What is Caching?

Caching is a component of hardware or software that allows for storage of relevant data for quick future access. Often - and in the case of Conduit - a cache is maintained in memory. In the context of data storage, the purpose of caching is to improve speed of accessing the data.

Why Caching?

Often, Database administrators are concerned about provided unfettered access to their data sources. There are good reasons for this:

  • Performance of databases - Databases may be designed to be optimized for purposes other than constantly being read or queried by users. If this is the case, frequently reading the database can quickly slow down performance of the other requirements of that database setup (i.e. transactional databases must be carefully designed to update rapidly from many different sources) to the point that the database becomes inoperable without some TLC by the DBA team.
  • Protection of databases - Many users concurrently sending ad-hoc queries to a database could break it. Offloading the data from the database into a temporary cache (you can define exactly how temporary) provides protection against this scenario.
  • Protection of data - Even with the best of intentions, a user could accidentally change the data in the database itself in a manner that is un-'Undo'able. Users connected to the data sources in Conduit have read-only access to the data so that there is no way for the user to mistakenly be provided with the wrong permission and then accidentally write or edit a database.

But, there are also scenarios in which caching is not the best choice:

  • Freshness of data is of primary concern and latency is not an issue (data not terribly large, simple query, etc.)
  • Few users need access to the data and access is a low priority compared to other, more critical data sources

Caching options within Conduit

Conduit provides the choice of leveraging caching for the data sources of any of its connectors. This allows the administrator who sets up the connector to choose whether to default to utilizing the processing capabilities build into the database or to use those 

Caching (disabled / enabled):

  • Disabled: All queries will read directly from the data source at the time that the query is submitted (from vizualization tool via Hive connection, Python script via JDBC/ODBD connection, or otherwise), leveraging Conduit's SQL engine only when required for query language translation or hybrid join operations.
  • Enabled: All queries to Conduit connectors read from cached data from the data source, rather from the data source itself.

Enable Conduit SQL Engine for Join Queries (disabled / enabled):

  • Disabled: Conduit will not allow the caching necessary in the Conduit SQL engine to perform hybrid joins between different data source types. Joins between different data sources require intermediate storage and computation between the data sources themselves and the front-end through which the user is joining the data. While Conduit makes this operation seemless for the end user, an administrator is able to determine that a particular connector / data set should not be joined with other data in the enterprise's ecosystem. This decision could be based on a variety of considerations, ranging from data integrity to query performance.
  • Enabled: Hybrid joins between different data source types (different databases, cloud storage, or any other type of connector in Conduit) are allowed and will be triggered from whatever front-end that the user has instructed to join the data (visualization tool or otherwise).

Query Caching

Conduit also automatically enables query caching for a quicker query response. This feature caches the results of user queries which improves response time for subsequent, similar queries performed by other users.

How to Leverage Caching in Conduit

For each connector, Conduit allows for caching the datasets included in that connector. When you arrive on the `Virtualization` step of the Connector Wizard, you will see an option to 'Enable Caching.' Selecting this field indicates to Conduit that all of the data related to this connector should be stored in the cached and that all subsequent queries sent to that connector should read from the cache rather than reading directly from the data source.

In the `Advanced` step of the Connector Wizard, each table within the connector provides the option to modify the Caching Expiration policy. The Caching Expiration policy dictates how long the cache is to exist before being reset. This allows for fine-grained control over the freshness of the data.

The second caching option in the `Advanced` table is an option to 'Cache Now'. By selecting this box, Conduit will immediate create of copy of the data source in the cache so that whatever latency in reading the source data will not be experienced at the time that the first query is sent through this connector.

Finished! - Conduit makes the option to leverage caching as simple as that. Three clicks provides detailed control over how your data is being access and updated through your Conduit connector.

Monitoring the Conduit Cache

Ever wonder where all those magically cached data sets end up? Click the Performance tab on the navigation bar the top of the page. Then click the Cached Datasets link in the drop down box. This takes you to the monitoring screen of the cache.

Here you will be able to review:

  • exactly what tables (or flat files in cloud storage)
  • when the data set is set to expire
  • how much space is occupies in either memory or on disk
  • what portion of the data set has successfully cached (relevant if very large file or table\
  • what the partitioning has been set to for that particular data set