What is Caching?

In Conduit, a cache is a data storage layer which stores a subset of data, transient in nature, so that future requests for that data are served up faster than is possible by accessing the data’s primary storage location. In addition to data cache Conduit also maintains query results to efficiently reuse previously retrieved results. 


Why Caching?

Often, Database administrators are concerned about provided unfettered access to their data sources. There are good reasons for this: 

  • Performance of databases - databases may be designed to be optimized for purposes other than constantly being read or queried by users. If this is the case, frequently reading the database can quickly slow down performance of the other requirements of that database setup (i.e. transactional databases must be carefully designed to update rapidly from many different sources) to the point that the database becomes inoperable without some TLC by the DBA team. 
  • Protection of databases - many users concurrently sending ad-hoc queries to a database could break it. Offloading the data from the database into a temporary cache (you can define exactly how temporary) provides protection against this scenario. 
  • Protection of data - even with the best of intentions, a user could accidentally change the data in the database itself in a manner that is un-'Undo'able. Users connected to the data sources in Conduit have read-only access to the data so that there is no way for the user to mistakenly be provided with the wrong permission and then accidentally write or edit a database. 


There are also scenarios in which caching may be not the best choice: 

  • Freshness of data is of primary concern and latency is not an issue (data not terribly large, simple query, etc.) 
  • Few users need access to the data and access is a low priority compared to other, more critical data sources 

Caching options within Conduit

Conduit provides the choice of leveraging caching for the data sources of any of its connectors. 

Query Caching 

  • When enabled, Conduit will store query results for all queries for the connector's datasets so that when the exact same query is called again, the query results will be returned from memory to improve response time 
  • Recommended to enable when expensive queries are expected and/or when underlying data is not expected to change often 
  • The results set exceeding one page of retrieved records - for Power BI it's 10000 - will be not be cached to avoid OOM 
  • Caching expiration is 30 min by default, and can be customized for each connector's dataset as needed 
     

Connector Caching 

  • When enabled, Conduit will create temporary secure parquet store of all connector's datasets for a quick future access 
  • Recommended to enable for large datasets and/or when expensive queries are expected   
  • Selected datasets for the connector will be cached in the parquet store. All queries for this connector will be ran against the parquet store 
  • Caching expiration is 30 min by default, and can be customized for each connector's dataset as needed 
  • When connector data is cached, query results will be cached in memory for small/medium results set to further enhance performance. Query Cache will expire with data cache 
  • List of existing parquet files and their expected expiration times can be accessed on Performance>Parquet Store page 
  • Conduit SQL Engine will be used to run all queries when Connector Caching is selected 

How to Leverage Caching in Conduit

For each connector, Conduit allows for caching the datasets included in that connector. On the `Virtualization` step of the Connector Wizard, you will see options 'Enable Query Caching’ and 'Enable Connector Caching’.  


Selecting ‘Enable Query Caching’ indicates that Conduit will store query results for all queries for the connector's datasets so that when the exact same query is called again, the query results will be returned from memory. 

Selecting 'Enable Connector Caching ' indicates to Conduit that all of the data related to this connector should be stored in the parquet store and that all subsequent queries sent to that connector should read from the cache rather than reading directly from the data source. Query Caching comes along with   enabled Connector Caching.  


In the `Advanced` step of the Connector Wizard, each table within the connector provides the option to modify the Caching expiration policy. The Caching expiration policy dictates how long the cache is to exist before being reset. This allows for fine-grained control over the freshness of the data. 

The second caching option in the `Advanced` tab is an option to 'Cache now'. When this box is selected, Conduit will initiate caching of the data source on connector save to avoid waiting for cache upon initial query.  

Lastly, the third caching option is ‘Auto refresh’. When this option is enabled, Conduit will automatically recreate data cache in parquet store when existing data cache expires 

 

Finished! - Conduit makes the option to leverage caching as simple as that. A few clicks provide detailed control over how your data is being access and updated through your Conduit connector. 

Monitoring the Conduit Cache

Click the Performance tab on the navigation bar the top of the page. Then click the Parquet Store link in the drop-down menu. This takes you to the monitoring screen of the data cache. 

Here you will be able to review: 

  • Table name (or flat files in Cloud storage) cached in parquet store 
  • When the data cache is set to expire 
  • How much space each cache occupies on disk 
  • Number of parquet files the data cache consists of 

The Parquet Store page also provides an option to clear an existing data cache if desired.