TABLE OF CONTENTS



Conduit storage location

Datasets caches and materialization assets are stored in Parquet format, thus the name “Parquet store”. 

In Conduit, the following file systems (storage types) are supported for storing data source caches and data materialization: 

  • Azure Blob Storage (abfs) 

  • S3 (s3)

  • Google Cloud Storage (gcs)

  • HDFS (hdfs) 

  • local file system (file).



Supported storage file systems


Azure Cloud Storage

General Prerequisites


  • Azure Blob storage account must already be created
    • the storage account must have "Enable hierarchical namespace" checked
    • the other settings can be left on default (unless required by user's system configuration, not Conduit)
      • e.g. networking, data protection, tags and others
    • see section below how to create a storage account
  • Azure Blob container must be already created
    • no special settings required
    • settings can be left on default values
    • see section below how to create a container


How to create a Azure Blob storage account:


Step 1)


Step 2) enable hierarchical namespace


How to create a Azure Blob container:


Navigate to your Azure Blob Storage account and click on "Containers" on the left panel.



Azure Blob storage authentication

The configuration of Azure Blob storage as storage type can be done in 2 ways depending on the type of authentication used:

  1. access keys authentication
  2. azure managed identity authentication


1. Access keys authentication


Prerequisites:

  • have access to Azure Blob access keys 
    • more information on generating access keys can be found here


  • The storage account must have hierarchical namespace enabled.


Settings 

/etc/bpcs/docker/bde-server.env

Once all prerequisites are fulfilled, please update the following configuration with the proper values and add them to bde-server.env :


FS_TYPE=abfs
FS_ABFS_STORAGE_ACCOUNT={ Azure Blob storage account }
FS_ABFS_CONTAINER={ Azure Blob container }
FS_ABFS_ACCESS_KEY={ Azure Blob access key }
FS_DEFAULTFS=abfs://{ Azure Blob container }@{ Azure Blob storage account }.dfs.core.windows.net
CONDUIT_AZURE_CLOUD_TYPE=AzureCloud
CONDUIT_AZURE_CLOUD_STORAGE_ENDPOINT_SUFFIX=core.windows.net
  • remove the curly { } brackets. See below examples section.   
  • If installing Conduit for the first time, then please see section "Conduit installation" for how to use the above variables. 
  • If you want to update an already installed Conduit, then please continue with the steps described in section: "Configuration update after installation".


2. Azure Managed Identity authentication

This type of authentication is used when Conduit services are deployed on a virtual machine running in Azure. More information about managed identities for Azure can be found here.


Prerequisites

  • enable System-assigned managed identity. Follow steps from here
  • the storage account must have hierarchical namespace enabled.


  • the resource group of the virtual machine where Conduit services are running must have the following role: StorageBlobDataContributor
    • in Azure Portal navigate to "All services" -> "Resource groups" -> select resource group where Conduit services VM is using -> Access control (IAM) -> search or add "StorageBlobDataContributor


Settings

/etc/bpcs/docker/bde-server.env

Once all prerequisites are fulfilled, please update the following configuration with the proper values and add them to bde-server.env :

FS_TYPE=abfs
FS_ABFS_STORAGE_ACCOUNT={ Azure Blob storage account }
FS_ABFS_CONTAINER={ Azure Blob container }
FS_DEFAULTFS=abfs://{ Azure Blob container }@{ Azure Blob storage account }.dfs.core.windows.net

CONDUIT_AZURE_CLOUD_TYPE=AzureCloud
CONDUIT_AZURE_CLOUD_STORAGE_ENDPOINT_SUFFIX=core.windows.net


  • remove the curly { } brackets. See below examples section.
  • If installing Conduit for the first time, then please see section "Conduit installation" for how to use the above variables. 
  • If you want to update an already installed Conduit, then please continue with the steps described in section: "Configuration update after installation".



Azure Government Storage

General Prerequisites

- follow the steps from "Azure Cloud Storage section" (see above)


Settings

/etc/bpcs/docker/bde-server.env

Once all prerequisites are fulfilled, please update the following configuration with the proper values and add them to bde-server.env :

FS_TYPE=abfs
FS_ABFS_STORAGE_ACCOUNT={ Azure Blob storage account }
FS_ABFS_CONTAINER={ Azure Blob container } FS_DEFAULTFS=abfs://{ Azure Blob container }@{ Azure Blob storage account }.dfs.core.usgovcloudapi.net/ CONDUIT_AZURE_CLOUD_TYPE=AzureUSGovernment CONDUIT_AZURE_CLOUD_STORAGE_ENDPOINT_SUFFIX=core.usgovcloudapi.net
  • remove the curly { } brackets. See below examples section.
  • If installing Conduit for the first time, then please see section "Conduit installation" for how to use the above variables. 
  • If you want to update an already installed Conduit, then please continue with the steps described in section: "Configuration update after installation".

S3 

The configuration of this file system can be done in 2 ways, depending on the type of authentication used:

1. Access Key authentication

  • More information on access key generation can be found here.

  • It is important for the used service account to have the following permission: AmazonS3FullAccess 

  • Please navigate to the following path:

    /etc/bpcs/docker/bde-server.env
  • The following configuration must be added to bde-server.env:

FS_TYPE=s3
FS_AWS_BUCKET={ S3 bucket name }
FS_AWS_ACCESS_KEY={ S3 bucket access key }
FS_AWS_SECRET_KEY={ S3 bucket secret key }
FS_DEFAULTFS=s3a://{ S3 bucket name }


  • remove the curly { } brackets. See below examples section.   
  • If installing Conduit for the first time, then please see section "Conduit installation" for how to use the above variables. 
  • If you want to update an already installed Conduit, then please continue with the steps described in section: "Configuration update after installation".


2. IAM metadata authentication

  • More information about this type of authentication can be found here.

  • It is important for the used service account to have the following permission: AmazonS3FullAccess 

  • Please navigate to the following path:

    /etc/bpcs/docker/bde-server.env
  • The following configuration must be added to bde-server.env

FS_TYPE=s3
FS_AWS_BUCKET={ S3 bucket name }
FS_DEFAULTFS=s3a://{ S3 bucket name }


  • remove the curly { } brackets. See below examples section.   
  • If installing Conduit for the first time, then please see section "Conduit installation" for how to use the above variables. 
  • If you want to update an already installed Conduit, then please continue with the steps described in section: "Configuration update after installation".


Google Cloud Storage (GCS) 

The configuration of this file system can be done in 2 ways, depending on the type of authentication used:

1. File credential authentication (using P12 certificate)

  • More information about this type of authentication can be found here

  • It is important for the used service account to have the following permission: StorageAdmin

  • Please navigate to the following path:

    /etc/bpcs/docker/bde-server.env
  • The following configuration must be added to bde-server.env

FS_TYPE=gcs
FS_GCS_BUCKET={{ GCS bucket name }}
FS_GCS_PROJECT_ID={{ GCS project id }}
FS_GCS_SERVICE_ACCOUNT_KEYFILE={{ The path to the GCS json keyfile }}
FS_DEFAULTFS=gs://{{ GCS bucket name }}


  • if the configuration is new or the keyfile needs to be changed, the new file should be added to the following directory:/etc/bpcs/docker/conduit/gcs/keyfile/
  • remove the curly { } brackets. See below examples section.   
  • If installing Conduit for the first time, then please see section "Conduit installation" for how to use the above variables. 
  • If you want to update an already installed Conduit, then please continue with the steps described in section: "Configuration update after installation".


2. IAM metadata authentication

  • More information about this type of authentication can be found here.

  • It is important for the used service account to have the following permission: StorageAdmin

  • Please navigate to the following path:

    /etc/bpcs/docker/bde-server.env
  • The following configuration must be added to bde-server.env


FS_TYPE=gcs
FS_GCS_BUCKET={{ GCS bucket name }}
FS_GCS_PROJECT_ID={{ GCS project id }}
FS_DEFAULTFS=gs://{{ GCS bucket name }}


  • remove the curly { } brackets. See below examples section.   
  • If installing Conduit for the first time, then please see section "Conduit installation" for how to use the above variables. 
  • If you want to update an already installed Conduit, then please continue with the steps described in section: "Configuration update after installation".

HDFS 

Please navigate to the following path:

/etc/bpcs/docker/bde-server.env

The following configuration must be added to bde-server.env

FS_TYPE=hdfs
FS_DEFAULTFS=hdfs://{{ spark_host }}:9200


  • If installing Conduit for the first time, then please see section "Conduit installation" for how to use the above variables. 
  • If you want to update an already installed Conduit, then please continue with the steps described in section: "Configuration update after installation".



Local file system 

Please navigate to the following path:

/etc/bpcs/docker/bde-server.env

The following configuration must be added to bde-server.env:

FS_TYPE=file
FS_DEFAULTFS=file:///{your file path}
  • If installing Conduit for the first time, then please see section "Conduit installation" for how to use the above variables. 
  • If you want to update an already installed Conduit, then please continue with the steps described in section: "Configuration update after installation".

Conduit installation

On Conduit installation the user will be asked to choose one of the types above to be used as Parquet store location. The installation script will configure Conduit system according to the chosen option.
Prerequisites must be satisfied for each file system type before proceeding withe the installation. 


Please see section "Supported storage file systems" for a guide to created appropriate configurations for each storage type. 


Once configuration variables are created, use the values during installation dialogues. 


Configuration update after installation

The following describes the steps required to reconfigure supported storage types for dataset cache location in Conduit.

If Conduit parquet storage configuration needs to be modified after installation, this can be done using the environment variables from bde-server.env file.

Please navigate to the following path:

/etc/bpcs/docker/bde-server.env


Step 1) Delete old configurations

If the configurations already exist and need to be modified, all environment variables from bde-server.env, that start with FS_ (usually found at the bottom of the file), must be deleted first.

Step 2) Edit new configurations

See section above "Supported storage file system" for how to obtain new configurations for each different storage type.


Edit bde-server.env file with new values and save it. 


Step 3) Clean Conduit storage metadata service

Also, it is required to clean a previous hive metastore volume, using the following command (the container must be stopped first):


docker stop hive-metastore
docker rm hive-metastore
docker volume rm docker_hive_metastore_volume


Step 4) Restart Conduit services

cd /etc/bpcs/docker
docker-compose up -d

After this step, Parquet store was updated to use the new storage type.




Example of Parquet store configurations

Please navigate to the following path:

/etc/bpcs/docker/bde-server.env

Azure Cloud Storage

FS_TYPE=abfs
FS_AZBS_STORAGE_ACCOUNT=my_storage_account
FS_AZBS_CONTAINER=my_container
FS_AZBS_ACCESS_KEY=my_access_key
FS_DEFAULTFS=abfs://my_container@my_storage_account.dfs.core.windows.net
CONDUIT_AZURE_CLOUD_TYPE=AzureCloud
CONDUIT_AZURE_CLOUD_STORAGE_ENDPOINT_SUFFIX=core.windows.net


Azure Government Storage

FS_TYPE=abfs
FS_AZBS_STORAGE_ACCOUNT=my_storage_account
FS_AZBS_CONTAINER=my_container
FS_AZBS_ACCESS_KEY=my_access_key
FS_DEFAULTFS=abfs://my_container@my_storage_account.dfs.core.usgovcloudapi.net/
CONDUIT_AZURE_CLOUD_TYPE=AzureUSGovernment
CONDUIT_AZURE_CLOUD_STORAGE_ENDPOINT_SUFFIX=core.usgovcloudapi.net


S3

FS_TYPE=s3
FS_AWS_BUCKET=my_bucket 
FS_AWS_ACCESS_KEY=my_access_key
FS_AWS_SECRET_KEY=my_secret_key 
FS_DEFAULTFS=s3a://my_bucket


Google Cloud Storage

FS_TYPE=gcs
FS_GCS_BUCKET=my_bucket
FS_GCS_PROJECT_ID=my_project_id
FS_GCS_SERVICE_ACCOUNT_KEYFILE=/etc/bpcs/docker/conduit/gcs/keyfile/my_keyfile.json
FS_DEFAULTFS=gcs://my_bucket


HDFS

FS_TYPE=hdfs
FS_DEFAULTFS=hdfs://10.1.8.4:9000/


Local file system

FS_TYPE=file
FS_DEFAULTFS=file:///


Related pages