Running Snowflake Queries Asynchronously from Azure Data Factory

Azure data factory is a fully managed, serverless cloud ETL service from Microsoft which can be used to easily create data pipelines without writing any code. It can connect to the different data sources (linked services) using built-in connectors and also allows to perform transformations while doing the migration. One of the built-in connectors is the snowflake connector that allows to connect to the popular data warehouse Snowflake. Snowflake has gained popularity as a fully managed cloud data warehouse for easy setup, scalability, performance, speed and also for the concept of virtual warehouse, where you can set it to automatically start up and shut down instead of running 24×7 and only pay for the time and resources required to execute queries.

Many typical ETL queries for transformations or generating reports work on huge amount of data that may run for a long time. So the concept of asynchronous call becomes important, where we can initiate a query and let it run in the background without holding the connection open from the caller. We can then check the query status periodically and use the result when it is ready.

Unfortunately at the time of this writing (Oct 2021), neither azure Data factory nor Snowflake provides a built-in method to call a query asynchronously from the data factory. The data factory Snowflake connector supports “copy activity”, “mapping data-flow” and “lookup activity”, where the “lookup activity” is used to run a query but only synchronously, meaning it will keep the connection open until the query is finished. It also has the limitations of the number of rows returned, duration of the query and the size of the returned output. On the other side, Snowflake javascript stored procedure API only has an execute() method, but has no way to execute a query asynchronously. This poses a problem for the aforementioned long running ETL queries.

Fortunately, Snowflake does however allows executing queries asynchronously from common programming languages like Java or Python using the provided connectors. For example, the python connector has an execute_async() method which allows for asynchronous query execution. So from Python we can execute the query and retrieve the query_id from Snowflake for later status checking. Something like below:

from snowflake import connector
...

with connector.connect (...) as conn:
    cursor = conn.cursor()
    cursor.execute_async(<LONG_RUNNING_QUERY>)
    query_id = cursor.sfqid

    while conn.is_still_running(query_id):
        # log status, send notifications
        ...

So, the only remaining item is to figure out how to call this from inside data factory? Here comes azure function, which is the serverless compute platform in azure that allows to create and execute functions written in common programming languages like C#, Java, Python or Javascript, without any server setup or maintenance. Azure function also has 1 million executions / 400,000 GB-s free grant, which is plenty to use free of cost for the ETL applications.

Data factory has azure function activity which can be added to the data pipeline to run for example, a python app created in azure Function, that takes a query as parameter and executes in Snowflake asynchronously, as shown above. Azure function does time out after 230 seconds, but that should be plenty for an asynchronous call.

The flow can be something like this: we can use an azure function app to initiate a long running query and return the corresponding query id. We then use a separate data pipeline in data factory calling another azure function app to periodically call Snowflake to check the status of the query. Once complete, the monitoring function can use the azure Python SDK’s TriggersOperations class to disable the monitoring, as it is now complete, something like below:

from azure.mgmt.datafactory import DataFactoryManagementClient
...
client = DataFactoryManagementClient(...)
client.triggers.begin_start(<resource_group_name>, <data_factory_name>, <trigger_name>)

Be First to Comment

Leave a Reply

Your email address will not be published. Required fields are marked *