Data profiling is an important but often overlooked component in ETL pipelines or exploratory data analysis (EDA). It provides a way to look into the data to understand the structure, inter-relationships and dependencies with each other. It can also uncover any data quality issues that may stem inside a data pipeline during migration, preventing data…
Category: <span>Data Engineering</span>
In ELT workflow the raw source tables are first loaded into a data lake, which is later transformed into a more suitable data model in the data warehouse for reporting. The transformation process can be time consuming or expensive for large batch ELT or streaming ETL for various reasons, for examples: Streaming events that require…
Azure data factory is a fully managed, serverless cloud ETL service from Microsoft which can be used to easily create data pipelines without writing any code. It can connect to the different data sources (linked services) using built-in connectors and also allows to perform transformations while doing the migration. One of the built-in connectors is…
ETL process usually involves a few discrete tasks that relies on each other. One way to achieve this is to use crontab (or Kubernetes CronJob) where each task is scheduled to run at a specific time. However, ensuring inter-dependencies among the tasks, where one task should only start when the previous has finished is not…
As data engineers, we have to collect data from different types of sources and often have to come up with custom data pipelines and ETL tools to move data from one system to another in order to consolidate into a single data warehouse. The sources can be conventional relational databases, NoSQL databases or message bus…