Abdullah Ahmed - Open Source Data Profiling Tools for ETL and EDA

Published January 15, 2022 by Abdullah Ahmed

Open Source Data Profiling Tools for ETL and EDA

Data profiling is an important but often overlooked component in ETL pipelines or exploratory data analysis (EDA). It provides a way to look into the data to understand the structure, inter-relationships and dependencies with each other. It can also uncover any data quality issues that may stem inside a data pipeline during migration, preventing data…

Data Engineering

Data Profiling Data Quality EDA ETL PostgreSQL Python Snowflake

Published December 7, 2021 by Abdullah Ahmed

Snowflake Stream for Continuous ELT Workflow

In ELT workflow the raw source tables are first loaded into a data lake, which is later transformed into a more suitable data model in the data warehouse for reporting. The transformation process can be time consuming or expensive for large batch ELT or streaming ETL for various reasons, for examples: Streaming events that require…

Data Engineering

CDC ELT Snowflake Stream

Published October 11, 2021 by Abdullah Ahmed

Running Snowflake Queries Asynchronously from Azure Data Factory

Azure data factory is a fully managed, serverless cloud ETL service from Microsoft which can be used to easily create data pipelines without writing any code. It can connect to the different data sources (linked services) using built-in connectors and also allows to perform transformations while doing the migration. One of the built-in connectors is…

Data Engineering

Azure Data Factory Snowflake

Published April 6, 2021 by Abdullah Ahmed

Third party vs self-hosted VPN

Recently I have been travelling quite a bit. Without much surprise, it seems the internet is not as good as you expect: either the mobile data is too expensive or has weak signal for everyday usage or the broadband internet available from local ISPs are not the most reliable: the same public IP is shared…

DevOps

VPN

Published February 7, 2021 by Abdullah Ahmed

Converting SQL_ASCII Encoded PostgreSQL Database to UTF-8

PostgreSQL is one of the most popular and feature rich open source relational database. It supports different types of encodings, e.g. ‘SQL_ASCII’, ‘UTF8’, ‘LATIN1’, ‘EUC_KR’ etc. Joel Spolsky has a must read article about unicode and character encoding, but basically character encoding is a mapping between a set of bytes and their corresponding characters. Without…

Database

Character Encoding PostgreSQL SQL_ASCII UTF-8

Published April 4, 2020 by Abdullah Ahmed

ETL Workflow Management Using Apache Airflow

ETL process usually involves a few discrete tasks that relies on each other. One way to achieve this is to use crontab (or Kubernetes CronJob) where each task is scheduled to run at a specific time. However, ensuring inter-dependencies among the tasks, where one task should only start when the previous has finished is not…

Data Engineering

Apache Airflow ETL MySQL PostgreSQL Workflow Management

Published February 11, 2020 by Abdullah Ahmed

JSON How To’s in PostgreSQL

JSON (Javascript Object Notation) is supported in PostgreSQL which lets you store semi-structured or unstructured data in a table and allows for greater flexibility for applications and support for NoSQL like features. Currently, there are two JSON data types in PostgreSQL: JSON and JSONB. In this post, I want to try some common operations for…

Database

JSON PostgreSQL

Published January 26, 2020 by Abdullah Ahmed

Continuous Data Delivery Using StreamSets Data Collector

As data engineers, we have to collect data from different types of sources and often have to come up with custom data pipelines and ETL tools to move data from one system to another in order to consolidate into a single data warehouse. The sources can be conventional relational databases, NoSQL databases or message bus…

Data Engineering

Data Collector ETL MySQL PostgreSQL StreamSets

Abdullah Ahmed Posts