Using Python for data engineering

By Mika Szczerbak, Data Engineer STX Next
Friday, 14 January, 2022

Python is one of the most popular programming languages worldwide. It often ranks well in industry programming popularity surveys, recently claiming the top spot in both the Popularity of Programming Language (PYPL) and TIOBE indexes.

The chief focus of Python was never web development. However, the potential in Python for this particular purpose was realised a few years ago by software engineers and the language experienced a massive surge in popularity.

But data engineers couldn’t do their job without Python either, so it’s important to appreciate how it can be used to make their workload more manageable and efficient.

Cloud platform providers use Python for implementing and controlling services

The run-of-the-mill challenges data engineers face are not dissimilar to those experienced by data scientists, as processing data in its many forms is a key focus for both. From a data engineering perspective, however, there is more concentration on industrial processes, such as ETL (extract-transform-load) jobs and data pipelines. They have to be strongly built, dependable and fit for use.

The serverless computing principle allows for triggering data ETL processes on demand, after which physical processing infrastructure can be shared by the users. This allows them to enhance the costs and, consequently, reduce the management overhead to its bare minimum. Python is supported by the serverless computing services of prominent platforms, including AWS Lambda Functions, Azure Functions and GCP Cloud Functions.

Parallel computing is, in turn, needed for the more ‘heavy duty’ ETL tasks relating to issues concerning big data. Splitting the transformation workflows among multiple worker nodes is essentially the only feasible way memory-wise and time-wise to accomplish the goal.

A Python wrapper for the Spark engine named ‘PySpark’ is ideal as it is supported by AWS Elastic MapReduce (EMR), Dataproc for GCP and HDInsight. As far as controlling and managing the resources in the cloud is concerned, appropriate application programming interfaces (APIs) are exposed for each platform. APIs are used when carrying out job triggering or data retrieval.

Python is consequently used across all cloud computing platforms. The language is useful when performing a data engineer’s job, which is to set up data pipelines along with ETL jobs to recover data from various sources (ingestion), process/aggregate them (transformation) and conclusively allow them to become available for end users.

Using Python for data ingestion

Business data originates from a number of sources including databases (both SQL and noSQL), flat files (eg, CSVs) and various other file formats used by organisations including spreadsheets, external systems, web documents and APIs.

The wide acceptance of Python as a programming language has resulted in a wealth of libraries and modules, including the particularly fascinating Pandas library. Pandas is interesting considering it has the ability to enable reading of data into “DataFrames”. This can take place from a variety of formats, such as CSVs, TSVs, JSON, XML, HTML, LaTeX, SQL, Microsoft, open spreadsheets and other binary formats that are results of different business systems exports.

Pandas is based on other scientific and calculationally optimised packages, offering a rich programming interface with a huge panel of functions necessary to process and transform data reliably and efficiently. AWS Labs maintains an aws-data-wrangler library named “Pandas on AWS” used to maintain well-known DataFrame operations on AWS.

Using PySpark for parallel computing

Apache Spark is an open-source engine used to process large quantities of data that controls the parallel computing principle in a highly efficient and fault-tolerant fashion. Whilst initially implemented in Scala and natively supporting this language, it is now a universally used interface in Python — PySpark.

PySpark supports a majority of Spark’s features — including Spark SQL, DataFrame, Streaming, MLlib (Machine Learning) and Spark Core — making it easier for Pandas experts to develop ETL jobs.

All of the aforementioned cloud computing platforms can be used with PySpark: Elastic MapReduce (EMR), Dataproc and HDInsight for AWS, GCP and Azure, respectively.

In addition, users are able to link their Jupyter Notebook to accompany the development of the distributed processing Python code, for example, with natively supported EMR Notebooks in AWS.

PySpark is a useful platform for remodelling and aggregating large groups of data, making it easier to consume by end users, including business analysts.

Using Apache Airflow for job scheduling

By having renowned Python-based tools within on-premise systems, cloud providers are motivated to commercialise them in the form of ‘managed’ services that are, therefore, simple to set up and use.

This is true for (among others) Amazon’s Managed Workflows for Apache Airflow, which was launched in 2020 and facilitates using Airflow in some of the AWS zones (nine at the time of writing). Cloud Composer is a GCP alternative for a managed Airflow service.

Apache Airflow is a Python-based, open-source workflow management tool. It allows users to programmatically author and schedule workflow processing sequences, and subsequently keep track of them with the Airflow user interface.

There are various substitutes for Airflow, for instance, the obvious choices of Prefect and Dagster. Both of which are Python-based data workflow orchestrators with UI and can be used to construct, run and observe the pipelines. They aim to address some of the concerns that some users face when using Airflow.

Strive to reach data engineering goals with Python

Valued in the software community for being intuitive and easy to use, Python is not only innovative, but also versatile, allowing engineers to elevate their services. The simplicity at the heart of the language means engineers are able to overcome obstacles as they arise.

Backed by an enthusiastic community that works together to better the language, Python’s simple composition allows developers to collaborate on projects with quantitative researchers, analysts and data engineers, and will see it remain as one of the most accepted programming languages in the world.

Using Python for data engineering

Cloud platform providers use Python for implementing and controlling services

Using Python for data ingestion

Using PySpark for parallel computing

Using Apache Airflow for job scheduling

Strive to reach data engineering goals with Python

Build the next-generation customer experience with GenAI

Why onboarding is critical to a successful platform engineering rollout

How to tackle the rising threat of shadow AI

Content from other channels on our network