The secret sauce for a testable, maintainable and documented ETL

Data warehouses are typically assembled from a variety of data sources with different formats and purposes. As such, ETL (Extract, Transform and Load) is a key process to bring all the data together into a standard, homogeneous environment. The ETL process is composed of:

  • Extraction – data is extracted from its sources
  • Transformation – data is transformed for storage in the proper format or structure, for querying and analysis
  • Loading – it is loaded into the final target (i.e. database)

Managing big ETL process might be a challenge because of:

  • overall and ever-growing complexity
  • transformation dependency management hell (when final transformations depend on other transformations)
  • inaccurate, outdated or missing ETL process documentation
  • a lack of regression tests that can prevent unexpected errors after a framework upgrade or unintentional changes in seemingly unrelated areas

and many more. There is a simple solution: unit/integration tests.

Well, not that fast — traditional unit/integration tests might serve as up-to-date documentation. However, they’re only readable by programmers who are forced to constantly translate their contents and do all the ETL-related work. As Stitch Fix put it in their Engineers Shouldn’t Write ETL: A Guide to Building a High Functioning Data Science Department blog post:

Engineers should not write ETL. For the love of everything sacred and holy in the profession, this should not be a dedicated or specialized role. There is nothing more soul sucking than writing, maintaining, modifying, and supporting ETL to produce data that you yourself never get to use or consume.

There must be some way of empowering and enabling more motivated parties (e.g. data scientists, business analysts) to control and maintain the ETL, including preparing documentation and ensuring regression. The solution? Cucumber.

No, not the plant, but Cucumber — executable specifications that enable:

  • A single source of truth — Cucumber merges specification and test documentation into one cohesive whole
  • Living documentation — because they’re automatically tested by Cucumber, your specifications are always up-to-date
  • With human-readable statements automated testing is now open for non-programmers too

With Cucumber, here’s how the most trivial ETL specification might look like:

Feature: User registration ETL
  Scenario: User registration ETL
    # Input
    Given product table "users":
      | Integer | String              |
      | id      | created_on          |
      | 1       | 2016-01-01 01:02:03 |
      | 2       | 2016-01-02 04:05:06 |

    # ETL invocation
    When "vinted.warehouse.etl.facts.RegistrationFact" job is run

    # Output assertions
    Then result table "facts.registration" includes the following rows
      | Integer |
      | id      |
      | 1       |
      | 2       |

Let’s see what’s inside:

  • Input: data source mocks that are used for extraction. These could be Hive tables, RDBMS tables, Kafka topics, CSV files, Google Spreadsheets, you name it.
  • ETL invocation: a class name that performs the ETL job (extracts from mocked Inputs and writes to desired Output).
  • Output assertions: what is the expected result of a given ETL. These could also be checks for content in database tables, message queues, data files, Google Spreadsheets, whatever.

Just to ignite your imagination, here are some more interesting examples from our specifications.

Input examples

Mocking Kafka topics:

Given Kafka topic "click_event" with messages
  | {"event":"click","time":123} |
  | {"event":"click","time":234} |

Mocking HTTP responses:

# Mocks all HTTP responses with pre-recorded tapes,
# returning unique response for each HTTP call
Given mocked HTTP responses

Assertion examples

Ensuring there are no excess rows (e.g. potential row multiplication after a join):

Then result table "facts.registration" has 2 rows

Checking for certain state with SQL query:

Then SQL query "select 'error' from accounting_ledger having sum(credit) <> sum(debit)" returns 0 rows

Ensuring certain rows have been filtered out:

Then result table "t1" does not include the following rows
  | Integer |
  | id      |
  | 31337   |

Checking data in Kafka topics:

Then Kafka topic "some_stats" includes the following messages
  | {"id":123,"stats":{}} |
  | {"id":456,"stats":{}} |

Ensuring that standard time fields (i.e. local_time, utc_time, etc) have been added to the result and they’re not empty:

Then result table "facts.registration" contains timestamps

Fitting it all together

Cucumber isn’t enough on its own - there’s work to do before the magic can happen. Initially, there’s a need for engineers to “wire” reusable specification statements with the specific ETL implementation.

At Vinted, we’ve picked Apache Spark to be our data management tool. The good news is that it’s not a prerequisite - it might be anything else that makes you productive or does the right job. The only requirement is some sort of API that enables interaction with the tool and (optionally) has the ability to run on a local machine (the more local/embedded it is, the faster your feedback loop will be; obviously — the faster, the better).

After the wiring’s been done — it’s time to reap the benefits of this fruitful investment:

  • there’s human readable and always up-to-date specification
  • ETL process has regression tests — important things can’t be broken that easily
  • it’s safe to do any framework upgrades — there will be no surprises in production
  • bonus: onboarding new team members is not an issue due to documentation and safety — anyone can contribute on their first day!

In the upcoming post we’ll share all the implementation details. Stay tuned!