Change Data Capture (CDC) for Analytics

Ganesh Ramasubramanian
3 min readSep 1, 2019

--

Gone are the days where in we wait for the ETL script to run in batch mode every night when there are:

  • Less frequent users
  • Slave node (Screaming out aloud)
  • Seeing thro’ the message in pagers/SMS alerts at midnights !!!

From Jay Kreps (back in 2013)

Change Data Capture — There is a small industry around getting data out of databases, and this is the most log-friendly style of data extraction.

But, big data world and Kafka has evolved ever since. Now, CDC architecture has become a de-facto standard for extracting data from any RDBMS and transfer the same to any end point. At the time when Jay Kreps wrote this in 2013, there was no Kafka Connect, KSQL, Kafka Streams and all. All have evolved after that. Now for Confluent this is a selling point in their platform.

Integrate Data from External Systems using Kafka-Connect: https://docs.confluent.io/current/connect.html

Now, we can use Kafka as real-time streaming engine to transfer/move data from almost any source system to any target system in a seamless way and in real-time. We can even combine/join/intersect multiple event sources in Kafka using KSQL and Kafka Streams on the fly and convert the records are “Business Events”.

Image taken from: https://kafka.apache.org/intro.html

Let’s take a step-by-step approach for how a database could be backed up to another data center for disaster recovery or send business events to external target system for Analytics.

Typical CDC Architecture
  • A typical Kafka cluster

- Kafka Confluent platform or
- On Cloud (AWS, Azure/GCP) or
- On-Prem
There are many established websites which gives the procedure for how to install on these.

  • Kafka Source Connectors

Debezium PostgreSQL Source Connector
Debezium MySQL Sink Connector

The sources could also be not just one or two.
It could be multiple databases, flat files, custom programs etc.,

  • Kafka Sink Connectors

Any NoSQL Sink Connector
Debezium PostgreSQL Sink Connector
Debezium MySQL Sink Connector

The sources could also be not just one or two.
It could be multiple databases, flat files, custom programs etc.,

For the Kafka connectors, Camel have carried out quite extensively the connectors and give the support. When any of the source and sink connectors aren’t available, one can also extend their interface and develop in a nice way.

Then after for Analytics purposes one can use any of the tools that your customer is in need quite easily (Elastic Stack, AMQP messages or NoSQL Databases etc.,)

All of the messages going through your Kafka Messaging System, you need to decide whether you want to be binary encoded (AVRO/protobuf/JSON). Confluent always recommends to use AVRO and supports Kafka Schema Registry for storing the schema.

References:

--

--

No responses yet