Apache Kafka for the streaming data pipeline validated it with Apache Avro as a Schema Registry

When filling in data into your Apache Kafka topic, you do not know how the data as default looks, right? the problem with not knowing how the data are looking from the producer means the consumer did not know how the data should be expected to look.

The problem with this case will be the data can change the structure, and when its do that it will probably break your pipeline and have very bad side effects for that.

I know many people say they always have control over what they are sending into their data pipelines and yes I will say that too if I’m an alone worrier and only fight the same battle the rest of my life, but that the real world cases, even if you are your own build up a very simple data pipeline for something you will probably already have more then 1 topic in your Apache Kafka data stream.

One of the other benefits you get using Apache Avro as a file format for your messages into Apache Kafka is the .avsc format is binary and smaller to transport so you will save traffic between your Apache Kafka cluster and you will overall save disk space and you will see better performers for your Apache Kafka cluster so it's a very powerful format to use with Apache Kafka, I have to browse some images here over the Serials benchmark from another article here on Medium “Serialization performance in .NET: JSON, BSON, Protobuf, Avro” visit it and read the performances more in deep! :)

Format explaining:

  • AA — Apache.Avro
  • CA — Chr.Avro
  • NJ — Newtonsoft.Json
  • STJ — System.Text.Json
Serialization performance in .NET: JSON, BSON, Protobuf, Avro
Serialization performance in .NET: JSON, BSON, Protobuf, Avro

Design your Apache Avro Schema

That's why I will recommend you use Apache Avro with your Apache Kafka Cluster when you design your data stream pipeline, let’s take the simple test first and design our Apache Avro Schema.

Let's start by creating our first Apache Avro Schema as a local file and call it booking.avsc

Produce messages to Apache Kafka Topic with Apache Avro Schema

So it is time to produce a message to Apache Kafka Topic, it is imported when you do that you first validate your message by attaching your object to the Apache Avro parser, and in the end, you produce it to Apache Kafka.

The benefits will be you will get an error if the messages are not validated with your Apache Avro Schema and you can block the message as having been sent before it happens and avoid errors deeper into your data pipeline.

Consume messages from Apache Kafka Topic with Apache Avro Schema

When you confirm the message is produced you can now consume the message again, in here we explain the Apache Avro Scheme and point to our local file again, after this the magic happens when you consume the messages from Apache Kafka not before so it's only the consumer application code there validate the messages are valid based on the picked Apache Avro Schema.

It requires you to validate your Apache Kafka messages again in your consumer because you can version your Apache Avro Schema and it makes it dynamic so you can be sure the version you use still supports the messages you received it's a good way to double-check your data.

Free to play and share! :)

Thanks for reading my article, you can always find all the source code on my GitHub repo linked here and there will be a fully working docker setup to spin up a fully working Apache Kafka cluster and include all my sample files over time about Apache Kafka so take a look and give it a star if you like and follow me on GitHub to received a notice when I do something crazier! :)

--

--

Paris Nakita Kejser
DevOps Engineer, Software Architect and Software Developering

DevOps Engineer, Software Architect, Software Developer, Data Scientist and identify me as a non-binary person.