Offload Ks of SQL tenants with Debezium and DLTs | DailyDevLists

Loading video player...

Full Transcript

3,759 words • EN

Hello.

Hello everyone. Can you hear us?

>> Okay.

>> Nice to see you for me. I will start by

first introducing us and Marco with

Andre Romeo. Uh we work at team system

which is an Italian tech company that

builds products for uh uh micro

businesses, small medium enterprises,

accountants, lawyers. Uh so we have

around 200 products in Italy, France,

Spain, Turkey and Andrea and I are part

of the data platform team which we build

features directly for the end users of

our products. But today we are going to

uh tell you a bit more about something

we have worked hard in the last months

which has made really us

work hard and suffer. So I'm glad to

share with you today. So this is a a

simplistic view of the problem we had to

face. We are talking about mailup which

is one of our product which is an email

marketing platform. Uh we have this

architecture on the back end side. So we

have a an app that exposes APIs to end

users but for each customer which is a a

corporate customer we have a dedicated

database. So we have a database per

tenant model. So we have the exact same

schema on each database dedicated to

each customer and we have a master

database where there is just some master

reference data to say for customer A you

have tenant A customer B tenant B

database. So um this is the the our

source data but the need that we have is

to do some analytics and aggregation

across all the customers of mailup. So

the idea is that on our data lake we

have for the same table which has the

same schema across different databases a

single table with the records from all

uh customers so that you can go you can

do aggregations among them. So today we

are really focusing on how can we do as

a data engineer

efficient ETL. What you can think of uh

that comes first to your mind can be

something like this. A batch job that

starts every day like at midnight then

makes a big for loop over old table one

across all uh source databases then

takes the updated or new records in that

table and then appends or updates the

record in the target data lake. So this

is something that uh we have considered

which has some uh limitations. First

when you start and you when you have a

high number of uh tables or high number

of customers we are talking about tens

of thousands of customers. Uh you can

have some impact on the main OOLTP

database even if you do it during a

night still. Um then uh performance it

takes time to do this big for loop over

and over and then you have latency

because you have the data updated at

most once per day plus some delay.

So today we are really thinking how can

we improve this approach? How can we do

something that is tackling these issues?

to give you a flavor of what we started

from. What do we have as a data

engineers as tools? We know uh how

change data capture works. So you can

use tools like the bed zoom to listen to

transaction logs of a certain table and

being notified when there is an event

for any record in the table. Then you

can decide what to do with this event.

For example, you can write it to a Kafka

topic and then you can manage all uh

these events and write them to your data

data target. But the issue is that uh

this in theory improves what I discussed

before because in theory you reduce the

impact on the database because you don't

have this repeating poll queries. In

theory, it's should be close to real

time and you have something that is has

no latency in theory. But this was not

working on our side. I mean, it's a bit

more complex because you don't have one

table, but you have thousands of tables

span across different database with the

same schema but on different databases.

Plus, whenever a new customer arrives,

there is a new database dedicated to it.

So you have to handle this and then

there are also some DevOps or S S sur

operations that the operations team is

doing and sometimes they want to move a

database from a server to another. So

this has more extra complexity. So today

I will introduce you to a solution that

we designed to tackle these challenges

that is um I will give you a really high

level overview then Andrea will go more

in details.

The idea is as following. Uh first to

have a change data capture job that is

notified by the events that occur on the

master database. So we are notified

whenever a database is moved or a new

customer comes and a new database is

created. These events are uh processed

by a GitHub services via Argo CD which

is responsible to dynamically span or

destroy the bedum instances that uh

collects data from uh different tables.

This is a really high level overview and

now in this talk we will go a bit deeper

to how it works what are the challenges.

Okay. So I will leave the stage to

Andrea. Let's start from the the bottom

part of the architecture.

>> Thank you, Marco. Thank you. And

I just start with an introduction of the

two main tools because I mean I will

stress you just with two introductions

of two tools uh of um which will be the

uh the basis for the for this

presentation. The first one is I'm

pretty sure that almost all of you

already knows because it's really

fundamental for the near real time

analysis in in that engineering. But uh

we need to keep in mind uh I want to

introduce it because we need to keep in

mind two uh principle key points. So uh

what actually uh does change data

capture? I mean uh and and we want I

want to focus on the SQL server um

because this is a feature that you can

find in in in almost all of of RDBMS.

Um

in in SQL server you have always uh a

technical table which is this

transaction log and inside this

transaction log you have all the

information about the operations of

update adding deletion of everything

inside your server and when you activate

and you the the changeap change that

capture it's a feature that you can

activate with the store procedure on the

server and for database is for one

database and when you active the change

data capture essentially you are

activating two jobs. Uh one is the

capturing jobs and it is always up and

running and essentially it is reading

from the trans from the transaction log

and it tracks all the information of the

your source tables and store this

informations in these technical tables

which are the change data capture

tables. The second job is the is the

cleanup process and which is a technical

job that cleans uh it is scheduled and

sometimes it cleans all the uh old logs

inside the change data capture tables

because you want to avoid the filling

the filling up of the memory of the SQL

server. And so keep in mind these two

points uh that you have one one job

always up and running uh and the second

that you have one scheduled job.

Um

the second tool that I want to introduce

here is the bizium and the bizium is

really really really powerful when you

use it with with change data capture and

uh actually indeed with in SQL server

when you when you're managing SQL server

you need to have uh the change data

capture active on the database that you

want to to to track the information. Um

how the bedroom work? Uh the B the bzium

uh the first step is to take snapshot of

the change data capture tables um uh of

the of the source tables and then it

starts to track all the changes uh about

the this these tables and uh it starts

to read from the change data capture

table all the changes all the operation

of updating uh adding or deletion of

And essentially it extracts u one

message and um it it extracts one

message and sends this message to some

cafka interface. uh in our case we have

an event tab but the event tab is uh

more or less the same of Kafka uh Kafka

in in in the Asure world but I mean the

the main point is that you can send with

the bum you can extract and build um a

message uh as you want and you can send

this message to some cafka interface uh

let let's here I just puted some the

configuration file because you you have

to define

uh one the besium instance for each

server and you need one configuration

file and you can install the bzium on

each uh I mean on a on a on a kubernetes

cluster and in this configuration file

you have several uh technical fields uh

and some technical fields about how you

want to extract and build this message

but I want to stress the attention on

the field that you need um on

database.names field because in this

configuration file you need to list all

the databases that you need that you

want to track inside the server and you

can't use a regx pattern. So if you want

to add a new database in your tracking

uh with the bizium you need to add this

database in this in this line.

Uh here I just puted an example of the

message of the JSON uh of the of the

JSON that built uh from the bzium and uh

you can see here we have an example of

of an update operation and in this case

you can see that this JSON it's uh we we

have an example of an update because uh

the we had an update of a row inside one

database and one field uh of NL an LDB

server uh has changed and now we have

another another field value and that's

an example. So you have a before node

inside the JSON and then the after node.

Uh it depends on the operation that you

have on the on the database but uh I

mean it's very similar for addition or

deletion of rows.

So let me move on uh in the core of the

the core of this presentation. And I

just want to remind you that we have uh

our infrastructure is uh more or le more

or less uh 15 of servers and inside of

each server you have more or less 1,000

of databases and for each database each

database represents one customer. Um and

each database has always the same

structure. So you have always the same

source tables which are for example

DB.subscribers and you have always the

same the same structure. What we want to

do is to bring all these datas uh into

only one table and because we know we we

want to collect all these datas is only

one table because it's easier for our

data engineers to uh make analysis on on

on one table and uh so the the the

target data

um and then so we have one uh unifying

table and we want to analyze with some

workload with datab bricks.

it's uh simpler. And so what we uh we

decide to do is to create one topic for

each table, one dedicated topic. And so

the source tables and the target data is

the the cafka topic. And what's there in

the middle? In the middle there is a uh

a bium configuration. I mean we have one

Kubernetes cluster and we uh set up uh

one DB instance for each server because

remember that you need one DB instance

for each server and uh and so also you

have one configuration file for for each

instance and we stored all these

configuration files inside one GitLab

repository which is linked um to an Argo

CD instance and I don't know if you know

what is it but essentially it's uh

continuous integration and continuously

the delivering tool for Kubernetes

application. So when you receive a new

changes inside your GitLab repository,

Argo CD automatically takes these

changes and applies these changes on

your Kubernetes cluster. So everything

it's done more or less automatically

and

uh so you have one dubium instance for

each server one configuration file. Uh

just remember that in each configuration

file you have this field in which you

need to list all the databases names uh

that you want to track

and

uh as Marco showed before um the

infrastructure I mean it's everything

it's quite easy because I mean you can

think that it's it's easy But all the

databases inside the server it's

changing quite often because uh each

database is is one customer and uh if uh

you can think that if it depends on the

business that you have but uh you can

think that in one day you can have um

more data more customers you can add

some customers or I mean the volumes it

depends on the business but uh you can

imagine that you have a lot customer

more or more customers less and each

customer more is one database is added

in your list that you need to track.

Sometimes moreover all the server um

has to rebalance the number of databases

that it has inside. So sometimes also if

you uh also if you don't have a new

customer sometimes it happens that one

database change the the location from

one server to another and it's quite

complicated. So you need to understand

how to manage this changing of location

of databases.

But uh we have one special server which

is the master server and inside this

server we have one database with one

table uh in which we have all the

information about the location of each

database. So imagine that you have one

column with the database name and one

column with the with the server. Uh so

so you have a lot of changing and

uh assuming that when customer is added

uh it's the the adding of a new row

inside this database.

And also if one database moved the

location is an update of a row in this

in this table. So what comes into your

mind I think but it's quite

straightforward to think about this is

to configure another uh another instance

of the bizium uh for the master server

and is exactly configured in the same

way of the of the standard flow. So when

we receive an update operation, so we

activated the the change data capture

inside the source database, we when we

received an update operation on the on

the change data capture uh with the

change data capture feature, we extract

and we build up a new message and we

sent the me this message to a dedicated

CFKA topic

and but at the end we don't have a a

databri workload but We have a fast API

application

and

it's quite interesting to see what this

application actually does because it's

the core of the of the of the manage of

this all all of this architecture

and uh the first so we receive the the

for example assuming that we are adding

a new customer we are receiving a new

row and the the adding of a new row on a

new database. So the first thing to do

is to activate the change data capture

on the right database. So because

remember that you have the all the

informations that you need. So you have

the the database and the SQL server. So

you can find the the right server and

you can activate the datab the the the

change data capture. The second step is

to pull the gitlab repository in where

you have stored all the configuration

file for each the boom instance

and uh you can find the right

configuration file for the right server

because you have one configuration file

for each server. So you need to find the

right one and you can update this

configuration file. Um and you need to

update the the database.names field list

and where you need to add the new

database

then you can commit and push these

changes and automatically Argo CD will

takes this changing and applies this

changing on the Kubernetes cluster. So

in near real time you have all the

information up uh updated like the the

the reality. So here we have an example

of the comet that fast API did for

example here we have added one database

in the database.names list.

Uh so everything went

well. I think we we made all this in the

test environment and everything went

well. Everyone were really happy because

the the infrastructure also is pretty

interesting to to to to present. But the

problem is was

that the test environment was pretty

different from the production

environment. And we have a really

different volumes and for example in in

test environment we have only two

servers with just uh 300 databases for

each. I mean one is the master one is

the is the standard. So we received some

deadlocks when we we tried to elevated

this in production. Um the problem is

just just to remind the number we have

in production we have one more or less

10,000 of databases for each server. So

if you activate all the changes that

capture inside you have one jobs uh

10,000 of jobs continuously running. And

so uh if you want 10,000 of jobs

continuously running you you need to

have 32 processors more or less and it's

really expensive.

Uh so we stopped this development. We

just need to find the solution. And the

problem is the the the always running

capturing jobs because the default

configuration and the recommended

configuration for SQL server is to have

these jobs continuously up and running.

But you can change this configuration

and you can schedule all this this uh

this the capturing jobs. But you need to

pay attention on this on two main

problems. One is that uh if you schedule

and and if the capturing job is not fast

enough you can you can lose some datas

because there is also the cleanup job

and so sometimes it happens that the

cleanup job can clean the datas and and

so you can receive some conflicts

because if you lose datas that you don't

have reads you can have some problems.

Uh the second is so you have to schedule

really close but

you need to avoid concurrently jobs and

um you need to avoid the overlapping of

the of scheduling jobs. So you and and

also these databases are added randomly

during the day. So it's quite complex to

to manage all these kind of problems.

So, but I mean the problem was I think

that the solution it it I think it's

quite interesting to do but the problem

is that uh we didn't have the right

skills inside the the the team because

we need you need uh at 100% you need to

be sure because you are if you change

this those configuration in the SQL

server you need to be uh sure at 100% of

what are you touching because um it's

these configurations are really sens

sensible. So we lacked the the DBA

skills and so we we and we had more data

engineer skills. So we decided to stop

this development because it's quite

complicated.

But we we learned a lot of things. There

are a lot of technical things about the

boom change data capture and but also we

uh have this three key key takeaways

uh I think that we need to extract from

this in this presentation

and are more at high level because

you have uh when you are designing

architectures

you need to pay attention on these three

points. It's quite obviously but you

need to keep in mind this three points.

The first one is more technical but I

mean not not so much. I mean if you if

you want to have an near realtime

infrastructure I mean it's it's awesome

to have datas in near real time but you

need to to pay attention on the fact

that if you activate the near realtime

features uh most of the cases you can

have some uh you're increasing the the

requesting of resources and the

underground processes increase so you

need to pay attention a lot of this

and the second key point is that when uh

always when you are designing

architecture is that if you don't have

an in an environment a test environment

uh I mean if you have a test environment

which is very very different of volume

than the production environment it can

be

hard to elev way to design architecture.

You need to uh you need to find another

way to test the the infrastructure that

you are that you have designed because

it's really fundamental to have this.

And the third key point is that you need

to understand at 100% the skills that

you have inside your team because people

inside your team will be the people that

will maintain the infrastructure and

uh

yes uh you need to when you are

designing an architecture you need to

pay attention uh on uh which skills you

have And that's the the end of the of

the of the presentation. Feel free to

ask any any any questions and we are

Marcus Anton from team system and I am

Andre Romeo from platform engineer. If

you want well I don't know if you if we

have time maybe not for some questions.

Uh if not you can feel free to take our

contacts and write just text us.

>> Thank you.

>> Thank you.

Offload Ks of SQL tenants with Debezium and DLTs

Big Data LDN

10 days ago

27:24

Platform Engineering & DevOps Culture

Rank #12

Description

Data Engineering Theatre Wednesday, 24th Sep 15:20 - 15:50 How to move data from thousands of SQL databases to data lake with no impact on OLTP? We'll explore the challenges we faced while migrated legacy batch data flows to event-based architecture. A key challenge for our data engineers was the multi-tenant architecture of our backend, meaning that we had to handle the same SQL schema on over 15k databases. We'll present the journey employing Debezium, Azure Event Hub, Delta Live tables and the extra tooling we had to put in place. Marco Santoni Data Engineering Manager, TeamSystem Marco Santoni is a data engineering leader currently serving as Data Engineering Manager at TeamSystem, a leading tech company that provides products for businesses and professionals to manage finance, operations, and compliance. Before joining TeamSystem, Marco led the Data Platform team at Flowe, a digital banking startup, where he focused on building modern data infrastructure and fostering a data-driven culture. Beyond his professional roles, Marco is an active contributor to the Italian Python community. He co-hosts the podcast *Intervista Pythonista*, which features bi-weekly interviews with developers and researchers in the Python ecosystem. He also helps organize local meetups and conferences, including PyCon Italia and Py4AI, promoting knowledge sharing and community growth. Marco holds a degree from ETH Zurich and maintains a personal blog where he shares insights on data engineering, productivity, and the evolving tech landscape. Andrea Romeo Platform Engineer, Teamsystem I'm a Platform Engineer at TeamSystem, where I focus on building and evolving the technical foundation that powers our data infrastructure. With a strong passion for backend and cloud technologies, I work closely with development and data teams to design scalable, resilient platforms that support data-driven products and services. My role is deeply rooted in the tech side of the business—I thrive on solving complex engineering challenges, automating workflows, and enabling teams to deliver value faster and more reliably. In particular, I specialize in architecting and maintaining data platforms that ensure efficient data processing, storage, and access across the organization. I’m continuously exploring new tools and best practices in DevOps, cloud-native development, and infrastructure as code to improve platform performance and developer experience. I value collaboration, clean architecture, and a mindset of continuous improvement.

Video Details

Category

Platform Engineering & DevOps Culture

Featured Date

November 7, 2025

Quality Rank

#12

AI Recommended