Loading video player...
Hello.
Hello everyone. Can you hear us?
>> Okay.
>> Nice to see you for me. I will start by
first introducing us and Marco with
Andre Romeo. Uh we work at team system
which is an Italian tech company that
builds products for uh uh micro
businesses, small medium enterprises,
accountants, lawyers. Uh so we have
around 200 products in Italy, France,
Spain, Turkey and Andrea and I are part
of the data platform team which we build
features directly for the end users of
our products. But today we are going to
uh tell you a bit more about something
we have worked hard in the last months
which has made really us
work hard and suffer. So I'm glad to
share with you today. So this is a a
simplistic view of the problem we had to
face. We are talking about mailup which
is one of our product which is an email
marketing platform. Uh we have this
architecture on the back end side. So we
have a an app that exposes APIs to end
users but for each customer which is a a
corporate customer we have a dedicated
database. So we have a database per
tenant model. So we have the exact same
schema on each database dedicated to
each customer and we have a master
database where there is just some master
reference data to say for customer A you
have tenant A customer B tenant B
database. So um this is the the our
source data but the need that we have is
to do some analytics and aggregation
across all the customers of mailup. So
the idea is that on our data lake we
have for the same table which has the
same schema across different databases a
single table with the records from all
uh customers so that you can go you can
do aggregations among them. So today we
are really focusing on how can we do as
a data engineer
efficient ETL. What you can think of uh
that comes first to your mind can be
something like this. A batch job that
starts every day like at midnight then
makes a big for loop over old table one
across all uh source databases then
takes the updated or new records in that
table and then appends or updates the
record in the target data lake. So this
is something that uh we have considered
which has some uh limitations. First
when you start and you when you have a
high number of uh tables or high number
of customers we are talking about tens
of thousands of customers. Uh you can
have some impact on the main OOLTP
database even if you do it during a
night still. Um then uh performance it
takes time to do this big for loop over
and over and then you have latency
because you have the data updated at
most once per day plus some delay.
So today we are really thinking how can
we improve this approach? How can we do
something that is tackling these issues?
to give you a flavor of what we started
from. What do we have as a data
engineers as tools? We know uh how
change data capture works. So you can
use tools like the bed zoom to listen to
transaction logs of a certain table and
being notified when there is an event
for any record in the table. Then you
can decide what to do with this event.
For example, you can write it to a Kafka
topic and then you can manage all uh
these events and write them to your data
data target. But the issue is that uh
this in theory improves what I discussed
before because in theory you reduce the
impact on the database because you don't
have this repeating poll queries. In
theory, it's should be close to real
time and you have something that is has
no latency in theory. But this was not
working on our side. I mean, it's a bit
more complex because you don't have one
table, but you have thousands of tables
span across different database with the
same schema but on different databases.
Plus, whenever a new customer arrives,
there is a new database dedicated to it.
So you have to handle this and then
there are also some DevOps or S S sur
operations that the operations team is
doing and sometimes they want to move a
database from a server to another. So
this has more extra complexity. So today
I will introduce you to a solution that
we designed to tackle these challenges
that is um I will give you a really high
level overview then Andrea will go more
in details.
The idea is as following. Uh first to
have a change data capture job that is
notified by the events that occur on the
master database. So we are notified
whenever a database is moved or a new
customer comes and a new database is
created. These events are uh processed
by a GitHub services via Argo CD which
is responsible to dynamically span or
destroy the bedum instances that uh
collects data from uh different tables.
This is a really high level overview and
now in this talk we will go a bit deeper
to how it works what are the challenges.
Okay. So I will leave the stage to
Andrea. Let's start from the the bottom
part of the architecture.
>> Thank you, Marco. Thank you. And
I just start with an introduction of the
two main tools because I mean I will
stress you just with two introductions
of two tools uh of um which will be the
uh the basis for the for this
presentation. The first one is I'm
pretty sure that almost all of you
already knows because it's really
fundamental for the near real time
analysis in in that engineering. But uh
we need to keep in mind uh I want to
introduce it because we need to keep in
mind two uh principle key points. So uh
what actually uh does change data
capture? I mean uh and and we want I
want to focus on the SQL server um
because this is a feature that you can
find in in in almost all of of RDBMS.
Um
in in SQL server you have always uh a
technical table which is this
transaction log and inside this
transaction log you have all the
information about the operations of
update adding deletion of everything
inside your server and when you activate
and you the the changeap change that
capture it's a feature that you can
activate with the store procedure on the
server and for database is for one
database and when you active the change
data capture essentially you are
activating two jobs. Uh one is the
capturing jobs and it is always up and
running and essentially it is reading
from the trans from the transaction log
and it tracks all the information of the
your source tables and store this
informations in these technical tables
which are the change data capture
tables. The second job is the is the
cleanup process and which is a technical
job that cleans uh it is scheduled and
sometimes it cleans all the uh old logs
inside the change data capture tables
because you want to avoid the filling
the filling up of the memory of the SQL
server. And so keep in mind these two
points uh that you have one one job
always up and running uh and the second
that you have one scheduled job.
Um
the second tool that I want to introduce
here is the bizium and the bizium is
really really really powerful when you
use it with with change data capture and
uh actually indeed with in SQL server
when you when you're managing SQL server
you need to have uh the change data
capture active on the database that you
want to to to track the information. Um
how the bedroom work? Uh the B the bzium
uh the first step is to take snapshot of
the change data capture tables um uh of
the of the source tables and then it
starts to track all the changes uh about
the this these tables and uh it starts
to read from the change data capture
table all the changes all the operation
of updating uh adding or deletion of
And essentially it extracts u one
message and um it it extracts one
message and sends this message to some
cafka interface. uh in our case we have
an event tab but the event tab is uh
more or less the same of Kafka uh Kafka
in in in the Asure world but I mean the
the main point is that you can send with
the bum you can extract and build um a
message uh as you want and you can send
this message to some cafka interface uh
let let's here I just puted some the
configuration file because you you have
to define
uh one the besium instance for each
server and you need one configuration
file and you can install the bzium on
each uh I mean on a on a on a kubernetes
cluster and in this configuration file
you have several uh technical fields uh
and some technical fields about how you
want to extract and build this message
but I want to stress the attention on
the field that you need um on
database.names field because in this
configuration file you need to list all
the databases that you need that you
want to track inside the server and you
can't use a regx pattern. So if you want
to add a new database in your tracking
uh with the bizium you need to add this
database in this in this line.
Uh here I just puted an example of the
message of the JSON uh of the of the
JSON that built uh from the bzium and uh
you can see here we have an example of
of an update operation and in this case
you can see that this JSON it's uh we we
have an example of an update because uh
the we had an update of a row inside one
database and one field uh of NL an LDB
server uh has changed and now we have
another another field value and that's
an example. So you have a before node
inside the JSON and then the after node.
Uh it depends on the operation that you
have on the on the database but uh I
mean it's very similar for addition or
deletion of rows.
So let me move on uh in the core of the
the core of this presentation. And I
just want to remind you that we have uh
our infrastructure is uh more or le more
or less uh 15 of servers and inside of
each server you have more or less 1,000
of databases and for each database each
database represents one customer. Um and
each database has always the same
structure. So you have always the same
source tables which are for example
DB.subscribers and you have always the
same the same structure. What we want to
do is to bring all these datas uh into
only one table and because we know we we
want to collect all these datas is only
one table because it's easier for our
data engineers to uh make analysis on on
on one table and uh so the the the
target data
um and then so we have one uh unifying
table and we want to analyze with some
workload with datab bricks.
it's uh simpler. And so what we uh we
decide to do is to create one topic for
each table, one dedicated topic. And so
the source tables and the target data is
the the cafka topic. And what's there in
the middle? In the middle there is a uh
a bium configuration. I mean we have one
Kubernetes cluster and we uh set up uh
one DB instance for each server because
remember that you need one DB instance
for each server and uh and so also you
have one configuration file for for each
instance and we stored all these
configuration files inside one GitLab
repository which is linked um to an Argo
CD instance and I don't know if you know
what is it but essentially it's uh
continuous integration and continuously
the delivering tool for Kubernetes
application. So when you receive a new
changes inside your GitLab repository,
Argo CD automatically takes these
changes and applies these changes on
your Kubernetes cluster. So everything
it's done more or less automatically
and
uh so you have one dubium instance for
each server one configuration file. Uh
just remember that in each configuration
file you have this field in which you
need to list all the databases names uh
that you want to track
and
uh as Marco showed before um the
infrastructure I mean it's everything
it's quite easy because I mean you can
think that it's it's easy But all the
databases inside the server it's
changing quite often because uh each
database is is one customer and uh if uh
you can think that if it depends on the
business that you have but uh you can
think that in one day you can have um
more data more customers you can add
some customers or I mean the volumes it
depends on the business but uh you can
imagine that you have a lot customer
more or more customers less and each
customer more is one database is added
in your list that you need to track.
Sometimes moreover all the server um
has to rebalance the number of databases
that it has inside. So sometimes also if
you uh also if you don't have a new
customer sometimes it happens that one
database change the the location from
one server to another and it's quite
complicated. So you need to understand
how to manage this changing of location
of databases.
But uh we have one special server which
is the master server and inside this
server we have one database with one
table uh in which we have all the
information about the location of each
database. So imagine that you have one
column with the database name and one
column with the with the server. Uh so
so you have a lot of changing and
uh assuming that when customer is added
uh it's the the adding of a new row
inside this database.
And also if one database moved the
location is an update of a row in this
in this table. So what comes into your
mind I think but it's quite
straightforward to think about this is
to configure another uh another instance
of the bizium uh for the master server
and is exactly configured in the same
way of the of the standard flow. So when
we receive an update operation, so we
activated the the change data capture
inside the source database, we when we
received an update operation on the on
the change data capture uh with the
change data capture feature, we extract
and we build up a new message and we
sent the me this message to a dedicated
CFKA topic
and but at the end we don't have a a
databri workload but We have a fast API
application
and
it's quite interesting to see what this
application actually does because it's
the core of the of the of the manage of
this all all of this architecture
and uh the first so we receive the the
for example assuming that we are adding
a new customer we are receiving a new
row and the the adding of a new row on a
new database. So the first thing to do
is to activate the change data capture
on the right database. So because
remember that you have the all the
informations that you need. So you have
the the database and the SQL server. So
you can find the the right server and
you can activate the datab the the the
change data capture. The second step is
to pull the gitlab repository in where
you have stored all the configuration
file for each the boom instance
and uh you can find the right
configuration file for the right server
because you have one configuration file
for each server. So you need to find the
right one and you can update this
configuration file. Um and you need to
update the the database.names field list
and where you need to add the new
database
then you can commit and push these
changes and automatically Argo CD will
takes this changing and applies this
changing on the Kubernetes cluster. So
in near real time you have all the
information up uh updated like the the
the reality. So here we have an example
of the comet that fast API did for
example here we have added one database
in the database.names list.
Uh so everything went
well. I think we we made all this in the
test environment and everything went
well. Everyone were really happy because
the the infrastructure also is pretty
interesting to to to to present. But the
problem is was
that the test environment was pretty
different from the production
environment. And we have a really
different volumes and for example in in
test environment we have only two
servers with just uh 300 databases for
each. I mean one is the master one is
the is the standard. So we received some
deadlocks when we we tried to elevated
this in production. Um the problem is
just just to remind the number we have
in production we have one more or less
10,000 of databases for each server. So
if you activate all the changes that
capture inside you have one jobs uh
10,000 of jobs continuously running. And
so uh if you want 10,000 of jobs
continuously running you you need to
have 32 processors more or less and it's
really expensive.
Uh so we stopped this development. We
just need to find the solution. And the
problem is the the the always running
capturing jobs because the default
configuration and the recommended
configuration for SQL server is to have
these jobs continuously up and running.
But you can change this configuration
and you can schedule all this this uh
this the capturing jobs. But you need to
pay attention on this on two main
problems. One is that uh if you schedule
and and if the capturing job is not fast
enough you can you can lose some datas
because there is also the cleanup job
and so sometimes it happens that the
cleanup job can clean the datas and and
so you can receive some conflicts
because if you lose datas that you don't
have reads you can have some problems.
Uh the second is so you have to schedule
really close but
you need to avoid concurrently jobs and
um you need to avoid the overlapping of
the of scheduling jobs. So you and and
also these databases are added randomly
during the day. So it's quite complex to
to manage all these kind of problems.
So, but I mean the problem was I think
that the solution it it I think it's
quite interesting to do but the problem
is that uh we didn't have the right
skills inside the the the team because
we need you need uh at 100% you need to
be sure because you are if you change
this those configuration in the SQL
server you need to be uh sure at 100% of
what are you touching because um it's
these configurations are really sens
sensible. So we lacked the the DBA
skills and so we we and we had more data
engineer skills. So we decided to stop
this development because it's quite
complicated.
But we we learned a lot of things. There
are a lot of technical things about the
boom change data capture and but also we
uh have this three key key takeaways
uh I think that we need to extract from
this in this presentation
and are more at high level because
you have uh when you are designing
architectures
you need to pay attention on these three
points. It's quite obviously but you
need to keep in mind this three points.
The first one is more technical but I
mean not not so much. I mean if you if
you want to have an near realtime
infrastructure I mean it's it's awesome
to have datas in near real time but you
need to to pay attention on the fact
that if you activate the near realtime
features uh most of the cases you can
have some uh you're increasing the the
requesting of resources and the
underground processes increase so you
need to pay attention a lot of this
and the second key point is that when uh
always when you are designing
architecture is that if you don't have
an in an environment a test environment
uh I mean if you have a test environment
which is very very different of volume
than the production environment it can
be
hard to elev way to design architecture.
You need to uh you need to find another
way to test the the infrastructure that
you are that you have designed because
it's really fundamental to have this.
And the third key point is that you need
to understand at 100% the skills that
you have inside your team because people
inside your team will be the people that
will maintain the infrastructure and
uh
yes uh you need to when you are
designing an architecture you need to
pay attention uh on uh which skills you
have And that's the the end of the of
the of the presentation. Feel free to
ask any any any questions and we are
Marcus Anton from team system and I am
Andre Romeo from platform engineer. If
you want well I don't know if you if we
have time maybe not for some questions.
Uh if not you can feel free to take our
contacts and write just text us.
>> Thank you.
>> Thank you.
Data Engineering Theatre Wednesday, 24th Sep 15:20 - 15:50 How to move data from thousands of SQL databases to data lake with no impact on OLTP? We'll explore the challenges we faced while migrated legacy batch data flows to event-based architecture. A key challenge for our data engineers was the multi-tenant architecture of our backend, meaning that we had to handle the same SQL schema on over 15k databases. We'll present the journey employing Debezium, Azure Event Hub, Delta Live tables and the extra tooling we had to put in place. Marco Santoni Data Engineering Manager, TeamSystem Marco Santoni is a data engineering leader currently serving as Data Engineering Manager at TeamSystem, a leading tech company that provides products for businesses and professionals to manage finance, operations, and compliance. Before joining TeamSystem, Marco led the Data Platform team at Flowe, a digital banking startup, where he focused on building modern data infrastructure and fostering a data-driven culture. Beyond his professional roles, Marco is an active contributor to the Italian Python community. He co-hosts the podcast *Intervista Pythonista*, which features bi-weekly interviews with developers and researchers in the Python ecosystem. He also helps organize local meetups and conferences, including PyCon Italia and Py4AI, promoting knowledge sharing and community growth. Marco holds a degree from ETH Zurich and maintains a personal blog where he shares insights on data engineering, productivity, and the evolving tech landscape. Andrea Romeo Platform Engineer, Teamsystem I'm a Platform Engineer at TeamSystem, where I focus on building and evolving the technical foundation that powers our data infrastructure. With a strong passion for backend and cloud technologies, I work closely with development and data teams to design scalable, resilient platforms that support data-driven products and services. My role is deeply rooted in the tech side of the business—I thrive on solving complex engineering challenges, automating workflows, and enabling teams to deliver value faster and more reliably. In particular, I specialize in architecting and maintaining data platforms that ensure efficient data processing, storage, and access across the organization. I’m continuously exploring new tools and best practices in DevOps, cloud-native development, and infrastructure as code to improve platform performance and developer experience. I value collaboration, clean architecture, and a mindset of continuous improvement.