General
DATA+AI World Tour Amsterdam
At Big Industries we are encouraged to visit conferences of interest. Hereby my impressions of the Databricks World AI Tour which took place in November 2024 in Amsterdam.
The Big Databricks Show
Databricks has momentum. The Databricks World AI tour is visiting several regions on each continent and, remarkably, with 2700 visitors (and many more registrations ) the Benelux event is the biggest of their 'regional' events in the world (not counting their annual flagship event in San Francisco).
The revolving door leading into the Johan Cruyff Arena was a bottleneck upon arrival, resulting in a line stretching at least half a kilometer. To keep the crowd from getting anxious, a mobile dj stage provided musical entertainment as the hordes trickled into – literally – the Arena. 'Is this the Databricks event... or a Taylor Swift concert?' I heard someone say a couple rows in front of me.
The (Benelux) Protagonists
The conference itself felt like spending a day visiting a shopping strip in Holland and realizing that every brand was embracing Databricks. The list of speakers was filled with representatives of the quintessential Dutch companies. Some dairy products for breakfast (Friesland Campina), big players in oil (Shell) and the chip industry (ASML). Stopping by a retailer (take your pick from Jumbo, Hema or Albert Heijn – whose logistic supply chain forecasting runs on Databricks).
The issue of increasing costs was raised by some of the speakers (Databricks only has cloud offerings), so there was a need to withdraw money from the ATM (‘even pinnen’) from presenters ABN-AMRO, ING or Rabobank to pay for all these cloud costs. And to end the day grab a Heineken (keynote speaker) while watching AJAX's soccer game (more on data and sports in the appendix). It made one wonder – how far along is the kroketjes-uit-de-muur https://www.the-low-countries.com/article/why-do-the-dutch-eat-snacks-from-the-wall/ company in its data & AI journey?
The Success Explained
Databricks' own keynote speaker pointed out that the biggest problem in data solutions is complexity. The talk underscored a problem everyone relates to: many data teams spend their entire days stitching together overly complex solutions with all kinds of tech nobody fully understands. A complexity nightmare of high costs and proprietary formats. The magic word behind the Databricks success is...
DEMOCRATIZATION
To democratize data & AI means (as their representatives explained) making it accessible to non-technical users. They live by this motto and are consistent in their vision of tackling problems that will give the somewhat-technical-but-not-expert user the shivers when facing more advanced challenges (as is the case in competitive data lake solutions).
SIMPLICITY
With revenues pouring in, Databricks has been able to invest massively in their engineering departments to simplify complex elements that had previously been accepted with a shrug of 'that's the way big data and data lakes work' (examples in the next section).
Next they focused on simplification (and limiting) of the different tools needed. With a combo of Delta Lake, Spark, Unity Catalog and some derivatives of these, we’re actually close to listing all the tools that most of the platform relies on. This is much easier to grasp than the 'biodiverse zoo' of components that resides in e.g. the Hadoop ecosystem.
It distinguishes them from competitors whose tools are more tailored towards the elite data engineer with years of experience. It is incredibly smart that not-so-technical people (often the decision makers in a given enterprise ) can fully understand their product.
Want a data warehouse? Wrap a new "jacket" around Spark, using performance enhancements like the photon accelerator and an option to access a cluster serverless (requesting a couple machines from a pool that is always there waiting for the Databricks customer with a ‘premium’-like subscription cutting down the start-up cost from minutes to seconds). Combining all of the above, they have turned SparkSQL from an ETL tool (where you didn't loose sleep over whether your batch finished in 12 minutes or 1 hour) into a powerfull SQL Warehouse that is fast enough for user-interactive queries. They were not shy about showing off how the Databricks Data Warehouse has generated the most revenue ever (in the history of data warehouses) since its (recent) conception and how it is outperforming all others.
Next to having an excellent product, Databricks is very marketing savvy. The trainings, the website, the events. It is like walking on the playground and seeing the cool kids wearing shiny shoes... you will soon want a pair, too!
The Tech
AI is everywhere and it comes in 2 flavors.
AI Flavor 1: under-the-hood data engineering optimization.
They incorporate AI for "intelligence optimizations" that fix some of the most common errors in data engineering by automatically applying tweaks. The claim is that this leads to efficiency and cost-effectiveness and saves the developer's time (and hence costs). Just a few examples are:
COMMON MISTAKE
- Wrongfully chosen form partitioning
- Hadoop small world problem
DATABRICKS’ SOLUTION
- Liquid clustering https://docs.databricks.com/en/delta/clustering.html comes down to the fact that in Delta Lake, partitioning will happen automatically; Delta Lake will decide for you on the best partitioning strategy and you don’t even need to know the details.
- Predictive optimization https://docs.databricks.com/en/optimizations/predictive-optimization.html this autoschedules removal of abundant files and fixes the infamous Hadoop small file problem https://medium.com/@rahul.singh.suny/small-file-problem-e2a5184678c7.
Enter a world where you are using Spark but there is no need to care about the details that used to distinguish the layman SQL user from the seasoned data engineer (no need to care for executor memory, number of executors, dynamic allocation, etc.). The only thing you need to know is how to write the functional (e.g. SQL) logic in a notebook.
AI Flavor 2: gimmicks (often for the SQL-savvy user).
Some specific examples:
- An AI generated description of the fields in your schema https://docs.databricks.com/en/comments/ai-comments.html
- Instead of writing queries yourself, having them generated by AI. Databricks assistent https://www.databricks.com/blog/introducing-databricks-assistant is the AI assisent trained specifically to generate Databricks content. It will suggest what fields to join, why a query is not working or it can generate python code to – let’s say – ingest data into a kafka topic.
- AI/BI genie: did it take months to add one field in a BI dashboard and promote it to production? Now you can generate a dashboard with normal plain English and AI assistence https://www.databricks.com/product/ai-bi/genie
- Several AI-fueled new features were added that can directly be called in SQL in the data warehouse (basically the built-in functions starting with ai_ found here https://docs.databricks.com/aws/en/sql/language-manual/sql-ref-functions-builtin-alpha to do sentiment analyis, summarizing, classifying, fixing grammar errors or calling your favorite AI model). Or why not translation?
SELECT ai_translate('Hello, how are you?', 'es');
will now return:
"Hola, ¿cómo estás?"
The final 2 pillars of the Data Lakehouse (ETL/streaming) were mentioned in merely seconds during the keynote. The emphasis was on the more "common, simple" pipeline of growing data ending up in dashboards.
We were also reminded that just before summer '24 Databricks acquired Tabular – the company behind Iceberg to unify Delta Lake and Iceberg in order to produce 'Delta Lake UniForm' https://www.databricks.com/blog/databricks-tabular. Why would one need Iceberg if there is already Delta Lake? I believe the answer lies in the fact that Delta Lake was developed specifically for Spark, while Iceberg is found in conjunction with many other tools. So ruling out all customers that are on these tools (and using Iceberg right now) would limit the customers that might be tempted to come and migrate to Databricks.
Finally, one could spoil the fun by zeroing in on the costs of running everything on this cloud-only platform, or by discussing the gaps (by choosing this more limited set of tools) compared to other data platforms. But they had me spin a wheel of fortune with swag where I scored a nice pair of Databricks bathroom slippers, so I ‘m gonna let it go.
Extra-time: Tell me more about data and soccer! Now!
OK, OK. I will!
What a great choice to let this event take place in Ajax's soccer stadium! The keynote sessions were technically outdoors – inside the stadium. A bit chilly but pleasant to get some fresh air with an impressive view overlooking a TED-style podium placed just in front of the dugouts, while seated in the best (leather) seats in the stadium. The data engineering course took place in the press room of Ajax – to get there one could glance into the visitors' locker rooms and roam through the catacombs just meters away from the steps that would have gotten you onto the pitch. A wall full of the team's sponsors was still in place to get the first post-game reactions.
Ajax – the host of this event – employs 7 data professionals that use data+AI. The tickets/merch team's use case could be to target ticket sales to fill up empty spots in the stadium by offering reduced pricing to the fans around these seats (hoping they will invite some friends) .
Another team focuses on sports analytics. To do that, they use data tracking the whereabouts of the soccer ball (every 2 seconds for >100 competitions for 200-400 games/season), who possesses it, succeeded in passing, etc. Their stadium is equipped with cameras that generate over a billion pieces of optical tracking data about the team each year. All of this is fed to MLFlow and gives them info about what players pick the right options most frequently or who is an especially creative player who excels in a certain position.
Fun fact 1: the guy who is the brains behind using data as a road to sucess in baseball (pictured in Moneyball https://www.imdb.com/title/tt1210166/ apparently has a gig right now as ‘Databricks evangelist’ and made a quick cameo on the stage.
Fun fact 2: Bernardo Silva (Man. City) was given as an example of a player with not too impressive stats for goals or assists but still considered an excellent player. He outperforms the model that predicts whether a pass will be completed at any given game-state and was given as an example of someone who might be discovered by their techniques (and not by looking at the traditional stats). He often chooses passes that are unlikely to succeed yet still makes them happen!

Joris Billen
Joris comes from a computational/engineering background and is approaching 10 years of consultancy experience. His focus is on data engineering and he enjoys working at the intersection of the functional aspects of use cases and their implementation on innovative data platforms. He was part of the Big Industries team that laid the foundation for a data analytics platform on Azure for a major client in the auto industry. More recently he has been using Cloudera Hadoop in aviation and anti-fraud projects at EU Institutions and has been digging deeper into AWS. Outside of work he can be found biking, playing soccer and getting fresh air with his (expanding) family.