• This thread is just the tip of the iceberg.The people ahead of the curve aren't Googling for answers — they're already in here, having the conversations you haven't found yet. DealerRefresh is free.Get the full picture →

Anyone have experience with data lakes

We've built a platform for our data lake to bring in over a dozen resources. Our CMS event data is in Clickhouse because it's so effective for large streams of event data, that might be overkill for a single dealer. We pull what we need from it into the data lake. We bring in the raw JSON from other APIs and then parse it out into a MySQL database. Each month the market data is refreshed. We've done much of this with me using Claude Code and then my engineering team ensuring it's both secure, stable, and scalable. We're going to be migrating it soon into our production environment.
 

Attachments

  • Screenshot 2026-06-30 at 12.44.58 PM.png
    Screenshot 2026-06-30 at 12.44.58 PM.png
    1.2 MB · Views: 3
Are you also bringing in DMS data or is this mostly the metadata layer of everything your dealership is producing or working in the market?
My particular focus is on digital visibility... this is purely dealer CMS, inventory pushed there, and then regional search engine result and listings data.

I will advise one critical thing... while a data lake provides all of your data elements, it does not provide the context that is critical for AI to fully explain it. I'd highly recommend you building out a data schema that explains every data element, how it's updated, benefits, challenges, definitions, etc. I worked for one company that literally built this into a table that was ALWAYS queried with every additional query. This way the data came back with the definitions and AI was able to better contextualize the answers.
 

✨ AI Highlights

A dealer group operator asks for vendor recommendations and advice on building a data lake to power marketing dashboards and a CDP. Practitioners recommend Snowflake, ClickHouse, BigQuery, and AWS stacks, with one member sharing a detailed real-world architecture using CDK, DriveCentric, AWS Glue, and Segment. The thread's sharpest takeaway is that the hardest problems aren't the storage platform choice but rather building reliable data pipelines out of resistant DMS and CRM vendors, and then resolving fragmented customer identity across rooftops before any meaningful reporting or audience activation is possible.

Replies Views 12 845 Started Last Reply