• This thread is just the tip of the iceberg.The people ahead of the curve aren't Googling for answers — they're already in here, having the conversations you haven't found yet. DealerRefresh is free.Get the full picture →

Anyone have experience with data lakes

We've built a platform for our data lake to bring in over a dozen resources. Our CMS event data is in Clickhouse because it's so effective for large streams of event data, that might be overkill for a single dealer. We pull what we need from it into the data lake. We bring in the raw JSON from other APIs and then parse it out into a MySQL database. Each month the market data is refreshed. We've done much of this with me using Claude Code and then my engineering team ensuring it's both secure, stable, and scalable. We're going to be migrating it soon into our production environment.
 

Attachments

  • Screenshot 2026-06-30 at 12.44.58 PM.png
    Screenshot 2026-06-30 at 12.44.58 PM.png
    1.2 MB · Views: 3
Are you also bringing in DMS data or is this mostly the metadata layer of everything your dealership is producing or working in the market?
My particular focus is on digital visibility... this is purely dealer CMS, inventory pushed there, and then regional search engine result and listings data.

I will advise one critical thing... while a data lake provides all of your data elements, it does not provide the context that is critical for AI to fully explain it. I'd highly recommend you building out a data schema that explains every data element, how it's updated, benefits, challenges, definitions, etc. I worked for one company that literally built this into a table that was ALWAYS queried with every additional query. This way the data came back with the definitions and AI was able to better contextualize the answers.
 
  • Useful
Reactions: craigh
Agreed! The id resolution took about 2-3 months to pin down and was handled downstream via Segment CDP with a PK string across customer id, vin, lead id, deal id, etc. to ensure LTV values and vin assigned were accurate against the first defined record.

So relied on Glue for clear dedup logic but then had downstream mapping using DMS/CRM and non DMS/CRM sources to append any new/modified change to shoppers. Hardest part was keeping with 3rd party tool tracking coverage!
HI Bill, I agree with the process you outlined. Customer ID, VIN, lead ID, deal ID, CRM/BMS data, CDP mapping, and de-dupe logic are all part of getting identity resolution right.The piece I’d add is that those methods usually only catch the obvious, rules-based connections. However, there’s another semantic layer (with a robust impeded ID spine) that finds relationships and matches that don’t show up cleanly through standard deterministic logic. We’re using this semantic layer (Darkmath.ai) with clients outside the auto industry mostly but within automotive space we’re using it with Tekion to help consolidate records across multiple rooftops. We’re seeing it add roughly 30% to 50% more matches/relationships deduplicates, depending on the dealer group. So I don’t disagree with your approach. I just think there’s a meaningful layer on top of it that most teams aren’t factoring in yet. Happy to compare notes.
 

✨ AI Highlights

A dealer group operator asks for vendor recommendations and advice on building a data lake to power marketing dashboards and a CDP. Practitioners recommend Snowflake, ClickHouse, BigQuery, and AWS stacks, with one member sharing a detailed real-world architecture using CDK, DriveCentric, AWS Glue, and Segment. The thread's sharpest takeaway is that the hardest problems aren't the storage platform choice but rather building reliable data pipelines out of resistant DMS and CRM vendors, and then resolving fragmented customer identity across rooftops before any meaningful reporting or audience activation is possible.

Replies Views 13 876 Started Last Reply