• This thread is just the tip of the iceberg.The people ahead of the curve aren't Googling for answers — they're already in here, having the conversations you haven't found yet. DealerRefresh is free.Get the full picture →

Anyone have experience with data lakes

We've built a platform for our data lake to bring in over a dozen resources. Our CMS event data is in Clickhouse because it's so effective for large streams of event data, that might be overkill for a single dealer. We pull what we need from it into the data lake. We bring in the raw JSON from other APIs and then parse it out into a MySQL database. Each month the market data is refreshed. We've done much of this with me using Claude Code and then my engineering team ensuring it's both secure, stable, and scalable. We're going to be migrating it soon into our production environment.
 

Attachments

  • Screenshot 2026-06-30 at 12.44.58 PM.png
    Screenshot 2026-06-30 at 12.44.58 PM.png
    1.2 MB · Views: 3
Are you also bringing in DMS data or is this mostly the metadata layer of everything your dealership is producing or working in the market?
My particular focus is on digital visibility... this is purely dealer CMS, inventory pushed there, and then regional search engine result and listings data.

I will advise one critical thing... while a data lake provides all of your data elements, it does not provide the context that is critical for AI to fully explain it. I'd highly recommend you building out a data schema that explains every data element, how it's updated, benefits, challenges, definitions, etc. I worked for one company that literally built this into a table that was ALWAYS queried with every additional query. This way the data came back with the definitions and AI was able to better contextualize the answers.
 
  • Useful
Reactions: craigh
Agreed! The id resolution took about 2-3 months to pin down and was handled downstream via Segment CDP with a PK string across customer id, vin, lead id, deal id, etc. to ensure LTV values and vin assigned were accurate against the first defined record.

So relied on Glue for clear dedup logic but then had downstream mapping using DMS/CRM and non DMS/CRM sources to append any new/modified change to shoppers. Hardest part was keeping with 3rd party tool tracking coverage!
HI Bill, I agree with the process you outlined. Customer ID, VIN, lead ID, deal ID, CRM/BMS data, CDP mapping, and de-dupe logic are all part of getting identity resolution right.The piece I’d add is that those methods usually only catch the obvious, rules-based connections. However, there’s another semantic layer (with a robust impeded ID spine) that finds relationships and matches that don’t show up cleanly through standard deterministic logic. We’re using this semantic layer (Darkmath.ai) with clients outside the auto industry mostly but within automotive space we’re using it with Tekion to help consolidate records across multiple rooftops. We’re seeing it add roughly 30% to 50% more matches/relationships deduplicates, depending on the dealer group. So I don’t disagree with your approach. I just think there’s a meaningful layer on top of it that most teams aren’t factoring in yet. Happy to compare notes.
 
Every multi-rooftop dealer group seems to eventually build a data lake, and every one hits the same set of problems. Once you figure out the pipeline, the next one is usually that “one customer” is actually six records scattered across DMS, CRM, and service at five different stores. Rules-based dedup catches the easy ones and drops a good bit of the rest. The fix is resolving identity by meaning to match even the records that don't have many common identifiers. Unify the person first. Then your dashboards and campaigns tell the truth.
This (one customer, multiple records) is key. Even within the same store. Customer once communicated via email and another time with phone. They changed email/phone. The key is to have a communication platform that helps you identify such duplicates and merge them.
 

✨ AI Highlights

A dealer group operator asks for vendor recommendations and advice on building a data lake to power marketing dashboards and CDP audience activation across multiple rooftops. Experienced practitioners share detailed stack recommendations — AWS, Redshift, CDK Data Your Way, DriveCentric, Segment — along with documented architectures, while the thread's dominant insight is that the hardest unsolved problem isn't the pipeline or platform choice but identity resolution: a single customer routinely exists as multiple fragmented records across DMS, CRM, and service systems, and rules-based deduplication misses roughly half of true matches. Consensus points toward investing early in a clean data model, consistent naming conventions, and probabilistic identity resolution before layering on reporting or AI.

Replies Views 14 929 Started Last Reply