Anyone have experience with data lakes

douglaskarr · Jun 30, 2026

We've built a platform for our data lake to bring in over a dozen resources. Our CMS event data is in Clickhouse because it's so effective for large streams of event data, that might be overkill for a single dealer. We pull what we need from it into the data lake. We bring in the raw JSON from other APIs and then parse it out into a MySQL database. Each month the market data is refreshed. We've done much of this with me using Claude Code and then my engineering team ensuring it's both secure, stable, and scalable. We're going to be migrating it soon into our production environment.

craigh · Jun 30, 2026

douglaskarr said:
We've built a platform for our data lake to bring in over a dozen resources.

Are you also bringing in DMS data or is this mostly the metadata layer of everything your dealership is producing or working in the market?

douglaskarr · Jun 30, 2026

craigh said:
Are you also bringing in DMS data or is this mostly the metadata layer of everything your dealership is producing or working in the market?

My particular focus is on digital visibility... this is purely dealer CMS, inventory pushed there, and then regional search engine result and listings data.

I will advise one critical thing... while a data lake provides all of your data elements, it does not provide the context that is critical for AI to fully explain it. I'd highly recommend you building out a data schema that explains every data element, how it's updated, benefits, challenges, definitions, etc. I worked for one company that literally built this into a table that was ALWAYS queried with every additional query. This way the data came back with the definitions and AI was able to better contextualize the answers.

djbailey121 · Jun 30, 2026

Bill Hoerr said:
Agreed! The id resolution took about 2-3 months to pin down and was handled downstream via Segment CDP with a PK string across customer id, vin, lead id, deal id, etc. to ensure LTV values and vin assigned were accurate against the first defined record.

So relied on Glue for clear dedup logic but then had downstream mapping using DMS/CRM and non DMS/CRM sources to append any new/modified change to shoppers. Hardest part was keeping with 3rd party tool tracking coverage!

HI Bill, I agree with the process you outlined. Customer ID, VIN, lead ID, deal ID, CRM/BMS data, CDP mapping, and de-dupe logic are all part of getting identity resolution right.The piece I’d add is that those methods usually only catch the obvious, rules-based connections. However, there’s another semantic layer (with a robust impeded ID spine) that finds relationships and matches that don’t show up cleanly through standard deterministic logic. We’re using this semantic layer (Darkmath.ai) with clients outside the auto industry mostly but within automotive space we’re using it with Tekion to help consolidate records across multiple rooftops. We’re seeing it add roughly 30% to 50% more matches/relationships deduplicates, depending on the dealer group. So I don’t disagree with your approach. I just think there’s a meaningful layer on top of it that most teams aren’t factoring in yet. Happy to compare notes.

iWish AI · Jul 1, 2026

djbailey121 said:
Every multi-rooftop dealer group seems to eventually build a data lake, and every one hits the same set of problems. Once you figure out the pipeline, the next one is usually that “one customer” is actually six records scattered across DMS, CRM, and service at five different stores. Rules-based dedup catches the easy ones and drops a good bit of the rest. The fix is resolving identity by meaning to match even the records that don't have many common identifiers. Unify the person first. Then your dashboards and campaigns tell the truth.

This (one customer, multiple records) is key. Even within the same store. Customer once communicated via email and another time with phone. They changed email/phone. The key is to have a communication platform that helps you identify such duplicates and merge them.

Bill Hoerr · Jul 1, 2026

djbailey121 said:
HI Bill, I agree with the process you outlined. Customer ID, VIN, lead ID, deal ID, CRM/BMS data, CDP mapping, and de-dupe logic are all part of getting identity resolution right.The piece I’d add is that those methods usually only catch the obvious, rules-based connections. However, there’s another semantic layer (with a robust impeded ID spine) that finds relationships and matches that don’t show up cleanly through standard deterministic logic. We’re using this semantic layer (Darkmath.ai) with clients outside the auto industry mostly but within automotive space we’re using it with Tekion to help consolidate records across multiple rooftops. We’re seeing it add roughly 30% to 50% more matches/relationships deduplicates, depending on the dealer group. So I don’t disagree with your approach. I just think there’s a meaningful layer on top of it that most teams aren’t factoring in yet. Happy to compare notes.

Really appreciate the insight here and I will have to take a look at Darkmath.ai! Segment handled a lot of the "ID spine" from the various sources and updated as records were modified, always changing addresses, phone, etc. but keeping the record up to date and consistent. Never got to the external enrichment and semantic layer portion. I'll take you up on the offer!

djbailey121 · Jul 7, 2026

Bill Hoerr said:
Really appreciate the insight here and I will have to take a look at Darkmath.ai! Segment handled a lot of the "ID spine" from the various sources and updated as records were modified, always changing addresses, phone, etc. but keeping the record up to date and consistent. Never got to the external enrichment and semantic layer portion. I'll take you up on the offer!

Send me an email at [email protected] and we can set up a time to chat. I have a lot of availability this week. Look forward to it.

Search

Anyone have experience with data lakes

douglaskarr

3rd Base Coach

Attachments

craigh

Super Moderator

douglaskarr

3rd Base Coach

djbailey121

Green Pea

iWish AI

Green Pea

Bill Hoerr

Sled Master

djbailey121

Green Pea

Need help?

✨ AI Highlights

Latest posts