• This thread is just the tip of the iceberg.The people ahead of the curve aren't Googling for answers — they're already in here, having the conversations you haven't found yet. DealerRefresh is free.Get the full picture →

Anyone have experience with data lakes

Brad Burlingham

4 Pounder
May 28, 2009
76
21
Awards
4
First Name
Brad
I'm hoping to setup api from all of our dealerships into a datalake so that we can create our own reporting. Does anyone have any suggestions on vendors or overall advice?

First stage is to use it to develop marketing dashboards to evaluate trends, performance, kpi etc. next step would be to use it with a cdp to develop audiences etc.
 
Snowflake is still the industry leader by far, but ClickHouse is quickly stealing market share as a more innovative solution:
Fast Open-Source OLAP DBMS - ClickHouse

Getting marketing data like GA4 into a data warehouse can sometimes be easier with Google BigQuery, but tools like Supermetrics can help load that data into Snowflake as well.

For reporting, I would highly suggest looking into Sigma. We’re in the process of moving off Tableau and Google Data / Looker Studio and over to Sigma:
Sigma. Unified AI apps and analytics

But honestly, the most challenging piece is going to be setting up the pipelines to extract data out of your DMS, CRM, etc because as I’m sure you already know, automotive still tends to be a pretty walled off garden.
 
I went through this process previously at my 4 rooftop group. I documented the process and put it in the DR Resources tab here as well as GitHub with a more detailed architectural breakdown.

We used:
  • CDK (DMS): The Data Your Way export tool for daily pipeline cadence. This was super helpful.
  • DriveCentric (CRM): They offered direct API connection
  • Infrastructure: AWS API Gateway, S3 storage, Glue for dedup/cleaning, and Redshift for storage
  • Reporting: Data Studio
  • Marketing Activation: Segment CDP
What I'd do differently now:
  • I'd go with BigQuery/Google Cloud over Redshift/AWS for the native integrations with GA4 and Data Studio as Ryan mentioned ^
  • Supermetrics (they recently released a marketing intelligence platform for activation) or Airbyte if you want more technical control
  • Evidence.dev for reporting. They have NLP built-in so non-technical folks can build reports too and permission controls if you wanted to create reports per department.

Depending on the DMS and CRM, the pipeline will be the largest obstacle. But DriveCentric and CDK had these export tools which made it ALOT easier to get things started. And as always, get comfortable with the data dictionary provided

It was a fun and beneficial project for our group and I hope some of this helps you in your quest!
 
I'm hoping to setup api from all of our dealerships into a datalake so that we can create our own reporting. Does anyone have any suggestions on vendors or overall advice?

First stage is to use it to develop marketing dashboards to evaluate trends, performance, kpi etc. next step would be to use it with a cdp to develop audiences etc.
Every multi-rooftop dealer group seems to eventually build a data lake, and every one hits the same set of problems. Once you figure out the pipeline, the next one is usually that “one customer” is actually six records scattered across DMS, CRM, and service at five different stores. Rules-based dedup catches the easy ones and drops a good bit of the rest. The fix is resolving identity by meaning to match even the records that don't have many common identifiers. Unify the person first. Then your dashboards and campaigns tell the truth.
 
I went through this process previously at my 4 rooftop group. I documented the process and put it in the DR Resources tab here as well as GitHub with a more detailed architectural breakdown.

We used:
  • CDK (DMS): The Data Your Way export tool for daily pipeline cadence. This was super helpful.
  • DriveCentric (CRM): They offered direct API connection
  • Infrastructure: AWS API Gateway, S3 storage, Glue for dedup/cleaning, and Redshift for storage
  • Reporting: Data Studio
  • Marketing Activation: Segment CDP
What I'd do differently now:
  • I'd go with BigQuery/Google Cloud over Redshift/AWS for the native integrations with GA4 and Data Studio as Ryan mentioned ^
  • Supermetrics (they recently released a marketing intelligence platform for activation) or Airbyte if you want more technical control
  • Evidence.dev for reporting. They have NLP built-in so non-technical folks can build reports too and permission controls if you wanted to create reports per department.

Depending on the DMS and CRM, the pipeline will be the largest obstacle. But DriveCentric and CDK had these export tools which made it ALOT easier to get things started. And as always, get comfortable with the data dictionary provided

It was a fun and beneficial project for our group and I hope some of this helps you in your quest!
Sharp breakdown Bill. You've nailed the pipeline, and I'd just add the part that plays a big role in whether or not the whole thing works.

It doesn't matter much whether you land on Databricks, Snowflake, BigQuery, or AWS, they're all capable. What actually makes or breaks it is the identity spine and the outside data you append to it, run through a pipeline built to fit the solution rather than forcing the solution into the tool.

That's where the dedup step you mentioned really carries weight. We're seeing this firsthand on a project with Tekion right now, reconciling records across their ecosystem. We’ve found generally that Glue/rules-based matching gets the easy ones but misses ~half the true matches, and the ones it drops are usually the high-value repeat customers. Resolve identity by meaning first, then append (vehicle/VIN, equity, demographics) so you're activating against a real person who's been placed in the broader market rather than just what's in your own system.
 
  • :light:
Reactions: Bill Hoerr
Sharp breakdown Bill. You've nailed the pipeline, and I'd just add the part that plays a big role in whether or not the whole thing works.

It doesn't matter much whether you land on Databricks, Snowflake, BigQuery, or AWS, they're all capable. What actually makes or breaks it is the identity spine and the outside data you append to it, run through a pipeline built to fit the solution rather than forcing the solution into the tool.

That's where the dedup step you mentioned really carries weight. We're seeing this firsthand on a project with Tekion right now, reconciling records across their ecosystem. We’ve found generally that Glue/rules-based matching gets the easy ones but misses ~half the true matches, and the ones it drops are usually the high-value repeat customers. Resolve identity by meaning first, then append (vehicle/VIN, equity, demographics) so you're activating against a real person who's been placed in the broader market rather than just what's in your own system.
Agreed! The id resolution took about 2-3 months to pin down and was handled downstream via Segment CDP with a PK string across customer id, vin, lead id, deal id, etc. to ensure LTV values and vin assigned were accurate against the first defined record.

So relied on Glue for clear dedup logic but then had downstream mapping using DMS/CRM and non DMS/CRM sources to append any new/modified change to shoppers. Hardest part was keeping with 3rd party tool tracking coverage!
 
A data lake is a solid foundation if you're pulling data from multiple dealership systems. My suggestion would be to prioritize data quality and standardization early consistent customer IDs, campaign naming, and event tracking will make reporting and audience building much easier later. Choose a platform that scales well and integrates easily with your existing APIs, especially if your long-term goal is to connect it with a CDP. Building a strong data model upfront will save a lot of time down the road.
 
We are in the data lake business as of late and I agree with all points in this thread - the first big challenge is getting ALL the data (financial data especially) and we typically throw all of that into the lake as bronze tier data. This data is messy, but pretty important for alot of the reporting we do.

Once we have as much bronze data as possible thrown in there, it's ETL time - most of our lakes skip silver entirely and go straight to gold files, which is merging all those customer records, balancing the accounts and making sure the financial statement actually aligns with reality after all credits, debits and inter-account transfers. Once you have the gold files figured out, the reporting layer is much simpler and (more importantly) a portable decision - you already have clean data you just need to visualize it, measure it, predict it, etc. The tricky part about this layer is the drilldown - we wanted to be able to click "90 Aged Units" and see those units, then click any unit and see all service data, but also see every single journal entry attached to that vehicle in the financial statement.

For the average dealer this is virtually impossible (especially with many of the vendors in our space) but we are having more and more conversations as dealers are realizing that this disparate data needs to be connected in order to properly see the entire operation.

One quick shout out: Tekion has done a superb job of opening this data up recently with their new direct data program - they have a full SDK to retrieve all data from the system and they're the easiest vendor to get data from because of this. We are also seeing great success with DealerBuilt and PBS who both believe data should be accessible with dealer consent.
 

✨ AI Highlights

A dealer group operator asks for vendor recommendations and advice on building a data lake to power marketing dashboards and a CDP. Practitioners recommend Snowflake, ClickHouse, BigQuery, and AWS stacks, with one member sharing a detailed real-world architecture using CDK, DriveCentric, AWS Glue, and Segment. The thread's sharpest takeaway is that the hardest problems aren't the storage platform choice but rather building reliable data pipelines out of resistant DMS and CRM vendors, and then resolving fragmented customer identity across rooftops before any meaningful reporting or audience activation is possible.

Replies Views 9 754 Started Last Reply