The 5-Minute Data Quality Check I Run on Every New Dataset
Before I trust any dataset — whether it's handed to me by a stakeholder, pulled from a database, or ingested from an API — I run the same five checks. Every time. Without exception.
This takes 5–15 minutes depending on the size of the dataset. It has caught critical issues before they reached a dashboard or a decision more times than I can count.
The five checks
1. Row count vs expectation. How many rows should this dataset have? If I'm pulling monthly sales data for a 12-month period and I get 11 months of rows, that's a problem I need to find before I aggregate anything. Always validate count against a known expectation before you start.
SELECT COUNT(*) as total_rows, COUNT(DISTINCT month_key) as months FROM sales_fact;
2. Null rate on key columns. Which columns are critical for this analysis? What's the null rate? A 2% null rate on a primary key is catastrophic. A 2% null rate on an optional comment field is fine. Know which is which.
SELECT
SUM(CASE WHEN customer_id IS NULL THEN 1 ELSE 0 END) * 100.0 / COUNT(*) as pct_null_customer,
SUM(CASE CASE WHEN amount IS NULL THEN 1 ELSE 0 END) * 100.0 / COUNT(*) as pct_null_amount
FROM transactions;
3. Date range sanity. Does the date range match what was requested? Are there future dates in a historical dataset? Are there dates from 1900 that indicate a default value instead of a real date? Date corruption is one of the most common and least visible data quality issues.
4. Duplicate check on primary keys. Should be obvious, but it's the check most people skip. If your primary key has duplicates, every join you do downstream will silently inflate your numbers.
SELECT id, COUNT(*) as occurrences FROM orders GROUP BY id HAVING COUNT(*) > 1;
5. Distribution sanity on key metrics. Pull a MIN, MAX, AVG, and a few percentiles on your main numeric columns. Does the max value look plausible? Is the average where you'd expect it? A transaction amount of $47,000,000 in a dataset where the average is $85 is either the deal of the century or a data entry error. Check before you report it.
Why this works
Most data quality issues aren't subtle. They're nulls where there shouldn't be nulls, duplicates that inflate every aggregate, and date values that are clearly wrong to any human who looks at them. The five-minute check catches these before they become a 5-hour debugging session after someone spots an anomaly in a live dashboard.