Data Quality Assets¶
Comprehensive testing framework for ensuring data reliability and trustworthiness
Data Quality assets in OpenMetadata provide a structured framework for defining, executing, and tracking data quality tests across all your data assets. The quality framework follows a three-level hierarchy: TestDefinitions define test types, TestCases apply tests to specific data assets, and TestSuites group related tests for comprehensive validation.
Hierarchy Overview¶
graph TD
A[TestDefinition] -.->|defines test type| B1[TestCase 1]
A -.->|defines test type| B2[TestCase 2]
B1 --> C1[TestSuite 1]
B2 --> C1
B2 --> C2[TestSuite 2]
B1 -.->|tests| D1[customers table]
B2 -.->|tests| D2[orders table]
style A fill:#667eea,color:#fff
style B1 fill:#f5576c,color:#fff
style B2 fill:#f5576c,color:#fff
style C1 fill:#fa709a,color:#fff
style C2 fill:#fa709a,color:#fff
style D1 fill:#4facfe,color:#fff
style D2 fill:#4facfe,color:#fff Why This Hierarchy?¶
TestDefinition¶
Purpose: Template defining a type of data quality test
A TestDefinition is a reusable test template that specifies what kind of quality check to perform. It defines the test logic, parameters, and expected behavior without being tied to any specific data asset.
Examples:
columnValuesToBeNotNull- Checks that a column has no null valuescolumnValuesToBeBetween- Validates values are within a rangetableRowCountToBeBetween- Verifies row count is within expected boundstableColumnCountToEqual- Ensures table has expected number of columnscolumnValuesToMatchRegex- Validates values match a pattern
Why needed:
Test definitions enable:
- Reusability: Define once, apply to many assets
- Standardization: Consistent quality checks across the organization
- Parameterization: Customize behavior with parameters (min, max, regex pattern)
- Extensibility: Add custom test definitions for specific needs
Built-in Test Types:
- Column Tests: Null checks, range validation, regex matching, uniqueness
- Table Tests: Row count, column count, freshness
- Custom SQL: Define custom validation logic
- Cross-table Tests: Compare values across tables
View TestDefinition Specification →
TestCase¶
Purpose: Specific test instance applied to a data asset
A TestCase applies a TestDefinition to a specific table, column, or other data asset with concrete parameters. It represents an actual quality check that gets executed and produces pass/fail results.
Examples:
customers_email_not_null- Check email column in customers table is not nullorders_amount_positive- Verify order amount is greater than zerodaily_revenue_row_count- Ensure daily revenue table has expected row countproduct_sku_format- Validate product SKU matches pattern
Why needed:
Test cases provide:
- Specificity: Test definition + target asset + parameters
- Execution History: Track results over time
- Alerting: Notify stakeholders when tests fail
- Ownership: Assign responsibility for data quality
- Documentation: Explain why this test matters
Test Case Components:
- Test Definition: Which test type to use
- Entity Reference: What to test (table, column, etc.)
- Parameters: Test-specific configuration (min=0, max=100, regex pattern)
- Threshold: When to fail (critical, warning thresholds)
- Compute: SQL query that implements the test logic
TestSuite¶
Purpose: Logical grouping of related test cases
A TestSuite organizes test cases into meaningful collections for batch execution, reporting, and monitoring. Test suites enable comprehensive validation of entire data pipelines or business domains.
Examples:
CustomerDataQuality- All tests for customer data assetsDailySalesValidation- Tests for daily sales processingPIIComplianceChecks- Ensure PII columns are properly handledCriticalTableMonitoring- Monitor tier-1 production tables
Why needed: Test suites enable: - Organization: Group related tests logically - Batch Execution: Run all tests in a suite together - Reporting: Aggregate results across multiple tests - Scheduling: Run suites on a schedule (hourly, daily, weekly) - Ownership: Assign team responsibility for a domain
Suite Types:
- Logical Suite: Manual grouping of related tests
- Pipeline Suite: Tests for a specific pipeline's outputs
- Asset Suite: All tests for tables in a schema or database
- Domain Suite: Tests for a business domain (sales, finance, etc.)
View TestSuite Specification →
Common Patterns¶
Pattern 1: Column Validation¶
TestDefinition: columnValuesToBeNotNull
↓
TestCase: customers_email_not_null (on customers.email)
↓
TestSuite: CustomerDataQuality
Reusable test definition applied to specific column with results tracked in a suite.
Pattern 2: Table Health Check¶
TestDefinition: tableRowCountToBeBetween
↓
TestCase: daily_orders_row_count (min=100, max=10000)
↓
TestSuite: DailyPipelineValidation
Ensure table has expected number of rows after pipeline execution.
Pattern 3: Comprehensive Asset Testing¶
TestSuite: OrdersTableQuality
├── orders_id_unique (uniqueness test)
├── orders_amount_positive (range test)
├── orders_date_not_null (null check)
├── orders_status_valid (allowed values test)
└── orders_freshness (freshness test)
Multiple test cases grouped to validate all aspects of a critical table.
Real-World Example¶
Here's how a data team ensures quality in their customer analytics pipeline:
graph TB
subgraph "Test Definitions (Reusable Templates)"
TD1[columnValuesToBeNotNull]
TD2[columnValuesToBeUnique]
TD3[tableRowCountToBeBetween]
TD4[columnValuesToBeInSet]
end
subgraph "Test Cases (Specific Tests)"
TC1[customers_email_not_null]
TC2[customers_id_unique]
TC3[orders_row_count]
TC4[orders_status_valid]
end
subgraph "Test Suites (Organized Groups)"
TS1[CustomerDataQuality Suite]
TS2[OrdersDataQuality Suite]
end
subgraph "Data Assets"
T1[customers table]
T2[orders table]
C1[email column]
C2[customer_id column]
end
TD1 -.->|defines| TC1
TD2 -.->|defines| TC2
TD3 -.->|defines| TC3
TD4 -.->|defines| TC4
TC1 -->|in suite| TS1
TC2 -->|in suite| TS1
TC3 -->|in suite| TS2
TC4 -->|in suite| TS2
TC1 -.->|tests| C1
TC2 -.->|tests| C2
TC3 -.->|tests| T2
TC4 -.->|tests| T2
style TD1 fill:#667eea,color:#fff
style TD2 fill:#667eea,color:#fff
style TD3 fill:#667eea,color:#fff
style TD4 fill:#667eea,color:#fff
style TC1 fill:#f5576c,color:#fff
style TC2 fill:#f5576c,color:#fff
style TC3 fill:#f5576c,color:#fff
style TC4 fill:#f5576c,color:#fff
style TS1 fill:#fa709a,color:#fff
style TS2 fill:#fa709a,color:#fff
style T1 fill:#4facfe,color:#fff
style T2 fill:#4facfe,color:#fff
style C1 fill:#00f2fe,color:#fff
style C2 fill:#00f2fe,color:#fff Example Quality Implementation Workflow:
This example shows how Test Definitions, Test Cases, Test Suites, and Execution work together for an e-commerce platform:
Step 1: Define Reusable Test Templates (Test Definitions)
Standard test types available to all teams:
- Not null checks
- Uniqueness validation
- Row count monitoring
- Allowed values checking
Step 2: Apply Tests to Specific Assets (Test Cases)
Specific tests on customer and order tables:
customers.emailmust not be nullcustomers.customer_idmust be uniqueorderstable should have 1,000-50,000 rows dailyorders.statusmust be in ["pending", "completed", "cancelled"]
Step 3: Organize Tests by Domain (Test Suites)
- Customer Suite: Runs all customer data quality checks
- Orders Suite: Validates transactional data integrity
Step 4: Execute and Monitor (Execution)
- Pipeline completes → Test suite executes → Results reported
- Failed tests trigger alerts to data engineering team
Comprehensive Quality Framework¶
A complete data quality implementation for an e-commerce platform:
graph TB
subgraph "Test Definitions Library"
direction LR
TDG1[Column Tests]
TDG2[Table Tests]
TDG3[Custom SQL Tests]
TDG1 --> TD1[NotNull]
TDG1 --> TD2[Unique]
TDG1 --> TD3[InRange]
TDG1 --> TD4[MatchRegex]
TDG2 --> TD5[RowCount]
TDG2 --> TD6[Freshness]
TDG2 --> TD7[ColumnCount]
TDG3 --> TD8[CustomQuery]
end
subgraph "Applied Test Cases"
direction TB
TC1[email_not_null]
TC2[customer_id_unique]
TC3[age_in_range]
TC4[phone_format_valid]
TC5[orders_daily_count]
TC6[revenue_freshness]
TC7[products_column_count]
TC8[revenue_reconciliation]
end
subgraph "Test Suite Organization"
TS1[Critical Data Quality<br/>Runs: Every Hour]
TS2[Daily Pipeline Validation<br/>Runs: After ETL]
TS3[Weekly Health Check<br/>Runs: Sunday 2 AM]
end
subgraph "Data Assets"
DB1[(Production Database)]
DB2[(Analytics Warehouse)]
DB1 --> T1[customers]
DB1 --> T2[orders]
DB2 --> T3[daily_revenue]
DB2 --> T4[products]
end
TD1 -.-> TC1
TD2 -.-> TC2
TD3 -.-> TC3
TD4 -.-> TC4
TD5 -.-> TC5
TD6 -.-> TC6
TD7 -.-> TC7
TD8 -.-> TC8
TC1 --> TS1
TC2 --> TS1
TC5 --> TS1
TC3 --> TS2
TC4 --> TS2
TC6 --> TS2
TC8 --> TS2
TC7 --> TS3
TC1 -.-> T1
TC2 -.-> T1
TC3 -.-> T1
TC5 -.-> T2
TC6 -.-> T3
TC7 -.-> T4
TC8 -.-> T3
style TD1 fill:#667eea,color:#fff
style TD2 fill:#667eea,color:#fff
style TD3 fill:#667eea,color:#fff
style TD4 fill:#667eea,color:#fff
style TD5 fill:#667eea,color:#fff
style TD6 fill:#667eea,color:#fff
style TD7 fill:#667eea,color:#fff
style TD8 fill:#667eea,color:#fff
style TS1 fill:#fa709a,color:#fff
style TS2 fill:#fa709a,color:#fff
style TS3 fill:#fa709a,color:#fff Implementation Details:
- Test Definition Library:
- 50+ built-in test definitions
- Custom SQL tests for complex validation
-
Parameterized for flexibility
-
Test Case Strategy:
- Critical columns: Not null, uniqueness, format validation
- Tables: Row count, freshness, schema stability
-
Business rules: Custom reconciliation queries
-
Test Suite Scheduling:
- Hourly: Critical quality checks on production tables
- Pipeline: Run after each ETL completion
-
Weekly: Comprehensive health checks and drift detection
-
Alerting and Remediation:
- Critical failures → Page on-call engineer
- Warnings → Slack notification to data team
- Trends → Weekly quality report to leadership
Test Execution Flow¶
Understanding how tests execute in practice:
sequenceDiagram
participant Pipeline as Data Pipeline
participant TestSuite as Test Suite
participant TestCase as Test Cases
participant DataAsset as Data Assets
participant Alert as Alerting System
Pipeline->>DataAsset: Load/Transform Data
Pipeline->>TestSuite: Trigger Test Execution
TestSuite->>TestCase: Execute Test 1
TestCase->>DataAsset: Query Data
DataAsset-->>TestCase: Return Results
TestCase->>TestCase: Evaluate Against Threshold
TestCase-->>TestSuite: Pass/Fail Result
TestSuite->>TestCase: Execute Test 2
TestCase->>DataAsset: Query Data
DataAsset-->>TestCase: Return Results
TestCase->>TestCase: Evaluate Against Threshold
TestCase-->>TestSuite: Pass/Fail Result
TestSuite->>TestSuite: Aggregate Results
TestSuite->>Alert: Send Notifications (if failures)
Alert->>Alert: Route to Owners Execution Steps:
- Trigger: Pipeline completion or schedule
- Execution: Test suite runs all test cases
- Evaluation: Compare results against thresholds
- Recording: Store results and history
- Alerting: Notify owners of failures
- Remediation: Teams investigate and fix issues
Data Quality Dimensions¶
OpenMetadata test definitions cover all quality dimensions:
| Dimension | Description | Test Examples |
|---|---|---|
| Completeness | No missing or null values | columnValuesToBeNotNull |
| Uniqueness | No duplicate records | columnValuesToBeUnique |
| Validity | Values conform to rules | columnValuesToMatchRegex, columnValuesToBeInSet |
| Accuracy | Values are correct | Custom SQL comparing to source of truth |
| Consistency | Values consistent across systems | Cross-table comparison tests |
| Timeliness | Data is fresh and up-to-date | tableFreshnessCheck |
| Integrity | Referential integrity maintained | Foreign key validation tests |
Entity Specifications¶
Each entity in the data quality framework has complete specifications:
| Entity | Description | Specification |
|---|---|---|
| TestDefinition | Reusable test template | View Spec |
| TestCase | Specific test instance | View Spec |
| TestSuite | Grouped test collection | View Spec |
Each specification includes: - Complete field reference - JSON Schema definition - RDF/OWL ontology representation - JSON-LD context and examples - Execution semantics - API operations
Best Practices¶
1. Start with Critical Assets¶
Focus first on tier-1 tables that power critical dashboards and reports. Expand coverage gradually.
2. Use Meaningful Test Names¶
Name test cases descriptively: customers_email_not_null is better than test_1.
3. Set Appropriate Thresholds¶
- Critical: Data cannot be used (100% null rate)
- Warning: Quality degrading (>5% null rate)
4. Organize by Domain¶
Create test suites by business domain (sales, finance, marketing) for clear ownership.
5. Monitor Test Trends¶
Track pass rates over time to identify quality degradation before it becomes critical.
6. Document Test Rationale¶
Explain why each test matters and what failure indicates.
7. Integrate with Pipelines¶
Run test suites automatically after pipeline execution to catch issues immediately.
8. Establish Ownership¶
Assign data asset owners who are responsible for fixing quality issues.
Next Steps¶
- Explore specifications - Click through each entity above
- See test examples - Check out data quality examples
- Integration guide - Learn how to implement quality tests
- Built-in tests - Review the complete test definition library