Storage Assets¶
Object storage, file systems, and cloud document services
Storage assets represent two distinct types of storage systems:
- Storage Services: Object storage (S3, GCS, Azure Blob) and distributed file systems (HDFS, NFS) for data lakes and large-scale data storage
- Drive Services: Cloud document management platforms (Google Drive, OneDrive, SharePoint) for collaborative documents, spreadsheets, and files
Both types organize data hierarchically but serve different purposes.
Hierarchy Overview¶
graph TD
A[StorageService<br/>S3, GCS, Azure Blob] --> B1[Container:<br/>raw-events]
A --> B2[Container:<br/>curated-datasets]
A --> B3[Container:<br/>ml-models]
B1 --> C1[Folder: year=2024]
C1 --> C2[Folder: month=11]
C2 --> C3[Folder: day=22]
C3 --> D1[File: events_000.parquet]
C3 --> D2[File: events_001.parquet]
C3 --> D3[File: events_002.parquet]
B2 --> E1[Folder: sales]
E1 --> E2[File: transactions.parquet<br/>Schema: id, amount, date]
B3 --> F1[Folder: customer-churn]
F1 --> F2[File: model.pkl]
F1 --> F3[File: features.json]
D1 -.->|format| G1[Parquet]
D1 -.->|schema| G2[userId, eventType, timestamp]
E2 -.->|partitioned by| G3[date]
B1 -.->|consumed by| G4[ETL Pipeline]
B2 -.->|consumed by| G5[BI Dashboard]
style A fill:#667eea,color:#fff
style B1 fill:#764ba2,color:#fff
style B2 fill:#764ba2,color:#fff
style B3 fill:#764ba2,color:#fff
style C1 fill:#f093fb,color:#fff
style C2 fill:#f093fb,color:#fff
style C3 fill:#f093fb,color:#fff
style D1 fill:#4facfe,color:#fff
style D2 fill:#4facfe,color:#fff
style D3 fill:#4facfe,color:#fff
style E1 fill:#f093fb,color:#fff
style E2 fill:#4facfe,color:#fff
style F1 fill:#f093fb,color:#fff
style F2 fill:#4facfe,color:#fff
style F3 fill:#4facfe,color:#fff
style G1 fill:#43e97b,color:#fff
style G2 fill:#43e97b,color:#fff
style G3 fill:#43e97b,color:#fff
style G4 fill:#00f2fe,color:#fff
style G5 fill:#00f2fe,color:#fff Why This Hierarchy?¶
Storage Service¶
Purpose: Represents the cloud storage platform or account
A Storage Service is the platform that hosts object storage containers and buckets. It contains configuration for connecting to the storage provider and discovering containers.
Examples:
s3-data-lake- AWS S3 for data lakeazure-blob-analytics- Azure Blob Storage for analyticsgcs-archive- Google Cloud Storage for archivalminio-local- MinIO for local object storagehdfs-production- HDFS cluster for big data processing
Why needed: Organizations use multiple storage platforms across cloud providers (AWS S3, Azure Blob, GCS) and have multiple accounts or regions for different purposes. The service level groups containers by platform and account, making it easy to manage connections and understand storage organization.
Supported Platforms:
- Object Storage: AWS S3, Azure Blob Storage, Azure Data Lake Storage (ADLS), Google Cloud Storage, MinIO, Alibaba Cloud OSS, IBM Cloud Object Storage
- Distributed File Systems: HDFS, NFS, Ceph
Note: Cloud document platforms (Google Drive, OneDrive, SharePoint) are modeled separately as Drive Services.
View Storage Service Specification →
Container¶
Purpose: Represents a bucket or container holding files and folders
A Container (called a "bucket" in S3, "container" in Azure Blob, "bucket" in GCS) is a top-level namespace that holds objects organized in a hierarchical folder structure.
Examples:
raw-events- Raw event data from applicationscurated-datasets- Cleaned and processed datasetsml-models- Machine learning model artifactsdata-warehouse-staging- Staging area for warehouse loadslogs-archive- Historical log filesbackup-snapshots- Database backups and snapshots
Key Metadata:
- Structure: Folder hierarchy and file organization
- File Types: Parquet, CSV, JSON, Avro, images, videos, etc.
- Schema: Detected schema for structured files (Parquet, Avro)
- Size: Total storage size and object count
- Partitioning: Date-based or custom partitioning schemes
- Lifecycle Policies: Retention, archival, deletion rules
- Access Control: Permissions and encryption settings
- Lineage: Source → Container → Processing pipelines
- Tags: Department, sensitivity, compliance classifications
Why needed: Containers are the primary organizational unit in object storage. Tracking them enables: - Understanding data lake organization and structure - Schema discovery for structured file formats - Impact analysis (which pipelines read from this container?) - Data governance (PII detection, compliance) - Cost optimization (storage usage patterns) - Data quality monitoring
View Container Specification →
Drive Service¶
Purpose: Cloud document management and file sharing platforms
Drive Services represent cloud-based document platforms like Google Drive, OneDrive, and SharePoint. Unlike object storage (S3, GCS), Drive Services are designed for collaborative documents, spreadsheets, and presentations with features like real-time editing, version history, and sharing.
Examples:
google-drive-marketing- Google Drive for marketing teamonedrive-finance- OneDrive for finance documentssharepoint-hr- SharePoint for HR document librariesdropbox-engineering- Dropbox for engineering team
Key Metadata:
- Directories/Folders: Top-level folders and nested structure
- Files: Documents, Spreadsheets, Presentations
- Spreadsheets: Excel/Google Sheets with multiple worksheets
- Worksheets: Individual sheets/tabs in spreadsheets with schemas
- Sharing & Permissions: Access control and collaboration
- Version History: Track changes and revisions
- Lineage: Spreadsheet → Data Pipeline → Table
- Tags: Department, sensitivity, compliance
Hierarchy:
Drive Service (Google Drive, OneDrive, SharePoint)
└── Directory/Folder
├── Spreadsheet
│ └── Worksheet (Sheet1, Sheet2, etc.)
├── Document (Word, Google Docs, PDF)
├── Presentation (PowerPoint, Slides)
└── Other Files (CSV, Images, etc.)
Why needed: Many organizations use spreadsheets in Google Drive or SharePoint as data sources for analytics. Tracking these enables: - Lineage: Track which dashboards and pipelines consume spreadsheet data - Schema Discovery: Understand worksheet structure and columns - Data Governance: Tag PII in collaborative documents - Impact Analysis: Know which teams use which shared files
Drive Service Entities:
Drive Service → Directory → Spreadsheet → Worksheet →
File Organization Patterns¶
Storage containers typically organize files in hierarchical folder structures:
Date-Partitioned Events¶
raw-events/
├── year=2024/
│ ├── month=01/
│ │ ├── day=01/
│ │ │ ├── events_000.parquet
│ │ │ ├── events_001.parquet
│ │ ├── day=02/
│ ├── month=02/
Domain-Based Organization¶
data-lake/
├── sales/
│ ├── transactions/
│ ├── customers/
├── marketing/
│ ├── campaigns/
│ ├── analytics/
├── operations/
Processing Stage Organization¶
analytics/
├── raw/
│ ├── source_system_1/
│ ├── source_system_2/
├── staging/
│ ├── cleaned_data/
├── curated/
│ ├── aggregated_metrics/
Common Patterns¶
Pattern 1: S3 Data Lake¶
S3 Service → raw-events Container → year=2024/month=11/day=22/events.parquet
→ File Format: Parquet
→ Schema: userId, eventType, timestamp
→ Partitioning: Date-based (year/month/day)
Event data organized by date with automatic schema detection.
Pattern 2: Azure Blob Analytics Storage¶
Azure Blob Service → curated-datasets Container → sales/
→ marketing/
→ operations/
→ File Format: CSV, Parquet
→ Lifecycle: Archive after 90 days
Departmental data organization with lifecycle management.
Pattern 3: GCS ML Artifacts¶
GCS Service → ml-models Container → customer-churn/
│ ├── model.pkl
│ ├── features.json
│ └── metrics.json
→ fraud-detection/
→ recommendation-engine/
ML model artifacts organized by model name with metadata.
Real-World Example¶
Here's how a data platform team uses object storage for their data lake:
graph LR
A[Application APIs] -->|Stream| B[Kafka Topics]
B -->|Firehose| C[S3 raw-events<br/>Container]
C -->|Spark ETL| D[S3 curated-datasets<br/>Container]
D -->|Athena Query| E[BI Dashboard]
D -->|Read| F[ML Training Pipeline]
C -.->|Format| G[Parquet files]
C -.->|Partitioning| H[year/month/day]
C -.->|Schema| I[userId, eventType, timestamp]
D -.->|Format| J[Parquet files]
D -.->|Tags| K[PII, Analytics]
D -.->|Quality Tests| L[Completeness, Schema validation]
style A fill:#0061f2,color:#fff
style B fill:#f093fb,color:#fff
style C fill:#6900c7,color:#fff
style D fill:#00ac69,color:#fff
style E fill:#f5576c,color:#fff
style F fill:#4facfe,color:#fff Flow: 1. Ingestion: Application events stream to Kafka, then land in S3 raw-events container 2. Raw Storage: Events stored as Parquet files, partitioned by date 3. Schema Detection: Automatic schema inference from Parquet metadata 4. Processing: Spark ETL reads from raw, writes to curated-datasets container 5. Consumption: BI tools query curated data with Athena, ML pipelines read for training 6. Governance: PII tags applied, quality tests validate schema and completeness
Benefits:
- Lineage: Trace data from application → Kafka → S3 → ETL → Curated → Analytics
- Schema Management: Automatic schema detection for Parquet/Avro files
- Impact Analysis: Know which ETL jobs and BI tools depend on container data
- Cost Optimization: Track storage usage, apply lifecycle policies
- Data Quality: Monitor schema drift, validate data completeness
Storage Lineage¶
Storage containers create lineage connections across data pipelines:
graph TD
A[MySQL Database] --> P1[Export Pipeline]
B[Application Logs] --> P2[Log Aggregation]
P1 --> C[S3 raw-exports]
P2 --> C
C --> P3[ETL Pipeline]
P3 --> D[S3 curated-data]
D --> P4[Analytics Pipeline]
D --> P5[ML Training]
P4 --> E[Snowflake Tables]
P5 --> F[ML Models]
style C fill:#6900c7,color:#fff
style D fill:#00ac69,color:#fff
style P1 fill:#f5576c,color:#fff
style P2 fill:#f5576c,color:#fff
style P3 fill:#f5576c,color:#fff
style P4 fill:#f5576c,color:#fff
style P5 fill:#f5576c,color:#fff Source → Raw Storage → Processing → Curated Storage → Consumption
File Format Support¶
OpenMetadata detects schemas and metadata from various file formats:
Structured Formats (Schema Detection)¶
- Parquet: Columnar format with embedded schema
- Avro: Row-based format with schema
- ORC: Optimized columnar format
- Delta Lake: ACID transactions on Parquet
- Iceberg: Table format for big data
Semi-Structured Formats¶
- JSON: Nested data structures
- JSONL/NDJSON: Line-delimited JSON
- CSV: Tabular data with headers
- TSV: Tab-separated values
- XML: Hierarchical markup
Unstructured Formats¶
- Images: PNG, JPEG, TIFF
- Videos: MP4, AVI, MOV
- Audio: MP3, WAV, FLAC
- Documents: PDF, DOCX, TXT
- Archives: ZIP, TAR, GZIP
Schema Inference: For structured formats (Parquet, Avro, ORC), OpenMetadata automatically extracts: - Column names and data types - Nested structures - Partitioning schemes - Statistics (row counts, null counts)
Partitioning Strategies¶
Storage containers often use partitioning for performance and organization:
Time-Based Partitioning¶
- Use Case: Event data, logs, time-series - Benefits: Efficient date range queries, lifecycle management - Format: Hive-style partitioning (key=value)Geography-Based Partitioning¶
- Use Case: Distributed operations, compliance (data residency) - Benefits: Regional queries, GDPR complianceEntity-Based Partitioning¶
s3://data-lake/entity=customers/version=v2/customers.parquet
s3://data-lake/entity=orders/version=v2/orders.parquet
Storage Security and Governance¶
Track important security and compliance metadata:
Access Control¶
- IAM Policies: Who can read/write to containers
- Bucket Policies: Resource-based permissions
- Encryption: At-rest (SSE-S3, SSE-KMS) and in-transit (TLS)
- Versioning: Object version history
Data Classification¶
- PII Detection: Automatically tag sensitive data
- Compliance Tags: GDPR, HIPAA, PCI-DSS classifications
- Data Retention: Lifecycle policies for deletion and archival
- Access Logs: Track who accessed which objects
Cost Management¶
- Storage Class: Standard, Infrequent Access, Glacier, Archive
- Lifecycle Transitions: Automatic tiering based on age
- Storage Analytics: Usage patterns and optimization opportunities
Container Types¶
Different container configurations for different use cases:
Data Lake Raw Zone¶
{
"name": "raw-events",
"purpose": "Landing zone for raw ingestion",
"fileFormat": ["Parquet", "JSON"],
"partitioning": "year/month/day/hour",
"retention": "90 days",
"encryption": "SSE-KMS"
}
Data Lake Curated Zone¶
{
"name": "curated-datasets",
"purpose": "Cleaned and transformed data",
"fileFormat": ["Parquet", "Delta Lake"],
"partitioning": "entity/version/date",
"retention": "7 years",
"dataQuality": ["Schema validation", "Completeness checks"]
}
ML Artifacts Storage¶
{
"name": "ml-models",
"purpose": "Model artifacts and metadata",
"fileFormat": ["PKL", "ONNX", "JSON"],
"versioning": "Enabled",
"encryption": "SSE-KMS"
}
Entity Specifications¶
Storage Services (Object Storage)¶
| Entity | Description | Specification |
|---|---|---|
| Storage Service | Cloud storage platform (S3, GCS, Azure Blob) | View Spec |
| Container | Bucket or container | View Spec |
| File | Individual files in containers | View Spec |
Drive Services (Document Platforms)¶
| Entity | Description | Specification |
|---|---|---|
| Drive Service | Cloud document platform (Google Drive, OneDrive, SharePoint) | View Spec |
| Directory | Folders and directories | View Spec |
| Spreadsheet | Spreadsheet files (Excel, Google Sheets) | View Spec |
| Worksheet | Individual sheets/tabs within spreadsheets | View Spec |
Each specification includes: - Complete field reference - JSON Schema definition - RDF/OWL ontology representation - JSON-LD context and examples - Integration with storage/drive platforms
Supported Storage Platforms¶
OpenMetadata supports metadata extraction from:
- Amazon S3 - Scalable object storage
- Azure Blob Storage - Object storage for Azure
- Azure Data Lake Storage (ADLS) - Hierarchical data lake storage
- Google Cloud Storage - Unified object storage for Google Cloud
- MinIO - High-performance S3-compatible storage
- Alibaba Cloud OSS - Object storage service
- IBM Cloud Object Storage - Distributed storage system
- Oracle Cloud Object Storage - Object storage for Oracle Cloud
- Cloudflare R2 - S3-compatible edge storage
- Wasabi - Hot cloud storage
Storage Integration Patterns¶
Pattern: S3 + AWS Glue Data Catalog¶
S3 Service → raw-events Container → AWS Glue Crawler discovers schema
→ Glue Data Catalog stores metadata
→ Athena queries using catalog
Serverless architecture with automatic schema discovery.
Pattern: Azure Blob + Databricks¶
Azure Blob Service → curated-data Container → Databricks mounts container
→ Delta Lake tables on blob storage
→ Unity Catalog manages metadata
Lakehouse architecture with unified governance.
Pattern: GCS + BigQuery External Tables¶
GCS Service → analytics-data Container → BigQuery external tables reference GCS
→ Query Parquet/Avro without loading
→ Materialize to BigQuery for performance
Hybrid architecture querying data in place.
Data Lake Architecture¶
Storage containers are central to modern data lake architectures:
graph TD
A[Streaming Sources] --> B[Bronze Layer<br/>Raw Container]
C[Batch Sources] --> B
B --> D[ETL Pipelines]
D --> E[Silver Layer<br/>Cleaned Container]
E --> F[Transformation Pipelines]
F --> G[Gold Layer<br/>Curated Container]
G --> H[BI Dashboards]
G --> I[ML Models]
G --> J[Data Warehouse]
style B fill:#f5576c,color:#fff
style E fill:#6900c7,color:#fff
style G fill:#00ac69,color:#fff Multi-Hop Architecture: 1. Bronze: Raw data, minimal processing 2. Silver: Cleaned, validated, deduplicated 3. Gold: Business-level aggregates, ready for consumption
Next Steps¶
- Explore specifications - Click through each entity above
- See storage lineage - Check out lineage from storage to analytics
- Data lake patterns - Learn about modern data lake architectures