Storage Assets¶

Object storage, file systems, and cloud document services

Storage assets represent two distinct types of storage systems:

Storage Services: Object storage (S3, GCS, Azure Blob) and distributed file systems (HDFS, NFS) for data lakes and large-scale data storage
Drive Services: Cloud document management platforms (Google Drive, OneDrive, SharePoint) for collaborative documents, spreadsheets, and files

Both types organize data hierarchically but serve different purposes.

Hierarchy Overview¶

graph TD
    A[StorageService<br/>S3, GCS, Azure Blob] --> B1[Container:<br/>raw-events]
    A --> B2[Container:<br/>curated-datasets]
    A --> B3[Container:<br/>ml-models]

    B1 --> C1[Folder: year=2024]
    C1 --> C2[Folder: month=11]
    C2 --> C3[Folder: day=22]

    C3 --> D1[File: events_000.parquet]
    C3 --> D2[File: events_001.parquet]
    C3 --> D3[File: events_002.parquet]

    B2 --> E1[Folder: sales]
    E1 --> E2[File: transactions.parquet<br/>Schema: id, amount, date]

    B3 --> F1[Folder: customer-churn]
    F1 --> F2[File: model.pkl]
    F1 --> F3[File: features.json]

    D1 -.->|format| G1[Parquet]
    D1 -.->|schema| G2[userId, eventType, timestamp]
    E2 -.->|partitioned by| G3[date]
    B1 -.->|consumed by| G4[ETL Pipeline]
    B2 -.->|consumed by| G5[BI Dashboard]

    style A fill:#667eea,color:#fff
    style B1 fill:#764ba2,color:#fff
    style B2 fill:#764ba2,color:#fff
    style B3 fill:#764ba2,color:#fff
    style C1 fill:#f093fb,color:#fff
    style C2 fill:#f093fb,color:#fff
    style C3 fill:#f093fb,color:#fff
    style D1 fill:#4facfe,color:#fff
    style D2 fill:#4facfe,color:#fff
    style D3 fill:#4facfe,color:#fff
    style E1 fill:#f093fb,color:#fff
    style E2 fill:#4facfe,color:#fff
    style F1 fill:#f093fb,color:#fff
    style F2 fill:#4facfe,color:#fff
    style F3 fill:#4facfe,color:#fff
    style G1 fill:#43e97b,color:#fff
    style G2 fill:#43e97b,color:#fff
    style G3 fill:#43e97b,color:#fff
    style G4 fill:#00f2fe,color:#fff
    style G5 fill:#00f2fe,color:#fff

Why This Hierarchy?¶

Storage Service¶

Purpose: Represents the cloud storage platform or account

A Storage Service is the platform that hosts object storage containers and buckets. It contains configuration for connecting to the storage provider and discovering containers.

Examples:

s3-data-lake - AWS S3 for data lake
azure-blob-analytics - Azure Blob Storage for analytics
gcs-archive - Google Cloud Storage for archival
minio-local - MinIO for local object storage
hdfs-production - HDFS cluster for big data processing

Why needed: Organizations use multiple storage platforms across cloud providers (AWS S3, Azure Blob, GCS) and have multiple accounts or regions for different purposes. The service level groups containers by platform and account, making it easy to manage connections and understand storage organization.

Supported Platforms:

Object Storage: AWS S3, Azure Blob Storage, Azure Data Lake Storage (ADLS), Google Cloud Storage, MinIO, Alibaba Cloud OSS, IBM Cloud Object Storage
Distributed File Systems: HDFS, NFS, Ceph

Note: Cloud document platforms (Google Drive, OneDrive, SharePoint) are modeled separately as Drive Services.

View Storage Service Specification →

Container¶

Purpose: Represents a bucket or container holding files and folders

A Container (called a "bucket" in S3, "container" in Azure Blob, "bucket" in GCS) is a top-level namespace that holds objects organized in a hierarchical folder structure.

Examples:

raw-events - Raw event data from applications
curated-datasets - Cleaned and processed datasets
ml-models - Machine learning model artifacts
data-warehouse-staging - Staging area for warehouse loads
logs-archive - Historical log files
backup-snapshots - Database backups and snapshots

Key Metadata:

Structure: Folder hierarchy and file organization
File Types: Parquet, CSV, JSON, Avro, images, videos, etc.
Schema: Detected schema for structured files (Parquet, Avro)
Size: Total storage size and object count
Partitioning: Date-based or custom partitioning schemes
Lifecycle Policies: Retention, archival, deletion rules
Access Control: Permissions and encryption settings
Lineage: Source → Container → Processing pipelines
Tags: Department, sensitivity, compliance classifications

Why needed: Containers are the primary organizational unit in object storage. Tracking them enables: - Understanding data lake organization and structure - Schema discovery for structured file formats - Impact analysis (which pipelines read from this container?) - Data governance (PII detection, compliance) - Cost optimization (storage usage patterns) - Data quality monitoring

View Container Specification →

Drive Service¶

Purpose: Cloud document management and file sharing platforms

Drive Services represent cloud-based document platforms like Google Drive, OneDrive, and SharePoint. Unlike object storage (S3, GCS), Drive Services are designed for collaborative documents, spreadsheets, and presentations with features like real-time editing, version history, and sharing.

Examples:

google-drive-marketing - Google Drive for marketing team
onedrive-finance - OneDrive for finance documents
sharepoint-hr - SharePoint for HR document libraries
dropbox-engineering - Dropbox for engineering team

Key Metadata:

Directories/Folders: Top-level folders and nested structure
Files: Documents, Spreadsheets, Presentations
Spreadsheets: Excel/Google Sheets with multiple worksheets
Worksheets: Individual sheets/tabs in spreadsheets with schemas
Sharing & Permissions: Access control and collaboration
Version History: Track changes and revisions
Lineage: Spreadsheet → Data Pipeline → Table
Tags: Department, sensitivity, compliance

Hierarchy:

Drive Service (Google Drive, OneDrive, SharePoint)
  └── Directory/Folder
        ├── Spreadsheet
        │     └── Worksheet (Sheet1, Sheet2, etc.)
        ├── Document (Word, Google Docs, PDF)
        ├── Presentation (PowerPoint, Slides)
        └── Other Files (CSV, Images, etc.)

Why needed: Many organizations use spreadsheets in Google Drive or SharePoint as data sources for analytics. Tracking these enables: - Lineage: Track which dashboards and pipelines consume spreadsheet data - Schema Discovery: Understand worksheet structure and columns - Data Governance: Tag PII in collaborative documents - Impact Analysis: Know which teams use which shared files

Drive Service Entities:

Drive Service → Directory → Spreadsheet → Worksheet →

File Organization Patterns¶

Storage containers typically organize files in hierarchical folder structures:

Date-Partitioned Events¶

raw-events/
├── year=2024/
│   ├── month=01/
│   │   ├── day=01/
│   │   │   ├── events_000.parquet
│   │   │   ├── events_001.parquet
│   │   ├── day=02/
│   ├── month=02/

Domain-Based Organization¶

data-lake/
├── sales/
│   ├── transactions/
│   ├── customers/
├── marketing/
│   ├── campaigns/
│   ├── analytics/
├── operations/

Processing Stage Organization¶

analytics/
├── raw/
│   ├── source_system_1/
│   ├── source_system_2/
├── staging/
│   ├── cleaned_data/
├── curated/
│   ├── aggregated_metrics/

Common Patterns¶

Pattern 1: S3 Data Lake¶

S3 Service → raw-events Container → year=2024/month=11/day=22/events.parquet
                                   → File Format: Parquet
                                   → Schema: userId, eventType, timestamp
                                   → Partitioning: Date-based (year/month/day)

Event data organized by date with automatic schema detection.

Pattern 2: Azure Blob Analytics Storage¶

Azure Blob Service → curated-datasets Container → sales/
                                                 → marketing/
                                                 → operations/
                                                → File Format: CSV, Parquet
                                                → Lifecycle: Archive after 90 days

Departmental data organization with lifecycle management.

Pattern 3: GCS ML Artifacts¶

GCS Service → ml-models Container → customer-churn/
                                  │   ├── model.pkl
                                  │   ├── features.json
                                  │   └── metrics.json
                                  → fraud-detection/
                                  → recommendation-engine/

ML model artifacts organized by model name with metadata.

Real-World Example¶

Here's how a data platform team uses object storage for their data lake:

graph LR
    A[Application APIs] -->|Stream| B[Kafka Topics]
    B -->|Firehose| C[S3 raw-events<br/>Container]

    C -->|Spark ETL| D[S3 curated-datasets<br/>Container]
    D -->|Athena Query| E[BI Dashboard]
    D -->|Read| F[ML Training Pipeline]

    C -.->|Format| G[Parquet files]
    C -.->|Partitioning| H[year/month/day]
    C -.->|Schema| I[userId, eventType, timestamp]

    D -.->|Format| J[Parquet files]
    D -.->|Tags| K[PII, Analytics]
    D -.->|Quality Tests| L[Completeness, Schema validation]

    style A fill:#0061f2,color:#fff
    style B fill:#f093fb,color:#fff
    style C fill:#6900c7,color:#fff
    style D fill:#00ac69,color:#fff
    style E fill:#f5576c,color:#fff
    style F fill:#4facfe,color:#fff

Flow: 1. Ingestion: Application events stream to Kafka, then land in S3 raw-events container 2. Raw Storage: Events stored as Parquet files, partitioned by date 3. Schema Detection: Automatic schema inference from Parquet metadata 4. Processing: Spark ETL reads from raw, writes to curated-datasets container 5. Consumption: BI tools query curated data with Athena, ML pipelines read for training 6. Governance: PII tags applied, quality tests validate schema and completeness

Benefits:

Lineage: Trace data from application → Kafka → S3 → ETL → Curated → Analytics
Schema Management: Automatic schema detection for Parquet/Avro files
Impact Analysis: Know which ETL jobs and BI tools depend on container data
Cost Optimization: Track storage usage, apply lifecycle policies
Data Quality: Monitor schema drift, validate data completeness

Storage Lineage¶

Storage containers create lineage connections across data pipelines:

graph TD
    A[MySQL Database] --> P1[Export Pipeline]
    B[Application Logs] --> P2[Log Aggregation]

    P1 --> C[S3 raw-exports]
    P2 --> C

    C --> P3[ETL Pipeline]
    P3 --> D[S3 curated-data]

    D --> P4[Analytics Pipeline]
    D --> P5[ML Training]

    P4 --> E[Snowflake Tables]
    P5 --> F[ML Models]

    style C fill:#6900c7,color:#fff
    style D fill:#00ac69,color:#fff
    style P1 fill:#f5576c,color:#fff
    style P2 fill:#f5576c,color:#fff
    style P3 fill:#f5576c,color:#fff
    style P4 fill:#f5576c,color:#fff
    style P5 fill:#f5576c,color:#fff

Source → Raw Storage → Processing → Curated Storage → Consumption

File Format Support¶

OpenMetadata detects schemas and metadata from various file formats:

Structured Formats (Schema Detection)¶

Parquet: Columnar format with embedded schema
Avro: Row-based format with schema
ORC: Optimized columnar format
Delta Lake: ACID transactions on Parquet
Iceberg: Table format for big data

Semi-Structured Formats¶

JSON: Nested data structures
JSONL/NDJSON: Line-delimited JSON
CSV: Tabular data with headers
TSV: Tab-separated values
XML: Hierarchical markup

Unstructured Formats¶

Images: PNG, JPEG, TIFF
Videos: MP4, AVI, MOV
Audio: MP3, WAV, FLAC
Documents: PDF, DOCX, TXT
Archives: ZIP, TAR, GZIP

Schema Inference: For structured formats (Parquet, Avro, ORC), OpenMetadata automatically extracts: - Column names and data types - Nested structures - Partitioning schemes - Statistics (row counts, null counts)

Partitioning Strategies¶

Storage containers often use partitioning for performance and organization:

Time-Based Partitioning¶

s3://events/year=2024/month=11/day=22/hour=14/data.parquet

- Use Case: Event data, logs, time-series - Benefits: Efficient date range queries, lifecycle management - Format: Hive-style partitioning (key=value)

Geography-Based Partitioning¶

s3://sales/region=us-east/state=ny/sales.parquet
s3://sales/region=eu-west/country=uk/sales.parquet

- Use Case: Distributed operations, compliance (data residency) - Benefits: Regional queries, GDPR compliance

Entity-Based Partitioning¶

s3://data-lake/entity=customers/version=v2/customers.parquet
s3://data-lake/entity=orders/version=v2/orders.parquet

- Use Case: Data warehouse staging, entity organization - Benefits: Clear data organization, version management

Storage Security and Governance¶

Track important security and compliance metadata:

Access Control¶

IAM Policies: Who can read/write to containers
Bucket Policies: Resource-based permissions
Encryption: At-rest (SSE-S3, SSE-KMS) and in-transit (TLS)
Versioning: Object version history

Data Classification¶

PII Detection: Automatically tag sensitive data
Compliance Tags: GDPR, HIPAA, PCI-DSS classifications
Data Retention: Lifecycle policies for deletion and archival
Access Logs: Track who accessed which objects

Cost Management¶

Storage Class: Standard, Infrequent Access, Glacier, Archive
Lifecycle Transitions: Automatic tiering based on age
Storage Analytics: Usage patterns and optimization opportunities

Container Types¶

Different container configurations for different use cases:

Data Lake Raw Zone¶

{
  "name": "raw-events",
  "purpose": "Landing zone for raw ingestion",
  "fileFormat": ["Parquet", "JSON"],
  "partitioning": "year/month/day/hour",
  "retention": "90 days",
  "encryption": "SSE-KMS"
}

Data Lake Curated Zone¶

{
  "name": "curated-datasets",
  "purpose": "Cleaned and transformed data",
  "fileFormat": ["Parquet", "Delta Lake"],
  "partitioning": "entity/version/date",
  "retention": "7 years",
  "dataQuality": ["Schema validation", "Completeness checks"]
}

ML Artifacts Storage¶

{
  "name": "ml-models",
  "purpose": "Model artifacts and metadata",
  "fileFormat": ["PKL", "ONNX", "JSON"],
  "versioning": "Enabled",
  "encryption": "SSE-KMS"
}

Entity Specifications¶

Storage Services (Object Storage)¶

Entity	Description	Specification
Storage Service	Cloud storage platform (S3, GCS, Azure Blob)	View Spec
Container	Bucket or container	View Spec
File	Individual files in containers	View Spec

Drive Services (Document Platforms)¶

Entity	Description	Specification
Drive Service	Cloud document platform (Google Drive, OneDrive, SharePoint)	View Spec
Directory	Folders and directories	View Spec
Spreadsheet	Spreadsheet files (Excel, Google Sheets)	View Spec
Worksheet	Individual sheets/tabs within spreadsheets	View Spec

Each specification includes: - Complete field reference - JSON Schema definition - RDF/OWL ontology representation - JSON-LD context and examples - Integration with storage/drive platforms

Supported Storage Platforms¶

OpenMetadata supports metadata extraction from:

Amazon S3 - Scalable object storage
Azure Blob Storage - Object storage for Azure
Azure Data Lake Storage (ADLS) - Hierarchical data lake storage
Google Cloud Storage - Unified object storage for Google Cloud
MinIO - High-performance S3-compatible storage
Alibaba Cloud OSS - Object storage service
IBM Cloud Object Storage - Distributed storage system
Oracle Cloud Object Storage - Object storage for Oracle Cloud
Cloudflare R2 - S3-compatible edge storage
Wasabi - Hot cloud storage

Storage Integration Patterns¶

Pattern: S3 + AWS Glue Data Catalog¶

S3 Service → raw-events Container → AWS Glue Crawler discovers schema
                                   → Glue Data Catalog stores metadata
                                   → Athena queries using catalog

Serverless architecture with automatic schema discovery.

Pattern: Azure Blob + Databricks¶

Azure Blob Service → curated-data Container → Databricks mounts container
                                            → Delta Lake tables on blob storage
                                            → Unity Catalog manages metadata

Lakehouse architecture with unified governance.

Pattern: GCS + BigQuery External Tables¶

GCS Service → analytics-data Container → BigQuery external tables reference GCS
                                       → Query Parquet/Avro without loading
                                       → Materialize to BigQuery for performance

Hybrid architecture querying data in place.

Data Lake Architecture¶

Storage containers are central to modern data lake architectures:

graph TD
    A[Streaming Sources] --> B[Bronze Layer<br/>Raw Container]
    C[Batch Sources] --> B

    B --> D[ETL Pipelines]
    D --> E[Silver Layer<br/>Cleaned Container]

    E --> F[Transformation Pipelines]
    F --> G[Gold Layer<br/>Curated Container]

    G --> H[BI Dashboards]
    G --> I[ML Models]
    G --> J[Data Warehouse]

    style B fill:#f5576c,color:#fff
    style E fill:#6900c7,color:#fff
    style G fill:#00ac69,color:#fff

Multi-Hop Architecture: 1. Bronze: Raw data, minimal processing 2. Silver: Cleaned, validated, deduplicated 3. Gold: Business-level aggregates, ready for consumption

Next Steps¶

Explore specifications - Click through each entity above
See storage lineage - Check out lineage from storage to analytics
Data lake patterns - Learn about modern data lake architectures