Skip to content

Search Assets

Search engines and index systems

Search assets represent search indexes and full-text search systems that enable fast querying across large datasets. OpenMetadata models search with a two-level hierarchy for search platforms.


Hierarchy Overview

graph TD
    A[SearchService<br/>Elasticsearch, OpenSearch] --> B1[SearchIndex:<br/>products]
    A --> B2[SearchIndex:<br/>logs-2024.11]
    A --> B3[SearchIndex:<br/>customers]

    B1 --> C1[Mappings & Fields]
    C1 --> D1[Field: product_id<br/>Type: keyword]
    C1 --> D2[Field: name<br/>Type: text]
    C1 --> D3[Field: description<br/>Type: text]
    C1 --> D4[Field: price<br/>Type: float]
    C1 --> D5[Field: category<br/>Type: keyword]
    C1 --> D6[Field: in_stock<br/>Type: boolean]

    B2 --> C2[Mappings & Fields]
    C2 --> E1[Field: timestamp<br/>Type: date]
    C2 --> E2[Field: level<br/>Type: keyword]
    C2 --> E3[Field: message<br/>Type: text]

    B1 -.->|synced from| F1[PostgreSQL<br/>products table]
    B2 -.->|synced from| F2[Application Logs<br/>via Fluentd]
    B3 -.->|synced from| F3[PostgreSQL<br/>customers table]

    B1 -.->|queried by| G1[Website Search]
    B1 -.->|queried by| G2[Mobile App]
    B2 -.->|queried by| G3[Kibana Dashboard]

    B1 -.->|settings| H1[5 shards, 1 replica]
    B1 -.->|documents| H2[100,000 products]

    style A fill:#667eea,color:#fff
    style B1 fill:#764ba2,color:#fff
    style B2 fill:#764ba2,color:#fff
    style B3 fill:#764ba2,color:#fff
    style C1 fill:#f093fb,color:#fff
    style C2 fill:#f093fb,color:#fff
    style D1 fill:#4facfe,color:#fff
    style D2 fill:#4facfe,color:#fff
    style D3 fill:#4facfe,color:#fff
    style D4 fill:#4facfe,color:#fff
    style D5 fill:#4facfe,color:#fff
    style D6 fill:#4facfe,color:#fff
    style E1 fill:#4facfe,color:#fff
    style E2 fill:#4facfe,color:#fff
    style E3 fill:#4facfe,color:#fff
    style F1 fill:#43e97b,color:#fff
    style F2 fill:#43e97b,color:#fff
    style F3 fill:#43e97b,color:#fff
    style G1 fill:#00f2fe,color:#fff
    style G2 fill:#00f2fe,color:#fff
    style G3 fill:#00f2fe,color:#fff
    style H1 fill:#fa709a,color:#fff
    style H2 fill:#fa709a,color:#fff

Why This Hierarchy?

Search Service

Purpose: Represents the search platform or cluster

A Search Service is the platform that hosts search indexes and provides full-text search capabilities. It contains configuration for connecting to the search cluster and discovering indexes.

Examples:

  • elasticsearch-prod - Production Elasticsearch cluster
  • opensearch-analytics - OpenSearch for log analytics
  • solr-ecommerce - Solr for product search
  • algolia-website - Algolia for website search

Why needed: Organizations use different search platforms for different use cases (Elasticsearch for analytics, Solr for enterprise search, Algolia for website search). The service level groups indexes by platform and cluster, making it easy to manage connections and understand search infrastructure.

Supported Platforms: Elasticsearch, OpenSearch, Apache Solr, Algolia, Azure Cognitive Search, Amazon CloudSearch, Typesense

View Search Service Specification →


Search Index

Purpose: Represents a searchable collection of documents

A Search Index is a collection of documents optimized for fast full-text search and filtering. It has mappings (field definitions), settings (analyzers, shards), and contains searchable data.

Examples:

  • products - Product catalog search
  • logs-2024.11 - Application logs for November 2024
  • customers - Customer data for support search
  • documents - Content management system documents

Key Metadata:

  • Mappings: Field names, types, and indexing settings
  • Settings: Number of shards, replicas, analyzers
  • Aliases: Alternative names for the index
  • Document Count: Total documents in the index
  • Size: Storage size of the index
  • Refresh Interval: How often index updates are visible
  • Data Sources: Tables or systems feeding the index
  • Lineage: Source data → Index → Search applications
  • Query Patterns: Common search queries and filters

Why needed: Search indexes are critical for application search functionality. Tracking them enables: - Understanding search infrastructure and data flow - Schema management for search fields and mappings - Impact analysis (which applications depend on this index?) - Performance optimization (shard sizing, replication) - Data quality monitoring (indexing lag, completeness)

View Search Index Specification →


Index Mapping and Fields

Search indexes define how documents are stored and queried:

Elasticsearch Mapping Example

{
  "mappings": {
    "properties": {
      "product_id": {
        "type": "keyword"
      },
      "name": {
        "type": "text",
        "analyzer": "standard"
      },
      "description": {
        "type": "text",
        "analyzer": "english"
      },
      "price": {
        "type": "float"
      },
      "category": {
        "type": "keyword"
      },
      "tags": {
        "type": "keyword"
      },
      "created_at": {
        "type": "date"
      },
      "in_stock": {
        "type": "boolean"
      }
    }
  }
}

Field Types

  • Text: Full-text searchable fields (analyzed)
  • Keyword: Exact match fields (not analyzed)
  • Numeric: Integer, long, float, double
  • Date: Timestamp fields
  • Boolean: True/false values
  • Nested: Complex nested objects
  • Geo: Geographic coordinates

Common Patterns

Elasticsearch Service → products Index → Mappings: name (text), price (float), category (keyword)
                                       → Documents: 100,000 products
                                       → Shards: 5 primary, 1 replica
                                       → Source: PostgreSQL products table

E-commerce product catalog indexed for fast search and filtering.

Pattern 2: OpenSearch Log Analytics

OpenSearch Service → logs-2024.11 Index → Mappings: timestamp (date), level (keyword), message (text)
                                         → Documents: 10M log entries
                                         → Time-based index (monthly rollover)
                                         → Source: Application logs via Fluentd

Time-series log data with automated index lifecycle management.

Solr Service → documents Index → Mappings: title (text), content (text), author (keyword)
                                → Documents: 1M documents
                                → Facets: department, document_type, year
                                → Source: SharePoint, Confluence, file systems

Enterprise document search with faceted navigation.


Real-World Example

Here's how an e-commerce platform uses search for product discovery:

graph LR
    A[PostgreSQL<br/>products table] --> P1[ETL Pipeline]
    B[PostgreSQL<br/>reviews table] --> P1

    P1 --> C[Elasticsearch<br/>products Index]

    C -->|Search| D[Website Search]
    C -->|Facets| E[Category Filters]
    C -->|Suggestions| F[Autocomplete]

    C -.->|Mappings| G[name: text, price: float]
    C -.->|Documents| H[100,000 products]
    C -.->|Shards| I[5 primary, 1 replica]
    C -.->|Refresh| J[1 second]

    style A fill:#0061f2,color:#fff
    style B fill:#0061f2,color:#fff
    style P1 fill:#f5576c,color:#fff
    style C fill:#00ac69,color:#fff
    style D fill:#6900c7,color:#fff
    style E fill:#6900c7,color:#fff
    style F fill:#6900c7,color:#fff

Flow: 1. Data Sources: Product and review data from PostgreSQL 2. ETL Pipeline: Syncs data to Elasticsearch (real-time or batch) 3. Search Index: Products indexed with full-text search on name/description 4. Search Features: - Search: Full-text search across products - Filters: Faceted search by category, price, rating - Autocomplete: Search suggestions as users type 5. Configuration: 5 shards for scalability, 1 replica for availability

Benefits:

  • Lineage: Trace search index back to source database tables
  • Schema Management: Track field mappings, detect mapping changes
  • Impact Analysis: Know which search features depend on which fields
  • Performance Monitoring: Track indexing lag, query latency
  • Data Quality: Validate completeness, monitor indexing errors

Search Lineage

Search indexes create lineage connections from data sources to search applications:

graph LR
    A[MySQL products] --> P1[Sync Pipeline]
    B[MongoDB reviews] --> P2[Sync Pipeline]
    C[S3 documents] --> P3[Indexing Pipeline]

    P1 --> D[ES products Index]
    P2 --> E[ES reviews Index]
    P3 --> F[ES documents Index]

    D --> G[Website Search]
    E --> G
    F --> G

    D --> H[Mobile App Search]
    E --> H

    style P1 fill:#f5576c,color:#fff
    style P2 fill:#f5576c,color:#fff
    style P3 fill:#f5576c,color:#fff
    style D fill:#00ac69,color:#fff
    style E fill:#00ac69,color:#fff
    style F fill:#00ac69,color:#fff

Database → Pipeline → Search Index → Application


Index Settings and Configuration

Search indexes have settings that control behavior and performance:

Elasticsearch Settings

{
  "settings": {
    "number_of_shards": 5,
    "number_of_replicas": 1,
    "refresh_interval": "1s",
    "max_result_window": 10000,
    "analysis": {
      "analyzer": {
        "custom_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["lowercase", "stop", "snowball"]
        }
      }
    }
  }
}

Key Settings:

  • Shards: Partitions for horizontal scaling
  • Replicas: Copies for high availability
  • Refresh Interval: How often new documents are searchable
  • Analyzers: Text processing (tokenization, stemming, stop words)

Time-Series Index Patterns

Logs and events often use time-based indexes:

Index Naming Convention

logs-2024.11.22  (Daily)
logs-2024.11     (Monthly)
metrics-2024-w47 (Weekly)

Index Lifecycle Management

{
  "policy": "logs-lifecycle",
  "phases": {
    "hot": {
      "min_age": "0ms",
      "actions": {
        "rollover": {
          "max_age": "1d",
          "max_size": "50gb"
        }
      }
    },
    "warm": {
      "min_age": "7d",
      "actions": {
        "shrink": {
          "number_of_shards": 1
        }
      }
    },
    "cold": {
      "min_age": "30d",
      "actions": {
        "freeze": {}
      }
    },
    "delete": {
      "min_age": "90d",
      "actions": {
        "delete": {}
      }
    }
  }
}

Lifecycle Phases: 1. Hot: Active indexing and querying 2. Warm: Read-only, less frequent queries 3. Cold: Rarely accessed, compressed 4. Delete: Automatically delete old data


Search Query Patterns

Common search operations and their use cases:

{
  "query": {
    "match": {
      "name": {
        "query": "wireless headphones",
        "operator": "and"
      }
    }
  }
}
Use Case: Product search, document search

{
  "query": {
    "bool": {
      "must": [
        {"match": {"category": "electronics"}}
      ],
      "filter": [
        {"range": {"price": {"gte": 50, "lte": 200}}},
        {"term": {"in_stock": true}}
      ]
    }
  }
}
Use Case: E-commerce filtering by price, category, availability

Aggregations (Facets)

{
  "aggs": {
    "categories": {
      "terms": {
        "field": "category.keyword",
        "size": 10
      }
    },
    "price_ranges": {
      "range": {
        "field": "price",
        "ranges": [
          {"to": 50},
          {"from": 50, "to": 100},
          {"from": 100}
        ]
      }
    }
  }
}
Use Case: Faceted navigation, analytics

Autocomplete/Suggestions

{
  "suggest": {
    "product_suggest": {
      "prefix": "wire",
      "completion": {
        "field": "suggest"
      }
    }
  }
}
Use Case: Search-as-you-type, autocomplete


Index Aliases

Aliases provide indirection and zero-downtime reindexing:

{
  "actions": [
    {
      "add": {
        "index": "products-v2",
        "alias": "products"
      }
    },
    {
      "remove": {
        "index": "products-v1",
        "alias": "products"
      }
    }
  ]
}

Use Cases:

  • Blue-Green Deployment: Build new index, swap alias atomically
  • Versioning: Maintain multiple index versions
  • Read/Write Separation: Different aliases for read and write operations
  • Time-Series Rollup: Single alias pointing to multiple time-based indexes

Search Performance Optimization

Track and optimize search performance:

Shard Sizing

{
  "shards": {
    "optimal_size": "20-50GB per shard",
    "count": 5,
    "replicas": 1,
    "total": 10
  }
}

Query Performance

{
  "query_metrics": {
    "average_latency": "45ms",
    "p95_latency": "120ms",
    "p99_latency": "250ms",
    "queries_per_second": 500
  }
}

Indexing Performance

{
  "indexing_metrics": {
    "documents_per_second": 5000,
    "indexing_lag": "2s",
    "refresh_time": "1s",
    "bulk_size": 1000
  }
}

Search Data Synchronization

Different strategies for keeping search indexes in sync:

Real-Time Sync

Database → Change Data Capture → Kafka → Index Consumer → Elasticsearch
Use Case: E-commerce, social media (immediate consistency)

Near-Real-Time Sync

Database → API → Elasticsearch Bulk API → Index
Use Case: Content management, customer data (eventual consistency)

Batch Sync

Database → Scheduled ETL → Full Reindex → Elasticsearch
Use Case: Analytics, reports (daily/hourly updates)

Event-Driven Sync

Application → Event Bus → Lambda/Function → Elasticsearch
Use Case: Microservices, event-driven architectures


Search Analytics and Monitoring

Track search usage and effectiveness:

{
  "index": "products",
  "analytics": {
    "top_searches": [
      {"query": "wireless headphones", "count": 5000},
      {"query": "laptop", "count": 3500},
      {"query": "phone case", "count": 2000}
    ],
    "zero_result_searches": [
      {"query": "productxyz", "count": 100}
    ],
    "average_results_per_query": 45,
    "click_through_rate": "15%",
    "search_to_purchase_rate": "8%"
  }
}

Key Metrics:

  • Top Searches: Most common queries
  • Zero-Result Searches: Queries with no results (improve content)
  • Click-Through Rate: Users clicking on results
  • Conversion Rate: Searches leading to actions

Support for multiple languages and localization:

{
  "mappings": {
    "properties": {
      "name": {
        "type": "text",
        "fields": {
          "en": {
            "type": "text",
            "analyzer": "english"
          },
          "es": {
            "type": "text",
            "analyzer": "spanish"
          },
          "fr": {
            "type": "text",
            "analyzer": "french"
          }
        }
      }
    }
  }
}

Language Features:

  • Analyzers: Language-specific stemming and stop words
  • Multi-Fields: Same content indexed for different languages
  • Character Filters: Unicode normalization, accent folding

Entity Specifications

Entity Description Specification
Search Service Search platform or cluster View Spec
Search Index Searchable document collection View Spec

Each specification includes: - Complete field reference - JSON Schema definition - RDF/OWL ontology representation - JSON-LD context and examples - Integration with search platforms


Supported Search Platforms

OpenMetadata supports metadata extraction from:

  • Elasticsearch - Distributed search and analytics engine
  • OpenSearch - Open-source Elasticsearch fork
  • Apache Solr - Enterprise search platform
  • Algolia - Managed search API
  • Azure Cognitive Search - AI-powered cloud search
  • Amazon CloudSearch - Managed search service
  • Typesense - Fast, typo-tolerant search engine
  • Meilisearch - Open-source instant search
  • Amazon Kendra - Intelligent enterprise search
  • Google Cloud Search - Enterprise search for G Suite

Search Index Types

Different index patterns for different use cases:

Product Catalog

{
  "index": "products",
  "purpose": "E-commerce product search",
  "mappings": ["name:text", "price:float", "category:keyword"],
  "features": ["full-text search", "faceting", "autocomplete"],
  "update_frequency": "Real-time"
}

Log Analytics

{
  "index": "logs-*",
  "purpose": "Application log search and analysis",
  "mappings": ["timestamp:date", "level:keyword", "message:text"],
  "features": ["time-series", "aggregations", "alerts"],
  "lifecycle": "Daily rollover, 90-day retention"
}
{
  "index": "documents",
  "purpose": "Enterprise document search",
  "mappings": ["title:text", "content:text", "author:keyword"],
  "features": ["full-text search", "highlighting", "relevance tuning"],
  "update_frequency": "Near real-time"
}

Search Security and Access Control

Track security and access control metadata:

Index-Level Security

{
  "security": {
    "authentication": "SAML/OAuth",
    "field_level_security": {
      "hr_team": ["employee_id", "name", "salary"],
      "general": ["employee_id", "name"]
    },
    "document_level_security": {
      "filter": {
        "term": {"department": "user.department"}
      }
    }
  }
}

Encryption

{
  "encryption": {
    "at_rest": "AES-256",
    "in_transit": "TLS 1.3",
    "field_level": ["ssn", "credit_card"]
  }
}

Next Steps

  1. Explore specifications - Click through each entity above
  2. See search lineage - Check out lineage from databases to search to apps
  3. Search optimization - Learn about query performance and index tuning