Search Assets¶
Search engines and index systems
Search assets represent search indexes and full-text search systems that enable fast querying across large datasets. OpenMetadata models search with a two-level hierarchy for search platforms.
Hierarchy Overview¶
graph TD
A[SearchService<br/>Elasticsearch, OpenSearch] --> B1[SearchIndex:<br/>products]
A --> B2[SearchIndex:<br/>logs-2024.11]
A --> B3[SearchIndex:<br/>customers]
B1 --> C1[Mappings & Fields]
C1 --> D1[Field: product_id<br/>Type: keyword]
C1 --> D2[Field: name<br/>Type: text]
C1 --> D3[Field: description<br/>Type: text]
C1 --> D4[Field: price<br/>Type: float]
C1 --> D5[Field: category<br/>Type: keyword]
C1 --> D6[Field: in_stock<br/>Type: boolean]
B2 --> C2[Mappings & Fields]
C2 --> E1[Field: timestamp<br/>Type: date]
C2 --> E2[Field: level<br/>Type: keyword]
C2 --> E3[Field: message<br/>Type: text]
B1 -.->|synced from| F1[PostgreSQL<br/>products table]
B2 -.->|synced from| F2[Application Logs<br/>via Fluentd]
B3 -.->|synced from| F3[PostgreSQL<br/>customers table]
B1 -.->|queried by| G1[Website Search]
B1 -.->|queried by| G2[Mobile App]
B2 -.->|queried by| G3[Kibana Dashboard]
B1 -.->|settings| H1[5 shards, 1 replica]
B1 -.->|documents| H2[100,000 products]
style A fill:#667eea,color:#fff
style B1 fill:#764ba2,color:#fff
style B2 fill:#764ba2,color:#fff
style B3 fill:#764ba2,color:#fff
style C1 fill:#f093fb,color:#fff
style C2 fill:#f093fb,color:#fff
style D1 fill:#4facfe,color:#fff
style D2 fill:#4facfe,color:#fff
style D3 fill:#4facfe,color:#fff
style D4 fill:#4facfe,color:#fff
style D5 fill:#4facfe,color:#fff
style D6 fill:#4facfe,color:#fff
style E1 fill:#4facfe,color:#fff
style E2 fill:#4facfe,color:#fff
style E3 fill:#4facfe,color:#fff
style F1 fill:#43e97b,color:#fff
style F2 fill:#43e97b,color:#fff
style F3 fill:#43e97b,color:#fff
style G1 fill:#00f2fe,color:#fff
style G2 fill:#00f2fe,color:#fff
style G3 fill:#00f2fe,color:#fff
style H1 fill:#fa709a,color:#fff
style H2 fill:#fa709a,color:#fff Why This Hierarchy?¶
Search Service¶
Purpose: Represents the search platform or cluster
A Search Service is the platform that hosts search indexes and provides full-text search capabilities. It contains configuration for connecting to the search cluster and discovering indexes.
Examples:
elasticsearch-prod- Production Elasticsearch clusteropensearch-analytics- OpenSearch for log analyticssolr-ecommerce- Solr for product searchalgolia-website- Algolia for website search
Why needed: Organizations use different search platforms for different use cases (Elasticsearch for analytics, Solr for enterprise search, Algolia for website search). The service level groups indexes by platform and cluster, making it easy to manage connections and understand search infrastructure.
Supported Platforms: Elasticsearch, OpenSearch, Apache Solr, Algolia, Azure Cognitive Search, Amazon CloudSearch, Typesense
View Search Service Specification →
Search Index¶
Purpose: Represents a searchable collection of documents
A Search Index is a collection of documents optimized for fast full-text search and filtering. It has mappings (field definitions), settings (analyzers, shards), and contains searchable data.
Examples:
products- Product catalog searchlogs-2024.11- Application logs for November 2024customers- Customer data for support searchdocuments- Content management system documents
Key Metadata:
- Mappings: Field names, types, and indexing settings
- Settings: Number of shards, replicas, analyzers
- Aliases: Alternative names for the index
- Document Count: Total documents in the index
- Size: Storage size of the index
- Refresh Interval: How often index updates are visible
- Data Sources: Tables or systems feeding the index
- Lineage: Source data → Index → Search applications
- Query Patterns: Common search queries and filters
Why needed: Search indexes are critical for application search functionality. Tracking them enables: - Understanding search infrastructure and data flow - Schema management for search fields and mappings - Impact analysis (which applications depend on this index?) - Performance optimization (shard sizing, replication) - Data quality monitoring (indexing lag, completeness)
View Search Index Specification →
Index Mapping and Fields¶
Search indexes define how documents are stored and queried:
Elasticsearch Mapping Example¶
{
"mappings": {
"properties": {
"product_id": {
"type": "keyword"
},
"name": {
"type": "text",
"analyzer": "standard"
},
"description": {
"type": "text",
"analyzer": "english"
},
"price": {
"type": "float"
},
"category": {
"type": "keyword"
},
"tags": {
"type": "keyword"
},
"created_at": {
"type": "date"
},
"in_stock": {
"type": "boolean"
}
}
}
}
Field Types¶
- Text: Full-text searchable fields (analyzed)
- Keyword: Exact match fields (not analyzed)
- Numeric: Integer, long, float, double
- Date: Timestamp fields
- Boolean: True/false values
- Nested: Complex nested objects
- Geo: Geographic coordinates
Common Patterns¶
Pattern 1: Elasticsearch Product Search¶
Elasticsearch Service → products Index → Mappings: name (text), price (float), category (keyword)
→ Documents: 100,000 products
→ Shards: 5 primary, 1 replica
→ Source: PostgreSQL products table
E-commerce product catalog indexed for fast search and filtering.
Pattern 2: OpenSearch Log Analytics¶
OpenSearch Service → logs-2024.11 Index → Mappings: timestamp (date), level (keyword), message (text)
→ Documents: 10M log entries
→ Time-based index (monthly rollover)
→ Source: Application logs via Fluentd
Time-series log data with automated index lifecycle management.
Pattern 3: Solr Enterprise Search¶
Solr Service → documents Index → Mappings: title (text), content (text), author (keyword)
→ Documents: 1M documents
→ Facets: department, document_type, year
→ Source: SharePoint, Confluence, file systems
Enterprise document search with faceted navigation.
Real-World Example¶
Here's how an e-commerce platform uses search for product discovery:
graph LR
A[PostgreSQL<br/>products table] --> P1[ETL Pipeline]
B[PostgreSQL<br/>reviews table] --> P1
P1 --> C[Elasticsearch<br/>products Index]
C -->|Search| D[Website Search]
C -->|Facets| E[Category Filters]
C -->|Suggestions| F[Autocomplete]
C -.->|Mappings| G[name: text, price: float]
C -.->|Documents| H[100,000 products]
C -.->|Shards| I[5 primary, 1 replica]
C -.->|Refresh| J[1 second]
style A fill:#0061f2,color:#fff
style B fill:#0061f2,color:#fff
style P1 fill:#f5576c,color:#fff
style C fill:#00ac69,color:#fff
style D fill:#6900c7,color:#fff
style E fill:#6900c7,color:#fff
style F fill:#6900c7,color:#fff Flow: 1. Data Sources: Product and review data from PostgreSQL 2. ETL Pipeline: Syncs data to Elasticsearch (real-time or batch) 3. Search Index: Products indexed with full-text search on name/description 4. Search Features: - Search: Full-text search across products - Filters: Faceted search by category, price, rating - Autocomplete: Search suggestions as users type 5. Configuration: 5 shards for scalability, 1 replica for availability
Benefits:
- Lineage: Trace search index back to source database tables
- Schema Management: Track field mappings, detect mapping changes
- Impact Analysis: Know which search features depend on which fields
- Performance Monitoring: Track indexing lag, query latency
- Data Quality: Validate completeness, monitor indexing errors
Search Lineage¶
Search indexes create lineage connections from data sources to search applications:
graph LR
A[MySQL products] --> P1[Sync Pipeline]
B[MongoDB reviews] --> P2[Sync Pipeline]
C[S3 documents] --> P3[Indexing Pipeline]
P1 --> D[ES products Index]
P2 --> E[ES reviews Index]
P3 --> F[ES documents Index]
D --> G[Website Search]
E --> G
F --> G
D --> H[Mobile App Search]
E --> H
style P1 fill:#f5576c,color:#fff
style P2 fill:#f5576c,color:#fff
style P3 fill:#f5576c,color:#fff
style D fill:#00ac69,color:#fff
style E fill:#00ac69,color:#fff
style F fill:#00ac69,color:#fff Database → Pipeline → Search Index → Application
Index Settings and Configuration¶
Search indexes have settings that control behavior and performance:
Elasticsearch Settings¶
{
"settings": {
"number_of_shards": 5,
"number_of_replicas": 1,
"refresh_interval": "1s",
"max_result_window": 10000,
"analysis": {
"analyzer": {
"custom_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": ["lowercase", "stop", "snowball"]
}
}
}
}
}
Key Settings:
- Shards: Partitions for horizontal scaling
- Replicas: Copies for high availability
- Refresh Interval: How often new documents are searchable
- Analyzers: Text processing (tokenization, stemming, stop words)
Time-Series Index Patterns¶
Logs and events often use time-based indexes:
Index Naming Convention¶
Index Lifecycle Management¶
{
"policy": "logs-lifecycle",
"phases": {
"hot": {
"min_age": "0ms",
"actions": {
"rollover": {
"max_age": "1d",
"max_size": "50gb"
}
}
},
"warm": {
"min_age": "7d",
"actions": {
"shrink": {
"number_of_shards": 1
}
}
},
"cold": {
"min_age": "30d",
"actions": {
"freeze": {}
}
},
"delete": {
"min_age": "90d",
"actions": {
"delete": {}
}
}
}
}
Lifecycle Phases: 1. Hot: Active indexing and querying 2. Warm: Read-only, less frequent queries 3. Cold: Rarely accessed, compressed 4. Delete: Automatically delete old data
Search Query Patterns¶
Common search operations and their use cases:
Full-Text Search¶
Use Case: Product search, document searchFiltered Search¶
{
"query": {
"bool": {
"must": [
{"match": {"category": "electronics"}}
],
"filter": [
{"range": {"price": {"gte": 50, "lte": 200}}},
{"term": {"in_stock": true}}
]
}
}
}
Aggregations (Facets)¶
{
"aggs": {
"categories": {
"terms": {
"field": "category.keyword",
"size": 10
}
},
"price_ranges": {
"range": {
"field": "price",
"ranges": [
{"to": 50},
{"from": 50, "to": 100},
{"from": 100}
]
}
}
}
}
Autocomplete/Suggestions¶
Use Case: Search-as-you-type, autocompleteIndex Aliases¶
Aliases provide indirection and zero-downtime reindexing:
{
"actions": [
{
"add": {
"index": "products-v2",
"alias": "products"
}
},
{
"remove": {
"index": "products-v1",
"alias": "products"
}
}
]
}
Use Cases:
- Blue-Green Deployment: Build new index, swap alias atomically
- Versioning: Maintain multiple index versions
- Read/Write Separation: Different aliases for read and write operations
- Time-Series Rollup: Single alias pointing to multiple time-based indexes
Search Performance Optimization¶
Track and optimize search performance:
Shard Sizing¶
Query Performance¶
{
"query_metrics": {
"average_latency": "45ms",
"p95_latency": "120ms",
"p99_latency": "250ms",
"queries_per_second": 500
}
}
Indexing Performance¶
{
"indexing_metrics": {
"documents_per_second": 5000,
"indexing_lag": "2s",
"refresh_time": "1s",
"bulk_size": 1000
}
}
Search Data Synchronization¶
Different strategies for keeping search indexes in sync:
Real-Time Sync¶
Use Case: E-commerce, social media (immediate consistency)Near-Real-Time Sync¶
Use Case: Content management, customer data (eventual consistency)Batch Sync¶
Use Case: Analytics, reports (daily/hourly updates)Event-Driven Sync¶
Use Case: Microservices, event-driven architecturesSearch Analytics and Monitoring¶
Track search usage and effectiveness:
{
"index": "products",
"analytics": {
"top_searches": [
{"query": "wireless headphones", "count": 5000},
{"query": "laptop", "count": 3500},
{"query": "phone case", "count": 2000}
],
"zero_result_searches": [
{"query": "productxyz", "count": 100}
],
"average_results_per_query": 45,
"click_through_rate": "15%",
"search_to_purchase_rate": "8%"
}
}
Key Metrics:
- Top Searches: Most common queries
- Zero-Result Searches: Queries with no results (improve content)
- Click-Through Rate: Users clicking on results
- Conversion Rate: Searches leading to actions
Multi-Language Search¶
Support for multiple languages and localization:
{
"mappings": {
"properties": {
"name": {
"type": "text",
"fields": {
"en": {
"type": "text",
"analyzer": "english"
},
"es": {
"type": "text",
"analyzer": "spanish"
},
"fr": {
"type": "text",
"analyzer": "french"
}
}
}
}
}
}
Language Features:
- Analyzers: Language-specific stemming and stop words
- Multi-Fields: Same content indexed for different languages
- Character Filters: Unicode normalization, accent folding
Entity Specifications¶
| Entity | Description | Specification |
|---|---|---|
| Search Service | Search platform or cluster | View Spec |
| Search Index | Searchable document collection | View Spec |
Each specification includes: - Complete field reference - JSON Schema definition - RDF/OWL ontology representation - JSON-LD context and examples - Integration with search platforms
Supported Search Platforms¶
OpenMetadata supports metadata extraction from:
- Elasticsearch - Distributed search and analytics engine
- OpenSearch - Open-source Elasticsearch fork
- Apache Solr - Enterprise search platform
- Algolia - Managed search API
- Azure Cognitive Search - AI-powered cloud search
- Amazon CloudSearch - Managed search service
- Typesense - Fast, typo-tolerant search engine
- Meilisearch - Open-source instant search
- Amazon Kendra - Intelligent enterprise search
- Google Cloud Search - Enterprise search for G Suite
Search Index Types¶
Different index patterns for different use cases:
Product Catalog¶
{
"index": "products",
"purpose": "E-commerce product search",
"mappings": ["name:text", "price:float", "category:keyword"],
"features": ["full-text search", "faceting", "autocomplete"],
"update_frequency": "Real-time"
}
Log Analytics¶
{
"index": "logs-*",
"purpose": "Application log search and analysis",
"mappings": ["timestamp:date", "level:keyword", "message:text"],
"features": ["time-series", "aggregations", "alerts"],
"lifecycle": "Daily rollover, 90-day retention"
}
Document Search¶
{
"index": "documents",
"purpose": "Enterprise document search",
"mappings": ["title:text", "content:text", "author:keyword"],
"features": ["full-text search", "highlighting", "relevance tuning"],
"update_frequency": "Near real-time"
}
Search Security and Access Control¶
Track security and access control metadata:
Index-Level Security¶
{
"security": {
"authentication": "SAML/OAuth",
"field_level_security": {
"hr_team": ["employee_id", "name", "salary"],
"general": ["employee_id", "name"]
},
"document_level_security": {
"filter": {
"term": {"department": "user.department"}
}
}
}
}
Encryption¶
{
"encryption": {
"at_rest": "AES-256",
"in_transit": "TLS 1.3",
"field_level": ["ssn", "credit_card"]
}
}
Next Steps¶
- Explore specifications - Click through each entity above
- See search lineage - Check out lineage from databases to search to apps
- Search optimization - Learn about query performance and index tuning