Skip to content

A Deep Dive into the powerful Vector Search of Elasticsearch

Table of contents

Search engines have traditionally relied on lexical search techniques, which involve matching query terms with documents containing those exact terms. While effective in many cases, lexical search has limitations, particularly in understanding the context or semantics behind a query. As the need for more nuanced and intelligent search capabilities has grown, vector search has emerged as a powerful alternative, particularly in contexts like semantic search, recommendation systems, and natural language processing. In this post, we’ll explore what vector search is, how it differs from lexical search, and how to implement vector search in Elasticsearch.

Before we dive deep into vector search and it’s benefits, we first need to see how currently most of the queries made for elastic/opensearch are made, with lexical search.

How Lexical Search Works in Elasticsearch

Lexical search in Elasticsearch relies on matching exact terms within documents based on an inverted index. When a query is issued, Elasticsearch breaks down the query into tokens (words or phrases) and looks up these tokens in the index to find documents that contain them. The results are then scored using algorithms like TF-IDF (Term Frequency-Inverse Document Frequency) or BM25, which rank documents based on how often the terms appear and how important they are within the corpus. You can see more details on how it is calculated on our blog post talking about it.

While lexical search is efficient and effective for exact term matching, it falls short in cases where the user’s intent isn’t perfectly aligned with the exact terms in the document. For example, lexical search struggles with:

1. Complex Mappings and Nested Data:

In many databases, data is stored in complex structures with nested fields, arrays, or mixed content types. Lexical search, which relies on straightforward token matching, struggles to effectively query or index these complex data types. For instance, in a document with deeply nested fields, a simple keyword search might miss relevant matches that are buried within subfields or arrays. An example for this can be synonyms, a query for “automobile” might miss documents containing “car,” even though they mean the same thing.

2. Inconsistent Text Formats:

Non-structured data often includes text that doesn’t conform to regular formats, such as logs, emails, or web scraping results. These documents might contain a mix of structured text, metadata, and free-form content. Lexical search relies heavily on text analysis pipelines that break down text into tokens based on predefined rules. If the text is irregular or doesn’t fit well with these rules, important content can be misinterpreted or overlooked.

3. Difficulty with Rich Text and Multimedia Content:

When dealing with documents that contain rich text (e.g., HTML) or embedded multimedia elements (images, videos), lexical search hits its limits. It can’t easily parse and understand content that isn’t purely textual. For example, a product description containing both text and images might be indexed poorly because lexical search only sees the text and misses the context provided by the images or formatting.

4. Challenges with Semantic Variability:

In non-structured data, the same concept might be expressed in many different ways, using varied vocabulary, phrasing, or even multiple languages. Lexical search’s reliance on exact matches means it often fails to connect different expressions of the same idea. This is particularly problematic in non-structured data where the language used might vary significantly within the same dataset, leading to incomplete or irrelevant search results.

Vector search is a technique that involves representing data—whether text, images, or other unstructured information—as vectors in a continuous vector space. These vectors capture the semantic meaning of the data, allowing for searches that are more context-aware and less reliant on exact keyword matches. Documents are ranked based on their vector similarity to the query. This ranking process allows the search engine to return the most semantically relevant results, even if they don’t contain the exact keywords present in the query.

Before searching can happen, data must be transformed into vector representations. This is typically done using machine learning models such as BERT, Word2Vec, or custom models trained for specific tasks. Each piece of data (e.g., a document or an image) is converted into a high-dimensional vector. The vectors representing the data are stored in an Elasticsearch index. Unlike traditional indexing methods that rely on inverted indices for keyword lookup, vector indices store the actual vector representations, enabling efficient similarity searches.

When a search query is issued, it is also converted into a vector using the same or a similar model. The search engine then computes the similarity between the query vector and the document vectors stored in the index. Common similarity metrics include cosine similarity, Euclidean distance, or dot product.

To fully appreciate the advantages of vector search, it’s important to compare it with traditional lexical search.

  1. Lexical Search:
    • Technique: Relies on matching query terms with document terms using an inverted index.
    • Strengths: Effective for precise term matching, fast lookup for exact matches.
    • Limitations: Struggles with synonyms, word variations, and understanding context. For example, a search for “car” might miss documents containing “automobile” or “vehicle.”
  2. Vector Search:
    • Technique: Represents queries and documents as vectors in a high-dimensional space and retrieves results based on vector similarity.
    • Strengths: Captures semantic meaning, handles synonyms and related terms, better at understanding the context and intent behind queries.
    • Limitations: Requires more computational resources for vector computations and storage. Also, requires training or fine-tuning of models to generate effective embeddings.

Implementing Vector Search in Elasticsearch

Let’s dive into a practical example of how to implement vector search in Elasticsearch. It supports storing vector data using the dense_vector field type. Here’s how you can create an index with a dense_vector field:

PUT /my_vector_index
{
  "mappings": {
    "properties": {
      "rating": {
        "type": "integer"
      },
      "title_vector": {
        "type": "dense_vector",
        "index": true,
        “similarity”: “dot_product”,
        "dims": 4 // Specify the dimensions of the vector, typically based on the embedding model used, for now, we will be using 4 dims for demo
        "index_options": {
            "type": "hnsw",
            "ef_construction": 128,
            "m": 24    
         }
      }
    }
  }
}

In this example, the rating field stores the vector representation of the document’s rating. The number of dimensions (dims) corresponds to the size of the vectors generated by your embedding model. Depending on which version of Elasticsearch you are using (specifically 8.11 or higher), specifying index, similarity and other options are not necessary to change/create a field to dense_vector. When configuring vector fields in Elasticsearch, several parameters are key to determining how vectors are stored, indexed, and searched. Here’s a breakdown of these parameters:

  • dims: This parameter specifies the number of vector dimensions. Up until Elasticsearch version 8.11, it was mandatory to set this parameter. The maximum number of dimensions allowed has evolved over time: it was 1024 up to version 8.9.2, increased to 2048 in version 8.10.0, and expanded further to 4096 in version 8.11.0. As of version 8.11, specifying the dims parameter is optional, and if not set, it will default to the dimension of the first indexed vector.
  • element_type: This parameter defines the data type of the vector elements. If you don’t specify a type, Elasticsearch defaults to using float (which consumes 4 bytes). Another option available is byte, which uses 1 byte per element and can help reduce storage size.
  • index: This parameter controls whether the vectors should be indexed or simply stored. When set to true, vectors are indexed in a dedicated, optimized data structure, which enhances search performance. If set to false, vectors are stored as binary doc values. Before version 8.10, the default value for this parameter was false, meaning vectors were not indexed unless explicitly specified. As of version 8.11, the default behavior has changed to true, meaning vectors are indexed by default.
  • similarity: This parameter determines the similarity metric used for k-NN search when vectors are indexed. The available metrics include:
    • l2_norm (L2 distance)
    • dot_product (dot product similarity)
    • cosine (cosine similarity)
    • max_inner_product (maximum inner product similarity)
    Prior to version 8.11, this parameter was required if vectors were being indexed. It’s important to note that the dot_product metric should only be used if your vectors are already normalized (i.e., they are unit vectors with a magnitude of 1). Starting from version 8.11, if this parameter isn’t specified, it defaults to cosine.
  • index_options: This parameter is relevant when the vectors are indexed. Currently, the only supported algorithm for indexing is hnsw (Hierarchical Navigable Small World). Within this context, you can configure options such as ef_construction and m. For more detailed information on these options, refer to the official Elasticsearch documentation.

These parameters provide the flexibility needed to fine-tune how vector data is handled in Elasticsearch, allowing for optimized search performance and efficient storage. To perform a vector search, you need to convert the query into a vector and then search for documents with similar vectors. Below are various examples of queries:

Exact search, also known as brute-force search or exact k-NN (k-Nearest Neighbors) search, retrieves the most similar vectors to a given query vector by comparing all vectors in the index. This method calculates similarity metrics (like cosine similarity or Euclidean distance) between the query vector and every document vector. First, we need to set the vector field mapping in a way that the vectors are not indexed, and this can be done by specifying index: false and no similarity metric in the mapping:

PUT /my_vector_index
{
  "mappings": {
    "properties": {
      "rating": {
        "type": "integer"
      },
      "title_vector": {
        "type": "dense_vector",
        "index": false,
        "dims": 4
      }
    }
  }
}

Then index some data into it:

POST /my_vector_index/_bulk
{ "index": { "_id": "1" } }
{ "title_vector": [3.0, 2.7, 6.4, 1.9], "rating": 85 }
{ "index": { "_id": "2" } }
{ "title_vector": [4.8, 1.2, 3.3, 2.6], "rating": 92 }
{ "index": { "_id": "3" } }
{ "title_vector": [2.5, 3.8, 7.1, 0.4], "rating": 60 }
{ "index": { "_id": "4" } }
{ "title_vector": [1.9, 6.2, 2.3, 4.0], "rating": 47 }

Then finally, execute the search with the following query:

POST /my_vector_index/_search
{
  "_source": false,
  "fields": [ "price" ],
  "query": {
    "script_score": {
      "query" : {
        "bool" : {
          "filter" : {
            "range" : {
              "price" : {
                "gte": 50
              }
            }
          }
        }
      },
      "script": {
        "source": "cosineSimilarity(params.queryVector, 'title_vector') + 1.0",
        "params": {
          "queryVector": [1.4, 3.2, 0.1, 2.7]
        }
      }
    }
  }
}

This query searches the my_vector_index index in Elasticsearch, focusing on documents where the price field is greater than or equal to 50. Instead of returning the full document, it only retrieves the price field for each result.

The core of the query uses a script_score to rank the documents. It does so by applying a cosineSimilarity function between a provided query vector ([1.4, 3.2, 0.1, 2.7]) and the title_vector field of each document. The score generated by the cosine similarity is adjusted by adding 1.0 to it, ensuring that all scores are positive and making the scoring more suitable for ranking purposes.

The query first filters the documents to only those meeting the price condition. Then, among the filtered documents, it calculates a similarity score for each one based on how close their title_vector is to the provided query vector. The results are then sorted by this score, allowing you to identify which documents, with a price of at least 50, have vectors most similar to the query vector.

In summary, while exact search might be appropriate for very small datasets, it doesn’t scale well as data volumes increase. If you expect your dataset to grow, it’s essential to consider using k-NN search instead. We’ll explore that next.

  • Use Case: When high accuracy is critical and the dataset size is manageable.
  • Advantages: Provides the most precise results as it evaluates all possible candidates.
  • Limitations: Computationally expensive and time-consuming for large datasets.

Approximate k-NN search uses algorithms like Annoy, HNSW (Hierarchical Navigable Small World), or LSH (Locality-Sensitive Hashing) to approximate nearest neighbors quickly. These algorithms trade off some accuracy for much faster search times, making them suitable for large-scale datasets and pretty common among users to end up using.

To set up k-NN searches in the first place, we’ll create a sample index with a suitable vector field mapping, ensuring the vector data is properly indexed with index: true, with a specified similarity metric:

PUT /my_vector_index
{
  "mappings": {
    "properties": {
      "rating": {
        "type": "integer"
      },
      "title_vector": {
        "type": "dense_vector",
        "index": true,
        "similarity": "cosine",
        "dims": 4,
        "index_options": {
            "ef_construction": 128,
            "m": 24    
        }
      }
    }
  }
}

Then let’s load some data on the index:

POST /my_vector_index/_bulk
{ "index": { "_id": "1" } }
{ "title_vector": [3.0, 2.7, 6.4, 1.9], "rating": 85 }
{ "index": { "_id": "2" } }
{ "title_vector": [4.8, 1.2, 3.3, 2.6], "rating": 92 }
{ "index": { "_id": "3" } }
{ "title_vector": [2.5, 3.8, 7.1, 0.4], "rating": 60 }
{ "index": { "_id": "4" } }
{ "title_vector": [1.9, 6.2, 2.3, 4.0], "rating": 47 }

After running these two commands, our vector data is now properly indexed in an HNSW graph and ready to be searched using the knn search option.

  1. Use Case: Large-scale datasets where speed is more critical than absolute precision.
  2. Advantages: Faster search times compared to exact search, scalable to large datasets.
  3. Limitations: May return results that are slightly less accurate due to approximation.

Simple k-NN search involves finding the k-nearest neighbors to a query vector without additional constraints or filtering. It retrieves the top k most similar vectors based on a chosen similarity metric.

POST /my_vector_index/_search
{
  "_source": false,
  "fields": [ "rating" ],
  "knn": {
    "field": "title_vector",
    "query_vector": [0.9, 3.7, 5.4, 2.5],
    "k": 2,
    "num_candidates": 100
  }
}

This query searches the my-index index using k-Nearest Neighbors (k-NN) to find the vectors most similar to a provided query_vector [0.9, 3.7, 5.4, 2.5]. The search is conducted on the title_vector field.

The query requests only the price field in the search results and excludes the full document source (_source: false). The goal is to find the top 2 (k: 2) vectors that are closest to the query vector.

The num_candidates parameter is set to 100, meaning the search will consider up to 100 candidate vectors on each shard. These candidates are evaluated for similarity, and the top 2 vectors are returned to the coordinating node, which merges results from all shards and presents the final top 2 vectors from the global set of candidates.

The role of num_candidates is crucial for balancing search performance and accuracy. A higher value increases the likelihood of finding the true nearest neighbors but also slows down the search process since more candidates are evaluated. Conversely, a lower value speeds up the search but might miss some of the nearest neighbors.

Several k-NN Searches

Several k-NN searches refer to performing multiple k-NN queries within a single search request. This can be useful when you need to retrieve the top k nearest neighbors for different query vectors simultaneously, as seen below:

POST /my_vector_index/_search
{
  "_source": false,
  "fields": [ "rating" ],
  "knn": [
    {
      "field": "title_vector",
      "query_vector": [4.2, 1.9, 7.3, 3.5],
      "k": 2,
      "num_candidates": 100,
      "boost": 0.4
    },
    {
      "field": "content_vector",
      "query_vector": [3.8, 2.4, 6.5, 0.7],
      "k": 5,
      "num_candidates": 100,
      "boost": 0.6
    }
  ]
}
  • Use Case: Batch processing where multiple queries need to be evaluated in one go.
  • Advantages: Efficient for scenarios involving multiple queries, reducing the number of search requests.
  • Limitations: May require more complex query management and handling of multiple result sets.

Filtered k-NN search combines k-NN search with additional filtering criteria. It retrieves the top k nearest neighbors but only within the subset of documents that match the specified filter conditions. This is useful for narrowing down results based on other attributes or metadata.

POST /my_vector_index/_search
{
  "_source": false,
  "fields": [ "price" ],
  "knn": {
    "field": "title_vector",
    "query_vector": [5.4, 5.1, 4.0, 2.8],
    "k": 2,
    "num_candidates": 100,
    "similarity": 0.975,
    "filter" : {
      "range" : {
        "rating" : {
          "gte": 50
        }
      }
    }
  }
}
  • Use Case: When you need to find similar items within a specific subset of your data, such as filtering by category or date.
  • Advantages: Allows for more targeted and relevant search results by applying filters.
  • Limitations: May require careful management of filters to ensure they do not overly restrict the search space.

While k-NN (k-Nearest Neighbors) search is a powerful tool for finding similar items in a dataset, it comes with certain limitations that are important to consider:

1. Scalability and Performance

k-NN search can become computationally expensive as the size of the dataset grows. Exact k-NN search requires comparing the query vector with every vector in the index, leading to significant time and resource consumption for large datasets. This makes it less suitable for real-time applications or when dealing with millions of vectors unless approximations are used.

2. Curse of Dimensionality

As the number of dimensions in the vector space increases, the effectiveness of k-NN search can diminish. High-dimensional spaces often lead to sparsity, where data points become more equidistant from each other, making it harder to differentiate between truly similar and dissimilar points. This phenomenon, known as the “curse of dimensionality,” can reduce the accuracy of k-NN search results.

3. Need for Tuning

k-NN search requires careful tuning of parameters, such as the number of neighbors (k), the similarity metric, and, in the case of approximate k-NN, the algorithm-specific settings like ef_construction and m. Poorly chosen parameters can lead to suboptimal search results, either missing relevant items or including too many irrelevant ones.

4. Resource-Intensive Indexing

Indexing vectors for k-NN search, especially with exact methods, can be resource-intensive. The process may require significant memory and processing power, particularly when using algorithms like HNSW, which builds and maintains complex graph structures to speed up search queries. This can be a bottleneck in systems with limited resources or when indexing needs to happen frequently.

5. Sensitivity to Noise

k-NN search can be sensitive to noisy data, where outliers or irrelevant features in the vectors can lead to incorrect nearest neighbor identification. Unlike some machine learning models that can be regularized or trained to ignore noise, k-NN relies directly on the raw vector data, making it more vulnerable to inaccuracies if the input data isn’t clean or well-prepared.

6. Lack of Interpretability

The results of k-NN search are often harder to interpret compared to more structured query methods. Since k-NN simply retrieves the closest vectors based on the chosen metric, it doesn’t provide insights into why certain vectors are similar or how much each dimension contributes to the similarity score. This can be a limitation when transparency and explainability are important.

In summary, while k-NN search is highly effective for certain use cases, its limitations in scalability, dimensionality, resource usage, and interpretability mean that it is not always the best choice, particularly in large-scale, real-time, or high-dimensional scenarios. Careful consideration and tuning are required to maximize its effectiveness.

Conclusion

Vector search represents a significant advancement in search technology, enabling more sophisticated, context-aware queries that go beyond simple keyword matching. By understanding and implementing vector search in Elasticsearch, you can unlock powerful new capabilities for your search applications, whether you’re building a semantic search engine, recommendation system, or any other application that benefits from understanding the deeper meaning of data.

While vector search requires more resources and setup compared to traditional lexical search, the benefits in terms of relevance and user satisfaction can be substantial, making it a valuable tool in the modern search engineer’s toolkit.