Architecture: Retrieval system

How we take a query, search for related text, and send back a large final prompt that includes the question plus all relevant supporting text

This diagram depicts our retrieval system that vectorizes user queries to search a database and construct a comprehensive prompt for further processing. Here's a detailed breakdown of the steps involved in this retrieval process:

1. User Query: The process initiates when a user submits a query from a frontend React app hosted on Vercel. This query is the any kind of question or command that will be enriched with info from the vector database.

2. Amazon API Gateway: The user query comes to the system via Amazon API Gateway. The API Gateway acts as an intermediary that provides a secure way to connect clients with the back-end services. It can handle request routing, composition, and protocol translation.

3. Fetch Embedding for User's Query within a Lambda: The user's query is sent to OpenAI, where a machine learning model is used to transform the query into a vector embedding. This embedding represents the semantic meaning of the query in a numerical form suitable for comparison with other embeddings.

4. AWS Lambda deployed with Docker: This is a serverless compute service that runs code in response to events. It is used here to perform a search using cosine similarity, a method for measuring the cosine of the angle between two vectors, which in this context is a way to determine how similar the query vector is to document vectors stored in the database. All of that is handled by llama-index classes.

def build_prompt(source_nodes, query):
    context_str = ''
    print('source nodes to build prompt', source_nodes)
    for node in source_nodes:
        context_str += '\n' + node.text + '\n'
    formatted_prompt = TEXT_QA_TEMPLATE.format(context_str=context_str, query_str=query)
    return formatted_prompt

...

retriever = construct_retriever(aos_endpoint, index_to_query, auth, similarity_top_k=bot_settings['similarity_top_k'])
query_engine = query_engine_no_synthesis(retriever, similarity_cutoff=bot_settings['similarity_cutoff'])
resp = query_engine.query(query)
source_nodes = resp.source_nodes
final_prompt = build_prompt(source_nodes, query)
return json.dumps(final_prompt) # the next.js serverless fn will take the prompt and send it to openai and then stream to the client

Docker is used to containerize the Lambda deployment package, because the Lambda relies on key external pip libraries like opensearch-py and llama-index. Docker provides a way to package and run applications in a loosely isolated environment called a container. The system is designed for easy scalability and consistency across multiple development, testing, and production environments.

5. Amazon OpenSearch Service: Lambda interacts with Amazon OpenSearch Service, querying the vector database for entries that have a high cosine similarity to the user's query vector. This service provides the necessary infrastructure for fast, scalable, and relevant search capabilities over the contents of the documents.

6. Return of Supporting Texts: Once the search is complete, texts from the database that are relevant to the user's query are retrieved. These supporting texts are documents or chunks of documents that have a high degree of similarity to the query.

# Querying the Opensearch vector database
from llama_index import (
    VectorStoreIndex,
    get_response_synthesizer,
)
from llama_index.retrievers import VectorIndexRetriever
from llama_index.query_engine import RetrieverQueryEngine
from llama_index.postprocessor import SimilarityPostprocessor

...

def query_engine_no_synthesis(retriever, similarity_cutoff=0.7):
    query_engine = RetrieverQueryEngine.from_args(
        retriever=retriever,
        node_postprocessors=[SimilarityPostprocessor(similarity_cutoff=similarity_cutoff)],
        response_mode='no_text',
    )
    return query_engine

7. Construction of a Larger Prompt: The retrieved supporting texts are combined with the original user query to form a larger prompt. This expanded prompt encompasses the user's initial query and adds context or additional information from the most relevant documents found in the vector database.

8. Response to Front End: This larger prompt is sent back to the front end of the application. It is designed so that this comprehensive prompt can be used for subsequent calls to a Language Learning Model (LLM) for more detailed analysis or generation of a response, which can then be streamed to the user.

This system vectorizes a user's textual query, searches a vector database for the most relevant documents, constructs a more detailed prompt from the search results, and then sends this prompt back to the front end. The front end can further use this prompt for interaction with an LLM, facilitating a rich and informative user experience.

(yes I used ChatGPT to help write this article)

Previous
Previous

Generative AI Starter Pack

Next
Next

Practical AI for Business