RAG with Langchain and Oracle, Approximate spelling of proper nouns

In this part of the series, we address the challenge of dealing with approximate spellings of proper nouns. This is a common issue when users might misspell names or use slightly different variations. To enhance our agent’s performance, we will implement a solution that leverages Langchain and Oracle Database to handle these approximations effectively.

By integrating a tool that utilizes fuzzy matching algorithms, the agent can infer the correct names even when the spelling isn’t perfect. This approach improves the agent’s ability to respond accurately, providing a more robust and user-friendly experience. With this enhancement, our agent becomes more resilient to errors, ensuring smoother and more intuitive interactions.

If you haven’t seen the previous part of this post, we invite you to read it before proceeding.

RAG with Langchain and Oracle Database – Part II Providing memory to the agent.

Facing the challenge

Let’s start by challenging the agent by asking about an employee whose name is misspelled. For this, we will inquire about an employee with the last name “Scyara”.

config = {"configurable": {"thread_id": "thread-3"}}
print_stream(agent_memory,  {"messages": [("user", "Who is the manager of Scyara?")]}, config)

================================ Human Message =================================

Who is the manager of Scyara?
================================== Ai Message ==================================

It seems there was an issue with the previous query. Let's try searching for "Scyara" by last name as well, just in case the name was entered differently. I'll check both first and last names.
Tool Calls:
  sql_db_query_checker (call_kcaRAcxSk6yQVOuKJNbQBGPz)
 Call ID: call_kcaRAcxSk6yQVOuKJNbQBGPz
  Args:
    query: SELECT e2.first_name, e2.last_name FROM employees e1 JOIN employees e2 ON e1.manager_id = e2.employee_id WHERE e1.last_name = 'Scyara'
  sql_db_query_checker (call_9HHLr7ZyyqoYZwomXnUP6sCF)
 Call ID: call_9HHLr7ZyyqoYZwomXnUP6sCF
  Args:
    query: SELECT e2.first_name, e2.last_name FROM employees e1 JOIN employees e2 ON e1.manager_id = e2.employee_id WHERE e1.first_name = 'Scyara'
================================= Tool Message =================================
Name: sql_db_query_checker

```sql
SELECT e2.first_name, e2.last_name 
FROM employees e1 
JOIN employees e2 ON e1.manager_id = e2.employee_id 
WHERE e1.first_name = 'Scyara'
```
================================== Ai Message ==================================
Tool Calls:
  sql_db_query (call_b0F6aO1R21lvFVR1hI1kAJWp)
 Call ID: call_b0F6aO1R21lvFVR1hI1kAJWp
  Args:
    query: SELECT e2.first_name, e2.last_name FROM employees e1 JOIN employees e2 ON e1.manager_id = e2.employee_id WHERE e1.last_name = 'Scyara'
  sql_db_query (call_GnaxX9buNCXckY2ovtRtfohh)
 Call ID: call_GnaxX9buNCXckY2ovtRtfohh
  Args:
    query: SELECT e2.first_name, e2.last_name FROM employees e1 JOIN employees e2 ON e1.manager_id = e2.employee_id WHERE e1.first_name = 'Scyara'
================================= Tool Message =================================
Name: sql_db_query


================================== Ai Message ==================================

It appears that there is no employee named "Scyara" in the database, either by first name or last name. Therefore, I couldn't find a manager for Scyara. If you have any other questions or need further assistance, please let me know!

In this case, the agent simply responds that no employee with that name exists. Next, we will define a new tool that allows the agent to infer who we are referring to through a list of names.

Implementing the retriever tool.

First, we get the whole list of employees.

import ast
import re


def query_as_list(db, query):
    res = db.run(query)
    res = [el for sub in ast.literal_eval(res) for el in sub if el]
    res = [re.sub(r"\b\d+\b", "", string).strip() for string in res]
    return list(set(res))


employees = query_as_list(db, "SELECT first_name || ' ' || last_name employee_name FROM Employees")

And then, we create a FAISS vector database to store the OpenAI embeddings. This allows us to efficiently perform similarity searches, enabling the agent to infer the correct names even with approximate spellings.

Facebook AI Similarity Search (FAISS) is a library for efficient similarity search and clustering of dense vectors. It contains algorithms that search in sets of vectors of any size, up to ones that possibly do not fit in RAM. It also contains supporting code for evaluation and parameter tuning.

from langchain.agents.agent_toolkits import create_retriever_tool
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings

vector_db = FAISS.from_texts(employees, OpenAIEmbeddings())
retriever = vector_db.as_retriever(search_kwargs={"k": 5})
description = """Use to look up values to filter on. Input is an approximate spelling of the proper noun, output is \
valid proper nouns. Use the noun most similar to the search."""
retriever_tool = create_retriever_tool(
    retriever,
    name="search_proper_nouns",
    description=description,
)

The code snippet shown above creates a retriever tool. It initializes the vector database with embeddings generated from employee names using OpenAI’s model. Then, it configures the vector database as a retriever with a search parameter to return the top 5 closest matches. A description is provided to explain the tool’s purpose, and finally, a retriever tool named “search_proper_nouns” is created using this configuration.

By Invoking the retriever tool we can check how it handles approximate spellings by returning the most similar valid proper nouns from the list of employees.

print(retriever_tool.invoke("Scyara"))

Ismael Sciarra

Sarath Sewall

Sarah Bell

Sigal Tobias

Julia Nayer

Now we will add the retriever tool to our existing set of tools and create a new agent.

system = """You are an agent designed to interact with an Oracle SQL database.
Given an input question, create a syntactically correct Oracle SQL query to run, then look at the results of the query and return the answer.
Unless the user specifies a specific number of examples they wish to obtain, always limit your query to at most 5 results.
You can order the results by a relevant column to return the most interesting examples in the database.
Never query for all the columns from a specific table, only ask for the relevant columns given the question.
You have access to tools for interacting with the database.
Only use the below tools. Only use the information returned by the below tools to construct your final answer.
You MUST double check your query before executing it. If you get an error while executing a query, rewrite the query and try again.

DO NOT make any DML statements (INSERT, UPDATE, DELETE, DROP, TRUNCATE etc.) to the database.

To start you should ALWAYS look at the tables in the database to see what you can query. 
Do NOT skip this step.
When possible try to join tables to extend the retrieved information.
Then you should query the schema of the most relevant tables.

If you need to filter on a proper noun, you must ALWAYS first look up the filter value using the "search_proper_nouns" tool!
Do not try to guess at the proper name - use this function to find similar ones.
"""

system_message = SystemMessage(content=system)
tools.append(retriever_tool)
new_agent = create_react_agent(llm, tools, messages_modifier=system_message, checkpointer=MemorySaver())

This time, the agent is aware of the misspelling and is able to handle it effectively.

config = {"configurable": {"thread_id": "thread-2"}}

for s in new_agent.stream(
    {"messages": [HumanMessage(content="Who is the manager of Scyara?")]}, config):
    print(s)
    print("----")

{'agent': {'messages': [AIMessage(content='', additional_kwargs={'tool_calls': [{'id': 'call_FmqPeXhg59OW9TlBJnFrcqyw', 'function': {'arguments': '{"query":"Scyara"}', 'name': 'search_proper_nouns'}, 'type': 'function'}]}, response_metadata={'token_usage': {'completion_tokens': 19, 'prompt_tokens': 737, 'total_tokens': 756}, 'model_name': 'gpt-4o-2024-05-13', 'system_fingerprint': 'fp_d576307f90', 'finish_reason': 'tool_calls', 'logprobs': None}, id='run-4c5d40b4-d799-479a-93ad-d1cd491f981e-0', tool_calls=[{'name': 'search_proper_nouns', 'args': {'query': 'Scyara'}, 'id': 'call_FmqPeXhg59OW9TlBJnFrcqyw'}], usage_metadata={'input_tokens': 737, 'output_tokens': 19, 'total_tokens': 756})]}}
----
{'tools': {'messages': [ToolMessage(content='Ismael Sciarra\n\nSarath Sewall\n\nSarah Bell\n\nSigal Tobias\n\nJulia Nayer', name='search_proper_nouns', tool_call_id='call_FmqPeXhg59OW9TlBJnFrcqyw')]}}
----
{'agent': {'messages': [AIMessage(content='', additional_kwargs={'tool_calls': [{'id': 'call_EcDbRd1cNlTyPlC77maEZLeK', 'function': {'arguments': '{"query":"Ismael Sciarra"}', 'name': 'search_proper_nouns'}, 'type': 'function'}]}, response_metadata={'token_usage': {'completion_tokens': 20, 'prompt_tokens': 787, 'total_tokens': 807}, 'model_name': 'gpt-4o-2024-05-13', 'system_fingerprint': 'fp_d576307f90', 'finish_reason': 'tool_calls', 'logprobs': None}, id='run-5094de0c-6411-47bd-866c-0b31d9083bd9-0', tool_calls=[{'name': 'search_proper_nouns', 'args': {'query': 'Ismael Sciarra'}, 'id': 'call_EcDbRd1cNlTyPlC77maEZLeK'}], usage_metadata={'input_tokens': 787, 'output_tokens': 20, 'total_tokens': 807})]}}
----
{'tools': {'messages': [ToolMessage(content='Ismael Sciarra\n\nJose Manuel Urman\n\nAlberto Errazuriz\n\nSigal Tobias\n\nShelli Baida', name='search_proper_nouns', tool_call_id='call_EcDbRd1cNlTyPlC77maEZLeK')]}}
----
{'agent': {'messages': [AIMessage(content='', additional_kwargs={'tool_calls': [{'id': 'call_2wGD2ZedIwdeS67juEtZDRQq', 'function': {'arguments': '{}', 'name': 'sql_db_list_tables'}, 'type': 'function'}]}, response_metadata={'token_usage': {'completion_tokens': 12, 'prompt_tokens': 843, 'total_tokens': 855}, 'model_name': 'gpt-4o-2024-05-13', 'system_fingerprint': 'fp_d576307f90', 'finish_reason': 'tool_calls', 'logprobs': None}, id='run-ffd9efc5-1a21-4fa7-a2c7-231c5d293b1a-0', tool_calls=[{'name': 'sql_db_list_tables', 'args': {}, 'id': 'call_2wGD2ZedIwdeS67juEtZDRQq'}], usage_metadata={'input_tokens': 843, 'output_tokens': 12, 'total_tokens': 855})]}}
----
{'tools': {'messages': [ToolMessage(content='countries, departments, emp_temp, employees, j_purchaseorder, job_history, jobs, locations, regions, sales', name='sql_db_list_tables', tool_call_id='call_2wGD2ZedIwdeS67juEtZDRQq')]}}
----
{'agent': {'messages': [AIMessage(content='', additional_kwargs={'tool_calls': [{'id': 'call_VOCG8mzA56ZRIeUeMEnOCCoK', 'function': {'arguments': '{"table_names":"employees, departments"}', 'name': 'sql_db_schema'}, 'type': 'function'}]}, response_metadata={'token_usage': {'completion_tokens': 18, 'prompt_tokens': 888, 'total_tokens': 906}, 'model_name': 'gpt-4o-2024-05-13', 'system_fingerprint': 'fp_4008e3b719', 'finish_reason': 'tool_calls', 'logprobs': None}, id='run-2ac5387a-1bf6-4b78-938e-372a0a7ff9cf-0', tool_calls=[{'name': 'sql_db_schema', 'args': {'table_names': 'employees, departments'}, 'id': 'call_VOCG8mzA56ZRIeUeMEnOCCoK'}], usage_metadata={'input_tokens': 888, 'output_tokens': 18, 'total_tokens': 906})]}}
----
{'tools': {'messages': [ToolMessage(content='\nCREATE TABLE departments (\n\tdepartment_id NUMBER(4, 0) NOT NULL, \n\tdepartment_name VARCHAR(30 CHAR) NOT NULL, \n\tmanager_id NUMBER(6, 0), \n\tlocation_id NUMBER(4, 0), \n\tCONSTRAINT dept_id_pk PRIMARY KEY (department_id), \n\tCONSTRAINT dept_loc_fk FOREIGN KEY(location_id) REFERENCES locations (location_id), \n\tCONSTRAINT dept_mgr_fk FOREIGN KEY(manager_id) REFERENCES employees (employee_id)\n)\n\n/*\n3 rows from departments table:\ndepartment_id\tdepartment_name\tmanager_id\tlocation_id\n10\tAdministration\t200\t1700\n20\tMarketing\t201\t1800\n30\tPurchasing\t114\t1700\n*/\n\n\nCREATE TABLE employees (\n\temployee_id NUMBER(6, 0) NOT NULL, \n\tfirst_name VARCHAR(20 CHAR), \n\tlast_name VARCHAR(25 CHAR) NOT NULL, \n\temail VARCHAR(25 CHAR) NOT NULL, \n\tphone_number VARCHAR(20 CHAR), \n\thire_date DATE NOT NULL, \n\tjob_id VARCHAR(10 CHAR) NOT NULL, \n\tsalary NUMBER(8, 2), \n\tcommission_pct NUMBER(2, 2), \n\tmanager_id NUMBER(6, 0), \n\tdepartment_id NUMBER(4, 0), \n\tCONSTRAINT emp_emp_id_pk PRIMARY KEY (employee_id), \n\tCONSTRAINT emp_dept_fk FOREIGN KEY(department_id) REFERENCES departments (department_id), \n\tCONSTRAINT emp_job_fk FOREIGN KEY(job_id) REFERENCES jobs (job_id), \n\tCONSTRAINT emp_manager_fk FOREIGN KEY(manager_id) REFERENCES employees (employee_id), \n\tCONSTRAINT emp_salary_min CHECK (salary > 0)\n)\n\n/*\n3 rows from employees table:\nemployee_id\tfirst_name\tlast_name\temail\tphone_number\thire_date\tjob_id\tsalary\tcommission_pct\tmanager_id\tdepartment_id\n199\tDouglas\tGrant\tDGRANT\t1.650.555.0164\t2018-01-13 00:00:00\tSH_CLERK\t2600\tNone\t124\t50\n200\tJennifer\tWhalen\tJWHALEN\t1.515.555.0165\t2013-09-17 00:00:00\tAD_ASST\t4400\tNone\t101\t10\n201\tMichael\tMartinez\tMMARTINE\t1.515.555.0166\t2014-02-17 00:00:00\tMK_MAN\t13000\tNone\t100\t20\n*/', name='sql_db_schema', tool_call_id='call_VOCG8mzA56ZRIeUeMEnOCCoK')]}}
----
/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/langchain_community/utilities/sql_database.py:316: SAWarning: Cannot correctly sort tables; there are unresolvable cycles between tables "departments, employees", which is usually caused by mutually dependent foreign key constraints.  Foreign key constraints involving these tables will not be considered; this warning may raise an error in a future release.
  metadata_table_names = [tbl.name for tbl in self._metadata.sorted_tables]
/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/langchain_community/utilities/sql_database.py:328: SAWarning: Cannot correctly sort tables; there are unresolvable cycles between tables "departments, employees", which is usually caused by mutually dependent foreign key constraints.  Foreign key constraints involving these tables will not be considered; this warning may raise an error in a future release.
  for tbl in self._metadata.sorted_tables
{'agent': {'messages': [AIMessage(content='', additional_kwargs={'tool_calls': [{'id': 'call_lhRiiKvlJm5XfZE7V4WPpX2G', 'function': {'arguments': '{"query":"SELECT e.first_name, e.last_name, m.first_name AS manager_first_name, m.last_name AS manager_last_name\\nFROM employees e\\nJOIN employees m ON e.manager_id = m.employee_id\\nWHERE e.first_name = \'Ismael\' AND e.last_name = \'Sciarra\'"}', 'name': 'sql_db_query_checker'}, 'type': 'function'}]}, response_metadata={'token_usage': {'completion_tokens': 74, 'prompt_tokens': 1434, 'total_tokens': 1508}, 'model_name': 'gpt-4o-2024-05-13', 'system_fingerprint': 'fp_ce0793330f', 'finish_reason': 'tool_calls', 'logprobs': None}, id='run-25dcdbd8-5ac8-4c71-88af-3cf797376596-0', tool_calls=[{'name': 'sql_db_query_checker', 'args': {'query': "SELECT e.first_name, e.last_name, m.first_name AS manager_first_name, m.last_name AS manager_last_name\nFROM employees e\nJOIN employees m ON e.manager_id = m.employee_id\nWHERE e.first_name = 'Ismael' AND e.last_name = 'Sciarra'"}, 'id': 'call_lhRiiKvlJm5XfZE7V4WPpX2G'}], usage_metadata={'input_tokens': 1434, 'output_tokens': 74, 'total_tokens': 1508})]}}
----
{'tools': {'messages': [ToolMessage(content="```sql\nSELECT e.first_name, e.last_name, m.first_name AS manager_first_name, m.last_name AS manager_last_name\nFROM employees e\nJOIN employees m ON e.manager_id = m.employee_id\nWHERE e.first_name = 'Ismael' AND e.last_name = 'Sciarra'\n```", name='sql_db_query_checker', tool_call_id='call_lhRiiKvlJm5XfZE7V4WPpX2G')]}}
----
{'agent': {'messages': [AIMessage(content='', additional_kwargs={'tool_calls': [{'id': 'call_1zw1csZrWhJ9e9KbziVDuENv', 'function': {'arguments': '{"query":"SELECT e.first_name, e.last_name, m.first_name AS manager_first_name, m.last_name AS manager_last_name\\nFROM employees e\\nJOIN employees m ON e.manager_id = m.employee_id\\nWHERE e.first_name = \'Ismael\' AND e.last_name = \'Sciarra\'"}', 'name': 'sql_db_query'}, 'type': 'function'}]}, response_metadata={'token_usage': {'completion_tokens': 73, 'prompt_tokens': 1581, 'total_tokens': 1654}, 'model_name': 'gpt-4o-2024-05-13', 'system_fingerprint': 'fp_4008e3b719', 'finish_reason': 'tool_calls', 'logprobs': None}, id='run-7ea9641f-0eba-47fe-91c6-aa278d2912fc-0', tool_calls=[{'name': 'sql_db_query', 'args': {'query': "SELECT e.first_name, e.last_name, m.first_name AS manager_first_name, m.last_name AS manager_last_name\nFROM employees e\nJOIN employees m ON e.manager_id = m.employee_id\nWHERE e.first_name = 'Ismael' AND e.last_name = 'Sciarra'"}, 'id': 'call_1zw1csZrWhJ9e9KbziVDuENv'}], usage_metadata={'input_tokens': 1581, 'output_tokens': 73, 'total_tokens': 1654})]}}
----
{'tools': {'messages': [ToolMessage(content="[('Ismael', 'Sciarra', 'Nancy', 'Gruenberg')]", name='sql_db_query', tool_call_id='call_1zw1csZrWhJ9e9KbziVDuENv')]}}
----
{'agent': {'messages': [AIMessage(content='The manager of Ismael Sciarra is Nancy Gruenberg.', response_metadata={'token_usage': {'completion_tokens': 13, 'prompt_tokens': 1679, 'total_tokens': 1692}, 'model_name': 'gpt-4o-2024-05-13', 'system_fingerprint': 'fp_d576307f90', 'finish_reason': 'stop', 'logprobs': None}, id='run-525819c6-2d37-41b5-94d6-c3586e66fa98-0', usage_metadata={'input_tokens': 1679, 'output_tokens': 13, 'total_tokens': 1692})]}}

The final message holds the requested answer: ‘The manager of Ismael Sciarra is Nancy Gruenberg.’

To wrap up

Incorporating the ability to handle misspellings enhances the user experience by making interactions more intuitive and seamless. By implementing tools like the retriever with a FAISS vector database and OpenAI embeddings, our agent can accurately infer the intended proper nouns, even when users provide approximate spellings. This not only improves accuracy but also ensures a more robust and user-friendly system.

However, it’s important to note that while we have improved the agent’s functionality, we still face the critical challenge of anonymizing employee data. Ensuring that personal information is protected is paramount to maintaining privacy and compliance with data protection regulations. Moving forward, addressing this challenge will be essential to safeguard sensitive data and uphold user trust.