Hey folks, here is another small tutorial now on how to build a small search engine using OpenAI's NLP pre-trained model "text-embedding-ada-002". According to OpenAI, "it replaces five separate models for text search, text similarity, and code search, and outperforms our previous most capable model, Davinci, at most tasks while being priced 99.8% lower". If you want to explore more about this model, please click here
That sounds good, right? well, let's test it out by building the embeddings out from the description of some superheroes I got in a csv file.
1 - Ok to build this, you will need to get an OpenAI key. you can get on for free at https://openai.com/api/. Just sign-in, click on Personal and then View API Keys and create a new one. Make sure to copy-paste it in a secure place.
2 - Download the csv file from My Github Datasets Repo
3 - Open a jupyter notebook and install the following libraries
Alright, enough chat, let's get into the business here. The first step is to load the required libraries.
import pandas as pd import numpy as np import openai from openai.embeddings_utils import get_embedding, cosine_similarity openai.api_key = 'put-your-api-key-here'
In this snippet, we are loading all the utils we need as well as setting up the API Key.
Create the "Combined" Field for Search
We load the supers.csv into a pandas dataframe. We will search over the "history_text" and "powers_text" descriptions. To make this easier, we will merge both columns into a new one called "Combined". The following code achieves exactly this. We will be using only the first 50 super heroes for our testing.
df = pd.read_csv("data/supers.csv") df = df.fillna('') # lets use the first 50 superheros. (because of the rate limit) df = df.head(50) df["combined"] = df[["history_text", "powers_text"]].apply(" ".join, axis=1)
Create the Embeddings
The "Combined" column plays an important role. We need to convert this text into an Embedding. The embedding is a vector representation of the tokens in the text according to the "text-embedding-ada-002" model. This means we will create a new pandas column with the name "Embedding" to store this representation. The code below loops over the data frame and creates an embedding out of each combined text row. Then each embedding is converted as numpy.array and added to the dataframe. Note that the model embeddings are created by making a remote call to OpenAI through the get_embedding(...) function, which has a rate limit of 60 calls/second unless you upgrade your subscription.
for item in df["combined"]: embedding = get_embedding(item, engine="text-embedding-ada-002") embeddings.append(np.array(embedding)) df["embedding"] = embeddings
Now we have a new column called "embeddings" that will hold the model representation for each row. the only thing left to do here is to build a Search function that can calculate the cosine similarity between the search query and the existent embeddings. The most similar results should be the ones returned first.
This is where the magic happens. The search function converts a query such as "helps batman" to a vector embedding and then compares it to the rest of the dataset. The results are sorted by the similarity column calculated from the cosine function. Take a look.
def search(df, description, n=3): text_embedding = get_embedding( description, engine="text-embedding-ada-002" ) df["similarity"] = df.embedding.apply(lambda x: cosine_similarity(x, text_embedding)) results = ( df.sort_values("similarity", ascending=False) .head(n) ) return results
Here comes the fun and final part. We now just look for information and we let the search(...) function make the magic. The n=3 means to show the top 3 results.
results = search(df, "helps batman", n=3)
Now we loop over the results with this code to see what came back.
for index, row in results.iterrows(): print("---------------------- ") print(row["real_name"], '[Similarity Score:', round(row["similarity"]*100),"%]") print(" ") print("History:", row["history_text"][:200]) print(" ") print("Powers:", row["powers_text"][:200])
These are the results from the search:
---------------------- Alfred Pennyworth [Similarity Score: 84 %] History: Alfred CranePennyworth is the butler, mentor, surrogate father, and close friend of Bruce Wayne. Alfred has served the Wayne family since before Bruce was born. After Bruce was left orphaned from the Powers: Alfred is exceptionally intelligent, which extends to his considerable investigative, analytical, communications, computer operating, engineering, and medical skills, as well as those in ordinary hous ---------------------- Aaron Cash [Similarity Score: 82 %] History: Aaron Cash is the head of security at Arkham Asylum. He has a hook for a hand after his real hand was eaten by Killer Croc. Powers: ---------------------- Bruce Wayne [Similarity Score: 82 %] History: He was one of the many prisoners of Indian Hill to be transferred to another facility upstate on the orders of The Court. However, Fish Mooney hijacks the bus and drives it into Gotham City, where the Powers:
As you can see, with a Similarity Score of 84%, Alfred Pennyworth, was the first result. For those who don't know, Alfred is Batman's butler.