Comparing Prompt Results - A Rose By Any Other Name

You might want to test an expected response from a prompt sent to a large language model, but string comparisons will not help you. The inherent variability in large language model (LLM) responses will require you to find new ways to compare generated prompt results.

There are a few reasons why a generated prompt result will not exactly match a prior result: the prompt itself may have changed, the model parameters may have changed, or the model’s inherent variability may inject a small amount of change in the results.

One technique for dealing with “fuzzy” comparisons is to use embeddings to compare texts. Embeddings can be thought of as a multidimensional space of meaning, within which any fragment of text can be assigned a location to a point in that space using a coordinate. The coordinate is the “embedding”. The ability to express the distance between any two points in such a space can be used as a proxy for “sameness in meaning.” Birds of a feather flock together.

A coordinate list (or tuple) of two numbers, such as (x, y), is needed to assign a unique point location in a two-dimensional space. To extend this to an n-dimensional space, a list of length n is required. Embedding models typically assign coordinates in spaces of over 100 dimensions. This depends on the particular embedding model chosen. For OpenAI embedding models such as'text-embedding-ada-002'and'text-embedding-3-small', the embedding will be a list of 1,536 float values - a coordinate in a 1,536 dimensional space.

I’ve always found it fascinating that no matter how many dimentions there are, the distance between any two points in n-dimensions is always a single number.

There are a couple of choices you can use to determine distance between n-dimensional coordinates, the most straightforward being the “Euclidean distance” (or norm) which most high school algebra students will remember as: $$ d = \sqrt{(x_1-x_2)^2 + (y_1-y_2)^2} $$ or, more generally, for any n-dimensional Euclidean space:

$$ d(p,q) = \sqrt{\sum_{i=1}^{n}(q_i - p_i)^2} $$

Another choice is the Cosine similarity, which considers the angle between two coordinate vectors. Embeddings can be interpreted as a vector from the origin to the coordinate the embedding represents. The length of the embedding vector is always normalized to 1.0. If the vectors point in the same direction, they have similar meaning. Birds of a feather flock together and fly in the same direction too:

$$ cos(\theta) = \frac{A \cdot B}{\left|A\right| \left|B\right|} $$

If the two embedding vectors are identical, the angle between them will be 0, so the angle’s cosine will be 1. The possible values range from -1 to 1, where -1 means opposite in meaning, 0 means no overlap in meaning, and 1 is identical. In practical terms, the values we are concerned with range from 0 to 1.

For the following code examples, I will use Cosine similarity. Rather than roll my own from the above formula, I will use a helper function found in thetorch.nnpackage, nn.CosineSimilarity(). This function will take advantage of the GPU, if present, and will also handle a few edge cases and singularities that may arise from comparing embedding vectors.

Setting up a few functions to help generate prompt results and compare results:

from openai import OpenAI
import torch
import torch.nn as nn

def get_prompt_result(prompt:str, temperature:float=0, top_p:float = 1) -> str:
    result = client.chat.completions.create(
      model="gpt-4o",
      messages=[
        {
          "role": "user",
          "content": prompt,
        }
      ],
      temperature=temperature,
      top_p=top_p
    )
    return result.choices[0].message.content

def get_embedding(text:str, model:str = 'text-embedding-ada-002') -> list[float]:
  return client.embeddings.create(
    input=[text],
    model=model # text-embedding-3-small
  ).data[0].embedding

def get_similarity(embedding1:list[float], embedding2:list[float]) -> float:
  cosine_similarity = nn.CosineSimilarity(dim=0)
  similarity = cosine_similarity(torch.tensor(embedding1), torch.tensor(embedding2))
  return similarity.item()

def describe_similarity(value:float) -> str:
  if value > 0.99:
    return "identical"
  elif value > 0.95:
    return "very similar"
  elif value > 0.9:
    return "similar"
  elif value > 0.8:
    return "somewhat similar"
  else:
    return "not similar"

Let’s show how different prompts can produce different results even if the prompts themselves seem to be asking nearly the same question:

baseline_result_text = get_prompt_result("What are some nice places to visit in Paris?")
current_result_text = get_prompt_result("Describe a few good places to visit in Paris?")

returns these results:

Paris is a city rich in history, culture, and beauty, offering a wide array of attractions for visitors. Here are some must-see places in Paris:

1. **Eiffel Tower**: The iconic symbol of Paris, offering stunning views of the city from its observation decks.

2. **Louvre Museum**: Home to thousands of works of art, including the Mona Lisa and the Venus de Milo.

3. **Notre-Dame Cathedral**: A masterpiece of Gothic architecture, though currently under restoration following the 2019 fire.

4. **Champs-Élysées and Arc de Triomphe**: A famous avenue leading to the monumental arch, which offers panoramic views of Paris.

5. **Montmartre and Sacré-Cœur Basilica**: A historic district known for its artistic history, charming streets, and the beautiful basilica with its stunning views.

6. **Musée d'Orsay**: An art museum housed in a former railway station, featuring an extensive collection of Impressionist and Post-Impressionist masterpieces.

7. **Palace of Versailles**: Located just outside Paris, this opulent palace and its gardens are a testament to the grandeur of French royalty.

8. **Seine River Cruise**: A boat tour along the Seine River offers a unique perspective of Paris's landmarks, especially beautiful at night.

9. **Luxembourg Gardens**: A beautiful park perfect for a leisurely stroll, with statues, fountains, and the Luxembourg Palace.

10. **Le Marais**: A historic district known for its narrow medieval streets, trendy boutiques, and vibrant nightlife.

11. **Panthéon**: A neoclassical mausoleum containing the remains of distinguished French citizens, with impressive architecture and a crypt.

12. **Centre Pompidou**: A modern art museum with a distinctive architectural design, housing a vast collection of contemporary art.

13. **Sainte-Chapelle**: A stunning Gothic chapel known for its magnificent stained glass windows.

14. **Opéra Garnier**: A grand opera house with opulent interiors, worth visiting for its architecture and performances.

15. **Père Lachaise Cemetery**: The final resting place of many famous individuals, including Jim Morrison, Oscar Wilde, and Edith Piaf.

These are just a few highlights, and Paris has much more to offer depending on your interests, from charming neighborhoods and cafes to world-class shopping and dining experiences.t

and…

Paris, often referred to as "The City of Light," is renowned for its rich history, stunning architecture, and vibrant culture. Here are a few must-visit places in Paris:

1. **Eiffel Tower**: No trip to Paris is complete without visiting its most iconic landmark. You can take an elevator ride to the top for a breathtaking view of the city or enjoy a picnic on the Champ de Mars with the tower as your backdrop.

2. **Louvre Museum**: Home to thousands of works of art, including the Mona Lisa and the Venus de Milo, the Louvre is the world's largest art museum and a historic monument in Paris. The glass pyramid entrance is a modern architectural marvel.

3. **Notre-Dame Cathedral**: This masterpiece of French Gothic architecture is famous for its stunning facade, intricate sculptures, and beautiful stained glass windows. Although it suffered damage from a fire in 2019, it remains a symbol of Parisian heritage.

4. **Montmartre and the Basilica of the Sacré-Cœur**: Perched atop a hill, Montmartre is known for its bohemian atmosphere, artists, and charming streets. The Basilica of the Sacré-Cœur offers panoramic views of Paris and is a beautiful example of Romano-Byzantine architecture.

5. **Champs-Élysées and Arc de Triomphe**: The Champs-Élysées is one of the most famous avenues in the world, lined with shops, theaters, and cafes. At its western end stands the Arc de Triomphe, a monument honoring those who fought and died for France.

6. **Musée d'Orsay**: Housed in a former railway station, this museum is renowned for its extensive collection of Impressionist and Post-Impressionist masterpieces by artists such as Monet, Van Gogh, and Degas.

7. **Seine River Cruise**: A boat cruise on the Seine River offers a unique perspective of Paris's landmarks, including the Eiffel Tower, Notre-Dame Cathedral, and the Louvre. It's a relaxing way to see the city, especially at night when the buildings are illuminated.

8. **Palace of Versailles**: Just a short trip from Paris, the Palace of Versailles is a UNESCO World Heritage site known for its opulent architecture, stunning gardens, and the Hall of Mirrors. It was the royal residence of Louis XIV.

9. **Le Marais**: This historic district is known for its narrow medieval streets, trendy boutiques, art galleries, and vibrant nightlife. It's also home to the Jewish Quarter and the beautiful Place des Vosges.

10. **Luxembourg Gardens**: These beautifully manicured gardens are perfect for a leisurely stroll or a relaxing afternoon. The gardens feature statues, fountains, and the Luxembourg Palace, which houses the French Senate.

Each of these locations offers a unique glimpse into the rich tapestry of Parisian culture and history. Whether you're an art lover, history buff, or simply looking to soak in the city's ambiance, Paris has something to offer everyone.

They are slightly different in that the first prompt returns fifteen locations and the second prompt only ten. Let’s calculate the embeddings and compare them:

similarity = get_similarity(baseline_embedding, current_embedding)
print(similarity,describe_similarity(similarity))

> 0.9737002849578857 very similar

So, not identical, but essentially the same. You might have some testing added to your code that uses this metric to determine if anticipated results are “drifting” beyond some pre-determined limit. In this case, the ranges I chose might alert for the need to look at possible model changes that have crept into the process, or perhaps a prompt was changed in a recent release.

We can compare results using the same exact prompt twice. Notice that the defaulttemperaturefor the prompt generation functionget_prompt_result()is set to “0” - which should provide the most “repeatable” output fromgpt-4o. In practice there is always a tiny bit of variablilty even with temperature set to the most stable value. This varies between different LLMs:

baseline_result_text = get_prompt_result("What are some nice places to visit in Paris?")
current_result_text = get_prompt_result("What are some nice places to visit in Paris?")
similarity = get_similarity(get_embedding(baseline_result_text), get_embedding(current_result_text))
print(similarity, describe_similarity(similarity))
print(f"Is baseline_result_text == current_result_text? {baseline_result_text == current_result_text}")

However, by comparing the embeddings, we can determine them to be identical for practical purposes, even if they are not exactly identical as strings.

> 0.997055172920227 identical
> Is baseline_result_text == current_result_text? False

The code above allows for changing the model parameters used in the prompt generation. You can play with this code yourself and get a feel for how these values can change repeatability of the results. You might find this useful to add to your prompt testing and processing in live systems.