Scaling OpenAI With AsyncOpenAI

As I stood outside and looked at the neighborhood wasteland that post-July 4th left behind, the whiff of gunpowder still hanging in the air, I felt a burst of good neighbor energy flow through me, so I grabbed a broom. Sweeping up the street gave me time to think about the other chores I had for the day, including the writing of a new blog post, and I began to wonder how I could use ChatGPT to help me speed some things up.

My thoughts turned to the bills. Every now and then, I download the transactions from my credit card’s website and attempt to figure out what I’m spending and where it is going. The transaction file looks like this.

Note the field Name and its contents. Normally, I spend 15-30 minutes figuring out how to describe the transactions with tags in a column that I can use to group payments. Looking at row two, I have spent something on Amazon Digital and there’s a unique reference in the Name field. My tag might be ‘AMAZON DIGITAL’. But I will need to mentally parse each of these and assign the tag manually. This looks like a job for GPT.

It would be nice if GPT could examine these fields and tag them uniformly based on their contents—thus, turning a half-hour of Excel “clicky” into 30 seconds or so of work. Let’s get started.

For this example, I have decided to use OpenAI’s GPT-4o model, and my first task is to develop a prompt capable of doing what I want. After many attempts (another good blog topic on prompt engineering awaits) I came up with this.

def get_prompt(name):
    return """Examine the transaction name text delimited by ```.
    The format starts with a company or domain name. Call that portion COMPANY.
    COMPANY will consist of a phrase of one or more words. The first word may be a number.
    DESC follows COMPANY. DESC starts with a non-letter character, a non-english word, or a number.
    All words in COMPANY should be upper case and separated with single spaces.
    Output the COMPANY on a single line.
    Replace any non-letter characters in COMPANY with spaces.
    Do not repeat these instructions as part of your response.
    """ + f"```{name}```"

Using some Pandas magic and the python OpenAI client, I was able to use this prompt to produce a tag based on the Name and insert a column called Desc that I could use for grouping.

Figure 3. Tagged transactions in the ‘Desc’ column.

Note in line [377] that I am post-processing the data from GPT-4o to remove some artifacts I don’t like, such as removing names and other ‘PII’ that might appear in the tags. Also, in some Name values, company names were duplicated. So, rather than ‘MICROSOFT MICROSOFT,’ I wanted just ‘MICROSOFT.’ This is a good example of adding a step to ensure your data is good, clean, safe, etc., and should be a part of any LLM-generated processing.

Processing at scale

Now that I had a reasonably good process for tagging my data, I decided to run all 323 transactions through calls to OpenAI. It took a few minutes. A few factors were playing into the total time:

My tier level on this OpenAI account was only Tier-1.
I was using the synchronous OpenAI() client.

Increasing the tier level to Tier-3 required processing a few more tokens for this account. A goal that was achieved fairly quickly as I ran more of these tests. Increasing the tier level allows for processing more tokens per minute (TPM) which allows for sending larger batches.

The next part, and the main subject here, was to use batches of parallel queries to improve throughput, and this required the use of asyncio combined with the async client in the openai package: AsyncOpenAI.

AsyncOpenAI has flow control

You will find examples of using asyncio and aiohttps, combined with backoff and/or tenacity to call the OpenAI API asychronously using response headers to dynamically adjust calls when token limits are being reached. I don’t recommend this approach unless you are trying to write your own library.

As it turns out, the openai library already includes a client that does much of this for you called AsyncOpenAI, and it is mostly a drop-in replacement for the synchronous OpenAI client. It will handle flow control and backoff for you, so you don’t have to drop down into the details under the covers.

That is the example I’m showing here. Note that other than the ‘async/await’ keywords, the asynchronous version looks exactly like the synchronous one.

client = AsyncOpenAI()
async def async_completion(messages,prompt):
    messages.append({"role":"user","content":f"{prompt}"})
    completion = await client.chat.completions.create(
        model="gpt-4o",
        messages=messages,
        temperature=0.2,
        top_p=0.9,
        max_tokens=4096,
        stream=False
    )
    messages.append({"role":"assistant","content":f"{completion.choices[0].message.content}"})
    return messages

The next part is batching calls. We need a few helper functions. The first function is simply my own code that handles setting up the prompt and calling the async_completion() function. The next function handles the semaphore.

async def reformat_transaction_name(name):
    messages = [
        {'role':'system','content':'you are a helpful named entity recognizer.'},
    ]
    response = await async_completion(messages,get_prompt(name))
    return response[-1]['content']

async def reformat_transaction_with_semaphore(transaction_name, semaphore):
    async with semaphore:
        return await reformat_transaction_name(transaction_name)

Now for the batching with Semaphore part.

from asyncio import Semaphore
async def run_batch(items):
    semaphore = Semaphore(len(items))
    tasks = [reformat_transaction_with_semaphore(name,semaphore) for name in items]
    results = await asyncio.gather(*tasks)
    return results

In the run_batch(items) function, a semaphore is created to match the size of the batch being processed. The len(items) allows me to adjust this when calling it.

Now for the fun part

In this code, I setup and submit the batches based on BATCHSIZE to run run_batch(items) by windowing through the list of items in the Dataframe - doing a bit of accounting along the way.

# Batched async
import time
start_time = time.time()
BATCHSIZE = 65

desc_items = []
for i in range(0,len(cctrans['Name']),BATCHSIZE):
    items = cctrans['Name'][i:i+BATCHSIZE]
    results = asyncio.run(run_batch(items))
    desc_items += results

end_time = time.time()
print(f"Batch completion time: {end_time-start_time} seconds")
print(f"Item count: {len(desc_items)}")
print(f"Items per second: {len(desc_items) / (end_time-start_time)}")

Here are the results:

Batch completion time: 6.294457197189331 seconds
Item count: 323
Items per second: 51.31498870851476

Significantly better than a few minutes at 6.29 seconds.

On the home stretch

So, at this point, you have enough code examples to go try this on your own, but I will spend a little time talking about testing and costs, etc. If you’re able to stick around.

Picking good BATCHSIZE

It is good to run tests with different batch sizes to determine what gives you the best throughput. Using the timing information from various batch sizes, I found BATCHSIZE=65 to be somewhat optimal; however, as I move up in tier level, I will need to re-check this, as higher limits will allow more TPM to be processed.

As you can see, there is some variablility in throughput most likely based on how busy OpenAI is. Your mileage will vary.

How much will this cost?

Another good practice is to estimate how many tokens you will be using. Using tiktoken and some pricing information can help with that.

import tiktoken
token_1M_price=5
prompt_text = get_prompt('Amazon web services    aws.amazon.co WA')
tokenizer = tiktoken.get_encoding("cl100k_base")
tokens = tokenizer.encode(prompt_text)
prompt_token_count = len(tokens)
tokens = tokenizer.encode('AMAZON WEB SERVICES')
response_token_count = len(tokens)

print(f"Prompt length: {prompt_token_count}")
print(f"Response length: {response_token_count}")
print(f"Item count: {len(cctrans['Transaction'])}")
jobsize=len(cctrans['Transaction'])*(prompt_token_count+response_token_count)
print(f"Total estimated job size in tokens: {jobsize}")
print(f"Estimated price per job at ${token_1M_price:.2f}/1M_tokens is ${(jobsize/1_000_000)*token_1M_price:.2f}")

Prompt length: 128
Response length: 5
Item count: 323
Total estimated job size in tokens: 42959
Estimated price per job at $5.00/1M_tokens is $0.21

So, there you have it. I’m spending about 21 cents per job.

Final thoughts

Well, the street is now clean, and I’ve finished another blog post.

You should be well on your way to improving throughput with OpenAI.

Do come back!

Processing at scale#

AsyncOpenAI has flow control#

Now for the fun part#

On the home stretch#

Picking good BATCHSIZE#

How much will this cost?#

Final thoughts#