题意:从OpenAI API批量嵌入作业中检索嵌入数据时遇到问题
问题背景:
I have to embed over 300,000 products description for a multi-classification project. I split the descriptions onto chunks of 34,337 descriptions to be under the Batch embeddings limit size.
A sample of my jsonl file for batch processing:
{"custom_id": "request-0", "method": "POST", "url": "/v1/embeddings", "body": {"model": "text-embedding-ada-002", "input": "Base L\u00edquida Maybelline Superstay 24 Horas Full Coverage Cor 220 Natural Beige 30ml", "encoding_format": "float"}}
{"custom_id": "request-1", "method": "POST", "url": "/v1/embeddings", "body": {"model": "text-embedding-ada-002", "input": "Sand\u00e1lia Havaianas Top Animals Cinza/Gelo 39/40", "encoding_format": "float"}}
My jsonl file has 34,337 lines.
I've susscesfully uploaded the file:
File 'batch_emb_file_1.jsonl' uploaded succesfully:FileObject(id='redacted for work compliance', bytes=6663946, created_at=1720128016, filename='batch_emb_file_1.jsonl', object='file', purpose='batch', status='processed', status_details=None)
and ran the embedding job:
Batch job created successfully:Batch(id='redacted for work compliance', completion_window='24h', created_at=1720129886, endpoint='/v1/embeddings', input_file_id='redacted for work compliance', object='batch', status='validating', cancelled_at=None, cancelling_at=None, completed_at=None, error_file_id=None, errors=None, expired_at=None, expires_at=1720216286, failed_at=None, finalizing_at=None, in_progress_at=None, metadata={'description': 'Batch job for embedding large quantity of product descriptions', 'initiated_by': 'Marcio', 'project': 'Product Classification', 'date': '2024-07-04 21:51', 'comments': 'This is the 1 batch job of embeddings'}, output_file_id=None, request_counts=BatchRequestCounts(completed=0, failed=0, total=0))
The work was completed:
client.batches.retrieve(batch_job_1.id).status
'completed'
client.batches.retrieve('redacted for work compliance')
, returns:
Batch(id='redacted for work compliance', completion_window='24h', created_at=1720129886, endpoint='/v1/embeddings', input_file_id='redacted for work compliance', object='batch', status='completed', cancelled_at=None, cancelling_at=None, completed_at=1720135956, error_file_id=None, errors=None, expired_at=None, expires_at=1720216286, failed_at=None, finalizing_at=1720133521, in_progress_at=1720129903, metadata={'description': 'Batch job for embedding large quantity of product descriptions', 'initiated_by': 'Marcio', 'project': 'Product Classification', 'date': '2024-07-04 21:51', 'comments': 'This is the 1 batch job of embeddings'}, output_file_id='redacted for work compliance', request_counts=BatchRequestCounts(completed=34337, failed=0, total=34337))
But when I try to get the content using output_file_id string
client.files.content(value of output_file_id)
, returns:
<openai._legacy_response.HttpxBinaryResponseContent at 0x79ae81ec7d90>
I have tried: client.files.content(value of output_file_id).content
but this kills my kernel
What am I doing wrong? Also I believe I am under utilizing Batch embeddings. the 90,000 limits conflicts with Batch Queue Limit of 'text-embedding-ada-002' model which is: 3,000,000
Could someone help?
问题解决:
Retrieving the embedding data from batch file is a bit trick, this Tutorial breaks it down set by set link
after getting the output_file_id, you need to:
output_file =client.files.content(output_files_id).textembedding_results = []
for line in output_file.split('\n')[:-1]:data =json.loads(line)custom_id = data.get('custom_id')embedding = data['response']['body']['data'][0]['embedding']embedding_results.append([custom_id, embedding])embedding_results = pd.DataFrame(embedding_results, columns=['custom_id', 'embedding'])
In my case, this retrieves the embedding data from the batch job file