Leveraging The Power Of LLMs For Receipt Extraction

How I used the OpenAI's GPT-4o model and Paddle OCR to facilitate the extraction and data formating of ingredients and prices from supplier receipt in Bon Service.

July 8, 2024

Bon Service is a web application that allows chefs to write, standardize, and share their recipes with their kitchen team members. One of the main challenges we encountered was providing access to real data. Our initial solution was to allow manual entry of ingredients, prices, and other relevant information such as origin, supplier name, and the date of the last price update.

Although this approach is used for our free version, the paid version should offer something more robust, which would truly save time in management and eliminate the burden of data entry.

In order to achieve this, we needed to extract ingredients, prices, and other relevant information from receipts. This led me to the development of a simple receipt extractor API using Python, OpenAI's GPT-4o and Paddle OCR.

The Spark of Inspiration

My colleague Remi and I were actually working on a completely different school project at the time and OpenAI had just added the GPT-Vision model to their API. At the end of the day, we decided to experiment with the model and see if it would be possible to properly prompt it to extract ingredients from receipts PDFs and images that I had on hand.

After roughtly 25 minutes of prompting we had a working theory that it would be possible to use GPT to format the data given that we were able to provide it with good quality data. The Vision model was not really good at interpretting receipts, especially the ones that were not in english.

This however gave me the idea to explore other OCR models that might be better suited for the task and to feed the text output to GPT-4o to format the data.

Finding The Best OCR Model For Our Use Case

Tesseract OCR

First we needed to find a good OCR model that would be able to extract the text from the receipts. I started by looking at Open Source OCR models and found Tesseract, which seemed like a good fit for our use case. It worked in TypeScript, wich was the language we were using for the project, and I was able to run it on the same server as the actual application.

Tesseract was an decent model, but it had 3 glaring issues:

  • It's accuracy in french was suboptimal.

  • Extracting text from hand written receipts was near impossible.

  • It couldn't handle PDFs out of the box.

This meant that if I wanted to use Tesseract, I would have to convert the PDFs receipts to images and then extract the text from those. This task quickly became a lot more complicated that I thought it would be, especially in TypeScript.

While looking for a way to convert my PDFs, all signs pointed me toward Python. Python had easier ways to convert the PDFs to images but it also had a variety of other options for OCR. So without really thinking about this twice I decided that I was going to write the API in Python.

PaddleOCR

After a few more hours of research, I stumbled upon PaddleOCR, an open-source Optical Character Recognition (OCR) tool developed by PaddlePaddle, which is an open-source deep learning platform created by Baidu. PaddleOCR is designed to provide a comprehensive solution for text detection and text recognition in images. It supports a wide range of languages and is known for its accuracy and efficiency.

The base models are very powerful and it could handle both images and PDFs out of the box. This was enough to convince me to try it out.

python
from paddleocr import PaddleOCR file_path = '/root/documents/receipts/text-receipt.pdf' _ocr = PaddleOCR( use_angle_cls=True, lang='fr', show_log=True, ) try: formated_text = extract_receipts(_ocr, file_path) except: # This is where the error would be handled once the added to the Flask API. return jsonify({"error": "Something went wrong when passing the file through OCR."}), 500 # Helper function to extract text from receipts, then format it to a giant string where chunks are separated by a newline. def extract_receipts(ocr, file_path): results = ocr.ocr(file_path, cls=True) txts = [] # PaddleORC returns a list of lists, where each sublist contains the text and the confidence score. for result in results: for line in result: txts.append(line[1][0]) return "\n".join(txts)

In just a few hours I rewrote the code in Python and had a working prototype. Now I needed to build the API using Flask and feed the text output to GPT-4o to format the data and return it to my Next.js application.

Building The API

In order to use the receipt extractor I needed to write a simple API using Flask. When calling the API, the Next.js application would pass the file as input as well as the supplier name. The API would then return the extracted receipts in a user-friendly format.

I chose Flask because it was simple to use and I had already used it in the past when working on a bot that automate the registration to my local CrossFit classes. (maybe I'll write a blog post about that in the future).

Writting The Flask Application

Starting from the initial prototype, I build a simple Flask application to handle the requests and extraction.

python
from flask import Flask, request, abort, jsonify from werkzeug.utils import secure_filename from paddleocr import PaddleOCR import os app = Flask(__name__) @app.route("/api/process-receipts", methods=["POST"]) def process_receipts(): # Initializing an OCR object on every request that way the API can deal with multiple requests at once. _ocr = PaddleOCR( use_angle_cls=True, lang='fr', show_log=True, ) # Get supplier name from header supplier = request.headers.get('X-Supplier') if 'file' not in request.files: return jsonify({"error": "No file part"}), 400 file = request.files['file'] if file.filename == '': return jsonify({"error": "No selected file"}), 400 filename = secure_filename(file.filename) file_path = os.path.join("/tmp", filename) file.save(file_path) try: formated_text = extract_receipts(_ocr, file_path) except: return jsonify({"error": "Something went wrong when passing the file through OCR."}), 500 os.remove(file_path) # Once the text has been extracted, we can pass it to GPT-4o to format the data. if __name__ == "__main__": app.run()

Adding An API Key And CORS For Extra Security

In order to protect this API from being abused, I added an API key and enabled CORS. This way, only applications that are allowed to access the API can make requests to it.

The API keys are stored in a database and is only accessible my the API itself. If no valid API key is provided, the request will be aborted.

Notice how we've modified the header 'X-Supplier' to 'X-Supplier-Notes'.

This is because the API will now receive a string that will be used to modify the base prompt. We will now be adding the supplier notes to the prompt, allowing the user to work with any suppliers.

python
from flask import Flask, request, abort, jsonify from flask_cors import CORS from werkzeug.utils import secure_filename from data.dao import ApiDataManager from paddleocr import PaddleOCR import os app = Flask(__name__) # Added a database to store the API keys. db = ApiDataManager() # This will only allow requests coming from the same domain. CORS(app) @app.route("/api/process-receipts", methods=["POST"]) def process_receipts(): _ocr = PaddleOCR( use_angle_cls=True, lang='fr', show_log=True, ) # Added two new headers to the request app_name = request.headers.get('X-App-Name') app_api_key = request.headers.get('X-Api-Key') supplier_notes = request.headers.get('X-Supplier-Notes') if app_api_key != db.get_application_api_key(app_name): abort(401)

Connecting The API To OpenAI

In order to use the API, we needed to connect it to OpenAI's GPT-4o model. This model is a powerful LLM that can generate text based on a given prompt. We used the openai Python library to connect to the model and send the requests.

python
import os import openai import json import re class Inferencer: def __init__(self): self.__client = openai.Client() self.__client.api_key = os.getenv("OPENAI_API_KEY") def inference(self, additional_notes, receipts): # This is the base prompt that will be used to make sense of the text data sent to GPT-4o. # The additional_notes variable is a string that contains additional notes that will be added to the prompt. prompt = f"The following text is from a receipt, convert it to a JSON object.There might be some errors in the \ item description: For example, '5LB' might have been extracted as 'SLB' since the 5 closely resembles an S. \ Fix those errors. The JSON object should only contain the items and their price per unit (if there are \ two prices for the same item use the smallest of the 2). Units that are labeled GR should be changed to G. LT to L \ CL is not an origin tag it should be ignored. It is possible for items to not have an origin. The QUANTITY-UNIT \ should be in 2 unique fields also add a category field. The category should be logical with the item name, \ and should be in the language of the receipt. Use the following categories: 'Fruit & Légume', \ 'Viande', 'Poisson', 'Produit Laitier', 'Pâtisserie', 'Cannes', 'Congeler', 'Sec', 'Fines Herbes'.\ Make sure to filter duplicate items.If the item is not in the list use 'Autre'. A mushroom should be classified \ as 'Fruit & Légume' The JSON structure for an item should be: 'name': 'NAME OF PRODUCT', 'quantity': number, \ 'unit': 'UNIT OF PRODUCT', 'origin': 'ORIGIN TAG OF PRODUCT', 'category': 'ONE OF THE CATEGORY', 'price': number \ . Make sure to correct common french mistakes (e.g lls should be île) {additional_notes}" response = openai.chat.completions.create( model="gpt-4-0125-preview", messages=[ { "role": "user", "content": [ { "type": "text", "text": prompt }, { "type": "text", "text": receipts } ], } ], max_tokens=4096, ) # In order to receive a JSON object we need to remove the markdown. json_string = response.choices[0].message.content.replace("```json\n", "").replace("\n```", ""); # In some cases the JSON object was inside of an array, so I also need to remove that array. array = re.search(r'\[(.*?)\]', json_string, re.DOTALL).group(0) json_data = json.loads(array) return json_data

Sending The Receipt To GPT-4o

Once the API is connected to OpenAI, we can send the receipt to the model and get the extracted ingredients and prices. We used the extract_receipts function to send the text output to GPT-4o and get the data back in JSON format.

python
from gpt4 import Inferencer _gpt = Inferencer() try: print("sending to GPT-4...") interpreted_receipt_data = _gpt.inference(supplier_notes, formated_text) return jsonify(interpreted_receipt_data), 200 except: return jsonify({"error": "Something went wrong when converting the response from GPT-4."}), 500 if __name__ == "__main__": app.run()

This was working surpringly well, it returned the data in a consistent format over and over again. Now all that was left was to deploy it on my personal server and call it from the Bon Service application.

Deployment

Getting PaddleOCR to work on anything other than a Linux machine was so complicated that I decided to just use a Docker container. I created a Dockerfile that would build a minimal Python container and install all the required dependencies.

The python:3.10-slim image is a good starting point for the container, however I had to install some additional dependencies to get it to work.

dockerfile
FROM python:3.10-slim WORKDIR /app RUN apt-get update && apt-get install -y \ libgomp1 \ && rm -rf /var/lib/apt/lists/* RUN apt-get update && apt-get install -y \ libgomp1 \ libgl1-mesa-glx \ libglib2.0-0 \ libsm6 \ libxext6 \ libxrender-dev \ && rm -rf /var/lib/apt/lists/* COPY app/ . RUN pip install -r requirements.txt CMD gunicorn --bind 0.0.0.0:5000 app:app

Up until now everytime a request was sent to the API, PaddleOCR would download the models. In order to avoid this, I'm now adding the models at build time. This way, the models would not need to be downloaded every time a request was made.

dockerfile
FROM python:3.10-slim WORKDIR /app RUN apt-get update && apt-get install -y \ libgomp1 \ && rm -rf /var/lib/apt/lists/* RUN apt-get update && apt-get install -y \ libgomp1 \ libgl1-mesa-glx \ libglib2.0-0 \ libsm6 \ libxext6 \ libxrender-dev \ && rm -rf /var/lib/apt/lists/* COPY app/ . RUN pip install -r requirements.txt COPY models/ /root/.paddleocr/whl/ CMD gunicorn --bind 0.0.0.0:5000 app:app

In our Flask application, we simply need to add the lines of code that point to the detection and recognition models.

python
@app.route("/api/process-receipts", methods=["POST"]) def process_receipts(): _ocr = PaddleOCR( use_angle_cls=True, lang='fr', show_log=True, det_model_dir='/root/.paddleocr/whl/det/en_PP-OCRv3_det_infer/', rec_model_dir='/root/.paddleocr/whl/rec/latin_PP-OCRv3_rec_infer/', cls_model_dir='/root/.paddleocr/whl/cls/ch_ppocr_mobile_v2.0_cls_infer/' )

Since our API is deployed with Docker, it would be easy to use a web server such as NGINX to create a load balancer that would distribute the requests between different containers. This would potentially allow us to handle the requests faster.

This application will now be deployed on a DigitalOcean droplet, but you could also choose your own VPS.

Conclusion

In conclusion, the API is a powerful tool that can be used to extract ingredients and prices from receipts. Its ability to handle different languages and different receipt formats makes it a valuable asset to our application.

I hope you find this article useful and informative. If you have any questions or feedback, please feel free to reach out to me at hello@juliencm.dev. I'm always happy to hear from you!

Thank you for reading my blog!

Peace nerds,

Julien