Extract Data from PDFs to Airtable using OpenAI

Jul 18, 2024Zayyad Muhammad Sani

In this tutorial, you'll learn how to extract data from PDFs in Airtable. We will extract specific fields (like invoice number and due date) from invoices, using OpenAI. You can use the same approach to extract structured data from any PDF attachments. Let's dive in!

Setting up the Invoices Table

We'll begin by uploading our PDF invoices, and setting up a filtered view with all the records that need data:

1. Create a table called "Invoices" or use an existing table.

2. Create a new field called "PDF" with the type Attachment. You can use your existing attachment field, if you have one.

3. Upload invoices to the "PDF" field. Make sure the invoices are PDF files.

pdf-files-uploaded.png

4. Rename the primary field "Name" to "Invoice number".

renamed-to-invoice-number.png

5. Create a new Grid view called Needs data and add the following filters applied:

  • Invoice number - is empty
  • PDF - is not empty
needs-data-filter.png

This view will help us extract data from the PDFs that don't have any information in the table.

Now that the table is set up, an OpenAI assistant to extract PDF data for us...

Creating an Assistant in OpenAI

OpenAI assistants are like small versions of ChatGPT that developers can use to interact with files, write code, and solve problems. However, you don't need to be a developer to use them in Data Fetcher. Follow these steps to create an OpenAI assistant:

1. Visit the OpenAI assistants page in your browser. You'll be asked to log in or sign up.

openai-assistants-login.png

2. Once you're logged in, click + Create.

create-assistant-openai.png

3. Name the assistant "PDF parser".

4. Under Instructions, enter "You are a helpful assistant.". This is all we need.

name-and-instruction-pdf-parser.png

5. Select gpt-4o under Model.

select-model-openAI-assistant.png

6. Scroll down to TOOLS and turn File search on. The file search tool enables the AI to search for data in PDFs and extract it.

turn-on-file-search-openai-assistants.png

OpenAI will save the changes automatically, so you don't have to click any buttons. Now that the assistant is set up, we can use it in Data Fetcher.

Add Data Fetcher to Airtable

We'll create a request in Data Fetcher that will allow us to communicate with the OpenAI assistant we created earlier.

1. Add the Data Fetcher extension to your Airtable base.

2. Once you’ve added the extension, sign in to Data Fetcher or sign up if you don’t have an account.

sign up for data fetcher.png

3. Once you're logged in to Data Fetcher, click Create your first request.

create your first request logo 2 smaller.png

Extract Data from PDFs to Airtable using OpenAI

Follow these steps to configure the request:

1. Select OpenAI under Application.

open ai application.png

2. Rename the request to "Extract invoice data".

rename-request-extract-invoice-data.png

3. Copy and paste your OpenAI API key under Authorization. You can create an API key in your OpenAI account.

api key authorization.png

4. Under Endpoint, select Create assistant thread and run.

openai assistant thread and run.png

5. Click Save and Continue in the bottom right corner.

save-and-continue-extract-invoice-data.png

Sending Messages to the Assistant

After following the previous steps, you'll be taken to a new section where you'll tell the AI how to interact with the PDFs in the table.

1. Under Assistant, select the assistant you created earlier in OpenAI.

select-pdf-parser-data-fetcher.png

2. Copy and paste the following code under Messages.

Extract data from this invoice. Return JSON data. If you cannot find a field, return null.

first-part-of-message-extract-invoice-data.png

Now we'll reference the PDF files in the table.

3. Click the + button next to Messages.

add-reference-extract-invoice-data.png

4. In the dialog that opens, select PDF under Field.

5. Under Run for every record in view, select Needs data.

insert-reference-extract-invoice-data.png

6. Click Confirm to save and close the dialog.

7. Press Shift + Enter on your computer to go to the next line, then copy and paste the following code:

Output format: {"InvoiceNumber": "345ADK", "TotalAmount": 100, "InvoiceDueDate": "TODO" }

full-message-extract-invoice-data.png

8. Press Enter to save the message.

9. Select gpt-4o under Model, then click Save and Run to proceed to the next step.

save-and-run-extract-invoice-data.png

Let's walk through the message before we move on. We ask OpenAI to extract data from an invoice, which we attach by referencing the "PDF" field in the table. We then ask it to return the results in JSON format, so that Data Fetcher can process it. Finally, we give it the output format, where we specify the data we want from the PDF file.

OpenAI is smart enough to extract the data from different invoice formats, so if you want other invoice data such as tax or quantity, you can add them to the format above like so:

Output format: {"InvoiceNumber": "345ADK", "TotalAmount": 100, "InvoiceDueDate": "TODO", "Quantity": 1 }

Mapping Response Fields

After completing the previous steps, Data Fetcher will take you to Response Field Mapping. On this page, you'll select the fields you want to keep from the response data.

response-field-mapping-extract-invoice-data.png

Notice the text "Json" in the first three field names. Data Fetcher automatically added this prefix to the field names because we instructed OpenAI to return a JSON response when we created the message earlier.

Data Fetcher returns four fields from the request, but we only need the three we specified in the output format. Follow these steps to import them:

1. Click the checkbox next to "Message" to deselect it.

deselect-message-extract-invoice-data.png

2. Use the following mapping for the remaining fields:

  • Json invoice number -> Existing field Invoice number
  • Json total amount -> New field Total amount (change field type to currency)
  • Json due date -> New field Due date
fields-mapped-extract-invoice-data.png

3. Click Save and Run in the bottom right corner.

Go back to the default view of your table, and you should see the generated invoice data.

invoice-data-generated.png

Extract PDF Data in Airtable Automatically

Currently, whenever you add new PDFs to your table, you'll have to manually run the request. To save time, you can use Data Fetcher's Trigger feature to run the request automatically.

The Trigger feature is only available on our paid plans, so you can follow these steps to upgrade:

1. Open the request in Data Fetcher, scroll down to the Schedule / Trigger / Webhook URL tabs, then select Upgrade under Trigger.

schedule trigger webhook url tabs.png

2. Select a paid plan and complete the payment process.

3. Click + Authorize to give Data Fetcher access to your Airtable base.

trigger tab authorize.png

A warning dialog will pop up in Data Fetcher.

4. Click I understand, let's Authorize.

I understand. Let-s authorize.png

A new window will open, asking you to grant Data Fetcher access to your Airtable base.

5. Click Add a base, then select All current and future bases in all current and future workspaces so you won’t have to authorize Data Fetcher anytime you want to use it on an Airtable base.

schedule-authorize-bases.jpg

Click Grant access.

Now that you've authorized Data Fetcher, you can create a trigger to run the request when you add a new PDF to the table:

1. Select Record created.

trigger tab record created.png

2. Select "Invoices" under Table and "Needs data" under View.

select-invoices-table-trigger.png

3. Finally, click Save at the bottom of the screen.

Now, Data Fetcher will extract the data whenever you add a new PDF to your table.

That's all for this tutorial. You can check out our blog to learn how to use OpenAI and Data Fetcher to generate images and text in Airtable.

Related Posts

How to Extract Data from Images in Airtable using AI

How to Extract Data from Images in Airtable using AI

Jul 25, 2024

Zayyad Muhammad Sani

OpenAI
How to Use Airtable Generative AI

How to Use Airtable Generative AI

Jul 5, 2024

Andy Cloke

OpenAI
Connect to the OpenAI API in Airtable with No-Code

Connect to the OpenAI API in Airtable with No-Code

Jun 25, 2024

Andy Cloke

OpenAI