In this tutorial, you'll learn how to extract data from PDFs in Airtable. We will extract specific fields (like invoice number and due date) from invoices using OpenAI. You can use the same approach to extract structured data from any PDF attachments.
We'll begin by uploading our PDF invoices and setting up a filtered view with all the records that need data:
1. Create a table called "Invoices" or use an existing table.
2. Create a new field called "PDF" with the type Attachment. You can use your existing attachment field if you have one.
3. Upload invoices to the "PDF" field. Make sure the invoices are PDF files. OpenAI PDF extraction is flexible enough to handle different invoice layouts and structures, so they do not all have to be from the same supplier.
4. Rename the primary field "Name" to "Invoice number".
5. Create a new Grid view called Needs data and add the following filters applied:
This view will help us extract data from the PDFs that don't have any information in the table.
Now that the table is set up, let's create an OpenAI assistant to extract PDF data for us.
To extract data from PDF files, we need to use OpenAI assistants, since OpenAI's API requires assistants for PDF file processing. Follow these steps to create an OpenAI assistant:
1. Sign up for an OpenAI API account, or log in to your existing account.
2. Add a payment method in your OpenAI API billing panel.
Note: OpenAI API accounts and billing are separate from ChatGPT accounts and plus subscriptions.
3. Visit the OpenAI assistants page.
4. Click + Create to create your first OpenAI assistant.
5. Name the assistant "PDF parser".
6. Under Instructions, enter "You are a helpful assistant.". This is all we need.
7. Select gpt-4o under Model.
8. Scroll down to TOOLS and turn File search on. The file search tool enables the assistant to access the PDFs we send to it and extract data.
OpenAI will save the changes automatically, so you don't have to click any buttons. Now that the assistant is set up, we can use it in Data Fetcher.
We'll create a request in Data Fetcher that will allow us to communicate with the OpenAI assistant we created earlier.
1. Add the Data Fetcher extension to your Airtable base.
2. Once you’ve added the extension, sign in to Data Fetcher or sign up if you don’t have an account.
3. Once you're logged in to Data Fetcher, click Create your first request.
Follow these steps to configure the request:
1. Select OpenAI under Application.
2. Rename the request to "Extract invoice data".
3. Copy and paste your OpenAI API key under Authorization. You can create an API key in your OpenAI account.
4. Under Endpoint, select Create assistant thread and run.
5. Click Save and Continue in the bottom right corner.
After following the previous steps, you'll be taken to a new section where you'll configure how OpenAI processes the PDFs.
1. Under Assistant, select the assistant you created earlier in OpenAI.
2. Copy and paste the following message under Messages:
Extract data from the following invoice. Return JSON data. If you cannot find a field, return null.
Output format:
{
"InvoiceNumber": "345ADK",
"TotalAmount": 100,
"InvoiceDueDate": "01-27-2025"
}
Note: If you want to extract additional fields from each invoice, such as tax or quantity, add them to the output format JSON.
Now we'll reference the PDF files in the table.
3. Click the + button next to Messages to reference the PDF files.
4. In the dialog that opens:
5. Click Confirm to save and close the dialog.
The message sent to OpenAI will now include each invoice when the integration runs.
1. Click Save and Run in the bottom right corner to proceed to the next step.
After completing the previous steps, Data Fetcher will take you to Response Field Mapping. This is where you'll configure how the extracted PDF results are imported into your Airtable fields.
Notice the "Json" prefix in the field names. Data Fetcher automatically adds this prefix because we configured our PDF invoice parser to return JSON-formatted data from the OpenAI assistant.
The OpenAI API response returns four fields from the request, but we only need the three invoice data fields we specified in our output format. Follow these steps to map them:
1. Click the checkbox next to "Message" to deselect it (we don't need the raw message text).
2. Map the remaining parsed PDF fields to your Airtable columns:
3. Click Save and Run in the bottom right corner to run the PDF invoice parser.
Return to the default view of your Invoices table, and you should see the extracted invoice data populated automatically from your PDF files through the OpenAI PDF extraction process.
Currently, whenever you add new PDFs to your table, you'll have to manually run the request. To save time, you can use Data Fetcher's Trigger feature to run the request automatically.
The Trigger feature is only available on our paid plans, so you can follow these steps to upgrade:
1. Open the request in Data Fetcher, scroll down to the Schedule / Trigger / Webhook URL tabs, then select Upgrade under Trigger.
2. Select a paid plan and complete the payment process.
3. Click + Authorize to give Data Fetcher access to your Airtable base.
4. Click I understand, let's Authorize.
A new window will open, asking you to grant Data Fetcher access to your Airtable base.
5. Click Add a base, then select + Add resources so you won’t have to authorize Data Fetcher anytime you want to use it on an additional Airtable base.
6. Click Grant access.
Now that you've authorized Data Fetcher, you can create a trigger to run the request when you add a new PDF to the table:
1. Select Record created.
2. Select "Invoices" under Table and "Needs data" under View.
3. Finally, click Save at the bottom of the screen.
Your automation is now active! Data Fetcher will automatically extract data from any new PDFs you add to your Invoices table. Simply upload PDF invoices and the system will populate the invoice numbers, amounts, and due dates without any manual work - saving you from manual data entry.
That's all for this tutorial. You can check out our blog to learn how to use OpenAI and Data Fetcher to generate images and text in Airtable.
This same workflow works with many other file types and business scenarios. You can extract structured data from Word documents, PowerPoint presentations, text files, code files, and more. Here are some practical applications:
The flexibility of OpenAI's file processing means you can adapt this approach to virtually any document-based workflow where you need to turn unstructured files into organized Airtable data.
OpenAI supports a wide range of file formats beyond PDFs:
Note: For text-based files, encoding must be UTF-8, UTF-16, or ASCII.
For working with images instead of documents, check out our guide on extracting data from images to learn how to process screenshots, charts, and visual content.
Jul 25, 2024
Jul 5, 2024
Jun 17, 2024
Data Fetcher customers spend less time copying data and more time using it.
1 / 11
Data Fetcher is incredibly easy to use and understand. We have no API or data experience, yet our team can seamlessly integrate external data easily with Data Fetcher.
Thomas Coiner
CEO, ProU Sports