In this tutorial, you'll learn how to extract data from PDFs in Airtable. We will extract specific fields (like invoice number and due date) from invoices, using OpenAI. You can use the same approach to extract structured data from any PDF attachments. Let's dive in!
We'll begin by uploading our PDF invoices, and setting up a filtered view with all the records that need data:
1. Create a table called "Invoices" or use an existing table.
2. Create a new field called "PDF" with the type Attachment. You can use your existing attachment field, if you have one.
3. Upload invoices to the "PDF" field. Make sure the invoices are PDF files.
4. Rename the primary field "Name" to "Invoice number".
5. Create a new Grid view called Needs data and add the following filters applied:
This view will help us extract data from the PDFs that don't have any information in the table.
Now that the table is set up, an OpenAI assistant to extract PDF data for us...
OpenAI assistants are like small versions of ChatGPT that developers can use to interact with files, write code, and solve problems. However, you don't need to be a developer to use them in Data Fetcher. Follow these steps to create an OpenAI assistant:
1. Visit the OpenAI assistants page in your browser. You'll be asked to log in or sign up.
2. Once you're logged in, click + Create.
3. Name the assistant "PDF parser".
4. Under Instructions, enter "You are a helpful assistant.". This is all we need.
5. Select gpt-4o under Model.
6. Scroll down to TOOLS and turn File search on. The file search tool enables the AI to search for data in PDFs and extract it.
OpenAI will save the changes automatically, so you don't have to click any buttons. Now that the assistant is set up, we can use it in Data Fetcher.
We'll create a request in Data Fetcher that will allow us to communicate with the OpenAI assistant we created earlier.
1. Add the Data Fetcher extension to your Airtable base.
2. Once you’ve added the extension, sign in to Data Fetcher or sign up if you don’t have an account.
3. Once you're logged in to Data Fetcher, click Create your first request.
Follow these steps to configure the request:
1. Select OpenAI under Application.
2. Rename the request to "Extract invoice data".
3. Copy and paste your OpenAI API key under Authorization. You can create an API key in your OpenAI account.
4. Under Endpoint, select Create assistant thread and run.
5. Click Save and Continue in the bottom right corner.
After following the previous steps, you'll be taken to a new section where you'll tell the AI how to interact with the PDFs in the table.
1. Under Assistant, select the assistant you created earlier in OpenAI.
2. Copy and paste the following code under Messages.
Extract data from this invoice. Return JSON data. If you cannot find a field, return null.
Now we'll reference the PDF files in the table.
3. Click the + button next to Messages.
4. In the dialog that opens, select PDF under Field.
5. Under Run for every record in view, select Needs data.
6. Click Confirm to save and close the dialog.
7. Press Shift + Enter on your computer to go to the next line, then copy and paste the following code:
Output format: {"InvoiceNumber": "345ADK", "TotalAmount": 100, "InvoiceDueDate": "TODO" }
8. Press Enter to save the message.
9. Select gpt-4o under Model, then click Save and Run to proceed to the next step.
Let's walk through the message before we move on. We ask OpenAI to extract data from an invoice, which we attach by referencing the "PDF" field in the table. We then ask it to return the results in JSON format, so that Data Fetcher can process it. Finally, we give it the output format, where we specify the data we want from the PDF file.
OpenAI is smart enough to extract the data from different invoice formats, so if you want other invoice data such as tax or quantity, you can add them to the format above like so:
Output format: {"InvoiceNumber": "345ADK", "TotalAmount": 100, "InvoiceDueDate": "TODO", "Quantity": 1 }
After completing the previous steps, Data Fetcher will take you to Response Field Mapping. On this page, you'll select the fields you want to keep from the response data.
Notice the text "Json" in the first three field names. Data Fetcher automatically added this prefix to the field names because we instructed OpenAI to return a JSON response when we created the message earlier.
Data Fetcher returns four fields from the request, but we only need the three we specified in the output format. Follow these steps to import them:
1. Click the checkbox next to "Message" to deselect it.
2. Use the following mapping for the remaining fields:
3. Click Save and Run in the bottom right corner.
Go back to the default view of your table, and you should see the generated invoice data.
Currently, whenever you add new PDFs to your table, you'll have to manually run the request. To save time, you can use Data Fetcher's Trigger feature to run the request automatically.
The Trigger feature is only available on our paid plans, so you can follow these steps to upgrade:
1. Open the request in Data Fetcher, scroll down to the Schedule / Trigger / Webhook URL tabs, then select Upgrade under Trigger.
2. Select a paid plan and complete the payment process.
3. Click + Authorize to give Data Fetcher access to your Airtable base.
A warning dialog will pop up in Data Fetcher.
4. Click I understand, let's Authorize.
A new window will open, asking you to grant Data Fetcher access to your Airtable base.
5. Click Add a base, then select All current and future bases in all current and future workspaces so you won’t have to authorize Data Fetcher anytime you want to use it on an Airtable base.
Click Grant access.
Now that you've authorized Data Fetcher, you can create a trigger to run the request when you add a new PDF to the table:
1. Select Record created.
2. Select "Invoices" under Table and "Needs data" under View.
3. Finally, click Save at the bottom of the screen.
Now, Data Fetcher will extract the data whenever you add a new PDF to your table.
That's all for this tutorial. You can check out our blog to learn how to use OpenAI and Data Fetcher to generate images and text in Airtable.
Jul 25, 2024
•Zayyad Muhammad Sani
•OpenAIJul 5, 2024
•Andy Cloke
•OpenAIJun 25, 2024
•Andy Cloke
•OpenAI