Extract Data from PDFs to Airtable using OpenAI

Jul 18, 2024Andy Cloke

In this tutorial, you'll learn how to extract data from PDFs in Airtable. We will extract specific fields (like invoice number and due date) from invoices using OpenAI. You can use the same approach to extract structured data from any PDF attachments.

Set Up Invoices Table

We'll begin by uploading our PDF invoices and setting up a filtered view with all the records that need data:

1. Create a table called "Invoices" or use an existing table.

2. Create a new field called "PDF" with the type Attachment. You can use your existing attachment field if you have one.

3. Upload invoices to the "PDF" field. Make sure the invoices are PDF files. OpenAI PDF extraction is flexible enough to handle different invoice layouts and structures, so they do not all have to be from the same supplier.

pdf-files-uploaded.png

4. Rename the primary field "Name" to "Invoice number".

renamed-to-invoice-number.png

5. Create a new Grid view called Needs data and add the following filters applied:

  • Invoice number - is empty
  • PDF - is not empty
needs-data-filter.png

This view will help us extract data from the PDFs that don't have any information in the table.

Now that the table is set up, let's create an OpenAI assistant to extract PDF data for us.

Create an Assistant in OpenAI

To extract data from PDF files, we need to use OpenAI assistants, since OpenAI's API requires assistants for PDF file processing. Follow these steps to create an OpenAI assistant:

1. Sign up for an OpenAI API account, or log in to your existing account.

2. Add a payment method in your OpenAI API billing panel.

Note: OpenAI API accounts and billing are separate from ChatGPT accounts and plus subscriptions.

3. Visit the OpenAI assistants page.

4. Click + Create to create your first OpenAI assistant.

create-assistant-openai.png

5. Name the assistant "PDF parser".

6. Under Instructions, enter "You are a helpful assistant.". This is all we need.

name-and-instruction-pdf-parser.png

7. Select gpt-4o under Model.

select-model-openAI-assistant.png

8. Scroll down to TOOLS and turn File search on. The file search tool enables the assistant to access the PDFs we send to it and extract data.

turn-on-file-search-openai-assistants.png

OpenAI will save the changes automatically, so you don't have to click any buttons. Now that the assistant is set up, we can use it in Data Fetcher.

Add Data Fetcher

We'll create a request in Data Fetcher that will allow us to communicate with the OpenAI assistant we created earlier.

1. Add the Data Fetcher extension to your Airtable base.

2. Once you’ve added the extension, sign in to Data Fetcher or sign up if you don’t have an account.

Data Fetcher extension sign up.png

3. Once you're logged in to Data Fetcher, click Create your first request.

Create your first request button.png

Extract Data from PDFs to Airtable using OpenAI

Follow these steps to configure the request:

1. Select OpenAI under Application.

OpenAI application dropdown.png

2. Rename the request to "Extract invoice data".

3. Copy and paste your OpenAI API key under Authorization. You can create an API key in your OpenAI account.

api key authorization.png

4. Under Endpoint, select Create assistant thread and run.

Create assistant thread and run.png

5. Click Save and Continue in the bottom right corner.

extract pdf save and continue.png

Send Messages to the Assistant

After following the previous steps, you'll be taken to a new section where you'll configure how OpenAI processes the PDFs.

1. Under Assistant, select the assistant you created earlier in OpenAI.

pdf parser assistant.png

2. Copy and paste the following message under Messages:

Extract data from the following invoice. Return JSON data. If you cannot find a field, return null.
⁠
Output format: 
{
"InvoiceNumber": "345ADK", 
"TotalAmount": 100, 
"InvoiceDueDate": "01-27-2025" 
}
extract pdf data message.png

Note: If you want to extract additional fields from each invoice, such as tax or quantity, add them to the output format JSON.

Now we'll reference the PDF files in the table.

3. Click the + button next to Messages to reference the PDF files.

extract pdf + button.png

4. In the dialog that opens:

  • Select PDF under Field
  • Under Run for every record in view, select Needs data
  • extract pdf table reference.png

    5. Click Confirm to save and close the dialog.

    The message sent to OpenAI will now include each invoice when the integration runs.

    messages with pdf reference.png

    Run PDF Invoice Parser

    1. Click Save and Run in the bottom right corner to proceed to the next step.

    pdf parser save and run.png

    Map PDF Extraction Results to Airtable Fields

    After completing the previous steps, Data Fetcher will take you to Response Field Mapping. This is where you'll configure how the extracted PDF results are imported into your Airtable fields.

    pdf response fields.png

    Notice the "Json" prefix in the field names. Data Fetcher automatically adds this prefix because we configured our PDF invoice parser to return JSON-formatted data from the OpenAI assistant.

    The OpenAI API response returns four fields from the request, but we only need the three invoice data fields we specified in our output format. Follow these steps to map them:

    1. Click the checkbox next to "Message" to deselect it (we don't need the raw message text).

    2. Map the remaining parsed PDF fields to your Airtable columns:

    • Json invoice number → Existing field Invoice number
    • Json total amount → New field Total amount (change field type to currency)
    • Json invoice due date → New field Due date
    pdf response field mapping.png

    3. Click Save and Run in the bottom right corner to run the PDF invoice parser.

    pdf rfm save and run.png

    Return to the default view of your Invoices table, and you should see the extracted invoice data populated automatically from your PDF files through the OpenAI PDF extraction process.

    invoice-data-generated.png

    Automate PDF Extraction

    Currently, whenever you add new PDFs to your table, you'll have to manually run the request. To save time, you can use Data Fetcher's Trigger feature to run the request automatically.

    The Trigger feature is only available on our paid plans, so you can follow these steps to upgrade:

    1. Open the request in Data Fetcher, scroll down to the Schedule / Trigger / Webhook URL tabs, then select Upgrade under Trigger.

    upgrade button trigger tab.png

    2. Select a paid plan and complete the payment process.

    3. Click + Authorize to give Data Fetcher access to your Airtable base.

    trigger tab authorize.png

    4. Click I understand, let's Authorize.

    A new window will open, asking you to grant Data Fetcher access to your Airtable base.

    5. Click Add a base, then select + Add resources so you won’t have to authorize Data Fetcher anytime you want to use it on an additional Airtable base.

    airtable oauth grant all access.png

    6. Click Grant access.

    Now that you've authorized Data Fetcher, you can create a trigger to run the request when you add a new PDF to the table:

    1. Select Record created.

    trigger tab record created.png

    2. Select "Invoices" under Table and "Needs data" under View.

    needs data view record created trigger.png

    3. Finally, click Save at the bottom of the screen.

    Your automation is now active! Data Fetcher will automatically extract data from any new PDFs you add to your Invoices table. Simply upload PDF invoices and the system will populate the invoice numbers, amounts, and due dates without any manual work - saving you from manual data entry.

    That's all for this tutorial. You can check out our blog to learn how to use OpenAI and Data Fetcher to generate images and text in Airtable.

    Beyond PDF Invoices: Other Use Cases

    This same workflow works with many other file types and business scenarios. You can extract structured data from Word documents, PowerPoint presentations, text files, code files, and more. Here are some practical applications:

    • Contract Management: Extract key terms, dates, and parties from legal documents and contracts stored as PDFs or Word files.
    • Resume Screening: Parse candidate information from resumes in various formats to populate your hiring database.
    • Research Documentation: Extract findings, data points, and citations from research papers and academic documents.
    • Financial Reports: Pull key metrics and figures from financial statements and quarterly reports.
    • Technical Documentation: Extract API endpoints, code snippets, and configuration details from technical docs.

    The flexibility of OpenAI's file processing means you can adapt this approach to virtually any document-based workflow where you need to turn unstructured files into organized Airtable data.

    Supported File Types

    OpenAI supports a wide range of file formats beyond PDFs:

    • Documents: PDF, Word (.doc, .docx), PowerPoint (.pptx), Markdown (.md), plain text (.txt)
    • Code Files: JavaScript (.js), Python (.py), C++ (.cpp), Java (.java), TypeScript (.ts), and more
    • Web Files: HTML (.html), CSS (.css), JSON (.json)
    • Other: TeX (.tex), Shell scripts (.sh), Ruby (.rb), Go (.go), PHP (.php)

    Note: For text-based files, encoding must be UTF-8, UTF-16, or ASCII.

    For working with images instead of documents, check out our guide on extracting data from images to learn how to process screenshots, charts, and visual content.

    G2 rating

    Loved by Airtable users like you

    Data Fetcher customers spend less time copying data and more time using it.

    1 / 11

    "Need data pumped into Airtable? Data Fetcher is the solution."

    Data Fetcher is incredibly easy to use and understand. We have no API or data experience, yet our team can seamlessly integrate external data easily with Data Fetcher.

    Thomas Coiner

    Thomas Coiner

    CEO, ProU Sports

    Ready to build with Data Fetcher?

    Start connecting your data sources with Airtable today.