Extract Data from PDFs to Airtable using OpenAI

Jul 18, 2024Andy Cloke

In this tutorial, you'll learn how to extract and parse data from PDFs in Airtable. We will extract specific fields (like document numbers and dates) from PDF files to Airtable using OpenAI. You can use the same approach to extract structured data from any PDF attachments.

Set Up Your Documents Table

We'll begin by uploading our PDF files and setting up a filtered view with all the records that need data:

1. Create a table called "Documents" or use an existing table.

2. Create a new field called "PDF" with the type Attachment. You can use your existing attachment field if you have one.

3. Upload PDF files to the "PDF" field. Make sure the documents are PDF files. OpenAI PDF parsing is flexible enough to handle different documents layouts and structures, so they do not all have to be from the same source.

4. Rename the primary field "Name" to "Title".

documents in table.png

5. Create a new Grid view called Needs data and add the following filters applied:

  • Title - is empty
  • PDF - is not empty
needs data view.png

This view will help us extract data from the PDFs that don't have any information in the table.

Now that the table is set up, let's create an OpenAI assistant to parse PDFs to Airtable for us.

Create an Assistant in OpenAI

To extract data from PDF files, we need to use OpenAI assistants, since OpenAI's API requires assistants for PDF file processing. Follow these steps to create an OpenAI assistant:

1. Sign up for an OpenAI API account, or log in to your existing account.

2. Add a payment method in your OpenAI API billing panel.

Note: OpenAI API accounts and billing are separate from ChatGPT accounts and plus subscriptions.

3. Visit the OpenAI assistants page.

4. Click + Create to create your first OpenAI assistant.

create assistant in openai dashboard.png

5. Name the assistant "PDF Data Extractor".

6. Select gpt-4o under Model.

openai assistant name and model.png

7. Scroll down to TOOLS and turn File search on. The file search tool enables the assistant to access the PDFs we send to it and extract data.

enable file search.png

OpenAI will save the changes automatically, so you don't have to click any buttons. Now that the assistant is set up, we can use it in Data Fetcher.

Add Data Fetcher

We'll create a request in Data Fetcher that will allow us to communicate with the OpenAI assistant we created earlier.

1. Add the Data Fetcher extension to your Airtable base.

2. Once you’ve added the extension, sign in to Data Fetcher or sign up if you don’t have an account.

Data Fetcher extension sign up.png

3. Once you're logged in to Data Fetcher, click Create your first request.

Create your first request button.png

Extract Data from PDFs to Airtable using OpenAI

Follow these steps to configure the request:

1. Select OpenAI under Application.

OpenAI application dropdown.png

2. Rename the request to "Extract PDF data".

3. Copy and paste your OpenAI API key under Authorization. You can create an API key in your OpenAI account.

api key authorization.png

4. Under Endpoint, select Create assistant thread and run.

Create assistant thread and run.png

5. Click Save and Continue in the bottom right corner.

extract pdf save & continue.png

Send Messages to the Assistant

After following the previous steps, you'll be taken to a new section where you'll configure how OpenAI processes the PDFs.

1. Under Assistant, select the assistant you created earlier in OpenAI.

openai assistant pdf extractor.png

2. Copy and paste the following message under Messages:

Extract data from the following document. Return JSON data. If you cannot find a field, return null.

Output format: 
{
"DocumentNumber": "DOC-12345", 
"Title": "Project Requirements Document", 
"Date": "01-27-2025" 
}
openai pdf extract message.png

Note: If you want to extract additional fields from each document, such as author information or categories, add them to the output format JSON.

Now we'll reference the PDF files in the table.

3. Click the + button next to Messages to reference the PDF files.

openai pdf extract message + highlight.png

4. In the dialog that opens:

  • Ensure Documents is selected for Table
  • Select PDF under Field
  • Under Run for every record in view, select Needs data
pdf document field table reference.png

5. Click Confirm to save and close the dialog.

The message sent to OpenAI will now include each PDF document when the integration runs.

pdf message with reference.png

Run PDF Parser

1. Click Save and Run in the bottom right corner to proceed to the next step.

save & run pdf data extract.png

Map PDF Parsing Results to Airtable Fields

After completing the previous steps, Data Fetcher will take you to Response Field Mapping. This is where you'll configure how the extracted PDF results are imported into your Airtable fields.

response field mapping pdf data.png

You can ignore the "Json" prefix in the field names. Data Fetcher automatically adds this prefix because we configured our PDF document extractor to return JSON-formatted data from the OpenAI assistant.

The OpenAI API response returns four fields from the request, but we only need the three document data fields we specified in our output format. Follow these steps to map them:

1. Click the selected "Message" field to deselect it (we don't need the raw message text).

2. Map the remaining extracted PDF fields to your Airtable columns:

  • Json title → Existing field Title
  • Json document number New field Document number
  • Json date → New field date
PDF data fields mapped.png

3. Click Save and Run in the bottom right corner to run the PDF extraction.

Return to the default "Grid view" in your Documents table, and you will see the extracted data populated automatically from your PDF files through the OpenAI PDF parsing process.

airtable pdfs with data extracted.png

Automate PDF Extraction

Currently, whenever you add new PDFs to your table, you'll have to manually run the request. To save time, you can use Data Fetcher's Trigger feature to run the request automatically.

The Trigger feature is only available on our paid plans, so you can follow these steps to upgrade:

1. Open the request in Data Fetcher, scroll down to the Schedule / Trigger / Webhook URL tabs, then select Upgrade under Trigger.

upgrade button trigger tab.png

2. Select a paid plan and complete the payment process.

3. Click + Authorize to give Data Fetcher access to your Airtable base.

trigger tab authorize.png

4. Click I understand, let's Authorize.

A new window will open, asking you to grant Data Fetcher access to your Airtable base.

5. Click Add a base, then select + Add resources so you won’t have to authorize Data Fetcher anytime you want to use it on an additional Airtable base.

airtable oauth grant all access.png

6. Click Grant access.

Now that you've authorized Data Fetcher, you can create a trigger to run the request when you add a new PDF to the table:

1. Select Record created.

trigger tab record created.png

2. Select "Documents" under Table and "Needs data" under View.

documents needs data trigger.png

3. Finally, click Save at the bottom of the screen.

Your automation is now active! Data Fetcher will automatically parse data from any new PDFs you add to your Documents table. Simply upload PDF files and the system will populate the document numbers, titles, and dates without any manual work - creating an automatic PDF to Airtable workflow.

That's all for this tutorial. You can check out our blog to learn how to use OpenAI and Data Fetcher to generate images and text in Airtable.

Beyond Basic Documents: Other Use Cases

This same workflow works with many other file types and business scenarios. You can extract structured data from Word documents, PowerPoint presentations, text files, code files, and more. Here are some practical applications:

  • Contract Management: Extract key terms, dates, and parties from legal documents and contracts stored as PDFs or Word files.
  • Resume Screening: Parse candidate information from resumes in various formats to populate your hiring database.
  • Research Documentation: Extract findings, data points, and citations from research papers and academic documents.
  • Financial Reports: Pull key metrics and figures from financial statements and quarterly reports.
  • Technical Documentation: Extract API endpoints, code snippets, and configuration details from technical docs.

The flexibility of OpenAI's file processing means you can adapt this approach to virtually any document-based workflow where you need to turn unstructured files into organized Airtable data.

Supported File Types

OpenAI supports a wide range of file formats beyond PDFs:

  • Documents: PDF, Word (.doc, .docx), PowerPoint (.pptx), Markdown (.md), plain text (.txt)
  • Code Files: JavaScript (.js), Python (.py), C++ (.cpp), Java (.java), TypeScript (.ts), and more
  • Web Files: HTML (.html), CSS (.css), JSON (.json)
  • Other: TeX (.tex), Shell scripts (.sh), Ruby (.rb), Go (.go), PHP (.php)

Note: For text-based files, encoding must be UTF-8, UTF-16, or ASCII.

For working with images instead of documents, check out our guide on extracting data from images to learn how to process screenshots, charts, and visual content.

G2 rating

Loved by Airtable users like you

Data Fetcher customers spend less time copying data and more time using it.

1 / 11

"Need data pumped into Airtable? Data Fetcher is the solution."

Data Fetcher is incredibly easy to use and understand. We have no API or data experience, yet our team can seamlessly integrate external data easily with Data Fetcher.

Thomas Coiner

Thomas Coiner

CEO, ProU Sports

Ready to build with Data Fetcher?

Start connecting your data sources with Airtable today.