Simple steps.... Your Guide to extracting data from PDFs... Images, Texts, Tables, Graphs....

2021-08-10 10:44


للنسخة العربية، الضغط هنا

By Laura Grant-GIJN

Journalists get lots of data in PDF format – they can be tables of data that are embedded in reports, or spreadsheets that have been thoughtfully saved as PDFs before they’re emailed to you – but until you can get that data into a spreadsheet, there’s not much you can do with it.

Luckily, there are a few great tools that can liberate your data quickly and relatively easily. I’ve listed some of the ones that I’ve used here, but there are no doubt loads more out there.

I love Tabula. It’s my go-to option, firstly because it’s free, and secondly, it’s really easy to use. Its website says it was created “by journalists for journalists”, which is probably why it’s so popular with non-techie people like me. I often need to extract tables of data from biggish PDF reports. Tabula lets you upload an entire document and select just the tables you want. You can convert one table at a time or a few depending on the layout of your document into a CSV, TSV of JSON file, which you can import to Google Sheets (free), Libre Office Calc (free), Excel (not free), or whatever program you prefer.

The only times I don’t go straight to Tabula is when I have PDFs that have been scanned in, or when the tables I want to convert are rotated 90°. But I’ll deal with those later.

For more tips on extracting data from PDFs watch out video tutorial on using Tabula.

This one is also popular with journalists – not least because IRE members get free premium membership – and it’s really easy to use. You can convert up to five documents a week for free, but you have to subscribe if you want to do more. I quite like the fact that you can subscribe for a month at a time for $9.99, but if you really like it you can get a lifetime membership for about $130. You upload or import the pdf you want to convert, click the convert button and choose between Excel and .ODS (which you can open in Libre Office), unfortunately .CSV isn’t an option, but if you don’t have either of those spreadsheet packages, you can upload the file to Google Drive and open it in Google Sheets. It works quickly and well, but the really nice thing about Cometdocs is that it does optical character recognition (OCR), so it can convert scanned pdfs. You need to check the converted document against the original, though, just to be sure it picked everything up correctly. Like Tabula, it can’t handle tables that are rotated.

Adobe Export PDF
This one’s not free, but it’s not terribly expensive either – about $24 a year. If you use Adobe Reader, which is Adobe’s free PDF reader, Export PDF allows you to convert a PDF document that you’ve opened in Acrobat Reader to Excel, Word, PowerPoint or rtf. It works well and quickly with fairly big documents. But, like Tabula, it can’t do scanned documents or rotated tables.

Nitro Pro
If you have a Windows machine, Nitro is a great tool for editing and converting PDFs to useful formats, but it’s not free (about $160) and the fact that it only works with Windows means it’s out of reach for me and my MacBook. I have tried it out on somebody else’s machine, though, and I was suitably impressed.

Acrobat Pro
This one is accessible for Mac users, but it’s also not free (about $15 a month and it requires an annual commitment).

This UK-based company has developed software to automate PDF processing. It’s not free, but you can see what it can do by trying out it’s demo document converter – as long as your document is 1.5MB or smaller. You upload your pdf, tell them what you want it converted to, give them your email address and they’ll mail you the converted document.

This is another online conversion tool where you can upload your document, choose the format you want to convert it to and it’ll email the converted document to the email address of your choice.

Rotated tables
Sometimes the tables in PDF documents have been rotated 90°. You need to be able to rotate the tables back to a normal orientation before any conversion tool will be able to identify them as text. Just rotating the page in Acrobat Reader or Preview, for example, won’t work. You need to rotate the table itself. To do this you need a proper PDF editor such as Acrobat Pro or Nitro Pro.

If you have Acrobat Pro, here’s what you do:

  • If you your tables are part of a larger document, open your document and using the Organise Pages option, extract the pages with the tables you want to rotate. If you want to extract a number of consecutive pages, it’s simpler to extract them into separate files.
  • Open the page with the table on it. Go to the View menu and rotate view until your table is upright.
  • If there are headers and footers, or any other text, that are not rotated in the same direction as your table remove them using the Edit PDF function – you need to delete them, covering them up doesn’t work.
  • Go to the Enhance Scans option and choose Recognise Text, check the settings to make sure the option “Save as editable text and images” is selected. This may take a few minutes and when it’s finished your table may be rotated 90% again.
  • Go back to View and rotate your page till the table is upright again. Then save your file.
  • You can try to convert your page to an Excel spreadsheet using the Export PDF function, but I find that Tabula generally does the job better.

Always check the converted data against the original documents because sometimes 8s can be mistaken for 6s or Bs. But even if your converted document isn’t absolutely perfect, converting it this way will be much quicker than manually typing it into a spreadsheet.

Converting scanned PDFs
In a scanned PDF, a table will be identified as an image rather than text, so if you want to extract the data from a table you first need to convert it to text with something that has optical character recognition (OCR). You can use Cometdocs, Acrobat Pro or Nitro Pro. Acrobat Pro’s Enhance Scans tool should recognise the text in your pdf as long as the quality of the scan isn’t terrible. Sometimes it helps to save a snapshot of the table you want to extract into its own pdf before you use the Enhance Scans tool. Once the scan is converted to text and images I still save it as a pdf and convert it to a CSV with Tabula. And, of course, always check your data against the original.

Password protected PDFs
Sometimes pdfs are password protected so that you can’t edit them or convert them to any other format. If you have a Mac with Preview try opening your PDF in Preview, then select the Export as PDF option under the File menu. Open the new version of your PDF and see if you’re able convert it to a spreadsheet now.