How to convert tables from PDF to Excel or CSV with Tabula

convert pdf into csv and extract data from tables

One of the most laborious tasks in Machine Learning consists of data collection and treatment.

There are a meteorological observatory in my city. You can see main meteorological indicators in real time trough its we and it share historical data too, but it share it in PDF

I have talk with them in order to share all data in CSV and allow people to use the data easily, but it seems that is not possible 🙁

Therefore I want this data and I want to convert this PDF files to a workable data collection. And I have been searching a good solution to convert this table PDF to CSV and the solution is called Canvas.

Once you have data in CSV you can use this data in many ways, opening with excel, Libre office, Google Sheets, etc, because is easy import them in spreedsheets or using pythons and its libraries.

As I want an automatized process I will work with a python script and is here where I introduce Tabula.

Convert PDF to CSV with Tabula

It is easy. First at all I install Tabula in our development environment. Tabula allow to extract data form tables in PDF and convert it to Panda dataframes.

Panda is a Python library especialliced to work with csv files and arrays.

With this script we can also extract an convert data from and between PDF, JSON, CSV y TSV. You find it in its Github Repository.

To take advantage of the Anaconda tutorial we saw las day I use it to install Tabula.py Here you can see how to install Anaconda.

SEE ALSO  How to navigate with the ip of the country we want with TOR

Install Tabula

#First at all we activate our develope environment. In our case 'comparador'
conda activate comparador
pip install tabula-py

I get an error

Reading the official documentation I find the solution: uninstall the old version and install the new one.

pip uninstall tabula
pip install tabula-py

Create .py executable

Read pdf tables

I create the .py executable called pdftocsv.py and I save it in the folder Descargas/eltiempo it is a file with this code.

import tabula
# Extaer los datos del pdf al DataFrame
df = tabula.read_pdf("inforatge.pdf")
# lo convierte en un csv llamdo out.csv codificado con utf-8
df.to_csv('out.csv', sep='\t', encoding='utf-8')

The downloaded pdf I have to read is called inforatge.pdf and I want to obtain an out.csv file with de result. It remains in the folder we are working.

Go to directory where .py and pdf are. It seems a very obvious step but there are much people with problems to not work in correct folder. If you are not in correct site you get a not found file error.

cd Descargas/eltiempo

In this directory PDF to be read and .py exectuble are placed. And our desired csv ‘out.csv’ will be place here too.

Write in console

python pdftocsv.py

I have write python, i.e, I want to execute the script with python 2, I use this way due to python 3 return us errors.

how to execute Tabula in Anaconda develope environment

I have added 3 new lines to .py file in order to control execution time. At the end of the article I have attached my pdftocsv.py final version

import tabula
import time

start_time = time.time()

df = tabula.read_pdf("inforatge.pdf")
df.to_csv('out.csv', sep='\t', encoding='utf-8')

print("--- %s seconds ---" % (time.time() - start_time))

More Tabula options

Examples or more things that can be done with this escript. There are many options

# Read remote PDF and convert them into DataFrame
df2 = tabula.read_pdf("https://github.com/tabulapdf/tabula-java/raw/master/src/test/resources/technology/tabula/arabic.pdf")

# Convert PDF into CSV
tabula.convert_into("test.pdf", "output.csv", output_format="csv")

And an useful thing I love it works with batch, we can convert all files in a directory at once.

tabula.convert_into_by_batch("input_directory", output_format='csv')

Is the way to task automatation.

SEE ALSO  Scratch for Linux (Scratux Ubuntu)

How to convert pdf to excel online

If we want to convert only one file or to extract one file data and import to Excel or Librecalc, there are online tools that work for us. I have explained Tabula.py thinking in people who are interested in Machine Learning, data science and similar disciplines.

I have tested these two online tools and works fine.

These are a good option if you have to work only with one file. But the free online solutions don’t give you batch options. Therefore if you need to automatize tasks it will be always better a python based solution.

Classical method

If you find problems and don’t like to wirok with tools nor scripts, you can always to do by manual way.

To copy pdf tables and paste them into our spread sheet.

1 thought on “How to convert tables from PDF to Excel or CSV with Tabula”

  1. Awesome! Everything works smoothly. I’d also recommend Acethinker PDF Converter which I have used for many years. It’s a free web-based application to convert PDF to Excel and CSV, you don’t have to install anything.

    Reply

Leave a Comment