One of the most laborious tasks in Machine Learning consists of data collection and treatment.
There are a meteorological observatory in my city. You can see main meteorological indicators in real time trough its we and it share historical data too, but it share it in PDF
I have talk with them in order to share all data in CSV and allow people to use the data easily, but it seems that is not possible 🙁
Therefore I want this data and I want to convert this PDF files to a workable data collection. And I have been searching a good solution to convert this table PDF to CSV and the solution is called Canvas.
Once you have data in CSV you can use this data in many ways, opening with excel, Libre office, Google Sheets, etc, because is easy import them in spreedsheets or using pythons and its libraries.
As I want an automatized process I will work with a python script and is here where I introduce Tabula.
Convert PDF to CSV with Tabula
It is easy. First at all I install Tabula in our development environment. Tabula allow to extract data form tables in PDF and convert it to Panda dataframes.
Panda is a Python library especialliced to work with csv files and arrays.
With this script we can also extract an convert data from and between PDF, JSON, CSV y TSV. You find it in its Github Repository.
To take advantage of the Anaconda tutorial we saw las day I use it to install Tabula.py Here you can see how to install Anaconda.
Install Tabula
#First at all we activate our develope environment. In our case 'comparador'
conda activate comparador
pip install tabula-py
I get an error
Reading the official documentation I find the solution: uninstall the old version and install the new one.
pip uninstall tabula
pip install tabula-py
Create .py executable
I create the .py executable called pdftocsv.py and I save it in the folder Descargas/eltiempo it is a file with this code.
import tabula
# Extaer los datos del pdf al DataFrame
df = tabula.read_pdf("inforatge.pdf")
# lo convierte en un csv llamdo out.csv codificado con utf-8
df.to_csv('out.csv', sep='\t', encoding='utf-8')
The downloaded pdf I have to read is called inforatge.pdf and I want to obtain an out.csv file with de result. It remains in the folder we are working.
Go to directory where .py and pdf are. It seems a very obvious step but there are much people with problems to not work in correct folder. If you are not in correct site you get a not found file error.
cd Descargas/eltiempo
In this directory PDF to be read and .py exectuble are placed. And our desired csv ‘out.csv’ will be place here too.
Write in console
python pdftocsv.py
I have write python, i.e, I want to execute the script with python 2, I use this way due to python 3 return us errors.
I have added 3 new lines to .py file in order to control execution time. At the end of the article I have attached my pdftocsv.py final version
import tabula
import time
start_time = time.time()
df = tabula.read_pdf("inforatge.pdf")
df.to_csv('out.csv', sep='\t', encoding='utf-8')
print("--- %s seconds ---" % (time.time() - start_time))
More Tabula options
Examples or more things that can be done with this escript. There are many options
# Read remote PDF and convert them into DataFrame
df2 = tabula.read_pdf("https://github.com/tabulapdf/tabula-java/raw/master/src/test/resources/technology/tabula/arabic.pdf")
# Convert PDF into CSV
tabula.convert_into("test.pdf", "output.csv", output_format="csv")
And an useful thing I love it works with batch, we can convert all files in a directory at once.
tabula.convert_into_by_batch("input_directory", output_format='csv')
Is the way to task automatation.
How to convert pdf to excel online
If we want to convert only one file or to extract one file data and import to Excel or Librecalc, there are online tools that work for us. I have explained Tabula.py thinking in people who are interested in Machine Learning, data science and similar disciplines.
I have tested these two online tools and works fine.
- https://www.pdftoexcel.com/
- https://www.pdftoexcelonline.com/ works well but send you de output by mail
These are a good option if you have to work only with one file. But the free online solutions don’t give you batch options. Therefore if you need to automatize tasks it will be always better a python based solution.
Classical method
If you find problems and don’t like to wirok with tools nor scripts, you can always to do by manual way.
To copy pdf tables and paste them into our spread sheet.
Awesome! Everything works smoothly. I’d also recommend Acethinker PDF Converter which I have used for many years. It’s a free web-based application to convert PDF to Excel and CSV, you don’t have to install anything.