The pdf2txt project combines interfaces to a number of PDF to text converters with text preprocessors that refine the converted text for use in further NLP applications.
This project has been published to maven central and can be used by sbt
and other build tools as a library dependency. Include a line like this in build.sbt
to incorporate the main project along with all the subprojects:
libraryDependencies += "org.clulab" %% "pdf2txt" % "1.1.2"
The main Pdf2txtApp
can be run directly from the pre-built jar
file. The only prerequisite is Java. Startup is significantly quicker than when it runs via sbt
.
The PDF converters are divided into two categories. Some converters work locally, with no network connection needed, while others depend on remote servers to perform the conversion. The default is the local tika converter:
-
Local
-
ghostact
This converter is a combination of Ghostscript for conversion of PDF to images and Tesseract for conversion of images to text. It depends on both of these programs having been installed in advance and being available on the
$PATH
if default settings are used. The settings can be adjusted. See the subproject's README.md for details. This converter does not do well on any but the simplest pages, but it is able to process images embedded in PDFs. -
pdfminer
This Python project is further wrapped in Python code included as a resource with this project. It gets run as an external process using the
python3
command which must be available on the$PATH
. Furthermore,pdfminer
needs to have been installed in advance, possibly withpip install pdfminer
. -
pdftotext
This executable program needs to be installed on the local computer and accessible via the operating system
$PATH
so that thepdftotext
command can run. It is started as an external process to perform the conversion. See the README.md file for configuration details. -
scienceparse
Science Parse is a Scala library that parses scientific papers. The pre-built jars are included in this project because recent versions are no longer available in standard repositories (e.g., maven central). This converter relies on large machine learning models which are downloaded when the converter is first used.
-
text
If your text has already been converted from PDF and only needs to be preprocessed. then this is the "converter" to use. It is implemented directly in this project rather than in a subproject. In contrast to the others, it reads files matching *.txt rather than *.pdf.
-
tika
Apache Tika provides a Java library which is included as a dependency for this project. This is the default converter.
-
-
Remote
-
adobe
This converter provides an interface to Adobe's online PDF Extract service. The service requires credentials and eventual payment if used beyond the trial limits. See the adobe subproject's README.md for configuration details. The service returns a zip file containing a description of the PDF. The zip files are saved alongside the PDFs and will be reused if the same PDF is converted again. Converted text is generated wholly from the zip file and if one is found with the PDF, the call to the service is skipped (and the credentials are not used or needed).
-
amazon
Amazon provides via AWS a similar online Textract service. The service requires credentials and eventual payment if used beyond the trial limits. An S3 bucket may also be required. See the amazon subproject's README.md for configuration details. The service converts the PDF document into images and performs optical character recognition (OCR) to recover the text. It knows about pages, lines, and words, but not about paragraphs or other logical document structure. Input files of more than one page need to temporarily reside in an S3 bucket. If no bucket is configured (the value is an empty string), none is used, but that will cause errors if a PDF has more than one page.
-
google
Google's Cloud Vision API also offers PDF to text conversion. The service requires credentials and eventual payment if used beyond the trial limits. A cloud storage bucket is required. See the google subproject's README.md for configuration details. The service separates the PDF into pages and performs optical character recognition (OCR) on each one separately. Both input and output files need to temporarily reside in a storage bucket. The service returns a json file containing a description of the PDF. The json files are saved alongside the PDFs and will be reused if the same PDF is converted again. Converted text is generated wholly from the json file and if one is found with the PDF, the call to the service is skipped (and the credentials are not used or needed).
-
microsoft
Microsoft has its own computer vision service. As with several of the other converters, its processing of PDFs is an extension of more general image processing capabilities. The PDF is converted into an image and then scanned for text. In this case, no cloud storage is needed and no temporary files are created. Credentials are required and there are eventual charges after a trial period. Free conversions are limited both in number of pages (to two) and submission rate (20 calls per minute). There are also image size limits. See Microsoft documentation for input requirements. The subproject's README.md has information on how to configure the credentials.
-
Preprocessors can be configured on (true) and off (false) as shown later, but they are by default applied in the order given here. That can be changed if the project is used as a library, since it is an (ordered) array of preprocessors that gets passed around. Because actions of one preprocessor can affect how the next might work or the previous might have worked, the list is traversed multiple times until the output no longer changes.
-
line
This preprocessor removes blank lines that some PDF converters leave between populated lines of text even though there is no paragraph break and usually not even the end of a sentence intervening. After the blank line is removed, text parsers can usually piece together a sentance that is split across the remaining lines.
-
paragraph
Blank lines are otherwise assumed to end paragraphs. Sentences cannot span paragraphs, so at the end of each paragraph a period is added if necessary. This prevents parsers from combining things like multiple section headings into a single nonsensical sentence.
-
unicode
Conversion of unicode characters is controlled by a translation table which can remove accents, spell out Greek letters, convert to spaces, etc. and a list of accented characters which might be spared from such conversion. How these are used is controlled by parameters. In the command line interface, they are hard coded, but the library provides access.
-
case
Headers and titles often indicated with words that have been capitalized. Unfortunately, this can confuse part of speech taggers and named entity recognizers. Case is restored here so that words appear as they would in normal sentences for more accurate processing.
-
number
Numbers are sometimes converted so that spaces separate some of the digits or a comma lands after a space as in 123 ,45. This preprocessor tries to remove unnecessary spaces within numbers.
-
ligature
Many PDF converters have difficulties with ligatures, like ffi typeset as single glyphs, resulting in spaces inserted into words. Such situations are detected and resolved with this preprocessor. "coe ffi cient" would be corrected to "coefficient". In order to do so, it must have a fairly good idea of what is a word or not and even whether one word is more probable than another. Therefore, this preprocessor (and all the remaining ones) makes use of a language model described in the next section.
-
lineBreak
Words, particularly in justified text, are often hyphenated and split between lines of text. Some words already include hyphens that are not optional. This preprocessor, with the aid from a language model, attempts to find words split across lines and unite the parts.
-
lineWrap
Some PDF converters attempt to reformat each paragraph of text into a single line like it would be in a word processor with automatic word wrapping. Unfortunately, the operation often ignores the possibility of words having been hyphenated across lines. Rather than removing the hyphen and line break, the converter simply replaces the break with a space and leaves the hyphen, resulting in text like "imple- ment". This preprocessor monitors for this situation and with the help of a language model corrects to "implement", since that is a word.
-
wordBreakByHyphen
Given the many kinds of dashes (-, –, —, etc.) within words, PDF converters sometimes can't tell whether the letters after the dash belong to the same word or to the next one, and unwanted spaces can get inserted. Words with hyphens are recombined here. For example, "left- handed" might be restored to "left-handed" or "two- year- old" to "two-year-old".
-
wordBreakBySpace
Finally, sometimes spaces just appear magically within words. They might be removed here, but by default the Never language model is configured out of an abundance of caution. Library users can change this.
The preprocessor unit tests include illustrative examples of transformations.
The primary reponsibility of the language models is to determine whether word "parts" should be joined so that a word is whole again. The parts may have resulted from spaces or hyphens having been inserted between characters of a word. The programming interface looks like this:
def shouldJoin(left: String, right: String, prevWords: Seq[String]): Boolean
It decides whether a sentence starting "Wordone wordtwo left right" is OK or should have been "Wordone wordtwo leftright". This might be calculated based on something like
P(Wordone wordtwo leftright | Wordone wordtwo) > P(Wordone wordtwo left | Wordone wordtwo)
or even
P(leftright) > P(left)
The language models below are currently available. Both the gigaword
and glove
use not only vocabulary from their respective dictionaries, but dynamically add to it words from the document they are currently processing. A novel word such as a product or brand name that is seen without a hyphen in a document can be used to de-hyphenate other instances in the document.
-
always
Always join left and right, which is useful in testing.
-
gigaword
Use word frequencies derived from gigaword. Since counts are involved, this is coded as a
BagLanguageModel
-
glove
Use words, without frequencies, derived from glove. Since these are without counts, this is called a
SetLanguageModel
. -
never
Never join left and right, which is again useful in testing.
A HuggingFace language model is also anticipated.
Although this project is intended more as a library, there are several command line applications included. Many read all the PDF files in an input directory, convert them to text, preprocesses them for potential use with other NLP projects, and then write them to an output directory. They differ mainly in which component converts the PDF to text. Pdf2txtApp should be noted in particular, since it is the most encompassing. Here are highlights from its help text.
From the command line with sbt and having the git repo, use
sbt "run <arguments>"
or from the command line after having run "sbt assembly" and changed directories (target/scala-2.12) or after having downloaded the jar file,
java -jar pdf2txt.jar <arguments>
<no_arguments>
converts all PDFs in the current directory to text files.
-in ./pdfs -out ./txts
converts all PDFs in ./pdfs
to text files in ./txts.
-converter pdftotext -wordBreakBySpace false -in doc.pdf -out doc.txt
converts doc.pdf
to doc.txt
using pdftotxt
without the
wordBreakBySpace
preprocessor.
-converter text -in file.txt -out file.out.txt
preprocesses file.txt resulting in file.out.txt
To get the full help text, use -h
, -help
, or --help
.
This software uses lots of memory for multiple large neural network models and dictionaries. It may not run on machines with less than 16GB of memory, particulary with ScienceParse, and even then, settings may need to be adjusted so that the memory available can also be used. If you encounter errors indicating memory exhaustion, such as
[error] ## Exception when compiling 44 sources to /clulab/pdf2txt-project/pdf2txt/target/scala-2.11/classes
[error] java.lang.OutOfMemoryError: Java heap space
or
Exception in thread "ModelLoaderThread" java.lang.OutOfMemoryError: Java heap space
then here are some tips to try:
-
If
sbt
can't complete commands likecompile
orassembly
for lack of memory, then the-Xmx
setting in .jvmopts might be increased. The Windows version ofsbt
seems to ignore this file, so it may be necessary to instead set the value of the environment variable_JAVA_OPTIONS
. Depending on the shell, that might be withset _JAVA_OPTIONS=-Xmx12g
or$env:_JAVA_OPTIONS="-Xmx12g"
. -
If
sbt
can't complete thetest
command, then the value forThisBuild / Test / javaOptions
in test.sbt needs to be adjusted. -
If the
run
command doesn't work, then use the setting forrun / javaOptions
in build.sbt. -
If you execute the jar file from Java and run out of memory, then the environment variable
_JAVA_OPTIONS
is the best place to make the change. The command for Windows is above. For other operating systems, it is usuallyexport _JAVA_OPTIONS=-Xmx12g
. -
If
sbt run
orjava -jar
is problematic, then lowering the value for the-threads
argument can reduce memory requirements because fewer documents will be processed at the same time.
In each case adjust the number before the g
(gigabytes) as needed.
Please note that the startup messages from fatdynet that are printed to stderr
like the ones below are normal and not indicative of a problem.
[error] [dynet] Checking /home/user/pwd for libdynet_swig.so...
[error] [dynet] Checking /home/user for libdynet_swig.so...
[error] [dynet] Extracting resource libdynet_swig.so to /tmp/libdynet_swig-8897097308525612384.so...
[error] [dynet] Loading DyNet from /tmp/libdynet_swig-8897097308525612384.so...
[error] [dynet] random seed: 2522620396
[error] [dynet] allocating memory: 512MB
[error] [dynet] memory allocation done.