Using Reportlab and PdfTk to Modify Existing PDFs

The PDF format is great for printing and displaying written content, but it can be a nightmare for editing and manipulating the content itself. Here, I explain a simple way to use some open-source tools to add information to an existing document without editing the original content.

Overview

In this example, I want to include an additional bit of text on the first page of a multipage pdf. To do this, there are 4 steps needed.

  1. Separate the first page of the original PDF
  2. Create a PDF with the just the additional text placed exaclty where the text should go in the first page. You may have to do some trial and error to get the position correct.
  3. Overlay the additional text PDF onto the first page.
  4. Re-assemble the PDF using the new first page and the original other pages.

PDFTK

PDFtk is a free and open-source, feature-rich PDF manipulation program. With that richness of functionality, it can be a little unwieldly for many common uses. Since I only need a small subset of it’s features, I have written a thin wrapper written in python3.6:

# pdftk.py
"""
Wrapper around some of the pdftk functionality
"""
from collections import Iterable
from pathlib import PosixPath
from string import ascii_uppercase
from subprocess import call


def __get_path(obj):
    """
    Turn pathlib object into string
    """
    if type(obj) == PosixPath:
        return str(obj)
    else:
        return obj


def dump_data(infile, output=None):
    """Dump pdf metadata to output file

    Args:
        infile (Path or str): input file path

    Kwargs:
        output (Path or str): output file path of metadata file

    Returns: outcome of call command

    """
    infile = __get_path(infile)
    outfile = __get_path(output)

    command = 'pdftk {0} dump_data_utf8 > {1}'.format(infile, outfile)

    return call(command, shell=True)


def cat(inputs=None, ranges=None, output=None):
    """Combine pdfs using ranges

    Kwargs:
        inputs (Iterable): iterable of Paths or strings for input files
        ranges (Iterable): iterable of strings for range in pdftk format
        output (Path or str): output file path

    Returns: outcome of call command

    """
    if not all([isinstance(arg, Iterable) for arg in (inputs, ranges)]):
        raise TypeError('inputs and ranges must be iterables')

    if len(inputs) != len(ranges):
        raise NotImplementedError('inputs and ranges must be same length')

    inputs = [__get_path(i) for i in inputs]
    outfile = __get_path(output)
    if len(inputs) == 1:
        command = 'pdftk {0} cat {1} output {2}'.format(inputs[0],
                                                        ranges[0],
                                                        outfile)
    else:
        command = 'pdftk '
        for i, input_ in enumerate(inputs):
            command += '{0}={1} '.format(ascii_uppercase[i], input_)
        command += 'cat '
        for i, range_ in enumerate(ranges):
            command += '{0}{1} '.format(ascii_uppercase[i], range_)
        command += 'output {}'.format(outfile)

    return call(command, shell=True)


def stamp(infile, stamp=None, output=None):
    """Stamp input with stamp file

    Args:
        infile (Path or str): input file path

    Kwargs:
        stamp (Path or str): stamp file path
        output (Path or str): output file path

    Returns: outcome of call command

    """
    infile = __get_path(infile)
    stamp = __get_path(stamp)
    output = __get_path(output)

    command = 'pdftk {0} stamp {1} output {2}'.format(infile, stamp, output)

    return call(command, shell=True)


def update_info(infile, info=None, output=None):
    """Add pdf metadata to pdf file

    Args:
        infile (Path or str): input file

    Kwargs:
        info (Path or str): metadata file path
        output (Path or str): output file path

    Returns: outcome of call command

    """
    infile = __get_path(infile)
    info = __get_path(info)
    output = __get_path(output)

    command = 'pdftk {0} update_info {1} output {2}'.format(infile, info,
                                                            output)

    return call(command, shell=True)

As you can see, I have implemented the dump_data, update_info, cat and stamp commands from PDFtk. I will explain exactly what they do below. I am simple using the subprocess module to run the pdftk command in the shell.

Report Lab

ReportLab is an open-source toolkit for creating dynamic PDFs using python. My example is heavily influenced from Dan Bader’s upcoming Report Lab. I was a backer on Kickstarter, so I have access to the pre-released version. I recommend that you pick up this book when it is available to the public.

Putting it together

Below is the final code.

from pathlib import Path

from reportlab.lib.pagesizes import letter
from reportlab.pdfbase import pdfmetrics
from reportlab.pdfbase.ttfonts import TTFont
from reportlab.pdfgen import canvas
from reportlab.lib.units import inch

import pdftk


def coord(x, y, unit=1):
    """determine coordinates

    Args:
        x (float): x coordinate
        y (float): y coordinate

    Kwargs:
        unit (unit): unit of measure

    Returns:
        x: (unit): x in unit of measure
        y: (unit): y in unit of measure

    """
    x, y = x * unit, y * unit
    return x, y


m_font_path = '/home/robert/.fonts/Merriweather-Light.ttf'
m = TTFont('Merriweather-Light', m_font_path)
pdfmetrics.registerFont(m)
mb_font_path = '/home/robert/.fonts/Merriweather-Bold.ttf'
mb = TTFont('Merriweather-Bold', mb_font_path)
pdfmetrics.registerFont(mb)

c = canvas.Canvas("stamp.pdf", bottomup=0, pagesize=letter)

c.setFont('Merriweather-Bold', 10)
c.drawString(*coord(4.83, 2.95, inch), text="My Text Label")
c.setFont('Merriweather-Light', 10)
c.drawString(*coord(5.9, 2.95, inch), text="My Text Content")
c.save()

# set file names
report_dir = Path('/home/robert/Downloads/reports')
original_report = report_dir / 'example.pdf'
assert(original_report.exists())
pdf_metadata_file = Path('example.info')
first_page = Path('temp1.pdf')
stamped_first_page = Path('temp2.pdf')
raw_stamped_report = Path('temp3.pdf')
stamped_report = original_report.with_name('stamped_' + original_report.name)

# Perform add additional text to example report
pdftk.dump_data(original_report, output=pdf_metadata_file)
pdftk.cat(inputs=(original_report, ),
          ranges=('1-1', ), output=first_page)
pdftk.stamp(first_page, stamp='stamp.pdf', output=stamped_first_page)
pdftk.cat(inputs=(stamped_first_page, original_report),
          ranges=('1-1', '2-end'), output=raw_stamped_report)
pdftk.update_info(raw_stamped_report, info=pdf_metadata_file,
                  output=stamped_report)

# clean up
pdf_metadata_file.unlink()
first_page.unlink()
stamped_first_page.unlink()
raw_stamped_report.unlink()
Path('stamp.pdf').unlink()

Let’s break this down. First, I create a function coord to handle units of measure for the layout.

Then, I define where to get the font assets for the pdf. This step is not always needed, but if you want the new text to look like it belongs, you should probably match the font.

After that, I indicate where on the page to apply the text. Then I save the resulting pdf file.

Using the dump_data command from PDFtk, I extract the original PDF’s metadata, which includes bookmarks used by PDF and eReaders to create a Table Of Contents.

Using cat command, I create a copy of the first page of the original example.pdf file.

I then use the stamp command to overlay the new text pdf onto the first page.

Following this, I create a new pdf using the stamped first page and the rest of (2-end) the original file.

To finish up, I add the previously-dumped metadata to the new pdf file. Since each one of the steps was non-destructive to the original files, I then delete all of the temporary files created during the process.

Summary

I am a moderately-experience user of PDFtk, and a novice Report Lab user. Going forward, I hope to use the two tools in combination to programatically manipulate and generate PDFs in all kind of ways for my personal and professional use cases. I want to that Dan Bader for his work with the Report Lab book, and all the contributors to the Report Lab and PDFtk projects.

000webhost logo