How to extract PDF file attachments using Python and PyPDF2

Tl;dr: Cut and paste the function I wrote here.

This is a quick technical writeup to hopefully answer a question I’ve seen posted a few times around StackOverflow and the issue trackers of various Python PDF libraries. This is especially handy for those of you who don’t want to dive through the PDF32000 to figure out how Adobe wants us to handle attachments.

PyPDF2 makes working with PDFs easy, but you may have noticed that it only has an addAttachment() function, similar to many other PDF libraries I tried. How do we extract attachments so that we can work with them? Embedding files in PDFs is very common and it would be nice to be able to interact with these objects, like we can with form fields and other things you might find in PDF files.

Fortunately the building blocks how how to do this are already available in the PdfFileReader class!  We just need to stitch them together:

  1. Read the PDF file using PdfFileReader from PyPDF2
  2. Decrypt the PDF if necessary (required, you can’t get to the embedded files without doing this)
  3. Retrieve the file catalog by retrieving the file trailer (reader.trailer[‘/root’])
  4. Navigate in the dictionary this returns to ‘/EmbeddedFiles’
  5. Loop through the list of files that are found there
  6. When we get to an IndirectObject, we have our file parameters. We call getObject() to return the parameters dictionary, then navigate to ‘/F’ where our file data is stored as yet another IndirectObject. Here we simply call getData() and get a byte string back. This can then be written to a destination file or processed however you please!

As always it’s better to show the code, so here’s a proof of concept script:


import PyPDF2
def getAttachments(reader):
"""
Retrieves the file attachments of the PDF as a dictionary of file names
and the file data as a bytestring.
:return: dictionary of filenames and bytestrings
"""
catalog = reader.trailer["/Root"]
fileNames = catalog['/Names']['/EmbeddedFiles']['/Names']
attachments = {}
for f in fileNames:
if isinstance(f, str):
name = f
dataIndex = fileNames.index(f) + 1
fDict = fileNames[dataIndex].getObject()
fData = fDict['/EF']['/F'].getData()
attachments[name] = fData
return attachments
handler = open('YOURPDFPATH', 'rb')
reader = PyPDF2.PdfFileReader(handler)
dictionary = getAttachments(reader)
print(dictionary)
for fName, fData in dictionary.items():
with open(fName, 'wb') as outfile:
outfile.write(fData)

Easy, just not immediately intuitive when you want to do this fast! I created pull request to hopefully get this function added as a method for the PdfFileReader class.

 

3 thoughts on “How to extract PDF file attachments using Python and PyPDF2

  1. I love this, it works perfectly, thanks for sharing.
    Though, if i dont have an attachment this script breaks. Sadly I can’t figure out a way to check whether or not there is an attachment using PyPDF2.
    Do you have any suggestions?

    Thanks in advance, Vio

    Like

    1. Something I would try is to put everything after the ‘catalog’ line in the get_attachments method in a try-catch. I’m guessing the script is breaking because the embedded files section of the PDF doesn’t always exist so trying to access it throws an error.

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s