How to extract PDF file attachments using Python and PyPDF2

Tl;dr: Cut and paste the function I wrote here.

This is a quick technical writeup to hopefully answer a question I’ve seen posted a few times around StackOverflow and the issue trackers of various Python PDF libraries. This is especially handy for those of you who don’t want to dive through the PDF32000 to figure out how Adobe wants us to handle attachments.

PyPDF2 makes working with PDFs easy, but you may have noticed that it only has an addAttachment() function, similar to many other PDF libraries I tried. How do we extract attachments so that we can work with them? Embedding files in PDFs is very common and it would be nice to be able to interact with these objects, like we can with form fields and other things you might find in PDF files.

Fortunately the building blocks how how to do this are already available in the PdfFileReader class!  We just need to stitch them together:

  1. Read the PDF file using PdfFileReader from PyPDF2
  2. Decrypt the PDF if necessary (required, you can’t get to the embedded files without doing this)
  3. Retrieve the file catalog by retrieving the file trailer (reader.trailer[‘/root’])
  4. Navigate in the dictionary this returns to ‘/EmbeddedFiles’
  5. Loop through the list of files that are found there
  6. When we get to an IndirectObject, we have our file parameters. We call getObject() to return the parameters dictionary, then navigate to ‘/F’ where our file data is stored as yet another IndirectObject. Here we simply call getData() and get a byte string back. This can then be written to a destination file or processed however you please!

As always it’s better to show the code, so here’s a proof of concept script:

Easy, just not immediately intuitive when you want to do this fast! I created pull request to hopefully get this function added as a method for the PdfFileReader class.

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s