This is a quick technical writeup to hopefully answer a question I’ve seen posted a few times around StackOverflow and the issue trackers of various Python PDF libraries. This is especially handy for those of you who don’t want to dive through the PDF32000 to figure out how Adobe wants us to handle attachments.
PyPDF2 makes working with PDFs easy, but you may have noticed that it only has an addAttachment() function, similar to many other PDF libraries I tried. How do we extract attachments so that we can work with them? Embedding files in PDFs is very common and it would be nice to be able to interact with these objects, like we can with form fields and other things you might find in PDF files.
Fortunately the building blocks how how to do this are already available in the PdfFileReader class! We just need to stitch them together:
- Read the PDF file using PdfFileReader from PyPDF2
- Decrypt the PDF if necessary (required, you can’t get to the embedded files without doing this)
- Retrieve the file catalog by retrieving the file trailer (reader.trailer[‘/root’])
- Navigate in the dictionary this returns to ‘/EmbeddedFiles’
- Loop through the list of files that are found there
- When we get to an IndirectObject, we have our file parameters. We call getObject() to return the parameters dictionary, then navigate to ‘/F’ where our file data is stored as yet another IndirectObject. Here we simply call getData() and get a byte string back. This can then be written to a destination file or processed however you please!
As always it’s better to show the code, so here’s a proof of concept script:
Easy, just not immediately intuitive when you want to do this fast! I created pull request to hopefully get this function added as a method for the PdfFileReader class.