Tl;dr: Cut and paste the function I wrote here.
This is a quick technical writeup to hopefully answer a question I’ve seen posted a few times around StackOverflow and the issue trackers of various Python PDF libraries. This is especially handy for those of you who don’t want to dive through the PDF32000 to figure out how Adobe wants us to handle attachments.
PyPDF2 makes working with PDFs easy, but you may have noticed that it only has an addAttachment() function, similar to many other PDF libraries I tried. How do we extract attachments so that we can work with them? Embedding files in PDFs is very common and it would be nice to be able to interact with these objects, like we can with form fields and other things you might find in PDF files.
Fortunately the building blocks how how to do this are already available in the PdfFileReader class! We just need to stitch them together:
- Read the PDF file using PdfFileReader from PyPDF2
- Decrypt the PDF if necessary (required, you can’t get to the embedded files without doing this)
- Retrieve the file catalog by retrieving the file trailer (reader.trailer[‘/root’])
- Navigate in the dictionary this returns to ‘/EmbeddedFiles’
- Loop through the list of files that are found there
- When we get to an IndirectObject, we have our file parameters. We call getObject() to return the parameters dictionary, then navigate to ‘/F’ where our file data is stored as yet another IndirectObject. Here we simply call getData() and get a byte string back. This can then be written to a destination file or processed however you please!
As always it’s better to show the code, so here’s a proof of concept script:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import PyPDF2 | |
def getAttachments(reader): | |
""" | |
Retrieves the file attachments of the PDF as a dictionary of file names | |
and the file data as a bytestring. | |
:return: dictionary of filenames and bytestrings | |
""" | |
catalog = reader.trailer["/Root"] | |
fileNames = catalog['/Names']['/EmbeddedFiles']['/Names'] | |
attachments = {} | |
for f in fileNames: | |
if isinstance(f, str): | |
name = f | |
dataIndex = fileNames.index(f) + 1 | |
fDict = fileNames[dataIndex].getObject() | |
fData = fDict['/EF']['/F'].getData() | |
attachments[name] = fData | |
return attachments | |
handler = open('YOURPDFPATH', 'rb') | |
reader = PyPDF2.PdfFileReader(handler) | |
dictionary = getAttachments(reader) | |
print(dictionary) | |
for fName, fData in dictionary.items(): | |
with open(fName, 'wb') as outfile: | |
outfile.write(fData) |
Easy, just not immediately intuitive when you want to do this fast! I created pull request to hopefully get this function added as a method for the PdfFileReader class.
I love this, it works perfectly, thanks for sharing.
Though, if i dont have an attachment this script breaks. Sadly I can’t figure out a way to check whether or not there is an attachment using PyPDF2.
Do you have any suggestions?
Thanks in advance, Vio
LikeLike
Something I would try is to put everything after the ‘catalog’ line in the get_attachments method in a try-catch. I’m guessing the script is breaking because the embedded files section of the PDF doesn’t always exist so trying to access it throws an error.
LikeLike
Thank you for your post. It really helps. I just have one question about step 2, how do you decrypt the PDF?
LikeLike