Extracting and editing PDF file content

0800 156 0777

Extracting and editing PDF file content

With the PDF long established as one of the common currencies of the creative industries, it’s not unusual to have clients hand you a PDF and ask if you can take an image or a logo from within the file for use in an alternate layout. In an ideal world we would always want to go back to the source file and get the original content but sometimes the original file isn’t available, or there simply isn’t time.

In today’s post I want to look at extracting raster (pixel) based content from PDFs. In Part 2 we’ll cover extracting vector content and look at the Touch-up tools for ‘round-trip’ editing of PDF content.

So, we’ve got our PDF and we need to access one or more of the images within the file to use elsewhere.

 

pdf containing images

Because we’re looking at extracting Raster content let’s open up Photoshop and look at our options. Go to File>Open... and select your PDF in the file browser.

I should point out now that, if the PDF was created in Photoshop and saved just so, there is a chance the file will open as a Photoshop file with fully editable layers and separate content and your life will be made a little easier.

However, you’ll most likely encounter a dialogue box similar to what you see below. By default, when we open a PDF into Photoshop it’s configured to import whole pages – notice the ‘Pages’ radio button is active. That means all the content on the page will be rasterised according to the settings on the right side of the dialogue box. You chose a size, resolution and colour mode and the whole page is converted to pixels based on these parameters. For certain tasks this is ideal, but we’re looking to extract the image, as is, from the PDF. If we convert the whole page we are re-rasterising the images within which could alter quality and colour.

 

problem with converting pdf pages

If we select the ‘Images’ radio button we see a series of thumbnails for each of the images within our PDF – notice the Image Size options on the right side of the panel are greyed out. Instead of Rasterizing our page we’re now choosing to extract the images directly from the PDF as is. This means we get them at the embedded resolution and intended colour mode. Currently, the first thumbnail is selected (indicated by the blue surrounding highlight) but we can shift-click to select and open multiple images if required.

 

exporting multiple images from pdf

Once our image is opened you’ll see the image filename is inherited from the PDF filename followed by a number – original filenames are discarded when PDFs are created so do not get embedded with the file. We can check the resolution by calling up Image Size – the image below isn’t exactly 300ppi indicating that the original image was resampled when the PDF was created. Because scaling and conversion factors don’t always equate to whole integers you’ll often see images render out to these slightly values but this is, essentially, a 300ppi image.

 

image size dialog when exporting images from pdf

So, when we extract an image from a PDF it is possible the original image may have been converted/compressed when the PDF, but by using this extraction method we know we are getting the maximum available image quality and resolution from the embedded content. In situations where the original resources can no longer be traced, this is your next-best resource for re-using content.