Why You Can’t Always Copy the Right Text from PDFs
In this post I will explain you why you can’t always copy / paste from PDF files, why sometimes the “wrong” areas are selected, and how to fix it.
To start, let’s clarify that you have two main types of PDF files: first PDF files with selectable text, and PDF files without selectable text. Before I explain the difference, the easiest way to figure what you are facing is to open the PDF, then click on “Ctrl+A” on your keyboard. This will automatically select all selectable text (thank you Mr Obvious) and highlight it. If nothing happens then your PDF doesn’t contain selectable text. Now let’s look at each case:
PDF files with non-selectable text, include scans of documents, files that have been outlined or vectorised. These PDFs don’t technically contain any text but rather only an “image” of it, which is why you can’t copy it. For these, you could use an OCR software (Optical Character Recognition) which will put back selectable text behind its “image”. As you probably know OCR is very useful, but a little hit and miss since it is not 100% accurate.
In short, you can’t copy text from a PDF without selectable text, and an OCR isn’t safe to use. If you still need to copy that text, you may ask the creator of that PDF. Creator of such PDF files are usually graphic designers, and they always have a version with selectable text which you will be able to copy and paste safely. Maybe they won’t want to send it to you, but that’s another story.
PDF files with selectable text can also pose problem: sometimes you will try to copy something but it won’t work: for some unknown (yet) reason that other paragraph on the opposite side of the page is selected instead of what you want. This happens because PDF files don’t have a standard logical text flow. In Word for example, the logical text flow is from left to right and from top to bottom. (and right to left, top to bottom for Arabic or Hebrew, well sort of). So the text flow on PDF files depends on how the graphic designer created it: if for example she/he pasted the second paragraph, then the first one, and finally the third one, then technically the logical order in the PDF file becomes second paragraph, then first paragraph, then third (2,1,3)!. This explains why you can’t select what you want in the order you want it.
The solution to this is “threading text boxes”: when graphic designers create PDF files they manually “re-order” the text using connectors so it flows in the correct logical order. The challenge comes with layout-heavy PDFs, because this can mean spending some time to do this. So if you have fairly simple PDFs in terms of layout design, you can and should expect a pretty good job of text flow ordering from them, and this will allow you to copy and paste your text in the right logical order.