Introduction to Unicode for Proofreaders
Unicode is a computer and software industry standard for handling text. While it is possible to geek-out on the technical aspects of Unicode (and if you’re so inclined you can check out the Wikipedia page on Unicode and the Unicode Consortium website), in this article we’re going to focus more on understanding the fundamentals, and discuss why this basic knowledge is critical for anyone who is part of a document creation / editing / review / correction workflow in a highly regulated industry.
What Is Unicode? (In Plain English For Non-Techies)
Unicode is most easily understood as a separation of the visual representation of text characters (also called glyphs) from the definition of what those characters are.
People can understand what written letters and other kinds of symbols mean, but computers only understand binary numbers (ones and zeroes). The challenge with human languages, then, is how to convert written letters, numbers, and other kinds of symbols (apostrophes, quotation marks, parentheses, etc.) into the binary language that computers understand.
The answer is encoding, and that’s what Unicode does. Each letter, number, and symbol in every language around the world is represented with a unique code in the Unicode standard.
In plain English, then, encoding exists because human beings can look at a capital letter A and understand what it is, but computers cannot. Computers understand A as the Unicode value U+0041 (which ultimately gets converted into binary ones and zeroes via math that we don’t need to get into here).
See for yourself. Open up Microsoft Word and type a capital A. Highlight it and then hit Alt+X on your keyboard. The capital A on your screen will be turned into the four-digit code 0041.
Assuming you’re using a Unicode font (and most fonts are these days), every single character in every document you create (or proofread) has a Unicode value behind it.
How Unicode Makes Your Life Easier (And You Don’t Even Know It)
If there were no such thing as character encoding, you’d have a very hard time copying and pasting text from one application to another, because there would be no way for the applications to have a common understanding of the characters being copied and pasted.
So that pharmaceutical product insert where the regulatory approved source material is in Microsoft Word but the artwork needs to be done in Adobe InDesign and exported to PDF to go to the printer? None of that would be possible without an encoding standard.
But Nothing’s Perfect
Here’s where things get interesting, and really important for proofreaders (plus anyone else involved with document creation, editing, review, and correction). There are risks and pitfalls if you don’t understand the basic Unicode concepts described above.
If you don’t understand that human beings and computers understand written language differently, there are a number of risky situations that can arise
Here’s a quick example:
When creating the artwork for an insert, a graphic designer decides that a superscript zero works just as well as the actual degree symbol. So instead of 32°F, we have 320F.
Looks similar, right?
The proofreader visually checks the final version of the design proof against the regulatory approved source material and doesn’t catch the misuse of the number zero to represent degrees. The design proof goes to the printer, but the printer ends up needing to adjust the text slightly to make it look just right for printing. During this adjustment, the zero that has been used to represent degrees loses its superscript attribute, and now the text reads 320F.
Now we have a problem. Potentially a big problem if there’s a misprint or missed deadline due to an unexpected additional correction cycle.
Accessibility Considerations
Print is not the only area where risk needs to be managed. If you’re developing IFUs or technical documentation that will be posted on a website for the public to access, be aware that not everyone will be consuming the document with their eyes. Visually impaired people often use text-to-speech applications to have documents read out loud to them.
Want to guess how text-to-speech applications read text?
Yup. It’s Unicode.
So even if your 320F made it to the printed material and didn’t cause any problems, there might be others out there trying to listen to instructions read aloud and hearing “320 Fahrenheit” instead of “32 degrees Fahrenheit.”
Regulatory Submission Considerations
Even if you’re not planning on publishing your documents publicly on your website, you should know that regulatory bodies around the world are moving towards interoperability of submitted content among systems. As mentioned earlier in this article, Unicode is a big part of what makes this interoperability possible.
Here’s an example of how the “superscript zero as degree symbol” scenario might play out here:
You submit a document for approval to a regulatory body. That regulatory body has an initiative in place to take all documents submitted and put the content in a searchable database. As part of this workflow, the submitted document gets converted from PDF to another format. The conversion routine relies on the Unicode values of each character in the PDF document to produce the content that will ultimately be stored it the database, and when the conversion happens the “zero as degree symbol” loses its superscript attribute, becoming a regular zero. Now in this regulatory body’s database what should be 32°F becomes 320F.
So again, when it comes to Unicode, if you don’t understand the importance of using the correct characters, you can easily end up with garbage in / garbage out.
Avoiding Unicode Pitfalls
There are two main ways to prevent Unicode related problems from cropping up in your documents:
1. Educate everyone involved in the document workflow about the differences between the way human beings understand written language and how computers understand written language. When creating documents, always use the correct symbols instead of cutting corners.
2. Use proofreading software (such as TVT) that compares text at the Unicode level. You’ll save lots of time, you’ll be able to proofread in any language, and you’ll catch deviations that you’d never see when doing a manual visual inspection.