Other aspects of computing work best with monospace, also. The Unix shells; PowerShell; the Windows Command Prompt. Email is still sent with a copy in plaintext, which has to be wrapped on a monospace boundary. Not least, this persists because HTML email is excessively difficult to render securely, and there are user agents that still work better with plaintext.
In all of these situations, the problem presents itself that the originator has to anticipate how text will be rendered in advance. You cannot just send text and expect the recipient to flow it. You have to predict the effects of Tab characters correctly, and word wrap the text in advance, often not knowing the software that will be used for display. In terminal emulation, e.g. xterm via SSH, when the server sends the client a character to render, the server and the client need to agree by how many positions to advance the cursor. If they disagree, the whole screen can become corrupted.
As long as you stick to precomposed Unicode characters, and Western scripts, things are relatively straightforward. Whether it's A or Å, S or Š – so long as there are no combining marks, you can count a single Unicode code point as one character width. So the following works:
aeioucszNice and neat, right?
áéíóúčšž
Unfortunately, problems appear with Asian characters. When displayed in monospace, many Asian characters occupy two character widths. How do we know which ones?
Our problems would be solved if the Unicode standard included this information.
If you're on Unix, you may have access to
wcwidth
. However: "This function was removed from the final ISO/IEC 9899:1990/Amendment 1:1995 (E), and the return value for a non-printable wide character is not specified." What this means is that the results of wcwidth
are system-specific.In 2007, Markus Kuhn implemented a generic version of
wcwidth
, which we now use in the graphical SSH terminal console in Bitvise SSH Client. However, this is more than 8 years old at this point, and is based on Unicode 5.0, whereas the current latest version is 8.0.So I had the idea that maybe we could "just" extract up-to-date information from Windows. It's 2015, the following should render well, right?
aeioucszIt turns out – no. Perhaps you have an operating system with proper monospace fonts, which displays all of the above lined up. On my Windows 8.1, the problem looks like this:
áéíóúčšž
台北1234 (leading characters should be 2 spaces each)
abcdefgh
QRS12 (fullwidth latin; should be 2 spaces each)
abcdefgh
アイウ1234 (halfwidth kana; should be 1 space each)
abcdefgh
IE | Chrome | Firefox | Notepad | VS 2015 |
Note how nothing lines up: not in Internet Explorer; not in Chrome; not in Firefox; not in Notepad; not in the latest version of Visual Studio –
It turns out, when locale is set to English (United States), Windows just doesn't seem to use monospace fonts for Asian characters. Indeed, setting the Windows locale to Chinese (Simplified) produces this:
This is better; but now, the half-width kana are borked. sigh
Note that the above isn't a Windows problem only. This is how the same text displays on Android:
It boggles my mind that it's 2015, and we still don't have a single, authoritative answer to this question: how many character positions should each Unicode character occupy in a monospace font?
Discussion
Because I'm providing examples of incorrect character rendering, this may offer the misleading impression that this is just a font problem.This isn't just a font problem. It's that there's no standard monospace character width information, independent of font used.
The above incorrect renderings involve systems using non-monospace fallback fonts. However:
- Even if you only have a fallback font that's not mono, you can coerce it into the right character positions if you know the character widths. The above examples could work correctly – although the renderings might be less than perfect – if software knew the intended character widths.
- Even if you do not have a fallback font, and are just displaying placeholder boxes – you still need to know character widths to render the rest of the text properly, and for Tab characters to work.
Update and additional information
It turns out that Unicode does in fact provide character width information for East-Asian characters. It's just not as neat as one number. When is it ever? :)The information is in
EastAsianWidth.txt
, which is part of the Unicode character database. The data provides an East_Asian_Width
property, which is explained in this technical report.This is basically what is needed... with some unfortunate limitations:
- Hundreds of characters are categorized as ambiguous width (property value
A
). These characters include anything from U+00A1 (inverted exclamation mark, ¡) to U+2010 (hyphen, ‐) to U+FFFD (replacement character, �). Many of these characters (but not all!) have different widths depending on system locale. For example, U+00F7 (division character, ÷) has a width of 1 on Windows under English (United States), but a width of 2 under Chinese (Simplified, China). - In some cases, width can differ even between different fonts under the same locale. For example, on Windows under Chinese (Simplified, China), U+FFFD (replacement character) renders as narrow (1 position) with a raster font, and wide (2 positions) as TrueType.
- Some characters categorized as one width are still displayed as another width by certain systems. For example, U+20A9 (Won sign, ₩) has width property value H (half-width), but is displayed as wide (two positions) by Windows under locale Chinese (Simplified, China). It is displayed as narrow under locale English (United States).
There are other efforts to provide information on character widths, including the
utf8proc
library that's part of Julia. Interestingly, this library derives its information by extracting it from Unifont. Unifont, in turn, is an impressive open source Unicode font with a huge coverage of characters.
Showing 18 out of 18 comments, oldest first:
Comment on Sep 11, 2015 at 14:47 by Pádraig Brady
Comment on Sep 11, 2015 at 18:54 by denisbider
Comment on Sep 11, 2015 at 22:23 by Conley
http://stackoverflow.com/questions/30881811/how-do-you-get-the-display-width-of-combined-unicode-characters-in-python-3
Comment on Sep 11, 2015 at 23:47 by AcidFlask
A correction about UAX 11: East Asian Widths - it is not definitive regard to character widths. Section 2 states that "Instead, the guidelines on use of this property should be considered recommendations based on a particular legacy practice that may be overridden by implementations as necessary."
Furthermore, a great many characters have East_Asian_Width property 'N' (Not an East Asian Character) - and so Unicode does not provide any guidance as to how to deal with these cases.
Jiahao Chen, MIT Research Scientist
Comment on Sep 12, 2015 at 08:11 by Dominik Dalek
Comment on Sep 12, 2015 at 12:31 by D. Ongs
Comment on Sep 12, 2015 at 13:19 by microcolonel
Chrome on Chrome OS (using FreeType 2) properly aligns the text in that <pre>, as does firefox on GNU/Linux (also using FreeType 2, in addition to graphite). On linux, both of them also render the full-width latin characters with the correct weight and face. Something that Chrome and IE on Windows seem to get wrong.
Comment on Sep 13, 2015 at 15:18 by denisbider
Comment on Jan 5, 2016 at 22:26 by Unknown
Comment on Jan 6, 2016 at 15:44 by Unknown
Chinese characters align in notepad, pfe and html.
Comment on Feb 9, 2018 at 23:52 by Rajiv
Comment on Feb 10, 2018 at 01:14 by denisbider
Scripts that are incompatible with monospace representations are fundamentally incompatible with the above described technologies. This is usually not a problem because there is little intersection between the use cases for those scripts, and the use cases for monospace technologies.
However, for those scripts that can be represented in monospace, it helps if fonts are available that display them that way, so that they can be used with monospace technologies.
Comment on Feb 10, 2018 at 01:26 by denisbider
The intolerant argument is that the diversity of languages and scripts that exist in the world is shit. It increases the oppressiveness of geopolitical borders and is the main obstacle that prevents ideas crossing them. It prevents communication, protects harmful idiosyncrasies and local fiefdoms, and enables the most harmful cognitive bias in the world – the in-group/out-group dynamic – because our cultures are separated by language and script, and make us foreign to each other in the world.
Ideally, there would be one language, one script, and all the rest should go into a museum and never be touched again. It is variety for the sake of variety, and its effects are economically, socially, and politically evil.
Comment on Apr 22, 2020 at 19:50 by amn
Comment on Apr 22, 2020 at 23:08 by denisbider
You appear to be an uninsightful commenter who does not bother reading, so I do not welcome further discussion.
Comment on Apr 23, 2020 at 13:19 by amn
You either have to concede to the notion that *character*-based terminal emulators may be unfit for displaying text in multiple languages simultaneously, or you need to understand that it's not the job of Unicode to mandate widths and heights of the characters, the notion loses their meaning, if you ask me. It makes sense for a terminal displaying source code, but that's not Unicode's job, although I agree that their purpose is to coordinate. I am not saying multilingual text rendering layout cannot be standardized, but that's not Unicode's job!
Good day. I don't mean to offend you, but I don't know you and I don't agree with you. If you are looking for emotional support, feel free to withold my seemingly aggressive comment from your blog, what can I tell you. I don't have a habit of sprinkling niceties into these kind of discussions. I don't know why would I. I state my opinions, you could have stopped with yours without asking me to leave. This is the Internet, not a nightclub.
Also, I have read your post, how do you think I came here? To insult you?
Comment on Apr 24, 2020 at 01:19 by denisbider
That is a Unicode character. If Unicode has taken upon itself to standardize this, then it could certainly standardize character widths for scripts where it makes sense. It doesn't make sense of Devanagari, but it certainly makes sense for Asian scripts which have been used in monospace terminals for decades.
Again, I do not appreciate further discussion, not because I'm offended but because it's stupid.
Comment on May 28, 2020 at 15:11 by Unknown