<?xml version="1.0" encoding="utf-8" standalone="yes"?><feed xmlns="http://www.w3.org/2005/Atom">
  <title></title>
  <subtitle></subtitle>
  <id>https://www.endpointdev.com/blog/tags/unicode/</id>
  <link href="https://www.endpointdev.com/blog/tags/unicode/"/>
  <link href="https://www.endpointdev.com/blog/tags/unicode/" rel="self"/>
  <updated>2025-04-29T00:00:00+00:00</updated>
  <author>
    <name>End Point Dev</name>
  </author>
  
    <entry>
      <title>Handling text encoding in Perl</title>
      <link rel="alternate" href="https://www.endpointdev.com/blog/2025/04/encoding-in-perl/"/>
      <id>https://www.endpointdev.com/blog/2025/04/encoding-in-perl/</id>
      <published>2025-04-29T00:00:00+00:00</published>
      <author>
        <name>Marco Pessotto</name>
      </author>
      <content type="html">
        &lt;p&gt;&lt;img src=&#34;/blog/2025/04/encoding-in-perl/hieroglyphics-1246926-pxhere-com.webp&#34; alt=&#34;Columns of Egyptian hieroglyphics carved into stone&#34;&gt;&lt;/p&gt;
&lt;!-- Photo https://pxhere.com/en/photo/1246926 CC0 Public Domain --&gt;
&lt;p&gt;When we are dealing with legacy applications, it&amp;rsquo;s very possible that the code we are looking at does not deal with Unicode characters, instead assuming all text is ASCII. This will cause a myriad of glitches and visual errors.&lt;/p&gt;
&lt;p&gt;In 2025, after more than 30 years since Unicode was born, how is that possible that old applications still survive while ignoring or working around the whole issue?&lt;/p&gt;
&lt;p&gt;Well, if your audience is mainly English speaking, it&amp;rsquo;s possible that you just experience glitches sometimes, with some characters like typographical quotes, non breaking spaces, etc. which are not really mission-critical. If, on the contrary, you need to deal every day with diacritics or even different languages (say, Italian and Slovenian), your application simply won&amp;rsquo;t survive without a good understanding of encoding.&lt;/p&gt;
&lt;p&gt;In this article we are going to focus on Perl, but other languages face the same problems.&lt;/p&gt;
&lt;h3 id=&#34;back-to-the-bytes&#34;&gt;Back to the bytes&lt;/h3&gt;
&lt;p&gt;As we know, machines work with numbers and bytes. A string of text is made of bytes, and each of them is 8 bits (each bit is a 0 or a 1). So one byte allows 256 possible combinations of bits.&lt;/p&gt;
&lt;p&gt;Plain ASCII is made by 128 characters (7 bits), so it fits nicely in one byte, leaving room for more. One character is exactly one byte, and one byte carries a character.&lt;/p&gt;
&lt;p&gt;However, ASCII is not enough for most of languages, even if they use the Latin alphabet, because they use diacritics like é, à, č, and ž.&lt;/p&gt;
&lt;p&gt;To address this problem, the &lt;a href=&#34;https://en.wikipedia.org/wiki/ISO/IEC_8859&#34;&gt;ISO 8859&lt;/a&gt; encoding standards appeared (there are others, like the Windows code pages, using the same idea, but of course using different code points). These standards use 8th bit not used by ASCII, still using a single byte for each character but double the combinations from ASCII, allowing 256 possible characters. That&amp;rsquo;s better, but still not great. It suffices for handling text in a couple of languages if they share the same characters, but not more. For this reason, there are various ISO 8859 encoding standards (8859-1, 8859-2, etc.) — one for each group of related languages (e.g. 8859-1 is for Western Europe, 8859-2 for Central Europe and so on, and even revisions of the same encoding, like 8859-15 and 8859-16).&lt;/p&gt;
&lt;p&gt;The problem is that if you have a random string, you have to guess which is the correct encoding. The same byte value could represent an &amp;ldquo;È&amp;rdquo; or a &amp;ldquo;Č&amp;rdquo;. You need to look at the context (which language is this?) or search for an encoding declaration. Most important, you are simply not able to type È and Č in the same plain text document. If your company works in Italy using the 8859-15 encoding, it means you can&amp;rsquo;t even accept the correct name of a customer from Slovenia, a neighbour country, because the encoding simply doesn&amp;rsquo;t have a place for characters with a caron (like &amp;ldquo;č&amp;rdquo;) and you have to work around this real problem.&lt;/p&gt;
&lt;p&gt;So finally came the &lt;a href=&#34;https://en.wikipedia.org/wiki/Unicode&#34;&gt;Unicode&lt;/a&gt; age. This standard allows for more than a million characters, which should be enough. You can finally type English, Italian, Russian, Arabic, and emojis all in the same plain text. This is truly great, but it creates a complication for the programmer: the assumption that one byte is one character is not true anymore. The common encoding for Unicode is UTF-8, which is also backward compatible with ASCII. This means that if you have ASCII text, it is also valid UTF-8. Any other character which is not ASCII will instead take from two to three bytes and the programming language needs to be aware of this.&lt;/p&gt;
&lt;h3 id=&#34;into-the-language-and-back-to-the-world&#34;&gt;Into the language and back to the world&lt;/h3&gt;
&lt;p&gt;Text manipulation is a very common task. If you need to process a string, say &amp;ldquo;ÈČ&amp;rdquo;, like in this document, you should be able to tell that it is a string with two characters representing two letters. You want to be able to use regular expression on it, and so on.&lt;/p&gt;
&lt;p&gt;Now, if we read it as a string of bytes, we get 4 of them and the newline, which is not what we want.&lt;/p&gt;
&lt;p&gt;Let&amp;rsquo;s see an example:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4;&#34;&gt;&lt;code class=&#34;language-perl&#34; data-lang=&#34;perl&#34;&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#888&#34;&gt;#!/usr/bin/env perl&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#080;font-weight:bold&#34;&gt;use&lt;/span&gt; &lt;span style=&#34;color:#b06;font-weight:bold&#34;&gt;strict&lt;/span&gt;;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#080;font-weight:bold&#34;&gt;use&lt;/span&gt; &lt;span style=&#34;color:#b06;font-weight:bold&#34;&gt;warnings&lt;/span&gt;;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#080;font-weight:bold&#34;&gt;use&lt;/span&gt; &lt;span style=&#34;color:#b06;font-weight:bold&#34;&gt;Data::Dumper::Concise&lt;/span&gt;;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#888&#34;&gt;# sample.txt contains ÈČ and a new line&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;{
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;    &lt;span style=&#34;color:#038&#34;&gt;open&lt;/span&gt; &lt;span style=&#34;color:#080;font-weight:bold&#34;&gt;my&lt;/span&gt; &lt;span style=&#34;color:#369&#34;&gt;$fh&lt;/span&gt;, &lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;&amp;#39;&amp;lt;&amp;#39;&lt;/span&gt;, &lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;&amp;#39;sample.txt&amp;#39;&lt;/span&gt;;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;    &lt;span style=&#34;color:#080;font-weight:bold&#34;&gt;while&lt;/span&gt; (&lt;span style=&#34;color:#080;font-weight:bold&#34;&gt;my&lt;/span&gt; &lt;span style=&#34;color:#369&#34;&gt;$l&lt;/span&gt; = &lt;span style=&#34;color:#080;background-color:#fff0ff&#34;&gt;&amp;lt;$fh&amp;gt;&lt;/span&gt;) {
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;        &lt;span style=&#34;color:#080;font-weight:bold&#34;&gt;print&lt;/span&gt; &lt;span style=&#34;color:#369&#34;&gt;$l&lt;/span&gt;;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;        &lt;span style=&#34;color:#080;font-weight:bold&#34;&gt;if&lt;/span&gt; (&lt;span style=&#34;color:#369&#34;&gt;$l&lt;/span&gt; =~ &lt;span style=&#34;color:#080;background-color:#fff0ff&#34;&gt;m/\w\w/&lt;/span&gt;) {
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;            &lt;span style=&#34;color:#080;font-weight:bold&#34;&gt;print&lt;/span&gt; &lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;&amp;#34;Found two characters\n&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;        }
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;        &lt;span style=&#34;color:#080;font-weight:bold&#34;&gt;print&lt;/span&gt; Dumper(&lt;span style=&#34;color:#369&#34;&gt;$l&lt;/span&gt;);
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;    }
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;    &lt;span style=&#34;color:#038&#34;&gt;close&lt;/span&gt; &lt;span style=&#34;color:#369&#34;&gt;$fh&lt;/span&gt;;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;}
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;{
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;    &lt;span style=&#34;color:#038&#34;&gt;open&lt;/span&gt; &lt;span style=&#34;color:#080;font-weight:bold&#34;&gt;my&lt;/span&gt; &lt;span style=&#34;color:#369&#34;&gt;$fh&lt;/span&gt;, &lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;&amp;#39;&amp;lt;:encoding(UTF-8)&amp;#39;&lt;/span&gt;, &lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;&amp;#39;sample.txt&amp;#39;&lt;/span&gt;;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;    &lt;span style=&#34;color:#080;font-weight:bold&#34;&gt;while&lt;/span&gt; (&lt;span style=&#34;color:#080;font-weight:bold&#34;&gt;my&lt;/span&gt; &lt;span style=&#34;color:#369&#34;&gt;$l&lt;/span&gt; = &lt;span style=&#34;color:#080;background-color:#fff0ff&#34;&gt;&amp;lt;$fh&amp;gt;&lt;/span&gt;) {
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;        &lt;span style=&#34;color:#080;font-weight:bold&#34;&gt;print&lt;/span&gt; &lt;span style=&#34;color:#369&#34;&gt;$l&lt;/span&gt;;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;        &lt;span style=&#34;color:#080;font-weight:bold&#34;&gt;if&lt;/span&gt; (&lt;span style=&#34;color:#369&#34;&gt;$l&lt;/span&gt; =~ &lt;span style=&#34;color:#080;background-color:#fff0ff&#34;&gt;m/\w\w/&lt;/span&gt;) {
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;            &lt;span style=&#34;color:#080;font-weight:bold&#34;&gt;print&lt;/span&gt; &lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;&amp;#34;Found two characters\n&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;        }
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;        &lt;span style=&#34;color:#080;font-weight:bold&#34;&gt;print&lt;/span&gt; Dumper(&lt;span style=&#34;color:#369&#34;&gt;$l&lt;/span&gt;);
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;    }
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;    &lt;span style=&#34;color:#038&#34;&gt;close&lt;/span&gt; &lt;span style=&#34;color:#369&#34;&gt;$fh&lt;/span&gt;;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;}&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This is the output:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4;&#34;&gt;&lt;code class=&#34;language-plain&#34; data-lang=&#34;plain&#34;&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;ÈČ
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&amp;#34;\303\210\304\214\n&amp;#34;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;Wide character in print at test.pl line 24, &amp;lt;$fh&amp;gt; line 1.
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;ÈČ
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;Found two characters
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&amp;#34;\x{c8}\x{10c}\n&amp;#34;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;In the first block the file is read verbatim, without any decoding.  The regular expression doesn&amp;rsquo;t work, we have basically 4 bytes which don&amp;rsquo;t seem to mean much.&lt;/p&gt;
&lt;p&gt;In the second block we decoded the input, converting it in the Perl internal representation. Now we can use regular expressions and have a consistent approach to text manipulation.&lt;/p&gt;
&lt;p&gt;In the first block, we got a warning:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4;&#34;&gt;&lt;code class=&#34;language-plain&#34; data-lang=&#34;plain&#34;&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;Wide character in print at test.pl line 25, &amp;lt;$fh&amp;gt; line 1&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;That&amp;rsquo;s because we printed something to the screen, but given that the string is now made by characters (decoded for internal use), Perl warns us that we need to encode it back to bytes (for the outside world to consume). A wide character is basically a character which needs to be encoded.&lt;/p&gt;
&lt;p&gt;This can either be done by calling the &lt;code&gt;encode()&lt;/code&gt; method from the &lt;code&gt;Encode&lt;/code&gt; API:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4;&#34;&gt;&lt;code class=&#34;language-perl&#34; data-lang=&#34;perl&#34;&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#080;font-weight:bold&#34;&gt;use&lt;/span&gt; &lt;span style=&#34;color:#b06;font-weight:bold&#34;&gt;strict&lt;/span&gt;;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#080;font-weight:bold&#34;&gt;use&lt;/span&gt; &lt;span style=&#34;color:#b06;font-weight:bold&#34;&gt;warnings&lt;/span&gt;;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#080;font-weight:bold&#34;&gt;use&lt;/span&gt; &lt;span style=&#34;color:#b06;font-weight:bold&#34;&gt;Encode&lt;/span&gt;;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#080;font-weight:bold&#34;&gt;print&lt;/span&gt; encode(&lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;&amp;#34;UTF-8&amp;#34;&lt;/span&gt;, &lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;&amp;#34;\x{c8}\x{10c}\n&amp;#34;&lt;/span&gt;);&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Or, better, by declaring the global encoding for the standard output:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4;&#34;&gt;&lt;code class=&#34;language-perl&#34; data-lang=&#34;perl&#34;&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#080;font-weight:bold&#34;&gt;use&lt;/span&gt; &lt;span style=&#34;color:#b06;font-weight:bold&#34;&gt;strict&lt;/span&gt;;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#080;font-weight:bold&#34;&gt;use&lt;/span&gt; &lt;span style=&#34;color:#b06;font-weight:bold&#34;&gt;warnings&lt;/span&gt;;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#038&#34;&gt;binmode&lt;/span&gt; &lt;span style=&#34;color:#038&#34;&gt;STDOUT&lt;/span&gt;, &lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;&amp;#34;:encoding(UTF-8)&amp;#34;&lt;/span&gt;;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#080;font-weight:bold&#34;&gt;print&lt;/span&gt; &lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;&amp;#34;\x{c8}\x{10c}\n&amp;#34;&lt;/span&gt;;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;So, the golden rule is:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;decode the string on input and get characters out of bytes&lt;/li&gt;
&lt;li&gt;work with it in your program as a string of characters&lt;/li&gt;
&lt;li&gt;encode the string on output&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Any other approach is going to lead to double encoded characters (seeing things like Ã and Ä in English text is a clear symptom of this), corrupted text, and confusion.&lt;/p&gt;
&lt;h3 id=&#34;encoding-strategies&#34;&gt;Encoding strategies&lt;/h3&gt;
&lt;p&gt;If you are dealing with standard input/​output on the shell, you should have this in your script:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4;&#34;&gt;&lt;code class=&#34;language-plain&#34; data-lang=&#34;plain&#34;&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;binmode STDIN,  &amp;#34;:encoding(UTF-8)&amp;#34;;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;binmode STDOUT, &amp;#34;:encoding(UTF-8)&amp;#34;;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;binmode STDERR, &amp;#34;:encoding(UTF-8)&amp;#34;;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;So you&amp;rsquo;re decoding on input and encoding on output automatically.&lt;/p&gt;
&lt;p&gt;For files, you can add the layer in the second argument of &lt;code&gt;open&lt;/code&gt; like in the sample script above, or use a handy module like &lt;code&gt;Path::Tiny&lt;/code&gt;, which provides methods like &lt;code&gt;slurp_utf8&lt;/code&gt; and &lt;code&gt;spew_utf8&lt;/code&gt; to read and write files using the correct encoding.&lt;/p&gt;
&lt;p&gt;Interactions with web frameworks should always happen with the internal Perl representation. When you receive the input from a form, it &lt;em&gt;should be considered already decoded&lt;/em&gt;. It&amp;rsquo;s also the framework&amp;rsquo;s responsibility to handle the encoding on output. Here at End Point we have many Interchange applications. Interchange &lt;em&gt;can&lt;/em&gt; support this, via the &lt;code&gt;MV_UTF8&lt;/code&gt; variable.&lt;/p&gt;
&lt;p&gt;The same rules apply to databases. It&amp;rsquo;s responsibility of the driver to take your strings and encode/​decode them when talking to the database. E.g. &lt;a href=&#34;https://metacpan.org/pod/DBD::Pg&#34;&gt;DBD::Pg&lt;/a&gt; has the &lt;code&gt;pg_enable_utf8&lt;/code&gt; option, while &lt;a href=&#34;https://metacpan.org/pod/DBD::mysql&#34;&gt;DBD::mysql&lt;/a&gt; has &lt;code&gt;mysql_enable_utf8&lt;/code&gt;. These options should usually be turned on or off explicitly. Not specifying the option is usually source of confusion because of the heuristic approach it requires for understanding the code.&lt;/p&gt;
&lt;h3 id=&#34;debugging-strategies&#34;&gt;Debugging strategies&lt;/h3&gt;
&lt;p&gt;It may not be the most correct approach, but I&amp;rsquo;ve been using &lt;code&gt;Dumper&lt;/code&gt; for more than a decade and it works. You simply use &lt;code&gt;Data::Dumper&lt;/code&gt; or &lt;code&gt;Data::Dumper::Concise&lt;/code&gt; and call &lt;code&gt;Dumper&lt;/code&gt; on the string you want to examine.&lt;/p&gt;
&lt;p&gt;If you see hexadecimal codepoints like &lt;code&gt;\x{c8}\x{10c}&lt;/code&gt;, it means the string is decoded and you&amp;rsquo;re working with the characters. If you see the raw bytes or characters with diacritics (the latter would happing if the terminal is interpreting the bytes and showing you the characters), you&amp;rsquo;re dealing with an encoded string. If you see weird characters in an English context, it probably means the text has been encoded more than once.&lt;/p&gt;
&lt;h3 id=&#34;migrate-a-web-application-to-unicode&#34;&gt;Migrate a web application to Unicode&lt;/h3&gt;
&lt;p&gt;If you&amp;rsquo;re still using legacy encoding systems like ISO 8859 or the similar Windows code pages, or worse, you simply don&amp;rsquo;t know and you&amp;rsquo;re relying on the browsers&amp;rsquo; heuristics (they&amp;rsquo;re quite good at guessing) you should change to handle the input and the output correctly along the whole application:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Convert the templates from the encoding you are using to UTF-8 (&lt;code&gt;iconv&lt;/code&gt; should do the trick).&lt;/li&gt;
&lt;li&gt;Inspect and possibly convert the existing DB data&lt;/li&gt;
&lt;li&gt;Make sure the DB drivers handle the I/O correctly&lt;/li&gt;
&lt;li&gt;Make sure the web framework is decoding the input and encoding the output&lt;/li&gt;
&lt;li&gt;Make sure the files you read and write are correctly handled&lt;/li&gt;
&lt;li&gt;Clean up any workarounds you may have had in place&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This looks like a challenging task, and it can be, but it&amp;rsquo;s totally worth it because fancy and well-supported characters nowadays are the norm. Typographical quotes like “this” and ‘this’ are very common and inserted by word processors automatically. So are emojis. People and customers simply expect them to work.&lt;/p&gt;
&lt;h3 id=&#34;band-aids&#34;&gt;Band-aids&lt;/h3&gt;
&lt;p&gt;If your client is on a budget or can&amp;rsquo;t deal with a large upgrade like this one, which has the potential to be disruptive and expose bugs which are lurking around, you can try to downgrade the Unicode characters to ASCII with tools like &lt;a href=&#34;https://metacpan.org/pod/Text::Unidecode&#34;&gt;Text::Unidecode&lt;/a&gt; (which has been ported to other languages as well). So typographical quotes will become the plain ASCII ones, diacritics will be stripped, and various other characters will get their ASCII representation. Not great, but better than dealing with unexpected behavior!&lt;/p&gt;

      </content>
    </entry>
  
    <entry>
      <title>Ecommerce customer names with interesting Unicode characters</title>
      <link rel="alternate" href="https://www.endpointdev.com/blog/2021/12/ecommerce-customer-names-interesting-unicode/"/>
      <id>https://www.endpointdev.com/blog/2021/12/ecommerce-customer-names-interesting-unicode/</id>
      <published>2021-12-29T00:00:00+00:00</published>
      <author>
        <name>Jon Jensen</name>
      </author>
      <content type="html">
        &lt;p&gt;&lt;img src=&#34;/blog/2021/12/ecommerce-customer-names-interesting-unicode/20210710_085128-sm.jpg&#34; alt=&#34;Photo of a small garden in front of a house basement window with two cats looking out&#34;&gt;&lt;/p&gt;
&lt;!-- photo by Jon Jensen --&gt;
&lt;p&gt;One of our clients with a busy ecommerce site sees a lot of orders, and among those, sometimes there are unusual customer name and address submissions.&lt;/p&gt;
&lt;p&gt;We first noticed in 2015 that they had a customer order come in with an emoji in the name field of the order. The emoji was 😏 and we half-jokingly wondered if that was a new sign of fraud.&lt;/p&gt;
&lt;p&gt;Over the following 3 years only a few more orders came in with various emoji in customer&amp;rsquo;s names, but in mid-2018 emoji started to appear increasingly frequently until now one now appears on average every day or two.&lt;/p&gt;
&lt;p&gt;Why the sudden appearance of emoji in 2015? It correlates with the rapid shift to browsing the web and shopping on mobile devices. Mobile visits now represent more than half of this client&amp;rsquo;s ecommerce traffic.&lt;/p&gt;
&lt;p&gt;Most people in 2015 didn&amp;rsquo;t even know how to type emoji on a desktop or laptop computer, but mobile touchscreen keyboards began showing emoji choices around that time, so the mobile explanation makes sense. And mobile keyboard autocorrect also sometimes offers emoji in addition to words, making them even more common in the past few years.&lt;/p&gt;
&lt;p&gt;Just for fun I wanted to automatically find all such &amp;ldquo;interesting&amp;rdquo; names, so I wrote a simple report that uses SQL to query their PostgreSQL ecommerce database.&lt;/p&gt;
&lt;p&gt;To preserve customer privacy, names shown here have been changed and limited to a few names that are common in the United States.&lt;/p&gt;
&lt;h3 id=&#34;real-names-with-bonuses&#34;&gt;Real names with bonuses&lt;/h3&gt;
&lt;p&gt;First let&amp;rsquo;s look at the apparently real names with emoji and other non-alphabetical Unicode characters mixed in:&lt;/p&gt;
&lt;p&gt;Amy 🩺&lt;br&gt;
Amy💕&lt;br&gt;
Amy👑&lt;br&gt;
👸Amy&lt;br&gt;
Anna 💫&lt;br&gt;
Anna 😎&lt;br&gt;
Anna ❤️&lt;br&gt;
Anna💊💉&lt;br&gt;
Bob☯️&lt;br&gt;
Bob😍😘😜👑💞&lt;br&gt;
Bob🐞&lt;br&gt;
Brenda 🌙&lt;br&gt;
Brenda 🥳🤪&lt;br&gt;
Brenda☠️&lt;br&gt;
B R E N D A 💕💅🏾💃🏾👜&lt;br&gt;
Cameron 🌻&lt;br&gt;
Cameron 😎&lt;br&gt;
Cameron 💁🏻‍♀️&lt;br&gt;
Doug 💕&lt;br&gt;
Doug 🅱️&lt;br&gt;
Doug 🌸🤩&lt;br&gt;
Doug 👦🏻❤️&lt;br&gt;
Emily 🌷&lt;br&gt;
Emily 😊🇨🇺&lt;br&gt;
Emily😚&lt;br&gt;
Emily 🎰👑❤️&lt;br&gt;
Frank 🥰&lt;br&gt;
Frank ⚗️&lt;br&gt;
Frank🔆&lt;br&gt;
Frank💞💰&lt;br&gt;
Jane 💕&lt;br&gt;
Jane⁸&lt;br&gt;
Jane ❤️❤️&lt;br&gt;
Jane🔆&lt;br&gt;
Jill 👑🎀&lt;br&gt;
Jill 👼🏽✨🌫&lt;br&gt;
Jill🍓🍒🍒&lt;br&gt;
Jill👸🏽💖&lt;br&gt;
Jim 🐰&lt;br&gt;
Jim, 💕&lt;br&gt;
Jim’s iPhone✨&lt;br&gt;
Joe 🐯&lt;br&gt;
Joe 🇦🇺&lt;br&gt;
Joe😏&lt;br&gt;
Joe🏀🎸‼️&lt;br&gt;
John 👪&lt;br&gt;
John⁵&lt;br&gt;
John🎭&lt;br&gt;
Karen 🌻🌹&lt;br&gt;
Karen 👑✨&lt;br&gt;
Karen 🔒❤️&lt;br&gt;
Karen🎀👑&lt;br&gt;
Kate💙💚&lt;br&gt;
Kate🤪🤞🏽💙&lt;br&gt;
Kate.💘&lt;br&gt;
K$ate💉&lt;br&gt;
Liz🌺&lt;br&gt;
Liz❣️&lt;br&gt;
Liz❤️🙃&lt;br&gt;
Liz Mama💙💙&lt;br&gt;
Mary 🎀&lt;br&gt;
Mary💘&lt;br&gt;
Mary👼🏼💓&lt;br&gt;
Mary⁹&lt;br&gt;
Mike 👑&lt;br&gt;
Mike 👑🌸&lt;br&gt;
Mike 👩🏻‍🌾&lt;br&gt;
Mike⁶&lt;br&gt;
Sarah 👑&lt;br&gt;
Sarah👑A.&lt;br&gt;
✨Sarah✨&lt;br&gt;
💛🌻 Sarah&lt;br&gt;
Steve🐑&lt;br&gt;
Steve 🦁&lt;br&gt;
Steve♓️💓&lt;br&gt;
Victoria 🥴&lt;br&gt;
Victoria🌻&lt;br&gt;
V𝚒𝚌𝚝𝚘𝚛𝚒𝚊&lt;br&gt;
V I C T O R I A 🤍&lt;br&gt;&lt;/p&gt;
&lt;p&gt;Would you have expected all that in ecommerce orders? I didn&amp;rsquo;t!&lt;/p&gt;
&lt;h3 id=&#34;fake-names&#34;&gt;Fake names&lt;/h3&gt;
&lt;p&gt;Next let&amp;rsquo;s look at placeholder names with people&amp;rsquo;s role or self-description or similar:&lt;/p&gt;
&lt;p&gt;Amor ⚽️&lt;br&gt;
Babe ❤️&lt;br&gt;
C𝚒𝚝𝚒𝚣𝚎𝚗&lt;br&gt;
Daddy🥴😏&lt;br&gt;
Daddy😘👴👨🙏👨👩👧🙇🏾&lt;br&gt;
Fly High 🕊&lt;br&gt;
Forever 💍💜&lt;br&gt;
Granny 👵🏽&lt;br&gt;
HOME ❤️&lt;br&gt;
Home🏠&lt;br&gt;
Home🏠💜&lt;br&gt;
Hubby🥰&lt;br&gt;
💦Juicy🍑&lt;br&gt;
me!! 💛&lt;br&gt;
mi amor ❤️&lt;br&gt;
Mom💗&lt;br&gt;
Mom♥️&lt;br&gt;
Mom 🐥💛&lt;br&gt;
MOMMY 💗&lt;br&gt;
Myself 😘&lt;br&gt;
Princess👑&lt;br&gt;
princess❤️&lt;br&gt;
Queen 😍💖🔓&lt;br&gt;
Queen💘&lt;br&gt;
The Husband💍❤️&lt;br&gt;
Wifey 😈✌🏽👅&lt;br&gt;&lt;/p&gt;
&lt;p&gt;Perhaps the occurrence of &amp;ldquo;home&amp;rdquo; several times reflects a mobile address book auto-fill function for billing or shipping address fields?&lt;/p&gt;
&lt;h3 id=&#34;not-names-at-all&#34;&gt;Not names at all!&lt;/h3&gt;
&lt;p&gt;Then there are those customers who didn&amp;rsquo;t provide any kind of name at all, just emoji and other special characters:&lt;/p&gt;
&lt;p&gt;🦋&lt;br&gt;
💙&lt;br&gt;
🤡🎪&lt;br&gt;
💗☁️&lt;br&gt;
🍌&lt;br&gt;
♥️&lt;br&gt;
🌙&lt;br&gt;
⁵&lt;br&gt;
∅&lt;br&gt;&lt;/p&gt;
&lt;p&gt;I guess only one or two of those per year doesn&amp;rsquo;t amount to much, but they&amp;rsquo;re interesting to see.&lt;/p&gt;
&lt;h3 id=&#34;strange-addresses&#34;&gt;Strange addresses&lt;/h3&gt;
&lt;p&gt;In addition to the name fields we also checked the address fields for unusual characters and found (again, details changed to preserve privacy):&lt;/p&gt;
&lt;p&gt;125 E 27😎&lt;br&gt;
227 W 24 Circle ⭕️&lt;br&gt;
3 Blvd. George Washington™&lt;br&gt;&lt;/p&gt;
&lt;h3 id=&#34;simply-odd&#34;&gt;Simply odd&lt;/h3&gt;
&lt;p&gt;The prizewinner for oddity, which seems like some kind of copy-and-paste mistake, is this in the city field of an address:&lt;/p&gt;
&lt;p&gt;Indian® Roadmaster™ Classic&lt;/p&gt;
&lt;p&gt;Maybe at least one motorcycle has achieved sentience and needed to do some online shopping!&lt;/p&gt;
&lt;h3 id=&#34;interesting-unicode-ranges&#34;&gt;&amp;ldquo;Interesting&amp;rdquo; Unicode ranges&lt;/h3&gt;
&lt;p&gt;When searching for interesting Unicode ranges, we could just look for characters in the &lt;a href=&#34;https://unicode.org/emoji/charts/full-emoji-list.html&#34;&gt;Unicode emoji ranges&lt;/a&gt;. That would be fairly straightforward since there are just a few ranges to match.&lt;/p&gt;
&lt;p&gt;But we were curious what other unusual characters were getting used aside from emoji, so we wanted to include other classes of characters. So perhaps we should include everything to start and then exclude the entire class of Unicode &amp;ldquo;word characters&amp;rdquo;? That covers the world&amp;rsquo;s standard characters used for names and addresses, including not just Roman/Latin with optional diacritics, but also other character sets such as Cyrillic, Arabic, Hebrew, Korean, Chinese, Japanese, Devanagari, Thai, and many others.&lt;/p&gt;
&lt;p&gt;&lt;a href=&#34;https://www.postgresql.org/docs/current/functions-matching.html#POSIX-CLASS-SHORTHAND-ESCAPES-TABLE&#34;&gt;PostgreSQL POSIX regular expressions&lt;/a&gt; include the class &amp;ldquo;word character&amp;rdquo; represented by either &lt;code&gt;[[:word:]]&lt;/code&gt; or the Perl shorthand &lt;code&gt;\w&lt;/code&gt;. I started with that, but found it covered too many things I did want to see, such as the visually double-width Latin characters that are part of the Chinese word character range, and some special numbers.&lt;/p&gt;
&lt;p&gt;So I switched back to matching what I want, rather than excluding what I don&amp;rsquo;t want, and I manually went through the &lt;a href=&#34;https://www.unicode.org/charts/&#34;&gt;Unicode code charts&lt;/a&gt; and noted the ranges to include.&lt;/p&gt;
&lt;p&gt;The list of Unicode code ranges I came up with, in hexadecimal, is:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4;&#34;&gt;&lt;code class=&#34;language-plain&#34; data-lang=&#34;plain&#34;&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;250-2ba
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;2bc-2c5
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;2cc-2dc
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;2de-2ff
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;58d-58e
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;fd5-fd8
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;1d00-1dbf
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;2070-2079
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;207b-209f
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;20d0-2104
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;2106-2115
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;2117-215f
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;2163-218b
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;2190-2211
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;2213-266e
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;2670-2bff
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;2e00-2e7f
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;2ff0-2fff
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;3004
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;3012-3013
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;3020
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;3200-33ff
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;4dc0-4dff
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;a000-abf9
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;fff0-fffc
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;fffe-1d35f
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;1d360-1d37f
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;1d400-1d7ff
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;1ec70-1ecbf
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;1ed00-1ed4f
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;1ee00-1eeff
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;1f000-10ffff&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Those ranges exclude several fairly common characters that people (or their software&amp;rsquo;s autocorrect) used in their address fields, which we wanted to ignore, such as:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Music sharp sign: ♯ (instead of # before a number)&lt;/li&gt;
&lt;li&gt;numero: №&lt;/li&gt;
&lt;li&gt;care of: ℅&lt;/li&gt;
&lt;li&gt;replacement character: � (though this could be interesting if it reveals unknown encoding errors)&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;the-sql-query&#34;&gt;The SQL query&lt;/h3&gt;
&lt;p&gt;PostgreSQL allows us to represent Unicode characters in hexadecimal numbers as either &lt;code&gt;\u&lt;/code&gt; plus 4 digits or &lt;code&gt;\U&lt;/code&gt; plus 8 digits. So the character 2ba is written in a Postgres string as &lt;code&gt;\u02ba&lt;/code&gt; and the range fffe-1d35f is written in a Postgres regex range as &lt;code&gt;[\ufffe-\U0001d35f]&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;With a little scripting to put it all together, I came up with:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4;&#34;&gt;&lt;code class=&#34;language-sql&#34; data-lang=&#34;sql&#34;&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#080;font-weight:bold&#34;&gt;SELECT&lt;/span&gt;&lt;span style=&#34;color:#bbb&#34;&gt; &lt;/span&gt;order_number,&lt;span style=&#34;color:#bbb&#34;&gt; &lt;/span&gt;order_timestamp::&lt;span style=&#34;color:#038&#34;&gt;date&lt;/span&gt;&lt;span style=&#34;color:#bbb&#34;&gt; &lt;/span&gt;&lt;span style=&#34;color:#080;font-weight:bold&#34;&gt;AS&lt;/span&gt;&lt;span style=&#34;color:#bbb&#34;&gt; &lt;/span&gt;order_date,&lt;span style=&#34;color:#bbb&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#bbb&#34;&gt;    &lt;/span&gt;fname,&lt;span style=&#34;color:#bbb&#34;&gt; &lt;/span&gt;lname,&lt;span style=&#34;color:#bbb&#34;&gt; &lt;/span&gt;company,&lt;span style=&#34;color:#bbb&#34;&gt; &lt;/span&gt;address1,&lt;span style=&#34;color:#bbb&#34;&gt; &lt;/span&gt;address2,&lt;span style=&#34;color:#bbb&#34;&gt; &lt;/span&gt;city,&lt;span style=&#34;color:#bbb&#34;&gt; &lt;/span&gt;&lt;span style=&#34;color:#080;font-weight:bold&#34;&gt;state&lt;/span&gt;,&lt;span style=&#34;color:#bbb&#34;&gt; &lt;/span&gt;b_fname,&lt;span style=&#34;color:#bbb&#34;&gt; &lt;/span&gt;b_lname,&lt;span style=&#34;color:#bbb&#34;&gt; &lt;/span&gt;b_company,&lt;span style=&#34;color:#bbb&#34;&gt; &lt;/span&gt;b_address1,&lt;span style=&#34;color:#bbb&#34;&gt; &lt;/span&gt;b_address2,&lt;span style=&#34;color:#bbb&#34;&gt; &lt;/span&gt;b_city,&lt;span style=&#34;color:#bbb&#34;&gt; &lt;/span&gt;b_state,&lt;span style=&#34;color:#bbb&#34;&gt; &lt;/span&gt;phone&lt;span style=&#34;color:#bbb&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#bbb&#34;&gt;&lt;/span&gt;&lt;span style=&#34;color:#080;font-weight:bold&#34;&gt;FROM&lt;/span&gt;&lt;span style=&#34;color:#bbb&#34;&gt; &lt;/span&gt;orders&lt;span style=&#34;color:#bbb&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#bbb&#34;&gt;&lt;/span&gt;&lt;span style=&#34;color:#080;font-weight:bold&#34;&gt;WHERE&lt;/span&gt;&lt;span style=&#34;color:#bbb&#34;&gt; &lt;/span&gt;concat(fname,&lt;span style=&#34;color:#bbb&#34;&gt; &lt;/span&gt;lname,&lt;span style=&#34;color:#bbb&#34;&gt; &lt;/span&gt;company,&lt;span style=&#34;color:#bbb&#34;&gt; &lt;/span&gt;address1,&lt;span style=&#34;color:#bbb&#34;&gt; &lt;/span&gt;address2,&lt;span style=&#34;color:#bbb&#34;&gt; &lt;/span&gt;city,&lt;span style=&#34;color:#bbb&#34;&gt; &lt;/span&gt;&lt;span style=&#34;color:#080;font-weight:bold&#34;&gt;state&lt;/span&gt;,&lt;span style=&#34;color:#bbb&#34;&gt; &lt;/span&gt;b_fname,&lt;span style=&#34;color:#bbb&#34;&gt; &lt;/span&gt;b_lname,&lt;span style=&#34;color:#bbb&#34;&gt; &lt;/span&gt;b_company,&lt;span style=&#34;color:#bbb&#34;&gt; &lt;/span&gt;b_address1,&lt;span style=&#34;color:#bbb&#34;&gt; &lt;/span&gt;b_address2,&lt;span style=&#34;color:#bbb&#34;&gt; &lt;/span&gt;b_city,&lt;span style=&#34;color:#bbb&#34;&gt; &lt;/span&gt;b_state,&lt;span style=&#34;color:#bbb&#34;&gt; &lt;/span&gt;phone)&lt;span style=&#34;color:#bbb&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#bbb&#34;&gt;    &lt;/span&gt;~&lt;span style=&#34;color:#bbb&#34;&gt; &lt;/span&gt;&lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;&amp;#39;[\u0250-\u02ba\u02bc-\u02c5\u02cc-\u02dc\u02de-\u02ff\u058d-\u058e\u0fd5-\u0fd8\u1d00-\u1dbf\u2070-\u2079\u207b-\u209f\u20d0-\u2104\u2106-\u2115\u2117-\u215f\u2163-\u218b\u2190-\u2211\u2213-\u266e\u2670-\u2bff\u2e00-\u2e7f\u2ff0-\u2fff\u3004\u3012-\u3013\u3020\u3200-\u33ff\u4dc0-\u4dff\ua000-\uabf9\ufff0-\ufffc\ufffe-\U0001d35f\U0001d360-\U0001d37f\U0001d400-\U0001d7ff\U0001ec70-\U0001ecbf\U0001ed00-\U0001ed4f\U0001ee00-\U0001eeff\U0001f000-\U0010ffff]&amp;#39;&lt;/span&gt;&lt;span style=&#34;color:#bbb&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#bbb&#34;&gt;    &lt;/span&gt;&lt;span style=&#34;color:#888&#34;&gt;-- limit how many years back to go
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#888&#34;&gt;&lt;/span&gt;&lt;span style=&#34;color:#bbb&#34;&gt;    &lt;/span&gt;&lt;span style=&#34;color:#080;font-weight:bold&#34;&gt;AND&lt;/span&gt;&lt;span style=&#34;color:#bbb&#34;&gt; &lt;/span&gt;order_timestamp&lt;span style=&#34;color:#bbb&#34;&gt; &lt;/span&gt;&amp;gt;&lt;span style=&#34;color:#bbb&#34;&gt; &lt;/span&gt;(&lt;span style=&#34;color:#080;font-weight:bold&#34;&gt;SELECT&lt;/span&gt;&lt;span style=&#34;color:#bbb&#34;&gt; &lt;/span&gt;&lt;span style=&#34;color:#080;font-weight:bold&#34;&gt;CURRENT_TIMESTAMP&lt;/span&gt;&lt;span style=&#34;color:#bbb&#34;&gt; &lt;/span&gt;-&lt;span style=&#34;color:#bbb&#34;&gt; &lt;/span&gt;&lt;span style=&#34;color:#038&#34;&gt;interval&lt;/span&gt;&lt;span style=&#34;color:#bbb&#34;&gt; &lt;/span&gt;&lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;&amp;#39;3 years&amp;#39;&lt;/span&gt;)&lt;span style=&#34;color:#bbb&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#bbb&#34;&gt;    &lt;/span&gt;&lt;span style=&#34;color:#888&#34;&gt;-- exclude any order that had PII expunged for GDPR
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#888&#34;&gt;&lt;/span&gt;&lt;span style=&#34;color:#bbb&#34;&gt;    &lt;/span&gt;&lt;span style=&#34;color:#080;font-weight:bold&#34;&gt;AND&lt;/span&gt;&lt;span style=&#34;color:#bbb&#34;&gt; &lt;/span&gt;expunged_at&lt;span style=&#34;color:#bbb&#34;&gt; &lt;/span&gt;&lt;span style=&#34;color:#080;font-weight:bold&#34;&gt;IS&lt;/span&gt;&lt;span style=&#34;color:#bbb&#34;&gt; &lt;/span&gt;&lt;span style=&#34;color:#080;font-weight:bold&#34;&gt;NULL&lt;/span&gt;&lt;span style=&#34;color:#bbb&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#bbb&#34;&gt;&lt;/span&gt;&lt;span style=&#34;color:#080;font-weight:bold&#34;&gt;ORDER&lt;/span&gt;&lt;span style=&#34;color:#bbb&#34;&gt; &lt;/span&gt;&lt;span style=&#34;color:#080;font-weight:bold&#34;&gt;BY&lt;/span&gt;&lt;span style=&#34;color:#bbb&#34;&gt; &lt;/span&gt;order_timestamp&lt;span style=&#34;color:#bbb&#34;&gt; &lt;/span&gt;&lt;span style=&#34;color:#080;font-weight:bold&#34;&gt;DESC&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Try a similar query on databases you have access to, and see what interesting user submissions you discover. I always find surprises and in addition to being fun, sometimes we find things that help us improve our input validation and user guidance so that more mistakes are caught when they&amp;rsquo;re easy for the customer to correct.&lt;/p&gt;
&lt;p&gt;Happy holidays!&lt;/p&gt;

      </content>
    </entry>
  
    <entry>
      <title>Generating TOTP QR codes as Unicode text from the command line</title>
      <link rel="alternate" href="https://www.endpointdev.com/blog/2021/10/generating-qr-codes-as-unicode-text/"/>
      <id>https://www.endpointdev.com/blog/2021/10/generating-qr-codes-as-unicode-text/</id>
      <published>2021-10-28T00:00:00+00:00</published>
      <author>
        <name>Bharathi Ponnusamy</name>
      </author>
      <content type="html">
        &lt;p&gt;&lt;img src=&#34;/blog/2021/10/generating-qr-codes-as-unicode-text/banner.jpg&#34; alt=&#34;banner, qr code, Unicode, text, security, console, terminal, command line&#34;&gt;&lt;/p&gt;
&lt;!-- photo by Bharathi Ponnusamy --&gt;
&lt;p&gt;(QR = “Quick Response” — good to know!)&lt;/p&gt;
&lt;p&gt;Python’s QR code generator library &lt;a href=&#34;https://pypi.org/project/qrcode/&#34;&gt;qrcode&lt;/a&gt; generates QR codes from a secret key and outputs to a terminal using Unicode characters, not a PNG graphic as most other libraries do. We can store that in a text file. This is a neat thing to do, but how is this functionality useful?&lt;/p&gt;
&lt;h5 id=&#34;benefits-of-having-unicode-qr-code-as-a-text-file&#34;&gt;Benefits of having Unicode QR code as a text file:&lt;/h5&gt;
&lt;ul&gt;
&lt;li&gt;Storing the QR code as a text file takes less disk space than a PNG image.&lt;/li&gt;
&lt;li&gt;It is easy to read the QR code over ssh using the &lt;code&gt;cat&lt;/code&gt; command; you don&amp;rsquo;t even have to download the file to your own workstation.&lt;/li&gt;
&lt;li&gt;It is simpler to manage QR codes in Git as text files than as PNG images.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This can be used for any kind of QR code, but we have found it especially useful for managing shared multi-factor authentication (MFA, including 2FA for 2-factor authentication) secrets for TOTPs (Time-based One-Time Passwords).&lt;/p&gt;
&lt;h3 id=&#34;multi-factor-authentication-mfa&#34;&gt;Multi-factor authentication (MFA)&lt;/h3&gt;
&lt;p&gt;Many services provide a separate account and login for each user so that accounts do not need to be shared, and thus passwords and multi-factor authentication secrets do not need to be shared either. This is ideal, and what we insist on for our most important accounts.&lt;/p&gt;
&lt;p&gt;Unfortunately, however, some services provide only a single login per account, or only a single primary account login with the other accounts being limited in serious ways (no access to billing, account management, etc.) so that any business relying on them needs to share the access between several authorized users. A single point of failure in an account login is a serious problem when that one person is unavailable.&lt;/p&gt;
&lt;h3 id=&#34;totp-mobile-apps&#34;&gt;TOTP mobile apps&lt;/h3&gt;
&lt;p&gt;There are many good mobile apps for managing TOTP keys and codes, including Aegis, FreeOTP, Google Authenticator, and many others. Look for one that works with no connection to the outside world, so that you won’t be stuck when off internet &amp;amp; data networks.&lt;/p&gt;
&lt;p&gt;Most applications support scanning QR codes with the phone’s camera, or else typing in a secret key to import the accounts.&lt;/p&gt;
&lt;p&gt;For those shared accounts with no option for fully empowered individual user accounts, we can convert secret keys into QR codes for easy sharing and easy imports.&lt;/p&gt;
&lt;p&gt;First, note that you should never use online QR code generators for MFA secrets! You risk exposing your extra authentication factors and defeating the purpose of your extra work.&lt;/p&gt;
&lt;h3 id=&#34;pythons-qrcode-library&#34;&gt;Python’s &amp;lsquo;qrcode&amp;rsquo; library&lt;/h3&gt;
&lt;p&gt;The &lt;code&gt;qrcode&lt;/code&gt; Python library provides a &lt;code&gt;qr&lt;/code&gt; executable that can print your QR code using UTF-8 characters on the console.&lt;/p&gt;
&lt;h4 id=&#34;installation&#34;&gt;Installation&lt;/h4&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4;&#34;&gt;&lt;code class=&#34;language-plain&#34; data-lang=&#34;plain&#34;&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;apt install python3-qrcode&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Or visit &lt;a href=&#34;https://pypi.org/project/qrcode/&#34;&gt;https://pypi.org/project/qrcode/&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;The contents of the QR code are a URL in the format:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4;&#34;&gt;&lt;code class=&#34;language-plain&#34; data-lang=&#34;plain&#34;&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;otpauth://totp/{username}?secret={key}&amp;amp;issuer={provider_name}&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The &lt;code&gt;provider_name&lt;/code&gt; can contain spaces; however, they need to be URL-encoded and entered as %20 for auth to work correctly on iOS. Otherwise, an invalid barcode error will be shown when adding the code.&lt;/p&gt;
&lt;p&gt;For example, if you generate the QR code with key &lt;code&gt;JSZE5V4676DZFCUCFW4GLPAHEFDNY447&lt;/code&gt; for the account &lt;code&gt;root@example.com&lt;/code&gt;, the resulting command would be:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4;&#34;&gt;&lt;code class=&#34;language-plain&#34; data-lang=&#34;plain&#34;&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;$ qr &amp;#34;otpauth://totp/Example:root@example.com?secret=JSZE5V4676DZFCUCFW4GLPAHEFDNY447&amp;amp;issuer=Superhost&amp;#34; &lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Here&amp;rsquo;s what its output looks like:&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;/blog/2021/10/generating-qr-codes-as-unicode-text/qrcode.jpg&#34; alt=&#34;qrcode&#34;&gt;&lt;/p&gt;
&lt;p&gt;Providing the username and issuer will display it properly in the list of configured accounts in your authenticator application. For example: &lt;code&gt;Superhost (Example:root@example.com)&lt;/code&gt;&lt;/p&gt;
&lt;h3 id=&#34;reference&#34;&gt;Reference&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;See a &lt;a href=&#34;http://www1.auth.iij.jp/smartkey/en/uri_v1.html&#34;&gt;simple explanation of otpauth URI format&lt;/a&gt; used by TOTP.&lt;/li&gt;
&lt;li&gt;See &lt;a href=&#34;https://datatracker.ietf.org/doc/html/rfc6238&#34;&gt;RFC 6238&lt;/a&gt; for full details about TOTP.&lt;/li&gt;
&lt;li&gt;See &lt;a href=&#34;https://datatracker.ietf.org/doc/html/rfc4648#section-6&#34;&gt;RFC 4648&lt;/a&gt; for the base 32 specification used to encode the secret key.&lt;/li&gt;
&lt;li&gt;A recent similar Perl implementation &lt;a href=&#34;https://github.polettix.it/ETOOBUSY/2021/09/26/text-qrcode-unicode/&#34;&gt;Terminal: QR Code with Unicode characters&lt;/a&gt; by Flavio Poletti that builds on &lt;code&gt;Text::QRCode&lt;/code&gt;, which uses &lt;code&gt;libqrencode&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

      </content>
    </entry>
  
    <entry>
      <title>Regular Expression Inconsistencies With Unicode</title>
      <link rel="alternate" href="https://www.endpointdev.com/blog/2018/01/regular-expression-inconsistencies-with-unicode/"/>
      <id>https://www.endpointdev.com/blog/2018/01/regular-expression-inconsistencies-with-unicode/</id>
      <published>2018-01-23T00:00:00+00:00</published>
      <author>
        <name>Phineas Jensen</name>
      </author>
      <content type="html">
        &lt;p&gt;&lt;img src=&#34;/blog/2018/01/regular-expression-inconsistencies-with-unicode/mud-run.jpg&#34; alt=&#34;A mud run&#34;&gt;&lt;br/&gt;
&lt;small&gt;A casual stroll through the world of Unicode and regular expressions—​&lt;a href=&#34;https://www.flickr.com/photos/presidioofmonterey/7025086135&#34;&gt;Photo&lt;/a&gt; by Presidio of Monterey&lt;/small&gt;&lt;/p&gt;
&lt;p&gt;Character classes in regular expressions are an extremely useful and widespread feature, but there are some relatively recent changes that you might not know of.&lt;/p&gt;
&lt;p&gt;The issue stems from how different programming languages, locales, and character encodings treat predefined character classes. Take, for example, the expression &lt;code&gt;\w&lt;/code&gt; which was introduced in Perl around the year 1990 (along with &lt;code&gt;\d&lt;/code&gt; and &lt;code&gt;\s&lt;/code&gt; and their inverted sets &lt;code&gt;\W&lt;/code&gt;, &lt;code&gt;\D&lt;/code&gt;, and &lt;code&gt;\S&lt;/code&gt;).&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;\w&lt;/code&gt; shorthand is a character class that matches “word characters” as the C language understands them: &lt;code&gt;[a-zA-Z0-9_]&lt;/code&gt;. At least when ASCII was the main player in the character encoding scene that simple fact was true. With the standardization of Unicode and UTF-8, the meaning of &lt;code&gt;\w&lt;/code&gt; has become a more foggy.&lt;/p&gt;
&lt;h4 id=&#34;perl&#34;&gt;Perl&lt;/h4&gt;
&lt;p&gt;Take this example in a recent Perl version:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4;&#34;&gt;&lt;code class=&#34;language-perl&#34; data-lang=&#34;perl&#34;&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#080;font-weight:bold&#34;&gt;use&lt;/span&gt; &lt;span style=&#34;color:#00d;font-weight:bold&#34;&gt;5.012&lt;/span&gt;; &lt;span style=&#34;color:#888&#34;&gt;# use 5.012 or higher includes Unicode support&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#080;font-weight:bold&#34;&gt;use&lt;/span&gt; &lt;span style=&#34;color:#b06;font-weight:bold&#34;&gt;utf8&lt;/span&gt;;  &lt;span style=&#34;color:#888&#34;&gt;# necessary for Unicode string literals&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#080;font-weight:bold&#34;&gt;print&lt;/span&gt; &lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;&amp;#34;username&amp;#34;&lt;/span&gt; =~&lt;span style=&#34;color:#080;background-color:#fff0ff&#34;&gt; /^\w+$/&lt;/span&gt;; &lt;span style=&#34;color:#888&#34;&gt;# 1&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#080;font-weight:bold&#34;&gt;print&lt;/span&gt; &lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;&amp;#34;userاسم&amp;#34;&lt;/span&gt;  =~&lt;span style=&#34;color:#080;background-color:#fff0ff&#34;&gt; /^\w+$/&lt;/span&gt;; &lt;span style=&#34;color:#888&#34;&gt;# 1&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Perl is treating &lt;code&gt;\w&lt;/code&gt; differently here because the characters “اسم” (“ism” meaning “name” in Arabic) definitely don’t fall within &lt;code&gt;[a-zA-Z0-9_]&lt;/code&gt;!&lt;/p&gt;
&lt;p&gt;Beginning with Perl 5.12 from the year 2010, character classes are handled differently. Documentation on the topic is found in &lt;a href=&#34;https://perldoc.perl.org/perlrecharclass.html#Backslash-sequences&#34;&gt;perlrecharclass&lt;/a&gt;. The rules aren’t as simple as with some languages, but can be generalized as such:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;\w&lt;/code&gt; will match Unicode characters with the “Word” property (equivalent to &lt;code&gt;\p{Word}&lt;/code&gt;), unless the &lt;code&gt;/a&lt;/code&gt; (ASCII) flag is enabled, in which case it will be equivalent to the original &lt;code&gt;[a-zA-Z0-9_]&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Let’s see the &lt;code&gt;/a&lt;/code&gt; flag in action.&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4;&#34;&gt;&lt;code class=&#34;language-perl&#34; data-lang=&#34;perl&#34;&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#080;font-weight:bold&#34;&gt;use&lt;/span&gt; &lt;span style=&#34;color:#00d;font-weight:bold&#34;&gt;5.012&lt;/span&gt;;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#080;font-weight:bold&#34;&gt;use&lt;/span&gt; &lt;span style=&#34;color:#b06;font-weight:bold&#34;&gt;utf8&lt;/span&gt;;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#080;font-weight:bold&#34;&gt;print&lt;/span&gt; &lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;&amp;#34;username&amp;#34;&lt;/span&gt; =~&lt;span style=&#34;color:#080;background-color:#fff0ff&#34;&gt; /^\w+$/&lt;/span&gt;a; &lt;span style=&#34;color:#888&#34;&gt;# 1&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#080;font-weight:bold&#34;&gt;print&lt;/span&gt; &lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;&amp;#34;userاسم&amp;#34;&lt;/span&gt;  =~&lt;span style=&#34;color:#080;background-color:#fff0ff&#34;&gt; /^\w+$/&lt;/span&gt;a; &lt;span style=&#34;color:#888&#34;&gt;# 0&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;However, you should know that for code points below 256, these rules can change depending on whether Unicode or locale rules are on, so if you’re unsure, consult the &lt;a href=&#34;https://perldoc.perl.org/perlre.html&#34;&gt;perlre&lt;/a&gt; and &lt;a href=&#34;https://perldoc.perl.org/perlrecharclass.html&#34;&gt;perlrecharclass&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Keep in mind that these same questions of what the character classes include can apply to every predefined character class in whatever language you’re using, so remember to check language-specific implementations for other character class shorthands, such as &lt;code&gt;\s&lt;/code&gt; and &lt;code&gt;\d&lt;/code&gt;, not just &lt;code&gt;\w&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Every language seems to do regular expressions a little bit differently, so here’s a short, incomplete guide for several other languages we use frequently.&lt;/p&gt;
&lt;h4 id=&#34;python&#34;&gt;Python&lt;/h4&gt;
&lt;p&gt;Take this example in Python 3.6.2:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4;&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&amp;gt;&amp;gt;&amp;gt; re.&lt;span style=&#34;color:#080;font-weight:bold&#34;&gt;match&lt;/span&gt;(&lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;r&lt;/span&gt;&lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;&amp;#39;^\w+$&amp;#39;&lt;/span&gt;, &lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;&amp;#39;username&amp;#39;&lt;/span&gt;)
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&amp;lt;_sre.SRE_Match &lt;span style=&#34;color:#038&#34;&gt;object&lt;/span&gt;; span=(&lt;span style=&#34;color:#00d;font-weight:bold&#34;&gt;0&lt;/span&gt;, &lt;span style=&#34;color:#00d;font-weight:bold&#34;&gt;8&lt;/span&gt;), &lt;span style=&#34;color:#080;font-weight:bold&#34;&gt;match&lt;/span&gt;=&lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;&amp;#39;username&amp;#39;&lt;/span&gt;&amp;gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&amp;gt;&amp;gt;&amp;gt; re.&lt;span style=&#34;color:#080;font-weight:bold&#34;&gt;match&lt;/span&gt;(&lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;r&lt;/span&gt;&lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;&amp;#39;^\w+$&amp;#39;&lt;/span&gt;, &lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;&amp;#39;userاسم&amp;#39;&lt;/span&gt;)
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&amp;lt;_sre.SRE_Match &lt;span style=&#34;color:#038&#34;&gt;object&lt;/span&gt;; span=(&lt;span style=&#34;color:#00d;font-weight:bold&#34;&gt;0&lt;/span&gt;, &lt;span style=&#34;color:#00d;font-weight:bold&#34;&gt;7&lt;/span&gt;), &lt;span style=&#34;color:#080;font-weight:bold&#34;&gt;match&lt;/span&gt;=&lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;&amp;#39;userاسم&amp;#39;&lt;/span&gt;&amp;gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Python is also treating &lt;code&gt;\w&lt;/code&gt; differently here. Let’s take a look at &lt;a href=&#34;https://docs.python.org/3/library/re.html#regular-expression-syntax&#34;&gt;the Python docs&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;\w&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;For Unicode (str) patterns:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;Matches Unicode word characters; this includes most characters that can be part of a word in any language, as well as numbers and the underscore. If the ASCII flag is used, only [a-zA-Z0-9_] is matched (but the flag affects the entire regular expression, so in such cases using an explicit [a-zA-Z0-9_] may be a better choice).
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;For 8-bit (bytes) patterns:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;Matches characters considered alphanumeric in the ASCII character set; this is equivalent to [a-zA-Z0-9_]. If the LOCALE flag is used, matches characters considered alphanumeric in the current locale and the underscore.
&lt;/code&gt;&lt;/pre&gt;&lt;/blockquote&gt;
&lt;p&gt;So &lt;code&gt;\w&lt;/code&gt; includes “most characters that can be part of a word in any language, as well as numbers and the underscore”. A list of the characters that includes is difficult to pin down, so it would be best to use the &lt;code&gt;re.ASCII&lt;/code&gt; flag as suggested when you’re unsure if you want letters from other languages matched:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4;&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&amp;gt;&amp;gt;&amp;gt; re.&lt;span style=&#34;color:#080;font-weight:bold&#34;&gt;match&lt;/span&gt;(&lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;r&lt;/span&gt;&lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;&amp;#39;^\w+$&amp;#39;&lt;/span&gt;, &lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;&amp;#39;userاسم&amp;#39;&lt;/span&gt;,  flags=re.ASCII)
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&amp;gt;&amp;gt;&amp;gt; re.&lt;span style=&#34;color:#080;font-weight:bold&#34;&gt;match&lt;/span&gt;(&lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;r&lt;/span&gt;&lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;&amp;#39;^\w+$&amp;#39;&lt;/span&gt;, &lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;&amp;#39;username&amp;#39;&lt;/span&gt;, flags=re.ASCII)
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&amp;lt;_sre.SRE_Match &lt;span style=&#34;color:#038&#34;&gt;object&lt;/span&gt;; span=(&lt;span style=&#34;color:#00d;font-weight:bold&#34;&gt;0&lt;/span&gt;, &lt;span style=&#34;color:#00d;font-weight:bold&#34;&gt;8&lt;/span&gt;), &lt;span style=&#34;color:#080;font-weight:bold&#34;&gt;match&lt;/span&gt;=&lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;&amp;#39;username&amp;#39;&lt;/span&gt;&amp;gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h4 id=&#34;ruby&#34;&gt;Ruby&lt;/h4&gt;
&lt;p&gt;Ruby’s &lt;a href=&#34;https://ruby-doc.org/core-2.5.0/Regexp.html#class-Regexp-label-Character+Classes&#34;&gt;Regexp class&lt;/a&gt; documentation gives a simple and useful explanation: backslash character classes (e.g. &lt;code&gt;\w&lt;/code&gt;, &lt;code&gt;\s&lt;/code&gt;, &lt;code&gt;\d&lt;/code&gt;) are ASCII-only, while POSIX-style bracket expressions (e.g. &lt;code&gt;[[:alnum:]]&lt;/code&gt;) include other Unicode characters.&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4;&#34;&gt;&lt;code class=&#34;language-ruby&#34; data-lang=&#34;ruby&#34;&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;irb(main):&lt;span style=&#34;color:#00d;font-weight:bold&#34;&gt;001&lt;/span&gt;:&lt;span style=&#34;color:#00d;font-weight:bold&#34;&gt;0&lt;/span&gt;&amp;gt; &lt;span style=&#34;color:#080;background-color:#fff0ff&#34;&gt;/^\w+$/&lt;/span&gt;         =~ &lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;&amp;#34;userاسم&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;=&amp;gt; &lt;span style=&#34;color:#080&#34;&gt;nil&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;irb(main):&lt;span style=&#34;color:#00d;font-weight:bold&#34;&gt;002&lt;/span&gt;:&lt;span style=&#34;color:#00d;font-weight:bold&#34;&gt;0&lt;/span&gt;&amp;gt; &lt;span style=&#34;color:#080;background-color:#fff0ff&#34;&gt;/^[[:word:]]+$/&lt;/span&gt; =~ &lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;&amp;#34;userاسم&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;=&amp;gt; &lt;span style=&#34;color:#00d;font-weight:bold&#34;&gt;0&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h4 id=&#34;javascript&#34;&gt;JavaScript&lt;/h4&gt;
&lt;p&gt;JavaScript doesn’t support POSIX-style bracket expressions, and its backslash character classes are simple, straightforward lists of ASCII characters. The &lt;a href=&#34;https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions#Using_special_characters&#34;&gt;MDN&lt;/a&gt; has simple explanations for each one.&lt;/p&gt;
&lt;p&gt;JavaScript regular expressions do accept a &lt;code&gt;/u&lt;/code&gt; flag, but it does not affect shorthand character classes. Consider these examples in Node.js:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4;&#34;&gt;&lt;code class=&#34;language-javascript&#34; data-lang=&#34;javascript&#34;&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&amp;gt; &lt;span style=&#34;color:#080;background-color:#fff0ff&#34;&gt;/^\w+$/&lt;/span&gt;.test(&lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;&amp;#34;username&amp;#34;&lt;/span&gt;);
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#080;font-weight:bold&#34;&gt;true&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&amp;gt; &lt;span style=&#34;color:#080;background-color:#fff0ff&#34;&gt;/^\w+$/&lt;/span&gt;.test(&lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;&amp;#34;userﺎﺴﻣ&amp;#34;&lt;/span&gt;);
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#080;font-weight:bold&#34;&gt;false&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&amp;gt; &lt;span style=&#34;color:#080;background-color:#fff0ff&#34;&gt;/^\w+$/u&lt;/span&gt;.test(&lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;&amp;#34;username&amp;#34;&lt;/span&gt;);
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#080;font-weight:bold&#34;&gt;true&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&amp;gt; &lt;span style=&#34;color:#080;background-color:#fff0ff&#34;&gt;/^\w+$/u&lt;/span&gt;.test(&lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;&amp;#34;userﺎﺴﻣ&amp;#34;&lt;/span&gt;);
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#080;font-weight:bold&#34;&gt;false&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;We can see that the &lt;code&gt;/u&lt;/code&gt; flag has no effect on what &lt;code&gt;\w&lt;/code&gt; matches. Now let’s look at Unicode character lengths in JavaScript:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4;&#34;&gt;&lt;code class=&#34;language-javascript&#34; data-lang=&#34;javascript&#34;&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&amp;gt; &lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;&amp;#39;❤&amp;#39;&lt;/span&gt;.length
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#00d;font-weight:bold&#34;&gt;1&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&amp;gt; &lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;&amp;#39;👩&amp;#39;&lt;/span&gt;.length
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#00d;font-weight:bold&#34;&gt;2&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&amp;gt; &lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;&amp;#39;🀄️&amp;#39;&lt;/span&gt;.length
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#00d;font-weight:bold&#34;&gt;3&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Because of the way Unicode is implemented in JavaScript, strings with Unicode characters outside the BMP (Basic Multilingual Plane) will appear to be longer than they are.&lt;/p&gt;
&lt;p&gt;This can be accounted for in regular expressions with the &lt;code&gt;/u&lt;/code&gt; flag, which only corrects character parsing, and does not affect shorthand character classes:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4;&#34;&gt;&lt;code class=&#34;language-javascript&#34; data-lang=&#34;javascript&#34;&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&amp;gt; &lt;span style=&#34;color:#080;font-weight:bold&#34;&gt;let&lt;/span&gt; mystr = &lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;&amp;#34;hi👩there&amp;#34;&lt;/span&gt;;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#080;font-weight:bold&#34;&gt;undefined&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&amp;gt; mystr.length
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#00d;font-weight:bold&#34;&gt;9&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&amp;gt; &lt;span style=&#34;color:#080;background-color:#fff0ff&#34;&gt;/hi.there/&lt;/span&gt;.test(mystr);
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#080;font-weight:bold&#34;&gt;false&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&amp;gt; &lt;span style=&#34;color:#080;background-color:#fff0ff&#34;&gt;/hi..there/&lt;/span&gt;.test(mystr);
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#080;font-weight:bold&#34;&gt;true&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&amp;gt; &lt;span style=&#34;color:#080;background-color:#fff0ff&#34;&gt;/hi.there/u&lt;/span&gt;.test(mystr);  &lt;span style=&#34;color:#a61717;background-color:#e3d2d2&#34;&gt;#&lt;/span&gt; note the /u from here on
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#080;font-weight:bold&#34;&gt;true&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&amp;gt; &lt;span style=&#34;color:#080;background-color:#fff0ff&#34;&gt;/hi..there/u&lt;/span&gt;.test(mystr);
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#080;font-weight:bold&#34;&gt;false&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&amp;gt; &lt;span style=&#34;color:#080;background-color:#fff0ff&#34;&gt;/hi..there/u&lt;/span&gt;.test(&lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;&amp;#34;hi👩👩there&amp;#34;&lt;/span&gt;);
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#080;font-weight:bold&#34;&gt;true&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The excellent article &lt;a href=&#34;http://blog.jonnew.com/posts/poo-dot-length-equals-two&#34;&gt;&amp;quot;💩&amp;quot;.length === 2&lt;/a&gt; by Jonathan New goes into detail about the why this is, and explores various solutions. It also addresses some legacy inconsistencies, like how the old HEAVY BLACK HEART character and other older Unicode symbols might be represented differently.&lt;/p&gt;
&lt;h4 id=&#34;php&#34;&gt;PHP&lt;/h4&gt;
&lt;p&gt;PHP’s documentation explains that &lt;code&gt;\w&lt;/code&gt; matches letters, digits, and the underscore as defined by your locale. It’s not totally clear about how Unicode is treated, but it uses the PCRE (Perl Compatible Regular Expressions) library which supports a &lt;code&gt;/u&lt;/code&gt; flag that can be used to enable Unicode matching in character classes:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4;&#34;&gt;&lt;code class=&#34;language-php&#34; data-lang=&#34;php&#34;&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&amp;lt;?php
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#080;font-weight:bold&#34;&gt;echo&lt;/span&gt; preg_match(&lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;&amp;#34;/^&lt;/span&gt;&lt;span style=&#34;color:#04d;background-color:#fff0f0&#34;&gt;\\&lt;/span&gt;&lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;w+$/&amp;#34;&lt;/span&gt;, &lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;&amp;#34;username&amp;#34;&lt;/span&gt;), &lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;&amp;#34;&lt;/span&gt;&lt;span style=&#34;color:#04d;background-color:#fff0f0&#34;&gt;\n&lt;/span&gt;&lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;&amp;#34;&lt;/span&gt;;  &lt;span style=&#34;color:#888&#34;&gt;# 1
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#888&#34;&gt;&lt;/span&gt;&lt;span style=&#34;color:#080;font-weight:bold&#34;&gt;echo&lt;/span&gt; preg_match(&lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;&amp;#34;/^&lt;/span&gt;&lt;span style=&#34;color:#04d;background-color:#fff0f0&#34;&gt;\\&lt;/span&gt;&lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;w+$/&amp;#34;&lt;/span&gt;, &lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;&amp;#34;userاسم&amp;#34;&lt;/span&gt;),  &lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;&amp;#34;&lt;/span&gt;&lt;span style=&#34;color:#04d;background-color:#fff0f0&#34;&gt;\n&lt;/span&gt;&lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;&amp;#34;&lt;/span&gt;;  &lt;span style=&#34;color:#888&#34;&gt;# 0
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#888&#34;&gt;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#080;font-weight:bold&#34;&gt;echo&lt;/span&gt; preg_match(&lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;&amp;#34;/^&lt;/span&gt;&lt;span style=&#34;color:#04d;background-color:#fff0f0&#34;&gt;\\&lt;/span&gt;&lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;w+$/u&amp;#34;&lt;/span&gt;, &lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;&amp;#34;username&amp;#34;&lt;/span&gt;), &lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;&amp;#34;&lt;/span&gt;&lt;span style=&#34;color:#04d;background-color:#fff0f0&#34;&gt;\n&lt;/span&gt;&lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;&amp;#34;&lt;/span&gt;; &lt;span style=&#34;color:#888&#34;&gt;# 1
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#888&#34;&gt;&lt;/span&gt;&lt;span style=&#34;color:#080;font-weight:bold&#34;&gt;echo&lt;/span&gt; preg_match(&lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;&amp;#34;/^&lt;/span&gt;&lt;span style=&#34;color:#04d;background-color:#fff0f0&#34;&gt;\\&lt;/span&gt;&lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;w+$/u&amp;#34;&lt;/span&gt;, &lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;&amp;#34;userاسم&amp;#34;&lt;/span&gt;),  &lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;&amp;#34;&lt;/span&gt;&lt;span style=&#34;color:#04d;background-color:#fff0f0&#34;&gt;\n&lt;/span&gt;&lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;&amp;#34;&lt;/span&gt;; &lt;span style=&#34;color:#888&#34;&gt;# 1
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h4 id=&#34;net&#34;&gt;.NET&lt;/h4&gt;
&lt;p&gt;The &lt;a href=&#34;https://docs.microsoft.com/en-us/dotnet/standard/base-types/character-classes-in-regular-expressions&#34;&gt;.NET Quick Reference&lt;/a&gt; has a comprehensive guide to character classes. For word characters, it defines a specific group of Unicode categories including letters, modifiers, and connectors from many languages, but also points out that setting the &lt;a href=&#34;https://docs.microsoft.com/en-us/dotnet/standard/base-types/regular-expression-options#ECMAScript&#34;&gt;ECMAScript Matching Behavior&lt;/a&gt; option will limit &lt;code&gt;\w&lt;/code&gt; to &lt;code&gt;[a-zA-Z_0-9]&lt;/code&gt;, among other things. Microsoft’s documentation is clear and comprehensive with great examples, so I recommend referring to it frequently.&lt;/p&gt;
&lt;h4 id=&#34;go&#34;&gt;Go&lt;/h4&gt;
&lt;p&gt;Go follows the regular expression syntax used by &lt;a href=&#34;https://github.com/google/re2/wiki/Syntax&#34;&gt;Google’s RE2 engine&lt;/a&gt;, which has easy syntax for specifying whether you want Unicode characters to be captured or not:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4;&#34;&gt;&lt;code class=&#34;language-go&#34; data-lang=&#34;go&#34;&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#080;font-weight:bold&#34;&gt;package&lt;/span&gt;&lt;span style=&#34;color:#bbb&#34;&gt; &lt;/span&gt;main&lt;span style=&#34;color:#bbb&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#bbb&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#bbb&#34;&gt;&lt;/span&gt;&lt;span style=&#34;color:#080;font-weight:bold&#34;&gt;import&lt;/span&gt;&lt;span style=&#34;color:#bbb&#34;&gt; &lt;/span&gt;(&lt;span style=&#34;color:#bbb&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#bbb&#34;&gt;	&lt;/span&gt;&lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;&amp;#34;fmt&amp;#34;&lt;/span&gt;&lt;span style=&#34;color:#bbb&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#bbb&#34;&gt;	&lt;/span&gt;&lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;&amp;#34;regexp&amp;#34;&lt;/span&gt;&lt;span style=&#34;color:#bbb&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#bbb&#34;&gt;&lt;/span&gt;)&lt;span style=&#34;color:#bbb&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#bbb&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#bbb&#34;&gt;&lt;/span&gt;&lt;span style=&#34;color:#080;font-weight:bold&#34;&gt;func&lt;/span&gt;&lt;span style=&#34;color:#bbb&#34;&gt; &lt;/span&gt;&lt;span style=&#34;color:#06b;font-weight:bold&#34;&gt;main&lt;/span&gt;()&lt;span style=&#34;color:#bbb&#34;&gt; &lt;/span&gt;{&lt;span style=&#34;color:#bbb&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#bbb&#34;&gt;	&lt;/span&gt;&lt;span style=&#34;color:#888&#34;&gt;// Perl-style&lt;/span&gt;&lt;span style=&#34;color:#bbb&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#bbb&#34;&gt;	&lt;/span&gt;fmt.&lt;span style=&#34;color:#06b;font-weight:bold&#34;&gt;Println&lt;/span&gt;(regexp.&lt;span style=&#34;color:#06b;font-weight:bold&#34;&gt;MatchString&lt;/span&gt;(&lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;`^\w+$`&lt;/span&gt;,&lt;span style=&#34;color:#bbb&#34;&gt; &lt;/span&gt;&lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;&amp;#34;username&amp;#34;&lt;/span&gt;))&lt;span style=&#34;color:#bbb&#34;&gt; &lt;/span&gt;&lt;span style=&#34;color:#888&#34;&gt;// true&lt;/span&gt;&lt;span style=&#34;color:#bbb&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#bbb&#34;&gt;	&lt;/span&gt;fmt.&lt;span style=&#34;color:#06b;font-weight:bold&#34;&gt;Println&lt;/span&gt;(regexp.&lt;span style=&#34;color:#06b;font-weight:bold&#34;&gt;MatchString&lt;/span&gt;(&lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;`^\w+$`&lt;/span&gt;,&lt;span style=&#34;color:#bbb&#34;&gt; &lt;/span&gt;&lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;&amp;#34;userاسم&amp;#34;&lt;/span&gt;))&lt;span style=&#34;color:#bbb&#34;&gt;  &lt;/span&gt;&lt;span style=&#34;color:#888&#34;&gt;// false&lt;/span&gt;&lt;span style=&#34;color:#bbb&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#bbb&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#bbb&#34;&gt;	&lt;/span&gt;&lt;span style=&#34;color:#888&#34;&gt;// POSIX-style&lt;/span&gt;&lt;span style=&#34;color:#bbb&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#bbb&#34;&gt;	&lt;/span&gt;fmt.&lt;span style=&#34;color:#06b;font-weight:bold&#34;&gt;Println&lt;/span&gt;(regexp.&lt;span style=&#34;color:#06b;font-weight:bold&#34;&gt;MatchString&lt;/span&gt;(&lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;`^[[:word:]]+$`&lt;/span&gt;,&lt;span style=&#34;color:#bbb&#34;&gt; &lt;/span&gt;&lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;&amp;#34;username&amp;#34;&lt;/span&gt;))&lt;span style=&#34;color:#bbb&#34;&gt; &lt;/span&gt;&lt;span style=&#34;color:#888&#34;&gt;// true&lt;/span&gt;&lt;span style=&#34;color:#bbb&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#bbb&#34;&gt;	&lt;/span&gt;fmt.&lt;span style=&#34;color:#06b;font-weight:bold&#34;&gt;Println&lt;/span&gt;(regexp.&lt;span style=&#34;color:#06b;font-weight:bold&#34;&gt;MatchString&lt;/span&gt;(&lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;`^[[:word:]]+$`&lt;/span&gt;,&lt;span style=&#34;color:#bbb&#34;&gt; &lt;/span&gt;&lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;&amp;#34;userاسم&amp;#34;&lt;/span&gt;))&lt;span style=&#34;color:#bbb&#34;&gt;  &lt;/span&gt;&lt;span style=&#34;color:#888&#34;&gt;// false&lt;/span&gt;&lt;span style=&#34;color:#bbb&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#bbb&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#bbb&#34;&gt;	&lt;/span&gt;&lt;span style=&#34;color:#888&#34;&gt;// Unicode character class&lt;/span&gt;&lt;span style=&#34;color:#bbb&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#bbb&#34;&gt;	&lt;/span&gt;fmt.&lt;span style=&#34;color:#06b;font-weight:bold&#34;&gt;Println&lt;/span&gt;(regexp.&lt;span style=&#34;color:#06b;font-weight:bold&#34;&gt;MatchString&lt;/span&gt;(&lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;`^\pL+$`&lt;/span&gt;,&lt;span style=&#34;color:#bbb&#34;&gt; &lt;/span&gt;&lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;&amp;#34;username&amp;#34;&lt;/span&gt;))&lt;span style=&#34;color:#bbb&#34;&gt; &lt;/span&gt;&lt;span style=&#34;color:#888&#34;&gt;// true&lt;/span&gt;&lt;span style=&#34;color:#bbb&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#bbb&#34;&gt;	&lt;/span&gt;fmt.&lt;span style=&#34;color:#06b;font-weight:bold&#34;&gt;Println&lt;/span&gt;(regexp.&lt;span style=&#34;color:#06b;font-weight:bold&#34;&gt;MatchString&lt;/span&gt;(&lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;`^\pL+$`&lt;/span&gt;,&lt;span style=&#34;color:#bbb&#34;&gt; &lt;/span&gt;&lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;&amp;#34;userاسم&amp;#34;&lt;/span&gt;))&lt;span style=&#34;color:#bbb&#34;&gt;  &lt;/span&gt;&lt;span style=&#34;color:#888&#34;&gt;// true&lt;/span&gt;&lt;span style=&#34;color:#bbb&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#bbb&#34;&gt;&lt;/span&gt;}&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;You can see this code in action &lt;a href=&#34;https://play.golang.org/p/Y0HEhWXgXYa&#34;&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;h4 id=&#34;grep&#34;&gt;grep&lt;/h4&gt;
&lt;p&gt;Implementations of grep vary widely across platforms and versions. On my personal computer with GNU grep 3.1, &lt;code&gt;\w&lt;/code&gt; doesn&amp;rsquo;t work at all with default settings, matches only ASCII characters with the &lt;code&gt;-P&lt;/code&gt; (PCRE) option, and matches Unicode characters with &lt;code&gt;-E&lt;/code&gt;:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4;&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;[phin@caballero ~]$ grep    &lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;&amp;#34;^\w+&lt;/span&gt;$&lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;&amp;#34;&lt;/span&gt; &amp;lt;(&lt;span style=&#34;color:#038&#34;&gt;echo&lt;/span&gt; &lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;&amp;#34;username&amp;#34;&lt;/span&gt;)  &lt;span style=&#34;color:#888&#34;&gt;# no match&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;[phin@caballero ~]$ grep -P &lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;&amp;#34;^\w+&lt;/span&gt;$&lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;&amp;#34;&lt;/span&gt; &amp;lt;(&lt;span style=&#34;color:#038&#34;&gt;echo&lt;/span&gt; &lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;&amp;#34;username&amp;#34;&lt;/span&gt;)
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;username
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;[phin@caballero ~]$ grep -P &lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;&amp;#34;^\w+&lt;/span&gt;$&lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;&amp;#34;&lt;/span&gt; &amp;lt;(&lt;span style=&#34;color:#038&#34;&gt;echo&lt;/span&gt; &lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;&amp;#34;userاسم&amp;#34;&lt;/span&gt;)   &lt;span style=&#34;color:#888&#34;&gt;# no match&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;[phin@caballero ~]$ grep -E &lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;&amp;#34;^\w+&lt;/span&gt;$&lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;&amp;#34;&lt;/span&gt; &amp;lt;(&lt;span style=&#34;color:#038&#34;&gt;echo&lt;/span&gt; &lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;&amp;#34;username&amp;#34;&lt;/span&gt;)
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;username
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;[phin@caballero ~]$ grep -E &lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;&amp;#34;^\w+&lt;/span&gt;$&lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;&amp;#34;&lt;/span&gt; &amp;lt;(&lt;span style=&#34;color:#038&#34;&gt;echo&lt;/span&gt; &lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;&amp;#34;userاسم&amp;#34;&lt;/span&gt;)
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;userاسم&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Again, implementations vary a lot, so double check on your system before doing anything important.&lt;/p&gt;
&lt;h3 id=&#34;other-links&#34;&gt;Other links&lt;/h3&gt;
&lt;p&gt;As great as Unicode and regular expressions are, their implementations vary widely across various languages and tools, and that introduces far more unexpected behavior than I can write about in this post. Whenever you&amp;rsquo;re going to use something with Unicode and regular expressions, make sure to check language specifications to make sure everything will work as expected.&lt;/p&gt;
&lt;p&gt;Of course, this topic has already been discussed and written about at great length. Here are some links worth checking out:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/&#34;&gt;The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)&lt;/a&gt; - This is an oft-referenced article by Joel Spolsky. It was written in 2003 but the wealth of valuable information within is still very relevant and it helps greatly in going from Unicode noob to having a comfortable, useful knowledge of many common issues.&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://mathiasbynens.be/notes/es-regexp-proposals&#34;&gt;ECMAScript regular expressions are getting better!&lt;/a&gt; - This article by a V8 developer at Google shows some nice JavaScript regular expression improvements planned for ES2018, including Unicode property escapes.&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://github.com/LuminosoInsight/python-ftfy&#34;&gt;ftfy for Python&lt;/a&gt; - ftfy is a Python library that takes corrupt Unicode text and attempts to fix it as best it can. I haven’t yet had a chance to use it, but the examples are compelling and it’s definitely worth knowing about.&lt;/li&gt;
&lt;/ul&gt;

      </content>
    </entry>
  
    <entry>
      <title>Postgres migrating SQL_ASCII to UTF-8 with fix_latin</title>
      <link rel="alternate" href="https://www.endpointdev.com/blog/2017/07/postgres-migrating-sqlascii-to-utf-8/"/>
      <id>https://www.endpointdev.com/blog/2017/07/postgres-migrating-sqlascii-to-utf-8/</id>
      <published>2017-07-21T00:00:00+00:00</published>
      <author>
        <name>Greg Sabino Mullane</name>
      </author>
      <content type="html">
        &lt;div class=&#34;separator&#34; style=&#34;clear: both; float:right; text-align: center;&#34;&gt;&lt;a href=&#34;/blog/2017/07/postgres-migrating-sqlascii-to-utf-8/image-0.jpeg&#34; imageanchor=&#34;1&#34; style=&#34;clear: right; margin-bottom: 1em; margin-left: 1em;&#34;&gt;&lt;img border=&#34;0&#34; data-original-height=&#34;395&#34; data-original-width=&#34;500&#34; src=&#34;/blog/2017/07/postgres-migrating-sqlascii-to-utf-8/image-0.jpeg&#34;/&gt;&lt;/a&gt;&lt;br/&gt;&lt;small&gt;(&lt;a href=&#34;https://flic.kr/p/fZRp6G&#34;&gt;photograph&lt;/a&gt; by &lt;a href=&#34;https://www.flickr.com/people/usoceangov/&#34;&gt;NOAA National Ocean Service&lt;/a&gt;)&lt;/small&gt;&lt;/div&gt;
&lt;p&gt;Upgrading &lt;a href=&#34;https://www.postgresql.org/&#34;&gt;Postgres&lt;/a&gt; is not quite as painful as it used to be, thanks
primarily to the &lt;a href=&#34;https://www.postgresql.org/docs/current/static/pgupgrade.html&#34;&gt;pg_upgrade program&lt;/a&gt;, but there are times when it simply cannot be used.
We recently had an existing End Point client come to us requesting help upgrading from their current
Postgres database (version 9.2) to the latest version (9.6—​but soon to be 10). They also wanted
to finally move away from their SQL_ASCII encoding to &lt;a href=&#34;https://en.wikipedia.org/wiki/UTF-8&#34;&gt;UTF-8&lt;/a&gt;. As this meant
that pg_upgrade could not be used, we also took the opportunity to enable
checksums as well (this change cannot be done via pg_upgrade). Finally, they
were moving their database server to new hardware. There were many lessons learned and bumps
along the way for this migration, but for this
post I’d like to focus on one of the most vexing problems, the database encoding.&lt;/p&gt;
&lt;p&gt;When a Postgres database is created, it is set to a specific encoding. The most
common one (and the default) is “UTF8”.  This covers
99% of all user’s needs. The second most common one is the
poorly-named “SQL_ASCII” encoding, which should be named
“DANGER_DO_NOT_USE_THIS_ENCODING”, because it causes nothing but trouble.
The SQL_ASCII encoding basically means no encoding at all, and simply stores
any bytes you throw at it. This usually means the database ends up containing a
whole mess of different encodings, creating a “byte soup” that will be
difficult to sanitize by moving to a real encoding (i.e. UTF-8).&lt;/p&gt;
&lt;p&gt;Many tools exist which convert text from one encoding to another. One of the
most popular ones on Unix boxes is “iconv”. Although this program works great
if your source text is using one encoding, it fails when it encounters
byte soup.&lt;/p&gt;
&lt;p&gt;For this migration, we first did a &lt;a href=&#34;https://www.postgresql.org/docs/current/static/app-pgdump.html&#34;&gt;pg_dump&lt;/a&gt; from the old database to
a newly created UTF-8 test database, just to see which tables had encoding problems.
Quite a few did—​but not all of them!—​so we wrote a script to import tables
in parallel, with some filtering for the problem ones. As mentioned above,
iconv was not particularly helpful: looking at the tables closely showed
evidence of many different encodings in each one: Windows-1252, ISO-8859-1,  Japanese,
Greek, and many others. There were even large bits that were plainly
binary data (e.g. images) that simply got shoved into a text field somehow.
This is the big problem with SQL_ASCII: it accepts &lt;em&gt;everything&lt;/em&gt;, and does no
validation whatsoever. The iconv program simply could not handle these tables,
even when adding the //IGNORE option.&lt;/p&gt;
&lt;p&gt;To better explain the problem and the solution, let’s create a small text
file with a jumble of encodings. Discussions of how UTF-8 represents
characters, and its interactions with Unicode, are avoided here, as
Unicode is a dense, complex subject, and this article is dry enough already. :)&lt;/p&gt;
&lt;p&gt;First, we want to add some items using the encodings ‘Windows-1252’ and ‘Latin-1’. These encoding
systems were attempts to extend the basic ASCII character set to include more characters. As these encodings
pre-date the invention of UTF-8, they do it in a very inelegant (and incompatible)
way. Use of the “echo” command is a great way to add arbitrary bytes to a file as it
allows direct hex input:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4;&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;$ echo -e &amp;#34;[Windows-1252]   Euro: \x80   Double dagger: \x87&amp;#34; &amp;gt; sample.data
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;$ echo -e &amp;#34;[Latin-1]   Yen: \xa5   Half: \xbd&amp;#34; &amp;gt;&amp;gt; sample.data
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;$ echo -e &amp;#34;[Japanese]   Ship: \xe8\x88\xb9&amp;#34; &amp;gt;&amp;gt; sample.data
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;$ echo -e &amp;#34;[Invalid UTF-8]  Blob: \xf4\xa5\xa3\xa5&amp;#34; &amp;gt;&amp;gt; sample.data&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This file looks ugly. Notice all the “wrong” characters when we simply view the file directly:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4;&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;$ cat sample.data
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;[Windows-1252]   Euro: �   Double dagger: �
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;[Latin-1]   Yen: �   Half: �
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;[Japanese]   Ship: 船
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;[Invalid UTF-8]  Blob: ����&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Running iconv is of little help:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4;&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;## With no source encoding given, it errors on the Euro:
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;$ iconv -t utf8 sample.data &amp;gt;/dev/null
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;iconv: illegal input sequence at position 23
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;## We can tell it to ignore those errors, but it still barfs on the blob:
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;$ iconv -t utf8//ignore sample.data &amp;gt;/dev/null
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;iconv: illegal input sequence at position 123
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;## Telling it the source is Window-1252 fixes some things, but still sinks the Ship:
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;$ iconv -f windows-1252 -t utf8//ignore sample.data
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;[Windows-1252]   Euro: €   Double dagger: ‡
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;[Latin-1]   Yen: ¥   Half: ½
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;[Japanese]   Ship: èˆ¹
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;[Invalid UTF-8]  Blob: ô¥£¥&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;After testing a few other tools, we discovered the nifty &lt;a href=&#34;https://metacpan.org/pod/Encoding::FixLatin&#34;&gt;Encoding::FixLatin&lt;/a&gt;, a Perl module which provides a command-line program called “fix_latin”. Rather than being authoritative like iconv, it tries its best to fix things up with educated guesses. Its documentation gives a good summary:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The script acts as a filter, taking source data which may contain a mix of
ASCII, UTF8, ISO8859-1 and CP1252 characters, and producing output will be
all ASCII/UTF8.&lt;/p&gt;
&lt;p&gt;Multi-byte UTF8 characters will be passed through unchanged (although
over-long UTF8 byte sequences will be converted to the shortest normal
form). Single byte characters will be converted as follows:&lt;/p&gt;
&lt;p&gt;0x00 - 0x7F   ASCII - passed through unchanged
0x80 - 0x9F   Converted to UTF8 using CP1252 mappings
0xA0 - 0xFF   Converted to UTF8 using Latin-1 mappings&lt;/p&gt;&lt;/blockquote&gt;
&lt;p&gt;While this works great for fixing the Windows-1252 and Latin-1 problems (and
thus accounted for at least 95% of our table’s bad encodings), it still allows
“invalid” UTF-8 to pass on through. Which means that Postgres will still refuse
to accept it. Let’s check our test file:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4;&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;$ fix_latin sample.data
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;[Windows-1252]   Euro: €   Double dagger: ‡
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;[Latin-1]   Yen: ¥   Half: ½
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;[Japanese]   Ship: 船
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;[Invalid UTF-8]  Blob: ����
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;## Postgres will refuse to import that last part:
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;$ echo &amp;#34;SELECT E&amp;#39;&amp;#34;  &amp;#34;$(fix_latin sample.data)&amp;#34;  &amp;#34;&amp;#39;;&amp;#34; | psql
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;ERROR:  invalid byte sequence for encoding &amp;#34;UTF8&amp;#34;: 0xf4 0xa5 0xa3 0xa5
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;## Even adding iconv is of no help:
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;$ echo &amp;#34;SELECT E&amp;#39;&amp;#34;  &amp;#34;$(fix_latin sample.data | iconv -t utf-8)&amp;#34;  &amp;#34;&amp;#39;;&amp;#34; | psql
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;ERROR:  invalid byte sequence for encoding &amp;#34;UTF8&amp;#34;: 0xf4 0xa5 0xa3 0xa5&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The &lt;a href=&#34;https://tools.ietf.org/html/rfc3629&#34;&gt;UTF-8 specification&lt;/a&gt; is rather dense and puts many requirements on
encoders and decoders. How well programs implement these requirements (and optional
bits) varies, of course. But at the end of the day, we needed that data to go
into a UTF-8 encoded Postgres database without complaint. When in doubt, go
to the source! The relevant file in the Postgres source code responsible for
rejecting bad UTF-8 (as in the examples above) is src/backend/utils/mb/wchar.c
Analyzing that file shows a small but elegant piece of code whose job is
to ensure only “legal” UTF-8 is accepted:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4;&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;bool
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;pg_utf8_islegal(const unsigned char *source, int length)
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;{
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;  unsigned char a;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;  switch (length)
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;  {
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;    default:
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;      /* reject lengths 5 and 6 for now */
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;      return false;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;    case 4:
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;      a = source[3];
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;      if (a &amp;lt; 0x80 || a &amp;gt; 0xBF)
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;        return false;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;      /* FALL THRU */
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;    case 3:
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;      a = source[2];
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;      if (a &amp;lt; 0x80 || a &amp;gt; 0xBF)
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;        return false;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;      /* FALL THRU */
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;    case 2:
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;      a = source[1];
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;      switch (*source)
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;      {
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;        case 0xE0:
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;          if (a &amp;lt; 0xA0 || a &amp;gt; 0xBF)
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;            return false;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;          break;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;        case 0xED:
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;          if (a &amp;lt; 0x80 || a &amp;gt; 0x9F)
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;            return false;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;          break;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;        case 0xF0:
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;          if (a &amp;lt; 0x90 || a &amp;gt; 0xBF)
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;            return false;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;          break;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;        case 0xF4:
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;          if (a &amp;lt; 0x80 || a &amp;gt; 0x8F)
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;            return false;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;          break;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;        default:
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;          if (a &amp;lt; 0x80 || a &amp;gt; 0xBF)
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;            return false;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;          break;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;      }
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;      /* FALL THRU */
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;    case 1:
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;      a = *source;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;      if (a &amp;gt;= 0x80 &amp;amp;&amp;amp; a &amp;lt; 0xC2)
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;        return false;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;      if (a &amp;gt; 0xF4)
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;        return false;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;      break;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;  }
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;  return true;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;}&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Now that we know the UTF-8 rules for Postgres, how do we ensure our data follows it?
While we could have made another standalone filter to run after fix_latin, that would
increase the migration time. So I made a quick patch to the fix_latin program itself, rewriting
that C logic in Perl. A new option “&amp;ndash;strict-utf8” was added. Its job is to simply enforce the
rules found in the Postgres source code. If a character is invalid, it is replaced with
a question mark (there are other choices for a replacement character, but we decided simple
question marks were quick and easy—​and the surrounding data was unlikely to be read or even used anyway).&lt;/p&gt;
&lt;p&gt;Voila! All of the data was now going into Postgres without a problem. Observe:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4;&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;$ echo &amp;#34;SELECT E&amp;#39;&amp;#34;  &amp;#34;$(fix_latin  sample.data)&amp;#34;  &amp;#34;&amp;#39;;&amp;#34; | psql
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;ERROR:  invalid byte sequence for encoding &amp;#34;UTF8&amp;#34;: 0xf4 0xa5 0xa3 0xa5
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;$ echo &amp;#34;SELECT E&amp;#39;&amp;#34;  &amp;#34;$(fix_latin --strict-utf8 sample.data)&amp;#34;  &amp;#34;&amp;#39;;&amp;#34; | psql
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;                   ?column?
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;----------------------------------------------
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;  [Windows-1252]   Euro: €   Double dagger: ‡+
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt; [Latin-1]   Yen: ¥   Half: ½                +
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt; [Japanese]   Ship: 船                       +
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt; [Invalid UTF-8]  Blob: ????
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;(1 row)&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;What are the lessons here? First and foremost, &lt;em&gt;&lt;strong&gt;never&lt;/strong&gt;&lt;/em&gt; use SQL_ASCII. It’s outdated,
dangerous, and will cause much pain down the road. Second, there are an amazing number
of client encodings in use, especially for old data, but the world has pretty much standardized
on UTF-8 these days, so even if you are stuck with SQL_ASCII, the amount of Windows-1252 and
other monstrosities will be small. Third, don’t be afraid to go to the source. If Postgres
is rejecting your data, it’s probably for a very good reason, so find out exactly why.
There were other challenges to overcome in this migration, but the encoding was certainly
one of the most interesting ones. Everyone, the client and us, is very happy to finally
have everything using UTF-8!&lt;/p&gt;

      </content>
    </entry>
  
    <entry>
      <title>Bucardo, and Coping with Unicode</title>
      <link rel="alternate" href="https://www.endpointdev.com/blog/2014/03/bucardo-and-coping-with-unicode/"/>
      <id>https://www.endpointdev.com/blog/2014/03/bucardo-and-coping-with-unicode/</id>
      <published>2014-03-12T00:00:00+00:00</published>
      <author>
        <name>Josh Tolley</name>
      </author>
      <content type="html">
        &lt;p&gt;Given the &lt;a href=&#34;/blog/2014/02/dbdpg-utf-8-perl-postgresql/&#34;&gt;recent DBD::Pg 3.0.0 release&lt;/a&gt;, with its improved Unicode support, it seemed like a good time to work on a &lt;a href=&#34;https://github.com/bucardo/bucardo/issues/47&#34;&gt;Bucardo bug&lt;/a&gt; we’ve wanted fixed for a while. Although &lt;a href=&#34;https://bucardo.org&#34;&gt;Bucardo&lt;/a&gt; will replicate Unicode data without a problem, it runs into difficulties when table or column in the database include non-ASCII characters. Teaching Bucardo to handle Unicode data has been an interesting exercise.&lt;/p&gt;
&lt;p&gt;Without information about its encoding, string data at its heart is meaningless. Programs that exchange string information without paying attention to the encoding end up with problems exactly like that described in the bug, with nonsense characters all over. Further, it’s impossible even to compare two different strings reliably. So not only would Bucardo’s logs and program output contain junk data, Bucardo would simply fail to find database objects that clearly existed, because it would end up querying for the wrong object name, or the keys of the hashes it uses internally would be meaningless. Even communication between different Bucardo processes needs to be decoded correctly. The recent DBD::Pg 3.0.0 release takes care of decoding strings sent from PostgreSQL, but other inputs, such as command-line arguments, must be treated individually. All output handles, such as STDOUT, STDERR, and the log file output, must be told to expect data in a particular encoding to ensure their output is handled correctly.&lt;/p&gt;
&lt;p&gt;The first step is to build a test case. Bucardo’s test suite is quite comprehensive, and easy to use. For starters, I’ll make a simple test that just creates a table, and tries to tell Bucardo about it. The test suite will already create databases and install Bucardo for me; I can talk to those databases with handles $dbhA and $dbhB. Note that in this case, although the table and primary key names contain non-ASCII characters, the relgroup and sync names do not. That will require further programming. The character in the primary key name, incidentally, is a &lt;a href=&#34;https://en.wikipedia.org/wiki/Rod_of_Asclepius&#34;&gt;staff of Aesculapius&lt;/a&gt;, which I don’t recommend people include in the name of a typical primary key.&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4;&#34;&gt;&lt;code class=&#34;language-perl&#34; data-lang=&#34;perl&#34;&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#080;font-weight:bold&#34;&gt;for&lt;/span&gt; &lt;span style=&#34;color:#080;font-weight:bold&#34;&gt;my&lt;/span&gt; &lt;span style=&#34;color:#369&#34;&gt;$dbh&lt;/span&gt; ((&lt;span style=&#34;color:#369&#34;&gt;$dbhA&lt;/span&gt;, &lt;span style=&#34;color:#369&#34;&gt;$dbhB&lt;/span&gt;)) {
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;    &lt;span style=&#34;color:#369&#34;&gt;$dbh&lt;/span&gt;-&amp;gt;&lt;span style=&#34;color:#080;font-weight:bold&#34;&gt;do&lt;/span&gt;(&lt;span style=&#34;color:#2b2;background-color:#f0fff0&#34;&gt;qq/CREATE TABLE test_büçárđo ( pkey_\x{2695} INTEGER PRIMARY KEY, data TEXT );/&lt;/span&gt;);
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;    &lt;span style=&#34;color:#369&#34;&gt;$dbh&lt;/span&gt;-&amp;gt;commit;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;}
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;like &lt;span style=&#34;color:#369&#34;&gt;$bct&lt;/span&gt;-&amp;gt;ctl(&lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;&amp;#39;bucardo add table test_büçárđo db=A relgroup=unicode&amp;#39;&lt;/span&gt;),
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;    &lt;span style=&#34;color:#2b2;background-color:#f0fff0&#34;&gt;qr/Added the following tables/&lt;/span&gt;, &lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;&amp;#34;Added table in db A&amp;#34;&lt;/span&gt;;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;like(&lt;span style=&#34;color:#369&#34;&gt;$bct&lt;/span&gt;-&amp;gt;ctl(&lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;&amp;#34;bucardo add sync test_unicode relgroup=unicode dbs=A:source,B:target&amp;#34;&lt;/span&gt;),
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;    &lt;span style=&#34;color:#2b2;background-color:#f0fff0&#34;&gt;qr/Added sync &amp;#34;test_unicode&amp;#34;/&lt;/span&gt;, &lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;&amp;#34;Create sync from A to B&amp;#34;&lt;/span&gt;)
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;    &lt;span style=&#34;color:#080&#34;&gt;or&lt;/span&gt; BAIL_OUT &lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;&amp;#34;Failed to add test_unicode sync&amp;#34;&lt;/span&gt;;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Having created database objects and configured Bucardo, the next part of the test starts Bucardo, inserts some data into the master database “A”, and tries to replicate it:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4;&#34;&gt;&lt;code class=&#34;language-perl&#34; data-lang=&#34;perl&#34;&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#369&#34;&gt;$dbhA&lt;/span&gt;-&amp;gt;&lt;span style=&#34;color:#080;font-weight:bold&#34;&gt;do&lt;/span&gt;(&lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;&amp;#34;INSERT INTO test_büçárđo (pkey_\x{2695}, data) VALUES (1, &amp;#39;Something&amp;#39;)&amp;#34;&lt;/span&gt;);
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#369&#34;&gt;$dbhA&lt;/span&gt;-&amp;gt;commit;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#888&#34;&gt;## Get Bucardo going&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#369&#34;&gt;$bct&lt;/span&gt;-&amp;gt;restart_bucardo(&lt;span style=&#34;color:#369&#34;&gt;$dbhX&lt;/span&gt;);
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#888&#34;&gt;## Kick off the sync.&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#080;font-weight:bold&#34;&gt;my&lt;/span&gt; &lt;span style=&#34;color:#369&#34;&gt;$timer_regex&lt;/span&gt; = &lt;span style=&#34;color:#2b2;background-color:#f0fff0&#34;&gt;qr/\[0\s*s\]\s+(?:[\b]{6}\[\d+\s*s\]\s+)*/&lt;/span&gt;;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;like &lt;span style=&#34;color:#369&#34;&gt;$bct&lt;/span&gt;-&amp;gt;ctl(&lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;&amp;#39;kick sync test_unicode 0&amp;#39;&lt;/span&gt;),
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;    &lt;span style=&#34;color:#2b2;background-color:#f0fff0&#34;&gt;qr/^Kick\s+test_unicode:\s+${timer_regex}DONE!/&lt;/span&gt;,
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;    &lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;&amp;#39;Kick test_unicode&amp;#39;&lt;/span&gt; &lt;span style=&#34;color:#080&#34;&gt;or&lt;/span&gt; &lt;span style=&#34;color:#038&#34;&gt;die&lt;/span&gt; &lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;&amp;#39;Sync failed, no point continuing&amp;#39;&lt;/span&gt;;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#080;font-weight:bold&#34;&gt;my&lt;/span&gt; &lt;span style=&#34;color:#369&#34;&gt;$res&lt;/span&gt; = &lt;span style=&#34;color:#369&#34;&gt;$dbhB&lt;/span&gt;-&amp;gt;selectall_arrayref(&lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;&amp;#39;SELECT * FROM test_büçárđo&amp;#39;&lt;/span&gt;);
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;ok(&lt;span style=&#34;color:#369&#34;&gt;$#$res&lt;/span&gt; == &lt;span style=&#34;color:#00d;font-weight:bold&#34;&gt;0&lt;/span&gt; &amp;amp;&amp;amp; &lt;span style=&#34;color:#369&#34;&gt;$res&lt;/span&gt;-&amp;gt;[&lt;span style=&#34;color:#00d;font-weight:bold&#34;&gt;0&lt;/span&gt;][&lt;span style=&#34;color:#00d;font-weight:bold&#34;&gt;0&lt;/span&gt;] == &lt;span style=&#34;color:#00d;font-weight:bold&#34;&gt;1&lt;/span&gt; &amp;amp;&amp;amp; &lt;span style=&#34;color:#369&#34;&gt;$res&lt;/span&gt;-&amp;gt;[&lt;span style=&#34;color:#00d;font-weight:bold&#34;&gt;0&lt;/span&gt;][&lt;span style=&#34;color:#00d;font-weight:bold&#34;&gt;1&lt;/span&gt;] &lt;span style=&#34;color:#080&#34;&gt;eq&lt;/span&gt; &lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;&amp;#39;Something&amp;#39;&lt;/span&gt;, &lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;&amp;#39;Replication worked&amp;#39;&lt;/span&gt;);&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Given that DBD::Pg handles the encodings for strings from the database, I only need to make a few other changes. I added a few lines to the preamble of some files, to deal with UTF8 elements in the code itself, and to tell input and output pipes to expect UTF8 data.&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4;&#34;&gt;&lt;code class=&#34;language-perl&#34; data-lang=&#34;perl&#34;&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#080;font-weight:bold&#34;&gt;use&lt;/span&gt; &lt;span style=&#34;color:#b06;font-weight:bold&#34;&gt;utf8&lt;/span&gt;;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#080;font-weight:bold&#34;&gt;use&lt;/span&gt; &lt;span style=&#34;color:#b06;font-weight:bold&#34;&gt;open&lt;/span&gt; &lt;span style=&#34;color:#2b2;background-color:#f0fff0&#34;&gt;qw( :std :utf8 )&lt;/span&gt;;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;In some cases, I also had to add a couple more modules, and explicitly decode incoming values. For instance, the test suite repeatedly runs shell commands to configure and manage test instances of Bucardo. There, too, the output needs to be decoded correctly:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4;&#34;&gt;&lt;code class=&#34;language-perl&#34; data-lang=&#34;perl&#34;&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;    debug(&lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;&amp;#34;Script: $ctl Connection options: $connopts Args: $args&amp;#34;&lt;/span&gt;, &lt;span style=&#34;color:#00d;font-weight:bold&#34;&gt;3&lt;/span&gt;);
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;-   &lt;span style=&#34;color:#369&#34;&gt;$info&lt;/span&gt; = &lt;span style=&#34;color:#2b2;background-color:#f0fff0&#34;&gt;qx{$ctl $connopts $args 2&amp;gt;&amp;amp;1}&lt;/span&gt;;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;+   &lt;span style=&#34;color:#369&#34;&gt;$info&lt;/span&gt; = decode( locale =&amp;gt; &lt;span style=&#34;color:#2b2;background-color:#f0fff0&#34;&gt;qx{$ctl $connopts $args 2&amp;gt;&amp;amp;1}&lt;/span&gt; );
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;    debug(&lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;&amp;#34;Exit value: $?&amp;#34;&lt;/span&gt;, &lt;span style=&#34;color:#00d;font-weight:bold&#34;&gt;3&lt;/span&gt;);&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;And with that, now Bucardo accepts non-ASCII table names.&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4;&#34;&gt;&lt;code class=&#34;language-plain&#34; data-lang=&#34;plain&#34;&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;[~/devel/bucardo]$ prove t/10-object-names.t 
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;t/10-object-names.t .. ok     
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;All tests successful.
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;Files=1, Tests=20, 24 wallclock secs ( 0.01 usr  0.01 sys +  2.01 cusr  0.22 csys =  2.25 CPU)
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;Result: PASS&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;


      </content>
    </entry>
  
    <entry>
      <title>DBD::Pg 3.0.0 and the utf8 flag</title>
      <link rel="alternate" href="https://www.endpointdev.com/blog/2014/02/dbdpg-utf-8-perl-postgresql/"/>
      <id>https://www.endpointdev.com/blog/2014/02/dbdpg-utf-8-perl-postgresql/</id>
      <published>2014-02-19T00:00:00+00:00</published>
      <author>
        <name>Greg Sabino Mullane</name>
      </author>
      <content type="html">
        &lt;p&gt;One of the major changes in the recently released &lt;a href=&#34;/blog/2014/02/perl-postgresql-driver-dbdpg-300/&#34;&gt;3.0 version of DBD::Pg (the Perl driver for PostgreSQL)&lt;/a&gt; was the handling of UTF-8 strings. Previously, you had to make sure to always set the mysterious “pg_enable_utf8” attribute. Now, everything should simply work as expected without any adjustments.&lt;/p&gt;
&lt;p&gt;When using an older DBD::Pg (version 2.x), any data coming back from the database was treated as a plain old string. Perl strings have an internal flag called “utf8” that tells Perl that the string should be treated as containing UTF-8. The only way to get this flag turned on was to set the &lt;strong&gt;pg_enable_utf8&lt;/strong&gt; attribute to true before fetching your data from the database. When this flag was on, each returned string was scanned for high bit characters, and if found, the utf8 flag was set on the string. The Postgres server_encoding and client_encoding values were never consulted, so this one attribute was the only knob available. Here is a sample program we will use to examine the returned strings. The handy &lt;a href=&#34;http://search.cpan.org/~hmbrand/Data-Peek/Peek.pm&#34;&gt;Data::Peek module&lt;/a&gt; will help us see if the string has the utf8 flag enabled.&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4;&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;#!perl
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;use strict;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;use warnings;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;use utf8;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;use charnames &amp;#39;:full&amp;#39;;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;use DBI;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;use Data::Peek;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;use lib &amp;#39;blib/lib&amp;#39;, &amp;#39;blib/arch&amp;#39;;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;## Do our best to represent the output faithfully
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;binmode STDOUT, &amp;#34;:encoding(utf8)&amp;#34;;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;my $DSN = &amp;#39;DBI:Pg:dbname=postgres&amp;#39;;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;my $dbh = DBI-&amp;gt;connect($DSN, &amp;#39;&amp;#39;, &amp;#39;&amp;#39;, {AutoCommit=&amp;gt;0,RaiseError=&amp;gt;1,PrintError=&amp;gt;0})
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;  or die &amp;#34;Connection failed!\n&amp;#34;;                                            
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;print &amp;#34;DBI is version $DBI::VERSION, DBD::Pg is version $DBD::Pg::VERSION\n&amp;#34;;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;## Create some Unicode strings (perl strings with the utf8 flag enabled)
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;my %dm = (
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;    dotty  =&amp;gt; &amp;#34;\N{CADUCEUS}&amp;#34;,
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;    chilly =&amp;gt; &amp;#34;\N{SNOWMAN}&amp;#34;,
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;    stuffy =&amp;gt; &amp;#34;\N{DRAGON}&amp;#34;,
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;    lambie =&amp;gt; &amp;#34;\N{SHEEP}&amp;#34;,
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;);
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;## Show the strings both before and after a trip to the database
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;for my $x (sort keys %dm) {
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;    print &amp;#34;\nSending $x ($dm{$x}) to the database. Length is &amp;#34; . length($dm{$x}) . &amp;#34;\n&amp;#34;;                                                                    
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;    my $SQL = qq{SELECT &amp;#39;$dm{$x}&amp;#39;::TEXT};             
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;    my $var = $dbh-&amp;gt;selectall_arrayref($SQL)-&amp;gt;[0][0];
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;    print &amp;#34;Database gave us back ($var) with a length of &amp;#34; . length($var) . &amp;#34;\n&amp;#34;;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;    print DPeek $var;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;    print &amp;#34;\n&amp;#34;;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;}&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Let’s checkout an older version of DBD::Pg and run the script:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4;&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;$ cd dbdpg.git; git checkout 2.18.1; perl Makefile.PL; make
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;$ perl dbdpg_unicode_test.pl
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;DBI is version 1.628, DBD::Pg is version 2.18.1
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;Sending chilly (☃) to the database. Length is 1
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;Database gave us back (â) with a length of 3
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;PV(&amp;#34;\342\230\203&amp;#34;\0)
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;Sending dotty (☤) to the database. Length is 1
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;Database gave us back (â¤) with a length of 3
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;PV(&amp;#34;\342\230\244&amp;#34;\0)
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;Sending lambie (🐑) to the database. Length is 1
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;Database gave us back (ð) with a length of 4
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;PV(&amp;#34;\360\237\220\221&amp;#34;\0)
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;Sending stuffy (🐉) to the database. Length is 1
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;Database gave us back (ð) with a length of 4
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;PV(&amp;#34;\360\237\220\211&amp;#34;\0)&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The first thing you may notice is that not all of the Unicode symbols appear as expected. They should be tiny but legible versions of a snowman, a caduceus, a sheep, and a dragon. The fact that they do not appear properly everywhere indicates we have a way to go before the world is Unicode ready. When writing this, only chilly and dotty appeared correctly on my terminal. The blog editing textarea showed chilly, dotty, and lambie. The final blog in Chrome showed only chilly and dotty! Obviously, your mileage may vary, but all of those are all legitimate Unicode characters.&lt;/p&gt;
&lt;p&gt;The second thing to notice is how badly the length of the string is computed once it comes back from the database. Each string is one character long, and goes in that way, but comes back longer. Which means the utf8 flag is off - this is confirmed by a lack of a UTF8 section in the DPeek output. We can get the correct output by setting the pg_enable_utf8 attribute after connecting, like so:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4;&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;...
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;my $dbh = DBI-&amp;gt;connect($DSN, &amp;#39;&amp;#39;, &amp;#39;&amp;#39;, {AutoCommit=&amp;gt;0,RaiseError=&amp;gt;1,PrintError=&amp;gt;0})
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;  or die &amp;#34;Connection failed!\n&amp;#34;;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;## Needed for older versions of DBD::Pg.
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;## This is the same as setting it to 1 for DBD::Pg 2.x - see below
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;$dbh-&amp;gt;{pg_enable_utf8} = -1;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;...&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Once we do that, DBD::Pg will add the utf8 flag to any returned string, regardless of the actual encoding, as long as there is a high bit in the string. The output will now look like this:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4;&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;Sending chilly (☃) to the database. Length is 1
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;Database gave us back (☃) with a length of 1
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;PVMG(&amp;#34;\342\230\203&amp;#34;\0) [UTF8 &amp;#34;\x{2603}&amp;#34;]&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Now our snowman has the correct length, and Data::Peek shows us that it has a UTF8 section. However, it’s not a great solution, because it ignores client_encoding, has to scan every single string, and because it means having to always remember  to set an obscure attribute in your code every time you connect. Version 3.0.0 and up will check your &lt;a href=&#34;http://www.postgresql.org/docs/9.3/static/multibyte.html&#34;&gt;client_encoding&lt;/a&gt;, and as long as it is UTF-8 (and it really ought to be!), it will automatically return strings with the utf8 flag set. Here is our snowman test on 3.0.0 with no explicit setting of pg_enable_utf8:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4;&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;$ git checkout 3.0.0; perl Makefile.PL; make
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;$ perl dbdpg_unicode_test.pl
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;DBI is version 1.628, DBD::Pg is version 3.0.0
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;Sending chilly (☃) to the database. Length is 1
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;Database gave us back (☃) with a length of 1
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;PVMG(&amp;#34;\342\230\203&amp;#34;\0) [UTF8 &amp;#34;\x{2603}&amp;#34;]&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This new automatic detection is the same as setting pg_enable_utf8 to -1. Setting it to 0 will prevent the utf8 flag from ever being set, while setting it to 1 will cause the flag to always be set. Setting it to anything but -1 should be extremely rare in production and used with care.&lt;/p&gt;
&lt;h3 id=&#34;common-questions&#34;&gt;Common Questions&lt;/h3&gt;
&lt;h4 id=&#34;what-happens-if-i-set-pg_enable_utf8---1-on-older-versions-of-dbdpg&#34;&gt;What happens if I set pg_enable_utf8 = -1 on older versions of DBD::Pg?&lt;/h4&gt;
&lt;p&gt;Prior to DBD::Pg 3.0.0, the pg_enable_utf8 attribute was a simple boolean, so that setting to anything than &lt;strong&gt;0&lt;/strong&gt; will set it to true. In other words, setting it to -1 is the same as setting it to 1. If you must support older versions of DBD::Pg, setting it to -1 is a good setting.&lt;/p&gt;
&lt;h4 id=&#34;why-does-dbdpg-flag-everything-as-utf8-including-simple-ascii-strings-with-no-high-bit-characters&#34;&gt;Why does DBD::Pg flag everything as utf8, including simple ASCII strings with no high bit characters?&lt;/h4&gt;
&lt;p&gt;The lovely thing about the UTF-8 scheme is that ASCII data fits nicely inside it with no changes. However, a bare ASCII string is still valid UTF-8, it simply doesn’t have any high-bit characters. So rather than read each string as it comes back from the database and determine if it &lt;em&gt;must&lt;/em&gt; be flagged as utf8, DBD::Pg simply flags every string as utf8 because it &lt;em&gt;can&lt;/em&gt;. In other words, every string may or may not contain actual non-ASCII characters, but either way we simply flag it because it &lt;em&gt;may&lt;/em&gt; contain them, and that is good enough. This saves us a bit of time and effort, as we no longer have to scan every single byte coming back from the database. This decision to mark everything as utf8 instead of only non-ASCII strings was the most contentious decision when this new version was being developed.&lt;/p&gt;
&lt;h4 id=&#34;why-is-only-utf-8-the-only-client_encoding-that-is-treated-special&#34;&gt;Why is only UTF-8 the only client_encoding that is treated special?&lt;/h4&gt;
&lt;p&gt;There are two important reasons why we only look at UTF-8. First, the utf8 flag is the only flag Perl strings have, so there is no way of marking a string as any other type of encoding. Second, UTF-8 is unique inside Postgres as it is the universal client_encoding, which has a mapping from nearly every supported server_encoding. In other words, no matter what your server_encoding is set to, setting your client_encoding to UTF-8 is always a safe bet. It’s pretty obvious at this point that UTF-8 has won the encoding wars, and is the de-facto encoding standard for Unicode.&lt;/p&gt;
&lt;h4 id=&#34;when-is-the-client_encoding-checked-what-if-i-change-it&#34;&gt;When is the client_encoding checked? What if I change it?&lt;/h4&gt;
&lt;p&gt;The value of client_encoding is only checked when DBD::Pg first connects. Rechecking this seldom-changed attribute would be quite costly, but there is a way to signal DBD::Pg. If you really want to change the value of client_encoding after you connect, just set the pg_enable_utf8 attribute to -1, and it will cause DBD::Pg to re-read the client_encoding and start setting the utf8 flags accordingly.&lt;/p&gt;
&lt;h4 id=&#34;what-about-arrays&#34;&gt;What about arrays?&lt;/h4&gt;
&lt;p&gt;Arrays are handled as expected too. Arrays are unwrapped and turned into an array reference, in which the individual strings within it have the utf8 flag set. Example code:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4;&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;...
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;for my $x (sort keys %dm) {
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;    print &amp;#34;\nSending $x ($dm{$x}) to the database. Length is &amp;#34; . length($dm{$x}) . &amp;#34;\n&amp;#34;;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;    my $SQL = qq{SELECT ARRAY[&amp;#39;$dm{$x}&amp;#39;]};
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;    my $var = $dbh-&amp;gt;selectall_arrayref($SQL)-&amp;gt;[0][0];
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;    print &amp;#34;Database gave us back ($var) with a length of &amp;#34; . length($var) . &amp;#34;\n&amp;#34;;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;    $var = pop @$var;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;    print &amp;#34;Inner array ($var) has a length of &amp;#34; . length($var) . &amp;#34;\n&amp;#34;;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;    print DPeek $var;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;    print &amp;#34;\n&amp;#34;;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;}
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;DBI is version 1.628, DBD::Pg is version 3.0.0
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;Sending chilly (☃) to the database. Length is 1
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;Database gave us back (ARRAY(0x90c555c)) with a length of 16
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;Inner array (☃) has a length of 1
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;PVMG(&amp;#34;\342\230\203&amp;#34;\0) [UTF8 &amp;#34;\x{2603}&amp;#34;]&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h4 id=&#34;why-is-unicode-so-hard&#34;&gt;Why is Unicode so hard?&lt;/h4&gt;
&lt;p&gt;Partly because human languages are a vast and complex system, and partly because we painted ourselves into a corner a bit in the early days of computing. Some of the statements presented above have been over-simplified. Unicode is much more than just using UTF-8 properly. The utf8 flag in Perl strings does not mean quite the same thing as a UTF-8 encoding. Interestingly, Perl even makes a distinction between “UTF8” and “UTF-8”. It’s quite a mess, but at the end of the day, Unicode support is far better &lt;a href=&#34;http://perldoc.perl.org/perlunicode.html&#34;&gt;in Perl&lt;/a&gt; than &lt;a href=&#34;https://web.archive.org/web/20140306122242/https://dheeb.files.wordpress.com/2011/07/gbu.pdf&#34;&gt;any other language&lt;/a&gt;.&lt;/p&gt;

      </content>
    </entry>
  
    <entry>
      <title>The Un-unaccentable Character</title>
      <link rel="alternate" href="https://www.endpointdev.com/blog/2013/08/the-un-unaccentable-character/"/>
      <id>https://www.endpointdev.com/blog/2013/08/the-un-unaccentable-character/</id>
      <published>2013-08-18T00:00:00+00:00</published>
      <author>
        <name>Josh Williams</name>
      </author>
      <content type="html">
        &lt;p&gt;I typed “Unicode” into an online translator, and it responded saying it had no idea what the language was but it roughly translates to “Surprise!”&lt;/p&gt;
&lt;p&gt;Recently a client sent over a problem getting some of their Postgres data through an ASCII-only ETL process.  They only needed to worry about some occasional accent marks, and not any of the more uncommon or odd Unicode characters, thankfully. ☺ Or so we thought.  The &lt;a href=&#34;http://www.postgresql.org/docs/current/interactive/unaccent.html&#34;&gt;unaccent extension&lt;/a&gt; was a great starting point, but the problem they sent over boiled down to this:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4;&#34;&gt;&lt;code class=&#34;language-plain&#34; data-lang=&#34;plain&#34;&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;postgres=# SELECT unaccent(&amp;#39;e é ѐ&amp;#39;);
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt; unaccent 
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;----------
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt; e e ѐ
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;(1 row)&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;unaccent() worked, except for that odd ѐ, which then failed the ETL task.  That’s exactly what unnaccent is supposed to handle.  The character è even appears in the unaccent.rules file.  So what gives?&lt;/p&gt;
&lt;p&gt;Well, if you’re in the habit of piping blog posts through hexdump (and who isn’t?) then you probably already know the answer.  But even if not, you may already suspect that we’re dealing with a different character that just looks the same.  And you’d be right.  Specifically, the &lt;a href=&#34;http://unicode.org/cldr/utility/character.jsp?a=00E8&#34;&gt;è in the rules file&lt;/a&gt; is from the more common Latin set, and the &lt;a href=&#34;http://unicode.org/cldr/utility/character.jsp?a=0450&#34;&gt;ѐ that doesn’t work&lt;/a&gt; is from the Cyrillic set.  Pretty much visually identical, but completely separate characters.&lt;/p&gt;
&lt;h3 id=&#34;augmenting-the-unaccent-dictionary&#34;&gt;Augmenting the unaccent dictionary:&lt;/h3&gt;
&lt;p&gt;Speaking more generically, ideally a simple UPDATE statement with a replace() will correct it in the source data.  And a trigger doing the same will keep it tidy from that point forward.&lt;/p&gt;
&lt;p&gt;But if you can’t or just don’t want to go down that path, the unaccent extension dictionary can be edited.  On my system it’s found in /usr/share/postgresql/9.3/tsearch_data/unaccent.rules.  It has a very simple format.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;Make a copy of the file before you edit it.  Updated packages or new deployments if you’re compiling from source will wipe out any changes to the unaccent.rules file.&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4;&#34;&gt;&lt;code class=&#34;language-plain&#34; data-lang=&#34;plain&#34;&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;root:~# cp /usr/share/postgresql/9.3/tsearch_data/{unaccent,extended}.rules&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Add a line including the character to translate.  To handle our example above, add:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4;&#34;&gt;&lt;code class=&#34;language-plain&#34; data-lang=&#34;plain&#34;&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;ѐ e&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;In Postgres, create a new dictionary to load in those rules.&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4;&#34;&gt;&lt;code class=&#34;language-plain&#34; data-lang=&#34;plain&#34;&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;db=# CREATE TEXT SEARCH DICTIONARY extended (TEMPLATE=unaccent, RULES=&amp;#39;extended&amp;#39;);
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;CREATE TEXT SEARCH DICTIONARY&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Note that &amp;rsquo;extended&amp;rsquo; above will point to the extended.rules file.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Call unaccent() specifying the newly added dictionary:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4;&#34;&gt;&lt;code class=&#34;language-plain&#34; data-lang=&#34;plain&#34;&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;db=# SELECT unaccent(&amp;#39;extended&amp;#39;, &amp;#39;e é ѐ&amp;#39;);
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt; unaccent 
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;----------
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt; e e e
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;(1 row)&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Note that subsequent changes won’t automatically appear.  To update the in-database version, after you make any changes to the rules file run:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4;&#34;&gt;&lt;code class=&#34;language-plain&#34; data-lang=&#34;plain&#34;&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;db=# ALTER TEXT SEARCH DICTIONARY extended (RULES=&amp;#39;extended&amp;#39;);
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;ALTER TEXT SEARCH DICTIONARY&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;/li&gt;
&lt;/ol&gt;

      </content>
    </entry>
  
    <entry>
      <title>Perl, UTF-8, and binmode on filehandles</title>
      <link rel="alternate" href="https://www.endpointdev.com/blog/2012/02/perl-utf8-binmode-filehandle-unicode/"/>
      <id>https://www.endpointdev.com/blog/2012/02/perl-utf8-binmode-filehandle-unicode/</id>
      <published>2012-02-21T00:00:00+00:00</published>
      <author>
        <name>Greg Sabino Mullane</name>
      </author>
      <content type="html">
        &lt;p&gt;&lt;a href=&#34;/blog/2012/02/perl-utf8-binmode-filehandle-unicode/image-0-big.jpeg&#34;&gt;&lt;img alt=&#34;&#34; border=&#34;0&#34; id=&#34;BLOGGER_PHOTO_ID_5711604532308041090&#34; src=&#34;/blog/2012/02/perl-utf8-binmode-filehandle-unicode/image-0.jpeg&#34; style=&#34;cursor: pointer; height: 222px; width: 320px;&#34;/&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Original &lt;a href=&#34;https://www.flickr.com/photos/avlxyz/2462987456/&#34;&gt;image&lt;/a&gt; by &lt;a href=&#34;https://www.flickr.com/photos/avlxyz/&#34;&gt;avlxyz&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;I recently ran into a Perl quirk involving UTF-8, standard filehandles, and the built-in Perl die() and warn() functions. Someone reported a bug in the &lt;a href=&#34;https://bucardo.org/check_postgres/&#34;&gt;check_postgres&lt;/a&gt; program in which the French output was displaying incorrectly. That is, when the &lt;a href=&#34;https://en.wikipedia.org/wiki/Locale&#34;&gt;locale&lt;/a&gt; was set to &lt;strong&gt;FR_fr&lt;/strong&gt;, the French accented characters generated by the program were coming out as “byte soup” instead of proper UTF-8. Some other languages, English and Japanese among them, seemed to be fine. For example:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4;&#34;&gt;&lt;code class=&#34;language-perl&#34; data-lang=&#34;perl&#34;&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#888&#34;&gt;## English: &amp;#34;sorry, too many clients already&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#888&#34;&gt;## Japanese: &amp;#34;現在クライアント数が多すぎます&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#888&#34;&gt;## French expected: &amp;#34;désolé, trop de clients sont déjà connectés&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#888&#34;&gt;## French actual: &amp;#34;d�sol�, trop de clients sont d�j� connect�s&amp;#34;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;That last line should be very familiar to anyone who has struggled with Unicode on a command line, with those question marks on an inverted background. Our problem was that the output of the script looked like the last line, rather than the one before it. The Japanese output, despite being chock full of Unicode, does have the same problem! More on that later.&lt;/p&gt;
&lt;p&gt;I was able to duplicate the problem easy enough by setting my locale to &lt;strong&gt;FR_fr&lt;/strong&gt; and having check_postgres output a message with some non-ASCII characters in it. However, as noted above, some languages were fine, some were not.&lt;/p&gt;
&lt;p&gt;Before going any further, I should point out that this Perl script did have a &lt;strong&gt;use utf8;&lt;/strong&gt; at the top of it, as it should. This does not dictate how things will be read in or output,but merely tells Perl that the source code itself contains UTF-8 characters. Now to the quirky parts.&lt;/p&gt;
&lt;p&gt;I normally test my Perl scripts on the fly by adding a quick series of debugging statements to warn()s or die()s. Both go to stderr, so it is easy to separate your debugging statements from normal output of the code. However, when I output a non-ASCII message in question immediately after it was defined in the script, it showed a normal, expected UTF-8 string. So I started tracking things through the code, to see if there was some point at which the apparently normal UTF-8 string gets turned back into byte soup. It never did; I finally realized that although print was outputting byte soup, both warn() and die() were outputting UTF-8! Here’s a sample script to better demonstrate the problem:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4;&#34;&gt;&lt;code class=&#34;language-perl&#34; data-lang=&#34;perl&#34;&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#888&#34;&gt;#!perl&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#080;font-weight:bold&#34;&gt;use&lt;/span&gt; &lt;span style=&#34;color:#b06;font-weight:bold&#34;&gt;strict&lt;/span&gt;;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#080;font-weight:bold&#34;&gt;use&lt;/span&gt; &lt;span style=&#34;color:#b06;font-weight:bold&#34;&gt;warnings&lt;/span&gt;;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#080;font-weight:bold&#34;&gt;use&lt;/span&gt; &lt;span style=&#34;color:#b06;font-weight:bold&#34;&gt;utf8&lt;/span&gt;;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#080;font-weight:bold&#34;&gt;my&lt;/span&gt; &lt;span style=&#34;color:#369&#34;&gt;$msg&lt;/span&gt; = &lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;&amp;#39;This is a micro symbol: µ&amp;#39;&lt;/span&gt;;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#080;font-weight:bold&#34;&gt;print&lt;/span&gt; &lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;&amp;#34;print = $msg\n&amp;#34;&lt;/span&gt;;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#038&#34;&gt;warn&lt;/span&gt; &lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;&amp;#34;warn = $msg\n&amp;#34;&lt;/span&gt;;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#038&#34;&gt;die&lt;/span&gt; &lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;&amp;#34;die = $msg\n&amp;#34;&lt;/span&gt;;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Now let’s run it and see what happens:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4;&#34;&gt;&lt;code class=&#34;language-perl&#34; data-lang=&#34;perl&#34;&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#080;font-weight:bold&#34;&gt;print&lt;/span&gt; = This is a micro symbol: &lt;span style=&#34;color:#a61717;background-color:#e3d2d2&#34;&gt;�&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#038&#34;&gt;warn&lt;/span&gt; = This is a micro symbol: µ
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#038&#34;&gt;die&lt;/span&gt; = This is a micro symbol: µ&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;So we’ve found one Perl quirk: the output of print() and warn() are different, as warn() manages to correctly output the string as UTF-8. Perhaps it is just that the stdout and stderr filehandles are using different encodings? Let’s take a look by expanding the script and explicitly printing to both stdout and stderr. We’ll also add some other Unicode characters, to emulate the difference between French and Japanese above:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4;&#34;&gt;&lt;code class=&#34;language-perl&#34; data-lang=&#34;perl&#34;&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#888&#34;&gt;#!perl&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#080;font-weight:bold&#34;&gt;use&lt;/span&gt; &lt;span style=&#34;color:#b06;font-weight:bold&#34;&gt;strict&lt;/span&gt;;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#080;font-weight:bold&#34;&gt;use&lt;/span&gt; &lt;span style=&#34;color:#b06;font-weight:bold&#34;&gt;warnings&lt;/span&gt;;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#080;font-weight:bold&#34;&gt;use&lt;/span&gt; &lt;span style=&#34;color:#b06;font-weight:bold&#34;&gt;utf8&lt;/span&gt;;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#080;font-weight:bold&#34;&gt;my&lt;/span&gt; &lt;span style=&#34;color:#369&#34;&gt;$msg&lt;/span&gt; = &lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;&amp;#39;This is a micro symbol: µ&amp;#39;&lt;/span&gt;;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#080;font-weight:bold&#34;&gt;my&lt;/span&gt; &lt;span style=&#34;color:#369&#34;&gt;$alert&lt;/span&gt; = &lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;&amp;#39;The radioactive snowmen come in peace: ☢ ☃☃☃ ☮&amp;#39;&lt;/span&gt;;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#080;font-weight:bold&#34;&gt;print&lt;/span&gt; &lt;span style=&#34;color:#038&#34;&gt;STDOUT&lt;/span&gt; &lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;&amp;#34;print to STDOUT = $msg\n&amp;#34;&lt;/span&gt;;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#080;font-weight:bold&#34;&gt;print&lt;/span&gt; &lt;span style=&#34;color:#038&#34;&gt;STDOUT&lt;/span&gt; &lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;&amp;#34;print to STDOUT = $alert\n&amp;#34;&lt;/span&gt;;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#080;font-weight:bold&#34;&gt;print&lt;/span&gt; &lt;span style=&#34;color:#038&#34;&gt;STDERR&lt;/span&gt; &lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;&amp;#34;print to STDERR = $msg\n&amp;#34;&lt;/span&gt;;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#080;font-weight:bold&#34;&gt;print&lt;/span&gt; &lt;span style=&#34;color:#038&#34;&gt;STDERR&lt;/span&gt; &lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;&amp;#34;print to STDERR = $alert\n&amp;#34;&lt;/span&gt;;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#038&#34;&gt;warn&lt;/span&gt; &lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;&amp;#34;warn = $msg\n&amp;#34;&lt;/span&gt;;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#038&#34;&gt;warn&lt;/span&gt; &lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;&amp;#34;warn = $alert\n&amp;#34;&lt;/span&gt;;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;(Note: if you do not see small literal snowmen characters in the above script, you need to get a better browser or RSS reader!)&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4;&#34;&gt;&lt;code class=&#34;language-perl&#34; data-lang=&#34;perl&#34;&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#080;font-weight:bold&#34;&gt;print&lt;/span&gt; to &lt;span style=&#34;color:#038&#34;&gt;STDOUT&lt;/span&gt; = This is a micro symbol: &lt;span style=&#34;color:#a61717;background-color:#e3d2d2&#34;&gt;�&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;Wide character in &lt;span style=&#34;color:#080;font-weight:bold&#34;&gt;print&lt;/span&gt; at utf12 line &lt;span style=&#34;color:#00d;font-weight:bold&#34;&gt;11&lt;/span&gt;.
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#080;font-weight:bold&#34;&gt;print&lt;/span&gt; to &lt;span style=&#34;color:#038&#34;&gt;STDOUT&lt;/span&gt; = The radioactive snowmen come in peace: &lt;span style=&#34;color:#a61717;background-color:#e3d2d2&#34;&gt;☢&lt;/span&gt; &lt;span style=&#34;color:#a61717;background-color:#e3d2d2&#34;&gt;☃☃☃&lt;/span&gt; &lt;span style=&#34;color:#a61717;background-color:#e3d2d2&#34;&gt;☮&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#080;font-weight:bold&#34;&gt;print&lt;/span&gt; to &lt;span style=&#34;color:#038&#34;&gt;STDERR&lt;/span&gt; = This is a micro symbol: &lt;span style=&#34;color:#a61717;background-color:#e3d2d2&#34;&gt;�&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;Wide character in &lt;span style=&#34;color:#080;font-weight:bold&#34;&gt;print&lt;/span&gt; at utf12 line &lt;span style=&#34;color:#00d;font-weight:bold&#34;&gt;14&lt;/span&gt;.
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#080;font-weight:bold&#34;&gt;print&lt;/span&gt; to &lt;span style=&#34;color:#038&#34;&gt;STDERR&lt;/span&gt; = The radioactive snowmen come in peace: &lt;span style=&#34;color:#a61717;background-color:#e3d2d2&#34;&gt;☢&lt;/span&gt; &lt;span style=&#34;color:#a61717;background-color:#e3d2d2&#34;&gt;☃☃☃&lt;/span&gt; &lt;span style=&#34;color:#a61717;background-color:#e3d2d2&#34;&gt;☮&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#038&#34;&gt;warn&lt;/span&gt; = This is a micro symbol: µ
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#038&#34;&gt;warn&lt;/span&gt; = The radioactive snowmen come in peace: &lt;span style=&#34;color:#a61717;background-color:#e3d2d2&#34;&gt;☢&lt;/span&gt; &lt;span style=&#34;color:#a61717;background-color:#e3d2d2&#34;&gt;☃☃☃&lt;/span&gt; &lt;span style=&#34;color:#a61717;background-color:#e3d2d2&#34;&gt;☮&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;There are a number of things to note here. First, that the stderr filehandle has the same problem as the stdout filehandle. So, while warn() and die() send things to stderr, there is some magic happening behind the scenes such that sending a string to them is &lt;em&gt;not&lt;/em&gt; the same as sending it to stderr ourselves via a print statement. Which is a good thing overall, as it would be more weird for stdout and stderr to have different encoding layers! The solution to this is simple enough: just force stdout to have the proper encoding by use of the &lt;a href=&#34;http://perldoc.perl.org/functions/binmode.html&#34;&gt;binmode&lt;/a&gt; function:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4;&#34;&gt;&lt;code class=&#34;language-perl&#34; data-lang=&#34;perl&#34;&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#038&#34;&gt;binmode&lt;/span&gt; &lt;span style=&#34;color:#038&#34;&gt;STDOUT&lt;/span&gt;, &lt;span style=&#34;color:#d20;background-color:#fff0f0&#34;&gt;&amp;#39;:utf8&amp;#39;&lt;/span&gt;;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Indeed, the one line above solved the original poster’s problem; applying it to our test script shows that the stdout filehandle now outputs things correctly, unlike the stderr filehandle:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4;&#34;&gt;&lt;code class=&#34;language-perl&#34; data-lang=&#34;perl&#34;&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#080;font-weight:bold&#34;&gt;print&lt;/span&gt; to &lt;span style=&#34;color:#038&#34;&gt;STDOUT&lt;/span&gt; = This is a micro symbol: µ
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#080;font-weight:bold&#34;&gt;print&lt;/span&gt; to &lt;span style=&#34;color:#038&#34;&gt;STDOUT&lt;/span&gt; = The radioactive snowmen come in peace: &lt;span style=&#34;color:#a61717;background-color:#e3d2d2&#34;&gt;☢&lt;/span&gt; &lt;span style=&#34;color:#a61717;background-color:#e3d2d2&#34;&gt;☃☃☃&lt;/span&gt; &lt;span style=&#34;color:#a61717;background-color:#e3d2d2&#34;&gt;☮&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#080;font-weight:bold&#34;&gt;print&lt;/span&gt; to &lt;span style=&#34;color:#038&#34;&gt;STDERR&lt;/span&gt; = This is a micro symbol: &lt;span style=&#34;color:#a61717;background-color:#e3d2d2&#34;&gt;�&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;Wide character in &lt;span style=&#34;color:#080;font-weight:bold&#34;&gt;print&lt;/span&gt; at utf12 line &lt;span style=&#34;color:#00d;font-weight:bold&#34;&gt;16&lt;/span&gt;.
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#080;font-weight:bold&#34;&gt;print&lt;/span&gt; to &lt;span style=&#34;color:#038&#34;&gt;STDERR&lt;/span&gt; = The radioactive snowmen come in peace: &lt;span style=&#34;color:#a61717;background-color:#e3d2d2&#34;&gt;☢&lt;/span&gt; &lt;span style=&#34;color:#a61717;background-color:#e3d2d2&#34;&gt;☃☃☃&lt;/span&gt; &lt;span style=&#34;color:#a61717;background-color:#e3d2d2&#34;&gt;☮&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#038&#34;&gt;warn&lt;/span&gt; = This is a micro symbol: µ
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#038&#34;&gt;warn&lt;/span&gt; = The radioactive snowmen come in peace: &lt;span style=&#34;color:#a61717;background-color:#e3d2d2&#34;&gt;☢&lt;/span&gt; &lt;span style=&#34;color:#a61717;background-color:#e3d2d2&#34;&gt;☃☃☃&lt;/span&gt; &lt;span style=&#34;color:#a61717;background-color:#e3d2d2&#34;&gt;☮&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The next thing to notice is that the snowmen alert message is displayed properly everywhere. Why is this? The answer lies in that the micro symbol (and the accented French characters) fall into a range that &lt;em&gt;could&lt;/em&gt; still be ASCII, as far as Perl is concerned. What happens is that, in the lack of any explicit guidance, Perl makes a best guess as to whether a string to be outputted contains UTF-8 characters or not. In the case of the French and “micro” strings, it guessed wrong, and the characters were output as ASCII. In the case of the Japanese and “snowmen” strings, it still guessed wrong, even though the strings contained higher bytes that left no doubt that we had left ASCII-land and were exploring the land of Unicode. In other words, even though they were still not coming out as pure UTF-8, there is no direct ASCII equivalent so they appear as the characters one would expect. Note, however, that Perl still emits a wide character warning, for it recognizes that something is probably wrong. The warnings go away when we use &lt;code&gt;binmode&lt;/code&gt; to force the encoding layer to &lt;code&gt;:utf8&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;The correct solution when dealing with UTF-8 is to be explicit and not let Perl make any guesses. Solutions to this vary, but the combination here of adding &lt;strong&gt;use utf8;&lt;/strong&gt; and &lt;strong&gt;binmode STDOUT, &amp;lsquo;:utf8&amp;rsquo;;&lt;/strong&gt;. While I was able to duplicate the problem right away, the combination of Perl making inconsistent guesses and the odd behavior of warn() and die() turned this from a quick fix into a slightly longer investigation. Yes, Unicode and Perl has given me quite a few gray hairs over the years, but I always feel better when I &lt;a href=&#34;https://www.azabani.com/pages/gbu/&#34;&gt;look at how &lt;em&gt;other&lt;/em&gt; languages handle Unicode&lt;/a&gt;. :)&lt;/p&gt;

      </content>
    </entry>
  
    <entry>
      <title>Sanitizing supposed UTF-8 data</title>
      <link rel="alternate" href="https://www.endpointdev.com/blog/2011/12/sanitizing-supposed-utf-8-data/"/>
      <id>https://www.endpointdev.com/blog/2011/12/sanitizing-supposed-utf-8-data/</id>
      <published>2011-12-17T00:00:00+00:00</published>
      <author>
        <name>Jon Jensen</name>
      </author>
      <content type="html">
        &lt;p&gt;As time passes, it’s clear that Unicode has won the character set encoding wars, and UTF-8 is by far the most popular encoding, and the expected default. In a few more years we’ll probably find discussion of different character set encodings to be arcane, relegated to “data historians” and people working with legacy systems.&lt;/p&gt;
&lt;p&gt;But we’re not there yet! There’s still lots of migration to do before we can forget about everything that’s not UTF-8.&lt;/p&gt;
&lt;p&gt;Last week I again found myself converting data. This time I was taking data from a PostgreSQL database with no specified encoding (so-called “SQL_ASCII”, really just raw bytes), and sending it via JSON to a remote web service. JSON uses UTF-8 by default, and that’s what I needed here. Most of the source data was in either UTF-8, ISO Latin-1, or Windows-1252, but some was in non-Unicode Chinese or Japanese encodings, and some was just plain mangled.&lt;/p&gt;
&lt;p&gt;At this point I need to remind you about one of the most unusual aspects of UTF-8: It has limited valid forms. Legacy encodings typically used all or most of the 255 code points in their 8-byte space (leaving point 0 for traditional ASCII NUL). While UTF-8 is compatible with 7-bit ASCII, it does not allow any possible 8-bit byte in any position. See &lt;a href=&#34;https://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences&#34;&gt;the Wikipedia summary of invalid byte sequences&lt;/a&gt; to know what can be considered invalid.&lt;/p&gt;
&lt;p&gt;We had no need to try to fix the truly broken data, but we wanted to convert everything possible to UTF-8 and at the very least guarantee no invalid UTF-8 strings appeared in what we sent.&lt;/p&gt;
&lt;p&gt;I previously wrote about &lt;a href=&#34;/blog/2010/03/postgresql-utf-8-conversion/&#34;&gt;converting a PostgreSQL database dump to UTF-8&lt;/a&gt;, and used the Perl CPAN module &lt;a href=&#34;https://metacpan.org/pod/IsUTF8&#34;&gt;IsUTF8&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I was going to use that again, but looked around and found an even better module, exactly targeting this use case: &lt;a href=&#34;https://metacpan.org/release/Encoding-FixLatin&#34;&gt;Encoding::FixLatin&lt;/a&gt;, by Grant McLean. Its documentation says it “takes mixed encoding input and produces UTF-8 output” and that’s exactly what it does, focusing on input with mixed UTF-8, Latin-1, and Windows-1252.&lt;/p&gt;
&lt;p&gt;It worked as advertised, very well. We would need to use a different module to convert some other legacy encodings, but in this case this was good enough and got the vast majority of the data right.&lt;/p&gt;
&lt;p&gt;There’s even a standalone &lt;a href=&#34;https://metacpan.org/pod/Encoding::FixLatin&#34;&gt;fix_latin&lt;/a&gt; program designed specifically for processing Postgres pg_dump output from legacy encodings, with some nice examples of how to use it.&lt;/p&gt;
&lt;p&gt;One gotcha is similar to a catch that David Christensen reported with the Encode module in a &lt;a href=&#34;/blog/2010/12/character-encoding-in-perl-decodeutf8/&#34;&gt;blog post here about a year ago&lt;/a&gt;: If the Perl string already has the UTF-8 flag set, Encoding::FixLatin immediately returns it, rather than trying to process it. So it’s important that the incoming data be a pure byte stream, or that you otherwise turn off the UTF-8 flag, if you expect it to change anything.&lt;/p&gt;
&lt;p&gt;Along the way I found some other CPAN modules that look useful for cases where I need more manual control than Encoding::FixLatin gives:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;https://metacpan.org/pod/Search::Tools::UTF8&#34;&gt;Search::Tools::UTF8&lt;/a&gt; — test for and/or fix bad ASCII, Latin-1, Windows-1252, and UTF-8 strings&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://search.cpan.org/perldoc?Encode::Detect&#34;&gt;Encode::Detect&lt;/a&gt; — use Mozilla’s universal charset detector and convert to UTF-8&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://metacpan.org/pod/Unicode::Tussle&#34;&gt;Unicode::Tussle&lt;/a&gt; — ridiculously comprehensive set of Unicode tools that has to be seen to be believed&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Once again Perl’s thriving open source/free software community made my day!&lt;/p&gt;

      </content>
    </entry>
  
</feed>
