Handling text encoding in Perl

2025-04-29T00:00:00+00:00

When we are dealing with legacy applications, it’s very possible that the code we are looking at does not deal with Unicode characters, instead assuming all text is ASCII. This will cause a myriad of glitches and visual errors.

In 2025, after more than 30 years since Unicode was born, how is that possible that old applications still survive while ignoring or working around the whole issue?

Well, if your audience is mainly English speaking, it’s possible that you just experience glitches sometimes, with some characters like typographical quotes, non breaking spaces, etc. which are not really mission-critical. If, on the contrary, you need to deal every day with diacritics or even different languages (say, Italian and Slovenian), your application simply won’t survive without a good understanding of encoding.

In this article we are going to focus on Perl, but other languages face the same problems.

Back to the bytes

As we know, machines work with numbers and bytes. A string of text is made of bytes, and each of them is 8 bits (each bit is a 0 or a 1). So one byte allows 256 possible combinations of bits.

Plain ASCII is made by 128 characters (7 bits), so it fits nicely in one byte, leaving room for more. One character is exactly one byte, and one byte carries a character.

However, ASCII is not enough for most of languages, even if they use the Latin alphabet, because they use diacritics like é, à, č, and ž.

To address this problem, the ISO 8859 encoding standards appeared (there are others, like the Windows code pages, using the same idea, but of course using different code points). These standards use 8th bit not used by ASCII, still using a single byte for each character but double the combinations from ASCII, allowing 256 possible characters. That’s better, but still not great. It suffices for handling text in a couple of languages if they share the same characters, but not more. For this reason, there are various ISO 8859 encoding standards (8859-1, 8859-2, etc.) — one for each group of related languages (e.g. 8859-1 is for Western Europe, 8859-2 for Central Europe and so on, and even revisions of the same encoding, like 8859-15 and 8859-16).

The problem is that if you have a random string, you have to guess which is the correct encoding. The same byte value could represent an “È” or a “Č”. You need to look at the context (which language is this?) or search for an encoding declaration. Most important, you are simply not able to type È and Č in the same plain text document. If your company works in Italy using the 8859-15 encoding, it means you can’t even accept the correct name of a customer from Slovenia, a neighbour country, because the encoding simply doesn’t have a place for characters with a caron (like “č”) and you have to work around this real problem.

So finally came the Unicode age. This standard allows for more than a million characters, which should be enough. You can finally type English, Italian, Russian, Arabic, and emojis all in the same plain text. This is truly great, but it creates a complication for the programmer: the assumption that one byte is one character is not true anymore. The common encoding for Unicode is UTF-8, which is also backward compatible with ASCII. This means that if you have ASCII text, it is also valid UTF-8. Any other character which is not ASCII will instead take from two to three bytes and the programming language needs to be aware of this.

Into the language and back to the world

Text manipulation is a very common task. If you need to process a string, say “ÈČ”, like in this document, you should be able to tell that it is a string with two characters representing two letters. You want to be able to use regular expression on it, and so on.

Now, if we read it as a string of bytes, we get 4 of them and the newline, which is not what we want.

Let’s see an example:

#!/usr/bin/env perl

use strict;
use warnings;
use Data::Dumper::Concise;

# sample.txt contains ÈČ and a new line

{
    open my $fh, '<', 'sample.txt';
    while (my $l = <$fh>) {
        print $l;
        if ($l =~ m/\w\w/) {
            print "Found two characters\n"
        }
        print Dumper($l);
    }
    close $fh;
}

{
    open my $fh, '<:encoding(UTF-8)', 'sample.txt';
    while (my $l = <$fh>) {
        print $l;
        if ($l =~ m/\w\w/) {
            print "Found two characters\n"
        }
        print Dumper($l);
    }
    close $fh;
}

This is the output:

ÈČ
"\303\210\304\214\n"
Wide character in print at test.pl line 24, <$fh> line 1.
ÈČ
Found two characters
"\x{c8}\x{10c}\n"

In the first block the file is read verbatim, without any decoding. The regular expression doesn’t work, we have basically 4 bytes which don’t seem to mean much.

In the second block we decoded the input, converting it in the Perl internal representation. Now we can use regular expressions and have a consistent approach to text manipulation.

In the first block, we got a warning:

Wide character in print at test.pl line 25, <$fh> line 1

That’s because we printed something to the screen, but given that the string is now made by characters (decoded for internal use), Perl warns us that we need to encode it back to bytes (for the outside world to consume). A wide character is basically a character which needs to be encoded.

This can either be done by calling the encode() method from the Encode API:

use strict;
use warnings;
use Encode;
print encode("UTF-8", "\x{c8}\x{10c}\n");

Or, better, by declaring the global encoding for the standard output:

use strict;
use warnings;
binmode STDOUT, ":encoding(UTF-8)";
print "\x{c8}\x{10c}\n";

So, the golden rule is:

decode the string on input and get characters out of bytes
work with it in your program as a string of characters
encode the string on output

Any other approach is going to lead to double encoded characters (seeing things like Ã and Ä in English text is a clear symptom of this), corrupted text, and confusion.

Encoding strategies

If you are dealing with standard input/output on the shell, you should have this in your script:

binmode STDIN,  ":encoding(UTF-8)";
binmode STDOUT, ":encoding(UTF-8)";
binmode STDERR, ":encoding(UTF-8)";

So you’re decoding on input and encoding on output automatically.

For files, you can add the layer in the second argument of open like in the sample script above, or use a handy module like Path::Tiny, which provides methods like slurp_utf8 and spew_utf8 to read and write files using the correct encoding.

Interactions with web frameworks should always happen with the internal Perl representation. When you receive the input from a form, it should be considered already decoded. It’s also the framework’s responsibility to handle the encoding on output. Here at End Point we have many Interchange applications. Interchange can support this, via the MV_UTF8 variable.

The same rules apply to databases. It’s responsibility of the driver to take your strings and encode/decode them when talking to the database. E.g. DBD::Pg has the pg_enable_utf8 option, while DBD::mysql has mysql_enable_utf8. These options should usually be turned on or off explicitly. Not specifying the option is usually source of confusion because of the heuristic approach it requires for understanding the code.

Debugging strategies

It may not be the most correct approach, but I’ve been using Dumper for more than a decade and it works. You simply use Data::Dumper or Data::Dumper::Concise and call Dumper on the string you want to examine.

If you see hexadecimal codepoints like \x{c8}\x{10c}, it means the string is decoded and you’re working with the characters. If you see the raw bytes or characters with diacritics (the latter would happing if the terminal is interpreting the bytes and showing you the characters), you’re dealing with an encoded string. If you see weird characters in an English context, it probably means the text has been encoded more than once.

Migrate a web application to Unicode

If you’re still using legacy encoding systems like ISO 8859 or the similar Windows code pages, or worse, you simply don’t know and you’re relying on the browsers’ heuristics (they’re quite good at guessing) you should change to handle the input and the output correctly along the whole application:

Convert the templates from the encoding you are using to UTF-8 (iconv should do the trick).
Inspect and possibly convert the existing DB data
Make sure the DB drivers handle the I/O correctly
Make sure the web framework is decoding the input and encoding the output
Make sure the files you read and write are correctly handled
Clean up any workarounds you may have had in place

This looks like a challenging task, and it can be, but it’s totally worth it because fancy and well-supported characters nowadays are the norm. Typographical quotes like “this” and ‘this’ are very common and inserted by word processors automatically. So are emojis. People and customers simply expect them to work.

Band-aids

If your client is on a budget or can’t deal with a large upgrade like this one, which has the potential to be disruptive and expose bugs which are lurking around, you can try to downgrade the Unicode characters to ASCII with tools like Text::Unidecode (which has been ported to other languages as well). So typographical quotes will become the plain ASCII ones, diacritics will be stripped, and various other characters will get their ASCII representation. Not great, but better than dealing with unexpected behavior!

Ecommerce customer names with interesting Unicode characters

2021-12-29T00:00:00+00:00

One of our clients with a busy ecommerce site sees a lot of orders, and among those, sometimes there are unusual customer name and address submissions.

We first noticed in 2015 that they had a customer order come in with an emoji in the name field of the order. The emoji was 😏 and we half-jokingly wondered if that was a new sign of fraud.

Over the following 3 years only a few more orders came in with various emoji in customer’s names, but in mid-2018 emoji started to appear increasingly frequently until now one now appears on average every day or two.

Why the sudden appearance of emoji in 2015? It correlates with the rapid shift to browsing the web and shopping on mobile devices. Mobile visits now represent more than half of this client’s ecommerce traffic.

Most people in 2015 didn’t even know how to type emoji on a desktop or laptop computer, but mobile touchscreen keyboards began showing emoji choices around that time, so the mobile explanation makes sense. And mobile keyboard autocorrect also sometimes offers emoji in addition to words, making them even more common in the past few years.

Just for fun I wanted to automatically find all such “interesting” names, so I wrote a simple report that uses SQL to query their PostgreSQL ecommerce database.

To preserve customer privacy, names shown here have been changed and limited to a few names that are common in the United States.

Real names with bonuses

First let’s look at the apparently real names with emoji and other non-alphabetical Unicode characters mixed in:

Amy 🩺
Amy💕
Amy👑
👸Amy
Anna 💫
Anna 😎
Anna ❤️
Anna💊💉
Bob☯️
Bob😍😘😜👑💞
Bob🐞
Brenda 🌙
Brenda 🥳🤪
Brenda☠️
B R E N D A 💕💅🏾💃🏾👜
Cameron 🌻
Cameron 😎
Cameron 💁🏻‍♀️
Doug 💕
Doug 🅱️
Doug 🌸🤩
Doug 👦🏻❤️
Emily 🌷
Emily 😊🇨🇺
Emily😚
Emily 🎰👑❤️
Frank 🥰
Frank ⚗️
Frank🔆
Frank💞💰
Jane 💕
Jane⁸
Jane ❤️❤️
Jane🔆
Jill 👑🎀
Jill 👼🏽✨🌫
Jill🍓🍒🍒
Jill👸🏽💖
Jim 🐰
Jim, 💕
Jim’s iPhone✨
Joe 🐯
Joe 🇦🇺
Joe😏
Joe🏀🎸‼️
John 👪
John⁵
John🎭
Karen 🌻🌹
Karen 👑✨
Karen 🔒❤️
Karen🎀👑
Kate💙💚
Kate🤪🤞🏽💙
Kate.💘
K$ate💉
Liz🌺
Liz❣️
Liz❤️🙃
Liz Mama💙💙
Mary 🎀
Mary💘
Mary👼🏼💓
Mary⁹
Mike 👑
Mike 👑🌸
Mike 👩🏻‍🌾
Mike⁶
Sarah 👑
Sarah👑A.
✨Sarah✨
💛🌻 Sarah
Steve🐑
Steve 🦁
Steve♓️💓
Victoria 🥴
Victoria🌻
V𝚒𝚌𝚝𝚘𝚛𝚒𝚊
V I C T O R I A 🤍

Would you have expected all that in ecommerce orders? I didn’t!

Fake names

Next let’s look at placeholder names with people’s role or self-description or similar:

Amor ⚽️
Babe ❤️
C𝚒𝚝𝚒𝚣𝚎𝚗
Daddy🥴😏
Daddy😘👴👨🙏👨👩👧🙇🏾
Fly High 🕊
Forever 💍💜
Granny 👵🏽
HOME ❤️
Home🏠
Home🏠💜
Hubby🥰
💦Juicy🍑
me!! 💛
mi amor ❤️
Mom💗
Mom♥️
Mom 🐥💛
MOMMY 💗
Myself 😘
Princess👑
princess❤️
Queen 😍💖🔓
Queen💘
The Husband💍❤️
Wifey 😈✌🏽👅

Perhaps the occurrence of “home” several times reflects a mobile address book auto-fill function for billing or shipping address fields?

Not names at all!

Then there are those customers who didn’t provide any kind of name at all, just emoji and other special characters:

🦋
💙
🤡🎪
💗☁️
🍌
♥️
🌙
⁵
∅

I guess only one or two of those per year doesn’t amount to much, but they’re interesting to see.

Strange addresses

In addition to the name fields we also checked the address fields for unusual characters and found (again, details changed to preserve privacy):

125 E 27😎
227 W 24 Circle ⭕️
3 Blvd. George Washington™

Simply odd

The prizewinner for oddity, which seems like some kind of copy-and-paste mistake, is this in the city field of an address:

Indian® Roadmaster™ Classic

Maybe at least one motorcycle has achieved sentience and needed to do some online shopping!

“Interesting” Unicode ranges

When searching for interesting Unicode ranges, we could just look for characters in the Unicode emoji ranges. That would be fairly straightforward since there are just a few ranges to match.

But we were curious what other unusual characters were getting used aside from emoji, so we wanted to include other classes of characters. So perhaps we should include everything to start and then exclude the entire class of Unicode “word characters”? That covers the world’s standard characters used for names and addresses, including not just Roman/Latin with optional diacritics, but also other character sets such as Cyrillic, Arabic, Hebrew, Korean, Chinese, Japanese, Devanagari, Thai, and many others.

PostgreSQL POSIX regular expressions include the class “word character” represented by either [[:word:]] or the Perl shorthand \w. I started with that, but found it covered too many things I did want to see, such as the visually double-width Latin characters that are part of the Chinese word character range, and some special numbers.

So I switched back to matching what I want, rather than excluding what I don’t want, and I manually went through the Unicode code charts and noted the ranges to include.

The list of Unicode code ranges I came up with, in hexadecimal, is:

250-2ba
2bc-2c5
2cc-2dc
2de-2ff
58d-58e
fd5-fd8
1d00-1dbf
2070-2079
207b-209f
20d0-2104
2106-2115
2117-215f
2163-218b
2190-2211
2213-266e
2670-2bff
2e00-2e7f
2ff0-2fff
3004
3012-3013
3020
3200-33ff
4dc0-4dff
a000-abf9
fff0-fffc
fffe-1d35f
1d360-1d37f
1d400-1d7ff
1ec70-1ecbf
1ed00-1ed4f
1ee00-1eeff
1f000-10ffff

Those ranges exclude several fairly common characters that people (or their software’s autocorrect) used in their address fields, which we wanted to ignore, such as:

Music sharp sign: ♯ (instead of # before a number)
numero: №
care of: ℅
replacement character: � (though this could be interesting if it reveals unknown encoding errors)

The SQL query

PostgreSQL allows us to represent Unicode characters in hexadecimal numbers as either \u plus 4 digits or \U plus 8 digits. So the character 2ba is written in a Postgres string as \u02ba and the range fffe-1d35f is written in a Postgres regex range as [\ufffe-\U0001d35f].

With a little scripting to put it all together, I came up with:

SELECT order_number, order_timestamp::date AS order_date,
    fname, lname, company, address1, address2, city, state, b_fname, b_lname, b_company, b_address1, b_address2, b_city, b_state, phone
FROM orders
WHERE concat(fname, lname, company, address1, address2, city, state, b_fname, b_lname, b_company, b_address1, b_address2, b_city, b_state, phone)
    ~ '[\u0250-\u02ba\u02bc-\u02c5\u02cc-\u02dc\u02de-\u02ff\u058d-\u058e\u0fd5-\u0fd8\u1d00-\u1dbf\u2070-\u2079\u207b-\u209f\u20d0-\u2104\u2106-\u2115\u2117-\u215f\u2163-\u218b\u2190-\u2211\u2213-\u266e\u2670-\u2bff\u2e00-\u2e7f\u2ff0-\u2fff\u3004\u3012-\u3013\u3020\u3200-\u33ff\u4dc0-\u4dff\ua000-\uabf9\ufff0-\ufffc\ufffe-\U0001d35f\U0001d360-\U0001d37f\U0001d400-\U0001d7ff\U0001ec70-\U0001ecbf\U0001ed00-\U0001ed4f\U0001ee00-\U0001eeff\U0001f000-\U0010ffff]'
    -- limit how many years back to go
    AND order_timestamp > (SELECT CURRENT_TIMESTAMP - interval '3 years')
    -- exclude any order that had PII expunged for GDPR
    AND expunged_at IS NULL
ORDER BY order_timestamp DESC

Try a similar query on databases you have access to, and see what interesting user submissions you discover. I always find surprises and in addition to being fun, sometimes we find things that help us improve our input validation and user guidance so that more mistakes are caught when they’re easy for the customer to correct.

Happy holidays!

Generating TOTP QR codes as Unicode text from the command line

2021-10-28T00:00:00+00:00

(QR = “Quick Response” — good to know!)

Python’s QR code generator library qrcode generates QR codes from a secret key and outputs to a terminal using Unicode characters, not a PNG graphic as most other libraries do. We can store that in a text file. This is a neat thing to do, but how is this functionality useful?

Benefits of having Unicode QR code as a text file:

Storing the QR code as a text file takes less disk space than a PNG image.
It is easy to read the QR code over ssh using the cat command; you don’t even have to download the file to your own workstation.
It is simpler to manage QR codes in Git as text files than as PNG images.

This can be used for any kind of QR code, but we have found it especially useful for managing shared multi-factor authentication (MFA, including 2FA for 2-factor authentication) secrets for TOTPs (Time-based One-Time Passwords).

Multi-factor authentication (MFA)

Many services provide a separate account and login for each user so that accounts do not need to be shared, and thus passwords and multi-factor authentication secrets do not need to be shared either. This is ideal, and what we insist on for our most important accounts.

Unfortunately, however, some services provide only a single login per account, or only a single primary account login with the other accounts being limited in serious ways (no access to billing, account management, etc.) so that any business relying on them needs to share the access between several authorized users. A single point of failure in an account login is a serious problem when that one person is unavailable.

TOTP mobile apps

There are many good mobile apps for managing TOTP keys and codes, including Aegis, FreeOTP, Google Authenticator, and many others. Look for one that works with no connection to the outside world, so that you won’t be stuck when off internet & data networks.

Most applications support scanning QR codes with the phone’s camera, or else typing in a secret key to import the accounts.

For those shared accounts with no option for fully empowered individual user accounts, we can convert secret keys into QR codes for easy sharing and easy imports.

First, note that you should never use online QR code generators for MFA secrets! You risk exposing your extra authentication factors and defeating the purpose of your extra work.

Python’s ‘qrcode’ library

The qrcode Python library provides a qr executable that can print your QR code using UTF-8 characters on the console.

Installation

apt install python3-qrcode

Or visit https://pypi.org/project/qrcode/

The contents of the QR code are a URL in the format:

otpauth://totp/{username}?secret={key}&issuer={provider_name}

The provider_name can contain spaces; however, they need to be URL-encoded and entered as %20 for auth to work correctly on iOS. Otherwise, an invalid barcode error will be shown when adding the code.

For example, if you generate the QR code with key JSZE5V4676DZFCUCFW4GLPAHEFDNY447 for the account root@example.com, the resulting command would be:

$ qr "otpauth://totp/Example:root@example.com?secret=JSZE5V4676DZFCUCFW4GLPAHEFDNY447&issuer=Superhost"

Here’s what its output looks like:

Providing the username and issuer will display it properly in the list of configured accounts in your authenticator application. For example: Superhost (Example:root@example.com)

Reference

See a simple explanation of otpauth URI format used by TOTP.
See RFC 6238 for full details about TOTP.
See RFC 4648 for the base 32 specification used to encode the secret key.
A recent similar Perl implementation Terminal: QR Code with Unicode characters by Flavio Poletti that builds on Text::QRCode, which uses libqrencode.

Regular Expression Inconsistencies With Unicode

2018-01-23T00:00:00+00:00

A casual stroll through the world of Unicode and regular expressions—Photo by Presidio of Monterey

Character classes in regular expressions are an extremely useful and widespread feature, but there are some relatively recent changes that you might not know of.

The issue stems from how different programming languages, locales, and character encodings treat predefined character classes. Take, for example, the expression \w which was introduced in Perl around the year 1990 (along with \d and \s and their inverted sets \W, \D, and \S).

The \w shorthand is a character class that matches “word characters” as the C language understands them: [a-zA-Z0-9_]. At least when ASCII was the main player in the character encoding scene that simple fact was true. With the standardization of Unicode and UTF-8, the meaning of \w has become a more foggy.

Perl

Take this example in a recent Perl version:

use 5.012; # use 5.012 or higher includes Unicode support
use utf8;  # necessary for Unicode string literals

print "username" =~ /^\w+$/; # 1
print "userاسم"  =~ /^\w+$/; # 1

Perl is treating \w differently here because the characters “اسم” (“ism” meaning “name” in Arabic) definitely don’t fall within [a-zA-Z0-9_]!

Beginning with Perl 5.12 from the year 2010, character classes are handled differently. Documentation on the topic is found in perlrecharclass. The rules aren’t as simple as with some languages, but can be generalized as such:

\w will match Unicode characters with the “Word” property (equivalent to \p{Word}), unless the /a (ASCII) flag is enabled, in which case it will be equivalent to the original [a-zA-Z0-9_].

Let’s see the /a flag in action.

use 5.012;
use utf8;

print "username" =~ /^\w+$/a; # 1
print "userاسم"  =~ /^\w+$/a; # 0

However, you should know that for code points below 256, these rules can change depending on whether Unicode or locale rules are on, so if you’re unsure, consult the perlre and perlrecharclass.

Keep in mind that these same questions of what the character classes include can apply to every predefined character class in whatever language you’re using, so remember to check language-specific implementations for other character class shorthands, such as \s and \d, not just \w.

Every language seems to do regular expressions a little bit differently, so here’s a short, incomplete guide for several other languages we use frequently.

Python

Take this example in Python 3.6.2:

>>> re.match(r'^\w+$', 'username')
<_sre.SRE_Match object; span=(0, 8), match='username'>
>>> re.match(r'^\w+$', 'userاسم')
<_sre.SRE_Match object; span=(0, 7), match='userاسم'>

Python is also treating \w differently here. Let’s take a look at the Python docs:

\w

For Unicode (str) patterns:

Matches Unicode word characters; this includes most characters that can be part of a word in any language, as well as numbers and the underscore. If the ASCII flag is used, only [a-zA-Z0-9_] is matched (but the flag affects the entire regular expression, so in such cases using an explicit [a-zA-Z0-9_] may be a better choice).

For 8-bit (bytes) patterns:

Matches characters considered alphanumeric in the ASCII character set; this is equivalent to [a-zA-Z0-9_]. If the LOCALE flag is used, matches characters considered alphanumeric in the current locale and the underscore.

So \w includes “most characters that can be part of a word in any language, as well as numbers and the underscore”. A list of the characters that includes is difficult to pin down, so it would be best to use the re.ASCII flag as suggested when you’re unsure if you want letters from other languages matched:

>>> re.match(r'^\w+$', 'userاسم',  flags=re.ASCII)
>>> re.match(r'^\w+$', 'username', flags=re.ASCII)
<_sre.SRE_Match object; span=(0, 8), match='username'>

Ruby

Ruby’s Regexp class documentation gives a simple and useful explanation: backslash character classes (e.g. \w, \s, \d) are ASCII-only, while POSIX-style bracket expressions (e.g. [[:alnum:]]) include other Unicode characters.

irb(main):001:0> /^\w+$/         =~ "userاسم"
=> nil
irb(main):002:0> /^[[:word:]]+$/ =~ "userاسم"
=> 0

JavaScript

JavaScript doesn’t support POSIX-style bracket expressions, and its backslash character classes are simple, straightforward lists of ASCII characters. The MDN has simple explanations for each one.

JavaScript regular expressions do accept a /u flag, but it does not affect shorthand character classes. Consider these examples in Node.js:

> /^\w+$/.test("username");
true
> /^\w+$/.test("userﺎﺴﻣ");
false
> /^\w+$/u.test("username");
true
> /^\w+$/u.test("userﺎﺴﻣ");
false

We can see that the /u flag has no effect on what \w matches. Now let’s look at Unicode character lengths in JavaScript:

> '❤'.length
1
> '👩'.length
2
> '🀄️'.length
3

Because of the way Unicode is implemented in JavaScript, strings with Unicode characters outside the BMP (Basic Multilingual Plane) will appear to be longer than they are.

This can be accounted for in regular expressions with the /u flag, which only corrects character parsing, and does not affect shorthand character classes:

> let mystr = "hi👩there";
undefined
> mystr.length
9
> /hi.there/.test(mystr);
false
> /hi..there/.test(mystr);
true
> /hi.there/u.test(mystr);  # note the /u from here on
true
> /hi..there/u.test(mystr);
false
> /hi..there/u.test("hi👩👩there");
true

The excellent article "💩".length === 2 by Jonathan New goes into detail about the why this is, and explores various solutions. It also addresses some legacy inconsistencies, like how the old HEAVY BLACK HEART character and other older Unicode symbols might be represented differently.

PHP

PHP’s documentation explains that \w matches letters, digits, and the underscore as defined by your locale. It’s not totally clear about how Unicode is treated, but it uses the PCRE (Perl Compatible Regular Expressions) library which supports a /u flag that can be used to enable Unicode matching in character classes:

<?php

echo preg_match("/^\\w+$/", "username"), "\n";  # 1
echo preg_match("/^\\w+$/", "userاسم"),  "\n";  # 0

echo preg_match("/^\\w+$/u", "username"), "\n"; # 1
echo preg_match("/^\\w+$/u", "userاسم"),  "\n"; # 1

.NET

The .NET Quick Reference has a comprehensive guide to character classes. For word characters, it defines a specific group of Unicode categories including letters, modifiers, and connectors from many languages, but also points out that setting the ECMAScript Matching Behavior option will limit \w to [a-zA-Z_0-9], among other things. Microsoft’s documentation is clear and comprehensive with great examples, so I recommend referring to it frequently.

Go

Go follows the regular expression syntax used by Google’s RE2 engine, which has easy syntax for specifying whether you want Unicode characters to be captured or not:

package main

import (
	"fmt"
	"regexp"
)

func main() {
	// Perl-style
	fmt.Println(regexp.MatchString(`^\w+$`, "username")) // true
	fmt.Println(regexp.MatchString(`^\w+$`, "userاسم"))  // false

	// POSIX-style
	fmt.Println(regexp.MatchString(`^[[:word:]]+$`, "username")) // true
	fmt.Println(regexp.MatchString(`^[[:word:]]+$`, "userاسم"))  // false

	// Unicode character class
	fmt.Println(regexp.MatchString(`^\pL+$`, "username")) // true
	fmt.Println(regexp.MatchString(`^\pL+$`, "userاسم"))  // true
}

You can see this code in action here.

grep

Implementations of grep vary widely across platforms and versions. On my personal computer with GNU grep 3.1, \w doesn’t work at all with default settings, matches only ASCII characters with the -P (PCRE) option, and matches Unicode characters with -E:

[phin@caballero ~]$ grep    "^\w+$" <(echo "username")  # no match
[phin@caballero ~]$ grep -P "^\w+$" <(echo "username")
username
[phin@caballero ~]$ grep -P "^\w+$" <(echo "userاسم")   # no match
[phin@caballero ~]$ grep -E "^\w+$" <(echo "username")
username
[phin@caballero ~]$ grep -E "^\w+$" <(echo "userاسم")
userاسم

Again, implementations vary a lot, so double check on your system before doing anything important.

Postgres migrating SQL_ASCII to UTF-8 with fix_latin

2017-07-21T00:00:00+00:00

(photograph by NOAA National Ocean Service)

Upgrading Postgres is not quite as painful as it used to be, thanks primarily to the pg_upgrade program, but there are times when it simply cannot be used. We recently had an existing End Point client come to us requesting help upgrading from their current Postgres database (version 9.2) to the latest version (9.6—but soon to be 10). They also wanted to finally move away from their SQL_ASCII encoding to UTF-8. As this meant that pg_upgrade could not be used, we also took the opportunity to enable checksums as well (this change cannot be done via pg_upgrade). Finally, they were moving their database server to new hardware. There were many lessons learned and bumps along the way for this migration, but for this post I’d like to focus on one of the most vexing problems, the database encoding.

When a Postgres database is created, it is set to a specific encoding. The most common one (and the default) is “UTF8”. This covers 99% of all user’s needs. The second most common one is the poorly-named “SQL_ASCII” encoding, which should be named “DANGER_DO_NOT_USE_THIS_ENCODING”, because it causes nothing but trouble. The SQL_ASCII encoding basically means no encoding at all, and simply stores any bytes you throw at it. This usually means the database ends up containing a whole mess of different encodings, creating a “byte soup” that will be difficult to sanitize by moving to a real encoding (i.e. UTF-8).

Many tools exist which convert text from one encoding to another. One of the most popular ones on Unix boxes is “iconv”. Although this program works great if your source text is using one encoding, it fails when it encounters byte soup.

For this migration, we first did a pg_dump from the old database to a newly created UTF-8 test database, just to see which tables had encoding problems. Quite a few did—but not all of them!—so we wrote a script to import tables in parallel, with some filtering for the problem ones. As mentioned above, iconv was not particularly helpful: looking at the tables closely showed evidence of many different encodings in each one: Windows-1252, ISO-8859-1, Japanese, Greek, and many others. There were even large bits that were plainly binary data (e.g. images) that simply got shoved into a text field somehow. This is the big problem with SQL_ASCII: it accepts everything, and does no validation whatsoever. The iconv program simply could not handle these tables, even when adding the //IGNORE option.

To better explain the problem and the solution, let’s create a small text file with a jumble of encodings. Discussions of how UTF-8 represents characters, and its interactions with Unicode, are avoided here, as Unicode is a dense, complex subject, and this article is dry enough already. :)

First, we want to add some items using the encodings ‘Windows-1252’ and ‘Latin-1’. These encoding systems were attempts to extend the basic ASCII character set to include more characters. As these encodings pre-date the invention of UTF-8, they do it in a very inelegant (and incompatible) way. Use of the “echo” command is a great way to add arbitrary bytes to a file as it allows direct hex input:

$ echo -e "[Windows-1252]   Euro: \x80   Double dagger: \x87" > sample.data
$ echo -e "[Latin-1]   Yen: \xa5   Half: \xbd" >> sample.data
$ echo -e "[Japanese]   Ship: \xe8\x88\xb9" >> sample.data
$ echo -e "[Invalid UTF-8]  Blob: \xf4\xa5\xa3\xa5" >> sample.data

This file looks ugly. Notice all the “wrong” characters when we simply view the file directly:

$ cat sample.data
[Windows-1252]   Euro: �   Double dagger: �
[Latin-1]   Yen: �   Half: �
[Japanese]   Ship: 船
[Invalid UTF-8]  Blob: ����

Running iconv is of little help:

## With no source encoding given, it errors on the Euro:
$ iconv -t utf8 sample.data >/dev/null
iconv: illegal input sequence at position 23

## We can tell it to ignore those errors, but it still barfs on the blob:
$ iconv -t utf8//ignore sample.data >/dev/null
iconv: illegal input sequence at position 123

## Telling it the source is Window-1252 fixes some things, but still sinks the Ship:
$ iconv -f windows-1252 -t utf8//ignore sample.data
[Windows-1252]   Euro: €   Double dagger: ‡
[Latin-1]   Yen: ¥   Half: ½
[Japanese]   Ship: èˆ¹
[Invalid UTF-8]  Blob: ô¥£¥

After testing a few other tools, we discovered the nifty Encoding::FixLatin, a Perl module which provides a command-line program called “fix_latin”. Rather than being authoritative like iconv, it tries its best to fix things up with educated guesses. Its documentation gives a good summary:

The script acts as a filter, taking source data which may contain a mix of ASCII, UTF8, ISO8859-1 and CP1252 characters, and producing output will be all ASCII/UTF8.

Multi-byte UTF8 characters will be passed through unchanged (although over-long UTF8 byte sequences will be converted to the shortest normal form). Single byte characters will be converted as follows:

0x00 - 0x7F ASCII - passed through unchanged 0x80 - 0x9F Converted to UTF8 using CP1252 mappings 0xA0 - 0xFF Converted to UTF8 using Latin-1 mappings

While this works great for fixing the Windows-1252 and Latin-1 problems (and thus accounted for at least 95% of our table’s bad encodings), it still allows “invalid” UTF-8 to pass on through. Which means that Postgres will still refuse to accept it. Let’s check our test file:

$ fix_latin sample.data
[Windows-1252]   Euro: €   Double dagger: ‡
[Latin-1]   Yen: ¥   Half: ½
[Japanese]   Ship: 船
[Invalid UTF-8]  Blob: ����

## Postgres will refuse to import that last part:
$ echo "SELECT E'"  "$(fix_latin sample.data)"  "';" | psql
ERROR:  invalid byte sequence for encoding "UTF8": 0xf4 0xa5 0xa3 0xa5

## Even adding iconv is of no help:
$ echo "SELECT E'"  "$(fix_latin sample.data | iconv -t utf-8)"  "';" | psql
ERROR:  invalid byte sequence for encoding "UTF8": 0xf4 0xa5 0xa3 0xa5

The UTF-8 specification is rather dense and puts many requirements on encoders and decoders. How well programs implement these requirements (and optional bits) varies, of course. But at the end of the day, we needed that data to go into a UTF-8 encoded Postgres database without complaint. When in doubt, go to the source! The relevant file in the Postgres source code responsible for rejecting bad UTF-8 (as in the examples above) is src/backend/utils/mb/wchar.c Analyzing that file shows a small but elegant piece of code whose job is to ensure only “legal” UTF-8 is accepted:

bool
pg_utf8_islegal(const unsigned char *source, int length)
{
  unsigned char a;

  switch (length)
  {
    default:
      /* reject lengths 5 and 6 for now */
      return false;
    case 4:
      a = source[3];
      if (a < 0x80 || a > 0xBF)
        return false;
      /* FALL THRU */
    case 3:
      a = source[2];
      if (a < 0x80 || a > 0xBF)
        return false;
      /* FALL THRU */
    case 2:
      a = source[1];
      switch (*source)
      {
        case 0xE0:
          if (a < 0xA0 || a > 0xBF)
            return false;
          break;
        case 0xED:
          if (a < 0x80 || a > 0x9F)
            return false;
          break;
        case 0xF0:
          if (a < 0x90 || a > 0xBF)
            return false;
          break;
        case 0xF4:
          if (a < 0x80 || a > 0x8F)
            return false;
          break;
        default:
          if (a < 0x80 || a > 0xBF)
            return false;
          break;
      }
      /* FALL THRU */
    case 1:
      a = *source;
      if (a >= 0x80 && a < 0xC2)
        return false;
      if (a > 0xF4)
        return false;
      break;
  }
  return true;
}

Now that we know the UTF-8 rules for Postgres, how do we ensure our data follows it? While we could have made another standalone filter to run after fix_latin, that would increase the migration time. So I made a quick patch to the fix_latin program itself, rewriting that C logic in Perl. A new option “–strict-utf8” was added. Its job is to simply enforce the rules found in the Postgres source code. If a character is invalid, it is replaced with a question mark (there are other choices for a replacement character, but we decided simple question marks were quick and easy—and the surrounding data was unlikely to be read or even used anyway).

Voila! All of the data was now going into Postgres without a problem. Observe:

$ echo "SELECT E'"  "$(fix_latin  sample.data)"  "';" | psql
ERROR:  invalid byte sequence for encoding "UTF8": 0xf4 0xa5 0xa3 0xa5

$ echo "SELECT E'"  "$(fix_latin --strict-utf8 sample.data)"  "';" | psql
                   ?column?
----------------------------------------------
  [Windows-1252]   Euro: €   Double dagger: ‡+
 [Latin-1]   Yen: ¥   Half: ½                +
 [Japanese]   Ship: 船                       +
 [Invalid UTF-8]  Blob: ????
(1 row)

What are the lessons here? First and foremost, never use SQL_ASCII. It’s outdated, dangerous, and will cause much pain down the road. Second, there are an amazing number of client encodings in use, especially for old data, but the world has pretty much standardized on UTF-8 these days, so even if you are stuck with SQL_ASCII, the amount of Windows-1252 and other monstrosities will be small. Third, don’t be afraid to go to the source. If Postgres is rejecting your data, it’s probably for a very good reason, so find out exactly why. There were other challenges to overcome in this migration, but the encoding was certainly one of the most interesting ones. Everyone, the client and us, is very happy to finally have everything using UTF-8!

Bucardo, and Coping with Unicode

2014-03-12T00:00:00+00:00

Given the recent DBD::Pg 3.0.0 release, with its improved Unicode support, it seemed like a good time to work on a Bucardo bug we’ve wanted fixed for a while. Although Bucardo will replicate Unicode data without a problem, it runs into difficulties when table or column in the database include non-ASCII characters. Teaching Bucardo to handle Unicode data has been an interesting exercise.

Without information about its encoding, string data at its heart is meaningless. Programs that exchange string information without paying attention to the encoding end up with problems exactly like that described in the bug, with nonsense characters all over. Further, it’s impossible even to compare two different strings reliably. So not only would Bucardo’s logs and program output contain junk data, Bucardo would simply fail to find database objects that clearly existed, because it would end up querying for the wrong object name, or the keys of the hashes it uses internally would be meaningless. Even communication between different Bucardo processes needs to be decoded correctly. The recent DBD::Pg 3.0.0 release takes care of decoding strings sent from PostgreSQL, but other inputs, such as command-line arguments, must be treated individually. All output handles, such as STDOUT, STDERR, and the log file output, must be told to expect data in a particular encoding to ensure their output is handled correctly.

The first step is to build a test case. Bucardo’s test suite is quite comprehensive, and easy to use. For starters, I’ll make a simple test that just creates a table, and tries to tell Bucardo about it. The test suite will already create databases and install Bucardo for me; I can talk to those databases with handles $dbhA and $dbhB. Note that in this case, although the table and primary key names contain non-ASCII characters, the relgroup and sync names do not. That will require further programming. The character in the primary key name, incidentally, is a staff of Aesculapius, which I don’t recommend people include in the name of a typical primary key.

for my $dbh (($dbhA, $dbhB)) {
    $dbh->do(qq/CREATE TABLE test_büçárđo ( pkey_\x{2695} INTEGER PRIMARY KEY, data TEXT );/);
    $dbh->commit;
}

like $bct->ctl('bucardo add table test_büçárđo db=A relgroup=unicode'),
    qr/Added the following tables/, "Added table in db A";
like($bct->ctl("bucardo add sync test_unicode relgroup=unicode dbs=A:source,B:target"),
    qr/Added sync "test_unicode"/, "Create sync from A to B")
    or BAIL_OUT "Failed to add test_unicode sync";

Having created database objects and configured Bucardo, the next part of the test starts Bucardo, inserts some data into the master database “A”, and tries to replicate it:

$dbhA->do("INSERT INTO test_büçárđo (pkey_\x{2695}, data) VALUES (1, 'Something')");
$dbhA->commit;

## Get Bucardo going
$bct->restart_bucardo($dbhX);

## Kick off the sync.
my $timer_regex = qr/\[0\s*s\]\s+(?:[\b]{6}\[\d+\s*s\]\s+)*/;
like $bct->ctl('kick sync test_unicode 0'),
    qr/^Kick\s+test_unicode:\s+${timer_regex}DONE!/,
    'Kick test_unicode' or die 'Sync failed, no point continuing';

my $res = $dbhB->selectall_arrayref('SELECT * FROM test_büçárđo');
ok($#$res == 0 && $res->[0][0] == 1 && $res->[0][1] eq 'Something', 'Replication worked');

Given that DBD::Pg handles the encodings for strings from the database, I only need to make a few other changes. I added a few lines to the preamble of some files, to deal with UTF8 elements in the code itself, and to tell input and output pipes to expect UTF8 data.

use utf8;
use open qw( :std :utf8 );

In some cases, I also had to add a couple more modules, and explicitly decode incoming values. For instance, the test suite repeatedly runs shell commands to configure and manage test instances of Bucardo. There, too, the output needs to be decoded correctly:

    debug("Script: $ctl Connection options: $connopts Args: $args", 3);
-   $info = qx{$ctl $connopts $args 2>&1};
+   $info = decode( locale => qx{$ctl $connopts $args 2>&1} );
    debug("Exit value: $?", 3);

And with that, now Bucardo accepts non-ASCII table names.

[~/devel/bucardo]$ prove t/10-object-names.t 
t/10-object-names.t .. ok     
All tests successful.
Files=1, Tests=20, 24 wallclock secs ( 0.01 usr  0.01 sys +  2.01 cusr  0.22 csys =  2.25 CPU)
Result: PASS

DBD::Pg 3.0.0 and the utf8 flag

2014-02-19T00:00:00+00:00

One of the major changes in the recently released 3.0 version of DBD::Pg (the Perl driver for PostgreSQL) was the handling of UTF-8 strings. Previously, you had to make sure to always set the mysterious “pg_enable_utf8” attribute. Now, everything should simply work as expected without any adjustments.

When using an older DBD::Pg (version 2.x), any data coming back from the database was treated as a plain old string. Perl strings have an internal flag called “utf8” that tells Perl that the string should be treated as containing UTF-8. The only way to get this flag turned on was to set the pg_enable_utf8 attribute to true before fetching your data from the database. When this flag was on, each returned string was scanned for high bit characters, and if found, the utf8 flag was set on the string. The Postgres server_encoding and client_encoding values were never consulted, so this one attribute was the only knob available. Here is a sample program we will use to examine the returned strings. The handy Data::Peek module will help us see if the string has the utf8 flag enabled.

#!perl
use strict;
use warnings;
use utf8;
use charnames ':full';
use DBI;
use Data::Peek;
use lib 'blib/lib', 'blib/arch';

## Do our best to represent the output faithfully
binmode STDOUT, ":encoding(utf8)";

my $DSN = 'DBI:Pg:dbname=postgres';
my $dbh = DBI->connect($DSN, '', '', {AutoCommit=>0,RaiseError=>1,PrintError=>0})
  or die "Connection failed!\n";                                            
print "DBI is version $DBI::VERSION, DBD::Pg is version $DBD::Pg::VERSION\n";

## Create some Unicode strings (perl strings with the utf8 flag enabled)
my %dm = (
    dotty  => "\N{CADUCEUS}",
    chilly => "\N{SNOWMAN}",
    stuffy => "\N{DRAGON}",
    lambie => "\N{SHEEP}",
);

## Show the strings both before and after a trip to the database
for my $x (sort keys %dm) {
    print "\nSending $x ($dm{$x}) to the database. Length is " . length($dm{$x}) . "\n";                                                                    
    my $SQL = qq{SELECT '$dm{$x}'::TEXT};             
    my $var = $dbh->selectall_arrayref($SQL)->[0][0];
    print "Database gave us back ($var) with a length of " . length($var) . "\n";
    print DPeek $var;
    print "\n";
}

Let’s checkout an older version of DBD::Pg and run the script:

$ cd dbdpg.git; git checkout 2.18.1; perl Makefile.PL; make
$ perl dbdpg_unicode_test.pl
DBI is version 1.628, DBD::Pg is version 2.18.1

Sending chilly (☃) to the database. Length is 1
Database gave us back (â) with a length of 3
PV("\342\230\203"\0)

Sending dotty (☤) to the database. Length is 1
Database gave us back (â¤) with a length of 3
PV("\342\230\244"\0)

Sending lambie (🐑) to the database. Length is 1
Database gave us back (ð) with a length of 4
PV("\360\237\220\221"\0)

Sending stuffy (🐉) to the database. Length is 1
Database gave us back (ð) with a length of 4
PV("\360\237\220\211"\0)

The first thing you may notice is that not all of the Unicode symbols appear as expected. They should be tiny but legible versions of a snowman, a caduceus, a sheep, and a dragon. The fact that they do not appear properly everywhere indicates we have a way to go before the world is Unicode ready. When writing this, only chilly and dotty appeared correctly on my terminal. The blog editing textarea showed chilly, dotty, and lambie. The final blog in Chrome showed only chilly and dotty! Obviously, your mileage may vary, but all of those are all legitimate Unicode characters.

The second thing to notice is how badly the length of the string is computed once it comes back from the database. Each string is one character long, and goes in that way, but comes back longer. Which means the utf8 flag is off - this is confirmed by a lack of a UTF8 section in the DPeek output. We can get the correct output by setting the pg_enable_utf8 attribute after connecting, like so:

...
my $dbh = DBI->connect($DSN, '', '', {AutoCommit=>0,RaiseError=>1,PrintError=>0})
  or die "Connection failed!\n";
## Needed for older versions of DBD::Pg.
## This is the same as setting it to 1 for DBD::Pg 2.x - see below
$dbh->{pg_enable_utf8} = -1;
...

Once we do that, DBD::Pg will add the utf8 flag to any returned string, regardless of the actual encoding, as long as there is a high bit in the string. The output will now look like this:


Sending chilly (☃) to the database. Length is 1
Database gave us back (☃) with a length of 1
PVMG("\342\230\203"\0) [UTF8 "\x{2603}"]

Now our snowman has the correct length, and Data::Peek shows us that it has a UTF8 section. However, it’s not a great solution, because it ignores client_encoding, has to scan every single string, and because it means having to always remember to set an obscure attribute in your code every time you connect. Version 3.0.0 and up will check your client_encoding, and as long as it is UTF-8 (and it really ought to be!), it will automatically return strings with the utf8 flag set. Here is our snowman test on 3.0.0 with no explicit setting of pg_enable_utf8:

$ git checkout 3.0.0; perl Makefile.PL; make
$ perl dbdpg_unicode_test.pl
DBI is version 1.628, DBD::Pg is version 3.0.0

Sending chilly (☃) to the database. Length is 1
Database gave us back (☃) with a length of 1
PVMG("\342\230\203"\0) [UTF8 "\x{2603}"]

This new automatic detection is the same as setting pg_enable_utf8 to -1. Setting it to 0 will prevent the utf8 flag from ever being set, while setting it to 1 will cause the flag to always be set. Setting it to anything but -1 should be extremely rare in production and used with care.

Common Questions

What happens if I set pg_enable_utf8 = -1 on older versions of DBD::Pg?

Prior to DBD::Pg 3.0.0, the pg_enable_utf8 attribute was a simple boolean, so that setting to anything than 0 will set it to true. In other words, setting it to -1 is the same as setting it to 1. If you must support older versions of DBD::Pg, setting it to -1 is a good setting.

Why does DBD::Pg flag everything as utf8, including simple ASCII strings with no high bit characters?

The lovely thing about the UTF-8 scheme is that ASCII data fits nicely inside it with no changes. However, a bare ASCII string is still valid UTF-8, it simply doesn’t have any high-bit characters. So rather than read each string as it comes back from the database and determine if it must be flagged as utf8, DBD::Pg simply flags every string as utf8 because it can. In other words, every string may or may not contain actual non-ASCII characters, but either way we simply flag it because it may contain them, and that is good enough. This saves us a bit of time and effort, as we no longer have to scan every single byte coming back from the database. This decision to mark everything as utf8 instead of only non-ASCII strings was the most contentious decision when this new version was being developed.

Why is only UTF-8 the only client_encoding that is treated special?

There are two important reasons why we only look at UTF-8. First, the utf8 flag is the only flag Perl strings have, so there is no way of marking a string as any other type of encoding. Second, UTF-8 is unique inside Postgres as it is the universal client_encoding, which has a mapping from nearly every supported server_encoding. In other words, no matter what your server_encoding is set to, setting your client_encoding to UTF-8 is always a safe bet. It’s pretty obvious at this point that UTF-8 has won the encoding wars, and is the de-facto encoding standard for Unicode.

When is the client_encoding checked? What if I change it?

The value of client_encoding is only checked when DBD::Pg first connects. Rechecking this seldom-changed attribute would be quite costly, but there is a way to signal DBD::Pg. If you really want to change the value of client_encoding after you connect, just set the pg_enable_utf8 attribute to -1, and it will cause DBD::Pg to re-read the client_encoding and start setting the utf8 flags accordingly.

What about arrays?

Arrays are handled as expected too. Arrays are unwrapped and turned into an array reference, in which the individual strings within it have the utf8 flag set. Example code:

...
for my $x (sort keys %dm) {

    print "\nSending $x ($dm{$x}) to the database. Length is " . length($dm{$x}) . "\n";
    my $SQL = qq{SELECT ARRAY['$dm{$x}']};
    my $var = $dbh->selectall_arrayref($SQL)->[0][0];
    print "Database gave us back ($var) with a length of " . length($var) . "\n";
    $var = pop @$var;
    print "Inner array ($var) has a length of " . length($var) . "\n";
    print DPeek $var;
    print "\n";
}

DBI is version 1.628, DBD::Pg is version 3.0.0

Sending chilly (☃) to the database. Length is 1
Database gave us back (ARRAY(0x90c555c)) with a length of 16
Inner array (☃) has a length of 1
PVMG("\342\230\203"\0) [UTF8 "\x{2603}"]

Why is Unicode so hard?

Partly because human languages are a vast and complex system, and partly because we painted ourselves into a corner a bit in the early days of computing. Some of the statements presented above have been over-simplified. Unicode is much more than just using UTF-8 properly. The utf8 flag in Perl strings does not mean quite the same thing as a UTF-8 encoding. Interestingly, Perl even makes a distinction between “UTF8” and “UTF-8”. It’s quite a mess, but at the end of the day, Unicode support is far better in Perl than any other language.

The Un-unaccentable Character

2013-08-18T00:00:00+00:00

I typed “Unicode” into an online translator, and it responded saying it had no idea what the language was but it roughly translates to “Surprise!”

Recently a client sent over a problem getting some of their Postgres data through an ASCII-only ETL process. They only needed to worry about some occasional accent marks, and not any of the more uncommon or odd Unicode characters, thankfully. ☺ Or so we thought. The unaccent extension was a great starting point, but the problem they sent over boiled down to this:

postgres=# SELECT unaccent('e é ѐ');
 unaccent 
----------
 e e ѐ
(1 row)

unaccent() worked, except for that odd ѐ, which then failed the ETL task. That’s exactly what unnaccent is supposed to handle. The character è even appears in the unaccent.rules file. So what gives?

Well, if you’re in the habit of piping blog posts through hexdump (and who isn’t?) then you probably already know the answer. But even if not, you may already suspect that we’re dealing with a different character that just looks the same. And you’d be right. Specifically, the è in the rules file is from the more common Latin set, and the ѐ that doesn’t work is from the Cyrillic set. Pretty much visually identical, but completely separate characters.

Augmenting the unaccent dictionary:

Speaking more generically, ideally a simple UPDATE statement with a replace() will correct it in the source data. And a trigger doing the same will keep it tidy from that point forward.

But if you can’t or just don’t want to go down that path, the unaccent extension dictionary can be edited. On my system it’s found in /usr/share/postgresql/9.3/tsearch_data/unaccent.rules. It has a very simple format.

Make a copy of the file before you edit it. Updated packages or new deployments if you’re compiling from source will wipe out any changes to the unaccent.rules file.
```
root:~# cp /usr/share/postgresql/9.3/tsearch_data/{unaccent,extended}.rules
```
Add a line including the character to translate. To handle our example above, add:
```
ѐ e
```
In Postgres, create a new dictionary to load in those rules.
```
db=# CREATE TEXT SEARCH DICTIONARY extended (TEMPLATE=unaccent, RULES='extended');
CREATE TEXT SEARCH DICTIONARY
```
Note that ’extended’ above will point to the extended.rules file.

Call unaccent() specifying the newly added dictionary:

db=# SELECT unaccent('extended', 'e é ѐ');
 unaccent 
----------
 e e e
(1 row)

Note that subsequent changes won’t automatically appear. To update the in-database version, after you make any changes to the rules file run:
```
db=# ALTER TEXT SEARCH DICTIONARY extended (RULES='extended');
ALTER TEXT SEARCH DICTIONARY
```

Perl, UTF-8, and binmode on filehandles

2012-02-21T00:00:00+00:00

Original image by avlxyz

I recently ran into a Perl quirk involving UTF-8, standard filehandles, and the built-in Perl die() and warn() functions. Someone reported a bug in the check_postgres program in which the French output was displaying incorrectly. That is, when the locale was set to FR_fr, the French accented characters generated by the program were coming out as “byte soup” instead of proper UTF-8. Some other languages, English and Japanese among them, seemed to be fine. For example:

## English: "sorry, too many clients already"
## Japanese: "現在クライアント数が多すぎます"
## French expected: "désolé, trop de clients sont déjà connectés"
## French actual: "d�sol�, trop de clients sont d�j� connect�s"

That last line should be very familiar to anyone who has struggled with Unicode on a command line, with those question marks on an inverted background. Our problem was that the output of the script looked like the last line, rather than the one before it. The Japanese output, despite being chock full of Unicode, does have the same problem! More on that later.

I was able to duplicate the problem easy enough by setting my locale to FR_fr and having check_postgres output a message with some non-ASCII characters in it. However, as noted above, some languages were fine, some were not.

Before going any further, I should point out that this Perl script did have a use utf8; at the top of it, as it should. This does not dictate how things will be read in or output,but merely tells Perl that the source code itself contains UTF-8 characters. Now to the quirky parts.

I normally test my Perl scripts on the fly by adding a quick series of debugging statements to warn()s or die()s. Both go to stderr, so it is easy to separate your debugging statements from normal output of the code. However, when I output a non-ASCII message in question immediately after it was defined in the script, it showed a normal, expected UTF-8 string. So I started tracking things through the code, to see if there was some point at which the apparently normal UTF-8 string gets turned back into byte soup. It never did; I finally realized that although print was outputting byte soup, both warn() and die() were outputting UTF-8! Here’s a sample script to better demonstrate the problem:

#!perl

use strict;
use warnings;
use utf8;

my $msg = 'This is a micro symbol: µ';

print "print = $msg\n";
warn "warn = $msg\n";
die "die = $msg\n";

Now let’s run it and see what happens:

print = This is a micro symbol: �
warn = This is a micro symbol: µ
die = This is a micro symbol: µ

So we’ve found one Perl quirk: the output of print() and warn() are different, as warn() manages to correctly output the string as UTF-8. Perhaps it is just that the stdout and stderr filehandles are using different encodings? Let’s take a look by expanding the script and explicitly printing to both stdout and stderr. We’ll also add some other Unicode characters, to emulate the difference between French and Japanese above:

#!perl

use strict;
use warnings;
use utf8;

my $msg = 'This is a micro symbol: µ';
my $alert = 'The radioactive snowmen come in peace: ☢ ☃☃☃ ☮';

print STDOUT "print to STDOUT = $msg\n";
print STDOUT "print to STDOUT = $alert\n";

print STDERR "print to STDERR = $msg\n";
print STDERR "print to STDERR = $alert\n";

warn "warn = $msg\n";
warn "warn = $alert\n";

(Note: if you do not see small literal snowmen characters in the above script, you need to get a better browser or RSS reader!)

print to STDOUT = This is a micro symbol: �
Wide character in print at utf12 line 11.
print to STDOUT = The radioactive snowmen come in peace: ☢ ☃☃☃ ☮
print to STDERR = This is a micro symbol: �
Wide character in print at utf12 line 14.
print to STDERR = The radioactive snowmen come in peace: ☢ ☃☃☃ ☮
warn = This is a micro symbol: µ
warn = The radioactive snowmen come in peace: ☢ ☃☃☃ ☮

There are a number of things to note here. First, that the stderr filehandle has the same problem as the stdout filehandle. So, while warn() and die() send things to stderr, there is some magic happening behind the scenes such that sending a string to them is not the same as sending it to stderr ourselves via a print statement. Which is a good thing overall, as it would be more weird for stdout and stderr to have different encoding layers! The solution to this is simple enough: just force stdout to have the proper encoding by use of the binmode function:

binmode STDOUT, ':utf8';

Indeed, the one line above solved the original poster’s problem; applying it to our test script shows that the stdout filehandle now outputs things correctly, unlike the stderr filehandle:

print to STDOUT = This is a micro symbol: µ
print to STDOUT = The radioactive snowmen come in peace: ☢ ☃☃☃ ☮
print to STDERR = This is a micro symbol: �
Wide character in print at utf12 line 16.
print to STDERR = The radioactive snowmen come in peace: ☢ ☃☃☃ ☮
warn = This is a micro symbol: µ
warn = The radioactive snowmen come in peace: ☢ ☃☃☃ ☮

The next thing to notice is that the snowmen alert message is displayed properly everywhere. Why is this? The answer lies in that the micro symbol (and the accented French characters) fall into a range that could still be ASCII, as far as Perl is concerned. What happens is that, in the lack of any explicit guidance, Perl makes a best guess as to whether a string to be outputted contains UTF-8 characters or not. In the case of the French and “micro” strings, it guessed wrong, and the characters were output as ASCII. In the case of the Japanese and “snowmen” strings, it still guessed wrong, even though the strings contained higher bytes that left no doubt that we had left ASCII-land and were exploring the land of Unicode. In other words, even though they were still not coming out as pure UTF-8, there is no direct ASCII equivalent so they appear as the characters one would expect. Note, however, that Perl still emits a wide character warning, for it recognizes that something is probably wrong. The warnings go away when we use binmode to force the encoding layer to :utf8.

The correct solution when dealing with UTF-8 is to be explicit and not let Perl make any guesses. Solutions to this vary, but the combination here of adding use utf8; and binmode STDOUT, ‘:utf8’;. While I was able to duplicate the problem right away, the combination of Perl making inconsistent guesses and the odd behavior of warn() and die() turned this from a quick fix into a slightly longer investigation. Yes, Unicode and Perl has given me quite a few gray hairs over the years, but I always feel better when I look at how other languages handle Unicode. :)

Sanitizing supposed UTF-8 data

2011-12-17T00:00:00+00:00

As time passes, it’s clear that Unicode has won the character set encoding wars, and UTF-8 is by far the most popular encoding, and the expected default. In a few more years we’ll probably find discussion of different character set encodings to be arcane, relegated to “data historians” and people working with legacy systems.

But we’re not there yet! There’s still lots of migration to do before we can forget about everything that’s not UTF-8.

Last week I again found myself converting data. This time I was taking data from a PostgreSQL database with no specified encoding (so-called “SQL_ASCII”, really just raw bytes), and sending it via JSON to a remote web service. JSON uses UTF-8 by default, and that’s what I needed here. Most of the source data was in either UTF-8, ISO Latin-1, or Windows-1252, but some was in non-Unicode Chinese or Japanese encodings, and some was just plain mangled.

At this point I need to remind you about one of the most unusual aspects of UTF-8: It has limited valid forms. Legacy encodings typically used all or most of the 255 code points in their 8-byte space (leaving point 0 for traditional ASCII NUL). While UTF-8 is compatible with 7-bit ASCII, it does not allow any possible 8-bit byte in any position. See the Wikipedia summary of invalid byte sequences to know what can be considered invalid.

We had no need to try to fix the truly broken data, but we wanted to convert everything possible to UTF-8 and at the very least guarantee no invalid UTF-8 strings appeared in what we sent.

I previously wrote about converting a PostgreSQL database dump to UTF-8, and used the Perl CPAN module IsUTF8.

I was going to use that again, but looked around and found an even better module, exactly targeting this use case: Encoding::FixLatin, by Grant McLean. Its documentation says it “takes mixed encoding input and produces UTF-8 output” and that’s exactly what it does, focusing on input with mixed UTF-8, Latin-1, and Windows-1252.

It worked as advertised, very well. We would need to use a different module to convert some other legacy encodings, but in this case this was good enough and got the vast majority of the data right.

There’s even a standalone fix_latin program designed specifically for processing Postgres pg_dump output from legacy encodings, with some nice examples of how to use it.

One gotcha is similar to a catch that David Christensen reported with the Encode module in a blog post here about a year ago: If the Perl string already has the UTF-8 flag set, Encoding::FixLatin immediately returns it, rather than trying to process it. So it’s important that the incoming data be a pure byte stream, or that you otherwise turn off the UTF-8 flag, if you expect it to change anything.

Along the way I found some other CPAN modules that look useful for cases where I need more manual control than Encoding::FixLatin gives:

Search::Tools::UTF8 — test for and/or fix bad ASCII, Latin-1, Windows-1252, and UTF-8 strings
Encode::Detect — use Mozilla’s universal charset detector and convert to UTF-8
Unicode::Tussle — ridiculously comprehensive set of Unicode tools that has to be seen to be believed

Once again Perl’s thriving open source/free software community made my day!