Our Blog

Ongoing observations by End Point Dev people

Ecommerce customer names with interesting Unicode characters

Jon Jensen

By Jon Jensen
December 29, 2021

Photo of a small garden in front of a house basement window with two cats looking out

One of our clients with a busy ecommerce site sees a lot of orders, and among those, sometimes there are unusual customer name and address submissions.

We first noticed in 2015 that they had a customer order come in with an emoji in the name field of the order. The emoji was ๐Ÿ˜ and we half-jokingly wondered if that was a new sign of fraud.

Over the following 3 years only a few more orders came in with various emoji in customer’s names, but in mid-2018 emoji started to appear increasingly frequently until now one now appears on average every day or two.

Why the sudden appearance of emoji in 2015? It correlates with the rapid shift to browsing the web and shopping on mobile devices. Mobile visits now represent more than half of this client’s ecommerce traffic.

Most people in 2015 didn’t even know how to type emoji on a desktop or laptop computer, but mobile touchscreen keyboards began showing emoji choices around that time, so the mobile explanation makes sense. And mobile keyboard autocorrect also sometimes offers emoji in addition to words, making them even more common in the past few years.

Just for fun I wanted to automatically find all such “interesting” names, so I wrote a simple report that uses SQL to query their PostgreSQL ecommerce database.

To preserve customer privacy, names shown here have been changed and limited to a few names that are common in the United States.

Real names with bonuses

First let’s look at the apparently real names with emoji and other non-alphabetical Unicode characters mixed in:

Amy ๐Ÿฉบ
Amy๐Ÿ’•
Amy๐Ÿ‘‘
๐Ÿ‘ธAmy
Anna ๐Ÿ’ซ
Anna ๐Ÿ˜Ž
Anna โค๏ธ
Anna๐Ÿ’Š๐Ÿ’‰
Bobโ˜ฏ๏ธ
Bob๐Ÿ˜๐Ÿ˜˜๐Ÿ˜œ๐Ÿ‘‘๐Ÿ’ž
Bob๐Ÿž
Brenda ๐ŸŒ™
Brenda ๐Ÿฅณ๐Ÿคช
Brendaโ˜ ๏ธ
B R E N D A ๐Ÿ’•๐Ÿ’…๐Ÿพ๐Ÿ’ƒ๐Ÿพ๐Ÿ‘œ
Cameron ๐ŸŒป
Cameron ๐Ÿ˜Ž
Cameron ๐Ÿ’๐Ÿปโ€โ™€๏ธ
Doug ๐Ÿ’•
Doug ๐Ÿ…ฑ๏ธ
Doug ๐ŸŒธ๐Ÿคฉ
Doug ๐Ÿ‘ฆ๐Ÿปโค๏ธ
Emily ๐ŸŒท
Emily ๐Ÿ˜Š๐Ÿ‡จ๐Ÿ‡บ
Emily๐Ÿ˜š
Emily ๐ŸŽฐ๐Ÿ‘‘โค๏ธ
Frank ๐Ÿฅฐ
Frank โš—๏ธ
Frank๐Ÿ”†
Frank๐Ÿ’ž๐Ÿ’ฐ
Jane ๐Ÿ’•
Janeโธ
Jane โค๏ธโค๏ธ
Jane๐Ÿ”†
Jill ๐Ÿ‘‘๐ŸŽ€
Jill ๐Ÿ‘ผ๐Ÿฝโœจ๐ŸŒซ
Jill๐Ÿ“๐Ÿ’๐Ÿ’
Jill๐Ÿ‘ธ๐Ÿฝ๐Ÿ’–
Jim ๐Ÿฐ
Jim, ๐Ÿ’•
Jimโ€™s iPhoneโœจ
Joe ๐Ÿฏ
Joe ๐Ÿ‡ฆ๐Ÿ‡บ
Joe๐Ÿ˜
Joe๐Ÿ€๐ŸŽธโ€ผ๏ธ
John ๐Ÿ‘ช
Johnโต
John๐ŸŽญ
Karen ๐ŸŒป๐ŸŒน
Karen ๐Ÿ‘‘โœจ
Karen ๐Ÿ”’โค๏ธ
Karen๐ŸŽ€๐Ÿ‘‘
Kate๐Ÿ’™๐Ÿ’š
Kate๐Ÿคช๐Ÿคž๐Ÿฝ๐Ÿ’™
Kate.๐Ÿ’˜
K$ate๐Ÿ’‰
Liz๐ŸŒบ
Lizโฃ๏ธ
Lizโค๏ธ๐Ÿ™ƒ
Liz Mama๐Ÿ’™๐Ÿ’™
Mary ๐ŸŽ€
Mary๐Ÿ’˜
Mary๐Ÿ‘ผ๐Ÿผ๐Ÿ’“
Maryโน
Mike ๐Ÿ‘‘
Mike ๐Ÿ‘‘๐ŸŒธ
Mike ๐Ÿ‘ฉ๐Ÿปโ€๐ŸŒพ
Mikeโถ
Sarah ๐Ÿ‘‘
Sarah๐Ÿ‘‘A.
โœจSarahโœจ
๐Ÿ’›๐ŸŒป Sarah
Steve๐Ÿ‘
Steve ๐Ÿฆ
Steveโ™“๏ธ๐Ÿ’“
Victoria ๐Ÿฅด
Victoria๐ŸŒป
V๐š’๐šŒ๐š๐š˜๐š›๐š’๐šŠ
V I C T O R I A ๐Ÿค

Would you have expected all that in ecommerce orders? I didn’t!

Fake names

Next let’s look at placeholder names with people’s role or self-description or similar:

Amor โšฝ๏ธ
Babe โค๏ธ
C๐š’๐š๐š’๐šฃ๐šŽ๐š—
Daddy๐Ÿฅด๐Ÿ˜
Daddy๐Ÿ˜˜๐Ÿ‘ด๐Ÿ‘จ๐Ÿ™๐Ÿ‘จ๐Ÿ‘ฉ๐Ÿ‘ง๐Ÿ™‡๐Ÿพ
Fly High ๐Ÿ•Š
Forever ๐Ÿ’๐Ÿ’œ
Granny ๐Ÿ‘ต๐Ÿฝ
HOME โค๏ธ
Home๐Ÿ 
Home๐Ÿ ๐Ÿ’œ
Hubby๐Ÿฅฐ
๐Ÿ’ฆJuicy๐Ÿ‘
me!! ๐Ÿ’›
mi amor โค๏ธ
Mom๐Ÿ’—
Momโ™ฅ๏ธ
Mom ๐Ÿฅ๐Ÿ’›
MOMMY ๐Ÿ’—
Myself ๐Ÿ˜˜
Princess๐Ÿ‘‘
princessโค๏ธ
Queen ๐Ÿ˜๐Ÿ’–๐Ÿ”“
Queen๐Ÿ’˜
The Husband๐Ÿ’โค๏ธ
Wifey ๐Ÿ˜ˆโœŒ๐Ÿฝ๐Ÿ‘…

Perhaps the occurrence of “home” several times reflects a mobile address book auto-fill function for billing or shipping address fields?

Not names at all!

Then there are those customers who didn’t provide any kind of name at all, just emoji and other special characters:

๐Ÿฆ‹
๐Ÿ’™
๐Ÿคก๐ŸŽช
๐Ÿ’—โ˜๏ธ
๐ŸŒ
โ™ฅ๏ธ
๐ŸŒ™
โต
โˆ…

I guess only one or two of those per year doesn’t amount to much, but they’re interesting to see.

Strange addresses

In addition to the name fields we also checked the address fields for unusual characters and found (again, details changed to preserve privacy):

125 E 27๐Ÿ˜Ž
227 W 24 Circle โญ•๏ธ
3 Blvd. George Washingtonโ„ข

Simply odd

The prizewinner for oddity, which seems like some kind of copy-and-paste mistake, is this in the city field of an address:

Indianยฎ Roadmasterโ„ข Classic

Maybe at least one motorcycle has achieved sentience and needed to do some online shopping!

“Interesting” Unicode ranges

When searching for interesting Unicode ranges, we could just look for characters in the Unicode emoji ranges. That would be fairly straightforward since there are just a few ranges to match.

But we were curious what other unusual characters were getting used aside from emoji, so we wanted to include other classes of characters. So perhaps we should include everything to start and then exclude the entire class of Unicode “word characters”? That covers the world’s standard characters used for names and addresses, including not just Roman/Latin with optional diacritics, but also other character sets such as Cyrillic, Arabic, Hebrew, Korean, Chinese, Japanese, Devanagari, Thai, and many others.

PostgreSQL POSIX regular expressions include the class “word character” represented by either [[:word:]] or the Perl shorthand \w. I started with that, but found it covered too many things I did want to see, such as the visually double-width Latin characters that are part of the Chinese word character range, and some special numbers.

So I switched back to matching what I want, rather than excluding what I don’t want, and I manually went through the Unicode code charts and noted the ranges to include.

The list of Unicode code ranges I came up with, in hexadecimal, is:

250-2ba
2bc-2c5
2cc-2dc
2de-2ff
58d-58e
fd5-fd8
1d00-1dbf
2070-2079
207b-209f
20d0-2104
2106-2115
2117-215f
2163-218b
2190-2211
2213-266e
2670-2bff
2e00-2e7f
2ff0-2fff
3004
3012-3013
3020
3200-33ff
4dc0-4dff
a000-abf9
fff0-fffc
fffe-1d35f
1d360-1d37f
1d400-1d7ff
1ec70-1ecbf
1ed00-1ed4f
1ee00-1eeff
1f000-10ffff

Those ranges exclude several fairly common characters that people (or their software’s autocorrect) used in their address fields, which we wanted to ignore, such as:

  • Music sharp sign: โ™ฏ (instead of # before a number)
  • numero: โ„–
  • care of: โ„…
  • replacement character: ๏ฟฝ (though this could be interesting if it reveals unknown encoding errors)

The SQL query

PostgreSQL allows us to represent Unicode characters in hexadecimal numbers as either \u plus 4 digits or \U plus 8 digits. So the character 2ba is written in a Postgres string as \u02ba and the range fffe-1d35f is written in a Postgres regex range as [\ufffe-\U0001d35f].

With a little scripting to put it all together, I came up with:

SELECT order_number, order_timestamp::date AS order_date,
    fname, lname, company, address1, address2, city, state, b_fname, b_lname, b_company, b_address1, b_address2, b_city, b_state, phone
FROM orders
WHERE concat(fname, lname, company, address1, address2, city, state, b_fname, b_lname, b_company, b_address1, b_address2, b_city, b_state, phone)
    ~ '[\u0250-\u02ba\u02bc-\u02c5\u02cc-\u02dc\u02de-\u02ff\u058d-\u058e\u0fd5-\u0fd8\u1d00-\u1dbf\u2070-\u2079\u207b-\u209f\u20d0-\u2104\u2106-\u2115\u2117-\u215f\u2163-\u218b\u2190-\u2211\u2213-\u266e\u2670-\u2bff\u2e00-\u2e7f\u2ff0-\u2fff\u3004\u3012-\u3013\u3020\u3200-\u33ff\u4dc0-\u4dff\ua000-\uabf9\ufff0-\ufffc\ufffe-\U0001d35f\U0001d360-\U0001d37f\U0001d400-\U0001d7ff\U0001ec70-\U0001ecbf\U0001ed00-\U0001ed4f\U0001ee00-\U0001eeff\U0001f000-\U0010ffff]'
    -- limit how many years back to go
    AND order_timestamp > (SELECT CURRENT_TIMESTAMP - interval '3 years')
    -- exclude any order that had PII expunged for GDPR
    AND expunged_at IS NULL
ORDER BY order_timestamp DESC

Try a similar query on databases you have access to, and see what interesting user submissions you discover. I always find surprises and in addition to being fun, sometimes we find things that help us improve our input validation and user guidance so that more mistakes are caught when they’re easy for the customer to correct.

Happy holidays!

ecommerce unicode sql postgres


Comments