Find Text in Any Column of a PostgreSQL Table

2023-02-06T00:00:00+00:00

It’s not uncommon for me to want to find a particular text snippet in a PostgreSQL database. Easy enough, you might say. After all, that’s what databases are for: You feed them a bunch of information, ask them questions in the form of a query, and they give you the answer. So just write a query, right?

Well, maybe.

SQL stands for “Structured Query Language”, and the fact that it’s “structured” means not only that the database abides by some defined structure, but that your queries do, too, which implies that you know at the time you’re writing the query where in the structure you want to look. And that’s where the problem arises. What if I know “Kilroy” is somewhere in a table, but I don’t know what column to look in to find him? How do I write that query?

The first answer I came up with to that question depends on pg_dump: dump the contents of a table, search the results with grep, and there you have it.

$ pg_dump -t person mydb | grep -i kilroy
633132  F       NH      \N      Cristen212      J    Kilroy44        1983-09-28 00:00:00     \N      t       \N      \N      \N      \N      F       USA  \N       \N      \N      \N      \N      \N      \N      \N

Here I knew I wanted to look in the person table, so I specified that in the call to pg_dump, but this works as well when I want to search the entire database:

$ pg_dump  mydb | grep -i kilroy
716565  public  person  18185   {"session_user": "josh"}        {"id": 633132}  {"last_name": "Feeney44"}       {"last_name": "Kilroy44"}       U       123349  2023-02-03 11:19:40.587883-062023-02-03 11:19:40.587883-06    2023-02-03 11:19:40.592103-06   josh    \N      \N
633132  F       NH      \N      Cristen212      J    Kilroy44        1983-09-28 00:00:00     \N      t       \N      \N      \N      \N      F       USA  \N       \N      \N      \N      \N      \N      \N      \N

I can switch pg_dump to --inserts mode and thus see the table these entries were found in:

$ pg_dump --inserts mydb | grep -i kilroy
INSERT INTO audit.modification_log VALUES (716565, 'public', 'person', 18185, '{"session_user": "josh"}', '{"id": 633132}', '{"last_name": "Feeney44"}', '{"last_name": "Kilroy44"}', 'U', 123349, '2023-02-03 11:19:40.587883-06', '2023-02-03 11:19:40.587883-06', '2023-02-03 11:19:40.592103-06', 'josh', NULL, NULL);
INSERT INTO public.person VALUES (633132, 'F', 'NH', NULL, 'Cristen212', 'J', 'Kilroy44', '1983-09-28 00:00:00', NULL, true, NULL, NULL, NULL, NULL, 'F', 'USA', NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL);

But pg_dump is considerably slower with --inserts turned on (about 200% slower, in my rudimentary testing on this one database), so perhaps I can get fancier with grep and achieve the same thing:

$ pg_dump mydb | grep -ie kilroy -e '^COPY' | grep -B1 -i kilroy
COPY audit.modification_log (id, table_schema, table_name, table_relid, app_user_info, id_columns, old_data, new_data, operation, transaction_id, ts_transaction, ts_statement, ts_clock, session_user_name, client_addr, query_text) FROM stdin;
716565  public  person  18185   {"session_user": "josh"}        {"id": 633132}  {"last_name": "Feeney44"}       {"last_name": "Kilroy44"}       U       123349  2023-02-03 11:19:40.587883-062023-02-03 11:19:40.587883-06    2023-02-03 11:19:40.592103-06   josh    \N      \N
--
COPY public.person (id, birth_gender, ethnicity, language, first_name, middle_name, last_name, birth_date, date_of_death, live, days_old_no_birthday, from_pregnant_id, fatal_condition_id, death_status, current_gender, country_of_birth, people_id, entity_id, birth_gender_id, ethnicity_id, primary_language_id, age_type_id, approximate_age_no_birthday, updated_at) FROM stdin;
633132  F       NH      \N      Cristen212      J    Kilroy44        1983-09-28 00:00:00     \N      t       \N      \N      \N      \N      F       USA  \N       \N      \N      \N      \N      \N      \N      \N

But the other day I realized I could easily search every a single table without leaving the database, thanks to \copy:

mydb=# \copy person to program 'grep -i kilroy';
COPY 117
633132  F       NH      \N      Cristen212      J    Kilroy44        1983-09-28 00:00:00     \N      t       \N      \N      \N      \N      F       USA  \N       \N      \N      \N      \N      \N      \N      \N

\copy requires copying the entire table to the client side, which wasn’t a concern in my instance but could be a problem for some folks, and you need to be a superuser to use server-side COPY ... TO PROGRAM, so there are some limitations to this approach.

I started thinking about this problem fresh off an encounter with row types, and realized we could do the whole thing in SQL:

mydb=# select * from person where person::text ~* 'kilroy';
   id   | birth_gender | ethnicity | language | first_name |             middle_name              | last_name |     birth_date      | date_of_death | live | days_old_no_birthday | from_pregnant_id | fatal_condition_id | death_status | current_gender | country_of_birth | people_id | entity_id | birth_gender_id | ethnicity_id | primary_language_id | age_type_id | approximate_age_no_birthday | updated_at
--------+--------------+-----------+----------+------------+--------------------------------------+-----------+---------------------+---------------+------+----------------------+------------------+--------------------+--------------+----------------+------------------+-----------+-----------+-----------------+--------------+---------------------+-------------+-----------------------------+------------
 633132 | F            | NH        |          | Cristen212 | J | Kilroy44  | 1983-09-28 00:00:00 |               | t    |                      |                  |                    |              | F              | USA              |           |           |                 |              |                     |             |                             |
(1 row)

This is my new favorite way of solving this problem, when I know what table I’m looking in but don’t know the column. In cases where I don’t know either, there’s always this monstrosity:

mydb=# select 'select ' || quote_literal(relname) || ' as relname, f.*
from ' || oid::regclass || ' f where f::text ~* ''kilroy''' from pg_class
where relkind = 'r' and relnamespace = 'public'::regnamespace; \gexec

This will work only in psql, as it depends on the \gexec command. It composes one query for each table, and then executes them in turn. It produces rather a lot of probably useless output you’ll need to sort through, and if you have (as I do) some tables with a large number of columns, I recommend trying it with \x turned on, to avoid having to page through quite so many results. But it did work, to find the entry I was looking for:

... Lots of output
(0 rows)

-[ RECORD 1 ]---------------+-------------------------------------
relname                     | person
id                          | 633132
birth_gender                | F
ethnicity                   | NH
language                    |
first_name                  | Cristen212
middle_name                 | J
last_name                   | Kilroy44
birth_date                  | 1983-09-28 00:00:00
date_of_death               |
live                        | t
days_old_no_birthday        |
from_pregnant_id            |
fatal_condition_id          |
death_status                |
current_gender              | F
country_of_birth            | USA
people_id                   |
entity_id                   |
birth_gender_id             |
ethnicity_id                |
primary_language_id         |
age_type_id                 |
approximate_age_no_birthday |
updated_at                  |

(0 rows)
... Lots more output

If I’m willing to expand beyond one-liner solutions, I can get psql to filter that useless output for me:

mydb=# \o | grep -i kilroy | grep -v select
mydb=# select 'select ' || quote_literal(relname) || ' as rname, f.* from ' || oid::regclass || ' f where f::text ~* ''kilroy'''
from pg_class where relkind = 'r' and relnamespace = 'public'::regnamespace; \gexec
mydb=# \o
 person | 633132 | F            | NH        |          | Cristen212 | J | Kilroy44  | 1983-09-28 00:00:00 |               | t    |                      |                  |                    |              | F              | USA              |           |           |                 |              |                     |             |                             |

When I did this, I had to use a final \o to set the output mode back to normal, before I saw results.

Do you have a different technique to solve this problem? I’m sure there are many other possibilities out there! Comment below!

Data Migration Tips

2023-02-04T00:00:00+00:00

When you’re in the business of selling software to people, you tend to get a few chances to migrate data from their legacy software to your shiny new system. Most recently for me that has involved public health data exported from legacy disease surveillance systems into PostgreSQL databases for use by the open source EpiTrax system and its companion EMSA.

We have collected a few tips that may help you learn from our successes, as well as our ~~mistakes~~particularly educational experiences.

Customer Management

Your job is to satisfy your customers, and your customers want to know how the migration is progressing. Give them an answer, even if it’s just a generalization. This may be a burndown chart, a calculated percentage, a nifty graphic, or whatever, but something your project managers can show to their managers, to know more or less how far along things are.

Your job is also to know your system; that’s not the customer’s job. They shouldn’t have to get their data into a specific format for you to make use of it. Be as flexible as possible in the data format and structure you’ll accept. In theory, so long as your customer can provide the legacy data in a machine-readable format, you should be able to use it. Let them focus on getting the data out of their legacy system — which is sometimes quite an effort in itself! Real data is almost always messy data, and your customer will probably want to take the opportunity to clean things up; make it as easy as possible for them to do that, while still ensuring the migration proceeds quickly and smoothly.

Be careful about the vocabulary you use with your customer. Your system and the legacy system probably deal with the same kinds of data, and do generally the same kinds of things. Of course, your software does it better than the old ’n’ busted mess you’re replacing, but in order to be better, your software has to be different from what it’s replacing. Different software means different methods, different processes, and different concepts. You and your customer might use the same words to mean totally different things, and unless you’re careful, those differences can remain hidden well into the implementation process, coming to light just in time to highlight some horrible mistake you’ve made. Try to avoid that. Talk with your customer to make sure your team and their team share a common vocabulary. If your system’s core function is to track widgets, and their legacy system’s function is also to track widgets, make sure you understand the differences between your software’s concept of a widget and their legacy system’s concept of a widget.

Migration Scripting

For our products and our customers, we’ve found pretty much every migration process is different, and each one is a custom programming job. Your team should decide at the beginning what steps the migration needs to include, and what technologies it will use in each step. Here at End Point, we often like to work directly in the database, in SQL. Migrations are all about manipulating data, and SQL is well suited for that task. With whatever technology you use, decide how you’ll use it, to manage the considerations given here. You can change your mind later and refactor accordingly, but always have a plan you’re following.

Design the migration as a sequence of processes. That is, first you might import one type of record, next another type of record that depends on the previous one, followed by several further steps to import data from a third source, clean it, map values from the legacy system to the new system, validate the results, and create records in the destination database. Of course the steps will vary from project to project, but the point is your migration will probably include several steps which need to be run in a specific order, so plan your development conventions accordingly. At End Point, we often like to put each step in a SQL file, and name each file beginning with a number, so you can run each script in order sorted by filename, and achieve the correct result. We might have files called:

01_import_products.sql
02_import_customers.sql
03_import_order_history.sql

It’s also common to implement each step one at a time, and to need to run each step several times as it’s being developed. We find it very helpful to wrap each step in a transaction. Often that means each SQL file begins with a BEGIN; statement, and ends with COMMIT;. Often I’ll leave out the COMMIT until I’m finished working on a file. That way to work on the code I can open a database session and run the migration script I’m working on, and when it completes, it will leave me inside the open transaction where I can inspect the results of my work. I make changes to the script, roll back the transaction, and run the script again, for as many iterations as it takes. I only add a COMMIT when I’ve tested the whole file and think it’s ready for the next step in testing.

I mentioned above that the customer may want to use this opportunity to clean their data. You should want this, too. Make sure the data you’re feeding your new system is as clean and well-structured as possible. You may find, as we do, that most of the work in your migrations is in validating the input data, and that actually creating new records in your application is almost an afterthought. That’s OK. You may also find there are places where your application, wonderful though it may be, could stand to be more strict about the data it accepts. I have often discovered my application’s database needs a uniqueness constraint, or a foreign key, thanks to a migration I was working on.

Data Migration History

I wish I could truthfully claim all our migrations go off flawlessly, but that would be a lie. It’s not unheard of to run into some corner case, a few weeks or even months after the migration goes live, where data wasn’t migrated correctly.

On the other hand, it’s certainly not uncommon for a customer or coworker to spot something that strikes them as odd, after the migration goes live, only to find later that everything was in fact correct. In either case, it’s important to preserve a history of the migration to investigate these concerns. We accomplish this with a few specific steps:

Create a database schema for the migration, and a table within that schema for each data file or object type we’re importing. I call these tables “staging tables”, where the incoming data is “staged” as it’s cleaned and validated. Having a separate schema means these tables can remain in the production database long after the migration is complete, generally without interfering with anything.
These staging tables should generally use text fields, to be as forgiving and flexible as possible with the incoming data. We can clean, parse, and reformat the data after it’s imported.
Don’t change the data in these staging tables; add to the data instead. In other words, if you need to map a value from the legacy system to a different value for your new system, don’t change the column you imported into the staging table; instead, add a new column to the staging table where you’ll store the re-mapped value. If you need to parse a text field into a date (because you followed the instruction to use text fields!), don’t change the type of an imported column; instead, add a new column of date or timestamp type, to store the parsed value. That way, when three months down the road someone discovers that some of the imported records have weird dates, you have all the information you need to determine whether the fault lies with the imported data or some step of the migration progress. Knowing exactly where the fault crept in leaves you that much more empowered to fix it.
Keep track of your migrated records’ primary keys, in the legacy system and the new system. Imagine you’ve just imported your client’s legacy customer list into a staging table. This data includes the legacy system’s primary key. Add a new column to the table for your new system’s primary key, and populate it. Many of our systems use an integer sequence as a primary key, so we’d add a new integer column to the staging table, and fill it with the next values from the sequence. Following this principle will give you several important abilities:
- You can always connect a record in the legacy system with its corresponding record(s) in the new system. If you’ve imported a customer list in this fashion, then when you’re importing the order data later, and each order points to a customer using a legacy customer primary key, you can easily find the correct customer primary key to use in your system.
- You can easily know if a record in your system comes from the migration, or from normal day-to-day business. You will probably use this every time you try to debug something with your migration.
- If you need to remove all imported records and re-import them, you can identify exactly which records those are. This should be only rarely needed.

Finally, document the decision making process, in comments directly in your code. For instance, if you have a table of mappings from one value to another, chances are good you arrived at the final version of that mapping table only after some discussion with the customer. Chances are also good someone’s going to question it later on. It’s helpful to keep a comment around, something like, # Joe Rogers verified this is the correct mapping in the daily standup meeting, 23 Nov 2022. This is especially common if you eventually decide to ignore a certain class of records. /* Rebecca says ignore all records with type = "ARCHIVED", via group email 9 Jan 2023 */ is a very helpful clue when someone comes around wondering where those records went.

Teamwork

My remaining tips apply to almost any programming project. First, use source control, and commit your code to it often. I can’t count how often I’ve been grateful the Git repository had a backup of my work, or made my work accessible to fill some unexpected need on some other system, nor can I count how many times I’ve been stuck because someone else didn’t commit their code so I couldn’t get at it when I needed it. Let’s not talk about how many times I’ve caused someone else to get stuck in the same way… Of course, don’t commit your customer’s data. But you should commit your code, and commit it often.

Finally, where possible, work with someone else. Two programmers reviewing each other’s code and collaborating on solutions are often far better than two programmers working alone, or one programmer working twice as much.

Speaking of working out solutions together, I’d love your help improving this list. What keys have you found are important for data migration projects? I welcome your comments. And if you’re looking for someone to handle a data migration project for you, give us a call!

What is serialization?

2021-05-06T00:00:00+00:00

Photo by Brian Patrick Tagalog on Unsplash

Serialization is a process used constantly by most applications today. However, there are some common misconceptions and misunderstandings about what it is and how it works; I hope to clear up a few of these in this post. I’ll be talking specifically about serialization and not marshalling, a related process.

What is serialization?

Most developers know that complex objects need to be transformed into another format before they can be sent to a server, but many might not be aware that every time they print an object in the Python or JavaScript console, the same type of thing is happening. Variables and objects as they’re stored in memory—either in a headless program or one with developer tools attached—are not really usable to us humans.

Data serialization is the process of taking an object in memory and translating it to another format. This may entail encoding the information as a chunk of binary to store in a database, creating a string representation that a human can understand, or saving a config file from the options a user selected in an application. The reverse—deserialization—takes an object in one of these formats and converts it to an in-memory object the program can work with. This two-way process of translation is a very important part of the ability of various programs and computers to communicate with one another.

An example of serialization that we deal with every day can be found in the way we view numbers on a calculator. Computers use binary numbers, not decimal, so how do we ask one to add 230 and 4 and get back 234? Because the 230 and the 4 are deserialized to their machine representations, added in that format, and then serialized again in a form we understand: 234. To get 230 in a form the computer understands, it has to read each digit one at a time, figure out what that digit’s value is (i.e. the 2 is 200 and the 3 is 30), and then add them together. It’s easy to overlook how often this concept appears in everything we do with computers!

Why it’s important to understand how it works

As a developer, there are many reasons you should be familiar with how serialization works as well as the various formats available, including:

Different formats are best suited for different use cases.
Standardization varies between formats. For example, INI files have no single specification, but TOML does. YAML 1.2 came out in 2009 but most YAML parsers still implement only parts of the earlier YAML 1.1 spec.
Each application typically supports only one or a few formats.
Formats have different goals, such as readability & simplicity for humans, speed for computers, and conciseness for storage space and transfer efficiency.
Applications use the various formats very differently from each other.

Before you start working on a project, it will certainly pay off to make sure you’re familiar with the options for serialization formats so you can pick the one most suited to your particular use case.

Binary vs. human-readable serialization

There’s one more important distinction to be made before I show any examples, and that is human-readable vs. binary serialization. The advantage of human-readability is obvious: debugging in particular is much simpler, but other things like scanning data for keywords is much easier as well. Binary serialization, however, can be much faster to process for both the sender and recipient, it can sometimes include information that’s hard to represent in plain text, and it can be much more efficient with space without needing separate compression. I’ll stick to reviewing human-readable formats in this post.

Common serialization formats with examples

CSV

For my examples, I’ll have a simple JavaScript object representing myself, with properties including my name, recent books I’ve read, and my favorite food. I’ll start with CSV (comma-separated values) because it’s intended for simpler data records than most of the other formats I’ll be showing; you’ll notice that there isn’t an easy way to do object hierarchies or lists. CSV files begin with a list of the column names followed by the rows of data:

name,favorite_food_name,favorite_food_prep_time,recent_book
Zed,Pizza,30,Leviathan Wakes

CSV files are most often used for storing or transferring tabular data, but there’s no single specification, so the implementation can be fairly different in different programs. The most common differences involve data with commas or line breaks, requiring quoting of some or all elements, and escaping some characters.

TSV

Files in tab-separated values (TSV) format are also fairly common, using tabs instead of commas to separate columns of data.

Because the tab character is rarely used in text put into table format, it is less of a problem as a separator than the very frequently-occurring comma. Typically no quoting or escaping of any kind is needed or possible in a TSV file.

name	favorite_food_name	favorite_food_prep_time	recent_book
Zed	Pizza	30	Leviathan Wakes

For the rest of my examples of each format, I’ll show the command (and library, if needed) that I used to get the serialized form of my object.

JSON

JSON stands for JavaScript Object Notation, and thus you might be fooled into thinking that it’s just an extension of JavaScript itself. However, this isn’t the case; it was originally derived from JavaScript syntax, but it has significant differences. For example, JSON has a stricter syntax for declaring objects. For my example, using the Google Chrome developer console I declared my object like this:

const me = {
  name: 'Zed',
  recent_books: [
    'Leviathan Wakes',
    'Pride and Prejudice and Zombies'
  ],
  favorite_food: {
    name: 'Pizza',
    prep_time: 30
  }
};

You’ll notice that the property names aren’t quoted and the strings are single-quoted with '. This is perfectly valid JavaScript, but invalid JSON. Let’s see what an equivalent JSON file could look like:

{
  "name": "Zed",
  "recent_books": [
    "Leviathan Wakes",
    "Pride and Prejudice and Zombies"
  ],
  "favorite_food": {
    "name": "Pizza",
    "prep_time": 30
  }
}

JSON requires property names to be quoted, and only double quotes " are allowed. It’s true that they look very similar, but the difference is important. Also notice that this JSON is formatted in an easy-to-read way, on multiple lines with indentation. This is called pretty-printing and is possible because JSON doesn’t care about whitespace.

Imagine my JavaScript application wants to send this object to some server that’s expecting JSON, using any other platform such as Java or .NET and not necessarily JavaScript. It would need to serialize the object from memory into a JSON string first, which can be done by JavaScript itself:

> let meJSON = JSON.stringify(me);
> console.log(meJSON);
{"name":"Zed","recent_books":["Leviathan Wakes","Pride and Prejudice and Zombies"],"favorite_food":{"name":"Pizza","prep_time":30}}

Note that the result here has no extra line breaks or spaces. This is called minifying, and is the reverse of pretty-printing. The flexibility allowed by these two processes is one reason people like JSON.

Parsing our example back into a JavaScript object is also very easy:

> console.log(JSON.parse(meJSON));
{
  name: 'Zed',
  recent_books: [ 'Leviathan Wakes', 'Pride and Prejudice and Zombies' ],
  favorite_food: { name: 'Pizza', prep_time: 30 }
}

The easy integration with JavaScript is a big reason JSON is so popular. I showed these examples to highlight how easy it is to use, but also to point out that sometimes we might use serialization without being aware of what’s going on under the hood; it’s important to remember that JSON texts aren’t JavaScript objects, and there may be instances where it makes more sense to use another format.

For instance, if you need a config file format that’s easy for humans to read, it is very helpful to allow comments that are not part of the data structure once it is read into memory. But CSV, TSV, and JSON do not allow for comments. The most obvious or popular choice isn’t always the only one, or the best one, so let’s keep looking at other formats.

XML

XML is well known as the markup language of which HTML is a subset, or at least a close sibling. It can also be used for serialization of data, and allows us to add comments such as the one at the beginning:

<!-- My favorites as of May 2021 -->
<name>Zed</name>
<recent_books>Leviathan Wakes</recent_books>
<recent_books>Pride and Prejudice and Zombies</recent_books>
<favorite_food>
    <name>Pizza</name>
    <prep_time>30</prep_time>
</favorite_food>

XML has the benefit of being widely used, and it can represent more complex data structures since each element can also optionally have various attributes, and ordering of its child elements is significant.

But XML is unpleasant to type and for many use cases feels rather complex and bloated, so it suffers when compared to other formats we are looking at in this post.

YAML

YAML is a serialization format for all kinds of data that’s designed to be human-readable. Simple files look fine, like our example:

---
# My favorites as of May 2021
name: "Zed"
recent_books:
  - "Leviathan Wakes"
  - "Pride and Prejudice and Zombies"
favorite_food:
  name: "pizza"
  prep_time: 30

However, the YAML specification is far from simple, and quite a bit has been written on why it’s better to use other formats where possible:

INI

INI, short for initialization, is well-known and has been around since the ’90s or earlier. It was most notably used for configuration files in Windows, especially in the era before Windows 95. INI files are still used in many places, including Windows and Linux programs’ system configuration files such as for the Git version control system.

Our example in INI format looks like this:

; My favorites as of May 2021

name=Zed
recent_books[]=Leviathan Wakes
recent_books[]=Pride and Prejudice and Zombies

[favorite_food]
name=Pizza
prep_time=30

INI has no single specification, so one project’s config files might use different syntax from another. This makes it hard to recommend over newer formats like TOML.

TOML

TOML, which stands for Tom’s Obvious Minimal Language, is a more recent addition to serialization formats; its first version was released in 2013. TOML maps directly to dictionary objects and is intended especially for configuration files as an alternative to INI. It has similar syntax to INI as well:

# My favorites as of May 2021

name = "Zed"

recent_books = [
  "Leviathan Wakes",
  "Pride and Prejudice and Zombies"
]

[favorite_food]
name = "Pizza"
prep_time = 30

Unlike INI and YAML, TOML has a very clear and well-defined specification, and seems like a great option for new projects in the future. It is currently used most prominently by the Rust programming language tools. There is a list of TOML libraries per language and version on the TOML wiki at GitHub.

PHP’s serialize()

PHP’s serialization output isn’t quite as readable, but the data is still recognizable for someone scanning visually for keywords or doing a more rigorous search. Converting from JSON is fairly simple:

#!/usr/bin/env php

<?php

$json = '
{
  "name": "Zed",
  "recent_books": [
    "Leviathan Wakes",
    "Pride and Prejudice and Zombies"
  ],
  "favorite_food": {
    "name": "Pizza",
    "prep_time": 30
  }
}
';

$obj = json_decode($json, true);

echo serialize($obj);

And the result:

a:3:{s:4:"name";s:3:"Zed";s:12:"recent_books";a:2:{i:0;s:15:"Leviathan Wakes";i:1;s:27:"Pride and Prejudice and Zombies";}s:13:"favorite_food";a:2:{s:4:"name";s:5:"Pizza";s:9:"prep_time";i:30;}}

PHP serialize() does not allow for comments, but it does support full object marshalling, which it is more commonly used for.

Perl’s Data::Dumper

Perl’s Data::Dumper module serializes data in a format specifically for Perl to load back into memory:

#!/usr/bin/env perl

use strict;
use warnings;
use JSON;
use Data::Dumper 'Dumper';

my $json = <<'END';
{
  "name": "Zed",
  "recent_books": [
    "Leviathan Wakes",
    "Pride and Prejudice and Zombies"
  ],
  "favorite_food": {
    "name": "Pizza",
    "prep_time": 30
  }
}
END

my $hash = decode_json $json;

print Dumper($hash);

And the result, which is a valid Perl statement:

$VAR1 = {
          'recent_books' => [
                              'Leviathan Wakes',
                              'Pride and Prejudice and Zombies'
                            ],
          'name' => 'Zed',
          'favorite_food' => {
                               'name' => 'Pizza',
                               'prep_time' => 30
                             }
        }

Conclusion

Serialization is an extremely common function that we as programmers should be familiar with. Knowing which is a good option for a new project can save time and money, as well as making things easier for developers and API users.

Please leave a comment if I have missed your favorite format!

The flow of hierarchical data extraction

2019-03-13T00:00:00+00:00

1. Problem statement

There are many cases when people intend to collect data, for various purposes. One may want to compare prices or find out how musical fashion changes over time. There are a zillion potential uses of collected data.

The old-fashioned way to do this task is to hire a few dozen of people and explain them where should they go on the web, what should they collect, how they should write a report and how they should send it.

It is more effective to teach them this at the same time than to teach them separately, but even then, there will be misunderstandings, mistakes with high cost, not to mention the limit a human has when processing data in terms of the amount to process. As a result, the industry strives to make sure this is as automatic as possible.

This is why people write software to cope with this issue. The terms data-extractor, data-miner, data-crawler, data-spider mean software which extracts data from a source and stores it at the target. If data is mined from the web, then the more-specific web-extractor, web-miner, web-crawler, web-spider terms can be used.

In this article I will use the term “data-miner”.

This article deals with the extraction of hierarchical data in semantic manner and the way we can parse the data we obtained this way.

1.1. Hierarchical structure

A hierarchical structure involves a hierarchy, that is, we have a graph involved with nodes and vertices, but without a cycle. Specialists call this structure a forest. A forest consists of trees; in our case we have a forest of rooted trees. A rooted tree has a root node and every other node is its descendant (child of child of child …), or, if we put it inversely, the root is the ancestor of all other nodes in a rooted tree.

If we add a node to a forest and we make sure that all the trees’ roots in the forest are children of the new node, the new root, then we transformed our hierarchical structure, our forest into a tree.

Common hierarchical structures a data-miner might have to work with are:

HTML
XML
Trees represented in JSON
File structure

Other, non-hierarchical structures could be texts, pictures, audio files, video, etc. In this article we focus on hierarchical structures.

1.2. Purpose of the data-miner

Of course the purpose of the data-miner is to mine data; however, more specifically, we can speak about general-purpose data-miners, thematic data-miners and narrow-spaced data-miners.

General-purpose data-miners are the data-miners which extract and prepare data for search engines. The technique one can imagine is to find the text in the source (for example a web-page) and map it to keywords, so that when people are searching for the keywords, the source is shown as the result. Of course, it is highly probable that these data-miners are much more smart and complex than described here, but that is out of scope from our perspective.

If we speak about general-purpose data-mining, then even if it is to mine from a hierarchical data-source, it is very difficult to define all the semantics involved by humans, so, if one wants to create a general-purposed data-miner, machine-learning will play a large part to define all the concepts. However, a general-purposed data-miner is a generalized version of thematic data-miner. If we add up a lot of thematic data-miners, we get a general-purpose data-miner and inversely, if we divide the different areas a general-purposed data-miner deals with, we get thematic data-miners. This is true if we look at what they accomplish, but the implementation will differ a lot when the developers implement a general-purposed data-miner from the case when developers implement a thematic data-miner. As a result, when we discuss the thematic data-miners, we can keep in mind that the general-purposed data-miners are a broader version of the same thing, at least if we look at what they achieve.

Thematic data-miners are dealing with a given cluster of concepts, that is, concepts which are logically related to each-other. For example, if we are to extract real-estate details, then we have concepts of “type”, “bedroom number”, “picture” and so on. All these concepts are interrelated; the cluster is defined by the real-estate entity we want to extract. If we speak of the “type”, we really mean “the type of the real estate” entity.

Narrow-spaced data-miners are data-miners which extract data from very specific data-sources, like a single website. These data-miners are always particular cases of thematic data-miners, however, in many cases narrow-spaced data-miners have a lot of hard-coded logic, so, when the space of a narrow-spaced thematic data-miner is broadened, there is usually a lot of code refactoring involved.

This article focuses on thematic data-miners, which could have thousands of different sources.

1.3. The human element

We need to pay attention to legality and to the ethical aspect. If data mining from the source is illegal, then we must avoid doing it. If it is against the will of those who own it, then it is unethical and we should avoid it. Sources should be at least neutral to our extraction, but, preferably happy about us extracting their data.

Why would the owner of a data source be against extracting their data? There can be many possible causes. For example, the owner might want to attract many human visitors to a website, who would visit to see their data and would be unhappy if another website were created and people would visit the new website instead of theirs.

But why would the owner of a data source be happy about extracting their data? It depends on the purpose of their data. If the purpose is to spread the information as far as possible, then data extractors are considered to be “helping hands”.

For example, if an estate agent of a small village wants to target global audience, he/she might want to create a website and hope that people would see his/her data, but without advertisement and a lot of care to raise the popularity of the site, including design, SEO measures and so on, people searching for real estate might never see his/her site. A large website which is searchable by region and shows the results of extracted data could boost the business of the estate agent, especially if the data of the estate agent is publicized without him/her having to pay a penny. People will search for real estate in the region where the estate agent works and will find the big site showing the mined data, emphasizing the contact of the estate agent.

So, to cope with all possible needs of data-source owners, the planner of a data-miner project might want to support listing results with or without details page. Imagine the case when one wants visitors to his/her website and would not like another site to be visited instead. In this case, one can tell him/her that, if he/she allows the extraction of data from his/her site, then they will appear in the list of results, but will not have a details page, instead, when the user clicks on such a result, the website of the owner of the data-source would be opened in a new tab. In this case we are giving them an attractive offer, so that our site will essentially help getting visitors to his/her site. And of course, for those who just want to share the information with as many people as possible, one could show a details page for his/her data. This is a simplified strategy, but illustrates the idea of how people can be convinced to allow us to extract their data.

1.4. The actual problem statement

In this article we intend to have solid ideas of how hierarchical data can be extracted by a thematic data-extractor from data-sources where the owner is content with their data being extracted.

2. The extraction

Now we have hierarchies to work with, possibly many. Nodes can have the parent-child relation, but they can have the ancestor-descendant relation as well. A is the ancestor and D is its descendant, if A is the parent of D, or we have a sequence of nodes S, where Si is the parent of Si+i for each neighboring pair of the sequence and A is the parent of S1, Sn is the parent of D. Consider this HTML code chunk:

<div class="dimensions">
    <div class="large-width">
        <div class="area">Area<span class="dimension-data">500</span> <span class="unit">sqm</span></div>
        <div class="height">Height<span class="dimension-data">60</span> <span class="unit">m</span></div>
    </div>
</div>

We see that the div having the class of dimensions is the parent of the div with the large-width class and the ancestor of the spans having the area and height classes, respectively. In the case of hierarchical data, a descendant is inside the ancestor; knowing this gives us a lot of context in many cases.

2.1. Semantic concepts

In hierarchical structures we have some nodes, a (usually small) subset of which contains interesting data for us. For instance, in the example shown at point 2 we have a div which exists to style the shown data, which has some useful information in its descendants, which have the classes of area and height having the class of dimension-data. However, the div having the class of large-width by itself does not have any useful information outside those descendant nodes, and frankly, it is just a styling node, which makes it irrelevant from our point of view. This means that the large-width div should not exist for us conceptually, we just need to know that we have a concept of dimensions, inside which we should be able to find the area and height. In terms of JavaScript selectors we know that we can find dimensions inside the document and area and height inside it, like this:

let dimensions = [];
let dimensionContainers = document.querySelectorAll(".dimensions");
for (let dimensionContainer of dimensionContainers) {
    const dimension = {};
    let areaContainer = dimensionContainer.querySelector(".area");
    if (areaContainer) {
        let value = areaContainer.querySelector(".dimension-data");
        let unit = areaContainer.querySelector(".unit");
        if (value && unit) {
            dimension.area = {value: value.innerText, unit: unit.innerText};
        }
    }
    let heightContainer = dimensionContainer.querySelector(".height");
    if (heightContainer) {
        let value = heightContainer.querySelector(".dimension-data");
        let unit = heightContainer.querySelector(".unit");
        if (value && unit) {
            dimension.unit = {value: value.innerText, unit: unit.innerText};
        }
    }
     dimensions.push(dimension);
}

We immediately notice some details in the code:

It does not deal with large-width at all, because it’s irrelevant, instead, it just focuses on nodes, which are semantically relevant.
When a concept is found, its direct sub-concept is searched inside its structural context instead of the whole context, so we are particularizing the search.
Our code smartly searches for plural results in the case of dimensions and is aware that area and height is singular in its parent concept, the dimension.
The code does not assume the existence of any data.

There are also problems to cope with:

What happens if the structure changes over time? How can we maintain the code we have above?
The code above is based on empiric evidence, on the findings of a developer, but this is only proving that the hierarchy the code above is dealing with exists, but it certainly does not prove this is the only structure to work with.
How can we cope with composite data, like extracting something like “height: 60m”.
How can we cope with the variance of data, such as synonyms?
How can we cope with paging where applicable?

All these are serious questions, which show that such a concrete code will not solve the problems we face. We will need abstraction to progress further than the limits of particularity and achieve the level of a thematic data-miner. We have seen that in our examples we had a hierarchy of semantic concepts and also, we can observe the rule that an ancestor concept is an ancestor structurally as well. Also, the algorithm we had above has a pattern of searching for the child concept in the context of the parent concept.

In reality we also have the problem of conceptual inconsistency, that is, the structure of the same data-source can be varied and in some cases they are at a totally different location in the structure. To give you a practical example, let’s consider the example we had about real-estate properties and their dimensions particularly. It is possible that the height of the real estate makes it very attractive for buyers, so the developers of the data-source decided to show the height separately, in an emphasized manner, at the top of the page and not to show it at its usual place. In this case, if we would not cope with this important detail, we would miss the height for real-estate properties where the height is the most important detail; we would miss the stars of the show.

2.2. Semantic rules

In 2.1 we have seen how we can have an understanding about the conceptual essence. Naturally, the concepts are useful by themselves, but we need to build up a powerful structure, a signature of the concepts, which, if shown on a diagram would describe the essence of the structure, the conceptual essence in such a powerful way that one could understand it at a glance (unless there are too many nodes for a human, of course).

In order to gain the ability to build up such a powerful structure we need to find patterns, but in a more systematic way than the one we have seen in the naive code used as an example. Concepts can be described by rules. I call these rules semantic rules. What attributes describe a concept:

parent concept(s): Which concepts can be the parents of the current concepts? A special case is a root parent. Also, the possibility to define multiple parents is to avoid duplication when the very same concept in the very same substructure can be present at various places.
descendant(s): Useful to make the rule more specific and possibly filter out unneeded cases and thus increase performance.
relative path: How can the given concept be found starting from the parent concept as root?
plurality: Should we stop at the first match, or continue searching?
excludes: Which concepts are excluded if this concept is matched? (this is a symmetrical relation)
implies: What concepts imply this concept? (consequently, if this concept is not matched, the concepts which imply it can be excluded as well)
value: Where the actual value of the concept can be found.

If the structure of the data source changes over time, then occasionally the semantic rules will have to change as well. However, arguably in the majority of cases the relative path of the concepts will be the only thing to be changed in this case, which is much easier to maintain than to refactor code. If the relative path descriptor does not change and the virtual structure of our semantic rules remains similar, then our data-miner might just work well even if the data-source is changed.

For example, in the case of web-crawlers, writing the initial code-base is just a fraction of the long-term costs. The real financial burden is maintaining the crawlability of many thousands of different sites, all changing from time to time. With semantic rules, even if they are defined manually the maintainability is largely simplified. Yet, if we have a module which proposes the new semantic rules, it would not hurt if there are many data-sources.

Imagine the case when we already have a well-defined set of semantic rules and a cron job detects that some data from a data source was not found in the last few hours, so it would search for the missing data in the data-source and, once found, it would generate the semantic rules for it and propose a new semantic tree of rules, which would run in parallel with the main semantic tree and store data in some experimental place. When the human developers would arrive back and check the reports, they would see that the data-miner thinks that the semantic tree needs to be changed and even has a proposal, also, for the case when the data-miner is right with its proposal, nothing was missed, the extracted data is among the experimental data, but once the new semantic tree is accepted, the experimental data would override the actual data.

Of course, the engine will not be 100% accurate in such a case, its suggestion might be mistaken, or, even if accurate, slight adjustments might be helpful. A relative path might be less accurate than needed, or the concept in the changed structure might have a different plurality, but even if there is room for improvement, such an automatic feature, generating a new semantic tree would be extremely helpful and would make sure that data is accurately extracted even in the case of large changes.

If the owner of the data-source is cooperating, then he/she could provide the changed semantic tree.

The parser is the part of the project which should do the job of handling composite data, but at the level of semantic rules the parser could be helped with a strategy of defining rules. One could define a syntax for the parser, so it would know which concepts are not atomic and maybe even some clues about how should they be parsed.

Synonyms could be dealt with keyword clusters.

Navigation can be handled by a module created for this purpose, we can call it the navigation module. The algorithm below describes how data-mining should occur:

initialNavigation()
while (page ← page.next())
    for each element in elements
        for each node in semanticTree  // breadth-first or depth-first traversing
            if (applicable(node)) then extract(node)
        end for
    end for
end while
finalNavigation()

Before we move on to deal with semantic trees we need to solve the problem of conceptual inconsistencies. We have seen that concepts can have multiple conceptual parents, but this is not inconsistent, nor violating the criteria of a tree. In reality for each element a concept will have a single parent, but it is perfectly possible that a given concept will have a parent for an element and a different parent for another element. This is what we described with the possibility of a concept having one of many possible parents/element. However, even though we have a very good explanation for a concept having more parents, we still have to deal with the possibility of a concept being present twice for the same element. How is this possible?

Let’s consider the example of books displayed on a website. A book may have main author(s) and secondary author(s). It would not be surprising at all if the main authors are displayed differently than secondary authors. Also, a main picture might be displayed of the book’s cover and maybe some smaller images elsewhere, shown in a gallery. This would be a nice feature on a site which shows children’s books. However, from our perspective this means that the same concept is placed at several places at the same time. One may think of quantum mechanics, where time-place discrepancies are also possible.

How can we solve this problem for ourselves? We need to have an understanding, otherwise the whole thought process will go astray. My solution is to differentiate the concepts in the different stages of extraction and parsing. This would mean that the concept of “author” or “picture” is conceptually unique when we store it, even though these might be plural. More elements do not necessarily mean more concepts. On the other hand, at the time of extraction, the main picture is a different concept than the other pictures, also, the main authors are a different concept from the secondary authors. The moment of merging happens at the point when we store these and therefore have to convert the extraction concepts into the concepts we store.

2.3. Semantic tree

The semantic rules we define have a parent-child and ancestor-descendant relation. The semantic tree is the blueprint of the conceptual essence, a plan to extract data and also, its attributes instruct the extractor about what concept should be found where, how should the extractor operate to have good performance and so on.

However, the entities to extract can vary greatly and there might be cases when seemingly the same concept is distributed among various places. The word “seemingly” here means that even though at the end these concepts will be merged, at the phase of the actual data-mining we view these to be similar, but different concepts. The fact that a conceptual node might have several parents in the semantic rules only means that one of the parents of the list is to be expected, which means that all of the listed parents are possible. Consider a semantic rule which says that the parent of currency can be the description or the price:

parent: description,price

This means that the parent can be any of the elements of the list, that is, in the case of some elements the parent will be description, but in the case of some other elements, the parent will be price, so we do not violate by design our aim to have a semantic tree.

We have to watch out for cycles though. Without additional tools there is no protection against cycles, that we might accidentally add while we define the semantic rules, or our semantic rule generator does its magic. However, it makes sense to check whether there is a cycle in the semantic tree. If this is done automatically, then all the better.

However, since the description of semantic trees can define multiple disjunct parents to cope with the possibility to cope with the actual tree of concepts for all elements, at least those whose pattern is known, the semantic tree in reality is a semantic tree pattern and when we use it or we search for the cycles, we will need to traverse the possible parents where more of them are listed.

Consider the example we have brought for a semantic tree.

We can see that we have a single main concept, “REAL-ESTATE”. This is not a general pattern, there might be several main concepts to extract on the same data-source, but for the sake of simplicity we have a single one. Implicitly, the technical implementation involves a single, abstract root, which is the parent of all the main concepts. We see that “TYPE” is plural for “REAL-ESTATE”, but “BEDROOM #” is singular. The conceptual reason is that we technically defined type in such a way, that not all pairs of types are mutually exclusive so if a type is found, for example “Bungalow”, the engine should not stop searching for other types, but if the number of bedrooms is found, then there is no point to further search for numbers of bedroom, because it is safe to assume that even if there are several different occurrences of this concept, they will be equivalent.

Why is “CURRENCY” special? It has two potential parents: “DESCRIPTION” or “PRICE”. In some cases it can be found inside “DESCRIPTION”, in other cases it can be found inside price. For example, if there are several values for “PRICE”, then “CURRENCY” is inside “DESCRIPTION”, but otherwise “CURRENCY” is inside “PRICE”.

But why could a real-estate property have several different prices? Well, this is outside the scope of data-mining, but to have an understanding, it is good to consider a viable example. Let’s consider the example of an estate agent who has a 20% bonus if he/she successfully sells a given real-estate property within a month. In this case, the agent might want to draw buyers and put a 5% discount on the property for a month and if he/she is successful in selling it, then he/she will still have a nice bonus. Considering this economic mechanism the data-source might be showing the prices in a special way if there is such a discount, like:

<div class="description">
    <table class="prices">
        <tr>
            <td>
                <p class="price"><span>100000</span></p>
                <p class="price red"><span>90000</span> 10% discount!</p>
            </td>
            <td>
                <div class="currency">$</div>
            </td>
        </tr>
    </table>
    <!-- … -->
</div>

while, if there is a single price, a different structure is generated:

<div class="description">
    <p class="price">
        <span>100000</span>
        <div class="currency">$</div>
    </p>
</div>

We notice again that the possibility of multiple parents does not mean that there will be any extracted element with multiple parents, it just describes that among the many elements some of the items will have “PRICE” as parent, the others will have “DESCRIPTION” as parent.

If we take a look at “DIMENSIONS”, we will see several concepts with the same name (“VALUE” and “UNIT”), but they have a different meaning in their specific context (“AREA” and “HEIGHT”, respectively).

Another interesting region is “CONTACT”, which has “PERSON” and “COMPANY” elements as conceptual children and both “PERSON” and “CONTACT” is plural. The underlying logic is that several companies or persons can be contacted when one wants to buy/view a real-estate. We have sub-concepts of “FACEBOOK”, “EMAIL”, “PHONE”, “NAME”, “OTHER” and “WEBSITE” for both “PERSON” and “COMPANY”, but similarly to the example we have seen with “CURRENCY”, here the concepts have different meanings. A corporate website is a different thing from a personal website.

However, if we draw/generate such a semantic tree, then it is better than a long documentation. It actually describes for coders the exact way the engine will operate and in the case when the engine is suggesting a new semantic tree for a reason, then, provided that the engine generates the tree’s picture, one will immediately understand what the essence of the engine’s suggestion is. Also, with such a nice diagram managers will understand the mechanism of the system at a glance.

2.4. Parallelization

It is feasible to send parallel requests at the same time, which could happen using promises and the event queue in the case of JavaScript, or multiple threads in a multi-threaded environment.

3. Symbiosis

If the owner of the data-source is happy and supportive for his/her data to be extracted, then they might notify the maintainers of the data-miner whenever structural changes occur, or he/she can provide an API from which data can be extracted, for example a large JSON. However, this JSON will be probably hierarchical as well.

In some extremely lucky cases the owner of the data-source will be happy to provide and maintain the semantic rules. This could happen in the case when “spreading the word” via a data-miner is deeply valued by the owner of the data-source. The key is to have an offer, which helps reaching the goals of the data-source, so the owner of the data-source will see the data-miner as his/her extended hand instead of seeing it as a barrier in reaching the goals of the data-source.

4. Parsing the data

At the point when the data was successfully extracted, the results can be parsed just before storing it. For example in some cases we might have textually composite data in the same node, which is impossible to separate via the semantic tree, which needs leaf nodes of the original structure as atoms. So, in many cases a separate layer is needed to decompose composite textual data which holds data applicable to different concepts.

Also, if, for some technical reasons the semantic tree split a concept into different concepts (for example main authors and secondary authors), then the data-parser can merge the concepts which belong to be together into a single concept. There are many possible jobs the data-parser might fulfill, depending on the actual needs.

5. Analyzing the data

Let’s assume that we have a very nice schema and we store the data we have efficiently. However, we might be interested to know what patterns can we find in our data. Let’s see what patterns we are interested to find. They include:

association rules
functional dependencies
conditional functional dependencies (a functional dependency upon the table or cluster records provided a condition is met)

AR (Association Rule): c => {v1, …, vn}

If a certain condition (c) is fulfilled, then we have a set of constant values for their respective (database table) columns.

For example, let’s consider the table:

person(id, is_alive, has_right_to_vote, has_valid_passport)

Now, we can observe that:

(is_alive = 0) => ((has_right_to_vote = 0) ^ (has_valid_passport = 0))

So, this is an association rule, which has the condition of is_alive = 0 (so the person is deceased) and we know for a fact that dead people will not vote and their passport is invalid.

When we extract data from a source, there might be some association rules (field values are associated to a condition) we do not know about yet. If we find those out, then it will help us a lot. For instance, imagine the case when for whatever reason an insert/update is attempted with values:

is_alive = 0
has_right_to_vote = 1

In this case we can throw an exception, so, this way we can find bugs in the code or mistakes in the semantic tree. This kind of inconsistency prevention is useful even if we are not speaking of data-mining, but, in the context of this article it is extremely useful, as it might detect problems in the semantic tree automatically.

FD (Functional Dependency): S → D

The column-set S (Source):

S = {S1, …, Sm}

determines the column-set D (Destination):

D = {D1, …, Dn}

The formula more explicitly looks like this:

{S1, …, Sm} → (D1, …, Dn)

This relation means that if we have two different records/entities with the same source values:

Source1 = Source2 = {s1, …, sm}

then their destination will match as well:

Destination1 = Destination2 = {d1, …, dn}

Inversely this is not necessarily true. If two records/entities have the same destination values, then the functional dependency does not require them to have the very same sources.

CFD (Conditional Functional Dependency): c => S → D

A CFD is a more generalized term of FD. It adds a condition to the formula, so that the functional dependency’s applicability is only guaranteed if the condition is met. We can describe functional dependencies as particular conditional functional dependencies, where the condition is inherently true:

(true => S → D) <=> S → D

Also, an AR can be described as

c => {} → D

5.1. The More Useful (MU) relation

Let’s consider that we have two patterns, P1 and P2, which could be ARs, FDs or CFDs. Is there a way to determine which of them is more useful? Generally speaking:

P1 MU P2 if and only if P1 is more general than P2.

Since both ARs and FDs are particular cases of CFDs, we will work with the formula of CFDs:

P1 = (c1 => S1 → D1)

P2 = (c2 => S2 → D2)

P1 MU P2 <=> ((c2 => c1) ^ (S1 ⊆ S2) ^ (D1 ⊇ D2))

Note that MU is reflexive, transitive, antisymmetrical and has a neutral element.

Reflexive: (c1 => c1) ^ (S1 ⊆ S1) ^ (D1 ⊇ D1) is trivially true.

Transitive:

Let’s suppose that

((c2 => c1) ^ (S1 ⊆ S2) ^ (D1 ⊇ D2))

((c3 => c2) ^ (S2 ⊆ S3) ^ (D2 ⊇ D3))

is true. Is

((c3 => c1) ^ (S1 ⊆ S3) ^ (D1 ⊇ D3))

also true?

Since c3 => c2 => c1, due to the transitivity of the implication relation we know that c3 => c1.

Since S1 ⊆ S2 ⊆ S3, due to the transitivity of the subset relation we know that S1 ⊆ S3.

Since D1 ⊇ D2 ⊇ D3, due to the transitivity of the superset relation we know that D1 ⊇ D3.

The three transitivities together prove that MU is transitive.

Neutral element (the least useful):

false => {} → All columns

false => c is always true, {} is the subset of everything else, including itself and all columns is the subset of all combinations of columns, or, in other words, it’s a superset of all its subsets.

Antisymmetrical:

If P1 MU P2 and P2 MU P1, then P1 <=> P2.

P1 MU P2:

((c2 => c1) ^ (S1 ⊆ S2) ^ (D1 ⊇ D2))

P2 MU P1

((c1 => c2) ^ (S2 ⊆ S1) ^ (D2 ⊇ D1))

P1 MU P2 ^ P2 MU P1 <=>

((c2 => c1) ^ (c1 => c2)) ^

((S1 ⊆ S2) ^ (S2 ⊆ S1)) ^

((D1 ⊇ D2) ^ (D2 ⊇ D1)) <=>

(c1 <=> c2) ^ (S1 = S2) ^ (D1 = D2) <=>

P1 <=> P2

so the relation is antisymmetrical:

((P1 MU P2) ^ (P2 Mu P1)) <=> (P1 <=> P2)

This means that MU is a poset (partially ordered set) and all the algebra applicable for partially ordered sets in general can be used to analyze MU as well.

The importance of the MU relation is that we can start searching for such patterns in an ordered manner, starting from the most useful we consider and eventually composing S and decomposing D into less useful relation candidates whenever a candidate for a pattern proved to be false, also, knowing that a pattern seems to be accurate we also know that all less useful patterns seem to be accurate as well. Also, if we know that a pattern is accurate, we know that all less useful patterns are accurate as well. We can start our search from a useful pattern candidate, but not necessarily from the most useful, as, intuitively, it is not very probable that all the columns will invariably have the same value of all records. It would defeat the purpose of storing so many values. This means that defining the most useful possible patterns would make sense.

5.2. MU Lattice

Possible patterns can be represented using a Lattice, where the root would be the most useful node and the leaf would be the least useful node. We have a join and a meet operation, which are closures.

P1 join P2 = (c1 v c2) => (S1 ⋂ S2) → (D1 ⋃ D2)

P1 meet P2 = (c1 ^ c2) => (S1 ⋃ S2) → (D1 ⋂ D2)

Of course:

(P1 join P2) MU (P1 meet P2)

Proof:

(c1 ^ c2) => (c1 v c2) is trivially true and

(S1 ⋂ S2) ⊆ (S1 ⋃ S2) is trivially true and

(D1 ⋃ D2) ⊇ (D1 ⋂ D2) is trivially true.

We can split the lattice into many different simple lattices, each having its own condition. Since an AR never has source columns, it cannot be less useful than a non AR CFD. Also, since an FD has a condition which is implied by any possible other condition, FDs are never less useful than CFDs with real conditions.

5.3. Domain of search

The domain of search can vary. It can be limited to a single table. Or it can be limited to a cluster of tables, related to each other in one-to-one, one-to-many, many-to-one or many-to-many manner. The condition can be considered to be in a simplistic way, checking only the equals operator of some columns, or, it can be complex and considering several columns, even in a cross-table manner, using several possible operators. This depends on the kinds of patterns we intend to find, the cumulative power of resources, the ability of the development team to work out something serious, time, and, yes, money.

5.4. Differentiation between patterns and reality

We can have a P pattern, which was automatically found. We only know that there is no counter-example of the pattern P, or, if we have a level of tolerance, we know that there were not as many counter-examples so we would discard the pattern. So, P appears to be true. But is appearance equivalent to the truth in this case? As a matter of fact nature produces infinitely many examples of seemingly impossible occurrences or highly improbable coincidences.

It is a common fallacy to concentrate on a pattern and due to the improbability of the result being a mere coincidence to exclude the possibility that it was just a coincidence. Indeed, it’s the so-called Texas sharpshooter fallacy, even though it’s unconscious in our specific case.

If I have a cube which has 6 sides, each having a number from 1–6 and I toss it 1000 times and the result will always be six, then I will have the natural feeling that something must be not right, I might be divinely favored, or the stars are lining up in my favor, but in this case I’m ignoring the fact that there is no connection between the results of the tosses, or, in other words, my experiments are independent from each other. I could calculate that having a result of 6 for 1000 times has a probability of 1 / 6^1000, which is quasi-impossible. Yet, it is only quasi-impossible and not actually impossible.

If I toss the same cube 1000 times randomly and get a sequence of 1000 numbers, then I could calculate the probability of my random, not special sequence of 1000 elements occuring in the exact same way it occurred. And, surprise, surprise, the result will be exactly equal to 1 / 6^1000, but, if the results of the sequence are varied, I do not feel the results to be special. As a result, having a result of 6 for 1000 tosses is not special at all either.

There is no mathematical difference between the probability of tossing a cube 1000 times and getting only sixes and tossing a cube 1000 times and getting a specific sequence of 1000 elements you might choose. The probability of the exact same sequences will be exactly the same before I start tossing the cube. The difference between the two sequences is the meaning I, as a person attribute to one of them.

Also, if I win a lottery, I might think about the probability of my choice of numeric combination being correct and feel that I’m especially lucky, but if I calculate the probability of the actual result when I do not win, I will not find any difference in the probability itself. Yet, people tend to calculate the chances of the case when they get lucky, but not to calculate the chances when they are not. The low probability of a given event is only special because we are interested in it, but we can find infinitely many similarly low-probability events happening all the time, but we are just not interested enough in the majority of such events to analyze it.

But let’s see this example further. Before I toss the cube 1000 times I do not know the exact sequence I will get in advance. In fact, it is almost impossible to guess it, but I know that whatever the sequence will be, its a priori probability is converging to zero.

This means that whenever we do not attribute a pattern a meaning, we are inclined to fallaciously not even consider it to be the nature of how things are. Yet, if we attribute a meaning to a pattern, then we might be inclined to fallaciously not understand that it was a mere coincidence if it happens not to be the natural rule of how things are according to our understanding.

If we find a pattern with a tool, we get enthusiastic and we almost want it to be the nature of things, but we need to be much more severe before we factually accept a pattern. Consider the example of primes. How many primes are there? The answer is simple: infinitely many.

Proof (reductio ad absurdum):

Let’s assume that there is a finite n number of primes and there is no other, except them:

p1, …, pn

Now, let’s consider the number:

N = (p1 * … * pn) + 1

We know that N is indivisible with any of p1, …, pn, so there are two cases: N is either a prime or a composite number. If N is a prime, then we found a new prime. Otherwise, if N is composite, then it is divisible by at least a prime which is not among p1, …, pn. So, in either case we find a new prime, therefore there are infinitely many prime numbers.

How many primes are pair? Exactly one. It is the number 2. Now, if we have a huge set of primes, among which we do not find 2, not knowing that 2 is a prime, we might be inclined to think that primes can only be odd numbers, which is of course wrong. If we pick a prime randomly from the infinitely many primes, the chance that it will be exactly two is extremely small, rather minuscule. However, if a human has to pick a prime number, a human will know only a few primes and 2 is the “first”, so, among the primes 2 is one that has a high chance of being chosen by a human.

The point of all this contemplation is that if something is very frequent or highly probable, then it is not necessarily true. When we take a look at a pattern, it is good to be very critical about it and think about how that pattern could fail and what the consequences would be if we assume the pattern to be accurate, yet it leaves us in trouble in the most inappropriate time.

5.5. Usage of factually validated patterns

Now, if we accept a rule to be factually accurate, then we might want to make sure that it is respected. Assuming that

c => S → D

is accurate, we also assume that if the condition is met and there is already a record having the source values of s1, …, sm, then, inserting/updating another record with the same source values, but with different destination values leads to an error. Let’s suppose that we throw an exception when an accepted pattern is to be violated. If many such exceptions are thrown, then we have a problem. What could the problem be:

the semantic tree might have wrong/deprecated rules
the older records might be broken
the pattern might be no longer valid, or, it might have been wrongly accepted in the first place

See? By analyzing the data we can add some features of machine-learning, so our data-miner will really rule the problem space it is responsible for. Naturally, such patterns can also be used at insertion and update, when we do not get some of the destination values, but we have a pattern from which we can deduce it. Naturally, a tableau of at least the most frequent source values and conditions could come to our help.

Knowledge is power. Instead of failing gracefully, outside our limits of what we perceive could result in many months-long gibberish data. But instead of that pain, we could instantly know when a problem appears and, if we have some helping robotic hands — even if they are only virtual — at the end of the day we will be rarely alerted with urgencies. And such patterns deepen our understanding of the data we work with, even if they cannot be accepted as a general rule. With better understanding we will have better ideas. With better ideas we will have better features and performance. With better features and performance we will have more patterns. And with more patterns: we deepen our understanding further.

6. The flow

This diagram is an idealized representation of the flow. In reality we might have several different cron jobs, we might be working with threads, in asynchronous manner and the parser is invoked much more frequently, not just at the end of the whole extraction, because we do not have infinite resources. All these nuances would complicate the diagram immensely.

Find Text in Any Column of a PostgreSQL Table

Data Migration Tips

Customer Management

Migration Scripting

Data Migration History

Teamwork

What is serialization?

What is serialization?

Why it’s important to understand how it works

Binary vs. human-readable serialization

Common serialization formats with examples

CSV

TSV

JSON

XML

YAML

INI

TOML

PHP’s serialize()

Perl’s Data::Dumper

Conclusion

Further reading

The flow of hierarchical data extraction

1. Problem statement

1.1. Hierarchical structure

1.2. Purpose of the data-miner

1.3. The human element

1.4. The actual problem statement

2. The extraction

2.1. Semantic concepts

2.2. Semantic rules

2.3. Semantic tree

2.4. Parallelization

3. Symbiosis

4. Parsing the data

5. Analyzing the data

5.1. The More Useful (MU) relation

5.2. MU Lattice

5.3. Domain of search

5.4. Differentiation between patterns and reality

5.5. Usage of factually validated patterns

6. The flow