Understanding Linear Regression

2022-06-01T00:00:00+00:00

Linear regression is a regression model which outputs a numeric value. It is used to predict an outcome based on a linear set of input.

The simplest hypothesis function of linear regression model is a univariate function as shown in the equation below:

$$ h_θ = θ_0 + θ_1x_1 $$

As you can guess this function represents a linear line in the coordinate system. The hypothesis function (h₀) approximates the output given input.

θ₀ is the intercept, also called bias term. θ₁ is the gradient or slope.

A linear regression model can either represent a univariate or a multivariate problem. So we can generalize the equation of the hypothesis as summation:

$$ h_θ = \sum{θ_ix_i} $$

where x₀ is always 1.

We can also represent the hypothesis equation with vector notation:

$$ h_θ = \begin{bmatrix} θ_0 & θ_1 & θ_2 \dots θ_n \end{bmatrix} x \begin{bmatrix} x_0 \\ x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix} $$

Linear Regression Model

I am going to introduce a linear regression model using a gradient descent algorithm. Each iteration of a gradient descent algorithm calculates the following steps:

Hypothesis h
The loss
Gradient descent update

The gradient descent update iteration stops when it reaches the convergence.

Although I am implementing a univariate linear regression model in this section, these steps apply to multivariate linear regression models as well.

Hypothesis

We start the initial hypothesis assumption with random parameters. Then we calculate the loss using L2 Loss function over the training dataset. In Python:

def hypothesis(X, theta):
    return theta[0] + theta[1:] * X

In this function we took X input (univariate in this implementation) and theta parameter values. X represents the feature input of our dataset. Theta is the weights of the features. θ₀ is called the bias term and θ₁ is the gradient or slope.

L2 Loss Function

L2 Loss function — sometimes called Mean Squared Error (MSE) — is the total error of the current hypothesis over the given training dataset. During the training, by calculating the MSE, we can target minimizing the cumulative error.

$$ J(θ) = \frac{\sum{(h_θ(x_i) - y_i)^2}}{2m} $$

L2 loss function (MSE) simply calculates the error by summing the squares of each data point error by dividing the size of the dataset.

The more the linear function is aligned, the optimized center of the data points with an optimized slope would give us a minimized error which is our target in linear regression training.

Gradients of the Loss

Each time we iterate and calculate a new theta (θ), we get a new theta₁ (slope) value. If we plot each slope value in the gradient descent batch update we will have a curve like this:

This curve has a minimum value which can’t get lower. Our goal is to find an optimal low value of theta₁ that reaches a point where our curve doesn’t get lower anymore or the change can be ignored. That is where the convergence is achieved and the loss is minimized.

Let’s do a little bit more math. The gradient of the loss is the partial derivative of θ. We calculate partial differential of loss for θ₀ and θ₁ separately. For multivariate functions our θ₁ is a generalized version for all available θ_i since the partial derivatives are calculated similarly. You can simply calculate the partial derivatives of loss function yourself too.

$$ \frac{∂}{∂θ_0}J(θ_0) = \frac{\sum{(h_0 - y_0)}}{m} $$

$$ \frac{∂}{∂θ_0}J(θ_i) = \frac{\sum{(h_i - y_i)x_i}}{m} $$

Since we know the hypothesis equation we can replace it in the derivatives as well:

def partial_derivatives(h, X, y):
    return [np.mean((h.flatten() - y)), np.mean((h.flatten() - y) * X.flatten())]

Now we will calculate the gradients for given hypothesis given theta, X, and y:

def calc_gradients(theta, X, y):
    gradient = [0, 0]

    h = hypothesis(X, theta)
    gradient = partial_derivatives(h, X, y)
    return np.array(gradient)

Batch Gradient Descent

The gradient descent method I used in this implementation is called batch gradient descent which uses all the data available through the iterations, which slows down the overall convergence process. There are methods to improve the performance of gradient descent such as stochastic gradient descent.

Since we calculated the gradients for the given theta we will iterate as much as we can until the convergence.

$$ θ_1(new) = θ_1(current) - α * J’(θ_1(current)) $$

Here comes the convergence rate or so called learning rate (α) factor to decide how long we should jump through the iterations. If α is too small, convergence can be more accurate, but the performance will be too small. This also leads to overfitting. If α is too big, the performance will be better, but convergence couldn’t be calculated accurately or well enough.

There is no strict best value for α since it depends on the dataset for training the model. By evaluating the model you trained you can find the best alpha value for your dataset. You can refer to statistical measures like R² score to determine the observed variance. But there usually won’t be single model parameter, hyperparameter, or statistical variable to refer to for regularization.

def gradient_update(X, y, theta, alpha, stop_threshold):
    # initial loss
    loss = L2_loss(hypothesis(X, theta), y)
    old_loss = loss + stop_threshold

    while( abs(old_loss - loss) > stop_threshold ):
        # gradient descent update
        gradients = calc_gradients(theta, X, y)
        theta = theta - alpha * gradients            
        old_loss = loss
        loss = L2_loss(hypothesis(X, theta), y)
        
    print('Gradient Descent training stopped at loss %s, with coefficients: %s' % (loss, theta))
    return theta

By performing batch gradient descent we actually train our algorithm and make it find the best theta values to fit the linear function. Now we can evaluate our algorithm and compare it with Sci-Kit Learn Linear Regression.

Evaluation

Since linear regression is a regression model, you should train and evaluate this model on regression datasets.

SK-Learn Diabetes dataset is a good regression dataset example. Below I loaded and prepared the dataset by splitting into training and test datasets.

from sklearn import datasets
from sklearn.model_selection import train_test_split

# Load the diabetes dataset
diabetes = datasets.load_diabetes()
diabetes_X = diabetes.data[:, np.newaxis, 2]
diabetes_y = diabetes.target

X_train, X_test, y_train, y_test = train_test_split(diabetes.data, diabetes_y, test_size=0.1)

Now we can evaluate our model:

from sklearn.metrics import mean_squared_error, r2_score

# initial random theta
theta = [100, 3]

stop_threshold = 0.1

# learning rate
alpha = 0.5

theta = gradient_update(X_train, y_train, theta, alpha, stop_threshold)
y_pred = hypothesis(X_test, theta)

print("Intercept (theta 0):", theta[0])
print("Coefficients (theta 1):", theta[1])
print("MSE:", mean_squared_error(y_test, y_pred))
print("R2 Score", r2_score(y_test, y_pred))

# Plot outputs using test data
plt.scatter(X_test, y_test,  color='black')
plt.plot(X_test, y_pred, color='blue', linewidth=3)

plt.show()

When I run my linear regression model it finds the optimal theta values, finishes the training, and outputs as below. I noted sample evaluation scores below too.

Gradient Descent training stopped at loss 3753.11429796413, with coefficients: [151.6166715  850.81024746]
Intercept (theta 0): 151.61667150054697
Coefficients (theta 1): 850.8102474614635
MSE: 5320.89741757879
R2 Score 0.14348916154815183

Now let’s evaluate the SK-Learn linear regression model with the same training and test datasets we used. I’m going to use default parameters without optimizing.

# Sci-Kit Learn LinearRegression model evaluation

regr = linear_model.LinearRegression()
regr.fit(X_train, y_train)
y_pred = regr.predict(X_test)

print("Coef:", regr.coef_)
print("Intercept:", regr.intercept_)
print("MSE:", mean_squared_error(y_test, y_pred))
print("R2 Score", r2_score(y_test, y_pred))

# Plot outputs
plt.scatter(X_test, y_test, color='black')
plt.plot(X_test, y_pred, color='blue', linewidth=3)

plt.show()

The output and plot of the SK-Learn Linear Regression model is as below:

Coef: [993.14228074]
Intercept: 151.5751918329106
MSE: 5544.283378702411
R2 Score 0.10753047228113943

Notice the intercept of my linear regression model and SK-Learn’s linear regression model are very close with value of around ~151. MSE values are calculated very close too. Also both plotted their predictions very similarly as well.

Multivariate Linear Regression

We can add more features as we have more features in a dataset and prepare our hypothesis as below, similar to a univariate hypothesis.

$$ h_θ(x) = θ_0 + θ_1x_1 + … + θ_nx_n $$

A multivariate dataset can have multiple features and a single output like below.

Feature1	Feature2	Feature3	Feature4	Target
2	0	0	100	12
16	10	1000	121	18
5	5	450	302	14

Each feature is an independent variable (x_i) of a dataset. Parameters (theta) are what we aim to find during the training just like the univariate model.

Linear Regression with Polynomial Functions

Sometimes a line function doesn’t fit the data well enough. Although if we are dealing with a polynomial function (having multiple features with exponential versions), it could fit the data better.

In this case the data itself is not linear but we are lucky that the parameter space is linear and we can still apply the linear regression over the non-linear dataset as well:

$$ h_θ(x) = θ_0 + θ_1x + θ_1x^2 … + θ_nx^n $$

$$ h_θ = \begin{bmatrix} 1 & x & x^2 \dots x^n \end{bmatrix} x \begin{bmatrix} θ_0 \\ θ_1 \\ θ_2 \\ \vdots \\ θ_n \end{bmatrix} $$

Here the data is non-linear but the parameters are linear and we can still apply the gradient descent algorithm.

Conclusion

In this post I implemented a linear regression model from scratch and evaluated by training it.

Linear regression is useful when your dataset variables can be related in a linear relation. In the real world, linear regression is very useful in forecasting.

Implementing SummAE neural text summarization with a denoising auto-encoder

2020-05-28T00:00:00+00:00

If there’s any problem space in machine learning, with no shortage of (unlabelled) data to train on, it’s easily natural language processing (NLP).

In this article, I’d like to take on the challenge of taking a paper that came from Google Research in late 2019 and implementing it. It’s going to be a fun trip into the world of neural text summarization. We’re going to go through the basics, the coding, and then we’ll look at what the results actually are in the end.

The paper we’re going to implement here is: Peter J. Liu, Yu-An Chung, Jie Ren (2019) SummAE: Zero-Shot Abstractive Text Summarization using Length-Agnostic Auto-Encoders.

Here’s the paper’s abstract:

We propose an end-to-end neural model for zero-shot abstractive text summarization of paragraphs, and introduce a benchmark task, ROCSumm, based on ROCStories, a subset for which we collected human summaries. In this task, five-sentence stories (paragraphs) are summarized with one sentence, using human summaries only for evaluation. We show results for extractive and human baselines to demonstrate a large abstractive gap in performance. Our model, SummAE, consists of a denoising auto-encoder that embeds sentences and paragraphs in a common space, from which either can be decoded. Summaries for paragraphs are generated by decoding a sentence from the paragraph representations. We find that traditional sequence-to-sequence auto-encoders fail to produce good summaries and describe how specific architectural choices and pre-training techniques can significantly improve performance, outperforming extractive baselines. The data, training, evaluation code, and best model weights are open-sourced.

Preliminaries

Before we go any further, let’s talk a little bit about neural summarization in general. There’re two main approaches to it:

The first approach makes the model “focus” on the most important parts of the longer text - extracting them to form a summary.

Let’s take a recent article, “Shopify Admin API: Importing Products in Bulk”, by one of my great co-workers, Patrick Lewis, as an example and see what the extractive summarization would look like. Let’s take the first two paragraphs:

I recently worked on an interesting project for a store owner who was facing a daunting task: he had an inventory of hundreds of thousands of Magic: The Gathering (MTG) cards that he wanted to sell online through his Shopify store. The logistics of tracking down artwork and current market pricing for each card made it impossible to do manually.

My solution was to create a custom Rails application that retrieves inventory data from a combination of APIs and then automatically creates products for each card in Shopify. The resulting project turned what would have been a months- or years-long task into a bulk upload that only took a few hours to complete and allowed the store owner to immediately start selling his inventory online. The online store launch turned out to be even more important than initially expected due to current closures of physical stores.

An extractive model could summarize it as follows:

I recently worked on an interesting project for a store owner who had an inventory of hundreds of thousands of cards that he wanted to sell through his store. The logistics and current pricing for each card made it impossible to do manually. My solution was to create a custom Rails application that retrieves inventory data from a combination of APIs and then automatically creates products for each card. The store launch turned out to be even more important than expected due to current closures of physical stores.

See how it does the copying and pasting? The big advantage of these types of models is that they are generally easier to create and the resulting summaries tend to faithfully reflect the facts included in the source.

The downside though is that it’s not how a human would do it. We do a lot of paraphrasing, for instance. We use different words and tend to form sentences less rigidly following the original ones. The need for the summaries to feel more natural made the second type — abstractive — into this subfield’s holy grail.

Datasets

The paper’s authors used the so-called “ROCStories” dataset (“Tackling The Story Ending Biases in The Story Cloze Test”. Rishi Sharma, James Allen, Omid Bakhshandeh, Nasrin Mostafazadeh. In Proceedings of the 2018 Conference of the Association for Computational Linguistics (ACL), 2018).

In my experiments, I’ve also tried the model against one that’s quite a bit more difficult: WikiHow (Mahnaz Koupaee, William Yang Wang (2018) WikiHow: A Large Scale Text Summarization Dataset).

ROCStories

The dataset consists of 98162 stories, each one consisting of 5 sentences. It’s incredibly clean. The only step I needed to take was to split the stories between the train, eval, and test sets.

Examples of sentences:

Example 1:

My retired coworker turned 69 in July. I went net surfing to get her a gift. She loves Diana Ross. I got two newly released cds and mailed them to her. She sent me an email thanking me.

Example 2:

Tom alerted the government he expected a guest. When she didn’t come he got in a lot of trouble. They talked about revoking his doctor’s license. And charging him a huge fee! Tom’s life was destroyed because of his act of kindness.

Example 3:

I went to see the doctor when I knew it was bad. I hadn’t eaten in nearly a week. I told him I felt afraid of food in my body. He told me I was developing an eating disorder. He instructed me to get some help.

Wikihow

This is one of the most challenging openly available datasets for neural summarization. It consists of more than 200,000 long-sequence pairs of text + headline scraped from WikiHow’s website.

Some examples:

Text:

One easy way to conserve water is to cut down on your shower time. Practice cutting your showers down to 10 minutes, then 7, then 5. Challenge yourself to take a shorter shower every day. Washing machines take up a lot of water and electricity, so running a cycle for a couple of articles of clothing is inefficient. Hold off on laundry until you can fill the machine. Avoid letting the water run while you’re brushing your teeth or shaving. Keep your hoses and faucets turned off as much as possible. When you need them, use them sparingly.

Headline:

Take quicker showers to conserve water. Wait for a full load of clothing before running a washing machine. Turn off the water when you’re not using it.

The main challenge for the summarization model here is that the headline was actually created by humans and is not just “extracting” anything. Any model performing well on this dataset actually needs to model the language pretty well. Otherwise, the headline could be used for computing the evaluation metrics, but it’s pretty clear that traditional metrics like ROUGE are just bound here to miss the point.

Basics of the sequence-to-sequence modeling

Most sequence-to-sequence models are based on the “next token prediction” workflow.

The general idea can be expressed with P(token | context) — where the task is to model this conditional probability distribution. The “context” here depends on the approach.

Those models are also called “auto-regressive” because they need to consume their own predictions from previous steps during the inference:

predict(["<start>"], context)
# "I"
predict(["<start>", "I"], context)
# "love"
predict(["<start>", "I", "love"], context)
# "biking"
predict(["<start>", "I", "love", "biking"], context)
# "<end>"

Naively simple modeling: Markov Model

In this model, the approach is to take on a bold assumption: that the probability of the next token is conditioned only on the previous token.

The Markov Model is elegantly introduced in the blog post Next Word Prediction using Markov Model.

Why is it naive? Because we know that the probability of the word “love” depends on the word “I” given a broader context. A model that’s always going to output “roses” would miss the best word more often than not.

Modeling with neural networks

Usually, sequence-to-sequence neural network models consist of two parts:

encoder
decoder

The encoder is there to build a “gist” representation of the input sequence. The gist and the previous token become our “context” to do the inference. This fits in well within the P(token | context) modeling I described above. That distribution can be expressed more clearly as P(token | previous; gist).

There are other approaches too with one of them being the ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training - 2020 - Yan, Yu and Qi, Weizhen and Gong, Yeyun and Liu, Dayiheng and Duan, Nan and Chen, Jiusheng and Zhang, Ruofei and Zhou, Ming. The difference in the approach here was the prediction of n-tokens ahead at once.

Teacher-forcing

Let’s see how could we go about teaching the model about the next token’s conditional distribution.

Imagine that the model’s parameters aren’t performing well yet. We have an input sequence of: ["<start>", "I", "love", "biking", "during", "the", "summer", "<end>"]. We’re training the model giving it the first token:

model(["<start>", context])
# "I"

Great, now let’s ask it for another one:

model(["<start>", "I"], context])
# "wonder"

Hmmm that’s not what we wanted, but let’s naively continue:

model(["<start>", "I", "wonder"], context)
# "why"

We could continue gathering predictions and compute the loss at the end. The loss would really only be able to tell it about the first mistake (“love” vs. “wonder”); the rest of the errors would just accumulate from here. This would hinder the learning considerably, adding in the noise from the accumulated errors.

There’s a better approach called Teacher Forcing. In this approach, you’re telling the model the true answer after each of its guesses. The last example would look like the following:

model(["<start>", "I", "love"], context)
# "watching"

You’d continue the process, feeding it the full input sequence and the loss term would be computed based on all its guesses.

Compute-friendly representation for tokens and gists

Some of the readers might want to skip this section. I’d like to describe quickly here the concept of the latent space and vector embeddings. This is to keep the matters relatively palatable for the broader audience.

Representing words naively

How do we turn the words (strings) into numbers that we input into our machine learning models? A software developer might think about assigning each word a unique integer. This works well for databases but in machine learning models, the fact that integers follow one another means that they encode a relation (which one follows which and in what distance). This doesn’t work well for almost any problem in data science.

Traditionally, the problem is solved by “one-hot encoding”. This means that we’re turning our integers into vectors, where each value is zero except the one for the index that equals the value to encode (or minus one if your programming language uses zero-based indexing). Example: 3 => [0, 0, 0, 1, 0, 0, 0, 0, 0, 0] when the total number of “integers” (classes) to encode is 10.

This is better as it breaks the ordering and distancing assumptions. It doesn’t encode anything about the words, though, except the arbitrary number we’ve decided to assign to them. We now don’t have the ordering but we also don’t have any distance. Empirically though we just know that the word “love” is much closer to “enjoy” than it is to “helicopter”.

A better approach: word embeddings

How could we keep our vector representation (as in one-hot encoding) but also introduce the distance? I’ve already glanced over this concept in my post about the simple recommender system. The idea is to have a vector of floating-point values so that the closer the words are in their meaning, the smaller the angle is between them. We can easily compute a metric following this logic by measuring the cosine distance. This way, the word representations are easy to feed into the encoder, and they already contain a lot of the information in themselves.

Not only words

Can we only have vectors for words? Couldn’t we have vectors for paragraphs, so that the closer they are in their meaning, the smaller some vector space metric between them? Of course we can. This is, in fact, what will allow us in this article’s model to encode the “gist” that we talked about. The “encoder” part of the model is going to learn the most convenient way of turning the input sequence into the floating-point numbers vector.

Auto-encoders

We’re slowly approaching the model from the paper. We still have one concept that’s vital to understand in order to get why the model is going to work.

Up until now, we talked about the following structure of the typical sequence-to-sequence neural network model:

This is true e.g. for translation models where the input sequence is in English and the output is in Greek. It’s also true for this article’s model during the inference.

What if we’d make the input and output to be the same sequence? We’d turn it into a so-called auto-encoder.

The output of course isn’t all that useful — we already know what the input sequence is. The true value is in the model’s ability to encode the input into a gist.

Adding the noise

A very interesting type of an auto-encoder is the denoising auto-encoder. The idea is that the input sequence gets randomly corrupted and the network learns to still produce a good gist and reconstruct the sequence before it got corrupted. This makes the training “teach” the network about the deeper connections in the data, instead of just “memorizing” as much as it can.

The SummAE model

We’re now ready to talk about the architecture from the paper. Given what we’ve already learned, this is going to be very simple. The SummAE model is just a denoising auto-encoder that is being trained a special way.

Auto-encoding paragraphs and sentences

The authors were training the model on both single sentences and full paragraphs. In all cases the task was to reproduce the uncorrupted input.

The first part of the approach is about having two special “start tokens” to signal the mode: paragraph vs. sentence. In my code, I’ve used “<start-full>” and “<start-short>”.

During the training, the model learns the conditional distributions given those two tokens and the ones that follow, for any given token in the sequence.

Adding the noise

The sentences are simply concatenated to form a paragraph. The input then gets corrupted at random by means of:

masking the input tokens
shuffling the order of the sentences within the paragraph

The authors are claiming that the latter helped them in solving the issue of the network just memorizing the first sentence. What I have found though is that this model is generally prone towards memorizing concrete sentences from the paragraph. Sometimes it’s the first, and sometimes it’s some of the others. I’ve found this true even when adding a lot of noise to the input.

The code

The full PyTorch implementation described in this blog post is available at https://github.com/kamilc/neural-text-summarization. You may find some of its parts less clean than others — it’s a work in progress. Specifically, the data download is almost left out.

You can find the WikiData preprocessing in a notebook in the repository. For the ROCStories, I just downloaded the CSV files and concatenated with Unix cat. There’s an additional process.py file generated from a very simple IPython session.

Let’s have a very brief look at some of the most interesting parts of the code:

class SummarizeNet(NNModel):
    def encode(self, embeddings, lengths):
        # ...

    def decode(self, embeddings, encoded, lengths, modes):
        # ...

    def forward(self, embeddings, clean_embeddings, lengths, modes):
        # ...

    def predict(self, vocabulary, embeddings, lengths):
        # ...

You can notice separate methods for forward and predict. I chose the Transformer over the recurrent neural networks for both the encoder part and the decoder. The PyTorch implementation of the transformer decoder part already includes the teacher forcing in the forward method. This makes it convenient at the training time — to just feed it the full, uncorrupted sequence of embeddings as the “target”. During the inference we need to do the “auto-regressive” part by hand though. This means feeding the previous predictions in a loop — hence the need for two distinct methods here.

def forward(self, embeddings, clean_embeddings, lengths, modes):
    noisy_embeddings = self.mask_dropout(embeddings, lengths)

    encoded = self.encode(noisy_embeddings[:, 1:, :], lengths-1)
    decoded = self.decode(clean_embeddings, encoded, lengths, modes)

    return (
        decoded,
        encoded
    )

You can notice that I’m doing the token masking at the model level during the training. The code also shows cleanly the structure of this seq2seq model — with the encoder and the decoder.

The encoder part looks simple as long as you’re familiar with the transformers:

def encode(self, embeddings, lengths):
    batch_size, seq_len, _ = embeddings.shape

    embeddings = self.encode_positions(embeddings)

    paddings_mask = torch.arange(end=seq_len).unsqueeze(dim=0).expand((batch_size, seq_len)).to(self.device)
    paddings_mask = (paddings_mask + 1) > lengths.unsqueeze(dim=1).expand((batch_size, seq_len))

    encoded = embeddings.transpose(1,0)

    for ix, encoder in enumerate(self.encoders):
        encoded = encoder(encoded, src_key_padding_mask=paddings_mask)
        encoded = self.encode_batch_norms[ix](encoded.transpose(2,1)).transpose(2,1)

    last_encoded = encoded

    encoded = self.pool_encoded(encoded, lengths)

    encoded = self.to_hidden(encoded)

    return encoded

We’re first encoding the positions as in the “Attention Is All You Need” paper and then feeding the embeddings into a stack of the encoder layers. At the end, we’re morphing the tensor to have the final dimension equal the number given as the model’s parameter.

The decode sits on PyTorch’s shoulders too:

def decode(self, embeddings, encoded, lengths, modes):
    batch_size, seq_len, _ = embeddings.shape

    embeddings = self.encode_positions(embeddings)

    mask = self.mask_for(embeddings)

    encoded = self.from_hidden(encoded)
    encoded = encoded.unsqueeze(dim=0).expand(seq_len, batch_size, -1)

    decoded = embeddings.transpose(1,0)
    decoded = torch.cat(
        [
            encoded,
            decoded
        ],
        axis=2
    )
    decoded = self.combine_decoded(decoded)
    decoded = self.combine_batch_norm(decoded.transpose(2,1)).transpose(2,1)

    paddings_mask = torch.arange(end=seq_len).unsqueeze(dim=0).expand((batch_size, seq_len)).to(self.device)
    paddings_mask = paddings_mask > lengths.unsqueeze(dim=1).expand((batch_size, seq_len))

    for ix, decoder in enumerate(self.decoders):
        decoded = decoder(
            decoded,
            torch.ones_like(decoded),
            tgt_mask=mask,
            tgt_key_padding_mask=paddings_mask
        )
        decoded = self.decode_batch_norms[ix](decoded.transpose(2,1)).transpose(2,1)

    decoded = decoded.transpose(1,0)

    return self.linear_logits(decoded)

You can notice that I’m combining the gist received from the encoder with each word embeddings — as this is how it was described in the paper.

The predict is very similar to forward:

def predict(self, vocabulary, embeddings, lengths):
    """
    Caller should include the start and end tokens here
    but we’re going to ensure the start one is replaces by <start-short>
    """
    previous_mode = self.training

    self.eval()

    batch_size, _, _ = embeddings.shape

    results = []

    for row in range(0, batch_size):
        row_embeddings = embeddings[row, :, :].unsqueeze(dim=0)
        row_embeddings[0, 0] = vocabulary.token_vector("<start-short>")

        encoded = self.encode(
            row_embeddings[:, 1:, :],
            lengths[row].unsqueeze(dim=0)
        )

        results.append(
            self.decode_prediction(
                vocabulary,
                encoded,
                lengths[row].unsqueeze(dim=0)
            )
        )

    self.training = previous_mode

    return results

The workhorse behind the decoding at the inference time looks as follows:

def decode_prediction(self, vocabulary, encoded1xH, lengths1x):
    tokens = ['<start-short>']
    last_token = None
    seq_len = 1

    encoded1xH = self.from_hidden(encoded1xH)

    while last_token != '<end>' and seq_len < 50:
        embeddings1xSxD = vocabulary.embed(tokens).unsqueeze(dim=0).to(self.device)
        embeddings1xSxD = self.encode_positions(embeddings1xSxD)

        maskSxS = self.mask_for(embeddings1xSxD)

        encodedSx1xH = encoded1xH.unsqueeze(dim=0).expand(seq_len, 1, -1)

        decodedSx1xD = embeddings1xSxD.transpose(1,0)
        decodedSx1xD = torch.cat(
            [
                encodedSx1xH,
                decodedSx1xD
            ],
            axis=2
        )
        decodedSx1xD = self.combine_decoded(decodedSx1xD)
        decodedSx1xD = self.combine_batch_norm(decodedSx1xD.transpose(2,1)).transpose(2,1)

        for ix, decoder in enumerate(self.decoders):
            decodedSx1xD = decoder(
                decodedSx1xD,
                torch.ones_like(decodedSx1xD),
                tgt_mask=maskSxS,
            )
            decodedSx1xD = self.decode_batch_norms[ix](decodedSx1xD.transpose(2,1))
            decodedSx1xD = decodedSx1xD.transpose(2,1)

        decoded1x1xD = decodedSx1xD.transpose(1,0)[:, (seq_len-1):seq_len, :]
        decoded1x1xV = self.linear_logits(decoded1x1xD)

        word_id = F.softmax(decoded1x1xV[0, 0, :]).argmax().cpu().item()
        last_token = vocabulary.words[word_id]
        tokens.append(last_token)
        seq_len += 1

    return ' '.join(tokens[1:])

You can notice starting with the “start short” token and going in a loop, getting predictions, and feeding back until the “end” token.

Again, the model is very, very simple. What makes the difference is how it’s being trained — it’s all in the training data corruption and the model pre-training.

It’s already a long article so I encourage the curious readers to look at the code at my GitHub repo for more details.

My experiment with the WikiHow dataset

In my WikiHow experiment I wanted to see how the results look if I fed the full articles and their headlines for the two modes of the network. The same data-corruption regime was used in this case.

Some of the results were looking almost good:

Text:

for a savory flavor, mix in 1/2 teaspoon ground cumin, ground turmeric, or masala powder.this works best when added to the traditional salty lassi. for a flavorful addition to the traditional sweet lassi, add 1/2 teaspoon of ground cardamom powder or ginger, for some kick. , start with a traditional sweet lassi and blend in some of your favorite fruits. consider mixing in strawberries, papaya, bananas, or coconut.try chopping and freezing the fruit before blending it into the lassi. this will make your drink colder and frothier. , while most lassi drinks are yogurt based, you can swap out the yogurt and water or milk for coconut milk. this will give a slightly tropical flavor to the drink. or you could flavor the lassi with rose water syrup, vanilla extract, or honey.don’t choose too many flavors or they could make the drink too sweet. if you stick to one or two flavors, they’ll be more pronounced. , top your lassi with any of the following for extra flavor and a more polished look: chopped pistachios sprigs of mint sprinkle of turmeric or cumin chopped almonds fruit sliver

Headline:

add a spice., blend in a fruit., flavor with a syrup or milk., garnish.

Predicted summary:

blend vanilla in a sweeter flavor . , add a sugary fruit . , do a spicy twist . eat with dessert . , revise .

It’s not 100% faithful to the original text even though it seems to “read” well.

My suspicion is that pre-training against a much larger corpus of text might possibly help. There’s an obvious issue with the lack of very specific knowledge here to have the network summarize better. Here’s another of those examples:

Text:

the settings app looks like a gray gear icon on your iphone’s home screen.; , this option is listed next to a blue “a” icon below general. , this option will be at the bottom of the display & brightness menu. , the right-hand side of the slider will give you bigger font size in all menus and apps that support dynamic type, including the mail app. you can preview the corresponding text size by looking at the menu texts located above and below the text size slider. , the left-hand side of the slider will make all dynamic type text smaller, including all menus and mailboxes in the mail app. , tap the back button twice in the upper-left corner of your screen. it will save your text size settings and take you back to your settings menu. , this option is listed next to a gray gear icon above display & brightness. , it’s halfway through the general menu. ,, the switch will turn green. the text size slider below the switch will allow for even bigger fonts. , the text size in all menus and apps that support dynamic type will increase as you go towards the right-hand side of the slider. this is the largest text size you can get on an iphone. , it will save your settings.

Headline:

open your iphone’s settings., scroll down and tap display & brightness., tap text size., tap and drag the slider to the right for bigger text., tap and drag the slider to the left for smaller text., go back to the settings menu., tap general., tap accessibility., tap larger text. , slide the larger accessibility sizes switch to on position., tap and drag the slider to the right., tap the back button in the upper-left corner.

Predicted summary:

open your iphone ’s settings . , tap general . , scroll down and tap accessibility . , tap larger accessibility . , tap and larger text for the iphone to highlight the text you want to close . , tap the larger text - colored contacts app .

It might be interesting to train against this dataset again while:

utilizing some pre-trained, large scale model as part of the encoder
using a large corpus of text to still pre-train the auto-encoder

This could possibly take a lot of time to train on my GPU (even with the pre-trained part of the encoder). I didn’t follow the idea further at this time.

The problem with getting paragraphs when we want the sentences

One of the biggest problems the authors ran into was with the decoder outputting the long version of the text, even though it was asked for the sentence-long summary.

Authors called this phenomenon the “segregation issue”. What they have found was that the encoder was mapping paragraphs and sentences into completely separate regions. The solution to this problem was to trick the encoder into making both representations indistinguishable. The following figure comes from the paper and shows the issue visualized:

Better gists by using the “critic”

The idea of a “critic” has been popularized along with the fantastic results produced by some of the Generative Adversarial Networks. The general workflow is to have the main network generate output while the other tries to guess some of its properties.

For GANs that are generating realistic photos, the critic is there to guess if the photo was generated or if it’s real. A loss term is added based on how well it’s doing, penalizing the main network for generating photos that the critic is able to call out as fake.

A similar idea was used in the A3C algorithm I blogged about (Self-driving toy car using the Asynchronous Advantage Actor-Critic algorithm). The “critic” part penalized the AI agent for taking steps that were on average less advantageous.

Here, in the SummAE model, the critic adds a penalty to the loss to the degree to which it’s able to guess whether the gist comes from a paragraph or a sentence.

Training with the critic might get tricky. What I’ve found to be the cleanest way is to use two different optimizers — one updating the main network’s parameters while the other updates the critic itself:

for batch in batches:
    if mode == "train":
        self.model.train()
        self.discriminator.train()
    else:
        self.model.eval()
        self.discriminator.eval()

    self.optimizer.zero_grad()
    self.discriminator_optimizer.zero_grad()

    logits, state = self.model(
        batch.word_embeddings.to(self.device),
        batch.clean_word_embeddings.to(self.device),
        batch.lengths.to(self.device),
        batch.mode.to(self.device)
    )

    mode_probs_disc = self.discriminator(state.detach())
    mode_probs = self.discriminator(state)

    discriminator_loss = F.binary_cross_entropy(
        mode_probs_disc,
        batch.mode
    )

    discriminator_loss.backward(retain_graph=True)

    if mode == "train":
        self.discriminator_optimizer.step()

    text = batch.text.copy()

    if self.no_period_trick:
        text = [txt.replace('.', '') for txt in text]

    classes = self.vocabulary.encode(text, modes=batch.mode)
    classes = classes.roll(-1, dims=1)
    classes[:,classes.shape[1]-1] = 3

    model_loss = torch.tensor(0).cuda()

    if logits.shape[0:2] == classes.shape:
        model_loss = F.cross_entropy(
            logits.reshape(-1, logits.shape[2]).to(self.device),
            classes.long().reshape(-1).to(self.device),
            ignore_index=3
        )
    else:
        print("WARNING: Skipping model loss for inconsistency between logits and classes shapes")

    fooling_loss = F.binary_cross_entropy(
        mode_probs,
        torch.ones_like(batch.mode).to(self.device)
    )

    loss = model_loss + (0.1 * fooling_loss)

    loss.backward()
    if mode == "train":
        self.optimizer.step()

    self.optimizer.zero_grad()
    self.discriminator_optimizer.zero_grad()

The main idea is to treat the main network’s encoded gist as constant with respect to the updates to the critic’s parameters, and vice versa.

Results

I’ve found some of the results look really exceptional:

Text:

lynn is unhappy in her marriage. her husband is never good to her and shows her no attention. one evening lynn tells her husband she is going out with her friends. she really goes out with a man from work and has a great time. lynn continues dating him and starts having an affair.

Predicted summary:

lynn starts dating him and has an affair .

Text:

cedric was hoping to get a big bonus at work. he had worked hard at the office all year. cedric’s boss called him into his office. cedric was disappointed when told there would be no bonus. cedric’s boss surprised cedric with a big raise instead of a bonus.

Predicted summary:

cedric had a big deal at his boss ’s office .

Some others showed how the model attends to single sentences though:

Text:

i lost my job. i was having trouble affording my necessities. i didn’t have enough money to pay rent. i searched online for money making opportunities. i discovered amazon mechanical turk.

Predicted summary:

i did n’t have enough money to pay rent .

While the sentence like this one would maybe make a good headline — it’s definitely not the best summary as it naturally loses the vital parts found in other sentences.

Final words

First of all, let me thank the paper’s authors for their exceptional work. It was a great read and great fun implementing!

Abstractive text summarization remains very difficult. The model trained for this blog post has very limited use in practice. There’s a lot of room for improvement though, which makes the future of abstractive summaries very promising.

OpenITI Starts Arabic-script OCR Catalyst Project

2019-09-10T00:00:00+00:00

Photo by Free Quran Pictures 4K, cropped, CC BY 2.0

Congratulations to the Open Islamicate Texts Initiative (OpenITI) on their new project the Arabic-script OCR Catalyst Project (AOCP)! This project received funding from the The Andrew W. Mellon Foundation this summer.

End Point developer Kamil Ciemniewski will be serving the project as a Technology Integration Specialist. Kamil has been involved with OpenITI since 2018 and with the affiliated project, Corpus Builder, since 2017.

Corpus Builder project version 1.0 made collaborative effort possible in producing ground truth datasets for OCR models training. The application acts as a versioned database of text transcriptions and a full OCR pipeline itself. The versioned character of the database follows closely the model used by Git.

What is remarkable about it is that it brings the ability to work on revisions of documents whose character isn’t linear as text in the Git case. For the OCR problem, one needs both textual data but also the spatial: where exactly the text is to be found.

A sophisticated mechanism of applying updates to those documents minimizes (with mathematical guarantees) the chance of introducing merge conflicts.

The project also hosts a great-looking UI interface allowing non-technical editors to work within the workflow of this versioned data.

CorpusBuilder works with both Tesseract and Kraken as its OCR backends and is capable of exporting datasets in their respective formats for further model training / retraining. Training of Tesseract models was covered last year in a blog post by Kamil.

AOCP will rapidly expand prior work and will help establish a digital pipeline for digitizing texts and creating a set of tools for students and scholars of historic texts.

End Point is really excited to be a part of such a cool integration of technology and the humanities!

An Introduction to Neural Networks

2019-07-01T00:00:00+00:00

Photo by Sudhamshu Hebbar, used under CC BY 2.0

Earlier this year I wrote a post about my work with a machine-learning camera, the AWS DeepLens, which has onboard processing power to enable AI capabilities without sending data to the cloud. Neural networks are a type of ML model which achieves very impressive results on certain problems (including computer vision), so in this post I give a more thorough introduction to neural networks, and share some useful resources for those who want to dig deeper.

Neurons and Nodes

Neural networks are models inspired by the function of biological neural networks. They consist of nodes (arranged in layers), and the connections between those nodes. Each connection between two nodes enables one-way information transfer: a node either receives input from, or sends output to each node to which it is connected. Nodes typically have an “activation function”, parameterized by the node’s inputs, and its output is the result of this function.

As with the function of biological neural networks, the emergence of information processing from these mathematical operations is opaque. Nevertheless, complex artificial neural networks are capable of feats such as vision, language translation, and winning competitive games. As the technology improves, even more impressive tasks will become possible. As with organic brains, neural networks can achieve complex tasks only as a result of appropriate architecture, constraints, and training—for machine learning, humans must (for now) design it all.

Neural Network Architecture

Nodes are grouped in layers: the input layer, the output layer, and all the layers between them, known as hidden layers. Nodes can be networked in a variety of ways within and between layers, and sophisticated neural network models can include dozens of layers configured in various ways. These include layers which summarize, combine, eliminate, direct, or transform information. Each receives its input from the previous layer, and passes its output to the next layer. The last layer is designed such that its output answers the relevant question (for example, it would offer 9 options if the goal were to identify the hand-written numbers 1–9).

For all this information processing to achieve a given task, the parameters of each node need appropriate values. The process of choosing those values is called training. In order to train a neural network, one needs to provide examples of what the network should do. (For example, to train it to write requires examples of writing. To train it to identify objects in images requires images and their appropriately labeled counterparts.) The more data a model can learn from, the better it can work. Gathering enough data is typically a major undertaking.

Training a Neural Network

Before training, models have random parameters for all nodes. Each time data is passed through the model, the effectiveness of the model is measured using a “loss function”. Loss functions measure how wrong a model’s output is. Different loss functions (also known as cost functions or error functions) measure this in different ways, but in general, the more wrong a model is, the higher its loss/error/cost. Loss functions thus summarize the quality of a model’s output with a single number. Models are optimized to minimize the loss. (For more on the role of loss functions in neural networks, I suggest this excellent article.)

One of the most interesting details of the entire process has to do with how the parameters are tuned. Model optimization relies on variations of a process called gradient descent, in which parameter values are adjusted by small intervals in an attempt to minimize the loss. Over many thousands of repetitions, the training program uses calculus to pick values that help to minimize the loss. As you can imagine, this process becomes extremely computationally intensive when the neural network is large and complex. However, in order to solve hard problems, networks must be large and complex. This is why training neural networks requires substantial computing power, and often takes place in the cloud. (For more on stochastic gradient descent, I suggest this video as a great starting point, or this review for a more advanced overview.)

Deploying production Machine Learning pipelines to Kubernetes with Argo

2019-06-28T00:00:00+00:00

Image by Wikimedia Commons

In some sense, most machine learning projects look exactly the same. There are 4 stages to be concerned with no matter what the project is:

Sourcing the data
Transforming it
Building the model
Deploying it

It’s been said that #1 and #2 take most of ML engineers’ time. This is to emphasize how little time it sometimes feels the most fun part—#3—gets.

In the real world, though, #4 over time can take almost as much as the previous three.

Deployed models sometimes need to be rebuilt. They consume data that need to constantly go through points #1 and #2. It certainly isn’t always what’s shown in the classroom, where datasets perfectly fit in the memory and model training takes at most a couple hours on an old laptop.

Working with gigantic datasets isn’t the only problem. Data pipelines can take long hours to complete. What if some part of your infrastructure has an unexpected downtime? Do you just start it all over again from the very beginning?

Many solutions of course exist. With this article, I’d like to go over this problem space and present an approach that feels really nice and clean.

Project description

End Point Corporation was founded in 1995. That’s 24 years! About 9 years later, the oldest article on the company’s blog was published. Since that time, a staggering number of 1435 unique articles have been published. That’s a lot of words! This is something we can definitely use in a smart way.

For the purpose of having fun with building a production-grade data pipeline, let’s imagine the following project:

A doc2vec model trained on the corpus of End Point’s blog articles
Use of the paragraph vectors for each article to find the 10 other, most similar articles

I blogged about using the matrix factorization as a simple collaborative filtering style of the recommender system. We can think about today’s doc2vec-based model as an example of the content based filtering. The business value would be the potentially increased blog traffic from users staying longer on the website.

Scalable pipelines

The data pipelines problem certainly found some really great solutions. The Hadoop project brought in the HDFS—a distributed file system for huge data artifacts. Its MapReduce component plays a vital role in distributed data processing.

Then, the fantastic Spark project came in. Its architecture makes data reside in memory by default—with explicit caching of the data on disks. The project claims to be running workloads 100 times faster than Hadoop.

Both projects though require the developer to use a very specific set of libraries. It’s not easy, for example, to distribute spaCy training and inference on Spark.

Containers

On the other side of the spectrum, there’s Dask. It’s a Python package that wraps Numpy, Pandas and Scikit-Learn. It enables developers to load huge piles of data, just as they would with the smaller datasets. The data is partitioned and distributed among the cluster nodes. It can work with groups of processes as well as clusters of containers. The APIs of the above-mentioned projects are (mostly) preserved while all the processing is suddenly distributed.

Some teams like to use Dask along with Luigi and build production pipelines around Docker or Kubernetes.

In this article, I’d like to present another Dask-friendly solution: Kubernetes-native workflows using Argo. What’s great about it compared to Luigi, is that you don’t even need to care about having a certain version of Python and Luigi installed to orchestrate the pipeline. All you need is the Kubernetes cluster and Argo installed on it.

Hands down work on the project

The first thing to do when developing this project is to get access to the Kubernetes cluster. For the development, you can set up a one-node cluster using either one of:

I love them both. The first is developed by Canonical while the second by the Kubernetes team itself.

This isn’t going to be a step-by-step tutorial on using Kubernetes. I encourage you to read the documentation or possibly seek out a good online course if you don’t know anything yet. Read on even in this case though—it’s nothing that would be overly complex.

Next, you’ll need the Argo Workflows. The installation is really easy. The full yet simple documentation can be found here.

The project structure

Here’s what the project looks like in the end:

.
├── Makefile
├── notebooks
│  └── scratch.ipynb
├── notebooks.yml
├── pipeline.yaml
├── tasks
   ├── base
   │  ├── Dockerfile
   │  └── requirements.txt
   ├── build_model
   │  ├── Dockerfile
   │  └── run.py
   ├── clone_repo
   │  ├── Dockerfile
   │  └── run.sh
   ├── infer
   │  ├── Dockerfile
   │  └── run.py
   ├── notebooks
   │  └── Dockerfile
   └── preprocess
      ├── Dockerfile
      └── run.py

The main parts are as follows:

Makefile provides easy to use helpers for building images, sending them into the Docker repository and running the Argo workflow
notebooks.yml defines a Kubernetes service and deployment for exploratory Jupyter Lab instance
notebooks contains individual Jupyter notebooks
pipeline.yaml defines our Machine Learning pipeline in the form of the Argo workflow
tasks contains workflow steps as containers along with their Dockerfiles
tasks/base defines the base Docker image for other tasks
tasks/**/run.(py|sh) is a single entry point for a given pipeline step

The idea is to minimize the boilerplate while retaining the features offered e.g. by Luigi.

Makefile

SHELL := /bin/bash
VERSION?=latest
TASK_IMAGES:=$(shell find tasks -name Dockerfile -printf '%h ')
REGISTRY=base:5000

tasks/%: FORCE
        set -e ;\
        docker build -t blog_pipeline_$(@F):$(VERSION) $@ ;\
        docker tag blog_pipeline_$(@F):$(VERSION) $(REGISTRY)/blog_pipeline_$(@F):$(VERSION) ;\
        docker push $(REGISTRY)/blog_pipeline_$(@F):$(VERSION)

images: $(TASK_IMAGES)

run: images
        argo submit pipeline.yaml --watch

start_notebooks:
        kubectl apply -f notebooks.yml

stop_notebooks:
        kubectl delete deployment jupyter-notebook

FORCE: ;

When using this Makefile with make run, it will need to resolve the images dependency. This, in turn, will ask to resolve all of the task/**/Dockerfile dependencies too. Notice how the TASK_IMAGES variable is constructed: it uses the make’s shell command to use the Unix’s find to find the subdirectories of tasks that contain the Dockerfile. Here’s what the output would be if you were to use it directly:

$ find tasks -name Dockerfile -printf '%h '
tasks/notebooks tasks/base tasks/preprocess tasks/infer tasks/build_model tasks/clone_repo

Setting up Jupyter Notebooks as a scratch pad and for EDA

Let’s start off by defining our base Docker image:

FROM python:3.7

COPY requirements.txt /requirements.txt
RUN pip install -r /requirements.txt

Following is the Dockerfile that extends it and adds the Jupyter Lab:

FROM endpoint-blog-pipeline/base:latest

RUN pip install jupyterlab

RUN mkdir ~/.jupyter
RUN echo "c.NotebookApp.token = ''" >> ~/.jupyter/jupyter_notebook_config.py
RUN echo "c.NotebookApp.password = ''" >> ~/.jupyter/jupyter_notebook_config.py

RUN mkdir /notebooks
WORKDIR /notebooks

The last step is to add the Kubernetes service and deployment definition in notebooks.yml:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: jupyter-notebook
  labels:
    app: jupyter-notebook
spec:
  replicas: 1
  selector:
    matchLabels:
      app: jupyter-notebook
  template:
    metadata:
      labels:
        app: jupyter-notebook
    spec:
      containers:
      - name: minimal-notebook
        image: base:5000/blog_pipeline_notebooks
        ports:
        - containerPort: 8888
        command: ["/usr/local/bin/jupyter"]
        args: ["lab", "--allow-root", "--port", "8888", "--ip", "0.0.0.0"]
---
kind: Service
apiVersion: v1
metadata:
  name: jupyter-notebook
spec:
  type: NodePort
  selector:
    app: jupyter-notebook
  ports:
  - protocol: TCP
    nodePort: 30040
    port: 8888
    targetPort: 8888

This can be run using our Makefile with make start_notebooks or directly with:

$ kubectl apply -f notebooks.yml

Exploration

The notebook itself feels more like a scratch pad than an exploratory data analysis. You can see that it’s very informal and doesn’t include much of the exploration or visualization. You’re likely not to omit those in more real-world code.

I used it to ensure the model would work at all. I then was able to grab portions of the code and paste it directly into step definitions.

Implementation

Step 1: Source blog articles

The blog’s articles are stored on GitHub in Markdown files.

Our first pipeline task will need to either clone the repo or pull from it if it’s present in the pipeline’s shared volume.

We’ll use the Kubernetes hostPath as the cross-step volume. What’s nice about it is that it’s easy to peek into the volume during development to see if the data artifacts are being generated correctly.

In our example here, I’m hardcoding the path on my local system:

# ...
volumes:
  - name: endpoint-blog-src
    hostPath:
      path: /home/kamil/data/endpoint-blog-src
      type: Directory
# ...

This is one of the downsides of the hostPath—it only accepts absolute paths. This will do just fine for now though.

In the pipeline.yml we define the task container with:

# ...
templates:
  - name: clone-repo
    container:
      image: base:5000/blog_pipeline_clone_repo
      command: [bash]
      args: ["/run.sh"]
      volumeMounts:
      - mountPath: /data
        name: endpoint-blog-src
# ...

The full pipeline forms a tree which is expressed conveniently as a directed acyclic graph within the Argo. Here’s the definition of the whole pipeline (some steps were not shown yet):

# ...
- name: article-vectors
    dag:
      tasks:
      - name: src
        template: clone-repo
      - name: dataframe
        template: preprocess
        dependencies: [src]
      - name: model
        template: build-model
        dependencies: [dataframe]
      - name: infer
        template: infer
        dependencies: [model]
# ...

Notice how the dependencies field makes it easy to tell Argo what order to take when executing the tasks. The Argo steps can also define inputs and outputs—just like Luigi. For this simple example, I decided to omit them and enforce the convention for the steps to expect data artifacts in a certain location in the mounted volume. If you’re curious about other Argo features though, here is its documentation.

The entry point script for the task is pretty simple:

#!/bin/bash

cd /data

if [ -d ./blog ]
then
  cd blog
  git pull origin master
else
  git clone https://github.com/EndPointCorp/end-point-blog.git blog
fi

Step 2: Data wrangling

At this point, we’d have the source files for the blog articles in Markdown files. To be able to run them through any kind of machine learning modeling, we need to source it into the data frame. We’ll also need to clean the text a bit. Here is the reasoning behind the cleanup routine:

I want the relations between the articles to omit the code snippets: not to group them by the used programming language or a library just by the keywords they contain
I also want the metadata about the tags and authors to be omitted too as I don’t want to see only e.g. my articles listed as similar to my other ones

The full source for the run.py of the “preprocess” task can be viewed here.

Notice that unlike make or Luigi, the Argo workflows would run the same task fully again even with the step artifact already being created. I like this flexibility—it’s extremely easy after all to just skip the processing in Python or shell script if it already exists.

At the end of this step, the data frame is written as the Apache Parquet file.

Step 3: Building the model

The model from the paper mentioned earlier has already been implemented in a variety of other projects. There are implementations for each major deep learning framework on GitHub. There’s also a pretty good one included in Gensim. Its documentation can be found here.

The run.py is pretty short and straight forward as well. This is one of the goals for the pipeline. In the end, it’s writing the trained model into the shared volume as well.

Notice that re-running the pipeline with the model already stored will not trigger the training again. This is what we want. Imagine a new article being pushed into the repository. It’s very unlikely that retraining with it would affect the model’s performance in any significant way. We’ll still need to predict the similar other documents for it. The model building step would short-circuit though with:

if __name__ == '__main__':
    if os.path.isfile('/data/articles.model'):
        print("Skipping as the model file already exists")
    else:
        build_model()

Step 4: Predict similar articles

The listing of the run.py isn’t overly long:

import pandas as pd
from gensim.models.doc2vec import Doc2Vec
import yaml
from pathlib import Path
import os


def write_similar_for(path, model):
    similar_paths = model.docvecs.most_similar(path)
    yaml_path = (Path('/data/blog/') / path).parent / 'similar.yaml'

    with open(yaml_path, "w") as file:
        file.write(yaml.dump([p for p, _ in similar_paths]))
        print(f"Wrote similar paths to {yaml_path}")


def infer_similar():
    articles = pd.read_parquet('/data/articles.parquet')
    model = Doc2Vec.load('/data/articles.model')

    for tag in articles['file'].tolist():
        write_similar_for(tag, model)

if __name__ == '__main__':
    infer_similar()

The idea is to load up the saved Gensim model and the data frame with articles first. Then for each article use the model to get the 10 most similar other articles.

As the step’s output, the listing of similar articles is placed in the similar.yml file for each article’s subdirectory.

The blog’s Markdown → HTML compiler could then use this file and e.g. inject the “You might find those articles interesting too” section.

Results

The scratch notebook already includes the example results of running this doc2vec model. Examples:

model.docvecs.most_similar('2019/01/09/liquid-galaxy-at-instituto-moreira-salles.html.md')

Giving the output of:

[('2016/04/22/liquid-galaxy-for-real-estate.html.md', 0.8872901201248169),
 ('2017/07/03/liquid-galaxy-at-2017-boma.html.md', 0.8766101598739624),
 ('2017/01/25/smartracs-liquid-galaxy-at-national.html.md',
  0.8722846508026123),
 ('2016/01/04/liquid-galaxy-at-new-york-tech-meetup_4.html.md',
  0.8693454265594482),
 ('2017/06/16/successful-first-geoint-symposium-for.html.md',
  0.8679709434509277),
 ('2014/08/22/liquid-galaxy-for-daniel-island-school.html.md',
  0.8659971356391907),
 ('2016/07/21/liquid-galaxy-featured-on-reef-builders.html.md',
  0.8644022941589355),
 ('2017/11/17/president-of-the-un-general-assembly.html.md',
  0.8620222806930542),
 ('2016/04/27/we-are-bigger-than-vr-gear-liquid-galaxy.html.md',
  0.8613147139549255),
 ('2015/11/04/end-pointers-favorite-liquid-galaxy.html.md',
  0.8601428270339966)]

Or the following:

model.docvecs.most_similar('2019/01/08/speech-recognition-with-tensorflow.html.md')

Giving:

[('2019/05/01/facial-recognition-amazon-deeplens.html.md', 0.8850516080856323),
 ('2017/05/30/recognizing-handwritten-digits-quick.html.md',
  0.8535605072975159),
 ('2018/10/10/image-recognition-tools.html.md', 0.8495659232139587),
 ('2018/07/09/training-tesseract-models-from-scratch.html.md',
  0.8377258777618408),
 ('2015/12/18/ros-has-become-pivotal-piece-of.html.md', 0.8344655632972717),
 ('2013/03/07/streaming-live-with-red5-media.html.md', 0.8181146383285522),
 ('2012/04/27/streaming-live-with-red5-media-server.html.md',
  0.8142604827880859),
 ('2013/03/15/generating-pdf-documents-in-browser.html.md',
  0.7829260230064392),
 ('2016/05/12/sketchfab-on-liquid-galaxy.html.md', 0.7779937386512756),
 ('2018/08/29/self-driving-toy-car-using-the-a3c-algorithm.html.md',
  0.7659779787063599)]

model.docvecs.most_similar('2016/06/03/adding-bash-completion-to-python-script.html.md')

With:

[('2014/03/12/provisioning-development-environment.html.md',
  0.8298013806343079),
 ('2015/04/03/manage-python-script-options.html.md', 0.7975824475288391),
 ('2012/01/03/automating-removal-of-ssh-key-patterns.html.md',
  0.7794561386108398),
 ('2014/03/14/provisioning-development-environment_14.html.md',
  0.7763932943344116),
 ('2012/04/16/easy-creating-ramdisk-on-ubuntu.html.md', 0.7579266428947449),
 ('2016/03/03/loading-json-files-into-postgresql-95.html.md',
  0.7410352230072021),
 ('2015/02/06/vim-plugin-spotlight-ctrlp.html.md', 0.7385793924331665),
 ('2017/10/27/hot-deploy-java-classes-and-assets-in.html.md',
  0.7358890771865845),
 ('2012/03/21/puppet-custom-fact-ruby-plugin.html.md', 0.718029260635376),
 ('2012/01/14/using-disqus-and-rails.html.md', 0.716759443283081)]

To run the pipeline all you need is to:

$ make run

Or directly with:

$ argo submit pipeline.yml --watch

Argo gives a nice looking output of all the steps:

Name:                endpoint-blog-pipeline-49ls5
Namespace:           default
ServiceAccount:      default
Status:              Succeeded
Created:             Wed Jun 26 13:27:51 +0200 (17 seconds ago)
Started:             Wed Jun 26 13:27:51 +0200 (17 seconds ago)
Finished:            Wed Jun 26 13:28:08 +0200 (now)
Duration:            17 seconds

STEP                             PODNAME                                  DURATION  MESSAGE
 ✔ endpoint-blog-pipeline-49ls5
 ├-✔ src                         endpoint-blog-pipeline-49ls5-3331170004  3s
 ├-✔ dataframe                   endpoint-blog-pipeline-49ls5-2286787535  3s
 ├-✔ model                       endpoint-blog-pipeline-49ls5-529475051   3s
 └-✔ infer                       endpoint-blog-pipeline-49ls5-1778224726  6s

The resulting similar.yml files look as follows:

$ ls ~/data/endpoint-blog-src/blog/2013/03/15/
generating-pdf-documents-in-browser.html.md  similar.yaml

$ cat ~/data/endpoint-blog-src/blog/2013/03/15/similar.yaml
- 2016/03/17/creating-video-player-with-time-markers.html.md
- 2014/07/17/creating-symbol-web-font.html.md
- 2018/10/10/image-recognition-tools.html.md
- 2015/08/04/how-to-big-beautiful-background-video.html.md
- 2014/11/06/simplifying-mobile-development-with.html.md
- 2016/03/23/learning-from-data-basics-naive-bayes.html.md
- 2019/01/08/speech-recognition-with-tensorflow.html.md
- 2013/11/19/asynchronous-page-switches-with-django.html.md
- 2016/03/11/strict-typing-fun-example-free-monads.html.md
- 2018/07/09/training-tesseract-models-from-scratch.html.md

Although it’s difficult to quantify, those sets of “similar” documents do seem to be linked in many ways to their “anchor” articles. You’re invited to read them and see for yourself!

Closing words

The code presented here is hosted on GitHub. There’s lots of room for improvement of course. It shows a nice approach that could be used for both small model deployments (like the one above) but also very big ones too.

The Argo workflows could be used in tandem with Kubernetes deployments. You could e.g. run a distributed TensorFlow model training and then deploy it on Kubernetes via TensorFlow Serving. If you’re more into PyTorch, then distributing the training would be possible via Horovod. Have data scientists that use R? Deploy RStudio Server instead of the JupyterLab with the image from DockerHub and run some or all tasks with the simpler one with R-base only.

If you have any questions or projects you’d like us to help you with, reach out right away through our contact form!

Facial Recognition Using Amazon DeepLens: Counting Liquid Galaxy Interactions

2019-05-01T00:00:00+00:00

I have been exploring the possible uses of a machine-learning-enabled camera for the Liquid Galaxy. The Amazon Web Services (AWS) DeepLens is a camera that can receive and transmit data over wifi, and that has computing hardware built in. Since its hardware enables it to use machine learning models, it can perform computer vision tasks in the field.

The Amazon DeepLens camera

This camera is the first of its kind—likely the first of many, given the ongoing rapid adoption of Internet of Things (IoT) devices and computer vision. It came to End Point’s attention as hardware that could potentially interface with and extend End Point’s immersive visualization platform, the Liquid Galaxy. We’ve thought of several ways computer vision could potentially work to enhance the platform, for example:

Monitoring users’ reactions
Counting unique visitors to the LG
Counting the number of people using an LG at a given time

The first idea would depend on parsing facial expressions. Perhaps a certain moment in a user experience causes people to look confused, or particularly delighted—valuable insights. The second idea would generate data that could help us assess the platform’s impact, using a metric crucial to any potential clients whose goals involve engaging audiences. The third idea would create a simpler metric: the average number of people engaging with the system over a period of time. Nevertheless, this idea has a key advantage over the second: it doesn’t require distinguishing between people, which makes it a much more tractable project. This post focuses on the third idea.

To set up the camera, the user has to plug it into a power outlet and connect it to wifi. The camera will still work even with a slow network connection, though when the connection is slower the delay between the camera seeing something and reporting it is longer. However, this delay was hardly noticable on my home network which has slow-to-moderate speeds of about 17 Mbps down and 33 Mbps up.

Computer Vision and the Amazon DeepLens

A deep learning model is a neural network with multiple layers of processing units. It is called “deep” because it has multiple layers. The inputs and outputs of each processing unit are numbers. These units are roughly analogous to neurons: they receive input from units in the previous layer, and output it to units in the next layer after transforming it based on a function. These “activation functions” can change in a variety of ways. The last layer’s outputs translate into the results. These models work because these functions get tuned based on how well the model works. For example, to make a model that labels each human face in a picture and draws a box around it, we would start with a corpus of pictures with boxes drawn around faces, as well as the versions of the pictures without the boxes drawn. We would test the model on the non-labeled images by checking—for each picture—whether the output generated by the model is correct. If not, the computer chooses different unit functions, tries again, and compares the results. Repeating this process thousands of times yields models which work remarkably well for a wide range of tasks, including computer vision.

In deep learning for computer vision, training on large sets of labeled images enables models to generalize about visual characteristics. The training process takes a lot of computing resources, but once models are trained, they can produce results quickly and with relative ease. This is why the DeepLens is able to perform computer vision with its limited computing resources.

Since the DeepLens is an Amazon product, it comes as no surprise that the user interface and backend for DeepLens consist of AWS services. One of the most important is SageMaker, which is used to train, manage, optimize, and deploy machine learning models such as neural networks. It includes hosted Jupyter notebooks (Jupyter is a development environment for data science), as well as the computing resources required for model training and storage. With SageMaker, users can train computer vision models for deployment to DeepLens, or import and adjust pretrained models from various sources.

Remote management of the DeepLens depends on AWS Lambda, a “serverless” cloud service that provides an environment to run backend code and integrate with other cloud services. It runs the show, allowing users to manage everything from the camera’s behavior to what happens to gathered data. Another service, AWS Greengrass, connects the instructions from AWS Lambda to the DeepLens, managing tasks like authentication, updates, and reactions to local events.

Amazon’s IoT service saves information about each DeepLens, and allows users to manage their devices, for example by choosing which model is active on the device, or viewing a live stream from the camera. It also keeps track of what’s going on with the hardware, even when it’s off. When a model is running on the DeepLens, you can view a live stream of its inferences about what it’s seeing (the labeled images). Amazon has released various pretrained models designed to work on the DeepLens. Using a model for detecting faces, we can get a live stream that looks like this:

Me looking at the DeepLens in my kitchen

Facial recognition inferences on multiple people. (Witness my smile of satisfaction at finally finding enthusiastic subjects of facial recognition.)

Each face that the camera detects gets a box around it, along with the model’s level of certainty that it is a face. The above pictures were the results of an attempt to simulate the conditions where this could be used.

The Model

The model I used was trained on data from ImageNet, a public database with hundreds or thousands of images associated with nouns. (For example they have 1537 pictures of folding chairs.) ImageNet is commonly used to train and test computer vision models.

However, the training for this model didn’t stop there: Amazon used transfer learning from another large image dataset, MS-COCO, to fine-tune the model for face detection. Transfer learning works essentially by retraining the last layer of an already-trained model. In this way it harnesses the “insights” of the existing model (e.g. about shapes, colors, and positions) by repurposing this information to make predictions about something else. In this case, whether something is a face.

Since this model was pretrained and optimized by Amazon for the DeepLens, it provides a low effort route to implementing a computer vision model on the DeepLens. I didn’t have to do any of the processing on my own hardware. The DeepLens hardware took care of all the predictions, though the biggest resource savings were from not having to train the model myself (which can take days, or longer).

When the facial recognition model is deployed and the DeepLens is on, an AWS Lambda function written in Python repeatedly prompts the camera to get frames from the camera:

frame = awscam.getLastFrame()

…to resize the frames before inference (the model accepts frames of particular size):

frame_resize = cv2.resize(frame, (input_height, input_width))

…to pass the frames to the model:

parsed_inference_results = model.parseResult(model_type, model.doInference(frame_resize))

…and to use the results to draw boxes around the faces:

cv2.rectangle(frame, (xmin, ymin), (xmax, ymax), (255, 165, 20), 10)

As you can see from how often “cv2” appears in the code above, this implementation relies heavily on code from OpenCV, an open source computer vision framework. Finally, the results are sent to the cloud:

client.publish(topic=iot_topic, payload=json.dumps(cloud_output))

In the last code snippet above, iot_topic refers to an Amazon “MQTT topic” (Message Queuing Telemetry Transport), for IoT devices. MQTT is the standard connectivity framework for DeepLens and many other IoT devices. One of its advantages for this context is that it can handle situations with intermittent connectivity, by smoothly queueing messages for when the network connection is stable. The essence of MQTT is to enable publishing and subscribing to different topics. The system of topics enables results from a DeepLens to trigger other processes. For example, the DeepLens could publish a message when it sees a face, and this could prompt another cloud service to do something else, such as save what time and how long the face appeared.

I wanted to test how data from this model would compare to a human’s perception. The first step was to understand what data the camera offers. It produces data about each frame analyzed: a timestamp (in 13-digit Unix time), and the predicted probability that something it identifies is a face. To gather this data, I used the AWS IoT service to manually subscribe to a secure MQTT topic where the DeepLens published its predictions. Each frame processed produces data like this:

{
  "format": "json",
  "payload": {
    "face": 0.5654296875
  },
  "qos": 0,
  "timestamp": 1554853281975,
  "topic": "$aws/things/deeplens_bnU5sr2sSD2ecW5YkfJZtw/infer"
}

The data generated by a single frame (with one face) when processed by the DeepLens.

For my purposes, I was only interested in the timestamps and payloads (which contain the number of faces identified, and their probabilities). I decided to test the facial recognition model under several different conditions:

No faces present
One face present
Multiple faces present

For condition 1 I just aimed it at an empty room for 20 minutes, and for condition 2 I sat in front of the camera for 20 minutes. For condition 3, I aimed the camera at a public space for 20 minutes, and while it was running I kept an ongoing count of the number of people looking in the general direction of the camera (I put the camera in front of a wall with a TV on it so people would be more likely to look towards it). Then I averaged my count over the duration of the sample, which resulted in an average engagement number of 2.5 people, meaning that on average, 2.5 people were looking at the camera. In an attempt to minimize bias, I made my human-eye assessment before looking at any of the data.

I’ll spoil one aspect of the results right away: there were no false positives under any condition. Even the lower probability guesses corresponded to actual faces, though this result might not hold true in a room with lots of face-like art, that’s not too common of a scenario. This simplified things, since it meant there was no need to set a lower bound on the probabilities which we should count—any face detected by the camera is a face. This also highlights one of my remaining questions about the model: is there useful information to be gained from the probabilities?

Another important note: I noticed early in the experiment that it almost never detects a face farther than 15 feet away. For the use case of a Liquid Galaxy, the 15-foot range is too short to capture all types of engagement (some people look at it across the room), but from my experience with the system I think that users within this range could be accurately described as focused users—something worth measuring, but certainly not everything worth measuring. After noticing this, I retested condition 2 with my face about 5 feet from the DeepLens, after initially trying it from across a room.

How did the DeepLens counts compare to my counts?

The model matched my performance in conditions 1 and 2, which makes a strong statement about its reliability in relatively static and close-up conditions such as looking at an empty room, or looking at someone stare at their laptop across a small table. In contrast, it did not count as many faces as I did in condition 3—so I’m happy to report I can still outperform A.I. on something.

Anyway, this suggests that the model is somewhat conservative, at least compared to my count (likely partly due to my eyes having a range larger than 15 feet). Therefore, when considering usage statistics gathered by a similar method, it might make most sense to think of the results as a lower bound, e.g. “the average number of people focused on the system was more than 2.1”.

It would be useful to experiment with the multiple faces condition again, to see how robust these findings are. It would also be helpful to keep track of factors like how much people move, the lighting, and the orientation of the camera, to see if they might impact the results. It would also be useful to automate the data collection and analysis.

This investigation has showed me that the DeepLens has a lot of potential as a tool for measuring engagement. Perhaps a future post will examine how it can be used to count users.

Thanks for reading! You are welcome to learn more about End Point Liquid Galaxy and AWS DeepLens.

The flow of hierarchical data extraction

2019-03-13T00:00:00+00:00

1. Problem statement

There are many cases when people intend to collect data, for various purposes. One may want to compare prices or find out how musical fashion changes over time. There are a zillion potential uses of collected data.

The old-fashioned way to do this task is to hire a few dozen of people and explain them where should they go on the web, what should they collect, how they should write a report and how they should send it.

It is more effective to teach them this at the same time than to teach them separately, but even then, there will be misunderstandings, mistakes with high cost, not to mention the limit a human has when processing data in terms of the amount to process. As a result, the industry strives to make sure this is as automatic as possible.

This is why people write software to cope with this issue. The terms data-extractor, data-miner, data-crawler, data-spider mean software which extracts data from a source and stores it at the target. If data is mined from the web, then the more-specific web-extractor, web-miner, web-crawler, web-spider terms can be used.

In this article I will use the term “data-miner”.

This article deals with the extraction of hierarchical data in semantic manner and the way we can parse the data we obtained this way.

1.1. Hierarchical structure

A hierarchical structure involves a hierarchy, that is, we have a graph involved with nodes and vertices, but without a cycle. Specialists call this structure a forest. A forest consists of trees; in our case we have a forest of rooted trees. A rooted tree has a root node and every other node is its descendant (child of child of child …), or, if we put it inversely, the root is the ancestor of all other nodes in a rooted tree.

If we add a node to a forest and we make sure that all the trees’ roots in the forest are children of the new node, the new root, then we transformed our hierarchical structure, our forest into a tree.

Common hierarchical structures a data-miner might have to work with are:

HTML
XML
Trees represented in JSON
File structure

Other, non-hierarchical structures could be texts, pictures, audio files, video, etc. In this article we focus on hierarchical structures.

1.2. Purpose of the data-miner

Of course the purpose of the data-miner is to mine data; however, more specifically, we can speak about general-purpose data-miners, thematic data-miners and narrow-spaced data-miners.

General-purpose data-miners are the data-miners which extract and prepare data for search engines. The technique one can imagine is to find the text in the source (for example a web-page) and map it to keywords, so that when people are searching for the keywords, the source is shown as the result. Of course, it is highly probable that these data-miners are much more smart and complex than described here, but that is out of scope from our perspective.

If we speak about general-purpose data-mining, then even if it is to mine from a hierarchical data-source, it is very difficult to define all the semantics involved by humans, so, if one wants to create a general-purposed data-miner, machine-learning will play a large part to define all the concepts. However, a general-purposed data-miner is a generalized version of thematic data-miner. If we add up a lot of thematic data-miners, we get a general-purpose data-miner and inversely, if we divide the different areas a general-purposed data-miner deals with, we get thematic data-miners. This is true if we look at what they accomplish, but the implementation will differ a lot when the developers implement a general-purposed data-miner from the case when developers implement a thematic data-miner. As a result, when we discuss the thematic data-miners, we can keep in mind that the general-purposed data-miners are a broader version of the same thing, at least if we look at what they achieve.

Thematic data-miners are dealing with a given cluster of concepts, that is, concepts which are logically related to each-other. For example, if we are to extract real-estate details, then we have concepts of “type”, “bedroom number”, “picture” and so on. All these concepts are interrelated; the cluster is defined by the real-estate entity we want to extract. If we speak of the “type”, we really mean “the type of the real estate” entity.

Narrow-spaced data-miners are data-miners which extract data from very specific data-sources, like a single website. These data-miners are always particular cases of thematic data-miners, however, in many cases narrow-spaced data-miners have a lot of hard-coded logic, so, when the space of a narrow-spaced thematic data-miner is broadened, there is usually a lot of code refactoring involved.

This article focuses on thematic data-miners, which could have thousands of different sources.

1.3. The human element

We need to pay attention to legality and to the ethical aspect. If data mining from the source is illegal, then we must avoid doing it. If it is against the will of those who own it, then it is unethical and we should avoid it. Sources should be at least neutral to our extraction, but, preferably happy about us extracting their data.

Why would the owner of a data source be against extracting their data? There can be many possible causes. For example, the owner might want to attract many human visitors to a website, who would visit to see their data and would be unhappy if another website were created and people would visit the new website instead of theirs.

But why would the owner of a data source be happy about extracting their data? It depends on the purpose of their data. If the purpose is to spread the information as far as possible, then data extractors are considered to be “helping hands”.

For example, if an estate agent of a small village wants to target global audience, he/she might want to create a website and hope that people would see his/her data, but without advertisement and a lot of care to raise the popularity of the site, including design, SEO measures and so on, people searching for real estate might never see his/her site. A large website which is searchable by region and shows the results of extracted data could boost the business of the estate agent, especially if the data of the estate agent is publicized without him/her having to pay a penny. People will search for real estate in the region where the estate agent works and will find the big site showing the mined data, emphasizing the contact of the estate agent.

So, to cope with all possible needs of data-source owners, the planner of a data-miner project might want to support listing results with or without details page. Imagine the case when one wants visitors to his/her website and would not like another site to be visited instead. In this case, one can tell him/her that, if he/she allows the extraction of data from his/her site, then they will appear in the list of results, but will not have a details page, instead, when the user clicks on such a result, the website of the owner of the data-source would be opened in a new tab. In this case we are giving them an attractive offer, so that our site will essentially help getting visitors to his/her site. And of course, for those who just want to share the information with as many people as possible, one could show a details page for his/her data. This is a simplified strategy, but illustrates the idea of how people can be convinced to allow us to extract their data.

1.4. The actual problem statement

In this article we intend to have solid ideas of how hierarchical data can be extracted by a thematic data-extractor from data-sources where the owner is content with their data being extracted.

2. The extraction

Now we have hierarchies to work with, possibly many. Nodes can have the parent-child relation, but they can have the ancestor-descendant relation as well. A is the ancestor and D is its descendant, if A is the parent of D, or we have a sequence of nodes S, where Si is the parent of Si+i for each neighboring pair of the sequence and A is the parent of S1, Sn is the parent of D. Consider this HTML code chunk:

<div class="dimensions">
    <div class="large-width">
        <div class="area">Area<span class="dimension-data">500</span> <span class="unit">sqm</span></div>
        <div class="height">Height<span class="dimension-data">60</span> <span class="unit">m</span></div>
    </div>
</div>

We see that the div having the class of dimensions is the parent of the div with the large-width class and the ancestor of the spans having the area and height classes, respectively. In the case of hierarchical data, a descendant is inside the ancestor; knowing this gives us a lot of context in many cases.

2.1. Semantic concepts

In hierarchical structures we have some nodes, a (usually small) subset of which contains interesting data for us. For instance, in the example shown at point 2 we have a div which exists to style the shown data, which has some useful information in its descendants, which have the classes of area and height having the class of dimension-data. However, the div having the class of large-width by itself does not have any useful information outside those descendant nodes, and frankly, it is just a styling node, which makes it irrelevant from our point of view. This means that the large-width div should not exist for us conceptually, we just need to know that we have a concept of dimensions, inside which we should be able to find the area and height. In terms of JavaScript selectors we know that we can find dimensions inside the document and area and height inside it, like this:

let dimensions = [];
let dimensionContainers = document.querySelectorAll(".dimensions");
for (let dimensionContainer of dimensionContainers) {
    const dimension = {};
    let areaContainer = dimensionContainer.querySelector(".area");
    if (areaContainer) {
        let value = areaContainer.querySelector(".dimension-data");
        let unit = areaContainer.querySelector(".unit");
        if (value && unit) {
            dimension.area = {value: value.innerText, unit: unit.innerText};
        }
    }
    let heightContainer = dimensionContainer.querySelector(".height");
    if (heightContainer) {
        let value = heightContainer.querySelector(".dimension-data");
        let unit = heightContainer.querySelector(".unit");
        if (value && unit) {
            dimension.unit = {value: value.innerText, unit: unit.innerText};
        }
    }
     dimensions.push(dimension);
}

We immediately notice some details in the code:

It does not deal with large-width at all, because it’s irrelevant, instead, it just focuses on nodes, which are semantically relevant.
When a concept is found, its direct sub-concept is searched inside its structural context instead of the whole context, so we are particularizing the search.
Our code smartly searches for plural results in the case of dimensions and is aware that area and height is singular in its parent concept, the dimension.
The code does not assume the existence of any data.

There are also problems to cope with:

What happens if the structure changes over time? How can we maintain the code we have above?
The code above is based on empiric evidence, on the findings of a developer, but this is only proving that the hierarchy the code above is dealing with exists, but it certainly does not prove this is the only structure to work with.
How can we cope with composite data, like extracting something like “height: 60m”.
How can we cope with the variance of data, such as synonyms?
How can we cope with paging where applicable?

All these are serious questions, which show that such a concrete code will not solve the problems we face. We will need abstraction to progress further than the limits of particularity and achieve the level of a thematic data-miner. We have seen that in our examples we had a hierarchy of semantic concepts and also, we can observe the rule that an ancestor concept is an ancestor structurally as well. Also, the algorithm we had above has a pattern of searching for the child concept in the context of the parent concept.

In reality we also have the problem of conceptual inconsistency, that is, the structure of the same data-source can be varied and in some cases they are at a totally different location in the structure. To give you a practical example, let’s consider the example we had about real-estate properties and their dimensions particularly. It is possible that the height of the real estate makes it very attractive for buyers, so the developers of the data-source decided to show the height separately, in an emphasized manner, at the top of the page and not to show it at its usual place. In this case, if we would not cope with this important detail, we would miss the height for real-estate properties where the height is the most important detail; we would miss the stars of the show.

2.2. Semantic rules

In 2.1 we have seen how we can have an understanding about the conceptual essence. Naturally, the concepts are useful by themselves, but we need to build up a powerful structure, a signature of the concepts, which, if shown on a diagram would describe the essence of the structure, the conceptual essence in such a powerful way that one could understand it at a glance (unless there are too many nodes for a human, of course).

In order to gain the ability to build up such a powerful structure we need to find patterns, but in a more systematic way than the one we have seen in the naive code used as an example. Concepts can be described by rules. I call these rules semantic rules. What attributes describe a concept:

parent concept(s): Which concepts can be the parents of the current concepts? A special case is a root parent. Also, the possibility to define multiple parents is to avoid duplication when the very same concept in the very same substructure can be present at various places.
descendant(s): Useful to make the rule more specific and possibly filter out unneeded cases and thus increase performance.
relative path: How can the given concept be found starting from the parent concept as root?
plurality: Should we stop at the first match, or continue searching?
excludes: Which concepts are excluded if this concept is matched? (this is a symmetrical relation)
implies: What concepts imply this concept? (consequently, if this concept is not matched, the concepts which imply it can be excluded as well)
value: Where the actual value of the concept can be found.

If the structure of the data source changes over time, then occasionally the semantic rules will have to change as well. However, arguably in the majority of cases the relative path of the concepts will be the only thing to be changed in this case, which is much easier to maintain than to refactor code. If the relative path descriptor does not change and the virtual structure of our semantic rules remains similar, then our data-miner might just work well even if the data-source is changed.

For example, in the case of web-crawlers, writing the initial code-base is just a fraction of the long-term costs. The real financial burden is maintaining the crawlability of many thousands of different sites, all changing from time to time. With semantic rules, even if they are defined manually the maintainability is largely simplified. Yet, if we have a module which proposes the new semantic rules, it would not hurt if there are many data-sources.

Imagine the case when we already have a well-defined set of semantic rules and a cron job detects that some data from a data source was not found in the last few hours, so it would search for the missing data in the data-source and, once found, it would generate the semantic rules for it and propose a new semantic tree of rules, which would run in parallel with the main semantic tree and store data in some experimental place. When the human developers would arrive back and check the reports, they would see that the data-miner thinks that the semantic tree needs to be changed and even has a proposal, also, for the case when the data-miner is right with its proposal, nothing was missed, the extracted data is among the experimental data, but once the new semantic tree is accepted, the experimental data would override the actual data.

Of course, the engine will not be 100% accurate in such a case, its suggestion might be mistaken, or, even if accurate, slight adjustments might be helpful. A relative path might be less accurate than needed, or the concept in the changed structure might have a different plurality, but even if there is room for improvement, such an automatic feature, generating a new semantic tree would be extremely helpful and would make sure that data is accurately extracted even in the case of large changes.

If the owner of the data-source is cooperating, then he/she could provide the changed semantic tree.

The parser is the part of the project which should do the job of handling composite data, but at the level of semantic rules the parser could be helped with a strategy of defining rules. One could define a syntax for the parser, so it would know which concepts are not atomic and maybe even some clues about how should they be parsed.

Synonyms could be dealt with keyword clusters.

Navigation can be handled by a module created for this purpose, we can call it the navigation module. The algorithm below describes how data-mining should occur:

initialNavigation()
while (page ← page.next())
    for each element in elements
        for each node in semanticTree  // breadth-first or depth-first traversing
            if (applicable(node)) then extract(node)
        end for
    end for
end while
finalNavigation()

Before we move on to deal with semantic trees we need to solve the problem of conceptual inconsistencies. We have seen that concepts can have multiple conceptual parents, but this is not inconsistent, nor violating the criteria of a tree. In reality for each element a concept will have a single parent, but it is perfectly possible that a given concept will have a parent for an element and a different parent for another element. This is what we described with the possibility of a concept having one of many possible parents/element. However, even though we have a very good explanation for a concept having more parents, we still have to deal with the possibility of a concept being present twice for the same element. How is this possible?

Let’s consider the example of books displayed on a website. A book may have main author(s) and secondary author(s). It would not be surprising at all if the main authors are displayed differently than secondary authors. Also, a main picture might be displayed of the book’s cover and maybe some smaller images elsewhere, shown in a gallery. This would be a nice feature on a site which shows children’s books. However, from our perspective this means that the same concept is placed at several places at the same time. One may think of quantum mechanics, where time-place discrepancies are also possible.

How can we solve this problem for ourselves? We need to have an understanding, otherwise the whole thought process will go astray. My solution is to differentiate the concepts in the different stages of extraction and parsing. This would mean that the concept of “author” or “picture” is conceptually unique when we store it, even though these might be plural. More elements do not necessarily mean more concepts. On the other hand, at the time of extraction, the main picture is a different concept than the other pictures, also, the main authors are a different concept from the secondary authors. The moment of merging happens at the point when we store these and therefore have to convert the extraction concepts into the concepts we store.

2.3. Semantic tree

The semantic rules we define have a parent-child and ancestor-descendant relation. The semantic tree is the blueprint of the conceptual essence, a plan to extract data and also, its attributes instruct the extractor about what concept should be found where, how should the extractor operate to have good performance and so on.

However, the entities to extract can vary greatly and there might be cases when seemingly the same concept is distributed among various places. The word “seemingly” here means that even though at the end these concepts will be merged, at the phase of the actual data-mining we view these to be similar, but different concepts. The fact that a conceptual node might have several parents in the semantic rules only means that one of the parents of the list is to be expected, which means that all of the listed parents are possible. Consider a semantic rule which says that the parent of currency can be the description or the price:

parent: description,price

This means that the parent can be any of the elements of the list, that is, in the case of some elements the parent will be description, but in the case of some other elements, the parent will be price, so we do not violate by design our aim to have a semantic tree.

We have to watch out for cycles though. Without additional tools there is no protection against cycles, that we might accidentally add while we define the semantic rules, or our semantic rule generator does its magic. However, it makes sense to check whether there is a cycle in the semantic tree. If this is done automatically, then all the better.

However, since the description of semantic trees can define multiple disjunct parents to cope with the possibility to cope with the actual tree of concepts for all elements, at least those whose pattern is known, the semantic tree in reality is a semantic tree pattern and when we use it or we search for the cycles, we will need to traverse the possible parents where more of them are listed.

Consider the example we have brought for a semantic tree.

We can see that we have a single main concept, “REAL-ESTATE”. This is not a general pattern, there might be several main concepts to extract on the same data-source, but for the sake of simplicity we have a single one. Implicitly, the technical implementation involves a single, abstract root, which is the parent of all the main concepts. We see that “TYPE” is plural for “REAL-ESTATE”, but “BEDROOM #” is singular. The conceptual reason is that we technically defined type in such a way, that not all pairs of types are mutually exclusive so if a type is found, for example “Bungalow”, the engine should not stop searching for other types, but if the number of bedrooms is found, then there is no point to further search for numbers of bedroom, because it is safe to assume that even if there are several different occurrences of this concept, they will be equivalent.

Why is “CURRENCY” special? It has two potential parents: “DESCRIPTION” or “PRICE”. In some cases it can be found inside “DESCRIPTION”, in other cases it can be found inside price. For example, if there are several values for “PRICE”, then “CURRENCY” is inside “DESCRIPTION”, but otherwise “CURRENCY” is inside “PRICE”.

But why could a real-estate property have several different prices? Well, this is outside the scope of data-mining, but to have an understanding, it is good to consider a viable example. Let’s consider the example of an estate agent who has a 20% bonus if he/she successfully sells a given real-estate property within a month. In this case, the agent might want to draw buyers and put a 5% discount on the property for a month and if he/she is successful in selling it, then he/she will still have a nice bonus. Considering this economic mechanism the data-source might be showing the prices in a special way if there is such a discount, like:

<div class="description">
    <table class="prices">
        <tr>
            <td>
                <p class="price"><span>100000</span></p>
                <p class="price red"><span>90000</span> 10% discount!</p>
            </td>
            <td>
                <div class="currency">$</div>
            </td>
        </tr>
    </table>
    <!-- … -->
</div>

while, if there is a single price, a different structure is generated:

<div class="description">
    <p class="price">
        <span>100000</span>
        <div class="currency">$</div>
    </p>
</div>

We notice again that the possibility of multiple parents does not mean that there will be any extracted element with multiple parents, it just describes that among the many elements some of the items will have “PRICE” as parent, the others will have “DESCRIPTION” as parent.

If we take a look at “DIMENSIONS”, we will see several concepts with the same name (“VALUE” and “UNIT”), but they have a different meaning in their specific context (“AREA” and “HEIGHT”, respectively).

Another interesting region is “CONTACT”, which has “PERSON” and “COMPANY” elements as conceptual children and both “PERSON” and “CONTACT” is plural. The underlying logic is that several companies or persons can be contacted when one wants to buy/view a real-estate. We have sub-concepts of “FACEBOOK”, “EMAIL”, “PHONE”, “NAME”, “OTHER” and “WEBSITE” for both “PERSON” and “COMPANY”, but similarly to the example we have seen with “CURRENCY”, here the concepts have different meanings. A corporate website is a different thing from a personal website.

However, if we draw/generate such a semantic tree, then it is better than a long documentation. It actually describes for coders the exact way the engine will operate and in the case when the engine is suggesting a new semantic tree for a reason, then, provided that the engine generates the tree’s picture, one will immediately understand what the essence of the engine’s suggestion is. Also, with such a nice diagram managers will understand the mechanism of the system at a glance.

2.4. Parallelization

It is feasible to send parallel requests at the same time, which could happen using promises and the event queue in the case of JavaScript, or multiple threads in a multi-threaded environment.

3. Symbiosis

If the owner of the data-source is happy and supportive for his/her data to be extracted, then they might notify the maintainers of the data-miner whenever structural changes occur, or he/she can provide an API from which data can be extracted, for example a large JSON. However, this JSON will be probably hierarchical as well.

In some extremely lucky cases the owner of the data-source will be happy to provide and maintain the semantic rules. This could happen in the case when “spreading the word” via a data-miner is deeply valued by the owner of the data-source. The key is to have an offer, which helps reaching the goals of the data-source, so the owner of the data-source will see the data-miner as his/her extended hand instead of seeing it as a barrier in reaching the goals of the data-source.

4. Parsing the data

At the point when the data was successfully extracted, the results can be parsed just before storing it. For example in some cases we might have textually composite data in the same node, which is impossible to separate via the semantic tree, which needs leaf nodes of the original structure as atoms. So, in many cases a separate layer is needed to decompose composite textual data which holds data applicable to different concepts.

Also, if, for some technical reasons the semantic tree split a concept into different concepts (for example main authors and secondary authors), then the data-parser can merge the concepts which belong to be together into a single concept. There are many possible jobs the data-parser might fulfill, depending on the actual needs.

5. Analyzing the data

Let’s assume that we have a very nice schema and we store the data we have efficiently. However, we might be interested to know what patterns can we find in our data. Let’s see what patterns we are interested to find. They include:

association rules
functional dependencies
conditional functional dependencies (a functional dependency upon the table or cluster records provided a condition is met)

AR (Association Rule): c => {v1, …, vn}

If a certain condition (c) is fulfilled, then we have a set of constant values for their respective (database table) columns.

For example, let’s consider the table:

person(id, is_alive, has_right_to_vote, has_valid_passport)

Now, we can observe that:

(is_alive = 0) => ((has_right_to_vote = 0) ^ (has_valid_passport = 0))

So, this is an association rule, which has the condition of is_alive = 0 (so the person is deceased) and we know for a fact that dead people will not vote and their passport is invalid.

When we extract data from a source, there might be some association rules (field values are associated to a condition) we do not know about yet. If we find those out, then it will help us a lot. For instance, imagine the case when for whatever reason an insert/update is attempted with values:

is_alive = 0
has_right_to_vote = 1

In this case we can throw an exception, so, this way we can find bugs in the code or mistakes in the semantic tree. This kind of inconsistency prevention is useful even if we are not speaking of data-mining, but, in the context of this article it is extremely useful, as it might detect problems in the semantic tree automatically.

FD (Functional Dependency): S → D

The column-set S (Source):

S = {S1, …, Sm}

determines the column-set D (Destination):

D = {D1, …, Dn}

The formula more explicitly looks like this:

{S1, …, Sm} → (D1, …, Dn)

This relation means that if we have two different records/entities with the same source values:

Source1 = Source2 = {s1, …, sm}

then their destination will match as well:

Destination1 = Destination2 = {d1, …, dn}

Inversely this is not necessarily true. If two records/entities have the same destination values, then the functional dependency does not require them to have the very same sources.

CFD (Conditional Functional Dependency): c => S → D

A CFD is a more generalized term of FD. It adds a condition to the formula, so that the functional dependency’s applicability is only guaranteed if the condition is met. We can describe functional dependencies as particular conditional functional dependencies, where the condition is inherently true:

(true => S → D) <=> S → D

Also, an AR can be described as

c => {} → D

5.1. The More Useful (MU) relation

Let’s consider that we have two patterns, P1 and P2, which could be ARs, FDs or CFDs. Is there a way to determine which of them is more useful? Generally speaking:

P1 MU P2 if and only if P1 is more general than P2.

Since both ARs and FDs are particular cases of CFDs, we will work with the formula of CFDs:

P1 = (c1 => S1 → D1)

P2 = (c2 => S2 → D2)

P1 MU P2 <=> ((c2 => c1) ^ (S1 ⊆ S2) ^ (D1 ⊇ D2))

Note that MU is reflexive, transitive, antisymmetrical and has a neutral element.

Reflexive: (c1 => c1) ^ (S1 ⊆ S1) ^ (D1 ⊇ D1) is trivially true.

Transitive:

Let’s suppose that

((c2 => c1) ^ (S1 ⊆ S2) ^ (D1 ⊇ D2))

((c3 => c2) ^ (S2 ⊆ S3) ^ (D2 ⊇ D3))

is true. Is

((c3 => c1) ^ (S1 ⊆ S3) ^ (D1 ⊇ D3))

also true?

Since c3 => c2 => c1, due to the transitivity of the implication relation we know that c3 => c1.

Since S1 ⊆ S2 ⊆ S3, due to the transitivity of the subset relation we know that S1 ⊆ S3.

Since D1 ⊇ D2 ⊇ D3, due to the transitivity of the superset relation we know that D1 ⊇ D3.

The three transitivities together prove that MU is transitive.

Neutral element (the least useful):

false => {} → All columns

false => c is always true, {} is the subset of everything else, including itself and all columns is the subset of all combinations of columns, or, in other words, it’s a superset of all its subsets.

Antisymmetrical:

If P1 MU P2 and P2 MU P1, then P1 <=> P2.

P1 MU P2:

((c2 => c1) ^ (S1 ⊆ S2) ^ (D1 ⊇ D2))

P2 MU P1

((c1 => c2) ^ (S2 ⊆ S1) ^ (D2 ⊇ D1))

P1 MU P2 ^ P2 MU P1 <=>

((c2 => c1) ^ (c1 => c2)) ^

((S1 ⊆ S2) ^ (S2 ⊆ S1)) ^

((D1 ⊇ D2) ^ (D2 ⊇ D1)) <=>

(c1 <=> c2) ^ (S1 = S2) ^ (D1 = D2) <=>

P1 <=> P2

so the relation is antisymmetrical:

((P1 MU P2) ^ (P2 Mu P1)) <=> (P1 <=> P2)

This means that MU is a poset (partially ordered set) and all the algebra applicable for partially ordered sets in general can be used to analyze MU as well.

The importance of the MU relation is that we can start searching for such patterns in an ordered manner, starting from the most useful we consider and eventually composing S and decomposing D into less useful relation candidates whenever a candidate for a pattern proved to be false, also, knowing that a pattern seems to be accurate we also know that all less useful patterns seem to be accurate as well. Also, if we know that a pattern is accurate, we know that all less useful patterns are accurate as well. We can start our search from a useful pattern candidate, but not necessarily from the most useful, as, intuitively, it is not very probable that all the columns will invariably have the same value of all records. It would defeat the purpose of storing so many values. This means that defining the most useful possible patterns would make sense.

5.2. MU Lattice

Possible patterns can be represented using a Lattice, where the root would be the most useful node and the leaf would be the least useful node. We have a join and a meet operation, which are closures.

P1 join P2 = (c1 v c2) => (S1 ⋂ S2) → (D1 ⋃ D2)

P1 meet P2 = (c1 ^ c2) => (S1 ⋃ S2) → (D1 ⋂ D2)

Of course:

(P1 join P2) MU (P1 meet P2)

Proof:

(c1 ^ c2) => (c1 v c2) is trivially true and

(S1 ⋂ S2) ⊆ (S1 ⋃ S2) is trivially true and

(D1 ⋃ D2) ⊇ (D1 ⋂ D2) is trivially true.

We can split the lattice into many different simple lattices, each having its own condition. Since an AR never has source columns, it cannot be less useful than a non AR CFD. Also, since an FD has a condition which is implied by any possible other condition, FDs are never less useful than CFDs with real conditions.

5.3. Domain of search

The domain of search can vary. It can be limited to a single table. Or it can be limited to a cluster of tables, related to each other in one-to-one, one-to-many, many-to-one or many-to-many manner. The condition can be considered to be in a simplistic way, checking only the equals operator of some columns, or, it can be complex and considering several columns, even in a cross-table manner, using several possible operators. This depends on the kinds of patterns we intend to find, the cumulative power of resources, the ability of the development team to work out something serious, time, and, yes, money.

5.4. Differentiation between patterns and reality

We can have a P pattern, which was automatically found. We only know that there is no counter-example of the pattern P, or, if we have a level of tolerance, we know that there were not as many counter-examples so we would discard the pattern. So, P appears to be true. But is appearance equivalent to the truth in this case? As a matter of fact nature produces infinitely many examples of seemingly impossible occurrences or highly improbable coincidences.

It is a common fallacy to concentrate on a pattern and due to the improbability of the result being a mere coincidence to exclude the possibility that it was just a coincidence. Indeed, it’s the so-called Texas sharpshooter fallacy, even though it’s unconscious in our specific case.

If I have a cube which has 6 sides, each having a number from 1–6 and I toss it 1000 times and the result will always be six, then I will have the natural feeling that something must be not right, I might be divinely favored, or the stars are lining up in my favor, but in this case I’m ignoring the fact that there is no connection between the results of the tosses, or, in other words, my experiments are independent from each other. I could calculate that having a result of 6 for 1000 times has a probability of 1 / 6^1000, which is quasi-impossible. Yet, it is only quasi-impossible and not actually impossible.

If I toss the same cube 1000 times randomly and get a sequence of 1000 numbers, then I could calculate the probability of my random, not special sequence of 1000 elements occuring in the exact same way it occurred. And, surprise, surprise, the result will be exactly equal to 1 / 6^1000, but, if the results of the sequence are varied, I do not feel the results to be special. As a result, having a result of 6 for 1000 tosses is not special at all either.

There is no mathematical difference between the probability of tossing a cube 1000 times and getting only sixes and tossing a cube 1000 times and getting a specific sequence of 1000 elements you might choose. The probability of the exact same sequences will be exactly the same before I start tossing the cube. The difference between the two sequences is the meaning I, as a person attribute to one of them.

Also, if I win a lottery, I might think about the probability of my choice of numeric combination being correct and feel that I’m especially lucky, but if I calculate the probability of the actual result when I do not win, I will not find any difference in the probability itself. Yet, people tend to calculate the chances of the case when they get lucky, but not to calculate the chances when they are not. The low probability of a given event is only special because we are interested in it, but we can find infinitely many similarly low-probability events happening all the time, but we are just not interested enough in the majority of such events to analyze it.

But let’s see this example further. Before I toss the cube 1000 times I do not know the exact sequence I will get in advance. In fact, it is almost impossible to guess it, but I know that whatever the sequence will be, its a priori probability is converging to zero.

This means that whenever we do not attribute a pattern a meaning, we are inclined to fallaciously not even consider it to be the nature of how things are. Yet, if we attribute a meaning to a pattern, then we might be inclined to fallaciously not understand that it was a mere coincidence if it happens not to be the natural rule of how things are according to our understanding.

If we find a pattern with a tool, we get enthusiastic and we almost want it to be the nature of things, but we need to be much more severe before we factually accept a pattern. Consider the example of primes. How many primes are there? The answer is simple: infinitely many.

Proof (reductio ad absurdum):

Let’s assume that there is a finite n number of primes and there is no other, except them:

p1, …, pn

Now, let’s consider the number:

N = (p1 * … * pn) + 1

We know that N is indivisible with any of p1, …, pn, so there are two cases: N is either a prime or a composite number. If N is a prime, then we found a new prime. Otherwise, if N is composite, then it is divisible by at least a prime which is not among p1, …, pn. So, in either case we find a new prime, therefore there are infinitely many prime numbers.

How many primes are pair? Exactly one. It is the number 2. Now, if we have a huge set of primes, among which we do not find 2, not knowing that 2 is a prime, we might be inclined to think that primes can only be odd numbers, which is of course wrong. If we pick a prime randomly from the infinitely many primes, the chance that it will be exactly two is extremely small, rather minuscule. However, if a human has to pick a prime number, a human will know only a few primes and 2 is the “first”, so, among the primes 2 is one that has a high chance of being chosen by a human.

The point of all this contemplation is that if something is very frequent or highly probable, then it is not necessarily true. When we take a look at a pattern, it is good to be very critical about it and think about how that pattern could fail and what the consequences would be if we assume the pattern to be accurate, yet it leaves us in trouble in the most inappropriate time.

5.5. Usage of factually validated patterns

Now, if we accept a rule to be factually accurate, then we might want to make sure that it is respected. Assuming that

c => S → D

is accurate, we also assume that if the condition is met and there is already a record having the source values of s1, …, sm, then, inserting/updating another record with the same source values, but with different destination values leads to an error. Let’s suppose that we throw an exception when an accepted pattern is to be violated. If many such exceptions are thrown, then we have a problem. What could the problem be:

the semantic tree might have wrong/deprecated rules
the older records might be broken
the pattern might be no longer valid, or, it might have been wrongly accepted in the first place

See? By analyzing the data we can add some features of machine-learning, so our data-miner will really rule the problem space it is responsible for. Naturally, such patterns can also be used at insertion and update, when we do not get some of the destination values, but we have a pattern from which we can deduce it. Naturally, a tableau of at least the most frequent source values and conditions could come to our help.

Knowledge is power. Instead of failing gracefully, outside our limits of what we perceive could result in many months-long gibberish data. But instead of that pain, we could instantly know when a problem appears and, if we have some helping robotic hands — even if they are only virtual — at the end of the day we will be rarely alerted with urgencies. And such patterns deepen our understanding of the data we work with, even if they cannot be accepted as a general rule. With better understanding we will have better ideas. With better ideas we will have better features and performance. With better features and performance we will have more patterns. And with more patterns: we deepen our understanding further.

6. The flow

This diagram is an idealized representation of the flow. In reality we might have several different cron jobs, we might be working with threads, in asynchronous manner and the parser is invoked much more frequently, not just at the end of the whole extraction, because we do not have infinite resources. All these nuances would complicate the diagram immensely.

Speech Recognition from scratch using Dilated Convolutions and CTC in TensorFlow

2019-01-08T00:00:00+00:00

Image by WILL POWER · CC BY 2.0, cropped

In this blog post, I’d like to take you on a journey. We’re going to get a speech recognition project from its architecting phase, through coding and training. In the end, we’ll have a fully working model. You’ll be able to take it and run the model serving app, exposing a nice HTTP API. Yes, you’ll even be able to use it in your projects.

Speech recognition has been amongst one of the hardest tasks in Machine Learning. Traditional approaches involve meticulous crafting and extracting of the audio features that separate one phoneme from another. To be able to do that, one needs a deep background in data science and signal processing. The complexity of the training process prompted teams of researchers to look for alternative, more automated approaches.

With the growing development of Deep Learning, the need for handcrafted features declined. The training process for a neural network is much more streamlined. You can feed the signals either in their raw form or as their spectrograms and watch the model improve.

Did this get you excited? Let’s start!

Project Plan of Attack

Let’s build a web service that exposes an API. Let it be able to receive audio signals, encoded as an array of floating point numbers. In return, we’re going to get the recognized text.

Here’s a rough plan of the stages we’re going to go through:

Get the dataset to train the model on
Architect the model
Implement it along with the unit tests
Train it on the dataset
Measure its accuracy
Serve it as a web service

The dataset

The open-source community has a lot to be thankful for the Mozilla Foundation for. It’s a host of many projects with a wonderful, free Firefox browser at its forefront. One of its other projects, called Common Voice, focuses on gathering large data sets to be used by anyone in speech recognition projects.

The datasets consist of wave files and their text transcriptions. There’s no notion of time-alignment. It’s just the audio and text for each utterance.

If you want to code along, head up to the Common Voice Datasets download page. Be warned that it weighs roughly around 12GB.

After the download, simply extract the files from the archive into the ./data directory of the root of the project. The files, in the end, should reside under the ./data/cv_corpus_v1/ path.

How much data should we have? It always depends on the challenge at hand. Roughly speaking, the more difficult the task, the more powerful your neural network needs to be. It will need to be capable of expressing more complex patterns in data. The more powerful the network, the easier it is to have it just memorize the training examples. This is highly undesirable and results in overfitting. To lessen its aptitude to do so, you need to either augment your data on the fly randomly or gather more “real” examples. On this project, we’re going to do both. Data augmentation will be covered in the coding section. Additional datasets we’ll use are well known LibriSpeech (the file to download, around 23GB) and VoxForge (the file to download).

Those two datasets are among the most popular that are freely available. There are others I chose to omit as they weigh quite a lot. I was already almost out of free space after the download and preprocessing of the three sets chosen above.

You need to download both Libri and Vox and extract them under ./data/LibriSpeech/ and ./data/voxforge/.

Background on audio processing

In order to build a working model, we need some background in signal processing. Although a lot of the traditional work is going to be done by the neural network automatically, we still need to understand what is going on in order to reason about its various hyperparameters.

Additionally, we’re going to process audio into a form that’s easier to train. This is going to lower the memory requirements. It’s also going to lower the time needed for model’s parameters to converge to ones that work well.

How is audio represented?

Let’s have a quick look at what the audio data looks like when we load it from a wave file.

import librosa
import librosa.display

SAMPLING_RATE=16000

# ...

wave, _ = librosa.load(path_to_file, sr=SAMPLING_RATE)

librosa.display.waveplot(wave, sr=SAMPLING_RATE)

The above code specifies that we want to load the audio data with a sampling rate of a 16k (more about it later). It then loads it and plots it along the time axis:

The X-axis obviously represents the time. The Y axis is often called the amplitude. A quick look at the plot above makes it obvious that we have negative values in the signal. How come those values are called amplitudes then? Amplitude is said to represent the maximum difference of displacements of a physical object as it vibrates. What does it mean to have a negative amplitude? To make those values a bit more clear, let’s call it just displacement for now. Audio is nothing else than the vibration of the air. If you were to build an electrical recorder, you might come up with one that gives you output in voltages at each point in time. As the air vibrates, you need a reference point obviously. This, in turn, allows you to catch the exact specifics of the vibration — how it “rises” above the reference point and then gets back way below it. Imagine that your electrical circuit gives you output within the range of -1V and 1V. To load it into your computer and into the plot like above, you’d need to capture those values at discrete points in time. The sampling rate is nothing else than a number of times within one second when the value from your sound-meter would be measured and stored — to be loaded later. Next time, when you read that your CD from the ’90 contains audio sampled at a frequency of 44,100 Hz, you’ll know that the raw “air displacement” values were sampled 44,100 times each second.

Let’s do a simple thought experiment to prepare for the next section. What would you hear if all the above values were constant, e.g. 1.0? We saw that the values given by librosa are floating points. In the example file they ranged between -0.6 and 0.6. The value of 1.0 is certainly much higher — would you hear “more” of “something” then? Because the definition of a sound is that it’s a vibration: you wouldn’t hear anything! The amplitudes of the audio signal must periodically change — this is how we detect or hear sounds. This implies that in order to distinguish between different sounds, those sounds have to “vibrate differently”. The difference that makes sounds different is the frequency of the vibration.

Decomposing the signal with the Fourier Transform

Let’s create a signal generating machine, that will output a sinusoidal of a given frequency and amplitude:

def gen_sin(freq, amplitude, sr=1000):
    return np.sin(
        (freq * 2 * np.pi * np.linspace(0, sr, sr)) / sr
    ) * amplitude

Here’s how 1000 points signal looks like for a frequency of 30 and an amplitude of 1:

import seaborn as sns

sns.lineplot(data=gen_sin(30, 1))

Here’s one for 10 and 0.6:

You can count the number of times the values in plots approach their maximum. Knowing that sine has only one maximum within its period and that we’re showing just one second, that number shows that we have frequencies 30 and 10.

What would we get if we were to sum such sinusoidal signals of different frequencies and amplitudes? Let’s see — below you can see 3 different sine waves plotted on top of each other. The fourth — and last one — shows the signal that is the sum of all of them:

Here’s another example, with the last plot showing the sum of 5 different waves:

It isn’t that regular anymore, is it? It turns out that you can construct any signal by summing up some number of sine waves of different frequencies and amplitudes (and phases, their translation in time). The converse is also true: any signal can be represented as a sum of some number of sine waves of different frequencies and amplitudes (and phases). This is extremely important to our speech recognition task. Frequencies are the real difference between sounds that make up the phonemes and words that we want to be able to recognize.

This is where the Fourier Transform comes into play. It takes our data points that represent intensity per each point in time and produces data points representing intensity per each frequency bin. It’s said that it transforms the domain of the signal from time into frequency. Now, what exactly is a frequency bin? Imagine the physical audio signal being constructed from frequencies between 0Hz and 8000Hz. The FFT algorithm (Fast Fourier Transform) is going to split that full spectrum into bins. If you were to split it into 10 bins, you’d end up having the following ranges: 0Hz–800Hz, 800Hz–1600Hz, 1600Hz–2400Hz, 2400Hz–3200Hz, 3200Hz–4000Hz, 4000Hz–4800Hz, 4800Hz–5600Hz, 5600Hz–6400Hz, 6400Hz–7200Hz, 7200Hz–8000Hz.

Let’s see how the FFT works on the example of the signal given above. The waves and plots were produced by the following Python function:

def plot_wave_composition(defs, hspace=1.0):
    fig_size = plt.rcParams["figure.figsize"]

    plt.rcParams["figure.figsize"] = [14.0, 10.0]

    waves = [
        gen_sin(freq, amp)
        for freq, amp in defs
    ]

    fig, axs = plt.subplots(nrows=len(defs) + 1)

    for ix, wave in enumerate(waves):
        sb.lineplot(data=wave, ax=axs[ix])
        axs[ix].set_ylabel('{}'.format(defs[ix]))

        if ix != 0:
            axs[ix].set_title('+')

    plt.subplots_adjust(hspace = hspace)

    sb.lineplot(data=sum(waves), ax=axs[len(defs)])
    axs[len(defs)].set_ylabel('sum')
    axs[len(defs)].set_xlabel('time')
    axs[len(defs)].set_title('=')

    plt.rcParams["figure.figsize"] = fig_size

    return waves, sum(waves)

We can plot the signals and grab them at the same time with:

wave_defs = [
        (2, 1),
        (3, 0.8),
        (5, 0.2),
        (7, 0.1),
        (9, 0.25)
    ]

waves, the_sum = plot_wave_composition(wave_defs)

Next, let’s compute the FFT values along with the frequencies:

ffts = np.fft.fft(the_sum)
freqs = np.fft.fftfreq(len(the_sum))

frequencies, coeffs = zip(
    *list(
        filter(
            lambda row: row[1] > 10, # arbitrary threshold but let’s not make it too complex for now
            [ (int(abs(freq * 1000)), coef) for freq, coef in zip(freqs[0:(len(ffts) // 2)], np.abs(ffts)[0:(len(ffts) // 2)]) ]
        )
    )
)

sns.barplot(x=list(frequencies), y=coeffs)

The last call produces the following plot:

The X-axis represents now the frequency in Hz, while the Y-axis is the intensity.

There’s one missing part before we can use it with our speech data. As you can see, FFT gives us frequencies for the whole signal, assuming that it’s periodic and spans in time into infinity. Obviously, when I say “hello”, the air vibrates differently in the beginning, changes in between and is even more different at the end. We need to split that audio into small “windows” of data points. By feeding them into FFT, we can get the frequencies for each one of them. This turns the data domain from time into frequency within the scope of the window. It remains the info about the time at the global level, making our data represent: time x frequency x intensity.

Scaling frequencies

The human perception is a vastly complex phenomenon. Taking that into account can take us a long way when working on the recognition model emulating the work of our brains when we’re listening to each other.

Let’s make another experiment. What sound is produced by the 800Hz sine?

from IPython.display import Audio

Audio(data=gen_sin(800, 1, 16000), rate=16000)

Let’s now generate 900Hz and 1000Hz to get a sense of the difference:

900Hz:

1000Hz:

Let us now ante up the frequencies and generate 7000Hz, 7100Hz and 7200Hz:

Can you hear the difference being smaller in the case of the last three? It’s a well-known phenomenon. We sense a greater difference in sounds for lower frequencies and as it increases that difference becomes less and less.

Because of this, three gentlemen—Stevens, Volkmann, and Newman—created a so-called Mel scale in 1937. You can think of it as a simple rescaling of the frequencies that roughly follows the relationship shown below:

Although not mandatory, lots of models that deal with human speech also decrease the importance of the intensity by taking the log of the re-scaled data. The resulting time x frequency (mels) x log-intensity is called the log-Mel spectrogram.

Background on deep learning techniques in use for this project

We’ve just gone through the necessary basics of signal processing. Let’s now focus on the Deep Learning concepts we’ll use to construct and train the model.

While this article assumes that the reader already knows a lot, there are less common techniques we’ll use that deserve at least a quick go through.

Dilated convolutions as a faster alternative to recurrent networks

Traditionally, the sequence processing in Deep Learning is tackled by the recurrent neural networks.

No matter the choice of their flavor, the basic scheme is always the same: the computations are done sequentially going through examples in time. In our case, we’d need to split the time x frequency x intensity into time length of frequency x intensity chunks. As the chunks would be processed one by one, the recurrent network internal state would “remember” the previous chunk’s specifics, incorporating them into their future outputs. The output shape would be time x frequency x recurrent units.

The fact that the computations are done sequentially, makes them quite slow overall. Later in-pipeline computations spend most of the time waiting on the previous ones to finish because of the direct dependency. The problem is even more severe with the use of GPUs. We use them because of their ability to do math in parallel on huge chunks of data. With recurrent networks, lots of that power is being wasted.

The premise of RNNs is that in theory, they can have the capacity for keeping very long contexts in their “memory”. This has recently been put into test and falsified in practice by Bai et al. Also, when you stop and think about the task at hand: does it really matter to “remember” the beginning of the sentence to know that it ends with the word “dog”? Some context is obviously needed — but not as wide as it might seem at first.

I have an Nvidia GTX 1070Ti with 8GB of memory to train my models on. I don’t really feel like waiting a month for my recurrent network to converge. In this project, let’s use a very performant alternative — convolutional neural network.

Expanding the context of the convolutional network

Simple convolutional layers weren’t used for sequence processing much for a good reason. The crux of the sequence processing is to be able to take bigger contexts into account. Depending on the job, we might want to constrain the context only to the past — learning the causal relations in data. We might sometimes want to incorporate both past and future in it as well. The go-to solution for doing OCR at the moment is to use bidirectional recurrent layers. Their one pass learns the relations from left to right while another learns from right to left. The results are then concatenated.

By applying proper padding, we can easily include one or two-sided contexts in 1D convolutions. The challenge is that in order to make the outputs depend on bigger contexts, the size of the filters needs to become bigger and bigger. This, in turn, requires more and more memory.

Because our aim is to create a model that we’ll be able to train on a quite cheap (given the GPUs used in this field usually) GTX 1070Ti (around $500 at the moment), we want the memory requirements to be as low as possible.

Thanks to the success of the WaveNet (among others), a specific class of convolutional layers gained a lot of attention lately. The variation is called Dilated Convolutions or sometimes Atrous Convolutions. So what are they?

Let’s first have a look at how the outputs depend on their context for simple convolutional layers:

Imagine that you originally have just the top-most row of numbers. You are going to use 1D convolutions and to make the reasoning easiest, the number of filters is 1. Also for simplicity, all filter values are set to 1. You can see the cross-correlation (because that’s what convolutional layers are in fact computing) operator taking 3 values in the context, multiplying by the filter and summing up to 2 * 1 + 3 * 1 + 4 * 1 = 9.

The atrous convolutions are really the same, except they dilate their focus without increasing the size of the filter by introducing holes. It’s shown below with the convolution of the size 2 and dilation of 2:

Here’s yet another example for the size of 2 and dilation of 3:

Gated activations

Traditionally, convolutional layers are followed by the *elu family of activations (ReLu, Elu, PRelu, Selu). They fit in well within the “match pattern” paradigm of the conv nets. On the contrary, recurrent units operate the “remember/forget” approach. Two of their most commonly used implementations, GRU and LSTM, include explicit “forget” gates.

We want to mimic their ability to “forget” parts of the context within our dilated convolutions based model too. To do that, we’re going to use the “gated activations” approach, explained by Liptchinsky et al.

The idea is very simple: we pass the input through Conv1D separately and apply tanh and sigmoid respectively. The result is the element-wise product. We’re going to go one step further in our approach, by applying tanh one more time in the end.

Others

The full explanation of all of the details of our neural network’s architecture is beyond the scope of an article like this. Let me point you at additional pieces along with the reading they come from:

Let’s code it

The architecture of our choice in this project is going to heavily rely on the great success of residual-style networks as well as dilated convolutions. You might see similarities to the famous WaveNet, although it’s going to be a bit different.

Here is the bird-eye view of the SpeechNet neural network:

The residual stacks, being at the heart of it, are structured the following way:

The residual blocks, doing all the heavy lifting, can be seen as shown below:

The most important aspect of coding of the Deep Learning models

Developing Deep Learning models doesn’t really differ that much from any other type of coding. It does require specific background knowledge, but the good coding practices remain the same. In fact, good coding habits are 10× more relevant here than in e.g. a web-app project.

Training a speech-to-text model is bound to require days if not weeks. Imagine having a small bug in your code, preventing the process from finding a good local minimum. It’s extremely frustrating to find out about it days into the training, with the model trainable parameters not being improved much.

Let’s start by adding some unit tests then. In this project, we’re using the Jupyter notebook as we don’t intend to package it anywhere. The code’s intent is to be for educational purposes mainly.

Adding unit tests within the Jupyter notebook is possible with the following “hack” (notice the value for argv):

import unittest

RUN_TESTS = TRUE

class TestNotebook(unittest.TestCase):
    def test_it_works(self):
        self.assertEqual(2 + 2, 4)

if __name__ == '__main__' and RUN_TESTS:
    import doctest

    doctest.testmod()
    unittest.main(
        argv=['first-arg-is-ignored'],
        failfast=True,
        exit=False
    )

You can notice the import of the doctest module which adds support for doc-string level tests which may come in handy as well.

I also hugely recommend the hypothesis library for testing the QuickCheck way as I blogged about it before.

Data pipeline

A place that’s surprisingly very bug-potent is the data pipeline. It’s easy to e.g. shuffle the labels independently of input vectors if you’re not careful. There’s also always a chance to introduce input vectors including NaN or inf values, which a few steps later produce NaN or inf loss values. Let’s add a simple test to check for the first condition:


# assuming test path will look like: 1/file.wav
# the input and output types are driven by the input_fn shown later
# here, we’re just generating values based on the “path”
def dummy_load_wave(example):
    row, params = example
    path = row.filename

    return np.ones((SAMPLING_RATE)) * float(path.split('/')[0]), row

class TestNotebook(unittest.TestCase):

    # (...)

    def test_dataset_returns_data_in_order(self):

        params = experiment_params(
            dataset_params(
                batch_size=2,
                epochs=1,
                augment=False
            )
        )

        data = pd.DataFrame(
            data={
                'text': [ str(i) for i in range(10) ],
                'filename':  [ '{}/wav'.format(i) for i in range(10) ]
            }
        )

        dataset = input_fn(data, params['data'], dummy_load_wave)()
        iterator = dataset.make_one_shot_iterator()
        next_element = iterator.get_next()

        with tf.Session() as session:
            try:
                while True:
                    audio, label = session.run(next_element)
                    audio, length = audio

                    for _audio, _label in zip(list(audio), list(label)):
                        self.assertEqual(_audio[0], float(_label))

                    for _length in length:
                        self.assertEqual(_length, SAMPLING_RATE)
            except tf.errors.OutOfRangeError:
                pass

The above code assumes having the input_fn function in scope. If you’re not familiar with the concept yet, please go ahead and read the introduction to the TensorFlow Estimators API.

Here’s our implementation:

from multiprocessing import Pool

def input_fn(input_dataset, params, load_wave_fn=load_wave):
    def _input_fn():
        """
        Returns raw audio wave along with the label
        """

        dataset = input_dataset

        print(params)

        if 'max_text_length' in params and params['max_text_length'] is not None:
            print('Constraining dataset to the max_text_length')
            dataset = input_dataset[input_dataset.text.str.len() < params['max_text_length']]

        if 'min_text_length' in params and params['min_text_length'] is not None:
            print('Constraining dataset to the min_text_length')
            dataset = input_dataset[input_dataset.text.str.len() >= params['min_text_length']]

        if 'max_wave_length' in params and params['max_wave_length'] is not None:
            print('Constraining dataset to the max_wave_length')

        print('Resulting dataset length: {}'.format(len(dataset)))

        def generator_fn():
            pool = Pool()
            buffer = []

            for epoch in range(params['epochs']):
                for _, row in dataset.sample(frac=1).iterrows():
                    buffer.append((row, params))

                    if len(buffer) >= params['batch_size']:

                        if params['parallelize']:
                            audios = pool.map(
                                load_wave_fn,
                                buffer
                            )
                        else:
                            audios = map(
                                load_wave_fn,
                                buffer
                            )

                        for audio, row in audios:
                            if audio is not None:
                                if np.isnan(audio).any():
                                    print('SKIPPING! NaN coming from the pipeline!')
                                else:
                                    yield (audio, len(audio)), row.text.encode()

                        buffer = []

        return tf.data.Dataset.from_generator(
                generator_fn,
                output_types=((tf.float32, tf.int32), (tf.string)),
                output_shapes=((None,()), (()))
            ) \
            .padded_batch(
                batch_size=params['batch_size'],
                padded_shapes=(
                    (tf.TensorShape([None]), tf.TensorShape(())),
                    tf.TensorShape(())
                )
            )

    return _input_fn

This depends on the load_wave function:

import librosa
import hickle as hkl
import os.path

def to_path(filename):
    return './data/cv_corpus_v1/' + filename

def load_wave(example, absolute=False):
    row, params = example

    _path = row.filename if absolute else to_path(row.filename)

    if os.path.isfile(_path + '.wave.hkl'):
        wave = hkl.load(_path + '.wave.hkl').astype(np.float32)
    else:
        wave, _ = librosa.load(_path, sr=SAMPLING_RATE)
        hkl.dump(wave, _path + '.wave.hkl')

    if len(wave) <= params['max_wave_length']:
        if params['augment']:
            wave = random_noise(
                random_stretch(
                    random_shift(
                        wave,
                        params
                    ),
                    params
                ),
                params
            )
    else:
        wave = None

    return wave, row

Which depends on three other functions used to augment the data on the fly to improve the model’s generalization:

import random
import glob

noise_files = glob.glob('./data/*.wav')
noises = {}

def random_stretch(audio, params):
    rate = random.uniform(params['random_stretch_min'], params['random_stretch_max'])

    return librosa.effects.time_stretch(audio, rate)

def random_shift(audio, params):
    _shift = random.randrange(params['random_shift_min'], params['random_shift_max'])

    if _shift < 0:
        pad = (_shift * -1, 0)
    else:
        pad = (0, _shift)

    return np.pad(audio, pad, mode='constant')

def random_noise(audio, params):
    _factor = random.uniform(
        params['random_noise_factor_min'],
        params['random_noise_factor_max']
    )

    if params['random_noise'] > random.uniform(0, 1):
        _path = random.choice(noise_files)

        if _path in noises:
            wave = noises[_path]
        else:
            if os.path.isfile(_path + '.wave.hkl'):
                wave = hkl.load(_path + '.wave.hkl').astype(np.float32)
                noises[_path] = wave
            else:
                wave, _ = librosa.load(_path, sr=SAMPLING_RATE)
                hkl.dump(wave, _path + '.wave.hkl')
                noises[_path] = wave

        noise = random_shift(
            wave,
            {
                'random_shift_min': -16000,
                'random_shift_max': 16000
            }
        )

        max_noise = np.max(noise[0:len(audio)])
        max_wave = np.max(audio)

        noise = noise * (max_wave / max_noise)

        return _factor * noise[0:len(audio)] + (1.0 - _factor) * audio
    else:
        return audio

Notice that we’re making almost everything into a configurable parameter. We want the code to allow the greatest freedom of searching for just the right set of hyperparameters.

The data pipeline as shown above randomly shuffles the Pandas data frame once for each epoch. It also creates a pool of background workers to parallelize the data loading as much as possible. We’re doing the data loading and augmentation on the CPU. It also uses the hickle library for caching audio signals on the disk. Loading a wave file with a given sampling rate isn’t that fast as one might think. In my experiments, loading the resulting array of floating points via hickle was 10x faster. We need the best speed of feeding the data into the network or else our GPU is going to stay underutilized.

In my experiments also, turning data augmentation on made a real difference. I’ve run the training without it and the network overfit was disastrous: with the normalized edit distance for the training set revolving around 0.01 and 0.53 for the validation.

The random_noise function uses the noise sounds included in the Speech Commands: A public dataset for single-word speech recognition dataset. Please go ahead and download it, extracting just the noise files under the ./data directory.

The last function in use we haven’t seen yet is the experiment_params. It’s just a helper that allows an easy params hash construction for our experiments:

def dataset_params(batch_size=32,
                   epochs=50000,
                   parallelize=True,
                   max_text_length=None,
                   min_text_length=None,
                   max_wave_length=80000,
                   shuffle=True,
                   random_shift_min=-4000,
                   random_shift_max= 4000,
                   random_stretch_min=0.7,
                   random_stretch_max= 1.3,
                   random_noise=0.75,
                   random_noise_factor_min=0.2,
                   random_noise_factor_max=0.5,
                   augment=False):
    return {
        'parallelize': parallelize,
        'shuffle': shuffle,
        'max_text_length': max_text_length,
        'min_text_length': min_text_length,
        'max_wave_length': max_wave_length,
        'random_shift_min': random_shift_min,
        'random_shift_max': random_shift_max,
        'random_stretch_min': random_stretch_min,
        'random_stretch_max': random_stretch_max,
        'random_noise': random_noise,
        'random_noise_factor_min': random_noise_factor_min,
        'random_noise_factor_max': random_noise_factor_max,
        'epochs': epochs,
        'batch_size': batch_size,
        'augment': augment
    }

Labels encoder and decoder

When working with the CTC loss, we need a way to code each letter as a numerical value. Conversely, the neural network is going to give us probabilities for each letter, given by its index within the output matrix.

The idea behind this project’s approach is to push the encoding and decoding into the network graph itself. We want two functions: encode_labels and decode_codes. We want the first to turn a string into an array of integers. The second one should complement it, turning the array of integers into the resulting string.

It’s a good idea to use our hypothesis library for this unit test. It’s going to come up with many input examples, trying to falsify our assumptions:

@given(st.text(alphabet="abcdefghijk1234!@#$%^&*", max_size=10))
def test_encode_and_decode_work(self, text):
    assume(text != '')

    params = { 'alphabet': 'abcdefghijk1234!@#$%^&*' }

    label_ph = tf.placeholder(tf.string, shape=(1), name='text')
    codes_op = encode_labels(label_ph, params)
    decode_op = decode_codes(codes_op, params)

    with tf.Session() as session:
        session.run(tf.global_variables_initializer())
        session.run(tf.tables_initializer(name='init_all_tables'))

        codes, decoded = session.run(
            [codes_op, decode_op],
            {
                label_ph: np.array([text])
            }
        )

        note(codes)
        note(decoded)

        self.assertEqual(text, ''.join(map(lambda s: s.decode('UTF-8'), decoded.values)))
        self.assertEqual(codes.values.dtype, np.int32)
        self.assertEqual(len(codes.values), len(text))

Here is the implementation that passes the above test:

def encode_labels(labels, params):
    characters = list(params['alphabet'])

    table = tf.contrib.lookup.HashTable(
        tf.contrib.lookup.KeyValueTensorInitializer(
            characters,
            list(range(len(characters)))
        ),
        -1,
        name='char2id'
    )

    return table.lookup(
        tf.string_split(labels, delimiter='')
    )

def decode_codes(codes, params):
    characters = list(params['alphabet'])

    table = tf.contrib.lookup.HashTable(
        tf.contrib.lookup.KeyValueTensorInitializer(
            list(range(len(characters))),
            characters
        ),
        '',
        name='id2char'
    )

    return table.lookup(codes)

Log-Mel Spectrogram layer

Another piece we need is a way to turn raw audio signals into the log-Mel spectrograms. The idea, again, is to push it into the network graph. This way it’s going to work way faster on GPUs and also the model’s API is going to be much simpler.

In the following unit test, we’re testing our custom TensorFlow layer against values coming from known-to-be-valid librosa:

@given(
    st.sampled_from([22000, 16000, 8000]),
    st.sampled_from([1024, 512]),
    st.sampled_from([1024, 512]),
    npst.arrays(
        np.float32,
        (4, 16000),
        elements=st.floats(-1, 1)
    )
)
@settings(max_examples=10)
def test_log_mel_conversion_works(self, sampling_rate, n_fft, frame_step, audio):
    lower_edge_hertz=0.0
    upper_edge_hertz=sampling_rate / 2.0
    num_mel_bins=64

    def librosa_melspectrogram(audio_item):
        spectrogram = np.abs(
            librosa.core.stft(
                audio_item,
                n_fft=n_fft,
                hop_length=frame_step,
                center=False
            )
        )**2

        return np.log(
            librosa.feature.melspectrogram(
                S=spectrogram,
                sr=sampling_rate,
                n_mels=num_mel_bins,
                fmin=lower_edge_hertz,
                fmax=upper_edge_hertz,
            ) + 1e-6
        )

    audio_ph = tf.placeholder(tf.float32, (4, 16000))

    librosa_log_mels = np.transpose(
        np.stack([
            librosa_melspectrogram(audio_item)
            for audio_item in audio
        ]),
        (0, 2, 1)
    )

    log_mel_op = tf.check_numerics(
        LogMelSpectrogram(
            sampling_rate=sampling_rate,
            n_fft=n_fft,
            frame_step=frame_step,
            lower_edge_hertz=lower_edge_hertz,
            upper_edge_hertz=upper_edge_hertz,
            num_mel_bins=num_mel_bins
        )(audio_ph),
        message="log mels"
    )

    with tf.Session() as session:
        session.run(tf.global_variables_initializer())

        log_mels = session.run(
            log_mel_op,
            {
               audio_ph: audio
            }
        )

        np.testing.assert_allclose(
            log_mels,
            librosa_log_mels,
            rtol=1e-1,
            atol=0
        )

The implementation of the layer, that passes the above unit test reads as follows:

class LogMelSpectrogram(tf.layers.Layer):
    def __init__(self,
                 sampling_rate,
                 n_fft,
                 frame_step,
                 lower_edge_hertz,
                 upper_edge_hertz,
                 num_mel_bins,
                 **kwargs):
        super(LogMelSpectrogram, self).__init__(**kwargs)

        self.sampling_rate = sampling_rate
        self.n_fft = n_fft
        self.frame_step = frame_step
        self.lower_edge_hertz = lower_edge_hertz
        self.upper_edge_hertz = upper_edge_hertz
        self.num_mel_bins = num_mel_bins

    def call(self, inputs, training=True):
        stfts = tf.contrib.signal.stft(
            inputs,
            frame_length=self.n_fft,
            frame_step=self.frame_step,
            fft_length=self.n_fft,
            pad_end=False
        )

        power_spectrograms = tf.real(stfts * tf.conj(stfts))

        num_spectrogram_bins = power_spectrograms.shape[-1].value

        linear_to_mel_weight_matrix = tf.constant(
            np.transpose(
                librosa.filters.mel(
                    sr=self.sampling_rate,
                    n_fft=self.n_fft + 1,
                    n_mels=self.num_mel_bins,
                    fmin=self.lower_edge_hertz,
                    fmax=self.upper_edge_hertz
                )
            ),
            dtype=tf.float32
        )

        mel_spectrograms = tf.tensordot(
            power_spectrograms,
            linear_to_mel_weight_matrix,
            1
        )

        mel_spectrograms.set_shape(
            power_spectrograms.shape[:-1].concatenate(
                linear_to_mel_weight_matrix.shape[-1:]
            )
        )

        return tf.log(mel_spectrograms + 1e-6)

Converted data lengths function

In order to use the CTC loss and decoder efficiently, we need to pass it the length of the data effectively representing audio for each batch. This is because not all audio files are of the same length but we need to pad them with zeros to do mini-batch.

Here’s the unit test:

@given(
        npst.arrays(
            np.float32,
            (st.integers(min_value=16000, max_value=16000*5)),
            elements=st.floats(-1, 1)
        ),
        st.sampled_from([22000, 16000, 8000]),
        st.sampled_from([1024, 512, 640]),
        st.sampled_from([1024, 512, 160]),
    )
    @settings(max_examples=10)
    def test_compute_lengths_works(self,
                                   audio_wave,
                                   sampling_rate,
                                   n_fft,
                                   frame_step
                                  ):
        assume(n_fft >= frame_step)

        original_wave_length = audio_wave.shape[0]

        audio_waves_ph = tf.placeholder(tf.float32, (None, None), name="audio_waves")
        original_lengths_ph = tf.placeholder(tf.int32, (None), name="original_lengths")

        lengths_op = compute_lengths(
            original_lengths_ph,
            {
                'frame_step': frame_step,
                'n_fft': n_fft
            }
        )

        self.assertEqual(lengths_op.dtype, tf.int32)

        log_mel_op = LogMelSpectrogram(
            sampling_rate=sampling_rate,
            n_fft=n_fft,
            frame_step=frame_step,
            lower_edge_hertz=0.0,
            upper_edge_hertz=8000.0,
            num_mel_bins=13
        )(audio_waves_ph)

        with tf.Session() as session:
            session.run(tf.global_variables_initializer())

            lengths, log_mels = session.run(
                [lengths_op, log_mel_op],
                {
                    audio_waves_ph: np.array([audio_wave]),
                    original_lengths_ph: np.array([original_wave_length])
                }
            )

            note(original_wave_length)
            note(lengths)
            note(log_mels.shape)

            self.assertEqual(lengths[0], log_mels.shape[1])

And here’s the implementation:

def compute_lengths(original_lengths, params):
    """
    Computes the length of data for CTC
    """

    return tf.cast(
        tf.floor(
            (tf.cast(original_lengths, dtype=tf.float32) - params['n_fft']) /
                params['frame_step']
        ) + 1,
        tf.int32
    )

Atrous 1D Convolutions layer

It’s also a good idea to ensure that our dilated convolutions layer behaves as in theory. TensorFlow already includes an ability to specify the dilations. The end result though may differ wildly based on the choice of other parameters.

Let’s ensure at least that it works as intended when we choose it to work in the “causal” mode. The unit test:

def test_causal_conv1d_works(self):
    conv_size2_dilation_1 = AtrousConv1D(
        filters=1,
        kernel_size=2,
        dilation_rate=1,
        kernel_initializer=tf.ones_initializer(),
        use_bias=False
    )

    conv_size3_dilation_1 = AtrousConv1D(
        filters=1,
        kernel_size=3,
        dilation_rate=1,
        kernel_initializer=tf.ones_initializer(),
        use_bias=False
    )

    conv_size2_dilation_2 = AtrousConv1D(
        filters=1,
        kernel_size=2,
        dilation_rate=2,
        kernel_initializer=tf.ones_initializer(),
        use_bias=False
    )

    conv_size2_dilation_3 = AtrousConv1D(
        filters=1,
        kernel_size=2,
        dilation_rate=3,
        kernel_initializer=tf.ones_initializer(),
        use_bias=False
    )

    data = np.array(list(range(1, 31)))
    data_ph = tf.placeholder(tf.float32, (1, 30, 1))

    size2_dilation_1_1 = conv_size2_dilation_1(data_ph)
    size2_dilation_1_2 = conv_size2_dilation_1(size2_dilation_1_1)

    size3_dilation_1_1 = conv_size3_dilation_1(data_ph)
    size3_dilation_1_2 = conv_size3_dilation_1(size3_dilation_1_1)

    size2_dilation_2_1 = conv_size2_dilation_2(data_ph)
    size2_dilation_2_2 = conv_size2_dilation_2(size2_dilation_2_1)

    size2_dilation_3_1 = conv_size2_dilation_3(data_ph)
    size2_dilation_3_2 = conv_size2_dilation_3(size2_dilation_3_1)

    with tf.Session() as session:
        session.run(tf.global_variables_initializer())

        outputs = session.run(
            [
                size2_dilation_1_1,
                size2_dilation_1_2,
                size3_dilation_1_1,
                size3_dilation_1_2,
                size2_dilation_2_1,
                size2_dilation_2_2,
                size2_dilation_3_1,
                size2_dilation_3_2
            ],
            {
                data_ph: np.reshape(data, (1, 30, 1))
            }
        )

        for ix, out in enumerate(outputs):
            out = np.squeeze(out)
            outputs[ix] = out

            self.assertEqual(out.shape[0], len(data))

        np.testing.assert_equal(
            outputs[0],
            np.array([1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 31, 33, 35, 37, 39, 41, 43, 45, 47, 49, 51, 53, 55, 57, 59], dtype=np.float32)
        )

        np.testing.assert_equal(
            outputs[1],
            np.array([1, 4, 8, 12, 16, 20, 24, 28, 32, 36, 40, 44, 48, 52, 56, 60, 64, 68, 72, 76, 80, 84, 88, 92, 96, 100, 104, 108, 112, 116], dtype=np.float32)
        )

        np.testing.assert_equal(
            outputs[2],
            np.array([1, 3, 6, 9, 12, 15, 18, 21, 24, 27, 30, 33, 36, 39, 42, 45, 48, 51, 54, 57, 60, 63, 66, 69, 72, 75, 78, 81, 84, 87], dtype=np.float32)
        )

        np.testing.assert_equal(
            outputs[3],
            np.array([1, 4, 10, 18, 27, 36, 45, 54, 63, 72, 81, 90, 99, 108, 117, 126, 135, 144, 153, 162, 171, 180, 189, 198, 207, 216, 225, 234, 243, 252], dtype=np.float32)
        )

        np.testing.assert_equal(
            outputs[4],
            np.array([1, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40, 42, 44, 46, 48, 50, 52, 54, 56, 58], dtype=np.float32)
        )

        np.testing.assert_equal(
            outputs[5],
            np.array([1, 2, 5, 8, 12, 16, 20, 24, 28, 32, 36, 40, 44, 48, 52, 56, 60, 64, 68, 72, 76, 80, 84, 88, 92, 96, 100, 104, 108, 112], dtype=np.float32)
        )

        np.testing.assert_equal(
            outputs[6],
            np.array([1, 2, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 31, 33, 35, 37, 39, 41, 43, 45, 47, 49, 51, 53, 55, 57], dtype=np.float32)
        )

        np.testing.assert_equal(
            outputs[7],
            np.array([1, 2, 3, 6, 9, 12, 16, 20, 24, 28, 32, 36, 40, 44, 48, 52, 56, 60, 64, 68, 72, 76, 80, 84, 88, 92, 96, 100, 104, 108], dtype=np.float32)
        )

And the layer’s code:

class AtrousConv1D(tf.layers.Layer):
    def __init__(self,
                 filters,
                 kernel_size,
                 dilation_rate,
                 use_bias=True,
                 kernel_initializer=tf.glorot_normal_initializer(),
                 causal=True
                ):
        super(AtrousConv1D, self).__init__()

        self.filters = filters
        self.kernel_size = kernel_size
        self.dilation_rate = dilation_rate
        self.causal = causal

        self.conv1d = tf.layers.Conv1D(
            filters=filters,
            kernel_size=kernel_size,
            dilation_rate=dilation_rate,
            padding='valid' if causal else 'same',
            use_bias=use_bias,
            kernel_initializer=kernel_initializer
        )

    def call(self, inputs):
        if self.causal:
            padding = (self.kernel_size - 1) * self.dilation_rate
            inputs = tf.pad(inputs, tf.constant([(0, 0,), (1, 0), (0, 0)]) * padding)

        return self.conv1d(inputs)

Residual Block layer

One aspect that wasn’t covered yet is the heavy usage of batch normalization. When coding the residual block layer, ensuring that batch normalization is properly applied when training and when inferring is one of the most important tasks.

Here’s the unit test:

@given(
    npst.arrays(
        np.float32,
        (4, 16000),
        elements=st.floats(-1, 1)
    ),
    st.sampled_from([64, 32]),
    st.sampled_from([7, 3]),
    st.sampled_from([1, 4]),
)
@settings(max_examples=10)
def test_residual_block_works(self, audio_waves, filters, size, dilation_rate):
    with tf.Graph().as_default() as g:
        audio_ph = tf.placeholder(tf.float32, (4, None))

        log_mel_op = LogMelSpectrogram(
            sampling_rate=16000,
            n_fft=512,
            frame_step=256,
            lower_edge_hertz=0,
            upper_edge_hertz=8000,
            num_mel_bins=10
        )(audio_ph)

        expanded_op = tf.layers.Dense(filters)(log_mel_op)

        _, block_op = ResidualBlock(
            filters=filters,
            kernel_size=size,
            causal=True,
            dilation_rate=dilation_rate
        )(expanded_op, training=True)

        # really dumb loss function just for the sake
        # of testing:
        loss_op = tf.reduce_sum(block_op)

        variables = tf.trainable_variables()
        self.assertTrue(any(["batch_normalization" in var.name for var in variables]))

        grads_op = tf.gradients(
            loss_op,
            variables
        )

        for grad, var in zip(grads_op, variables):
            if grad is None:
                note(var)

            self.assertTrue(grad is not None)

        with tf.Session(graph=g) as session:
            session.run(tf.global_variables_initializer())

            result, expanded, grads, _ = session.run(
                [block_op, expanded_op, grads_op, loss_op],
                {
                    audio_ph: audio_waves
                }
            )

            self.assertFalse(np.array_equal(result, expanded))
            self.assertEqual(result.shape, expanded.shape)
            self.assertEqual(len(grads), len(variables))
            self.assertFalse(any([np.isnan(grad).any() for grad in grads]))

And here’s the implementation:

class ResidualBlock(tf.layers.Layer):
    def __init__(self, filters, kernel_size, dilation_rate, causal, **kwargs):
        super(ResidualBlock, self).__init__(**kwargs)

        self.dilated_conv1 = AtrousConv1D(
            filters=filters,
            kernel_size=kernel_size,
            dilation_rate=dilation_rate,
            causal=causal
        )

        self.dilated_conv2 = AtrousConv1D(
            filters=filters,
            kernel_size=kernel_size,
            dilation_rate=dilation_rate,
            causal=causal
        )

        self.out = tf.layers.Conv1D(
            filters=filters,
            kernel_size=1
        )

    def call(self, inputs, training=True):
        data = tf.layers.batch_normalization(
            inputs,
            training=training
        )

        filters = self.dilated_conv1(data)
        gates = self.dilated_conv2(data)

        filters = tf.nn.tanh(filters)
        gates = tf.nn.sigmoid(gates)

        out = tf.nn.tanh(
            self.out(
                filters * gates
            )
        )

        return out + inputs, out

Residual Stack layer

Testing the residual stack follows the same kind of logic:

@given(
    npst.arrays(
        np.float32,
        (4, 16000),
        elements=st.floats(-1, 1)
    ),
    st.sampled_from([64, 32]),
    st.sampled_from([7, 3])
)
@settings(max_examples=10)
def test_residual_stack_works(self, audio_waves, filters, size):
    dilation_rates = [1,2,4]

    with tf.Graph().as_default() as g:
        audio_ph = tf.placeholder(tf.float32, (4, None))

        log_mel_op = LogMelSpectrogram(
            sampling_rate=16000,
            n_fft=512,
            frame_step=256,
            lower_edge_hertz=0,
            upper_edge_hertz=8000,
            num_mel_bins=10
        )(audio_ph)

        expanded_op = tf.layers.Dense(filters)(log_mel_op)

        stack_op = ResidualStack(
            filters=filters,
            kernel_size=size,
            causal=True,
            dilation_rates=dilation_rates
        )(expanded_op, training=True)

        # really dumb loss function just for the sake
        # of testing:
        loss_op = tf.reduce_sum(stack_op)

        variables = tf.trainable_variables()
        self.assertTrue(any(["batch_normalization" in var.name for var in variables]))

        grads_op = tf.gradients(
            loss_op,
            variables
        )

        for grad, var in zip(grads_op, variables):
            if grad is None:
                note(var)

            self.assertTrue(grad is not None)

        with tf.Session(graph=g) as session:
            session.run(tf.global_variables_initializer())

            result, expanded, grads, _ = session.run(
                [stack_op, expanded_op, grads_op, loss_op],
                {
                    audio_ph: audio_waves
                }
            )

            self.assertFalse(np.array_equal(result, expanded))
            self.assertEqual(result.shape, expanded.shape)
            self.assertEqual(len(grads), len(variables))
            self.assertFalse(any([np.isnan(grad).any() for grad in grads]))

With the layer’s code looking as follows:

class ResidualStack(tf.layers.Layer):
    def __init__(self, filters, kernel_size, dilation_rates, causal, **kwargs):
        super(ResidualStack, self).__init__(**kwargs)

        self.blocks = [
            ResidualBlock(
                filters=filters,
                kernel_size=kernel_size,
                dilation_rate=dilation_rate,
                causal=causal
            )
            for dilation_rate in dilation_rates
        ]

    def call(self, inputs, training=True):
        data = inputs
        skip = 0

        for block in self.blocks:
            data, current_skip = block(data, training=training)
            skip += current_skip

        return skip

The SpeechNet

Finally, let’s add a very similar test for the SpeechNet itself:

@given(
    npst.arrays(
        np.float32,
        (4, 16000),
        elements=st.floats(-1, 1)
    )
)
@settings(max_examples=10)
def test_speech_net_works(self, audio_waves):
    with tf.Graph().as_default() as g:
        audio_ph = tf.placeholder(tf.float32, (4, None))

        logits_op = SpeechNet(
            experiment_params(
                {},
                stack_dilation_rates= [1, 2, 4],
                stack_kernel_size= 3,
                stack_filters= 32,
                alphabet= 'abcd'
            )
        )(audio_ph)

        # really dumb loss function just for the sake
        # of testing:
        loss_op = tf.reduce_sum(logits_op)

        variables = tf.trainable_variables()
        self.assertTrue(any(["batch_normalization" in var.name for var in variables]))

        grads_op = tf.gradients(
            loss_op,
            variables
        )

        for grad, var in zip(grads_op, variables):
            if grad is None:
                note(var)

            self.assertTrue(grad is not None)

        with tf.Session(graph=g) as session:
            session.run(tf.global_variables_initializer())

            result, grads, _ = session.run(
                [logits_op, grads_op, loss_op],
                {
                    audio_ph: audio_waves
                }
            )

            self.assertEqual(result.shape[2], 5)
            self.assertEqual(len(grads), len(variables))
            self.assertFalse(any([np.isnan(grad).any() for grad in grads]))

And let’s provide the code that passes it:

class SpeechNet(tf.layers.Layer):
    def __init__(self, params, **kwargs):
        super(SpeechNet, self).__init__(**kwargs)

        self.to_log_mel = LogMelSpectrogram(
            sampling_rate=params['sampling_rate'],
            n_fft=params['n_fft'],
            frame_step=params['frame_step'],
            lower_edge_hertz=params['lower_edge_hertz'],
            upper_edge_hertz=params['upper_edge_hertz'],
            num_mel_bins=params['num_mel_bins']
        )

        self.expand = tf.layers.Conv1D(
            filters=params['stack_filters'],
            kernel_size=1,
            padding='same'
        )

        self.stacks = [
            ResidualStack(
                filters=params['stack_filters'],
                kernel_size=params['stack_kernel_size'],
                dilation_rates=params['stack_dilation_rates'],
                causal=params['causal_convolutions']
            )
            for _ in range(params['stacks'])
        ]

        self.out = tf.layers.Conv1D(
            filters=len(params['alphabet']) + 1,
            kernel_size=1,
            padding='same'
        )

    def call(self, inputs, training=True):
        data = self.to_log_mel(inputs)

        data = tf.layers.batch_normalization(
            data,
            training=training
        )

        if len(data.shape) == 2:
            data = tf.expand_dims(data, 0)

        data = self.expand(data)

        for stack in self.stacks:
            data = stack(data, training=training)

        data = tf.layers.batch_normalization(
            data,
            training=training
        )

        return self.out(data) + 1e-8

The model function

We have only one last piece of code to cover before we’ll be able to start the training. It’s the model_fn that adheres to the TensorFlow Estimators API:

def model_fn(features, labels, mode, params):
    if isinstance(features, dict):
        audio = features['audio']
        original_lengths = features['length']
    else:
        audio, original_lengths = features

    lengths = compute_lengths(original_lengths, params)

    if labels is not None:
        codes = encode_labels(labels, params)

    network = SpeechNet(params)

    is_training = mode==tf.estimator.ModeKeys.TRAIN

    logits = network(audio, training=is_training)
    text, predicted_codes = decode_logits(logits, lengths, params)

    if mode == tf.estimator.ModeKeys.PREDICT:
        predictions = {
            'logits': logits,
            'text': tf.sparse_tensor_to_dense(
                text,
                ''
            )
        }

        export_outputs = {
            'predictions': tf.estimator.export.PredictOutput(predictions)
        }

        return tf.estimator.EstimatorSpec(
            mode,
            predictions=predictions,
            export_outputs=export_outputs
        )
    else:
        loss = tf.reduce_mean(
            tf.nn.ctc_loss(
                labels=codes,
                inputs=logits,
                sequence_length=lengths,
                time_major=False,
                ignore_longer_outputs_than_inputs=True
            )
        )

        mean_edit_distance = tf.reduce_mean(
            tf.edit_distance(
                tf.cast(predicted_codes, tf.int32),
                codes
            )
        )

        distance_metric = tf.metrics.mean(mean_edit_distance)

        if mode == tf.estimator.ModeKeys.EVAL:
            return tf.estimator.EstimatorSpec(
                mode,
                loss=loss,
                eval_metric_ops={ 'edit_distance': distance_metric }
            )

        elif mode == tf.estimator.ModeKeys.TRAIN:
            global_step = tf.train.get_or_create_global_step()

            tf.summary.text(
                'train_predicted_text',
                tf.sparse_tensor_to_dense(text, '')
            )
            tf.summary.scalar('train_edit_distance', mean_edit_distance)

            update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
            with tf.control_dependencies(update_ops):
                train_op = tf.contrib.layers.optimize_loss(
                    loss=loss,
                    global_step=global_step,
                    learning_rate=params['lr'],
                    optimizer=(params['optimizer']),
                    update_ops=update_ops,
                    clip_gradients=params['clip_gradients'],
                    summaries=[
                        "learning_rate",
                        "loss",
                        "global_gradient_norm",
                    ]
                )

            return tf.estimator.EstimatorSpec(
                mode,
                loss=loss,
                train_op=train_op
            )

Using the API, we’ll get lots of stats in TensorBoard for free. It will also make it very easy to validate the model and to export it to a SavedModel format.

In order to easily experiment with different hyperparameters, I’ve also created a helper function as listed below:

import copy

def experiment(data_params=dataset_params(), **kwargs):
    params = experiment_params(
        data_params,
        **kwargs
    )

    print(params)

    estimator = tf.estimator.Estimator(
        model_fn=model_fn,
        model_dir='stats/{}'.format(experiment_name(params)),
        params=params
    )

    #import pdb; pdb.set_trace()

    train_spec = tf.estimator.TrainSpec(
        input_fn=input_fn(
            train_data,
            params['data']
        )
    )

    features = {
        "audio": tf.placeholder(dtype=tf.float32, shape=[None]),
        "length": tf.placeholder(dtype=tf.int32, shape=[])
    }

    serving_input_receiver_fn = tf.estimator.export.build_raw_serving_input_receiver_fn(
        features
    )

    best_exporter = tf.estimator.BestExporter(
      name="best_exporter",
      serving_input_receiver_fn=serving_input_receiver_fn,
      exports_to_keep=5
    )

    eval_params = copy.deepcopy(params['data'])
    eval_params['augment'] = False

    eval_spec = tf.estimator.EvalSpec(
        input_fn=input_fn(
            eval_data,
            eval_params
        ),
        throttle_secs=60*30,
        exporters=best_exporter
    )

    tf.estimator.train_and_evaluate(
        estimator,
        train_spec,
        eval_spec
    )

As well as two more to test the model’s accuracy and to get the test set predictions:

def test(data_params=dataset_params(), **kwargs):
    params = experiment_params(
        data_params,
        **kwargs
    )

    print(params)

    estimator = tf.estimator.Estimator(
        model_fn=model_fn,
        model_dir='stats/{}'.format(experiment_name(params)),
        params=params
    )

    eval_params = copy.deepcopy(params['data'])
    eval_params['augment'] = False
    eval_params['epochs'] = 1
    eval_params['shuffle'] = False

    estimator.evaluate(
        input_fn=input_fn(
            test_data,
            eval_params
        )
    )

def predict_test(**kwargs):
    params = experiment_params(
        dataset_params(
            augment=False,
            shuffle=False,
            batch_size=1,
            epochs=1,
            parallelize=False
        ),
        **kwargs
    )

    print(len(test_data))

    estimator = tf.estimator.Estimator(
        model_fn=model_fn,
        model_dir='stats/{}'.format(experiment_name(params)),
        params=params
    )

    return list(
        estimator.predict(
            input_fn=input_fn(
                test_data,
                params['data']
            )
        )
    )

Which depends on the following other functions:

def experiment_params(data,
                      optimizer='Adam',
                      lr=1e-4,
                      alphabet=" 'abcdefghijklmnopqrstuvwxyz",
                      causal_convolutions=True,
                      stack_dilation_rates=[1, 3, 9, 27, 81],
                      stacks=2,
                      stack_kernel_size=3,
                      stack_filters=32,
                      sampling_rate=16000,
                      n_fft=160*4,
                      frame_step=160,
                      lower_edge_hertz=0,
                      upper_edge_hertz=8000,
                      num_mel_bins=160,
                      clip_gradients=None,
                      codename='regular',
                      **kwargs):
    params = {
        'optimizer': optimizer,
        'lr': lr,
        'data': data,
        'alphabet': alphabet,
        'causal_convolutions': causal_convolutions,
        'stack_dilation_rates': stack_dilation_rates,
        'stacks': stacks,
        'stack_kernel_size': stack_kernel_size,
        'stack_filters': stack_filters,
        'sampling_rate': sampling_rate,
        'n_fft': n_fft,
        'frame_step': frame_step,
        'lower_edge_hertz': lower_edge_hertz,
        'upper_edge_hertz': upper_edge_hertz,
        'num_mel_bins': num_mel_bins,
        'clip_gradients': clip_gradients,
        'codename': codename
    }

    #import pdb; pdb.set_trace()

    if kwargs is not None and 'data' in kwargs:
        params['data'] = { **params['data'], **kwargs['data'] }
        del kwargs['data']

    if kwargs is not None:
        params = { **params, **kwargs }

    return params

def experiment_name(params, excluded_keys=['alphabet', 'data', 'lr', 'clip_gradients']):

    def represent(key, value):
        if key in excluded_keys:
            return None
        else:
            if isinstance(value, list):
                return '{}_{}'.format(key, '_'.join([str(v) for v in value]))
            else:
                return '{}_{}'.format(key, value)

    parts = filter(
        lambda p: p is not None,
        [
            represent(k, params[k])
            for k in sorted(params.keys())
        ]
    )

    return '/'.join(parts)

Each new set of hyperparameters constitutes a different “experiment”. It will output separate statistics in TensorBoard that are going to be easily filterable.

The experiment function uses the train_and_validate TensorFlow function which will periodically test the model against the validation set. This is our tool of gauging how well it generalizes. It also uses the tf.estimator.BestExporter class to automatically export SavedModel files for best performing versions.

Other aspects

The coverage of the full code listing wouldn’t be very practical for an article like this. We’ve covered the most important of them above. I invite you to have a look at the Jupyter notebook itself which is hosted on GitHub: kamilc/speech-recognition.

Let’s train it

Before we can dive in and start training the model using the code above, we need to set a few things up.

First of all, I’m using Docker. This way I’m not constrained e.g. by the version of Cuda to install.

Here’s the Dockerfile for this project:

FROM tensorflow/tensorflow:latest-devel-gpu-py3

RUN apt-get update
RUN apt-get install -y ffmpeg git cmake

RUN pip install matplotlib pandas scikit-learn librosa seaborn hickle hypothesis[pandas]

RUN mkdir -p /home/data-science/projects
VOLUME /home/data-science/projects

RUN echo "c.NotebookApp.token = ''" >> ~/.jupyter/jupyter_notebook_config.py
RUN echo "c.NotebookApp.password = ''" >> ~/.jupyter/jupyter_notebook_config.py

WORKDIR /home/data-science/projects

RUN pip install git+https://github.com/Supervisor/supervisor && \
  mkdir -p /var/log/supervisor

ADD supervisor.conf /etc/supervisor.conf

EXPOSE 80
EXPOSE 6006

CMD supervisord -c /etc/supervisor.conf

I also like to make my life easier and provide the Makefile that automates common project-related tasks:

build:
    nvidia-docker build -t speech-recognition:latest .
run:
    nvidia-docker run -p 80:80 -p 6006:6006 --shm-size 16G --mount type=bind,source=/home/kamil/projects/speech-recognition,target=/home/data-science/projects -it speech-recognition
bash:
    nvidia-docker run --mount type=bind,source=/home/kamil/projects/speech-recognition,target=/home/data-science/projects -it speech-recognition bash

We’ll use TensorBoard to visualize the progress. At the same time, we need Jupyter notebooks server to be running as well. We’ll need a supervisor daemon to run both at the same time in a container. Here’s its config file:

[supervisord]
nodaemon=true

[program:jupyter]
command=bash -c "source /etc/bash.bashrc && jupyter notebook --notebook-dir=/home/data-science/projects --ip 0.0.0.0 --no-browser --allow-root --port=80"

[program:tensorboard]
command=tensorboard --logdir /home/data-science/projects/stats

In order to run the Jupyter notebook and start experimenting you’ll need to run the following in the command line:

make build

And then to start the container with TensorFlow, Jupyter, and Tensorboard:

make run

The notebook includes a helper function for running experiments. Here’s the invocation, whose set of parameters worked best for me:

experiment(
    dataset_params(
        batch_size=18,
        epochs=10,
        max_wave_length=320000,
        augment=True,
        random_noise=0.75,
        random_noise_factor_min=0.1,
        random_noise_factor_max=0.15,
        random_stretch_min=0.8,
        random_stretch_max=1.2
    ),
    codename='deep_max_20_seconds',
    alphabet=' !"&\',-.01234:;\\abcdefghijklmnopqrstuvwxyz', # !"&',-.01234:;\abcdefghijklmnopqrstuvwxyz
    causal_convolutions=False,
    stack_dilation_rates=[1, 3, 9, 27],
    stacks=6,
    stack_kernel_size=7,
    stack_filters=3*128,
    n_fft=160*8,
    frame_step=160*4,
    num_mel_bins=160,
    optimizer='Momentum',
    lr=0.00001,
    clip_gradients=20.0
)

The training process takes lots of time. On my machine, it took it more than 2 weeks. Searching for the best set of parameters is very difficult (and not fun).

The function accepts the max_text_length as one of its parameters. I first ran the experiments setting it to some small value (e.g. 15 characters). It constrains the data set to a narrow set of “easy” files. The reason is that it’s easy to spot any issues with the architecture on an easy set: if it’s not converging here, then we surely have a bug.

For the main training procedure, this parameter is kept unset.

Results

By using TensorBoard, we get a handy tool for monitoring the progress. I made the model_fn output statistics for the training set edit distance as well as the one for the evaluation set.

The statistics for the CTC Loss are included by default.

Here are the charts for the final model included in the GitHub repo:

A thing to notice is that I paused the training between the 20th and 30th December.

The above chart presents the training time edit distance. Because of the pretty aggressive data augmentation, I noticed that throughout the whole process the training and validation edit distances didn’t differ hugely.

Following image shows the CTC loss with the orange line representing the evaluation runs.

The evaluation edit distance is shown below. I stopped the training once the further gain for a whole day was dropping by less than 0.005.

Every machine learning model should be rigorously measured against meaningful accuracy statistics. Let’s see how we did:

test(
    dataset_params(
        batch_size=18,
        epochs=10,
        max_wave_length=320000,
        augment=True,
        random_noise=0.75,
        random_noise_factor_min=0.1,
        random_noise_factor_max=0.15,
        random_stretch_min=0.8,
        random_stretch_max=1.2
    ),
    codename='deep_max_20_seconds',
    alphabet=' !"&\',-.01234:;\\abcdefghijklmnopqrstuvwxyz', # !"&',-.01234:;\abcdefghijklmnopqrstuvwxyz
    causal_convolutions=False,
    stack_dilation_rates=[1, 3, 9, 27],
    stacks=6,
    stack_kernel_size=7,
    stack_filters=3*128,
    n_fft=160*8,
    frame_step=160*4,
    num_mel_bins=160,
    optimizer='Momentum',
    lr=0.00001,
    clip_gradients=20.0
)

The output:

(...)
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Finished evaluation at 2019-01-07-10:51:09
INFO:tensorflow:Saving dict for global step 1525345: edit_distance = 0.07922124, global_step = 1525345, loss = 13.410753
(...)

This shows that for the test set, we’ve scored 0.079 in edit distance. We could invert it to call accuracy (somewhat naively though), which gives 92.1% — not too bad. The result would be officially reported as 7.9 LER.

What’s even nicer is the size of the model:

ls stats/causal_convolutions_False/codename_deep_max_20_seconds/frame_step_640/lower_edge_hertz_0/n_fft_1280/num_mel_bins_160/optimizer_Momentum/sampling_rate_16000/stack_dilation_rates_1_3_9_27/stack_filters_384/stack_kernel_size_7/stacks_6/upper_edge_hertz_8000/export/best_exporter/1546198558/variables -lh
total 204M

That’s 204MB for the model trained on the 375k+ dataset with aggressive augmentation (which makes the resulting dataset size effectively a couple times bigger).

It’s always nice to see what the results look like. Here’s the code that runs the model through the whole test sets and gathers the predicted transcriptions:

test_results = predict_test(
    codename='deep_max_20_seconds',
    alphabet=' !"&\',-.01234:;\\abcdefghijklmnopqrstuvwxyz', # !"&',-.01234:;\abcdefghijklmnopqrstuvwxyz
    causal_convolutions=False,
    stack_dilation_rates=[1, 3, 9, 27],
    stacks=6,
    stack_kernel_size=7,
    stack_filters=3*128,
    n_fft=160*8,
    frame_step=160*4,
    num_mel_bins=160,
    optimizer='Momentum',
    lr=0.00001,
    clip_gradients=20.0
)
[ b''.join(t['text']) for t in test_results ]

And the excerpt of the above is:

[b'without the dotaset the artice suistles',
 b"i've got to go to him",
 b'and you know it',
 b'down below in the darknes were hundrededs of people sleping in peace',
 b'strange images pased through my mind',
 b'the shep had taught him that',
 b'it was glaringly hot not a clou in hesky nor a breath of wind',
 b'your son went to serve at a distant place and became a cinturion',
 b'they made a boy continue tiging but he found nothing',
 b'the shoreas in da',
 b'fol the instructions here',
 b"the're caling to u not to give up and to kep on fighting",
 b'the shop was closed on monis',
 b'even coming down on the train together she wrote me',
 b"i'm going away he said",
 b"he wasn't asking for help",
 b'some of the grynsh was faling of the circular edge',
 b"i'd like to think",
 b'the alchemist robably already knew al that',
 b"you 'l take fiftly and like et",
 b'it was droping of in flakes and raining down on the sand',
 b"what's your name he asked",
 b"it's because you were not born",
 b'what do you think of that',
 b"if i had told tyo o you wouldn't have sen the pyramids",
 b"i havn't hert the baby complain yet",
 b'i told him wit could teach hr to ignore people who was had tend',
 b"the one you're blocking",
 b'henderson stod up with a spade in his hand',
 b"he didn't ned to sek out the old woman for this",
 b'only a minority of literature is reaten this way',
 b"i wish you wouldn't",
 ...]

Seems quite okay. You can immediately notice that some words are misspelled. This stems from the nature of the CTC algorithm itself. We’re predicting letters instead of words here. The good side is that the problem of out-of-vocabulary words is lessened. The worse part is that you’ll get e.g. ‘sek’ sometimes instead of ‘seek’. Because we’re outputting the logits for each example, it’s possible to use e.g. the CTCWordBeamSearch to constrain the output’s tokens to ones known within the corpus — making it predict the words instead.

Here’s the last little fun test: speech to text on the utterance I created on my laptop:

results = predict(
    'cv_corpus_v1/test-me.m4a',
    codename='deep_max_20_seconds',
    alphabet=' !"&\',-.01234:;\\abcdefghijklmnopqrstuvwxyz', # !"&',-.01234:;\abcdefghijklmnopqrstuvwxyz
    causal_convolutions=False,
    stack_dilation_rates=[1, 3, 9, 27],
    stacks=6,
    stack_kernel_size=7,
    stack_filters=3*128,
    n_fft=160*8,
    frame_step=160*4,
    num_mel_bins=160,
    optimizer='Momentum',
    lr=0.00001,
    clip_gradients=20.0
)
b''.join(results[0]['text'])

The result:

b'it semed to work just fine'

Project on GitHub

The full Jupyter notebook’s code for this article can be found on GitHub: kamilc/speech-recognition.

The repository includes the bz2 archive of the best performing model I’ve trained. You can download it and run it as a web service via TensorFlow Serving, which we will cover in the next and last section here.

Serving the model with the TensorFlow Serving

The last step in this project is to serve our trained model as a web service. Thankfully, the TensorFlow project includes a ready to use “model server” that’s free to use: TensorFlow Serving.

The idea behind it is that we can run it, pointing it at the directory containing the models saved in the TensorFlow’s SavedModel format.

The deployment is extremely straightforward if you’re okay with running it from a Docker container. Let’s first pull the image:

docker pull tensorflow/serving

Next, we need to download the saved model we’ve trained in this article from GitHub:

$ wget https://github.com/kamilc/speech-recognition/raw/master/best.tar.bz2
$ tar xvjf best.tar.bz2

In the next step, we need to start a container for the TensorFlow Serving image making it:

open its port to outside
mount the directory containing our model
set the MODEL_NAME environment variable

As follows:

docker run -t --rm -p 8501:8501 -v "/home/kamil/projects/speech-recognition/best/1546646971:/models/speech/1" -e MODEL_NAME=speech tensorflow/serving

The service communicates via JSON payloads. Let’s prepare a payload.json file containing our request payload:

{"inputs": {"audio": <audio-data-here>, "length": <audio-raw-signal-length-here>}}

We can now easily query the web service with the prepared request audio data:

curl -d @payload.json \
   -X POST http://localhost:8501/v1/models/speech:predict

Here’s what our intelligent web service responds with:

{
    "outputs": {
        "text": [
            [
                "c",
                "e",
                "v",
                "e",
                "r",
                "y",
                "t",
                "h",
                "i",
                "n",
                "g",
                " ",
                "i",
                "n",
                " ",
                "t",
                "h",
                "e",
                " ",
                "u",
                "n",
                "i",
                "v",
                "e",
                "r",
                "s",
                " ",
                "o",
                "v",
                "a",
                "l",
                "s",
                "h",
                "e",
                " ",
                "t",
                "e",
                "d",
                "i",
                "n",
                " ",
                "a",
                "w",
                "i",
                "t",
                " ",
                "j",
                "g",
                "m",
                "f",
                "t",
                "a",
                "r",
                "y",
                "s",
                "e"
            ]
        ],
        "logits": [
            [
                <logits-here>
            ]
        ]
    }
}

Image Recognition Tools

2018-10-10T00:00:00+00:00

I’m always impressed with the advancement of machine learning, and, more recently, deep learning. However, since I am not an expert in the field I decided to let the researchers and scholars elaborate more on them.

In this post I will share the existing tools and the associated libraries to make them work, at least for me.

The reason I explored these tools is simple: I plan to deploy a poor man’s security camera in my home with some “sense” of intelligence. Since I am working at home, I want to know who is actually knocking my door. So I thought, what if I could use a web cam to monitor my door and let me know who’s actually standing at the door?

Face Detection

I searched around for existing face detection software and found this Python script using Haarcascade. So I was able to detect faces, but upon sharing the “findings” with a friend he said this only detects faces. How would the computer be able to recognize who’s who? Then I stumbled upon the phrase “face recognition”.

You might have noticed that if you use the image file that you import directly from your smartphone, the output will be displayed in a large file to the screen. You can use ImageMagick to resize the file to say, 640x480 pixels.

$ file makan.jpg
makan.jpg: JPEG image data, JFIF standard 1.01, aspect ratio, density 1x1, segment length 16, Exif Standard: [TIFF image data, big-endian, direntries=15, height=3120, bps=0, width=4160], baseline, precision 8, 4160x3120, frames 3

$ convert makan.jpg -resize 640x480 makan-small.jpg

$ file makan-small.jpg
makan-small.jpg: JPEG image data, JFIF standard 1.01, resolution (DPI), density 72x72, segment length 16, Exif Standard: [TIFF image data, big-endian, direntries=15, height=3120, bps=0, width=4160], baseline, precision 8, 640x480, frames 3

Machine Vision

The computer doesn’t see the image directly as the humans seem to, so we need to convert the images into numerical values. For example, in the facial regcognition tools, the training file contains the following matrices:

opencv_lbphfaces:
   threshold: 1.7976931348623157e+308
   radius: 1
   neighbors: 8
   grid_x: 8
   grid_y: 8
   histograms:
      - !!opencv-matrix
         rows: 1
         cols: 16384
         dt: f
         data: [ 2.46913582e-02, 1.85185187e-02, 0., 3.08641978e-03,
             1.23456791e-02, 6.17283955e-03, 3.08641978e-03,
             2.46913582e-02, 0., 0., 0., 0., 0., 3.08641978e-03, 0.,
             9.25925933e-03, 1.85185187e-02, 9.25925933e-03, 0., 0.,
             3.08641978e-03, 0., 0., 0., 3.08641978e-03, 0., 0., 0.,
             2.46913582e-02, 3.08641978e-03, 0., 6.79012388e-02, 0., 0.,
		...................
             1.30385486e-02, 1.47392293e-02, 4.53514745e-03,
             1.13378686e-03, 7.93650839e-03, 5.66893432e-04,
             5.66893432e-04, 1.13378686e-03, 6.80272095e-03,
             2.26757373e-03, 0., 0., 5.66893443e-03, 2.83446722e-03,
             5.10204071e-03, 9.07029491e-03, 7.14285746e-02 ]
   labels: !!opencv-matrix
      rows: 26
      cols: 1
      dt: i
      data: [ 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 4, 4, 4, 4, 5, 5, 5, 5, 5,
          6, 6, 8, 8, 8, 8 ]
   labelsInfo:
      []

Face Recognition

I continued my search for existing face recognition software and found several projects which could be tested right away, with some modifications from the original source. I found one tutorial which explained clearly how we could get the face recognition working from the web camera, in real time.

If the code provided in the video isn’t working directly, you could try my small patches, in which I corrected a typo and extended the filename extensions towards the source file from here.

My daughter Aufa is joining me in this facial recognition session.

Apart from that there is also a fork on GitHub which allows us to do the real-time face recognition. For now, however, some manual work needed to be done in order to add more datasets (images of faces) if you want to use the code right away.

Obviously I am not Tom Cruise.

Object Recognition

I also searched for more related software which could possibly provide an alternative to the face recognition. I found quite an interesing piece of work for object detection by using Neural Networks. It runs on a framework called Darknet. It allows us to do post-processing object detection for still pictures and videos. It can also do real-time object recognition but requires a GPU to do it efficiently. I tried with the CPU-only mode but I could not get a real-time result (my computer almost crashed).

Still image samples

Video samples

This video was on Lebuhraya Utara Selatan (freeway) in Malaysia
Another from Lebuhraya Utara Selatan (freeway) in Malaysia Two kids playing with bubbles
This video was taken a on a boat, with several people floating in the sea wearing their life jackets My kid and I walking on the beach in western Australia Here’s a kid riding a small horse

Vehicle Counting and Speed Measurement

I found a tool developed by Ahmet Ozlu which uses TensorFlow. The use case here is vechicle counting, vehicle type and color recoginition, and speed detection.

You can see the in following video how it works.

Libraries

OpenCV

OpenCV is an open source library for computer vision, which comes together with libraries which we can use for our detection and recognition work.

In my understanding, the face detection will come first and the recognition second. In newer digital cameras and smartphones facial detection is quite common. Social media applications sometimes use facial recognition to suggest similar faces to be tagged in photo albums, or for photo album reorganization.

Tools based on or making use of OpenCV

Apart from the custom-written Python code which uses OpenCV and Numpy, I also found out there are several works which use TensorFlow together with neural networks, called YOLO (You Look Only Once). They are:

darknet (written in C)
darkflow (written with Python and seems to work as a wrapper for darknet) — You need to install different dependencies from darknet, for example Cython and TensorFlow. The good thing is that we could use this tool for a video post-processing, where instead of taking input directly from a webcam, we take it from existing videos. However, if you want to use the latest YOLO algorithm, then just stick to Darknet rather than using Darkflow. There is a fork on GitHub which could allow Darknet to save the output of the processed video into a file as well.

To rotate the video if it was taken from a smartphone but in a 180 degree position:

ffmpeg -i sourcefile.mp4 -vf "transpose=4" fileout.mp4

The transpose value depends on the nature of the rotation. If it’s 90 degrees, the transpose value should be 2. It also depends on whether the rotation is clockwise or counter-clockwise.

To convert the video to a slower framerate:

ffmpeg -i sourcefile.avi -r 8 fileout.mp4

For the Darkflow tool, the default output is in AVI format, but ffmpeg allows us to convert it to MP4 if you want.

ImageAI

ImageAI is a Python-based computer vision library which utilizes the use of TensorFlow, Keras, Matplotlib and several other dependencies which are commonly used for machine learning. In terms of usage, it is similar to darkflow.

Conclusion

The advancement of AI field contributes a lot of useful automation to life. It can range from helping detect tumors, helping search and rescue missions, reducing keystrokes with keyword predictions, to spam filtering. AI also accelerates the field of image processing and pattern recognition.

A lot of the hard work of smart people and scholars have produced many smart solutions to make people live a better life with the use of AI. As I have shown, some of these tools could achieve better detection given a good amount of samples to be trained on and the correct size of picture to be detected.

The tools above will work as-is, but may need some tweaking/editing if you want to customize it. For example, some of the code works with their own demos, so you may need to pass an argument such as sys.argv[] inside the Python code if you want to process your own video.

Self-driving toy car using the Asynchronous Advantage Actor-Critic algorithm

2018-08-29T00:00:00+00:00

The field of Reinforcement Learning has seen a lot of great improvement in the past years. Researchers at universities and companies like Deep Mind have been developing new and better ways to train intelligent, artificial agents to solve more and more difficult tasks. The algorithms being developed are requiring less time to train. They also are making the training much more stable.

This article is about an algorithm that’s one of the most cited lately: A3C — Asynchronous Advantage Actor-Critic.

As the subject is both wide and deep, I’m assuming the reader has the relevant background mastered already. Although reading it might be interesting even without understanding most of the notions in use, having a good grasp of them will help you get the most out of it.

Because we’re looking at the Deep Reinforcement Learning, the obvious requirement is to be acquainted with the neural networks. I’m also using different notions known in the field of Reinforcement Learning overall like $Q(a, s)$ and $V(s)$ functions or the n-step return. The mathematical expressions, in particular, are given assuming that the reader already knows what the symbols stand for. Some notions known from other families of RL algorithms are being touched on as well (e.g. experience replay) — to contrast them with the A3C way of solving the same kind of problems. The article along with the source code uses the OpenAI gym, Python, and PyTorch among other Python-related libraries.

Theory

The A3C algorithm is a part of the greater class of RL algorithms called Policy Gradients.

In this approach, we’re creating a model that approximates the action-choosing policy itself.

Let’s contrast it with value iteration, the goal of which is to learn the value function and have policy emerge as the function that chooses an action transitioning to the state of the greatest value.

With the policy gradient approach, we’re approximating the policy with a differentiable function. Such stated problem requires only a good approximation of the gradient that over time will maximize the rewards.

The unique approach of A3C adds a very clever twist: we’re also learning an approximation of the value function at the same time. This helps us in getting the variance of the gradient down considerably, making the training much more stable.

These two aspects of the algorithm are being personified within its name: actor-critic. The policy function approximation is being called the actor, while the value function is being called the critic.

The policy gradient

As we’ve noticed already, in order to improve our policy function approximation, we need a gradient that points at the direction that maximizes the rewards.

I’m not going to reinvent the wheel here. There are some great resources the reader can access to dig deep into the Mathematics of what’s called the Policy Gradient Theorem:

The following equation presents the basic form of the gradient of the policy function:

$$\nabla_{\theta} J(\theta) = E_{\tau}[,R_{\tau}\cdot\nabla_\theta,\sum_{t=0}^{T-1},log,\pi(a_t|s_t;\theta),]$$

This states that for each sampled trajectory $\tau$, the correct estimate of the gradient is the expected value of the rewards times the action probabilities moved into the log space. Ascending in this direction makes our rewards greater and greater over time.

We can derive all the needed intermediary gradients ourselves by hand of course. Because we’re using PyTorch though, we only need the right loss function.

Let’s figure out the right loss function formula that will produce the gradient as shown above:

$$L_\theta=-J(\theta)$$

Also:

$$J(\theta)=E_\tau[R_\tau\cdot\sum_{t=0}^{T-1},log,\pi(a_t|s_t;\theta)]$$

Hence:

$$L_\theta=-\frac{1}{n}\sum_{t=0}^{n-1}R_t,\cdot,log\pi(a_t|s_t;\theta)$$

Formalizing the accumulation of rewards

For now, we’ve been using the $R_\tau$ and $R_t$ terms very abstractly. Let’s make this part more intuitive and concrete now.

Its true meaning really is “the quality of the sampled trajectory”. Consider the following equation:

$$R_t=\sum_{i=t}^{N+t}\gamma^{i-t}r_i,+,\gamma^{i-t+1}V(s_{t+N+1})$$

Each $r_i$ is the reward received from the environment after each step. Each trajectory consists of multiple steps. Each time, we’re sampling actions based on our policy function. This gives probabilities of a given action being best given the state.

What if we’re taking 5 actions for which we’re not being given any reward but overall it helped us get rewarded in the 6th step? This is exactly the case we’ll be dealing with in this article later when training a toy car to drive based only on pixel values of the scene. In that environment, we’ll be given $-0.1$ “negative” reward each step and something close to $7$ each new “tile” the car stays on the road.

We need a way to still encourage actions that make us earn rewards in a not too distant future. We also need to be smart and discount future rewards somewhat so that the more immediate the reward is to our action, the more emphasis we put on it.

That’s exactly what the above equation does. Notice that $\gamma$ becomes a hyper-parameter. It makes sense to give it value from $(0, 1)$. Let’s consider the following list of rewards: $[r_1, r_2, r_3, r_4]$. For $r_1$, the formula for the discounted accumulated reward is:

$$R_1=\gamma,r_1,+,\gamma^2r_2,+,\gamma^3r_3,+,\gamma^4r_4,+,\gamma^5V(s_5)$$

For $r_2$ it’s:

$$R_2=\gamma,r_2,+,\gamma^2r_3,+,\gamma^3r_4,+,\gamma^4V(s_5)$$

And so on… In case when we hit the terminal state, having no “next” state, we substitute $0$ for $V(s_{t+N+1})$.

We’ve said that in A3C we’re learning the value function at the same time. The $R_t$ as described above becomes the target value when training our $V(s)$. The value function becomes an approximation of the average of the rewards given the state (because $R_t$ depends on us sampling actions in this state).

Making the gradients more stable

One of the greatest inhibitors of the policy gradient performance is what’s broadly called “high variance”.

I have to admit, the first time I saw that term in this context, I was disoriented. I knew what “variance” was. It’s the “variance of what” that was not clear to me.

Thankfully I found a brilliant answer to this question. It explains the issue simply yet in detail.

Let me cite it here:

When we talk about high variance in the policy gradient method, we’re specifically talking about the facts that the variance of the gradients are high — namely, that $Var(\nabla_{\theta} J(\theta))$ is big.

To put it in simple terms: because we’re sampling trajectories from the space that is stochastic in nature, we’re bound to have those samples give gradients that disagree a lot on the best direction to take our model’s parameters into.

I encourage the reader to pause now and read the above-mentioned answer as it’s very vital. The gist of the solution described in it is that we can subtract a baseline value from each $R_t$. An example of a good baseline that was given was to make it into an average of the sampled accumulated rewards. The A3C algorithm uses this insight in a very, very clever way.

Value function as a baseline

To learn the $V(s)$ we’re typically using the MSE or Huber loss against the accumulated rewards for each step. This means that over time we’re averaging those rewards out based on the state we’re finding ourselves in.

Improving our gradient formula with those ideas we now get:

$$\nabla_{\theta} J(\theta) = E_{\tau}[,\nabla_\theta,\sum_{t=0}^{T-1},log,\pi(a_t|s_t;\theta)\cdot(R_t-V(s_t)),]$$

It’s important to treat the $(R_t-V(s_t))$ term as a constant. This means that when using PyTorch or any other deep learning framework, the computation of it should occur outside the graph that influences the gradients.

The enhanced part of the equation is where we get the word “advantage” in the algorithm’s name. The advantage is simply the difference between the accumulated rewards and what those rewards are on average for the given state:

$$A(a_{t..t+n},s_{t..t+n})=R_t(a_{t..t+n},s_{t..t+n})-V(s_t)$$

If we’ll make $R_t$ into $Q(s,a)$ as it’s commonly written in literature, we’ll arrive at the formula:

$$A(s,a)=Q(s,a) - V(s)$$

What’s the intuition here? Imagine that you’re playing chess with a 5-year-old. You win by a huge margin. Your friend who’s watched lots of master-level games observed this one as well. His take is that even though you scored positively, you still made lots of mistakes. You’ve got your critic here. Your score and what it looks like for the “observing critic” combined is what we call the advantage of the actions you took.

Guarding against the model’s overconfidence

Although he was warned, Icarus was too young and too enthusiastic about flying. He got excited by the thrill of flying and carried away by the amazing feeling of freedom and started flying high to salute the sun, diving low to the sea, and then up high again. His father Daedalus was trying in vain to make young Icarus to understand that his behavior was dangerous, and Icarus soon saw his wings melting. Icarus fell into the sea and drowned.

The Myth Of Daedalus And Icarus

The job of an “actor” is to output probability values for each possible action the agent can take. The greater the probability, the greater the model’s confidence that this action will result in the highest reward.

What if at some point, the weights are being steered in a way that the model becomes overconfident of some particular action? If this happens before the model learns much, it becomes a huge problem.

Because we’re using the $\pi(a|s;\theta)$ distribution to sample trajectories with, we’re not sampling totally at random. In other words, for $\pi(a|s;\theta) = [0.1, 0.4, 0.2, 0.3]$ our sampling chooses the second option 40% of the time. With any action overwhelming the others, we’re losing the ability to explore different paths and thus learn valuable lessons.

Empirically, I have found myself seeing the process sometimes not even able to escape the “overconfidence” area for long, long hours.

Regularizing with entropy

Let’s introduce the notion of an entropy.

In simple words in our case, it’s the measure of how much “knowledge” does given probability distribution posses. It’s being maximized for the uniform distribution. Here’s the formula:

$$H(X)=E[-log_b(P(X))]$$

This expands to the following:

$$H(X)=-\sum_{i=1}^{n}P(x_i)log_b(P(x_i))$$

Let’s look closer at the values this function produces using the following simple Calca code:

uniform = [0.25, 0.25, 0.25, 0.25]
more confident = [0.5, 0.25, 0.15, 0.10]
over confident = [0.95, 0.01, 0.01, 0.03]
super over confident = [0.99, 0.003, 0.004, 0.003]

y(x) = x*log(x, 10)

entropy(dist) = -sum(map(y, dist))

entropy (uniform) => 0.6021
entropy (more confident) => 0.5246
entropy (over confident) => 0.1068
entropy (super over confident) => 0.0291

We can use the above to “punish” the model whenever it’s too confident of its choices. As we’re going to use gradient descend, we’ll be minimizing terms that appear in our loss function. Minimizing the entropy as shown above would encourage more confidence though. We’ll need to make it into a negative in the loss to work the way we intend:

$$L_\theta=-\frac{1}{n}\sum_{t=0}^{n-1}log\pi(a_t|s_t;\theta)\cdot(R_t-V(s_t)),-\beta,H(\pi(a_t|s_t;\theta))$$

Where $\beta$ is a hyperparameter scaling the effects of the penalty that the entropy has on the gradients. Choosing the right value for $\beta$ becomes very vital for the model’s convergence. In this article, I’m using $0.01$ as with $0.001$ I was still observing the process stuck being overconfident.

Let’s include the value loss $L_v$ in the loss function formula making it full and ready to be implemented:

$$L_\theta=-\frac{1}{n}\sum_{t=0}^{n-1}log\pi(a_t|s_t;\theta)\cdot(R_t-V(s_t)),+\alpha,L_v,,-\beta,H(\pi(a_t|s_t;\theta))$$

The last A in A3C

So far we’ve gone from the vanilla policy gradients to using the notion of an advantage. We’ve also improved it with the baseline that intuitively makes the model consist of two parts: the actor and the critic. At this point, we have what’s sometimes called the A2C — Advantage Actor-Critic.

Let us now focus on the last piece of the puzzle: the last A. This last A comes from the word “asynchronous”. It’s been explained very clearly in the original paper on A3C.

This idea I think is the least complex of all that have their place in the approach. I’ll just comment on what was already written:

These approaches share a common idea: the sequence of observed data encountered by an online RL agent is non-stationary, and online RL updates are strongly correlated. By storing the agent’s data in an experience replay memory, the data can be batched (Riedmiller, 2005; Schulman et al., 2015a) or randomly sampled (Mnih et al., 2013; 2015; Van Hasselt et al., 2015) from different time-steps. Aggregating over memory in this way reduces non-stationarity and decorrelates updates, but at the same time limits the methods to off-policy reinforcement learning algorithms.

The A3C unique approach is that it doesn’t use experience replay for de-correlating the updates to the weights of the model. Instead, we’re sampling many different trajectories at the same time in an asynchronous manner.

This means that we’re creating many clones of the environment and we let our agents experience them at the same time. Separate agents share their weights in one way or another. There are implementations with agents sharing those weights very literally — and performing the updates to the weights on their own whenever they need to. There also are implementations with one main agent holding the main weights and doing the updates based on the gradients reported by the “worker” agents. The worker agents are then being updated with the evolved weights. The environments and agents are not being directly synchronized, working at their own speed. As soon as any of them collects the needed rewards to perform the n-step gradients calculations, the gradients are being applied in one way or another.

In this article, I’m preferring the second approach — having one “main” agent and making workers synchronize their weights with it each n-step period.

Practice

The challenge

To present the above theory in practical terms, we’re going to code the A3C to train a toy self-driving game car. The algorithm will only have the game’s pixels as inputs. We’re also going to collect rewards.

Each step, the player will decide how to move the steering wheel, how much throttle to apply and how much brake.

Points are being assigned for each new “tile” that the car enters staying within the road. There’s a small penalty for each other case of $-0.1$ points.

We’re going to use OpenAI Gym and the environment’s called CarRacing.

You can read a bit more about the setup in the environment’s source code on GitHub.

Coding the Agent

Our agent is going to output both $\pi(a|s;\theta)$ as well as $V(s)$. We’re going to use the GRU unit to give the agent the ability to remember its previous actions and environments previous features.

I’ve also decided to use PRelu instead of Relu activations as it appeared to me that this way the agent was learning much quicker (although I don’t have any numbers to back this impression up).

Disclaimer: the code presented below has not been refactored in any way. If this was going to be used in production I’d certainly hugely clean it up.

Here’s the full listing of the agent’s class:

class Agent(nn.Module):
    def __init__(self, **kwargs):
        super(Agent, self).__init__(**kwargs)

        self.init_args = kwargs

        self.h = torch.zeros(1, 256)

        self.norm1 = nn.BatchNorm2d(4)
        self.norm2 = nn.BatchNorm2d(32)

        self.conv1 = nn.Conv2d(4, 32, 4, stride=2, padding=1)
        self.conv2 = nn.Conv2d(32, 32, 3, stride=2, padding=1)
        self.conv3 = nn.Conv2d(32, 32, 3, stride=2, padding=1)
        self.conv4 = nn.Conv2d(32, 32, 3, stride=2, padding=1)

        self.gru = nn.GRUCell(1152, 256)
        self.policy = nn.Linear(256, 4)
        self.value = nn.Linear(256, 1)

        self.prelu1 = nn.PReLU()
        self.prelu2 = nn.PReLU()
        self.prelu3 = nn.PReLU()
        self.prelu4 = nn.PReLU()

        nn.init.xavier_uniform_(self.conv1.weight, gain=nn.init.calculate_gain('leaky_relu'))
        nn.init.constant_(self.conv1.bias, 0.01)
        nn.init.xavier_uniform_(self.conv2.weight, gain=nn.init.calculate_gain('leaky_relu'))
        nn.init.constant_(self.conv2.bias, 0.01)
        nn.init.xavier_uniform_(self.conv3.weight, gain=nn.init.calculate_gain('leaky_relu'))
        nn.init.constant_(self.conv3.bias, 0.01)
        nn.init.xavier_uniform_(self.conv4.weight, gain=nn.init.calculate_gain('leaky_relu'))
        nn.init.constant_(self.conv4.bias, 0.01)
        nn.init.constant_(self.gru.bias_ih, 0)
        nn.init.constant_(self.gru.bias_hh, 0)
        nn.init.xavier_uniform_(self.policy.weight, gain=nn.init.calculate_gain('leaky_relu'))
        nn.init.constant_(self.policy.bias, 0.01)
        nn.init.xavier_uniform_(self.value.weight, gain=nn.init.calculate_gain('leaky_relu'))
        nn.init.constant_(self.value.bias, 0.01)

        self.train()

    def reset(self):
        self.h = torch.zeros(1, 256)

    def clone(self, num=1):
        return [ self.clone_one() for _ in range(num) ]

    def clone_one(self):
        return Agent(**self.init_args)

    def forward(self, state):
        state = state.view(1, 4, 96, 96)
        state = self.norm1(state)

        data = self.prelu1(self.conv1(state))
        data = self.prelu2(self.conv2(data))
        data = self.prelu3(self.conv3(data))
        data = self.prelu4(self.conv4(data))

        data = self.norm2(data)
        data = data.view(1, -1)

        h = self.gru(data, self.h)
        self.h = h.detach()

        pre_policy = h.view(-1)

        policy = F.softmax(self.policy(pre_policy))
        value = self.value(pre_policy)

        return policy, value

You can immediately notice that actor and critic parts share most of the weights. They only differ in the last layer.

Next, I wanted to abstract out the notion of the “runner”. It encapsulates the idea of a “running agent”. Think of it as the game player — with the joystick and its brain to score game points. I’m discretizing the action space the following way:

Action name	value
Turn left	[-0.8, 0.0, 0.0]
Turn right	[0.8, 0.0, 0]
Full throttle	[0.0, 0.1, 0.0]
Brake	[0.0, 0.0, 0.6]

class Runner:
    def __init__(self, agent, ix, train = True, **kwargs):
        self.agent = agent
        self.train = train
        self.ix = ix
        self.reset = False
        self.states = []

        # each runner has its own environment:
        self.env = gym.make('CarRacing-v0')

    def get_value(self):
        """
        Returns just the current state's value.
        This is used when approximating the R.
        If the last step was
        not terminal, then we're substituting the "r"
        with V(s) - hence, we need a way to just
        get that V(s) without moving forward yet.
        """
        _input = self.preprocess(self.states)
        _, _, _, value = self.decide(_input)
        return value

    def run_episode(self, yield_every = 10, do_render = False):
        """
        The episode runner written in the generator style.
        This is meant to be used in a "for (...) in run_episode(...):" manner.
        Each value generated is a tuple of:
        step_ix: the current "step" number
        rewards: the list of rewards as received from the environment (without discounting yet)
        values: the list of V(s) values, as predicted by the "critic"
        policies: the list of policies as received from the "actor"
        actions: the list of actions as sampled based on policies
        terminal: whether we're in a "terminal" state
        """
        self.reset = False
        step_ix = 0

        rewards, values, policies, actions = [[], [], [], []]

        self.env.reset()

        # we're going to feed the last 4 frames to the neural network that acts as the "actor-critic" duo. We'll use the "deque" to efficiently drop too old frames always keeping its length at 4:
        states = deque([ ])

        # we're pre-populating the states deque by taking first 4 steps as "full throttle forward":
        while len(states) < 4:
            _, r, _, _ = self.env.step([0.0, 1.0, 0.0])
            state = self.env.render(mode='rgb_array')
            states.append(state)
            logger.info('Init reward ' + str(r) )

        # we need to repeat the following as long as the game is not over yet:
        while True:
            # the frames need to be preprocessed (I'm explaining the reasons later in the article)
            _input = self.preprocess(states)

            # asking the neural network for the policy and value predictions:
            action, action_ix, policy, value = self.decide(_input, step_ix)

            # taking the step and receiving the reward along with info if the game is over:
            _, reward, terminal, _ = self.env.step(action)

            # explicitly rendering the scene (again, this will be explained later)
            state = self.env.render(mode='rgb_array')

            # update the last 4 states deque:
            states.append(state)
            while len(states) > 4:
                states.popleft()

            # if we've been asked to render into the window (e. g. to capture the video):
            if do_render:
                self.env.render()

            self.states = states
            step_ix += 1

            rewards.append(reward)
            values.append(value)
            policies.append(policy)
            actions.append(action_ix)

            # periodically save the state's screenshot along with the numerical values in an easy to read way:
            if self.ix == 2 and step_ix % 200 == 0:
                fname = './screens/car-racing/screen-' + str(step_ix) + '-' + str(int(time.time())) + '.jpg'
                im = Image.fromarray(state)
                im.save(fname)
                state.tofile(fname + '.txt', sep=" ")
                _input.numpy().tofile(fname + '.input.txt', sep=" ")

            # if it's game over or we hit the "yield every" value, yield the values from this generator:
            if terminal or step_ix % yield_every == 0:
                yield step_ix, rewards, values, policies, actions, terminal
                rewards, values, policies, actions = [[], [], [], []]

            # following is a very tacky way to allow external using code to mark that it wants us to reset the environment, finishing the episode prematurely. (this would be hugely refactored in the production code but for the sake of playing with the algorithm itself, it's good enough):
            if self.reset:
                self.reset = False
                self.agent.reset()
                states = deque([ ])
                self.states = deque([ ])
                return

            if terminal:
                self.agent.reset()
                states = deque([ ])
                return

    def ask_reset(self):
        self.reset = True

    def preprocess(self, states):
        return torch.stack([ torch.tensor(self.preprocess_one(image_data), dtype=torch.float32) for image_data in states ])

    def preprocess_one(self, image):
        """
        Scales the rendered image and makes it grayscale
        """
        return rescale(rgb2gray(image), (0.24, 0.16), anti_aliasing=False, mode='edge', multichannel=False)

    def choose_action(self, policy, step_ix):
        """
        Chooses an action to take based on the policy and whether we're in the training mode or not. During training, it samples based on the probability values in the policy. During the evaluation, it takes the most probable action in a greedy way.
        """
        policies = [[-0.8, 0.0, 0.0], [0.8, 0.0, 0], [0.0, 0.1, 0.0], [0.0, 0.0, 0.6]]

        if self.train:
            action_ix = np.random.choice(4, 1, p=torch.tensor(policy).detach().numpy())[0]
        else:
            action_ix = np.argmax(torch.tensor(policy).detach().numpy())

        logger.info('Step ' + str(step_ix) + ' Runner ' + str(self.ix) + ' Action ix: ' + str(action_ix) + ' From: ' + str(policy))

        return np.array(policies[action_ix], dtype=np.float32), action_ix

    def decide(self, state, step_ix = 999):
        policy, value = self.agent(state)

        action, action_ix = self.choose_action(policy, step_ix)

        return action, action_ix, policy, value

    def load_state_dict(self, state):
        """
        As we'll have multiple "worker" runners, they will need to be able to sync their agents' weights with the main agent.
        This function loads the weights into this runner's agent.
        """
        self.agent.load_state_dict(state)

I’m also encapsulating the training process in a class of its own. You can notice the gradients being clipped before being applied. I’m also clipping the rewards into the range of $<-3, 3>$ to help to keep the variance low.

class Trainer:
    def __init__(self, gamma, agent, window = 15, workers = 8, **kwargs):
        super().__init__(**kwargs)

        self.agent = agent
        self.window = window
        self.gamma = gamma
        self.optimizer = optim.Adam(self.agent.parameters(), lr=1e-4)
        self.workers = workers

        # even though we're loading the weights into worker agents explicitly, I found that still without sharing the weights as following, the algorithm was not converging:
        self.agent.share_memory()

    def fit(self, episodes = 1000):
            """
            The higher level method for training the agents.
            It called into the lower level "train" which orchestrates the process itself.
            """
        last_update = 0
        updates = dict()

        for ix in range(1, self.workers + 1):
            updates[ ix ] = { 'episode': 0, 'step': 0, 'rewards': deque(), 'losses': deque(), 'points': 0, 'mean_reward': 0, 'mean_loss': 0 }

        for update in self.train(episodes):
            now = time.time()

            # you could do something useful here with the updates dict.
            # I've opted out as I'm using logging anyways and got more value in just watching the log file, grepping for the desired values

            # save the current model's weights every minute:
            if now - last_update > 60:
                torch.save(self.agent.state_dict(), './checkpoints/car-racing/' + str(int(now)) + '-.pytorch')
                last_update = now

    def train(self, episodes = 1000):
        """
        Lower level training orchestration method. Written in the generator style. Intended to be used with "for update in train(...):"
        """

        # create the requested number of background agents and runners:
        worker_agents = self.agent.clone(num = self.workers)
        runners = [ Runner(agent=agent, ix = ix + 1, train = True) for ix, agent in enumerate(worker_agents) ]

        # we're going to communicate the workers' updates via the thread safe queue:
        queue = mp.SimpleQueue()

        # if we've not been given a number of episodes: assume the process is going to be interrupted with the keyboard interrupt once the user (us) decides so:
        if episodes is None:
            print('Starting out an infinite training process')

        # create the actual background processes, making their entry be the train_one method:
        processes = [ mp.Process(target=self.train_one, args=(runners[ix - 1], queue, episodes, ix)) for ix in range(1, self.workers + 1) ]

        # run those processes:
        for process in processes:
            process.start()

        try:
            # what follows is a rather naive implementation of listening to workers updates. it works though for our purposes:
            while any([ process.is_alive() for process in processes ]):
                results = queue.get()
                yield results
        except Exception as e:
            logger.error(str(e))

    def train_one(self, runner, queue, episodes = 1000, ix = 1):
        """
        Orchestrate the training for a single worker runner and agent. This is intended to run in its own background process.
        """

        # possibly naive way of trying to de-correlate the weight updates further (I have no hard evidence to prove if it works, other than my subjective observation):
        time.sleep(ix)

        try:
            # we are going to request the episode be reset whenever our agent scores lower than its max points. the same will happen if the agent scores total of -10 points:
            max_points = 0
            max_eval_points = 0
            min_points = 0
            max_episode = 0

            for episode_ix in itertools.count(start=0, step=1):

                if episodes is not None and episode_ix >= episodes:
                    return

                max_episode_points = 0
                points = 0

                # load up the newest weights every new episode:
                runner.load_state_dict(self.agent.state_dict())

                # every 5 episodes lets evaluate the weights we've learned so far by recording the run of the car using the greedy strategy:
                if ix == 1 and episode_ix % 5 == 0:
                    eval_points = self.record_greedy(episode_ix)

                    if eval_points > max_eval_points:
                        torch.save(runner.agent.state_dict(), './checkpoints/car-racing/' + str(eval_points) + '-eval-points.pytorch')
                        max_eval_points = eval_points

                # each n-step window, compute the gradients and apply
                # also: decide if we shouldn't restart the episode if we don't want to explore too much of the not-useful state space:
                for step, rewards, values, policies, action_ixs, terminal in runner.run_episode(yield_every=self.window):
                    points += sum(rewards)

                    if ix == 1 and points > max_points:
                        torch.save(runner.agent.state_dict(), './checkpoints/car-racing/' + str(points) + '-points.pytorch')
                        max_points = points

                    if ix == 1 and episode_ix > max_episode:
                        torch.save(runner.agent.state_dict(), './checkpoints/car-racing/' + str(episode_ix) + '-episode.pytorch')
                        max_episode = episode_ix

                    if points < -10 or (max_episode_points > min_points and points < min_points):
                        terminal = True
                        max_episode_points = 0
                        point = 0
                        runner.ask_reset()

                    if terminal:
                        logger.info('TERMINAL for ' + str(ix) + ' at step ' + str(step) + ' with total points ' + str(points) + ' max: ' + str(max_episode_points) )

                    # if we're learning, then compute and apply the gradients and load the newest weights:
                    if runner.train:
                        loss = self.apply_gradients(policies, action_ixs, rewards, values, terminal, runner)
                        runner.load_state_dict(self.agent.state_dict())

                    max_episode_points = max(max_episode_points, points)
                    min_points = max(min_points, points)

                    # communicate the gathered values to the main process:
                    queue.put((ix, episode_ix, step, rewards, loss, points, terminal))

        except Exception as e:
            string = traceback.format_exc()
            logger.error(str(e) + ' → ' + string)
            queue.put((ix, -1, -1, [-1], -1, str(e) + '<br />' + string, True))

    def record_greedy(self, episode_ix):
        """
        Records the video of the "greedy" run based on the current weights.
        """
        directory = './videos/car-racing/episode-' + str(episode_ix) + '-' + str(int(time.time()))
        player = Player(agent=self.agent, directory=directory, train=False)
        points = player.play()
        logger.info('Evaluation at episode ' + str(episode_ix) + ': ' + str(points) + ' points (' + directory + ')')
        return points

    def apply_gradients(self, policies, actions, rewards, values, terminal, runner):
        worker_agent = runner.agent
        actions_one_hot = torch.tensor([[ int(i == action) for i in range(4) ] for action in actions], dtype=torch.float32)

        policies = torch.stack(policies)
        values = torch.cat(values)
        values_nograd = torch.zeros_like(values.detach(), requires_grad=False)
        values_nograd.copy_(values)

        discounted_rewards = self.discount_rewards(runner, rewards, values_nograd[-1], terminal)
        advantages = discounted_rewards - values_nograd

        logger.info('Runner ' + str(runner.ix) + 'Rewards: ' + str(rewards))
        logger.info('Runner ' + str(runner.ix) + 'Discounted Rewards: ' + str(discounted_rewards.numpy()))

        log_policies = torch.log(0.00000001 + policies)

        one_log_policies = torch.sum(log_policies * actions_one_hot, dim=1)

        entropy = torch.sum(policies * -log_policies)

        policy_loss = -torch.mean(one_log_policies * advantages)

        value_loss = F.mse_loss(values, discounted_rewards)

        value_loss_nograd = torch.zeros_like(value_loss)
        value_loss_nograd.copy_(value_loss)

        policy_loss_nograd = torch.zeros_like(policy_loss)
        policy_loss_nograd.copy_(policy_loss)

        logger.info('Value Loss: ' + str(float(value_loss_nograd)) + ' Policy Loss: ' + str(float(policy_loss_nograd)))

        loss = policy_loss + 0.5 * value_loss - 0.01 * entropy
        self.agent.zero_grad()
        loss.backward()
        torch.nn.utils.clip_grad_norm_(worker_agent.parameters(), 40)

        # the following step is crucial. at this point, all the info about the gradients reside in the worker agent's memory. We need to "move" those gradients into the main agent's memory:
        self.share_gradients(worker_agent)

        # update the weights with the computed gradients:
        self.optimizer.step()

        worker_agent.zero_grad()
        return float(loss.detach())

    def share_gradients(self, worker_agent):
        for param, shared_param in zip(worker_agent.parameters(), self.agent.parameters()):
            if shared_param.grad is not None:
                return
            shared_param._grad = param.grad

    def clip_reward(self, reward):
        """
        Clips the rewards into the <-3, 3> range preventing too big of the gradients variance.
        """
        return max(min(reward, 3), -3)

    def discount_rewards(self, runner, rewards, last_value, terminal):
            discounted_rewards = [0 for _ in rewards]
        loop_rewards = [ self.clip_reward(reward) for reward in rewards ]

        if terminal:
            loop_rewards.append(0)
        else:
            loop_rewards.append(runner.get_value())

        for main_ix in range(len(discounted_rewards) - 1, -1, -1):
            for inside_ix in range(len(loop_rewards) - 1, -1, -1):
                if inside_ix >= main_ix:
                    reward = loop_rewards[inside_ix]
                    discounted_rewards[main_ix] += self.gamma**(inside_ix - main_ix) * reward

        return torch.tensor(discounted_rewards)

For the record_greedy method to work we need the following class:

class Player(Runner):
    def __init__(self, directory, **kwargs):
        super().__init__(ix=999, **kwargs)

        self.env = Monitor(self.env, directory)

    def play(self):
        points = 0
        for step, rewards, values, policies, actions, terminal in self.run_episode(yield_every = 1, do_render = True):
            points += sum(rewards)
        self.env.close()
        return points

All the above code can be used as follows (in the Python script):

if __name__ == "__main__":
    agent = Agent()

    trainer = Trainer(gamma = 0.99, agent = agent)
    trainer.fit(episodes=None)

The importance of tuning of the n-step window size

Reading the code, you can notice that we’ve chosen $15$ to be the size of the n-step window. We’ve also chosen $\gamma=0.99$. Getting those values right is a subject for tuning. The same ones that work on one game or a challenge will not necessarily work well for the other.

Here’s a quick explanation of how to think about them: We’re going to be penalized most of the time. It’s important for us to give the algorithm a chance to actually find trajectories that score positively. In the “CarRacing” challenge, I’ve found that it can take 10 steps of moving “full throttle” in the correct direction before we’re being rewarded by entering the new “tile”. I’ve just simply added $5$ of the safety net to that number. No mathematical proof follows this thinking here, but I can tell you though that it made a huge difference in the training time for me. The version of the code I’m presenting above starts to score above 700 points after approximately 10 hours on my Ryzen 7 based computing box.

Problems with the state being returned from the environment - overcoming with the explicit render

You might have also noticed that I’m not using the state values returned by the step method of the gym environment. This might seem contradictory to how the gym is typically being used. After days of not seeing my model converge though, I have found that the step method was returning one and the same numpy array on each call. You can imagine that it was the absolutely last thing I’ve checked when trying to find that bug.

I’ve found the render(mode='rgb_array') works as intended each time. I just needed to write my own preprocessing code, to scale it down and make it grayscale.

How to know when the algorithm converges

I’ve seen some people thinking that their A3C implementation does not converge. The resulting policy did not seem to be working that well, but the training process was taking a bit longer than “some other implementation”. I fell for this kind of thinking myself as well. My humble bit of advice is to stick to what makes sense mathematically. Someone else’s model might be converging faster simply because of the hardware being used or some slight difference in the code around the training (e.g. explicit render needed in my case). This might not have anything to do with the A3C part at all.

How do we “stick to what makes sense mathematically”? Simply by logging the value loss and observing it as the training continues. Intuitively, for the model that has converged, we should see that it has already learned the value function. Those values — representing the average of the discounted rewards — should not make the loss too big most of the time. Still, for some states, the best action will make the $R_t$ much bigger than $V(s_t)$ which means that we still should see the loss spiking from time to time.

Again, the above bit of advice doesn’t come with any mathematical proofs. It’s what I found working and making sense in my case.

The Results

Instead of presenting hard-core statistics about the model’s performance — which wouldn’t make much sense because I stopped it as soon as the “evaluation” videos started looking cool enough) — I’ll just post three videos of the car driving on its own through the three randomly generated tracks.

Have fun watching and even more fun coding it yourself too!

Recommender System via a Simple Matrix Factorization

2018-07-17T00:00:00+00:00

Photo by Michael Cartwright, CC BY-SA 2.0, cropped

We all like how apps like Spotify or Last.fm can recommend us a song that feels so much like our taste. Being able to recommend an item to a user is very important for keeping and expanding the user base.

In this article I’ll present an overview of building a recommendation system. The approach here is quite basic. It’s grounded though in a valid and battle-tested theory. I’ll show you how to put this theory into practice by coding it in Python with the help of MXNet.

Kinds of recommenders

The general setup of the content recommendation challenge is that we have users and items. The task is to recommend items to a particular user.

There are two distinct approaches to recommending content:

The first one bases its outputs on the the intricate features of the item and how they relate to the user itself. The latter one uses the information about the way other, similar users rank the items. More elaborate systems base their work on both. Such systems are called hybrid recommender systems.

This article is going to focus on collaborative filtering only.

A bit of theory: matrix factorization

In the simplest terms, we can represent interactions between users and items with a matrix:

	item1	item2	item3
user1	-1	-	0.6
user2	-	0.95	-0.1
user3	0.5	-	0.8

In the above case users can rate items on the scale of <-1, 1>. Notice that in reality it’s most likely that users will not rate everything. The missing ratings are represented with the dash: -.

Just by looking at the above table, we know that no amount of math is going to change the fact that user1 completely dislikes item1. The same goes for user2 liking item2 a lot. The ratings we already have make up for a fairly easy set of items to propose. The goal of a recommender is not to propose the items users know already though. We want to predict which of the “dashes” from the table are most likely to be liked the most. Putting it in other words: we want to predict the full representation of the above matrix, basing only on its “sparse” representation as shown above.

How can we solve this problem? Let’s recall the rules of multiplying two matrices:

Given two matrices: A: m × k and B: k × n, their product is another matrix C: m × n. We know that we can multiply matrices only if the second dimension of the first matrix equals the first one of the second matrix. In such a case, matrix C becomes a product of two factors: matrix A and matrix B:

C = AB

Imagine now that the sparse matrix represented by the ratings table is our C. This means that there exist two matrices: A and B that factorize C.

Notice also how this factorization is saving the space needed to persist the ratings:

Let’s make m and n numbers into:

m = 1000000
n = 10000

Then the full representation takes:

m * n => 10,000,000,000

We can now choose the value for k, to be later used when constructing the factorizing matrices:

k = 16

Then to store both matrices: A and B we only need:

m * k + n * k => 16,160,000

Making it into a fraction of the previous number:

(m * k + n * k) / (m * n) => 0.001616

That’s a huge saving of the original space! The cost we need to pay is the small increase in the computational resources needed for the information retrieval. Inference of the rating from C based on A and B requires a dot product of the corresponding row and column of those matrices.

Reasoning about the matrix factors

What intuition can we build for the above mentioned matrices A and B? Looking at their dimensions, we can see that each row of A is a k-sized vector that represents a user. Conversely, each column of B is a k-sized vector that represents an item. The values in those vectors are being called latent features. Sometimes those vectors are being called latent representations of users and items.

What could be the intuition? To split the original matrix, for each item we need to look at all interactions with users. You can imagine the algorithm finding patterns in the ratings that later on match certain characteristics of the item. If this was about movies, the features could be that it’s a comedy or sci-fi or that it’s futuristic or embedded deeply in some ancient times. We’re essentially taking the original vector of a movie, that contains ratings — and based on that we’re distilling features of the movie that describe it best. Note that this is only a half-truth. We think about it this way just to have a way to explain why the approach works. In many cases we could have a hard time finding the actual real world aspects that those latent features follow.

Factorizing the user × item matrix in practice

A simple approach to find matrices A and B is to initialize them randomly first. Then by computing the dot product of each row and column having a known value in C, we can compute how much it differs from the known value. Because dot product is easily differentiable, we can use gradient descend to iteratively improve our matrices A and B until AB is close enough to C for our purposes.

In this article, I’m going to use a freely available database of joke ratings, called “Jester”. It contains data about ratings from 59132 users and 150 jokes.

Coding the model with MXNet

Let’s first import some of the classes and functions we’ll use later.

from mxnet.gluon import Block, nn, Trainer
from mxnet.gluon.loss import L2Loss
from mxnet import autograd, ndarray as F
import mxnet as mx

import numpy as np
import random
import logging
import re

First step in building the training process is to create an iterator over the training batches read from the data files. To make things trivially simple, I’ll read the whole data into memory. The batches will be constructed each time from the data cached in memory.

To create a custom data iterator, we’ll need to inherit from mxnet.io.DataIter and implement at least two methods: next and reset. Here’s our simple code:

class DataIter(mx.io.DataIter):
    def __init__(self, data, batch_size = 16):
        super(DataIter, self).__init__()
        self.batch_size = batch_size
        self.all_user_ids = set()
        self.data = data
        self.index = 0

        for user_id, item_id, _ in data:
            self.all_user_ids.add(user_id)

    @property
    def user_count(self):
        return len(self.all_user_ids)

    @property
    def item_count(self):
        # we just know the value even though 10 of them were
        # not voted
        return 150

    def next(self):
        index = self.index * self.batch_size
        endindex = index + self.batch_size

        if len(self.data) <= index:
            raise StopIteration
        else:
            user_ids = []
            item_ids = []
            ratings = []

            user_ids = self.data[index:endindex, 0]
            item_ids = self.data[index:endindex, 1]
            ratings   = self.data[index:endindex, 2]

            data_all = [mx.nd.array(user_ids), mx.nd.array(item_ids)]
            label_all = [mx.nd.array([r]) for r in ratings]

            self.index += 1

            return mx.io.DataBatch(data_all, label_all)

    def reset(self):
        self.index = 0
        random.shuffle(self.data)

The above DataIter class expects to be given a numpy array with all the training examples. The first dimension represents a user, second an item and third the rating.

Here’s the code for reading data from disk and feeding it into the DataIter’s constructor:

def get_data(batch_size):
    user_ids = []
    item_ids = []
    ratings = []

    with open("data/jester_ratings.dat", "r") as file:
        for line in file:
            user_id, _, item_id, _, rating = line.strip().split("\t")

            user_ids.append(int(user_id))
            item_ids.append(int(item_id))
            ratings.append(float(rating) / 10.0)

    all_raw = np.asarray(list(zip(user_ids, item_ids, ratings)), dtype='float32')

    return DataIter(all_raw, batch_size = batch_size)

Notice that I’m dividing each rating by 10 to scale the ratings from <-10,10> to <-1,1>. I’m doing it because I found the process hitting numerical overflows when using the Adam optimizer.

The function accepts the batch_size as an argument. Below I’m creating a dataset iterator yielding 64 examples at a time:

train = get_data(64)

Recent versions of MXNet bring in a similar coding model to one found in PyTorch. We can use the clean approach of defining the model by extending the base class and defining the forward method. This is possible by using the mxnet.gluon module that defines the Block class.

As a full-featured deep learning framework, MXNet has its own implementation of calculating gradients automatically. The forward method in our Block inherited class is all we need to proceed with the gradient descend.

In our model, the A and B matrices will be encoded within the gluon layers of type Embedding. The Embedding class lets you specify the number of rows in the matrix as well as the dimension into which we’re “squashing” them. Using the class is very handy as it doesn’t require you to “one hot encode” our user and item IDs.

Following is the implementation of our simple model as MXNet block. Notice that all it really is, is a regression. The model is linear so we’re not using any activation function:

class Model(Block):
    def __init__(self, k, dataiter, **kwargs):
        super(Model, self).__init__(**kwargs)

        with self.name_scope():
            self.user_embedding = nn.Embedding(input_dim = dataiter.user_count, output_dim=k)
            self.item_embedding = nn.Embedding(input_dim = dataiter.item_count, output_dim=k)

    def forward(self, x):
        user = self.user_embedding(x[0] - 1)
        item = self.item_embedding(x[1] - 1)

        # the following is a dot product in essence
        # summing up of the element-wise multiplication
        pred = user * item
        return F.sum_axis(pred, axis = 1)

Next, I’m creating the MXNet computation context as well as an instance of the model itself. Before doing any kind of learning, the parameters of the model will need to be initialized:

context = mx.gpu() if mx.test_utils.list_gpus() else mx.cpu()
model = Model(16, train)
model.collect_params().initialize(mx.init.Xavier(), ctx=context)

The last line from above is initializing the A and B matrices randomly.

We are going to save the state of the model periodically to a file. We’ll be able to load them back with:

model.load_params("model.mxnet", ctx=context)

The last bit of code that we need is the training procedure itself. We’re going to code it as a function that takes the model, the data iterator and the number of epochs:

def fit(model, train, num_epoch):
    trainer = Trainer(model.collect_params(), 'adam')

    for epoch_id in range(num_epoch):
        batch_id = 0
        train.reset()

        for batch in train:
            with autograd.record():
                targets = F.concat(*batch.label, dim=0)
                predictions = model(batch.data)
                L = L2Loss()
                loss = L(predictions, targets)
                loss.backward()

            trainer.step(batch.data[0].shape[0])

            if (batch_id + 1) % 1000 == 0:
                mean_loss = F.mean(loss).asnumpy()[0]

                logger.info(f'Epoch {epoch_id + 1} / {num_epoch} | Batch {batch_id + 1} | Mean Loss: {mean_loss}')

            batch_id += 1

        logger.info('Saving model parameters')
        model.save_params("model.mxnet")

Running the trainer for 10 epochs is as simple as:

fit(model, train, num_epoch=10)

The training process is periodically outputting statistics similar to ones below:

INFO:root:Epoch 1 / 10 | Batch 1000 | Mean Loss: 0.11189080774784088
INFO:root:Epoch 1 / 10 | Batch 2000 | Mean Loss: 0.12274568527936935
INFO:root:Epoch 1 / 10 | Batch 3000 | Mean Loss: 0.1204155907034874
INFO:root:Epoch 1 / 10 | Batch 4000 | Mean Loss: 0.12192331254482269

(...)

INFO:root:Epoch 10 / 10 | Batch 24000 | Mean Loss: 0.0003094784333370626
INFO:root:Epoch 10 / 10 | Batch 25000 | Mean Loss: 0.0006345464498735964
INFO:root:Epoch 10 / 10 | Batch 26000 | Mean Loss: 0.0007207655580714345
INFO:root:Epoch 10 / 10 | Batch 27000 | Mean Loss: 0.005522257648408413
INFO:root:Saving model parameters

Using the trained latent feature matrices

To extract he latent matrices from the trained model we need to use the collect_params as shown below:

user_embed = model.collect_params().get('embedding0_weight').data()
joke_embed = model.collect_params().get('embedding1_weight').data()

Each user’s latent representation is a vector of k values:

> user_embed[0]

[ 0.11911439 -0.01560098 -0.26248184  0.5341552   1.3078408  -0.82505447
  0.2181341   0.69577765 -0.22569533 -0.7669992   0.14042236  0.78608125
  0.07242275  0.49357334  0.7525147   0.37984315]
<NDArray 16 @cpu(0)>

The same case is with the latent representations of jokes:

> joke_embed[7]

[ 0.11836094  0.14039275 -0.10859593 -0.13673168  0.14074579 -0.18800738
  0.0463879  -0.09659509  0.1629943   0.02109279 -0.0294639  -0.03487734
 -0.18192524 -0.13103536 -0.10280509  0.14753008]
<NDArray 16 @cpu(0)>

Let’s first test to see if the known values got reconstructed:

> F.dot(user_embed[0], joke_embed[7]) * 10

[-9.26895]
<NDArray 1 @cpu(0)>

Comparing it with the value from the file:

> cat data/jester_ratings.dat | rg "^1\t\t8\t"
1               8               -9.281

That’s close enough. Let’s now get the set of all joke ids rated by the first user:

test = get_data(1)
joke_ids = set()
for batch in test:
    user_id, joke_id = batch.data
    if user_id.asnumpy()[0] == 1:
        joke_ids.add(joke_id.asnumpy()[0])
joke_ids

The above code outputs:

{5.0, 7.0, 8.0, 13.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 29.0, 31.0, 32.0, 34.0, 35.0, 36.0, 42.0, 49.0, 50.0, 51.0, 52.0, 53.0, 54.0, 61.0, 62.0, 65.0, 66.0, 68.0, 69.0, 72.0, 76.0, 80.0, 81.0, 83.0, 87.0, 89.0, 91.0, 92.0, 93.0, 102.0, 103.0, 104.0, 105.0, 106.0, 107.0, 108.0, 109.0, 118.0, 119.0, 120.0, 121.0, 123.0, 127.0, 128.0, 134.0}

Because we’re mostly interested in the items that have not been yet rated by the user, we’d like to see what the model gathered about them in this context:

> sorted([ (i, F.dot(user_embed[0], joke_embed[i]).asnumpy()[0] * 10) for i in range(0, 150) if i + 1 not in joke_ids ], key=lambda x: x[1])

[(100, -25.34627914428711),
 (89, -23.647150993347168),
 (63, -23.543219566345215),
 (94, -23.415722846984863),
 (70, -22.017195224761963),
 (93, -21.375732421875),
 (140, -20.033082962036133),
 (81, -18.813319206237793),
 (40, -18.48101019859314),
 (135, -18.216774463653564),
 (39, -16.993610858917236),
 (123, -16.66216731071472),
 (45, -16.03758215904236),
 (59, -15.045435428619385),
 (43, -14.993469715118408),
 (74, -12.132725715637207),
 (72, -11.94629430770874),
 (76, -11.861177682876587),
 (29, -11.831218004226685),
 (114, -11.82992935180664),
 (38, -11.327273845672607),
 (98, -10.9122633934021),
 (62, -9.507511854171753),
 (32, -9.498740434646606),
 (83, -9.442780017852783),
 (56, -9.361632466316223),
 (78, -9.310351014137268),
 (109, -8.428668975830078),
 (77, -8.131155967712402),
 (47, -7.274705171585083),
 (99, -7.204542756080627),
 (42, -7.091279625892639),
 (69, -6.739482879638672),
 (57, -6.623743772506714),
 (96, -6.209834814071655),
 (134, -5.58724582195282),
 (73, -5.530622601509094),
 (110, -5.126549005508423),
 (131, -4.435622692108154),
 (9, -4.142558574676514),
 (46, -3.7173447012901306),
 (13, -3.1510373950004578),
 (44, -2.9845643043518066),
 (124, -2.7145612239837646),
 (137, -2.2213394939899445),
 (132, -2.2054636478424072),
 (116, -1.9229638576507568),
 (111, -1.9177806377410889),
 (121, -1.3515384495258331),
 (36, -1.119830161333084),
 (2, -1.0263845324516296),
 (136, -0.14549612998962402),
 (97, 0.02288222312927246),
 (138, 0.23310404270887375),
 (11, 0.34488800913095474),
 (1, 0.3801669552922249),
 (95, 0.42442888021469116),
 (5, 0.585017055273056),
 (0, 0.6578207015991211),
 (10, 1.0580871254205704),
 (148, 1.101222038269043),
 (85, 1.5351229906082153),
 (8, 1.8577364087104797),
 (129, 2.067573070526123),
 (84, 2.5856217741966248),
 (125, 2.927420735359192),
 (145, 3.010193407535553),
 (3, 3.240116238594055),
 (112, 3.8082027435302734),
 (115, 3.8878047466278076),
 (147, 4.29826945066452),
 (58, 5.724080801010132),
 (144, 6.969168186187744),
 (130, 7.328435778617859),
 (146, 8.421227931976318),
 (149, 8.71802568435669),
 (27, 10.014463663101196),
 (143, 10.086603164672852),
 (113, 11.049185991287231),
 (66, 11.210532188415527),
 (139, 11.213960647583008),
 (142, 11.479517221450806),
 (128, 11.862180233001709),
 (141, 12.742302417755127),
 (54, 13.011351823806763),
 (55, 16.884247064590454),
 (37, 18.53071689605713),
 (87, 23.8028883934021)]

The above output presents joke ids along with the prediction of what rating user1 would give them. We can see that some values fall outside of the <-10, 10> range which is fine. We can simply treat the smaller than -10 ones as -10 and greater than 10 as 10.

Immediately we can see that with this recommender model we could recommend the jokes: 146, 149, 27, 143, 113, 66, 139, 142, 128, 141, 54, 55, 37, 87.

To have a little bit more fun, let’s create code for reading the actual text of the jokes. I took the following class from StackOverflow. We’ll use it for stripping HTML tags from the jokes file:

from html.parser import HTMLParser

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.strict = False
        self.convert_charrefs= True
        self.fed = []
    def handle_data(self, d):
        self.fed.append(d)
    def get_data(self):
        return ''.join(self.fed)

def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()

Here’s the function that reads the file and uses the HTML tags stripping class:

def get_jokes():
    jokes = []
    joke = ''
    pattern = re.compile('^\\d+:$')
    with open("data/jester_items.dat", "r") as file:
        for line in file:
            if pattern.match(line):
                joke = ''
            else:
                if line.strip() == '':
                    jokes.append(strip_tags(joke).strip())
                else:
                    joke += line
    return jokes

Let’s now read them from disk and see an example joke our system would recommend to the first user:

> jokes = get_jokes()
> jokes[87]

'A Czechoslovakian man felt his eyesight was growing steadily worse, and felt it was time to go see an optometrist.\n\nThe doctor started with some simple testing, and showed him a standard eye chart with letters of diminishing size: CRKBNWXSKZY...\n\n"Can you read this?" the doctor asked.\n\n"Read it?" the Czech answered. "Doc, I know him!"'

Using the item feature vectors to find similarities

One cool thing we can do with the latent vectors, is to measure how similar they are in terms of appealing to certain users. To do that we can use a so-called cosine similarity. The subject is very clearly described by Christian S. Perone in his blog post.

It makes use of the angle between the two vectors and returns its cosine. Notice that it only cares about the angle between the vectors, and not their magnitudes. The codomain of the cosine function is <-1, 1> and so is for the cosine similarity as well. It translates to our sense of similarity quite naturally: -1 meaning “the total opposite” and 1 meaning “exactly the same”.

We can trivially implement the function as a product of the dot products of the vectors normalized to units:

def cos_similarity(vec1, vec2):
    return mx.nd.dot(vec1, vec2) / (F.norm(vec1) * F.norm(vec2))

We can use the new measurement to rank the jokes in terms of how close they are. Here’s a function that takes a joke ID and returns list of IDs along with the similarity ratings:

def get_scores(joke_id):
    scores = []
    joke = joke_embed[joke_id]
    for ix in range(0, 150):
        scores.append((ix, cos_similarity(joke, joke_embed[ix]).asnumpy()[0]))
    return scores

The following function takes a joke_id and takes the 4 most similar jokes. It then prints them one by one in a summary:

def print_joke_stats(ix):
		def by_second(t):
		    if t[1] is None:
		        return -2
		    else:
		        return t[1]
    similar = get_scores(ix)
    similar.sort(key=by_second)
    similar.reverse()

    print(f'Jokes making same people laugh compared to:\n\n=== \n{jokes[ix]}\n===:\n\n')

    for ix in range(1, 4):
        print(f'---\n{jokes[similar[ix][0]]}\n---\n')

Let’s see what jokes our system found to be cracking up the same kinds of people:

> print_joke_stats(87)

Jokes making same people laugh compared to:

===
A Czechoslovakian man felt his eyesight was growing steadily worse, and felt it was time to go see an optometrist.

The doctor started with some simple testing, and showed him a standard eye chart with letters of diminishing size: CRKBNWXSKZY...

"Can you read this?" the doctor asked.

"Read it?" the Czech answered. "Doc, I know him!"
===:

---
A woman has twins, and gives them up for adoption. One of them goes to a family in Egypt and is named "Amal." The other goes to a family in Spain; they name him "Juan." Years later, Juan sends a picture of himself to his mom. Upon receiving the picture, she tells her husband that she wishes she also had a picture of Amal.

Her husband responds, "But they are twins--if you've seen Juan, you've seen Amal."
---

---
An explorer in the deepest Amazon suddenly finds himself surrounded by a bloodthirsty group of natives. Upon surveying the situation, he says quietly to himself, "Oh God, I'm screwed."

The sky darkens and a voice booms out, "No, you are NOT screwed. Pick up that stone at your feet and bash in the head of the chief standing in front of you."

So with the stone he bashes the life out of the chief. He stands above the lifeless body, breathing heavily and looking at 100 angry natives...

The voice booms out again, "Okay....NOW you're screwed."
---

---
A man is driving in the country one evening when his car stalls and won't start. He goes up to a nearby farm house for help, and because it is suppertime he is asked to stay for supper. When he sits down at the table he notices that a pig is sitting at the table with them for supper and that the pig has a wooden leg.

As they are eating and chatting, he eventually asks the farmer why the pig is there and why it has a wooden leg.
"Oh," says the farmer, "that is a very special pig. Last month my wife and daughter were in the barn when it caught fire. The pig saw this, ran to the barn, tipped over a pail of water, crawled over the wet floor to reach them and pulled them out of the barn safely. A special pig like that, you just don't eat it all at once!"
---

Final words

The approach presented here is relatively simple, yet people have found it surprisingly accurate. It depends though on having enough data for each item. Otherwise the accuracy degrades. An extreme case of not having enough data is called a cold start.

Also, accuracy is not the only goal. Wikipedia lists features like “Serendipity” as an important factor of a successful system among others:

Serendipity is a measure of “how surprising the recommendations are”. For instance, a recommender system that recommends milk to a customer in a grocery store might be perfectly accurate, but it is not a good recommendation because it is an obvious item for the customer to buy. However, high scores of serendipity may have a negative impact on accuracy.

Researchers have been working on different approaches to tackling the above mentioned issues. Netflix is known to be using a “hybrid” approach — one that uses both content and collaborative based recommender. As per Wikipedia:

Netflix is a good example of the use of hybrid recommender systems. The website makes recommendations by comparing the watching and searching habits of similar users (i.e., collaborative filtering) as well as by offering movies that share characteristics with films that a user has rated highly (content-based filtering).

Training Tesseract 4 models from real images

2018-07-09T00:00:00+00:00

Over the years, Tesseract has been one of the most popular open source optical character recognition (OCR) solutions. It provides ready-to-use models for recognizing text in many languages. Currently there are 124 models that are available to be downloaded and used.

Not too long ago, the project moved in the direction of using more modern machine-learning approaches and is now using artificial neural networks.

For some people, this move meant a lot of confusion when they wanted to train their own models. This blog post tries to explain the process of turning scans of images with textual ground-truth data into models that are ready to be used.

Tesseract pre-trained models

You can download the pre-created ones designed to be fast and consume less memory, as well as the ones requiring more in terms of resources but giving a better accuracy.

Pre-trained models have been created using the images with text artificially rendered using a huge corpus of text coming from the web. The text was rendered using different fonts. The project’s wiki states that:

For Latin-based languages, the existing model data provided has been trained on about 400000 textlines spanning about 4500 fonts. For other scripts, not so many fonts are available, but they have still been trained on a similar number of textlines.

Training a new model from scratch

Before diving in, there are a couple of broader aspects you need to know:

The latest Tesseract uses artificial neural networks based models (they differ totally from the older approach)
You might want to get familiar with how neural networks work and how their different types of layers can be used and what you can expect of them
It’s definitely a bonus to read about the “Connectionist Temporal Classification”, explained brilliantly at Sequence Modeling with CTC (it’s not mandatory though)

Compiling the training tools

This blog post talks specifically about the latest version 4 of Tesseract. Please make sure that you have that installed and not some older version 3 release.

To continue with the training, you’ll also need the training tools. The project’s wiki already explains the process of getting them well enough.

Preparing the training data

Training datasets consist of *.tif files and accompanying *.box files. While the image files are easy to prepare, the box files seem to be a source of confusion.

For some images you’ll want to ensure that there’s at least 10px of free space between the border and the text. Adding it to all of the images will not hurt and will only ensure that you won’t see odd-looking warning messages during the training.

The first rule is that you’ll have one box file per one image. You need to give them the same prefixes, e.g. image1.tif and image1.box. The box files describe used characters as well as their spatial location within the image.

Each line describes one character as follows:

<symbol> <left> <bottom> <right> <top> <page>

Where:

<symbol> is the character e.g. a or b.
<left> <bottom> <right> <top> are the coordinates of the rectangle that fits the character on the page. Note that the coordinates system used by Tesseract has (0,0) in the bottom-left corner of the image!
<page> is only relevant if you’re using multi-page TIFF files. In all other cases just put 0 in here.

The order of characters is extremely important here. They should be sorted strictly in the visual order, going from left to right. Tesseract does the Unicode bidi-re-ordering internally on its own.

Each word should be separated by the line with a space as the <symbol>. It works best for me to set a 1x1 small rectangle as a bounding box that directly follows the previous character.

If your image contains more than one line, the line ending should be marked with a line where <symbol> is a tab.

Generating the `unicharset` file

If you’ve went through the neural networks reading, you’ll quickly understand that if the model is to be fast, it needs to be given a constrained list of characters you want it to recognize. Trying to make it choose out the whole Unicode set would be computationally unfeasible. This is what the so-called unicharset file is for. It defines the set of graphemes along with providing info about their basic properties.

Tesseract does come with its own utility for compiling that file but I’ve found it very buggy. That’s what it looked like the last time I tried it, in June 2018. I came up with my own script in Ruby which compiles a very basic version of that file and is more than enough:

require "rubygems"
require "unicode/scripts"
require "unicode/categories"

bool_to_si = -> (b) {
  b ? "1" : "0"
}

is_digit = -> (props) {
  (props & ["Nd", "No", "Nl"]).count > 0
}

is_letter = -> (props) {
  (props & ["LC", "Ll", "Lm", "Lo", "Lt", "Lu"]).count > 0
}

is_alpha = -> (props) {
  is_letter.call(props)
}

is_lower = -> (props) {
  (props & ["Ll"]).count > 0
}

is_upper = -> (props) {
  (props & ["Lu"]).count > 0
}

is_punct = -> (props) {
  (props & ["Pc", "Pd", "Pe", "Pf", "Pi", "Po", "Ps"]).count > 0
}

if ARGV.length < 1
  $stderr.puts "Usage: ruby ./extract_unicharset.rb path/to/all-boxes"
  exit
end

if !File.exist?(ARGV[0])
  $stderr.puts "The all-boxes file #{ARGV[0]} doesn't exist"
  exit
end

uniqs = IO.readlines(ARGV[0]).map { |line| line[0] }.uniq.sort

outs = uniqs.each_with_index.map do |char, ix|
  script = Unicode::Scripts.scripts(char).first
  props = Unicode::Categories.categories(char)

  isalpha = is_alpha.call(props)
  islower = is_lower.call(props)
  isupper = is_upper.call(props)
  isdigit = is_digit.call(props)
  ispunctuation = is_punct.call(props)

  props = [ isalpha, islower, isupper, isdigit, ispunctuation].reverse.inject("") do |state, is|
    "#{state}#{bool_to_si.call(is)}"
  end

  "#{char} #{props.to_i(2)} #{script} #{ix + 1}"
end

puts outs.count + 1
puts "NULL 0 Common 0"
outs.each { |o| puts o }

You’ll need to install the unicode-scripts and unicode-categories gems first. The usage is as it stands in the source code:

ruby extract_unicharset.rb path/to/all-boxes > path/to/unicharset

Where do we get the all-boxes file from? The script only cares about the unique set of characters from the box files. The following gist of shell-work will provide you with all you need:

cat path/to/dataset/*.box > path/to/all-boxes
ruby extract_unicharset.rb path/to/all-boxes > path/to/unicharset

Notice that the last command should create a path/to/unicharset text file for you.

Combining images with box files into `*.lstmf` files

The image and box files aren’t being directly fed into the trainer. Instead, Tesseract works with the special *.lstmf files which combine images, boxes and text for each pair of *.tif and *.box.

In order to generate those *.lstmf files you’ll need to run the following:

cd path/to/dataset
for file in *.tif; do
  echo $file
  base=`basename $file .tif`
  tesseract $file $base lstm.train
done

After the above is done, you should be able to find the accompanying *.lstmf files. Make sure that you have Tesseract with langdata and tessdata properly installed. If you keep your tessdata folder in a nonstandard location, you might need to either export or set inline the following shell variable:

# exporting so that it’s available for all following commands:
export TESSDATA_PREFIX=path/to/your/tessdata

# or run it inline:
cd path/to/dataset
for file in *.tif; do
  echo $file
  base=`basename $file .tif`
  TESSDATA_PREFIX=path/to/your/tessdata tesseract $file $base lstm.train
done

We’ll need to generate the all-lstmf file containing paths to all those files that we will use later:

ls -1 *.lstmf | sort -R > all-lstmf

Notice the use of sort -R which makes the list sorted randomly which is a good practice when preparing the training data in many cases.

Generating the training and evaluation files lists

Next, we want to create the list.train and list.eval files. Their purpose is to contain the paths to *.lstmf files that Tesseract is going to use during the training and during the evaluation. Training and evaluation are interleaved. The former adjusts the neural network learnable parameters to minimize the so-called loss. The evaluation here is strictly to enhance the user experience: it prints out accuracy metrics periodically, letting you know how much the model has learned so far. Their values are averaged out. You can expect to see two metrics being shown: char error and word error: both are going to be close to 100% in the beginning but with all going well, you should see them dropping even to below 1%.

The evaluation set is often called the “holdout set”. How many training examples should it contain? That depends. If you have a big enough set, something around 10% of all of the examples should be more than enough. You might also not care about the training-time evaluation and set it to something very small. You’d then do your own evaluation after the network’s loss converges to something small (by small we mean something close to 0.1 or less).

Assuming that you want the evaluation set to contain 1000 examples, here’s how you can generate the list.train and list.eval:

head -n  1000 path/to/all-lstmf > list.eval
tail -n +1001 path/to/all-lstmf > list.train

If you’d like to express it in terms of fractions of all of the examples:

holdout_count=$(count_all=`wc -l path/to/all-lstmf`; bc <<< "$count_all * 0.1 / 1")

head -n  $holdout_count path/to/all-lstmf > list.eval
tail -n +$holdout_count path/to/all-lstmf > list.train

The above shell code assigns around 10% examples to the holdout set.

Compiling the initial `*.traineddata` file

There’s one last piece that we’ll need to generate before we’re able to start the training process: the yourmodel.traineddata. This file is going to contain the initial info needed for the trainer to perform the training:

combine_lang_model \
  --input_unicharset path/to/unicharset \
  --script_dir path/to/your/tessdata \
  --output_dir path/to/output \
  --lang_is_rtl \ # set it only if you work with a right-to-left language
  --pass_through_recoder \ # I found it working better with this option
  --lang yourmodelname

The above should create a bunch of files in the specified output directory.

Starting the actual training process

To start the training process you’ll need to execute the lstmtraining app. It accepts the arguments that are described below.

num_classes=`head -n1 path/to/unicharset`

lstmtraining \
  path/to/traineddata-file \
  --net_spec "[1,40,0,1 Ct5,5,64 Mp3,3 Lfys128 Lbx256 Lbx256 O1c$num_classes]" \
  --model_output path/to/model/output
  --train_listfile path/to/list.train
  --eval_listfile path/to/list.eval

You’re giving it the compiled *.traineddata file and the train/eval file lists and it trains the new model for you. It will adjust the neural network parameters to make the error between its predictions and what is known as ground-truth smaller and smaller.

There’s one part that we haven’t talked about yet: the --net_spec argument and its accompanying value given as string.

The neural network “spec” is there because neural networks come in many different shapes and forms. The subject is beyond the scope of this article. If you don’t know anything yet but are curious, I encourage you to look for some good books. The process of learning about them is extremely rewarding if you’re into math and computer science.

The value for that argument I presented above should be more than enough for most of your needs. That’s unless you’d like to e.g. recognize vertical text, for which I recommend adjusting the spec greatly.

The format that the given string follows is called VGSL. You can find out more about it on the Tesseract Wiki.

Finishing the training and compiling the resulting model file

If you’ve gotten excited by what we’ve done so far, I have to encourage your expectations to make friends with The Reality. The truth is that the training process can take days, depending on how fast your machine is and how many training examples you have. You may notice it taking even longer if your examples differ by a huge factor. That might be true if you’re feeding it examples that use significantly different fonts.

Once the training error rate is small enough and doesn’t seem to be converging further, you may want to stop it and compile the final model file.

During the training, the lstmtraining app will output checkpoint files every once in a while. They are there to make it possible to stop the training and resume it later (with the --continue_from argument). You create the final model files out of those checkpoint files with:

lstmtraining \
  --traineddata path/to/traineddata-file \
  --continue_from path/to/model/output/checkout \
  --model_output path/to/final/output \
  --stop_training

And that’s it — you can now take the output file of that last command and place it inside your tessdata folder it immediately Tesseract will be able to use it.

Recognizing handwritten digits: a quick peek into the basics of machine learning

2017-05-30T00:00:00+00:00

Previous in series:

In the previous two posts on machine learning, I presented a very basic introduction of an approach called “probabilistic graphical models”. In this post I’d like to take a tour of some different techniques while creating code that will recognize handwritten digits.

The handwritten digits recognition is an interesting topic that has been explored for many years. It is now considered one of the best ways to start the journey into the world of machine learning.

Taking the Kaggle challenge

We’ll take the “digits recognition” challenge as presented in Kaggle. It is an online platform with challenges for data scientists. Most of the challenges have their prizes expressed in real money to win. Some of them are there to help us out in our journey on learning data science techniques—so is the “digits recognition” contest.

The challenge

As explained on Kaggle:

MNIST (“Modified National Institute of Standards and Technology”) is the de facto “hello world” dataset of computer vision.

The “digits recognition” challenge is one of the best ways to get acquainted with machine learning and computer vision. The so-called “MNIST” dataset consists of 70k images of handwritten digits - each one grayscaled and of a 28x28 size. The Kaggle challenge is about taking a subset of 42k of them along with labels (what actual number does the image show) and “training” the computer on that set. The next step is to take the rest 28k of images without the labels and “predict” which actual number they present.

Here’s a short overview of how the digits in a set really look like (along with the numbers they represent):

I have to admit that for some of them I have a really hard time recognizing the actual numbers on my own :)

The general approach to supervised learning

Learning from labelled data is what is called “supervised learning”. It’s supervised because we’re taking the computer by hand through the whole training data set and “teaching” it how the data that is linked with different labels “looks” like.

In all such scenarios we can express the data and labels as:

Y ~ X1, X2, X3, X4, ..., Xn

The Y is called a dependent variable while each Xn are independent variables. This formula holds both for classification problems as well as regressions.

Classification is when the dependent variable Y is so called categorical—taking values from a concrete set without a meaningful order. Regression is when the Y is not categorical—most often continuous.

In the digits recognition challenge we’re faced with the classification task. The dependent variable takes values from the set:

Y = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 }

I’m sure the question you might be asking yourself now is: what are the independent variables Xn? It turns out to be the crux of the whole problem to solve :)

The plan of attack

A good introduction to computer vision techniques is a book by J. R Parker - “Algorithms for Image Processing and Computer Vision”. I encourage the reader to buy that book. I took some ideas from it while having fun with my own solution to the challenge.

The book outlines the ideas revolving around computing image profiles—for each side. For each row of pixels, a number representing the distance of the first pixel from the edge is computed. This way we’re getting our first independent variables. To capture even more information about digit shapes, we’ll also capture the differences between consecutive row values as well as their global maxima and minima. We’ll also compute the width of the shape for each row.

Because the handwritten digits vary greatly in their thickness, we will first preprocess the images to detect so-called skeletons of the digit. The skeleton is an image representation where the thickness of the shape has been reduced to just one.

Having the image thinned will also allow us to capture some more info about the shapes. We will write an algorithm that walks the skeleton and records the direction change frequencies.

Once we’ll have our set of independent variables Xn, we’ll use a classification algorithm to first learn in a supervised way (using the provided labels) and then to predict the values of the test data set. Lastly we’ll submit our predictions to Kaggle and see how well did we do.

Having fun with languages

In the data science world, the lingua franca still remains to be the R programming language. In the last years Python has also came close in popularity and nowadays we can say it’s the duo of R and Python that rule the data science world (not counting high performance code written e. g. in C++ in production systems).

Lately a new language designed with data scientists in mind has emerged - Julia. It’s a language with characteristics of both dynamically typed scripting languages as well as strictly typed compiled ones. It compiles its code into efficient native binary via LLVM—but it’s using it in a JIT fashion - inferring the types when needed on the go.

While having fun with the Kaggle challenge I’ll use Julia and Python for the so called feature extraction phase (the one in which we’re computing information about our Xn variables). I’ll then turn towards R for doing the classification itself. Note that I might use any of those languages at each step getting very similar results. The purpose of this series of articles is to be a bird eye fun overview so I decided that this way will be much more interesting.

Feature Extraction

The end result of this phase is the data frame saved as a CSV file so that we’ll be able to load it in R and do the classification.

First let’s define the general function in Julia that takes the name of the input CSV file and returns a data frame with features of given images extracted into columns:

using DataFrames

function get_data(name :: String, include_label = true)
  println("Loading CSV file into a data frame...")
  table = readtable(string(name, ".csv"))
  extract(table, include_label)
end

Now the extract function looks like the following:

"""
Extracts the features from the dataframe. Puts them into
separate columns and removes all other columns except the
labels.

The features:

* Left and right profiles (after fitting into the same sized rect):
  * Min
  * Max
  * Width[y]
  * Diff[y]
* Paths:
  * Frequencies of movement directions
  * Simplified directions:
    * Frequencies of 3 element simplified paths
"""
function extract(frame :: DataFrame, include_label = true)
  println("Reshaping data...")

  function to_image(flat :: Array{Float64}) :: Array{Float64}
    dim      = Base.isqrt(length(flat))
    reshape(flat, (dim, dim))'
  end

  from = include_label ? 2 : 1
  frame[:pixels] = map((i) -> convert(Array{Float64}, frame[i, from:end]) |> to_image, 1:size(frame, 1))
  images = frame[:, :pixels] ./ 255
  data = Array{Array{Float64}}(length(images))

  @showprogress 1 "Computing features..." for i in 1:length(images)
    features = pixels_to_features(images[i])
    data[i] = features_to_row(features)
  end
  start_column = include_label ? [:label] : []
  columns = vcat(start_column, features_columns(images[1]))

  result = DataFrame()
  for c in columns
    result[c] = []
  end

  for i in 1:length(data)
    if include_label
      push!(result, vcat(frame[i, :label], data[i]))
    else
      push!(result, vcat([],               data[i]))
    end
  end

  result
end

A few nice things to notice here about Julia itself are:

The function documentation is written in Markdown
We can nest functions inside other functions
The language is statically and strongly typed
Types can be inferred from the context
It is often desirable to provide the concrete types to improve performance (but that an advanced Julia related topic)
Arrays are indexed from 1
There’s the nice |> operator found e. g. In Elixir (which I absolutely love)

The above code converts the images to be arrays of Float64 and converts the values to be within 0 and 1 (instead of 0..255 originally).

A thing to notice is that in Julia we can vectorize operations easily and we’re using this fact to tersely convert our number:

images = frame[:, :pixels] ./ 255

We are referencing the pixels_to_features function which we define as:

"""
Returns ImageFeatures struct for the image pixels
given as an argument
"""
function pixels_to_features(image :: Array{Float64})
  dim      = Base.isqrt(length(image))
  skeleton = compute_skeleton(image)
  bounds   = compute_bounds(skeleton)
  resized  = compute_resized(skeleton, bounds, (dim, dim))
  left     = compute_profile(resized, :left)
  right    = compute_profile(resized, :right)
  width_min, width_max, width_at = compute_widths(left, right, image)
  frequencies, simples = compute_transitions(skeleton)

  ImageStats(dim, left, right, width_min, width_max, width_at, frequencies, simples)
end

This in turn uses the ImageStats structure:

immutable ImageStats
  image_dim             :: Int64
  left                  :: ProfileStats
  right                 :: ProfileStats
  width_min             :: Int64
  width_max             :: Int64
  width_at              :: Array{Int64}
  direction_frequencies :: Array{Float64}

  # The following adds information about transitions
  # in 2 element simplified paths:
  simple_direction_frequencies :: Array{Float64}
end

immutable ProfileStats
  min :: Int64
  max :: Int64
  at  :: Array{Int64}
  diff :: Array{Int64}
end

The pixels_to_features function first gets the skeleton of the digit shape as an image and then uses other functions passing that skeleton to them. The function returning the skeleton utilizes the fact that in Julia it’s trivially easy to use Python libraries. Here’s its definition:

using PyCall

@pyimport skimage.morphology as cv

"""
Thin the number in the image by computing the skeleton
"""
function compute_skeleton(number_image :: Array{Float64}) :: Array{Float64}
  convert(Array{Float64}, cv.skeletonize_3d(number_image))
end

It uses the scikit-image library’s function skeletonize3d by using the @pyimport macro and using the function as if it was just a regular Julia code.

Next the code crops the digit itself from the 28x28 image and resizes it back to 28x28 so that the edges of the shape always “touch” the edges of the image. For this we need the function that returns the bounds of the shape so that it’s easy to do the cropping:

function compute_bounds(number_image :: Array{Float64}) :: Bounds
  rows = size(number_image, 1)
  cols = size(number_image, 2)

  saw_top = false
  saw_bottom = false

  top = 1
  bottom = rows
  left = cols
  right = 1

  for y = 1:rows
    saw_left = false
    row_sum = 0

    for x = 1:cols
      row_sum += number_image[y, x]

      if !saw_top && number_image[y, x] > 0
        saw_top = true
        top = y
      end

      if !saw_left && number_image[y, x] > 0 && x < left
        saw_left = true
        left = x
      end

      if saw_top && !saw_bottom && x == cols && row_sum == 0
        saw_bottom = true
        bottom = y - 1
      end

      if number_image[y, x] > 0 && x > right
        right = x
      end
    end
  end
  Bounds(top, right, bottom, left)
end

Resizing the image is pretty straight-forward:

using Images

function compute_resized(image :: Array{Float64}, bounds :: Bounds, dims :: Tuple{Int64, Int64}) :: Array{Float64}
  cropped = image[bounds.left:bounds.right, bounds.top:bounds.bottom]
  imresize(cropped, dims)
end

Next, we need to compute the profile stats as described in our plan of attack:

function compute_profile(image :: Array{Float64}, side :: Symbol) :: ProfileStats
  @assert side == :left || side == :right

  rows = size(image, 1)
  cols = size(image, 2)

  columns = side == :left ? collect(1:cols) : (collect(1:cols) |> reverse)
  at = zeros(Int64, rows)
  diff = zeros(Int64, rows)
  min = rows
  max = 0

  min_val = cols
  max_val = 0

  for y = 1:rows
    for x = columns
      if image[y, x] > 0
        at[y] = side == :left ? x : cols - x + 1

        if at[y] < min_val
          min_val = at[y]
          min = y
        end

        if at[y] > max_val
          max_val = at[y]
          max = y
        end
        break
      end
    end
    if y == 1
      diff[y] = at[y]
    else
      diff[y] = at[y] - at[y - 1]
    end
  end

  ProfileStats(min, max, at, diff)
end

The widths of shapes can be computed with the following:

function compute_widths(left :: ProfileStats, right :: ProfileStats, image :: Array{Float64}) :: Tuple{Int64, Int64, Array{Int64}}
  image_width = size(image, 2)
  min_width = image_width
  max_width = 0
  width_ats = length(left.at) |> zeros

  for row in 1:length(left.at)
    width_ats[row] = image_width - (left.at[row] - 1) - (right.at[row] - 1)

    if width_ats[row] < min_width
      min_width = width_ats[row]
    end

    if width_ats[row] > max_width
      max_width = width_ats[row]
    end
  end

  (min_width, max_width, width_ats)
end

And lastly, the transitions:

function compute_transitions(image :: Image) :: Tuple{Array{Float64}, Array{Float64}}
  history = zeros((size(image,1), size(image,2)))

  function next_point() :: Nullable{Point}
    point = Nullable()

    for row in 1:size(image, 1) |> reverse
      for col in 1:size(image, 2) |> reverse
        if image[row, col] > 0.0 && history[row, col] == 0.0
          point = Nullable((row, col))
          history[row, col] = 1.0

          return point
        end
      end
    end
  end

  function next_point(point :: Nullable{Point}) :: Tuple{Nullable{Point}, Int64}
    result = Nullable()
    trans = 0

    function direction_to_moves(direction :: Int64) :: Tuple{Int64, Int64}
      # for frequencies:
      # 8 1 2
      # 7 - 3
      # 6 5 4
      [
       ( -1,  0 ),
       ( -1,  1 ),
       (  0,  1 ),
       (  1,  1 ),
       (  1,  0 ),
       (  1, -1 ),
       (  0, -1 ),
       ( -1, -1 ),
      ][direction]
    end

    function peek_point(direction :: Int64) :: Nullable{Point}
      actual_current = get(point)

      row_move, col_move = direction_to_moves(direction)

      new_row = actual_current[1] + row_move
      new_col = actual_current[2] + col_move

      if new_row <= size(image, 1) && new_col <= size(image, 2) &&
         new_row >= 1 && new_col >= 1
        return Nullable((new_row, new_col))
      else
        return Nullable()
      end
    end

    for direction in 1:8
      peeked = peek_point(direction)

      if !isnull(peeked)
        actual = get(peeked)
        if image[actual[1], actual[2]] > 0.0 && history[actual[1], actual[2]] == 0.0
          result = peeked
          history[actual[1], actual[2]] = 1
          trans = direction
          break
        end
      end
    end

    ( result, trans )
  end

  function trans_to_simples(transition :: Int64) :: Array{Int64}
    # for frequencies:
    # 8 1 2
    # 7 - 3
    # 6 5 4

    # for simples:
    # - 1 -
    # 4 - 2
    # - 3 -
    [
      [ 1 ],
      [ 1, 2 ],
      [ 2 ],
      [ 2, 3 ],
      [ 3 ],
      [ 3, 4 ],
      [ 4 ],
      [ 1, 4 ]
    ][transition]
  end

  transitions     = zeros(8)
  simples         = zeros(16)
  last_simples    = [ ]
  point           = next_point()
  num_transitions = .0
  ind(r, c) = (c - 1)*4 + r

  while !isnull(point)
    point, trans = next_point(point)

    if isnull(point)
      point = next_point()
    else
      current_simples = trans_to_simples(trans)
      transitions[trans] += 1
      for simple in current_simples
        for last_simple in last_simples
          simples[ind(last_simple, simple)] +=1
        end
      end
      last_simples = current_simples
      num_transitions += 1.0
    end
  end

  (transitions ./ num_transitions, simples ./ num_transitions)
end

All those gathered features can be turned into rows with:

function features_to_row(features :: ImageStats)
  lefts       = [ features.left.min,  features.left.max  ]
  rights      = [ features.right.min, features.right.max ]

  left_ats    = [ features.left.at[i]  for i in 1:features.image_dim ]
  left_diffs  = [ features.left.diff[i]  for i in 1:features.image_dim ]
  right_ats   = [ features.right.at[i] for i in 1:features.image_dim ]
  right_diffs = [ features.right.diff[i]  for i in 1:features.image_dim ]
  frequencies = features.direction_frequencies
  simples     = features.simple_direction_frequencies

  vcat(lefts, left_ats, left_diffs, rights, right_ats, right_diffs, frequencies, simples)
end

Similarly we can construct the column names with:

function features_columns(image :: Array{Float64})
  image_dim   = Base.isqrt(length(image))

  lefts       = [ :left_min,  :left_max  ]
  rights      = [ :right_min, :right_max ]

  left_ats    = [ Symbol("left_at_",  i) for i in 1:image_dim ]
  left_diffs  = [ Symbol("left_diff_",  i) for i in 1:image_dim ]
  right_ats   = [ Symbol("right_at_", i) for i in 1:image_dim ]
  right_diffs = [ Symbol("right_diff_", i) for i in 1:image_dim ]
  frequencies = [ Symbol("direction_freq_", i)   for i in 1:8 ]
  simples     = [ Symbol("simple_trans_", i)   for i in 1:4^2 ]

  vcat(lefts, left_ats, left_diffs, rights, right_ats, right_diffs, frequencies, simples)
end

The data frame constructed with the get_data function can be easily dumped into the CSV file with the writeable function from the DataFrames package.

You can notice that gathering / extracting features is a lot of work. All this was needed to be done because in this article we’re focusing on the somewhat “classical" way of doing machine learning. You might have heard about algorithms existing that mimic how the human brain learns. We’re not focusing on them here. This we will explore in some future article.

We use the mentioned writetable on data frames computed for both training and test datasets to store two files: processed_train.csv and processed_test.csv.

Choosing the model

For the task of classifying I decided to use the XGBoost library which is somewhat a hot new technology in the world of machine learning. It’s an improvement over the so-called Random Forest algorithm. The reader can read more about XGBoost on its website: https://xgboost.readthedocs.io/.

Both random forest and xgboost revolve around the idea called ensemble learning. In this approach we’re not getting just one learning model—the algorithm actually creates many variations of models and uses them to collectively come up with better results. This is as much as can be written as a short description as this article is already quite lengthy.

Training the model

The training and classification code in R is very simple. We first need to load the libraries that will allow us to load data as well as to build the classification model:

library(xgboost)
library(readr)

Loading the data into data frames is equally straight-forward:

processed_train <- read_csv("processed_train.csv")
processed_test <- read_csv("processed_test.csv")

We then move on to preparing the vector of labels for each row as well as the matrix of features:

labels = processed_train$label
features = processed_train[, 2:141]
features = scale(features)
features = as.matrix(features)

The train-test split

When working with models, one of the ways of evaluating their performance is to split the data into so-called train and test sets. We train the model on one set and then we predict the values from the test set. We then calculate the accuracy of predicted values as the ratio between the number of correct predictions and the number of all observations.

Because Kaggle provides the test set without labels, for the sake of evaluating the model’s performance without the need to submit the results, we’ll split our Kaggle-training set into local train and test ones. We’ll use the amazing caret library which provides a wealth of tools for doing machine learning:

library(caret)

index <- createDataPartition(processed_train$label, p = .8,
                             list = FALSE,
                             times = 1)

train_labels <- labels[index]
train_features <- features[index,]

test_labels <- labels[-index]
test_features <- features[-index,]

The above code splits the set uniformly based on the labels so that the train set is approximately 80% in size of the whole data set.

Using XGBoost as the classification model

We can now make our data digestible by the XGBoost library:

train <- xgb.DMatrix(as.matrix(train_features), label = train_labels)
test  <- xgb.DMatrix(as.matrix(test_features),  label = test_labels)

The next step is to make the XGBoost learn from our data. The actual parameters and their explanations are beyond the scope of this overview article, but the reader can look them up on the XGBoost pages:

model <- xgboost(train,
                 max_depth = 16,
                 nrounds = 600,
                 eta = 0.2,
                 objective = "multi:softmax",
                 num_class = 10)

It’s critically important to pass the objective as “multi:softmax" and num_class as 10.

Simple performance evaluation with confusion matrix

After waiting a while (couple of minutes) for the last batch of code to finish computing, we now have the classification model ready to be used. Let’s use it to predict the labels from our test set:

predicted = predict(model, test)

This returns the vector of predicted values. We’d now like to check how well our model predicts the values. One of the easiest ways is to use the so-called confusion matrix.

As per Wikipedia, confusion matrix is simply:

(…) also known as an error matrix, is a specific table layout that allows visualization of the performance of an algorithm, typically a supervised learning one (in unsupervised learning it is usually called a matching matrix). Each column of the matrix represents the instances in a predicted class while each row represents the instances in an actual class (or vice versa). The name stems from the fact that it makes it easy to see if the system is confusing two classes (i.e. commonly mislabelling one as another).

The caret library provides a very easy to use function for examining the confusion matrix and statistics derived from it:

confusionMatrix(data=predicted, reference=labels)

The function returns an R list that gets pretty printed to the R console. In our case it looks like the following:

Confusion Matrix and Statistics

          Reference
Prediction   0   1   2   3   4   5   6   7   8   9
         0 819   0   3   3   1   1   2   1  10   5
         1   0 923   0   4   5   1   5   3   4   5
         2   4   2 766  26   2   6   8  12   5   0
         3   2   0  15 799   0  22   2   8   0   8
         4   5   2   1   0 761   1   0  15   4  19
         5   1   3   0  13   2 719   3   0   9   6
         6   5   3   4   1   6   5 790   0  16   2
         7   1   7  12   9   2   3   1 813   4  16
         8   6   2   4   7   8  11   8   5 767  10
         9   5   2   1  13  22   6   1  14  14 746

Overall Statistics

               Accuracy : 0.9411
                 95% CI : (0.9358, 0.946)
    No Information Rate : 0.1124
    P-Value [Acc > NIR] : < 2.2e-16

                  Kappa : 0.9345
 Mcnemar's Test P-Value : NA

(...)

Each column in the matrix represents actual labels while rows represent what our algorithms predicted this value to be. There’s also the accuracy rate printed for us and in this case it equals 0.9411. This means that our code was able to predict correct values of handwritten digits for 94.11% of observations.

Submitting the results

We got 0.9411 of an accuracy rate for our local test set and it turned out to be very close to the one we got against the test set coming from Kaggle. After predicting the competition values and submitting them, the accuracy rate computed by Kaggle was 0.94357. That’s quite okay given the fact that we’re not using here any of the new and fancy techniques.

Also, we haven’t done any parameter tuning which could surely improve the overall accuracy. We could also revisit the code from the features extraction phase. One improvement I can think of would be to first crop and resize back - and only then compute the skeleton which might preserve more information about the shape. We could also use the confusion matrix and taking the number that was being confused the most, look at the real images that we failed to recognize. This could lead us to conclusions about improvements to our feature extraction code. There’s always a way to extract more information.

Nowadays, Kagglers from around the world were successfully using advanced techniques like Convolutional Neural Networks getting accuracy scores close to 0.999. Those live in somewhat different branch of the machine learning world though. Using this type of neural networks we don’t need to do the feature extraction on our own. The algorithm includes the step that automatically gathers features that it later on feeds into the network itself. We will take a look at them in some of the future articles.

wroc_love.rb 2017 part 1

2017-03-18T00:00:00+00:00

wroc_love.rb is a single-track 3-day conference that takes place in Wrocław, Poland, every year in March.

Here’s a subjective list of most interesting talks from the first day:

Kafka / Karafka by Maciej Mensfeld

Karafka is another library that simplifies Apache Kafka usage in Ruby. It lets Ruby on Rails apps benefit from horizontally scalable message busses in a pub-sub (or publisher/consumer) type of network.

Why Kafka is (probably) better message/task broker for your app:

broadcasting is a real power feature of Kafka (HTTP lacks that)
author claims that it’s easier to support than ZeroMQ/RabbitMQ
it’s namespaced with topics (similar to ROS, the Robot Operating System)
great replacement for ruby-kafka and Poseidon

Karafka https://t.co/g9LQZiAV4i microframework to have #rails-like development performance with #kafka in #ruby @maciejmensfeld #wrocloverb

— Maciek Rząsa (@mjrzasa) 17 marzo 2017

Machine Learning to the Rescue by Mariusz Gil

This talk was devoted to Machine Learning success (and failure) story of the author.

Author underlined that Machine Learning is a process and proposed following workflow:

define a problem
gather your data
understand your data
prepare and condition the data
select & run your algorithms
tune algorithms parameters
select final model
validate final model (test using production data)

Mariusz described few ML problems that he has dealt with in the past. One of them was a project designed to estimate cost of a code review. He outlined the process of tuning the input data. Here’s a list of what comprised the input for a code review estimation cost:

number of lines changed
number of files changed
efferent coupling
afferent coupling
number of classes
number of interfaces
inheritance level
number of method calls
LLOC metric (Logical Lines of Code, excluding empty or comment lines)
LCOM metric (Lack of Cohesion between Methods—whether single responsibility pattern is followed or not)

Spree lightning talk by sparksolutions.co

One of the lightning talks was devoted to Spree. Here’s some interesting latest data from the Spree world:

number of contributors to Spree: 700
it’s very modular
it’s API driven
it’s one of the biggest repos on GitHub
very large number of extensions
it drives thousands of stores worldwide
Spark Solutions is a maintainer
Popular companies that use Spree: GoDaddy, Goop, Casper, Bonobos, Littlebits, Greetabl
it support Rails 5, Rails 4.2 and Rails 3.x

Author also released newest 3.2.0 stable version during the talk:

releasing spree 3.2.0 live during lightning talk #wrocloverb pic.twitter.com/9oPcB5CTfB

— Wojciech Ziniewicz (@fribulusxax) 17 marzo 2017

Learning from data basics II: simple Bayesian Networks

2016-04-12T00:00:00+00:00

In my last article I presented an approach that simplifies computations of very complex probability models. It makes these complex models viable by shrinking the amount of needed memory and improving the speed of computing probabilities. The approach we were exploring is called the Naive Bayes model.

The context was the e-commerce feature in which a user is presented with the promotion box. The box shows the product category the user is most likely to buy.

Though the results we got were quite good, I promised to present an approach that gives much better ones. While the Naive Bayes approach may not be acceptable in some scenarios due to the gap between approximated and real values, the approach presented in this article will make this distance much, much smaller.

Naive Bayes as a simple Bayesian Network

When exploring the Naive Bayes model, we said that there is a probabilistic assumption the model makes in order to simplify the computations. In the last article I wrote:

The Naive Bayes assumption says that the distribution factorizes the way we did it only if the features are conditionally independent given the category.

Expressing variable dependencies as a graph

Let’s imagine the visual representation of the relations between the random variables in the Naive Bayes model. Let’s make it into a directed acyclic graph. Let’s mark the dependence of one variable on another as a graph edge from the parent node pointing to it’s dependent node.

Because of the assumption the Naive Bayes model enforces, its structure as a graph looks like the following:

You can notice there are no lines between all the “evidence” nodes. The assumption says that knowing the category, we have all needed knowledge about every single evidence node. This makes category the parent of all the other nodes. Intuitively, we can say that knowing the class (in this example, the category) we know everything about all features. It’s easy to notice that this assumption doesn’t hold in this example.

In our fake data generator, we made it so that e.g. relationship status depends on age. We’ve also made the category depend on sex and age directly. This way we can’t say that knowing category we know everything about e. g. age. The random variables age and sex are not independent even if we know the value of category. It is clear that the above graph does not model the dependency relationships between these random variables.

Let’s draw a graph that represents our fake data model better:

The combination of a graph like the one above and the probability distribution that follows the independencies it describes are known as a Bayesian Network.

Using the graph representation in practice - the chain rule for Bayesian Networks

The fact that our distribution is part of the Bayesian Network, allows us to use the formula for simplifying the distribution itself. The formula is called the chain rule for Bayesian Networks and for our particular example looks like the following:

p(cat, sex, age, rel, loc) = p(sex) * p(age) * p(loc) * p(rel | age) * p(cat | sex, age)

You can notice that the equation is just a product of a number of factors. There’s one factor for each random variable. The factors for variables that in the graph don’t have any parents are expressed as p(var) while those that do are expressed as p(var | par) or p(var | par1, par2…).

Notice that the Naive Bayes model fits perfectly into this equation. If you were to take the first graph presented in this article—for the Naive Bayes, and use the above equation, you’d get exactly the formula we used in the last article.

Coding the updated probabilistic model

Before going further, I strongly advise you to make sure you read the previous article - about the Naive Bayes model - to fully understand the classes used in the code in this section.

Let’s take our chain rule equation and simplify it:

p(cat, sex, age, rel, loc) = p(sex) * p(age) * p(loc) * p(rel | age) * p(cat | sex, age)

Again a conditional distrubution can be expressed as:

p(a | b) = p(a, b) / p(b)

This gives us:

p(cat, sex, age, rel, loc) = p(sex) * p(age) * p(loc) * (p(rel, age)/ p(age)) * (p(cat, sex, age) / p(sex, age))

We can easily factor out the p(age) with:

p(cat, sex, age, rel, loc) = p(sex) * p(loc) * p(rel, age) * (p(cat, sex, age) / p(sex, age))

Let’s define needed random variables and factors:

category = RandomVariable.new :category, [ :veggies, :snacks, :meat, :drinks, :beauty, :magazines ]
age      = RandomVariable.new :age,      [ :teens, :young_adults, :adults, :elders ]
sex      = RandomVariable.new :sex,      [ :male, :female ]
relation = RandomVariable.new :relation, [ :single, :in_relationship ]
location = RandomVariable.new :location, [ :us, :canada, :europe, :asia ]

loc_dist     = Factor.new [ location ]
sex_dist     = Factor.new [ sex ]
rel_age_dist = Factor.new [ relation, age ]
cat_age_sex_dist = Factor.new [ category, age, sex ]
age_sex_dist = Factor.new [ age, sex ]

full_dist = Factor.new [ category, age, sex, relation, location ]

The learning part is as trivial as in the Naive Bayes case. The only difference is the set of distributions involved:

Model.generate(1000).each do |user|
  user.baskets.each do |basket|
    basket.line_items.each do |item|
      loc_dist.observe! location: user.location
      sex_dist.observe! sex: user.sex
      rel_age_dist.observe! relation: user.relationship, age: user.age
      cat_age_sex_dist.observe! category: item.category, age: user.age, sex: user.sex
      age_sex_dist.observe! age: user.age, sex: user.sex
      full_dist.observe! category: item.category, age: user.age, sex: user.sex,
        relation: user.relationship, location: user.location
    end
  end
end

The inference part is also very similar to the one from the previous article. Here too the only difference are the distributions involved:

infer = -> (age, sex, rel, loc) do
  all = category.values.map do |cat|
    pl  = loc_dist.value_for location: loc
    ps  = sex_dist.value_for sex: sex
    pra = rel_age_dist.value_for relation: rel, age: age
    pcas = cat_age_sex_dist.value_for category: cat, age: age, sex: sex
    pas = age_sex_dist.value_for age: age, sex: sex
    { category: cat, value: (pl * ps * pra * pcas) / pas }
  end

  all_full = category.values.map do |cat|
    val = full_dist.value_for category: cat, age: age, sex: sex,
            relation: rel, location: loc
    { category: cat, value: val }
  end

  win      = all.max      { |a, b| a[:value] <=> b[:value] }
  win_full = all_full.max { |a, b| a[:value] <=> b[:value] }

  puts "Best match for #{[ age, sex, rel, loc ]}:"
  puts "   #{win[:category]} => #{win[:value]}"
  puts "Full pointed at:"
  puts "   #{win_full[:category]} => #{win_full[:value]}\n\n"
end

The results

Now let’s run the inference procedure with the same set of examples as in the previous post to compare the results:

infer.call :teens, :male, :single, :us
infer.call :young_adults, :male, :single, :asia
infer.call :adults, :female, :in_relationship, :europe
infer.call :elders, :female, :in_relationship, :canada

Which yields:

Best match for [:teens, :male, :single, :us]:
   snacks => 0.020610837341908994
Full pointed at:
   snacks => 0.02103999999999992

Best match for [:young_adults, :male, :single, :asia]:
   meat => 0.001801062449999991
Full pointed at:
   meat => 0.0010700000000000121

Best match for [:adults, :female, :in_relationship, :europe]:
   beauty => 0.0007693377820183494
Full pointed at:
   beauty => 0.0008300000000000074

Best match for [:elders, :female, :in_relationship, :canada]:
   veggies => 0.0024346445741176875
Full pointed at:
   veggies => 0.0034199999999999886

Just as with using the Naive Bayes, we got correct values for all cases. When you look closer though, you can notice that the resulting probability values were much closer to the original, full distribution ones. The approach we took here makes the values differ only a couple times in 10000. That result could make a difference in the e-commerce shop from the example if it were visited by millions of customers each month.

Learning from data basics: the Naive Bayes model

2016-03-23T00:00:00+00:00

Have you ever wondered what is the machinery behind some of the algorithms for doing seemingly very intelligent tasks? How is it possible that the computer program can recognize faces in photos, turn an image into a text or even classify some emails as legitimate or as spam?

Today, I’d like to present one of the simplest models for performing classification tasks. The model enables extremely fast execution, making it very practical in many use cases. The example I’ll choose will enable us to extend the discussion about the most optimal approach to another blog post.

The problem

Imagine that you’re working on an e-commerce store for your client. One of the requirements is to present the currently logged in user with a “promotion box” somewhere on the page. The goal is to maximize our chances of having the user put the product from the box into the basket. There’s one promotional box and a couple of different categories of products to choose the actual product from.

Thinking about the solution—using probability theory

One of the obvious directions we may want to turn towards is to use probability theory. If we could collect the data about the user’s previous choices and his or her characteristics, we can use probability to select the product category best suited for the current user. We would then choose a product from this category that currently has an active promotion.

Quick theory refresher for programmers

As we’ll be exploring the probability approaches using Ruby code, I’d like to very quickly walk you through some of the basic concepts we will be using from now on.

Random variables

The simplest probability scenario many of us are already accustomed with is the coin toss results distribution. Here we’re throwing the coin, noting whether we get heads or tails. In this experiment, we call “got heads” and “got tails” probability events. We can also shift the terminology a bit by calling them: two values of the “toss result” random variable.

So in this case we’d have a random variable—let’s call it T (for “toss”) that can take values of: “heads” or “tails”. We then define the probability distribution P(T) as a function from the random variable value to a real number between 0 and 1 inclusively on both sides. In real world the probability values after e. g 10000 tosses might look like the following:

+-------+---------------------+
| toss  | value               |
+-------+---------------------+
| heads | 0.49929999999999947 |
| tails |   0.500699999999998 |
+-------+---------------------+

These values are nearing 0.5 more and more with the greater number of tosses.

Factors and probability distributions

We’ve shown a simple probability distribution. To ease the comprehension of the Ruby code we’ll be working with, let me introduce the notion of the factor. We called the “table” from the last example a probability distribution. The table represented a function from a random variable’s value to a real number between [0, 1]. The factor is a generalization of that notion. It’s a function from the same domain, but returning any real number. We’ll explore the usability of this notion in some of our next articles.

The probability distribution is a factor that adds two constraints:

its values are always in the range [0, 1] inclusively
the sum of all it’s values is exactly 1

Simple Ruby modeling of random variables and factors

We need to have some ways of computing probability distributions. Let’s define some simple tools we’ll be using in this blog series:

# Let's define a simple version of the random variable
# - one that will hold discrete values
class RandomVariable
  attr_accessor :values, :name

  def initialize(name, values)
    @name = name
    @values = values
  end
end

# The following class really represents here a probability
# distribution. We'll adjust it in the next posts to make
# it match the definition of a "factor". We're naming it this
# way right now as every probability distribution is a factor
# too.
class Factor
  attr_accessor :_table, :_count, :variables

  def initialize(variables)
    @_table = {}
    @_count = 0.0
    @variables = variables
    initialize_table
  end

  # We're choosing to represent the factor / distribution
  # here as a table with value combinations in one column
  # and probability values in another. Technically, we're using
  # Ruby's Hash. The following method builds the the initial hash
  # with all the possible keys and values assigned to 0:
  def initialize_table
    variables_values = @variables.map do |var|
      var.values.map do |val|
        { var.name.to_sym => val }
      end.flatten
    end # [ [ { name: value } ] ]   
    @_table = variables_values[1..(variables_values.count)].inject(variables_values.first) do |all_array, var_arrays|
      all_array = all_array.map do |ob|
        var_arrays.map do |var_val|
          ob.merge var_val
        end
      end.flatten
      all_array
    end.inject({}) { |m, item| m[item] = 0; m }
  end

  # The following method adjusts the factor by adding information
  # about observed combination of values. This in turn adjusts probability
  # values for all the entries:
  def observe!(observation)
    if !@_table.has_key? observation
      raise ArgumentError, "Doesn't fit the factor - #{@variables} for observation: #{observation}"
    end

    @_count += 1

    @_table.keys.each do |key|
      observed = key == observation
      @_table[key] = (@_table[key] * (@_count == 0 ? 0 : (@_count - 1)) + 
       (observed ? 1 : 0)) / 
         (@_count == 0 ? 1 : @_count)
    end

    self
  end

  # Helper method for getting all the possible combinations
  # of random variable assignments
  def entries
    @_table.each
  end

  # Helper method for testing purposes. Sums the values for the whole
  # distribution - it should return 1 (close to 1 due to how computers
  # handle floating point operations)
  def sum
    @_table.values.inject(:+)
  end

  # Returns a probability of a given combination happening
  # in the experiment
  def value_for(key)
    if @_table[key].nil?
      raise ArgumentError, "Doesn't fit the factor - #{@varables} for: #{key}"
    end
    @_table[key]
  end

  # Helper method for testing purposes. Returns a table object
  # ready to be printed to stdout. It shows the whole distribution
  # as a table with some columns being random variables values and
  # the last one being the probability value
  def table
    rows = @_table.keys.map do |key|
      key.values << @_table[key]
    end
    table = Terminal::Table.new rows: rows, headings: ( @variables.map(&:name) << "value" )
    table.align_column(@variables.count, :right)
    table
  end

  protected

  def entries=(_entries)
    _entries.each do |entry|
      @_table[entry.keys.first] = entry.values.first
    end
  end

  def count
    @_count
  end

  def count=(_count)
    @_count = _count
  end
end

Notice that we’re using here the terminal-table gem as a helper for printing out the factors in an easy to grasp fashion. You’ll need the following requires:

require 'rubygems'
require 'terminal-table'

The scenario setup

Let’s imagine that we have the following categories to choose from:

category = RandomVariable.new :category, [ :veggies, :snacks, :meat, :drinks, :beauty, :magazines ]

And the following user features on each request:

age      = RandomVariable.new :age,      [ :teens, :young_adults, :adults, :elders ]
sex      = RandomVariable.new :sex,      [ :male, :female ]
relation = RandomVariable.new :relation, [ :single, :in_relationship ]
location = RandomVariable.new :location, [ :us, :canada, :europe, :asia ]

Let’s define the data model that resembles logically the one we could have in our real e-commerce application:

class LineItem
  attr_accessor :category

  def initialize(category)
    self.category = category
  end
end

class Basket
  attr_accessor :line_items

  def initialize(line_items)
    self.line_items = line_items
  end
end

class User
  attr_accessor :age, :sex, :relationship, :location, :baskets

  def initialize(age, sex, relationship, location, baskets)
    self.age = age
    self.sex = sex
    self.relationship = relationship
    self.location = location
    self.baskets = baskets
  end
end

We want to utilize a user’s baskets in order to infer the most probable value for a category, given a set of user’s features. In our example, we can imagine that we’re offering authentication via Facebook. We can grab info about a user’s sex, location, age and whether she/he is in relationship or not. We want to find a category that’s being chosen the most by users with a given set of features.

As we don’t have any real data to play with, we’ll need a generator to create fake data of certain characteristics. Let’s first define a helper class with a method, that will allow us to choose a value out of a given list of options along with their weights:

class Generator
  def self.pick(options)
    items = options.inject([]) do |memo, keyval|
      key, val = keyval
      memo << Array.new(val, key)
      memo
    end.flatten
    items.sample
  end
end

With all the above we can define a random data generation model:

class Model

  # Let's generate `num` users (1000 by default)
  def self.generate(num = 1000)
    num.times.to_a.map do |user_index|
      gen_user
    end
  end

  # Returns a user with randomly selected traits and baskets
  def self.gen_user
    age = gen_age
    sex = gen_sex
    rel = gen_rel(age)
    loc = gen_loc
    baskets = gen_baskets(age, sex)

    User.new age, sex, rel, loc, baskets
  end

  # Randomly select a sex with 40% chance for getting a male
  def self.gen_sex
    Generator.pick male: 4, female: 6
  end

  # Randomly select an age with 50% chance for getting a teen
  # (among other options and weights)
  def self.gen_age
    Generator.pick teens: 5, young_adults: 2, adults: 2, elders: 1
  end

  # Randomly select a relationship status.
  # Depend the chance of getting a given option on the user's age
  def self.gen_rel(age)
    case age
      when :teens        then Generator.pick single: 7, in_relationship: 3
      when :young_adults then Generator.pick single: 4, in_relationship: 6
      else                    Generator.pick single: 8, in_relationship: 2
    end
  end

  # Randomly select a location with 40% chance for getting a united states
  # (among other options and weights)
  def self.gen_loc
    Generator.pick us: 4, canada: 3, europe: 1, asia: 2
  end

  # Randomly select 20 basket line items.
  # Depend the chance of getting a given option on the user's age and sex
  def self.gen_items(age, sex)
    num = 20

    num.times.to_a.map do |i|
      if (age == :teens || age == :young_adults) && sex == :female
        Generator.pick veggies: 1, snacks: 3, meat: 1, drinks: 1, beauty: 9, magazines: 6
      elsif age == :teens  && sex == :male
        Generator.pick veggies: 1, snacks: 6, meat: 4, drinks: 1, beauty: 1, magazines: 4
      elsif (age == :young_adults || age == :adults) && sex == :male
        Generator.pick veggies: 1, snacks: 4, meat: 6, drinks: 6, beauty: 1, magazines: 1
      elsif (age == :young_adults || age == :adults) && sex == :female
        Generator.pick veggies: 4, snacks: 4, meat: 2, drinks: 1, beauty: 6, magazines: 3
      elsif age == :elders && sex == :male
        Generator.pick veggies: 6, snacks: 2, meat: 2, drinks: 2, beauty: 1, magazines: 1
      elsif age == :elders && sex == :female
        Generator.pick veggies: 8, snacks: 1, meat: 2, drinks: 1, beauty: 4, magazines: 1
      else
        Generator.pick veggies: 1, snacks: 1, meat: 1, drinks: 1, beauty: 1, magazines: 1
      end
    end.map do |cat|
      LineItem.new cat
    end
  end

  # Randomly select 5 baskets depending the traits of the basket on user
  # age and sex
  def self.gen_baskets(age, sex)
    num = 5

    num.times.to_a.map do |i|
      Basket.new gen_items(age, sex)
    end
  end
end

Where is the complexity?

The approach described above doesn’t seem that exciting or complex. Usually reading about probability theory applied in the field of machine learning requires going through quite a dense set of mathematical notions. The field is also being actively worked on by researchers. This implies a huge complexity—certainly not the simple definition of probability that we got used to in high school.

The problem becomes a bit more complex if you consider efficiency of computing the probabilities. In our example, the joined probability distribution—to fully describe the scenario—needs to specify probability values for 383 cases:

p(:veggies, :teens, :male, :single, :us) # one of 384 combinations

Given that the probability distributions have to sum up to 1, the last case can be fully inferred from the sum of all the others. This means that we need 6 * 4 * 2 * 2 * 4 - 1 = 383 parameters in the model: 6 categories, 4 age classes, 2 sexes, 2 relationship kinds and 4 locations. Imagine adding one additional, 4 valued feature (a season). This would grow our number of parameters to 1535. And this is a very simple training example. We could have a model with close to 100 different features. The number of parameters would clearly be unmanageable even on the biggest servers we could put them in. This approach would also make it very painful to add additional features to the model.

Very simple but powerful optimization: The Naive Bayes model

In this section I’m going to present you with an equation we’ll be working with when optimizing our example. I’m not going to explain the mathematics behind it as you can easily read about them on e. g. Wikipedia.

The approach is called the Naive Bayes model. It is being used e .g. in spam filters. It also has been used in the past in medical diagnosis field.

It allows us to present the full probability distribution as a product of factors:

p(cat, age, sex, rel, loc) == p(cat) * p(age | cat) * p(sex | cat) * p(rel | cat) * p(loc | cat)

Where e. g. p(age | cat) represents the probability of a user being a certain age given that this user selects cat products most frequently. This is called the “posterior probability”. The above equation states that we can simplify the distribution to be a product of some number of much more easily manageable factors.

The category from our example is often called a class and the rest of random variables in the distribution are often called features.

In our example, the number of parameters we’ll need to manage when presenting the distribution in this form drops to:

(6 - 1) + (6 * 4 - 1) + (6 * 2 - 1) + (6 * 2 - 1) + (6 * 4 - 1) == 73

That’s just around 19% of the original amount! Also, adding another variable (season) would only add 23 new parameters (compared to 1152 in the full distribution case).

The Naive Bayes model limits the number of parameters we have to manage but it comes with very strong assumptions about the variables involved: in our example, that the user features are conditionally independent given the resulting category. Later on I’ll show why this isn’t true in this case even though the results will still be quite okay.

Implementing the Naive Bayes model

As we now have all the tools we need, let’s get back to the probability theory to figure out how best to model the Naive Bayes in terms of the Ruby blocks we now have.

The approach says that under the assumptions we discussed we can approximate the original distribution to be the product of factors:

p(cat, age, sex, rel, loc) = p(cat) * p(age | cat) * p(sex | cat) * p(rel | cat) * p(loc | cat)

Given the definition of the conditional probability we have that:

p(a | b) = p(a, b) / p(b)

Thus, we can express the approximation as:

p(cat, age, sex, rel, loc) = p(cat) * ( p(age, cat) / p(cat) ) * ( p(sex, cat) / p(cat) ) * ( p(rel, cat) / p(cat) ) * ( p(loc, cat) / p(cat) )

And then simplify it even further as:

p(cat, age, sex, rel, loc) = p(age, cat) * ( p(sex, cat) / p(cat) ) * ( p(rel, cat) / p(cat) ) * ( p(loc, cat) / p(cat) )

Let’s define all the factors we will need:

cat_dist     = Factor.new [ category ]
age_cat_dist = Factor.new [ age, category ]
sex_cat_dist = Factor.new [ sex, category ]
rel_cat_dist = Factor.new [ relation, category ]
loc_cat_dist = Factor.new [ location, category ]

Also, we want a full distribution to compare the results:

full_dist = Factor.new [ category, age, sex, relation, location ]

Let’s generate 1000 random users and looping through them and their baskets - adjust probability distributions for combinations of product categories and user traits:

Model.generate(1000).each do |user|
  user.baskets.each do |basket|
    basket.line_items.each do |item|
      cat_dist.observe! category: item.category
      age_cat_dist.observe! age: user.age, category: item.category
      sex_cat_dist.observe! sex: user.sex, category: item.category
      rel_cat_dist.observe! relation: user.relationship, category: item.category
      loc_cat_dist.observe! location: user.location, category: item.category
      full_dist.observe! category: item.category, age: user.age, sex: user.sex,
        relation: user.relationship, location: user.location
    end
  end
end

We can now print the distributions as tables to have an insight about the data:

[ cat_dist, age_cat_dist, sex_cat_dist, rel_cat_dist, 
  loc_cat_dist, full_dist ].each do |dist|
    puts dist.table
    # Let's print out the sum of all entries to ensure the
    # algorithm works well:
    puts dist.sum
    puts "\n\n"
end

Which yields the following to the console (the full distribution is truncated due to its size):

+-----------+---------------------+
| category  | value               |
+-----------+---------------------+
| veggies   |             0.10866 |
| snacks    | 0.19830999999999863 |
| meat      |             0.14769 |
| drinks    | 0.10115999999999989 |
| beauty    |             0.24632 |
| magazines | 0.19785999999999926 |
+-----------+---------------------+
0.9999999999999978

+--------------+-----------+----------------------+
| age          | category  | value                |
+--------------+-----------+----------------------+
| teens        | veggies   |  0.02608000000000002 |
| teens        | snacks    |  0.11347999999999969 |
| teens        | meat      |  0.06282999999999944 |
| teens        | drinks    |   0.0263200000000002 |
| teens        | beauty    |   0.1390699999999995 |
| teens        | magazines |  0.13322000000000103 |
| young_adults | veggies   | 0.010250000000000023 |
| young_adults | snacks    |  0.03676000000000003 |
| young_adults | meat      |  0.03678000000000005 |
| young_adults | drinks    |  0.03670000000000045 |
| young_adults | beauty    |  0.05172999999999976 |
| young_adults | magazines | 0.035779999999999916 |
| adults       | veggies   | 0.026749999999999927 |
| adults       | snacks    |  0.03827999999999962 |
| adults       | meat      | 0.034600000000000505 |
| adults       | drinks    | 0.028190000000000038 |
| adults       | beauty    |  0.03892000000000036 |
| adults       | magazines |  0.02225999999999998 |
| elders       | veggies   |  0.04558000000000066 |
| elders       | snacks    | 0.009790000000000047 |
| elders       | meat      | 0.013480000000000027 |
| elders       | drinks    | 0.009949999999999931 |
| elders       | beauty    | 0.016600000000000226 |
| elders       | magazines | 0.006600000000000025 |
+--------------+-----------+----------------------+
1.0000000000000013

+--------+-----------+----------------------+
| sex    | category  | value                |
+--------+-----------+----------------------+
| male   | veggies   |  0.03954000000000044 |
| male   | snacks    |   0.1132499999999996 |
| male   | meat      |  0.10851000000000031 |
| male   | drinks    |                0.073 |
| male   | beauty    | 0.023679999999999857 |
| male   | magazines |  0.05901999999999993 |
| female | veggies   |  0.06911999999999997 |
| female | snacks    |  0.08506000000000069 |
| female | meat      |  0.03918000000000006 |
| female | drinks    |  0.02816000000000005 |
| female | beauty    |  0.22264000000000062 |
| female | magazines |  0.13884000000000046 |
+--------+-----------+----------------------+
1.000000000000002

+-----------------+-----------+----------------------+
| relation        | category  | value                |
+-----------------+-----------+----------------------+
| single          | veggies   |  0.07722000000000082 |
| single          | snacks    |  0.13090999999999794 |
| single          | meat      |  0.09317000000000061 |
| single          | drinks    | 0.059979999999999915 |
| single          | beauty    |  0.16317999999999971 |
| single          | magazines |  0.13054000000000135 |
| in_relationship | veggies   | 0.031440000000000336 |
| in_relationship | snacks    |  0.06740000000000032 |
| in_relationship | meat      | 0.054520000000000006 |
| in_relationship | drinks    |  0.04118000000000009 |
| in_relationship | beauty    |  0.08314000000000002 |
| in_relationship | magazines |  0.06732000000000182 |
+-----------------+-----------+----------------------+
1.000000000000003

+----------+-----------+----------------------+
| location | category  | value                |
+----------+-----------+----------------------+
| us       | veggies   |  0.04209000000000062 |
| us       | snacks    |  0.07534000000000109 |
| us       | meat      | 0.055059999999999984 |
| us       | drinks    |  0.03704000000000108 |
| us       | beauty    |  0.09879000000000099 |
| us       | magazines |  0.07867999999999964 |
| canada   | veggies   | 0.027930000000000062 |
| canada   | snacks    |  0.05745999999999996 |
| canada   | meat      |  0.04288000000000003 |
| canada   | drinks    |  0.03078999999999948 |
| canada   | beauty    |  0.06397999999999997 |
| canada   | magazines | 0.053959999999999675 |
| europe   | veggies   | 0.013110000000000132 |
| europe   | snacks    |   0.0223200000000001 |
| europe   | meat      |  0.01730000000000005 |
| europe   | drinks    | 0.011859999999999964 |
| europe   | beauty    | 0.025490000000000183 |
| europe   | magazines | 0.020920000000000164 |
| asia     | veggies   |  0.02552999999999989 |
| asia     | snacks    |  0.04319000000000044 |
| asia     | meat      |  0.03244999999999966 |
| asia     | drinks    |  0.02147000000000005 |
| asia     | beauty    |  0.05805999999999953 |
| asia     | magazines |   0.0442999999999999 |
+----------+-----------+----------------------+
1.0000000000000029

+-----------+--------------+--------+-----------------+----------+------------------------+
| category  | age          | sex    | relation        | location | value                  |
+-----------+--------------+--------+-----------------+----------+------------------------+
| veggies   | teens        | male   | single          | us       |  0.0035299999999999936 |
| veggies   | teens        | male   | single          | canada   |  0.0024500000000000073 |
| veggies   | teens        | male   | single          | europe   |  0.0006999999999999944 |
| veggies   | teens        | male   | single          | asia     |  0.0016699999999999899 |
| veggies   | teens        | male   | in_relationship | us       |   0.001340000000000006 |
| veggies   | teens        | male   | in_relationship | canada   |  0.0010099999999999775 |
| veggies   | teens        | male   | in_relationship | europe   |  0.0006499999999999989 |
| veggies   | teens        | male   | in_relationship | asia     |   0.000819999999999994 |

(... many rows ...)

| magazines | elders       | male   | in_relationship | asia     | 0.00012000000000000163 |
| magazines | elders       | female | single          | us       |  0.0007399999999999966 |
| magazines | elders       | female | single          | canada   |  0.0007000000000000037 |
| magazines | elders       | female | single          | europe   |  0.0003199999999999965 |
| magazines | elders       | female | single          | asia     |  0.0005899999999999999 |
| magazines | elders       | female | in_relationship | us       |  0.0004899999999999885 |
| magazines | elders       | female | in_relationship | canada   | 0.00027000000000000114 |
| magazines | elders       | female | in_relationship | europe   | 0.00012000000000000014 |
| magazines | elders       | female | in_relationship | asia     | 0.00012000000000000014 |
+-----------+--------------+--------+-----------------+----------+------------------------+
1.0000000000000004

Let’s define a Proc for inferring categories based on user traits as evidence:

infer = -> (age, sex, rel, loc) do

  # Let's map through the possible categories and the probability
  # values the distibutions assign to them:
  all = category.values.map do |cat|
    pc  = cat_dist.value_for category: cat
    pac = age_cat_dist.value_for age: age, category: cat
    psc = sex_cat_dist.value_for sex: sex, category: cat
    prc = rel_cat_dist.value_for relation: rel, category: cat
    plc = loc_cat_dist.value_for location: loc, category: cat

    { category: cat, value: (pac * psc/pc * prc/pc * plc/pc) }
  end

  # Let's do the same with the full distribution to be able to compare
  # the results:
  all_full = category.values.map do |cat|
    val = full_dist.value_for category: cat, age: age, sex: sex,
            relation: rel, location: loc

    { category: cat, value: val }
  end

  # Here's we're getting the most probable categories based on the
  # Naive Bayes distribution approximation model and based on the full
  # distribution:
  win      = all.max      { |a, b| a[:value] <=> b[:value] }
  win_full = all_full.max { |a, b| a[:value] <=> b[:value] }

  puts "Best match for #{[ age, sex, rel, loc ]}:"
  puts "   #{win[:category]} => #{win[:value]}"
  puts "Full pointed at:"
  puts "   #{win_full[:category]} => #{win_full[:value]}\n\n"
end

The results

We’re ready now to use the model and see how well the Naive Bayes model performs in this particular scenario:

infer.call :teens, :male, :single, :us
infer.call :young_adults, :male, :single, :asia
infer.call :adults, :female, :in_relationship, :europe
infer.call :elders, :female, :in_relationship, :canada

This gave the following results on the console:

Best match for [:teens, :male, :single, :us]:
   snacks => 0.016252573282200262
Full pointed at:
   snacks => 0.01898999999999971

Best match for [:young_adults, :male, :single, :asia]:
   meat => 0.0037455794492659757
Full pointed at:
   meat => 0.0017000000000000016

Best match for [:adults, :female, :in_relationship, :europe]:
   beauty => 0.0012287311061725868
Full pointed at:
   beauty => 0.0003000000000000026

Best match for [:elders, :female, :in_relationship, :canada]:
   veggies => 0.002156365730474441
Full pointed at:
   veggies => 0.0013500000000000022

That’s quite impressive! Even though we’re using a simplified model to approximate the original distribution, the algorithm managed to infer the correct values in all cases. You can notice also that the results differ only by a couple of cases in 1000.

The approximation like that would certainly be very useful in a more complex e-commerce scenario, in the case where the number of evidence variables would be big enough to be unmanageable using the full distribution. There are use cases though, where a couple of errors in 1000 cases would be too many—the traditional example is medical diagnosis. There are also cases where the number of errors would be much greater just because the Naive Bayes assumption of conditional independence of variables is not always a fair an assumption. Is there a way to improve?

The Naive Bayes assumption says that the distribution factorizes the way we did it only if the features are conditionally independent given the category. The notion of conditional independence (apart from the formal mathematical definition) suggests that if some variables a and b are conditionally independent given c, then if we know the value of c then no additional information about b can alter our knowledge about a. In our example, knowing the category, let say :beauty doesn’t mean that e. g sex is independent from age. In real world examples, it’s often very hard to find a use case for Naive Bayes that would follow the assumption in all the cases.

There are alternative approaches that allow us to apply the assumptions that more rigidly follow the chosen data set. We will explore these in the next articles, building on top of what we saw here.

RailsConf 2014 on Machine Learning

2014-04-24T00:00:00+00:00

This year at RailsConf 2014, there are workshop tracks which are focused sessions double or triple the length of the normal talk. Today I attended Machine Learning for Fun and Profit by John Paul Ashenfelter. Some analytics tools are good at providing averages on data (e.g. Google Analytics), but averages don’t tell you a specific story or context of your users, which can be valuable and actionable. In his story-telling approach, John covered several stories for generating data via machine learning techniques in Ruby.

Make a Plan

First, one must formulate a plan or a goal for which to collect actionable data. More likely than not, the goal is to make money, and the hope is that machine learning can help you find actionable data to make more money! John walked through several use cases and examples code with machine learning and I’ll add a bit of ecommerce context to each story below.

Act 1: Describe your Users

First, John talked about a few tools used for describing your users. In the context of his story, he wanted to figure out what gender ratio of shirts to order for the company. He used the sexmachine gem, which is based on census data, to predict the sex of a person based on a first name. The first name from all your users would be passed into this gem to segregate via gender, and from there you may be able to take action (e.g. order shirts with an estimated gender ratio).

Next, John covered geolocation. John wanted to how to scale support hours to customers using the product, likely a very common reason for geolocation for any SaaS or customer-centric tools. His solution uses a free IP address lookup service, Python, and Go, and free Maxmind data. The example code is available here.

With these tools, gender assignment & geolocation reveals basic but valuable information about your users. In the ecommerce space, determining gender ratios and geolocation may help determine the target of marketing and/or product development efforts, for example targeting a specific marketing message to a female east coast demographic.

Act 2: Clustering

In the next step, John talked about using machine learning to cluster users. The context John provided was to cluster users into three groups: casual users, super users and professional users, to potentially learn more about the super users and how to get more users in that group. An ecommerce story might be to cluster users in amount spent buckets which have rewards at higher levels, to incentivize users to spend more money to climb the hierarchy for more rewards. Here John used ai4r gem, which uses k-means clustering to group users. In as few words as possible, k-means clustering randomly creates X clusters (step 1), computes the center of each cluster (step 2), moves nodes if they are closer to a different cluster center (step 3), and repeats steps 2 & 3 until no nodes have been moved. The actual code is quite simple with the gem. Alternative clustering tools are hierarchical clusterers or divisive hierarchical clustering, which will yield slightly different results. John also mentioned that there are much better numerical tools like Python, R, Octave/Matlab, and Mathematica.

Act 3: Similarity

The third and final topic John covered was determining similarity between users, or perhaps finding other users similar to user X. The context of this was to understand how people collaborate and spread knowledge. In the ecommerce space, the obvious use-case here is building a product recommendation engine, e.g. recommending products to a user based on what they have bought, are looking at, or what is in their cart. John didn’t dive into the specific linear algebra math here (linear algebra is hard!), but he provided example code using the linalg gem that does much of the hard work for you.

Conclusion

The conclusion of this workshop was again to share Ruby tools that can help solve problems about your user and business. It’s very important to have a plan and/or goal to strive for and to determine actionable data analysis and metrics to help reach those goals.

Understanding Linear Regression

Linear Regression Model

Hypothesis

L2 Loss Function

Gradients of the Loss

Batch Gradient Descent

Evaluation

Multivariate Linear Regression

Linear Regression with Polynomial Functions

Conclusion

Implementing SummAE neural text summarization with a denoising auto-encoder

Preliminaries

Datasets

ROCStories

Wikihow

Basics of the sequence-to-sequence modeling

Naively simple modeling: Markov Model

Modeling with neural networks

Teacher-forcing

Compute-friendly representation for tokens and gists

Representing words naively

A better approach: word embeddings

Not only words

Auto-encoders

Adding the noise

The SummAE model

Auto-encoding paragraphs and sentences

Adding the noise

The code

My experiment with the WikiHow dataset

The problem with getting paragraphs when we want the sentences

Better gists by using the “critic”

Results

Final words

OpenITI Starts Arabic-script OCR Catalyst Project

An Introduction to Neural Networks

Neurons and Nodes

Neural Network Architecture

Training a Neural Network

Further reading

Deploying production Machine Learning pipelines to Kubernetes with Argo

Project description

Scalable pipelines

Containers

Hands down work on the project

The project structure

Makefile

Setting up Jupyter Notebooks as a scratch pad and for EDA

Exploration

Implementation

Step 1: Source blog articles

Step 2: Data wrangling

Step 3: Building the model

Step 4: Predict similar articles

Results

Closing words

Facial Recognition Using Amazon DeepLens: Counting Liquid Galaxy Interactions

The Amazon DeepLens camera

Computer Vision and the Amazon DeepLens

The Model

How did the DeepLens counts compare to my counts?

The flow of hierarchical data extraction

1. Problem statement

1.1. Hierarchical structure

1.2. Purpose of the data-miner

1.3. The human element

1.4. The actual problem statement

2. The extraction

2.1. Semantic concepts

2.2. Semantic rules

2.3. Semantic tree

2.4. Parallelization

3. Symbiosis

4. Parsing the data

5. Analyzing the data

5.1. The More Useful (MU) relation

5.2. MU Lattice

5.3. Domain of search

5.4. Differentiation between patterns and reality

5.5. Usage of factually validated patterns