https://www.endpointdev.com/blog/tags/machine-learning/2022-06-01T00:00:00+00:00End Point DevUnderstanding Linear Regressionhttps://www.endpointdev.com/blog/2022/06/understanding-linear-regression/2022-06-01T00:00:00+00:00Kürşat Kutlu Aydemir
<link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/katex@0.15.6/dist/katex.min.css" integrity="sha384-ZPe7yZ91iWxYumsBEOn7ieg8q/o+qh/hQpSaPow8T6BwALcXSCS6C6fSRPIAnTQs" crossorigin="anonymous">
<script defer src="https://cdn.jsdelivr.net/npm/katex@0.15.6/dist/katex.min.js" integrity="sha384-ljao5I1l+8KYFXG7LNEA7DyaFvuvSCmedUf6Y6JI7LJqiu8q5dEivP2nDdFH31V4" crossorigin="anonymous"></script>
<script defer src="https://cdn.jsdelivr.net/npm/katex@0.15.6/dist/contrib/auto-render.min.js" integrity="sha384-+XBljXPPiv+OzfbB3cVmLHf4hdUFHlWNZN5spNQ7rmHTXpd7WvJum6fIACpNNfIR" crossorigin="anonymous"></script>
<p><img src="/blog/2022/06/understanding-linear-regression/banner.webp" alt="Green Striped">
<a href="https://www.pexels.com/photo/green-striped-wallpaper-136740/">Photo by Scott Webb</a></p>
<p>Linear regression is a regression model which outputs a numeric value. It is used to predict an outcome based on a linear set of input.</p>
<p>The simplest hypothesis function of linear regression model is a univariate function as shown in the equation below:</p>
<p>$$
h_θ = θ_0 + θ_1x_1
$$</p>
<p>As you can guess this function represents a linear line in the coordinate system. The hypothesis function (h<sub>0</sub>) approximates the output given input.</p>
<p><img src="/blog/2022/06/understanding-linear-regression/linear-regression-1.webp" alt="Linear regression plot"></p>
<p>θ<sub>0</sub> is the <em>intercept</em>, also called <em>bias term</em>. θ<sub>1</sub> is the <em>gradient</em> or <em>slope</em>.</p>
<p>A linear regression model can either represent a univariate or a multivariate problem. So we can generalize the equation of the hypothesis as summation:</p>
<p>$$
h_θ = \sum{θ_ix_i}
$$</p>
<p>where x<sub>0</sub> is always 1.</p>
<p>We can also represent the hypothesis equation with vector notation:</p>
<p>$$
h_θ =
\begin{bmatrix}
θ_0 & θ_1 & θ_2 \dots θ_n
\end{bmatrix}
x
\begin{bmatrix}
x_0 \\
x_1 \\
x_2 \\
\vdots \\
x_n
\end{bmatrix}
$$</p>
<h3 id="linear-regression-model">Linear Regression Model</h3>
<p>I am going to introduce a linear regression model using a <em>gradient descent</em> algorithm. Each iteration of a gradient descent algorithm calculates the following steps:</p>
<ul>
<li>Hypothesis <em>h</em></li>
<li>The loss</li>
<li>Gradient descent update</li>
</ul>
<p>The gradient descent update iteration stops when it reaches the <em>convergence</em>.</p>
<p>Although I am implementing a univariate linear regression model in this section, these steps apply to multivariate linear regression models as well.</p>
<h4 id="hypothesis">Hypothesis</h4>
<p>We start the initial hypothesis assumption with random parameters. Then we calculate the loss using <a href="https://en.wikipedia.org/wiki/Norm_(mathematics)#Euclidean_norm">L2 Loss</a> function over the training dataset. In Python:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python"><span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">hypothesis</span>(X, theta):
<span style="color:#080;font-weight:bold">return</span> theta[<span style="color:#00d;font-weight:bold">0</span>] + theta[<span style="color:#00d;font-weight:bold">1</span>:] * X
</code></pre></div><p>In this function we took X input (univariate in this implementation) and theta parameter values. <code>X</code> represents the feature input of our dataset. Theta is the weights of the features. θ<sub>0</sub> is called the <em>bias term</em> and θ<sub>1</sub> is the <em>gradient</em> or <em>slope</em>.</p>
<h4 id="l2-loss-function">L2 Loss Function</h4>
<p>L2 Loss function — sometimes called Mean Squared Error (MSE) — is the total error of the current hypothesis over the given training dataset. During the training, by calculating the MSE, we can target minimizing the cumulative error.</p>
<p><img src="/blog/2022/06/understanding-linear-regression/linear-regression-2-mse.webp" alt="L2 Loss"></p>
<p>$$
J(θ) = \frac{\sum{(h_θ(x_i) - y_i)^2}}{2m}
$$</p>
<p>L2 loss function (MSE) simply calculates the error by summing the squares of each data point error by dividing the size of the dataset.</p>
<p>The more the linear function is aligned, the optimized center of the data points with an optimized slope would give us a minimized error which is our target in linear regression training.</p>
<h4 id="gradients-of-the-loss">Gradients of the Loss</h4>
<p>Each time we iterate and calculate a new theta (θ), we get a new theta<sub>1</sub> (slope) value. If we plot each slope value in the gradient descent batch update we will have a curve like this:</p>
<p><img src="/blog/2022/06/understanding-linear-regression/linear-regression-3-gradient-descent.webp" alt="Gradient Descent"></p>
<p>This curve has a minimum value which can’t get lower. Our goal is to find an optimal low value of theta<sub>1</sub> that reaches a point where our curve doesn’t get lower anymore or the change can be ignored. That is where the convergence is achieved and the loss is minimized.</p>
<p>Let’s do a little bit more math. The gradient of the loss is the partial derivative of θ. We calculate partial differential of loss for θ<sub>0</sub> and θ<sub>1</sub> separately. For multivariate functions our θ<sub>1</sub> is a generalized version for all available θ<sub>i</sub> since the partial derivatives are calculated similarly. You can simply calculate the partial derivatives of loss function yourself too.</p>
<p>$$
\frac{∂}{∂θ_0}J(θ_0) = \frac{\sum{(h_0 - y_0)}}{m}
$$</p>
<p>$$
\frac{∂}{∂θ_0}J(θ_i) = \frac{\sum{(h_i - y_i)x_i}}{m}
$$</p>
<p>Since we know the hypothesis equation we can replace it in the derivatives as well:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python"><span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">partial_derivatives</span>(h, X, y):
<span style="color:#080;font-weight:bold">return</span> [np.mean((h.flatten() - y)), np.mean((h.flatten() - y) * X.flatten())]
</code></pre></div><p>Now we will calculate the gradients for given hypothesis given theta, X, and y:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python"><span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">calc_gradients</span>(theta, X, y):
gradient = [<span style="color:#00d;font-weight:bold">0</span>, <span style="color:#00d;font-weight:bold">0</span>]
h = hypothesis(X, theta)
gradient = partial_derivatives(h, X, y)
<span style="color:#080;font-weight:bold">return</span> np.array(gradient)
</code></pre></div><h4 id="batch-gradient-descent">Batch Gradient Descent</h4>
<p>The gradient descent method I used in this implementation is called <em>batch gradient descent</em> which uses all the data available through the iterations, which slows down the overall convergence process. There are methods to improve the performance of gradient descent such as <a href="https://en.wikipedia.org/wiki/Stochastic_gradient_descent">stochastic gradient descent</a>.</p>
<p>Since we calculated the gradients for the given theta we will iterate as much as we can until the convergence.</p>
<p>$$
θ_1(new) = θ_1(current) - α * J'(θ_1(current))
$$</p>
<p>Here comes the <em>convergence rate</em> or so called <em>learning rate</em> (α) factor to decide how long we should jump through the iterations. If <code>α</code> is too small, convergence can be more accurate, but the performance will be too small. This also leads to <em>overfitting</em>. If <code>α</code> is too big, the performance will be better, but convergence couldn’t be calculated accurately or well enough.</p>
<p>There is no strict best value for <code>α</code> since it depends on the dataset for training the model. By evaluating the model you trained you can find the best alpha value for your dataset. You can refer to statistical measures like R<sup>2</sup> score to determine the observed variance. But there usually won’t be single model parameter, hyperparameter, or statistical variable to refer to for regularization.</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python"><span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">gradient_update</span>(X, y, theta, alpha, stop_threshold):
<span style="color:#888"># initial loss</span>
loss = L2_loss(hypothesis(X, theta), y)
old_loss = loss + stop_threshold
<span style="color:#080;font-weight:bold">while</span>( <span style="color:#038">abs</span>(old_loss - loss) > stop_threshold ):
<span style="color:#888"># gradient descent update</span>
gradients = calc_gradients(theta, X, y)
theta = theta - alpha * gradients
old_loss = loss
loss = L2_loss(hypothesis(X, theta), y)
<span style="color:#038">print</span>(<span style="color:#d20;background-color:#fff0f0">'Gradient Descent training stopped at loss </span><span style="color:#33b;background-color:#fff0f0">%s</span><span style="color:#d20;background-color:#fff0f0">, with coefficients: </span><span style="color:#33b;background-color:#fff0f0">%s</span><span style="color:#d20;background-color:#fff0f0">'</span> % (loss, theta))
<span style="color:#080;font-weight:bold">return</span> theta
</code></pre></div><p>By performing batch gradient descent we actually train our algorithm and make it find the best theta values to fit the linear function. Now we can evaluate our algorithm and compare it with <a href="https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html">Sci-Kit Learn Linear Regression</a>.</p>
<h4 id="evaluation">Evaluation</h4>
<p>Since linear regression is a regression model, you should train and evaluate this model on regression datasets.</p>
<p><a href="https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_diabetes.html">SK-Learn Diabetes dataset</a> is a good regression dataset example. Below I loaded and prepared the dataset by splitting into training and test datasets.</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python"><span style="color:#080;font-weight:bold">from</span> <span style="color:#b06;font-weight:bold">sklearn</span> <span style="color:#080;font-weight:bold">import</span> datasets
<span style="color:#080;font-weight:bold">from</span> <span style="color:#b06;font-weight:bold">sklearn.model_selection</span> <span style="color:#080;font-weight:bold">import</span> train_test_split
<span style="color:#888"># Load the diabetes dataset</span>
diabetes = datasets.load_diabetes()
diabetes_X = diabetes.data[:, np.newaxis, <span style="color:#00d;font-weight:bold">2</span>]
diabetes_y = diabetes.target
X_train, X_test, y_train, y_test = train_test_split(diabetes.data, diabetes_y, test_size=<span style="color:#00d;font-weight:bold">0.1</span>)
</code></pre></div><p>Now we can evaluate our model:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python"><span style="color:#080;font-weight:bold">from</span> <span style="color:#b06;font-weight:bold">sklearn.metrics</span> <span style="color:#080;font-weight:bold">import</span> mean_squared_error, r2_score
<span style="color:#888"># initial random theta</span>
theta = [<span style="color:#00d;font-weight:bold">100</span>, <span style="color:#00d;font-weight:bold">3</span>]
stop_threshold = <span style="color:#00d;font-weight:bold">0.1</span>
<span style="color:#888"># learning rate</span>
alpha = <span style="color:#00d;font-weight:bold">0.5</span>
theta = gradient_update(X_train, y_train, theta, alpha, stop_threshold)
y_pred = hypothesis(X_test, theta)
<span style="color:#038">print</span>(<span style="color:#d20;background-color:#fff0f0">"Intercept (theta 0):"</span>, theta[<span style="color:#00d;font-weight:bold">0</span>])
<span style="color:#038">print</span>(<span style="color:#d20;background-color:#fff0f0">"Coefficients (theta 1):"</span>, theta[<span style="color:#00d;font-weight:bold">1</span>])
<span style="color:#038">print</span>(<span style="color:#d20;background-color:#fff0f0">"MSE:"</span>, mean_squared_error(y_test, y_pred))
<span style="color:#038">print</span>(<span style="color:#d20;background-color:#fff0f0">"R2 Score"</span>, r2_score(y_test, y_pred))
<span style="color:#888"># Plot outputs using test data</span>
plt.scatter(X_test, y_test, color=<span style="color:#d20;background-color:#fff0f0">'black'</span>)
plt.plot(X_test, y_pred, color=<span style="color:#d20;background-color:#fff0f0">'blue'</span>, linewidth=<span style="color:#00d;font-weight:bold">3</span>)
plt.show()
</code></pre></div><p>When I run my linear regression model it finds the optimal theta values, finishes the training, and outputs as below. I noted sample evaluation scores below too.</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-plain" data-lang="plain">Gradient Descent training stopped at loss 3753.11429796413, with coefficients: [151.6166715 850.81024746]
Intercept (theta 0): 151.61667150054697
Coefficients (theta 1): 850.8102474614635
MSE: 5320.89741757879
R2 Score 0.14348916154815183
</code></pre></div><p><img src="/blog/2022/06/understanding-linear-regression/gd-evaluate.webp" alt="Linear Regression Plot"></p>
<p>Now let’s evaluate the SK-Learn linear regression model with the same training and test datasets we used. I’m going to use default parameters without optimizing.</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python"><span style="color:#888"># Sci-Kit Learn LinearRegression model evaluation</span>
regr = linear_model.LinearRegression()
regr.fit(X_train, y_train)
y_pred = regr.predict(X_test)
<span style="color:#038">print</span>(<span style="color:#d20;background-color:#fff0f0">"Coef:"</span>, regr.coef_)
<span style="color:#038">print</span>(<span style="color:#d20;background-color:#fff0f0">"Intercept:"</span>, regr.intercept_)
<span style="color:#038">print</span>(<span style="color:#d20;background-color:#fff0f0">"MSE:"</span>, mean_squared_error(y_test, y_pred))
<span style="color:#038">print</span>(<span style="color:#d20;background-color:#fff0f0">"R2 Score"</span>, r2_score(y_test, y_pred))
<span style="color:#888"># Plot outputs</span>
plt.scatter(X_test, y_test, color=<span style="color:#d20;background-color:#fff0f0">'black'</span>)
plt.plot(X_test, y_pred, color=<span style="color:#d20;background-color:#fff0f0">'blue'</span>, linewidth=<span style="color:#00d;font-weight:bold">3</span>)
plt.show()
</code></pre></div><p>The output and plot of the SK-Learn Linear Regression model is as below:</p>
<pre tabindex="0"><code>Coef: [993.14228074]
Intercept: 151.5751918329106
MSE: 5544.283378702411
R2 Score 0.10753047228113943
</code></pre><p><img src="/blog/2022/06/understanding-linear-regression/sklearn-lr-evaluate.webp" alt="SK-Learn Linear Regression Plot"></p>
<p>Notice the intercept of my linear regression model and SK-Learn’s linear regression model are very close with value of around ~151. MSE values are calculated very close too. Also both plotted their predictions very similarly as well.</p>
<h3 id="multivariate-linear-regression">Multivariate Linear Regression</h3>
<p>We can add more features as we have more features in a dataset and prepare our hypothesis as below, similar to a univariate hypothesis.</p>
<p>$$
h_θ(x) = θ_0 + θ_1x_1 + … + θ_nx_n
$$</p>
<p>A multivariate dataset can have multiple features and a single output like below.</p>
<table>
<thead>
<tr>
<th>Feature1</th>
<th>Feature2</th>
<th>Feature3</th>
<th>Feature4</th>
<th>Target</th>
</tr>
</thead>
<tbody>
<tr>
<td>2</td>
<td>0</td>
<td>0</td>
<td>100</td>
<td>12</td>
</tr>
<tr>
<td>16</td>
<td>10</td>
<td>1000</td>
<td>121</td>
<td>18</td>
</tr>
<tr>
<td>5</td>
<td>5</td>
<td>450</td>
<td>302</td>
<td>14</td>
</tr>
</tbody>
</table>
<p>Each feature is an independent variable (x<sub>i</sub>) of a dataset. Parameters (theta) are what we aim to find during the training just like the univariate model.</p>
<h3 id="linear-regression-with-polynomial-functions">Linear Regression with Polynomial Functions</h3>
<p>Sometimes a line function doesn’t fit the data well enough. Although if we are dealing with a polynomial function (having multiple features with exponential versions), it could fit the data better.</p>
<p>In this case the data itself is not linear but we are lucky that the parameter space is linear and we can still apply the linear regression over the non-linear dataset as well:</p>
<p>$$
h_θ(x) = θ_0 + θ_1x + θ_1x^2 … + θ_nx^n
$$</p>
<p>$$
h_θ =
\begin{bmatrix}
1 & x & x^2 \dots x^n
\end{bmatrix}
x
\begin{bmatrix}
θ_0 \\
θ_1 \\
θ_2 \\
\vdots \\
θ_n
\end{bmatrix}
$$</p>
<p>Here the data is non-linear but the parameters are linear and we can still apply the gradient descent algorithm.</p>
<h3 id="conclusion">Conclusion</h3>
<p>In this post I implemented a linear regression model from scratch and evaluated by training it.</p>
<p>Linear regression is useful when your dataset variables can be related in a linear relation. In the real world, linear regression is very useful in <a href="https://www.pluralsight.com/courses/understanding-applying-linear-regression?aid=7010a000002BWqGAAW&exp=2">forecasting</a>.</p>
<script>
document.addEventListener("DOMContentLoaded", function() {
renderMathInElement(document.body, {
// customised options
// • auto-render specific keys, e.g.:
delimiters: [
{left: '$$', right: '$$', display: true},
{left: '$', right: '$', display: false},
{left: '\\(', right: '\\)', display: false},
{left: '\\[', right: '\\]', display: true}
],
// • rendering keys, e.g.:
throwOnError : false
});
});
</script>
Implementing SummAE neural text summarization with a denoising auto-encoderhttps://www.endpointdev.com/blog/2020/05/summae-neural-text-summarization-denoising-autoencoder/2020-05-28T00:00:00+00:00Kamil Ciemniewski
<p><img src="/blog/2020/05/summae-neural-text-summarization-denoising-autoencoder/book.jpg" alt="Book open on lawn with dandelions"></p>
<p>If there’s any problem space in machine learning, with no shortage of (unlabelled) data to train on, it’s easily natural language processing (NLP).</p>
<p>In this article, I’d like to take on the challenge of taking a paper that came from Google Research in late 2019 and implementing it. It’s going to be a fun trip into the world of neural text summarization. We’re going to go through the basics, the coding, and then we’ll look at what the results actually are in the end.</p>
<p>The paper we’re going to implement here is: <a href="https://arxiv.org/abs/1910.00998">Peter J. Liu, Yu-An Chung, Jie Ren (2019) SummAE: Zero-Shot Abstractive Text Summarization using Length-Agnostic Auto-Encoders</a>.</p>
<p>Here’s the paper’s abstract:</p>
<blockquote>
<p>We propose an end-to-end neural model for zero-shot abstractive text summarization of paragraphs, and introduce a benchmark task, ROCSumm, based on ROCStories, a subset for which we collected human summaries. In this task, five-sentence stories (paragraphs) are summarized with one sentence, using human summaries only for evaluation. We show results for extractive and human baselines to demonstrate a large abstractive gap in performance. Our model, SummAE, consists of a denoising auto-encoder that embeds sentences and paragraphs in a common space, from which either can be decoded. Summaries for paragraphs are generated by decoding a sentence from the paragraph representations. We find that traditional sequence-to-sequence auto-encoders fail to produce good summaries and describe how specific architectural choices and pre-training techniques can significantly improve performance, outperforming extractive baselines. The data, training, evaluation code, and best model weights are open-sourced.</p>
</blockquote>
<h3 id="preliminaries">Preliminaries</h3>
<p>Before we go any further, let’s talk a little bit about neural summarization in general. There’re two main approaches to it:</p>
<ul>
<li><a href="https://en.wikipedia.org/wiki/Automatic_summarization#Extraction-based_summarization">Extractive</a></li>
<li><a href="https://en.wikipedia.org/wiki/Automatic_summarization#Abstraction-based_summarization">Abstractive</a></li>
</ul>
<p>The first approach makes the model “focus” on the most important parts of the longer text - extracting them to form a summary.</p>
<p>Let’s take a recent article, <a href="/blog/2020/05/shopify-product-creation/">“Shopify Admin API: Importing Products in Bulk”</a>, by one of my great co-workers, <a href="/team/patrick-lewis/">Patrick Lewis</a>, as an example and see what the extractive summarization would look like. Let’s take the first two paragraphs:</p>
<blockquote>
<p>I recently worked on an interesting project for a store owner who was facing a daunting task: he had an inventory of hundreds of thousands of Magic: The Gathering (MTG) cards that he wanted to sell online through his Shopify store. The logistics of tracking down artwork and current market pricing for each card made it impossible to do manually.</p>
<p>My solution was to create a custom Rails application that retrieves inventory data from a combination of APIs and then automatically creates products for each card in Shopify. The resulting project turned what would have been a months- or years-long task into a bulk upload that only took a few hours to complete and allowed the store owner to immediately start selling his inventory online. The online store launch turned out to be even more important than initially expected due to current closures of physical stores.</p>
</blockquote>
<p>An extractive model could summarize it as follows:</p>
<blockquote>
<p>I recently worked on an interesting project for a store owner who had an inventory of hundreds of thousands of cards that he wanted to sell through his store. The logistics and current pricing for each card made it impossible to do manually. My solution was to create a custom Rails application that retrieves inventory data from a combination of APIs and then automatically creates products for each card. The store launch turned out to be even more important than expected due to current closures of physical stores.</p>
</blockquote>
<p>See how it does the copying and pasting? The big advantage of these types of models is that they are generally easier to create and the resulting summaries tend to faithfully reflect the facts included in the source.</p>
<p>The downside though is that it’s not how a human would do it. We do a lot of paraphrasing, for instance. We use different words and tend to form sentences less rigidly following the original ones. The need for the summaries to feel more natural made the second type — abstractive — into this subfield’s holy grail.</p>
<h3 id="datasets">Datasets</h3>
<p>The paper’s authors used the so-called <a href="https://cs.rochester.edu/nlp/rocstories/">“ROCStories” dataset</a> (<a href="https://www.aclweb.org/anthology/P18-2119/">“Tackling The Story Ending Biases in The Story Cloze Test”. Rishi Sharma, James Allen, Omid Bakhshandeh, Nasrin Mostafazadeh. In Proceedings of the 2018 Conference of the Association for Computational Linguistics (ACL), 2018</a>).</p>
<p>In my experiments, I’ve also tried the model against one that’s quite a bit more difficult: <a href="https://github.com/mahnazkoupaee/WikiHow-Dataset">WikiHow</a> (<a href="https://arxiv.org/abs/1810.09305">Mahnaz Koupaee, William Yang Wang (2018) WikiHow: A Large Scale Text Summarization Dataset</a>).</p>
<h4 id="rocstories">ROCStories</h4>
<p>The dataset consists of 98162 stories, each one consisting of 5 sentences. It’s incredibly clean. The only step I needed to take was to split the stories between the train, eval, and test sets.</p>
<p>Examples of sentences:</p>
<p>Example 1:</p>
<blockquote>
<p>My retired coworker turned 69 in July. I went net surfing to get her a gift. She loves Diana Ross. I got two newly released cds and mailed them to her. She sent me an email thanking me.</p>
</blockquote>
<p>Example 2:</p>
<blockquote>
<p>Tom alerted the government he expected a guest. When she didn’t come he got in a lot of trouble. They talked about revoking his doctor’s license. And charging him a huge fee! Tom’s life was destroyed because of his act of kindness.</p>
</blockquote>
<p>Example 3:</p>
<blockquote>
<p>I went to see the doctor when I knew it was bad. I hadn’t eaten in nearly a week. I told him I felt afraid of food in my body. He told me I was developing an eating disorder. He instructed me to get some help.</p>
</blockquote>
<h4 id="wikihow">Wikihow</h4>
<p>This is one of the most challenging openly available datasets for neural summarization. It consists of more than 200,000 long-sequence pairs of text + headline scraped from <a href="https://www.wikihow.com/Main-Page">WikiHow’s website</a>.</p>
<p>Some examples:</p>
<p>Text:</p>
<blockquote>
<p>One easy way to conserve water is to cut down on your shower time. Practice cutting your showers down to 10 minutes, then 7, then 5. Challenge yourself to take a shorter shower every day. Washing machines take up a lot of water and electricity, so running a cycle for a couple of articles of clothing is inefficient. Hold off on laundry until you can fill the machine. Avoid letting the water run while you’re brushing your teeth or shaving. Keep your hoses and faucets turned off as much as possible. When you need them, use them sparingly.</p>
</blockquote>
<p>Headline:</p>
<blockquote>
<p>Take quicker showers to conserve water. Wait for a full load of clothing before running a washing machine. Turn off the water when you’re not using it.</p>
</blockquote>
<p>The main challenge for the summarization model here is that the headline <strong>was actually created by humans</strong> and is not just “extracting” anything. Any model performing well on this dataset actually needs to model the language pretty well. Otherwise, the headline could be used for computing the evaluation metrics, but it’s pretty clear that traditional metrics like <a href="https://en.wikipedia.org/wiki/ROUGE_(metric)">ROUGE</a> are just bound here to miss the point.</p>
<h3 id="basics-of-the-sequence-to-sequence-modeling">Basics of the sequence-to-sequence modeling</h3>
<p>Most sequence-to-sequence models are based on the “next token prediction” workflow.</p>
<p>The general idea can be expressed with P(token | context) — where the task is to model this conditional probability distribution. The “context” here depends on the approach.</p>
<p>Those models are also called “auto-regressive” because they need to consume their own predictions from previous steps during the inference:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python">predict([<span style="color:#d20;background-color:#fff0f0">"<start>"</span>], context)
<span style="color:#888"># "I"</span>
predict([<span style="color:#d20;background-color:#fff0f0">"<start>"</span>, <span style="color:#d20;background-color:#fff0f0">"I"</span>], context)
<span style="color:#888"># "love"</span>
predict([<span style="color:#d20;background-color:#fff0f0">"<start>"</span>, <span style="color:#d20;background-color:#fff0f0">"I"</span>, <span style="color:#d20;background-color:#fff0f0">"love"</span>], context)
<span style="color:#888"># "biking"</span>
predict([<span style="color:#d20;background-color:#fff0f0">"<start>"</span>, <span style="color:#d20;background-color:#fff0f0">"I"</span>, <span style="color:#d20;background-color:#fff0f0">"love"</span>, <span style="color:#d20;background-color:#fff0f0">"biking"</span>], context)
<span style="color:#888"># "<end>"</span>
</code></pre></div><h4 id="naively-simple-modeling-markov-model">Naively simple modeling: Markov Model</h4>
<p>In this model, the approach is to take on a bold assumption: that the probability of the next token is conditioned <strong>only</strong> on the previous token.</p>
<p>The Markov Model is elegantly introduced in the blog post <a href="https://medium.com/ymedialabs-innovation/next-word-prediction-using-markov-model-570fc0475f96">Next Word Prediction using Markov Model</a>.</p>
<p>Why is it naive? Because we know that the probability of the word “love” depends on the word “I” <strong>given a broader context</strong>. A model that’s always going to output “roses” would miss the best word more often than not.</p>
<h4 id="modeling-with-neural-networks">Modeling with neural networks</h4>
<p>Usually, sequence-to-sequence neural network models consist of two parts:</p>
<ul>
<li>encoder</li>
<li>decoder</li>
</ul>
<p>The encoder is there to build a “gist” representation of the input sequence. The gist and the previous token become our “context” to do the inference. This fits in well within the P(token | context) modeling I described above. That distribution can be expressed more clearly as P(token | previous; gist).</p>
<p>There are other approaches too with one of them being the <a href="https://arxiv.org/pdf/2001.04063v2.pdf">ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training - 2020 - Yan, Yu and Qi, Weizhen and Gong, Yeyun and Liu, Dayiheng and Duan, Nan and Chen, Jiusheng and Zhang, Ruofei and Zhou, Ming</a>. The difference in the approach here was the prediction of n-tokens ahead at once.</p>
<h3 id="teacher-forcing">Teacher-forcing</h3>
<p>Let’s see how could we go about teaching the model about the next token’s conditional distribution.</p>
<p>Imagine that the model’s parameters aren’t performing well yet. We have an input sequence of: <code>["<start>", "I", "love", "biking", "during", "the", "summer", "<end>"]</code>. We’re training the model giving it the first token:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python">model([<span style="color:#d20;background-color:#fff0f0">"<start>"</span>, context])
<span style="color:#888"># "I"</span>
</code></pre></div><p>Great, now let’s ask it for another one:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python">model([<span style="color:#d20;background-color:#fff0f0">"<start>"</span>, <span style="color:#d20;background-color:#fff0f0">"I"</span>], context])
<span style="color:#888"># "wonder"</span>
</code></pre></div><p>Hmmm that’s not what we wanted, but let’s naively continue:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python">model([<span style="color:#d20;background-color:#fff0f0">"<start>"</span>, <span style="color:#d20;background-color:#fff0f0">"I"</span>, <span style="color:#d20;background-color:#fff0f0">"wonder"</span>], context)
<span style="color:#888"># "why"</span>
</code></pre></div><p>We could continue gathering predictions and compute the loss at the end. The loss would really only be able to tell it about the first mistake (“love” vs. “wonder”); the rest of the errors would just accumulate from here. This would hinder the learning considerably, adding in the noise from the accumulated errors.</p>
<p>There’s a better approach called <a href="https://machinelearningmastery.com/teacher-forcing-for-recurrent-neural-networks/">Teacher Forcing</a>. In this approach, you’re telling the model the true answer after each of its guesses. The last example would look like the following:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python">model([<span style="color:#d20;background-color:#fff0f0">"<start>"</span>, <span style="color:#d20;background-color:#fff0f0">"I"</span>, <span style="color:#d20;background-color:#fff0f0">"love"</span>], context)
<span style="color:#888"># "watching"</span>
</code></pre></div><p>You’d continue the process, feeding it the full input sequence and the loss term would be computed based on all its guesses.</p>
<h3 id="compute-friendly-representation-for-tokens-and-gists">Compute-friendly representation for tokens and gists</h3>
<p>Some of the readers might want to skip this section. I’d like to describe quickly here the concept of the <a href="https://towardsdatascience.com/understanding-latent-space-in-machine-learning-de5a7c687d8d">latent space</a> and <a href="https://towardsdatascience.com/introduction-to-word-embedding-and-word2vec-652d0c2060fa">vector embeddings</a>. This is to keep the matters relatively palatable for the broader audience.</p>
<h4 id="representing-words-naively">Representing words naively</h4>
<p>How do we turn the words (strings) into numbers that we input into our machine learning models? A software developer might think about assigning each word a unique integer. This works well for databases but in machine learning models, the fact that integers follow one another means that they encode a relation (which one follows which and in what distance). This doesn’t work well for almost any problem in data science.</p>
<p>Traditionally, the problem is solved by “<a href="https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/">one-hot encoding</a>”. This means that we’re turning our integers into vectors, where each value is zero except the one for the index that equals the value to encode (or minus one if your programming language uses zero-based indexing). Example: <code>3 => [0, 0, 0, 1, 0, 0, 0, 0, 0, 0]</code> when the total number of “integers” (classes) to encode is 10.</p>
<p>This is better as it breaks the ordering and distancing assumptions. It doesn’t encode anything about the words, though, except the arbitrary number we’ve decided to assign to them. We now don’t have the ordering but we also don’t have any distance. Empirically though we just know that the word “love” is much closer to “enjoy” than it is to “helicopter”.</p>
<h4 id="a-better-approach-word-embeddings">A better approach: word embeddings</h4>
<p>How could we keep our vector representation (as in one-hot encoding) but also introduce the distance? I’ve already glanced over this concept in my <a href="/blog/2018/07/recommender-mxnet/">post about the simple recommender system</a>. The idea is to have a vector of floating-point values so that the closer the words are in their meaning, the smaller the angle is between them. We can easily compute a metric following this logic by measuring the <a href="http://blog.christianperone.com/2013/09/machine-learning-cosine-similarity-for-vector-space-models-part-iii/">cosine distance</a>. This way, the word representations are easy to feed into the encoder, and they already contain a lot of the information in themselves.</p>
<h4 id="not-only-words">Not only words</h4>
<p>Can we only have vectors for words? Couldn’t we have vectors for paragraphs, so that the closer they are in their meaning, the smaller some vector space metric between them? Of course we can. This is, in fact, what will allow us in this article’s model to encode the “gist” that we talked about. The “encoder” part of the model is going to learn the most convenient way of turning the input sequence into the floating-point numbers vector.</p>
<h3 id="auto-encoders">Auto-encoders</h3>
<p>We’re slowly approaching the model from the paper. We still have one concept that’s vital to understand in order to get why the model is going to work.</p>
<p>Up until now, we talked about the following structure of the typical sequence-to-sequence neural network model:</p>
<p><img src="/blog/2020/05/summae-neural-text-summarization-denoising-autoencoder/seq-to-seq.png" alt="Sequence To Sequence Neural Nets"></p>
<p>This is true e.g. for translation models where the input sequence is in English and the output is in Greek. It’s also true for this article’s model <strong>during the inference</strong>.</p>
<p>What if we’d make the input and output to be the same sequence? We’d turn it into a so-called <a href="https://en.wikipedia.org/wiki/Autoencoder">auto-encoder</a>.</p>
<p>The output of course isn’t all that useful — we already know what the input sequence is. The true value is in the model’s ability to encode the input into a <strong>gist</strong>.</p>
<h4 id="adding-the-noise">Adding the noise</h4>
<p>A very interesting type of an auto-encoder is the <a href="https://towardsdatascience.com/denoising-autoencoders-explained-dbb82467fc2">denoising auto-encoder</a>. The idea is that the input sequence gets randomly corrupted and the network learns to still produce a good gist and reconstruct the sequence before it got corrupted. This makes the training “teach” the network about the deeper connections in the data, instead of just “memorizing” as much as it can.</p>
<h3 id="the-summae-model">The SummAE model</h3>
<p>We’re now ready to talk about the architecture from the paper. Given what we’ve already learned, this is going to be very simple. The SummAE model is just a denoising auto-encoder that is being trained a special way.</p>
<h4 id="auto-encoding-paragraphs-and-sentences">Auto-encoding paragraphs and sentences</h4>
<p>The authors were training the model on both single sentences and full paragraphs. In all cases the task was to reproduce the uncorrupted input.</p>
<p>The first part of the approach is about having two special “start tokens” to signal the mode: paragraph vs. sentence. In my code, I’ve used “<start-full>” and “<start-short>”.</p>
<p>During the training, the model learns the conditional distributions given those two tokens and the ones that follow, for any given token in the sequence.</p>
<h4 id="adding-the-noise-1">Adding the noise</h4>
<p>The sentences are simply concatenated to form a paragraph. The input then gets corrupted at random by means of:</p>
<ul>
<li>masking the input tokens</li>
<li>shuffling the order of the sentences within the paragraph</li>
</ul>
<p>The authors are claiming that the latter helped them in solving the issue of the network just memorizing the first sentence. What I have found though is that this model is generally prone towards memorizing concrete sentences from the paragraph. Sometimes it’s the first, and sometimes it’s some of the others. I’ve found this true even when adding a lot of noise to the input.</p>
<h4 id="the-code">The code</h4>
<p>The full PyTorch implementation described in this blog post is available at <a href="https://github.com/kamilc/neural-text-summarization">https://github.com/kamilc/neural-text-summarization</a>. You may find some of its parts less clean than others — it’s a work in progress. Specifically, the data download is almost left out.</p>
<p>You can find the WikiData preprocessing in a notebook in the repository. For the ROCStories, I just downloaded the CSV files and concatenated with Unix <code>cat</code>. There’s an additional <code>process.py</code> file generated from a very simple <code>IPython</code> session.</p>
<p>Let’s have a very brief look at some of the most interesting parts of the code:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python"><span style="color:#080;font-weight:bold">class</span> <span style="color:#b06;font-weight:bold">SummarizeNet</span>(NNModel):
<span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">encode</span>(self, embeddings, lengths):
<span style="color:#888"># ...</span>
<span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">decode</span>(self, embeddings, encoded, lengths, modes):
<span style="color:#888"># ...</span>
<span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">forward</span>(self, embeddings, clean_embeddings, lengths, modes):
<span style="color:#888"># ...</span>
<span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">predict</span>(self, vocabulary, embeddings, lengths):
<span style="color:#888"># ...</span>
</code></pre></div><p>You can notice separate methods for <code>forward</code> and <code>predict</code>. I chose the <a href="https://jalammar.github.io/illustrated-transformer/">Transformer</a> over the recurrent neural networks for both the encoder part and the decoder. The <a href="https://pytorch.org/docs/master/generated/torch.nn.TransformerDecoder.html">PyTorch implementation of the transformer decoder part</a> already includes the teacher forcing in the <code>forward</code> method. This makes it convenient at the training time — to just feed it the full, uncorrupted sequence of embeddings as the “target”. During the inference we need to do the “auto-regressive” part by hand though. This means feeding the previous predictions in a loop — hence the need for two distinct methods here.</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python"><span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">forward</span>(self, embeddings, clean_embeddings, lengths, modes):
noisy_embeddings = self.mask_dropout(embeddings, lengths)
encoded = self.encode(noisy_embeddings[:, <span style="color:#00d;font-weight:bold">1</span>:, :], lengths-<span style="color:#00d;font-weight:bold">1</span>)
decoded = self.decode(clean_embeddings, encoded, lengths, modes)
<span style="color:#080;font-weight:bold">return</span> (
decoded,
encoded
)
</code></pre></div><p>You can notice that I’m doing the token masking at the model level during the training. The code also shows cleanly the structure of this seq2seq model — with the encoder and the decoder.</p>
<p>The encoder part looks simple as long as you’re familiar with the transformers:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python"><span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">encode</span>(self, embeddings, lengths):
batch_size, seq_len, _ = embeddings.shape
embeddings = self.encode_positions(embeddings)
paddings_mask = torch.arange(end=seq_len).unsqueeze(dim=<span style="color:#00d;font-weight:bold">0</span>).expand((batch_size, seq_len)).to(self.device)
paddings_mask = (paddings_mask + <span style="color:#00d;font-weight:bold">1</span>) > lengths.unsqueeze(dim=<span style="color:#00d;font-weight:bold">1</span>).expand((batch_size, seq_len))
encoded = embeddings.transpose(<span style="color:#00d;font-weight:bold">1</span>,<span style="color:#00d;font-weight:bold">0</span>)
<span style="color:#080;font-weight:bold">for</span> ix, encoder <span style="color:#080">in</span> <span style="color:#038">enumerate</span>(self.encoders):
encoded = encoder(encoded, src_key_padding_mask=paddings_mask)
encoded = self.encode_batch_norms[ix](encoded.transpose(<span style="color:#00d;font-weight:bold">2</span>,<span style="color:#00d;font-weight:bold">1</span>)).transpose(<span style="color:#00d;font-weight:bold">2</span>,<span style="color:#00d;font-weight:bold">1</span>)
last_encoded = encoded
encoded = self.pool_encoded(encoded, lengths)
encoded = self.to_hidden(encoded)
<span style="color:#080;font-weight:bold">return</span> encoded
</code></pre></div><p>We’re first encoding the positions as in the “Attention Is All You Need” paper and then feeding the embeddings into a stack of the encoder layers. At the end, we’re morphing the tensor to have the final dimension equal the number given as the model’s parameter.</p>
<p>The <code>decode</code> sits on PyTorch’s shoulders too:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python"><span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">decode</span>(self, embeddings, encoded, lengths, modes):
batch_size, seq_len, _ = embeddings.shape
embeddings = self.encode_positions(embeddings)
mask = self.mask_for(embeddings)
encoded = self.from_hidden(encoded)
encoded = encoded.unsqueeze(dim=<span style="color:#00d;font-weight:bold">0</span>).expand(seq_len, batch_size, -<span style="color:#00d;font-weight:bold">1</span>)
decoded = embeddings.transpose(<span style="color:#00d;font-weight:bold">1</span>,<span style="color:#00d;font-weight:bold">0</span>)
decoded = torch.cat(
[
encoded,
decoded
],
axis=<span style="color:#00d;font-weight:bold">2</span>
)
decoded = self.combine_decoded(decoded)
decoded = self.combine_batch_norm(decoded.transpose(<span style="color:#00d;font-weight:bold">2</span>,<span style="color:#00d;font-weight:bold">1</span>)).transpose(<span style="color:#00d;font-weight:bold">2</span>,<span style="color:#00d;font-weight:bold">1</span>)
paddings_mask = torch.arange(end=seq_len).unsqueeze(dim=<span style="color:#00d;font-weight:bold">0</span>).expand((batch_size, seq_len)).to(self.device)
paddings_mask = paddings_mask > lengths.unsqueeze(dim=<span style="color:#00d;font-weight:bold">1</span>).expand((batch_size, seq_len))
<span style="color:#080;font-weight:bold">for</span> ix, decoder <span style="color:#080">in</span> <span style="color:#038">enumerate</span>(self.decoders):
decoded = decoder(
decoded,
torch.ones_like(decoded),
tgt_mask=mask,
tgt_key_padding_mask=paddings_mask
)
decoded = self.decode_batch_norms[ix](decoded.transpose(<span style="color:#00d;font-weight:bold">2</span>,<span style="color:#00d;font-weight:bold">1</span>)).transpose(<span style="color:#00d;font-weight:bold">2</span>,<span style="color:#00d;font-weight:bold">1</span>)
decoded = decoded.transpose(<span style="color:#00d;font-weight:bold">1</span>,<span style="color:#00d;font-weight:bold">0</span>)
<span style="color:#080;font-weight:bold">return</span> self.linear_logits(decoded)
</code></pre></div><p>You can notice that I’m combining the gist received from the encoder with each word embeddings — as this is how it was described in the paper.</p>
<p>The <code>predict</code> is very similar to <code>forward</code>:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python"><span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">predict</span>(self, vocabulary, embeddings, lengths):
<span style="color:#d20;background-color:#fff0f0">"""
</span><span style="color:#d20;background-color:#fff0f0"> Caller should include the start and end tokens here
</span><span style="color:#d20;background-color:#fff0f0"> but we’re going to ensure the start one is replaces by <start-short>
</span><span style="color:#d20;background-color:#fff0f0"> """</span>
previous_mode = self.training
self.eval()
batch_size, _, _ = embeddings.shape
results = []
<span style="color:#080;font-weight:bold">for</span> row <span style="color:#080">in</span> <span style="color:#038">range</span>(<span style="color:#00d;font-weight:bold">0</span>, batch_size):
row_embeddings = embeddings[row, :, :].unsqueeze(dim=<span style="color:#00d;font-weight:bold">0</span>)
row_embeddings[<span style="color:#00d;font-weight:bold">0</span>, <span style="color:#00d;font-weight:bold">0</span>] = vocabulary.token_vector(<span style="color:#d20;background-color:#fff0f0">"<start-short>"</span>)
encoded = self.encode(
row_embeddings[:, <span style="color:#00d;font-weight:bold">1</span>:, :],
lengths[row].unsqueeze(dim=<span style="color:#00d;font-weight:bold">0</span>)
)
results.append(
self.decode_prediction(
vocabulary,
encoded,
lengths[row].unsqueeze(dim=<span style="color:#00d;font-weight:bold">0</span>)
)
)
self.training = previous_mode
<span style="color:#080;font-weight:bold">return</span> results
</code></pre></div><p>The workhorse behind the decoding at the inference time looks as follows:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python"><span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">decode_prediction</span>(self, vocabulary, encoded1xH, lengths1x):
tokens = [<span style="color:#d20;background-color:#fff0f0">'<start-short>'</span>]
last_token = <span style="color:#080;font-weight:bold">None</span>
seq_len = <span style="color:#00d;font-weight:bold">1</span>
encoded1xH = self.from_hidden(encoded1xH)
<span style="color:#080;font-weight:bold">while</span> last_token != <span style="color:#d20;background-color:#fff0f0">'<end>'</span> <span style="color:#080">and</span> seq_len < <span style="color:#00d;font-weight:bold">50</span>:
embeddings1xSxD = vocabulary.embed(tokens).unsqueeze(dim=<span style="color:#00d;font-weight:bold">0</span>).to(self.device)
embeddings1xSxD = self.encode_positions(embeddings1xSxD)
maskSxS = self.mask_for(embeddings1xSxD)
encodedSx1xH = encoded1xH.unsqueeze(dim=<span style="color:#00d;font-weight:bold">0</span>).expand(seq_len, <span style="color:#00d;font-weight:bold">1</span>, -<span style="color:#00d;font-weight:bold">1</span>)
decodedSx1xD = embeddings1xSxD.transpose(<span style="color:#00d;font-weight:bold">1</span>,<span style="color:#00d;font-weight:bold">0</span>)
decodedSx1xD = torch.cat(
[
encodedSx1xH,
decodedSx1xD
],
axis=<span style="color:#00d;font-weight:bold">2</span>
)
decodedSx1xD = self.combine_decoded(decodedSx1xD)
decodedSx1xD = self.combine_batch_norm(decodedSx1xD.transpose(<span style="color:#00d;font-weight:bold">2</span>,<span style="color:#00d;font-weight:bold">1</span>)).transpose(<span style="color:#00d;font-weight:bold">2</span>,<span style="color:#00d;font-weight:bold">1</span>)
<span style="color:#080;font-weight:bold">for</span> ix, decoder <span style="color:#080">in</span> <span style="color:#038">enumerate</span>(self.decoders):
decodedSx1xD = decoder(
decodedSx1xD,
torch.ones_like(decodedSx1xD),
tgt_mask=maskSxS,
)
decodedSx1xD = self.decode_batch_norms[ix](decodedSx1xD.transpose(<span style="color:#00d;font-weight:bold">2</span>,<span style="color:#00d;font-weight:bold">1</span>))
decodedSx1xD = decodedSx1xD.transpose(<span style="color:#00d;font-weight:bold">2</span>,<span style="color:#00d;font-weight:bold">1</span>)
decoded1x1xD = decodedSx1xD.transpose(<span style="color:#00d;font-weight:bold">1</span>,<span style="color:#00d;font-weight:bold">0</span>)[:, (seq_len-<span style="color:#00d;font-weight:bold">1</span>):seq_len, :]
decoded1x1xV = self.linear_logits(decoded1x1xD)
word_id = F.softmax(decoded1x1xV[<span style="color:#00d;font-weight:bold">0</span>, <span style="color:#00d;font-weight:bold">0</span>, :]).argmax().cpu().item()
last_token = vocabulary.words[word_id]
tokens.append(last_token)
seq_len += <span style="color:#00d;font-weight:bold">1</span>
<span style="color:#080;font-weight:bold">return</span> <span style="color:#d20;background-color:#fff0f0">' '</span>.join(tokens[<span style="color:#00d;font-weight:bold">1</span>:])
</code></pre></div><p>You can notice starting with the “start short” token and going in a loop, getting predictions, and feeding back until the “end” token.</p>
<p>Again, the model is very, very simple. What makes the difference is how it’s being trained — it’s all in the training data corruption and the model pre-training.</p>
<p>It’s already a long article so I encourage the curious readers to look at the code at <a href="https://github.com/kamilc/neural-text-summarization">my GitHub repo</a> for more details.</p>
<h4 id="my-experiment-with-the-wikihow-dataset">My experiment with the WikiHow dataset</h4>
<p>In my WikiHow experiment I wanted to see how the results look if I fed the full articles and their headlines for the two modes of the network. The same data-corruption regime was used in this case.</p>
<p>Some of the results were looking <strong>almost</strong> good:</p>
<p>Text:</p>
<blockquote>
<p>for a savory flavor, mix in 1/2 teaspoon ground cumin, ground turmeric, or masala powder.this works best when added to the traditional salty lassi. for a flavorful addition to the traditional sweet lassi, add 1/2 teaspoon of ground cardamom powder or ginger, for some kick. , start with a traditional sweet lassi and blend in some of your favorite fruits. consider mixing in strawberries, papaya, bananas, or coconut.try chopping and freezing the fruit before blending it into the lassi. this will make your drink colder and frothier. , while most lassi drinks are yogurt based, you can swap out the yogurt and water or milk for coconut milk. this will give a slightly tropical flavor to the drink. or you could flavor the lassi with rose water syrup, vanilla extract, or honey.don’t choose too many flavors or they could make the drink too sweet. if you stick to one or two flavors, they’ll be more pronounced. , top your lassi with any of the following for extra flavor and a more polished look: chopped pistachios sprigs of mint sprinkle of turmeric or cumin chopped almonds fruit sliver</p>
</blockquote>
<p>Headline:</p>
<blockquote>
<p>add a spice., blend in a fruit., flavor with a syrup or milk., garnish.</p>
</blockquote>
<p>Predicted summary:</p>
<blockquote>
<p>blend vanilla in a sweeter flavor . , add a sugary fruit . , do a spicy twist . eat with dessert . , revise . <end></p>
</blockquote>
<p>It’s not 100% faithful to the original text even though it seems to “read” well.</p>
<p>My suspicion is that pre-training against a much larger corpus of text might possibly help. There’s an obvious issue with the lack of very specific knowledge here to have the network summarize better. Here’s another of those examples:</p>
<p>Text:</p>
<blockquote>
<p>the settings app looks like a gray gear icon on your iphone’s home screen.; , this option is listed next to a blue “a” icon below general. , this option will be at the bottom of the display & brightness menu. , the right-hand side of the slider will give you bigger font size in all menus and apps that support dynamic type, including the mail app. you can preview the corresponding text size by looking at the menu texts located above and below the text size slider. , the left-hand side of the slider will make all dynamic type text smaller, including all menus and mailboxes in the mail app. , tap the back button twice in the upper-left corner of your screen. it will save your text size settings and take you back to your settings menu. , this option is listed next to a gray gear icon above display & brightness. , it’s halfway through the general menu. ,, the switch will turn green. the text size slider below the switch will allow for even bigger fonts. , the text size in all menus and apps that support dynamic type will increase as you go towards the right-hand side of the slider. this is the largest text size you can get on an iphone. , it will save your settings.</p>
</blockquote>
<p>Headline:</p>
<blockquote>
<p>open your iphone’s settings., scroll down and tap display & brightness., tap text size., tap and drag the slider to the right for bigger text., tap and drag the slider to the left for smaller text., go back to the settings menu., tap general., tap accessibility., tap larger text. , slide the larger accessibility sizes switch to on position., tap and drag the slider to the right., tap the back button in the upper-left corner.</p>
</blockquote>
<p>Predicted summary:</p>
<blockquote>
<p>open your iphone ’s settings . , tap general . , scroll down and tap accessibility . , tap larger accessibility . , tap and larger text for the iphone to highlight the text you want to close . , tap the larger text - colored contacts app .</p>
</blockquote>
<p>It might be interesting to train against this dataset again while:</p>
<ul>
<li>utilizing some pre-trained, large scale model as part of the encoder</li>
<li>using a large corpus of text to still pre-train the auto-encoder</li>
</ul>
<p>This could possibly take a lot of time to train on my GPU (even with the pre-trained part of the encoder). I didn’t follow the idea further at this time.</p>
<h4 id="the-problem-with-getting-paragraphs-when-we-want-the-sentences">The problem with getting paragraphs when we want the sentences</h4>
<p>One of the biggest problems the authors ran into was with the decoder outputting the long version of the text, even though it was asked for the sentence-long summary.</p>
<p>Authors called this phenomenon the “segregation issue”. What they have found was that the encoder was mapping paragraphs and sentences into completely separate regions. The solution to this problem was to trick the encoder into making both representations indistinguishable. The following figure comes from the paper and shows the issue visualized:</p>
<p><img src="/blog/2020/05/summae-neural-text-summarization-denoising-autoencoder/segregation.jpg" alt="Segregation problem"></p>
<h4 id="better-gists-by-using-the-critic">Better gists by using the “critic”</h4>
<p>The idea of a “critic” has been popularized along with the fantastic results produced by some of the <a href="https://en.wikipedia.org/wiki/Generative_adversarial_network">Generative Adversarial Networks</a>. The general workflow is to have the main network generate output while the other tries to guess some of its properties.</p>
<p>For GANs that are generating realistic photos, the critic is there to guess if the photo was generated or if it’s real. A loss term is added based on how well it’s doing, penalizing the main network for generating photos that the critic is able to call out as fake.</p>
<p>A similar idea was used in the A3C algorithm I blogged about (<a href="/blog/2018/08/self-driving-toy-car-using-the-a3c-algorithm/">Self-driving toy car using the Asynchronous Advantage Actor-Critic algorithm</a>). The “critic” part penalized the AI agent for taking steps that were on average less advantageous.</p>
<p>Here, in the SummAE model, the critic adds a penalty to the loss to the degree to which it’s able to guess whether the gist comes from a paragraph or a sentence.</p>
<p>Training with the critic might get tricky. What I’ve found to be the cleanest way is to use two different optimizers — one updating the main network’s parameters while the other updates the critic itself:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python"><span style="color:#080;font-weight:bold">for</span> batch <span style="color:#080">in</span> batches:
<span style="color:#080;font-weight:bold">if</span> mode == <span style="color:#d20;background-color:#fff0f0">"train"</span>:
self.model.train()
self.discriminator.train()
<span style="color:#080;font-weight:bold">else</span>:
self.model.eval()
self.discriminator.eval()
self.optimizer.zero_grad()
self.discriminator_optimizer.zero_grad()
logits, state = self.model(
batch.word_embeddings.to(self.device),
batch.clean_word_embeddings.to(self.device),
batch.lengths.to(self.device),
batch.mode.to(self.device)
)
mode_probs_disc = self.discriminator(state.detach())
mode_probs = self.discriminator(state)
discriminator_loss = F.binary_cross_entropy(
mode_probs_disc,
batch.mode
)
discriminator_loss.backward(retain_graph=<span style="color:#080;font-weight:bold">True</span>)
<span style="color:#080;font-weight:bold">if</span> mode == <span style="color:#d20;background-color:#fff0f0">"train"</span>:
self.discriminator_optimizer.step()
text = batch.text.copy()
<span style="color:#080;font-weight:bold">if</span> self.no_period_trick:
text = [txt.replace(<span style="color:#d20;background-color:#fff0f0">'.'</span>, <span style="color:#d20;background-color:#fff0f0">''</span>) <span style="color:#080;font-weight:bold">for</span> txt <span style="color:#080">in</span> text]
classes = self.vocabulary.encode(text, modes=batch.mode)
classes = classes.roll(-<span style="color:#00d;font-weight:bold">1</span>, dims=<span style="color:#00d;font-weight:bold">1</span>)
classes[:,classes.shape[<span style="color:#00d;font-weight:bold">1</span>]-<span style="color:#00d;font-weight:bold">1</span>] = <span style="color:#00d;font-weight:bold">3</span>
model_loss = torch.tensor(<span style="color:#00d;font-weight:bold">0</span>).cuda()
<span style="color:#080;font-weight:bold">if</span> logits.shape[<span style="color:#00d;font-weight:bold">0</span>:<span style="color:#00d;font-weight:bold">2</span>] == classes.shape:
model_loss = F.cross_entropy(
logits.reshape(-<span style="color:#00d;font-weight:bold">1</span>, logits.shape[<span style="color:#00d;font-weight:bold">2</span>]).to(self.device),
classes.long().reshape(-<span style="color:#00d;font-weight:bold">1</span>).to(self.device),
ignore_index=<span style="color:#00d;font-weight:bold">3</span>
)
<span style="color:#080;font-weight:bold">else</span>:
<span style="color:#038">print</span>(<span style="color:#d20;background-color:#fff0f0">"WARNING: Skipping model loss for inconsistency between logits and classes shapes"</span>)
fooling_loss = F.binary_cross_entropy(
mode_probs,
torch.ones_like(batch.mode).to(self.device)
)
loss = model_loss + (<span style="color:#00d;font-weight:bold">0.1</span> * fooling_loss)
loss.backward()
<span style="color:#080;font-weight:bold">if</span> mode == <span style="color:#d20;background-color:#fff0f0">"train"</span>:
self.optimizer.step()
self.optimizer.zero_grad()
self.discriminator_optimizer.zero_grad()
</code></pre></div><p>The main idea is to treat the main network’s encoded gist as constant with respect to the updates to the critic’s parameters, and vice versa.</p>
<h3 id="results">Results</h3>
<p>I’ve found some of the results look really exceptional:</p>
<p>Text:</p>
<blockquote>
<p>lynn is unhappy in her marriage. her husband is never good to her and shows her no attention. one evening lynn tells her husband she is going out with her friends. she really goes out with a man from work and has a great time. lynn continues dating him and starts having an affair.</p>
</blockquote>
<p>Predicted summary:</p>
<blockquote>
<p>lynn starts dating him and has an affair . <end></p>
</blockquote>
<p>Text:</p>
<blockquote>
<p>cedric was hoping to get a big bonus at work. he had worked hard at the office all year. cedric’s boss called him into his office. cedric was disappointed when told there would be no bonus. cedric’s boss surprised cedric with a big raise instead of a bonus.</p>
</blockquote>
<p>Predicted summary:</p>
<blockquote>
<p>cedric had a big deal at his boss ’s office . <end></p>
</blockquote>
<p>Some others showed how the model attends to single sentences though:</p>
<p>Text:</p>
<blockquote>
<p>i lost my job. i was having trouble affording my necessities. i didn’t have enough money to pay rent. i searched online for money making opportunities. i discovered amazon mechanical turk.</p>
</blockquote>
<p>Predicted summary:</p>
<blockquote>
<p>i did n’t have enough money to pay rent . <end></p>
</blockquote>
<p>While the sentence like this one would maybe make a good headline — it’s definitely not the best summary as it naturally loses the vital parts found in other sentences.</p>
<h3 id="final-words">Final words</h3>
<p>First of all, let me thank the paper’s authors for their exceptional work. It was a great read and great fun implementing!</p>
<p>Abstractive text summarization remains very difficult. The model trained for this blog post has very limited use in practice. There’s a lot of room for improvement though, which makes the future of abstractive summaries very promising.</p>
OpenITI Starts Arabic-script OCR Catalyst Projecthttps://www.endpointdev.com/blog/2019/09/openiti-arabic-ocr-catalyst-project/2019-09-10T00:00:00+00:00Elizabeth Garrett Christensen
<p><img src="/blog/2019/09/openiti-arabic-ocr-catalyst-project/banner.jpg" alt="Decorative Arabic calligraphy" /> <a href="https://www.flickr.com/photos/firaskaheel/16680667070">Photo</a> by Free Quran Pictures 4K, cropped, <a href="https://creativecommons.org/licenses/by/2.0/">CC BY 2.0</a></p>
<p>Congratulations to the <a href="https://iti-corpus.github.io/">Open Islamicate Texts Initiative</a> (OpenITI) on their new project the Arabic-script OCR Catalyst Project (AOCP)! This project received funding from the The Andrew W. Mellon Foundation this summer.</p>
<p>End Point developer Kamil Ciemniewski will be serving the project as a Technology Integration Specialist. Kamil has been involved with OpenITI since 2018 and with the affiliated project, <a href="https://openiti.org/projects/corpusbuilder">Corpus Builder</a>, since 2017.</p>
<p>Corpus Builder project version 1.0 made collaborative effort possible in producing ground truth datasets for OCR models training. The application acts as a versioned database of text transcriptions and a full OCR pipeline itself. The versioned character of the database follows closely the model used by Git.</p>
<p>What is remarkable about it is that it brings the ability to work on revisions of documents whose character isn’t linear as text in the Git case. For the OCR problem, one needs both textual data but also the spatial: where exactly the text is to be found.</p>
<p>A sophisticated mechanism of applying updates to those documents minimizes (with mathematical guarantees) the chance of introducing merge conflicts.</p>
<p>The project also hosts a great-looking UI interface allowing non-technical editors to work within the workflow of this versioned data.</p>
<p>CorpusBuilder works with both <a href="https://github.com/tesseract-ocr/tesseract">Tesseract</a> and <a href="http://kraken.re/">Kraken</a> as its OCR backends and is capable of exporting datasets in their respective formats for further model training / retraining. Training of Tesseract models was covered last year in a <a href="/blog/2018/07/training-tesseract-models-from-scratch/">blog post</a> by Kamil.</p>
<p>AOCP will rapidly expand prior work and will help establish a digital pipeline for digitizing texts and creating a set of tools for students and scholars of historic texts.</p>
<p>End Point is really excited to be a part of such a cool integration of technology and the humanities!</p>
<p>Read more at:</p>
<ul>
<li><a href="https://www.openiti.org/projects/openitiaocp">https://www.openiti.org/projects/openitiaocp</a></li>
<li><a href="https://medium.com/@openiti/openiti-aocp-9802865a6586">https://medium.com/@openiti/openiti-aocp-9802865a6586</a></li>
<li><a href="http://kitab-project.org/corpus/">http://kitab-project.org/corpus/</a></li>
</ul>
An Introduction to Neural Networkshttps://www.endpointdev.com/blog/2019/07/an-introduction-to-neural-networks/2019-07-01T00:00:00+00:00Ben Ironside Goldstein
<p><img src="/blog/2019/07/an-introduction-to-neural-networks/image-0.jpg" alt="Weird Tree Art (Neural Network)" /> <a href="https://flic.kr/p/5eL8Ag">Photo</a> by <a href="https://www.flickr.com/photos/sudhamshu/">Sudhamshu Hebbar</a>, used under <a href="https://creativecommons.org/licenses/by/2.0/">CC BY 2.0</a></p>
<p>Earlier this year I wrote a <a href="/blog/2019/05/facial-recognition-amazon-deeplens/">post</a> about my work with a machine-learning camera, the <a href="https://aws.amazon.com/deeplens/">AWS DeepLens</a>, which has onboard processing power to enable AI capabilities without sending data to the cloud. Neural networks are a type of ML model which achieves very impressive results on certain problems (including computer vision), so in this post I give a more thorough introduction to neural networks, and share some useful resources for those who want to dig deeper.</p>
<h3 id="neurons-and-nodes">Neurons and Nodes</h3>
<p>Neural networks are models inspired by the function of biological neural networks. They consist of nodes (arranged in layers), and the connections between those nodes. Each connection between two nodes enables one-way information transfer: a node either receives input from, or sends output to each node to which it is connected. Nodes typically have an “activation function”, parameterized by the node’s inputs, and its output is the result of this function.</p>
<p>As with the function of biological neural networks, the emergence of information processing from these mathematical operations is opaque. Nevertheless, complex artificial neural networks are capable of feats such as vision, language translation, and winning competitive games. As the technology improves, even more impressive tasks will become possible. As with organic brains, neural networks can achieve complex tasks only as a result of appropriate architecture, constraints, and training—for machine learning, humans must (for now) design it all.</p>
<h3 id="neural-network-architecture">Neural Network Architecture</h3>
<p><img src="/blog/2019/07/an-introduction-to-neural-networks/image-1.png" style="float: right; max-width: 200px" /></p>
<p>Nodes are grouped in layers: the input layer, the output layer, and all the layers between them, known as hidden layers. Nodes can be networked in a variety of ways within and between layers, and sophisticated neural network models can include dozens of layers configured in various ways. These include layers which summarize, combine, eliminate, direct, or transform information. Each receives its input from the previous layer, and passes its output to the next layer. The last layer is designed such that its output answers the relevant question (for example, it would offer 9 options if the goal were to identify the hand-written numbers 1–9).</p>
<p>For all this information processing to achieve a given task, the parameters of each node need appropriate values. The process of choosing those values is called training. In order to train a neural network, one needs to provide examples of what the network should do. (For example, to train it to write requires examples of writing. To train it to identify objects in images requires images and their appropriately labeled counterparts.) The more data a model can learn from, the better it can work. Gathering enough data is typically a major undertaking.</p>
<h3 id="training-a-neural-network">Training a Neural Network</h3>
<p>Before training, models have random parameters for all nodes. Each time data is passed through the model, the effectiveness of the model is measured using a “loss function”. Loss functions measure how wrong a model’s output is. Different loss functions (also known as cost functions or error functions) measure this in different ways, but in general, the more wrong a model is, the higher its loss/error/cost. Loss functions thus summarize the quality of a model’s output with a single number. Models are optimized to minimize the loss. (For more on the role of loss functions in neural networks, I suggest <a href="https://machinelearningmastery.com/loss-and-loss-functions-for-training-deep-learning-neural-networks/">this excellent article</a>.)</p>
<p>One of the most interesting details of the entire process has to do with how the parameters are tuned. Model optimization relies on variations of a process called gradient descent, in which parameter values are adjusted by small intervals in an attempt to minimize the loss. Over many thousands of repetitions, the training program uses calculus to pick values that help to minimize the loss. As you can imagine, this process becomes extremely computationally intensive when the neural network is large and complex. However, in order to solve hard problems, networks must be large and complex. This is why training neural networks requires substantial computing power, and often takes place in the cloud. (For more on stochastic gradient descent, I suggest <a href="https://www.youtube.com/watch?v=vMh0zPT0tLI">this video</a> as a great starting point, or <a href="http://ruder.io/optimizing-gradient-descent/">this review</a> for a more advanced overview.)</p>
<h3 id="further-reading">Further reading</h3>
<ul>
<li>It turns out that a relatively simple neural network can approximate any function. This remarkable <a href="https://towardsdatascience.com/can-neural-networks-really-learn-any-function-65e106617fc6">demonstration</a> is quite accessible.</li>
<li>There are countless useful implementations of neural network models. End Pointer <a href="/blog/authors/kamil-ciemniewski/">Kamil Ciemniewski</a> wrote two in-depth and fascinating blogs about neural network projects which he completed in the past year: <a href="/blog/2019/01/speech-recognition-with-tensorflow/">Speech Recognition From Scratch</a>, and <a href="/blog/2018/08/self-driving-toy-car-using-the-a3c-algorithm/">Self-Driving Toy Car</a>.</li>
<li>If you’re interested in getting a sense for the general state of the art, <a href="https://www.topbots.com/most-important-ai-research-papers-2018/">here</a> are summaries of some of the most influential papers in machine learning since 2018.</li>
<li>For those curious about the inner workings of the training process, here’s one about <a href="http://neuralnetworksanddeeplearning.com/chap2.html">back-propagation</a>.</li>
<li>This blog post describes “densely connected” network layers; here’s an article about <a href="https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3bd2b1164a53">convolutional layers</a>.</li>
<li>And finally, this article describes <a href="https://medium.com/explore-artificial-intelligence/an-introduction-to-recurrent-neural-networks-72c97bf0912">recurrent neural networks</a>.</li>
</ul>
Deploying production Machine Learning pipelines to Kubernetes with Argohttps://www.endpointdev.com/blog/2019/06/deploying-production-pipelines-to-kubernetes/2019-06-28T00:00:00+00:00Kamil Ciemniewski
<p><img src="/blog/2019/06/deploying-production-pipelines-to-kubernetes/image-0.jpg" alt="Rube Goldberg machine" /><br><a href="https://commons.wikimedia.org/wiki/File:Rube_Goldberg_Machine_(278696130).jpg">Image by Wikimedia Commons</a></p>
<p>In some sense, most machine learning projects look exactly the same. There are 4 stages to be concerned with no matter what the project is:</p>
<ol>
<li>Sourcing the data</li>
<li>Transforming it</li>
<li>Building the model</li>
<li>Deploying it</li>
</ol>
<p>It’s been said that #1 and #2 take most of ML engineers’ time. This is to emphasize how little time it sometimes feels the most fun part—#3—gets.</p>
<p>In the real world, though, #4 over time can take almost as much as the previous three.</p>
<p>Deployed models sometimes need to be rebuilt. They consume data that need to constantly go through points #1 and #2. It certainly isn’t always what’s shown in the classroom, where datasets perfectly fit in the memory and model training takes at most a couple hours on an old laptop.</p>
<p>Working with gigantic datasets isn’t the only problem. Data pipelines can take long hours to complete. What if some part of your infrastructure has an unexpected downtime? Do you just start it all over again from the very beginning?</p>
<p>Many solutions of course exist. With this article, I’d like to go over this problem space and present an approach that feels really nice and clean.</p>
<h3 id="project-description">Project description</h3>
<p>End Point Corporation was founded in 1995. That’s 24 years! About 9 years later, <a href="/blog/2004/10/red-hat-enterprise-linux-3-update-3/">the oldest article</a> on the company’s blog was published. Since that time, a staggering number of 1435 unique articles have been published. That’s a lot of words! This is something we can definitely use in a smart way.</p>
<p>For the purpose of having fun with building a production-grade data pipeline, let’s imagine the following project:</p>
<ul>
<li>A <a href="https://cs.stanford.edu/~quocle/paragraph_vector.pdf">doc2vec</a> model trained on the corpus of End Point’s blog articles</li>
<li>Use of the paragraph vectors for each article to find the 10 other, most similar articles</li>
</ul>
<p>I blogged about using the <a href="/blog/2018/07/recommender-mxnet/">matrix factorization</a> as a simple <a href="https://en.wikipedia.org/wiki/Recommender_system#Collaborative_filtering">collaborative filtering</a> style of the recommender system. We can think about today’s doc2vec-based model as an example of the <a href="https://en.wikipedia.org/wiki/Recommender_system#Content-based_filtering">content based filtering</a>. The business value would be the potentially increased blog traffic from users staying longer on the website.</p>
<h3 id="scalable-pipelines">Scalable pipelines</h3>
<p>The data pipelines problem certainly found some really great solutions. The <a href="http://hadoop.apache.org">Hadoop</a> project brought in the HDFS—a distributed file system for huge data artifacts. Its MapReduce component plays a vital role in distributed data processing.</p>
<p>Then, the fantastic <a href="https://spark.apache.org">Spark</a> project came in. Its architecture makes data reside in memory by default—with explicit caching of the data on disks. The project claims to be running workloads 100 times faster than Hadoop.</p>
<p>Both projects though require the developer to use a very specific set of libraries. It’s not easy, for example, to distribute <a href="https://spacy.io">spaCy</a> training and inference on Spark.</p>
<h3 id="containers">Containers</h3>
<p>On the other side of the spectrum, there’s <a href="https://dask.org">Dask</a>. It’s a Python package that wraps <a href="https://www.numpy.org">Numpy</a>, <a href="https://pandas.pydata.org">Pandas</a> and <a href="https://scikit-learn.org/stable/">Scikit-Learn</a>. It enables developers to load huge piles of data, just as they would with the smaller datasets. The data is partitioned and distributed among the cluster nodes. It can work with groups of processes as well as clusters of containers. The APIs of the above-mentioned projects are (mostly) preserved while all the processing is suddenly distributed.</p>
<p>Some teams like to use Dask along with <a href="https://luigi.readthedocs.io/en/stable/">Luigi</a> and build production pipelines around <a href="https://www.docker.com">Docker</a> or <a href="https://kubernetes.io">Kubernetes</a>.</p>
<p>In this article, I’d like to present another Dask-friendly solution: Kubernetes-native workflows using <a href="https://argoproj.github.io">Argo</a>. What’s great about it compared to Luigi, is that you don’t even need to care about having a certain version of Python and Luigi installed to orchestrate the pipeline. All you need is the Kubernetes cluster and Argo installed on it.</p>
<h3 id="hands-down-work-on-the-project">Hands down work on the project</h3>
<p>The first thing to do when developing this project is to get access to the Kubernetes cluster. For the development, you can set up a one-node cluster using either one of:</p>
<ul>
<li><a href="https://microk8s.io">Microk8s</a></li>
<li><a href="https://github.com/kubernetes/minikube">Minikube</a></li>
</ul>
<p>I love them both. The first is developed by Canonical while the second by the Kubernetes team itself.</p>
<p>This isn’t going to be a step-by-step tutorial on using Kubernetes. I encourage you to read the documentation or possibly seek out a good online course if you don’t know anything yet. Read on even in this case though—it’s nothing that would be overly complex.</p>
<p>Next, you’ll need the Argo Workflows. The installation is really easy. The full yet simple documentation can be found <a href="https://argoproj.github.io/docs/argo/demo.html">here</a>.</p>
<h4 id="the-project-structure">The project structure</h4>
<p>Here’s what the project looks like in the end:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-bash" data-lang="bash">.
├── Makefile
├── notebooks
│ └── scratch.ipynb
├── notebooks.yml
├── pipeline.yaml
├── tasks
├── base
│ ├── Dockerfile
│ └── requirements.txt
├── build_model
│ ├── Dockerfile
│ └── run.py
├── clone_repo
│ ├── Dockerfile
│ └── run.sh
├── infer
│ ├── Dockerfile
│ └── run.py
├── notebooks
│ └── Dockerfile
└── preprocess
├── Dockerfile
└── run.py
</code></pre></div><p>The main parts are as follows:</p>
<ul>
<li><code>Makefile</code> provides easy to use helpers for building images, sending them into the Docker repository and running the Argo workflow</li>
<li><code>notebooks.yml</code> defines a Kubernetes service and deployment for exploratory <a href="https://github.com/jupyterlab/jupyterlab">Jupyter Lab</a> instance</li>
<li><code>notebooks</code> contains individual Jupyter notebooks</li>
<li><code>pipeline.yaml</code> defines our Machine Learning pipeline in the form of the Argo workflow</li>
<li><code>tasks</code> contains workflow steps as containers along with their Dockerfiles</li>
<li><code>tasks/base</code> defines the base Docker image for other tasks</li>
<li><code>tasks/**/run.(py|sh)</code> is a single entry point for a given pipeline step</li>
</ul>
<p>The idea is to minimize the boilerplate while retaining the features offered e.g. by Luigi.</p>
<h4 id="makefile">Makefile</h4>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-makefile" data-lang="makefile"><span style="color:#369">SHELL</span> := /bin/bash
<span style="color:#369">VERSION</span>?=latest
<span style="color:#369">TASK_IMAGES</span>:=<span style="color:#080;font-weight:bold">$(</span>shell find tasks -name Dockerfile -printf <span style="color:#d20;background-color:#fff0f0">'%h '</span><span style="color:#080;font-weight:bold">)</span>
<span style="color:#369">REGISTRY</span>=base:5000
<span style="color:#06b;font-weight:bold">tasks/%</span>: FORCE
<span style="color:#038">set</span> -e ;<span style="color:#04d;background-color:#fff0f0">\
</span><span style="color:#04d;background-color:#fff0f0"></span> docker build -t blog_pipeline_<span style="color:#080;font-weight:bold">$(</span>@F<span style="color:#080;font-weight:bold">)</span>:<span style="color:#080;font-weight:bold">$(</span>VERSION<span style="color:#080;font-weight:bold">)</span> <span style="color:#369">$@</span> ;<span style="color:#04d;background-color:#fff0f0">\
</span><span style="color:#04d;background-color:#fff0f0"></span> docker tag blog_pipeline_<span style="color:#080;font-weight:bold">$(</span>@F<span style="color:#080;font-weight:bold">)</span>:<span style="color:#080;font-weight:bold">$(</span>VERSION<span style="color:#080;font-weight:bold">)</span> <span style="color:#080;font-weight:bold">$(</span>REGISTRY<span style="color:#080;font-weight:bold">)</span>/blog_pipeline_<span style="color:#080;font-weight:bold">$(</span>@F<span style="color:#080;font-weight:bold">)</span>:<span style="color:#080;font-weight:bold">$(</span>VERSION<span style="color:#080;font-weight:bold">)</span> ;<span style="color:#04d;background-color:#fff0f0">\
</span><span style="color:#04d;background-color:#fff0f0"></span> docker push <span style="color:#080;font-weight:bold">$(</span>REGISTRY<span style="color:#080;font-weight:bold">)</span>/blog_pipeline_<span style="color:#080;font-weight:bold">$(</span>@F<span style="color:#080;font-weight:bold">)</span>:<span style="color:#080;font-weight:bold">$(</span>VERSION<span style="color:#080;font-weight:bold">)</span>
<span style="color:#06b;font-weight:bold">images</span>: <span style="color:#080;font-weight:bold">$(</span><span style="color:#369">TASK_IMAGES</span><span style="color:#080;font-weight:bold">)</span>
<span style="color:#06b;font-weight:bold">run</span>: images
argo submit pipeline.yaml --watch
<span style="color:#06b;font-weight:bold">start_notebooks</span>:
kubectl apply -f notebooks.yml
<span style="color:#06b;font-weight:bold">stop_notebooks</span>:
kubectl delete deployment jupyter-notebook
<span style="color:#06b;font-weight:bold">FORCE</span>: ;
</code></pre></div><p>When using this Makefile with <code>make run</code>, it will need to resolve the <code>images</code> dependency. This, in turn, will ask to resolve all of the <code>task/**/Dockerfile</code> dependencies too. Notice how the <code>TASK_IMAGES</code> variable is constructed: it uses the make’s <code>shell</code> command to use the Unix’s <code>find</code> to find the subdirectories of <code>tasks</code> that contain the Dockerfile. Here’s what the output would be if you were to use it directly:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-bash" data-lang="bash">$ find tasks -name Dockerfile -printf <span style="color:#d20;background-color:#fff0f0">'%h '</span>
tasks/notebooks tasks/base tasks/preprocess tasks/infer tasks/build_model tasks/clone_repo
</code></pre></div><h4 id="setting-up-jupyter-notebooks-as-a-scratch-pad-and-for-eda">Setting up Jupyter Notebooks as a scratch pad and for EDA</h4>
<p>Let’s start off by defining our base Docker image:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-dockerfile" data-lang="dockerfile"><span style="color:#080;font-weight:bold">FROM</span><span style="color:#d20;background-color:#fff0f0"> python:3.7</span><span style="color:#a61717;background-color:#e3d2d2">
</span><span style="color:#a61717;background-color:#e3d2d2">
</span><span style="color:#a61717;background-color:#e3d2d2"></span><span style="color:#080;font-weight:bold">COPY</span> requirements.txt /requirements.txt<span style="color:#a61717;background-color:#e3d2d2">
</span><span style="color:#a61717;background-color:#e3d2d2"></span><span style="color:#080;font-weight:bold">RUN</span> pip install -r /requirements.txt<span style="color:#a61717;background-color:#e3d2d2">
</span></code></pre></div><p>Following is the Dockerfile that extends it and adds the Jupyter Lab:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-dockerfile" data-lang="dockerfile"><span style="color:#080;font-weight:bold">FROM</span><span style="color:#d20;background-color:#fff0f0"> endpoint-blog-pipeline/base:latest</span><span style="color:#a61717;background-color:#e3d2d2">
</span><span style="color:#a61717;background-color:#e3d2d2">
</span><span style="color:#a61717;background-color:#e3d2d2"></span><span style="color:#080;font-weight:bold">RUN</span> pip install jupyterlab<span style="color:#a61717;background-color:#e3d2d2">
</span><span style="color:#a61717;background-color:#e3d2d2">
</span><span style="color:#a61717;background-color:#e3d2d2"></span><span style="color:#080;font-weight:bold">RUN</span> mkdir ~/.jupyter<span style="color:#a61717;background-color:#e3d2d2">
</span><span style="color:#a61717;background-color:#e3d2d2"></span><span style="color:#080;font-weight:bold">RUN</span> <span style="color:#038">echo</span> <span style="color:#d20;background-color:#fff0f0">"c.NotebookApp.token = ''"</span> >> ~/.jupyter/jupyter_notebook_config.py<span style="color:#a61717;background-color:#e3d2d2">
</span><span style="color:#a61717;background-color:#e3d2d2"></span><span style="color:#080;font-weight:bold">RUN</span> <span style="color:#038">echo</span> <span style="color:#d20;background-color:#fff0f0">"c.NotebookApp.password = ''"</span> >> ~/.jupyter/jupyter_notebook_config.py<span style="color:#a61717;background-color:#e3d2d2">
</span><span style="color:#a61717;background-color:#e3d2d2">
</span><span style="color:#a61717;background-color:#e3d2d2"></span><span style="color:#080;font-weight:bold">RUN</span> mkdir /notebooks<span style="color:#a61717;background-color:#e3d2d2">
</span><span style="color:#a61717;background-color:#e3d2d2"></span><span style="color:#080;font-weight:bold">WORKDIR</span><span style="color:#d20;background-color:#fff0f0"> /notebooks</span><span style="color:#a61717;background-color:#e3d2d2">
</span></code></pre></div><p>The last step is to add the Kubernetes service and deployment definition in <code>notebooks.yml</code>:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-yaml" data-lang="yaml"><span style="color:#b06;font-weight:bold">apiVersion</span>:<span style="color:#bbb"> </span>apps/v1<span style="color:#bbb">
</span><span style="color:#bbb"></span><span style="color:#b06;font-weight:bold">kind</span>:<span style="color:#bbb"> </span>Deployment<span style="color:#bbb">
</span><span style="color:#bbb"></span><span style="color:#b06;font-weight:bold">metadata</span>:<span style="color:#bbb">
</span><span style="color:#bbb"> </span><span style="color:#b06;font-weight:bold">name</span>:<span style="color:#bbb"> </span>jupyter-notebook<span style="color:#bbb">
</span><span style="color:#bbb"> </span><span style="color:#b06;font-weight:bold">labels</span>:<span style="color:#bbb">
</span><span style="color:#bbb"> </span><span style="color:#b06;font-weight:bold">app</span>:<span style="color:#bbb"> </span>jupyter-notebook<span style="color:#bbb">
</span><span style="color:#bbb"></span><span style="color:#b06;font-weight:bold">spec</span>:<span style="color:#bbb">
</span><span style="color:#bbb"> </span><span style="color:#b06;font-weight:bold">replicas</span>:<span style="color:#bbb"> </span><span style="color:#00d;font-weight:bold">1</span><span style="color:#bbb">
</span><span style="color:#bbb"> </span><span style="color:#b06;font-weight:bold">selector</span>:<span style="color:#bbb">
</span><span style="color:#bbb"> </span><span style="color:#b06;font-weight:bold">matchLabels</span>:<span style="color:#bbb">
</span><span style="color:#bbb"> </span><span style="color:#b06;font-weight:bold">app</span>:<span style="color:#bbb"> </span>jupyter-notebook<span style="color:#bbb">
</span><span style="color:#bbb"> </span><span style="color:#b06;font-weight:bold">template</span>:<span style="color:#bbb">
</span><span style="color:#bbb"> </span><span style="color:#b06;font-weight:bold">metadata</span>:<span style="color:#bbb">
</span><span style="color:#bbb"> </span><span style="color:#b06;font-weight:bold">labels</span>:<span style="color:#bbb">
</span><span style="color:#bbb"> </span><span style="color:#b06;font-weight:bold">app</span>:<span style="color:#bbb"> </span>jupyter-notebook<span style="color:#bbb">
</span><span style="color:#bbb"> </span><span style="color:#b06;font-weight:bold">spec</span>:<span style="color:#bbb">
</span><span style="color:#bbb"> </span><span style="color:#b06;font-weight:bold">containers</span>:<span style="color:#bbb">
</span><span style="color:#bbb"> </span>- <span style="color:#b06;font-weight:bold">name</span>:<span style="color:#bbb"> </span>minimal-notebook<span style="color:#bbb">
</span><span style="color:#bbb"> </span><span style="color:#b06;font-weight:bold">image</span>:<span style="color:#bbb"> </span>base:5000/blog_pipeline_notebooks<span style="color:#bbb">
</span><span style="color:#bbb"> </span><span style="color:#b06;font-weight:bold">ports</span>:<span style="color:#bbb">
</span><span style="color:#bbb"> </span>- <span style="color:#b06;font-weight:bold">containerPort</span>:<span style="color:#bbb"> </span><span style="color:#00d;font-weight:bold">8888</span><span style="color:#bbb">
</span><span style="color:#bbb"> </span><span style="color:#b06;font-weight:bold">command</span>:<span style="color:#bbb"> </span>[<span style="color:#d20;background-color:#fff0f0">"/usr/local/bin/jupyter"</span>]<span style="color:#bbb">
</span><span style="color:#bbb"> </span><span style="color:#b06;font-weight:bold">args</span>:<span style="color:#bbb"> </span>[<span style="color:#d20;background-color:#fff0f0">"lab"</span>,<span style="color:#bbb"> </span><span style="color:#d20;background-color:#fff0f0">"--allow-root"</span>,<span style="color:#bbb"> </span><span style="color:#d20;background-color:#fff0f0">"--port"</span>,<span style="color:#bbb"> </span><span style="color:#d20;background-color:#fff0f0">"8888"</span>,<span style="color:#bbb"> </span><span style="color:#d20;background-color:#fff0f0">"--ip"</span>,<span style="color:#bbb"> </span><span style="color:#d20;background-color:#fff0f0">"0.0.0.0"</span>]<span style="color:#bbb">
</span><span style="color:#bbb"></span><span style="color:#b06;font-weight:bold">---</span><span style="color:#bbb">
</span><span style="color:#bbb"></span><span style="color:#b06;font-weight:bold">kind</span>:<span style="color:#bbb"> </span>Service<span style="color:#bbb">
</span><span style="color:#bbb"></span><span style="color:#b06;font-weight:bold">apiVersion</span>:<span style="color:#bbb"> </span>v1<span style="color:#bbb">
</span><span style="color:#bbb"></span><span style="color:#b06;font-weight:bold">metadata</span>:<span style="color:#bbb">
</span><span style="color:#bbb"> </span><span style="color:#b06;font-weight:bold">name</span>:<span style="color:#bbb"> </span>jupyter-notebook<span style="color:#bbb">
</span><span style="color:#bbb"></span><span style="color:#b06;font-weight:bold">spec</span>:<span style="color:#bbb">
</span><span style="color:#bbb"> </span><span style="color:#b06;font-weight:bold">type</span>:<span style="color:#bbb"> </span>NodePort<span style="color:#bbb">
</span><span style="color:#bbb"> </span><span style="color:#b06;font-weight:bold">selector</span>:<span style="color:#bbb">
</span><span style="color:#bbb"> </span><span style="color:#b06;font-weight:bold">app</span>:<span style="color:#bbb"> </span>jupyter-notebook<span style="color:#bbb">
</span><span style="color:#bbb"> </span><span style="color:#b06;font-weight:bold">ports</span>:<span style="color:#bbb">
</span><span style="color:#bbb"> </span>- <span style="color:#b06;font-weight:bold">protocol</span>:<span style="color:#bbb"> </span>TCP<span style="color:#bbb">
</span><span style="color:#bbb"> </span><span style="color:#b06;font-weight:bold">nodePort</span>:<span style="color:#bbb"> </span><span style="color:#00d;font-weight:bold">30040</span><span style="color:#bbb">
</span><span style="color:#bbb"> </span><span style="color:#b06;font-weight:bold">port</span>:<span style="color:#bbb"> </span><span style="color:#00d;font-weight:bold">8888</span><span style="color:#bbb">
</span><span style="color:#bbb"> </span><span style="color:#b06;font-weight:bold">targetPort</span>:<span style="color:#bbb"> </span><span style="color:#00d;font-weight:bold">8888</span><span style="color:#bbb">
</span></code></pre></div><p>This can be run using our Makefile with <code>make start_notebooks</code> or directly with:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-bash" data-lang="bash">$ kubectl apply -f notebooks.yml
</code></pre></div><h4 id="exploration">Exploration</h4>
<p>The <a href="https://github.com/kamilc/endpoint-blog-nlp/blob/master/notebooks/scratch.ipynb">notebook itself</a> feels more like a scratch pad than an exploratory data analysis. You can see that it’s very informal and doesn’t include much of the exploration or visualization. You’re likely not to omit those in more real-world code.</p>
<p>I used it to ensure the model would work at all. I then was able to grab portions of the code and paste it directly into step definitions.</p>
<h4 id="implementation">Implementation</h4>
<h5 id="step-1-source-blog-articles">Step 1: Source blog articles</h5>
<p>The blog’s articles are stored on <a href="https://github.com/EndPointCorp/end-point-blog">GitHub</a> in Markdown files.</p>
<p>Our first pipeline task will need to either clone the repo or pull from it if it’s present in the pipeline’s shared volume.</p>
<p>We’ll use the Kubernetes <a href="https://kubernetes.io/docs/concepts/storage/volumes/#hostpath">hostPath</a> as the cross-step volume. What’s nice about it is that it’s easy to peek into the volume during development to see if the data artifacts are being generated correctly.</p>
<p>In our example here, I’m hardcoding the path on my local system:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-yaml" data-lang="yaml"><span style="color:#888"># ...</span><span style="color:#bbb">
</span><span style="color:#bbb"></span><span style="color:#b06;font-weight:bold">volumes</span>:<span style="color:#bbb">
</span><span style="color:#bbb"> </span>- <span style="color:#b06;font-weight:bold">name</span>:<span style="color:#bbb"> </span>endpoint-blog-src<span style="color:#bbb">
</span><span style="color:#bbb"> </span><span style="color:#b06;font-weight:bold">hostPath</span>:<span style="color:#bbb">
</span><span style="color:#bbb"> </span><span style="color:#b06;font-weight:bold">path</span>:<span style="color:#bbb"> </span>/home/kamil/data/endpoint-blog-src<span style="color:#bbb">
</span><span style="color:#bbb"> </span><span style="color:#b06;font-weight:bold">type</span>:<span style="color:#bbb"> </span>Directory<span style="color:#bbb">
</span><span style="color:#bbb"></span><span style="color:#888"># ...</span><span style="color:#bbb">
</span></code></pre></div><p>This is one of the downsides of the <code>hostPath</code>—it only accepts absolute paths. This will do just fine for now though.</p>
<p>In the <code>pipeline.yml</code> we define the task container with:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-yaml" data-lang="yaml"><span style="color:#888"># ...</span><span style="color:#bbb">
</span><span style="color:#bbb"></span><span style="color:#b06;font-weight:bold">templates</span>:<span style="color:#bbb">
</span><span style="color:#bbb"> </span>- <span style="color:#b06;font-weight:bold">name</span>:<span style="color:#bbb"> </span>clone-repo<span style="color:#bbb">
</span><span style="color:#bbb"> </span><span style="color:#b06;font-weight:bold">container</span>:<span style="color:#bbb">
</span><span style="color:#bbb"> </span><span style="color:#b06;font-weight:bold">image</span>:<span style="color:#bbb"> </span>base:5000/blog_pipeline_clone_repo<span style="color:#bbb">
</span><span style="color:#bbb"> </span><span style="color:#b06;font-weight:bold">command</span>:<span style="color:#bbb"> </span>[bash]<span style="color:#bbb">
</span><span style="color:#bbb"> </span><span style="color:#b06;font-weight:bold">args</span>:<span style="color:#bbb"> </span>[<span style="color:#d20;background-color:#fff0f0">"/run.sh"</span>]<span style="color:#bbb">
</span><span style="color:#bbb"> </span><span style="color:#b06;font-weight:bold">volumeMounts</span>:<span style="color:#bbb">
</span><span style="color:#bbb"> </span>- <span style="color:#b06;font-weight:bold">mountPath</span>:<span style="color:#bbb"> </span>/data<span style="color:#bbb">
</span><span style="color:#bbb"> </span><span style="color:#b06;font-weight:bold">name</span>:<span style="color:#bbb"> </span>endpoint-blog-src<span style="color:#bbb">
</span><span style="color:#bbb"></span><span style="color:#888"># ...</span><span style="color:#bbb">
</span></code></pre></div><p>The full pipeline forms a tree which is expressed conveniently as a directed acyclic graph within the Argo. Here’s the definition of the whole pipeline (some steps were not shown yet):</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-yaml" data-lang="yaml"><span style="color:#888"># ...</span><span style="color:#bbb">
</span><span style="color:#bbb"></span>- <span style="color:#b06;font-weight:bold">name</span>:<span style="color:#bbb"> </span>article-vectors<span style="color:#bbb">
</span><span style="color:#bbb"> </span><span style="color:#b06;font-weight:bold">dag</span>:<span style="color:#bbb">
</span><span style="color:#bbb"> </span><span style="color:#b06;font-weight:bold">tasks</span>:<span style="color:#bbb">
</span><span style="color:#bbb"> </span>- <span style="color:#b06;font-weight:bold">name</span>:<span style="color:#bbb"> </span>src<span style="color:#bbb">
</span><span style="color:#bbb"> </span><span style="color:#b06;font-weight:bold">template</span>:<span style="color:#bbb"> </span>clone-repo<span style="color:#bbb">
</span><span style="color:#bbb"> </span>- <span style="color:#b06;font-weight:bold">name</span>:<span style="color:#bbb"> </span>dataframe<span style="color:#bbb">
</span><span style="color:#bbb"> </span><span style="color:#b06;font-weight:bold">template</span>:<span style="color:#bbb"> </span>preprocess<span style="color:#bbb">
</span><span style="color:#bbb"> </span><span style="color:#b06;font-weight:bold">dependencies</span>:<span style="color:#bbb"> </span>[src]<span style="color:#bbb">
</span><span style="color:#bbb"> </span>- <span style="color:#b06;font-weight:bold">name</span>:<span style="color:#bbb"> </span>model<span style="color:#bbb">
</span><span style="color:#bbb"> </span><span style="color:#b06;font-weight:bold">template</span>:<span style="color:#bbb"> </span>build-model<span style="color:#bbb">
</span><span style="color:#bbb"> </span><span style="color:#b06;font-weight:bold">dependencies</span>:<span style="color:#bbb"> </span>[dataframe]<span style="color:#bbb">
</span><span style="color:#bbb"> </span>- <span style="color:#b06;font-weight:bold">name</span>:<span style="color:#bbb"> </span>infer<span style="color:#bbb">
</span><span style="color:#bbb"> </span><span style="color:#b06;font-weight:bold">template</span>:<span style="color:#bbb"> </span>infer<span style="color:#bbb">
</span><span style="color:#bbb"> </span><span style="color:#b06;font-weight:bold">dependencies</span>:<span style="color:#bbb"> </span>[model]<span style="color:#bbb">
</span><span style="color:#bbb"></span><span style="color:#888"># ...</span><span style="color:#bbb">
</span></code></pre></div><p>Notice how the <code>dependencies</code> field makes it easy to tell Argo what order to take when executing the tasks. The Argo steps can also define inputs and outputs—just like Luigi. For this simple example, I decided to omit them and enforce the convention for the steps to expect data artifacts in a certain location in the mounted volume. If you’re curious about other Argo features though, <a href="https://argoproj.github.io/docs/argo/examples/readme.html#parameters">here</a> is its documentation.</p>
<p>The entry point script for the task is pretty simple:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-bash" data-lang="bash"><span style="color:#c00;font-weight:bold">#!/bin/bash
</span><span style="color:#c00;font-weight:bold"></span>
<span style="color:#038">cd</span> /data
<span style="color:#080;font-weight:bold">if</span> [ -d ./blog ]
<span style="color:#080;font-weight:bold">then</span>
<span style="color:#038">cd</span> blog
git pull origin master
<span style="color:#080;font-weight:bold">else</span>
git clone https://github.com/EndPointCorp/end-point-blog.git blog
<span style="color:#080;font-weight:bold">fi</span>
</code></pre></div><h5 id="step-2-data-wrangling">Step 2: Data wrangling</h5>
<p>At this point, we’d have the source files for the blog articles in Markdown files. To be able to run them through any kind of machine learning modeling, we need to source it into the data frame. We’ll also need to clean the text a bit. Here is the reasoning behind the cleanup routine:</p>
<ul>
<li>I want the relations between the articles to omit the code snippets: <strong>not</strong> to group them by the used programming language or a library just by the keywords they contain</li>
<li>I also want the metadata about the tags and authors to be omitted too as I don’t want to see only e.g. my articles listed as similar to my other ones</li>
</ul>
<p>The full source for the <code>run.py</code> of the “preprocess” task can be viewed <a href="https://github.com/kamilc/endpoint-blog-nlp/blob/master/tasks/preprocess/run.py">here</a>.</p>
<p>Notice that unlike make or Luigi, the Argo workflows would run the same task fully again even with the step artifact already being created. I <strong>like</strong> this flexibility—it’s extremely easy after all to just skip the processing in Python or shell script if it already exists.</p>
<p>At the end of this step, the data frame is written as the <a href="https://parquet.apache.org">Apache Parquet</a> file.</p>
<h5 id="step-3-building-the-model">Step 3: Building the model</h5>
<p>The model from the paper mentioned earlier has already been implemented in a variety of other projects. There are implementations for each major deep learning framework on GitHub. There’s also a pretty good one included in <a href="https://radimrehurek.com/gensim/index.html">Gensim</a>. Its documentation can be found <a href="https://radimrehurek.com/gensim/models/doc2vec.html">here</a>.</p>
<p>The <a href="https://github.com/kamilc/endpoint-blog-nlp/blob/master/tasks/build_model/run.py">run.py</a> is pretty short and straight forward as well. This is one of the goals for the pipeline. In the end, it’s writing the trained model into the shared volume as well.</p>
<p>Notice that re-running the pipeline with the model already stored will not trigger the training again. This is what we want. Imagine a new article being pushed into the repository. It’s very unlikely that retraining with it would affect the model’s performance in any significant way. We’ll still need to predict the similar other documents for it. The model building step would short-circuit though with:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python"><span style="color:#080;font-weight:bold">if</span> __name__ == <span style="color:#d20;background-color:#fff0f0">'__main__'</span>:
<span style="color:#080;font-weight:bold">if</span> os.path.isfile(<span style="color:#d20;background-color:#fff0f0">'/data/articles.model'</span>):
<span style="color:#038">print</span>(<span style="color:#d20;background-color:#fff0f0">"Skipping as the model file already exists"</span>)
<span style="color:#080;font-weight:bold">else</span>:
build_model()
</code></pre></div><h5 id="step-4-predict-similar-articles">Step 4: Predict similar articles</h5>
<p>The listing of the <a href="https://github.com/kamilc/endpoint-blog-nlp/blob/master/tasks/infer/run.py">run.py</a> isn’t overly long:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python"><span style="color:#080;font-weight:bold">import</span> <span style="color:#b06;font-weight:bold">pandas</span> <span style="color:#080;font-weight:bold">as</span> <span style="color:#b06;font-weight:bold">pd</span>
<span style="color:#080;font-weight:bold">from</span> <span style="color:#b06;font-weight:bold">gensim.models.doc2vec</span> <span style="color:#080;font-weight:bold">import</span> Doc2Vec
<span style="color:#080;font-weight:bold">import</span> <span style="color:#b06;font-weight:bold">yaml</span>
<span style="color:#080;font-weight:bold">from</span> <span style="color:#b06;font-weight:bold">pathlib</span> <span style="color:#080;font-weight:bold">import</span> Path
<span style="color:#080;font-weight:bold">import</span> <span style="color:#b06;font-weight:bold">os</span>
<span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">write_similar_for</span>(path, model):
similar_paths = model.docvecs.most_similar(path)
yaml_path = (Path(<span style="color:#d20;background-color:#fff0f0">'/data/blog/'</span>) / path).parent / <span style="color:#d20;background-color:#fff0f0">'similar.yaml'</span>
<span style="color:#080;font-weight:bold">with</span> <span style="color:#038">open</span>(yaml_path, <span style="color:#d20;background-color:#fff0f0">"w"</span>) <span style="color:#080;font-weight:bold">as</span> file:
file.write(yaml.dump([p <span style="color:#080;font-weight:bold">for</span> p, _ <span style="color:#080">in</span> similar_paths]))
<span style="color:#038">print</span>(<span style="color:#d20;background-color:#fff0f0">f</span><span style="color:#d20;background-color:#fff0f0">"Wrote similar paths to </span><span style="color:#33b;background-color:#fff0f0">{</span>yaml_path<span style="color:#33b;background-color:#fff0f0">}</span><span style="color:#d20;background-color:#fff0f0">"</span>)
<span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">infer_similar</span>():
articles = pd.read_parquet(<span style="color:#d20;background-color:#fff0f0">'/data/articles.parquet'</span>)
model = Doc2Vec.load(<span style="color:#d20;background-color:#fff0f0">'/data/articles.model'</span>)
<span style="color:#080;font-weight:bold">for</span> tag <span style="color:#080">in</span> articles[<span style="color:#d20;background-color:#fff0f0">'file'</span>].tolist():
write_similar_for(tag, model)
<span style="color:#080;font-weight:bold">if</span> __name__ == <span style="color:#d20;background-color:#fff0f0">'__main__'</span>:
infer_similar()
</code></pre></div><p>The idea is to load up the saved Gensim model and the data frame with articles first. Then for each article use the model to get the 10 most similar other articles.</p>
<p>As the step’s output, the listing of similar articles is placed in the <code>similar.yml</code> file for each article’s subdirectory.</p>
<p>The blog’s Markdown → HTML compiler could then use this file and e.g. inject the “You might find those articles interesting too” section.</p>
<h4 id="results">Results</h4>
<p>The scratch notebook already includes the example results of running this doc2vec model. Examples:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python">model.docvecs.most_similar(<span style="color:#d20;background-color:#fff0f0">'2019/01/09/liquid-galaxy-at-instituto-moreira-salles.html.md'</span>)
</code></pre></div><p>Giving the output of:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python">[(<span style="color:#d20;background-color:#fff0f0">'2016/04/22/liquid-galaxy-for-real-estate.html.md'</span>, <span style="color:#00d;font-weight:bold">0.8872901201248169</span>),
(<span style="color:#d20;background-color:#fff0f0">'2017/07/03/liquid-galaxy-at-2017-boma.html.md'</span>, <span style="color:#00d;font-weight:bold">0.8766101598739624</span>),
(<span style="color:#d20;background-color:#fff0f0">'2017/01/25/smartracs-liquid-galaxy-at-national.html.md'</span>,
<span style="color:#00d;font-weight:bold">0.8722846508026123</span>),
(<span style="color:#d20;background-color:#fff0f0">'2016/01/04/liquid-galaxy-at-new-york-tech-meetup_4.html.md'</span>,
<span style="color:#00d;font-weight:bold">0.8693454265594482</span>),
(<span style="color:#d20;background-color:#fff0f0">'2017/06/16/successful-first-geoint-symposium-for.html.md'</span>,
<span style="color:#00d;font-weight:bold">0.8679709434509277</span>),
(<span style="color:#d20;background-color:#fff0f0">'2014/08/22/liquid-galaxy-for-daniel-island-school.html.md'</span>,
<span style="color:#00d;font-weight:bold">0.8659971356391907</span>),
(<span style="color:#d20;background-color:#fff0f0">'2016/07/21/liquid-galaxy-featured-on-reef-builders.html.md'</span>,
<span style="color:#00d;font-weight:bold">0.8644022941589355</span>),
(<span style="color:#d20;background-color:#fff0f0">'2017/11/17/president-of-the-un-general-assembly.html.md'</span>,
<span style="color:#00d;font-weight:bold">0.8620222806930542</span>),
(<span style="color:#d20;background-color:#fff0f0">'2016/04/27/we-are-bigger-than-vr-gear-liquid-galaxy.html.md'</span>,
<span style="color:#00d;font-weight:bold">0.8613147139549255</span>),
(<span style="color:#d20;background-color:#fff0f0">'2015/11/04/end-pointers-favorite-liquid-galaxy.html.md'</span>,
<span style="color:#00d;font-weight:bold">0.8601428270339966</span>)]
</code></pre></div><p>Or the following:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python">model.docvecs.most_similar(<span style="color:#d20;background-color:#fff0f0">'2019/01/08/speech-recognition-with-tensorflow.html.md'</span>)
</code></pre></div><p>Giving:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python">[(<span style="color:#d20;background-color:#fff0f0">'2019/05/01/facial-recognition-amazon-deeplens.html.md'</span>, <span style="color:#00d;font-weight:bold">0.8850516080856323</span>),
(<span style="color:#d20;background-color:#fff0f0">'2017/05/30/recognizing-handwritten-digits-quick.html.md'</span>,
<span style="color:#00d;font-weight:bold">0.8535605072975159</span>),
(<span style="color:#d20;background-color:#fff0f0">'2018/10/10/image-recognition-tools.html.md'</span>, <span style="color:#00d;font-weight:bold">0.8495659232139587</span>),
(<span style="color:#d20;background-color:#fff0f0">'2018/07/09/training-tesseract-models-from-scratch.html.md'</span>,
<span style="color:#00d;font-weight:bold">0.8377258777618408</span>),
(<span style="color:#d20;background-color:#fff0f0">'2015/12/18/ros-has-become-pivotal-piece-of.html.md'</span>, <span style="color:#00d;font-weight:bold">0.8344655632972717</span>),
(<span style="color:#d20;background-color:#fff0f0">'2013/03/07/streaming-live-with-red5-media.html.md'</span>, <span style="color:#00d;font-weight:bold">0.8181146383285522</span>),
(<span style="color:#d20;background-color:#fff0f0">'2012/04/27/streaming-live-with-red5-media-server.html.md'</span>,
<span style="color:#00d;font-weight:bold">0.8142604827880859</span>),
(<span style="color:#d20;background-color:#fff0f0">'2013/03/15/generating-pdf-documents-in-browser.html.md'</span>,
<span style="color:#00d;font-weight:bold">0.7829260230064392</span>),
(<span style="color:#d20;background-color:#fff0f0">'2016/05/12/sketchfab-on-liquid-galaxy.html.md'</span>, <span style="color:#00d;font-weight:bold">0.7779937386512756</span>),
(<span style="color:#d20;background-color:#fff0f0">'2018/08/29/self-driving-toy-car-using-the-a3c-algorithm.html.md'</span>,
<span style="color:#00d;font-weight:bold">0.7659779787063599</span>)]
</code></pre></div><p>Or</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python">model.docvecs.most_similar(<span style="color:#d20;background-color:#fff0f0">'2016/06/03/adding-bash-completion-to-python-script.html.md'</span>)
</code></pre></div><p>With:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python">[(<span style="color:#d20;background-color:#fff0f0">'2014/03/12/provisioning-development-environment.html.md'</span>,
<span style="color:#00d;font-weight:bold">0.8298013806343079</span>),
(<span style="color:#d20;background-color:#fff0f0">'2015/04/03/manage-python-script-options.html.md'</span>, <span style="color:#00d;font-weight:bold">0.7975824475288391</span>),
(<span style="color:#d20;background-color:#fff0f0">'2012/01/03/automating-removal-of-ssh-key-patterns.html.md'</span>,
<span style="color:#00d;font-weight:bold">0.7794561386108398</span>),
(<span style="color:#d20;background-color:#fff0f0">'2014/03/14/provisioning-development-environment_14.html.md'</span>,
<span style="color:#00d;font-weight:bold">0.7763932943344116</span>),
(<span style="color:#d20;background-color:#fff0f0">'2012/04/16/easy-creating-ramdisk-on-ubuntu.html.md'</span>, <span style="color:#00d;font-weight:bold">0.7579266428947449</span>),
(<span style="color:#d20;background-color:#fff0f0">'2016/03/03/loading-json-files-into-postgresql-95.html.md'</span>,
<span style="color:#00d;font-weight:bold">0.7410352230072021</span>),
(<span style="color:#d20;background-color:#fff0f0">'2015/02/06/vim-plugin-spotlight-ctrlp.html.md'</span>, <span style="color:#00d;font-weight:bold">0.7385793924331665</span>),
(<span style="color:#d20;background-color:#fff0f0">'2017/10/27/hot-deploy-java-classes-and-assets-in.html.md'</span>,
<span style="color:#00d;font-weight:bold">0.7358890771865845</span>),
(<span style="color:#d20;background-color:#fff0f0">'2012/03/21/puppet-custom-fact-ruby-plugin.html.md'</span>, <span style="color:#00d;font-weight:bold">0.718029260635376</span>),
(<span style="color:#d20;background-color:#fff0f0">'2012/01/14/using-disqus-and-rails.html.md'</span>, <span style="color:#00d;font-weight:bold">0.716759443283081</span>)]
</code></pre></div><p>To run the pipeline all you need is to:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-bash" data-lang="bash">$ make run
</code></pre></div><p>Or directly with:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-bash" data-lang="bash">$ argo submit pipeline.yml --watch
</code></pre></div><p>Argo gives a nice looking output of all the steps:</p>
<pre tabindex="0"><code>Name: endpoint-blog-pipeline-49ls5
Namespace: default
ServiceAccount: default
Status: Succeeded
Created: Wed Jun 26 13:27:51 +0200 (17 seconds ago)
Started: Wed Jun 26 13:27:51 +0200 (17 seconds ago)
Finished: Wed Jun 26 13:28:08 +0200 (now)
Duration: 17 seconds
STEP PODNAME DURATION MESSAGE
✔ endpoint-blog-pipeline-49ls5
├-✔ src endpoint-blog-pipeline-49ls5-3331170004 3s
├-✔ dataframe endpoint-blog-pipeline-49ls5-2286787535 3s
├-✔ model endpoint-blog-pipeline-49ls5-529475051 3s
└-✔ infer endpoint-blog-pipeline-49ls5-1778224726 6s
</code></pre><p>The resulting <code>similar.yml</code> files look as follows:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-bash" data-lang="bash">$ ls ~/data/endpoint-blog-src/blog/2013/03/15/
generating-pdf-documents-in-browser.html.md similar.yaml
$ cat ~/data/endpoint-blog-src/blog/2013/03/15/similar.yaml
- 2016/03/17/creating-video-player-with-time-markers.html.md
- 2014/07/17/creating-symbol-web-font.html.md
- 2018/10/10/image-recognition-tools.html.md
- 2015/08/04/how-to-big-beautiful-background-video.html.md
- 2014/11/06/simplifying-mobile-development-with.html.md
- 2016/03/23/learning-from-data-basics-naive-bayes.html.md
- 2019/01/08/speech-recognition-with-tensorflow.html.md
- 2013/11/19/asynchronous-page-switches-with-django.html.md
- 2016/03/11/strict-typing-fun-example-free-monads.html.md
- 2018/07/09/training-tesseract-models-from-scratch.html.md
</code></pre></div><p>Although it’s difficult to quantify, those sets of “similar” documents do seem to be linked in many ways to their “anchor” articles. You’re invited to read them and see for yourself!</p>
<h3 id="closing-words">Closing words</h3>
<p>The code presented here is hosted <a href="https://github.com/kamilc/endpoint-blog-nlp">on GitHub</a>. There’s lots of room for improvement of course. It shows a nice approach that could be used for both small model deployments (like the one above) but also very big ones too.</p>
<p>The Argo workflows could be used in tandem with Kubernetes deployments. You could e.g. run a distributed <a href="https://www.tensorflow.org">TensorFlow</a> model training and then deploy it on Kubernetes via <a href="https://www.tensorflow.org/tfx/guide/serving">TensorFlow Serving</a>. If you’re more into <a href="https://pytorch.org">PyTorch</a>, then distributing the training would be possible via <a href="https://eng.uber.com/horovod/">Horovod</a>. Have data scientists that use R? Deploy <a href="https://www.rstudio.com">RStudio Server</a> instead of the JupyterLab with <a href="https://hub.docker.com/r/rocker/rstudio">the image from DockerHub</a> and run some or all tasks with the <a href="https://hub.docker.com/r/rocker/r-ver">simpler one</a> with R-base only.</p>
<p>If you have any questions or projects you’d like us to help you with, reach out right away through our <a href="/contact/">contact form</a>!</p>
Facial Recognition Using Amazon DeepLens: Counting Liquid Galaxy Interactionshttps://www.endpointdev.com/blog/2019/05/facial-recognition-amazon-deeplens/2019-05-01T00:00:00+00:00Ben Ironside Goldstein
<p>I have been exploring the possible uses of a machine-learning-enabled camera for the Liquid Galaxy. The Amazon Web Services (AWS) <a href="https://aws.amazon.com/deeplens/">DeepLens</a> is a camera that can receive and transmit data over wifi, and that has computing hardware built in. Since its hardware enables it to use machine learning models, it can perform computer vision tasks in the field.</p>
<h3 id="the-amazon-deeplens-camera">The Amazon DeepLens camera</h3>
<p><img style="float: left; width: 400px; padding-right: 2em;" src="/blog/2019/05/facial-recognition-amazon-deeplens/deeplens-front-angle.jpg" alt="DeepLens" /></p>
<p>This camera is the first of its kind—likely the first of many, given the ongoing rapid adoption of Internet of Things (IoT) devices and computer vision. It came to End Point’s attention as hardware that could potentially interface with and extend End Point’s immersive visualization platform, the <a href="https://www.visionport.com/">Liquid Galaxy</a>. We’ve thought of several ways computer vision could potentially work to enhance the platform, for example:</p>
<ol>
<li>Monitoring users’ reactions</li>
<li>Counting unique visitors to the LG</li>
<li>Counting the number of people using an LG at a given time</li>
</ol>
<p>The first idea would depend on parsing facial expressions. Perhaps a certain moment in a user experience causes people to look confused, or particularly delighted—valuable insights. The second idea would generate data that could help us assess the platform’s impact, using a metric crucial to any potential clients whose goals involve engaging audiences. The third idea would create a simpler metric: the average number of people engaging with the system over a period of time. Nevertheless, this idea has a key advantage over the second: it doesn’t require distinguishing between people, which makes it a much more tractable project. This post focuses on the third idea.</p>
<p>To set up the camera, the user has to plug it into a power outlet and connect it to wifi. The camera will still work even with a slow network connection, though when the connection is slower the delay between the camera seeing something and reporting it is longer. However, this delay was hardly noticable on my home network which has slow-to-moderate speeds of about 17 Mbps down and 33 Mbps up.</p>
<h3 id="computer-vision-and-the-amazon-deeplens">Computer Vision and the Amazon DeepLens</h3>
<p>A <a href="https://en.wikipedia.org/wiki/Deep_learning">deep learning model</a> is a neural network with multiple layers of processing units. It is called “deep” because it has multiple layers. The inputs and outputs of each processing unit are numbers. These units are roughly analogous to neurons: they receive input from units in the previous layer, and output it to units in the next layer after transforming it based on a function. These “activation functions” can change in a variety of ways. The last layer’s outputs translate into the results. These models work because these functions get tuned based on how well the model works. For example, to make a model that labels each human face in a picture and draws a box around it, we would start with a corpus of pictures with boxes drawn around faces, as well as the versions of the pictures without the boxes drawn. We would test the model on the non-labeled images by checking—for each picture—whether the output generated by the model is correct. If not, the computer chooses different unit functions, tries again, and compares the results. Repeating this process thousands of times yields models which work remarkably well for a wide range of tasks, including computer vision.</p>
<p>In deep learning for computer vision, training on large sets of labeled images enables models to generalize about visual characteristics. The training process takes a lot of computing resources, but once models are trained, they can produce results quickly and with relative ease. This is why the DeepLens is able to perform computer vision with its limited computing resources.</p>
<p>Since the DeepLens is an Amazon product, it comes as no surprise that the user interface and backend for DeepLens consist of AWS services. One of the most important is <a href="https://aws.amazon.com/sagemaker/">SageMaker</a>, which is used to train, manage, optimize, and deploy machine learning models such as neural networks. It includes hosted Jupyter notebooks (<a href="https://jupyter.org/">Jupyter</a> is a development environment for data science), as well as the computing resources required for model training and storage. With SageMaker, users can train computer vision models for deployment to DeepLens, or import and adjust pretrained models from various sources.</p>
<p>Remote management of the DeepLens depends on <a href="https://aws.amazon.com/lambda/">AWS Lambda</a>, a “serverless” cloud service that provides an environment to run backend code and integrate with other cloud services. It runs the show, allowing users to manage everything from the camera’s behavior to what happens to gathered data. Another service, <a href="https://aws.amazon.com/greengrass/">AWS Greengrass</a>, connects the instructions from AWS Lambda to the DeepLens, managing tasks like authentication, updates, and reactions to local events.</p>
<p>Amazon’s IoT service saves information about each DeepLens, and allows users to manage their devices, for example by choosing which model is active on the device, or viewing a live stream from the camera. It also keeps track of what’s going on with the hardware, even when it’s off. When a model is running on the DeepLens, you can view a live stream of its inferences about what it’s seeing (the labeled images). Amazon has released various pretrained models designed to work on the DeepLens. Using a model for detecting faces, we can get a live stream that looks like this:</p>
<p><img src="/blog/2019/05/facial-recognition-amazon-deeplens/one-face-recognition.jpg" alt="one-face-recognition">
<br>Me looking at the DeepLens in my kitchen</p>
<p><img src="/blog/2019/05/facial-recognition-amazon-deeplens/multi-face-recognition.jpg" alt="multi-face-recognition">
<br>Facial recognition inferences on multiple people. (Witness my smile of satisfaction at finally finding enthusiastic subjects of facial recognition.)</p>
<p>Each face that the camera detects gets a box around it, along with the model’s level of certainty that it is a face. The above pictures were the results of an attempt to simulate the conditions where this could be used.</p>
<h3 id="the-model">The Model</h3>
<p>The model I used was trained on data from <a href="http://www.image-net.org/">ImageNet</a>, a public database with hundreds or thousands of images associated with nouns. (For example they have 1537 <a href="http://www.image-net.org/synset?wnid=n03376595">pictures of folding chairs</a>.) ImageNet is <a href="https://arxiv.org/search/?query=imagenet&searchtype=all&source=header">commonly</a> used to train and test computer vision models.</p>
<p>However, the training for this model didn’t stop there: Amazon used transfer learning from another large image dataset, <a href="http://cocodataset.org/#home">MS-COCO</a>, to fine-tune the model for face detection. Transfer learning works essentially by retraining the last layer of an already-trained model. In this way it harnesses the “insights” of the existing model (e.g. about shapes, colors, and positions) by repurposing this information to make predictions about something else. In this case, whether something is a face.</p>
<p>Since this model was pretrained and optimized by Amazon for the DeepLens, it provides a low effort route to implementing a computer vision model on the DeepLens. I didn’t have to do any of the processing on my own hardware. The DeepLens hardware took care of all the predictions, though the biggest resource savings were from not having to train the model myself (which can take days, or longer).</p>
<p>When the facial recognition model is deployed and the DeepLens is on, an AWS Lambda function written in Python repeatedly prompts the camera to get frames from the camera:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python">frame = awscam.getLastFrame()
</code></pre></div><p>…to resize the frames before inference (the model accepts frames of particular size):</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python">frame_resize = cv2.resize(frame, (input_height, input_width))
</code></pre></div><p>…to pass the frames to the model:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python">parsed_inference_results = model.parseResult(model_type, model.doInference(frame_resize))
</code></pre></div><p>…and to use the results to draw boxes around the faces:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python">cv2.rectangle(frame, (xmin, ymin), (xmax, ymax), (<span style="color:#00d;font-weight:bold">255</span>, <span style="color:#00d;font-weight:bold">165</span>, <span style="color:#00d;font-weight:bold">20</span>), <span style="color:#00d;font-weight:bold">10</span>)
</code></pre></div><p>As you can see from how often “cv2” appears in the code above, this implementation relies heavily on code from <a href="https://opencv.org">OpenCV</a>, an open source computer vision framework. Finally, the results are sent to the cloud:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python">client.publish(topic=iot_topic, payload=json.dumps(cloud_output))
</code></pre></div><p>In the last code snippet above, iot_topic refers to an Amazon “MQTT topic” (Message Queuing Telemetry Transport), for IoT devices. <a href="https://en.wikipedia.org/wiki/MQTT">MQTT</a> is the standard connectivity framework for DeepLens and many other IoT devices. One of its advantages for this context is that it can handle situations with intermittent connectivity, by smoothly queueing messages for when the network connection is stable. The essence of MQTT is to enable publishing and subscribing to different topics. The system of topics enables results from a DeepLens to trigger other processes. For example, the DeepLens could publish a message when it sees a face, and this could prompt another cloud service to do something else, such as save what time and how long the face appeared.</p>
<p>I wanted to test how data from this model would compare to a human’s perception. The first step was to understand what data the camera offers. It produces data about each frame analyzed: a timestamp (in 13-digit <a href="https://en.wikipedia.org/wiki/Unix_time">Unix time</a>), and the predicted probability that something it identifies is a face. To gather this data, I used the AWS IoT service to manually subscribe to a secure MQTT topic where the DeepLens published its predictions. Each frame processed produces data like this:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-json" data-lang="json">{
<span style="color:#b06;font-weight:bold">"format"</span>: <span style="color:#d20;background-color:#fff0f0">"json"</span>,
<span style="color:#b06;font-weight:bold">"payload"</span>: {
<span style="color:#b06;font-weight:bold">"face"</span>: <span style="color:#00d;font-weight:bold">0.5654296875</span>
},
<span style="color:#b06;font-weight:bold">"qos"</span>: <span style="color:#00d;font-weight:bold">0</span>,
<span style="color:#b06;font-weight:bold">"timestamp"</span>: <span style="color:#00d;font-weight:bold">1554853281975</span>,
<span style="color:#b06;font-weight:bold">"topic"</span>: <span style="color:#d20;background-color:#fff0f0">"$aws/things/deeplens_bnU5sr2sSD2ecW5YkfJZtw/infer"</span>
}
</code></pre></div><p>The data generated by a single frame (with one face) when processed by the DeepLens.</p>
<p>For my purposes, I was only interested in the timestamps and payloads (which contain the number of faces identified, and their probabilities). I decided to test the facial recognition model under several different conditions:</p>
<ol>
<li>No faces present</li>
<li>One face present</li>
<li>Multiple faces present</li>
</ol>
<p>For condition 1 I just aimed it at an empty room for 20 minutes, and for condition 2 I sat in front of the camera for 20 minutes. For condition 3, I aimed the camera at a public space for 20 minutes, and while it was running I kept an ongoing count of the number of people looking in the general direction of the camera (I put the camera in front of a wall with a TV on it so people would be more likely to look towards it). Then I averaged my count over the duration of the sample, which resulted in an average engagement number of 2.5 people, meaning that on average, 2.5 people were looking at the camera. In an attempt to minimize bias, I made my human-eye assessment before looking at any of the data.</p>
<p>I’ll spoil one aspect of the results right away: there were no false positives under any condition. Even the lower probability guesses corresponded to actual faces, though this result might not hold true in a room with lots of face-like art, that’s not too common of a scenario. This simplified things, since it meant there was no need to set a lower bound on the probabilities which we should count—any face detected by the camera is a face. This also highlights one of my remaining questions about the model: is there useful information to be gained from the probabilities?</p>
<p>Another important note: I noticed early in the experiment that it almost never detects a face farther than 15 feet away. For the use case of a Liquid Galaxy, the 15-foot range is too short to capture all types of engagement (some people look at it across the room), but from my experience with the system I think that users within this range could be accurately described as focused users—something worth measuring, but certainly not everything worth measuring. After noticing this, I retested condition 2 with my face about 5 feet from the DeepLens, after initially trying it from across a room.</p>
<h3 id="how-did-the-deeplens-counts-compare-to-my-counts">How did the DeepLens counts compare to my counts?</h3>
<p><img src="/blog/2019/05/facial-recognition-amazon-deeplens/results.png" alt="results"></p>
<p>The model matched my performance in conditions 1 and 2, which makes a strong statement about its reliability in relatively static and close-up conditions such as looking at an empty room, or looking at someone stare at their laptop across a small table. In contrast, it did not count as many faces as I did in condition 3—so I’m happy to report I can still outperform A.I. on something.</p>
<p>Anyway, this suggests that the model is somewhat conservative, at least compared to my count (likely partly due to my eyes having a range larger than 15 feet). Therefore, when considering usage statistics gathered by a similar method, it might make most sense to think of the results as a lower bound, e.g. “the average number of people focused on the system was more than 2.1”.</p>
<p>It would be useful to experiment with the multiple faces condition again, to see how robust these findings are. It would also be helpful to keep track of factors like how much people move, the lighting, and the orientation of the camera, to see if they might impact the results. It would also be useful to automate the data collection and analysis.</p>
<p>This investigation has showed me that the DeepLens has a lot of potential as a tool for measuring engagement. Perhaps a future post will examine how it can be used to count users.</p>
<hr>
<p>Thanks for reading! You are welcome to learn more about <a href="https://www.visionport.com/">End Point Liquid Galaxy</a> and <a href="https://aws.amazon.com/deeplens/">AWS DeepLens</a>.</p>
The flow of hierarchical data extractionhttps://www.endpointdev.com/blog/2019/03/the-flow-of-hierarchical-data-extraction/2019-03-13T00:00:00+00:00Árpád Lajos
<p><img src="/blog/2019/03/the-flow-of-hierarchical-data-extraction/art-ball-ball-shaped-235615-smaller.jpg" alt="forest view through glass ball on wood stump"></p>
<h3 id="1-problem-statement">1. Problem statement</h3>
<p>There are many cases when people intend to collect data, for various purposes. One may want to compare prices or find out how musical fashion changes over time. There are a zillion potential uses of collected data.</p>
<p>The old-fashioned way to do this task is to hire a few dozen of people and explain them where should they go on the web, what should they collect, how they should write a report and how they should send it.</p>
<p>It is more effective to teach them this at the same time than to teach them separately, but even then, there will be misunderstandings, mistakes with high cost, not to mention the limit a human has when processing data in terms of the amount to process. As a result, the industry strives to make sure this is as automatic as possible.</p>
<p>This is why people write software to cope with this issue. The terms data-extractor, data-miner, data-crawler, data-spider mean software which extracts data from a source and stores it at the target. If data is mined from the web, then the more-specific web-extractor, web-miner, web-crawler, web-spider terms can be used.</p>
<p><img src="/blog/2019/03/the-flow-of-hierarchical-data-extraction/SemanticDataExtractorFigure1.png" alt="" title="Extraction, to put it simply"></p>
<p>In this article I will use the term “data-miner”.</p>
<p>This article deals with the extraction of hierarchical data in semantic manner and the way we can parse the data we obtained this way.</p>
<h4 id="11-hierarchical-structure">1.1. Hierarchical structure</h4>
<p>A hierarchical structure involves a hierarchy, that is, we have a graph involved with nodes and vertices, but without a cycle. Specialists call this structure a <strong>forest</strong>. A forest consists of <strong>trees</strong>; in our case we have a forest of rooted trees. A rooted tree has a root node and every other node is its descendant (child of child of child …), or, if we put it inversely, the root is the ancestor of all other nodes in a rooted tree.</p>
<p>If we add a node to a forest and we make sure that all the trees’ roots in the forest are children of the new node, the new root, then we transformed our hierarchical structure, our forest into a tree.</p>
<p>Common hierarchical structures a data-miner might have to work with are:</p>
<ul>
<li>HTML</li>
<li>XML</li>
<li>Trees represented in JSON</li>
<li>File structure</li>
</ul>
<p>Other, non-hierarchical structures could be texts, pictures, audio files, video, etc. In this article we focus on hierarchical structures.</p>
<h4 id="12-purpose-of-the-data-miner">1.2. Purpose of the data-miner</h4>
<p>Of course the purpose of the data-miner is to mine data; however, more specifically, we can speak about general-purpose data-miners, thematic data-miners and narrow-spaced data-miners.</p>
<p>General-purpose data-miners are the data-miners which extract and prepare data for search engines. The technique one can imagine is to find the text in the source (for example a web-page) and map it to keywords, so that when people are searching for the keywords, the source is shown as the result. Of course, it is highly probable that these data-miners are much more smart and complex than described here, but that is out of scope from our perspective.</p>
<p>If we speak about general-purpose data-mining, then even if it is to mine from a hierarchical data-source, it is very difficult to define all the semantics involved by humans, so, if one wants to create a general-purposed data-miner, machine-learning will play a large part to define all the concepts. However, a general-purposed data-miner is a generalized version of thematic data-miner. If we add up a lot of thematic data-miners, we get a general-purpose data-miner and inversely, if we divide the different areas a general-purposed data-miner deals with, we get thematic data-miners. This is true if we look at what they accomplish, but the implementation will differ a lot when the developers implement a general-purposed data-miner from the case when developers implement a thematic data-miner. As a result, when we discuss the thematic data-miners, we can keep in mind that the general-purposed data-miners are a broader version of the same thing, at least if we look at what they achieve.</p>
<p>Thematic data-miners are dealing with a given cluster of concepts, that is, concepts which are logically related to each-other. For example, if we are to extract real-estate details, then we have concepts of “type”, “bedroom number”, “picture” and so on. All these concepts are interrelated; the cluster is defined by the real-estate entity we want to extract. If we speak of the “type”, we really mean “the type of the real estate” entity.</p>
<p>Narrow-spaced data-miners are data-miners which extract data from very specific data-sources, like a single website. These data-miners are always particular cases of thematic data-miners, however, in many cases narrow-spaced data-miners have a lot of hard-coded logic, so, when the space of a narrow-spaced thematic data-miner is broadened, there is usually a lot of code refactoring involved.</p>
<p>This article focuses on thematic data-miners, which could have thousands of different sources.</p>
<h4 id="13-the-human-element">1.3. The human element</h4>
<p>We need to pay attention to legality and to the ethical aspect. If data mining from the source is illegal, then we must avoid doing it. If it is against the will of those who own it, then it is unethical and we should avoid it. Sources should be at least neutral to our extraction, but, preferably happy about us extracting their data.</p>
<p>Why would the owner of a data source be against extracting their data? There can be many possible causes. For example, the owner might want to attract many human visitors to a website, who would visit to see their data and would be unhappy if another website were created and people would visit the new website instead of theirs.</p>
<p>But why would the owner of a data source be happy about extracting their data? It depends on the purpose of their data. If the purpose is to spread the information as far as possible, then data extractors are considered to be “helping hands”.</p>
<p>For example, if an estate agent of a small village wants to target global audience, he/she might want to create a website and hope that people would see his/her data, but without advertisement and a lot of care to raise the popularity of the site, including design, SEO measures and so on, people searching for real estate might never see his/her site. A large website which is searchable by region and shows the results of extracted data could boost the business of the estate agent, especially if the data of the estate agent is publicized without him/her having to pay a penny. People will search for real estate in the region where the estate agent works and will find the big site showing the mined data, emphasizing the contact of the estate agent.</p>
<p>So, to cope with all possible needs of data-source owners, the planner of a data-miner project might want to support listing results with or without details page. Imagine the case when one wants visitors to his/her website and would not like another site to be visited instead. In this case, one can tell him/her that, if he/she allows the extraction of data from his/her site, then they will appear in the list of results, but will not have a details page, instead, when the user clicks on such a result, the website of the owner of the data-source would be opened in a new tab. In this case we are giving them an attractive offer, so that our site will essentially help getting visitors to his/her site. And of course, for those who just want to share the information with as many people as possible, one could show a details page for his/her data. This is a simplified strategy, but illustrates the idea of how people can be convinced to allow us to extract their data.</p>
<h4 id="14-the-actual-problem-statement">1.4. The actual problem statement</h4>
<p>In this article we intend to have solid ideas of how hierarchical data can be extracted by a thematic data-extractor from data-sources where the owner is content with their data being extracted.</p>
<h3 id="2-the-extraction">2. The extraction</h3>
<p>Now we have hierarchies to work with, possibly many. Nodes can have the parent-child relation, but they can have the ancestor-descendant relation as well. A is the ancestor and D is its descendant, if A is the parent of D, or we have a sequence of nodes S, where Si is the parent of Si+i for each neighboring pair of the sequence and A is the parent of S1, Sn is the parent of D. Consider this HTML code chunk:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-html" data-lang="html"><<span style="color:#b06;font-weight:bold">div</span> <span style="color:#369">class</span>=<span style="color:#d20;background-color:#fff0f0">"dimensions"</span>>
<<span style="color:#b06;font-weight:bold">div</span> <span style="color:#369">class</span>=<span style="color:#d20;background-color:#fff0f0">"large-width"</span>>
<<span style="color:#b06;font-weight:bold">div</span> <span style="color:#369">class</span>=<span style="color:#d20;background-color:#fff0f0">"area"</span>>Area<<span style="color:#b06;font-weight:bold">span</span> <span style="color:#369">class</span>=<span style="color:#d20;background-color:#fff0f0">"dimension-data"</span>>500</<span style="color:#b06;font-weight:bold">span</span>> <<span style="color:#b06;font-weight:bold">span</span> <span style="color:#369">class</span>=<span style="color:#d20;background-color:#fff0f0">"unit"</span>>sqm</<span style="color:#b06;font-weight:bold">span</span>></<span style="color:#b06;font-weight:bold">div</span>>
<<span style="color:#b06;font-weight:bold">div</span> <span style="color:#369">class</span>=<span style="color:#d20;background-color:#fff0f0">"height"</span>>Height<<span style="color:#b06;font-weight:bold">span</span> <span style="color:#369">class</span>=<span style="color:#d20;background-color:#fff0f0">"dimension-data"</span>>60</<span style="color:#b06;font-weight:bold">span</span>> <<span style="color:#b06;font-weight:bold">span</span> <span style="color:#369">class</span>=<span style="color:#d20;background-color:#fff0f0">"unit"</span>>m</<span style="color:#b06;font-weight:bold">span</span>></<span style="color:#b06;font-weight:bold">div</span>>
</<span style="color:#b06;font-weight:bold">div</span>>
</<span style="color:#b06;font-weight:bold">div</span>>
</code></pre></div><p>We see that the div having the class of dimensions is the parent of the div with the large-width class and the ancestor of the spans having the area and height classes, respectively. In the case of hierarchical data, a descendant is inside the ancestor; knowing this gives us a lot of context in many cases.</p>
<h4 id="21-semantic-concepts">2.1. Semantic concepts</h4>
<p>In hierarchical structures we have some nodes, a (usually small) subset of which contains interesting data for us. For instance, in the example shown at point 2 we have a div which exists to style the shown data, which has some useful information in its descendants, which have the classes of area and height having the class of dimension-data. However, the div having the class of large-width by itself does not have any useful information outside those descendant nodes, and frankly, it is just a styling node, which makes it irrelevant from our point of view. This means that the large-width div should not exist for us conceptually, we just need to know that we have a concept of dimensions, inside which we should be able to find the area and height. In terms of JavaScript selectors we know that we can find dimensions inside the document and area and height inside it, like this:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-js" data-lang="js"><span style="color:#080;font-weight:bold">let</span> dimensions = [];
<span style="color:#080;font-weight:bold">let</span> dimensionContainers = <span style="color:#038">document</span>.querySelectorAll(<span style="color:#d20;background-color:#fff0f0">".dimensions"</span>);
<span style="color:#080;font-weight:bold">for</span> (<span style="color:#080;font-weight:bold">let</span> dimensionContainer <span style="color:#080;font-weight:bold">of</span> dimensionContainers) {
<span style="color:#080;font-weight:bold">const</span> dimension = {};
<span style="color:#080;font-weight:bold">let</span> areaContainer = dimensionContainer.querySelector(<span style="color:#d20;background-color:#fff0f0">".area"</span>);
<span style="color:#080;font-weight:bold">if</span> (areaContainer) {
<span style="color:#080;font-weight:bold">let</span> value = areaContainer.querySelector(<span style="color:#d20;background-color:#fff0f0">".dimension-data"</span>);
<span style="color:#080;font-weight:bold">let</span> unit = areaContainer.querySelector(<span style="color:#d20;background-color:#fff0f0">".unit"</span>);
<span style="color:#080;font-weight:bold">if</span> (value && unit) {
dimension.area = {value: value.innerText, unit: unit.innerText};
}
}
<span style="color:#080;font-weight:bold">let</span> heightContainer = dimensionContainer.querySelector(<span style="color:#d20;background-color:#fff0f0">".height"</span>);
<span style="color:#080;font-weight:bold">if</span> (heightContainer) {
<span style="color:#080;font-weight:bold">let</span> value = heightContainer.querySelector(<span style="color:#d20;background-color:#fff0f0">".dimension-data"</span>);
<span style="color:#080;font-weight:bold">let</span> unit = heightContainer.querySelector(<span style="color:#d20;background-color:#fff0f0">".unit"</span>);
<span style="color:#080;font-weight:bold">if</span> (value && unit) {
dimension.unit = {value: value.innerText, unit: unit.innerText};
}
}
dimensions.push(dimension);
}
</code></pre></div><p>We immediately notice some details in the code:</p>
<ul>
<li>It does not deal with large-width at all, because it’s irrelevant, instead, it just focuses on nodes, which are semantically relevant.</li>
<li>When a concept is found, its direct sub-concept is searched inside its structural context instead of the whole context, so we are particularizing the search.</li>
<li>Our code smartly searches for plural results in the case of dimensions and is aware that area and height is singular in its parent concept, the dimension.</li>
<li>The code does not assume the existence of any data.</li>
</ul>
<p>There are also problems to cope with:</p>
<ul>
<li>What happens if the structure changes over time? How can we maintain the code we have above?</li>
<li>The code above is based on empiric evidence, on the findings of a developer, but this is only proving that the hierarchy the code above is dealing with exists, but it certainly does not prove this is the only structure to work with.</li>
<li>How can we cope with composite data, like extracting something like “height: 60m”.</li>
<li>How can we cope with the variance of data, such as synonyms?</li>
<li>How can we cope with paging where applicable?</li>
</ul>
<p>All these are serious questions, which show that such a concrete code will not solve the problems we face. We will need abstraction to progress further than the limits of particularity and achieve the level of a thematic data-miner. We have seen that in our examples we had a hierarchy of semantic concepts and also, we can observe the rule that an ancestor concept is an ancestor structurally as well. Also, the algorithm we had above has a pattern of searching for the child concept in the context of the parent concept.</p>
<p>In reality we also have the problem of conceptual inconsistency, that is, the structure of the same data-source can be varied and in some cases they are at a totally different location in the structure. To give you a practical example, let’s consider the example we had about real-estate properties and their dimensions particularly. It is possible that the height of the real estate makes it very attractive for buyers, so the developers of the data-source decided to show the height separately, in an emphasized manner, at the top of the page and not to show it at its usual place. In this case, if we would not cope with this important detail, we would miss the height for real-estate properties where the height is the most important detail; we would miss the stars of the show.</p>
<h4 id="22-semantic-rules">2.2. Semantic rules</h4>
<p>In 2.1 we have seen how we can have an understanding about the conceptual essence. Naturally, the concepts are useful by themselves, but we need to build up a powerful structure, a signature of the concepts, which, if shown on a diagram would describe the essence of the structure, the conceptual essence in such a powerful way that one could understand it at a glance (unless there are too many nodes for a human, of course).</p>
<p>In order to gain the ability to build up such a powerful structure we need to find patterns, but in a more systematic way than the one we have seen in the naive code used as an example. Concepts can be described by rules. I call these rules semantic rules. What attributes describe a concept:</p>
<ul>
<li><strong>parent concept(s):</strong> Which concepts can be the parents of the current concepts? A special case is a root parent. Also, the possibility to define multiple parents is to avoid duplication when the very same concept in the very same substructure can be present at various places.</li>
<li><strong>descendant(s):</strong> Useful to make the rule more specific and possibly filter out unneeded cases and thus increase performance.</li>
<li><strong>relative path:</strong> How can the given concept be found starting from the parent concept as root?</li>
<li><strong>plurality:</strong> Should we stop at the first match, or continue searching?</li>
<li><strong>excludes:</strong> Which concepts are excluded if this concept is matched? (this is a symmetrical relation)</li>
<li><strong>implies:</strong> What concepts imply this concept? (consequently, if this concept is not matched, the concepts which imply it can be excluded as well)</li>
<li><strong>value:</strong> Where the actual value of the concept can be found.</li>
</ul>
<p>If the structure of the data source changes over time, then occasionally the semantic rules will have to change as well. However, arguably in the majority of cases the relative path of the concepts will be the only thing to be changed in this case, which is much easier to maintain than to refactor code. If the relative path descriptor does not change and the virtual structure of our semantic rules remains similar, then our data-miner might just work well even if the data-source is changed.</p>
<p>For example, in the case of web-crawlers, writing the initial code-base is just a fraction of the long-term costs. The real financial burden is maintaining the crawlability of many thousands of different sites, all changing from time to time. With semantic rules, even if they are defined manually the maintainability is largely simplified. Yet, if we have a module which proposes the new semantic rules, it would not hurt if there are many data-sources.</p>
<p>Imagine the case when we already have a well-defined set of semantic rules and a cron job detects that some data from a data source was not found in the last few hours, so it would search for the missing data in the data-source and, once found, it would generate the semantic rules for it and propose a new semantic tree of rules, which would run in parallel with the main semantic tree and store data in some experimental place. When the human developers would arrive back and check the reports, they would see that the data-miner thinks that the semantic tree needs to be changed and even has a proposal, also, for the case when the data-miner is right with its proposal, nothing was missed, the extracted data is among the experimental data, but once the new semantic tree is accepted, the experimental data would override the actual data.</p>
<p>Of course, the engine will not be 100% accurate in such a case, its suggestion might be mistaken, or, even if accurate, slight adjustments might be helpful. A relative path might be less accurate than needed, or the concept in the changed structure might have a different plurality, but even if there is room for improvement, such an automatic feature, generating a new semantic tree would be extremely helpful and would make sure that data is accurately extracted even in the case of large changes.</p>
<p>If the owner of the data-source is cooperating, then he/she could provide the changed semantic tree.</p>
<p>The parser is the part of the project which should do the job of handling composite data, but at the level of semantic rules the parser could be helped with a strategy of defining rules. One could define a syntax for the parser, so it would know which concepts are not atomic and maybe even some clues about how should they be parsed.</p>
<p>Synonyms could be dealt with keyword clusters.</p>
<p>Navigation can be handled by a module created for this purpose, we can call it the navigation module. The algorithm below describes how data-mining should occur:</p>
<pre tabindex="0"><code>initialNavigation()
while (page ← page.next())
for each element in elements
for each node in semanticTree // breadth-first or depth-first traversing
if (applicable(node)) then extract(node)
end for
end for
end while
finalNavigation()
</code></pre><p>Before we move on to deal with semantic trees we need to solve the problem of conceptual inconsistencies. We have seen that concepts can have multiple conceptual parents, but this is not inconsistent, nor violating the criteria of a tree. In reality for each element a concept will have a single parent, but it is perfectly possible that a given concept will have a parent for an element and a different parent for another element. This is what we described with the possibility of a concept having one of many possible parents/element. However, even though we have a very good explanation for a concept having more parents, we still have to deal with the possibility of a concept being present twice for the same element. How is this possible?</p>
<p>Let’s consider the example of books displayed on a website. A book may have main author(s) and secondary author(s). It would not be surprising at all if the main authors are displayed differently than secondary authors. Also, a main picture might be displayed of the book’s cover and maybe some smaller images elsewhere, shown in a gallery. This would be a nice feature on a site which shows children’s books. However, from our perspective this means that the same concept is placed at several places at the same time. One may think of quantum mechanics, where time-place discrepancies are also possible.</p>
<p>How can we solve this problem for ourselves? We need to have an understanding, otherwise the whole thought process will go astray. My solution is to differentiate the concepts in the different stages of extraction and parsing. This would mean that the concept of “author” or “picture” is conceptually unique when we store it, even though these might be plural. More elements do not necessarily mean more concepts. On the other hand, at the time of extraction, the main picture is a different concept than the other pictures, also, the main authors are a different concept from the secondary authors. The moment of merging happens at the point when we store these and therefore have to convert the extraction concepts into the concepts we store.</p>
<h4 id="23-semantic-tree">2.3. Semantic tree</h4>
<p>The semantic rules we define have a parent-child and ancestor-descendant relation. The semantic tree is the blueprint of the conceptual essence, a plan to extract data and also, its attributes instruct the extractor about what concept should be found where, how should the extractor operate to have good performance and so on.</p>
<p>However, the entities to extract can vary greatly and there might be cases when seemingly the same concept is distributed among various places. The word “seemingly” here means that even though at the end these concepts will be merged, at the phase of the actual data-mining we view these to be similar, but different concepts. The fact that a conceptual node might have several parents in the semantic rules only means that one of the parents of the list is to be expected, which means that all of the listed parents are possible. Consider a semantic rule which says that the parent of currency can be the description or the price:</p>
<pre tabindex="0"><code>parent: description,price
</code></pre><p>This means that the parent can be any of the elements of the list, that is, in the case of some elements the parent will be description, but in the case of some other elements, the parent will be price, so we do not violate by design our aim to have a semantic tree.</p>
<p>We have to watch out for cycles though. Without additional tools there is no protection against cycles, that we might accidentally add while we define the semantic rules, or our semantic rule generator does its magic. However, it makes sense to check whether there is a cycle in the semantic tree. If this is done automatically, then all the better.</p>
<p>However, since the description of semantic trees can define multiple disjunct parents to cope with the possibility to cope with the actual tree of concepts for all elements, at least those whose pattern is known, the semantic tree in reality is a semantic tree pattern and when we use it or we search for the cycles, we will need to traverse the possible parents where more of them are listed.</p>
<p><img src="/blog/2019/03/the-flow-of-hierarchical-data-extraction/SemanticDataExtractorFigure2.png" alt="" title="Semantic tree example"></p>
<p>Consider the example we have brought for a semantic tree.</p>
<p>We can see that we have a single main concept, “REAL-ESTATE”. This is not a general pattern, there might be several main concepts to extract on the same data-source, but for the sake of simplicity we have a single one. Implicitly, the technical implementation involves a single, abstract root, which is the parent of all the main concepts. We see that “TYPE” is plural for “REAL-ESTATE”, but “BEDROOM #” is singular. The conceptual reason is that we technically defined type in such a way, that not all pairs of types are mutually exclusive so if a type is found, for example “Bungalow”, the engine should not stop searching for other types, but if the number of bedrooms is found, then there is no point to further search for numbers of bedroom, because it is safe to assume that even if there are several different occurrences of this concept, they will be equivalent.</p>
<p>Why is “CURRENCY” special? It has two potential parents: “DESCRIPTION” or “PRICE”. In some cases it can be found inside “DESCRIPTION”, in other cases it can be found inside price. For example, if there are several values for “PRICE”, then “CURRENCY” is inside “DESCRIPTION”, but otherwise “CURRENCY” is inside “PRICE”.</p>
<p>But why could a real-estate property have several different prices? Well, this is outside the scope of data-mining, but to have an understanding, it is good to consider a viable example. Let’s consider the example of an estate agent who has a 20% bonus if he/she successfully sells a given real-estate property within a month. In this case, the agent might want to draw buyers and put a 5% discount on the property for a month and if he/she is successful in selling it, then he/she will still have a nice bonus. Considering this economic mechanism the data-source might be showing the prices in a special way if there is such a discount, like:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-html" data-lang="html"><<span style="color:#b06;font-weight:bold">div</span> <span style="color:#369">class</span>=<span style="color:#d20;background-color:#fff0f0">"description"</span>>
<<span style="color:#b06;font-weight:bold">table</span> <span style="color:#369">class</span>=<span style="color:#d20;background-color:#fff0f0">"prices"</span>>
<<span style="color:#b06;font-weight:bold">tr</span>>
<<span style="color:#b06;font-weight:bold">td</span>>
<<span style="color:#b06;font-weight:bold">p</span> <span style="color:#369">class</span>=<span style="color:#d20;background-color:#fff0f0">"price"</span>><<span style="color:#b06;font-weight:bold">span</span>>100000</<span style="color:#b06;font-weight:bold">span</span>></<span style="color:#b06;font-weight:bold">p</span>>
<<span style="color:#b06;font-weight:bold">p</span> <span style="color:#369">class</span>=<span style="color:#d20;background-color:#fff0f0">"price red"</span>><<span style="color:#b06;font-weight:bold">span</span>>90000</<span style="color:#b06;font-weight:bold">span</span>> 10% discount!</<span style="color:#b06;font-weight:bold">p</span>>
</<span style="color:#b06;font-weight:bold">td</span>>
<<span style="color:#b06;font-weight:bold">td</span>>
<<span style="color:#b06;font-weight:bold">div</span> <span style="color:#369">class</span>=<span style="color:#d20;background-color:#fff0f0">"currency"</span>>$</<span style="color:#b06;font-weight:bold">div</span>>
</<span style="color:#b06;font-weight:bold">td</span>>
</<span style="color:#b06;font-weight:bold">tr</span>>
</<span style="color:#b06;font-weight:bold">table</span>>
<span style="color:#888"><!-- … --></span>
</<span style="color:#b06;font-weight:bold">div</span>>
</code></pre></div><p>while, if there is a single price, a different structure is generated:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-html" data-lang="html"><<span style="color:#b06;font-weight:bold">div</span> <span style="color:#369">class</span>=<span style="color:#d20;background-color:#fff0f0">"description"</span>>
<<span style="color:#b06;font-weight:bold">p</span> <span style="color:#369">class</span>=<span style="color:#d20;background-color:#fff0f0">"price"</span>>
<<span style="color:#b06;font-weight:bold">span</span>>100000</<span style="color:#b06;font-weight:bold">span</span>>
<<span style="color:#b06;font-weight:bold">div</span> <span style="color:#369">class</span>=<span style="color:#d20;background-color:#fff0f0">"currency"</span>>$</<span style="color:#b06;font-weight:bold">div</span>>
</<span style="color:#b06;font-weight:bold">p</span>>
</<span style="color:#b06;font-weight:bold">div</span>>
</code></pre></div><p>We notice again that the possibility of multiple parents does not mean that there will be any extracted element with multiple parents, it just describes that among the many elements some of the items will have “PRICE” as parent, the others will have “DESCRIPTION” as parent.</p>
<p>If we take a look at “DIMENSIONS”, we will see several concepts with the same name (“VALUE” and “UNIT”), but they have a different meaning in their specific context (“AREA” and “HEIGHT”, respectively).</p>
<p>Another interesting region is “CONTACT”, which has “PERSON” and “COMPANY” elements as conceptual children and both “PERSON” and “CONTACT” is plural. The underlying logic is that several companies or persons can be contacted when one wants to buy/view a real-estate. We have sub-concepts of “FACEBOOK”, “EMAIL”, “PHONE”, “NAME”, “OTHER” and “WEBSITE” for both “PERSON” and “COMPANY”, but similarly to the example we have seen with “CURRENCY”, here the concepts have different meanings. A corporate website is a different thing from a personal website.</p>
<p>However, if we draw/generate such a semantic tree, then it is better than a long documentation. It actually describes for coders the exact way the engine will operate and in the case when the engine is suggesting a new semantic tree for a reason, then, provided that the engine generates the tree’s picture, one will immediately understand what the essence of the engine’s suggestion is. Also, with such a nice diagram managers will understand the mechanism of the system at a glance.</p>
<h4 id="24-parallelization">2.4. Parallelization</h4>
<p>It is feasible to send parallel requests at the same time, which could happen using promises and the event queue in the case of JavaScript, or multiple threads in a multi-threaded environment.</p>
<h3 id="3-symbiosis">3. Symbiosis</h3>
<p>If the owner of the data-source is happy and supportive for his/her data to be extracted, then they might notify the maintainers of the data-miner whenever structural changes occur, or he/she can provide an API from which data can be extracted, for example a large JSON. However, this JSON will be probably hierarchical as well.</p>
<p>In some extremely lucky cases the owner of the data-source will be happy to provide and maintain the semantic rules. This could happen in the case when “spreading the word” via a data-miner is deeply valued by the owner of the data-source. The key is to have an offer, which helps reaching the goals of the data-source, so the owner of the data-source will see the data-miner as his/her extended hand instead of seeing it as a barrier in reaching the goals of the data-source.</p>
<h3 id="4-parsing-the-data">4. Parsing the data</h3>
<p>At the point when the data was successfully extracted, the results can be parsed just before storing it. For example in some cases we might have textually composite data in the same node, which is impossible to separate via the semantic tree, which needs leaf nodes of the original structure as atoms. So, in many cases a separate layer is needed to decompose composite textual data which holds data applicable to different concepts.</p>
<p>Also, if, for some technical reasons the semantic tree split a concept into different concepts (for example main authors and secondary authors), then the data-parser can merge the concepts which belong to be together into a single concept. There are many possible jobs the data-parser might fulfill, depending on the actual needs.</p>
<h3 id="5-analyzing-the-data">5. Analyzing the data</h3>
<p>Let’s assume that we have a very nice schema and we store the data we have efficiently. However, we might be interested to know what patterns can we find in our data. Let’s see what patterns we are interested to find. They include:</p>
<ul>
<li><a href="http://www.vldb.org/conf/1994/P487.PDF">association rules</a></li>
<li><a href="https://opentextbc.ca/dbdesign01/chapter/chapter-11-functional-dependencies/">functional dependencies</a></li>
<li>conditional functional dependencies (a functional dependency upon the table or cluster records provided a condition is met)</li>
</ul>
<p><strong>AR (Association Rule):</strong> <code>c => {v1, …, vn}</code></p>
<p>If a certain condition (c) is fulfilled, then we have a set of constant values for their respective (database table) columns.</p>
<p>For example, let’s consider the table:</p>
<pre tabindex="0"><code>person(id, is_alive, has_right_to_vote, has_valid_passport)
</code></pre><p>Now, we can observe that:</p>
<pre tabindex="0"><code>(is_alive = 0) => ((has_right_to_vote = 0) ^ (has_valid_passport = 0))
</code></pre><p>So, this is an association rule, which has the condition of is_alive = 0 (so the person is deceased) and we know for a fact that dead people will not vote and their passport is invalid.</p>
<p>When we extract data from a source, there might be some association rules (field values are associated to a condition) we do not know about yet. If we find those out, then it will help us a lot. For instance, imagine the case when for whatever reason an insert/update is attempted with values:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python">is_alive = <span style="color:#00d;font-weight:bold">0</span>
has_right_to_vote = <span style="color:#00d;font-weight:bold">1</span>
</code></pre></div><p>In this case we can throw an exception, so, this way we can find bugs in the code or mistakes in the semantic tree. This kind of inconsistency prevention is useful even if we are not speaking of data-mining, but, in the context of this article it is extremely useful, as it might detect problems in the semantic tree automatically.</p>
<p><strong>FD (Functional Dependency):</strong> <code>S → D</code></p>
<p>The column-set S (Source):</p>
<pre tabindex="0"><code>S = {S1, …, Sm}
</code></pre><p>determines the column-set D (Destination):</p>
<pre tabindex="0"><code>D = {D1, …, Dn}
</code></pre><p>The formula more explicitly looks like this:</p>
<pre tabindex="0"><code>{S1, …, Sm} → (D1, …, Dn)
</code></pre><p>This relation means that if we have two different records/entities with the same source values:</p>
<pre tabindex="0"><code>Source1 = Source2 = {s1, …, sm}
</code></pre><p>then their destination will match as well:</p>
<pre tabindex="0"><code>Destination1 = Destination2 = {d1, …, dn}
</code></pre><p>Inversely this is not necessarily true. If two records/entities have the same destination values, then the functional dependency does not require them to have the very same sources.</p>
<p><strong>CFD (Conditional Functional Dependency):</strong> <code>c => S → D</code></p>
<p>A CFD is a more generalized term of FD. It adds a condition to the formula, so that the functional dependency’s applicability is only guaranteed if the condition is met. We can describe functional dependencies as particular conditional functional dependencies, where the condition is inherently true:</p>
<pre tabindex="0"><code>(true => S → D) <=> S → D
</code></pre><p>Also, an AR can be described as</p>
<pre tabindex="0"><code>c => {} → D
</code></pre><h4 id="51-the-more-useful-mu-relation">5.1. The More Useful (MU) relation</h4>
<p>Let’s consider that we have two patterns, P1 and P2, which could be ARs, FDs or CFDs. Is there a way to determine which of them is more useful? Generally speaking:</p>
<p>P1 MU P2 if and only if P1 is more general than P2.</p>
<p>Since both ARs and FDs are particular cases of CFDs, we will work with the formula of CFDs:</p>
<pre tabindex="0"><code>P1 = (c1 => S1 → D1)
P2 = (c2 => S2 → D2)
P1 MU P2 <=> ((c2 => c1) ^ (S1 ⊆ S2) ^ (D1 ⊇ D2))
</code></pre><p>Note that MU is reflexive, transitive, antisymmetrical and has a neutral element.</p>
<p>Reflexive: <code>(c1 => c1) ^ (S1 ⊆ S1) ^ (D1 ⊇ D1)</code> is trivially true.</p>
<p>Transitive:</p>
<p>Let’s suppose that</p>
<pre tabindex="0"><code>((c2 => c1) ^ (S1 ⊆ S2) ^ (D1 ⊇ D2))
((c3 => c2) ^ (S2 ⊆ S3) ^ (D2 ⊇ D3))
</code></pre><p>is true. Is</p>
<pre tabindex="0"><code>((c3 => c1) ^ (S1 ⊆ S3) ^ (D1 ⊇ D3))
</code></pre><p>also true?</p>
<p>Since <code>c3 => c2 => c1</code>, due to the transitivity of the implication relation we know that <code>c3 => c1</code>.</p>
<p>Since <code>S1 ⊆ S2 ⊆ S3</code>, due to the transitivity of the subset relation we know that <code>S1 ⊆ S3</code>.</p>
<p>Since <code>D1 ⊇ D2 ⊇ D3</code>, due to the transitivity of the superset relation we know that <code>D1 ⊇ D3</code>.</p>
<p>The three transitivities together prove that MU is transitive.</p>
<p>Neutral element (the least useful):</p>
<pre tabindex="0"><code>false => {} → All columns
</code></pre><p><code>false => c</code> is always true, <code>{}</code> is the subset of everything else, including itself and all columns is the subset of all combinations of columns, or, in other words, it’s a superset of all its subsets.</p>
<p>Antisymmetrical:</p>
<pre tabindex="0"><code>If P1 MU P2 and P2 MU P1, then P1 <=> P2.
P1 MU P2:
((c2 => c1) ^ (S1 ⊆ S2) ^ (D1 ⊇ D2))
P2 MU P1
((c1 => c2) ^ (S2 ⊆ S1) ^ (D2 ⊇ D1))
P1 MU P2 ^ P2 MU P1 <=>
((c2 => c1) ^ (c1 => c2)) ^
((S1 ⊆ S2) ^ (S2 ⊆ S1)) ^
((D1 ⊇ D2) ^ (D2 ⊇ D1)) <=>
(c1 <=> c2) ^ (S1 = S2) ^ (D1 = D2) <=>
P1 <=> P2
</code></pre><p>so the relation is antisymmetrical:</p>
<pre tabindex="0"><code>((P1 MU P2) ^ (P2 Mu P1)) <=> (P1 <=> P2)
</code></pre><p>This means that MU is a poset (partially ordered set) and all the algebra applicable for partially ordered sets in general can be used to analyze MU as well.</p>
<p>The importance of the MU relation is that we can start searching for such patterns in an ordered manner, starting from the most useful we consider and eventually composing S and decomposing D into less useful relation candidates whenever a candidate for a pattern proved to be false, also, knowing that a pattern seems to be accurate we also know that all less useful patterns seem to be accurate as well. Also, if we know that a pattern is accurate, we know that all less useful patterns are accurate as well. We can start our search from a useful pattern candidate, but not necessarily from the most useful, as, intuitively, it is not very probable that all the columns will invariably have the same value of all records. It would defeat the purpose of storing so many values. This means that defining the most useful <em>possible</em> patterns would make sense.</p>
<h4 id="52-mu-lattice">5.2. MU Lattice</h4>
<p>Possible patterns can be represented using a <a href="http://mathworld.wolfram.com/LatticeTheory.html">Lattice</a>, where the root would be the most useful node and the leaf would be the least useful node. We have a join and a meet operation, which are closures.</p>
<pre tabindex="0"><code>P1 join P2 = (c1 v c2) => (S1 ⋂ S2) → (D1 ⋃ D2)
P1 meet P2 = (c1 ^ c2) => (S1 ⋃ S2) → (D1 ⋂ D2)
</code></pre><p>Of course:</p>
<pre tabindex="0"><code>(P1 join P2) MU (P1 meet P2)
</code></pre><p>Proof:</p>
<pre tabindex="0"><code>(c1 ^ c2) => (c1 v c2) is trivially true and
(S1 ⋂ S2) ⊆ (S1 ⋃ S2) is trivially true and
(D1 ⋃ D2) ⊇ (D1 ⋂ D2) is trivially true.
</code></pre><p>We can split the lattice into many different simple lattices, each having its own condition. Since an AR never has source columns, it cannot be less useful than a non AR CFD. Also, since an FD has a condition which is implied by any possible other condition, FDs are never less useful than CFDs with real conditions.</p>
<h4 id="53-domain-of-search">5.3. Domain of search</h4>
<p>The domain of search can vary. It can be limited to a single table. Or it can be limited to a cluster of tables, related to each other in one-to-one, one-to-many, many-to-one or many-to-many manner. The condition can be considered to be in a simplistic way, checking only the equals operator of some columns, or, it can be complex and considering several columns, even in a cross-table manner, using several possible operators. This depends on the kinds of patterns we intend to find, the cumulative power of resources, the ability of the development team to work out something serious, time, and, yes, money.</p>
<h4 id="54-differentiation-between-patterns-and-reality">5.4. Differentiation between patterns and reality</h4>
<p>We can have a P pattern, which was automatically found. We only know that there is no counter-example of the pattern P, or, if we have a level of tolerance, we know that there were not as many counter-examples so we would discard the pattern. So, P appears to be true. But is appearance equivalent to the truth in this case? As a matter of fact nature produces infinitely many examples of seemingly impossible occurrences or highly improbable coincidences.</p>
<p>It is a common fallacy to concentrate on a pattern and due to the improbability of the result being a mere coincidence to exclude the possibility that it was just a coincidence. Indeed, it’s the so-called <a href="https://www.logicallyfallacious.com/tools/lp/Bo/LogicalFallacies/175/Texas-Sharpshooter-Fallacy">Texas sharpshooter</a> fallacy, even though it’s unconscious in our specific case.</p>
<p>If I have a cube which has 6 sides, each having a number from 1–6 and I toss it 1000 times and the result will always be six, then I will have the natural feeling that something must be not right, I might be divinely favored, or the stars are lining up in my favor, but in this case I’m ignoring the fact that there is no connection between the results of the tosses, or, in other words, my experiments are independent from each other. I could calculate that having a result of 6 for 1000 times has a probability of 1 / 6^1000, which is quasi-impossible. Yet, it is only quasi-impossible and not actually impossible.</p>
<p>If I toss the same cube 1000 times randomly and get a sequence of 1000 numbers, then I could calculate the probability of my random, not special sequence of 1000 elements occuring in the exact same way it occurred. And, surprise, surprise, the result will be exactly equal to 1 / 6^1000, but, if the results of the sequence are varied, I do not feel the results to be special. As a result, having a result of 6 for 1000 tosses is not special at all either.</p>
<p>There is no mathematical difference between the probability of tossing a cube 1000 times and getting only sixes and tossing a cube 1000 times and getting a specific sequence of 1000 elements you might choose. The probability of the exact same sequences will be exactly the same before I start tossing the cube. The difference between the two sequences is the meaning I, as a person attribute to one of them.</p>
<p>Also, if I win a lottery, I might think about the probability of my choice of numeric combination being correct and feel that I’m especially lucky, but if I calculate the probability of the actual result when I do not win, I will not find any difference in the probability itself. Yet, people tend to calculate the chances of the case when they get lucky, but not to calculate the chances when they are not. The low probability of a given event is only special because we are interested in it, but we can find infinitely many similarly low-probability events happening all the time, but we are just not interested enough in the majority of such events to analyze it.</p>
<p>But let’s see this example further. Before I toss the cube 1000 times I do not know the exact sequence I will get in advance. In fact, it is almost impossible to guess it, but I know that whatever the sequence will be, its a priori probability is converging to zero.</p>
<p>This means that whenever we do not attribute a pattern a meaning, we are inclined to fallaciously not even consider it to be the nature of how things are. Yet, if we attribute a meaning to a pattern, then we might be inclined to fallaciously not understand that it was a mere coincidence if it happens not to be the natural rule of how things are according to our understanding.</p>
<p>If we find a pattern with a tool, we get enthusiastic and we almost want it to be the nature of things, but we need to be much more severe before we factually accept a pattern. Consider the example of primes. How many primes are there? The answer is simple: infinitely many.</p>
<p>Proof (reductio ad absurdum):</p>
<p>Let’s assume that there is a finite n number of primes and there is no other, except them:</p>
<pre tabindex="0"><code>p1, …, pn
</code></pre><p>Now, let’s consider the number:</p>
<pre tabindex="0"><code>N = (p1 * … * pn) + 1
</code></pre><p>We know that N is indivisible with any of p1, …, pn, so there are two cases: N is either a prime or a composite number. If N is a prime, then we found a new prime. Otherwise, if N is composite, then it is divisible by at least a prime which is not among p1, …, pn. So, in either case we find a new prime, therefore there are infinitely many prime numbers.</p>
<p>How many primes are pair? Exactly one. It is the number 2. Now, if we have a huge set of primes, among which we do not find 2, not knowing that 2 is a prime, we might be inclined to think that primes can only be odd numbers, which is of course wrong. If we pick a prime randomly from the infinitely many primes, the chance that it will be exactly two is extremely small, rather minuscule. However, if a human has to pick a prime number, a human will know only a few primes and 2 is the “first”, so, among the primes 2 is one that has a high chance of being chosen by a human.</p>
<p>The point of all this contemplation is that if something is very frequent or highly probable, then it is not necessarily true. When we take a look at a pattern, it is good to be very critical about it and think about how that pattern could fail and what the consequences would be if we assume the pattern to be accurate, yet it leaves us in trouble in the most inappropriate time.</p>
<h4 id="55-usage-of-factually-validated-patterns">5.5. Usage of factually validated patterns</h4>
<p>Now, if we accept a rule to be factually accurate, then we might want to make sure that it is respected. Assuming that</p>
<pre tabindex="0"><code>c => S → D
</code></pre><p>is accurate, we also assume that if the condition is met and there is already a record having the source values of s1, …, sm, then, inserting/updating another record with the same source values, but with different destination values leads to an error. Let’s suppose that we throw an exception when an accepted pattern is to be violated. If many such exceptions are thrown, then we have a problem. What could the problem be:</p>
<ul>
<li>the semantic tree might have wrong/deprecated rules</li>
<li>the older records might be broken</li>
<li>the pattern might be no longer valid, or, it might have been wrongly accepted in the first place</li>
</ul>
<p>See? By analyzing the data we can add some features of machine-learning, so our data-miner will really rule the problem space it is responsible for. Naturally, such patterns can also be used at insertion and update, when we do not get some of the destination values, but we have a pattern from which we can deduce it. Naturally, a tableau of at least the most frequent source values and conditions could come to our help.</p>
<p>Knowledge is power. Instead of failing gracefully, outside our limits of what we perceive could result in many months-long gibberish data. But instead of that pain, we could instantly know when a problem appears and, if we have some helping robotic hands — even if they are only virtual — at the end of the day we will be rarely alerted with urgencies. And such patterns deepen our understanding of the data we work with, even if they cannot be accepted as a general rule. With better understanding we will have better ideas. With better ideas we will have better features and performance. With better features and performance we will have more patterns. And with more patterns: we deepen our understanding further.</p>
<h3 id="6-the-flow">6. The flow</h3>
<p><img src="/blog/2019/03/the-flow-of-hierarchical-data-extraction/SemanticDataExtractorFigure3.png" alt="" title="The flow"></p>
<p>This diagram is an idealized representation of the flow. In reality we might have several different cron jobs, we might be working with threads, in asynchronous manner and the parser is invoked much more frequently, not just at the end of the whole extraction, because we do not have infinite resources. All these nuances would complicate the diagram immensely.</p>
Speech Recognition from scratch using Dilated Convolutions and CTC in TensorFlowhttps://www.endpointdev.com/blog/2019/01/speech-recognition-with-tensorflow/2019-01-08T00:00:00+00:00Kamil Ciemniewski
<p><img src="/blog/2019/01/speech-recognition-with-tensorflow/25928285337_50483f3619_o.jpg" alt="Sound visualization" /><br><a href="https://www.flickr.com/photos/williamismael/25928285337/">Image by WILL POWER · CC BY 2.0, cropped</a></p>
<p>In this blog post, I’d like to take you on a journey. We’re going to get a speech recognition project from its architecting phase, through coding and training. In the end, we’ll have a fully working model. You’ll be able to take it and run the model serving app, exposing a nice HTTP API. Yes, you’ll even be able to use it in your projects.</p>
<p>Speech recognition has been amongst one of the hardest tasks in Machine Learning. Traditional approaches involve meticulous crafting and extracting of the audio features that separate one phoneme from another. To be able to do that, one needs a deep background in data science and signal processing. The complexity of the training process prompted teams of researchers to look for alternative, more automated approaches.</p>
<p>With the growing development of Deep Learning, the need for handcrafted features declined. The training process for a neural network is much more streamlined. You can feed the signals either in their raw form or as their spectrograms and watch the model improve.</p>
<p>Did this get you excited? Let’s start!</p>
<h3 id="project-plan-of-attack">Project Plan of Attack</h3>
<p>Let’s build a web service that exposes an API. Let it be able to receive audio signals, encoded as an array of floating point numbers. In return, we’re going to get the recognized text.</p>
<p>Here’s a rough plan of the stages we’re going to go through:</p>
<ol>
<li>Get the dataset to train the model on</li>
<li>Architect the model</li>
<li>Implement it along with the unit tests</li>
<li>Train it on the dataset</li>
<li>Measure its accuracy</li>
<li>Serve it as a web service</li>
</ol>
<h4 id="the-dataset">The dataset</h4>
<p>The open-source community has a lot to be thankful for the <a href="https://foundation.mozilla.org/">Mozilla Foundation</a> for. It’s a host of many projects with a wonderful, free Firefox browser at its forefront. One of its other projects, called <a href="https://voice.mozilla.org">Common Voice</a>, focuses on gathering large data sets to be used by anyone in speech recognition projects.</p>
<p>The datasets consist of wave files and their text transcriptions. There’s no notion of time-alignment. It’s just the audio and text for each utterance.</p>
<p>If you want to code along, head up to <a href="https://voice.mozilla.org/pl/datasets">the Common Voice Datasets download page</a>. Be warned that it weighs roughly around 12GB.</p>
<p>After the download, simply extract the files from the archive into the <code>./data</code> directory of the root of the project. The files, in the end, should reside under the <code>./data/cv_corpus_v1/</code> path.</p>
<p>How much data should we have? It always depends on the challenge at hand. Roughly speaking, the more difficult the task, the more powerful your neural network needs to be. It will need to be capable of expressing more complex patterns in data. The more powerful the network, the easier it is to have it just memorize the training examples. This is highly undesirable and results in overfitting. To lessen its aptitude to do so, you need to either augment your data on the fly randomly or gather more “real” examples. On this project, we’re going to do both. Data augmentation will be covered in the coding section. Additional datasets we’ll use are well known <a href="http://www.openslr.org/12/">LibriSpeech</a> (<a href="http://www.openslr.org/resources/12/train-clean-360.tar.gz">the file to download, around 23GB</a>) and <a href="http://voxforge.org">VoxForge</a> (<a href="https://s3.us-east-2.amazonaws.com/common-voice-data-download/voxforge_corpus_v1.0.0.tar.gz">the file to download</a>).</p>
<p>Those two datasets are among the most popular that are freely available. There are others I chose to omit as they weigh quite a lot. I was already almost out of free space after the download and preprocessing of the three sets chosen above.</p>
<p>You need to download both Libri and Vox and extract them under <code>./data/LibriSpeech/</code> and <code>./data/voxforge/</code>.</p>
<h3 id="background-on-audio-processing">Background on audio processing</h3>
<p>In order to build a working model, we need some background in signal processing. Although a lot of the traditional work is going to be done by the neural network automatically, we still need to understand what is going on in order to reason about its various hyperparameters.</p>
<p>Additionally, we’re going to process audio into a form that’s easier to train. This is going to lower the memory requirements. It’s also going to lower the time needed for model’s parameters to <em>converge</em> to ones that work well.</p>
<h4 id="how-is-audio-represented">How is audio represented?</h4>
<p>Let’s have a quick look at what the audio data looks like when we load it from a wave file.</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python"><span style="color:#080;font-weight:bold">import</span> <span style="color:#b06;font-weight:bold">librosa</span>
<span style="color:#080;font-weight:bold">import</span> <span style="color:#b06;font-weight:bold">librosa.display</span>
SAMPLING_RATE=<span style="color:#00d;font-weight:bold">16000</span>
<span style="color:#888"># ...</span>
wave, _ = librosa.load(path_to_file, sr=SAMPLING_RATE)
librosa.display.waveplot(wave, sr=SAMPLING_RATE)
</code></pre></div><p>The above code specifies that we want to load the audio data with a <em>sampling rate</em> of a 16k (more about it later). It then loads it and plots it along the time axis:</p>
<p><img src="/blog/2019/01/speech-recognition-with-tensorflow/wave-plot.png" alt="" title="Plot of a raw audio signal"></p>
<p>The X-axis obviously represents the time. The Y axis is often called the <a href="https://en.wikipedia.org/wiki/Amplitude">amplitude</a>. A quick look at the plot above makes it obvious that we have negative values in the signal. How come those values are called amplitudes then? Amplitude is said to represent the maximum difference of displacements of a physical object as it vibrates. What does it mean to have a negative amplitude? To make those values a bit more clear, let’s call it just displacement for now. Audio is nothing else than the vibration of the air. If you were to build an electrical recorder, you might come up with one that gives you output in voltages at each point in time. As the air vibrates, you need a <strong>reference point</strong> obviously. This, in turn, allows you to catch the exact specifics of the vibration — how it “rises” above the reference point and then gets back way below it. Imagine that your electrical circuit gives you output within the range of <code>-1V</code> and <code>1V</code>. To load it into your computer and into the plot like above, you’d need to capture those values at discrete points in time. The <strong>sampling rate</strong> is nothing else than a number of times within one second when the value from your sound-meter would be measured and stored — to be loaded later. Next time, when you read that your CD from the ’90 contains audio sampled at a frequency of 44,100 Hz, you’ll know that the raw “air displacement” values were sampled 44,100 times each second.</p>
<p>Let’s do a simple thought experiment to prepare for the next section. What would you hear if all the above values were constant, e.g. 1.0? We saw that the values given by <code>librosa</code> are floating points. In the example file they ranged between -0.6 and 0.6. The value of 1.0 is certainly much higher — would you hear “more” of “something” then? Because the definition of a sound is that <strong>it’s a vibration</strong>: you wouldn’t hear anything! The amplitudes of the audio signal must periodically change — this is how we detect or hear sounds. This implies that in order to distinguish between different sounds, those sounds have to “vibrate differently”. The difference that makes sounds different is the <strong>frequency</strong> of the vibration.</p>
<h4 id="decomposing-the-signal-with-the-fourier-transform">Decomposing the signal with the Fourier Transform</h4>
<p>Let’s create a signal generating machine, that will output a sinusoidal of a given frequency and amplitude:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python"><span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">gen_sin</span>(freq, amplitude, sr=<span style="color:#00d;font-weight:bold">1000</span>):
<span style="color:#080;font-weight:bold">return</span> np.sin(
(freq * <span style="color:#00d;font-weight:bold">2</span> * np.pi * np.linspace(<span style="color:#00d;font-weight:bold">0</span>, sr, sr)) / sr
) * amplitude
</code></pre></div><p>Here’s how 1000 points signal looks like for a frequency of 30 and an amplitude of 1:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python"><span style="color:#080;font-weight:bold">import</span> <span style="color:#b06;font-weight:bold">seaborn</span> <span style="color:#080;font-weight:bold">as</span> <span style="color:#b06;font-weight:bold">sns</span>
sns.lineplot(data=gen_sin(<span style="color:#00d;font-weight:bold">30</span>, <span style="color:#00d;font-weight:bold">1</span>))
</code></pre></div><p><img src="/blog/2019/01/speech-recognition-with-tensorflow/signal-1000-30-1.png" alt="" title="Sinusoidal signal"></p>
<p>Here’s one for 10 and 0.6:</p>
<p><img src="/blog/2019/01/speech-recognition-with-tensorflow/signal-1000-10-0.6.png" alt="" title="Sinusoidal signal"></p>
<p>You can count the number of times the values in plots approach their maximum. Knowing that sine has only one maximum within its period and that we’re showing just one second, that number shows that we have frequencies 30 and 10.</p>
<p>What would we get if we were to sum such sinusoidal signals of different frequencies and amplitudes? Let’s see — below you can see 3 different sine waves plotted on top of each other. The fourth — and last one — shows the signal that is the sum of all of them:</p>
<p><img src="/blog/2019/01/speech-recognition-with-tensorflow/wave-decomposition-2.png" alt="" title="Wave composition / decomposition"></p>
<p>Here’s another example, with the last plot showing the sum of 5 different waves:</p>
<p><img src="/blog/2019/01/speech-recognition-with-tensorflow/wave-decomposition-1.png" alt="" title="Wave composition / decomposition"></p>
<p>It isn’t that regular anymore, is it? It turns out that <strong>you can construct any signal by summing up some number of sine waves of different frequencies and amplitudes</strong> (and phases, their translation in time). The converse is also true: <strong>any signal can be represented as a sum of some number of sine waves of different frequencies and amplitudes</strong> (and phases). This is extremely important to our speech recognition task. Frequencies are the real difference between sounds that make up the phonemes and words that we want to be able to recognize.</p>
<p>This is where the <a href="https://en.wikipedia.org/wiki/Fourier_transform">Fourier Transform</a> comes into play. It takes our data points that represent intensity per each point in time and produces data points representing intensity per each <em>frequency bin</em>. It’s said that it transforms the domain of the signal from <em>time</em> into <em>frequency</em>. Now, what exactly is a <em>frequency bin</em>? Imagine the physical audio signal being constructed from frequencies between 0Hz and 8000Hz. The FFT algorithm (Fast Fourier Transform) is going to split that full spectrum into <em>bins</em>. If you were to split it into 10 bins, you’d end up having the following ranges: 0Hz–800Hz, 800Hz–1600Hz, 1600Hz–2400Hz, 2400Hz–3200Hz, 3200Hz–4000Hz, 4000Hz–4800Hz, 4800Hz–5600Hz, 5600Hz–6400Hz, 6400Hz–7200Hz, 7200Hz–8000Hz.</p>
<p>Let’s see how the FFT works on the example of the signal given above. The waves and plots were produced by the following Python function:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python"><span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">plot_wave_composition</span>(defs, hspace=<span style="color:#00d;font-weight:bold">1.0</span>):
fig_size = plt.rcParams[<span style="color:#d20;background-color:#fff0f0">"figure.figsize"</span>]
plt.rcParams[<span style="color:#d20;background-color:#fff0f0">"figure.figsize"</span>] = [<span style="color:#00d;font-weight:bold">14.0</span>, <span style="color:#00d;font-weight:bold">10.0</span>]
waves = [
gen_sin(freq, amp)
<span style="color:#080;font-weight:bold">for</span> freq, amp <span style="color:#080">in</span> defs
]
fig, axs = plt.subplots(nrows=<span style="color:#038">len</span>(defs) + <span style="color:#00d;font-weight:bold">1</span>)
<span style="color:#080;font-weight:bold">for</span> ix, wave <span style="color:#080">in</span> <span style="color:#038">enumerate</span>(waves):
sb.lineplot(data=wave, ax=axs[ix])
axs[ix].set_ylabel(<span style="color:#d20;background-color:#fff0f0">'</span><span style="color:#33b;background-color:#fff0f0">{}</span><span style="color:#d20;background-color:#fff0f0">'</span>.format(defs[ix]))
<span style="color:#080;font-weight:bold">if</span> ix != <span style="color:#00d;font-weight:bold">0</span>:
axs[ix].set_title(<span style="color:#d20;background-color:#fff0f0">'+'</span>)
plt.subplots_adjust(hspace = hspace)
sb.lineplot(data=<span style="color:#038">sum</span>(waves), ax=axs[<span style="color:#038">len</span>(defs)])
axs[<span style="color:#038">len</span>(defs)].set_ylabel(<span style="color:#d20;background-color:#fff0f0">'sum'</span>)
axs[<span style="color:#038">len</span>(defs)].set_xlabel(<span style="color:#d20;background-color:#fff0f0">'time'</span>)
axs[<span style="color:#038">len</span>(defs)].set_title(<span style="color:#d20;background-color:#fff0f0">'='</span>)
plt.rcParams[<span style="color:#d20;background-color:#fff0f0">"figure.figsize"</span>] = fig_size
<span style="color:#080;font-weight:bold">return</span> waves, <span style="color:#038">sum</span>(waves)
</code></pre></div><p>We can plot the signals and grab them at the same time with:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python">wave_defs = [
(<span style="color:#00d;font-weight:bold">2</span>, <span style="color:#00d;font-weight:bold">1</span>),
(<span style="color:#00d;font-weight:bold">3</span>, <span style="color:#00d;font-weight:bold">0.8</span>),
(<span style="color:#00d;font-weight:bold">5</span>, <span style="color:#00d;font-weight:bold">0.2</span>),
(<span style="color:#00d;font-weight:bold">7</span>, <span style="color:#00d;font-weight:bold">0.1</span>),
(<span style="color:#00d;font-weight:bold">9</span>, <span style="color:#00d;font-weight:bold">0.25</span>)
]
waves, the_sum = plot_wave_composition(wave_defs)
</code></pre></div><p>Next, let’s compute the FFT values along with the frequencies:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python">ffts = np.fft.fft(the_sum)
freqs = np.fft.fftfreq(<span style="color:#038">len</span>(the_sum))
frequencies, coeffs = <span style="color:#038">zip</span>(
*<span style="color:#038">list</span>(
<span style="color:#038">filter</span>(
<span style="color:#080;font-weight:bold">lambda</span> row: row[<span style="color:#00d;font-weight:bold">1</span>] > <span style="color:#00d;font-weight:bold">10</span>, <span style="color:#888"># arbitrary threshold but let’s not make it too complex for now</span>
[ (<span style="color:#038">int</span>(<span style="color:#038">abs</span>(freq * <span style="color:#00d;font-weight:bold">1000</span>)), coef) <span style="color:#080;font-weight:bold">for</span> freq, coef <span style="color:#080">in</span> <span style="color:#038">zip</span>(freqs[<span style="color:#00d;font-weight:bold">0</span>:(<span style="color:#038">len</span>(ffts) // <span style="color:#00d;font-weight:bold">2</span>)], np.abs(ffts)[<span style="color:#00d;font-weight:bold">0</span>:(<span style="color:#038">len</span>(ffts) // <span style="color:#00d;font-weight:bold">2</span>)]) ]
)
)
)
sns.barplot(x=<span style="color:#038">list</span>(frequencies), y=coeffs)
</code></pre></div><p>The last call produces the following plot:</p>
<p><img src="/blog/2019/01/speech-recognition-with-tensorflow/fft-results.png" alt="" title="Detected frequencies"></p>
<p>The X-axis represents now the frequency in Hz, while the Y-axis is the intensity.</p>
<p>There’s one missing part before we can use it with our speech data. As you can see, FFT gives us frequencies <strong>for the whole signal, assuming that it’s periodic and spans in time into infinity</strong>. Obviously, when I say “hello”, the air vibrates differently in the beginning, changes in between and is even more different at the end. We need to <strong>split</strong> that audio into small “windows” of data points. By feeding them into FFT, we can get the frequencies for each one of them. This turns the data domain from time into frequency within the scope of the window. It remains the info about the time at the global level, making our data represent: <code>time x frequency x intensity</code>.</p>
<h4 id="scaling-frequencies">Scaling frequencies</h4>
<p>The human perception is a vastly complex phenomenon. Taking that into account can take us a long way when working on the recognition model emulating the work of our brains when we’re listening to each other.</p>
<p>Let’s make another experiment. What sound is produced by the 800Hz sine?</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python"><span style="color:#080;font-weight:bold">from</span> <span style="color:#b06;font-weight:bold">IPython.display</span> <span style="color:#080;font-weight:bold">import</span> Audio
Audio(data=gen_sin(<span style="color:#00d;font-weight:bold">800</span>, <span style="color:#00d;font-weight:bold">1</span>, <span style="color:#00d;font-weight:bold">16000</span>), rate=<span style="color:#00d;font-weight:bold">16000</span>)
</code></pre></div><div>
<audio controls="controls">
<source src="/blog/2019/01/speech-recognition-with-tensorflow/800Hz.wav" type="audio/wav">
</audio>
</div><br />
<p>Let’s now generate 900Hz and 1000Hz to get a sense of the difference:</p>
<p>900Hz:</p>
<div>
<audio controls="controls">
<source src="/blog/2019/01/speech-recognition-with-tensorflow/900Hz.wav" type="audio/wav">
</audio>
</div><br />
<p>1000Hz:</p>
<div>
<audio controls="controls">
<source src="/blog/2019/01/speech-recognition-with-tensorflow/1000Hz.wav" type="audio/wav">
</audio>
</div><br />
<p>Let us now ante up the frequencies and generate 7000Hz, 7100Hz and 7200Hz:</p>
<audio controls="controls">
<source src="/blog/2019/01/speech-recognition-with-tensorflow/7000Hz.wav" type="audio/wav">
</audio>
<br />
<audio controls="controls">
<source src="/blog/2019/01/speech-recognition-with-tensorflow/7100Hz.wav" type="audio/wav">
</audio>
<br />
<audio controls="controls">
<source src="/blog/2019/01/speech-recognition-with-tensorflow/7200Hz.wav" type="audio/wav">
</audio>
<br />
<p>Can you hear the difference being smaller in the case of the last three? It’s a well-known phenomenon. We sense a greater difference in sounds for lower frequencies and as it increases that difference becomes less and less.</p>
<p>Because of this, three gentlemen—Stevens, Volkmann, and Newman—created a so-called <a href="https://en.wikipedia.org/wiki/Mel_scale">Mel scale</a> in 1937. You can think of it as a simple rescaling of the frequencies that roughly follows the relationship shown below:</p>
<p><img src="/blog/2019/01/speech-recognition-with-tensorflow/Mel-Hz_plot.svg.png" alt=""></p>
<p>Although not mandatory, lots of models that deal with human speech also decrease the importance of the intensity by taking the log of the re-scaled data. The resulting <code>time x frequency (mels) x log-intensity</code> is called the <strong>log-Mel spectrogram</strong>.</p>
<h3 id="background-on-deep-learning-techniques-in-use-for-this-project">Background on deep learning techniques in use for this project</h3>
<p>We’ve just gone through the necessary basics of signal processing. Let’s now focus on the Deep Learning concepts we’ll use to construct and train the model.</p>
<p>While this article assumes that the reader already knows a lot, there are less common techniques we’ll use that deserve at least a quick go through.</p>
<h4 id="dilated-convolutions-as-a-faster-alternative-to-recurrent-networks">Dilated convolutions as a faster alternative to recurrent networks</h4>
<p>Traditionally, the sequence processing in Deep Learning is tackled by the <a href="https://en.wikipedia.org/wiki/Recurrent_neural_network">recurrent neural networks</a>.</p>
<p>No matter the choice of their flavor, the basic scheme is always the same: the computations are done <strong>sequentially</strong> going through examples <strong>in time</strong>. In our case, we’d need to split the <code>time x frequency x intensity</code> into <code>time</code> length of <code>frequency x intensity</code> chunks. As the chunks would be processed one by one, the recurrent network internal state would “remember” the previous chunk’s specifics, incorporating them into their future outputs. The output shape would be <code>time x frequency x recurrent units</code>.</p>
<p>The fact that the computations are done sequentially, makes them quite slow overall. Later in-pipeline computations spend most of the time waiting on the previous ones to finish because of the direct dependency. The problem is even more severe with the use of GPUs. We use them because of their ability to do math in parallel on huge chunks of data. With recurrent networks, lots of that power is being wasted.</p>
<p>The premise of RNNs is that in theory, they can have the capacity for keeping very long contexts in their “memory”. This has recently been put into test and falsified in practice by <a href="https://arxiv.org/pdf/1803.01271.pdf">Bai et al</a>. Also, when you stop and think about the task at hand: does it really matter to “remember” the beginning of the sentence to know that it ends with the word “dog”? Some context is obviously needed — but not as wide as it might seem at first.</p>
<p>I have an Nvidia GTX 1070Ti with 8GB of memory to train my models on. I don’t really feel like waiting a month for my recurrent network to converge. In this project, let’s use a very performant alternative — <a href="https://en.wikipedia.org/wiki/Convolutional_neural_network">convolutional neural network</a>.</p>
<h5 id="expanding-the-context-of-the-convolutional-network">Expanding the context of the convolutional network</h5>
<p>Simple convolutional layers weren’t used for sequence processing much for a good reason. The crux of the sequence processing is to be able to take bigger contexts into account. Depending on the job, we might want to constrain the context only to the <em>past</em> — learning the <strong>causal</strong> relations in data. We might sometimes want to incorporate both <em>past</em> and <em>future</em> in it as well. The go-to solution for doing OCR at the moment is to use bidirectional recurrent layers. Their one pass learns the relations from left to right while another learns from right to left. The results are then concatenated.</p>
<p>By applying proper padding, we can easily include one or two-sided contexts in 1D convolutions. The challenge is that in order to make the outputs depend on bigger contexts, the size of the filters needs to become bigger and bigger. This, in turn, requires more and more memory.</p>
<p>Because our aim is to create a model that we’ll be able to train on a quite cheap (given the GPUs used in this field usually) GTX 1070Ti (around $500 at the moment), we want the memory requirements to be as low as possible.</p>
<p>Thanks to the success of the <a href="https://arxiv.org/pdf/1609.03499.pdf">WaveNet</a> (among others), a specific class of convolutional layers gained a lot of attention lately. The variation is called <strong>Dilated Convolutions</strong> or sometimes <strong>Atrous Convolutions</strong>. So what are they?</p>
<p>Let’s first have a look at how the outputs depend on their context for simple convolutional layers:</p>
<p><img src="/blog/2019/01/speech-recognition-with-tensorflow/causal-conv-3-1.png" alt=""></p>
<p>Imagine that you originally have just the top-most row of numbers. You are going to use 1D convolutions and to make the reasoning easiest, the number of filters is 1. Also for simplicity, all filter values are set to 1. You can see the cross-correlation (because that’s what convolutional layers are in fact computing) operator taking 3 values in the context, multiplying by the filter and summing up to <code>2 * 1 + 3 * 1 + 4 * 1 = 9</code>.</p>
<p>The <em>atrous</em> convolutions are really the same, except they <strong>dilate</strong> their focus without increasing the size of the filter by introducing holes. It’s shown below with the convolution of the size 2 and dilation of 2:</p>
<p><img src="/blog/2019/01/speech-recognition-with-tensorflow/causal-conv-2-2.png" alt=""></p>
<p>Here’s yet another example for the size of 2 and dilation of 3:</p>
<p><img src="/blog/2019/01/speech-recognition-with-tensorflow/causal-conv-2-3.png" alt=""></p>
<h4 id="gated-activations">Gated activations</h4>
<p>Traditionally, convolutional layers are followed by the *elu family of activations (ReLu, Elu, PRelu, Selu). They fit in well within the “match pattern” paradigm of the conv nets. On the contrary, recurrent units operate the “remember/forget” approach. Two of their most commonly used implementations, GRU and LSTM, include explicit “forget” gates.</p>
<p>We want to mimic their ability to “forget” parts of the context within our dilated convolutions based model too. To do that, we’re going to use the “gated activations” approach, explained by <a href="https://arxiv.org/pdf/1712.09444.pdf">Liptchinsky et al.</a></p>
<p>The idea is very simple: we pass the input through Conv1D separately and apply tanh and sigmoid respectively. The result is the element-wise product. We’re going to go one step further in our approach, by applying tanh one more time in the end.</p>
<h4 id="others">Others</h4>
<p>The full explanation of all of the details of our neural network’s architecture is beyond the scope of an article like this. Let me point you at additional pieces along with the reading they come from:</p>
<ul>
<li><a href="https://arxiv.org/pdf/1502.03167.pdf">Batch Normalization</a></li>
<li><a href="ftp://ftp.idsia.ch/pub/juergen/icml2006.pdf">Connectionist Temporal Classification</a></li>
<li><a href="https://arxiv.org/pdf/1512.03385.pdf">Residual Learning</a></li>
</ul>
<h3 id="lets-code-it">Let’s code it</h3>
<p>The architecture of our choice in this project is going to heavily rely on the great success of residual-style networks as well as dilated convolutions. You might see similarities to the famous WaveNet, although it’s going to be a bit different.</p>
<p>Here is the bird-eye view of the SpeechNet neural network:</p>
<p><img src="/blog/2019/01/speech-recognition-with-tensorflow/speech-net.png" alt=""></p>
<p>The residual stacks, being at the heart of it, are structured the following way:</p>
<p><img src="/blog/2019/01/speech-recognition-with-tensorflow/residual-stack.png" alt=""></p>
<p>The residual blocks, doing all the heavy lifting, can be seen as shown below:</p>
<p><img src="/blog/2019/01/speech-recognition-with-tensorflow/residual-block.png" alt=""></p>
<h4 id="the-most-important-aspect-of-coding-of-the-deep-learning-models">The most important aspect of coding of the Deep Learning models</h4>
<p>Developing Deep Learning models doesn’t really differ that much from any other type of coding. It does require specific background knowledge, but the good coding practices remain the same. In fact, good coding habits are 10× more relevant here than in e.g. a web-app project.</p>
<p>Training a speech-to-text model is bound to require days if not weeks. Imagine having a small bug in your code, preventing the process from finding a good local minimum. It’s extremely frustrating to find out about it days into the training, with the model trainable parameters not being improved much.</p>
<p>Let’s start by adding some unit tests then. In this project, we’re using the Jupyter notebook as we don’t intend to package it anywhere. The code’s intent is to be for educational purposes mainly.</p>
<p>Adding unit tests within the Jupyter notebook is possible with the following “hack” (notice the value for <code>argv</code>):</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python"><span style="color:#080;font-weight:bold">import</span> <span style="color:#b06;font-weight:bold">unittest</span>
RUN_TESTS = TRUE
<span style="color:#080;font-weight:bold">class</span> <span style="color:#b06;font-weight:bold">TestNotebook</span>(unittest.TestCase):
<span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">test_it_works</span>(self):
self.assertEqual(<span style="color:#00d;font-weight:bold">2</span> + <span style="color:#00d;font-weight:bold">2</span>, <span style="color:#00d;font-weight:bold">4</span>)
<span style="color:#080;font-weight:bold">if</span> __name__ == <span style="color:#d20;background-color:#fff0f0">'__main__'</span> <span style="color:#080">and</span> RUN_TESTS:
<span style="color:#080;font-weight:bold">import</span> <span style="color:#b06;font-weight:bold">doctest</span>
doctest.testmod()
unittest.main(
argv=[<span style="color:#d20;background-color:#fff0f0">'first-arg-is-ignored'</span>],
failfast=<span style="color:#080;font-weight:bold">True</span>,
exit=<span style="color:#080;font-weight:bold">False</span>
)
</code></pre></div><p>You can notice the import of the <code>doctest</code> module which adds support for <a href="https://docs.python.org/2/library/doctest.html">doc-string level tests</a> which may come in handy as well.</p>
<p>I also hugely recommend the <a href="https://hypothesis.readthedocs.io/en/latest/">hypothesis library</a> for testing the QuickCheck way <a href="/blog/2016/03/quickcheck-property-based-testing-in/">as I blogged about it before</a>.</p>
<h5 id="data-pipeline">Data pipeline</h5>
<p>A place that’s surprisingly very bug-potent is the data pipeline. It’s easy to e.g. shuffle the labels independently of input vectors if you’re not careful. There’s also always a chance to introduce input vectors including <code>NaN</code> or <code>inf</code> values, which a few steps later produce <code>NaN</code> or <code>inf</code> loss values. Let’s add a simple test to check for the first condition:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python">
<span style="color:#888"># assuming test path will look like: 1/file.wav</span>
<span style="color:#888"># the input and output types are driven by the input_fn shown later</span>
<span style="color:#888"># here, we’re just generating values based on the “path”</span>
<span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">dummy_load_wave</span>(example):
row, params = example
path = row.filename
<span style="color:#080;font-weight:bold">return</span> np.ones((SAMPLING_RATE)) * <span style="color:#038">float</span>(path.split(<span style="color:#d20;background-color:#fff0f0">'/'</span>)[<span style="color:#00d;font-weight:bold">0</span>]), row
<span style="color:#080;font-weight:bold">class</span> <span style="color:#b06;font-weight:bold">TestNotebook</span>(unittest.TestCase):
<span style="color:#888"># (...)</span>
<span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">test_dataset_returns_data_in_order</span>(self):
params = experiment_params(
dataset_params(
batch_size=<span style="color:#00d;font-weight:bold">2</span>,
epochs=<span style="color:#00d;font-weight:bold">1</span>,
augment=<span style="color:#080;font-weight:bold">False</span>
)
)
data = pd.DataFrame(
data={
<span style="color:#d20;background-color:#fff0f0">'text'</span>: [ <span style="color:#038">str</span>(i) <span style="color:#080;font-weight:bold">for</span> i <span style="color:#080">in</span> <span style="color:#038">range</span>(<span style="color:#00d;font-weight:bold">10</span>) ],
<span style="color:#d20;background-color:#fff0f0">'filename'</span>: [ <span style="color:#d20;background-color:#fff0f0">'</span><span style="color:#33b;background-color:#fff0f0">{}</span><span style="color:#d20;background-color:#fff0f0">/wav'</span>.format(i) <span style="color:#080;font-weight:bold">for</span> i <span style="color:#080">in</span> <span style="color:#038">range</span>(<span style="color:#00d;font-weight:bold">10</span>) ]
}
)
dataset = input_fn(data, params[<span style="color:#d20;background-color:#fff0f0">'data'</span>], dummy_load_wave)()
iterator = dataset.make_one_shot_iterator()
next_element = iterator.get_next()
<span style="color:#080;font-weight:bold">with</span> tf.Session() <span style="color:#080;font-weight:bold">as</span> session:
<span style="color:#080;font-weight:bold">try</span>:
<span style="color:#080;font-weight:bold">while</span> <span style="color:#080;font-weight:bold">True</span>:
audio, label = session.run(next_element)
audio, length = audio
<span style="color:#080;font-weight:bold">for</span> _audio, _label <span style="color:#080">in</span> <span style="color:#038">zip</span>(<span style="color:#038">list</span>(audio), <span style="color:#038">list</span>(label)):
self.assertEqual(_audio[<span style="color:#00d;font-weight:bold">0</span>], <span style="color:#038">float</span>(_label))
<span style="color:#080;font-weight:bold">for</span> _length <span style="color:#080">in</span> length:
self.assertEqual(_length, SAMPLING_RATE)
<span style="color:#080;font-weight:bold">except</span> tf.errors.OutOfRangeError:
<span style="color:#080;font-weight:bold">pass</span>
</code></pre></div><p>The above code assumes having the <code>input_fn</code> function in scope. If you’re not familiar with the concept yet, please go ahead and read the introduction to the <a href="https://www.tensorflow.org/guide/estimators">TensorFlow Estimators API</a>.</p>
<p>Here’s our implementation:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python"><span style="color:#080;font-weight:bold">from</span> <span style="color:#b06;font-weight:bold">multiprocessing</span> <span style="color:#080;font-weight:bold">import</span> Pool
<span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">input_fn</span>(input_dataset, params, load_wave_fn=load_wave):
<span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">_input_fn</span>():
<span style="color:#d20;background-color:#fff0f0">"""
</span><span style="color:#d20;background-color:#fff0f0"> Returns raw audio wave along with the label
</span><span style="color:#d20;background-color:#fff0f0"> """</span>
dataset = input_dataset
<span style="color:#038">print</span>(params)
<span style="color:#080;font-weight:bold">if</span> <span style="color:#d20;background-color:#fff0f0">'max_text_length'</span> <span style="color:#080">in</span> params <span style="color:#080">and</span> params[<span style="color:#d20;background-color:#fff0f0">'max_text_length'</span>] <span style="color:#080">is</span> <span style="color:#080">not</span> <span style="color:#080;font-weight:bold">None</span>:
<span style="color:#038">print</span>(<span style="color:#d20;background-color:#fff0f0">'Constraining dataset to the max_text_length'</span>)
dataset = input_dataset[input_dataset.text.str.len() < params[<span style="color:#d20;background-color:#fff0f0">'max_text_length'</span>]]
<span style="color:#080;font-weight:bold">if</span> <span style="color:#d20;background-color:#fff0f0">'min_text_length'</span> <span style="color:#080">in</span> params <span style="color:#080">and</span> params[<span style="color:#d20;background-color:#fff0f0">'min_text_length'</span>] <span style="color:#080">is</span> <span style="color:#080">not</span> <span style="color:#080;font-weight:bold">None</span>:
<span style="color:#038">print</span>(<span style="color:#d20;background-color:#fff0f0">'Constraining dataset to the min_text_length'</span>)
dataset = input_dataset[input_dataset.text.str.len() >= params[<span style="color:#d20;background-color:#fff0f0">'min_text_length'</span>]]
<span style="color:#080;font-weight:bold">if</span> <span style="color:#d20;background-color:#fff0f0">'max_wave_length'</span> <span style="color:#080">in</span> params <span style="color:#080">and</span> params[<span style="color:#d20;background-color:#fff0f0">'max_wave_length'</span>] <span style="color:#080">is</span> <span style="color:#080">not</span> <span style="color:#080;font-weight:bold">None</span>:
<span style="color:#038">print</span>(<span style="color:#d20;background-color:#fff0f0">'Constraining dataset to the max_wave_length'</span>)
<span style="color:#038">print</span>(<span style="color:#d20;background-color:#fff0f0">'Resulting dataset length: </span><span style="color:#33b;background-color:#fff0f0">{}</span><span style="color:#d20;background-color:#fff0f0">'</span>.format(<span style="color:#038">len</span>(dataset)))
<span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">generator_fn</span>():
pool = Pool()
buffer = []
<span style="color:#080;font-weight:bold">for</span> epoch <span style="color:#080">in</span> <span style="color:#038">range</span>(params[<span style="color:#d20;background-color:#fff0f0">'epochs'</span>]):
<span style="color:#080;font-weight:bold">for</span> _, row <span style="color:#080">in</span> dataset.sample(frac=<span style="color:#00d;font-weight:bold">1</span>).iterrows():
buffer.append((row, params))
<span style="color:#080;font-weight:bold">if</span> <span style="color:#038">len</span>(buffer) >= params[<span style="color:#d20;background-color:#fff0f0">'batch_size'</span>]:
<span style="color:#080;font-weight:bold">if</span> params[<span style="color:#d20;background-color:#fff0f0">'parallelize'</span>]:
audios = pool.map(
load_wave_fn,
buffer
)
<span style="color:#080;font-weight:bold">else</span>:
audios = <span style="color:#038">map</span>(
load_wave_fn,
buffer
)
<span style="color:#080;font-weight:bold">for</span> audio, row <span style="color:#080">in</span> audios:
<span style="color:#080;font-weight:bold">if</span> audio <span style="color:#080">is</span> <span style="color:#080">not</span> <span style="color:#080;font-weight:bold">None</span>:
<span style="color:#080;font-weight:bold">if</span> np.isnan(audio).any():
<span style="color:#038">print</span>(<span style="color:#d20;background-color:#fff0f0">'SKIPPING! NaN coming from the pipeline!'</span>)
<span style="color:#080;font-weight:bold">else</span>:
<span style="color:#080;font-weight:bold">yield</span> (audio, <span style="color:#038">len</span>(audio)), row.text.encode()
buffer = []
<span style="color:#080;font-weight:bold">return</span> tf.data.Dataset.from_generator(
generator_fn,
output_types=((tf.float32, tf.int32), (tf.string)),
output_shapes=((<span style="color:#080;font-weight:bold">None</span>,()), (()))
) \
.padded_batch(
batch_size=params[<span style="color:#d20;background-color:#fff0f0">'batch_size'</span>],
padded_shapes=(
(tf.TensorShape([<span style="color:#080;font-weight:bold">None</span>]), tf.TensorShape(())),
tf.TensorShape(())
)
)
<span style="color:#080;font-weight:bold">return</span> _input_fn
</code></pre></div><p>This depends on the <code>load_wave</code> function:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python"><span style="color:#080;font-weight:bold">import</span> <span style="color:#b06;font-weight:bold">librosa</span>
<span style="color:#080;font-weight:bold">import</span> <span style="color:#b06;font-weight:bold">hickle</span> <span style="color:#080;font-weight:bold">as</span> <span style="color:#b06;font-weight:bold">hkl</span>
<span style="color:#080;font-weight:bold">import</span> <span style="color:#b06;font-weight:bold">os.path</span>
<span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">to_path</span>(filename):
<span style="color:#080;font-weight:bold">return</span> <span style="color:#d20;background-color:#fff0f0">'./data/cv_corpus_v1/'</span> + filename
<span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">load_wave</span>(example, absolute=<span style="color:#080;font-weight:bold">False</span>):
row, params = example
_path = row.filename <span style="color:#080;font-weight:bold">if</span> absolute <span style="color:#080;font-weight:bold">else</span> to_path(row.filename)
<span style="color:#080;font-weight:bold">if</span> os.path.isfile(_path + <span style="color:#d20;background-color:#fff0f0">'.wave.hkl'</span>):
wave = hkl.load(_path + <span style="color:#d20;background-color:#fff0f0">'.wave.hkl'</span>).astype(np.float32)
<span style="color:#080;font-weight:bold">else</span>:
wave, _ = librosa.load(_path, sr=SAMPLING_RATE)
hkl.dump(wave, _path + <span style="color:#d20;background-color:#fff0f0">'.wave.hkl'</span>)
<span style="color:#080;font-weight:bold">if</span> <span style="color:#038">len</span>(wave) <= params[<span style="color:#d20;background-color:#fff0f0">'max_wave_length'</span>]:
<span style="color:#080;font-weight:bold">if</span> params[<span style="color:#d20;background-color:#fff0f0">'augment'</span>]:
wave = random_noise(
random_stretch(
random_shift(
wave,
params
),
params
),
params
)
<span style="color:#080;font-weight:bold">else</span>:
wave = <span style="color:#080;font-weight:bold">None</span>
<span style="color:#080;font-weight:bold">return</span> wave, row
</code></pre></div><p>Which depends on three other functions used to augment the data on the fly to improve the model’s generalization:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python"><span style="color:#080;font-weight:bold">import</span> <span style="color:#b06;font-weight:bold">random</span>
<span style="color:#080;font-weight:bold">import</span> <span style="color:#b06;font-weight:bold">glob</span>
noise_files = glob.glob(<span style="color:#d20;background-color:#fff0f0">'./data/*.wav'</span>)
noises = {}
<span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">random_stretch</span>(audio, params):
rate = random.uniform(params[<span style="color:#d20;background-color:#fff0f0">'random_stretch_min'</span>], params[<span style="color:#d20;background-color:#fff0f0">'random_stretch_max'</span>])
<span style="color:#080;font-weight:bold">return</span> librosa.effects.time_stretch(audio, rate)
<span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">random_shift</span>(audio, params):
_shift = random.randrange(params[<span style="color:#d20;background-color:#fff0f0">'random_shift_min'</span>], params[<span style="color:#d20;background-color:#fff0f0">'random_shift_max'</span>])
<span style="color:#080;font-weight:bold">if</span> _shift < <span style="color:#00d;font-weight:bold">0</span>:
pad = (_shift * -<span style="color:#00d;font-weight:bold">1</span>, <span style="color:#00d;font-weight:bold">0</span>)
<span style="color:#080;font-weight:bold">else</span>:
pad = (<span style="color:#00d;font-weight:bold">0</span>, _shift)
<span style="color:#080;font-weight:bold">return</span> np.pad(audio, pad, mode=<span style="color:#d20;background-color:#fff0f0">'constant'</span>)
<span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">random_noise</span>(audio, params):
_factor = random.uniform(
params[<span style="color:#d20;background-color:#fff0f0">'random_noise_factor_min'</span>],
params[<span style="color:#d20;background-color:#fff0f0">'random_noise_factor_max'</span>]
)
<span style="color:#080;font-weight:bold">if</span> params[<span style="color:#d20;background-color:#fff0f0">'random_noise'</span>] > random.uniform(<span style="color:#00d;font-weight:bold">0</span>, <span style="color:#00d;font-weight:bold">1</span>):
_path = random.choice(noise_files)
<span style="color:#080;font-weight:bold">if</span> _path <span style="color:#080">in</span> noises:
wave = noises[_path]
<span style="color:#080;font-weight:bold">else</span>:
<span style="color:#080;font-weight:bold">if</span> os.path.isfile(_path + <span style="color:#d20;background-color:#fff0f0">'.wave.hkl'</span>):
wave = hkl.load(_path + <span style="color:#d20;background-color:#fff0f0">'.wave.hkl'</span>).astype(np.float32)
noises[_path] = wave
<span style="color:#080;font-weight:bold">else</span>:
wave, _ = librosa.load(_path, sr=SAMPLING_RATE)
hkl.dump(wave, _path + <span style="color:#d20;background-color:#fff0f0">'.wave.hkl'</span>)
noises[_path] = wave
noise = random_shift(
wave,
{
<span style="color:#d20;background-color:#fff0f0">'random_shift_min'</span>: -<span style="color:#00d;font-weight:bold">16000</span>,
<span style="color:#d20;background-color:#fff0f0">'random_shift_max'</span>: <span style="color:#00d;font-weight:bold">16000</span>
}
)
max_noise = np.max(noise[<span style="color:#00d;font-weight:bold">0</span>:<span style="color:#038">len</span>(audio)])
max_wave = np.max(audio)
noise = noise * (max_wave / max_noise)
<span style="color:#080;font-weight:bold">return</span> _factor * noise[<span style="color:#00d;font-weight:bold">0</span>:<span style="color:#038">len</span>(audio)] + (<span style="color:#00d;font-weight:bold">1.0</span> - _factor) * audio
<span style="color:#080;font-weight:bold">else</span>:
<span style="color:#080;font-weight:bold">return</span> audio
</code></pre></div><p>Notice that we’re making almost everything into a configurable parameter. We want the code to allow the greatest freedom of searching for just the right set of hyperparameters.</p>
<p>The data pipeline as shown above randomly shuffles the <a href="https://pandas.pydata.org">Pandas</a> data frame once for each epoch. It also creates a pool of background workers to parallelize the data loading as much as possible. We’re doing the data loading and augmentation on the CPU. It also uses the <a href="https://github.com/telegraphic/hickle">hickle</a> library for caching audio signals on the disk. Loading a wave file with a given sampling rate isn’t <strong>that</strong> fast as one might think. In my experiments, loading the resulting array of floating points via <code>hickle</code> was 10x faster. We need the best speed of feeding the data into the network or else our GPU is going to stay underutilized.</p>
<p>In my experiments also, turning data augmentation on <strong>made a real difference</strong>. I’ve run the training without it and the network overfit was disastrous: with the normalized <a href="https://en.wikipedia.org/wiki/Edit_distance">edit distance</a> for the training set revolving around 0.01 and 0.53 for the validation.</p>
<p>The <code>random_noise</code> function uses the noise sounds included in the <a href="http://download.tensorflow.org/data/speech_commands_v0.01.tar.gz">Speech Commands: A public dataset for single-word speech recognition</a> dataset. Please go ahead and download it, extracting just the noise files under the <code>./data</code> directory.</p>
<p>The last function in use we haven’t seen yet is the <code>experiment_params</code>. It’s just a helper that allows an easy params hash construction for our experiments:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python"><span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">dataset_params</span>(batch_size=<span style="color:#00d;font-weight:bold">32</span>,
epochs=<span style="color:#00d;font-weight:bold">50000</span>,
parallelize=<span style="color:#080;font-weight:bold">True</span>,
max_text_length=<span style="color:#080;font-weight:bold">None</span>,
min_text_length=<span style="color:#080;font-weight:bold">None</span>,
max_wave_length=<span style="color:#00d;font-weight:bold">80000</span>,
shuffle=<span style="color:#080;font-weight:bold">True</span>,
random_shift_min=-<span style="color:#00d;font-weight:bold">4000</span>,
random_shift_max= <span style="color:#00d;font-weight:bold">4000</span>,
random_stretch_min=<span style="color:#00d;font-weight:bold">0.7</span>,
random_stretch_max= <span style="color:#00d;font-weight:bold">1.3</span>,
random_noise=<span style="color:#00d;font-weight:bold">0.75</span>,
random_noise_factor_min=<span style="color:#00d;font-weight:bold">0.2</span>,
random_noise_factor_max=<span style="color:#00d;font-weight:bold">0.5</span>,
augment=<span style="color:#080;font-weight:bold">False</span>):
<span style="color:#080;font-weight:bold">return</span> {
<span style="color:#d20;background-color:#fff0f0">'parallelize'</span>: parallelize,
<span style="color:#d20;background-color:#fff0f0">'shuffle'</span>: shuffle,
<span style="color:#d20;background-color:#fff0f0">'max_text_length'</span>: max_text_length,
<span style="color:#d20;background-color:#fff0f0">'min_text_length'</span>: min_text_length,
<span style="color:#d20;background-color:#fff0f0">'max_wave_length'</span>: max_wave_length,
<span style="color:#d20;background-color:#fff0f0">'random_shift_min'</span>: random_shift_min,
<span style="color:#d20;background-color:#fff0f0">'random_shift_max'</span>: random_shift_max,
<span style="color:#d20;background-color:#fff0f0">'random_stretch_min'</span>: random_stretch_min,
<span style="color:#d20;background-color:#fff0f0">'random_stretch_max'</span>: random_stretch_max,
<span style="color:#d20;background-color:#fff0f0">'random_noise'</span>: random_noise,
<span style="color:#d20;background-color:#fff0f0">'random_noise_factor_min'</span>: random_noise_factor_min,
<span style="color:#d20;background-color:#fff0f0">'random_noise_factor_max'</span>: random_noise_factor_max,
<span style="color:#d20;background-color:#fff0f0">'epochs'</span>: epochs,
<span style="color:#d20;background-color:#fff0f0">'batch_size'</span>: batch_size,
<span style="color:#d20;background-color:#fff0f0">'augment'</span>: augment
}
</code></pre></div><h5 id="labels-encoder-and-decoder">Labels encoder and decoder</h5>
<p>When working with the CTC loss, we need a way to code each letter as a numerical value. Conversely, the neural network is going to give us probabilities for each letter, given by its index within the output matrix.</p>
<p>The idea behind this project’s approach is to push the encoding and decoding into the network graph itself. We want two functions: <code>encode_labels</code> and <code>decode_codes</code>. We want the first to turn a string into an array of integers. The second one should complement it, turning the array of integers into the resulting string.</p>
<p>It’s a good idea to use our <code>hypothesis</code> library for this unit test. It’s going to come up with many input examples, trying to falsify our assumptions:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python"><span style="color:#555">@given</span>(st.text(alphabet=<span style="color:#d20;background-color:#fff0f0">"abcdefghijk1234!@#$%^&*"</span>, max_size=<span style="color:#00d;font-weight:bold">10</span>))
<span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">test_encode_and_decode_work</span>(self, text):
assume(text != <span style="color:#d20;background-color:#fff0f0">''</span>)
params = { <span style="color:#d20;background-color:#fff0f0">'alphabet'</span>: <span style="color:#d20;background-color:#fff0f0">'abcdefghijk1234!@#$%^&*'</span> }
label_ph = tf.placeholder(tf.string, shape=(<span style="color:#00d;font-weight:bold">1</span>), name=<span style="color:#d20;background-color:#fff0f0">'text'</span>)
codes_op = encode_labels(label_ph, params)
decode_op = decode_codes(codes_op, params)
<span style="color:#080;font-weight:bold">with</span> tf.Session() <span style="color:#080;font-weight:bold">as</span> session:
session.run(tf.global_variables_initializer())
session.run(tf.tables_initializer(name=<span style="color:#d20;background-color:#fff0f0">'init_all_tables'</span>))
codes, decoded = session.run(
[codes_op, decode_op],
{
label_ph: np.array([text])
}
)
note(codes)
note(decoded)
self.assertEqual(text, <span style="color:#d20;background-color:#fff0f0">''</span>.join(<span style="color:#038">map</span>(<span style="color:#080;font-weight:bold">lambda</span> s: s.decode(<span style="color:#d20;background-color:#fff0f0">'UTF-8'</span>), decoded.values)))
self.assertEqual(codes.values.dtype, np.int32)
self.assertEqual(<span style="color:#038">len</span>(codes.values), <span style="color:#038">len</span>(text))
</code></pre></div><p>Here is the implementation that passes the above test:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python"><span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">encode_labels</span>(labels, params):
characters = <span style="color:#038">list</span>(params[<span style="color:#d20;background-color:#fff0f0">'alphabet'</span>])
table = tf.contrib.lookup.HashTable(
tf.contrib.lookup.KeyValueTensorInitializer(
characters,
<span style="color:#038">list</span>(<span style="color:#038">range</span>(<span style="color:#038">len</span>(characters)))
),
-<span style="color:#00d;font-weight:bold">1</span>,
name=<span style="color:#d20;background-color:#fff0f0">'char2id'</span>
)
<span style="color:#080;font-weight:bold">return</span> table.lookup(
tf.string_split(labels, delimiter=<span style="color:#d20;background-color:#fff0f0">''</span>)
)
<span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">decode_codes</span>(codes, params):
characters = <span style="color:#038">list</span>(params[<span style="color:#d20;background-color:#fff0f0">'alphabet'</span>])
table = tf.contrib.lookup.HashTable(
tf.contrib.lookup.KeyValueTensorInitializer(
<span style="color:#038">list</span>(<span style="color:#038">range</span>(<span style="color:#038">len</span>(characters))),
characters
),
<span style="color:#d20;background-color:#fff0f0">''</span>,
name=<span style="color:#d20;background-color:#fff0f0">'id2char'</span>
)
<span style="color:#080;font-weight:bold">return</span> table.lookup(codes)
</code></pre></div><h5 id="log-mel-spectrogram-layer">Log-Mel Spectrogram layer</h5>
<p>Another piece we need is a way to turn raw audio signals into the log-Mel spectrograms. The idea, again, is to push it into the network graph. This way it’s going to work way faster on GPUs and also the model’s API is going to be much simpler.</p>
<p>In the following unit test, we’re testing our custom TensorFlow layer against values coming from known-to-be-valid librosa:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python"><span style="color:#555">@given</span>(
st.sampled_from([<span style="color:#00d;font-weight:bold">22000</span>, <span style="color:#00d;font-weight:bold">16000</span>, <span style="color:#00d;font-weight:bold">8000</span>]),
st.sampled_from([<span style="color:#00d;font-weight:bold">1024</span>, <span style="color:#00d;font-weight:bold">512</span>]),
st.sampled_from([<span style="color:#00d;font-weight:bold">1024</span>, <span style="color:#00d;font-weight:bold">512</span>]),
npst.arrays(
np.float32,
(<span style="color:#00d;font-weight:bold">4</span>, <span style="color:#00d;font-weight:bold">16000</span>),
elements=st.floats(-<span style="color:#00d;font-weight:bold">1</span>, <span style="color:#00d;font-weight:bold">1</span>)
)
)
<span style="color:#555">@settings</span>(max_examples=<span style="color:#00d;font-weight:bold">10</span>)
<span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">test_log_mel_conversion_works</span>(self, sampling_rate, n_fft, frame_step, audio):
lower_edge_hertz=<span style="color:#00d;font-weight:bold">0.0</span>
upper_edge_hertz=sampling_rate / <span style="color:#00d;font-weight:bold">2.0</span>
num_mel_bins=<span style="color:#00d;font-weight:bold">64</span>
<span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">librosa_melspectrogram</span>(audio_item):
spectrogram = np.abs(
librosa.core.stft(
audio_item,
n_fft=n_fft,
hop_length=frame_step,
center=<span style="color:#080;font-weight:bold">False</span>
)
)**<span style="color:#00d;font-weight:bold">2</span>
<span style="color:#080;font-weight:bold">return</span> np.log(
librosa.feature.melspectrogram(
S=spectrogram,
sr=sampling_rate,
n_mels=num_mel_bins,
fmin=lower_edge_hertz,
fmax=upper_edge_hertz,
) + <span style="color:#00d;font-weight:bold">1e-6</span>
)
audio_ph = tf.placeholder(tf.float32, (<span style="color:#00d;font-weight:bold">4</span>, <span style="color:#00d;font-weight:bold">16000</span>))
librosa_log_mels = np.transpose(
np.stack([
librosa_melspectrogram(audio_item)
<span style="color:#080;font-weight:bold">for</span> audio_item <span style="color:#080">in</span> audio
]),
(<span style="color:#00d;font-weight:bold">0</span>, <span style="color:#00d;font-weight:bold">2</span>, <span style="color:#00d;font-weight:bold">1</span>)
)
log_mel_op = tf.check_numerics(
LogMelSpectrogram(
sampling_rate=sampling_rate,
n_fft=n_fft,
frame_step=frame_step,
lower_edge_hertz=lower_edge_hertz,
upper_edge_hertz=upper_edge_hertz,
num_mel_bins=num_mel_bins
)(audio_ph),
message=<span style="color:#d20;background-color:#fff0f0">"log mels"</span>
)
<span style="color:#080;font-weight:bold">with</span> tf.Session() <span style="color:#080;font-weight:bold">as</span> session:
session.run(tf.global_variables_initializer())
log_mels = session.run(
log_mel_op,
{
audio_ph: audio
}
)
np.testing.assert_allclose(
log_mels,
librosa_log_mels,
rtol=<span style="color:#00d;font-weight:bold">1e-1</span>,
atol=<span style="color:#00d;font-weight:bold">0</span>
)
</code></pre></div><p>The implementation of the layer, that passes the above unit test reads as follows:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python"><span style="color:#080;font-weight:bold">class</span> <span style="color:#b06;font-weight:bold">LogMelSpectrogram</span>(tf.layers.Layer):
<span style="color:#080;font-weight:bold">def</span> __init__(self,
sampling_rate,
n_fft,
frame_step,
lower_edge_hertz,
upper_edge_hertz,
num_mel_bins,
**kwargs):
<span style="color:#038">super</span>(LogMelSpectrogram, self).__init__(**kwargs)
self.sampling_rate = sampling_rate
self.n_fft = n_fft
self.frame_step = frame_step
self.lower_edge_hertz = lower_edge_hertz
self.upper_edge_hertz = upper_edge_hertz
self.num_mel_bins = num_mel_bins
<span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">call</span>(self, inputs, training=<span style="color:#080;font-weight:bold">True</span>):
stfts = tf.contrib.signal.stft(
inputs,
frame_length=self.n_fft,
frame_step=self.frame_step,
fft_length=self.n_fft,
pad_end=<span style="color:#080;font-weight:bold">False</span>
)
power_spectrograms = tf.real(stfts * tf.conj(stfts))
num_spectrogram_bins = power_spectrograms.shape[-<span style="color:#00d;font-weight:bold">1</span>].value
linear_to_mel_weight_matrix = tf.constant(
np.transpose(
librosa.filters.mel(
sr=self.sampling_rate,
n_fft=self.n_fft + <span style="color:#00d;font-weight:bold">1</span>,
n_mels=self.num_mel_bins,
fmin=self.lower_edge_hertz,
fmax=self.upper_edge_hertz
)
),
dtype=tf.float32
)
mel_spectrograms = tf.tensordot(
power_spectrograms,
linear_to_mel_weight_matrix,
<span style="color:#00d;font-weight:bold">1</span>
)
mel_spectrograms.set_shape(
power_spectrograms.shape[:-<span style="color:#00d;font-weight:bold">1</span>].concatenate(
linear_to_mel_weight_matrix.shape[-<span style="color:#00d;font-weight:bold">1</span>:]
)
)
<span style="color:#080;font-weight:bold">return</span> tf.log(mel_spectrograms + <span style="color:#00d;font-weight:bold">1e-6</span>)
</code></pre></div><h5 id="converted-data-lengths-function">Converted data lengths function</h5>
<p>In order to use the CTC loss and decoder efficiently, we need to pass it the length of the data effectively representing audio for each batch. This is because not all audio files are of the same length but we need to pad them with zeros to do mini-batch.</p>
<p>Here’s the unit test:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python"><span style="color:#555">@given</span>(
npst.arrays(
np.float32,
(st.integers(min_value=<span style="color:#00d;font-weight:bold">16000</span>, max_value=<span style="color:#00d;font-weight:bold">16000</span>*<span style="color:#00d;font-weight:bold">5</span>)),
elements=st.floats(-<span style="color:#00d;font-weight:bold">1</span>, <span style="color:#00d;font-weight:bold">1</span>)
),
st.sampled_from([<span style="color:#00d;font-weight:bold">22000</span>, <span style="color:#00d;font-weight:bold">16000</span>, <span style="color:#00d;font-weight:bold">8000</span>]),
st.sampled_from([<span style="color:#00d;font-weight:bold">1024</span>, <span style="color:#00d;font-weight:bold">512</span>, <span style="color:#00d;font-weight:bold">640</span>]),
st.sampled_from([<span style="color:#00d;font-weight:bold">1024</span>, <span style="color:#00d;font-weight:bold">512</span>, <span style="color:#00d;font-weight:bold">160</span>]),
)
<span style="color:#555">@settings</span>(max_examples=<span style="color:#00d;font-weight:bold">10</span>)
<span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">test_compute_lengths_works</span>(self,
audio_wave,
sampling_rate,
n_fft,
frame_step
):
assume(n_fft >= frame_step)
original_wave_length = audio_wave.shape[<span style="color:#00d;font-weight:bold">0</span>]
audio_waves_ph = tf.placeholder(tf.float32, (<span style="color:#080;font-weight:bold">None</span>, <span style="color:#080;font-weight:bold">None</span>), name=<span style="color:#d20;background-color:#fff0f0">"audio_waves"</span>)
original_lengths_ph = tf.placeholder(tf.int32, (<span style="color:#080;font-weight:bold">None</span>), name=<span style="color:#d20;background-color:#fff0f0">"original_lengths"</span>)
lengths_op = compute_lengths(
original_lengths_ph,
{
<span style="color:#d20;background-color:#fff0f0">'frame_step'</span>: frame_step,
<span style="color:#d20;background-color:#fff0f0">'n_fft'</span>: n_fft
}
)
self.assertEqual(lengths_op.dtype, tf.int32)
log_mel_op = LogMelSpectrogram(
sampling_rate=sampling_rate,
n_fft=n_fft,
frame_step=frame_step,
lower_edge_hertz=<span style="color:#00d;font-weight:bold">0.0</span>,
upper_edge_hertz=<span style="color:#00d;font-weight:bold">8000.0</span>,
num_mel_bins=<span style="color:#00d;font-weight:bold">13</span>
)(audio_waves_ph)
<span style="color:#080;font-weight:bold">with</span> tf.Session() <span style="color:#080;font-weight:bold">as</span> session:
session.run(tf.global_variables_initializer())
lengths, log_mels = session.run(
[lengths_op, log_mel_op],
{
audio_waves_ph: np.array([audio_wave]),
original_lengths_ph: np.array([original_wave_length])
}
)
note(original_wave_length)
note(lengths)
note(log_mels.shape)
self.assertEqual(lengths[<span style="color:#00d;font-weight:bold">0</span>], log_mels.shape[<span style="color:#00d;font-weight:bold">1</span>])
</code></pre></div><p>And here’s the implementation:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python"><span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">compute_lengths</span>(original_lengths, params):
<span style="color:#d20;background-color:#fff0f0">"""
</span><span style="color:#d20;background-color:#fff0f0"> Computes the length of data for CTC
</span><span style="color:#d20;background-color:#fff0f0"> """</span>
<span style="color:#080;font-weight:bold">return</span> tf.cast(
tf.floor(
(tf.cast(original_lengths, dtype=tf.float32) - params[<span style="color:#d20;background-color:#fff0f0">'n_fft'</span>]) /
params[<span style="color:#d20;background-color:#fff0f0">'frame_step'</span>]
) + <span style="color:#00d;font-weight:bold">1</span>,
tf.int32
)
</code></pre></div><h5 id="atrous-1d-convolutions-layer">Atrous 1D Convolutions layer</h5>
<p>It’s also a good idea to ensure that our dilated convolutions layer behaves as in theory. TensorFlow already includes an ability to specify the dilations. The end result though may differ wildly based on the choice of other parameters.</p>
<p>Let’s ensure at least that it works as intended when we choose it to work in the “causal” mode. The unit test:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python"><span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">test_causal_conv1d_works</span>(self):
conv_size2_dilation_1 = AtrousConv1D(
filters=<span style="color:#00d;font-weight:bold">1</span>,
kernel_size=<span style="color:#00d;font-weight:bold">2</span>,
dilation_rate=<span style="color:#00d;font-weight:bold">1</span>,
kernel_initializer=tf.ones_initializer(),
use_bias=<span style="color:#080;font-weight:bold">False</span>
)
conv_size3_dilation_1 = AtrousConv1D(
filters=<span style="color:#00d;font-weight:bold">1</span>,
kernel_size=<span style="color:#00d;font-weight:bold">3</span>,
dilation_rate=<span style="color:#00d;font-weight:bold">1</span>,
kernel_initializer=tf.ones_initializer(),
use_bias=<span style="color:#080;font-weight:bold">False</span>
)
conv_size2_dilation_2 = AtrousConv1D(
filters=<span style="color:#00d;font-weight:bold">1</span>,
kernel_size=<span style="color:#00d;font-weight:bold">2</span>,
dilation_rate=<span style="color:#00d;font-weight:bold">2</span>,
kernel_initializer=tf.ones_initializer(),
use_bias=<span style="color:#080;font-weight:bold">False</span>
)
conv_size2_dilation_3 = AtrousConv1D(
filters=<span style="color:#00d;font-weight:bold">1</span>,
kernel_size=<span style="color:#00d;font-weight:bold">2</span>,
dilation_rate=<span style="color:#00d;font-weight:bold">3</span>,
kernel_initializer=tf.ones_initializer(),
use_bias=<span style="color:#080;font-weight:bold">False</span>
)
data = np.array(<span style="color:#038">list</span>(<span style="color:#038">range</span>(<span style="color:#00d;font-weight:bold">1</span>, <span style="color:#00d;font-weight:bold">31</span>)))
data_ph = tf.placeholder(tf.float32, (<span style="color:#00d;font-weight:bold">1</span>, <span style="color:#00d;font-weight:bold">30</span>, <span style="color:#00d;font-weight:bold">1</span>))
size2_dilation_1_1 = conv_size2_dilation_1(data_ph)
size2_dilation_1_2 = conv_size2_dilation_1(size2_dilation_1_1)
size3_dilation_1_1 = conv_size3_dilation_1(data_ph)
size3_dilation_1_2 = conv_size3_dilation_1(size3_dilation_1_1)
size2_dilation_2_1 = conv_size2_dilation_2(data_ph)
size2_dilation_2_2 = conv_size2_dilation_2(size2_dilation_2_1)
size2_dilation_3_1 = conv_size2_dilation_3(data_ph)
size2_dilation_3_2 = conv_size2_dilation_3(size2_dilation_3_1)
<span style="color:#080;font-weight:bold">with</span> tf.Session() <span style="color:#080;font-weight:bold">as</span> session:
session.run(tf.global_variables_initializer())
outputs = session.run(
[
size2_dilation_1_1,
size2_dilation_1_2,
size3_dilation_1_1,
size3_dilation_1_2,
size2_dilation_2_1,
size2_dilation_2_2,
size2_dilation_3_1,
size2_dilation_3_2
],
{
data_ph: np.reshape(data, (<span style="color:#00d;font-weight:bold">1</span>, <span style="color:#00d;font-weight:bold">30</span>, <span style="color:#00d;font-weight:bold">1</span>))
}
)
<span style="color:#080;font-weight:bold">for</span> ix, out <span style="color:#080">in</span> <span style="color:#038">enumerate</span>(outputs):
out = np.squeeze(out)
outputs[ix] = out
self.assertEqual(out.shape[<span style="color:#00d;font-weight:bold">0</span>], <span style="color:#038">len</span>(data))
np.testing.assert_equal(
outputs[<span style="color:#00d;font-weight:bold">0</span>],
np.array([<span style="color:#00d;font-weight:bold">1</span>, <span style="color:#00d;font-weight:bold">3</span>, <span style="color:#00d;font-weight:bold">5</span>, <span style="color:#00d;font-weight:bold">7</span>, <span style="color:#00d;font-weight:bold">9</span>, <span style="color:#00d;font-weight:bold">11</span>, <span style="color:#00d;font-weight:bold">13</span>, <span style="color:#00d;font-weight:bold">15</span>, <span style="color:#00d;font-weight:bold">17</span>, <span style="color:#00d;font-weight:bold">19</span>, <span style="color:#00d;font-weight:bold">21</span>, <span style="color:#00d;font-weight:bold">23</span>, <span style="color:#00d;font-weight:bold">25</span>, <span style="color:#00d;font-weight:bold">27</span>, <span style="color:#00d;font-weight:bold">29</span>, <span style="color:#00d;font-weight:bold">31</span>, <span style="color:#00d;font-weight:bold">33</span>, <span style="color:#00d;font-weight:bold">35</span>, <span style="color:#00d;font-weight:bold">37</span>, <span style="color:#00d;font-weight:bold">39</span>, <span style="color:#00d;font-weight:bold">41</span>, <span style="color:#00d;font-weight:bold">43</span>, <span style="color:#00d;font-weight:bold">45</span>, <span style="color:#00d;font-weight:bold">47</span>, <span style="color:#00d;font-weight:bold">49</span>, <span style="color:#00d;font-weight:bold">51</span>, <span style="color:#00d;font-weight:bold">53</span>, <span style="color:#00d;font-weight:bold">55</span>, <span style="color:#00d;font-weight:bold">57</span>, <span style="color:#00d;font-weight:bold">59</span>], dtype=np.float32)
)
np.testing.assert_equal(
outputs[<span style="color:#00d;font-weight:bold">1</span>],
np.array([<span style="color:#00d;font-weight:bold">1</span>, <span style="color:#00d;font-weight:bold">4</span>, <span style="color:#00d;font-weight:bold">8</span>, <span style="color:#00d;font-weight:bold">12</span>, <span style="color:#00d;font-weight:bold">16</span>, <span style="color:#00d;font-weight:bold">20</span>, <span style="color:#00d;font-weight:bold">24</span>, <span style="color:#00d;font-weight:bold">28</span>, <span style="color:#00d;font-weight:bold">32</span>, <span style="color:#00d;font-weight:bold">36</span>, <span style="color:#00d;font-weight:bold">40</span>, <span style="color:#00d;font-weight:bold">44</span>, <span style="color:#00d;font-weight:bold">48</span>, <span style="color:#00d;font-weight:bold">52</span>, <span style="color:#00d;font-weight:bold">56</span>, <span style="color:#00d;font-weight:bold">60</span>, <span style="color:#00d;font-weight:bold">64</span>, <span style="color:#00d;font-weight:bold">68</span>, <span style="color:#00d;font-weight:bold">72</span>, <span style="color:#00d;font-weight:bold">76</span>, <span style="color:#00d;font-weight:bold">80</span>, <span style="color:#00d;font-weight:bold">84</span>, <span style="color:#00d;font-weight:bold">88</span>, <span style="color:#00d;font-weight:bold">92</span>, <span style="color:#00d;font-weight:bold">96</span>, <span style="color:#00d;font-weight:bold">100</span>, <span style="color:#00d;font-weight:bold">104</span>, <span style="color:#00d;font-weight:bold">108</span>, <span style="color:#00d;font-weight:bold">112</span>, <span style="color:#00d;font-weight:bold">116</span>], dtype=np.float32)
)
np.testing.assert_equal(
outputs[<span style="color:#00d;font-weight:bold">2</span>],
np.array([<span style="color:#00d;font-weight:bold">1</span>, <span style="color:#00d;font-weight:bold">3</span>, <span style="color:#00d;font-weight:bold">6</span>, <span style="color:#00d;font-weight:bold">9</span>, <span style="color:#00d;font-weight:bold">12</span>, <span style="color:#00d;font-weight:bold">15</span>, <span style="color:#00d;font-weight:bold">18</span>, <span style="color:#00d;font-weight:bold">21</span>, <span style="color:#00d;font-weight:bold">24</span>, <span style="color:#00d;font-weight:bold">27</span>, <span style="color:#00d;font-weight:bold">30</span>, <span style="color:#00d;font-weight:bold">33</span>, <span style="color:#00d;font-weight:bold">36</span>, <span style="color:#00d;font-weight:bold">39</span>, <span style="color:#00d;font-weight:bold">42</span>, <span style="color:#00d;font-weight:bold">45</span>, <span style="color:#00d;font-weight:bold">48</span>, <span style="color:#00d;font-weight:bold">51</span>, <span style="color:#00d;font-weight:bold">54</span>, <span style="color:#00d;font-weight:bold">57</span>, <span style="color:#00d;font-weight:bold">60</span>, <span style="color:#00d;font-weight:bold">63</span>, <span style="color:#00d;font-weight:bold">66</span>, <span style="color:#00d;font-weight:bold">69</span>, <span style="color:#00d;font-weight:bold">72</span>, <span style="color:#00d;font-weight:bold">75</span>, <span style="color:#00d;font-weight:bold">78</span>, <span style="color:#00d;font-weight:bold">81</span>, <span style="color:#00d;font-weight:bold">84</span>, <span style="color:#00d;font-weight:bold">87</span>], dtype=np.float32)
)
np.testing.assert_equal(
outputs[<span style="color:#00d;font-weight:bold">3</span>],
np.array([<span style="color:#00d;font-weight:bold">1</span>, <span style="color:#00d;font-weight:bold">4</span>, <span style="color:#00d;font-weight:bold">10</span>, <span style="color:#00d;font-weight:bold">18</span>, <span style="color:#00d;font-weight:bold">27</span>, <span style="color:#00d;font-weight:bold">36</span>, <span style="color:#00d;font-weight:bold">45</span>, <span style="color:#00d;font-weight:bold">54</span>, <span style="color:#00d;font-weight:bold">63</span>, <span style="color:#00d;font-weight:bold">72</span>, <span style="color:#00d;font-weight:bold">81</span>, <span style="color:#00d;font-weight:bold">90</span>, <span style="color:#00d;font-weight:bold">99</span>, <span style="color:#00d;font-weight:bold">108</span>, <span style="color:#00d;font-weight:bold">117</span>, <span style="color:#00d;font-weight:bold">126</span>, <span style="color:#00d;font-weight:bold">135</span>, <span style="color:#00d;font-weight:bold">144</span>, <span style="color:#00d;font-weight:bold">153</span>, <span style="color:#00d;font-weight:bold">162</span>, <span style="color:#00d;font-weight:bold">171</span>, <span style="color:#00d;font-weight:bold">180</span>, <span style="color:#00d;font-weight:bold">189</span>, <span style="color:#00d;font-weight:bold">198</span>, <span style="color:#00d;font-weight:bold">207</span>, <span style="color:#00d;font-weight:bold">216</span>, <span style="color:#00d;font-weight:bold">225</span>, <span style="color:#00d;font-weight:bold">234</span>, <span style="color:#00d;font-weight:bold">243</span>, <span style="color:#00d;font-weight:bold">252</span>], dtype=np.float32)
)
np.testing.assert_equal(
outputs[<span style="color:#00d;font-weight:bold">4</span>],
np.array([<span style="color:#00d;font-weight:bold">1</span>, <span style="color:#00d;font-weight:bold">2</span>, <span style="color:#00d;font-weight:bold">4</span>, <span style="color:#00d;font-weight:bold">6</span>, <span style="color:#00d;font-weight:bold">8</span>, <span style="color:#00d;font-weight:bold">10</span>, <span style="color:#00d;font-weight:bold">12</span>, <span style="color:#00d;font-weight:bold">14</span>, <span style="color:#00d;font-weight:bold">16</span>, <span style="color:#00d;font-weight:bold">18</span>, <span style="color:#00d;font-weight:bold">20</span>, <span style="color:#00d;font-weight:bold">22</span>, <span style="color:#00d;font-weight:bold">24</span>, <span style="color:#00d;font-weight:bold">26</span>, <span style="color:#00d;font-weight:bold">28</span>, <span style="color:#00d;font-weight:bold">30</span>, <span style="color:#00d;font-weight:bold">32</span>, <span style="color:#00d;font-weight:bold">34</span>, <span style="color:#00d;font-weight:bold">36</span>, <span style="color:#00d;font-weight:bold">38</span>, <span style="color:#00d;font-weight:bold">40</span>, <span style="color:#00d;font-weight:bold">42</span>, <span style="color:#00d;font-weight:bold">44</span>, <span style="color:#00d;font-weight:bold">46</span>, <span style="color:#00d;font-weight:bold">48</span>, <span style="color:#00d;font-weight:bold">50</span>, <span style="color:#00d;font-weight:bold">52</span>, <span style="color:#00d;font-weight:bold">54</span>, <span style="color:#00d;font-weight:bold">56</span>, <span style="color:#00d;font-weight:bold">58</span>], dtype=np.float32)
)
np.testing.assert_equal(
outputs[<span style="color:#00d;font-weight:bold">5</span>],
np.array([<span style="color:#00d;font-weight:bold">1</span>, <span style="color:#00d;font-weight:bold">2</span>, <span style="color:#00d;font-weight:bold">5</span>, <span style="color:#00d;font-weight:bold">8</span>, <span style="color:#00d;font-weight:bold">12</span>, <span style="color:#00d;font-weight:bold">16</span>, <span style="color:#00d;font-weight:bold">20</span>, <span style="color:#00d;font-weight:bold">24</span>, <span style="color:#00d;font-weight:bold">28</span>, <span style="color:#00d;font-weight:bold">32</span>, <span style="color:#00d;font-weight:bold">36</span>, <span style="color:#00d;font-weight:bold">40</span>, <span style="color:#00d;font-weight:bold">44</span>, <span style="color:#00d;font-weight:bold">48</span>, <span style="color:#00d;font-weight:bold">52</span>, <span style="color:#00d;font-weight:bold">56</span>, <span style="color:#00d;font-weight:bold">60</span>, <span style="color:#00d;font-weight:bold">64</span>, <span style="color:#00d;font-weight:bold">68</span>, <span style="color:#00d;font-weight:bold">72</span>, <span style="color:#00d;font-weight:bold">76</span>, <span style="color:#00d;font-weight:bold">80</span>, <span style="color:#00d;font-weight:bold">84</span>, <span style="color:#00d;font-weight:bold">88</span>, <span style="color:#00d;font-weight:bold">92</span>, <span style="color:#00d;font-weight:bold">96</span>, <span style="color:#00d;font-weight:bold">100</span>, <span style="color:#00d;font-weight:bold">104</span>, <span style="color:#00d;font-weight:bold">108</span>, <span style="color:#00d;font-weight:bold">112</span>], dtype=np.float32)
)
np.testing.assert_equal(
outputs[<span style="color:#00d;font-weight:bold">6</span>],
np.array([<span style="color:#00d;font-weight:bold">1</span>, <span style="color:#00d;font-weight:bold">2</span>, <span style="color:#00d;font-weight:bold">3</span>, <span style="color:#00d;font-weight:bold">5</span>, <span style="color:#00d;font-weight:bold">7</span>, <span style="color:#00d;font-weight:bold">9</span>, <span style="color:#00d;font-weight:bold">11</span>, <span style="color:#00d;font-weight:bold">13</span>, <span style="color:#00d;font-weight:bold">15</span>, <span style="color:#00d;font-weight:bold">17</span>, <span style="color:#00d;font-weight:bold">19</span>, <span style="color:#00d;font-weight:bold">21</span>, <span style="color:#00d;font-weight:bold">23</span>, <span style="color:#00d;font-weight:bold">25</span>, <span style="color:#00d;font-weight:bold">27</span>, <span style="color:#00d;font-weight:bold">29</span>, <span style="color:#00d;font-weight:bold">31</span>, <span style="color:#00d;font-weight:bold">33</span>, <span style="color:#00d;font-weight:bold">35</span>, <span style="color:#00d;font-weight:bold">37</span>, <span style="color:#00d;font-weight:bold">39</span>, <span style="color:#00d;font-weight:bold">41</span>, <span style="color:#00d;font-weight:bold">43</span>, <span style="color:#00d;font-weight:bold">45</span>, <span style="color:#00d;font-weight:bold">47</span>, <span style="color:#00d;font-weight:bold">49</span>, <span style="color:#00d;font-weight:bold">51</span>, <span style="color:#00d;font-weight:bold">53</span>, <span style="color:#00d;font-weight:bold">55</span>, <span style="color:#00d;font-weight:bold">57</span>], dtype=np.float32)
)
np.testing.assert_equal(
outputs[<span style="color:#00d;font-weight:bold">7</span>],
np.array([<span style="color:#00d;font-weight:bold">1</span>, <span style="color:#00d;font-weight:bold">2</span>, <span style="color:#00d;font-weight:bold">3</span>, <span style="color:#00d;font-weight:bold">6</span>, <span style="color:#00d;font-weight:bold">9</span>, <span style="color:#00d;font-weight:bold">12</span>, <span style="color:#00d;font-weight:bold">16</span>, <span style="color:#00d;font-weight:bold">20</span>, <span style="color:#00d;font-weight:bold">24</span>, <span style="color:#00d;font-weight:bold">28</span>, <span style="color:#00d;font-weight:bold">32</span>, <span style="color:#00d;font-weight:bold">36</span>, <span style="color:#00d;font-weight:bold">40</span>, <span style="color:#00d;font-weight:bold">44</span>, <span style="color:#00d;font-weight:bold">48</span>, <span style="color:#00d;font-weight:bold">52</span>, <span style="color:#00d;font-weight:bold">56</span>, <span style="color:#00d;font-weight:bold">60</span>, <span style="color:#00d;font-weight:bold">64</span>, <span style="color:#00d;font-weight:bold">68</span>, <span style="color:#00d;font-weight:bold">72</span>, <span style="color:#00d;font-weight:bold">76</span>, <span style="color:#00d;font-weight:bold">80</span>, <span style="color:#00d;font-weight:bold">84</span>, <span style="color:#00d;font-weight:bold">88</span>, <span style="color:#00d;font-weight:bold">92</span>, <span style="color:#00d;font-weight:bold">96</span>, <span style="color:#00d;font-weight:bold">100</span>, <span style="color:#00d;font-weight:bold">104</span>, <span style="color:#00d;font-weight:bold">108</span>], dtype=np.float32)
)
</code></pre></div><p>And the layer’s code:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python"><span style="color:#080;font-weight:bold">class</span> <span style="color:#b06;font-weight:bold">AtrousConv1D</span>(tf.layers.Layer):
<span style="color:#080;font-weight:bold">def</span> __init__(self,
filters,
kernel_size,
dilation_rate,
use_bias=<span style="color:#080;font-weight:bold">True</span>,
kernel_initializer=tf.glorot_normal_initializer(),
causal=<span style="color:#080;font-weight:bold">True</span>
):
<span style="color:#038">super</span>(AtrousConv1D, self).__init__()
self.filters = filters
self.kernel_size = kernel_size
self.dilation_rate = dilation_rate
self.causal = causal
self.conv1d = tf.layers.Conv1D(
filters=filters,
kernel_size=kernel_size,
dilation_rate=dilation_rate,
padding=<span style="color:#d20;background-color:#fff0f0">'valid'</span> <span style="color:#080;font-weight:bold">if</span> causal <span style="color:#080;font-weight:bold">else</span> <span style="color:#d20;background-color:#fff0f0">'same'</span>,
use_bias=use_bias,
kernel_initializer=kernel_initializer
)
<span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">call</span>(self, inputs):
<span style="color:#080;font-weight:bold">if</span> self.causal:
padding = (self.kernel_size - <span style="color:#00d;font-weight:bold">1</span>) * self.dilation_rate
inputs = tf.pad(inputs, tf.constant([(<span style="color:#00d;font-weight:bold">0</span>, <span style="color:#00d;font-weight:bold">0</span>,), (<span style="color:#00d;font-weight:bold">1</span>, <span style="color:#00d;font-weight:bold">0</span>), (<span style="color:#00d;font-weight:bold">0</span>, <span style="color:#00d;font-weight:bold">0</span>)]) * padding)
<span style="color:#080;font-weight:bold">return</span> self.conv1d(inputs)
</code></pre></div><h5 id="residual-block-layer">Residual Block layer</h5>
<p>One aspect that wasn’t covered yet is the heavy usage of batch normalization. When coding the residual block layer, ensuring that batch normalization is properly applied when training and when inferring is one of the most important tasks.</p>
<p>Here’s the unit test:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python"><span style="color:#555">@given</span>(
npst.arrays(
np.float32,
(<span style="color:#00d;font-weight:bold">4</span>, <span style="color:#00d;font-weight:bold">16000</span>),
elements=st.floats(-<span style="color:#00d;font-weight:bold">1</span>, <span style="color:#00d;font-weight:bold">1</span>)
),
st.sampled_from([<span style="color:#00d;font-weight:bold">64</span>, <span style="color:#00d;font-weight:bold">32</span>]),
st.sampled_from([<span style="color:#00d;font-weight:bold">7</span>, <span style="color:#00d;font-weight:bold">3</span>]),
st.sampled_from([<span style="color:#00d;font-weight:bold">1</span>, <span style="color:#00d;font-weight:bold">4</span>]),
)
<span style="color:#555">@settings</span>(max_examples=<span style="color:#00d;font-weight:bold">10</span>)
<span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">test_residual_block_works</span>(self, audio_waves, filters, size, dilation_rate):
<span style="color:#080;font-weight:bold">with</span> tf.Graph().as_default() <span style="color:#080;font-weight:bold">as</span> g:
audio_ph = tf.placeholder(tf.float32, (<span style="color:#00d;font-weight:bold">4</span>, <span style="color:#080;font-weight:bold">None</span>))
log_mel_op = LogMelSpectrogram(
sampling_rate=<span style="color:#00d;font-weight:bold">16000</span>,
n_fft=<span style="color:#00d;font-weight:bold">512</span>,
frame_step=<span style="color:#00d;font-weight:bold">256</span>,
lower_edge_hertz=<span style="color:#00d;font-weight:bold">0</span>,
upper_edge_hertz=<span style="color:#00d;font-weight:bold">8000</span>,
num_mel_bins=<span style="color:#00d;font-weight:bold">10</span>
)(audio_ph)
expanded_op = tf.layers.Dense(filters)(log_mel_op)
_, block_op = ResidualBlock(
filters=filters,
kernel_size=size,
causal=<span style="color:#080;font-weight:bold">True</span>,
dilation_rate=dilation_rate
)(expanded_op, training=<span style="color:#080;font-weight:bold">True</span>)
<span style="color:#888"># really dumb loss function just for the sake</span>
<span style="color:#888"># of testing:</span>
loss_op = tf.reduce_sum(block_op)
variables = tf.trainable_variables()
self.assertTrue(<span style="color:#038">any</span>([<span style="color:#d20;background-color:#fff0f0">"batch_normalization"</span> <span style="color:#080">in</span> var.name <span style="color:#080;font-weight:bold">for</span> var <span style="color:#080">in</span> variables]))
grads_op = tf.gradients(
loss_op,
variables
)
<span style="color:#080;font-weight:bold">for</span> grad, var <span style="color:#080">in</span> <span style="color:#038">zip</span>(grads_op, variables):
<span style="color:#080;font-weight:bold">if</span> grad <span style="color:#080">is</span> <span style="color:#080;font-weight:bold">None</span>:
note(var)
self.assertTrue(grad <span style="color:#080">is</span> <span style="color:#080">not</span> <span style="color:#080;font-weight:bold">None</span>)
<span style="color:#080;font-weight:bold">with</span> tf.Session(graph=g) <span style="color:#080;font-weight:bold">as</span> session:
session.run(tf.global_variables_initializer())
result, expanded, grads, _ = session.run(
[block_op, expanded_op, grads_op, loss_op],
{
audio_ph: audio_waves
}
)
self.assertFalse(np.array_equal(result, expanded))
self.assertEqual(result.shape, expanded.shape)
self.assertEqual(<span style="color:#038">len</span>(grads), <span style="color:#038">len</span>(variables))
self.assertFalse(<span style="color:#038">any</span>([np.isnan(grad).any() <span style="color:#080;font-weight:bold">for</span> grad <span style="color:#080">in</span> grads]))
</code></pre></div><p>And here’s the implementation:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python"><span style="color:#080;font-weight:bold">class</span> <span style="color:#b06;font-weight:bold">ResidualBlock</span>(tf.layers.Layer):
<span style="color:#080;font-weight:bold">def</span> __init__(self, filters, kernel_size, dilation_rate, causal, **kwargs):
<span style="color:#038">super</span>(ResidualBlock, self).__init__(**kwargs)
self.dilated_conv1 = AtrousConv1D(
filters=filters,
kernel_size=kernel_size,
dilation_rate=dilation_rate,
causal=causal
)
self.dilated_conv2 = AtrousConv1D(
filters=filters,
kernel_size=kernel_size,
dilation_rate=dilation_rate,
causal=causal
)
self.out = tf.layers.Conv1D(
filters=filters,
kernel_size=<span style="color:#00d;font-weight:bold">1</span>
)
<span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">call</span>(self, inputs, training=<span style="color:#080;font-weight:bold">True</span>):
data = tf.layers.batch_normalization(
inputs,
training=training
)
filters = self.dilated_conv1(data)
gates = self.dilated_conv2(data)
filters = tf.nn.tanh(filters)
gates = tf.nn.sigmoid(gates)
out = tf.nn.tanh(
self.out(
filters * gates
)
)
<span style="color:#080;font-weight:bold">return</span> out + inputs, out
</code></pre></div><h5 id="residual-stack-layer">Residual Stack layer</h5>
<p>Testing the residual stack follows the same kind of logic:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python"><span style="color:#555">@given</span>(
npst.arrays(
np.float32,
(<span style="color:#00d;font-weight:bold">4</span>, <span style="color:#00d;font-weight:bold">16000</span>),
elements=st.floats(-<span style="color:#00d;font-weight:bold">1</span>, <span style="color:#00d;font-weight:bold">1</span>)
),
st.sampled_from([<span style="color:#00d;font-weight:bold">64</span>, <span style="color:#00d;font-weight:bold">32</span>]),
st.sampled_from([<span style="color:#00d;font-weight:bold">7</span>, <span style="color:#00d;font-weight:bold">3</span>])
)
<span style="color:#555">@settings</span>(max_examples=<span style="color:#00d;font-weight:bold">10</span>)
<span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">test_residual_stack_works</span>(self, audio_waves, filters, size):
dilation_rates = [<span style="color:#00d;font-weight:bold">1</span>,<span style="color:#00d;font-weight:bold">2</span>,<span style="color:#00d;font-weight:bold">4</span>]
<span style="color:#080;font-weight:bold">with</span> tf.Graph().as_default() <span style="color:#080;font-weight:bold">as</span> g:
audio_ph = tf.placeholder(tf.float32, (<span style="color:#00d;font-weight:bold">4</span>, <span style="color:#080;font-weight:bold">None</span>))
log_mel_op = LogMelSpectrogram(
sampling_rate=<span style="color:#00d;font-weight:bold">16000</span>,
n_fft=<span style="color:#00d;font-weight:bold">512</span>,
frame_step=<span style="color:#00d;font-weight:bold">256</span>,
lower_edge_hertz=<span style="color:#00d;font-weight:bold">0</span>,
upper_edge_hertz=<span style="color:#00d;font-weight:bold">8000</span>,
num_mel_bins=<span style="color:#00d;font-weight:bold">10</span>
)(audio_ph)
expanded_op = tf.layers.Dense(filters)(log_mel_op)
stack_op = ResidualStack(
filters=filters,
kernel_size=size,
causal=<span style="color:#080;font-weight:bold">True</span>,
dilation_rates=dilation_rates
)(expanded_op, training=<span style="color:#080;font-weight:bold">True</span>)
<span style="color:#888"># really dumb loss function just for the sake</span>
<span style="color:#888"># of testing:</span>
loss_op = tf.reduce_sum(stack_op)
variables = tf.trainable_variables()
self.assertTrue(<span style="color:#038">any</span>([<span style="color:#d20;background-color:#fff0f0">"batch_normalization"</span> <span style="color:#080">in</span> var.name <span style="color:#080;font-weight:bold">for</span> var <span style="color:#080">in</span> variables]))
grads_op = tf.gradients(
loss_op,
variables
)
<span style="color:#080;font-weight:bold">for</span> grad, var <span style="color:#080">in</span> <span style="color:#038">zip</span>(grads_op, variables):
<span style="color:#080;font-weight:bold">if</span> grad <span style="color:#080">is</span> <span style="color:#080;font-weight:bold">None</span>:
note(var)
self.assertTrue(grad <span style="color:#080">is</span> <span style="color:#080">not</span> <span style="color:#080;font-weight:bold">None</span>)
<span style="color:#080;font-weight:bold">with</span> tf.Session(graph=g) <span style="color:#080;font-weight:bold">as</span> session:
session.run(tf.global_variables_initializer())
result, expanded, grads, _ = session.run(
[stack_op, expanded_op, grads_op, loss_op],
{
audio_ph: audio_waves
}
)
self.assertFalse(np.array_equal(result, expanded))
self.assertEqual(result.shape, expanded.shape)
self.assertEqual(<span style="color:#038">len</span>(grads), <span style="color:#038">len</span>(variables))
self.assertFalse(<span style="color:#038">any</span>([np.isnan(grad).any() <span style="color:#080;font-weight:bold">for</span> grad <span style="color:#080">in</span> grads]))
</code></pre></div><p>With the layer’s code looking as follows:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python"><span style="color:#080;font-weight:bold">class</span> <span style="color:#b06;font-weight:bold">ResidualStack</span>(tf.layers.Layer):
<span style="color:#080;font-weight:bold">def</span> __init__(self, filters, kernel_size, dilation_rates, causal, **kwargs):
<span style="color:#038">super</span>(ResidualStack, self).__init__(**kwargs)
self.blocks = [
ResidualBlock(
filters=filters,
kernel_size=kernel_size,
dilation_rate=dilation_rate,
causal=causal
)
<span style="color:#080;font-weight:bold">for</span> dilation_rate <span style="color:#080">in</span> dilation_rates
]
<span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">call</span>(self, inputs, training=<span style="color:#080;font-weight:bold">True</span>):
data = inputs
skip = <span style="color:#00d;font-weight:bold">0</span>
<span style="color:#080;font-weight:bold">for</span> block <span style="color:#080">in</span> self.blocks:
data, current_skip = block(data, training=training)
skip += current_skip
<span style="color:#080;font-weight:bold">return</span> skip
</code></pre></div><h5 id="the-speechnet">The SpeechNet</h5>
<p>Finally, let’s add a very similar test for the SpeechNet itself:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python"><span style="color:#555">@given</span>(
npst.arrays(
np.float32,
(<span style="color:#00d;font-weight:bold">4</span>, <span style="color:#00d;font-weight:bold">16000</span>),
elements=st.floats(-<span style="color:#00d;font-weight:bold">1</span>, <span style="color:#00d;font-weight:bold">1</span>)
)
)
<span style="color:#555">@settings</span>(max_examples=<span style="color:#00d;font-weight:bold">10</span>)
<span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">test_speech_net_works</span>(self, audio_waves):
<span style="color:#080;font-weight:bold">with</span> tf.Graph().as_default() <span style="color:#080;font-weight:bold">as</span> g:
audio_ph = tf.placeholder(tf.float32, (<span style="color:#00d;font-weight:bold">4</span>, <span style="color:#080;font-weight:bold">None</span>))
logits_op = SpeechNet(
experiment_params(
{},
stack_dilation_rates= [<span style="color:#00d;font-weight:bold">1</span>, <span style="color:#00d;font-weight:bold">2</span>, <span style="color:#00d;font-weight:bold">4</span>],
stack_kernel_size= <span style="color:#00d;font-weight:bold">3</span>,
stack_filters= <span style="color:#00d;font-weight:bold">32</span>,
alphabet= <span style="color:#d20;background-color:#fff0f0">'abcd'</span>
)
)(audio_ph)
<span style="color:#888"># really dumb loss function just for the sake</span>
<span style="color:#888"># of testing:</span>
loss_op = tf.reduce_sum(logits_op)
variables = tf.trainable_variables()
self.assertTrue(<span style="color:#038">any</span>([<span style="color:#d20;background-color:#fff0f0">"batch_normalization"</span> <span style="color:#080">in</span> var.name <span style="color:#080;font-weight:bold">for</span> var <span style="color:#080">in</span> variables]))
grads_op = tf.gradients(
loss_op,
variables
)
<span style="color:#080;font-weight:bold">for</span> grad, var <span style="color:#080">in</span> <span style="color:#038">zip</span>(grads_op, variables):
<span style="color:#080;font-weight:bold">if</span> grad <span style="color:#080">is</span> <span style="color:#080;font-weight:bold">None</span>:
note(var)
self.assertTrue(grad <span style="color:#080">is</span> <span style="color:#080">not</span> <span style="color:#080;font-weight:bold">None</span>)
<span style="color:#080;font-weight:bold">with</span> tf.Session(graph=g) <span style="color:#080;font-weight:bold">as</span> session:
session.run(tf.global_variables_initializer())
result, grads, _ = session.run(
[logits_op, grads_op, loss_op],
{
audio_ph: audio_waves
}
)
self.assertEqual(result.shape[<span style="color:#00d;font-weight:bold">2</span>], <span style="color:#00d;font-weight:bold">5</span>)
self.assertEqual(<span style="color:#038">len</span>(grads), <span style="color:#038">len</span>(variables))
self.assertFalse(<span style="color:#038">any</span>([np.isnan(grad).any() <span style="color:#080;font-weight:bold">for</span> grad <span style="color:#080">in</span> grads]))
</code></pre></div><p>And let’s provide the code that passes it:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python"><span style="color:#080;font-weight:bold">class</span> <span style="color:#b06;font-weight:bold">SpeechNet</span>(tf.layers.Layer):
<span style="color:#080;font-weight:bold">def</span> __init__(self, params, **kwargs):
<span style="color:#038">super</span>(SpeechNet, self).__init__(**kwargs)
self.to_log_mel = LogMelSpectrogram(
sampling_rate=params[<span style="color:#d20;background-color:#fff0f0">'sampling_rate'</span>],
n_fft=params[<span style="color:#d20;background-color:#fff0f0">'n_fft'</span>],
frame_step=params[<span style="color:#d20;background-color:#fff0f0">'frame_step'</span>],
lower_edge_hertz=params[<span style="color:#d20;background-color:#fff0f0">'lower_edge_hertz'</span>],
upper_edge_hertz=params[<span style="color:#d20;background-color:#fff0f0">'upper_edge_hertz'</span>],
num_mel_bins=params[<span style="color:#d20;background-color:#fff0f0">'num_mel_bins'</span>]
)
self.expand = tf.layers.Conv1D(
filters=params[<span style="color:#d20;background-color:#fff0f0">'stack_filters'</span>],
kernel_size=<span style="color:#00d;font-weight:bold">1</span>,
padding=<span style="color:#d20;background-color:#fff0f0">'same'</span>
)
self.stacks = [
ResidualStack(
filters=params[<span style="color:#d20;background-color:#fff0f0">'stack_filters'</span>],
kernel_size=params[<span style="color:#d20;background-color:#fff0f0">'stack_kernel_size'</span>],
dilation_rates=params[<span style="color:#d20;background-color:#fff0f0">'stack_dilation_rates'</span>],
causal=params[<span style="color:#d20;background-color:#fff0f0">'causal_convolutions'</span>]
)
<span style="color:#080;font-weight:bold">for</span> _ <span style="color:#080">in</span> <span style="color:#038">range</span>(params[<span style="color:#d20;background-color:#fff0f0">'stacks'</span>])
]
self.out = tf.layers.Conv1D(
filters=<span style="color:#038">len</span>(params[<span style="color:#d20;background-color:#fff0f0">'alphabet'</span>]) + <span style="color:#00d;font-weight:bold">1</span>,
kernel_size=<span style="color:#00d;font-weight:bold">1</span>,
padding=<span style="color:#d20;background-color:#fff0f0">'same'</span>
)
<span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">call</span>(self, inputs, training=<span style="color:#080;font-weight:bold">True</span>):
data = self.to_log_mel(inputs)
data = tf.layers.batch_normalization(
data,
training=training
)
<span style="color:#080;font-weight:bold">if</span> <span style="color:#038">len</span>(data.shape) == <span style="color:#00d;font-weight:bold">2</span>:
data = tf.expand_dims(data, <span style="color:#00d;font-weight:bold">0</span>)
data = self.expand(data)
<span style="color:#080;font-weight:bold">for</span> stack <span style="color:#080">in</span> self.stacks:
data = stack(data, training=training)
data = tf.layers.batch_normalization(
data,
training=training
)
<span style="color:#080;font-weight:bold">return</span> self.out(data) + <span style="color:#00d;font-weight:bold">1e-8</span>
</code></pre></div><h5 id="the-model-function">The model function</h5>
<p>We have only one last piece of code to cover before we’ll be able to start the training. It’s the <code>model_fn</code> that adheres to the TensorFlow Estimators API:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python"><span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">model_fn</span>(features, labels, mode, params):
<span style="color:#080;font-weight:bold">if</span> <span style="color:#038">isinstance</span>(features, <span style="color:#038">dict</span>):
audio = features[<span style="color:#d20;background-color:#fff0f0">'audio'</span>]
original_lengths = features[<span style="color:#d20;background-color:#fff0f0">'length'</span>]
<span style="color:#080;font-weight:bold">else</span>:
audio, original_lengths = features
lengths = compute_lengths(original_lengths, params)
<span style="color:#080;font-weight:bold">if</span> labels <span style="color:#080">is</span> <span style="color:#080">not</span> <span style="color:#080;font-weight:bold">None</span>:
codes = encode_labels(labels, params)
network = SpeechNet(params)
is_training = mode==tf.estimator.ModeKeys.TRAIN
logits = network(audio, training=is_training)
text, predicted_codes = decode_logits(logits, lengths, params)
<span style="color:#080;font-weight:bold">if</span> mode == tf.estimator.ModeKeys.PREDICT:
predictions = {
<span style="color:#d20;background-color:#fff0f0">'logits'</span>: logits,
<span style="color:#d20;background-color:#fff0f0">'text'</span>: tf.sparse_tensor_to_dense(
text,
<span style="color:#d20;background-color:#fff0f0">''</span>
)
}
export_outputs = {
<span style="color:#d20;background-color:#fff0f0">'predictions'</span>: tf.estimator.export.PredictOutput(predictions)
}
<span style="color:#080;font-weight:bold">return</span> tf.estimator.EstimatorSpec(
mode,
predictions=predictions,
export_outputs=export_outputs
)
<span style="color:#080;font-weight:bold">else</span>:
loss = tf.reduce_mean(
tf.nn.ctc_loss(
labels=codes,
inputs=logits,
sequence_length=lengths,
time_major=<span style="color:#080;font-weight:bold">False</span>,
ignore_longer_outputs_than_inputs=<span style="color:#080;font-weight:bold">True</span>
)
)
mean_edit_distance = tf.reduce_mean(
tf.edit_distance(
tf.cast(predicted_codes, tf.int32),
codes
)
)
distance_metric = tf.metrics.mean(mean_edit_distance)
<span style="color:#080;font-weight:bold">if</span> mode == tf.estimator.ModeKeys.EVAL:
<span style="color:#080;font-weight:bold">return</span> tf.estimator.EstimatorSpec(
mode,
loss=loss,
eval_metric_ops={ <span style="color:#d20;background-color:#fff0f0">'edit_distance'</span>: distance_metric }
)
<span style="color:#080;font-weight:bold">elif</span> mode == tf.estimator.ModeKeys.TRAIN:
global_step = tf.train.get_or_create_global_step()
tf.summary.text(
<span style="color:#d20;background-color:#fff0f0">'train_predicted_text'</span>,
tf.sparse_tensor_to_dense(text, <span style="color:#d20;background-color:#fff0f0">''</span>)
)
tf.summary.scalar(<span style="color:#d20;background-color:#fff0f0">'train_edit_distance'</span>, mean_edit_distance)
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
<span style="color:#080;font-weight:bold">with</span> tf.control_dependencies(update_ops):
train_op = tf.contrib.layers.optimize_loss(
loss=loss,
global_step=global_step,
learning_rate=params[<span style="color:#d20;background-color:#fff0f0">'lr'</span>],
optimizer=(params[<span style="color:#d20;background-color:#fff0f0">'optimizer'</span>]),
update_ops=update_ops,
clip_gradients=params[<span style="color:#d20;background-color:#fff0f0">'clip_gradients'</span>],
summaries=[
<span style="color:#d20;background-color:#fff0f0">"learning_rate"</span>,
<span style="color:#d20;background-color:#fff0f0">"loss"</span>,
<span style="color:#d20;background-color:#fff0f0">"global_gradient_norm"</span>,
]
)
<span style="color:#080;font-weight:bold">return</span> tf.estimator.EstimatorSpec(
mode,
loss=loss,
train_op=train_op
)
</code></pre></div><p>Using the API, we’ll get lots of stats in TensorBoard for free. It will also make it very easy to validate the model and to export it to a <code>SavedModel</code> format.</p>
<p>In order to easily experiment with different hyperparameters, I’ve also created a helper function as listed below:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python"><span style="color:#080;font-weight:bold">import</span> <span style="color:#b06;font-weight:bold">copy</span>
<span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">experiment</span>(data_params=dataset_params(), **kwargs):
params = experiment_params(
data_params,
**kwargs
)
<span style="color:#038">print</span>(params)
estimator = tf.estimator.Estimator(
model_fn=model_fn,
model_dir=<span style="color:#d20;background-color:#fff0f0">'stats/</span><span style="color:#33b;background-color:#fff0f0">{}</span><span style="color:#d20;background-color:#fff0f0">'</span>.format(experiment_name(params)),
params=params
)
<span style="color:#888">#import pdb; pdb.set_trace()</span>
train_spec = tf.estimator.TrainSpec(
input_fn=input_fn(
train_data,
params[<span style="color:#d20;background-color:#fff0f0">'data'</span>]
)
)
features = {
<span style="color:#d20;background-color:#fff0f0">"audio"</span>: tf.placeholder(dtype=tf.float32, shape=[<span style="color:#080;font-weight:bold">None</span>]),
<span style="color:#d20;background-color:#fff0f0">"length"</span>: tf.placeholder(dtype=tf.int32, shape=[])
}
serving_input_receiver_fn = tf.estimator.export.build_raw_serving_input_receiver_fn(
features
)
best_exporter = tf.estimator.BestExporter(
name=<span style="color:#d20;background-color:#fff0f0">"best_exporter"</span>,
serving_input_receiver_fn=serving_input_receiver_fn,
exports_to_keep=<span style="color:#00d;font-weight:bold">5</span>
)
eval_params = copy.deepcopy(params[<span style="color:#d20;background-color:#fff0f0">'data'</span>])
eval_params[<span style="color:#d20;background-color:#fff0f0">'augment'</span>] = <span style="color:#080;font-weight:bold">False</span>
eval_spec = tf.estimator.EvalSpec(
input_fn=input_fn(
eval_data,
eval_params
),
throttle_secs=<span style="color:#00d;font-weight:bold">60</span>*<span style="color:#00d;font-weight:bold">30</span>,
exporters=best_exporter
)
tf.estimator.train_and_evaluate(
estimator,
train_spec,
eval_spec
)
</code></pre></div><p>As well as two more to test the model’s accuracy and to get the test set predictions:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python"><span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">test</span>(data_params=dataset_params(), **kwargs):
params = experiment_params(
data_params,
**kwargs
)
<span style="color:#038">print</span>(params)
estimator = tf.estimator.Estimator(
model_fn=model_fn,
model_dir=<span style="color:#d20;background-color:#fff0f0">'stats/</span><span style="color:#33b;background-color:#fff0f0">{}</span><span style="color:#d20;background-color:#fff0f0">'</span>.format(experiment_name(params)),
params=params
)
eval_params = copy.deepcopy(params[<span style="color:#d20;background-color:#fff0f0">'data'</span>])
eval_params[<span style="color:#d20;background-color:#fff0f0">'augment'</span>] = <span style="color:#080;font-weight:bold">False</span>
eval_params[<span style="color:#d20;background-color:#fff0f0">'epochs'</span>] = <span style="color:#00d;font-weight:bold">1</span>
eval_params[<span style="color:#d20;background-color:#fff0f0">'shuffle'</span>] = <span style="color:#080;font-weight:bold">False</span>
estimator.evaluate(
input_fn=input_fn(
test_data,
eval_params
)
)
<span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">predict_test</span>(**kwargs):
params = experiment_params(
dataset_params(
augment=<span style="color:#080;font-weight:bold">False</span>,
shuffle=<span style="color:#080;font-weight:bold">False</span>,
batch_size=<span style="color:#00d;font-weight:bold">1</span>,
epochs=<span style="color:#00d;font-weight:bold">1</span>,
parallelize=<span style="color:#080;font-weight:bold">False</span>
),
**kwargs
)
<span style="color:#038">print</span>(<span style="color:#038">len</span>(test_data))
estimator = tf.estimator.Estimator(
model_fn=model_fn,
model_dir=<span style="color:#d20;background-color:#fff0f0">'stats/</span><span style="color:#33b;background-color:#fff0f0">{}</span><span style="color:#d20;background-color:#fff0f0">'</span>.format(experiment_name(params)),
params=params
)
<span style="color:#080;font-weight:bold">return</span> <span style="color:#038">list</span>(
estimator.predict(
input_fn=input_fn(
test_data,
params[<span style="color:#d20;background-color:#fff0f0">'data'</span>]
)
)
)
</code></pre></div><p>Which depends on the following other functions:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python"><span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">experiment_params</span>(data,
optimizer=<span style="color:#d20;background-color:#fff0f0">'Adam'</span>,
lr=<span style="color:#00d;font-weight:bold">1e-4</span>,
alphabet=<span style="color:#d20;background-color:#fff0f0">" 'abcdefghijklmnopqrstuvwxyz"</span>,
causal_convolutions=<span style="color:#080;font-weight:bold">True</span>,
stack_dilation_rates=[<span style="color:#00d;font-weight:bold">1</span>, <span style="color:#00d;font-weight:bold">3</span>, <span style="color:#00d;font-weight:bold">9</span>, <span style="color:#00d;font-weight:bold">27</span>, <span style="color:#00d;font-weight:bold">81</span>],
stacks=<span style="color:#00d;font-weight:bold">2</span>,
stack_kernel_size=<span style="color:#00d;font-weight:bold">3</span>,
stack_filters=<span style="color:#00d;font-weight:bold">32</span>,
sampling_rate=<span style="color:#00d;font-weight:bold">16000</span>,
n_fft=<span style="color:#00d;font-weight:bold">160</span>*<span style="color:#00d;font-weight:bold">4</span>,
frame_step=<span style="color:#00d;font-weight:bold">160</span>,
lower_edge_hertz=<span style="color:#00d;font-weight:bold">0</span>,
upper_edge_hertz=<span style="color:#00d;font-weight:bold">8000</span>,
num_mel_bins=<span style="color:#00d;font-weight:bold">160</span>,
clip_gradients=<span style="color:#080;font-weight:bold">None</span>,
codename=<span style="color:#d20;background-color:#fff0f0">'regular'</span>,
**kwargs):
params = {
<span style="color:#d20;background-color:#fff0f0">'optimizer'</span>: optimizer,
<span style="color:#d20;background-color:#fff0f0">'lr'</span>: lr,
<span style="color:#d20;background-color:#fff0f0">'data'</span>: data,
<span style="color:#d20;background-color:#fff0f0">'alphabet'</span>: alphabet,
<span style="color:#d20;background-color:#fff0f0">'causal_convolutions'</span>: causal_convolutions,
<span style="color:#d20;background-color:#fff0f0">'stack_dilation_rates'</span>: stack_dilation_rates,
<span style="color:#d20;background-color:#fff0f0">'stacks'</span>: stacks,
<span style="color:#d20;background-color:#fff0f0">'stack_kernel_size'</span>: stack_kernel_size,
<span style="color:#d20;background-color:#fff0f0">'stack_filters'</span>: stack_filters,
<span style="color:#d20;background-color:#fff0f0">'sampling_rate'</span>: sampling_rate,
<span style="color:#d20;background-color:#fff0f0">'n_fft'</span>: n_fft,
<span style="color:#d20;background-color:#fff0f0">'frame_step'</span>: frame_step,
<span style="color:#d20;background-color:#fff0f0">'lower_edge_hertz'</span>: lower_edge_hertz,
<span style="color:#d20;background-color:#fff0f0">'upper_edge_hertz'</span>: upper_edge_hertz,
<span style="color:#d20;background-color:#fff0f0">'num_mel_bins'</span>: num_mel_bins,
<span style="color:#d20;background-color:#fff0f0">'clip_gradients'</span>: clip_gradients,
<span style="color:#d20;background-color:#fff0f0">'codename'</span>: codename
}
<span style="color:#888">#import pdb; pdb.set_trace()</span>
<span style="color:#080;font-weight:bold">if</span> kwargs <span style="color:#080">is</span> <span style="color:#080">not</span> <span style="color:#080;font-weight:bold">None</span> <span style="color:#080">and</span> <span style="color:#d20;background-color:#fff0f0">'data'</span> <span style="color:#080">in</span> kwargs:
params[<span style="color:#d20;background-color:#fff0f0">'data'</span>] = { **params[<span style="color:#d20;background-color:#fff0f0">'data'</span>], **kwargs[<span style="color:#d20;background-color:#fff0f0">'data'</span>] }
<span style="color:#080;font-weight:bold">del</span> kwargs[<span style="color:#d20;background-color:#fff0f0">'data'</span>]
<span style="color:#080;font-weight:bold">if</span> kwargs <span style="color:#080">is</span> <span style="color:#080">not</span> <span style="color:#080;font-weight:bold">None</span>:
params = { **params, **kwargs }
<span style="color:#080;font-weight:bold">return</span> params
<span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">experiment_name</span>(params, excluded_keys=[<span style="color:#d20;background-color:#fff0f0">'alphabet'</span>, <span style="color:#d20;background-color:#fff0f0">'data'</span>, <span style="color:#d20;background-color:#fff0f0">'lr'</span>, <span style="color:#d20;background-color:#fff0f0">'clip_gradients'</span>]):
<span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">represent</span>(key, value):
<span style="color:#080;font-weight:bold">if</span> key <span style="color:#080">in</span> excluded_keys:
<span style="color:#080;font-weight:bold">return</span> <span style="color:#080;font-weight:bold">None</span>
<span style="color:#080;font-weight:bold">else</span>:
<span style="color:#080;font-weight:bold">if</span> <span style="color:#038">isinstance</span>(value, <span style="color:#038">list</span>):
<span style="color:#080;font-weight:bold">return</span> <span style="color:#d20;background-color:#fff0f0">'</span><span style="color:#33b;background-color:#fff0f0">{}</span><span style="color:#d20;background-color:#fff0f0">_</span><span style="color:#33b;background-color:#fff0f0">{}</span><span style="color:#d20;background-color:#fff0f0">'</span>.format(key, <span style="color:#d20;background-color:#fff0f0">'_'</span>.join([<span style="color:#038">str</span>(v) <span style="color:#080;font-weight:bold">for</span> v <span style="color:#080">in</span> value]))
<span style="color:#080;font-weight:bold">else</span>:
<span style="color:#080;font-weight:bold">return</span> <span style="color:#d20;background-color:#fff0f0">'</span><span style="color:#33b;background-color:#fff0f0">{}</span><span style="color:#d20;background-color:#fff0f0">_</span><span style="color:#33b;background-color:#fff0f0">{}</span><span style="color:#d20;background-color:#fff0f0">'</span>.format(key, value)
parts = <span style="color:#038">filter</span>(
<span style="color:#080;font-weight:bold">lambda</span> p: p <span style="color:#080">is</span> <span style="color:#080">not</span> <span style="color:#080;font-weight:bold">None</span>,
[
represent(k, params[k])
<span style="color:#080;font-weight:bold">for</span> k <span style="color:#080">in</span> <span style="color:#038">sorted</span>(params.keys())
]
)
<span style="color:#080;font-weight:bold">return</span> <span style="color:#d20;background-color:#fff0f0">'/'</span>.join(parts)
</code></pre></div><p>Each new set of hyperparameters constitutes a different “experiment”. It will output separate statistics in TensorBoard that are going to be easily filterable.</p>
<p>The <code>experiment</code> function uses the <code>train_and_validate</code> TensorFlow function which will periodically test the model against the validation set. This is our tool of gauging how well it generalizes. It also uses the <code>tf.estimator.BestExporter</code> class to automatically export <code>SavedModel</code> files for best performing versions.</p>
<h5 id="other-aspects">Other aspects</h5>
<p>The coverage of the full code listing wouldn’t be very practical for an article like this. We’ve covered the most important of them above. I invite you to have a look at the Jupyter notebook itself which is hosted on GitHub: <a href="https://github.com/kamilc/speech-recognition">kamilc/speech-recognition</a>.</p>
<h3 id="lets-train-it">Let’s train it</h3>
<p>Before we can dive in and start training the model using the code above, we need to set a few things up.</p>
<p>First of all, I’m using Docker. This way I’m not constrained e.g. by the version of Cuda to install.</p>
<p>Here’s the Dockerfile for this project:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-dockerfile" data-lang="dockerfile"><span style="color:#080;font-weight:bold">FROM</span><span style="color:#d20;background-color:#fff0f0"> tensorflow/tensorflow:latest-devel-gpu-py3</span><span style="color:#a61717;background-color:#e3d2d2">
</span><span style="color:#a61717;background-color:#e3d2d2">
</span><span style="color:#a61717;background-color:#e3d2d2"></span><span style="color:#080;font-weight:bold">RUN</span> apt-get update<span style="color:#a61717;background-color:#e3d2d2">
</span><span style="color:#a61717;background-color:#e3d2d2"></span><span style="color:#080;font-weight:bold">RUN</span> apt-get install -y ffmpeg git cmake<span style="color:#a61717;background-color:#e3d2d2">
</span><span style="color:#a61717;background-color:#e3d2d2">
</span><span style="color:#a61717;background-color:#e3d2d2"></span><span style="color:#080;font-weight:bold">RUN</span> pip install matplotlib pandas scikit-learn librosa seaborn hickle hypothesis[pandas]<span style="color:#a61717;background-color:#e3d2d2">
</span><span style="color:#a61717;background-color:#e3d2d2">
</span><span style="color:#a61717;background-color:#e3d2d2"></span><span style="color:#080;font-weight:bold">RUN</span> mkdir -p /home/data-science/projects<span style="color:#a61717;background-color:#e3d2d2">
</span><span style="color:#a61717;background-color:#e3d2d2"></span><span style="color:#080;font-weight:bold">VOLUME</span><span style="color:#d20;background-color:#fff0f0"> /home/data-science/projects</span><span style="color:#a61717;background-color:#e3d2d2">
</span><span style="color:#a61717;background-color:#e3d2d2">
</span><span style="color:#a61717;background-color:#e3d2d2"></span><span style="color:#080;font-weight:bold">RUN</span> <span style="color:#038">echo</span> <span style="color:#d20;background-color:#fff0f0">"c.NotebookApp.token = ''"</span> >> ~/.jupyter/jupyter_notebook_config.py<span style="color:#a61717;background-color:#e3d2d2">
</span><span style="color:#a61717;background-color:#e3d2d2"></span><span style="color:#080;font-weight:bold">RUN</span> <span style="color:#038">echo</span> <span style="color:#d20;background-color:#fff0f0">"c.NotebookApp.password = ''"</span> >> ~/.jupyter/jupyter_notebook_config.py<span style="color:#a61717;background-color:#e3d2d2">
</span><span style="color:#a61717;background-color:#e3d2d2">
</span><span style="color:#a61717;background-color:#e3d2d2"></span><span style="color:#080;font-weight:bold">WORKDIR</span><span style="color:#d20;background-color:#fff0f0"> /home/data-science/projects</span><span style="color:#a61717;background-color:#e3d2d2">
</span><span style="color:#a61717;background-color:#e3d2d2">
</span><span style="color:#a61717;background-color:#e3d2d2"></span><span style="color:#080;font-weight:bold">RUN</span> pip install git+https://github.com/Supervisor/supervisor && <span style="color:#04d;background-color:#fff0f0">\
</span><span style="color:#04d;background-color:#fff0f0"></span> mkdir -p /var/log/supervisor<span style="color:#a61717;background-color:#e3d2d2">
</span><span style="color:#a61717;background-color:#e3d2d2">
</span><span style="color:#a61717;background-color:#e3d2d2"></span><span style="color:#080;font-weight:bold">ADD</span> supervisor.conf /etc/supervisor.conf<span style="color:#a61717;background-color:#e3d2d2">
</span><span style="color:#a61717;background-color:#e3d2d2">
</span><span style="color:#a61717;background-color:#e3d2d2"></span><span style="color:#080;font-weight:bold">EXPOSE</span><span style="color:#d20;background-color:#fff0f0"> 80</span><span style="color:#a61717;background-color:#e3d2d2">
</span><span style="color:#a61717;background-color:#e3d2d2"></span><span style="color:#080;font-weight:bold">EXPOSE</span><span style="color:#d20;background-color:#fff0f0"> 6006</span><span style="color:#a61717;background-color:#e3d2d2">
</span><span style="color:#a61717;background-color:#e3d2d2">
</span><span style="color:#a61717;background-color:#e3d2d2"></span><span style="color:#080;font-weight:bold">CMD</span> supervisord -c /etc/supervisor.conf<span style="color:#a61717;background-color:#e3d2d2">
</span></code></pre></div><p>I also like to make my life easier and provide the Makefile that automates common project-related tasks:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-makefile" data-lang="makefile"><span style="color:#06b;font-weight:bold">build</span>:
nvidia-docker build -t speech-recognition:latest .
<span style="color:#06b;font-weight:bold">run</span>:
nvidia-docker run -p 80:80 -p 6006:6006 --shm-size 16G --mount <span style="color:#369">type</span>=bind,source=/home/kamil/projects/speech-recognition,target=/home/data-science/projects -it speech-recognition
<span style="color:#06b;font-weight:bold">bash</span>:
nvidia-docker run --mount <span style="color:#369">type</span>=bind,source=/home/kamil/projects/speech-recognition,target=/home/data-science/projects -it speech-recognition bash
</code></pre></div><p>We’ll use TensorBoard to visualize the progress. At the same time, we need Jupyter notebooks server to be running as well. We’ll need a supervisor daemon to run both at the same time in a container. Here’s its config file:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-ini" data-lang="ini"><span style="color:#080;font-weight:bold">[supervisord]</span>
<span style="color:#369">nodaemon</span>=<span style="color:#d20;background-color:#fff0f0">true</span>
<span style="color:#080;font-weight:bold">[program:jupyter]</span>
<span style="color:#369">command</span>=<span style="color:#d20;background-color:#fff0f0">bash -c "source /etc/bash.bashrc && jupyter notebook --notebook-dir=/home/data-science/projects --ip 0.0.0.0 --no-browser --allow-root --port=80"</span>
<span style="color:#080;font-weight:bold">[program:tensorboard]</span>
<span style="color:#369">command</span>=<span style="color:#d20;background-color:#fff0f0">tensorboard --logdir /home/data-science/projects/stats</span>
</code></pre></div><p>In order to run the Jupyter notebook and start experimenting you’ll need to run the following in the command line:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-bash" data-lang="bash">make build
</code></pre></div><p>And then to start the container with TensorFlow, Jupyter, and Tensorboard:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-bash" data-lang="bash">make run
</code></pre></div><p>The notebook includes a helper function for running experiments. Here’s the invocation, whose set of parameters worked best for me:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python">experiment(
dataset_params(
batch_size=<span style="color:#00d;font-weight:bold">18</span>,
epochs=<span style="color:#00d;font-weight:bold">10</span>,
max_wave_length=<span style="color:#00d;font-weight:bold">320000</span>,
augment=<span style="color:#080;font-weight:bold">True</span>,
random_noise=<span style="color:#00d;font-weight:bold">0.75</span>,
random_noise_factor_min=<span style="color:#00d;font-weight:bold">0.1</span>,
random_noise_factor_max=<span style="color:#00d;font-weight:bold">0.15</span>,
random_stretch_min=<span style="color:#00d;font-weight:bold">0.8</span>,
random_stretch_max=<span style="color:#00d;font-weight:bold">1.2</span>
),
codename=<span style="color:#d20;background-color:#fff0f0">'deep_max_20_seconds'</span>,
alphabet=<span style="color:#d20;background-color:#fff0f0">' !"&</span><span style="color:#04d;background-color:#fff0f0">\'</span><span style="color:#d20;background-color:#fff0f0">,-.01234:;</span><span style="color:#04d;background-color:#fff0f0">\\</span><span style="color:#d20;background-color:#fff0f0">abcdefghijklmnopqrstuvwxyz'</span>, <span style="color:#888"># !"&',-.01234:;\abcdefghijklmnopqrstuvwxyz</span>
causal_convolutions=<span style="color:#080;font-weight:bold">False</span>,
stack_dilation_rates=[<span style="color:#00d;font-weight:bold">1</span>, <span style="color:#00d;font-weight:bold">3</span>, <span style="color:#00d;font-weight:bold">9</span>, <span style="color:#00d;font-weight:bold">27</span>],
stacks=<span style="color:#00d;font-weight:bold">6</span>,
stack_kernel_size=<span style="color:#00d;font-weight:bold">7</span>,
stack_filters=<span style="color:#00d;font-weight:bold">3</span>*<span style="color:#00d;font-weight:bold">128</span>,
n_fft=<span style="color:#00d;font-weight:bold">160</span>*<span style="color:#00d;font-weight:bold">8</span>,
frame_step=<span style="color:#00d;font-weight:bold">160</span>*<span style="color:#00d;font-weight:bold">4</span>,
num_mel_bins=<span style="color:#00d;font-weight:bold">160</span>,
optimizer=<span style="color:#d20;background-color:#fff0f0">'Momentum'</span>,
lr=<span style="color:#00d;font-weight:bold">0.00001</span>,
clip_gradients=<span style="color:#00d;font-weight:bold">20.0</span>
)
</code></pre></div><p>The training process takes lots of time. On my machine, it took it more than 2 weeks. Searching for the best set of parameters is very difficult (and not fun).</p>
<p>The function accepts the <code>max_text_length</code> as one of its parameters. I first ran the experiments setting it to some small value (e.g. 15 characters). It constrains the data set to a narrow set of “easy” files. The reason is that it’s easy to spot any issues with the architecture on an easy set: if it’s not converging here, then we surely have a bug.</p>
<p>For the main training procedure, this parameter is kept unset.</p>
<h3 id="results">Results</h3>
<p>By using TensorBoard, we get a handy tool for monitoring the progress. I made the <code>model_fn</code> output statistics for the training set <a href="https://en.wikipedia.org/wiki/Edit_distance">edit distance</a> as well as the one for the evaluation set.</p>
<p>The statistics for the <a href="https://en.wikipedia.org/wiki/Connectionist_temporal_classification">CTC Loss</a> are included by default.</p>
<p>Here are the charts for the final model included in the GitHub repo:</p>
<p><img src="/blog/2019/01/speech-recognition-with-tensorflow/training-1.png" alt=""></p>
<p>A thing to notice is that I paused the training between the 20th and 30th December.</p>
<p>The above chart presents the <strong>training time</strong> edit distance. Because of the pretty aggressive data augmentation, I noticed that throughout the whole process the training and validation edit distances didn’t differ hugely.</p>
<p>Following image shows the CTC loss with the orange line representing the evaluation runs.</p>
<p><img src="/blog/2019/01/speech-recognition-with-tensorflow/training-2.png" alt=""></p>
<p>The evaluation edit distance is shown below. I stopped the training once the further gain for a whole day was dropping by less than <code>0.005</code>.</p>
<p><img src="/blog/2019/01/speech-recognition-with-tensorflow/training-3.png" alt=""></p>
<p>Every machine learning model should be rigorously measured against meaningful accuracy statistics. Let’s see how we did:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python">test(
dataset_params(
batch_size=<span style="color:#00d;font-weight:bold">18</span>,
epochs=<span style="color:#00d;font-weight:bold">10</span>,
max_wave_length=<span style="color:#00d;font-weight:bold">320000</span>,
augment=<span style="color:#080;font-weight:bold">True</span>,
random_noise=<span style="color:#00d;font-weight:bold">0.75</span>,
random_noise_factor_min=<span style="color:#00d;font-weight:bold">0.1</span>,
random_noise_factor_max=<span style="color:#00d;font-weight:bold">0.15</span>,
random_stretch_min=<span style="color:#00d;font-weight:bold">0.8</span>,
random_stretch_max=<span style="color:#00d;font-weight:bold">1.2</span>
),
codename=<span style="color:#d20;background-color:#fff0f0">'deep_max_20_seconds'</span>,
alphabet=<span style="color:#d20;background-color:#fff0f0">' !"&</span><span style="color:#04d;background-color:#fff0f0">\'</span><span style="color:#d20;background-color:#fff0f0">,-.01234:;</span><span style="color:#04d;background-color:#fff0f0">\\</span><span style="color:#d20;background-color:#fff0f0">abcdefghijklmnopqrstuvwxyz'</span>, <span style="color:#888"># !"&',-.01234:;\abcdefghijklmnopqrstuvwxyz</span>
causal_convolutions=<span style="color:#080;font-weight:bold">False</span>,
stack_dilation_rates=[<span style="color:#00d;font-weight:bold">1</span>, <span style="color:#00d;font-weight:bold">3</span>, <span style="color:#00d;font-weight:bold">9</span>, <span style="color:#00d;font-weight:bold">27</span>],
stacks=<span style="color:#00d;font-weight:bold">6</span>,
stack_kernel_size=<span style="color:#00d;font-weight:bold">7</span>,
stack_filters=<span style="color:#00d;font-weight:bold">3</span>*<span style="color:#00d;font-weight:bold">128</span>,
n_fft=<span style="color:#00d;font-weight:bold">160</span>*<span style="color:#00d;font-weight:bold">8</span>,
frame_step=<span style="color:#00d;font-weight:bold">160</span>*<span style="color:#00d;font-weight:bold">4</span>,
num_mel_bins=<span style="color:#00d;font-weight:bold">160</span>,
optimizer=<span style="color:#d20;background-color:#fff0f0">'Momentum'</span>,
lr=<span style="color:#00d;font-weight:bold">0.00001</span>,
clip_gradients=<span style="color:#00d;font-weight:bold">20.0</span>
)
</code></pre></div><p>The output:</p>
<pre tabindex="0"><code>(...)
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Finished evaluation at 2019-01-07-10:51:09
INFO:tensorflow:Saving dict for global step 1525345: edit_distance = 0.07922124, global_step = 1525345, loss = 13.410753
(...)
</code></pre><p>This shows that for the test set, we’ve scored <code>0.079</code> in edit distance. We could invert it to call accuracy (somewhat naively though), which gives <code>92.1%</code> — not too bad. The result would be officially reported as <code>7.9 LER</code>.</p>
<p>What’s even nicer is the size of the model:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-bash" data-lang="bash">ls stats/causal_convolutions_False/codename_deep_max_20_seconds/frame_step_640/lower_edge_hertz_0/n_fft_1280/num_mel_bins_160/optimizer_Momentum/sampling_rate_16000/stack_dilation_rates_1_3_9_27/stack_filters_384/stack_kernel_size_7/stacks_6/upper_edge_hertz_8000/export/best_exporter/1546198558/variables -lh
total 204M
</code></pre></div><p>That’s <code>204MB</code> for the model trained on the 375k+ dataset with aggressive augmentation (which makes the resulting dataset size effectively a couple times bigger).</p>
<p>It’s always nice to <strong>see</strong> what the results look like. Here’s the code that runs the model through the whole test sets and gathers the predicted transcriptions:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python">test_results = predict_test(
codename=<span style="color:#d20;background-color:#fff0f0">'deep_max_20_seconds'</span>,
alphabet=<span style="color:#d20;background-color:#fff0f0">' !"&</span><span style="color:#04d;background-color:#fff0f0">\'</span><span style="color:#d20;background-color:#fff0f0">,-.01234:;</span><span style="color:#04d;background-color:#fff0f0">\\</span><span style="color:#d20;background-color:#fff0f0">abcdefghijklmnopqrstuvwxyz'</span>, <span style="color:#888"># !"&',-.01234:;\abcdefghijklmnopqrstuvwxyz</span>
causal_convolutions=<span style="color:#080;font-weight:bold">False</span>,
stack_dilation_rates=[<span style="color:#00d;font-weight:bold">1</span>, <span style="color:#00d;font-weight:bold">3</span>, <span style="color:#00d;font-weight:bold">9</span>, <span style="color:#00d;font-weight:bold">27</span>],
stacks=<span style="color:#00d;font-weight:bold">6</span>,
stack_kernel_size=<span style="color:#00d;font-weight:bold">7</span>,
stack_filters=<span style="color:#00d;font-weight:bold">3</span>*<span style="color:#00d;font-weight:bold">128</span>,
n_fft=<span style="color:#00d;font-weight:bold">160</span>*<span style="color:#00d;font-weight:bold">8</span>,
frame_step=<span style="color:#00d;font-weight:bold">160</span>*<span style="color:#00d;font-weight:bold">4</span>,
num_mel_bins=<span style="color:#00d;font-weight:bold">160</span>,
optimizer=<span style="color:#d20;background-color:#fff0f0">'Momentum'</span>,
lr=<span style="color:#00d;font-weight:bold">0.00001</span>,
clip_gradients=<span style="color:#00d;font-weight:bold">20.0</span>
)
[ <span style="color:#d20;background-color:#fff0f0">b</span><span style="color:#d20;background-color:#fff0f0">''</span>.join(t[<span style="color:#d20;background-color:#fff0f0">'text'</span>]) <span style="color:#080;font-weight:bold">for</span> t <span style="color:#080">in</span> test_results ]
</code></pre></div><p>And the excerpt of the above is:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python">[<span style="color:#d20;background-color:#fff0f0">b</span><span style="color:#d20;background-color:#fff0f0">'without the dotaset the artice suistles'</span>,
<span style="color:#d20;background-color:#fff0f0">b</span><span style="color:#d20;background-color:#fff0f0">"i've got to go to him"</span>,
<span style="color:#d20;background-color:#fff0f0">b</span><span style="color:#d20;background-color:#fff0f0">'and you know it'</span>,
<span style="color:#d20;background-color:#fff0f0">b</span><span style="color:#d20;background-color:#fff0f0">'down below in the darknes were hundrededs of people sleping in peace'</span>,
<span style="color:#d20;background-color:#fff0f0">b</span><span style="color:#d20;background-color:#fff0f0">'strange images pased through my mind'</span>,
<span style="color:#d20;background-color:#fff0f0">b</span><span style="color:#d20;background-color:#fff0f0">'the shep had taught him that'</span>,
<span style="color:#d20;background-color:#fff0f0">b</span><span style="color:#d20;background-color:#fff0f0">'it was glaringly hot not a clou in hesky nor a breath of wind'</span>,
<span style="color:#d20;background-color:#fff0f0">b</span><span style="color:#d20;background-color:#fff0f0">'your son went to serve at a distant place and became a cinturion'</span>,
<span style="color:#d20;background-color:#fff0f0">b</span><span style="color:#d20;background-color:#fff0f0">'they made a boy continue tiging but he found nothing'</span>,
<span style="color:#d20;background-color:#fff0f0">b</span><span style="color:#d20;background-color:#fff0f0">'the shoreas in da'</span>,
<span style="color:#d20;background-color:#fff0f0">b</span><span style="color:#d20;background-color:#fff0f0">'fol the instructions here'</span>,
<span style="color:#d20;background-color:#fff0f0">b</span><span style="color:#d20;background-color:#fff0f0">"the're caling to u not to give up and to kep on fighting"</span>,
<span style="color:#d20;background-color:#fff0f0">b</span><span style="color:#d20;background-color:#fff0f0">'the shop was closed on monis'</span>,
<span style="color:#d20;background-color:#fff0f0">b</span><span style="color:#d20;background-color:#fff0f0">'even coming down on the train together she wrote me'</span>,
<span style="color:#d20;background-color:#fff0f0">b</span><span style="color:#d20;background-color:#fff0f0">"i'm going away he said"</span>,
<span style="color:#d20;background-color:#fff0f0">b</span><span style="color:#d20;background-color:#fff0f0">"he wasn't asking for help"</span>,
<span style="color:#d20;background-color:#fff0f0">b</span><span style="color:#d20;background-color:#fff0f0">'some of the grynsh was faling of the circular edge'</span>,
<span style="color:#d20;background-color:#fff0f0">b</span><span style="color:#d20;background-color:#fff0f0">"i'd like to think"</span>,
<span style="color:#d20;background-color:#fff0f0">b</span><span style="color:#d20;background-color:#fff0f0">'the alchemist robably already knew al that'</span>,
<span style="color:#d20;background-color:#fff0f0">b</span><span style="color:#d20;background-color:#fff0f0">"you 'l take fiftly and like et"</span>,
<span style="color:#d20;background-color:#fff0f0">b</span><span style="color:#d20;background-color:#fff0f0">'it was droping of in flakes and raining down on the sand'</span>,
<span style="color:#d20;background-color:#fff0f0">b</span><span style="color:#d20;background-color:#fff0f0">"what's your name he asked"</span>,
<span style="color:#d20;background-color:#fff0f0">b</span><span style="color:#d20;background-color:#fff0f0">"it's because you were not born"</span>,
<span style="color:#d20;background-color:#fff0f0">b</span><span style="color:#d20;background-color:#fff0f0">'what do you think of that'</span>,
<span style="color:#d20;background-color:#fff0f0">b</span><span style="color:#d20;background-color:#fff0f0">"if i had told tyo o you wouldn't have sen the pyramids"</span>,
<span style="color:#d20;background-color:#fff0f0">b</span><span style="color:#d20;background-color:#fff0f0">"i havn't hert the baby complain yet"</span>,
<span style="color:#d20;background-color:#fff0f0">b</span><span style="color:#d20;background-color:#fff0f0">'i told him wit could teach hr to ignore people who was had tend'</span>,
<span style="color:#d20;background-color:#fff0f0">b</span><span style="color:#d20;background-color:#fff0f0">"the one you're blocking"</span>,
<span style="color:#d20;background-color:#fff0f0">b</span><span style="color:#d20;background-color:#fff0f0">'henderson stod up with a spade in his hand'</span>,
<span style="color:#d20;background-color:#fff0f0">b</span><span style="color:#d20;background-color:#fff0f0">"he didn't ned to sek out the old woman for this"</span>,
<span style="color:#d20;background-color:#fff0f0">b</span><span style="color:#d20;background-color:#fff0f0">'only a minority of literature is reaten this way'</span>,
<span style="color:#d20;background-color:#fff0f0">b</span><span style="color:#d20;background-color:#fff0f0">"i wish you wouldn't"</span>,
...]
</code></pre></div><p>Seems quite okay. You can immediately notice that some words are misspelled. This stems from the nature of the CTC algorithm itself. We’re <strong>predicting letters</strong> instead of words here. The good side is that the problem of out-of-vocabulary words is lessened. The worse part is that you’ll get e.g. ‘sek’ sometimes instead of ‘seek’. Because we’re outputting the logits for each example, it’s possible to use e.g. the <a href="https://github.com/githubharald/CTCWordBeamSearch">CTCWordBeamSearch</a> to constrain the output’s tokens to ones known within the corpus — making it predict the words instead.</p>
<p>Here’s the last little fun test: speech to text on the utterance I created on my laptop:</p>
<audio controls="controls">
<source src="/blog/2019/01/speech-recognition-with-tensorflow/test-me.m4a" type="audio/wav">
</audio>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python">results = predict(
<span style="color:#d20;background-color:#fff0f0">'cv_corpus_v1/test-me.m4a'</span>,
codename=<span style="color:#d20;background-color:#fff0f0">'deep_max_20_seconds'</span>,
alphabet=<span style="color:#d20;background-color:#fff0f0">' !"&</span><span style="color:#04d;background-color:#fff0f0">\'</span><span style="color:#d20;background-color:#fff0f0">,-.01234:;</span><span style="color:#04d;background-color:#fff0f0">\\</span><span style="color:#d20;background-color:#fff0f0">abcdefghijklmnopqrstuvwxyz'</span>, <span style="color:#888"># !"&',-.01234:;\abcdefghijklmnopqrstuvwxyz</span>
causal_convolutions=<span style="color:#080;font-weight:bold">False</span>,
stack_dilation_rates=[<span style="color:#00d;font-weight:bold">1</span>, <span style="color:#00d;font-weight:bold">3</span>, <span style="color:#00d;font-weight:bold">9</span>, <span style="color:#00d;font-weight:bold">27</span>],
stacks=<span style="color:#00d;font-weight:bold">6</span>,
stack_kernel_size=<span style="color:#00d;font-weight:bold">7</span>,
stack_filters=<span style="color:#00d;font-weight:bold">3</span>*<span style="color:#00d;font-weight:bold">128</span>,
n_fft=<span style="color:#00d;font-weight:bold">160</span>*<span style="color:#00d;font-weight:bold">8</span>,
frame_step=<span style="color:#00d;font-weight:bold">160</span>*<span style="color:#00d;font-weight:bold">4</span>,
num_mel_bins=<span style="color:#00d;font-weight:bold">160</span>,
optimizer=<span style="color:#d20;background-color:#fff0f0">'Momentum'</span>,
lr=<span style="color:#00d;font-weight:bold">0.00001</span>,
clip_gradients=<span style="color:#00d;font-weight:bold">20.0</span>
)
<span style="color:#d20;background-color:#fff0f0">b</span><span style="color:#d20;background-color:#fff0f0">''</span>.join(results[<span style="color:#00d;font-weight:bold">0</span>][<span style="color:#d20;background-color:#fff0f0">'text'</span>])
</code></pre></div><p>The result:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python"><span style="color:#d20;background-color:#fff0f0">b</span><span style="color:#d20;background-color:#fff0f0">'it semed to work just fine'</span>
</code></pre></div><h3 id="project-on-github">Project on GitHub</h3>
<p>The full Jupyter notebook’s code for this article can be found on GitHub: <a href="https://github.com/kamilc/speech-recognition">kamilc/speech-recognition</a>.</p>
<p>The repository includes the bz2 archive of the best performing model I’ve trained. You can download it and run it as a web service via <a href="https://www.tensorflow.org/serving/">TensorFlow Serving</a>, which we will cover in the next and last section here.</p>
<h3 id="serving-the-model-with-the-tensorflow-serving">Serving the model with the TensorFlow Serving</h3>
<p>The last step in this project is to serve our trained model as a web service. Thankfully, the TensorFlow project includes a ready to use “model server” that’s free to use: <a href="https://www.tensorflow.org/serving/">TensorFlow Serving</a>.</p>
<p>The idea behind it is that we can run it, pointing it at the directory containing the models saved in the TensorFlow’s SavedModel format.</p>
<p>The deployment is extremely straightforward if you’re okay with running it from a Docker container. Let’s first pull the image:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-bash" data-lang="bash">docker pull tensorflow/serving
</code></pre></div><p>Next, we need to download the saved model we’ve trained in this article from GitHub:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-bash" data-lang="bash">$ wget https://github.com/kamilc/speech-recognition/raw/master/best.tar.bz2
$ tar xvjf best.tar.bz2
</code></pre></div><p>In the next step, we need to start a container for the TensorFlow Serving image making it:</p>
<ul>
<li>open its port to outside</li>
<li>mount the directory containing our model</li>
<li>set the <code>MODEL_NAME</code> environment variable</li>
</ul>
<p>As follows:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-bash" data-lang="bash">docker run -t --rm -p 8501:8501 -v <span style="color:#d20;background-color:#fff0f0">"/home/kamil/projects/speech-recognition/best/1546646971:/models/speech/1"</span> -e <span style="color:#369">MODEL_NAME</span>=speech tensorflow/serving
</code></pre></div><p>The service communicates via JSON payloads. Let’s prepare a payload.json file containing our request payload:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-json" data-lang="json">{<span style="color:#b06;font-weight:bold">"inputs"</span>: {<span style="color:#b06;font-weight:bold">"audio"</span>: <span style="color:#a61717;background-color:#e3d2d2"><audio-data-here></span>, <span style="color:#b06;font-weight:bold">"length"</span>: <span style="color:#a61717;background-color:#e3d2d2"><audio-raw-signal-length-here></span>}}
</code></pre></div><p>We can now easily query the web service with the prepared request audio data:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-bash" data-lang="bash">curl -d @payload.json <span style="color:#04d;background-color:#fff0f0">\
</span><span style="color:#04d;background-color:#fff0f0"></span> -X POST http://localhost:8501/v1/models/speech:predict
</code></pre></div><p>Here’s what our intelligent web service responds with:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-json" data-lang="json">{
<span style="color:#b06;font-weight:bold">"outputs"</span>: {
<span style="color:#b06;font-weight:bold">"text"</span>: [
[
<span style="color:#d20;background-color:#fff0f0">"c"</span>,
<span style="color:#d20;background-color:#fff0f0">"e"</span>,
<span style="color:#d20;background-color:#fff0f0">"v"</span>,
<span style="color:#d20;background-color:#fff0f0">"e"</span>,
<span style="color:#d20;background-color:#fff0f0">"r"</span>,
<span style="color:#d20;background-color:#fff0f0">"y"</span>,
<span style="color:#d20;background-color:#fff0f0">"t"</span>,
<span style="color:#d20;background-color:#fff0f0">"h"</span>,
<span style="color:#d20;background-color:#fff0f0">"i"</span>,
<span style="color:#d20;background-color:#fff0f0">"n"</span>,
<span style="color:#d20;background-color:#fff0f0">"g"</span>,
<span style="color:#d20;background-color:#fff0f0">" "</span>,
<span style="color:#d20;background-color:#fff0f0">"i"</span>,
<span style="color:#d20;background-color:#fff0f0">"n"</span>,
<span style="color:#d20;background-color:#fff0f0">" "</span>,
<span style="color:#d20;background-color:#fff0f0">"t"</span>,
<span style="color:#d20;background-color:#fff0f0">"h"</span>,
<span style="color:#d20;background-color:#fff0f0">"e"</span>,
<span style="color:#d20;background-color:#fff0f0">" "</span>,
<span style="color:#d20;background-color:#fff0f0">"u"</span>,
<span style="color:#d20;background-color:#fff0f0">"n"</span>,
<span style="color:#d20;background-color:#fff0f0">"i"</span>,
<span style="color:#d20;background-color:#fff0f0">"v"</span>,
<span style="color:#d20;background-color:#fff0f0">"e"</span>,
<span style="color:#d20;background-color:#fff0f0">"r"</span>,
<span style="color:#d20;background-color:#fff0f0">"s"</span>,
<span style="color:#d20;background-color:#fff0f0">" "</span>,
<span style="color:#d20;background-color:#fff0f0">"o"</span>,
<span style="color:#d20;background-color:#fff0f0">"v"</span>,
<span style="color:#d20;background-color:#fff0f0">"a"</span>,
<span style="color:#d20;background-color:#fff0f0">"l"</span>,
<span style="color:#d20;background-color:#fff0f0">"s"</span>,
<span style="color:#d20;background-color:#fff0f0">"h"</span>,
<span style="color:#d20;background-color:#fff0f0">"e"</span>,
<span style="color:#d20;background-color:#fff0f0">" "</span>,
<span style="color:#d20;background-color:#fff0f0">"t"</span>,
<span style="color:#d20;background-color:#fff0f0">"e"</span>,
<span style="color:#d20;background-color:#fff0f0">"d"</span>,
<span style="color:#d20;background-color:#fff0f0">"i"</span>,
<span style="color:#d20;background-color:#fff0f0">"n"</span>,
<span style="color:#d20;background-color:#fff0f0">" "</span>,
<span style="color:#d20;background-color:#fff0f0">"a"</span>,
<span style="color:#d20;background-color:#fff0f0">"w"</span>,
<span style="color:#d20;background-color:#fff0f0">"i"</span>,
<span style="color:#d20;background-color:#fff0f0">"t"</span>,
<span style="color:#d20;background-color:#fff0f0">" "</span>,
<span style="color:#d20;background-color:#fff0f0">"j"</span>,
<span style="color:#d20;background-color:#fff0f0">"g"</span>,
<span style="color:#d20;background-color:#fff0f0">"m"</span>,
<span style="color:#d20;background-color:#fff0f0">"f"</span>,
<span style="color:#d20;background-color:#fff0f0">"t"</span>,
<span style="color:#d20;background-color:#fff0f0">"a"</span>,
<span style="color:#d20;background-color:#fff0f0">"r"</span>,
<span style="color:#d20;background-color:#fff0f0">"y"</span>,
<span style="color:#d20;background-color:#fff0f0">"s"</span>,
<span style="color:#d20;background-color:#fff0f0">"e"</span>
]
],
<span style="color:#b06;font-weight:bold">"logits"</span>: [
[
<span style="color:#a61717;background-color:#e3d2d2"><logits-here></span>
]
]
}
}
</code></pre></div>
Image Recognition Toolshttps://www.endpointdev.com/blog/2018/10/image-recognition-tools/2018-10-10T00:00:00+00:00Muhammad Najmi bin Ahmad Zabidi
<p><img src="/blog/2018/10/image-recognition-tools/image-1.jpg" alt="detecting 1 face" /></p>
<p>I’m always impressed with the advancement of machine learning, and, more recently, deep learning. However, since I am not an expert in the field I decided to let the researchers and scholars elaborate more on them.</p>
<p>In this post I will share the existing tools and the associated libraries to make them work, at least for me.</p>
<p>The reason I explored these tools is simple: I plan to deploy a poor man’s security camera in my home with some “sense” of intelligence. Since I am working at home, I want to know who is actually knocking my door. So I thought, what if I could use a web cam to monitor my door and let me know who’s actually standing at the door?</p>
<h3 id="face-detection">Face Detection</h3>
<p>I searched around for existing face detection software and found <a href="https://github.com/shantnu/FaceDetect/blob/master/face_detect.py">this Python script</a> using <a href="https://github.com/opencv/opencv/tree/master/data/haarcascades">Haarcascade</a>. So I was able to detect faces, but upon sharing the “findings” with a friend he said this only detects faces. How would the computer be able to recognize who’s who? Then I stumbled upon the phrase “face recognition”.</p>
<p>You might have noticed that if you use the image file that you import directly from your smartphone, the output will be displayed in a large file to the screen. You can use ImageMagick to resize the file to say, 640x480 pixels.</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-bash" data-lang="bash">$ file makan.jpg
makan.jpg: JPEG image data, JFIF standard 1.01, aspect ratio, density 1x1, segment length 16, Exif Standard: [TIFF image data, big-endian, <span style="color:#369">direntries</span>=15, <span style="color:#369">height</span>=3120, <span style="color:#369">bps</span>=0, <span style="color:#369">width</span>=4160], baseline, precision 8, 4160x3120, frames <span style="color:#00d;font-weight:bold">3</span>
$ convert makan.jpg -resize 640x480 makan-small.jpg
$ file makan-small.jpg
makan-small.jpg: JPEG image data, JFIF standard 1.01, resolution (DPI), density 72x72, segment length 16, Exif Standard: [TIFF image data, big-endian, <span style="color:#369">direntries</span>=15, <span style="color:#369">height</span>=3120, <span style="color:#369">bps</span>=0, <span style="color:#369">width</span>=4160], baseline, precision 8, 640x480, frames <span style="color:#00d;font-weight:bold">3</span>
</code></pre></div><p><img src="/blog/2018/10/image-recognition-tools/image-0.jpg" alt="detecting 2 faces" /></p>
<h3 id="machine-vision">Machine Vision</h3>
<p>The computer doesn’t see the image directly as the humans seem to, so we need to convert the images into numerical values. For example, in the facial regcognition tools, the training file contains the following matrices:</p>
<pre tabindex="0"><code>opencv_lbphfaces:
threshold: 1.7976931348623157e+308
radius: 1
neighbors: 8
grid_x: 8
grid_y: 8
histograms:
- !!opencv-matrix
rows: 1
cols: 16384
dt: f
data: [ 2.46913582e-02, 1.85185187e-02, 0., 3.08641978e-03,
1.23456791e-02, 6.17283955e-03, 3.08641978e-03,
2.46913582e-02, 0., 0., 0., 0., 0., 3.08641978e-03, 0.,
9.25925933e-03, 1.85185187e-02, 9.25925933e-03, 0., 0.,
3.08641978e-03, 0., 0., 0., 3.08641978e-03, 0., 0., 0.,
2.46913582e-02, 3.08641978e-03, 0., 6.79012388e-02, 0., 0.,
...................
1.30385486e-02, 1.47392293e-02, 4.53514745e-03,
1.13378686e-03, 7.93650839e-03, 5.66893432e-04,
5.66893432e-04, 1.13378686e-03, 6.80272095e-03,
2.26757373e-03, 0., 0., 5.66893443e-03, 2.83446722e-03,
5.10204071e-03, 9.07029491e-03, 7.14285746e-02 ]
labels: !!opencv-matrix
rows: 26
cols: 1
dt: i
data: [ 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 4, 4, 4, 4, 5, 5, 5, 5, 5,
6, 6, 8, 8, 8, 8 ]
labelsInfo:
[]
</code></pre><h3 id="face-recognition">Face Recognition</h3>
<p>I continued my search for existing face recognition software and found several projects which could be tested right away, with some modifications from the original source. I found one <a href="https://www.youtube.com/watch?v=PmZ29Vta7Vc">tutorial</a> which explained clearly how we could get the face recognition working from the web camera, in real time.</p>
<p>If the code provided in the video isn’t working directly, you could try my small patches, in which I corrected a typo and extended the filename extensions towards the source file from <a href="https://github.com/codingforentrepreneurs/OpenCV-Python-Series/compare/master...raden:utk-github">here</a>.</p>
<center>
<video width="100%" controls>
<source src="/blog/2018/10/image-recognition-tools/aufa-im-process.webm" type="video/webm">
</video>
<caption>My daughter Aufa is joining me in this facial recognition session.</caption>
</center>
<p>Apart from that there is also a fork <a href="https://github.com/nazmiasri95/Face-Recognition">on GitHub</a> which allows us to do the real-time face recognition. For now, however, some manual work needed to be done in order to add more datasets (images of faces) if you want to use the code right away.</p>
<center>
<video width="100%" controls>
<source src="/blog/2018/10/image-recognition-tools/tom-cruise.webm" type="video/webm">
</video>
<caption>Obviously I am not Tom Cruise.</caption>
</center>
<h3 id="object-recognition">Object Recognition</h3>
<p>I also searched for more related software which could possibly provide an alternative to the face recognition. I found quite an interesing piece of work for object detection by using Neural Networks. It runs on a framework called <a href="https://pjreddie.com/darknet/">Darknet</a>. It allows us to do post-processing object detection for still pictures and videos. It can also do real-time object recognition but requires a GPU to do it efficiently. I tried with the CPU-only mode but I could not get a real-time result (my computer almost crashed).</p>
<h4 id="still-image-samples">Still image samples</h4>
<p><img src="/blog/2018/10/image-recognition-tools/image-2.jpg" alt="detecting boats and people at the beach" /></p>
<p><img src="/blog/2018/10/image-recognition-tools/image-3.jpg" alt="detecting birds at the zoo" /></p>
<h4 id="video-samples">Video samples</h4>
<center>
<video width="40%" controls>
<source src="/blog/2018/10/image-recognition-tools/keteslow.webm" type="video/webm">
</video>
<br /><caption>This video was on Lebuhraya Utara Selatan (freeway) in Malaysia</caption>
</center>
<center>
<video width="40%" controls>
<source src="/blog/2018/10/image-recognition-tools/keteslow2.webm" type="video/webm">
</video>
<br /><caption>Another from Lebuhraya Utara Selatan (freeway) in Malaysia</caption>
</center>
<center>
<video width="100%" controls>
<source src="/blog/2018/10/image-recognition-tools/kids-bubble.webm" type="video/webm">
</video>
<caption>Two kids playing with bubbles</caption>
</center>
<center>
<video width="40%" controls>
<source src="/blog/2018/10/image-recognition-tools/perhentian-swim-analyzed.webm" type="video/webm">
</video>
<br /><caption>This video was taken a on a boat, with several people floating in the sea wearing their life jackets</caption>
</center>
<center>
<video width="100%" controls>
<source src="/blog/2018/10/image-recognition-tools/jalan-pantai.webm" type="video/webm">
</video>
<caption>My kid and I walking on the beach in western Australia</caption>
</center>
<center>
<video width="100%" controls>
<source src="/blog/2018/10/image-recognition-tools/aufa-naik-kida-slow.webm" type="video/webm">
</video>
<caption>Here’s a kid riding a small horse</caption>
</center>
<h4 id="vehicle-counting-and-speed-measurement">Vehicle Counting and Speed Measurement</h4>
<p>I found a tool developed by <a href="https://github.com/ahmetozlu/vehicle_counting_tensorflow">Ahmet Ozlu</a> which uses TensorFlow. The use case here is vechicle counting, vehicle type and color recoginition, and speed detection.</p>
<p>You can see the in following video how it works.</p>
<center>
<video width="100%" controls>
<source src="/blog/2018/10/image-recognition-tools/ahmet-traffic.webm" type="video/webm">
</video>
<caption></caption>
</center>
<h3 id="libraries">Libraries</h3>
<h4 id="opencv">OpenCV</h4>
<p><a href="https://opencv.org/">OpenCV</a> is an open source library for computer vision, which comes together with libraries which we can use for our detection and recognition work.</p>
<p>In my understanding, the face detection will come first and the recognition second. In newer digital cameras and smartphones facial detection is quite common. Social media applications sometimes use facial recognition to suggest similar faces to be tagged in photo albums, or for photo album reorganization.</p>
<h4 id="tools-based-on-or-making-use-of-opencv">Tools based on or making use of OpenCV</h4>
<p>Apart from the custom-written Python code which uses OpenCV and Numpy, I also found out there are several works which use TensorFlow together with neural networks, called YOLO (You Look Only Once). They are:</p>
<ul>
<li><a href="https://pjreddie.com/darknet/">darknet</a> (written in C)</li>
<li><a href="https://github.com/thtrieu/darkflow">darkflow</a> (written with Python and seems to work as a wrapper for darknet) — You need to install different dependencies from darknet, for example Cython and TensorFlow. The good thing is that we could use this tool for a video post-processing, where instead of taking input directly from a webcam, we take it from existing videos. However, if you want to use the latest YOLO algorithm, then just stick to Darknet rather than using Darkflow. There is a fork on GitHub which could allow Darknet to save the output of the processed video into a file as well.</li>
</ul>
<p>To rotate the video if it was taken from a smartphone but in a 180 degree position:</p>
<p><code>ffmpeg -i sourcefile.mp4 -vf "transpose=4" fileout.mp4</code></p>
<p>The transpose value depends on the nature of the rotation. If it’s 90 degrees, the transpose value should be 2. It also depends on whether the rotation is clockwise or counter-clockwise.</p>
<p>To convert the video to a slower framerate:</p>
<p><code>ffmpeg -i sourcefile.avi -r 8 fileout.mp4</code></p>
<p>For the Darkflow tool, the default output is in AVI format, but ffmpeg allows us to convert it to MP4 if you want.</p>
<h4 id="imageai">ImageAI</h4>
<p><a href="https://github.com/OlafenwaMoses/ImageAI">ImageAI</a> is a Python-based computer vision library which utilizes the use of TensorFlow, Keras, Matplotlib and several other dependencies which are commonly used for machine learning. In terms of usage, it is similar to darkflow.</p>
<h3 id="conclusion">Conclusion</h3>
<p>The advancement of AI field contributes a lot of useful automation to life. It can range from helping detect tumors, helping search and rescue missions, reducing keystrokes with keyword predictions, to spam filtering. AI also accelerates the field of image processing and pattern recognition.</p>
<p>A lot of the hard work of smart people and scholars have produced many smart solutions to make people live a better life with the use of AI. As I have shown, some of these tools could achieve better detection given a good amount of samples to be trained on and the correct size of picture to be detected.</p>
<p>The tools above will work as-is, but may need some tweaking/editing if you want to customize it. For example, some of the code works with their own demos, so you may need to pass an argument such as <code>sys.argv[]</code> inside the Python code if you want to process your own video.</p>
Self-driving toy car using the Asynchronous Advantage Actor-Critic algorithmhttps://www.endpointdev.com/blog/2018/08/self-driving-toy-car-using-the-a3c-algorithm/2018-08-29T00:00:00+00:00Kamil Ciemniewski
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/KaTeX/0.7.1/katex.min.css">
<script src="https://cdnjs.cloudflare.com/ajax/libs/KaTeX/0.7.1/katex.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/KaTeX/0.7.1/contrib/auto-render.min.js"></script>
<style>
.katex .op-symbol.large-op {
line-height: 1.2 !important;
}
.mtight {
font-size: 0.95em;
}
</style>
<center>
<video width="100%" controls poster="/blog/2018/08/self-driving-toy-car-using-the-a3c-algorithm/poster.png">
<source src="/blog/2018/08/self-driving-toy-car-using-the-a3c-algorithm/892-openaigym.video.90.68.video000000.mp4" type="video/mp4">
</video>
</center>
<p>The field of <a href="https://en.wikipedia.org/wiki/Reinforcement_learning">Reinforcement Learning</a> has seen a lot of great improvement in the past years. Researchers at universities and companies like <a href="https://deepmind.com/">Deep Mind</a> have been developing new and better ways to train intelligent, artificial agents to solve more and more difficult tasks. The algorithms being developed are requiring less time to train. They also are making the training much more stable.</p>
<p>This article is about an algorithm that’s one of the most cited lately: A3C — Asynchronous Advantage Actor-Critic.</p>
<p>As the subject is both wide and deep, I’m assuming the reader has the relevant background mastered already. Although reading it might be interesting even without understanding most of the notions in use, having a good grasp of them will help you get the most out of it.</p>
<p>Because we’re looking at the Deep Reinforcement Learning, the obvious requirement is to be acquainted with the <a href="https://en.wikipedia.org/wiki/Artificial_neural_network">neural networks</a>. I’m also using different notions known in the field of <a href="https://en.wikipedia.org/wiki/Reinforcement_learning">Reinforcement Learning</a> overall like $Q(a, s)$ and $V(s)$ functions or the n-step return. The mathematical expressions, in particular, are given assuming that the reader already knows what the symbols stand for. Some notions known from other families of RL algorithms are being touched on as well (e.g. experience replay) — to contrast them with the A3C way of solving the same kind of problems. The article along with the source code uses the <a href="https://gym.openai.com">OpenAI gym</a>, Python, and <a href="https://pytorch.org">PyTorch</a> among other Python-related libraries.</p>
<h3 id="theory">Theory</h3>
<p>The A3C algorithm is a part of the greater class of RL algorithms called <a href="http://www.scholarpedia.org/article/Policy_gradient_methods">Policy Gradients</a>.</p>
<p>In this approach, we’re creating a model that <strong>approximates the action-choosing policy itself</strong>.</p>
<p>Let’s contrast it with <a href="https://en.wikipedia.org/wiki/Markov_decision_process#Value_iteration">value iteration</a>, the goal of which is to learn the <a href="https://en.wikipedia.org/wiki/Reinforcement_learning#Value_function">value function</a> and have policy emerge as the function that chooses an action transitioning to the state of the greatest value.</p>
<p>With the policy gradient approach, we’re approximating the policy with a differentiable function. Such stated problem requires only a good approximation of the gradient that over time will maximize the rewards.</p>
<p>The unique approach of A3C adds a very clever twist: we’re also learning an approximation of the value function at the same time. This helps us in getting the variance of the gradient down considerably, making the training much more stable.</p>
<p>These two aspects of the algorithm are being personified within its name: actor-critic. The policy function approximation is being called the actor, while the value function is being called the critic.</p>
<h4 id="the-policy-gradient">The policy gradient</h4>
<p>As we’ve noticed already, in order to improve our policy function approximation, we need a gradient that points at the direction that maximizes the rewards.</p>
<p>I’m not going to reinvent the wheel here. There are some great resources the reader can access to dig deep into the Mathematics of what’s called the Policy Gradient Theorem:</p>
<ul>
<li><a href="https://lilianweng.github.io/lil-log/2018/04/08/policy-gradient-algorithms.html">Lilian Weng’s excellent article</a></li>
<li><a href="http://incompleteideas.net/book/bookdraft2017nov5.pdf">Sutton & Barto — Reinforcement Learning: An Introduction</a></li>
</ul>
<p>The following equation presents the basic form of the gradient of the policy function:</p>
<p>$$\nabla_{\theta} J(\theta) = E_{\tau}[,R_{\tau}\cdot\nabla_\theta,\sum_{t=0}^{T-1},log,\pi(a_t|s_t;\theta),]$$</p>
<p>This states that for each sampled trajectory $\tau$, the correct estimate of the gradient is the expected value of the rewards times the action probabilities moved into the log space. Ascending in this direction makes our rewards greater and greater over time.</p>
<p>We <strong>can</strong> derive all the needed intermediary gradients ourselves by hand of course. Because we’re using <a href="https://pytorch.org">PyTorch</a> though, we only need the right loss function.</p>
<p>Let’s figure out the right loss function formula that will produce the gradient as shown above:</p>
<p>$$L_\theta=-J(\theta)$$</p>
<p>Also:</p>
<p>$$J(\theta)=E_\tau[R_\tau\cdot\sum_{t=0}^{T-1},log,\pi(a_t|s_t;\theta)]$$</p>
<p>Hence:</p>
<p>$$L_\theta=-\frac{1}{n}\sum_{t=0}^{n-1}R_t,\cdot,log\pi(a_t|s_t;\theta)$$</p>
<h4 id="formalizing-the-accumulation-of-rewards">Formalizing the accumulation of rewards</h4>
<p>For now, we’ve been using the $R_\tau$ and $R_t$ terms very abstractly. Let’s make this part more intuitive and concrete now.</p>
<p>Its true meaning really is “the quality of the sampled trajectory”. Consider the following equation:</p>
<p>$$R_t=\sum_{i=t}^{N+t}\gamma^{i-t}r_i,+,\gamma^{i-t+1}V(s_{t+N+1})$$</p>
<p>Each $r_i$ is the reward received from the environment after each step. Each trajectory consists of multiple steps. Each time, we’re sampling actions based on our policy function. This gives probabilities of a given action being best given the state.</p>
<p>What if we’re taking 5 actions for which we’re not being given any reward but overall it helped us get rewarded in the 6th step? This is exactly the case we’ll be dealing with in this article later when training a toy car to drive based only on pixel values of the scene. In that environment, we’ll be given $-0.1$ “negative” reward each step and something close to $7$ each new “tile” the car stays on the road.</p>
<p>We need a way to still encourage actions that make us earn rewards in a not too distant future. We also need to be smart and <strong>discount</strong> future rewards somewhat so that the more immediate the reward is to our action, the more emphasis we put on it.</p>
<p>That’s exactly what the above equation does. Notice that $\gamma$ becomes a hyper-parameter. It makes sense to give it value from $(0, 1)$. Let’s consider the following list of rewards: $[r_1, r_2, r_3, r_4]$. For $r_1$, the formula for the discounted accumulated reward is:</p>
<p>$$R_1=\gamma,r_1,+,\gamma^2r_2,+,\gamma^3r_3,+,\gamma^4r_4,+,\gamma^5V(s_5)$$</p>
<p>For $r_2$ it’s:</p>
<p>$$R_2=\gamma,r_2,+,\gamma^2r_3,+,\gamma^3r_4,+,\gamma^4V(s_5)$$</p>
<p>And so on… In case when we hit the terminal state, having no “next” state, we substitute $0$ for $V(s_{t+N+1})$.</p>
<p>We’ve said that in A3C we’re learning the value function at the same time. The $R_t$ as described above becomes the target value when training our $V(s)$. The value function becomes an approximation of the average of the rewards given the state (because $R_t$ depends on us sampling actions in this state).</p>
<h4 id="making-the-gradients-more-stable">Making the gradients more stable</h4>
<p>One of the greatest inhibitors of the policy gradient performance is what’s broadly called “high variance”.</p>
<p>I have to admit, the first time I saw that term in this context, I was disoriented. I knew what “variance” was. It’s the “variance of what” that was not clear to me.</p>
<p>Thankfully I found <a href="https://www.quora.com/Why-does-the-policy-gradient-method-have-a-high-variance?share=1">a brilliant answer to this question</a>. It explains the issue simply yet in detail.</p>
<p>Let me cite it here:</p>
<blockquote>
<p>When we talk about high variance in the policy gradient method, we’re specifically talking about the facts that the variance of the gradients are high — namely, that $Var(\nabla_{\theta} J(\theta))$ is big.</p>
</blockquote>
<p>To put it in simple terms: because we’re <strong>sampling</strong> trajectories from the space that is stochastic in nature, we’re bound to have those samples give gradients that disagree a lot on the best direction to take our model’s parameters into.</p>
<p>I encourage the reader to pause now and read the above-mentioned answer as it’s very vital. The gist of the solution described in it is that we can <strong>subtract a baseline value from each $R_t$</strong>. An example of a good baseline that was given was to make it into an <strong>average of the sampled accumulated rewards</strong>. The A3C algorithm uses this insight in a very, very clever way.</p>
<h4 id="value-function-as-a-baseline">Value function as a baseline</h4>
<p>To learn the $V(s)$ we’re typically using the MSE or Huber loss against the accumulated rewards for each step. This means that over time we’re <strong>averaging those rewards out based on the state we’re finding ourselves in</strong>.</p>
<p>Improving our gradient formula with those ideas we now get:</p>
<p>$$\nabla_{\theta} J(\theta) = E_{\tau}[,\nabla_\theta,\sum_{t=0}^{T-1},log,\pi(a_t|s_t;\theta)\cdot(R_t-V(s_t)),]$$</p>
<p>It’s important to treat the $(R_t-V(s_t))$ term <strong>as a constant</strong>. This means that when using PyTorch or any other deep learning framework, the computation of it should occur <strong>outside the graph that influences the gradients</strong>.</p>
<p>The enhanced part of the equation is where we get the word “advantage” in the algorithm’s name. The <strong>advantage</strong> is simply the difference between the accumulated rewards and what those rewards are <strong>on average</strong> for the given state:</p>
<p>$$A(a_{t..t+n},s_{t..t+n})=R_t(a_{t..t+n},s_{t..t+n})-V(s_t)$$</p>
<p>If we’ll make $R_t$ into $Q(s,a)$ as it’s commonly written in literature, we’ll arrive at the formula:</p>
<p>$$A(s,a)=Q(s,a) - V(s)$$</p>
<p>What’s the intuition here? Imagine that you’re playing chess with a 5-year-old. You win by a huge margin. Your friend who’s watched lots of master-level games observed this one as well. His take is that even though you scored positively, you still made lots of mistakes. You’ve got your <strong>critic</strong> here. Your score and what it looks like for the “observing critic” combined is what we call the advantage of the actions you took.</p>
<h4 id="guarding-against-the-models-overconfidence">Guarding against the model’s overconfidence</h4>
<blockquote>
<p>Although he was warned, Icarus was too young and too enthusiastic about flying. He got excited by the thrill of flying and carried away by the amazing feeling of freedom and started flying high to salute the sun, diving low to the sea, and then up high again.
His father Daedalus was trying in vain to make young Icarus to understand that his behavior was dangerous, and Icarus soon saw his wings melting.
Icarus fell into the sea and drowned.</p>
</blockquote>
<p><em><a href="https://www.greekmyths-greekmythology.com/myth-of-daedalus-and-icarus/">The Myth Of Daedalus And Icarus</a></em></p>
<p>The job of an “actor” is to output probability values for each possible action the agent can take. The greater the probability, the greater the model’s confidence that this action will result in the highest reward.</p>
<p>What if at some point, the weights are being steered in a way that the model becomes <em>overconfident</em> of some particular action? If this happens before the model learns much, it becomes a huge problem.</p>
<p>Because we’re using the $\pi(a|s;\theta)$ distribution to sample trajectories with, we’re not sampling totally at random. In other words, for $\pi(a|s;\theta) = [0.1, 0.4, 0.2, 0.3]$ our sampling chooses the second option 40% of the time. With any action overwhelming the others, we’re losing the ability to <strong>explore</strong> different paths and thus learn valuable lessons.</p>
<p>Empirically, I have found myself seeing the process sometimes not even able to escape the “overconfidence” area for long, long hours.</p>
<h4 id="regularizing-with-entropy">Regularizing with entropy</h4>
<p>Let’s introduce the notion of an <a href="https://en.wikipedia.org/wiki/Entropy_(information_theory)">entropy</a>.</p>
<p>In simple words in our case, it’s the measure of how much “knowledge” does given probability distribution posses. It’s being maximized for the uniform distribution. Here’s the formula:</p>
<p>$$H(X)=E[-log_b(P(X))]$$</p>
<p>This expands to the following:</p>
<p>$$H(X)=-\sum_{i=1}^{n}P(x_i)log_b(P(x_i))$$</p>
<p>Let’s look closer at the values this function produces using the following simple <a href="https://calca.io">Calca</a> code:</p>
<pre tabindex="0"><code>uniform = [0.25, 0.25, 0.25, 0.25]
more confident = [0.5, 0.25, 0.15, 0.10]
over confident = [0.95, 0.01, 0.01, 0.03]
super over confident = [0.99, 0.003, 0.004, 0.003]
y(x) = x*log(x, 10)
entropy(dist) = -sum(map(y, dist))
entropy (uniform) => 0.6021
entropy (more confident) => 0.5246
entropy (over confident) => 0.1068
entropy (super over confident) => 0.0291
</code></pre><p>We can use the above to “punish” the model whenever it’s too confident of its choices. As we’re going to use gradient descend, we’ll be minimizing terms that appear in our loss function. Minimizing the entropy as shown above would encourage more confidence though. We’ll need to make it into a negative in the loss to work the way we intend:</p>
<p>$$L_\theta=-\frac{1}{n}\sum_{t=0}^{n-1}log\pi(a_t|s_t;\theta)\cdot(R_t-V(s_t)),-\beta,H(\pi(a_t|s_t;\theta))$$</p>
<p>Where $\beta$ is a hyperparameter scaling the effects of the penalty that the entropy has on the gradients. Choosing the right value for $\beta$ becomes very vital for the model’s convergence. In this article, I’m using $0.01$ as with $0.001$ I was still observing the process stuck being overconfident.</p>
<p>Let’s include the value loss $L_v$ in the loss function formula making it full and ready to be implemented:</p>
<p>$$L_\theta=-\frac{1}{n}\sum_{t=0}^{n-1}log\pi(a_t|s_t;\theta)\cdot(R_t-V(s_t)),+\alpha,L_v,,-\beta,H(\pi(a_t|s_t;\theta))$$</p>
<h4 id="the-last-a-in-a3c">The last A in A3C</h4>
<p>So far we’ve gone from the vanilla policy gradients to using the notion of an advantage. We’ve also improved it with the baseline that intuitively makes the model consist of two parts: the actor and the critic. At this point, we have what’s sometimes called the A2C — Advantage Actor-Critic.</p>
<p>Let us now focus on the last piece of the puzzle: the last A. This last A comes from the word “asynchronous”. It’s been explained very clearly in the <a href="https://arxiv.org/pdf/1602.01783">original paper on A3C</a>.</p>
<p>This idea I think is the least complex of all that have their place in the approach. I’ll just comment on what was already written:</p>
<blockquote>
<p>These approaches share a common idea: the sequence of observed data encountered by an online RL agent is non-stationary, and online RL updates are strongly correlated. By storing the agent’s data in an experience replay memory, the data can be batched (Riedmiller, 2005; Schulman et al., 2015a) or randomly sampled (Mnih et al., 2013; 2015; Van Hasselt et al., 2015) from different time-steps. Aggregating over memory in this way reduces non-stationarity and decorrelates updates, but at the same time limits the methods to off-policy reinforcement learning algorithms.</p>
</blockquote>
<p>The A3C unique approach is that it doesn’t use experience replay for de-correlating the updates to the weights of the model. Instead, we’re sampling many different trajectories <strong>at the same time</strong> in an <strong>asynchronous</strong> manner.</p>
<p>This means that we’re creating many clones of the environment and we let our agents experience them at the same time. Separate agents share their weights in one way or another. There are implementations with agents sharing those weights very <strong>literally</strong> — and performing the updates to the weights on their own whenever they need to. There also are implementations with one main agent holding the main weights and doing the updates based on the gradients reported by the “worker” agents. The worker agents are then being updated with the evolved weights. The environments and agents are not being directly synchronized, working at their own speed. As soon as any of them collects the needed rewards to perform the n-step gradients calculations, the gradients are being applied in one way or another.</p>
<p>In this article, I’m preferring the second approach — having one “main” agent and making workers synchronize their weights with it each n-step period.</p>
<h3 id="practice">Practice</h3>
<h4 id="the-challenge">The challenge</h4>
<p>To present the above theory in practical terms, we’re going to code the A3C to train a toy self-driving game car. The algorithm will only have the game’s pixels as inputs. We’re also going to collect rewards.</p>
<p>Each step, the player will decide how to move the steering wheel, how much throttle to apply and how much brake.</p>
<p>Points are being assigned for each new “tile” that the car enters staying within the road. There’s a small penalty for each other case of $-0.1$ points.</p>
<p>We’re going to use <a href="https://gym.openai.com">OpenAI Gym</a> and the environment’s called <a href="https://gym.openai.com/envs/CarRacing-v0/">CarRacing</a>.</p>
<p>You can read a bit more about the setup in the environment’s source code on <a href="https://github.com/openai/gym/blob/master/gym/envs/box2d/car_racing.py">GitHub</a>.</p>
<h4 id="coding-the-agent">Coding the Agent</h4>
<p>Our agent is going to output both $\pi(a|s;\theta)$ as well as $V(s)$. We’re going to use the GRU unit to give the agent the ability to remember its previous actions and environments previous features.</p>
<p>I’ve also decided to use PRelu instead of Relu activations as it <strong>appeared</strong> to me that this way the agent was learning much quicker (although I don’t have any numbers to back this impression up).</p>
<p><strong>Disclaimer</strong>: the code presented below <strong>has not been refactored</strong> in any way. If this was going to be used in production I’d certainly hugely clean it up.</p>
<p>Here’s the full listing of the agent’s class:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python"><span style="color:#080;font-weight:bold">class</span> <span style="color:#b06;font-weight:bold">Agent</span>(nn.Module):
<span style="color:#080;font-weight:bold">def</span> __init__(self, **kwargs):
<span style="color:#038">super</span>(Agent, self).__init__(**kwargs)
self.init_args = kwargs
self.h = torch.zeros(<span style="color:#00d;font-weight:bold">1</span>, <span style="color:#00d;font-weight:bold">256</span>)
self.norm1 = nn.BatchNorm2d(<span style="color:#00d;font-weight:bold">4</span>)
self.norm2 = nn.BatchNorm2d(<span style="color:#00d;font-weight:bold">32</span>)
self.conv1 = nn.Conv2d(<span style="color:#00d;font-weight:bold">4</span>, <span style="color:#00d;font-weight:bold">32</span>, <span style="color:#00d;font-weight:bold">4</span>, stride=<span style="color:#00d;font-weight:bold">2</span>, padding=<span style="color:#00d;font-weight:bold">1</span>)
self.conv2 = nn.Conv2d(<span style="color:#00d;font-weight:bold">32</span>, <span style="color:#00d;font-weight:bold">32</span>, <span style="color:#00d;font-weight:bold">3</span>, stride=<span style="color:#00d;font-weight:bold">2</span>, padding=<span style="color:#00d;font-weight:bold">1</span>)
self.conv3 = nn.Conv2d(<span style="color:#00d;font-weight:bold">32</span>, <span style="color:#00d;font-weight:bold">32</span>, <span style="color:#00d;font-weight:bold">3</span>, stride=<span style="color:#00d;font-weight:bold">2</span>, padding=<span style="color:#00d;font-weight:bold">1</span>)
self.conv4 = nn.Conv2d(<span style="color:#00d;font-weight:bold">32</span>, <span style="color:#00d;font-weight:bold">32</span>, <span style="color:#00d;font-weight:bold">3</span>, stride=<span style="color:#00d;font-weight:bold">2</span>, padding=<span style="color:#00d;font-weight:bold">1</span>)
self.gru = nn.GRUCell(<span style="color:#00d;font-weight:bold">1152</span>, <span style="color:#00d;font-weight:bold">256</span>)
self.policy = nn.Linear(<span style="color:#00d;font-weight:bold">256</span>, <span style="color:#00d;font-weight:bold">4</span>)
self.value = nn.Linear(<span style="color:#00d;font-weight:bold">256</span>, <span style="color:#00d;font-weight:bold">1</span>)
self.prelu1 = nn.PReLU()
self.prelu2 = nn.PReLU()
self.prelu3 = nn.PReLU()
self.prelu4 = nn.PReLU()
nn.init.xavier_uniform_(self.conv1.weight, gain=nn.init.calculate_gain(<span style="color:#d20;background-color:#fff0f0">'leaky_relu'</span>))
nn.init.constant_(self.conv1.bias, <span style="color:#00d;font-weight:bold">0.01</span>)
nn.init.xavier_uniform_(self.conv2.weight, gain=nn.init.calculate_gain(<span style="color:#d20;background-color:#fff0f0">'leaky_relu'</span>))
nn.init.constant_(self.conv2.bias, <span style="color:#00d;font-weight:bold">0.01</span>)
nn.init.xavier_uniform_(self.conv3.weight, gain=nn.init.calculate_gain(<span style="color:#d20;background-color:#fff0f0">'leaky_relu'</span>))
nn.init.constant_(self.conv3.bias, <span style="color:#00d;font-weight:bold">0.01</span>)
nn.init.xavier_uniform_(self.conv4.weight, gain=nn.init.calculate_gain(<span style="color:#d20;background-color:#fff0f0">'leaky_relu'</span>))
nn.init.constant_(self.conv4.bias, <span style="color:#00d;font-weight:bold">0.01</span>)
nn.init.constant_(self.gru.bias_ih, <span style="color:#00d;font-weight:bold">0</span>)
nn.init.constant_(self.gru.bias_hh, <span style="color:#00d;font-weight:bold">0</span>)
nn.init.xavier_uniform_(self.policy.weight, gain=nn.init.calculate_gain(<span style="color:#d20;background-color:#fff0f0">'leaky_relu'</span>))
nn.init.constant_(self.policy.bias, <span style="color:#00d;font-weight:bold">0.01</span>)
nn.init.xavier_uniform_(self.value.weight, gain=nn.init.calculate_gain(<span style="color:#d20;background-color:#fff0f0">'leaky_relu'</span>))
nn.init.constant_(self.value.bias, <span style="color:#00d;font-weight:bold">0.01</span>)
self.train()
<span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">reset</span>(self):
self.h = torch.zeros(<span style="color:#00d;font-weight:bold">1</span>, <span style="color:#00d;font-weight:bold">256</span>)
<span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">clone</span>(self, num=<span style="color:#00d;font-weight:bold">1</span>):
<span style="color:#080;font-weight:bold">return</span> [ self.clone_one() <span style="color:#080;font-weight:bold">for</span> _ <span style="color:#080">in</span> <span style="color:#038">range</span>(num) ]
<span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">clone_one</span>(self):
<span style="color:#080;font-weight:bold">return</span> Agent(**self.init_args)
<span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">forward</span>(self, state):
state = state.view(<span style="color:#00d;font-weight:bold">1</span>, <span style="color:#00d;font-weight:bold">4</span>, <span style="color:#00d;font-weight:bold">96</span>, <span style="color:#00d;font-weight:bold">96</span>)
state = self.norm1(state)
data = self.prelu1(self.conv1(state))
data = self.prelu2(self.conv2(data))
data = self.prelu3(self.conv3(data))
data = self.prelu4(self.conv4(data))
data = self.norm2(data)
data = data.view(<span style="color:#00d;font-weight:bold">1</span>, -<span style="color:#00d;font-weight:bold">1</span>)
h = self.gru(data, self.h)
self.h = h.detach()
pre_policy = h.view(-<span style="color:#00d;font-weight:bold">1</span>)
policy = F.softmax(self.policy(pre_policy))
value = self.value(pre_policy)
<span style="color:#080;font-weight:bold">return</span> policy, value
</code></pre></div><p>You can immediately notice that actor and critic parts share most of the weights. They only differ in the last layer.</p>
<p>Next, I wanted to abstract out the notion of the “runner”. It encapsulates the idea of a “running agent”. Think of it as the game player — with the joystick and its brain to score game points. I’m discretizing the action space the following way:</p>
<table>
<tr>
<th>Action name</th>
<th>value</th>
</tr>
<tr>
<td>Turn left</td>
<td>[-0.8, 0.0, 0.0]</td>
</tr>
<tr>
<td>Turn right</td>
<td>[0.8, 0.0, 0]</td>
</tr>
<tr>
<td>Full throttle</td>
<td>[0.0, 0.1, 0.0]</td>
</tr>
<tr>
<td>Brake</td>
<td>[0.0, 0.0, 0.6]</td>
</tr>
</table>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python"><span style="color:#080;font-weight:bold">class</span> <span style="color:#b06;font-weight:bold">Runner</span>:
<span style="color:#080;font-weight:bold">def</span> __init__(self, agent, ix, train = <span style="color:#080;font-weight:bold">True</span>, **kwargs):
self.agent = agent
self.train = train
self.ix = ix
self.reset = <span style="color:#080;font-weight:bold">False</span>
self.states = []
<span style="color:#888"># each runner has its own environment:</span>
self.env = gym.make(<span style="color:#d20;background-color:#fff0f0">'CarRacing-v0'</span>)
<span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">get_value</span>(self):
<span style="color:#d20;background-color:#fff0f0">"""
</span><span style="color:#d20;background-color:#fff0f0"> Returns just the current state's value.
</span><span style="color:#d20;background-color:#fff0f0"> This is used when approximating the R.
</span><span style="color:#d20;background-color:#fff0f0"> If the last step was
</span><span style="color:#d20;background-color:#fff0f0"> not terminal, then we're substituting the "r"
</span><span style="color:#d20;background-color:#fff0f0"> with V(s) - hence, we need a way to just
</span><span style="color:#d20;background-color:#fff0f0"> get that V(s) without moving forward yet.
</span><span style="color:#d20;background-color:#fff0f0"> """</span>
_input = self.preprocess(self.states)
_, _, _, value = self.decide(_input)
<span style="color:#080;font-weight:bold">return</span> value
<span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">run_episode</span>(self, yield_every = <span style="color:#00d;font-weight:bold">10</span>, do_render = <span style="color:#080;font-weight:bold">False</span>):
<span style="color:#d20;background-color:#fff0f0">"""
</span><span style="color:#d20;background-color:#fff0f0"> The episode runner written in the generator style.
</span><span style="color:#d20;background-color:#fff0f0"> This is meant to be used in a "for (...) in run_episode(...):" manner.
</span><span style="color:#d20;background-color:#fff0f0"> Each value generated is a tuple of:
</span><span style="color:#d20;background-color:#fff0f0"> step_ix: the current "step" number
</span><span style="color:#d20;background-color:#fff0f0"> rewards: the list of rewards as received from the environment (without discounting yet)
</span><span style="color:#d20;background-color:#fff0f0"> values: the list of V(s) values, as predicted by the "critic"
</span><span style="color:#d20;background-color:#fff0f0"> policies: the list of policies as received from the "actor"
</span><span style="color:#d20;background-color:#fff0f0"> actions: the list of actions as sampled based on policies
</span><span style="color:#d20;background-color:#fff0f0"> terminal: whether we're in a "terminal" state
</span><span style="color:#d20;background-color:#fff0f0"> """</span>
self.reset = <span style="color:#080;font-weight:bold">False</span>
step_ix = <span style="color:#00d;font-weight:bold">0</span>
rewards, values, policies, actions = [[], [], [], []]
self.env.reset()
<span style="color:#888"># we're going to feed the last 4 frames to the neural network that acts as the "actor-critic" duo. We'll use the "deque" to efficiently drop too old frames always keeping its length at 4:</span>
states = deque([ ])
<span style="color:#888"># we're pre-populating the states deque by taking first 4 steps as "full throttle forward":</span>
<span style="color:#080;font-weight:bold">while</span> <span style="color:#038">len</span>(states) < <span style="color:#00d;font-weight:bold">4</span>:
_, r, _, _ = self.env.step([<span style="color:#00d;font-weight:bold">0.0</span>, <span style="color:#00d;font-weight:bold">1.0</span>, <span style="color:#00d;font-weight:bold">0.0</span>])
state = self.env.render(mode=<span style="color:#d20;background-color:#fff0f0">'rgb_array'</span>)
states.append(state)
logger.info(<span style="color:#d20;background-color:#fff0f0">'Init reward '</span> + <span style="color:#038">str</span>(r) )
<span style="color:#888"># we need to repeat the following as long as the game is not over yet:</span>
<span style="color:#080;font-weight:bold">while</span> <span style="color:#080;font-weight:bold">True</span>:
<span style="color:#888"># the frames need to be preprocessed (I'm explaining the reasons later in the article)</span>
_input = self.preprocess(states)
<span style="color:#888"># asking the neural network for the policy and value predictions:</span>
action, action_ix, policy, value = self.decide(_input, step_ix)
<span style="color:#888"># taking the step and receiving the reward along with info if the game is over:</span>
_, reward, terminal, _ = self.env.step(action)
<span style="color:#888"># explicitly rendering the scene (again, this will be explained later)</span>
state = self.env.render(mode=<span style="color:#d20;background-color:#fff0f0">'rgb_array'</span>)
<span style="color:#888"># update the last 4 states deque:</span>
states.append(state)
<span style="color:#080;font-weight:bold">while</span> <span style="color:#038">len</span>(states) > <span style="color:#00d;font-weight:bold">4</span>:
states.popleft()
<span style="color:#888"># if we've been asked to render into the window (e. g. to capture the video):</span>
<span style="color:#080;font-weight:bold">if</span> do_render:
self.env.render()
self.states = states
step_ix += <span style="color:#00d;font-weight:bold">1</span>
rewards.append(reward)
values.append(value)
policies.append(policy)
actions.append(action_ix)
<span style="color:#888"># periodically save the state's screenshot along with the numerical values in an easy to read way:</span>
<span style="color:#080;font-weight:bold">if</span> self.ix == <span style="color:#00d;font-weight:bold">2</span> <span style="color:#080">and</span> step_ix % <span style="color:#00d;font-weight:bold">200</span> == <span style="color:#00d;font-weight:bold">0</span>:
fname = <span style="color:#d20;background-color:#fff0f0">'./screens/car-racing/screen-'</span> + <span style="color:#038">str</span>(step_ix) + <span style="color:#d20;background-color:#fff0f0">'-'</span> + <span style="color:#038">str</span>(<span style="color:#038">int</span>(time.time())) + <span style="color:#d20;background-color:#fff0f0">'.jpg'</span>
im = Image.fromarray(state)
im.save(fname)
state.tofile(fname + <span style="color:#d20;background-color:#fff0f0">'.txt'</span>, sep=<span style="color:#d20;background-color:#fff0f0">" "</span>)
_input.numpy().tofile(fname + <span style="color:#d20;background-color:#fff0f0">'.input.txt'</span>, sep=<span style="color:#d20;background-color:#fff0f0">" "</span>)
<span style="color:#888"># if it's game over or we hit the "yield every" value, yield the values from this generator:</span>
<span style="color:#080;font-weight:bold">if</span> terminal <span style="color:#080">or</span> step_ix % yield_every == <span style="color:#00d;font-weight:bold">0</span>:
<span style="color:#080;font-weight:bold">yield</span> step_ix, rewards, values, policies, actions, terminal
rewards, values, policies, actions = [[], [], [], []]
<span style="color:#888"># following is a very tacky way to allow external using code to mark that it wants us to reset the environment, finishing the episode prematurely. (this would be hugely refactored in the production code but for the sake of playing with the algorithm itself, it's good enough):</span>
<span style="color:#080;font-weight:bold">if</span> self.reset:
self.reset = <span style="color:#080;font-weight:bold">False</span>
self.agent.reset()
states = deque([ ])
self.states = deque([ ])
<span style="color:#080;font-weight:bold">return</span>
<span style="color:#080;font-weight:bold">if</span> terminal:
self.agent.reset()
states = deque([ ])
<span style="color:#080;font-weight:bold">return</span>
<span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">ask_reset</span>(self):
self.reset = <span style="color:#080;font-weight:bold">True</span>
<span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">preprocess</span>(self, states):
<span style="color:#080;font-weight:bold">return</span> torch.stack([ torch.tensor(self.preprocess_one(image_data), dtype=torch.float32) <span style="color:#080;font-weight:bold">for</span> image_data <span style="color:#080">in</span> states ])
<span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">preprocess_one</span>(self, image):
<span style="color:#d20;background-color:#fff0f0">"""
</span><span style="color:#d20;background-color:#fff0f0"> Scales the rendered image and makes it grayscale
</span><span style="color:#d20;background-color:#fff0f0"> """</span>
<span style="color:#080;font-weight:bold">return</span> rescale(rgb2gray(image), (<span style="color:#00d;font-weight:bold">0.24</span>, <span style="color:#00d;font-weight:bold">0.16</span>), anti_aliasing=<span style="color:#080;font-weight:bold">False</span>, mode=<span style="color:#d20;background-color:#fff0f0">'edge'</span>, multichannel=<span style="color:#080;font-weight:bold">False</span>)
<span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">choose_action</span>(self, policy, step_ix):
<span style="color:#d20;background-color:#fff0f0">"""
</span><span style="color:#d20;background-color:#fff0f0"> Chooses an action to take based on the policy and whether we're in the training mode or not. During training, it samples based on the probability values in the policy. During the evaluation, it takes the most probable action in a greedy way.
</span><span style="color:#d20;background-color:#fff0f0"> """</span>
policies = [[-<span style="color:#00d;font-weight:bold">0.8</span>, <span style="color:#00d;font-weight:bold">0.0</span>, <span style="color:#00d;font-weight:bold">0.0</span>], [<span style="color:#00d;font-weight:bold">0.8</span>, <span style="color:#00d;font-weight:bold">0.0</span>, <span style="color:#00d;font-weight:bold">0</span>], [<span style="color:#00d;font-weight:bold">0.0</span>, <span style="color:#00d;font-weight:bold">0.1</span>, <span style="color:#00d;font-weight:bold">0.0</span>], [<span style="color:#00d;font-weight:bold">0.0</span>, <span style="color:#00d;font-weight:bold">0.0</span>, <span style="color:#00d;font-weight:bold">0.6</span>]]
<span style="color:#080;font-weight:bold">if</span> self.train:
action_ix = np.random.choice(<span style="color:#00d;font-weight:bold">4</span>, <span style="color:#00d;font-weight:bold">1</span>, p=torch.tensor(policy).detach().numpy())[<span style="color:#00d;font-weight:bold">0</span>]
<span style="color:#080;font-weight:bold">else</span>:
action_ix = np.argmax(torch.tensor(policy).detach().numpy())
logger.info(<span style="color:#d20;background-color:#fff0f0">'Step '</span> + <span style="color:#038">str</span>(step_ix) + <span style="color:#d20;background-color:#fff0f0">' Runner '</span> + <span style="color:#038">str</span>(self.ix) + <span style="color:#d20;background-color:#fff0f0">' Action ix: '</span> + <span style="color:#038">str</span>(action_ix) + <span style="color:#d20;background-color:#fff0f0">' From: '</span> + <span style="color:#038">str</span>(policy))
<span style="color:#080;font-weight:bold">return</span> np.array(policies[action_ix], dtype=np.float32), action_ix
<span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">decide</span>(self, state, step_ix = <span style="color:#00d;font-weight:bold">999</span>):
policy, value = self.agent(state)
action, action_ix = self.choose_action(policy, step_ix)
<span style="color:#080;font-weight:bold">return</span> action, action_ix, policy, value
<span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">load_state_dict</span>(self, state):
<span style="color:#d20;background-color:#fff0f0">"""
</span><span style="color:#d20;background-color:#fff0f0"> As we'll have multiple "worker" runners, they will need to be able to sync their agents' weights with the main agent.
</span><span style="color:#d20;background-color:#fff0f0"> This function loads the weights into this runner's agent.
</span><span style="color:#d20;background-color:#fff0f0"> """</span>
self.agent.load_state_dict(state)
</code></pre></div><p>I’m also encapsulating the training process in a class of its own. You can notice the gradients being clipped before being applied. I’m also clipping the rewards into the range of $<-3, 3>$ to help to keep the variance low.</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python"><span style="color:#080;font-weight:bold">class</span> <span style="color:#b06;font-weight:bold">Trainer</span>:
<span style="color:#080;font-weight:bold">def</span> __init__(self, gamma, agent, window = <span style="color:#00d;font-weight:bold">15</span>, workers = <span style="color:#00d;font-weight:bold">8</span>, **kwargs):
<span style="color:#038">super</span>().__init__(**kwargs)
self.agent = agent
self.window = window
self.gamma = gamma
self.optimizer = optim.Adam(self.agent.parameters(), lr=<span style="color:#00d;font-weight:bold">1e-4</span>)
self.workers = workers
<span style="color:#888"># even though we're loading the weights into worker agents explicitly, I found that still without sharing the weights as following, the algorithm was not converging:</span>
self.agent.share_memory()
<span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">fit</span>(self, episodes = <span style="color:#00d;font-weight:bold">1000</span>):
<span style="color:#d20;background-color:#fff0f0">"""
</span><span style="color:#d20;background-color:#fff0f0"> The higher level method for training the agents.
</span><span style="color:#d20;background-color:#fff0f0"> It called into the lower level "train" which orchestrates the process itself.
</span><span style="color:#d20;background-color:#fff0f0"> """</span>
last_update = <span style="color:#00d;font-weight:bold">0</span>
updates = <span style="color:#038">dict</span>()
<span style="color:#080;font-weight:bold">for</span> ix <span style="color:#080">in</span> <span style="color:#038">range</span>(<span style="color:#00d;font-weight:bold">1</span>, self.workers + <span style="color:#00d;font-weight:bold">1</span>):
updates[ ix ] = { <span style="color:#d20;background-color:#fff0f0">'episode'</span>: <span style="color:#00d;font-weight:bold">0</span>, <span style="color:#d20;background-color:#fff0f0">'step'</span>: <span style="color:#00d;font-weight:bold">0</span>, <span style="color:#d20;background-color:#fff0f0">'rewards'</span>: deque(), <span style="color:#d20;background-color:#fff0f0">'losses'</span>: deque(), <span style="color:#d20;background-color:#fff0f0">'points'</span>: <span style="color:#00d;font-weight:bold">0</span>, <span style="color:#d20;background-color:#fff0f0">'mean_reward'</span>: <span style="color:#00d;font-weight:bold">0</span>, <span style="color:#d20;background-color:#fff0f0">'mean_loss'</span>: <span style="color:#00d;font-weight:bold">0</span> }
<span style="color:#080;font-weight:bold">for</span> update <span style="color:#080">in</span> self.train(episodes):
now = time.time()
<span style="color:#888"># you could do something useful here with the updates dict.</span>
<span style="color:#888"># I've opted out as I'm using logging anyways and got more value in just watching the log file, grepping for the desired values</span>
<span style="color:#888"># save the current model's weights every minute:</span>
<span style="color:#080;font-weight:bold">if</span> now - last_update > <span style="color:#00d;font-weight:bold">60</span>:
torch.save(self.agent.state_dict(), <span style="color:#d20;background-color:#fff0f0">'./checkpoints/car-racing/'</span> + <span style="color:#038">str</span>(<span style="color:#038">int</span>(now)) + <span style="color:#d20;background-color:#fff0f0">'-.pytorch'</span>)
last_update = now
<span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">train</span>(self, episodes = <span style="color:#00d;font-weight:bold">1000</span>):
<span style="color:#d20;background-color:#fff0f0">"""
</span><span style="color:#d20;background-color:#fff0f0"> Lower level training orchestration method. Written in the generator style. Intended to be used with "for update in train(...):"
</span><span style="color:#d20;background-color:#fff0f0"> """</span>
<span style="color:#888"># create the requested number of background agents and runners:</span>
worker_agents = self.agent.clone(num = self.workers)
runners = [ Runner(agent=agent, ix = ix + <span style="color:#00d;font-weight:bold">1</span>, train = <span style="color:#080;font-weight:bold">True</span>) <span style="color:#080;font-weight:bold">for</span> ix, agent <span style="color:#080">in</span> <span style="color:#038">enumerate</span>(worker_agents) ]
<span style="color:#888"># we're going to communicate the workers' updates via the thread safe queue:</span>
queue = mp.SimpleQueue()
<span style="color:#888"># if we've not been given a number of episodes: assume the process is going to be interrupted with the keyboard interrupt once the user (us) decides so:</span>
<span style="color:#080;font-weight:bold">if</span> episodes <span style="color:#080">is</span> <span style="color:#080;font-weight:bold">None</span>:
<span style="color:#038">print</span>(<span style="color:#d20;background-color:#fff0f0">'Starting out an infinite training process'</span>)
<span style="color:#888"># create the actual background processes, making their entry be the train_one method:</span>
processes = [ mp.Process(target=self.train_one, args=(runners[ix - <span style="color:#00d;font-weight:bold">1</span>], queue, episodes, ix)) <span style="color:#080;font-weight:bold">for</span> ix <span style="color:#080">in</span> <span style="color:#038">range</span>(<span style="color:#00d;font-weight:bold">1</span>, self.workers + <span style="color:#00d;font-weight:bold">1</span>) ]
<span style="color:#888"># run those processes:</span>
<span style="color:#080;font-weight:bold">for</span> process <span style="color:#080">in</span> processes:
process.start()
<span style="color:#080;font-weight:bold">try</span>:
<span style="color:#888"># what follows is a rather naive implementation of listening to workers updates. it works though for our purposes:</span>
<span style="color:#080;font-weight:bold">while</span> <span style="color:#038">any</span>([ process.is_alive() <span style="color:#080;font-weight:bold">for</span> process <span style="color:#080">in</span> processes ]):
results = queue.get()
<span style="color:#080;font-weight:bold">yield</span> results
<span style="color:#080;font-weight:bold">except</span> <span style="color:#b06;font-weight:bold">Exception</span> <span style="color:#080;font-weight:bold">as</span> e:
logger.error(<span style="color:#038">str</span>(e))
<span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">train_one</span>(self, runner, queue, episodes = <span style="color:#00d;font-weight:bold">1000</span>, ix = <span style="color:#00d;font-weight:bold">1</span>):
<span style="color:#d20;background-color:#fff0f0">"""
</span><span style="color:#d20;background-color:#fff0f0"> Orchestrate the training for a single worker runner and agent. This is intended to run in its own background process.
</span><span style="color:#d20;background-color:#fff0f0"> """</span>
<span style="color:#888"># possibly naive way of trying to de-correlate the weight updates further (I have no hard evidence to prove if it works, other than my subjective observation):</span>
time.sleep(ix)
<span style="color:#080;font-weight:bold">try</span>:
<span style="color:#888"># we are going to request the episode be reset whenever our agent scores lower than its max points. the same will happen if the agent scores total of -10 points:</span>
max_points = <span style="color:#00d;font-weight:bold">0</span>
max_eval_points = <span style="color:#00d;font-weight:bold">0</span>
min_points = <span style="color:#00d;font-weight:bold">0</span>
max_episode = <span style="color:#00d;font-weight:bold">0</span>
<span style="color:#080;font-weight:bold">for</span> episode_ix <span style="color:#080">in</span> itertools.count(start=<span style="color:#00d;font-weight:bold">0</span>, step=<span style="color:#00d;font-weight:bold">1</span>):
<span style="color:#080;font-weight:bold">if</span> episodes <span style="color:#080">is</span> <span style="color:#080">not</span> <span style="color:#080;font-weight:bold">None</span> <span style="color:#080">and</span> episode_ix >= episodes:
<span style="color:#080;font-weight:bold">return</span>
max_episode_points = <span style="color:#00d;font-weight:bold">0</span>
points = <span style="color:#00d;font-weight:bold">0</span>
<span style="color:#888"># load up the newest weights every new episode:</span>
runner.load_state_dict(self.agent.state_dict())
<span style="color:#888"># every 5 episodes lets evaluate the weights we've learned so far by recording the run of the car using the greedy strategy:</span>
<span style="color:#080;font-weight:bold">if</span> ix == <span style="color:#00d;font-weight:bold">1</span> <span style="color:#080">and</span> episode_ix % <span style="color:#00d;font-weight:bold">5</span> == <span style="color:#00d;font-weight:bold">0</span>:
eval_points = self.record_greedy(episode_ix)
<span style="color:#080;font-weight:bold">if</span> eval_points > max_eval_points:
torch.save(runner.agent.state_dict(), <span style="color:#d20;background-color:#fff0f0">'./checkpoints/car-racing/'</span> + <span style="color:#038">str</span>(eval_points) + <span style="color:#d20;background-color:#fff0f0">'-eval-points.pytorch'</span>)
max_eval_points = eval_points
<span style="color:#888"># each n-step window, compute the gradients and apply</span>
<span style="color:#888"># also: decide if we shouldn't restart the episode if we don't want to explore too much of the not-useful state space:</span>
<span style="color:#080;font-weight:bold">for</span> step, rewards, values, policies, action_ixs, terminal <span style="color:#080">in</span> runner.run_episode(yield_every=self.window):
points += <span style="color:#038">sum</span>(rewards)
<span style="color:#080;font-weight:bold">if</span> ix == <span style="color:#00d;font-weight:bold">1</span> <span style="color:#080">and</span> points > max_points:
torch.save(runner.agent.state_dict(), <span style="color:#d20;background-color:#fff0f0">'./checkpoints/car-racing/'</span> + <span style="color:#038">str</span>(points) + <span style="color:#d20;background-color:#fff0f0">'-points.pytorch'</span>)
max_points = points
<span style="color:#080;font-weight:bold">if</span> ix == <span style="color:#00d;font-weight:bold">1</span> <span style="color:#080">and</span> episode_ix > max_episode:
torch.save(runner.agent.state_dict(), <span style="color:#d20;background-color:#fff0f0">'./checkpoints/car-racing/'</span> + <span style="color:#038">str</span>(episode_ix) + <span style="color:#d20;background-color:#fff0f0">'-episode.pytorch'</span>)
max_episode = episode_ix
<span style="color:#080;font-weight:bold">if</span> points < -<span style="color:#00d;font-weight:bold">10</span> <span style="color:#080">or</span> (max_episode_points > min_points <span style="color:#080">and</span> points < min_points):
terminal = <span style="color:#080;font-weight:bold">True</span>
max_episode_points = <span style="color:#00d;font-weight:bold">0</span>
point = <span style="color:#00d;font-weight:bold">0</span>
runner.ask_reset()
<span style="color:#080;font-weight:bold">if</span> terminal:
logger.info(<span style="color:#d20;background-color:#fff0f0">'TERMINAL for '</span> + <span style="color:#038">str</span>(ix) + <span style="color:#d20;background-color:#fff0f0">' at step '</span> + <span style="color:#038">str</span>(step) + <span style="color:#d20;background-color:#fff0f0">' with total points '</span> + <span style="color:#038">str</span>(points) + <span style="color:#d20;background-color:#fff0f0">' max: '</span> + <span style="color:#038">str</span>(max_episode_points) )
<span style="color:#888"># if we're learning, then compute and apply the gradients and load the newest weights:</span>
<span style="color:#080;font-weight:bold">if</span> runner.train:
loss = self.apply_gradients(policies, action_ixs, rewards, values, terminal, runner)
runner.load_state_dict(self.agent.state_dict())
max_episode_points = <span style="color:#038">max</span>(max_episode_points, points)
min_points = <span style="color:#038">max</span>(min_points, points)
<span style="color:#888"># communicate the gathered values to the main process:</span>
queue.put((ix, episode_ix, step, rewards, loss, points, terminal))
<span style="color:#080;font-weight:bold">except</span> <span style="color:#b06;font-weight:bold">Exception</span> <span style="color:#080;font-weight:bold">as</span> e:
string = traceback.format_exc()
logger.error(<span style="color:#038">str</span>(e) + <span style="color:#d20;background-color:#fff0f0">' → '</span> + string)
queue.put((ix, -<span style="color:#00d;font-weight:bold">1</span>, -<span style="color:#00d;font-weight:bold">1</span>, [-<span style="color:#00d;font-weight:bold">1</span>], -<span style="color:#00d;font-weight:bold">1</span>, <span style="color:#038">str</span>(e) + <span style="color:#d20;background-color:#fff0f0">'<br />'</span> + string, <span style="color:#080;font-weight:bold">True</span>))
<span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">record_greedy</span>(self, episode_ix):
<span style="color:#d20;background-color:#fff0f0">"""
</span><span style="color:#d20;background-color:#fff0f0"> Records the video of the "greedy" run based on the current weights.
</span><span style="color:#d20;background-color:#fff0f0"> """</span>
directory = <span style="color:#d20;background-color:#fff0f0">'./videos/car-racing/episode-'</span> + <span style="color:#038">str</span>(episode_ix) + <span style="color:#d20;background-color:#fff0f0">'-'</span> + <span style="color:#038">str</span>(<span style="color:#038">int</span>(time.time()))
player = Player(agent=self.agent, directory=directory, train=<span style="color:#080;font-weight:bold">False</span>)
points = player.play()
logger.info(<span style="color:#d20;background-color:#fff0f0">'Evaluation at episode '</span> + <span style="color:#038">str</span>(episode_ix) + <span style="color:#d20;background-color:#fff0f0">': '</span> + <span style="color:#038">str</span>(points) + <span style="color:#d20;background-color:#fff0f0">' points ('</span> + directory + <span style="color:#d20;background-color:#fff0f0">')'</span>)
<span style="color:#080;font-weight:bold">return</span> points
<span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">apply_gradients</span>(self, policies, actions, rewards, values, terminal, runner):
worker_agent = runner.agent
actions_one_hot = torch.tensor([[ <span style="color:#038">int</span>(i == action) <span style="color:#080;font-weight:bold">for</span> i <span style="color:#080">in</span> <span style="color:#038">range</span>(<span style="color:#00d;font-weight:bold">4</span>) ] <span style="color:#080;font-weight:bold">for</span> action <span style="color:#080">in</span> actions], dtype=torch.float32)
policies = torch.stack(policies)
values = torch.cat(values)
values_nograd = torch.zeros_like(values.detach(), requires_grad=<span style="color:#080;font-weight:bold">False</span>)
values_nograd.copy_(values)
discounted_rewards = self.discount_rewards(runner, rewards, values_nograd[-<span style="color:#00d;font-weight:bold">1</span>], terminal)
advantages = discounted_rewards - values_nograd
logger.info(<span style="color:#d20;background-color:#fff0f0">'Runner '</span> + <span style="color:#038">str</span>(runner.ix) + <span style="color:#d20;background-color:#fff0f0">'Rewards: '</span> + <span style="color:#038">str</span>(rewards))
logger.info(<span style="color:#d20;background-color:#fff0f0">'Runner '</span> + <span style="color:#038">str</span>(runner.ix) + <span style="color:#d20;background-color:#fff0f0">'Discounted Rewards: '</span> + <span style="color:#038">str</span>(discounted_rewards.numpy()))
log_policies = torch.log(<span style="color:#00d;font-weight:bold">0.00000001</span> + policies)
one_log_policies = torch.sum(log_policies * actions_one_hot, dim=<span style="color:#00d;font-weight:bold">1</span>)
entropy = torch.sum(policies * -log_policies)
policy_loss = -torch.mean(one_log_policies * advantages)
value_loss = F.mse_loss(values, discounted_rewards)
value_loss_nograd = torch.zeros_like(value_loss)
value_loss_nograd.copy_(value_loss)
policy_loss_nograd = torch.zeros_like(policy_loss)
policy_loss_nograd.copy_(policy_loss)
logger.info(<span style="color:#d20;background-color:#fff0f0">'Value Loss: '</span> + <span style="color:#038">str</span>(<span style="color:#038">float</span>(value_loss_nograd)) + <span style="color:#d20;background-color:#fff0f0">' Policy Loss: '</span> + <span style="color:#038">str</span>(<span style="color:#038">float</span>(policy_loss_nograd)))
loss = policy_loss + <span style="color:#00d;font-weight:bold">0.5</span> * value_loss - <span style="color:#00d;font-weight:bold">0.01</span> * entropy
self.agent.zero_grad()
loss.backward()
torch.nn.utils.clip_grad_norm_(worker_agent.parameters(), <span style="color:#00d;font-weight:bold">40</span>)
<span style="color:#888"># the following step is crucial. at this point, all the info about the gradients reside in the worker agent's memory. We need to "move" those gradients into the main agent's memory:</span>
self.share_gradients(worker_agent)
<span style="color:#888"># update the weights with the computed gradients:</span>
self.optimizer.step()
worker_agent.zero_grad()
<span style="color:#080;font-weight:bold">return</span> <span style="color:#038">float</span>(loss.detach())
<span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">share_gradients</span>(self, worker_agent):
<span style="color:#080;font-weight:bold">for</span> param, shared_param <span style="color:#080">in</span> <span style="color:#038">zip</span>(worker_agent.parameters(), self.agent.parameters()):
<span style="color:#080;font-weight:bold">if</span> shared_param.grad <span style="color:#080">is</span> <span style="color:#080">not</span> <span style="color:#080;font-weight:bold">None</span>:
<span style="color:#080;font-weight:bold">return</span>
shared_param._grad = param.grad
<span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">clip_reward</span>(self, reward):
<span style="color:#d20;background-color:#fff0f0">"""
</span><span style="color:#d20;background-color:#fff0f0"> Clips the rewards into the <-3, 3> range preventing too big of the gradients variance.
</span><span style="color:#d20;background-color:#fff0f0"> """</span>
<span style="color:#080;font-weight:bold">return</span> <span style="color:#038">max</span>(<span style="color:#038">min</span>(reward, <span style="color:#00d;font-weight:bold">3</span>), -<span style="color:#00d;font-weight:bold">3</span>)
<span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">discount_rewards</span>(self, runner, rewards, last_value, terminal):
discounted_rewards = [<span style="color:#00d;font-weight:bold">0</span> <span style="color:#080;font-weight:bold">for</span> _ <span style="color:#080">in</span> rewards]
loop_rewards = [ self.clip_reward(reward) <span style="color:#080;font-weight:bold">for</span> reward <span style="color:#080">in</span> rewards ]
<span style="color:#080;font-weight:bold">if</span> terminal:
loop_rewards.append(<span style="color:#00d;font-weight:bold">0</span>)
<span style="color:#080;font-weight:bold">else</span>:
loop_rewards.append(runner.get_value())
<span style="color:#080;font-weight:bold">for</span> main_ix <span style="color:#080">in</span> <span style="color:#038">range</span>(<span style="color:#038">len</span>(discounted_rewards) - <span style="color:#00d;font-weight:bold">1</span>, -<span style="color:#00d;font-weight:bold">1</span>, -<span style="color:#00d;font-weight:bold">1</span>):
<span style="color:#080;font-weight:bold">for</span> inside_ix <span style="color:#080">in</span> <span style="color:#038">range</span>(<span style="color:#038">len</span>(loop_rewards) - <span style="color:#00d;font-weight:bold">1</span>, -<span style="color:#00d;font-weight:bold">1</span>, -<span style="color:#00d;font-weight:bold">1</span>):
<span style="color:#080;font-weight:bold">if</span> inside_ix >= main_ix:
reward = loop_rewards[inside_ix]
discounted_rewards[main_ix] += self.gamma**(inside_ix - main_ix) * reward
<span style="color:#080;font-weight:bold">return</span> torch.tensor(discounted_rewards)
</code></pre></div><p>For the <code>record_greedy</code> method to work we need the following class:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python"><span style="color:#080;font-weight:bold">class</span> <span style="color:#b06;font-weight:bold">Player</span>(Runner):
<span style="color:#080;font-weight:bold">def</span> __init__(self, directory, **kwargs):
<span style="color:#038">super</span>().__init__(ix=<span style="color:#00d;font-weight:bold">999</span>, **kwargs)
self.env = Monitor(self.env, directory)
<span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">play</span>(self):
points = <span style="color:#00d;font-weight:bold">0</span>
<span style="color:#080;font-weight:bold">for</span> step, rewards, values, policies, actions, terminal <span style="color:#080">in</span> self.run_episode(yield_every = <span style="color:#00d;font-weight:bold">1</span>, do_render = <span style="color:#080;font-weight:bold">True</span>):
points += <span style="color:#038">sum</span>(rewards)
self.env.close()
<span style="color:#080;font-weight:bold">return</span> points
</code></pre></div><p>All the above code can be used as follows (in the Python script):</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python"><span style="color:#080;font-weight:bold">if</span> __name__ == <span style="color:#d20;background-color:#fff0f0">"__main__"</span>:
agent = Agent()
trainer = Trainer(gamma = <span style="color:#00d;font-weight:bold">0.99</span>, agent = agent)
trainer.fit(episodes=<span style="color:#080;font-weight:bold">None</span>)
</code></pre></div><h4 id="the-importance-of-tuning-of-the-n-step-window-size">The importance of tuning of the n-step window size</h4>
<p>Reading the code, you can notice that we’ve chosen $15$ to be the size of the n-step window. We’ve also chosen $\gamma=0.99$. Getting those values right is a subject for tuning. The same ones that work on one game or a challenge will not necessarily work well for the other.</p>
<p>Here’s a quick explanation of how to think about them: We’re going to be penalized most of the time. It’s important for us to give the algorithm a chance to actually find trajectories that score positively. In the “CarRacing” challenge, I’ve found that it can take 10 steps of moving “full throttle” in the correct direction before we’re being rewarded by entering the new “tile”. I’ve just simply added $5$ of the safety net to that number. No mathematical proof follows this thinking here, but I can tell you though that it made a <strong>huge</strong> difference in the training time for me. The version of the code I’m presenting above starts to score above 700 points after approximately 10 hours on my Ryzen 7 based computing box.</p>
<h4 id="problems-with-the-state-being-returned-from-the-environment---overcoming-with-the-explicit-render">Problems with the state being returned from the environment - overcoming with the explicit render</h4>
<p>You might have also noticed that I’m not using the state values returned by the <code>step</code> method of the gym environment. This might seem contradictory to how the gym is typically being used. After <strong>days</strong> of not seeing my model converge though, I have found that the <code>step</code> method was returning <strong>one and the same</strong> numpy array <strong>on each call</strong>. You can imagine that it was the absolutely <strong>last</strong> thing I’ve checked when trying to find that bug.</p>
<p>I’ve found the <code>render(mode='rgb_array')</code> works as intended each time. I just needed to write my own preprocessing code, to scale it down and make it grayscale.</p>
<h4 id="how-to-know-when-the-algorithm-converges">How to know when the algorithm converges</h4>
<p>I’ve seen some people thinking that their A3C implementation does not converge. The resulting policy did not seem to be working that well, but the training process was taking a bit longer than “some other implementation”. I fell for this kind of thinking myself as well. My humble bit of advice is to stick to what makes sense mathematically. Someone else’s model might be converging faster simply because of the hardware being used or some slight difference in the code <strong>around</strong> the training (e.g. explicit render needed in my case). This might not have anything to do with the A3C part at all.</p>
<p>How do we “stick to what makes sense mathematically”? Simply by logging the value loss and observing it as the training continues. Intuitively, for the model that has converged, we should see that it has already learned the value function. Those values — representing the average of the discounted rewards — should not make the loss too big most of the time. Still, for some states, the best action will make the $R_t$ much bigger than $V(s_t)$ which means that we still should see the loss spiking from time to time.</p>
<p>Again, the above bit of advice doesn’t come with any mathematical proofs. It’s what I found working and making sense <strong>in my case</strong>.</p>
<h3 id="the-results">The Results</h3>
<p>Instead of presenting hard-core statistics about the model’s performance — which wouldn’t make much sense because I stopped it as soon as the “evaluation” videos started looking cool enough) — I’ll just post three videos of the car driving on its own through the three randomly generated tracks.</p>
<p>Have fun watching and even more fun coding it yourself too!</p>
<center>
<video width="100%" controls poster="/blog/2018/08/self-driving-toy-car-using-the-a3c-algorithm/poster.png">
<source src="/blog/2018/08/self-driving-toy-car-using-the-a3c-algorithm/873-openaigym.video.92.68.video000000.mp4" type="video/mp4">
</video>
</center>
<center>
<video width="100%" controls poster="/blog/2018/08/self-driving-toy-car-using-the-a3c-algorithm/poster.png">
<source src="/blog/2018/08/self-driving-toy-car-using-the-a3c-algorithm/892-openaigym.video.90.68.video000000.mp4" type="video/mp4">
</video>
</center>
<center>
<video width="100%" controls poster="/blog/2018/08/self-driving-toy-car-using-the-a3c-algorithm/poster.png">
<source src="/blog/2018/08/self-driving-toy-car-using-the-a3c-algorithm/904-openaigym.video.77.68.video000000.mp4" type="video/mp4">
</video>
</center>
<script>
renderMathInElement(
document.body,
{
delimiters: [
{left: "$$", right: "$$", display: true},
{left: "\\[", right: "\\]", display: true},
{left: "$", right: "$", display: false},
{left: "\\(", right: "\\)", display: false}
]
}
);
</script>
Recommender System via a Simple Matrix Factorizationhttps://www.endpointdev.com/blog/2018/07/recommender-mxnet/2018-07-17T00:00:00+00:00Kamil Ciemniewski
<p><img src="/blog/2018/07/recommender-mxnet/10539898745_56b790e62e_o-crop.jpg" alt="people sitting and laughing" /><br><a href="https://www.flickr.com/photos/michaelcartwright/10539898745/">Photo by Michael Cartwright, CC BY-SA 2.0, cropped</a></p>
<p>We all like how apps like Spotify or Last.fm can recommend us a song that feels so much like our taste. Being able to recommend an item to a user is very important for keeping and expanding the user base.</p>
<p>In this article I’ll present an overview of building a recommendation system. The approach here is quite basic. It’s grounded though in a valid and battle-tested theory. I’ll show you how to put this theory into practice by coding it in Python with the help of MXNet.</p>
<h3 id="kinds-of-recommenders">Kinds of recommenders</h3>
<p>The general setup of the content recommendation challenge is that we have <strong>users</strong> and <strong>items</strong>. The task is to recommend items to a particular user.</p>
<p>There are two distinct approaches to recommending content:</p>
<ol>
<li><a href="https://en.wikipedia.org/wiki/Recommender_system#Content-based_filtering">Content based filtering</a></li>
<li><a href="https://en.wikipedia.org/wiki/Recommender_system#Collaborative_filtering">Collaborative filtering</a></li>
</ol>
<p>The first one bases its outputs on the the intricate features of the item and how they relate to the user itself. The latter one uses the information about the way other, similar users rank the items. More elaborate systems base their work on both. Such systems are called <a href="https://en.wikipedia.org/wiki/Recommender_system#Hybrid_recommender_systems">hybrid recommender systems</a>.</p>
<p>This article is going to focus on <strong>collaborative filtering</strong> only.</p>
<h3 id="a-bit-of-theory-matrix-factorization">A bit of theory: matrix factorization</h3>
<p>In the simplest terms, we can represent interactions between users and items with a matrix:</p>
<table style="border-collapse: collapse; text-align: center"><tr><th></th><th>item1</th><th>item2</th><th>item3</th></tr><tr><th>user1</th><td>-1</td><td>-</td><td>0.6</td></tr><tr><th>user2</th><td>-</td><td>0.95</td><td>-0.1</td></tr><tr><th>user3</th><td>0.5</td><td>-</td><td>0.8</td></tr></table>
<p>In the above case users can rate items on the scale of <code><-1, 1></code>. Notice that in reality it’s most likely that users will not rate everything. The missing ratings are represented with the dash: <code>-</code>.</p>
<p>Just by looking at the above table, we know that no amount of math is going to change the fact that user1 completely dislikes item1. The same goes for user2 liking item2 a lot. The ratings we already have make up for a fairly easy set of items to propose. The goal of a recommender is not to propose the items users know already though. We want to predict which of the “dashes” from the table are most likely to be liked the most. Putting it in other words: we want to predict the full representation of the above matrix, basing only on its “sparse” representation as shown above.</p>
<p>How can we solve this problem? Let’s recall the rules of multiplying two matrices:</p>
<p>Given two matrices: <code>A: m × k</code> and <code>B: k × n</code>, their product is another matrix <code>C: m × n</code>. We know that we can multiply matrices only if the second dimension of the first matrix equals the first one of the second matrix. In such a case, matrix <code>C</code> becomes a product of two factors: matrix <code>A</code> and matrix <code>B</code>:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python">C = AB
</code></pre></div><p>Imagine now that the sparse matrix represented by the ratings table is our <code>C</code>. This means that there exist two matrices: <code>A</code> and <code>B</code> that <em>factorize</em> <code>C</code>.</p>
<p>Notice also how this factorization is saving the space needed to persist the ratings:</p>
<p>Let’s make <code>m</code> and <code>n</code> numbers into:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python">m = <span style="color:#00d;font-weight:bold">1000000</span>
n = <span style="color:#00d;font-weight:bold">10000</span>
</code></pre></div><p>Then the full representation takes:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python">m * n => <span style="color:#00d;font-weight:bold">10</span>,<span style="color:#00d;font-weight:bold">000</span>,<span style="color:#00d;font-weight:bold">000</span>,<span style="color:#00d;font-weight:bold">000</span>
</code></pre></div><p>We can now choose the value for <code>k</code>, to be later used when constructing the factorizing matrices:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python">k = <span style="color:#00d;font-weight:bold">16</span>
</code></pre></div><p>Then to store both matrices: <code>A</code> and <code>B</code> we only need:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python">m * k + n * k => <span style="color:#00d;font-weight:bold">16</span>,<span style="color:#00d;font-weight:bold">160</span>,<span style="color:#00d;font-weight:bold">000</span>
</code></pre></div><p>Making it into a fraction of the previous number:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python">(m * k + n * k) / (m * n) => <span style="color:#00d;font-weight:bold">0.001616</span>
</code></pre></div><p>That’s a <strong>huge</strong> saving of the original space! The cost we need to pay is the small increase in the computational resources needed for the information retrieval. Inference of the rating from <code>C</code> based on <code>A</code> and <code>B</code> requires a <strong>dot product</strong> of the corresponding row and column of those matrices.</p>
<h3 id="reasoning-about-the-matrix-factors">Reasoning about the matrix factors</h3>
<p>What intuition can we build for the above mentioned matrices <code>A</code> and <code>B</code>? Looking at their dimensions, we can see that each row of <code>A</code> is a <code>k</code>-sized vector that represents a user. Conversely, each column of <code>B</code> is a <code>k</code>-sized vector that represents an item. The values in those vectors are being called <strong>latent features</strong>. Sometimes those vectors are being called <strong>latent representations</strong> of users and items.</p>
<p>What could be the intuition? To split the original matrix, for each item we need to look at all interactions with users. You can imagine the algorithm finding patterns in the ratings that later on match certain characteristics of the item. If this was about movies, the features could be that it’s a comedy or sci-fi or that it’s futuristic or embedded deeply in some ancient times. We’re essentially taking the original vector of a movie, that contains ratings — and based on that we’re distilling features of the movie that describe it best. Note that this is only a half-truth. We think about it this way just to have a way to explain why the approach works. In many cases we could have a hard time finding the actual real world aspects that those latent features follow.</p>
<h3 id="factorizing-the-user--item-matrix-in-practice">Factorizing the user × item matrix in practice</h3>
<p>A simple approach to find matrices <code>A</code> and <code>B</code> is to initialize them randomly first. Then by computing the dot product of each row and column having a known value in <code>C</code>, we can compute how much it differs from the known value. Because dot product is easily differentiable, we can use <a href="https://en.wikipedia.org/wiki/Gradient_descent">gradient descend</a> to iteratively improve our matrices <code>A</code> and <code>B</code> until <code>AB</code> is close enough to <code>C</code> for our purposes.</p>
<p>In this article, I’m going to use a freely available database of joke ratings, called “<a href="http://eigentaste.berkeley.edu/dataset/">Jester</a>”. It contains data about ratings from 59132 users and 150 jokes.</p>
<h3 id="coding-the-model-with-mxnet">Coding the model with MXNet</h3>
<p>Let’s first import some of the classes and functions we’ll use later.</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python"><span style="color:#080;font-weight:bold">from</span> <span style="color:#b06;font-weight:bold">mxnet.gluon</span> <span style="color:#080;font-weight:bold">import</span> Block, nn, Trainer
<span style="color:#080;font-weight:bold">from</span> <span style="color:#b06;font-weight:bold">mxnet.gluon.loss</span> <span style="color:#080;font-weight:bold">import</span> L2Loss
<span style="color:#080;font-weight:bold">from</span> <span style="color:#b06;font-weight:bold">mxnet</span> <span style="color:#080;font-weight:bold">import</span> autograd, ndarray <span style="color:#080;font-weight:bold">as</span> F
<span style="color:#080;font-weight:bold">import</span> <span style="color:#b06;font-weight:bold">mxnet</span> <span style="color:#080;font-weight:bold">as</span> <span style="color:#b06;font-weight:bold">mx</span>
<span style="color:#080;font-weight:bold">import</span> <span style="color:#b06;font-weight:bold">numpy</span> <span style="color:#080;font-weight:bold">as</span> <span style="color:#b06;font-weight:bold">np</span>
<span style="color:#080;font-weight:bold">import</span> <span style="color:#b06;font-weight:bold">random</span>
<span style="color:#080;font-weight:bold">import</span> <span style="color:#b06;font-weight:bold">logging</span>
<span style="color:#080;font-weight:bold">import</span> <span style="color:#b06;font-weight:bold">re</span>
</code></pre></div><p>First step in building the training process is to create an iterator over the training batches read from the data files. To make things trivially simple, I’ll read the whole data into memory. The batches will be constructed each time from the data cached in memory.</p>
<p>To create a custom data iterator, we’ll need to inherit from <code>mxnet.io.DataIter</code> and implement at least two methods: <code>next</code> and <code>reset</code>. Here’s our simple code:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python"><span style="color:#080;font-weight:bold">class</span> <span style="color:#b06;font-weight:bold">DataIter</span>(mx.io.DataIter):
<span style="color:#080;font-weight:bold">def</span> __init__(self, data, batch_size = <span style="color:#00d;font-weight:bold">16</span>):
<span style="color:#038">super</span>(DataIter, self).__init__()
self.batch_size = batch_size
self.all_user_ids = <span style="color:#038">set</span>()
self.data = data
self.index = <span style="color:#00d;font-weight:bold">0</span>
<span style="color:#080;font-weight:bold">for</span> user_id, item_id, _ <span style="color:#080">in</span> data:
self.all_user_ids.add(user_id)
<span style="color:#555">@property</span>
<span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">user_count</span>(self):
<span style="color:#080;font-weight:bold">return</span> <span style="color:#038">len</span>(self.all_user_ids)
<span style="color:#555">@property</span>
<span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">item_count</span>(self):
<span style="color:#888"># we just know the value even though 10 of them were</span>
<span style="color:#888"># not voted</span>
<span style="color:#080;font-weight:bold">return</span> <span style="color:#00d;font-weight:bold">150</span>
<span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">next</span>(self):
index = self.index * self.batch_size
endindex = index + self.batch_size
<span style="color:#080;font-weight:bold">if</span> <span style="color:#038">len</span>(self.data) <= index:
<span style="color:#080;font-weight:bold">raise</span> <span style="color:#b06;font-weight:bold">StopIteration</span>
<span style="color:#080;font-weight:bold">else</span>:
user_ids = []
item_ids = []
ratings = []
user_ids = self.data[index:endindex, <span style="color:#00d;font-weight:bold">0</span>]
item_ids = self.data[index:endindex, <span style="color:#00d;font-weight:bold">1</span>]
ratings = self.data[index:endindex, <span style="color:#00d;font-weight:bold">2</span>]
data_all = [mx.nd.array(user_ids), mx.nd.array(item_ids)]
label_all = [mx.nd.array([r]) <span style="color:#080;font-weight:bold">for</span> r <span style="color:#080">in</span> ratings]
self.index += <span style="color:#00d;font-weight:bold">1</span>
<span style="color:#080;font-weight:bold">return</span> mx.io.DataBatch(data_all, label_all)
<span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">reset</span>(self):
self.index = <span style="color:#00d;font-weight:bold">0</span>
random.shuffle(self.data)
</code></pre></div><p>The above <code>DataIter</code> class expects to be given a <code>numpy</code> array with all the training examples. The first dimension represents a user, second an item and third the rating.</p>
<p>Here’s the code for reading data from disk and feeding it into the <code>DataIter</code>’s constructor:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python"><span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">get_data</span>(batch_size):
user_ids = []
item_ids = []
ratings = []
<span style="color:#080;font-weight:bold">with</span> <span style="color:#038">open</span>(<span style="color:#d20;background-color:#fff0f0">"data/jester_ratings.dat"</span>, <span style="color:#d20;background-color:#fff0f0">"r"</span>) <span style="color:#080;font-weight:bold">as</span> file:
<span style="color:#080;font-weight:bold">for</span> line <span style="color:#080">in</span> file:
user_id, _, item_id, _, rating = line.strip().split(<span style="color:#d20;background-color:#fff0f0">"</span><span style="color:#04d;background-color:#fff0f0">\t</span><span style="color:#d20;background-color:#fff0f0">"</span>)
user_ids.append(<span style="color:#038">int</span>(user_id))
item_ids.append(<span style="color:#038">int</span>(item_id))
ratings.append(<span style="color:#038">float</span>(rating) / <span style="color:#00d;font-weight:bold">10.0</span>)
all_raw = np.asarray(<span style="color:#038">list</span>(<span style="color:#038">zip</span>(user_ids, item_ids, ratings)), dtype=<span style="color:#d20;background-color:#fff0f0">'float32'</span>)
<span style="color:#080;font-weight:bold">return</span> DataIter(all_raw, batch_size = batch_size)
</code></pre></div><p>Notice that I’m dividing each rating by <code>10</code> to scale the ratings from <code><-10,10></code> to <code><-1,1></code>. I’m doing it because I found the process hitting numerical overflows when using the <code>Adam</code> optimizer.</p>
<p>The function accepts the <code>batch_size</code> as an argument. Below I’m creating a dataset iterator yielding 64 examples at a time:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python">train = get_data(<span style="color:#00d;font-weight:bold">64</span>)
</code></pre></div><p>Recent versions of MXNet bring in a similar coding model to one found in PyTorch. We can use the clean approach of defining the model by extending the base class and defining the <code>forward</code> method. This is possible by using the <code>mxnet.gluon</code> module that defines the <code>Block</code> class.</p>
<p>As a full-featured deep learning framework, MXNet has its own implementation of calculating gradients automatically. The <code>forward</code> method in our <code>Block</code> inherited class is all we need to proceed with the gradient descend.</p>
<p>In our model, the <code>A</code> and <code>B</code> matrices will be encoded within the <code>gluon</code> layers of type <code>Embedding</code>. The <code>Embedding</code> class lets you specify the number of rows in the matrix as well as the dimension into which we’re “squashing” them. Using the class is very handy as it doesn’t require you to “<a href="https://en.wikipedia.org/wiki/One-hot">one hot encode</a>” our user and item IDs.</p>
<p>Following is the implementation of our simple model as <code>MXNet</code> block. Notice that all it really is, is a regression. The model is linear so we’re not using any <a href="https://en.wikipedia.org/wiki/Activation_function">activation function</a>:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python"><span style="color:#080;font-weight:bold">class</span> <span style="color:#b06;font-weight:bold">Model</span>(Block):
<span style="color:#080;font-weight:bold">def</span> __init__(self, k, dataiter, **kwargs):
<span style="color:#038">super</span>(Model, self).__init__(**kwargs)
<span style="color:#080;font-weight:bold">with</span> self.name_scope():
self.user_embedding = nn.Embedding(input_dim = dataiter.user_count, output_dim=k)
self.item_embedding = nn.Embedding(input_dim = dataiter.item_count, output_dim=k)
<span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">forward</span>(self, x):
user = self.user_embedding(x[<span style="color:#00d;font-weight:bold">0</span>] - <span style="color:#00d;font-weight:bold">1</span>)
item = self.item_embedding(x[<span style="color:#00d;font-weight:bold">1</span>] - <span style="color:#00d;font-weight:bold">1</span>)
<span style="color:#888"># the following is a dot product in essence</span>
<span style="color:#888"># summing up of the element-wise multiplication</span>
pred = user * item
<span style="color:#080;font-weight:bold">return</span> F.sum_axis(pred, axis = <span style="color:#00d;font-weight:bold">1</span>)
</code></pre></div><p>Next, I’m creating the MXNet computation context as well as an instance of the model itself. Before doing any kind of learning, the parameters of the model will need to be initialized:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python">context = mx.gpu() <span style="color:#080;font-weight:bold">if</span> mx.test_utils.list_gpus() <span style="color:#080;font-weight:bold">else</span> mx.cpu()
model = Model(<span style="color:#00d;font-weight:bold">16</span>, train)
model.collect_params().initialize(mx.init.Xavier(), ctx=context)
</code></pre></div><p>The last line from above is initializing the <code>A</code> and <code>B</code> matrices randomly.</p>
<p>We are going to save the state of the model periodically to a file. We’ll be able to load them back with:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python">model.load_params(<span style="color:#d20;background-color:#fff0f0">"model.mxnet"</span>, ctx=context)
</code></pre></div><p>The last bit of code that we need is the training procedure itself. We’re going to code it as a function that takes the model, the data iterator and the number of epochs:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python"><span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">fit</span>(model, train, num_epoch):
trainer = Trainer(model.collect_params(), <span style="color:#d20;background-color:#fff0f0">'adam'</span>)
<span style="color:#080;font-weight:bold">for</span> epoch_id <span style="color:#080">in</span> <span style="color:#038">range</span>(num_epoch):
batch_id = <span style="color:#00d;font-weight:bold">0</span>
train.reset()
<span style="color:#080;font-weight:bold">for</span> batch <span style="color:#080">in</span> train:
<span style="color:#080;font-weight:bold">with</span> autograd.record():
targets = F.concat(*batch.label, dim=<span style="color:#00d;font-weight:bold">0</span>)
predictions = model(batch.data)
L = L2Loss()
loss = L(predictions, targets)
loss.backward()
trainer.step(batch.data[<span style="color:#00d;font-weight:bold">0</span>].shape[<span style="color:#00d;font-weight:bold">0</span>])
<span style="color:#080;font-weight:bold">if</span> (batch_id + <span style="color:#00d;font-weight:bold">1</span>) % <span style="color:#00d;font-weight:bold">1000</span> == <span style="color:#00d;font-weight:bold">0</span>:
mean_loss = F.mean(loss).asnumpy()[<span style="color:#00d;font-weight:bold">0</span>]
logger.info(<span style="color:#d20;background-color:#fff0f0">f</span><span style="color:#d20;background-color:#fff0f0">'Epoch </span><span style="color:#33b;background-color:#fff0f0">{</span>epoch_id + <span style="color:#00d;font-weight:bold">1</span><span style="color:#33b;background-color:#fff0f0">}</span><span style="color:#d20;background-color:#fff0f0"> / </span><span style="color:#33b;background-color:#fff0f0">{</span>num_epoch<span style="color:#33b;background-color:#fff0f0">}</span><span style="color:#d20;background-color:#fff0f0"> | Batch </span><span style="color:#33b;background-color:#fff0f0">{</span>batch_id + <span style="color:#00d;font-weight:bold">1</span><span style="color:#33b;background-color:#fff0f0">}</span><span style="color:#d20;background-color:#fff0f0"> | Mean Loss: </span><span style="color:#33b;background-color:#fff0f0">{</span>mean_loss<span style="color:#33b;background-color:#fff0f0">}</span><span style="color:#d20;background-color:#fff0f0">'</span>)
batch_id += <span style="color:#00d;font-weight:bold">1</span>
logger.info(<span style="color:#d20;background-color:#fff0f0">'Saving model parameters'</span>)
model.save_params(<span style="color:#d20;background-color:#fff0f0">"model.mxnet"</span>)
</code></pre></div><p>Running the trainer for 10 epochs is as simple as:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python">fit(model, train, num_epoch=<span style="color:#00d;font-weight:bold">10</span>)
</code></pre></div><p>The training process is periodically outputting statistics similar to ones below:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-plain" data-lang="plain">INFO:root:Epoch 1 / 10 | Batch 1000 | Mean Loss: 0.11189080774784088
INFO:root:Epoch 1 / 10 | Batch 2000 | Mean Loss: 0.12274568527936935
INFO:root:Epoch 1 / 10 | Batch 3000 | Mean Loss: 0.1204155907034874
INFO:root:Epoch 1 / 10 | Batch 4000 | Mean Loss: 0.12192331254482269
(...)
INFO:root:Epoch 10 / 10 | Batch 24000 | Mean Loss: 0.0003094784333370626
INFO:root:Epoch 10 / 10 | Batch 25000 | Mean Loss: 0.0006345464498735964
INFO:root:Epoch 10 / 10 | Batch 26000 | Mean Loss: 0.0007207655580714345
INFO:root:Epoch 10 / 10 | Batch 27000 | Mean Loss: 0.005522257648408413
INFO:root:Saving model parameters
</code></pre></div><h3 id="using-the-trained-latent-feature-matrices">Using the trained latent feature matrices</h3>
<p>To extract he latent matrices from the trained model we need to use the <code>collect_params</code> as shown below:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python">user_embed = model.collect_params().get(<span style="color:#d20;background-color:#fff0f0">'embedding0_weight'</span>).data()
joke_embed = model.collect_params().get(<span style="color:#d20;background-color:#fff0f0">'embedding1_weight'</span>).data()
</code></pre></div><p>Each user’s latent representation is a vector of <code>k</code> values:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python">> user_embed[<span style="color:#00d;font-weight:bold">0</span>]
[ <span style="color:#00d;font-weight:bold">0.11911439</span> -<span style="color:#00d;font-weight:bold">0.01560098</span> -<span style="color:#00d;font-weight:bold">0.26248184</span> <span style="color:#00d;font-weight:bold">0.5341552</span> <span style="color:#00d;font-weight:bold">1.3078408</span> -<span style="color:#00d;font-weight:bold">0.82505447</span>
<span style="color:#00d;font-weight:bold">0.2181341</span> <span style="color:#00d;font-weight:bold">0.69577765</span> -<span style="color:#00d;font-weight:bold">0.22569533</span> -<span style="color:#00d;font-weight:bold">0.7669992</span> <span style="color:#00d;font-weight:bold">0.14042236</span> <span style="color:#00d;font-weight:bold">0.78608125</span>
<span style="color:#00d;font-weight:bold">0.07242275</span> <span style="color:#00d;font-weight:bold">0.49357334</span> <span style="color:#00d;font-weight:bold">0.7525147</span> <span style="color:#00d;font-weight:bold">0.37984315</span>]
<NDArray <span style="color:#00d;font-weight:bold">16</span> <span style="color:#555">@cpu</span>(<span style="color:#00d;font-weight:bold">0</span>)>
</code></pre></div><p>The same case is with the latent representations of jokes:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python">> joke_embed[<span style="color:#00d;font-weight:bold">7</span>]
[ <span style="color:#00d;font-weight:bold">0.11836094</span> <span style="color:#00d;font-weight:bold">0.14039275</span> -<span style="color:#00d;font-weight:bold">0.10859593</span> -<span style="color:#00d;font-weight:bold">0.13673168</span> <span style="color:#00d;font-weight:bold">0.14074579</span> -<span style="color:#00d;font-weight:bold">0.18800738</span>
<span style="color:#00d;font-weight:bold">0.0463879</span> -<span style="color:#00d;font-weight:bold">0.09659509</span> <span style="color:#00d;font-weight:bold">0.1629943</span> <span style="color:#00d;font-weight:bold">0.02109279</span> -<span style="color:#00d;font-weight:bold">0.0294639</span> -<span style="color:#00d;font-weight:bold">0.03487734</span>
-<span style="color:#00d;font-weight:bold">0.18192524</span> -<span style="color:#00d;font-weight:bold">0.13103536</span> -<span style="color:#00d;font-weight:bold">0.10280509</span> <span style="color:#00d;font-weight:bold">0.14753008</span>]
<NDArray <span style="color:#00d;font-weight:bold">16</span> <span style="color:#555">@cpu</span>(<span style="color:#00d;font-weight:bold">0</span>)>
</code></pre></div><p>Let’s first test to see if the known values got reconstructed:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python">> F.dot(user_embed[<span style="color:#00d;font-weight:bold">0</span>], joke_embed[<span style="color:#00d;font-weight:bold">7</span>]) * <span style="color:#00d;font-weight:bold">10</span>
[-<span style="color:#00d;font-weight:bold">9.26895</span>]
<NDArray <span style="color:#00d;font-weight:bold">1</span> <span style="color:#555">@cpu</span>(<span style="color:#00d;font-weight:bold">0</span>)>
</code></pre></div><p>Comparing it with the value from the file:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-bash" data-lang="bash">> cat data/jester_ratings.dat | rg <span style="color:#d20;background-color:#fff0f0">"^1\t\t8\t"</span>
<span style="color:#00d;font-weight:bold">1</span> <span style="color:#00d;font-weight:bold">8</span> -9.281
</code></pre></div><p>That’s close enough. Let’s now get the set of all joke ids rated by the first user:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python">test = get_data(<span style="color:#00d;font-weight:bold">1</span>)
joke_ids = <span style="color:#038">set</span>()
<span style="color:#080;font-weight:bold">for</span> batch <span style="color:#080">in</span> test:
user_id, joke_id = batch.data
<span style="color:#080;font-weight:bold">if</span> user_id.asnumpy()[<span style="color:#00d;font-weight:bold">0</span>] == <span style="color:#00d;font-weight:bold">1</span>:
joke_ids.add(joke_id.asnumpy()[<span style="color:#00d;font-weight:bold">0</span>])
joke_ids
</code></pre></div><p>The above code outputs:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python">{<span style="color:#00d;font-weight:bold">5.0</span>, <span style="color:#00d;font-weight:bold">7.0</span>, <span style="color:#00d;font-weight:bold">8.0</span>, <span style="color:#00d;font-weight:bold">13.0</span>, <span style="color:#00d;font-weight:bold">15.0</span>, <span style="color:#00d;font-weight:bold">16.0</span>, <span style="color:#00d;font-weight:bold">17.0</span>, <span style="color:#00d;font-weight:bold">18.0</span>, <span style="color:#00d;font-weight:bold">19.0</span>, <span style="color:#00d;font-weight:bold">20.0</span>, <span style="color:#00d;font-weight:bold">21.0</span>, <span style="color:#00d;font-weight:bold">22.0</span>, <span style="color:#00d;font-weight:bold">23.0</span>, <span style="color:#00d;font-weight:bold">24.0</span>, <span style="color:#00d;font-weight:bold">25.0</span>, <span style="color:#00d;font-weight:bold">26.0</span>, <span style="color:#00d;font-weight:bold">27.0</span>, <span style="color:#00d;font-weight:bold">29.0</span>, <span style="color:#00d;font-weight:bold">31.0</span>, <span style="color:#00d;font-weight:bold">32.0</span>, <span style="color:#00d;font-weight:bold">34.0</span>, <span style="color:#00d;font-weight:bold">35.0</span>, <span style="color:#00d;font-weight:bold">36.0</span>, <span style="color:#00d;font-weight:bold">42.0</span>, <span style="color:#00d;font-weight:bold">49.0</span>, <span style="color:#00d;font-weight:bold">50.0</span>, <span style="color:#00d;font-weight:bold">51.0</span>, <span style="color:#00d;font-weight:bold">52.0</span>, <span style="color:#00d;font-weight:bold">53.0</span>, <span style="color:#00d;font-weight:bold">54.0</span>, <span style="color:#00d;font-weight:bold">61.0</span>, <span style="color:#00d;font-weight:bold">62.0</span>, <span style="color:#00d;font-weight:bold">65.0</span>, <span style="color:#00d;font-weight:bold">66.0</span>, <span style="color:#00d;font-weight:bold">68.0</span>, <span style="color:#00d;font-weight:bold">69.0</span>, <span style="color:#00d;font-weight:bold">72.0</span>, <span style="color:#00d;font-weight:bold">76.0</span>, <span style="color:#00d;font-weight:bold">80.0</span>, <span style="color:#00d;font-weight:bold">81.0</span>, <span style="color:#00d;font-weight:bold">83.0</span>, <span style="color:#00d;font-weight:bold">87.0</span>, <span style="color:#00d;font-weight:bold">89.0</span>, <span style="color:#00d;font-weight:bold">91.0</span>, <span style="color:#00d;font-weight:bold">92.0</span>, <span style="color:#00d;font-weight:bold">93.0</span>, <span style="color:#00d;font-weight:bold">102.0</span>, <span style="color:#00d;font-weight:bold">103.0</span>, <span style="color:#00d;font-weight:bold">104.0</span>, <span style="color:#00d;font-weight:bold">105.0</span>, <span style="color:#00d;font-weight:bold">106.0</span>, <span style="color:#00d;font-weight:bold">107.0</span>, <span style="color:#00d;font-weight:bold">108.0</span>, <span style="color:#00d;font-weight:bold">109.0</span>, <span style="color:#00d;font-weight:bold">118.0</span>, <span style="color:#00d;font-weight:bold">119.0</span>, <span style="color:#00d;font-weight:bold">120.0</span>, <span style="color:#00d;font-weight:bold">121.0</span>, <span style="color:#00d;font-weight:bold">123.0</span>, <span style="color:#00d;font-weight:bold">127.0</span>, <span style="color:#00d;font-weight:bold">128.0</span>, <span style="color:#00d;font-weight:bold">134.0</span>}
</code></pre></div><p>Because we’re mostly interested in the items that have not been yet rated by the user, we’d like to see what the model gathered about them in this context:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python">> <span style="color:#038">sorted</span>([ (i, F.dot(user_embed[<span style="color:#00d;font-weight:bold">0</span>], joke_embed[i]).asnumpy()[<span style="color:#00d;font-weight:bold">0</span>] * <span style="color:#00d;font-weight:bold">10</span>) <span style="color:#080;font-weight:bold">for</span> i <span style="color:#080">in</span> <span style="color:#038">range</span>(<span style="color:#00d;font-weight:bold">0</span>, <span style="color:#00d;font-weight:bold">150</span>) <span style="color:#080;font-weight:bold">if</span> i + <span style="color:#00d;font-weight:bold">1</span> <span style="color:#080">not</span> <span style="color:#080">in</span> joke_ids ], key=<span style="color:#080;font-weight:bold">lambda</span> x: x[<span style="color:#00d;font-weight:bold">1</span>])
[(<span style="color:#00d;font-weight:bold">100</span>, -<span style="color:#00d;font-weight:bold">25.34627914428711</span>),
(<span style="color:#00d;font-weight:bold">89</span>, -<span style="color:#00d;font-weight:bold">23.647150993347168</span>),
(<span style="color:#00d;font-weight:bold">63</span>, -<span style="color:#00d;font-weight:bold">23.543219566345215</span>),
(<span style="color:#00d;font-weight:bold">94</span>, -<span style="color:#00d;font-weight:bold">23.415722846984863</span>),
(<span style="color:#00d;font-weight:bold">70</span>, -<span style="color:#00d;font-weight:bold">22.017195224761963</span>),
(<span style="color:#00d;font-weight:bold">93</span>, -<span style="color:#00d;font-weight:bold">21.375732421875</span>),
(<span style="color:#00d;font-weight:bold">140</span>, -<span style="color:#00d;font-weight:bold">20.033082962036133</span>),
(<span style="color:#00d;font-weight:bold">81</span>, -<span style="color:#00d;font-weight:bold">18.813319206237793</span>),
(<span style="color:#00d;font-weight:bold">40</span>, -<span style="color:#00d;font-weight:bold">18.48101019859314</span>),
(<span style="color:#00d;font-weight:bold">135</span>, -<span style="color:#00d;font-weight:bold">18.216774463653564</span>),
(<span style="color:#00d;font-weight:bold">39</span>, -<span style="color:#00d;font-weight:bold">16.993610858917236</span>),
(<span style="color:#00d;font-weight:bold">123</span>, -<span style="color:#00d;font-weight:bold">16.66216731071472</span>),
(<span style="color:#00d;font-weight:bold">45</span>, -<span style="color:#00d;font-weight:bold">16.03758215904236</span>),
(<span style="color:#00d;font-weight:bold">59</span>, -<span style="color:#00d;font-weight:bold">15.045435428619385</span>),
(<span style="color:#00d;font-weight:bold">43</span>, -<span style="color:#00d;font-weight:bold">14.993469715118408</span>),
(<span style="color:#00d;font-weight:bold">74</span>, -<span style="color:#00d;font-weight:bold">12.132725715637207</span>),
(<span style="color:#00d;font-weight:bold">72</span>, -<span style="color:#00d;font-weight:bold">11.94629430770874</span>),
(<span style="color:#00d;font-weight:bold">76</span>, -<span style="color:#00d;font-weight:bold">11.861177682876587</span>),
(<span style="color:#00d;font-weight:bold">29</span>, -<span style="color:#00d;font-weight:bold">11.831218004226685</span>),
(<span style="color:#00d;font-weight:bold">114</span>, -<span style="color:#00d;font-weight:bold">11.82992935180664</span>),
(<span style="color:#00d;font-weight:bold">38</span>, -<span style="color:#00d;font-weight:bold">11.327273845672607</span>),
(<span style="color:#00d;font-weight:bold">98</span>, -<span style="color:#00d;font-weight:bold">10.9122633934021</span>),
(<span style="color:#00d;font-weight:bold">62</span>, -<span style="color:#00d;font-weight:bold">9.507511854171753</span>),
(<span style="color:#00d;font-weight:bold">32</span>, -<span style="color:#00d;font-weight:bold">9.498740434646606</span>),
(<span style="color:#00d;font-weight:bold">83</span>, -<span style="color:#00d;font-weight:bold">9.442780017852783</span>),
(<span style="color:#00d;font-weight:bold">56</span>, -<span style="color:#00d;font-weight:bold">9.361632466316223</span>),
(<span style="color:#00d;font-weight:bold">78</span>, -<span style="color:#00d;font-weight:bold">9.310351014137268</span>),
(<span style="color:#00d;font-weight:bold">109</span>, -<span style="color:#00d;font-weight:bold">8.428668975830078</span>),
(<span style="color:#00d;font-weight:bold">77</span>, -<span style="color:#00d;font-weight:bold">8.131155967712402</span>),
(<span style="color:#00d;font-weight:bold">47</span>, -<span style="color:#00d;font-weight:bold">7.274705171585083</span>),
(<span style="color:#00d;font-weight:bold">99</span>, -<span style="color:#00d;font-weight:bold">7.204542756080627</span>),
(<span style="color:#00d;font-weight:bold">42</span>, -<span style="color:#00d;font-weight:bold">7.091279625892639</span>),
(<span style="color:#00d;font-weight:bold">69</span>, -<span style="color:#00d;font-weight:bold">6.739482879638672</span>),
(<span style="color:#00d;font-weight:bold">57</span>, -<span style="color:#00d;font-weight:bold">6.623743772506714</span>),
(<span style="color:#00d;font-weight:bold">96</span>, -<span style="color:#00d;font-weight:bold">6.209834814071655</span>),
(<span style="color:#00d;font-weight:bold">134</span>, -<span style="color:#00d;font-weight:bold">5.58724582195282</span>),
(<span style="color:#00d;font-weight:bold">73</span>, -<span style="color:#00d;font-weight:bold">5.530622601509094</span>),
(<span style="color:#00d;font-weight:bold">110</span>, -<span style="color:#00d;font-weight:bold">5.126549005508423</span>),
(<span style="color:#00d;font-weight:bold">131</span>, -<span style="color:#00d;font-weight:bold">4.435622692108154</span>),
(<span style="color:#00d;font-weight:bold">9</span>, -<span style="color:#00d;font-weight:bold">4.142558574676514</span>),
(<span style="color:#00d;font-weight:bold">46</span>, -<span style="color:#00d;font-weight:bold">3.7173447012901306</span>),
(<span style="color:#00d;font-weight:bold">13</span>, -<span style="color:#00d;font-weight:bold">3.1510373950004578</span>),
(<span style="color:#00d;font-weight:bold">44</span>, -<span style="color:#00d;font-weight:bold">2.9845643043518066</span>),
(<span style="color:#00d;font-weight:bold">124</span>, -<span style="color:#00d;font-weight:bold">2.7145612239837646</span>),
(<span style="color:#00d;font-weight:bold">137</span>, -<span style="color:#00d;font-weight:bold">2.2213394939899445</span>),
(<span style="color:#00d;font-weight:bold">132</span>, -<span style="color:#00d;font-weight:bold">2.2054636478424072</span>),
(<span style="color:#00d;font-weight:bold">116</span>, -<span style="color:#00d;font-weight:bold">1.9229638576507568</span>),
(<span style="color:#00d;font-weight:bold">111</span>, -<span style="color:#00d;font-weight:bold">1.9177806377410889</span>),
(<span style="color:#00d;font-weight:bold">121</span>, -<span style="color:#00d;font-weight:bold">1.3515384495258331</span>),
(<span style="color:#00d;font-weight:bold">36</span>, -<span style="color:#00d;font-weight:bold">1.119830161333084</span>),
(<span style="color:#00d;font-weight:bold">2</span>, -<span style="color:#00d;font-weight:bold">1.0263845324516296</span>),
(<span style="color:#00d;font-weight:bold">136</span>, -<span style="color:#00d;font-weight:bold">0.14549612998962402</span>),
(<span style="color:#00d;font-weight:bold">97</span>, <span style="color:#00d;font-weight:bold">0.02288222312927246</span>),
(<span style="color:#00d;font-weight:bold">138</span>, <span style="color:#00d;font-weight:bold">0.23310404270887375</span>),
(<span style="color:#00d;font-weight:bold">11</span>, <span style="color:#00d;font-weight:bold">0.34488800913095474</span>),
(<span style="color:#00d;font-weight:bold">1</span>, <span style="color:#00d;font-weight:bold">0.3801669552922249</span>),
(<span style="color:#00d;font-weight:bold">95</span>, <span style="color:#00d;font-weight:bold">0.42442888021469116</span>),
(<span style="color:#00d;font-weight:bold">5</span>, <span style="color:#00d;font-weight:bold">0.585017055273056</span>),
(<span style="color:#00d;font-weight:bold">0</span>, <span style="color:#00d;font-weight:bold">0.6578207015991211</span>),
(<span style="color:#00d;font-weight:bold">10</span>, <span style="color:#00d;font-weight:bold">1.0580871254205704</span>),
(<span style="color:#00d;font-weight:bold">148</span>, <span style="color:#00d;font-weight:bold">1.101222038269043</span>),
(<span style="color:#00d;font-weight:bold">85</span>, <span style="color:#00d;font-weight:bold">1.5351229906082153</span>),
(<span style="color:#00d;font-weight:bold">8</span>, <span style="color:#00d;font-weight:bold">1.8577364087104797</span>),
(<span style="color:#00d;font-weight:bold">129</span>, <span style="color:#00d;font-weight:bold">2.067573070526123</span>),
(<span style="color:#00d;font-weight:bold">84</span>, <span style="color:#00d;font-weight:bold">2.5856217741966248</span>),
(<span style="color:#00d;font-weight:bold">125</span>, <span style="color:#00d;font-weight:bold">2.927420735359192</span>),
(<span style="color:#00d;font-weight:bold">145</span>, <span style="color:#00d;font-weight:bold">3.010193407535553</span>),
(<span style="color:#00d;font-weight:bold">3</span>, <span style="color:#00d;font-weight:bold">3.240116238594055</span>),
(<span style="color:#00d;font-weight:bold">112</span>, <span style="color:#00d;font-weight:bold">3.8082027435302734</span>),
(<span style="color:#00d;font-weight:bold">115</span>, <span style="color:#00d;font-weight:bold">3.8878047466278076</span>),
(<span style="color:#00d;font-weight:bold">147</span>, <span style="color:#00d;font-weight:bold">4.29826945066452</span>),
(<span style="color:#00d;font-weight:bold">58</span>, <span style="color:#00d;font-weight:bold">5.724080801010132</span>),
(<span style="color:#00d;font-weight:bold">144</span>, <span style="color:#00d;font-weight:bold">6.969168186187744</span>),
(<span style="color:#00d;font-weight:bold">130</span>, <span style="color:#00d;font-weight:bold">7.328435778617859</span>),
(<span style="color:#00d;font-weight:bold">146</span>, <span style="color:#00d;font-weight:bold">8.421227931976318</span>),
(<span style="color:#00d;font-weight:bold">149</span>, <span style="color:#00d;font-weight:bold">8.71802568435669</span>),
(<span style="color:#00d;font-weight:bold">27</span>, <span style="color:#00d;font-weight:bold">10.014463663101196</span>),
(<span style="color:#00d;font-weight:bold">143</span>, <span style="color:#00d;font-weight:bold">10.086603164672852</span>),
(<span style="color:#00d;font-weight:bold">113</span>, <span style="color:#00d;font-weight:bold">11.049185991287231</span>),
(<span style="color:#00d;font-weight:bold">66</span>, <span style="color:#00d;font-weight:bold">11.210532188415527</span>),
(<span style="color:#00d;font-weight:bold">139</span>, <span style="color:#00d;font-weight:bold">11.213960647583008</span>),
(<span style="color:#00d;font-weight:bold">142</span>, <span style="color:#00d;font-weight:bold">11.479517221450806</span>),
(<span style="color:#00d;font-weight:bold">128</span>, <span style="color:#00d;font-weight:bold">11.862180233001709</span>),
(<span style="color:#00d;font-weight:bold">141</span>, <span style="color:#00d;font-weight:bold">12.742302417755127</span>),
(<span style="color:#00d;font-weight:bold">54</span>, <span style="color:#00d;font-weight:bold">13.011351823806763</span>),
(<span style="color:#00d;font-weight:bold">55</span>, <span style="color:#00d;font-weight:bold">16.884247064590454</span>),
(<span style="color:#00d;font-weight:bold">37</span>, <span style="color:#00d;font-weight:bold">18.53071689605713</span>),
(<span style="color:#00d;font-weight:bold">87</span>, <span style="color:#00d;font-weight:bold">23.8028883934021</span>)]
</code></pre></div><p>The above output presents joke ids along with the prediction of what rating user1 would give them. We can see that some values fall outside of the <code><-10, 10></code> range which is fine. We can simply treat the smaller than -10 ones as -10 and greater than 10 as 10.</p>
<p>Immediately we can see that with this recommender model we could recommend the jokes: <code>146, 149, 27, 143, 113, 66, 139, 142, 128, 141, 54, 55, 37, 87</code>.</p>
<p>To have a little bit more fun, let’s create code for reading the actual text of the jokes. I took the following class from <a href="https://stackoverflow.com/questions/11061058/using-htmlparser-in-python-3-2">StackOverflow</a>. We’ll use it for stripping HTML tags from the jokes file:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python"><span style="color:#080;font-weight:bold">from</span> <span style="color:#b06;font-weight:bold">html.parser</span> <span style="color:#080;font-weight:bold">import</span> HTMLParser
<span style="color:#080;font-weight:bold">class</span> <span style="color:#b06;font-weight:bold">MLStripper</span>(HTMLParser):
<span style="color:#080;font-weight:bold">def</span> __init__(self):
self.reset()
self.strict = <span style="color:#080;font-weight:bold">False</span>
self.convert_charrefs= <span style="color:#080;font-weight:bold">True</span>
self.fed = []
<span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">handle_data</span>(self, d):
self.fed.append(d)
<span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">get_data</span>(self):
<span style="color:#080;font-weight:bold">return</span> <span style="color:#d20;background-color:#fff0f0">''</span>.join(self.fed)
<span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">strip_tags</span>(html):
s = MLStripper()
s.feed(html)
<span style="color:#080;font-weight:bold">return</span> s.get_data()
</code></pre></div><p>Here’s the function that reads the file and uses the HTML tags stripping class:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python"><span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">get_jokes</span>():
jokes = []
joke = <span style="color:#d20;background-color:#fff0f0">''</span>
pattern = re.compile(<span style="color:#d20;background-color:#fff0f0">'^</span><span style="color:#04d;background-color:#fff0f0">\\</span><span style="color:#d20;background-color:#fff0f0">d+:$'</span>)
<span style="color:#080;font-weight:bold">with</span> <span style="color:#038">open</span>(<span style="color:#d20;background-color:#fff0f0">"data/jester_items.dat"</span>, <span style="color:#d20;background-color:#fff0f0">"r"</span>) <span style="color:#080;font-weight:bold">as</span> file:
<span style="color:#080;font-weight:bold">for</span> line <span style="color:#080">in</span> file:
<span style="color:#080;font-weight:bold">if</span> pattern.match(line):
joke = <span style="color:#d20;background-color:#fff0f0">''</span>
<span style="color:#080;font-weight:bold">else</span>:
<span style="color:#080;font-weight:bold">if</span> line.strip() == <span style="color:#d20;background-color:#fff0f0">''</span>:
jokes.append(strip_tags(joke).strip())
<span style="color:#080;font-weight:bold">else</span>:
joke += line
<span style="color:#080;font-weight:bold">return</span> jokes
</code></pre></div><p>Let’s now read them from disk and see an example joke our system would recommend to the first user:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python">> jokes = get_jokes()
> jokes[<span style="color:#00d;font-weight:bold">87</span>]
<span style="color:#d20;background-color:#fff0f0">'A Czechoslovakian man felt his eyesight was growing steadily worse, and felt it was time to go see an optometrist.</span><span style="color:#04d;background-color:#fff0f0">\n\n</span><span style="color:#d20;background-color:#fff0f0">The doctor started with some simple testing, and showed him a standard eye chart with letters of diminishing size: CRKBNWXSKZY...</span><span style="color:#04d;background-color:#fff0f0">\n\n</span><span style="color:#d20;background-color:#fff0f0">"Can you read this?" the doctor asked.</span><span style="color:#04d;background-color:#fff0f0">\n\n</span><span style="color:#d20;background-color:#fff0f0">"Read it?" the Czech answered. "Doc, I know him!"'</span>
</code></pre></div><h3 id="using-the-item-feature-vectors-to-find-similarities">Using the item feature vectors to find similarities</h3>
<p>One cool thing we can do with the latent vectors, is to measure how similar they are in terms of appealing to certain users. To do that we can use a so-called <strong>cosine similarity</strong>. The subject is very clearly described by Christian S. Perone <a href="http://blog.christianperone.com/2013/09/machine-learning-cosine-similarity-for-vector-space-models-part-iii/">in his blog post</a>.</p>
<p>It makes use of the angle between the two vectors and returns its cosine. Notice that it only cares about the angle between the vectors, and <strong>not</strong> their magnitudes. The codomain of the cosine function is <code><-1, 1></code> and so is for the <em>cosine similarity</em> as well. It translates to our sense of similarity quite naturally: <code>-1</code> meaning “the total opposite” and <code>1</code> meaning “exactly the same”.</p>
<p>We can trivially implement the function as a product of the dot products of the vectors normalized to units:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python"><span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">cos_similarity</span>(vec1, vec2):
<span style="color:#080;font-weight:bold">return</span> mx.nd.dot(vec1, vec2) / (F.norm(vec1) * F.norm(vec2))
</code></pre></div><p>We can use the new measurement to rank the jokes in terms of how close they are. Here’s a function that takes a joke ID and returns list of IDs along with the similarity ratings:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python"><span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">get_scores</span>(joke_id):
scores = []
joke = joke_embed[joke_id]
<span style="color:#080;font-weight:bold">for</span> ix <span style="color:#080">in</span> <span style="color:#038">range</span>(<span style="color:#00d;font-weight:bold">0</span>, <span style="color:#00d;font-weight:bold">150</span>):
scores.append((ix, cos_similarity(joke, joke_embed[ix]).asnumpy()[<span style="color:#00d;font-weight:bold">0</span>]))
<span style="color:#080;font-weight:bold">return</span> scores
</code></pre></div><p>The following function takes a joke_id and takes the 4 most similar jokes. It then prints them one by one in a summary:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python"><span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">print_joke_stats</span>(ix):
<span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">by_second</span>(t):
<span style="color:#080;font-weight:bold">if</span> t[<span style="color:#00d;font-weight:bold">1</span>] <span style="color:#080">is</span> <span style="color:#080;font-weight:bold">None</span>:
<span style="color:#080;font-weight:bold">return</span> -<span style="color:#00d;font-weight:bold">2</span>
<span style="color:#080;font-weight:bold">else</span>:
<span style="color:#080;font-weight:bold">return</span> t[<span style="color:#00d;font-weight:bold">1</span>]
similar = get_scores(ix)
similar.sort(key=by_second)
similar.reverse()
<span style="color:#038">print</span>(<span style="color:#d20;background-color:#fff0f0">f</span><span style="color:#d20;background-color:#fff0f0">'Jokes making same people laugh compared to:</span><span style="color:#04d;background-color:#fff0f0">\n\n</span><span style="color:#d20;background-color:#fff0f0">=== </span><span style="color:#04d;background-color:#fff0f0">\n</span><span style="color:#33b;background-color:#fff0f0">{</span>jokes[ix]<span style="color:#33b;background-color:#fff0f0">}</span><span style="color:#04d;background-color:#fff0f0">\n</span><span style="color:#d20;background-color:#fff0f0">===:</span><span style="color:#04d;background-color:#fff0f0">\n\n</span><span style="color:#d20;background-color:#fff0f0">'</span>)
<span style="color:#080;font-weight:bold">for</span> ix <span style="color:#080">in</span> <span style="color:#038">range</span>(<span style="color:#00d;font-weight:bold">1</span>, <span style="color:#00d;font-weight:bold">4</span>):
<span style="color:#038">print</span>(<span style="color:#d20;background-color:#fff0f0">f</span><span style="color:#d20;background-color:#fff0f0">'---</span><span style="color:#04d;background-color:#fff0f0">\n</span><span style="color:#33b;background-color:#fff0f0">{</span>jokes[similar[ix][<span style="color:#00d;font-weight:bold">0</span>]]<span style="color:#33b;background-color:#fff0f0">}</span><span style="color:#04d;background-color:#fff0f0">\n</span><span style="color:#d20;background-color:#fff0f0">---</span><span style="color:#04d;background-color:#fff0f0">\n</span><span style="color:#d20;background-color:#fff0f0">'</span>)
</code></pre></div><p>Let’s see what jokes our system found to be cracking up the same kinds of people:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-plain" data-lang="plain">> print_joke_stats(87)
Jokes making same people laugh compared to:
===
A Czechoslovakian man felt his eyesight was growing steadily worse, and felt it was time to go see an optometrist.
The doctor started with some simple testing, and showed him a standard eye chart with letters of diminishing size: CRKBNWXSKZY...
"Can you read this?" the doctor asked.
"Read it?" the Czech answered. "Doc, I know him!"
===:
---
A woman has twins, and gives them up for adoption. One of them goes to a family in Egypt and is named "Amal." The other goes to a family in Spain; they name him "Juan." Years later, Juan sends a picture of himself to his mom. Upon receiving the picture, she tells her husband that she wishes she also had a picture of Amal.
Her husband responds, "But they are twins--if you've seen Juan, you've seen Amal."
---
---
An explorer in the deepest Amazon suddenly finds himself surrounded by a bloodthirsty group of natives. Upon surveying the situation, he says quietly to himself, "Oh God, I'm screwed."
The sky darkens and a voice booms out, "No, you are NOT screwed. Pick up that stone at your feet and bash in the head of the chief standing in front of you."
So with the stone he bashes the life out of the chief. He stands above the lifeless body, breathing heavily and looking at 100 angry natives...
The voice booms out again, "Okay....NOW you're screwed."
---
---
A man is driving in the country one evening when his car stalls and won't start. He goes up to a nearby farm house for help, and because it is suppertime he is asked to stay for supper. When he sits down at the table he notices that a pig is sitting at the table with them for supper and that the pig has a wooden leg.
As they are eating and chatting, he eventually asks the farmer why the pig is there and why it has a wooden leg.
"Oh," says the farmer, "that is a very special pig. Last month my wife and daughter were in the barn when it caught fire. The pig saw this, ran to the barn, tipped over a pail of water, crawled over the wet floor to reach them and pulled them out of the barn safely. A special pig like that, you just don't eat it all at once!"
---
</code></pre></div><h3 id="final-words">Final words</h3>
<p>The approach presented here is relatively simple, yet people have found it surprisingly accurate. It depends though on having enough data for each item. Otherwise the accuracy degrades. An extreme case of not having enough data is called a <a href="https://en.wikipedia.org/wiki/Cold_start_(computing)">cold start</a>.</p>
<p>Also, accuracy is not the only goal. <a href="https://en.wikipedia.org/wiki/Recommender_system">Wikipedia</a> lists features like “Serendipity” as an important factor of a successful system among others:</p>
<blockquote>
<p>Serendipity is a measure of “how surprising the recommendations are”. For instance, a recommender system that recommends milk to a customer in a grocery store might be perfectly accurate, but it is not a good recommendation because it is an obvious item for the customer to buy. However, high scores of serendipity may have a negative impact on accuracy.</p>
</blockquote>
<p>Researchers have been working on different approaches to tackling the above mentioned issues. Netflix is known to be using a “hybrid” approach — one that uses both content and collaborative based recommender. As per <a href="https://en.wikipedia.org/wiki/Recommender_system">Wikipedia</a>:</p>
<blockquote>
<p>Netflix is a good example of the use of hybrid recommender systems. The website makes recommendations by comparing the watching and searching habits of similar users (i.e., collaborative filtering) as well as by offering movies that share characteristics with films that a user has rated highly (content-based filtering).</p>
</blockquote>
Training Tesseract 4 models from real imageshttps://www.endpointdev.com/blog/2018/07/training-tesseract-models-from-scratch/2018-07-09T00:00:00+00:00Kamil Ciemniewski
<p><img src="/blog/2018/07/training-tesseract-models-from-scratch/15403939701_6e85c63a08_o-sm.jpg" alt="table of ancient alphabets" /></p>
<p>Over the years, Tesseract has been one of the most popular open source optical character recognition (OCR) solutions. It provides ready-to-use models for recognizing text in many languages. Currently there are 124 models that are available to be downloaded and used.</p>
<p>Not too long ago, the project moved in the direction of using more modern machine-learning approaches and is now using <a href="https://en.wikipedia.org/wiki/Artificial_neural_network">artificial neural networks</a>.</p>
<p>For some people, this move meant a lot of confusion when they wanted to train their own models. This blog post tries to explain the process of turning scans of images with textual ground-truth data into models that are ready to be used.</p>
<h3 id="tesseract-pre-trained-models">Tesseract pre-trained models</h3>
<p>You can download the <a href="https://github.com/tesseract-ocr/tessdata_fast">pre-created ones designed to be fast and consume less memory</a>, as well as the <a href="https://github.com/tesseract-ocr/tessdata_best">ones requiring more in terms of resources but giving a better accuracy</a>.</p>
<p>Pre-trained models have been created using the images with text artificially rendered using a huge corpus of text coming from the web. The text was rendered using different fonts. The project’s wiki states that:</p>
<blockquote>
<p>For Latin-based languages, the existing model data provided has been trained on about <a href="https://github.com/tesseract-ocr/tesseract/issues/654#issuecomment-274574951">400000 textlines spanning about 4500 fonts</a>. For other scripts, not so many fonts are available, but they have still been trained on a similar number of textlines.</p>
</blockquote>
<h3 id="training-a-new-model-from-scratch">Training a new model from scratch</h3>
<p>Before diving in, there are a couple of broader aspects you need to know:</p>
<ul>
<li>The latest Tesseract uses <a href="https://en.wikipedia.org/wiki/Artificial_neural_network">artificial neural networks</a> based models (they differ totally from the older approach)</li>
<li>You might want to get familiar with how neural networks work and how their different types of layers can be used and what you can expect of them</li>
<li>It’s definitely a bonus to read about the “Connectionist Temporal Classification”, explained brilliantly at <a href="https://distill.pub/2017/ctc/">Sequence Modeling with CTC</a> (it’s not mandatory though)</li>
</ul>
<h3 id="compiling-the-training-tools">Compiling the training tools</h3>
<p>This blog post talks specifically about the latest version 4 of Tesseract. Please make sure that you have that installed and not some older version 3 release.</p>
<p>To continue with the training, you’ll also need the training tools. The project’s wiki already <a href="https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#building-the-training-tools">explains the process</a> of getting them well enough.</p>
<h3 id="preparing-the-training-data">Preparing the training data</h3>
<p>Training datasets consist of <code>*.tif</code> files and accompanying <code>*.box</code> files. While the image files are easy to prepare, the box files seem to be a source of confusion.</p>
<p>For some images you’ll want to <strong>ensure that there’s at least 10px of free space between the border and the text</strong>. Adding it to all of the images will not hurt and will only ensure that you won’t see odd-looking warning messages during the training.</p>
<p>The first rule is that you’ll have one box file per one image. You need to give them the same prefixes, e.g. <code>image1.tif</code> and <code>image1.box</code>. The box files describe used characters as well as their spatial location within the image.</p>
<p>Each line describes one character as follows:</p>
<p><code><symbol> <left> <bottom> <right> <top> <page></code></p>
<p>Where:</p>
<ul>
<li><code><symbol></code> is the character e.g. <code>a</code> or <code>b</code>.</li>
<li><code><left> <bottom> <right> <top></code> are the coordinates of the rectangle that fits the character on the page. Note that the coordinates system used by Tesseract has <code>(0,0)</code> in the bottom-left corner of the image!</li>
<li><code><page></code> is only relevant if you’re using multi-page TIFF files. In all other cases just put <code>0</code> in here.</li>
</ul>
<p>The order of characters is extremely important here. They <strong>should be sorted strictly in the visual order, going from left to right</strong>. Tesseract does the Unicode bidi-re-ordering internally on its own.</p>
<p>Each word should be separated by the line with a space as the <code><symbol></code>. It works best for me to set a <code>1x1</code> small rectangle as a bounding box that directly follows the previous character.</p>
<p>If your image contains more than one line, the line ending should be marked with a line where <code><symbol></code> is a tab.</p>
<h4 id="generating-the-unicharset-file">Generating the <code>unicharset</code> file</h4>
<p>If you’ve went through the neural networks reading, you’ll quickly understand that if the model is to be fast, it needs to be given a constrained list of characters you want it to recognize. Trying to make it choose out the whole Unicode set would be computationally unfeasible. This is what the so-called <code>unicharset</code> file is for. It defines the set of graphemes along with providing info about their basic properties.</p>
<p>Tesseract does come with its own utility for compiling that file but I’ve found it very buggy. That’s what it looked like the last time I tried it, in June 2018. I came up with my own script in Ruby which compiles a very basic version of that file and is more than enough:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-ruby" data-lang="ruby"><span style="color:#038">require</span> <span style="color:#d20;background-color:#fff0f0">"rubygems"</span>
<span style="color:#038">require</span> <span style="color:#d20;background-color:#fff0f0">"unicode/scripts"</span>
<span style="color:#038">require</span> <span style="color:#d20;background-color:#fff0f0">"unicode/categories"</span>
bool_to_si = -> (b) {
b ? <span style="color:#d20;background-color:#fff0f0">"1"</span> : <span style="color:#d20;background-color:#fff0f0">"0"</span>
}
is_digit = -> (props) {
(props & [<span style="color:#d20;background-color:#fff0f0">"Nd"</span>, <span style="color:#d20;background-color:#fff0f0">"No"</span>, <span style="color:#d20;background-color:#fff0f0">"Nl"</span>]).count > <span style="color:#00d;font-weight:bold">0</span>
}
is_letter = -> (props) {
(props & [<span style="color:#d20;background-color:#fff0f0">"LC"</span>, <span style="color:#d20;background-color:#fff0f0">"Ll"</span>, <span style="color:#d20;background-color:#fff0f0">"Lm"</span>, <span style="color:#d20;background-color:#fff0f0">"Lo"</span>, <span style="color:#d20;background-color:#fff0f0">"Lt"</span>, <span style="color:#d20;background-color:#fff0f0">"Lu"</span>]).count > <span style="color:#00d;font-weight:bold">0</span>
}
is_alpha = -> (props) {
is_letter.call(props)
}
is_lower = -> (props) {
(props & [<span style="color:#d20;background-color:#fff0f0">"Ll"</span>]).count > <span style="color:#00d;font-weight:bold">0</span>
}
is_upper = -> (props) {
(props & [<span style="color:#d20;background-color:#fff0f0">"Lu"</span>]).count > <span style="color:#00d;font-weight:bold">0</span>
}
is_punct = -> (props) {
(props & [<span style="color:#d20;background-color:#fff0f0">"Pc"</span>, <span style="color:#d20;background-color:#fff0f0">"Pd"</span>, <span style="color:#d20;background-color:#fff0f0">"Pe"</span>, <span style="color:#d20;background-color:#fff0f0">"Pf"</span>, <span style="color:#d20;background-color:#fff0f0">"Pi"</span>, <span style="color:#d20;background-color:#fff0f0">"Po"</span>, <span style="color:#d20;background-color:#fff0f0">"Ps"</span>]).count > <span style="color:#00d;font-weight:bold">0</span>
}
<span style="color:#080;font-weight:bold">if</span> <span style="color:#036;font-weight:bold">ARGV</span>.length < <span style="color:#00d;font-weight:bold">1</span>
<span style="color:#d70">$stderr</span>.puts <span style="color:#d20;background-color:#fff0f0">"Usage: ruby ./extract_unicharset.rb path/to/all-boxes"</span>
<span style="color:#038">exit</span>
<span style="color:#080;font-weight:bold">end</span>
<span style="color:#080;font-weight:bold">if</span> !<span style="color:#036;font-weight:bold">File</span>.exist?(<span style="color:#036;font-weight:bold">ARGV</span>[<span style="color:#00d;font-weight:bold">0</span>])
<span style="color:#d70">$stderr</span>.puts <span style="color:#d20;background-color:#fff0f0">"The all-boxes file </span><span style="color:#33b;background-color:#fff0f0">#{</span><span style="color:#036;font-weight:bold">ARGV</span>[<span style="color:#00d;font-weight:bold">0</span>]<span style="color:#33b;background-color:#fff0f0">}</span><span style="color:#d20;background-color:#fff0f0"> doesn't exist"</span>
<span style="color:#038">exit</span>
<span style="color:#080;font-weight:bold">end</span>
uniqs = <span style="color:#036;font-weight:bold">IO</span>.readlines(<span style="color:#036;font-weight:bold">ARGV</span>[<span style="color:#00d;font-weight:bold">0</span>]).map { |line| line[<span style="color:#00d;font-weight:bold">0</span>] }.uniq.sort
outs = uniqs.each_with_index.map <span style="color:#080;font-weight:bold">do</span> |char, ix|
script = <span style="color:#036;font-weight:bold">Unicode</span>::<span style="color:#036;font-weight:bold">Scripts</span>.scripts(char).first
props = <span style="color:#036;font-weight:bold">Unicode</span>::<span style="color:#036;font-weight:bold">Categories</span>.categories(char)
isalpha = is_alpha.call(props)
islower = is_lower.call(props)
isupper = is_upper.call(props)
isdigit = is_digit.call(props)
ispunctuation = is_punct.call(props)
props = [ isalpha, islower, isupper, isdigit, ispunctuation].reverse.inject(<span style="color:#d20;background-color:#fff0f0">""</span>) <span style="color:#080;font-weight:bold">do</span> |state, is|
<span style="color:#d20;background-color:#fff0f0">"</span><span style="color:#33b;background-color:#fff0f0">#{</span>state<span style="color:#33b;background-color:#fff0f0">}#{</span>bool_to_si.call(is)<span style="color:#33b;background-color:#fff0f0">}</span><span style="color:#d20;background-color:#fff0f0">"</span>
<span style="color:#080;font-weight:bold">end</span>
<span style="color:#d20;background-color:#fff0f0">"</span><span style="color:#33b;background-color:#fff0f0">#{</span>char<span style="color:#33b;background-color:#fff0f0">}</span><span style="color:#d20;background-color:#fff0f0"> </span><span style="color:#33b;background-color:#fff0f0">#{</span>props.to_i(<span style="color:#00d;font-weight:bold">2</span>)<span style="color:#33b;background-color:#fff0f0">}</span><span style="color:#d20;background-color:#fff0f0"> </span><span style="color:#33b;background-color:#fff0f0">#{</span>script<span style="color:#33b;background-color:#fff0f0">}</span><span style="color:#d20;background-color:#fff0f0"> </span><span style="color:#33b;background-color:#fff0f0">#{</span>ix + <span style="color:#00d;font-weight:bold">1</span><span style="color:#33b;background-color:#fff0f0">}</span><span style="color:#d20;background-color:#fff0f0">"</span>
<span style="color:#080;font-weight:bold">end</span>
<span style="color:#038">puts</span> outs.count + <span style="color:#00d;font-weight:bold">1</span>
<span style="color:#038">puts</span> <span style="color:#d20;background-color:#fff0f0">"NULL 0 Common 0"</span>
outs.each { |o| <span style="color:#038">puts</span> o }
</code></pre></div><p>You’ll need to install the <code>unicode-scripts</code> and <code>unicode-categories</code> gems first. The usage is as it stands in the source code:</p>
<pre tabindex="0"><code>ruby extract_unicharset.rb path/to/all-boxes > path/to/unicharset
</code></pre><p>Where do we get the <code>all-boxes</code> file from? The script only cares about the unique set of characters from the box files. The following gist of shell-work will provide you with all you need:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-bash" data-lang="bash">cat path/to/dataset/*.box > path/to/all-boxes
ruby extract_unicharset.rb path/to/all-boxes > path/to/unicharset
</code></pre></div><p>Notice that the last command should create a <code>path/to/unicharset</code> text file for you.</p>
<h4 id="combining-images-with-box-files-into-lstmf-files">Combining images with box files into <code>*.lstmf</code> files</h4>
<p>The image and box files aren’t being directly fed into the trainer. Instead, Tesseract works with the special <code>*.lstmf</code> files which combine images, boxes and text for each pair of <code>*.tif</code> and <code>*.box</code>.</p>
<p>In order to generate those <code>*.lstmf</code> files you’ll need to run the following:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-bash" data-lang="bash"><span style="color:#038">cd</span> path/to/dataset
<span style="color:#080;font-weight:bold">for</span> file in *.tif; <span style="color:#080;font-weight:bold">do</span>
<span style="color:#038">echo</span> <span style="color:#369">$file</span>
<span style="color:#369">base</span>=<span style="color:#d20;background-color:#fff0f0">`</span>basename <span style="color:#369">$file</span> .tif<span style="color:#d20;background-color:#fff0f0">`</span>
tesseract <span style="color:#369">$file</span> <span style="color:#369">$base</span> lstm.train
<span style="color:#080;font-weight:bold">done</span>
</code></pre></div><p>After the above is done, you should be able to find the accompanying <code>*.lstmf</code> files. Make sure that you have Tesseract with <code>langdata</code> and <code>tessdata</code> properly installed. If you keep your <code>tessdata</code> folder in a nonstandard location, you might need to either export or set inline the following shell variable:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-bash" data-lang="bash"><span style="color:#888"># exporting so that it’s available for all following commands:</span>
<span style="color:#038">export</span> <span style="color:#369">TESSDATA_PREFIX</span>=path/to/your/tessdata
<span style="color:#888"># or run it inline:</span>
<span style="color:#038">cd</span> path/to/dataset
<span style="color:#080;font-weight:bold">for</span> file in *.tif; <span style="color:#080;font-weight:bold">do</span>
<span style="color:#038">echo</span> <span style="color:#369">$file</span>
<span style="color:#369">base</span>=<span style="color:#d20;background-color:#fff0f0">`</span>basename <span style="color:#369">$file</span> .tif<span style="color:#d20;background-color:#fff0f0">`</span>
<span style="color:#369">TESSDATA_PREFIX</span>=path/to/your/tessdata tesseract <span style="color:#369">$file</span> <span style="color:#369">$base</span> lstm.train
<span style="color:#080;font-weight:bold">done</span>
</code></pre></div><p>We’ll need to generate the <code>all-lstmf</code> file containing paths to all those files that we will use later:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-bash" data-lang="bash">ls -1 *.lstmf | sort -R > all-lstmf
</code></pre></div><p>Notice the use of <code>sort -R</code> which makes the list sorted randomly which is a good practice when preparing the training data in many cases.</p>
<h4 id="generating-the-training-and-evaluation-files-lists">Generating the training and evaluation files lists</h4>
<p>Next, we want to create the <code>list.train</code> and <code>list.eval</code> files. Their purpose is to contain the paths to <code>*.lstmf</code> files that Tesseract is going to use during the training and during the evaluation. Training and evaluation are interleaved. The former adjusts the neural network learnable parameters to minimize the so-called loss. The evaluation here is strictly to enhance the user experience: it prints out accuracy metrics periodically, letting you know how much the model has learned so far. Their values are averaged out. You can expect to see two metrics being shown: <code>char error</code> and <code>word error</code>: both are going to be close to 100% in the beginning but with all going well, you should see them dropping even to below 1%.</p>
<p>The evaluation set is often called the “holdout set”. How many training examples should it contain? That depends. If you have a big enough set, something around 10% of all of the examples should be more than enough. You might also not care about the training-time evaluation and set it to something very small. You’d then do your own evaluation after the network’s loss converges to something small (by small we mean something close to 0.1 or less).</p>
<p>Assuming that you want the evaluation set to contain 1000 examples, here’s how you can generate the <code>list.train</code> and <code>list.eval</code>:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-bash" data-lang="bash">head -n <span style="color:#00d;font-weight:bold">1000</span> path/to/all-lstmf > list.eval
tail -n +1001 path/to/all-lstmf > list.train
</code></pre></div><p>If you’d like to express it in terms of fractions of all of the examples:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-bash" data-lang="bash"><span style="color:#369">holdout_count</span>=<span style="color:#080;font-weight:bold">$(</span><span style="color:#369">count_all</span>=<span style="color:#d20;background-color:#fff0f0">`</span>wc -l path/to/all-lstmf<span style="color:#d20;background-color:#fff0f0">`</span>; bc <<< <span style="color:#d20;background-color:#fff0f0">"</span><span style="color:#369">$count_all</span><span style="color:#d20;background-color:#fff0f0"> * 0.1 / 1"</span><span style="color:#080;font-weight:bold">)</span>
head -n <span style="color:#369">$holdout_count</span> path/to/all-lstmf > list.eval
tail -n +<span style="color:#369">$holdout_count</span> path/to/all-lstmf > list.train
</code></pre></div><p>The above shell code assigns around 10% examples to the holdout set.</p>
<h4 id="compiling-the-initial-traineddata-file">Compiling the initial <code>*.traineddata</code> file</h4>
<p>There’s one last piece that we’ll need to generate before we’re able to start the training process: the <code>yourmodel.traineddata</code>. This file is going to contain the initial info needed for the trainer to perform the training:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-bash" data-lang="bash">combine_lang_model <span style="color:#04d;background-color:#fff0f0">\
</span><span style="color:#04d;background-color:#fff0f0"></span> --input_unicharset path/to/unicharset <span style="color:#04d;background-color:#fff0f0">\
</span><span style="color:#04d;background-color:#fff0f0"></span> --script_dir path/to/your/tessdata <span style="color:#04d;background-color:#fff0f0">\
</span><span style="color:#04d;background-color:#fff0f0"></span> --output_dir path/to/output <span style="color:#04d;background-color:#fff0f0">\
</span><span style="color:#04d;background-color:#fff0f0"></span> --lang_is_rtl <span style="color:#04d;background-color:#fff0f0">\ </span><span style="color:#888"># set it only if you work with a right-to-left language</span>
--pass_through_recoder <span style="color:#04d;background-color:#fff0f0">\ </span><span style="color:#888"># I found it working better with this option</span>
--lang yourmodelname
</code></pre></div><p>The above should create a bunch of files in the specified output directory.</p>
<h4 id="starting-the-actual-training-process">Starting the actual training process</h4>
<p>To start the training process you’ll need to execute the <code>lstmtraining</code> app. It accepts the arguments that are described below.</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-bash" data-lang="bash"><span style="color:#369">num_classes</span>=<span style="color:#d20;background-color:#fff0f0">`</span>head -n1 path/to/unicharset<span style="color:#d20;background-color:#fff0f0">`</span>
lstmtraining <span style="color:#04d;background-color:#fff0f0">\
</span><span style="color:#04d;background-color:#fff0f0"></span> path/to/traineddata-file <span style="color:#04d;background-color:#fff0f0">\
</span><span style="color:#04d;background-color:#fff0f0"></span> --net_spec <span style="color:#d20;background-color:#fff0f0">"[1,40,0,1 Ct5,5,64 Mp3,3 Lfys128 Lbx256 Lbx256 O1c</span><span style="color:#369">$num_classes</span><span style="color:#d20;background-color:#fff0f0">]"</span> <span style="color:#04d;background-color:#fff0f0">\
</span><span style="color:#04d;background-color:#fff0f0"></span> --model_output path/to/model/output
--train_listfile path/to/list.train
--eval_listfile path/to/list.eval
</code></pre></div><p>You’re giving it the compiled <code>*.traineddata</code> file and the train/eval file lists and it trains the new model for you. It will adjust the neural network parameters to make the error between its predictions and what is known as ground-truth smaller and smaller.</p>
<p>There’s one part that we haven’t talked about yet: the <code>--net_spec</code> argument and its accompanying value given as string.</p>
<p>The neural network “spec” is there because neural networks come in many different shapes and forms. The subject is beyond the scope of this article. If you don’t know anything yet but are curious, I encourage you to look for some good books. The process of learning about them is extremely rewarding if you’re into math and computer science.</p>
<p>The value for that argument I presented above should be more than enough for most of your needs. That’s unless you’d like to e.g. recognize vertical text, for which I recommend adjusting the spec greatly.</p>
<p>The format that the given string follows is called VGSL. You can find out more about it on the <a href="https://github.com/tesseract-ocr/tesseract/wiki/VGSLSpecs">Tesseract Wiki</a>.</p>
<h4 id="finishing-the-training-and-compiling-the-resulting-model-file">Finishing the training and compiling the resulting model file</h4>
<p>If you’ve gotten excited by what we’ve done so far, I have to encourage your expectations to make friends with <strong>The Reality</strong>. The truth is that the training process can take days, depending on how fast your machine is and how many training examples you have. You may notice it taking even longer if your examples differ by a huge factor. That might be true if you’re feeding it examples that use significantly different fonts.</p>
<p>Once the training error rate is small enough and doesn’t seem to be converging further, you may want to stop it and compile the final model file.</p>
<p>During the training, the <code>lstmtraining</code> app will output checkpoint files every once in a while. They are there to make it possible to stop the training and resume it later (with the <code>--continue_from</code> argument). You create the final model files out of those checkpoint files with:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-bash" data-lang="bash">lstmtraining <span style="color:#04d;background-color:#fff0f0">\
</span><span style="color:#04d;background-color:#fff0f0"></span> --traineddata path/to/traineddata-file <span style="color:#04d;background-color:#fff0f0">\
</span><span style="color:#04d;background-color:#fff0f0"></span> --continue_from path/to/model/output/checkout <span style="color:#04d;background-color:#fff0f0">\
</span><span style="color:#04d;background-color:#fff0f0"></span> --model_output path/to/final/output <span style="color:#04d;background-color:#fff0f0">\
</span><span style="color:#04d;background-color:#fff0f0"></span> --stop_training
</code></pre></div><p>And that’s it — you can now take the output file of that last command and place it inside your <code>tessdata</code> folder it immediately Tesseract will be able to use it.</p>
Recognizing handwritten digits: a quick peek into the basics of machine learninghttps://www.endpointdev.com/blog/2017/05/recognizing-handwritten-digits-quick/2017-05-30T00:00:00+00:00Kamil Ciemniewski
<p>Previous in series:</p>
<ul>
<li><a href="/blog/2016/03/learning-from-data-basics-naive-bayes/">Learning from data basics: the Naive Bayes model</a></li>
<li><a href="/blog/2016/04/learning-from-data-basics-ii-simple/">Learning from data basics II: simple Bayesian Networks</a></li>
</ul>
<p>In the previous two posts on machine learning, I presented a very basic introduction of an approach called “probabilistic graphical models”. In this post I’d like to take a tour of some different techniques while creating code that will recognize handwritten digits.</p>
<p>The handwritten digits recognition is an interesting topic that has been explored for many years. It is now considered one of the best ways to start the journey into the world of machine learning.</p>
<h3 id="taking-the-kaggle-challenge">Taking the Kaggle challenge</h3>
<p>We’ll take the “digits recognition” challenge as presented in Kaggle. It is an online platform with challenges for data scientists. Most of the challenges have their prizes expressed in real money to win. Some of them are there to help us out in our journey on learning data science techniques—so is the “digits recognition” contest.</p>
<h3 id="the-challenge">The challenge</h3>
<p>As explained on Kaggle:</p>
<blockquote>
<p>MNIST (“Modified National Institute of Standards and Technology”) is the de facto “hello world” dataset of computer vision.</p>
</blockquote>
<p>The “digits recognition” challenge is one of the best ways to get acquainted with machine learning and computer vision. The so-called “MNIST” dataset consists of 70k images of handwritten digits - each one grayscaled and of a 28x28 size. The Kaggle challenge is about taking a subset of 42k of them along with labels (what actual number does the image show) and “training” the computer on that set. The next step is to take the rest 28k of images without the labels and “predict” which actual number they present.</p>
<p>Here’s a short overview of how the digits in a set really look like (along with the numbers they represent):</p>
<div class="separator" style="clear: both; text-align: center;">
<a href="/blog/2017/05/recognizing-handwritten-digits-quick/image-0-big.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1201" data-original-width="1600" height="480" src="/blog/2017/05/recognizing-handwritten-digits-quick/image-0.png" width="640"/></a></div>
<p>I have to admit that for some of them I have a really hard time recognizing the actual numbers on my own :)</p>
<h3 id="the-general-approach-to-supervised-learning">The general approach to supervised learning</h3>
<p>Learning from labelled data is what is called “supervised learning”. It’s supervised because we’re taking the computer by hand through the whole training data set and “teaching” it how the data that is linked with different labels “looks” like.</p>
<p>In all such scenarios we can express the data and labels as:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-plain" data-lang="plain">Y ~ X1, X2, X3, X4, ..., Xn
</code></pre></div><p>The Y is called a <strong>dependent variable</strong> while each Xn are <strong>independent variables</strong>. This formula holds both for classification problems as well as regressions.</p>
<p>Classification is when the dependent variable Y is so called <em>categorical</em>—taking values from a concrete set without a meaningful order. Regression is when the Y is not categorical—most often continuous.</p>
<p>In the digits recognition challenge we’re faced with the classification task. The dependent variable takes values from the set:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-plain" data-lang="plain">Y = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 }
</code></pre></div><p>I’m sure the question you might be asking yourself now is: what are the independent variables Xn? It turns out to be the crux of the whole problem to solve :)</p>
<h3 id="the-plan-of-attack">The plan of attack</h3>
<p>A good introduction to computer vision techniques is a book by J. R Parker - “Algorithms for Image Processing and Computer Vision”. I encourage the reader to buy that book. I took some ideas from it while having fun with my own solution to the challenge.</p>
<p>The book outlines the ideas revolving around computing image profiles—for each side. For each row of pixels, a number representing the distance of the first pixel from the edge is computed. This way we’re getting our first independent variables. To capture even more information about digit shapes, we’ll also capture the differences between consecutive row values as well as their global maxima and minima. We’ll also compute the width of the shape for each row.</p>
<p>Because the handwritten digits vary greatly in their thickness, we will first preprocess the images to detect so-called skeletons of the digit. The skeleton is an image representation where the thickness of the shape has been reduced to just one.</p>
<p>Having the image thinned will also allow us to capture some more info about the shapes. We will write an algorithm that walks the skeleton and records the direction change frequencies.</p>
<p>Once we’ll have our set of independent variables Xn, we’ll use a classification algorithm to first learn in a supervised way (using the provided labels) and then to predict the values of the test data set. Lastly we’ll submit our predictions to Kaggle and see how well did we do.</p>
<h3 id="having-fun-with-languages">Having fun with languages</h3>
<p>In the data science world, the lingua franca still remains to be the R programming language. In the last years Python has also came close in popularity and nowadays we can say it’s the duo of R and Python that rule the data science world (not counting high performance code written e. g. in C++ in production systems).</p>
<p>Lately a new language designed with data scientists in mind has emerged - Julia. It’s a language with characteristics of both dynamically typed scripting languages as well as strictly typed compiled ones. It compiles its code into efficient native binary via LLVM—but it’s using it in a JIT fashion - inferring the types when needed on the go.</p>
<p>While having fun with the Kaggle challenge I’ll use Julia and Python for the so called <strong>feature extraction</strong> phase (the one in which we’re computing information about our Xn variables). I’ll then turn towards R for doing the classification itself. Note that I might use any of those languages at each step getting very similar results. The purpose of this series of articles is to be a bird eye fun overview so I decided that this way will be much more interesting.</p>
<h3 id="feature-extraction">Feature Extraction</h3>
<p>The end result of this phase is the data frame saved as a CSV file so that we’ll be able to load it in R and do the classification.</p>
<p>First let’s define the general function in Julia that takes the name of the input CSV file and returns a data frame with features of given images extracted into columns:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-ruby" data-lang="ruby">using <span style="color:#036;font-weight:bold">DataFrames</span>
function get_data(<span style="color:#038">name</span> :: <span style="color:#038">String</span>, include_label = <span style="color:#080">true</span>)
println(<span style="color:#d20;background-color:#fff0f0">"Loading CSV file into a data frame..."</span>)
table = readtable(string(<span style="color:#038">name</span>, <span style="color:#d20;background-color:#fff0f0">".csv"</span>))
extract(table, include_label)
<span style="color:#080;font-weight:bold">end</span>
</code></pre></div><p>Now the extract function looks like the following:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-ruby" data-lang="ruby"><span style="color:#d20;background-color:#fff0f0">"""
</span><span style="color:#d20;background-color:#fff0f0">Extracts the features from the dataframe. Puts them into
</span><span style="color:#d20;background-color:#fff0f0">separate columns and removes all other columns except the
</span><span style="color:#d20;background-color:#fff0f0">labels.
</span><span style="color:#d20;background-color:#fff0f0">
</span><span style="color:#d20;background-color:#fff0f0">The features:
</span><span style="color:#d20;background-color:#fff0f0">
</span><span style="color:#d20;background-color:#fff0f0">* Left and right profiles (after fitting into the same sized rect):
</span><span style="color:#d20;background-color:#fff0f0"> * Min
</span><span style="color:#d20;background-color:#fff0f0"> * Max
</span><span style="color:#d20;background-color:#fff0f0"> * Width[y]
</span><span style="color:#d20;background-color:#fff0f0"> * Diff[y]
</span><span style="color:#d20;background-color:#fff0f0">* Paths:
</span><span style="color:#d20;background-color:#fff0f0"> * Frequencies of movement directions
</span><span style="color:#d20;background-color:#fff0f0"> * Simplified directions:
</span><span style="color:#d20;background-color:#fff0f0"> * Frequencies of 3 element simplified paths
</span><span style="color:#d20;background-color:#fff0f0">"""</span>
function extract(frame :: <span style="color:#036;font-weight:bold">DataFrame</span>, include_label = <span style="color:#080">true</span>)
println(<span style="color:#d20;background-color:#fff0f0">"Reshaping data..."</span>)
function to_image(flat :: <span style="color:#038">Array</span>{<span style="color:#036;font-weight:bold">Float64</span>}) :: <span style="color:#038">Array</span>{<span style="color:#036;font-weight:bold">Float64</span>}
dim = <span style="color:#036;font-weight:bold">Base</span>.isqrt(length(flat))
reshape(flat, (dim, dim))<span style="color:#a61717;background-color:#e3d2d2">'</span>
<span style="color:#080;font-weight:bold">end</span>
from = include_label ? <span style="color:#00d;font-weight:bold">2</span> : <span style="color:#00d;font-weight:bold">1</span>
frame[<span style="color:#a60;background-color:#fff0f0">:pixels</span>] = map((i) -> convert(<span style="color:#038">Array</span>{<span style="color:#036;font-weight:bold">Float64</span>}, frame[i, <span style="color:#a60;background-color:#fff0f0">from</span>:<span style="color:#080;font-weight:bold">end</span>]) |> to_image, <span style="color:#00d;font-weight:bold">1</span><span style="color:#a60;background-color:#fff0f0">:size</span>(frame, <span style="color:#00d;font-weight:bold">1</span>))
images = frame[:, <span style="color:#a60;background-color:#fff0f0">:pixels</span>] ./ <span style="color:#00d;font-weight:bold">255</span>
data = <span style="color:#038">Array</span>{<span style="color:#038">Array</span>{<span style="color:#036;font-weight:bold">Float64</span>}}(length(images))
<span style="color:#33b">@showprogress</span> <span style="color:#00d;font-weight:bold">1</span> <span style="color:#d20;background-color:#fff0f0">"Computing features..."</span> <span style="color:#080;font-weight:bold">for</span> i <span style="color:#080;font-weight:bold">in</span> <span style="color:#00d;font-weight:bold">1</span><span style="color:#a60;background-color:#fff0f0">:length</span>(images)
features = pixels_to_features(images[i])
data[i] = features_to_row(features)
<span style="color:#080;font-weight:bold">end</span>
start_column = include_label ? [<span style="color:#a60;background-color:#fff0f0">:label</span>] : []
columns = vcat(start_column, features_columns(images[<span style="color:#00d;font-weight:bold">1</span>]))
result = <span style="color:#036;font-weight:bold">DataFrame</span>()
<span style="color:#080;font-weight:bold">for</span> c <span style="color:#080;font-weight:bold">in</span> columns
result[c] = []
<span style="color:#080;font-weight:bold">end</span>
<span style="color:#080;font-weight:bold">for</span> i <span style="color:#080;font-weight:bold">in</span> <span style="color:#00d;font-weight:bold">1</span><span style="color:#a60;background-color:#fff0f0">:length</span>(data)
<span style="color:#080;font-weight:bold">if</span> include_label
push!(result, vcat(frame[i, <span style="color:#a60;background-color:#fff0f0">:label</span>], data[i]))
<span style="color:#080;font-weight:bold">else</span>
push!(result, vcat([], data[i]))
<span style="color:#080;font-weight:bold">end</span>
<span style="color:#080;font-weight:bold">end</span>
result
<span style="color:#080;font-weight:bold">end</span>
</code></pre></div><p>A few nice things to notice here about Julia itself are:</p>
<ul>
<li>The function documentation is written in Markdown</li>
<li>We can nest functions inside other functions</li>
<li>The language is statically and strongly typed</li>
<li>Types can be inferred from the context</li>
<li>It is often desirable to provide the concrete types to improve performance (but that an advanced Julia related topic)</li>
<li>Arrays are indexed from 1</li>
<li>There’s the nice |> operator found e. g. In Elixir (which I absolutely love)</li>
</ul>
<p>The above code converts the images to be arrays of Float64 and converts the values to be within 0 and 1 (instead of 0..255 originally).</p>
<p>A thing to notice is that in Julia we can vectorize operations easily and we’re using this fact to tersely convert our number:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-ruby" data-lang="ruby">images = frame[:, <span style="color:#a60;background-color:#fff0f0">:pixels</span>] ./ <span style="color:#00d;font-weight:bold">255</span>
</code></pre></div><p>We are referencing the pixels_to_features function which we define as:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-ruby" data-lang="ruby"><span style="color:#d20;background-color:#fff0f0">"""
</span><span style="color:#d20;background-color:#fff0f0">Returns ImageFeatures struct for the image pixels
</span><span style="color:#d20;background-color:#fff0f0">given as an argument
</span><span style="color:#d20;background-color:#fff0f0">"""</span>
function pixels_to_features(image :: <span style="color:#038">Array</span>{<span style="color:#036;font-weight:bold">Float64</span>})
dim = <span style="color:#036;font-weight:bold">Base</span>.isqrt(length(image))
skeleton = compute_skeleton(image)
bounds = compute_bounds(skeleton)
resized = compute_resized(skeleton, bounds, (dim, dim))
left = compute_profile(resized, <span style="color:#a60;background-color:#fff0f0">:left</span>)
right = compute_profile(resized, <span style="color:#a60;background-color:#fff0f0">:right</span>)
width_min, width_max, width_at = compute_widths(left, right, image)
frequencies, simples = compute_transitions(skeleton)
<span style="color:#036;font-weight:bold">ImageStats</span>(dim, left, right, width_min, width_max, width_at, frequencies, simples)
<span style="color:#080;font-weight:bold">end</span>
</code></pre></div><p>This in turn uses the ImageStats structure:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-ruby" data-lang="ruby">immutable <span style="color:#036;font-weight:bold">ImageStats</span>
image_dim :: <span style="color:#036;font-weight:bold">Int64</span>
left :: <span style="color:#036;font-weight:bold">ProfileStats</span>
right :: <span style="color:#036;font-weight:bold">ProfileStats</span>
width_min :: <span style="color:#036;font-weight:bold">Int64</span>
width_max :: <span style="color:#036;font-weight:bold">Int64</span>
width_at :: <span style="color:#038">Array</span>{<span style="color:#036;font-weight:bold">Int64</span>}
direction_frequencies :: <span style="color:#038">Array</span>{<span style="color:#036;font-weight:bold">Float64</span>}
<span style="color:#888"># The following adds information about transitions</span>
<span style="color:#888"># in 2 element simplified paths:</span>
simple_direction_frequencies :: <span style="color:#038">Array</span>{<span style="color:#036;font-weight:bold">Float64</span>}
<span style="color:#080;font-weight:bold">end</span>
immutable <span style="color:#036;font-weight:bold">ProfileStats</span>
min :: <span style="color:#036;font-weight:bold">Int64</span>
max :: <span style="color:#036;font-weight:bold">Int64</span>
at :: <span style="color:#038">Array</span>{<span style="color:#036;font-weight:bold">Int64</span>}
diff :: <span style="color:#038">Array</span>{<span style="color:#036;font-weight:bold">Int64</span>}
<span style="color:#080;font-weight:bold">end</span>
</code></pre></div><p>The pixels_to_features function first gets the skeleton of the digit shape as an image and then uses other functions passing that skeleton to them. The function returning the skeleton utilizes the fact that in Julia it’s trivially easy to use Python libraries. Here’s its definition:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-ruby" data-lang="ruby">using <span style="color:#036;font-weight:bold">PyCall</span>
<span style="color:#33b">@pyimport</span> skimage.morphology as cv
<span style="color:#d20;background-color:#fff0f0">"""
</span><span style="color:#d20;background-color:#fff0f0">Thin the number in the image by computing the skeleton
</span><span style="color:#d20;background-color:#fff0f0">"""</span>
function compute_skeleton(number_image :: <span style="color:#038">Array</span>{<span style="color:#036;font-weight:bold">Float64</span>}) :: <span style="color:#038">Array</span>{<span style="color:#036;font-weight:bold">Float64</span>}
convert(<span style="color:#038">Array</span>{<span style="color:#036;font-weight:bold">Float64</span>}, cv.skeletonize_3d(number_image))
<span style="color:#080;font-weight:bold">end</span>
</code></pre></div><p>It uses the scikit-image library’s function skeletonize3d by using the @pyimport macro and using the function as if it was just a regular Julia code.</p>
<p>Next the code crops the digit itself from the 28x28 image and resizes it back to 28x28 so that the edges of the shape always “touch” the edges of the image. For this we need the function that returns the bounds of the shape so that it’s easy to do the cropping:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-ruby" data-lang="ruby">function compute_bounds(number_image :: <span style="color:#038">Array</span>{<span style="color:#036;font-weight:bold">Float64</span>}) :: <span style="color:#036;font-weight:bold">Bounds</span>
rows = size(number_image, <span style="color:#00d;font-weight:bold">1</span>)
cols = size(number_image, <span style="color:#00d;font-weight:bold">2</span>)
saw_top = <span style="color:#080">false</span>
saw_bottom = <span style="color:#080">false</span>
top = <span style="color:#00d;font-weight:bold">1</span>
bottom = rows
left = cols
right = <span style="color:#00d;font-weight:bold">1</span>
<span style="color:#080;font-weight:bold">for</span> y = <span style="color:#00d;font-weight:bold">1</span><span style="color:#a60;background-color:#fff0f0">:rows</span>
saw_left = <span style="color:#080">false</span>
row_sum = <span style="color:#00d;font-weight:bold">0</span>
<span style="color:#080;font-weight:bold">for</span> x = <span style="color:#00d;font-weight:bold">1</span><span style="color:#a60;background-color:#fff0f0">:cols</span>
row_sum += number_image[y, x]
<span style="color:#080;font-weight:bold">if</span> !saw_top && number_image[y, x] > <span style="color:#00d;font-weight:bold">0</span>
saw_top = <span style="color:#080">true</span>
top = y
<span style="color:#080;font-weight:bold">end</span>
<span style="color:#080;font-weight:bold">if</span> !saw_left && number_image[y, x] > <span style="color:#00d;font-weight:bold">0</span> && x < left
saw_left = <span style="color:#080">true</span>
left = x
<span style="color:#080;font-weight:bold">end</span>
<span style="color:#080;font-weight:bold">if</span> saw_top && !saw_bottom && x == cols && row_sum == <span style="color:#00d;font-weight:bold">0</span>
saw_bottom = <span style="color:#080">true</span>
bottom = y - <span style="color:#00d;font-weight:bold">1</span>
<span style="color:#080;font-weight:bold">end</span>
<span style="color:#080;font-weight:bold">if</span> number_image[y, x] > <span style="color:#00d;font-weight:bold">0</span> && x > right
right = x
<span style="color:#080;font-weight:bold">end</span>
<span style="color:#080;font-weight:bold">end</span>
<span style="color:#080;font-weight:bold">end</span>
<span style="color:#036;font-weight:bold">Bounds</span>(top, right, bottom, left)
<span style="color:#080;font-weight:bold">end</span>
</code></pre></div><p>Resizing the image is pretty straight-forward:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-ruby" data-lang="ruby">using <span style="color:#036;font-weight:bold">Images</span>
function compute_resized(image :: <span style="color:#038">Array</span>{<span style="color:#036;font-weight:bold">Float64</span>}, bounds :: <span style="color:#036;font-weight:bold">Bounds</span>, dims :: <span style="color:#036;font-weight:bold">Tuple</span>{<span style="color:#036;font-weight:bold">Int64</span>, <span style="color:#036;font-weight:bold">Int64</span>}) :: <span style="color:#038">Array</span>{<span style="color:#036;font-weight:bold">Float64</span>}
cropped = image[bounds.left<span style="color:#a60;background-color:#fff0f0">:bounds</span>.right, bounds.top<span style="color:#a60;background-color:#fff0f0">:bounds</span>.bottom]
imresize(cropped, dims)
<span style="color:#080;font-weight:bold">end</span>
</code></pre></div><p>Next, we need to compute the profile stats as described in our plan of attack:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-ruby" data-lang="ruby">function compute_profile(image :: <span style="color:#038">Array</span>{<span style="color:#036;font-weight:bold">Float64</span>}, side :: <span style="color:#036;font-weight:bold">Symbol</span>) :: <span style="color:#036;font-weight:bold">ProfileStats</span>
<span style="color:#33b">@assert</span> side == <span style="color:#a60;background-color:#fff0f0">:left</span> || side == <span style="color:#a60;background-color:#fff0f0">:right</span>
rows = size(image, <span style="color:#00d;font-weight:bold">1</span>)
cols = size(image, <span style="color:#00d;font-weight:bold">2</span>)
columns = side == <span style="color:#a60;background-color:#fff0f0">:left</span> ? collect(<span style="color:#00d;font-weight:bold">1</span><span style="color:#a60;background-color:#fff0f0">:cols</span>) : (collect(<span style="color:#00d;font-weight:bold">1</span><span style="color:#a60;background-color:#fff0f0">:cols</span>) |> reverse)
at = zeros(<span style="color:#036;font-weight:bold">Int64</span>, rows)
diff = zeros(<span style="color:#036;font-weight:bold">Int64</span>, rows)
min = rows
max = <span style="color:#00d;font-weight:bold">0</span>
min_val = cols
max_val = <span style="color:#00d;font-weight:bold">0</span>
<span style="color:#080;font-weight:bold">for</span> y = <span style="color:#00d;font-weight:bold">1</span><span style="color:#a60;background-color:#fff0f0">:rows</span>
<span style="color:#080;font-weight:bold">for</span> x = columns
<span style="color:#080;font-weight:bold">if</span> image[y, x] > <span style="color:#00d;font-weight:bold">0</span>
at[y] = side == <span style="color:#a60;background-color:#fff0f0">:left</span> ? x : cols - x + <span style="color:#00d;font-weight:bold">1</span>
<span style="color:#080;font-weight:bold">if</span> at[y] < min_val
min_val = at[y]
min = y
<span style="color:#080;font-weight:bold">end</span>
<span style="color:#080;font-weight:bold">if</span> at[y] > max_val
max_val = at[y]
max = y
<span style="color:#080;font-weight:bold">end</span>
<span style="color:#080;font-weight:bold">break</span>
<span style="color:#080;font-weight:bold">end</span>
<span style="color:#080;font-weight:bold">end</span>
<span style="color:#080;font-weight:bold">if</span> y == <span style="color:#00d;font-weight:bold">1</span>
diff[y] = at[y]
<span style="color:#080;font-weight:bold">else</span>
diff[y] = at[y] - at[y - <span style="color:#00d;font-weight:bold">1</span>]
<span style="color:#080;font-weight:bold">end</span>
<span style="color:#080;font-weight:bold">end</span>
<span style="color:#036;font-weight:bold">ProfileStats</span>(min, max, at, diff)
<span style="color:#080;font-weight:bold">end</span>
</code></pre></div><p>The widths of shapes can be computed with the following:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-ruby" data-lang="ruby">function compute_widths(left :: <span style="color:#036;font-weight:bold">ProfileStats</span>, right :: <span style="color:#036;font-weight:bold">ProfileStats</span>, image :: <span style="color:#038">Array</span>{<span style="color:#036;font-weight:bold">Float64</span>}) :: <span style="color:#036;font-weight:bold">Tuple</span>{<span style="color:#036;font-weight:bold">Int64</span>, <span style="color:#036;font-weight:bold">Int64</span>, <span style="color:#038">Array</span>{<span style="color:#036;font-weight:bold">Int64</span>}}
image_width = size(image, <span style="color:#00d;font-weight:bold">2</span>)
min_width = image_width
max_width = <span style="color:#00d;font-weight:bold">0</span>
width_ats = length(left.at) |> zeros
<span style="color:#080;font-weight:bold">for</span> row <span style="color:#080;font-weight:bold">in</span> <span style="color:#00d;font-weight:bold">1</span><span style="color:#a60;background-color:#fff0f0">:length</span>(left.at)
width_ats[row] = image_width - (left.at[row] - <span style="color:#00d;font-weight:bold">1</span>) - (right.at[row] - <span style="color:#00d;font-weight:bold">1</span>)
<span style="color:#080;font-weight:bold">if</span> width_ats[row] < min_width
min_width = width_ats[row]
<span style="color:#080;font-weight:bold">end</span>
<span style="color:#080;font-weight:bold">if</span> width_ats[row] > max_width
max_width = width_ats[row]
<span style="color:#080;font-weight:bold">end</span>
<span style="color:#080;font-weight:bold">end</span>
(min_width, max_width, width_ats)
<span style="color:#080;font-weight:bold">end</span>
</code></pre></div><p>And lastly, the transitions:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-ruby" data-lang="ruby">function compute_transitions(image :: <span style="color:#036;font-weight:bold">Image</span>) :: <span style="color:#036;font-weight:bold">Tuple</span>{<span style="color:#038">Array</span>{<span style="color:#036;font-weight:bold">Float64</span>}, <span style="color:#038">Array</span>{<span style="color:#036;font-weight:bold">Float64</span>}}
history = zeros((size(image,<span style="color:#00d;font-weight:bold">1</span>), size(image,<span style="color:#00d;font-weight:bold">2</span>)))
function next_point() :: <span style="color:#036;font-weight:bold">Nullable</span>{<span style="color:#036;font-weight:bold">Point</span>}
point = <span style="color:#036;font-weight:bold">Nullable</span>()
<span style="color:#080;font-weight:bold">for</span> row <span style="color:#080;font-weight:bold">in</span> <span style="color:#00d;font-weight:bold">1</span><span style="color:#a60;background-color:#fff0f0">:size</span>(image, <span style="color:#00d;font-weight:bold">1</span>) |> reverse
<span style="color:#080;font-weight:bold">for</span> col <span style="color:#080;font-weight:bold">in</span> <span style="color:#00d;font-weight:bold">1</span><span style="color:#a60;background-color:#fff0f0">:size</span>(image, <span style="color:#00d;font-weight:bold">2</span>) |> reverse
<span style="color:#080;font-weight:bold">if</span> image[row, col] > <span style="color:#00d;font-weight:bold">0</span>.<span style="color:#00d;font-weight:bold">0</span> && history[row, col] == <span style="color:#00d;font-weight:bold">0</span>.<span style="color:#00d;font-weight:bold">0</span>
point = <span style="color:#036;font-weight:bold">Nullable</span>((row, col))
history[row, col] = <span style="color:#00d;font-weight:bold">1</span>.<span style="color:#00d;font-weight:bold">0</span>
<span style="color:#080;font-weight:bold">return</span> point
<span style="color:#080;font-weight:bold">end</span>
<span style="color:#080;font-weight:bold">end</span>
<span style="color:#080;font-weight:bold">end</span>
<span style="color:#080;font-weight:bold">end</span>
function next_point(point :: <span style="color:#036;font-weight:bold">Nullable</span>{<span style="color:#036;font-weight:bold">Point</span>}) :: <span style="color:#036;font-weight:bold">Tuple</span>{<span style="color:#036;font-weight:bold">Nullable</span>{<span style="color:#036;font-weight:bold">Point</span>}, <span style="color:#036;font-weight:bold">Int64</span>}
result = <span style="color:#036;font-weight:bold">Nullable</span>()
trans = <span style="color:#00d;font-weight:bold">0</span>
function direction_to_moves(direction :: <span style="color:#036;font-weight:bold">Int64</span>) :: <span style="color:#036;font-weight:bold">Tuple</span>{<span style="color:#036;font-weight:bold">Int64</span>, <span style="color:#036;font-weight:bold">Int64</span>}
<span style="color:#888"># for frequencies:</span>
<span style="color:#888"># 8 1 2</span>
<span style="color:#888"># 7 - 3</span>
<span style="color:#888"># 6 5 4</span>
[
( -<span style="color:#00d;font-weight:bold">1</span>, <span style="color:#00d;font-weight:bold">0</span> ),
( -<span style="color:#00d;font-weight:bold">1</span>, <span style="color:#00d;font-weight:bold">1</span> ),
( <span style="color:#00d;font-weight:bold">0</span>, <span style="color:#00d;font-weight:bold">1</span> ),
( <span style="color:#00d;font-weight:bold">1</span>, <span style="color:#00d;font-weight:bold">1</span> ),
( <span style="color:#00d;font-weight:bold">1</span>, <span style="color:#00d;font-weight:bold">0</span> ),
( <span style="color:#00d;font-weight:bold">1</span>, -<span style="color:#00d;font-weight:bold">1</span> ),
( <span style="color:#00d;font-weight:bold">0</span>, -<span style="color:#00d;font-weight:bold">1</span> ),
( -<span style="color:#00d;font-weight:bold">1</span>, -<span style="color:#00d;font-weight:bold">1</span> ),
][direction]
<span style="color:#080;font-weight:bold">end</span>
function peek_point(direction :: <span style="color:#036;font-weight:bold">Int64</span>) :: <span style="color:#036;font-weight:bold">Nullable</span>{<span style="color:#036;font-weight:bold">Point</span>}
actual_current = get(point)
row_move, col_move = direction_to_moves(direction)
new_row = actual_current[<span style="color:#00d;font-weight:bold">1</span>] + row_move
new_col = actual_current[<span style="color:#00d;font-weight:bold">2</span>] + col_move
<span style="color:#080;font-weight:bold">if</span> new_row <= size(image, <span style="color:#00d;font-weight:bold">1</span>) && new_col <= size(image, <span style="color:#00d;font-weight:bold">2</span>) &&
new_row >= <span style="color:#00d;font-weight:bold">1</span> && new_col >= <span style="color:#00d;font-weight:bold">1</span>
<span style="color:#080;font-weight:bold">return</span> <span style="color:#036;font-weight:bold">Nullable</span>((new_row, new_col))
<span style="color:#080;font-weight:bold">else</span>
<span style="color:#080;font-weight:bold">return</span> <span style="color:#036;font-weight:bold">Nullable</span>()
<span style="color:#080;font-weight:bold">end</span>
<span style="color:#080;font-weight:bold">end</span>
<span style="color:#080;font-weight:bold">for</span> direction <span style="color:#080;font-weight:bold">in</span> <span style="color:#00d;font-weight:bold">1</span>:<span style="color:#00d;font-weight:bold">8</span>
peeked = peek_point(direction)
<span style="color:#080;font-weight:bold">if</span> !isnull(peeked)
actual = get(peeked)
<span style="color:#080;font-weight:bold">if</span> image[actual[<span style="color:#00d;font-weight:bold">1</span>], actual[<span style="color:#00d;font-weight:bold">2</span>]] > <span style="color:#00d;font-weight:bold">0</span>.<span style="color:#00d;font-weight:bold">0</span> && history[actual[<span style="color:#00d;font-weight:bold">1</span>], actual[<span style="color:#00d;font-weight:bold">2</span>]] == <span style="color:#00d;font-weight:bold">0</span>.<span style="color:#00d;font-weight:bold">0</span>
result = peeked
history[actual[<span style="color:#00d;font-weight:bold">1</span>], actual[<span style="color:#00d;font-weight:bold">2</span>]] = <span style="color:#00d;font-weight:bold">1</span>
trans = direction
<span style="color:#080;font-weight:bold">break</span>
<span style="color:#080;font-weight:bold">end</span>
<span style="color:#080;font-weight:bold">end</span>
<span style="color:#080;font-weight:bold">end</span>
( result, trans )
<span style="color:#080;font-weight:bold">end</span>
function trans_to_simples(transition :: <span style="color:#036;font-weight:bold">Int64</span>) :: <span style="color:#038">Array</span>{<span style="color:#036;font-weight:bold">Int64</span>}
<span style="color:#888"># for frequencies:</span>
<span style="color:#888"># 8 1 2</span>
<span style="color:#888"># 7 - 3</span>
<span style="color:#888"># 6 5 4</span>
<span style="color:#888"># for simples:</span>
<span style="color:#888"># - 1 -</span>
<span style="color:#888"># 4 - 2</span>
<span style="color:#888"># - 3 -</span>
[
[ <span style="color:#00d;font-weight:bold">1</span> ],
[ <span style="color:#00d;font-weight:bold">1</span>, <span style="color:#00d;font-weight:bold">2</span> ],
[ <span style="color:#00d;font-weight:bold">2</span> ],
[ <span style="color:#00d;font-weight:bold">2</span>, <span style="color:#00d;font-weight:bold">3</span> ],
[ <span style="color:#00d;font-weight:bold">3</span> ],
[ <span style="color:#00d;font-weight:bold">3</span>, <span style="color:#00d;font-weight:bold">4</span> ],
[ <span style="color:#00d;font-weight:bold">4</span> ],
[ <span style="color:#00d;font-weight:bold">1</span>, <span style="color:#00d;font-weight:bold">4</span> ]
][transition]
<span style="color:#080;font-weight:bold">end</span>
transitions = zeros(<span style="color:#00d;font-weight:bold">8</span>)
simples = zeros(<span style="color:#00d;font-weight:bold">16</span>)
last_simples = [ ]
point = next_point()
num_transitions = .<span style="color:#00d;font-weight:bold">0</span>
ind(r, c) = (c - <span style="color:#00d;font-weight:bold">1</span>)*<span style="color:#00d;font-weight:bold">4</span> + r
<span style="color:#080;font-weight:bold">while</span> !isnull(point)
point, trans = next_point(point)
<span style="color:#080;font-weight:bold">if</span> isnull(point)
point = next_point()
<span style="color:#080;font-weight:bold">else</span>
current_simples = trans_to_simples(trans)
transitions[trans] += <span style="color:#00d;font-weight:bold">1</span>
<span style="color:#080;font-weight:bold">for</span> simple <span style="color:#080;font-weight:bold">in</span> current_simples
<span style="color:#080;font-weight:bold">for</span> last_simple <span style="color:#080;font-weight:bold">in</span> last_simples
simples[ind(last_simple, simple)] +=<span style="color:#00d;font-weight:bold">1</span>
<span style="color:#080;font-weight:bold">end</span>
<span style="color:#080;font-weight:bold">end</span>
last_simples = current_simples
num_transitions += <span style="color:#00d;font-weight:bold">1</span>.<span style="color:#00d;font-weight:bold">0</span>
<span style="color:#080;font-weight:bold">end</span>
<span style="color:#080;font-weight:bold">end</span>
(transitions ./ num_transitions, simples ./ num_transitions)
<span style="color:#080;font-weight:bold">end</span>
</code></pre></div><p>All those gathered features can be turned into rows with:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-ruby" data-lang="ruby">function features_to_row(features :: <span style="color:#036;font-weight:bold">ImageStats</span>)
lefts = [ features.left.min, features.left.max ]
rights = [ features.right.min, features.right.max ]
left_ats = [ features.left.at[i] <span style="color:#080;font-weight:bold">for</span> i <span style="color:#080;font-weight:bold">in</span> <span style="color:#00d;font-weight:bold">1</span><span style="color:#a60;background-color:#fff0f0">:features</span>.image_dim ]
left_diffs = [ features.left.diff[i] <span style="color:#080;font-weight:bold">for</span> i <span style="color:#080;font-weight:bold">in</span> <span style="color:#00d;font-weight:bold">1</span><span style="color:#a60;background-color:#fff0f0">:features</span>.image_dim ]
right_ats = [ features.right.at[i] <span style="color:#080;font-weight:bold">for</span> i <span style="color:#080;font-weight:bold">in</span> <span style="color:#00d;font-weight:bold">1</span><span style="color:#a60;background-color:#fff0f0">:features</span>.image_dim ]
right_diffs = [ features.right.diff[i] <span style="color:#080;font-weight:bold">for</span> i <span style="color:#080;font-weight:bold">in</span> <span style="color:#00d;font-weight:bold">1</span><span style="color:#a60;background-color:#fff0f0">:features</span>.image_dim ]
frequencies = features.direction_frequencies
simples = features.simple_direction_frequencies
vcat(lefts, left_ats, left_diffs, rights, right_ats, right_diffs, frequencies, simples)
<span style="color:#080;font-weight:bold">end</span>
</code></pre></div><p>Similarly we can construct the column names with:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-ruby" data-lang="ruby">function features_columns(image :: <span style="color:#038">Array</span>{<span style="color:#036;font-weight:bold">Float64</span>})
image_dim = <span style="color:#036;font-weight:bold">Base</span>.isqrt(length(image))
lefts = [ <span style="color:#a60;background-color:#fff0f0">:left_min</span>, <span style="color:#a60;background-color:#fff0f0">:left_max</span> ]
rights = [ <span style="color:#a60;background-color:#fff0f0">:right_min</span>, <span style="color:#a60;background-color:#fff0f0">:right_max</span> ]
left_ats = [ <span style="color:#036;font-weight:bold">Symbol</span>(<span style="color:#d20;background-color:#fff0f0">"left_at_"</span>, i) <span style="color:#080;font-weight:bold">for</span> i <span style="color:#080;font-weight:bold">in</span> <span style="color:#00d;font-weight:bold">1</span><span style="color:#a60;background-color:#fff0f0">:image_dim</span> ]
left_diffs = [ <span style="color:#036;font-weight:bold">Symbol</span>(<span style="color:#d20;background-color:#fff0f0">"left_diff_"</span>, i) <span style="color:#080;font-weight:bold">for</span> i <span style="color:#080;font-weight:bold">in</span> <span style="color:#00d;font-weight:bold">1</span><span style="color:#a60;background-color:#fff0f0">:image_dim</span> ]
right_ats = [ <span style="color:#036;font-weight:bold">Symbol</span>(<span style="color:#d20;background-color:#fff0f0">"right_at_"</span>, i) <span style="color:#080;font-weight:bold">for</span> i <span style="color:#080;font-weight:bold">in</span> <span style="color:#00d;font-weight:bold">1</span><span style="color:#a60;background-color:#fff0f0">:image_dim</span> ]
right_diffs = [ <span style="color:#036;font-weight:bold">Symbol</span>(<span style="color:#d20;background-color:#fff0f0">"right_diff_"</span>, i) <span style="color:#080;font-weight:bold">for</span> i <span style="color:#080;font-weight:bold">in</span> <span style="color:#00d;font-weight:bold">1</span><span style="color:#a60;background-color:#fff0f0">:image_dim</span> ]
frequencies = [ <span style="color:#036;font-weight:bold">Symbol</span>(<span style="color:#d20;background-color:#fff0f0">"direction_freq_"</span>, i) <span style="color:#080;font-weight:bold">for</span> i <span style="color:#080;font-weight:bold">in</span> <span style="color:#00d;font-weight:bold">1</span>:<span style="color:#00d;font-weight:bold">8</span> ]
simples = [ <span style="color:#036;font-weight:bold">Symbol</span>(<span style="color:#d20;background-color:#fff0f0">"simple_trans_"</span>, i) <span style="color:#080;font-weight:bold">for</span> i <span style="color:#080;font-weight:bold">in</span> <span style="color:#00d;font-weight:bold">1</span>:<span style="color:#00d;font-weight:bold">4</span>^<span style="color:#00d;font-weight:bold">2</span> ]
vcat(lefts, left_ats, left_diffs, rights, right_ats, right_diffs, frequencies, simples)
<span style="color:#080;font-weight:bold">end</span>
</code></pre></div><p>The data frame constructed with the get_data function can be easily dumped into the CSV file with the writeable function from the DataFrames package.</p>
<p>You can notice that gathering / extracting features is a <strong>lot</strong> of work. All this was needed to be done because in this article we’re focusing on the somewhat “classical" way of doing machine learning. You might have heard about algorithms existing that mimic how the human brain learns. We’re <strong>not</strong> focusing on them here. This we will explore in some future article.</p>
<p>We use the mentioned writetable on data frames computed for both training and test datasets to store two files: processed_train.csv and processed_test.csv.</p>
<h3 id="choosing-the-model">Choosing the model</h3>
<p>For the task of classifying I decided to use the XGBoost library which is somewhat a hot new technology in the world of machine learning. It’s an improvement over the so-called Random Forest algorithm. The reader can read more about XGBoost on its website: <a href="https://xgboost.readthedocs.io/">https://xgboost.readthedocs.io/</a>.</p>
<p>Both random forest and xgboost revolve around the idea called <em>ensemble learning</em>. In this approach we’re not getting just one learning model—the algorithm actually creates many variations of models and uses them to collectively come up with better results. This is as much as can be written as a short description as this article is already quite lengthy.</p>
<h3 id="training-the-model">Training the model</h3>
<p>The training and classification code in R is very simple. We first need to load the libraries that will allow us to load data as well as to build the classification model:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-r" data-lang="r"><span style="color:#06b;font-weight:bold">library</span>(xgboost)
<span style="color:#06b;font-weight:bold">library</span>(readr)
</code></pre></div><p>Loading the data into data frames is equally straight-forward:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-r" data-lang="r">processed_train <- <span style="color:#06b;font-weight:bold">read_csv</span>(<span style="color:#d20;background-color:#fff0f0">"processed_train.csv"</span>)
processed_test <- <span style="color:#06b;font-weight:bold">read_csv</span>(<span style="color:#d20;background-color:#fff0f0">"processed_test.csv"</span>)
</code></pre></div><p>We then move on to preparing the vector of labels for each row as well as the matrix of features:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-r" data-lang="r">labels = processed_train$label
features = processed_train[, <span style="color:#00d;font-weight:bold">2</span>:<span style="color:#00d;font-weight:bold">141</span>]
features = <span style="color:#06b;font-weight:bold">scale</span>(features)
features = <span style="color:#06b;font-weight:bold">as.matrix</span>(features)
</code></pre></div><h3 id="the-train-test-split">The train-test split</h3>
<p>When working with models, one of the ways of evaluating their performance is to split the data into so-called train and test sets. We train the model on one set and then we predict the values from the test set. We then calculate the accuracy of predicted values as the ratio between the number of correct predictions and the number of all observations.</p>
<p>Because Kaggle provides the test set without labels, for the sake of evaluating the model’s performance without the need to submit the results, we’ll split our Kaggle-training set into local train and test ones. We’ll use the amazing caret library which provides a wealth of tools for doing machine learning:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-r" data-lang="r"><span style="color:#06b;font-weight:bold">library</span>(caret)
index <- <span style="color:#06b;font-weight:bold">createDataPartition</span>(processed_train$label, p = <span style="color:#00d;font-weight:bold">.8</span>,
list = <span style="color:#080;font-weight:bold">FALSE</span>,
times = <span style="color:#00d;font-weight:bold">1</span>)
train_labels <- labels[index]
train_features <- features[index,]
test_labels <- labels[-index]
test_features <- features[-index,]
</code></pre></div><p>The above code splits the set uniformly based on the labels so that the train set is approximately 80% in size of the whole data set.</p>
<h3 id="using-xgboost-as-the-classification-model">Using XGBoost as the classification model</h3>
<p>We can now make our data digestible by the XGBoost library:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-r" data-lang="r">train <- <span style="color:#06b;font-weight:bold">xgb.DMatrix</span>(<span style="color:#06b;font-weight:bold">as.matrix</span>(train_features), label = train_labels)
test <- <span style="color:#06b;font-weight:bold">xgb.DMatrix</span>(<span style="color:#06b;font-weight:bold">as.matrix</span>(test_features), label = test_labels)
</code></pre></div><p>The next step is to make the XGBoost learn from our data. The actual parameters and their explanations are beyond the scope of this overview article, but the reader can look them up on the XGBoost pages:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-r" data-lang="r">model <- <span style="color:#06b;font-weight:bold">xgboost</span>(train,
max_depth = <span style="color:#00d;font-weight:bold">16</span>,
nrounds = <span style="color:#00d;font-weight:bold">600</span>,
eta = <span style="color:#00d;font-weight:bold">0.2</span>,
objective = <span style="color:#d20;background-color:#fff0f0">"multi:softmax"</span>,
num_class = <span style="color:#00d;font-weight:bold">10</span>)
</code></pre></div><p>It’s critically important to pass the objective as “multi:softmax" and num_class as 10.</p>
<h3 id="simple-performance-evaluation-with-confusion-matrix">Simple performance evaluation with confusion matrix</h3>
<p>After waiting a while (couple of minutes) for the last batch of code to finish computing, we now have the classification model ready to be used. Let’s use it to predict the labels from our test set:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-r" data-lang="r">predicted = <span style="color:#06b;font-weight:bold">predict</span>(model, test)
</code></pre></div><p>This returns the vector of predicted values. We’d now like to check how well our model predicts the values. One of the easiest ways is to use the so-called <strong>confusion matrix</strong>.</p>
<p>As per Wikipedia, confusion matrix is simply:</p>
<blockquote>
<p>(…) also known as an error matrix, is a specific table layout that allows visualization of the performance of an algorithm, typically a supervised learning one (in unsupervised learning it is usually called a matching matrix). Each column of the matrix represents the instances in a predicted class while each row represents the instances in an actual class (or vice versa). The name stems from the fact that it makes it easy to see if the system is confusing two classes (i.e. commonly mislabelling one as another).</p>
</blockquote>
<p>The caret library provides a very easy to use function for examining the confusion matrix and statistics derived from it:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-r" data-lang="r"><span style="color:#06b;font-weight:bold">confusionMatrix</span>(data=predicted, reference=labels)
</code></pre></div><p>The function returns an R list that gets pretty printed to the R console. In our case it looks like the following:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-plain" data-lang="plain">Confusion Matrix and Statistics
Reference
Prediction 0 1 2 3 4 5 6 7 8 9
0 819 0 3 3 1 1 2 1 10 5
1 0 923 0 4 5 1 5 3 4 5
2 4 2 766 26 2 6 8 12 5 0
3 2 0 15 799 0 22 2 8 0 8
4 5 2 1 0 761 1 0 15 4 19
5 1 3 0 13 2 719 3 0 9 6
6 5 3 4 1 6 5 790 0 16 2
7 1 7 12 9 2 3 1 813 4 16
8 6 2 4 7 8 11 8 5 767 10
9 5 2 1 13 22 6 1 14 14 746
Overall Statistics
Accuracy : 0.9411
95% CI : (0.9358, 0.946)
No Information Rate : 0.1124
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.9345
Mcnemar's Test P-Value : NA
(...)
</code></pre></div><p>Each column in the matrix represents actual labels while rows represent what our algorithms predicted this value to be. There’s also the accuracy rate printed for us and in this case it equals 0.9411. This means that our code was able to predict correct values of handwritten digits for 94.11% of observations.</p>
<h3 id="submitting-the-results">Submitting the results</h3>
<p>We got 0.9411 of an accuracy rate for our local test set and it turned out to be very close to the one we got against the test set coming from Kaggle. After predicting the competition values and submitting them, the accuracy rate computed by Kaggle was 0.94357. That’s quite okay given the fact that we’re not using here any of the new and fancy techniques.</p>
<p>Also, we haven’t done any <em>parameter tuning</em> which could surely improve the overall accuracy. We could also revisit the code from the features extraction phase. One improvement I can think of would be to first crop and resize back - and only then compute the skeleton which might preserve more information about the shape. We could also use the confusion matrix and taking the number that was being confused the most, look at the real images that we failed to recognize. This could lead us to conclusions about improvements to our feature extraction code. There’s always a way to extract more information.</p>
<p>Nowadays, Kagglers from around the world were successfully using advanced techniques like <em>Convolutional Neural Networks</em> getting accuracy scores close to 0.999. Those live in somewhat different branch of the machine learning world though. Using this type of neural networks we don’t need to do the feature extraction on our own. The algorithm includes the step that automatically gathers features that it later on feeds into the network itself. We will take a look at them in some of the future articles.</p>
<h3 id="see-also">See also</h3>
<ul>
<li><a href="https://julialang.org/">Julia Language</a></li>
<li><a href="https://www.r-project.org/">R Language</a></li>
<li><a href="http://scikit-image.org/">Scikit-Image library</a></li>
<li><a href="https://xgboost.readthedocs.io/">XGBoost library</a></li>
<li><a href="https://topepo.github.io/caret/index.html">Caret library</a></li>
<li><a href="https://www.kaggle.com/">Kaggle</a></li>
</ul>
wroc_love.rb 2017 part 1https://www.endpointdev.com/blog/2017/03/wrocloverb-2017-part-1/2017-03-18T00:00:00+00:00Wojtek Ziniewicz
<p><a href="https://wrocloverb.com/">wroc_love.rb</a> is a single-track 3-day conference that takes place in Wrocław, Poland, every year in March.</p>
<p>Here’s a subjective list of most interesting talks from the first day:</p>
<h3 id="kafka--karafka-by-maciej-mensfeld">Kafka / Karafka by Maciej Mensfeld</h3>
<p><a href="https://github.com/karafka/karafka">Karafka</a> is another library that simplifies Apache Kafka usage in Ruby. It lets Ruby on Rails apps benefit from horizontally scalable message busses in a pub-sub (or publisher/consumer) type of network.</p>
<p><strong>Why <a href="https://kafka.apache.org/">Kafka</a> is (<em>probably</em>) better message/task broker for your app:</strong></p>
<ul>
<li>broadcasting is a real power feature of Kafka (HTTP lacks that)</li>
<li>author claims that it’s easier to support than ZeroMQ/RabbitMQ</li>
<li>it’s namespaced with topics (similar to ROS, the <a href="http://www.ros.org/">Robot Operating System</a>)</li>
<li>great replacement for <a href="https://github.com/zendesk/ruby-kafka">ruby-kafka</a> and <a href="https://github.com/bpot/poseidon">Poseidon</a></li>
</ul>
<blockquote>
<p>Karafka <a href="https://t.co/g9LQZiAV4i">https://t.co/g9LQZiAV4i</a> microframework to have <a href="https://twitter.com/hashtag/rails?src=hash">#rails</a>-like development performance with <a href="https://twitter.com/hashtag/kafka?src=hash">#kafka</a> in <a href="https://twitter.com/hashtag/ruby?src=hash">#ruby</a> <a href="https://twitter.com/maciejmensfeld">@maciejmensfeld</a> <a href="https://twitter.com/hashtag/wrocloverb?src=hash">#wrocloverb</a></p>
<p>— Maciek Rząsa (@mjrzasa) <a href="https://twitter.com/mjrzasa/status/842771868239192064">17 marzo 2017</a></p>
</blockquote>
<h3 id="machine-learning-to-the-rescue-by-mariusz-gil">Machine Learning to the Rescue by Mariusz Gil</h3>
<p>This talk was devoted to Machine Learning success (and failure) story of the author.</p>
<p>Author underlined that Machine Learning is a <strong>process</strong> and proposed following <strong>workflow</strong>:</p>
<ol>
<li>define a problem</li>
<li>gather your data</li>
<li>understand your data</li>
<li>prepare and condition the data</li>
<li>select & run your algorithms</li>
<li>tune algorithms parameters</li>
<li>select final model</li>
<li>validate final model (test using production data)</li>
</ol>
<p>Mariusz described few ML problems that he has dealt with in the past. One of them was a project designed to estimate cost of a code review. He outlined the process of tuning the input data. Here’s a list of what comprised the input for a code review estimation cost:</p>
<ul>
<li>number of lines changed</li>
<li>number of files changed</li>
<li><a href="https://en.wikipedia.org/wiki/Efferent_coupling">efferent</a> coupling</li>
<li><a href="https://en.wikipedia.org/wiki/Coupling_(computer_programming)">afferent</a> coupling</li>
<li>number of classes</li>
<li>number of interfaces</li>
<li>inheritance level</li>
<li>number of method calls</li>
<li>LLOC metric (Logical Lines of Code, excluding empty or comment lines)</li>
<li>LCOM metric (Lack of Cohesion between Methods—whether single responsibility pattern is followed or not)</li>
</ul>
<h3 id="spree-lightning-talk-by-sparksolutionscohttpssparksolutionsco">Spree lightning talk by <a href="https://sparksolutions.co/">sparksolutions.co</a></h3>
<p>One of the lightning talks was devoted to Spree. Here’s some interesting latest data from the Spree world:</p>
<ul>
<li>number of contributors to Spree: 700</li>
<li>it’s very modular</li>
<li>it’s API driven</li>
<li>it’s one of the biggest repos on GitHub</li>
<li>very large number of extensions</li>
<li>it drives thousands of stores worldwide</li>
<li><a href="https://sparksolutions.co/">Spark Solutions</a> is a maintainer</li>
<li>Popular companies that use Spree: GoDaddy, Goop, Casper, Bonobos, Littlebits, Greetabl</li>
<li>it support Rails 5, Rails 4.2 and Rails 3.x</li>
</ul>
<p>Author also released newest 3.2.0 stable version during the talk:</p>
<blockquote>
<p>releasing spree 3.2.0 live during lightning talk <a href="https://twitter.com/hashtag/wrocloverb?src=hash">#wrocloverb</a> <a href="https://t.co/9oPcB5CTfB">pic.twitter.com/9oPcB5CTfB</a></p>
<p>— Wojciech Ziniewicz (@fribulusxax) <a href="https://twitter.com/fribulusxax/status/842800094915301376">17 marzo 2017</a></p>
</blockquote>
Learning from data basics II: simple Bayesian Networkshttps://www.endpointdev.com/blog/2016/04/learning-from-data-basics-ii-simple/2016-04-12T00:00:00+00:00Kamil Ciemniewski
<p>In my <a href="/blog/2016/03/learning-from-data-basics-naive-bayes/">last article</a> I presented an approach that simplifies computations of very complex probability models. It makes these complex models viable by shrinking the amount of needed memory and improving the speed of computing probabilities. The approach we were exploring is called the <strong>Naive Bayes model</strong>.</p>
<p>The context was the e-commerce feature in which a user is presented with the promotion box. The box shows the product category the user is most likely to buy.</p>
<p>Though the results we got were quite good, I promised to present an approach that gives much better ones. While the Naive Bayes approach may not be acceptable in some scenarios due to the gap between approximated and real values, the approach presented in this article will make this distance much, much smaller.</p>
<h3 id="naive-bayes-as-a-simple-bayesian-network">Naive Bayes as a simple Bayesian Network</h3>
<p>When exploring the Naive Bayes model, we said that there is a probabilistic assumption the model makes in order to simplify the computations. In the last article I wrote:</p>
<blockquote>
<p>The Naive Bayes assumption says that the distribution factorizes the way we did it <strong>only if the features are conditionally independent given the category</strong>.</p>
</blockquote>
<h4 id="expressing-variable-dependencies-as-a-graph">Expressing variable dependencies as a graph</h4>
<p>Let’s imagine the visual representation of the relations between the random variables in the Naive Bayes model. Let’s make it into a directed acyclic graph. Let’s mark the dependence of one variable on another as a graph edge from the parent node pointing to it’s dependent node.</p>
<p>Because of the assumption the Naive Bayes model enforces, its structure as a graph looks like the following:</p>
<div class="separator" style="clear: both; text-align: center;"><a href="/blog/2016/04/learning-from-data-basics-ii-simple/image-0.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="/blog/2016/04/learning-from-data-basics-ii-simple/image-0.png"/></a></div>
<p>You can notice there are no lines between all the “evidence” nodes. The assumption says that knowing the category, we have all needed knowledge about every single evidence node. This makes category the parent of all the other nodes. Intuitively, we can say that knowing the <strong>class</strong> (in this example, the category) we know everything about all <strong>features</strong>. It’s easy to notice that this assumption doesn’t hold in this example.</p>
<p>In our fake data generator, we made it so that e.g. relationship status depends on age. We’ve also made the category depend on sex and age directly. This way we can’t say that knowing category we know everything about e. g. age. The random variables age and sex are not independent even if we know the value of category. It is clear that the above graph does not model the dependency relationships between these random variables.</p>
<p>Let’s draw a graph that represents our fake data model better:</p>
<div class="separator" style="clear: both; text-align: center;"><a href="/blog/2016/04/learning-from-data-basics-ii-simple/image-1.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="/blog/2016/04/learning-from-data-basics-ii-simple/image-1.png"/></a></div>
<p>The combination of a graph like the one above and the probability distribution that follows the independencies it describes are known as a <strong>Bayesian Network</strong>.</p>
<h4 id="using-the-graph-representation-in-practice---the-chain-rule-for-bayesian-networks">Using the graph representation in practice - the chain rule for Bayesian Networks</h4>
<p>The fact that our distribution is part of the Bayesian Network, allows us to use the formula for simplifying the distribution itself. The formula is called the <strong>chain rule for Bayesian Networks</strong> and for our particular example looks like the following:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-plain" data-lang="plain">p(cat, sex, age, rel, loc) = p(sex) * p(age) * p(loc) * p(rel | age) * p(cat | sex, age)
</code></pre></div><p>You can notice that the equation is just a product of a number of factors. There’s one factor for each random variable. The factors for variables that in the graph don’t have any parents are expressed as p(var) while those that do are expressed as p(var | par) or p(var | par1, par2…).</p>
<p>Notice that the Naive Bayes model fits perfectly into this equation. If you were to take the first graph presented in this article—for the Naive Bayes, and use the above equation, you’d get exactly the formula we used in the last article.</p>
<h3 id="coding-the-updated-probabilistic-model">Coding the updated probabilistic model</h3>
<p>Before going further, I strongly advise you to make sure you read the <a href="/blog/2016/03/learning-from-data-basics-naive-bayes/">previous article - about the Naive Bayes model</a> - to fully understand the classes used in the code in this section.</p>
<p>Let’s take our chain rule equation and simplify it:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-plain" data-lang="plain">p(cat, sex, age, rel, loc) = p(sex) * p(age) * p(loc) * p(rel | age) * p(cat | sex, age)
</code></pre></div><p>Again a conditional distrubution can be expressed as:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-plain" data-lang="plain">p(a | b) = p(a, b) / p(b)
</code></pre></div><p>This gives us:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-plain" data-lang="plain">p(cat, sex, age, rel, loc) = p(sex) * p(age) * p(loc) * (p(rel, age)/ p(age)) * (p(cat, sex, age) / p(sex, age))
</code></pre></div><p>We can easily factor out the p(age) with:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-plain" data-lang="plain">p(cat, sex, age, rel, loc) = p(sex) * p(loc) * p(rel, age) * (p(cat, sex, age) / p(sex, age))
</code></pre></div><p>Let’s define needed random variables and factors:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-ruby" data-lang="ruby">category = <span style="color:#036;font-weight:bold">RandomVariable</span>.new <span style="color:#a60;background-color:#fff0f0">:category</span>, [ <span style="color:#a60;background-color:#fff0f0">:veggies</span>, <span style="color:#a60;background-color:#fff0f0">:snacks</span>, <span style="color:#a60;background-color:#fff0f0">:meat</span>, <span style="color:#a60;background-color:#fff0f0">:drinks</span>, <span style="color:#a60;background-color:#fff0f0">:beauty</span>, <span style="color:#a60;background-color:#fff0f0">:magazines</span> ]
age = <span style="color:#036;font-weight:bold">RandomVariable</span>.new <span style="color:#a60;background-color:#fff0f0">:age</span>, [ <span style="color:#a60;background-color:#fff0f0">:teens</span>, <span style="color:#a60;background-color:#fff0f0">:young_adults</span>, <span style="color:#a60;background-color:#fff0f0">:adults</span>, <span style="color:#a60;background-color:#fff0f0">:elders</span> ]
sex = <span style="color:#036;font-weight:bold">RandomVariable</span>.new <span style="color:#a60;background-color:#fff0f0">:sex</span>, [ <span style="color:#a60;background-color:#fff0f0">:male</span>, <span style="color:#a60;background-color:#fff0f0">:female</span> ]
relation = <span style="color:#036;font-weight:bold">RandomVariable</span>.new <span style="color:#a60;background-color:#fff0f0">:relation</span>, [ <span style="color:#a60;background-color:#fff0f0">:single</span>, <span style="color:#a60;background-color:#fff0f0">:in_relationship</span> ]
location = <span style="color:#036;font-weight:bold">RandomVariable</span>.new <span style="color:#a60;background-color:#fff0f0">:location</span>, [ <span style="color:#a60;background-color:#fff0f0">:us</span>, <span style="color:#a60;background-color:#fff0f0">:canada</span>, <span style="color:#a60;background-color:#fff0f0">:europe</span>, <span style="color:#a60;background-color:#fff0f0">:asia</span> ]
loc_dist = <span style="color:#036;font-weight:bold">Factor</span>.new [ location ]
sex_dist = <span style="color:#036;font-weight:bold">Factor</span>.new [ sex ]
rel_age_dist = <span style="color:#036;font-weight:bold">Factor</span>.new [ relation, age ]
cat_age_sex_dist = <span style="color:#036;font-weight:bold">Factor</span>.new [ category, age, sex ]
age_sex_dist = <span style="color:#036;font-weight:bold">Factor</span>.new [ age, sex ]
full_dist = <span style="color:#036;font-weight:bold">Factor</span>.new [ category, age, sex, relation, location ]
</code></pre></div><p>The learning part is as trivial as in the Naive Bayes case. The only difference is the set of distributions involved:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-ruby" data-lang="ruby"><span style="color:#036;font-weight:bold">Model</span>.generate(<span style="color:#00d;font-weight:bold">1000</span>).each <span style="color:#080;font-weight:bold">do</span> |user|
user.baskets.each <span style="color:#080;font-weight:bold">do</span> |basket|
basket.line_items.each <span style="color:#080;font-weight:bold">do</span> |item|
loc_dist.observe! <span style="color:#a60;background-color:#fff0f0">location</span>: user.location
sex_dist.observe! <span style="color:#a60;background-color:#fff0f0">sex</span>: user.sex
rel_age_dist.observe! <span style="color:#a60;background-color:#fff0f0">relation</span>: user.relationship, <span style="color:#a60;background-color:#fff0f0">age</span>: user.age
cat_age_sex_dist.observe! <span style="color:#a60;background-color:#fff0f0">category</span>: item.category, <span style="color:#a60;background-color:#fff0f0">age</span>: user.age, <span style="color:#a60;background-color:#fff0f0">sex</span>: user.sex
age_sex_dist.observe! <span style="color:#a60;background-color:#fff0f0">age</span>: user.age, <span style="color:#a60;background-color:#fff0f0">sex</span>: user.sex
full_dist.observe! <span style="color:#a60;background-color:#fff0f0">category</span>: item.category, <span style="color:#a60;background-color:#fff0f0">age</span>: user.age, <span style="color:#a60;background-color:#fff0f0">sex</span>: user.sex,
<span style="color:#a60;background-color:#fff0f0">relation</span>: user.relationship, <span style="color:#a60;background-color:#fff0f0">location</span>: user.location
<span style="color:#080;font-weight:bold">end</span>
<span style="color:#080;font-weight:bold">end</span>
<span style="color:#080;font-weight:bold">end</span>
</code></pre></div><p>The inference part is also very similar to the one from the previous article. Here too the only difference are the distributions involved:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-ruby" data-lang="ruby">infer = -> (age, sex, rel, loc) <span style="color:#080;font-weight:bold">do</span>
all = category.values.map <span style="color:#080;font-weight:bold">do</span> |cat|
pl = loc_dist.value_for <span style="color:#a60;background-color:#fff0f0">location</span>: loc
ps = sex_dist.value_for <span style="color:#a60;background-color:#fff0f0">sex</span>: sex
pra = rel_age_dist.value_for <span style="color:#a60;background-color:#fff0f0">relation</span>: rel, <span style="color:#a60;background-color:#fff0f0">age</span>: age
pcas = cat_age_sex_dist.value_for <span style="color:#a60;background-color:#fff0f0">category</span>: cat, <span style="color:#a60;background-color:#fff0f0">age</span>: age, <span style="color:#a60;background-color:#fff0f0">sex</span>: sex
pas = age_sex_dist.value_for <span style="color:#a60;background-color:#fff0f0">age</span>: age, <span style="color:#a60;background-color:#fff0f0">sex</span>: sex
{ <span style="color:#a60;background-color:#fff0f0">category</span>: cat, <span style="color:#a60;background-color:#fff0f0">value</span>: (pl * ps * pra * pcas) / pas }
<span style="color:#080;font-weight:bold">end</span>
all_full = category.values.map <span style="color:#080;font-weight:bold">do</span> |cat|
val = full_dist.value_for <span style="color:#a60;background-color:#fff0f0">category</span>: cat, <span style="color:#a60;background-color:#fff0f0">age</span>: age, <span style="color:#a60;background-color:#fff0f0">sex</span>: sex,
<span style="color:#a60;background-color:#fff0f0">relation</span>: rel, <span style="color:#a60;background-color:#fff0f0">location</span>: loc
{ <span style="color:#a60;background-color:#fff0f0">category</span>: cat, <span style="color:#a60;background-color:#fff0f0">value</span>: val }
<span style="color:#080;font-weight:bold">end</span>
win = all.max { |a, b| a[<span style="color:#a60;background-color:#fff0f0">:value</span>] <=> b[<span style="color:#a60;background-color:#fff0f0">:value</span>] }
win_full = all_full.max { |a, b| a[<span style="color:#a60;background-color:#fff0f0">:value</span>] <=> b[<span style="color:#a60;background-color:#fff0f0">:value</span>] }
<span style="color:#038">puts</span> <span style="color:#d20;background-color:#fff0f0">"Best match for </span><span style="color:#33b;background-color:#fff0f0">#{</span>[ age, sex, rel, loc ]<span style="color:#33b;background-color:#fff0f0">}</span><span style="color:#d20;background-color:#fff0f0">:"</span>
<span style="color:#038">puts</span> <span style="color:#d20;background-color:#fff0f0">" </span><span style="color:#33b;background-color:#fff0f0">#{</span>win[<span style="color:#a60;background-color:#fff0f0">:category</span>]<span style="color:#33b;background-color:#fff0f0">}</span><span style="color:#d20;background-color:#fff0f0"> => </span><span style="color:#33b;background-color:#fff0f0">#{</span>win[<span style="color:#a60;background-color:#fff0f0">:value</span>]<span style="color:#33b;background-color:#fff0f0">}</span><span style="color:#d20;background-color:#fff0f0">"</span>
<span style="color:#038">puts</span> <span style="color:#d20;background-color:#fff0f0">"Full pointed at:"</span>
<span style="color:#038">puts</span> <span style="color:#d20;background-color:#fff0f0">" </span><span style="color:#33b;background-color:#fff0f0">#{</span>win_full[<span style="color:#a60;background-color:#fff0f0">:category</span>]<span style="color:#33b;background-color:#fff0f0">}</span><span style="color:#d20;background-color:#fff0f0"> => </span><span style="color:#33b;background-color:#fff0f0">#{</span>win_full[<span style="color:#a60;background-color:#fff0f0">:value</span>]<span style="color:#33b;background-color:#fff0f0">}</span><span style="color:#04d;background-color:#fff0f0">\n\n</span><span style="color:#d20;background-color:#fff0f0">"</span>
<span style="color:#080;font-weight:bold">end</span>
</code></pre></div><h3 id="the-results">The results</h3>
<p>Now let’s run the inference procedure with the same set of examples as in the previous post to compare the results:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-ruby" data-lang="ruby">infer.call <span style="color:#a60;background-color:#fff0f0">:teens</span>, <span style="color:#a60;background-color:#fff0f0">:male</span>, <span style="color:#a60;background-color:#fff0f0">:single</span>, <span style="color:#a60;background-color:#fff0f0">:us</span>
infer.call <span style="color:#a60;background-color:#fff0f0">:young_adults</span>, <span style="color:#a60;background-color:#fff0f0">:male</span>, <span style="color:#a60;background-color:#fff0f0">:single</span>, <span style="color:#a60;background-color:#fff0f0">:asia</span>
infer.call <span style="color:#a60;background-color:#fff0f0">:adults</span>, <span style="color:#a60;background-color:#fff0f0">:female</span>, <span style="color:#a60;background-color:#fff0f0">:in_relationship</span>, <span style="color:#a60;background-color:#fff0f0">:europe</span>
infer.call <span style="color:#a60;background-color:#fff0f0">:elders</span>, <span style="color:#a60;background-color:#fff0f0">:female</span>, <span style="color:#a60;background-color:#fff0f0">:in_relationship</span>, <span style="color:#a60;background-color:#fff0f0">:canada</span>
</code></pre></div><p>Which yields:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-plain" data-lang="plain">Best match for [:teens, :male, :single, :us]:
snacks => 0.020610837341908994
Full pointed at:
snacks => 0.02103999999999992
Best match for [:young_adults, :male, :single, :asia]:
meat => 0.001801062449999991
Full pointed at:
meat => 0.0010700000000000121
Best match for [:adults, :female, :in_relationship, :europe]:
beauty => 0.0007693377820183494
Full pointed at:
beauty => 0.0008300000000000074
Best match for [:elders, :female, :in_relationship, :canada]:
veggies => 0.0024346445741176875
Full pointed at:
veggies => 0.0034199999999999886
</code></pre></div><p>Just as with using the Naive Bayes, we got correct values for all cases. When you look closer though, you can notice that the resulting probability values were much closer to the original, full distribution ones. The approach we took here makes the values differ only a couple times in 10000. That result could make a difference in the e-commerce shop from the example if it were visited by millions of customers each month.</p>
Learning from data basics: the Naive Bayes modelhttps://www.endpointdev.com/blog/2016/03/learning-from-data-basics-naive-bayes/2016-03-23T00:00:00+00:00Kamil Ciemniewski
<p>Have you ever wondered what is the machinery behind some of the algorithms for doing seemingly very intelligent tasks? How is it possible that the computer program can recognize faces in photos, turn an image into a text or even classify some emails as legitimate or as spam?</p>
<p>Today, I’d like to present one of the simplest models for performing classification tasks. The model enables extremely fast execution, making it very practical in many use cases. The example I’ll choose will enable us to extend the discussion about the most optimal approach to another blog post.</p>
<h3 id="the-problem">The problem</h3>
<p>Imagine that you’re working on an e-commerce store for your client. One of the requirements is to present the currently logged in user with a “promotion box” somewhere on the page. The goal is to maximize our chances of having the user put the product from the box into the basket. There’s one promotional box and a couple of different categories of products to choose the actual product from.</p>
<h3 id="thinking-about-the-solutionusing-probability-theory">Thinking about the solution—using probability theory</h3>
<p>One of the obvious directions we may want to turn towards is to use probability theory. If we could collect the data about the user’s previous choices and his or her characteristics, we can use probability to select the product category best suited for the current user. We would then choose a product from this category that currently has an active promotion.</p>
<h3 id="quick-theory-refresher-for-programmers">Quick theory refresher for programmers</h3>
<p>As we’ll be exploring the probability approaches using Ruby code, I’d like to very quickly walk you through some of the basic concepts we will be using from now on.</p>
<h4 id="random-variables">Random variables</h4>
<p>The simplest probability scenario many of us are already accustomed with is the coin toss results distribution. Here we’re throwing the coin, noting whether we get heads or tails. In this experiment, we call “got heads” and “got tails” probability events. We can also shift the terminology a bit by calling them: two values of the “toss result” <strong>random variable</strong>.</p>
<p>So in this case we’d have a random variable—let’s call it <strong>T</strong> (for “toss”) that can take values of: “heads” or “tails”. We then define the probability distribution P(T) as a function from the random variable value to a real number between 0 and 1 inclusively on both sides. In real world the probability values after e. g 10000 tosses might look like the following:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-plain" data-lang="plain">+-------+---------------------+
| toss | value |
+-------+---------------------+
| heads | 0.49929999999999947 |
| tails | 0.500699999999998 |
+-------+---------------------+
</code></pre></div><p>These values are nearing 0.5 more and more with the greater number of tosses.</p>
<h4 id="factors-and-probability-distributions">Factors and probability distributions</h4>
<p>We’ve shown a simple probability distribution. To ease the comprehension of the Ruby code we’ll be working with, let me introduce the notion of the <strong>factor</strong>. We called the “table” from the last example a probability distribution. The table represented a function from a random variable’s value to a real number between [0, 1]. The <strong>factor</strong> is a generalization of that notion. It’s a function from the same domain, but returning any real number. We’ll explore the usability of this notion in some of our next articles.</p>
<p>The probability distribution is a factor that adds two constraints:</p>
<ul>
<li>its values are always in the range [0, 1] inclusively</li>
<li>the sum of all it’s values is exactly 1</li>
</ul>
<h3 id="simple-ruby-modeling-of-random-variables-and-factors">Simple Ruby modeling of random variables and factors</h3>
<p>We need to have some ways of computing probability distributions. Let’s define some simple tools we’ll be using in this blog series:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-ruby" data-lang="ruby"><span style="color:#888"># Let's define a simple version of the random variable</span>
<span style="color:#888"># - one that will hold discrete values</span>
<span style="color:#080;font-weight:bold">class</span> <span style="color:#b06;font-weight:bold">RandomVariable</span>
<span style="color:#080">attr_accessor</span> <span style="color:#a60;background-color:#fff0f0">:values</span>, <span style="color:#a60;background-color:#fff0f0">:name</span>
<span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">initialize</span>(<span style="color:#038">name</span>, values)
<span style="color:#33b">@name</span> = <span style="color:#038">name</span>
<span style="color:#33b">@values</span> = values
<span style="color:#080;font-weight:bold">end</span>
<span style="color:#080;font-weight:bold">end</span>
<span style="color:#888"># The following class really represents here a probability</span>
<span style="color:#888"># distribution. We'll adjust it in the next posts to make</span>
<span style="color:#888"># it match the definition of a "factor". We're naming it this</span>
<span style="color:#888"># way right now as every probability distribution is a factor</span>
<span style="color:#888"># too.</span>
<span style="color:#080;font-weight:bold">class</span> <span style="color:#b06;font-weight:bold">Factor</span>
<span style="color:#080">attr_accessor</span> <span style="color:#a60;background-color:#fff0f0">:_table</span>, <span style="color:#a60;background-color:#fff0f0">:_count</span>, <span style="color:#a60;background-color:#fff0f0">:variables</span>
<span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">initialize</span>(variables)
<span style="color:#33b">@_table</span> = {}
<span style="color:#33b">@_count</span> = <span style="color:#00d;font-weight:bold">0</span>.<span style="color:#00d;font-weight:bold">0</span>
<span style="color:#33b">@variables</span> = variables
initialize_table
<span style="color:#080;font-weight:bold">end</span>
<span style="color:#888"># We're choosing to represent the factor / distribution</span>
<span style="color:#888"># here as a table with value combinations in one column</span>
<span style="color:#888"># and probability values in another. Technically, we're using</span>
<span style="color:#888"># Ruby's Hash. The following method builds the the initial hash</span>
<span style="color:#888"># with all the possible keys and values assigned to 0:</span>
<span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">initialize_table</span>
variables_values = <span style="color:#33b">@variables</span>.map <span style="color:#080;font-weight:bold">do</span> |var|
var.values.map <span style="color:#080;font-weight:bold">do</span> |val|
{ var.name.to_sym => val }
<span style="color:#080;font-weight:bold">end</span>.flatten
<span style="color:#080;font-weight:bold">end</span> <span style="color:#888"># [ [ { name: value } ] ] </span>
<span style="color:#33b">@_table</span> = variables_values[<span style="color:#00d;font-weight:bold">1</span>..(variables_values.count)].inject(variables_values.first) <span style="color:#080;font-weight:bold">do</span> |all_array, var_arrays|
all_array = all_array.map <span style="color:#080;font-weight:bold">do</span> |ob|
var_arrays.map <span style="color:#080;font-weight:bold">do</span> |var_val|
ob.merge var_val
<span style="color:#080;font-weight:bold">end</span>
<span style="color:#080;font-weight:bold">end</span>.flatten
all_array
<span style="color:#080;font-weight:bold">end</span>.inject({}) { |m, item| m[item] = <span style="color:#00d;font-weight:bold">0</span>; m }
<span style="color:#080;font-weight:bold">end</span>
<span style="color:#888"># The following method adjusts the factor by adding information</span>
<span style="color:#888"># about observed combination of values. This in turn adjusts probability</span>
<span style="color:#888"># values for all the entries:</span>
<span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">observe!</span>(observation)
<span style="color:#080;font-weight:bold">if</span> !<span style="color:#33b">@_table</span>.has_key? observation
<span style="color:#080;font-weight:bold">raise</span> <span style="color:#036;font-weight:bold">ArgumentError</span>, <span style="color:#d20;background-color:#fff0f0">"Doesn't fit the factor - </span><span style="color:#33b;background-color:#fff0f0">#{</span><span style="color:#33b">@variables</span><span style="color:#33b;background-color:#fff0f0">}</span><span style="color:#d20;background-color:#fff0f0"> for observation: </span><span style="color:#33b;background-color:#fff0f0">#{</span>observation<span style="color:#33b;background-color:#fff0f0">}</span><span style="color:#d20;background-color:#fff0f0">"</span>
<span style="color:#080;font-weight:bold">end</span>
<span style="color:#33b">@_count</span> += <span style="color:#00d;font-weight:bold">1</span>
<span style="color:#33b">@_table</span>.keys.each <span style="color:#080;font-weight:bold">do</span> |key|
observed = key == observation
<span style="color:#33b">@_table</span>[key] = (<span style="color:#33b">@_table</span>[key] * (<span style="color:#33b">@_count</span> == <span style="color:#00d;font-weight:bold">0</span> ? <span style="color:#00d;font-weight:bold">0</span> : (<span style="color:#33b">@_count</span> - <span style="color:#00d;font-weight:bold">1</span>)) +
(observed ? <span style="color:#00d;font-weight:bold">1</span> : <span style="color:#00d;font-weight:bold">0</span>)) /
(<span style="color:#33b">@_count</span> == <span style="color:#00d;font-weight:bold">0</span> ? <span style="color:#00d;font-weight:bold">1</span> : <span style="color:#33b">@_count</span>)
<span style="color:#080;font-weight:bold">end</span>
<span style="color:#038">self</span>
<span style="color:#080;font-weight:bold">end</span>
<span style="color:#888"># Helper method for getting all the possible combinations</span>
<span style="color:#888"># of random variable assignments</span>
<span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">entries</span>
<span style="color:#33b">@_table</span>.each
<span style="color:#080;font-weight:bold">end</span>
<span style="color:#888"># Helper method for testing purposes. Sums the values for the whole</span>
<span style="color:#888"># distribution - it should return 1 (close to 1 due to how computers</span>
<span style="color:#888"># handle floating point operations)</span>
<span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">sum</span>
<span style="color:#33b">@_table</span>.values.inject(<span style="color:#a60;background-color:#fff0f0">:+</span>)
<span style="color:#080;font-weight:bold">end</span>
<span style="color:#888"># Returns a probability of a given combination happening</span>
<span style="color:#888"># in the experiment</span>
<span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">value_for</span>(key)
<span style="color:#080;font-weight:bold">if</span> <span style="color:#33b">@_table</span>[key].nil?
<span style="color:#080;font-weight:bold">raise</span> <span style="color:#036;font-weight:bold">ArgumentError</span>, <span style="color:#d20;background-color:#fff0f0">"Doesn't fit the factor - </span><span style="color:#33b;background-color:#fff0f0">#{</span><span style="color:#33b">@varables</span><span style="color:#33b;background-color:#fff0f0">}</span><span style="color:#d20;background-color:#fff0f0"> for: </span><span style="color:#33b;background-color:#fff0f0">#{</span>key<span style="color:#33b;background-color:#fff0f0">}</span><span style="color:#d20;background-color:#fff0f0">"</span>
<span style="color:#080;font-weight:bold">end</span>
<span style="color:#33b">@_table</span>[key]
<span style="color:#080;font-weight:bold">end</span>
<span style="color:#888"># Helper method for testing purposes. Returns a table object</span>
<span style="color:#888"># ready to be printed to stdout. It shows the whole distribution</span>
<span style="color:#888"># as a table with some columns being random variables values and</span>
<span style="color:#888"># the last one being the probability value</span>
<span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">table</span>
rows = <span style="color:#33b">@_table</span>.keys.map <span style="color:#080;font-weight:bold">do</span> |key|
key.values << <span style="color:#33b">@_table</span>[key]
<span style="color:#080;font-weight:bold">end</span>
table = <span style="color:#036;font-weight:bold">Terminal</span>::<span style="color:#036;font-weight:bold">Table</span>.new <span style="color:#a60;background-color:#fff0f0">rows</span>: rows, <span style="color:#a60;background-color:#fff0f0">headings</span>: ( <span style="color:#33b">@variables</span>.map(&<span style="color:#a60;background-color:#fff0f0">:name</span>) << <span style="color:#d20;background-color:#fff0f0">"value"</span> )
table.align_column(<span style="color:#33b">@variables</span>.count, <span style="color:#a60;background-color:#fff0f0">:right</span>)
table
<span style="color:#080;font-weight:bold">end</span>
<span style="color:#080">protected</span>
<span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">entries</span>=(_entries)
_entries.each <span style="color:#080;font-weight:bold">do</span> |entry|
<span style="color:#33b">@_table</span>[entry.keys.first] = entry.values.first
<span style="color:#080;font-weight:bold">end</span>
<span style="color:#080;font-weight:bold">end</span>
<span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">count</span>
<span style="color:#33b">@_count</span>
<span style="color:#080;font-weight:bold">end</span>
<span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">count</span>=(_count)
<span style="color:#33b">@_count</span> = _count
<span style="color:#080;font-weight:bold">end</span>
<span style="color:#080;font-weight:bold">end</span>
</code></pre></div><p>Notice that we’re using here the <strong>terminal-table</strong> gem as a helper for printing out the factors in an easy to grasp fashion. You’ll need the following requires:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-ruby" data-lang="ruby"><span style="color:#038">require</span> <span style="color:#d20;background-color:#fff0f0">'rubygems'</span>
<span style="color:#038">require</span> <span style="color:#d20;background-color:#fff0f0">'terminal-table'</span>
</code></pre></div><h3 id="the-scenario-setup">The scenario setup</h3>
<p>Let’s imagine that we have the following categories to choose from:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-ruby" data-lang="ruby">category = <span style="color:#036;font-weight:bold">RandomVariable</span>.new <span style="color:#a60;background-color:#fff0f0">:category</span>, [ <span style="color:#a60;background-color:#fff0f0">:veggies</span>, <span style="color:#a60;background-color:#fff0f0">:snacks</span>, <span style="color:#a60;background-color:#fff0f0">:meat</span>, <span style="color:#a60;background-color:#fff0f0">:drinks</span>, <span style="color:#a60;background-color:#fff0f0">:beauty</span>, <span style="color:#a60;background-color:#fff0f0">:magazines</span> ]
</code></pre></div><p>And the following user features on each request:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-ruby" data-lang="ruby">age = <span style="color:#036;font-weight:bold">RandomVariable</span>.new <span style="color:#a60;background-color:#fff0f0">:age</span>, [ <span style="color:#a60;background-color:#fff0f0">:teens</span>, <span style="color:#a60;background-color:#fff0f0">:young_adults</span>, <span style="color:#a60;background-color:#fff0f0">:adults</span>, <span style="color:#a60;background-color:#fff0f0">:elders</span> ]
sex = <span style="color:#036;font-weight:bold">RandomVariable</span>.new <span style="color:#a60;background-color:#fff0f0">:sex</span>, [ <span style="color:#a60;background-color:#fff0f0">:male</span>, <span style="color:#a60;background-color:#fff0f0">:female</span> ]
relation = <span style="color:#036;font-weight:bold">RandomVariable</span>.new <span style="color:#a60;background-color:#fff0f0">:relation</span>, [ <span style="color:#a60;background-color:#fff0f0">:single</span>, <span style="color:#a60;background-color:#fff0f0">:in_relationship</span> ]
location = <span style="color:#036;font-weight:bold">RandomVariable</span>.new <span style="color:#a60;background-color:#fff0f0">:location</span>, [ <span style="color:#a60;background-color:#fff0f0">:us</span>, <span style="color:#a60;background-color:#fff0f0">:canada</span>, <span style="color:#a60;background-color:#fff0f0">:europe</span>, <span style="color:#a60;background-color:#fff0f0">:asia</span> ]
</code></pre></div><p>Let’s define the data model that resembles logically the one we could have in our real e-commerce application:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-ruby" data-lang="ruby"><span style="color:#080;font-weight:bold">class</span> <span style="color:#b06;font-weight:bold">LineItem</span>
<span style="color:#080">attr_accessor</span> <span style="color:#a60;background-color:#fff0f0">:category</span>
<span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">initialize</span>(category)
<span style="color:#038">self</span>.category = category
<span style="color:#080;font-weight:bold">end</span>
<span style="color:#080;font-weight:bold">end</span>
<span style="color:#080;font-weight:bold">class</span> <span style="color:#b06;font-weight:bold">Basket</span>
<span style="color:#080">attr_accessor</span> <span style="color:#a60;background-color:#fff0f0">:line_items</span>
<span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">initialize</span>(line_items)
<span style="color:#038">self</span>.line_items = line_items
<span style="color:#080;font-weight:bold">end</span>
<span style="color:#080;font-weight:bold">end</span>
<span style="color:#080;font-weight:bold">class</span> <span style="color:#b06;font-weight:bold">User</span>
<span style="color:#080">attr_accessor</span> <span style="color:#a60;background-color:#fff0f0">:age</span>, <span style="color:#a60;background-color:#fff0f0">:sex</span>, <span style="color:#a60;background-color:#fff0f0">:relationship</span>, <span style="color:#a60;background-color:#fff0f0">:location</span>, <span style="color:#a60;background-color:#fff0f0">:baskets</span>
<span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">initialize</span>(age, sex, relationship, location, baskets)
<span style="color:#038">self</span>.age = age
<span style="color:#038">self</span>.sex = sex
<span style="color:#038">self</span>.relationship = relationship
<span style="color:#038">self</span>.location = location
<span style="color:#038">self</span>.baskets = baskets
<span style="color:#080;font-weight:bold">end</span>
<span style="color:#080;font-weight:bold">end</span>
</code></pre></div><p>We want to utilize a user’s baskets in order to infer the most probable value for a category, given a set of user’s features. In our example, we can imagine that we’re offering authentication via Facebook. We can grab info about a user’s sex, location, age and whether she/he is in relationship or not. We want to find a category that’s being chosen the most by users with a given set of features.</p>
<p>As we don’t have any real data to play with, we’ll need a generator to create fake data of certain characteristics. Let’s first define a helper class with a method, that will allow us to choose a value out of a given list of options along with their weights:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-ruby" data-lang="ruby"><span style="color:#080;font-weight:bold">class</span> <span style="color:#b06;font-weight:bold">Generator</span>
<span style="color:#080;font-weight:bold">def</span> <span style="color:#b06;font-weight:bold">self</span>.<span style="color:#06b;font-weight:bold">pick</span>(options)
items = options.inject([]) <span style="color:#080;font-weight:bold">do</span> |memo, keyval|
key, val = keyval
memo << <span style="color:#038">Array</span>.new(val, key)
memo
<span style="color:#080;font-weight:bold">end</span>.flatten
items.sample
<span style="color:#080;font-weight:bold">end</span>
<span style="color:#080;font-weight:bold">end</span>
</code></pre></div><p>With all the above we can define a random data generation model:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-ruby" data-lang="ruby"><span style="color:#080;font-weight:bold">class</span> <span style="color:#b06;font-weight:bold">Model</span>
<span style="color:#888"># Let's generate `num` users (1000 by default)</span>
<span style="color:#080;font-weight:bold">def</span> <span style="color:#b06;font-weight:bold">self</span>.<span style="color:#06b;font-weight:bold">generate</span>(num = <span style="color:#00d;font-weight:bold">1000</span>)
num.times.to_a.map <span style="color:#080;font-weight:bold">do</span> |user_index|
gen_user
<span style="color:#080;font-weight:bold">end</span>
<span style="color:#080;font-weight:bold">end</span>
<span style="color:#888"># Returns a user with randomly selected traits and baskets</span>
<span style="color:#080;font-weight:bold">def</span> <span style="color:#b06;font-weight:bold">self</span>.<span style="color:#06b;font-weight:bold">gen_user</span>
age = gen_age
sex = gen_sex
rel = gen_rel(age)
loc = gen_loc
baskets = gen_baskets(age, sex)
<span style="color:#036;font-weight:bold">User</span>.new age, sex, rel, loc, baskets
<span style="color:#080;font-weight:bold">end</span>
<span style="color:#888"># Randomly select a sex with 40% chance for getting a male</span>
<span style="color:#080;font-weight:bold">def</span> <span style="color:#b06;font-weight:bold">self</span>.<span style="color:#06b;font-weight:bold">gen_sex</span>
<span style="color:#036;font-weight:bold">Generator</span>.pick <span style="color:#a60;background-color:#fff0f0">male</span>: <span style="color:#00d;font-weight:bold">4</span>, <span style="color:#a60;background-color:#fff0f0">female</span>: <span style="color:#00d;font-weight:bold">6</span>
<span style="color:#080;font-weight:bold">end</span>
<span style="color:#888"># Randomly select an age with 50% chance for getting a teen</span>
<span style="color:#888"># (among other options and weights)</span>
<span style="color:#080;font-weight:bold">def</span> <span style="color:#b06;font-weight:bold">self</span>.<span style="color:#06b;font-weight:bold">gen_age</span>
<span style="color:#036;font-weight:bold">Generator</span>.pick <span style="color:#a60;background-color:#fff0f0">teens</span>: <span style="color:#00d;font-weight:bold">5</span>, <span style="color:#a60;background-color:#fff0f0">young_adults</span>: <span style="color:#00d;font-weight:bold">2</span>, <span style="color:#a60;background-color:#fff0f0">adults</span>: <span style="color:#00d;font-weight:bold">2</span>, <span style="color:#a60;background-color:#fff0f0">elders</span>: <span style="color:#00d;font-weight:bold">1</span>
<span style="color:#080;font-weight:bold">end</span>
<span style="color:#888"># Randomly select a relationship status.</span>
<span style="color:#888"># Depend the chance of getting a given option on the user's age</span>
<span style="color:#080;font-weight:bold">def</span> <span style="color:#b06;font-weight:bold">self</span>.<span style="color:#06b;font-weight:bold">gen_rel</span>(age)
<span style="color:#080;font-weight:bold">case</span> age
<span style="color:#080;font-weight:bold">when</span> <span style="color:#a60;background-color:#fff0f0">:teens</span> <span style="color:#080;font-weight:bold">then</span> <span style="color:#036;font-weight:bold">Generator</span>.pick <span style="color:#a60;background-color:#fff0f0">single</span>: <span style="color:#00d;font-weight:bold">7</span>, <span style="color:#a60;background-color:#fff0f0">in_relationship</span>: <span style="color:#00d;font-weight:bold">3</span>
<span style="color:#080;font-weight:bold">when</span> <span style="color:#a60;background-color:#fff0f0">:young_adults</span> <span style="color:#080;font-weight:bold">then</span> <span style="color:#036;font-weight:bold">Generator</span>.pick <span style="color:#a60;background-color:#fff0f0">single</span>: <span style="color:#00d;font-weight:bold">4</span>, <span style="color:#a60;background-color:#fff0f0">in_relationship</span>: <span style="color:#00d;font-weight:bold">6</span>
<span style="color:#080;font-weight:bold">else</span> <span style="color:#036;font-weight:bold">Generator</span>.pick <span style="color:#a60;background-color:#fff0f0">single</span>: <span style="color:#00d;font-weight:bold">8</span>, <span style="color:#a60;background-color:#fff0f0">in_relationship</span>: <span style="color:#00d;font-weight:bold">2</span>
<span style="color:#080;font-weight:bold">end</span>
<span style="color:#080;font-weight:bold">end</span>
<span style="color:#888"># Randomly select a location with 40% chance for getting a united states</span>
<span style="color:#888"># (among other options and weights)</span>
<span style="color:#080;font-weight:bold">def</span> <span style="color:#b06;font-weight:bold">self</span>.<span style="color:#06b;font-weight:bold">gen_loc</span>
<span style="color:#036;font-weight:bold">Generator</span>.pick <span style="color:#a60;background-color:#fff0f0">us</span>: <span style="color:#00d;font-weight:bold">4</span>, <span style="color:#a60;background-color:#fff0f0">canada</span>: <span style="color:#00d;font-weight:bold">3</span>, <span style="color:#a60;background-color:#fff0f0">europe</span>: <span style="color:#00d;font-weight:bold">1</span>, <span style="color:#a60;background-color:#fff0f0">asia</span>: <span style="color:#00d;font-weight:bold">2</span>
<span style="color:#080;font-weight:bold">end</span>
<span style="color:#888"># Randomly select 20 basket line items.</span>
<span style="color:#888"># Depend the chance of getting a given option on the user's age and sex</span>
<span style="color:#080;font-weight:bold">def</span> <span style="color:#b06;font-weight:bold">self</span>.<span style="color:#06b;font-weight:bold">gen_items</span>(age, sex)
num = <span style="color:#00d;font-weight:bold">20</span>
num.times.to_a.map <span style="color:#080;font-weight:bold">do</span> |i|
<span style="color:#080;font-weight:bold">if</span> (age == <span style="color:#a60;background-color:#fff0f0">:teens</span> || age == <span style="color:#a60;background-color:#fff0f0">:young_adults</span>) && sex == <span style="color:#a60;background-color:#fff0f0">:female</span>
<span style="color:#036;font-weight:bold">Generator</span>.pick <span style="color:#a60;background-color:#fff0f0">veggies</span>: <span style="color:#00d;font-weight:bold">1</span>, <span style="color:#a60;background-color:#fff0f0">snacks</span>: <span style="color:#00d;font-weight:bold">3</span>, <span style="color:#a60;background-color:#fff0f0">meat</span>: <span style="color:#00d;font-weight:bold">1</span>, <span style="color:#a60;background-color:#fff0f0">drinks</span>: <span style="color:#00d;font-weight:bold">1</span>, <span style="color:#a60;background-color:#fff0f0">beauty</span>: <span style="color:#00d;font-weight:bold">9</span>, <span style="color:#a60;background-color:#fff0f0">magazines</span>: <span style="color:#00d;font-weight:bold">6</span>
<span style="color:#080;font-weight:bold">elsif</span> age == <span style="color:#a60;background-color:#fff0f0">:teens</span> && sex == <span style="color:#a60;background-color:#fff0f0">:male</span>
<span style="color:#036;font-weight:bold">Generator</span>.pick <span style="color:#a60;background-color:#fff0f0">veggies</span>: <span style="color:#00d;font-weight:bold">1</span>, <span style="color:#a60;background-color:#fff0f0">snacks</span>: <span style="color:#00d;font-weight:bold">6</span>, <span style="color:#a60;background-color:#fff0f0">meat</span>: <span style="color:#00d;font-weight:bold">4</span>, <span style="color:#a60;background-color:#fff0f0">drinks</span>: <span style="color:#00d;font-weight:bold">1</span>, <span style="color:#a60;background-color:#fff0f0">beauty</span>: <span style="color:#00d;font-weight:bold">1</span>, <span style="color:#a60;background-color:#fff0f0">magazines</span>: <span style="color:#00d;font-weight:bold">4</span>
<span style="color:#080;font-weight:bold">elsif</span> (age == <span style="color:#a60;background-color:#fff0f0">:young_adults</span> || age == <span style="color:#a60;background-color:#fff0f0">:adults</span>) && sex == <span style="color:#a60;background-color:#fff0f0">:male</span>
<span style="color:#036;font-weight:bold">Generator</span>.pick <span style="color:#a60;background-color:#fff0f0">veggies</span>: <span style="color:#00d;font-weight:bold">1</span>, <span style="color:#a60;background-color:#fff0f0">snacks</span>: <span style="color:#00d;font-weight:bold">4</span>, <span style="color:#a60;background-color:#fff0f0">meat</span>: <span style="color:#00d;font-weight:bold">6</span>, <span style="color:#a60;background-color:#fff0f0">drinks</span>: <span style="color:#00d;font-weight:bold">6</span>, <span style="color:#a60;background-color:#fff0f0">beauty</span>: <span style="color:#00d;font-weight:bold">1</span>, <span style="color:#a60;background-color:#fff0f0">magazines</span>: <span style="color:#00d;font-weight:bold">1</span>
<span style="color:#080;font-weight:bold">elsif</span> (age == <span style="color:#a60;background-color:#fff0f0">:young_adults</span> || age == <span style="color:#a60;background-color:#fff0f0">:adults</span>) && sex == <span style="color:#a60;background-color:#fff0f0">:female</span>
<span style="color:#036;font-weight:bold">Generator</span>.pick <span style="color:#a60;background-color:#fff0f0">veggies</span>: <span style="color:#00d;font-weight:bold">4</span>, <span style="color:#a60;background-color:#fff0f0">snacks</span>: <span style="color:#00d;font-weight:bold">4</span>, <span style="color:#a60;background-color:#fff0f0">meat</span>: <span style="color:#00d;font-weight:bold">2</span>, <span style="color:#a60;background-color:#fff0f0">drinks</span>: <span style="color:#00d;font-weight:bold">1</span>, <span style="color:#a60;background-color:#fff0f0">beauty</span>: <span style="color:#00d;font-weight:bold">6</span>, <span style="color:#a60;background-color:#fff0f0">magazines</span>: <span style="color:#00d;font-weight:bold">3</span>
<span style="color:#080;font-weight:bold">elsif</span> age == <span style="color:#a60;background-color:#fff0f0">:elders</span> && sex == <span style="color:#a60;background-color:#fff0f0">:male</span>
<span style="color:#036;font-weight:bold">Generator</span>.pick <span style="color:#a60;background-color:#fff0f0">veggies</span>: <span style="color:#00d;font-weight:bold">6</span>, <span style="color:#a60;background-color:#fff0f0">snacks</span>: <span style="color:#00d;font-weight:bold">2</span>, <span style="color:#a60;background-color:#fff0f0">meat</span>: <span style="color:#00d;font-weight:bold">2</span>, <span style="color:#a60;background-color:#fff0f0">drinks</span>: <span style="color:#00d;font-weight:bold">2</span>, <span style="color:#a60;background-color:#fff0f0">beauty</span>: <span style="color:#00d;font-weight:bold">1</span>, <span style="color:#a60;background-color:#fff0f0">magazines</span>: <span style="color:#00d;font-weight:bold">1</span>
<span style="color:#080;font-weight:bold">elsif</span> age == <span style="color:#a60;background-color:#fff0f0">:elders</span> && sex == <span style="color:#a60;background-color:#fff0f0">:female</span>
<span style="color:#036;font-weight:bold">Generator</span>.pick <span style="color:#a60;background-color:#fff0f0">veggies</span>: <span style="color:#00d;font-weight:bold">8</span>, <span style="color:#a60;background-color:#fff0f0">snacks</span>: <span style="color:#00d;font-weight:bold">1</span>, <span style="color:#a60;background-color:#fff0f0">meat</span>: <span style="color:#00d;font-weight:bold">2</span>, <span style="color:#a60;background-color:#fff0f0">drinks</span>: <span style="color:#00d;font-weight:bold">1</span>, <span style="color:#a60;background-color:#fff0f0">beauty</span>: <span style="color:#00d;font-weight:bold">4</span>, <span style="color:#a60;background-color:#fff0f0">magazines</span>: <span style="color:#00d;font-weight:bold">1</span>
<span style="color:#080;font-weight:bold">else</span>
<span style="color:#036;font-weight:bold">Generator</span>.pick <span style="color:#a60;background-color:#fff0f0">veggies</span>: <span style="color:#00d;font-weight:bold">1</span>, <span style="color:#a60;background-color:#fff0f0">snacks</span>: <span style="color:#00d;font-weight:bold">1</span>, <span style="color:#a60;background-color:#fff0f0">meat</span>: <span style="color:#00d;font-weight:bold">1</span>, <span style="color:#a60;background-color:#fff0f0">drinks</span>: <span style="color:#00d;font-weight:bold">1</span>, <span style="color:#a60;background-color:#fff0f0">beauty</span>: <span style="color:#00d;font-weight:bold">1</span>, <span style="color:#a60;background-color:#fff0f0">magazines</span>: <span style="color:#00d;font-weight:bold">1</span>
<span style="color:#080;font-weight:bold">end</span>
<span style="color:#080;font-weight:bold">end</span>.map <span style="color:#080;font-weight:bold">do</span> |cat|
<span style="color:#036;font-weight:bold">LineItem</span>.new cat
<span style="color:#080;font-weight:bold">end</span>
<span style="color:#080;font-weight:bold">end</span>
<span style="color:#888"># Randomly select 5 baskets depending the traits of the basket on user</span>
<span style="color:#888"># age and sex</span>
<span style="color:#080;font-weight:bold">def</span> <span style="color:#b06;font-weight:bold">self</span>.<span style="color:#06b;font-weight:bold">gen_baskets</span>(age, sex)
num = <span style="color:#00d;font-weight:bold">5</span>
num.times.to_a.map <span style="color:#080;font-weight:bold">do</span> |i|
<span style="color:#036;font-weight:bold">Basket</span>.new gen_items(age, sex)
<span style="color:#080;font-weight:bold">end</span>
<span style="color:#080;font-weight:bold">end</span>
<span style="color:#080;font-weight:bold">end</span>
</code></pre></div><h3 id="where-is-the-complexity">Where is the complexity?</h3>
<p>The approach described above doesn’t seem that exciting or complex. Usually reading about probability theory applied in the field of machine learning requires going through quite a dense set of mathematical notions. The field is also being actively worked on by researchers. This implies a huge complexity—certainly not the simple definition of probability that we got used to in high school.</p>
<p>The problem becomes a bit more complex if you consider efficiency of computing the probabilities. In our example, the joined probability distribution—to fully describe the scenario—needs to specify probability values for 383 cases:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-ruby" data-lang="ruby"><span style="color:#038">p</span>(<span style="color:#a60;background-color:#fff0f0">:veggies</span>, <span style="color:#a60;background-color:#fff0f0">:teens</span>, <span style="color:#a60;background-color:#fff0f0">:male</span>, <span style="color:#a60;background-color:#fff0f0">:single</span>, <span style="color:#a60;background-color:#fff0f0">:us</span>) <span style="color:#888"># one of 384 combinations</span>
</code></pre></div><p>Given that the probability distributions have to sum up to 1, the last case can be fully inferred from the sum of all the others. This means that we need 6 * 4 * 2 * 2 * 4 - 1 = 383 parameters in the model: 6 categories, 4 age classes, 2 sexes, 2 relationship kinds and 4 locations. Imagine adding one additional, 4 valued feature (a season). This would grow our number of parameters to <strong>1535</strong>. And this is a very simple training example. We could have a model with close to 100 different features. The number of parameters would clearly be unmanageable even on the biggest servers we could put them in. This approach would also make it very painful to add additional features to the model.</p>
<h3 id="very-simple-but-powerful-optimization-the-naive-bayes-model">Very simple but powerful optimization: The Naive Bayes model</h3>
<p>In this section I’m going to present you with an equation we’ll be working with when optimizing our example. I’m not going to explain the mathematics behind it as you can easily read about them on e. g. Wikipedia.</p>
<p>The approach is called the <strong><a href="https://en.wikipedia.org/wiki/Naive_Bayes_classifier">Naive Bayes model</a></strong>. It is being used e .g. in spam filters. It also has been used in the past in medical diagnosis field.</p>
<p>It allows us to present the full probability distribution as a product of factors:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-ruby" data-lang="ruby"><span style="color:#038">p</span>(cat, age, sex, rel, loc) == <span style="color:#038">p</span>(cat) * <span style="color:#038">p</span>(age | cat) * <span style="color:#038">p</span>(sex | cat) * <span style="color:#038">p</span>(rel | cat) * <span style="color:#038">p</span>(loc | cat)
</code></pre></div><p>Where e. g. p(age | cat) represents the probability of a user being a certain age given that this user selects cat products most frequently. This is called the “posterior probability”. The above equation states that we can simplify the distribution to be a product of some number of much more easily manageable factors.</p>
<p>The category from our example is often called a <strong>class</strong> and the rest of random variables in the distribution are often called <strong>features</strong>.</p>
<p>In our example, the number of parameters we’ll need to manage when presenting the distribution in this form drops to:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-ruby" data-lang="ruby">(<span style="color:#00d;font-weight:bold">6</span> - <span style="color:#00d;font-weight:bold">1</span>) + (<span style="color:#00d;font-weight:bold">6</span> * <span style="color:#00d;font-weight:bold">4</span> - <span style="color:#00d;font-weight:bold">1</span>) + (<span style="color:#00d;font-weight:bold">6</span> * <span style="color:#00d;font-weight:bold">2</span> - <span style="color:#00d;font-weight:bold">1</span>) + (<span style="color:#00d;font-weight:bold">6</span> * <span style="color:#00d;font-weight:bold">2</span> - <span style="color:#00d;font-weight:bold">1</span>) + (<span style="color:#00d;font-weight:bold">6</span> * <span style="color:#00d;font-weight:bold">4</span> - <span style="color:#00d;font-weight:bold">1</span>) == <span style="color:#00d;font-weight:bold">73</span>
</code></pre></div><p>That’s just around 19% of the original amount! Also, adding another variable (season) would only add 23 new parameters (compared to 1152 in the full distribution case).</p>
<p>The Naive Bayes model limits the number of parameters we have to manage but it comes with very strong assumptions about the variables involved: in our example, that the user features are conditionally independent given the resulting category. Later on I’ll show why this isn’t true in this case even though the results will still be quite okay.</p>
<h3 id="implementing-the-naive-bayes-model">Implementing the Naive Bayes model</h3>
<p>As we now have all the tools we need, let’s get back to the probability theory to figure out how best to model the Naive Bayes in terms of the Ruby blocks we now have.</p>
<p>The approach says that under the assumptions we discussed we can approximate the original distribution to be the product of factors:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-ruby" data-lang="ruby"><span style="color:#038">p</span>(cat, age, sex, rel, loc) = <span style="color:#038">p</span>(cat) * <span style="color:#038">p</span>(age | cat) * <span style="color:#038">p</span>(sex | cat) * <span style="color:#038">p</span>(rel | cat) * <span style="color:#038">p</span>(loc | cat)
</code></pre></div><p>Given the definition of the conditional probability we have that:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-ruby" data-lang="ruby"><span style="color:#038">p</span>(a | b) = <span style="color:#038">p</span>(a, b) / <span style="color:#038">p</span>(b)
</code></pre></div><p>Thus, we can express the approximation as:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-ruby" data-lang="ruby"><span style="color:#038">p</span>(cat, age, sex, rel, loc) = <span style="color:#038">p</span>(cat) * ( <span style="color:#038">p</span>(age, cat) / <span style="color:#038">p</span>(cat) ) * ( <span style="color:#038">p</span>(sex, cat) / <span style="color:#038">p</span>(cat) ) * ( <span style="color:#038">p</span>(rel, cat) / <span style="color:#038">p</span>(cat) ) * ( <span style="color:#038">p</span>(loc, cat) / <span style="color:#038">p</span>(cat) )
</code></pre></div><p>And then simplify it even further as:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-ruby" data-lang="ruby"><span style="color:#038">p</span>(cat, age, sex, rel, loc) = <span style="color:#038">p</span>(age, cat) * ( <span style="color:#038">p</span>(sex, cat) / <span style="color:#038">p</span>(cat) ) * ( <span style="color:#038">p</span>(rel, cat) / <span style="color:#038">p</span>(cat) ) * ( <span style="color:#038">p</span>(loc, cat) / <span style="color:#038">p</span>(cat) )
</code></pre></div><p>Let’s define all the factors we will need:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-ruby" data-lang="ruby">cat_dist = <span style="color:#036;font-weight:bold">Factor</span>.new [ category ]
age_cat_dist = <span style="color:#036;font-weight:bold">Factor</span>.new [ age, category ]
sex_cat_dist = <span style="color:#036;font-weight:bold">Factor</span>.new [ sex, category ]
rel_cat_dist = <span style="color:#036;font-weight:bold">Factor</span>.new [ relation, category ]
loc_cat_dist = <span style="color:#036;font-weight:bold">Factor</span>.new [ location, category ]
</code></pre></div><p>Also, we want a full distribution to compare the results:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-ruby" data-lang="ruby">full_dist = <span style="color:#036;font-weight:bold">Factor</span>.new [ category, age, sex, relation, location ]
</code></pre></div><p>Let’s generate 1000 random users and looping through them and their baskets - adjust probability distributions for combinations of product categories and user traits:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-ruby" data-lang="ruby"><span style="color:#036;font-weight:bold">Model</span>.generate(<span style="color:#00d;font-weight:bold">1000</span>).each <span style="color:#080;font-weight:bold">do</span> |user|
user.baskets.each <span style="color:#080;font-weight:bold">do</span> |basket|
basket.line_items.each <span style="color:#080;font-weight:bold">do</span> |item|
cat_dist.observe! <span style="color:#a60;background-color:#fff0f0">category</span>: item.category
age_cat_dist.observe! <span style="color:#a60;background-color:#fff0f0">age</span>: user.age, <span style="color:#a60;background-color:#fff0f0">category</span>: item.category
sex_cat_dist.observe! <span style="color:#a60;background-color:#fff0f0">sex</span>: user.sex, <span style="color:#a60;background-color:#fff0f0">category</span>: item.category
rel_cat_dist.observe! <span style="color:#a60;background-color:#fff0f0">relation</span>: user.relationship, <span style="color:#a60;background-color:#fff0f0">category</span>: item.category
loc_cat_dist.observe! <span style="color:#a60;background-color:#fff0f0">location</span>: user.location, <span style="color:#a60;background-color:#fff0f0">category</span>: item.category
full_dist.observe! <span style="color:#a60;background-color:#fff0f0">category</span>: item.category, <span style="color:#a60;background-color:#fff0f0">age</span>: user.age, <span style="color:#a60;background-color:#fff0f0">sex</span>: user.sex,
<span style="color:#a60;background-color:#fff0f0">relation</span>: user.relationship, <span style="color:#a60;background-color:#fff0f0">location</span>: user.location
<span style="color:#080;font-weight:bold">end</span>
<span style="color:#080;font-weight:bold">end</span>
<span style="color:#080;font-weight:bold">end</span>
</code></pre></div><p>We can now print the distributions as tables to have an insight about the data:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-ruby" data-lang="ruby">[ cat_dist, age_cat_dist, sex_cat_dist, rel_cat_dist,
loc_cat_dist, full_dist ].each <span style="color:#080;font-weight:bold">do</span> |dist|
<span style="color:#038">puts</span> dist.table
<span style="color:#888"># Let's print out the sum of all entries to ensure the</span>
<span style="color:#888"># algorithm works well:</span>
<span style="color:#038">puts</span> dist.sum
<span style="color:#038">puts</span> <span style="color:#d20;background-color:#fff0f0">"</span><span style="color:#04d;background-color:#fff0f0">\n\n</span><span style="color:#d20;background-color:#fff0f0">"</span>
<span style="color:#080;font-weight:bold">end</span>
</code></pre></div><p>Which yields the following to the console (the full distribution is truncated due to its size):</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-plain" data-lang="plain">+-----------+---------------------+
| category | value |
+-----------+---------------------+
| veggies | 0.10866 |
| snacks | 0.19830999999999863 |
| meat | 0.14769 |
| drinks | 0.10115999999999989 |
| beauty | 0.24632 |
| magazines | 0.19785999999999926 |
+-----------+---------------------+
0.9999999999999978
+--------------+-----------+----------------------+
| age | category | value |
+--------------+-----------+----------------------+
| teens | veggies | 0.02608000000000002 |
| teens | snacks | 0.11347999999999969 |
| teens | meat | 0.06282999999999944 |
| teens | drinks | 0.0263200000000002 |
| teens | beauty | 0.1390699999999995 |
| teens | magazines | 0.13322000000000103 |
| young_adults | veggies | 0.010250000000000023 |
| young_adults | snacks | 0.03676000000000003 |
| young_adults | meat | 0.03678000000000005 |
| young_adults | drinks | 0.03670000000000045 |
| young_adults | beauty | 0.05172999999999976 |
| young_adults | magazines | 0.035779999999999916 |
| adults | veggies | 0.026749999999999927 |
| adults | snacks | 0.03827999999999962 |
| adults | meat | 0.034600000000000505 |
| adults | drinks | 0.028190000000000038 |
| adults | beauty | 0.03892000000000036 |
| adults | magazines | 0.02225999999999998 |
| elders | veggies | 0.04558000000000066 |
| elders | snacks | 0.009790000000000047 |
| elders | meat | 0.013480000000000027 |
| elders | drinks | 0.009949999999999931 |
| elders | beauty | 0.016600000000000226 |
| elders | magazines | 0.006600000000000025 |
+--------------+-----------+----------------------+
1.0000000000000013
+--------+-----------+----------------------+
| sex | category | value |
+--------+-----------+----------------------+
| male | veggies | 0.03954000000000044 |
| male | snacks | 0.1132499999999996 |
| male | meat | 0.10851000000000031 |
| male | drinks | 0.073 |
| male | beauty | 0.023679999999999857 |
| male | magazines | 0.05901999999999993 |
| female | veggies | 0.06911999999999997 |
| female | snacks | 0.08506000000000069 |
| female | meat | 0.03918000000000006 |
| female | drinks | 0.02816000000000005 |
| female | beauty | 0.22264000000000062 |
| female | magazines | 0.13884000000000046 |
+--------+-----------+----------------------+
1.000000000000002
+-----------------+-----------+----------------------+
| relation | category | value |
+-----------------+-----------+----------------------+
| single | veggies | 0.07722000000000082 |
| single | snacks | 0.13090999999999794 |
| single | meat | 0.09317000000000061 |
| single | drinks | 0.059979999999999915 |
| single | beauty | 0.16317999999999971 |
| single | magazines | 0.13054000000000135 |
| in_relationship | veggies | 0.031440000000000336 |
| in_relationship | snacks | 0.06740000000000032 |
| in_relationship | meat | 0.054520000000000006 |
| in_relationship | drinks | 0.04118000000000009 |
| in_relationship | beauty | 0.08314000000000002 |
| in_relationship | magazines | 0.06732000000000182 |
+-----------------+-----------+----------------------+
1.000000000000003
+----------+-----------+----------------------+
| location | category | value |
+----------+-----------+----------------------+
| us | veggies | 0.04209000000000062 |
| us | snacks | 0.07534000000000109 |
| us | meat | 0.055059999999999984 |
| us | drinks | 0.03704000000000108 |
| us | beauty | 0.09879000000000099 |
| us | magazines | 0.07867999999999964 |
| canada | veggies | 0.027930000000000062 |
| canada | snacks | 0.05745999999999996 |
| canada | meat | 0.04288000000000003 |
| canada | drinks | 0.03078999999999948 |
| canada | beauty | 0.06397999999999997 |
| canada | magazines | 0.053959999999999675 |
| europe | veggies | 0.013110000000000132 |
| europe | snacks | 0.0223200000000001 |
| europe | meat | 0.01730000000000005 |
| europe | drinks | 0.011859999999999964 |
| europe | beauty | 0.025490000000000183 |
| europe | magazines | 0.020920000000000164 |
| asia | veggies | 0.02552999999999989 |
| asia | snacks | 0.04319000000000044 |
| asia | meat | 0.03244999999999966 |
| asia | drinks | 0.02147000000000005 |
| asia | beauty | 0.05805999999999953 |
| asia | magazines | 0.0442999999999999 |
+----------+-----------+----------------------+
1.0000000000000029
+-----------+--------------+--------+-----------------+----------+------------------------+
| category | age | sex | relation | location | value |
+-----------+--------------+--------+-----------------+----------+------------------------+
| veggies | teens | male | single | us | 0.0035299999999999936 |
| veggies | teens | male | single | canada | 0.0024500000000000073 |
| veggies | teens | male | single | europe | 0.0006999999999999944 |
| veggies | teens | male | single | asia | 0.0016699999999999899 |
| veggies | teens | male | in_relationship | us | 0.001340000000000006 |
| veggies | teens | male | in_relationship | canada | 0.0010099999999999775 |
| veggies | teens | male | in_relationship | europe | 0.0006499999999999989 |
| veggies | teens | male | in_relationship | asia | 0.000819999999999994 |
(... many rows ...)
| magazines | elders | male | in_relationship | asia | 0.00012000000000000163 |
| magazines | elders | female | single | us | 0.0007399999999999966 |
| magazines | elders | female | single | canada | 0.0007000000000000037 |
| magazines | elders | female | single | europe | 0.0003199999999999965 |
| magazines | elders | female | single | asia | 0.0005899999999999999 |
| magazines | elders | female | in_relationship | us | 0.0004899999999999885 |
| magazines | elders | female | in_relationship | canada | 0.00027000000000000114 |
| magazines | elders | female | in_relationship | europe | 0.00012000000000000014 |
| magazines | elders | female | in_relationship | asia | 0.00012000000000000014 |
+-----------+--------------+--------+-----------------+----------+------------------------+
1.0000000000000004
</code></pre></div><p>Let’s define a Proc for inferring categories based on user traits as evidence:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-ruby" data-lang="ruby">infer = -> (age, sex, rel, loc) <span style="color:#080;font-weight:bold">do</span>
<span style="color:#888"># Let's map through the possible categories and the probability</span>
<span style="color:#888"># values the distibutions assign to them:</span>
all = category.values.map <span style="color:#080;font-weight:bold">do</span> |cat|
pc = cat_dist.value_for <span style="color:#a60;background-color:#fff0f0">category</span>: cat
pac = age_cat_dist.value_for <span style="color:#a60;background-color:#fff0f0">age</span>: age, <span style="color:#a60;background-color:#fff0f0">category</span>: cat
psc = sex_cat_dist.value_for <span style="color:#a60;background-color:#fff0f0">sex</span>: sex, <span style="color:#a60;background-color:#fff0f0">category</span>: cat
prc = rel_cat_dist.value_for <span style="color:#a60;background-color:#fff0f0">relation</span>: rel, <span style="color:#a60;background-color:#fff0f0">category</span>: cat
plc = loc_cat_dist.value_for <span style="color:#a60;background-color:#fff0f0">location</span>: loc, <span style="color:#a60;background-color:#fff0f0">category</span>: cat
{ <span style="color:#a60;background-color:#fff0f0">category</span>: cat, <span style="color:#a60;background-color:#fff0f0">value</span>: (pac * psc/pc * prc/pc * plc/pc) }
<span style="color:#080;font-weight:bold">end</span>
<span style="color:#888"># Let's do the same with the full distribution to be able to compare</span>
<span style="color:#888"># the results:</span>
all_full = category.values.map <span style="color:#080;font-weight:bold">do</span> |cat|
val = full_dist.value_for <span style="color:#a60;background-color:#fff0f0">category</span>: cat, <span style="color:#a60;background-color:#fff0f0">age</span>: age, <span style="color:#a60;background-color:#fff0f0">sex</span>: sex,
<span style="color:#a60;background-color:#fff0f0">relation</span>: rel, <span style="color:#a60;background-color:#fff0f0">location</span>: loc
{ <span style="color:#a60;background-color:#fff0f0">category</span>: cat, <span style="color:#a60;background-color:#fff0f0">value</span>: val }
<span style="color:#080;font-weight:bold">end</span>
<span style="color:#888"># Here's we're getting the most probable categories based on the</span>
<span style="color:#888"># Naive Bayes distribution approximation model and based on the full</span>
<span style="color:#888"># distribution:</span>
win = all.max { |a, b| a[<span style="color:#a60;background-color:#fff0f0">:value</span>] <=> b[<span style="color:#a60;background-color:#fff0f0">:value</span>] }
win_full = all_full.max { |a, b| a[<span style="color:#a60;background-color:#fff0f0">:value</span>] <=> b[<span style="color:#a60;background-color:#fff0f0">:value</span>] }
<span style="color:#038">puts</span> <span style="color:#d20;background-color:#fff0f0">"Best match for </span><span style="color:#33b;background-color:#fff0f0">#{</span>[ age, sex, rel, loc ]<span style="color:#33b;background-color:#fff0f0">}</span><span style="color:#d20;background-color:#fff0f0">:"</span>
<span style="color:#038">puts</span> <span style="color:#d20;background-color:#fff0f0">" </span><span style="color:#33b;background-color:#fff0f0">#{</span>win[<span style="color:#a60;background-color:#fff0f0">:category</span>]<span style="color:#33b;background-color:#fff0f0">}</span><span style="color:#d20;background-color:#fff0f0"> => </span><span style="color:#33b;background-color:#fff0f0">#{</span>win[<span style="color:#a60;background-color:#fff0f0">:value</span>]<span style="color:#33b;background-color:#fff0f0">}</span><span style="color:#d20;background-color:#fff0f0">"</span>
<span style="color:#038">puts</span> <span style="color:#d20;background-color:#fff0f0">"Full pointed at:"</span>
<span style="color:#038">puts</span> <span style="color:#d20;background-color:#fff0f0">" </span><span style="color:#33b;background-color:#fff0f0">#{</span>win_full[<span style="color:#a60;background-color:#fff0f0">:category</span>]<span style="color:#33b;background-color:#fff0f0">}</span><span style="color:#d20;background-color:#fff0f0"> => </span><span style="color:#33b;background-color:#fff0f0">#{</span>win_full[<span style="color:#a60;background-color:#fff0f0">:value</span>]<span style="color:#33b;background-color:#fff0f0">}</span><span style="color:#04d;background-color:#fff0f0">\n\n</span><span style="color:#d20;background-color:#fff0f0">"</span>
<span style="color:#080;font-weight:bold">end</span>
</code></pre></div><h3 id="the-results">The results</h3>
<p>We’re ready now to use the model and see how well the Naive Bayes model performs in this particular scenario:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-ruby" data-lang="ruby">infer.call <span style="color:#a60;background-color:#fff0f0">:teens</span>, <span style="color:#a60;background-color:#fff0f0">:male</span>, <span style="color:#a60;background-color:#fff0f0">:single</span>, <span style="color:#a60;background-color:#fff0f0">:us</span>
infer.call <span style="color:#a60;background-color:#fff0f0">:young_adults</span>, <span style="color:#a60;background-color:#fff0f0">:male</span>, <span style="color:#a60;background-color:#fff0f0">:single</span>, <span style="color:#a60;background-color:#fff0f0">:asia</span>
infer.call <span style="color:#a60;background-color:#fff0f0">:adults</span>, <span style="color:#a60;background-color:#fff0f0">:female</span>, <span style="color:#a60;background-color:#fff0f0">:in_relationship</span>, <span style="color:#a60;background-color:#fff0f0">:europe</span>
infer.call <span style="color:#a60;background-color:#fff0f0">:elders</span>, <span style="color:#a60;background-color:#fff0f0">:female</span>, <span style="color:#a60;background-color:#fff0f0">:in_relationship</span>, <span style="color:#a60;background-color:#fff0f0">:canada</span>
</code></pre></div><p>This gave the following results on the console:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-plain" data-lang="plain">Best match for [:teens, :male, :single, :us]:
snacks => 0.016252573282200262
Full pointed at:
snacks => 0.01898999999999971
Best match for [:young_adults, :male, :single, :asia]:
meat => 0.0037455794492659757
Full pointed at:
meat => 0.0017000000000000016
Best match for [:adults, :female, :in_relationship, :europe]:
beauty => 0.0012287311061725868
Full pointed at:
beauty => 0.0003000000000000026
Best match for [:elders, :female, :in_relationship, :canada]:
veggies => 0.002156365730474441
Full pointed at:
veggies => 0.0013500000000000022
</code></pre></div><p>That’s quite impressive! Even though we’re using a simplified model to approximate the original distribution, the algorithm managed to infer the correct values in all cases. You can notice also that the results differ only by a couple of cases in 1000.</p>
<p>The approximation like that would certainly be very useful in a more complex e-commerce scenario, in the case where the number of evidence variables would be big enough to be unmanageable using the full distribution. There are use cases though, where a couple of errors in 1000 cases would be too many—the traditional example is medical diagnosis. There are also cases where the number of errors would be much greater just because the Naive Bayes assumption of conditional independence of variables is not always a fair an assumption. Is there a way to improve?</p>
<p>The Naive Bayes assumption says that the distribution factorizes the way we did it <strong>only if the features are conditionally independent given the category</strong>. The notion of <strong>conditional independence</strong> (apart from the formal mathematical definition) suggests that if some variables a and b are conditionally independent given c, then if we know the value of c then no additional information about b can alter our knowledge about a. In our example, knowing the category, let say :beauty doesn’t mean that e. g sex is independent from age. In real world examples, it’s often very hard to find a use case for Naive Bayes that would follow the assumption in all the cases.</p>
<p>There are alternative approaches that allow us to apply the assumptions that more rigidly follow the chosen data set. We will explore these in the next articles, building on top of what we saw here.</p>
RailsConf 2014 on Machine Learninghttps://www.endpointdev.com/blog/2014/04/railsconf-2014-on-machine-learning/2014-04-24T00:00:00+00:00Steph Skardal
<p>This year at RailsConf 2014, there are workshop tracks which are focused sessions double or triple the length of the normal talk. Today I attended <em>Machine Learning for Fun and Profit</em> by <a href="https://twitter.com/johnashenfelter">John Paul Ashenfelter</a>. Some analytics tools are good at providing averages on data (e.g. Google Analytics), but averages don’t tell you a specific story or context of your users, which can be valuable and actionable. In his story-telling approach, John covered several stories for generating data via machine learning techniques in Ruby.</p>
<h3 id="make-a-plan">Make a Plan</h3>
<p>First, one must formulate a plan or a goal for which to collect actionable data. More likely than not, the goal is to make money, and the hope is that machine learning can help you find actionable data to make more money! John walked through several use cases and examples code with machine learning and I’ll add a bit of ecommerce context to each story below.</p>
<h3 id="act-1-describe-your-users">Act 1: Describe your Users</h3>
<p>First, John talked about a few tools used for describing your users. In the context of his story, he wanted to figure out what gender ratio of shirts to order for the company. He used the <a href="https://github.com/bmuller/sexmachine">sexmachine gem</a>, which is based on census data, to predict the sex of a person based on a first name. The first name from all your users would be passed into this gem to segregate via gender, and from there you may be able to take action (e.g. order shirts with an estimated gender ratio).</p>
<p>Next, John covered geolocation. John wanted to how to scale support hours to customers using the product, likely a very common reason for geolocation for any SaaS or customer-centric tools. His solution uses a free IP address lookup service, Python, and Go, and free <a href="https://www.maxmind.com/en/geoip2-services-and-databases">Maxmind data</a>. The example code is available <a href="https://github.com/johnpaulashenfelter/railsconf2014-ml/tree/master/ex2_geolocation">here</a>.</p>
<p>With these tools, gender assignment & geolocation reveals basic but valuable information about your users. In the ecommerce space, determining gender ratios and geolocation may help determine the target of marketing and/or product development efforts, for example targeting a specific marketing message to a female east coast demographic.</p>
<h3 id="act-2-clustering">Act 2: Clustering</h3>
<p>In the next step, John talked about using machine learning to cluster users. The context John provided was to cluster users into three groups: casual users, super users and professional users, to potentially learn more about the super users and how to get more users in that group. An ecommerce story might be to cluster users in amount spent buckets which have rewards at higher levels, to incentivize users to spend more money to climb the hierarchy for more rewards. Here John used <a href="https://github.com/SergioFierens/ai4r">ai4r</a> gem, which uses <a href="https://en.wikipedia.org/wiki/K-means_clustering">k-means clustering</a> to group users. In as few words as possible, k-means clustering randomly creates X clusters (step 1), computes the center of each cluster (step 2), moves nodes if they are closer to a different cluster center (step 3), and repeats steps 2 & 3 until no nodes have been moved. The actual code is quite simple with the gem. Alternative clustering tools are <a href="https://en.wikipedia.org/wiki/Hierarchical_clustering">hierarchical clusterers</a> or <a href="https://www.google.com/search?q=divisive+hierarchical+clustering">divisive hierarchical clustering</a>, which will yield slightly different results. John also mentioned that there are much better numerical tools like Python, R, Octave/Matlab, and Mathematica.</p>
<h3 id="act-3-similarity">Act 3: Similarity</h3>
<p>The third and final topic John covered was determining similarity between users, or perhaps finding other users similar to user X. The context of this was to understand how people collaborate and spread knowledge. In the ecommerce space, the obvious use-case here is building a product recommendation engine, e.g. recommending products to a user based on what they have bought, are looking at, or what is in their cart. John didn’t dive into the specific linear algebra math here (linear algebra is hard!), but he provided <a href="https://github.com/johnpaulashenfelter/railsconf2014-ml/tree/master/ex4_similarity">example code</a> using the <a href="https://github.com/quix/linalg">linalg</a> gem that does much of the hard work for you.</p>
<h3 id="conclusion">Conclusion</h3>
<p>The conclusion of this workshop was again to share Ruby tools that can help solve problems about your user and business. It’s very important to have a plan and/or goal to strive for and to determine actionable data analysis and metrics to help reach those goals.</p>