https://www.endpointdev.com/blog/tags/artificial-intelligence/2020-05-28T00:00:00+00:00End Point DevImplementing SummAE neural text summarization with a denoising auto-encoderhttps://www.endpointdev.com/blog/2020/05/summae-neural-text-summarization-denoising-autoencoder/2020-05-28T00:00:00+00:00Kamil Ciemniewski
<p><img src="/blog/2020/05/summae-neural-text-summarization-denoising-autoencoder/book.jpg" alt="Book open on lawn with dandelions"></p>
<p>If there’s any problem space in machine learning, with no shortage of (unlabelled) data to train on, it’s easily natural language processing (NLP).</p>
<p>In this article, I’d like to take on the challenge of taking a paper that came from Google Research in late 2019 and implementing it. It’s going to be a fun trip into the world of neural text summarization. We’re going to go through the basics, the coding, and then we’ll look at what the results actually are in the end.</p>
<p>The paper we’re going to implement here is: <a href="https://arxiv.org/abs/1910.00998">Peter J. Liu, Yu-An Chung, Jie Ren (2019) SummAE: Zero-Shot Abstractive Text Summarization using Length-Agnostic Auto-Encoders</a>.</p>
<p>Here’s the paper’s abstract:</p>
<blockquote>
<p>We propose an end-to-end neural model for zero-shot abstractive text summarization of paragraphs, and introduce a benchmark task, ROCSumm, based on ROCStories, a subset for which we collected human summaries. In this task, five-sentence stories (paragraphs) are summarized with one sentence, using human summaries only for evaluation. We show results for extractive and human baselines to demonstrate a large abstractive gap in performance. Our model, SummAE, consists of a denoising auto-encoder that embeds sentences and paragraphs in a common space, from which either can be decoded. Summaries for paragraphs are generated by decoding a sentence from the paragraph representations. We find that traditional sequence-to-sequence auto-encoders fail to produce good summaries and describe how specific architectural choices and pre-training techniques can significantly improve performance, outperforming extractive baselines. The data, training, evaluation code, and best model weights are open-sourced.</p>
</blockquote>
<h3 id="preliminaries">Preliminaries</h3>
<p>Before we go any further, let’s talk a little bit about neural summarization in general. There’re two main approaches to it:</p>
<ul>
<li><a href="https://en.wikipedia.org/wiki/Automatic_summarization#Extraction-based_summarization">Extractive</a></li>
<li><a href="https://en.wikipedia.org/wiki/Automatic_summarization#Abstraction-based_summarization">Abstractive</a></li>
</ul>
<p>The first approach makes the model “focus” on the most important parts of the longer text - extracting them to form a summary.</p>
<p>Let’s take a recent article, <a href="/blog/2020/05/shopify-product-creation/">“Shopify Admin API: Importing Products in Bulk”</a>, by one of my great co-workers, <a href="/team/patrick-lewis/">Patrick Lewis</a>, as an example and see what the extractive summarization would look like. Let’s take the first two paragraphs:</p>
<blockquote>
<p>I recently worked on an interesting project for a store owner who was facing a daunting task: he had an inventory of hundreds of thousands of Magic: The Gathering (MTG) cards that he wanted to sell online through his Shopify store. The logistics of tracking down artwork and current market pricing for each card made it impossible to do manually.</p>
<p>My solution was to create a custom Rails application that retrieves inventory data from a combination of APIs and then automatically creates products for each card in Shopify. The resulting project turned what would have been a months- or years-long task into a bulk upload that only took a few hours to complete and allowed the store owner to immediately start selling his inventory online. The online store launch turned out to be even more important than initially expected due to current closures of physical stores.</p>
</blockquote>
<p>An extractive model could summarize it as follows:</p>
<blockquote>
<p>I recently worked on an interesting project for a store owner who had an inventory of hundreds of thousands of cards that he wanted to sell through his store. The logistics and current pricing for each card made it impossible to do manually. My solution was to create a custom Rails application that retrieves inventory data from a combination of APIs and then automatically creates products for each card. The store launch turned out to be even more important than expected due to current closures of physical stores.</p>
</blockquote>
<p>See how it does the copying and pasting? The big advantage of these types of models is that they are generally easier to create and the resulting summaries tend to faithfully reflect the facts included in the source.</p>
<p>The downside though is that it’s not how a human would do it. We do a lot of paraphrasing, for instance. We use different words and tend to form sentences less rigidly following the original ones. The need for the summaries to feel more natural made the second type — abstractive — into this subfield’s holy grail.</p>
<h3 id="datasets">Datasets</h3>
<p>The paper’s authors used the so-called <a href="https://cs.rochester.edu/nlp/rocstories/">“ROCStories” dataset</a> (<a href="https://www.aclweb.org/anthology/P18-2119/">“Tackling The Story Ending Biases in The Story Cloze Test”. Rishi Sharma, James Allen, Omid Bakhshandeh, Nasrin Mostafazadeh. In Proceedings of the 2018 Conference of the Association for Computational Linguistics (ACL), 2018</a>).</p>
<p>In my experiments, I’ve also tried the model against one that’s quite a bit more difficult: <a href="https://github.com/mahnazkoupaee/WikiHow-Dataset">WikiHow</a> (<a href="https://arxiv.org/abs/1810.09305">Mahnaz Koupaee, William Yang Wang (2018) WikiHow: A Large Scale Text Summarization Dataset</a>).</p>
<h4 id="rocstories">ROCStories</h4>
<p>The dataset consists of 98162 stories, each one consisting of 5 sentences. It’s incredibly clean. The only step I needed to take was to split the stories between the train, eval, and test sets.</p>
<p>Examples of sentences:</p>
<p>Example 1:</p>
<blockquote>
<p>My retired coworker turned 69 in July. I went net surfing to get her a gift. She loves Diana Ross. I got two newly released cds and mailed them to her. She sent me an email thanking me.</p>
</blockquote>
<p>Example 2:</p>
<blockquote>
<p>Tom alerted the government he expected a guest. When she didn’t come he got in a lot of trouble. They talked about revoking his doctor’s license. And charging him a huge fee! Tom’s life was destroyed because of his act of kindness.</p>
</blockquote>
<p>Example 3:</p>
<blockquote>
<p>I went to see the doctor when I knew it was bad. I hadn’t eaten in nearly a week. I told him I felt afraid of food in my body. He told me I was developing an eating disorder. He instructed me to get some help.</p>
</blockquote>
<h4 id="wikihow">Wikihow</h4>
<p>This is one of the most challenging openly available datasets for neural summarization. It consists of more than 200,000 long-sequence pairs of text + headline scraped from <a href="https://www.wikihow.com/Main-Page">WikiHow’s website</a>.</p>
<p>Some examples:</p>
<p>Text:</p>
<blockquote>
<p>One easy way to conserve water is to cut down on your shower time. Practice cutting your showers down to 10 minutes, then 7, then 5. Challenge yourself to take a shorter shower every day. Washing machines take up a lot of water and electricity, so running a cycle for a couple of articles of clothing is inefficient. Hold off on laundry until you can fill the machine. Avoid letting the water run while you’re brushing your teeth or shaving. Keep your hoses and faucets turned off as much as possible. When you need them, use them sparingly.</p>
</blockquote>
<p>Headline:</p>
<blockquote>
<p>Take quicker showers to conserve water. Wait for a full load of clothing before running a washing machine. Turn off the water when you’re not using it.</p>
</blockquote>
<p>The main challenge for the summarization model here is that the headline <strong>was actually created by humans</strong> and is not just “extracting” anything. Any model performing well on this dataset actually needs to model the language pretty well. Otherwise, the headline could be used for computing the evaluation metrics, but it’s pretty clear that traditional metrics like <a href="https://en.wikipedia.org/wiki/ROUGE_(metric)">ROUGE</a> are just bound here to miss the point.</p>
<h3 id="basics-of-the-sequence-to-sequence-modeling">Basics of the sequence-to-sequence modeling</h3>
<p>Most sequence-to-sequence models are based on the “next token prediction” workflow.</p>
<p>The general idea can be expressed with P(token | context) — where the task is to model this conditional probability distribution. The “context” here depends on the approach.</p>
<p>Those models are also called “auto-regressive” because they need to consume their own predictions from previous steps during the inference:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python">predict([<span style="color:#d20;background-color:#fff0f0">"<start>"</span>], context)
<span style="color:#888"># "I"</span>
predict([<span style="color:#d20;background-color:#fff0f0">"<start>"</span>, <span style="color:#d20;background-color:#fff0f0">"I"</span>], context)
<span style="color:#888"># "love"</span>
predict([<span style="color:#d20;background-color:#fff0f0">"<start>"</span>, <span style="color:#d20;background-color:#fff0f0">"I"</span>, <span style="color:#d20;background-color:#fff0f0">"love"</span>], context)
<span style="color:#888"># "biking"</span>
predict([<span style="color:#d20;background-color:#fff0f0">"<start>"</span>, <span style="color:#d20;background-color:#fff0f0">"I"</span>, <span style="color:#d20;background-color:#fff0f0">"love"</span>, <span style="color:#d20;background-color:#fff0f0">"biking"</span>], context)
<span style="color:#888"># "<end>"</span>
</code></pre></div><h4 id="naively-simple-modeling-markov-model">Naively simple modeling: Markov Model</h4>
<p>In this model, the approach is to take on a bold assumption: that the probability of the next token is conditioned <strong>only</strong> on the previous token.</p>
<p>The Markov Model is elegantly introduced in the blog post <a href="https://medium.com/ymedialabs-innovation/next-word-prediction-using-markov-model-570fc0475f96">Next Word Prediction using Markov Model</a>.</p>
<p>Why is it naive? Because we know that the probability of the word “love” depends on the word “I” <strong>given a broader context</strong>. A model that’s always going to output “roses” would miss the best word more often than not.</p>
<h4 id="modeling-with-neural-networks">Modeling with neural networks</h4>
<p>Usually, sequence-to-sequence neural network models consist of two parts:</p>
<ul>
<li>encoder</li>
<li>decoder</li>
</ul>
<p>The encoder is there to build a “gist” representation of the input sequence. The gist and the previous token become our “context” to do the inference. This fits in well within the P(token | context) modeling I described above. That distribution can be expressed more clearly as P(token | previous; gist).</p>
<p>There are other approaches too with one of them being the <a href="https://arxiv.org/pdf/2001.04063v2.pdf">ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training - 2020 - Yan, Yu and Qi, Weizhen and Gong, Yeyun and Liu, Dayiheng and Duan, Nan and Chen, Jiusheng and Zhang, Ruofei and Zhou, Ming</a>. The difference in the approach here was the prediction of n-tokens ahead at once.</p>
<h3 id="teacher-forcing">Teacher-forcing</h3>
<p>Let’s see how could we go about teaching the model about the next token’s conditional distribution.</p>
<p>Imagine that the model’s parameters aren’t performing well yet. We have an input sequence of: <code>["<start>", "I", "love", "biking", "during", "the", "summer", "<end>"]</code>. We’re training the model giving it the first token:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python">model([<span style="color:#d20;background-color:#fff0f0">"<start>"</span>, context])
<span style="color:#888"># "I"</span>
</code></pre></div><p>Great, now let’s ask it for another one:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python">model([<span style="color:#d20;background-color:#fff0f0">"<start>"</span>, <span style="color:#d20;background-color:#fff0f0">"I"</span>], context])
<span style="color:#888"># "wonder"</span>
</code></pre></div><p>Hmmm that’s not what we wanted, but let’s naively continue:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python">model([<span style="color:#d20;background-color:#fff0f0">"<start>"</span>, <span style="color:#d20;background-color:#fff0f0">"I"</span>, <span style="color:#d20;background-color:#fff0f0">"wonder"</span>], context)
<span style="color:#888"># "why"</span>
</code></pre></div><p>We could continue gathering predictions and compute the loss at the end. The loss would really only be able to tell it about the first mistake (“love” vs. “wonder”); the rest of the errors would just accumulate from here. This would hinder the learning considerably, adding in the noise from the accumulated errors.</p>
<p>There’s a better approach called <a href="https://machinelearningmastery.com/teacher-forcing-for-recurrent-neural-networks/">Teacher Forcing</a>. In this approach, you’re telling the model the true answer after each of its guesses. The last example would look like the following:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python">model([<span style="color:#d20;background-color:#fff0f0">"<start>"</span>, <span style="color:#d20;background-color:#fff0f0">"I"</span>, <span style="color:#d20;background-color:#fff0f0">"love"</span>], context)
<span style="color:#888"># "watching"</span>
</code></pre></div><p>You’d continue the process, feeding it the full input sequence and the loss term would be computed based on all its guesses.</p>
<h3 id="compute-friendly-representation-for-tokens-and-gists">Compute-friendly representation for tokens and gists</h3>
<p>Some of the readers might want to skip this section. I’d like to describe quickly here the concept of the <a href="https://towardsdatascience.com/understanding-latent-space-in-machine-learning-de5a7c687d8d">latent space</a> and <a href="https://towardsdatascience.com/introduction-to-word-embedding-and-word2vec-652d0c2060fa">vector embeddings</a>. This is to keep the matters relatively palatable for the broader audience.</p>
<h4 id="representing-words-naively">Representing words naively</h4>
<p>How do we turn the words (strings) into numbers that we input into our machine learning models? A software developer might think about assigning each word a unique integer. This works well for databases but in machine learning models, the fact that integers follow one another means that they encode a relation (which one follows which and in what distance). This doesn’t work well for almost any problem in data science.</p>
<p>Traditionally, the problem is solved by “<a href="https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/">one-hot encoding</a>”. This means that we’re turning our integers into vectors, where each value is zero except the one for the index that equals the value to encode (or minus one if your programming language uses zero-based indexing). Example: <code>3 => [0, 0, 0, 1, 0, 0, 0, 0, 0, 0]</code> when the total number of “integers” (classes) to encode is 10.</p>
<p>This is better as it breaks the ordering and distancing assumptions. It doesn’t encode anything about the words, though, except the arbitrary number we’ve decided to assign to them. We now don’t have the ordering but we also don’t have any distance. Empirically though we just know that the word “love” is much closer to “enjoy” than it is to “helicopter”.</p>
<h4 id="a-better-approach-word-embeddings">A better approach: word embeddings</h4>
<p>How could we keep our vector representation (as in one-hot encoding) but also introduce the distance? I’ve already glanced over this concept in my <a href="/blog/2018/07/recommender-mxnet/">post about the simple recommender system</a>. The idea is to have a vector of floating-point values so that the closer the words are in their meaning, the smaller the angle is between them. We can easily compute a metric following this logic by measuring the <a href="http://blog.christianperone.com/2013/09/machine-learning-cosine-similarity-for-vector-space-models-part-iii/">cosine distance</a>. This way, the word representations are easy to feed into the encoder, and they already contain a lot of the information in themselves.</p>
<h4 id="not-only-words">Not only words</h4>
<p>Can we only have vectors for words? Couldn’t we have vectors for paragraphs, so that the closer they are in their meaning, the smaller some vector space metric between them? Of course we can. This is, in fact, what will allow us in this article’s model to encode the “gist” that we talked about. The “encoder” part of the model is going to learn the most convenient way of turning the input sequence into the floating-point numbers vector.</p>
<h3 id="auto-encoders">Auto-encoders</h3>
<p>We’re slowly approaching the model from the paper. We still have one concept that’s vital to understand in order to get why the model is going to work.</p>
<p>Up until now, we talked about the following structure of the typical sequence-to-sequence neural network model:</p>
<p><img src="/blog/2020/05/summae-neural-text-summarization-denoising-autoencoder/seq-to-seq.png" alt="Sequence To Sequence Neural Nets"></p>
<p>This is true e.g. for translation models where the input sequence is in English and the output is in Greek. It’s also true for this article’s model <strong>during the inference</strong>.</p>
<p>What if we’d make the input and output to be the same sequence? We’d turn it into a so-called <a href="https://en.wikipedia.org/wiki/Autoencoder">auto-encoder</a>.</p>
<p>The output of course isn’t all that useful — we already know what the input sequence is. The true value is in the model’s ability to encode the input into a <strong>gist</strong>.</p>
<h4 id="adding-the-noise">Adding the noise</h4>
<p>A very interesting type of an auto-encoder is the <a href="https://towardsdatascience.com/denoising-autoencoders-explained-dbb82467fc2">denoising auto-encoder</a>. The idea is that the input sequence gets randomly corrupted and the network learns to still produce a good gist and reconstruct the sequence before it got corrupted. This makes the training “teach” the network about the deeper connections in the data, instead of just “memorizing” as much as it can.</p>
<h3 id="the-summae-model">The SummAE model</h3>
<p>We’re now ready to talk about the architecture from the paper. Given what we’ve already learned, this is going to be very simple. The SummAE model is just a denoising auto-encoder that is being trained a special way.</p>
<h4 id="auto-encoding-paragraphs-and-sentences">Auto-encoding paragraphs and sentences</h4>
<p>The authors were training the model on both single sentences and full paragraphs. In all cases the task was to reproduce the uncorrupted input.</p>
<p>The first part of the approach is about having two special “start tokens” to signal the mode: paragraph vs. sentence. In my code, I’ve used “<start-full>” and “<start-short>”.</p>
<p>During the training, the model learns the conditional distributions given those two tokens and the ones that follow, for any given token in the sequence.</p>
<h4 id="adding-the-noise-1">Adding the noise</h4>
<p>The sentences are simply concatenated to form a paragraph. The input then gets corrupted at random by means of:</p>
<ul>
<li>masking the input tokens</li>
<li>shuffling the order of the sentences within the paragraph</li>
</ul>
<p>The authors are claiming that the latter helped them in solving the issue of the network just memorizing the first sentence. What I have found though is that this model is generally prone towards memorizing concrete sentences from the paragraph. Sometimes it’s the first, and sometimes it’s some of the others. I’ve found this true even when adding a lot of noise to the input.</p>
<h4 id="the-code">The code</h4>
<p>The full PyTorch implementation described in this blog post is available at <a href="https://github.com/kamilc/neural-text-summarization">https://github.com/kamilc/neural-text-summarization</a>. You may find some of its parts less clean than others — it’s a work in progress. Specifically, the data download is almost left out.</p>
<p>You can find the WikiData preprocessing in a notebook in the repository. For the ROCStories, I just downloaded the CSV files and concatenated with Unix <code>cat</code>. There’s an additional <code>process.py</code> file generated from a very simple <code>IPython</code> session.</p>
<p>Let’s have a very brief look at some of the most interesting parts of the code:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python"><span style="color:#080;font-weight:bold">class</span> <span style="color:#b06;font-weight:bold">SummarizeNet</span>(NNModel):
<span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">encode</span>(self, embeddings, lengths):
<span style="color:#888"># ...</span>
<span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">decode</span>(self, embeddings, encoded, lengths, modes):
<span style="color:#888"># ...</span>
<span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">forward</span>(self, embeddings, clean_embeddings, lengths, modes):
<span style="color:#888"># ...</span>
<span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">predict</span>(self, vocabulary, embeddings, lengths):
<span style="color:#888"># ...</span>
</code></pre></div><p>You can notice separate methods for <code>forward</code> and <code>predict</code>. I chose the <a href="https://jalammar.github.io/illustrated-transformer/">Transformer</a> over the recurrent neural networks for both the encoder part and the decoder. The <a href="https://pytorch.org/docs/master/generated/torch.nn.TransformerDecoder.html">PyTorch implementation of the transformer decoder part</a> already includes the teacher forcing in the <code>forward</code> method. This makes it convenient at the training time — to just feed it the full, uncorrupted sequence of embeddings as the “target”. During the inference we need to do the “auto-regressive” part by hand though. This means feeding the previous predictions in a loop — hence the need for two distinct methods here.</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python"><span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">forward</span>(self, embeddings, clean_embeddings, lengths, modes):
noisy_embeddings = self.mask_dropout(embeddings, lengths)
encoded = self.encode(noisy_embeddings[:, <span style="color:#00d;font-weight:bold">1</span>:, :], lengths-<span style="color:#00d;font-weight:bold">1</span>)
decoded = self.decode(clean_embeddings, encoded, lengths, modes)
<span style="color:#080;font-weight:bold">return</span> (
decoded,
encoded
)
</code></pre></div><p>You can notice that I’m doing the token masking at the model level during the training. The code also shows cleanly the structure of this seq2seq model — with the encoder and the decoder.</p>
<p>The encoder part looks simple as long as you’re familiar with the transformers:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python"><span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">encode</span>(self, embeddings, lengths):
batch_size, seq_len, _ = embeddings.shape
embeddings = self.encode_positions(embeddings)
paddings_mask = torch.arange(end=seq_len).unsqueeze(dim=<span style="color:#00d;font-weight:bold">0</span>).expand((batch_size, seq_len)).to(self.device)
paddings_mask = (paddings_mask + <span style="color:#00d;font-weight:bold">1</span>) > lengths.unsqueeze(dim=<span style="color:#00d;font-weight:bold">1</span>).expand((batch_size, seq_len))
encoded = embeddings.transpose(<span style="color:#00d;font-weight:bold">1</span>,<span style="color:#00d;font-weight:bold">0</span>)
<span style="color:#080;font-weight:bold">for</span> ix, encoder <span style="color:#080">in</span> <span style="color:#038">enumerate</span>(self.encoders):
encoded = encoder(encoded, src_key_padding_mask=paddings_mask)
encoded = self.encode_batch_norms[ix](encoded.transpose(<span style="color:#00d;font-weight:bold">2</span>,<span style="color:#00d;font-weight:bold">1</span>)).transpose(<span style="color:#00d;font-weight:bold">2</span>,<span style="color:#00d;font-weight:bold">1</span>)
last_encoded = encoded
encoded = self.pool_encoded(encoded, lengths)
encoded = self.to_hidden(encoded)
<span style="color:#080;font-weight:bold">return</span> encoded
</code></pre></div><p>We’re first encoding the positions as in the “Attention Is All You Need” paper and then feeding the embeddings into a stack of the encoder layers. At the end, we’re morphing the tensor to have the final dimension equal the number given as the model’s parameter.</p>
<p>The <code>decode</code> sits on PyTorch’s shoulders too:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python"><span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">decode</span>(self, embeddings, encoded, lengths, modes):
batch_size, seq_len, _ = embeddings.shape
embeddings = self.encode_positions(embeddings)
mask = self.mask_for(embeddings)
encoded = self.from_hidden(encoded)
encoded = encoded.unsqueeze(dim=<span style="color:#00d;font-weight:bold">0</span>).expand(seq_len, batch_size, -<span style="color:#00d;font-weight:bold">1</span>)
decoded = embeddings.transpose(<span style="color:#00d;font-weight:bold">1</span>,<span style="color:#00d;font-weight:bold">0</span>)
decoded = torch.cat(
[
encoded,
decoded
],
axis=<span style="color:#00d;font-weight:bold">2</span>
)
decoded = self.combine_decoded(decoded)
decoded = self.combine_batch_norm(decoded.transpose(<span style="color:#00d;font-weight:bold">2</span>,<span style="color:#00d;font-weight:bold">1</span>)).transpose(<span style="color:#00d;font-weight:bold">2</span>,<span style="color:#00d;font-weight:bold">1</span>)
paddings_mask = torch.arange(end=seq_len).unsqueeze(dim=<span style="color:#00d;font-weight:bold">0</span>).expand((batch_size, seq_len)).to(self.device)
paddings_mask = paddings_mask > lengths.unsqueeze(dim=<span style="color:#00d;font-weight:bold">1</span>).expand((batch_size, seq_len))
<span style="color:#080;font-weight:bold">for</span> ix, decoder <span style="color:#080">in</span> <span style="color:#038">enumerate</span>(self.decoders):
decoded = decoder(
decoded,
torch.ones_like(decoded),
tgt_mask=mask,
tgt_key_padding_mask=paddings_mask
)
decoded = self.decode_batch_norms[ix](decoded.transpose(<span style="color:#00d;font-weight:bold">2</span>,<span style="color:#00d;font-weight:bold">1</span>)).transpose(<span style="color:#00d;font-weight:bold">2</span>,<span style="color:#00d;font-weight:bold">1</span>)
decoded = decoded.transpose(<span style="color:#00d;font-weight:bold">1</span>,<span style="color:#00d;font-weight:bold">0</span>)
<span style="color:#080;font-weight:bold">return</span> self.linear_logits(decoded)
</code></pre></div><p>You can notice that I’m combining the gist received from the encoder with each word embeddings — as this is how it was described in the paper.</p>
<p>The <code>predict</code> is very similar to <code>forward</code>:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python"><span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">predict</span>(self, vocabulary, embeddings, lengths):
<span style="color:#d20;background-color:#fff0f0">"""
</span><span style="color:#d20;background-color:#fff0f0"> Caller should include the start and end tokens here
</span><span style="color:#d20;background-color:#fff0f0"> but we’re going to ensure the start one is replaces by <start-short>
</span><span style="color:#d20;background-color:#fff0f0"> """</span>
previous_mode = self.training
self.eval()
batch_size, _, _ = embeddings.shape
results = []
<span style="color:#080;font-weight:bold">for</span> row <span style="color:#080">in</span> <span style="color:#038">range</span>(<span style="color:#00d;font-weight:bold">0</span>, batch_size):
row_embeddings = embeddings[row, :, :].unsqueeze(dim=<span style="color:#00d;font-weight:bold">0</span>)
row_embeddings[<span style="color:#00d;font-weight:bold">0</span>, <span style="color:#00d;font-weight:bold">0</span>] = vocabulary.token_vector(<span style="color:#d20;background-color:#fff0f0">"<start-short>"</span>)
encoded = self.encode(
row_embeddings[:, <span style="color:#00d;font-weight:bold">1</span>:, :],
lengths[row].unsqueeze(dim=<span style="color:#00d;font-weight:bold">0</span>)
)
results.append(
self.decode_prediction(
vocabulary,
encoded,
lengths[row].unsqueeze(dim=<span style="color:#00d;font-weight:bold">0</span>)
)
)
self.training = previous_mode
<span style="color:#080;font-weight:bold">return</span> results
</code></pre></div><p>The workhorse behind the decoding at the inference time looks as follows:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python"><span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">decode_prediction</span>(self, vocabulary, encoded1xH, lengths1x):
tokens = [<span style="color:#d20;background-color:#fff0f0">'<start-short>'</span>]
last_token = <span style="color:#080;font-weight:bold">None</span>
seq_len = <span style="color:#00d;font-weight:bold">1</span>
encoded1xH = self.from_hidden(encoded1xH)
<span style="color:#080;font-weight:bold">while</span> last_token != <span style="color:#d20;background-color:#fff0f0">'<end>'</span> <span style="color:#080">and</span> seq_len < <span style="color:#00d;font-weight:bold">50</span>:
embeddings1xSxD = vocabulary.embed(tokens).unsqueeze(dim=<span style="color:#00d;font-weight:bold">0</span>).to(self.device)
embeddings1xSxD = self.encode_positions(embeddings1xSxD)
maskSxS = self.mask_for(embeddings1xSxD)
encodedSx1xH = encoded1xH.unsqueeze(dim=<span style="color:#00d;font-weight:bold">0</span>).expand(seq_len, <span style="color:#00d;font-weight:bold">1</span>, -<span style="color:#00d;font-weight:bold">1</span>)
decodedSx1xD = embeddings1xSxD.transpose(<span style="color:#00d;font-weight:bold">1</span>,<span style="color:#00d;font-weight:bold">0</span>)
decodedSx1xD = torch.cat(
[
encodedSx1xH,
decodedSx1xD
],
axis=<span style="color:#00d;font-weight:bold">2</span>
)
decodedSx1xD = self.combine_decoded(decodedSx1xD)
decodedSx1xD = self.combine_batch_norm(decodedSx1xD.transpose(<span style="color:#00d;font-weight:bold">2</span>,<span style="color:#00d;font-weight:bold">1</span>)).transpose(<span style="color:#00d;font-weight:bold">2</span>,<span style="color:#00d;font-weight:bold">1</span>)
<span style="color:#080;font-weight:bold">for</span> ix, decoder <span style="color:#080">in</span> <span style="color:#038">enumerate</span>(self.decoders):
decodedSx1xD = decoder(
decodedSx1xD,
torch.ones_like(decodedSx1xD),
tgt_mask=maskSxS,
)
decodedSx1xD = self.decode_batch_norms[ix](decodedSx1xD.transpose(<span style="color:#00d;font-weight:bold">2</span>,<span style="color:#00d;font-weight:bold">1</span>))
decodedSx1xD = decodedSx1xD.transpose(<span style="color:#00d;font-weight:bold">2</span>,<span style="color:#00d;font-weight:bold">1</span>)
decoded1x1xD = decodedSx1xD.transpose(<span style="color:#00d;font-weight:bold">1</span>,<span style="color:#00d;font-weight:bold">0</span>)[:, (seq_len-<span style="color:#00d;font-weight:bold">1</span>):seq_len, :]
decoded1x1xV = self.linear_logits(decoded1x1xD)
word_id = F.softmax(decoded1x1xV[<span style="color:#00d;font-weight:bold">0</span>, <span style="color:#00d;font-weight:bold">0</span>, :]).argmax().cpu().item()
last_token = vocabulary.words[word_id]
tokens.append(last_token)
seq_len += <span style="color:#00d;font-weight:bold">1</span>
<span style="color:#080;font-weight:bold">return</span> <span style="color:#d20;background-color:#fff0f0">' '</span>.join(tokens[<span style="color:#00d;font-weight:bold">1</span>:])
</code></pre></div><p>You can notice starting with the “start short” token and going in a loop, getting predictions, and feeding back until the “end” token.</p>
<p>Again, the model is very, very simple. What makes the difference is how it’s being trained — it’s all in the training data corruption and the model pre-training.</p>
<p>It’s already a long article so I encourage the curious readers to look at the code at <a href="https://github.com/kamilc/neural-text-summarization">my GitHub repo</a> for more details.</p>
<h4 id="my-experiment-with-the-wikihow-dataset">My experiment with the WikiHow dataset</h4>
<p>In my WikiHow experiment I wanted to see how the results look if I fed the full articles and their headlines for the two modes of the network. The same data-corruption regime was used in this case.</p>
<p>Some of the results were looking <strong>almost</strong> good:</p>
<p>Text:</p>
<blockquote>
<p>for a savory flavor, mix in 1/2 teaspoon ground cumin, ground turmeric, or masala powder.this works best when added to the traditional salty lassi. for a flavorful addition to the traditional sweet lassi, add 1/2 teaspoon of ground cardamom powder or ginger, for some kick. , start with a traditional sweet lassi and blend in some of your favorite fruits. consider mixing in strawberries, papaya, bananas, or coconut.try chopping and freezing the fruit before blending it into the lassi. this will make your drink colder and frothier. , while most lassi drinks are yogurt based, you can swap out the yogurt and water or milk for coconut milk. this will give a slightly tropical flavor to the drink. or you could flavor the lassi with rose water syrup, vanilla extract, or honey.don’t choose too many flavors or they could make the drink too sweet. if you stick to one or two flavors, they’ll be more pronounced. , top your lassi with any of the following for extra flavor and a more polished look: chopped pistachios sprigs of mint sprinkle of turmeric or cumin chopped almonds fruit sliver</p>
</blockquote>
<p>Headline:</p>
<blockquote>
<p>add a spice., blend in a fruit., flavor with a syrup or milk., garnish.</p>
</blockquote>
<p>Predicted summary:</p>
<blockquote>
<p>blend vanilla in a sweeter flavor . , add a sugary fruit . , do a spicy twist . eat with dessert . , revise . <end></p>
</blockquote>
<p>It’s not 100% faithful to the original text even though it seems to “read” well.</p>
<p>My suspicion is that pre-training against a much larger corpus of text might possibly help. There’s an obvious issue with the lack of very specific knowledge here to have the network summarize better. Here’s another of those examples:</p>
<p>Text:</p>
<blockquote>
<p>the settings app looks like a gray gear icon on your iphone’s home screen.; , this option is listed next to a blue “a” icon below general. , this option will be at the bottom of the display & brightness menu. , the right-hand side of the slider will give you bigger font size in all menus and apps that support dynamic type, including the mail app. you can preview the corresponding text size by looking at the menu texts located above and below the text size slider. , the left-hand side of the slider will make all dynamic type text smaller, including all menus and mailboxes in the mail app. , tap the back button twice in the upper-left corner of your screen. it will save your text size settings and take you back to your settings menu. , this option is listed next to a gray gear icon above display & brightness. , it’s halfway through the general menu. ,, the switch will turn green. the text size slider below the switch will allow for even bigger fonts. , the text size in all menus and apps that support dynamic type will increase as you go towards the right-hand side of the slider. this is the largest text size you can get on an iphone. , it will save your settings.</p>
</blockquote>
<p>Headline:</p>
<blockquote>
<p>open your iphone’s settings., scroll down and tap display & brightness., tap text size., tap and drag the slider to the right for bigger text., tap and drag the slider to the left for smaller text., go back to the settings menu., tap general., tap accessibility., tap larger text. , slide the larger accessibility sizes switch to on position., tap and drag the slider to the right., tap the back button in the upper-left corner.</p>
</blockquote>
<p>Predicted summary:</p>
<blockquote>
<p>open your iphone ’s settings . , tap general . , scroll down and tap accessibility . , tap larger accessibility . , tap and larger text for the iphone to highlight the text you want to close . , tap the larger text - colored contacts app .</p>
</blockquote>
<p>It might be interesting to train against this dataset again while:</p>
<ul>
<li>utilizing some pre-trained, large scale model as part of the encoder</li>
<li>using a large corpus of text to still pre-train the auto-encoder</li>
</ul>
<p>This could possibly take a lot of time to train on my GPU (even with the pre-trained part of the encoder). I didn’t follow the idea further at this time.</p>
<h4 id="the-problem-with-getting-paragraphs-when-we-want-the-sentences">The problem with getting paragraphs when we want the sentences</h4>
<p>One of the biggest problems the authors ran into was with the decoder outputting the long version of the text, even though it was asked for the sentence-long summary.</p>
<p>Authors called this phenomenon the “segregation issue”. What they have found was that the encoder was mapping paragraphs and sentences into completely separate regions. The solution to this problem was to trick the encoder into making both representations indistinguishable. The following figure comes from the paper and shows the issue visualized:</p>
<p><img src="/blog/2020/05/summae-neural-text-summarization-denoising-autoencoder/segregation.jpg" alt="Segregation problem"></p>
<h4 id="better-gists-by-using-the-critic">Better gists by using the “critic”</h4>
<p>The idea of a “critic” has been popularized along with the fantastic results produced by some of the <a href="https://en.wikipedia.org/wiki/Generative_adversarial_network">Generative Adversarial Networks</a>. The general workflow is to have the main network generate output while the other tries to guess some of its properties.</p>
<p>For GANs that are generating realistic photos, the critic is there to guess if the photo was generated or if it’s real. A loss term is added based on how well it’s doing, penalizing the main network for generating photos that the critic is able to call out as fake.</p>
<p>A similar idea was used in the A3C algorithm I blogged about (<a href="/blog/2018/08/self-driving-toy-car-using-the-a3c-algorithm/">Self-driving toy car using the Asynchronous Advantage Actor-Critic algorithm</a>). The “critic” part penalized the AI agent for taking steps that were on average less advantageous.</p>
<p>Here, in the SummAE model, the critic adds a penalty to the loss to the degree to which it’s able to guess whether the gist comes from a paragraph or a sentence.</p>
<p>Training with the critic might get tricky. What I’ve found to be the cleanest way is to use two different optimizers — one updating the main network’s parameters while the other updates the critic itself:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python"><span style="color:#080;font-weight:bold">for</span> batch <span style="color:#080">in</span> batches:
<span style="color:#080;font-weight:bold">if</span> mode == <span style="color:#d20;background-color:#fff0f0">"train"</span>:
self.model.train()
self.discriminator.train()
<span style="color:#080;font-weight:bold">else</span>:
self.model.eval()
self.discriminator.eval()
self.optimizer.zero_grad()
self.discriminator_optimizer.zero_grad()
logits, state = self.model(
batch.word_embeddings.to(self.device),
batch.clean_word_embeddings.to(self.device),
batch.lengths.to(self.device),
batch.mode.to(self.device)
)
mode_probs_disc = self.discriminator(state.detach())
mode_probs = self.discriminator(state)
discriminator_loss = F.binary_cross_entropy(
mode_probs_disc,
batch.mode
)
discriminator_loss.backward(retain_graph=<span style="color:#080;font-weight:bold">True</span>)
<span style="color:#080;font-weight:bold">if</span> mode == <span style="color:#d20;background-color:#fff0f0">"train"</span>:
self.discriminator_optimizer.step()
text = batch.text.copy()
<span style="color:#080;font-weight:bold">if</span> self.no_period_trick:
text = [txt.replace(<span style="color:#d20;background-color:#fff0f0">'.'</span>, <span style="color:#d20;background-color:#fff0f0">''</span>) <span style="color:#080;font-weight:bold">for</span> txt <span style="color:#080">in</span> text]
classes = self.vocabulary.encode(text, modes=batch.mode)
classes = classes.roll(-<span style="color:#00d;font-weight:bold">1</span>, dims=<span style="color:#00d;font-weight:bold">1</span>)
classes[:,classes.shape[<span style="color:#00d;font-weight:bold">1</span>]-<span style="color:#00d;font-weight:bold">1</span>] = <span style="color:#00d;font-weight:bold">3</span>
model_loss = torch.tensor(<span style="color:#00d;font-weight:bold">0</span>).cuda()
<span style="color:#080;font-weight:bold">if</span> logits.shape[<span style="color:#00d;font-weight:bold">0</span>:<span style="color:#00d;font-weight:bold">2</span>] == classes.shape:
model_loss = F.cross_entropy(
logits.reshape(-<span style="color:#00d;font-weight:bold">1</span>, logits.shape[<span style="color:#00d;font-weight:bold">2</span>]).to(self.device),
classes.long().reshape(-<span style="color:#00d;font-weight:bold">1</span>).to(self.device),
ignore_index=<span style="color:#00d;font-weight:bold">3</span>
)
<span style="color:#080;font-weight:bold">else</span>:
<span style="color:#038">print</span>(<span style="color:#d20;background-color:#fff0f0">"WARNING: Skipping model loss for inconsistency between logits and classes shapes"</span>)
fooling_loss = F.binary_cross_entropy(
mode_probs,
torch.ones_like(batch.mode).to(self.device)
)
loss = model_loss + (<span style="color:#00d;font-weight:bold">0.1</span> * fooling_loss)
loss.backward()
<span style="color:#080;font-weight:bold">if</span> mode == <span style="color:#d20;background-color:#fff0f0">"train"</span>:
self.optimizer.step()
self.optimizer.zero_grad()
self.discriminator_optimizer.zero_grad()
</code></pre></div><p>The main idea is to treat the main network’s encoded gist as constant with respect to the updates to the critic’s parameters, and vice versa.</p>
<h3 id="results">Results</h3>
<p>I’ve found some of the results look really exceptional:</p>
<p>Text:</p>
<blockquote>
<p>lynn is unhappy in her marriage. her husband is never good to her and shows her no attention. one evening lynn tells her husband she is going out with her friends. she really goes out with a man from work and has a great time. lynn continues dating him and starts having an affair.</p>
</blockquote>
<p>Predicted summary:</p>
<blockquote>
<p>lynn starts dating him and has an affair . <end></p>
</blockquote>
<p>Text:</p>
<blockquote>
<p>cedric was hoping to get a big bonus at work. he had worked hard at the office all year. cedric’s boss called him into his office. cedric was disappointed when told there would be no bonus. cedric’s boss surprised cedric with a big raise instead of a bonus.</p>
</blockquote>
<p>Predicted summary:</p>
<blockquote>
<p>cedric had a big deal at his boss ’s office . <end></p>
</blockquote>
<p>Some others showed how the model attends to single sentences though:</p>
<p>Text:</p>
<blockquote>
<p>i lost my job. i was having trouble affording my necessities. i didn’t have enough money to pay rent. i searched online for money making opportunities. i discovered amazon mechanical turk.</p>
</blockquote>
<p>Predicted summary:</p>
<blockquote>
<p>i did n’t have enough money to pay rent . <end></p>
</blockquote>
<p>While the sentence like this one would maybe make a good headline — it’s definitely not the best summary as it naturally loses the vital parts found in other sentences.</p>
<h3 id="final-words">Final words</h3>
<p>First of all, let me thank the paper’s authors for their exceptional work. It was a great read and great fun implementing!</p>
<p>Abstractive text summarization remains very difficult. The model trained for this blog post has very limited use in practice. There’s a lot of room for improvement though, which makes the future of abstractive summaries very promising.</p>
An Introduction to Neural Networkshttps://www.endpointdev.com/blog/2019/07/an-introduction-to-neural-networks/2019-07-01T00:00:00+00:00Ben Ironside Goldstein
<p><img src="/blog/2019/07/an-introduction-to-neural-networks/image-0.jpg" alt="Weird Tree Art (Neural Network)" /> <a href="https://flic.kr/p/5eL8Ag">Photo</a> by <a href="https://www.flickr.com/photos/sudhamshu/">Sudhamshu Hebbar</a>, used under <a href="https://creativecommons.org/licenses/by/2.0/">CC BY 2.0</a></p>
<p>Earlier this year I wrote a <a href="/blog/2019/05/facial-recognition-amazon-deeplens/">post</a> about my work with a machine-learning camera, the <a href="https://aws.amazon.com/deeplens/">AWS DeepLens</a>, which has onboard processing power to enable AI capabilities without sending data to the cloud. Neural networks are a type of ML model which achieves very impressive results on certain problems (including computer vision), so in this post I give a more thorough introduction to neural networks, and share some useful resources for those who want to dig deeper.</p>
<h3 id="neurons-and-nodes">Neurons and Nodes</h3>
<p>Neural networks are models inspired by the function of biological neural networks. They consist of nodes (arranged in layers), and the connections between those nodes. Each connection between two nodes enables one-way information transfer: a node either receives input from, or sends output to each node to which it is connected. Nodes typically have an “activation function”, parameterized by the node’s inputs, and its output is the result of this function.</p>
<p>As with the function of biological neural networks, the emergence of information processing from these mathematical operations is opaque. Nevertheless, complex artificial neural networks are capable of feats such as vision, language translation, and winning competitive games. As the technology improves, even more impressive tasks will become possible. As with organic brains, neural networks can achieve complex tasks only as a result of appropriate architecture, constraints, and training—for machine learning, humans must (for now) design it all.</p>
<h3 id="neural-network-architecture">Neural Network Architecture</h3>
<p><img src="/blog/2019/07/an-introduction-to-neural-networks/image-1.png" style="float: right; max-width: 200px" /></p>
<p>Nodes are grouped in layers: the input layer, the output layer, and all the layers between them, known as hidden layers. Nodes can be networked in a variety of ways within and between layers, and sophisticated neural network models can include dozens of layers configured in various ways. These include layers which summarize, combine, eliminate, direct, or transform information. Each receives its input from the previous layer, and passes its output to the next layer. The last layer is designed such that its output answers the relevant question (for example, it would offer 9 options if the goal were to identify the hand-written numbers 1–9).</p>
<p>For all this information processing to achieve a given task, the parameters of each node need appropriate values. The process of choosing those values is called training. In order to train a neural network, one needs to provide examples of what the network should do. (For example, to train it to write requires examples of writing. To train it to identify objects in images requires images and their appropriately labeled counterparts.) The more data a model can learn from, the better it can work. Gathering enough data is typically a major undertaking.</p>
<h3 id="training-a-neural-network">Training a Neural Network</h3>
<p>Before training, models have random parameters for all nodes. Each time data is passed through the model, the effectiveness of the model is measured using a “loss function”. Loss functions measure how wrong a model’s output is. Different loss functions (also known as cost functions or error functions) measure this in different ways, but in general, the more wrong a model is, the higher its loss/error/cost. Loss functions thus summarize the quality of a model’s output with a single number. Models are optimized to minimize the loss. (For more on the role of loss functions in neural networks, I suggest <a href="https://machinelearningmastery.com/loss-and-loss-functions-for-training-deep-learning-neural-networks/">this excellent article</a>.)</p>
<p>One of the most interesting details of the entire process has to do with how the parameters are tuned. Model optimization relies on variations of a process called gradient descent, in which parameter values are adjusted by small intervals in an attempt to minimize the loss. Over many thousands of repetitions, the training program uses calculus to pick values that help to minimize the loss. As you can imagine, this process becomes extremely computationally intensive when the neural network is large and complex. However, in order to solve hard problems, networks must be large and complex. This is why training neural networks requires substantial computing power, and often takes place in the cloud. (For more on stochastic gradient descent, I suggest <a href="https://www.youtube.com/watch?v=vMh0zPT0tLI">this video</a> as a great starting point, or <a href="http://ruder.io/optimizing-gradient-descent/">this review</a> for a more advanced overview.)</p>
<h3 id="further-reading">Further reading</h3>
<ul>
<li>It turns out that a relatively simple neural network can approximate any function. This remarkable <a href="https://towardsdatascience.com/can-neural-networks-really-learn-any-function-65e106617fc6">demonstration</a> is quite accessible.</li>
<li>There are countless useful implementations of neural network models. End Pointer <a href="/blog/authors/kamil-ciemniewski/">Kamil Ciemniewski</a> wrote two in-depth and fascinating blogs about neural network projects which he completed in the past year: <a href="/blog/2019/01/speech-recognition-with-tensorflow/">Speech Recognition From Scratch</a>, and <a href="/blog/2018/08/self-driving-toy-car-using-the-a3c-algorithm/">Self-Driving Toy Car</a>.</li>
<li>If you’re interested in getting a sense for the general state of the art, <a href="https://www.topbots.com/most-important-ai-research-papers-2018/">here</a> are summaries of some of the most influential papers in machine learning since 2018.</li>
<li>For those curious about the inner workings of the training process, here’s one about <a href="http://neuralnetworksanddeeplearning.com/chap2.html">back-propagation</a>.</li>
<li>This blog post describes “densely connected” network layers; here’s an article about <a href="https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3bd2b1164a53">convolutional layers</a>.</li>
<li>And finally, this article describes <a href="https://medium.com/explore-artificial-intelligence/an-introduction-to-recurrent-neural-networks-72c97bf0912">recurrent neural networks</a>.</li>
</ul>
Facial Recognition Using Amazon DeepLens: Counting Liquid Galaxy Interactionshttps://www.endpointdev.com/blog/2019/05/facial-recognition-amazon-deeplens/2019-05-01T00:00:00+00:00Ben Ironside Goldstein
<p>I have been exploring the possible uses of a machine-learning-enabled camera for the Liquid Galaxy. The Amazon Web Services (AWS) <a href="https://aws.amazon.com/deeplens/">DeepLens</a> is a camera that can receive and transmit data over wifi, and that has computing hardware built in. Since its hardware enables it to use machine learning models, it can perform computer vision tasks in the field.</p>
<h3 id="the-amazon-deeplens-camera">The Amazon DeepLens camera</h3>
<p><img style="float: left; width: 400px; padding-right: 2em;" src="/blog/2019/05/facial-recognition-amazon-deeplens/deeplens-front-angle.jpg" alt="DeepLens" /></p>
<p>This camera is the first of its kind—likely the first of many, given the ongoing rapid adoption of Internet of Things (IoT) devices and computer vision. It came to End Point’s attention as hardware that could potentially interface with and extend End Point’s immersive visualization platform, the <a href="https://www.visionport.com/">Liquid Galaxy</a>. We’ve thought of several ways computer vision could potentially work to enhance the platform, for example:</p>
<ol>
<li>Monitoring users’ reactions</li>
<li>Counting unique visitors to the LG</li>
<li>Counting the number of people using an LG at a given time</li>
</ol>
<p>The first idea would depend on parsing facial expressions. Perhaps a certain moment in a user experience causes people to look confused, or particularly delighted—valuable insights. The second idea would generate data that could help us assess the platform’s impact, using a metric crucial to any potential clients whose goals involve engaging audiences. The third idea would create a simpler metric: the average number of people engaging with the system over a period of time. Nevertheless, this idea has a key advantage over the second: it doesn’t require distinguishing between people, which makes it a much more tractable project. This post focuses on the third idea.</p>
<p>To set up the camera, the user has to plug it into a power outlet and connect it to wifi. The camera will still work even with a slow network connection, though when the connection is slower the delay between the camera seeing something and reporting it is longer. However, this delay was hardly noticable on my home network which has slow-to-moderate speeds of about 17 Mbps down and 33 Mbps up.</p>
<h3 id="computer-vision-and-the-amazon-deeplens">Computer Vision and the Amazon DeepLens</h3>
<p>A <a href="https://en.wikipedia.org/wiki/Deep_learning">deep learning model</a> is a neural network with multiple layers of processing units. It is called “deep” because it has multiple layers. The inputs and outputs of each processing unit are numbers. These units are roughly analogous to neurons: they receive input from units in the previous layer, and output it to units in the next layer after transforming it based on a function. These “activation functions” can change in a variety of ways. The last layer’s outputs translate into the results. These models work because these functions get tuned based on how well the model works. For example, to make a model that labels each human face in a picture and draws a box around it, we would start with a corpus of pictures with boxes drawn around faces, as well as the versions of the pictures without the boxes drawn. We would test the model on the non-labeled images by checking—for each picture—whether the output generated by the model is correct. If not, the computer chooses different unit functions, tries again, and compares the results. Repeating this process thousands of times yields models which work remarkably well for a wide range of tasks, including computer vision.</p>
<p>In deep learning for computer vision, training on large sets of labeled images enables models to generalize about visual characteristics. The training process takes a lot of computing resources, but once models are trained, they can produce results quickly and with relative ease. This is why the DeepLens is able to perform computer vision with its limited computing resources.</p>
<p>Since the DeepLens is an Amazon product, it comes as no surprise that the user interface and backend for DeepLens consist of AWS services. One of the most important is <a href="https://aws.amazon.com/sagemaker/">SageMaker</a>, which is used to train, manage, optimize, and deploy machine learning models such as neural networks. It includes hosted Jupyter notebooks (<a href="https://jupyter.org/">Jupyter</a> is a development environment for data science), as well as the computing resources required for model training and storage. With SageMaker, users can train computer vision models for deployment to DeepLens, or import and adjust pretrained models from various sources.</p>
<p>Remote management of the DeepLens depends on <a href="https://aws.amazon.com/lambda/">AWS Lambda</a>, a “serverless” cloud service that provides an environment to run backend code and integrate with other cloud services. It runs the show, allowing users to manage everything from the camera’s behavior to what happens to gathered data. Another service, <a href="https://aws.amazon.com/greengrass/">AWS Greengrass</a>, connects the instructions from AWS Lambda to the DeepLens, managing tasks like authentication, updates, and reactions to local events.</p>
<p>Amazon’s IoT service saves information about each DeepLens, and allows users to manage their devices, for example by choosing which model is active on the device, or viewing a live stream from the camera. It also keeps track of what’s going on with the hardware, even when it’s off. When a model is running on the DeepLens, you can view a live stream of its inferences about what it’s seeing (the labeled images). Amazon has released various pretrained models designed to work on the DeepLens. Using a model for detecting faces, we can get a live stream that looks like this:</p>
<p><img src="/blog/2019/05/facial-recognition-amazon-deeplens/one-face-recognition.jpg" alt="one-face-recognition">
<br>Me looking at the DeepLens in my kitchen</p>
<p><img src="/blog/2019/05/facial-recognition-amazon-deeplens/multi-face-recognition.jpg" alt="multi-face-recognition">
<br>Facial recognition inferences on multiple people. (Witness my smile of satisfaction at finally finding enthusiastic subjects of facial recognition.)</p>
<p>Each face that the camera detects gets a box around it, along with the model’s level of certainty that it is a face. The above pictures were the results of an attempt to simulate the conditions where this could be used.</p>
<h3 id="the-model">The Model</h3>
<p>The model I used was trained on data from <a href="http://www.image-net.org/">ImageNet</a>, a public database with hundreds or thousands of images associated with nouns. (For example they have 1537 <a href="http://www.image-net.org/synset?wnid=n03376595">pictures of folding chairs</a>.) ImageNet is <a href="https://arxiv.org/search/?query=imagenet&searchtype=all&source=header">commonly</a> used to train and test computer vision models.</p>
<p>However, the training for this model didn’t stop there: Amazon used transfer learning from another large image dataset, <a href="http://cocodataset.org/#home">MS-COCO</a>, to fine-tune the model for face detection. Transfer learning works essentially by retraining the last layer of an already-trained model. In this way it harnesses the “insights” of the existing model (e.g. about shapes, colors, and positions) by repurposing this information to make predictions about something else. In this case, whether something is a face.</p>
<p>Since this model was pretrained and optimized by Amazon for the DeepLens, it provides a low effort route to implementing a computer vision model on the DeepLens. I didn’t have to do any of the processing on my own hardware. The DeepLens hardware took care of all the predictions, though the biggest resource savings were from not having to train the model myself (which can take days, or longer).</p>
<p>When the facial recognition model is deployed and the DeepLens is on, an AWS Lambda function written in Python repeatedly prompts the camera to get frames from the camera:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python">frame = awscam.getLastFrame()
</code></pre></div><p>…to resize the frames before inference (the model accepts frames of particular size):</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python">frame_resize = cv2.resize(frame, (input_height, input_width))
</code></pre></div><p>…to pass the frames to the model:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python">parsed_inference_results = model.parseResult(model_type, model.doInference(frame_resize))
</code></pre></div><p>…and to use the results to draw boxes around the faces:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python">cv2.rectangle(frame, (xmin, ymin), (xmax, ymax), (<span style="color:#00d;font-weight:bold">255</span>, <span style="color:#00d;font-weight:bold">165</span>, <span style="color:#00d;font-weight:bold">20</span>), <span style="color:#00d;font-weight:bold">10</span>)
</code></pre></div><p>As you can see from how often “cv2” appears in the code above, this implementation relies heavily on code from <a href="https://opencv.org">OpenCV</a>, an open source computer vision framework. Finally, the results are sent to the cloud:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python">client.publish(topic=iot_topic, payload=json.dumps(cloud_output))
</code></pre></div><p>In the last code snippet above, iot_topic refers to an Amazon “MQTT topic” (Message Queuing Telemetry Transport), for IoT devices. <a href="https://en.wikipedia.org/wiki/MQTT">MQTT</a> is the standard connectivity framework for DeepLens and many other IoT devices. One of its advantages for this context is that it can handle situations with intermittent connectivity, by smoothly queueing messages for when the network connection is stable. The essence of MQTT is to enable publishing and subscribing to different topics. The system of topics enables results from a DeepLens to trigger other processes. For example, the DeepLens could publish a message when it sees a face, and this could prompt another cloud service to do something else, such as save what time and how long the face appeared.</p>
<p>I wanted to test how data from this model would compare to a human’s perception. The first step was to understand what data the camera offers. It produces data about each frame analyzed: a timestamp (in 13-digit <a href="https://en.wikipedia.org/wiki/Unix_time">Unix time</a>), and the predicted probability that something it identifies is a face. To gather this data, I used the AWS IoT service to manually subscribe to a secure MQTT topic where the DeepLens published its predictions. Each frame processed produces data like this:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-json" data-lang="json">{
<span style="color:#b06;font-weight:bold">"format"</span>: <span style="color:#d20;background-color:#fff0f0">"json"</span>,
<span style="color:#b06;font-weight:bold">"payload"</span>: {
<span style="color:#b06;font-weight:bold">"face"</span>: <span style="color:#00d;font-weight:bold">0.5654296875</span>
},
<span style="color:#b06;font-weight:bold">"qos"</span>: <span style="color:#00d;font-weight:bold">0</span>,
<span style="color:#b06;font-weight:bold">"timestamp"</span>: <span style="color:#00d;font-weight:bold">1554853281975</span>,
<span style="color:#b06;font-weight:bold">"topic"</span>: <span style="color:#d20;background-color:#fff0f0">"$aws/things/deeplens_bnU5sr2sSD2ecW5YkfJZtw/infer"</span>
}
</code></pre></div><p>The data generated by a single frame (with one face) when processed by the DeepLens.</p>
<p>For my purposes, I was only interested in the timestamps and payloads (which contain the number of faces identified, and their probabilities). I decided to test the facial recognition model under several different conditions:</p>
<ol>
<li>No faces present</li>
<li>One face present</li>
<li>Multiple faces present</li>
</ol>
<p>For condition 1 I just aimed it at an empty room for 20 minutes, and for condition 2 I sat in front of the camera for 20 minutes. For condition 3, I aimed the camera at a public space for 20 minutes, and while it was running I kept an ongoing count of the number of people looking in the general direction of the camera (I put the camera in front of a wall with a TV on it so people would be more likely to look towards it). Then I averaged my count over the duration of the sample, which resulted in an average engagement number of 2.5 people, meaning that on average, 2.5 people were looking at the camera. In an attempt to minimize bias, I made my human-eye assessment before looking at any of the data.</p>
<p>I’ll spoil one aspect of the results right away: there were no false positives under any condition. Even the lower probability guesses corresponded to actual faces, though this result might not hold true in a room with lots of face-like art, that’s not too common of a scenario. This simplified things, since it meant there was no need to set a lower bound on the probabilities which we should count—any face detected by the camera is a face. This also highlights one of my remaining questions about the model: is there useful information to be gained from the probabilities?</p>
<p>Another important note: I noticed early in the experiment that it almost never detects a face farther than 15 feet away. For the use case of a Liquid Galaxy, the 15-foot range is too short to capture all types of engagement (some people look at it across the room), but from my experience with the system I think that users within this range could be accurately described as focused users—something worth measuring, but certainly not everything worth measuring. After noticing this, I retested condition 2 with my face about 5 feet from the DeepLens, after initially trying it from across a room.</p>
<h3 id="how-did-the-deeplens-counts-compare-to-my-counts">How did the DeepLens counts compare to my counts?</h3>
<p><img src="/blog/2019/05/facial-recognition-amazon-deeplens/results.png" alt="results"></p>
<p>The model matched my performance in conditions 1 and 2, which makes a strong statement about its reliability in relatively static and close-up conditions such as looking at an empty room, or looking at someone stare at their laptop across a small table. In contrast, it did not count as many faces as I did in condition 3—so I’m happy to report I can still outperform A.I. on something.</p>
<p>Anyway, this suggests that the model is somewhat conservative, at least compared to my count (likely partly due to my eyes having a range larger than 15 feet). Therefore, when considering usage statistics gathered by a similar method, it might make most sense to think of the results as a lower bound, e.g. “the average number of people focused on the system was more than 2.1”.</p>
<p>It would be useful to experiment with the multiple faces condition again, to see how robust these findings are. It would also be helpful to keep track of factors like how much people move, the lighting, and the orientation of the camera, to see if they might impact the results. It would also be useful to automate the data collection and analysis.</p>
<p>This investigation has showed me that the DeepLens has a lot of potential as a tool for measuring engagement. Perhaps a future post will examine how it can be used to count users.</p>
<hr>
<p>Thanks for reading! You are welcome to learn more about <a href="https://www.visionport.com/">End Point Liquid Galaxy</a> and <a href="https://aws.amazon.com/deeplens/">AWS DeepLens</a>.</p>
Self-driving toy car using the Asynchronous Advantage Actor-Critic algorithmhttps://www.endpointdev.com/blog/2018/08/self-driving-toy-car-using-the-a3c-algorithm/2018-08-29T00:00:00+00:00Kamil Ciemniewski
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/KaTeX/0.7.1/katex.min.css">
<script src="https://cdnjs.cloudflare.com/ajax/libs/KaTeX/0.7.1/katex.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/KaTeX/0.7.1/contrib/auto-render.min.js"></script>
<style>
.katex .op-symbol.large-op {
line-height: 1.2 !important;
}
.mtight {
font-size: 0.95em;
}
</style>
<center>
<video width="100%" controls poster="/blog/2018/08/self-driving-toy-car-using-the-a3c-algorithm/poster.png">
<source src="/blog/2018/08/self-driving-toy-car-using-the-a3c-algorithm/892-openaigym.video.90.68.video000000.mp4" type="video/mp4">
</video>
</center>
<p>The field of <a href="https://en.wikipedia.org/wiki/Reinforcement_learning">Reinforcement Learning</a> has seen a lot of great improvement in the past years. Researchers at universities and companies like <a href="https://deepmind.com/">Deep Mind</a> have been developing new and better ways to train intelligent, artificial agents to solve more and more difficult tasks. The algorithms being developed are requiring less time to train. They also are making the training much more stable.</p>
<p>This article is about an algorithm that’s one of the most cited lately: A3C — Asynchronous Advantage Actor-Critic.</p>
<p>As the subject is both wide and deep, I’m assuming the reader has the relevant background mastered already. Although reading it might be interesting even without understanding most of the notions in use, having a good grasp of them will help you get the most out of it.</p>
<p>Because we’re looking at the Deep Reinforcement Learning, the obvious requirement is to be acquainted with the <a href="https://en.wikipedia.org/wiki/Artificial_neural_network">neural networks</a>. I’m also using different notions known in the field of <a href="https://en.wikipedia.org/wiki/Reinforcement_learning">Reinforcement Learning</a> overall like $Q(a, s)$ and $V(s)$ functions or the n-step return. The mathematical expressions, in particular, are given assuming that the reader already knows what the symbols stand for. Some notions known from other families of RL algorithms are being touched on as well (e.g. experience replay) — to contrast them with the A3C way of solving the same kind of problems. The article along with the source code uses the <a href="https://gym.openai.com">OpenAI gym</a>, Python, and <a href="https://pytorch.org">PyTorch</a> among other Python-related libraries.</p>
<h3 id="theory">Theory</h3>
<p>The A3C algorithm is a part of the greater class of RL algorithms called <a href="http://www.scholarpedia.org/article/Policy_gradient_methods">Policy Gradients</a>.</p>
<p>In this approach, we’re creating a model that <strong>approximates the action-choosing policy itself</strong>.</p>
<p>Let’s contrast it with <a href="https://en.wikipedia.org/wiki/Markov_decision_process#Value_iteration">value iteration</a>, the goal of which is to learn the <a href="https://en.wikipedia.org/wiki/Reinforcement_learning#Value_function">value function</a> and have policy emerge as the function that chooses an action transitioning to the state of the greatest value.</p>
<p>With the policy gradient approach, we’re approximating the policy with a differentiable function. Such stated problem requires only a good approximation of the gradient that over time will maximize the rewards.</p>
<p>The unique approach of A3C adds a very clever twist: we’re also learning an approximation of the value function at the same time. This helps us in getting the variance of the gradient down considerably, making the training much more stable.</p>
<p>These two aspects of the algorithm are being personified within its name: actor-critic. The policy function approximation is being called the actor, while the value function is being called the critic.</p>
<h4 id="the-policy-gradient">The policy gradient</h4>
<p>As we’ve noticed already, in order to improve our policy function approximation, we need a gradient that points at the direction that maximizes the rewards.</p>
<p>I’m not going to reinvent the wheel here. There are some great resources the reader can access to dig deep into the Mathematics of what’s called the Policy Gradient Theorem:</p>
<ul>
<li><a href="https://lilianweng.github.io/lil-log/2018/04/08/policy-gradient-algorithms.html">Lilian Weng’s excellent article</a></li>
<li><a href="http://incompleteideas.net/book/bookdraft2017nov5.pdf">Sutton & Barto — Reinforcement Learning: An Introduction</a></li>
</ul>
<p>The following equation presents the basic form of the gradient of the policy function:</p>
<p>$$\nabla_{\theta} J(\theta) = E_{\tau}[,R_{\tau}\cdot\nabla_\theta,\sum_{t=0}^{T-1},log,\pi(a_t|s_t;\theta),]$$</p>
<p>This states that for each sampled trajectory $\tau$, the correct estimate of the gradient is the expected value of the rewards times the action probabilities moved into the log space. Ascending in this direction makes our rewards greater and greater over time.</p>
<p>We <strong>can</strong> derive all the needed intermediary gradients ourselves by hand of course. Because we’re using <a href="https://pytorch.org">PyTorch</a> though, we only need the right loss function.</p>
<p>Let’s figure out the right loss function formula that will produce the gradient as shown above:</p>
<p>$$L_\theta=-J(\theta)$$</p>
<p>Also:</p>
<p>$$J(\theta)=E_\tau[R_\tau\cdot\sum_{t=0}^{T-1},log,\pi(a_t|s_t;\theta)]$$</p>
<p>Hence:</p>
<p>$$L_\theta=-\frac{1}{n}\sum_{t=0}^{n-1}R_t,\cdot,log\pi(a_t|s_t;\theta)$$</p>
<h4 id="formalizing-the-accumulation-of-rewards">Formalizing the accumulation of rewards</h4>
<p>For now, we’ve been using the $R_\tau$ and $R_t$ terms very abstractly. Let’s make this part more intuitive and concrete now.</p>
<p>Its true meaning really is “the quality of the sampled trajectory”. Consider the following equation:</p>
<p>$$R_t=\sum_{i=t}^{N+t}\gamma^{i-t}r_i,+,\gamma^{i-t+1}V(s_{t+N+1})$$</p>
<p>Each $r_i$ is the reward received from the environment after each step. Each trajectory consists of multiple steps. Each time, we’re sampling actions based on our policy function. This gives probabilities of a given action being best given the state.</p>
<p>What if we’re taking 5 actions for which we’re not being given any reward but overall it helped us get rewarded in the 6th step? This is exactly the case we’ll be dealing with in this article later when training a toy car to drive based only on pixel values of the scene. In that environment, we’ll be given $-0.1$ “negative” reward each step and something close to $7$ each new “tile” the car stays on the road.</p>
<p>We need a way to still encourage actions that make us earn rewards in a not too distant future. We also need to be smart and <strong>discount</strong> future rewards somewhat so that the more immediate the reward is to our action, the more emphasis we put on it.</p>
<p>That’s exactly what the above equation does. Notice that $\gamma$ becomes a hyper-parameter. It makes sense to give it value from $(0, 1)$. Let’s consider the following list of rewards: $[r_1, r_2, r_3, r_4]$. For $r_1$, the formula for the discounted accumulated reward is:</p>
<p>$$R_1=\gamma,r_1,+,\gamma^2r_2,+,\gamma^3r_3,+,\gamma^4r_4,+,\gamma^5V(s_5)$$</p>
<p>For $r_2$ it’s:</p>
<p>$$R_2=\gamma,r_2,+,\gamma^2r_3,+,\gamma^3r_4,+,\gamma^4V(s_5)$$</p>
<p>And so on… In case when we hit the terminal state, having no “next” state, we substitute $0$ for $V(s_{t+N+1})$.</p>
<p>We’ve said that in A3C we’re learning the value function at the same time. The $R_t$ as described above becomes the target value when training our $V(s)$. The value function becomes an approximation of the average of the rewards given the state (because $R_t$ depends on us sampling actions in this state).</p>
<h4 id="making-the-gradients-more-stable">Making the gradients more stable</h4>
<p>One of the greatest inhibitors of the policy gradient performance is what’s broadly called “high variance”.</p>
<p>I have to admit, the first time I saw that term in this context, I was disoriented. I knew what “variance” was. It’s the “variance of what” that was not clear to me.</p>
<p>Thankfully I found <a href="https://www.quora.com/Why-does-the-policy-gradient-method-have-a-high-variance?share=1">a brilliant answer to this question</a>. It explains the issue simply yet in detail.</p>
<p>Let me cite it here:</p>
<blockquote>
<p>When we talk about high variance in the policy gradient method, we’re specifically talking about the facts that the variance of the gradients are high — namely, that $Var(\nabla_{\theta} J(\theta))$ is big.</p>
</blockquote>
<p>To put it in simple terms: because we’re <strong>sampling</strong> trajectories from the space that is stochastic in nature, we’re bound to have those samples give gradients that disagree a lot on the best direction to take our model’s parameters into.</p>
<p>I encourage the reader to pause now and read the above-mentioned answer as it’s very vital. The gist of the solution described in it is that we can <strong>subtract a baseline value from each $R_t$</strong>. An example of a good baseline that was given was to make it into an <strong>average of the sampled accumulated rewards</strong>. The A3C algorithm uses this insight in a very, very clever way.</p>
<h4 id="value-function-as-a-baseline">Value function as a baseline</h4>
<p>To learn the $V(s)$ we’re typically using the MSE or Huber loss against the accumulated rewards for each step. This means that over time we’re <strong>averaging those rewards out based on the state we’re finding ourselves in</strong>.</p>
<p>Improving our gradient formula with those ideas we now get:</p>
<p>$$\nabla_{\theta} J(\theta) = E_{\tau}[,\nabla_\theta,\sum_{t=0}^{T-1},log,\pi(a_t|s_t;\theta)\cdot(R_t-V(s_t)),]$$</p>
<p>It’s important to treat the $(R_t-V(s_t))$ term <strong>as a constant</strong>. This means that when using PyTorch or any other deep learning framework, the computation of it should occur <strong>outside the graph that influences the gradients</strong>.</p>
<p>The enhanced part of the equation is where we get the word “advantage” in the algorithm’s name. The <strong>advantage</strong> is simply the difference between the accumulated rewards and what those rewards are <strong>on average</strong> for the given state:</p>
<p>$$A(a_{t..t+n},s_{t..t+n})=R_t(a_{t..t+n},s_{t..t+n})-V(s_t)$$</p>
<p>If we’ll make $R_t$ into $Q(s,a)$ as it’s commonly written in literature, we’ll arrive at the formula:</p>
<p>$$A(s,a)=Q(s,a) - V(s)$$</p>
<p>What’s the intuition here? Imagine that you’re playing chess with a 5-year-old. You win by a huge margin. Your friend who’s watched lots of master-level games observed this one as well. His take is that even though you scored positively, you still made lots of mistakes. You’ve got your <strong>critic</strong> here. Your score and what it looks like for the “observing critic” combined is what we call the advantage of the actions you took.</p>
<h4 id="guarding-against-the-models-overconfidence">Guarding against the model’s overconfidence</h4>
<blockquote>
<p>Although he was warned, Icarus was too young and too enthusiastic about flying. He got excited by the thrill of flying and carried away by the amazing feeling of freedom and started flying high to salute the sun, diving low to the sea, and then up high again.
His father Daedalus was trying in vain to make young Icarus to understand that his behavior was dangerous, and Icarus soon saw his wings melting.
Icarus fell into the sea and drowned.</p>
</blockquote>
<p><em><a href="https://www.greekmyths-greekmythology.com/myth-of-daedalus-and-icarus/">The Myth Of Daedalus And Icarus</a></em></p>
<p>The job of an “actor” is to output probability values for each possible action the agent can take. The greater the probability, the greater the model’s confidence that this action will result in the highest reward.</p>
<p>What if at some point, the weights are being steered in a way that the model becomes <em>overconfident</em> of some particular action? If this happens before the model learns much, it becomes a huge problem.</p>
<p>Because we’re using the $\pi(a|s;\theta)$ distribution to sample trajectories with, we’re not sampling totally at random. In other words, for $\pi(a|s;\theta) = [0.1, 0.4, 0.2, 0.3]$ our sampling chooses the second option 40% of the time. With any action overwhelming the others, we’re losing the ability to <strong>explore</strong> different paths and thus learn valuable lessons.</p>
<p>Empirically, I have found myself seeing the process sometimes not even able to escape the “overconfidence” area for long, long hours.</p>
<h4 id="regularizing-with-entropy">Regularizing with entropy</h4>
<p>Let’s introduce the notion of an <a href="https://en.wikipedia.org/wiki/Entropy_(information_theory)">entropy</a>.</p>
<p>In simple words in our case, it’s the measure of how much “knowledge” does given probability distribution posses. It’s being maximized for the uniform distribution. Here’s the formula:</p>
<p>$$H(X)=E[-log_b(P(X))]$$</p>
<p>This expands to the following:</p>
<p>$$H(X)=-\sum_{i=1}^{n}P(x_i)log_b(P(x_i))$$</p>
<p>Let’s look closer at the values this function produces using the following simple <a href="https://calca.io">Calca</a> code:</p>
<pre tabindex="0"><code>uniform = [0.25, 0.25, 0.25, 0.25]
more confident = [0.5, 0.25, 0.15, 0.10]
over confident = [0.95, 0.01, 0.01, 0.03]
super over confident = [0.99, 0.003, 0.004, 0.003]
y(x) = x*log(x, 10)
entropy(dist) = -sum(map(y, dist))
entropy (uniform) => 0.6021
entropy (more confident) => 0.5246
entropy (over confident) => 0.1068
entropy (super over confident) => 0.0291
</code></pre><p>We can use the above to “punish” the model whenever it’s too confident of its choices. As we’re going to use gradient descend, we’ll be minimizing terms that appear in our loss function. Minimizing the entropy as shown above would encourage more confidence though. We’ll need to make it into a negative in the loss to work the way we intend:</p>
<p>$$L_\theta=-\frac{1}{n}\sum_{t=0}^{n-1}log\pi(a_t|s_t;\theta)\cdot(R_t-V(s_t)),-\beta,H(\pi(a_t|s_t;\theta))$$</p>
<p>Where $\beta$ is a hyperparameter scaling the effects of the penalty that the entropy has on the gradients. Choosing the right value for $\beta$ becomes very vital for the model’s convergence. In this article, I’m using $0.01$ as with $0.001$ I was still observing the process stuck being overconfident.</p>
<p>Let’s include the value loss $L_v$ in the loss function formula making it full and ready to be implemented:</p>
<p>$$L_\theta=-\frac{1}{n}\sum_{t=0}^{n-1}log\pi(a_t|s_t;\theta)\cdot(R_t-V(s_t)),+\alpha,L_v,,-\beta,H(\pi(a_t|s_t;\theta))$$</p>
<h4 id="the-last-a-in-a3c">The last A in A3C</h4>
<p>So far we’ve gone from the vanilla policy gradients to using the notion of an advantage. We’ve also improved it with the baseline that intuitively makes the model consist of two parts: the actor and the critic. At this point, we have what’s sometimes called the A2C — Advantage Actor-Critic.</p>
<p>Let us now focus on the last piece of the puzzle: the last A. This last A comes from the word “asynchronous”. It’s been explained very clearly in the <a href="https://arxiv.org/pdf/1602.01783">original paper on A3C</a>.</p>
<p>This idea I think is the least complex of all that have their place in the approach. I’ll just comment on what was already written:</p>
<blockquote>
<p>These approaches share a common idea: the sequence of observed data encountered by an online RL agent is non-stationary, and online RL updates are strongly correlated. By storing the agent’s data in an experience replay memory, the data can be batched (Riedmiller, 2005; Schulman et al., 2015a) or randomly sampled (Mnih et al., 2013; 2015; Van Hasselt et al., 2015) from different time-steps. Aggregating over memory in this way reduces non-stationarity and decorrelates updates, but at the same time limits the methods to off-policy reinforcement learning algorithms.</p>
</blockquote>
<p>The A3C unique approach is that it doesn’t use experience replay for de-correlating the updates to the weights of the model. Instead, we’re sampling many different trajectories <strong>at the same time</strong> in an <strong>asynchronous</strong> manner.</p>
<p>This means that we’re creating many clones of the environment and we let our agents experience them at the same time. Separate agents share their weights in one way or another. There are implementations with agents sharing those weights very <strong>literally</strong> — and performing the updates to the weights on their own whenever they need to. There also are implementations with one main agent holding the main weights and doing the updates based on the gradients reported by the “worker” agents. The worker agents are then being updated with the evolved weights. The environments and agents are not being directly synchronized, working at their own speed. As soon as any of them collects the needed rewards to perform the n-step gradients calculations, the gradients are being applied in one way or another.</p>
<p>In this article, I’m preferring the second approach — having one “main” agent and making workers synchronize their weights with it each n-step period.</p>
<h3 id="practice">Practice</h3>
<h4 id="the-challenge">The challenge</h4>
<p>To present the above theory in practical terms, we’re going to code the A3C to train a toy self-driving game car. The algorithm will only have the game’s pixels as inputs. We’re also going to collect rewards.</p>
<p>Each step, the player will decide how to move the steering wheel, how much throttle to apply and how much brake.</p>
<p>Points are being assigned for each new “tile” that the car enters staying within the road. There’s a small penalty for each other case of $-0.1$ points.</p>
<p>We’re going to use <a href="https://gym.openai.com">OpenAI Gym</a> and the environment’s called <a href="https://gym.openai.com/envs/CarRacing-v0/">CarRacing</a>.</p>
<p>You can read a bit more about the setup in the environment’s source code on <a href="https://github.com/openai/gym/blob/master/gym/envs/box2d/car_racing.py">GitHub</a>.</p>
<h4 id="coding-the-agent">Coding the Agent</h4>
<p>Our agent is going to output both $\pi(a|s;\theta)$ as well as $V(s)$. We’re going to use the GRU unit to give the agent the ability to remember its previous actions and environments previous features.</p>
<p>I’ve also decided to use PRelu instead of Relu activations as it <strong>appeared</strong> to me that this way the agent was learning much quicker (although I don’t have any numbers to back this impression up).</p>
<p><strong>Disclaimer</strong>: the code presented below <strong>has not been refactored</strong> in any way. If this was going to be used in production I’d certainly hugely clean it up.</p>
<p>Here’s the full listing of the agent’s class:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python"><span style="color:#080;font-weight:bold">class</span> <span style="color:#b06;font-weight:bold">Agent</span>(nn.Module):
<span style="color:#080;font-weight:bold">def</span> __init__(self, **kwargs):
<span style="color:#038">super</span>(Agent, self).__init__(**kwargs)
self.init_args = kwargs
self.h = torch.zeros(<span style="color:#00d;font-weight:bold">1</span>, <span style="color:#00d;font-weight:bold">256</span>)
self.norm1 = nn.BatchNorm2d(<span style="color:#00d;font-weight:bold">4</span>)
self.norm2 = nn.BatchNorm2d(<span style="color:#00d;font-weight:bold">32</span>)
self.conv1 = nn.Conv2d(<span style="color:#00d;font-weight:bold">4</span>, <span style="color:#00d;font-weight:bold">32</span>, <span style="color:#00d;font-weight:bold">4</span>, stride=<span style="color:#00d;font-weight:bold">2</span>, padding=<span style="color:#00d;font-weight:bold">1</span>)
self.conv2 = nn.Conv2d(<span style="color:#00d;font-weight:bold">32</span>, <span style="color:#00d;font-weight:bold">32</span>, <span style="color:#00d;font-weight:bold">3</span>, stride=<span style="color:#00d;font-weight:bold">2</span>, padding=<span style="color:#00d;font-weight:bold">1</span>)
self.conv3 = nn.Conv2d(<span style="color:#00d;font-weight:bold">32</span>, <span style="color:#00d;font-weight:bold">32</span>, <span style="color:#00d;font-weight:bold">3</span>, stride=<span style="color:#00d;font-weight:bold">2</span>, padding=<span style="color:#00d;font-weight:bold">1</span>)
self.conv4 = nn.Conv2d(<span style="color:#00d;font-weight:bold">32</span>, <span style="color:#00d;font-weight:bold">32</span>, <span style="color:#00d;font-weight:bold">3</span>, stride=<span style="color:#00d;font-weight:bold">2</span>, padding=<span style="color:#00d;font-weight:bold">1</span>)
self.gru = nn.GRUCell(<span style="color:#00d;font-weight:bold">1152</span>, <span style="color:#00d;font-weight:bold">256</span>)
self.policy = nn.Linear(<span style="color:#00d;font-weight:bold">256</span>, <span style="color:#00d;font-weight:bold">4</span>)
self.value = nn.Linear(<span style="color:#00d;font-weight:bold">256</span>, <span style="color:#00d;font-weight:bold">1</span>)
self.prelu1 = nn.PReLU()
self.prelu2 = nn.PReLU()
self.prelu3 = nn.PReLU()
self.prelu4 = nn.PReLU()
nn.init.xavier_uniform_(self.conv1.weight, gain=nn.init.calculate_gain(<span style="color:#d20;background-color:#fff0f0">'leaky_relu'</span>))
nn.init.constant_(self.conv1.bias, <span style="color:#00d;font-weight:bold">0.01</span>)
nn.init.xavier_uniform_(self.conv2.weight, gain=nn.init.calculate_gain(<span style="color:#d20;background-color:#fff0f0">'leaky_relu'</span>))
nn.init.constant_(self.conv2.bias, <span style="color:#00d;font-weight:bold">0.01</span>)
nn.init.xavier_uniform_(self.conv3.weight, gain=nn.init.calculate_gain(<span style="color:#d20;background-color:#fff0f0">'leaky_relu'</span>))
nn.init.constant_(self.conv3.bias, <span style="color:#00d;font-weight:bold">0.01</span>)
nn.init.xavier_uniform_(self.conv4.weight, gain=nn.init.calculate_gain(<span style="color:#d20;background-color:#fff0f0">'leaky_relu'</span>))
nn.init.constant_(self.conv4.bias, <span style="color:#00d;font-weight:bold">0.01</span>)
nn.init.constant_(self.gru.bias_ih, <span style="color:#00d;font-weight:bold">0</span>)
nn.init.constant_(self.gru.bias_hh, <span style="color:#00d;font-weight:bold">0</span>)
nn.init.xavier_uniform_(self.policy.weight, gain=nn.init.calculate_gain(<span style="color:#d20;background-color:#fff0f0">'leaky_relu'</span>))
nn.init.constant_(self.policy.bias, <span style="color:#00d;font-weight:bold">0.01</span>)
nn.init.xavier_uniform_(self.value.weight, gain=nn.init.calculate_gain(<span style="color:#d20;background-color:#fff0f0">'leaky_relu'</span>))
nn.init.constant_(self.value.bias, <span style="color:#00d;font-weight:bold">0.01</span>)
self.train()
<span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">reset</span>(self):
self.h = torch.zeros(<span style="color:#00d;font-weight:bold">1</span>, <span style="color:#00d;font-weight:bold">256</span>)
<span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">clone</span>(self, num=<span style="color:#00d;font-weight:bold">1</span>):
<span style="color:#080;font-weight:bold">return</span> [ self.clone_one() <span style="color:#080;font-weight:bold">for</span> _ <span style="color:#080">in</span> <span style="color:#038">range</span>(num) ]
<span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">clone_one</span>(self):
<span style="color:#080;font-weight:bold">return</span> Agent(**self.init_args)
<span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">forward</span>(self, state):
state = state.view(<span style="color:#00d;font-weight:bold">1</span>, <span style="color:#00d;font-weight:bold">4</span>, <span style="color:#00d;font-weight:bold">96</span>, <span style="color:#00d;font-weight:bold">96</span>)
state = self.norm1(state)
data = self.prelu1(self.conv1(state))
data = self.prelu2(self.conv2(data))
data = self.prelu3(self.conv3(data))
data = self.prelu4(self.conv4(data))
data = self.norm2(data)
data = data.view(<span style="color:#00d;font-weight:bold">1</span>, -<span style="color:#00d;font-weight:bold">1</span>)
h = self.gru(data, self.h)
self.h = h.detach()
pre_policy = h.view(-<span style="color:#00d;font-weight:bold">1</span>)
policy = F.softmax(self.policy(pre_policy))
value = self.value(pre_policy)
<span style="color:#080;font-weight:bold">return</span> policy, value
</code></pre></div><p>You can immediately notice that actor and critic parts share most of the weights. They only differ in the last layer.</p>
<p>Next, I wanted to abstract out the notion of the “runner”. It encapsulates the idea of a “running agent”. Think of it as the game player — with the joystick and its brain to score game points. I’m discretizing the action space the following way:</p>
<table>
<tr>
<th>Action name</th>
<th>value</th>
</tr>
<tr>
<td>Turn left</td>
<td>[-0.8, 0.0, 0.0]</td>
</tr>
<tr>
<td>Turn right</td>
<td>[0.8, 0.0, 0]</td>
</tr>
<tr>
<td>Full throttle</td>
<td>[0.0, 0.1, 0.0]</td>
</tr>
<tr>
<td>Brake</td>
<td>[0.0, 0.0, 0.6]</td>
</tr>
</table>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python"><span style="color:#080;font-weight:bold">class</span> <span style="color:#b06;font-weight:bold">Runner</span>:
<span style="color:#080;font-weight:bold">def</span> __init__(self, agent, ix, train = <span style="color:#080;font-weight:bold">True</span>, **kwargs):
self.agent = agent
self.train = train
self.ix = ix
self.reset = <span style="color:#080;font-weight:bold">False</span>
self.states = []
<span style="color:#888"># each runner has its own environment:</span>
self.env = gym.make(<span style="color:#d20;background-color:#fff0f0">'CarRacing-v0'</span>)
<span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">get_value</span>(self):
<span style="color:#d20;background-color:#fff0f0">"""
</span><span style="color:#d20;background-color:#fff0f0"> Returns just the current state's value.
</span><span style="color:#d20;background-color:#fff0f0"> This is used when approximating the R.
</span><span style="color:#d20;background-color:#fff0f0"> If the last step was
</span><span style="color:#d20;background-color:#fff0f0"> not terminal, then we're substituting the "r"
</span><span style="color:#d20;background-color:#fff0f0"> with V(s) - hence, we need a way to just
</span><span style="color:#d20;background-color:#fff0f0"> get that V(s) without moving forward yet.
</span><span style="color:#d20;background-color:#fff0f0"> """</span>
_input = self.preprocess(self.states)
_, _, _, value = self.decide(_input)
<span style="color:#080;font-weight:bold">return</span> value
<span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">run_episode</span>(self, yield_every = <span style="color:#00d;font-weight:bold">10</span>, do_render = <span style="color:#080;font-weight:bold">False</span>):
<span style="color:#d20;background-color:#fff0f0">"""
</span><span style="color:#d20;background-color:#fff0f0"> The episode runner written in the generator style.
</span><span style="color:#d20;background-color:#fff0f0"> This is meant to be used in a "for (...) in run_episode(...):" manner.
</span><span style="color:#d20;background-color:#fff0f0"> Each value generated is a tuple of:
</span><span style="color:#d20;background-color:#fff0f0"> step_ix: the current "step" number
</span><span style="color:#d20;background-color:#fff0f0"> rewards: the list of rewards as received from the environment (without discounting yet)
</span><span style="color:#d20;background-color:#fff0f0"> values: the list of V(s) values, as predicted by the "critic"
</span><span style="color:#d20;background-color:#fff0f0"> policies: the list of policies as received from the "actor"
</span><span style="color:#d20;background-color:#fff0f0"> actions: the list of actions as sampled based on policies
</span><span style="color:#d20;background-color:#fff0f0"> terminal: whether we're in a "terminal" state
</span><span style="color:#d20;background-color:#fff0f0"> """</span>
self.reset = <span style="color:#080;font-weight:bold">False</span>
step_ix = <span style="color:#00d;font-weight:bold">0</span>
rewards, values, policies, actions = [[], [], [], []]
self.env.reset()
<span style="color:#888"># we're going to feed the last 4 frames to the neural network that acts as the "actor-critic" duo. We'll use the "deque" to efficiently drop too old frames always keeping its length at 4:</span>
states = deque([ ])
<span style="color:#888"># we're pre-populating the states deque by taking first 4 steps as "full throttle forward":</span>
<span style="color:#080;font-weight:bold">while</span> <span style="color:#038">len</span>(states) < <span style="color:#00d;font-weight:bold">4</span>:
_, r, _, _ = self.env.step([<span style="color:#00d;font-weight:bold">0.0</span>, <span style="color:#00d;font-weight:bold">1.0</span>, <span style="color:#00d;font-weight:bold">0.0</span>])
state = self.env.render(mode=<span style="color:#d20;background-color:#fff0f0">'rgb_array'</span>)
states.append(state)
logger.info(<span style="color:#d20;background-color:#fff0f0">'Init reward '</span> + <span style="color:#038">str</span>(r) )
<span style="color:#888"># we need to repeat the following as long as the game is not over yet:</span>
<span style="color:#080;font-weight:bold">while</span> <span style="color:#080;font-weight:bold">True</span>:
<span style="color:#888"># the frames need to be preprocessed (I'm explaining the reasons later in the article)</span>
_input = self.preprocess(states)
<span style="color:#888"># asking the neural network for the policy and value predictions:</span>
action, action_ix, policy, value = self.decide(_input, step_ix)
<span style="color:#888"># taking the step and receiving the reward along with info if the game is over:</span>
_, reward, terminal, _ = self.env.step(action)
<span style="color:#888"># explicitly rendering the scene (again, this will be explained later)</span>
state = self.env.render(mode=<span style="color:#d20;background-color:#fff0f0">'rgb_array'</span>)
<span style="color:#888"># update the last 4 states deque:</span>
states.append(state)
<span style="color:#080;font-weight:bold">while</span> <span style="color:#038">len</span>(states) > <span style="color:#00d;font-weight:bold">4</span>:
states.popleft()
<span style="color:#888"># if we've been asked to render into the window (e. g. to capture the video):</span>
<span style="color:#080;font-weight:bold">if</span> do_render:
self.env.render()
self.states = states
step_ix += <span style="color:#00d;font-weight:bold">1</span>
rewards.append(reward)
values.append(value)
policies.append(policy)
actions.append(action_ix)
<span style="color:#888"># periodically save the state's screenshot along with the numerical values in an easy to read way:</span>
<span style="color:#080;font-weight:bold">if</span> self.ix == <span style="color:#00d;font-weight:bold">2</span> <span style="color:#080">and</span> step_ix % <span style="color:#00d;font-weight:bold">200</span> == <span style="color:#00d;font-weight:bold">0</span>:
fname = <span style="color:#d20;background-color:#fff0f0">'./screens/car-racing/screen-'</span> + <span style="color:#038">str</span>(step_ix) + <span style="color:#d20;background-color:#fff0f0">'-'</span> + <span style="color:#038">str</span>(<span style="color:#038">int</span>(time.time())) + <span style="color:#d20;background-color:#fff0f0">'.jpg'</span>
im = Image.fromarray(state)
im.save(fname)
state.tofile(fname + <span style="color:#d20;background-color:#fff0f0">'.txt'</span>, sep=<span style="color:#d20;background-color:#fff0f0">" "</span>)
_input.numpy().tofile(fname + <span style="color:#d20;background-color:#fff0f0">'.input.txt'</span>, sep=<span style="color:#d20;background-color:#fff0f0">" "</span>)
<span style="color:#888"># if it's game over or we hit the "yield every" value, yield the values from this generator:</span>
<span style="color:#080;font-weight:bold">if</span> terminal <span style="color:#080">or</span> step_ix % yield_every == <span style="color:#00d;font-weight:bold">0</span>:
<span style="color:#080;font-weight:bold">yield</span> step_ix, rewards, values, policies, actions, terminal
rewards, values, policies, actions = [[], [], [], []]
<span style="color:#888"># following is a very tacky way to allow external using code to mark that it wants us to reset the environment, finishing the episode prematurely. (this would be hugely refactored in the production code but for the sake of playing with the algorithm itself, it's good enough):</span>
<span style="color:#080;font-weight:bold">if</span> self.reset:
self.reset = <span style="color:#080;font-weight:bold">False</span>
self.agent.reset()
states = deque([ ])
self.states = deque([ ])
<span style="color:#080;font-weight:bold">return</span>
<span style="color:#080;font-weight:bold">if</span> terminal:
self.agent.reset()
states = deque([ ])
<span style="color:#080;font-weight:bold">return</span>
<span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">ask_reset</span>(self):
self.reset = <span style="color:#080;font-weight:bold">True</span>
<span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">preprocess</span>(self, states):
<span style="color:#080;font-weight:bold">return</span> torch.stack([ torch.tensor(self.preprocess_one(image_data), dtype=torch.float32) <span style="color:#080;font-weight:bold">for</span> image_data <span style="color:#080">in</span> states ])
<span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">preprocess_one</span>(self, image):
<span style="color:#d20;background-color:#fff0f0">"""
</span><span style="color:#d20;background-color:#fff0f0"> Scales the rendered image and makes it grayscale
</span><span style="color:#d20;background-color:#fff0f0"> """</span>
<span style="color:#080;font-weight:bold">return</span> rescale(rgb2gray(image), (<span style="color:#00d;font-weight:bold">0.24</span>, <span style="color:#00d;font-weight:bold">0.16</span>), anti_aliasing=<span style="color:#080;font-weight:bold">False</span>, mode=<span style="color:#d20;background-color:#fff0f0">'edge'</span>, multichannel=<span style="color:#080;font-weight:bold">False</span>)
<span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">choose_action</span>(self, policy, step_ix):
<span style="color:#d20;background-color:#fff0f0">"""
</span><span style="color:#d20;background-color:#fff0f0"> Chooses an action to take based on the policy and whether we're in the training mode or not. During training, it samples based on the probability values in the policy. During the evaluation, it takes the most probable action in a greedy way.
</span><span style="color:#d20;background-color:#fff0f0"> """</span>
policies = [[-<span style="color:#00d;font-weight:bold">0.8</span>, <span style="color:#00d;font-weight:bold">0.0</span>, <span style="color:#00d;font-weight:bold">0.0</span>], [<span style="color:#00d;font-weight:bold">0.8</span>, <span style="color:#00d;font-weight:bold">0.0</span>, <span style="color:#00d;font-weight:bold">0</span>], [<span style="color:#00d;font-weight:bold">0.0</span>, <span style="color:#00d;font-weight:bold">0.1</span>, <span style="color:#00d;font-weight:bold">0.0</span>], [<span style="color:#00d;font-weight:bold">0.0</span>, <span style="color:#00d;font-weight:bold">0.0</span>, <span style="color:#00d;font-weight:bold">0.6</span>]]
<span style="color:#080;font-weight:bold">if</span> self.train:
action_ix = np.random.choice(<span style="color:#00d;font-weight:bold">4</span>, <span style="color:#00d;font-weight:bold">1</span>, p=torch.tensor(policy).detach().numpy())[<span style="color:#00d;font-weight:bold">0</span>]
<span style="color:#080;font-weight:bold">else</span>:
action_ix = np.argmax(torch.tensor(policy).detach().numpy())
logger.info(<span style="color:#d20;background-color:#fff0f0">'Step '</span> + <span style="color:#038">str</span>(step_ix) + <span style="color:#d20;background-color:#fff0f0">' Runner '</span> + <span style="color:#038">str</span>(self.ix) + <span style="color:#d20;background-color:#fff0f0">' Action ix: '</span> + <span style="color:#038">str</span>(action_ix) + <span style="color:#d20;background-color:#fff0f0">' From: '</span> + <span style="color:#038">str</span>(policy))
<span style="color:#080;font-weight:bold">return</span> np.array(policies[action_ix], dtype=np.float32), action_ix
<span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">decide</span>(self, state, step_ix = <span style="color:#00d;font-weight:bold">999</span>):
policy, value = self.agent(state)
action, action_ix = self.choose_action(policy, step_ix)
<span style="color:#080;font-weight:bold">return</span> action, action_ix, policy, value
<span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">load_state_dict</span>(self, state):
<span style="color:#d20;background-color:#fff0f0">"""
</span><span style="color:#d20;background-color:#fff0f0"> As we'll have multiple "worker" runners, they will need to be able to sync their agents' weights with the main agent.
</span><span style="color:#d20;background-color:#fff0f0"> This function loads the weights into this runner's agent.
</span><span style="color:#d20;background-color:#fff0f0"> """</span>
self.agent.load_state_dict(state)
</code></pre></div><p>I’m also encapsulating the training process in a class of its own. You can notice the gradients being clipped before being applied. I’m also clipping the rewards into the range of $<-3, 3>$ to help to keep the variance low.</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python"><span style="color:#080;font-weight:bold">class</span> <span style="color:#b06;font-weight:bold">Trainer</span>:
<span style="color:#080;font-weight:bold">def</span> __init__(self, gamma, agent, window = <span style="color:#00d;font-weight:bold">15</span>, workers = <span style="color:#00d;font-weight:bold">8</span>, **kwargs):
<span style="color:#038">super</span>().__init__(**kwargs)
self.agent = agent
self.window = window
self.gamma = gamma
self.optimizer = optim.Adam(self.agent.parameters(), lr=<span style="color:#00d;font-weight:bold">1e-4</span>)
self.workers = workers
<span style="color:#888"># even though we're loading the weights into worker agents explicitly, I found that still without sharing the weights as following, the algorithm was not converging:</span>
self.agent.share_memory()
<span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">fit</span>(self, episodes = <span style="color:#00d;font-weight:bold">1000</span>):
<span style="color:#d20;background-color:#fff0f0">"""
</span><span style="color:#d20;background-color:#fff0f0"> The higher level method for training the agents.
</span><span style="color:#d20;background-color:#fff0f0"> It called into the lower level "train" which orchestrates the process itself.
</span><span style="color:#d20;background-color:#fff0f0"> """</span>
last_update = <span style="color:#00d;font-weight:bold">0</span>
updates = <span style="color:#038">dict</span>()
<span style="color:#080;font-weight:bold">for</span> ix <span style="color:#080">in</span> <span style="color:#038">range</span>(<span style="color:#00d;font-weight:bold">1</span>, self.workers + <span style="color:#00d;font-weight:bold">1</span>):
updates[ ix ] = { <span style="color:#d20;background-color:#fff0f0">'episode'</span>: <span style="color:#00d;font-weight:bold">0</span>, <span style="color:#d20;background-color:#fff0f0">'step'</span>: <span style="color:#00d;font-weight:bold">0</span>, <span style="color:#d20;background-color:#fff0f0">'rewards'</span>: deque(), <span style="color:#d20;background-color:#fff0f0">'losses'</span>: deque(), <span style="color:#d20;background-color:#fff0f0">'points'</span>: <span style="color:#00d;font-weight:bold">0</span>, <span style="color:#d20;background-color:#fff0f0">'mean_reward'</span>: <span style="color:#00d;font-weight:bold">0</span>, <span style="color:#d20;background-color:#fff0f0">'mean_loss'</span>: <span style="color:#00d;font-weight:bold">0</span> }
<span style="color:#080;font-weight:bold">for</span> update <span style="color:#080">in</span> self.train(episodes):
now = time.time()
<span style="color:#888"># you could do something useful here with the updates dict.</span>
<span style="color:#888"># I've opted out as I'm using logging anyways and got more value in just watching the log file, grepping for the desired values</span>
<span style="color:#888"># save the current model's weights every minute:</span>
<span style="color:#080;font-weight:bold">if</span> now - last_update > <span style="color:#00d;font-weight:bold">60</span>:
torch.save(self.agent.state_dict(), <span style="color:#d20;background-color:#fff0f0">'./checkpoints/car-racing/'</span> + <span style="color:#038">str</span>(<span style="color:#038">int</span>(now)) + <span style="color:#d20;background-color:#fff0f0">'-.pytorch'</span>)
last_update = now
<span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">train</span>(self, episodes = <span style="color:#00d;font-weight:bold">1000</span>):
<span style="color:#d20;background-color:#fff0f0">"""
</span><span style="color:#d20;background-color:#fff0f0"> Lower level training orchestration method. Written in the generator style. Intended to be used with "for update in train(...):"
</span><span style="color:#d20;background-color:#fff0f0"> """</span>
<span style="color:#888"># create the requested number of background agents and runners:</span>
worker_agents = self.agent.clone(num = self.workers)
runners = [ Runner(agent=agent, ix = ix + <span style="color:#00d;font-weight:bold">1</span>, train = <span style="color:#080;font-weight:bold">True</span>) <span style="color:#080;font-weight:bold">for</span> ix, agent <span style="color:#080">in</span> <span style="color:#038">enumerate</span>(worker_agents) ]
<span style="color:#888"># we're going to communicate the workers' updates via the thread safe queue:</span>
queue = mp.SimpleQueue()
<span style="color:#888"># if we've not been given a number of episodes: assume the process is going to be interrupted with the keyboard interrupt once the user (us) decides so:</span>
<span style="color:#080;font-weight:bold">if</span> episodes <span style="color:#080">is</span> <span style="color:#080;font-weight:bold">None</span>:
<span style="color:#038">print</span>(<span style="color:#d20;background-color:#fff0f0">'Starting out an infinite training process'</span>)
<span style="color:#888"># create the actual background processes, making their entry be the train_one method:</span>
processes = [ mp.Process(target=self.train_one, args=(runners[ix - <span style="color:#00d;font-weight:bold">1</span>], queue, episodes, ix)) <span style="color:#080;font-weight:bold">for</span> ix <span style="color:#080">in</span> <span style="color:#038">range</span>(<span style="color:#00d;font-weight:bold">1</span>, self.workers + <span style="color:#00d;font-weight:bold">1</span>) ]
<span style="color:#888"># run those processes:</span>
<span style="color:#080;font-weight:bold">for</span> process <span style="color:#080">in</span> processes:
process.start()
<span style="color:#080;font-weight:bold">try</span>:
<span style="color:#888"># what follows is a rather naive implementation of listening to workers updates. it works though for our purposes:</span>
<span style="color:#080;font-weight:bold">while</span> <span style="color:#038">any</span>([ process.is_alive() <span style="color:#080;font-weight:bold">for</span> process <span style="color:#080">in</span> processes ]):
results = queue.get()
<span style="color:#080;font-weight:bold">yield</span> results
<span style="color:#080;font-weight:bold">except</span> <span style="color:#b06;font-weight:bold">Exception</span> <span style="color:#080;font-weight:bold">as</span> e:
logger.error(<span style="color:#038">str</span>(e))
<span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">train_one</span>(self, runner, queue, episodes = <span style="color:#00d;font-weight:bold">1000</span>, ix = <span style="color:#00d;font-weight:bold">1</span>):
<span style="color:#d20;background-color:#fff0f0">"""
</span><span style="color:#d20;background-color:#fff0f0"> Orchestrate the training for a single worker runner and agent. This is intended to run in its own background process.
</span><span style="color:#d20;background-color:#fff0f0"> """</span>
<span style="color:#888"># possibly naive way of trying to de-correlate the weight updates further (I have no hard evidence to prove if it works, other than my subjective observation):</span>
time.sleep(ix)
<span style="color:#080;font-weight:bold">try</span>:
<span style="color:#888"># we are going to request the episode be reset whenever our agent scores lower than its max points. the same will happen if the agent scores total of -10 points:</span>
max_points = <span style="color:#00d;font-weight:bold">0</span>
max_eval_points = <span style="color:#00d;font-weight:bold">0</span>
min_points = <span style="color:#00d;font-weight:bold">0</span>
max_episode = <span style="color:#00d;font-weight:bold">0</span>
<span style="color:#080;font-weight:bold">for</span> episode_ix <span style="color:#080">in</span> itertools.count(start=<span style="color:#00d;font-weight:bold">0</span>, step=<span style="color:#00d;font-weight:bold">1</span>):
<span style="color:#080;font-weight:bold">if</span> episodes <span style="color:#080">is</span> <span style="color:#080">not</span> <span style="color:#080;font-weight:bold">None</span> <span style="color:#080">and</span> episode_ix >= episodes:
<span style="color:#080;font-weight:bold">return</span>
max_episode_points = <span style="color:#00d;font-weight:bold">0</span>
points = <span style="color:#00d;font-weight:bold">0</span>
<span style="color:#888"># load up the newest weights every new episode:</span>
runner.load_state_dict(self.agent.state_dict())
<span style="color:#888"># every 5 episodes lets evaluate the weights we've learned so far by recording the run of the car using the greedy strategy:</span>
<span style="color:#080;font-weight:bold">if</span> ix == <span style="color:#00d;font-weight:bold">1</span> <span style="color:#080">and</span> episode_ix % <span style="color:#00d;font-weight:bold">5</span> == <span style="color:#00d;font-weight:bold">0</span>:
eval_points = self.record_greedy(episode_ix)
<span style="color:#080;font-weight:bold">if</span> eval_points > max_eval_points:
torch.save(runner.agent.state_dict(), <span style="color:#d20;background-color:#fff0f0">'./checkpoints/car-racing/'</span> + <span style="color:#038">str</span>(eval_points) + <span style="color:#d20;background-color:#fff0f0">'-eval-points.pytorch'</span>)
max_eval_points = eval_points
<span style="color:#888"># each n-step window, compute the gradients and apply</span>
<span style="color:#888"># also: decide if we shouldn't restart the episode if we don't want to explore too much of the not-useful state space:</span>
<span style="color:#080;font-weight:bold">for</span> step, rewards, values, policies, action_ixs, terminal <span style="color:#080">in</span> runner.run_episode(yield_every=self.window):
points += <span style="color:#038">sum</span>(rewards)
<span style="color:#080;font-weight:bold">if</span> ix == <span style="color:#00d;font-weight:bold">1</span> <span style="color:#080">and</span> points > max_points:
torch.save(runner.agent.state_dict(), <span style="color:#d20;background-color:#fff0f0">'./checkpoints/car-racing/'</span> + <span style="color:#038">str</span>(points) + <span style="color:#d20;background-color:#fff0f0">'-points.pytorch'</span>)
max_points = points
<span style="color:#080;font-weight:bold">if</span> ix == <span style="color:#00d;font-weight:bold">1</span> <span style="color:#080">and</span> episode_ix > max_episode:
torch.save(runner.agent.state_dict(), <span style="color:#d20;background-color:#fff0f0">'./checkpoints/car-racing/'</span> + <span style="color:#038">str</span>(episode_ix) + <span style="color:#d20;background-color:#fff0f0">'-episode.pytorch'</span>)
max_episode = episode_ix
<span style="color:#080;font-weight:bold">if</span> points < -<span style="color:#00d;font-weight:bold">10</span> <span style="color:#080">or</span> (max_episode_points > min_points <span style="color:#080">and</span> points < min_points):
terminal = <span style="color:#080;font-weight:bold">True</span>
max_episode_points = <span style="color:#00d;font-weight:bold">0</span>
point = <span style="color:#00d;font-weight:bold">0</span>
runner.ask_reset()
<span style="color:#080;font-weight:bold">if</span> terminal:
logger.info(<span style="color:#d20;background-color:#fff0f0">'TERMINAL for '</span> + <span style="color:#038">str</span>(ix) + <span style="color:#d20;background-color:#fff0f0">' at step '</span> + <span style="color:#038">str</span>(step) + <span style="color:#d20;background-color:#fff0f0">' with total points '</span> + <span style="color:#038">str</span>(points) + <span style="color:#d20;background-color:#fff0f0">' max: '</span> + <span style="color:#038">str</span>(max_episode_points) )
<span style="color:#888"># if we're learning, then compute and apply the gradients and load the newest weights:</span>
<span style="color:#080;font-weight:bold">if</span> runner.train:
loss = self.apply_gradients(policies, action_ixs, rewards, values, terminal, runner)
runner.load_state_dict(self.agent.state_dict())
max_episode_points = <span style="color:#038">max</span>(max_episode_points, points)
min_points = <span style="color:#038">max</span>(min_points, points)
<span style="color:#888"># communicate the gathered values to the main process:</span>
queue.put((ix, episode_ix, step, rewards, loss, points, terminal))
<span style="color:#080;font-weight:bold">except</span> <span style="color:#b06;font-weight:bold">Exception</span> <span style="color:#080;font-weight:bold">as</span> e:
string = traceback.format_exc()
logger.error(<span style="color:#038">str</span>(e) + <span style="color:#d20;background-color:#fff0f0">' → '</span> + string)
queue.put((ix, -<span style="color:#00d;font-weight:bold">1</span>, -<span style="color:#00d;font-weight:bold">1</span>, [-<span style="color:#00d;font-weight:bold">1</span>], -<span style="color:#00d;font-weight:bold">1</span>, <span style="color:#038">str</span>(e) + <span style="color:#d20;background-color:#fff0f0">'<br />'</span> + string, <span style="color:#080;font-weight:bold">True</span>))
<span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">record_greedy</span>(self, episode_ix):
<span style="color:#d20;background-color:#fff0f0">"""
</span><span style="color:#d20;background-color:#fff0f0"> Records the video of the "greedy" run based on the current weights.
</span><span style="color:#d20;background-color:#fff0f0"> """</span>
directory = <span style="color:#d20;background-color:#fff0f0">'./videos/car-racing/episode-'</span> + <span style="color:#038">str</span>(episode_ix) + <span style="color:#d20;background-color:#fff0f0">'-'</span> + <span style="color:#038">str</span>(<span style="color:#038">int</span>(time.time()))
player = Player(agent=self.agent, directory=directory, train=<span style="color:#080;font-weight:bold">False</span>)
points = player.play()
logger.info(<span style="color:#d20;background-color:#fff0f0">'Evaluation at episode '</span> + <span style="color:#038">str</span>(episode_ix) + <span style="color:#d20;background-color:#fff0f0">': '</span> + <span style="color:#038">str</span>(points) + <span style="color:#d20;background-color:#fff0f0">' points ('</span> + directory + <span style="color:#d20;background-color:#fff0f0">')'</span>)
<span style="color:#080;font-weight:bold">return</span> points
<span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">apply_gradients</span>(self, policies, actions, rewards, values, terminal, runner):
worker_agent = runner.agent
actions_one_hot = torch.tensor([[ <span style="color:#038">int</span>(i == action) <span style="color:#080;font-weight:bold">for</span> i <span style="color:#080">in</span> <span style="color:#038">range</span>(<span style="color:#00d;font-weight:bold">4</span>) ] <span style="color:#080;font-weight:bold">for</span> action <span style="color:#080">in</span> actions], dtype=torch.float32)
policies = torch.stack(policies)
values = torch.cat(values)
values_nograd = torch.zeros_like(values.detach(), requires_grad=<span style="color:#080;font-weight:bold">False</span>)
values_nograd.copy_(values)
discounted_rewards = self.discount_rewards(runner, rewards, values_nograd[-<span style="color:#00d;font-weight:bold">1</span>], terminal)
advantages = discounted_rewards - values_nograd
logger.info(<span style="color:#d20;background-color:#fff0f0">'Runner '</span> + <span style="color:#038">str</span>(runner.ix) + <span style="color:#d20;background-color:#fff0f0">'Rewards: '</span> + <span style="color:#038">str</span>(rewards))
logger.info(<span style="color:#d20;background-color:#fff0f0">'Runner '</span> + <span style="color:#038">str</span>(runner.ix) + <span style="color:#d20;background-color:#fff0f0">'Discounted Rewards: '</span> + <span style="color:#038">str</span>(discounted_rewards.numpy()))
log_policies = torch.log(<span style="color:#00d;font-weight:bold">0.00000001</span> + policies)
one_log_policies = torch.sum(log_policies * actions_one_hot, dim=<span style="color:#00d;font-weight:bold">1</span>)
entropy = torch.sum(policies * -log_policies)
policy_loss = -torch.mean(one_log_policies * advantages)
value_loss = F.mse_loss(values, discounted_rewards)
value_loss_nograd = torch.zeros_like(value_loss)
value_loss_nograd.copy_(value_loss)
policy_loss_nograd = torch.zeros_like(policy_loss)
policy_loss_nograd.copy_(policy_loss)
logger.info(<span style="color:#d20;background-color:#fff0f0">'Value Loss: '</span> + <span style="color:#038">str</span>(<span style="color:#038">float</span>(value_loss_nograd)) + <span style="color:#d20;background-color:#fff0f0">' Policy Loss: '</span> + <span style="color:#038">str</span>(<span style="color:#038">float</span>(policy_loss_nograd)))
loss = policy_loss + <span style="color:#00d;font-weight:bold">0.5</span> * value_loss - <span style="color:#00d;font-weight:bold">0.01</span> * entropy
self.agent.zero_grad()
loss.backward()
torch.nn.utils.clip_grad_norm_(worker_agent.parameters(), <span style="color:#00d;font-weight:bold">40</span>)
<span style="color:#888"># the following step is crucial. at this point, all the info about the gradients reside in the worker agent's memory. We need to "move" those gradients into the main agent's memory:</span>
self.share_gradients(worker_agent)
<span style="color:#888"># update the weights with the computed gradients:</span>
self.optimizer.step()
worker_agent.zero_grad()
<span style="color:#080;font-weight:bold">return</span> <span style="color:#038">float</span>(loss.detach())
<span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">share_gradients</span>(self, worker_agent):
<span style="color:#080;font-weight:bold">for</span> param, shared_param <span style="color:#080">in</span> <span style="color:#038">zip</span>(worker_agent.parameters(), self.agent.parameters()):
<span style="color:#080;font-weight:bold">if</span> shared_param.grad <span style="color:#080">is</span> <span style="color:#080">not</span> <span style="color:#080;font-weight:bold">None</span>:
<span style="color:#080;font-weight:bold">return</span>
shared_param._grad = param.grad
<span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">clip_reward</span>(self, reward):
<span style="color:#d20;background-color:#fff0f0">"""
</span><span style="color:#d20;background-color:#fff0f0"> Clips the rewards into the <-3, 3> range preventing too big of the gradients variance.
</span><span style="color:#d20;background-color:#fff0f0"> """</span>
<span style="color:#080;font-weight:bold">return</span> <span style="color:#038">max</span>(<span style="color:#038">min</span>(reward, <span style="color:#00d;font-weight:bold">3</span>), -<span style="color:#00d;font-weight:bold">3</span>)
<span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">discount_rewards</span>(self, runner, rewards, last_value, terminal):
discounted_rewards = [<span style="color:#00d;font-weight:bold">0</span> <span style="color:#080;font-weight:bold">for</span> _ <span style="color:#080">in</span> rewards]
loop_rewards = [ self.clip_reward(reward) <span style="color:#080;font-weight:bold">for</span> reward <span style="color:#080">in</span> rewards ]
<span style="color:#080;font-weight:bold">if</span> terminal:
loop_rewards.append(<span style="color:#00d;font-weight:bold">0</span>)
<span style="color:#080;font-weight:bold">else</span>:
loop_rewards.append(runner.get_value())
<span style="color:#080;font-weight:bold">for</span> main_ix <span style="color:#080">in</span> <span style="color:#038">range</span>(<span style="color:#038">len</span>(discounted_rewards) - <span style="color:#00d;font-weight:bold">1</span>, -<span style="color:#00d;font-weight:bold">1</span>, -<span style="color:#00d;font-weight:bold">1</span>):
<span style="color:#080;font-weight:bold">for</span> inside_ix <span style="color:#080">in</span> <span style="color:#038">range</span>(<span style="color:#038">len</span>(loop_rewards) - <span style="color:#00d;font-weight:bold">1</span>, -<span style="color:#00d;font-weight:bold">1</span>, -<span style="color:#00d;font-weight:bold">1</span>):
<span style="color:#080;font-weight:bold">if</span> inside_ix >= main_ix:
reward = loop_rewards[inside_ix]
discounted_rewards[main_ix] += self.gamma**(inside_ix - main_ix) * reward
<span style="color:#080;font-weight:bold">return</span> torch.tensor(discounted_rewards)
</code></pre></div><p>For the <code>record_greedy</code> method to work we need the following class:</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python"><span style="color:#080;font-weight:bold">class</span> <span style="color:#b06;font-weight:bold">Player</span>(Runner):
<span style="color:#080;font-weight:bold">def</span> __init__(self, directory, **kwargs):
<span style="color:#038">super</span>().__init__(ix=<span style="color:#00d;font-weight:bold">999</span>, **kwargs)
self.env = Monitor(self.env, directory)
<span style="color:#080;font-weight:bold">def</span> <span style="color:#06b;font-weight:bold">play</span>(self):
points = <span style="color:#00d;font-weight:bold">0</span>
<span style="color:#080;font-weight:bold">for</span> step, rewards, values, policies, actions, terminal <span style="color:#080">in</span> self.run_episode(yield_every = <span style="color:#00d;font-weight:bold">1</span>, do_render = <span style="color:#080;font-weight:bold">True</span>):
points += <span style="color:#038">sum</span>(rewards)
self.env.close()
<span style="color:#080;font-weight:bold">return</span> points
</code></pre></div><p>All the above code can be used as follows (in the Python script):</p>
<div class="highlight"><pre tabindex="0" style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python"><span style="color:#080;font-weight:bold">if</span> __name__ == <span style="color:#d20;background-color:#fff0f0">"__main__"</span>:
agent = Agent()
trainer = Trainer(gamma = <span style="color:#00d;font-weight:bold">0.99</span>, agent = agent)
trainer.fit(episodes=<span style="color:#080;font-weight:bold">None</span>)
</code></pre></div><h4 id="the-importance-of-tuning-of-the-n-step-window-size">The importance of tuning of the n-step window size</h4>
<p>Reading the code, you can notice that we’ve chosen $15$ to be the size of the n-step window. We’ve also chosen $\gamma=0.99$. Getting those values right is a subject for tuning. The same ones that work on one game or a challenge will not necessarily work well for the other.</p>
<p>Here’s a quick explanation of how to think about them: We’re going to be penalized most of the time. It’s important for us to give the algorithm a chance to actually find trajectories that score positively. In the “CarRacing” challenge, I’ve found that it can take 10 steps of moving “full throttle” in the correct direction before we’re being rewarded by entering the new “tile”. I’ve just simply added $5$ of the safety net to that number. No mathematical proof follows this thinking here, but I can tell you though that it made a <strong>huge</strong> difference in the training time for me. The version of the code I’m presenting above starts to score above 700 points after approximately 10 hours on my Ryzen 7 based computing box.</p>
<h4 id="problems-with-the-state-being-returned-from-the-environment---overcoming-with-the-explicit-render">Problems with the state being returned from the environment - overcoming with the explicit render</h4>
<p>You might have also noticed that I’m not using the state values returned by the <code>step</code> method of the gym environment. This might seem contradictory to how the gym is typically being used. After <strong>days</strong> of not seeing my model converge though, I have found that the <code>step</code> method was returning <strong>one and the same</strong> numpy array <strong>on each call</strong>. You can imagine that it was the absolutely <strong>last</strong> thing I’ve checked when trying to find that bug.</p>
<p>I’ve found the <code>render(mode='rgb_array')</code> works as intended each time. I just needed to write my own preprocessing code, to scale it down and make it grayscale.</p>
<h4 id="how-to-know-when-the-algorithm-converges">How to know when the algorithm converges</h4>
<p>I’ve seen some people thinking that their A3C implementation does not converge. The resulting policy did not seem to be working that well, but the training process was taking a bit longer than “some other implementation”. I fell for this kind of thinking myself as well. My humble bit of advice is to stick to what makes sense mathematically. Someone else’s model might be converging faster simply because of the hardware being used or some slight difference in the code <strong>around</strong> the training (e.g. explicit render needed in my case). This might not have anything to do with the A3C part at all.</p>
<p>How do we “stick to what makes sense mathematically”? Simply by logging the value loss and observing it as the training continues. Intuitively, for the model that has converged, we should see that it has already learned the value function. Those values — representing the average of the discounted rewards — should not make the loss too big most of the time. Still, for some states, the best action will make the $R_t$ much bigger than $V(s_t)$ which means that we still should see the loss spiking from time to time.</p>
<p>Again, the above bit of advice doesn’t come with any mathematical proofs. It’s what I found working and making sense <strong>in my case</strong>.</p>
<h3 id="the-results">The Results</h3>
<p>Instead of presenting hard-core statistics about the model’s performance — which wouldn’t make much sense because I stopped it as soon as the “evaluation” videos started looking cool enough) — I’ll just post three videos of the car driving on its own through the three randomly generated tracks.</p>
<p>Have fun watching and even more fun coding it yourself too!</p>
<center>
<video width="100%" controls poster="/blog/2018/08/self-driving-toy-car-using-the-a3c-algorithm/poster.png">
<source src="/blog/2018/08/self-driving-toy-car-using-the-a3c-algorithm/873-openaigym.video.92.68.video000000.mp4" type="video/mp4">
</video>
</center>
<center>
<video width="100%" controls poster="/blog/2018/08/self-driving-toy-car-using-the-a3c-algorithm/poster.png">
<source src="/blog/2018/08/self-driving-toy-car-using-the-a3c-algorithm/892-openaigym.video.90.68.video000000.mp4" type="video/mp4">
</video>
</center>
<center>
<video width="100%" controls poster="/blog/2018/08/self-driving-toy-car-using-the-a3c-algorithm/poster.png">
<source src="/blog/2018/08/self-driving-toy-car-using-the-a3c-algorithm/904-openaigym.video.77.68.video000000.mp4" type="video/mp4">
</video>
</center>
<script>
renderMathInElement(
document.body,
{
delimiters: [
{left: "$$", right: "$$", display: true},
{left: "\\[", right: "\\]", display: true},
{left: "$", right: "$", display: false},
{left: "\\(", right: "\\)", display: false}
]
}
);
</script>