Introducing the End Point AI Portal for Our Customers

2025-03-12T00:00:00+00:00

Today, End Point is opening up the End Point AI Portal for our customers to use. As a technology consulting company that gets in the trenches with our customers to integrate, manage, and support the information technology they need to operate and transform their businesses, we believe this is a great service for us to provide.

We are offering our AI Portal to our customers at no charge. The price is certainly right, so we think our customers are very much going to want to use it. 🙂

What’s in the Portal

End Point’s AI Portal provides an interface to interact with many different LLM AI chatbot services all in one place. It even gives users the ability to simultaneously submit queries and instructions to multiple LLMs and see their responses side by side.

Our Portal is a customized implementation of the Open WebUI AI interface and connects via APIs to various AI services. Because of this architecture, our analysis has shown that unless there is ongoing high-intensity usage, the cost per user will be very low — much lower than a typical end-user subscription to a single LLM provider. In the future we may implement an accounting system for the portal to pass usage costs onto our users, but our first priority is to get this capability into our clients’ hands as soon as possible to help them be early adopters of this transformative tech.

Our AI Portal connects to a range of proprietary AI-powered chatbots served up by Anthropic, OpenAI, Google, and Grok, as well as to several open-source AI-powered chatbots hosted on servers in the United States. This includes the Deepseek R1 model developed in China that rocked the tech world with its release on January 20. It’s a remarkable model, but for businesses in the US having their chat histories stored on servers in China is generally not acceptable. Having the service hosted in the US helps address privacy concerns. We run the open-source models on our own servers and using third-party hosting providers such as together.ai and groq.com.

To provide a more coherent user experience, the Open WebUI interface maintains each user’s chat history and provides the ability to manage, share, and download them.

The portal has already helped End Point with internal projects such as experimenting with how best to process various data sets to make them useful for AI-powered applications. The portal hosts several RAG (Retrieval Augmented Generation) applications that make use of various LLMs to search and answer questions about End Point’s blog posts—we are working to identify the best solution, so results may vary. Please don’t give up on reading our blog just yet.

Below are screenshots of our AI Portal’s interface showing some of its features (click on the images to view at full size).

We require users to accept our AI portal’s standard Terms of Service to use the portal. Additionally, each AI-powered chatbot accessible through our portal has its own Terms of Service, to which we provide links for your reference. Our primary concern is whether the information users input into these chatbots might be used by providers for training or other purposes. We strive to avoid chatbots that engage in such practices, and as of the launch of our portal, we’re not aware of any chatbot providers that claim rights to user input data. However, it’s common for Terms of Service documents to include clauses allowing providers to modify their terms without prior notice, so we can’t guarantee ongoing compliance with our standards. Regardless, users should always exercise caution and good judgment when entering confidential information into any online service.

Why We Are Doing This

At End Point, we’ve built our reputation for almost 30 years by providing expert consulting, development, and systems support. As AI technology rapidly evolves, we see it as both an opportunity and a necessity to stay ahead—helping our customers integrate AI into their own systems while deepening our expertise in the field.

By opening up our Open WebUI Portal to customers at no charge, we’re reinforcing our commitment to innovation, transparency, value, and great customer service. We intend this not only to benefit our clients, but to highlight our ability to integrate AI-powered solutions into your business. We stand ready to help you enhance your existing workflows, products, and services—as well as to create entirely new, AI-powered applications.

AI use cases include:

User-facing enhancements
- Intelligent similarity search
- Recommendation systems
- AI-powered insights from your data
- Dynamic FAQ generation based on past inquiries
- Customer support chatbots
- Faster and smarter documentation creation
- You name it. The possibilities are endless!
AI-driven development & optimization
- System testing and automated quality assurance
- AI-assisted code reviews and refactoring
- Automated code generation for improved efficiency
- …and more

We’re also using AI to refine our own processes, improve our code, and enhance the value we deliver to our customers. This initiative is part of our commitment to ensure that our clients continue to benefit from the best technology solutions available.

How to Get an Account

Reach out to your End Point client rep or use our contact form to get in touch.

Implementing SummAE neural text summarization with a denoising auto-encoder

2020-05-28T00:00:00+00:00

If there’s any problem space in machine learning, with no shortage of (unlabelled) data to train on, it’s easily natural language processing (NLP).

In this article, I’d like to take on the challenge of taking a paper that came from Google Research in late 2019 and implementing it. It’s going to be a fun trip into the world of neural text summarization. We’re going to go through the basics, the coding, and then we’ll look at what the results actually are in the end.

The paper we’re going to implement here is: Peter J. Liu, Yu-An Chung, Jie Ren (2019) SummAE: Zero-Shot Abstractive Text Summarization using Length-Agnostic Auto-Encoders.

Here’s the paper’s abstract:

We propose an end-to-end neural model for zero-shot abstractive text summarization of paragraphs, and introduce a benchmark task, ROCSumm, based on ROCStories, a subset for which we collected human summaries. In this task, five-sentence stories (paragraphs) are summarized with one sentence, using human summaries only for evaluation. We show results for extractive and human baselines to demonstrate a large abstractive gap in performance. Our model, SummAE, consists of a denoising auto-encoder that embeds sentences and paragraphs in a common space, from which either can be decoded. Summaries for paragraphs are generated by decoding a sentence from the paragraph representations. We find that traditional sequence-to-sequence auto-encoders fail to produce good summaries and describe how specific architectural choices and pre-training techniques can significantly improve performance, outperforming extractive baselines. The data, training, evaluation code, and best model weights are open-sourced.

Preliminaries

Before we go any further, let’s talk a little bit about neural summarization in general. There’re two main approaches to it:

The first approach makes the model “focus” on the most important parts of the longer text - extracting them to form a summary.

Let’s take a recent article, “Shopify Admin API: Importing Products in Bulk”, by one of my great co-workers, Patrick Lewis, as an example and see what the extractive summarization would look like. Let’s take the first two paragraphs:

I recently worked on an interesting project for a store owner who was facing a daunting task: he had an inventory of hundreds of thousands of Magic: The Gathering (MTG) cards that he wanted to sell online through his Shopify store. The logistics of tracking down artwork and current market pricing for each card made it impossible to do manually.

My solution was to create a custom Rails application that retrieves inventory data from a combination of APIs and then automatically creates products for each card in Shopify. The resulting project turned what would have been a months- or years-long task into a bulk upload that only took a few hours to complete and allowed the store owner to immediately start selling his inventory online. The online store launch turned out to be even more important than initially expected due to current closures of physical stores.

An extractive model could summarize it as follows:

I recently worked on an interesting project for a store owner who had an inventory of hundreds of thousands of cards that he wanted to sell through his store. The logistics and current pricing for each card made it impossible to do manually. My solution was to create a custom Rails application that retrieves inventory data from a combination of APIs and then automatically creates products for each card. The store launch turned out to be even more important than expected due to current closures of physical stores.

See how it does the copying and pasting? The big advantage of these types of models is that they are generally easier to create and the resulting summaries tend to faithfully reflect the facts included in the source.

The downside though is that it’s not how a human would do it. We do a lot of paraphrasing, for instance. We use different words and tend to form sentences less rigidly following the original ones. The need for the summaries to feel more natural made the second type — abstractive — into this subfield’s holy grail.

Datasets

The paper’s authors used the so-called “ROCStories” dataset (“Tackling The Story Ending Biases in The Story Cloze Test”. Rishi Sharma, James Allen, Omid Bakhshandeh, Nasrin Mostafazadeh. In Proceedings of the 2018 Conference of the Association for Computational Linguistics (ACL), 2018).

In my experiments, I’ve also tried the model against one that’s quite a bit more difficult: WikiHow (Mahnaz Koupaee, William Yang Wang (2018) WikiHow: A Large Scale Text Summarization Dataset).

ROCStories

The dataset consists of 98162 stories, each one consisting of 5 sentences. It’s incredibly clean. The only step I needed to take was to split the stories between the train, eval, and test sets.

Examples of sentences:

Example 1:

My retired coworker turned 69 in July. I went net surfing to get her a gift. She loves Diana Ross. I got two newly released cds and mailed them to her. She sent me an email thanking me.

Example 2:

Tom alerted the government he expected a guest. When she didn’t come he got in a lot of trouble. They talked about revoking his doctor’s license. And charging him a huge fee! Tom’s life was destroyed because of his act of kindness.

Example 3:

I went to see the doctor when I knew it was bad. I hadn’t eaten in nearly a week. I told him I felt afraid of food in my body. He told me I was developing an eating disorder. He instructed me to get some help.

Wikihow

This is one of the most challenging openly available datasets for neural summarization. It consists of more than 200,000 long-sequence pairs of text + headline scraped from WikiHow’s website.

Some examples:

Text:

One easy way to conserve water is to cut down on your shower time. Practice cutting your showers down to 10 minutes, then 7, then 5. Challenge yourself to take a shorter shower every day. Washing machines take up a lot of water and electricity, so running a cycle for a couple of articles of clothing is inefficient. Hold off on laundry until you can fill the machine. Avoid letting the water run while you’re brushing your teeth or shaving. Keep your hoses and faucets turned off as much as possible. When you need them, use them sparingly.

Headline:

Take quicker showers to conserve water. Wait for a full load of clothing before running a washing machine. Turn off the water when you’re not using it.

The main challenge for the summarization model here is that the headline was actually created by humans and is not just “extracting” anything. Any model performing well on this dataset actually needs to model the language pretty well. Otherwise, the headline could be used for computing the evaluation metrics, but it’s pretty clear that traditional metrics like ROUGE are just bound here to miss the point.

Basics of the sequence-to-sequence modeling

Most sequence-to-sequence models are based on the “next token prediction” workflow.

The general idea can be expressed with P(token | context) — where the task is to model this conditional probability distribution. The “context” here depends on the approach.

Those models are also called “auto-regressive” because they need to consume their own predictions from previous steps during the inference:

predict(["<start>"], context)
# "I"
predict(["<start>", "I"], context)
# "love"
predict(["<start>", "I", "love"], context)
# "biking"
predict(["<start>", "I", "love", "biking"], context)
# "<end>"

Naively simple modeling: Markov Model

In this model, the approach is to take on a bold assumption: that the probability of the next token is conditioned only on the previous token.

The Markov Model is elegantly introduced in the blog post Next Word Prediction using Markov Model.

Why is it naive? Because we know that the probability of the word “love” depends on the word “I” given a broader context. A model that’s always going to output “roses” would miss the best word more often than not.

Modeling with neural networks

Usually, sequence-to-sequence neural network models consist of two parts:

encoder
decoder

The encoder is there to build a “gist” representation of the input sequence. The gist and the previous token become our “context” to do the inference. This fits in well within the P(token | context) modeling I described above. That distribution can be expressed more clearly as P(token | previous; gist).

There are other approaches too with one of them being the ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training - 2020 - Yan, Yu and Qi, Weizhen and Gong, Yeyun and Liu, Dayiheng and Duan, Nan and Chen, Jiusheng and Zhang, Ruofei and Zhou, Ming. The difference in the approach here was the prediction of n-tokens ahead at once.

Teacher-forcing

Let’s see how could we go about teaching the model about the next token’s conditional distribution.

Imagine that the model’s parameters aren’t performing well yet. We have an input sequence of: ["<start>", "I", "love", "biking", "during", "the", "summer", "<end>"]. We’re training the model giving it the first token:

model(["<start>", context])
# "I"

Great, now let’s ask it for another one:

model(["<start>", "I"], context])
# "wonder"

Hmmm that’s not what we wanted, but let’s naively continue:

model(["<start>", "I", "wonder"], context)
# "why"

We could continue gathering predictions and compute the loss at the end. The loss would really only be able to tell it about the first mistake (“love” vs. “wonder”); the rest of the errors would just accumulate from here. This would hinder the learning considerably, adding in the noise from the accumulated errors.

There’s a better approach called Teacher Forcing. In this approach, you’re telling the model the true answer after each of its guesses. The last example would look like the following:

model(["<start>", "I", "love"], context)
# "watching"

You’d continue the process, feeding it the full input sequence and the loss term would be computed based on all its guesses.

Compute-friendly representation for tokens and gists

Some of the readers might want to skip this section. I’d like to describe quickly here the concept of the latent space and vector embeddings. This is to keep the matters relatively palatable for the broader audience.

Representing words naively

How do we turn the words (strings) into numbers that we input into our machine learning models? A software developer might think about assigning each word a unique integer. This works well for databases but in machine learning models, the fact that integers follow one another means that they encode a relation (which one follows which and in what distance). This doesn’t work well for almost any problem in data science.

Traditionally, the problem is solved by “one-hot encoding”. This means that we’re turning our integers into vectors, where each value is zero except the one for the index that equals the value to encode (or minus one if your programming language uses zero-based indexing). Example: 3 => [0, 0, 0, 1, 0, 0, 0, 0, 0, 0] when the total number of “integers” (classes) to encode is 10.

This is better as it breaks the ordering and distancing assumptions. It doesn’t encode anything about the words, though, except the arbitrary number we’ve decided to assign to them. We now don’t have the ordering but we also don’t have any distance. Empirically though we just know that the word “love” is much closer to “enjoy” than it is to “helicopter”.

A better approach: word embeddings

How could we keep our vector representation (as in one-hot encoding) but also introduce the distance? I’ve already glanced over this concept in my post about the simple recommender system. The idea is to have a vector of floating-point values so that the closer the words are in their meaning, the smaller the angle is between them. We can easily compute a metric following this logic by measuring the cosine distance. This way, the word representations are easy to feed into the encoder, and they already contain a lot of the information in themselves.

Not only words

Can we only have vectors for words? Couldn’t we have vectors for paragraphs, so that the closer they are in their meaning, the smaller some vector space metric between them? Of course we can. This is, in fact, what will allow us in this article’s model to encode the “gist” that we talked about. The “encoder” part of the model is going to learn the most convenient way of turning the input sequence into the floating-point numbers vector.

Auto-encoders

We’re slowly approaching the model from the paper. We still have one concept that’s vital to understand in order to get why the model is going to work.

Up until now, we talked about the following structure of the typical sequence-to-sequence neural network model:

This is true e.g. for translation models where the input sequence is in English and the output is in Greek. It’s also true for this article’s model during the inference.

What if we’d make the input and output to be the same sequence? We’d turn it into a so-called auto-encoder.

The output of course isn’t all that useful — we already know what the input sequence is. The true value is in the model’s ability to encode the input into a gist.

Adding the noise

A very interesting type of an auto-encoder is the denoising auto-encoder. The idea is that the input sequence gets randomly corrupted and the network learns to still produce a good gist and reconstruct the sequence before it got corrupted. This makes the training “teach” the network about the deeper connections in the data, instead of just “memorizing” as much as it can.

The SummAE model

We’re now ready to talk about the architecture from the paper. Given what we’ve already learned, this is going to be very simple. The SummAE model is just a denoising auto-encoder that is being trained a special way.

Auto-encoding paragraphs and sentences

The authors were training the model on both single sentences and full paragraphs. In all cases the task was to reproduce the uncorrupted input.

The first part of the approach is about having two special “start tokens” to signal the mode: paragraph vs. sentence. In my code, I’ve used “<start-full>” and “<start-short>”.

During the training, the model learns the conditional distributions given those two tokens and the ones that follow, for any given token in the sequence.

Adding the noise

The sentences are simply concatenated to form a paragraph. The input then gets corrupted at random by means of:

masking the input tokens
shuffling the order of the sentences within the paragraph

The authors are claiming that the latter helped them in solving the issue of the network just memorizing the first sentence. What I have found though is that this model is generally prone towards memorizing concrete sentences from the paragraph. Sometimes it’s the first, and sometimes it’s some of the others. I’ve found this true even when adding a lot of noise to the input.

The code

The full PyTorch implementation described in this blog post is available at https://github.com/kamilc/neural-text-summarization. You may find some of its parts less clean than others — it’s a work in progress. Specifically, the data download is almost left out.

You can find the WikiData preprocessing in a notebook in the repository. For the ROCStories, I just downloaded the CSV files and concatenated with Unix cat. There’s an additional process.py file generated from a very simple IPython session.

Let’s have a very brief look at some of the most interesting parts of the code:

class SummarizeNet(NNModel):
    def encode(self, embeddings, lengths):
        # ...

    def decode(self, embeddings, encoded, lengths, modes):
        # ...

    def forward(self, embeddings, clean_embeddings, lengths, modes):
        # ...

    def predict(self, vocabulary, embeddings, lengths):
        # ...

You can notice separate methods for forward and predict. I chose the Transformer over the recurrent neural networks for both the encoder part and the decoder. The PyTorch implementation of the transformer decoder part already includes the teacher forcing in the forward method. This makes it convenient at the training time — to just feed it the full, uncorrupted sequence of embeddings as the “target”. During the inference we need to do the “auto-regressive” part by hand though. This means feeding the previous predictions in a loop — hence the need for two distinct methods here.

def forward(self, embeddings, clean_embeddings, lengths, modes):
    noisy_embeddings = self.mask_dropout(embeddings, lengths)

    encoded = self.encode(noisy_embeddings[:, 1:, :], lengths-1)
    decoded = self.decode(clean_embeddings, encoded, lengths, modes)

    return (
        decoded,
        encoded
    )

You can notice that I’m doing the token masking at the model level during the training. The code also shows cleanly the structure of this seq2seq model — with the encoder and the decoder.

The encoder part looks simple as long as you’re familiar with the transformers:

def encode(self, embeddings, lengths):
    batch_size, seq_len, _ = embeddings.shape

    embeddings = self.encode_positions(embeddings)

    paddings_mask = torch.arange(end=seq_len).unsqueeze(dim=0).expand((batch_size, seq_len)).to(self.device)
    paddings_mask = (paddings_mask + 1) > lengths.unsqueeze(dim=1).expand((batch_size, seq_len))

    encoded = embeddings.transpose(1,0)

    for ix, encoder in enumerate(self.encoders):
        encoded = encoder(encoded, src_key_padding_mask=paddings_mask)
        encoded = self.encode_batch_norms[ix](encoded.transpose(2,1)).transpose(2,1)

    last_encoded = encoded

    encoded = self.pool_encoded(encoded, lengths)

    encoded = self.to_hidden(encoded)

    return encoded

We’re first encoding the positions as in the “Attention Is All You Need” paper and then feeding the embeddings into a stack of the encoder layers. At the end, we’re morphing the tensor to have the final dimension equal the number given as the model’s parameter.

The decode sits on PyTorch’s shoulders too:

def decode(self, embeddings, encoded, lengths, modes):
    batch_size, seq_len, _ = embeddings.shape

    embeddings = self.encode_positions(embeddings)

    mask = self.mask_for(embeddings)

    encoded = self.from_hidden(encoded)
    encoded = encoded.unsqueeze(dim=0).expand(seq_len, batch_size, -1)

    decoded = embeddings.transpose(1,0)
    decoded = torch.cat(
        [
            encoded,
            decoded
        ],
        axis=2
    )
    decoded = self.combine_decoded(decoded)
    decoded = self.combine_batch_norm(decoded.transpose(2,1)).transpose(2,1)

    paddings_mask = torch.arange(end=seq_len).unsqueeze(dim=0).expand((batch_size, seq_len)).to(self.device)
    paddings_mask = paddings_mask > lengths.unsqueeze(dim=1).expand((batch_size, seq_len))

    for ix, decoder in enumerate(self.decoders):
        decoded = decoder(
            decoded,
            torch.ones_like(decoded),
            tgt_mask=mask,
            tgt_key_padding_mask=paddings_mask
        )
        decoded = self.decode_batch_norms[ix](decoded.transpose(2,1)).transpose(2,1)

    decoded = decoded.transpose(1,0)

    return self.linear_logits(decoded)

You can notice that I’m combining the gist received from the encoder with each word embeddings — as this is how it was described in the paper.

The predict is very similar to forward:

def predict(self, vocabulary, embeddings, lengths):
    """
    Caller should include the start and end tokens here
    but we’re going to ensure the start one is replaces by <start-short>
    """
    previous_mode = self.training

    self.eval()

    batch_size, _, _ = embeddings.shape

    results = []

    for row in range(0, batch_size):
        row_embeddings = embeddings[row, :, :].unsqueeze(dim=0)
        row_embeddings[0, 0] = vocabulary.token_vector("<start-short>")

        encoded = self.encode(
            row_embeddings[:, 1:, :],
            lengths[row].unsqueeze(dim=0)
        )

        results.append(
            self.decode_prediction(
                vocabulary,
                encoded,
                lengths[row].unsqueeze(dim=0)
            )
        )

    self.training = previous_mode

    return results

The workhorse behind the decoding at the inference time looks as follows:

def decode_prediction(self, vocabulary, encoded1xH, lengths1x):
    tokens = ['<start-short>']
    last_token = None
    seq_len = 1

    encoded1xH = self.from_hidden(encoded1xH)

    while last_token != '<end>' and seq_len < 50:
        embeddings1xSxD = vocabulary.embed(tokens).unsqueeze(dim=0).to(self.device)
        embeddings1xSxD = self.encode_positions(embeddings1xSxD)

        maskSxS = self.mask_for(embeddings1xSxD)

        encodedSx1xH = encoded1xH.unsqueeze(dim=0).expand(seq_len, 1, -1)

        decodedSx1xD = embeddings1xSxD.transpose(1,0)
        decodedSx1xD = torch.cat(
            [
                encodedSx1xH,
                decodedSx1xD
            ],
            axis=2
        )
        decodedSx1xD = self.combine_decoded(decodedSx1xD)
        decodedSx1xD = self.combine_batch_norm(decodedSx1xD.transpose(2,1)).transpose(2,1)

        for ix, decoder in enumerate(self.decoders):
            decodedSx1xD = decoder(
                decodedSx1xD,
                torch.ones_like(decodedSx1xD),
                tgt_mask=maskSxS,
            )
            decodedSx1xD = self.decode_batch_norms[ix](decodedSx1xD.transpose(2,1))
            decodedSx1xD = decodedSx1xD.transpose(2,1)

        decoded1x1xD = decodedSx1xD.transpose(1,0)[:, (seq_len-1):seq_len, :]
        decoded1x1xV = self.linear_logits(decoded1x1xD)

        word_id = F.softmax(decoded1x1xV[0, 0, :]).argmax().cpu().item()
        last_token = vocabulary.words[word_id]
        tokens.append(last_token)
        seq_len += 1

    return ' '.join(tokens[1:])

You can notice starting with the “start short” token and going in a loop, getting predictions, and feeding back until the “end” token.

Again, the model is very, very simple. What makes the difference is how it’s being trained — it’s all in the training data corruption and the model pre-training.

It’s already a long article so I encourage the curious readers to look at the code at my GitHub repo for more details.

My experiment with the WikiHow dataset

In my WikiHow experiment I wanted to see how the results look if I fed the full articles and their headlines for the two modes of the network. The same data-corruption regime was used in this case.

Some of the results were looking almost good:

Text:

for a savory flavor, mix in 1/2 teaspoon ground cumin, ground turmeric, or masala powder.this works best when added to the traditional salty lassi. for a flavorful addition to the traditional sweet lassi, add 1/2 teaspoon of ground cardamom powder or ginger, for some kick. , start with a traditional sweet lassi and blend in some of your favorite fruits. consider mixing in strawberries, papaya, bananas, or coconut.try chopping and freezing the fruit before blending it into the lassi. this will make your drink colder and frothier. , while most lassi drinks are yogurt based, you can swap out the yogurt and water or milk for coconut milk. this will give a slightly tropical flavor to the drink. or you could flavor the lassi with rose water syrup, vanilla extract, or honey.don’t choose too many flavors or they could make the drink too sweet. if you stick to one or two flavors, they’ll be more pronounced. , top your lassi with any of the following for extra flavor and a more polished look: chopped pistachios sprigs of mint sprinkle of turmeric or cumin chopped almonds fruit sliver

Headline:

add a spice., blend in a fruit., flavor with a syrup or milk., garnish.

Predicted summary:

blend vanilla in a sweeter flavor . , add a sugary fruit . , do a spicy twist . eat with dessert . , revise .

It’s not 100% faithful to the original text even though it seems to “read” well.

My suspicion is that pre-training against a much larger corpus of text might possibly help. There’s an obvious issue with the lack of very specific knowledge here to have the network summarize better. Here’s another of those examples:

Text:

the settings app looks like a gray gear icon on your iphone’s home screen.; , this option is listed next to a blue “a” icon below general. , this option will be at the bottom of the display & brightness menu. , the right-hand side of the slider will give you bigger font size in all menus and apps that support dynamic type, including the mail app. you can preview the corresponding text size by looking at the menu texts located above and below the text size slider. , the left-hand side of the slider will make all dynamic type text smaller, including all menus and mailboxes in the mail app. , tap the back button twice in the upper-left corner of your screen. it will save your text size settings and take you back to your settings menu. , this option is listed next to a gray gear icon above display & brightness. , it’s halfway through the general menu. ,, the switch will turn green. the text size slider below the switch will allow for even bigger fonts. , the text size in all menus and apps that support dynamic type will increase as you go towards the right-hand side of the slider. this is the largest text size you can get on an iphone. , it will save your settings.

Headline:

open your iphone’s settings., scroll down and tap display & brightness., tap text size., tap and drag the slider to the right for bigger text., tap and drag the slider to the left for smaller text., go back to the settings menu., tap general., tap accessibility., tap larger text. , slide the larger accessibility sizes switch to on position., tap and drag the slider to the right., tap the back button in the upper-left corner.

Predicted summary:

open your iphone ’s settings . , tap general . , scroll down and tap accessibility . , tap larger accessibility . , tap and larger text for the iphone to highlight the text you want to close . , tap the larger text - colored contacts app .

It might be interesting to train against this dataset again while:

utilizing some pre-trained, large scale model as part of the encoder
using a large corpus of text to still pre-train the auto-encoder

This could possibly take a lot of time to train on my GPU (even with the pre-trained part of the encoder). I didn’t follow the idea further at this time.

The problem with getting paragraphs when we want the sentences

One of the biggest problems the authors ran into was with the decoder outputting the long version of the text, even though it was asked for the sentence-long summary.

Authors called this phenomenon the “segregation issue”. What they have found was that the encoder was mapping paragraphs and sentences into completely separate regions. The solution to this problem was to trick the encoder into making both representations indistinguishable. The following figure comes from the paper and shows the issue visualized:

Better gists by using the “critic”

The idea of a “critic” has been popularized along with the fantastic results produced by some of the Generative Adversarial Networks. The general workflow is to have the main network generate output while the other tries to guess some of its properties.

For GANs that are generating realistic photos, the critic is there to guess if the photo was generated or if it’s real. A loss term is added based on how well it’s doing, penalizing the main network for generating photos that the critic is able to call out as fake.

A similar idea was used in the A3C algorithm I blogged about (Self-driving toy car using the Asynchronous Advantage Actor-Critic algorithm). The “critic” part penalized the AI agent for taking steps that were on average less advantageous.

Here, in the SummAE model, the critic adds a penalty to the loss to the degree to which it’s able to guess whether the gist comes from a paragraph or a sentence.

Training with the critic might get tricky. What I’ve found to be the cleanest way is to use two different optimizers — one updating the main network’s parameters while the other updates the critic itself:

for batch in batches:
    if mode == "train":
        self.model.train()
        self.discriminator.train()
    else:
        self.model.eval()
        self.discriminator.eval()

    self.optimizer.zero_grad()
    self.discriminator_optimizer.zero_grad()

    logits, state = self.model(
        batch.word_embeddings.to(self.device),
        batch.clean_word_embeddings.to(self.device),
        batch.lengths.to(self.device),
        batch.mode.to(self.device)
    )

    mode_probs_disc = self.discriminator(state.detach())
    mode_probs = self.discriminator(state)

    discriminator_loss = F.binary_cross_entropy(
        mode_probs_disc,
        batch.mode
    )

    discriminator_loss.backward(retain_graph=True)

    if mode == "train":
        self.discriminator_optimizer.step()

    text = batch.text.copy()

    if self.no_period_trick:
        text = [txt.replace('.', '') for txt in text]

    classes = self.vocabulary.encode(text, modes=batch.mode)
    classes = classes.roll(-1, dims=1)
    classes[:,classes.shape[1]-1] = 3

    model_loss = torch.tensor(0).cuda()

    if logits.shape[0:2] == classes.shape:
        model_loss = F.cross_entropy(
            logits.reshape(-1, logits.shape[2]).to(self.device),
            classes.long().reshape(-1).to(self.device),
            ignore_index=3
        )
    else:
        print("WARNING: Skipping model loss for inconsistency between logits and classes shapes")

    fooling_loss = F.binary_cross_entropy(
        mode_probs,
        torch.ones_like(batch.mode).to(self.device)
    )

    loss = model_loss + (0.1 * fooling_loss)

    loss.backward()
    if mode == "train":
        self.optimizer.step()

    self.optimizer.zero_grad()
    self.discriminator_optimizer.zero_grad()

The main idea is to treat the main network’s encoded gist as constant with respect to the updates to the critic’s parameters, and vice versa.

Results

I’ve found some of the results look really exceptional:

Text:

lynn is unhappy in her marriage. her husband is never good to her and shows her no attention. one evening lynn tells her husband she is going out with her friends. she really goes out with a man from work and has a great time. lynn continues dating him and starts having an affair.

Predicted summary:

lynn starts dating him and has an affair .

Text:

cedric was hoping to get a big bonus at work. he had worked hard at the office all year. cedric’s boss called him into his office. cedric was disappointed when told there would be no bonus. cedric’s boss surprised cedric with a big raise instead of a bonus.

Predicted summary:

cedric had a big deal at his boss ’s office .

Some others showed how the model attends to single sentences though:

Text:

i lost my job. i was having trouble affording my necessities. i didn’t have enough money to pay rent. i searched online for money making opportunities. i discovered amazon mechanical turk.

Predicted summary:

i did n’t have enough money to pay rent .

While the sentence like this one would maybe make a good headline — it’s definitely not the best summary as it naturally loses the vital parts found in other sentences.

Final words

First of all, let me thank the paper’s authors for their exceptional work. It was a great read and great fun implementing!

Abstractive text summarization remains very difficult. The model trained for this blog post has very limited use in practice. There’s a lot of room for improvement though, which makes the future of abstractive summaries very promising.

An Introduction to Neural Networks

2019-07-01T00:00:00+00:00

Photo by Sudhamshu Hebbar, used under CC BY 2.0

Earlier this year I wrote a post about my work with a machine-learning camera, the AWS DeepLens, which has onboard processing power to enable AI capabilities without sending data to the cloud. Neural networks are a type of ML model which achieves very impressive results on certain problems (including computer vision), so in this post I give a more thorough introduction to neural networks, and share some useful resources for those who want to dig deeper.

Neurons and Nodes

Neural networks are models inspired by the function of biological neural networks. They consist of nodes (arranged in layers), and the connections between those nodes. Each connection between two nodes enables one-way information transfer: a node either receives input from, or sends output to each node to which it is connected. Nodes typically have an “activation function”, parameterized by the node’s inputs, and its output is the result of this function.

As with the function of biological neural networks, the emergence of information processing from these mathematical operations is opaque. Nevertheless, complex artificial neural networks are capable of feats such as vision, language translation, and winning competitive games. As the technology improves, even more impressive tasks will become possible. As with organic brains, neural networks can achieve complex tasks only as a result of appropriate architecture, constraints, and training—for machine learning, humans must (for now) design it all.

Neural Network Architecture

Nodes are grouped in layers: the input layer, the output layer, and all the layers between them, known as hidden layers. Nodes can be networked in a variety of ways within and between layers, and sophisticated neural network models can include dozens of layers configured in various ways. These include layers which summarize, combine, eliminate, direct, or transform information. Each receives its input from the previous layer, and passes its output to the next layer. The last layer is designed such that its output answers the relevant question (for example, it would offer 9 options if the goal were to identify the hand-written numbers 1–9).

For all this information processing to achieve a given task, the parameters of each node need appropriate values. The process of choosing those values is called training. In order to train a neural network, one needs to provide examples of what the network should do. (For example, to train it to write requires examples of writing. To train it to identify objects in images requires images and their appropriately labeled counterparts.) The more data a model can learn from, the better it can work. Gathering enough data is typically a major undertaking.

Training a Neural Network

Before training, models have random parameters for all nodes. Each time data is passed through the model, the effectiveness of the model is measured using a “loss function”. Loss functions measure how wrong a model’s output is. Different loss functions (also known as cost functions or error functions) measure this in different ways, but in general, the more wrong a model is, the higher its loss/error/cost. Loss functions thus summarize the quality of a model’s output with a single number. Models are optimized to minimize the loss. (For more on the role of loss functions in neural networks, I suggest this excellent article.)

One of the most interesting details of the entire process has to do with how the parameters are tuned. Model optimization relies on variations of a process called gradient descent, in which parameter values are adjusted by small intervals in an attempt to minimize the loss. Over many thousands of repetitions, the training program uses calculus to pick values that help to minimize the loss. As you can imagine, this process becomes extremely computationally intensive when the neural network is large and complex. However, in order to solve hard problems, networks must be large and complex. This is why training neural networks requires substantial computing power, and often takes place in the cloud. (For more on stochastic gradient descent, I suggest this video as a great starting point, or this review for a more advanced overview.)

Facial Recognition Using Amazon DeepLens: Counting Liquid Galaxy Interactions

2019-05-01T00:00:00+00:00

I have been exploring the possible uses of a machine-learning-enabled camera for the Liquid Galaxy. The Amazon Web Services (AWS) DeepLens is a camera that can receive and transmit data over wifi, and that has computing hardware built in. Since its hardware enables it to use machine learning models, it can perform computer vision tasks in the field.

The Amazon DeepLens camera

This camera is the first of its kind—likely the first of many, given the ongoing rapid adoption of Internet of Things (IoT) devices and computer vision. It came to End Point’s attention as hardware that could potentially interface with and extend End Point’s immersive visualization platform, the Liquid Galaxy. We’ve thought of several ways computer vision could potentially work to enhance the platform, for example:

Monitoring users’ reactions
Counting unique visitors to the LG
Counting the number of people using an LG at a given time

The first idea would depend on parsing facial expressions. Perhaps a certain moment in a user experience causes people to look confused, or particularly delighted—valuable insights. The second idea would generate data that could help us assess the platform’s impact, using a metric crucial to any potential clients whose goals involve engaging audiences. The third idea would create a simpler metric: the average number of people engaging with the system over a period of time. Nevertheless, this idea has a key advantage over the second: it doesn’t require distinguishing between people, which makes it a much more tractable project. This post focuses on the third idea.

To set up the camera, the user has to plug it into a power outlet and connect it to wifi. The camera will still work even with a slow network connection, though when the connection is slower the delay between the camera seeing something and reporting it is longer. However, this delay was hardly noticable on my home network which has slow-to-moderate speeds of about 17 Mbps down and 33 Mbps up.

Computer Vision and the Amazon DeepLens

A deep learning model is a neural network with multiple layers of processing units. It is called “deep” because it has multiple layers. The inputs and outputs of each processing unit are numbers. These units are roughly analogous to neurons: they receive input from units in the previous layer, and output it to units in the next layer after transforming it based on a function. These “activation functions” can change in a variety of ways. The last layer’s outputs translate into the results. These models work because these functions get tuned based on how well the model works. For example, to make a model that labels each human face in a picture and draws a box around it, we would start with a corpus of pictures with boxes drawn around faces, as well as the versions of the pictures without the boxes drawn. We would test the model on the non-labeled images by checking—for each picture—whether the output generated by the model is correct. If not, the computer chooses different unit functions, tries again, and compares the results. Repeating this process thousands of times yields models which work remarkably well for a wide range of tasks, including computer vision.

In deep learning for computer vision, training on large sets of labeled images enables models to generalize about visual characteristics. The training process takes a lot of computing resources, but once models are trained, they can produce results quickly and with relative ease. This is why the DeepLens is able to perform computer vision with its limited computing resources.

Since the DeepLens is an Amazon product, it comes as no surprise that the user interface and backend for DeepLens consist of AWS services. One of the most important is SageMaker, which is used to train, manage, optimize, and deploy machine learning models such as neural networks. It includes hosted Jupyter notebooks (Jupyter is a development environment for data science), as well as the computing resources required for model training and storage. With SageMaker, users can train computer vision models for deployment to DeepLens, or import and adjust pretrained models from various sources.

Remote management of the DeepLens depends on AWS Lambda, a “serverless” cloud service that provides an environment to run backend code and integrate with other cloud services. It runs the show, allowing users to manage everything from the camera’s behavior to what happens to gathered data. Another service, AWS Greengrass, connects the instructions from AWS Lambda to the DeepLens, managing tasks like authentication, updates, and reactions to local events.

Amazon’s IoT service saves information about each DeepLens, and allows users to manage their devices, for example by choosing which model is active on the device, or viewing a live stream from the camera. It also keeps track of what’s going on with the hardware, even when it’s off. When a model is running on the DeepLens, you can view a live stream of its inferences about what it’s seeing (the labeled images). Amazon has released various pretrained models designed to work on the DeepLens. Using a model for detecting faces, we can get a live stream that looks like this:

Me looking at the DeepLens in my kitchen

Facial recognition inferences on multiple people. (Witness my smile of satisfaction at finally finding enthusiastic subjects of facial recognition.)

Each face that the camera detects gets a box around it, along with the model’s level of certainty that it is a face. The above pictures were the results of an attempt to simulate the conditions where this could be used.

The Model

The model I used was trained on data from ImageNet, a public database with hundreds or thousands of images associated with nouns. (For example they have 1537 pictures of folding chairs.) ImageNet is commonly used to train and test computer vision models.

However, the training for this model didn’t stop there: Amazon used transfer learning from another large image dataset, MS-COCO, to fine-tune the model for face detection. Transfer learning works essentially by retraining the last layer of an already-trained model. In this way it harnesses the “insights” of the existing model (e.g. about shapes, colors, and positions) by repurposing this information to make predictions about something else. In this case, whether something is a face.

Since this model was pretrained and optimized by Amazon for the DeepLens, it provides a low effort route to implementing a computer vision model on the DeepLens. I didn’t have to do any of the processing on my own hardware. The DeepLens hardware took care of all the predictions, though the biggest resource savings were from not having to train the model myself (which can take days, or longer).

When the facial recognition model is deployed and the DeepLens is on, an AWS Lambda function written in Python repeatedly prompts the camera to get frames from the camera:

frame = awscam.getLastFrame()

…to resize the frames before inference (the model accepts frames of particular size):

frame_resize = cv2.resize(frame, (input_height, input_width))

…to pass the frames to the model:

parsed_inference_results = model.parseResult(model_type, model.doInference(frame_resize))

…and to use the results to draw boxes around the faces:

cv2.rectangle(frame, (xmin, ymin), (xmax, ymax), (255, 165, 20), 10)

As you can see from how often “cv2” appears in the code above, this implementation relies heavily on code from OpenCV, an open source computer vision framework. Finally, the results are sent to the cloud:

client.publish(topic=iot_topic, payload=json.dumps(cloud_output))

In the last code snippet above, iot_topic refers to an Amazon “MQTT topic” (Message Queuing Telemetry Transport), for IoT devices. MQTT is the standard connectivity framework for DeepLens and many other IoT devices. One of its advantages for this context is that it can handle situations with intermittent connectivity, by smoothly queueing messages for when the network connection is stable. The essence of MQTT is to enable publishing and subscribing to different topics. The system of topics enables results from a DeepLens to trigger other processes. For example, the DeepLens could publish a message when it sees a face, and this could prompt another cloud service to do something else, such as save what time and how long the face appeared.

I wanted to test how data from this model would compare to a human’s perception. The first step was to understand what data the camera offers. It produces data about each frame analyzed: a timestamp (in 13-digit Unix time), and the predicted probability that something it identifies is a face. To gather this data, I used the AWS IoT service to manually subscribe to a secure MQTT topic where the DeepLens published its predictions. Each frame processed produces data like this:

{
  "format": "json",
  "payload": {
    "face": 0.5654296875
  },
  "qos": 0,
  "timestamp": 1554853281975,
  "topic": "$aws/things/deeplens_bnU5sr2sSD2ecW5YkfJZtw/infer"
}

The data generated by a single frame (with one face) when processed by the DeepLens.

For my purposes, I was only interested in the timestamps and payloads (which contain the number of faces identified, and their probabilities). I decided to test the facial recognition model under several different conditions:

No faces present
One face present
Multiple faces present

For condition 1 I just aimed it at an empty room for 20 minutes, and for condition 2 I sat in front of the camera for 20 minutes. For condition 3, I aimed the camera at a public space for 20 minutes, and while it was running I kept an ongoing count of the number of people looking in the general direction of the camera (I put the camera in front of a wall with a TV on it so people would be more likely to look towards it). Then I averaged my count over the duration of the sample, which resulted in an average engagement number of 2.5 people, meaning that on average, 2.5 people were looking at the camera. In an attempt to minimize bias, I made my human-eye assessment before looking at any of the data.

I’ll spoil one aspect of the results right away: there were no false positives under any condition. Even the lower probability guesses corresponded to actual faces, though this result might not hold true in a room with lots of face-like art, that’s not too common of a scenario. This simplified things, since it meant there was no need to set a lower bound on the probabilities which we should count—any face detected by the camera is a face. This also highlights one of my remaining questions about the model: is there useful information to be gained from the probabilities?

Another important note: I noticed early in the experiment that it almost never detects a face farther than 15 feet away. For the use case of a Liquid Galaxy, the 15-foot range is too short to capture all types of engagement (some people look at it across the room), but from my experience with the system I think that users within this range could be accurately described as focused users—something worth measuring, but certainly not everything worth measuring. After noticing this, I retested condition 2 with my face about 5 feet from the DeepLens, after initially trying it from across a room.

How did the DeepLens counts compare to my counts?

The model matched my performance in conditions 1 and 2, which makes a strong statement about its reliability in relatively static and close-up conditions such as looking at an empty room, or looking at someone stare at their laptop across a small table. In contrast, it did not count as many faces as I did in condition 3—so I’m happy to report I can still outperform A.I. on something.

Anyway, this suggests that the model is somewhat conservative, at least compared to my count (likely partly due to my eyes having a range larger than 15 feet). Therefore, when considering usage statistics gathered by a similar method, it might make most sense to think of the results as a lower bound, e.g. “the average number of people focused on the system was more than 2.1”.

It would be useful to experiment with the multiple faces condition again, to see how robust these findings are. It would also be helpful to keep track of factors like how much people move, the lighting, and the orientation of the camera, to see if they might impact the results. It would also be useful to automate the data collection and analysis.

This investigation has showed me that the DeepLens has a lot of potential as a tool for measuring engagement. Perhaps a future post will examine how it can be used to count users.

Thanks for reading! You are welcome to learn more about End Point Liquid Galaxy and AWS DeepLens.

Self-driving toy car using the Asynchronous Advantage Actor-Critic algorithm

2018-08-29T00:00:00+00:00

The field of Reinforcement Learning has seen a lot of great improvement in the past years. Researchers at universities and companies like Deep Mind have been developing new and better ways to train intelligent, artificial agents to solve more and more difficult tasks. The algorithms being developed are requiring less time to train. They also are making the training much more stable.

This article is about an algorithm that’s one of the most cited lately: A3C — Asynchronous Advantage Actor-Critic.

As the subject is both wide and deep, I’m assuming the reader has the relevant background mastered already. Although reading it might be interesting even without understanding most of the notions in use, having a good grasp of them will help you get the most out of it.

Because we’re looking at the Deep Reinforcement Learning, the obvious requirement is to be acquainted with the neural networks. I’m also using different notions known in the field of Reinforcement Learning overall like $Q(a, s)$ and $V(s)$ functions or the n-step return. The mathematical expressions, in particular, are given assuming that the reader already knows what the symbols stand for. Some notions known from other families of RL algorithms are being touched on as well (e.g. experience replay) — to contrast them with the A3C way of solving the same kind of problems. The article along with the source code uses the OpenAI gym, Python, and PyTorch among other Python-related libraries.

Theory

The A3C algorithm is a part of the greater class of RL algorithms called Policy Gradients.

In this approach, we’re creating a model that approximates the action-choosing policy itself.

Let’s contrast it with value iteration, the goal of which is to learn the value function and have policy emerge as the function that chooses an action transitioning to the state of the greatest value.

With the policy gradient approach, we’re approximating the policy with a differentiable function. Such stated problem requires only a good approximation of the gradient that over time will maximize the rewards.

The unique approach of A3C adds a very clever twist: we’re also learning an approximation of the value function at the same time. This helps us in getting the variance of the gradient down considerably, making the training much more stable.

These two aspects of the algorithm are being personified within its name: actor-critic. The policy function approximation is being called the actor, while the value function is being called the critic.

The policy gradient

As we’ve noticed already, in order to improve our policy function approximation, we need a gradient that points at the direction that maximizes the rewards.

I’m not going to reinvent the wheel here. There are some great resources the reader can access to dig deep into the Mathematics of what’s called the Policy Gradient Theorem:

The following equation presents the basic form of the gradient of the policy function:

$$\nabla_{\theta} J(\theta) = E_{\tau}[,R_{\tau}\cdot\nabla_\theta,\sum_{t=0}^{T-1},log,\pi(a_t|s_t;\theta),]$$

This states that for each sampled trajectory $\tau$, the correct estimate of the gradient is the expected value of the rewards times the action probabilities moved into the log space. Ascending in this direction makes our rewards greater and greater over time.

We can derive all the needed intermediary gradients ourselves by hand of course. Because we’re using PyTorch though, we only need the right loss function.

Let’s figure out the right loss function formula that will produce the gradient as shown above:

$$L_\theta=-J(\theta)$$

Also:

$$J(\theta)=E_\tau[R_\tau\cdot\sum_{t=0}^{T-1},log,\pi(a_t|s_t;\theta)]$$

Hence:

$$L_\theta=-\frac{1}{n}\sum_{t=0}^{n-1}R_t,\cdot,log\pi(a_t|s_t;\theta)$$

Formalizing the accumulation of rewards

For now, we’ve been using the $R_\tau$ and $R_t$ terms very abstractly. Let’s make this part more intuitive and concrete now.

Its true meaning really is “the quality of the sampled trajectory”. Consider the following equation:

$$R_t=\sum_{i=t}^{N+t}\gamma^{i-t}r_i,+,\gamma^{i-t+1}V(s_{t+N+1})$$

Each $r_i$ is the reward received from the environment after each step. Each trajectory consists of multiple steps. Each time, we’re sampling actions based on our policy function. This gives probabilities of a given action being best given the state.

What if we’re taking 5 actions for which we’re not being given any reward but overall it helped us get rewarded in the 6th step? This is exactly the case we’ll be dealing with in this article later when training a toy car to drive based only on pixel values of the scene. In that environment, we’ll be given $-0.1$ “negative” reward each step and something close to $7$ each new “tile” the car stays on the road.

We need a way to still encourage actions that make us earn rewards in a not too distant future. We also need to be smart and discount future rewards somewhat so that the more immediate the reward is to our action, the more emphasis we put on it.

That’s exactly what the above equation does. Notice that $\gamma$ becomes a hyper-parameter. It makes sense to give it value from $(0, 1)$. Let’s consider the following list of rewards: $[r_1, r_2, r_3, r_4]$. For $r_1$, the formula for the discounted accumulated reward is:

$$R_1=\gamma,r_1,+,\gamma^2r_2,+,\gamma^3r_3,+,\gamma^4r_4,+,\gamma^5V(s_5)$$

For $r_2$ it’s:

$$R_2=\gamma,r_2,+,\gamma^2r_3,+,\gamma^3r_4,+,\gamma^4V(s_5)$$

And so on… In case when we hit the terminal state, having no “next” state, we substitute $0$ for $V(s_{t+N+1})$.

We’ve said that in A3C we’re learning the value function at the same time. The $R_t$ as described above becomes the target value when training our $V(s)$. The value function becomes an approximation of the average of the rewards given the state (because $R_t$ depends on us sampling actions in this state).

Making the gradients more stable

One of the greatest inhibitors of the policy gradient performance is what’s broadly called “high variance”.

I have to admit, the first time I saw that term in this context, I was disoriented. I knew what “variance” was. It’s the “variance of what” that was not clear to me.

Thankfully I found a brilliant answer to this question. It explains the issue simply yet in detail.

Let me cite it here:

When we talk about high variance in the policy gradient method, we’re specifically talking about the facts that the variance of the gradients are high — namely, that $Var(\nabla_{\theta} J(\theta))$ is big.

To put it in simple terms: because we’re sampling trajectories from the space that is stochastic in nature, we’re bound to have those samples give gradients that disagree a lot on the best direction to take our model’s parameters into.

I encourage the reader to pause now and read the above-mentioned answer as it’s very vital. The gist of the solution described in it is that we can subtract a baseline value from each $R_t$. An example of a good baseline that was given was to make it into an average of the sampled accumulated rewards. The A3C algorithm uses this insight in a very, very clever way.

Value function as a baseline

To learn the $V(s)$ we’re typically using the MSE or Huber loss against the accumulated rewards for each step. This means that over time we’re averaging those rewards out based on the state we’re finding ourselves in.

Improving our gradient formula with those ideas we now get:

$$\nabla_{\theta} J(\theta) = E_{\tau}[,\nabla_\theta,\sum_{t=0}^{T-1},log,\pi(a_t|s_t;\theta)\cdot(R_t-V(s_t)),]$$

It’s important to treat the $(R_t-V(s_t))$ term as a constant. This means that when using PyTorch or any other deep learning framework, the computation of it should occur outside the graph that influences the gradients.

The enhanced part of the equation is where we get the word “advantage” in the algorithm’s name. The advantage is simply the difference between the accumulated rewards and what those rewards are on average for the given state:

$$A(a_{t..t+n},s_{t..t+n})=R_t(a_{t..t+n},s_{t..t+n})-V(s_t)$$

If we’ll make $R_t$ into $Q(s,a)$ as it’s commonly written in literature, we’ll arrive at the formula:

$$A(s,a)=Q(s,a) - V(s)$$

What’s the intuition here? Imagine that you’re playing chess with a 5-year-old. You win by a huge margin. Your friend who’s watched lots of master-level games observed this one as well. His take is that even though you scored positively, you still made lots of mistakes. You’ve got your critic here. Your score and what it looks like for the “observing critic” combined is what we call the advantage of the actions you took.

Guarding against the model’s overconfidence

Although he was warned, Icarus was too young and too enthusiastic about flying. He got excited by the thrill of flying and carried away by the amazing feeling of freedom and started flying high to salute the sun, diving low to the sea, and then up high again. His father Daedalus was trying in vain to make young Icarus to understand that his behavior was dangerous, and Icarus soon saw his wings melting. Icarus fell into the sea and drowned.

The Myth Of Daedalus And Icarus

The job of an “actor” is to output probability values for each possible action the agent can take. The greater the probability, the greater the model’s confidence that this action will result in the highest reward.

What if at some point, the weights are being steered in a way that the model becomes overconfident of some particular action? If this happens before the model learns much, it becomes a huge problem.

Because we’re using the $\pi(a|s;\theta)$ distribution to sample trajectories with, we’re not sampling totally at random. In other words, for $\pi(a|s;\theta) = [0.1, 0.4, 0.2, 0.3]$ our sampling chooses the second option 40% of the time. With any action overwhelming the others, we’re losing the ability to explore different paths and thus learn valuable lessons.

Empirically, I have found myself seeing the process sometimes not even able to escape the “overconfidence” area for long, long hours.

Regularizing with entropy

Let’s introduce the notion of an entropy.

In simple words in our case, it’s the measure of how much “knowledge” does given probability distribution posses. It’s being maximized for the uniform distribution. Here’s the formula:

$$H(X)=E[-log_b(P(X))]$$

This expands to the following:

$$H(X)=-\sum_{i=1}^{n}P(x_i)log_b(P(x_i))$$

Let’s look closer at the values this function produces using the following simple Calca code:

uniform = [0.25, 0.25, 0.25, 0.25]
more confident = [0.5, 0.25, 0.15, 0.10]
over confident = [0.95, 0.01, 0.01, 0.03]
super over confident = [0.99, 0.003, 0.004, 0.003]

y(x) = x*log(x, 10)

entropy(dist) = -sum(map(y, dist))

entropy (uniform) => 0.6021
entropy (more confident) => 0.5246
entropy (over confident) => 0.1068
entropy (super over confident) => 0.0291

We can use the above to “punish” the model whenever it’s too confident of its choices. As we’re going to use gradient descend, we’ll be minimizing terms that appear in our loss function. Minimizing the entropy as shown above would encourage more confidence though. We’ll need to make it into a negative in the loss to work the way we intend:

$$L_\theta=-\frac{1}{n}\sum_{t=0}^{n-1}log\pi(a_t|s_t;\theta)\cdot(R_t-V(s_t)),-\beta,H(\pi(a_t|s_t;\theta))$$

Where $\beta$ is a hyperparameter scaling the effects of the penalty that the entropy has on the gradients. Choosing the right value for $\beta$ becomes very vital for the model’s convergence. In this article, I’m using $0.01$ as with $0.001$ I was still observing the process stuck being overconfident.

Let’s include the value loss $L_v$ in the loss function formula making it full and ready to be implemented:

$$L_\theta=-\frac{1}{n}\sum_{t=0}^{n-1}log\pi(a_t|s_t;\theta)\cdot(R_t-V(s_t)),+\alpha,L_v,,-\beta,H(\pi(a_t|s_t;\theta))$$

The last A in A3C

So far we’ve gone from the vanilla policy gradients to using the notion of an advantage. We’ve also improved it with the baseline that intuitively makes the model consist of two parts: the actor and the critic. At this point, we have what’s sometimes called the A2C — Advantage Actor-Critic.

Let us now focus on the last piece of the puzzle: the last A. This last A comes from the word “asynchronous”. It’s been explained very clearly in the original paper on A3C.

This idea I think is the least complex of all that have their place in the approach. I’ll just comment on what was already written:

These approaches share a common idea: the sequence of observed data encountered by an online RL agent is non-stationary, and online RL updates are strongly correlated. By storing the agent’s data in an experience replay memory, the data can be batched (Riedmiller, 2005; Schulman et al., 2015a) or randomly sampled (Mnih et al., 2013; 2015; Van Hasselt et al., 2015) from different time-steps. Aggregating over memory in this way reduces non-stationarity and decorrelates updates, but at the same time limits the methods to off-policy reinforcement learning algorithms.

The A3C unique approach is that it doesn’t use experience replay for de-correlating the updates to the weights of the model. Instead, we’re sampling many different trajectories at the same time in an asynchronous manner.

This means that we’re creating many clones of the environment and we let our agents experience them at the same time. Separate agents share their weights in one way or another. There are implementations with agents sharing those weights very literally — and performing the updates to the weights on their own whenever they need to. There also are implementations with one main agent holding the main weights and doing the updates based on the gradients reported by the “worker” agents. The worker agents are then being updated with the evolved weights. The environments and agents are not being directly synchronized, working at their own speed. As soon as any of them collects the needed rewards to perform the n-step gradients calculations, the gradients are being applied in one way or another.

In this article, I’m preferring the second approach — having one “main” agent and making workers synchronize their weights with it each n-step period.

Practice

The challenge

To present the above theory in practical terms, we’re going to code the A3C to train a toy self-driving game car. The algorithm will only have the game’s pixels as inputs. We’re also going to collect rewards.

Each step, the player will decide how to move the steering wheel, how much throttle to apply and how much brake.

Points are being assigned for each new “tile” that the car enters staying within the road. There’s a small penalty for each other case of $-0.1$ points.

We’re going to use OpenAI Gym and the environment’s called CarRacing.

You can read a bit more about the setup in the environment’s source code on GitHub.

Coding the Agent

Our agent is going to output both $\pi(a|s;\theta)$ as well as $V(s)$. We’re going to use the GRU unit to give the agent the ability to remember its previous actions and environments previous features.

I’ve also decided to use PRelu instead of Relu activations as it appeared to me that this way the agent was learning much quicker (although I don’t have any numbers to back this impression up).

Disclaimer: the code presented below has not been refactored in any way. If this was going to be used in production I’d certainly hugely clean it up.

Here’s the full listing of the agent’s class:

class Agent(nn.Module):
    def __init__(self, **kwargs):
        super(Agent, self).__init__(**kwargs)

        self.init_args = kwargs

        self.h = torch.zeros(1, 256)

        self.norm1 = nn.BatchNorm2d(4)
        self.norm2 = nn.BatchNorm2d(32)

        self.conv1 = nn.Conv2d(4, 32, 4, stride=2, padding=1)
        self.conv2 = nn.Conv2d(32, 32, 3, stride=2, padding=1)
        self.conv3 = nn.Conv2d(32, 32, 3, stride=2, padding=1)
        self.conv4 = nn.Conv2d(32, 32, 3, stride=2, padding=1)

        self.gru = nn.GRUCell(1152, 256)
        self.policy = nn.Linear(256, 4)
        self.value = nn.Linear(256, 1)

        self.prelu1 = nn.PReLU()
        self.prelu2 = nn.PReLU()
        self.prelu3 = nn.PReLU()
        self.prelu4 = nn.PReLU()

        nn.init.xavier_uniform_(self.conv1.weight, gain=nn.init.calculate_gain('leaky_relu'))
        nn.init.constant_(self.conv1.bias, 0.01)
        nn.init.xavier_uniform_(self.conv2.weight, gain=nn.init.calculate_gain('leaky_relu'))
        nn.init.constant_(self.conv2.bias, 0.01)
        nn.init.xavier_uniform_(self.conv3.weight, gain=nn.init.calculate_gain('leaky_relu'))
        nn.init.constant_(self.conv3.bias, 0.01)
        nn.init.xavier_uniform_(self.conv4.weight, gain=nn.init.calculate_gain('leaky_relu'))
        nn.init.constant_(self.conv4.bias, 0.01)
        nn.init.constant_(self.gru.bias_ih, 0)
        nn.init.constant_(self.gru.bias_hh, 0)
        nn.init.xavier_uniform_(self.policy.weight, gain=nn.init.calculate_gain('leaky_relu'))
        nn.init.constant_(self.policy.bias, 0.01)
        nn.init.xavier_uniform_(self.value.weight, gain=nn.init.calculate_gain('leaky_relu'))
        nn.init.constant_(self.value.bias, 0.01)

        self.train()

    def reset(self):
        self.h = torch.zeros(1, 256)

    def clone(self, num=1):
        return [ self.clone_one() for _ in range(num) ]

    def clone_one(self):
        return Agent(**self.init_args)

    def forward(self, state):
        state = state.view(1, 4, 96, 96)
        state = self.norm1(state)

        data = self.prelu1(self.conv1(state))
        data = self.prelu2(self.conv2(data))
        data = self.prelu3(self.conv3(data))
        data = self.prelu4(self.conv4(data))

        data = self.norm2(data)
        data = data.view(1, -1)

        h = self.gru(data, self.h)
        self.h = h.detach()

        pre_policy = h.view(-1)

        policy = F.softmax(self.policy(pre_policy))
        value = self.value(pre_policy)

        return policy, value

You can immediately notice that actor and critic parts share most of the weights. They only differ in the last layer.

Next, I wanted to abstract out the notion of the “runner”. It encapsulates the idea of a “running agent”. Think of it as the game player — with the joystick and its brain to score game points. I’m discretizing the action space the following way:

Action name	value
Turn left	[-0.8, 0.0, 0.0]
Turn right	[0.8, 0.0, 0]
Full throttle	[0.0, 0.1, 0.0]
Brake	[0.0, 0.0, 0.6]

class Runner:
    def __init__(self, agent, ix, train = True, **kwargs):
        self.agent = agent
        self.train = train
        self.ix = ix
        self.reset = False
        self.states = []

        # each runner has its own environment:
        self.env = gym.make('CarRacing-v0')

    def get_value(self):
        """
        Returns just the current state's value.
        This is used when approximating the R.
        If the last step was
        not terminal, then we're substituting the "r"
        with V(s) - hence, we need a way to just
        get that V(s) without moving forward yet.
        """
        _input = self.preprocess(self.states)
        _, _, _, value = self.decide(_input)
        return value

    def run_episode(self, yield_every = 10, do_render = False):
        """
        The episode runner written in the generator style.
        This is meant to be used in a "for (...) in run_episode(...):" manner.
        Each value generated is a tuple of:
        step_ix: the current "step" number
        rewards: the list of rewards as received from the environment (without discounting yet)
        values: the list of V(s) values, as predicted by the "critic"
        policies: the list of policies as received from the "actor"
        actions: the list of actions as sampled based on policies
        terminal: whether we're in a "terminal" state
        """
        self.reset = False
        step_ix = 0

        rewards, values, policies, actions = [[], [], [], []]

        self.env.reset()

        # we're going to feed the last 4 frames to the neural network that acts as the "actor-critic" duo. We'll use the "deque" to efficiently drop too old frames always keeping its length at 4:
        states = deque([ ])

        # we're pre-populating the states deque by taking first 4 steps as "full throttle forward":
        while len(states) < 4:
            _, r, _, _ = self.env.step([0.0, 1.0, 0.0])
            state = self.env.render(mode='rgb_array')
            states.append(state)
            logger.info('Init reward ' + str(r) )

        # we need to repeat the following as long as the game is not over yet:
        while True:
            # the frames need to be preprocessed (I'm explaining the reasons later in the article)
            _input = self.preprocess(states)

            # asking the neural network for the policy and value predictions:
            action, action_ix, policy, value = self.decide(_input, step_ix)

            # taking the step and receiving the reward along with info if the game is over:
            _, reward, terminal, _ = self.env.step(action)

            # explicitly rendering the scene (again, this will be explained later)
            state = self.env.render(mode='rgb_array')

            # update the last 4 states deque:
            states.append(state)
            while len(states) > 4:
                states.popleft()

            # if we've been asked to render into the window (e. g. to capture the video):
            if do_render:
                self.env.render()

            self.states = states
            step_ix += 1

            rewards.append(reward)
            values.append(value)
            policies.append(policy)
            actions.append(action_ix)

            # periodically save the state's screenshot along with the numerical values in an easy to read way:
            if self.ix == 2 and step_ix % 200 == 0:
                fname = './screens/car-racing/screen-' + str(step_ix) + '-' + str(int(time.time())) + '.jpg'
                im = Image.fromarray(state)
                im.save(fname)
                state.tofile(fname + '.txt', sep=" ")
                _input.numpy().tofile(fname + '.input.txt', sep=" ")

            # if it's game over or we hit the "yield every" value, yield the values from this generator:
            if terminal or step_ix % yield_every == 0:
                yield step_ix, rewards, values, policies, actions, terminal
                rewards, values, policies, actions = [[], [], [], []]

            # following is a very tacky way to allow external using code to mark that it wants us to reset the environment, finishing the episode prematurely. (this would be hugely refactored in the production code but for the sake of playing with the algorithm itself, it's good enough):
            if self.reset:
                self.reset = False
                self.agent.reset()
                states = deque([ ])
                self.states = deque([ ])
                return

            if terminal:
                self.agent.reset()
                states = deque([ ])
                return

    def ask_reset(self):
        self.reset = True

    def preprocess(self, states):
        return torch.stack([ torch.tensor(self.preprocess_one(image_data), dtype=torch.float32) for image_data in states ])

    def preprocess_one(self, image):
        """
        Scales the rendered image and makes it grayscale
        """
        return rescale(rgb2gray(image), (0.24, 0.16), anti_aliasing=False, mode='edge', multichannel=False)

    def choose_action(self, policy, step_ix):
        """
        Chooses an action to take based on the policy and whether we're in the training mode or not. During training, it samples based on the probability values in the policy. During the evaluation, it takes the most probable action in a greedy way.
        """
        policies = [[-0.8, 0.0, 0.0], [0.8, 0.0, 0], [0.0, 0.1, 0.0], [0.0, 0.0, 0.6]]

        if self.train:
            action_ix = np.random.choice(4, 1, p=torch.tensor(policy).detach().numpy())[0]
        else:
            action_ix = np.argmax(torch.tensor(policy).detach().numpy())

        logger.info('Step ' + str(step_ix) + ' Runner ' + str(self.ix) + ' Action ix: ' + str(action_ix) + ' From: ' + str(policy))

        return np.array(policies[action_ix], dtype=np.float32), action_ix

    def decide(self, state, step_ix = 999):
        policy, value = self.agent(state)

        action, action_ix = self.choose_action(policy, step_ix)

        return action, action_ix, policy, value

    def load_state_dict(self, state):
        """
        As we'll have multiple "worker" runners, they will need to be able to sync their agents' weights with the main agent.
        This function loads the weights into this runner's agent.
        """
        self.agent.load_state_dict(state)

I’m also encapsulating the training process in a class of its own. You can notice the gradients being clipped before being applied. I’m also clipping the rewards into the range of $<-3, 3>$ to help to keep the variance low.

class Trainer:
    def __init__(self, gamma, agent, window = 15, workers = 8, **kwargs):
        super().__init__(**kwargs)

        self.agent = agent
        self.window = window
        self.gamma = gamma
        self.optimizer = optim.Adam(self.agent.parameters(), lr=1e-4)
        self.workers = workers

        # even though we're loading the weights into worker agents explicitly, I found that still without sharing the weights as following, the algorithm was not converging:
        self.agent.share_memory()

    def fit(self, episodes = 1000):
            """
            The higher level method for training the agents.
            It called into the lower level "train" which orchestrates the process itself.
            """
        last_update = 0
        updates = dict()

        for ix in range(1, self.workers + 1):
            updates[ ix ] = { 'episode': 0, 'step': 0, 'rewards': deque(), 'losses': deque(), 'points': 0, 'mean_reward': 0, 'mean_loss': 0 }

        for update in self.train(episodes):
            now = time.time()

            # you could do something useful here with the updates dict.
            # I've opted out as I'm using logging anyways and got more value in just watching the log file, grepping for the desired values

            # save the current model's weights every minute:
            if now - last_update > 60:
                torch.save(self.agent.state_dict(), './checkpoints/car-racing/' + str(int(now)) + '-.pytorch')
                last_update = now

    def train(self, episodes = 1000):
        """
        Lower level training orchestration method. Written in the generator style. Intended to be used with "for update in train(...):"
        """

        # create the requested number of background agents and runners:
        worker_agents = self.agent.clone(num = self.workers)
        runners = [ Runner(agent=agent, ix = ix + 1, train = True) for ix, agent in enumerate(worker_agents) ]

        # we're going to communicate the workers' updates via the thread safe queue:
        queue = mp.SimpleQueue()

        # if we've not been given a number of episodes: assume the process is going to be interrupted with the keyboard interrupt once the user (us) decides so:
        if episodes is None:
            print('Starting out an infinite training process')

        # create the actual background processes, making their entry be the train_one method:
        processes = [ mp.Process(target=self.train_one, args=(runners[ix - 1], queue, episodes, ix)) for ix in range(1, self.workers + 1) ]

        # run those processes:
        for process in processes:
            process.start()

        try:
            # what follows is a rather naive implementation of listening to workers updates. it works though for our purposes:
            while any([ process.is_alive() for process in processes ]):
                results = queue.get()
                yield results
        except Exception as e:
            logger.error(str(e))

    def train_one(self, runner, queue, episodes = 1000, ix = 1):
        """
        Orchestrate the training for a single worker runner and agent. This is intended to run in its own background process.
        """

        # possibly naive way of trying to de-correlate the weight updates further (I have no hard evidence to prove if it works, other than my subjective observation):
        time.sleep(ix)

        try:
            # we are going to request the episode be reset whenever our agent scores lower than its max points. the same will happen if the agent scores total of -10 points:
            max_points = 0
            max_eval_points = 0
            min_points = 0
            max_episode = 0

            for episode_ix in itertools.count(start=0, step=1):

                if episodes is not None and episode_ix >= episodes:
                    return

                max_episode_points = 0
                points = 0

                # load up the newest weights every new episode:
                runner.load_state_dict(self.agent.state_dict())

                # every 5 episodes lets evaluate the weights we've learned so far by recording the run of the car using the greedy strategy:
                if ix == 1 and episode_ix % 5 == 0:
                    eval_points = self.record_greedy(episode_ix)

                    if eval_points > max_eval_points:
                        torch.save(runner.agent.state_dict(), './checkpoints/car-racing/' + str(eval_points) + '-eval-points.pytorch')
                        max_eval_points = eval_points

                # each n-step window, compute the gradients and apply
                # also: decide if we shouldn't restart the episode if we don't want to explore too much of the not-useful state space:
                for step, rewards, values, policies, action_ixs, terminal in runner.run_episode(yield_every=self.window):
                    points += sum(rewards)

                    if ix == 1 and points > max_points:
                        torch.save(runner.agent.state_dict(), './checkpoints/car-racing/' + str(points) + '-points.pytorch')
                        max_points = points

                    if ix == 1 and episode_ix > max_episode:
                        torch.save(runner.agent.state_dict(), './checkpoints/car-racing/' + str(episode_ix) + '-episode.pytorch')
                        max_episode = episode_ix

                    if points < -10 or (max_episode_points > min_points and points < min_points):
                        terminal = True
                        max_episode_points = 0
                        point = 0
                        runner.ask_reset()

                    if terminal:
                        logger.info('TERMINAL for ' + str(ix) + ' at step ' + str(step) + ' with total points ' + str(points) + ' max: ' + str(max_episode_points) )

                    # if we're learning, then compute and apply the gradients and load the newest weights:
                    if runner.train:
                        loss = self.apply_gradients(policies, action_ixs, rewards, values, terminal, runner)
                        runner.load_state_dict(self.agent.state_dict())

                    max_episode_points = max(max_episode_points, points)
                    min_points = max(min_points, points)

                    # communicate the gathered values to the main process:
                    queue.put((ix, episode_ix, step, rewards, loss, points, terminal))

        except Exception as e:
            string = traceback.format_exc()
            logger.error(str(e) + ' → ' + string)
            queue.put((ix, -1, -1, [-1], -1, str(e) + '<br />' + string, True))

    def record_greedy(self, episode_ix):
        """
        Records the video of the "greedy" run based on the current weights.
        """
        directory = './videos/car-racing/episode-' + str(episode_ix) + '-' + str(int(time.time()))
        player = Player(agent=self.agent, directory=directory, train=False)
        points = player.play()
        logger.info('Evaluation at episode ' + str(episode_ix) + ': ' + str(points) + ' points (' + directory + ')')
        return points

    def apply_gradients(self, policies, actions, rewards, values, terminal, runner):
        worker_agent = runner.agent
        actions_one_hot = torch.tensor([[ int(i == action) for i in range(4) ] for action in actions], dtype=torch.float32)

        policies = torch.stack(policies)
        values = torch.cat(values)
        values_nograd = torch.zeros_like(values.detach(), requires_grad=False)
        values_nograd.copy_(values)

        discounted_rewards = self.discount_rewards(runner, rewards, values_nograd[-1], terminal)
        advantages = discounted_rewards - values_nograd

        logger.info('Runner ' + str(runner.ix) + 'Rewards: ' + str(rewards))
        logger.info('Runner ' + str(runner.ix) + 'Discounted Rewards: ' + str(discounted_rewards.numpy()))

        log_policies = torch.log(0.00000001 + policies)

        one_log_policies = torch.sum(log_policies * actions_one_hot, dim=1)

        entropy = torch.sum(policies * -log_policies)

        policy_loss = -torch.mean(one_log_policies * advantages)

        value_loss = F.mse_loss(values, discounted_rewards)

        value_loss_nograd = torch.zeros_like(value_loss)
        value_loss_nograd.copy_(value_loss)

        policy_loss_nograd = torch.zeros_like(policy_loss)
        policy_loss_nograd.copy_(policy_loss)

        logger.info('Value Loss: ' + str(float(value_loss_nograd)) + ' Policy Loss: ' + str(float(policy_loss_nograd)))

        loss = policy_loss + 0.5 * value_loss - 0.01 * entropy
        self.agent.zero_grad()
        loss.backward()
        torch.nn.utils.clip_grad_norm_(worker_agent.parameters(), 40)

        # the following step is crucial. at this point, all the info about the gradients reside in the worker agent's memory. We need to "move" those gradients into the main agent's memory:
        self.share_gradients(worker_agent)

        # update the weights with the computed gradients:
        self.optimizer.step()

        worker_agent.zero_grad()
        return float(loss.detach())

    def share_gradients(self, worker_agent):
        for param, shared_param in zip(worker_agent.parameters(), self.agent.parameters()):
            if shared_param.grad is not None:
                return
            shared_param._grad = param.grad

    def clip_reward(self, reward):
        """
        Clips the rewards into the <-3, 3> range preventing too big of the gradients variance.
        """
        return max(min(reward, 3), -3)

    def discount_rewards(self, runner, rewards, last_value, terminal):
            discounted_rewards = [0 for _ in rewards]
        loop_rewards = [ self.clip_reward(reward) for reward in rewards ]

        if terminal:
            loop_rewards.append(0)
        else:
            loop_rewards.append(runner.get_value())

        for main_ix in range(len(discounted_rewards) - 1, -1, -1):
            for inside_ix in range(len(loop_rewards) - 1, -1, -1):
                if inside_ix >= main_ix:
                    reward = loop_rewards[inside_ix]
                    discounted_rewards[main_ix] += self.gamma**(inside_ix - main_ix) * reward

        return torch.tensor(discounted_rewards)

For the record_greedy method to work we need the following class:

class Player(Runner):
    def __init__(self, directory, **kwargs):
        super().__init__(ix=999, **kwargs)

        self.env = Monitor(self.env, directory)

    def play(self):
        points = 0
        for step, rewards, values, policies, actions, terminal in self.run_episode(yield_every = 1, do_render = True):
            points += sum(rewards)
        self.env.close()
        return points

All the above code can be used as follows (in the Python script):

if __name__ == "__main__":
    agent = Agent()

    trainer = Trainer(gamma = 0.99, agent = agent)
    trainer.fit(episodes=None)

The importance of tuning of the n-step window size

Reading the code, you can notice that we’ve chosen $15$ to be the size of the n-step window. We’ve also chosen $\gamma=0.99$. Getting those values right is a subject for tuning. The same ones that work on one game or a challenge will not necessarily work well for the other.

Here’s a quick explanation of how to think about them: We’re going to be penalized most of the time. It’s important for us to give the algorithm a chance to actually find trajectories that score positively. In the “CarRacing” challenge, I’ve found that it can take 10 steps of moving “full throttle” in the correct direction before we’re being rewarded by entering the new “tile”. I’ve just simply added $5$ of the safety net to that number. No mathematical proof follows this thinking here, but I can tell you though that it made a huge difference in the training time for me. The version of the code I’m presenting above starts to score above 700 points after approximately 10 hours on my Ryzen 7 based computing box.

Problems with the state being returned from the environment - overcoming with the explicit render

You might have also noticed that I’m not using the state values returned by the step method of the gym environment. This might seem contradictory to how the gym is typically being used. After days of not seeing my model converge though, I have found that the step method was returning one and the same numpy array on each call. You can imagine that it was the absolutely last thing I’ve checked when trying to find that bug.

I’ve found the render(mode='rgb_array') works as intended each time. I just needed to write my own preprocessing code, to scale it down and make it grayscale.

How to know when the algorithm converges

I’ve seen some people thinking that their A3C implementation does not converge. The resulting policy did not seem to be working that well, but the training process was taking a bit longer than “some other implementation”. I fell for this kind of thinking myself as well. My humble bit of advice is to stick to what makes sense mathematically. Someone else’s model might be converging faster simply because of the hardware being used or some slight difference in the code around the training (e.g. explicit render needed in my case). This might not have anything to do with the A3C part at all.

How do we “stick to what makes sense mathematically”? Simply by logging the value loss and observing it as the training continues. Intuitively, for the model that has converged, we should see that it has already learned the value function. Those values — representing the average of the discounted rewards — should not make the loss too big most of the time. Still, for some states, the best action will make the $R_t$ much bigger than $V(s_t)$ which means that we still should see the loss spiking from time to time.

Again, the above bit of advice doesn’t come with any mathematical proofs. It’s what I found working and making sense in my case.

The Results

Instead of presenting hard-core statistics about the model’s performance — which wouldn’t make much sense because I stopped it as soon as the “evaluation” videos started looking cool enough) — I’ll just post three videos of the car driving on its own through the three randomly generated tracks.

Have fun watching and even more fun coding it yourself too!

Introducing the End Point AI Portal for Our Customers

What’s in the Portal

Why We Are Doing This

How to Get an Account

Implementing SummAE neural text summarization with a denoising auto-encoder

Preliminaries

Datasets

ROCStories

Wikihow

Basics of the sequence-to-sequence modeling

Naively simple modeling: Markov Model

Modeling with neural networks

Teacher-forcing

Compute-friendly representation for tokens and gists

Representing words naively

A better approach: word embeddings

Not only words

Auto-encoders

Adding the noise

The SummAE model

Auto-encoding paragraphs and sentences

Adding the noise

The code

My experiment with the WikiHow dataset

The problem with getting paragraphs when we want the sentences

Better gists by using the “critic”

Results

Final words

An Introduction to Neural Networks

Neurons and Nodes

Neural Network Architecture

Training a Neural Network

Further reading

Facial Recognition Using Amazon DeepLens: Counting Liquid Galaxy Interactions

The Amazon DeepLens camera

Computer Vision and the Amazon DeepLens

The Model

How did the DeepLens counts compare to my counts?

Self-driving toy car using the Asynchronous Advantage Actor-Critic algorithm

Theory

The policy gradient

Formalizing the accumulation of rewards

Making the gradients more stable

Value function as a baseline

Guarding against the model’s overconfidence

Regularizing with entropy

The last A in A3C

Practice

The challenge

Coding the Agent

The importance of tuning of the n-step window size

Problems with the state being returned from the environment - overcoming with the explicit render

How to know when the algorithm converges

The Results