Django and Mojolicious: a quick comparison of two popular web frameworks

2025-02-06T00:00:00+00:00

Recently I’ve been working on a project with a Vue front-end and two back-ends, one in Python using the Django framework and one in Perl using the Mojolicious framework. So, it’s a good time to spend some words to share the experience and do a quick comparison.

Previously I wrote a post about Perl web frameworks, and now I’m expanding the subject into another language.

Django was chosen for this project because it’s been around for almost 20 years now and provides the needed maturity and stability to be long-running and low-budget. In this regard, it has proved a good choice so far. Recently it saw a major version upgrade without any problems to speak of. It could be argued that I should have used the Django REST Framework instead of plain Django. However, at the time the decision was made, adding a framework on top of another seemed a bit excessive. I don’t have many regrets about this, though.

Mojolicious is an old acquaintance. It used to have fast-paced development but seems very mature now, and it’s even been ported to JavaScript.

Both frameworks have just a few dependencies (which is fairly normal in the Python world, but not in the Perl one) and excellent documentation. They both follow the model-view-controller pattern. Let’s examine the components.

Views

Both frameworks come with a built-in template system (which can be swapped out with something else), but in this project we can skip the topic altogether as both frameworks are used only as back-end for transmitting JSON, without any HTML rendering involved.

However, let’s see how the rendering looks for the API we’re writing.

use Mojo::Base 'Mojolicious::Controller', -signatures;
sub check ($self) {
    $self->render(json => { status => 'OK' });
}

from django.http import JsonResponse
def status(request):
    return JsonResponse({ "status":  "OK" })

Nothing complicated here, just provide the right call.

Models

Django

Usually a model in context of web development means a database and here we are going to keep this assumption.

Django comes with a comprehensive object-relational mapping (ORM) system and it feels like the natural thing to use. I don’t think it makes much sense to use another ORM, or even to use raw SQL queries (though it is possible).

You usually start a Django project by defining the model. The Django ORM gives you the tools to manage the migrations, providing abstraction from the SQL. You need to define the field types and the relationships (joins and foreign keys) using the appropriate class methods.

For example:

from django.db import models
class User(AbstractUser):
    email = models.EmailField(null=False, blank=False)
    site = models.ForeignKey(Site, on_delete=models.CASCADE, related_name="site_users")
    libraries = models.ManyToManyField(Library, related_name="affiliated_users")
    expiration = models.DateTimeField(null=True, blank=True)
    created = models.DateTimeField(auto_now_add=True)
    last_modified = models.DateTimeField(auto_now=True)

These calls provide not only the SQL type to use, but also the validation. For example, the blank parameter is a validation option specifying whether Django will accept an empty value. It is different from the null option, which directly correlates to SQL. You can see we’re quite far from working with SQL, at least two layers of abstraction away.

In the example above, we’re also defining a foreign key between a site and a user (many-to-one), so each user belongs to one site. We also define a many-to-many relationship with the libraries record. I like how these relationships are defined, it’s very concise.

Thanks to these definitions, you get a whole admin console almost for free, which your admin users are sure to like. However, I’m not sure this is a silver bullet for solving all problems. With large tables and relationships the admin pages load slowly and they could become unusable very quickly. Of course, you can tune that by filtering out what you need and what you don’t, but that means things are not as simple as “an admin dashboard for free” — at the very least, there’s some configuring to do.

As for the query syntax, you usually need to call Class.objects.filter(). As you would expect from an ORM, you can chain the calls and finally get objects out of that, representing a database row, which, in turn, you can update or delete.

The syntax for the filter() call is based on the double underscore separator, so you can query over the relationships like this:

for agent in (Agent.objects.filter(canonical_agent_id__isnull=False)
              .prefetch_related('canonical_agent')
              .order_by('canonical_agent__name', 'name')
              .all()):
    agent.name = "Dummy"
    agent.save()

In this case, provided that we defined the foreign keys and the attributes in the model, we can search/order across the relationship. The __isnull suffix, as you can imagine, results in a WHERE canonical_agent_id IS NOT NULL query, while in the order_by call we sort over the joined table using the name column. Looks nice and readable, with a touch of magic.

Of course things are never so simple, so you can build complex queries with the Q class combined with bytewise operators (&, |).

Here’s an example of a simple case-insensitive search for a name containing multiple words:

from django.db.models import Q

def api_list(request)
    term = request.GET.get('search')
    if term
        words = [ w for w in re.split(r'\W+', term) if w ]
        if words:
            query = Q(name__icontains=words.pop())
            while words:
                query = query & Q(name__icontains=words.pop())
            # logger.debug(query)
            agents = Agent.objects.filter(query).all()

To sum up, the ORM is providing everything you need to stay away from the SQL. In fact, it seems like Django doesn’t like you doing raw SQL queries.

Mojolicious and Perl

In the Perl world things are a bit different.

The Mojolicious tutorial doesn’t even mention the database. You can use any ORM or no ORM at all, if you prefer so. However, Mojolicious makes the DB handle available everywhere in the application.

You could use DBIx::Connector, DBIx::Class, Mojo::Pg (which was developed with Mojolicious), or whatever you prefer.

For example, to use Mojo::Pg in the main application class:

package MyApp;
use Mojo::Base 'Mojolicious', -signatures;
use Mojo::Pg;
use Data::Dumper::Concise;

sub startup ($self) {
    my $config = $self->plugin('NotYAMLConfig');
    $self->log->info("Starting up with " . Dumper($config));
    $self->helper(pg => sub {
                      state $pg = Mojo::Pg->new($config->{dbi_connection_string});
                  });

In the routes you can call $self->pg to get the database object.

The three approaches I’ve mentioned here are different.

DBIx::Connector is basically a way to get you a safe DBI handle across forks and DB connection failures.

Mojo::Pg gives you the ability to do abstract queries but also gives some convenient methods to get the results. I wouldn’t call it a ORM; from a query you usually gets hashes, not objects, you don’t need to define the database layout, and it won’t produce migrations for you, though there is some migration support.

Here’s an example of standard and abstract queries:

sub list_texts ($self) {
    if (my $sid = $self->param('sid')) {
        my $sql = 'SELECT * FROM texts WHERE sid = ? ORDER BY sorting_index';
        @all = $self->pg->db->query($sql, $sid)->hashes->each;
    }
    $self->render(json => { texts => \@all });

The query above can be rewritten with an abstract query, using the same module.

@all = $self->pg->db->select(texts => undef,
                             { sid => $sid },
                             { order_by => 'sorting_index' })->hashes->each;

If it’s a simple, static query, it’s basically a matter of taste; do you prefer to see the SQL or not? The second version is usually nicer if you want to build a different query depending on the parameters, so you add or remove keys to the hashes which maps to query and finally execute it.

Now, speaking of taste, for complex queries with a lot of joins I honestly prefer to see the SQL query instead of wondering if the abstract one is producing the correct SQL. This is true regardless of the framework. I have the impression that it is faster, safer, and cleaner to have the explicit SQL in the code rather than leaving future developers (including future me) to wonder if the magic is happening or not.

Finally, nothing stops you from using DBIx::Class, which is the best ORM for Perl, even if it’s not exactly light on dependencies.

It’s very versatile, it can build queries of arbitrary complexity, and you usually get objects out of the queries you make. It doesn’t come with an admin dashboard, it doesn’t enforce the data types and it doesn’t ship any validation by default (of course, you can implement that manually). The query syntax is very close to the Mojo::Pg one (which is basically SQL::Abstract).

The gain here is that, like in Django’s ORM, you can attach your methods to the classes representing the rows, so the data definitions live with the code operating on them.

However, the fact that it builds an object for each result means you’re paying a performance penalty which sometimes can be very high. I think this is a problem common to all ORMs, regardless of the language and framework you’re using.

The difference with Django is that once you have chosen it as your framework, you are basically already sold to the ORM. With Mojolicious and other Perl frameworks (Catalyst, Dancer), you can still make the decision and, at least in theory, change it down the road.

My recommendation would be to keep the model, both code and business logic, decoupled from the web-specific code. This is not really doable with Django, but is fully doable with the Perl frameworks. Just put the DB configuration in a dedicated file and the business code in appropriate classes. Then you should be able to, for example, run a script without loading the web and the whole framework configuration. In this ideal scenario, the web framework just provides the glue between the user and your model.

Controllers

Routes are defined similarly between Django and Mojolicious. Usually you put the code in a class and then point to it, attaching a name to it so you can reference it elsewhere. The language is different, the style is different, but they essentially do the same thing.

Django:

from django.urls import path
from . import views
urlpatterns = [
    path("api/agents/<int:agent_id>", views.api_agent_view, name="api_agent_view"),
]

The function views.api_agent_view will receive the request with the agent_id as a parameter.

Mojolicious:

sub startup ($self) {
    # ....
    my $r = $self->routes;
    $r->get('/list/:sid')->to('API#list_texts')->name('api_list_texts');
}

The ->to method is routing the request to the Myapp::Controller::API::list_texts, which will receive the request with the sid as parameter.

This is pretty much the core business of every web framework: routing a request to a given function.

Mojolicious has also the ability to chain the routes (pretty much taken from Catalyst). The typical use is authorization:

sub startup ($self) {
    ...
    my $r = $self->routes;
    my $api = $r->under('/api/v1', sub ($c) {
        if ($c->req->headers->header('X-API-Key') eq 'testkey') {
            return 1;
        }
        $c->render(text => 'Authentication required!', status => 401);
        return undef;
    }
    $api->get('/check')->to('API#check')->name('api_check');

So the request to /api/v1/check will first go in the first block and the chain will abort if the API key is not set in the header. Otherwise it will proceed to run the API module’s check function.

Conclusion

I’m Perl guy and so I’m a bit biased toward Mojolicious, but I also have a pragmatic approach to programming. Python is widely used — they teach it in schools — while Perl is seen as old-school, if not dead (like all the mature technologies). So, Python could potentially attract more developers to your project, and this is important to consider.

Learning a new language like Python is not a big leap; it and Perl are quite similar despite the different syntax. I’d throw Ruby in the same basket.

Of course both languages provide high quality modules you can use, and these two frameworks are an excellent example.

Exploring Geodatabase Files

2024-08-14T00:00:00+00:00

One of our clients recently provided us with a dataset of real estate properties that they manage, and asked us to generate content based off of the points and polygons in the dataset.

We will walk through the process of extracting polygons, placemarks, and other info from a geodatabase file and converting them into separate KML files using the ogr2ogr command-line tool, adding some logic to the data selection to limit the subset of features. We will also explore the GDB file using the GDAL Python library to export the data as JSON for use in other scripts.

Prerequisites

Basic understanding of geospatial data
Installed versions of the GDAL/OGR library
A geodatabase file (.gdb, .gdb.zip, or .shp)

A first look into the contents of the GDB file

ogrinfo example.gdb.zip

This command will list all layers that are available in the dataset. Add a specific layer as a parameter to the same command, and it will output all the fields, their types, and values for every feature in the layer.

ogrinfo example.gdb.zip a_layer_name

The parameter -so can be used to omit the values from the output and get only the layer field names and types:

#$> ogrinfo example.gdb.zip Land_Points -so
INFO: Open of `example.gdb.zip'
      using driver `OpenFileGDB' successful.

Layer name: Land_Points
Geometry: Point
Feature Count: 219
Extent: (-122.699891, -23.590280) - (139.763352, 59.622821)
Layer SRS WKT:
GEOGCRS["WGS 84",
    DATUM["World Geodetic System 1984",
        ELLIPSOID["WGS 84",6378137,298.257223563,
            LENGTHUNIT["metre",1]]],
    PRIMEM["Greenwich",0,
        ANGLEUNIT["degree",0.0174532925199433]],
    CS[ellipsoidal,2],
        AXIS["geodetic latitude (Lat)",north,
            ORDER[2],
        ANGLEUNIT["degree",0.0174532925199433]],
    USAGE[
        SCOPE["unknown"],
        AREA["World"],
        BBOX[-90,-180,90,180]],
    ID["EPSG",4326]]
Data axis to CRS axis mapping: 2,1
FID Column = OBJECTID
Geometry Column = Shape
asset_type: Integer (0.0)
asset_id: String (15.0)
property_code: String (255.0)
full_address_text: String (255.0)
land_name: String (255.0)
land_address1: String (255.0)
land_city: String (255.0)
land_sate_code: String (50.0)
land_postal_code: String (10.0)
land_country_code: String (50.0)
division_name: String (255.0)
region_name: String (255.0)
market_name: String (255.0)
submarket_name: String (255.0)
supplemental_portfolio_name: String (255.0)
ownership_name: String (255.0)
land_held_for_sale_acre: Real (0.0)
land_held_for_sale_hect: Real (0.0)
land_held_for_development_acre: Real (0.0)
land_held_for_development_hect: Real (0.0)
total_land_acre: Real (0.0)
total_land_hectare: Real (0.0)
buildable_area_sf: Real (0.0)
buildable_area_sm: Real (0.0)
buildable_area_tsubo: Real (0.0)
land_latitude: Real (0.0)
land_longitude: Real (0.0)
geocoded: Integer (0.0) DEFAULT 0
globalid: String (0.0) NOT NULL
created_user: String (255.0)
created_date: DateTime (0.0)
last_edited_user:  String (255.0)
last_edited_date: DateTime (0.0)

This command shows you information for a single layer, which can be helpful if you are only looking for certain values. However, the geographic data is not much to look at in the terminal. To visualize it, we can convert the data to KML, which applications like Google Earth or Cesium can render while keeping the information as text that can be read:

ogr2ogr -f "KML" example_output.kml example.gdb.zip layer_name

-f "KML" specifies the output format
example_output.kml is the name of the output KML file
example.gdb.zip is the path to the geodatabase file
layer_name is the geodatabase layer to export

This command will export into a KML file:

All the geometries in the layer, be it placemarks or polygons in our case
All the other fields of information as extended data, which will show up for each feature as a balloon table when visualized in Google Earth

The `ogr2ogr` `-sql` option

The default balloon popup was not the outcome we wanted for this KML file. We first used sed to remove all the extended data from the KML files, but after looking into it a bit further, we found an ogr2ogr option, -sql, that made the data filtering easier. This option lets us add a query to the command just like getting the data from a SQL database.

1. Extract property names

To create a KML file with just the names of the properties, look up the layers and fields in them using ogrinfo and find the points layer that has the names—in this case, Layer_Points. Then, add the desired SQL query to the command.

ogr2ogr -f "KML" output_names.kml example.gdb.zip -sql "SELECT name FROM Layer_Points"

2. Extract polygons only

The polygon geometries are stored in Layer_Polygons. They can be selected using the special OGR field OGR_GEOMETRY that refers to the geometry of the selected layer:

ogr2ogr -f "KML" output_polygons.kml example.gdb.zip -sql "SELECT OGR_GEOMETRY FROM Layer_Polygons"

3. Create KML with pins and names

To create a KML file with property names as placemarks with pins, we just select the name. The point placemark seems to be included with all data:

ogr2ogr -f "KML" output_pins.kml input.gdb layer_name -sql "SELECT name FROM layer_name"

4. Create KML with pins only, no names

To get the points with nothing else, we use the SQLite MakePoint function, which selects a list of points from the KML.

ogr2ogr -f "KML" output_pins.kml input.gdb layer_name -dialect SQLite -sql "SELECT MakePoint(land_longitude, land_latitude) AS geom FROM Land_Points"

5. Extract limited properties

For extracting a limited number of properties, we can add the WHERE clause to the SQL command:

ogr2ogr -f "KML" output_pins.kml input.gdb layer_name -sql "SELECT name FROM layer_name"

We can also use the ogr2ogr -where flag to use that part of the query only:

ogr2ogr -f "KML" limited_output_polygons.kml input.gdb layer_name -where "ID IN (1, 2, 3)"

6. Run it all in a Python script

There is a Python GDAL library, which I will cover the basics of later, but first, here is a simplified example using Python subprocess to run the ogr2ogr commands we tested in the terminal.

import subprocess

gdb_file = "example.gdb.zip"
column_name = "p_code"
elem_str = "'a1', 'a2', 'b1', 'b3'"

sql_land_labels = f"SELECT land_name FROM Land_Points WHERE {column_name} IN ({elem_str})"
subprocess.run(['ogr2ogr', '-f', 'KML', 'land_labels.kml', gdb_file, '-sql', sql_land_labels])

sql_land_points = f"SELECT MakePoint(land_longitude, land_latitude) AS geom FROM Land_Points WHERE {column_name} IN ({elem_str})"
subprocess.run(['ogr2ogr', '-f', 'KML', 'land_points.kml', gdb_file, '-dialect', 'SQLite', '-sql', sql_land_points])

sql_land_polygons = f"SELECT OGR_GEOMETRY FROM Land_Polygons WHERE {column_name} IN ({elem_str})"
subprocess.run(['ogr2ogr', '-f', 'KML', 'land_pols.kml', gdb_file, '-sql', sql_land_polygons])

This script will create the three different KML files that we usually need for our presentations. We used variables for the gdb_file, column_name, and elem_str command line parameters to make the script easy to use for selecting other data. We also use them in other scripts to join, apply custom styling, and add regions to the KMLs, which will be covered in another blog post.

7. Extract all data as JSON from one layer using the Python GDAL library

I probably should have started here, but I only used the Python library later on in the process to join the gdp.zip file data with data from other sources (spreadsheets, emails, etc.).

First, install the GDAL library:

pip install gdal

Then you can use the following Python script, changing as necessary the gdb_file, layer_name, and the unique field chosen to structure the data. The code is explained in the comments.

#!/bin/python

import json
from osgeo import ogr

# Open the Geodatabase file
driver = ogr.GetDriverByName('OpenFileGDB')
gdb_file = 'example.gdb.zip'
data_source = driver.Open(gdb_file, 0)

# Get the layer
layer_name = 'Building_Points'
layer = data_source.GetLayerByName(layer_name)
if layer:
    # Get the layer definition
    layer_defn = layer.GetLayerDefn()

    # Get the number of fields in the layer
    num_fields = layer_defn.GetFieldCount()

    # Initialize an empty list to store field names
    field_names = []

    # Iterate over each field and add its name to the list
    for i in range(num_fields):
        field_defn = layer_defn.GetFieldDefn(i)
        field_name = field_defn.GetName()
        field_names.append(field_name)

    print("Field names:", field_names)
else:
    print(f"Layer '{layer_name}' not found.")

all_info = {}

# Iterate over all features and organize all field names under a unique one for the dictionary structure
for feature in layer:
    if feature is not None:
        globalid = feature.GetField('globalid')
        all_info[globalid] = {}
        for fn in field_names:
            all_info[globalid][fn] = feature.GetField(fn)

# Close the data source
data_source = None

# Save information to a JSON file
output_file = 'info.json'
with open(output_file, 'w') as json_file:
    json.dump(all_info, json_file, indent=4)
print(f"Information saved to {output_file}")

The script will create a JSON file with all the information on the layer.

Conclusion

Following these steps, we can efficiently manage and visualize geospatial data using ogr2ogr, SQL, and KML, and in some cases, JSON. These methods allow for a high degree of customization and can be tailored to specific project requirements.

Additional Resources

Making a Loading Spinner with tkinter

2024-03-05T00:00:00+00:00

When you need a loading spinner, you really need a loading spinner. Interested in putting something on the screen without installing a pile of dependencies, I reached deep into the toolbox of the Python standard library, dug around a bit, and pulled out the tkinter module.

The tkinter module is an interface to the venerable Tcl/Tk GUI toolkit, a cross-platform suite for creating user interfaces in the style of whatever operating system you run it on. It’s the only built-in GUI toolkit in Python, but there are many worthy alternatives available (see the end of the post for a list).

Here I’ll demonstrate how to make a loading spinnner with tkinter on Ubuntu 22.04. It should work on any platform that runs Python, with some variations when setting up the system for it.

Prerequisites

My vision for the loading spinner is some spinning dots and a logo, since this is such a convenient branding opportunity. To accomplish this we’ll be extending tkinter with Pillow’s ImageTk capability, which can load a PNG with transparency.

To produce that PNG with transparency, first we may need to rasterize an SVG file, because wise designers work in vectors. This is made trivial by Inkscape, a free and complete vector graphics tool:

$ inkscape logo.svg -o logo.png

With the logo in hand, we can move on to setting up our dependencies. Ubuntu’s python3 distribution doesn’t include tkinter by default, so we’ll need to install it explicitly, along with Pillow’s separate ImageTk support:

$ sudo apt install python3-tk python3-pil.imagetk

This may occupy up to 75MB, if Python was already installed. This was still the smallest apt footprint of all of the Python GUI libraries in consideration. Pygame was also a strong contender.

Code

Let’s start with something simple: putting the logo on the screen.

#!/usr/bin/env python3

from PIL import Image, ImageTk
from tkinter import BOTH, Canvas, Tk


# Desired dimensions of our window.
WIDTH, HEIGHT = 500, 500

if __name__ == "__main__":
    # Create the root window object.
    root = Tk()
    # Create a canvas for drawing our graphics.
    canvas = Canvas(root, width=WIDTH, height=HEIGHT, background="black")
    # Fill the entire window with the canvas.
    canvas.pack(fill=BOTH, expand=1)

    # Load the logo PNG with regular PIL.
    logo_img = Image.open("logo.png")
    # Convert the logo to an ImageTk PhotoImage.
    logo_pi = ImageTk.PhotoImage(logo_img)
    # Add our logo image to the canvas.
    canvas.create_image(
        WIDTH / 2,
        HEIGHT / 2,
        image=logo_pi,
    )

    # Run the tkinter main loop.
    root.mainloop()

This puts the logo in the center of a window, but the logo may be too large or small. Let’s scale it according to the window dimensions, let’s say to about ⅔ of the width so we have some room for spinning dots:

#!/usr/bin/env python3

from PIL import Image, ImageTk
from tkinter import BOTH, Canvas, Tk


# Desired dimensions of our window.
WIDTH, HEIGHT = 500, 500

if __name__ == "__main__":
    # Create the root window object.
    root = Tk()
    # Create a canvas for drawing our graphics.
    canvas = Canvas(root, width=WIDTH, height=HEIGHT, background="black")
    # Fill the entire window with the canvas.
    canvas.pack(fill=BOTH, expand=1)

    # Load the logo PNG with regular PIL.
    logo_img = Image.open("logo.png")
    # Resize the logo to about 2/3 the window width.
    scaled_w = round(WIDTH * 0.6)
    scaled_h = round(scaled_w / (logo_img.width / logo_img.height))
    logo_img = logo_img.resize((scaled_w, scaled_h), Image.LANCZOS)
    # Convert the logo to an ImageTk PhotoImage.
    logo_pi = ImageTk.PhotoImage(logo_img)
    # Add our logo image to the canvas.
    canvas.create_image(
        WIDTH / 2,
        HEIGHT / 2,
        image=logo_pi,
    )

    # Run the tkinter main loop.
    root.mainloop()

That’s better. Now let’s add the promised spinning dots. We’ll draw some ovals on the canvas and modify our main loop to animate them:

#!/usr/bin/env python3

import math
import time

from PIL import Image, ImageTk
from tkinter import BOTH, Canvas, Tk


# Desired dimensions of our window.
WIDTH, HEIGHT = 500, 500
# Coordinates of the center.
CENTER_X, CENTER_Y = WIDTH / 2, HEIGHT / 2
# How many spinning dots we want.
NUM_DOTS = 8

if __name__ == "__main__":
    # Create the root window object.
    root = Tk()
    # Create a canvas for drawing our graphics.
    canvas = Canvas(root, width=WIDTH, height=HEIGHT, background="black")
    # Fill the entire window with the canvas.
    canvas.pack(fill=BOTH, expand=1)

    # Load the logo PNG with regular PIL.
    logo_img = Image.open("logo.png")
    # Resize the logo to about 2/3 the window width.
    scaled_w = round(WIDTH * 0.6)
    scaled_h = round(scaled_w / (logo_img.width / logo_img.height))
    logo_img = logo_img.resize((scaled_w, scaled_h), Image.LANCZOS)
    # Convert the logo to an ImageTk PhotoImage.
    logo_pi = ImageTk.PhotoImage(logo_img)
    # Add our logo image to the canvas.
    canvas.create_image(
        CENTER_X,
        CENTER_Y,
        image=logo_pi,
    )

    # Radius in pixels of a single dot.
    dot_radius = WIDTH * 0.05
    # Radius of the ring of dots from the center of the window.
    dots_radius = WIDTH / 2 - dot_radius * 2

    # Helper function to calculate dot position on each update.
    def get_dot_coords(n: int, t: float):
        """Get the x0, y0, x1, y1 coords of dot at index 'n' at time 't'."""
        angle = (n / NUM_DOTS) * math.pi * 2 + t
        x = math.cos(angle) * dots_radius + CENTER_X
        y = math.sin(angle) * dots_radius + CENTER_Y
        return x - dot_radius, y - dot_radius, x + dot_radius, y + dot_radius

    # Create all the dots.
    t0 = time.monotonic()
    for n in range(NUM_DOTS):
        coords = get_dot_coords(n, t0)
        canvas.create_oval(
            *coords,
            fill="#888888",
            width=0,  # Border width.
            tags=f"dot_{n}",
        )

    # Set up a custom main loop to animate the moving dots.
    while True:
        # Check the time of this update.
        t = time.monotonic()
        for n in range(NUM_DOTS):
            # Get the desired coords for this dot at this time.
            coords = get_dot_coords(n, t)
            # Move the dot on the canvas, finding it by its tag.
            canvas.coords(
                f"dot_{n}",
                *coords,
            )
        # Call the required tkinter update function.
        root.update()
        # Attempt to stabilize the timing of this loop by targeting 60Hz.
        while t0 < t:
            t0 += 1 / 60
        time.sleep(t0 - t)

You may notice that the dots don’t look all that great. There’s no anti-aliasing when drawing shape primitives in tkinter, so the edges look jagged compared to our well-scaled logo image. One hack is to layer slightly larger and dimmer shapes under each object, which you might do like so:

#!/usr/bin/env python3

import math
import time

from PIL import Image, ImageTk
from tkinter import BOTH, Canvas, Tk


# Desired dimensions of our window.
WIDTH, HEIGHT = 500, 500
# Coordinates of the center.
CENTER_X, CENTER_Y = WIDTH / 2, HEIGHT / 2
# How many spinning dots we want.
NUM_DOTS = 8
# Colors for each layer of fake anti-aliasing around each dot.
# Must be in order from back to front.
COLORS = ["#888888", "#BBBBBB", "#FFFFFF"]

if __name__ == "__main__":
    # Create the root window object.
    root = Tk()
    # Create a canvas for drawing our graphics.
    canvas = Canvas(root, width=WIDTH, height=HEIGHT, background="black")
    # Fill the entire window with the canvas.
    canvas.pack(fill=BOTH, expand=1)

    # Load the logo PNG with regular PIL.
    logo_img = Image.open("logo.png")
    # Resize the logo to about 2/3 the window width.
    scaled_w = round(WIDTH * 0.6)
    scaled_h = round(scaled_w / (logo_img.width / logo_img.height))
    logo_img = logo_img.resize((scaled_w, scaled_h), Image.LANCZOS)
    # Convert the logo to an ImageTk PhotoImage.
    logo_pi = ImageTk.PhotoImage(logo_img)
    # Add our logo image to the canvas.
    canvas.create_image(
        CENTER_X,
        CENTER_Y,
        image=logo_pi,
    )

    # Radius in pixels of a single dot.
    dot_radius = WIDTH * 0.05
    # Radius of the ring of dots from the center of the window.
    dots_radius = WIDTH / 2 - dot_radius * 2

    # Helper function to calculate dot position on each update.
    def get_dot_coords(n: int, t: float, c: int):
        """Get the x0, y0, x1, y1 coords of dot at index 'n' at time 't'.
        Inflate the radius by color index 'c'."""
        angle = (n / NUM_DOTS) * math.pi * 2 + t
        x = math.cos(angle) * dots_radius + CENTER_X
        y = math.sin(angle) * dots_radius + CENTER_Y
        # Invert the color index and add to the radius.
        radius = dot_radius + (len(COLORS) - c) * 0.75
        #radius = dot_radius + c
        return x - radius, y - radius, x + radius, y + radius

    # Create all the dots.
    t0 = time.monotonic()
    for c, color in enumerate(COLORS):
        for n in range(NUM_DOTS):
            coords = get_dot_coords(n, t0, c)
            canvas.create_oval(
                *coords,
                fill=color,
                width=0,  # Border width.
                tags=f"dot_{c}_{n}",
            )

    # Set up a custom main loop to animate the moving dots.
    while True:
        # Check the time of this update.
        t = time.monotonic()
        for c, color in enumerate(COLORS):
            for n in range(NUM_DOTS):
                # Get the desired coords for this dot at this time.
                coords = get_dot_coords(n, t, c)
                # Move the dot on the canvas, finding it by its tag.
                canvas.coords(
                    f"dot_{c}_{n}",
                    *coords,
                )
        # Call the required tkinter update function.
        root.update()
        # Attempt to stabilize the timing of this loop by targeting 60Hz.
        while t0 < t:
            t0 += 1 / 60
        time.sleep(t0 - t)

The fake anti-aliasing was a fun exercise, but for this use case you’ll probably get better-looking results out of scaling a PNG asset like we did the logo.

Resources

If you’re interested in learning more about tkinter, see also:

Other Python GUI/graphics toolkits you might consider:

Pygame
Pyglet
PySide6 (official Qt binding)
PyQt (unofficial Qt binding)
wxPython

How to deploy a containerized Django app with AWS Copilot

2022-06-21T00:00:00+00:00

Generally there are 2 major options at AWS when it comes to deployment of containerized applications. You can either go for EKS or ECS.

EKS (Elastic Kubernetes Service) is the managed Kubernetes service by AWS. ECS (Elastic Container Service), on the other hand, is AWS’s own way to manage your containerized application. You can learn more about EKS and ECS on the AWS website.

For this post we will use ECS.

The chosen one and the sidekick

With ECS chosen, now you have to find a preferably easy way to deploy your containerized application on it.

There are quite a number of resources from AWS that are needed for your application to live on ECS, such as VPC (Virtual Private Cloud), Security Group (firewall), EC2 (virtual machine), Load Balancer, and others. Creating these resources manually is cumbersome so AWS has came out with a tool that can automate the creation of all of them. The tool is known as AWS Copilot and we are going to learn how to use it.

Install Docker

Docker or Docker Desktop is required for building the Docker image later. Please refer to my previous article on how to install Docker Desktop on macOS, or follow Docker’s instructions for Linux and Windows.

Set up AWS CLI

We need to set up the Docker AWS CLI (command-line interface) for authentication and authorization to AWS.

Execute the following command to install the AWS CLI on macOS:

$ curl -O "https://awscli.amazonaws.com/AWSCLIV2.pkg"
$ sudo installer -pkg AWSCLIV2.pkg -target /

For other OSes see Amazon’s docs.

Execute the following command and enter the AWS Account and Access Keys.

$ aws configure

Install AWS Copilot CLI

Now it’s time for the main character: AWS Copilot.

Install AWS Copilot with Homebrew for macOS:

$ brew install aws/tap/copilot-cli

See AWS Copilot Installation for other platforms.

The Django project

Create a Django project by using a Python Docker Image. You can clone my Git project to get the Dockerfile, docker-compose.yaml and requirements.txt that I’m using.

$ git clone https://github.com/aburayyanjeffry/django-copilot.git

Go to the django-pilot directory and execute docker-compose to create a Django project named “mydjango”.

$ cd django-copilot
$ docker-compose run web django-admin startproject mydjango .

Edit mydjango/settings.py to allow all hostnames for its URL. This is required because by default AWS will generate a random URL for the application. Find the following variable and set the value as follows:

ALLOWED_HOSTS = ['*']

The Deployment with AWS Copilot

Create an AWS Copilot “Application”. This is a grouping of services such as web app or database, environments (development, QA, production), and CI/CD pipelines. Execute the following command to create an Application with the name of “mydjango”.

$ copilot init -a mydjango

Select the Workload type. Since this Django is an internet-facing app we will choose the “Load Balanced Web Service”.

Which workload type best represents your architecture?  [Use arrows to move, type to filter, ? for more help]
    Request-Driven Web Service  (App Runner)
  > Load Balanced Web Service   (Internet to ECS on Fargate)
    Backend Service             (ECS on Fargate)
    Worker Service              (Events to SQS to ECS on Fargate)
    Scheduled Job               (Scheduled event to State Machine to Fargate)

Give the Workload a name. We are going to name it “mydjango-web”.

Workload type: Load Balanced Web Service

  What do you want to name this service? [? for help] mydjango-web

Select the Dockerfile in the current directory.

Which Dockerfile would you like to use for mydjango-web?  [Use arrows to move, type to filter, ? for more help]
  > ./Dockerfile
    Enter custom path for your Dockerfile
    Use an existing image instead

Accept to create a test environment.

All right, you're all set for local development.

  Would you like to deploy a test environment? [? for help] (y/N) y

Wait and see. At the end of the deployment you will get the URL of your application. Open it in a browser.

Now let’s migrate some data, create a superuser, and try to log in. The Django app comes with a SQLite database. Execute the following command to get a terminal for the Django app:

$ copilot svc exec

Once in the terminal, execute the following to migrate the initial data and to create the superuser:

$ python manage.py migrate
$ python manage.py createsuperuser

Now you may access the admin page and login by using the created credentials.

You should see the Django admin:

A mini cheat sheet

AWS Copilot commands	Remarks
`copilot app ls`	To list available Applications
`copilot app show -n appname`	To get the details of an Application
`copilot app delete -n appname`	To delete an Application
`copilot svc ls`	To list available Services
`copilot svc show -n svcname`	To get the details of a Service
`copilot svc delete -n svcname`	To delete a Service

The End

That’s all, folks.

AWS Copilot is a tool to automate the deployment of AWS infrastructure for our containerized application needs. It takes away most of the worries about infrastructure and enables us to focus sooner on the application development.

For further info on AWS Copilot visit its website.

Understanding Linear Regression

2022-06-01T00:00:00+00:00

Photo by Scott Webb

Linear regression is a regression model which outputs a numeric value. It is used to predict an outcome based on a linear set of input.

The simplest hypothesis function of linear regression model is a univariate function as shown in the equation below:

$$ h_θ = θ_0 + θ_1x_1 $$

As you can guess this function represents a linear line in the coordinate system. The hypothesis function (h₀) approximates the output given input.

θ₀ is the intercept, also called bias term. θ₁ is the gradient or slope.

A linear regression model can either represent a univariate or a multivariate problem. So we can generalize the equation of the hypothesis as summation:

$$ h_θ = \sum{θ_ix_i} $$

where x₀ is always 1.

We can also represent the hypothesis equation with vector notation:

$$ h_θ = \begin{bmatrix} θ_0 & θ_1 & θ_2 \dots θ_n \end{bmatrix} x \begin{bmatrix} x_0 \\ x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix} $$

Linear Regression Model

I am going to introduce a linear regression model using a gradient descent algorithm. Each iteration of a gradient descent algorithm calculates the following steps:

Hypothesis h
The loss
Gradient descent update

The gradient descent update iteration stops when it reaches the convergence.

Although I am implementing a univariate linear regression model in this section, these steps apply to multivariate linear regression models as well.

Hypothesis

We start the initial hypothesis assumption with random parameters. Then we calculate the loss using L2 Loss function over the training dataset. In Python:

def hypothesis(X, theta):
    return theta[0] + theta[1:] * X

In this function we took X input (univariate in this implementation) and theta parameter values. X represents the feature input of our dataset. Theta is the weights of the features. θ₀ is called the bias term and θ₁ is the gradient or slope.

L2 Loss Function

L2 Loss function — sometimes called Mean Squared Error (MSE) — is the total error of the current hypothesis over the given training dataset. During the training, by calculating the MSE, we can target minimizing the cumulative error.

$$ J(θ) = \frac{\sum{(h_θ(x_i) - y_i)^2}}{2m} $$

L2 loss function (MSE) simply calculates the error by summing the squares of each data point error by dividing the size of the dataset.

The more the linear function is aligned, the optimized center of the data points with an optimized slope would give us a minimized error which is our target in linear regression training.

Gradients of the Loss

Each time we iterate and calculate a new theta (θ), we get a new theta₁ (slope) value. If we plot each slope value in the gradient descent batch update we will have a curve like this:

This curve has a minimum value which can’t get lower. Our goal is to find an optimal low value of theta₁ that reaches a point where our curve doesn’t get lower anymore or the change can be ignored. That is where the convergence is achieved and the loss is minimized.

Let’s do a little bit more math. The gradient of the loss is the partial derivative of θ. We calculate partial differential of loss for θ₀ and θ₁ separately. For multivariate functions our θ₁ is a generalized version for all available θ_i since the partial derivatives are calculated similarly. You can simply calculate the partial derivatives of loss function yourself too.

$$ \frac{∂}{∂θ_0}J(θ_0) = \frac{\sum{(h_0 - y_0)}}{m} $$

$$ \frac{∂}{∂θ_0}J(θ_i) = \frac{\sum{(h_i - y_i)x_i}}{m} $$

Since we know the hypothesis equation we can replace it in the derivatives as well:

def partial_derivatives(h, X, y):
    return [np.mean((h.flatten() - y)), np.mean((h.flatten() - y) * X.flatten())]

Now we will calculate the gradients for given hypothesis given theta, X, and y:

def calc_gradients(theta, X, y):
    gradient = [0, 0]

    h = hypothesis(X, theta)
    gradient = partial_derivatives(h, X, y)
    return np.array(gradient)

Batch Gradient Descent

The gradient descent method I used in this implementation is called batch gradient descent which uses all the data available through the iterations, which slows down the overall convergence process. There are methods to improve the performance of gradient descent such as stochastic gradient descent.

Since we calculated the gradients for the given theta we will iterate as much as we can until the convergence.

$$ θ_1(new) = θ_1(current) - α * J’(θ_1(current)) $$

Here comes the convergence rate or so called learning rate (α) factor to decide how long we should jump through the iterations. If α is too small, convergence can be more accurate, but the performance will be too small. This also leads to overfitting. If α is too big, the performance will be better, but convergence couldn’t be calculated accurately or well enough.

There is no strict best value for α since it depends on the dataset for training the model. By evaluating the model you trained you can find the best alpha value for your dataset. You can refer to statistical measures like R² score to determine the observed variance. But there usually won’t be single model parameter, hyperparameter, or statistical variable to refer to for regularization.

def gradient_update(X, y, theta, alpha, stop_threshold):
    # initial loss
    loss = L2_loss(hypothesis(X, theta), y)
    old_loss = loss + stop_threshold

    while( abs(old_loss - loss) > stop_threshold ):
        # gradient descent update
        gradients = calc_gradients(theta, X, y)
        theta = theta - alpha * gradients            
        old_loss = loss
        loss = L2_loss(hypothesis(X, theta), y)
        
    print('Gradient Descent training stopped at loss %s, with coefficients: %s' % (loss, theta))
    return theta

By performing batch gradient descent we actually train our algorithm and make it find the best theta values to fit the linear function. Now we can evaluate our algorithm and compare it with Sci-Kit Learn Linear Regression.

Evaluation

Since linear regression is a regression model, you should train and evaluate this model on regression datasets.

SK-Learn Diabetes dataset is a good regression dataset example. Below I loaded and prepared the dataset by splitting into training and test datasets.

from sklearn import datasets
from sklearn.model_selection import train_test_split

# Load the diabetes dataset
diabetes = datasets.load_diabetes()
diabetes_X = diabetes.data[:, np.newaxis, 2]
diabetes_y = diabetes.target

X_train, X_test, y_train, y_test = train_test_split(diabetes.data, diabetes_y, test_size=0.1)

Now we can evaluate our model:

from sklearn.metrics import mean_squared_error, r2_score

# initial random theta
theta = [100, 3]

stop_threshold = 0.1

# learning rate
alpha = 0.5

theta = gradient_update(X_train, y_train, theta, alpha, stop_threshold)
y_pred = hypothesis(X_test, theta)

print("Intercept (theta 0):", theta[0])
print("Coefficients (theta 1):", theta[1])
print("MSE:", mean_squared_error(y_test, y_pred))
print("R2 Score", r2_score(y_test, y_pred))

# Plot outputs using test data
plt.scatter(X_test, y_test,  color='black')
plt.plot(X_test, y_pred, color='blue', linewidth=3)

plt.show()

When I run my linear regression model it finds the optimal theta values, finishes the training, and outputs as below. I noted sample evaluation scores below too.

Gradient Descent training stopped at loss 3753.11429796413, with coefficients: [151.6166715  850.81024746]
Intercept (theta 0): 151.61667150054697
Coefficients (theta 1): 850.8102474614635
MSE: 5320.89741757879
R2 Score 0.14348916154815183

Now let’s evaluate the SK-Learn linear regression model with the same training and test datasets we used. I’m going to use default parameters without optimizing.

# Sci-Kit Learn LinearRegression model evaluation

regr = linear_model.LinearRegression()
regr.fit(X_train, y_train)
y_pred = regr.predict(X_test)

print("Coef:", regr.coef_)
print("Intercept:", regr.intercept_)
print("MSE:", mean_squared_error(y_test, y_pred))
print("R2 Score", r2_score(y_test, y_pred))

# Plot outputs
plt.scatter(X_test, y_test, color='black')
plt.plot(X_test, y_pred, color='blue', linewidth=3)

plt.show()

The output and plot of the SK-Learn Linear Regression model is as below:

Coef: [993.14228074]
Intercept: 151.5751918329106
MSE: 5544.283378702411
R2 Score 0.10753047228113943

Notice the intercept of my linear regression model and SK-Learn’s linear regression model are very close with value of around ~151. MSE values are calculated very close too. Also both plotted their predictions very similarly as well.

Multivariate Linear Regression

We can add more features as we have more features in a dataset and prepare our hypothesis as below, similar to a univariate hypothesis.

$$ h_θ(x) = θ_0 + θ_1x_1 + … + θ_nx_n $$

A multivariate dataset can have multiple features and a single output like below.

Feature1	Feature2	Feature3	Feature4	Target
2	0	0	100	12
16	10	1000	121	18
5	5	450	302	14

Each feature is an independent variable (x_i) of a dataset. Parameters (theta) are what we aim to find during the training just like the univariate model.

Linear Regression with Polynomial Functions

Sometimes a line function doesn’t fit the data well enough. Although if we are dealing with a polynomial function (having multiple features with exponential versions), it could fit the data better.

In this case the data itself is not linear but we are lucky that the parameter space is linear and we can still apply the linear regression over the non-linear dataset as well:

$$ h_θ(x) = θ_0 + θ_1x + θ_1x^2 … + θ_nx^n $$

$$ h_θ = \begin{bmatrix} 1 & x & x^2 \dots x^n \end{bmatrix} x \begin{bmatrix} θ_0 \\ θ_1 \\ θ_2 \\ \vdots \\ θ_n \end{bmatrix} $$

Here the data is non-linear but the parameters are linear and we can still apply the gradient descent algorithm.

Conclusion

In this post I implemented a linear regression model from scratch and evaluated by training it.

Linear regression is useful when your dataset variables can be related in a linear relation. In the real world, linear regression is very useful in forecasting.

Visualizing Data with Pair-Plot Using Matplotlib

2022-04-25T00:00:00+00:00

Photo by Sebastian

Pair Plot

A pair plot is plotting “pairwise relationships in a dataset” (seaborn.pairplot). A few well-known visualization modules for Python are widely used by data scientists and analysts: Matplotlib and Seaborn. There are many others as well but these are de facto standards. In the sense of level we can consider Matplotlib as the more primitive library and Seaborn builds upon Matplotlib and “provides a high-level interface for drawing attractive and informative statistical graphics” (Seaborn project).

Seaborn’s higher-level pre-built plot functions give us good features. Pair plot is one of them. With Matplotlib you can plot many plot types like line, scatter, bar, histograms, and so on. Pair-plot is a plotting model rather than a plot type individually. Here is a pair-plot example depicted on the Seaborn site:

Using a pair-plot we aim to visualize the correlation of each feature pair in a dataset against the class distribution. The diagonal of the pairplot is different than the other pairwise plots as you see above. That is because the diagonal plots are rendering for the same feature pairs. So we wouldn’t need to plot the correlation of the feature in the diagonal. Instead we can just plot the class distribution for that pair using one kind of plot type.

The different feature pair plots can be scatter plots or heatmaps so that the class distribution makes sense in terms of correlation. Also the plot type of the diagonal can be chosen among the mostly used kind of plots such as histogram or KDE (kernel density estimate), which essentially plots the density distribution of the classes.

Since a pair plot visually gives an idea of correlation of each feature pair, it helps us to understand and quickly analyse the correlation matrix (Pearson) of the dataset as well.

Custom Pair-Plot using Matplotlib

Since Matplotlib is relatively primitive and doesn’t provide a ready-to-use pair-plot function, we can do it ourselves in a similar way to how Seaborn does. You normally won’t necessarily create such home-made functions if they are already available in modules like Seaborn. But implementing your visualization methods in a custom way give you a chance to know what you plot and may be sometimes very different than the existing ones. I am not going to introduce an exceptional case here but creating our pair-plot grid using Matplotlib.

Plot Grid Area

Initially we need to create a grid plot area using the subplots function of matplotlib like below.

fig, axis = plt.subplots(nrows=3, ncols=3)

For a pair-plot grid you should give the same row and column size because we are going to plot pairwise. Now we can prepare a plot function for the plot grid area we created. If we have 3 features in our dataset as this example we can loop through the features per feature like this:

for i in range(0, 3):
    for j in range(0, 3):
        plotPair()

For cleaner code it is better to move the single pair plotting to another function.

Below is a function I created for one of my master’s degree coursework assignments in December 2021 at the University of London. Plotting a single pair in a grid needs to get the current axis for the current grid cell and identify the current feature data values on the current axis. Another thing to consider is where to render the labels of axes. If we were plotting a single chart it would be easy to render the labels on each axis of the chart. But in a pair plot it is better to plot the labels on the left-most and bottom-most of the grid area so that we won’t bother the inner subplots with the dirty labeling.

def plot_single_pair(ax, feature_ind1, feature_ind2, _X, _y, _features, colormap):
    """Plots single pair of features.

    Parameters
    ----------
    ax : Axes
        matplotlib axis to be plotted
    feature_ind1 : int
        index of first feature to be plotted
    feature_ind2 : int
        index of second feature to be plotted
    _X : numpy.ndarray
        Feature dataset of of shape m x n
    _y : numpy.ndarray
        Target list of shape 1 x n
    _features : list of str
        List of n feature titles
    colormap : dict
        Color map of classes existing in target

    Returns
    -------
    None
    """

    # Plot distribution histogram if the features are the same (diagonal of the pair-plot).
    if feature_ind1 == feature_ind2:
        tdf = pd.DataFrame(_X[:, [feature_ind1]], columns = [_features[feature_ind1]])
        tdf['target'] = _y
        for c in colormap.keys():
            tdf_filtered = tdf.loc[tdf['target']==c]
            ax[feature_ind1, feature_ind2].hist(tdf_filtered[_features[feature_ind1]], color = colormap[c], bins = 30)
    else:
        # other wise plot the pair-wise scatter plot
        tdf = pd.DataFrame(_X[:, [feature_ind1, feature_ind2]], columns = [_features[feature_ind1], _features[feature_ind2]])
        tdf['target'] = _y
        for c in colormap.keys():
            tdf_filtered = tdf.loc[tdf['target']==c]
            ax[feature_ind1, feature_ind2].scatter(x = tdf_filtered[_features[feature_ind2]], y = tdf_filtered[_features[feature_ind1]], color=colormap[c])

    # Print the feature labels only on the left side of the pair-plot figure
    # and bottom side of the pair-plot figure. 
    # Here avoiding printing the labels for inner axis plots.
    if feature_ind1 == len(_features) - 1:
        ax[feature_ind1, feature_ind2].set(xlabel=_features[feature_ind2], ylabel='')
    if feature_ind2 == 0:
        if feature_ind1 == len(_features) - 1:
            ax[feature_ind1, feature_ind2].set(xlabel=_features[feature_ind2], ylabel=_features[feature_ind1])
        else:
            ax[feature_ind1, feature_ind2].set(xlabel='', ylabel=_features[feature_ind1])

Let’s go back to the initial plotting of the grid area and adjust the call of plot_single_pair function. We can adjust the figure size of the grid area using fig.set_size_inches depending on the feature count so that we can prepare a well-scaled area.

colormap={0: "red", 1: "green", 2: "blue"}

fig.set_size_inches(feature_count * 4, feature_count * 4)

# Iterate through features to plot pairwise.
for i in range(0, 3):
    for j in range(0, 3):
        plot_single_pair(axis, i, j, X, y, features, colormap)

plt.show()

In my plot-single_pair function notice that I also used a colormap dictionary. This dictionary is used to color the classes (labels) of the dataset to distinguish in a scatter plot or a histogram and makes it look more beautiful.

Here is my final grid plot function for pair-plot:

def myplotGrid(X, y, features, colormap={0: "red", 1: "green", 2: "blue"}):
    """Plots a pair grid of the given features.

    Parameters
    ----------
    X : numpy.ndarray
        Dataset of shape m x n
    y : numpy.ndarray
        Target list of shape 1 x n
    features : list of str
        List of n feature titles

    Returns
    -------
    None
    """

    feature_count = len(features)
    # Create a matplot subplot area with the size of [feature count x feature count]
    fig, axis = plt.subplots(nrows=feature_count, ncols=feature_count)
    # Setting figure size helps to optimize the figure size according to the feature count.
    fig.set_size_inches(feature_count * 4, feature_count * 4)

    # Iterate through features to plot pairwise.
    for i in range(0, feature_count):
        for j in range(0, feature_count):
            plot_single_pair(axis, i, j, X, y, features, colormap)

    plt.show()

Pair-Plot a Dataset

Now let’s prepare a dataset and plot using our custom pair-plot implementation. Notice that in my plot_single_pair function I passed the feature and target values as the numpy.ndarray type.

Let’s get the iris dataset from the SciKit-Learn dataset collection and do a quick exploratory data analysis.

from sklearn import datasets
iris = datasets.load_iris()

Here are the targets (classes) of the iris dataset:

iris.target_names
array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

And here we can see the feature names and a few lines of the dataset values.

iris_df = pd.DataFrame(iris.data, columns = iris.feature_names)
iris_df.head()

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2

Since iris.data and iris.target are already of type numpy.ndarray as I implemented my function I don’t need any further dataset manipulation here. Now let’s finally call myplotGrid function and render the pair-plot for the iris dataset.

Note that you can change color per target in colormap as you wish.

myplotGrid(iris.data, iris.target, iris.feature_names, colormap={0: "red", 1: "green", 2: "blue"})

And here is my custom pair-plot output:

For further research I encourage you to go and do your exploratory data analysis and take a look at the correlation coefficient analysis to get more insights for pair-wise analysis.

Python concurrency: asyncio for threading users

2020-10-05T00:00:00+00:00

Photo by Adrian Schwarz

You’ve probably heard this classic software engineering mantra:

Concurrency is hard.

The undeniable fact is that an entire category of software bugs, known for being elusive and frustrating to reproduce, is gated behind the introduction of concurrency to a project. Race conditions, mutual exclusion, deadlock, and starvation, to name a few.

Most programming languages with concurrency features ship with some or all of the classical concurrency primitives: threads, locks, events, semaphores, mutexes, thread-safe queues, and so on. While practically any concurrency problem can be solved with this toolkit, let me share a relevant life mantra:

Just because you can doesn’t mean you should.

In the case of Python, you have access to a standard library alternative to threading, which factors out many of the trickier parts of concurrent programming: asyncio. Many existing applications of Python threads can be replaced by asyncio coroutines, potentially eliminating many of the difficulties of concurrency.

Understanding the differences between asyncio and threading can help you make informed choices about which to apply and when, so let’s take a closer look.

The Python GIL

Any discussion of Python concurrency should mention Python’s GIL, or Global Interpreter Lock. The GIL ensures that only one thread may be executing Python code at a time. A. Jesse Jiryu Davis writes a succinct description of how the GIL affects Python threads:

One thread runs Python, while N others sleep or await I/O.

Once a thread has acquired the GIL, there are two ways it can release the GIL for another thread to acquire:

Reaching a point in the script where it is sleeping, ending, waiting for I/O, or executing native code which explicitly releases the GIL. For example, calling time.sleep(0) forces a thread to release the GIL without ending the thread.
After executing for some amount of time and/or instructions, a thread may be forced to release the GIL for another thread to have a turn. Interrupting a thread in this way is preemption.

A Contrived Example of Thread Preemption

Here I will set up a concurrency toy to demonstrate some characteristics of Python threading. One thread will increment a list of numbers in a for loop. The other thread will, roughly once per second, check the consistency of the list (all numbers are equal) and print a message. Observant readers can probably imagine a better non-threading solution to this problem, but allow me to entertain you.

All posted results are from running the examples in Python 3.8.6 in a Docker container on a laptop. Your results may vary.

Here is the naive approach, without any attempt at synchronization:

import threading
import time

SIZE = 100000


class Counter():
  def __init__(self):
    self.values = [0] * SIZE

  def count(self):
    while True:
      for i in range(SIZE):
        self.values[i] += 1

  def heartbeat(self):
    t0 = time.monotonic()

    while True:
      time.sleep(1.0)

      # Check for consistency.
      for i in range(SIZE):
        assert self.values[i] == self.values[0], f'Value at index {i} is inconsistent'

      now = time.monotonic()
      print(f'All values are {self.values[0]} at +{now - t0:.5f}s')


if __name__ == '__main__':
  counter = Counter()

  t_count = threading.Thread(target=counter.count, daemon=True)
  t_count.start()

  t_heartbeat = threading.Thread(target=counter.heartbeat, daemon=True)
  t_heartbeat.start()

  print('Press Ctrl+C to end')

  # Wait for Ctrl+C or heartbeat crash.
  t_heartbeat.join()

If you run this code a few times, you’ll notice that you might get a few heartbeat messages (if you’re lucky), then the script crashes because the consistency check failed. Because the counter thread can be preempted at any time, this can include between value increments in the for loop. Similarly, the heartbeat thread can be preempted in the middle of its consistency check.

Threads can be very rude when left alone. We can prevent threads racing for access to shared state by using a Lock, which can only be held by one thread at a time:

class Counter():
  def __init__(self):
    self.values_lock = threading.Lock()
    self.values = [0] * SIZE

  def count(self):
    while True:
      with self.values_lock:
        for i in range(SIZE):
          self.values[i] += 1

  def heartbeat(self):
    t0 = time.monotonic()

    while True:
      time.sleep(1.0)

      # Check for consistency.
      with self.values_lock:
        for i in range(SIZE):
          assert self.values[i] == self.values[0], f'Value at index {i} is inconsistent'

      now = time.monotonic()
      print(f'All values are {self.values[0]} at +{now - t0:.5f}s')

All values are 693 at +9.12154s
All values are 797 at +10.60691s
All values are 877 at +11.83219s
All values are 957 at +13.04729s
All values are 1036 at +14.27122s
All values are 1157 at +16.15205s

When you run this code you may notice that the consistency check never fails, but the timing of the heartbeat message is not as consistent as we would like. In fact, it may be several seconds before you see a heartbeat message. There is a lot of timing jitter, which is deviation from the expected time interval. There is also a lot of drift, which is accumulated jitter.

This is happening because there is no guarantee that the counter thread will release the GIL while values_lock is available, or that values_lock won’t be acquired again by the counter thread immediately after releasing it.

When multiple threads are waiting to acquire a Lock, the order in which they are woken up is not defined. We might expect or sometimes observe that the Lock will be acquired first come, first served, or first in, first out, but Python makes no guarantees.

Remembering that we can explicitly release the GIL, we cleverly try doing so immediately before the counter thread acquires values_lock to give the heartbeat thread a chance to win the race:

def count(self):
  while True:
    time.sleep(0)  # Explicitly yield the GIL.
    with self.values_lock:
      for i in range(SIZE):
        self.values[i] += 1

All values are 114 at +1.41245s
All values are 206 at +2.57230s
All values are 315 at +4.05285s
All values are 432 at +5.75999s
All values are 514 at +6.94572s
All values are 662 at +9.05388s

This is much better, but there is still enough jitter and drift that we might not even be able to adequately compensate for the drift by reducing the sleep interval automatically. We’ve applied all of our knowledge of threading and this is the best we can do, within the constraints of the example.

All of this may seem like a lot to keep up with, and it is. Even if you are a threading master and craft perfectly safe and reasonably performant Python threading code, it is likely to be relatively difficult and expensive to maintain. There’s got to be a better way.

Concurrency with asyncio

The asyncio approach to Python concurrency is relatively new. Its integration with the language has changed over the course of Python development, but it appears to be largely stable and useful as of Python 3.8. Instead of using Python threads to run instructions concurrently, asyncio uses an event loop to schedule instructions on the main thread.

Contrasted with threads, asyncio coroutines may never be interrupted unless they explicitly yield the thread with async or await keywords. However, there is no guarantee that saying async or await will yield the thread to another task.

The asyncio library is intended to be used for I/O-bound applications such as high performance network servers, which spend much of their time waiting for the OS to send or receive data on a file descriptor or socket. However, as we will see when applying asyncio to our toy concurrency example, it can be applied to otherwise pure and isolated Python code too.

This example requires Python 3.7 for the asyncio.run() method.

import asyncio

SIZE = 100000


class Counter():
  def __init__(self):
    self.values = [0] * SIZE

  async def count(self):
    while True:
      await asyncio.sleep(0)  # Explicitly yield to other coroutines.
      for i in range(SIZE):
        self.values[i] += 1

  async def heartbeat(self):
    loop = asyncio.get_event_loop()
    t0 = loop.time()

    while True:
      await asyncio.sleep(1.0)

      # Check for consistency.
      for i in range(SIZE):
        assert self.values[i] == self.values[0]

      now = loop.time()
      print(f'All values are {self.values[0]} at +{now - t0:.5f}s')


async def main():
    counter = Counter()

    tasks = map(asyncio.create_task, [counter.count(), counter.heartbeat()])
    await asyncio.wait(tasks, return_when=asyncio.FIRST_COMPLETED)


if __name__ == '__main__':
    asyncio.run(main())

All values are 84 at +1.04261s
All values are 169 at +2.08948s
All values are 254 at +3.13599s
All values are 326 at +4.18760s
All values are 400 at +5.24720s
All values are 473 at +6.30465s

This timing is now in the ballpark where we can compensate for drift by automatically adjusting the heartbeat sleep interval. There is no more racing for the GIL. Instead the counter coroutine explicitly awaits whenever the state is consistent. It may not be interrupted by another coroutine at any other time, but it can still be interrupted by another thread or a signal.

If we imagine that the asyncio.sleep(0) incantation is actually awaiting data on a file descriptor or network socket saturated with data, this example is immediately relevant to the intended use of asyncio. The pattern of running top-level coroutines with asyncio.wait(), among other asyncio task scheduling tools, can potentially replace many existing uses of daemon threads for background tasks.

A caveat of asyncio is the consequence of one of its advantages: While a coroutine is running, it may never be interrupted by another coroutine unless it explicitly gives up control. One slow or malformed coroutine can freeze your entire application, where daemon threads would keep preempting and bypassing the blockage (at cost of some context-switching overhead and developer sanity). If you must run blocking code concurrently you can run it in an executor to decouple it from the asyncio event loop, but understand that the usual thread synchronization problems may apply.

Just because asyncio never preempts doesn’t mean you will never need to synchronize access to shared state. The library does include threading-like synchronization primitives (which are not thread-safe), but the need for them should be the exception, not the norm.

Conclusion

If you’re interested in using asyncio, I urge you to explore its interfaces further. We’ve only scratched the surface in these examples. If you are shoehorning asyncio into an existing application or want to use it with a library or framework that relies on threads, know there are asyncio interfaces for safely adding coroutines from a thread.

Happy hacking!

Random Strings and Integers That Actually Aren’t

2020-07-02T00:00:00+00:00

Image from Flickr user fsse8info

Recently the topic of generating random-looking coupon codes and other strings came up on internal chat. My go-to for something like that is always this solution based on Feistel networks, which I didn’t think was terribly obscure. But I was surprised when nobody else seemed to recognize it, so maybe it is. In any case here’s a little illustration of the thing in action.

Feistel networks are the mathematical basis of the ciphers behind DES and other encryption algorithms. I won’t go into details (because that would suggest I fully understand it, and there are bits where I’m hazy) but ultimately it’s a somewhat simple and very fast mechanism that’s fairly effective for our uses here.

For string generation we have two parts. For the first part we take an integer, say the sequentially generated id primary key field in the database, and run it through a function that turns it into some other random-looking integer. Our implementation of the function has an interesting property: If you take that random-looking integer and run it back through the same function, we get the original integer back out. In other words…

cipher(cipher(n)) == n

…for any integer value of n. That one-to-one mapping essentially guarantees that the random-looking output is actually unique across the integer space. In other words, we can be sure there will be no collisions once we get to the string-making part.

The original function is based off the code on the PostgreSQL wiki with just a few alterations for clarity, and should work for any modern (or archaic) version of Postgres.

CREATE OR REPLACE FUNCTION public.feistel_crypt(value integer)
  RETURNS integer
  LANGUAGE plpgsql
  IMMUTABLE STRICT
AS $function$
DECLARE
    key numeric;
    l1 int;
    l2 int;
    r1 int;
    r2 int;
    i int:=0;
BEGIN
    l1:= (VALUE >> 16) & 65535;
    r1:= VALUE & 65535;
    WHILE i < 3 LOOP
        -- key can be any function that returns numeric between 0 and 1
        key := (((1366 * r1 + 150889) % 714025) / 714025.0);
        l2 := r1;
        r2 := l1 # (key * 32767)::int;
        l1 := l2;
        r1 := r2;
        i := i + 1;
    END LOOP;
    RETURN ((r1 << 16) + l1);
END;
$function$;

Swap what’s assigned to that key variable around a bit, just so you get a different output than what I’m illustrating. No good, after all, if someone can take this example verbatim and generate your coupon codes. Also once you start using the generated numbers, one way or another, you probably don’t want to change that key function as that would introduce the possibility of collisions with existing values generated with the previous key.

Anyway, with that in place, you can start generating some random integers, and make sure they map back:

totesrandom=# SELECT feistel_crypt(1), feistel_crypt(2), feistel_crypt(3), feistel_crypt(4);
 feistel_crypt | feistel_crypt | feistel_crypt | feistel_crypt
---------------+---------------+---------------+---------------
     561465857 |     436885871 |     576481439 |     483424269
(1 row)

totesrandom=# SELECT feistel_crypt(561465857), feistel_crypt(436885871), feistel_crypt(576481439), feistel_crypt(483424269);
 feistel_crypt | feistel_crypt | feistel_crypt | feistel_crypt
---------------+---------------+---------------+---------------
             1 |             2 |             3 |             4
(1 row)

In fact we can run a verification across, say, 10 million integers:

totesrandom=# SELECT COUNT(*) FROM generate_series (1,10000000) WHERE feistel_crypt(feistel_crypt(generate_series)) != generate_series;
 count
-------
     0
(1 row)

Time: 185151.416 ms (03:05.151)

The cool part: string generation

Once we have that new value, the second part is even easier. We take the new integer and map that to a string, essentially creating a base-N representation of the number.

CREATE OR REPLACE FUNCTION public.int_to_string(n int)
  RETURNS text
  LANGUAGE plpgsql
  IMMUTABLE STRICT
AS $function$
DECLARE
    alphabet text:='abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789';
    base int:=length(alphabet);
    output text:='';
BEGIN
    LOOP
        output := output || substr(alphabet, 1+(n%base)::int, 1);
        n := n / base;
        EXIT WHEN n=0;
    END LOOP;
    RETURN output;
END $function$;

Voilà, short random-looking strings you can use for coupon codes, email confirmation tokens, whatever you need:

totesrandom=# SELECT int_to_string(feistel_crypt(1)), int_to_string(feistel_crypt(2)), int_to_string(feistel_crypt(3)), int_to_string(feistel_crypt(4));
 int_to_string | int_to_string | int_to_string | int_to_string
---------------+---------------+---------------+---------------
 5409L         | t8hJD         | Tj1aN         | NTySG
(1 row)

Time: 0.473 ms

You can tune that character set as needed, of course. Maybe jumble it up a bit if you’re super paranoid about someone reverse engineering it. Your character set could be anything you wanted. A purely emoji set could be fun, or perhaps set it to an array of words to concatenate together instead of individual letters. Or if there’s a chance someone could be reading one of these out loud, over a phone call for instance, you might want to go with a single case:

totesrandom=# -- alphabet in above function instead set to 'ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789'
totesrandom=# SELECT int_to_string(feistel_crypt(1)), int_to_string(feistel_crypt(2)), int_to_string(feistel_crypt(3)), int_to_string(feistel_crypt(4));
 int_to_string | int_to_string | int_to_string | int_to_string
---------------+---------------+---------------+---------------
 33FKKJ        | XK9DIH        | L79HTJ        | 7TQ39H
(1 row)

Time: 0.681 ms

It’s also certainly possible to reverse this part, too, and read the string back into the original integer. But I’d instead recommend stashing the resulting string into a database field, and doing a look-up on that directly.

Bonus

At some point I ended up porting this to Python. It’s still super simple, and works just the same. But maybe seeing it in another form will help you port it to whatever other language you might need it for.

def simple_feistel(value):
    # A simple self-inverse Feistel cipher for ID obfuscation
    l1 = (value >> 16) & 65535
    r1 = value & 65535

    for i in range(3):
        key = (((1366 * r1 + 150889) % 714025) / 714025.0)
        l2 = r1
        r2 = l1 ^ int(key * 32767)
        l1 = l2
        r1 = r2
    return (r1 << 16) + l1

def stringify_integer(value):
    # Take an integer and encode it as a base(len(alphabet)) string

    alphabet = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789';
    base = len(alphabet)
    output = ''

    while value > 0:
        output += alphabet[value%base]
        value //= base

    return output

Implementing SummAE neural text summarization with a denoising auto-encoder

2020-05-28T00:00:00+00:00

If there’s any problem space in machine learning, with no shortage of (unlabelled) data to train on, it’s easily natural language processing (NLP).

In this article, I’d like to take on the challenge of taking a paper that came from Google Research in late 2019 and implementing it. It’s going to be a fun trip into the world of neural text summarization. We’re going to go through the basics, the coding, and then we’ll look at what the results actually are in the end.

The paper we’re going to implement here is: Peter J. Liu, Yu-An Chung, Jie Ren (2019) SummAE: Zero-Shot Abstractive Text Summarization using Length-Agnostic Auto-Encoders.

Here’s the paper’s abstract:

We propose an end-to-end neural model for zero-shot abstractive text summarization of paragraphs, and introduce a benchmark task, ROCSumm, based on ROCStories, a subset for which we collected human summaries. In this task, five-sentence stories (paragraphs) are summarized with one sentence, using human summaries only for evaluation. We show results for extractive and human baselines to demonstrate a large abstractive gap in performance. Our model, SummAE, consists of a denoising auto-encoder that embeds sentences and paragraphs in a common space, from which either can be decoded. Summaries for paragraphs are generated by decoding a sentence from the paragraph representations. We find that traditional sequence-to-sequence auto-encoders fail to produce good summaries and describe how specific architectural choices and pre-training techniques can significantly improve performance, outperforming extractive baselines. The data, training, evaluation code, and best model weights are open-sourced.

Preliminaries

Before we go any further, let’s talk a little bit about neural summarization in general. There’re two main approaches to it:

The first approach makes the model “focus” on the most important parts of the longer text - extracting them to form a summary.

Let’s take a recent article, “Shopify Admin API: Importing Products in Bulk”, by one of my great co-workers, Patrick Lewis, as an example and see what the extractive summarization would look like. Let’s take the first two paragraphs:

I recently worked on an interesting project for a store owner who was facing a daunting task: he had an inventory of hundreds of thousands of Magic: The Gathering (MTG) cards that he wanted to sell online through his Shopify store. The logistics of tracking down artwork and current market pricing for each card made it impossible to do manually.

My solution was to create a custom Rails application that retrieves inventory data from a combination of APIs and then automatically creates products for each card in Shopify. The resulting project turned what would have been a months- or years-long task into a bulk upload that only took a few hours to complete and allowed the store owner to immediately start selling his inventory online. The online store launch turned out to be even more important than initially expected due to current closures of physical stores.

An extractive model could summarize it as follows:

I recently worked on an interesting project for a store owner who had an inventory of hundreds of thousands of cards that he wanted to sell through his store. The logistics and current pricing for each card made it impossible to do manually. My solution was to create a custom Rails application that retrieves inventory data from a combination of APIs and then automatically creates products for each card. The store launch turned out to be even more important than expected due to current closures of physical stores.

See how it does the copying and pasting? The big advantage of these types of models is that they are generally easier to create and the resulting summaries tend to faithfully reflect the facts included in the source.

The downside though is that it’s not how a human would do it. We do a lot of paraphrasing, for instance. We use different words and tend to form sentences less rigidly following the original ones. The need for the summaries to feel more natural made the second type — abstractive — into this subfield’s holy grail.

Datasets

The paper’s authors used the so-called “ROCStories” dataset (“Tackling The Story Ending Biases in The Story Cloze Test”. Rishi Sharma, James Allen, Omid Bakhshandeh, Nasrin Mostafazadeh. In Proceedings of the 2018 Conference of the Association for Computational Linguistics (ACL), 2018).

In my experiments, I’ve also tried the model against one that’s quite a bit more difficult: WikiHow (Mahnaz Koupaee, William Yang Wang (2018) WikiHow: A Large Scale Text Summarization Dataset).

ROCStories

The dataset consists of 98162 stories, each one consisting of 5 sentences. It’s incredibly clean. The only step I needed to take was to split the stories between the train, eval, and test sets.

Examples of sentences:

Example 1:

My retired coworker turned 69 in July. I went net surfing to get her a gift. She loves Diana Ross. I got two newly released cds and mailed them to her. She sent me an email thanking me.

Example 2:

Tom alerted the government he expected a guest. When she didn’t come he got in a lot of trouble. They talked about revoking his doctor’s license. And charging him a huge fee! Tom’s life was destroyed because of his act of kindness.

Example 3:

I went to see the doctor when I knew it was bad. I hadn’t eaten in nearly a week. I told him I felt afraid of food in my body. He told me I was developing an eating disorder. He instructed me to get some help.

Wikihow

This is one of the most challenging openly available datasets for neural summarization. It consists of more than 200,000 long-sequence pairs of text + headline scraped from WikiHow’s website.

Some examples:

Text:

One easy way to conserve water is to cut down on your shower time. Practice cutting your showers down to 10 minutes, then 7, then 5. Challenge yourself to take a shorter shower every day. Washing machines take up a lot of water and electricity, so running a cycle for a couple of articles of clothing is inefficient. Hold off on laundry until you can fill the machine. Avoid letting the water run while you’re brushing your teeth or shaving. Keep your hoses and faucets turned off as much as possible. When you need them, use them sparingly.

Headline:

Take quicker showers to conserve water. Wait for a full load of clothing before running a washing machine. Turn off the water when you’re not using it.

The main challenge for the summarization model here is that the headline was actually created by humans and is not just “extracting” anything. Any model performing well on this dataset actually needs to model the language pretty well. Otherwise, the headline could be used for computing the evaluation metrics, but it’s pretty clear that traditional metrics like ROUGE are just bound here to miss the point.

Basics of the sequence-to-sequence modeling

Most sequence-to-sequence models are based on the “next token prediction” workflow.

The general idea can be expressed with P(token | context) — where the task is to model this conditional probability distribution. The “context” here depends on the approach.

Those models are also called “auto-regressive” because they need to consume their own predictions from previous steps during the inference:

predict(["<start>"], context)
# "I"
predict(["<start>", "I"], context)
# "love"
predict(["<start>", "I", "love"], context)
# "biking"
predict(["<start>", "I", "love", "biking"], context)
# "<end>"

Naively simple modeling: Markov Model

In this model, the approach is to take on a bold assumption: that the probability of the next token is conditioned only on the previous token.

The Markov Model is elegantly introduced in the blog post Next Word Prediction using Markov Model.

Why is it naive? Because we know that the probability of the word “love” depends on the word “I” given a broader context. A model that’s always going to output “roses” would miss the best word more often than not.

Modeling with neural networks

Usually, sequence-to-sequence neural network models consist of two parts:

encoder
decoder

The encoder is there to build a “gist” representation of the input sequence. The gist and the previous token become our “context” to do the inference. This fits in well within the P(token | context) modeling I described above. That distribution can be expressed more clearly as P(token | previous; gist).

There are other approaches too with one of them being the ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training - 2020 - Yan, Yu and Qi, Weizhen and Gong, Yeyun and Liu, Dayiheng and Duan, Nan and Chen, Jiusheng and Zhang, Ruofei and Zhou, Ming. The difference in the approach here was the prediction of n-tokens ahead at once.

Teacher-forcing

Let’s see how could we go about teaching the model about the next token’s conditional distribution.

Imagine that the model’s parameters aren’t performing well yet. We have an input sequence of: ["<start>", "I", "love", "biking", "during", "the", "summer", "<end>"]. We’re training the model giving it the first token:

model(["<start>", context])
# "I"

Great, now let’s ask it for another one:

model(["<start>", "I"], context])
# "wonder"

Hmmm that’s not what we wanted, but let’s naively continue:

model(["<start>", "I", "wonder"], context)
# "why"

We could continue gathering predictions and compute the loss at the end. The loss would really only be able to tell it about the first mistake (“love” vs. “wonder”); the rest of the errors would just accumulate from here. This would hinder the learning considerably, adding in the noise from the accumulated errors.

There’s a better approach called Teacher Forcing. In this approach, you’re telling the model the true answer after each of its guesses. The last example would look like the following:

model(["<start>", "I", "love"], context)
# "watching"

You’d continue the process, feeding it the full input sequence and the loss term would be computed based on all its guesses.

Compute-friendly representation for tokens and gists

Some of the readers might want to skip this section. I’d like to describe quickly here the concept of the latent space and vector embeddings. This is to keep the matters relatively palatable for the broader audience.

Representing words naively

How do we turn the words (strings) into numbers that we input into our machine learning models? A software developer might think about assigning each word a unique integer. This works well for databases but in machine learning models, the fact that integers follow one another means that they encode a relation (which one follows which and in what distance). This doesn’t work well for almost any problem in data science.

Traditionally, the problem is solved by “one-hot encoding”. This means that we’re turning our integers into vectors, where each value is zero except the one for the index that equals the value to encode (or minus one if your programming language uses zero-based indexing). Example: 3 => [0, 0, 0, 1, 0, 0, 0, 0, 0, 0] when the total number of “integers” (classes) to encode is 10.

This is better as it breaks the ordering and distancing assumptions. It doesn’t encode anything about the words, though, except the arbitrary number we’ve decided to assign to them. We now don’t have the ordering but we also don’t have any distance. Empirically though we just know that the word “love” is much closer to “enjoy” than it is to “helicopter”.

A better approach: word embeddings

How could we keep our vector representation (as in one-hot encoding) but also introduce the distance? I’ve already glanced over this concept in my post about the simple recommender system. The idea is to have a vector of floating-point values so that the closer the words are in their meaning, the smaller the angle is between them. We can easily compute a metric following this logic by measuring the cosine distance. This way, the word representations are easy to feed into the encoder, and they already contain a lot of the information in themselves.

Not only words

Can we only have vectors for words? Couldn’t we have vectors for paragraphs, so that the closer they are in their meaning, the smaller some vector space metric between them? Of course we can. This is, in fact, what will allow us in this article’s model to encode the “gist” that we talked about. The “encoder” part of the model is going to learn the most convenient way of turning the input sequence into the floating-point numbers vector.

Auto-encoders

We’re slowly approaching the model from the paper. We still have one concept that’s vital to understand in order to get why the model is going to work.

Up until now, we talked about the following structure of the typical sequence-to-sequence neural network model:

This is true e.g. for translation models where the input sequence is in English and the output is in Greek. It’s also true for this article’s model during the inference.

What if we’d make the input and output to be the same sequence? We’d turn it into a so-called auto-encoder.

The output of course isn’t all that useful — we already know what the input sequence is. The true value is in the model’s ability to encode the input into a gist.

Adding the noise

A very interesting type of an auto-encoder is the denoising auto-encoder. The idea is that the input sequence gets randomly corrupted and the network learns to still produce a good gist and reconstruct the sequence before it got corrupted. This makes the training “teach” the network about the deeper connections in the data, instead of just “memorizing” as much as it can.

The SummAE model

We’re now ready to talk about the architecture from the paper. Given what we’ve already learned, this is going to be very simple. The SummAE model is just a denoising auto-encoder that is being trained a special way.

Auto-encoding paragraphs and sentences

The authors were training the model on both single sentences and full paragraphs. In all cases the task was to reproduce the uncorrupted input.

The first part of the approach is about having two special “start tokens” to signal the mode: paragraph vs. sentence. In my code, I’ve used “<start-full>” and “<start-short>”.

During the training, the model learns the conditional distributions given those two tokens and the ones that follow, for any given token in the sequence.

Adding the noise

The sentences are simply concatenated to form a paragraph. The input then gets corrupted at random by means of:

masking the input tokens
shuffling the order of the sentences within the paragraph

The authors are claiming that the latter helped them in solving the issue of the network just memorizing the first sentence. What I have found though is that this model is generally prone towards memorizing concrete sentences from the paragraph. Sometimes it’s the first, and sometimes it’s some of the others. I’ve found this true even when adding a lot of noise to the input.

The code

The full PyTorch implementation described in this blog post is available at https://github.com/kamilc/neural-text-summarization. You may find some of its parts less clean than others — it’s a work in progress. Specifically, the data download is almost left out.

You can find the WikiData preprocessing in a notebook in the repository. For the ROCStories, I just downloaded the CSV files and concatenated with Unix cat. There’s an additional process.py file generated from a very simple IPython session.

Let’s have a very brief look at some of the most interesting parts of the code:

class SummarizeNet(NNModel):
    def encode(self, embeddings, lengths):
        # ...

    def decode(self, embeddings, encoded, lengths, modes):
        # ...

    def forward(self, embeddings, clean_embeddings, lengths, modes):
        # ...

    def predict(self, vocabulary, embeddings, lengths):
        # ...

You can notice separate methods for forward and predict. I chose the Transformer over the recurrent neural networks for both the encoder part and the decoder. The PyTorch implementation of the transformer decoder part already includes the teacher forcing in the forward method. This makes it convenient at the training time — to just feed it the full, uncorrupted sequence of embeddings as the “target”. During the inference we need to do the “auto-regressive” part by hand though. This means feeding the previous predictions in a loop — hence the need for two distinct methods here.

def forward(self, embeddings, clean_embeddings, lengths, modes):
    noisy_embeddings = self.mask_dropout(embeddings, lengths)

    encoded = self.encode(noisy_embeddings[:, 1:, :], lengths-1)
    decoded = self.decode(clean_embeddings, encoded, lengths, modes)

    return (
        decoded,
        encoded
    )

You can notice that I’m doing the token masking at the model level during the training. The code also shows cleanly the structure of this seq2seq model — with the encoder and the decoder.

The encoder part looks simple as long as you’re familiar with the transformers:

def encode(self, embeddings, lengths):
    batch_size, seq_len, _ = embeddings.shape

    embeddings = self.encode_positions(embeddings)

    paddings_mask = torch.arange(end=seq_len).unsqueeze(dim=0).expand((batch_size, seq_len)).to(self.device)
    paddings_mask = (paddings_mask + 1) > lengths.unsqueeze(dim=1).expand((batch_size, seq_len))

    encoded = embeddings.transpose(1,0)

    for ix, encoder in enumerate(self.encoders):
        encoded = encoder(encoded, src_key_padding_mask=paddings_mask)
        encoded = self.encode_batch_norms[ix](encoded.transpose(2,1)).transpose(2,1)

    last_encoded = encoded

    encoded = self.pool_encoded(encoded, lengths)

    encoded = self.to_hidden(encoded)

    return encoded

We’re first encoding the positions as in the “Attention Is All You Need” paper and then feeding the embeddings into a stack of the encoder layers. At the end, we’re morphing the tensor to have the final dimension equal the number given as the model’s parameter.

The decode sits on PyTorch’s shoulders too:

def decode(self, embeddings, encoded, lengths, modes):
    batch_size, seq_len, _ = embeddings.shape

    embeddings = self.encode_positions(embeddings)

    mask = self.mask_for(embeddings)

    encoded = self.from_hidden(encoded)
    encoded = encoded.unsqueeze(dim=0).expand(seq_len, batch_size, -1)

    decoded = embeddings.transpose(1,0)
    decoded = torch.cat(
        [
            encoded,
            decoded
        ],
        axis=2
    )
    decoded = self.combine_decoded(decoded)
    decoded = self.combine_batch_norm(decoded.transpose(2,1)).transpose(2,1)

    paddings_mask = torch.arange(end=seq_len).unsqueeze(dim=0).expand((batch_size, seq_len)).to(self.device)
    paddings_mask = paddings_mask > lengths.unsqueeze(dim=1).expand((batch_size, seq_len))

    for ix, decoder in enumerate(self.decoders):
        decoded = decoder(
            decoded,
            torch.ones_like(decoded),
            tgt_mask=mask,
            tgt_key_padding_mask=paddings_mask
        )
        decoded = self.decode_batch_norms[ix](decoded.transpose(2,1)).transpose(2,1)

    decoded = decoded.transpose(1,0)

    return self.linear_logits(decoded)

You can notice that I’m combining the gist received from the encoder with each word embeddings — as this is how it was described in the paper.

The predict is very similar to forward:

def predict(self, vocabulary, embeddings, lengths):
    """
    Caller should include the start and end tokens here
    but we’re going to ensure the start one is replaces by <start-short>
    """
    previous_mode = self.training

    self.eval()

    batch_size, _, _ = embeddings.shape

    results = []

    for row in range(0, batch_size):
        row_embeddings = embeddings[row, :, :].unsqueeze(dim=0)
        row_embeddings[0, 0] = vocabulary.token_vector("<start-short>")

        encoded = self.encode(
            row_embeddings[:, 1:, :],
            lengths[row].unsqueeze(dim=0)
        )

        results.append(
            self.decode_prediction(
                vocabulary,
                encoded,
                lengths[row].unsqueeze(dim=0)
            )
        )

    self.training = previous_mode

    return results

The workhorse behind the decoding at the inference time looks as follows:

def decode_prediction(self, vocabulary, encoded1xH, lengths1x):
    tokens = ['<start-short>']
    last_token = None
    seq_len = 1

    encoded1xH = self.from_hidden(encoded1xH)

    while last_token != '<end>' and seq_len < 50:
        embeddings1xSxD = vocabulary.embed(tokens).unsqueeze(dim=0).to(self.device)
        embeddings1xSxD = self.encode_positions(embeddings1xSxD)

        maskSxS = self.mask_for(embeddings1xSxD)

        encodedSx1xH = encoded1xH.unsqueeze(dim=0).expand(seq_len, 1, -1)

        decodedSx1xD = embeddings1xSxD.transpose(1,0)
        decodedSx1xD = torch.cat(
            [
                encodedSx1xH,
                decodedSx1xD
            ],
            axis=2
        )
        decodedSx1xD = self.combine_decoded(decodedSx1xD)
        decodedSx1xD = self.combine_batch_norm(decodedSx1xD.transpose(2,1)).transpose(2,1)

        for ix, decoder in enumerate(self.decoders):
            decodedSx1xD = decoder(
                decodedSx1xD,
                torch.ones_like(decodedSx1xD),
                tgt_mask=maskSxS,
            )
            decodedSx1xD = self.decode_batch_norms[ix](decodedSx1xD.transpose(2,1))
            decodedSx1xD = decodedSx1xD.transpose(2,1)

        decoded1x1xD = decodedSx1xD.transpose(1,0)[:, (seq_len-1):seq_len, :]
        decoded1x1xV = self.linear_logits(decoded1x1xD)

        word_id = F.softmax(decoded1x1xV[0, 0, :]).argmax().cpu().item()
        last_token = vocabulary.words[word_id]
        tokens.append(last_token)
        seq_len += 1

    return ' '.join(tokens[1:])

You can notice starting with the “start short” token and going in a loop, getting predictions, and feeding back until the “end” token.

Again, the model is very, very simple. What makes the difference is how it’s being trained — it’s all in the training data corruption and the model pre-training.

It’s already a long article so I encourage the curious readers to look at the code at my GitHub repo for more details.

My experiment with the WikiHow dataset

In my WikiHow experiment I wanted to see how the results look if I fed the full articles and their headlines for the two modes of the network. The same data-corruption regime was used in this case.

Some of the results were looking almost good:

Text:

for a savory flavor, mix in 1/2 teaspoon ground cumin, ground turmeric, or masala powder.this works best when added to the traditional salty lassi. for a flavorful addition to the traditional sweet lassi, add 1/2 teaspoon of ground cardamom powder or ginger, for some kick. , start with a traditional sweet lassi and blend in some of your favorite fruits. consider mixing in strawberries, papaya, bananas, or coconut.try chopping and freezing the fruit before blending it into the lassi. this will make your drink colder and frothier. , while most lassi drinks are yogurt based, you can swap out the yogurt and water or milk for coconut milk. this will give a slightly tropical flavor to the drink. or you could flavor the lassi with rose water syrup, vanilla extract, or honey.don’t choose too many flavors or they could make the drink too sweet. if you stick to one or two flavors, they’ll be more pronounced. , top your lassi with any of the following for extra flavor and a more polished look: chopped pistachios sprigs of mint sprinkle of turmeric or cumin chopped almonds fruit sliver

Headline:

add a spice., blend in a fruit., flavor with a syrup or milk., garnish.

Predicted summary:

blend vanilla in a sweeter flavor . , add a sugary fruit . , do a spicy twist . eat with dessert . , revise .

It’s not 100% faithful to the original text even though it seems to “read” well.

My suspicion is that pre-training against a much larger corpus of text might possibly help. There’s an obvious issue with the lack of very specific knowledge here to have the network summarize better. Here’s another of those examples:

Text:

the settings app looks like a gray gear icon on your iphone’s home screen.; , this option is listed next to a blue “a” icon below general. , this option will be at the bottom of the display & brightness menu. , the right-hand side of the slider will give you bigger font size in all menus and apps that support dynamic type, including the mail app. you can preview the corresponding text size by looking at the menu texts located above and below the text size slider. , the left-hand side of the slider will make all dynamic type text smaller, including all menus and mailboxes in the mail app. , tap the back button twice in the upper-left corner of your screen. it will save your text size settings and take you back to your settings menu. , this option is listed next to a gray gear icon above display & brightness. , it’s halfway through the general menu. ,, the switch will turn green. the text size slider below the switch will allow for even bigger fonts. , the text size in all menus and apps that support dynamic type will increase as you go towards the right-hand side of the slider. this is the largest text size you can get on an iphone. , it will save your settings.

Headline:

open your iphone’s settings., scroll down and tap display & brightness., tap text size., tap and drag the slider to the right for bigger text., tap and drag the slider to the left for smaller text., go back to the settings menu., tap general., tap accessibility., tap larger text. , slide the larger accessibility sizes switch to on position., tap and drag the slider to the right., tap the back button in the upper-left corner.

Predicted summary:

open your iphone ’s settings . , tap general . , scroll down and tap accessibility . , tap larger accessibility . , tap and larger text for the iphone to highlight the text you want to close . , tap the larger text - colored contacts app .

It might be interesting to train against this dataset again while:

utilizing some pre-trained, large scale model as part of the encoder
using a large corpus of text to still pre-train the auto-encoder

This could possibly take a lot of time to train on my GPU (even with the pre-trained part of the encoder). I didn’t follow the idea further at this time.

The problem with getting paragraphs when we want the sentences

One of the biggest problems the authors ran into was with the decoder outputting the long version of the text, even though it was asked for the sentence-long summary.

Authors called this phenomenon the “segregation issue”. What they have found was that the encoder was mapping paragraphs and sentences into completely separate regions. The solution to this problem was to trick the encoder into making both representations indistinguishable. The following figure comes from the paper and shows the issue visualized:

Better gists by using the “critic”

The idea of a “critic” has been popularized along with the fantastic results produced by some of the Generative Adversarial Networks. The general workflow is to have the main network generate output while the other tries to guess some of its properties.

For GANs that are generating realistic photos, the critic is there to guess if the photo was generated or if it’s real. A loss term is added based on how well it’s doing, penalizing the main network for generating photos that the critic is able to call out as fake.

A similar idea was used in the A3C algorithm I blogged about (Self-driving toy car using the Asynchronous Advantage Actor-Critic algorithm). The “critic” part penalized the AI agent for taking steps that were on average less advantageous.

Here, in the SummAE model, the critic adds a penalty to the loss to the degree to which it’s able to guess whether the gist comes from a paragraph or a sentence.

Training with the critic might get tricky. What I’ve found to be the cleanest way is to use two different optimizers — one updating the main network’s parameters while the other updates the critic itself:

for batch in batches:
    if mode == "train":
        self.model.train()
        self.discriminator.train()
    else:
        self.model.eval()
        self.discriminator.eval()

    self.optimizer.zero_grad()
    self.discriminator_optimizer.zero_grad()

    logits, state = self.model(
        batch.word_embeddings.to(self.device),
        batch.clean_word_embeddings.to(self.device),
        batch.lengths.to(self.device),
        batch.mode.to(self.device)
    )

    mode_probs_disc = self.discriminator(state.detach())
    mode_probs = self.discriminator(state)

    discriminator_loss = F.binary_cross_entropy(
        mode_probs_disc,
        batch.mode
    )

    discriminator_loss.backward(retain_graph=True)

    if mode == "train":
        self.discriminator_optimizer.step()

    text = batch.text.copy()

    if self.no_period_trick:
        text = [txt.replace('.', '') for txt in text]

    classes = self.vocabulary.encode(text, modes=batch.mode)
    classes = classes.roll(-1, dims=1)
    classes[:,classes.shape[1]-1] = 3

    model_loss = torch.tensor(0).cuda()

    if logits.shape[0:2] == classes.shape:
        model_loss = F.cross_entropy(
            logits.reshape(-1, logits.shape[2]).to(self.device),
            classes.long().reshape(-1).to(self.device),
            ignore_index=3
        )
    else:
        print("WARNING: Skipping model loss for inconsistency between logits and classes shapes")

    fooling_loss = F.binary_cross_entropy(
        mode_probs,
        torch.ones_like(batch.mode).to(self.device)
    )

    loss = model_loss + (0.1 * fooling_loss)

    loss.backward()
    if mode == "train":
        self.optimizer.step()

    self.optimizer.zero_grad()
    self.discriminator_optimizer.zero_grad()

The main idea is to treat the main network’s encoded gist as constant with respect to the updates to the critic’s parameters, and vice versa.

Results

I’ve found some of the results look really exceptional:

Text:

lynn is unhappy in her marriage. her husband is never good to her and shows her no attention. one evening lynn tells her husband she is going out with her friends. she really goes out with a man from work and has a great time. lynn continues dating him and starts having an affair.

Predicted summary:

lynn starts dating him and has an affair .

Text:

cedric was hoping to get a big bonus at work. he had worked hard at the office all year. cedric’s boss called him into his office. cedric was disappointed when told there would be no bonus. cedric’s boss surprised cedric with a big raise instead of a bonus.

Predicted summary:

cedric had a big deal at his boss ’s office .

Some others showed how the model attends to single sentences though:

Text:

i lost my job. i was having trouble affording my necessities. i didn’t have enough money to pay rent. i searched online for money making opportunities. i discovered amazon mechanical turk.

Predicted summary:

i did n’t have enough money to pay rent .

While the sentence like this one would maybe make a good headline — it’s definitely not the best summary as it naturally loses the vital parts found in other sentences.

Final words

First of all, let me thank the paper’s authors for their exceptional work. It was a great read and great fun implementing!

Abstractive text summarization remains very difficult. The model trained for this blog post has very limited use in practice. There’s a lot of room for improvement though, which makes the future of abstractive summaries very promising.

Deploying production Machine Learning pipelines to Kubernetes with Argo

2019-06-28T00:00:00+00:00

Image by Wikimedia Commons

In some sense, most machine learning projects look exactly the same. There are 4 stages to be concerned with no matter what the project is:

Sourcing the data
Transforming it
Building the model
Deploying it

It’s been said that #1 and #2 take most of ML engineers’ time. This is to emphasize how little time it sometimes feels the most fun part—#3—gets.

In the real world, though, #4 over time can take almost as much as the previous three.

Deployed models sometimes need to be rebuilt. They consume data that need to constantly go through points #1 and #2. It certainly isn’t always what’s shown in the classroom, where datasets perfectly fit in the memory and model training takes at most a couple hours on an old laptop.

Working with gigantic datasets isn’t the only problem. Data pipelines can take long hours to complete. What if some part of your infrastructure has an unexpected downtime? Do you just start it all over again from the very beginning?

Many solutions of course exist. With this article, I’d like to go over this problem space and present an approach that feels really nice and clean.

Project description

End Point Corporation was founded in 1995. That’s 24 years! About 9 years later, the oldest article on the company’s blog was published. Since that time, a staggering number of 1435 unique articles have been published. That’s a lot of words! This is something we can definitely use in a smart way.

For the purpose of having fun with building a production-grade data pipeline, let’s imagine the following project:

A doc2vec model trained on the corpus of End Point’s blog articles
Use of the paragraph vectors for each article to find the 10 other, most similar articles

I blogged about using the matrix factorization as a simple collaborative filtering style of the recommender system. We can think about today’s doc2vec-based model as an example of the content based filtering. The business value would be the potentially increased blog traffic from users staying longer on the website.

Scalable pipelines

The data pipelines problem certainly found some really great solutions. The Hadoop project brought in the HDFS—a distributed file system for huge data artifacts. Its MapReduce component plays a vital role in distributed data processing.

Then, the fantastic Spark project came in. Its architecture makes data reside in memory by default—with explicit caching of the data on disks. The project claims to be running workloads 100 times faster than Hadoop.

Both projects though require the developer to use a very specific set of libraries. It’s not easy, for example, to distribute spaCy training and inference on Spark.

Containers

On the other side of the spectrum, there’s Dask. It’s a Python package that wraps Numpy, Pandas and Scikit-Learn. It enables developers to load huge piles of data, just as they would with the smaller datasets. The data is partitioned and distributed among the cluster nodes. It can work with groups of processes as well as clusters of containers. The APIs of the above-mentioned projects are (mostly) preserved while all the processing is suddenly distributed.

Some teams like to use Dask along with Luigi and build production pipelines around Docker or Kubernetes.

In this article, I’d like to present another Dask-friendly solution: Kubernetes-native workflows using Argo. What’s great about it compared to Luigi, is that you don’t even need to care about having a certain version of Python and Luigi installed to orchestrate the pipeline. All you need is the Kubernetes cluster and Argo installed on it.

Hands down work on the project

The first thing to do when developing this project is to get access to the Kubernetes cluster. For the development, you can set up a one-node cluster using either one of:

I love them both. The first is developed by Canonical while the second by the Kubernetes team itself.

This isn’t going to be a step-by-step tutorial on using Kubernetes. I encourage you to read the documentation or possibly seek out a good online course if you don’t know anything yet. Read on even in this case though—it’s nothing that would be overly complex.

Next, you’ll need the Argo Workflows. The installation is really easy. The full yet simple documentation can be found here.

The project structure

Here’s what the project looks like in the end:

.
├── Makefile
├── notebooks
│  └── scratch.ipynb
├── notebooks.yml
├── pipeline.yaml
├── tasks
   ├── base
   │  ├── Dockerfile
   │  └── requirements.txt
   ├── build_model
   │  ├── Dockerfile
   │  └── run.py
   ├── clone_repo
   │  ├── Dockerfile
   │  └── run.sh
   ├── infer
   │  ├── Dockerfile
   │  └── run.py
   ├── notebooks
   │  └── Dockerfile
   └── preprocess
      ├── Dockerfile
      └── run.py

The main parts are as follows:

Makefile provides easy to use helpers for building images, sending them into the Docker repository and running the Argo workflow
notebooks.yml defines a Kubernetes service and deployment for exploratory Jupyter Lab instance
notebooks contains individual Jupyter notebooks
pipeline.yaml defines our Machine Learning pipeline in the form of the Argo workflow
tasks contains workflow steps as containers along with their Dockerfiles
tasks/base defines the base Docker image for other tasks
tasks/**/run.(py|sh) is a single entry point for a given pipeline step

The idea is to minimize the boilerplate while retaining the features offered e.g. by Luigi.

Makefile

SHELL := /bin/bash
VERSION?=latest
TASK_IMAGES:=$(shell find tasks -name Dockerfile -printf '%h ')
REGISTRY=base:5000

tasks/%: FORCE
        set -e ;\
        docker build -t blog_pipeline_$(@F):$(VERSION) $@ ;\
        docker tag blog_pipeline_$(@F):$(VERSION) $(REGISTRY)/blog_pipeline_$(@F):$(VERSION) ;\
        docker push $(REGISTRY)/blog_pipeline_$(@F):$(VERSION)

images: $(TASK_IMAGES)

run: images
        argo submit pipeline.yaml --watch

start_notebooks:
        kubectl apply -f notebooks.yml

stop_notebooks:
        kubectl delete deployment jupyter-notebook

FORCE: ;

When using this Makefile with make run, it will need to resolve the images dependency. This, in turn, will ask to resolve all of the task/**/Dockerfile dependencies too. Notice how the TASK_IMAGES variable is constructed: it uses the make’s shell command to use the Unix’s find to find the subdirectories of tasks that contain the Dockerfile. Here’s what the output would be if you were to use it directly:

$ find tasks -name Dockerfile -printf '%h '
tasks/notebooks tasks/base tasks/preprocess tasks/infer tasks/build_model tasks/clone_repo

Setting up Jupyter Notebooks as a scratch pad and for EDA

Let’s start off by defining our base Docker image:

FROM python:3.7

COPY requirements.txt /requirements.txt
RUN pip install -r /requirements.txt

Following is the Dockerfile that extends it and adds the Jupyter Lab:

FROM endpoint-blog-pipeline/base:latest

RUN pip install jupyterlab

RUN mkdir ~/.jupyter
RUN echo "c.NotebookApp.token = ''" >> ~/.jupyter/jupyter_notebook_config.py
RUN echo "c.NotebookApp.password = ''" >> ~/.jupyter/jupyter_notebook_config.py

RUN mkdir /notebooks
WORKDIR /notebooks

The last step is to add the Kubernetes service and deployment definition in notebooks.yml:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: jupyter-notebook
  labels:
    app: jupyter-notebook
spec:
  replicas: 1
  selector:
    matchLabels:
      app: jupyter-notebook
  template:
    metadata:
      labels:
        app: jupyter-notebook
    spec:
      containers:
      - name: minimal-notebook
        image: base:5000/blog_pipeline_notebooks
        ports:
        - containerPort: 8888
        command: ["/usr/local/bin/jupyter"]
        args: ["lab", "--allow-root", "--port", "8888", "--ip", "0.0.0.0"]
---
kind: Service
apiVersion: v1
metadata:
  name: jupyter-notebook
spec:
  type: NodePort
  selector:
    app: jupyter-notebook
  ports:
  - protocol: TCP
    nodePort: 30040
    port: 8888
    targetPort: 8888

This can be run using our Makefile with make start_notebooks or directly with:

$ kubectl apply -f notebooks.yml

Exploration

The notebook itself feels more like a scratch pad than an exploratory data analysis. You can see that it’s very informal and doesn’t include much of the exploration or visualization. You’re likely not to omit those in more real-world code.

I used it to ensure the model would work at all. I then was able to grab portions of the code and paste it directly into step definitions.

Implementation

Step 1: Source blog articles

The blog’s articles are stored on GitHub in Markdown files.

Our first pipeline task will need to either clone the repo or pull from it if it’s present in the pipeline’s shared volume.

We’ll use the Kubernetes hostPath as the cross-step volume. What’s nice about it is that it’s easy to peek into the volume during development to see if the data artifacts are being generated correctly.

In our example here, I’m hardcoding the path on my local system:

# ...
volumes:
  - name: endpoint-blog-src
    hostPath:
      path: /home/kamil/data/endpoint-blog-src
      type: Directory
# ...

This is one of the downsides of the hostPath—it only accepts absolute paths. This will do just fine for now though.

In the pipeline.yml we define the task container with:

# ...
templates:
  - name: clone-repo
    container:
      image: base:5000/blog_pipeline_clone_repo
      command: [bash]
      args: ["/run.sh"]
      volumeMounts:
      - mountPath: /data
        name: endpoint-blog-src
# ...

The full pipeline forms a tree which is expressed conveniently as a directed acyclic graph within the Argo. Here’s the definition of the whole pipeline (some steps were not shown yet):

# ...
- name: article-vectors
    dag:
      tasks:
      - name: src
        template: clone-repo
      - name: dataframe
        template: preprocess
        dependencies: [src]
      - name: model
        template: build-model
        dependencies: [dataframe]
      - name: infer
        template: infer
        dependencies: [model]
# ...

Notice how the dependencies field makes it easy to tell Argo what order to take when executing the tasks. The Argo steps can also define inputs and outputs—just like Luigi. For this simple example, I decided to omit them and enforce the convention for the steps to expect data artifacts in a certain location in the mounted volume. If you’re curious about other Argo features though, here is its documentation.

The entry point script for the task is pretty simple:

#!/bin/bash

cd /data

if [ -d ./blog ]
then
  cd blog
  git pull origin master
else
  git clone https://github.com/EndPointCorp/end-point-blog.git blog
fi

Step 2: Data wrangling

At this point, we’d have the source files for the blog articles in Markdown files. To be able to run them through any kind of machine learning modeling, we need to source it into the data frame. We’ll also need to clean the text a bit. Here is the reasoning behind the cleanup routine:

I want the relations between the articles to omit the code snippets: not to group them by the used programming language or a library just by the keywords they contain
I also want the metadata about the tags and authors to be omitted too as I don’t want to see only e.g. my articles listed as similar to my other ones

The full source for the run.py of the “preprocess” task can be viewed here.

Notice that unlike make or Luigi, the Argo workflows would run the same task fully again even with the step artifact already being created. I like this flexibility—it’s extremely easy after all to just skip the processing in Python or shell script if it already exists.

At the end of this step, the data frame is written as the Apache Parquet file.

Step 3: Building the model

The model from the paper mentioned earlier has already been implemented in a variety of other projects. There are implementations for each major deep learning framework on GitHub. There’s also a pretty good one included in Gensim. Its documentation can be found here.

The run.py is pretty short and straight forward as well. This is one of the goals for the pipeline. In the end, it’s writing the trained model into the shared volume as well.

Notice that re-running the pipeline with the model already stored will not trigger the training again. This is what we want. Imagine a new article being pushed into the repository. It’s very unlikely that retraining with it would affect the model’s performance in any significant way. We’ll still need to predict the similar other documents for it. The model building step would short-circuit though with:

if __name__ == '__main__':
    if os.path.isfile('/data/articles.model'):
        print("Skipping as the model file already exists")
    else:
        build_model()

Step 4: Predict similar articles

The listing of the run.py isn’t overly long:

import pandas as pd
from gensim.models.doc2vec import Doc2Vec
import yaml
from pathlib import Path
import os


def write_similar_for(path, model):
    similar_paths = model.docvecs.most_similar(path)
    yaml_path = (Path('/data/blog/') / path).parent / 'similar.yaml'

    with open(yaml_path, "w") as file:
        file.write(yaml.dump([p for p, _ in similar_paths]))
        print(f"Wrote similar paths to {yaml_path}")


def infer_similar():
    articles = pd.read_parquet('/data/articles.parquet')
    model = Doc2Vec.load('/data/articles.model')

    for tag in articles['file'].tolist():
        write_similar_for(tag, model)

if __name__ == '__main__':
    infer_similar()

The idea is to load up the saved Gensim model and the data frame with articles first. Then for each article use the model to get the 10 most similar other articles.

As the step’s output, the listing of similar articles is placed in the similar.yml file for each article’s subdirectory.

The blog’s Markdown → HTML compiler could then use this file and e.g. inject the “You might find those articles interesting too” section.

Results

The scratch notebook already includes the example results of running this doc2vec model. Examples:

model.docvecs.most_similar('2019/01/09/liquid-galaxy-at-instituto-moreira-salles.html.md')

Giving the output of:

[('2016/04/22/liquid-galaxy-for-real-estate.html.md', 0.8872901201248169),
 ('2017/07/03/liquid-galaxy-at-2017-boma.html.md', 0.8766101598739624),
 ('2017/01/25/smartracs-liquid-galaxy-at-national.html.md',
  0.8722846508026123),
 ('2016/01/04/liquid-galaxy-at-new-york-tech-meetup_4.html.md',
  0.8693454265594482),
 ('2017/06/16/successful-first-geoint-symposium-for.html.md',
  0.8679709434509277),
 ('2014/08/22/liquid-galaxy-for-daniel-island-school.html.md',
  0.8659971356391907),
 ('2016/07/21/liquid-galaxy-featured-on-reef-builders.html.md',
  0.8644022941589355),
 ('2017/11/17/president-of-the-un-general-assembly.html.md',
  0.8620222806930542),
 ('2016/04/27/we-are-bigger-than-vr-gear-liquid-galaxy.html.md',
  0.8613147139549255),
 ('2015/11/04/end-pointers-favorite-liquid-galaxy.html.md',
  0.8601428270339966)]

Or the following:

model.docvecs.most_similar('2019/01/08/speech-recognition-with-tensorflow.html.md')

Giving:

[('2019/05/01/facial-recognition-amazon-deeplens.html.md', 0.8850516080856323),
 ('2017/05/30/recognizing-handwritten-digits-quick.html.md',
  0.8535605072975159),
 ('2018/10/10/image-recognition-tools.html.md', 0.8495659232139587),
 ('2018/07/09/training-tesseract-models-from-scratch.html.md',
  0.8377258777618408),
 ('2015/12/18/ros-has-become-pivotal-piece-of.html.md', 0.8344655632972717),
 ('2013/03/07/streaming-live-with-red5-media.html.md', 0.8181146383285522),
 ('2012/04/27/streaming-live-with-red5-media-server.html.md',
  0.8142604827880859),
 ('2013/03/15/generating-pdf-documents-in-browser.html.md',
  0.7829260230064392),
 ('2016/05/12/sketchfab-on-liquid-galaxy.html.md', 0.7779937386512756),
 ('2018/08/29/self-driving-toy-car-using-the-a3c-algorithm.html.md',
  0.7659779787063599)]

model.docvecs.most_similar('2016/06/03/adding-bash-completion-to-python-script.html.md')

With:

[('2014/03/12/provisioning-development-environment.html.md',
  0.8298013806343079),
 ('2015/04/03/manage-python-script-options.html.md', 0.7975824475288391),
 ('2012/01/03/automating-removal-of-ssh-key-patterns.html.md',
  0.7794561386108398),
 ('2014/03/14/provisioning-development-environment_14.html.md',
  0.7763932943344116),
 ('2012/04/16/easy-creating-ramdisk-on-ubuntu.html.md', 0.7579266428947449),
 ('2016/03/03/loading-json-files-into-postgresql-95.html.md',
  0.7410352230072021),
 ('2015/02/06/vim-plugin-spotlight-ctrlp.html.md', 0.7385793924331665),
 ('2017/10/27/hot-deploy-java-classes-and-assets-in.html.md',
  0.7358890771865845),
 ('2012/03/21/puppet-custom-fact-ruby-plugin.html.md', 0.718029260635376),
 ('2012/01/14/using-disqus-and-rails.html.md', 0.716759443283081)]

To run the pipeline all you need is to:

$ make run

Or directly with:

$ argo submit pipeline.yml --watch

Argo gives a nice looking output of all the steps:

Name:                endpoint-blog-pipeline-49ls5
Namespace:           default
ServiceAccount:      default
Status:              Succeeded
Created:             Wed Jun 26 13:27:51 +0200 (17 seconds ago)
Started:             Wed Jun 26 13:27:51 +0200 (17 seconds ago)
Finished:            Wed Jun 26 13:28:08 +0200 (now)
Duration:            17 seconds

STEP                             PODNAME                                  DURATION  MESSAGE
 ✔ endpoint-blog-pipeline-49ls5
 ├-✔ src                         endpoint-blog-pipeline-49ls5-3331170004  3s
 ├-✔ dataframe                   endpoint-blog-pipeline-49ls5-2286787535  3s
 ├-✔ model                       endpoint-blog-pipeline-49ls5-529475051   3s
 └-✔ infer                       endpoint-blog-pipeline-49ls5-1778224726  6s

The resulting similar.yml files look as follows:

$ ls ~/data/endpoint-blog-src/blog/2013/03/15/
generating-pdf-documents-in-browser.html.md  similar.yaml

$ cat ~/data/endpoint-blog-src/blog/2013/03/15/similar.yaml
- 2016/03/17/creating-video-player-with-time-markers.html.md
- 2014/07/17/creating-symbol-web-font.html.md
- 2018/10/10/image-recognition-tools.html.md
- 2015/08/04/how-to-big-beautiful-background-video.html.md
- 2014/11/06/simplifying-mobile-development-with.html.md
- 2016/03/23/learning-from-data-basics-naive-bayes.html.md
- 2019/01/08/speech-recognition-with-tensorflow.html.md
- 2013/11/19/asynchronous-page-switches-with-django.html.md
- 2016/03/11/strict-typing-fun-example-free-monads.html.md
- 2018/07/09/training-tesseract-models-from-scratch.html.md

Although it’s difficult to quantify, those sets of “similar” documents do seem to be linked in many ways to their “anchor” articles. You’re invited to read them and see for yourself!

Closing words

The code presented here is hosted on GitHub. There’s lots of room for improvement of course. It shows a nice approach that could be used for both small model deployments (like the one above) but also very big ones too.

The Argo workflows could be used in tandem with Kubernetes deployments. You could e.g. run a distributed TensorFlow model training and then deploy it on Kubernetes via TensorFlow Serving. If you’re more into PyTorch, then distributing the training would be possible via Horovod. Have data scientists that use R? Deploy RStudio Server instead of the JupyterLab with the image from DockerHub and run some or all tasks with the simpler one with R-base only.

If you have any questions or projects you’d like us to help you with, reach out right away through our contact form!

Speech Recognition from scratch using Dilated Convolutions and CTC in TensorFlow

2019-01-08T00:00:00+00:00

Image by WILL POWER · CC BY 2.0, cropped

In this blog post, I’d like to take you on a journey. We’re going to get a speech recognition project from its architecting phase, through coding and training. In the end, we’ll have a fully working model. You’ll be able to take it and run the model serving app, exposing a nice HTTP API. Yes, you’ll even be able to use it in your projects.

Speech recognition has been amongst one of the hardest tasks in Machine Learning. Traditional approaches involve meticulous crafting and extracting of the audio features that separate one phoneme from another. To be able to do that, one needs a deep background in data science and signal processing. The complexity of the training process prompted teams of researchers to look for alternative, more automated approaches.

With the growing development of Deep Learning, the need for handcrafted features declined. The training process for a neural network is much more streamlined. You can feed the signals either in their raw form or as their spectrograms and watch the model improve.

Did this get you excited? Let’s start!

Project Plan of Attack

Let’s build a web service that exposes an API. Let it be able to receive audio signals, encoded as an array of floating point numbers. In return, we’re going to get the recognized text.

Here’s a rough plan of the stages we’re going to go through:

Get the dataset to train the model on
Architect the model
Implement it along with the unit tests
Train it on the dataset
Measure its accuracy
Serve it as a web service

The dataset

The open-source community has a lot to be thankful for the Mozilla Foundation for. It’s a host of many projects with a wonderful, free Firefox browser at its forefront. One of its other projects, called Common Voice, focuses on gathering large data sets to be used by anyone in speech recognition projects.

The datasets consist of wave files and their text transcriptions. There’s no notion of time-alignment. It’s just the audio and text for each utterance.

If you want to code along, head up to the Common Voice Datasets download page. Be warned that it weighs roughly around 12GB.

After the download, simply extract the files from the archive into the ./data directory of the root of the project. The files, in the end, should reside under the ./data/cv_corpus_v1/ path.

How much data should we have? It always depends on the challenge at hand. Roughly speaking, the more difficult the task, the more powerful your neural network needs to be. It will need to be capable of expressing more complex patterns in data. The more powerful the network, the easier it is to have it just memorize the training examples. This is highly undesirable and results in overfitting. To lessen its aptitude to do so, you need to either augment your data on the fly randomly or gather more “real” examples. On this project, we’re going to do both. Data augmentation will be covered in the coding section. Additional datasets we’ll use are well known LibriSpeech (the file to download, around 23GB) and VoxForge (the file to download).

Those two datasets are among the most popular that are freely available. There are others I chose to omit as they weigh quite a lot. I was already almost out of free space after the download and preprocessing of the three sets chosen above.

You need to download both Libri and Vox and extract them under ./data/LibriSpeech/ and ./data/voxforge/.

Background on audio processing

In order to build a working model, we need some background in signal processing. Although a lot of the traditional work is going to be done by the neural network automatically, we still need to understand what is going on in order to reason about its various hyperparameters.

Additionally, we’re going to process audio into a form that’s easier to train. This is going to lower the memory requirements. It’s also going to lower the time needed for model’s parameters to converge to ones that work well.

How is audio represented?

Let’s have a quick look at what the audio data looks like when we load it from a wave file.

import librosa
import librosa.display

SAMPLING_RATE=16000

# ...

wave, _ = librosa.load(path_to_file, sr=SAMPLING_RATE)

librosa.display.waveplot(wave, sr=SAMPLING_RATE)

The above code specifies that we want to load the audio data with a sampling rate of a 16k (more about it later). It then loads it and plots it along the time axis:

The X-axis obviously represents the time. The Y axis is often called the amplitude. A quick look at the plot above makes it obvious that we have negative values in the signal. How come those values are called amplitudes then? Amplitude is said to represent the maximum difference of displacements of a physical object as it vibrates. What does it mean to have a negative amplitude? To make those values a bit more clear, let’s call it just displacement for now. Audio is nothing else than the vibration of the air. If you were to build an electrical recorder, you might come up with one that gives you output in voltages at each point in time. As the air vibrates, you need a reference point obviously. This, in turn, allows you to catch the exact specifics of the vibration — how it “rises” above the reference point and then gets back way below it. Imagine that your electrical circuit gives you output within the range of -1V and 1V. To load it into your computer and into the plot like above, you’d need to capture those values at discrete points in time. The sampling rate is nothing else than a number of times within one second when the value from your sound-meter would be measured and stored — to be loaded later. Next time, when you read that your CD from the ’90 contains audio sampled at a frequency of 44,100 Hz, you’ll know that the raw “air displacement” values were sampled 44,100 times each second.

Let’s do a simple thought experiment to prepare for the next section. What would you hear if all the above values were constant, e.g. 1.0? We saw that the values given by librosa are floating points. In the example file they ranged between -0.6 and 0.6. The value of 1.0 is certainly much higher — would you hear “more” of “something” then? Because the definition of a sound is that it’s a vibration: you wouldn’t hear anything! The amplitudes of the audio signal must periodically change — this is how we detect or hear sounds. This implies that in order to distinguish between different sounds, those sounds have to “vibrate differently”. The difference that makes sounds different is the frequency of the vibration.

Decomposing the signal with the Fourier Transform

Let’s create a signal generating machine, that will output a sinusoidal of a given frequency and amplitude:

def gen_sin(freq, amplitude, sr=1000):
    return np.sin(
        (freq * 2 * np.pi * np.linspace(0, sr, sr)) / sr
    ) * amplitude

Here’s how 1000 points signal looks like for a frequency of 30 and an amplitude of 1:

import seaborn as sns

sns.lineplot(data=gen_sin(30, 1))

Here’s one for 10 and 0.6:

You can count the number of times the values in plots approach their maximum. Knowing that sine has only one maximum within its period and that we’re showing just one second, that number shows that we have frequencies 30 and 10.

What would we get if we were to sum such sinusoidal signals of different frequencies and amplitudes? Let’s see — below you can see 3 different sine waves plotted on top of each other. The fourth — and last one — shows the signal that is the sum of all of them:

Here’s another example, with the last plot showing the sum of 5 different waves:

It isn’t that regular anymore, is it? It turns out that you can construct any signal by summing up some number of sine waves of different frequencies and amplitudes (and phases, their translation in time). The converse is also true: any signal can be represented as a sum of some number of sine waves of different frequencies and amplitudes (and phases). This is extremely important to our speech recognition task. Frequencies are the real difference between sounds that make up the phonemes and words that we want to be able to recognize.

This is where the Fourier Transform comes into play. It takes our data points that represent intensity per each point in time and produces data points representing intensity per each frequency bin. It’s said that it transforms the domain of the signal from time into frequency. Now, what exactly is a frequency bin? Imagine the physical audio signal being constructed from frequencies between 0Hz and 8000Hz. The FFT algorithm (Fast Fourier Transform) is going to split that full spectrum into bins. If you were to split it into 10 bins, you’d end up having the following ranges: 0Hz–800Hz, 800Hz–1600Hz, 1600Hz–2400Hz, 2400Hz–3200Hz, 3200Hz–4000Hz, 4000Hz–4800Hz, 4800Hz–5600Hz, 5600Hz–6400Hz, 6400Hz–7200Hz, 7200Hz–8000Hz.

Let’s see how the FFT works on the example of the signal given above. The waves and plots were produced by the following Python function:

def plot_wave_composition(defs, hspace=1.0):
    fig_size = plt.rcParams["figure.figsize"]

    plt.rcParams["figure.figsize"] = [14.0, 10.0]

    waves = [
        gen_sin(freq, amp)
        for freq, amp in defs
    ]

    fig, axs = plt.subplots(nrows=len(defs) + 1)

    for ix, wave in enumerate(waves):
        sb.lineplot(data=wave, ax=axs[ix])
        axs[ix].set_ylabel('{}'.format(defs[ix]))

        if ix != 0:
            axs[ix].set_title('+')

    plt.subplots_adjust(hspace = hspace)

    sb.lineplot(data=sum(waves), ax=axs[len(defs)])
    axs[len(defs)].set_ylabel('sum')
    axs[len(defs)].set_xlabel('time')
    axs[len(defs)].set_title('=')

    plt.rcParams["figure.figsize"] = fig_size

    return waves, sum(waves)

We can plot the signals and grab them at the same time with:

wave_defs = [
        (2, 1),
        (3, 0.8),
        (5, 0.2),
        (7, 0.1),
        (9, 0.25)
    ]

waves, the_sum = plot_wave_composition(wave_defs)

Next, let’s compute the FFT values along with the frequencies:

ffts = np.fft.fft(the_sum)
freqs = np.fft.fftfreq(len(the_sum))

frequencies, coeffs = zip(
    *list(
        filter(
            lambda row: row[1] > 10, # arbitrary threshold but let’s not make it too complex for now
            [ (int(abs(freq * 1000)), coef) for freq, coef in zip(freqs[0:(len(ffts) // 2)], np.abs(ffts)[0:(len(ffts) // 2)]) ]
        )
    )
)

sns.barplot(x=list(frequencies), y=coeffs)

The last call produces the following plot:

The X-axis represents now the frequency in Hz, while the Y-axis is the intensity.

There’s one missing part before we can use it with our speech data. As you can see, FFT gives us frequencies for the whole signal, assuming that it’s periodic and spans in time into infinity. Obviously, when I say “hello”, the air vibrates differently in the beginning, changes in between and is even more different at the end. We need to split that audio into small “windows” of data points. By feeding them into FFT, we can get the frequencies for each one of them. This turns the data domain from time into frequency within the scope of the window. It remains the info about the time at the global level, making our data represent: time x frequency x intensity.

Scaling frequencies

The human perception is a vastly complex phenomenon. Taking that into account can take us a long way when working on the recognition model emulating the work of our brains when we’re listening to each other.

Let’s make another experiment. What sound is produced by the 800Hz sine?

from IPython.display import Audio

Audio(data=gen_sin(800, 1, 16000), rate=16000)

Let’s now generate 900Hz and 1000Hz to get a sense of the difference:

900Hz:

1000Hz:

Let us now ante up the frequencies and generate 7000Hz, 7100Hz and 7200Hz:

Can you hear the difference being smaller in the case of the last three? It’s a well-known phenomenon. We sense a greater difference in sounds for lower frequencies and as it increases that difference becomes less and less.

Because of this, three gentlemen—Stevens, Volkmann, and Newman—created a so-called Mel scale in 1937. You can think of it as a simple rescaling of the frequencies that roughly follows the relationship shown below:

Although not mandatory, lots of models that deal with human speech also decrease the importance of the intensity by taking the log of the re-scaled data. The resulting time x frequency (mels) x log-intensity is called the log-Mel spectrogram.

Background on deep learning techniques in use for this project

We’ve just gone through the necessary basics of signal processing. Let’s now focus on the Deep Learning concepts we’ll use to construct and train the model.

While this article assumes that the reader already knows a lot, there are less common techniques we’ll use that deserve at least a quick go through.

Dilated convolutions as a faster alternative to recurrent networks

Traditionally, the sequence processing in Deep Learning is tackled by the recurrent neural networks.

No matter the choice of their flavor, the basic scheme is always the same: the computations are done sequentially going through examples in time. In our case, we’d need to split the time x frequency x intensity into time length of frequency x intensity chunks. As the chunks would be processed one by one, the recurrent network internal state would “remember” the previous chunk’s specifics, incorporating them into their future outputs. The output shape would be time x frequency x recurrent units.

The fact that the computations are done sequentially, makes them quite slow overall. Later in-pipeline computations spend most of the time waiting on the previous ones to finish because of the direct dependency. The problem is even more severe with the use of GPUs. We use them because of their ability to do math in parallel on huge chunks of data. With recurrent networks, lots of that power is being wasted.

The premise of RNNs is that in theory, they can have the capacity for keeping very long contexts in their “memory”. This has recently been put into test and falsified in practice by Bai et al. Also, when you stop and think about the task at hand: does it really matter to “remember” the beginning of the sentence to know that it ends with the word “dog”? Some context is obviously needed — but not as wide as it might seem at first.

I have an Nvidia GTX 1070Ti with 8GB of memory to train my models on. I don’t really feel like waiting a month for my recurrent network to converge. In this project, let’s use a very performant alternative — convolutional neural network.

Expanding the context of the convolutional network

Simple convolutional layers weren’t used for sequence processing much for a good reason. The crux of the sequence processing is to be able to take bigger contexts into account. Depending on the job, we might want to constrain the context only to the past — learning the causal relations in data. We might sometimes want to incorporate both past and future in it as well. The go-to solution for doing OCR at the moment is to use bidirectional recurrent layers. Their one pass learns the relations from left to right while another learns from right to left. The results are then concatenated.

By applying proper padding, we can easily include one or two-sided contexts in 1D convolutions. The challenge is that in order to make the outputs depend on bigger contexts, the size of the filters needs to become bigger and bigger. This, in turn, requires more and more memory.

Because our aim is to create a model that we’ll be able to train on a quite cheap (given the GPUs used in this field usually) GTX 1070Ti (around $500 at the moment), we want the memory requirements to be as low as possible.

Thanks to the success of the WaveNet (among others), a specific class of convolutional layers gained a lot of attention lately. The variation is called Dilated Convolutions or sometimes Atrous Convolutions. So what are they?

Let’s first have a look at how the outputs depend on their context for simple convolutional layers:

Imagine that you originally have just the top-most row of numbers. You are going to use 1D convolutions and to make the reasoning easiest, the number of filters is 1. Also for simplicity, all filter values are set to 1. You can see the cross-correlation (because that’s what convolutional layers are in fact computing) operator taking 3 values in the context, multiplying by the filter and summing up to 2 * 1 + 3 * 1 + 4 * 1 = 9.

The atrous convolutions are really the same, except they dilate their focus without increasing the size of the filter by introducing holes. It’s shown below with the convolution of the size 2 and dilation of 2:

Here’s yet another example for the size of 2 and dilation of 3:

Gated activations

Traditionally, convolutional layers are followed by the *elu family of activations (ReLu, Elu, PRelu, Selu). They fit in well within the “match pattern” paradigm of the conv nets. On the contrary, recurrent units operate the “remember/forget” approach. Two of their most commonly used implementations, GRU and LSTM, include explicit “forget” gates.

We want to mimic their ability to “forget” parts of the context within our dilated convolutions based model too. To do that, we’re going to use the “gated activations” approach, explained by Liptchinsky et al.

The idea is very simple: we pass the input through Conv1D separately and apply tanh and sigmoid respectively. The result is the element-wise product. We’re going to go one step further in our approach, by applying tanh one more time in the end.

Others

The full explanation of all of the details of our neural network’s architecture is beyond the scope of an article like this. Let me point you at additional pieces along with the reading they come from:

Let’s code it

The architecture of our choice in this project is going to heavily rely on the great success of residual-style networks as well as dilated convolutions. You might see similarities to the famous WaveNet, although it’s going to be a bit different.

Here is the bird-eye view of the SpeechNet neural network:

The residual stacks, being at the heart of it, are structured the following way:

The residual blocks, doing all the heavy lifting, can be seen as shown below:

The most important aspect of coding of the Deep Learning models

Developing Deep Learning models doesn’t really differ that much from any other type of coding. It does require specific background knowledge, but the good coding practices remain the same. In fact, good coding habits are 10× more relevant here than in e.g. a web-app project.

Training a speech-to-text model is bound to require days if not weeks. Imagine having a small bug in your code, preventing the process from finding a good local minimum. It’s extremely frustrating to find out about it days into the training, with the model trainable parameters not being improved much.

Let’s start by adding some unit tests then. In this project, we’re using the Jupyter notebook as we don’t intend to package it anywhere. The code’s intent is to be for educational purposes mainly.

Adding unit tests within the Jupyter notebook is possible with the following “hack” (notice the value for argv):

import unittest

RUN_TESTS = TRUE

class TestNotebook(unittest.TestCase):
    def test_it_works(self):
        self.assertEqual(2 + 2, 4)

if __name__ == '__main__' and RUN_TESTS:
    import doctest

    doctest.testmod()
    unittest.main(
        argv=['first-arg-is-ignored'],
        failfast=True,
        exit=False
    )

You can notice the import of the doctest module which adds support for doc-string level tests which may come in handy as well.

I also hugely recommend the hypothesis library for testing the QuickCheck way as I blogged about it before.

Data pipeline

A place that’s surprisingly very bug-potent is the data pipeline. It’s easy to e.g. shuffle the labels independently of input vectors if you’re not careful. There’s also always a chance to introduce input vectors including NaN or inf values, which a few steps later produce NaN or inf loss values. Let’s add a simple test to check for the first condition:


# assuming test path will look like: 1/file.wav
# the input and output types are driven by the input_fn shown later
# here, we’re just generating values based on the “path”
def dummy_load_wave(example):
    row, params = example
    path = row.filename

    return np.ones((SAMPLING_RATE)) * float(path.split('/')[0]), row

class TestNotebook(unittest.TestCase):

    # (...)

    def test_dataset_returns_data_in_order(self):

        params = experiment_params(
            dataset_params(
                batch_size=2,
                epochs=1,
                augment=False
            )
        )

        data = pd.DataFrame(
            data={
                'text': [ str(i) for i in range(10) ],
                'filename':  [ '{}/wav'.format(i) for i in range(10) ]
            }
        )

        dataset = input_fn(data, params['data'], dummy_load_wave)()
        iterator = dataset.make_one_shot_iterator()
        next_element = iterator.get_next()

        with tf.Session() as session:
            try:
                while True:
                    audio, label = session.run(next_element)
                    audio, length = audio

                    for _audio, _label in zip(list(audio), list(label)):
                        self.assertEqual(_audio[0], float(_label))

                    for _length in length:
                        self.assertEqual(_length, SAMPLING_RATE)
            except tf.errors.OutOfRangeError:
                pass

The above code assumes having the input_fn function in scope. If you’re not familiar with the concept yet, please go ahead and read the introduction to the TensorFlow Estimators API.

Here’s our implementation:

from multiprocessing import Pool

def input_fn(input_dataset, params, load_wave_fn=load_wave):
    def _input_fn():
        """
        Returns raw audio wave along with the label
        """

        dataset = input_dataset

        print(params)

        if 'max_text_length' in params and params['max_text_length'] is not None:
            print('Constraining dataset to the max_text_length')
            dataset = input_dataset[input_dataset.text.str.len() < params['max_text_length']]

        if 'min_text_length' in params and params['min_text_length'] is not None:
            print('Constraining dataset to the min_text_length')
            dataset = input_dataset[input_dataset.text.str.len() >= params['min_text_length']]

        if 'max_wave_length' in params and params['max_wave_length'] is not None:
            print('Constraining dataset to the max_wave_length')

        print('Resulting dataset length: {}'.format(len(dataset)))

        def generator_fn():
            pool = Pool()
            buffer = []

            for epoch in range(params['epochs']):
                for _, row in dataset.sample(frac=1).iterrows():
                    buffer.append((row, params))

                    if len(buffer) >= params['batch_size']:

                        if params['parallelize']:
                            audios = pool.map(
                                load_wave_fn,
                                buffer
                            )
                        else:
                            audios = map(
                                load_wave_fn,
                                buffer
                            )

                        for audio, row in audios:
                            if audio is not None:
                                if np.isnan(audio).any():
                                    print('SKIPPING! NaN coming from the pipeline!')
                                else:
                                    yield (audio, len(audio)), row.text.encode()

                        buffer = []

        return tf.data.Dataset.from_generator(
                generator_fn,
                output_types=((tf.float32, tf.int32), (tf.string)),
                output_shapes=((None,()), (()))
            ) \
            .padded_batch(
                batch_size=params['batch_size'],
                padded_shapes=(
                    (tf.TensorShape([None]), tf.TensorShape(())),
                    tf.TensorShape(())
                )
            )

    return _input_fn

This depends on the load_wave function:

import librosa
import hickle as hkl
import os.path

def to_path(filename):
    return './data/cv_corpus_v1/' + filename

def load_wave(example, absolute=False):
    row, params = example

    _path = row.filename if absolute else to_path(row.filename)

    if os.path.isfile(_path + '.wave.hkl'):
        wave = hkl.load(_path + '.wave.hkl').astype(np.float32)
    else:
        wave, _ = librosa.load(_path, sr=SAMPLING_RATE)
        hkl.dump(wave, _path + '.wave.hkl')

    if len(wave) <= params['max_wave_length']:
        if params['augment']:
            wave = random_noise(
                random_stretch(
                    random_shift(
                        wave,
                        params
                    ),
                    params
                ),
                params
            )
    else:
        wave = None

    return wave, row

Which depends on three other functions used to augment the data on the fly to improve the model’s generalization:

import random
import glob

noise_files = glob.glob('./data/*.wav')
noises = {}

def random_stretch(audio, params):
    rate = random.uniform(params['random_stretch_min'], params['random_stretch_max'])

    return librosa.effects.time_stretch(audio, rate)

def random_shift(audio, params):
    _shift = random.randrange(params['random_shift_min'], params['random_shift_max'])

    if _shift < 0:
        pad = (_shift * -1, 0)
    else:
        pad = (0, _shift)

    return np.pad(audio, pad, mode='constant')

def random_noise(audio, params):
    _factor = random.uniform(
        params['random_noise_factor_min'],
        params['random_noise_factor_max']
    )

    if params['random_noise'] > random.uniform(0, 1):
        _path = random.choice(noise_files)

        if _path in noises:
            wave = noises[_path]
        else:
            if os.path.isfile(_path + '.wave.hkl'):
                wave = hkl.load(_path + '.wave.hkl').astype(np.float32)
                noises[_path] = wave
            else:
                wave, _ = librosa.load(_path, sr=SAMPLING_RATE)
                hkl.dump(wave, _path + '.wave.hkl')
                noises[_path] = wave

        noise = random_shift(
            wave,
            {
                'random_shift_min': -16000,
                'random_shift_max': 16000
            }
        )

        max_noise = np.max(noise[0:len(audio)])
        max_wave = np.max(audio)

        noise = noise * (max_wave / max_noise)

        return _factor * noise[0:len(audio)] + (1.0 - _factor) * audio
    else:
        return audio

Notice that we’re making almost everything into a configurable parameter. We want the code to allow the greatest freedom of searching for just the right set of hyperparameters.

The data pipeline as shown above randomly shuffles the Pandas data frame once for each epoch. It also creates a pool of background workers to parallelize the data loading as much as possible. We’re doing the data loading and augmentation on the CPU. It also uses the hickle library for caching audio signals on the disk. Loading a wave file with a given sampling rate isn’t that fast as one might think. In my experiments, loading the resulting array of floating points via hickle was 10x faster. We need the best speed of feeding the data into the network or else our GPU is going to stay underutilized.

In my experiments also, turning data augmentation on made a real difference. I’ve run the training without it and the network overfit was disastrous: with the normalized edit distance for the training set revolving around 0.01 and 0.53 for the validation.

The random_noise function uses the noise sounds included in the Speech Commands: A public dataset for single-word speech recognition dataset. Please go ahead and download it, extracting just the noise files under the ./data directory.

The last function in use we haven’t seen yet is the experiment_params. It’s just a helper that allows an easy params hash construction for our experiments:

def dataset_params(batch_size=32,
                   epochs=50000,
                   parallelize=True,
                   max_text_length=None,
                   min_text_length=None,
                   max_wave_length=80000,
                   shuffle=True,
                   random_shift_min=-4000,
                   random_shift_max= 4000,
                   random_stretch_min=0.7,
                   random_stretch_max= 1.3,
                   random_noise=0.75,
                   random_noise_factor_min=0.2,
                   random_noise_factor_max=0.5,
                   augment=False):
    return {
        'parallelize': parallelize,
        'shuffle': shuffle,
        'max_text_length': max_text_length,
        'min_text_length': min_text_length,
        'max_wave_length': max_wave_length,
        'random_shift_min': random_shift_min,
        'random_shift_max': random_shift_max,
        'random_stretch_min': random_stretch_min,
        'random_stretch_max': random_stretch_max,
        'random_noise': random_noise,
        'random_noise_factor_min': random_noise_factor_min,
        'random_noise_factor_max': random_noise_factor_max,
        'epochs': epochs,
        'batch_size': batch_size,
        'augment': augment
    }

Labels encoder and decoder

When working with the CTC loss, we need a way to code each letter as a numerical value. Conversely, the neural network is going to give us probabilities for each letter, given by its index within the output matrix.

The idea behind this project’s approach is to push the encoding and decoding into the network graph itself. We want two functions: encode_labels and decode_codes. We want the first to turn a string into an array of integers. The second one should complement it, turning the array of integers into the resulting string.

It’s a good idea to use our hypothesis library for this unit test. It’s going to come up with many input examples, trying to falsify our assumptions:

@given(st.text(alphabet="abcdefghijk1234!@#$%^&*", max_size=10))
def test_encode_and_decode_work(self, text):
    assume(text != '')

    params = { 'alphabet': 'abcdefghijk1234!@#$%^&*' }

    label_ph = tf.placeholder(tf.string, shape=(1), name='text')
    codes_op = encode_labels(label_ph, params)
    decode_op = decode_codes(codes_op, params)

    with tf.Session() as session:
        session.run(tf.global_variables_initializer())
        session.run(tf.tables_initializer(name='init_all_tables'))

        codes, decoded = session.run(
            [codes_op, decode_op],
            {
                label_ph: np.array([text])
            }
        )

        note(codes)
        note(decoded)

        self.assertEqual(text, ''.join(map(lambda s: s.decode('UTF-8'), decoded.values)))
        self.assertEqual(codes.values.dtype, np.int32)
        self.assertEqual(len(codes.values), len(text))

Here is the implementation that passes the above test:

def encode_labels(labels, params):
    characters = list(params['alphabet'])

    table = tf.contrib.lookup.HashTable(
        tf.contrib.lookup.KeyValueTensorInitializer(
            characters,
            list(range(len(characters)))
        ),
        -1,
        name='char2id'
    )

    return table.lookup(
        tf.string_split(labels, delimiter='')
    )

def decode_codes(codes, params):
    characters = list(params['alphabet'])

    table = tf.contrib.lookup.HashTable(
        tf.contrib.lookup.KeyValueTensorInitializer(
            list(range(len(characters))),
            characters
        ),
        '',
        name='id2char'
    )

    return table.lookup(codes)

Log-Mel Spectrogram layer

Another piece we need is a way to turn raw audio signals into the log-Mel spectrograms. The idea, again, is to push it into the network graph. This way it’s going to work way faster on GPUs and also the model’s API is going to be much simpler.

In the following unit test, we’re testing our custom TensorFlow layer against values coming from known-to-be-valid librosa:

@given(
    st.sampled_from([22000, 16000, 8000]),
    st.sampled_from([1024, 512]),
    st.sampled_from([1024, 512]),
    npst.arrays(
        np.float32,
        (4, 16000),
        elements=st.floats(-1, 1)
    )
)
@settings(max_examples=10)
def test_log_mel_conversion_works(self, sampling_rate, n_fft, frame_step, audio):
    lower_edge_hertz=0.0
    upper_edge_hertz=sampling_rate / 2.0
    num_mel_bins=64

    def librosa_melspectrogram(audio_item):
        spectrogram = np.abs(
            librosa.core.stft(
                audio_item,
                n_fft=n_fft,
                hop_length=frame_step,
                center=False
            )
        )**2

        return np.log(
            librosa.feature.melspectrogram(
                S=spectrogram,
                sr=sampling_rate,
                n_mels=num_mel_bins,
                fmin=lower_edge_hertz,
                fmax=upper_edge_hertz,
            ) + 1e-6
        )

    audio_ph = tf.placeholder(tf.float32, (4, 16000))

    librosa_log_mels = np.transpose(
        np.stack([
            librosa_melspectrogram(audio_item)
            for audio_item in audio
        ]),
        (0, 2, 1)
    )

    log_mel_op = tf.check_numerics(
        LogMelSpectrogram(
            sampling_rate=sampling_rate,
            n_fft=n_fft,
            frame_step=frame_step,
            lower_edge_hertz=lower_edge_hertz,
            upper_edge_hertz=upper_edge_hertz,
            num_mel_bins=num_mel_bins
        )(audio_ph),
        message="log mels"
    )

    with tf.Session() as session:
        session.run(tf.global_variables_initializer())

        log_mels = session.run(
            log_mel_op,
            {
               audio_ph: audio
            }
        )

        np.testing.assert_allclose(
            log_mels,
            librosa_log_mels,
            rtol=1e-1,
            atol=0
        )

The implementation of the layer, that passes the above unit test reads as follows:

class LogMelSpectrogram(tf.layers.Layer):
    def __init__(self,
                 sampling_rate,
                 n_fft,
                 frame_step,
                 lower_edge_hertz,
                 upper_edge_hertz,
                 num_mel_bins,
                 **kwargs):
        super(LogMelSpectrogram, self).__init__(**kwargs)

        self.sampling_rate = sampling_rate
        self.n_fft = n_fft
        self.frame_step = frame_step
        self.lower_edge_hertz = lower_edge_hertz
        self.upper_edge_hertz = upper_edge_hertz
        self.num_mel_bins = num_mel_bins

    def call(self, inputs, training=True):
        stfts = tf.contrib.signal.stft(
            inputs,
            frame_length=self.n_fft,
            frame_step=self.frame_step,
            fft_length=self.n_fft,
            pad_end=False
        )

        power_spectrograms = tf.real(stfts * tf.conj(stfts))

        num_spectrogram_bins = power_spectrograms.shape[-1].value

        linear_to_mel_weight_matrix = tf.constant(
            np.transpose(
                librosa.filters.mel(
                    sr=self.sampling_rate,
                    n_fft=self.n_fft + 1,
                    n_mels=self.num_mel_bins,
                    fmin=self.lower_edge_hertz,
                    fmax=self.upper_edge_hertz
                )
            ),
            dtype=tf.float32
        )

        mel_spectrograms = tf.tensordot(
            power_spectrograms,
            linear_to_mel_weight_matrix,
            1
        )

        mel_spectrograms.set_shape(
            power_spectrograms.shape[:-1].concatenate(
                linear_to_mel_weight_matrix.shape[-1:]
            )
        )

        return tf.log(mel_spectrograms + 1e-6)

Converted data lengths function

In order to use the CTC loss and decoder efficiently, we need to pass it the length of the data effectively representing audio for each batch. This is because not all audio files are of the same length but we need to pad them with zeros to do mini-batch.

Here’s the unit test:

@given(
        npst.arrays(
            np.float32,
            (st.integers(min_value=16000, max_value=16000*5)),
            elements=st.floats(-1, 1)
        ),
        st.sampled_from([22000, 16000, 8000]),
        st.sampled_from([1024, 512, 640]),
        st.sampled_from([1024, 512, 160]),
    )
    @settings(max_examples=10)
    def test_compute_lengths_works(self,
                                   audio_wave,
                                   sampling_rate,
                                   n_fft,
                                   frame_step
                                  ):
        assume(n_fft >= frame_step)

        original_wave_length = audio_wave.shape[0]

        audio_waves_ph = tf.placeholder(tf.float32, (None, None), name="audio_waves")
        original_lengths_ph = tf.placeholder(tf.int32, (None), name="original_lengths")

        lengths_op = compute_lengths(
            original_lengths_ph,
            {
                'frame_step': frame_step,
                'n_fft': n_fft
            }
        )

        self.assertEqual(lengths_op.dtype, tf.int32)

        log_mel_op = LogMelSpectrogram(
            sampling_rate=sampling_rate,
            n_fft=n_fft,
            frame_step=frame_step,
            lower_edge_hertz=0.0,
            upper_edge_hertz=8000.0,
            num_mel_bins=13
        )(audio_waves_ph)

        with tf.Session() as session:
            session.run(tf.global_variables_initializer())

            lengths, log_mels = session.run(
                [lengths_op, log_mel_op],
                {
                    audio_waves_ph: np.array([audio_wave]),
                    original_lengths_ph: np.array([original_wave_length])
                }
            )

            note(original_wave_length)
            note(lengths)
            note(log_mels.shape)

            self.assertEqual(lengths[0], log_mels.shape[1])

And here’s the implementation:

def compute_lengths(original_lengths, params):
    """
    Computes the length of data for CTC
    """

    return tf.cast(
        tf.floor(
            (tf.cast(original_lengths, dtype=tf.float32) - params['n_fft']) /
                params['frame_step']
        ) + 1,
        tf.int32
    )

Atrous 1D Convolutions layer

It’s also a good idea to ensure that our dilated convolutions layer behaves as in theory. TensorFlow already includes an ability to specify the dilations. The end result though may differ wildly based on the choice of other parameters.

Let’s ensure at least that it works as intended when we choose it to work in the “causal” mode. The unit test:

def test_causal_conv1d_works(self):
    conv_size2_dilation_1 = AtrousConv1D(
        filters=1,
        kernel_size=2,
        dilation_rate=1,
        kernel_initializer=tf.ones_initializer(),
        use_bias=False
    )

    conv_size3_dilation_1 = AtrousConv1D(
        filters=1,
        kernel_size=3,
        dilation_rate=1,
        kernel_initializer=tf.ones_initializer(),
        use_bias=False
    )

    conv_size2_dilation_2 = AtrousConv1D(
        filters=1,
        kernel_size=2,
        dilation_rate=2,
        kernel_initializer=tf.ones_initializer(),
        use_bias=False
    )

    conv_size2_dilation_3 = AtrousConv1D(
        filters=1,
        kernel_size=2,
        dilation_rate=3,
        kernel_initializer=tf.ones_initializer(),
        use_bias=False
    )

    data = np.array(list(range(1, 31)))
    data_ph = tf.placeholder(tf.float32, (1, 30, 1))

    size2_dilation_1_1 = conv_size2_dilation_1(data_ph)
    size2_dilation_1_2 = conv_size2_dilation_1(size2_dilation_1_1)

    size3_dilation_1_1 = conv_size3_dilation_1(data_ph)
    size3_dilation_1_2 = conv_size3_dilation_1(size3_dilation_1_1)

    size2_dilation_2_1 = conv_size2_dilation_2(data_ph)
    size2_dilation_2_2 = conv_size2_dilation_2(size2_dilation_2_1)

    size2_dilation_3_1 = conv_size2_dilation_3(data_ph)
    size2_dilation_3_2 = conv_size2_dilation_3(size2_dilation_3_1)

    with tf.Session() as session:
        session.run(tf.global_variables_initializer())

        outputs = session.run(
            [
                size2_dilation_1_1,
                size2_dilation_1_2,
                size3_dilation_1_1,
                size3_dilation_1_2,
                size2_dilation_2_1,
                size2_dilation_2_2,
                size2_dilation_3_1,
                size2_dilation_3_2
            ],
            {
                data_ph: np.reshape(data, (1, 30, 1))
            }
        )

        for ix, out in enumerate(outputs):
            out = np.squeeze(out)
            outputs[ix] = out

            self.assertEqual(out.shape[0], len(data))

        np.testing.assert_equal(
            outputs[0],
            np.array([1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 31, 33, 35, 37, 39, 41, 43, 45, 47, 49, 51, 53, 55, 57, 59], dtype=np.float32)
        )

        np.testing.assert_equal(
            outputs[1],
            np.array([1, 4, 8, 12, 16, 20, 24, 28, 32, 36, 40, 44, 48, 52, 56, 60, 64, 68, 72, 76, 80, 84, 88, 92, 96, 100, 104, 108, 112, 116], dtype=np.float32)
        )

        np.testing.assert_equal(
            outputs[2],
            np.array([1, 3, 6, 9, 12, 15, 18, 21, 24, 27, 30, 33, 36, 39, 42, 45, 48, 51, 54, 57, 60, 63, 66, 69, 72, 75, 78, 81, 84, 87], dtype=np.float32)
        )

        np.testing.assert_equal(
            outputs[3],
            np.array([1, 4, 10, 18, 27, 36, 45, 54, 63, 72, 81, 90, 99, 108, 117, 126, 135, 144, 153, 162, 171, 180, 189, 198, 207, 216, 225, 234, 243, 252], dtype=np.float32)
        )

        np.testing.assert_equal(
            outputs[4],
            np.array([1, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40, 42, 44, 46, 48, 50, 52, 54, 56, 58], dtype=np.float32)
        )

        np.testing.assert_equal(
            outputs[5],
            np.array([1, 2, 5, 8, 12, 16, 20, 24, 28, 32, 36, 40, 44, 48, 52, 56, 60, 64, 68, 72, 76, 80, 84, 88, 92, 96, 100, 104, 108, 112], dtype=np.float32)
        )

        np.testing.assert_equal(
            outputs[6],
            np.array([1, 2, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 31, 33, 35, 37, 39, 41, 43, 45, 47, 49, 51, 53, 55, 57], dtype=np.float32)
        )

        np.testing.assert_equal(
            outputs[7],
            np.array([1, 2, 3, 6, 9, 12, 16, 20, 24, 28, 32, 36, 40, 44, 48, 52, 56, 60, 64, 68, 72, 76, 80, 84, 88, 92, 96, 100, 104, 108], dtype=np.float32)
        )

And the layer’s code:

class AtrousConv1D(tf.layers.Layer):
    def __init__(self,
                 filters,
                 kernel_size,
                 dilation_rate,
                 use_bias=True,
                 kernel_initializer=tf.glorot_normal_initializer(),
                 causal=True
                ):
        super(AtrousConv1D, self).__init__()

        self.filters = filters
        self.kernel_size = kernel_size
        self.dilation_rate = dilation_rate
        self.causal = causal

        self.conv1d = tf.layers.Conv1D(
            filters=filters,
            kernel_size=kernel_size,
            dilation_rate=dilation_rate,
            padding='valid' if causal else 'same',
            use_bias=use_bias,
            kernel_initializer=kernel_initializer
        )

    def call(self, inputs):
        if self.causal:
            padding = (self.kernel_size - 1) * self.dilation_rate
            inputs = tf.pad(inputs, tf.constant([(0, 0,), (1, 0), (0, 0)]) * padding)

        return self.conv1d(inputs)

Residual Block layer

One aspect that wasn’t covered yet is the heavy usage of batch normalization. When coding the residual block layer, ensuring that batch normalization is properly applied when training and when inferring is one of the most important tasks.

Here’s the unit test:

@given(
    npst.arrays(
        np.float32,
        (4, 16000),
        elements=st.floats(-1, 1)
    ),
    st.sampled_from([64, 32]),
    st.sampled_from([7, 3]),
    st.sampled_from([1, 4]),
)
@settings(max_examples=10)
def test_residual_block_works(self, audio_waves, filters, size, dilation_rate):
    with tf.Graph().as_default() as g:
        audio_ph = tf.placeholder(tf.float32, (4, None))

        log_mel_op = LogMelSpectrogram(
            sampling_rate=16000,
            n_fft=512,
            frame_step=256,
            lower_edge_hertz=0,
            upper_edge_hertz=8000,
            num_mel_bins=10
        )(audio_ph)

        expanded_op = tf.layers.Dense(filters)(log_mel_op)

        _, block_op = ResidualBlock(
            filters=filters,
            kernel_size=size,
            causal=True,
            dilation_rate=dilation_rate
        )(expanded_op, training=True)

        # really dumb loss function just for the sake
        # of testing:
        loss_op = tf.reduce_sum(block_op)

        variables = tf.trainable_variables()
        self.assertTrue(any(["batch_normalization" in var.name for var in variables]))

        grads_op = tf.gradients(
            loss_op,
            variables
        )

        for grad, var in zip(grads_op, variables):
            if grad is None:
                note(var)

            self.assertTrue(grad is not None)

        with tf.Session(graph=g) as session:
            session.run(tf.global_variables_initializer())

            result, expanded, grads, _ = session.run(
                [block_op, expanded_op, grads_op, loss_op],
                {
                    audio_ph: audio_waves
                }
            )

            self.assertFalse(np.array_equal(result, expanded))
            self.assertEqual(result.shape, expanded.shape)
            self.assertEqual(len(grads), len(variables))
            self.assertFalse(any([np.isnan(grad).any() for grad in grads]))

And here’s the implementation:

class ResidualBlock(tf.layers.Layer):
    def __init__(self, filters, kernel_size, dilation_rate, causal, **kwargs):
        super(ResidualBlock, self).__init__(**kwargs)

        self.dilated_conv1 = AtrousConv1D(
            filters=filters,
            kernel_size=kernel_size,
            dilation_rate=dilation_rate,
            causal=causal
        )

        self.dilated_conv2 = AtrousConv1D(
            filters=filters,
            kernel_size=kernel_size,
            dilation_rate=dilation_rate,
            causal=causal
        )

        self.out = tf.layers.Conv1D(
            filters=filters,
            kernel_size=1
        )

    def call(self, inputs, training=True):
        data = tf.layers.batch_normalization(
            inputs,
            training=training
        )

        filters = self.dilated_conv1(data)
        gates = self.dilated_conv2(data)

        filters = tf.nn.tanh(filters)
        gates = tf.nn.sigmoid(gates)

        out = tf.nn.tanh(
            self.out(
                filters * gates
            )
        )

        return out + inputs, out

Residual Stack layer

Testing the residual stack follows the same kind of logic:

@given(
    npst.arrays(
        np.float32,
        (4, 16000),
        elements=st.floats(-1, 1)
    ),
    st.sampled_from([64, 32]),
    st.sampled_from([7, 3])
)
@settings(max_examples=10)
def test_residual_stack_works(self, audio_waves, filters, size):
    dilation_rates = [1,2,4]

    with tf.Graph().as_default() as g:
        audio_ph = tf.placeholder(tf.float32, (4, None))

        log_mel_op = LogMelSpectrogram(
            sampling_rate=16000,
            n_fft=512,
            frame_step=256,
            lower_edge_hertz=0,
            upper_edge_hertz=8000,
            num_mel_bins=10
        )(audio_ph)

        expanded_op = tf.layers.Dense(filters)(log_mel_op)

        stack_op = ResidualStack(
            filters=filters,
            kernel_size=size,
            causal=True,
            dilation_rates=dilation_rates
        )(expanded_op, training=True)

        # really dumb loss function just for the sake
        # of testing:
        loss_op = tf.reduce_sum(stack_op)

        variables = tf.trainable_variables()
        self.assertTrue(any(["batch_normalization" in var.name for var in variables]))

        grads_op = tf.gradients(
            loss_op,
            variables
        )

        for grad, var in zip(grads_op, variables):
            if grad is None:
                note(var)

            self.assertTrue(grad is not None)

        with tf.Session(graph=g) as session:
            session.run(tf.global_variables_initializer())

            result, expanded, grads, _ = session.run(
                [stack_op, expanded_op, grads_op, loss_op],
                {
                    audio_ph: audio_waves
                }
            )

            self.assertFalse(np.array_equal(result, expanded))
            self.assertEqual(result.shape, expanded.shape)
            self.assertEqual(len(grads), len(variables))
            self.assertFalse(any([np.isnan(grad).any() for grad in grads]))

With the layer’s code looking as follows:

class ResidualStack(tf.layers.Layer):
    def __init__(self, filters, kernel_size, dilation_rates, causal, **kwargs):
        super(ResidualStack, self).__init__(**kwargs)

        self.blocks = [
            ResidualBlock(
                filters=filters,
                kernel_size=kernel_size,
                dilation_rate=dilation_rate,
                causal=causal
            )
            for dilation_rate in dilation_rates
        ]

    def call(self, inputs, training=True):
        data = inputs
        skip = 0

        for block in self.blocks:
            data, current_skip = block(data, training=training)
            skip += current_skip

        return skip

The SpeechNet

Finally, let’s add a very similar test for the SpeechNet itself:

@given(
    npst.arrays(
        np.float32,
        (4, 16000),
        elements=st.floats(-1, 1)
    )
)
@settings(max_examples=10)
def test_speech_net_works(self, audio_waves):
    with tf.Graph().as_default() as g:
        audio_ph = tf.placeholder(tf.float32, (4, None))

        logits_op = SpeechNet(
            experiment_params(
                {},
                stack_dilation_rates= [1, 2, 4],
                stack_kernel_size= 3,
                stack_filters= 32,
                alphabet= 'abcd'
            )
        )(audio_ph)

        # really dumb loss function just for the sake
        # of testing:
        loss_op = tf.reduce_sum(logits_op)

        variables = tf.trainable_variables()
        self.assertTrue(any(["batch_normalization" in var.name for var in variables]))

        grads_op = tf.gradients(
            loss_op,
            variables
        )

        for grad, var in zip(grads_op, variables):
            if grad is None:
                note(var)

            self.assertTrue(grad is not None)

        with tf.Session(graph=g) as session:
            session.run(tf.global_variables_initializer())

            result, grads, _ = session.run(
                [logits_op, grads_op, loss_op],
                {
                    audio_ph: audio_waves
                }
            )

            self.assertEqual(result.shape[2], 5)
            self.assertEqual(len(grads), len(variables))
            self.assertFalse(any([np.isnan(grad).any() for grad in grads]))

And let’s provide the code that passes it:

class SpeechNet(tf.layers.Layer):
    def __init__(self, params, **kwargs):
        super(SpeechNet, self).__init__(**kwargs)

        self.to_log_mel = LogMelSpectrogram(
            sampling_rate=params['sampling_rate'],
            n_fft=params['n_fft'],
            frame_step=params['frame_step'],
            lower_edge_hertz=params['lower_edge_hertz'],
            upper_edge_hertz=params['upper_edge_hertz'],
            num_mel_bins=params['num_mel_bins']
        )

        self.expand = tf.layers.Conv1D(
            filters=params['stack_filters'],
            kernel_size=1,
            padding='same'
        )

        self.stacks = [
            ResidualStack(
                filters=params['stack_filters'],
                kernel_size=params['stack_kernel_size'],
                dilation_rates=params['stack_dilation_rates'],
                causal=params['causal_convolutions']
            )
            for _ in range(params['stacks'])
        ]

        self.out = tf.layers.Conv1D(
            filters=len(params['alphabet']) + 1,
            kernel_size=1,
            padding='same'
        )

    def call(self, inputs, training=True):
        data = self.to_log_mel(inputs)

        data = tf.layers.batch_normalization(
            data,
            training=training
        )

        if len(data.shape) == 2:
            data = tf.expand_dims(data, 0)

        data = self.expand(data)

        for stack in self.stacks:
            data = stack(data, training=training)

        data = tf.layers.batch_normalization(
            data,
            training=training
        )

        return self.out(data) + 1e-8

The model function

We have only one last piece of code to cover before we’ll be able to start the training. It’s the model_fn that adheres to the TensorFlow Estimators API:

def model_fn(features, labels, mode, params):
    if isinstance(features, dict):
        audio = features['audio']
        original_lengths = features['length']
    else:
        audio, original_lengths = features

    lengths = compute_lengths(original_lengths, params)

    if labels is not None:
        codes = encode_labels(labels, params)

    network = SpeechNet(params)

    is_training = mode==tf.estimator.ModeKeys.TRAIN

    logits = network(audio, training=is_training)
    text, predicted_codes = decode_logits(logits, lengths, params)

    if mode == tf.estimator.ModeKeys.PREDICT:
        predictions = {
            'logits': logits,
            'text': tf.sparse_tensor_to_dense(
                text,
                ''
            )
        }

        export_outputs = {
            'predictions': tf.estimator.export.PredictOutput(predictions)
        }

        return tf.estimator.EstimatorSpec(
            mode,
            predictions=predictions,
            export_outputs=export_outputs
        )
    else:
        loss = tf.reduce_mean(
            tf.nn.ctc_loss(
                labels=codes,
                inputs=logits,
                sequence_length=lengths,
                time_major=False,
                ignore_longer_outputs_than_inputs=True
            )
        )

        mean_edit_distance = tf.reduce_mean(
            tf.edit_distance(
                tf.cast(predicted_codes, tf.int32),
                codes
            )
        )

        distance_metric = tf.metrics.mean(mean_edit_distance)

        if mode == tf.estimator.ModeKeys.EVAL:
            return tf.estimator.EstimatorSpec(
                mode,
                loss=loss,
                eval_metric_ops={ 'edit_distance': distance_metric }
            )

        elif mode == tf.estimator.ModeKeys.TRAIN:
            global_step = tf.train.get_or_create_global_step()

            tf.summary.text(
                'train_predicted_text',
                tf.sparse_tensor_to_dense(text, '')
            )
            tf.summary.scalar('train_edit_distance', mean_edit_distance)

            update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
            with tf.control_dependencies(update_ops):
                train_op = tf.contrib.layers.optimize_loss(
                    loss=loss,
                    global_step=global_step,
                    learning_rate=params['lr'],
                    optimizer=(params['optimizer']),
                    update_ops=update_ops,
                    clip_gradients=params['clip_gradients'],
                    summaries=[
                        "learning_rate",
                        "loss",
                        "global_gradient_norm",
                    ]
                )

            return tf.estimator.EstimatorSpec(
                mode,
                loss=loss,
                train_op=train_op
            )

Using the API, we’ll get lots of stats in TensorBoard for free. It will also make it very easy to validate the model and to export it to a SavedModel format.

In order to easily experiment with different hyperparameters, I’ve also created a helper function as listed below:

import copy

def experiment(data_params=dataset_params(), **kwargs):
    params = experiment_params(
        data_params,
        **kwargs
    )

    print(params)

    estimator = tf.estimator.Estimator(
        model_fn=model_fn,
        model_dir='stats/{}'.format(experiment_name(params)),
        params=params
    )

    #import pdb; pdb.set_trace()

    train_spec = tf.estimator.TrainSpec(
        input_fn=input_fn(
            train_data,
            params['data']
        )
    )

    features = {
        "audio": tf.placeholder(dtype=tf.float32, shape=[None]),
        "length": tf.placeholder(dtype=tf.int32, shape=[])
    }

    serving_input_receiver_fn = tf.estimator.export.build_raw_serving_input_receiver_fn(
        features
    )

    best_exporter = tf.estimator.BestExporter(
      name="best_exporter",
      serving_input_receiver_fn=serving_input_receiver_fn,
      exports_to_keep=5
    )

    eval_params = copy.deepcopy(params['data'])
    eval_params['augment'] = False

    eval_spec = tf.estimator.EvalSpec(
        input_fn=input_fn(
            eval_data,
            eval_params
        ),
        throttle_secs=60*30,
        exporters=best_exporter
    )

    tf.estimator.train_and_evaluate(
        estimator,
        train_spec,
        eval_spec
    )

As well as two more to test the model’s accuracy and to get the test set predictions:

def test(data_params=dataset_params(), **kwargs):
    params = experiment_params(
        data_params,
        **kwargs
    )

    print(params)

    estimator = tf.estimator.Estimator(
        model_fn=model_fn,
        model_dir='stats/{}'.format(experiment_name(params)),
        params=params
    )

    eval_params = copy.deepcopy(params['data'])
    eval_params['augment'] = False
    eval_params['epochs'] = 1
    eval_params['shuffle'] = False

    estimator.evaluate(
        input_fn=input_fn(
            test_data,
            eval_params
        )
    )

def predict_test(**kwargs):
    params = experiment_params(
        dataset_params(
            augment=False,
            shuffle=False,
            batch_size=1,
            epochs=1,
            parallelize=False
        ),
        **kwargs
    )

    print(len(test_data))

    estimator = tf.estimator.Estimator(
        model_fn=model_fn,
        model_dir='stats/{}'.format(experiment_name(params)),
        params=params
    )

    return list(
        estimator.predict(
            input_fn=input_fn(
                test_data,
                params['data']
            )
        )
    )

Which depends on the following other functions:

def experiment_params(data,
                      optimizer='Adam',
                      lr=1e-4,
                      alphabet=" 'abcdefghijklmnopqrstuvwxyz",
                      causal_convolutions=True,
                      stack_dilation_rates=[1, 3, 9, 27, 81],
                      stacks=2,
                      stack_kernel_size=3,
                      stack_filters=32,
                      sampling_rate=16000,
                      n_fft=160*4,
                      frame_step=160,
                      lower_edge_hertz=0,
                      upper_edge_hertz=8000,
                      num_mel_bins=160,
                      clip_gradients=None,
                      codename='regular',
                      **kwargs):
    params = {
        'optimizer': optimizer,
        'lr': lr,
        'data': data,
        'alphabet': alphabet,
        'causal_convolutions': causal_convolutions,
        'stack_dilation_rates': stack_dilation_rates,
        'stacks': stacks,
        'stack_kernel_size': stack_kernel_size,
        'stack_filters': stack_filters,
        'sampling_rate': sampling_rate,
        'n_fft': n_fft,
        'frame_step': frame_step,
        'lower_edge_hertz': lower_edge_hertz,
        'upper_edge_hertz': upper_edge_hertz,
        'num_mel_bins': num_mel_bins,
        'clip_gradients': clip_gradients,
        'codename': codename
    }

    #import pdb; pdb.set_trace()

    if kwargs is not None and 'data' in kwargs:
        params['data'] = { **params['data'], **kwargs['data'] }
        del kwargs['data']

    if kwargs is not None:
        params = { **params, **kwargs }

    return params

def experiment_name(params, excluded_keys=['alphabet', 'data', 'lr', 'clip_gradients']):

    def represent(key, value):
        if key in excluded_keys:
            return None
        else:
            if isinstance(value, list):
                return '{}_{}'.format(key, '_'.join([str(v) for v in value]))
            else:
                return '{}_{}'.format(key, value)

    parts = filter(
        lambda p: p is not None,
        [
            represent(k, params[k])
            for k in sorted(params.keys())
        ]
    )

    return '/'.join(parts)

Each new set of hyperparameters constitutes a different “experiment”. It will output separate statistics in TensorBoard that are going to be easily filterable.

The experiment function uses the train_and_validate TensorFlow function which will periodically test the model against the validation set. This is our tool of gauging how well it generalizes. It also uses the tf.estimator.BestExporter class to automatically export SavedModel files for best performing versions.

Other aspects

The coverage of the full code listing wouldn’t be very practical for an article like this. We’ve covered the most important of them above. I invite you to have a look at the Jupyter notebook itself which is hosted on GitHub: kamilc/speech-recognition.

Let’s train it

Before we can dive in and start training the model using the code above, we need to set a few things up.

First of all, I’m using Docker. This way I’m not constrained e.g. by the version of Cuda to install.

Here’s the Dockerfile for this project:

FROM tensorflow/tensorflow:latest-devel-gpu-py3

RUN apt-get update
RUN apt-get install -y ffmpeg git cmake

RUN pip install matplotlib pandas scikit-learn librosa seaborn hickle hypothesis[pandas]

RUN mkdir -p /home/data-science/projects
VOLUME /home/data-science/projects

RUN echo "c.NotebookApp.token = ''" >> ~/.jupyter/jupyter_notebook_config.py
RUN echo "c.NotebookApp.password = ''" >> ~/.jupyter/jupyter_notebook_config.py

WORKDIR /home/data-science/projects

RUN pip install git+https://github.com/Supervisor/supervisor && \
  mkdir -p /var/log/supervisor

ADD supervisor.conf /etc/supervisor.conf

EXPOSE 80
EXPOSE 6006

CMD supervisord -c /etc/supervisor.conf

I also like to make my life easier and provide the Makefile that automates common project-related tasks:

build:
    nvidia-docker build -t speech-recognition:latest .
run:
    nvidia-docker run -p 80:80 -p 6006:6006 --shm-size 16G --mount type=bind,source=/home/kamil/projects/speech-recognition,target=/home/data-science/projects -it speech-recognition
bash:
    nvidia-docker run --mount type=bind,source=/home/kamil/projects/speech-recognition,target=/home/data-science/projects -it speech-recognition bash

We’ll use TensorBoard to visualize the progress. At the same time, we need Jupyter notebooks server to be running as well. We’ll need a supervisor daemon to run both at the same time in a container. Here’s its config file:

[supervisord]
nodaemon=true

[program:jupyter]
command=bash -c "source /etc/bash.bashrc && jupyter notebook --notebook-dir=/home/data-science/projects --ip 0.0.0.0 --no-browser --allow-root --port=80"

[program:tensorboard]
command=tensorboard --logdir /home/data-science/projects/stats

In order to run the Jupyter notebook and start experimenting you’ll need to run the following in the command line:

make build

And then to start the container with TensorFlow, Jupyter, and Tensorboard:

make run

The notebook includes a helper function for running experiments. Here’s the invocation, whose set of parameters worked best for me:

experiment(
    dataset_params(
        batch_size=18,
        epochs=10,
        max_wave_length=320000,
        augment=True,
        random_noise=0.75,
        random_noise_factor_min=0.1,
        random_noise_factor_max=0.15,
        random_stretch_min=0.8,
        random_stretch_max=1.2
    ),
    codename='deep_max_20_seconds',
    alphabet=' !"&\',-.01234:;\\abcdefghijklmnopqrstuvwxyz', # !"&',-.01234:;\abcdefghijklmnopqrstuvwxyz
    causal_convolutions=False,
    stack_dilation_rates=[1, 3, 9, 27],
    stacks=6,
    stack_kernel_size=7,
    stack_filters=3*128,
    n_fft=160*8,
    frame_step=160*4,
    num_mel_bins=160,
    optimizer='Momentum',
    lr=0.00001,
    clip_gradients=20.0
)

The training process takes lots of time. On my machine, it took it more than 2 weeks. Searching for the best set of parameters is very difficult (and not fun).

The function accepts the max_text_length as one of its parameters. I first ran the experiments setting it to some small value (e.g. 15 characters). It constrains the data set to a narrow set of “easy” files. The reason is that it’s easy to spot any issues with the architecture on an easy set: if it’s not converging here, then we surely have a bug.

For the main training procedure, this parameter is kept unset.

Results

By using TensorBoard, we get a handy tool for monitoring the progress. I made the model_fn output statistics for the training set edit distance as well as the one for the evaluation set.

The statistics for the CTC Loss are included by default.

Here are the charts for the final model included in the GitHub repo:

A thing to notice is that I paused the training between the 20th and 30th December.

The above chart presents the training time edit distance. Because of the pretty aggressive data augmentation, I noticed that throughout the whole process the training and validation edit distances didn’t differ hugely.

Following image shows the CTC loss with the orange line representing the evaluation runs.

The evaluation edit distance is shown below. I stopped the training once the further gain for a whole day was dropping by less than 0.005.

Every machine learning model should be rigorously measured against meaningful accuracy statistics. Let’s see how we did:

test(
    dataset_params(
        batch_size=18,
        epochs=10,
        max_wave_length=320000,
        augment=True,
        random_noise=0.75,
        random_noise_factor_min=0.1,
        random_noise_factor_max=0.15,
        random_stretch_min=0.8,
        random_stretch_max=1.2
    ),
    codename='deep_max_20_seconds',
    alphabet=' !"&\',-.01234:;\\abcdefghijklmnopqrstuvwxyz', # !"&',-.01234:;\abcdefghijklmnopqrstuvwxyz
    causal_convolutions=False,
    stack_dilation_rates=[1, 3, 9, 27],
    stacks=6,
    stack_kernel_size=7,
    stack_filters=3*128,
    n_fft=160*8,
    frame_step=160*4,
    num_mel_bins=160,
    optimizer='Momentum',
    lr=0.00001,
    clip_gradients=20.0
)

The output:

(...)
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Finished evaluation at 2019-01-07-10:51:09
INFO:tensorflow:Saving dict for global step 1525345: edit_distance = 0.07922124, global_step = 1525345, loss = 13.410753
(...)

This shows that for the test set, we’ve scored 0.079 in edit distance. We could invert it to call accuracy (somewhat naively though), which gives 92.1% — not too bad. The result would be officially reported as 7.9 LER.

What’s even nicer is the size of the model:

ls stats/causal_convolutions_False/codename_deep_max_20_seconds/frame_step_640/lower_edge_hertz_0/n_fft_1280/num_mel_bins_160/optimizer_Momentum/sampling_rate_16000/stack_dilation_rates_1_3_9_27/stack_filters_384/stack_kernel_size_7/stacks_6/upper_edge_hertz_8000/export/best_exporter/1546198558/variables -lh
total 204M

That’s 204MB for the model trained on the 375k+ dataset with aggressive augmentation (which makes the resulting dataset size effectively a couple times bigger).

It’s always nice to see what the results look like. Here’s the code that runs the model through the whole test sets and gathers the predicted transcriptions:

test_results = predict_test(
    codename='deep_max_20_seconds',
    alphabet=' !"&\',-.01234:;\\abcdefghijklmnopqrstuvwxyz', # !"&',-.01234:;\abcdefghijklmnopqrstuvwxyz
    causal_convolutions=False,
    stack_dilation_rates=[1, 3, 9, 27],
    stacks=6,
    stack_kernel_size=7,
    stack_filters=3*128,
    n_fft=160*8,
    frame_step=160*4,
    num_mel_bins=160,
    optimizer='Momentum',
    lr=0.00001,
    clip_gradients=20.0
)
[ b''.join(t['text']) for t in test_results ]

And the excerpt of the above is:

[b'without the dotaset the artice suistles',
 b"i've got to go to him",
 b'and you know it',
 b'down below in the darknes were hundrededs of people sleping in peace',
 b'strange images pased through my mind',
 b'the shep had taught him that',
 b'it was glaringly hot not a clou in hesky nor a breath of wind',
 b'your son went to serve at a distant place and became a cinturion',
 b'they made a boy continue tiging but he found nothing',
 b'the shoreas in da',
 b'fol the instructions here',
 b"the're caling to u not to give up and to kep on fighting",
 b'the shop was closed on monis',
 b'even coming down on the train together she wrote me',
 b"i'm going away he said",
 b"he wasn't asking for help",
 b'some of the grynsh was faling of the circular edge',
 b"i'd like to think",
 b'the alchemist robably already knew al that',
 b"you 'l take fiftly and like et",
 b'it was droping of in flakes and raining down on the sand',
 b"what's your name he asked",
 b"it's because you were not born",
 b'what do you think of that',
 b"if i had told tyo o you wouldn't have sen the pyramids",
 b"i havn't hert the baby complain yet",
 b'i told him wit could teach hr to ignore people who was had tend',
 b"the one you're blocking",
 b'henderson stod up with a spade in his hand',
 b"he didn't ned to sek out the old woman for this",
 b'only a minority of literature is reaten this way',
 b"i wish you wouldn't",
 ...]

Seems quite okay. You can immediately notice that some words are misspelled. This stems from the nature of the CTC algorithm itself. We’re predicting letters instead of words here. The good side is that the problem of out-of-vocabulary words is lessened. The worse part is that you’ll get e.g. ‘sek’ sometimes instead of ‘seek’. Because we’re outputting the logits for each example, it’s possible to use e.g. the CTCWordBeamSearch to constrain the output’s tokens to ones known within the corpus — making it predict the words instead.

Here’s the last little fun test: speech to text on the utterance I created on my laptop:

results = predict(
    'cv_corpus_v1/test-me.m4a',
    codename='deep_max_20_seconds',
    alphabet=' !"&\',-.01234:;\\abcdefghijklmnopqrstuvwxyz', # !"&',-.01234:;\abcdefghijklmnopqrstuvwxyz
    causal_convolutions=False,
    stack_dilation_rates=[1, 3, 9, 27],
    stacks=6,
    stack_kernel_size=7,
    stack_filters=3*128,
    n_fft=160*8,
    frame_step=160*4,
    num_mel_bins=160,
    optimizer='Momentum',
    lr=0.00001,
    clip_gradients=20.0
)
b''.join(results[0]['text'])

The result:

b'it semed to work just fine'

Project on GitHub

The full Jupyter notebook’s code for this article can be found on GitHub: kamilc/speech-recognition.

The repository includes the bz2 archive of the best performing model I’ve trained. You can download it and run it as a web service via TensorFlow Serving, which we will cover in the next and last section here.

Serving the model with the TensorFlow Serving

The last step in this project is to serve our trained model as a web service. Thankfully, the TensorFlow project includes a ready to use “model server” that’s free to use: TensorFlow Serving.

The idea behind it is that we can run it, pointing it at the directory containing the models saved in the TensorFlow’s SavedModel format.

The deployment is extremely straightforward if you’re okay with running it from a Docker container. Let’s first pull the image:

docker pull tensorflow/serving

Next, we need to download the saved model we’ve trained in this article from GitHub:

$ wget https://github.com/kamilc/speech-recognition/raw/master/best.tar.bz2
$ tar xvjf best.tar.bz2

In the next step, we need to start a container for the TensorFlow Serving image making it:

open its port to outside
mount the directory containing our model
set the MODEL_NAME environment variable

As follows:

docker run -t --rm -p 8501:8501 -v "/home/kamil/projects/speech-recognition/best/1546646971:/models/speech/1" -e MODEL_NAME=speech tensorflow/serving

The service communicates via JSON payloads. Let’s prepare a payload.json file containing our request payload:

{"inputs": {"audio": <audio-data-here>, "length": <audio-raw-signal-length-here>}}

We can now easily query the web service with the prepared request audio data:

curl -d @payload.json \
   -X POST http://localhost:8501/v1/models/speech:predict

Here’s what our intelligent web service responds with:

{
    "outputs": {
        "text": [
            [
                "c",
                "e",
                "v",
                "e",
                "r",
                "y",
                "t",
                "h",
                "i",
                "n",
                "g",
                " ",
                "i",
                "n",
                " ",
                "t",
                "h",
                "e",
                " ",
                "u",
                "n",
                "i",
                "v",
                "e",
                "r",
                "s",
                " ",
                "o",
                "v",
                "a",
                "l",
                "s",
                "h",
                "e",
                " ",
                "t",
                "e",
                "d",
                "i",
                "n",
                " ",
                "a",
                "w",
                "i",
                "t",
                " ",
                "j",
                "g",
                "m",
                "f",
                "t",
                "a",
                "r",
                "y",
                "s",
                "e"
            ]
        ],
        "logits": [
            [
                <logits-here>
            ]
        ]
    }
}

Image Recognition Tools

2018-10-10T00:00:00+00:00

I’m always impressed with the advancement of machine learning, and, more recently, deep learning. However, since I am not an expert in the field I decided to let the researchers and scholars elaborate more on them.

In this post I will share the existing tools and the associated libraries to make them work, at least for me.

The reason I explored these tools is simple: I plan to deploy a poor man’s security camera in my home with some “sense” of intelligence. Since I am working at home, I want to know who is actually knocking my door. So I thought, what if I could use a web cam to monitor my door and let me know who’s actually standing at the door?

Face Detection

I searched around for existing face detection software and found this Python script using Haarcascade. So I was able to detect faces, but upon sharing the “findings” with a friend he said this only detects faces. How would the computer be able to recognize who’s who? Then I stumbled upon the phrase “face recognition”.

You might have noticed that if you use the image file that you import directly from your smartphone, the output will be displayed in a large file to the screen. You can use ImageMagick to resize the file to say, 640x480 pixels.

$ file makan.jpg
makan.jpg: JPEG image data, JFIF standard 1.01, aspect ratio, density 1x1, segment length 16, Exif Standard: [TIFF image data, big-endian, direntries=15, height=3120, bps=0, width=4160], baseline, precision 8, 4160x3120, frames 3

$ convert makan.jpg -resize 640x480 makan-small.jpg

$ file makan-small.jpg
makan-small.jpg: JPEG image data, JFIF standard 1.01, resolution (DPI), density 72x72, segment length 16, Exif Standard: [TIFF image data, big-endian, direntries=15, height=3120, bps=0, width=4160], baseline, precision 8, 640x480, frames 3

Machine Vision

The computer doesn’t see the image directly as the humans seem to, so we need to convert the images into numerical values. For example, in the facial regcognition tools, the training file contains the following matrices:

opencv_lbphfaces:
   threshold: 1.7976931348623157e+308
   radius: 1
   neighbors: 8
   grid_x: 8
   grid_y: 8
   histograms:
      - !!opencv-matrix
         rows: 1
         cols: 16384
         dt: f
         data: [ 2.46913582e-02, 1.85185187e-02, 0., 3.08641978e-03,
             1.23456791e-02, 6.17283955e-03, 3.08641978e-03,
             2.46913582e-02, 0., 0., 0., 0., 0., 3.08641978e-03, 0.,
             9.25925933e-03, 1.85185187e-02, 9.25925933e-03, 0., 0.,
             3.08641978e-03, 0., 0., 0., 3.08641978e-03, 0., 0., 0.,
             2.46913582e-02, 3.08641978e-03, 0., 6.79012388e-02, 0., 0.,
		...................
             1.30385486e-02, 1.47392293e-02, 4.53514745e-03,
             1.13378686e-03, 7.93650839e-03, 5.66893432e-04,
             5.66893432e-04, 1.13378686e-03, 6.80272095e-03,
             2.26757373e-03, 0., 0., 5.66893443e-03, 2.83446722e-03,
             5.10204071e-03, 9.07029491e-03, 7.14285746e-02 ]
   labels: !!opencv-matrix
      rows: 26
      cols: 1
      dt: i
      data: [ 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 4, 4, 4, 4, 5, 5, 5, 5, 5,
          6, 6, 8, 8, 8, 8 ]
   labelsInfo:
      []

Face Recognition

I continued my search for existing face recognition software and found several projects which could be tested right away, with some modifications from the original source. I found one tutorial which explained clearly how we could get the face recognition working from the web camera, in real time.

If the code provided in the video isn’t working directly, you could try my small patches, in which I corrected a typo and extended the filename extensions towards the source file from here.

My daughter Aufa is joining me in this facial recognition session.

Apart from that there is also a fork on GitHub which allows us to do the real-time face recognition. For now, however, some manual work needed to be done in order to add more datasets (images of faces) if you want to use the code right away.

Obviously I am not Tom Cruise.

Object Recognition

I also searched for more related software which could possibly provide an alternative to the face recognition. I found quite an interesing piece of work for object detection by using Neural Networks. It runs on a framework called Darknet. It allows us to do post-processing object detection for still pictures and videos. It can also do real-time object recognition but requires a GPU to do it efficiently. I tried with the CPU-only mode but I could not get a real-time result (my computer almost crashed).

Still image samples

Video samples

This video was on Lebuhraya Utara Selatan (freeway) in Malaysia
Another from Lebuhraya Utara Selatan (freeway) in Malaysia Two kids playing with bubbles
This video was taken a on a boat, with several people floating in the sea wearing their life jackets My kid and I walking on the beach in western Australia Here’s a kid riding a small horse

Vehicle Counting and Speed Measurement

I found a tool developed by Ahmet Ozlu which uses TensorFlow. The use case here is vechicle counting, vehicle type and color recoginition, and speed detection.

You can see the in following video how it works.

Libraries

OpenCV

OpenCV is an open source library for computer vision, which comes together with libraries which we can use for our detection and recognition work.

In my understanding, the face detection will come first and the recognition second. In newer digital cameras and smartphones facial detection is quite common. Social media applications sometimes use facial recognition to suggest similar faces to be tagged in photo albums, or for photo album reorganization.

Tools based on or making use of OpenCV

Apart from the custom-written Python code which uses OpenCV and Numpy, I also found out there are several works which use TensorFlow together with neural networks, called YOLO (You Look Only Once). They are:

darknet (written in C)
darkflow (written with Python and seems to work as a wrapper for darknet) — You need to install different dependencies from darknet, for example Cython and TensorFlow. The good thing is that we could use this tool for a video post-processing, where instead of taking input directly from a webcam, we take it from existing videos. However, if you want to use the latest YOLO algorithm, then just stick to Darknet rather than using Darkflow. There is a fork on GitHub which could allow Darknet to save the output of the processed video into a file as well.

To rotate the video if it was taken from a smartphone but in a 180 degree position:

ffmpeg -i sourcefile.mp4 -vf "transpose=4" fileout.mp4

The transpose value depends on the nature of the rotation. If it’s 90 degrees, the transpose value should be 2. It also depends on whether the rotation is clockwise or counter-clockwise.

To convert the video to a slower framerate:

ffmpeg -i sourcefile.avi -r 8 fileout.mp4

For the Darkflow tool, the default output is in AVI format, but ffmpeg allows us to convert it to MP4 if you want.

ImageAI

ImageAI is a Python-based computer vision library which utilizes the use of TensorFlow, Keras, Matplotlib and several other dependencies which are commonly used for machine learning. In terms of usage, it is similar to darkflow.

Conclusion

The advancement of AI field contributes a lot of useful automation to life. It can range from helping detect tumors, helping search and rescue missions, reducing keystrokes with keyword predictions, to spam filtering. AI also accelerates the field of image processing and pattern recognition.

A lot of the hard work of smart people and scholars have produced many smart solutions to make people live a better life with the use of AI. As I have shown, some of these tools could achieve better detection given a good amount of samples to be trained on and the correct size of picture to be detected.

The tools above will work as-is, but may need some tweaking/editing if you want to customize it. For example, some of the code works with their own demos, so you may need to pass an argument such as sys.argv[] inside the Python code if you want to process your own video.

Self-driving toy car using the Asynchronous Advantage Actor-Critic algorithm

2018-08-29T00:00:00+00:00

The field of Reinforcement Learning has seen a lot of great improvement in the past years. Researchers at universities and companies like Deep Mind have been developing new and better ways to train intelligent, artificial agents to solve more and more difficult tasks. The algorithms being developed are requiring less time to train. They also are making the training much more stable.

This article is about an algorithm that’s one of the most cited lately: A3C — Asynchronous Advantage Actor-Critic.

As the subject is both wide and deep, I’m assuming the reader has the relevant background mastered already. Although reading it might be interesting even without understanding most of the notions in use, having a good grasp of them will help you get the most out of it.

Because we’re looking at the Deep Reinforcement Learning, the obvious requirement is to be acquainted with the neural networks. I’m also using different notions known in the field of Reinforcement Learning overall like $Q(a, s)$ and $V(s)$ functions or the n-step return. The mathematical expressions, in particular, are given assuming that the reader already knows what the symbols stand for. Some notions known from other families of RL algorithms are being touched on as well (e.g. experience replay) — to contrast them with the A3C way of solving the same kind of problems. The article along with the source code uses the OpenAI gym, Python, and PyTorch among other Python-related libraries.

Theory

The A3C algorithm is a part of the greater class of RL algorithms called Policy Gradients.

In this approach, we’re creating a model that approximates the action-choosing policy itself.

Let’s contrast it with value iteration, the goal of which is to learn the value function and have policy emerge as the function that chooses an action transitioning to the state of the greatest value.

With the policy gradient approach, we’re approximating the policy with a differentiable function. Such stated problem requires only a good approximation of the gradient that over time will maximize the rewards.

The unique approach of A3C adds a very clever twist: we’re also learning an approximation of the value function at the same time. This helps us in getting the variance of the gradient down considerably, making the training much more stable.

These two aspects of the algorithm are being personified within its name: actor-critic. The policy function approximation is being called the actor, while the value function is being called the critic.

The policy gradient

As we’ve noticed already, in order to improve our policy function approximation, we need a gradient that points at the direction that maximizes the rewards.

I’m not going to reinvent the wheel here. There are some great resources the reader can access to dig deep into the Mathematics of what’s called the Policy Gradient Theorem:

The following equation presents the basic form of the gradient of the policy function:

$$\nabla_{\theta} J(\theta) = E_{\tau}[,R_{\tau}\cdot\nabla_\theta,\sum_{t=0}^{T-1},log,\pi(a_t|s_t;\theta),]$$

This states that for each sampled trajectory $\tau$, the correct estimate of the gradient is the expected value of the rewards times the action probabilities moved into the log space. Ascending in this direction makes our rewards greater and greater over time.

We can derive all the needed intermediary gradients ourselves by hand of course. Because we’re using PyTorch though, we only need the right loss function.

Let’s figure out the right loss function formula that will produce the gradient as shown above:

$$L_\theta=-J(\theta)$$

Also:

$$J(\theta)=E_\tau[R_\tau\cdot\sum_{t=0}^{T-1},log,\pi(a_t|s_t;\theta)]$$

Hence:

$$L_\theta=-\frac{1}{n}\sum_{t=0}^{n-1}R_t,\cdot,log\pi(a_t|s_t;\theta)$$

Formalizing the accumulation of rewards

For now, we’ve been using the $R_\tau$ and $R_t$ terms very abstractly. Let’s make this part more intuitive and concrete now.

Its true meaning really is “the quality of the sampled trajectory”. Consider the following equation:

$$R_t=\sum_{i=t}^{N+t}\gamma^{i-t}r_i,+,\gamma^{i-t+1}V(s_{t+N+1})$$

Each $r_i$ is the reward received from the environment after each step. Each trajectory consists of multiple steps. Each time, we’re sampling actions based on our policy function. This gives probabilities of a given action being best given the state.

What if we’re taking 5 actions for which we’re not being given any reward but overall it helped us get rewarded in the 6th step? This is exactly the case we’ll be dealing with in this article later when training a toy car to drive based only on pixel values of the scene. In that environment, we’ll be given $-0.1$ “negative” reward each step and something close to $7$ each new “tile” the car stays on the road.

We need a way to still encourage actions that make us earn rewards in a not too distant future. We also need to be smart and discount future rewards somewhat so that the more immediate the reward is to our action, the more emphasis we put on it.

That’s exactly what the above equation does. Notice that $\gamma$ becomes a hyper-parameter. It makes sense to give it value from $(0, 1)$. Let’s consider the following list of rewards: $[r_1, r_2, r_3, r_4]$. For $r_1$, the formula for the discounted accumulated reward is:

$$R_1=\gamma,r_1,+,\gamma^2r_2,+,\gamma^3r_3,+,\gamma^4r_4,+,\gamma^5V(s_5)$$

For $r_2$ it’s:

$$R_2=\gamma,r_2,+,\gamma^2r_3,+,\gamma^3r_4,+,\gamma^4V(s_5)$$

And so on… In case when we hit the terminal state, having no “next” state, we substitute $0$ for $V(s_{t+N+1})$.

We’ve said that in A3C we’re learning the value function at the same time. The $R_t$ as described above becomes the target value when training our $V(s)$. The value function becomes an approximation of the average of the rewards given the state (because $R_t$ depends on us sampling actions in this state).

Making the gradients more stable

One of the greatest inhibitors of the policy gradient performance is what’s broadly called “high variance”.

I have to admit, the first time I saw that term in this context, I was disoriented. I knew what “variance” was. It’s the “variance of what” that was not clear to me.

Thankfully I found a brilliant answer to this question. It explains the issue simply yet in detail.

Let me cite it here:

When we talk about high variance in the policy gradient method, we’re specifically talking about the facts that the variance of the gradients are high — namely, that $Var(\nabla_{\theta} J(\theta))$ is big.

To put it in simple terms: because we’re sampling trajectories from the space that is stochastic in nature, we’re bound to have those samples give gradients that disagree a lot on the best direction to take our model’s parameters into.

I encourage the reader to pause now and read the above-mentioned answer as it’s very vital. The gist of the solution described in it is that we can subtract a baseline value from each $R_t$. An example of a good baseline that was given was to make it into an average of the sampled accumulated rewards. The A3C algorithm uses this insight in a very, very clever way.

Value function as a baseline

To learn the $V(s)$ we’re typically using the MSE or Huber loss against the accumulated rewards for each step. This means that over time we’re averaging those rewards out based on the state we’re finding ourselves in.

Improving our gradient formula with those ideas we now get:

$$\nabla_{\theta} J(\theta) = E_{\tau}[,\nabla_\theta,\sum_{t=0}^{T-1},log,\pi(a_t|s_t;\theta)\cdot(R_t-V(s_t)),]$$

It’s important to treat the $(R_t-V(s_t))$ term as a constant. This means that when using PyTorch or any other deep learning framework, the computation of it should occur outside the graph that influences the gradients.

The enhanced part of the equation is where we get the word “advantage” in the algorithm’s name. The advantage is simply the difference between the accumulated rewards and what those rewards are on average for the given state:

$$A(a_{t..t+n},s_{t..t+n})=R_t(a_{t..t+n},s_{t..t+n})-V(s_t)$$

If we’ll make $R_t$ into $Q(s,a)$ as it’s commonly written in literature, we’ll arrive at the formula:

$$A(s,a)=Q(s,a) - V(s)$$

What’s the intuition here? Imagine that you’re playing chess with a 5-year-old. You win by a huge margin. Your friend who’s watched lots of master-level games observed this one as well. His take is that even though you scored positively, you still made lots of mistakes. You’ve got your critic here. Your score and what it looks like for the “observing critic” combined is what we call the advantage of the actions you took.

Guarding against the model’s overconfidence

Although he was warned, Icarus was too young and too enthusiastic about flying. He got excited by the thrill of flying and carried away by the amazing feeling of freedom and started flying high to salute the sun, diving low to the sea, and then up high again. His father Daedalus was trying in vain to make young Icarus to understand that his behavior was dangerous, and Icarus soon saw his wings melting. Icarus fell into the sea and drowned.

The Myth Of Daedalus And Icarus

The job of an “actor” is to output probability values for each possible action the agent can take. The greater the probability, the greater the model’s confidence that this action will result in the highest reward.

What if at some point, the weights are being steered in a way that the model becomes overconfident of some particular action? If this happens before the model learns much, it becomes a huge problem.

Because we’re using the $\pi(a|s;\theta)$ distribution to sample trajectories with, we’re not sampling totally at random. In other words, for $\pi(a|s;\theta) = [0.1, 0.4, 0.2, 0.3]$ our sampling chooses the second option 40% of the time. With any action overwhelming the others, we’re losing the ability to explore different paths and thus learn valuable lessons.

Empirically, I have found myself seeing the process sometimes not even able to escape the “overconfidence” area for long, long hours.

Regularizing with entropy

Let’s introduce the notion of an entropy.

In simple words in our case, it’s the measure of how much “knowledge” does given probability distribution posses. It’s being maximized for the uniform distribution. Here’s the formula:

$$H(X)=E[-log_b(P(X))]$$

This expands to the following:

$$H(X)=-\sum_{i=1}^{n}P(x_i)log_b(P(x_i))$$

Let’s look closer at the values this function produces using the following simple Calca code:

uniform = [0.25, 0.25, 0.25, 0.25]
more confident = [0.5, 0.25, 0.15, 0.10]
over confident = [0.95, 0.01, 0.01, 0.03]
super over confident = [0.99, 0.003, 0.004, 0.003]

y(x) = x*log(x, 10)

entropy(dist) = -sum(map(y, dist))

entropy (uniform) => 0.6021
entropy (more confident) => 0.5246
entropy (over confident) => 0.1068
entropy (super over confident) => 0.0291

We can use the above to “punish” the model whenever it’s too confident of its choices. As we’re going to use gradient descend, we’ll be minimizing terms that appear in our loss function. Minimizing the entropy as shown above would encourage more confidence though. We’ll need to make it into a negative in the loss to work the way we intend:

$$L_\theta=-\frac{1}{n}\sum_{t=0}^{n-1}log\pi(a_t|s_t;\theta)\cdot(R_t-V(s_t)),-\beta,H(\pi(a_t|s_t;\theta))$$

Where $\beta$ is a hyperparameter scaling the effects of the penalty that the entropy has on the gradients. Choosing the right value for $\beta$ becomes very vital for the model’s convergence. In this article, I’m using $0.01$ as with $0.001$ I was still observing the process stuck being overconfident.

Let’s include the value loss $L_v$ in the loss function formula making it full and ready to be implemented:

$$L_\theta=-\frac{1}{n}\sum_{t=0}^{n-1}log\pi(a_t|s_t;\theta)\cdot(R_t-V(s_t)),+\alpha,L_v,,-\beta,H(\pi(a_t|s_t;\theta))$$

The last A in A3C

So far we’ve gone from the vanilla policy gradients to using the notion of an advantage. We’ve also improved it with the baseline that intuitively makes the model consist of two parts: the actor and the critic. At this point, we have what’s sometimes called the A2C — Advantage Actor-Critic.

Let us now focus on the last piece of the puzzle: the last A. This last A comes from the word “asynchronous”. It’s been explained very clearly in the original paper on A3C.

This idea I think is the least complex of all that have their place in the approach. I’ll just comment on what was already written:

These approaches share a common idea: the sequence of observed data encountered by an online RL agent is non-stationary, and online RL updates are strongly correlated. By storing the agent’s data in an experience replay memory, the data can be batched (Riedmiller, 2005; Schulman et al., 2015a) or randomly sampled (Mnih et al., 2013; 2015; Van Hasselt et al., 2015) from different time-steps. Aggregating over memory in this way reduces non-stationarity and decorrelates updates, but at the same time limits the methods to off-policy reinforcement learning algorithms.

The A3C unique approach is that it doesn’t use experience replay for de-correlating the updates to the weights of the model. Instead, we’re sampling many different trajectories at the same time in an asynchronous manner.

This means that we’re creating many clones of the environment and we let our agents experience them at the same time. Separate agents share their weights in one way or another. There are implementations with agents sharing those weights very literally — and performing the updates to the weights on their own whenever they need to. There also are implementations with one main agent holding the main weights and doing the updates based on the gradients reported by the “worker” agents. The worker agents are then being updated with the evolved weights. The environments and agents are not being directly synchronized, working at their own speed. As soon as any of them collects the needed rewards to perform the n-step gradients calculations, the gradients are being applied in one way or another.

In this article, I’m preferring the second approach — having one “main” agent and making workers synchronize their weights with it each n-step period.

Practice

The challenge

To present the above theory in practical terms, we’re going to code the A3C to train a toy self-driving game car. The algorithm will only have the game’s pixels as inputs. We’re also going to collect rewards.

Each step, the player will decide how to move the steering wheel, how much throttle to apply and how much brake.

Points are being assigned for each new “tile” that the car enters staying within the road. There’s a small penalty for each other case of $-0.1$ points.

We’re going to use OpenAI Gym and the environment’s called CarRacing.

You can read a bit more about the setup in the environment’s source code on GitHub.

Coding the Agent

Our agent is going to output both $\pi(a|s;\theta)$ as well as $V(s)$. We’re going to use the GRU unit to give the agent the ability to remember its previous actions and environments previous features.

I’ve also decided to use PRelu instead of Relu activations as it appeared to me that this way the agent was learning much quicker (although I don’t have any numbers to back this impression up).

Disclaimer: the code presented below has not been refactored in any way. If this was going to be used in production I’d certainly hugely clean it up.

Here’s the full listing of the agent’s class:

class Agent(nn.Module):
    def __init__(self, **kwargs):
        super(Agent, self).__init__(**kwargs)

        self.init_args = kwargs

        self.h = torch.zeros(1, 256)

        self.norm1 = nn.BatchNorm2d(4)
        self.norm2 = nn.BatchNorm2d(32)

        self.conv1 = nn.Conv2d(4, 32, 4, stride=2, padding=1)
        self.conv2 = nn.Conv2d(32, 32, 3, stride=2, padding=1)
        self.conv3 = nn.Conv2d(32, 32, 3, stride=2, padding=1)
        self.conv4 = nn.Conv2d(32, 32, 3, stride=2, padding=1)

        self.gru = nn.GRUCell(1152, 256)
        self.policy = nn.Linear(256, 4)
        self.value = nn.Linear(256, 1)

        self.prelu1 = nn.PReLU()
        self.prelu2 = nn.PReLU()
        self.prelu3 = nn.PReLU()
        self.prelu4 = nn.PReLU()

        nn.init.xavier_uniform_(self.conv1.weight, gain=nn.init.calculate_gain('leaky_relu'))
        nn.init.constant_(self.conv1.bias, 0.01)
        nn.init.xavier_uniform_(self.conv2.weight, gain=nn.init.calculate_gain('leaky_relu'))
        nn.init.constant_(self.conv2.bias, 0.01)
        nn.init.xavier_uniform_(self.conv3.weight, gain=nn.init.calculate_gain('leaky_relu'))
        nn.init.constant_(self.conv3.bias, 0.01)
        nn.init.xavier_uniform_(self.conv4.weight, gain=nn.init.calculate_gain('leaky_relu'))
        nn.init.constant_(self.conv4.bias, 0.01)
        nn.init.constant_(self.gru.bias_ih, 0)
        nn.init.constant_(self.gru.bias_hh, 0)
        nn.init.xavier_uniform_(self.policy.weight, gain=nn.init.calculate_gain('leaky_relu'))
        nn.init.constant_(self.policy.bias, 0.01)
        nn.init.xavier_uniform_(self.value.weight, gain=nn.init.calculate_gain('leaky_relu'))
        nn.init.constant_(self.value.bias, 0.01)

        self.train()

    def reset(self):
        self.h = torch.zeros(1, 256)

    def clone(self, num=1):
        return [ self.clone_one() for _ in range(num) ]

    def clone_one(self):
        return Agent(**self.init_args)

    def forward(self, state):
        state = state.view(1, 4, 96, 96)
        state = self.norm1(state)

        data = self.prelu1(self.conv1(state))
        data = self.prelu2(self.conv2(data))
        data = self.prelu3(self.conv3(data))
        data = self.prelu4(self.conv4(data))

        data = self.norm2(data)
        data = data.view(1, -1)

        h = self.gru(data, self.h)
        self.h = h.detach()

        pre_policy = h.view(-1)

        policy = F.softmax(self.policy(pre_policy))
        value = self.value(pre_policy)

        return policy, value

You can immediately notice that actor and critic parts share most of the weights. They only differ in the last layer.

Next, I wanted to abstract out the notion of the “runner”. It encapsulates the idea of a “running agent”. Think of it as the game player — with the joystick and its brain to score game points. I’m discretizing the action space the following way:

Action name	value
Turn left	[-0.8, 0.0, 0.0]
Turn right	[0.8, 0.0, 0]
Full throttle	[0.0, 0.1, 0.0]
Brake	[0.0, 0.0, 0.6]

class Runner:
    def __init__(self, agent, ix, train = True, **kwargs):
        self.agent = agent
        self.train = train
        self.ix = ix
        self.reset = False
        self.states = []

        # each runner has its own environment:
        self.env = gym.make('CarRacing-v0')

    def get_value(self):
        """
        Returns just the current state's value.
        This is used when approximating the R.
        If the last step was
        not terminal, then we're substituting the "r"
        with V(s) - hence, we need a way to just
        get that V(s) without moving forward yet.
        """
        _input = self.preprocess(self.states)
        _, _, _, value = self.decide(_input)
        return value

    def run_episode(self, yield_every = 10, do_render = False):
        """
        The episode runner written in the generator style.
        This is meant to be used in a "for (...) in run_episode(...):" manner.
        Each value generated is a tuple of:
        step_ix: the current "step" number
        rewards: the list of rewards as received from the environment (without discounting yet)
        values: the list of V(s) values, as predicted by the "critic"
        policies: the list of policies as received from the "actor"
        actions: the list of actions as sampled based on policies
        terminal: whether we're in a "terminal" state
        """
        self.reset = False
        step_ix = 0

        rewards, values, policies, actions = [[], [], [], []]

        self.env.reset()

        # we're going to feed the last 4 frames to the neural network that acts as the "actor-critic" duo. We'll use the "deque" to efficiently drop too old frames always keeping its length at 4:
        states = deque([ ])

        # we're pre-populating the states deque by taking first 4 steps as "full throttle forward":
        while len(states) < 4:
            _, r, _, _ = self.env.step([0.0, 1.0, 0.0])
            state = self.env.render(mode='rgb_array')
            states.append(state)
            logger.info('Init reward ' + str(r) )

        # we need to repeat the following as long as the game is not over yet:
        while True:
            # the frames need to be preprocessed (I'm explaining the reasons later in the article)
            _input = self.preprocess(states)

            # asking the neural network for the policy and value predictions:
            action, action_ix, policy, value = self.decide(_input, step_ix)

            # taking the step and receiving the reward along with info if the game is over:
            _, reward, terminal, _ = self.env.step(action)

            # explicitly rendering the scene (again, this will be explained later)
            state = self.env.render(mode='rgb_array')

            # update the last 4 states deque:
            states.append(state)
            while len(states) > 4:
                states.popleft()

            # if we've been asked to render into the window (e. g. to capture the video):
            if do_render:
                self.env.render()

            self.states = states
            step_ix += 1

            rewards.append(reward)
            values.append(value)
            policies.append(policy)
            actions.append(action_ix)

            # periodically save the state's screenshot along with the numerical values in an easy to read way:
            if self.ix == 2 and step_ix % 200 == 0:
                fname = './screens/car-racing/screen-' + str(step_ix) + '-' + str(int(time.time())) + '.jpg'
                im = Image.fromarray(state)
                im.save(fname)
                state.tofile(fname + '.txt', sep=" ")
                _input.numpy().tofile(fname + '.input.txt', sep=" ")

            # if it's game over or we hit the "yield every" value, yield the values from this generator:
            if terminal or step_ix % yield_every == 0:
                yield step_ix, rewards, values, policies, actions, terminal
                rewards, values, policies, actions = [[], [], [], []]

            # following is a very tacky way to allow external using code to mark that it wants us to reset the environment, finishing the episode prematurely. (this would be hugely refactored in the production code but for the sake of playing with the algorithm itself, it's good enough):
            if self.reset:
                self.reset = False
                self.agent.reset()
                states = deque([ ])
                self.states = deque([ ])
                return

            if terminal:
                self.agent.reset()
                states = deque([ ])
                return

    def ask_reset(self):
        self.reset = True

    def preprocess(self, states):
        return torch.stack([ torch.tensor(self.preprocess_one(image_data), dtype=torch.float32) for image_data in states ])

    def preprocess_one(self, image):
        """
        Scales the rendered image and makes it grayscale
        """
        return rescale(rgb2gray(image), (0.24, 0.16), anti_aliasing=False, mode='edge', multichannel=False)

    def choose_action(self, policy, step_ix):
        """
        Chooses an action to take based on the policy and whether we're in the training mode or not. During training, it samples based on the probability values in the policy. During the evaluation, it takes the most probable action in a greedy way.
        """
        policies = [[-0.8, 0.0, 0.0], [0.8, 0.0, 0], [0.0, 0.1, 0.0], [0.0, 0.0, 0.6]]

        if self.train:
            action_ix = np.random.choice(4, 1, p=torch.tensor(policy).detach().numpy())[0]
        else:
            action_ix = np.argmax(torch.tensor(policy).detach().numpy())

        logger.info('Step ' + str(step_ix) + ' Runner ' + str(self.ix) + ' Action ix: ' + str(action_ix) + ' From: ' + str(policy))

        return np.array(policies[action_ix], dtype=np.float32), action_ix

    def decide(self, state, step_ix = 999):
        policy, value = self.agent(state)

        action, action_ix = self.choose_action(policy, step_ix)

        return action, action_ix, policy, value

    def load_state_dict(self, state):
        """
        As we'll have multiple "worker" runners, they will need to be able to sync their agents' weights with the main agent.
        This function loads the weights into this runner's agent.
        """
        self.agent.load_state_dict(state)

I’m also encapsulating the training process in a class of its own. You can notice the gradients being clipped before being applied. I’m also clipping the rewards into the range of $<-3, 3>$ to help to keep the variance low.

class Trainer:
    def __init__(self, gamma, agent, window = 15, workers = 8, **kwargs):
        super().__init__(**kwargs)

        self.agent = agent
        self.window = window
        self.gamma = gamma
        self.optimizer = optim.Adam(self.agent.parameters(), lr=1e-4)
        self.workers = workers

        # even though we're loading the weights into worker agents explicitly, I found that still without sharing the weights as following, the algorithm was not converging:
        self.agent.share_memory()

    def fit(self, episodes = 1000):
            """
            The higher level method for training the agents.
            It called into the lower level "train" which orchestrates the process itself.
            """
        last_update = 0
        updates = dict()

        for ix in range(1, self.workers + 1):
            updates[ ix ] = { 'episode': 0, 'step': 0, 'rewards': deque(), 'losses': deque(), 'points': 0, 'mean_reward': 0, 'mean_loss': 0 }

        for update in self.train(episodes):
            now = time.time()

            # you could do something useful here with the updates dict.
            # I've opted out as I'm using logging anyways and got more value in just watching the log file, grepping for the desired values

            # save the current model's weights every minute:
            if now - last_update > 60:
                torch.save(self.agent.state_dict(), './checkpoints/car-racing/' + str(int(now)) + '-.pytorch')
                last_update = now

    def train(self, episodes = 1000):
        """
        Lower level training orchestration method. Written in the generator style. Intended to be used with "for update in train(...):"
        """

        # create the requested number of background agents and runners:
        worker_agents = self.agent.clone(num = self.workers)
        runners = [ Runner(agent=agent, ix = ix + 1, train = True) for ix, agent in enumerate(worker_agents) ]

        # we're going to communicate the workers' updates via the thread safe queue:
        queue = mp.SimpleQueue()

        # if we've not been given a number of episodes: assume the process is going to be interrupted with the keyboard interrupt once the user (us) decides so:
        if episodes is None:
            print('Starting out an infinite training process')

        # create the actual background processes, making their entry be the train_one method:
        processes = [ mp.Process(target=self.train_one, args=(runners[ix - 1], queue, episodes, ix)) for ix in range(1, self.workers + 1) ]

        # run those processes:
        for process in processes:
            process.start()

        try:
            # what follows is a rather naive implementation of listening to workers updates. it works though for our purposes:
            while any([ process.is_alive() for process in processes ]):
                results = queue.get()
                yield results
        except Exception as e:
            logger.error(str(e))

    def train_one(self, runner, queue, episodes = 1000, ix = 1):
        """
        Orchestrate the training for a single worker runner and agent. This is intended to run in its own background process.
        """

        # possibly naive way of trying to de-correlate the weight updates further (I have no hard evidence to prove if it works, other than my subjective observation):
        time.sleep(ix)

        try:
            # we are going to request the episode be reset whenever our agent scores lower than its max points. the same will happen if the agent scores total of -10 points:
            max_points = 0
            max_eval_points = 0
            min_points = 0
            max_episode = 0

            for episode_ix in itertools.count(start=0, step=1):

                if episodes is not None and episode_ix >= episodes:
                    return

                max_episode_points = 0
                points = 0

                # load up the newest weights every new episode:
                runner.load_state_dict(self.agent.state_dict())

                # every 5 episodes lets evaluate the weights we've learned so far by recording the run of the car using the greedy strategy:
                if ix == 1 and episode_ix % 5 == 0:
                    eval_points = self.record_greedy(episode_ix)

                    if eval_points > max_eval_points:
                        torch.save(runner.agent.state_dict(), './checkpoints/car-racing/' + str(eval_points) + '-eval-points.pytorch')
                        max_eval_points = eval_points

                # each n-step window, compute the gradients and apply
                # also: decide if we shouldn't restart the episode if we don't want to explore too much of the not-useful state space:
                for step, rewards, values, policies, action_ixs, terminal in runner.run_episode(yield_every=self.window):
                    points += sum(rewards)

                    if ix == 1 and points > max_points:
                        torch.save(runner.agent.state_dict(), './checkpoints/car-racing/' + str(points) + '-points.pytorch')
                        max_points = points

                    if ix == 1 and episode_ix > max_episode:
                        torch.save(runner.agent.state_dict(), './checkpoints/car-racing/' + str(episode_ix) + '-episode.pytorch')
                        max_episode = episode_ix

                    if points < -10 or (max_episode_points > min_points and points < min_points):
                        terminal = True
                        max_episode_points = 0
                        point = 0
                        runner.ask_reset()

                    if terminal:
                        logger.info('TERMINAL for ' + str(ix) + ' at step ' + str(step) + ' with total points ' + str(points) + ' max: ' + str(max_episode_points) )

                    # if we're learning, then compute and apply the gradients and load the newest weights:
                    if runner.train:
                        loss = self.apply_gradients(policies, action_ixs, rewards, values, terminal, runner)
                        runner.load_state_dict(self.agent.state_dict())

                    max_episode_points = max(max_episode_points, points)
                    min_points = max(min_points, points)

                    # communicate the gathered values to the main process:
                    queue.put((ix, episode_ix, step, rewards, loss, points, terminal))

        except Exception as e:
            string = traceback.format_exc()
            logger.error(str(e) + ' → ' + string)
            queue.put((ix, -1, -1, [-1], -1, str(e) + '<br />' + string, True))

    def record_greedy(self, episode_ix):
        """
        Records the video of the "greedy" run based on the current weights.
        """
        directory = './videos/car-racing/episode-' + str(episode_ix) + '-' + str(int(time.time()))
        player = Player(agent=self.agent, directory=directory, train=False)
        points = player.play()
        logger.info('Evaluation at episode ' + str(episode_ix) + ': ' + str(points) + ' points (' + directory + ')')
        return points

    def apply_gradients(self, policies, actions, rewards, values, terminal, runner):
        worker_agent = runner.agent
        actions_one_hot = torch.tensor([[ int(i == action) for i in range(4) ] for action in actions], dtype=torch.float32)

        policies = torch.stack(policies)
        values = torch.cat(values)
        values_nograd = torch.zeros_like(values.detach(), requires_grad=False)
        values_nograd.copy_(values)

        discounted_rewards = self.discount_rewards(runner, rewards, values_nograd[-1], terminal)
        advantages = discounted_rewards - values_nograd

        logger.info('Runner ' + str(runner.ix) + 'Rewards: ' + str(rewards))
        logger.info('Runner ' + str(runner.ix) + 'Discounted Rewards: ' + str(discounted_rewards.numpy()))

        log_policies = torch.log(0.00000001 + policies)

        one_log_policies = torch.sum(log_policies * actions_one_hot, dim=1)

        entropy = torch.sum(policies * -log_policies)

        policy_loss = -torch.mean(one_log_policies * advantages)

        value_loss = F.mse_loss(values, discounted_rewards)

        value_loss_nograd = torch.zeros_like(value_loss)
        value_loss_nograd.copy_(value_loss)

        policy_loss_nograd = torch.zeros_like(policy_loss)
        policy_loss_nograd.copy_(policy_loss)

        logger.info('Value Loss: ' + str(float(value_loss_nograd)) + ' Policy Loss: ' + str(float(policy_loss_nograd)))

        loss = policy_loss + 0.5 * value_loss - 0.01 * entropy
        self.agent.zero_grad()
        loss.backward()
        torch.nn.utils.clip_grad_norm_(worker_agent.parameters(), 40)

        # the following step is crucial. at this point, all the info about the gradients reside in the worker agent's memory. We need to "move" those gradients into the main agent's memory:
        self.share_gradients(worker_agent)

        # update the weights with the computed gradients:
        self.optimizer.step()

        worker_agent.zero_grad()
        return float(loss.detach())

    def share_gradients(self, worker_agent):
        for param, shared_param in zip(worker_agent.parameters(), self.agent.parameters()):
            if shared_param.grad is not None:
                return
            shared_param._grad = param.grad

    def clip_reward(self, reward):
        """
        Clips the rewards into the <-3, 3> range preventing too big of the gradients variance.
        """
        return max(min(reward, 3), -3)

    def discount_rewards(self, runner, rewards, last_value, terminal):
            discounted_rewards = [0 for _ in rewards]
        loop_rewards = [ self.clip_reward(reward) for reward in rewards ]

        if terminal:
            loop_rewards.append(0)
        else:
            loop_rewards.append(runner.get_value())

        for main_ix in range(len(discounted_rewards) - 1, -1, -1):
            for inside_ix in range(len(loop_rewards) - 1, -1, -1):
                if inside_ix >= main_ix:
                    reward = loop_rewards[inside_ix]
                    discounted_rewards[main_ix] += self.gamma**(inside_ix - main_ix) * reward

        return torch.tensor(discounted_rewards)

For the record_greedy method to work we need the following class:

class Player(Runner):
    def __init__(self, directory, **kwargs):
        super().__init__(ix=999, **kwargs)

        self.env = Monitor(self.env, directory)

    def play(self):
        points = 0
        for step, rewards, values, policies, actions, terminal in self.run_episode(yield_every = 1, do_render = True):
            points += sum(rewards)
        self.env.close()
        return points

All the above code can be used as follows (in the Python script):

if __name__ == "__main__":
    agent = Agent()

    trainer = Trainer(gamma = 0.99, agent = agent)
    trainer.fit(episodes=None)

The importance of tuning of the n-step window size

Reading the code, you can notice that we’ve chosen $15$ to be the size of the n-step window. We’ve also chosen $\gamma=0.99$. Getting those values right is a subject for tuning. The same ones that work on one game or a challenge will not necessarily work well for the other.

Here’s a quick explanation of how to think about them: We’re going to be penalized most of the time. It’s important for us to give the algorithm a chance to actually find trajectories that score positively. In the “CarRacing” challenge, I’ve found that it can take 10 steps of moving “full throttle” in the correct direction before we’re being rewarded by entering the new “tile”. I’ve just simply added $5$ of the safety net to that number. No mathematical proof follows this thinking here, but I can tell you though that it made a huge difference in the training time for me. The version of the code I’m presenting above starts to score above 700 points after approximately 10 hours on my Ryzen 7 based computing box.

Problems with the state being returned from the environment - overcoming with the explicit render

You might have also noticed that I’m not using the state values returned by the step method of the gym environment. This might seem contradictory to how the gym is typically being used. After days of not seeing my model converge though, I have found that the step method was returning one and the same numpy array on each call. You can imagine that it was the absolutely last thing I’ve checked when trying to find that bug.

I’ve found the render(mode='rgb_array') works as intended each time. I just needed to write my own preprocessing code, to scale it down and make it grayscale.

How to know when the algorithm converges

I’ve seen some people thinking that their A3C implementation does not converge. The resulting policy did not seem to be working that well, but the training process was taking a bit longer than “some other implementation”. I fell for this kind of thinking myself as well. My humble bit of advice is to stick to what makes sense mathematically. Someone else’s model might be converging faster simply because of the hardware being used or some slight difference in the code around the training (e.g. explicit render needed in my case). This might not have anything to do with the A3C part at all.

How do we “stick to what makes sense mathematically”? Simply by logging the value loss and observing it as the training continues. Intuitively, for the model that has converged, we should see that it has already learned the value function. Those values — representing the average of the discounted rewards — should not make the loss too big most of the time. Still, for some states, the best action will make the $R_t$ much bigger than $V(s_t)$ which means that we still should see the loss spiking from time to time.

Again, the above bit of advice doesn’t come with any mathematical proofs. It’s what I found working and making sense in my case.

The Results

Instead of presenting hard-core statistics about the model’s performance — which wouldn’t make much sense because I stopped it as soon as the “evaluation” videos started looking cool enough) — I’ll just post three videos of the car driving on its own through the three randomly generated tracks.

Have fun watching and even more fun coding it yourself too!

Recommender System via a Simple Matrix Factorization

2018-07-17T00:00:00+00:00

Photo by Michael Cartwright, CC BY-SA 2.0, cropped

We all like how apps like Spotify or Last.fm can recommend us a song that feels so much like our taste. Being able to recommend an item to a user is very important for keeping and expanding the user base.

In this article I’ll present an overview of building a recommendation system. The approach here is quite basic. It’s grounded though in a valid and battle-tested theory. I’ll show you how to put this theory into practice by coding it in Python with the help of MXNet.

Kinds of recommenders

The general setup of the content recommendation challenge is that we have users and items. The task is to recommend items to a particular user.

There are two distinct approaches to recommending content:

The first one bases its outputs on the the intricate features of the item and how they relate to the user itself. The latter one uses the information about the way other, similar users rank the items. More elaborate systems base their work on both. Such systems are called hybrid recommender systems.

This article is going to focus on collaborative filtering only.

A bit of theory: matrix factorization

In the simplest terms, we can represent interactions between users and items with a matrix:

	item1	item2	item3
user1	-1	-	0.6
user2	-	0.95	-0.1
user3	0.5	-	0.8

In the above case users can rate items on the scale of <-1, 1>. Notice that in reality it’s most likely that users will not rate everything. The missing ratings are represented with the dash: -.

Just by looking at the above table, we know that no amount of math is going to change the fact that user1 completely dislikes item1. The same goes for user2 liking item2 a lot. The ratings we already have make up for a fairly easy set of items to propose. The goal of a recommender is not to propose the items users know already though. We want to predict which of the “dashes” from the table are most likely to be liked the most. Putting it in other words: we want to predict the full representation of the above matrix, basing only on its “sparse” representation as shown above.

How can we solve this problem? Let’s recall the rules of multiplying two matrices:

Given two matrices: A: m × k and B: k × n, their product is another matrix C: m × n. We know that we can multiply matrices only if the second dimension of the first matrix equals the first one of the second matrix. In such a case, matrix C becomes a product of two factors: matrix A and matrix B:

C = AB

Imagine now that the sparse matrix represented by the ratings table is our C. This means that there exist two matrices: A and B that factorize C.

Notice also how this factorization is saving the space needed to persist the ratings:

Let’s make m and n numbers into:

m = 1000000
n = 10000

Then the full representation takes:

m * n => 10,000,000,000

We can now choose the value for k, to be later used when constructing the factorizing matrices:

k = 16

Then to store both matrices: A and B we only need:

m * k + n * k => 16,160,000

Making it into a fraction of the previous number:

(m * k + n * k) / (m * n) => 0.001616

That’s a huge saving of the original space! The cost we need to pay is the small increase in the computational resources needed for the information retrieval. Inference of the rating from C based on A and B requires a dot product of the corresponding row and column of those matrices.

Reasoning about the matrix factors

What intuition can we build for the above mentioned matrices A and B? Looking at their dimensions, we can see that each row of A is a k-sized vector that represents a user. Conversely, each column of B is a k-sized vector that represents an item. The values in those vectors are being called latent features. Sometimes those vectors are being called latent representations of users and items.

What could be the intuition? To split the original matrix, for each item we need to look at all interactions with users. You can imagine the algorithm finding patterns in the ratings that later on match certain characteristics of the item. If this was about movies, the features could be that it’s a comedy or sci-fi or that it’s futuristic or embedded deeply in some ancient times. We’re essentially taking the original vector of a movie, that contains ratings — and based on that we’re distilling features of the movie that describe it best. Note that this is only a half-truth. We think about it this way just to have a way to explain why the approach works. In many cases we could have a hard time finding the actual real world aspects that those latent features follow.

Factorizing the user × item matrix in practice

A simple approach to find matrices A and B is to initialize them randomly first. Then by computing the dot product of each row and column having a known value in C, we can compute how much it differs from the known value. Because dot product is easily differentiable, we can use gradient descend to iteratively improve our matrices A and B until AB is close enough to C for our purposes.

In this article, I’m going to use a freely available database of joke ratings, called “Jester”. It contains data about ratings from 59132 users and 150 jokes.

Coding the model with MXNet

Let’s first import some of the classes and functions we’ll use later.

from mxnet.gluon import Block, nn, Trainer
from mxnet.gluon.loss import L2Loss
from mxnet import autograd, ndarray as F
import mxnet as mx

import numpy as np
import random
import logging
import re

First step in building the training process is to create an iterator over the training batches read from the data files. To make things trivially simple, I’ll read the whole data into memory. The batches will be constructed each time from the data cached in memory.

To create a custom data iterator, we’ll need to inherit from mxnet.io.DataIter and implement at least two methods: next and reset. Here’s our simple code:

class DataIter(mx.io.DataIter):
    def __init__(self, data, batch_size = 16):
        super(DataIter, self).__init__()
        self.batch_size = batch_size
        self.all_user_ids = set()
        self.data = data
        self.index = 0

        for user_id, item_id, _ in data:
            self.all_user_ids.add(user_id)

    @property
    def user_count(self):
        return len(self.all_user_ids)

    @property
    def item_count(self):
        # we just know the value even though 10 of them were
        # not voted
        return 150

    def next(self):
        index = self.index * self.batch_size
        endindex = index + self.batch_size

        if len(self.data) <= index:
            raise StopIteration
        else:
            user_ids = []
            item_ids = []
            ratings = []

            user_ids = self.data[index:endindex, 0]
            item_ids = self.data[index:endindex, 1]
            ratings   = self.data[index:endindex, 2]

            data_all = [mx.nd.array(user_ids), mx.nd.array(item_ids)]
            label_all = [mx.nd.array([r]) for r in ratings]

            self.index += 1

            return mx.io.DataBatch(data_all, label_all)

    def reset(self):
        self.index = 0
        random.shuffle(self.data)

The above DataIter class expects to be given a numpy array with all the training examples. The first dimension represents a user, second an item and third the rating.

Here’s the code for reading data from disk and feeding it into the DataIter’s constructor:

def get_data(batch_size):
    user_ids = []
    item_ids = []
    ratings = []

    with open("data/jester_ratings.dat", "r") as file:
        for line in file:
            user_id, _, item_id, _, rating = line.strip().split("\t")

            user_ids.append(int(user_id))
            item_ids.append(int(item_id))
            ratings.append(float(rating) / 10.0)

    all_raw = np.asarray(list(zip(user_ids, item_ids, ratings)), dtype='float32')

    return DataIter(all_raw, batch_size = batch_size)

Notice that I’m dividing each rating by 10 to scale the ratings from <-10,10> to <-1,1>. I’m doing it because I found the process hitting numerical overflows when using the Adam optimizer.

The function accepts the batch_size as an argument. Below I’m creating a dataset iterator yielding 64 examples at a time:

train = get_data(64)

Recent versions of MXNet bring in a similar coding model to one found in PyTorch. We can use the clean approach of defining the model by extending the base class and defining the forward method. This is possible by using the mxnet.gluon module that defines the Block class.

As a full-featured deep learning framework, MXNet has its own implementation of calculating gradients automatically. The forward method in our Block inherited class is all we need to proceed with the gradient descend.

In our model, the A and B matrices will be encoded within the gluon layers of type Embedding. The Embedding class lets you specify the number of rows in the matrix as well as the dimension into which we’re “squashing” them. Using the class is very handy as it doesn’t require you to “one hot encode” our user and item IDs.

Following is the implementation of our simple model as MXNet block. Notice that all it really is, is a regression. The model is linear so we’re not using any activation function:

class Model(Block):
    def __init__(self, k, dataiter, **kwargs):
        super(Model, self).__init__(**kwargs)

        with self.name_scope():
            self.user_embedding = nn.Embedding(input_dim = dataiter.user_count, output_dim=k)
            self.item_embedding = nn.Embedding(input_dim = dataiter.item_count, output_dim=k)

    def forward(self, x):
        user = self.user_embedding(x[0] - 1)
        item = self.item_embedding(x[1] - 1)

        # the following is a dot product in essence
        # summing up of the element-wise multiplication
        pred = user * item
        return F.sum_axis(pred, axis = 1)

Next, I’m creating the MXNet computation context as well as an instance of the model itself. Before doing any kind of learning, the parameters of the model will need to be initialized:

context = mx.gpu() if mx.test_utils.list_gpus() else mx.cpu()
model = Model(16, train)
model.collect_params().initialize(mx.init.Xavier(), ctx=context)

The last line from above is initializing the A and B matrices randomly.

We are going to save the state of the model periodically to a file. We’ll be able to load them back with:

model.load_params("model.mxnet", ctx=context)

The last bit of code that we need is the training procedure itself. We’re going to code it as a function that takes the model, the data iterator and the number of epochs:

def fit(model, train, num_epoch):
    trainer = Trainer(model.collect_params(), 'adam')

    for epoch_id in range(num_epoch):
        batch_id = 0
        train.reset()

        for batch in train:
            with autograd.record():
                targets = F.concat(*batch.label, dim=0)
                predictions = model(batch.data)
                L = L2Loss()
                loss = L(predictions, targets)
                loss.backward()

            trainer.step(batch.data[0].shape[0])

            if (batch_id + 1) % 1000 == 0:
                mean_loss = F.mean(loss).asnumpy()[0]

                logger.info(f'Epoch {epoch_id + 1} / {num_epoch} | Batch {batch_id + 1} | Mean Loss: {mean_loss}')

            batch_id += 1

        logger.info('Saving model parameters')
        model.save_params("model.mxnet")

Running the trainer for 10 epochs is as simple as:

fit(model, train, num_epoch=10)

The training process is periodically outputting statistics similar to ones below:

INFO:root:Epoch 1 / 10 | Batch 1000 | Mean Loss: 0.11189080774784088
INFO:root:Epoch 1 / 10 | Batch 2000 | Mean Loss: 0.12274568527936935
INFO:root:Epoch 1 / 10 | Batch 3000 | Mean Loss: 0.1204155907034874
INFO:root:Epoch 1 / 10 | Batch 4000 | Mean Loss: 0.12192331254482269

(...)

INFO:root:Epoch 10 / 10 | Batch 24000 | Mean Loss: 0.0003094784333370626
INFO:root:Epoch 10 / 10 | Batch 25000 | Mean Loss: 0.0006345464498735964
INFO:root:Epoch 10 / 10 | Batch 26000 | Mean Loss: 0.0007207655580714345
INFO:root:Epoch 10 / 10 | Batch 27000 | Mean Loss: 0.005522257648408413
INFO:root:Saving model parameters

Using the trained latent feature matrices

To extract he latent matrices from the trained model we need to use the collect_params as shown below:

user_embed = model.collect_params().get('embedding0_weight').data()
joke_embed = model.collect_params().get('embedding1_weight').data()

Each user’s latent representation is a vector of k values:

> user_embed[0]

[ 0.11911439 -0.01560098 -0.26248184  0.5341552   1.3078408  -0.82505447
  0.2181341   0.69577765 -0.22569533 -0.7669992   0.14042236  0.78608125
  0.07242275  0.49357334  0.7525147   0.37984315]
<NDArray 16 @cpu(0)>

The same case is with the latent representations of jokes:

> joke_embed[7]

[ 0.11836094  0.14039275 -0.10859593 -0.13673168  0.14074579 -0.18800738
  0.0463879  -0.09659509  0.1629943   0.02109279 -0.0294639  -0.03487734
 -0.18192524 -0.13103536 -0.10280509  0.14753008]
<NDArray 16 @cpu(0)>

Let’s first test to see if the known values got reconstructed:

> F.dot(user_embed[0], joke_embed[7]) * 10

[-9.26895]
<NDArray 1 @cpu(0)>

Comparing it with the value from the file:

> cat data/jester_ratings.dat | rg "^1\t\t8\t"
1               8               -9.281

That’s close enough. Let’s now get the set of all joke ids rated by the first user:

test = get_data(1)
joke_ids = set()
for batch in test:
    user_id, joke_id = batch.data
    if user_id.asnumpy()[0] == 1:
        joke_ids.add(joke_id.asnumpy()[0])
joke_ids

The above code outputs:

{5.0, 7.0, 8.0, 13.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 29.0, 31.0, 32.0, 34.0, 35.0, 36.0, 42.0, 49.0, 50.0, 51.0, 52.0, 53.0, 54.0, 61.0, 62.0, 65.0, 66.0, 68.0, 69.0, 72.0, 76.0, 80.0, 81.0, 83.0, 87.0, 89.0, 91.0, 92.0, 93.0, 102.0, 103.0, 104.0, 105.0, 106.0, 107.0, 108.0, 109.0, 118.0, 119.0, 120.0, 121.0, 123.0, 127.0, 128.0, 134.0}

Because we’re mostly interested in the items that have not been yet rated by the user, we’d like to see what the model gathered about them in this context:

> sorted([ (i, F.dot(user_embed[0], joke_embed[i]).asnumpy()[0] * 10) for i in range(0, 150) if i + 1 not in joke_ids ], key=lambda x: x[1])

[(100, -25.34627914428711),
 (89, -23.647150993347168),
 (63, -23.543219566345215),
 (94, -23.415722846984863),
 (70, -22.017195224761963),
 (93, -21.375732421875),
 (140, -20.033082962036133),
 (81, -18.813319206237793),
 (40, -18.48101019859314),
 (135, -18.216774463653564),
 (39, -16.993610858917236),
 (123, -16.66216731071472),
 (45, -16.03758215904236),
 (59, -15.045435428619385),
 (43, -14.993469715118408),
 (74, -12.132725715637207),
 (72, -11.94629430770874),
 (76, -11.861177682876587),
 (29, -11.831218004226685),
 (114, -11.82992935180664),
 (38, -11.327273845672607),
 (98, -10.9122633934021),
 (62, -9.507511854171753),
 (32, -9.498740434646606),
 (83, -9.442780017852783),
 (56, -9.361632466316223),
 (78, -9.310351014137268),
 (109, -8.428668975830078),
 (77, -8.131155967712402),
 (47, -7.274705171585083),
 (99, -7.204542756080627),
 (42, -7.091279625892639),
 (69, -6.739482879638672),
 (57, -6.623743772506714),
 (96, -6.209834814071655),
 (134, -5.58724582195282),
 (73, -5.530622601509094),
 (110, -5.126549005508423),
 (131, -4.435622692108154),
 (9, -4.142558574676514),
 (46, -3.7173447012901306),
 (13, -3.1510373950004578),
 (44, -2.9845643043518066),
 (124, -2.7145612239837646),
 (137, -2.2213394939899445),
 (132, -2.2054636478424072),
 (116, -1.9229638576507568),
 (111, -1.9177806377410889),
 (121, -1.3515384495258331),
 (36, -1.119830161333084),
 (2, -1.0263845324516296),
 (136, -0.14549612998962402),
 (97, 0.02288222312927246),
 (138, 0.23310404270887375),
 (11, 0.34488800913095474),
 (1, 0.3801669552922249),
 (95, 0.42442888021469116),
 (5, 0.585017055273056),
 (0, 0.6578207015991211),
 (10, 1.0580871254205704),
 (148, 1.101222038269043),
 (85, 1.5351229906082153),
 (8, 1.8577364087104797),
 (129, 2.067573070526123),
 (84, 2.5856217741966248),
 (125, 2.927420735359192),
 (145, 3.010193407535553),
 (3, 3.240116238594055),
 (112, 3.8082027435302734),
 (115, 3.8878047466278076),
 (147, 4.29826945066452),
 (58, 5.724080801010132),
 (144, 6.969168186187744),
 (130, 7.328435778617859),
 (146, 8.421227931976318),
 (149, 8.71802568435669),
 (27, 10.014463663101196),
 (143, 10.086603164672852),
 (113, 11.049185991287231),
 (66, 11.210532188415527),
 (139, 11.213960647583008),
 (142, 11.479517221450806),
 (128, 11.862180233001709),
 (141, 12.742302417755127),
 (54, 13.011351823806763),
 (55, 16.884247064590454),
 (37, 18.53071689605713),
 (87, 23.8028883934021)]

The above output presents joke ids along with the prediction of what rating user1 would give them. We can see that some values fall outside of the <-10, 10> range which is fine. We can simply treat the smaller than -10 ones as -10 and greater than 10 as 10.

Immediately we can see that with this recommender model we could recommend the jokes: 146, 149, 27, 143, 113, 66, 139, 142, 128, 141, 54, 55, 37, 87.

To have a little bit more fun, let’s create code for reading the actual text of the jokes. I took the following class from StackOverflow. We’ll use it for stripping HTML tags from the jokes file:

from html.parser import HTMLParser

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.strict = False
        self.convert_charrefs= True
        self.fed = []
    def handle_data(self, d):
        self.fed.append(d)
    def get_data(self):
        return ''.join(self.fed)

def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()

Here’s the function that reads the file and uses the HTML tags stripping class:

def get_jokes():
    jokes = []
    joke = ''
    pattern = re.compile('^\\d+:$')
    with open("data/jester_items.dat", "r") as file:
        for line in file:
            if pattern.match(line):
                joke = ''
            else:
                if line.strip() == '':
                    jokes.append(strip_tags(joke).strip())
                else:
                    joke += line
    return jokes

Let’s now read them from disk and see an example joke our system would recommend to the first user:

> jokes = get_jokes()
> jokes[87]

'A Czechoslovakian man felt his eyesight was growing steadily worse, and felt it was time to go see an optometrist.\n\nThe doctor started with some simple testing, and showed him a standard eye chart with letters of diminishing size: CRKBNWXSKZY...\n\n"Can you read this?" the doctor asked.\n\n"Read it?" the Czech answered. "Doc, I know him!"'

Using the item feature vectors to find similarities

One cool thing we can do with the latent vectors, is to measure how similar they are in terms of appealing to certain users. To do that we can use a so-called cosine similarity. The subject is very clearly described by Christian S. Perone in his blog post.

It makes use of the angle between the two vectors and returns its cosine. Notice that it only cares about the angle between the vectors, and not their magnitudes. The codomain of the cosine function is <-1, 1> and so is for the cosine similarity as well. It translates to our sense of similarity quite naturally: -1 meaning “the total opposite” and 1 meaning “exactly the same”.

We can trivially implement the function as a product of the dot products of the vectors normalized to units:

def cos_similarity(vec1, vec2):
    return mx.nd.dot(vec1, vec2) / (F.norm(vec1) * F.norm(vec2))

We can use the new measurement to rank the jokes in terms of how close they are. Here’s a function that takes a joke ID and returns list of IDs along with the similarity ratings:

def get_scores(joke_id):
    scores = []
    joke = joke_embed[joke_id]
    for ix in range(0, 150):
        scores.append((ix, cos_similarity(joke, joke_embed[ix]).asnumpy()[0]))
    return scores

The following function takes a joke_id and takes the 4 most similar jokes. It then prints them one by one in a summary:

def print_joke_stats(ix):
		def by_second(t):
		    if t[1] is None:
		        return -2
		    else:
		        return t[1]
    similar = get_scores(ix)
    similar.sort(key=by_second)
    similar.reverse()

    print(f'Jokes making same people laugh compared to:\n\n=== \n{jokes[ix]}\n===:\n\n')

    for ix in range(1, 4):
        print(f'---\n{jokes[similar[ix][0]]}\n---\n')

Let’s see what jokes our system found to be cracking up the same kinds of people:

> print_joke_stats(87)

Jokes making same people laugh compared to:

===
A Czechoslovakian man felt his eyesight was growing steadily worse, and felt it was time to go see an optometrist.

The doctor started with some simple testing, and showed him a standard eye chart with letters of diminishing size: CRKBNWXSKZY...

"Can you read this?" the doctor asked.

"Read it?" the Czech answered. "Doc, I know him!"
===:

---
A woman has twins, and gives them up for adoption. One of them goes to a family in Egypt and is named "Amal." The other goes to a family in Spain; they name him "Juan." Years later, Juan sends a picture of himself to his mom. Upon receiving the picture, she tells her husband that she wishes she also had a picture of Amal.

Her husband responds, "But they are twins--if you've seen Juan, you've seen Amal."
---

---
An explorer in the deepest Amazon suddenly finds himself surrounded by a bloodthirsty group of natives. Upon surveying the situation, he says quietly to himself, "Oh God, I'm screwed."

The sky darkens and a voice booms out, "No, you are NOT screwed. Pick up that stone at your feet and bash in the head of the chief standing in front of you."

So with the stone he bashes the life out of the chief. He stands above the lifeless body, breathing heavily and looking at 100 angry natives...

The voice booms out again, "Okay....NOW you're screwed."
---

---
A man is driving in the country one evening when his car stalls and won't start. He goes up to a nearby farm house for help, and because it is suppertime he is asked to stay for supper. When he sits down at the table he notices that a pig is sitting at the table with them for supper and that the pig has a wooden leg.

As they are eating and chatting, he eventually asks the farmer why the pig is there and why it has a wooden leg.
"Oh," says the farmer, "that is a very special pig. Last month my wife and daughter were in the barn when it caught fire. The pig saw this, ran to the barn, tipped over a pail of water, crawled over the wet floor to reach them and pulled them out of the barn safely. A special pig like that, you just don't eat it all at once!"
---

Final words

The approach presented here is relatively simple, yet people have found it surprisingly accurate. It depends though on having enough data for each item. Otherwise the accuracy degrades. An extreme case of not having enough data is called a cold start.

Also, accuracy is not the only goal. Wikipedia lists features like “Serendipity” as an important factor of a successful system among others:

Serendipity is a measure of “how surprising the recommendations are”. For instance, a recommender system that recommends milk to a customer in a grocery store might be perfectly accurate, but it is not a good recommendation because it is an obvious item for the customer to buy. However, high scores of serendipity may have a negative impact on accuracy.

Researchers have been working on different approaches to tackling the above mentioned issues. Netflix is known to be using a “hybrid” approach — one that uses both content and collaborative based recommender. As per Wikipedia:

Netflix is a good example of the use of hybrid recommender systems. The website makes recommendations by comparing the watching and searching habits of similar users (i.e., collaborative filtering) as well as by offering movies that share characteristics with films that a user has rated highly (content-based filtering).

Sentiment Analysis with Python

2018-05-18T00:00:00+00:00

Photograph by Helena Lopes, CC0

I recently had the chance to spend my weekend enhancing my knowledge by joining a local community meetup in Malaysia which is sponsored by Malaysian Global Innovation & Creativity Centre (MaGIC). The trainer was Mr Lee Boon Kong.

Anaconda and Jupyter Notebook

We started by preparing our Jupyter Notebook setup which is running on the Anaconda Python distribution. The installer is 500 MB in size but pretty handy when we started using it.

Anaconda comes with a graphical installer called “Navigator” so the user can install some packages for work. However it did not always work for me on some OSes, so I had to use its command-line based tool “conda”. Conda works like Linux-based package management tools such as apt, dnf, yum, and pacman, so to install a package I would just run conda install <package name>.

Jupyter uses a web browser to allow us to write the code directly in its cell. It is quite helpful for us to debug the code or if we just want to execute it segment by segment independently.

Creating Twitter’s API key

First we need to head to apps.twitter.com.

The following items are needed:

Consumer Key (API key)
Consumer Secret (API secret)
Access Token
Access Token Secret

Using Tweepy, NLTK and TextBlob

from textblob import TextBlob
import tweepy
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
consumer_token = '<put your token here>'
consumer_secret = '<put your secret here>'
access_token = '<put your access token here>'
access_secret = '<put your access secret here>'
auth = tweepy.OAuthHandler(consumer_token, consumer_secret)
auth.set_access_token(access_token, access_secret)
api = tweepy.API(auth)
public_tweets = api.search("Avengers Infinity War", lang='en')
print("number of tweets extracted: " + str(len(public_tweets)))
for tweet in public_tweets:
    print(tweet.text)
    analysis = TextBlob(tweet.text)
    print(analysis.sentiment)
    print("\n")

If we want to increase the number of tweets to be displayed and analyzed, just change this line to:

public_tweets = api.search("avengers", count=100, result_type='recent', lang='en')

Analyzing Sentiment Score Results

The sentiment score that we got is summarized as follows:

< 0 - Negative sentiment
0 - Neutral
0 - Positive sentiment

By default, the code above uses the English-based library. As a Malaysian, I could not analyze tweets in the Malay language yet. Efforts by the local community are being made to create the Malay-based language corpus for NLTK.

Looking at the TextBlob Component

The Natural Language Processing (NLP) library’s TextBlob did the sentiment processing task.

I did some reading in TextBlob’s documentation. So for example if I declare:

text = '''
I love to read!
'''

I get a sentiment polarity value of 0.5 (positive).

While if I put

text = '''
I hate to read!
'''

I get a sentiment polarity value of -1.0 (negative).

Summary

Overall I am quite satisfied with what I learned during the session. It is good to have one day spent on a technical workshop like this in which we could be super focused on the content without any external distraction.

Kudos to Mr Lee for his effort to teach us about data analysis with Python. Till we meet again, hopefully!

Shell Command Outputs Truncated in Python

2018-04-05T00:00:00+00:00

Photo by Sarah Pflug of Burst

Recently I was working on a Python script to do some parsing and processing on the output of shell commands in Ubuntu. The output that showed up was truncated.

The below sections will walk through the debugging process to identify the root cause and implement a solution with detailed explanation, using Python 2.

Problem

The following code block shows the output of a shell command which lists the installed packages, name and version, in Ubuntu.

# dpkg -l | grep ^ii | awk '{print $2 "    " $3}'
accountsservice    0.6.35-0ubuntu7.3
acl    2.2.52-1
adduser    3.113+nmu3ubuntu3
ant    1.9.3-2build1
ant-optional    1.9.3-2build1
apache2    2.4.7-1ubuntu4.18
apache2-bin    2.4.7-1ubuntu4.18
apache2-data    2.4.7-1ubuntu4.18
apache2-utils    2.4.7-1ubuntu4.18
apparmor    2.10.95-0ubuntu2.6~14.04.1

The same shell command executes in the Python console but the output shows truncated values for a few packages’ versions, for example, accountsservice, adduser, apache2, etc.

>>> import subprocess
>>> installed_packages = subprocess.check_output(['dpkg -l | grep ^ii | awk \'{print $2 "    " $3}\''], shell=True)
>>> print installed_packages
accountsservice    0.6.35-0ubuntu7.
acl    2.2.52-1
adduser    3.113+nmu3ubuntu
ant    1.9.3-2build1
ant-optional    1.9.3-2build1
apache2    2.4.7-1ubuntu4.1
apache2-bin    2.4.7-1ubuntu4.1
apache2-data    2.4.7-1ubuntu4.1
apache2-utils    2.4.7-1ubuntu4.1
apparmor    2.10.95-0ubuntu2

Root Cause

To identify the root cause of the problem, I started with source command dpkg -l command without any filters and processing. I have noticed two different results for this command, with and without less command. The less command showed the complete result with scrolling as below.

# dpkg -l | less
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name                                  Version                                    Architecture Description
+++-=====================================-==========================================-============-===============================================================================
rc  aacraid                               1.2.1-52011                                amd64        This driver supports Adaptec by PMC aacraid family of cards.
ii  accountsservice                       0.6.35-0ubuntu7.3                          amd64        query and manipulate user account information
ii  acl                                   2.2.52-1                                   amd64        Access control list utilities
ii  adduser                               3.113+nmu3ubuntu3                          all          add and remove users and groups
ii  ant                                   1.9.3-2build1                              all          Java based build tool like make
ii  ant-optional                          1.9.3-2build1                              all          Java based build tool like make - optional libraries
ii  apache2                               2.4.7-1ubuntu4.18                          amd64        Apache HTTP Server
ii  apache2-bin                           2.4.7-1ubuntu4.18                          amd64        Apache HTTP Server (binary files and modules)
ii  apache2-data                          2.4.7-1ubuntu4.18                          all          Apache HTTP Server (common files)
ii  apache2-utils                         2.4.7-1ubuntu4.18                          amd64        Apache HTTP Server (utility programs for web servers)
rc  apache2.2-common                      2.2.22-1ubuntu1.11                         amd64        Apache HTTP Server common files
ii  apparmor                              2.10.95-0ubuntu2.6~14.04.1                 amd64        user-space parser utility for AppArmor

But dpkg -l prints on the screen with truncated data due to the columns width constraint. The truncated values exactly match the Python console output. The output column width is decided by environment variable COLUMNS and commands restrict the column width in output based on COLUMNS value.

# echo $COLUMNS
127

# dpkg -l
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name                   Version          Architecture     Description
+++-======================-================-================-==================================================
rc  aacraid                1.2.1-52011      amd64            This driver supports Adaptec by PMC aacraid family
ii  accountsservice        0.6.35-0ubuntu7. amd64            query and manipulate user account information
ii  acl                    2.2.52-1         amd64            Access control list utilities
ii  adduser                3.113+nmu3ubuntu all              add and remove users and groups
ii  ant                    1.9.3-2build1    all              Java based build tool like make
ii  ant-optional           1.9.3-2build1    all              Java based build tool like make - optional librari
ii  apache2                2.4.7-1ubuntu4.1 amd64            Apache HTTP Server
ii  apache2-bin            2.4.7-1ubuntu4.1 amd64            Apache HTTP Server (binary files and modules)
ii  apache2-data           2.4.7-1ubuntu4.1 all              Apache HTTP Server (common files)
ii  apache2-utils          2.4.7-1ubuntu4.1 amd64            Apache HTTP Server (utility programs for web serve
rc  apache2.2-common       2.2.22-1ubuntu1. amd64            Apache HTTP Server common files
ii  apparmor               2.10.95-0ubuntu2 amd64            user-space parser utility for AppArmor

Solution

The subprocess module of Python provides complete untruncated output of the shell command when the argument env={} is passed to check_output function:

>>> installed_packages = subprocess.check_output(['dpkg -l | grep ^ii | awk \'{print $2 "    " $3}\''], shell=True, env={})
>>> print installed_packages
accountsservice    0.6.35-0ubuntu7.3
acl    2.2.52-1
adduser    3.113+nmu3ubuntu3
ant    1.9.3-2build1
ant-optional    1.9.3-2build1
apache2    2.4.7-1ubuntu4.18
apache2-bin    2.4.7-1ubuntu4.18
apache2-data    2.4.7-1ubuntu4.18
apache2-utils    2.4.7-1ubuntu4.18
apparmor    2.10.95-0ubuntu2.6~14.04.1

Explanation

Curious to know what is happening behind the scenes? The check_output function uses C library functions execv or execve for processing. It chooses the function based on the env argument.

Reference:

When no env argument is passed to subprocess.check_output, the os.execv function is called.

When an env argument is passed to subprocess.check_output, the os.execve function is called.

for (i = 0; exec_array[i] != NULL; ++i) {
    const char *executable = exec_array[i];
    if (envp) {
        execve(executable, argv, envp);
    } else {
        execv(executable, argv);
    }

What makes the execv and execve functions produce different output?

The execv function passes through the shell COLUMNS variable which leads to truncating output columns to 127 width, like our reference system.

# echo $COLUMNS
127

>>> print subprocess.check_output(['dpkg -l | grep libqtcore4'], shell=True)
ii  libqtcore4:amd64          4:4.8.5+git192-g0 amd64             Qt 4 core module

>>> print subprocess.check_output(['dpkg -l | grep libqtcore4'], shell=True, env={'COLUMNS':'127'})
ii  libqtcore4:amd64          4:4.8.5+git192-g0 amd64             Qt 4 core module

The execve function uses additional argument environment variables and it is based on the environ function. It uses environment variables available in env command which doesn’t have COLUMNS initialised. So output values returned without any column width restriction.

>>> print subprocess.check_output(['dpkg -l | grep libqtcore4'], shell=True, env={})
ii  libqtcore4:amd64                      4:4.8.5+git192-g085f851+dfsg-2ubuntu4.1    amd64        Qt 4 core module

>>> print subprocess.check_output(['dpkg -l | grep libqtcore4'], shell=True, env={'COLUMNS':''})
ii  libqtcore4:amd64                      4:4.8.5+git192-g085f851+dfsg-2ubuntu4.1    amd64        Qt 4 core module

For more details refer to the man pages of execv, execve, environ.

Conclusion

It is always good to pass env={} argument to subprocess.check_output function whenever processing shell command output in Python. It helps avoid unstable results down the line due to truncated values.

Recycling Web Workers: Just Proper Hygiene

2018-03-23T00:00:00+00:00

Photo by Dano, CC BY 2.0

A long while back we were helping out a client with a mysterious and serious problem: The PostgreSQL instance was showing gradual memory growth, each of the processes slowly ballooning memory across a few days until the system triggered the OOM (out of memory) killer. Database at that point kicks out all connections and restarts. Downtime is bad, yo.

Spoiler alert: It was a prepared SQL statements bug in Rails. Sometimes it’s fun to take you through all the hair-pulling that goes into debugging something like this, but instead this Friday I’m feeling preachy.

Of course there’s a few different directions you could go to work around a problem like this:

Update your framework. If, of course, the fix has been released, or determined in the first place. And if your application doesn’t have any compatibility trouble preventing it from running on the updated version, or you feel comfortable back-patching the fix yourself. And if the Change Management Officer doesn’t try to string you up for wanting to update production willy nilly. (Not everyone works at a startup!) But do add it as a milestone. It should be one anyway.
Take a different code path. Feature switches, like three-point seat belts and pocket breath mints, fall into the category of neat things that seem like a little burden until you really need them. In fact in this case, turning off prepared statements in unpatched Rails deployments is the recommended workaround. It’ll still take some testing, but is likely less risky than changing the framework code itself. It might also be a slight performance hit, but then so is a crashing database.
Recycle your worker processes periodically. Simple. Readily doable. And usually entirely undisruptive. And thus I’m here advocating that you think about doing it occasionally as a matter of course.

The usual way to get worker processes to recycle gracefully is to send a SIGHUP signal to the application master process. At that point each worker process finishes handling its current request and then quits, after which the master starts up a new one to handle the next request. It’s typically a seamless process.

You could do that through cron every now and then, perhaps daily or whatever is appropriate. But some app servers have this built in, usually after processing some given number of requests (it’ll usually be a “max_requests” parameter, or something very close to that.)

One the Python side, both gunicorn and uwsgi both have it. The name variation is ever so slightly different (max_requests, sometimes, versus max-request).
For Ruby, Passenger has it as a parameter, while unicorn has a separate gem, unicorn-worker-killer, that does this.
php-fpm has it as a parameter as well, though if you’re using Apache httpd to host the application through mod_php directly, MaxConnectionsPerChild is what you want to set.

I would be remiss if I didn’t mention a couple potential downsides. The first is that while no request will be left behind, the first one that hits each new worker process might see a slight delay as code is reloaded and any process cache is warmed.

The second is that it may happen unexpectedly, which could be a problem if some changes had been made but the app process hadn’t seen it yet. This could be a deploy that’s still in progress, or some change that was left out there to be completed later. (On one hand, tsk, tsk; on the other, eh, it does happen.)

There you have it. As a coworker said in chat: Recycling worker processes, it’s just good hygiene.

Regular Expression Inconsistencies With Unicode

2018-01-23T00:00:00+00:00

A casual stroll through the world of Unicode and regular expressions—Photo by Presidio of Monterey

Character classes in regular expressions are an extremely useful and widespread feature, but there are some relatively recent changes that you might not know of.

The issue stems from how different programming languages, locales, and character encodings treat predefined character classes. Take, for example, the expression \w which was introduced in Perl around the year 1990 (along with \d and \s and their inverted sets \W, \D, and \S).

The \w shorthand is a character class that matches “word characters” as the C language understands them: [a-zA-Z0-9_]. At least when ASCII was the main player in the character encoding scene that simple fact was true. With the standardization of Unicode and UTF-8, the meaning of \w has become a more foggy.

Perl

Take this example in a recent Perl version:

use 5.012; # use 5.012 or higher includes Unicode support
use utf8;  # necessary for Unicode string literals

print "username" =~ /^\w+$/; # 1
print "userاسم"  =~ /^\w+$/; # 1

Perl is treating \w differently here because the characters “اسم” (“ism” meaning “name” in Arabic) definitely don’t fall within [a-zA-Z0-9_]!

Beginning with Perl 5.12 from the year 2010, character classes are handled differently. Documentation on the topic is found in perlrecharclass. The rules aren’t as simple as with some languages, but can be generalized as such:

\w will match Unicode characters with the “Word” property (equivalent to \p{Word}), unless the /a (ASCII) flag is enabled, in which case it will be equivalent to the original [a-zA-Z0-9_].

Let’s see the /a flag in action.

use 5.012;
use utf8;

print "username" =~ /^\w+$/a; # 1
print "userاسم"  =~ /^\w+$/a; # 0

However, you should know that for code points below 256, these rules can change depending on whether Unicode or locale rules are on, so if you’re unsure, consult the perlre and perlrecharclass.

Keep in mind that these same questions of what the character classes include can apply to every predefined character class in whatever language you’re using, so remember to check language-specific implementations for other character class shorthands, such as \s and \d, not just \w.

Every language seems to do regular expressions a little bit differently, so here’s a short, incomplete guide for several other languages we use frequently.

Python

Take this example in Python 3.6.2:

>>> re.match(r'^\w+$', 'username')
<_sre.SRE_Match object; span=(0, 8), match='username'>
>>> re.match(r'^\w+$', 'userاسم')
<_sre.SRE_Match object; span=(0, 7), match='userاسم'>

Python is also treating \w differently here. Let’s take a look at the Python docs:

\w

For Unicode (str) patterns:

Matches Unicode word characters; this includes most characters that can be part of a word in any language, as well as numbers and the underscore. If the ASCII flag is used, only [a-zA-Z0-9_] is matched (but the flag affects the entire regular expression, so in such cases using an explicit [a-zA-Z0-9_] may be a better choice).

For 8-bit (bytes) patterns:

Matches characters considered alphanumeric in the ASCII character set; this is equivalent to [a-zA-Z0-9_]. If the LOCALE flag is used, matches characters considered alphanumeric in the current locale and the underscore.

So \w includes “most characters that can be part of a word in any language, as well as numbers and the underscore”. A list of the characters that includes is difficult to pin down, so it would be best to use the re.ASCII flag as suggested when you’re unsure if you want letters from other languages matched:

>>> re.match(r'^\w+$', 'userاسم',  flags=re.ASCII)
>>> re.match(r'^\w+$', 'username', flags=re.ASCII)
<_sre.SRE_Match object; span=(0, 8), match='username'>

Ruby

Ruby’s Regexp class documentation gives a simple and useful explanation: backslash character classes (e.g. \w, \s, \d) are ASCII-only, while POSIX-style bracket expressions (e.g. [[:alnum:]]) include other Unicode characters.

irb(main):001:0> /^\w+$/         =~ "userاسم"
=> nil
irb(main):002:0> /^[[:word:]]+$/ =~ "userاسم"
=> 0

JavaScript

JavaScript doesn’t support POSIX-style bracket expressions, and its backslash character classes are simple, straightforward lists of ASCII characters. The MDN has simple explanations for each one.

JavaScript regular expressions do accept a /u flag, but it does not affect shorthand character classes. Consider these examples in Node.js:

> /^\w+$/.test("username");
true
> /^\w+$/.test("userﺎﺴﻣ");
false
> /^\w+$/u.test("username");
true
> /^\w+$/u.test("userﺎﺴﻣ");
false

We can see that the /u flag has no effect on what \w matches. Now let’s look at Unicode character lengths in JavaScript:

> '❤'.length
1
> '👩'.length
2
> '🀄️'.length
3

Because of the way Unicode is implemented in JavaScript, strings with Unicode characters outside the BMP (Basic Multilingual Plane) will appear to be longer than they are.

This can be accounted for in regular expressions with the /u flag, which only corrects character parsing, and does not affect shorthand character classes:

> let mystr = "hi👩there";
undefined
> mystr.length
9
> /hi.there/.test(mystr);
false
> /hi..there/.test(mystr);
true
> /hi.there/u.test(mystr);  # note the /u from here on
true
> /hi..there/u.test(mystr);
false
> /hi..there/u.test("hi👩👩there");
true

The excellent article "💩".length === 2 by Jonathan New goes into detail about the why this is, and explores various solutions. It also addresses some legacy inconsistencies, like how the old HEAVY BLACK HEART character and other older Unicode symbols might be represented differently.

PHP

PHP’s documentation explains that \w matches letters, digits, and the underscore as defined by your locale. It’s not totally clear about how Unicode is treated, but it uses the PCRE (Perl Compatible Regular Expressions) library which supports a /u flag that can be used to enable Unicode matching in character classes:

<?php

echo preg_match("/^\\w+$/", "username"), "\n";  # 1
echo preg_match("/^\\w+$/", "userاسم"),  "\n";  # 0

echo preg_match("/^\\w+$/u", "username"), "\n"; # 1
echo preg_match("/^\\w+$/u", "userاسم"),  "\n"; # 1

.NET

The .NET Quick Reference has a comprehensive guide to character classes. For word characters, it defines a specific group of Unicode categories including letters, modifiers, and connectors from many languages, but also points out that setting the ECMAScript Matching Behavior option will limit \w to [a-zA-Z_0-9], among other things. Microsoft’s documentation is clear and comprehensive with great examples, so I recommend referring to it frequently.

Go

Go follows the regular expression syntax used by Google’s RE2 engine, which has easy syntax for specifying whether you want Unicode characters to be captured or not:

package main

import (
	"fmt"
	"regexp"
)

func main() {
	// Perl-style
	fmt.Println(regexp.MatchString(`^\w+$`, "username")) // true
	fmt.Println(regexp.MatchString(`^\w+$`, "userاسم"))  // false

	// POSIX-style
	fmt.Println(regexp.MatchString(`^[[:word:]]+$`, "username")) // true
	fmt.Println(regexp.MatchString(`^[[:word:]]+$`, "userاسم"))  // false

	// Unicode character class
	fmt.Println(regexp.MatchString(`^\pL+$`, "username")) // true
	fmt.Println(regexp.MatchString(`^\pL+$`, "userاسم"))  // true
}

You can see this code in action here.

grep

Implementations of grep vary widely across platforms and versions. On my personal computer with GNU grep 3.1, \w doesn’t work at all with default settings, matches only ASCII characters with the -P (PCRE) option, and matches Unicode characters with -E:

[phin@caballero ~]$ grep    "^\w+$" <(echo "username")  # no match
[phin@caballero ~]$ grep -P "^\w+$" <(echo "username")
username
[phin@caballero ~]$ grep -P "^\w+$" <(echo "userاسم")   # no match
[phin@caballero ~]$ grep -E "^\w+$" <(echo "username")
username
[phin@caballero ~]$ grep -E "^\w+$" <(echo "userاسم")
userاسم

Again, implementations vary a lot, so double check on your system before doing anything important.

Conference Recap: PyCon Asia Pacific (APAC) 2017 in Kuala Lumpur, Malaysia

2017-12-02T00:00:00+00:00

I got a chance to attend the annual PyCon APAC 2017 (Python Conference, Asia Pacific) which was hosted in my homeland, Malaysia. In previous years, Python conferences in Malaysia were held at the national level and this year the Malaysia’s PyCon committee worked hard on organizing a broader Asia-level regional conference.

Highlights from Day 1

The first day of the conference began with a keynote delivered by Luis Miguel Sanchez, the founder of SGX Analytics, a New York City-based data science/data strategy advisory firm. Luis shared thoughts about the advancement of artificial intelligence and machine learning in many aspects, including demonstrations of automated music generation. In his talk Luis presented his application which composed a song using his AI algorithm. He also told us a bit on the legal aspect of the music produced by his algorithm.

Luis speaking to the the audience. Photo from PyCon’s Flickr.

Then I attended Amir Othman’s talk which discussed the data mining technique of news in the Malay and German languages (he received his education at a German tertiary institution). His discussion included the verification of the source of the news and the issue of the language structure of German and Malay, which have similarities with English. First, Amir mentioned language detection using pycld2. Amir shared the backend setup for his news crawler which includes RSS and Twitter feeds for input, Redis as a message queue, and Spacy and Polyglot for the “entity recognition”.

Quite a number of speakers spoke about gensim, including Amir, who used it for “topic modelling”. Amir also used TF/IDF (term frequency–inverse document frequency) which is a numerical statistic method that is intended to reflect how significant a word is to a document in a corpus. For the similarity lookup aspect, he used word2vec on the entire corpus. In the case of full-text search he used Elasticsearch.

Later I attended Mr. Ng Swee Meng’s talk in which he shared his effort in the Sinar Project to process the government of Malaysia’s publicly available documents using his Python code. He shared the method of characterization with the use of bag of words plus the use of stopwords which has similarity with the English language. Mr. Ng’s work focuses on Malay language documents so he found out that the Indonesian’s Malay language stopwords which are already available could be used to adapt to Malay. Ng also mentioned the use of gensim in his work.

Highlights from Day 2

The second day’s talk began with a keynote from Jessica McKellar who was involved in the development of Ksplice, Zulip (co-founder), and Dropbox. She highlighted her involvement with San Quentin prison to help the convicts prepare for real-world opportunities after they get out. Jessica also mentioned diversity issues of men and women in computing, race diversity, and technical devices accessibility. Jessica mentioned that problems getting more people involved in computing is not due to lack of interest, but due to lack of access. She also praised the effort done by PyCon UK to help the visually impaired attendees attend a conference. If possible, a conference should be wheelchair friendly too.

Me standing in the audience. Photo from PyCon’s Flickr.

I found the talk by Praveen Patil entitled “Physics and Math with Python” really interesting. Praveen showed his effort to make teaching physics and mathematics interesting for students. Apart from the code snippets he also shared the electronic gadgets which were being used for the subjects.

The other talk which I attended was delivered by Hironori Sekine on the technologies being used by startups in Japan. Hironori mentioned that Ruby is widely used by the Japanese startups and many book publications for Ruby were published in the Japanese language. Other programming languages being used include Java, PHP, Scala, and Go. Python is starting to become more popular since last year as books in the local language started to be published.

Conclusion

Overall I really appreciate the efforts of the organizer. Though it was the first ever APAC-based PyCon held in Malaysia, I felt that it was very well organized and I could not complain about anything. Thumbs up for the effort and hopefully I can attend next year’s event!

Recognizing handwritten digits: a quick peek into the basics of machine learning

2017-05-30T00:00:00+00:00

Previous in series:

In the previous two posts on machine learning, I presented a very basic introduction of an approach called “probabilistic graphical models”. In this post I’d like to take a tour of some different techniques while creating code that will recognize handwritten digits.

The handwritten digits recognition is an interesting topic that has been explored for many years. It is now considered one of the best ways to start the journey into the world of machine learning.

Taking the Kaggle challenge

We’ll take the “digits recognition” challenge as presented in Kaggle. It is an online platform with challenges for data scientists. Most of the challenges have their prizes expressed in real money to win. Some of them are there to help us out in our journey on learning data science techniques—so is the “digits recognition” contest.

The challenge

As explained on Kaggle:

MNIST (“Modified National Institute of Standards and Technology”) is the de facto “hello world” dataset of computer vision.

The “digits recognition” challenge is one of the best ways to get acquainted with machine learning and computer vision. The so-called “MNIST” dataset consists of 70k images of handwritten digits - each one grayscaled and of a 28x28 size. The Kaggle challenge is about taking a subset of 42k of them along with labels (what actual number does the image show) and “training” the computer on that set. The next step is to take the rest 28k of images without the labels and “predict” which actual number they present.

Here’s a short overview of how the digits in a set really look like (along with the numbers they represent):

I have to admit that for some of them I have a really hard time recognizing the actual numbers on my own :)

The general approach to supervised learning

Learning from labelled data is what is called “supervised learning”. It’s supervised because we’re taking the computer by hand through the whole training data set and “teaching” it how the data that is linked with different labels “looks” like.

In all such scenarios we can express the data and labels as:

Y ~ X1, X2, X3, X4, ..., Xn

The Y is called a dependent variable while each Xn are independent variables. This formula holds both for classification problems as well as regressions.

Classification is when the dependent variable Y is so called categorical—taking values from a concrete set without a meaningful order. Regression is when the Y is not categorical—most often continuous.

In the digits recognition challenge we’re faced with the classification task. The dependent variable takes values from the set:

Y = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 }

I’m sure the question you might be asking yourself now is: what are the independent variables Xn? It turns out to be the crux of the whole problem to solve :)

The plan of attack

A good introduction to computer vision techniques is a book by J. R Parker - “Algorithms for Image Processing and Computer Vision”. I encourage the reader to buy that book. I took some ideas from it while having fun with my own solution to the challenge.

The book outlines the ideas revolving around computing image profiles—for each side. For each row of pixels, a number representing the distance of the first pixel from the edge is computed. This way we’re getting our first independent variables. To capture even more information about digit shapes, we’ll also capture the differences between consecutive row values as well as their global maxima and minima. We’ll also compute the width of the shape for each row.

Because the handwritten digits vary greatly in their thickness, we will first preprocess the images to detect so-called skeletons of the digit. The skeleton is an image representation where the thickness of the shape has been reduced to just one.

Having the image thinned will also allow us to capture some more info about the shapes. We will write an algorithm that walks the skeleton and records the direction change frequencies.

Once we’ll have our set of independent variables Xn, we’ll use a classification algorithm to first learn in a supervised way (using the provided labels) and then to predict the values of the test data set. Lastly we’ll submit our predictions to Kaggle and see how well did we do.

Having fun with languages

In the data science world, the lingua franca still remains to be the R programming language. In the last years Python has also came close in popularity and nowadays we can say it’s the duo of R and Python that rule the data science world (not counting high performance code written e. g. in C++ in production systems).

Lately a new language designed with data scientists in mind has emerged - Julia. It’s a language with characteristics of both dynamically typed scripting languages as well as strictly typed compiled ones. It compiles its code into efficient native binary via LLVM—but it’s using it in a JIT fashion - inferring the types when needed on the go.

While having fun with the Kaggle challenge I’ll use Julia and Python for the so called feature extraction phase (the one in which we’re computing information about our Xn variables). I’ll then turn towards R for doing the classification itself. Note that I might use any of those languages at each step getting very similar results. The purpose of this series of articles is to be a bird eye fun overview so I decided that this way will be much more interesting.

Feature Extraction

The end result of this phase is the data frame saved as a CSV file so that we’ll be able to load it in R and do the classification.

First let’s define the general function in Julia that takes the name of the input CSV file and returns a data frame with features of given images extracted into columns:

using DataFrames

function get_data(name :: String, include_label = true)
  println("Loading CSV file into a data frame...")
  table = readtable(string(name, ".csv"))
  extract(table, include_label)
end

Now the extract function looks like the following:

"""
Extracts the features from the dataframe. Puts them into
separate columns and removes all other columns except the
labels.

The features:

* Left and right profiles (after fitting into the same sized rect):
  * Min
  * Max
  * Width[y]
  * Diff[y]
* Paths:
  * Frequencies of movement directions
  * Simplified directions:
    * Frequencies of 3 element simplified paths
"""
function extract(frame :: DataFrame, include_label = true)
  println("Reshaping data...")

  function to_image(flat :: Array{Float64}) :: Array{Float64}
    dim      = Base.isqrt(length(flat))
    reshape(flat, (dim, dim))'
  end

  from = include_label ? 2 : 1
  frame[:pixels] = map((i) -> convert(Array{Float64}, frame[i, from:end]) |> to_image, 1:size(frame, 1))
  images = frame[:, :pixels] ./ 255
  data = Array{Array{Float64}}(length(images))

  @showprogress 1 "Computing features..." for i in 1:length(images)
    features = pixels_to_features(images[i])
    data[i] = features_to_row(features)
  end
  start_column = include_label ? [:label] : []
  columns = vcat(start_column, features_columns(images[1]))

  result = DataFrame()
  for c in columns
    result[c] = []
  end

  for i in 1:length(data)
    if include_label
      push!(result, vcat(frame[i, :label], data[i]))
    else
      push!(result, vcat([],               data[i]))
    end
  end

  result
end

A few nice things to notice here about Julia itself are:

The function documentation is written in Markdown
We can nest functions inside other functions
The language is statically and strongly typed
Types can be inferred from the context
It is often desirable to provide the concrete types to improve performance (but that an advanced Julia related topic)
Arrays are indexed from 1
There’s the nice |> operator found e. g. In Elixir (which I absolutely love)

The above code converts the images to be arrays of Float64 and converts the values to be within 0 and 1 (instead of 0..255 originally).

A thing to notice is that in Julia we can vectorize operations easily and we’re using this fact to tersely convert our number:

images = frame[:, :pixels] ./ 255

We are referencing the pixels_to_features function which we define as:

"""
Returns ImageFeatures struct for the image pixels
given as an argument
"""
function pixels_to_features(image :: Array{Float64})
  dim      = Base.isqrt(length(image))
  skeleton = compute_skeleton(image)
  bounds   = compute_bounds(skeleton)
  resized  = compute_resized(skeleton, bounds, (dim, dim))
  left     = compute_profile(resized, :left)
  right    = compute_profile(resized, :right)
  width_min, width_max, width_at = compute_widths(left, right, image)
  frequencies, simples = compute_transitions(skeleton)

  ImageStats(dim, left, right, width_min, width_max, width_at, frequencies, simples)
end

This in turn uses the ImageStats structure:

immutable ImageStats
  image_dim             :: Int64
  left                  :: ProfileStats
  right                 :: ProfileStats
  width_min             :: Int64
  width_max             :: Int64
  width_at              :: Array{Int64}
  direction_frequencies :: Array{Float64}

  # The following adds information about transitions
  # in 2 element simplified paths:
  simple_direction_frequencies :: Array{Float64}
end

immutable ProfileStats
  min :: Int64
  max :: Int64
  at  :: Array{Int64}
  diff :: Array{Int64}
end

The pixels_to_features function first gets the skeleton of the digit shape as an image and then uses other functions passing that skeleton to them. The function returning the skeleton utilizes the fact that in Julia it’s trivially easy to use Python libraries. Here’s its definition:

using PyCall

@pyimport skimage.morphology as cv

"""
Thin the number in the image by computing the skeleton
"""
function compute_skeleton(number_image :: Array{Float64}) :: Array{Float64}
  convert(Array{Float64}, cv.skeletonize_3d(number_image))
end

It uses the scikit-image library’s function skeletonize3d by using the @pyimport macro and using the function as if it was just a regular Julia code.

Next the code crops the digit itself from the 28x28 image and resizes it back to 28x28 so that the edges of the shape always “touch” the edges of the image. For this we need the function that returns the bounds of the shape so that it’s easy to do the cropping:

function compute_bounds(number_image :: Array{Float64}) :: Bounds
  rows = size(number_image, 1)
  cols = size(number_image, 2)

  saw_top = false
  saw_bottom = false

  top = 1
  bottom = rows
  left = cols
  right = 1

  for y = 1:rows
    saw_left = false
    row_sum = 0

    for x = 1:cols
      row_sum += number_image[y, x]

      if !saw_top && number_image[y, x] > 0
        saw_top = true
        top = y
      end

      if !saw_left && number_image[y, x] > 0 && x < left
        saw_left = true
        left = x
      end

      if saw_top && !saw_bottom && x == cols && row_sum == 0
        saw_bottom = true
        bottom = y - 1
      end

      if number_image[y, x] > 0 && x > right
        right = x
      end
    end
  end
  Bounds(top, right, bottom, left)
end

Resizing the image is pretty straight-forward:

using Images

function compute_resized(image :: Array{Float64}, bounds :: Bounds, dims :: Tuple{Int64, Int64}) :: Array{Float64}
  cropped = image[bounds.left:bounds.right, bounds.top:bounds.bottom]
  imresize(cropped, dims)
end

Next, we need to compute the profile stats as described in our plan of attack:

function compute_profile(image :: Array{Float64}, side :: Symbol) :: ProfileStats
  @assert side == :left || side == :right

  rows = size(image, 1)
  cols = size(image, 2)

  columns = side == :left ? collect(1:cols) : (collect(1:cols) |> reverse)
  at = zeros(Int64, rows)
  diff = zeros(Int64, rows)
  min = rows
  max = 0

  min_val = cols
  max_val = 0

  for y = 1:rows
    for x = columns
      if image[y, x] > 0
        at[y] = side == :left ? x : cols - x + 1

        if at[y] < min_val
          min_val = at[y]
          min = y
        end

        if at[y] > max_val
          max_val = at[y]
          max = y
        end
        break
      end
    end
    if y == 1
      diff[y] = at[y]
    else
      diff[y] = at[y] - at[y - 1]
    end
  end

  ProfileStats(min, max, at, diff)
end

The widths of shapes can be computed with the following:

function compute_widths(left :: ProfileStats, right :: ProfileStats, image :: Array{Float64}) :: Tuple{Int64, Int64, Array{Int64}}
  image_width = size(image, 2)
  min_width = image_width
  max_width = 0
  width_ats = length(left.at) |> zeros

  for row in 1:length(left.at)
    width_ats[row] = image_width - (left.at[row] - 1) - (right.at[row] - 1)

    if width_ats[row] < min_width
      min_width = width_ats[row]
    end

    if width_ats[row] > max_width
      max_width = width_ats[row]
    end
  end

  (min_width, max_width, width_ats)
end

And lastly, the transitions:

function compute_transitions(image :: Image) :: Tuple{Array{Float64}, Array{Float64}}
  history = zeros((size(image,1), size(image,2)))

  function next_point() :: Nullable{Point}
    point = Nullable()

    for row in 1:size(image, 1) |> reverse
      for col in 1:size(image, 2) |> reverse
        if image[row, col] > 0.0 && history[row, col] == 0.0
          point = Nullable((row, col))
          history[row, col] = 1.0

          return point
        end
      end
    end
  end

  function next_point(point :: Nullable{Point}) :: Tuple{Nullable{Point}, Int64}
    result = Nullable()
    trans = 0

    function direction_to_moves(direction :: Int64) :: Tuple{Int64, Int64}
      # for frequencies:
      # 8 1 2
      # 7 - 3
      # 6 5 4
      [
       ( -1,  0 ),
       ( -1,  1 ),
       (  0,  1 ),
       (  1,  1 ),
       (  1,  0 ),
       (  1, -1 ),
       (  0, -1 ),
       ( -1, -1 ),
      ][direction]
    end

    function peek_point(direction :: Int64) :: Nullable{Point}
      actual_current = get(point)

      row_move, col_move = direction_to_moves(direction)

      new_row = actual_current[1] + row_move
      new_col = actual_current[2] + col_move

      if new_row <= size(image, 1) && new_col <= size(image, 2) &&
         new_row >= 1 && new_col >= 1
        return Nullable((new_row, new_col))
      else
        return Nullable()
      end
    end

    for direction in 1:8
      peeked = peek_point(direction)

      if !isnull(peeked)
        actual = get(peeked)
        if image[actual[1], actual[2]] > 0.0 && history[actual[1], actual[2]] == 0.0
          result = peeked
          history[actual[1], actual[2]] = 1
          trans = direction
          break
        end
      end
    end

    ( result, trans )
  end

  function trans_to_simples(transition :: Int64) :: Array{Int64}
    # for frequencies:
    # 8 1 2
    # 7 - 3
    # 6 5 4

    # for simples:
    # - 1 -
    # 4 - 2
    # - 3 -
    [
      [ 1 ],
      [ 1, 2 ],
      [ 2 ],
      [ 2, 3 ],
      [ 3 ],
      [ 3, 4 ],
      [ 4 ],
      [ 1, 4 ]
    ][transition]
  end

  transitions     = zeros(8)
  simples         = zeros(16)
  last_simples    = [ ]
  point           = next_point()
  num_transitions = .0
  ind(r, c) = (c - 1)*4 + r

  while !isnull(point)
    point, trans = next_point(point)

    if isnull(point)
      point = next_point()
    else
      current_simples = trans_to_simples(trans)
      transitions[trans] += 1
      for simple in current_simples
        for last_simple in last_simples
          simples[ind(last_simple, simple)] +=1
        end
      end
      last_simples = current_simples
      num_transitions += 1.0
    end
  end

  (transitions ./ num_transitions, simples ./ num_transitions)
end

All those gathered features can be turned into rows with:

function features_to_row(features :: ImageStats)
  lefts       = [ features.left.min,  features.left.max  ]
  rights      = [ features.right.min, features.right.max ]

  left_ats    = [ features.left.at[i]  for i in 1:features.image_dim ]
  left_diffs  = [ features.left.diff[i]  for i in 1:features.image_dim ]
  right_ats   = [ features.right.at[i] for i in 1:features.image_dim ]
  right_diffs = [ features.right.diff[i]  for i in 1:features.image_dim ]
  frequencies = features.direction_frequencies
  simples     = features.simple_direction_frequencies

  vcat(lefts, left_ats, left_diffs, rights, right_ats, right_diffs, frequencies, simples)
end

Similarly we can construct the column names with:

function features_columns(image :: Array{Float64})
  image_dim   = Base.isqrt(length(image))

  lefts       = [ :left_min,  :left_max  ]
  rights      = [ :right_min, :right_max ]

  left_ats    = [ Symbol("left_at_",  i) for i in 1:image_dim ]
  left_diffs  = [ Symbol("left_diff_",  i) for i in 1:image_dim ]
  right_ats   = [ Symbol("right_at_", i) for i in 1:image_dim ]
  right_diffs = [ Symbol("right_diff_", i) for i in 1:image_dim ]
  frequencies = [ Symbol("direction_freq_", i)   for i in 1:8 ]
  simples     = [ Symbol("simple_trans_", i)   for i in 1:4^2 ]

  vcat(lefts, left_ats, left_diffs, rights, right_ats, right_diffs, frequencies, simples)
end

The data frame constructed with the get_data function can be easily dumped into the CSV file with the writeable function from the DataFrames package.

You can notice that gathering / extracting features is a lot of work. All this was needed to be done because in this article we’re focusing on the somewhat “classical" way of doing machine learning. You might have heard about algorithms existing that mimic how the human brain learns. We’re not focusing on them here. This we will explore in some future article.

We use the mentioned writetable on data frames computed for both training and test datasets to store two files: processed_train.csv and processed_test.csv.

Choosing the model

For the task of classifying I decided to use the XGBoost library which is somewhat a hot new technology in the world of machine learning. It’s an improvement over the so-called Random Forest algorithm. The reader can read more about XGBoost on its website: https://xgboost.readthedocs.io/.

Both random forest and xgboost revolve around the idea called ensemble learning. In this approach we’re not getting just one learning model—the algorithm actually creates many variations of models and uses them to collectively come up with better results. This is as much as can be written as a short description as this article is already quite lengthy.

Training the model

The training and classification code in R is very simple. We first need to load the libraries that will allow us to load data as well as to build the classification model:

library(xgboost)
library(readr)

Loading the data into data frames is equally straight-forward:

processed_train <- read_csv("processed_train.csv")
processed_test <- read_csv("processed_test.csv")

We then move on to preparing the vector of labels for each row as well as the matrix of features:

labels = processed_train$label
features = processed_train[, 2:141]
features = scale(features)
features = as.matrix(features)

The train-test split

When working with models, one of the ways of evaluating their performance is to split the data into so-called train and test sets. We train the model on one set and then we predict the values from the test set. We then calculate the accuracy of predicted values as the ratio between the number of correct predictions and the number of all observations.

Because Kaggle provides the test set without labels, for the sake of evaluating the model’s performance without the need to submit the results, we’ll split our Kaggle-training set into local train and test ones. We’ll use the amazing caret library which provides a wealth of tools for doing machine learning:

library(caret)

index <- createDataPartition(processed_train$label, p = .8,
                             list = FALSE,
                             times = 1)

train_labels <- labels[index]
train_features <- features[index,]

test_labels <- labels[-index]
test_features <- features[-index,]

The above code splits the set uniformly based on the labels so that the train set is approximately 80% in size of the whole data set.

Using XGBoost as the classification model

We can now make our data digestible by the XGBoost library:

train <- xgb.DMatrix(as.matrix(train_features), label = train_labels)
test  <- xgb.DMatrix(as.matrix(test_features),  label = test_labels)

The next step is to make the XGBoost learn from our data. The actual parameters and their explanations are beyond the scope of this overview article, but the reader can look them up on the XGBoost pages:

model <- xgboost(train,
                 max_depth = 16,
                 nrounds = 600,
                 eta = 0.2,
                 objective = "multi:softmax",
                 num_class = 10)

It’s critically important to pass the objective as “multi:softmax" and num_class as 10.

Simple performance evaluation with confusion matrix

After waiting a while (couple of minutes) for the last batch of code to finish computing, we now have the classification model ready to be used. Let’s use it to predict the labels from our test set:

predicted = predict(model, test)

This returns the vector of predicted values. We’d now like to check how well our model predicts the values. One of the easiest ways is to use the so-called confusion matrix.

As per Wikipedia, confusion matrix is simply:

(…) also known as an error matrix, is a specific table layout that allows visualization of the performance of an algorithm, typically a supervised learning one (in unsupervised learning it is usually called a matching matrix). Each column of the matrix represents the instances in a predicted class while each row represents the instances in an actual class (or vice versa). The name stems from the fact that it makes it easy to see if the system is confusing two classes (i.e. commonly mislabelling one as another).

The caret library provides a very easy to use function for examining the confusion matrix and statistics derived from it:

confusionMatrix(data=predicted, reference=labels)

The function returns an R list that gets pretty printed to the R console. In our case it looks like the following:

Confusion Matrix and Statistics

          Reference
Prediction   0   1   2   3   4   5   6   7   8   9
         0 819   0   3   3   1   1   2   1  10   5
         1   0 923   0   4   5   1   5   3   4   5
         2   4   2 766  26   2   6   8  12   5   0
         3   2   0  15 799   0  22   2   8   0   8
         4   5   2   1   0 761   1   0  15   4  19
         5   1   3   0  13   2 719   3   0   9   6
         6   5   3   4   1   6   5 790   0  16   2
         7   1   7  12   9   2   3   1 813   4  16
         8   6   2   4   7   8  11   8   5 767  10
         9   5   2   1  13  22   6   1  14  14 746

Overall Statistics

               Accuracy : 0.9411
                 95% CI : (0.9358, 0.946)
    No Information Rate : 0.1124
    P-Value [Acc > NIR] : < 2.2e-16

                  Kappa : 0.9345
 Mcnemar's Test P-Value : NA

(...)

Each column in the matrix represents actual labels while rows represent what our algorithms predicted this value to be. There’s also the accuracy rate printed for us and in this case it equals 0.9411. This means that our code was able to predict correct values of handwritten digits for 94.11% of observations.

Submitting the results

We got 0.9411 of an accuracy rate for our local test set and it turned out to be very close to the one we got against the test set coming from Kaggle. After predicting the competition values and submitting them, the accuracy rate computed by Kaggle was 0.94357. That’s quite okay given the fact that we’re not using here any of the new and fancy techniques.

Also, we haven’t done any parameter tuning which could surely improve the overall accuracy. We could also revisit the code from the features extraction phase. One improvement I can think of would be to first crop and resize back - and only then compute the skeleton which might preserve more information about the shape. We could also use the confusion matrix and taking the number that was being confused the most, look at the real images that we failed to recognize. This could lead us to conclusions about improvements to our feature extraction code. There’s always a way to extract more information.

Nowadays, Kagglers from around the world were successfully using advanced techniques like Convolutional Neural Networks getting accuracy scores close to 0.999. Those live in somewhat different branch of the machine learning world though. Using this type of neural networks we don’t need to do the feature extraction on our own. The algorithm includes the step that automatically gathers features that it later on feeds into the network itself. We will take a look at them in some of the future articles.

Executing Custom SQL in Django Migrations

2016-09-17T00:00:00+00:00

Since version 1.7, Django has natively supported database migrations similar to Rails migrations. The biggest difference fundamentally between the two is the way the migrations are created: Rails migrations are written by hand, specifying changes you want made to the database, while Django migrations are usually automatically generated to mirror the database schema in its current state.

Usually, Django’s automatic schema detection works quite nicely, but occasionally you will have to write some custom migration that Django can’t properly generate, such as a functional index in PostgreSQL.

Creating an empty migration

To create a custom migration, it’s easiest to start by generating an empty migration. In this example, it’ll be for an application called blog:

$ ./manage.py makemigrations blog --empty -n create_custom_index
Migrations for 'blog':
  0002_create_custom_index.py:

This generates a file at blog/migrations/0002_create_custom_index.py that will look something like this:

# -*- coding: utf-8 -*-
# Generated by Django 1.9.4 on 2016-09-17 17:35
from __future__ import unicode_literals

from django.db import migrations


class Migration(migrations.Migration):

    dependencies = [
        ('blog', '0001_initial'),
    ]

    operations = [
    ]

Adding Custom SQL to a Migration

The best way to run custom SQL in a migration is through the migration.RunSQL operation. RunSQL allows you to write code for migrating forwards and backwards—that is, applying migrations and unapplying them. In this example, the first string in RunSQL is the forward SQL, the second is the reverse SQL.

# -*- coding: utf-8 -*-
# Generated by Django 1.9.4 on 2016-09-17 17:35
from __future__ import unicode_literals

from django.db import migrations


class Migration(migrations.Migration):

    dependencies = [
        ('blog', '0001_initial'),
    ]

    operations = [
        migrations.RunSQL(
            "CREATE INDEX i_active_posts ON posts(id) WHERE active",
            "DROP INDEX i_active_posts"
        )
    ]

Unless you’re using Postgres for your database, you’ll need to install the sqlparse library, which allows Django to break the SQL strings into individual statements.

Running the Migrations

Running your migrations is easy:

$ ./manage.py migrate
Operations to perform:
  Apply all migrations: blog, sessions, auth, contenttypes, admin
Running migrations:
  Rendering model states... DONE
  Applying blog.0002_create_custom_index... OK

Unapplying migrations is also simple. Just provide the name of the app to migrate and the id of the migration you want to go to, or “zero” to reverse all migrations on that app:

$./manage.py migrate blog 0001
Operations to perform:
  Target specific migration: 0001_initial, from blog
Running migrations:
  Rendering model states... DONE
  Unapplying blog.0002_create_custom_index... OK

Hand-written migrations can be used for many other operations, including data migrations. Full documentation for migrations can be found in the Django documentation.

(This post originally covered South migrations and was updated by Phin Jensen to illustrate the now-native Django migrations.)

Adding Bash Completion To a Python Script

2016-06-03T00:00:00+00:00

Bash has quite a nice feature, you can write a command in a console, and then press twice. This should should you possible options you can write for this command.

I will show how to integrate this mechanism into a custom python script with two types of arguments. What’s more, I want this to be totally generic. I don’t want to change it when I will change the options, or change config files.

This script accepts two types of arguments. One type contains mainly flags beginning with ‘–’, the other type is a host name taken from a bunch of chef scripts.

Let’s name this script show.py—it will show some information about the host. This way I can use it with:

show.py szymon

The szymon part is the name of my special host, and it is taken from one of our chef node definition files.

This script also takes huge number of arguments like:

show.py --cpu --memory --format=json

So we have two kinds of arguments: one is a simple string, one begins with –.

To implement the bash completion on double , first I wrote a simple python script, which is prints a huge list of all the node names:

#!/usr/bin/env python

from sys import argv
import os
import json

if __name__ == "__main__":
    pattern = ""
    if len(argv) == 2:
        pattern = argv[1]

    chef_dir = os.environ.get('CHEF_DIR', None)
    if not chef_dir:
        exit(0)
    node_dirs = [os.path.join(chef_dir, "nodes"),
                 os.path.join(chef_dir, "dev_nodes")]
    node_names = []

    for nodes_dir in node_dirs:
        for root, dirs, files in os.walk(nodes_dir):
            for f in files:
                try:
                    with open(os.path.join(root, f), 'r') as nf:
                        data = json.load(nf)
                        node_names.append(data['normal']['support_name'])
                except:
                    pass

    for name in node_names:
        print name

Another thing was to get a list of all the program options. We used the below one liner. It uses the help information shown by the script. So each time the script changed its options, and it is shown when used show.py –help, the tab completion will have show these new options.

$CHEF_DIR/repo_scripts/show.py --help | grep '  --' | awk '{print $1}'

The last step to make all this work was making a simple bash script, which uses the above python script, and the one liner. I placed this script in a file $CHEF_DIR/repo_scripts/show.bash-completion.

_show_complete()
{
    local cur prev opts node_names
    COMPREPLY=()
    cur="${COMP_WORDS[COMP_CWORD]}"
    prev="${COMP_WORDS[COMP_CWORD-1]}"
    opts=`$CHEF_DIR/repo_scripts/show.py --help | grep '  --' | awk '{print $1}'`
    node_names=`python $CHEF_DIR/repo_scripts/node_names.py`

    if [[ ${cur} == -* ]] ; then
        COMPREPLY=( $(compgen -W "${opts}" -- ${cur}) )
        return 0
    fi

    COMPREPLY=( $(compgen -W "${node_names}" -- ${cur}) )
}

complete -F _show_complete show.py

The last thing was to source this file, so I’ve added the below line in my ~/.bashrc.

source $CHEF_DIR/repo_scripts/show.bash-completion

And now pressing the twice in a console shows quite nice completion options:

$ show.py <tab><tab>
Display all 42 possibilities? (y or n)
... and here go all 42 node names ...
</tab></tab>

$ show.py h<tab><tab>
... and here go all node names beginning with 'h' ...
</tab></tab>

$ show.py --<tab><tab>
.. and here go all the options beginning with -- ...
</tab></tab>

ROS architecture of Liquid Galaxy

2015-12-18T00:00:00+00:00

ROS has become the pivotal piece of software we have written our new Liquid Galaxy platform on. We have also recently open sourced all of our ROS nodes on GitHub. While the system itself is not a robot per se, it does have many characteristics of modern robots, making the ROS platform so useful. Our system is made up of multiple computers and peripheral devices, all working together to bring view synced content to multiple displays at the same time. To do this we made use of ROS’s messaging platform, and distributed the work done on our system to many small ROS nodes.

Overview

Our systems are made up of usually 3 or more machines:

Head node: Small computer that runs roscore, more of a director in the system.
display-a: Usually controls the center three screens and a touchscreen + spacenav joystick.
display-b: Controls four screens, two on either side of the middle three.
display-$N: Controls more and more screens as needed, usually about four a piece.

Display-a and display-b are mostly identical in build. They mainly have a powerful graphics card and a PXE booted Ubuntu image. ROS has become our means to communicate between these machines to synchronize content across the system. The two most common functions are running Google Earth with KML / browser overlays to show extra content, and panoramic image viewers like Google’s Street View. ROS is how we tell each instance of Google Earth what it should be looking at, and what should appear on all the screens.

Ros Architecture

Here is a general description all our ROS nodes. Hopefully we will be writing more blog posts about each node individually, as we do links will be filled in below. The source to all nodes can be found here on GitHub.

lg_activity: A node that measures activity across the system to determine when the system has become inactive. It will send an alert on a specific ROS topic when it detects inactivity, as well as another alert when the system is active again.
lg_attract_loop: This node will go over a list of tours that we provide to it. This node is usually listening for inactivity before starting, providing a unique screensaver when inactive.
lg_builder: Makes use of the ROS build system to create Debian packages.
lg_common: Full of useful tools and common message types to reduce coupling between nodes.
lg_earth: Manages Google Earth, syncs instances between all screens, includes a KML server to automate loading KML on earth.
lg_media: This shows images, videos, and text (or really any webpage) on screen at whatever geometry / location through awesome window manager rules.
lg_nav_to_device: This grabs the output of the /spacenav/twist topic, and translates it back into an event device. This was needed because Google Earth grabs the spacenav event device, not allowing the spacenav ROS node access.
lg_replay: This grabs any event device, and publishes its activity over a ROS topic.
lg_sv: This includes a Street View and generic panoramic image viewer, plus a server that manages the current POV / image for either viewer.

Why ROS

None of the above nodes specifically needs to exist as a ROS node. The reason we chose ROS is because as a ROS node, each running program (and sometimes any one of these nodes can exist multiple times at once on one machine) has an easy way to communicate with any other program. We really liked the pub/sub style for Inter-Process Communication in ROS. This has helped us reduce coupling between nodes. Each node can be replaced as needed without detrimental effects on the system.

We also make heavy use of the ROS packaging/build system, Catkin. We use it to build Debian packages which are installed on the PXE booted images.

Lastly ROS has become a real joy to work with. It is a really dependable system, with many powerful features. The ROS architecture allows us to easily add on new features as we develop them, without conflicting with everything else going on. We were able to re-implement our Street View viewer recently, and had no issues plugging the new one into the system. Documenting the nodes from a client facing side is also very easy. As long as we describe each rosparam and rostopic then we have finished most of the work needed to document a node. Each program becomes a small, easy to understand, high functioning piece of the system, similar to the Unix philosophy. We couldn’t be happier with our new changes, or our decision to open source the ROS nodes.

Testing Django Applications

2015-12-14T00:00:00+00:00

This post summarizes some observations and guidelines originating from introducing the pytest unit testing framework into our CMS (Content Management System) component of the Liquid Galaxy. Our Django-based CMS allows users to define scenes, presentations and assets (StreetView, Earth tours, panos, etc) to be displayed on the Liquid Galaxy.

The purpose of this blog post is to capture my Django and testing study points, summarize useful resource links as well as to itemize some guidelines for implementing tests for newcomers to the project. It also provides a comparison between Python’s standard unittest library and the aforementioned pytest. Its focus is on Django database interaction.

Versions of software packages used

This post describes some of our experiences at End Point in designing and working on comprehensive QA/CI facilities for a new system which is closely related to the Liquid Galaxy.

The experiments were done on Ubuntu Linux 14.04:

python (2.7.6) and its corresponding version of unittest
django 1.7 (current recent is 1.9 but our CMS uses still 1.7 version)
pytest-django 2.8.0
pytest 2.7.2 (with py 1.4.30)
virtualenv 13.1.2
factory_boy 2.6.0

Testing Django Applications

We probably don’t need to talk much about the importance of testing. Writing tests along with the application code has become standard over the years. Surely, developers may fall into a trap of their own prejudice when creating testing conditions which would still result in faulty software but the likelihood of buggy software is certainly higher on a code that has no QA measures. If the code works and is untested, it means it works by accident, they say.

As a rule of thumb, unit tests should be very brief testing items seldom interacting with any external services such as the database. Integration tests on the other hand often communicate with external components.

This post will heavily reference an example minimal Django application written for the purpose of experimenting on Django testing. Its README file contains some set up and requirement notes. Also, I am not going to list (m)any code snippets here but rather reference the functional application and its test suite. Hence the points below qualify for more or less assorted little topics or observations.

In order to benefit from this post, it will be helpful to follow the README and interact (run tests that is) with the demo django-testing application.

Basic Django unittest versus pytest basic examples

This pair of test modules shows the differences between Django TestCase (unittest) and pytest-django (pytest) frameworks.

test_unittest_style.py

The base Django TestCase class derives along this tree:

    django.test.TestCase
        django.test.TransactionTestCase
            django.test.SimpleTestCase
                unittest.TestCase

Django adds (among any other aspects) handling of database, the documentation is here, on top of the Python standard unittest library.

test_pytest_style.py

this is a pytest style implementation of the same tests and pytest-django plug-in adds, among other features, Django database handling support.

The advantage of unittest is that it comes with the Python installation—it’s a standard library. That means that one does not have to install anything for writing tests, unlike pytest which is a third-party library and needs to be installed separately. While the absence of additional installation is certainly a plus, it’s dubious whether being a part of Python distribution is a benefit. I seem to recall Guido Van Rossum during Europython 2010 having said the the best thing for pytest is not being part of the Python standard set of libraries for its lively development and evolution would be slowed down by the inclusion.

There are very good talks and articles summarizing advantages of pytest. For me personally, the reporting of error context is supreme. No boiler-plate (no inheritance), using plain Python asserts instead of many assert* methods and flexibility (function, class) are other big plus points

Testing Django applications with py.test (EuroPython 2013) (very good)
pytest presentation from its author
very brief pytest introduction
switch to pytest, rich features descriptions, has some Django touches

As the comment in the test_unittest_style.py file says, this particular unittest-based test module can be run by both Django manage.py (which boils down to unittest lookup discovery on a lower layer) or by py.test (pytest).

It should also be noted, that pytest’s flexibility can bite back if something gets overlooked.

Django database interaction unittest versus pytest (advanced examples)

test_unittest_advanced.py

Since this post concentrates on pytest and since it’s the choice for our LG CMS project (naturally :-), this unittest example just shows how the test (fresh) database is determined and how Django migrations are run at each test suite execution. Just as described in the Django documentation: “If your tests rely on database access such as creating or querying models, be sure to create your test classes as subclasses of django.test.TestCase rather than unittest.TestCase.”

That is true for database interaction but not completely true when using pytest. And “Using unittest.TestCase avoids the cost of running each test in a transaction and flushing the database, but if your tests interact with the database their behavior will vary based on the order that the test runner executes them. This can lead to unit tests that pass when run in isolation but fail when run in a suite.” django.test.TestCase, however, ensures that each test runs inside a transaction to provide isolation. The transaction is rolled back once the test case is over.

test_pytest_advanced.py

This file represents the actual core of the test experiments for this blog / demo app and shows various pytest features and approaches typical for this framework as well as Django (pytest-django that is) specifics.

Django pytest notes (advanced example)

Much like the unittest documentation, the pytest-django recommends avoiding database interaction in unittest and concentrate only on the logic which should be designed in such a fashion that it can be tested without database.

test database name prefixed test_ (just like at the unittest example), the base value is taken from the database section of the settings.py. As a matter of fact, it’s possible to run the test suite after previously dropping the main database, the test suite interacts only with test_ + DATABASE_NAME
migration execution before any database interaction is carried out (similarly to unittest example)
database interaction marked by a Python decorator @pytest.mark.django_db on the method or class level (or stand-alone function level). It’s in fact the first occurrence of this marker which triggers the database set up (its creation and migrations handling). Again analogously to unittest (django.test.TestCase), the test case is wrapped in a database transaction which puts the database back into the state prior to the test case. The database test_ + DATABASE_NAME itself is dropped once the test suite run is over. The database is not dropped if --db-reuse option is used. The production DATABASE_NAME remains untouched during the test suite run (more about this below)
pytest_djangodb_only.py — setup_method — run this module separately and the data created in setup_method end up NOT in the test_ + DATABASE_NAME database but in the standard one (as configured in the settings.py which would be the production database likely)! Also this data won’t be rolled back. When run separately, this test module will pass (but still the production database would be tainted). It may or may not fail on the second and subsequent run depending whether it creates any unique data. When run within the test suite, the database call from the setup_method will fail despite the presence of the class django_db marker. This has been very important to realize. Recommendation: do not include database interaction in the pytest special methods (such assetup_method or teardown_method, etc), only include database interaction in the test case methods
The error message Failed: Database access not allowed, use the "django_db" mark to enable was seen on a database error on a method which actually had the marker. This output is not to be 100% trusted
data model factories are discussed separately below
lastly the test module shows Django Client instance and calling an HTTP resource

pytest setup_method

While the fundamental differences between unittest and pytest were discussed, there is something to be said about Django specific differences of the two. There is different database-related behaviour of unittest setUp method versus the pytest setup_method method. The setUp is included in the transaction and database interactions are rolled back once the test case is over. The setup_method is not included in the transaction. Moreover, interacting with the database from setup_method results in faulty behaviour and difference depending whether the test module is run on its own or as a part of the whole test suite.

The bottom line is: do not include database interaction in setup_method. This setUp, setup_method behaviour was already shown in the basic examples. And more description and demonstration of this behaviour is in the file: pytest_djangodb_only.py. This actually revealed the fact that using django_db database fixture is not supported in special pytest methods and the aforementioned error message is misleading (more references here and here).

When running the whole test suite, this file won’t be collected (its name lacks test_ string). It needs to be renamed to be included in the test suite run.

JSON data fixtures versus factories (pytest advanced example)

The traditional way of interacting with some test data was to perform following steps:

have data loaded in the database
python manage.py dumpdata
the produced JSON file is dragged along the application test code
call_command(“loaddata”, fixture_json_file_name) happens at each test suite run

The load is expensive, the JSON dump file is hard to maintain manually if the original modified copy and the current needs diverge (the file has integer primary keys value, etc). Although even the recent Django testing documentation mentions usage of JSON data fixtures, the approach is considered discouraged and the goal is recommended to achieve by means of loading the data in migrations or using model data factories.

This talk for example compares the both approaches in favour of factory_boy library. A quote from the article: “Factory Boy is a Python port of a popular Ruby project called Factory Girl. It provides a declarative syntax for how new instances should be created. … Using fixtures for complex data structures in your tests is fraught with peril. They are hard to maintain and they make your tests slow. Creating model instances as they are needed is a cleaner way to write your tests which will make them faster and more maintainable.”

The file test_pytest_advanced.py demostrates interaction with factories defined in the module factories.py, the basic very easy-to-use features.

Despite its ease of use, the factory_boy is a powerful library capable of modeling Django’s ORM many-to-many relationships, among other features.

Additional useful links

Django 1.7 testing — version used in the demo application
Django 2.0 testing — latest stable version
Effective Django — testing covered already in the second chapter of the book
Effective Django factory_boy
Django testing — excellent PyCon talk, slides covering pytest, fixtures vs factories, etc

Conclusion

You should have a good idea about testing differences via unittest and pytest in the Django environment. The emphasis has been put on pytest (django-pytest) and some recommended approaches. The demo application django-testing brings functional test cases demonstrating the behaviour and features discussed. The articles and talks listed in this post were extremely helpful and instrumental in gaining expertise in the area and introducing rigorous testing approach into the production application.

Any discrepancy between the behaviour described above and on your own setup may originate from different software versions. In any case, if anything is not clear enough, please let me know in the comments.

img.bi, a secret encrypted image sharing service tool

2015-07-30T00:00:00+00:00

After a fairly good experience with dnote installed on our own servers as an encrypted notes sharing service, my team decided that it would have been nice to have a similar service for images.

We found a nice project called img.bi that is based on NodeJS, Python, Redis and a lot of client-side JavaScript.

The system is divided into two components: the HTML/JS frontend and a Python FastCGI API.

Unfortunately the documentation is a still in its very early stage and it’s lacking a meaningful structure and a lot of needed information.

Here’s an overview of the steps we followed to setup img.bi on our own server behind nginx.

First of all we chose that we wanted to have as much as possible running and confined to a regular user, which is always a good idea with such young and potentially vulnerable tools. We chose to use the imgbi user.

Then since we wanted to keep as clean as possible the root user environment (and system status), we also decided to use pyenv. To be conservative we chose the latest Python 2.7 stable release, 2.7.10.

git clone https://github.com/yyuu/pyenv.git ~/.pyenv
echo 'export PYENV_ROOT="$HOME/.pyenv"' >> ~/.bash_profile
echo 'export PATH="$PYENV_ROOT/bin:$PATH"' >> ~/.bash_profile
echo 'eval "$(pyenv init -)"' >> ~/.bash_profile
echo $SHELL -l
pyenv install -l  | grep 2\\.7
pyenv install 2.7.10
pyenv global 2.7.10
pyenv version
which python
python --version

In order to use img.bi, we also needed NodeJS and following the same approach we chose to use nvm and install the latest NodeJS stable version:

curl -o- https://raw.githubusercontent.com/creationix/nvm/v0.25.4/install.sh | bash
nvm install stable
nvm list
nvm use stable
nvm alias default stable
node --version

As a short note to the usage of the bad practice of blindly using:

curl -o- https://some_obscure_link_or_not | bash

We want to add that we do not endorse this practice as it’s dangerous and exposes your system to many security risks. On the other hand, though, it’s true that cloning the source via Git and compile/installing it blindly is not much safer, so it’s always up to how much you trust the peer review on the project you’re about to use. And at least with an https URL you should be talking to the destination you want, whereas an http URL is much more dangerous.

Furthermore going through the entire Python and NodeJS installation as a regular user, was far beyond the scope of this post and the steps proposed here assumes that you’re doing everything as the regular user, except where specifically stated differently.

Anyway after that we updated pip and then installed all the needed Python modules:

pip install --upgrade pip
pip install redis m2crypto web.py bcrypt pysha3 zbase62 pyutil flup

Then it’s time to clone the actual img.bi code from the GitHub repo, install a few missing dependencies and then use the bower and npm .json files to add the desired packages:

git clone https://github.com/imgbi/img.bi.git
cd img.bi/
npm install -g bower grunt grunt-cli grunt-multiresize
npm install -g grunt-webfont --save-dev
npm install
bower install

We also faced an issue which made Grunt fail to start correctly. Grunt was complaining about an “undefined property” called “prototype”. If you happen to have the same problem just type

cd node_modules/grunt-connect-proxy/node_modules/http-proxy
npm install eventemitter3@0.1.6
cd -

That’ll basically install the eventemitter3 NodeJS package module locally to the grunt-connect-proxy module so to overcome the compatibility issues which in turn causes the error mentioned above.

You should use your favourite editor to change the file config.json, which basically contains all your local needed configuration. In particular our host is not exposed on the I2P or Tor network, so we “visually” disabled those options.

# lines with "+" needs to be replace the ones starting with a "-"
-  "name": "img.bi",
+  "name": "img.bi - End Point image sharing service",

-  "maxSize": "3145728",
+  "maxSize": "32145728",

-  "clearnet": "https://img.bi",
+  "clearnet": "https://imgbi.example",

-  "i2p": "http://imgbi.i2p",
+  "i2p": "http://NOTAVAILABLE.i2p",

-  "tor": "http://imgbifwwqoixh7te.onion",
+  "tor": "http://NOTAVAILABLE.onion",

Save and close the file. At this point you should be able to run “grunt” to build the project but if it fails on the multiresize task, just run

grunt --force

to ignore the warnings.

That’s about everything you need for the frontend part, so it’s now time to take care of the API.

cd
git clone https://github.com/imgbi/img.bi-api.git
cd /home/imgbi/img.bi-api/

You now need to edit the two Python files which are the core of the API.

# edit code.py expired.py
-upload_dir = '/home/img.bi/img.bi-files'
+upload_dir = '/home/imgbi/img.bi-files'

Verify that you’re not having any Python import related error, due to missing modules or else, by running the Python code.py file directly.

./code.py

If that’s working okay, just create a symlink in the build directory in order to have the API created files available to the frontend

ln -s /home/imgbi/img.bi-files /home/imgbi/img.bi/build/download

And then it’s time to spawn the actual Python daemon:

spawn-fcgi -f /home/imgbi/img.bi-api/code.py -a 127.0.0.1 -p 1234

The expired.py file is used by a cronjob which periodically checks if there’s any image/content that should be removed because its time has expired. First of all let’s call the script directly and if there’s no error, let’s create the crontab:

python /home/imgbi/img.bi-api/expired.py

crontab -e

@reboot spawn-fcgi -f /home/imgbi/img.bi-api/code.py -a 127.0.0.1 -p 1234
30 4 * * * python /home/imgbi/img.bi-api/expired.py

It’s now time to install nginx and Redis (if you still haven’t done so), and then configure them. For Redis you can just follow the usual simple, basic installation and that’ll be just okay. Same is true for nginx but we’ll add our configuration/vhost file content here as an example /etc/nginx/sites-enabled/imgbi.example.conf for everyone who may need it:

upstream imgbi-fastcgi {
  server 127.0.0.1:1234;
}

server {
  listen 80;
  listen [::]:80;
  server_name imgbi.example;
  access_log /var/log/nginx/sites/imgbi.example/access.log;
  error_log /var/log/nginx/sites/imgbi.example/error.log;
  rewrite ^ https://imgbi.example/ permanent;
}

server {
  listen 443 ssl spdy;
  listen [::]:443 ssl spdy;
  server_name  imgbi.example;
  server_name  imgbi.example;
  access_log /var/log/nginx/sites/imgbi.example/access.log;
  error_log /var/log/nginx/sites/imgbi.example/error.log;

  client_max_body_size 4G;

  include include/ssl-wildcard-example.inc;

  add_header Strict-Transport-Security max-age=31536000;
  add_header X-Frame-Options SAMEORIGIN;
  add_header X-Content-Type-Options nosniff;
  add_header X-XSS-Protection "1; mode=block";

  location / {
    root /home/imgbi/img.bi/build;
  }

  location /api {
    fastcgi_param QUERY_STRING $query_string;
    fastcgi_param REQUEST_METHOD $request_method;
    fastcgi_param CONTENT_TYPE $content_type;
    fastcgi_param CONTENT_LENGTH $content_length;

    fastcgi_param SCRIPT_NAME "";
    fastcgi_param PATH_INFO $uri;
    fastcgi_param REQUEST_URI $request_uri;
    fastcgi_param DOCUMENT_URI $document_uri;
    fastcgi_param DOCUMENT_ROOT $document_root;
    fastcgi_param SERVER_PROTOCOL $server_protocol;

    fastcgi_param GATEWAY_INTERFACE CGI/1.1;
    fastcgi_param SERVER_SOFTWARE nginx/$nginx_version;

    fastcgi_param REMOTE_ADDR $remote_addr;
    fastcgi_param REMOTE_PORT $remote_port;
    fastcgi_param SERVER_ADDR $server_addr;
    fastcgi_param SERVER_PORT $server_port;
    fastcgi_param SERVER_NAME $server_name;
    fastcgi_param HTTPS on;

    fastcgi_pass imgbi-fastcgi;
    fastcgi_keep_conn on;
  }
}

Well, that should be enough to get you started and at least have all the components in place. Enjoy your secure image sharing now.

Django and Mojolicious: a quick comparison of two popular web frameworks

Views

Models

Django

Mojolicious and Perl

Controllers

Conclusion

Exploring Geodatabase Files

Prerequisites

A first look into the contents of the GDB file

The ogr2ogr -sql option

1. Extract property names

2. Extract polygons only

3. Create KML with pins and names

4. Create KML with pins only, no names

5. Extract limited properties

6. Run it all in a Python script

7. Extract all data as JSON from one layer using the Python GDAL library

Conclusion

Additional Resources

Making a Loading Spinner with tkinter

Prerequisites

Code

Resources

How to deploy a containerized Django app with AWS Copilot

The chosen one and the sidekick

Install Docker

Set up AWS CLI

Install AWS Copilot CLI

The Django project

The Deployment with AWS Copilot

A mini cheat sheet

The End

Understanding Linear Regression

Linear Regression Model

Hypothesis

L2 Loss Function

Gradients of the Loss

Batch Gradient Descent

Evaluation

Multivariate Linear Regression

Linear Regression with Polynomial Functions

Conclusion

Visualizing Data with Pair-Plot Using Matplotlib

Pair Plot

Custom Pair-Plot using Matplotlib

Plot Grid Area

Pair-Plot a Dataset

Python concurrency: asyncio for threading users

The Python GIL

A Contrived Example of Thread Preemption

Concurrency with asyncio

Conclusion

Random Strings and Integers That Actually Aren’t

The cool part: string generation

Bonus

Implementing SummAE neural text summarization with a denoising auto-encoder

Preliminaries

Datasets

ROCStories

Wikihow

Basics of the sequence-to-sequence modeling

Naively simple modeling: Markov Model

Modeling with neural networks

Teacher-forcing

Compute-friendly representation for tokens and gists

Representing words naively

A better approach: word embeddings

Not only words

Auto-encoders

Adding the noise

The SummAE model

Auto-encoding paragraphs and sentences

Adding the noise

The code

My experiment with the WikiHow dataset

The problem with getting paragraphs when we want the sentences

Better gists by using the “critic”

Results

Final words

The `ogr2ogr` `-sql` option