Engineering a fast feedback infrastructure

[Originally posted at]

A tech company’s potential to create value comes from its ability to prototype quickly and iterate fast: the infrastructure shouldn’t be a hurdle in that process. In fact, it should do exactly the opposite: give us the means to go even faster. At Snips, we believe that everyone in the team should be able to run and monitor any code on any server at the press of a button.

The time and effort needed to go from the idea to a prototype running in production should be as small as possible. Prototypes provide insights to know what works and what needs to be improved. This enables us to avoid premature optimizations and to focus on what matters. Making logs and run-time metrics straightforward to record and explore hence goes a long way into making this iterative process more efficient and enjoyable.

In this post, we want to share the first steps we have taken in the direction of a true infrastructure as a service approach, using exclusively open-source tools. We will touch upon:

  • how we run one-off or recurrent jobs
  • how we run long-running services
  • how we inspect and monitor services and jobs
  • how we push services in production

A Docker-based infrastructure

When we move code from local machines to the shared infrastructure, we must guarantee that the deployed code will work exactly like it does on our development machines despite potential differences in package versions, OS distributions and hardware configurations. This is why we use Docker to build a standardized environment.

Docker is an open platform enabling the creation of software containers. It runs containers, which have their own isolated user space, network interface, file system and processes, a bit like a virtual machine. Since isolation is done at the OS level, it is less strict than in a virtual machine. But instantiating a container is real fast, because there is no separate OS to boot.

A container is created from an image, which describes the initial content of the containers in which the processes will be run. This is described by a script, called a Dockerfile, which starts from a base image like a raw Ubuntu distribution and describes what commands should be run to install the particular packages and files needed for this image.

Docker is smart about space and uses a concept of image hierarchy to only save diffs (called “layers”) when it makes sense. When building several images based on the same base image, Docker will only store the original image once. It then only stores the differences from the original image. At Snips, we have created our own base image including the packages we use most and our internal libraries, and use it for most of our builds. This allows us to save a lot of space when building hundreds of derived images which mostly consist of adding a few additional packages and a script to the base image.

We believe each Docker image should describe as much as possible an atomic functionality. For instance it is best to run a database and an application in two separate containers which are linked by the various facilities provided by Docker. Thus, each service is isolated, leading to easier maintenance and scalability. This extends the micro-services philosophy that we apply to our internal and public applications.

All of our Docker images are stored on a private repository called the private registry which is shared by all our servers. This allows to push once and use images everywhere.


Everything at Snips runs in a Docker container.

Everything in our infrastructure runs in a container. Developers and data scientists are in charge of maintaining their own images. At its core, maintaining infrastructure means making sure Docker works. Provisioning a new machine on our infrastructure is relatively easy: Install Docker 😉

Running a one-off job

We have a small home-made Docker wrapper which enables us to start a new instance of our base image very easily. A simple sky container in the command prompt gives you a new instance (container) running in our cloud. It’s more or less like SSH’ing into a random machine of our infrastructure, except the environment is completely isolated and standardized:

> sky container
7ffb9a3b28e3# pwd
7ffb9a3b28e3# ps
  PID TTY          TIME CMD
    1 ?        00:00:00 zsh
   50 ?        00:00:00 ps

You can then git clone a repository, run some long-running code, and get notified once the process has finished (using a small in-house tool called snitch):

7ffb9a3b28e3# snitch --notify-email -c "sleep 5; echo done"
2015-02-25 11:19:06.788241[57]: Started "sleep 5; echo done"
2015-02-25 11:19:12.791770[57]: Command successfully finished

You can even detach from the container and re-attach later as you want to check on it. Now this is great for running one-off jobs, but what happens if you want to run a long-running service like a REST API?

Deploying a service

You have a piece of code working on your laptop and you wish to deploy it so the rest of the company can start testing it. It can be anything from a new algorithm to a new API, a new dashboard or even a new database.

The first step is to construct a Docker image, containing all the required binaries and code. A simple Dockerfile extending our custom base image does the trick in a few lines. Once the image is built, we need to describe where the container instance will be started and how it will be connected to the rest of the infrastructure. To this purpose, we use a standardized service configuration file containing:

  • the service maintainer (name and email)
  • the service docker image and image version
  • the service interfaces (ports, DNS…)
  • the service dependencies (databases, file system volumes…)

This config is inspired by Fig (now Docker Compose) and Maestro. Here is a configuration example for a server requiring one Cassandra database:

    image: my-application
    requires: [ database ]
        version: 1.0-SNAPSHOT
          RUNTIME_ENV: production
          memory: 1g
        version: 1.1-dev
          RUNTIME_ENV: staging
          memory: 1g

    image: cassandra
    version: 1.0-SNAPSHOT
      RUNTIME_ENV: production
      memory: 10g

The instance key in the my-application section describes a list of all instances of the application to be run (in this case, a production and a development instance). Each instance will inherit the parent properties of the configuration file (heremaintainer, image, and requires). This means that each instance will spawn a new database alongside it. To connect the application to the database, the sky tool will inject specific environment variables in the application container describing which address and port to connect to.

The VIRTUAL_HOST environment variable is cool: it allows us to bind a container application to a public or private (on our VPN) URL by simply adding a line to the config file. This also load-balances instances having the same VIRTUAL_HOSTvariable.

Because the maintainer email is in the configuration file, any alerts from warnings or errors are directly sent to him so he can fix them.

Once the service configuration file has been written and the Dockerfile built, starting the service is as easy as:

> sky service start my-application
Starting my-application-production-database.. DONE
Starting my-application-production.. DONE
Starting my-application-dev-database.. DONE
Starting my-application-dev.. DONE

and the running process logs can be obtained by runningsky service logs my-application.

A dashboard on our private intranet also enables us to monitor the status of services. It enables us to gather logs, interrupt or restart failing services. This makes it straightforward for new team members to understand at a glance how the Docker infrastructure works and to inspect what is going on with their containers.

We will extend the configuration file to include lifecycle checks (HTTP check, port check or a custom command) ensuring the service runs smoothly and is never down.

Running a recurrent job

Recurrent jobs are not that different from services and thus are expressed in the same configuration framework. For example, sending out a report email periodically would look like:

    image: base:0.7
    command: echo "Hey I'm still alive!"
    every: 1 day at 17:00
    notify-on-completion: true

A “scheduler” service watches for configuration changes and executes docker containers based on the aforementioned config.

We can of course start any of these recurring jobs outside of the standard schedule when an error has occured requiring us to restart a service. For example, forcing a start of the “still-alive” job can be done using our sky tool usingsky job start still-alive.

Inspecting services and jobs

Simplicity of use is not only about ease of deployment, but also about how simple it is to debug and improve your code. Two things are very important to get meaningful insights into your code: access to logs and run-time metrics.

Logs give precise details about what happened and when. Metrics quantify how fast and how often the code ran. Both are critical in understanding how applications and algorithms behave when run on production-sized data. Feedback about how services and jobs are running is essential. It means measuring as much as you can, and this for at least two reasons:

  • You can’t optimize properly if you don’t measure properly. You risk optimizing the wrong part of the code or doing premature optimisation.
  • It allows to quantify, compare and learn. Why is my code running so slow compared to others? Why am I using so much RAM? Asking the right question is already halfway to a solution. Develop a culture of speed and efficiency!

When a container runs on our infrastructure, it is automatically monitored. Anyone can then inspect its resource usage.


Each container is monitored using CollectD and Graphite containers. The results are gathered in infrastructure-wide dashboards which allow us to investigate resource usages of each container.

Application-level metrics and logs are handled with a set of homemade wrappers included in our base image. These tools are written in the most common backend languages used at Snips (Python and Scala), and give us a standardized way of evaluating code performance and exploring logs. Logs and alerts are handled by logstash, while metrics are handled by Graphite and StatsD.


Home-made Python and Scala librairies enable us to have a standardized way of defining application metrics. We introspect those using dashboards automatically generated for each containerized web server API. We can then investigate slow queries and explore time series using a series of tools.

Switching to production

All of our production traffic is duplicated and redirected to services in staging. This enables us to test services that have not yet reached production maturity with production data. Because everything is measured, we can quickly assess the impact of changes, and identify errors that lead to bottlenecks.

Since Docker images are tagged by versions and stored, rolling back to an older version simply consists in rolling back to the previous services configuration file which points to the previous image versions.

We use Strider as a Continuous Integration system. It allows direct deployment of services that pass tests upon GitHub commits. This is especially useful for iterating quickly in a staging environment to correct mistakes that have slipped through.

As a consequence of having an uniform infrastructure, running code in production is not fundamentally different from running a prototype in staging. The same toolchain and processes are used throughout.

Closing words

Fast iteration only becomes possible when you have substantially reduced the time and effort needed to deploy and inspect services on an infrastructure. Tightening the feedback loop enables richer ideas to be conceived, and higher quality prototypes to be deployed.

Less errors are introduced when the same toolchain is used for developing, prototyping and production. The infrastructure then becomes a high quality service for all of its users.

An infrastructure is in essence no different from a traditional interface. Its true purpose is to hide complexity, in order to let us do what we do best: be creative.


Flash code to log on your Wifi

We’ve all experienced the pain it is to log into a Wifi whose password is a long string of characters. You type away the long sequence of characters and try to log in, only to find out that you mistyped some characters thanks to your bulky thumbs trying to hit the right keys on your smartphone.

This is for you, owner of long passwords: have your guests log into your wifi by flashing a QR code! It’s quite easy, you can print one.. or build one in LEGOs. Why take the easy way when the long way can be fun? 🙂


There’s different resolutions of QR codes, with different data storage capacities. The encoding scheme for Wifi SSID/passwords is not very well defined, however, the ZXing project, authors of the Barcode scanner for Android, seems to provide a QR code generator you can use to encode pretty much anything. Unfortunately the standard hasn’t reached iOS yet, so you’re out of luck if using Apple.

Happy flashing!

Pocket for video: or how to bookmark videos for offline watching

EDIT: Full updated code on github:

I use Pocket a lot. For those who don’t know it, it’s an app which downloads and syncs a given article so that it can be viewed offline on any device. Everytime I see a great article somewhere on the web, I just hit “Save to Pocket” using the Chrome Extension, and I know I will be able to read it later from my phone or tablet during a commute.

I want that for video too! Too often I stumble across a video which I can’t view because I do not want to use my data plan, or simply because it is inconvenient. However, I’d like to bookmark that video to watch it home, in a more relaxed context. For my use, videos mainly come from Youtube. To build this system, I needed three parts:

  1. A system that triggers the saving of a video
  2. A system which downloads the video
  3. A system which syncs that video across devices

Turns out that Youtube already has a button called “Watch Later” which is visible on any video you are watching. When pressed, the targeted video gets added to the “Watch Later” playlist. The you could have a Python script periodically called to check the “Watch Later” playlist, and download the videos in it. Downloading the videos can be done using the youtube-dl command-line utility (Mac OS, Windows, Linux), and syncing them across devices can be done using Bittorrent Sync (works with Android and iOS!).

If you want such a system set up, the real tricky (and interesting) part is fetching the “Watch Later” playlist. By default, it is a private playlist, so you must authenticate against your own Youtube account. Let’s do that.

Head over to the Google Cloud Console, create a new Project, and activate the YouTube Data API v3 under APIs & Auth > APIs. Next, you need to register a new application (i.e. the Python script). Under Registered apps, click the Register App button, and select Web Application. You should now be able to download the OAuth client ID JSON, which represents the Python script as a client. Save this file as client_secrets.json.

Now we are ready to try to authenticate from python. Install the Google Client APIs for Python and the python-gflags libraries. If you use easy_install:

sudo easy_install google-api-python-client
sudo easy_install python-gflags

We now need to generate an OAuth token, which is used to authenticate against Youtube. The script will load the client_secrets.json file, and open a web browser to authenticate against Youtube if no token is already present:

from oauth2client.file import Storage
from oauth2client.client import flow_from_clientsecrets
from import run

CLIENT_SECRETS_FILE = "client_secrets.json"
OAUTH_TOKEN_FILE = "oauth2.json"
SCOPE = ""

flow = flow_from_clientsecrets(CLIENT_SECRETS_FILE,
                               message="Missing client_secrets.json",
storage = Storage(OAUTH_TOKEN_FILE)
credentials = storage.get()

if credentials is None or credentials.invalid:
    print('No credentials, running authentication flow to get OAuth token')
    credentials = run(flow, storage)

Once the web authentication has been performed, the message “Authentication successful.” should appear in the console, and the oauth token file oauth2.json should have been created. Congratulations, you are now authenticated on your Youtube account!

The task is now to retrieve the content of the “Watch Later” playlist (the whole API doc can be found here). First, we declare some constants and connect to the API (this requires the httplib2 library to be installed, e.g. with easy_install):

from apiclient.discovery import build
from httplib2 import Http


def buildAPI():
    http = Http()
    http = credentials.authorize(http)

if credentials is None or credentials.invalid:
    print('No credentials, running authentication flow to get OAuth token')
    credentials = run(flow, storage)
    youtube = buildAPI()
    channels = youtube.channels().list(part='contentDetails', mine=True).execute()
    watchLaterID = channels['items'][0]['contentDetails']['relatedPlaylists']['watchLater']
    videos = youtube.playlistItems().list(
        playlistId=watchLaterID, maxResults=50
    print('Videos to download: %s' % len(videos))

The playlistId representing the Watch Later playlist is saved in watchLaterID, which is then used to list all videos in that playlist. Next, we loop through the list of videos to download it. After each download, the corresponding video is deleted from the playlist. Because the video download might take some time and that our API connection might have timed out, we need to call the buildAPI() method again in order to be able to remove the video from the “Watch Later” playlist.

from subprocess import call

STORAGE_PATH = "storage/"

for video in videos:
    video_id = video["snippet"]["resourceId"]["videoId"]
    video_title = video["snippet"]["title"]
    video_filename = '%s%s %s.%%(ext)s' % (STORAGE_PATH, toValidFilename(video_title), video_id)
      call('youtube-dl -o "%s"' % (video_filename, video_id), shell=True)
    except Exception as e:
    	# Download can resume, so don't delete the file
    	print('%s' % e)
    youtube = buildAPI()

Note that the file name is obtained from the video title, so we must make sure no illegal characters are used:

def toValidFilename(value):
  valid_chars = "-_.() abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"
  return "".join([c for c in value if c in valid_chars])

Simply share the directory pointed by STORAGE_PATH through Bittorrent Sync (or any other sharing solution like Dropbox or Google Drive), and the videos will be accessible from wherever you wish. This script should be called periodically (I use 15 minutes interval cron jobs), so we must make sure only one occurrence is running (or else we might download the same file at the same time). This can be done very simply by using a singleton (from the tendo library), which will ensure we are only running one instance of the script at a time.

from tendo import singleton
me = singleton.SingleInstance()

Finally, if you want the script to run every 15 minutes on your Raspberry Pi, simply add a cron job. In a command line prompt, type “crontab -e”, and paste in the following line (putting the correct paths and file names).

0,15,30,45 * * * * /home/pi/path/to/script/ >> /home/pi/path/to/log/youtubewatchlater.log 2>&1

This gives me a nice way to save videos for offline watching. Those videos get downloaded while I’m away (as I often add them to the playlist when I’m not home), and get synced over WiFi as soon as I get home. I can then watch them on my TV or on my phone when commuting, without using the data plan.