1 min read

Mutation testing with infection.

Mutation testing is a testing methodology that involves modifying a program in small ways and analyzing reactions of the test suite on these modifications. If tests pass after the code is changed, then the tests are not very efficient for the mutated piece of code.

It will make one change in the code base and then run the tests. If the tests fail, it means the test killed the mutation. If the tests succeed, it means the mutant escaped. This will be clear if we look at an example.

$ pip install procgen # install
$ python -m procgen.interactive --env-name starpilot # human
$ python <<EOF # random AI agent
import gym
env = gym.make('procgen:procgen-coinrun-v0')
obs = env.reset()
while True:
    obs, rew, done, info = env.step(env.action_space.sample())
    env.render()
    if done:
        break
EOF

Say we have an object Temperature in our nuclear reactor codebase. We can instantiate the Temperature object with an int. And we can ask the object if the temperature is safe. If it is below 100 degrees it is safe. To be safe we write some tests. The first test checks if we create a temperature of 50 it should return it is safe. The seconds test we create a temperature of 200 and it should return it is not safe. We run our tests and sure enough, all tests succeed and it gives us a reassuring code coverage of 100%.

Now we will use infection which is a PHP Mutation testing tool and run it on our code. It created 6 separate changes run our tests every time and 3 of the changes or mutations, were not detected by our tests. Let's look at one of those changes. Notice the greater or equal sign, the mutation removed the equal sign, ran our tests and all of our test succeeded. This could be a big problem! If we look at the other changes, first we see the original, then we the 3 changes our tests didn't detect and then the 3 changes it did detect.

Ok, let's improve our tests. We will now check if 99 will return safe and 100 will return not safe. Now we will run infection again and all mutants were killed. We can now be more reassured that when a developer, makes a mistake our tests will be less forgiving.

We’re releasing an analysis showing that since 2012 the amount of compute needed to train a neural net to the same performance on ImageNet 1 classification has been decreasing by a factor of 2 every 16 months. Compared to 2012, it now takes 44 times less compute to train a neural network to the level of AlexNet 2 (by contrast, Moore’s Law 3 would yield an 11x cost improvement over this period). Our results suggest that for AI tasks with high levels of recent investment, algorithmic progress has yielded more gains than classical hardware efficiency.

Read Paper

Algorithmic improvement is a key factor driving the advance of AI. It’s important to search for measures that shed light on overall algorithmic progress, even though it’s harder than measuring such trends in compute. 4

44x less compute required to get to AlexNet performance 7 years later
201220132014201520162017201820190.050.10.51510Teraflop/s-daysGoogLeNetResnet-18Resnet-34Resnet-50Squeezenet_v1_1DenseNet121MobileNet_v1ShuffleNet_v1_1xShuffleNet_v2_1xMobileNet_v2EfficientNet-b0VGG-11AlexNetResNext_50ShuffleNet_v2_1_5xWide_ResNet_50
Total amount of compute in teraflops/s-days used to train to AlexNet level performance. Lowest compute points at any given time shown in blue, all points measured shown in gray.25678910111213141516

Download charts

Measuring efficiency

Algorithmic efficiency can be defined as reducing the compute needed to train a specific capability. Efficiency is the primary way we measure algorithmic progress on classic computer science problems like sorting. Efficiency gains on traditional problems like sorting are more straightforward to measure than in ML because they have a clearer measure of task difficulty. [1]

However, we can apply the efficiency lens to machine learning by holding performance constant. Efficiency trends can be compared across domains like DNA sequencing17 (10-month doubling), solar energy18 (6-year doubling), and transistor density3 (2-year doubling).

We are standardizing OpenAI’s deep learning framework on PyTorch. In the past, we implemented projects in many frameworks depending on their relative strengths. We’ve now chosen to standardize to make it easier for our team to create and share optimized implementations of our models.

Browse Microscope
$ pip install procgen # install
$ python -m procgen.interactive --env-name starpilot # human
$ python <<EOF # random AI agent
import gym
env = gym.make('procgen:procgen-coinrun-v0')
obs = env.reset()
while True:
    obs, rew, done, info = env.step(env.action_space.sample())
    env.render()
    if done:
        break
EOF

Design principles

We’ve designed all Procgen environments to satisfy the following criteria:

Paper Environment Code Training Code

OpenAI builds free software for training,
benchmarking, and experimenting with AI.