1 min read

Mutation testing with infection.

Mutation testing is a testing methodology that involves modifying a program in small ways and analyzing reactions of the test suite on these modifications. If tests pass after the code is changed, then the tests are not very efficient for the mutated piece of code.

It will make one change in the code base and then run the tests. If the tests fail, it means the test killed the mutation. If the tests succeed, it means the mutant escaped. This will be clear if we look at an example.

$pip install procgen # install$ python -m procgen.interactive --env-name starpilot # human
$python <<EOF # random AI agent import gym env = gym.make('procgen:procgen-coinrun-v0') obs = env.reset() while True: obs, rew, done, info = env.step(env.action_space.sample()) env.render() if done: break EOF  Say we have an object Temperature in our nuclear reactor codebase. We can instantiate the Temperature object with an int. And we can ask the object if the temperature is safe. If it is below 100 degrees it is safe. To be safe we write some tests. The first test checks if we create a temperature of 50 it should return it is safe. The seconds test we create a temperature of 200 and it should return it is not safe. We run our tests and sure enough, all tests succeed and it gives us a reassuring code coverage of 100%. Now we will use infection which is a PHP Mutation testing tool and run it on our code. It created 6 separate changes run our tests every time and 3 of the changes or mutations, were not detected by our tests. Let's look at one of those changes. Notice the greater or equal sign, the mutation removed the equal sign, ran our tests and all of our test succeeded. This could be a big problem! If we look at the other changes, first we see the original, then we the 3 changes our tests didn't detect and then the 3 changes it did detect. Ok, let's improve our tests. We will now check if 99 will return safe and 100 will return not safe. Now we will run infection again and all mutants were killed. We can now be more reassured that when a developer, makes a mistake our tests will be less forgiving. We’re releasing an analysis showing that since 2012 the amount of compute needed to train a neural net to the same performance on ImageNet 1 classification has been decreasing by a factor of 2 every 16 months. Compared to 2012, it now takes 44 times less compute to train a neural network to the level of AlexNet 2 (by contrast, Moore’s Law 3 would yield an 11x cost improvement over this period). Our results suggest that for AI tasks with high levels of recent investment, algorithmic progress has yielded more gains than classical hardware efficiency. Algorithmic improvement is a key factor driving the advance of AI. It’s important to search for measures that shed light on overall algorithmic progress, even though it’s harder than measuring such trends in compute. 44x less compute required to get to AlexNet performance 7 years later Total amount of compute in teraflops/s-days used to train to AlexNet level performance. Lowest compute points at any given time shown in blue, all points measured shown in gray.25678910111213141516 Measuring efficiency Algorithmic efficiency can be defined as reducing the compute needed to train a specific capability. Efﬁciency is the primary way we measure algorithmic progress on classic computer science problems like sorting. Efficiency gains on traditional problems like sorting are more straightforward to measure than in ML because they have a clearer measure of task difficulty. [1] However, we can apply the efficiency lens to machine learning by holding performance constant. Efficiency trends can be compared across domains like DNA sequencing17 (10-month doubling), solar energy18 (6-year doubling), and transistor density3 (2-year doubling). We are standardizing OpenAI’s deep learning framework on PyTorch. In the past, we implemented projects in many frameworks depending on their relative strengths. We’ve now chosen to standardize to make it easier for our team to create and share optimized implementations of our models. $ pip install procgen # install
$python -m procgen.interactive --env-name starpilot # human$ python <<EOF # random AI agent
import gym
env = gym.make('procgen:procgen-coinrun-v0')
obs = env.reset()
while True:
obs, rew, done, info = env.step(env.action_space.sample())
env.render()
if done:
break
EOF


Design principles

We’ve designed all Procgen environments to satisfy the following criteria:

• High Diversity: Environment generation logic is given maximal freedom, subject to basic design constraints. The diversity in the resulting level distributions presents agents with meaningful generalization challenges.

• Fast Evaluation: Environment difficulty is calibrated such that baseline agents make significant progress after training for 200M timesteps. Moreover, the environments are optimized to perform thousands of steps per second on a single CPU core, enabling a fast experimental pipeline.

• Tunable Difficulty: All environments support two well-calibrated difficulty settings: easy and hard. While we report results using the hard difficulty setting, we make the easy difficulty setting available for those with limited access to compute power. Easy environments require approximately an eighth of the resources to train.

• Emphasis on Visual Recognition and Motor Control: In keeping with precedent, environments mimic the style of many Atari and Gym Retro games. Performing well primarily depends on identifying key assets in the observation space and enacting appropriate low level motor responses.

OpenAI builds free software for training,
benchmarking, and experimenting with AI.