Published at techblog.netflix.com on July 19, 2011, Netflix explain a tool that randomly disables their production instances to make sure to survive a failure.
With an army of 7 kinds of “soldiers” everyone have a specific job:
- Latency Monkey induces artificial delays in our RESTful client-server communication layer to simulate service degradation and measures if upstream services respond appropriately.
- Conformity Monkey finds instances that don’t adhere to best-practices and shuts them down.
- Doctor Monkey taps into health checks that run on each instance as well as monitors other external signs of health to detect unhealthy instances.
- Janitor Monkey ensures that the cloud environment is running free of clutter and waste.
- Security Monkey finds security violations or vulnerabilities.
- 10–18 Monkey detects configuration and run time problems in instances serving customers in multiple geographic regions
- Chaos Gorilla simulates an outage of an entire Amazon availability zone.
It’s interesting ask seven years after if every system need a Simian Army to ensure the availability of the platforms.
More information on: The Netflix Simian Army