Abstract
Testing distributed systems (such as a network) with automation is difficult. In this post I will attempt to codify a general approach to testing a distributed, heterogeneous system with automation. Some goals of this approach are:
- To be implementable on top of a “script based” automation system incrementally.
- To allow for easy and effective test failure isolation and troubleshooting.
- To enable simulations of the distributed system that help testers/engineers gain insight about how the distributed system works.
- To allow for automated or computer assisted test case generation.
Motivating example
You have a heterogeneous system of providers and clients. Providers give services to clients. Let’s say you have a few storage providers in your distributed system. You also have many pieces that are storage clients. When a storage provider gives storage to a client, it should have add the client to a list of clients that it has provided storage to. In addition the client should add the provided storage to a list of all the storage that has been provided to it. You also have to guarantee that the storage provided to the client is usable.
The Solution
To test this behavior you have a set of proxy objects that are manipulated by the test framework. The proxy objects then send the correct the commands to the actual pieces of the distributed system to provide the storage to the clients. These proxy objects have internal state that keep track of what state the distributed system should be in. Periodically (depending on performance constraints) the state of the framework proxy objects are compared to the state of the actual system. If they do not match, a test error is raised specifying what part of the state does not match.
- The expectations of a component should be grouped with the interface to the component.
Each proxy object has preconditions and postconditions associated with each of its methods. An invariant shared by all proxy objects is that their the internal state matches the state of the corresponding piece of the distributed system.
The proxy objects automatically poll the actual component and compare the actual state to the expected state to determine if the actual component is behaving correctly.
Some cool things we can do once we have proxy objects
- Specify certain states and state transitions that we require. We then use a constraint solver (either deterministic based on object contracts or heuristic based on simulator runs) to determine the optimal test execution based on a few things (Maximum allowable time, maximum number of components of each type, etc.)
- Get detailed reports of the exact components that failed in a test case
- Keep track of what components fail the most. If we keep track of the reasons that pieces failed, we can also determine what the most likely causes are automatically.
Simulation
To effectively simulate, the proxy objects interfaces to real devices must be disabled. This is best implemented as a a separation between the part of the proxy object that communicates with the real components and the part that keeps track of the expected internal state. Simulation would be implemented by only using the part that keeps track of expected internal state. The part that communicates with the actual object could either be disabled or replaced with a dumb object that does nothing and always return that the state matches.
Implementation Notes
- The proxy object doesn’t need to keep track of all the state of the real component. Detailed tracking can be added incrementally.
Notes and Gotchas
- What about issuing commands to components that fail preconditions? The component should go into a proper failure state. So preconditions shouldn’t prevent calling, but alert the user that preconditions aren’t met and update their expectations of the state of the component based on the preconditions being false. This is hard to do well with a clear API. Perhaps users should check preconditions manually? Or the object treats preconditions different based on a mode it is in?
What are the downsides?
Experts can take into account other things when determining optimal execution — this forces these pieces of knowledge to be codified into the system. It may be more efficient for these pieces of knowledge to be just used and not codified.
I can’t think of a clear and obvious way to detect hardware failures directly through the proxy objects. They would help isolate any such problems though.