A previous and reference post about this topic here
Following are a few tips and suggestions for High Availability testing:
Understand the scope of the High Availability test
Most High Availability test scenarios involve deploying different parts of the product/s on different machines. Proper deployment guidelines need to be clearly defined and understood.
Debugger and symbols
Reproducing High Availability bugs are difficult due to the magnitude of the deployment effort. Thus when these bugs are encountered we need to ensure that all supporting software is installed or available so that the developer can diagnose the problem.
A complete stack dump will only be useful if all the matching symbols are available. These should include not just the symbols for the product but also all the symbols for the relevant OS including all the service packs and symbols for dependencies.
It is important to run all test on High Availability test beds with debuggers attached
High Availability Tools
We need to ensure that we consistently run and pass using the tools provided by Data Center. In addition to these we also need to run with customized tools that simulate real world failure situations.
These need to be built based on area of functionality and after consulting with the PM and developers.
In a real world scenario all serious consumers of our products will install some form of security firewall.
High availability tests should also be run with firewalls (both hardware and software) setup. This will help detect issues related to limitations existing in the product early on. E.g. raw RPC calls.
Increase Test Coverage
Write code to test new functionality. In addition to running and testing existing HA aware functionality, write scripts to increase feature area test coverage.
Use Multi-Proc servers
All High Availability tests should be run on multi-proc servers to avoid the serialization introduced by hardware. Many issues will not easily be found in a single proc server.
Although High Availability testing is not Performance testing, every effort should be made to ensure that the hardware platform does not frequently change. This will help ensure consistent results build over build.
Ensure that all tests are performed using normal user credentials. Using Administrator with full access will obscure some of the results. It is imperative to test early on in the cycle with normal user rights. If normal functionality does not work without administrative rights file a bug.
All steps taken to setup and deploy the products for high availability testing should be clearly documented and caveats should be clearly highlighted. Problems faced installing/uninstalling in a HA environment should be filed as bugs. E.g. installing on an MSCS cluster should require deployment from one node only. If setup requires manual installation on each node file a bug.
Page heap or App verifier to catch memory corruption
For memory corruption bugs, to get good call stack on where the problem is, you need to use some tools to help you on catching them right in time. Page heap/Gflags (or App verifier if you use .net server) can help.
For debugging and catching memory corruption bugs, you should always use page heap. Run with both options Normal/Full.
Occasionally use other options like:
If page heap hangs: /protect (protect internal page heap structures from corruption).
Verify failure code paths: /fault (simulate low memory conditions).
Verify pointer decrement: /backwards (reverse buffer overruns).
Some of these problems might manifest themselves only when failover occurs.
All testing on HA environment should be automated preferably through scripts. Imaging, deployment running the tests and logging the results should be setup so that they can be triggered remotely.
Time invested in this effort will save time in the long run and help transferring ownership of testing seamless.
Help Improve Functionality
Attend feature area meetings related to high availability functionality. Pay close attention to changes in functionality in these areas. E.g. a change to maintain state locally on the box could mean that the feature can no longer failover and also cannot scale out.
Call out these issues. Look for areas that can be improved, e.g. Long running tasks should be broken into smaller atomic jobs, if the system fails during one of these long tasks, the next node can and should continue after the last good checkpoint rather than start all over.
Example in a real scenario…
to be in next post…