To obtain meaningful empirical results, it was necessary to conduct a series of experiments under different playing conditions. Each enhancement was tested against a variety of opponents having different styles. Experiments were designed with built-in standards for comparison, such as playing one particular version against the identical program with a single enhancement to isolate the dependent variable. For each test, the number of games were chosen to produce statistically significant results. Many experiments were performed to establish reliable results, and only a cross-section of those tests are presented here. For instance, over a dozen experiments were conducted to measure any performance gains achieved with the addition of the extra arc from OPP Final to BPP Final alone.
Throughout the experiments, to allow us to make comparable inferences about the relative merits of each feature, the original Bayesian Poker Player was used as a baseline when assessing the number of betting units won per game. This method can give a preliminary indication of the relative value of the new enhancement, however, one must be careful when drawing conclusions from self-play experiments. It is important to not over-interpret the results of one simulation. With the above format, there are limitations to how much can be concluded from a single experiment, since it is representative of only one particular type of game and style of opponent. It is quite possible that the same feature will perform much worse (or much better) in a game against human opposition, for example.
A wider variety of testing is necessary to get an accurate assessment of the new feature, such as changing the context of the simulated game. However, most versions of BPP are very similar and have fairly similar playing styles. It is quite possible that the consequences of each change would be different against a field of opponents who employ different playing styles. For example, against several human players, the effect of refining the action classifications may be much bigger than that of refining the way hands are represented.
Ultimately, the only performance metric that is important is how BPP plays against humans. Since it was difficult (and expensive) to get this data, most of the experimentation were done with self-play tournaments.
One final important method of evaluation that is essential is the critique of experienced human players. Humans can review the play of the computer and determine if certain decisions are ``reasonable" under the circumstances, or are indicative of a serious weakness or misconception.