Kalamazoo Psychiatric Hospital—where Cushny and Peebles conducted the hyoscine clinical trial (14)

Hypothesis testing pervades statistical education and scientific research, and for good reason. Experiments are often initiated from a hypothesis or produce data that leads to a hypothesis. Consider, for example, this data set from pharmacologists Cushny and Peebles (1):

Table 1: Average sleep for patients at Kalamazoo Psychiatric Hospital after being dosed with small amounts of different candidate soporifics (1)

Cushny and Peebles were hoping to find an effective soporific. To that end, they dosed patients at the Michigan Asylum for the Insane at Kalamazoo with small amounts of different related drugs and measured sleep activity. Their data set is somewhat famous (2). It was one of the first examples Student (9) applied the t-distribution to and it was later analyzed by Ronald Fisher in his highly influential book *Statistical Methods for Research Workers (15, p. 119)*.

Looking at the data, several natural hypotheses suggest themselves: Is *L-hyoscine HBr* more effective than *L-hyoscyamine HBr *at inducing sleep? Is *L-hyoscyamine HBr* better than no drug? One would hope for statistics to provide us with good tools to test such questions. Unfortunately, the method most often taught and used, significance testing with P-values, can be problematic and is frequently misunderstood.

To see an example of how P-values are misunderstood, let’s look at how Peter Bernstein describes hypothesis testing in his bestselling book *Against the Gods: The Remarkable Story of Risk.*

Epidemiologists — the statisticians of health — observe the same convention as that used to measure the performance of investment managers. They usually define a result as statistically significant if there is no more than a 5% probability that an outcome was the result of chance. (4, p. 285)

Peter L. Bernstein, Against the Gods: The Remarkable Story of Risk

Bernstein is mistakenly thinking that P-values quantify the probability of the null hypothesis. He is hardly alone in this error. Steven Goodman coined the term P value fallacy to refer to this misconception, and he describes how common it is

In my experience teaching many academic physicians, when physicians are presented with a single-sentence summary of a study that produced a surprising result with P = 0.05, the overwhelming majority will confidently state that there is a 95% or greater chance that the null hypothesis is incorrect. (3)

There is no pathway to probability of the null hypothesis that doesn’t involve use of a prior and Bayesian statistics, which Ronald Fisher, the statistician who popularized P-value hypothesis testing, categorically rejected

… advocates of inverse probability (i.e. Bayesian statistics) seem forced to regard mathematical probability, not as an objective quantity measured by observed frequencies, but as measuring merely psychological tendencies, theorems respecting which are useless for scientific purposes (5, p. 7)

To Fisher, probability of the null hypothesis was ill-defined.

The “ error,” so called, of accepting the null hypothesis “when it is false,” is thus always ill-defined both in magnitude and frequency. (5, p. 20)

This reoccurring misinterpretation of P-values as probabilities is a major problem for statistics, and it raises the question “*How should we interpret P-values?*”.

In 1987, James Berger and Thomas Sellke (6) made good progress towards an answer and what they discovered was quite alarming for how P-values and statistical significance levels are often understood.

Analyzing the case of testing a normal mean and a point null hypothesis, they considered the broad class of symmetric priors that assign equal weight to the null hypothesis and alternate hypothesis and are nonincreasing away from the null.

Figure 1: Examples of priors with support on (-R, R) that are symmetric and nonincreasing from the null, mean=0. The most extreme, and least favorable to the null, is the uniform prior.

They found that for any such objective prior, probability of the null hypothesis is *bounded below* by the function

Figure 2: Lower bounds on the posterior probability of a point null hypothesis for given P values, assuming equal prior probability for the null and alternate hypothesis and any objective prior.

What the plot says is quite stunning.* *For* *P=0.05, which is generally considered the threshold for statistical significance, the probability of the null hypothesis is *at least *28.9%. At best P=0.05 only weakly suggests the null hypothesis is wrong. Interpreting P*spectacularly wrong*.

Moreover, it should be emphasized, α provides a *lower bound *on the probability of the null hypothesis under an objective prior. It assumes circumstances that are unfavorable to the null hypothesis. To better understand this, Tomas Sellke, María Bayarri, and James Berger (7) set up a simple simulation. They assumed a sequence of experimental drugs D_1, D_2, … would be tested for non-negligible effect by computing a P-value from a sample of normally distributed data with unknown variance. If 50% of the drugs were chosen to have negligible effect and data was randomly sampled from normal distributions with varying values chosen for their means and standard deviations, they found, after running the simulations for long enough,

1. Of the D_i for which the P-value ≈ .05, at least 23% (and typically close to 50%) will have negligible effect.

2. Of the D_i for which the P-value ≈ .01, at least 7% (and typically close to 15%) will have negligible effect.

In this blog post I want to consider the problem of how to derive objective Bayesian answers for hypothesis testing under conditions that are more reasonable for the null hypothesis. There are some good approaches available in the Bayesian literature; but unfortunately, they are often not covered as part of a standard statistical education and hence are much less widely known and used as P-values.

The method I’ll be covering uses what are known as *expected encompassing intrinsic Bayes factors* (EEIBFs) (8). Before getting into the details, let’s revisit Cushny and Peeble’s data set to see how it works on a real example.

## Example: The Hyoscine Trial at Kalamazoo

Suppose we want to explore the question of whether *L-hyoscine HBr* is a more effective soporific than *L-hyoscyamine HBr. W*e’ll assume that the differences of the average sleep durations are normally distributed with unknown variance and we’ll test the three hypotheses:

```
H_0: L-hyoscyamine HBr and L-hyoscine HBr are equally effective
H_l: L-hyoscyamine HBr is less effective than L-hyoscine HBr
H_m: L-hyoscyamine HBr is more effective than L-hyoscine HBr
```

```
# Create an array of differences from the average
# sleep durations for L-hyoscine HBr subtracted from
# the average sleep durations for L-hyoscyamine HBr
#
# The full data set is available from
# http://www.medicine.mcgill.ca/epidemiology/hanley/bios601/Mean-Quantile/First.t.test.pdf
# page 791
import numpy as np
deltas = np.array((
-1.2, -2.4, -1.3, -1.3, 0. , -1. , -1.8, -0.8,
-4.6, -1.4, -0.5))
# Test the three hypotheses that the mean is less than zero, equal to zero,
# or greater than zero using expected encompassing intrinsic Bayes factors
from bbai.stat import NormalMeanHypothesis
result = NormalMeanHypothesis().test(deltas)
# Print the posterior probabilities for the three hypotheses
print('Probability mean is less than zero: ', result.left)
print('Probability mean is equal to zero: ', result.equal)
print('Probability mean is greater than zero: ', result.right)
```

The table below shows how the posterior probabilities of the three hypotheses evolve as differences are observed:

Table 2: Posterior probabilities for three hypotheses, computed using EEIBFs and objective priors.

### Comparison to Student’s Analysis

*Note: Student made a few mistakes when entering the data from Cushny and Peeble study (documented in (2)). He mislabeled the columns and omitted the observation for Patient 11.*

Student (9) analyzed the data from a Bayesian perspective and considered only the two hypotheses H_l and H_m (i.e. a one-sided test). He computes probabilities consistent with using the prior

*Note: The prior Student uses is improper, which perhaps requires some discussion. For the special case of one-sided testing, improper priors like this can be justified though the prior assumes conditions that are in a sense least favorable to the null hypothesis (see (11) and (12, p. 147–148)).*

Here’s some python code that reproduces the values from Student’s paper:

```
import numpy as np
# Note: Student only considers the first 10 entries from the data set
deltas = np.array((
-1.2, -2.4, -1.3, -1.3, 0. , -1. , -1.8, -0.8, -4.6, -1.4))
n = len(deltas)
t = np.mean(deltas) / (np.std(deltas, ddof=1) / np.sqrt(n))
# Compute the posterior probability for the hypothesis that
# L-hyoscyamine HBr is less effective than L-hyoscine HB
import scipy
P_H_alt = scipy.stats.t(n-1).cdf(-t)
print('Probability L-lyoscine HB is more effective:', P_H_alt)
```

`Probability L-hyoscine HB is more effective: 0.9985835549013079`

Rounding will give us the same result that Student reports.

But I take it the real point of the authors (Cushny and Peebles) was that 2 (L-hyoscine HBr) is better than 1 (L-hyoscyamine HBr). This we must test by making a new series, subtracting 1 from 2. The mean value of this series is +1.58 while the S.D. is 1.17, the mean value being +1.35 times the S.D. From the table the probability is .9985 or the odds are about 666 to 1 that 2 is the better soporific.

To make our results more comparable to Student’s, we’ll adjust the values for the n=10 entry in Table 2 to exclude the equals hypothesis, giving us

or an odds of about 119 to 1. As expected, EEIBFs give a smaller probability than Student since the method is design to provide a more reasonable prior probability for the null hypothesis than what would correspond to a lower bound.

**Bayes Factors**

The building blocks for objective Bayesian hypothesis testing are the Bayes factor and objective priors.

Let M_1 and M_2 denote two competing models with parameters θ_1 and θ_2 and *proper priors* π_1 and π_2. Suppose we then observe data x. Then the posterior probability for one of the models is given by

The *Bayes factor, *B_ji*, *is defined as the ratio

If the models are given equal prior probability, then B_ji is the ratio of the posterior probabilities for the two models.

### Example 1: The binomial distribution with Jeffreys prior

Suppose we observe 5 successes and 4 failures from a binomial distribution with unknown probability of success parameter p. Now imagine that we consider two models corresponding to the hypotheses that p1/2. If we use Jeffreys prior, then the priors for the two models after renormalization are

and the Bayes factor is given by

Let’s now consider the normal distribution. The reference prior for the normal distribution with the mean, µ,* *as the parameter of interest and the variance, σ², as the nuisance parameter is given by

(see example 10.5 of (17)). In this case, π is improper so we can’t use it directly to compute Bayes factors. Improper priors are unnormalized and only specified proportionally. Any multiple c x π(µ, σ²) is also a valid version of the improper prior; and, hence, computing Bayes factors with improper priors gives us a result that’s dependent on the ratio of arbitrary constants.

One way to work around this issue, the *training sample method *(10),* *is to use a minimal subset of the data to form proper priors and then the remaining data to compute the Bayes factors.

*Note: By minimal, I mean the smallest subset of observations such that the posterior using the prior π is proper and well defined.*

### Example 2: Naive version of the training sample method

Suppose we observe the values

from a normal distribution with known variance σ²=1 and we want to test the two hypotheses, µ 0. For the objective priors

we need a single observation to form proper distributions. Using the first observation, gives use the new priors

where φ and Φ represents the PDF and CDF of the standard normal distribution. We can then compute the Bayes factor

Clearly, the Bayes factor computed in Example 2 will depend on which observation is chosen to “train” the prior. As an improvement, we can average the result over all possible minimal training subsets, and this leads us to what Berger and Mortera (8) call *encompassing arithmetic intrinsic Bayes factors* (EIBFs).

Suppose H_1 and H_2 are two hypotheses to be tested and let H_0 denote a hypothesis that encompasses both H_1 and H_2.

*For example, if H_1 and H_2 denote hypotheses that the mean of a normal distribution is less than zero and greater than zero, then a naturally encompassing hypothesis is that the mean is unrestricted.*

The encompassing arithmetic intrinsic Bayes factor is defined as

where B_ji^N denotes the Bayes factor computed using the improper prior and **x**(*l*) denotes a minimal subset of the observed data and *l *enumerates over all possible minimal subsets.

EIBFs can work well with enough observations; but for smaller data sets, the term

can be numerically unstable. To make the factors more general purpose, Berger and Mortera recommend replacing the sum with expected values which gives us the *expected encompassing intrinsic Bayes factor *(EEIBF),

where** X**(*l*) denotes a minimal vector of independent identically distributed random variables large enough to make the posterior proper and distributed according to the model for the encompassing hypothesis H_0 with parameters that maximize the likelihood of the observed data **x**.

### Example 3: EEIBF hypothesis testing of normally distributed data

Suppose we observe the same values from Example 2,

and assume the data was sampled from a normal distribution with unknown variance and we wish to test the two hypotheses, H_1: µ 0. For the objective priors,

we need two observations to form a proper posterior distribution. The encompassing hypothesis, H_0, is that the mean is unrestricted and the maximum likelihood parameters are given by

Thus, the expected encompassing intrinsic* *Bayes factor is given by

where X_1 and X_2 are independent normally distributed variables with mean µ_ML and variance σ²_ML.

We can use the package bbai to do the computation.

```
import numpy as np
y = np.array((1.89, 0.52, 1.10, 2.36, 1.99, -0.85, 1.07))
from bbai.stat import NormalMeanHypothesis
result = NormalMeanHypothesis(with_equal=False).test(y)
print('B_12^EEI =', result.factor_left / result.factor_right)
```

`B_12^EEI = 0.07224015471356342`

Berger and Mortera write:

The EEIBF would appear to be the best procedure. It is satisfactory for even very small sample sizes, as is indicated by its not differing greatly from the corresponding intrinsic Bayes factor. Also, it was “balanced” between the two hypotheses, even in the highly nonsymmetric exponential model. It may be somewhat more computationally intensive than the other procedures, although its computation through simulation is virtually always straightforward. (8)

I’ll add that for the case of normal mean testing with unknown variance, there’s no need to resort to simulation. With appropriate quadrature rules and interpolation in Chebyshev nodes, we can accurately compute EEIBFs deterministically and efficiently. Perhaps I’ll go more into the details in another blog post; but for now, there’s this Jupyter notebook that shows how to do a step-by-step validation of an efficient deterministic implementation: https://github.com/rnburn/bbai/blob/master/example/18-hypothesis-eeibf-validation.ipynb.

## Discussion

**Q1: What other options are there for objective Bayesian hypothesis testing?**

As mentioned in the introduction, Sellke and Berger developed equations that allow you to derive a lower bound for the posterior probability of the null hypothesis, and Clare (16) makes improvements on Sellke and Berger’s work to give tighter bounds.

As another approach, Berger and Mortera developed intrinsic priors that asymptotically give the same answer as various default Bayes factor methods such as EEIBF, and they suggest that these intrinsic priors might be used in place of default Bayes factors:

Furthermore, (intrinsic priors) can be used directly as default priors in computing Bayes factors; this may be especially useful for very small sample sizes. Indeed, such direct use of intrinsicic priors is studied in the paper and leads, in part, to conclusions such as the superiority of the EEIBF (over the other default Bayes factors) for small sample sizes. (8)

**Q2: But if there are multiple reasonable methods, doesn’t that make the EEIBF approach subjective, whereas P-values are an objective approach to hypothesis testing?**

If the goal is to derive an objective answer to a hypothesis test, then P-values are a red herring.

A P-value is determined not just by the model and observed data, but also by experimental intent (22). The same observation and model can result in remarkably different P-values. Berger and Berry (13) provide this example to demonstrate.

Suppose a research studies a coin for bias. The researcher flips the coin 17 times. Heads comes up 13 times and tails comes up 4 times. Let θ represents the probability of heads and suppose the researcher is doing a standard P-value test with the null hypothesis that the coin is not bias, θ=0.5. What P-value would they get?

If the researcher intended to flip the coin 17 times, then the probability of seeing a value *less extreme *than 13 heads under the null hypothesis is given by summing binomial distribution terms representing the probabilities of getting 5 to 12 heads,

which gives us a P-value of 1–0.951=0.049.

If, however, the researcher intended to continue flipping until they got at least 4 heads and 4 tails, then the probability of seeing a value *less extreme *than 17 total flips under the null hypothesis is given by summing negative binomial distribution terms representing the probabilities of getting 8 to 16 total flips,

which gives us a P-value of 1–0.979=0.021.

One can argue whether any statistical method can be truly objective or to what extent objectivity should even be the goal; but methods based on objective Bayesian inference such as EEIBF have at least as strong claim to objectivity as P-values.

**Q3: Does rejection of P-value hypothesis testing make one a Bayesian?**

We shouldn’t equate frequentism with P-value hypothesis testing, and many of the results from objective Bayesian inference can be adapted to use frequentist terminology.

In (18), for example, Berger considers the thought experiment of whether Fisher, Jeffreys, and Neyman’s views on hypothesis testing could be reconciled and he shows how Bayes factor values could be interpreted by a frequentist as likelihood ratios and used to produce conditional error probabilities.

## Conclusion

When Fisher initially advocated for P-value testing, the method was met with skepticism and produce a lively debate. Harold Jeffreys famously criticized the use of P-values:

(A P-value) gives the probability of departures, measured in a particular way, equal to or greater than the observed set, and the contribution from the actual value is nearly always negligible. What the use of P implies, therefore, is that a hypothesis that may be true may be rejected because it has not predicted observable results that have not occurred. This seems a remarkable procedure. (19, p. 357)

We can certainly appreciate the problem Fisher was trying to solve with P-values. At the time, common practice was to use Bayesian statistics (i.e. inverse probability) with a default uniform prior. Such an approach will, of course, produce results that depend arbitrarily on the parameterization of the model (20, 21). But as Berger and Berry show (13), P-values can be just as arbitrary, and the frequent misinterpretation of P-values as probabilities shows us that statistics users naturally want to interpret results as a measurement of degree of belief.

Default Bayes factors such as EEIBF fix the problems of early Bayesian practice by replacing naive use of the uniform prior with reference priors while keeping the goal of delivering results in terms of degree of belief. While no such method can claim to be objective in the purest sense (I doubt a completely objective approach to hypothesis testing exists), they can make a reasonable claim to objectivity and they can produce good results under very general circumstances.

## References

(1): Cushny, A. R. and Peebles, A. R. The action of optical isomers. 11: Hyoscines’, *Journal *of *Physiology, 32, *501–510 (1905). Since the paper is from more than 95 years ago, the data set can be consider to be in the public domain.

(2): Senn S, Richardson W. The first t-test. Stat Med. 1994 Apr 30;13(8):785–803. doi: 10.1002/sim.4780130802. PMID: 8047737.

(4): Bernstein, P. (1998). *Against the Gods: The Remarkable Story of Risk*. Wiley.

(12): Berger, J. (1985). *Statistical Decision Theory and Bayesian Analysis*. Springer.

(14): “View of Michigan Asylum for the Insane, Kalamazoo” Map of Kalamazoo Co., Michigan. Philadelphia: Geil & Harley, 1861. Library of Congress.

(15): Fisher, R. A (1925). Statistical Methods for Research Workers.

(20): Fisher, R. (1930). Inverse probability. *Mathematical Proceedings of the Cambridge Philosophical Society 26*(4), 528–535.