Whether one is interpreting data in government, business, soft or hard sciences, hypothesis testing is crucial for drawing proper inferences from this data. The ultimate question we are trying to answer is: “Could these observations really have occurred by chance?”

**What is a Hypothesis?**

The first step in hypothesis testing is to set a research hypothesis. A hypothesis, in the context of research, is the premise or claim we wish to investigate as the explanation of a phenomenon.

These hypotheses are drawn from observations of a representative sample in the population under investigation. In statistical terminology, __the sample__ is taken to represent the larger group called the population. Normally this is based on some suspicion of their being a real effect in the results through the collected observations.

If the sample is representative of the population, hypothesis testing can be used to figure out if any differences or effects discovered in the sample exist in the population. Broadly speaking, this process is referred to as __inferential statistics__. A researcher will first make predictions or state the expectations about the hypothesis prior to conducting the study (or, more formally, calculate the pre-study odds). Once data is collected, the researcher will use inferential statistics to draw conclusions from the data. It is thus incredibly important that the hypothesis is stated in a specific, clear, and testable manner. This brings us to the indispensable scientific concept of falsification.

**Valid Scientific Hypotheses**

*A scientific statement is one that can possibly be proven wrong*

__Falsification__ as a philosophical tenet strives to define valid from invalid scientific hypotheses. It’s M-O is questioning and rejecting hypotheses instead of proving them or trying to view them as valid. In other words, the hypothesis should be disprovable to be considered a valid hypothesis that can be tested. This questioning usually involves the collection of observations which may conflict with the hypothesis and lead to new insights into the research question. That also means that the given falsifiable hypothesis must be sufficiently specific so as to be testable. An unfalsifiable claim therefore must be refined and clarified until it can be sufficiently tested.

An example of an unfalsifiable hypothesis is (from Carl Sagan’s The Demon Haunted World):

*A fire-breathing dragon lives in my garage*

The garage owner then claims the dragon is invisibly floating, incorporeal, and evades detection with scientific equipment like infrared sensors. In this case, the dragon hypothesis is clearly untestable, and therefore unfalsifiable. This is the case so long as new observations don’t come along that refine the (now metaphysical) statement and provide the claim with the means to be tested and falsified. Sagan sums up (un)falsifiability well after introducing this example when he says:

"Claims that cannot be tested, assertions immune to disproof are veridically worthless, whatever value they have in inspiring us or in exciting our sense of wonder”

Falsification is an important underpinning of generalizing scientific results to the population rather than just having claims about the sample, as the researcher casts their hypothesis into the language of null and alternate hypothesis, which is the next step in hypothesis testing. Then the researcher can successfully embark on solid ground in creating a well-designed experiment that may arrive at useful and impactful conclusions.

**The Alternate and The Null Hypothesis**

Usually when we come across a hypothesis in an article, it is written something like:

*There is a positive relationship between Drug Z and blood pressure levels.*

What is stated here is the **alternate hypothesis**, Ha, that is, what the researcher hopes to find in the data, as an *alternate* way of explaining the phenomena. Simply, it states that there is a real effect, that the observations are the result of this real effect, with some added chance variation mixed in.

This is contrasted with the **null hypothesis**, H0, which is the ‘status quo’ hypothesis that states that there really is no relationship between the variables (in the above example, Drug Z does not impact blood pressure levels, neither raises nor lowers) and therefore the observations are purely the result of chance. For all intents and purposes, the null and the alternate hypotheses are mathematical opposites.

Like in the court of law, it is hard to prove an outright truth, so what is typically done is that the alternate hypothesis is assumed true and then one attempts to reject the null. So the real metric here is whether or not one rejects or fails to reject the null hypothesis.

That is to say, we do not prove the null hypothesis, and we also do not prove the alternate hypothesis. The null and the alternate hypotheses take on no meaning outside of the world of the comparison between the finite set of observations that were made. There is no “objectively true” null with which to prove, since we may very well be other nulls that were not thought up to begin with that explain the lack of effect.

**The Test Statistic**

One goes about rejecting the null with a test statistic, which is calculated from the sample data and will assess the evidence against the null hypothesis. This is used traditionally in determining statistical significance, which is where the line is drawn to help us decide if we should reject the null or not. Critical values are used to judge if a test statistic is significant or not.

Basically, critical values are cut-off values that define regions where the test statistic is unlikely to lie. The critical values separate rejection regions from non-rejection regions. The rejection region is the values in the sampling distribution of the test statistic where the null hypothesis is not probable. The null hypothesis is rejected if the test statistic lies within this region.

The rejection region is determined by the Type I error, which is the probability of rejecting the null hypothesis when it is in fact true, known otherwise as alpha. For a value of alpha = 0.05, which is a very common value in scientific research, the null hypothesis is rejected 5% of the time despite it being true. In the picture of the two-tailed distribution above, the area under the curve of the rejection region is equal to the level of significance of the test (that is, alpha).

Another very important aspect of significance testing is the p value. When p is less than alpha, then we can truly call a result statistically significant, but more on the use (and misuse) of p-values next time.

As an example of a common test statistic in random experiments, consider the common problem in probability courses called the “two daughter problem”. The problem goes like, suppose there is a mother carrying fraternal twins and wants to know the probability of having two girls. The total space of possible pairs is then (boy, girl), (girl, boy), (boy, boy), (girl, girl). This problem mirrors exactly a coin toss, where one asks what the probability is of getting tails twice if flipping twice (only now boy = heads, girls = tails).

The test statistic used in both cases is the binomial random variable X with p = 0.50 (probability of heads or tails is equal at 0.5) and n = 2 (2 total tosses/births). A binomial random variable is used to count how often a particular event occurs in a fixed number of independent tries (n) each with a discrete probability p.

For independent events, probabilities are multiplied, so the answer to both problems is 0.5*0.5 = 0.25.

## Comments