8.2 Statistical Test for Population Mean (Large Sample)

In this section will try to answer the following question: It has been known that some population mean is, say, 10, but we suspect that the population mean for a population that has "undergone some treatment" is different from 10, perhaps larger than 10. We want to determine whether our suspicion is true or not.

We will follow the outline of a statistical test as described in the previous section, but adjust the four elements of the test to our situation of testing for a population mean (we will see other tests in subsequent sections).

Example 1: A new antihypertensive drug is tested. It is supposed to lower blood pressure more than other drugs. Other drugs have been found to lower the pressure by 10 mmHg on average, so we suspect (or hope) that our drug will lower blood pressure by more than 10 mmHg.

At this stage we can setup the two competing hypothesis:
We would like to know whether the new drug shows results different from the other drugs, in particular whether it is better than the old drugs, i.e. does the new drug lower blood pressure more than other drugs? To collect evidence, we select a random sample of size n = 62 (say), which was found to have a sample mean of 11.3 and a sample standard deviation of 5.1. 

Since the sample mean is 11.3, which is more than other drugs, it looks like this sample mean supports our suspicion (because the mean from our sample is indeed bigger than 10). But - knowing that we can never be 100% certain - we must compute a probability and associate that with our conclusion.

Assuming that the null hypothesis is true we will try to compute the probability that a particular sample mean (such as the one we collected) could indeed occur. Leaving the details of the computation aside for now, it turns out that the associated probability

  p = 0.044
But if that computation is correct (which it is -:) we have a problem: assuming that the null hypothesis is true, the probability of observing a random sample mean of 11.3 or more is quite small (less than 5%). But we have observed a sample mean of 11.3, there no denying that fact. So, something is not right: either we were extremely lucky to have hit the less than 5% case, or something else is wrong: our assumption that the null hypothesis was true. Since we don't believe in luck, we choose to reject the null hypothesis (even though there's a 4.4% chance - based on our evidence - that the null hypothesis could still be right).

The only practical consideration is: how do we compute the probability p?

We want to know the chance that a sample mean could be 11.3 (or more), given that we assume the population mean to be 10.0 (our null hypothesis). In other words, we want to compute:

P(sample mean > 11.3) = ...  we could do that if we only knew the distribution to use (see chapter 7.1)

But from chapter 7.2 (Central Limit Theorem) we do know the distribution of sample means: according to that theorem we know that the mean of the sample means is the same as the population mean, and the standard deviation is the original standard deviation divided by the square root of N (the sample size). In other words, if the original mean is m and the original standard deviation is s, then the distribution of the sample means are N(m, s / sqrt(n) ). And since we assumed the null hypothesis was true, we actually know (as per assumption) the population mean, and - since nothing else is available, we use the standard deviation as computed from the sample to figure as the standard deviation we need. Therefore, we know that, in our case:
mean to use: 10.0
standard deviation to use: 5.1/sqrt(62)
But now Excel can, of course, help perfectly fine:  it provides the function called "NORMDIST" to compute probabilies such as this one.Therefore, using the Central Limit Theorem:

Normal Distribution, 1-Tail
P(sample mean > 11.3) = 1 - NORMDIST(11.3, 10.0, 5.1/sqrt(62), TRUE) = 0.022

But we are not yet done: right now we only took into account that the sample mean could be 11.3 = 10 + 1.3 or more, whereas our alternative was that the mean is not equal to 10.0. Therefore we should also take into account that the probability could be smaller than 10 - 1.3 = 8.7. Again, Excel let's us compute this easily:

Normal Distribution, 1-Tail
P(sample mean < 8.7) = NORMDIST(8.7, 10.0, 5.1/sqrt(62), TRUE) = 0.022

Finally, since the alternative hypothesis is not equal to 10.0 we need to consider both probabilities together. In other words, the value of p we need is, using symmetry of the normal distribution:

Normal Distribution, 2-Tail
p = P(sample mean < 8.7) + P(sample mean > 11.3) = 2 * (1 - NORMDIST(11.3, 10.0, 5.1/sqrt(62), TRUE)) = 0.044

In chapter 7.1 we learned how to use the NORMDIST function to compute probabilities such as the ones we are interested in, but NORMDIST is somewhat difficult to use (we have to enter all these parameters at the right place). To simplify our calculation, we will instead use the new Excel function

NORMSDIST(z)  (notice the "S" in the middle of that function name)

which gives the probability using a standard normal distribution (mean 0, standard deviation 1), instead of the usual NORMDIST(x, m, s, TRUE) function. It is simpler to use because it requires only one input value. Both functions are related as follows:

NORMSDIST( (x - m) / (s / sqrt(n)) ) = NORMDIST(x, m, s / sqrt(s), TRUE)

Therefore, to compute probabilities we now proceed in two steps:
In the above case we have:

In other words, instead of entering the original mean, standard deviation, and sample size into the NORMDIST function we first compute a z-score, and then we use the NORMSDIST function to compute a probability.

Now we are ready to summarize our example into a procedure for testing for a sample mean as follows. The good news is that even if the above derivation seems complicated and perhaps confusing, the procedure we will now summarize is relatively simple and straight-forward. It works fine even if you did not understand the above calculations, as the subsequent examples will illustrate.

Statistical Test for the Mean (large sample size N > 30):

Fix an error level you are comfortable with (something like 10%, 5%, or 1% is most common). Denote that "comfortable error level" by the letter "A" If no prescribed comfort level A is given, use 0.05 as a default value. Then setup the test as follows:
Null Hypothesis H0:
mean = M, i.e. The mean is a known number M
Alternative Hypothesis Ha:
mean ≠ M, i.e. mean is different from M (2-tail test)
Test Statistics:
Select a random sample of size N, compute its sample mean X and the standard deviation S. Then compute the corresponding z-score as follows:
Z = (X - M) / ( S / sqrt(N) )
Rejection Region (Conclusion)

Compute p = 2*P(z > |Z|) = 2 * (1 - NORMSDIST(ABS(Z)))

If the probability p computed in the above step is less than A (the error level you were comfortable with initially, you reject the null hypothesis H0 and accept the alternative hypothesis. Otherwise you declare your test inconclusive.


Technically speaking, this particular test works under the following assumptions: The probability computed in part 3 of our test and used to determine the rejection region gives the Level of Significance of the test. The smaller it is, the more likely you are to be correct in rejecting the null hypothesis. Recall that there are two types of error that you could commit: Please note that even if we reject a null hypothesis (and hence accept the alternative) it is still possible that the null hypothesis is true after all. However, the probability with which that can happen is p, which is small if we choose this answer (smaller than our pre-determined comfort level A).

Example 2: Bottles of ketchup are filled automatically by a machine which must be adjusted periodically to increase or decrease the average content per bottle. Each bottle is supposed to contain 18 oz. It is important to detect an average content significantly above or below 18 oz so that the machine can be adjuste:; too much ketchup per bottle would be unprofitable, while too little would be a poor business practice and open the company up to law suites about invalid labeling.

We select a random sample of 32 bottles filled by the machine and compute their average weight to be 18.34 with a standard deviation of 0.7334. Should we adjust the machine? Use a comfort level of 5%.

We can see right away that the average weight of our sample, being 18.34 oz, is indeed different from what it's supposed to be (18 oz), but the question is whether the difference is statistically significant. In our particular case we want to know whether the machine is "off" and be sure to allow at most a 5% chance of an error in our conclusion. After all, if we did conclude the difference is significant we would have to adjust the Ketchup machine, which is an expensive procedure that we don't want to perform unnecessarily.

Our statistical test for the mean will provide the answer:

In other words, we conclude that the difference was statistically significant and that therefore the alternative hypothesis is (likely) true. Therefore, we will adjust the ketchup filling machine. Note that while we feel comfortable rejecting the null hypothesis (and adjusting the machine) the probability that this decision (and our course of action) is incorrect is 0.8%.

Example 3: In a nutrition study, 48 calves were fed "factor X" exclusively for six weeks. The weight gain was recorded for each calf, yielding a sample mean of 22.4 pounds and a standard deviation of 11.5 pounds. Other nutritional supplements are known to cause an average weight gain of about 20 lb in six weeks. Can we conclude from this evidence that, in general, a six-week diet of "factor X" will yield an average weight gain of 20 pounds or more at the "1% level of significance"? In other words, is "factor X" significantly better than standard supplements?

Our test, being inconclusive, is not really satisfying: all our work was for nothing, we are unwilling to give a definite answer. In particular, while we are not ready to reject the null hypothesis, we are also not accepting it - we simply say there's insufficient evidence. This is similar to a standard trial in front of a judge or jury: some being found not guilty does not necessarily mean he/she is really innocent. It just means there was insufficient evidence for a conviction.

Example 4: A group of secondary education student teachers were given 2 1/2 days of training in interpersonal communication group work. The effect of such a training session on the dogmatic nature of the student teachers was measured y the difference of scores on the "Rokeach Dogmatism test given before and after the training session. The difference "post minus pre score" was recorded as follows:

-16, -5, 4, 19, -40, -16, -29, 15, -2, 0, 5, -23, -3, 16, -8, 9, -14, -33, -64, -33
Can we conclude from this evidence that the training session makes student teachers less dogmatic (at the 5% level of significance) ?

We can easily compute (using Excel) that the sample mean is -10.9 and the standard deviation is 21.33. The sample size N = 20. Our hypothesis testing procedure is as follows:

Thus, since we reject the null hypothesis we accept the alternative, or in other words we do believe that the training session makes the student teachers differently dogmatic, and since the mean did go down, less dogmatic. The probability that this is incorrect is less than 5% (or about 2.2% to be precise). For curious minds: for the 1-tail test p would have worked out to be p = 1.1% (half the 2-tail value) which would also have lead to a rejection of the null hypothesis at the 5% level.

Please note that technically we were not supposed to use our procedure, since the sample size N = 20 is less than 30. Therefore, while we can still reject the null hypothesis, the true error in making that statement is somewhat larger than 2.2%. So, you ask: "what are we supposed to do for small sample sizes N < 30"? Funny you should ask - there's always another section ...