Testing Hypotheses Involving the Variances of Two Populations
Tests of hypotheses involving the variances of two populations are not as common as tests of hypotheses involving the means or proportions of two populations. Some types of situations in which variances of two populations need to be compared are:
when you need to compare degrees of uniformity of two populations; when the difference of their means or proportions is not a big issue, but the degree of variation is. If two drugs, or two food additives, or two processes lead to similar mean effects, but one produces a more uniform effect than the other, the more uniformly acting alternative may be preferred. For example, in the food industry, you may have two varieties of say, potato plant, both of which produce essentially equivalent mean yields in units of tons per hectare. However, if most of the potatoes produced by one variety are nearly the same size or weight or shape, whereas the other variety produces many small potatoes and large potatoes, then a processor might prefer the more uniform sized variety.
Recall that when we wished to compare the means of two populations with one or both of the samples having sizes less than 30, valid application of the t-distribution required that the two populations needed to have the same variances: 12 = 22. One way to verify that the data is consistent with this requirement would be to carry out a hypothesis test based on the null hypothesis H0: 12 = 22.
So, the null hypothesis we will be testing here is
H0: 12 = 22 (TPVARHT - 1)
12 is the variance of population #1
and 22 is the variance of population #2.
The test exploits the fact that if s12 is the variance of a random sample of size n1 drawn from population 1 which is normally distributed, and s22 is the variance of a random sample of size n2 drawn from population 2 which is also normally distributed, then the random variable
(TPVARHT - 2)
has the so-called F-distribution with
numerator degrees of freedom = 1 = n1 - 1
denominator degrees of freedom = 2 = n2 - 1
Note that if H0 is true, then 12 = 22 and so
(TPVARHT - 3)
Further, to the extent that the data contradicts H0, we expect this ratio (TPVARHT - 3) to differ from 1 (since if 12 = 22, perfectly representative samples would give s12 = s22 as well, and so F = 1 in that case). Thus, the hypothesis testing rules become:
reject H0 at a level of significance if:
H0: 12 = 22
HA: 12 > 22
F > F,1, 2
(single-tailed rejection region)
Pr(F > test statistic value)
H0: 12 = 22
HA: 12 < 22
F < F1-,1,2
(single-tailed rejection region)
Pr(F < test statistic value)
H0: 12 = 22
HA: 12 22
F > F/2,1,2 or F < F1-/2,1,2
(two-tailed rejection region)
2 Pr(F > test statistic)
2 Pr(F < test statistic)
We will call the rules contained in this table "The F-test."
Note that the test statistic, F, must always be a positive number because s12 and s22 are both always positive numbers. Details of the properties and use of the F-distribution, and tables of critical values of the standard F random variable for right tail areas of 0.05 and 0.01 are given in the short document following this one. Because of the two degree of freedom numbers associated with the F random variable, each page of a printed table can give critical values for just one tail area and a selection of values of each of the degrees of freedom numbers. For calculations involving values of tail areas and degree of freedom numbers other than those represented by the abbreviated (though quite conventional) tables in the next document, you can use the FDIST() and FINV() functions supplied with Excel/97 or equivalent functionality in other computer applications.
The rules in the table above reflect the fact that the F distribution is not symmetric. As a result, some of the rejection criteria stated above (for the left-tailed and two-tailed test) involve right-tail areas which would normally be quite large numbers such as 0.95. whereas the standard tables typically give critical values of F only for right-tail areas of 0.05 and 0.01. In such a situation, you can do one of three things:
(i) note the property of the F-distribution that
(TPVARHT - 4)
You can use this formula to give critical values of the F random variable for right-tail areas of 0.95 using tables of critical values for right-tail areas of 0.05, for example.
(ii) You can always rearrange the hypotheses for one-tailed tests so that they are right-tailed tests.
H0: 12 = 22
HA: 12 < 22 is equivalent to
H0: 22 = 12
HA: 22 > 12
(iii) Use a computer application such as Excel/97 to generate the precise critical values that you need. Such computer-based functions will work for any valid right-tail areas and degree of freedom numbers.
Strategies (I) and (ii) really amount to the same thing here. Flipping the hypotheses as indicated in (ii) will result in a standardized test statistic, F, given by (TPVARHT - 3) which is the reciprocal of the former value, and will have the two degree of freedom numbers swapped.
Before illustrating these formulas with some examples, we mention one caution voiced by many authors. The accuracy of the F-test described above seems to be quite sensitive to deviations of the populations from being normally distributed. This means that checking whether the data is consistent with the populations being normally distributed should have a fairly high priority here.
In fact, as Devore and others caution, special care must be exercised when this F-test is being used to assess whether the condition of equal variances is met prior to performing a t-test on the difference of two population means. There are several problems here. First, consistency with a normally distributed population is more difficult to assess reliably when only small samples are available. Secondly, the t-test is known to be quite insensitive to moderate departures from normality in the populations. As a result, it may happen that an inappropriately applied F-test will erroneously indicate that 12 22 when the t-test would have worked fine.
Example 1: The suspicion is voiced that apples left longer on the tree become less uniform in size. Is this suspicion supported by the Jonagold apple data for the first two harvest dates? (Refer to the standard data sets distributed earlier in the course.)
The variance of the apple weights is a measure of the uniformity (or lack of uniformity) of the population of apple weights. If the weights of the population of apples harvested on the first date has a variance of 12 and the weights of the apples harvested on the second date have a variance 22, then rejecting H0 in
H0: 22 = 12
HA: 22 > 12
amounts to supporting the conclusion that 22 > 12 or that population 2 has a greater variability than does population 1.
The raw data is given in the standard data sets document. From it, we get for the first harvest date that
n1 = 60 = 219.73 g s1 = 42.88 g
and for the second harvest date
n2 = 55 = 257.27 g s2 = 52.35 g
(Actually, we don't really need the values of the sample means here.) Since the F-test requires that the two populations be normally distributed, we prepare normal probability plots for the two sets of data:
While the points in these plots are not on the straightest possible lines, neither plot shows clear signs of the data deviating from normality in a serious way, so we are reasonably safe in using the F-test here.
The standardized test statistic is
We have n = 55 - 1 = 54 and D = 60 - 1 = 59, and using = 0.05, the best we can do from the printed tables of critical values of the F random variable is
F0.05, 54, 59 F0.05, 40, 50 = 1.63
(Since the tables do not contain entries for n = 54 and D = 59, we took the entry for the closest more rigorous values, n = 40 and D = 50. We are using the symbols n and D, respectively, for the numerator and denominator degrees of freedom to avoid the confusion that would result from the numerical subscripts given that we've reversed the subscripts 1 and 2 in the hypotheses in this example relative to the template rules in the table earlier. )
So, we can reject H0 here at a level of significance of 0.05 if the calculated test statistic is greater than 1.63. But 1.490 is not greater than 1.63, and so we cannot reject H0. (In fact, using the FDIST() function in Excel/97, we find that the p-value for this hypothesis test is 0.0676, which while not inordinately large, is still a bit larger than the 0.05 that most practitioners would consider the largest allowable p-value for rejecting a null hypothesis.) Thus, the best we can say is that the data presented is not strong evidence that apples harvested at a later date are less uniform than apples harvested at the earlier date.
Example 2: Refer to the standard data sets giving percentages of various amino acids in specimens of natural and artificial shark fin. A technologist wishes to test hypotheses to determine if the mean percentage of the amino acid alanine differs between the two shark fin preparations. However, since the sample sizes are just 15 in both cases, and therefore small, this will require the assumption that both populations have equal variances. What does the F-test say about the validity of that assumption here?
We won't repeat the raw data here -- it's available in the document containing the standard data sets (we are referring specifically to the data sets labeled SharkfinNatAla and SharkfinArtAla). We are really being asked to test the hypotheses:
H0: nat2 = art2
HA: nat2 art2
From the data, we have
nnat = 15 snat = 2.423
nart = 15 sart = 1.565
The s's are in units of percent.) Further, the normal probability plots for each set of data are:
These appear to be consistent with the populations being approximately normally distributed, so application of the F-test to the above hypotheses should be valid.
The standard test statistic is calculated to be
Since this is a two-tailed test, and the standard tables of critical values of the F-distribution have values for tail areas of 0.05 and 0.01 only, we will be able to use those tables to test these hypotheses at a level of significance of either 0.10 or 0.02 only (that is, twice the single tail areas represented in the tables). We choose to use = 0.10 here.
So, H0 can be rejected at = 0.10 if either
(Again, we've had to go to the closest more rigorous entry in the table because it doesn't cover the precise degrees of freedom required in this problem.) Since 2.397 is not greater than 2.53, nor is 2.397 less than 0.395, we cannot reject H0 at a level of significance of 0.10. Thus, we can conclude that the data is not inconsistent with nat2 = art2, and so this condition of validity for the t-test appears to be met. You shouldn't be too concerned that 2.397 is quite close to 2.53, because after all, we are working with quite a large value of here. In fact, using the FDIST() function in Excel/97, we get the p-value for this test as
p-value = 2 Pr(F > 2.397, n = 14, D = 14)
= 2(0.05677) = 0.1135
This is too large a value to seriously consider rejecting H0 here.
Example 3 A food technologist frequently carries out experiments in which participants are asked to rate various qualities of potential foods on numerical scales -- for example, from 0 (for awful) to 10 (for excellent). She suspects that male participants tend to give more uniform responses than female participants. Before spending much time speculating on what this might mean about the relationship between sensory perception and gender, she decides that first she must devise an experiment to see if she can find evidence that the effect is real.
This is what she does (in our little fairy story…). She prepares a set of identical food specimens and asks 61 randomly selected men and 61 randomly selected women to rate the food specimens on a scale of 0 to 10. The resulting data is:
6 2 7 3 3 4 6 3 5 4 3 6 2 5 3 6 2 5 5 2 5
4 4 8 6 4 2 7 4 2 3 3 7 4 4 6 3 5 2 4 5 2
4 3 2 3 3 1 3 2 4 2 5 3 6 2 4 3 4 4 5
3 6 6 2 9 7 7 5 10 2 4 2 6 0 6 9 2 6 4 6 4
2 4 10 5 9 3 1 7 2 3 2 7 6 6 3 6 2 6 0 8 4
6 6 1 8 3 2 6 7 5 8 4 2 4 6 4 5 0 6 4
The standard deviation of the men's ratings is 1.584 and the standard deviations of the women's ratings is 2.523. (The mean ratings are somewhat different between the two samples as well, though that fact is not directly relevant here. You notice that none of the men gave a rating of 0 or 10, the most extreme possible, whereas several women gave those ratings. In the world in which these numbers were real data from an actual experiment, this might be suggestive of an effect. However, we wouldn't want to base a conclusion on only the occurrence of most extreme responses in the samples, hence the need to do a more systematic hypothesis test here.)
A test of hypotheses involving the variance of these responses should indicate whether the responses of male participants are more uniform than the responses of female participants. Carry out the appropriate hypothesis test and comment on the result.
It appears we need to test the hypotheses:
H0: 2women = 2men
HA: 2women > 2men We can use the F-test if we have some assurance that the responses are consistent with a normally-distributed population. Since the form of the responses are a single value chosen from a set of only eleven distinct whole number values, it's probably worthwhile to examine this assumption a bit.
t's quite easy to construct frequency histograms for the responses:
he responses by the men form a tighter distribution, but seem to show a bit of a right skew. The responses by the women don't appear to be particularly skewed, but form a rather jagged approximation to the desired bell shape. The corresponding normal probability plots are:
You can see some evidence of the skewing to the right in the normal probability plot of the men's responses there is a perceptible curvature to the path through the rough centers of each line of points. However, this is not a dominating feature and so we'll proceed as if approximate normality has been demonstrated for the men's responses. The normal probability plot of the women's responses raises no concerns.
So, now apply the F-test. The value of the standardized test statistic is
We may reject H0 at a level of significance of 0.05 if
F > F0.05, 60, 60 = 1.53
(The decision to include 61 participants in each sample was to get 1 = 60 and 2 = 60, for which our F-tables have entries.)
But, 2.537 is greater than 1.53, and so we can reject H0. The data supports the claim that the variance of the men's responses is less than the variance of the women's responses. This implies that the men's responses are more uniform than those of the women.
(NOTE: As with all examples in these course notes, this data is simulated to achieve certain pedagogical goals and should not be used to draw actual conclusions about the real world. For that, you would need to perform this experiment yourself using real world participants and materials!)
Confidence Interval Estimates Involving the Variances of Two Populations Since by definition, there is a probability of 100(1 - )% that
or, using (TPVARHT - 2), that
(TPVARHT - 5)
we have a way to come up with 100(1 - )% confidence interval estimates involving the two population variances. Taking the left inequality in the preceding relation,
and rearranging the right-hand side somewhat
you can see that we can write
Similarly, starting with the right-hand inequality of (TPVARHT - 5), we eventually get
Putting these last two inequalities together into a single interval-like expression then gives the final result
@ 100(1 - )% (TPVARHT - 6)
This is a formula for a confidence interval estimate of the ratio of the two population variances.
Example 4 Just a very brief example of how formula (TPVARHT - 6) can be used. Consider the data in Example 3, just above. Using the existing F-tables, we can construct confidence interval estimates with either a 90% confidence level (single-tail area of 0.05) or a 98% confidence level (single tail area of 0.01). Here we will choose the 90% confidence interval estimate.
For this, we need F0.05, 60, 60 = 1.53, and
Thus, with swomen = 2.523 and smen = 1.584, we get from (TPVARHT - 6) the result
Thus, at a level of confidence of 90%, we conclude that the variance of the population of women's responses is between 1.659 and 3.882 times as large as the variance of the men's responses.