The Real Problem: An Initial Example Here we summarize some of the observations and discussion that arise out of an experiment that you will actually perform in class. As you read through this document, think back to the things you actually did and observed.
1. The Question Let us suppose, for the sake of a story, that we wish to find out something about the number of eggs that are eaten each week by a person in the Vancouver area. There could be a number of reasons why this is of interest:
your employer, a major egg producer, is thinking about creating an advertising campaign and either wishes to know if there's the possibility of convincing people to eat more eggs (presumably if every person already eats hundreds of eggs each week, it may not be easy to persuade them to eat even more eggs!) or would like to couple this study with one to be done after the advertising is complete to see if it has had any effect in increasing egg consumption.
maybe you work for a public health researcher who is trying to relate some particular health effect to the level of egg consumption in the community.
One might be able to think of many reasons why this experiment could be a useful one to do. As the course progresses, you will learn that how the experiment is best structured depends to a great extent on the initial question to be answered. For now, we ignore the issue of experimental design almost completely in order to focus on a more fundamental issue -- what is the basic problem that the field of statistics was developed to handle.
Keep in mind, however, that every statistical experiment must begin with a clear statement of the "question" to be answered, including, but not limited to:
a clear description of the precise population being studied (here, we've said "people in the Vancouver area", but notice that this is rather vague: What specific geographical region are we speaking of? Is Burnaby part of the Vancouver area? Is Squamish in the Vancouver area? Are we interested in all people or just adults? What do we consider an adult to be? And so on…)
what precise measure or characteristic of the population do we need to determine? (here, we've just said "find out something", which is about as vague as you can get. Perhaps we need to determine the "average" -- whatever that means -- number of eggs each person eats in a week, but there are other possible "somethings" that will come to mind once we've gotten a bit farther into the subject of statistics.)
how will the sampling process be carried out? (here, we're already assuming that the number of people in the Vancouver area is so large, and the task of speaking to all of them could be so costly that we will have to rely on information in just a sampling of those people. At issue here is how many people should be questioned, how should those people be selected, and so on. Answers to some of these questions may depend on the answers to the previous two, and on issues such as how precise we'd like the final results to be.)
The Experimental Setup
Let us suppose that to begin with, we decide we will survey exactly 100 randomly selected people from the Vancouver area (having settled the issue of what we mean precisely by "people of the Vancouver area"). As already discussed, to get a really random sample of a population, we must ensure that every member of the population has an equal likelihood of being selected. So,
if we simply went from house to house on Willingdon Avenue until we found 100 people at home willing to answer our question, we would not have a really random sample of our target population. Why? Because anyone who doesn't live on Willingdon Avenue has no chance of being included in our sample. Also, because anyone who is not at home on Willingdon Avenue on the day we do our survey has no chance of being included. Also, because we'll only be getting answers from those people who are willing to open their doors to a stranger and relate some relatively personal information to them. If we use this method, we are really sampling the target population of people who live on Willingdon Avenue in Burnaby who are available and willing to discuss their egg consumption with us. This is very likely not at all close to the target population we originally had in mind.
perhaps we could flip open the telephone directory for the area to a random page, throw a dart at that page from some distance, dial the number thus selected, and ask our question. This would certainly open the possibilities up to a larger number of people in the area. We might worry whether each page in the telephone directory has the same likelihood of being chosen (and not just those near the middle of the book -- but there are ways to deal with this). might wonder whether we should speak only to the person who answers the telephone, or ask to speak to a more "random" member of the household reached. We would need to decide when to make these phone calls, because certain members of the population are more likely to be accessible at some times of the day than at others. And so on … Although almost everyone in our society now belongs to a household with telephone access (some with several telephone lines -- will this give those households a greater weight in our experiment?), we should realize that this approach really restricts our sampling to the population of individuals in households with telephone numbers listed in the telephone directory, which is likely not quite the same as the target population originally identified.
There are many other approaches one might suggest. The goal is to be able to come as close as possible within reasonable time and costs, to sampling the population we are interested in. Some of the major statistical disasters in recent history have resulted from carrying out the sampling so that the sampled population was quite different from the intended target population.
Things are much simpler for us in class. To simulate a truly random sampling of a population, we proceed as follows. Each group is given a largish brown paper bag which contains several thousand small squares of paper, each with a number. Each piece of paper in a bag represents one member of a population, and the number on the piece of paper tells us how many eggs that person ate in the last week. By drawing these pieces of paper from the bag one at a time, recording their value, and returning them to the bag before mixing the contents well and drawing the next piece of paper, we can simulate something very close to a random sampling of what amounts to an infinite population.
So, although the sampling issues raised above are very important ones to deal with, our classroom experiment sidesteps them entirely.
Results of the First Experiment
You will do this experiment in class yourself, as part of a group of two or three students. However, to be able to summarize some of the major observations of this experiment, we will look at the data collected by one particular recent group of students.
The division into ten groups of ten values each here is just to make reading easier -- the experiment consisted of 100 consecutive observations.
With the observations displayed in this fashion, we are unable to say very much in general about egg consumption in the population. (Think how much more perplexing this data might be at a glance if we had done 1000 observations or even 10,000 observations.) Clearly, to make some sense of our results, we will need to organize them in some way to reveal patterns, or typical values, or some other more intuitive overall characteristics.
Before doing that, we can note the following "facts" about our sample results:
it seems like everyone in the sample ate at least 1 egg in the last week, and no one in the sample ate more than 12 eggs last week.
we see the values 1, 2, 3, 5, 7, and 12 in our table of observations. However, it doesn't appear than any of our sampled individuals at 4, 6, 8, 9, 10, or 11 eggs in the past week. This doesn't necessarily mean that nobody in our population ate exactly 4 eggs in the past week (or exactly 6, or exactly 8, etc.) -- just that no one in our sample of 100 individuals from that population did so. It might suggest to us that for some reason, people are very unlikely to eat either 4, 6, 8, 9, 10, or 11 eggs in one week and much more likely to eat 1, 2, 3, 5, 7, or 12 eggs in one week, but even that conclusion may not be justified. (Also, remember that this is a very artificial and contrived experiment -- we aren't sampling a real population of people, but rather we are sampling little pieces of paper from a brown paper bag!)
Looking at the list of 100 numbers, it isn't obvious whether one of the six different values occurring is more or less common than the others. Are there more 1's than 2's? Are there more 12's than any other value? Even to answer simple questions like this about the sample data requires some organization of the data.
Organizing and summarizing large collections of data is the job of descriptive statistics. We will deal with the methods of descriptive statistics more systematically and in greater detail shortly, but here, we illustrate just three basic approaches to organizing and summarizing the data above to give you a sense of what can be done.
One way to begin is to create a frequency table or relative frequency table, a tally of how many times each value or group of values occurs in our observations. This is relatively easy to do here because it looks like there are just six distinct values in our data: 1, 2, 3, 5, 7, and 12. Thus, a frequency table can simply be a table listing how often each value occurs in the data.
0.17 or 17%
0.16 or 16%
0.13 or 13%
0.11 or 11%
0.29 or 29%
0.14 or 14%
1.00 or 100%
The "tally marks" are just a way of keeping track of the values as they occur in the list of 100 observations. Each mark represents one observation with the indicated value. By frequency, we mean the total count of all occurrences of that observation -- how "frequently" or "often" did that value occur in our observations. By relative frequency, we mean the fraction or percentage of all observations that were that value. Thus, since 13 out of our 100 observations were the value 3, we say that 'the value 3 had a relative frequency of 0.13, or of 13 out of 100, or of 13%." You can see that to compute the relative frequency of a particular observation, you just divide its frequency by the total number of observations.
Now, this table appears to be quite a bit more informative than was the original simple list of observations. From the table we can see which were the most frequently occurring values in our sample (the value 7 came up 29 times in the 100 observations, or 29% of the observations were 7's). The least frequent observation was 5, occurring only 11% of the time. With a little bit of work, we can even make more complicated statements about the mix of observations. For example, (17% + 16% + 13%) = 46% of the people in this sample ate three or less eggs last week.
Once a frequency table is available, it is quite easy to represent the information pictorially or graphically. In this case, one might choose to construct a histogram: a bar graph with no gaps between the bars, with the height of each bar representing either the frequency (for a frequency histogram) or the relative frequency (for a relative frequency histogram) of that observation:
Note that we have been careful to include the values with no observations along the horizontal axis in our histogram.
Notice that the frequency histogram and the relative frequency histogram have essentially the same shape -- all that's different is the vertical scale.
People often use the word distribution to refer to the information in the table above or in the two histograms, since in both of these sorts of ways, we are indicating how the values observed in the sample are "distributed" among all possible values.
The tabulation of frequencies, relative frequencies, and their representation graphically in charts such as histograms achieves the goal of organizing and summarizing the original data in a way that retains much of the information in that data. (In fact, in this simple example, the frequency table and histogram retain all of the information in the original data, but that would have not been possible to do if there hadn't been just a few distinct values present in the data.) Another approach to summarizing the original data is to use it to calculate a few summary values that are meaningful. Shortly, we will look at several of these numerical summaries which play an important role in statistical analysis, but for the moment, we just consider one of the most important: the "average" or more correctly, the arithmetic mean. We need to use this more technical term here, because the word "average" is used loosely to indicate quite a variety of different things by many people.
To calculate the arithmetic mean, we just add up all of the observations (getting 514 in this case), and divide by the total number of observations (100 in this case) to get the value 5.14. We say, "the mean number of eggs eaten per week by people in our sample was 5.14." This number is what is often meant by the statement that "the people in the sample ate an average of 5.14 eggs per week," with the value 5.14 taken to indicate a sort of typical number of eggs consumed by these people per week. Of course, it does not mean that somehow, these people are finding a way to eat 0.14 of an egg each week. Rather, it indicates that if you selected a large number of people from this population, you could estimate the total number of eggs eaten by the entire group by multiplying their number by 5.14. (However, this number does say that to achieve the same total egg consumption among the 100 people but with everyone eating the same number of eggs per week, that common egg consumption would have to be 5.14 eggs.)
The arithmetic mean is one of the most useful single number summaries we can calculate for a sample, or estimate for a population. For example, suppose using the data above, we concluded that the mean number of eggs eaten by members of this population was about 5.14 eggs per week. (Such a conclusion will need some justification, since the mean value 5.14 really only applies to the random sample of 100 persons in the possibly very much larger population. As you'll see shortly, there is good reason to be very suspicious of the extension of that value to the entire population!) Then, following an advertising campaign, the same 100 people or perhaps another similar random sample from this population yielded a mean of 12.68 eggs eaten by each person per week. This wouldn't necessarily indicate that every person had increased their egg consumption by just over seven eggs per week, but it does indicate that overall, there appears to be more than seven additional eggs being eaten per week for each person in the population. One might hope to interpret such results as indicating that the advertising campaign had been successful.
Finally … The Real Problem!
The "real problem" that the subject of statistics was developed to solve has already showed itself in the last paragraph. We wish to say something about the characteristics of an entire population, and we wish to do so (or perhaps we're forced to do so) using just information available for a random sample selected from that population. What is the justification for assuming that the characteristics of the random sample accurately (or even adequately) reflect the corresponding characteristics of the population? In the example above, are we justified in claiming that 29% of the people in the entire population eat 7 eggs per week, since we found that 29% of the people in our random sample of 100 individuals reported they ate 7 eggs per week? Are we justified in stating that the mean numbers of eggs eaten per week by the members of the entire population is 5.14, since that was the mean we obtained for the 100 individuals in our sample?
Of course, the answer is NO!
How do we know the answer is NO?
Perhaps the easiest way to see what goes wrong is to repeat the original experiment. If the relative frequencies for the sample are accurate reflections of the relative frequencies in the entire population, we should find that every random sample drawn from that population should have the same table of relative frequencies. Similarly, if the sample mean is an accurate estimate of the population mean, then every random sample selected from that population should give more or less the same mean value.
So you can see exactly what happens, we will do this experiment in class. As in previous editions of this course, there will be enough of us present to be able to select 9 or 10 random samples from this population, and in a few minutes, carry out the computations illustrated above. The following table gives the typical sort of result of such an experiment:
here's a fair amount of information in this table. Recall, 10 random samples of 100 items were selected from this population. The frequencies of the values observed for each of the ten samples are given by the ten columns of numbers inside the heavy lines. Since each sample contained 100 items, it is easy to convert mentally between frequencies and relative frequencies here.
First, look at the two columns to the extreme right, titled "smallest frequency" and "largest frequency", respectively. The first row of the table lists the number of 1's observed in each of the 10 samples, and we see that this varied from as few as 12 (or 12% of the sample) to as many as 20 (or 20% of the sample). If sample number 7 was the one we were using to study the population, we'd end up claiming that only 12% of the population appears to eat just one egg per week. On the other hand, if sample number 4 was the one we were using to study the population, we'd end up claiming 20% of the population eat just one egg per week. The second number is nearly twice as big as the first one, and both are quite different from the 17% value we observed in the original experiment. The same sort of variation in relative frequencies is seen for each of the other 5 values observed. Since the relative frequencies for each observation seem to be so different from one random sample to the next, it isn't clear whether there is any way to use the information in the sample to say something justifiable about relative frequencies in the population.
The row labelled "mean" in the table above gives the arithmetic mean for each of the ten random samples. You see that these values vary from a low of 3.95 for sample number six to a high of 5.56 for sample number three. Again, with this much variation in value from one sample to the next, it is unclear whether the sample mean is telling us anything meaningful about the population mean.
To complete the apparent disaster, there's one more rabbit to pull out of the hat. Unlike all interesting real life populations, the populations from which these samples have been selected are known in complete detail because they were artificially constructed. In fact, there are equal numbers of each of the values 1, 2, 3, 5, 7, and 12, in each of the populations! This means that if the samples precisely reflected the characteristics of the population, we would expect that 1/6 or 16.7% of each sample would be 1's, 1/6 or 16.7% of each sample would be 2's, and so on. Some of the frequencies in the table above are 16 or 17 (the closest we can come in a sample of 100 to a relative frequency of 1/6), but most are not. (The bottom two rows of the figure show how much frequencies differ from 16 or 17 for each sample.) Furthermore, with a bit of thought, you can determine that the population mean is exactly 5. Again, some of the sample means have values near 5.00, but some are quite different from 5.00.
So, this is the situation we find ourselves in:
we have a need to be able to estimate the values of some summary characteristics of a large population -- in the present example, these characteristics have been relative frequencies of some particular observations, or the arithmetic mean of the population.
for one or more of several reasons already described (see the earlier document, "What Is/Are Statistics"), we are unable to examine every element of the entire population
hence we would like to rely on information we get from a smaller random sample which we select from that population
but, the information in such samples appear to reflect the characteristics of the population rather poorly. In fact, it appears that different people sampling the same population may obtain data giving quite different information.
Basically, we need to find something that links what we see in a sample with what we want to know about a population. The naïve assumption that the mean of a random sample will by itself be a good estimate of the mean of the population from which the sample was selected does not appear to work out. The sample mean is an example of what we call a random variable -- it is a variable whose value depends on the particular elements of a random sample. Thus, the values of the sample mean are unpredictable in principle. Still, we don't want to abandon any attempt to estimate characteristics of populations using information obtained from random samples of those populations.
6. And … The Solution? There is a solution of sorts to this problem, of course, and it is the development of that solution which will occupy much of our efforts in this course. To begin to see a link between sample characteristics and corresponding population characteristics, take a look at the following figure:
To get this histogram, we collected 400 random samples of 100 items from the population, and calculated the arithmetic mean for each sample. These sample means ranged from a low value of 4.10 to a high value of 5.75. Rather than count up how many times each of the 166 values between 4.10 and 5.75 came up in our list, we divided the horizontal axis up into intervals of width 0.1 units, centered on the values shown, and counted how many values of the sample mean fell within each of these intervals. The result is the frequency histogram seen in the figure. Thus, for example, 45 of the 400 sample means had a value falling between 4.85 and 4.94, the interval centered on 4.9.
This histogram has two striking features:
it is approximately bell-shaped, and roughly symmetric about its highest point
that center peak falls very close to the (in this case) known population mean of 5.
In fact, as we continue to add new random samples to the collection, this histogram will better and better approximate a bell curve centered exactly at the population mean. It is this type of stable link, illustrated here by the center of the distribution of possible sample means being equal to the population mean, that is exploited by the methods of statistics.
Interestingly, the same sort of thing relationship arises between sample frequencies and population proportions (the percentage of the population which has a specific value or range of values). The following chart is a frequency histogram for the number of 1's occurring in the 400 random samples described above:
(In keeping with our original story, this would be the number of persons in each random sample of 100 persons who said that they had eaten just one egg in the preceding week). Every one of the 400 samples had at least 8 ones, and none had more than 25 ones. Notice that the distribution of numbers of ones is still faintly suggestive of a bell curve centered on the expected value of approximately 17.
7. Where To From Here? This document has raised quite a number of issues in a very superficial way. Over the next several weeks of the course, we will go over these idea to fill in necessary details. The immediate steps are:
first, we need to look in more depth and breadth at ways of organizing and summarizing data, not just to develop other useful types of charts to complement our use of histograms, but also to look at some of the other numerical characteristics of samples and populations that may be useful in summarizing their general properties.
then, we need to develop some familiarity with some basic concepts of probability theory so that we can quantify the effects of randomness in the sampling process. Our initial focus will be on the properties of the ubiquitous bell-shaped curve (known technically as the Gaussiandistribution or Normal distribution). The ideas we develop for dealing with such normal distribution problems carry over pretty much to cases where other types of distribution curves apply.
finally, we need to make the link between such sample distributions and the values of population parameters very precise. This leads us to the twin topics of statistical estimation, and hypothesis testing, the ultimate goals of this course.