Lesson 2: Summarizing Data

Section 6: Measures of Central Location

A measure of central location provides a single value that summarizes an entire distribution of data. Suppose you had data from an outbreak of gastroenteritis affecting 41 persons who had recently attended a wedding. If your supervisor asked you to describe the ages of the affected persons, you could simply list the ages of each person. Alternatively, your supervisor might prefer one summary number — a measure of central location. Saying that the mean (or average) age was 48 years rather than reciting 41 ages is certainly more efficient, and most likely more meaningful.

Measures of central location include the mode, median, arithmetic mean, midrange, and geometric mean. Selecting the best measure to use for a given distribution depends largely on two factors:

  • The shape or skewness of the distribution, and
  • The intended use of the measure.

Each measure — what it is, how to calculate it, and when best to use it — is described in this section.

Mode

Definition of mode

The mode is the value that occurs most often in a set of data. It can be determined simply by tallying the number of times each value occurs. Consider, for example, the number of doses of diphtheria-pertussis-tetanus (DPT) vaccine each of seventeen 2-year-old children in a particular village received:

0, 0, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4

Two children received no doses; two children received 1 dose; three received 2 doses; six received 3 doses; and four received all 4 doses. Therefore, the mode is 3 doses, because more children received 3 doses than any other number of doses.

Method for identifying the mode

  1. Step 1. Arrange the observations into a frequency distribution, indicating the values of the variable and the frequency with which each value occurs. (Alternatively, for a data set with only a few values, arrange the actual values in ascending order, as was done with the DPT vaccine doses above.)
  2. Step 2. Identify the value that occurs most often.

EXAMPLES: Identifying the Mode

Example A: Table 2.8 (below) provides data from 30 patients who were hospitalized and received antibiotics. For the variable “length of stay” (LOS) in the hospital, identify the mode.

  1. Step 1. Arrange the data in a frequency distribution.
    LOS
    Frequency
    0
    1
    1
    0
    2
    1
    3
    1
    4
    1
    5
    2
    6
    1
    7
    1
    8
    1
    9
    3
    LOS
    Frequency
    10
    5
    11
    1
    12
    3
    13
    1
    14
    1
    15
    0
    16
    1
    17
    0
    18
    2
    19
    1
    LOS
    Frequency
    20
    0
    21
    0
    22
    1
    .
    0
    .
    0
    27
    1
    .
    0
    .
    0
    49
    1
    Alternatively, arrange the values in ascending order.

    0, 2, 3, 4, 5, 5, 6, 7, 8, 9,
    9, 9, 10, 10, 10, 10, 10, 11, 12, 12,
    12, 13, 14, 16, 18, 18, 19, 22, 27, 49

  2. Step 2. Identify the value that occurs most often.
    Most values appear once, but the distribution includes two 5s, three 9s, five 10s, three 12s, and two 18s.
    Because 10 appears most frequently, the mode is 10.

Example B: Find the mode of the following incubation periods for hepatitis A: 27, 31, 15, 30, and 22 days.

  1. Step 1. Arrange the values in ascending order.
    15, 22, 27, 30, and 31 days
  2. Step 2. Identify the value that occurs most often.
    None

Note: When no value occurs more than once, the distribution is said to have no mode.

Example : Find the mode of the following incubation periods for Bacillus cereus food poisoning:

2, 3, 3, 3, 3, 3, 4, 4, 5, 6, 7, 9, 10, 11, 11, 12, 12, 12, 12, 12, 14, 14, 15, 17, 18, 20, 21 hours
  1. Step 1. Arrange the values in ascending order.
    Done
  2. Step 2. Identify the values that occur most often.
    Five 3s and five 12s

Example C illustrates the fact that a frequency distribution can have more than one mode. When this occurs, the distribution is said to be bi-modal. Indeed, Bacillus cereus is known to cause two syndromes with different incubation periods: a short-incubation- period (1–6 hours) syndrome characterized by vomiting; and a long-incubation-period (6–24 hours) syndrome characterized by diarrhea.

Table 2.8 Sample Data from the Northeast Consortium Vancomycin Quality Improvement Project

ID Admission Date Discharge Date LOS DOB (mm/dd) DOB (year) Age Sex ESRD
1 1/01 1/10 9 11/18 1928 66 M Y 3 N
2 1/08 1/30 22 01/21 1916 78 F N 10 Y
3 1/16 3/06 49 04/22 1920 74 F N 32 Y
4 1/23 2/04 12 05/14 1919 75 M N 5 Y
5 1/24 2/01 8 08/17 1929 65 M N 4 N
6 1/27 2/14 18 01/11 1918 77 M N 6 Y
7 2/06 2/16 10 01/09 1920 75 F N 2 Y
8 2/12 2/22 10 06/12 1927 67 M N 1 N
9 2/22 3/04 10 05/09 1915 79 M N 8 N
10 2/22 3/08 14 04/09 1920 74 F N 10 N
11 2/25 3/04 7 07/28 1915 79 F N 4 N
12 3/02 3/14 12 04/24 1928 66 F N 8 N
13 3/11 3/17 6 11/09 1925 69 M N 3 N
14 3/18 3/23 5 04/08 1924 70 F N 2 N
15 3/19 3/28 9 09/13 1915 79 F N 1 Y
16 3/27 4/01 5 01/28 1912 83 F N 4 Y
17 3/31 4/02 2 03/14 1921 74 M N 2 Y
18 4/12 4/24 12 02/07 1927 68 F N 3 N
19 4/17 5/06 19 03/04 1921 74 F N 11 Y
20 4/29 5/26 27 02/23 1921 74 F N 14 N
21 5/11 5/15 4 05/05 1923 72 M N 4 Y
22 5/14 5/14 0 01/03 1911 84 F N 1 N
23 5/20 5/30 10 11/11 1922 72 F N 9 Y
24 5/21 6/08 18 08/08 1912 82 M N 14 Y
25 5/26 6/05 10 09/28 1924 70 M Y 5 N
26 5/27 5/30 3 05/14 1899 96 F N 2 N
27 5/28 6/06 9 07/22 1921 73 M N 1 Y
28 6/07 6/20 13 12/30 1896 98 F N 3 N
29 6/07 6/23 16 08/31 1906 88 M N 1 N
30 6/16 6/27 11 07/07 1917 77 F N 7 Y
Epi Info

To identify the mode from a data set in Analysis Module:

Epi Info does not have a Mode command. Thus, the best way to identify the mode is to create a histogram and look for the tallest column(s).

Select graphs, then choose histogram under Graph Type.

The tallest column(s) is(are) the mode(s).

NOTE: The Means command provides a mode, but only the lowest value if a distribution has more than one mode.

Properties and uses of the mode

The mode is the easiest measure of central location to understand and explain. It is also the easiest to identify, and requires no calculations.

  • The mode is the preferred measure of central location for addressing which value is the most popular or the most common. For example, the mode is used to describe which day of the week people most prefer to come to the influenza vaccination clinic, or the “typical” number of doses of DPT the children in a particular community have received by their second birthday.
  • As demonstrated, a distribution can have a single mode. However, a distribution has more than one mode if two or more values tie as the most frequent values. It has no mode if no value appears more than once.
  • The mode is used almost exclusively as a “descriptive” measure. It is almost never used in statistical manipulations or analyses.
  • The mode is not typically affected by one or two extreme values (outliers).
Exercise Question

Exercise 2.3

Using the same vaccination data as in Exercise 2.2, find the mode. (If you answered Exercise 2.2, find the mode from your frequency distribution.)

 

2, 0, 3, 1, 0, 1, 2, 2, 4, 8, 1, 3, 3, 12, 1, 6, 2, 5, 1

Check your answers

Median

Definition of median

The median is the middle value of a set of data that has been put into rank order. Similar to the median on a highway that divides the road in two, the statistical median is the value that divides the data into two halves, with one half of the observations being smaller than the median value and the other half being larger. The median is also the 50th percentile of the distribution. Suppose you had the following ages in years for patients with a particular illness:

4, 23, 28, 31, 32

The median age is 28 years, because it is the middle value, with two values smaller than 28 and two values larger than 28.

Method for identifying the median

Step 1. Arrange the observations into increasing or decreasing order.

Step 2. Find the middle position of the distribution by using the following formula:

Middle position = (n + 1) / 2

  • If the number of observations (n) is odd, the middle position falls on a single observation.
  • If the number of observations is even, the middle position falls between two observations.

Step 3. Identify the value at the middle position.

  • If the number of observations (n) is odd and the middle position falls on a single observation, the median equals the value of that observation.
  • If the number of observations is even and the middle position falls between two observations, the median equals the average of the two values.

Properties and uses of the median

  • The median is a good descriptive measure, particularly for data that are skewed, because it is the central point of the distribution.
  • The median is relatively easy to identify. It is equal to either a single observed value (if odd number of observations) or the average of two observed values (if even number of observations).
  • The median, like the mode, is not generally affected by one or two extreme values (outliers). For example, if the values on the previous page had been 4, 23, 28, 31, and 131 (instead of 31), the median would still be 28.
  • The median has less-than-ideal statistical properties. Therefore, it is not often used in statistical manipulations and analyses.
Exercise Question

Exercise 2.4

Determine the median for the same vaccination data used in Exercises 2.2. and 2.3.

2, 0, 3, 1, 0, 1, 2, 2, 4, 8, 1, 3, 3, 12, 1, 6, 2, 5, 1

Check your answers

Arithmetic mean

Definition of mean

The arithmetic mean is a more technical name for what is more commonly called the mean or average. The arithmetic mean is the value that is closest to all the other values in a distribution.

Method for calculating the mean

Step 1. Add all of the observed values in the distribution.

Step 2. Divide the sum by the number of observations.

EXAMPLE: Finding the Mean

Find the mean of the following incubation periods for hepatitis A: 27, 31, 15, 30, and 22 days.

Step 1. Add all of the observed values in the distribution.

27 + 31 + 15 + 30 + 22 = 125

Step 2. Divide the sum by the number of observations.

125 / 5 = 25.0

Therefore, the mean incubation period is 25.0 days.

Properties and uses of the arithmetic mean

  • The mean has excellent statistical properties and is commonly used in additional statistical manipulations and analyses. One such property is called the centering property of the mean. When the mean is subtracted from each observation in the data set, the sum of these differences is zero (i.e., the negative sum is equal to the positive sum). For the data in the previous hepatitis A example:
Value minus Mean Difference
15 – 25.0 -10.0
22 – 25.0 -3.0
27 – 25.0 + 2.0
30 – 25.0 + 5.0
31 – 25.0 + 6.0
125 – 125.0 = 0 + 13.0 13.0 = 0

This demonstrates that the mean is the arithmetic center of the distribution.

  • Because of this centering property, the mean is sometimes called the center of gravity of a frequency distribution. If the frequency distribution is plotted on a graph, and the graph is balanced on a fulcrum, the point at which the distribution would balance would be the mean.
  • The arithmetic mean is the best descriptive measure for data that are normally distributed.
  • On the other hand, the mean is not the measure of choice for data that are severely skewed or have extreme values in one direction or another. Because the arithmetic mean uses all of the observations in the distribution, it is affected by any extreme value. Suppose that the last value in the previous distribution was 131 instead of 31. The mean would be 225 / 5 = 45.0 rather than 25.0. As a result of one extremely large value, the mean is much larger than all values in the distribution except the extreme value (the “outlier”).
Epi Info

Epi Info Demonstration: Finding the Median

Question: In the data set named SMOKE, what is the mean weight of the participants?

Answer: In Epi Info:
Select Analyze Data.
Select Read (Import). The default data set should be Sample.mdb. Under Views, scroll down to view SMOKE, and double click, or click once and then click OK. Note that 9 persons have a weight of 777, and 10 persons have a weight of 999. These are code for “refused” and “missing.” To delete these records, enter the following commands:
Click on Select. Then type in the weight < 770, or select weight from available values, then type < 750, and click on OK.
Select Means. Then click on the down arrow beneath Means of, scroll down and select WEIGHT, then click OK.

The resulting output should indicate a mean weight of 158.116 pounds.

Your Turn: What is the mean number of cigarettes smoked per day? [Answer: 17]

Exercise Question

Exercise 2.5

Determine the mean for the same set of vaccination data.

2, 0, 3, 1, 0, 1, 2, 2, 4, 8, 1, 3, 3, 12, 1, 6, 2, 5, 1

Check your answers

The midrange (midpoint of an interval)

Definition of midrange
The midrange is the half‑way point or the midpoint of a set of observations. The midrange is usually calculated as an intermediate step in determining other measures.

Method for identifying the midrange

  1. Identify the smallest (minimum) observation and the largest (maximum) observation.
  2. Add the minimum plus the maximum, then divide by two.

Exception: Age differs from most other variables because age does not follow the usual rules for rounding to the nearest integer. Someone who is 17 years and 360 days old cannot claim to be 18 year old for at least 5 more days. Thus, to identify the midrange for age (in years) data, you must add the smallest (minimum) observation plus the largest (maximum) observation plus 1, then divide by two.

Midrange (most types of data) = (minimum + maximum) / 2
Midrange (age data) = (minimum + maximum + 1) / 2

Consider the following example:

In a particular pre-school, children are assigned to rooms on the basis of age on September 1. Room 2 holds all of the children who were at least 2 years old but not yet 3 years old as of September 1. In other words, every child in room 2 was 2 years old on September 1. What is the midrange of ages of the children in room 2 on September 1?

For descriptive purposes, a reasonable answer is 2. However, recall that the midrange is usually calculated as an intermediate step in other calculations. Therefore, more precision is necessary.

Consider that children born in August have just turned 2 years old. Others, born in September the previous year, are almost but not quite 3 years old. Ignoring seasonal trends in births and assuming a very large room of children, birthdays are expected to be uniformly distributed throughout the year. The youngest child, born on September 1, is exactly 2.000 years old. The oldest child, whose birthday is September 2 of the previous year, is 2.997 years old. For statistical purposes, the mean and midrange of this theoretical group of 2-year-olds are both 2.5 years.

Properties and uses of the midrange

  • The midrange is not commonly reported as a measure of central location.
  • The midrange is more commonly used as an intermediate step in other calculations, or for plotting graphs of data collected in intervals.

EXAMPLES: Identifying the Midrange

Example A: Find the midrange of the following incubation periods for hepatitis A: 27, 31, 15, 30, and 22 days.

  1. Identify the minimum and maximum values.
    Minimum = 15, maximum = 31
  2. Add the minimum plus the maximum, then divide by two.
    Midrange = 15 + 31 / 2 = 46 / 2 = 23 days

Example B: Find the midrange of the grouping 15–24 (e.g., number of alcoholic beverages consumed in one week).

  1. Identify the minimum and maximum values.
    Minimum = 15, maximum = 24
  2. Add the minimum plus the maximum, then divide by two.
    Midrange = 15 + 24 / 2 = 39 / 2 = 19.5

This calculation assumes that the grouping 15–24 really covers 14.50–24.49…. Since the midrange of 14.50–24.49… = 19.49…, the midrange can be reported as 19.5.

Example C: Find the midrange of the age group 15–24 years.

  1. Identify the minimum and maximum values.
    Minimum = 15, maximum = 24
  2. Add the minimum plus the maximum plus 1, then divide by two.

 

Midrange = (15 + 24 + 1) / 2 = 40 / 2 = 20 years

Age differs from the majority of other variables because age does not follow the usual rules for rounding to the nearest integer. For most variables, 15.99 can be rounded to 16. However, an adolescent who is 15 years and 360 days old cannot claim to be 16 years old (and hence get his driver’s license or learner’s permit) for at least 5 more days. Thus, the interval of 15–24 years really spans 15.0–24.99… years. The midrange of 15.0 and 24.99… = 19.99… = 20.0 years.

Geometric mean

To calculate the geometric mean, you need a scientific calculator with log and yx keys.

Definition of geometric mean
The geometric mean is the mean or average of a set of data measured on a logarithmic scale. The geometric mean is used when the logarithms of the observations are distributed normally (symmetrically) rather than the observations themselves. The geometric mean is particularly useful in the laboratory for data from serial dilution assays (1/2, 1/4, 1/8, 1/16, etc.) and in environmental sampling data.

More About Logarithms

A logarithm is the power to which a base is raised.

To what power would you need to raise a base of 10 to get a value of 100?
Because 10 times 10 or 102 equals 100, the log of 100 at base 10 equals 2. Similarly, the log of 16 at base 2 equals 4, because 24 = 2 x 2 x 2 x 2 = 16.

20 = 1 (anything raised to the 0 power is 1)
21 = 2 = 2
22 = 2 x 2 = 4
23 = 2 x 2 x 2 = 8
24 = 2 x 2 x 2 x 2 = 16
25 = 2 x 2 x 2 x 2 x 2 = 32
26 = 2 x 2 x 2 x 2 x 2 x 2 = 64
27 = 2 x 2 x 2 x 2 x 2 x 2 x 2 = 128
and so on.

100 = 1 (Anything raised to the 0 power equals 1)
101 = 10
102 = 100
103 = 1,000
104 = 10,000
105 = 100,000
106 = 1,000,000
107 = 10,000,000
and so on.

An antilog raises the base to the power (logarithm). For example, the antilog of 2 at base 10 is 102, or 100. The antilog of 4 at base 2 is 24, or 16. The majority of titers are reported as multiples of 2 (e.g., 2, 4, 8, etc.); therefore, base 2 is typically used when dealing with titers.

Method for calculating the geometric mean

There are two methods for calculating the geometric mean.

Method A

  1. Take the logarithm of each value.
  2. Calculate the mean of the log values by summing the log values, then dividing by the number of observations.
  3. Take the antilog of the mean of the log values to get the geometric mean.

Method B

  1. Calculate the product of the values by multiplying all of the values together.
  2. Take the nth root of the product (where n is the number of observations) to get the geometric mean.

EXAMPLES: Calculating the Geometric Mean

Example A: Using Method A
Calculate the geometric mean from the following set of data.

10, 10, 100, 100, 100, 100, 10,000, 100,000, 100,000, 1,000,000

Because these values are all multiples of 10, it makes sense to use logs of base 10.

Take the log (in this case, to base 10) of each value.

log10(xi) = 1, 1, 2, 2, 2, 2, 4, 5, 5, 6

Calculate the mean of the log values by summing and dividing by the number of observations (in this case, 10).

Mean of log10(xi) = (1+1+2+2+2+2+4+5+5+6) / 10 = 30 / 10 = 3

  1. Take the antilog of the mean of the log values to get the geometric mean.
  2. Antilog10(3) = 103 = 1,000.
  3. The geometric mean of the set of data is 1,000.

Example B: Using Method B

Calculate the geometric mean from the following 95% confidence intervals of an odds ratio: 1.0, 9.0

    1. Calculate the product of the values by multiplying all values together.

1.0 x 9.0 = 9.0

  1. Take the square root of the product.

The geometric mean = square root of 9.0 = 3.0.

Properties and uses of the geometric mean

The geometric mean is the average of logarithmic values, converted back to the base. The geometric mean tends to dampen the effect of extreme values and is always smaller than the corresponding arithmetic mean. In that sense, the geometric mean is less sensitive than the arithmetic mean to one or a few extreme values.

  • The geometric mean is the measure of choice for variables measured on an exponential or logarithmic scale, such as dilutional titers or assays.
  • The geometric mean is often used for environmental samples, when levels can range over several orders of magnitude. For example, levels of coliforms in samples taken from a body of water can range from less than 100 to more than 100,000.

 

Exercise Question

Exercise 2.6

Using the dilution titers shown below, calculate the geometric mean titer of convalescent antibodies against tularemia among 10 residents of Martha’s Vineyard. [Hint: Use only the second number in the ratio, i.e., for 1:640, use 640.]

ID # Acute Convalescent
1 1:16 1:512
2 1:16 1:512
3 1:32 1:128
4 not done 1:512
5 1:32 1:1024
6 “negative” 1:1024
7 1:256 1:2048
8 1:32 1:128
9 “negative” 1:4096
10 1:16 1:1024

Check your answers

Selecting the appropriate measure

Measures of central location are single values that summarize the observed values of a distribution. The mode provides the most common value, the median provides the central value, the arithmetic mean provides the average value, the midrange provides the midpoint value, and the geometric mean provides the logarithmic average.

The mode and median are useful as descriptive measures. However, they are not often used for further statistical manipulations. In contrast, the mean is not only a good descriptive measure, but it also has good statistical properties. The mean is used most often in additional statistical manipulations.

While the arithmetic mean is the measure of choice when data are normally distributed, the median is the measure of choice for data that are not normally distributed. Because epidemiologic data tend not to be normally distributed (incubation periods, doses, ages of patients), the median is often preferred. The geometric mean is used most commonly with laboratory data, particularly dilution titers or assays and environmental sampling data.

The arithmetic mean uses all the data, which makes it sensitive to outliers. Although the geometric mean also uses all the data, it is not as sensitive to outliers as the arithmetic mean. The midrange, which is based on the minimum and maximum values, is more sensitive to outliers than any other measures. The mode and median tend not to be affected by outliers.

In summary, each measure of central location — mode, median, mean, midrange, and geometric mean — is a single value that is used to represent all of the observed values of a distribution. Each measure has its advantages and limitations. The selection of the most appropriate measure requires judgment based on the characteristics of the data (e.g., normally distributed or skewed, with or without outliers, arithmetic or log scale) and the reason for calculating the measure (e.g., for descriptive or analytic purposes).

Exercise Question

Exercise 2.7

For each of the variables listed below from the line listing in Table 2.9, identify which measure of central location is best for representing the data.

  • Mode
  • Median
  • Mean
  • Geometric mean
  • No measure of central location is appropriate

________ 6. Year of diagnosis

________ 7. Age (years)

________ 8. Sex

________ 9. Highest IFA titer

________ 10. Platelets x 106/L

________ 11. White blood cell count x 109/L

Table 2.9 Line Listing for 12 Patients with Human Monocytotropic Ehrlichiosis — Missouri, 1998–1999

Patient ID Year of Diagnosis A ge (years) Sex Highest IFA* Titer Platelets x 106/L White Blood Cell Count x 109/L
01 1999 44 M 1:1024 90 1.9
02 1999 42 M 1:512 114 3.5
03 1999 63 M 1:2048 83 6.4
04 1999 53 F 1:512 180 4.5
05 1999 77 M 1:1024 44 3.5
06 1999 43 F 1:512 89 1.9
10 1998 22 F 1:128 142 2.1
11 1998 59 M 1:256 229 8.8
12 1998 67 M 1:512 36 4.2
14 1998 49 F 1:4096 271 2.6
15 1998 65 M 1:1024 207 4.3
18 1998 27 M 1:64 246 8.5
Mean: 1998.5 50.92 na 1:976.00 144.25 4.35
Median: 1998.5 51 na 1:512 128 3.85
Geometric Mean: 1998.5 48.08 na 1:574.70 120.84 3.81
Mode: none none M 1:512 none 1.9, 3.5

*Immunofluorescence assay

Data Source: Olano JP, Masters E, Hogrefe W, Walker DH. Human monocytotropic ehrlichiosis, Missouri. Emerg Infect Dis 2003;9:1579-86.

Check your answers

Top of Page