Lesson 2: Summarizing Data

Answers to Self-Assessment Quiz

Line list or line listing. A line listing is a table in which each row typically represents one person or case of disease, and each column represents a variable such as ID, age, sex, etc.
Sex A, D, F
Age B, G, H
Lymphocyte count B, G, H
Sex is a nominal variable, meaning that its categories have names but not numerical value. Nominal variables are qualitative or categorical variables.
Age and lymphocyte count are ratio variables because they are both numeric variable with true zero points. Ratio variables are continuous and quantitative variables.
A. Because the centers of each distribution line up, they have the same measure of central location. But because each distribution is spread differently, they have different measures of spread.
B, C, E. Right/left skewness refers to the tail of a distribution. Because the “hump” of this distribution is on the left and the tail is on the right, it is said to be skewed positively to the right. A skewed distribution is not symmetrical.
C. For a distribution such as that shown in Figure 2.12, with its hump to the left, the mode will be smaller than either the median or the mean. The long tail to the right will pull the mean upward, so that the sequence will be mode < median < mean.
B. The mode is the value that occurs most often.
C. The median is the value that has half the observations below it and half above it.
D. The mean is the value that is statistically closest to all of the values in the distribution
D. The geometric mean is the value that is statistically closest to all of the values in the distribution on a log scale.
C, E. The mode is the value that occurs most often. A distribution can have one mode, more than one mode, or no mode. In this distribution, both 38.0°C and 38.5°C appear 3 times.
D. The median is the value that has half the observations below it and half above it. For a distribution with an even number of values, the median falls between 2 observations, in this situation between the 7^th and 8^th values. The 7^th value is 38.2°C and the 8^th value is 38.5°C, so the median is the average of those two values, i.e., 38.35°C.
C. The mean is the average of all the values. Given 14 temperatures that sum to 531.6, the mean is calculated as 531.6 ⁄ 14, which equals 37.97°C, which should be rounded to 38.0°C.
A. The midrange is halfway between the smallest and largest values. Since the lowest and highest temperatures are 35.1°C and 39.6°C , the midrange is calculated as 35.1 + 39.6 ⁄ 2, or 37.35°C.
B. In epidemiology, the measure of central location generally preferred for summarizing skewed data such as incubation periods is the median.
A. The measure of central location generally preferred for additional statistical analysis is the mean, which is the only measure that has good statistical properties.
A, C, D, E. Interquartile range, range, standard deviation, and variance are all measures of spread. A percentile identifies a particular place on the distribution, but is not a measure of spread.
B. The range is the difference between the extreme values on either side, so it is most directly affected by those values.
B. The interquartile range covers the central 50% of a distribution.
C. The interquartile range usually accompanies the median, since both are based on percentiles. The interquartile range covers from the 25^th to the 75^th percentile, while the median marks the 50^th percentile.
A. The standard deviation usually accompanies the arithmetic mean.
A. The standard deviation is the square root of the variance.
A, D. Use of the mean and standard deviation are usually restricted to data that are more-or-less normally distributed. Calculation of the standard deviation requires squaring differences and then taking the square root, so you need a calculator that has a square-root function.
B. Distributions A, B, and C all range from 1 to 39 and have two central values of 20. Considering the eight values other than the smallest and largest, distribution C has values close to 20 (from 15 to 25), Distribution A has values from 10 to 30, and Distribution B has values from 3 to 37. So Distribution B has the broadest spread among the first 3 distributions. Distribution D has larger values than the first 3 distributions (41–49 rather than 1–39), but they cluster rather tightly around the central value of 45.
A and E. The area from the 2.5^th percentile to the 97.5^th percentile includes 95% of the area below the curve, which corresponds to ± 1.96 standard deviations along the x-axis.
A. The primary use of the standard error of the mean is in calculating a confidence interval.

Next Page: Lesson 3 Overview

Lesson 2 Overview