In this lesson, we consider ways to organize and summarize one quantitative variable. This allows us to see the distribution of the variable. The distribution is the values that the variable can take on along with how often they occur.
A histogram is a graphical display of the frequency (or relative frequency) table. Unlike categorical data there are no naturally occurring categories for quantitative data so we must make our own categories in order to construct a frequency table. These categories are called class intervals
Example 1: Fifteen small businesses were asked how many employees they have: 21, 14, 6, 9, 21, 4, 17, 19, 19, 21, 20, 14, 19, 24, 3
| Number of Employees | Frequency | Relative Frequency |
|---|---|---|
| \(>0-5\) | \(2\) | \(\frac{2}{15}=0.13\) |
| \(>5-10\) | \(2\) | \(\frac{2}{15}=0.13\) |
| \(>10-15\) | \(2\) | \(\frac{2}{15}=0.13\) |
| \(>15-20\) | \(5\) | \(\frac{5}{15}=0.33\) |
| \(>20-25\) | \(4\) | \(\frac{4}{15}=0.27\) |
| Total | \(15\) | \(0.99\) |
A histogram puts the class intervals on the horizontal axis and the frequency (or relative frequency) on the vertical axis. Rectangles of equal width are placed over each class interval with height equal to the corresponding frequency (or relative frequency). In Example 1, we would put a rectangle of height 5 over the class interval that goes from >10 - 15 since the frequency in that class interval is equal to 5.
Example 2: Construct a histogram for the data in Example 1
Click For AnswerFirst, enter the data
> emp <- c(21,14,6,9,21,4,17,19,19,21,20,14,19,24,3)
Then, create the histogram using the hist function. The main= subcommand sets the main title and the xlab= subcommand sets the x-axis label.
> hist(emp,xlab="Number of Employees",main="Histogram Example")
A boxplot is a picture of the five number summary. The five number summary consists of the minimum, the first quartile (\(Q_1\)), the median, the third quartile (\(Q_3\)) and the maximum.
The first quartile is the point at which 25% of the data lies below (the 25th percentile). The median is the point at which 50% of the data lies below (the 50th percentile). And the third quartile is the point at which 75% lies below (the 75th percentile). We will consider these statistics in more detail in Lessons 2.4 and 2.5. Right now, all you need to know is that these percentiles divide the data into four pieces. For example, if we had 40 observations then 10 of them (or 25%) would lie below the first quartile and 30 of them (or 75%) would lie below the third quartile.
Example 3: Twelve economists were asked to predict the percentage growth in the Consumer Price Index over the next year. Their forecasts were as follows: 3.6, 2.1, 2.8, 2.7, 3.5, 3.7, 4.1, 2.1, 3.7, 3.4
> cpi <- c(3.6,2.1,2.8,2.7,3.5,3.7,4.1,3.1,3.7,3.4,3.8,3.8)
Then we construct the boxplot with the boxplot function
> boxplot(cpi,horizontal=TRUE,xlab="% Growth",main="Boxplot Example")
The horizontal=TRUE subcommand tells R to make the boxplot horizontal (the default is vertical), the main= subcommand defines the title and the xlab= subcommand defines the label for the x-axis.
Here’s the five number summary and we can check that the above boxplot indeed is a picture of the five number summary.
> fivenum(cpi)
[1] 2.10 2.95 3.55 3.75 4.10
The box runs from 2.95 to 3.75 with a line at the median of 3.55. The lower whisker goes to the minimum of 2.10 and the upper whisker goes to the maximum of 4.10.
Sometimes, we have an observation in our data that is much smaller or larger than the rest of the data. We call this type of observation an outlier. To determine whether an observation is extreme enough to be considered an outlier, we use the quartile rule. An observation is classified as an outlier if it is less than \(Q_1 - 1.5 \times (Q_3-Q_1)\) or if it is greater than \(Q_3 + 1.5 \times (Q_3-Q_1)\).
All outliers should be taken seriously and should be investigated thoroughly for explanations. Automatic outlier-rejection schemes are particularly dangerous.
The classic case of automatic outlier rejection becoming automatic information rejection was the South Pole ozone depletion problem. Ozone depletion over the South Pole would have been detected years earlier except for the fact that the satellite data recording the low ozone readings had outlier-rejection code that automatically screened out the “outliers” (that is, the low ozone readings) before the analysis was conducted. Such inadvertent (and incorrect) purging went on for years. It was not until ground-based South Pole readings started detecting low ozone readings that someone decided to double-check as to why the satellite had not picked up this fact–it had, but it had gotten thrown out!
The best attitude is that outliers are our “friends”, outliers are trying to tell us something, and we should not stop until we are comfortable in the explanation for each outlier.
Example 4: Eight customers were asked how old they were (in years): 30, 32, 45, 45, 46, 51, 52, 79
Answer:
> age <- c(30,32,45,45,46,51,52,79)
Then, we compute the five number summary
> fivenum(age)
[1] 30.0 38.5 45.5 51.5 79.0
So \(Q_1 = 38.5\) and \(Q_3 = 51.5\) and we consider an observation to be an outlier if it is below \[38.5 - 1.5\times(51.5 - 38.5)=19\] or if it is above \[51.5 + 1.5\times(51.5 - 38.5)=71\] So we have one outlier at 79.
Note: An alternative way of doing this is to save the five number summary into an object, lets call it age5numsum.
> age5numsum <- fivenum(age)
This object is a vector with 5 numbers.
> age5numsum
[1] 30.0 38.5 45.5 51.5 79.0
The minimum is in age5numsum[1], the first quartile is in age5numsum[2], the median is in age5numsum[3], the third quartile is in age5numsum[4] and the maximum is in age5numsum[5]. For example, the first quartile is
> age5numsum[2]
[1] 38.5
Now find the lower and upper cutoffs for outliers
age5numsum[2] - 1.5*(age5numsum[4]-age5numsum[2])
[1] 19
age5numsum[4] + 1.5*(age5numsum[4]-age5numsum[2])
[1] 71
Example 5: Construct a boxplot of the data in Example 4
Click For Answer> boxplot(age,horizontal=TRUE,xlab="Customer Ages")
The boxplot has a box that runs from 38.5 to 51.5 with a line at 45.5. There are no lower outliers so the lower whisker runs from 30 to 38.5. We determined that 79 was an outlier, so that observation is marked with a special symbol. The upper whisker only goes out to 52 which is the largest observation that is not an outlier.
NOTE: The whisker does NOT go to the cutoff value of 70. The whisker must go to an actual value in the data. It goes to the largest (or smallest) value that is not an outlier.
An important feature of histograms and boxplots is the shape of the distribution. There are three main shapes: symmetric, skewed right and skewed left. Distributions that are skewed right have a longer right tail and those that are skewed to the left have a longer left tail. We will consider these shapes in more detail when we learn about measures of center in Lesson 2.4.
How do you decide whether to use a boxplot or a histogram to show the distribution of a quantitative variable? Part of it is personal preference and part of it depends on the data. There’s not necessarily a hard and fast answer but rather something that you get a feel for as you gain experience. However, you can see in the example below that sometimes the boxplot and histogram highlight different aspects of the distribution.
The histogram shows that the distribution has two peaks, which is not apparent in the box plot. The box plot makes it easy to identify the values of the outliers.