There are many graphical representations to illustrate summary measures for numerical variables; however, there are two specific ones that should be at the forefront of every exploratory analysis - histograms and boxplots.
Histograms are the most common type of chart for showing the distribution of a numerical variable.They display a distribution of the data by dividing the data into bins and counting the number of observations in each bin. A histogram provides the complete picture of the data by illustrating the center of the distribution, the variability, skewness, and other aspects in one chart.
We use the mtcars data set to illustrate some of the ideas here.
dat <- mtcars
head(dat)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
We begin by illustrating a histogram for the numerical variable mpg.
hist(dat$mpg,
xlab = "Miles Per Gallon (City)",
main = "Histogram of MPG (City)",
breaks = 5,
col = "blue",
border = "red")
You should always label your axes and give the plot a title. The breaks argument is specific to the hist() command. Entering an integer will give a suggestion to R for how many bars to use for the histogram. R attempts to guess a good number of breaks; however, you can attempt to modify this yourself. Note that R may reject your suggested number of breaks. For example, for this data set, we note that the spread is 25(ish) and as a result, R will only use 2, 3, 5, 10 or 25 breaks.
Skewness is a measure of symmetry for the data. When the mean is larger than the median, the graph is right skewed. If the mean is smaller than the median, the graph is skewed left. We illustrate with graphs below.
We note that if your data is highly skewed, the median gives a much clearer idea of the center of the distribution than the mean.
Looking at data in terms of a histogram gives us a very nice way to discuss the data using the mean and standard deviations associated with the data.
Chebyshev’s Theorem: For any number \(k > 1\), at least \(1 - 1/k^2\) of the data values lie within \(k\) standard deviations of the mean.
What this means is that at least 75% of the measurements are within 2 standard deviations of the mean and at least 89% are within 3 standard deviations.
We can do even better when our data is roughly bell shaped.
Empirical Rule: For a bell shaped (or roughly bell shaped) distribution, approximately
This confirms our earlier suggestion that most data was less than 3 standard deviations from the mean and any data point with a \(z\)-score with \(|z| > 3\) should be considered an outlier and should be examined further.
To visualize the relationship between a numerical and categorical variable, we use boxplots. In essence, a boxplot is a visual 5-number summary. In this example, we consider the hp variable in our data set. To construct the boxplot, we begin by drawing a box from the first to the third quartile. A horizontal line goes through the box at the median. We then calculate the IQR and specifically the values of \(Q1 - 1.5IQR\) and \(Q3 + 1.5IQR\). Whiskers are then extended from the box to the farthest data point within those bounds. Finally, any data points that lie outside the interval \([Q1 - 1.5IQR, Q3 + 1.5IQR]\) are marked with a dot. These are outliers (though, not in the 3 standard deviations away from the mean sense) and they should be examined further.
The code below shows a basic boxplot. Changing the horizontal = FALSE command to horizontal = TRUE will rotate the diagram 90 degrees and show the sideways boxplot.
boxplot(dat$hp,
main = "Boxplot of Horsepower",
ylab = "Horsepower",
horizontal = FALSE
)
One advantage of the boxplot is that we are able to compare 5-number summaries between categorical groups. Here, we will compare the horsepower variable across two types of transmission - automatic versus manual.
boxplot(dat$hp ~ dat$am,
main = "Boxplot of Horsepower by Transmission Type",
xlab = "Transmission Type",
ylab = "Horsepower",
names = c("Automatic", "Manual"),
horizontal = FALSE
)
The comparison clearly shows that the mean and median horsepower is lower for manual transmissions; however, it also points out that there are some manual transmission vehicles that have significantly more horsepower than any automatic transmission vehicle in this survey. This is one of the ways that we can use boxplots to quickly make comparisons of classes of data
The analogy to the histogram for categorical variables is the barplot. The barplot command takes a table and produces a graph showing the counts on the \(y\)-axis with the categories on the \(x\)-axis. Recall that the output of the table command is a frequency table of the data.
table(dat$cyl)
##
## 4 6 8
## 11 7 14
barplot(table(dat$cyl),
xlab = "Number of Cylinders",
ylab = "Frequency",
main = "Engine Size",
col = "blue",
border = "red")
Anderson, David R. , Williams, Thomas A. and Sweeney, Dennis J.. “Statistics”. Encyclopedia Britannica, 20 Oct. 2020, https://www.britannica.com/science/statistics. Accessed 6 April 2021.
Boehmke, B., & Greenwell, B.M. 2019. Hands-On Machine Learning with R.
Boeree, George. Descriptive Statistics. (2005). http://webspace.ship.edu/cgboer/descstats.html. Accessed 6 April 2021.
Henderson and Velleman (1981), Building multiple regression models interactively. Biometrics, 37, 391–411.
Hyndman, R. & Fan, Y. (1996). Sample Quantiles in Statistical Packages, The American Statistician, 50(4), 361-365
Prem S. Mann, Introductory Statistics, 8th ed. 2013. John Wiley & Sons.
Wickham, Hadley and Grolemund, Garrett. 2017. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data (1st. ed.). O’Reilly Media, Inc.