Learning Objectives

  1. Construct frequency distributions
  2. Constuct histograms
  3. Calculate measures of center and spread of a variable
  4. Examine categorical data with tables and graphs

Single Variable Descriptive Statistics for Numerical Data

Frequency Distributions

The simplest way to analyze a numerical variable is by grouping the data into “classes” and analyzing their counts, or frequencies. We introduced that last week by using the table() function on the species variable in the iris dataset.

# Store iris dataframe into environment

iris <- iris 
# View table of species variable in iris

table(iris$Species)
## 
##     setosa versicolor  virginica 
##         50         50         50

That was a very simple example though because there were only three values and it was a categorical variable. What about a variable with far more possible values, even infinitely many continuous values? Let’s create an r chunk and run the table function on the Sepal.Length variable.

# Table of Sepal.Length variable

table(iris$Sepal.Length)
## 
## 4.3 4.4 4.5 4.6 4.7 4.8 4.9   5 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9   6 6.1 6.2 
##   1   3   1   4   2   5   6  10   9   4   1   6   7   6   8   7   3   6   6   4 
## 6.3 6.4 6.5 6.6 6.7 6.8 6.9   7 7.1 7.2 7.3 7.4 7.6 7.7 7.9 
##   9   7   5   2   8   3   4   1   1   3   1   1   1   4   1

Well, that wasn’t fun. There are lots of unique values so it doesn’t tell you much. The best way to approach this would be to split the variable in a table split into classes. In order to do that, we want to know what the min and max values are of the dataset. We can find that with the summary() function.

# Summary of Sepal.Length variable

summary(iris$Sepal.Length)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.300   5.100   5.800   5.843   6.400   7.900

OK, so I see that the data in this set has a minimum value of 4.3 and a max of 7.9. There is no single number of classes that is best for a frequency distribution. A good rule of thumb is that you want between 5-20 classes in a frequency distribution. I’ll round the 4.3 down to 4, round the 7.9 up to 8, and see there is a range of 8-4 = 4 in this dataset. A total range of 4 can be divided somewhat neatly into 8 groupings of 0.5, so I’ll go with 8 classes each of width of 0.5.

# Setting "bins" for frequency distribution
bins <- seq(from = 4, to = 8, by = 0.5)
bins
## [1] 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0

# Split our Sepal.Length variable into the classes set by the bins vector
sepal_chunks <- cut(iris$Sepal.Length, breaks = bins)
sepal_chunks
##   [1] (5,5.5] (4.5,5] (4.5,5] (4.5,5] (4.5,5] (5,5.5] (4.5,5] (4.5,5] (4,4.5]
##  [10] (4.5,5] (5,5.5] (4.5,5] (4.5,5] (4,4.5] (5.5,6] (5.5,6] (5,5.5] (5,5.5]
##  [19] (5.5,6] (5,5.5] (5,5.5] (5,5.5] (4.5,5] (5,5.5] (4.5,5] (4.5,5] (4.5,5]
##  [28] (5,5.5] (5,5.5] (4.5,5] (4.5,5] (5,5.5] (5,5.5] (5,5.5] (4.5,5] (4.5,5]
##  [37] (5,5.5] (4.5,5] (4,4.5] (5,5.5] (4.5,5] (4,4.5] (4,4.5] (4.5,5] (5,5.5]
##  [46] (4.5,5] (5,5.5] (4.5,5] (5,5.5] (4.5,5] (6.5,7] (6,6.5] (6.5,7] (5,5.5]
##  [55] (6,6.5] (5.5,6] (6,6.5] (4.5,5] (6.5,7] (5,5.5] (4.5,5] (5.5,6] (5.5,6]
##  [64] (6,6.5] (5.5,6] (6.5,7] (5.5,6] (5.5,6] (6,6.5] (5.5,6] (5.5,6] (6,6.5]
##  [73] (6,6.5] (6,6.5] (6,6.5] (6.5,7] (6.5,7] (6.5,7] (5.5,6] (5.5,6] (5,5.5]
##  [82] (5,5.5] (5.5,6] (5.5,6] (5,5.5] (5.5,6] (6.5,7] (6,6.5] (5.5,6] (5,5.5]
##  [91] (5,5.5] (6,6.5] (5.5,6] (4.5,5] (5.5,6] (5.5,6] (5.5,6] (6,6.5] (5,5.5]
## [100] (5.5,6] (6,6.5] (5.5,6] (7,7.5] (6,6.5] (6,6.5] (7.5,8] (4.5,5] (7,7.5]
## [109] (6.5,7] (7,7.5] (6,6.5] (6,6.5] (6.5,7] (5.5,6] (5.5,6] (6,6.5] (6,6.5]
## [118] (7.5,8] (7.5,8] (5.5,6] (6.5,7] (5.5,6] (7.5,8] (6,6.5] (6.5,7] (7,7.5]
## [127] (6,6.5] (6,6.5] (6,6.5] (7,7.5] (7,7.5] (7.5,8] (6,6.5] (6,6.5] (6,6.5]
## [136] (7.5,8] (6,6.5] (6,6.5] (5.5,6] (6.5,7] (6.5,7] (6.5,7] (5.5,6] (6.5,7]
## [145] (6.5,7] (6.5,7] (6,6.5] (6,6.5] (6,6.5] (5.5,6]
## Levels: (4,4.5] (4.5,5] (5,5.5] (5.5,6] (6,6.5] (6.5,7] (7,7.5] (7.5,8]

# Create table
table(sepal_chunks)
## sepal_chunks
## (4,4.5] (4.5,5] (5,5.5] (5.5,6] (6,6.5] (6.5,7] (7,7.5] (7.5,8] 
##       5      27      27      30      31      18       6       6

Visualizing a frequency distribution

OK, if you didn’t follow all that, it’s ok. I, at one time, had to look up how to do it all myself. :)

More often than creating a table of just numbers for those counts, we will visualize a graph of the counts with the use of a histogram by using the hist() function.

# Histogram of Sepal Length variable w/ default breaks
hist(iris$Sepal.Length, breaks = 7, main = "Sepal Length Distribution", xlab = "Sepal Length", col = "darkgray")

Histograms are very common ways for analysts to visualize a dataset before diving into analysis. The textbook mentioned that you can find if a dataset is symmetric, left-skewed, or right-skewed by looking at its histogram.


Center of a Variable

When you have a single variable, a common thing of interest is the “middle” or “center” or “average” or “typical value” of that variable. There are a few known ways to calculate that, but by far the two most popular are by calculating the mean and median.

# Go around the room everyone picking our favorite number between 1-100
fave_nums <- c(17, 29, 6.28, 92)
# Find the mean of fave_nums
mean(fave_nums)
## [1] 36.07
# Find the median of fave_nums
median(fave_nums)
## [1] 23

Is either one of those statistics a “better” measure of center of the variable?

Generally, we prefer to use the median over the mean if the data contains outliers.

Practice

The two variables below are samples of waiting time for two banks. What is the mean wait time for each bank? What is the median wait time for each bank?

# Waiting times at two banks
bankA <- c(4.1, 5.2, 5.6, 6.2, 6.7, 7.2, 7.7, 7.7, 8.5, 9.3, 11.0)
bankB <- c(6.6, 6.7, 6.7, 6.9, 7.1, 7.2, 7.3, 7.4, 7.7, 7.8, 7.8)
# Mean and median of Bank A
mean(bankA)
## [1] 7.2
median(bankA)
## [1] 7.2
mean(bankB)
## [1] 7.2
median(bankB)
## [1] 7.2
# Mean and median of Bank B

What did you observe about the centers of each bank? They’re all the same.

Is there any difference in the wait time of each bank? Not for the average client, but individual experiences vary.

If you had this sample data available to you, which bank would you choose to go to? Looks like bankB hass less variability, tighter Rsquared value, so I would probably choose B.


Spread of a Variable

The most common way to measure spread of a variable is to analyze its standard deviation which is found by the sd() function. Standard deviation tries to capture some essence of “how far on average the data values are from the mean”.

# Find the standard deviation of bank A and bank B
sd(bankA)
## [1] 1.961122
sd(bankB)
## [1] 0.4449719

Visualizing spread of a variable with boxplots

Boxplots are a good way to analyze the spread of a variable.

# Boxplots of Bank A and Bank B
boxplot(bankA)

boxplot(bankB)

Boxplots can be used in a better way to compare the spread of a variable compared to other variables.

# Side by side box plots of Bank A and B

boxplot(bankA, bankB)

Example: How does the sepal length of flowers vary based on the species?

# Side by side boxplot based on the factor variable of species

boxplot(iris$Sepal.Length ~ iris$Species)


Examining Categorical Data

Categorical data is (often) non-numeric data that we usually split into factors. One of the videos for this chapter looked at the number variable in the email dataset for this topic which have three values it can assume – “none”, “small”, “big” which we call factors.

Tables

It is typically desirable and useful to analyze a factor variable by looking at a table.

# Load openintro library.  Install package in console first if needed.
library(openintro)
## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata
# Store email into environment
email <- email

# Create a table of the number variable
table(email$spam)
## 
##    0    1 
## 3554  367
round(table(email$number)/3921*100,5)
## 
##     none    small      big 
## 14.00153 72.09895 13.89952
barplot(table(email$number))

Barplots

It can also be useful to visualize these tables with a barplot.

# Barplot of the table of the number variable

If Time Permits…

# Subset data by number variable
none <- subset(
small <- subset(
big <- subset(
# Find tables of the spam in each subset
# Find percent of spam in none subset