The simplest way to analyze a numerical variable is by grouping the data into “classes” and analyzing their counts, or frequencies. We introduced that last week by using the table() function on the species variable in the iris dataset.
# Store iris dataframe into environment
iris <- iris
# View table of species variable in iris
table(iris$Species)
##
## setosa versicolor virginica
## 50 50 50
That was a very simple example though because there were only three values and it was a categorical variable. What about a variable with far more possible values, even infinitely many continuous values? Let’s create an r chunk and run the table function on the Sepal.Length variable.
# Table of Sepal.Length variable
table(iris$Sepal.Length)
##
## 4.3 4.4 4.5 4.6 4.7 4.8 4.9 5 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 6 6.1 6.2
## 1 3 1 4 2 5 6 10 9 4 1 6 7 6 8 7 3 6 6 4
## 6.3 6.4 6.5 6.6 6.7 6.8 6.9 7 7.1 7.2 7.3 7.4 7.6 7.7 7.9
## 9 7 5 2 8 3 4 1 1 3 1 1 1 4 1
Well, that wasn’t fun. There are lots of unique values so it doesn’t tell you much. The best way to approach this would be to split the variable in a table split into classes. In order to do that, we want to know what the min and max values are of the dataset. We can find that with the summary() function.
# Summary of Sepal.Length variable
summary(iris$Sepal.Length)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.300 5.100 5.800 5.843 6.400 7.900
OK, so I see that the data in this set has a minimum value of 4.3 and a max of 7.9. There is no single number of classes that is best for a frequency distribution. A good rule of thumb is that you want between 5-20 classes in a frequency distribution. I’ll round the 4.3 down to 4, round the 7.9 up to 8, and see there is a range of 8-4 = 4 in this dataset. A total range of 4 can be divided somewhat neatly into 8 groupings of 0.5, so I’ll go with 8 classes each of width of 0.5.
# Setting "bins" for frequency distribution
bins <- seq(from = 4, to = 8, by = 0.5)
bins
## [1] 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0
# Split our Sepal.Length variable into the classes set by the bins vector
sepal_chunks <- cut(iris$Sepal.Length, breaks = bins)
sepal_chunks
## [1] (5,5.5] (4.5,5] (4.5,5] (4.5,5] (4.5,5] (5,5.5] (4.5,5] (4.5,5] (4,4.5]
## [10] (4.5,5] (5,5.5] (4.5,5] (4.5,5] (4,4.5] (5.5,6] (5.5,6] (5,5.5] (5,5.5]
## [19] (5.5,6] (5,5.5] (5,5.5] (5,5.5] (4.5,5] (5,5.5] (4.5,5] (4.5,5] (4.5,5]
## [28] (5,5.5] (5,5.5] (4.5,5] (4.5,5] (5,5.5] (5,5.5] (5,5.5] (4.5,5] (4.5,5]
## [37] (5,5.5] (4.5,5] (4,4.5] (5,5.5] (4.5,5] (4,4.5] (4,4.5] (4.5,5] (5,5.5]
## [46] (4.5,5] (5,5.5] (4.5,5] (5,5.5] (4.5,5] (6.5,7] (6,6.5] (6.5,7] (5,5.5]
## [55] (6,6.5] (5.5,6] (6,6.5] (4.5,5] (6.5,7] (5,5.5] (4.5,5] (5.5,6] (5.5,6]
## [64] (6,6.5] (5.5,6] (6.5,7] (5.5,6] (5.5,6] (6,6.5] (5.5,6] (5.5,6] (6,6.5]
## [73] (6,6.5] (6,6.5] (6,6.5] (6.5,7] (6.5,7] (6.5,7] (5.5,6] (5.5,6] (5,5.5]
## [82] (5,5.5] (5.5,6] (5.5,6] (5,5.5] (5.5,6] (6.5,7] (6,6.5] (5.5,6] (5,5.5]
## [91] (5,5.5] (6,6.5] (5.5,6] (4.5,5] (5.5,6] (5.5,6] (5.5,6] (6,6.5] (5,5.5]
## [100] (5.5,6] (6,6.5] (5.5,6] (7,7.5] (6,6.5] (6,6.5] (7.5,8] (4.5,5] (7,7.5]
## [109] (6.5,7] (7,7.5] (6,6.5] (6,6.5] (6.5,7] (5.5,6] (5.5,6] (6,6.5] (6,6.5]
## [118] (7.5,8] (7.5,8] (5.5,6] (6.5,7] (5.5,6] (7.5,8] (6,6.5] (6.5,7] (7,7.5]
## [127] (6,6.5] (6,6.5] (6,6.5] (7,7.5] (7,7.5] (7.5,8] (6,6.5] (6,6.5] (6,6.5]
## [136] (7.5,8] (6,6.5] (6,6.5] (5.5,6] (6.5,7] (6.5,7] (6.5,7] (5.5,6] (6.5,7]
## [145] (6.5,7] (6.5,7] (6,6.5] (6,6.5] (6,6.5] (5.5,6]
## Levels: (4,4.5] (4.5,5] (5,5.5] (5.5,6] (6,6.5] (6.5,7] (7,7.5] (7.5,8]
# Create table
table(sepal_chunks)
## sepal_chunks
## (4,4.5] (4.5,5] (5,5.5] (5.5,6] (6,6.5] (6.5,7] (7,7.5] (7.5,8]
## 5 27 27 30 31 18 6 6
OK, if you didn’t follow all that, it’s ok. I, at one time, had to look up how to do it all myself. :)
More often than creating a table of just numbers for those counts, we
will visualize a graph of the counts with the use of a histogram by
using the hist() function.
# Histogram of Sepal Length variable w/ default breaks
hist(iris$Sepal.Length, breaks = 7, main = "Sepal Length Distribution", xlab = "Sepal Length", col = "darkgray")
Histograms are very common ways for analysts to visualize a dataset before diving into analysis. The textbook mentioned that you can find if a dataset is symmetric, left-skewed, or right-skewed by looking at its histogram.
When you have a single variable, a common thing of interest is the “middle” or “center” or “average” or “typical value” of that variable. There are a few known ways to calculate that, but by far the two most popular are by calculating the mean and median.
# Go around the room everyone picking our favorite number between 1-100
fave_nums <- c(17, 29, 6.28, 92)
# Find the mean of fave_nums
mean(fave_nums)
## [1] 36.07
# Find the median of fave_nums
median(fave_nums)
## [1] 23
Is either one of those statistics a “better” measure of center of the variable?
Generally, we prefer to use the median over the mean if the data contains outliers.
The two variables below are samples of waiting time for two banks. What is the mean wait time for each bank? What is the median wait time for each bank?
# Waiting times at two banks
bankA <- c(4.1, 5.2, 5.6, 6.2, 6.7, 7.2, 7.7, 7.7, 8.5, 9.3, 11.0)
bankB <- c(6.6, 6.7, 6.7, 6.9, 7.1, 7.2, 7.3, 7.4, 7.7, 7.8, 7.8)
# Mean and median of Bank A
mean(bankA)
## [1] 7.2
median(bankA)
## [1] 7.2
mean(bankB)
## [1] 7.2
median(bankB)
## [1] 7.2
# Mean and median of Bank B
What did you observe about the centers of each bank? They’re all the same.
Is there any difference in the wait time of each bank? Not for the average client, but individual experiences vary.
If you had this sample data available to you, which bank would you choose to go to? Looks like bankB hass less variability, tighter Rsquared value, so I would probably choose B.
The most common way to measure spread of a variable is to analyze its
standard deviation which is found by the sd()
function. Standard deviation tries to capture some essence of “how far
on average the data values are from the mean”.
# Find the standard deviation of bank A and bank B
sd(bankA)
## [1] 1.961122
sd(bankB)
## [1] 0.4449719
Boxplots are a good way to analyze the spread of a variable.
# Boxplots of Bank A and Bank B
boxplot(bankA)
boxplot(bankB)
Boxplots can be used in a better way to compare the spread of a variable compared to other variables.
# Side by side box plots of Bank A and B
boxplot(bankA, bankB)
Example: How does the sepal length of flowers vary based on the species?
# Side by side boxplot based on the factor variable of species
boxplot(iris$Sepal.Length ~ iris$Species)
Categorical data is (often) non-numeric data that we usually split into factors. One of the videos for this chapter looked at the number variable in the email dataset for this topic which have three values it can assume – “none”, “small”, “big” which we call factors.
It is typically desirable and useful to analyze a factor variable by looking at a table.
# Load openintro library. Install package in console first if needed.
library(openintro)
## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata
# Store email into environment
email <- email
# Create a table of the number variable
table(email$spam)
##
## 0 1
## 3554 367
round(table(email$number)/3921*100,5)
##
## none small big
## 14.00153 72.09895 13.89952
barplot(table(email$number))
It can also be useful to visualize these tables with a barplot.
# Barplot of the table of the number variable
If Time Permits…
# Subset data by number variable
none <- subset(
small <- subset(
big <- subset(
# Find tables of the spam in each subset
# Find percent of spam in none subset