Now that we’ve looked at exploring categorical and numerical data, you’ll learn some useful statistics for describing distributions of data.
library(readr)
cars <- read_csv('https://assets.datacamp.com/production/course_1796/datasets/cars04.csv')
## Parsed with column specification:
## cols(
## name = col_character(),
## sports_car = col_logical(),
## suv = col_logical(),
## wagon = col_logical(),
## minivan = col_logical(),
## pickup = col_logical(),
## all_wheel = col_logical(),
## rear_wheel = col_logical(),
## msrp = col_double(),
## dealer_cost = col_double(),
## eng_size = col_double(),
## ncyl = col_double(),
## horsepwr = col_double(),
## city_mpg = col_double(),
## hwy_mpg = col_double(),
## weight = col_double(),
## wheel_base = col_double(),
## length = col_double(),
## width = col_double()
## )
#cars <- cars %>%
# mutate(msrp = as.integer(msrp))
cars[,c(9:10,12:19)] <- sapply(cars[,c(9:10,12:19)],as.integer)
The choice of measure for center can have a dramatic impact on what we consider to be a typical observation, so it is important that you consider the shape of the distribution before deciding on the measure.
Which set of measures of central tendency would be worst for describing the two distributions shown here? Source: DataCamp
It’s “A: mean, B: mode”.
Throughout this chapter, you will use data from gapminder, which tracks demographic data in countries of the world over time. To learn more about it, you can bring up the help file with ?gapminder.
For this exercise, focus on how the life expectancy differs from continent to continent. This requires that you conduct your analysis not at the country level, but aggregated up to the continent level. This is made possible by the one-two punch of group_by() and summarize(), a very powerful syntax for carrying out the same analysis on different subsets of the full dataset.
# Create dataset of 2007 data
gap2007 <- filter(gapminder, year == 2007)
# Compute groupwise mean and median lifeExp
gap2007 %>%
group_by(continent) %>%
summarize(mean(lifeExp),
median(lifeExp))
## # A tibble: 5 x 3
## continent `mean(lifeExp)` `median(lifeExp)`
## <fct> <dbl> <dbl>
## 1 Africa 54.8 52.9
## 2 Americas 73.6 72.9
## 3 Asia 70.7 72.4
## 4 Europe 77.6 78.6
## 5 Oceania 80.7 80.7
# Generate box plots of lifeExp for each continent
gap2007 %>%
ggplot(aes(x = continent, y = lifeExp)) +
geom_boxplot()
x <- head(round(gap2007$lifeExp), 11)
x - mean(x)
## [1] -23.909091 8.090909 4.090909 -24.909091 7.090909 13.090909
## [7] 12.090909 8.090909 -3.909091 11.090909 -10.909091
sum(x - mean(x)) # which is close to 0
## [1] 2.842171e-14
sum((x - mean(x))^2) # which will keep getting bigger the more data you add
## [1] 1964.909
n <- 11
sum((x - mean(x))^2)/n
## [1] 178.6281
sum((x - mean(x))^2)/(n-1) # The sample variance
## [1] 196.4909
var(x) # R's built-in function
## [1] 196.4909
sqrt(sum((x - mean(x))^2)/(n-1))
## [1] 14.01752
sd(x)
## [1] 14.01752
# which is more commonly used than the...
summary(x)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 43.00 60.50 75.00 67.91 77.50 81.00
IQR(x)
## [1] 17
# However SD is affected by extreme values, unlike IQR, which is better to use with large skew and extreme outliers
Choice of spread measure The choice of measure for spread can dramatically impact how variable we consider our data to be, so it is important that you consider the shape of the distribution before deciding on the measure.
Which set of measures of spread would be worst for describing the two distributions shown here? Source: DataCamp
It’s “A: Variance, B: Range”. Notice the high peak of A and the considerable width of it. What does that tell you about its variance?
Let’s extend the powerful group_by() and summarize() syntax to measures of spread. If you’re unsure whether you’re working with symmetric or skewed distributions, it’s a good idea to consider a robust measure like IQR in addition to the usual measures of variance or standard deviation.
# Compute groupwise measures of spread
gap2007 %>%
group_by(continent) %>%
summarize(sd(lifeExp),
IQR(lifeExp),
n())
## # A tibble: 5 x 4
## continent `sd(lifeExp)` `IQR(lifeExp)` `n()`
## <fct> <dbl> <dbl> <int>
## 1 Africa 9.63 11.6 52
## 2 Americas 4.44 4.63 25
## 3 Asia 7.96 10.2 33
## 4 Europe 2.98 4.78 30
## 5 Oceania 0.729 0.516 2
# Generate overlaid density plots
gap2007 %>%
ggplot(aes(x = lifeExp, fill = continent)) +
geom_density(alpha = 0.3)
Consider the density plots shown here. What are the most appropriate measures to describe their centers and spreads? In this exercise, you’ll select the measures and then calculate them. Source: DataCamp
# Compute stats for lifeExp in Americas
gap2007 %>%
filter(continent == "Americas") %>%
summarize(mean(lifeExp),
sd(lifeExp))
## # A tibble: 1 x 2
## `mean(lifeExp)` `sd(lifeExp)`
## <dbl> <dbl>
## 1 73.6 4.44
# Compute stats for population
gap2007 %>%
summarize(median(pop),
IQR(pop))
## # A tibble: 1 x 2
## `median(pop)` `IQR(pop)`
## <dbl> <dbl>
## 1 10517531 26702008.
Like mean and standard deviation, median and IQR measure the central tendency and spread, respectively, but are robust to outliers and non-normal data.
To build some familiarity with distributions of different shapes, consider the four that are plotted here.
Which of the following options does the best job of describing their shape in terms of modality and skew/symmetry? Source: DataCamp
It’s “A: unimodal left-skewed; B: unimodal symmetric; C: unimodal right-skewed, D: bimodal symmetric.”
Highly skewed distributions can make it very difficult to learn anything from a visualization. Transformations can be helpful in revealing the more subtle structure.
Here you’ll focus on the population variable, which exhibits strong right skew, and transform it with the natural logarithm function (log() in R).
# Create density plot of old variable
gap2007 %>%
ggplot(aes(x = pop)) +
geom_density()
# Transform the skewed pop variable
gap2007 <- gap2007 %>%
mutate(log_pop = log(pop))
# Create density plot of new variable
gap2007 %>%
ggplot(aes(x = log(pop))) +
geom_density()
Consider the distribution, shown here, of the life expectancies of the countries in Asia. The box plot identifies one clear outlier: a country with a notably low life expectancy. Do you have a guess as to which country this might be? Test your guess in the console using either min() or filter(), then proceed to building a plot with that country removed.
# Filter for Asia, add column indicating outliers
gap_asia <- gap2007 %>%
filter(continent == "Asia") %>%
mutate(is_outlier = lifeExp < 50)
# Remove outliers, create box plot of lifeExp
gap_asia %>%
filter(!is_outlier) %>%
ggplot(aes(x = 1, y = lifeExp)) +
geom_boxplot()