Now that we’ve looked at exploring categorical and numerical data, you’ll learn some useful statistics for describing distributions of data.

library(readr)
cars <- read_csv('https://assets.datacamp.com/production/course_1796/datasets/cars04.csv')

## Parsed with column specification:
## cols(
##   name = col_character(),
##   sports_car = col_logical(),
##   suv = col_logical(),
##   wagon = col_logical(),
##   minivan = col_logical(),
##   pickup = col_logical(),
##   all_wheel = col_logical(),
##   rear_wheel = col_logical(),
##   msrp = col_double(),
##   dealer_cost = col_double(),
##   eng_size = col_double(),
##   ncyl = col_double(),
##   horsepwr = col_double(),
##   city_mpg = col_double(),
##   hwy_mpg = col_double(),
##   weight = col_double(),
##   wheel_base = col_double(),
##   length = col_double(),
##   width = col_double()
## )

#cars <- cars %>% 
#  mutate(msrp = as.integer(msrp))
 cars[,c(9:10,12:19)] <- sapply(cars[,c(9:10,12:19)],as.integer)

Video: Measures of center

View slides.

Question: Choice of center measure

The choice of measure for center can have a dramatic impact on what we consider to be a typical observation, so it is important that you consider the shape of the distribution before deciding on the measure.

Which set of measures of central tendency would be worst for describing the two distributions shown here? Source: DataCamp

It’s “A: mean, B: mode”.

Calculate center measures

Throughout this chapter, you will use data from gapminder, which tracks demographic data in countries of the world over time. To learn more about it, you can bring up the help file with ?gapminder.

For this exercise, focus on how the life expectancy differs from continent to continent. This requires that you conduct your analysis not at the country level, but aggregated up to the continent level. This is made possible by the one-two punch of group_by() and summarize(), a very powerful syntax for carrying out the same analysis on different subsets of the full dataset.

# Create dataset of 2007 data
gap2007 <- filter(gapminder, year == 2007)

# Compute groupwise mean and median lifeExp
gap2007 %>%
  group_by(continent) %>%
  summarize(mean(lifeExp),
            median(lifeExp))

## # A tibble: 5 x 3
##   continent `mean(lifeExp)` `median(lifeExp)`
##   <fct>               <dbl>             <dbl>
## 1 Africa               54.8              52.9
## 2 Americas             73.6              72.9
## 3 Asia                 70.7              72.4
## 4 Europe               77.6              78.6
## 5 Oceania              80.7              80.7

# Generate box plots of lifeExp for each continent
gap2007 %>%
  ggplot(aes(x = continent, y = lifeExp)) +
  geom_boxplot()

Video: Measures of variability

View slides.

x <- head(round(gap2007$lifeExp), 11)

x - mean(x)

##  [1] -23.909091   8.090909   4.090909 -24.909091   7.090909  13.090909
##  [7]  12.090909   8.090909  -3.909091  11.090909 -10.909091

sum(x - mean(x)) # which is close to 0

## [1] 2.842171e-14

sum((x - mean(x))^2) # which will keep getting bigger the more data you add

## [1] 1964.909

n <- 11
sum((x - mean(x))^2)/n

## [1] 178.6281

sum((x - mean(x))^2)/(n-1) # The sample variance

## [1] 196.4909

var(x) # R's built-in function

## [1] 196.4909

sqrt(sum((x - mean(x))^2)/(n-1))

## [1] 14.01752

sd(x)

## [1] 14.01752

# which is more commonly used than the...
summary(x)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   43.00   60.50   75.00   67.91   77.50   81.00

IQR(x)

## [1] 17

# However SD is affected by extreme values, unlike IQR, which is better to use with large skew and extreme outliers

Choice of spread measure The choice of measure for spread can dramatically impact how variable we consider our data to be, so it is important that you consider the shape of the distribution before deciding on the measure.

Which set of measures of spread would be worst for describing the two distributions shown here? Source: DataCamp

It’s “A: Variance, B: Range”. Notice the high peak of A and the considerable width of it. What does that tell you about its variance?

Calculate spread measures

Let’s extend the powerful group_by() and summarize() syntax to measures of spread. If you’re unsure whether you’re working with symmetric or skewed distributions, it’s a good idea to consider a robust measure like IQR in addition to the usual measures of variance or standard deviation.

# Compute groupwise measures of spread
gap2007 %>%
  group_by(continent) %>%
  summarize(sd(lifeExp),
            IQR(lifeExp),
            n())

## # A tibble: 5 x 4
##   continent `sd(lifeExp)` `IQR(lifeExp)` `n()`
##   <fct>             <dbl>          <dbl> <int>
## 1 Africa            9.63          11.6      52
## 2 Americas          4.44           4.63     25
## 3 Asia              7.96          10.2      33
## 4 Europe            2.98           4.78     30
## 5 Oceania           0.729          0.516     2

# Generate overlaid density plots
gap2007 %>%
  ggplot(aes(x = lifeExp, fill = continent)) +
  geom_density(alpha = 0.3)

Choose measures for center and spread

Consider the density plots shown here. What are the most appropriate measures to describe their centers and spreads? In this exercise, you’ll select the measures and then calculate them. Source: DataCamp

# Compute stats for lifeExp in Americas
gap2007 %>%
  filter(continent == "Americas") %>%
  summarize(mean(lifeExp),
            sd(lifeExp))

## # A tibble: 1 x 2
##   `mean(lifeExp)` `sd(lifeExp)`
##             <dbl>         <dbl>
## 1            73.6          4.44

# Compute stats for population
gap2007 %>%
  summarize(median(pop),
            IQR(pop))

## # A tibble: 1 x 2
##   `median(pop)` `IQR(pop)`
##           <dbl>      <dbl>
## 1      10517531  26702008.

Like mean and standard deviation, median and IQR measure the central tendency and spread, respectively, but are robust to outliers and non-normal data.

Video: Shape and transformations

View slides.

Describe the shape

To build some familiarity with distributions of different shapes, consider the four that are plotted here.

Which of the following options does the best job of describing their shape in terms of modality and skew/symmetry? Source: DataCamp

It’s “A: unimodal left-skewed; B: unimodal symmetric; C: unimodal right-skewed, D: bimodal symmetric.”

Transformations

Highly skewed distributions can make it very difficult to learn anything from a visualization. Transformations can be helpful in revealing the more subtle structure.

Here you’ll focus on the population variable, which exhibits strong right skew, and transform it with the natural logarithm function (log() in R).

# Create density plot of old variable
gap2007 %>%
  ggplot(aes(x = pop)) +
  geom_density()

# Transform the skewed pop variable
gap2007 <- gap2007 %>%
  mutate(log_pop = log(pop))

# Create density plot of new variable
gap2007 %>%
  ggplot(aes(x = log(pop))) +
  geom_density()

Video: Outliers

View slides.

Identify outliers

Consider the distribution, shown here, of the life expectancies of the countries in Asia. The box plot identifies one clear outlier: a country with a notably low life expectancy. Do you have a guess as to which country this might be? Test your guess in the console using either min() or filter(), then proceed to building a plot with that country removed.

# Filter for Asia, add column indicating outliers
gap_asia <- gap2007 %>%
  filter(continent == "Asia") %>%
  mutate(is_outlier = lifeExp < 50)

# Remove outliers, create box plot of lifeExp
gap_asia %>%
  filter(!is_outlier) %>%
  ggplot(aes(x = 1, y = lifeExp)) +
  geom_boxplot()

Gapminder

Video: Measures of center

Question: Choice of center measure

Calculate center measures

Video: Measures of variability

Calculate spread measures

Choose measures for center and spread

Video: Shape and transformations

Describe the shape

Transformations

Video: Outliers

Identify outliers