Disclaimer: The content of this RMarkdown note came from a course called Exploratory Data Analysis in datacamp.

Exploratory Data Analysis

When your dataset is represented as a table or a database, it’s difficult to observe much about it beyond its size and the types of variables it contains. In this course, you’ll learn how to use graphical and numerical techniques to begin uncovering the structure of your data. Which variables suggest interesting relationships? Which observations are unusual? By the end of the course, you’ll be able to answer these questions and more, while generating graphics that are both insightful and beautiful.

Chapter 1: Exploring Categorical Data

In this chapter, you will learn how to create graphical and numerical summaries of two categorical variables.

# Install gapminder R package. Once it's installed, you won't have to run this code again
#install.packages("dplyr") 
#install.packages("ggplot2")
#Install.packages("openintro")

# Load dplyr package
library(dplyr) #for use of dplyr functions such as glimpse(), mutate(), and filter()
library(ggplot2) #for use of ggplot2 functions such ggplot()
library(gapminder) #for use of gapminder data (ch3.2)
library(openintro) #for use of email data (ch4)

# Import data
comics <- read.csv("data/Exploratory data analysis/comics.csv") #first_appear var isn't exactly same
str(comics)
## 'data.frame':    23272 obs. of  11 variables:
##  $ name        : Factor w/ 23272 levels "'Spinner (Earth-616)",..: 19833 3335 22769 9647 20956 2220 17576 9346 18794 10957 ...
##  $ id          : Factor w/ 4 levels "No Dual","Public",..: 3 2 2 2 1 2 2 2 2 2 ...
##  $ align       : Factor w/ 4 levels "Bad","Good","Neutral",..: 2 2 3 2 2 2 2 2 3 2 ...
##  $ eye         : Factor w/ 26 levels "Amber Eyes","Auburn Hair",..: 11 5 5 5 5 5 6 6 6 5 ...
##  $ hair        : Factor w/ 28 levels "Auburn Hair",..: 7 27 3 3 4 14 7 7 7 4 ...
##  $ gender      : Factor w/ 3 levels "Female","Male",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ gsm         : Factor w/ 6 levels "Bisexual Characters",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ alive       : Factor w/ 2 levels "Deceased Characters",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ appearances : int  4043 3360 3061 2961 2258 2255 2072 2017 1955 1934 ...
##  $ first_appear: Factor w/ 2328 levels "1-Apr","1-Aug",..: 1772 2074 2255 2089 2185 2192 2192 2139 2292 2192 ...
##  $ publisher   : Factor w/ 2 levels "dc","marvel": 2 2 2 2 2 2 2 2 2 2 ...

# Convert data to tbl_df
comics <- tbl_df(comics)
str(comics)
## Classes 'tbl_df', 'tbl' and 'data.frame':    23272 obs. of  11 variables:
##  $ name        : Factor w/ 23272 levels "'Spinner (Earth-616)",..: 19833 3335 22769 9647 20956 2220 17576 9346 18794 10957 ...
##  $ id          : Factor w/ 4 levels "No Dual","Public",..: 3 2 2 2 1 2 2 2 2 2 ...
##  $ align       : Factor w/ 4 levels "Bad","Good","Neutral",..: 2 2 3 2 2 2 2 2 3 2 ...
##  $ eye         : Factor w/ 26 levels "Amber Eyes","Auburn Hair",..: 11 5 5 5 5 5 6 6 6 5 ...
##  $ hair        : Factor w/ 28 levels "Auburn Hair",..: 7 27 3 3 4 14 7 7 7 4 ...
##  $ gender      : Factor w/ 3 levels "Female","Male",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ gsm         : Factor w/ 6 levels "Bisexual Characters",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ alive       : Factor w/ 2 levels "Deceased Characters",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ appearances : int  4043 3360 3061 2961 2258 2255 2072 2017 1955 1934 ...
##  $ first_appear: Factor w/ 2328 levels "1-Apr","1-Aug",..: 1772 2074 2255 2089 2185 2192 2192 2139 2292 2192 ...
##  $ publisher   : Factor w/ 2 levels "dc","marvel": 2 2 2 2 2 2 2 2 2 2 ...

1.1 Contingency table review

In this chapter you’ll continue working with the comics dataset introduced in the video. This is a collection of characteristics on all of the superheroes created by Marvel and DC comics in the last 80 years.

# Print the first rows of the data
comics
## # A tibble: 23,272 x 11
##                                     name      id   align        eye
##                                   <fctr>  <fctr>  <fctr>     <fctr>
##  1             Spider-Man (Peter Parker)  Secret    Good Hazel Eyes
##  2       Captain America (Steven Rogers)  Public    Good  Blue Eyes
##  3 Wolverine (James \\"Logan\\" Howlett)  Public Neutral  Blue Eyes
##  4   Iron Man (Anthony \\"Tony\\" Stark)  Public    Good  Blue Eyes
##  5                   Thor (Thor Odinson) No Dual    Good  Blue Eyes
##  6            Benjamin Grimm (Earth-616)  Public    Good  Blue Eyes
##  7             Reed Richards (Earth-616)  Public    Good Brown Eyes
##  8            Hulk (Robert Bruce Banner)  Public    Good Brown Eyes
##  9             Scott Summers (Earth-616)  Public Neutral Brown Eyes
## 10            Jonathan Storm (Earth-616)  Public    Good  Blue Eyes
## # ... with 23,262 more rows, and 7 more variables: hair <fctr>,
## #   gender <fctr>, gsm <fctr>, alive <fctr>, appearances <int>,
## #   first_appear <fctr>, publisher <fctr>

# Check levels of align
levels(comics$align)
## [1] "Bad"                "Good"               "Neutral"           
## [4] "Reformed Criminals"

# Check the levels of gender
levels(comics$gender)
## [1] "Female" "Male"   "Other"

# Create a 2-way contingency table
table(comics$align, comics$gender)
##                     
##                      Female Male Other
##   Bad                  1573 7561    32
##   Good                 2490 4809    17
##   Neutral               836 1799    17
##   Reformed Criminals      1    2     0

1.2 Dropping levels

To simplify the analysis, drop levels that have very low counts.

# Create a 2-way contingency table
tab <- table(comics$align, comics$gender)

# Print tab
tab
##                     
##                      Female Male Other
##   Bad                  1573 7561    32
##   Good                 2490 4809    17
##   Neutral               836 1799    17
##   Reformed Criminals      1    2     0

# Remove align level
comics <- comics %>%
  filter(align != "Reformed Criminals") %>%
  droplevels()

1.3 Side-by-side barcharts

Construct two side-by-side barcharts of the comics data, showing how there are two options for presenting the same data.

# Create side-by-side barchart of gender by alignment
ggplot(comics, aes(x = align, fill = gender)) + 
  geom_bar(position = "dodge") #position = "dodge", to have a side-by-side (i.e. not stacked) barchart.


# Create side-by-side barchart of alignment by gender
ggplot(comics, aes(x = gender, fill = align)) + 
  geom_bar(position = "dodge") +
  theme(axis.text.x = element_text(angle = 90))

Interpretation

  • Among characters with Neutral alignment, males are the most common.
  • In general, there is an association between gender and alignment.
  • There are more male characters than female characters in this dataset.

1.4 Conditional proportions

Approximately what proportion of all female characters are good? 51%. Look at how align was distributed within each gender

tab <- table(comics$align, comics$gender)
options(scipen = 999, digits = 3) # Print fewer digits
prop.table(tab)     # Joint proportions
##          
##             Female     Male    Other
##   Bad     0.082210 0.395160 0.001672
##   Good    0.130135 0.251333 0.000888
##   Neutral 0.043692 0.094021 0.000888
prop.table(tab, 2)  # Conditional on columns
##          
##           Female  Male Other
##   Bad      0.321 0.534 0.485
##   Good     0.508 0.339 0.258
##   Neutral  0.171 0.127 0.258

1.5 Counts vs. proportions (2)

Bar charts can tell dramatically different stories depending on whether they represent counts or proportions and, if proportions, what the proportions are conditioned on. To demonstrate this difference, you’ll construct two barcharts in this exercise: one of counts and one of proportions.

# Plot of gender by align
ggplot(comics, aes(x = align, fill = gender)) +
  geom_bar()

  
# Plot proportion of gender, conditional on align
ggplot(comics, aes(x = align, fill = gender)) + 
  geom_bar(position = "fill") #position = "fill", to have a stacked barchart

1.6 Marginal barchart

Improve the interpretability of the plot by implementing some sensible ordering. Superheroes that are “Neutral” show an alignment between “Good” and “Bad”, so it makes sense to put that bar in the middle.

# Change the order of the levels in align
comics$align <- factor(comics$align, 
                       levels = c("Bad", "Neutral", "Good"))

# Create plot of align
ggplot(comics, aes(x = align)) + 
  geom_bar()

1.7 Conditional barchart

Make the distribution of alignment based on gender by creating multiple filtered datasets (one for each gender) or by faceting the plot of alignment based on gender.

# Plot of alignment broken down by gender
ggplot(comics, aes(x = align)) + 
  geom_bar() +
  facet_wrap(~ gender)

1.8 Improve piechart

The piechart is a very common way to represent the distribution of a single categorical variable, but they can be more difficult to interpret than barcharts. Improve the piechart representation of these data by constructing a barchart.

# Import data
pies <- data.frame(read.csv("data/Exploratory data analysis/pie.csv")) 
pies_tb <- table(pies$flavor)
pies_tb
## 
##        apple    blueberry boston creme       cherry     key lime 
##           17           14           15           13           16 
##      pumpkin   strawberry 
##           12           11

# Create a pie chart
pie(pies_tb)


# Put levels of flavor in decending order
lev <- c("apple", "key lime", "boston creme", "blueberry", "cherry", "pumpkin", "strawberry")
pies$flavor <- factor(pies$flavor, levels = lev)

# Create barchart of flavor
ggplot(pies, aes(x = flavor)) + 
  geom_bar(fill = "chartreuse") + 
  theme(axis.text.x = element_text(angle = 90))

Chapter 2: Exploring Numerical Data

In this chapter, you will learn how to graphically summarize numerical data.

2.1 Faceted histogram

The cars dataset records characteristics on all of the new models of cars for sale in the US in a certain year. Investigate the distribution of mileage across a categorial variable.

# Import data
cars <- read.csv("data/Exploratory data analysis/cars.csv", stringsAsFactors = FALSE) 

# Learn data structure
str(cars)
## 'data.frame':    428 obs. of  19 variables:
##  $ name       : chr  "Chevrolet Aveo 4dr" "Chevrolet Aveo LS 4dr hatch" "Chevrolet Cavalier 2dr" "Chevrolet Cavalier 4dr" ...
##  $ sports_car : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ suv        : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ wagon      : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ minivan    : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ pickup     : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ all_wheel  : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ rear_wheel : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ msrp       : int  11690 12585 14610 14810 16385 13670 15040 13270 13730 15460 ...
##  $ dealer_cost: int  10965 11802 13697 13884 15357 12849 14086 12482 12906 14496 ...
##  $ eng_size   : num  1.6 1.6 2.2 2.2 2.2 2 2 2 2 2 ...
##  $ ncyl       : int  4 4 4 4 4 4 4 4 4 4 ...
##  $ horsepwr   : int  103 103 140 140 140 132 132 130 110 130 ...
##  $ city_mpg   : int  28 28 26 26 26 29 29 26 27 26 ...
##  $ hwy_mpg    : int  34 34 37 37 37 36 36 33 36 33 ...
##  $ weight     : int  2370 2348 2617 2676 2617 2581 2626 2612 2606 2606 ...
##  $ wheel_base : int  98 98 104 104 104 105 105 103 103 103 ...
##  $ length     : int  167 153 183 183 183 174 174 168 168 168 ...
##  $ width      : int  66 66 69 68 69 67 67 67 67 67 ...

# Create faceted histogram
ggplot(cars, aes(x = city_mpg)) +
  geom_histogram() +
  facet_grid(~ suv)

2.2 Boxplots and density plots

Explore the relationship between mpg and the engine size (as measured by number of cylinders).

# Filter cars with 4, 6, 8 cylinders
common_cyl <- filter(cars, ncyl %in% c(4, 6, 8))

# Create box plots of city mpg by ncyl
ggplot(common_cyl, aes(x = as.factor(ncyl), y = city_mpg)) +
  geom_boxplot()


# Create overlaid density plots for same data
ggplot(common_cyl, aes(x = city_mpg, fill = as.factor(ncyl))) +
  geom_density(alpha = .3)

Interpretation

  • The highest mileage cars have 4 cylinders.
  • The typical 4 cylinder car gets better mileage than the typical 6 cylinder car, which gets better mileage than the typical 8 cylinder car.
  • Most of the 4 cylinder cars get better mileage than even the most efficient 8 cylinder cars.
  • The variability in mileage of 8 cylinder cars seem much smaller than that of 4 cylinder cars.

2.3 Marginal and conditional histograms

Get a sense of the marginal distribution of horsepower and then compare it to the distribution of horsepower conditional on the cars being affordable. How much horsepower do you sacrifice by restricting your budget?

# Create hist of horsepwr
cars %>%
  ggplot(aes(horsepwr)) +
  geom_histogram() +
  ggtitle("Distribution of horsepower")


# Create hist of horsepwr for affordable cars
cars %>% 
  filter(msrp < 25000) %>%
  ggplot(aes(horsepwr)) +
  geom_histogram() +
  xlim(c(90, 550)) +
  ggtitle("Distribution of horsepower for cars less than $25,000")

Interpretation

  • The highest horsepower car in the affordable range has just under 250 horsepower.

2.4 Three binwidths

Alter the binwidths to see how things change.

# Create hist of horsepwr with binwidth of 3
cars %>%
  ggplot(aes(horsepwr)) +
  geom_histogram(binwidth = 3) +
  ggtitle("The distribution of horsepower with a binwidth of 3")


# Create hist of horsepwr with binwidth of 30
cars %>%
  ggplot(aes(horsepwr)) +
  geom_histogram(binwidth = 30) +
  ggtitle("The distribution of horsepower with a binwidth of 30")


# Create hist of horsepwr with binwidth of 60
cars %>%
  ggplot(aes(horsepwr)) +
  geom_histogram(binwidth = 60) +
  ggtitle("The distribution of horsepower with a binwidth of 60")

Interpretation

  • Plot A is the only histogram that shows the count for cars with exactly 200 and 300 horsepower.
  • Plot A has the most bins, which means that it shows the number of cars with more specific horsepower values.

2.5 Box plots for outliers

Box plots are good:

  • when you compare several distributions at once
  • when you want to detect outliers

Bos plots are not good:

  • when the distribution has more than one mode

Apply a box plot to the msrp column (manufacturers suggested retail price) to detect if there are unusually expensive or cheap cars, outliers.

# Construct box plot of msrp
cars %>%
  ggplot(aes(x = 1, y = msrp)) +
  geom_boxplot()


# Exclude outliers from data
cars_no_out <- cars %>%
  filter(msrp < 100000)

# Construct box plot of msrp using the reduced dataset
cars_no_out %>%
  ggplot(aes(x = 1, y = msrp)) +
  geom_boxplot()

2.6 Plot selection

Plot to display the important features of the distributions of city_mpg and width. Which is most appropriate? Remember, both density plots and box plots display the central tendency and spread of the data, but the box plot is more robust to outliers.

# Create plot of city_mpg
cars %>%
  ggplot(aes(x = 1, y = city_mpg)) +
  geom_boxplot()

  
cars %>%
  ggplot(aes(city_mpg)) +
  geom_density()


# Create plot of width
cars %>% 
  ggplot(aes(x = 1, y = width)) +
  geom_boxplot()


cars %>%
  ggplot(aes(width)) +
  geom_density()

Interpretation

  • Because the city_mpg variable has a much wider range with its outliers, it’s best to display its distribution as a box plot.

2.7 3 variable plot

Faceting is a valuable technique for looking at several conditional distributions at the same time.

# Facet hists using hwy mileage and ncyl
common_cyl %>%
  ggplot(aes(x = hwy_mpg)) +
  geom_histogram() +
  facet_grid(ncyl ~ suv, labeller = label_both) +
  ggtitle("Distribution of Highway MPG by ncyl and SUV")

Interpretation

  • Across both SUVs and non-SUVs, mileage tends to decrease as the number of cylinders increases.
  • There are fewer SUVs than non-SUVs across all cylinder types.
  • There is more variability in 4-cylinder non-SUVs than in any other type of car.

Chapter 3: Numerical Summaries

Now that we’ve looked at exploring categorical and numerical data, you’ll learn some useful statistics for describing distributions of data. There are four characteristics of the distribution that we are interested in:

  • Center
  • Spread or variability
  • Shape (described in terms of modality and skew)
  • Outliers

Graphical tools for numerical data

  • histograms take continuous data and aggregate it into bins; then draw a bar to height the corresponse to a number of cases in that bin; and can be broken down by a categorical variable by faceting.
  • Boxplots excels at comparing multiple distributions; uses robust measures of median and IQR; and also flags potential outliers; but can hide more complicated shapes such as a bimodal distribution.
  • Density plots summarize data by drawing a smooth line to represent its shape; and can be faceted just like historams or overlaid on one another by maping the color of the fill of the distribution to a second variable.

3.1 Choice of center measure

The choice of measure for center can have a dramatic impact on what we consider to be a typical observation, so it is important that you consider the shape of the distribution before deciding on the measure.

  • Mean is sensitive to extreme values.
  • Median is the middle value in a sorted data set and a more appropriate measure of center when working with a skewed distribution.
  • Mode is the most common value.

3.2 Calculate center measures

Use data from gapminder, which tracks demographic data on countries of the world over time. To learn more about it, you can bring up the help file with ?gapminder.

Focus on how the life expectancy differs from continent to continent. This requires that you conduct your analysis not at the country level, but aggregated up to the continent level.

# Load data
data(gapminder)

# Create dataset of 2007 data
gap2007 <- filter(gapminder, year == 2007)

# Compute groupwise mean and median lifeExp
gap2007 %>%
  group_by(continent) %>%
  summarize(mean(lifeExp),
            median(lifeExp))
## # A tibble: 5 x 3
##   continent `mean(lifeExp)` `median(lifeExp)`
##      <fctr>           <dbl>             <dbl>
## 1    Africa            54.8              52.9
## 2  Americas            73.6              72.9
## 3      Asia            70.7              72.4
## 4    Europe            77.6              78.6
## 5   Oceania            80.7              80.7

# Generate box plots of lifeExp for each continent
gap2007 %>%
  ggplot(aes(x = continent, y = lifeExp)) +
  geom_boxplot()

3.3 Choice of spread measure

The choice of measure for spread can dramatically impact how variable we consider our data to be, so it is important that you consider the shape of the distribution before deciding on the measure.

  • Standard deviation is convenient when interpreting because it’s in the same unit as in the raw data. It’s the most commonly used measure of variation. But SD (and variance) is sensitive to extreme values in the same way that the mean is as a measure of center.
  • Variance is, by comparison, hard to interpret because it’s in the unit in the raw data is squared.
  • IQR is the interquartile range, which is the distance between the first quartile and the third quartile (the height of the box in the boxplot). IQR is a good alternative to the standard deviation when working with a skewed distribution because it’s less sensitive to extreme values.
  • Range is the distance between the maximum and the minimum.

3.4 Calculate spread measures

If you’re unsure whether you’re working with symmetric or skewed distributions, it’s a good idea to consider a robust measure like IQR in addition to the usual measures of variance or standard deviation.

# Compute groupwise measures of spread
gap2007 %>%
  group_by(continent) %>%
  summarize(sd(lifeExp),
            IQR(lifeExp),
            n())
## # A tibble: 5 x 4
##   continent `sd(lifeExp)` `IQR(lifeExp)` `n()`
##      <fctr>         <dbl>          <dbl> <int>
## 1    Africa         9.631         11.610    52
## 2  Americas         4.441          4.632    25
## 3      Asia         7.964         10.152    33
## 4    Europe         2.980          4.782    30
## 5   Oceania         0.729          0.516     2

# Generate overlaid density plots
gap2007 %>%
  ggplot(aes(x = lifeExp, fill = continent)) +
  geom_density(alpha = 0.3)

3.5 Choose measures for center and spread

Consider the density plots shown here. What are the most appropriate measures to describe their centers and spreads? In this exercise, you’ll select the measures and then calculate them.

# Create the charts given in datacamp
gap2007 %>% 
  filter(continent == "Americas") %>% 
  ggplot(aes(lifeExp)) + geom_density() + ggtitle("Life Expectancy in the Americas")


gap2007 %>% ggplot(aes(pop)) + geom_density()


# Compute stats for lifeExp in Americas
gap2007 %>%
  filter(continent == "Americas") %>%
  summarize(mean(lifeExp),
            sd(lifeExp))
## # A tibble: 1 x 2
##   `mean(lifeExp)` `sd(lifeExp)`
##             <dbl>         <dbl>
## 1            73.6          4.44

# Compute stats for population
gap2007 %>%
  summarize(median(pop),
            IQR(pop)) 
## # A tibble: 1 x 2
##   `median(pop)` `IQR(pop)`
##           <dbl>      <dbl>
## 1      10517531   26702008

Interpretation

  • Like mean and standard deviation, median and IQR measure the central tendency and spread, respectively, but are robust to outliers and non-normal data.

3.6 Shape and transformations

There are two shap Modality

  • Unimodal: a single prominent mode
  • Bimodal: two prominent modes
  • Multimodal: three modes or more
  • Uniform: no distinctive modes

Skew

  • Right-skewed: The distribution has a long tail that stretches out to the right.
  • Left_skewed: The distribution has a long tail that streches out to the left.
  • Symmetric: Neither tail is longer than the other.

Transformation

Highly skewed distributions can make it very difficult to learn anything from a visualization. Transformations can be helpful in revealing the more subtle structure. For example, a log transformation may be considered when comparing two distributions that are both highly right-skewed.

3.7 Transformations

Transform the population variable, which exhibits strong right skew, with the natural logarithm function.

# Create density plot of old variable
gap2007 %>%
  ggplot(aes(x = pop)) +
  geom_density()


# Transform the skewed pop variable
gap2007 <- gap2007 %>%
  mutate(log_pop = log(pop))

# Create density plot of new variable
gap2007 %>%
  ggplot(aes(x = log_pop)) +
  geom_density()

3.8 Outliers

Outliers are extreme values far from the bulk of the distribution.

Consider the distribution, shown here, of the life expectancies of the countries in Asia. The box plot identifies one clear outlier: a country with a notably low life expectancy. Do you have a guess as to which country this might be?

# Create a given boxplot in datacamp
gap2007 %>%
  filter(continent == "Asia") %>%
  ggplot(aes(x = 1, y = lifeExp)) + geom_boxplot()


# Filter for Asia, add column indicating outliers
gap_asia <- gap2007 %>%
  filter(continent == "Asia") %>%
  mutate(is_outlier = lifeExp < 50)

# Remove outliers, create box plot of lifeExp
gap_asia %>%
  filter(! is_outlier) %>%
  ggplot(aes(x = 1, y = lifeExp)) +
  geom_boxplot()

Chapter 4: Case study

Apply what you’ve learned to explore and summarize a real world dataset in this case study of email spam. What characteristics of emails are associated with spam?

data(email)
email$spam <- as.factor(email$spam)
levels(email$spam) <- c("not-spam", "spam")
str(email)
## 'data.frame':    3921 obs. of  21 variables:
##  $ spam        : Factor w/ 2 levels "not-spam","spam": 1 1 1 1 1 1 1 1 1 1 ...
##  $ to_multiple : num  0 0 0 0 0 0 1 1 0 0 ...
##  $ from        : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ cc          : int  0 0 0 0 0 0 0 1 0 0 ...
##  $ sent_email  : num  0 0 0 0 0 0 1 1 0 0 ...
##  $ time        : POSIXct, format: "2012-01-01 06:16:41" "2012-01-01 07:03:59" ...
##  $ image       : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ attach      : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ dollar      : num  0 0 4 0 0 0 0 0 0 0 ...
##  $ winner      : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ inherit     : num  0 0 1 0 0 0 0 0 0 0 ...
##  $ viagra      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ password    : num  0 0 0 0 2 2 0 0 0 0 ...
##  $ num_char    : num  11.37 10.5 7.77 13.26 1.23 ...
##  $ line_breaks : int  202 202 192 255 29 25 193 237 69 68 ...
##  $ format      : num  1 1 1 1 0 0 1 1 0 1 ...
##  $ re_subj     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ exclaim_subj: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ urgent_subj : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ exclaim_mess: num  0 1 6 48 1 1 1 18 1 0 ...
##  $ number      : Factor w/ 3 levels "none","small",..: 3 2 2 2 1 1 3 2 2 2 ...

# graphic tools for numerical data

# histograms
email %>%
  ggplot(aes(x = num_char)) +
    geom_histogram()


email %>%
  ggplot(aes(x = num_char)) +
    geom_histogram() +
    facet_grid(~ spam)


# boxplots
email %>%
  ggplot(aes(x = spam, y = num_char)) +
    geom_boxplot()


# density plots
email %>%
  ggplot(aes(x = num_char)) +
    geom_density()  


email %>%
  ggplot(aes(x = num_char)) +
    geom_density() + 
    facet_grid(~ spam)


email %>%
  ggplot(aes(x = num_char, fill = spam)) +  #must be a factor var for fill
    geom_density(alpha = 0.3)

4.1 Spam and num_char

Is there an association between spam and the length of an email?

# Compute summary statistics
email %>%
  group_by(spam) %>%
  summarize(median(num_char), IQR(num_char))
## # A tibble: 2 x 3
##       spam `median(num_char)` `IQR(num_char)`
##     <fctr>              <dbl>           <dbl>
## 1 not-spam               6.83           13.58
## 2     spam               1.05            2.82

# Create plot
email %>%
  mutate(log_num_char = log(num_char)) %>%
  ggplot(aes(x = spam, y = log_num_char)) +
  geom_boxplot()

Interpretation

  • The shortest email in this dataset was spam.
  • The median length of not-spam emails is greater than that of spam emails.
  • However, there is still a reasonable amount of overlap between the two distribution.

4.2 Spam and !!!

Exclaim_mess contains the number of exclamation marks in each message. Using summary statistics and visualization, see if there is a relationship between this variable and whether or not a message is spam.

# Compute center and spread for exclaim_mess by spam
email %>% group_by(spam) %>% summarize(median(exclaim_mess), IQR(exclaim_mess))
## # A tibble: 2 x 3
##       spam `median(exclaim_mess)` `IQR(exclaim_mess)`
##     <fctr>                  <dbl>               <dbl>
## 1 not-spam                      1                   5
## 2     spam                      0                   1



# Create plot for spam and exclaim_mess
# Assign an arbitrary number of -4.6 to log(0) just to compare 0 and non-0 values
email$log_exclaim_mess <- ifelse(email$exclaim_mess > 0, log(email$exclaim_mess), -4.6)


ggplot(email, aes(log_exclaim_mess)) + geom_histogram() + facet_grid(~ spam)

ggplot(email, aes(x = spam, y = log_exclaim_mess)) + geom_boxplot()

ggplot(email, aes(x = log_exclaim_mess, fill = spam)) + geom_density()

Interpreation

  • The most common value of exclaim_mess in both classes of email is zero (a log(exclaim_mess) of -4.6).
  • There are fewer cases of spam in this dataset than not-spam.
  • Even after a transformation, the distribution of exclaim_mess in both classes of email is heavily right-skewed.
  • The typical number of exclamations in the not-spam group appears to be slightly higher than in the spam group.

4.3 Collapsing levels

It is difficult to work with variables of heavy skew. image is such a variable the majority of whose values take zero and only a few positive values. Let’s collapse image into a categorical variable that indicates whether or not the email had at least one image.

# Create plot of proportion of spam by image
email %>%
  mutate(has_image = image > 0) %>%
  ggplot(aes(x = has_image, fill = spam)) +
  geom_bar(position = "fill")

Interpretation

  • There are more emails that are not-spam than spam.
  • An email with an image is more likely to be not-spam than spam.
  • An email without an image is more likely to be not-spam than spam.

4.4 Data integrity

In the process of exploring a dataset, you’ll sometimes come across something that will lead you to question how the data were compiled. For example, consider the variables image and attach. Do attached images count as attached files in this dataset?

Design a simple test to determine if images count as attached files.

# Test if images count as attachments
sum(email$image > email$attach) #False is 0 while True is 1.
## [1] 0

Interpretation

  • Since image is never greater than attach, we can infer that images are counted as attachments.

4.5 Answering questions with chains

When you have a specific question about a dataset, you can find your way to an answer by carefully constructing the appropriate chain of R code.

# Question1: For emails containing the word "dollar", does the typical spam email contain a greater number of occurrences of the word than the typical non-spam email? Create a summary statistic that answers this question.
email %>%
  filter(dollar > 0) %>%
  group_by(spam) %>%
  summarize(median(dollar))
## # A tibble: 2 x 2
##       spam `median(dollar)`
##     <fctr>            <dbl>
## 1 not-spam                4
## 2     spam                2

# Question2: If you encounter an email with greater than 10 occurrences of the word "dollar", is it more likely to be spam or not-spam? Create a barchart that answers this question.
email %>%
  filter(dollar > 10) %>%
  ggplot(aes(x = spam)) +
  geom_bar()

4.6 What’s in a number?

Explore the association between number and spam. Nunber is a factor variable saying whether there was no number, a small number (under 1 million), or a big number.

# Reorder levels
email$number <- factor(email$number, levels = c("none", "small", "big"))

# Construct plot of number
ggplot(email, aes(x = number)) + 
  geom_bar() +
  facet_grid(~ spam)

Interpretation

  • Given that an email contains a small number, it is more likely to be not-spam.
  • Given that an email contains no number, it is more likely to be not-spam.
  • Given that an email contains a big number, it is more likely to be not-spam.
  • Within both spam and not-spam, the most common number is a small one.

Quiz

How can we revitalize a region’s economy? Assume that you live in a rural community and are interested in helping your community develop its economy. So you want to identify a rural community in the U.S. that enjoys a high living standards and investigate what makes them different from the rest of the rural America. Use the countyComplete data set in the openintro R package. Click the link to open the manual and find the information about the data set.

  • Use the code below to create a new variable, rural takes the value of rural if it has less than 1,000 people per square mile and the value of urban otherwise. Click here for the discussion on definitions of rural areas.
# Load data
data(countyComplete) # It comes from the openintro package

# Create a new variable, rural
countyComplete$rural <- ifelse(countyComplete$density < 1000, "rural", "urban")
countyComplete$rural <- factor(countyComplete$rural)

# Examine the data set
glimpse(countyComplete)
## Observations: 3,143
## Variables: 54
## $ state                                     <fctr> Alabama, Alabama, A...
## $ name                                      <fctr> Autauga County, Bal...
## $ FIPS                                      <dbl> 1001, 1003, 1005, 10...
## $ pop2010                                   <dbl> 54571, 182265, 27457...
## $ pop2000                                   <dbl> 43671, 140415, 29038...
## $ age_under_5                               <dbl> 6.6, 6.1, 6.2, 6.0, ...
## $ age_under_18                              <dbl> 26.8, 23.0, 21.9, 22...
## $ age_over_65                               <dbl> 12.0, 16.8, 14.2, 12...
## $ female                                    <dbl> 51.3, 51.1, 46.9, 46...
## $ white                                     <dbl> 78.5, 85.7, 48.0, 75...
## $ black                                     <dbl> 17.7, 9.4, 46.9, 22....
## $ native                                    <dbl> 0.4, 0.7, 0.4, 0.3, ...
## $ asian                                     <dbl> 0.9, 0.7, 0.4, 0.1, ...
## $ pac_isl                                   <dbl> NA, NA, NA, NA, NA, ...
## $ two_plus_races                            <dbl> 1.6, 1.5, 0.9, 0.9, ...
## $ hispanic                                  <dbl> 2.4, 4.4, 5.1, 1.8, ...
## $ white_not_hispanic                        <dbl> 77.2, 83.5, 46.8, 75...
## $ no_move_in_one_plus_year                  <dbl> 86.3, 83.0, 83.0, 90...
## $ foreign_born                              <dbl> 2.0, 3.6, 2.8, 0.7, ...
## $ foreign_spoken_at_home                    <dbl> 3.7, 5.5, 4.7, 1.5, ...
## $ hs_grad                                   <dbl> 85.3, 87.6, 71.9, 74...
## $ bachelors                                 <dbl> 21.7, 26.8, 13.5, 10...
## $ veterans                                  <dbl> 5817, 20396, 2327, 1...
## $ mean_work_travel                          <dbl> 25.1, 25.8, 23.8, 28...
## $ housing_units                             <dbl> 22135, 104061, 11829...
## $ home_ownership                            <dbl> 77.5, 76.7, 68.0, 82...
## $ housing_multi_unit                        <dbl> 7.2, 22.6, 11.1, 6.6...
## $ median_val_owner_occupied                 <dbl> 133900, 177200, 8820...
## $ households                                <dbl> 19718, 69476, 9795, ...
## $ persons_per_household                     <dbl> 2.70, 2.50, 2.52, 3....
## $ per_capita_income                         <dbl> 24568, 26469, 15875,...
## $ median_household_income                   <dbl> 53255, 50147, 33219,...
## $ poverty                                   <dbl> 10.6, 12.2, 25.0, 12...
## $ private_nonfarm_establishments            <dbl> 877, 4812, 522, 318,...
## $ private_nonfarm_employment                <dbl> 10628, 52233, 7990, ...
## $ percent_change_private_nonfarm_employment <dbl> 16.6, 17.4, -27.0, -...
## $ nonemployment_establishments              <dbl> 2971, 14175, 1527, 1...
## $ firms                                     <dbl> 4067, 19035, 1667, 1...
## $ black_owned_firms                         <dbl> 15.2, 2.7, NA, 14.9,...
## $ native_owned_firms                        <dbl> NA, 0.4, NA, NA, NA,...
## $ asian_owned_firms                         <dbl> 1.3, 1.0, NA, NA, NA...
## $ pac_isl_owned_firms                       <dbl> NA, NA, NA, NA, NA, ...
## $ hispanic_owned_firms                      <dbl> 0.7, 1.3, NA, NA, NA...
## $ women_owned_firms                         <dbl> 31.7, 27.3, 27.0, NA...
## $ manufacturer_shipments_2007               <dbl> NA, 1410273, NA, 0, ...
## $ mercent_whole_sales_2007                  <dbl> NA, NA, NA, NA, NA, ...
## $ sales                                     <dbl> 598175, 2966489, 188...
## $ sales_per_capita                          <dbl> 12003, 17166, 6334, ...
## $ accommodation_food_service                <dbl> 88157, 436955, NA, 1...
## $ building_permits                          <dbl> 191, 696, 10, 8, 18,...
## $ fed_spending                              <dbl> 331142, 1119082, 240...
## $ area                                      <dbl> 594, 1590, 885, 623,...
## $ density                                   <dbl> 91.8, 114.6, 31.0, 3...
## $ rural                                     <fctr> rural, rural, rural...
  • Examine the living standard of the US counties, using per_capita_income as a measure, and how it differs by whether the county is rural or urban. To that end, do the following and interpret the results:

    • Compute groupwise mean and median per_capita_income by rural
    • Compute groupwise measures of spread for per_capita_income by rural
    • Create faceted histogram for per_capita_income by rural
    • Create overlaid density plots for per_capita_income by rural
    • Create box plots of per_capita_income by rural
  • Use the box plot you created above and identify an outlier, a rural county that has a per_capita_income higher than $60,000 a year when a typical rural county’s per capita income is less than $23,000 a year. What makes this rural county successful? Compare it with a typical rural county in the country (compare the county’s data with means of that of the rest of the rural counties)

# Compute groupwise mean and median per_capita_income by rural
countyComplete %>%
  group_by(rural) %>%
  summarize(mean(per_capita_income),
            median(per_capita_income))
## # A tibble: 2 x 3
##    rural `mean(per_capita_income)` `median(per_capita_income)`
##   <fctr>                     <dbl>                       <dbl>
## 1  rural                     22117                       21598
## 2  urban                     30576                       28522

# Compute groupwise measures of spread for per_capita_income by rural
countyComplete %>%
  group_by(rural) %>%
  summarize(sd(per_capita_income),
            IQR(per_capita_income),
            n())
## # A tibble: 2 x 4
##    rural `sd(per_capita_income)` `IQR(per_capita_income)` `n()`
##   <fctr>                   <dbl>                    <dbl> <int>
## 1  rural                    4884                     5498  2999
## 2  urban                    8598                     9594   144

# Create faceted histogram for per_capita_income by rural
ggplot(countyComplete, aes(x = per_capita_income)) +
  geom_histogram() +
  facet_grid(~ rural)



# Create overlaid density plots for per_capita_income by rural
ggplot(countyComplete, aes(x = per_capita_income, fill = rural)) +
  geom_density(alpha = .3)



# Create box plots of per_capita_income by rural
ggplot(countyComplete, aes(x = rural, y = per_capita_income)) +
  geom_boxplot()


# Identify the outlier
outlier <- countyComplete %>%
  filter(rural == "rural", per_capita_income > 60000)

outlier <- data.frame(t(outlier))
names(outlier) <- "outlier"
str(outlier)
## 'data.frame':    54 obs. of  1 variable:
##  $ outlier: Factor w/ 49 levels "0","0.3","0.5",..: 47 48 40 16 10 25 14 6 29 44 ...
##   ..- attr(*, "names")= chr  "state" "name" "FIPS" "pop2010" ...

# 
typicalRural <- countyComplete %>%
  filter(rural == "rural")

typicalRural <- data.frame(sapply(typicalRural, mean, na.rm = TRUE))
names(typicalRural) <- "typicalRural"
str(typicalRural)
## 'data.frame':    54 obs. of  1 variable:
##  $ typicalRural: num  NA NA 30234 63931 57041 ...

cbind(typicalRural, outlier)
##                                           typicalRural       outlier
## state                                               NA      Colorado
## name                                                NA Pitkin County
## FIPS                                         30233.737          8097
## pop2010                                      63930.906         17148
## pop2000                                      57040.766         14872
## age_under_5                                      6.251           4.4
## age_under_18                                    23.441          17.5
## age_over_65                                     16.052          11.5
## female                                          49.953            47
## white                                           83.770          93.5
## black                                            8.394           0.5
## native                                           2.101           0.3
## asian                                            0.921           1.2
## pac_isl                                          0.150          <NA>
## two_plus_races                                   1.921           1.4
## hispanic                                         8.015           9.1
## white_not_hispanic                              79.290          87.9
## no_move_in_one_plus_year                        85.930          86.8
## foreign_born                                     3.866          12.6
## foreign_spoken_at_home                           8.480          16.2
## hs_grad                                         82.961          96.1
## bachelors                                       18.311          59.7
## veterans                                      5307.390          1152
## mean_work_travel                                22.581            19
## housing_units                                27929.949         12953
## home_ownership                                  73.838          62.5
## housing_multi_unit                              11.259            40
## median_val_owner_occupied                   125304.602        670200
## households                                   23789.878          7417
## persons_per_household                            2.511          2.21
## per_capita_income                            22117.129         64381
## median_household_income                      43597.515         64502
## poverty                                         15.589           8.4
## private_nonfarm_establishments                1446.135          1620
## private_nonfarm_employment                   20109.331         16278
## percent_change_private_nonfarm_employment        0.814           5.1
## nonemployment_establishments                  4067.842          3286
## firms                                         5711.401          5100
## black_owned_firms                                9.451          <NA>
## native_owned_firms                               4.493          <NA>
## asian_owned_firms                                2.624          <NA>
## pac_isl_owned_firms                              1.221          <NA>
## hispanic_owned_firms                             6.692           2.9
## women_owned_firms                               25.748          26.5
## manufacturer_shipments_2007                1064843.116             0
## mercent_whole_sales_2007                    722079.695         64678
## sales                                       785766.919        399540
## sales_per_capita                             10107.925         26461
## accommodation_food_service                  123717.562        270848
## building_permits                               144.784            44
## fed_spending                                561603.057         45978
## area                                          1158.678           971
## density                                         95.589          17.7
## rural                                               NA         rural