Exploratory Data Analysis

In this course, you’ll learn how to use graphical and numerical techniques to begin uncovering the structure of your data. Which variables suggest interesting relationships? Which observations are unusual? By the end of the course, you’ll be able to answer these questions and more, while generating graphics that are both insightful and beautiful.

Chapter 1: Exploring Categorical Data

In this chapter, you will learn how to create graphical and numerical summaries of two categorical variables.

# Install gapminder R package. Once it's installed, you won't have to run this code again
#install.packages("dplyr") 
#install.packages("ggplot2")
#Install.packages("openintro")

# Load dplyr package
library(dplyr) #for use of dplyr functions such as glimpse(), mutate(), and filter()
library(ggplot2) #for use of ggplot2 functions such ggplot()
library(gapminder) #for use of gapminder data (ch3.2)
library(openintro) #for use of email data (ch4)

# Import data
comics <- read.csv("~/resources/rstudio/comics.csv") #first_appear var isn't exactly same
str(comics)
## 'data.frame':    23272 obs. of  11 variables:
##  $ name        : Factor w/ 23272 levels "'Spinner (Earth-616)",..: 19833 3335 22769 9647 20956 2220 17576 9346 18794 10957 ...
##  $ id          : Factor w/ 4 levels "No Dual","Public",..: 3 2 2 2 1 2 2 2 2 2 ...
##  $ align       : Factor w/ 4 levels "Bad","Good","Neutral",..: 2 2 3 2 2 2 2 2 3 2 ...
##  $ eye         : Factor w/ 26 levels "Amber Eyes","Auburn Hair",..: 11 5 5 5 5 5 6 6 6 5 ...
##  $ hair        : Factor w/ 28 levels "Auburn Hair",..: 7 27 3 3 4 14 7 7 7 4 ...
##  $ gender      : Factor w/ 3 levels "Female","Male",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ gsm         : Factor w/ 6 levels "Bisexual Characters",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ alive       : Factor w/ 2 levels "Deceased Characters",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ appearances : int  4043 3360 3061 2961 2258 2255 2072 2017 1955 1934 ...
##  $ first_appear: Factor w/ 2328 levels "1-Apr","1-Aug",..: 1772 2074 2255 2089 2185 2192 2192 2139 2292 2192 ...
##  $ publisher   : Factor w/ 2 levels "dc","marvel": 2 2 2 2 2 2 2 2 2 2 ...

# Convert data to tbl_df
comics <- tbl_df(comics)
str(comics)
## Classes 'tbl_df', 'tbl' and 'data.frame':    23272 obs. of  11 variables:
##  $ name        : Factor w/ 23272 levels "'Spinner (Earth-616)",..: 19833 3335 22769 9647 20956 2220 17576 9346 18794 10957 ...
##  $ id          : Factor w/ 4 levels "No Dual","Public",..: 3 2 2 2 1 2 2 2 2 2 ...
##  $ align       : Factor w/ 4 levels "Bad","Good","Neutral",..: 2 2 3 2 2 2 2 2 3 2 ...
##  $ eye         : Factor w/ 26 levels "Amber Eyes","Auburn Hair",..: 11 5 5 5 5 5 6 6 6 5 ...
##  $ hair        : Factor w/ 28 levels "Auburn Hair",..: 7 27 3 3 4 14 7 7 7 4 ...
##  $ gender      : Factor w/ 3 levels "Female","Male",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ gsm         : Factor w/ 6 levels "Bisexual Characters",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ alive       : Factor w/ 2 levels "Deceased Characters",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ appearances : int  4043 3360 3061 2961 2258 2255 2072 2017 1955 1934 ...
##  $ first_appear: Factor w/ 2328 levels "1-Apr","1-Aug",..: 1772 2074 2255 2089 2185 2192 2192 2139 2292 2192 ...
##  $ publisher   : Factor w/ 2 levels "dc","marvel": 2 2 2 2 2 2 2 2 2 2 ...

1.1 Contingency table review

# Print the first rows of the data
comics
## # A tibble: 23,272 x 11
##                                     name      id   align        eye
##                                   <fctr>  <fctr>  <fctr>     <fctr>
##  1             Spider-Man (Peter Parker)  Secret    Good Hazel Eyes
##  2       Captain America (Steven Rogers)  Public    Good  Blue Eyes
##  3 Wolverine (James \\"Logan\\" Howlett)  Public Neutral  Blue Eyes
##  4   Iron Man (Anthony \\"Tony\\" Stark)  Public    Good  Blue Eyes
##  5                   Thor (Thor Odinson) No Dual    Good  Blue Eyes
##  6            Benjamin Grimm (Earth-616)  Public    Good  Blue Eyes
##  7             Reed Richards (Earth-616)  Public    Good Brown Eyes
##  8            Hulk (Robert Bruce Banner)  Public    Good Brown Eyes
##  9             Scott Summers (Earth-616)  Public Neutral Brown Eyes
## 10            Jonathan Storm (Earth-616)  Public    Good  Blue Eyes
## # ... with 23,262 more rows, and 7 more variables: hair <fctr>,
## #   gender <fctr>, gsm <fctr>, alive <fctr>, appearances <int>,
## #   first_appear <fctr>, publisher <fctr>

# Check levels of align
levels(comics$align)
## [1] "Bad"                "Good"               "Neutral"           
## [4] "Reformed Criminals"

# Check the levels of gender
levels(comics$gender)
## [1] "Female" "Male"   "Other"

# Create a 2-way contingency table
table(comics$align, comics$gender)
##                     
##                      Female Male Other
##   Bad                  1573 7561    32
##   Good                 2490 4809    17
##   Neutral               836 1799    17
##   Reformed Criminals      1    2     0

1.2 Dropping levels

To simplify the analysis, drop levels that have very low counts.

# Create a 2-way contingency table
tab <- table(comics$align, comics$gender)

# Print tab
tab
##                     
##                      Female Male Other
##   Bad                  1573 7561    32
##   Good                 2490 4809    17
##   Neutral               836 1799    17
##   Reformed Criminals      1    2     0

# Remove align level
comics <- comics %>%
  filter(align != "Reformed Criminals") %>%
  droplevels()

1.3 Side-by-side barcharts

Construct two side-by-side barcharts of the comics data, showing how there are two options for presenting the same data.

# Create side-by-side barchart of gender by alignment
ggplot(comics, aes(x = align, fill = gender)) + 
  geom_bar(position = "dodge") #position = "dodge", to have a side-by-side (i.e. not stacked) barchart.


# Create side-by-side barchart of alignment by gender
ggplot(comics, aes(x = gender, fill = align)) + 
  geom_bar(position = "dodge") +
  theme(axis.text.x = element_text(angle = 90))

Interpretation

Among characters with Neutral alignment, males are the most common. In general, there is an association between gender and alignment. There are more male characters than female characters in this dataset.

1.4 Conditional proportions

Approximately what proportion of all female characters are good? 51%. Look at how align was distributed within each gender

tab <- table(comics$align, comics$gender)
options(scipen = 999, digits = 3) # Print fewer digits
prop.table(tab)     # Joint proportions
##          
##             Female     Male    Other
##   Bad     0.082210 0.395160 0.001672
##   Good    0.130135 0.251333 0.000888
##   Neutral 0.043692 0.094021 0.000888

prop.table(tab, 2)  # Conditional on columns
##          
##           Female  Male Other
##   Bad      0.321 0.534 0.485
##   Good     0.508 0.339 0.258
##   Neutral  0.171 0.127 0.258

1.5 Counts vs. proportions (2)

Bar charts can tell dramatically different stories depending on whether they represent counts or proportions and, if proportions, what the proportions are conditioned on. To demonstrate this difference, you’ll construct two barcharts in this exercise: one of counts and one of proportions.

# Plot of gender by align
ggplot(comics, aes(x = align, fill = gender)) +
  geom_bar()


# Plot proportion of gender, conditional on align
ggplot(comics, aes(x = align, fill = gender)) + 
  geom_bar(position = "fill") #position = "fill", to have a stacked barchart

1.6 Marginal barchart

Improve the interpretability of the plot by implementing some sensible ordering. Superheroes that are “Neutral” show an alignment between “Good” and “Bad”, so it makes sense to put that bar in the middle.

# Change the order of the levels in align
comics$align <- factor(comics$align, 
                       levels = c("Bad", "Neutral", "Good"))

# Create plot of align
ggplot(comics, aes(x = align)) + 
  geom_bar()

1.7 Conditional barchart

Make the distribution of alignment based on gender by creating multiple filtered datasets (one for each gender) or by faceting the plot of alignment based on gender.

# Plot of alignment broken down by gender
ggplot(comics, aes(x = align)) + 
  geom_bar() +
  facet_wrap(~ gender)

1.8 Improve piechart

The piechart is a very common way to represent the distribution of a single categorical variable, but they can be more difficult to interpret than barcharts. Improve the piechart representation of these data by constructing a barchart.

# Import data
pies <- data.frame(read.csv("~/resources/rstudio/pie.csv")) 
pies_tb <- table(pies$flavor)
pies_tb
## 
##        apple    blueberry boston creme       cherry     key lime 
##           17           14           15           13           16 
##      pumpkin   strawberry 
##           12           11

# Create a pie chart
pie(pies_tb)

# Put levels of flavor in decending order
lev <- c("apple", "key lime", "boston creme", "blueberry", "cherry", "pumpkin", "strawberry")
pies$flavor <- factor(pies$flavor, levels = lev)

# Create barchart of flavor
ggplot(pies, aes(x = flavor)) + 
  geom_bar(fill = "chartreuse") + 
  theme(axis.text.x = element_text(angle = 90))

Chapter 2: Exploring Numerical Data

In this chapter, you will learn how to graphically summarize numerical data

2.1 Faceted histogram

The cars dataset records characteristics on all of the new models of cars for sale in the US in a certain year. Investigate the distribution of mileage across a categorial variable.

# Import data
cars <- read.csv("~/resources/rstudio/cars.csv", stringsAsFactors = FALSE) 

# Learn data structure
str(cars)
## 'data.frame':    428 obs. of  19 variables:
##  $ name       : chr  "Chevrolet Aveo 4dr" "Chevrolet Aveo LS 4dr hatch" "Chevrolet Cavalier 2dr" "Chevrolet Cavalier 4dr" ...
##  $ sports_car : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ suv        : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ wagon      : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ minivan    : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ pickup     : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ all_wheel  : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ rear_wheel : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ msrp       : int  11690 12585 14610 14810 16385 13670 15040 13270 13730 15460 ...
##  $ dealer_cost: int  10965 11802 13697 13884 15357 12849 14086 12482 12906 14496 ...
##  $ eng_size   : num  1.6 1.6 2.2 2.2 2.2 2 2 2 2 2 ...
##  $ ncyl       : int  4 4 4 4 4 4 4 4 4 4 ...
##  $ horsepwr   : int  103 103 140 140 140 132 132 130 110 130 ...
##  $ city_mpg   : int  28 28 26 26 26 29 29 26 27 26 ...
##  $ hwy_mpg    : int  34 34 37 37 37 36 36 33 36 33 ...
##  $ weight     : int  2370 2348 2617 2676 2617 2581 2626 2612 2606 2606 ...
##  $ wheel_base : int  98 98 104 104 104 105 105 103 103 103 ...
##  $ length     : int  167 153 183 183 183 174 174 168 168 168 ...
##  $ width      : int  66 66 69 68 69 67 67 67 67 67 ...

# Create faceted histogram
ggplot(cars, aes(x = city_mpg)) +
  geom_histogram() +
  facet_grid(~ suv)

2.2 Boxplots and density plots

Explore the relationship between mpg and the engine size (as measured by number of cylinders).

# Filter cars with 4, 6, 8 cylinders
common_cyl <- filter(cars, ncyl %in% c(4, 6, 8))

# Create box plots of city mpg by ncyl
ggplot(common_cyl, aes(x = as.factor(ncyl), y = city_mpg)) +
  geom_boxplot()


# Create overlaid density plots for same data
ggplot(common_cyl, aes(x = city_mpg, fill = as.factor(ncyl))) +
  geom_density(alpha = .3)

Interpretation

The highest mileage cars have 4 cylinders. The typical 4 cylinder car gets better mileage than the typical 6 cylinder car, which gets better mileage than the typical 8 cylinder car. Most of the 4 cylinder cars get better mileage than even the most efficient 8 cylinder cars. The variability in mileage of 8 cylinder cars seem much smaller than that of 4 cylinder cars.

2.3 Marginal and conditional histograms

Get a sense of the marginal distribution of horsepower and then compare it to the distribution of horsepower conditional on the cars being affordable. How much horsepower do you sacrifice by restricting your budget?

# Create hist of horsepwr
cars %>%
  ggplot(aes(horsepwr)) +
  geom_histogram() +
  ggtitle("Distribution of horsepower")


# Create hist of horsepwr for affordable cars
cars %>% 
  filter(msrp < 25000) %>%
  ggplot(aes(horsepwr)) +
  geom_histogram() +
  xlim(c(90, 550)) +
  ggtitle("Distribution of horsepower for cars less than $25,000")

Interpretation

The highest horsepower car in the affordable range has just under 250 horsepower.

2.4 Three binwidths

Alter the binwidths to see how things change.

# Create hist of horsepwr with binwidth of 3
cars %>%
  ggplot(aes(horsepwr)) +
  geom_histogram(binwidth = 3) +
  ggtitle("The distribution of horsepower with a binwidth of 3")


# Create hist of horsepwr with binwidth of 30
cars %>%
  ggplot(aes(horsepwr)) +
  geom_histogram(binwidth = 30) +
  ggtitle("The distribution of horsepower with a binwidth of 30")


# Create hist of horsepwr with binwidth of 60
cars %>%
  ggplot(aes(horsepwr)) +
  geom_histogram(binwidth = 60) +
  ggtitle("The distribution of horsepower with a binwidth of 60")

Interpretation

Plot A is the only histogram that shows the count for cars with exactly 200 and 300 horsepower. Plot A has the most bins, which means that it shows the number of cars with more specific horsepower values.

2.5 Box plots for outliers

Apply a box plot to the msrp column (manufacturers suggested retail price) to detect if there are unusually expensive or cheap cars, outliers.

# Construct box plot of msrp
cars %>%
  ggplot(aes(x = 1, y = msrp)) +
  geom_boxplot()


# Exclude outliers from data
cars_no_out <- cars %>%
  filter(msrp < 100000)

# Construct box plot of msrp using the reduced dataset
cars_no_out %>%
  ggplot(aes(x = 1, y = msrp)) +
  geom_boxplot()

2.6 Plot selection

Plot to display the important features of the distributions of city_mpg and width. Which is most appropriate? Remember, both density plots and box plots display the central tendency and spread of the data, but the box plot is more robust to outliers.

# Create plot of city_mpg
cars %>%
  ggplot(aes(x = 1, y = city_mpg)) +
  geom_boxplot()


cars %>%
  ggplot(aes(city_mpg)) +
  geom_density()


# Create plot of width
cars %>% 
  ggplot(aes(x = 1, y = width)) +
  geom_boxplot()


cars %>%
  ggplot(aes(width)) +
  geom_density()

Interpretation

Because the city_mpg variable has a much wider range with its outliers, it’s best to display its distribution as a box plot.

2.7 3 variable plot

Faceting is a valuable technique for looking at several conditional distributions at the same time.

# Facet hists using hwy mileage and ncyl
common_cyl %>%
  ggplot(aes(x = hwy_mpg)) +
  geom_histogram() +
  facet_grid(ncyl ~ suv, labeller = label_both) +
  ggtitle("Distribution of Highway MPG by ncyl and SUV")

Interpretation

Across both SUVs and non-SUVs, mileage tends to decrease as the number of cylinders increases. There are fewer SUVs than non-SUVs across all cylinder types. There is more variability in 4-cylinder non-SUVs than in any other type of car.

Chapter 3 Numerical Summaries

Now that we’ve looked at exploring categorical and numerical data, you’ll learn some useful statistics for describing distributions of data.

3.1 Calculate center measures

Throughout this chapter, you will use data from gapminder, which tracks demographic data in countries of the world over time. For this exercise, focus on how the life expectancy differs from continent to continent. This requires that you conduct your analysis not at the country level, but aggregated up to the continent level. This is made possible by the one-two punch of group_by() and summarize()

# Create dataset of 2007 data
gap2007 <- filter(gapminder, year == 2007)

# Compute groupwise mean and median lifeExp
gap2007 %>%
  group_by(continent) %>%
  summarize(mean(lifeExp),
            median(lifeExp))
## # A tibble: 5 x 3
##   continent `mean(lifeExp)` `median(lifeExp)`
##      <fctr>           <dbl>             <dbl>
## 1    Africa            54.8              52.9
## 2  Americas            73.6              72.9
## 3      Asia            70.7              72.4
## 4    Europe            77.6              78.6
## 5   Oceania            80.7              80.7

# Generate box plots of lifeExp for each continent
gap2007 %>%
  ggplot(aes(x = continent, y = lifeExp)) +
  geom_boxplot()

For this exercise, focus on how the life expectancy differs from continent to continent. This requires that you conduct your analysis not at the country level, but aggregated up to the continent level. Use the group_by() and summarize() functions, a very powerful syntax for carrying out the same analysis on different subsets of the full dataset.

# Create dataset of 2007 data
gap2007 <- filter(gapminder, year == 2007)

# Compute groupwise mean and median lifeExp
gap2007 %>%
  group_by(continent) %>%
  summarize(mean(lifeExp),
            median(lifeExp))
## # A tibble: 5 x 3
##   continent `mean(lifeExp)` `median(lifeExp)`
##      <fctr>           <dbl>             <dbl>
## 1    Africa            54.8              52.9
## 2  Americas            73.6              72.9
## 3      Asia            70.7              72.4
## 4    Europe            77.6              78.6
## 5   Oceania            80.7              80.7

# Generate box plots of lifeExp for each continent
gap2007 %>%
  ggplot(aes(x = continent, y = lifeExp)) +
  geom_boxplot()

3.2 Calculate spread measures

Let’s extend the powerful group_by() and summarize() syntax to measures of spread.

# Compute groupwise measures of spread
gap2007 %>%
  group_by(continent) %>%
  summarize(sd(lifeExp),
            IQR(lifeExp),
            n())
## # A tibble: 5 x 4
##   continent `sd(lifeExp)` `IQR(lifeExp)` `n()`
##      <fctr>         <dbl>          <dbl> <int>
## 1    Africa         9.631         11.610    52
## 2  Americas         4.441          4.632    25
## 3      Asia         7.964         10.152    33
## 4    Europe         2.980          4.782    30
## 5   Oceania         0.729          0.516     2

# Generate overlaid density plots
gap2007 %>%
  ggplot(aes(x = lifeExp, fill = continent)) +
  geom_density(alpha = 0.3)

3.3 Choose measures for center and spread

Using the shapes of the density plots, calculate the most appropriate measures of center and spread for the following:

The distribution of life expectancy in the countries of the Americas. Note you’ll need to apply a filter here. The distribution of country populations across the entire gap2007 dataset.

# Compute stats for lifeExp in Americas
gap2007 %>%
  filter(continent == "Americas") %>%
  summarize(mean(lifeExp),
            sd(lifeExp))
## # A tibble: 1 x 2
##   `mean(lifeExp)` `sd(lifeExp)`
##             <dbl>         <dbl>
## 1            73.6          4.44

# Compute stats for population
gap2007 %>%
  summarize(median(pop),
            IQR(pop))
## # A tibble: 1 x 2
##   `median(pop)` `IQR(pop)`
##           <dbl>      <dbl>
## 1      10517531   26702008

3.4 Transformations

Here you’ll focus on the population variable, which exhibits strong right skew, and transform it with the natural logarithm function (log() in R).

# Create density plot of old variable
gap2007 %>%
  ggplot(aes(x = pop)) +
  geom_density()


# Transform the skewed pop variable
gap2007 <- gap2007 %>%
  mutate(log_pop = log(pop))

# Create density plot of new variable
gap2007 %>%
  ggplot(aes(x = log_pop)) +
  geom_density()

3.5 Identify outliers

# Filter for Asia, add column indicating outliers
gap_asia <- gap2007 %>%
  filter(continent == "Asia") %>%
  mutate(is_outlier = lifeExp < 50)

# Remove outliers, create box plot of lifeExp
gap_asia %>% 
  filter(!is_outlier) %>%
  ggplot(aes(x = 1, y = lifeExp)) +
  geom_boxplot()

Chapter 4 Case study

4.1 Spam and num_char

Is there an association between spam and the length of an email? use the email dataset to settle that question. Begin by bringing up the help file and learning about all the variables with ?email.

As you explore the association between spam and the length of an email, use this opportunity to try out linking a dplyr chain with the layers in a ggplot2 object.

# Load packages
library(openintro)
library(ggplot2)
library(dplyr)

# Compute summary statistics
email %>%
  group_by(spam) %>%
  summarize(median(num_char),
            IQR(num_char))
## # A tibble: 2 x 3
##    spam `median(num_char)` `IQR(num_char)`
##   <dbl>              <dbl>           <dbl>
## 1     0               6.83           13.58
## 2     1               1.05            2.82

# Create plot
email %>%
  mutate(log_num_char = log(num_char)) %>%
  ggplot(aes(x = spam, y = log_num_char)) +
  geom_boxplot()

4.2 Spam and !!!

exclaim_mess contains the number of exclamation marks in each message. Using summary statistics and visualization, see if there is a relationship between this variable and whether or not a message is spam.

# Compute center and spread for exclaim_mess by spam
email %>%
  group_by(spam) %>%
  summarize(median(exclaim_mess),
            IQR(exclaim_mess))
## # A tibble: 2 x 3
##    spam `median(exclaim_mess)` `IQR(exclaim_mess)`
##   <dbl>                  <dbl>               <dbl>
## 1     0                      1                   5
## 2     1                      0                   1

# Create plot for spam and exclaim_mess
email %>%
  mutate(log_exclaim_mess = log(exclaim_mess + .01)) %>%
  ggplot(aes(x = log_exclaim_mess)) +
  geom_histogram() +
  facet_wrap(~ spam)


# Alternative plot: side-by-side box plots
email %>%
  mutate(log_exclaim_mess = log(exclaim_mess + .01)) %>%
  ggplot(aes(x = 1, y = log_exclaim_mess)) +
  geom_boxplot() +
  facet_wrap(~ spam)


# Alternative plot: Overlaid density plots
email %>%
  mutate(log_exclaim_mess = log(exclaim_mess + .01)) %>%
  ggplot(aes(x = log_exclaim_mess, fill = spam)) +
  geom_density(alpha = 0.3)

4.3 Collapsing levels

If it was difficult to work with the heavy skew of exclaim_mess, the number of images attached to each email (image) poses even more of a challenge. Run the following code at the console to get a sense of its distribution:

table(email$image)
## 
##    0    1    2    3    4    5    9   20 
## 3811   76   17   11    2    2    1    1

Recall that this tabulates the number of cases in each category (so there were 3811 emails with 0 images, for example). Given the very low counts at the higher number of images, let’s collapse image into a categorical variable that indicates whether or not the email had at least one image. In this exercise, you’ll create this new variable and explore its association with spam.

# Create plot of proportion of spam by image
email %>%
  mutate(has_image = image > 0) %>%
  ggplot(aes(x = has_image, fill = spam)) +
  geom_bar(position = "fill")

4.4 Data Integrity

# Test if images count as attachments
sum(email$image > email$attach)
## [1] 0

4.5 Answering questions with chains

When you have a specific question about a dataset, you can find your way to an answer by carefully constructing the appropriate chain of R code. For example, consider the following question:

“Within non-spam emails, is the typical length of emails shorter for those that were sent to multiple people?” This can be answered with the following chain:

email %>% filter(spam == “not-spam”) %>% group_by(to_multiple) %>% summarize(median(num_char))

# Question 1
email %>%
  filter(dollar > 0) %>%
  group_by(spam) %>%
  summarize(median(dollar))
## # A tibble: 2 x 2
##    spam `median(dollar)`
##   <dbl>            <dbl>
## 1     0                4
## 2     1                2

# Question 2
email %>%
  filter(dollar > 10) %>%
  ggplot(aes(x = spam)) +
  geom_bar()

4.6 what’s in a number?

practice constructing a faceted barchart.

# Reorder levels
email$number <- factor(email$number, levels = c("none", "small", "big"))

# Construct plot of number
ggplot(email, aes(x = number)) +
  geom_bar() +
  facet_wrap(~ spam)

Interpretation: Given that an email contains no number, it is more likely to be spam.

Exploratory Data Analysis

Gabriel Nichols

7/11/2017

Exploratory Data Analysis

Chapter 1: Exploring Categorical Data

1.1 Contingency table review

1.2 Dropping levels

1.3 Side-by-side barcharts

1.4 Conditional proportions

1.5 Counts vs. proportions (2)

1.6 Marginal barchart

1.7 Conditional barchart

1.8 Improve piechart

Chapter 2: Exploring Numerical Data

2.1 Faceted histogram

2.2 Boxplots and density plots

2.3 Marginal and conditional histograms

2.4 Three binwidths

2.5 Box plots for outliers

2.6 Plot selection

2.7 3 variable plot

Chapter 3 Numerical Summaries

3.1 Calculate center measures

3.2 Calculate spread measures

3.3 Choose measures for center and spread

3.4 Transformations

3.5 Identify outliers

Chapter 4 Case study

4.1 Spam and num_char

4.2 Spam and !!!

4.3 Collapsing levels

4.4 Data Integrity

4.5 Answering questions with chains

4.6 what’s in a number?

Quiz 4

1

2