Introduction

Course notes from the Exploratory Data Analysis course on DataCamp

Whats Covered

Exploring Categorical Data
Exploring Numerical Data
Numerical Summaries
Case Study

Libraries and Data

source('create_datasets.R')

library(readr)
library(dplyr)
library(ggplot2)
library(openintro)

cars <- read.csv("https://assets.datacamp.com/production/course_1796/datasets/cars04.csv")
comics <- read.csv("https://assets.datacamp.com/production/course_1796/datasets/comics.csv")
life <- read.csv("https://assets.datacamp.com/production/course_1796/datasets/life_exp_raw.csv")

Exploring Categorical Data

Exploring categorical data

– Bar chart expectations

Bar charts with categorical variables on the x axis and in the fill are a common way to see a contingency table visually.
It essentialy what you would get if you used the table function with two variables
Which way you show the data can change the perception.
Which variable you use for the fill or the position of the bars (fill, dodge, stack) all can give different perceptions

– Contingency table review

# Print the first rows of the data
head(comics)

##                                    name      id   align        eye
## 1             Spider-Man (Peter Parker)  Secret    Good Hazel Eyes
## 2       Captain America (Steven Rogers)  Public    Good  Blue Eyes
## 3 Wolverine (James \\"Logan\\" Howlett)  Public Neutral  Blue Eyes
## 4   Iron Man (Anthony \\"Tony\\" Stark)  Public    Good  Blue Eyes
## 5                   Thor (Thor Odinson) No Dual    Good  Blue Eyes
## 6            Benjamin Grimm (Earth-616)  Public    Good  Blue Eyes
##         hair gender  gsm             alive appearances first_appear
## 1 Brown Hair   Male <NA> Living Characters        4043       Aug-62
## 2 White Hair   Male <NA> Living Characters        3360       Mar-41
## 3 Black Hair   Male <NA> Living Characters        3061       Oct-74
## 4 Black Hair   Male <NA> Living Characters        2961       Mar-63
## 5 Blond Hair   Male <NA> Living Characters        2258       Nov-50
## 6    No Hair   Male <NA> Living Characters        2255       Nov-61
##   publisher
## 1    marvel
## 2    marvel
## 3    marvel
## 4    marvel
## 5    marvel
## 6    marvel

# Check levels of align
levels(comics$align)

## [1] "Bad"                "Good"               "Neutral"           
## [4] "Reformed Criminals"

# Check the levels of gender
levels(comics$gender)

## [1] "Female" "Male"   "Other"

# Create a 2-way contingency table
table(comics$align, comics$gender)

##                     
##                      Female Male Other
##   Bad                  1573 7561    32
##   Good                 2490 4809    17
##   Neutral               836 1799    17
##   Reformed Criminals      1    2     0

– Dropping levels

# Load dplyr

# Print tab
tab <- table(comics$align, comics$gender)
tab

##                     
##                      Female Male Other
##   Bad                  1573 7561    32
##   Good                 2490 4809    17
##   Neutral               836 1799    17
##   Reformed Criminals      1    2     0

# Remove align level
comics <- comics %>%
  filter(align != 'Reformed Criminals') %>%
  droplevels()

levels(comics$align)

## [1] "Bad"     "Good"    "Neutral"

– Side-by-side barcharts

# Load ggplot2

# Create side-by-side barchart of gender by alignment
ggplot(comics, aes(x = align, fill = gender)) + 
  geom_bar(position = "dodge")

# Create side-by-side barchart of alignment by gender
ggplot(comics, aes(x = gender, fill = align)) + 
  geom_bar(positio = "dodge") +
  theme(axis.text.x = element_text(angle = 90))

– Bar chart interpretation

Among characters with “Neutral” alignment, males are the most common.
In general, there is an association between gender and alignment.
There are more male characters than female characters in this dataset.

Counts vs. proportions

# simplify display format
options(scipen = 999, digits = 3) 

## create table of counts
tbl_cnt <- table(comics$id, comics$align)
tbl_cnt

##          
##            Bad Good Neutral
##   No Dual  474  647     390
##   Public  2172 2930     965
##   Secret  4493 2475     959
##   Unknown    7    0       2

# Proportional table
# All values add up to 1
prop.table(tbl_cnt)

##          
##                Bad     Good  Neutral
##   No Dual 0.030553 0.041704 0.025139
##   Public  0.140003 0.188862 0.062202
##   Secret  0.289609 0.159533 0.061815
##   Unknown 0.000451 0.000000 0.000129

sum(prop.table(tbl_cnt))

## [1] 1

# All rows add up to 1
prop.table(tbl_cnt, 1)

##          
##             Bad  Good Neutral
##   No Dual 0.314 0.428   0.258
##   Public  0.358 0.483   0.159
##   Secret  0.567 0.312   0.121
##   Unknown 0.778 0.000   0.222

# Coluns add up to 1
prop.table(tbl_cnt, 2)

##          
##                Bad     Good  Neutral
##   No Dual 0.066331 0.106907 0.168394
##   Public  0.303946 0.484137 0.416667
##   Secret  0.628743 0.408956 0.414076
##   Unknown 0.000980 0.000000 0.000864

Look at the proportion of bad characters in the secret and unknown groups
Note there are very few characters with id = unknown

ggplot(comics, aes(x = id, fill = align)) + 
  geom_bar(position = "fill") + 
  ylab("proportion")

Swap the x and fill variables. Notice the most bad cahracters are secret (not unknown).
Here you can see more clearly that there are very few characters at all with id = unknown

ggplot(comics, aes(x = align, fill = id)) + 
  geom_bar(position = "fill") + 
  ylab("proportion")

– Conditional proportions

tab <- table(comics$align, comics$gender)
options(scipen = 999, digits = 3) # Print fewer digits
prop.table(tab)     # Joint proportions

##          
##             Female     Male    Other
##   Bad     0.082210 0.395160 0.001672
##   Good    0.130135 0.251333 0.000888
##   Neutral 0.043692 0.094021 0.000888

prop.table(tab, 2)

##          
##           Female  Male Other
##   Bad      0.321 0.534 0.485
##   Good     0.508 0.339 0.258
##   Neutral  0.171 0.127 0.258

Approximately what proportion of all female characters are good?
- 51%

– Counts vs. proportions (2)

# Plot of gender by align
ggplot(comics, aes(x = align, fill = gender)) +
  geom_bar()

# Plot proportion of gender, conditional on align
ggplot(comics, aes(x = align, fill = gender)) + 
  geom_bar(position = "fill")

Distribution of one variable

# Can use table function on just one variable
# This is called a marginal distribution
table(comics$id)

## 
## No Dual  Public  Secret Unknown 
##    1511    6067    7927       9

# Simple barchart
ggplot(comics, aes(x = id)) + 
  geom_bar()

You can also facet to see variables indidually
A little easier than filtering each and plotting.
This is a rearrangement of the bar chart we plotted earlier
- We facte by alignment rather then coloring the stack.
- This can make it a little easier to answer some questions.

ggplot(comics, aes(x = id)) + 
  geom_bar() + 
  facet_wrap(~align)

– Marginal barchart

It makes more sense to put neutral between Bad and Good
We need to reorder the levels so it will chart this way
Otherwise it will defult to alphabetical

# Change the order of the levels in align
comics$align <- factor(comics$align, 
                       levels = c("Bad", "Neutral", "Good"))

# Create plot of align
ggplot(comics, aes(x = align)) + 
  geom_bar()

– Conditional barchart

# Plot of alignment broken down by gender
ggplot(comics, aes(x = align)) + 
  geom_bar() +
  facet_wrap(~ gender)

– Improve piechart

# Put levels of flavor in decending order
lev <- c("apple", "key lime", "boston creme", "blueberry", "cherry", "pumpkin", "strawberry")
pies$flavor <- factor(pies$flavor, levels = lev)

head(pies$flavor)

## [1] cherry cherry cherry cherry cherry cherry
## 7 Levels: apple key lime boston creme blueberry cherry ... strawberry

# Create barchart of flavor
ggplot(pies, aes(x = flavor)) + 
  geom_bar(fill = "chartreuse") + 
  theme(axis.text.x = element_text(angle = 90))

Exploring Numerical Data

Exploring numerical data

# A dot plot shows all the datapoints
ggplot(cars, aes(x = weight)) + 
  geom_dotplot(dotsize = 0.4)

# A histogram groups the points into bins so it does not get overwhelming
ggplot(cars, aes(x = weight)) + 
  geom_histogram(dotsize = 0.4, binwidth = 500)

# A density plot gives a bigger picture representation of the distribution
# It more helpful when there is a lot of data
ggplot(cars, aes(x = weight)) + 
  geom_density()

# A boxplot is a good way to just show the summary info of the distriubtion
ggplot(cars, aes(x = 1, y = weight)) + 
  geom_boxplot() +
  coord_flip()

– Faceted histogram

# Load package
library(ggplot2)

# Learn data structure
str(cars)

## 'data.frame':    428 obs. of  19 variables:
##  $ name       : Factor w/ 425 levels "Acura 3.5 RL 4dr",..: 66 67 68 69 70 114 115 133 129 130 ...
##  $ sports_car : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ suv        : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ wagon      : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ minivan    : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ pickup     : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ all_wheel  : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ rear_wheel : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ msrp       : int  11690 12585 14610 14810 16385 13670 15040 13270 13730 15460 ...
##  $ dealer_cost: int  10965 11802 13697 13884 15357 12849 14086 12482 12906 14496 ...
##  $ eng_size   : num  1.6 1.6 2.2 2.2 2.2 2 2 2 2 2 ...
##  $ ncyl       : int  4 4 4 4 4 4 4 4 4 4 ...
##  $ horsepwr   : int  103 103 140 140 140 132 132 130 110 130 ...
##  $ city_mpg   : int  28 28 26 26 26 29 29 26 27 26 ...
##  $ hwy_mpg    : int  34 34 37 37 37 36 36 33 36 33 ...
##  $ weight     : int  2370 2348 2617 2676 2617 2581 2626 2612 2606 2606 ...
##  $ wheel_base : int  98 98 104 104 104 105 105 103 103 103 ...
##  $ length     : int  167 153 183 183 183 174 174 168 168 168 ...
##  $ width      : int  66 66 69 68 69 67 67 67 67 67 ...

# Create faceted histogram
ggplot(cars, aes(x = city_mpg)) +
  geom_histogram() +
  facet_wrap(~ suv)

– Boxplots and density plots

unique(cars$ncyl)

## [1]  4  6  3  8  5 12 10 -1

table(cars$ncyl)

## 
##  -1   3   4   5   6   8  10  12 
##   2   1 136   7 190  87   2   3

# Filter cars with 4, 6, 8 cylinders
common_cyl <- filter(cars, ncyl %in% c(4,6,8))

# Create box plots of city mpg by ncyl
ggplot(common_cyl, aes(x = as.factor(ncyl), y = city_mpg)) +
  geom_boxplot()

# Create overlaid density plots for same data
ggplot(common_cyl, aes(x = city_mpg, fill = as.factor(ncyl))) +
  geom_density(alpha = .3)

– Compare distribution via plots

The highest mileage cars have 4 cylinders.
The typical 4 cylinder car gets better mileage than the typical 6 cylinder car, which gets better mileage than the typical 8 cylinder car.
Most of the 4 cylinder cars get better mileage than even the most efficient 8 cylinder cars.

Distribution of one variable

– Marginal and conditional histograms

# Create hist of horsepwr
cars %>%
  ggplot(aes(horsepwr)) +
  geom_histogram() +
  ggtitle("Horsepower distribution")

# Create hist of horsepwr for affordable cars
cars %>% 
  filter(msrp < 25000) %>%
  ggplot(aes(horsepwr)) +
  geom_histogram() +
  xlim(c(90, 550)) +
  ggtitle("Horsepower distribtion for msrp < 25000")

– Marginal and conditional histograms interpretation

The highest horsepower car in the less expensive range has just under 250 horsepower.

– Three binwidths

# Create hist of horsepwr with binwidth of 3
cars %>%
  ggplot(aes(horsepwr)) +
  geom_histogram(binwidth = 3) +
  ggtitle("binwidth = 3")

# Create hist of horsepwr with binwidth of 30
cars %>%
  ggplot(aes(horsepwr)) +
  geom_histogram(binwidth = 30) +
  ggtitle("binwidth = 30")

# Create hist of horsepwr with binwidth of 60
cars %>%
  ggplot(aes(horsepwr)) +
  geom_histogram(binwidth = 60) +
  ggtitle("binwidth = 60")

Box plots

– Box plots for outliers

# Construct box plot of msrp
cars %>%
  ggplot(aes(x = 1, y = msrp)) +
  geom_boxplot()

# Exclude outliers from data
cars_no_out <- cars %>%
  filter(msrp < 100000)

# Construct box plot of msrp using the reduced dataset
cars_no_out %>%
  ggplot(aes(x = 1, y = msrp)) +
  geom_boxplot()

– Plot selection

# Create plot of city_mpg
cars %>%
  ggplot(aes(x = 1, y = city_mpg)) +
  geom_boxplot()

cars %>%
  ggplot(aes(city_mpg)) +
  geom_density()

# Create plot of width
cars %>%
  ggplot(aes(x = 1, y = width)) +
  geom_boxplot()

cars %>%
  ggplot(aes(x = width)) +
  geom_density()

Visualization in higher dimensions

– 3 variable plot

# Facet hists using hwy mileage and ncyl
common_cyl %>%
  ggplot(aes(x = hwy_mpg)) +
  geom_histogram() +
  facet_grid(ncyl ~ suv) +
  ggtitle("hwy_mpg by ncyl and suv")

– Interpret 3 var plot

Across both SUVs and non-SUVs, mileage tends to decrease as the number of cylinders increases.

Numerical Summaries

Measures of center

What is a typical value for life expectancy?
- We will look at just a few data points here
- And just the females

head(life)

##     State         County fips Year Female.life.expectancy..years.
## 1 Alabama Autauga County 1001 1985                          77.02
## 2 Alabama Baldwin County 1003 1985                          78.84
## 3 Alabama Barbour County 1005 1985                          76.04
## 4 Alabama    Bibb County 1007 1985                          76.62
## 5 Alabama  Blount County 1009 1985                          78.92
## 6 Alabama Bullock County 1011 1985                          75.12
##   Female.life.expectancy..national..years.
## 1                                    77.79
## 2                                    77.79
## 3                                    77.79
## 4                                    77.79
## 5                                    77.79
## 6                                    77.79
##   Female.life.expectancy..state..years. Male.life.expectancy..years.
## 1                                  76.9                        68.13
## 2                                  76.9                        71.14
## 3                                  76.9                        66.83
## 4                                  76.9                        67.34
## 5                                  76.9                        70.60
## 6                                  76.9                        66.61
##   Male.life.expectancy..national..years.
## 1                                  70.85
## 2                                  70.85
## 3                                  70.85
## 4                                  70.85
## 5                                  70.85
## 6                                  70.85
##   Male.life.expectancy..state..years.
## 1                               69.06
## 2                               69.06
## 3                               69.06
## 4                               69.06
## 5                               69.06
## 6                               69.06

x <- head(round(life$Female.life.expectancy..years.), 11)
x

##  [1] 77 79 76 77 79 75 77 77 77 78 77

mean

balance point of the data
sensitive to extreme values

sum(x)/11

## [1] 77.18182

mean(x)

## [1] 77.18182

median

middle value of the data
robust to extreme values
most approrpriate measure when working with skewed data

sort(x)

##  [1] 75 76 77 77 77 77 77 77 78 79 79

median(x)

## [1] 77

mode

most common value

table(x)

## x
## 75 76 77 78 79 
##  1  1  6  1  2

– Calculate center measures

str(gapminder)

## Classes 'tbl_df', 'tbl' and 'data.frame':    1704 obs. of  6 variables:
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
##  $ pop      : int  8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
##  $ gdpPercap: num  779 821 853 836 740 ...

# Create dataset of 2007 data
gap2007 <- filter(gapminder, year == 2007)

# Compute groupwise mean and median lifeExp
gap2007 %>%
  group_by(continent) %>%
  summarize(mean(lifeExp),
            median(lifeExp))

## # A tibble: 5 x 3
##   continent `mean(lifeExp)` `median(lifeExp)`
##      <fctr>           <dbl>             <dbl>
## 1    Africa        54.80604           52.9265
## 2  Americas        73.60812           72.8990
## 3      Asia        70.72848           72.3960
## 4    Europe        77.64860           78.6085
## 5   Oceania        80.71950           80.7195

# Generate box plots of lifeExp for each continent
gap2007 %>%
  ggplot(aes(x = continent, y = lifeExp)) +
  geom_boxplot()

Measures of variability

We wnat to know ‘How much is the data spread out from the middle?’
Just looking at the data gives us a sense of this
- But we want break it down to one number so we can compare sample distributions

##  [1] 77 79 76 77 79 75 77 77 77 78 77

We could just take the differnce between all points and the mean and add it up
- But that would equal 0. Thats the idea of the mean.

# Look at the difference between each point and the mean
sum(x - mean(x))

## [1] -5.684342e-14

So we can square the differnce
- But this number will keep getting bigger as you add more observations
- We want something that is stable

# Square each difference to get rid of negatives then sum
sum((x - mean(x))^2)

## [1] 13.63636

Variance

so we divide by n - 1
This is called the sample variance. One of the most useful measures of a sample distriution

sum((x - mean(x))^2)/(length(x) - 1)

## [1] 1.363636

var(x)

## [1] 1.363636

Standard Deviation

Another very useful metric is the sample standard deviation
This is just the square root of the variance
The nice thing about the std dev is that it is in the same units as the original data
In this case its 1.17 years

sqrt(sum((x - mean(x))^2)/(length(x) - 1))

## [1] 1.167748

sd(x)

## [1] 1.167748

Inter Quartile Range

The IQR is the middle 50% of the data
The nice thing about this one is that it is not sensitve to extreme values
All of the other measures listed here are sensitive to extreme values

summary(x)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   75.00   77.00   77.00   77.18   77.50   79.00

IQR(x)

## [1] 0.5

Range

max and min are also interesting
as is the range, or the difference between max and min

max(x)

## [1] 79

min(x)

## [1] 75

diff(range(x))

## [1] 4

– Calculate spread measures

str(gap2007)

## Classes 'tbl_df', 'tbl' and 'data.frame':    142 obs. of  6 variables:
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 4 1 1 2 5 4 3 3 4 ...
##  $ year     : int  2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
##  $ lifeExp  : num  43.8 76.4 72.3 42.7 75.3 ...
##  $ pop      : int  31889923 3600523 33333216 12420476 40301927 20434176 8199783 708573 150448339 10392226 ...
##  $ gdpPercap: num  975 5937 6223 4797 12779 ...

# Compute groupwise measures of spread
gap2007 %>%
  group_by(continent) %>%
  summarize(sd(lifeExp),
            IQR(lifeExp),
            n())

## # A tibble: 5 x 4
##   continent `sd(lifeExp)` `IQR(lifeExp)` `n()`
##      <fctr>         <dbl>          <dbl> <int>
## 1    Africa     9.6307807       11.61025    52
## 2  Americas     4.4409476        4.63200    25
## 3      Asia     7.9637245       10.15200    33
## 4    Europe     2.9798127        4.78250    30
## 5   Oceania     0.7290271        0.51550     2

# Generate overlaid density plots
gap2007 %>%
  ggplot(aes(x = lifeExp, fill = continent)) +
  geom_density(alpha = 0.3)

– Choose measures for center and spread

# Compute stats for lifeExp in Americas
head(gap2007)

## # A tibble: 6 x 6
##       country continent  year lifeExp      pop  gdpPercap
##        <fctr>    <fctr> <int>   <dbl>    <int>      <dbl>
## 1 Afghanistan      Asia  2007  43.828 31889923   974.5803
## 2     Albania    Europe  2007  76.423  3600523  5937.0295
## 3     Algeria    Africa  2007  72.301 33333216  6223.3675
## 4      Angola    Africa  2007  42.731 12420476  4797.2313
## 5   Argentina  Americas  2007  75.320 40301927 12779.3796
## 6   Australia   Oceania  2007  81.235 20434176 34435.3674

gap2007 %>%
  filter(continent == "Americas") %>%
  summarize(mean(lifeExp),
            sd(lifeExp))

## # A tibble: 1 x 2
##   `mean(lifeExp)` `sd(lifeExp)`
##             <dbl>         <dbl>
## 1        73.60812      4.440948

# Compute stats for population
gap2007 %>%
  summarize(median(pop),
            IQR(pop))

## # A tibble: 1 x 2
##   `median(pop)` `IQR(pop)`
##           <dbl>      <dbl>
## 1      10517531   26702008

Shape and transformations

4 chracteristics of a distribution that are of interest:

center
- already covered
spread or variablity
- already covered
shape
- modality: number of prominent humps (uni, bi, multi, or uniform - no humps)
- skew (right, left, or symetric)
- Can transform to fix skew
outliers

– Describe the shape

A: unimodal, left-skewed
B: unimodal, symmetric
C: unimodal, right-skewed
D: bimodal, symmetric

– Transformations

# Create density plot of old variable
gap2007 %>%
  ggplot(aes(x = pop)) +
  geom_density()

# Transform the skewed pop variable
gap2007 <- gap2007 %>%
  mutate(log_pop = log(pop))

# Create density plot of new variable
gap2007 %>%
  ggplot(aes(x = log_pop)) +
  geom_density()

Outliers

– Identify outliers

# Filter for Asia, add column indicating outliers
str(gapminder)

## Classes 'tbl_df', 'tbl' and 'data.frame':    1704 obs. of  6 variables:
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
##  $ pop      : int  8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
##  $ gdpPercap: num  779 821 853 836 740 ...

gap_asia <- gap2007 %>%
  filter(continent == "Asia") %>%
  mutate(is_outlier = lifeExp < 50)

# Remove outliers, create box plot of lifeExp
gap_asia %>%
  filter(!is_outlier) %>%
  ggplot(aes(x = 1, y = lifeExp)) +
  geom_boxplot()

Case Study

Introducing the data

– Spam and num_char

# ggplot2, dplyr, and openintro are loaded

# Compute summary statistics
email %>%
  group_by(spam) %>%
  summarize( 
    median(num_char),
    IQR(num_char))

## # A tibble: 2 x 3
##    spam `median(num_char)` `IQR(num_char)`
##   <dbl>              <dbl>           <dbl>
## 1     0              6.831        13.58225
## 2     1              1.046         2.81800

str(email)

## 'data.frame':    3921 obs. of  21 variables:
##  $ spam        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ to_multiple : num  0 0 0 0 0 0 1 1 0 0 ...
##  $ from        : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ cc          : int  0 0 0 0 0 0 0 1 0 0 ...
##  $ sent_email  : num  0 0 0 0 0 0 1 1 0 0 ...
##  $ time        : POSIXct, format: "2012-01-01 01:16:41" "2012-01-01 02:03:59" ...
##  $ image       : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ attach      : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ dollar      : num  0 0 4 0 0 0 0 0 0 0 ...
##  $ winner      : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ inherit     : num  0 0 1 0 0 0 0 0 0 0 ...
##  $ viagra      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ password    : num  0 0 0 0 2 2 0 0 0 0 ...
##  $ num_char    : num  11.37 10.5 7.77 13.26 1.23 ...
##  $ line_breaks : int  202 202 192 255 29 25 193 237 69 68 ...
##  $ format      : num  1 1 1 1 0 0 1 1 0 1 ...
##  $ re_subj     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ exclaim_subj: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ urgent_subj : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ exclaim_mess: num  0 1 6 48 1 1 1 18 1 0 ...
##  $ number      : Factor w/ 3 levels "none","small",..: 3 2 2 2 1 1 3 2 2 2 ...

table(email$spam)

## 
##    0    1 
## 3554  367

email <- email %>%
  mutate(spam = factor(ifelse(spam == 0, "not-spam", "spam")))

# Create plot
email %>%
  mutate(log_num_char = log(num_char)) %>%
  ggplot(aes(x = spam, y = log_num_char)) +
  geom_boxplot()

– Spam and num_char interpretation

The median length of not-spam emails is greater than that of spam emails

– Spam and !!!

# Compute center and spread for exclaim_mess by spam
email %>%
  group_by(spam) %>%
  summarize(
    median(exclaim_mess),
    IQR(exclaim_mess))

## # A tibble: 2 x 3
##       spam `median(exclaim_mess)` `IQR(exclaim_mess)`
##     <fctr>                  <dbl>               <dbl>
## 1 not-spam                      1                   5
## 2     spam                      0                   1

table(email$exclaim_mess)

## 
##    0    1    2    3    4    5    6    7    8    9   10   11   12   13   14 
## 1435  733  507  128  190  113  115   51   93   45   85   17   56   20   43 
##   15   16   17   18   19   20   21   22   23   24   25   26   27   28   29 
##   11   29   12   26    5   29    9   15    3   11    6   11    1    6    8 
##   30   31   32   33   34   35   36   38   39   40   41   42   43   44   45 
##   13   12   13    3    3    2    3    3    1    2    1    1    3    3    5 
##   46   47   48   49   52   54   55   57   58   62   71   75   78   89   94 
##    3    2    1    3    1    1    4    2    2    2    1    1    1    1    1 
##   96  139  148  157  187  454  915  939  947 1197 1203 1209 1236 
##    1    1    1    1    1    1    1    1    1    1    2    1    1

# Create plot for spam and exclaim_mess
email %>%
  mutate(log_exclaim_mess = log(exclaim_mess)) %>%
  ggplot(aes(x = log_exclaim_mess)) + 
  geom_histogram() + 
  facet_wrap(~ spam)

– Spam and !!! interpretation

The most common value of exclaim_mess in both classes of email is zero (a log(exclaim_mess) of -4.6 after adding .01).
Even after a transformation, the distribution of exclaim_mess in both classes of email is right-skewed.
The typical number of exclamations in the not-spam group appears to be slightly higher than in the spam group.

Check-in 1

Zero inflation in the exclaim_mess variable
- you can analyze the two part separatly
- or turn it into a categorical variable of is-zero, not-zero
Could make a barchart
- need to decide if you are more interested in counts or proportions

– Collapsing levels

table(email$image)

## 
##    0    1    2    3    4    5    9   20 
## 3811   76   17   11    2    2    1    1

# Create plot of proportion of spam by image
email %>%
  mutate(has_image = image > 0) %>%
  ggplot(aes(x = has_image, fill = spam)) +
  geom_bar(position = "fill")

– Image and spam interpretation

An email without an image is more likely to be not-spam than spam

– Data Integrity

# Test if images count as attachments
sum(email$image > email$attach)

## [1] 0

There are no emails with more images than attachments so these most be counted as attachments also

– Answering questions with chains

## Within non-spam emails, is the typical length of emails shorter for 
## those that were sent to multiple people?
email %>%
   filter(spam == "not-spam") %>%
   group_by(to_multiple) %>%
   summarize(median(num_char))

## # A tibble: 2 x 2
##   to_multiple `median(num_char)`
##         <dbl>              <dbl>
## 1           0             7.1960
## 2           1             5.3645

# Question 1
## For emails containing the word "dollar", does the typical spam email 
## contain a greater number of occurences of the word than the typical non-spam email?
table(email$dollar)

## 
##    0    1    2    3    4    5    6    7    8    9   10   11   12   13   14 
## 3175  120  151   10  146   20   44   12   35   10   22   10   20    7   14 
##   15   16   17   18   19   20   21   22   23   24   25   26   27   28   29 
##    5   23    2   14    1   10    7   12    7    7    3    7    1    5    1 
##   30   32   34   36   40   44   46   48   54   63   64 
##    1    2    1    2    3    3    2    1    1    1    3

email %>%
  filter(dollar > 0) %>%
  group_by(spam) %>%
  summarize(median(dollar))

## # A tibble: 2 x 2
##       spam `median(dollar)`
##     <fctr>            <dbl>
## 1 not-spam                4
## 2     spam                2

# Question 2
## If you encounter an email with greater than 10 occurrences of the word "dollar", 
## is it more likely to be spam or not -spam?

email %>%
  filter(dollar > 10) %>%
  ggplot(aes(x = spam)) +
  geom_bar()

Not-spam, at least in this dataset

Check-in 2

– What’s in a number?

levels(email$number)

## [1] "none"  "small" "big"

table(email$number)

## 
##  none small   big 
##   549  2827   545

# Reorder levels
email$number <- factor(email$number, levels = c("none","small","big"))

# Construct plot of number
ggplot(email, aes(x = number)) +
  geom_bar() + 
  facet_wrap( ~ spam)

– What’s in a number interpretation

Given that an email contains a small number, it is more likely to be not-spam.
Given that an email contains a big number, it is more likely to be not-spam.
Within both spam and not-spam, the most common number is a small one.

Exploratory Data Analysis

William Surles

2017-08-09

Introduction

Whats Covered

Libraries and Data

Exploring Categorical Data

Exploring categorical data

– Bar chart expectations

– Contingency table review

– Dropping levels

– Side-by-side barcharts

– Bar chart interpretation

Counts vs. proportions

– Conditional proportions

– Counts vs. proportions (2)

Distribution of one variable

– Marginal barchart

– Conditional barchart

– Improve piechart

Exploring Numerical Data

Exploring numerical data

– Faceted histogram

– Boxplots and density plots

– Compare distribution via plots

Distribution of one variable

– Marginal and conditional histograms

– Marginal and conditional histograms interpretation

– Three binwidths

Box plots

– Box plots for outliers

– Plot selection

Visualization in higher dimensions

– 3 variable plot

– Interpret 3 var plot

Numerical Summaries

Measures of center

– Calculate center measures

Measures of variability

– Calculate spread measures

– Choose measures for center and spread

Shape and transformations

– Describe the shape

– Transformations

Outliers

– Identify outliers

Case Study

Introducing the data

– Spam and num_char

– Spam and num_char interpretation

– Spam and !!!

– Spam and !!! interpretation

Check-in 1

– Collapsing levels

– Image and spam interpretation

– Data Integrity

– Answering questions with chains

Check-in 2

– What’s in a number?

– What’s in a number interpretation