This notebook summarizes datacamp’s Exploratory Data Analysis course in R.

library(dplyr)
library(ggplot2)
library(gapminder)
library(tidyr)
library(readr)
library(openintro)
options(scipen=999,digits=3)

Exploring categorical data

Factor variables’ unique values can be found using levels() function. NAs are ignored by this function. To get a frequency distribution between 2 factors variables, base R uses a contingency table. table(fact1,fact2) will fetch you what you are looking for. Since I couldnt find the comics dataset,I downloaded it from the datacamps website.

comics <- read.csv("comics.csv")
glimpse(comics)

## Observations: 23,272
## Variables: 11
## $ name         <fct> Spider-Man (Peter Parker), Captain America (Steve...
## $ id           <fct> Secret, Public, Public, Public, No Dual, Public, ...
## $ align        <fct> Good, Good, Neutral, Good, Good, Good, Good, Good...
## $ eye          <fct> Hazel Eyes, Blue Eyes, Blue Eyes, Blue Eyes, Blue...
## $ hair         <fct> Brown Hair, White Hair, Black Hair, Black Hair, B...
## $ gender       <fct> Male, Male, Male, Male, Male, Male, Male, Male, M...
## $ gsm          <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ alive        <fct> Living Characters, Living Characters, Living Char...
## $ appearances  <int> 4043, 3360, 3061, 2961, 2258, 2255, 2072, 2017, 1...
## $ first_appear <fct> Aug-62, Mar-41, Oct-74, Mar-63, Nov-50, Nov-61, N...
## $ publisher    <fct> marvel, marvel, marvel, marvel, marvel, marvel, m...

levels(comics$align)

## [1] "Bad"                "Good"               "Neutral"           
## [4] "Reformed Criminals"

levels(comics$id)

## [1] "No Dual" "Public"  "Secret"  "Unknown"

# Contingency table
table(comics$id,comics$align)

##          
##            Bad Good Neutral Reformed Criminals
##   No Dual  474  647     390                  0
##   Public  2172 2930     965                  1
##   Secret  4493 2475     959                  1
##   Unknown    7    0       2                  0

A ggplot always needs three basic inputs - 1) dataset 2) variables on axes 3) layer to be used. For 2 categorical variables, a stack bar chart is good. In this case, one categorical variable goes on x axis, in each bar, the other categorical variable is filled using the color.

ggplot(comics,aes(x=id,fill=align)) + geom_bar()

A large part of the data is unknown. The most common combination seems to be secret identity with bad alignment. Using the chart, since not all proportions are equal across different ids, it does appaear that id and align have some of relationship.

# Contingency table between gender and align
table(comics$gender,comics$align)

##         
##           Bad Good Neutral Reformed Criminals
##   Female 1573 2490     836                  1
##   Male   7561 4809    1799                  2
##   Other    32   17      17                  0

It appears that females are less likely to be portrayed bad compared to males. We can drop those levels which contain very less data

tab <- table(comics$gender,comics$align)

# Reformed criminals have lowest frequency

comics <- comics %>% filter(align!="Reformed Criminals") %>% droplevels()

When you dont want the information to be stacked, but want them side-by-side, use the **position=“dodge** arguement in the geom_bar(). The theme layer can help you format the axis text. Element text can be rotated for better readability.

ggplot(comics,aes(x=align,fill=gender)) + geom_bar(position="dodge")

ggplot(comics,aes(x=gender,fill=align)) + geom_bar(position="dodge") +
  theme(axis.text.x = element_text(angle=90))

Sometimes, more than counts it is proportions than matter. If we provide the contigency table to prop.table() as input, it computes the cross tab percentages. Where sum of the whole cross tab is 1. If you want conditional proportions, you can either make the rows sum to 1 (1 is second argument after table) or make columns sum to 1 (2 is second argument after table). If you want each bar in your stacked chart to add to 1 to show proportions, use position=“fill” in the geom_bar(). ylab(“labletext”) gives label name to y

tab_cnt <- table(comics$id,comics$align)
prop.table(tab_cnt)

##          
##                Bad     Good  Neutral
##   No Dual 0.030553 0.041704 0.025139
##   Public  0.140003 0.188862 0.062202
##   Secret  0.289609 0.159533 0.061815
##   Unknown 0.000451 0.000000 0.000129

# Rows sum to 1
prop.table(tab_cnt,1)

##          
##             Bad  Good Neutral
##   No Dual 0.314 0.428   0.258
##   Public  0.358 0.483   0.159
##   Secret  0.567 0.312   0.121
##   Unknown 0.778 0.000   0.222

# Columns sum to 1
prop.table(tab_cnt,2)

##          
##                Bad     Good  Neutral
##   No Dual 0.066331 0.106907 0.168394
##   Public  0.303946 0.484137 0.416667
##   Secret  0.628743 0.408956 0.414076
##   Unknown 0.000980 0.000000 0.000864

# Stacked 100% bar chart. This is called 100% stack chart, conditioned on id
ggplot(comics,aes(x=id,fill=align)) + geom_bar(position="fill") + ylab("proportion")

# 100% stacked bar chart, conditioned on alignment
ggplot(comics,aes(x=align,fill=id)) + geom_bar(position="fill") + ylab("proportion")

tab <- table(comics$align, comics$gender)
prop.table(tab)     # Joint proportions

##          
##             Female     Male    Other
##   Bad     0.082210 0.395160 0.001672
##   Good    0.130135 0.251333 0.000888
##   Neutral 0.043692 0.094021 0.000888

prop.table(tab, 2)  # Conditional on columns

##          
##           Female  Male Other
##   Bad      0.321 0.534 0.485
##   Good     0.508 0.339 0.258
##   Neutral  0.171 0.127 0.258

# Plot of gender by align
ggplot(comics, aes(x = align, fill = gender)) +
  geom_bar()

# Plot proportion of gender, conditional on align
ggplot(comics, aes(x = align, fill = gender)) + 
  geom_bar(position = "fill") +
  ylab("proportion")

In the charts, whatever you condition on goes on x axis.

Contingency tables (2 way) are useful for 2 variables. If we sum across either rows or columns, we will get distribution of that particular variables. One way contingency tables are called marginal distributions.

table(comics$id)

## 
## No Dual  Public  Secret Unknown 
##    1511    6067    7927       9

ggplot(comics,aes(x=id)) + geom_bar()

Sometimes, you are concerned with analyzing one variable’s distribution against only one value of another variable. Example. Analysing id only for males. We can either filter that dataset for the desired value , or use faceting. facet_wrap(~var_name) layer added to a simple barchart can acheieve this.

ggplot(comics,aes(x=id)) + geom_bar() + facet_wrap(~align)

A pie chart also shows categorical data, where size of the slice, is equal to proportion of that value in that variable. Pie charts make it difficult to compare sized of different slices sometimes.

To represent order in a factor variable, it might be worthwhile changing the order of levels in a factor

comics$align <- factor(comics$align,levels=c("Bad","Neutral","Good"))

ggplot(comics, aes(x = align)) + geom_bar()

# Plot of alignment broken down by gender
ggplot(comics, aes(x = align)) + geom_bar() + facet_wrap(~ gender)

Pie charts - especially with a lot of categories are better off as bar charts.

# Since I could not find the pies dataset , I downloaded it from datacamp and saved as a csv
pies <- read.csv("pies_data.csv")

# Getting the freq distribution of flavors, so one can arrange the levels in descending order
table(pies$flavor)

## 
##        apple    blueberry boston creme       cherry     key lime 
##           17           14           15           13           16 
##      pumpkin   strawberry 
##           12           11

# In descending order
lev <- c("apple", "key lime", "boston creme", "blueberry", "cherry", "pumpkin", "strawberry")
pies$flavor <- factor(pies$flavor, levels = lev)

# Bar chart with a different color scheeme
ggplot(pies,aes(x=flavor)) + geom_bar(fill="chartreuse") + theme(axis.text.x = element_text(angle=90))

Exploring numeric data

num type has continuous data, whereas int has discrete.

# Loading cars dataset
#data(cars)
# It turns out that the cars dataset being used by datacamp is different so downloading it from there
cars <- read.csv("cars04.csv",stringsAsFactors = FALSE)

str(cars)

## 'data.frame':    428 obs. of  19 variables:
##  $ name       : chr  "Chevrolet Aveo 4dr" "Chevrolet Aveo LS 4dr hatch" "Chevrolet Cavalier 2dr" "Chevrolet Cavalier 4dr" ...
##  $ sports_car : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ suv        : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ wagon      : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ minivan    : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ pickup     : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ all_wheel  : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ rear_wheel : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ msrp       : int  11690 12585 14610 14810 16385 13670 15040 13270 13730 15460 ...
##  $ dealer_cost: int  10965 11802 13697 13884 15357 12849 14086 12482 12906 14496 ...
##  $ eng_size   : num  1.6 1.6 2.2 2.2 2.2 2 2 2 2 2 ...
##  $ ncyl       : int  4 4 4 4 4 4 4 4 4 4 ...
##  $ horsepwr   : int  103 103 140 140 140 132 132 130 110 130 ...
##  $ city_mpg   : int  28 28 26 26 26 29 29 26 27 26 ...
##  $ hwy_mpg    : int  34 34 37 37 37 36 36 33 36 33 ...
##  $ weight     : int  2370 2348 2617 2676 2617 2581 2626 2612 2606 2606 ...
##  $ wheel_base : int  98 98 104 104 104 105 105 103 103 103 ...
##  $ length     : int  167 153 183 183 183 174 174 168 168 168 ...
##  $ width      : int  66 66 69 68 69 67 67 67 67 67 ...

# Making a dotplot to show numerical data. IMO, it is like a bar chart only, but with points stacked on top of each other
ggplot(cars,aes(x=weight)) + geom_dotplot(dotsize=0.4)

## `stat_bindot()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 2 rows containing non-finite values (stat_bindot).

# Histogram combines the dots, and the y axis now shows the actual count
ggplot(cars,aes(x=weight)) + geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 2 rows containing non-finite values (stat_bin).

# The shape of the distribution can be better represented with a density plot, without the stepwise nature of a histogram
ggplot(cars,aes(x=weight)) + geom_density()

## Warning: Removed 2 rows containing non-finite values (stat_density).

# to gain an even bigger picture of the distribution where you are concerned about center, extreme values etc use a boxplot
ggplot(cars,aes(x=1,y=weight)) + geom_boxplot()

## Warning: Removed 2 rows containing non-finite values (stat_boxplot).

ggplot(cars,aes(x=1,y=weight)) + geom_boxplot() + coord_flip()

## Warning: Removed 2 rows containing non-finite values (stat_boxplot).

# highway mileage faceted by pickup
ggplot(cars,aes(x=hwy_mpg)) + geom_histogram() + facet_wrap(~pickup)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 14 rows containing non-finite values (stat_bin).

# city mileage faceted by suv
ggplot(cars, aes(x = city_mpg)) + geom_histogram() + facet_wrap(~ suv)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 14 rows containing non-finite values (stat_bin).

# Exploring relationship of mileage with engine size
unique(cars$ncyl)

## [1]  4  6  3  8  5 12 10 -1

table(cars$ncyl)

## 
##  -1   3   4   5   6   8  10  12 
##   2   1 136   7 190  87   2   3

# Since only 4,6,8 are common, keeping only those in dataset
common_cyl <- cars %>% filter(ncyl %in% c(4,6,8))

# side by side box plots of city mpg separated by ncyl. Here instead of facetwrap, you can use x,y
ggplot(common_cyl,aes(y=city_mpg,x=as.factor(ncyl))) + geom_boxplot()

## Warning: Removed 11 rows containing non-finite values (stat_boxplot).

# Overlaid desnity plots of city mg colored by ncyl
ggplot(common_cyl,aes(x=city_mpg,color=as.factor(ncyl))) + geom_density()

## Warning: Removed 11 rows containing non-finite values (stat_density).

# OR

ggplot(common_cyl,aes(x=city_mpg,fill=as.factor(ncyl))) + geom_density(alpha=0.3)

## Warning: Removed 11 rows containing non-finite values (stat_density).

When dealing with 2 categorical variables, you can facet, however, if you want to see marginal distribution of one categorical variable, for a specific subset of a numerical variable, you can pipe the filtered output into ggplot. To make distribution seem smoother, you can change binwidth in historgram, and bandwidth in a density plot.

cars %>% filter(eng_size<=2.0) %>% ggplot(aes(x=hwy_mpg)) + geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 4 rows containing non-finite values (stat_bin).

# Smoother histogram
cars %>% filter(eng_size<=2.0) %>% ggplot(aes(x=hwy_mpg)) + geom_histogram(binwidth = 5)

## Warning: Removed 4 rows containing non-finite values (stat_bin).

# density plot
cars %>% filter(eng_size<=2.0) %>% ggplot(aes(x=hwy_mpg)) + geom_density()

## Warning: Removed 4 rows containing non-finite values (stat_density).

# Smooth density plot
cars %>% filter(eng_size<=2.0) %>% ggplot(aes(x=hwy_mpg)) + geom_density(bw=5)

## Warning: Removed 4 rows containing non-finite values (stat_density).

# Histogram of Horsepower
cars %>% ggplot(aes(x=horsepwr)) +geom_histogram() + ggtitle("Distribution of Horsepower")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Histogram of horsepower for affordable cars
cars %>%  filter(msrp<25000) %>% ggplot(aes(x=horsepwr)) + geom_histogram() +xlim(c(90, 550)) +
  ggtitle("Distribution of Horsepower for affordable cars")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 1 rows containing non-finite values (stat_bin).

## Warning: Removed 1 rows containing missing values (geom_bar).

# Create hist of horsepwr with binwidth of 3
cars %>% ggplot(aes(x=horsepwr)) + geom_histogram(binwidth = 3) +
  ggtitle("Distribution of Horsepower with Binwidth=3")

# Create hist of horsepwr with binwidth of 30
cars %>% ggplot(aes(x=horsepwr)) + geom_histogram(binwidth = 30) +
  ggtitle("Distribution of Horsepower with Binwidth=30")

# Create hist of horsepwr with binwidth of 60
cars %>% ggplot(aes(x=horsepwr)) + geom_histogram(binwidth = 60) +
  ggtitle("Distribution of Horsepower with Binwidth=60")

Boxplot is developed around 3 main summary stats of the data - 25,50 and 75 percentile. ggplot expects you to plot several boxplots side-by-side. If you want to see just one, use x=1. They cant detect bi-modal distributions and more information about distributions. However, boxlot is more resistant to outliers

# Making a boxplot of price of cars to identify extreme pricing
cars %>%ggplot(aes(x = 1, y = msrp)) + geom_boxplot()

# Now we will focus on non-outlier prices
# Exclude outliers from data
cars_no_out <- cars %>% filter(msrp<100000)

# Construct box plot of msrp using the reduced dataset
cars_no_out %>% ggplot(aes(x = 1, y = msrp)) + geom_boxplot()

# Choosing plot for citympg
cars %>% ggplot(aes(x=city_mpg)) + geom_density()

## Warning: Removed 14 rows containing non-finite values (stat_density).

cars %>% ggplot(aes(x=1,y=city_mpg)) + geom_boxplot()

## Warning: Removed 14 rows containing non-finite values (stat_boxplot).

# Choosing plot for width
cars %>% ggplot(aes(x=width)) + geom_density()

## Warning: Removed 28 rows containing non-finite values (stat_density).

cars %>% ggplot(aes(x=1,y=width)) + geom_boxplot()

## Warning: Removed 28 rows containing non-finite values (stat_boxplot).

For city mileage, boxplot takes care of showing the outliers. For width, density plot is more information since it has no clear one peak.

To plot 3 variables, we can use facet grids. The var before tilda becomes rows, the other becomes columns. To make reading and interpreting more, easy, we can use labeller=label_both option in the facet_grid()

ggplot(cars,aes(x=msrp)) + geom_density() + facet_grid(pickup~rear_wheel,labeller=label_both)

table(cars$pickup,cars$rear_wheel)

##        
##         FALSE TRUE
##   FALSE   306   98
##   TRUE     12   12

common_cyl %>% ggplot(aes(x = hwy_mpg)) + geom_histogram() + facet_grid(ncyl ~ suv) +
  ggtitle("Distribution of highway mileage across Engine Cylinder & SUVs")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 11 rows containing non-finite values (stat_bin).

Numerical summaries

The dataset being discussed here, contains information about every county in US, like life expectacy etc. The data being used here, does not match the one discussed in videos, so I’ve changed the analysis to most closely match the video

# Load the life expectacy data
life <- read_csv("life_exp_raw.csv")

## Parsed with column specification:
## cols(
##   State = col_character(),
##   County = col_character(),
##   fips = col_integer(),
##   Year = col_integer(),
##   `Female life expectancy (years)` = col_double(),
##   `Female life expectancy (national, years)` = col_double(),
##   `Female life expectancy (state, years)` = col_double(),
##   `Male life expectancy (years)` = col_double(),
##   `Male life expectancy (national, years)` = col_double(),
##   `Male life expectancy (state, years)` = col_double()
## )

str(life)

## Classes 'tbl_df', 'tbl' and 'data.frame':    81691 obs. of  10 variables:
##  $ State                                   : chr  "Alabama" "Alabama" "Alabama" "Alabama" ...
##  $ County                                  : chr  "Autauga County" "Baldwin County" "Barbour County" "Bibb County" ...
##  $ fips                                    : int  1001 1003 1005 1007 1009 1011 1013 1015 1017 1019 ...
##  $ Year                                    : int  1985 1985 1985 1985 1985 1985 1985 1985 1985 1985 ...
##  $ Female life expectancy (years)          : num  77 78.8 76 76.6 78.9 ...
##  $ Female life expectancy (national, years): num  77.8 77.8 77.8 77.8 77.8 ...
##  $ Female life expectancy (state, years)   : num  76.9 76.9 76.9 76.9 76.9 76.9 76.9 76.9 76.9 76.9 ...
##  $ Male life expectancy (years)            : num  68.1 71.1 66.8 67.3 70.6 ...
##  $ Male life expectancy (national, years)  : num  70.8 70.8 70.8 70.8 70.8 ...
##  $ Male life expectancy (state, years)     : num  69.1 69.1 69.1 69.1 69.1 ...
##  - attr(*, "spec")=List of 2
##   ..$ cols   :List of 10
##   .. ..$ State                                   : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ County                                  : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ fips                                    : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ Year                                    : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ Female life expectancy (years)          : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ Female life expectancy (national, years): list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ Female life expectancy (state, years)   : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ Male life expectancy (years)            : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ Male life expectancy (national, years)  : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ Male life expectancy (state, years)     : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   ..$ default: list()
##   .. ..- attr(*, "class")= chr  "collector_guess" "collector"
##   ..- attr(*, "class")= chr "col_spec"

# Mean of female life exp in first 11 counties
head(round(life$`Female life expectancy (years)`),11)

##  [1] 77 79 76 77 79 75 77 77 77 78 77

mean(head(round(life$`Female life expectancy (years)`),11))

## [1] 77.2

sum(head(round(life$`Female life expectancy (years)`),11))/11

## [1] 77.2

median(head(round(life$`Female life expectancy (years)`),11))

## [1] 77

Mean is very sensitive, and not considered a good measure especially for skewed distributions. Mean usually is towards the longer tail for skewed distributions.

life %>% mutate(west_coast=State %in% c("California","Oregon","Washington")) %>% group_by(west_coast) %>% summarise(mean(`Female life expectancy (years)`),median(`Female life expectancy (years)`))

## # A tibble: 2 x 3
##   west_coast `mean(\`Female life expectancy (years)\`)` `median(\`Female ~
##   <lgl>                                           <dbl>              <dbl>
## 1 F                                                78.7               78.8
## 2 T                                                79.6               79.5

Using the gapminder package’s gap2007 data

data(gapminder)
gap2007 <- gapminder %>% filter(year==2007)

# Compute groupwise mean and median lifeExp
gap2007 %>% group_by(continent) %>% summarize(mean(lifeExp), median(lifeExp))

## # A tibble: 5 x 3
##   continent `mean(lifeExp)` `median(lifeExp)`
##   <fct>               <dbl>             <dbl>
## 1 Africa               54.8              52.9
## 2 Americas             73.6              72.9
## 3 Asia                 70.7              72.4
## 4 Europe               77.6              78.6
## 5 Oceania              80.7              80.7

# Generate box plots of lifeExp for each continent
gap2007 %>%ggplot(aes(x = continent, y = lifeExp)) +geom_boxplot()

Measures of variability: How much do data points vary from mean

head(round(life$`Female life expectancy (years)`),11) - mean(head(round(life$`Female life expectancy (years)`),11))

##  [1] -0.182  1.818 -1.182 -0.182  1.818 -2.182 -0.182 -0.182 -0.182  0.818
## [11] -0.182

sum(head(round(life$`Female life expectancy (years)`),11) - mean(head(round(life$`Female life expectancy (years)`),11)))

## [1] -0.0000000000000568

sum((head(round(life$`Female life expectancy (years)`),11) - mean(head(round(life$`Female life expectancy (years)`),11)))^2)

## [1] 13.6

# Issue with the above is , it keeps growing

sum((head(round(life$`Female life expectancy (years)`),11) - mean(head(round(life$`Female life expectancy (years)`),11)))^2)/11

## [1] 1.24

# This is mean distance squared of each datapoint from mean

sum((head(round(life$`Female life expectancy (years)`),11) - mean(head(round(life$`Female life expectancy (years)`),11)))^2)/(10) # This is sample variance

## [1] 1.36

# Square root of variance is standard deviation. Unlike variance, it is in same units as orig data
sd(head(round(life$`Female life expectancy (years)`),11))

## [1] 1.17

# height of a box in boxplot is IQR - difference between 25th and 75th quarties. 
IQR(head(round(life$`Female life expectancy (years)`),11))

## [1] 0.5

# Range
diff(range(head(round(life$`Female life expectancy (years)`),11)))

## [1] 4

# Summary stats for gap2007
gap2007 %>% group_by(continent) %>% summarize(sd(lifeExp),IQR(lifeExp),n())

## # A tibble: 5 x 4
##   continent `sd(lifeExp)` `IQR(lifeExp)` `n()`
##   <fct>             <dbl>          <dbl> <int>
## 1 Africa            9.63          11.6      52
## 2 Americas          4.44           4.63     25
## 3 Asia              7.96          10.2      33
## 4 Europe            2.98           4.78     30
## 5 Oceania           0.729          0.516     2

# Overlaid density plots
gap2007 %>% ggplot(aes(x=lifeExp,fill=continent)) + geom_density(alpha=0.3)

Shape of distribution can be described in terms of modality & skew. Modality is number of prominent humps which show up in distribution. – This leads to classification of unimodal, bimodal & multimodal distributions. When there is no mode, it is uniform distribution. If left tail is longer - left-skewed, else right skewed. If no tail is longer than the other– symmetric distribution. log transformation helps with highly skewed data with high numbers.

# Shape of Male life expectancy on west coast
life %>% mutate(west_coast=State %in% c("California","Oregon","Washington")) %>% 
  ggplot(aes(x=`Male life expectancy (years)`,fill=west_coast)) + geom_density(alpha=0.3)

# Density lots for population in gapminder

# Create density plot of old variable
gap2007 %>% ggplot(aes(x = pop)) + geom_density()

# Transform the skewed pop variable
gap2007 <- gap2007 %>% mutate(log_pop=log(pop))

# Create density plot of new variable
gap2007 %>% ggplot(aes(x = log_pop)) +geom_density()

# Life expentency in Asia
gap2007 %>% filter(continent=="Asia") %>% ggplot(aes(x=1,y=lifeExp)) + geom_boxplot()

# One country seems to be outlier

# Filter for Asia, add column indicating outliers
gap_asia <- gap2007 %>% filter(continent=="Asia") %>% mutate(is_outlier = lifeExp<50)

# The country is : Afghanistan
filter(gap_asia,is_outlier)

## # A tibble: 1 x 8
##   country     continent  year lifeExp     pop gdpPercap log_pop is_outlier
##   <fct>       <fct>     <int>   <dbl>   <int>     <dbl>   <dbl> <lgl>     
## 1 Afghanistan Asia       2007    43.8  3.19e7       975    17.3 T

# Remove outliers, create box plot of lifeExp
gap_asia %>% filter(!is_outlier) %>%  ggplot(aes(x = 1, y = lifeExp)) + geom_boxplot()

Case Study

Establishing relationship between an email being spam and its length, number of “!” and spam,

# Loading the data
data("email")
# Modifying it to match datacamp's version
email$spam <- as.factor(email$spam)
levels(email$spam) <- c("not-spam", "spam")

# Seeing the data
str(email)

## 'data.frame':    3921 obs. of  21 variables:
##  $ spam        : Factor w/ 2 levels "not-spam","spam": 1 1 1 1 1 1 1 1 1 1 ...
##  $ to_multiple : num  0 0 0 0 0 0 1 1 0 0 ...
##  $ from        : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ cc          : int  0 0 0 0 0 0 0 1 0 0 ...
##  $ sent_email  : num  0 0 0 0 0 0 1 1 0 0 ...
##  $ time        : POSIXct, format: "2012-01-01 00:16:41" "2012-01-01 01:03:59" ...
##  $ image       : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ attach      : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ dollar      : num  0 0 4 0 0 0 0 0 0 0 ...
##  $ winner      : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ inherit     : num  0 0 1 0 0 0 0 0 0 0 ...
##  $ viagra      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ password    : num  0 0 0 0 2 2 0 0 0 0 ...
##  $ num_char    : num  11.37 10.5 7.77 13.26 1.23 ...
##  $ line_breaks : int  202 202 192 255 29 25 193 237 69 68 ...
##  $ format      : num  1 1 1 1 0 0 1 1 0 1 ...
##  $ re_subj     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ exclaim_subj: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ urgent_subj : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ exclaim_mess: num  0 1 6 48 1 1 1 18 1 0 ...
##  $ number      : Factor w/ 3 levels "none","small",..: 3 2 2 2 1 1 3 2 2 2 ...

# Summarizing stats for length of email for both spam and not-spam
email %>% group_by(spam) %>% summarize(mean(num_char),median(num_char),sd(num_char),IQR(num_char))

## # A tibble: 2 x 5
##   spam    `mean(num_char)` `median(num_cha~ `sd(num_char)` `IQR(num_char)`
##   <fct>              <dbl>            <dbl>          <dbl>           <dbl>
## 1 not-sp~            11.3              6.83           14.5           13.6 
## 2 spam                5.44             1.05           14.9            2.82

# Plotting boxplot and density plot to see which is appropriate measures of center and variability
# to describe the data
email %>% ggplot(aes(x=spam,y=num_char)) + geom_boxplot()

email %>% ggplot(aes(x=num_char,fill=spam)) + geom_density(alpha=0.3)

# Since the distributions are highly skewed, IQR and median are best
email %>% group_by(spam) %>% summarize(median(num_char),IQR(num_char))

## # A tibble: 2 x 3
##   spam     `median(num_char)` `IQR(num_char)`
##   <fct>                 <dbl>           <dbl>
## 1 not-spam               6.83           13.6 
## 2 spam                   1.05            2.82

email %>% mutate(log_num_char = log(num_char)) %>% ggplot(aes(x = spam, y = log_num_char)) +
  geom_boxplot()

# Number of exclaimation and spam. Small addition inserted to avoid log(0)
ggplot(email,aes(x=spam,y=log(exclaim_mess+0.01))) + geom_boxplot()

ggplot(email,aes(x=log(exclaim_mess+0.01),fill=spam)) + geom_density(alpha=0.3)

# Due to skewed nature, median and IQR are good
email %>% group_by(spam) %>% summarize(median(exclaim_mess),IQR(exclaim_mess))

## # A tibble: 2 x 3
##   spam     `median(exclaim_mess)` `IQR(exclaim_mess)`
##   <fct>                     <dbl>               <dbl>
## 1 not-spam                   1.00                5.00
## 2 spam                       0                   1.00

In the above analysis, there are overwhelming amount of zeroes, leading zero-inflation. There are two strategies to handle this:

Analyze zeroes separately
Treat as categorical - zero & non-zero

# Number of images attached to emails
table(email$image)

## 
##    0    1    2    3    4    5    9   20 
## 3811   76   17   11    2    2    1    1

# Since most of them are zero, treating it like a categorical variable 
email <- email %>% mutate(has_image=ifelse(image==0,FALSE,TRUE))

# Plotting to see relationship with spam
email %>% ggplot(aes(x=has_image,fill=spam)) + geom_bar(position="fill")

#There are attachments and images in the dataset. However, from the documentation, it is not clear, whether images are a part of attachment. 

# One assumption which can be made is : if num of images is never greater than attachments, there is a high likelihood of images being a subset of attachments
sum(email$image > email$attach)

## [1] 0

# Within non-spam emails, is the typical length of emails shorter for those that were sent to multiple people?

filter(email,spam!="spam") %>% group_by(to_multiple) %>% summarize(median(num_char),mean(num_char))

## # A tibble: 2 x 3
##   to_multiple `median(num_char)` `mean(num_char)`
##         <dbl>              <dbl>            <dbl>
## 1        0                  7.20            11.6 
## 2        1.00               5.36             9.36

# For emails containing the word "dollar", does the typical spam email contain a greater number of occurrences of the word than the typical non-spam email? Create a summary statistic that answers this question.

filter(email,dollar>0) %>% group_by(spam) %>% summarize(median(dollar),mean(dollar))

## # A tibble: 2 x 3
##   spam     `median(dollar)` `mean(dollar)`
##   <fct>               <dbl>          <dbl>
## 1 not-spam             4.00           8.21
## 2 spam                 2.00           3.44

# If you encounter an email with greater than 10 occurrences of the word "dollar", is it more likely to be spam or not-spam? Create a barchart that answers this question.

filter(email,dollar>10) %>% ggplot(aes(x=dollar)) + geom_bar() + facet_wrap(~spam)

#OR
filter(email,dollar>10) %>% ggplot(aes(x=spam)) + geom_bar()

# Number and spam

# Order levels in number
email$number <- factor(email$number,levels=c("none","small","big"))
                       
ggplot(email, aes(x=number))+
  geom_bar() + 
  facet_wrap(~spam)