Source by William Surles : https://rpubs.com/williamsurles/298945

Introduction

Course notes from the Exploratory Data Analysis course on DataCamp

Whats Covered

Exploring Categorical Data
Exploring Numerical Data
Numerical Summaries
Case Study

Libraries and Data

library(readr)

## Warning: package 'readr' was built under R version 4.1.3

library(dplyr)

## Warning: package 'dplyr' was built under R version 4.1.3

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.1.3

library(openintro)

## Warning: package 'openintro' was built under R version 4.1.3

## Loading required package: airports

## Warning: package 'airports' was built under R version 4.1.3

## Loading required package: cherryblossom

## Warning: package 'cherryblossom' was built under R version 4.1.3

## Loading required package: usdata

## Warning: package 'usdata' was built under R version 4.1.3

cars <- read.csv("https://assets.datacamp.com/production/course_1796/datasets/cars04.csv")
comics <- read.csv("https://assets.datacamp.com/production/course_1796/datasets/comics.csv")
life <- read.csv("https://assets.datacamp.com/production/course_1796/datasets/life_exp_raw.csv")

Exploring Categorical Data

Exploring categorical data

– Bar chart expectations

Bar charts with categorical variables on the x axis and in the fill are a common way to see a contingency table visually.
It essentialy what you would get if you used the table function with two variables
Which way you show the data can change the perception.
Which variable you use for the fill or the position of the bars (fill, dodge, stack) all can give different perceptions

– Contingency table review

# Print the first rows of the data
head(comics)

##                                    name      id   align        eye       hair
## 1             Spider-Man (Peter Parker)  Secret    Good Hazel Eyes Brown Hair
## 2       Captain America (Steven Rogers)  Public    Good  Blue Eyes White Hair
## 3 Wolverine (James \\"Logan\\" Howlett)  Public Neutral  Blue Eyes Black Hair
## 4   Iron Man (Anthony \\"Tony\\" Stark)  Public    Good  Blue Eyes Black Hair
## 5                   Thor (Thor Odinson) No Dual    Good  Blue Eyes Blond Hair
## 6            Benjamin Grimm (Earth-616)  Public    Good  Blue Eyes    No Hair
##   gender  gsm             alive appearances first_appear publisher
## 1   Male <NA> Living Characters        4043       Aug-62    marvel
## 2   Male <NA> Living Characters        3360       Mar-41    marvel
## 3   Male <NA> Living Characters        3061       Oct-74    marvel
## 4   Male <NA> Living Characters        2961       Mar-63    marvel
## 5   Male <NA> Living Characters        2258       Nov-50    marvel
## 6   Male <NA> Living Characters        2255       Nov-61    marvel

str(comics)

## 'data.frame':    23272 obs. of  11 variables:
##  $ name        : chr  "Spider-Man (Peter Parker)" "Captain America (Steven Rogers)" "Wolverine (James \\\"Logan\\\" Howlett)" "Iron Man (Anthony \\\"Tony\\\" Stark)" ...
##  $ id          : chr  "Secret" "Public" "Public" "Public" ...
##  $ align       : chr  "Good" "Good" "Neutral" "Good" ...
##  $ eye         : chr  "Hazel Eyes" "Blue Eyes" "Blue Eyes" "Blue Eyes" ...
##  $ hair        : chr  "Brown Hair" "White Hair" "Black Hair" "Black Hair" ...
##  $ gender      : chr  "Male" "Male" "Male" "Male" ...
##  $ gsm         : chr  NA NA NA NA ...
##  $ alive       : chr  "Living Characters" "Living Characters" "Living Characters" "Living Characters" ...
##  $ appearances : int  4043 3360 3061 2961 2258 2255 2072 2017 1955 1934 ...
##  $ first_appear: chr  "Aug-62" "Mar-41" "Oct-74" "Mar-63" ...
##  $ publisher   : chr  "marvel" "marvel" "marvel" "marvel" ...

There are 10 character variables and 1 integer variable, we want to change the character variables into factor variables

comics[,c(1:8,10,11)] <- lapply(comics[,c(1:8,10,11)],as.factor)

#check the levels of align
levels(comics$align)

## [1] "Bad"                "Good"               "Neutral"           
## [4] "Reformed Criminals"

Explanation:

There is 4 levels in align variable in comics datatype, which is Bad, Good, Neutral and Reformed Criminals

#check the levels of gender
levels(comics$gender)

## [1] "Female" "Male"   "Other"

Explanation:

There is 3 levels in gender variable in comics datatype, which is Female, Male and Others

# Create a 2-way contingency table
table(comics$align, comics$gender)

##                     
##                      Female Male Other
##   Bad                  1573 7561    32
##   Good                 2490 4809    17
##   Neutral               836 1799    17
##   Reformed Criminals      1    2     0

Explanation:

There are 1573 Female , 7561 Male and 32 other gender that has Bad align
There are 2490 Female , 4809 Male and 17 other gender that has Good align
There are 836 Female , 1799 Male and 17 other gender that has Neutral align
There are 1 Female , 2 Male and 0 other gender that has Reformed Criminals align

– Dropping levels

Because as we see before, Reformed Criminals doesn’t have much impact on the dataframe, we can drop its levels in the dataframe. First, we assign it into a variable

# Load dplyr

# Print tab
tab <- table(comics$align, comics$gender)
tab

##                     
##                      Female Male Other
##   Bad                  1573 7561    32
##   Good                 2490 4809    17
##   Neutral               836 1799    17
##   Reformed Criminals      1    2     0

# Remove align level
comics <- comics %>%
  filter(align != 'Reformed Criminals') %>%
  droplevels()

levels(comics$align)

## [1] "Bad"     "Good"    "Neutral"

Now, after removed Reformed Criminals align, we only had 3 align level , which is Bad, Good and Neutral

– Side-by-side barcharts

# Load ggplot2

# Create side-by-side barchart of gender by alignment
ggplot(comics, aes(x = align, fill = gender)) + 
  geom_bar(position = "dodge")

# Create side-by-side barchart of alignment by gender
ggplot(comics, aes(x = gender, fill = align)) + 
  geom_bar(positio = "dodge") +
  theme(axis.text.x = element_text(angle = 90))

Expalantion:

-Every align in the comics dataframe dominated by Male

-Bad Align the most count among the three align

-Female has more Good align that Neutral or Bad align

-There is a missing value in the gender datatype, marked by ‘NA’

-Among characters with “Neutral” alignment, males are the most common.

-In general, there is an association between gender and alignment.

Counts vs. proportions

We want to look at the differences between Count and proportions.

First, assign the table of id and align into a variable

# simplify display format
options(scipen = 999, digits = 3) 

## create table of counts
tbl_cnt <- table(comics$id, comics$align)
tbl_cnt

##          
##            Bad Good Neutral
##   No Dual  474  647     390
##   Public  2172 2930     965
##   Secret  4493 2475     959
##   Unknown    7    0       2

# Proportional table
# All values add up to 1
prop.table(tbl_cnt)

##          
##                Bad     Good  Neutral
##   No Dual 0.030553 0.041704 0.025139
##   Public  0.140003 0.188862 0.062202
##   Secret  0.289609 0.159533 0.061815
##   Unknown 0.000451 0.000000 0.000129

Explanation : the proportion table will count the percentage of a category to a whole data.So, the sum of a proportional table is 1

sum(prop.table(tbl_cnt))

## [1] 1

How if we want to make a proportional table that counted by row or column?We just need to add one more parameter, 1 for row, 2 for column

# All rows add up to 1
prop.table(tbl_cnt, 1)

##          
##             Bad  Good Neutral
##   No Dual 0.314 0.428   0.258
##   Public  0.358 0.483   0.159
##   Secret  0.567 0.312   0.121
##   Unknown 0.778 0.000   0.222

# All columns add up to 1
prop.table(tbl_cnt, 2)

##          
##                Bad     Good  Neutral
##   No Dual 0.066331 0.106907 0.168394
##   Public  0.303946 0.484137 0.416667
##   Secret  0.628743 0.408956 0.414076
##   Unknown 0.000980 0.000000 0.000864

Look at Bad align and Unknown id,

Based on rows, Bad align and Unknown id characters has 0,778 or 77,8% to whole row(whole unknown id)

But, based on column, Bad align and Unknown id characters has 0,00098 or just 0,1%(rounded) to whole column(whole Bad align)

ggplot(comics, aes(x = id, fill = align)) + 
  geom_bar(position = "fill") + 
  ylab("proportion")

Plotting of proportional Table by row shows that:

-There is no Good align and Unknown id characters

-Most of the Public id is a Good align characters

-Most unknown id and Secret id has a bad align character

ggplot(comics, aes(x = align, fill = id)) + 
  geom_bar(position = "fill") + 
  ylab("proportion")

Plotting of proportional table by columns shows that:

-Most bad align characters has Secret id

-Most good align characters has Public id

-There is a very little proportion of unknown id, shows that there is no purple color in the plot

– Conditional proportions

tab <- table(comics$align, comics$gender)
options(scipen = 999, digits = 3) # Print fewer digits
prop.table(tab)     # Joint proportions

##          
##             Female     Male    Other
##   Bad     0.082210 0.395160 0.001672
##   Good    0.130135 0.251333 0.000888
##   Neutral 0.043692 0.094021 0.000888

prop.table(tab, 2)

##          
##           Female  Male Other
##   Bad      0.321 0.534 0.485
##   Good     0.508 0.339 0.258
##   Neutral  0.171 0.127 0.258

Approximately what proportion of all female characters are good?
- 51%

Basically, we can use proportional table of row or columns to answer a lot of question, just depend on what the question is

Counts vs. proportions (2)

Another example of count vs proportion by using align and gender variable

# Plot of gender by align
ggplot(comics, aes(x = align, fill = gender)) +
  geom_bar()

The plot shows that:

-Most bad align characters are male

-Neutral characters has the least count

-There is very little amount of Other gender and N/A Gender

Now, we try the plot of the proportional table one

ggplot(comics, aes(x = align, fill = gender)) + 
  geom_bar(position = "fill")

Now, the plot will shows the part of a data compared to whole data up to 1.

The plot shows that:

-Most bad align,Good align, and Neutral align characters dominated by male gender compared to their own total align

-Good align characters has the most Female proportion compared than other align

Distribution of one variable

# Can use table function on just one variable
# This is called a marginal distribution
table(comics$id)

## 
## No Dual  Public  Secret Unknown 
##    1511    6067    7927       9

Explanation: Marginal distribution explain the count or frequency of a categorical variable, the above codes shows that there are 1511 no dual id, 6067 public id, etc.

We can use simple barchart to visually plot the above marginal distribution

# Simple barchart
ggplot(comics, aes(x = id)) + 
  geom_bar()

You can also facet to see variables indidually
A little easier than filtering each and plotting.
This is a rearrangement of the bar chart we plotted earlier
- We facte by alignment rather then coloring the stack.
- This can make it a little easier to answer some questions.

ggplot(comics, aes(x = id)) + 
  geom_bar() + 
  facet_wrap(~align)

By using facet wrap, we can easily filter each variable and make us easier answer a few collection, like

What id that has Bad align character?
= The answer is Secret

Marginal barchart

It makes more sense to put neutral between Bad and Good
We need to reorder the levels so it will chart this way
Otherwise it will defult to alphabetical

# Change the order of the levels in align
comics$align <- factor(comics$align, 
                       levels = c("Bad", "Neutral", "Good"))

# Create plot of align
ggplot(comics, aes(x = align)) + 
  geom_bar()

Explanation:

By looking at the plot, we can say that the count of Bad align character is the most, followed by Good align characters and the least one is neutral align characters

– Conditional barchart

# Plot of alignment broken down by gender
ggplot(comics, aes(x = align)) + 
  geom_bar() +
  facet_wrap(~ gender)

By using this conditional barchart, we can broke down each gender alignment and get more insight,like 1.Male gender dominated by bad alignment

2.Other gender is too small amount, so it doesn’t show any bar in the plot

3.Female gender has more good align character compared than the other two alignment

– Improve piechart

#Create a dataframe pies
pies <- data.frame(flavors = as.factor(rep(c("apple", "blueberry", "boston creme", "cherry", "key lime", "pumpkin", "strawberry"), times = c(17, 14, 15, 13, 16, 12, 11))))

# Put levels of flavor in decending order
lev <- c("apple", "key lime", "boston creme", "blueberry", "cherry", "pumpkin", "strawberry")
pies$flavor <- factor(pies$flavor, levels = lev)

head(pies$flavor)

## [1] apple apple apple apple apple apple
## Levels: apple key lime boston creme blueberry cherry pumpkin strawberry

# Create barchart of flavor
ggplot(pies, aes(x = flavor)) + 
  geom_bar(fill = "chartreuse") + 
  theme(axis.text.x = element_text(angle = 90))

Just like ordinary barchart, it shows the frequency of each flavor in pie dataframe, that means there are 17 apple flavor in the dataset, 16 key lime flavor, etc.

Exploring Numerical Data

Exploring numerical data

# A dot plot shows all the datapoints
ggplot(cars, aes(x = weight)) + 
  geom_dotplot(dotsize = 0.4)

## Bin width defaults to 1/30 of the range of the data. Pick better value with `binwidth`.

## Warning: Removed 2 rows containing non-finite values (stat_bindot).

It shows that most of the data mostly distributed at around 3600 - 3700 using dotpoint as its visualization

# A histogram groups the points into bins so it does not get overwhelming
ggplot(cars, aes(x = weight)) + 
  geom_histogram(dotsize = 0.4, binwidth = 500)

## Warning: Ignoring unknown parameters: dotsize

## Warning: Removed 2 rows containing non-finite values (stat_bin).

The histogram gives us more easier view by grouping view point into one group of interval weight

# A density plot gives a bigger picture representation of the distribution
# It more helpful when there is a lot of data
ggplot(cars, aes(x = weight)) + 
  geom_density()

## Warning: Removed 2 rows containing non-finite values (stat_density).

Density plot gives us representation of distribution of data, this plot help us when there is a lot of data

# A boxplot is a good way to just show the summary info of the distriubtion
ggplot(cars, aes(x = 1, y = weight)) + 
  geom_boxplot() +
  coord_flip()

## Warning: Removed 2 rows containing non-finite values (stat_boxplot).

Boxplot give us summary of a distribution , that shows the mean(marked by the middle line in the boxplot), IQR(Interquartile Range) and some outliers that detected by boxplot rule.

– Faceted histogram

# Load package
library(ggplot2)

# Learn data structure
str(cars)

## 'data.frame':    428 obs. of  19 variables:
##  $ name       : chr  "Chevrolet Aveo 4dr" "Chevrolet Aveo LS 4dr hatch" "Chevrolet Cavalier 2dr" "Chevrolet Cavalier 4dr" ...
##  $ sports_car : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ suv        : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ wagon      : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ minivan    : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ pickup     : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ all_wheel  : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ rear_wheel : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ msrp       : int  11690 12585 14610 14810 16385 13670 15040 13270 13730 15460 ...
##  $ dealer_cost: int  10965 11802 13697 13884 15357 12849 14086 12482 12906 14496 ...
##  $ eng_size   : num  1.6 1.6 2.2 2.2 2.2 2 2 2 2 2 ...
##  $ ncyl       : int  4 4 4 4 4 4 4 4 4 4 ...
##  $ horsepwr   : int  103 103 140 140 140 132 132 130 110 130 ...
##  $ city_mpg   : int  28 28 26 26 26 29 29 26 27 26 ...
##  $ hwy_mpg    : int  34 34 37 37 37 36 36 33 36 33 ...
##  $ weight     : int  2370 2348 2617 2676 2617 2581 2626 2612 2606 2606 ...
##  $ wheel_base : int  98 98 104 104 104 105 105 103 103 103 ...
##  $ length     : int  167 153 183 183 183 174 174 168 168 168 ...
##  $ width      : int  66 66 69 68 69 67 67 67 67 67 ...

There are 19 variables and 428 observation data on cars dataframe and they have:

-1 character variable , 7 logical variable(boolean) , 10 integer variable and 1 numerical variable

# Create faceted histogram
ggplot(cars, aes(x = city_mpg)) +
  geom_histogram() +
  facet_wrap(~ suv)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 14 rows containing non-finite values (stat_bin).

Looking at this faceted histogram , we can say that there are more non-SUV cars than the SUV cars, but both non-SUV car and SUV car has the same amount city_mpg(around 10-25)

– Boxplots and density plots

unique(cars$ncyl)

## [1]  4  6  3  8  5 12 10 -1

There are 8 unique value of cylinder number on this cars dataset, but there is a -1 value(cylinder number shouldn’t be negative,right?) so I think this is a missing value

table(cars$ncyl)

## 
##  -1   3   4   5   6   8  10  12 
##   2   1 136   7 190  87   2   3

The marginal table shows that most cars has 6,4 and 8 cylinder number

# Filter cars with 4, 6, 8 cylinders
common_cyl <- filter(cars, ncyl %in% c(4,6,8))

# Create box plots of city mpg by ncyl
ggplot(common_cyl, aes(x = as.factor(ncyl), y = city_mpg)) +
  geom_boxplot()

## Warning: Removed 11 rows containing non-finite values (stat_boxplot).

By this boxplot, we can say that:

-Most 4 cylinder cars has around 20-27/28 city mpg

-Most 6 cylinder cars has around 18-20 city mpg

-Most 8 cylinder cars has around 14-18 city mpg

Simple Conclusion : More cylinder number you have, the lower the city mpg

# Create overlaid density plots for same data
ggplot(common_cyl, aes(x = city_mpg, fill = as.factor(ncyl))) +
  geom_density(alpha = .3)

## Warning: Removed 11 rows containing non-finite values (stat_density).

By the density plot, we can say that

-4 cylinder cars mostly distributed at 22 city_ mpg

-6 cylinder cars mostly distributed at 18/19 city_mpg

-8 cylinder cars mostly distributed at 18 city_mpg

– Compare distribution via plots

The highest mileage cars have 4 cylinders.
The typical 4 cylinder car gets better mileage than the typical 6 cylinder car, which gets better mileage than the typical 8 cylinder car.
Most of the 4 cylinder cars get better mileage than even the most efficient 8 cylinder cars.

Distribution of one variable

– Marginal and conditional histograms

# Create hist of horsepwr
cars %>%
  ggplot(aes(horsepwr)) +
  geom_histogram() +
  ggtitle("Horsepower distribution")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

cars distributed mostly at 210-220 horsepower

# Create hist of horsepwr for affordable cars
cars %>% 
  filter(msrp < 25000) %>%
  ggplot(aes(horsepwr)) +
  geom_histogram() +
  xlim(c(90, 550)) +
  ggtitle("Horsepower distribtion for msrp < 25000")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 1 rows containing non-finite values (stat_bin).

## Warning: Removed 2 rows containing missing values (geom_bar).

Most affordable cars has horsepower from range 100-200s(210/220

– Marginal and conditional histograms interpretation

The highest horsepower car in the less expensive range has just under 250 horsepower.

– Three binwidths

# Create hist of horsepwr with binwidth of 3
cars %>%
  ggplot(aes(horsepwr)) +
  geom_histogram(binwidth = 3) +
  ggtitle("binwidth = 3")

By using binwidth = 3, we will have more histogram bins, because a bins only have 3 range wide, example like:[101-104,105-108]

# Create hist of horsepwr with binwidth of 30
cars %>%
  ggplot(aes(horsepwr)) +
  geom_histogram(binwidth = 30) +
  ggtitle("binwidth = 30")

By using binwidth = 30, we will have less histogram bins, because a bins only have 30 range wide, example like:[90-120,120-150]

# Create hist of horsepwr with binwidth of 60
cars %>%
  ggplot(aes(horsepwr)) +
  geom_histogram(binwidth = 60) +
  ggtitle("binwidth = 60")

Same as before, the plot will have less bin because the range of each bin got wider

Box plots

– Box plots for outliers

# Construct box plot of msrp
cars %>%
  ggplot(aes(x = 1, y = msrp)) +
  geom_boxplot()

As we can see, there are a few outliers on the data using boxplot rule, but there are so much extreme data like 180k msrp, so we can exclude it from the data

# Exclude outliers from data
cars_no_out <- cars %>%
  filter(msrp < 100000)

# Construct box plot of msrp using the reduced dataset
cars_no_out %>%
  ggplot(aes(x = 1, y = msrp)) +
  geom_boxplot()

Now, the over-extreme msrp will be excluded from our dataframe

– Plot selection

# Create plot of city_mpg
cars %>%
  ggplot(aes(x = 1, y = city_mpg)) +
  geom_boxplot()

## Warning: Removed 14 rows containing non-finite values (stat_boxplot).

cars %>%
  ggplot(aes(city_mpg)) +
  geom_density()

## Warning: Removed 14 rows containing non-finite values (stat_density).

As we saw in both plot, we can use both plot to see distribution of the data and detect outliers, all depend on what situation you got into or what you prefer to use.

The other example is width variable

# Create plot of width
cars %>%
  ggplot(aes(x = 1, y = width)) +
  geom_boxplot()

## Warning: Removed 28 rows containing non-finite values (stat_boxplot).

cars %>%
  ggplot(aes(x = width)) +
  geom_density()

## Warning: Removed 28 rows containing non-finite values (stat_density).

Both plot show the distribution of a data, but boxplot gives a summary of a distribution, meanwhile density plot use a line to show the distribution

Visualization in higher dimensions

– 3 variable plot

# Facet hists using hwy mileage and ncyl
common_cyl %>%
  ggplot(aes(x = hwy_mpg)) +
  geom_histogram() +
  facet_grid(ncyl ~ suv) +
  ggtitle("hwy_mpg by ncyl and suv")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 11 rows containing non-finite values (stat_bin).

The 3 - variable facet histogram plot shows that:

-Most non-SUV cars that has 4 cylinder number has highway mileage arounf 25-35 mpg

-Most non-SUV cars that has 6 cylinder number has highway mileage around 25-30 mpg

-Most non-SUV cars that has 8 cylinder numbers has highway mileage around 15-25 mpg

– Interpret 3 var plot

Across both SUVs and non-SUVs, mileage tends to decrease as the number of cylinders increases.

Numerical Summaries

Measures of center

What is a typical value for life expectancy?
- We will look at just a few data points here
- And just the females

head(life) #Shows the first 5 data rows

##     State         County fips Year Female.life.expectancy..years.
## 1 Alabama Autauga County 1001 1985                           77.0
## 2 Alabama Baldwin County 1003 1985                           78.8
## 3 Alabama Barbour County 1005 1985                           76.0
## 4 Alabama    Bibb County 1007 1985                           76.6
## 5 Alabama  Blount County 1009 1985                           78.9
## 6 Alabama Bullock County 1011 1985                           75.1
##   Female.life.expectancy..national..years.
## 1                                     77.8
## 2                                     77.8
## 3                                     77.8
## 4                                     77.8
## 5                                     77.8
## 6                                     77.8
##   Female.life.expectancy..state..years. Male.life.expectancy..years.
## 1                                  76.9                         68.1
## 2                                  76.9                         71.1
## 3                                  76.9                         66.8
## 4                                  76.9                         67.3
## 5                                  76.9                         70.6
## 6                                  76.9                         66.6
##   Male.life.expectancy..national..years. Male.life.expectancy..state..years.
## 1                                   70.8                                69.1
## 2                                   70.8                                69.1
## 3                                   70.8                                69.1
## 4                                   70.8                                69.1
## 5                                   70.8                                69.1
## 6                                   70.8                                69.1

x <- head(round(life$Female.life.expectancy..years.), 11)
x

##  [1] 77 79 76 77 79 75 77 77 77 78 77

The first 11 data from the dataframe has 77 female life expectancy years rounded

mean

balance point of the data
sensitive to extreme values

Counted by total of the data divided by the amount of data

sum(x)/11

## [1] 77.2

mean(x)

## [1] 77.2

median

middle value of the data
robust to extreme values
most approrpriate measure when working with skewed data

sort(x)

##  [1] 75 76 77 77 77 77 77 77 78 79 79

We can see that the middle value of x is 77 , so the median is 77

or using

median(x)

## [1] 77

mode

most common value

table(x)

## x
## 75 76 77 78 79 
##  1  1  6  1  2

the table shows that 77 shows 6 times, so the mode of x is 77

– Calculate center measures

library(gapminder)

## Warning: package 'gapminder' was built under R version 4.1.3

str(gapminder)

## tibble [1,704 x 6] (S3: tbl_df/tbl/data.frame)
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ year     : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ lifeExp  : num [1:1704] 28.8 30.3 32 34 36.1 ...
##  $ pop      : int [1:1704] 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
##  $ gdpPercap: num [1:1704] 779 821 853 836 740 ...

The gapminder dataset has 1704 observation data and has 2 factor variables, 2 integer variables and 2 numerical variables

# Create dataset of 2007 data
gap2007 <- filter(gapminder, year == 2007)

# Compute groupwise mean and median lifeExp
gap2007 %>%
  group_by(continent) %>%
  summarize(mean(lifeExp),
            median(lifeExp))

## # A tibble: 5 x 3
##   continent `mean(lifeExp)` `median(lifeExp)`
##   <fct>               <dbl>             <dbl>
## 1 Africa               54.8              52.9
## 2 Americas             73.6              72.9
## 3 Asia                 70.7              72.4
## 4 Europe               77.6              78.6
## 5 Oceania              80.7              80.7

# Generate box plots of lifeExp for each continent
gap2007 %>%
  ggplot(aes(x = continent, y = lifeExp)) +
  geom_boxplot()

By looking at the table and boxplot, we can surely said that in 2007

Africa has the lowest life expectancy and Oceania has the highest life expectancy, but Oceania has a very small distribution. So, I think Europe has the highest life expectancy if Oceania is excluded

Measures of variability

We want to know ‘How much is the data spread out from the middle?’
Just looking at the data gives us a sense of this
- But we want break it down to one number so we can compare sample distributions

##  [1] 77 79 76 77 79 75 77 77 77 78 77

We could just take the differnce between all points and the mean and add it up
- But that would equal 0. Thats the idea of the mean.

# Look at the difference between each point and the mean
sum(x - mean(x))

## [1] -0.0000000000000568

So we can square the differnce
- But this number will keep getting bigger as you add more observations
- We want something that is stable

# Square each difference to get rid of negatives then sum
sum((x - mean(x))^2)

## [1] 13.6

Variance

so we divide by n - 1
This is called the sample variance. One of the most useful measures of a sample distriution

sum((x - mean(x))^2)/(length(x) - 1)

## [1] 1.36

or we just can use

var(x)

## [1] 1.36

Standard Deviation

Another very useful metric is the sample standard deviation
This is just the square root of the variance
The nice thing about the std dev is that it is in the same units as the original data
In this case its 1.17 years

sqrt(sum((x - mean(x))^2)/(length(x) - 1))

## [1] 1.17

sd(x)

## [1] 1.17

Inter Quartile Range

The IQR is the middle 50% of the data
The nice thing about this one is that it is not sensitve to extreme values
All of the other measures listed here are sensitive to extreme values

summary(x)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    75.0    77.0    77.0    77.2    77.5    79.0

We can look at the summary and substract the 3rd quartile - 1st quartile to get the IQR, which is 0.5(77.5-77) or just use

IQR(x)

## [1] 0.5

Range

max and min are also interesting
as is the range, or the difference between max and min

max(x)

## [1] 79

min(x)

## [1] 75

So, the range of x is 79-75 =4 or just use

diff(range(x))

## [1] 4

– Calculate spread measures

str(gap2007)

## tibble [142 x 6] (S3: tbl_df/tbl/data.frame)
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 4 1 1 2 5 4 3 3 4 ...
##  $ year     : int [1:142] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
##  $ lifeExp  : num [1:142] 43.8 76.4 72.3 42.7 75.3 ...
##  $ pop      : int [1:142] 31889923 3600523 33333216 12420476 40301927 20434176 8199783 708573 150448339 10392226 ...
##  $ gdpPercap: num [1:142] 975 5937 6223 4797 12779 ...

The dataframe has 142 observation data and has the same variable and variable datatype with gapminder dataframe before

# Compute groupwise measures of spread
gap2007 %>%
  group_by(continent) %>%
  summarize(sd(lifeExp),
            IQR(lifeExp),
            n())

## # A tibble: 5 x 4
##   continent `sd(lifeExp)` `IQR(lifeExp)` `n()`
##   <fct>             <dbl>          <dbl> <int>
## 1 Africa            9.63          11.6      52
## 2 Americas          4.44           4.63     25
## 3 Asia              7.96          10.2      33
## 4 Europe            2.98           4.78     30
## 5 Oceania           0.729          0.516     2

# Generate overlaid density plots
gap2007 %>%
  ggplot(aes(x = lifeExp, fill = continent)) +
  geom_density(alpha = 0.3)

Oceania has the biggest density, because its amount its just small (just 2)

– Choose measures for center and spread

# Compute stats for lifeExp in Americas
head(gap2007)

## # A tibble: 6 x 6
##   country     continent  year lifeExp      pop gdpPercap
##   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
## 1 Afghanistan Asia       2007    43.8 31889923      975.
## 2 Albania     Europe     2007    76.4  3600523     5937.
## 3 Algeria     Africa     2007    72.3 33333216     6223.
## 4 Angola      Africa     2007    42.7 12420476     4797.
## 5 Argentina   Americas   2007    75.3 40301927    12779.
## 6 Australia   Oceania    2007    81.2 20434176    34435.

gap2007 %>%
  filter(continent == "Americas") %>%
  summarize(mean(lifeExp),
            sd(lifeExp))

## # A tibble: 1 x 2
##   `mean(lifeExp)` `sd(lifeExp)`
##             <dbl>         <dbl>
## 1            73.6          4.44

The american contingent has a mean of life expectancy of 73.6 and standard deviation of life expectancy of 4.44

# Compute stats for population
gap2007 %>%
  summarize(median(pop),
            IQR(pop))

## # A tibble: 1 x 2
##   `median(pop)` `IQR(pop)`
##           <dbl>      <dbl>
## 1      10517531  26702008.

The middle value of population is 10517531 and has a interquartile range (3rd quartile - 1st quartile) of 26702008

Shape and transformations

4 chracteristics of a distribution that are of interest:

center
- already covered
spread or variablity
- already covered
shape
- modality: number of prominent humps (uni, bi, multi, or uniform - no humps)
- skew (right, left, or symetric)
- Can transform to fix skew
outliers

– Describe the shape

A: unimodal, left-skewed
B: unimodal, symmetric
C: unimodal, right-skewed
D: bimodal, symmetric

– Transformations

# Create density plot of old variable
gap2007 %>%
  ggplot(aes(x = pop)) +
  geom_density()

We can say that the density plot of the data is Unimodal and Right - skewed

# Transform the skewed pop variable
gap2007 <- gap2007 %>%
  mutate(log_pop = log(pop))

# Create density plot of new variable
gap2007 %>%
  ggplot(aes(x = log_pop)) +
  geom_density()

Now, the log of the population get a unimodal and symettric, follows the normal distribution

Outliers

– Identify outliers

# Filter for Asia, add column indicating outliers
str(gapminder)

## tibble [1,704 x 6] (S3: tbl_df/tbl/data.frame)
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ year     : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ lifeExp  : num [1:1704] 28.8 30.3 32 34 36.1 ...
##  $ pop      : int [1:1704] 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
##  $ gdpPercap: num [1:1704] 779 821 853 836 740 ...

Remove outliers by assuming that a outliers is a data that has life expectancy more than 50

gap_asia <- gap2007 %>%
  filter(continent == "Asia") %>%
  mutate(is_outlier = lifeExp < 50)

# Remove outliers, create box plot of lifeExp
gap_asia %>%
  filter(!is_outlier) %>%
  ggplot(aes(x = 1, y = lifeExp)) +
  geom_boxplot()

Case Study

Introducing the data

– Spam and num_char

# ggplot2, dplyr, and openintro are loaded

# Compute summary statistics
email %>%
  group_by(spam) %>%
  summarize( 
    median(num_char),
    IQR(num_char))

## # A tibble: 2 x 3
##   spam  `median(num_char)` `IQR(num_char)`
##   <fct>              <dbl>           <dbl>
## 1 0                   6.83           13.6 
## 2 1                   1.05            2.82

The non-spam email(marked by 0) has median of 6.83 and IQR of 13.58,meanwhile the spam email(marked by 1) has median of 1.05 and IQR of 2.82

table(email$spam)

## 
##    0    1 
## 3554  367

There are more non-spam email than spam email

email <- email %>%
  mutate(spam = factor(ifelse(spam == 0, "not-spam", "spam")))

# Create plot
email %>%
  mutate(log_num_char = log(num_char)) %>%
  ggplot(aes(x = spam, y = log_num_char)) +
  geom_boxplot()

– Spam and num_char interpretation

The median length of not-spam emails is greater than that of spam emails

– Spam and !!!

# Compute center and spread for exclaim_mess by spam
email %>%
  group_by(spam) %>%
  summarize(
    median(exclaim_mess),
    IQR(exclaim_mess))

## # A tibble: 2 x 3
##   spam     `median(exclaim_mess)` `IQR(exclaim_mess)`
##   <fct>                     <dbl>               <dbl>
## 1 not-spam                      1                   5
## 2 spam                          0                   1

the non spam email has exclaim mess median of 1 and IQR of 5

the spam email has exclaim mess median of 0 and IQR of 1

table(email$exclaim_mess)

## 
##    0    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15 
## 1435  733  507  128  190  113  115   51   93   45   85   17   56   20   43   11 
##   16   17   18   19   20   21   22   23   24   25   26   27   28   29   30   31 
##   29   12   26    5   29    9   15    3   11    6   11    1    6    8   13   12 
##   32   33   34   35   36   38   39   40   41   42   43   44   45   46   47   48 
##   13    3    3    2    3    3    1    2    1    1    3    3    5    3    2    1 
##   49   52   54   55   57   58   62   71   75   78   89   94   96  139  148  157 
##    3    1    1    4    2    2    2    1    1    1    1    1    1    1    1    1 
##  187  454  915  939  947 1197 1203 1209 1236 
##    1    1    1    1    1    1    2    1    1

Most of the exclaim mess has a value of 0,1 and 2

# Create plot for spam and exclaim_mess
email %>%
  mutate(log_exclaim_mess = log(exclaim_mess)) %>%
  ggplot(aes(x = log_exclaim_mess)) + 
  geom_histogram() + 
  facet_wrap(~ spam)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 1435 rows containing non-finite values (stat_bin).

Most non-spam email has distributed of log(exclaim_mess) around 0-4,mostly at 0

spam email has mostly distributed of log(exclaim_mess) around 0

– Spam and !!! interpretation

The most common value of exclaim_mess in both classes of email is zero (a log(exclaim_mess) of -4.6 after adding .01).
Even after a transformation, the distribution of exclaim_mess in both classes of email is right-skewed.
The typical number of exclamations in the not-spam group appears to be slightly higher than in the spam group.

Check-in 1

Zero inflation in the exclaim_mess variable
- you can analyze the two part separatly
- or turn it into a categorical variable of is-zero, not-zero
Could make a barchart
- need to decide if you are more interested in counts or proportions

– Collapsing levels

table(email$image)

## 
##    0    1    2    3    4    5    9   20 
## 3811   76   17   11    2    2    1    1

Most of the email has 0 image

# Create plot of proportion of spam by image
email %>%
  mutate(has_image = image > 0) %>%
  ggplot(aes(x = has_image, fill = spam)) +
  geom_bar(position = "fill")

– Image and spam interpretation

An email without an image is more likely to be not-spam than spam

– Data Integrity

# Test if images count as attachments
sum(email$image > email$attach)

## [1] 0

There are no emails with more images than attachments so these most be counted as attachments also

– Answering questions with chains

## Within non-spam emails, is the typical length of emails shorter for 
## those that were sent to multiple people?
email %>%
   filter(spam == "not-spam") %>%
   group_by(to_multiple) %>%
   summarize(median(num_char))

## # A tibble: 2 x 2
##   to_multiple `median(num_char)`
##   <fct>                    <dbl>
## 1 0                         7.20
## 2 1                         5.36

Yes, because at non-spam email, the median value of the email that sent to multiple - people have less median that those that doens’t,

# Question 1
## For emails containing the word "dollar", does the typical spam email 
## contain a greater number of occurences of the word than the typical non-spam email?
table(email$dollar)

## 
##    0    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15 
## 3175  120  151   10  146   20   44   12   35   10   22   10   20    7   14    5 
##   16   17   18   19   20   21   22   23   24   25   26   27   28   29   30   32 
##   23    2   14    1   10    7   12    7    7    3    7    1    5    1    1    2 
##   34   36   40   44   46   48   54   63   64 
##    1    2    3    3    2    1    1    1    3

email %>%
  filter(dollar > 0) %>%
  group_by(spam) %>%
  summarize(median(dollar))

## # A tibble: 2 x 2
##   spam     `median(dollar)`
##   <fct>               <dbl>
## 1 not-spam                4
## 2 spam                    2

No, because the median of dollar word contained in email that considered as spam has a median of 2, meanwhile the median of dollar word contained in email that considered as not-spam has a median of 4.

That means, the not-spam email has more dollar word contained than the spam email.

# Question 2
## If you encounter an email with greater than 10 occurrences of the word "dollar", 
## is it more likely to be spam or not -spam?

email %>%
  filter(dollar > 10) %>%
  ggplot(aes(x = spam)) +
  geom_bar()

Not-spam, at least in this dataset.Because in the barchart visualization, the not-spam email has more count than spam email when the occurence of dollar word wrote 10 times in one email.

Check-in 2

– What’s in a number?

levels(email$number)

## [1] "none"  "small" "big"

There are 3 levels on number variables, which is none, small and big

table(email$number)

## 
##  none small   big 
##   549  2827   545

The small email number has the most count among the three levels of number

# Reorder levels
email$number <- factor(email$number, levels = c("none","small","big"))

# Construct plot of number
ggplot(email, aes(x = number)) +
  geom_bar() + 
  facet_wrap( ~ spam)

What’s in a number interpretation

Given that an email contains a small number, it is more likely to be not-spam.
Given that an email contains a big number, it is more likely to be not-spam.
Within both spam and not-spam, the most common number is a small one.

Exploratory Data Analysis

Andrew Widjaya

2022-06-02

Introduction

Whats Covered

Libraries and Data

Exploring Categorical Data

Exploring categorical data

– Bar chart expectations

– Contingency table review

– Dropping levels

– Side-by-side barcharts

Counts vs. proportions

– Conditional proportions

Counts vs. proportions (2)

Distribution of one variable

Marginal barchart

– Conditional barchart

– Improve piechart

Exploring Numerical Data

Exploring numerical data

– Faceted histogram

– Boxplots and density plots

– Compare distribution via plots

Distribution of one variable

– Marginal and conditional histograms

– Marginal and conditional histograms interpretation

– Three binwidths

Box plots

– Box plots for outliers

– Plot selection

Visualization in higher dimensions

– 3 variable plot

– Interpret 3 var plot

Numerical Summaries

Measures of center

– Calculate center measures

Measures of variability

– Calculate spread measures

– Choose measures for center and spread

Shape and transformations

– Describe the shape

– Transformations

Outliers

– Identify outliers

Case Study

Introducing the data

– Spam and num_char

– Spam and num_char interpretation

– Spam and !!!

– Spam and !!! interpretation

Check-in 1

– Collapsing levels

– Image and spam interpretation

– Data Integrity

– Answering questions with chains

Check-in 2

– What’s in a number?

What’s in a number interpretation