Exploratory Data Analysis

#Arvio Anandi #2022-06-02

##Introduction

#source('create_datasets.R')

library(readr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
library(openintro)

## Warning: package 'openintro' was built under R version 4.1.3

## Loading required package: airports

## Warning: package 'airports' was built under R version 4.1.3

## Loading required package: cherryblossom

## Warning: package 'cherryblossom' was built under R version 4.1.3

## Loading required package: usdata

## Warning: package 'usdata' was built under R version 4.1.3

cars <- read.csv("https://assets.datacamp.com/production/course_1796/datasets/cars04.csv")
comics <- read.csv("https://assets.datacamp.com/production/course_1796/datasets/comics.csv")
life <- read.csv("https://assets.datacamp.com/production/course_1796/datasets/life_exp_raw.csv")

##Exploring categorical data #– Bar chart expectations - Bar charts with categorical variables on the x axis and in the fill are a common way to see a contingency table visually. - It essentialy what you would get if you used the table function with two variables - Which way you show the data can change the perception. - Which variable you use for the fill or the position of the bars (fill, dodge, stack) all can give different perceptions.

# print the first rows of the data
head(comics)

##                                    name      id   align        eye       hair
## 1             Spider-Man (Peter Parker)  Secret    Good Hazel Eyes Brown Hair
## 2       Captain America (Steven Rogers)  Public    Good  Blue Eyes White Hair
## 3 Wolverine (James \\"Logan\\" Howlett)  Public Neutral  Blue Eyes Black Hair
## 4   Iron Man (Anthony \\"Tony\\" Stark)  Public    Good  Blue Eyes Black Hair
## 5                   Thor (Thor Odinson) No Dual    Good  Blue Eyes Blond Hair
## 6            Benjamin Grimm (Earth-616)  Public    Good  Blue Eyes    No Hair
##   gender  gsm             alive appearances first_appear publisher
## 1   Male <NA> Living Characters        4043       Aug-62    marvel
## 2   Male <NA> Living Characters        3360       Mar-41    marvel
## 3   Male <NA> Living Characters        3061       Oct-74    marvel
## 4   Male <NA> Living Characters        2961       Mar-63    marvel
## 5   Male <NA> Living Characters        2258       Nov-50    marvel
## 6   Male <NA> Living Characters        2255       Nov-61    marvel

EXPLANATION The data represents information and features of the superheros in Marvel and DC franchise with the rows representing names of each character superhero and 11 columns representing features of each superhero.

# check levels of align
comics$align <- as.factor(comics$align)
levels(comics$align)

## [1] "Bad"                "Good"               "Neutral"           
## [4] "Reformed Criminals"

EXPLANATION Alignment variable consists 4 types of level, those are bad, neutral, good, and reformed criminals.

# check levels of gender
comics$gender <- as.factor(comics$gender)
levels(comics$gender)

## [1] "Female" "Male"   "Other"

EXPLANATION The gender variable consists of 3 types of level, those are female, male ,and other.

# create a 2-way contingency table
table(comics$align, comics$gender)

##                     
##                      Female Male Other
##   Bad                  1573 7561    32
##   Good                 2490 4809    17
##   Neutral               836 1799    17
##   Reformed Criminals      1    2     0

EXPLANATION The most common category is Bad/Villain type superheroes that are male as many as 7561 characters.

Dropping levels

# load dplyr

# print tab
tab <- table(comics$align, comics$gender)
tab

##                     
##                      Female Male Other
##   Bad                  1573 7561    32
##   Good                 2490 4809    17
##   Neutral               836 1799    17
##   Reformed Criminals      1    2     0

# remove align level
comics <- comics %>%
  filter(align != 'Reformed Criminals') %>%
  droplevels()

levels(comics$align)

## [1] "Bad"     "Good"    "Neutral"

Side by-side barcharts

# load ggplot2

# create side-by-side barchart of gender by alignment
ggplot(comics, aes(x = align, fill = gender)) + 
  geom_bar(position = "dodge")

EXPLANATION - There are 3 types of alignment : Bad, Good, Neutral - There are 4 types of genders : Female, Male, Other, NA. - The ggplot shows that there is relationship between gender and alignment.

# create side-by-side barchart of alignment by gender
ggplot(comics, aes(x = gender, fill = align))+
  geom_bar(position = "dodge") +
  theme(axis.text.x = element_text(angle = 90))

EXPLANATION - There is a lot more male characters than female characters in this dataset. - There is very few if not no other gender type in this dataset. - Male villain characters is the most common type in this dataset.

– Bar chart interpretation

Among characters with “Neutral” alignment, males are the most common.
In general, there is an association between gender and alignment.
There are more male characters than female characters in this dataset.

Counts vs. proportions

# simplify display format
options(scipen = 999, digits = 3)

## create table of counts
tbl_cnt <- table(comics$id, comics$align)
tbl_cnt

##          
##            Bad Good Neutral
##   No Dual  474  647     390
##   Public  2172 2930     965
##   Secret  4493 2475     959
##   Unknown    7    0       2

## Proportional table
# All values add up to 1
prop.table(tbl_cnt)

##          
##                Bad     Good  Neutral
##   No Dual 0.030553 0.041704 0.025139
##   Public  0.140003 0.188862 0.062202
##   Secret  0.289609 0.159533 0.061815
##   Unknown 0.000451 0.000000 0.000129

EXPLANATION - The largest category out of the proportion is bad and secret identitied characters at about 29% in proportion out of other categories. - The least category out of the proportion is good and unknown characters at 0% in proportion.

sum(prop.table(tbl_cnt))

## [1] 1

EXPLANATION The total of all proportion summed up to 1.

# All rows add up to 1
prop.table(tbl_cnt, 1)

##          
##             Bad  Good Neutral
##   No Dual 0.314 0.428   0.258
##   Public  0.358 0.483   0.159
##   Secret  0.567 0.312   0.121
##   Unknown 0.778 0.000   0.222

EXPLANATION - The proportion are conditioned on the rows as it the second argument is added by 1. - The largest category out of the proportion is unknown and bad characters at about 79%. - The least category out of the proportion is still unknown and good characters at 0%.

# columns add up to 1
prop.table(tbl_cnt , 2)

##          
##                Bad     Good  Neutral
##   No Dual 0.066331 0.106907 0.168394
##   Public  0.303946 0.484137 0.416667
##   Secret  0.628743 0.408956 0.414076
##   Unknown 0.000980 0.000000 0.000864

EXPLANATION - The proportion are conditioned on the columns as it the second argument is added by 1. - The largest category out of the proportion is secret identitied and bad characters at about 62%. - The least category out of the proportion is still unknown and good characters at 0%.

Look at the proportion of bad characters in the secret and unknown groups
Note there are very few characters with id = unknown

ggplot(comics, aes(x = id, fill = align))+
  geom_bar(position = "fill")+
  ylab("proportion")

EXPLANATION - There is no good variable in the unknown character type. - All identity category have neutral characters. - All of the id category have bad characters. - The unknown identity has the most proportion of bad characters among all identities.

Swap the x and fill variables. Notice the most bad characters are secret (not unknown).
Here you can see more clearly that there are very few characters at all with id = unknown

ggplot(comics, aes(x = align, fill = id))+
  geom_bar(position = "fill")+
  ylab("proportion")

EXPLANATION - This plot is conditioned on alignment. - Within characters that are bad, the largest proportion of those are identified as secret. - Within characters that are good and neutral, the largest proportion of those are identified as public. - Across all alignment in this dataset, the least proportions are identified as no dual.

Conditional proportions

tab <- table(comics$align, comics$gender)
options(scipen = 999, digits = 3) # print fewer digits
prop.table(tab) # joint proportions

##          
##             Female     Male    Other
##   Bad     0.082210 0.395160 0.001672
##   Good    0.130135 0.251333 0.000888
##   Neutral 0.043692 0.094021 0.000888

EXPLANATION - The table is built by conditional proportions. - The largest proportion is bad and male characters at about 39% out of all category. - The least proportion is either good or neutral and identified as other character at about 0.08% out of the all category.

prop.table(tab, 2)

##          
##           Female  Male Other
##   Bad      0.321 0.534 0.485
##   Good     0.508 0.339 0.258
##   Neutral  0.171 0.127 0.258

EXPLANATION - The table is now conditioned on the columns instead. - The largest proportion is bad and male characters at about 53% out of all category. - The least proportion is neutral and male characters at about 12.7% out of all category.

Counts vs. proportions (2)

# plot of gender by align
ggplot(comics, aes(x = align, fill = gender))+
  geom_bar()

EXPLANATION - Across the alignment, the largest count are occupied by male characters. - Across the alignment, the least count are occupied by other gender characters. - Bad category has the most count out of other alignments.

# plot proportion of gender, conditional on align
ggplot(comics, aes(x = align, fill = gender)) +
  geom_bar(position = "fill")

EXPLANATION - Across the alignment, the largest proportions are occupied by male characters. - Across the alignment, the least proportions are occupied by other gender characters. - There is a relationship between alignment and gender variable. - All alignments consists of all gender types.

Distribution of one variable

# Can use table function on just one variable
# this is called a marginal distribution
table(comics$id)

## 
## No Dual  Public  Secret Unknown 
##    1511    6067    7927       9

EXPLANATION - The id variable consists of 4 attributes : No dual, Public, Secret, Unknown. - Secret has the largest count of characters. - Unknown has the least count of characters.

# simple barchart
ggplot(comics, aes(x = id))+
  geom_bar()

EXPLANATION - The id variable plot shows that there is an NA category identity with a total count of about 4500-4600. - No dual category has a count of about 1800. - Public category has a count of 6000. - Secret category has a count of 8000. - Unknown category has a count of almost zero.

ggplot(comics,aes(x = id))+
  geom_bar()

  facet_wrap(~align)

## <ggproto object: Class FacetWrap, Facet, gg>
##     compute_layout: function
##     draw_back: function
##     draw_front: function
##     draw_labels: function
##     draw_panels: function
##     finish_data: function
##     init_scales: function
##     map_data: function
##     params: list
##     setup_data: function
##     setup_params: function
##     shrink: TRUE
##     train_scales: function
##     vars: function
##     super:  <ggproto object: Class FacetWrap, Facet, gg>

Marginal barchart

# change the order of the levels in align
comics$align <- factor(comics$align, levels = c("Bad", "Neutral", "Good"))

# create plot of align
ggplot(comics, aes(x = align))+ geom_bar()

EXPLANATION - The bad alignment category has a count of about 8000-8500. - The neutral alignment category has a count of about 2600-2700. - The good alignment category has a count of 7500. - All counts of the alignments are above 2500. ## Conditional barchart

# plot of alignment broken down by gender
ggplot(comics, aes(x = align))+
  geom_bar()

  facet_wrap(~gender)

## <ggproto object: Class FacetWrap, Facet, gg>
##     compute_layout: function
##     draw_back: function
##     draw_front: function
##     draw_labels: function
##     draw_panels: function
##     finish_data: function
##     init_scales: function
##     map_data: function
##     params: list
##     setup_data: function
##     setup_params: function
##     shrink: TRUE
##     train_scales: function
##     vars: function
##     super:  <ggproto object: Class FacetWrap, Facet, gg>

Improve piechart

# put levels of flavor in descending order
pies <- data.frame(flavors = as.factor(rep(c("apple", "blueberry", "boston creme", "cherry", "key lime", "pumpkin", "strawberry"), times = c(17, 14, 15, 13, 16, 12, 11))))

lev <- c("apple", "key lime", "boston creme", "blueberry", "cherry", "pumpkin", "strawberry")
pies$flavor <- factor(pies$flavor, levels = lev)

head(pies$flavor)

## [1] apple apple apple apple apple apple
## Levels: apple key lime boston creme blueberry cherry pumpkin strawberry

# create barchart of flavor
ggplot(pies, aes(x = flavor))+
  geom_bar(fill = "chartreuse")+ theme(axis.text.x = element_text(angle = 90))

EXPLANATION - There are 7 categories of flavor : apple, key lime, boston creme, blueberry, cherry, pumpkin, and strawberry. - The data counts from apple flavor to strawberry respectively is in descending order. - The data count for apple flavor is around 17 - The data count for key lime flavor is around 16. - The data count for boston creme flavor is about 15. - The data count for blueberry flavor is around 14. - The data count for cherry flavor is around 13. - The data count for pumpkin flavor is around 12. - The data count for strawberry flavor is around 11.

Exploring Numerical Data

# a dot plot shows all the datapoints
ggplot(cars, aes(x = weight)) + geom_dotplot(dotsize = 0.4)

## Bin width defaults to 1/30 of the range of the data. Pick better value with `binwidth`.

## Warning: Removed 2 rows containing non-finite values (stat_bindot).

EXPLANATION - The weight ranges from about 1800 up to 7300. - The two highest count are both with weights of around 3300 and 3500. - There are 5 different car weights that has the least count greater than zero : 1800, 5600, 6200, 6500, and 7300. - The data counts are clustered mostly on the weights of 3500 as the median.

# a histogram groups the points into bins so it does not get overwhelming
ggplot(cars, aes(x = weight))+ geom_histogram(dotsize = 0.4 , binwidth = 500)

## Warning: Ignoring unknown parameters: dotsize

## Warning: Removed 2 rows containing non-finite values (stat_bin).

EXPLANATION - This is a plot with cars’ weight as the x-axis using a histogram. - The highest count is on the cars weight ranging from 3300-3800. - The data counts are clustered around the highest count of 3300-3800. - The most common car weight ranges from 3300-3800, followed by 3800-4300, 2800-3300, 2300-2800, 4300-4800, and so on.

# a density plot gives a bigger picture representation of the distribution
# it is more helpful when there is a lot of data
ggplot(cars, aes(x = weight))+ geom_density()

## Warning: Removed 2 rows containing non-finite values (stat_density).

EXPLANATION - This is a plot with cars’ weight as the x-axis using a density plot. - The highest car weight peaked at around 0.00065 in density. - The density points ranges from 0 to 0.00065 in density. - The data points are clustered on the car weight of about 3400.

# A boxplot is a good way to just show the summary info of the distriubtion
ggplot(cars, aes(x = 1, y = weight)) + 
  geom_boxplot() +
  coord_flip()

## Warning: Removed 2 rows containing non-finite values (stat_boxplot).

EXPLANATION - This plot is a plot with cars’ weight as the x-axis using boxplot. - There are around 10 data of car weights that are outlying just from looking from the plot. - The median of this plot is around 3400. - The least car weight from this plot is around 1800. - The non-outlying car weight from this plot is around 5300. # – Faceted histogram

# Load package
library(ggplot2)

# Learn data structure
str(cars)

## 'data.frame':    428 obs. of  19 variables:
##  $ name       : chr  "Chevrolet Aveo 4dr" "Chevrolet Aveo LS 4dr hatch" "Chevrolet Cavalier 2dr" "Chevrolet Cavalier 4dr" ...
##  $ sports_car : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ suv        : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ wagon      : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ minivan    : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ pickup     : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ all_wheel  : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ rear_wheel : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ msrp       : int  11690 12585 14610 14810 16385 13670 15040 13270 13730 15460 ...
##  $ dealer_cost: int  10965 11802 13697 13884 15357 12849 14086 12482 12906 14496 ...
##  $ eng_size   : num  1.6 1.6 2.2 2.2 2.2 2 2 2 2 2 ...
##  $ ncyl       : int  4 4 4 4 4 4 4 4 4 4 ...
##  $ horsepwr   : int  103 103 140 140 140 132 132 130 110 130 ...
##  $ city_mpg   : int  28 28 26 26 26 29 29 26 27 26 ...
##  $ hwy_mpg    : int  34 34 37 37 37 36 36 33 36 33 ...
##  $ weight     : int  2370 2348 2617 2676 2617 2581 2626 2612 2606 2606 ...
##  $ wheel_base : int  98 98 104 104 104 105 105 103 103 103 ...
##  $ length     : int  167 153 183 183 183 174 174 168 168 168 ...
##  $ width      : int  66 66 69 68 69 67 67 67 67 67 ...

EXPLANATION - There are 428 observations and 19 features. - There is 1 character data type in the dataset. - There are 7 logical data type in the dataset. - There is 1 numerical data type in the dataset. - There are 10 integer data type in the dataset.

# Create faceted histogram
ggplot(cars, aes(x = city_mpg)) +
  geom_histogram() +
  facet_wrap(~ suv)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 14 rows containing non-finite values (stat_bin).

EXPLANATION - The number of car type that isn’t suv that has travelled the city in miles per gallon ranges from 0 to 90. - The number of car type that is suv that has travelled the city in miles per gallon ranges from 0 to 20. - The number of miles per gallon the cars that isnt suv has travelled ranges from 12 to 62. - The number of miles per gallon the cars that is suv has travelled ranges from 9 to 24.

unique(cars$ncyl)

## [1]  4  6  3  8  5 12 10 -1

table(cars$ncyl)

## 
##  -1   3   4   5   6   8  10  12 
##   2   1 136   7 190  87   2   3

# Filter cars with 4, 6, 8 cylinders
common_cyl <- filter(cars, ncyl %in% c(4,6,8))

# Create box plots of city mpg by ncyl
ggplot(common_cyl, aes(x = as.factor(ncyl), y = city_mpg)) +
  geom_boxplot()

## Warning: Removed 11 rows containing non-finite values (stat_boxplot).

EXPLANATION - There are 3 data of the factor, ncyl : 4, 6, 8 - The miles per gallon travelled from the factor ncyl of 4 ranges from 17 to 33. - The miles per gallon travelled from the factor ncyl of 6 ranges from 15 to 23. - The miles per gallon travelled from the factor ncyl of 8 ranges from 10 to 17. - The data points from the factor ncyl of 4 has 5 outliers lying above the upper bound. - The data points from the factor ncyl of 6 has 1 outliers lying below the lower bound. - The highest number of the factor ncyl of 4 has 24 miles per gallon travelled in city. - The highest number of the factor ncyl of 6 has 19 miles per gallon travelled in city. - The highest number of the factor ncyl of 8 has 17 miles per gallon travelled in city.

# Create overlaid density plots for same data
ggplot(common_cyl, aes(x = city_mpg, fill = as.factor(ncyl))) +
  geom_density(alpha = .3)

## Warning: Removed 11 rows containing non-finite values (stat_density).

EXPLANATION - There are 3 main factors of ncyl : 4,6,8 - The ncyl with highest peak density is the number 8 with density peak of around 0.23 - The ncyl with lowest peak density is the number 4 with density peak of around 0.12 - The ncyl density ranges from 0 to 0.23 - The number of miles per gallon travelled in city ranges from 10 to 58. - All of the three ncyl data points are clustered around 15 to 25 miles per gallon in city.

#– Compare distribution via plots - The highest mileage cars have 4 cylinders. - The typical 4 cylinder car gets better mileage than the typical 6 cylinder car, which gets better mileage than the typical 8 cylinder car. - Most of the 4 cylinder cars get better mileage than even the most efficient 8 cylinder cars.

Distribution of one variable

– Marginal and conditional histograms

# Create hist of horsepwr
cars %>%
  ggplot(aes(horsepwr)) +
  geom_histogram() +
  ggtitle("Horsepower distribution")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

EXPLANATION - The lowest horsepower is 60. - The highest horsepower is 510. - The horsepower ranges from 60 to 510. - The number of data points ranges from 0 to 50. - There are horsepower that doesnt have data points : 360-370, 390-430, 460-500. - The horsepower of 220-240 has the most data points. - The data points are clustered around 220-240.

# Create hist of horsepwr for affordable cars
cars %>% 
  filter(msrp < 25000) %>%
  ggplot(aes(horsepwr)) +
  geom_histogram() +
  xlim(c(90, 550)) +
  ggtitle("Horsepower distribtion for msrp < 25000")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 1 rows containing non-finite values (stat_bin).

## Warning: Removed 2 rows containing missing values (geom_bar).

EXPLANATION - The plot shows the distribution of horsepower less than 25000 in terms of msrp. - The horsepower ranges from 105-240. - The number of data points ranges from 0 to 32. - The horsepower of around 130-150 has the most number of data count. - The horsepower of around 225-235 has the least number of data count.

#– Marginal and conditional histograms interpretation - The highest horsepower car in the less expensive range has just under 250 horsepower. #– Three binwidths

# Create hist of horsepwr with binwidth of 3
cars %>%
  ggplot(aes(horsepwr)) +
  geom_histogram(binwidth = 3) +
  ggtitle("binwidth = 3")

EXPLANATION - The lowest horsepower is 65. - The highest horsepower is 500. - A horsepower has at most number of data count of 23. - The data are clustered or distributed around 180-230. - There are 3 peaks across the data : 140,200,290. - A horsepower has the least number of data count of 0. - The horsepower of 200 has the most number of data count. - There are 6 ranges of outliers.

# Create hist of horsepwr with binwidth of 30
cars %>%
  ggplot(aes(horsepwr)) +
  geom_histogram(binwidth = 30) +
  ggtitle("binwidth = 30")

EXPLANATION - The lowest horsepower is 75. - The highest horsepower is 490. - A horsepower has at most number a data count of around 90. - The data are clustered or distributed around horsepower of 190-210. - A horsepower has the least number of data count of 0. - The horsepower ranging from 190-220 has the most number of data count. - There are outlier ranging from 460-490.

# Create hist of horsepwr with binwidth of 60
cars %>%
  ggplot(aes(horsepwr)) +
  geom_histogram(binwidth = 60) +
  ggtitle("binwidth = 60")

EXPLANATION - The lowest horsepower is 85. - The highest horsepower is 510. - A horsepower has at most number a data count of around 140. - The data are clustered or distributed around horsepower of 160-210. - A horsepower has the least number of data count of 0. - The horsepower ranging from 190-220 has the most number of data count. - There are outlier ranging from 160-210.

##Box plots #– Box plots for outliers

# Construct box plot of msrp
cars %>%
  ggplot(aes(x = 1, y = msrp)) +
  geom_boxplot()

EXPLANATION - The factor of msrp ranges from 15000-180000. - The data points ranges from 0.63-1.38. - The midpoint of the dataset is around 26500. - There are some outliers lying over 66000 msrp. - The maximum number of msrp that is not declared outliers is around 65000. - The minimum number of msrp that is not declared outliers is 15000.

# Exclude outliers from data
cars_no_out <- cars %>%
  filter(msrp < 100000)

# Construct box plot of msrp using the reduced dataset
cars_no_out %>%
  ggplot(aes(x = 1, y = msrp)) +
  geom_boxplot()

EXPLANATION - The factor of msrp ranges from 11500-110000. - The data points ranges from 0.63-1.38. - The midpoint of the dataset is around 26500. - There are some outliers lying over 62500 msrp. - The maximum number of msrp that is not declared outliers is around 63000. - The minimum number of msrp that is not declared outliers is 11000. # – Plot selection

# Create plot of city_mpg
cars %>%
  ggplot(aes(x = 1, y = city_mpg)) +
  geom_boxplot()

## Warning: Removed 14 rows containing non-finite values (stat_boxplot).

EXPLANATION - The number of miles per gallon travelled in city ranges from 10-60. - The data points ranges from 0.63-1.38. - The midpoint of the dataset is around 19 miles per gallon travelled in city. - There are some outliers lying over 27 miles per gallon travelled in city. - There is 1 outlier lying under 12 miles per gallon travelled in city. - The maximum number of miles per gallon that is not declared outliers is around 27. - The minimum number of miles per gallon that is not declared outliers is 12.

cars %>%
  ggplot(aes(city_mpg)) +
  geom_density()

## Warning: Removed 14 rows containing non-finite values (stat_density).

EXPLANATION - The most number of miles per gallon travelled in city peaked at around 0.13 in density. - The density points ranges from 0 to 0.13. - The density is high and clustered on cars that travelled from 15-22 miles per gallon. - The number of miles per gallon travelled in city ranges from 10 to 60. - The density of data count ranges from 0 to 0.13 - There is a steady number of density across 34-60 miles per gallon travelled.

# Create plot of width
cars %>%
  ggplot(aes(x = 1, y = width)) +
  geom_boxplot()

## Warning: Removed 28 rows containing non-finite values (stat_boxplot).

EXPLANATION - The width of a car ranges from 64-82. - The data count ranges from 0.63-1.38. - The midpoint of the car width in the dataset is around 71. - There are 2 outliers lying over 78. - The maximum width that is not declared outliers is around 78. - The minimum width that is not declared outliers is 64.

cars %>%
  ggplot(aes(x = width)) +
  geom_density()

## Warning: Removed 28 rows containing non-finite values (stat_density).

EXPLANATION - The most number of density of a car width peaked at around 0.115 in density. - The most number of density peaked at the car width of 72. - The density points ranges from 0 to 0.115. - The density is clustered on the car width of 67.5-75. - The number of car width ranges from 68 to 81.5. - The density of data count ranges from 0 to 0.13 - There is a steady fall of density from the car width ranging from 72.

Visualization in higher dimensions

#– 3 variable plot

# Facet hists using hwy mileage and ncyl
common_cyl %>%
  ggplot(aes(x = hwy_mpg)) +
  geom_histogram() +
  facet_grid(ncyl ~ suv) +
  ggtitle("hwy_mpg by ncyl and suv")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 11 rows containing non-finite values (stat_bin).

EXPLANATION - There are 2 categories of suv variable: cars that is not suv, and cars that is suv. - There 3 main categories of ncyl : 4,6,8 - The number of miles per gallon travelled from a car that is not suv with ncyl of 4 ranges from 23-53. - The number of miles per gallon travelled from a car that is suv with ncyl of 4 ranges from 24-27. - The number of miles per gallon travelled from a car that is not suv with ncyl of 6 ranges from 16-34. - The number of miles per gallon travelled from a car that is suv with ncyl of 6 ranges from 17-26. - The number of miles per gallon travelled from a car that is not suv with ncyl of 8 ranges from 16-28. - The number of miles per gallon travelled from a car that is suv with ncyl of 8 ranges from 11-23. - The most number of data count from a car that is not suv with ncyl of 4 is around 33-34. - The most number of data count from a car that is suv with ncyl of 4 is around 25. - The most number of data count from a car that is not suv with ncyl of 6 is around 25-26. - The most number of data count from a car that is suv with ncyl of 6 is around 23. - The most number of data count from a car that is not suv with ncyl of 8 is around 25-26. - The most number of data count from a car that is suv with ncyl of 8 is around 17.

#– Interpret 3 var plot - Across both SUVs and non-SUVs, mileage tends to decrease as the number of cylinders increases.

###Numerical Summaries ##Measures of center #What is a typical value for life expectancy? -We will look at just a few data points here -And just the females

head(life)

##     State         County fips Year Female.life.expectancy..years.
## 1 Alabama Autauga County 1001 1985                           77.0
## 2 Alabama Baldwin County 1003 1985                           78.8
## 3 Alabama Barbour County 1005 1985                           76.0
## 4 Alabama    Bibb County 1007 1985                           76.6
## 5 Alabama  Blount County 1009 1985                           78.9
## 6 Alabama Bullock County 1011 1985                           75.1
##   Female.life.expectancy..national..years.
## 1                                     77.8
## 2                                     77.8
## 3                                     77.8
## 4                                     77.8
## 5                                     77.8
## 6                                     77.8
##   Female.life.expectancy..state..years. Male.life.expectancy..years.
## 1                                  76.9                         68.1
## 2                                  76.9                         71.1
## 3                                  76.9                         66.8
## 4                                  76.9                         67.3
## 5                                  76.9                         70.6
## 6                                  76.9                         66.6
##   Male.life.expectancy..national..years. Male.life.expectancy..state..years.
## 1                                   70.8                                69.1
## 2                                   70.8                                69.1
## 3                                   70.8                                69.1
## 4                                   70.8                                69.1
## 5                                   70.8                                69.1
## 6                                   70.8                                69.1

EXPLANATION - There are 10 features in the life dataset.

x <- head(round(life$Female.life.expectancy..years.), 11)
x

##  [1] 77 79 76 77 79 75 77 77 77 78 77

mean

balance point of the data
sensitive to extreme values

sum(x)/11

## [1] 77.2

mean(x)

## [1] 77.2

median

middle value of the data
robust to extreme values
most appropriate measure when working with skewed data

sort(x)

##  [1] 75 76 77 77 77 77 77 77 78 79 79

median(x)

## [1] 77

mode

-most common value

table(x)

## x
## 75 76 77 78 79 
##  1  1  6  1  2

#– Calculate center measures

library("gapminder")

## Warning: package 'gapminder' was built under R version 4.1.3

str(gapminder)

## tibble [1,704 x 6] (S3: tbl_df/tbl/data.frame)
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ year     : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ lifeExp  : num [1:1704] 28.8 30.3 32 34 36.1 ...
##  $ pop      : int [1:1704] 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
##  $ gdpPercap: num [1:1704] 779 821 853 836 740 ...

EXPLANATION - There are 1704 observations with 6 features of each. - There are 2 factor data type in the dataset. - There are 2 integer data type in the dataset. - There are 2 numerical data type in the dataset. - There is no character data type in the dataset.

# Create dataset of 2007 data
gap2007 <- filter(gapminder, year == 2007)

# Compute groupwise mean and median lifeExp
gap2007 %>%
  group_by(continent) %>%
  summarize(mean(lifeExp),
            median(lifeExp))

## # A tibble: 5 x 3
##   continent `mean(lifeExp)` `median(lifeExp)`
##   <fct>               <dbl>             <dbl>
## 1 Africa               54.8              52.9
## 2 Americas             73.6              72.9
## 3 Asia                 70.7              72.4
## 4 Europe               77.6              78.6
## 5 Oceania              80.7              80.7

EXPLANATION - There are 5 continents : Africa, Americas, Asia, Europe, Oceania - There are 5 mean of the life expectancy in each continents respectively : 54.8, 73.6, 70.7, 77.6, 80.7. - There are 5 median of the life expectancy in each continents respectively : 52.9, 72.9, 72.4, 78.6, 80.7.

# Generate box plots of lifeExp for each continent
gap2007 %>%
  ggplot(aes(x = continent, y = lifeExp)) +
  geom_boxplot()

EXPLANATION - The Africa continent has life expectancy ranging from 40-76. - The Americas continent has life expectancy ranging from 61-81.- The Africa continent has life expectancy ranging from 40-76. - The Asia continent has life expectancy ranging from 44-83. - - The Europe continent has life expectancy ranging from 71-81. - The Oceania continent has life expectancy ranging from 80-81.

##Measures of variability - We wnat to know ‘How much is the data spread out from the middle?’ - Just looking at the data gives us a sense of this - But we want break it down to one number so we can compare sample distributions

##  [1] 77 79 76 77 79 75 77 77 77 78 77

We could just take the differnce between all points and the mean and add it up
- But that would equal 0. Thats the idea of the mean.

# Look at the difference between each point and the mean
sum(x - mean(x))

## [1] -0.0000000000000568

So we can square the differnce
- But this number will keep getting bigger as you add more observations
- We want something that is stable

# Square each difference to get rid of negatives then sum
sum((x - mean(x))^2)

## [1] 13.6

Variance

so we divide by n - 1
This is called the sample variance. One of the most useful measures of a sample distriution

sum((x - mean(x))^2)/(length(x) - 1)

## [1] 1.36

var(x)

## [1] 1.36

Standard Deviation

Another very useful metric is the sample standard deviation
This is just the square root of the variance
The nice thing about the std dev is that it is in the same units as the original data
In this case its 1.17 years

sqrt(sum((x - mean(x))^2)/(length(x) - 1))

## [1] 1.17

sd(x)

## [1] 1.17

Inter Quartile Range

The IQR is the middle 50% of the data
The nice thing about this one is that it is not sensitve to extreme values
All of the other measures listed here are sensitive to extreme values

summary(x)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    75.0    77.0    77.0    77.2    77.5    79.0

IQR(x)

## [1] 0.5

Range

max and min are also interesting
as is the range, or the difference between max and min

max(x)

## [1] 79

min(x)

## [1] 75

diff(range(x))

## [1] 4

– Calculate spread measures

str(gap2007)

## tibble [142 x 6] (S3: tbl_df/tbl/data.frame)
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 4 1 1 2 5 4 3 3 4 ...
##  $ year     : int [1:142] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
##  $ lifeExp  : num [1:142] 43.8 76.4 72.3 42.7 75.3 ...
##  $ pop      : int [1:142] 31889923 3600523 33333216 12420476 40301927 20434176 8199783 708573 150448339 10392226 ...
##  $ gdpPercap: num [1:142] 975 5937 6223 4797 12779 ...

EXPLANATION - There are 142 observations with 6 features of each - There are 2 factors data type in the dataset - There are 2 integer data type in the dataset - There are 2 numerical data type in the dataset

# Compute groupwise measures of spread
gap2007 %>%
  group_by(continent) %>%
  summarize(sd(lifeExp),
            IQR(lifeExp),
            n())

## # A tibble: 5 x 4
##   continent `sd(lifeExp)` `IQR(lifeExp)` `n()`
##   <fct>             <dbl>          <dbl> <int>
## 1 Africa            9.63          11.6      52
## 2 Americas          4.44           4.63     25
## 3 Asia              7.96          10.2      33
## 4 Europe            2.98           4.78     30
## 5 Oceania           0.729          0.516     2

EXPLANATION - The standard deviation of life expectancy of each continents ranges from 0.729 to 9.631. - The IQR of life expectancy of each continents ranges from 0.516 to 11.610 - The n value of each continents ranges from 2-52

# Generate overlaid density plots
gap2007 %>%
  ggplot(aes(x = lifeExp, fill = continent)) +
  geom_density(alpha = 0.3)

EXPLANATION - The highest density among all continents is Oceania. - The continents other than oceania has a density of atmost 0.12 - The continents other than oceania has similar distributions. # – Choose measures for center and spread

# Compute stats for lifeExp in Americas
head(gap2007)

## # A tibble: 6 x 6
##   country     continent  year lifeExp      pop gdpPercap
##   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
## 1 Afghanistan Asia       2007    43.8 31889923      975.
## 2 Albania     Europe     2007    76.4  3600523     5937.
## 3 Algeria     Africa     2007    72.3 33333216     6223.
## 4 Angola      Africa     2007    42.7 12420476     4797.
## 5 Argentina   Americas   2007    75.3 40301927    12779.
## 6 Australia   Oceania    2007    81.2 20434176    34435.

gap2007 %>%
  filter(continent == "Americas") %>%
  summarize(mean(lifeExp),
            sd(lifeExp))

## # A tibble: 1 x 2
##   `mean(lifeExp)` `sd(lifeExp)`
##             <dbl>         <dbl>
## 1            73.6          4.44

# Compute stats for population
gap2007 %>%
  summarize(median(pop),
            IQR(pop))

## # A tibble: 1 x 2
##   `median(pop)` `IQR(pop)`
##           <dbl>      <dbl>
## 1      10517531  26702008.

Shape and transformations

4 chracteristics of a distribution that are of interest:

center
- already covered
spread or variablity
- already covered
shape
- modality: number of prominent humps (uni, bi, multi, or uniform - no humps)
- skew (right, left, or symetric)
- Can transform to fix skew
outliers

– Transformations

# Create density plot of old variable
gap2007 %>%
  ggplot(aes(x = pop)) +
  geom_density()

EXPLANATION - The pop value ranges from 0 to 1700000000. - The density ranges from 0 to 0.00000003 - The highest density is around 0-200000000 of the value of pop - There is a steady spread after pop value of 200000000

# Transform the skewed pop variable
gap2007 <- gap2007 %>%
  mutate(log_pop = log(pop))

# Create density plot of new variable
gap2007 %>%
  ggplot(aes(x = log_pop)) +
  geom_density()

EXPLANATION - The log_pop value ranges from 12 to 22. - The highest peak of density is at the log_pop value of around 16 - The data are clustered around the log_pop value of 16 ## Outliers # – Identify outliers

# Filter for Asia, add column indicating outliers
str(gapminder)

## tibble [1,704 x 6] (S3: tbl_df/tbl/data.frame)
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ year     : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ lifeExp  : num [1:1704] 28.8 30.3 32 34 36.1 ...
##  $ pop      : int [1:1704] 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
##  $ gdpPercap: num [1:1704] 779 821 853 836 740 ...

gap_asia <- gap2007 %>%
  filter(continent == "Asia") %>%
  mutate(is_outlier = lifeExp < 50)

# Remove outliers, create box plot of lifeExp
gap_asia %>%
  filter(!is_outlier) %>%
  ggplot(aes(x = 1, y = lifeExp)) +
  geom_boxplot()

EXPLANATION - The life expectancy number ranges from 60 to 95. - The number of data points ranges from 0.63-1.38. - The midpoint of the life expectancy is around 72.5 - The highest value for life expectancy is 95 - The lowest value for life expectancy is around 59

###Case Study ## Introducing the data # - Spam and num_char

# ggplot2, dplyr, and openintro are loaded

# Compute summary statistics
email %>%
  group_by(spam) %>%
  summarize( 
    median(num_char),
    IQR(num_char))

## # A tibble: 2 x 3
##   spam  `median(num_char)` `IQR(num_char)`
##   <fct>              <dbl>           <dbl>
## 1 0                   6.83           13.6 
## 2 1                   1.05            2.82

EXPLANATION - The data points of the categories spam has value of 0 and 1. - The median of the num_char for value 0 is 6.83. - The median of the num_char for value 1 is 1.05. - The IQR of the num_char for value 0 is 13.58. - The IQR of the num_char for value 0 is 2.82.

str(email)

## tibble [3,921 x 21] (S3: tbl_df/tbl/data.frame)
##  $ spam        : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ to_multiple : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 2 1 1 ...
##  $ from        : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ cc          : int [1:3921] 0 0 0 0 0 0 0 1 0 0 ...
##  $ sent_email  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 2 1 1 ...
##  $ time        : POSIXct[1:3921], format: "2012-01-01 13:16:41" "2012-01-01 14:03:59" ...
##  $ image       : num [1:3921] 0 0 0 0 0 0 0 1 0 0 ...
##  $ attach      : num [1:3921] 0 0 0 0 0 0 0 1 0 0 ...
##  $ dollar      : num [1:3921] 0 0 4 0 0 0 0 0 0 0 ...
##  $ winner      : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ inherit     : num [1:3921] 0 0 1 0 0 0 0 0 0 0 ...
##  $ viagra      : num [1:3921] 0 0 0 0 0 0 0 0 0 0 ...
##  $ password    : num [1:3921] 0 0 0 0 2 2 0 0 0 0 ...
##  $ num_char    : num [1:3921] 11.37 10.5 7.77 13.26 1.23 ...
##  $ line_breaks : int [1:3921] 202 202 192 255 29 25 193 237 69 68 ...
##  $ format      : Factor w/ 2 levels "0","1": 2 2 2 2 1 1 2 2 1 2 ...
##  $ re_subj     : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ exclaim_subj: num [1:3921] 0 0 0 0 0 0 0 0 0 0 ...
##  $ urgent_subj : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ exclaim_mess: num [1:3921] 0 1 6 48 1 1 1 18 1 0 ...
##  $ number      : Factor w/ 3 levels "none","small",..: 3 2 2 2 1 1 3 2 2 2 ...

EXPLANATION - There are 3921 observations on this dataset along with 21 features of each. - There are 9 factors data type in this dataset - There are 2 integer data type in this dataset - There are 9 numerical data type in this dataset - There is 1 POSIXct data type in this dataset

table(email$spam)

## 
##    0    1 
## 3554  367

EXPLANATION - There are 3554 data points of the value 0 - There are 367 data points of the value 1

email <- email %>%
  mutate(spam = factor(ifelse(spam == 0, "not-spam", "spam")))

# Create plot
email %>%
  mutate(log_num_char = log(num_char)) %>%
  ggplot(aes(x = spam, y = log_num_char)) +
  geom_boxplot()

EXPLANATION - The log_num_char value of the not-spam category ranges from -6.0 to 5.1. - The log_num_char value of the spam category ranges from -7.2 to 5.1. - There are some log_num_char outliers lying under -2.4 for the not-spam category. - There are 2 log_num_char outliers lying under -2.8 for the spam category. - There are 4 log_num_char outliers lying above 2.75 for the spam category. - The midpoint of the not spam category data is at 2. - The midpoint of the spam category data is at 0.

– Spam and num_char interpretation

The median length of not-spam emails is greater than that of spam emails

– Spam and !!!

# Compute center and spread for exclaim_mess by spam
email %>%
  group_by(spam) %>%
  summarize(
    median(exclaim_mess),
    IQR(exclaim_mess))

## # A tibble: 2 x 3
##   spam     `median(exclaim_mess)` `IQR(exclaim_mess)`
##   <fct>                     <dbl>               <dbl>
## 1 not-spam                      1                   5
## 2 spam                          0                   1

table(email$exclaim_mess)

## 
##    0    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15 
## 1435  733  507  128  190  113  115   51   93   45   85   17   56   20   43   11 
##   16   17   18   19   20   21   22   23   24   25   26   27   28   29   30   31 
##   29   12   26    5   29    9   15    3   11    6   11    1    6    8   13   12 
##   32   33   34   35   36   38   39   40   41   42   43   44   45   46   47   48 
##   13    3    3    2    3    3    1    2    1    1    3    3    5    3    2    1 
##   49   52   54   55   57   58   62   71   75   78   89   94   96  139  148  157 
##    3    1    1    4    2    2    2    1    1    1    1    1    1    1    1    1 
##  187  454  915  939  947 1197 1203 1209 1236 
##    1    1    1    1    1    1    2    1    1

# Create plot for spam and exclaim_mess
email %>%
  mutate(log_exclaim_mess = log(exclaim_mess)) %>%
  ggplot(aes(x = log_exclaim_mess)) + 
  geom_histogram() + 
  facet_wrap(~ spam)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 1435 rows containing non-finite values (stat_bin).

EXPLANATION - The log_exclaim_mess value for not-spam category ranges from 0 to 7 - The log_exclaim_mess value for spam category ranges from 0 to 7.5. - The highest count of log_exclaim_mess for not-spam category is at around 0. - The highest count of log_exclaim_mess for spam category is at around 0.

– Spam and !!! interpretation

The most common value of exclaim_mess in both classes of email is zero (a log(exclaim_mess) of -4.6 after adding .01).
Even after a transformation, the distribution of exclaim_mess in both classes of email is right-skewed.
The typical number of exclamations in the not-spam group appears to be slightly higher than in the spam group. # Check-in 1
Zero inflation in the exclaim_mess variable ** you can analyze the two part separatly ** or turn it into a categorical variable of is-zero, not-zero
Could make a barchart ** need to decide if you are more interested in counts or proportions

– Collapsing levels

table(email$image)

## 
##    0    1    2    3    4    5    9   20 
## 3811   76   17   11    2    2    1    1

# Create plot of proportion of spam by image
email %>%
  mutate(has_image = image > 0) %>%
  ggplot(aes(x = has_image, fill = spam)) +
  geom_bar(position = "fill")

EXPLANATION - Both emails that does or doesnt have has_image have greater amount of not-spam category. - The proportion of spam for both situations in terms of has_image are small # – Image and spam interpretation * An email without an image is more likely to be not-spam than spam

– Data Integrity

# Test if images count as attachments
sum(email$image > email$attach)

## [1] 0

There are no emails with more images than attachments so these must be counted as attachments also

– Answering questions with chains

## Within non-spam emails, is the typical length of emails shorter for 
## those that were sent to multiple people?
email %>%
   filter(spam == "not-spam") %>%
   group_by(to_multiple) %>%
   summarize(median(num_char))

## # A tibble: 2 x 2
##   to_multiple `median(num_char)`
##   <fct>                    <dbl>
## 1 0                         7.20
## 2 1                         5.36

# Question 1
## For emails containing the word "dollar", does the typical spam email 
## contain a greater number of occurences of the word than the typical non-spam email?
table(email$dollar)

## 
##    0    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15 
## 3175  120  151   10  146   20   44   12   35   10   22   10   20    7   14    5 
##   16   17   18   19   20   21   22   23   24   25   26   27   28   29   30   32 
##   23    2   14    1   10    7   12    7    7    3    7    1    5    1    1    2 
##   34   36   40   44   46   48   54   63   64 
##    1    2    3    3    2    1    1    1    3

email %>%
  filter(dollar > 0) %>%
  group_by(spam) %>%
  summarize(median(dollar))

## # A tibble: 2 x 2
##   spam     `median(dollar)`
##   <fct>               <dbl>
## 1 not-spam                4
## 2 spam                    2

# Question 2
## If you encounter an email with greater than 10 occurrences of the word "dollar", 
## is it more likely to be spam or not -spam?

email %>%
  filter(dollar > 10) %>%
  ggplot(aes(x = spam)) +
  geom_bar()

* Not-spam, at least in this dataset

Check-in 2

– What’s in a number?

levels(email$number)

## [1] "none"  "small" "big"

table(email$number)

## 
##  none small   big 
##   549  2827   545

# Reorder levels
email$number <- factor(email$number, levels = c("none","small","big"))

# Construct plot of number
ggplot(email, aes(x = number)) +
  geom_bar() + 
  facet_wrap( ~ spam)

# – What’s in a number interpretation * Given that an email contains a small number, it is more likely to be not-spam. * Given that an email contains a big number, it is more likely to be not-spam. * Within both spam and not-spam, the most common number is a small one.

Exploratory Data Analysis

2501975684_Arvio Anandi

2022-06-02

Exploratory Data Analysis

Dropping levels

Side by-side barcharts

– Bar chart interpretation

Counts vs. proportions

Conditional proportions

Counts vs. proportions (2)

Distribution of one variable

Marginal barchart

Improve piechart

Exploring Numerical Data

Distribution of one variable

– Marginal and conditional histograms

Visualization in higher dimensions

– Calculate spread measures

Shape and transformations

– Transformations

– Spam and num_char interpretation

– Spam and !!!

– Spam and !!! interpretation

– Collapsing levels

– Data Integrity

– Answering questions with chains

Check-in 2

– What’s in a number?