• Course notes from the Exploratory Data Analysis course on DataCamp
• Exploring Categorical Data • Exploring Numerical Data • Numerical Summeries • Case Study
#source('create_dataset.R)
library(readr)
## Warning: package 'readr' was built under R version 4.1.3
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.1.3
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.1.3
library(openintro)
## Warning: package 'openintro' was built under R version 4.1.3
## Loading required package: airports
## Warning: package 'airports' was built under R version 4.1.3
## Loading required package: cherryblossom
## Warning: package 'cherryblossom' was built under R version 4.1.3
## Loading required package: usdata
## Warning: package 'usdata' was built under R version 4.1.3
cars <- read.csv("https://assets.datacamp.com/production/course_1796/datasets/cars04.csv")
comics <- read.csv("https://assets.datacamp.com/production/course_1796/datasets/comics.csv")
life <- read.csv("https://assets.datacamp.com/production/course_1796/datasets/life_exp_raw.csv")
• Bar charts with categorical variables on the x axis and in the fill are a common way to see a contingency table visually. • It essentialy what you would get if you used the table function with two variables. • Which way you show the data can change the perception. • Which variable you use for the fill or the position of the bars (fill, dodge, stack) all can give different perceptions.
#Print the first rows of the data
head(comics)
## name id align eye hair
## 1 Spider-Man (Peter Parker) Secret Good Hazel Eyes Brown Hair
## 2 Captain America (Steven Rogers) Public Good Blue Eyes White Hair
## 3 Wolverine (James \\"Logan\\" Howlett) Public Neutral Blue Eyes Black Hair
## 4 Iron Man (Anthony \\"Tony\\" Stark) Public Good Blue Eyes Black Hair
## 5 Thor (Thor Odinson) No Dual Good Blue Eyes Blond Hair
## 6 Benjamin Grimm (Earth-616) Public Good Blue Eyes No Hair
## gender gsm alive appearances first_appear publisher
## 1 Male <NA> Living Characters 4043 Aug-62 marvel
## 2 Male <NA> Living Characters 3360 Mar-41 marvel
## 3 Male <NA> Living Characters 3061 Oct-74 marvel
## 4 Male <NA> Living Characters 2961 Mar-63 marvel
## 5 Male <NA> Living Characters 2258 Nov-50 marvel
## 6 Male <NA> Living Characters 2255 Nov-61 marvel
From the description above, we can see that in the “comics” dataset there are 11 attributes which are name, id, align, eye, hair, gender, gsm, alive, appearance, first_appear, and publisher.
comics$align <- as.factor(comics$align)
levels(comics$align)
## [1] "Bad" "Good" "Neutral"
## [4] "Reformed Criminals"
comics$gender <- as.factor(comics$gender)
levels(comics$gender)
## [1] "Female" "Male" "Other"
The comics dataset has 4 types of align, which are “Bad”, “Good”, “Neutral”, “Reformed Criminals”. There are 3 genders in this dataset : Female, Male and Other.
#Create a 2-way contingency table
table(comics$align, comics$gender)
##
## Female Male Other
## Bad 1573 7561 32
## Good 2490 4809 17
## Neutral 836 1799 17
## Reformed Criminals 1 2 0
We can clearly see that in this dataset there are more Male characters than the Female and Other.
###-Dropping levels
#Load dplyr
#Print tab
tab <- table(comics$align, comics$gender)
tab
##
## Female Male Other
## Bad 1573 7561 32
## Good 2490 4809 17
## Neutral 836 1799 17
## Reformed Criminals 1 2 0
#Remove align level
comics <- comics %>%
filter(align != 'Reformed Criminals') %>%
droplevels()
levels(comics$align)
## [1] "Bad" "Good" "Neutral"
The contingency table shows that the Reformed criminals level in the align attribute has a very low count rather than other 3 align. So in order to simplify the analysis, we need to remove this level.
#Load ggplot2
#Create side-by-side barchart of gender by alignment
ggplot(comics, aes(x = align, fill = gender)) +
geom_bar(position = "dodge")
* the argument (position = “dodge”) in the geom_bar() is to create a
side-by-side barchart (not stacked).
As stated before we can clearly see that based on all alignments in the comics dataset, there are more Male than Female and Other genders.
#Create side-by-side barchart of alignment by gender
ggplot(comics, aes(x=gender, fill=align)) +
geom_bar(position = "dodge") +
theme(axis.text.x = element_text(angle=90))
Before, we stated that most of the character were Male gender, from the
barchart above it shows that most of the male character has “Bad”
alignment, while most of the female character has “Good” alignment.
– Bar chart interpretation • Among characters with “Neutral” alignment, males are the most common. • In general, there is an association between gender and alignment. • There are more male characters than female characters in this dataset.
# simplify display format
options(scipen = 999, digits = 3)
## create table of counts
tbl_cnt <- table(comics$id, comics$align)
tbl_cnt
##
## Bad Good Neutral
## No Dual 474 647 390
## Public 2172 2930 965
## Secret 4493 2475 959
## Unknown 7 0 2
# Proportional table
# All values add up to 1
prop.table(tbl_cnt)
##
## Bad Good Neutral
## No Dual 0.030553 0.041704 0.025139
## Public 0.140003 0.188862 0.062202
## Secret 0.289609 0.159533 0.061815
## Unknown 0.000451 0.000000 0.000129
Most of the characters with “Bad” alignment have “secret” id, while for the “Good” and “Neutral” alignment most of them have “Public” id. Compared with other id, the “unkown” id are the least.
sum(prop.table(tbl_cnt))
## [1] 1
#All rows add up to 1
prop.table(tbl_cnt, 1)
##
## Bad Good Neutral
## No Dual 0.314 0.428 0.258
## Public 0.358 0.483 0.159
## Secret 0.567 0.312 0.121
## Unknown 0.778 0.000 0.222
#Columns add up to 1
prop.table(tbl_cnt,2)
##
## Bad Good Neutral
## No Dual 0.066331 0.106907 0.168394
## Public 0.303946 0.484137 0.416667
## Secret 0.628743 0.408956 0.414076
## Unknown 0.000980 0.000000 0.000864
• Look at the proportion of bad characters in the secret and unknown groups • Note there are very few characters with id = unknown
ggplot(comics, aes(x=id, fill=align)) +
geom_bar(position = "fill") +
ylab("proportion")
• Swap the x and fill variables. Notice the most bad cahracters are
secret (not unknown). • Here you can see more clearly that there are
very few characters at all with id = unknown
ggplot(comics, aes(x=align, fill = id)) +
geom_bar(position = "fill") +
ylab("proportion")
- Characters with “Bad” alignment were most in “Secret” id. - Characters
with “Good” alignment were most in “Public” id. - Characters with
“Neutral” alignment were most in “Secret” and “Public” id.
tab <- table(comics$align, comics$gender)
options(scipen = 999, digits = 3) # Print fewer digits
prop.table(tab) # Joint proportions
##
## Female Male Other
## Bad 0.082210 0.395160 0.001672
## Good 0.130135 0.251333 0.000888
## Neutral 0.043692 0.094021 0.000888
prop.table(tab, 2)
##
## Female Male Other
## Bad 0.321 0.534 0.485
## Good 0.508 0.339 0.258
## Neutral 0.171 0.127 0.258
• Approximately what proportion of all female characters are good? = 51%
# Plot proportion of gender, conditional on align
ggplot(comics, aes(x = align, fill = gender)) +
geom_bar()
# Plot proportion of gender, conditional on align
ggplot(comics, aes(x = align, fill = gender)) +
geom_bar(position = "fill")
Two of them give different output, by adding the argument (position =
“fill”) in the geom_bar(), the output changes. The bar in the plot above
fills the entire height of the plotting window it displayed the plot
based on proportions.
# Can use table function on just one variable
# This is called a marginal distribution
table(comics$id)
##
## No Dual Public Secret Unknown
## 1511 6067 7927 9
In this dataset, most of them has the “Secret” id.
# Simple barchart
ggplot(comics, aes(x = id)) +
geom_bar()
• You can also facet to see variables indidually • A little easier than
filtering each and plotting.
ggplot(comics, aes(x = id)) +
geom_bar() +
facet_wrap(~align)
• This is a rearrangement of the bar chart we plotted earlier. - We
facet by alignment rather then coloring the stack. - This can make it a
little easier to answer some questions.
• It makes more sense to put neutral between Bad and Good • We need to reorder the levels so it will chart this way • Otherwise it will defult to alphabetical
#Change the order of the levels in align
comics$align <- factor(comics$align,
levels = c("Bad", "Neutral", "Good"))
# Create plot of align
ggplot(comics, aes(x = align)) +
geom_bar()
The order of alignment are now changed, before : “Bad”, “Good”,
“Neutral” after : “Bad”, “Neutral”, “Good”
It is better to put the “Neutral” to shows us an alignment between the “Good” and “Bad”.
# Plot of alignment broken down by gender
ggplot(comics, aes(x = align)) +
geom_bar() +
facet_wrap(~gender)
When the plot was rearranged using facet buy gender, we can clearly see
the alignment based on the gender clearly and better than stacking
them.
pies <- data.frame(flavors = as.factor(rep(c("apple", "blueberry", "boston creme", "cherry", "key lime", "pumpkin", "strawberry"), times = c(17, 14, 15, 13, 16, 12, 11))))
#Put levels of flavor in descending order
lev <- c("apple", "key lime", "boston creme", "blueberry", "cherry", "pumpkin", "strawberry")
pies$flavor <- factor(pies$flavor, levels = lev)
head(pies$flavor)
## [1] apple apple apple apple apple apple
## Levels: apple key lime boston creme blueberry cherry pumpkin strawberry
There are 7 type of flavor in the pies dataset.
# Create barchart of flavor
ggplot(pies, aes(x = flavor)) +
geom_bar(fill = "chartreuse") +
theme(axis.text.x = element_text(angle=90))
The barchart are arranged with descending order, it helps us better to
find which of the flavor are most likely by people and least likely by
people.
# A dot plot shows all the datapoints
ggplot(cars, aes(x = weight)) +
geom_dotplot(dotsize = 0.4)
## Bin width defaults to 1/30 of the range of the data. Pick better value with `binwidth`.
## Warning: Removed 2 rows containing non-finite values (stat_bindot).
# A histogram groups tje point into bins so it does not get overwhelming
ggplot(cars, aes(x = weight)) +
geom_histogram(dotsize = 0.4, binwidth = 500)
## Warning: Ignoring unknown parameters: dotsize
## Warning: Removed 2 rows containing non-finite values (stat_bin).
# A density plot gives a bigger picture representation od the distribution
# It is more helpful when there is a lot of data
ggplot(cars, aes(x = weight)) +
geom_density()
## Warning: Removed 2 rows containing non-finite values (stat_density).
# A boxplot is a good way to just show the summary info of the distribution
ggplot(cars, aes(x = 1, y = weight)) +
geom_boxplot() +
coord_flip()
## Warning: Removed 2 rows containing non-finite values (stat_boxplot).
# Load package
library(ggplot2)
#Learn data structure
str(cars)
## 'data.frame': 428 obs. of 19 variables:
## $ name : chr "Chevrolet Aveo 4dr" "Chevrolet Aveo LS 4dr hatch" "Chevrolet Cavalier 2dr" "Chevrolet Cavalier 4dr" ...
## $ sports_car : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ suv : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ wagon : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ minivan : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ pickup : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ all_wheel : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ rear_wheel : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ msrp : int 11690 12585 14610 14810 16385 13670 15040 13270 13730 15460 ...
## $ dealer_cost: int 10965 11802 13697 13884 15357 12849 14086 12482 12906 14496 ...
## $ eng_size : num 1.6 1.6 2.2 2.2 2.2 2 2 2 2 2 ...
## $ ncyl : int 4 4 4 4 4 4 4 4 4 4 ...
## $ horsepwr : int 103 103 140 140 140 132 132 130 110 130 ...
## $ city_mpg : int 28 28 26 26 26 29 29 26 27 26 ...
## $ hwy_mpg : int 34 34 37 37 37 36 36 33 36 33 ...
## $ weight : int 2370 2348 2617 2676 2617 2581 2626 2612 2606 2606 ...
## $ wheel_base : int 98 98 104 104 104 105 105 103 103 103 ...
## $ length : int 167 153 183 183 183 174 174 168 168 168 ...
## $ width : int 66 66 69 68 69 67 67 67 67 67 ...
There are 428 data and 19 attributes in the cars dataset. Lots of them were logical class and integer. *logi = logical class, it can only take 2 values (“TRUE” or “FALSE”, “0”, or “1”)
# Create faceted histogram
ggplot(cars, aes(x = city_mpg)) +
geom_histogram() +
facet_wrap(~suv)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 14 rows containing non-finite values (stat_bin).
In the cars dataset, there are more cars that is not suv with city_mpg
that ranged at 10-30.
unique(cars$ncyl)
## [1] 4 6 3 8 5 12 10 -1
table(cars$ncyl)
##
## -1 3 4 5 6 8 10 12
## 2 1 136 7 190 87 2 3
# Filter cars with 4, 6, 8 cylinders
common_cyl <- filter(cars, ncyl %in% c(4,6,8))
# Create boxplot of city_mpg by ncyl
ggplot(common_cyl, aes(x = as.factor(ncyl), y = city_mpg)) +
geom_boxplot()
## Warning: Removed 11 rows containing non-finite values (stat_boxplot).
All of them (4 (four), 6(six), and 8(eight) cylinder number of a car)
have different median, upper quartile, and lower quartile.
# Create overlaid density plots for same data
ggplot(common_cyl, aes(x = city_mpg, fill = as.factor(ncyl))) +
geom_density(alpha = .3)
## Warning: Removed 11 rows containing non-finite values (stat_density).
Different number of cylinder also has different range of city_mpg
attribute. - cars with 8 cylinder number has city_mpg ranged from 10-20
- cars with 6 cylinder number has city_mpg ranged from 12-25 - cars with
4 cylinder number has city_mpg ranged from 15-60
• The highest mileage cars have 4 cylinders. • The typical 4 cylinder car gets better mileage than the typical 6 cylinder car, which gets better mileage than the typical 8 cylinder car. • Most of the 4 cylinder cars get better mileage than even the most efficient 8 • cylinder cars.
# Create hist of horspower
cars %>%
ggplot(aes(horsepwr)) +
geom_histogram() +
ggtitle("Horsepower distribution")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Most of the car’s horsepower in the “cars” dataset ranged from 150-200.
The distribution of the horsepower attribute does shows a bell-shape
which tells us it has a normal distribution.
# Create hist of horsepwr for affordable cars
cars %>%
filter(msrp < 25000) %>%
ggplot(aes(horsepwr)) +
geom_histogram() +
xlim(c(90, 550)) +
ggtitle("Horsepower distribution for msrp < 25000")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1 rows containing non-finite values (stat_bin).
## Warning: Removed 2 rows containing missing values (geom_bar).
The plot above shows ud the distribution of horsepower with the price
(price of a car) less than $25.000. It is ranged from 100-250.
• The highest horsepower car in the less expensive range has just under 250 horsepower.
# Create hist of horsepwr with binwidth of 3
cars %>%
ggplot(aes(horsepwr)) +
geom_histogram(binwidth = 3) +
ggtitle("binwith = 3")
# Create hist of horsepwr with binwidth of 30
cars %>%
ggplot(aes(horsepwr)) +
geom_histogram(binwidth = 30) +
ggtitle("binwith = 30")
# Create hist of horsepwr with binwidth of 3
cars %>%
ggplot(aes(horsepwr)) +
geom_histogram(binwidth = 60) +
ggtitle("binwith = 60")
We can adjust the binwidth as we like to determine how smooth we want
our distribution to appear in the plot. The first plot can give us the
exact number of cars with the 100, 200, 300, 400, and 500
horsepower.
#Construct boxplot of msrp
cars %>%
ggplot(aes(x = 1, y = msrp)) +
geom_boxplot()
# Exclude outliers from data
cars_no_out <- cars %>%
filter(msrp < 100000)
# Construct box plot of msrp using the reduced dataset
cars_no_out %>%
ggplot(aes(x = 1, y = msrp)) +
geom_boxplot()
The boxplot can be used to see the summary distribution of the data, we
can also indicate outliers by using the boxplot. As we remove the
outliers in the boxplot before, the second boxplot gives us different
plot by removing the largest 3-5 outliers.
# Create plot of a city_mpg
cars %>%
ggplot(aes(x = 1, y = city_mpg)) +
geom_boxplot()
## Warning: Removed 14 rows containing non-finite values (stat_boxplot).
cars %>%
ggplot(aes(city_mpg)) +
geom_density()
## Warning: Removed 14 rows containing non-finite values (stat_density).
# Create plot of width
cars %>%
ggplot(aes(x = 1, y = width)) +
geom_boxplot()
## Warning: Removed 28 rows containing non-finite values (stat_boxplot).
cars %>%
ggplot(aes(x = width)) +
geom_density()
## Warning: Removed 28 rows containing non-finite values (stat_density).
It is better for us to use the boxplot for the city_mpg as it indicates
outliers from that attributes. But it is up to us to determine which of
these plot that best describe or more appropriate to show the attributes
distribution of the data.
# Facet hists using hwy mileage and ncyl
common_cyl %>%
ggplot(aes(x = hwy_mpg)) +
geom_histogram() +
facet_grid(ncyl ~ suv) +
ggtitle("hwy_mpg by ncyl and suv")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 11 rows containing non-finite values (stat_bin).
### - Interpret 3 var plot • Across bot SUV’s and non-SUV’s, mileage
tend to decrease as the number of cylinders increases
• What is a typical value for life expectancy - we will look at just a few data points here - and just the females
head(life)
## State County fips Year Female.life.expectancy..years.
## 1 Alabama Autauga County 1001 1985 77.0
## 2 Alabama Baldwin County 1003 1985 78.8
## 3 Alabama Barbour County 1005 1985 76.0
## 4 Alabama Bibb County 1007 1985 76.6
## 5 Alabama Blount County 1009 1985 78.9
## 6 Alabama Bullock County 1011 1985 75.1
## Female.life.expectancy..national..years.
## 1 77.8
## 2 77.8
## 3 77.8
## 4 77.8
## 5 77.8
## 6 77.8
## Female.life.expectancy..state..years. Male.life.expectancy..years.
## 1 76.9 68.1
## 2 76.9 71.1
## 3 76.9 66.8
## 4 76.9 67.3
## 5 76.9 70.6
## 6 76.9 66.6
## Male.life.expectancy..national..years. Male.life.expectancy..state..years.
## 1 70.8 69.1
## 2 70.8 69.1
## 3 70.8 69.1
## 4 70.8 69.1
## 5 70.8 69.1
## 6 70.8 69.1
We can see that the “life” dataset has 10 different attributes. All the attributes are State, Country, fips, Year, Female.life.expectancy..years., Female.life.expectancy..national..years., Female.life.expectancy..state..years., Male.life.expectancy..years., ale.life.expectancy..national..years., Male.life.expectancy..state..years. .
x <- head(round(life$Female.life.expectancy..years.), 11)
x
## [1] 77 79 76 77 79 75 77 77 77 78 77
mean • balance point of the data • sensitive to extreme values
sum(x)/11
## [1] 77.2
mean(x)
## [1] 77.2
The mean of the data (life$Female.life.expectancy..years.) is equal to sum of the data divided by 11
median • middle value of the data • robust to extreme values • most appropriate measure when working with skewed data
sort(x)
## [1] 75 76 77 77 77 77 77 77 78 79 79
median(x)
## [1] 77
mode • most common value
table(x)
## x
## 75 76 77 78 79
## 1 1 6 1 2
most common value of the data is 77
#install.packages("gapminder")
library(gapminder)
## Warning: package 'gapminder' was built under R version 4.1.3
str(gapminder)
## tibble [1,704 x 6] (S3: tbl_df/tbl/data.frame)
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ year : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ lifeExp : num [1:1704] 28.8 30.3 32 34 36.1 ...
## $ pop : int [1:1704] 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
## $ gdpPercap: num [1:1704] 779 821 853 836 740 ...
# Create dataset of 2007 data
gap2007 <- filter(gapminder, year == 2007)
# Compute groupwise mean and median lifeExp
gap2007 %>%
group_by(continent) %>%
summarize(mean(lifeExp),
median(lifeExp))
## # A tibble: 5 x 3
## continent `mean(lifeExp)` `median(lifeExp)`
## <fct> <dbl> <dbl>
## 1 Africa 54.8 52.9
## 2 Americas 73.6 72.9
## 3 Asia 70.7 72.4
## 4 Europe 77.6 78.6
## 5 Oceania 80.7 80.7
By grouping the data by continent, we can see that the Oceania continent has the highest mean and median of life expectancy, while Africa has the lowest mean and median life expectancy with the data from gapminder.
# Generate boxplots of lifeExp for each continent
gap2007 %>%
ggplot(aes(x = continent, y = lifeExp)) +
geom_boxplot()
The boxplot above shows us the summaries info of the data. Different
continent does have different mean, upper, and lower quartile, etc.
• We want to know ‘How much is the data spread out from the middle?’ • Just looking at the data gives us a sense of this But we want break it down to one number so we can compare sample distributions
x
## [1] 77 79 76 77 79 75 77 77 77 78 77
• We could just take the differnce between all points and the mean and add it up - But that would equal 0. Thats the idea of the mean.
# Look at the difference between each point and the mean
sum(x - mean(x))
## [1] -0.0000000000000568
• So we can square the difference - But this number will keep getting bigger as you add more observations - We want something that is stable
# Square each difference to get rid of negatives then sum
sum((x - mean(x))^2)
## [1] 13.6
Variance • so we divide by n - 1 • This is called the sample variance. One of the most useful measures of a sample distribution
sum((x - mean(x))^2)/(length(x)-1)
## [1] 1.36
var(x)
## [1] 1.36
We can just simply calculate the sample variance, by using the var() function. It helps us save much time and also reduce the chance of typo (error typing).
Standard Deviation • Another very useful metric is the sample standard deviation • This is just the square root of the variance • The nice thing about the std dev is that it is in the same units as the original data • In this case its 1.17 years
sqrt(sum((x - mean(x))^2)/(length(x)-1))
## [1] 1.17
sd(x)
## [1] 1.17
By using the sd() fuction it is faster and also easier for us to calculate the standard deviation rather than typing the formula above. *As I also experienced some errors when typing the formula for calculation standard deviation.
Inter Quartile Range • The IQR is the middle 50% of the data • The nice thing about this one is that it is not sensitve to extreme values • All of the other measures listed here are sensitive to extreme values
summary(x)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 75.0 77.0 77.0 77.2 77.5 79.0
IQR(x)
## [1] 0.5
The IQR (Inter Quartile Range) can be calculated by simply subtract the 3rd Quartile (Upper Quartile) by 1st Quartile (Lower Quartile).
Range • max and min are also interesting • as is the range, or the difference between max and min
max(x)
## [1] 79
min(x)
## [1] 75
diff(range(x))
## [1] 4
str(gap2007)
## tibble [142 x 6] (S3: tbl_df/tbl/data.frame)
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 4 1 1 2 5 4 3 3 4 ...
## $ year : int [1:142] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
## $ lifeExp : num [1:142] 43.8 76.4 72.3 42.7 75.3 ...
## $ pop : int [1:142] 31889923 3600523 33333216 12420476 40301927 20434176 8199783 708573 150448339 10392226 ...
## $ gdpPercap: num [1:142] 975 5937 6223 4797 12779 ...
# Compute groupwise measures of spread
gap2007 %>%
group_by(continent) %>%
summarize(sd(lifeExp),
IQR(lifeExp),
n())
## # A tibble: 5 x 4
## continent `sd(lifeExp)` `IQR(lifeExp)` `n()`
## <fct> <dbl> <dbl> <int>
## 1 Africa 9.63 11.6 52
## 2 Americas 4.44 4.63 25
## 3 Asia 7.96 10.2 33
## 4 Europe 2.98 4.78 30
## 5 Oceania 0.729 0.516 2
The table above shows us the standard daviation, IQR, and number of data of lifeExp that are grouped by the continents (Africa, America, Asia, Europe, Oceania)
# Generate overlaid density plots
gap2007 %>%
ggplot(aes(x = lifeExp, fill = continent)) +
geom_density(alpha = 0.3)
We can see that the lifeExp in the Oceania continent was the highest
among all of the 5 continent.
# Compute stats for lifeExp in Americas
head(gap2007)
## # A tibble: 6 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 2007 43.8 31889923 975.
## 2 Albania Europe 2007 76.4 3600523 5937.
## 3 Algeria Africa 2007 72.3 33333216 6223.
## 4 Angola Africa 2007 42.7 12420476 4797.
## 5 Argentina Americas 2007 75.3 40301927 12779.
## 6 Australia Oceania 2007 81.2 20434176 34435.
We can see that this dataset has 6 different attribute which are country, continent, year, lifeExpm popm gdpPercap.
gap2007 %>%
filter(continent == "Americas") %>%
summarize(mean(lifeExp),
sd(lifeExp))
## # A tibble: 1 x 2
## `mean(lifeExp)` `sd(lifeExp)`
## <dbl> <dbl>
## 1 73.6 4.44
#Compute stats for population
gap2007 %>%
summarize(median(pop),
IQR(pop))
## # A tibble: 1 x 2
## `median(pop)` `IQR(pop)`
## <dbl> <dbl>
## 1 10517531 26702008.
4 chracteristics of a distribution that are of interest:
• center - already covered • spread or variablity - already covered • shape - modality: number of prominent humps (uni, bi, multi, or uniform - no humps) - skew (right, left, or symetric) - Can transform to fix skew • outliers
– Describe the shape
A: unimodal, left-skewed B: unimodal, symmetric C: unimodal, right-skewed D: bimodal, symmetric
# Create density plot of old variable
gap2007 %>%
ggplot(aes(x = pop)) +
geom_density()
# Transform the skewed pop variable
gap2007 <- gap2007 %>%
mutate(log_pop = log(pop))
# Create density plot of new variable
gap2007 %>%
ggplot(aes(x = log_pop)) +
geom_density()
We need to transform our distribution to a more subtle structure because
the density plot of the old variable has a highly skewed distribution.
it makes us very difficult to read the visualixation.
# Filter for Asia, add column indicating outliers
str(gapminder)
## tibble [1,704 x 6] (S3: tbl_df/tbl/data.frame)
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ year : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ lifeExp : num [1:1704] 28.8 30.3 32 34 36.1 ...
## $ pop : int [1:1704] 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
## $ gdpPercap: num [1:1704] 779 821 853 836 740 ...
gap_asia <- gap2007 %>%
filter(continent == "Asia") %>%
mutate(is_outlier = lifeExp < 50)
# Remove outliers, create box plot of lifeExp
gap_asia %>%
filter(!is_outlier) %>%
ggplot(aes(x = 1, y = lifeExp)) +
geom_boxplot()
After removing the outliers, there is no outliers detected in the
boxplot. (from the boxplot above)
# ggplot2, dplyr, and openintro are loaded
# Compute summary statistics
email %>%
group_by(spam) %>%
summarize(
median(num_char),
IQR(num_char))
## # A tibble: 2 x 3
## spam `median(num_char)` `IQR(num_char)`
## <fct> <dbl> <dbl>
## 1 0 6.83 13.6
## 2 1 1.05 2.82
It is clearly seen that the median and IQR of the email that is not a spam is higher than the email that is a spam.
str(email)
## tibble [3,921 x 21] (S3: tbl_df/tbl/data.frame)
## $ spam : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ to_multiple : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 2 1 1 ...
## $ from : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ cc : int [1:3921] 0 0 0 0 0 0 0 1 0 0 ...
## $ sent_email : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 2 1 1 ...
## $ time : POSIXct[1:3921], format: "2012-01-01 13:16:41" "2012-01-01 14:03:59" ...
## $ image : num [1:3921] 0 0 0 0 0 0 0 1 0 0 ...
## $ attach : num [1:3921] 0 0 0 0 0 0 0 1 0 0 ...
## $ dollar : num [1:3921] 0 0 4 0 0 0 0 0 0 0 ...
## $ winner : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ inherit : num [1:3921] 0 0 1 0 0 0 0 0 0 0 ...
## $ viagra : num [1:3921] 0 0 0 0 0 0 0 0 0 0 ...
## $ password : num [1:3921] 0 0 0 0 2 2 0 0 0 0 ...
## $ num_char : num [1:3921] 11.37 10.5 7.77 13.26 1.23 ...
## $ line_breaks : int [1:3921] 202 202 192 255 29 25 193 237 69 68 ...
## $ format : Factor w/ 2 levels "0","1": 2 2 2 2 1 1 2 2 1 2 ...
## $ re_subj : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ exclaim_subj: num [1:3921] 0 0 0 0 0 0 0 0 0 0 ...
## $ urgent_subj : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ exclaim_mess: num [1:3921] 0 1 6 48 1 1 1 18 1 0 ...
## $ number : Factor w/ 3 levels "none","small",..: 3 2 2 2 1 1 3 2 2 2 ...
there are 3921 data with 21 attributes in the email dataset
table(email$spam)
##
## 0 1
## 3554 367
3554 of the emails were not a spam while 367 of the emails were a spam.
email <- email %>%
mutate(spam = factor(ifelse(spam == 0, "not-spam", "spam")))
# Create plot
email %>%
mutate(log_num_char = log(num_char)) %>%
ggplot(aes(x = spam, y = log_num_char)) +
geom_boxplot()
The boxplot above can helps us see the summary distribution of the data
better.
• The median length of not-spam emails is greater than that of spam emails
# Compute center and spread for exclaim_mess by spam
email %>%
group_by(spam) %>%
summarize(
median(exclaim_mess),
IQR(exclaim_mess))
## # A tibble: 2 x 3
## spam `median(exclaim_mess)` `IQR(exclaim_mess)`
## <fct> <dbl> <dbl>
## 1 not-spam 1 5
## 2 spam 0 1
table(email$exclaim_mess)
##
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
## 1435 733 507 128 190 113 115 51 93 45 85 17 56 20 43 11
## 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
## 29 12 26 5 29 9 15 3 11 6 11 1 6 8 13 12
## 32 33 34 35 36 38 39 40 41 42 43 44 45 46 47 48
## 13 3 3 2 3 3 1 2 1 1 3 3 5 3 2 1
## 49 52 54 55 57 58 62 71 75 78 89 94 96 139 148 157
## 3 1 1 4 2 2 2 1 1 1 1 1 1 1 1 1
## 187 454 915 939 947 1197 1203 1209 1236
## 1 1 1 1 1 1 2 1 1
# Create plot for spam and exclaim_mess
email %>%
mutate(log_exclaim_mess = log(exclaim_mess)) %>%
ggplot(aes(x = log_exclaim_mess)) +
geom_histogram() +
facet_wrap(~ spam)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1435 rows containing non-finite values (stat_bin).
• The most common value of exclaim_mess in both classes of email is zero (a log(exclaim_mess) of -4.6 after adding .01). • Even after a transformation, the distribution of exclaim_mess in both classes of email is right-skewed. • The typical number of exclamations in the not-spam group appears to be slightly higher than in the spam group.
• Zero inflation in the exclaim_mess variable - you can analyze the two part separately - or turn it into a categorical variable of is-zero, not-zero • Could make a barchart - need to decide if you are more interested in counts or proportions
table(email$image)
##
## 0 1 2 3 4 5 9 20
## 3811 76 17 11 2 2 1 1
# Create plot of proportion of spam by image
email %>%
mutate(has_image = image > 0) %>%
ggplot(aes(x = has_image, fill = spam)) +
geom_bar(position = "fill")
### – Image and spam interpretation • An email without an image is more
likely to be not-spam than spam Emails which has an image and doesn’t
have an image are mostly not-spam emails.
# Test if images count as attachments
sum(email$image > email$attach)
## [1] 0
There are no emails with more image than attachments so we can conclude that most of the emails has attachments
email %>%
filter(spam == "not-spam") %>%
group_by(to_multiple) %>%
summarize(median(num_char))
## # A tibble: 2 x 2
## to_multiple `median(num_char)`
## <fct> <dbl>
## 1 0 7.20
## 2 1 5.36
Within non-spam emails, is the typical length of emails shorter for those that were sent to multiple people? Yes, the typical length of emails is shorter for those that were sent to multiple people
For emails containing the word “dollar”, does the typical spam email contain a greater number of occurrences of the word than the typical non-spam email?
table(email$dollar)
##
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
## 3175 120 151 10 146 20 44 12 35 10 22 10 20 7 14 5
## 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 32
## 23 2 14 1 10 7 12 7 7 3 7 1 5 1 1 2
## 34 36 40 44 46 48 54 63 64
## 1 2 3 3 2 1 1 1 3
email %>%
filter(dollar > 0) %>%
group_by(spam) %>%
summarize(median(dollar))
## # A tibble: 2 x 2
## spam `median(dollar)`
## <fct> <dbl>
## 1 not-spam 4
## 2 spam 2
No, for emails containing the word “dollar”, the typical spam email doesn’t contain a greater number of occurrences of the word than the typical non-spam email.
###Question 2 If you encounter an email with greater than 10 occurrences of the word “dollar”, is it more likely to be spam or not -spam?
email %>%
filter(dollar > 10) %>%
ggplot(aes(x = spam)) +
geom_bar()
In this dataset, if I encaounter an email with greater than 10
occurrences of the word “dollar” it is more likely not-spam email.
levels(email$number)
## [1] "none" "small" "big"
there are 3 levels of number in email dataset, “none”, “small”, and “big”
table(email$number)
##
## none small big
## 549 2827 545
most of the emails have “small” type of number
# Reorder levels
email$number <- factor(email$number, levels = c("none","small","big"))
# Construct plot of number
ggplot(email, aes(x = number)) +
geom_bar() +
facet_wrap( ~ spam)
### – What’s in a number interpretation • Given that an email contains a
small number, it is more likely to be not-spam. • Given that an email
contains a big number, it is more likely to be not-spam. • Within both
spam and not-spam, the most common number is a small one.