Disclaimer: The content of this RMarkdown note came from a course called Exploratory Data Analysis in datacamp.
When your dataset is represented as a table or a database, it’s difficult to observe much about it beyond its size and the types of variables it contains. In this course, you’ll learn how to use graphical and numerical techniques to begin uncovering the structure of your data. Which variables suggest interesting relationships? Which observations are unusual? By the end of the course, you’ll be able to answer these questions and more, while generating graphics that are both insightful and beautiful.
In this chapter, you will learn how to create graphical and numerical summaries of two categorical variables.
# Install gapminder R package. Once it's installed, you won't have to run this code again
#install.packages("dplyr")
#install.packages("ggplot2")
#Install.packages("openintro")
# Load dplyr package
library(dplyr) #for use of dplyr functions such as glimpse(), mutate(), and filter()
library(ggplot2) #for use of ggplot2 functions such ggplot()
library(gapminder) #for use of gapminder data (ch3.2)
library(openintro) #for use of email data (ch4)
# Import data
comics <- read.csv("data/Exploratory data analysis/comics.csv") #first_appear var isn't exactly same
str(comics)
## 'data.frame': 23272 obs. of 11 variables:
## $ name : Factor w/ 23272 levels "'Spinner (Earth-616)",..: 19833 3335 22769 9647 20956 2220 17576 9346 18794 10957 ...
## $ id : Factor w/ 4 levels "No Dual","Public",..: 3 2 2 2 1 2 2 2 2 2 ...
## $ align : Factor w/ 4 levels "Bad","Good","Neutral",..: 2 2 3 2 2 2 2 2 3 2 ...
## $ eye : Factor w/ 26 levels "Amber Eyes","Auburn Hair",..: 11 5 5 5 5 5 6 6 6 5 ...
## $ hair : Factor w/ 28 levels "Auburn Hair",..: 7 27 3 3 4 14 7 7 7 4 ...
## $ gender : Factor w/ 3 levels "Female","Male",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ gsm : Factor w/ 6 levels "Bisexual Characters",..: NA NA NA NA NA NA NA NA NA NA ...
## $ alive : Factor w/ 2 levels "Deceased Characters",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ appearances : int 4043 3360 3061 2961 2258 2255 2072 2017 1955 1934 ...
## $ first_appear: Factor w/ 2328 levels "1-Apr","1-Aug",..: 1772 2074 2255 2089 2185 2192 2192 2139 2292 2192 ...
## $ publisher : Factor w/ 2 levels "dc","marvel": 2 2 2 2 2 2 2 2 2 2 ...
# Convert data to tbl_df
comics <- tbl_df(comics)
str(comics)
## Classes 'tbl_df', 'tbl' and 'data.frame': 23272 obs. of 11 variables:
## $ name : Factor w/ 23272 levels "'Spinner (Earth-616)",..: 19833 3335 22769 9647 20956 2220 17576 9346 18794 10957 ...
## $ id : Factor w/ 4 levels "No Dual","Public",..: 3 2 2 2 1 2 2 2 2 2 ...
## $ align : Factor w/ 4 levels "Bad","Good","Neutral",..: 2 2 3 2 2 2 2 2 3 2 ...
## $ eye : Factor w/ 26 levels "Amber Eyes","Auburn Hair",..: 11 5 5 5 5 5 6 6 6 5 ...
## $ hair : Factor w/ 28 levels "Auburn Hair",..: 7 27 3 3 4 14 7 7 7 4 ...
## $ gender : Factor w/ 3 levels "Female","Male",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ gsm : Factor w/ 6 levels "Bisexual Characters",..: NA NA NA NA NA NA NA NA NA NA ...
## $ alive : Factor w/ 2 levels "Deceased Characters",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ appearances : int 4043 3360 3061 2961 2258 2255 2072 2017 1955 1934 ...
## $ first_appear: Factor w/ 2328 levels "1-Apr","1-Aug",..: 1772 2074 2255 2089 2185 2192 2192 2139 2292 2192 ...
## $ publisher : Factor w/ 2 levels "dc","marvel": 2 2 2 2 2 2 2 2 2 2 ...
In this chapter you’ll continue working with the comics dataset introduced in the video. This is a collection of characteristics on all of the superheroes created by Marvel and DC comics in the last 80 years.
# Print the first rows of the data
comics
## # A tibble: 23,272 x 11
## name id align eye
## <fctr> <fctr> <fctr> <fctr>
## 1 Spider-Man (Peter Parker) Secret Good Hazel Eyes
## 2 Captain America (Steven Rogers) Public Good Blue Eyes
## 3 Wolverine (James \\"Logan\\" Howlett) Public Neutral Blue Eyes
## 4 Iron Man (Anthony \\"Tony\\" Stark) Public Good Blue Eyes
## 5 Thor (Thor Odinson) No Dual Good Blue Eyes
## 6 Benjamin Grimm (Earth-616) Public Good Blue Eyes
## 7 Reed Richards (Earth-616) Public Good Brown Eyes
## 8 Hulk (Robert Bruce Banner) Public Good Brown Eyes
## 9 Scott Summers (Earth-616) Public Neutral Brown Eyes
## 10 Jonathan Storm (Earth-616) Public Good Blue Eyes
## # ... with 23,262 more rows, and 7 more variables: hair <fctr>,
## # gender <fctr>, gsm <fctr>, alive <fctr>, appearances <int>,
## # first_appear <fctr>, publisher <fctr>
# Check levels of align
levels(comics$align)
## [1] "Bad" "Good" "Neutral"
## [4] "Reformed Criminals"
# Check the levels of gender
levels(comics$gender)
## [1] "Female" "Male" "Other"
# Create a 2-way contingency table
table(comics$align, comics$gender)
##
## Female Male Other
## Bad 1573 7561 32
## Good 2490 4809 17
## Neutral 836 1799 17
## Reformed Criminals 1 2 0
To simplify the analysis, drop levels that have very low counts.
# Create a 2-way contingency table
tab <- table(comics$align, comics$gender)
# Print tab
tab
##
## Female Male Other
## Bad 1573 7561 32
## Good 2490 4809 17
## Neutral 836 1799 17
## Reformed Criminals 1 2 0
# Remove align level
comics <- comics %>%
filter(align != "Reformed Criminals") %>%
droplevels()
Construct two side-by-side barcharts of the comics data, showing how there are two options for presenting the same data.
# Create side-by-side barchart of gender by alignment
ggplot(comics, aes(x = align, fill = gender)) +
geom_bar(position = "dodge") #position = "dodge", to have a side-by-side (i.e. not stacked) barchart.
# Create side-by-side barchart of alignment by gender
ggplot(comics, aes(x = gender, fill = align)) +
geom_bar(position = "dodge") +
theme(axis.text.x = element_text(angle = 90))
Interpretation
Approximately what proportion of all female characters are good? 51%. Look at how align was distributed within each gender
tab <- table(comics$align, comics$gender)
options(scipen = 999, digits = 3) # Print fewer digits
prop.table(tab) # Joint proportions
##
## Female Male Other
## Bad 0.082210 0.395160 0.001672
## Good 0.130135 0.251333 0.000888
## Neutral 0.043692 0.094021 0.000888
prop.table(tab, 2) # Conditional on columns
##
## Female Male Other
## Bad 0.321 0.534 0.485
## Good 0.508 0.339 0.258
## Neutral 0.171 0.127 0.258
Bar charts can tell dramatically different stories depending on whether they represent counts or proportions and, if proportions, what the proportions are conditioned on. To demonstrate this difference, you’ll construct two barcharts in this exercise: one of counts and one of proportions.
# Plot of gender by align
ggplot(comics, aes(x = align, fill = gender)) +
geom_bar()
# Plot proportion of gender, conditional on align
ggplot(comics, aes(x = align, fill = gender)) +
geom_bar(position = "fill") #position = "fill", to have a stacked barchart
Improve the interpretability of the plot by implementing some sensible ordering. Superheroes that are “Neutral” show an alignment between “Good” and “Bad”, so it makes sense to put that bar in the middle.
# Change the order of the levels in align
comics$align <- factor(comics$align,
levels = c("Bad", "Neutral", "Good"))
# Create plot of align
ggplot(comics, aes(x = align)) +
geom_bar()
Make the distribution of alignment based on gender by creating multiple filtered datasets (one for each gender) or by faceting the plot of alignment based on gender.
# Plot of alignment broken down by gender
ggplot(comics, aes(x = align)) +
geom_bar() +
facet_wrap(~ gender)
The piechart is a very common way to represent the distribution of a single categorical variable, but they can be more difficult to interpret than barcharts. Improve the piechart representation of these data by constructing a barchart.
# Import data
pies <- data.frame(read.csv("data/Exploratory data analysis/pie.csv"))
pies_tb <- table(pies$flavor)
pies_tb
##
## apple blueberry boston creme cherry key lime
## 17 14 15 13 16
## pumpkin strawberry
## 12 11
# Create a pie chart
pie(pies_tb)
# Put levels of flavor in decending order
lev <- c("apple", "key lime", "boston creme", "blueberry", "cherry", "pumpkin", "strawberry")
pies$flavor <- factor(pies$flavor, levels = lev)
# Create barchart of flavor
ggplot(pies, aes(x = flavor)) +
geom_bar(fill = "chartreuse") +
theme(axis.text.x = element_text(angle = 90))
In this chapter, you will learn how to graphically summarize numerical data.
The cars dataset records characteristics on all of the new models of cars for sale in the US in a certain year. Investigate the distribution of mileage across a categorial variable.
# Import data
cars <- read.csv("data/Exploratory data analysis/cars.csv", stringsAsFactors = FALSE)
# Learn data structure
str(cars)
## 'data.frame': 428 obs. of 19 variables:
## $ name : chr "Chevrolet Aveo 4dr" "Chevrolet Aveo LS 4dr hatch" "Chevrolet Cavalier 2dr" "Chevrolet Cavalier 4dr" ...
## $ sports_car : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ suv : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ wagon : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ minivan : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ pickup : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ all_wheel : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ rear_wheel : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ msrp : int 11690 12585 14610 14810 16385 13670 15040 13270 13730 15460 ...
## $ dealer_cost: int 10965 11802 13697 13884 15357 12849 14086 12482 12906 14496 ...
## $ eng_size : num 1.6 1.6 2.2 2.2 2.2 2 2 2 2 2 ...
## $ ncyl : int 4 4 4 4 4 4 4 4 4 4 ...
## $ horsepwr : int 103 103 140 140 140 132 132 130 110 130 ...
## $ city_mpg : int 28 28 26 26 26 29 29 26 27 26 ...
## $ hwy_mpg : int 34 34 37 37 37 36 36 33 36 33 ...
## $ weight : int 2370 2348 2617 2676 2617 2581 2626 2612 2606 2606 ...
## $ wheel_base : int 98 98 104 104 104 105 105 103 103 103 ...
## $ length : int 167 153 183 183 183 174 174 168 168 168 ...
## $ width : int 66 66 69 68 69 67 67 67 67 67 ...
# Create faceted histogram
ggplot(cars, aes(x = city_mpg)) +
geom_histogram() +
facet_grid(~ suv)
Explore the relationship between mpg and the engine size (as measured by number of cylinders).
# Filter cars with 4, 6, 8 cylinders
common_cyl <- filter(cars, ncyl %in% c(4, 6, 8))
# Create box plots of city mpg by ncyl
ggplot(common_cyl, aes(x = as.factor(ncyl), y = city_mpg)) +
geom_boxplot()
# Create overlaid density plots for same data
ggplot(common_cyl, aes(x = city_mpg, fill = as.factor(ncyl))) +
geom_density(alpha = .3)
Interpretation
Get a sense of the marginal distribution of horsepower
and then compare it to the distribution of horsepower
conditional on the cars being affordable. How much horsepower do you sacrifice by restricting your budget?
# Create hist of horsepwr
cars %>%
ggplot(aes(horsepwr)) +
geom_histogram() +
ggtitle("Distribution of horsepower")
# Create hist of horsepwr for affordable cars
cars %>%
filter(msrp < 25000) %>%
ggplot(aes(horsepwr)) +
geom_histogram() +
xlim(c(90, 550)) +
ggtitle("Distribution of horsepower for cars less than $25,000")
Interpretation
Alter the binwidths to see how things change.
# Create hist of horsepwr with binwidth of 3
cars %>%
ggplot(aes(horsepwr)) +
geom_histogram(binwidth = 3) +
ggtitle("The distribution of horsepower with a binwidth of 3")
# Create hist of horsepwr with binwidth of 30
cars %>%
ggplot(aes(horsepwr)) +
geom_histogram(binwidth = 30) +
ggtitle("The distribution of horsepower with a binwidth of 30")
# Create hist of horsepwr with binwidth of 60
cars %>%
ggplot(aes(horsepwr)) +
geom_histogram(binwidth = 60) +
ggtitle("The distribution of horsepower with a binwidth of 60")
Interpretation
Box plots are good:
Bos plots are not good:
Apply a box plot to the msrp column (manufacturers suggested retail price) to detect if there are unusually expensive or cheap cars, outliers.
# Construct box plot of msrp
cars %>%
ggplot(aes(x = 1, y = msrp)) +
geom_boxplot()
# Exclude outliers from data
cars_no_out <- cars %>%
filter(msrp < 100000)
# Construct box plot of msrp using the reduced dataset
cars_no_out %>%
ggplot(aes(x = 1, y = msrp)) +
geom_boxplot()
Plot to display the important features of the distributions of city_mpg and width. Which is most appropriate? Remember, both density plots and box plots display the central tendency and spread of the data, but the box plot is more robust to outliers.
# Create plot of city_mpg
cars %>%
ggplot(aes(x = 1, y = city_mpg)) +
geom_boxplot()
cars %>%
ggplot(aes(city_mpg)) +
geom_density()
# Create plot of width
cars %>%
ggplot(aes(x = 1, y = width)) +
geom_boxplot()
cars %>%
ggplot(aes(width)) +
geom_density()
Interpretation
Faceting is a valuable technique for looking at several conditional distributions at the same time.
# Facet hists using hwy mileage and ncyl
common_cyl %>%
ggplot(aes(x = hwy_mpg)) +
geom_histogram() +
facet_grid(ncyl ~ suv, labeller = label_both) +
ggtitle("Distribution of Highway MPG by ncyl and SUV")
Interpretation
Now that we’ve looked at exploring categorical and numerical data, you’ll learn some useful statistics for describing distributions of data. There are four characteristics of the distribution that we are interested in:
Graphical tools for numerical data
The choice of measure for center can have a dramatic impact on what we consider to be a typical observation, so it is important that you consider the shape of the distribution before deciding on the measure.
Use data from gapminder, which tracks demographic data on countries of the world over time. To learn more about it, you can bring up the help file with ?gapminder.
Focus on how the life expectancy differs from continent to continent. This requires that you conduct your analysis not at the country level, but aggregated up to the continent level.
# Load data
data(gapminder)
# Create dataset of 2007 data
gap2007 <- filter(gapminder, year == 2007)
# Compute groupwise mean and median lifeExp
gap2007 %>%
group_by(continent) %>%
summarize(mean(lifeExp),
median(lifeExp))
## # A tibble: 5 x 3
## continent `mean(lifeExp)` `median(lifeExp)`
## <fctr> <dbl> <dbl>
## 1 Africa 54.8 52.9
## 2 Americas 73.6 72.9
## 3 Asia 70.7 72.4
## 4 Europe 77.6 78.6
## 5 Oceania 80.7 80.7
# Generate box plots of lifeExp for each continent
gap2007 %>%
ggplot(aes(x = continent, y = lifeExp)) +
geom_boxplot()
The choice of measure for spread can dramatically impact how variable we consider our data to be, so it is important that you consider the shape of the distribution before deciding on the measure.
If you’re unsure whether you’re working with symmetric or skewed distributions, it’s a good idea to consider a robust measure like IQR in addition to the usual measures of variance or standard deviation.
# Compute groupwise measures of spread
gap2007 %>%
group_by(continent) %>%
summarize(sd(lifeExp),
IQR(lifeExp),
n())
## # A tibble: 5 x 4
## continent `sd(lifeExp)` `IQR(lifeExp)` `n()`
## <fctr> <dbl> <dbl> <int>
## 1 Africa 9.631 11.610 52
## 2 Americas 4.441 4.632 25
## 3 Asia 7.964 10.152 33
## 4 Europe 2.980 4.782 30
## 5 Oceania 0.729 0.516 2
# Generate overlaid density plots
gap2007 %>%
ggplot(aes(x = lifeExp, fill = continent)) +
geom_density(alpha = 0.3)
Consider the density plots shown here. What are the most appropriate measures to describe their centers and spreads? In this exercise, you’ll select the measures and then calculate them.
# Create the charts given in datacamp
gap2007 %>%
filter(continent == "Americas") %>%
ggplot(aes(lifeExp)) + geom_density() + ggtitle("Life Expectancy in the Americas")
gap2007 %>% ggplot(aes(pop)) + geom_density()
# Compute stats for lifeExp in Americas
gap2007 %>%
filter(continent == "Americas") %>%
summarize(mean(lifeExp),
sd(lifeExp))
## # A tibble: 1 x 2
## `mean(lifeExp)` `sd(lifeExp)`
## <dbl> <dbl>
## 1 73.6 4.44
# Compute stats for population
gap2007 %>%
summarize(median(pop),
IQR(pop))
## # A tibble: 1 x 2
## `median(pop)` `IQR(pop)`
## <dbl> <dbl>
## 1 10517531 26702008
Interpretation
mean
and standard deviation
, median
and IQR
measure the central tendency and spread, respectively, but are robust to outliers and non-normal data.There are two shap Modality
Skew
Transformation
Highly skewed distributions can make it very difficult to learn anything from a visualization. Transformations can be helpful in revealing the more subtle structure. For example, a log transformation may be considered when comparing two distributions that are both highly right-skewed.
Transform the population variable, which exhibits strong right skew, with the natural logarithm function.
# Create density plot of old variable
gap2007 %>%
ggplot(aes(x = pop)) +
geom_density()
# Transform the skewed pop variable
gap2007 <- gap2007 %>%
mutate(log_pop = log(pop))
# Create density plot of new variable
gap2007 %>%
ggplot(aes(x = log_pop)) +
geom_density()
Outliers are extreme values far from the bulk of the distribution.
Consider the distribution, shown here, of the life expectancies of the countries in Asia. The box plot identifies one clear outlier: a country with a notably low life expectancy. Do you have a guess as to which country this might be?
# Create a given boxplot in datacamp
gap2007 %>%
filter(continent == "Asia") %>%
ggplot(aes(x = 1, y = lifeExp)) + geom_boxplot()
# Filter for Asia, add column indicating outliers
gap_asia <- gap2007 %>%
filter(continent == "Asia") %>%
mutate(is_outlier = lifeExp < 50)
# Remove outliers, create box plot of lifeExp
gap_asia %>%
filter(! is_outlier) %>%
ggplot(aes(x = 1, y = lifeExp)) +
geom_boxplot()
Apply what you’ve learned to explore and summarize a real world dataset in this case study of email spam. What characteristics of emails are associated with spam?
data(email)
email$spam <- as.factor(email$spam)
levels(email$spam) <- c("not-spam", "spam")
str(email)
## 'data.frame': 3921 obs. of 21 variables:
## $ spam : Factor w/ 2 levels "not-spam","spam": 1 1 1 1 1 1 1 1 1 1 ...
## $ to_multiple : num 0 0 0 0 0 0 1 1 0 0 ...
## $ from : num 1 1 1 1 1 1 1 1 1 1 ...
## $ cc : int 0 0 0 0 0 0 0 1 0 0 ...
## $ sent_email : num 0 0 0 0 0 0 1 1 0 0 ...
## $ time : POSIXct, format: "2012-01-01 06:16:41" "2012-01-01 07:03:59" ...
## $ image : num 0 0 0 0 0 0 0 1 0 0 ...
## $ attach : num 0 0 0 0 0 0 0 1 0 0 ...
## $ dollar : num 0 0 4 0 0 0 0 0 0 0 ...
## $ winner : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ inherit : num 0 0 1 0 0 0 0 0 0 0 ...
## $ viagra : num 0 0 0 0 0 0 0 0 0 0 ...
## $ password : num 0 0 0 0 2 2 0 0 0 0 ...
## $ num_char : num 11.37 10.5 7.77 13.26 1.23 ...
## $ line_breaks : int 202 202 192 255 29 25 193 237 69 68 ...
## $ format : num 1 1 1 1 0 0 1 1 0 1 ...
## $ re_subj : num 0 0 0 0 0 0 0 0 0 0 ...
## $ exclaim_subj: num 0 0 0 0 0 0 0 0 0 0 ...
## $ urgent_subj : num 0 0 0 0 0 0 0 0 0 0 ...
## $ exclaim_mess: num 0 1 6 48 1 1 1 18 1 0 ...
## $ number : Factor w/ 3 levels "none","small",..: 3 2 2 2 1 1 3 2 2 2 ...
# graphic tools for numerical data
# histograms
email %>%
ggplot(aes(x = num_char)) +
geom_histogram()
email %>%
ggplot(aes(x = num_char)) +
geom_histogram() +
facet_grid(~ spam)
# boxplots
email %>%
ggplot(aes(x = spam, y = num_char)) +
geom_boxplot()
# density plots
email %>%
ggplot(aes(x = num_char)) +
geom_density()
email %>%
ggplot(aes(x = num_char)) +
geom_density() +
facet_grid(~ spam)
email %>%
ggplot(aes(x = num_char, fill = spam)) + #must be a factor var for fill
geom_density(alpha = 0.3)
Is there an association between spam and the length of an email?
# Compute summary statistics
email %>%
group_by(spam) %>%
summarize(median(num_char), IQR(num_char))
## # A tibble: 2 x 3
## spam `median(num_char)` `IQR(num_char)`
## <fctr> <dbl> <dbl>
## 1 not-spam 6.83 13.58
## 2 spam 1.05 2.82
# Create plot
email %>%
mutate(log_num_char = log(num_char)) %>%
ggplot(aes(x = spam, y = log_num_char)) +
geom_boxplot()
Interpretation
Exclaim_mess
contains the number of exclamation marks in each message. Using summary statistics and visualization, see if there is a relationship between this variable and whether or not a message is spam.
# Compute center and spread for exclaim_mess by spam
email %>% group_by(spam) %>% summarize(median(exclaim_mess), IQR(exclaim_mess))
## # A tibble: 2 x 3
## spam `median(exclaim_mess)` `IQR(exclaim_mess)`
## <fctr> <dbl> <dbl>
## 1 not-spam 1 5
## 2 spam 0 1
# Create plot for spam and exclaim_mess
# Assign an arbitrary number of -4.6 to log(0) just to compare 0 and non-0 values
email$log_exclaim_mess <- ifelse(email$exclaim_mess > 0, log(email$exclaim_mess), -4.6)
ggplot(email, aes(log_exclaim_mess)) + geom_histogram() + facet_grid(~ spam)
ggplot(email, aes(x = spam, y = log_exclaim_mess)) + geom_boxplot()
ggplot(email, aes(x = log_exclaim_mess, fill = spam)) + geom_density()
Interpreation
It is difficult to work with variables of heavy skew. image
is such a variable the majority of whose values take zero and only a few positive values. Let’s collapse image into a categorical variable that indicates whether or not the email had at least one image.
# Create plot of proportion of spam by image
email %>%
mutate(has_image = image > 0) %>%
ggplot(aes(x = has_image, fill = spam)) +
geom_bar(position = "fill")
Interpretation
In the process of exploring a dataset, you’ll sometimes come across something that will lead you to question how the data were compiled. For example, consider the variables image
and attach
. Do attached images count as attached files in this dataset?
Design a simple test to determine if images count as attached files.
# Test if images count as attachments
sum(email$image > email$attach) #False is 0 while True is 1.
## [1] 0
Interpretation
When you have a specific question about a dataset, you can find your way to an answer by carefully constructing the appropriate chain of R code.
# Question1: For emails containing the word "dollar", does the typical spam email contain a greater number of occurrences of the word than the typical non-spam email? Create a summary statistic that answers this question.
email %>%
filter(dollar > 0) %>%
group_by(spam) %>%
summarize(median(dollar))
## # A tibble: 2 x 2
## spam `median(dollar)`
## <fctr> <dbl>
## 1 not-spam 4
## 2 spam 2
# Question2: If you encounter an email with greater than 10 occurrences of the word "dollar", is it more likely to be spam or not-spam? Create a barchart that answers this question.
email %>%
filter(dollar > 10) %>%
ggplot(aes(x = spam)) +
geom_bar()
Explore the association between number
and spam
. Nunber
is a factor variable saying whether there was no number, a small number (under 1 million), or a big number.
# Reorder levels
email$number <- factor(email$number, levels = c("none", "small", "big"))
# Construct plot of number
ggplot(email, aes(x = number)) +
geom_bar() +
facet_grid(~ spam)
Interpretation
How can we revitalize a region’s economy? Assume that you live in a rural community and are interested in helping your community develop its economy. So you want to identify a rural community in the U.S. that enjoys a high living standards and investigate what makes them different from the rest of the rural America. Use the countyComplete
data set in the openintro R package. Click the link to open the manual and find the information about the data set.
# Load data
data(countyComplete) # It comes from the openintro package
# Create a new variable, rural
countyComplete$rural <- ifelse(countyComplete$density < 1000, "rural", "urban")
countyComplete$rural <- factor(countyComplete$rural)
# Examine the data set
glimpse(countyComplete)
## Observations: 3,143
## Variables: 54
## $ state <fctr> Alabama, Alabama, A...
## $ name <fctr> Autauga County, Bal...
## $ FIPS <dbl> 1001, 1003, 1005, 10...
## $ pop2010 <dbl> 54571, 182265, 27457...
## $ pop2000 <dbl> 43671, 140415, 29038...
## $ age_under_5 <dbl> 6.6, 6.1, 6.2, 6.0, ...
## $ age_under_18 <dbl> 26.8, 23.0, 21.9, 22...
## $ age_over_65 <dbl> 12.0, 16.8, 14.2, 12...
## $ female <dbl> 51.3, 51.1, 46.9, 46...
## $ white <dbl> 78.5, 85.7, 48.0, 75...
## $ black <dbl> 17.7, 9.4, 46.9, 22....
## $ native <dbl> 0.4, 0.7, 0.4, 0.3, ...
## $ asian <dbl> 0.9, 0.7, 0.4, 0.1, ...
## $ pac_isl <dbl> NA, NA, NA, NA, NA, ...
## $ two_plus_races <dbl> 1.6, 1.5, 0.9, 0.9, ...
## $ hispanic <dbl> 2.4, 4.4, 5.1, 1.8, ...
## $ white_not_hispanic <dbl> 77.2, 83.5, 46.8, 75...
## $ no_move_in_one_plus_year <dbl> 86.3, 83.0, 83.0, 90...
## $ foreign_born <dbl> 2.0, 3.6, 2.8, 0.7, ...
## $ foreign_spoken_at_home <dbl> 3.7, 5.5, 4.7, 1.5, ...
## $ hs_grad <dbl> 85.3, 87.6, 71.9, 74...
## $ bachelors <dbl> 21.7, 26.8, 13.5, 10...
## $ veterans <dbl> 5817, 20396, 2327, 1...
## $ mean_work_travel <dbl> 25.1, 25.8, 23.8, 28...
## $ housing_units <dbl> 22135, 104061, 11829...
## $ home_ownership <dbl> 77.5, 76.7, 68.0, 82...
## $ housing_multi_unit <dbl> 7.2, 22.6, 11.1, 6.6...
## $ median_val_owner_occupied <dbl> 133900, 177200, 8820...
## $ households <dbl> 19718, 69476, 9795, ...
## $ persons_per_household <dbl> 2.70, 2.50, 2.52, 3....
## $ per_capita_income <dbl> 24568, 26469, 15875,...
## $ median_household_income <dbl> 53255, 50147, 33219,...
## $ poverty <dbl> 10.6, 12.2, 25.0, 12...
## $ private_nonfarm_establishments <dbl> 877, 4812, 522, 318,...
## $ private_nonfarm_employment <dbl> 10628, 52233, 7990, ...
## $ percent_change_private_nonfarm_employment <dbl> 16.6, 17.4, -27.0, -...
## $ nonemployment_establishments <dbl> 2971, 14175, 1527, 1...
## $ firms <dbl> 4067, 19035, 1667, 1...
## $ black_owned_firms <dbl> 15.2, 2.7, NA, 14.9,...
## $ native_owned_firms <dbl> NA, 0.4, NA, NA, NA,...
## $ asian_owned_firms <dbl> 1.3, 1.0, NA, NA, NA...
## $ pac_isl_owned_firms <dbl> NA, NA, NA, NA, NA, ...
## $ hispanic_owned_firms <dbl> 0.7, 1.3, NA, NA, NA...
## $ women_owned_firms <dbl> 31.7, 27.3, 27.0, NA...
## $ manufacturer_shipments_2007 <dbl> NA, 1410273, NA, 0, ...
## $ mercent_whole_sales_2007 <dbl> NA, NA, NA, NA, NA, ...
## $ sales <dbl> 598175, 2966489, 188...
## $ sales_per_capita <dbl> 12003, 17166, 6334, ...
## $ accommodation_food_service <dbl> 88157, 436955, NA, 1...
## $ building_permits <dbl> 191, 696, 10, 8, 18,...
## $ fed_spending <dbl> 331142, 1119082, 240...
## $ area <dbl> 594, 1590, 885, 623,...
## $ density <dbl> 91.8, 114.6, 31.0, 3...
## $ rural <fctr> rural, rural, rural...
Examine the living standard of the US counties, using per_capita_income
as a measure, and how it differs by whether the county is rural or urban. To that end, do the following and interpret the results:
Use the box plot you created above and identify an outlier, a rural county that has a per_capita_income higher than $60,000 a year when a typical rural county’s per capita income is less than $23,000 a year. What makes this rural county successful? Compare it with a typical rural county in the country (compare the county’s data with means of that of the rest of the rural counties)
# Compute groupwise mean and median per_capita_income by rural
countyComplete %>%
group_by(rural) %>%
summarize(mean(per_capita_income),
median(per_capita_income))
## # A tibble: 2 x 3
## rural `mean(per_capita_income)` `median(per_capita_income)`
## <fctr> <dbl> <dbl>
## 1 rural 22117 21598
## 2 urban 30576 28522
# Compute groupwise measures of spread for per_capita_income by rural
countyComplete %>%
group_by(rural) %>%
summarize(sd(per_capita_income),
IQR(per_capita_income),
n())
## # A tibble: 2 x 4
## rural `sd(per_capita_income)` `IQR(per_capita_income)` `n()`
## <fctr> <dbl> <dbl> <int>
## 1 rural 4884 5498 2999
## 2 urban 8598 9594 144
# Create faceted histogram for per_capita_income by rural
ggplot(countyComplete, aes(x = per_capita_income)) +
geom_histogram() +
facet_grid(~ rural)
# Create overlaid density plots for per_capita_income by rural
ggplot(countyComplete, aes(x = per_capita_income, fill = rural)) +
geom_density(alpha = .3)
# Create box plots of per_capita_income by rural
ggplot(countyComplete, aes(x = rural, y = per_capita_income)) +
geom_boxplot()
# Identify the outlier
outlier <- countyComplete %>%
filter(rural == "rural", per_capita_income > 60000)
outlier <- data.frame(t(outlier))
names(outlier) <- "outlier"
str(outlier)
## 'data.frame': 54 obs. of 1 variable:
## $ outlier: Factor w/ 49 levels "0","0.3","0.5",..: 47 48 40 16 10 25 14 6 29 44 ...
## ..- attr(*, "names")= chr "state" "name" "FIPS" "pop2010" ...
#
typicalRural <- countyComplete %>%
filter(rural == "rural")
typicalRural <- data.frame(sapply(typicalRural, mean, na.rm = TRUE))
names(typicalRural) <- "typicalRural"
str(typicalRural)
## 'data.frame': 54 obs. of 1 variable:
## $ typicalRural: num NA NA 30234 63931 57041 ...
cbind(typicalRural, outlier)
## typicalRural outlier
## state NA Colorado
## name NA Pitkin County
## FIPS 30233.737 8097
## pop2010 63930.906 17148
## pop2000 57040.766 14872
## age_under_5 6.251 4.4
## age_under_18 23.441 17.5
## age_over_65 16.052 11.5
## female 49.953 47
## white 83.770 93.5
## black 8.394 0.5
## native 2.101 0.3
## asian 0.921 1.2
## pac_isl 0.150 <NA>
## two_plus_races 1.921 1.4
## hispanic 8.015 9.1
## white_not_hispanic 79.290 87.9
## no_move_in_one_plus_year 85.930 86.8
## foreign_born 3.866 12.6
## foreign_spoken_at_home 8.480 16.2
## hs_grad 82.961 96.1
## bachelors 18.311 59.7
## veterans 5307.390 1152
## mean_work_travel 22.581 19
## housing_units 27929.949 12953
## home_ownership 73.838 62.5
## housing_multi_unit 11.259 40
## median_val_owner_occupied 125304.602 670200
## households 23789.878 7417
## persons_per_household 2.511 2.21
## per_capita_income 22117.129 64381
## median_household_income 43597.515 64502
## poverty 15.589 8.4
## private_nonfarm_establishments 1446.135 1620
## private_nonfarm_employment 20109.331 16278
## percent_change_private_nonfarm_employment 0.814 5.1
## nonemployment_establishments 4067.842 3286
## firms 5711.401 5100
## black_owned_firms 9.451 <NA>
## native_owned_firms 4.493 <NA>
## asian_owned_firms 2.624 <NA>
## pac_isl_owned_firms 1.221 <NA>
## hispanic_owned_firms 6.692 2.9
## women_owned_firms 25.748 26.5
## manufacturer_shipments_2007 1064843.116 0
## mercent_whole_sales_2007 722079.695 64678
## sales 785766.919 399540
## sales_per_capita 10107.925 26461
## accommodation_food_service 123717.562 270848
## building_permits 144.784 44
## fed_spending 561603.057 45978
## area 1158.678 971
## density 95.589 17.7
## rural NA rural