library(openintro)
library(ggplot2)
library(dplyr)
# Load data
data(countyComplete) # It comes from the openintro package
# Create a new variable, rural
countyComplete$rural <- ifelse(countyComplete$density < 500, "rural", "urban")
countyComplete$rural <- factor(countyComplete$rural)
Create a scatterplot of per_capita_income and bachelors in the data set (1.1 Scatterplots).
# Load packages
library(ggplot2)
ggplot(data = countyComplete, aes(x = per_capita_income, y = bachelors)) + geom_point()
Interpretation: positive linear relationship between per capita income and bachelors
Create a boxplot of per_capita_income and bachelors (1.2 Boxplots as discretized/conditioned scatterplots).
# Boxplot of weight vs. weeks
ggplot(data = countyComplete,
aes(x = cut(per_capita_income, breaks = 5), y = bachelors)) +
geom_boxplot()
Interpretation: those with bachelors degrees are above the average per capita income
Add the rural variable to the scatterplot of per_capita_income and bachelors to see whether there is any difference in their relationship between rural and urban (1.3 Creating scatterplots).
# Load the package
library(openintro)
# Body dimensions scatterplot
ggplot(data = countyComplete, aes(x = per_capita_income, y = bachelors, color = factor(rural))) +
geom_point()
Interpretation: between urban and rural, people with a bachelors earn more in an urban setting
Compute correlation coefficient between income_per_capita and bachelors (2.1 Computing correlation). Interpret it. Keep in mind that correlation coefficients do not show causation but only association.
The cor(x, y) function will compute the Pearson product-moment correlation between variables, x and y.
# Load the package
library(dplyr)
# Compute correlation
countyComplete %>%
summarize(N = n(), r = cor(per_capita_income, bachelors))
## N r
## 1 3143 0.7924464
# Compute correlation for all non-missing pairs
countyComplete %>%
summarize(N = n(), r = cor(per_capita_income, bachelors, use = "pairwise.complete.obs"))
## N r
## 1 3143 0.7924464