Identify what variables are strongly associated with the standard of living. To that end, do the following:

Chapter 1: Visualizing two variables

library(openintro)
library(ggplot2)
library(dplyr)

# Load data
data(countyComplete) # It comes from the openintro package

# Create a new variable, rural
countyComplete$rural <- ifelse(countyComplete$density < 500, "rural", "urban")
countyComplete$rural <- factor(countyComplete$rural)

1.1 Scatterplots

Create a scatterplot of per_capita_income and bachelors in the data set

# Scatterplot
ggplot(data = countyComplete, aes(x = per_capita_income, y = bachelors)) + geom_point()

Interpretation

  • The two variables are positively associated meaning as the percentage of people who have a bachelors degree increases the income per capita increases as well. The graph is linear.

1.2 Boxplots as discretized/conditioned scatterplots

Create a boxplot of per_capita_income and bachelors

# Boxplot 
ggplot(data = countyComplete, 
       aes(x = cut(per_capita_income, breaks = 5), y = bachelors)) + 
  geom_boxplot()

Interpretation

  • The relationship still seems linear but as the data increases the correlation gets stronger as there are less outliers. The range of percent of people with bachelors degrees is greater the lower the income per capita.

1.3 Creating scatterplots

Add the rural variable to the scatterplot of per_capita_income and bachelors to see whether there is any difference in their relationship between rural and urban

# Body dimensions scatterplot
ggplot(data = countyComplete, aes(x = per_capita_income, y = bachelors, color = factor(rural))) +
  geom_point()

Interpretation

  • The rural data generally seems to be on the lower to middle end of both income and percent who have earned a degree. The urban data overlaps with the rural but it tends to have a higher percentage of people with bachelors degree.

2.1 Computing correlation

Compute correlation coefficient between income_per_capita and bachelors (2.1 Computing correlation). Interpret it. Keep in mind that correlation coefficients do not show causation but only association.

# Compute correlation
countyComplete %>%
  summarize(N = n(), r = cor(per_capita_income, bachelors))
##      N         r
## 1 3143 0.7924464

Interpretation

  • The correlation between the two variables, income per capita and bachelors, is .7924464 which is strong meaning they are closely associated.