In this problem set you will analyze data using some visualization and regression tools. The goal is to use these tools to understand the data, so the interpretation is important. This problem set is due on February 10.
Suppose you are interested in understanding campus crime. In this
question we will use a dataset called campus in the package
wooldridge. These data were collected by Daniel Martin, a
former MSU undergraduate, for a final project. They come from the FBI
Uniform Crime Reports and are for the year 1992. The variables are the
following:
0.7233104 is a relatively strong positive correlation. This suggests that as the number of police officers employed on campus increases, the number of crimes that occur on campus also increases.
cor(campus$police, campus$crime)
## [1] 0.7233104
This scatterplot is consistent with a, the points spreading out from left to right and upward, which suggests that as the number of police officers employed on campus increases, the number of crimes that occur on campus also increases.
plot(campus$crime ~ campus$police, xlab = "Number of Police Officers", ylab = "Number of Crimes")
c. (5 points) Now run a simple linear regression of crime (y) on police
(x). Present the results using the function
summary() and
interpret the result. Explain clearly. Based on the results, we can
conclude that there is a positive relationship between crime and police
and that this relationship is statistically significant.
model <- lm(crime ~ police, data = campus)
summary(model)
##
## Call:
## lm(formula = crime ~ police, data = campus)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1024.79 -152.96 -35.38 89.48 1540.16
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -42.557 53.730 -0.792 0.43
## police 21.323 2.089 10.210 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 319.9 on 95 degrees of freedom
## Multiple R-squared: 0.5232, Adjusted R-squared: 0.5182
## F-statistic: 104.2 on 1 and 95 DF, p-value: < 2.2e-16
geom_smooth. What do you see?
How good is the fit? Use the R-squared found in c. to complement your
answer. Explain.From the scatterplot with the regression line, we can see that the regression line generally follows the pattern of the data points, but there is some spread in the data points around the regression line.
library(ggplot2)
ggplot(data = campus, aes(x = police, y = crime)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
xlab("Number of Police Officers") +
ylab("Number of Crimes")
## `geom_smooth()` using formula = 'y ~ x'
e. (8 points) Now use log(crime) and log(police) instead and repeat the
following steps:
library(ggplot2)
ggplot(data = campus, aes(x = log(police), y = log(crime))) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
xlab("log(Number of Police Officers)") +
ylab("log(Number of Crimes)")
## `geom_smooth()` using formula = 'y ~ x'
How does this change the regression results, the interpretation of the regression results, the figure, and fit? What does the log do? Explain.
The log-transformation helps to normalize the distribution of the variables and reduce the influence of outliers.
Not casual because the conditions are not met.
campus$lcrime <- log(campus$crime)
campus$lenroll <- log(campus$enroll)
model <- lm(lcrime ~ lenroll, data = campus)
summary(model)
##
## Call:
## lm(formula = lcrime ~ lenroll, data = campus)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5136 -0.3858 0.1174 0.4363 2.5782
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.6314 1.0335 -6.416 5.44e-09 ***
## lenroll 1.2698 0.1098 11.567 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8946 on 95 degrees of freedom
## Multiple R-squared: 0.5848, Adjusted R-squared: 0.5804
## F-statistic: 133.8 on 1 and 95 DF, p-value: < 2.2e-16
library(ggplot2)
ggplot(data = campus, aes(x = lenroll, y = lcrime)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
labs(x = "log(enrollment)", y = "log(crime)")
## `geom_smooth()` using formula = 'y ~ x'
For this question we will use the dataset
Earnings_and_Height.csv taken from the Stock and Watson
textbook. You can find a detailed description of the dataset here.
df <- read.csv("Earnings_and_Height.csv")
head(df)
## sex age mrd educ cworker region race earnings height weight occupation
## 1 0 48 1 13 1 3 1 84054.75 65 133 1
## 2 0 41 6 12 1 2 1 14021.39 65 155 1
## 3 0 26 1 16 1 1 1 84054.75 60 108 1
## 4 0 37 1 16 1 2 1 84054.75 67 150 1
## 5 0 35 6 16 1 1 1 28560.39 68 180 1
## 6 0 25 6 15 1 4 1 23362.87 63 101 1
colnames(df)
## [1] "sex" "age" "mrd" "educ" "cworker"
## [6] "region" "race" "earnings" "height" "weight"
## [11] "occupation"
67
median(df$height)
## [1] 67
Let’s compare means for tall and not so tall people.
c1. (3 points) Estimate average earnings for workers whose height is at most 67 inches.
44488.44
mean(df$earnings[df$height <= 67])
## [1] 44488.44
c2. (3 points) Estimate average earnings for workers whose height is greater than 67 inches.
49987.88
mean(df$earnings[df$height > 67])
## [1] 49987.88
c3. (5 points) On average, do taller workers earn more than shorter workers? How much more? Test the hypothesis that the two averages are the same, i.e. that the mean earnings are the same for tall and short people. Taller workers on average make more
earnings_and_height <- read.csv("Earnings_and_Height.csv")
short <- earnings_and_height[earnings_and_height$height <= 67, ]
tall <- earnings_and_height[earnings_and_height$height > 67, ]
mean_short <- mean(short$earnings)
mean_tall <- mean(tall$earnings)
t.test(short$earnings, tall$earnings, alternative = "two.sided")
##
## Welch Two Sample t-test
##
## data: short$earnings and tall$earnings
## t = -13.59, df = 16624, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -6292.643 -4706.237
## sample estimates:
## mean of x mean of y
## 44488.44 49987.88
library(ggplot2)
ggplot(earnings_and_height, aes(x = height, y = earnings)) +
geom_point() +
xlab("Height (inches)") +
ylab("Annual earnings")
e. (3 points) Run a regression of earnings on height.
earnings_and_height_lm <- lm(earnings ~ height, data = earnings_and_height)
summary(earnings_and_height_lm)
##
## Call:
## lm(formula = earnings ~ height, data = earnings_and_height)
##
## Residuals:
## Min 1Q Median 3Q Max
## -47836 -21879 -7976 34323 50599
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -512.73 3386.86 -0.151 0.88
## height 707.67 50.49 14.016 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 26780 on 17868 degrees of freedom
## Multiple R-squared: 0.01088, Adjusted R-squared: 0.01082
## F-statistic: 196.5 on 1 and 17868 DF, p-value: < 2.2e-16
e1. (2 points) What is the estimated slope? Interpret it.
The estimated slope represents the change in earnings (in dollars) associated with a one inch increase in height. In this case, the estimated slope is 707.6716
reg_earnings_height <- lm(earnings ~ height, data = earnings_and_height)
slope_estimate <- coef(reg_earnings_height)[2]
coef(reg_earnings_height)[2]
## height
## 707.6716
e2. (3 points) Use the estimated regression to predict earnings for a worker who is 67 inches tall, for a worker who is 70 inches tall, and for a worker who is 65 inches tall.
predict(reg_earnings_height, data.frame(height=c(67,70,65)))
## 1 2 3
## 46901.26 49024.28 45485.92
height_cm with height
in cms, and answer the following questions about the earnings on height
(in cm) regression.earnings_and_height$height_cm <- earnings_and_height$height * 2.54
reg_earnings_height_cm <- lm(earnings ~ height_cm, data = earnings_and_height)
predict(reg_earnings_height_cm, data.frame(height_cm = c(165, 170, 178)))
## 1 2 3
## 45458.06 46851.11 49080.00
f1. (3 points) What is the estimated slope of the regression? Interpret it.
278.6108, for every 1 unit increase in height (1 cm in this case), the expected earnings would increase by approximately 278.6108 dollars.
reg_earnings_height_cm <- lm(earnings ~ height_cm, data = earnings_and_height)
coef(reg_earnings_height_cm)
## (Intercept) height_cm
## -512.7336 278.6108
f2. (3 points) What is the estimated intercept? Interpret it.
-512.7336 means that, if a worker’s height is zero centimeters, the predicted earnings would be -512.7336 dollars
reg_earnings_height_cm <- lm(earnings ~ height_cm, data = earnings_and_height)
coef(reg_earnings_height_cm)
## (Intercept) height_cm
## -512.7336 278.6108
f3. (3 points) What is the \(R^2\)? What does it say about the fit? Explain.
0.0108753, In this case, the \(R^2\) is small, indicating that only a small proportion of the variance in earnings is explained by height.
summary(reg_earnings_height_cm)$r.squared
## [1] 0.0108753
female_workers <- earnings_and_height[earnings_and_height$sex == 0, ]
reg_earnings_height_female <- lm(earnings ~ height, data = female_workers)
g1. (2 points) What is the estimated slope? 511.222
reg_earnings_height_female <- lm(earnings ~ height, data = earnings_and_height, subset = sex == 0)
coef(reg_earnings_height_female)["height"]
## height
## 511.2222
g2. (5 points) A randomly selected woman is 1 inch taller than the average woman in the sample. Would you predict her earnings to be higher or lower than the average earnings for women in the sample? By how much?
If a randomly selected woman is 1 inch taller than the average woman in the sample, then her predicted earnings would be 511.2222 higher
Yes, it is very different from female workers, for every 1 inch increase in height, earnings increase by 1306.86 dollars.
earnings_and_height_male <- earnings_and_height[earnings_and_height$sex == 1, ]
reg_earnings_height_male <- lm(earnings ~ height, data = earnings_and_height_male)
coef(reg_earnings_height_male)["height"]
## height
## 1306.86
The height and earnings regression cannot establish a causal relationship between height and earnings.