In this problem set you will analyze data using some visualization and regression tools. The goal is to use these tools to understand the data, so the interpretation is important. This problem set is due on February 10.

1. Campus Crime, Enrollment, Police

Suppose you are interested in understanding campus crime. In this question we will use a dataset called campus in the package wooldridge. These data were collected by Daniel Martin, a former MSU undergraduate, for a final project. They come from the FBI Uniform Crime Reports and are for the year 1992. The variables are the following:

  1. (5 points) What is the correlation between police and crime? Interpret this finding and explain clearly. Is this high or low?

0.7233104 is a relatively strong positive correlation. This suggests that as the number of police officers employed on campus increases, the number of crimes that occur on campus also increases.

cor(campus$police, campus$crime)
## [1] 0.7233104
  1. (5 points) Use a scatterplot to visualize the relationship between crime and police. Is this consistent with a? Explain clearly. Remember the labels.

This scatterplot is consistent with a, the points spreading out from left to right and upward, which suggests that as the number of police officers employed on campus increases, the number of crimes that occur on campus also increases.

plot(campus$crime ~ campus$police, xlab = "Number of Police Officers", ylab = "Number of Crimes")

c. (5 points) Now run a simple linear regression of crime (y) on police (x). Present the results using the function summary() and interpret the result. Explain clearly. Based on the results, we can conclude that there is a positive relationship between crime and police and that this relationship is statistically significant.

model <- lm(crime ~ police, data = campus)
summary(model)
## 
## Call:
## lm(formula = crime ~ police, data = campus)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1024.79  -152.96   -35.38    89.48  1540.16 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -42.557     53.730  -0.792     0.43    
## police        21.323      2.089  10.210   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 319.9 on 95 degrees of freedom
## Multiple R-squared:  0.5232, Adjusted R-squared:  0.5182 
## F-statistic: 104.2 on 1 and 95 DF,  p-value: < 2.2e-16
  1. (5 points) Visualize the regresion by adding a regression line to the scatterplot in b using geom_smooth. What do you see? How good is the fit? Use the R-squared found in c. to complement your answer. Explain.

From the scatterplot with the regression line, we can see that the regression line generally follows the pattern of the data points, but there is some spread in the data points around the regression line.

library(ggplot2)
ggplot(data = campus, aes(x = police, y = crime)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  xlab("Number of Police Officers") +
  ylab("Number of Crimes")
## `geom_smooth()` using formula = 'y ~ x'

e. (8 points) Now use log(crime) and log(police) instead and repeat the following steps:

library(ggplot2)
ggplot(data = campus, aes(x = log(police), y = log(crime))) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  xlab("log(Number of Police Officers)") +
  ylab("log(Number of Crimes)")
## `geom_smooth()` using formula = 'y ~ x'

How does this change the regression results, the interpretation of the regression results, the figure, and fit? What does the log do? Explain.

The log-transformation helps to normalize the distribution of the variables and reduce the influence of outliers.

  1. (5 points) Are these results causal? I.e. do they capture the causal effect of police on crime. Why or why not? Explain clearly.

Not casual because the conditions are not met.

  1. (8 points) Now run a regression of log(crime) on log(enrollment) and plot it together with a scatterplot, as before. How do we interpret this results? Why do you think you observe the relationship that the data shows? Is this relationship causal? Explain.
campus$lcrime <- log(campus$crime)
campus$lenroll <- log(campus$enroll)


model <- lm(lcrime ~ lenroll, data = campus)


summary(model)
## 
## Call:
## lm(formula = lcrime ~ lenroll, data = campus)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5136 -0.3858  0.1174  0.4363  2.5782 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -6.6314     1.0335  -6.416 5.44e-09 ***
## lenroll       1.2698     0.1098  11.567  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8946 on 95 degrees of freedom
## Multiple R-squared:  0.5848, Adjusted R-squared:  0.5804 
## F-statistic: 133.8 on 1 and 95 DF,  p-value: < 2.2e-16
library(ggplot2)
ggplot(data = campus, aes(x = lenroll, y = lcrime)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE) + 
  labs(x = "log(enrollment)", y = "log(crime)")
## `geom_smooth()` using formula = 'y ~ x'

2. Earnings and Height

For this question we will use the dataset Earnings_and_Height.csv taken from the Stock and Watson textbook. You can find a detailed description of the dataset here.

  1. (3 points) Start by loading the data.
df <- read.csv("Earnings_and_Height.csv")


head(df)
##   sex age mrd educ cworker region race earnings height weight occupation
## 1   0  48   1   13       1      3    1 84054.75     65    133          1
## 2   0  41   6   12       1      2    1 14021.39     65    155          1
## 3   0  26   1   16       1      1    1 84054.75     60    108          1
## 4   0  37   1   16       1      2    1 84054.75     67    150          1
## 5   0  35   6   16       1      1    1 28560.39     68    180          1
## 6   0  25   6   15       1      4    1 23362.87     63    101          1
colnames(df)
##  [1] "sex"        "age"        "mrd"        "educ"       "cworker"   
##  [6] "region"     "race"       "earnings"   "height"     "weight"    
## [11] "occupation"
  1. (3 points) What is the median value of height in the sample?

67

median(df$height)
## [1] 67
  1. Let’s compare means for tall and not so tall people.

    c1. (3 points) Estimate average earnings for workers whose height is at most 67 inches.

    44488.44

mean(df$earnings[df$height <= 67])
## [1] 44488.44

c2. (3 points) Estimate average earnings for workers whose height is greater than 67 inches.

49987.88

mean(df$earnings[df$height > 67])
## [1] 49987.88

c3. (5 points) On average, do taller workers earn more than shorter workers? How much more? Test the hypothesis that the two averages are the same, i.e. that the mean earnings are the same for tall and short people. Taller workers on average make more

earnings_and_height <- read.csv("Earnings_and_Height.csv")

short <- earnings_and_height[earnings_and_height$height <= 67, ]
tall <- earnings_and_height[earnings_and_height$height > 67, ]

mean_short <- mean(short$earnings)
mean_tall <- mean(tall$earnings)

t.test(short$earnings, tall$earnings, alternative = "two.sided")
## 
##  Welch Two Sample t-test
## 
## data:  short$earnings and tall$earnings
## t = -13.59, df = 16624, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -6292.643 -4706.237
## sample estimates:
## mean of x mean of y 
##  44488.44  49987.88
  1. (4 points) Construct a scatterplot of annual earnings on height. Notice that the points on the plot fall along horizontal lines (There are only 23 distinct values of earnings). Why? (Hint: Carefully read the detailed data description.)
library(ggplot2)
ggplot(earnings_and_height, aes(x = height, y = earnings)) +
  geom_point() +
  xlab("Height (inches)") +
  ylab("Annual earnings")

e. (3 points) Run a regression of earnings on height.

earnings_and_height_lm <- lm(earnings ~ height, data = earnings_and_height)
summary(earnings_and_height_lm)
## 
## Call:
## lm(formula = earnings ~ height, data = earnings_and_height)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -47836 -21879  -7976  34323  50599 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -512.73    3386.86  -0.151     0.88    
## height        707.67      50.49  14.016   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 26780 on 17868 degrees of freedom
## Multiple R-squared:  0.01088,    Adjusted R-squared:  0.01082 
## F-statistic: 196.5 on 1 and 17868 DF,  p-value: < 2.2e-16

e1. (2 points) What is the estimated slope? Interpret it.

The estimated slope represents the change in earnings (in dollars) associated with a one inch increase in height. In this case, the estimated slope is 707.6716

reg_earnings_height <- lm(earnings ~ height, data = earnings_and_height)

slope_estimate <- coef(reg_earnings_height)[2]

coef(reg_earnings_height)[2]
##   height 
## 707.6716

e2. (3 points) Use the estimated regression to predict earnings for a worker who is 67 inches tall, for a worker who is 70 inches tall, and for a worker who is 65 inches tall.

predict(reg_earnings_height, data.frame(height=c(67,70,65)))
##        1        2        3 
## 46901.26 49024.28 45485.92
  1. (3 points) Suppose height were measured in centimeters instead of inches. Create a new variable called height_cm with height in cms, and answer the following questions about the earnings on height (in cm) regression.
earnings_and_height$height_cm <- earnings_and_height$height * 2.54

reg_earnings_height_cm <- lm(earnings ~ height_cm, data = earnings_and_height)

predict(reg_earnings_height_cm, data.frame(height_cm = c(165, 170, 178)))
##        1        2        3 
## 45458.06 46851.11 49080.00

f1. (3 points) What is the estimated slope of the regression? Interpret it.

278.6108, for every 1 unit increase in height (1 cm in this case), the expected earnings would increase by approximately 278.6108 dollars.

reg_earnings_height_cm <- lm(earnings ~ height_cm, data = earnings_and_height)
coef(reg_earnings_height_cm)
## (Intercept)   height_cm 
##   -512.7336    278.6108

f2. (3 points) What is the estimated intercept? Interpret it.

-512.7336 means that, if a worker’s height is zero centimeters, the predicted earnings would be -512.7336 dollars

reg_earnings_height_cm <- lm(earnings ~ height_cm, data = earnings_and_height)
coef(reg_earnings_height_cm)
## (Intercept)   height_cm 
##   -512.7336    278.6108

f3. (3 points) What is the \(R^2\)? What does it say about the fit? Explain.

0.0108753, In this case, the \(R^2\) is small, indicating that only a small proportion of the variance in earnings is explained by height.

summary(reg_earnings_height_cm)$r.squared
## [1] 0.0108753
  1. (3 points) Run a regression of earnings on height, using data for female workers only.
female_workers <- earnings_and_height[earnings_and_height$sex == 0, ]
reg_earnings_height_female <- lm(earnings ~ height, data = female_workers)

g1. (2 points) What is the estimated slope? 511.222

reg_earnings_height_female <- lm(earnings ~ height, data = earnings_and_height, subset = sex == 0)
coef(reg_earnings_height_female)["height"]
##   height 
## 511.2222

g2. (5 points) A randomly selected woman is 1 inch taller than the average woman in the sample. Would you predict her earnings to be higher or lower than the average earnings for women in the sample? By how much?

If a randomly selected woman is 1 inch taller than the average woman in the sample, then her predicted earnings would be 511.2222 higher

  1. (5 points) Repeat g. for male workers. Is it different from female workers?

Yes, it is very different from female workers, for every 1 inch increase in height, earnings increase by 1306.86 dollars.

earnings_and_height_male <- earnings_and_height[earnings_and_height$sex == 1, ]


reg_earnings_height_male <- lm(earnings ~ height, data = earnings_and_height_male)


coef(reg_earnings_height_male)["height"]
##  height 
## 1306.86
  1. (5 points) Do you think the height is uncorrelated with other factors that cause earning? That is, do you think that the regression error term, \(u_i\), has a conditional mean of \(0\) given \(Height (X_i)\)?

The height and earnings regression cannot establish a causal relationship between height and earnings.