1a. How many women are there on this data set? How many of them are
there in the labor force?
sum(mroz$inlf==1)
## [1] 428
There are 753 women in this dataset and 428 are in the labor
force.
1b. What is the distribution of wages for those women in the labor
force? Calculate summary statistics and produce a histogram. Does the
distribution of wages look normal?
summary(mroz$wage) # Min.= 0, 1st Qu.=0 , Median= 1.625, Mean=2.375, 3rd Qu.=3.788 Max.=25
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 1.625 2.375 3.788 25.000
sd(mroz$wage) # 3.241829
## [1] 3.241829
ggplot(mroz, aes(x = wage)) +
geom_histogram(alpha = 0.5, position = "identity", bins = 10, color = "black")
### No this distribution does not look normal, it is skewed to the
right, with the majority of the wages being 0, which is most likely all
of the women not in the labor force.
1d. We are interested in understanding the relationship between the
wage variable and the age variable.
1i.) Use the summary function to understand the age variable.
summary(mroz$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 30.00 36.00 43.00 42.54 49.00 60.00
# Min.= 30.00
# 1st Qu. = 36.00
# Median = 43.00
# Mean = 42.54
# 3rd Qu.= 49.00
# Max. = 60.00
1dii.) Calculate the covariance between wage and age. Interpret the
covariance.
cov(mroz$age, mroz$wage) #-0.9044096
## [1] -0.9044096
The covariance is negative which means that as one variable
increases, the other decreases. Just looking at the data again, it looks
like younger aged women make more money than the older women in the
dataset (which is not something I would have expected).
1diii.) Calculate the correlation between wage and age. Interpret
the correlation coefficient.
cor(mroz$age, mroz$wage) #-0.03455915
## [1] -0.03455915
The correlation coefficient is negative but very close to 0 so I
don’t think the variables age and wage are related, and if there is any
relation, it is a weak negative correlation
1d iv.) Use ggplot to produce a scatter plot, in which age is in the
horizontal axis, and wage is in the vertical axis. Can you infer any
relationship between the two variables from the scatter plot?
ggplot(data = mroz)+
geom_point(aes(x = age, y= wage))
### Just based on the scatter plot, I do not see a relationship between
the two variables.
1v.) Use ggplot to add a linear fit on top of the scatter plot. Does
this corroborate your instincts from the previous part?
ggplot(data = mroz, mapping = aes(x = age, y= wage)) +
geom_point() +
geom_smooth(method = "lm", formula = y~x)
### Yes the linear fit line is almost a straight line which shows me
there is most likely not a relationship between the age and wage
variables.
1 vi.) Use the lm function to run a linear regression in which wage
is the outcome and age is the independent variable. What is the
estimator for the intercept? What is the estimator for the slope? How
can we interpret the estimator for the slope?
lm(wage ~ age, data=mroz)
##
## Call:
## lm(formula = wage ~ age, data = mroz)
##
## Coefficients:
## (Intercept) age
## 2.96492 -0.01388
#Coefficients:
#. (Intercept) age
# 2.96492 -0.01388
The estimator for the intercept is 2.96492 and the estimator for the
slope is -0.01388. The estimator for the slope is negative and very
close to zero. This tells me that there is not a strong relationship
between age and wage since the slope is so cloes to zero.
1 e. We are interested in understanding the relationship between the
wage variable and the educ variable.
1Ei.) Use the summary function to understand the educ variable.
summary(mroz$educ)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.00 12.00 12.00 12.29 13.00 17.00
#Min.= 5.00
# 1st Qu. = 12.00
# Median = 12.00
# Mean = 12.29
#3rd Qu.= 13.00
# Max. = 17.00
1Eii.) Calculate the covariance between wage and educ. Interpret the
covariance.
cov(mroz$educ, mroz$wage) #2.353504
## [1] 2.353504
The covariance is positive, which makes me believe these two
variables are correlated.
1E iii.) Calculate the correlation between wage and educ. Interpret
the correlation coefficient.
cor(mroz$educ, mroz$wage) #0.3183781
## [1] 0.3183781
The correlation coefficient is not very large, which makes be
believe there is a weaker positive correlation between education and
wage.
##1Eiv.) Use ggplot to produce a scatter plot, in which educ is in
the horizontal axis, and wage is in the vertical axis. Can you infer any
relationship between the two variables from the scatter plot?
ggplot(data = mroz)+
geom_point(aes(x = educ, y= wage))
## 1Ev.) Use ggplot to add a linear fit on top of the scatter plot. Does
this corroborate your instincts from the previous part?
ggplot(data = mroz, mapping = aes(x = educ, y= wage)) +
geom_point() +
geom_smooth(method = "lm", formula = y~x)
### With the linear regression line, it looks more positively correlated
than I would have thought.
1Evi.) Use the lm function to run a linear regression in which wage
is the outcome and educ is the independent variable. What is the
estimator for the intercept? What is the estimator for the slope? How
can we interpret the estimator for the slope?
lm(wage ~ educ, data=mroz)
##
## Call:
## lm(formula = wage ~ educ, data = mroz)
##
## Coefficients:
## (Intercept) educ
## -3.1869 0.4526
The estimator for the intercept is -3.1869. The estimator of the
slope is 0.4526. Since the slope is positive we know that as education
increases wages also increase.
1f. Calculate the following quantity: Cov(educ, wage)/Var(educ)
cov(mroz$educ, mroz$wage) / var(mroz$educ) #0.4526386
## [1] 0.4526386
Compare this to the results from running a simple linear regression
between wage, and educ. What can you conclude about βˆ?
lm(wage ~ educ, data=mroz)
##
## Call:
## lm(formula = wage ~ educ, data = mroz)
##
## Coefficients:
## (Intercept) educ
## -3.1869 0.4526
The quantity and the results from running a linear regression are
the same. I can conclude that for every extra year of school, a womans
wage increases by about $0.45.
1g. Notice that wage = βˆ0 + βˆ1educ
Use what you found out about βˆ1 in 1f to find βˆ0 only using sample
means, covariances, and variances, and not the lm function.
mean(mroz$wage) #2.374565
## [1] 2.374565
mean(mroz$educ)#12.28685
## [1] 12.28685
2.374565 - 0.4526386 * 12.28685
## [1] -3.186938
βˆ0 (the y-intercept) can be calcualted using the following
equation:βˆ0 = ¯y − βˆ1x, and the answer is -3.186938, which is the same
value we get when we run the regression
2a. Calculate summary statistics for each of the three variables,
and make sure you understand their distribution.
summary(fakedata$y)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -8.182 8.874 13.226 13.221 17.162 35.209
#Min. 1st Qu. Median Mean 3rd Qu. Max.
#-8.182 8.874 13.226 13.221 17.162 35.209
summary(fakedata$x1)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.6195 3.7434 5.0184 5.0323 6.3292 11.4821
# Min. 1st Qu. Median Mean 3rd Qu. Max.
#-0.6195 3.7434 5.0184 5.0323 6.3292 11.4821
summary(fakedata$x2)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2.641 2.720 4.022 4.068 5.273 10.569
#Min. 1st Qu. Median Mean 3rd Qu. Max.
#-2.641 2.720 4.022 4.068 5.273 10.569
2b.) Calculate the correlation between y and x1, y and x2, and x1
and x2. Briefly interpret each of these correlations.
cor(fakedata$y, fakedata$x1) #0.7477822
## [1] 0.7477822
cor(fakedata$y, fakedata$x2) #0.9733887
## [1] 0.9733887
cor(fakedata$x1, fakedata$x2) #0.8571596
## [1] 0.8571596
Y and X2 have the strongest positive correlation (0.97, very close
to 1) ,X1 and X2 have a strong positive correlation (0.85), and Y and X1
also have a strong positive correlation (0.74).
2c. You mistakenly think that the true population model is given
by:
yi = β0 + β1x1i + εi
Use ggplot to produce a scatter plot where x1 is in the horizontal
axis and y is on the vertical axis. Use ggplot to put a linear fit on
the scatter plot. Use the lm function to estimate model 1. What are the
estimators βˆ0 and βˆ1? Interpret βˆ1.
ggplot(data = fakedata, mapping = aes(x = x1, y= y)) +
geom_point() +
geom_smooth(method = "lm", formula = y~x)

lm(y ~x1, data=fakedata)
##
## Call:
## lm(formula = y ~ x1, data = fakedata)
##
## Coefficients:
## (Intercept) x1
## 1.312 2.367
βˆ0 is 1.312 and βˆ1 is 2.367. βˆ1 shows me that as as Y increases
by 1.312, X increases by 2.367, so X increases by about double every
time we increase by 1 on the Y axis.
2d.) A friend of yours who took DATA-3100 tells you that the true
population model might be:
yi = α0 + α1x1i + α2x2i + εi
Your friend knows the context well and thinks that x2 is crucial to
understanding y. Even more, given that x1 and x2 are correlated, as you
found in part 2b, model 1 is problematic because it estimates a
relationship in which x2 is in the error term and is correlated with x1,
and that produces an estimator β1 that is biased and inconsistent for
β1.
Estimate model 2. What is the value of ˆα1? Compare that to the
value of βˆ1. What would have happened if you had evaluated a policy
using 1, and your company/employer had taken action based on your
recommendation? What is the value of understanding the context to decide
between models 1 and 2?
lm(y ~x1+x2, data=fakedata)
##
## Call:
## lm(formula = y ~ x1 + x2, data = fakedata)
##
## Coefficients:
## (Intercept) x1 x2
## 2.033 -1.033 4.028
ˆα1 is negative, it is -1.033, but βˆ1 was 2.367.So this is a type
one error because I would have originally thought y and x1 had a
positive relationship, but when you have the influence of x2, x1
actually has a negative effect. I think it would be bad if my employer
took my advice because I would have said that y and x1 have a strong
positive relationship, when, in reality, x2 is a factor that needs to be
considered and when it is, y and x1 have a negative relationship.
Understanding the context to decide between models 1 and 2 is extremely
important since this would be the difference between a positive or
negative effect, which could be a riskier decision that positive or no
effect.