1a. How many women are there on this data set? How many of them are there in the labor force?

sum(mroz$inlf==1)
## [1] 428

There are 753 women in this dataset and 428 are in the labor force.

1b. What is the distribution of wages for those women in the labor force? Calculate summary statistics and produce a histogram. Does the distribution of wages look normal?

summary(mroz$wage) # Min.= 0, 1st Qu.=0 ,  Median= 1.625,   Mean=2.375, 3rd Qu.=3.788    Max.=25 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   1.625   2.375   3.788  25.000
sd(mroz$wage) # 3.241829
## [1] 3.241829
ggplot(mroz, aes(x = wage)) +
  geom_histogram(alpha = 0.5, position = "identity", bins = 10, color = "black")

### No this distribution does not look normal, it is skewed to the right, with the majority of the wages being 0, which is most likely all of the women not in the labor force.

1d. We are interested in understanding the relationship between the wage variable and the age variable.

1i.) Use the summary function to understand the age variable.

summary(mroz$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   30.00   36.00   43.00   42.54   49.00   60.00
# Min.= 30.00
# 1st Qu. = 36.00 
# Median = 43.00  
# Mean  =  42.54
# 3rd Qu.= 49.00   
# Max. = 60.00 

1dii.) Calculate the covariance between wage and age. Interpret the covariance.

cov(mroz$age, mroz$wage) #-0.9044096
## [1] -0.9044096

The covariance is negative which means that as one variable increases, the other decreases. Just looking at the data again, it looks like younger aged women make more money than the older women in the dataset (which is not something I would have expected).

1diii.) Calculate the correlation between wage and age. Interpret the correlation coefficient.

cor(mroz$age, mroz$wage) #-0.03455915
## [1] -0.03455915

1d iv.) Use ggplot to produce a scatter plot, in which age is in the horizontal axis, and wage is in the vertical axis. Can you infer any relationship between the two variables from the scatter plot?

ggplot(data = mroz)+
  geom_point(aes(x = age, y= wage))

### Just based on the scatter plot, I do not see a relationship between the two variables.

1v.) Use ggplot to add a linear fit on top of the scatter plot. Does this corroborate your instincts from the previous part?

ggplot(data = mroz, mapping = aes(x = age, y= wage)) +
  geom_point() +
  geom_smooth(method = "lm", formula = y~x)

### Yes the linear fit line is almost a straight line which shows me there is most likely not a relationship between the age and wage variables.

1 vi.) Use the lm function to run a linear regression in which wage is the outcome and age is the independent variable. What is the estimator for the intercept? What is the estimator for the slope? How can we interpret the estimator for the slope?

lm(wage ~ age, data=mroz)
## 
## Call:
## lm(formula = wage ~ age, data = mroz)
## 
## Coefficients:
## (Intercept)          age  
##     2.96492     -0.01388
#Coefficients:
#.   (Intercept)      age  
#      2.96492     -0.01388 

The estimator for the intercept is 2.96492 and the estimator for the slope is -0.01388. The estimator for the slope is negative and very close to zero. This tells me that there is not a strong relationship between age and wage since the slope is so cloes to zero.

1 e. We are interested in understanding the relationship between the wage variable and the educ variable.

1Ei.) Use the summary function to understand the educ variable.

summary(mroz$educ)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    5.00   12.00   12.00   12.29   13.00   17.00
#Min.= 5.00 
# 1st Qu. = 12.00
# Median = 12.00  
# Mean = 12.29
#3rd Qu.=  13.00   
# Max. =  17.00 

1Eii.) Calculate the covariance between wage and educ. Interpret the covariance.

cov(mroz$educ, mroz$wage) #2.353504
## [1] 2.353504

The covariance is positive, which makes me believe these two variables are correlated.

1E iii.) Calculate the correlation between wage and educ. Interpret the correlation coefficient.

cor(mroz$educ, mroz$wage) #0.3183781
## [1] 0.3183781

The correlation coefficient is not very large, which makes be believe there is a weaker positive correlation between education and wage.

##1Eiv.) Use ggplot to produce a scatter plot, in which educ is in the horizontal axis, and wage is in the vertical axis. Can you infer any relationship between the two variables from the scatter plot?

ggplot(data = mroz)+
  geom_point(aes(x = educ, y= wage))

## 1Ev.) Use ggplot to add a linear fit on top of the scatter plot. Does this corroborate your instincts from the previous part?

ggplot(data = mroz, mapping = aes(x = educ, y= wage)) +
  geom_point() +
  geom_smooth(method = "lm", formula = y~x)

### With the linear regression line, it looks more positively correlated than I would have thought.

1Evi.) Use the lm function to run a linear regression in which wage is the outcome and educ is the independent variable. What is the estimator for the intercept? What is the estimator for the slope? How can we interpret the estimator for the slope?

lm(wage ~ educ, data=mroz)
## 
## Call:
## lm(formula = wage ~ educ, data = mroz)
## 
## Coefficients:
## (Intercept)         educ  
##     -3.1869       0.4526

The estimator for the intercept is -3.1869. The estimator of the slope is 0.4526. Since the slope is positive we know that as education increases wages also increase.

1f. Calculate the following quantity: Cov(educ, wage)/Var(educ)

cov(mroz$educ, mroz$wage) / var(mroz$educ) #0.4526386
## [1] 0.4526386

Compare this to the results from running a simple linear regression between wage, and educ. What can you conclude about βˆ?

lm(wage ~ educ, data=mroz)
## 
## Call:
## lm(formula = wage ~ educ, data = mroz)
## 
## Coefficients:
## (Intercept)         educ  
##     -3.1869       0.4526

The quantity and the results from running a linear regression are the same. I can conclude that for every extra year of school, a womans wage increases by about $0.45.

1g. Notice that wage = βˆ0 + βˆ1educ

Use what you found out about βˆ1 in 1f to find βˆ0 only using sample means, covariances, and variances, and not the lm function.

mean(mroz$wage) #2.374565
## [1] 2.374565
mean(mroz$educ)#12.28685
## [1] 12.28685
2.374565 - 0.4526386 * 12.28685
## [1] -3.186938

βˆ0 (the y-intercept) can be calcualted using the following equation:βˆ0 = ¯y − βˆ1x, and the answer is -3.186938, which is the same value we get when we run the regression

2a. Calculate summary statistics for each of the three variables, and make sure you understand their distribution.

summary(fakedata$y)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -8.182   8.874  13.226  13.221  17.162  35.209
#Min.     1st Qu.  Median    Mean   3rd Qu.     Max. 
#-8.182   8.874    13.226   13.221   17.162    35.209 
summary(fakedata$x1)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.6195  3.7434  5.0184  5.0323  6.3292 11.4821
# Min.     1st Qu.  Median    Mean    3rd Qu.     Max. 
#-0.6195   3.7434   5.0184   5.0323   6.3292    11.4821 
summary(fakedata$x2)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -2.641   2.720   4.022   4.068   5.273  10.569
#Min.     1st Qu.   Median    Mean    3rd Qu.    Max. 
#-2.641   2.720     4.022    4.068    5.273     10.569 

2b.) Calculate the correlation between y and x1, y and x2, and x1 and x2. Briefly interpret each of these correlations.

cor(fakedata$y, fakedata$x1) #0.7477822
## [1] 0.7477822
cor(fakedata$y, fakedata$x2) #0.9733887
## [1] 0.9733887
cor(fakedata$x1, fakedata$x2) #0.8571596
## [1] 0.8571596

Y and X2 have the strongest positive correlation (0.97, very close to 1) ,X1 and X2 have a strong positive correlation (0.85), and Y and X1 also have a strong positive correlation (0.74).

2c. You mistakenly think that the true population model is given by:

yi = β0 + β1x1i + εi

Use ggplot to produce a scatter plot where x1 is in the horizontal axis and y is on the vertical axis. Use ggplot to put a linear fit on the scatter plot. Use the lm function to estimate model 1. What are the estimators βˆ0 and βˆ1? Interpret βˆ1.

ggplot(data = fakedata, mapping = aes(x = x1, y= y)) +
  geom_point() +
  geom_smooth(method = "lm", formula = y~x)

lm(y ~x1, data=fakedata)
## 
## Call:
## lm(formula = y ~ x1, data = fakedata)
## 
## Coefficients:
## (Intercept)           x1  
##       1.312        2.367

βˆ0 is 1.312 and βˆ1 is 2.367. βˆ1 shows me that as as Y increases by 1.312, X increases by 2.367, so X increases by about double every time we increase by 1 on the Y axis.

2d.) A friend of yours who took DATA-3100 tells you that the true population model might be:

yi = α0 + α1x1i + α2x2i + εi

Your friend knows the context well and thinks that x2 is crucial to understanding y. Even more, given that x1 and x2 are correlated, as you found in part 2b, model 1 is problematic because it estimates a relationship in which x2 is in the error term and is correlated with x1, and that produces an estimator β1 that is biased and inconsistent for β1.

Estimate model 2. What is the value of ˆα1? Compare that to the value of βˆ1. What would have happened if you had evaluated a policy using 1, and your company/employer had taken action based on your recommendation? What is the value of understanding the context to decide between models 1 and 2?

lm(y ~x1+x2, data=fakedata)
## 
## Call:
## lm(formula = y ~ x1 + x2, data = fakedata)
## 
## Coefficients:
## (Intercept)           x1           x2  
##       2.033       -1.033        4.028

ˆα1 is negative, it is -1.033, but βˆ1 was 2.367.So this is a type one error because I would have originally thought y and x1 had a positive relationship, but when you have the influence of x2, x1 actually has a negative effect. I think it would be bad if my employer took my advice because I would have said that y and x1 have a strong positive relationship, when, in reality, x2 is a factor that needs to be considered and when it is, y and x1 have a negative relationship. Understanding the context to decide between models 1 and 2 is extremely important since this would be the difference between a positive or negative effect, which could be a riskier decision that positive or no effect.