McDermott

1a. How many women are there on this data set? How many of them are there in the labor force?

sum(mroz$inlf==1)

## [1] 428

There are 753 women in this dataset and 428 are in the labor force.

1b. What is the distribution of wages for those women in the labor force? Calculate summary statistics and produce a histogram. Does the distribution of wages look normal?

summary(mroz$wage) # Min.= 0, 1st Qu.=0 ,  Median= 1.625,   Mean=2.375, 3rd Qu.=3.788    Max.=25

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   1.625   2.375   3.788  25.000

sd(mroz$wage) # 3.241829

## [1] 3.241829

ggplot(mroz, aes(x = wage)) +
  geom_histogram(alpha = 0.5, position = "identity", bins = 10, color = "black")

### No this distribution does not look normal, it is skewed to the right, with the majority of the wages being 0, which is most likely all of the women not in the labor force.

1c. Notice that we cannot know the wages of women not in the labor force. Unfortunately, any analysis we conduct using the wage variable excludes women who are not in the labor force. Do you think this could create a statistical problem? Do you think that only analyzing women in the labor force is enough to understand patterns related to wages, education, age, etc.?

Yes I do think this could create a statistical problem if you only want to/are mostly interested in analyzing the wages of women in the labor force because only about half of the women in this dataset are in the labor force so I don’t think it is a fair sample to use and visually, it makes your data look strange if you don’t filter out the women who are not in the labor force. I would imagine that running any statistical analysis on this data with the women not in the labor force will also skew any results towards that $0 wage.

1d. We are interested in understanding the relationship between the wage variable and the age variable.

1i.) Use the summary function to understand the age variable.

summary(mroz$age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   30.00   36.00   43.00   42.54   49.00   60.00

# Min.= 30.00
# 1st Qu. = 36.00 
# Median = 43.00  
# Mean  =  42.54
# 3rd Qu.= 49.00   
# Max. = 60.00

1dii.) Calculate the covariance between wage and age. Interpret the covariance.

cov(mroz$age, mroz$wage) #-0.9044096

## [1] -0.9044096

The covariance is negative which means that as one variable increases, the other decreases. Just looking at the data again, it looks like younger aged women make more money than the older women in the dataset (which is not something I would have expected).

1diii.) Calculate the correlation between wage and age. Interpret the correlation coefficient.

cor(mroz$age, mroz$wage) #-0.03455915

## [1] -0.03455915

The correlation coefficient is negative but very close to 0 so I don’t think the variables age and wage are related, and if there is any relation, it is a weak negative correlation

1d iv.) Use ggplot to produce a scatter plot, in which age is in the horizontal axis, and wage is in the vertical axis. Can you infer any relationship between the two variables from the scatter plot?

ggplot(data = mroz)+
  geom_point(aes(x = age, y= wage))

### Just based on the scatter plot, I do not see a relationship between the two variables.

1v.) Use ggplot to add a linear fit on top of the scatter plot. Does this corroborate your instincts from the previous part?

ggplot(data = mroz, mapping = aes(x = age, y= wage)) +
  geom_point() +
  geom_smooth(method = "lm", formula = y~x)

### Yes the linear fit line is almost a straight line which shows me there is most likely not a relationship between the age and wage variables.

1 vi.) Use the lm function to run a linear regression in which wage is the outcome and age is the independent variable. What is the estimator for the intercept? What is the estimator for the slope? How can we interpret the estimator for the slope?

lm(wage ~ age, data=mroz)

## 
## Call:
## lm(formula = wage ~ age, data = mroz)
## 
## Coefficients:
## (Intercept)          age  
##     2.96492     -0.01388

#Coefficients:
#.   (Intercept)      age  
#      2.96492     -0.01388

The estimator for the intercept is 2.96492 and the estimator for the slope is -0.01388. The estimator for the slope is negative and very close to zero. This tells me that there is not a strong relationship between age and wage since the slope is so cloes to zero.

1 vii.) Come up with a short argument (it does not need to be the correct one) for why the relationship between a woman’s age and a woman’s wage could be related in this way.

I think (with this data set/group of women at least) there is no/a weak negative correlation between a woman’s age and wage because almost half of the women in this data are not in the labor force and therefore earn $0 and there is no specific age where women are not in the workforce in this data (i.e there isn’t a specific group of retired women) that would make the relationship make much sense. I think women are not in the workforce in the age range of this datas et (30-60) for many reasons but this age range makes me think of stay at home moms vs. mid career women and that may also play into why we do not see a relationship between age and wage.

1 e. We are interested in understanding the relationship between the wage variable and the educ variable.

1Ei.) Use the summary function to understand the educ variable.

summary(mroz$educ)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    5.00   12.00   12.00   12.29   13.00   17.00

#Min.= 5.00 
# 1st Qu. = 12.00
# Median = 12.00  
# Mean = 12.29
#3rd Qu.=  13.00   
# Max. =  17.00

1Eii.) Calculate the covariance between wage and educ. Interpret the covariance.

cov(mroz$educ, mroz$wage) #2.353504

## [1] 2.353504

The covariance is positive, which makes me believe these two variables are correlated.

1E iii.) Calculate the correlation between wage and educ. Interpret the correlation coefficient.

cor(mroz$educ, mroz$wage) #0.3183781

## [1] 0.3183781

The correlation coefficient is not very large, which makes be believe there is a weaker positive correlation between education and wage.

##1Eiv.) Use ggplot to produce a scatter plot, in which educ is in the horizontal axis, and wage is in the vertical axis. Can you infer any relationship between the two variables from the scatter plot?

ggplot(data = mroz)+
  geom_point(aes(x = educ, y= wage))

## 1Ev.) Use ggplot to add a linear fit on top of the scatter plot. Does this corroborate your instincts from the previous part?

ggplot(data = mroz, mapping = aes(x = educ, y= wage)) +
  geom_point() +
  geom_smooth(method = "lm", formula = y~x)

### With the linear regression line, it looks more positively correlated than I would have thought.

1Evi.) Use the lm function to run a linear regression in which wage is the outcome and educ is the independent variable. What is the estimator for the intercept? What is the estimator for the slope? How can we interpret the estimator for the slope?

lm(wage ~ educ, data=mroz)

## 
## Call:
## lm(formula = wage ~ educ, data = mroz)
## 
## Coefficients:
## (Intercept)         educ  
##     -3.1869       0.4526

The estimator for the intercept is -3.1869. The estimator of the slope is 0.4526. Since the slope is positive we know that as education increases wages also increase.

1E vii.) Come up with a short argument (it does not need to be the correct one) for why the relationship between a woman’s education and a woman’s wage could be related in this way.

1f. Calculate the following quantity: Cov(educ, wage)/Var(educ)

cov(mroz$educ, mroz$wage) / var(mroz$educ) #0.4526386

## [1] 0.4526386

Compare this to the results from running a simple linear regression between wage, and educ. What can you conclude about βˆ?

lm(wage ~ educ, data=mroz)

## 
## Call:
## lm(formula = wage ~ educ, data = mroz)
## 
## Coefficients:
## (Intercept)         educ  
##     -3.1869       0.4526

The quantity and the results from running a linear regression are the same. I can conclude that for every extra year of school, a womans wage increases by about $0.45.

1g. Notice that wage = βˆ0 + βˆ1educ

Use what you found out about βˆ1 in 1f to find βˆ0 only using sample means, covariances, and variances, and not the lm function.

mean(mroz$wage) #2.374565

## [1] 2.374565

mean(mroz$educ)#12.28685

## [1] 12.28685

2.374565 - 0.4526386 * 12.28685

## [1] -3.186938

βˆ0 (the y-intercept) can be calcualted using the following equation:βˆ0 = ¯y − βˆ1x, and the answer is -3.186938, which is the same value we get when we run the regression

2a. Calculate summary statistics for each of the three variables, and make sure you understand their distribution.

summary(fakedata$y)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -8.182   8.874  13.226  13.221  17.162  35.209

#Min.     1st Qu.  Median    Mean   3rd Qu.     Max. 
#-8.182   8.874    13.226   13.221   17.162    35.209 
summary(fakedata$x1)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.6195  3.7434  5.0184  5.0323  6.3292 11.4821

# Min.     1st Qu.  Median    Mean    3rd Qu.     Max. 
#-0.6195   3.7434   5.0184   5.0323   6.3292    11.4821 
summary(fakedata$x2)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -2.641   2.720   4.022   4.068   5.273  10.569

#Min.     1st Qu.   Median    Mean    3rd Qu.    Max. 
#-2.641   2.720     4.022    4.068    5.273     10.569

2b.) Calculate the correlation between y and x1, y and x2, and x1 and x2. Briefly interpret each of these correlations.

cor(fakedata$y, fakedata$x1) #0.7477822

## [1] 0.7477822

cor(fakedata$y, fakedata$x2) #0.9733887

## [1] 0.9733887

cor(fakedata$x1, fakedata$x2) #0.8571596

## [1] 0.8571596

Y and X2 have the strongest positive correlation (0.97, very close to 1) ,X1 and X2 have a strong positive correlation (0.85), and Y and X1 also have a strong positive correlation (0.74).

2c. You mistakenly think that the true population model is given by:

yi = β0 + β1x1i + εi

Use ggplot to produce a scatter plot where x1 is in the horizontal axis and y is on the vertical axis. Use ggplot to put a linear fit on the scatter plot. Use the lm function to estimate model 1. What are the estimators βˆ0 and βˆ1? Interpret βˆ1.

ggplot(data = fakedata, mapping = aes(x = x1, y= y)) +
  geom_point() +
  geom_smooth(method = "lm", formula = y~x)

lm(y ~x1, data=fakedata)

## 
## Call:
## lm(formula = y ~ x1, data = fakedata)
## 
## Coefficients:
## (Intercept)           x1  
##       1.312        2.367

βˆ0 is 1.312 and βˆ1 is 2.367. βˆ1 shows me that as as Y increases by 1.312, X increases by 2.367, so X increases by about double every time we increase by 1 on the Y axis.

2d.) A friend of yours who took DATA-3100 tells you that the true population model might be:

yi = α0 + α1x1i + α2x2i + εi

Your friend knows the context well and thinks that x2 is crucial to understanding y. Even more, given that x1 and x2 are correlated, as you found in part 2b, model 1 is problematic because it estimates a relationship in which x2 is in the error term and is correlated with x1, and that produces an estimator β1 that is biased and inconsistent for β1.

Estimate model 2. What is the value of ˆα1? Compare that to the value of βˆ1. What would have happened if you had evaluated a policy using 1, and your company/employer had taken action based on your recommendation? What is the value of understanding the context to decide between models 1 and 2?

lm(y ~x1+x2, data=fakedata)

## 
## Call:
## lm(formula = y ~ x1 + x2, data = fakedata)
## 
## Coefficients:
## (Intercept)           x1           x2  
##       2.033       -1.033        4.028

ˆα1 is negative, it is -1.033, but βˆ1 was 2.367.So this is a type one error because I would have originally thought y and x1 had a positive relationship, but when you have the influence of x2, x1 actually has a negative effect. I think it would be bad if my employer took my advice because I would have said that y and x1 have a strong positive relationship, when, in reality, x2 is a factor that needs to be considered and when it is, y and x1 have a negative relationship. Understanding the context to decide between models 1 and 2 is extremely important since this would be the difference between a positive or negative effect, which could be a riskier decision that positive or no effect.

McDermott_HW6

Kellie McDemott

2025-03-09

1a. How many women are there on this data set? How many of them are there in the labor force?

There are 753 women in this dataset and 428 are in the labor force.

1b. What is the distribution of wages for those women in the labor force? Calculate summary statistics and produce a histogram. Does the distribution of wages look normal?

1d. We are interested in understanding the relationship between the wage variable and the age variable.

1i.) Use the summary function to understand the age variable.

1dii.) Calculate the covariance between wage and age. Interpret the covariance.

The covariance is negative which means that as one variable increases, the other decreases. Just looking at the data again, it looks like younger aged women make more money than the older women in the dataset (which is not something I would have expected).

1diii.) Calculate the correlation between wage and age. Interpret the correlation coefficient.

1d iv.) Use ggplot to produce a scatter plot, in which age is in the horizontal axis, and wage is in the vertical axis. Can you infer any relationship between the two variables from the scatter plot?

1v.) Use ggplot to add a linear fit on top of the scatter plot. Does this corroborate your instincts from the previous part?

1 vi.) Use the lm function to run a linear regression in which wage is the outcome and age is the independent variable. What is the estimator for the intercept? What is the estimator for the slope? How can we interpret the estimator for the slope?

The estimator for the intercept is 2.96492 and the estimator for the slope is -0.01388. The estimator for the slope is negative and very close to zero. This tells me that there is not a strong relationship between age and wage since the slope is so cloes to zero.

1 e. We are interested in understanding the relationship between the wage variable and the educ variable.

1Ei.) Use the summary function to understand the educ variable.

1Eii.) Calculate the covariance between wage and educ. Interpret the covariance.

The covariance is positive, which makes me believe these two variables are correlated.

1E iii.) Calculate the correlation between wage and educ. Interpret the correlation coefficient.

The correlation coefficient is not very large, which makes be believe there is a weaker positive correlation between education and wage.

1Evi.) Use the lm function to run a linear regression in which wage is the outcome and educ is the independent variable. What is the estimator for the intercept? What is the estimator for the slope? How can we interpret the estimator for the slope?

The estimator for the intercept is -3.1869. The estimator of the slope is 0.4526. Since the slope is positive we know that as education increases wages also increase.

1f. Calculate the following quantity: Cov(educ, wage)/Var(educ)

Compare this to the results from running a simple linear regression between wage, and educ. What can you conclude about βˆ?

The quantity and the results from running a linear regression are the same. I can conclude that for every extra year of school, a womans wage increases by about $0.45.

1g. Notice that wage = βˆ0 + βˆ1educ

Use what you found out about βˆ1 in 1f to find βˆ0 only using sample means, covariances, and variances, and not the lm function.

βˆ0 (the y-intercept) can be calcualted using the following equation:βˆ0 = ¯y − βˆ1x, and the answer is -3.186938, which is the same value we get when we run the regression

2a. Calculate summary statistics for each of the three variables, and make sure you understand their distribution.

2b.) Calculate the correlation between y and x1, y and x2, and x1 and x2. Briefly interpret each of these correlations.

Y and X2 have the strongest positive correlation (0.97, very close to 1) ,X1 and X2 have a strong positive correlation (0.85), and Y and X1 also have a strong positive correlation (0.74).

2c. You mistakenly think that the true population model is given by:

yi = β0 + β1x1i + εi

Use ggplot to produce a scatter plot where x1 is in the horizontal axis and y is on the vertical axis. Use ggplot to put a linear fit on the scatter plot. Use the lm function to estimate model 1. What are the estimators βˆ0 and βˆ1? Interpret βˆ1.

βˆ0 is 1.312 and βˆ1 is 2.367. βˆ1 shows me that as as Y increases by 1.312, X increases by 2.367, so X increases by about double every time we increase by 1 on the Y axis.

2d.) A friend of yours who took DATA-3100 tells you that the true population model might be:

yi = α0 + α1x1i + α2x2i + εi

Estimate model 2. What is the value of ˆα1? Compare that to the value of βˆ1. What would have happened if you had evaluated a policy using 1, and your company/employer had taken action based on your recommendation? What is the value of understanding the context to decide between models 1 and 2?

McDermott_HW6

Kellie McDemott

2025-03-09

1a. How many women are there on this data set? How many of them are there in the labor force?

There are 753 women in this dataset and 428 are in the labor force.

1b. What is the distribution of wages for those women in the labor force? Calculate summary statistics and produce a histogram. Does the distribution of wages look normal?

1d. We are interested in understanding the relationship between the wage variable and the age variable.

1i.) Use the summary function to understand the age variable.

1dii.) Calculate the covariance between wage and age. Interpret the covariance.

The covariance is negative which means that as one variable increases, the other decreases. Just looking at the data again, it looks like younger aged women make more money than the older women in the dataset (which is not something I would have expected).

1diii.) Calculate the correlation between wage and age. Interpret the correlation coefficient.

The correlation coefficient is negative but very close to 0 so I don’t think the variables age and wage are related, and if there is any relation, it is a weak negative correlation

1d iv.) Use ggplot to produce a scatter plot, in which age is in the horizontal axis, and wage is in the vertical axis. Can you infer any relationship between the two variables from the scatter plot?

1v.) Use ggplot to add a linear fit on top of the scatter plot. Does this corroborate your instincts from the previous part?

1 vi.) Use the lm function to run a linear regression in which wage is the outcome and age is the independent variable. What is the estimator for the intercept? What is the estimator for the slope? How can we interpret the estimator for the slope?

The estimator for the intercept is 2.96492 and the estimator for the slope is -0.01388. The estimator for the slope is negative and very close to zero. This tells me that there is not a strong relationship between age and wage since the slope is so cloes to zero.

1 vii.) Come up with a short argument (it does not need to be the correct one) for why the relationship between a woman’s age and a woman’s wage could be related in this way.

1 e. We are interested in understanding the relationship between the wage variable and the educ variable.

1Ei.) Use the summary function to understand the educ variable.

1Eii.) Calculate the covariance between wage and educ. Interpret the covariance.

The covariance is positive, which makes me believe these two variables are correlated.

1E iii.) Calculate the correlation between wage and educ. Interpret the correlation coefficient.

The correlation coefficient is not very large, which makes be believe there is a weaker positive correlation between education and wage.

1Evi.) Use the lm function to run a linear regression in which wage is the outcome and educ is the independent variable. What is the estimator for the intercept? What is the estimator for the slope? How can we interpret the estimator for the slope?

The estimator for the intercept is -3.1869. The estimator of the slope is 0.4526. Since the slope is positive we know that as education increases wages also increase.

1E vii.) Come up with a short argument (it does not need to be the correct one) for why the relationship between a woman’s education and a woman’s wage could be related in this way.

Education and wage are usually positive related regardless of gender. More high paying jobs require higher education so it makes sense to me that more education would lead to higher wages in any population.

1f. Calculate the following quantity: Cov(educ, wage)/Var(educ)

Compare this to the results from running a simple linear regression between wage, and educ. What can you conclude about βˆ?

The quantity and the results from running a linear regression are the same. I can conclude that for every extra year of school, a womans wage increases by about $0.45.

1g. Notice that wage = βˆ0 + βˆ1educ

Use what you found out about βˆ1 in 1f to find βˆ0 only using sample means, covariances, and variances, and not the lm function.

βˆ0 (the y-intercept) can be calcualted using the following equation:βˆ0 = ¯y − βˆ1x, and the answer is -3.186938, which is the same value we get when we run the regression

2a. Calculate summary statistics for each of the three variables, and make sure you understand their distribution.

2b.) Calculate the correlation between y and x1, y and x2, and x1 and x2. Briefly interpret each of these correlations.

Y and X2 have the strongest positive correlation (0.97, very close to 1) ,X1 and X2 have a strong positive correlation (0.85), and Y and X1 also have a strong positive correlation (0.74).

2c. You mistakenly think that the true population model is given by:

yi = β0 + β1x1i + εi

Use ggplot to produce a scatter plot where x1 is in the horizontal axis and y is on the vertical axis. Use ggplot to put a linear fit on the scatter plot. Use the lm function to estimate model 1. What are the estimators βˆ0 and βˆ1? Interpret βˆ1.

βˆ0 is 1.312 and βˆ1 is 2.367. βˆ1 shows me that as as Y increases by 1.312, X increases by 2.367, so X increases by about double every time we increase by 1 on the Y axis.

2d.) A friend of yours who took DATA-3100 tells you that the true population model might be:

yi = α0 + α1x1i + α2x2i + εi

Estimate model 2. What is the value of ˆα1? Compare that to the value of βˆ1. What would have happened if you had evaluated a policy using 1, and your company/employer had taken action based on your recommendation? What is the value of understanding the context to decide between models 1 and 2?