library(wooldridge)
library(tidyverse)
QUESTION 1: WOOLDRIDGE 2.C1
Use the data in k401k to answer this question
r data("k401k") df = as_tibble(k401k) view(df)
Use ?k401k to learn about the sample and variables
r ?k401k
Use ?k401k to learn about the sample and variables The data in k401k are a subset of data analyzed by Papke (1995) to study the relationship between participation in a 401(k) pension plan and the generosity of the plan. The variable prate is the percentage of eligible workers with an active account; this is the variable we would like to explain. The measure of generosity is the plan match rate, mrate. This variable gives the average amount the firm contributes to each worker’s plan for each $1 contribution by the work. For example, if mrate = 0.50, then a $1 contribution by the worker is matched by a 50 cent contribution by the firm.
PART A– Find the average participation rate and the average match rate in the sample of plans
r mean(df$prate)
[1] 87.36291
r mean(df$mrate)
[1] 0.7315124
PART B– Now, estimate the simple regression equation and report the results along with the sample size and R-squared.
r model = lm(prate ~ mrate, data = df) summary(model)
```
Call: lm(formula = prate ~ mrate, data = df)
Residuals: Min 1Q Median 3Q Max -82.303 -8.184 5.178 12.712 16.807
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 83.0755 0.5633 147.48 <2e-16 mrate 5.8611 0.5270 11.12 <2e-16

Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 16.09 on 1532 degrees of freedom Multiple R-squared: 0.0747, Adjusted R-squared: 0.0741 F-statistic: 123.7 on 1 and 1532 DF, p-value: < 2.2e-16




<!-- rnb-output-end -->

<!-- rnb-chunk-end -->


<!-- rnb-text-begin -->

The sample size is 1534. R^2 is 0.075. B0 hat value is 83.07755 and the B1 hat value is 5.8611. The equation to estimate the average prate^ = 83.0755 + 5.8611mrate

PART C-- Find the predicted prate when mrate= 3.5. Is this a reasonable prediction? Explain what is happening here.

<!-- rnb-text-end -->


<!-- rnb-chunk-begin -->


<!-- rnb-source-begin eyJkYXRhIjoiYGBgclxucHJlZGljdGVkID0gODMuMDc1NSArIDUuODYxMSooMy41KVxucHJpbnQocHJlZGljdGVkKVxuYGBgIn0= -->

```r
predicted = 83.0755 + 5.8611*(3.5)
print(predicted)
[1] 103.5894

The answer is 103.59. This is unreasonable as you cannot have a prate that exceeds 100%. This shows that the regression model estimates that at $3.50 mrate there is an 103.59% participation rate.

PART D– How much of the variation in prate is explained by mrate? Is this a lot in your opinion? The R^2 value is about 0.075. This shows that the mrate is the main reason for a 7.5% variation in the prate. As this percentage is not very significant, it suggests that there are other factors affecting the prate.

QUESTION 2: WOOLDRIDGE 2.C3
Use the data in sleep75 to answer this question
r data("sleep75") df_two = as_tibble(sleep75) view(df_two) ?sleep75
Use the data in sleep75 from Biddle and Hamermesh (1990) to study whether there is a tradeoff between the time spent sleeping per week and the time spent in paid work. We could use either variable as the dependent variable. For concreteness, estimate the model sleep = β0 + β1totwrk + u, where sleep is minutes spent sleeping at night per week and totwrk is total minutes worked during the week.
r model_two = lm(sleep ~totwrk, data = df_two) summary(model_two)
```
Call: lm(formula = sleep ~ totwrk, data = df_two)
Residuals: Min 1Q Median 3Q Max -2429.94 -240.25 4.91 250.53 1339.72
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3586.37695 38.91243 92.165 <2e-16 totwrk -0.15075 0.01674 -9.005 <2e-16

Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 421.1 on 704 degrees of freedom Multiple R-squared: 0.1033, Adjusted R-squared: 0.102 F-statistic: 81.09 on 1 and 704 DF, p-value: < 2.2e-16




<!-- rnb-output-end -->

<!-- rnb-chunk-end -->


<!-- rnb-text-begin -->


PART A-- Report your results in equation form along with the number of observations and R2. What does the intercept in this equation mean?

Equation: sleep^ = 3586.37695 - 0.15075*totwrk
Sample Size: 706
R^2: 0.1033
The intercept suggests that the estimated amount of sleep per week of someone that works 0 hours per week is 3596.37


PART B--If totwrk increases by 2 hours, by how much is sleep estimated to fall? Do you find this to be large effect?

<!-- rnb-text-end -->


<!-- rnb-chunk-begin -->


<!-- rnb-source-begin eyJkYXRhIjoiYGBgclxuY2hhbmdlX2luX3NsZWVwID0gLTAuMTUwNzUqIDEyMFxucHJpbnQoY2hhbmdlX2luX3NsZWVwKVxuYGBgIn0= -->

```r
change_in_sleep = -0.15075* 120
print(change_in_sleep)
[1] -18.09
QUESTION 3:
Use the data in wage1 to answer this question
r data("wage1") df_three = as_tibble(wage1) view(df_three) ?wage1
PART A– Estimate the model wage = β0 + β1educ + u
r model_three = lm(wage ~educ, data = df_three) summary(model_three)
```
Call: lm(formula = wage ~ educ, data = df_three)
Residuals: Min 1Q Median 3Q Max -5.3396 -2.1501 -0.9674 1.1921 16.6085
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.90485 0.68497 -1.321 0.187 educ 0.54136 0.05325 10.167 <2e-16 ***

Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.378 on 524 degrees of freedom Multiple R-squared: 0.1648, Adjusted R-squared: 0.1632 F-statistic: 103.4 on 1 and 524 DF, p-value: < 2.2e-16




<!-- rnb-output-end -->

<!-- rnb-chunk-end -->


<!-- rnb-text-begin -->


PART B-- Now, create new variables for weekly wages and semesters of education

<!-- rnb-text-end -->


<!-- rnb-chunk-begin -->


<!-- rnb-source-begin eyJkYXRhIjoiYGBgclxuIyBhc3N1bWUgOCBob3VycyBwZXIgZGF5IGFuZCBmaXZlIGRheXMgcGVyIHdlZWtcbmRmX3RocmVlJHdhZ2Vfd2Vla2x5ID0gd2FnZTEkd2FnZSo4KjUgXG4jIGFzc3VtZSAyIHNlbWVzdGVycyBwZXIgeWVhciBcbmRmX3RocmVlJGVkdWNfc2VtZXN0ZXIgPSB3YWdlMSRlZHVjKjJcbnZpZXcoZGZfdGhyZWUpXG5gYGAifQ== -->

```r
# assume 8 hours per day and five days per week
df_three$wage_weekly = wage1$wage*8*5 
# assume 2 semesters per year 
df_three$educ_semester = wage1$educ*2
view(df_three)

PART C– Estimate the model: wage_weekly = β0 + β1educ_semester + u. How do the coefficients compare to those in part a?

model_three = lm(wage_weekly ~ educ_semester, data = df_three)
summary(model_three)

Call:
lm(formula = wage_weekly ~ educ_semester, data = df_three)

Residuals:
    Min      1Q  Median      3Q     Max 
-213.58  -86.00  -38.70   47.68  664.34 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)    -36.194     27.399  -1.321    0.187    
educ_semester   10.827      1.065  10.167   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 135.1 on 524 degrees of freedom
Multiple R-squared:  0.1648,    Adjusted R-squared:  0.1632 
F-statistic: 103.4 on 1 and 524 DF,  p-value: < 2.2e-16

The coefficients of the model are much larger than for part A, this is because now wage is measured weekly instead of hourly.

PART D– Explain how this relations to Problem 10 in the written part of the homework. According to question 10 of the HW, “B~1=(c1/c2)B^1 and B~0=(c1)B^0”. In this question c1=5*8, which represents the days and hrs worked for the week and c2=2 is the number of semesters in a year. Therefore the relationship from question 10 is true.

QUESTION 4: WOOLDRIDGE 2.C11
Use the data in GPA1 for this exercise
```r data(“gpa1”) df_four = as_tibble(gpa1) view(df_four) ?gpa1
```
Use ?gpa1 to learn about the sample and variables It is a sample of Mighigan State University undergraduates from the mid-1990s, and includes current college GPA (colGPA) and a binary variable indicating whether the student owned a personal computer (PC).
PART A– How many students are in the sample? Find the average and highest college GPAs.
r max(df_four$colGPA)
[1] 4
r mean(df_four$colGPA)
[1] 3.056738
PART B– How many students owned their own PC?
r df_four %>% filter(df_four$PC ==1) %>% nrow()
[1] 56
56 students own their own PC
PART C– Estimate the simple regression equation colGPA=β0 +β1PC+u and report your estimates for β0 and β1. Interpret these estimates, including a discussion of the magnitudes.
r model_four = lm(colGPA ~ PC, data = df_four) summary(model_four)
```
Call: lm(formula = colGPA ~ PC, data = df_four)
Residuals: Min 1Q Median 3Q Max -0.95893 -0.25893 0.01059 0.31059 0.84107
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.98941 0.03950 75.678 <2e-16 * PC 0.16952 0.06268 2.704 0.0077

Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.3642 on 139 degrees of freedom Multiple R-squared: 0.04999, Adjusted R-squared: 0.04315 F-statistic: 7.314 on 1 and 139 DF, p-value: 0.007697 ```

The B0 value is 2.98941.This estimates when a student does not own a PC the avg colGPA is 2.99 The B1 value is 0.16952.This estimates that for each additional unit of PC on avg the colGPA increases by 0.16952

The B0 value was higher that would mean that on avg students without a PC will have a higher GPA, however this does impact their GPA.

PART D– What is the R-squared from the regression? What do you make of its magnitude?

The R^2 is 0.04999, this means that owning a PC is about 4.9% to blame for the variation of colGPA. This is a small percentage, meaning that other factors have a higher impact on the GPA of students.

PART E– Does your finding in part (c) imply that owning a PC has a causal effect on colGPA? Explain

No, owning a PC does not have a causal effect on colGPA. In our model we find that there is only about 4.9% of the variation between owning a PC and colGPA. This suggests that other factors have a higher impact and more direct relationship on GPA. However, a higher R^2 value wouldn’t necessarily explain a causal relationship, but it would suggest that the estimated colGPA values were more closely accurate to the actual GPA values.

