Chapter 2 Analysis

C9: COUNTYMURDERS Analysis for 1996

(i) How many counties had zero murders in 1996?

## Counties with zero murders in 1996: 1051

How many counties had at least one execution?

## Counties with at least one execution in 1996: 31

What is the largest number of executions?

## Warning in max(countymurders$countymurders$execs): no non-missing arguments to
## max; returning -Inf
## Largest number of executions in 1996: -Inf

(ii) Estimate the equation murders = ß0 + ß1execs + u by OLS

Report the results

## 
## Call:
## lm(formula = murders ~ execs, data = subset(countymurders, year == 
##     1996))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -149.12   -5.46   -4.46   -2.46 1338.99 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   5.4572     0.8348   6.537 7.79e-11 ***
## execs        58.5555     5.8333  10.038  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 38.89 on 2195 degrees of freedom
## Multiple R-squared:  0.04389,    Adjusted R-squared:  0.04346 
## F-statistic: 100.8 on 1 and 2195 DF,  p-value: < 2.2e-16

(iii) Slope coefficient

## The slope coefficient (ß1) represents the change in murders for a one-unit change in executions.
## If ß1 is negative, it suggests a deterrent effect of capital punishment.

(iv) What is the smallest number of murders that can be predicted?

## Smallest number of murders predicted: 5.457241

What is the residual for a county with zero executions and zero murders?

## Residual for a county with zero executions and zero murders: 5.457241

(v) Explain why a simple regression analysis is not well-suited for determining deterrence.

## A simple regression analysis may suffer from omitted variable bias and endogeneity issues.
## Factors other than executions could influence the murder rate, leading to biased estimates.
## Additionally, the decision to implement capital punishment may be influenced by the crime rate,
## creating endogeneity problems and making causal inference challenging.

(5)

(i) In the model GPA = ß0 + ß1study + ß2sleep + ß3work + ß4leisure + u,

does it make sense to hold sleep, work, and leisure fixed while changing study?

## In the given model, it does not make sense to hold sleep, work, and leisure fixed while changing study.
## The reason is that the sum of hours in all four activities must be 168 for each student.
## Changing the hours spent on studying would inherently change the hours available for other activities.

(ii) Explain why this model violates Assumption MLR.3.

## This model violates Assumption MLR.3, which assumes that the regressors are fixed and non-stochastic.
## In this case, the hours spent on different activities are not fixed; they must sum up to 168, which introduces
## stochasticity and correlation among the explanatory variables.

(iii) How could you reformulate the model so that its parameters have a useful interpretation

and it satisfies Assumption MLR.3?

## To satisfy Assumption MLR.3, you could reformulate the model by using a set of independent variables that
## are not constrained to sum to a fixed value. For example, you could use the hours spent on three activities
## as independent variables, and the fourth one can be derived from the constraint (168 - study - work - leisure).

(10)

(i) If x1 is highly correlated with x2 and x3 in the sample, and x2 and x3 have large partial effects on y,

would you expect (B1 with ~ sign) and (adjusted B1) to be similar or very different? Explain.

## If x1 is highly correlated with x2 and x3 and x2 and x3 have large partial effects on y,
## you would expect (B1 with ~ sign) and (adjusted B1) to be similar. The inclusion of x2 and x3 in the model
## should help in capturing the relationship between x1 and y more accurately, resulting in a similar effect.

(ii) If x1 is almost uncorrelated with x2 and x3, but x2 and x3 are highly correlated,

will (B1 with ~ sign) and (adjusted B1) tend to be similar or very different? Explain.

## If x1 is almost uncorrelated with x2 and x3 but x2 and x3 are highly correlated,
## (B1 with ~ sign) and (adjusted B1) tend to be similar. The high correlation between x2 and x3
## may result in multicollinearity issues, leading to unstable coefficient estimates for x2 and x3.

(iii) If x1 is highly correlated with x2 and x3, and x2 and x3 have small partial effects on y,

would you expect se(B₁ with ~ sign) or se(adjusted B₁) to be smaller? Explain.

## If x1 is highly correlated with x2 and x3, and x2 and x3 have small partial effects on y,
## you would expect se(B₁ with ~ sign) to be smaller. The high correlation can lead to
## multicollinearity, inflating standard errors for the individual coefficients.

(iv) If x1 is almost uncorrelated with x2 and x3, x2 and x3 have large partial effects on y,

and x2 and x3 are highly correlated, would you expect se(B₁ with ~ sign) or se (adjusted B₁) to be smaller? Explain.

## If x1 is almost uncorrelated with x2 and x3, x2 and x3 have large partial effects on y,
## and x2 and x3 are highly correlated, you would expect se(adjusted B₁) to be smaller.
## The inclusion of highly correlated variables (x2 and x3) without much correlation with x1
## can improve the precision of the estimates for x1, resulting in smaller standard errors.

(C8)

(i) Find the average values of prpblck and income in the sample, along with their standard deviations.

## Average prpblck: NA
## Standard deviation of prpblck: NA
## Average income: NA
## Standard deviation of income: NA
## Units of measurement: prpblck is the proportion of the population that is black (in percentage),
## and income is the median income in the zip code.

(ii) Estimate the model psoda = B0 + B1prpblck + B2income + u by OLS

## 
## Call:
## lm(formula = psoda ~ prpblck + income, data = discrim)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.29401 -0.05242  0.00333  0.04231  0.44322 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 9.563e-01  1.899e-02  50.354  < 2e-16 ***
## prpblck     1.150e-01  2.600e-02   4.423 1.26e-05 ***
## income      1.603e-06  3.618e-07   4.430 1.22e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.08611 on 398 degrees of freedom
##   (9 observations deleted due to missingness)
## Multiple R-squared:  0.06422,    Adjusted R-squared:  0.05952 
## F-statistic: 13.66 on 2 and 398 DF,  p-value: 1.835e-06

Interpret the coefficient on prpblck

## The coefficient on prpblck is the estimated change in psoda for a one-unit change in prpblck.
## In this context, it represents the change in the price of soda for a 1% increase in the proportion
## of the population that is black. Whether it is economically large depends on the magnitude and significance
## of the coefficient, which can be determined from the summary output.

(iii) Compare the estimate from part (ii) with the simple regression estimate from psoda on prpblck.

## 
## Call:
## lm(formula = psoda ~ prpblck, data = discrim)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.30884 -0.05963  0.01135  0.03206  0.44840 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.03740    0.00519  199.87  < 2e-16 ***
## prpblck      0.06493    0.02396    2.71  0.00702 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0881 on 399 degrees of freedom
##   (9 observations deleted due to missingness)
## Multiple R-squared:  0.01808,    Adjusted R-squared:  0.01561 
## F-statistic: 7.345 on 1 and 399 DF,  p-value: 0.007015

(iv) Estimate the model log(psoda) = B0 + B1prpblck + B2log(income) + u

## 
## Call:
## lm(formula = log(psoda) ~ prpblck + log(income), data = discrim)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.33563 -0.04695  0.00658  0.04334  0.35413 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.79377    0.17943  -4.424 1.25e-05 ***
## prpblck      0.12158    0.02575   4.722 3.24e-06 ***
## log(income)  0.07651    0.01660   4.610 5.43e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0821 on 398 degrees of freedom
##   (9 observations deleted due to missingness)
## Multiple R-squared:  0.06809,    Adjusted R-squared:  0.06341 
## F-statistic: 14.54 on 2 and 398 DF,  p-value: 8.039e-07

Calculate the estimated percentage change in psoda for a 20% increase in prpblck

## Estimated percentage change in psoda for a 20% increase in prpblck: 2.46141

(v) Add the variable prppov to the regression in part (iv)

## 
## Call:
## lm(formula = log(psoda) ~ prpblck + log(income) + prppov, data = discrim)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.32218 -0.04648  0.00651  0.04272  0.35622 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.46333    0.29371  -4.982  9.4e-07 ***
## prpblck      0.07281    0.03068   2.373   0.0181 *  
## log(income)  0.13696    0.02676   5.119  4.8e-07 ***
## prppov       0.38036    0.13279   2.864   0.0044 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.08137 on 397 degrees of freedom
##   (9 observations deleted due to missingness)
## Multiple R-squared:  0.08696,    Adjusted R-squared:  0.08006 
## F-statistic:  12.6 on 3 and 397 DF,  p-value: 6.917e-08

(vi) Find the correlation between log(income) and prppov

## Correlation between lincome and prppov: NA

(vii) Evaluate the following statement

## The statement 'Because lincome and prppov are so highly correlated, they have no business being in the same regression.'
## The high negative correlation between lincome and prppov suggests multicollinearity between these two variables.
## Multicollinearity can lead to unstable coefficient estimates, making it challenging to interpret the individual effects
## of the variables. However, the decision to include or exclude variables should be based on the specific research question,
## theoretical considerations, and the goals of the analysis.
## In some cases, including both variables in the regression model might still be justified if they capture different aspects
## of the relationship with the dependent variable and contribute to a more comprehensive understanding of the phenomenon under study.

Chapter 4

(3)

Given coefficients and standard errors

(i) Estimated percentage point change in Rdintens for a 10% increase in sales

## i) Estimated percentage point change in Rdintens for a 10% increase in sales: 3.21

(ii) Test for log(sales) coefficient

## ii) p-value for the test on log(sales) coefficient: 0.1480413
##    (At 5% level): Fail to reject H0
##    (At 10% level): Fail to reject H0

(iii) Interpretation of the coefficient on profmarg

## iii) Coefficient on profmarg: 0.5

(iv) Test for profmarg coefficient

## iv) p-value for the test on profmarg coefficient: 0.2860082
##    (At 5% level): Fail to reject H0
##    (At 10% level): Fail to reject H0

(C8)

(i) How many single-person households are there in the data set?

## Number of single-person households: 2017

(ii) Use OLS to estimate the model: nettfa = B0 + B1inc + B2age + u

## 
## Call:
## lm(formula = nettfa ~ inc + age, data = single_person_households)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -179.95  -14.16   -3.42    6.03 1113.94 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -43.03981    4.08039 -10.548   <2e-16 ***
## inc           0.79932    0.05973  13.382   <2e-16 ***
## age           0.84266    0.09202   9.158   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 44.68 on 2014 degrees of freedom
## Multiple R-squared:  0.1193, Adjusted R-squared:  0.1185 
## F-statistic: 136.5 on 2 and 2014 DF,  p-value: < 2.2e-16

Interpret the slope coefficients

## Interpretation of slope coefficients:
## B1 (inc): The estimated change in nettfa for a one-unit change in inc (annual family income).
## B2 (age): The estimated change in nettfa for a one-unit change in age.
## There might be surprises depending on the context and expectations of the relationship between variables.

(iii) Does the intercept from the regression in part (ii) have an interesting meaning? Explain.

## The intercept (B0) represents the estimated net financial wealth (nettfa) when both inc and age are zero.
## In this context, it may not have a meaningful interpretation, as having zero income and age is not practically meaningful.

(iv) Find the p-value for the test H0: B₂ = 1 against H₁: B₂ < 1. Do you reject H0 at the 1% significance level?

## p-value for the test H0: B₂ = 1 against H₁: B₂ < 1: 1.265959e-19
## At the 1% significance level, we would reject H0 if the p-value is less than 0.01.
## We reject H0; there is evidence that B₂ is less than 1.

(v) If you do a simple regression of nettfa on inc, is the estimated coefficient on inc much different from the estimate in part (ii)? Why or why not?

## 
## Call:
## lm(formula = nettfa ~ inc, data = single_person_households)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -185.12  -12.85   -4.85    1.78 1112.66 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -10.5709     2.0607   -5.13 3.18e-07 ***
## inc           0.8207     0.0609   13.48  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 45.59 on 2015 degrees of freedom
## Multiple R-squared:  0.08267,    Adjusted R-squared:  0.08222 
## F-statistic: 181.6 on 1 and 2015 DF,  p-value: < 2.2e-16
## Comparison of the estimated coefficient on inc:
## The estimated coefficient on inc in the simple regression is compared to the estimate in part (ii).
## Differences may arise due to the inclusion of age in the multiple regression model, which may affect
## the relationship between nettfa and inc. The context and goals of the analysis will determine
## whether the inclusion of age improves the model.

Chapter 5

(5)

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

(i). Probability that “score” exceeds 100 using the normal distribution

## (i). Probability that 'score' exceeds 100 using the normal distribution: 0.02044288

(ii). Assess the fit in the left tail visually or using statistical tests

## 
##  Shapiro-Wilk normality test
## 
## data:  data
## W = 0.96973, p-value = 2.454e-12

(C1)

(i) Estimate the equation wage = b0 + b1educ + b2exper + b3tenure + u.

## 
## Call:
## lm(formula = wage ~ educ + exper + tenure, data = wage_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.6068 -1.7747 -0.6279  1.1969 14.6536 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.87273    0.72896  -3.941 9.22e-05 ***
## educ         0.59897    0.05128  11.679  < 2e-16 ***
## exper        0.02234    0.01206   1.853   0.0645 .  
## tenure       0.16927    0.02164   7.820 2.93e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.084 on 522 degrees of freedom
## Multiple R-squared:  0.3064, Adjusted R-squared:  0.3024 
## F-statistic: 76.87 on 3 and 522 DF,  p-value: < 2.2e-16

(ii) Repeat part (i), but with log(wage) as the dependent variable.

## 
## Call:
## lm(formula = log(wage) ~ educ + exper + tenure, data = wage_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.05802 -0.29645 -0.03265  0.28788  1.42809 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.284360   0.104190   2.729  0.00656 ** 
## educ        0.092029   0.007330  12.555  < 2e-16 ***
## exper       0.004121   0.001723   2.391  0.01714 *  
## tenure      0.022067   0.003094   7.133 3.29e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4409 on 522 degrees of freedom
## Multiple R-squared:  0.316,  Adjusted R-squared:  0.3121 
## F-statistic: 80.39 on 3 and 522 DF,  p-value: < 2.2e-16

(iii) Q-Q plots for normality assessment

par(mfrow = c(2, 2)) # Set up a 2x2 grid for Q-Q plots

## 
## Summary Statistics - Level-Level Model:
## 
## Call:
## lm(formula = wage ~ educ + exper + tenure, data = wage_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.6068 -1.7747 -0.6279  1.1969 14.6536 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.87273    0.72896  -3.941 9.22e-05 ***
## educ         0.59897    0.05128  11.679  < 2e-16 ***
## exper        0.02234    0.01206   1.853   0.0645 .  
## tenure       0.16927    0.02164   7.820 2.93e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.084 on 522 degrees of freedom
## Multiple R-squared:  0.3064, Adjusted R-squared:  0.3024 
## F-statistic: 76.87 on 3 and 522 DF,  p-value: < 2.2e-16
## 
## Summary Statistics - Log-Level Model:
## 
## Call:
## lm(formula = log(wage) ~ educ + exper + tenure, data = wage_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.05802 -0.29645 -0.03265  0.28788  1.42809 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.284360   0.104190   2.729  0.00656 ** 
## educ        0.092029   0.007330  12.555  < 2e-16 ***
## exper       0.004121   0.001723   2.391  0.01714 *  
## tenure      0.022067   0.003094   7.133 3.29e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4409 on 522 degrees of freedom
## Multiple R-squared:  0.316,  Adjusted R-squared:  0.3121 
## F-statistic: 80.39 on 3 and 522 DF,  p-value: < 2.2e-16

Chapter 6

(3)

rdintens=2.613+0.00030⋅sales−0.0000000070⋅sales^2 (0.429)(0.00014)(0.0000000037) Given:n=32, R^2=0.1484

(i) Marginal Effect of Sales on rdintens

To find the point at which the marginal effect of sales on rdintens becomes negative, need to calculate the derivative of rdintens with respect to sales and find where it equals zero. The marginal effect is given by the coefficient of the linear term:

## The marginal effect of sales on rdintens becomes negative when sales is greater than -0.4742839

(ii) Keeping the Quadratic Term

If the coefficient is statistically significant (i.e., the p-value is small), might consider keeping the quadratic term.

## The decision to keep the quadratic term depends on the significance of the coefficient and the context of the analysis.
## If the quadratic term is statistically significant and improves the model fit, it may be kept.

(iii) Using salesbil as Independent Variable

Let salesbil=sales/1000 then salesbil^2= (sales/1000)^2 The equation becomes: rdintens=2.613+0.00030⋅salesbil−0.0000000070⋅salesbil^2

The standard errors would also need to be adjusted accordingly.

## 
## Call:
## lm(formula = rdintens ~ salesbil + I(salesbil^2), data = rdchem)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.1418 -1.3630 -0.2257  1.0688  5.5808 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    2.612512   0.429442   6.084 1.27e-06 ***
## salesbil       0.300571   0.139295   2.158   0.0394 *  
## I(salesbil^2) -0.006946   0.003726  -1.864   0.0725 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.788 on 29 degrees of freedom
## Multiple R-squared:  0.1484, Adjusted R-squared:  0.08969 
## F-statistic: 2.527 on 2 and 29 DF,  p-value: 0.09733

(iv) Preference for Reporting

The preference for reporting the results depends on various factors, including the interpretability of coefficients, statistical significance, and goodness-of-fit (R-squared). If the quadratic term is not statistically significant and does not contribute much to the goodness-of-fit, you might prefer the simpler linear model for ease of interpretation.

## Adjusted R-squared - Model 1: 0.08969224
## Adjusted R-squared - Model 2: 0.08969224
## Preference: Both models have the same adjusted R-squared.
## 
## Coefficients - Model 1:
##   (Intercept)         sales    I(sales^2) 
##  2.612512e+00  3.005713e-04 -6.945939e-09
## 
## Coefficients - Model 2:
##   (Intercept)      salesbil I(salesbil^2) 
##   2.612512085   0.300571301  -0.006945939

(10)

(i) If you are a policy maker trying to estimate the causal effect of per-student spending on math test performance,

explain why the first equation is more relevant than the second. What is the estimated effect of a 10% increase in expenditures per student?

## The first equation is more relevant because it provides a direct estimate of the effect of per-student spending on math test performance.
## In the first equation, the coefficient on lexppp represents the estimated effect of a one-unit increase in log expenditures per student.
## To get the estimated effect of a 10% increase, you can multiply the coefficient by 0.1.

(ii) Adding variable

Grades in maths and reading are inherently separate. Because read4 has a strong correlation with other significant explanatory variables, its addition to the model will result in odd effects.

(iii) How would you explain to someone with only basic knowledge of regression why, in this case, you prefer the equation with the smaller adjusted R-squared?

## 
## Call:
## lm(formula = math4 ~ lexppp + free + lmedinc + pctsgle, data = meapsingle)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -33.259  -7.422   1.615   7.274  49.524 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 24.48949   59.23781   0.413   0.6797    
## lexppp       9.00648    4.03530   2.232   0.0266 *  
## free        -0.42164    0.07064  -5.969 9.27e-09 ***
## lmedinc     -0.75221    5.35816  -0.140   0.8885    
## pctsgle     -0.27444    0.16086  -1.706   0.0894 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.59 on 224 degrees of freedom
## Multiple R-squared:  0.4716, Adjusted R-squared:  0.4622 
## F-statistic: 49.98 on 4 and 224 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = math4 ~ lexppp + free + lmedinc + pctsgle + read4, 
##     data = meapsingle)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -29.5690  -4.6729  -0.0349   4.3644  24.8425 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 149.37870   41.70293   3.582 0.000419 ***
## lexppp        1.93215    2.82480   0.684 0.494688    
## free         -0.06004    0.05399  -1.112 0.267297    
## lmedinc     -10.77595    3.75746  -2.868 0.004529 ** 
## pctsgle      -0.39663    0.11143  -3.559 0.000454 ***
## read4         0.66656    0.04249  15.687  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.012 on 223 degrees of freedom
## Multiple R-squared:  0.7488, Adjusted R-squared:  0.7432 
## F-statistic: 132.9 on 5 and 223 DF,  p-value: < 2.2e-16
## Comparison of Equations:
## 1. Equation without read4:
## 
## Call:
## lm(formula = math4 ~ lexppp + free + lmedinc + pctsgle, data = meapsingle)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -33.259  -7.422   1.615   7.274  49.524 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 24.48949   59.23781   0.413   0.6797    
## lexppp       9.00648    4.03530   2.232   0.0266 *  
## free        -0.42164    0.07064  -5.969 9.27e-09 ***
## lmedinc     -0.75221    5.35816  -0.140   0.8885    
## pctsgle     -0.27444    0.16086  -1.706   0.0894 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.59 on 224 degrees of freedom
## Multiple R-squared:  0.4716, Adjusted R-squared:  0.4622 
## F-statistic: 49.98 on 4 and 224 DF,  p-value: < 2.2e-16
## 
## 2. Equation with read4:
## 
## Call:
## lm(formula = math4 ~ lexppp + free + lmedinc + pctsgle + read4, 
##     data = meapsingle)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -29.5690  -4.6729  -0.0349   4.3644  24.8425 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 149.37870   41.70293   3.582 0.000419 ***
## lexppp        1.93215    2.82480   0.684 0.494688    
## free         -0.06004    0.05399  -1.112 0.267297    
## lmedinc     -10.77595    3.75746  -2.868 0.004529 ** 
## pctsgle      -0.39663    0.11143  -3.559 0.000454 ***
## read4         0.66656    0.04249  15.687  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.012 on 223 degrees of freedom
## Multiple R-squared:  0.7488, Adjusted R-squared:  0.7432 
## F-statistic: 132.9 on 5 and 223 DF,  p-value: < 2.2e-16
## 
## Explanation for (iii):
## The adjusted R-squared value is a measure of the proportion of variance in the dependent variable that is explained by the independent variables.
## In this case, a smaller adjusted R-squared indicates that the additional variable (read4) does not significantly improve the model's explanatory power.
## While the model with read4 has a higher overall R-squared, the adjusted R-squared considers the number of variables and penalizes for overfitting.
## Choosing the model with a smaller adjusted R-squared may be preferable if the additional variable does not contribute substantially to the model's accuracy.

(C3) Return to Education and Work Experience

# Assuming you have already loaded the "WAGE2" dataset
library(wooldridge)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ lubridate 1.9.3     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
 str(wage2)
## 'data.frame':    935 obs. of  17 variables:
##  $ wage   : int  769 808 825 650 562 1400 600 1081 1154 1000 ...
##  $ hours  : int  40 50 40 40 40 40 40 40 45 40 ...
##  $ IQ     : int  93 119 108 96 74 116 91 114 111 95 ...
##  $ KWW    : int  35 41 46 32 27 43 24 50 37 44 ...
##  $ educ   : int  12 18 14 12 11 16 10 18 15 12 ...
##  $ exper  : int  11 11 11 13 14 14 13 8 13 16 ...
##  $ tenure : int  2 16 9 7 5 2 0 14 1 16 ...
##  $ age    : int  31 37 33 32 34 35 30 38 36 36 ...
##  $ married: int  1 1 1 1 1 1 0 1 1 1 ...
##  $ black  : int  0 0 0 0 0 1 0 0 0 0 ...
##  $ south  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ urban  : int  1 1 1 1 1 1 1 1 0 1 ...
##  $ sibs   : int  1 1 1 4 10 1 1 2 2 1 ...
##  $ brthord: int  2 NA 2 3 6 2 2 3 3 1 ...
##  $ meduc  : int  8 14 14 12 6 8 8 8 14 12 ...
##  $ feduc  : int  8 14 14 12 11 NA 8 NA 5 11 ...
##  $ lwage  : num  6.65 6.69 6.72 6.48 6.33 ...
##  - attr(*, "time.stamp")= chr "25 Jun 2011 23:03"

(i) Show that the return to another year of education, holding exper fixed, is b1 + b3 * exper

## Return to another year of education (holding exper fixed): 0.04725277

(ii) State the null hypothesis that the return to education does not depend on the level of exper.

## 
## Call:
## lm(formula = log(wage) ~ educ + exper + educ * exper, data = wage2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.88558 -0.24553  0.03558  0.26171  1.28836 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.949455   0.240826  24.704   <2e-16 ***
## educ         0.044050   0.017391   2.533   0.0115 *  
## exper       -0.021496   0.019978  -1.076   0.2822    
## educ:exper   0.003203   0.001529   2.095   0.0365 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3923 on 931 degrees of freedom
## Multiple R-squared:  0.1349, Adjusted R-squared:  0.1321 
## F-statistic: 48.41 on 3 and 931 DF,  p-value: < 2.2e-16

(iii) Hypothesis Testing

## 
## Call:
## lm(formula = log(wage) ~ educ + exper + educ * exper, data = data_c6)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.88558 -0.24553  0.03558  0.26171  1.28836 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.949455   0.240826  24.704   <2e-16 ***
## educ         0.044050   0.017391   2.533   0.0115 *  
## exper       -0.021496   0.019978  -1.076   0.2822    
## educ:exper   0.003203   0.001529   2.095   0.0365 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3923 on 931 degrees of freedom
## Multiple R-squared:  0.1349, Adjusted R-squared:  0.1321 
## F-statistic: 48.41 on 3 and 931 DF,  p-value: < 2.2e-16

(iv) Return to Education When exper = 10

## Theta: -0.1709096
## 95% Confidence Interval: -0.210067 -0.1317521

(C12)

(i).What is the youngest age of people in this sample? How many people are at that age?

## Youngest age: 25
## Number of people at the youngest age: 211

(ii). What is the literal interpretation of b2? By itself, is it of much interest?

## 
## Call:
## lm(formula = nettfa ~ inc + age + agesq, data = k401ksubs)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -504.93  -18.61   -3.08    9.96 1464.26 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.680388  10.080986   0.464    0.642    
## inc          0.978252   0.025489  38.379  < 2e-16 ***
## age         -2.231489   0.489712  -4.557 5.26e-06 ***
## agesq        0.037722   0.005621   6.710 2.05e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 58.18 on 9271 degrees of freedom
## Multiple R-squared:  0.1731, Adjusted R-squared:  0.1728 
## F-statistic: 646.8 on 3 and 9271 DF,  p-value: < 2.2e-16

(iii) Age is negative?

nettfa = -1.204+ 0.825* inc + -1.322* age + 0.0255*age^2 b1= 0.825 b2<0 and b3>0 U shaped. # (iv) theta2= b2+50b3 ?

## [1] 25
## 
## Call:
## lm(formula = nettfa ~ inc + age + I(age^2), data = data_c6_12)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -179.36  -13.58   -2.97    5.67 1116.45 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.204212  15.280667  -0.079  0.93719    
## inc          0.824816   0.060298  13.679  < 2e-16 ***
## age         -1.321815   0.767496  -1.722  0.08518 .  
## I(age^2)     0.025562   0.008999   2.841  0.00455 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 44.6 on 2013 degrees of freedom
## Multiple R-squared:  0.1229, Adjusted R-squared:  0.1216 
## F-statistic: 93.99 on 3 and 2013 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = nettfa ~ inc + age + I(age^2 - age_50), data = data_c6_12)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -179.36  -13.58   -2.97    5.67 1116.45 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -1.204212  15.280667  -0.079  0.93719    
## inc                0.824816   0.060298  13.679  < 2e-16 ***
## age               -0.043695   0.325270  -0.134  0.89315    
## I(age^2 - age_50)  0.025562   0.008999   2.841  0.00455 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 44.6 on 2013 degrees of freedom
## Multiple R-squared:  0.1229, Adjusted R-squared:  0.1216 
## F-statistic: 93.99 on 3 and 2013 DF,  p-value: < 2.2e-16

(v)

The outcome demonstrates that the Rsquared does not drop when age is taken out of the model. As a result, the adjusted Rsquared rises with fewer variables, improving the goodness of fit.

## 
## Call:
## lm(formula = nettfa ~ inc + I(age_25^2), data = data_c6_12)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -179.37  -13.61   -3.01    5.63 1116.34 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -18.488105   2.177584  -8.490   <2e-16 ***
## inc           0.823571   0.059567  13.826   <2e-16 ***
## I(age_25^2)   0.024403   0.002541   9.605   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 44.59 on 2014 degrees of freedom
## Multiple R-squared:  0.1229, Adjusted R-squared:  0.122 
## F-statistic:   141 on 2 and 2014 DF,  p-value: < 2.2e-16

(vi) nettfa= -18.488105 + 0.823571* 30+ 0.024403 *(age-25)^2

-> nettfa= 6.219+ 0.024403 *(age-25)^2

(vii) Since inc^2 is not significant in the model, there is no need to add it.

## 
## Call:
## lm(formula = nettfa ~ inc + I(inc^2) + I(age_25^2), data = data_c6_12)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -179.46  -13.66   -3.00    5.76 1116.08 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.930e+01  3.688e+00  -5.234 1.83e-07 ***
## inc          8.722e-01  1.877e-01   4.648 3.57e-06 ***
## I(inc^2)    -5.405e-04  1.978e-03  -0.273    0.785    
## I(age_25^2)  2.440e-02  2.541e-03   9.603  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 44.6 on 2013 degrees of freedom
## Multiple R-squared:  0.1229, Adjusted R-squared:  0.1216 
## F-statistic: 94.01 on 3 and 2013 DF,  p-value: < 2.2e-16