library(wooldridge)
## Warning: package 'wooldridge' was built under R version 4.2.2
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.2.3
## Warning: package 'ggplot2' was built under R version 4.2.2
## Warning: package 'tidyr' was built under R version 4.2.2
## Warning: package 'readr' was built under R version 4.2.3
## Warning: package 'purrr' was built under R version 4.2.2
## Warning: package 'dplyr' was built under R version 4.2.2
## Warning: package 'stringr' was built under R version 4.2.2
## Warning: package 'forcats' was built under R version 4.2.2
## Warning: package 'lubridate' was built under R version 4.2.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.0 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
library(car)
## Warning: package 'car' was built under R version 4.2.2
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.2.2
##
## Attaching package: 'car'
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following object is masked from 'package:purrr':
##
## some
setwd("C:/Users/hasti/OneDrive/Desktop/RStudio Files/Methods 2")
Problem 1: Which of the following can cause the usual OLS t statistics to be invalid (that is, not to have t distributions under H0)?
Problem 1(i) Answer: Homoskedasticity is the 5th MLR assumption and MLR3 is violated when there is omitted variable bias caused by perfect collinearity, and both can therefore render the T-statistic invalid. Homoskedasticity can bias the variance of residuals which can have an impact on our t-statistics. \(\beta\) coefficients can be biased by omitted variables that relate to our DV and coefficients are used to calculate T statistics (based on the prediction of \(\hat{\beta_J}\). MLR assumptions must be met in order to have a valid T-statistic. When two independent variables have a correlation coefficient this is often not ideal but an MLR assumption is only violated when there is perfect collinearity.
Problem 2: Consider an equation to explain salaries of CEOs in terms of annual firm sales, return on equity (roe, in percentage form), and return on the firm’s stock (ros, in percentage form):
\[\hat{log(salary)} = \beta0 + \beta_1log(sales) + \beta_2roe + \beta_3ros + u\] Problem 2(i) In terms of the model parameters, state the null hypothesis that, after controlling for sales and roe, ros has no effect on CEO salary. State the alternative that better stock market performance increases a CEO’s salary.
Problem 2(i) Answer: For the null hypothesis: \(H_0: \beta3 = 0\) And the alternative hypothesis is: \(H_0: \beta3 \neq 0\)
Problem 2(ii) Using the data in CEOSAL1.RAW, the following equation was obtained by OLS:
\[\hat{log(salary)} = 4.32 + .280log(sales) + 0.174roe + .00024ros\] \[n=209, R^2 =.283\] By what percentage is salary predicted to increase if ros increases by 50 points? Does ros have a practically large effect on salary?
Problem 2(ii) Answer:
\(\hat{log(salary)} = 4.32 + .280log(sales) + 0.174roe + .00024ros\) \(\Delta\hat{log(salary)} = 4.32 + .280log(0) + 0.174(0) + .00024(50)\) \(\Delta\hat{log(salary)} = 4.32 + .00024(50)\) \(\Delta\hat{log(salary)} = 0.012\)
Salary is predicted to increase by 1.2% on average when ROE=50, all else being equal. I consider this a small amount considering a 50 point increase for ros is substantial yet it only amounts to a 1.2 increase in salary, showing a limited ability to impact chance in our DV.
Problem 2(iii) Test the null hypothesis that ros has no effect on salary against the alternative that ros has a positive effect. Carry out the test at the 10% significance level.
Problem 2(iii) Answer: To gain the t-statistic I divided ros coefficient by the standard error.
\(0.00024/.00054=T\) \(0.00024/.00054=0.44\bar4\)
According to the T table the critical value needed to deny the null hypothesis is 1.290 (following the .10 significance level and 207 degrees of freedom). In this case the T value of .444 fails exceed the critical value therefore the null hypothesis can not be rejected at the .10 sig level.
Problem 2(iv) Would you include ros in a final model explaining CEO compensation in terms of firm performance? Explain.
Problem 2(iv) Answer: No, I would not include ros in the final model. Because Ros is not statically different from 0 (at the .10 statistical significance level), meaning we can not reject the null hypothesis. We can therefore conclude that ros does not have much explanatory power on our DV.
Problem 5 Consider the estimated equation from Example 4.3, which can be used to study the effects of skipping class on college GPA:
\[\hat{colGPA} = 1.39 + .412hsGPA + .015ACT - 0.83skipped\] \[n=141, R^2 = .234\] Problem 5(i) Using the standard normal approximation, find the 95% confidence interval for \(\beta{hsGPA}\)
Problem 5(i) answer:
Calculating the confidence intervals according to a 95 confidence interval
\(Upper Bounds = \hat{\beta J} + C*Se({\beta J})\) \(Upper Bounds = \hat{hsGPA}+ C*Se({hsGPA})\) \(Upper Bounds = .412+1.96*0.094\)
Upper Bounds = 0.596
\(Lower Bounds = \hat{\beta J} - C*Se({\beta J})\) \(Lower Bounds = \hat{hsGPA}- C*Se({hsGPA})\) \(Lower Bounds = .412-1.96*0.094\)
Lower Bound = 0.228
Problem 5(ii) Can you reject the hypothesis H0: \(\beta{hsGPA}\) = .4 against the two-sided alternative at the 5% level?
Problem 5(ii) Answer: Considering the confidence interval crosses over the value of .4 we can not reject the null hypothesis at the two-sided 5% level. As in .4 resides within the confidence interval for \(\beta{hsGPA}\).
Problem 5(iii) Can you reject the hypothesis H0: \(\beta{hsGPA}\) = 1 against the two-sided alternative at the 5% level?
Problem 5(iii) Answer: Considering 1 is higher than the upper bound of the confidence interval we can reject the null hypothesis according to the two sided alternative at the 5% level for \(\beta{hsGPA}\)
Problem 9: In Problem 3 in Chapter 3, we estimated the equation
\[\hat{Sleep} = 3628.25 - .148totwrk - 11.13educ + 2.20age\] \[n=706, R^2 =.113\] where we now report standard errors along with the estimates.
Problem 9 (i) Is either educ or age individually significant at the 5% level against a two-sided alternative? Show your work.
Problem 9(i) Answer:
Calculating the individual significance at the 5% for educ and age.
DF= N-K-1 DF= 706 - 3 - 1 DF= 702
DF of 702 at 5% = 1.96 critical value
\(T(age)= \beta age/Se(\beta age)\) \(T(age)= 2.20/1.45\) \(T(age)= 1.527\)
\(T(educ)= \beta educ/Se(\beta educ)\) \(T(educ)= 11.13/5.88\) \(T(educ)= 1.892\)
The critical value based on the degrees of freedom (702) is 1.96 on a two sided test of 5%. Neither educ and age T values exceed the critical value of 1.984 meaning they are not statistically significant at the 5% level (two sided).
Problem 9 (ii) Dropping educ and age from the equation gives
\[\hat{Sleep} = 3586.38 - .141totwrk - 11.13educ + 2.20age\] \[n=706, R^2 =.113\] Are educ and age jointly significant in the original equation at the 5% level? Justify your answer.
Problem 9(ii) Answer:
\(F = \frac{(R^2_{ur}-R^2_{r})/q} {(1-R^2_{ur})/(n-k-1)}\) \(F = \frac{(0.113-0.103)/1} {(1-0.113)/(706-3-1)}\) \(F = \frac{0.01} {(0.887)/(702)}\) F= 7.91431793
Based on a F-test value of 7.914 the join statistical significance of educ and age from the original model meets the 5% level. We know this based on the F-test performed above using the restricted and unrestricted \(R^2\). This is done by taking the \(R^2\) of the model where age and educ are included and where it is taken out.
Problem 9 (iii) Does including educ and age in the model greatly affect the estimated trade off between sleeping and working?
Problem 9(iii) Answer: Error Reduction = \(1-(R_{ur}/R_{r})\)) Error Reduction = \((1-.113)/(1-.103)\) Error reduction = 0.011
Overall it appears that including age and educ only improves the model by 1.1% (calculated by taking \(1-(R_{ur}/R_{r})\)). This means that the overall impact of education and age on average time slept is low.
Problem 9 (iv) Suppose that the sleep equation contains heteroskedasticity. What does this mean about the tests computed in parts (i) and (ii)?
Problem 9(iv) Answer: If sleep contains heteroskedasticity then it is violating assumption MLR 5. While it will not bias the coefficients, having heteroskedasticity will bias the estimated variance. Considering the variance of the residuals is used to calculate both T and F tests (in the form of standard error) it can bias our results of statistical significance from questions 9(i) and 9(ii).
Computer Problem 1 The following model can be used to study whether campaign expenditures affect election outcomes:
\[voteA = \beta0 + \beta1{log(expendA)} + \beta{log(expendB)} + \beta3prtystrA + u\]
where voteA is the percentage of the vote received by Candidate A, expendA and expendB are campaign expenditures by Candidates A and B, and prtystrA is a measure of party strength for Candidate A (the percentage of the most recent presidential vote that went to A’s party).
Computer Problem 1(i) What is the interpretation of \(\beta1?\)
Computer Problem 1(i) Answer: \(\beta1\) can be interpreted as a percentage increase in spending for Candidate A and the average effect of a 1% increase in spending on voteA (all else being equal). This is because \(\beta1\) is presented in log form.
Computer Problem 1(ii) In terms of the parameters, state the null hypothesis that a 1% increase in A’s expenditures is offset by a 1% increase in B’s expenditures.
Computer Problem 1(ii) Answer: \(H_0: \beta1log(expendA) = -\beta1log(expendB)\)
Computer Problem 1(iii) Estimate the given model using the data in VOTE1.RAW and report the results in usual form. Do A’s expenditures affect the outcome? What about B’s expenditures? Can you use these results to test the hypothesis in part (ii)?
Computer Problem 1(iii) Answer:
C1Model <- lm(voteA ~ log(expendA) +log(expendB) + prtystrA, data = vote1)
summary(C1Model)
##
## Call:
## lm(formula = voteA ~ log(expendA) + log(expendB) + prtystrA,
## data = vote1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20.3968 -5.4174 -0.8679 4.9551 26.0660
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 45.07893 3.92631 11.48 <2e-16 ***
## log(expendA) 6.08332 0.38215 15.92 <2e-16 ***
## log(expendB) -6.61542 0.37882 -17.46 <2e-16 ***
## prtystrA 0.15196 0.06202 2.45 0.0153 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.712 on 169 degrees of freedom
## Multiple R-squared: 0.7926, Adjusted R-squared: 0.7889
## F-statistic: 215.2 on 3 and 169 DF, p-value: < 2.2e-16
All independent variables are statisticaly significant at the .05 level. A 1% increase in expendA results in a 6 unit (percent) increase in Vote Share (VoteA) on average, all else being equal. A 1% increase in expendb results in a 6.6 percent decrease in vote share (VoteA) on average, all else being equal. A 1 unit increase in prtystrA on average leads to a 0.151 unit increase in voteA (0.15%) all else being equal. The adjusted R-Squared of the model is 0.7889 explaining a large amount of the total variance. The Residual standard error is 7.712 for the entire model.
As expenditures is predicted to impact the outcome of vote share by 6% for every percentage increase in spending, I consider this to be substantial. The same is true for opponents expenditure where each percentage increase equates to a 6.6% drop in vote share for candidate A. You can use the coefficients with the standard errors to test if \(\beta1log(expendA) = \beta1log(expendB)\) by comparing the coefficients of expendA and expendB (within the joint bounds of the standard errors).
Computer Problem 1(iv) Estimate a model that directly gives the t statistic for testing the hypothesis in part (ii). What do you conclude? (Use a two-sided alternative.)
Computer Problem 1(iv) Answer:
C1Model2 <- lm(voteA ~ I(log(expendA) + log(expendB)) + log(expendA) + prtystrA, data = vote1)
summary(C1Model2)
##
## Call:
## lm(formula = voteA ~ I(log(expendA) + log(expendB)) + log(expendA) +
## prtystrA, data = vote1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20.3968 -5.4174 -0.8679 4.9551 26.0660
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 45.07893 3.92631 11.48 <2e-16 ***
## I(log(expendA) + log(expendB)) -6.61542 0.37882 -17.46 <2e-16 ***
## log(expendA) 12.69873 0.54305 23.38 <2e-16 ***
## prtystrA 0.15196 0.06202 2.45 0.0153 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.712 on 169 degrees of freedom
## Multiple R-squared: 0.7926, Adjusted R-squared: 0.7889
## F-statistic: 215.2 on 3 and 169 DF, p-value: < 2.2e-16
summary(C1Model2)$coefficients[1,]
## Estimate Std. Error t value Pr(>|t|)
## 4.507893e+01 3.926305e+00 1.148126e+01 6.351992e-23
After performing a re-parameterized model test which combined expendA and expendB into one variable we get a T-statistic of -17.46 which exceeds the critical value of -1.96 for statistical significance for the two tailed at the 95% threshold. Therefore we can reject the null hypothesis.
Computer Problem 5: Use the data in MLB1.RAW for this exercise.
Computer Problem 5(i) Use the model estimated in equation (4.31) and drop the variable rbisyr. What happens to the statistical significance of hrunsyr? What about the size of the coefficient on hrunsyr?
Computer Problem 5(i) Answer:
C5Model <- lm(log(salary) ~ years + gamesyr + bavg + hrunsyr, data = mlb1)
summary(C5Model)
##
## Call:
## lm(formula = log(salary) ~ years + gamesyr + bavg + hrunsyr,
## data = mlb1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.0642 -0.4614 -0.0271 0.4654 2.7216
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.020913 0.265719 41.476 < 2e-16 ***
## years 0.067732 0.012113 5.592 4.55e-08 ***
## gamesyr 0.015759 0.001564 10.079 < 2e-16 ***
## bavg 0.001419 0.001066 1.331 0.184
## hrunsyr 0.035943 0.007241 4.964 1.08e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7279 on 348 degrees of freedom
## Multiple R-squared: 0.6254, Adjusted R-squared: 0.6211
## F-statistic: 145.2 on 4 and 348 DF, p-value: < 2.2e-16
In the original model hrunsyr is not statisticaly significant but when rbisyr is dropped it becomes statistically significant exceeding the .001 level. The coefficient also increases by 0.021 in the second model which is close to two and half times the estimated effect in the first model (on average, all else being equal).
Computer Problem 5(ii) Add the variables runsyr (runs per year), fldperc (fielding percentage), and sbasesyr (stolen bases per year) to the model from part (i). Which of these factors are individually significant?
Computer Problem 5(ii) Answer:
C5Model2 <- lm(log(salary) ~ years + gamesyr + bavg + hrunsyr + runsyr + fldperc + sbasesyr, data = mlb1)
summary(C5Model2)
##
## Call:
## lm(formula = log(salary) ~ years + gamesyr + bavg + hrunsyr +
## runsyr + fldperc + sbasesyr, data = mlb1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.11554 -0.44557 -0.08808 0.48731 2.57872
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.4082678 2.0032546 5.196 3.50e-07 ***
## years 0.0699848 0.0119756 5.844 1.18e-08 ***
## gamesyr 0.0078995 0.0026775 2.950 0.003391 **
## bavg 0.0005296 0.0011038 0.480 0.631656
## hrunsyr 0.0232106 0.0086392 2.687 0.007566 **
## runsyr 0.0173922 0.0050641 3.434 0.000666 ***
## fldperc 0.0010351 0.0020046 0.516 0.605936
## sbasesyr -0.0064191 0.0051842 -1.238 0.216479
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7176 on 345 degrees of freedom
## Multiple R-squared: 0.639, Adjusted R-squared: 0.6317
## F-statistic: 87.25 on 7 and 345 DF, p-value: < 2.2e-16
After running the third model for log(salary) the individual variables that remain statistically significant at the .05 level are years, gamesyr, hrunsyr, runsyr.
Computer Problem 5(iii) In the model from part (ii), test the joint significance of bavg, fldperc, and sbasesyr.
Computer Problem 5(iii) Answer:
C5Model3 <- lm(log(salary) ~ I(bavg + fldperc + sbasesyr) + years + bavg + gamesyr + hrunsyr + runsyr + fldperc, data = mlb1)
summary(C5Model3)
##
## Call:
## lm(formula = log(salary) ~ I(bavg + fldperc + sbasesyr) + years +
## bavg + gamesyr + hrunsyr + runsyr + fldperc, data = mlb1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.11554 -0.44557 -0.08808 0.48731 2.57872
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.408268 2.003255 5.196 3.50e-07 ***
## I(bavg + fldperc + sbasesyr) -0.006419 0.005184 -1.238 0.216479
## years 0.069985 0.011976 5.844 1.18e-08 ***
## bavg 0.006949 0.005206 1.335 0.182834
## gamesyr 0.007900 0.002677 2.950 0.003391 **
## hrunsyr 0.023211 0.008639 2.687 0.007566 **
## runsyr 0.017392 0.005064 3.434 0.000666 ***
## fldperc 0.007454 0.005560 1.341 0.180892
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7176 on 345 degrees of freedom
## Multiple R-squared: 0.639, Adjusted R-squared: 0.6317
## F-statistic: 87.25 on 7 and 345 DF, p-value: < 2.2e-16
After running the conjoined model, the variables bavg, fldperc, and sbasesyr were not jointly statisticaly significant.
Computer Problem 6: Use the data in WAGE2.RAW for this exercise.
Computer Problem 6(i) Consider the standard wage equation
\[log(wage) = \beta0 + \beta1educ + \beta2exper + \beta3tunure + u \]
State the null hypothesis that another year of general workforce experience has the same effect on log(wage) as another year of tenure with the current employer.
Computer Problem 6(i) Answer: \(H_0: 0 = \beta2exper - \beta3tenure\)
If a year of experience minus a year of tenure equals zero, then the null hypothesis is confirmed.
Computer Problem 6(ii): Test the null hypothesis in part (i) against a two-sided alternative, at the 5% significance level, by constructing a 95% confidence interval. What do you conclude?
Computer Problem 6(ii) Answer:
C6Model <- lm(log(wage) ~ I(exper + tenure) + educ + exper, data = wage2)
summary(C6Model)
##
## Call:
## lm(formula = log(wage) ~ I(exper + tenure) + educ + exper, data = wage2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.8282 -0.2401 0.0203 0.2569 1.3400
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.496696 0.110528 49.731 < 2e-16 ***
## I(exper + tenure) 0.013375 0.002587 5.170 2.87e-07 ***
## educ 0.074864 0.006512 11.495 < 2e-16 ***
## exper 0.001954 0.004743 0.412 0.681
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3877 on 931 degrees of freedom
## Multiple R-squared: 0.1551, Adjusted R-squared: 0.1524
## F-statistic: 56.97 on 3 and 931 DF, p-value: < 2.2e-16
The two-sided hypothesis test at the 95% confidence interval requires a critical value of 1.96. Given the T value of 5.170 of the combined exper and tenure variable we can conclude that it exceeds the critical value of 1.96 which means we can reject the null hypothesis that \(H_0: 0 = \beta2exper - \beta3tenure\)