library(dplyr)
library(dslabs)
library(ISLR2)
library(matlib)
library(wooldridge)
data("countymurders")
data1 <- countymurders
attach(data1)
data1 <- data1 %>% filter(year=="1996")
head(data1)
## arrests countyid density popul perc1019 perc2029 percblack percmale
## 1 8 1001 67.21535 40061 15.89077 13.17491 20.975510 48.70073
## 2 6 1003 77.05643 123023 13.93886 11.63929 13.496660 48.83233
## 3 1 1005 29.91548 26475 15.06327 13.69972 46.190750 49.15203
## 4 0 1009 67.20457 43392 14.17542 12.99318 1.415007 48.97446
## 5 1 1011 17.89899 11188 14.98927 14.13121 72.756520 49.91956
## 6 2 1013 27.71148 21530 15.68509 11.25871 41.384110 46.81839
## rpcincmaint rpcpersinc rpcunemins year murders murdrate arrestrate statefips
## 1 192.038 11852.760 26.796 1996 7 1.7473350 1.9969550 1
## 2 139.084 13583.020 28.710 1996 6 0.4877137 0.4877137 1
## 3 405.768 10760.510 63.162 1996 1 0.3777148 0.3777148 1
## 4 184.382 11094.820 21.692 1996 2 0.4609145 0.0000000 1
## 5 485.518 8349.506 63.162 1996 0 0.0000000 0.8938148 1
## 6 357.918 9947.058 54.868 1996 2 0.9289364 0.9289364 1
## countyfips execs lpopul execrate
## 1 1 0 10.598160 0
## 2 3 0 11.720130 0
## 3 5 0 10.183960 0
## 4 9 0 10.678030 0
## 5 11 0 9.322598 0
## 6 13 0 9.977202 0
nrow(data1[data1$murders==0,])
## [1] 1051
nrow(data1[data1$execs>=1,])
## [1] 31
max(execs)
## [1] 7
model_c9 <- lm(murders ~ execs)
summary(model_c9)
##
## Call:
## lm(formula = murders ~ execs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -202.23 -6.84 -5.84 -3.84 1937.16
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.8382 0.2418 28.28 <2e-16 ***
## execs 65.4650 2.1463 30.50 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 46.64 on 37347 degrees of freedom
## Multiple R-squared: 0.02431, Adjusted R-squared: 0.02428
## F-statistic: 930.4 on 1 and 37347 DF, p-value: < 2.2e-16
*y = a0 + a1x1 + e1*
*y = b0 + b1x1 + b2x2 + b3x3 + e2*
data("discrim")
data2 <- discrim
attach(discrim)
mean(prpblck, na.rm = T)
## [1] 0.1134864
mean(income, na.rm = T)
## [1] 47053.78
sd(prpblck, na.rm = T)
## [1] 0.1824165
sd(income, na.rm = T)
## [1] 13179.29
model_c8 <- lm(psoda~prpblck+income, data = discrim)
summary(model_c8)
##
## Call:
## lm(formula = psoda ~ prpblck + income, data = discrim)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.29401 -0.05242 0.00333 0.04231 0.44322
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.563e-01 1.899e-02 50.354 < 2e-16 ***
## prpblck 1.150e-01 2.600e-02 4.423 1.26e-05 ***
## income 1.603e-06 3.618e-07 4.430 1.22e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.08611 on 398 degrees of freedom
## (9 observations deleted due to missingness)
## Multiple R-squared: 0.06422, Adjusted R-squared: 0.05952
## F-statistic: 13.66 on 2 and 398 DF, p-value: 1.835e-06
model_c8_1 <- lm(psoda~prpblck, data = discrim)
summary(model_c8_1)
##
## Call:
## lm(formula = psoda ~ prpblck, data = discrim)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.30884 -0.05963 0.01135 0.03206 0.44840
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.03740 0.00519 199.87 < 2e-16 ***
## prpblck 0.06493 0.02396 2.71 0.00702 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0881 on 399 degrees of freedom
## (9 observations deleted due to missingness)
## Multiple R-squared: 0.01808, Adjusted R-squared: 0.01561
## F-statistic: 7.345 on 1 and 399 DF, p-value: 0.007015
model_c8_2 <- lm(log(psoda)~prpblck+log(income), data = discrim)
summary(model_c8_2)
##
## Call:
## lm(formula = log(psoda) ~ prpblck + log(income), data = discrim)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.33563 -0.04695 0.00658 0.04334 0.35413
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.79377 0.17943 -4.424 1.25e-05 ***
## prpblck 0.12158 0.02575 4.722 3.24e-06 ***
## log(income) 0.07651 0.01660 4.610 5.43e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0821 on 398 degrees of freedom
## (9 observations deleted due to missingness)
## Multiple R-squared: 0.06809, Adjusted R-squared: 0.06341
## F-statistic: 14.54 on 2 and 398 DF, p-value: 8.039e-07
paste( (0.2*100)*0.122, "percent increase")
## [1] "2.44 percent increase"
model_c8_3 <- lm(log(psoda)~prpblck+log(income)+prppov)
model_c8_3
##
## Call:
## lm(formula = log(psoda) ~ prpblck + log(income) + prppov)
##
## Coefficients:
## (Intercept) prpblck log(income) prppov
## -1.46333 0.07281 0.13696 0.38036
Adding prppov causes the prpblck coefficient to fall to 0.0738.
cor(log(discrim$income), discrim$prppov, method = "pearson", use = "complete.obs")
## [1] -0.838467
data("rdchem")
attach(rdchem)
paste(log(1.1)*0.321*100, "percentages change in rdintens ")
## [1] "3.05945677171883 percentages change in rdintens "
model_c3 <- lm(rdintens ~ log(sales) + profmarg)
summary(model_c3)
##
## Call:
## lm(formula = rdintens ~ log(sales) + profmarg)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.3016 -1.2707 -0.6895 0.8785 6.0369
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.47225 1.67606 0.282 0.780
## log(sales) 0.32135 0.21557 1.491 0.147
## profmarg 0.05004 0.04578 1.093 0.283
##
## Residual standard error: 1.839 on 29 degrees of freedom
## Multiple R-squared: 0.09847, Adjusted R-squared: 0.0363
## F-statistic: 1.584 on 2 and 29 DF, p-value: 0.2224
The 5% critical value for a one-tailedtest, with df = 32 − 3 = 29, is obtained from Table G.2 as 1.699; so we cannot reject H0 at the 5% level. But the 10% critical value is 1.311; since the t statistic is abovethis value, we reject H0 in favor of H1 at the 10% level.
One percentage unit change in profmarg interprets 0.05 percentage unit change in rdintens. Economically that is quite significant, as given a10 % increase in profit margin then they will increase expenditures on R& D by 0.5 percentage point.
###C8: 401 KSUBS
data("k401ksubs")
data4 <- k401ksubs %>% filter(fsize==1)
dim(data4)
## [1] 2017 11
attach(data4)
model_c4 <- lm(nettfa ~ inc+ age)
summary(model_c4)
##
## Call:
## lm(formula = nettfa ~ inc + age)
##
## Residuals:
## Min 1Q Median 3Q Max
## -179.95 -14.16 -3.42 6.03 1113.94
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -43.03981 4.08039 -10.548 <2e-16 ***
## inc 0.79932 0.05973 13.382 <2e-16 ***
## age 0.84266 0.09202 9.158 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 44.68 on 2014 degrees of freedom
## Multiple R-squared: 0.1193, Adjusted R-squared: 0.1185
## F-statistic: 136.5 on 2 and 2014 DF, p-value: < 2.2e-16
min(age)
## [1] 25
min(inc)
## [1] 10.008
income <- lm(inc ~ age)
income
##
## Call:
## lm(formula = inc ~ age)
##
## Coefficients:
## (Intercept) age
## 27.08278 0.06017
cov(log(inc), age)
## [1] 0.1907963
data("wage1")
attach(wage1)
model_wage <- lm( wage ~ educ + exper + tenure )
summary(model_wage)
##
## Call:
## lm(formula = wage ~ educ + exper + tenure)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.6068 -1.7747 -0.6279 1.1969 14.6536
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.87273 0.72896 -3.941 9.22e-05 ***
## educ 0.59897 0.05128 11.679 < 2e-16 ***
## exper 0.02234 0.01206 1.853 0.0645 .
## tenure 0.16927 0.02164 7.820 2.93e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.084 on 522 degrees of freedom
## Multiple R-squared: 0.3064, Adjusted R-squared: 0.3024
## F-statistic: 76.87 on 3 and 522 DF, p-value: < 2.2e-16
hist(model_wage$residuals)
model_lg_wage <- lm(log(wage) ~ educ + exper + tenure)
summary(model_lg_wage)
##
## Call:
## lm(formula = log(wage) ~ educ + exper + tenure)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.05802 -0.29645 -0.03265 0.28788 1.42809
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.284360 0.104190 2.729 0.00656 **
## educ 0.092029 0.007330 12.555 < 2e-16 ***
## exper 0.004121 0.001723 2.391 0.01714 *
## tenure 0.022067 0.003094 7.133 3.29e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4409 on 522 degrees of freedom
## Multiple R-squared: 0.316, Adjusted R-squared: 0.3121
## F-statistic: 80.39 on 3 and 522 DF, p-value: < 2.2e-16
hist(model_lg_wage$residuals)
log(wage) = .284 + .092 educ + .0041 exper + .022 tenure (.104) (.007) (.0017) (.003) n = 526, R2 = .316, σ = .441. The histogram for the residuals from this equation, with the best-fitting normal distribution overlaid, is given above.
The residuals derived from the log(wage) regression showcase a closer resemblance to a normal distribution. Specifically, when comparing the histograms between part (i) and part (ii), the histogram in part (ii) aligns more closely with the expected normal density. Besides, the histogram illustrating wage residuals displays a noticeable left skew. In the wage regression analysis, there exist certain exceptionally large residuals, approximately valued at 15, positioned nearly five estimated standard deviations (σ = 3.085) away from the mean of the residuals, which inherently is zero. However, these residuals far from zero seem to pose less of an issue in the log(wage) regression analysis.
model_rd6 <- lm(rdintens ~ sales + I(sales^2))
summary(model_rd6)
##
## Call:
## lm(formula = rdintens ~ sales + I(sales^2))
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.1418 -1.3630 -0.2257 1.0688 5.5808
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.613e+00 4.294e-01 6.084 1.27e-06 ***
## sales 3.006e-04 1.393e-04 2.158 0.0394 *
## I(sales^2) -6.946e-09 3.726e-09 -1.864 0.0725 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.788 on 29 degrees of freedom
## Multiple R-squared: 0.1484, Adjusted R-squared: 0.08969
## F-statistic: 2.527 on 2 and 29 DF, p-value: 0.09733
salesbil <- sales/1000
model_salebil <- lm(rdintens~ salesbil+ I(salesbil^2))
summary(model_salebil)
##
## Call:
## lm(formula = rdintens ~ salesbil + I(salesbil^2))
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.1418 -1.3630 -0.2257 1.0688 5.5808
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.612512 0.429442 6.084 1.27e-06 ***
## salesbil 0.300571 0.139295 2.158 0.0394 *
## I(salesbil^2) -0.006946 0.003726 -1.864 0.0725 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.788 on 29 degrees of freedom
## Multiple R-squared: 0.1484, Adjusted R-squared: 0.08969
## F-statistic: 2.527 on 2 and 29 DF, p-value: 0.09733
data("wage2")
model_wage2 <- lm(log(wage) ~ educ + exper + educ*exper, data = wage2)
summary(model_wage2)
##
## Call:
## lm(formula = log(wage) ~ educ + exper + educ * exper, data = wage2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.88558 -0.24553 0.03558 0.26171 1.28836
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.949455 0.240826 24.704 <2e-16 ***
## educ 0.044050 0.017391 2.533 0.0115 *
## exper -0.021496 0.019978 -1.076 0.2822
## educ:exper 0.003203 0.001529 2.095 0.0365 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3923 on 931 degrees of freedom
## Multiple R-squared: 0.1349, Adjusted R-squared: 0.1321
## F-statistic: 48.41 on 3 and 931 DF, p-value: < 2.2e-16
t <- 0.003203/0.001529
t
## [1] 2.094833
min(data4$age)
## [1] 25
data4 %>% filter(age==25) %>% count()
## n
## 1 99
model_c12 <- lm(nettfa ~ inc + age +I(age^2), data = data4)
summary(model_c12)
##
## Call:
## lm(formula = nettfa ~ inc + age + I(age^2), data = data4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -179.36 -13.58 -2.97 5.67 1116.45
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.204212 15.280667 -0.079 0.93719
## inc 0.824816 0.060298 13.679 < 2e-16 ***
## age -1.321815 0.767496 -1.722 0.08518 .
## I(age^2) 0.025562 0.008999 2.841 0.00455 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 44.6 on 2013 degrees of freedom
## Multiple R-squared: 0.1229, Adjusted R-squared: 0.1216
## F-statistic: 93.99 on 3 and 2013 DF, p-value: < 2.2e-16
One literal interpretation is that β 2is the increase in nettfa when age increases by one year, holding fixed inc and age^2. But in this application, the partial effect starting at age = 0 is not interesting; the sample represents single people at least 25 years old.
nettfa = −1.20 + .825 inc − 1.322 age + .0256 age2
(15.28) (.060) (0.767) (.0090)
n = 2,017, R2 = .1229 Initially, the negative coefficient on age may
seem counterintuitive. The estimated relationship is a U-shape, but, to
make sense of it, we need to find the turning point in the
quadratic.
model_c12 <- lm(nettfa ~ inc + age + I((age-25)^2), data = data4)
summary(model_c12)
##
## Call:
## lm(formula = nettfa ~ inc + age + I((age - 25)^2), data = data4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -179.36 -13.58 -2.97 5.67 1116.45
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.180714 9.973145 -1.723 0.08510 .
## inc 0.824816 0.060298 13.679 < 2e-16 ***
## age -0.043695 0.325270 -0.134 0.89315
## I((age - 25)^2) 0.025562 0.008999 2.841 0.00455 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 44.6 on 2013 degrees of freedom
## Multiple R-squared: 0.1229, Adjusted R-squared: 0.1216
## F-statistic: 93.99 on 3 and 2013 DF, p-value: < 2.2e-16
The result are : nettfa = −17.20 + .825 inc − .0437 age + .0256 (age
−25)^2 (9.97) (.060) (.767) (.0090)
n = 2,017, R2 = .1229 Therefore, the estimated partial effect starting
at age = 25 is only −.044 and very statistically insignificant (t =
−.13). The two-sided p-value is about .89
model_c12 <- lm(nettfa ~ inc + I((age-25)^2), data = data4)
summary(model_c12)
##
## Call:
## lm(formula = nettfa ~ inc + I((age - 25)^2), data = data4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -179.37 -13.61 -3.01 5.63 1116.34
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -18.488105 2.177584 -8.490 <2e-16 ***
## inc 0.823571 0.059567 13.826 <2e-16 ***
## I((age - 25)^2) 0.024403 0.002541 9.605 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 44.59 on 2014 degrees of freedom
## Multiple R-squared: 0.1229, Adjusted R-squared: 0.122
## F-statistic: 141 on 2 and 2014 DF, p-value: < 2.2e-16
The adjusted R-squared is slightly higher when we drop age. But the real reason for dropping age is that its t statistic is quite small, and the model without it has a straightforward interpretation
# Generate x values
X_nettf<- seq(0, 50, length.out = 50)
# Compute corresponding y values using the equation
y_age <- 6.219 + 0.024403 * X_nettf^2
# Plot the curve
plot(X_nettf+25, y_age, type = "l", col = "yellow", lwd = 2, xlab = "Age", ylab = "Net Wealth", main = "Plot of relationship between nettfa and age")
The slope of the relationship between nnettfa and age is clearly
increasing. That is, there is an increasing marginal effect. The model
is constructed so that the slope is zero at age = 25; from there, the
slope increases.