According to the data, there are 1051 counties that have zero murders in 1996.
Besides, there are 31 counties that had at least one execution.
The largest number of executions is 3 from the county_id 45019
data_c2 <- wooldridge::countymurders
data_c2 <- data_c2 %>% filter(year==1996)
head(data_c2,10)
## arrests countyid density popul perc1019 perc2029 percblack percmale
## 1 8 1001 67.21535 40061 15.89077 13.17491 20.975510 48.70073
## 2 6 1003 77.05643 123023 13.93886 11.63929 13.496660 48.83233
## 3 1 1005 29.91548 26475 15.06327 13.69972 46.190750 49.15203
## 4 0 1009 67.20457 43392 14.17542 12.99318 1.415007 48.97446
## 5 1 1011 17.89899 11188 14.98927 14.13121 72.756520 49.91956
## 6 2 1013 27.71148 21530 15.68509 11.25871 41.384110 46.81839
## 7 20 1015 186.53970 113511 14.71135 14.28936 19.096830 47.99447
## 8 4 1017 61.51258 36748 14.65386 13.13813 37.253730 47.31142
## 9 2 1019 38.27024 21170 14.13321 12.13037 7.042985 49.22060
## 10 0 1021 50.89291 35323 14.80339 12.64332 11.921410 48.60006
## rpcincmaint rpcpersinc rpcunemins year murders murdrate arrestrate
## 1 192.038 11852.760 26.796 1996 7 1.7473350 1.9969550
## 2 139.084 13583.020 28.710 1996 6 0.4877137 0.4877137
## 3 405.768 10760.510 63.162 1996 1 0.3777148 0.3777148
## 4 184.382 11094.820 21.692 1996 2 0.4609145 0.0000000
## 5 485.518 8349.506 63.162 1996 0 0.0000000 0.8938148
## 6 357.918 9947.058 54.868 1996 2 0.9289364 0.9289364
## 7 248.820 11536.320 35.090 1996 14 1.2333610 1.7619440
## 8 243.078 10899.590 41.470 1996 3 0.8163710 1.0884950
## 9 200.970 9806.698 26.796 1996 0 0.0000000 0.9447331
## 10 231.594 10819.840 40.194 1996 0 0.0000000 0.0000000
## statefips countyfips execs lpopul execrate
## 1 1 1 0 10.598160 0
## 2 1 3 0 11.720130 0
## 3 1 5 0 10.183960 0
## 4 1 9 0 10.678030 0
## 5 1 11 0 9.322598 0
## 6 1 13 0 9.977202 0
## 7 1 15 0 11.639660 0
## 8 1 17 0 10.511840 0
## 9 1 19 0 9.960340 0
## 10 1 21 0 10.472290 0
zero_murder <- data_c2 %>% filter(murders==0)
zero_murder %>% count()
## n
## 1 1051
execution <- data_c2 %>% filter(execs !=0)
execution %>% count()
## n
## 1 31
max(execution$execs)
## [1] 3
max_ex <- execution %>% filter(execs==3)
max_ex
## arrests countyid density popul perc1019 perc2029 percblack percmale
## 1 34 45019 302.7346 277721 12.68575 17.78944 36.26301 48.23906
## rpcincmaint rpcpersinc rpcunemins year murders murdrate arrestrate statefips
## 1 253.924 14554.06 37.004 1996 32 1.152236 1.22425 45
## countyfips execs lpopul execrate
## 1 19 3 12.53437 0.1080221
Based on the result, we can see the formula:
murders= 5.4572+58.5555*execs
With 2197 observations and Rsquared at 0.04389
model <- lm(murders~ execs, data= data_c2)
summary(model)
##
## Call:
## lm(formula = murders ~ execs, data = data_c2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -149.12 -5.46 -4.46 -2.46 1338.99
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.4572 0.8348 6.537 7.79e-11 ***
## execs 58.5555 5.8333 10.038 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 38.89 on 2195 degrees of freedom
## Multiple R-squared: 0.04389, Adjusted R-squared: 0.04346
## F-statistic: 100.8 on 1 and 2195 DF, p-value: < 2.2e-16
The coefficient of execs is 58.5555, which means that an increase in 1 execution will lead to increasee nearly 59 murders. Hence, the estimated equation does not suggest a deterrent effect of capital punishment because the rise in punishment cannot lower murder.
The smallest number of murders will be when the execs=0 -> murders= 5.4572
Hence, the residual for the county with zero executions and zero murders is:
yhat- y= 5.4572
Simple regression cannot capture the detterent effect from capital punishment and murders because the relationship is complex and not linear. The low number of execution means the cheap price of commiting crime, hence, when the executions raise from 1 to 5 (for example), the price is still cheap, which cannot prevent murders. But later on, when the executions is higher, it is expensive enough to prevent commiting crime among murders (because of severe consequences)
It does not make sense to hold sleep, work, and leisure fixed, while changing study because they depend on each other. For example, when the amount of study time increase, other time should decrease because sum of them are 24 hours. Hence, we cannot change only 1 variable and keep other fixed. At least 2 variables will be changed.
This model violates Assumption MLR.3 about Perfect Collinearity because the information provided in study is already provided in other variables. We have 4 activities in a day including study, sleep, work, and leisure. Hence, with 3 activities sleep, work, and leisure, we can easily intepret the amount of study.
To reformulate the model so its parameters have a useful interpretation and satisfy MLR.3, we can drop 1 variable. For GPA, study, sleep, leisure can be more important. Hence, we drop work. The formula will be:
GPA= B0+ B1* study + B2* sleep + B3 * leisure+u
If x1 is highly corelated with x2 and x3, and x2 and x3 have large partial effects on y, the regression coefficient of simple regression will be biased and smaller than the coefficient of multiple regression. This is because simple regression did not account for effects of x2 and x3, leading to omitted variable bias. By adding x2 and x3, the model can better isolate the ceteris paribus relationship between y and x1, providing a more accurate estimate of the true effect of x1 on y. Therefore, the coefficient for x1 in the multiple regression is likely to be larger than the coefficient in the simple regression due to the reduction in omitted variable bias.
Because the relationship among x1,x2 and x3 cannot be treated separatedly. Hence, the highly correlated between x2 and x3 still cause bias. Consquently, the result will be different.
If x1 is highly correlated with x2 and x3, but x2 and x3 have small partial effects on y, I expect se(B1) for simple regression will smaller than se(B1) for multiple regression. It is because the impacts of x2 and x3 are insignificant. Hence, we can exclude them from the model with better results.
If x1 is highly correlated with x2 and x3, and x2 and x3 have larg partial effects on y, I expect se(B1) for multiple regression will smaller than se(B1) for simple regression. It is because the impacts of x2 and x3 are significant; thus, excluding them from the model will cause bias.
The unit of measurement in income is USD while that for prpblck is the percentage
data8 <- discrim
print(paste("The mean of income will be:", mean(data8$income, na.rm=TRUE)))
## [1] "The mean of income will be: 47053.7848410758"
print(paste("The standard deviation of income will be:", sd(data8$income, na.rm=TRUE)))
## [1] "The standard deviation of income will be: 13179.2860689389"
print(paste("The mean of people black will be:", mean(data8$prpblck,na.rm=TRUE)))
## [1] "The mean of people black will be: 0.113486396497833"
print(paste("The standard deviation of people black will be:", sd(data8$prpblck, na.rm=TRUE)))
## [1] "The standard deviation of people black will be: 0.182416467486231"
The coef of prpblck is 0.0115, meaning that if we increase black people by 1% in zipcode, the price of medium soda will increase by 0.0115 USD. It is not economically lag.
sample_size <- data8$psoda %>% na.omit %>% as.data.frame()
sample_size <- nrow(sample_size)
model <- lm(psoda~ prpblck+income, data=data8)
summary <- summary(model)
summary
##
## Call:
## lm(formula = psoda ~ prpblck + income, data = data8)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.29401 -0.05242 0.00333 0.04231 0.44322
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.563e-01 1.899e-02 50.354 < 2e-16 ***
## prpblck 1.150e-01 2.600e-02 4.423 1.26e-05 ***
## income 1.603e-06 3.618e-07 4.430 1.22e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.08611 on 398 degrees of freedom
## (9 observations deleted due to missingness)
## Multiple R-squared: 0.06422, Adjusted R-squared: 0.05952
## F-statistic: 13.66 on 2 and 398 DF, p-value: 1.835e-06
print(paste("The sample size is:",sample_size))
## [1] "The sample size is: 402"
print(paste("The Rsquared is:",summary$r.squared))
## [1] "The Rsquared is: 0.0642203910903628"
The coef is 0.0649. Hence, the effect is smaller when we control for income
model <- lm(psoda~prpblck, data=data8)
summary(model)
##
## Call:
## lm(formula = psoda ~ prpblck, data = data8)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.30884 -0.05963 0.01135 0.03206 0.44840
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.03740 0.00519 199.87 < 2e-16 ***
## prpblck 0.06493 0.02396 2.71 0.00702 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0881 on 399 degrees of freedom
## (9 observations deleted due to missingness)
## Multiple R-squared: 0.01808, Adjusted R-squared: 0.01561
## F-statistic: 7.345 on 1 and 399 DF, p-value: 0.007015
The coef is 0.1216. Hence, the effect is smaller when we control for income.
model <- lm(log(psoda)~prpblck+ log(income), data=data8)
summary(model)
##
## Call:
## lm(formula = log(psoda) ~ prpblck + log(income), data = data8)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.33563 -0.04695 0.00658 0.04334 0.35413
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.79377 0.17943 -4.424 1.25e-05 ***
## prpblck 0.12158 0.02575 4.722 3.24e-06 ***
## log(income) 0.07651 0.01660 4.610 5.43e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0821 on 398 degrees of freedom
## (9 observations deleted due to missingness)
## Multiple R-squared: 0.06809, Adjusted R-squared: 0.06341
## F-statistic: 14.54 on 2 and 398 DF, p-value: 8.039e-07
The coef decrease to 0.07281 and the significant level also decreases
model <- lm(log(psoda)~prpblck+ log(income)+prppov, data=data8)
summary(model)
##
## Call:
## lm(formula = log(psoda) ~ prpblck + log(income) + prppov, data = data8)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.32218 -0.04648 0.00651 0.04272 0.35622
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.46333 0.29371 -4.982 9.4e-07 ***
## prpblck 0.07281 0.03068 2.373 0.0181 *
## log(income) 0.13696 0.02676 5.119 4.8e-07 ***
## prppov 0.38036 0.13279 2.864 0.0044 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.08137 on 397 degrees of freedom
## (9 observations deleted due to missingness)
## Multiple R-squared: 0.08696, Adjusted R-squared: 0.08006
## F-statistic: 12.6 on 3 and 397 DF, p-value: 6.917e-08
The correlation is not as my expectation. It tells that proportion of poverty and price of soda is only 0.0259, meaning no relation. I expected the poorer the population, the less price we obtain.
# Remove missing values from the vectors
data8_6 <- data8 %>% select(psoda,prppov) %>% na.omit
cor(data8_6)
## psoda prppov
## psoda 1.00000000 0.02598077
## prppov 0.02598077 1.00000000
Because the correaltion is very high at -0.83. It means that the added value of prppove is very low when consider information of income already. Besides, 2 correlated variables in the model also can cause the multicolinearity. Hence, the statement is true.
# Remove missing values from the vectors
data8_7 <- data8 %>% select(income,prppov) %>% na.omit %>% mutate(income= log(income))
cor(data8_7)
## income prppov
## income 1.000000 -0.838467
## prppov -0.838467 1.000000
Formula : rdintens= 0.472+0.321* log(sales)+ 0.050*profmarg
Because the coeficient is 0.321, if the sales increase 1%, the expenditures on R&D as a percentage of sales should will increase 0.321% .
Hence, if sales increase 10%, expenditure on RD as percentage of sale will increase 3.21%.
This number is economically large because based on E&Y report in 2023, the average RD spending in the sector remained steady around 2.5% of the revenue.
rdintens= B0+ B1* log(sales)+ B2 *profmarg
We have H0: B1= 0 and H1: B1 # 0
t-value = 0.321/0.216=1.486111
With n=32, df= 31:
Critical value for 5% level of significance is c= 2.042
Critical value for 10% level of significance is c= 1.69
Because 1.486 < 1.69 < 2.042, we cannot reject H0. Consequently, we cannot conclude that increase in sales can increase RD intensity.
For profmarg, 1% increase in profit margin will lead to 0.05% increase in RD intensity. It cannot be considered as economically large because the average is 2.5%. However, with a big firm, when they already have large scale, the increase in RD intensity at 0.05% is acceptable.
rdintens= B0+ B1* log(sales)+ B2 *profmarg
We have H0: B2= 0 and H1: B2 # 0
t-value = 0.050/0.046=1.086857
With n=32, df= 31:
Critical value for 5% level of significance is c= 2.042
Critical value for 10% level of significance is c= 1.69
Because 1.086957 < 1.69 < 2.042, we cannot reject H0. Consequently, we cannot conclude that increase in profit margin can increase RD intensity.
There are 2017 single-personal households in the dataset
data_c4 <- wooldridge::k401ksubs
head(data_c4,10)
## e401k inc marr male age fsize nettfa p401k pira incsq agesq
## 1 0 13.170 0 0 40 1 4.575 0 1 173.4489 1600
## 2 1 61.230 0 1 35 1 154.000 1 0 3749.1128 1225
## 3 0 12.858 1 0 44 2 0.000 0 0 165.3282 1936
## 4 0 98.880 1 1 44 2 21.800 0 0 9777.2539 1936
## 5 0 22.614 0 0 53 1 18.450 0 0 511.3930 2809
## 6 0 15.000 1 0 60 3 0.000 0 0 225.0000 3600
## 7 0 37.155 1 0 49 5 3.483 0 1 1380.4939 2401
## 8 0 31.896 1 0 38 5 -2.100 0 0 1017.3548 1444
## 9 0 47.295 1 0 52 2 5.290 0 1 2236.8169 2704
## 10 1 29.100 0 1 45 1 29.600 0 1 846.8100 2025
data_c4 <- data_c4 %>% filter(fsize==1)
data_c4 %>% count()
## n
## 1 2017
Based on the result, we have formula: nettfa= -43.03981+0.79932* inc+ 0.84266*age
With the formula above, Rsquared is 0.1193 meaning that the changes in income and age can explain 11.93% the change in net financial wealth.
It makes sense that the increase in income and age lead to increase in financial wealth.Besides, intercept is negative, meaning people start from borrowing and their net wealth is -43 millions of dollar.
However, it is surprised that the coefficient is very small. It is expected that the changes in income and age will largely affect on net financial wealth.
model <- lm(nettfa ~ inc+age, data= data_c4)
summary(model)
##
## Call:
## lm(formula = nettfa ~ inc + age, data = data_c4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -179.95 -14.16 -3.42 6.03 1113.94
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -43.03981 4.08039 -10.548 <2e-16 ***
## inc 0.79932 0.05973 13.382 <2e-16 ***
## age 0.84266 0.09202 9.158 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 44.68 on 2014 degrees of freedom
## Multiple R-squared: 0.1193, Adjusted R-squared: 0.1185
## F-statistic: 136.5 on 2 and 2014 DF, p-value: < 2.2e-16
It is interesting that: The intercept is and adjustment for the Salary. And it would make not much sense since we are using age as a control and 0 age is not really meaningful.
We have H0: B2=1 and H1: B2< 1
The t-value will be: (0.84266-1)/0.09202 = -1.7098
The p-value for t= -1.7098 is 0.0436 >1%
Hence, with 1% level of significance, we cannot reject H0
(0.84266-1)/0.09202
## [1] -1.709846
pnorm((0.84266-1)/0.09202)
## [1] 0.04364721
model <- lm(nettfa~inc, data= data_c4)
summary(model)
##
## Call:
## lm(formula = nettfa ~ inc, data = data_c4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -185.12 -12.85 -4.85 1.78 1112.66
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -10.5709 2.0607 -5.13 3.18e-07 ***
## inc 0.8207 0.0609 13.48 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 45.59 on 2015 degrees of freedom
## Multiple R-squared: 0.08267, Adjusted R-squared: 0.08222
## F-statistic: 181.6 on 1 and 2015 DF, p-value: < 2.2e-16
cor(data_c4$age, data_c4$inc)
## [1] 0.03905864
With simple regression, the B1 is still similar with part(ii) because the income and age is not highly correlated. However, the intercept change a lot without controlling age.
If I use the normal distribution, probability that score exceeds 100 will not be zero because it should be symmetric. However, in real data, 100 is the highest score, hence my answer is contradict the assumption of normal distribution for score.
The normal distribution does not fit well in the left tail and it is left skewed distribution. It means that the low scores are more than high score.
data_c5 <- wooldridge::wage1
model <- lm(wage~educ+exper+tenure,data= data_c5)
residuals <- residuals(model)
hist(residuals, main = "Histogram of Residuals", xlab = "Residuals")
model <- lm(log(wage)~educ+exper+tenure,data= data_c5)
residuals <- residuals(model)
hist(residuals, main = "Histogram of Residuals", xlab = "Residuals")
As can be seen from the figure, MLR.6 is closer to being satisfied for log-level model because the distribution of residuals are more likely to be normal.
We have the marginal effect of sales on rdintens will be 0.00030-0.000000014*sale. Hence, it will be negative when:
0.0003 < 0.000000014*sale or sale >21428.57 (millions of dollars)
t-value of sale^2 is -70/37 = -1.89189 < -1.7, meaning that it is statistically significant at 5%. Hence, I will keep the quadric term.
rdintens = 2.613+ 0.3 * salebils - 0.0070* salebil^2
For the purpose of reporting, I prefer equation in part (iii) because it contains fewer zeros to the right of the decimal
For lexppp:
Model 1 has t-value = 9.01/4.04=2.23
Model 2 has t-value = 1.93/2.82 =0.68
Hence, lexppp in model 1 is statistical significance when it is not significant in model 2
But model 1 is more relevant because model 2 include read4 variable which is both irrelevant with the math4 and highly correlate with independent variable in the model
If estimated effect of a 10% increase in expenditures per student, according to model 1, the percentage satisfactory in math will increase 1.93 point.
Intuitively, reading and math grades are independent. Hence, adding read4 will cause strange effects in the model because read4 is also highly correlated with other important explanatory variables.
The importance of causal relationships in an econometric model is more fundamental and interesting to remain prominent in relation to the selection of a model compared to the mere objective of statistical adjustment, which can often lead us to an inadequate interpretation of an economic phenomenon .
The statistical significance of the variables in their set of explanations provides us with a more robust interpretation than a biased R2 .
dlog(wage)/deduc = (B0+ B1educ+B2exper+ B3educ* exper)’ = 0+B1+0+ B3 * exper = B1+ B3 *exper
Formula: log(wage) = B0+ B1educ+ B2exper + B3educ*exper + u
The null hypothesis for return to education does not depend on the level of exper will be:
H0: B3=0
For a more appropriate hypothesis, because people with more experience are more productive when given another year of education; hence, it will be better with H1: B3 > 0.
data_c6 <- wooldridge::wage2
model <- lm(log(wage) ~educ+ exper+ educ*exper, data= data_c6)
summary(model)
##
## Call:
## lm(formula = log(wage) ~ educ + exper + educ * exper, data = data_c6)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.88558 -0.24553 0.03558 0.26171 1.28836
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.949455 0.240826 24.704 <2e-16 ***
## educ 0.044050 0.017391 2.533 0.0115 *
## exper -0.021496 0.019978 -1.076 0.2822
## educ:exper 0.003203 0.001529 2.095 0.0365 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3923 on 931 degrees of freedom
## Multiple R-squared: 0.1349, Adjusted R-squared: 0.1321
## F-statistic: 48.41 on 3 and 931 DF, p-value: < 2.2e-16
t <- 0.003203/0.001529
pnorm(-t)
## [1] 0.01809291
From the testing result, p-value is at 1.809% <2%. Hence, we can reject H0: B3=0 at 2% significant level and above.
We can rearrange the formula:
log(wage) = B0+ (theta1+10B3)educ+ B2exper + B3educ*exper + u
-> log(wage) = B0+ theta1* educ+ B2exper + B3educ*(exper-10) + u
data_c6 <- data_c6 %>% mutate(exper_10= exper-10)
model <- lm(log(wage) ~ educ + exper + educ * exper_10, data = data_c6)
summary(model)
##
## Call:
## lm(formula = log(wage) ~ educ + exper + educ * exper_10, data = data_c6)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.88558 -0.24553 0.03558 0.26171 1.28836
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.949455 0.240826 24.704 <2e-16 ***
## educ 0.076080 0.006615 11.501 <2e-16 ***
## exper -0.021496 0.019978 -1.076 0.2822
## exper_10 NA NA NA NA
## educ:exper_10 0.003203 0.001529 2.095 0.0365 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3923 on 931 degrees of freedom
## Multiple R-squared: 0.1349, Adjusted R-squared: 0.1321
## F-statistic: 48.41 on 3 and 931 DF, p-value: < 2.2e-16
From the result we get theta1= 0.07608 The confidence interval will be: 0.07608+ 0.006615* 1.96 = 0.089 (upper bound) and 0.07608- 0.006615 *1.96 = 0.063 (lower bound)
The youngest age of people in this sample is 25, and there are 99 people at that age.
data_c6_12 <- wooldridge::k401ksubs
data_c6_12 <- data_c6_12 %>% filter(fsize==1)
min(data_c6_12$age)
## [1] 25
youngest <- data_c6_12 %>% filter(age==25)
youngest %>% count()
## n
## 1 99
model <- lm(nettfa~inc+age+I(age^2), data= data_c6_12)
summary(model)
##
## Call:
## lm(formula = nettfa ~ inc + age + I(age^2), data = data_c6_12)
##
## Residuals:
## Min 1Q Median 3Q Max
## -179.36 -13.58 -2.97 5.67 1116.45
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.204212 15.280667 -0.079 0.93719
## inc 0.824816 0.060298 13.679 < 2e-16 ***
## age -1.321815 0.767496 -1.722 0.08518 .
## I(age^2) 0.025562 0.008999 2.841 0.00455 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 44.6 on 2013 degrees of freedom
## Multiple R-squared: 0.1229, Adjusted R-squared: 0.1216
## F-statistic: 93.99 on 3 and 2013 DF, p-value: < 2.2e-16
B2 interprets the impact of age on the net wealth of people. It is interesting because if solely focus on B2, it will indicate that the higher the age, the lower the net wealth. And it will be a misleading result.
We have the formula: nettfa = -1.204+ 0.825* inc + -1.322* age + 0.0255*age^2
b1= 0.825 indicate that 1000USD increase in income will lead to 825USD increase in net wealth
b2<0 and b3>0 means age and net wealth have a U shaped relationship.
We have the new formula: nettfa= b0+b1* inc + (theta2-50b3) * age + b3*age^2
-> nettfa= b0+b1* inc + theta2 * age + b3* (age^2 -50*age)
data_c6_12 <- data_c6_12 %>% mutate(age_50 = 50*age)
model <- lm(nettfa~inc+age+I(age^2-age_50), data= data_c6_12)
summary(model)
##
## Call:
## lm(formula = nettfa ~ inc + age + I(age^2 - age_50), data = data_c6_12)
##
## Residuals:
## Min 1Q Median 3Q Max
## -179.36 -13.58 -2.97 5.67 1116.45
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.204212 15.280667 -0.079 0.93719
## inc 0.824816 0.060298 13.679 < 2e-16 ***
## age -0.043695 0.325270 -0.134 0.89315
## I(age^2 - age_50) 0.025562 0.008999 2.841 0.00455 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 44.6 on 2013 degrees of freedom
## Multiple R-squared: 0.1229, Adjusted R-squared: 0.1216
## F-statistic: 93.99 on 3 and 2013 DF, p-value: < 2.2e-16
From the result, we see that theta2 is very small and insignificant.
The result shows that when we exclude age from the model, the Rsquared does not decrease. Hence, with smaller number of variables, the adjusted Rsquared increases, leading to a better goodness of fit.
data_c6_12 <- data_c6_12 %>% mutate(age_25 = age-25)
model <- lm(nettfa~inc+I(age_25^2), data= data_c6_12)
summary(model)
##
## Call:
## lm(formula = nettfa ~ inc + I(age_25^2), data = data_c6_12)
##
## Residuals:
## Min 1Q Median 3Q Max
## -179.37 -13.61 -3.01 5.63 1116.34
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -18.488105 2.177584 -8.490 <2e-16 ***
## inc 0.823571 0.059567 13.826 <2e-16 ***
## I(age_25^2) 0.024403 0.002541 9.605 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 44.59 on 2014 degrees of freedom
## Multiple R-squared: 0.1229, Adjusted R-squared: 0.122
## F-statistic: 141 on 2 and 2014 DF, p-value: < 2.2e-16
We have the formula: nettfa= -18.488105 + 0.823571* 30+ 0.024403 *(age-25)^2
-> nettfa= 6.219+ 0.024403 *(age-25)^2
# Generate x values
x_values <- seq(0, 50, length.out = 50)
# Compute corresponding y values using the equation
y_values <- 6.219 + 0.024403 * x_values^2
# Plot the curve
plot(x_values+25, y_values, type = "l", col = "blue", lwd = 2, xlab = "Age", ylab = "Net Wealth", main = "Plot of relationship between nettfa and age")
I see that the higher the age, the higher the net wealth and the marginal effect will also increase.
Don’t need to add inc^2 because it is not significant in the model.
data_c6_12 <- data_c6_12 %>% mutate(age_25 = age-25)
model <- lm(nettfa~inc+I(inc^2)+I(age_25^2), data= data_c6_12)
summary(model)
##
## Call:
## lm(formula = nettfa ~ inc + I(inc^2) + I(age_25^2), data = data_c6_12)
##
## Residuals:
## Min 1Q Median 3Q Max
## -179.46 -13.66 -3.00 5.76 1116.08
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.930e+01 3.688e+00 -5.234 1.83e-07 ***
## inc 8.722e-01 1.877e-01 4.648 3.57e-06 ***
## I(inc^2) -5.405e-04 1.978e-03 -0.273 0.785
## I(age_25^2) 2.440e-02 2.541e-03 9.603 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 44.6 on 2013 degrees of freedom
## Multiple R-squared: 0.1229, Adjusted R-squared: 0.1216
## F-statistic: 94.01 on 3 and 2013 DF, p-value: < 2.2e-16