Midterm_Exam

Chapter 2

C9

Descriptive

According to the data, there are 1051 counties that have zero murders in 1996.

Besides, there are 31 counties that had at least one execution.

The largest number of executions is 3 from the county_id 45019

data_c2 <- wooldridge::countymurders
data_c2 <- data_c2 %>% filter(year==1996)
head(data_c2,10)

##    arrests countyid   density  popul perc1019 perc2029 percblack percmale
## 1        8     1001  67.21535  40061 15.89077 13.17491 20.975510 48.70073
## 2        6     1003  77.05643 123023 13.93886 11.63929 13.496660 48.83233
## 3        1     1005  29.91548  26475 15.06327 13.69972 46.190750 49.15203
## 4        0     1009  67.20457  43392 14.17542 12.99318  1.415007 48.97446
## 5        1     1011  17.89899  11188 14.98927 14.13121 72.756520 49.91956
## 6        2     1013  27.71148  21530 15.68509 11.25871 41.384110 46.81839
## 7       20     1015 186.53970 113511 14.71135 14.28936 19.096830 47.99447
## 8        4     1017  61.51258  36748 14.65386 13.13813 37.253730 47.31142
## 9        2     1019  38.27024  21170 14.13321 12.13037  7.042985 49.22060
## 10       0     1021  50.89291  35323 14.80339 12.64332 11.921410 48.60006
##    rpcincmaint rpcpersinc rpcunemins year murders  murdrate arrestrate
## 1      192.038  11852.760     26.796 1996       7 1.7473350  1.9969550
## 2      139.084  13583.020     28.710 1996       6 0.4877137  0.4877137
## 3      405.768  10760.510     63.162 1996       1 0.3777148  0.3777148
## 4      184.382  11094.820     21.692 1996       2 0.4609145  0.0000000
## 5      485.518   8349.506     63.162 1996       0 0.0000000  0.8938148
## 6      357.918   9947.058     54.868 1996       2 0.9289364  0.9289364
## 7      248.820  11536.320     35.090 1996      14 1.2333610  1.7619440
## 8      243.078  10899.590     41.470 1996       3 0.8163710  1.0884950
## 9      200.970   9806.698     26.796 1996       0 0.0000000  0.9447331
## 10     231.594  10819.840     40.194 1996       0 0.0000000  0.0000000
##    statefips countyfips execs    lpopul execrate
## 1          1          1     0 10.598160        0
## 2          1          3     0 11.720130        0
## 3          1          5     0 10.183960        0
## 4          1          9     0 10.678030        0
## 5          1         11     0  9.322598        0
## 6          1         13     0  9.977202        0
## 7          1         15     0 11.639660        0
## 8          1         17     0 10.511840        0
## 9          1         19     0  9.960340        0
## 10         1         21     0 10.472290        0

zero_murder <- data_c2 %>% filter(murders==0)
zero_murder %>% count()

##      n
## 1 1051

execution <- data_c2 %>% filter(execs !=0)
execution %>% count()

##    n
## 1 31

max(execution$execs)

## [1] 3

max_ex <- execution %>% filter(execs==3)
max_ex

##   arrests countyid  density  popul perc1019 perc2029 percblack percmale
## 1      34    45019 302.7346 277721 12.68575 17.78944  36.26301 48.23906
##   rpcincmaint rpcpersinc rpcunemins year murders murdrate arrestrate statefips
## 1     253.924   14554.06     37.004 1996      32 1.152236    1.22425        45
##   countyfips execs   lpopul  execrate
## 1         19     3 12.53437 0.1080221

murders= B0+ B1*execs+ u

Based on the result, we can see the formula:

murders= 5.4572+58.5555*execs

With 2197 observations and Rsquared at 0.04389

model <- lm(murders~ execs, data= data_c2)
summary(model)

## 
## Call:
## lm(formula = murders ~ execs, data = data_c2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -149.12   -5.46   -4.46   -2.46 1338.99 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   5.4572     0.8348   6.537 7.79e-11 ***
## execs        58.5555     5.8333  10.038  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 38.89 on 2195 degrees of freedom
## Multiple R-squared:  0.04389,    Adjusted R-squared:  0.04346 
## F-statistic: 100.8 on 1 and 2195 DF,  p-value: < 2.2e-16

Coefficient interpretation

The coefficient of execs is 58.5555, which means that an increase in 1 execution will lead to increasee nearly 59 murders. Hence, the estimated equation does not suggest a deterrent effect of capital punishment because the rise in punishment cannot lower murder.

The smallest number of murders will be when the execs=0 -> murders= 5.4572

Hence, the residual for the county with zero executions and zero murders is:

yhat- y= 5.4572

Explanation

Simple regression cannot capture the detterent effect from capital punishment and murders because the relationship is complex and not linear. The low number of execution means the cheap price of commiting crime, hence, when the executions raise from 1 to 5 (for example), the price is still cheap, which cannot prevent murders. But later on, when the executions is higher, it is expensive enough to prevent commiting crime among murders (because of severe consequences)

Chapter 3

C5

It does not make sense to hold sleep, work, and leisure fixed, while changing study because they depend on each other. For example, when the amount of study time increase, other time should decrease because sum of them are 24 hours. Hence, we cannot change only 1 variable and keep other fixed. At least 2 variables will be changed.

This model violates Assumption MLR.3 about Perfect Collinearity because the information provided in study is already provided in other variables. We have 4 activities in a day including study, sleep, work, and leisure. Hence, with 3 activities sleep, work, and leisure, we can easily intepret the amount of study.

To reformulate the model so its parameters have a useful interpretation and satisfy MLR.3, we can drop 1 variable. For GPA, study, sleep, leisure can be more important. Hence, we drop work. The formula will be:

GPA= B0+ B1* study + B2* sleep + B3 * leisure+u

C10

If x1 is highly corelated with x2 and x3, and x2 and x3 have large partial effects on y, the regression coefficient of simple regression will be biased and smaller than the coefficient of multiple regression. This is because simple regression did not account for effects of x2 and x3, leading to omitted variable bias. By adding x2 and x3, the model can better isolate the ceteris paribus relationship between y and x1, providing a more accurate estimate of the true effect of x1 on y. Therefore, the coefficient for x1 in the multiple regression is likely to be larger than the coefficient in the simple regression due to the reduction in omitted variable bias.
Because the relationship among x1,x2 and x3 cannot be treated separatedly. Hence, the highly correlated between x2 and x3 still cause bias. Consquently, the result will be different.

If x1 is highly correlated with x2 and x3, but x2 and x3 have small partial effects on y, I expect se(B1) for simple regression will smaller than se(B1) for multiple regression. It is because the impacts of x2 and x3 are insignificant. Hence, we can exclude them from the model with better results.

If x1 is highly correlated with x2 and x3, and x2 and x3 have larg partial effects on y, I expect se(B1) for multiple regression will smaller than se(B1) for simple regression. It is because the impacts of x2 and x3 are significant; thus, excluding them from the model will cause bias.

C8

Mean and standard deviation

The unit of measurement in income is USD while that for prpblck is the percentage

data8 <- discrim
print(paste("The mean of income will be:", mean(data8$income, na.rm=TRUE)))

## [1] "The mean of income will be: 47053.7848410758"

print(paste("The standard deviation of income will be:", sd(data8$income, na.rm=TRUE)))

## [1] "The standard deviation of income will be: 13179.2860689389"

print(paste("The mean of people black will be:", mean(data8$prpblck,na.rm=TRUE)))

## [1] "The mean of people black will be: 0.113486396497833"

print(paste("The standard deviation of people black will be:", sd(data8$prpblck, na.rm=TRUE)))

## [1] "The standard deviation of people black will be: 0.182416467486231"

psoda~ prpblck+income

The coef of prpblck is 0.0115, meaning that if we increase black people by 1% in zipcode, the price of medium soda will increase by 0.0115 USD. It is not economically lag.

sample_size <- data8$psoda %>% na.omit %>% as.data.frame()
sample_size <- nrow(sample_size)
model <- lm(psoda~ prpblck+income, data=data8)
summary <- summary(model)
summary

## 
## Call:
## lm(formula = psoda ~ prpblck + income, data = data8)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.29401 -0.05242  0.00333  0.04231  0.44322 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 9.563e-01  1.899e-02  50.354  < 2e-16 ***
## prpblck     1.150e-01  2.600e-02   4.423 1.26e-05 ***
## income      1.603e-06  3.618e-07   4.430 1.22e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.08611 on 398 degrees of freedom
##   (9 observations deleted due to missingness)
## Multiple R-squared:  0.06422,    Adjusted R-squared:  0.05952 
## F-statistic: 13.66 on 2 and 398 DF,  p-value: 1.835e-06

print(paste("The sample size is:",sample_size))

## [1] "The sample size is: 402"

print(paste("The Rsquared is:",summary$r.squared))

## [1] "The Rsquared is: 0.0642203910903628"

psoda ~ prpblck

The coef is 0.0649. Hence, the effect is smaller when we control for income

model <- lm(psoda~prpblck, data=data8)
summary(model)

## 
## Call:
## lm(formula = psoda ~ prpblck, data = data8)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.30884 -0.05963  0.01135  0.03206  0.44840 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.03740    0.00519  199.87  < 2e-16 ***
## prpblck      0.06493    0.02396    2.71  0.00702 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0881 on 399 degrees of freedom
##   (9 observations deleted due to missingness)
## Multiple R-squared:  0.01808,    Adjusted R-squared:  0.01561 
## F-statistic: 7.345 on 1 and 399 DF,  p-value: 0.007015

log(psoda) ~ prpblck ~ log(income)

The coef is 0.1216. Hence, the effect is smaller when we control for income.

model <- lm(log(psoda)~prpblck+ log(income), data=data8)
summary(model)

## 
## Call:
## lm(formula = log(psoda) ~ prpblck + log(income), data = data8)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.33563 -0.04695  0.00658  0.04334  0.35413 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.79377    0.17943  -4.424 1.25e-05 ***
## prpblck      0.12158    0.02575   4.722 3.24e-06 ***
## log(income)  0.07651    0.01660   4.610 5.43e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0821 on 398 degrees of freedom
##   (9 observations deleted due to missingness)
## Multiple R-squared:  0.06809,    Adjusted R-squared:  0.06341 
## F-statistic: 14.54 on 2 and 398 DF,  p-value: 8.039e-07

log(psoda) ~ prpblck ~ log(income)+ prppov

The coef decrease to 0.07281 and the significant level also decreases

model <- lm(log(psoda)~prpblck+ log(income)+prppov, data=data8)
summary(model)

## 
## Call:
## lm(formula = log(psoda) ~ prpblck + log(income) + prppov, data = data8)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.32218 -0.04648  0.00651  0.04272  0.35622 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.46333    0.29371  -4.982  9.4e-07 ***
## prpblck      0.07281    0.03068   2.373   0.0181 *  
## log(income)  0.13696    0.02676   5.119  4.8e-07 ***
## prppov       0.38036    0.13279   2.864   0.0044 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.08137 on 397 degrees of freedom
##   (9 observations deleted due to missingness)
## Multiple R-squared:  0.08696,    Adjusted R-squared:  0.08006 
## F-statistic:  12.6 on 3 and 397 DF,  p-value: 6.917e-08

correlation

The correlation is not as my expectation. It tells that proportion of poverty and price of soda is only 0.0259, meaning no relation. I expected the poorer the population, the less price we obtain.

# Remove missing values from the vectors
data8_6 <- data8 %>% select(psoda,prppov) %>% na.omit 
cor(data8_6)

##             psoda     prppov
## psoda  1.00000000 0.02598077
## prppov 0.02598077 1.00000000

statement evaluation

Because the correaltion is very high at -0.83. It means that the added value of prppove is very low when consider information of income already. Besides, 2 correlated variables in the model also can cause the multicolinearity. Hence, the statement is true.

# Remove missing values from the vectors
data8_7 <- data8 %>% select(income,prppov) %>% na.omit %>% mutate(income= log(income))
cor(data8_7)

##           income    prppov
## income  1.000000 -0.838467
## prppov -0.838467  1.000000

Chapter 4

C3

Formula : rdintens= 0.472+0.321* log(sales)+ 0.050*profmarg

Coef

Because the coeficient is 0.321, if the sales increase 1%, the expenditures on R&D as a percentage of sales should will increase 0.321% .

Hence, if sales increase 10%, expenditure on RD as percentage of sale will increase 3.21%.

This number is economically large because based on E&Y report in 2023, the average RD spending in the sector remained steady around 2.5% of the revenue.

Hypothesis testing

rdintens= B0+ B1* log(sales)+ B2 *profmarg

We have H0: B1= 0 and H1: B1 # 0

t-value = 0.321/0.216=1.486111

With n=32, df= 31:

Critical value for 5% level of significance is c= 2.042

Critical value for 10% level of significance is c= 1.69

Because 1.486 < 1.69 < 2.042, we cannot reject H0. Consequently, we cannot conclude that increase in sales can increase RD intensity.

Coef

For profmarg, 1% increase in profit margin will lead to 0.05% increase in RD intensity. It cannot be considered as economically large because the average is 2.5%. However, with a big firm, when they already have large scale, the increase in RD intensity at 0.05% is acceptable.

Hypothesis Testing

rdintens= B0+ B1* log(sales)+ B2 *profmarg

We have H0: B2= 0 and H1: B2 # 0

t-value = 0.050/0.046=1.086857

With n=32, df= 31:

Critical value for 5% level of significance is c= 2.042

Critical value for 10% level of significance is c= 1.69

Because 1.086957 < 1.69 < 2.042, we cannot reject H0. Consequently, we cannot conclude that increase in profit margin can increase RD intensity.

C8

There are 2017 single-personal households in the dataset

data_c4 <- wooldridge::k401ksubs
head(data_c4,10)

##    e401k    inc marr male age fsize  nettfa p401k pira     incsq agesq
## 1      0 13.170    0    0  40     1   4.575     0    1  173.4489  1600
## 2      1 61.230    0    1  35     1 154.000     1    0 3749.1128  1225
## 3      0 12.858    1    0  44     2   0.000     0    0  165.3282  1936
## 4      0 98.880    1    1  44     2  21.800     0    0 9777.2539  1936
## 5      0 22.614    0    0  53     1  18.450     0    0  511.3930  2809
## 6      0 15.000    1    0  60     3   0.000     0    0  225.0000  3600
## 7      0 37.155    1    0  49     5   3.483     0    1 1380.4939  2401
## 8      0 31.896    1    0  38     5  -2.100     0    0 1017.3548  1444
## 9      0 47.295    1    0  52     2   5.290     0    1 2236.8169  2704
## 10     1 29.100    0    1  45     1  29.600     0    1  846.8100  2025

data_c4 <- data_c4 %>% filter(fsize==1)
data_c4 %>%  count()

##      n
## 1 2017

nettfa ~ inc+age

Based on the result, we have formula: nettfa= -43.03981+0.79932* inc+ 0.84266*age

With the formula above, Rsquared is 0.1193 meaning that the changes in income and age can explain 11.93% the change in net financial wealth.

It makes sense that the increase in income and age lead to increase in financial wealth.Besides, intercept is negative, meaning people start from borrowing and their net wealth is -43 millions of dollar.

However, it is surprised that the coefficient is very small. It is expected that the changes in income and age will largely affect on net financial wealth.

model <- lm(nettfa ~ inc+age, data= data_c4)
summary(model)

## 
## Call:
## lm(formula = nettfa ~ inc + age, data = data_c4)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -179.95  -14.16   -3.42    6.03 1113.94 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -43.03981    4.08039 -10.548   <2e-16 ***
## inc           0.79932    0.05973  13.382   <2e-16 ***
## age           0.84266    0.09202   9.158   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 44.68 on 2014 degrees of freedom
## Multiple R-squared:  0.1193, Adjusted R-squared:  0.1185 
## F-statistic: 136.5 on 2 and 2014 DF,  p-value: < 2.2e-16

It is interesting that: The intercept is and adjustment for the Salary. And it would make not much sense since we are using age as a control and 0 age is not really meaningful.

Hypothesis testing

We have H0: B2=1 and H1: B2< 1

The t-value will be: (0.84266-1)/0.09202 = -1.7098

The p-value for t= -1.7098 is 0.0436 >1%

Hence, with 1% level of significance, we cannot reject H0

(0.84266-1)/0.09202

## [1] -1.709846

pnorm((0.84266-1)/0.09202)

## [1] 0.04364721

Simple regression

model <- lm(nettfa~inc, data= data_c4)
summary(model)

## 
## Call:
## lm(formula = nettfa ~ inc, data = data_c4)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -185.12  -12.85   -4.85    1.78 1112.66 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -10.5709     2.0607   -5.13 3.18e-07 ***
## inc           0.8207     0.0609   13.48  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 45.59 on 2015 degrees of freedom
## Multiple R-squared:  0.08267,    Adjusted R-squared:  0.08222 
## F-statistic: 181.6 on 1 and 2015 DF,  p-value: < 2.2e-16

cor(data_c4$age, data_c4$inc)

## [1] 0.03905864

With simple regression, the B1 is still similar with part(ii) because the income and age is not highly correlated. However, the intercept change a lot without controlling age.

Chapter 5

C5

If I use the normal distribution, probability that score exceeds 100 will not be zero because it should be symmetric. However, in real data, 100 is the highest score, hence my answer is contradict the assumption of normal distribution for score.

The normal distribution does not fit well in the left tail and it is left skewed distribution. It means that the low scores are more than high score.

C1

data_c5 <- wooldridge::wage1
model <- lm(wage~educ+exper+tenure,data= data_c5)
residuals <- residuals(model)
hist(residuals, main = "Histogram of Residuals", xlab = "Residuals")

model <- lm(log(wage)~educ+exper+tenure,data= data_c5)
residuals <- residuals(model)
hist(residuals, main = "Histogram of Residuals", xlab = "Residuals")

As can be seen from the figure, MLR.6 is closer to being satisfied for log-level model because the distribution of residuals are more likely to be normal.

Chapter 6

C3 - statistic

We have the marginal effect of sales on rdintens will be 0.00030-0.000000014*sale. Hence, it will be negative when:

0.0003 < 0.000000014*sale or sale >21428.57 (millions of dollars)

Quadratic term

t-value of sale^2 is -70/37 = -1.89189 < -1.7, meaning that it is statistically significant at 5%. Hence, I will keep the quadric term.

salesbil = sales/1000

rdintens = 2.613+ 0.3 * salebils - 0.0070* salebil^2

Reporting

For the purpose of reporting, I prefer equation in part (iii) because it contains fewer zeros to the right of the decimal

C10

Compare model 1 vs 2

For lexppp:

Model 1 has t-value = 9.01/4.04=2.23

Model 2 has t-value = 1.93/2.82 =0.68

Hence, lexppp in model 1 is statistical significance when it is not significant in model 2

But model 1 is more relevant because model 2 include read4 variable which is both irrelevant with the math4 and highly correlate with independent variable in the model

If estimated effect of a 10% increase in expenditures per student, according to model 1, the percentage satisfactory in math will increase 1.93 point.

Adding variable

Intuitively, reading and math grades are independent. Hence, adding read4 will cause strange effects in the model because read4 is also highly correlated with other important explanatory variables.

The importance of causal relationships in an econometric model is more fundamental and interesting to remain prominent in relation to the selection of a model compared to the mere objective of statistical adjustment, which can often lead us to an inadequate interpretation of an economic phenomenon .

The statistical significance of the variables in their set of explanations provides us with a more robust interpretation than a biased R2 .

C3 - computer

the effect of eduction on log(wage) when holding exper fixed:

dlog(wage)/deduc = (B0+ B1educ+B2exper+ B3educ* exper)’ = 0+B1+0+ B3 * exper = B1+ B3 *exper

Hypothesis

Formula: log(wage) = B0+ B1educ+ B2exper + B3educ*exper + u

The null hypothesis for return to education does not depend on the level of exper will be:

H0: B3=0

For a more appropriate hypothesis, because people with more experience are more productive when given another year of education; hence, it will be better with H1: B3 > 0.

wage2 data

data_c6 <- wooldridge::wage2
model <- lm(log(wage) ~educ+ exper+ educ*exper, data= data_c6)
summary(model)

## 
## Call:
## lm(formula = log(wage) ~ educ + exper + educ * exper, data = data_c6)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.88558 -0.24553  0.03558  0.26171  1.28836 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.949455   0.240826  24.704   <2e-16 ***
## educ         0.044050   0.017391   2.533   0.0115 *  
## exper       -0.021496   0.019978  -1.076   0.2822    
## educ:exper   0.003203   0.001529   2.095   0.0365 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3923 on 931 degrees of freedom
## Multiple R-squared:  0.1349, Adjusted R-squared:  0.1321 
## F-statistic: 48.41 on 3 and 931 DF,  p-value: < 2.2e-16

t <- 0.003203/0.001529
pnorm(-t)

## [1] 0.01809291

From the testing result, p-value is at 1.809% <2%. Hence, we can reject H0: B3=0 at 2% significant level and above.

We can rearrange the formula:

log(wage) = B0+ (theta1+10B3)educ+ B2exper + B3educ*exper + u

-> log(wage) = B0+ theta1* educ+ B2exper + B3educ*(exper-10) + u

data_c6 <- data_c6 %>% mutate(exper_10= exper-10)
model <- lm(log(wage) ~ educ + exper + educ * exper_10, data = data_c6)
summary(model)

## 
## Call:
## lm(formula = log(wage) ~ educ + exper + educ * exper_10, data = data_c6)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.88558 -0.24553  0.03558  0.26171  1.28836 
## 
## Coefficients: (1 not defined because of singularities)
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    5.949455   0.240826  24.704   <2e-16 ***
## educ           0.076080   0.006615  11.501   <2e-16 ***
## exper         -0.021496   0.019978  -1.076   0.2822    
## exper_10             NA         NA      NA       NA    
## educ:exper_10  0.003203   0.001529   2.095   0.0365 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3923 on 931 degrees of freedom
## Multiple R-squared:  0.1349, Adjusted R-squared:  0.1321 
## F-statistic: 48.41 on 3 and 931 DF,  p-value: < 2.2e-16

From the result we get theta1= 0.07608 The confidence interval will be: 0.07608+ 0.006615* 1.96 = 0.089 (upper bound) and 0.07608- 0.006615 *1.96 = 0.063 (lower bound)

C12

The youngest age of people in this sample is 25, and there are 99 people at that age.

data_c6_12 <- wooldridge::k401ksubs
data_c6_12 <- data_c6_12 %>%  filter(fsize==1)
min(data_c6_12$age)

## [1] 25

youngest <- data_c6_12 %>% filter(age==25)
youngest %>% count()

##    n
## 1 99

model <- lm(nettfa~inc+age+I(age^2), data= data_c6_12)
summary(model)

## 
## Call:
## lm(formula = nettfa ~ inc + age + I(age^2), data = data_c6_12)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -179.36  -13.58   -2.97    5.67 1116.45 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.204212  15.280667  -0.079  0.93719    
## inc          0.824816   0.060298  13.679  < 2e-16 ***
## age         -1.321815   0.767496  -1.722  0.08518 .  
## I(age^2)     0.025562   0.008999   2.841  0.00455 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 44.6 on 2013 degrees of freedom
## Multiple R-squared:  0.1229, Adjusted R-squared:  0.1216 
## F-statistic: 93.99 on 3 and 2013 DF,  p-value: < 2.2e-16

B2 interprets the impact of age on the net wealth of people. It is interesting because if solely focus on B2, it will indicate that the higher the age, the lower the net wealth. And it will be a misleading result.

We have the formula: nettfa = -1.204+ 0.825* inc + -1.322* age + 0.0255*age^2

b1= 0.825 indicate that 1000USD increase in income will lead to 825USD increase in net wealth

b2<0 and b3>0 means age and net wealth have a U shaped relationship.

theta2= b2+50b3

We have the new formula: nettfa= b0+b1* inc + (theta2-50b3) * age + b3*age^2

-> nettfa= b0+b1* inc + theta2 * age + b3* (age^2 -50*age)

data_c6_12 <- data_c6_12 %>%  mutate(age_50 = 50*age)
model <- lm(nettfa~inc+age+I(age^2-age_50), data= data_c6_12)
summary(model)

## 
## Call:
## lm(formula = nettfa ~ inc + age + I(age^2 - age_50), data = data_c6_12)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -179.36  -13.58   -2.97    5.67 1116.45 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -1.204212  15.280667  -0.079  0.93719    
## inc                0.824816   0.060298  13.679  < 2e-16 ***
## age               -0.043695   0.325270  -0.134  0.89315    
## I(age^2 - age_50)  0.025562   0.008999   2.841  0.00455 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 44.6 on 2013 degrees of freedom
## Multiple R-squared:  0.1229, Adjusted R-squared:  0.1216 
## F-statistic: 93.99 on 3 and 2013 DF,  p-value: < 2.2e-16

From the result, we see that theta2 is very small and insignificant.

The result shows that when we exclude age from the model, the Rsquared does not decrease. Hence, with smaller number of variables, the adjusted Rsquared increases, leading to a better goodness of fit.

data_c6_12 <- data_c6_12 %>%  mutate(age_25 = age-25)
model <- lm(nettfa~inc+I(age_25^2), data= data_c6_12)
summary(model)

## 
## Call:
## lm(formula = nettfa ~ inc + I(age_25^2), data = data_c6_12)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -179.37  -13.61   -3.01    5.63 1116.34 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -18.488105   2.177584  -8.490   <2e-16 ***
## inc           0.823571   0.059567  13.826   <2e-16 ***
## I(age_25^2)   0.024403   0.002541   9.605   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 44.59 on 2014 degrees of freedom
## Multiple R-squared:  0.1229, Adjusted R-squared:  0.122 
## F-statistic:   141 on 2 and 2014 DF,  p-value: < 2.2e-16

We have the formula: nettfa= -18.488105 + 0.823571* 30+ 0.024403 *(age-25)^2

-> nettfa= 6.219+ 0.024403 *(age-25)^2

# Generate x values
x_values <- seq(0, 50, length.out = 50)

# Compute corresponding y values using the equation
y_values <- 6.219 + 0.024403 * x_values^2

# Plot the curve
plot(x_values+25, y_values, type = "l", col = "blue", lwd = 2, xlab = "Age", ylab = "Net Wealth", main = "Plot of relationship between nettfa and age")

I see that the higher the age, the higher the net wealth and the marginal effect will also increase.

Don’t need to add inc^2 because it is not significant in the model.

data_c6_12 <- data_c6_12 %>%  mutate(age_25 = age-25)
model <- lm(nettfa~inc+I(inc^2)+I(age_25^2), data= data_c6_12)
summary(model)

## 
## Call:
## lm(formula = nettfa ~ inc + I(inc^2) + I(age_25^2), data = data_c6_12)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -179.46  -13.66   -3.00    5.76 1116.08 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.930e+01  3.688e+00  -5.234 1.83e-07 ***
## inc          8.722e-01  1.877e-01   4.648 3.57e-06 ***
## I(inc^2)    -5.405e-04  1.978e-03  -0.273    0.785    
## I(age_25^2)  2.440e-02  2.541e-03   9.603  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 44.6 on 2013 degrees of freedom
## Multiple R-squared:  0.1229, Adjusted R-squared:  0.1216 
## F-statistic: 94.01 on 3 and 2013 DF,  p-value: < 2.2e-16