Part 2

It is important to clarify to the reader that I have used R studio programming instead of LimDep/NLogit to complete the exercises. Also, if preferable the interpretation of the code can be ignored as it is not necessary to complete the course.

Exercise 1

Question: Which is a better hedge against inflation, gold or stock market?

Model 1: \(GoldPi=β1+β2CPI+ui\)
Model 2: \(NYSE=β1+β2CPI+ui\)

GoldP is the Gold Price, NYSE is the New York Stock Exchange and CPI is the consumer price index.

Hypothesis 1 and hypothesis 2 are based on model 1 and model 2 respectively. The hypotheses below are used to answer the question.

Hypotheses:
Consumer price index:
\(H0: β2=0\)
\(H1: β2≠0\)
Consumer price index:
\(H0: β2=0\)
\(H1: β2≠0\)

Results. The following results are based on the model above.

When gold price (GoldP) and the New York Stock Exchange (NYSE) index are plotted against consumer price index (CPI), a positive sign is expected in the regression results and in the plot, as inflation makes nominal prices and values increase, all else equal. In order to clearly visualize both variables in the same plot with where the Gold Price values are a lot smaller, the y-axis is logarithmic. Positive trends are visually apparent for both variables, and the estimates are also positive and statistically significant according to the summaries of the regression results.

Results. The following results are based on the model above.

GoldPvsCPI<-lm(Gold$GoldP~Gold$CPI)
NYSEvsCPI<-lm(Gold$NYSE~Gold$CPI)
# Results for GoldP.
summary(GoldPvsCPI)
## 
## Call:
## lm(formula = Gold$GoldP ~ Gold$CPI)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -149.53  -90.31    5.39   37.39  311.71 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  215.286     54.469    3.95  0.00042 ***
## Gold$CPI       1.038      0.404    2.57  0.01514 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 103 on 31 degrees of freedom
## Multiple R-squared:  0.176,  Adjusted R-squared:  0.149 
## F-statistic: 6.61 on 1 and 31 DF,  p-value: 0.0151
# Results for NYSE.
summary(NYSEvsCPI)
## 
## Call:
## lm(formula = Gold$NYSE ~ Gold$CPI)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -1322   -824   -344    965   1663 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3444.99     533.97   -6.45  3.4e-07 ***
## Gold$CPI       50.30       3.96   12.71  7.9e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1010 on 31 degrees of freedom
## Multiple R-squared:  0.839,  Adjusted R-squared:  0.834 
## F-statistic:  161 on 1 and 31 DF,  p-value: 7.89e-14
# Code for plot below.

Gold%>%
  pivot_longer(
    cols=c("GoldP","NYSE"),
    names_to="GoldPNYSE",
    values_to="Value")%>%
  ggplot(aes(x=CPI,y=Value,color=GoldPNYSE,fill=GoldPNYSE))+
  geom_point(na.rm=TRUE)+
  geom_smooth(method=lm)+
  labs(x="Consumer Price Index",
       title="Gold price and NY stock exchange index vs\nconsumer price index")+
  theme_grey(base_size=30)+
  theme(plot.title=element_text(hjust=0.5))+
  scale_y_log10()

Interpretation of the statistical tests:

The t-test is based on the following form:
\(H0: β2=0\)
\(H1: β2≠0\)

With 31 degrees of freedom, the critical value for statistical significance at 5% is a t-value of approximately 2.04, which is clearly exceeded in the first and the second regression results with t=2.57 in the first regression results and t=12.71 in the second regression results. This also indicates that inflation has a more statistically significant effect on the New York stock exchange index than on the gold price. In the former and the latter regression results, allnull hypotheses can be rejected at 5%.

According to the estimates, an increase of one percentage point in the consumer price index, on average leads to an increase of 1.038 USD in the gold price, ceteris paribus. An increase of one percentage point in the consumer price index, on average leads to an increase of 50.3 points in the New York Stock Exchange index, ceteris paribus. The positive signs in both β2 coefficients are visible in the plot as both fitted regression lines are positive.

Answer: As the the NYSE index estimate is more statistically significant it is a better hedge against inflation than gold.

Exercise 2

Question: Does educational achievement depend on intellectual ability?

\(Model: S_YEAR=β1+β2ASVABC\)

ASVABC is a composite measure of numerical and verbal ability among students and S_YEAR is the number of years of schooling.

The hypothesis below is used to answer the question.

Hypothesis
Numerical and verbal ability:
\(H0: β2=0\)
\(H1: β2≠0\)

The critical value for rejecting the null hypothesis is the calculated by: \(Fcrit=F(k-1,n-k)=F(2-1,570-2)=F(1,568)=3.858\)

The confidence interval of the composite measure of numerical and verbal ability (ASVABC) is calculated as:
\(CI=β2±t*(s/sqrt(n))=β2±t(SE(β2))=(lower bound,upper bound)\)

Results. The following results are based on the model above.

R1<-lm(EDU$S_YEAR~EDU$ASVABC)
# Regression results
summary(R1)
## 
## Call:
## lm(formula = EDU$S_YEAR ~ EDU$ASVABC)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.449 -1.590 -0.307  1.276  6.126 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.50226    0.48299    13.5   <2e-16 ***
## EDU$ASVABC   0.14176    0.00944    15.0   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2 on 568 degrees of freedom
## Multiple R-squared:  0.284,  Adjusted R-squared:  0.283 
## F-statistic:  225 on 1 and 568 DF,  p-value: <2e-16
# β2 coefficient value
summary(R1)$coefficients[2,1]
## [1] 0.142
# β2 standard error
summary(R1)$coefficients[2,2]
## [1] 0.00944
# Lower bound of the 95% confidence interval
summary(R1)$coefficients[2,1]-1.964149*summary(R1)$coefficients[2,2]
## [1] 0.123
# Upper bound of the 95% confidence interval
summary(R1)$coefficients[2,1]+1.964149*summary(R1)$coefficients[2,2]
## [1] 0.16
# Manually calculating the f-statistic
(summary(R1)$r.squared/1)/((1-summary(R1)$r.squared)/(570-1-1))
## [1] 225

Interpretation of the statistical tests:

The t-test is based on the following form:
\(H0: βi=0\)
\(H1: βi≠0\)

With 568 degrees of freedom, the critical value for statistical significance at 5% is a t-value of approximately 1.964149, which is exceeded in the intercept with t=13.5 and in ASVABC with t=15, which corresponds to rejections of both null hypotheses with p<2*10^-16.

On average, an individual with a zero test score on numerical and verbal ability has approximately 6.5 years of schooling, all else equal. Though, there is no such observation which makes the interpretation unnecessary. On average, an increase of one test score point in the composite measure of numerical and verbal ability leads to an increase of 0.14167 years of schooling, all else equal.

The manually calculated confidence interval is CI=(0.123,0.160) where it can be stated with 95% certainty that the interval contains the true value of β2 for ASVABC.

Goodness of fit F-test for the regression model:
\(H0: β1=0\)
\(H1: At least one coefficient is statistically significantly different from zero\)

The manually calculated F-value of 225 is equal to the F-value in the regression results. According to the adjusted coefficient of determination, the number of years of schooling can be explained by the variation in the composite measure of numerical and verbal ability by 28.3%. The null hypothesis regarding the F-test can be rejected as F=225>3.858 with p<2*10^-16.

Answer:

The number of years of schooling depends on numerical and verbal ability as the estimate of the beta coefficient is statistically significant at p<2*10^-16. The adjusted coefficient of determination, which is the number of years of schooling can be explained by the variation in the composite measure of numerical and verbal ability by 28.3% and also, the regression model has a good fit according to the F-statistic.

Exercise 3

Question: Do earnings depend on education?

The following model is used to answer the question.

\(Model: EARNINGS=β1+β2S_YEAR\)

S_YEAR is the number of years of schooling and EARNINGS is hourly earnings in 1994 USD.

The hypothesis below is used to answer the question.

Hypothesis:
Number of years of schooling:
\(H0: β2=0\)
\(H1: β2≠0\)

The confidence interval of the composite measure of numerical and verbal ability (ASVABC) is calculated as:
\(CI=β2±t(SE(β2))=(lower bound,upper bound)\)

Results. The following results are based on the model above.

R2<-lm(EDU$EARNINGS~EDU$S_YEAR)
# Regression results
summary(R2)
## 
## Call:
## lm(formula = EDU$EARNINGS ~ EDU$S_YEAR)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -17.49  -4.69  -1.55   2.26  75.82 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -9.158      2.182    -4.2  3.1e-05 ***
## EDU$S_YEAR     1.675      0.158    10.6  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.87 on 568 degrees of freedom
## Multiple R-squared:  0.166,  Adjusted R-squared:  0.164 
## F-statistic:  113 on 1 and 568 DF,  p-value: <2e-16
# β2 coefficient value
summary(R2)$coefficients[2,1]
## [1] 1.67
# β2 standard error
summary(R2)$coefficients[2,2]
## [1] 0.158
# Lower bound of the 95% confidence interval
summary(R2)$coefficients[2,1]-1.964149*summary(R2)$coefficients[2,2]
## [1] 1.37
# Upper bound of the 95% confidence interval
summary(R2)$coefficients[2,1]+1.964149*summary(R2)$coefficients[2,2]
## [1] 1.98
# Manually calculating the f-statistic
(summary(R2)$r.squared/1)/((1-summary(R2)$r.squared)/(570-1-1))
## [1] 113

Interpretation of the statistical tests:

The t-test is based on the following form:
\(H0: β1=0\)
\(H1: β1≠0\)

With 568 degrees of freedom, the critical value for statistical significance at 5% is a t-value of approximately 1.964149, which is exceeded in the intercept with t=|-4.2|=4.2 and in S_YEAR with t=10.6, which corresponds to rejections of both null hypotheses with p=0.000031 and p<2*10^-16 respectively.

An individual with a zero years of schooling on average has approximately negative 9.158 USD of hourly earnings all else equal. Though, there is no such observation which makes the interpretation unnecessary. On average, an increase of one year of schooling leads to an increase of 1.675 USD in hourly earnings, all else equal.

The manually calculated confidence interval is CI=(1.37,1.98) where it can be stated with 95% certainty that the interval contains the true value of β2 for S_YEAR

The F-test is based on the following form: \(H0: β2=0\)
\(H1: β2 is different from zero (β2≠0)\)

The manually calculated F-value of 113 is equal to the F-value in the regression results. The null hypothesis regarding the F-test can be rejected as F=113>3.858 with p<2*10^-16. According to the adjusted coefficient of determination, hourly earnings can be explained by the variation in the number of years of schooling by 16.4%.

Answer:

Hourly earnings depend on the number of years of schooling, as the estimate is statistically significant at p<2*10^-16. Also, according to the adjusted coefficient of determination, hourly earnings can be explained by the variation in the number of years of schooling by 16.4% and the regression model has a good fit according to the F-statistic.

Exercise 4

Question: Do earnings depend on intellectual ability as well as education?

The following model is used to answer the question.

\(Model: EARNINGS=β1+β2S_YEAR+β3ASVABC\)

S_YEAR is the number of years of schooling, ASVABC is a composite measure of numerical and verbal ability and EARNINGS is hourly earnings in 1994 USD.

The hypotheses below are used to answer the question.

Hypotheses:
Number of years of schooling:
\(H0: β2=0\)
\(H1: β2≠0\)
Numerical and verbal ability:
\(H0: β3=0\)
\(H1: β3≠0\)

Results. The following results are based on the model above.

# Regression results
R3<-lm(EDU$EARNINGS~EDU$S_YEAR+EDU$ASVABC)
summary(R3)
## 
## Call:
## lm(formula = EDU$EARNINGS ~ EDU$S_YEAR + EDU$ASVABC)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -16.90  -4.69  -1.31   2.31  75.84 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -13.380      2.435   -5.50  5.9e-08 ***
## EDU$S_YEAR     1.308      0.184    7.10  3.7e-12 ***
## EDU$ASVABC     0.183      0.049    3.74    2e-04 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.77 on 567 degrees of freedom
## Multiple R-squared:  0.186,  Adjusted R-squared:  0.183 
## F-statistic: 64.7 on 2 and 567 DF,  p-value: <2e-16

Interpretation of the statistical tests:

The t-test is based on the following form:
\(H0: βi=0\)
\(H1: βi≠0\)

With 567 degrees of freedom, the critical value for statistical significance at 5% is a t-value of approximately 1.964157, which is exceeded in the intercept with t=|-5.5|=5.5, in S_YEAR with t=7.1 and in ASVABC with 3.74, which corresponds to rejections of all three null hypotheses with p=5.9e-08, p=3.7e-12 and p=0.0002 respectively.

On average, an individual with a zero years of schooling and a zero point test score of numerical and verbal ability has approximately negative 13.38 USD of hourly earnings all else equal. Though, there is no such observation which makes the interpretation unnecessary. On average, an increase of one year of schooling leads to an increase of 1.308 USD in hourly earnings, ceteris paribus. On average, an increase of one test score point in numerical and verbal ability leads to an increase in hourly earnings of 0.183 USD, ceteris paribus.

Answer:

Hourly earnings depend on the number of years of schooling, as well as numerical and verbal ability, since the estimates are statistically significant at p=5.9e-8 and 3.7e-12 respectively.

Exercise 5

Question: How is the expenditure on different categories of goods and services related to household income?

The following model is used to answer the question.

\(Model: CAT=β1+β2INCOME\)

INCOME is household income and CAT is a general expression for each expenditure variable.

The hypothesis below is used to answer the question.

Hypothesis:
INCOME:
\(H0: β2=0\)
\(H1: β2≠0\)

The following statistics for each dependent variable are given in the order of coefficient of intercept, coefficient of the income estimate, t-value, coefficient of determination and p-value.

As the code is lengthy it is excluded.

Results. The following results are based on the model above.

CAT b1 b2 t-value R^2 p-value
FDHO 1903 0.0531 19.7 0.31 1.15e-71
FDAW -118 0.0446 22.5 0.37 1.26e-88
SHEL 27.1 0.193 30.8 0.524 2.69e-141
TELE 406 0.0104 11.5 0.133 1.22e-28
DOM -112 0.0118 7.39 0.0594 3.54e-13
TEXT -43.5 0.00421 10.9 0.12 7.65e-26
FURN -50.6 0.0106 9.48 0.0943 2.33e-20
MAPP 4.7 0.00492 7.34 0.0586 5.08e-13
SAPP 7.39 0.00147 8.69 0.0804 1.18e-17
CLOT 16.5 0.0401 22.3 0.364 4.08e-87
FOOT 26.6 0.00351 22.6 0.174 7.19e-38
GASO 343 0.0257 19.2 0.298 1.51e-68
TRIP -212 0.0181 11.1 0.124 8.8e-27
LOCT 70.4 -0.000284 -0.465 0.000481 0.519
HEAL 1062 0.0247 8.6 0.0788 3.83e-17
ENT -488 0.0704 18.7 0.288 1.12e-65
FEES -378 0.03 14.3 0.191 1.16e-41
TOYS -12.9 0.00927 12.2 0.147 1.17e-31
READ 37.9 0.00427 15.5 0.217 8.93e-48
EDUC -226 0.0194 9.19 0.089 2.86e-19
TOB 183 0.00255 3.3 0 .0125 0.001

Interpretation of the statistical tests:

The t-tests are based on the following form: \(H0: β1=0\)
\(H1: β1≠0\)

With 864 degrees of freedom, the critical value for statistical significance at 5% is a t-value of approximately 1.962713. The null hypothesis was not rejected with LOCT (local public transportation) as the dependent variable, as the t-value of INCOME in this estimate did not exceed the critical value. Every other null hypothesis is rejected at 5% with a p-value ranging from 0.001 to 2.69*10^-141.

The vast majority of b2-coefficients are positive, with the exception of LOCT.

Answer:

The vast majority of b2-coefficients are positive, thus indicating that the expenditure variables are normal goods as an economic interpretation. The exception is LOCT which is local public transportation with a negative b2-coefficient and a lack of statistical significance. If the beta coefficient of an estimate would be negative and statistically significant it would be considered an inferior good in economic terms.

Exercise 6

Question: Does educational attainment depend on parents’ education?

The following model is used to answer the question.

\(Model: S_YEAR=β1+β2ASVABC+β3HGCM+β4HGCF\)

Where ASVABC is composite measure of numerical and verbal ability, HGCM is the years of schooling of the mother, HGCF is the years of schooling of the father and S_YEAR is the number of years of schooling of the respondent.

The hypotheses below are used to answer the question.

Hypotheses:
Maternal years of education:
\(H0: β3=0\)
\(H1: β3≠0\)
Paternal years of education:
\(H0: β3=0\)
\(H1: β3≠0\)

Results. The following results are based on the model above.

# Regression results
R25a<-lm(EDU$S_YEAR~EDU$ASVABC+EDU$HGCM+EDU$HGCF)
summary(R25a)
## 
## Call:
## lm(formula = EDU$S_YEAR ~ EDU$ASVABC + EDU$HGCM + EDU$HGCF)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.193 -1.486 -0.349  1.215  5.724 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.18492    0.52169    9.94  < 2e-16 ***
## EDU$ASVABC   0.11548    0.00992   11.64  < 2e-16 ***
## EDU$HGCM     0.12023    0.03924    3.06  0.00229 ** 
## EDU$HGCF     0.10271    0.02937    3.50  0.00051 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.93 on 566 degrees of freedom
## Multiple R-squared:  0.337,  Adjusted R-squared:  0.333 
## F-statistic: 95.8 on 3 and 566 DF,  p-value: <2e-16

Interpretation of the statistical tests:

The t-tests are based on the following form: \(H0: βi=0\)
\(H1: βi≠0\)

With 566 degrees of freedom, the critical value for statistical significance at 5% is a t-value of approximately 1.964164, which is exceeded in the intercept with 9.94, in ASVABC with t=11.64, in HGCM with 3.06 and in HGCF with 3.5, which corresponds to rejections of all three null hypotheses with p<2e-16, p=2e-16, p=0.00229 and p=0.00051 respectively.

An individual with a zero score in the numerical and verbal test and parents who both lack formal education, on average has approximately 5.18 years of schooling. Though, there is no such observation which makes the interpretation unnecessary. On average, an increase of one unit in the test score leads to an increase of 0.11548 years of schooling, ceteris paribus. On average, an increase of one year of maternal schooling leads to an increase of 0.12023 years of schooling, ceteris paribus. On average, an increase of one year of paternal schooling leads to an increase of 0.10217 years of schooling, ceteris paribus.

Answer:

As as estimates for maternal and paternal education are positive and statistically significant, educational attainment does depend on the parental level of education. The beta coefficient of maternal education is larger, thus indicating a larger effect on the education of an individual.

Exercise 7

Question: Is expenditure on different categories of goods and services related to household size as well as household income?

The following model is used to answer the question.

\(Model: CAT=β1+β2INCOME+β3SIZE\)

INCOME is household income, SIZE is the number of persons in the household and CAT is a general expression for each expenditure variable.

The hypotheses below are used to answer the question.

Hypotheses:
INCOME:
\(H0: β2=0\)
\(H1: β2≠0\)
SIZE:
\(H0: β3=0\)
\(H1: β3≠0\)

The following statistics are given in the order of coefficient of intercept, coefficient of determination, coefficient of INCOME, standard error of INCOME, p-value of INCOME, coefficient of SIZE, standard error of SIZE and p-value of SIZE.

As the code is too lengthy it is not included below.

Results. The following results are based on the model above.

CAT b1 R^2 b2 se(b2) p (b2) b3 se(b3) p (b3)
FDHO 871 0.499 0.037 0.00246 2.38e-46 563 31.2 4.87e-62
FDAW -24.1 0.372 0.046 0.00212 6.14e-84 -51.2 26.9 0.057
SHEL 348 0.526 0.198 0.00670 2.0e-133 -175 85 0.0395
TELE 343 0.141 0.009 0.00096 1.13e-21 34.3 12.2 0.00514
DOM -134 0.060 0.018 0.00272 2.22e-11 11.7 34.5 0.734
TEXT -14.8 0.129 0.005 0.00041 1.53e-27 -15.7 5.23 0.00285
FURN -8.04 0.097 0.011 0.00119 4.09e-20 -23.2 15.1 0.124
MAPP -15.8 0.060 0.005 0.00072 2.21e-20 11.2 9.1 0.22
SAPP 11.2 0.081 0.002 0.00018 1.27e-16 2.1 2.29 0.361
CLOT -144 0.374 0.038 0.00191 1.94e-71 87.5 24.2 0.000324
FOOT -8.41 0.202 0.003 0.00027 6.1e-26 19.1 3.47 4.93e-08
GASO 166 0.321 0.023 0.00141 3.14e-52 96.8 17.9 8.31e-08
TRIP -85.8 0.134 0.020 0.00174 1.14e-28 -68.8 22 0.0134
LOCT 33.8 0.013 -0.00 0.00047 0.0722 20 5.95 0.000818
HEAL 1121 0.080 0.026 0.00307 3.25e-16 -32.1 39 0.41
ENT -318 0.290 0.073 0.00402 1.66e-62 -93 51 0.0688
FEES -209 0.201 0.325 0.00223 3.65e-43 -92.1 28.3 0.00119
TOYS -58.5 0.152 0.009 0.00081 1.24e-24 24.9 10.3 0.0159
READ 65.9 0.232 0.005 0.00029 6.87e-51 -15.3 3.71 4.28e-05
EDUC -269 0.089 0.019 0.00226 4.15e-16 23.7 28.7 0.409
TOB 139 0.019 0.002 0.00082 0.0232 24.1 10.4 0.0212

Interpretation of the statistical tests:

The t-tests are based on the following form: \(H0: βi=0\)
\(H1: βi≠0\)

With 863 degrees of freedom, the critical value for statistical significance at 5% is a t-value of approximately 1.962717. The null hypothesis regarding INCOME was not rejected at 5% regarding SHEL and FEES but was rejected at every other model. The null hypothesis regarding SIZE was not rejected at 5% regarding FDAW, DOM, FURN, MAPP, SAPP, HEAL and EDUC, but was rejected at every other estimate.

For estimates with positive β3 coefficients regarding SIZE, on average an expenditure increases when households become larger, ceteris paribus. For estimates with negative β3 coefficients, on average an expenditure decreases when households become larger, ceteris paribus.

The INCOME coefficients are mostly smaller in the latter table when SIZE is included, since the model suffered from omitted variable bias. A larger household size is arguably correlated with having more than one income and this effect was partially absorbed by the INCOME variable in the biased former results.

Answer: As the majority of β2- as well as β3 estimates are statistically significant, and the coefficient of determination is relatively high in general, the expenditure on the different categories of goods and services is related to household income, as well as household size, but more so to INCOME than to SIZE.

Exercise 8

Question: Is expenditure per capita on food related to household size as well as household income per capita?

At first, the two variables FDHOPC=FDHO/SIZE and INCPC=INCOME/SIZE are generated.

\(FDHOPC=FDHO/SIZE\)
\(INCPC=INCOME/SIZE\)

The following model is used to answer the question.

\(Model: FDHOPC=β1+β2INCPC+β3SIZE\)

INCPC is household income per capita, SIZE is household size and FDHOPC is per capita expenditure on food and non-alcoholic beverages consumed at home.

The hypotheses below are used to answer the question.

Hypotheses:
INCPC:
\(H0: β2=0\)
\(H1: β2≠0\)
SIZE:
\(H0: β3=0\)
\(H1: β3≠0\)

Results. The following results are based on the model above.

CES2<-
  CES%>%
  mutate(FDHOPC=FDHO/SIZE)%>%
  mutate(INCPC=INCOME/SIZE)
R46<-lm(CES2$FDHOPC~CES2$INCPC+CES2$SIZE)
summary(R46)
## 
## Call:
## lm(formula = CES2$FDHOPC ~ CES2$INCPC + CES2$SIZE)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -1732   -374    -60    260   5024 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.42e+03   6.70e+01   21.21   <2e-16 ***
## CES2$INCPC   3.21e-02   2.68e-03   11.99   <2e-16 ***
## CES2$SIZE   -1.33e+02   1.52e+01   -8.75   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 632 on 863 degrees of freedom
## Multiple R-squared:  0.292,  Adjusted R-squared:  0.291 
## F-statistic:  178 on 2 and 863 DF,  p-value: <2e-16

Interpretation of the statistical tests:

The t-tests are based on the following form: \(H0: βi=0\)
\(H1: βi≠0\)

With 863 degrees of freedom, the critical value for statistical significance at 5% is a t-value of approximately 1.962717, which is exceeded in the intercept with t=21.21, in INCPC with t=11.99, in SIZE with t=|-8.75|=8.75, which corresponds to rejections of all three null hypotheses with p<2*10^-16.

An individual with in a household with 0 people and a 0 per capita income, on average consumes food for 1420 USD. Though, there is no such observation which makes the interpretation unnecessary. An increase of one unit in per capita income, on average leads to a increase per capita food consumption by 0.0321 USD, ceteris paribus. On average, an increase of one individual in household size leads to a decrease of 133 USD in food spending, ceteris paribus.

The F-test is based on the following form: \(H0: β2=β3=0\)
\(H1: At least one coefficient is statistically significantly different from zero\)

With a critical F-value of: \(Fcrit=F(3,863)=3.006\), the null hypothesis is rejected with more than 95% certainty as the F-statistic is 178.

Answer:

As both estimates regarding the independent variables show statistical significance, expenditure per capita on food is related to household and household income per capita. The F-statistic shows a good fit and also, even though not of direct relevance to the exercise, the coefficient of determination is relatively high.

Exercise 9

Question: Is household composition a determinant of expenditure per capita on food, controlling for household income per capita? The following model is used to answer the question.

\(Model: FDHOPC=β1+β2INCPC+β3SIZEAM+β4SIZEAF+β5SIZEJM+β6SIZEJF+β7SIZEIN\)

INCPC is household income per capita, SIZEAM is the number of adult males in the household, SIZEAF is the number of adult females in the household, SIZEJM is the number of junior males in the household, SIZEJF is the number of junior females in the household, SIZEIN is the number of children below 2 and FDHOPC is per capita expenditure on food and non-alcoholic beverages consumed at home.

The hypotheses below are used to answer the question.

Hypotheses:
SIZEAM:
\(H0: β3=0\)
\(H1: β3≠0\)
SIZEAF:
\(H0: β4=0\)
\(H1: β4≠0\)
SIZEJM:
\(H0: β5=0\)
\(H1: β5≠0\)
SIZEJF:
\(H0: β6=0\)
\(H1: β6≠0\)
SIZEIN:
\(H0: β7=0\)
\(H1: β7≠0\)

Results. The following results are based on the model above.

R47<-lm(CES2$FDHOPC~CES2$INCPC+CES2$SIZEAM+CES2$SIZEAF+CES2$SIZEJM+CES2$SIZEJF+CES2$SIZEIN)
summary(R47)
## 
## Call:
## lm(formula = CES2$FDHOPC ~ CES2$INCPC + CES2$SIZEAM + CES2$SIZEAF + 
##     CES2$SIZEJM + CES2$SIZEJF + CES2$SIZEIN)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -1701   -367    -51    265   4994 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1404.1382    72.9208   19.26  < 2e-16 ***
## CES2$INCPC     0.0323     0.0027   11.97  < 2e-16 ***
## CES2$SIZEAM -153.5617    32.7758   -4.69  3.3e-06 ***
## CES2$SIZEAF -100.2695    37.7264   -2.66   0.0080 ** 
## CES2$SIZEJM -103.0498    36.5849   -2.82   0.0050 ** 
## CES2$SIZEJF -153.3433    37.3895   -4.10  4.5e-05 ***
## CES2$SIZEIN -222.1773    85.7235   -2.59   0.0097 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 632 on 859 degrees of freedom
## Multiple R-squared:  0.295,  Adjusted R-squared:  0.29 
## F-statistic: 59.8 on 6 and 859 DF,  p-value: <2e-16

Interpretation of the statistical tests:

The t-tests are based on the following form: \(H0: βi=0\)
\(H1: βi≠0\)

With 859 degrees of freedom, the critical value for statistical significance at 5% is a t-value of approximately 1.962729, which is exceeded in the intercept with t=19.09, in INCPC with t=12.2, in SIZEAM with t=|-4.76|=4.76, in SIZEAF with t=|-2.84|=2.84, in SIZEJM with t=|-2.88|=2.88 and in SIZEJF with t=|-4.18|=4.18, which corresponds to rejections of all null hypotheses at 5% significance with p<0.01.

An individual with in a household with 0 people and a 0 per capita income, on average consumes food for approximately 1404.14 USD. Though, there is no such observation which makes the interpretation unnecessary. On average, an increase of one unit in per capita income leads to a increase per capita food consumption by 0.0323 USD, ceteris paribus. On average, an increase of one adult male in household size leads to a decrease of 153.67 USD in food spending, ceteris paribus. On average, an increase of one adult female in household size leads to a decrease of 100.27 USD in food spending, ceteris paribus. On average, an increase of one junior male in household size leads to a decrease of 103.05 USD in food spending, ceteris paribus. On average, an increase of one junior female in household size leads to a decrease of 153.34 USD in food spending, ceteris paribus. On average, an increase of individual aged less than 2 in household size leads to a decrease of 222.18 USD in food spending, ceteris paribus.

The F-test is based on the following form: \(H0: β2=β3=β3=β4=β5=β6=β7\)
\(H1: At least one coefficient is statistically significantly different from zero\)

With a critical F-value of: \(Fcrit=F(6,859)=2.1091\), the null hypothesis is rejected with more than 95% certainty as the F-statistic is 59.8.

There is not evidence that the new model leads to more explanatory power because the coefficient of determination has not increased sufficiently.

Answer:

As the estimates for household composition show negative coefficients, the model likely suffers from bias as more mouths to feed according to the results on average lead to smaller food and beverage expenditure costs. INCPC is relatively collinear with the household size variables which can partially explain the odd results.

Exercise 10

Question: Is expenditure on food related to household income and size? An alternative model specification.

The following variables are generated.

\(LGFDHO=ln(FDHO)\) \(LGINC=ln(INCOME)\) \(LGSIZE=ln(SIZE)\)

The following model is used to answer the question.

\(Model: LGFDHO=β1+β2LGINC+β3LGSIZE\)

LGINC is the natural logarithm of household income, LGSIZE is the natural logarithm of the number of persons in the and LGFDHO is the natural logarithm of household food expenditure in USD.

The hypotheses below are used to answer the question.

Hypotheses:
LGINC:
\(H0: β2=0\)
\(H1: β2≠0\)
LGSIZE:
\(H0: β3=0\)
\(H1: β3≠0\)

Results. The following results are based on the model above.

CES3<-
  CES2%>%
  mutate(LGFDHO=log(FDHO))%>%
  mutate(LGINC=log(INCOME))%>%
  mutate(LGSIZE=log(SIZE))
R48<-lm(CES3$LGFDHO~CES3$LGINC+CES3$LGSIZE)
summary(R48)
## 
## Call:
## lm(formula = CES3$LGFDHO ~ CES3$LGINC + CES3$LGSIZE)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.8448 -0.2158  0.0294  0.2297  1.1806 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   4.6776     0.2214    21.1   <2e-16 ***
## CES3$LGINC    0.2910     0.0227    12.8   <2e-16 ***
## CES3$LGSIZE   0.4832     0.0256    18.9   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.387 on 863 degrees of freedom
## Multiple R-squared:  0.518,  Adjusted R-squared:  0.517 
## F-statistic:  464 on 2 and 863 DF,  p-value: <2e-16

Interpretation of the statistical tests:

The t-tests are based on the following form: \(H0: βi=0\)
\(H1: βi≠0\)

With 863 degrees of freedom, the critical value for statistical significance at 5% is a t-value of approximately 1.962717, which is exceeded in the intercept with t=21.1, in LGINC with t=12.8 and in LGSIZE with t=18.9, which corresponds to rejections of all null hypotheses at p<2*10^-16.

An individual with in a household with 0 people and a 0 per capita income, on average consumes food for approximately 4.6776%, though this can not be intuitively analyzed. On average, an increase of 1% in income leads to an increase of 0.291% in household food expenditure, ceteris paribus. On average, an increase of 1% in household size leads to an increase of 0.4832% in household food expenditure, ceteris paribus.

The F-test is based on the following form: \(H0: β2=β3\)
\(H1: At least one coefficient is statistically significantly different from zero\)

With a critical F-value of: \(Fcrit=F(2,863)=3.00615\), the null hypothesis is rejected with more than 95% certainty as the F-statistic is 462.

Answer:

Expenditure on food is related to household income and size as the estimates show high levels of statistical significance, sizable beta-coefficient values, a high coefficient of determination and a high F-value.

Exercise 11

Question: Is expenditure on food per capita related to total income per capita and size? An alternative model specification.

The following variables are generated.

\(LGFDHOPC=ln(FDHO/SIZE)\) \(LGINCPC=ln(INCOME/SIZE)\)

The following model is used to answer the question.

\(Model: LGFDHOPC=β1+β2LGINCPC+β3LGSIZEPC\)

LGINCPC is the natural logarithm of per capita household income, LGSIZEPC is the natural logarithm of per capita household size and LGFDHOPC is the natural logarithm of per capita household food expenditure in USD.

The hypotheses below are used to answer the question.

Hypotheses:
LGINCPC:
\(H0: β2=0\)
\(H1: β2≠0\)
LGSIZEPC:
\(H0: β3=0\)
\(H1: β3≠0\)

Results. The following results are based on the model above.

CES4<-
  CES3%>%
  mutate(LGFDHOPC=log(FDHO/SIZE))%>%
  mutate(LGINCPC=log(INCOME/(SIZE)))
R49<-lm(CES4$LGFDHOPC~CES4$LGINCPC+CES3$LGSIZE)
summary(R49)
## 
## Call:
## lm(formula = CES4$LGFDHOPC ~ CES4$LGINCPC + CES3$LGSIZE)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.8448 -0.2158  0.0294  0.2297  1.1806 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    4.6776     0.2214   21.13   <2e-16 ***
## CES4$LGINCPC   0.2910     0.0227   12.80   <2e-16 ***
## CES3$LGSIZE   -0.2259     0.0254   -8.89   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.387 on 863 degrees of freedom
## Multiple R-squared:  0.329,  Adjusted R-squared:  0.328 
## F-statistic:  212 on 2 and 863 DF,  p-value: <2e-16

Interpretation of the statistical tests:

The t-tests are based on the following form: \(H0: βi=0\)
\(H1: βi≠0\)

With 863 degrees of freedom, the critical value for statistical significance at 5% is a t-value of approximately 1.962717, which is exceeded in the intercept with t=21.13, in LGINCPC with t=12.8 and in LGSIZE with t=|-8.89|=8.89, which corresponds to rejections of all null hypotheses at p<2*10^-16.

An individual with in a household with 0 people and a 0 per capita income, on average consumes food for approximately 4.6776%, though this can not be intuitively analyzed. On average, an increase of 1% in per capita income leads to an increase of 0.291% in household food expenditure, ceteris paribus. On average, an increase of 1% in household size leads to a decrease of 0.2259% in household food expenditure, ceteris paribus.

The F-test is based on the following form: \(H0: β2=β3\)
\(H1: At least one coefficient is statistically significantly different from zero\)

With a critical F-value of: \(Fcrit=F(2,863)=3.006\), the null hypothesis is rejected with more than 95% certainty as the F-statistic is 212.

Also, the beta coefficient for income has not changed as the dependent variable also has been divided by size, thus leading to the same change in variation in both ends.

Answer:

Expenditure on food is related to per capita income and household size as the estimates show statistically significant sizable beta-coefficient values, a high coefficient of determination and a high F-value.

Exercise 12

Question: Does age have an effect on household income?

At first, a new variable is generated.

\(AGE=ifelse(REFAGE≤50)\)

The following model is used to answer the question.

\(Model: INCOME=β1+β2SIZE+β3AGE\)

SIZE is the number of persons in the household, AGE is individuals of the age of 50 or below and INCOME is household income

The hypothesis below is used to answer the question.

Hypotheses:
AGE:
\(H0: β3=0\)
\(H1: β3≠0\)

Results. The following results are based on the model above.

CES5<-
  CES4%>%
  mutate(AGE=ifelse(REFAGE<=50,0,1))
R50<-lm(CES5$INCOME~CES5$SIZE+CES5$AGE)
summary(R50)
## 
## Call:
## lm(formula = CES5$INCOME ~ CES5$SIZE + CES5$AGE)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -45004 -12059  -4095   8226 113812 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    22155       1609   13.77  < 2e-16 ***
## CES5$SIZE       3844        432    8.90  < 2e-16 ***
## CES5$AGE       -5408       1330   -4.07  5.2e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18000 on 863 degrees of freedom
## Multiple R-squared:  0.143,  Adjusted R-squared:  0.141 
## F-statistic: 71.8 on 2 and 863 DF,  p-value: <2e-16

Interpretation of the statistical tests:

The t-tests are based on the following form: \(H0: βi=0\)
\(H1: βi≠0\)

With 863 degrees of freedom, the critical value for statistical significance at 5% is a t-value of approximately 1.962717, which is exceeded in the intercept with t=13.77, in SIZE with t=8.9 and in AGE with t=|-4.07|=4.07, which corresponds to rejections of all null hypotheses at p<2e-16 for the intercept and AGE and 5.2e-5 for AGE.

An individual with 0 individuals in the household, aged 50 or below on average has a household income of 22155 USD. Though, a household with an income can not consist of 0 individuals. On average, an increase of one person in household size leads to an increase of 3844 USD in household income, ceteris paribus. People aged 50 or below on average have a household income of 5408 USD lower than people above the age of 50, ceteris paribus.

The F-test is based on the following form: \(H0: β2=β3\)
\(H1: At least one coefficient is statistically significantly different from zero\)

With a critical F-value of: \(Fcrit=F(2,863)=3.00615\), the null hypothesis is rejected with more than 95% certainty as the F-statistic is 71.8.

Answer:

Age does have an effect on income as the average household income is substantially lower for people aged 50 or below, with a statistically significant estimate.

Exercise 13

Question: Is there sex discrimination in earnings?

The following model is used to answer the question.

\(Model: EARNINGS=β1+β2S_YEAR+β3ASVABC+β4MALE\)

S_YEAR is years of schooling, ASVABC is a composite measure of numerical and verbal ability, MALE is male (MALE=1) and EARNINGS is hourly earnings in 1994 USD.

The hypothesis below is used to answer the question.

Hypotheses:
MALE:
\(H0: β4=0\)
\(H1: β4≠0\)

Results. The following results are based on the model above.

R51<-lm(EDU$EARNINGS~EDU$S_YEAR+EDU$ASVABC+EDU$MALE)
summary(R51)
## 
## Call:
## lm(formula = EDU$EARNINGS ~ EDU$S_YEAR + EDU$ASVABC + EDU$MALE)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -18.62  -4.09  -1.14   2.27  74.15 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -15.6381     2.4074   -6.50  1.8e-10 ***
## EDU$S_YEAR    1.2960     0.1795    7.22  1.7e-12 ***
## EDU$ASVABC    0.1857     0.0477    3.89  0.00011 ***
## EDU$MALE      4.0240     0.7232    5.56  4.1e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.55 on 566 degrees of freedom
## Multiple R-squared:  0.228,  Adjusted R-squared:  0.224 
## F-statistic: 55.8 on 3 and 566 DF,  p-value: <2e-16

Interpretation of the statistical tests:

The t-tests are based on the following form: \(H0: βi=0\)
\(H1: βi≠0\)

With 566 degrees of freedom, the critical value for statistical significance at 5% is a t-value of approximately 1.964164, which is exceeded in the intercept with t=|-6.50|=6.50, in S_YEAR with t=7.22, in ASVABC with T=3.89 and in MALE with t=5.56, which corresponds to rejections of all null hypotheses at 5%.

A female individual with 0 years of schooling, a zero test score in numerical and verbal ability has approximately negative 15.64 USD in hourly earnings. Though, such an individual does not exist in the data set. An increase of one year of schooling, on average leads to an increase of approximately 1.29 USD in hourly earnings, ceteris paribus. An increase of one test score point in numerical and verbal ability on average leads to an increase in hourly earnings by approximately 0.19 USD, ceteris paribus. Males on average earn approximately 4.03 USD more per hour than females, ceteris paribus.

The F-test is based on the following form: \(H0: β2=β3\)
\(H1: At least one coefficient is statistically significantly different from zero\)

With a critical F-value of: \(Fcrit=F(3,566)=2.6206\), the null hypothesis is rejected with more than 95% certainty as the F-statistic is 55.8.

Answer:

The estimate for MALE is statistically significant with a sizable positive beta coefficient, which can be an indicator for sex discrimination in earnings. Though, as the adjusted coefficient of determination is 0.224 the model likely suffers from omitted variable bias as there probably are more variables that explain hourly earnings.

Exercise 14

Question: Do earnings depend on type of employment?

The following model is used to answer the question.

\(Model: EARNINGS=β1+β2S_YEAR+β3ASVABC+β4MALE+β5URBAN+β6CATGOV+β7CATPRI\)

S_YEAR is years of schooling, ASVABC is a composite measure of numerical and verbal ability, MALE is male (MALE=1), URBAN is living in an urban area, CATGOV is employed by government, CATPRI is employed by private sector and EARNINGS is hourly earnings in 1994 USD.

The hypotheses below are used to answer the question.

Hypotheses:
CATGOV:
\(H0: β6=0\)
\(H1: β6≠0\)
CATPRI:
\(H0: β7=0\)
\(H1: β7≠0\)

Results. The following results are based on the model above.

R52<-lm(EDU$EARNINGS~EDU$S_YEAR++EDU$ASVABC+EDU$MALE+EDU$URBAN+EDU$CATGOV+EDU$CATPRI)
summary(R52)
## 
## Call:
## lm(formula = EDU$EARNINGS ~ EDU$S_YEAR + +EDU$ASVABC + EDU$MALE + 
##     EDU$URBAN + EDU$CATGOV + EDU$CATPRI)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -18.74  -3.96  -1.08   2.32  74.01 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -12.9871     2.7792   -4.67  3.7e-06 ***
## EDU$S_YEAR    1.3037     0.1845    7.07  4.7e-12 ***
## EDU$ASVABC    0.1817     0.0476    3.82  0.00015 ***
## EDU$MALE      3.8122     0.7225    5.28  1.9e-07 ***
## EDU$URBAN     1.0940     0.8659    1.26  0.20693    
## EDU$CATGOV   -4.8327     1.6824   -2.87  0.00423 ** 
## EDU$CATPRI   -3.2937     1.4205   -2.32  0.02077 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.5 on 563 degrees of freedom
## Multiple R-squared:  0.242,  Adjusted R-squared:  0.233 
## F-statistic: 29.9 on 6 and 563 DF,  p-value: <2e-16

Interpretation of the statistical tests:

The t-tests are based on the following form: \(H0: βi=0\)
\(H1: βi≠0\)

With 563 degrees of freedom, the critical value for statistical significance at 5% is a t-value of approximately 1.964187, which is exceeded in all of the estimates except for URBAN where the p-value of URBAN is larger than 0.05. The remainder of the variables have statistically significant estimates, as the p-values are lower than 0.05.

A self employed urban female individual with 0 years of schooling, a zero test score in numerical and verbal ability has approximately negative 12.99 USD in hourly earnings. Though, such an individual does not exist in the data set. An increase of one year of schooling, on average leads to an increase of approximately 1.30 USD in hourly earnings, ceteris paribus. An increase of one test score point in numerical and verbal ability on average leads to an increase in hourly earnings by approximately 0.18 USD, ceteris paribus. Males on average earn approximately 3.81 USD more per hour than females, ceteris paribus. People living in urban areas on average earn 1.09 USD more in hourly earnings than people living in rural areas, ceteris paribus. Government employees on average earn approximately 4.83 USD less per hour than non-government employees, ceteris paribus. Private sector employees on average earn approximately 3.29 USD less per hour than non-government employees, ceteris paribus.

The F-test is based on the following form: \(H0: β2=β3\)
\(H1: At least one coefficient is statistically significantly different from zero\)

With a critical F-value of: \(Fcrit=F(6,563)=2.115\), the null hypothesis is rejected with more than 95% certainty as the F-statistic is 29.9.

Answer:

Earnings depend on type of employment as both government employees and private sector employees on average earn less than the self-employed.

Exercise 15

Question: Does the sex of an individual affect educational attainment?

The following model is used to answer the question.

\(Model: S_YEAR=β1+β2ASVABC+β3HGCM+β4HGCF+β5MALE\)

ASVABC is a composite measure of numerical and verbal ability, HGCM is years of schooling of the respondent’s mother, HGCF is years of schooling of the respondent’s father, MALE is male (MALE=1) and S_YEAR is years of schooling.

The hypotheses below are used to answer the question.

Hypotheses:
MALE:
\(H0: β5=0\)
\(H1: β5≠0\)

Results. The following results are based on the model above.

R51<-lm(EDU$S_YEAR~EDU$ASVABC+EDU$HGCM+EDU$HGCF+EDU$MALE)
summary(R51)
## 
## Call:
## lm(formula = EDU$S_YEAR ~ EDU$ASVABC + EDU$HGCM + EDU$HGCF + 
##     EDU$MALE)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.184 -1.478 -0.344  1.224  5.733 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.19406    0.52723    9.85  < 2e-16 ***
## EDU$ASVABC   0.11545    0.00993   11.63  < 2e-16 ***
## EDU$HGCM     0.12071    0.03946    3.06  0.00232 ** 
## EDU$HGCF     0.10258    0.02942    3.49  0.00053 ***
## EDU$MALE    -0.02050    0.16392   -0.13  0.90054    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.93 on 565 degrees of freedom
## Multiple R-squared:  0.337,  Adjusted R-squared:  0.332 
## F-statistic: 71.7 on 4 and 565 DF,  p-value: <2e-16

Interpretation of the statistical tests:

The t-tests are based on the following form: \(H0: βi=0\)
\(H1: βi≠0\)

With 565 degrees of freedom, the critical value for statistical significance at 5% is a t-value of approximately 1.964172, which is exceeded in all of the estimates except for MALE where p≈0.9.

A female individual whose both parents have 0 years of schooling and a zero test score in numerical and verbal ability, on average has approximately 5.19 years of schooling, ceteris paribus. An increase of one test score point in numerical and verbal ability on average leads to an increase in years of schooling by approximately 0.115 years, ceteris paribus. An increase of one year in the schooling of the mother on average leads to an increase in years of schooling by approximately 0.12 years, ceteris paribus. An increase of one year in the schooling of the father on average leads to an increase in years of schooling by approximately 0.10 years, ceteris paribus. Males on average have 0.02 years less schooling than females, ceteris paribus.

Answer:

When controlling for the other independent variables, it can not be claimed that sex affects educational attainment as the beta coefficient for MALE is small and highly statistically insignificant.

Exercise 16

Question: Does education have an effect on household income?

The following model is used to answer the question.

\(Model: INCOME=β1+β2SIZE+β3EDUCDO+β4EDUCIC+β5EDUCCO\)

Household income, SIZE is number of persons in the household, EDUCDO is high school educated, EDUCIC is university educated, EDDUCCO is graduate school and INCOME is household income.

The hypotheses below are used to answer the question.

Hypotheses:
EDUCDO:
\(H0: β3=0\)
\(H1: β3≠0\)
EDUCIC:
\(H0: β4=0\)
\(H1: β4≠0\)
EDDUCCO:
\(H0: β5=0\)
\(H1: β5≠0\)

Results. The following results are based on the model above.

CES6<-
  CES5%>%
  mutate(EDUCDO=ifelse(REFEDUC==1,1,0))%>%
  mutate(EDUCIC=ifelse(REFEDUC==2,1,0))%>%
  mutate(EDUCCO=ifelse(REFEDUC==3,1,0))
R52<-lm(CES6$INCOME~CES6$SIZE+CES6$EDUCDO+CES6$EDUCIC+CES6$EDUCCO)
summary(R52)
## 
## Call:
## lm(formula = CES6$INCOME ~ CES6$SIZE + CES6$EDUCDO + CES6$EDUCIC + 
##     CES6$EDUCCO)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -34724 -10715  -3536   7852 120028 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     7376       2090    3.53  0.00044 ***
## CES6$SIZE       4253        379   11.22  < 2e-16 ***
## CES6$EDUCDO     7336       2095    3.50  0.00049 ***
## CES6$EDUCIC    15087       2138    7.06  3.5e-12 ***
## CES6$EDUCCO    24892       2637    9.44  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17000 on 861 degrees of freedom
## Multiple R-squared:  0.239,  Adjusted R-squared:  0.235 
## F-statistic: 67.4 on 4 and 861 DF,  p-value: <2e-16

Interpretation of the statistical tests:

The t-tests are based on the following form: \(H0: βi=0\)
\(H1: βi≠0\)

With 861 degrees of freedom, the critical value for statistical significance at 5% is a t-value of approximately 1.962723, which is exceeded in all of the estimates and thus all of the null hypotheses are rejected.

A compulsory school educated individual in a household with 0 people on average has a household income of 7376 USD. An increase of one person in household size on average leads to an increase in household income by approximately 4253 USD, ceteris paribus. A high school educated individual on average has a household income of 7336 USD higher than a compulsory school educated individual, ceteris paribus. A university educated individual on average has a household income of 15087 USD higher than a compulsory school educated individual, ceteris paribus. A graduate school educated individual on average has a household income of 24892 USD higher than a compulsory school educated individual, ceteris paribus.

The F-test is based on the following form: \(H0: β2=β3=β4=β5=0\)
\(H1: At least one coefficient is statistically significantly different from zero\)

The null hypothesis regarding the F-test can be rejected as the critical value is Fcrit=F(4,861)=2.3822, which is exceeded when F=67.4>2.382 with p<2*10^-16.

Answer:

Education has an effect on household income when higher levels of education on average leads to higher household incomes, all else equal.

Exercise 17

Question: Does type of employment affect educational attainment?

The following model is used to answer the question.

\(Model: S_YEAR=β1+β2ASVABC+β3MALE+β4HGCM+β5HGCF+β6CATGOV+β7CATPRI\)

ASVABC is a composite measure of numerical and verbal ability, MALE is male (MALE=1), HGCM is years of maternal schooling, HGCF is years of paternal schooling, CATGOV is employed by the government and CATPRI is employed by the private sector.

The hypotheses below are used to answer the question.

Hypotheses:
CATGOV:
\(H0: β6=0\)
\(H1: β6≠0\)
CATPRI:
\(H0: β7=0\)
\(H1: β7≠0\)

Results. The following results are based on the model above.

R53<-lm(EDU$S_YEAR~EDU$ASVABC+EDU$MALE+EDU$HGCM+EDU$HGCF+EDU$CATGOV+EDU$CATPRI)
summary(R53)
## 
## Call:
## lm(formula = EDU$S_YEAR ~ EDU$ASVABC + EDU$MALE + EDU$HGCM + 
##     EDU$HGCF + EDU$CATGOV + EDU$CATPRI)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.087 -1.452 -0.323  1.219  5.861 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   5.0553     0.6130    8.25  1.1e-15 ***
## EDU$ASVABC    0.1144     0.0098   11.67  < 2e-16 ***
## EDU$MALE      0.0342     0.1623    0.21  0.83342    
## EDU$HGCM      0.1225     0.0389    3.15  0.00174 ** 
## EDU$HGCF      0.0977     0.0290    3.36  0.00083 ***
## EDU$CATGOV    1.0521     0.3751    2.81  0.00520 ** 
## EDU$CATPRI    0.0695     0.3190    0.22  0.82770    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.9 on 563 degrees of freedom
## Multiple R-squared:  0.357,  Adjusted R-squared:  0.35 
## F-statistic: 52.2 on 6 and 563 DF,  p-value: <2e-16

Interpretation of the statistical tests:

The t-tests are based on the following form: \(H0: βi=0\)
\(H1: βi≠0\)

With 563 degrees of freedom, the critical value for statistical significance at 5% is a t-value of approximately 1.964187, which is exceeded at 5% in all of the estimates except for MALE and CATPRI.

A self-employed female with a zero test score whose parents lack formal education on average has approximately 5.06 years of schooling, ceteris paribus. An increase of one test score point in numerical and verbal ability, on average increases schooling by 0.1144 years, ceteris paribus. Males on average have 0.0342 years more schooling than females, ceteris paribus. An increase of one year in maternal schooling, on average increases schooling by 0.1225 years, ceteris paribus. An increase of one year in paternal schooling, on average increases schooling by 0.0977 years, ceteris paribus. Government employees on average have approximately 1.05 years more schooling than self-employed, ceteris paribus. Private sector employees on average have approximately 0.0695 years more schooling than self-employed, ceteris paribus.

Answer:

Government employment significantly affects education attainment, but that is not the case for private sector employment as the latter lacks statistical significance.

Exercise 18

Question: Is the effect of education on earnings different for males and females?

At first, a new variable is generated.

\(MALES=MALE*SYEAR\)

The following model is used to answer the question.

\(Model: EARNINGS=β1+β2S_YEAR+β3ASVABC+β4EDUPROF+β5EDUPHD+β6EDUMAST+β7EDUBA+β8EDUHS+β9MALE+β10MALES\)

S_ YEAR is years of schooling, ASVABC is a composite measure of numerical and verbal ability, EDUPROF is a professional degree, EDUPHD is a doctorate degree, EDUMAST is a master’s degree, EDUBA is a bachelors degree, EDUHS is a high school degree, MALE is male (MALE=1), MALES is a slope dummy variable (interaction variable) between MALE and S_YEAR and EARNINGS is hourly earnings.

The hypotheses below are used to answer the question.

Hypotheses:
S_YEAR:
\(H0: β2=0\)
\(H1: β2≠0\)
MALE:
\(H0: β9=0\)
\(H1: β9≠0\)
MALES:
\(H0: β10=0\)
\(H1: β10≠0\)

Results. The following results are based on the model above.

EDU<-
  EDU%>%
  mutate(MALES=MALE*S_YEAR)
R54<-lm(EDU$EARNINGS~EDU$S_YEAR+EDU$ASVABC+EDU$EDUPROF+EDU$EDUPHD+
          EDU$EDUMAST+EDU$EDUBA+EDU$EDUHS+EDU$MALE+EDU$MALES)
summary(R54)
## 
## Call:
## lm(formula = EDU$EARNINGS ~ EDU$S_YEAR + EDU$ASVABC + EDU$EDUPROF + 
##     EDU$EDUPHD + EDU$EDUMAST + EDU$EDUBA + EDU$EDUHS + EDU$MALE + 
##     EDU$MALES)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -24.31  -3.81  -1.12   2.12  54.69 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -10.9394     4.4696   -2.45    0.015 *  
## EDU$S_YEAR    0.8387     0.3569    2.35    0.019 *  
## EDU$ASVABC    0.1890     0.0463    4.08  5.2e-05 ***
## EDU$EDUPROF  26.7366     4.0994    6.52  1.6e-10 ***
## EDU$EDUPHD   -2.6782     6.4648   -0.41    0.679    
## EDU$EDUMAST   3.3365     2.6922    1.24    0.216    
## EDU$EDUBA     2.0562     1.9210    1.07    0.285    
## EDU$EDUHS     0.9749     1.1429    0.85    0.394    
## EDU$MALE      6.7642     4.2027    1.61    0.108    
## EDU$MALES    -0.2153     0.3043   -0.71    0.480    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.19 on 560 degrees of freedom
## Multiple R-squared:  0.299,  Adjusted R-squared:  0.288 
## F-statistic: 26.6 on 9 and 560 DF,  p-value: <2e-16

Interpretation of the statistical tests:

The t-tests are based on the following form: \(H0: βi=0\)
\(H1: βi≠0\)

With 560 degrees of freedom, the critical value for statistical significance at 5% is a t-value of approximately 1.964209, which is exceeded at 5% in S_YEAR, ASVABC only EDUPROF of the educational dummy variables and not rejected for MALE, MALES and the remainder of the dummy variables.

As interpreting the intercept is absurd in this case it is not done. An increase of one year of schooling for females on average leads to an increase of 0.8387 USD in hourly earnings, ceteris paribus. An increase of one test score point on average leads to an increase of 0.189 USD in hourly earnings, ceteris paribus, ceteris paribus. Individuals with a professional degree on average earn approximately 26.74 USD more per hour than compulsory school educated, ceteris paribus. Individuals with a doctorate degree on average earn approximately 2.68 USD less per hour than compulsory school educated, ceteris paribus. Individuals with a masters degree on average earn approximately 3.33 USD more per hour than compulsory school educated, ceteris paribus. Males on average earn approximately 6.77 USD more than females, ceteris paribus. Individuals with a bachelors degree on average earn approximately 2.06 USD more per hour than compulsory school educated, ceteris paribus. Individuals with a high school degree on average earn approximately 0.97 USD more per hour than compulsory school educated, ceteris paribus. An increase of one year of schooling for males on average leads to an increase of (0.8387-0.2153) = 0.6234 USD in hourly earnings, ceteris paribus.

The F-test is based on the following form: \(H0: β2=β3=β4=β5=β6=β7=β8=β9=β10=0\)
\(H1: At least one coefficient is statistically significantly different from zero\)

The null hypothesis regarding the F-test can be rejected as the critical value is Fcrit=F(9,560)=1.8965, which is exceeded when F=26.6>1.8965 with p<2*10^-16.

EDUCS is not included in the model to avoid full correlation between the educational dummy variables. A regression model can not include variables that have a summed probability of 1 as it would lead to the dummy variable trap.

Answer:

As the beta coefficient estimates for MALE and MALES are statistically insignificant while controlling for the educational dummy variables, the effect of education on earnings are not different between males and females.

Exercise 19

Question: Do earnings depend on length of work experience?

New variables are generated to proxy for potential length of work experience.

\(PWE=AGE-S_YEAR-5\) \(PWEBEF=PWE–TENURE\)

The following models are used to answer the question.

\(Model 1: EARNINGS=β1+β2S_YEAR+β3ASVABC+β4MALE+β5URBAN\)
\(Model 2: EARNINGS=β1+β2S_YEAR+β3ASVABC+β4MALE+β5URBAN+β6PWE\)
\(Model 3: EARNINGS=β1+β2S_YEAR+β3ASVABC+β4MALE+β5URBAN+β6PWEBEF+β7TENURE\)
\(Model 4 (male only): EARNINGS_m=β1+β2S_YEAR+β3ASVABC+β4URBAN+β5PWE\)
\(Model 5 (female only): EARNINGS_f=β1+β2S_YEAR+β3ASVABC+β4URBAN+β5PWE\)

S_YEAR is years of schooling, ASVABC is a composite measure of numerical and verbal ability, MALE is male (MALE=1), URBAN is living in urban area, PWE and PWEBEF are proxy variables for potential work experience and EARNINGS is hourly earnings in USD.

The hypotheses below are used to answer the question.

Hypotheses:
PWE:
\(H0: β2=0\)
\(H1: β2≠0\)
PWE:
\(H0: β6=0\)
\(H1: β6≠0\)
β6PWEBEF:
\(H0: β6=0\)
\(H1: β6≠0\)
β7TENURE:
\(H0: β6=0\)
\(H1: β6≠0\)

Results. The following results are based on the models above.

# Generating PWE
EDU<-
  EDU%>%
  mutate(PWE=AGE-S_YEAR-5)
# Running regressions
R55<-lm(EDU$EARNINGS~EDU$S_YEAR+EDU$ASVABC+EDU$MALE+EDU$URBAN)
summary(R55)
## 
## Call:
## lm(formula = EDU$EARNINGS ~ EDU$S_YEAR + EDU$ASVABC + EDU$MALE + 
##     EDU$URBAN)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -18.74  -4.05  -1.25   2.32  74.16 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -15.9950     2.4219   -6.60  9.2e-11 ***
## EDU$S_YEAR    1.2533     0.1824    6.87  1.7e-11 ***
## EDU$ASVABC    0.1874     0.0477    3.92  9.7e-05 ***
## EDU$MALE      4.0087     0.7229    5.55  4.5e-08 ***
## EDU$URBAN     1.1204     0.8703    1.29      0.2    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.54 on 565 degrees of freedom
## Multiple R-squared:  0.23,   Adjusted R-squared:  0.225 
## F-statistic: 42.3 on 4 and 565 DF,  p-value: <2e-16
R56<-lm(EDU$EARNINGS~EDU$S_YEAR+EDU$ASVABC+EDU$MALE+EDU$URBAN+EDU$PWE)
summary(R56)
## 
## Call:
## lm(formula = EDU$EARNINGS ~ EDU$S_YEAR + EDU$ASVABC + EDU$MALE + 
##     EDU$URBAN + EDU$PWE)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -19.10  -4.17  -1.00   2.32  73.75 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -26.6438     4.9875   -5.34  1.3e-07 ***
## EDU$S_YEAR    1.6800     0.2522    6.66  6.5e-11 ***
## EDU$ASVABC    0.1675     0.0482    3.47  0.00055 ***
## EDU$MALE      4.0897     0.7205    5.68  2.2e-08 ***
## EDU$URBAN     1.1060     0.8666    1.28  0.20239    
## EDU$PWE       0.4113     0.1686    2.44  0.01503 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.51 on 564 degrees of freedom
## Multiple R-squared:  0.238,  Adjusted R-squared:  0.232 
## F-statistic: 35.3 on 5 and 564 DF,  p-value: <2e-16
# Correlation between S_YEAR and PWE
EDU%>%
  dplyr::select(PWE,S_YEAR)%>%
  cor(EDU$PWE)
##          [,1]
## PWE     1.000
## S_YEAR -0.718
# Generating PWEBEF
EDU<-
  EDU%>%
  mutate(PWEBEF=PWE-TENURE)
# Running regression
R57<-lm(EDU$EARNINGS~EDU$S_YEAR+EDU$ASVABC+EDU$MALE+EDU$URBAN+EDU$PWEBEF+EDU$TENURE)
summary(R57)
## 
## Call:
## lm(formula = EDU$EARNINGS ~ EDU$S_YEAR + EDU$ASVABC + EDU$MALE + 
##     EDU$URBAN + EDU$PWEBEF + EDU$TENURE)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -17.92  -4.06  -0.89   1.95  74.03 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -25.3762     4.9613   -5.11  4.3e-07 ***
## EDU$S_YEAR    1.6249     0.2507    6.48  2.0e-10 ***
## EDU$ASVABC    0.1571     0.0479    3.28   0.0011 ** 
## EDU$MALE      3.9010     0.7169    5.44  7.9e-08 ***
## EDU$URBAN     1.3143     0.8617    1.53   0.1278    
## EDU$PWEBEF    0.3149     0.1699    1.85   0.0643 .  
## EDU$TENURE    0.5741     0.1746    3.29   0.0011 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.43 on 563 degrees of freedom
## Multiple R-squared:  0.252,  Adjusted R-squared:  0.244 
## F-statistic: 31.7 on 6 and 563 DF,  p-value: <2e-16
# Correlation between PWEBEF, S_YEAR and TENURE.
EDU%>%
  dplyr::select(PWE,S_YEAR,TENURE)%>%
  cor(EDU$PWE)
##          [,1]
## PWE     1.000
## S_YEAR -0.718
## TENURE  0.159
# Subsetting male and female.
EDU2m<-
  EDU%>%
  filter(MALE==1)
EDU2f<-
  EDU%>%
  filter(MALE==0)
# Running regressions
R58<-lm(EDU2m$EARNINGS~EDU2m$S_YEAR+EDU2m$ASVABC+EDU2m$URBAN+EDU2m$PWE)
summary(R58)
## 
## Call:
## lm(formula = EDU2m$EARNINGS ~ EDU2m$S_YEAR + EDU2m$ASVABC + EDU2m$URBAN + 
##     EDU2m$PWE)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -19.36  -4.96  -1.30   2.00  73.69 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -33.2585     7.6812   -4.33  2.0e-05 ***
## EDU2m$S_YEAR   2.0444     0.3858    5.30  2.2e-07 ***
## EDU2m$ASVABC   0.1772     0.0718    2.47   0.0142 *  
## EDU2m$URBAN    0.0981     1.3853    0.07   0.9436    
## EDU2m$PWE      0.8423     0.2650    3.18   0.0016 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.1 on 320 degrees of freedom
## Multiple R-squared:  0.181,  Adjusted R-squared:  0.171 
## F-statistic: 17.7 on 4 and 320 DF,  p-value: 4.04e-13
R59<-lm(EDU2f$EARNINGS~EDU2f$S_YEAR+EDU2f$ASVABC+EDU2f$URBAN+EDU2f$PWE)
summary(R59)
## 
## Call:
## lm(formula = EDU2f$EARNINGS ~ EDU2f$S_YEAR + EDU2f$ASVABC + EDU2f$URBAN + 
##     EDU2f$PWE)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -15.22  -2.96  -0.47   2.37  38.26 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -12.2759     4.9930   -2.46   0.0147 *  
## EDU2f$S_YEAR   1.1943     0.2588    4.61  6.4e-06 ***
## EDU2f$ASVABC   0.1560     0.0517    3.02   0.0028 ** 
## EDU2f$URBAN    2.3606     0.8353    2.83   0.0051 ** 
## EDU2f$PWE     -0.1603     0.1665   -0.96   0.3364    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.47 on 240 degrees of freedom
## Multiple R-squared:  0.359,  Adjusted R-squared:  0.348 
## F-statistic: 33.6 on 4 and 240 DF,  p-value: <2e-16

Interpretation of the statistical tests:

When comparing the regression results based on model 1 and model 2, S_YEAR is larger because it is highly negatively correlated with PWE as it is included in PWE. With such a high correlation, if one includes PWE, S_YEAR should be omitted due to bias created by collinearity between the two variables.

When comparing the regression results based on model 2 and model 3, there is no clear difference in the size of the beta coefficients. Though, now the results suffer from bias, as PWEBEF is collinear with both S_YEAR and TENURE.

As the estimate for PWE is only statistically significant for men and not for women, the effect of work experience on earnings is only relevant for males. For female workers it is indicated that living in an urban area affects earnings, as this variable has no statistical significance for males.

Answer:

As the estimates for PWE are statistically significant in model 2,3 and 4, PWEBEF and TENURE are significant in model 3, in general, earnings depend on work experience. Though, when the regression is subdivided between the genders, this significance is likely due to the effect of work experience on earnings for male workers.

Exercise 20

Question: Does the educational attainment affect family size?

A variable is generated by summing HGCM and HGCF.

\(SP=HGCM+HGCF\)

The following models are used to answer the question.

\(Model 1: SIBLINGS=β1+β2HGCM+β3HGCF\)
\(Model 2: SIBLINGS=β1+β2SP\)

HGCM is years of schooling of respondent’s mother, HGCF is years of schooling of respondent’s father, SP is a summed combination of HGCM and HGCF and SIBLINGS is number of siblings.

The hypotheses below are used to answer the question.

Hypotheses 1:
HGCM:
\(H0: β2=0\)
\(H1: β2≠0\)
HGCF:
\(H0: β3=0\)
\(H1: β3≠0\)

Testing linear restriction.

Hypothesis 2
\(H0:β2=β3\)
\(H1:β2≠β3\)

Critical value for linear restriction.

$Fcrit=F(1,df)=F(1,567)=3.858

Results. The following results are based on the models above.

# Regression of SIBLINGS on HGCM and HGCF
R60<-lm(EDU$SIBLINGS~EDU$HGCM+EDU$HGCF)
summary(R60)
## 
## Call:
## lm(formula = EDU$SIBLINGS ~ EDU$HGCM + EDU$HGCF)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.907 -1.248 -0.188  0.933  8.933 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   5.9951     0.3980   15.06  < 2e-16 ***
## EDU$HGCM     -0.1987     0.0397   -5.01  7.2e-07 ***
## EDU$HGCF     -0.0453     0.0292   -1.55     0.12    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.97 on 567 degrees of freedom
## Multiple R-squared:  0.0883, Adjusted R-squared:  0.0851 
## F-statistic: 27.4 on 2 and 567 DF,  p-value: 4.18e-12
# Correlation between HGCM and HGCF
EDU%>%
  dplyr::select(HGCM,HGCF)%>%
  cor(EDU$HGCM)
##       [,1]
## HGCM 1.000
## HGCF 0.579
# Unrestricted regression
R60_unrest<-lm(EDU$SIBLINGS~EDU$HGCM+EDU$HGCF)
summary(R60_unrest)
## 
## Call:
## lm(formula = EDU$SIBLINGS ~ EDU$HGCM + EDU$HGCF)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.907 -1.248 -0.188  0.933  8.933 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   5.9951     0.3980   15.06  < 2e-16 ***
## EDU$HGCM     -0.1987     0.0397   -5.01  7.2e-07 ***
## EDU$HGCF     -0.0453     0.0292   -1.55     0.12    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.97 on 567 degrees of freedom
## Multiple R-squared:  0.0883, Adjusted R-squared:  0.0851 
## F-statistic: 27.4 on 2 and 567 DF,  p-value: 4.18e-12
# Restricted regression
EDU<-
  EDU%>%
  mutate(SP=HGCM+HGCF)
R60_rest<-lm(EDU$SIBLINGS~EDU$SP)
summary(R60_rest)
## 
## Call:
## lm(formula = EDU$SIBLINGS ~ EDU$SP)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.937 -1.249 -0.219  0.922  8.922 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   5.6549     0.3757   15.05  < 2e-16 ***
## EDU$SP       -0.1074     0.0155   -6.94  1.1e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.98 on 568 degrees of freedom
## Multiple R-squared:  0.0782, Adjusted R-squared:  0.0766 
## F-statistic: 48.2 on 1 and 568 DF,  p-value: 1.05e-11
# Generating F-value for linear restriction
(RSS(R60_rest)-RSS(R60_unrest))/(RSS(R60_unrest)/(summary(R60_unrest)$df[2]))
## [1] 6.25

Interpretation of the statistical tests:

When regressing SIBLINGS on HGCM and HGCF, only the years of schooling of the respondent’s mother is statistically significant in affecting the number of siblings of he family.

The correlation coefficient between HGCM and HGCF is approximately 0.579.

The precision has increased when regressing SIBLINGS on the generated combined variable of SP, as the absolute value of the t value has increased and the p-value is smaller. As the null hypothesis of equal coefficients between HGCM and HGCF was rejected when F=6.249>3.858=Fcrit, the variables affect SIBLINGS statistically differently.

Answer:

Paternal educational attainment affects family size, but not maternal educational attainment. The estimates of paternal and maternal educational attainment are different as the null hypothesis of similiarity is rejected.

Exercise 21

Question: Does the educational attainment affect family size?

The following variables are generated. INCPCRANK is generating by ranking INCPC.

\(FDHPC=FDHO/SIZE\) \(LGFDHOPC=log(FDHPC)\) \(INCPC=INCOME/SIZE\) \(LGFDHOPC=log(FDHOPC)\)

The following models are used to answer the question.

\(Model 1: FDHOPC=β1+β2INCPC+β3SIZE\)
\(Model 2: LGFDHOPC=β1+β2INCPC+β3SIZE\)

INCPC is per capita household income, SIZE is household size, FDHOPC is per capita food and non-alcoholic beverages consumed at home and LGFDHOPC is logarithmic per capita food and non-alcoholic beverages consumed at home.

The hypothesis below is used to answer the question.

Hypotheses :
\(H0: The regression is homoskedastic, i.e. has constant variance in the disturbance terms\)
\(H1: The regression is heteroskedastic, i.e. does not have constant variance in the disturbance terms\)

Results. The following results are based on the models above.

# Generating variables and running regressions
CES<-
  CES%>%
  mutate(FDHOPC=FDHO/SIZE)%>%
  mutate(LGFDHOPC=log(FDHOPC))%>%
  mutate(INCPC=INCOME/SIZE)%>%
  mutate(INCPCRANK=rank(INCPC))
R61<-lm(CES$FDHOPC~CES$INCPC+CES$SIZE)
R62<-lm(CES$LGFDHOPC~CES$INCPC+CES$SIZE)
# Performing Goldfeld-Quandt test for heteroscedasticity.
gqtest(R61,order.by=~CES$INCPC,data=CES,fraction=nrow(CES)*0.2)
## 
##  Goldfeld-Quandt test
## 
## data:  R61
## GQ = 4, df1 = 344, df2 = 343, p-value <2e-16
## alternative hypothesis: variance increases from segment 1 to 2
gqtest(R62,order.by=~CES$INCPC,data=CES,fraction=nrow(CES)*0.2)
## 
##  Goldfeld-Quandt test
## 
## data:  R62
## GQ = 2, df1 = 344, df2 = 343, p-value = 7e-05
## alternative hypothesis: variance increases from segment 1 to 2

Interpretation of the statistical tests:

As the Goldfeld-Quandt tests for both regressions show p-values that are significant 5%, the null hypothesis of homoskedasticity is rejected and the variance of the disturbance terms are not constant. The regression is heteroskedastic.

Answer:

The disturbance term in the expenditure function is heteroskedastic.

Exercise 22

Question: Is the disturbance term in the earnings function heteroscedastic?

The following model is used to answer the question.

\(Model: EARNINGS=β1+β2S_YEAR+β3AGE+β4URBAN\)

AGE is age in 1994, URBAN is living in an urban area and S_YEAR is years of schooling.

The hypothesis below is used to answer the question.

Hypotheses :
\(H0: The regression is homoskedastic, i.e. has constant variance in the disturbance terms\)
\(H1: The regression is heteroskedastic, i.e. does not have constant variance in the disturbance terms\)

Results. The following results are based on the models above.

R63<-lm(EDU$EARNINGS~EDU$S_YEARRANK+EDU$AGE+EDU$URBAN)
gqtest(R63,order.by=~EDU$S_YEAR,data=EDU,fraction=nrow(CES)*0.2)
## 
##  Goldfeld-Quandt test
## 
## data:  R63
## GQ = 3, df1 = 195, df2 = 194, p-value = 4e-15
## alternative hypothesis: variance increases from segment 1 to 2

Interpretation of the statistical tests:

As the Goldfeld-Quandt test for the regression shows a p-value that is significant at 5%, the null hypothesis of homoskedasticity is rejected and the variance of the disturbance terms are not constant. The regression is heteroskedastic.

Answer:

The disturbance term in the earnings function heteroskedastic.

Exercise 23

Question: Does the sugar cane model suffer from heteroscedasticity?

The following models are used to answer the question.

\(Model 1: SUGAR=β1+β2AGE\)
\(Model 2: SIZE=β1+β2AGE\)

AGE is age in 1994, SIZE is size of the cane SUGAR is the sugar content of the cane.

The hypothesis below is used to answer the question.

Hypotheses :
\(H0: The regression is homoskedastic, i.e. has constant variance in the disturbance terms\)
\(H1: The regression is heteroskedastic, i.e. does not have constant variance in the disturbance terms\)

Results. The following results are based on the models above.

# Runnings both regressions
R64<-lm(CANE$Sugar~CANE$Age)
R65<-lm(CANE$Size~CANE$Age)
# Generating residuals and fitted values for both regressions
CANE2<-
  CANE%>%
  mutate(RESID_R64=resid(R64))%>%
  mutate(FITTED_R64=fitted(R64))%>%
  mutate(RESID_R65=resid(R65))%>%
  mutate(FITTED_R65=fitted(R65))
# Plotting model 1
CANE2%>%
  ggplot(aes(x=FITTED_R64,y=RESID_R64))+
  geom_point(na.rm=TRUE)+
  geom_smooth(method=lm)+
  labs(x="Fitted values",
        y="Residuals",
        title="Fitted values vs residuals")+
  theme(plot.title=element_text(hjust=0.5))

CANE2%>%
  ggplot(aes(x=FITTED_R65,y=RESID_R65))+
  geom_point(na.rm=TRUE)+
  geom_smooth(method=lm)+
  labs(x="Fitted values",
       y="Residuals",
       title="Fitted values vs residuals")+
  theme(plot.title=element_text(hjust=0.5))

# Breusch-Pagan test model 1
bptest(R64)
## 
##  studentized Breusch-Pagan test
## 
## data:  R64
## BP = 1, df = 1, p-value = 0.3
# Breusch-Pagan test model 2
bptest(R65)
## 
##  studentized Breusch-Pagan test
## 
## data:  R65
## BP = 31, df = 1, p-value = 2e-08
# White test model 1
bptest(R64,~Age+I(CANE$Age^2),data=CANE)
## 
##  studentized Breusch-Pagan test
## 
## data:  R64
## BP = 2, df = 2, p-value = 0.3
# White test model 2
bptest(R65,~Age+I(CANE$Age^2),data=CANE)
## 
##  studentized Breusch-Pagan test
## 
## data:  R65
## BP = 40, df = 2, p-value = 2e-09

Interpretation of the statistical tests:

As the Breusch-Pagan test and the White Test for the regression of model 1 show p-values of 0.3, that are not significant at 5%, the null hypotheses of homoskedasticity are assumed and the variance of the disturbance terms are constant. Also, in the plot for model 1, it is visually apparent that the spread of the residuals does not change significantly between different positions along the x-axis. Altogether it can be determined that the regression is homoskedastic.

As the Breusch-Pagan test and the White Test for the regression of model 2 show p<2*10-8, that are significant at 5%, the null hypotheses of homoskedasticity is rejected and the variance of the disturbance terms are not constant. Also, in the plot for model 2, it is visually apparent that the spread of the residuals varies significantly between different positions along the x-axis. Altogether it can be determined that the regression is heteroskedastic.

Answer:

The variance of the disturbance terms in the earnings function are homoskedastic in model 1, but heteroskedasticity is present in model 2.

Exercise 24

Question: Does correcting the sugar cane model for heteroscedasticity improve its performance?

The following models are used to answer the question.

\(Model 1: SUGAR=β1+β2AGE\)
\(Model 2: SIZE=β1+β2AGE\)

AGE is age in 1994, SIZE is size of the cane SUGAR is the sugar content of the cane.

The hypothesis below is used to answer the question.

Hypotheses:
\(H0: The regression is homoskedastic, i.e. has constant variance in the disturbance terms\)
\(H1: The regression is heteroskedastic, i.e. does not have constant variance in the disturbance terms\)

Results. The following results are based on the models above.

# Generating variables
CANE<-
  CANE%>%
  mutate(AgePC=Age/Age)%>%
  mutate(SugarPC=Sugar/Age)%>%
  mutate(SizePC=Size/Age)
# Generating residuals and fitted values for both regressions
R66<-lm(CANE$SizePC~CANE$Age)
summary(R66)
## 
## Call:
## lm(formula = CANE$SizePC ~ CANE$Age)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.4523 -0.7223 -0.0345  0.7286  2.3267 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.75724    0.21002    8.37  4.2e-13 ***
## CANE$Age     0.00360    0.00361    1.00     0.32    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.04 on 98 degrees of freedom
## Multiple R-squared:  0.01,   Adjusted R-squared:  -7.3e-05 
## F-statistic: 0.993 on 1 and 98 DF,  p-value: 0.322
R67<-lm(CANE$SugarPC~CANE$Age)
summary(R67)
## 
## Call:
## lm(formula = CANE$SugarPC ~ CANE$Age)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.5488 -0.0129  0.0041  0.0351  0.3302 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.927699   0.034067   56.59   <2e-16 ***
## CANE$Age    0.001061   0.000586    1.81    0.073 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.169 on 98 degrees of freedom
## Multiple R-squared:  0.0324, Adjusted R-squared:  0.0225 
## F-statistic: 3.28 on 1 and 98 DF,  p-value: 0.0731
# Breusch-Pagan test model 1
bptest(R66)
## 
##  studentized Breusch-Pagan test
## 
## data:  R66
## BP = 1, df = 1, p-value = 0.3
# Breusch-Pagan test model 2
bptest(R67)
## 
##  studentized Breusch-Pagan test
## 
## data:  R67
## BP = 4, df = 1, p-value = 0.05
# White test model 1
bptest(R66,~Age+I(CANE$Age^2),data=CANE)
## 
##  studentized Breusch-Pagan test
## 
## data:  R66
## BP = 2, df = 2, p-value = 0.3
# White test model 2
bptest(R67,~Age+I(CANE$Age^2),data=CANE)
## 
##  studentized Breusch-Pagan test
## 
## data:  R67
## BP = 9, df = 2, p-value = 0.009

Interpretation of the statistical tests:

As the Breusch-Pagan test and the White Test for the regression of model 1 show p-values of 0.3, that are not significant at 5%, after having divided the dependent variables SUGAR by AGE, the same level of homoskedasticity remains.

The p-values are a lot larger in model 2 after having divided SIZE by AGE, which has reduced the issue with heteroskedasticity, as the p-values have increased. This is evident as the Breusch-Pagan test shows a p-value 0.05 and the White tests shows a p-value of 0.009.

Answer:

As indicated in model 2, correcting the sugar cane model can improve its performance and reduce heteroskedasticity. Though, in model 2 the degree of homoskedasticity did not change.