R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

#Chapter 2
#C9
#Load the wooldridge library
library(wooldridge)

# Load the dataset
data("countymurders")

# (i) How many counties had zero murders in 1996?
zero_murders_count <- sum(countymurders$year == 1996 & countymurders$murders == 0)
cat("Counties with zero murders in 1996:", zero_murders_count, "\n")
## Counties with zero murders in 1996: 1051
# How many counties had at least one execution?
counties_with_execution <- sum(countymurders$year == 1996 & countymurders$execs >= 1)
cat("Counties with at least one execution in 1996:", counties_with_execution, "\n")
## Counties with at least one execution in 1996: 31
# What is the largest number of executions?
max_executions <- max(countymurders$countymurders$execs)
## Warning in max(countymurders$countymurders$execs): no non-missing arguments to
## max; returning -Inf
cat("Largest number of executions in 1996:", max_executions, "\n")
## Largest number of executions in 1996: -Inf
# (ii) Estimate the equation murders = ß0 + ß1execs + u by OLS
model <- lm(murders ~ execs, data = subset(countymurders, year == 1996))

# Report the results
summary(model)
## 
## Call:
## lm(formula = murders ~ execs, data = subset(countymurders, year == 
##     1996))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -149.12   -5.46   -4.46   -2.46 1338.99 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   5.4572     0.8348   6.537 7.79e-11 ***
## execs        58.5555     5.8333  10.038  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 38.89 on 2195 degrees of freedom
## Multiple R-squared:  0.04389,    Adjusted R-squared:  0.04346 
## F-statistic: 100.8 on 1 and 2195 DF,  p-value: < 2.2e-16
# (iii) Interpret the slope coefficient
cat("The slope coefficient (ß1) represents the change in murders for a one-unit change in executions.\n")
## The slope coefficient (ß1) represents the change in murders for a one-unit change in executions.
cat("If ß1 is negative, it suggests a deterrent effect of capital punishment.\n")
## If ß1 is negative, it suggests a deterrent effect of capital punishment.
# (iv) What is the smallest number of murders that can be predicted?
min_executions <- min(countymurders$execs)
min_predicted_murders <- coef(model)[1] + coef(model)[2] * min_executions
cat("Smallest number of murders predicted:", min_predicted_murders, "\n")
## Smallest number of murders predicted: 5.457241
# What is the residual for a county with zero executions and zero murders?
# What is the residual for a county with zero executions and zero murders?
zero_exec_zero_murders_residual <- predict(model, newdata = data.frame(execs = 0), type = "response")
cat("Residual for a county with zero executions and zero murders:", zero_exec_zero_murders_residual, "\n")
## Residual for a county with zero executions and zero murders: 5.457241
# (v) Explain why a simple regression analysis is not well-suited for determining deterrence.
cat("A simple regression analysis may suffer from omitted variable bias and endogeneity issues.\n")
## A simple regression analysis may suffer from omitted variable bias and endogeneity issues.
cat("Factors other than executions could influence the murder rate, leading to biased estimates.\n")
## Factors other than executions could influence the murder rate, leading to biased estimates.
cat("Additionally, the decision to implement capital punishment may be influenced by the crime rate,\n")
## Additionally, the decision to implement capital punishment may be influenced by the crime rate,
cat("creating endogeneity problems and making causal inference challenging.\n")
## creating endogeneity problems and making causal inference challenging.
#Chapter 3
#5

# (i) In the model GPA = ß0 + ß1study + ß2sleep + ß3work + ß4leisure + u, 
# does it make sense to hold sleep, work, and leisure fixed while changing study?
cat("In the given model, it does not make sense to hold sleep, work, and leisure fixed while changing study.\n")
## In the given model, it does not make sense to hold sleep, work, and leisure fixed while changing study.
cat("The reason is that the sum of hours in all four activities must be 168 for each student.\n")
## The reason is that the sum of hours in all four activities must be 168 for each student.
cat("Changing the hours spent on studying would inherently change the hours available for other activities.\n\n")
## Changing the hours spent on studying would inherently change the hours available for other activities.
# (ii) Explain why this model violates Assumption MLR.3.
cat("This model violates Assumption MLR.3, which assumes that the regressors are fixed and non-stochastic.\n")
## This model violates Assumption MLR.3, which assumes that the regressors are fixed and non-stochastic.
cat("In this case, the hours spent on different activities are not fixed; they must sum up to 168, which introduces\n")
## In this case, the hours spent on different activities are not fixed; they must sum up to 168, which introduces
cat("stochasticity and correlation among the explanatory variables.\n\n")
## stochasticity and correlation among the explanatory variables.
# (iii) How could you reformulate the model so that its parameters have a useful interpretation and it satisfies Assumption MLR.3?
cat("To satisfy Assumption MLR.3, you could reformulate the model by using a set of independent variables that\n")
## To satisfy Assumption MLR.3, you could reformulate the model by using a set of independent variables that
cat("are not constrained to sum to a fixed value. For example, you could use the hours spent on three activities\n")
## are not constrained to sum to a fixed value. For example, you could use the hours spent on three activities
cat("as independent variables, and the fourth one can be derived from the constraint (168 - study - work - leisure).\n")
## as independent variables, and the fourth one can be derived from the constraint (168 - study - work - leisure).
#10
# Load the wooldridge library
library(wooldridge)

# Load the dataset
data("mroz")

# (i) If x1 is highly correlated with x2 and x3 in the sample, and x2 and x3 have large partial effects on y,
# would you expect (B1 with ~ sign) and (adjusted B1) to be similar or very different? Explain.
cat("If x1 is highly correlated with x2 and x3 and x2 and x3 have large partial effects on y,\n")
## If x1 is highly correlated with x2 and x3 and x2 and x3 have large partial effects on y,
cat("you would expect (B1 with ~ sign) and (adjusted B1) to be similar. The inclusion of x2 and x3 in the model\n")
## you would expect (B1 with ~ sign) and (adjusted B1) to be similar. The inclusion of x2 and x3 in the model
cat("should help in capturing the relationship between x1 and y more accurately, resulting in a similar effect.\n\n")
## should help in capturing the relationship between x1 and y more accurately, resulting in a similar effect.
# (ii) If x1 is almost uncorrelated with x2 and x3, but x2 and x3 are highly correlated,
# will (B1 with ~ sign) and (adjusted B1) tend to be similar or very different? Explain.
cat("If x1 is almost uncorrelated with x2 and x3 but x2 and x3 are highly correlated,\n")
## If x1 is almost uncorrelated with x2 and x3 but x2 and x3 are highly correlated,
cat("(B1 with ~ sign) and (adjusted B1) tend to be similar. The high correlation between x2 and x3\n")
## (B1 with ~ sign) and (adjusted B1) tend to be similar. The high correlation between x2 and x3
cat("may result in multicollinearity issues, leading to unstable coefficient estimates for x2 and x3.\n\n")
## may result in multicollinearity issues, leading to unstable coefficient estimates for x2 and x3.
# (iii) If x1 is highly correlated with x2 and x3, and x2 and x3 have small partial effects on y,
# would you expect se(B₁ with ~ sign) or se(adjusted B₁) to be smaller? Explain.
cat("If x1 is highly correlated with x2 and x3, and x2 and x3 have small partial effects on y,\n")
## If x1 is highly correlated with x2 and x3, and x2 and x3 have small partial effects on y,
cat("you would expect se(B₁ with ~ sign) to be smaller. The high correlation can lead to\n")
## you would expect se(B₁ with ~ sign) to be smaller. The high correlation can lead to
cat("multicollinearity, inflating standard errors for the individual coefficients.\n\n")
## multicollinearity, inflating standard errors for the individual coefficients.
# (iv) If x1 is almost uncorrelated with x2 and x3, x2 and x3 have large partial effects on y,
# and x2 and x3 are highly correlated, would you expect se(B₁ with ~ sign) or se (adjusted B₁) to be smaller? Explain.
cat("If x1 is almost uncorrelated with x2 and x3, x2 and x3 have large partial effects on y,\n")
## If x1 is almost uncorrelated with x2 and x3, x2 and x3 have large partial effects on y,
cat("and x2 and x3 are highly correlated, you would expect se(adjusted B₁) to be smaller.\n")
## and x2 and x3 are highly correlated, you would expect se(adjusted B₁) to be smaller.
cat("The inclusion of highly correlated variables (x2 and x3) without much correlation with x1\n")
## The inclusion of highly correlated variables (x2 and x3) without much correlation with x1
cat("can improve the precision of the estimates for x1, resulting in smaller standard errors.\n")
## can improve the precision of the estimates for x1, resulting in smaller standard errors.
#C8
# Load the wooldridge library
library(wooldridge)

# Load the dataset
data("discrim")

# (i) Find the average values of prpblck and income in the sample, along with their standard deviations.
mean_prpblck <- mean(discrim$prpblck)
sd_prpblck <- sd(discrim$prpblck)
mean_income <- mean(discrim$income)
sd_income <- sd(discrim$income)

cat("Average prpblck:", mean_prpblck, "\n")
## Average prpblck: NA
cat("Standard deviation of prpblck:", sd_prpblck, "\n")
## Standard deviation of prpblck: NA
cat("Average income:", mean_income, "\n")
## Average income: NA
cat("Standard deviation of income:", sd_income, "\n\n")
## Standard deviation of income: NA
cat("Units of measurement: prpblck is the proportion of the population that is black (in percentage),\n")
## Units of measurement: prpblck is the proportion of the population that is black (in percentage),
cat("and income is the median income in the zip code.\n\n")
## and income is the median income in the zip code.
# (ii) Estimate the model psoda = B0 + B1prpblck + B2income + u by OLS
model <- lm(psoda ~ prpblck + income, data = discrim)
summary(model)
## 
## Call:
## lm(formula = psoda ~ prpblck + income, data = discrim)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.29401 -0.05242  0.00333  0.04231  0.44322 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 9.563e-01  1.899e-02  50.354  < 2e-16 ***
## prpblck     1.150e-01  2.600e-02   4.423 1.26e-05 ***
## income      1.603e-06  3.618e-07   4.430 1.22e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.08611 on 398 degrees of freedom
##   (9 observations deleted due to missingness)
## Multiple R-squared:  0.06422,    Adjusted R-squared:  0.05952 
## F-statistic: 13.66 on 2 and 398 DF,  p-value: 1.835e-06
# Interpret the coefficient on prpblck
cat("The coefficient on prpblck is the estimated change in psoda for a one-unit change in prpblck.\n")
## The coefficient on prpblck is the estimated change in psoda for a one-unit change in prpblck.
cat("In this context, it represents the change in the price of soda for a 1% increase in the proportion\n")
## In this context, it represents the change in the price of soda for a 1% increase in the proportion
cat("of the population that is black. Whether it is economically large depends on the magnitude and significance\n")
## of the population that is black. Whether it is economically large depends on the magnitude and significance
cat("of the coefficient, which can be determined from the summary output.\n\n")
## of the coefficient, which can be determined from the summary output.
# (iii) Compare the estimate from part (ii) with the simple regression estimate from psoda on prpblck.
simple_model <- lm(psoda ~ prpblck, data = discrim)
summary(simple_model)
## 
## Call:
## lm(formula = psoda ~ prpblck, data = discrim)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.30884 -0.05963  0.01135  0.03206  0.44840 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.03740    0.00519  199.87  < 2e-16 ***
## prpblck      0.06493    0.02396    2.71  0.00702 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0881 on 399 degrees of freedom
##   (9 observations deleted due to missingness)
## Multiple R-squared:  0.01808,    Adjusted R-squared:  0.01561 
## F-statistic: 7.345 on 1 and 399 DF,  p-value: 0.007015
# (iv) Estimate the model log(psoda) = B0 + B1prpblck + B2log(income) + u
log_model <- lm(log(psoda) ~ prpblck + log(income), data = discrim)
summary(log_model)
## 
## Call:
## lm(formula = log(psoda) ~ prpblck + log(income), data = discrim)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.33563 -0.04695  0.00658  0.04334  0.35413 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.79377    0.17943  -4.424 1.25e-05 ***
## prpblck      0.12158    0.02575   4.722 3.24e-06 ***
## log(income)  0.07651    0.01660   4.610 5.43e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0821 on 398 degrees of freedom
##   (9 observations deleted due to missingness)
## Multiple R-squared:  0.06809,    Adjusted R-squared:  0.06341 
## F-statistic: 14.54 on 2 and 398 DF,  p-value: 8.039e-07
# Calculate the estimated percentage change in psoda for a 20% increase in prpblck
percentage_change <- exp(coef(log_model)[2] * 0.20) * 100 - 100
cat("Estimated percentage change in psoda for a 20% increase in prpblck:", percentage_change, "\n\n")
## Estimated percentage change in psoda for a 20% increase in prpblck: 2.46141
# (v) Add the variable prppov to the regression in part (iv)
model_with_prppov <- lm(log(psoda) ~ prpblck + log(income) + prppov, data = discrim)
summary(model_with_prppov)
## 
## Call:
## lm(formula = log(psoda) ~ prpblck + log(income) + prppov, data = discrim)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.32218 -0.04648  0.00651  0.04272  0.35622 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.46333    0.29371  -4.982  9.4e-07 ***
## prpblck      0.07281    0.03068   2.373   0.0181 *  
## log(income)  0.13696    0.02676   5.119  4.8e-07 ***
## prppov       0.38036    0.13279   2.864   0.0044 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.08137 on 397 degrees of freedom
##   (9 observations deleted due to missingness)
## Multiple R-squared:  0.08696,    Adjusted R-squared:  0.08006 
## F-statistic:  12.6 on 3 and 397 DF,  p-value: 6.917e-08
# (vi) Find the correlation between log(income) and prppov
correlation_log_income_prppov <- cor(discrim$lincome, discrim$prppov)
cat("Correlation between lincome and prppov:", correlation_log_income_prppov, "\n\n")
## Correlation between lincome and prppov: NA
# (vii) Evaluate the following statement
cat("The statement 'Because lincome and prppov are so highly correlated, they have no business being in the same regression.'\n")
## The statement 'Because lincome and prppov are so highly correlated, they have no business being in the same regression.'
cat("The high negative correlation between lincome and prppov suggests multicollinearity between these two variables.\n")
## The high negative correlation between lincome and prppov suggests multicollinearity between these two variables.
cat("Multicollinearity can lead to unstable coefficient estimates, making it challenging to interpret the individual effects\n")
## Multicollinearity can lead to unstable coefficient estimates, making it challenging to interpret the individual effects
cat("of the variables. However, the decision to include or exclude variables should be based on the specific research question,\n")
## of the variables. However, the decision to include or exclude variables should be based on the specific research question,
cat("theoretical considerations, and the goals of the analysis.\n")
## theoretical considerations, and the goals of the analysis.
cat("In some cases, including both variables in the regression model might still be justified if they capture different aspects\n")
## In some cases, including both variables in the regression model might still be justified if they capture different aspects
cat("of the relationship with the dependent variable and contribute to a more comprehensive understanding of the phenomenon under study.\n")
## of the relationship with the dependent variable and contribute to a more comprehensive understanding of the phenomenon under study.
#Chapter4

#3
# Given coefficients and standard errors
coeff_log_sales <- 0.321
se_log_sales <- 0.216
coeff_profmarg <- 0.50
se_profmarg <- 0.46

# (i) Estimated percentage point change in Rdintens for a 10% increase in sales
percentage_change <- coeff_log_sales * 10
cat("a) Estimated percentage point change in Rdintens for a 10% increase in sales:", percentage_change, "\n")
## a) Estimated percentage point change in Rdintens for a 10% increase in sales: 3.21
# (ii) Test for log(sales) coefficient
t_stat_log_sales <- coeff_log_sales / se_log_sales
p_value_log_sales <- 2 * (1 - pt(abs(t_stat_log_sales), df = 29))  # two-tailed test
cat("b) p-value for the test on log(sales) coefficient:", p_value_log_sales, "\n")
## b) p-value for the test on log(sales) coefficient: 0.1480413
cat("   (At 5% level):", ifelse(p_value_log_sales < 0.05, "Reject H0", "Fail to reject H0"), "\n")
##    (At 5% level): Fail to reject H0
cat("   (At 10% level):", ifelse(p_value_log_sales < 0.10, "Reject H0", "Fail to reject H0"), "\n")
##    (At 10% level): Fail to reject H0
# (iii) Interpretation of the coefficient on profmarg
cat("c) Coefficient on profmarg:", coeff_profmarg, "\n")
## c) Coefficient on profmarg: 0.5
# (iv)) Test for profmarg coefficient
t_stat_profmarg <- coeff_profmarg / se_profmarg
p_value_profmarg <- 2 * (1 - pt(abs(t_stat_profmarg), df = 29))  # two-tailed test
cat("d) p-value for the test on profmarg coefficient:", p_value_profmarg, "\n")
## d) p-value for the test on profmarg coefficient: 0.2860082
cat("   (At 5% level):", ifelse(p_value_profmarg < 0.05, "Reject H0", "Fail to reject H0"), "\n")
##    (At 5% level): Fail to reject H0
cat("   (At 10% level):", ifelse(p_value_profmarg < 0.10, "Reject H0", "Fail to reject H0"), "\n")
##    (At 10% level): Fail to reject H0
#C8
# Load the wooldridge library
library(wooldridge)

# Load the dataset
data("k401ksubs")

# (i) How many single-person households are there in the data set?
single_person_households <- subset(k401ksubs, fsize == 1)
num_single_person_households <- nrow(single_person_households)
cat("Number of single-person households:", num_single_person_households, "\n\n")
## Number of single-person households: 2017
# (ii) Use OLS to estimate the model: nettfa = B0 + B1inc + B2age + u
model <- lm(nettfa ~ inc + age, data = single_person_households)
summary(model)
## 
## Call:
## lm(formula = nettfa ~ inc + age, data = single_person_households)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -179.95  -14.16   -3.42    6.03 1113.94 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -43.03981    4.08039 -10.548   <2e-16 ***
## inc           0.79932    0.05973  13.382   <2e-16 ***
## age           0.84266    0.09202   9.158   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 44.68 on 2014 degrees of freedom
## Multiple R-squared:  0.1193, Adjusted R-squared:  0.1185 
## F-statistic: 136.5 on 2 and 2014 DF,  p-value: < 2.2e-16
# Interpret the slope coefficients
cat("Interpretation of slope coefficients:\n")
## Interpretation of slope coefficients:
cat("B1 (inc): The estimated change in nettfa for a one-unit change in inc (annual family income).\n")
## B1 (inc): The estimated change in nettfa for a one-unit change in inc (annual family income).
cat("B2 (age): The estimated change in nettfa for a one-unit change in age.\n")
## B2 (age): The estimated change in nettfa for a one-unit change in age.
cat("There might be surprises depending on the context and expectations of the relationship between variables.\n\n")
## There might be surprises depending on the context and expectations of the relationship between variables.
# (iii) Does the intercept from the regression in part (ii) have an interesting meaning? Explain.
cat("The intercept (B0) represents the estimated net financial wealth (nettfa) when both inc and age are zero.\n")
## The intercept (B0) represents the estimated net financial wealth (nettfa) when both inc and age are zero.
cat("In this context, it may not have a meaningful interpretation, as having zero income and age is not practically meaningful.\n\n")
## In this context, it may not have a meaningful interpretation, as having zero income and age is not practically meaningful.
# (iv) Find the p-value for the test H0: B₂ = 1 against H₁: B₂ < 1. Do you reject H0 at the 1% significance level?
test_result <- summary(model)$coefficient[3, "Pr(>|t|)"]
cat("p-value for the test H0: B₂ = 1 against H₁: B₂ < 1:", test_result, "\n")
## p-value for the test H0: B₂ = 1 against H₁: B₂ < 1: 1.265959e-19
cat("At the 1% significance level, we would reject H0 if the p-value is less than 0.01.\n")
## At the 1% significance level, we would reject H0 if the p-value is less than 0.01.
if (test_result < 0.01) {
  cat("We reject H0; there is evidence that B₂ is less than 1.\n\n")
} else {
  cat("We do not reject H0; there is not enough evidence to conclude that B₂ is less than 1.\n\n")
}
## We reject H0; there is evidence that B₂ is less than 1.
# (v) If you do a simple regression of nettfa on inc, is the estimated coefficient on inc much different from the estimate in part (ii)? Why or why not?
simple_model <- lm(nettfa ~ inc, data = single_person_households)
summary(simple_model)
## 
## Call:
## lm(formula = nettfa ~ inc, data = single_person_households)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -185.12  -12.85   -4.85    1.78 1112.66 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -10.5709     2.0607   -5.13 3.18e-07 ***
## inc           0.8207     0.0609   13.48  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 45.59 on 2015 degrees of freedom
## Multiple R-squared:  0.08267,    Adjusted R-squared:  0.08222 
## F-statistic: 181.6 on 1 and 2015 DF,  p-value: < 2.2e-16
cat("Comparison of the estimated coefficient on inc:\n")
## Comparison of the estimated coefficient on inc:
cat("The estimated coefficient on inc in the simple regression is compared to the estimate in part (ii).\n")
## The estimated coefficient on inc in the simple regression is compared to the estimate in part (ii).
cat("Differences may arise due to the inclusion of age in the multiple regression model, which may affect\n")
## Differences may arise due to the inclusion of age in the multiple regression model, which may affect
cat("the relationship between nettfa and inc. The context and goals of the analysis will determine\n")
## the relationship between nettfa and inc. The context and goals of the analysis will determine
cat("whether the inclusion of age improves the model.\n")
## whether the inclusion of age improves the model.
#Chapter5

#5

# (i) Probability that score exceeds 100 using normal distribution
# - The probability of score exceeding 100 is unfeasible because course scoring in real life has a maximum value of 100 percent.it is practically impossible to achieve scores more than 100 in any known course in the world. However some courses do allow participants with lower scores a chance to earn extra scores via additional tasks.

# (ii) Explanation of the left tail
cat("(ii) The normal distribution may not fit well in the left tail, as this type of percentile values are bounded by values of 0 and 100, and normal distribution assumes unbounded tails.\n")
## (ii) The normal distribution may not fit well in the left tail, as this type of percentile values are bounded by values of 0 and 100, and normal distribution assumes unbounded tails.
#C1
# Install and load necessary libraries
library(wooldridge)
library(ggplot2)

# Load the 'wage1' dataset
data("wage1")
wage_data <- wage1

# (i) Estimate the equation wage = b0 + b1educ + b2exper + b3tenure + u.
model_level <- lm(wage ~ educ + exper + tenure, data = wage_data)

# Save residuals
residuals_level <- residuals(model_level)

# Display summary statistics for the level-level model
summary(model_level)
## 
## Call:
## lm(formula = wage ~ educ + exper + tenure, data = wage_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.6068 -1.7747 -0.6279  1.1969 14.6536 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.87273    0.72896  -3.941 9.22e-05 ***
## educ         0.59897    0.05128  11.679  < 2e-16 ***
## exper        0.02234    0.01206   1.853   0.0645 .  
## tenure       0.16927    0.02164   7.820 2.93e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.084 on 522 degrees of freedom
## Multiple R-squared:  0.3064, Adjusted R-squared:  0.3024 
## F-statistic: 76.87 on 3 and 522 DF,  p-value: < 2.2e-16
# (ii) Repeat part (i), but with log(wage) as the dependent variable.
model_log_level <- lm(log(wage) ~ educ + exper + tenure, data = wage_data)

# Save residuals
residuals_log_level <- residuals(model_log_level)

# Plot histogram of residuals for the log-level model
hist(residuals_log_level, main = "Histogram of Residuals (Log-Level Model)", col = "lightblue", border = "black")

# Display summary statistics for the log-level model
summary(model_log_level)
## 
## Call:
## lm(formula = log(wage) ~ educ + exper + tenure, data = wage_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.05802 -0.29645 -0.03265  0.28788  1.42809 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.284360   0.104190   2.729  0.00656 ** 
## educ        0.092029   0.007330  12.555  < 2e-16 ***
## exper       0.004121   0.001723   2.391  0.01714 *  
## tenure      0.022067   0.003094   7.133 3.29e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4409 on 522 degrees of freedom
## Multiple R-squared:  0.316,  Adjusted R-squared:  0.3121 
## F-statistic: 80.39 on 3 and 522 DF,  p-value: < 2.2e-16
# (iii) Q-Q plots for normality assessment
par(mfrow = c(2, 2))  # Set up a 2x2 grid for Q-Q plots

qqnorm(residuals_level, main = "Q-Q Plot - Level-Level Model")
qqline(residuals_level)

qqnorm(residuals_log_level, main = "Q-Q Plot - Log-Level Model")
qqline(residuals_log_level)

# Reset the plotting layout
par(mfrow = c(1, 1))

# Print summary statistics to the console for easier interpretation
cat("\nSummary Statistics - Level-Level Model:\n")
## 
## Summary Statistics - Level-Level Model:
summary(model_level)
## 
## Call:
## lm(formula = wage ~ educ + exper + tenure, data = wage_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.6068 -1.7747 -0.6279  1.1969 14.6536 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.87273    0.72896  -3.941 9.22e-05 ***
## educ         0.59897    0.05128  11.679  < 2e-16 ***
## exper        0.02234    0.01206   1.853   0.0645 .  
## tenure       0.16927    0.02164   7.820 2.93e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.084 on 522 degrees of freedom
## Multiple R-squared:  0.3064, Adjusted R-squared:  0.3024 
## F-statistic: 76.87 on 3 and 522 DF,  p-value: < 2.2e-16
cat("\nSummary Statistics - Log-Level Model:\n")
## 
## Summary Statistics - Log-Level Model:
summary(model_log_level)
## 
## Call:
## lm(formula = log(wage) ~ educ + exper + tenure, data = wage_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.05802 -0.29645 -0.03265  0.28788  1.42809 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.284360   0.104190   2.729  0.00656 ** 
## educ        0.092029   0.007330  12.555  < 2e-16 ***
## exper       0.004121   0.001723   2.391  0.01714 *  
## tenure      0.022067   0.003094   7.133 3.29e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4409 on 522 degrees of freedom
## Multiple R-squared:  0.316,  Adjusted R-squared:  0.3121 
## F-statistic: 80.39 on 3 and 522 DF,  p-value: < 2.2e-16

Including Plots

You can also embed plots, for example:

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.