Enkhbayasgalan

CHAPTER 2

i. How many counties had zero murders in 1996? How many counties had at least one execution? What is the largest number of executions?

library(wooldridge)
data(countymurders)
county_murders_1996 <- subset(countymurders, year == 1996)
# Count counties with zero murders
zero_murders <- sum(county_murders_1996$murders == 0)

# Count counties with at least one execution
at_least_one_execution <- sum(county_murders_1996$execs > 0)

# Find the largest number of executions
max_executions <- max(county_murders_1996$execs)
# Run the OLS regression
model <- lm(murders ~ execs, data = county_murders_1996)
summary(model)

## 
## Call:
## lm(formula = murders ~ execs, data = county_murders_1996)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -149.12   -5.46   -4.46   -2.46 1338.99 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   5.4572     0.8348   6.537 7.79e-11 ***
## execs        58.5555     5.8333  10.038  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 38.89 on 2195 degrees of freedom
## Multiple R-squared:  0.04389,    Adjusted R-squared:  0.04346 
## F-statistic: 100.8 on 1 and 2195 DF,  p-value: < 2.2e-16


ii. Estimate the equation murders = β0 + β1 execs + u using OLS and report sample size and R-squared.*


``` r
# Run the regression
model <- lm(murders ~ execs, data = countymurders)

# Display the summary, including sample size and R-squared
summary(model)

## 
## Call:
## lm(formula = murders ~ execs, data = countymurders)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -202.23   -6.84   -5.84   -3.84 1937.16 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   6.8382     0.2418   28.28   <2e-16 ***
## execs        65.4650     2.1463   30.50   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 46.64 on 37347 degrees of freedom
## Multiple R-squared:  0.02431,    Adjusted R-squared:  0.02428 
## F-statistic: 930.4 on 1 and 37347 DF,  p-value: < 2.2e-16

Interpret the slope coefficient reported in part (ii). Does the estimated equation suggest a deterrent effect of capital punishment?*

# The slope coefficient (β1) will be reported in the output of the regression above. You can interpret it directly from there.

What is the smallest number of murders can be predicted by the equation? What is the residual for a county with zero executions and zero murders?*

# Smallest predicted number of murders (when execs = 0)
predicted_murders <- coef(model)[1]

# Residual for a county with zero executions and zero murders
residual_zero_exec_zero_murders <- 0 - predicted_murders

# Print results
cat("Smallest predicted number of murders:", predicted_murders, "\n")

## Smallest predicted number of murders: 6.838201

cat("Residual for zero executions and zero murders:", residual_zero_exec_zero_murders, "\n")

## Residual for zero executions and zero murders: -6.838201

v. Explain why a simple regression analysis may not be well-suited to determine if capital punishment has a deterrent effect on murders.

# A note on limitations of simple regression cat("A simple regression does not account for other variables that influence murder rates, such as socio-economic factors, law enforcement differences, or cultural factors. Therefore, it may not accurately capture the causal effect of executions on murders.\n")

CHAPTER 3

QUESTION 5

i. Does it make sense to hold sleep, work, and leisure fixed, while changing study?

#No, it does not make sense to hold sleep, work, and leisure fixed while changing study.

ii. Explain why this model violates Assumption MLR.3.

# This model violates Assumption MLR.3 because there is a perfect linear relationship between study, sleep, work, and leisure due to the time constraint.

iii. How could you reformulate the model so that its parameters have a useful interpretation and it satisfies Assumption MLR.3?

#By removing leisure from the model, we get: GPA=β0+β1study+β2sleep+β3work+u

# Sample data: Assuming we have GPA and time spent on each activity for multiple students
data <- data.frame(
  GPA = c(3.5, 3.0, 3.8, 2.9, 3.2),        # Example GPA values
  study = c(30, 25, 40, 20, 35),           # Time spent on studying
  sleep = c(50, 60, 45, 55, 52),           # Time spent on sleeping
  work = c(30, 20, 35, 25, 30)             # Time spent on working
  # leisure is omitted due to the constraint
)

# Reformulated model without the 'leisure' variable
model <- lm(GPA ~ study + sleep + work, data = data)

# Summary of the model to view coefficients
summary(model)

## 
## Call:
## lm(formula = GPA ~ study + sleep + work, data = data)
## 
## Residuals:
##          1          2          3          4          5 
##  1.000e-01  8.327e-17 -5.000e-02 -5.000e-02 -2.671e-16 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.36667    5.94844   2.751    0.222
## study        0.03333    0.01491   2.236    0.268
## sleep       -0.18333    0.07454  -2.460    0.246
## work        -0.16000    0.08062  -1.985    0.297
## 
## Residual standard error: 0.1225 on 1 degrees of freedom
## Multiple R-squared:  0.9726, Adjusted R-squared:  0.8905 
## F-statistic: 11.84 on 3 and 1 DF,  p-value: 0.2097

QUESTION 10

i. If x1 is highly correlated with x2and x3 in the sample, and x2 and x3 have large partial effects on y, would you expect β1 and β1 to be similar or very different? Explain.

#If x1 highly correlated with x2 and x3 , it becomes difficult to isolate the effect of x1 on y when x2 and x3 are included in the model. This is due to multicollinearity, which increases the variability of coefficient estimates. Since x2 and x3 have large effects on y, the estimate β1 (from the multiple regression) will likely adjust to account for their influence, leading to a potentially large difference between β1 (simple regression estimate) and β1 (multiple regression estimate).

ii. If x1s almost uncorrelated with x2and x3, but x2* and* x3 are highly correlated, will β1 and β1 tend to be similar or very different? Explain.

#Since x1 is almost uncorrelated with x2 and x3 , the inclusion of x2 and x3 in the multiple regression is unlikely to significantly change the estimate of β1 from the simple regression. Thus,β 1 and β 1 are expected to be similar because x1 's effect on y is largely isolated from x2 and x3 . The high correlation between x2 and x3 may increase the variance of their coefficients, but it should not affect β1 much.

iii. If x1 is highly correlated with x2 and x3, and x2 and x3 have small partial effects on y, would you expect se(β~1) or se(β^1) to be smaller? Explain.

#In this case, β1 from the multiple regression is likely to have a larger standard error than β1 due to multicollinearity among x1 , x2 , and x3 . When x2 and x3 have small effects on y, they don't contribute much information about y, but their high correlation with x1 still causes instability in estimating β1 , increasing its standard error.

iv. If x1is almost uncorrelated with x2x3, x2 and x3 have large partial effects on yyy, and x2 and x3 are highly correlated, would you expect se(β~1) or se(β^1) to be smaller? Explain.

# Load necessary libraries
library(MASS) # For multivariate normal data generation

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:wooldridge':
## 
##     cement

# Set seed for reproducibility
set.seed(42)

# Simulate data for y, x1, x2, and x3
n <- 100  # sample size

# Scenario 1: x1, x2, x3 highly correlated with large effects on y
# Define covariance matrix for high correlation
sigma_high_corr <- matrix(c(1, 0.9, 0.9, 0.9, 1, 0.9, 0.9, 0.9, 1), nrow = 3)
X_high_corr <- mvrnorm(n, mu = c(0, 0, 0), Sigma = sigma_high_corr)
x1 <- X_high_corr[, 1]
x2 <- X_high_corr[, 2]
x3 <- X_high_corr[, 3]
y_high_corr <- 3 * x1 + 2 * x2 + 2 * x3 + rnorm(n)

# Simple and multiple regression models
simple_model_1 <- lm(y_high_corr ~ x1)
multiple_model_1 <- lm(y_high_corr ~ x1 + x2 + x3)

# Print summaries to compare coefficients and standard errors
summary(simple_model_1)

## 
## Call:
## lm(formula = y_high_corr ~ x1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.6565 -1.1224  0.1515  1.2883  4.7253 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.1302     0.1817   0.717    0.475    
## x1            6.8126     0.1801  37.834   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.816 on 98 degrees of freedom
## Multiple R-squared:  0.9359, Adjusted R-squared:  0.9353 
## F-statistic:  1431 on 1 and 98 DF,  p-value: < 2.2e-16

summary(multiple_model_1)

## 
## Call:
## lm(formula = y_high_corr ~ x1 + x2 + x3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.7944 -0.5867 -0.1038  0.6188  2.3280 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.03150    0.08914   0.353    0.725    
## x1           2.96840    0.23445  12.661  < 2e-16 ***
## x2           1.98742    0.26054   7.628 1.72e-11 ***
## x3           2.10396    0.23779   8.848 4.43e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8867 on 96 degrees of freedom
## Multiple R-squared:  0.985,  Adjusted R-squared:  0.9846 
## F-statistic:  2107 on 3 and 96 DF,  p-value: < 2.2e-16

# Scenario 2: x1 uncorrelated with x2 and x3, x2 and x3 highly correlated with large effects on y
sigma_uncorr <- matrix(c(1, 0, 0, 0, 1, 0.9, 0, 0.9, 1), nrow = 3)
X_uncorr <- mvrnorm(n, mu = c(0, 0, 0), Sigma = sigma_uncorr)
x1 <- X_uncorr[, 1]
x2 <- X_uncorr[, 2]
x3 <- X_uncorr[, 3]
y_uncorr <- 3 * x1 + 2 * x2 + 2 * x3 + rnorm(n)

# Simple and multiple regression models
simple_model_2 <- lm(y_uncorr ~ x1)
multiple_model_2 <- lm(y_uncorr ~ x1 + x2 + x3)

# Print summaries to compare coefficients and standard errors
summary(simple_model_2)

## 
## Call:
## lm(formula = y_uncorr ~ x1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.8505 -2.5598 -0.0297  2.2854 11.8318 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.6300     0.4049  -1.556    0.123    
## x1            3.5777     0.3827   9.349 3.13e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.049 on 98 degrees of freedom
## Multiple R-squared:  0.4714, Adjusted R-squared:  0.466 
## F-statistic:  87.4 on 1 and 98 DF,  p-value: 3.134e-15

summary(multiple_model_2)

## 
## Call:
## lm(formula = y_uncorr ~ x1 + x2 + x3)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.18619 -0.55754  0.00533  0.49589  2.04906 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.16512    0.09052  -1.824   0.0712 .  
## x1           3.01332    0.08631  34.912  < 2e-16 ***
## x2           1.89082    0.21312   8.872 3.94e-14 ***
## x3           2.10894    0.21268   9.916 2.26e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8955 on 96 degrees of freedom
## Multiple R-squared:  0.9747, Adjusted R-squared:  0.9739 
## F-statistic:  1231 on 3 and 96 DF,  p-value: < 2.2e-16

QUESTION C8

# Load necessary packages
library(wooldridge) # For the DISCRIM dataset
library(dplyr)      # For data manipulation

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:MASS':
## 
##     select

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

# Load data
data("discrim")

i. Average and Standard Deviation of prpblck and income

# Calculate mean and standard deviation of prpblck and income
summary_stats <- discrim %>%
  summarise(
    mean_prpblck = mean(prpblck, na.rm = TRUE),
    sd_prpblck = sd(prpblck, na.rm = TRUE),
    mean_income = mean(income, na.rm = TRUE),
    sd_income = sd(income, na.rm = TRUE)
  )

print(summary_stats)

##   mean_prpblck sd_prpblck mean_income sd_income
## 1    0.1134864  0.1824165    47053.78  13179.29

ii. Model to Explain psoda Using prpblck and income

# Run OLS regression of psoda on prpblck and income
model_1 <- lm(psoda ~ prpblck + income, data = discrim)
summary(model_1)

## 
## Call:
## lm(formula = psoda ~ prpblck + income, data = discrim)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.29401 -0.05242  0.00333  0.04231  0.44322 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 9.563e-01  1.899e-02  50.354  < 2e-16 ***
## prpblck     1.150e-01  2.600e-02   4.423 1.26e-05 ***
## income      1.603e-06  3.618e-07   4.430 1.22e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.08611 on 398 degrees of freedom
##   (9 observations deleted due to missingness)
## Multiple R-squared:  0.06422,    Adjusted R-squared:  0.05952 
## F-statistic: 13.66 on 2 and 398 DF,  p-value: 1.835e-06

iii. Compare the estimate part ii. Compare the multiple regression estimate of prpblck with a simple regression estimate of psoda on prpblck.

# Run simple regression of psoda on prpblck only
model_2 <- lm(psoda ~ prpblck, data = discrim)
summary(model_2)

## 
## Call:
## lm(formula = psoda ~ prpblck, data = discrim)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.30884 -0.05963  0.01135  0.03206  0.44840 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.03740    0.00519  199.87  < 2e-16 ***
## prpblck      0.06493    0.02396    2.71  0.00702 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0881 on 399 degrees of freedom
##   (9 observations deleted due to missingness)
## Multiple R-squared:  0.01808,    Adjusted R-squared:  0.01561 
## F-statistic: 7.345 on 1 and 399 DF,  p-value: 0.007015

iv. A model with a constant price elasticity with respect to income may be more appropriate.

# Run OLS regression with log transformations
model_3 <- lm(log(psoda) ~ prpblck + log(income), data = discrim)
summary(model_3)

## 
## Call:
## lm(formula = log(psoda) ~ prpblck + log(income), data = discrim)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.33563 -0.04695  0.00658  0.04334  0.35413 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.79377    0.17943  -4.424 1.25e-05 ***
## prpblck      0.12158    0.02575   4.722 3.24e-06 ***
## log(income)  0.07651    0.01660   4.610 5.43e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0821 on 398 degrees of freedom
##   (9 observations deleted due to missingness)
## Multiple R-squared:  0.06809,    Adjusted R-squared:  0.06341 
## F-statistic: 14.54 on 2 and 398 DF,  p-value: 8.039e-07

v. Adding prppov to the Model

# Run model with prppov included
model_4 <- lm(log(psoda) ~ prpblck + log(income) + prppov, data = discrim)
summary(model_4)

## 
## Call:
## lm(formula = log(psoda) ~ prpblck + log(income) + prppov, data = discrim)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.32218 -0.04648  0.00651  0.04272  0.35622 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.46333    0.29371  -4.982  9.4e-07 ***
## prpblck      0.07281    0.03068   2.373   0.0181 *  
## log(income)  0.13696    0.02676   5.119  4.8e-07 ***
## prppov       0.38036    0.13279   2.864   0.0044 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.08137 on 397 degrees of freedom
##   (9 observations deleted due to missingness)
## Multiple R-squared:  0.08696,    Adjusted R-squared:  0.08006 
## F-statistic:  12.6 on 3 and 397 DF,  p-value: 6.917e-08

vi. Correlation between log(income) and prppov

# Calculate correlation between log(income) and prppov
cor_log_income_prppov <- cor(log(discrim$income), discrim$prppov, use = "complete.obs")
print(cor_log_income_prppov)

## [1] -0.838467

vii. Evaluating the Statement on Multicollinearity

#Multicollinearity (high correlation between independent variables) can lead to instability in coefficient estimates, making it difficult to interpret individual effects. If log(income) and prppov are highly correlated, it might be sensible to include only one of these variables to avoid redundancy and improve model interpretability.

CHAPTER 4

QUESTION 3

# Given values
coef_log_sales <- 0.321      # Coefficient for log(sales)
se_log_sales <- 0.216        # Standard error for log(sales)
coef_profmarg <- 0.050       # Coefficient for profmarg
se_profmarg <- 0.046         # Standard error for profmarg
alpha_5 <- 0.05              # Significance level 5%
alpha_10 <- 0.10             # Significance level 10%

# Question (i): Interpretation of the coefficient on log(sales)
# When sales increase by 10%, log(sales) increases by log(1.10)
percent_change_rdintens <- coef_log_sales * log(1.10)
cat("Estimated percentage point change in rdintens for a 10% increase in sales:", 
    percent_change_rdintens, "\n")

## Estimated percentage point change in rdintens for a 10% increase in sales: 0.03059457

# Question (ii): Hypothesis Test for log(sales) coefficient
# H0: The coefficient on log(sales) is zero (no relationship with rdintens)
# H1: The coefficient on log(sales) is not zero
# Calculate t-statistic for log(sales)
t_stat_log_sales <- coef_log_sales / se_log_sales
p_value_log_sales <- 2 * (1 - pt(abs(t_stat_log_sales), df = 32 - 3))  # df = n - k, where n=32, k=3
cat("t-statistic for log(sales):", t_stat_log_sales, "\n")

## t-statistic for log(sales): 1.486111

cat("p-value for log(sales):", p_value_log_sales, "\n")

## p-value for log(sales): 0.1480413

# Determine if we reject H0 at 5% and 10% levels
reject_5_percent <- p_value_log_sales < alpha_5
reject_10_percent <- p_value_log_sales < alpha_10
cat("Reject H0 at 5% level?", reject_5_percent, "\n")

## Reject H0 at 5% level? FALSE

cat("Reject H0 at 10% level?", reject_10_percent, "\n")

## Reject H0 at 10% level? FALSE

# Question (iii): Interpretation of the coefficient on profmarg
cat("Interpretation: A one percentage point increase in profmarg is associated with an increase of",
    coef_profmarg, "percentage points in rdintens.\n")

## Interpretation: A one percentage point increase in profmarg is associated with an increase of 0.05 percentage points in rdintens.

# Question (iv): Statistical Significance of profmarg
# Calculate t-statistic for profmarg
t_stat_profmarg <- coef_profmarg / se_profmarg
p_value_profmarg <- 2 * (1 - pt(abs(t_stat_profmarg), df = 32 - 3))
cat("t-statistic for profmarg:", t_stat_profmarg, "\n")

## t-statistic for profmarg: 1.086957

cat("p-value for profmarg:", p_value_profmarg, "\n")

## p-value for profmarg: 0.2860082

# Determine if we reject H0 for profmarg at 5% and 10% levels
reject_profmarg_5_percent <- p_value_profmarg < alpha_5
reject_profmarg_10_percent <- p_value_profmarg < alpha_10
cat("Reject H0 for profmarg at 5% level?", reject_profmarg_5_percent, "\n")

## Reject H0 for profmarg at 5% level? FALSE

cat("Reject H0 for profmarg at 10% level?", reject_profmarg_10_percent, "\n")

## Reject H0 for profmarg at 10% level? FALSE

i. Interpret the coefficient on log(sales). In particular, if sales increases by 10%, what is the estimated percentage point change in rdintens? Is this an economically large effect?

#The code calculates the percentage point change in rdintens if sales increases by 10%. This is done using the formula: percentchange =𝛽log(sales)×log(1.10) percentchange=βlog(sales) ×log(1.10).

ii. Test the hypothesis that R&D intensity does not change with sales against the alternative that it odes increase with sales. Do the test at the 5% and 10% levels.

#The code performs a hypothesis test to check if the coefficient of log(sales) is significantly different from zero. It calculates the t-statistic and p-value and determines whether to reject the null hypothesis at the 5% and 10% significance levels.

iii. Interpret the coefficient on profmarg. Is it economically large?

#The interpretation of the profmarg coefficient is printed, indicating how rdintens changes for a one-percentage-point increase in profmarg.

iv. Does profmarg have a statistically significant effect on rdintens?

#The code performs a hypothesis test on the significance of the profmarg coefficient, calculating the t-statistic and p-value, and determining whether to reject the null hypothesis at the 5% and 10% levels.

QUESTION C8

# Load the k401ksubs dataset and filter for single-person households
data("k401ksubs", package = "wooldridge")
single_person <- subset(k401ksubs, fsize == 1)

# (i) Count the number of single-person households
num_single_person <- nrow(single_person)
cat("Number of single-person households:", num_single_person, "\n")

## Number of single-person households: 2017

# (ii) Estimate the model netffa = β0 + β1 * inc + β2 * age + u
model <- lm(nettfa ~ inc + age, data = single_person)
summary(model)

## 
## Call:
## lm(formula = nettfa ~ inc + age, data = single_person)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -179.95  -14.16   -3.42    6.03 1113.94 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -43.03981    4.08039 -10.548   <2e-16 ***
## inc           0.79932    0.05973  13.382   <2e-16 ***
## age           0.84266    0.09202   9.158   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 44.68 on 2014 degrees of freedom
## Multiple R-squared:  0.1193, Adjusted R-squared:  0.1185 
## F-statistic: 136.5 on 2 and 2014 DF,  p-value: < 2.2e-16

# (iii) Interpret the intercept
intercept <- coef(model)["(Intercept)"]
cat("Intercept (β0):", intercept, "\n")

## Intercept (β0): -43.03981

# (iv) Test H0: β2 = 1 against H1: β2 < 1
beta2 <- coef(model)["age"]
se_beta2 <- summary(model)$coefficients["age", "Std. Error"]
t_stat <- (beta2 - 1) / se_beta2
p_value <- pt(t_stat, df = model$df.residual)  # One-sided test
cat("t-statistic:", t_stat, "\n")

## t-statistic: -1.709944

cat("p-value for H0: β2 = 1 against H1: β2 < 1:", p_value, "\n")

## p-value for H0: β2 = 1 against H1: β2 < 1: 0.04371514

# (v) Simple regression of netffa on inc
simple_model <- lm(nettfa ~ inc, data = single_person)
summary(simple_model)

## 
## Call:
## lm(formula = nettfa ~ inc, data = single_person)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -185.12  -12.85   -4.85    1.78 1112.66 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -10.5709     2.0607   -5.13 3.18e-07 ***
## inc           0.8207     0.0609   13.48  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 45.59 on 2015 degrees of freedom
## Multiple R-squared:  0.08267,    Adjusted R-squared:  0.08222 
## F-statistic: 181.6 on 1 and 2015 DF,  p-value: < 2.2e-16

# Compare coefficients on inc
coef_simple_inc <- coef(simple_model)["inc"]
coef_full_inc <- coef(model)["inc"]
cat("Coefficient on inc in simple model:", coef_simple_inc, "\n")

## Coefficient on inc in simple model: 0.8206815

cat("Coefficient on inc in full model:", coef_full_inc, "\n")

## Coefficient on inc in full model: 0.7993167

i. How many single-person households are there in the dataset?

#Counts the number of single-person households by filtering the dataset where fsize == 1.

ii. Use OLS to estimate the model.

#Estimates the OLS model net_tfa=𝛽0+𝛽1inc+𝛽2age+𝑢net_tfa=β0+β1 inc+β2age+u and displays the summary. Interpret the coefficients for inc and age based on their signs and magnitudes in the summary output.

iii. Does the intercept from the regression in part iii have an interesting meaning? Explain.

#The intercept interpretation is printed, representing the expected value of net_tfa when inc and age are both zero, though this may not always be realistic.

iv. Find the p-value

#Performs a hypothesis test on the age coefficient (β2) to test if it is equal to 1, using a one-sided test with the null hypothesis 𝐻0:𝛽2=1H 0:β2=1. It calculates the t-statistic and p-value, and compares with a 1% significance level to determine whether to reject 𝐻0.

v. If you do a simple regression of nettfa on inc, is the estimated coefficient on inc much different from the estimate in part ii? why?

#Estimates a simple regression of net_tfa on inc and compares the coefficient of inc from this model with the coefficient of inc from the original multiple regression model. This comparison can indicate whether the effect of inc is consistent when age is included in the model.

CHAPTER 5

QUESTION 5

library(ggplot2)

# Simulate data if you don't have actual data
# Suppose the scores are normally distributed around 70 with a standard deviation of 15
set.seed(42) # for reproducibility
scores <- rnorm(1000, mean = 70, sd = 15)

# Plot histogram with normal curve
ggplot(data.frame(scores), aes(x = scores)) +
  geom_histogram(aes(y = ..density..), bins = 30, fill = "skyblue", color = "black") +
  stat_function(fun = dnorm, args = list(mean = mean(scores), sd = sd(scores)), color = "blue", size = 1) +
  labs(x = "Course score (in percentage form)", y = "Density") +
  theme_minimal()

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

QUESTION C1

i. Estimate the question

# Load necessary libraries
library(wooldridge) # For the wage1 dataset
library(ggplot2)    # For plotting
data("wage1", package = "wooldridge")



# (i) Estimate the equation: wage = β0 + β1*educ + β2*exper + β3*tenure + u
model1 <- lm(wage ~ educ + exper + tenure, data = wage1)

# Print summary of the model
summary(model1)

## 
## Call:
## lm(formula = wage ~ educ + exper + tenure, data = wage1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.6068 -1.7747 -0.6279  1.1969 14.6536 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.87273    0.72896  -3.941 9.22e-05 ***
## educ         0.59897    0.05128  11.679  < 2e-16 ***
## exper        0.02234    0.01206   1.853   0.0645 .  
## tenure       0.16927    0.02164   7.820 2.93e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.084 on 522 degrees of freedom
## Multiple R-squared:  0.3064, Adjusted R-squared:  0.3024 
## F-statistic: 76.87 on 3 and 522 DF,  p-value: < 2.2e-16

# Save the residuals and plot a histogram
residuals1 <- residuals(model1)

# Plot histogram of residuals for the level-level model
ggplot(data.frame(residuals1), aes(x = residuals1)) +
    geom_histogram(aes(y = ..density..), bins = 30, fill = "skyblue", color = "black") +
    labs(x = "Residuals", y = "Density", title = "Histogram of Residuals (Level-Level Model)") +
    theme_minimal()

ii. Repeat part i, but with log(wage) as the dependent variable.

model2 <- lm(log(wage) ~ educ + exper + tenure, data = wage1)

# Print summary of the model
summary(model2)

## 
## Call:
## lm(formula = log(wage) ~ educ + exper + tenure, data = wage1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.05802 -0.29645 -0.03265  0.28788  1.42809 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.284360   0.104190   2.729  0.00656 ** 
## educ        0.092029   0.007330  12.555  < 2e-16 ***
## exper       0.004121   0.001723   2.391  0.01714 *  
## tenure      0.022067   0.003094   7.133 3.29e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4409 on 522 degrees of freedom
## Multiple R-squared:  0.316,  Adjusted R-squared:  0.3121 
## F-statistic: 80.39 on 3 and 522 DF,  p-value: < 2.2e-16

# Save the residuals and plot a histogram
residuals2 <- residuals(model2)

# Plot histogram of residuals for the log-level model
ggplot(data.frame(residuals2), aes(x = residuals2)) +
    geom_histogram(aes(y = ..density..), bins = 30, fill = "skyblue", color = "black") +
    labs(x = "Residuals", y = "Density", title = "Histogram of Residuals (Log-Level Model)") +
    theme_minimal()

iii. Would you say that Assumption MLR.6 is closer to being satisfied for the level-level model or the log-level model?

library(lmtest)

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

# Breusch-Pagan test for original model
bptest(model)

## 
##  studentized Breusch-Pagan test
## 
## data:  model
## BP = 11.936, df = 2, p-value = 0.00256

# Breusch-Pagan test for log-transformed model

Enkhbayasgalan

2024-11-01