library(dplyr)
library(tidyverse)
library(wooldridge)
library(car)
library(ggplot2)
data("countymurders")
countymurders <- countymurders %>% filter(year == "1996")
countymurders %>%
filter(murders == 0) %>%
count()
## n
## 1 1051
There are 1051 counties with zero murders in 1996.
countymurders %>%
filter(execs > 0) %>%
count()
## n
## 1 31
31 counties had at least one execution.
max(countymurders$execs)
## [1] 3
The largest number of executions is 3.
murders <- lm(murders ~ execs, data = countymurders)
summary(murders)
##
## Call:
## lm(formula = murders ~ execs, data = countymurders)
##
## Residuals:
## Min 1Q Median 3Q Max
## -149.12 -5.46 -4.46 -2.46 1338.99
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.4572 0.8348 6.537 7.79e-11 ***
## execs 58.5555 5.8333 10.038 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 38.89 on 2195 degrees of freedom
## Multiple R-squared: 0.04389, Adjusted R-squared: 0.04346
## F-statistic: 100.8 on 1 and 2195 DF, p-value: < 2.2e-16
We have the estimated equation: \(murders = 5.46 + 58.56\times execs\). There are 2197 observations with an \(R^2\) value of 0.044
The slope coefficient for execs is 58.5555, indicating that each
additional execution is associated with an increase of about 58.56
murders.
This suggests that capital punishment does not have a deterrent effect;
instead, it implies that more executions are linked to higher murder
rates, contrary to the typical expectation that executions would reduce
crime.
The smallest predicted number of murders is approximately 5.46.
Residual for zero executions and murders: -5.4572
Simple regression analysis is not well suited for determining whether
capital punishment deters murders for several reasons:
- Oversimplification: It only examines one independent variable
(executions) and ignores other important factors influencing murder
rates.
- Omitted Variable Bias: Key variables like socioeconomic conditions and
policing strategies may be left out, leading to misleading
conclusions.
- Non-linear Relationships: The relationship may be complex or
non-linear, which a simple linear model cannot adequately capture.
No, it does not make sense to hold sleep, work, and leisure fixed while changing study in this model. This is because the sum of hours for the four activities (study, sleep, work, and leisure) must always be 168. If you change study, one or more of the other activities must also change to maintain the total time constraint. This makes the variables interdependent, violating the assumption of independent variation in the explanatory variables.
Assumption MLR.3, which states that there must be no perfect collinearity among the independent variables, is violated in this model. Since the total time for all activities must equal 168, the four variables are perfectly collinear. For any given values of three of the variables, the value of the fourth variable is completely determined. This perfect collinearity prevents the model from satisfying MLR.3.
To avoid perfect collinearity, we could exclude one of the activities (e.g., leisure) from the model. This setup avoids perfect collinearity, satisfies MLR.3, and allows the parameters to be interpreted as the effect of each activity on GPA, holding the other included activities constant.
We would expect \(\hat{\beta}_1\) and \(\tilde{\beta}_1\) to be different. In simple regression, the estimate \(\hat{\beta}_1\) would capture not only the effect of \(x_1\) on \(y\) but also some of the effects of \(x_2\) and \(x_3\) due to their high correlation with \(x_1\). In multiple regression, however,\(\tilde{\beta}_1\) would account for the effects of \(x_2\) and \(x_3\) separately, isolating the unique effect o f\(x_1\) on \(y\), resulting in a potentially different value from \(\hat{\beta}_1\).
If \(x_1\) is almost uncorrelated with \(x_2\) and \(x_3\) ,\(\hat{\beta}_1\) and \(\tilde{\beta}_1\) will likely be similar. Since \(x_1\) is uncorrelated with \(x_2\) and \(x_3\) , the inclusion of \(x_2\) and \(x_3\) in the model does not change the relationship between \(x_1\) and \(y\) in either the simple or multiple regression models. Therefore, the simple and multiple regression estimates of \(\beta_1\) should be close to each other.
If \(x_1\) is highly correlated with
\(x_2\) and \(x_3\) , but \(x_2\) and \(x_3\) have small partial effects on \(y\), we would expect \(se(\hat{\beta}_1)\) (from the simple
regression) to be smaller than \(se(\tilde{\beta}_1)\) (from the multiple
regression). In the multiple regression model, high multicollinearity
between \(x_1\) , \(x_2\) , and \(x_3\) increases the standard error of \(\tilde{\beta}_1\) , due to the difficulty
of disentangling the individual effects of each variable. Since the
simple regression only considers \(x_1\) , this multicollinearity effect does
not increase \(se(\hat{\beta}_1)\)
If \(x_1\) is almost uncorrelated with \(x_2\) and \(x_3\) , and \(x_2\) and \(x_3\) have large partial effects on \(y\), we would expect \(se(\tilde{\beta}_1)\) to be smaller than \(se(\hat{\beta}_1)\). In the multiple regression, \(x_2\) and \(x_3\) explain a significant portion of the variance in \(y\), reducing the residual variance and thus lowering the standard error for \(\tilde{\beta}_1\) . In the simple regression, where \(x_2\) and \(x_3\) are omitted, \(x_1\) must account for more of the unexplained variance, increasing \(se(\hat{\beta}_1)\)
DISCRIM
datadata('discrim')
prpblck
variableavg_prpblck <- mean(discrim$prpblck, na.rm = TRUE)
sd_prpblck <- sd(discrim$prpblck, na.rm = TRUE)
sprintf("Mean value: %.2f", avg_prpblck)
## [1] "Mean value: 0.11"
sprintf("Standard deviation value: %.2f", sd_prpblck)
## [1] "Standard deviation value: 0.18"
print("The unit of measurement for the variable is proportion.")
## [1] "The unit of measurement for the variable is proportion."
income
variableavg_prpblck <- mean(discrim$income, na.rm = TRUE)
sd_prpblck <- sd(discrim$income, na.rm = TRUE)
sprintf("Mean value: %.2f", avg_prpblck)
## [1] "Mean value: 47053.78"
sprintf("Standard deviation value: %.2f", sd_prpblck)
## [1] "Standard deviation value: 13179.29"
print("The unit of measurement for the variable is in USD.")
## [1] "The unit of measurement for the variable is in USD."
ols <- lm(psoda ~ prpblck + income, data = discrim)
summary(ols)
##
## Call:
## lm(formula = psoda ~ prpblck + income, data = discrim)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.29401 -0.05242 0.00333 0.04231 0.44322
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.563e-01 1.899e-02 50.354 < 2e-16 ***
## prpblck 1.150e-01 2.600e-02 4.423 1.26e-05 ***
## income 1.603e-06 3.618e-07 4.430 1.22e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.08611 on 398 degrees of freedom
## (9 observations deleted due to missingness)
## Multiple R-squared: 0.06422, Adjusted R-squared: 0.05952
## F-statistic: 13.66 on 2 and 398 DF, p-value: 1.835e-06
The coefficient on prpblck is 1.150e-01, while statistically significant, have a small economical effect.
ols_2 <- lm(psoda ~ prpblck, data = discrim)
summary(ols_2)
##
## Call:
## lm(formula = psoda ~ prpblck, data = discrim)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.30884 -0.05963 0.01135 0.03206 0.44840
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.03740 0.00519 199.87 < 2e-16 ***
## prpblck 0.06493 0.02396 2.71 0.00702 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0881 on 399 degrees of freedom
## (9 observations deleted due to missingness)
## Multiple R-squared: 0.01808, Adjusted R-squared: 0.01561
## F-statistic: 7.345 on 1 and 399 DF, p-value: 0.007015
The discrimination effect is larger when we control for income.
ols_3 <- lm(log(psoda) ~ prpblck + log(income), data = discrim)
summary(ols_3)
##
## Call:
## lm(formula = log(psoda) ~ prpblck + log(income), data = discrim)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.33563 -0.04695 0.00658 0.04334 0.35413
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.79377 0.17943 -4.424 1.25e-05 ***
## prpblck 0.12158 0.02575 4.722 3.24e-06 ***
## log(income) 0.07651 0.01660 4.610 5.43e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0821 on 398 degrees of freedom
## (9 observations deleted due to missingness)
## Multiple R-squared: 0.06809, Adjusted R-squared: 0.06341
## F-statistic: 14.54 on 2 and 398 DF, p-value: 8.039e-07
incr <- coef(ols_3)["prpblck"] * 0.2 * 100
sprintf("If prpblck increates by .20, the estimated percentage change in psoda is %.2f", incr)
## [1] "If prpblck increates by .20, the estimated percentage change in psoda is 2.43"
ols_3_update <- lm(log(psoda) ~ prpblck + log(income) + prppov, data = discrim)
summary(ols_3_update)
##
## Call:
## lm(formula = log(psoda) ~ prpblck + log(income) + prppov, data = discrim)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.32218 -0.04648 0.00651 0.04272 0.35622
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.46333 0.29371 -4.982 9.4e-07 ***
## prpblck 0.07281 0.03068 2.373 0.0181 *
## log(income) 0.13696 0.02676 5.119 4.8e-07 ***
## prppov 0.38036 0.13279 2.864 0.0044 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.08137 on 397 degrees of freedom
## (9 observations deleted due to missingness)
## Multiple R-squared: 0.08696, Adjusted R-squared: 0.08006
## F-statistic: 12.6 on 3 and 397 DF, p-value: 6.917e-08
The value of the estimated coefficient for prpblck decreases and the p-value for the coefficient increases.
subset <- discrim %>% select(income, prppov) %>% na.omit()
log_income <- log(subset$income)
cor(log_income, subset$prppov)
## [1] -0.838467
The correlation is roughly what I expcected. The value of -0.838467 indicates that the poorer the population, the less income family has.
I would not agree with the statement, according to the assumptions of multiple linear regression, as long as the correlation between two independent variables is not 1, we would not have to remove them from the regression.
The coefficient on log (sales) is 0.321. This indicates that a 1%
increase in sales is associated with a 0.321 percentage point increase
in rdintens. For a 10% increase in sales, the effect on rdintens would
be approximately 0.321×10 = 3.21 percentage points.
Whether this is economically large depends on the context, but a 3.21
percentage point increase in R&D intensity from a 10% increase in
sales could be considered substantial, especially if typical R&D
intensities in this industry are lower.
To test the hypothesis, we can use a t-test. The t-statistic is calculated to be 1.486. For a one-tailed test with n – 2 = 30 degrees of freedom:
At the 5% significance level, the critical t-value is approximately 1.697.
At the 10% significance level, the critical t-value is approximately 1.310.
Since 1.486 is less than 1.697 (5% level) but greater than 1.310 (10% level), we reject the null hypothesis at the 10% level but not at the 5% level. This suggests weak evidence that R&D intensity increases with sales.
The coefficient on profmarg is 0.050, which means that a one percentage point increase in profmarg is associated with a 0.050 percentage point increase in rdintens. This effect is relatively small, as a significant increase in profmarg would lead to only a slight increase in rdintens. Thus, the effect of profmarg on rdintens does not appear economically large.
To determine statistical significance, we look at the t-statistic for profmarg, which is 1.087. With 30 degrees of freedom:
The two-tailed critical value at the 5% significance level is approximately 2.042.
The two-tailed critical value at the 10% significance level is approximately 1.697.
Since 1.087 is less than both critical values, profmarg is not statistically significant at the 5% or 10% levels. Therefore, we conclude that profmarg does not have a statistically significant effect on rdintens.
data("k401ksubs")
k401ksubs <- k401ksubs %>%
filter(fsize == 1)
There are 2017 single-person households in the dataset.
nettfa <- lm(nettfa ~ inc + age, data = k401ksubs)
summary(nettfa)
##
## Call:
## lm(formula = nettfa ~ inc + age, data = k401ksubs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -179.95 -14.16 -3.42 6.03 1113.94
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -43.03981 4.08039 -10.548 <2e-16 ***
## inc 0.79932 0.05973 13.382 <2e-16 ***
## age 0.84266 0.09202 9.158 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 44.68 on 2014 degrees of freedom
## Multiple R-squared: 0.1193, Adjusted R-squared: 0.1185
## F-statistic: 136.5 on 2 and 2014 DF, p-value: < 2.2e-16
Coefficient: 0.79932. For each additional $1,000 in annual income, net financial assets (nettfa) increase by approximately $799.32. This reflects a positive relationship between income and financial assets.
Coefficient: 0.84266. For each additional year of age, net financial assets increase by about $842.66. This suggests that older individuals tend to have more accumulated assets over time.
Surprises in the Slope Estimates Magnitude: Both coefficients are relatively large, indicating that small increases in income or age can significantly impact financial assets, which might be surprising.
Positive Relationships: The expected positive relationships hold, but it’s noteworthy that in some contexts, age might also correlate with increased expenses (e.g., healthcare).
Single-Person Focus: The findings are based solely on single-person households, making it interesting that such strong relationships exist without accounting for multi-person household dynamics.
The intercept suggests that if a single-person household had an annual income of 0 dollar and was aged 0 years, their net financial assets would be approximately -$43,039.81. This negative value is not realistic, as it implies debt or negative assets, which cannot occur for someone with no income or age.
value <- 1
age_oneside <- t.test(k401ksubs$age, mu = value, alternative = "less")
print(age_oneside)
##
## One Sample t-test
##
## data: k401ksubs$age
## t = 158.83, df = 2016, p-value = 1
## alternative hypothesis: true mean is less than 1
## 95 percent confidence interval:
## -Inf 39.67472
## sample estimates:
## mean of x
## 39.27814
We do not reject \(H_0\) at 1% significance level.
nettfa_simple <- lm(nettfa ~ inc, data = k401ksubs)
summary(nettfa)
##
## Call:
## lm(formula = nettfa ~ inc + age, data = k401ksubs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -179.95 -14.16 -3.42 6.03 1113.94
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -43.03981 4.08039 -10.548 <2e-16 ***
## inc 0.79932 0.05973 13.382 <2e-16 ***
## age 0.84266 0.09202 9.158 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 44.68 on 2014 degrees of freedom
## Multiple R-squared: 0.1193, Adjusted R-squared: 0.1185
## F-statistic: 136.5 on 2 and 2014 DF, p-value: < 2.2e-16
If the estimated coefficient for inc in the previous analysis is
close to 0.79932 and still statistically significant, it may suggest
that the effect of inc remains consistent even after controlling for
age.
If the coefficient differs substantially, there might present the
implications of omitted variable bias or the impact of including
additional predictors in the analysis.
No, the probability would not be zero. The normal distribution has infinite tails, which means it theoretically allows for values greater than 100, even though they would be highly unlikely. This contradicts the assumption of a normal distribution for score because score is a percentage score and thus constrained to a maximum of 100. A normal distribution is unbounded, which makes it an imperfect fit for a bounded variable like score.
In the left tail of the histogram, the observed frequency of low scores (around 20-40) is higher than what the normal distribution predicts. The normal distribution underestimates the probability of these lower scores, suggesting that the data is skewed or has heavier tails on the left than the normal distribution can capture. Therefore, the normal distribution does not fit well in the left tail of the data.
data("wage1")
wage <- lm(wage ~ educ + exper + tenure, data = wage1)
summary(wage)
##
## Call:
## lm(formula = wage ~ educ + exper + tenure, data = wage1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.6068 -1.7747 -0.6279 1.1969 14.6536
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.87273 0.72896 -3.941 9.22e-05 ***
## educ 0.59897 0.05128 11.679 < 2e-16 ***
## exper 0.02234 0.01206 1.853 0.0645 .
## tenure 0.16927 0.02164 7.820 2.93e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.084 on 522 degrees of freedom
## Multiple R-squared: 0.3064, Adjusted R-squared: 0.3024
## F-statistic: 76.87 on 3 and 522 DF, p-value: < 2.2e-16
residuals <- residuals(wage)
ggplot(data = data.frame(residuals), aes(x = residuals)) +
geom_histogram(binwidth = 1, fill = "lightblue", color = "black") +
labs(title = "Histogram of Residuals", x = "Residuals", y = "Frequency") +
theme_minimal()
wage_log <- lm(log(wage) ~ educ + exper + tenure, data = wage1)
summary(wage_log)
##
## Call:
## lm(formula = log(wage) ~ educ + exper + tenure, data = wage1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.05802 -0.29645 -0.03265 0.28788 1.42809
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.284360 0.104190 2.729 0.00656 **
## educ 0.092029 0.007330 12.555 < 2e-16 ***
## exper 0.004121 0.001723 2.391 0.01714 *
## tenure 0.022067 0.003094 7.133 3.29e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4409 on 522 degrees of freedom
## Multiple R-squared: 0.316, Adjusted R-squared: 0.3121
## F-statistic: 80.39 on 3 and 522 DF, p-value: < 2.2e-16
residuals <- residuals(wage_log)
ggplot(data = data.frame(residuals), aes(x = residuals)) +
geom_histogram(binwidth = 1, fill = "lavender", color = "black") +
labs(title = "Histogram of Residuals", x = "Residuals", y = "Frequency") +
theme_minimal()
Based on the analysis of the residuals:
Homoscedasticity: The log-level model appears to have residuals that are more concentrated around zero, suggesting a better fit and less variance among the residuals compared to the level-level model. This supports the assumption of homoscedasticity more effectively.
Mean of Residuals: The residuals from the log-level model are closer to zero on average, indicating that the error term may be more appropriately modeled.
Final Assessment Overall, Assumption MLR.6 is closer to being satisfied for the log-level model than for the level-level model.