# Load libraries
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
data("midwest")
Null Hypothesis (H0) for ANOVA test: There is no significant difference in the college education rates (percollege) among the different state in midwest.
model<-aov(percollege ~ state, data=midwest)
summary(model)
## Df Sum Sq Mean Sq F value Pr(>F)
## state 4 774 193.52 5.122 0.000485 ***
## Residuals 432 16322 37.78
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
In this case, the p-value is very small (0.000485), indicating that there is strong evidence against the null hypothesis. Therefore, we reject the null hypothesis. There is a significant difference in college education rates among the different states in the Midwest.
I want to examine the relationship between “percbelowpoverty” (percent of people below poverty-response) and “percollege” (percent college educated) in urban and rural counties using linear regression
urban<- midwest %>% filter(inmetro == 1)
rural<-midwest %>% filter(inmetro == 0)
#linear regression model
model_urban <- lm(percbelowpoverty ~ percollege, urban)
summary(model_urban)
##
## Call:
## lm(formula = percbelowpoverty ~ percollege, data = urban)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9760 -3.5336 -0.1495 2.7718 12.3581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.23876 1.05600 11.590 <2e-16 ***
## percollege -0.08658 0.04465 -1.939 0.0544 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.048 on 148 degrees of freedom
## Multiple R-squared: 0.02478, Adjusted R-squared: 0.01819
## F-statistic: 3.76 on 1 and 148 DF, p-value: 0.0544
percollege: The estimated coefficient here is -0.08658. In this case, it suggests that an increase in percent of college educated is associated with a decrease in percent of people below poverty line.
The model has p-value is 0.0544. This indicates that the overall model might be marginally significant at a significance level of 0.05.
model_rural<- lm(percbelowpoverty ~ percollege, rural)
summary(model_rural)
##
## Call:
## lm(formula = percbelowpoverty ~ percollege, data = rural)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.999 -3.480 -1.024 2.255 33.002
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.38276 1.23664 14.06 < 2e-16 ***
## percollege -0.23092 0.07449 -3.10 0.00213 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.197 on 285 degrees of freedom
## Multiple R-squared: 0.03262, Adjusted R-squared: 0.02923
## F-statistic: 9.61 on 1 and 285 DF, p-value: 0.002129
intercept has statistical significance with a very low p-value.
percollege: The estimated coefficient is -0.23092, it suggests that an increase in “percollege” is associated with a decrease in “percbelowpoverty.” It is statistically significant (p-value: 0.00213)
The F-statistic is 9.61, and the associated p-value is 0.002129, indicating that the overall model is statistically significant.
ggplot(rural, aes(x = percbelowpoverty, y = percollege)) +
geom_point() +
geom_smooth(method='lm', formula = x~y, se=FALSE, color = "blue")+
labs(x = "percbelowpoverty", y = "College Education Rates") +
ggtitle("Scatter Plot of percbelowpoverty vs. College Education Rates (Inmetro Counties)")
## Warning: Computation failed in `stat_smooth()`
## Caused by error:
## ! object 'y' not found
Wkt from previous outout-> The minimum residual is approximately -7.999, and the maximum is about 33.002. These values suggest that the residuals vary across the dataset, indicating that the model does not fit the data perfectly.
Multiple linear regression model predicting “percbelowpoverty” (the percentage of the population below the poverty line) based on the variables “percollege,” “perchsd,” “percprof,” and their interaction terms in the counties with “inmetro” status.
model_lm2 <- lm(percbelowpoverty ~ percollege * perchsd * percprof, data = urban)
summary(model_lm2)
##
## Call:
## lm(formula = percbelowpoverty ~ percollege * perchsd * percprof,
## data = urban)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.3053 -2.2490 -0.0197 2.0597 7.7872
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 65.006842 21.057738 3.087 0.00243 **
## percollege 1.624188 1.428227 1.137 0.25737
## perchsd -0.799990 0.276251 -2.896 0.00438 **
## percprof -7.727712 6.275649 -1.231 0.22022
## percollege:perchsd -0.018744 0.016787 -1.117 0.26606
## percollege:percprof 0.120905 0.172466 0.701 0.48443
## perchsd:percprof 0.108015 0.080185 1.347 0.18010
## percollege:perchsd:percprof -0.001565 0.001979 -0.791 0.43023
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.065 on 142 degrees of freedom
## Multiple R-squared: 0.4636, Adjusted R-squared: 0.4371
## F-statistic: 17.53 on 7 and 142 DF, p-value: < 2.2e-16
Coefficients:
->percollege: The estimated coefficient is 1.6242 and p-value: 0.25737- it is not statistically significant
->perchsd: The estimated coefficient is -0.8000. It has statistically significant (p-value: 0.00438) and has a negative effect on “percbelowpoverty.”
->percprof: The estimated coefficient is -7.7277. It is not statistically significant (p-value: 0.22022).
Interaction terms:
2way: percollege:perchsd, percollege:percprof, perchsd:percprof
3way: percollege:perchsd:percprof
But none of the interaction terms are significant.
The F-statistic is 17.53, and the associated p-value is extremely small (< 2.2e-16), indicating that the model as a whole is statistically significant. In the percentage of the population with a high school education (“perchsd”) is a statistically significant predictor of “percbelowpoverty.” The other variables, including “percollege,” “percprof,” and their interaction terms, are not statistically significant predictors of “percbelowpoverty.”
ggplot(data = urban, aes(x = percollege, y = percbelowpoverty)) +
geom_point() +
geom_smooth(method = "lm", formula = y~x, se = FALSE, color = "lightblue") +
labs(title = "Scatterplot with Fitted Line", x = "percollege", y = "percbelowpoverty")
plot(model_lm2)