#install.packages("GGally")
library(carData)
library(GGally)
library(ggplot2)
data(Prestige)
str(Prestige)
## 'data.frame': 102 obs. of 6 variables:
## $ education: num 13.1 12.3 12.8 11.4 14.6 ...
## $ income : int 12351 25879 9271 8865 8403 11030 8258 14163 11377 11023 ...
## $ women : num 11.16 4.02 15.7 9.11 11.68 ...
## $ prestige : num 68.8 69.1 63.4 56.8 73.5 77.6 72.6 78.1 73.1 68.8 ...
## $ census : int 1113 1130 1171 1175 2111 2113 2133 2141 2143 2153 ...
## $ type : Factor w/ 3 levels "bc","prof","wc": 2 2 2 2 2 2 2 2 2 2 ...
ggpairs(Prestige, columns = 1:5, ggplot2::aes(color = type))
There is positive relationship for Education vs Income
Cor = 0.574means that higher education linked to higher income.
Education vs Prestige is showing really high positive relationship with
Cor = 0.866means that Prestige rises strongly with Education.
Income is also showing strong positive relationship vs Prestige with
Cor= 0.703means that higher income jo to have higher prestige.
The is a strong negative relationship on Education vs Census with
Cor= -0.826means that as education increases, census value decreases.
There is a also negative relationship Income vs Women with
Cor=-0.448, suggesting that higher income occupations tend to have lower % of women.
There is another strong negative relationship on Prestige vs Census with
Cor=-0.649, suggesting that higher prestige is associated with lower census values.
There is a very weak negative relationship on Women vs Prestige with
Cor=-0.110also Women vs Census withCor= -0.242is showing weak negative relationship.
Here are some possible non-linear patterns in our data; Income vs Education where it shows linear pattern at first income rises faster after a certain education level can be seen from the plot below:
ggplot(Prestige, aes(x=education, y=income)) +
geom_point() +
geom_smooth(method="loess", col="blue") +
labs(title="Positive Association")
the pattern for Census vs Education seems to be non-linear suggesting that after certain education level census tend to rise.
# Negative Association
ggplot(Prestige, aes(x=education, y=census)) +
geom_point() +
geom_smooth(method="loess", col="red") +
labs(title="Negative Association - Census vs Education")
Income vs Women is showing interesting differences between the group of employment types.
Overall Corr = -0.448
bc = -0.653*
prof = -0.513*
wc = -0.841*
# Negative Association
ggplot(Prestige, aes(x=women, y=income, color = type)) +
geom_point()
Our the model predicts income (in dollars) using
education,women(%),prestige,census, andtypevariables.
full_model <- lm(income ~ education + women + prestige + census + type, data = Prestige)
summary(full_model)
##
## Call:
## lm(formula = income ~ education + women + prestige + census +
## type, data = Prestige)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7752.4 -954.6 -331.2 742.6 14301.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.32053 3037.27048 0.002 0.99808
## education 131.18372 288.74961 0.454 0.65068
## women -53.23480 9.83107 -5.415 4.96e-07 ***
## prestige 139.20912 36.40239 3.824 0.00024 ***
## census 0.04209 0.23568 0.179 0.85865
## typeprof 509.15150 1798.87914 0.283 0.77779
## typewc 347.99010 1173.89384 0.296 0.76757
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2633 on 91 degrees of freedom
## (4 observations deleted due to missingness)
## Multiple R-squared: 0.6363, Adjusted R-squared: 0.6123
## F-statistic: 26.54 on 6 and 91 DF, p-value: < 2.2e-16
The reference group for
typeisbc (blue-collar), sincetypeprofandtypewcare included as indicators.
Multiple R-squared = 0.6363 means that our model explain 63.3% variation in income. Adjusted-R squared = 0.6123 means that after adjusting predictors our model still explain 61.2% variation.
Holding women, prestige, census, and type variables fixed, 1 unit increase in education is associated with an increase in predicted income of about $131. This is a small magnitude relative to our incomes values around $10,000–$25,000. However, since p-value = 0.650 is not statistically significant education variable doesn’t have statistical effect in our model.
Women variable coefficient is -53.23 means that holding
education, prestige, census, and type fixed, a 1 percentage point
increase in the proportion of women in the occupation
is associated with a decrease in predicted income of about
$53.23. Our p-value = 4.96e-07 highly significant, so
we could use women variable as predictor variable.
Prestige coefficient is 139.21 means that holding education, women %, census, and type fixed, a 1-unit increase in prestige score is associated with an increase in predicted income of about $139.21. P-value = 0.000024 is highly significant suggesting that prestige has a strong independent positive relationship with income
Census coefficient is 0.042 means that holding education, women %, prestige, and type fixed, a 1-unit increase in census code is associated with an increase in predicted income of about $0.04. It is showing very minimal effect in our income. P-value = 0.859 is not significant.
Interpreting the occupation type indicators (relative to blue collar)
Typeprof coefficient is 509.15 means that holding education, women %, prestige, and census fixed, professional jobs (prof) have predicted income about $509 higher than blue-collar jobs (bc). This is a small positive increase, however our p-value=0.778 is not showing statistical significance.
Typewc coefficient is 347.99 means that holding education, women %, prestige, and census fixed, white collar jobs (wc) have predicted income about $348 higher than blue-collar jobs (bc).
Assess which predictors are important contributors to the model. Use:
The coefficient for women is statistically significant. The t-test suggests that holding education, prestige, census, and type fixed, the percent of women in an occupation has a meaningful association with income. Therefore we should keep women variable in our model.
The coefficient for prestige is statistically significant.The t-test suggests that holding other variables fixed, prestige adds explanatory impact for income. Therefore Prestige should be kept in our model.
Census is not significant and its estimated effect is extremely small.The t-test suggests census does not explain income once other predictors are included.
type variables and it sub categories are not statistically significant. This suggests that once education, women, and prestige are controlled for, occupation type does not add additional explanatory power. Type variable can be remove but it is best for us to confirm with patial
F-testsince it is group of variables.
type_model <- lm(income ~ type, data = Prestige)
summary(type_model)
##
## Call:
## lm(formula = income ~ type, data = Prestige)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5945.5 -2003.3 -466.2 1536.9 15319.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5374.1 521.8 10.299 < 2e-16 ***
## typeprof 5185.3 811.6 6.389 6.16e-09 ***
## typewc -321.8 890.6 -0.361 0.719
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3461 on 95 degrees of freedom
## (4 observations deleted due to missingness)
## Multiple R-squared: 0.3437, Adjusted R-squared: 0.3299
## F-statistic: 24.87 on 2 and 95 DF, p-value: 2.057e-09
In our simple regression model (income ~ type),
typeis a significant predictor of income (p-value= 2.057e-09). Professional jobs earn significantly more than blue-collar jobs, with an estimated increase of about $5,185. However, once other variables are included in our full model, the type coefficients are no longer significant. This suggests that type does not have an independent effect on income, and that the income differences between occupation types are largely explained by those other factors. Therefore type can be removed from our model.
Education is not statistically significant, however in our prior explorotary data analysis education showed high correlation with Prestige therefore we have to do ANOVA testing to confirm if we should remove Education variable from our model.
edu_out_model <- lm(income ~ women + prestige + census + type, data = Prestige)
anova(edu_out_model, full_model)
## Analysis of Variance Table
##
## Model 1: income ~ women + prestige + census + type
## Model 2: income ~ education + women + prestige + census + type
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 92 632099120
## 2 91 630668656 1 1430464 0.2064 0.6507
Our p-value way bigger than 0.05 meaning that even though education might relate to income in general, after controlling for women, prestige, census, and type, education does not provide additional predictive value. therefore education can be removed from our model.
As we discussed in our previous step, now we are selecting variables for our reduced model.In our reduced model we will be testing the following hypothesis:
Holding everything else constant, women or
prestige do not have a linear relationship with
income.
reduced_model <- lm(income ~ women + prestige, data=Prestige)
summary(reduced_model)
##
## Call:
## lm(formula = income ~ women + prestige, data = Prestige)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7620.9 -1008.7 -240.4 873.1 14180.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 431.574 807.630 0.534 0.594
## women -48.385 8.128 -5.953 4.02e-08 ***
## prestige 165.875 14.988 11.067 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2573 on 99 degrees of freedom
## Multiple R-squared: 0.64, Adjusted R-squared: 0.6327
## F-statistic: 87.98 on 2 and 99 DF, p-value: < 2.2e-16
womenandprestigeboth have highly statistically significant p-values meaning that each contributes independently predicting income.
Our reduced model R² = 0.64 explaining 64% of the variation in income. This is a strong baseline captures most of the structure in our data. This provides us a strong reference model for evaluating whether additional predictors significantly improve model fit.
Prestige_clean <- Prestige[complete.cases(Prestige[, c("income","women","prestige","education","census","type")]),]
reduced_model <- lm(income ~ women + prestige, data = Prestige_clean)
full_model <- lm(income ~ education + women + prestige + census + type,
data = Prestige_clean)
anova(reduced_model, full_model)
## Analysis of Variance Table
##
## Model 1: income ~ women + prestige
## Model 2: income ~ education + women + prestige + census + type
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 95 635663898
## 2 91 630668656 4 4995243 0.1802 0.9481
Our partial F-test suggest that after accounting for women and prestige, adding education, census, and occupation type does not significantly improve the model’s ability to explain income.
confint(reduced_model, level = 0.95)
## 2.5 % 97.5 %
## (Intercept) -1010.64689 2316.99723
## women -67.21638 -33.78294
## prestige 133.05303 194.42855
womenhas lower bound -67.21 and upper bound -33.78 coefficiency meaning that every 1 percentage point increase in the proportion of women variable, income expected to decrease between 33.78 and 67.22 units. The interval does not cross zero, this predictor has a statistically significant negative effect.
prestigehas a lower bound 133.05 and upper bound 194.43 coefficiency meaning that every 1 unit increase in prestige variable, our expected income will increase by between roughly 133 and 194 units. Since both bounds are positive, this predictor has a statistically significant positive effect.
new_obs <- data.frame(women = 45, prestige = 57)
predict(reduced_model, newdata = new_obs, interval = "prediction", level = 0.95)
## fit lwr upr
## 1 7713.916 2535.318 12892.51
Our model’s predicted income is $7,713.91 for the specific values of women = 45 and prestige = 57. Our lower bound of prediction interval is 2535.31, while upper bound of prediction interval is 12892.51. this tells us that 95% certain that the income for this one specific individual fall between these bounds.
While our confidence interval was showing narrower range by estimating, our Prediction Interval is showing wider range in estimate. This is because Confidence interval is predicting average (mean) income for all predictions, Prediction interval is predicting one individial prediction. Individiual prediction can vary due to unobserved factors.
library(car)
vif(full_model)
## GVIF Df GVIF^(1/(2*Df))
## education 8.818285 1 2.969560
## women 1.332216 1 1.154217
## prestige 5.420072 1 2.328105
## census 5.601500 1 2.366749
## type 11.889350 2 1.856904
in our Full model; education, prestige, census and type are showing high varience inflation factor basically they are all fighitng to explain the same thing. Because they explain similar variation in income, including all three in the model leads to multicollinearity and making difficult for us to identify their independent effects.
vif(reduced_model)
## women prestige
## 1.01228 1.01228
One the other hand our reduced model has VIF = 1.012 suggest that virtually no multicollinearity and more stable coefficient estimates.
Perform model diagnostics using residual plots and influence measures. Include and interpret:
If diagnostics suggest problems, propose and implement a remedy if needed (only if justified), such as: * transforming the model properly, * adding additional predictor terms only if supported by plots, * investigating influential points (do not delete blindly—explain what you did and why).
Then provide your final selected model (or a small set of final candidate models) and interpret it carefully. Your final interpretation should reflect what you learned from collinearity and diagnostics.
par(mfrow = c(2,2))
plot(reduced_model)
Residuals are mostly centered around “0” but the spread increases as fitted values increase indicating fluctuating variance.
The Q–Q plot shows noticeable deviation in the upper tail, meaning the residuals are right-skewed and influenced by a few unusually high-income occupations. we have big outliers labeled general.managers and physicians are far above the line.
This plot shows an upward trend, confirming that the variance of residuals increases with fitted income values. This also confirms heteroscedatisity.
The residuals vs leverage plot indicates a few moderately influential observations, but Cook’s distance suggests no point is extreme enough to invalidate the model.
Our income residuals spread out more at higher fitted values, we might use Log Transformation to stabilize our variance caused by high income values.
final_model <- lm(log(income) ~ women+prestige, data=Prestige)
summary(final_model)
##
## Call:
## lm(formula = log(income) ~ women + prestige, data = Prestige)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.26042 -0.09467 0.03960 0.16794 0.76815
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.7896875 0.0954283 81.629 < 2e-16 ***
## women -0.0082221 0.0009604 -8.561 1.49e-13 ***
## prestige 0.0236816 0.0017710 13.372 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3041 on 99 degrees of freedom
## Multiple R-squared: 0.7409, Adjusted R-squared: 0.7357
## F-statistic: 141.6 on 2 and 99 DF, p-value: < 2.2e-16
Our Final Model increased Adjusted R² = 0.7357 compare to previous models suggest that about 74% of the variation in log(income) is explained by prestige and women.
Holding percent women fixed, for every 1 point increase in prestige score, predicted income increases by about 2.37%
Holding prestige fixed, for every 1 percentage point increase in the percent of women in an occupation, predicted income decreases by about 0.82%