1) Exploratory visualization and expected relationships

#install.packages("GGally")

library(carData)
library(GGally)
library(ggplot2)

data(Prestige)
str(Prestige)
## 'data.frame':    102 obs. of  6 variables:
##  $ education: num  13.1 12.3 12.8 11.4 14.6 ...
##  $ income   : int  12351 25879 9271 8865 8403 11030 8258 14163 11377 11023 ...
##  $ women    : num  11.16 4.02 15.7 9.11 11.68 ...
##  $ prestige : num  68.8 69.1 63.4 56.8 73.5 77.6 72.6 78.1 73.1 68.8 ...
##  $ census   : int  1113 1130 1171 1175 2111 2113 2133 2141 2143 2153 ...
##  $ type     : Factor w/ 3 levels "bc","prof","wc": 2 2 2 2 2 2 2 2 2 2 ...
ggpairs(Prestige, columns = 1:5, ggplot2::aes(color = type)) 

Positive & Negative Associations

There is positive relationship for Education vs Income Cor = 0.574 means that higher education linked to higher income.

Education vs Prestige is showing really high positive relationship with Cor = 0.866 means that Prestige rises strongly with Education.

Income is also showing strong positive relationship vs Prestige with Cor= 0.703 means that higher income jo to have higher prestige.

The is a strong negative relationship on Education vs Census with Cor= -0.826 means that as education increases, census value decreases.

There is a also negative relationship Income vs Women with Cor=-0.448 , suggesting that higher income occupations tend to have lower % of women.

There is another strong negative relationship on Prestige vs Census with Cor=-0.649, suggesting that higher prestige is associated with lower census values.

There is a very weak negative relationship on Women vs Prestige with Cor=-0.110 also Women vs Census with Cor= -0.242 is showing weak negative relationship.

Possible Nonlinear Patterns

Here are some possible non-linear patterns in our data; Income vs Education where it shows linear pattern at first income rises faster after a certain education level can be seen from the plot below:

ggplot(Prestige, aes(x=education, y=income)) + 
  geom_point() + 
  geom_smooth(method="loess", col="blue") +
  labs(title="Positive Association")

the pattern for Census vs Education seems to be non-linear suggesting that after certain education level census tend to rise.

# Negative Association
ggplot(Prestige, aes(x=education, y=census)) + 
  geom_point() + 
  geom_smooth(method="loess", col="red") +
  labs(title="Negative Association - Census vs Education")

Possible Interactions

Income vs Women is showing interesting differences between the group of employment types.

Overall Corr = -0.448

bc = -0.653*

prof = -0.513*

wc = -0.841*

# Negative Association
ggplot(Prestige, aes(x=women, y=income, color = type)) + 
  geom_point()

Fitting Linear Regression Model and Interpretations of outputs

Our the model predicts income (in dollars) using education, women (%), prestige, census, and type variables.

full_model <- lm(income ~ education + women + prestige + census + type, data = Prestige)
summary(full_model)
## 
## Call:
## lm(formula = income ~ education + women + prestige + census + 
##     type, data = Prestige)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7752.4  -954.6  -331.2   742.6 14301.3 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    7.32053 3037.27048   0.002  0.99808    
## education    131.18372  288.74961   0.454  0.65068    
## women        -53.23480    9.83107  -5.415 4.96e-07 ***
## prestige     139.20912   36.40239   3.824  0.00024 ***
## census         0.04209    0.23568   0.179  0.85865    
## typeprof     509.15150 1798.87914   0.283  0.77779    
## typewc       347.99010 1173.89384   0.296  0.76757    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2633 on 91 degrees of freedom
##   (4 observations deleted due to missingness)
## Multiple R-squared:  0.6363, Adjusted R-squared:  0.6123 
## F-statistic: 26.54 on 6 and 91 DF,  p-value: < 2.2e-16

The reference group for type is bc (blue-collar), since typeprof and typewc are included as indicators.

Multiple R-squared = 0.6363 means that our model explain 63.3% variation in income. Adjusted-R squared = 0.6123 means that after adjusting predictors our model still explain 61.2% variation.

Interpreting the occupation type indicators (relative to blue collar)

3) Regression importance testing (individual and/or group tests)

Assess which predictors are important contributors to the model. Use:

The coefficient for women is statistically significant. The t-test suggests that holding education, prestige, census, and type fixed, the percent of women in an occupation has a meaningful association with income. Therefore we should keep women variable in our model.

The coefficient for prestige is statistically significant.The t-test suggests that holding other variables fixed, prestige adds explanatory impact for income. Therefore Prestige should be kept in our model.

Census is not significant and its estimated effect is extremely small.The t-test suggests census does not explain income once other predictors are included.

type variables and it sub categories are not statistically significant. This suggests that once education, women, and prestige are controlled for, occupation type does not add additional explanatory power. Type variable can be remove but it is best for us to confirm with patial F-test since it is group of variables.

type_model <- lm(income ~ type, data = Prestige)
summary(type_model)
## 
## Call:
## lm(formula = income ~ type, data = Prestige)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5945.5 -2003.3  -466.2  1536.9 15319.5 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   5374.1      521.8  10.299  < 2e-16 ***
## typeprof      5185.3      811.6   6.389 6.16e-09 ***
## typewc        -321.8      890.6  -0.361    0.719    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3461 on 95 degrees of freedom
##   (4 observations deleted due to missingness)
## Multiple R-squared:  0.3437, Adjusted R-squared:  0.3299 
## F-statistic: 24.87 on 2 and 95 DF,  p-value: 2.057e-09

In our simple regression model (income ~ type), type is a significant predictor of income (p-value= 2.057e-09). Professional jobs earn significantly more than blue-collar jobs, with an estimated increase of about $5,185. However, once other variables are included in our full model, the type coefficients are no longer significant. This suggests that type does not have an independent effect on income, and that the income differences between occupation types are largely explained by those other factors. Therefore type can be removed from our model.

Education is not statistically significant, however in our prior explorotary data analysis education showed high correlation with Prestige therefore we have to do ANOVA testing to confirm if we should remove Education variable from our model.

edu_out_model <- lm(income ~ women + prestige + census + type, data = Prestige)
anova(edu_out_model, full_model)
## Analysis of Variance Table
## 
## Model 1: income ~ women + prestige + census + type
## Model 2: income ~ education + women + prestige + census + type
##   Res.Df       RSS Df Sum of Sq      F Pr(>F)
## 1     92 632099120                           
## 2     91 630668656  1   1430464 0.2064 0.6507

Our p-value way bigger than 0.05 meaning that even though education might relate to income in general, after controlling for women, prestige, census, and type, education does not provide additional predictive value. therefore education can be removed from our model.

4) Nested model testing (full vs. reduced model)

As we discussed in our previous step, now we are selecting variables for our reduced model.In our reduced model we will be testing the following hypothesis:

Holding everything else constant, women or prestige do not have a linear relationship with income.

reduced_model <- lm(income ~ women + prestige, data=Prestige)
summary(reduced_model)
## 
## Call:
## lm(formula = income ~ women + prestige, data = Prestige)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7620.9 -1008.7  -240.4   873.1 14180.0 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  431.574    807.630   0.534    0.594    
## women        -48.385      8.128  -5.953 4.02e-08 ***
## prestige     165.875     14.988  11.067  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2573 on 99 degrees of freedom
## Multiple R-squared:   0.64,  Adjusted R-squared:  0.6327 
## F-statistic: 87.98 on 2 and 99 DF,  p-value: < 2.2e-16

women and prestige both have highly statistically significant p-values meaning that each contributes independently predicting income.

Our reduced model R² = 0.64 explaining 64% of the variation in income. This is a strong baseline captures most of the structure in our data. This provides us a strong reference model for evaluating whether additional predictors significantly improve model fit.

Prestige_clean <- Prestige[complete.cases(Prestige[, c("income","women","prestige","education","census","type")]),]

reduced_model <- lm(income ~ women + prestige, data = Prestige_clean)

full_model <- lm(income ~ education + women + prestige + census + type,
                 data = Prestige_clean)

anova(reduced_model, full_model)
## Analysis of Variance Table
## 
## Model 1: income ~ women + prestige
## Model 2: income ~ education + women + prestige + census + type
##   Res.Df       RSS Df Sum of Sq      F Pr(>F)
## 1     95 635663898                           
## 2     91 630668656  4   4995243 0.1802 0.9481

Our partial F-test suggest that after accounting for women and prestige, adding education, census, and occupation type does not significantly improve the model’s ability to explain income.

5) Confidence intervals for regression coefficients

confint(reduced_model, level = 0.95)
##                   2.5 %     97.5 %
## (Intercept) -1010.64689 2316.99723
## women         -67.21638  -33.78294
## prestige      133.05303  194.42855

women has lower bound -67.21 and upper bound -33.78 coefficiency meaning that every 1 percentage point increase in the proportion of women variable, income expected to decrease between 33.78 and 67.22 units. The interval does not cross zero, this predictor has a statistically significant negative effect.

prestige has a lower bound 133.05 and upper bound 194.43 coefficiency meaning that every 1 unit increase in prestige variable, our expected income will increase by between roughly 133 and 194 units. Since both bounds are positive, this predictor has a statistically significant positive effect.

6) Prediction interval for a new observation

new_obs <- data.frame(women = 45, prestige = 57)
predict(reduced_model, newdata = new_obs, interval = "prediction", level = 0.95)
##        fit      lwr      upr
## 1 7713.916 2535.318 12892.51

Our model’s predicted income is $7,713.91 for the specific values of women = 45 and prestige = 57. Our lower bound of prediction interval is 2535.31, while upper bound of prediction interval is 12892.51. this tells us that 95% certain that the income for this one specific individual fall between these bounds.

While our confidence interval was showing narrower range by estimating, our Prediction Interval is showing wider range in estimate. This is because Confidence interval is predicting average (mean) income for all predictions, Prediction interval is predicting one individial prediction. Individiual prediction can vary due to unobserved factors.

7) Multicollinearity investigation

library(car)
vif(full_model)
##                GVIF Df GVIF^(1/(2*Df))
## education  8.818285  1        2.969560
## women      1.332216  1        1.154217
## prestige   5.420072  1        2.328105
## census     5.601500  1        2.366749
## type      11.889350  2        1.856904

in our Full model; education, prestige, census and type are showing high varience inflation factor basically they are all fighitng to explain the same thing. Because they explain similar variation in income, including all three in the model leads to multicollinearity and making difficult for us to identify their independent effects.

vif(reduced_model)
##    women prestige 
##  1.01228  1.01228

One the other hand our reduced model has VIF = 1.012 suggest that virtually no multicollinearity and more stable coefficient estimates.

8) Residual-based model diagnostics + remedy + final model interpretation

Perform model diagnostics using residual plots and influence measures. Include and interpret:

If diagnostics suggest problems, propose and implement a remedy if needed (only if justified), such as: * transforming the model properly, * adding additional predictor terms only if supported by plots, * investigating influential points (do not delete blindly—explain what you did and why).

Then provide your final selected model (or a small set of final candidate models) and interpret it carefully. Your final interpretation should reflect what you learned from collinearity and diagnostics.

par(mfrow = c(2,2))
plot(reduced_model)

Residuals vs Fitted

Residuals are mostly centered around “0” but the spread increases as fitted values increase indicating fluctuating variance.

Q-Q Residuals

The Q–Q plot shows noticeable deviation in the upper tail, meaning the residuals are right-skewed and influenced by a few unusually high-income occupations. we have big outliers labeled general.managers and physicians are far above the line.

Scale - Location

This plot shows an upward trend, confirming that the variance of residuals increases with fitted income values. This also confirms heteroscedatisity.

Residuals vs Leverage

The residuals vs leverage plot indicates a few moderately influential observations, but Cook’s distance suggests no point is extreme enough to invalidate the model.

Remedy

Our income residuals spread out more at higher fitted values, we might use Log Transformation to stabilize our variance caused by high income values.

final_model <- lm(log(income) ~ women+prestige, data=Prestige)
summary(final_model)
## 
## Call:
## lm(formula = log(income) ~ women + prestige, data = Prestige)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.26042 -0.09467  0.03960  0.16794  0.76815 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.7896875  0.0954283  81.629  < 2e-16 ***
## women       -0.0082221  0.0009604  -8.561 1.49e-13 ***
## prestige     0.0236816  0.0017710  13.372  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3041 on 99 degrees of freedom
## Multiple R-squared:  0.7409, Adjusted R-squared:  0.7357 
## F-statistic: 141.6 on 2 and 99 DF,  p-value: < 2.2e-16

Our Final Model increased Adjusted R² = 0.7357 compare to previous models suggest that about 74% of the variation in log(income) is explained by prestige and women.

Holding percent women fixed, for every 1 point increase in prestige score, predicted income increases by about 2.37%

Holding prestige fixed, for every 1 percentage point increase in the percent of women in an occupation, predicted income decreases by about 0.82%