The dataset used is called Prestige and comes from the car package library(car). Each row is an observation that relates to an occupation. The columns relate to predictors such as average years of education, percentage of women in the occupation, prestige of the occupation, etc.
# Load the package that contains the full dataset and necessary libraries for our work.
options(warn = -1) # Suppress Warnings
library(carData) # Loading Prestige dataset
library(ggplot2) # for some amazing looking graphs
library(MASS) # Library for our box-cox transform down the end
library(corrplot) # Plotting nice correlation matrix
library(cowplot) # arranging plots into a grid
Let’s look into the dataset and understand the characteristics of variables
# Inspect and summarize the data.
head(Prestige) # First 6 rows of dataset
## education income women prestige census type
## gov.administrators 13.11 12351 11.16 68.8 1113 prof
## general.managers 12.26 25879 4.02 69.1 1130 prof
## accountants 12.77 9271 15.70 63.4 1171 prof
## purchasing.officers 11.42 8865 9.11 56.8 1175 prof
## chemists 14.62 8403 11.68 73.5 2111 prof
## physicists 15.64 11030 5.13 77.6 2113 prof
str(Prestige) # Structure of Prestige dataset
## 'data.frame': 102 obs. of 6 variables:
## $ education: num 13.1 12.3 12.8 11.4 14.6 ...
## $ income : int 12351 25879 9271 8865 8403 11030 8258 14163 11377 11023 ...
## $ women : num 11.16 4.02 15.7 9.11 11.68 ...
## $ prestige : num 68.8 69.1 63.4 56.8 73.5 77.6 72.6 78.1 73.1 68.8 ...
## $ census : int 1113 1130 1171 1175 2111 2113 2133 2141 2143 2153 ...
## $ type : Factor w/ 3 levels "bc","prof","wc": 2 2 2 2 2 2 2 2 2 2 ...
summary(Prestige) # Summarize the data of Prestige dataset
## education income women prestige
## Min. : 6.380 Min. : 611 Min. : 0.000 Min. :14.80
## 1st Qu.: 8.445 1st Qu.: 4106 1st Qu.: 3.592 1st Qu.:35.23
## Median :10.540 Median : 5930 Median :13.600 Median :43.60
## Mean :10.738 Mean : 6798 Mean :28.979 Mean :46.83
## 3rd Qu.:12.648 3rd Qu.: 8187 3rd Qu.:52.203 3rd Qu.:59.27
## Max. :15.970 Max. :25879 Max. :97.510 Max. :87.20
## census type
## Min. :1113 bc :44
## 1st Qu.:3120 prof:31
## Median :5135 wc :23
## Mean :5402 NA's: 4
## 3rd Qu.:8312
## Max. :9517
You can notice that the ‘type’ variable has 4 missing values. So, let’s keep that in mind and handle them when we are building regression model if type variable will be used
Let’s identify the variables that are highly correlated with prestige varialble
cor(Prestige[,-6]) # Correlation of Prestige Dataset on numeric variables
## education income women prestige census
## education 1.00000000 0.5775802 0.06185286 0.8501769 -0.8230882
## income 0.57758023 1.0000000 -0.44105927 0.7149057 -0.3610023
## women 0.06185286 -0.4410593 1.00000000 -0.1183342 -0.2270028
## prestige 0.85017689 0.7149057 -0.11833419 1.0000000 -0.6345103
## census -0.82308821 -0.3610023 -0.22700277 -0.6345103 1.0000000
corrplot(cor(Prestige[,-6]) , method = "number") # Plotting Correlation Matrix
You can notice that income and education varialbes are highly positively correlated with prestige. Also, census is negatively correlated with prestige while women has no correlation.
Let’s see the datapoints on the graph using boxplot, histogram and correlation between variables. First of all let’s see the relationship of prestige against income, education & women variables through scatter plot
plot_income <- ggplot(data = Prestige, aes(x = prestige, y = income, col = type)) + geom_point()
plot_education <- ggplot(data = Prestige, aes(x = prestige, y = education, col = type)) + geom_point()
plot_women <- ggplot(data = Prestige, aes(x = prestige, y = women, col = type)) + geom_point()
plot_census <- ggplot(data = Prestige, aes(x = prestige, y = census, col = type)) + geom_point()
plot_grid(plot_income, plot_education, plot_women, plot_census, labels = "AUTO")
We can see a strong linear relationship of prestige with income and education rather than women and census. Let’s take a look into the data distribution of income & education variables through historgram plot and compare against mean and median values
hist_income <- ggplot(Prestige, aes(x = income)) + geom_histogram(binwidth = 1000) +
geom_vline(xintercept = mean(Prestige$income), color = "indianred") +
geom_vline(xintercept = median(Prestige$income), color = "cornflowerblue")
hist_education <- ggplot(Prestige, aes(x = education)) + geom_histogram(binwidth = 1) +
geom_vline(xintercept = mean(Prestige$education), color = "indianred") +
geom_vline(xintercept = median(Prestige$education), color = "cornflowerblue")
plot_grid(hist_income, hist_education, labels = "AUTO")
We can see that income variable is right skewed distribution and education is also not representing normal distribution. Let’s try to transform this into normal distribution if possible. We will be using Log2 for income variable and scale the value of the variable education on its mean.
# Comparing original income histogram against log of income histogram
hist_income <- ggplot(Prestige, aes(x = income)) + geom_histogram(binwidth = 1000) +
labs(title = "Original Income") + labs(x ="Income") +
geom_vline(xintercept = mean(Prestige$income), color = "indianred") +
geom_vline(xintercept = median(Prestige$income), color = "cornflowerblue")
hist_trnsf_income <- ggplot(Prestige, aes(x = log(income))) + geom_histogram(binwidth = 0.5) +
labs(title = "Transformed Income") + labs(x ="Log of Income") +
geom_vline(xintercept = mean(log(Prestige$income)), color = "indianred") +
geom_vline(xintercept = median(log(Prestige$income)), color = "cornflowerblue")
plot_grid(hist_income, hist_trnsf_income, labels = "AUTO")
Now, let’s build linear regression model step by step and eliminate the variables that are not significant to our model in the process to improve the performance of regression model. This will also correspond to our findings above of women and census variable not having realtionship with prestige
# Fit a linear model with education, income, women & census variables
lm_mod1 = lm(prestige ~ education + log(income) + women + census, data = Prestige)
summary(lm_mod1) # run a summary of its results
##
## Call:
## lm(formula = prestige ~ education + log(income) + women + census,
## data = Prestige)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.2039 -4.4726 -0.0698 4.4158 19.6516
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.139e+02 1.573e+01 -7.244 1.04e-10 ***
## education 3.983e+00 5.575e-01 7.144 1.68e-10 ***
## log(income) 1.328e+01 1.940e+00 6.844 6.99e-10 ***
## women 4.954e-02 3.034e-02 1.633 0.106
## census 2.943e-04 5.010e-04 0.587 0.558
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.116 on 97 degrees of freedom
## Multiple R-squared: 0.8357, Adjusted R-squared: 0.8289
## F-statistic: 123.3 on 4 and 97 DF, p-value: < 2.2e-16
You can notice that Adjusted R2 is 82% which is good, however, p-value is high for women and census variables. p-Value: we can consider a linear model to be statistically significant only when p-Values are less than the pre-determined statistical significance level of 0.05. However, income and education has very significant p-value impacting the model. We have seen this correlation above also in scatterplot and correlation plot. Let’s remove the women and census variables, build regression model and check it’s summary.
# Fit a linear model with education and income variables
lm_mod2 = lm(prestige ~ education + log(income), data = Prestige)
summary(lm_mod2) # run a summary of its results
##
## Call:
## lm(formula = prestige ~ education + log(income), data = Prestige)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.0346 -4.5657 -0.1857 4.0577 18.1270
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -95.1940 10.9979 -8.656 9.27e-14 ***
## education 4.0020 0.3115 12.846 < 2e-16 ***
## log(income) 11.4375 1.4371 7.959 2.94e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.145 on 99 degrees of freedom
## Multiple R-squared: 0.831, Adjusted R-squared: 0.8275
## F-statistic: 243.3 on 2 and 99 DF, p-value: < 2.2e-16
Adjusted R2 is still 82%, but, the “intercept” is negative i.e. -95. To handle this negative intercept, we will add a new variable education.c that will center the value of the variable education on its mean. This transformation was applied on the education variable so we could have a meaningful interpretation of its intercept estimate.
prstg_df = Prestige # creating a new dataset copy of Prestige or other manipulations
# scaling the value of education to its mean value
set.seed(1)
education.c = scale(prstg_df$education, center=TRUE, scale=FALSE)
prstg_df = cbind(prstg_df, education.c)
# Fit a linear model with centered education and income variables
lm_mod3 = lm(prestige ~ education.c + log(income), data = prstg_df)
summary(lm_mod3) # run a summary of its results
##
## Call:
## lm(formula = prestige ~ education.c + log(income), data = prstg_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.0346 -4.5657 -0.1857 4.0577 18.1270
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -52.2209 12.4661 -4.189 6.09e-05 ***
## education.c 4.0020 0.3115 12.846 < 2e-16 ***
## log(income) 11.4375 1.4371 7.959 2.94e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.145 on 99 degrees of freedom
## Multiple R-squared: 0.831, Adjusted R-squared: 0.8275
## F-statistic: 243.3 on 2 and 99 DF, p-value: < 2.2e-16
par(mfrow = c(2, 2)) # Split the plotting panel into a 2 x 2 grid
plot(lm_mod3) # Plot the income model information
par(mfrow = c(1, 1)) # Return plotting panel to original 1 section
Let’s see the residual values and plot of the last regression model built. Residuals are basically the difference between actual and predicted values. From above summary of model, residuals are ranging from -17 to 18 and you can see in the below plot that they are evenly distributed
# Compare Actual, Predicted and Residual values of prestige from Education model
prstg_df_pred = as.data.frame(prstg_df$prestige) # Save the actual values
prstg_df_pred$predicted <- predict(lm_mod3) # Save the predicted values
prstg_df_pred$residuals <- residuals(lm_mod3) # Save the residual values
head(prstg_df_pred)
## prstg_df$prestige predicted residuals
## 1 68.8 65.02953 3.7704703
## 2 69.1 70.08810 -0.9880979
## 3 63.4 60.38808 3.0119206
## 4 56.8 54.47327 2.3267295
## 5 73.5 66.66736 6.8326386
## 6 77.6 73.86069 3.7393148
plot(residuals(lm_mod3)) # Residual distribution of the model
abline(a=0,b=0,col='blue')
Now, let’s see if we can improve the model. If you remember, above we had done scatterplot for income and education against prestige with “type” variable as category and you have seen that for each category the linearity is different. Let’s look into that wiht a different perspective. But, before that let’s handle the NA values in “type” variable.
prstg_df <- na.omit(prstg_df) # remove rows containing na's values via omit function
ggplot_income <- ggplot(data = prstg_df, aes(x = prestige, y = income, col = type)) + geom_smooth()
ggplot_educ <- ggplot(data = prstg_df, aes(x = prestige, y = education, col = type)) + geom_smooth()
plot_grid(ggplot_income, ggplot_educ, labels = "AUTO")
So, let’s add “type” variable to the linear regression model and examine whether the model has improved and makes sense.
# Fit a linear model with centered education, log of income and type variables
lm_mod4 = lm(prestige ~ education.c + log(income) + type, data = prstg_df)
summary(lm_mod4) # run a summary of its results
##
## Call:
## lm(formula = prestige ~ education.c + log(income) + type, data = prstg_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.511 -3.746 1.011 4.356 18.438
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -45.9329 15.2089 -3.020 0.00326 **
## education.c 3.2845 0.6081 5.401 5.06e-07 ***
## log(income) 10.4875 1.7167 6.109 2.31e-08 ***
## typeprof 6.7509 3.6185 1.866 0.06524 .
## typewc -1.4394 2.3780 -0.605 0.54645
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.637 on 93 degrees of freedom
## Multiple R-squared: 0.8555, Adjusted R-squared: 0.8493
## F-statistic: 137.6 on 4 and 93 DF, p-value: < 2.2e-16
You can notice that R2 has increased to almost 85% and residuals are ranging from -13.5 to 18.5 Let’s try another way of adding “type” variable to regression model.
# Fit a linear model with centered education, log of income and type variables
lm_mod5 = lm(prestige ~ type * (education.c + log(income)), data = prstg_df)
summary(lm_mod5) # run a summary of its results
##
## Call:
## lm(formula = prestige ~ type * (education.c + log(income)), data = prstg_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.970 -4.124 1.206 3.829 18.059
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -94.9654 23.2304 -4.088 9.52e-05 ***
## typeprof 92.6488 33.5157 2.764 0.00693 **
## typewc 69.3281 37.9303 1.828 0.07093 .
## education.c 2.3357 0.9277 2.518 0.01360 *
## log(income) 15.9825 2.6059 6.133 2.32e-08 ***
## typeprof:education.c 0.6974 1.2895 0.541 0.58998
## typewc:education.c 3.6400 1.7589 2.069 0.04140 *
## typeprof:log(income) -9.4288 3.7751 -2.498 0.01434 *
## typewc:log(income) -8.1556 4.4029 -1.852 0.06730 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.409 on 89 degrees of freedom
## Multiple R-squared: 0.871, Adjusted R-squared: 0.8595
## F-statistic: 75.15 on 8 and 89 DF, p-value: < 2.2e-16
Now, R2 is almost 86%. We can conclude that including type predictor increases the model’s accuracy. Keep trying more permutations and combinations and suprise me with your results in the comments section. Happy Machine Learning!