Linear Regression - Prestige Dataset

Linear Regression - Basic & Multiple using Prestige Dataset

The dataset used is called Prestige and comes from the car package library(car). Each row is an observation that relates to an occupation. The columns relate to predictors such as average years of education, percentage of women in the occupation, prestige of the occupation, etc.

# Load the package that contains the full dataset and necessary libraries for our work.
options(warn = -1) # Suppress Warnings
library(carData) # Loading Prestige dataset
library(ggplot2) # for some amazing looking graphs
library(MASS) # Library for our box-cox transform down the end
library(corrplot) # Plotting nice correlation matrix
library(cowplot) # arranging plots into a grid

Examine the Prestige Dataset

Let’s look into the dataset and understand the characteristics of variables

# Inspect and summarize the data.
head(Prestige) # First 6 rows of dataset

##                     education income women prestige census type
## gov.administrators      13.11  12351 11.16     68.8   1113 prof
## general.managers        12.26  25879  4.02     69.1   1130 prof
## accountants             12.77   9271 15.70     63.4   1171 prof
## purchasing.officers     11.42   8865  9.11     56.8   1175 prof
## chemists                14.62   8403 11.68     73.5   2111 prof
## physicists              15.64  11030  5.13     77.6   2113 prof

str(Prestige) # Structure of Prestige dataset

## 'data.frame':    102 obs. of  6 variables:
##  $ education: num  13.1 12.3 12.8 11.4 14.6 ...
##  $ income   : int  12351 25879 9271 8865 8403 11030 8258 14163 11377 11023 ...
##  $ women    : num  11.16 4.02 15.7 9.11 11.68 ...
##  $ prestige : num  68.8 69.1 63.4 56.8 73.5 77.6 72.6 78.1 73.1 68.8 ...
##  $ census   : int  1113 1130 1171 1175 2111 2113 2133 2141 2143 2153 ...
##  $ type     : Factor w/ 3 levels "bc","prof","wc": 2 2 2 2 2 2 2 2 2 2 ...

summary(Prestige) # Summarize the data of Prestige dataset

##    education          income          women           prestige    
##  Min.   : 6.380   Min.   :  611   Min.   : 0.000   Min.   :14.80  
##  1st Qu.: 8.445   1st Qu.: 4106   1st Qu.: 3.592   1st Qu.:35.23  
##  Median :10.540   Median : 5930   Median :13.600   Median :43.60  
##  Mean   :10.738   Mean   : 6798   Mean   :28.979   Mean   :46.83  
##  3rd Qu.:12.648   3rd Qu.: 8187   3rd Qu.:52.203   3rd Qu.:59.27  
##  Max.   :15.970   Max.   :25879   Max.   :97.510   Max.   :87.20  
##      census       type   
##  Min.   :1113   bc  :44  
##  1st Qu.:3120   prof:31  
##  Median :5135   wc  :23  
##  Mean   :5402   NA's: 4  
##  3rd Qu.:8312            
##  Max.   :9517

You can notice that the ‘type’ variable has 4 missing values. So, let’s keep that in mind and handle them when we are building regression model if type variable will be used

Correlation Matrix on Prestige Dataset

Let’s identify the variables that are highly correlated with prestige varialble

cor(Prestige[,-6]) # Correlation of Prestige Dataset on numeric variables

##             education     income       women   prestige     census
## education  1.00000000  0.5775802  0.06185286  0.8501769 -0.8230882
## income     0.57758023  1.0000000 -0.44105927  0.7149057 -0.3610023
## women      0.06185286 -0.4410593  1.00000000 -0.1183342 -0.2270028
## prestige   0.85017689  0.7149057 -0.11833419  1.0000000 -0.6345103
## census    -0.82308821 -0.3610023 -0.22700277 -0.6345103  1.0000000

corrplot(cor(Prestige[,-6]) , method = "number") # Plotting Correlation Matrix

You can notice that income and education varialbes are highly positively correlated with prestige. Also, census is negatively correlated with prestige while women has no correlation.

Visualizing Prestige Dataset

Let’s see the datapoints on the graph using boxplot, histogram and correlation between variables. First of all let’s see the relationship of prestige against income, education & women variables through scatter plot

plot_income <- ggplot(data = Prestige, aes(x = prestige, y = income, col = type)) + geom_point()
plot_education <- ggplot(data = Prestige, aes(x = prestige, y = education, col = type)) + geom_point()
plot_women <- ggplot(data = Prestige, aes(x = prestige, y = women, col = type)) + geom_point()
plot_census <- ggplot(data = Prestige, aes(x = prestige, y = census, col = type)) + geom_point()
plot_grid(plot_income, plot_education, plot_women, plot_census, labels = "AUTO")

We can see a strong linear relationship of prestige with income and education rather than women and census. Let’s take a look into the data distribution of income & education variables through historgram plot and compare against mean and median values

hist_income <- ggplot(Prestige, aes(x = income)) + geom_histogram(binwidth = 1000) +
  geom_vline(xintercept = mean(Prestige$income), color = "indianred") +
  geom_vline(xintercept = median(Prestige$income), color = "cornflowerblue")
hist_education <- ggplot(Prestige, aes(x = education)) + geom_histogram(binwidth = 1) +
  geom_vline(xintercept = mean(Prestige$education), color = "indianred") +
  geom_vline(xintercept = median(Prestige$education), color = "cornflowerblue")
plot_grid(hist_income, hist_education, labels = "AUTO")

We can see that income variable is right skewed distribution and education is also not representing normal distribution. Let’s try to transform this into normal distribution if possible. We will be using Log2 for income variable and scale the value of the variable education on its mean.

# Comparing original income histogram against log of income histogram
hist_income <- ggplot(Prestige, aes(x = income)) + geom_histogram(binwidth = 1000) +
  labs(title = "Original Income") + labs(x ="Income") +
  geom_vline(xintercept = mean(Prestige$income), color = "indianred") +
  geom_vline(xintercept = median(Prestige$income), color = "cornflowerblue")
hist_trnsf_income <- ggplot(Prestige, aes(x = log(income))) + geom_histogram(binwidth = 0.5) +
  labs(title = "Transformed Income") + labs(x ="Log of Income") +
  geom_vline(xintercept = mean(log(Prestige$income)), color = "indianred") +
  geom_vline(xintercept = median(log(Prestige$income)), color = "cornflowerblue")
plot_grid(hist_income, hist_trnsf_income, labels = "AUTO")

Build Linear Regression Model to Predict Prestige

Now, let’s build linear regression model step by step and eliminate the variables that are not significant to our model in the process to improve the performance of regression model. This will also correspond to our findings above of women and census variable not having realtionship with prestige

# Fit a linear model with education, income, women & census variables
lm_mod1 = lm(prestige ~ education + log(income) + women + census, data = Prestige)
summary(lm_mod1) # run a summary of its results

## 
## Call:
## lm(formula = prestige ~ education + log(income) + women + census, 
##     data = Prestige)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.2039  -4.4726  -0.0698   4.4158  19.6516 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.139e+02  1.573e+01  -7.244 1.04e-10 ***
## education    3.983e+00  5.575e-01   7.144 1.68e-10 ***
## log(income)  1.328e+01  1.940e+00   6.844 6.99e-10 ***
## women        4.954e-02  3.034e-02   1.633    0.106    
## census       2.943e-04  5.010e-04   0.587    0.558    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.116 on 97 degrees of freedom
## Multiple R-squared:  0.8357, Adjusted R-squared:  0.8289 
## F-statistic: 123.3 on 4 and 97 DF,  p-value: < 2.2e-16

You can notice that Adjusted R2 is 82% which is good, however, p-value is high for women and census variables. p-Value: we can consider a linear model to be statistically significant only when p-Values are less than the pre-determined statistical significance level of 0.05. However, income and education has very significant p-value impacting the model. We have seen this correlation above also in scatterplot and correlation plot. Let’s remove the women and census variables, build regression model and check it’s summary.

# Fit a linear model with education and income variables
lm_mod2 = lm(prestige ~ education + log(income), data = Prestige)
summary(lm_mod2) # run a summary of its results

## 
## Call:
## lm(formula = prestige ~ education + log(income), data = Prestige)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.0346  -4.5657  -0.1857   4.0577  18.1270 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -95.1940    10.9979  -8.656 9.27e-14 ***
## education     4.0020     0.3115  12.846  < 2e-16 ***
## log(income)  11.4375     1.4371   7.959 2.94e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.145 on 99 degrees of freedom
## Multiple R-squared:  0.831,  Adjusted R-squared:  0.8275 
## F-statistic: 243.3 on 2 and 99 DF,  p-value: < 2.2e-16

Adjusted R2 is still 82%, but, the “intercept” is negative i.e. -95. To handle this negative intercept, we will add a new variable education.c that will center the value of the variable education on its mean. This transformation was applied on the education variable so we could have a meaningful interpretation of its intercept estimate.

prstg_df = Prestige # creating a new dataset copy of Prestige or other manipulations

# scaling the value of education to its mean value
set.seed(1)
education.c = scale(prstg_df$education, center=TRUE, scale=FALSE)
prstg_df = cbind(prstg_df, education.c)

# Fit a linear model with centered education and income variables
lm_mod3 = lm(prestige ~ education.c + log(income), data = prstg_df)
summary(lm_mod3) # run a summary of its results

## 
## Call:
## lm(formula = prestige ~ education.c + log(income), data = prstg_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.0346  -4.5657  -0.1857   4.0577  18.1270 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -52.2209    12.4661  -4.189 6.09e-05 ***
## education.c   4.0020     0.3115  12.846  < 2e-16 ***
## log(income)  11.4375     1.4371   7.959 2.94e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.145 on 99 degrees of freedom
## Multiple R-squared:  0.831,  Adjusted R-squared:  0.8275 
## F-statistic: 243.3 on 2 and 99 DF,  p-value: < 2.2e-16

par(mfrow = c(2, 2))  # Split the plotting panel into a 2 x 2 grid
plot(lm_mod3)  # Plot the income model information

par(mfrow = c(1, 1))  # Return plotting panel to original 1 section

Let’s see the residual values and plot of the last regression model built. Residuals are basically the difference between actual and predicted values. From above summary of model, residuals are ranging from -17 to 18 and you can see in the below plot that they are evenly distributed

# Compare Actual, Predicted and Residual values of prestige from Education model
prstg_df_pred = as.data.frame(prstg_df$prestige) # Save the actual values
prstg_df_pred$predicted <- predict(lm_mod3) # Save the predicted values
prstg_df_pred$residuals <- residuals(lm_mod3) # Save the residual values
head(prstg_df_pred)

##   prstg_df$prestige predicted  residuals
## 1              68.8  65.02953  3.7704703
## 2              69.1  70.08810 -0.9880979
## 3              63.4  60.38808  3.0119206
## 4              56.8  54.47327  2.3267295
## 5              73.5  66.66736  6.8326386
## 6              77.6  73.86069  3.7393148

plot(residuals(lm_mod3)) # Residual distribution of the model
abline(a=0,b=0,col='blue')

Now, let’s see if we can improve the model. If you remember, above we had done scatterplot for income and education against prestige with “type” variable as category and you have seen that for each category the linearity is different. Let’s look into that wiht a different perspective. But, before that let’s handle the NA values in “type” variable.

prstg_df <- na.omit(prstg_df) # remove rows containing na's values via omit function

ggplot_income <- ggplot(data = prstg_df, aes(x = prestige, y = income, col = type)) + geom_smooth()
ggplot_educ <- ggplot(data = prstg_df, aes(x = prestige, y = education, col = type)) + geom_smooth()
plot_grid(ggplot_income, ggplot_educ, labels = "AUTO")

So, let’s add “type” variable to the linear regression model and examine whether the model has improved and makes sense.

# Fit a linear model with centered education, log of income and type variables
lm_mod4 = lm(prestige ~ education.c + log(income) + type, data = prstg_df)
summary(lm_mod4) # run a summary of its results

## 
## Call:
## lm(formula = prestige ~ education.c + log(income) + type, data = prstg_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -13.511  -3.746   1.011   4.356  18.438 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -45.9329    15.2089  -3.020  0.00326 ** 
## education.c   3.2845     0.6081   5.401 5.06e-07 ***
## log(income)  10.4875     1.7167   6.109 2.31e-08 ***
## typeprof      6.7509     3.6185   1.866  0.06524 .  
## typewc       -1.4394     2.3780  -0.605  0.54645    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.637 on 93 degrees of freedom
## Multiple R-squared:  0.8555, Adjusted R-squared:  0.8493 
## F-statistic: 137.6 on 4 and 93 DF,  p-value: < 2.2e-16

You can notice that R2 has increased to almost 85% and residuals are ranging from -13.5 to 18.5 Let’s try another way of adding “type” variable to regression model.

# Fit a linear model with centered education, log of income and type variables
lm_mod5 = lm(prestige ~ type * (education.c + log(income)), data = prstg_df)
summary(lm_mod5) # run a summary of its results

## 
## Call:
## lm(formula = prestige ~ type * (education.c + log(income)), data = prstg_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -13.970  -4.124   1.206   3.829  18.059 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -94.9654    23.2304  -4.088 9.52e-05 ***
## typeprof              92.6488    33.5157   2.764  0.00693 ** 
## typewc                69.3281    37.9303   1.828  0.07093 .  
## education.c            2.3357     0.9277   2.518  0.01360 *  
## log(income)           15.9825     2.6059   6.133 2.32e-08 ***
## typeprof:education.c   0.6974     1.2895   0.541  0.58998    
## typewc:education.c     3.6400     1.7589   2.069  0.04140 *  
## typeprof:log(income)  -9.4288     3.7751  -2.498  0.01434 *  
## typewc:log(income)    -8.1556     4.4029  -1.852  0.06730 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.409 on 89 degrees of freedom
## Multiple R-squared:  0.871,  Adjusted R-squared:  0.8595 
## F-statistic: 75.15 on 8 and 89 DF,  p-value: < 2.2e-16

Now, R2 is almost 86%. We can conclude that including type predictor increases the model’s accuracy. Keep trying more permutations and combinations and suprise me with your results in the comments section. Happy Machine Learning!