The dataset used in this excercise is called “Prestige” and it is included in \(car\) library. It has 102 rows and 6 columns. Each row is an observation that relates to an occupation. The columns in the dataset relate to predicators such as years of education, income, percentage of women in the occupation, prestige of the occupation, etc.
The description of the variables in the dataset are as follows:
For multiple regression, we’ll use more than one predictor. We will include \(women\), \(prestige\) along with \(education\) as our list of predictor variables and our response variable will be \(income\).
For multiple regression, we would like to solve the following equation:
\[ \hat{income} = B_{0} + (B_{1} * education) + (B_{2} * prestige) + (B_{3} * women) \] The model will estimate the value of the intercept \(B_{0}\) and each predictor’s slope \(B_{1}\) for education, \(B_{2}\) for prestige, \(B_{3}\) for women. We will consider a couple of models and see which one is a good fit.
library(car)
Warning: package 'car' was built under R version 3.3.2
library(ggplot2)
Warning: package 'ggplot2' was built under R version 3.3.2
library(dplyr)
summary(Prestige)
education income women prestige census type
Min. : 6.380 Min. : 611 Min. : 0.000 Min. :14.80 Min. :1113 bc :44
1st Qu.: 8.445 1st Qu.: 4106 1st Qu.: 3.592 1st Qu.:35.23 1st Qu.:3120 prof:31
Median :10.540 Median : 5930 Median :13.600 Median :43.60 Median :5135 wc :23
Mean :10.738 Mean : 6798 Mean :28.979 Mean :46.83 Mean :5402 NA's: 4
3rd Qu.:12.648 3rd Qu.: 8187 3rd Qu.:52.203 3rd Qu.:59.27 3rd Qu.:8312
Max. :15.970 Max. :25879 Max. :97.510 Max. :87.20 Max. :9517
colnames(Prestige)
[1] "education" "income" "women" "prestige" "census" "type"
ncol(Prestige)
[1] 6
nrow(Prestige)
[1] 102
head(Prestige, 20)
education income women prestige census type
gov.administrators 13.11 12351 11.16 68.8 1113 prof
general.managers 12.26 25879 4.02 69.1 1130 prof
accountants 12.77 9271 15.70 63.4 1171 prof
purchasing.officers 11.42 8865 9.11 56.8 1175 prof
chemists 14.62 8403 11.68 73.5 2111 prof
physicists 15.64 11030 5.13 77.6 2113 prof
biologists 15.09 8258 25.65 72.6 2133 prof
architects 15.44 14163 2.69 78.1 2141 prof
civil.engineers 14.52 11377 1.03 73.1 2143 prof
mining.engineers 14.64 11023 0.94 68.8 2153 prof
surveyors 12.39 5902 1.91 62.0 2161 prof
draughtsmen 12.30 7059 7.83 60.0 2163 prof
computer.programers 13.83 8425 15.33 53.8 2183 prof
economists 14.44 8049 57.31 62.2 2311 prof
psychologists 14.36 7405 48.28 74.9 2315 prof
social.workers 14.21 6336 54.77 55.1 2331 prof
lawyers 15.77 19263 5.13 82.3 2343 prof
librarians 14.15 6112 77.10 58.1 2351 prof
vocational.counsellors 15.22 9593 34.89 58.3 2391 prof
ministers 14.50 4686 4.14 72.8 2511 prof
Prestige_4Column_df = Prestige[,c(1:4)]
summary(Prestige_4Column_df)
education income women prestige
Min. : 6.380 Min. : 611 Min. : 0.000 Min. :14.80
1st Qu.: 8.445 1st Qu.: 4106 1st Qu.: 3.592 1st Qu.:35.23
Median :10.540 Median : 5930 Median :13.600 Median :43.60
Mean :10.738 Mean : 6798 Mean :28.979 Mean :46.83
3rd Qu.:12.648 3rd Qu.: 8187 3rd Qu.:52.203 3rd Qu.:59.27
Max. :15.970 Max. :25879 Max. :97.510 Max. :87.20
Prestige_Education_df = arrange(Prestige, education)
head(Prestige_Education_df, 10)
education income women prestige census type
1 6.38 2847 90.67 28.2 8563 bc
2 6.60 5959 0.52 36.2 8782 bc
3 6.67 4696 0.00 27.3 8715 bc
4 6.69 4443 31.36 33.3 8267 bc
5 6.74 3485 39.48 28.8 8278 bc
6 6.84 3643 3.60 44.1 7112 <NA>
7 6.92 5299 0.56 38.9 8781 bc
8 7.11 3472 33.57 17.3 6191 bc
9 7.33 3000 69.31 20.8 6162 bc
10 7.42 1890 72.24 23.2 8221 bc
tail(Prestige_Education_df, 10)
education income women prestige census type
93 15.08 8034 46.80 66.1 2733 prof
94 15.09 8258 25.65 72.6 2133 prof
95 15.21 10432 24.71 69.3 3151 prof
96 15.22 9593 34.89 58.3 2391 prof
97 15.44 14163 2.69 78.1 2141 prof
98 15.64 11030 5.13 77.6 2113 prof
99 15.77 19263 5.13 82.3 2343 prof
100 15.94 14558 4.32 66.7 3115 prof
101 15.96 25308 10.56 87.2 3111 prof
102 15.97 12480 19.59 84.6 2711 prof
ggplot(data=Prestige_Education_df, aes(Prestige_Education_df$education)) +
geom_histogram(aes(fill = ..count..), binwidth=0.25) +
scale_fill_gradient("Count", low = "green", high = "brown") +
labs(title = "Historgram - Years of Education") +
labs(x = "Education") +
labs(y = "Count")
Prestige_Income_df = arrange(Prestige, income)
head(Prestige_Income_df, 10)
education income women prestige census type
1 9.46 611 96.53 25.9 6147 <NA>
2 9.62 918 7.00 14.8 5143 <NA>
3 8.60 1656 27.75 21.5 7182 bc
4 7.42 1890 72.24 23.2 8221 bc
5 9.93 2370 3.69 23.3 5145 bc
6 10.64 2448 91.76 42.3 4133 wc
7 10.05 2594 67.82 26.5 5137 wc
8 6.38 2847 90.67 28.2 8563 bc
9 11.04 2901 92.86 38.7 4171 wc
10 7.33 3000 69.31 20.8 6162 bc
tail(Prestige_Income_df, 10)
education income women prestige census type
93 14.52 11377 1.03 73.1 2143 prof
94 13.11 12351 11.16 68.8 1113 prof
95 15.97 12480 19.59 84.6 2711 prof
96 12.27 14032 0.58 66.1 9111 prof
97 15.44 14163 2.69 78.1 2141 prof
98 15.94 14558 4.32 66.7 3115 prof
99 14.71 17498 6.91 68.4 3117 prof
100 15.77 19263 5.13 82.3 2343 prof
101 15.96 25308 10.56 87.2 3111 prof
102 12.26 25879 4.02 69.1 1130 prof
ggplot(data=Prestige_Income_df, aes(Prestige_Income_df$income)) +
geom_histogram(aes(fill = ..count..)) +
scale_fill_gradient("Count", low = "blue", high = "purple") +
labs(title = "Historgram - Income") +
labs(x = "Income") +
labs(y = "Count")
# Plot matrix of all variables.
plot(Prestige_4Column_df, pch=16, col="red", main="Scatterplot of Income, Education, Women and Prestige")
In this section we will create a few multiple regression models and see which one provides a good relationship between predictor variables (education, prestige and women) and response variable (income).
getSlope <- function(m) {
slp = round(m$coefficients[2], 4)
return (slp)
}
getIntercept <- function(m) {
int = round(m$coefficients[1], 4)
return (int)
}
m1 = lm(Prestige$income ~ Prestige$education + Prestige$prestige + Prestige$women, data=Prestige)
summary(m1)
Call:
lm(formula = Prestige$income ~ Prestige$education + Prestige$prestige +
Prestige$women, data = Prestige)
Residuals:
Min 1Q Median 3Q Max
-7715.3 -929.7 -231.2 689.7 14391.8
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -253.850 1086.157 -0.234 0.816
Prestige$education 177.199 187.632 0.944 0.347
Prestige$prestige 141.435 29.910 4.729 7.58e-06 ***
Prestige$women -50.896 8.556 -5.948 4.19e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2575 on 98 degrees of freedom
Multiple R-squared: 0.6432, Adjusted R-squared: 0.6323
F-statistic: 58.89 on 3 and 98 DF, p-value: < 2.2e-16
\[ \hat{income} = -253.8497 + (177.199 * education) + (141.4354 * prestige) + (-50.8957 * women) \]
m2 = lm(Prestige$income ~ Prestige$education + I(Prestige$prestige^2) + Prestige$women, data=Prestige)
summary(m2)
Call:
lm(formula = Prestige$income ~ Prestige$education + I(Prestige$prestige^2) +
Prestige$women, data = Prestige)
Residuals:
Min 1Q Median 3Q Max
-8024.7 -792.2 -64.2 836.0 14130.6
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3601.5240 1406.5994 2.560 0.012 *
Prestige$education 57.1882 190.6623 0.300 0.765
I(Prestige$prestige^2) 1.5999 0.3011 5.313 6.75e-07 ***
Prestige$women -48.1680 8.4475 -5.702 1.25e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2514 on 98 degrees of freedom
Multiple R-squared: 0.6598, Adjusted R-squared: 0.6494
F-statistic: 63.35 on 3 and 98 DF, p-value: < 2.2e-16
\[ \hat{income} = 3601.524 + (57.1882 * education) + (1.5999 * prestige^2) + (-48.168 * women) \]
# Plot model residuals.
plot(m1, pch=16, which=1)
plot(m2, pch=16, which=1)
| Model | Linear Regression Equation | R-Square |
|---|---|---|
| Model-1 | income = -253.8497 + (177.199 * education) + (141.4354 * prestige) + (-50.8957 * women) | 0.6432 |
| Model-2 | income = 3601.524 + (57.1882 * education) + (1.5999 * prestige^2) + (-48.168 * women) | 0.6598 |
We’ve seen a couple of multiple linear regression models applied to the Prestige dataset. We transformed the variables to see the effect in the models. We observe that with a quadratic term, the R-square value gets better and the residuals tend to get closer to zero (except a few outliers). Therefore, we can conclude that the response variable can be predicted better using a quadratic term for \(Prestige\).