This tutorial shows step by step process to conduct a bi-variate linear regression analysis for an introductory statistics class.
The “Prestige (available from {car} package in R)” dataset is used in this analysis, which was obtained from Fox, J. and Weisberg, S. (2011) An R Companion to Applied Regression, Second Edition, Sage. The Prestige data set has 102 rows and 6 variables. The observations are occupations.
This data frame contains the following columns:
education: Average education of occupational incumbents, years, in 1971.
income: Average income of incumbents, dollars, in 1971.
women: Percentage of incumbents who are women.
prestige: Pineo-Porter prestige score for occupation, from a social survey conducted in the mid-1960s.
census: Canadian Census occupational code.
type: Type of occupation. bc - Blue Collar; prof - Professional, Managerial, and Technical; wc - White Collar.
education income women prestige census type
gov.administrators 13.11 12351 11.16 68.8 1113 prof
general.managers 12.26 25879 4.02 69.1 1130 prof
accountants 12.77 9271 15.70 63.4 1171 prof
purchasing.officers 11.42 8865 9.11 56.8 1175 prof
chemists 14.62 8403 11.68 73.5 2111 prof
physicists 15.64 11030 5.13 77.6 2113 prof
biologists 15.09 8258 25.65 72.6 2133 prof
architects 15.44 14163 2.69 78.1 2141 prof
civil.engineers 14.52 11377 1.03 73.1 2143 prof
mining.engineers 14.64 11023 0.94 68.8 2153 prof
education income women prestige
Min. : 6.380 Min. : 611 Min. : 0.000 Min. :14.80
1st Qu.: 8.445 1st Qu.: 4106 1st Qu.: 3.592 1st Qu.:35.23
Median :10.540 Median : 5930 Median :13.600 Median :43.60
Mean :10.738 Mean : 6798 Mean :28.979 Mean :46.83
3rd Qu.:12.648 3rd Qu.: 8187 3rd Qu.:52.203 3rd Qu.:59.27
Max. :15.970 Max. :25879 Max. :97.510 Max. :87.20
census type
Min. :1113 bc :44
1st Qu.:3120 prof:31
Median :5135 wc :23
Mean :5402 NA's: 4
3rd Qu.:8312
Max. :9517
Notice that the ‘type’ variable has 4 missing values.
prestige education income
prestige 1.0000000 0.8501769 0.7149057
education 0.8501769 1.0000000 0.5775802
income 0.7149057 0.5775802 1.0000000
Let’s focus on three variables of our interest - ‘prestige’, ‘education’, and ‘income’. While we observe that prestige is positively correlated with both education and income, education appears to be correlated more strongly than income with prestige.
Plot the data to look for outliers, non-linear relationships etc.
As expected, the relationship between education and prestige appears to be more linear than that between income and prestige.
Next, we fit response variable ‘prestige’ with explanatory variables ‘education’ and ‘income’ separately and find out which variable is the stronger predictor.
\(prestige = b_{0} + b_{1} * education + e\)
Call:
lm(formula = prestige ~ education, data = Prestige)
Residuals:
Min 1Q Median 3Q Max
-26.0397 -6.5228 0.6611 6.7430 18.1636
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -10.732 3.677 -2.919 0.00434 **
education 5.361 0.332 16.148 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 9.103 on 100 degrees of freedom
Multiple R-squared: 0.7228, Adjusted R-squared: 0.72
F-statistic: 260.8 on 1 and 100 DF, p-value: < 2.2e-16
95% confidence interval of the estimated coefficients
2.5 % 97.5 %
(Intercept) -18.027220 -3.436744
education 4.702223 6.019533
The education variable appears to be highly significant. The \(R^{2}\) suggests that the model explains 72% of the variability of the response variable ‘prestige’.
Test stat Pr(>|t|)
education 2.596 0.011
Tukey test 2.596 0.009
We observe that there is no linear relationship between the fitted values of ‘prestige’ and the residuals. Also, the Q-Q plot suggests that the residual is normally distributed (approximately).
\(prestige = b_{0} + b_{1}*log(income) + e\)
Call:
lm(formula = prestige ~ log(income), data = Prestige)
Residuals:
Min 1Q Median 3Q Max
-19.114 -9.342 -1.218 8.101 30.454
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -139.856 16.954 -8.249 6.6e-13 ***
log(income) 21.556 1.953 11.037 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 11.61 on 100 degrees of freedom
Multiple R-squared: 0.5492, Adjusted R-squared: 0.5447
F-statistic: 121.8 on 1 and 100 DF, p-value: < 2.2e-16
95% confidence interval of the estimated coefficients
2.5 % 97.5 %
(Intercept) -173.49239 -106.21906
log(income) 17.68139 25.43134
Comparing \(R^{2}\), we can conclude that education is a better predictor of prestige than income.
Test stat Pr(>|t|)
log(income) 2.694 0.008
Tukey test 2.694 0.007
Do we get a better fit if we use both education and income in the model as predictors of prestige?
\(prestige = b_{0} + b_{1}*education + b_{2}*log(income) + e\)
Call:
lm(formula = prestige ~ education + log(income), data = Prestige)
Residuals:
Min 1Q Median 3Q Max
-17.0346 -4.5657 -0.1857 4.0577 18.1270
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -95.1940 10.9979 -8.656 9.27e-14 ***
education 4.0020 0.3115 12.846 < 2e-16 ***
log(income) 11.4375 1.4371 7.959 2.94e-12 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 7.145 on 99 degrees of freedom
Multiple R-squared: 0.831, Adjusted R-squared: 0.8275
F-statistic: 243.3 on 2 and 99 DF, p-value: < 2.2e-16
2.5 % 97.5 %
(Intercept) -117.016298 -73.371676
education 3.383824 4.620081
log(income) 8.585936 14.288979
Test stat Pr(>|t|)
education 1.615 0.109
log(income) 0.221 0.825
Tukey test 0.653 0.514
Analysis of Variance Table
Model 1: prestige ~ education
Model 2: prestige ~ education + log(income)
Res.Df RSS Df Sum of Sq F Pr(>F)
1 100 8287.0
2 99 5053.6 1 3233.4 63.341 2.942e-12 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
In this multivariate model, we find that \(R^{2}\) is further improved. Also, the residual sum of squares is significantly lower than the education and income models. This suggests that combination of education and income is a better predictor of prestige than the bi-variate models.