Linear Regression

Introduction

This tutorial shows step by step process to conduct a bi-variate linear regression analysis for an introductory statistics class.

Dataset

The “Prestige (available from {car} package in R)” dataset is used in this analysis, which was obtained from Fox, J. and Weisberg, S. (2011) An R Companion to Applied Regression, Second Edition, Sage. The Prestige data set has 102 rows and 6 variables. The observations are occupations.

Variable Definitons

This data frame contains the following columns:
education: Average education of occupational incumbents, years, in 1971.
income: Average income of incumbents, dollars, in 1971.
women: Percentage of incumbents who are women.
prestige: Pineo-Porter prestige score for occupation, from a social survey conducted in the mid-1960s.
census: Canadian Census occupational code.
type: Type of occupation. bc - Blue Collar; prof - Professional, Managerial, and Technical; wc - White Collar.

First 10 rows of the dataset

                    education income women prestige census type
gov.administrators      13.11  12351 11.16     68.8   1113 prof
general.managers        12.26  25879  4.02     69.1   1130 prof
accountants             12.77   9271 15.70     63.4   1171 prof
purchasing.officers     11.42   8865  9.11     56.8   1175 prof
chemists                14.62   8403 11.68     73.5   2111 prof
physicists              15.64  11030  5.13     77.6   2113 prof
biologists              15.09   8258 25.65     72.6   2133 prof
architects              15.44  14163  2.69     78.1   2141 prof
civil.engineers         14.52  11377  1.03     73.1   2143 prof
mining.engineers        14.64  11023  0.94     68.8   2153 prof

Examine the data before fitting models

   education          income          women           prestige    
 Min.   : 6.380   Min.   :  611   Min.   : 0.000   Min.   :14.80  
 1st Qu.: 8.445   1st Qu.: 4106   1st Qu.: 3.592   1st Qu.:35.23  
 Median :10.540   Median : 5930   Median :13.600   Median :43.60  
 Mean   :10.738   Mean   : 6798   Mean   :28.979   Mean   :46.83  
 3rd Qu.:12.648   3rd Qu.: 8187   3rd Qu.:52.203   3rd Qu.:59.27  
 Max.   :15.970   Max.   :25879   Max.   :97.510   Max.   :87.20  
     census       type   
 Min.   :1113   bc  :44  
 1st Qu.:3120   prof:31  
 Median :5135   wc  :23  
 Mean   :5402   NA's: 4  
 3rd Qu.:8312            
 Max.   :9517

Notice that the ‘type’ variable has 4 missing values.

Correlation Matrix

           prestige education    income
prestige  1.0000000 0.8501769 0.7149057
education 0.8501769 1.0000000 0.5775802
income    0.7149057 0.5775802 1.0000000

Let’s focus on three variables of our interest - ‘prestige’, ‘education’, and ‘income’. While we observe that prestige is positively correlated with both education and income, education appears to be correlated more strongly than income with prestige.

Plot the data before fitting models

Plot the data to look for outliers, non-linear relationships etc.

As expected, the relationship between education and prestige appears to be more linear than that between income and prestige.