QUESTION

Using R, build a regression model for data that interests you. Conduct residual analysis. Was the linear model appropriate? Why or why not?

ANSWER

Dataset

The dataset used in this excercise is called “Prestige” and it is included in car library. It has 102 rows and 6 columns. Each row is an observation that relates to an occupation. The columns in the dataset relate to predicators such as years of education, income, percentage of women in the occupation, prestige of the occupation, etc.

##    education          income          women           prestige    
##  Min.   : 6.380   Min.   :  611   Min.   : 0.000   Min.   :14.80  
##  1st Qu.: 8.445   1st Qu.: 4106   1st Qu.: 3.592   1st Qu.:35.23  
##  Median :10.540   Median : 5930   Median :13.600   Median :43.60  
##  Mean   :10.738   Mean   : 6798   Mean   :28.979   Mean   :46.83  
##  3rd Qu.:12.648   3rd Qu.: 8187   3rd Qu.:52.203   3rd Qu.:59.27  
##  Max.   :15.970   Max.   :25879   Max.   :97.510   Max.   :87.20  
##      census       type   
##  Min.   :1113   bc  :44  
##  1st Qu.:3120   prof:31  
##  Median :5135   wc  :23  
##  Mean   :5402   NA's: 4  
##  3rd Qu.:8312            
##  Max.   :9517

Dataset Exploration

## [1] "Number of Columns of the data: 6"
## [1] "Number of Rows of the data: 102"
## [1] "Columns of the data: education" "Columns of the data: income"   
## [3] "Columns of the data: women"     "Columns of the data: prestige" 
## [5] "Columns of the data: census"    "Columns of the data: type"
##                     education income women prestige census type
## gov.administrators      13.11  12351 11.16     68.8   1113 prof
## general.managers        12.26  25879  4.02     69.1   1130 prof
## accountants             12.77   9271 15.70     63.4   1171 prof
## purchasing.officers     11.42   8865  9.11     56.8   1175 prof
## chemists                14.62   8403 11.68     73.5   2111 prof
## physicists              15.64  11030  5.13     77.6   2113 prof
## 'data.frame':    102 obs. of  6 variables:
##  $ education: num  13.1 12.3 12.8 11.4 14.6 ...
##  $ income   : int  12351 25879 9271 8865 8403 11030 8258 14163 11377 11023 ...
##  $ women    : num  11.16 4.02 15.7 9.11 11.68 ...
##  $ prestige : num  68.8 69.1 63.4 56.8 73.5 77.6 72.6 78.1 73.1 68.8 ...
##  $ census   : int  1113 1130 1171 1175 2111 2113 2133 2141 2143 2153 ...
##  $ type     : Factor w/ 3 levels "bc","prof","wc": 2 2 2 2 2 2 2 2 2 2 ...
## Observations: 102
## Variables: 6
## $ education <dbl> 13.11, 12.26, 12.77, 11.42, 14.62, 15.64, 15.09, 15.44, 1...
## $ income    <int> 12351, 25879, 9271, 8865, 8403, 11030, 8258, 14163, 11377...
## $ women     <dbl> 11.16, 4.02, 15.70, 9.11, 11.68, 5.13, 25.65, 2.69, 1.03,...
## $ prestige  <dbl> 68.8, 69.1, 63.4, 56.8, 73.5, 77.6, 72.6, 78.1, 73.1, 68....
## $ census    <int> 1113, 1130, 1171, 1175, 2111, 2113, 2133, 2141, 2143, 215...
## $ type      <fct> prof, prof, prof, prof, prof, prof, prof, prof, prof, pro...

Checking missing values in the dataset

## education    income     women  prestige    census      type 
##  0.000000  0.000000  0.000000  0.000000  0.000000  3.921569

Data Visualization

Education

##    education income women prestige census type
## 1       6.38   2847 90.67     28.2   8563   bc
## 2       6.60   5959  0.52     36.2   8782   bc
## 3       6.67   4696  0.00     27.3   8715   bc
## 4       6.69   4443 31.36     33.3   8267   bc
## 5       6.74   3485 39.48     28.8   8278   bc
## 6       6.84   3643  3.60     44.1   7112 <NA>
## 7       6.92   5299  0.56     38.9   8781   bc
## 8       7.11   3472 33.57     17.3   6191   bc
## 9       7.33   3000 69.31     20.8   6162   bc
## 10      7.42   1890 72.24     23.2   8221   bc
##     education income women prestige census type
## 93      15.08   8034 46.80     66.1   2733 prof
## 94      15.09   8258 25.65     72.6   2133 prof
## 95      15.21  10432 24.71     69.3   3151 prof
## 96      15.22   9593 34.89     58.3   2391 prof
## 97      15.44  14163  2.69     78.1   2141 prof
## 98      15.64  11030  5.13     77.6   2113 prof
## 99      15.77  19263  5.13     82.3   2343 prof
## 100     15.94  14558  4.32     66.7   3115 prof
## 101     15.96  25308 10.56     87.2   3111 prof
## 102     15.97  12480 19.59     84.6   2711 prof

Income

##    education income women prestige census type
## 1       9.46    611 96.53     25.9   6147 <NA>
## 2       9.62    918  7.00     14.8   5143 <NA>
## 3       8.60   1656 27.75     21.5   7182   bc
## 4       7.42   1890 72.24     23.2   8221   bc
## 5       9.93   2370  3.69     23.3   5145   bc
## 6      10.64   2448 91.76     42.3   4133   wc
## 7      10.05   2594 67.82     26.5   5137   wc
## 8       6.38   2847 90.67     28.2   8563   bc
## 9      11.04   2901 92.86     38.7   4171   wc
## 10      7.33   3000 69.31     20.8   6162   bc
##     education income women prestige census type
## 93      14.52  11377  1.03     73.1   2143 prof
## 94      13.11  12351 11.16     68.8   1113 prof
## 95      15.97  12480 19.59     84.6   2711 prof
## 96      12.27  14032  0.58     66.1   9111 prof
## 97      15.44  14163  2.69     78.1   2141 prof
## 98      15.94  14558  4.32     66.7   3115 prof
## 99      14.71  17498  6.91     68.4   3117 prof
## 100     15.77  19263  5.13     82.3   2343 prof
## 101     15.96  25308 10.56     87.2   3111 prof
## 102     12.26  25879  4.02     69.1   1130 prof

Statistical Analysis

In this section we will create a linear regression model and calculate the correlation between the data to see if there is a relationship between education and income.

Create a function for Linear Model

## 
## Call:
## lm(formula = Prestige$income ~ Prestige$education, data = Prestige)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5493.2 -2433.8   -41.9  1491.5 17713.1 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -2853.6     1407.0  -2.028   0.0452 *  
## Prestige$education    898.8      127.0   7.075 2.08e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3483 on 100 degrees of freedom
## Multiple R-squared:  0.3336, Adjusted R-squared:  0.3269 
## F-statistic: 50.06 on 1 and 100 DF,  p-value: 2.079e-10

income = -2853.5856 + 898.8128 * education

Regression Statistics

Linear Regression Equation Correlation Coefficient \(R^2\)
Income = -2853.5856 + (898.8128 * Education) 0.5776 0.3336

Summary

As can be concluded, that the data has correlation 0.5776. The residuals tend to remain constant as we move to the right. Additionally, the residuals are uniformly scattered (except a few outliers) above and below zero. Overall, this plot tells us that using the education as the predictor of the income in the regression model explains the data. Therefore, we can conclude that the linear model is appropriate in this case.