Regression-1

Using R, build a multiple regression model for data that interests you. Conduct residual analysis. Was the linear model appropriate? Why or why not?

(1) Dataset:

The dataset used in this excercise is called “Prestige” and it is included in \(car\) library. It has 102 rows and 6 columns. Each row is an observation that relates to an occupation. The columns in the dataset relate to predicators such as years of education, income, percentage of women in the occupation, prestige of the occupation, etc.

library(car)
Warning: package 'car' was built under R version 3.3.2
library(ggplot2)
Warning: package 'ggplot2' was built under R version 3.3.2
library(dplyr)
summary(Prestige)
   education          income          women           prestige         census       type   
 Min.   : 6.380   Min.   :  611   Min.   : 0.000   Min.   :14.80   Min.   :1113   bc  :44  
 1st Qu.: 8.445   1st Qu.: 4106   1st Qu.: 3.592   1st Qu.:35.23   1st Qu.:3120   prof:31  
 Median :10.540   Median : 5930   Median :13.600   Median :43.60   Median :5135   wc  :23  
 Mean   :10.738   Mean   : 6798   Mean   :28.979   Mean   :46.83   Mean   :5402   NA's: 4  
 3rd Qu.:12.648   3rd Qu.: 8187   3rd Qu.:52.203   3rd Qu.:59.27   3rd Qu.:8312            
 Max.   :15.970   Max.   :25879   Max.   :97.510   Max.   :87.20   Max.   :9517            
colnames(Prestige)
[1] "education" "income"    "women"     "prestige"  "census"    "type"     
ncol(Prestige)
[1] 6
nrow(Prestige)
[1] 102
head(Prestige, 20)
                       education income women prestige census type
gov.administrators         13.11  12351 11.16     68.8   1113 prof
general.managers           12.26  25879  4.02     69.1   1130 prof
accountants                12.77   9271 15.70     63.4   1171 prof
purchasing.officers        11.42   8865  9.11     56.8   1175 prof
chemists                   14.62   8403 11.68     73.5   2111 prof
physicists                 15.64  11030  5.13     77.6   2113 prof
biologists                 15.09   8258 25.65     72.6   2133 prof
architects                 15.44  14163  2.69     78.1   2141 prof
civil.engineers            14.52  11377  1.03     73.1   2143 prof
mining.engineers           14.64  11023  0.94     68.8   2153 prof
surveyors                  12.39   5902  1.91     62.0   2161 prof
draughtsmen                12.30   7059  7.83     60.0   2163 prof
computer.programers        13.83   8425 15.33     53.8   2183 prof
economists                 14.44   8049 57.31     62.2   2311 prof
psychologists              14.36   7405 48.28     74.9   2315 prof
social.workers             14.21   6336 54.77     55.1   2331 prof
lawyers                    15.77  19263  5.13     82.3   2343 prof
librarians                 14.15   6112 77.10     58.1   2351 prof
vocational.counsellors     15.22   9593 34.89     58.3   2391 prof
ministers                  14.50   4686  4.14     72.8   2511 prof

(2) Data Visualization:

Prestige_Education_df = arrange(Prestige, education)
head(Prestige_Education_df, 10)
   education income women prestige census type
1       6.38   2847 90.67     28.2   8563   bc
2       6.60   5959  0.52     36.2   8782   bc
3       6.67   4696  0.00     27.3   8715   bc
4       6.69   4443 31.36     33.3   8267   bc
5       6.74   3485 39.48     28.8   8278   bc
6       6.84   3643  3.60     44.1   7112 <NA>
7       6.92   5299  0.56     38.9   8781   bc
8       7.11   3472 33.57     17.3   6191   bc
9       7.33   3000 69.31     20.8   6162   bc
10      7.42   1890 72.24     23.2   8221   bc
tail(Prestige_Education_df, 10)
    education income women prestige census type
93      15.08   8034 46.80     66.1   2733 prof
94      15.09   8258 25.65     72.6   2133 prof
95      15.21  10432 24.71     69.3   3151 prof
96      15.22   9593 34.89     58.3   2391 prof
97      15.44  14163  2.69     78.1   2141 prof
98      15.64  11030  5.13     77.6   2113 prof
99      15.77  19263  5.13     82.3   2343 prof
100     15.94  14558  4.32     66.7   3115 prof
101     15.96  25308 10.56     87.2   3111 prof
102     15.97  12480 19.59     84.6   2711 prof
ggplot(data=Prestige_Education_df, aes(Prestige_Education_df$education)) + 
  geom_histogram(aes(fill = ..count..), binwidth=0.25) +
  scale_fill_gradient("Count", low = "green", high = "brown") +
  labs(title = "Historgram - Years of Education") +
  labs(x = "Education") +
  labs(y = "Count")

Prestige_Income_df = arrange(Prestige, income)
head(Prestige_Income_df, 10)
   education income women prestige census type
1       9.46    611 96.53     25.9   6147 <NA>
2       9.62    918  7.00     14.8   5143 <NA>
3       8.60   1656 27.75     21.5   7182   bc
4       7.42   1890 72.24     23.2   8221   bc
5       9.93   2370  3.69     23.3   5145   bc
6      10.64   2448 91.76     42.3   4133   wc
7      10.05   2594 67.82     26.5   5137   wc
8       6.38   2847 90.67     28.2   8563   bc
9      11.04   2901 92.86     38.7   4171   wc
10      7.33   3000 69.31     20.8   6162   bc
tail(Prestige_Income_df, 10)
    education income women prestige census type
93      14.52  11377  1.03     73.1   2143 prof
94      13.11  12351 11.16     68.8   1113 prof
95      15.97  12480 19.59     84.6   2711 prof
96      12.27  14032  0.58     66.1   9111 prof
97      15.44  14163  2.69     78.1   2141 prof
98      15.94  14558  4.32     66.7   3115 prof
99      14.71  17498  6.91     68.4   3117 prof
100     15.77  19263  5.13     82.3   2343 prof
101     15.96  25308 10.56     87.2   3111 prof
102     12.26  25879  4.02     69.1   1130 prof
ggplot(data=Prestige_Income_df, aes(Prestige_Income_df$income)) + 
  geom_histogram(aes(fill = ..count..)) +
  scale_fill_gradient("Count", low = "blue", high = "purple") +
  labs(title = "Historgram - Income") +
  labs(x = "Income") +
  labs(y = "Count")

(3) Statistical Analysis:

In this section we will create a linear regression model and calculate the correlation between the data to see if there is a relationship between education and income.

(3.1) Create a function to calculate the correlation and round it to 4 decimal digits
findCorrelation <- function() {
  x = Prestige$education
  y = Prestige$income
  corr = round(cor(x, y),4)
  print (paste0("Correlation = ",corr))
  return (corr)
}
c = findCorrelation()
[1] "Correlation = 0.5776"
(3.2) Create a function for Linear Model
findStatsFunction <- function() {
  m = lm (Prestige$income ~ Prestige$education, data = Prestige)
  s = summary(m)
  print(s)
  
  slp = round(m$coefficients[2], 4)
  int = round(m$coefficients[1], 4)

  return (m)
}
m = findStatsFunction()

Call:
lm(formula = Prestige$income ~ Prestige$education, data = Prestige)

Residuals:
    Min      1Q  Median      3Q     Max 
-5493.2 -2433.8   -41.9  1491.5 17713.1 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)         -2853.6     1407.0  -2.028   0.0452 *  
Prestige$education    898.8      127.0   7.075 2.08e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3483 on 100 degrees of freedom
Multiple R-squared:  0.3336,    Adjusted R-squared:  0.3269 
F-statistic: 50.06 on 1 and 100 DF,  p-value: 2.079e-10

\[ \hat{income} = -2853.5856 + 898.8128 * education \]

(3.3) Display the Linear Model
ggplot(Prestige, aes(education, income)) + geom_point(colour="blue", size=2) + 
    geom_abline(aes(slope=round(m$coefficients[2], 4), intercept=round(m$coefficients[1], 4))) +
    labs(title = "Education vs Income") +
    xlab("Education") + 
    ylab("Income")

(3.4) Residual Analysis
ggplot(m, aes(.fitted, .resid)) + 
  geom_point(color = "brown", size=2) +
  labs(title = "Fitted Values vs Residuals") +
  labs(x = "Fitted Values") +
  labs(y = "Residuals")

(3.5) Regression Statistics
Linear Regression Equation Correlation Coefficient Multiple R-Square R-Square
Income = -2853.5856 + (898.8128 * Education) 0.5776 0.3336 0.3336


(4) Summary:

We observe that the data has correlation 0.5776. We also see that the residuals tend to remain constant as we move to the right. Additionally, the residuals are uniformly scattered (except a few outliers) above and below zero. Overall, this plot tells us that using the education as the predictor of the income in the regression model explains the data. Therefore, we can conclude that the linear model is appropriate in this case.