DATA 605 Week12 Discussion

Regression-2

Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?

(1) Dataset:

The dataset used in this excercise is called “Prestige” and it is included in \(car\) library. It has 102 rows and 6 columns. Each row is an observation that relates to an occupation. The columns in the dataset relate to predicators such as years of education, income, percentage of women in the occupation, prestige of the occupation, etc.

The description of the variables in the dataset are as follows:

education: The average number of years of education for occupational incumbents.

income: The average income of occupational incumbents, in dollars.

women: The percentage of women in the occupation.

prestige: The average prestige rating for the occupation.

census: The code of the occupation used in the survey.

type: Professional and managerial (prof), white collar(wc), blue collar(bc), or missing(NA)

For multiple regression, we’ll use more than one predictor. We will include \(women\), \(prestige\) along with \(education\) as our list of predictor variables and our response variable will be \(income\).

For multiple regression, we would like to solve the following equation:

\[ \hat{income} = B_{0} + (B_{1} * education) + (B_{2} * prestige) + (B_{3} * women) \] The model will estimate the value of the intercept \(B_{0}\) and each predictor’s slope \(B_{1}\) for education, \(B_{2}\) for prestige, \(B_{3}\) for women. We will consider a couple of models and see which one is a good fit.

library(car)

Warning: package 'car' was built under R version 3.3.2

library(ggplot2)

Warning: package 'ggplot2' was built under R version 3.3.2

library(dplyr)
summary(Prestige)

   education          income          women           prestige         census       type   
 Min.   : 6.380   Min.   :  611   Min.   : 0.000   Min.   :14.80   Min.   :1113   bc  :44  
 1st Qu.: 8.445   1st Qu.: 4106   1st Qu.: 3.592   1st Qu.:35.23   1st Qu.:3120   prof:31  
 Median :10.540   Median : 5930   Median :13.600   Median :43.60   Median :5135   wc  :23  
 Mean   :10.738   Mean   : 6798   Mean   :28.979   Mean   :46.83   Mean   :5402   NA's: 4  
 3rd Qu.:12.648   3rd Qu.: 8187   3rd Qu.:52.203   3rd Qu.:59.27   3rd Qu.:8312            
 Max.   :15.970   Max.   :25879   Max.   :97.510   Max.   :87.20   Max.   :9517

colnames(Prestige)

[1] "education" "income"    "women"     "prestige"  "census"    "type"

ncol(Prestige)

[1] 6

nrow(Prestige)

[1] 102

head(Prestige, 20)

                       education income women prestige census type
gov.administrators         13.11  12351 11.16     68.8   1113 prof
general.managers           12.26  25879  4.02     69.1   1130 prof
accountants                12.77   9271 15.70     63.4   1171 prof
purchasing.officers        11.42   8865  9.11     56.8   1175 prof
chemists                   14.62   8403 11.68     73.5   2111 prof
physicists                 15.64  11030  5.13     77.6   2113 prof
biologists                 15.09   8258 25.65     72.6   2133 prof
architects                 15.44  14163  2.69     78.1   2141 prof
civil.engineers            14.52  11377  1.03     73.1   2143 prof
mining.engineers           14.64  11023  0.94     68.8   2153 prof
surveyors                  12.39   5902  1.91     62.0   2161 prof
draughtsmen                12.30   7059  7.83     60.0   2163 prof
computer.programers        13.83   8425 15.33     53.8   2183 prof
economists                 14.44   8049 57.31     62.2   2311 prof
psychologists              14.36   7405 48.28     74.9   2315 prof
social.workers             14.21   6336 54.77     55.1   2331 prof
lawyers                    15.77  19263  5.13     82.3   2343 prof
librarians                 14.15   6112 77.10     58.1   2351 prof
vocational.counsellors     15.22   9593 34.89     58.3   2391 prof
ministers                  14.50   4686  4.14     72.8   2511 prof

Prestige_4Column_df = Prestige[,c(1:4)]
summary(Prestige_4Column_df)

   education          income          women           prestige    
 Min.   : 6.380   Min.   :  611   Min.   : 0.000   Min.   :14.80  
 1st Qu.: 8.445   1st Qu.: 4106   1st Qu.: 3.592   1st Qu.:35.23  
 Median :10.540   Median : 5930   Median :13.600   Median :43.60  
 Mean   :10.738   Mean   : 6798   Mean   :28.979   Mean   :46.83  
 3rd Qu.:12.648   3rd Qu.: 8187   3rd Qu.:52.203   3rd Qu.:59.27  
 Max.   :15.970   Max.   :25879   Max.   :97.510   Max.   :87.20

(2) Data Visualization:

Prestige_Education_df = arrange(Prestige, education)
head(Prestige_Education_df, 10)

   education income women prestige census type
1       6.38   2847 90.67     28.2   8563   bc
2       6.60   5959  0.52     36.2   8782   bc
3       6.67   4696  0.00     27.3   8715   bc
4       6.69   4443 31.36     33.3   8267   bc
5       6.74   3485 39.48     28.8   8278   bc
6       6.84   3643  3.60     44.1   7112 <NA>
7       6.92   5299  0.56     38.9   8781   bc
8       7.11   3472 33.57     17.3   6191   bc
9       7.33   3000 69.31     20.8   6162   bc
10      7.42   1890 72.24     23.2   8221   bc

tail(Prestige_Education_df, 10)

    education income women prestige census type
93      15.08   8034 46.80     66.1   2733 prof
94      15.09   8258 25.65     72.6   2133 prof
95      15.21  10432 24.71     69.3   3151 prof
96      15.22   9593 34.89     58.3   2391 prof
97      15.44  14163  2.69     78.1   2141 prof
98      15.64  11030  5.13     77.6   2113 prof
99      15.77  19263  5.13     82.3   2343 prof
100     15.94  14558  4.32     66.7   3115 prof
101     15.96  25308 10.56     87.2   3111 prof
102     15.97  12480 19.59     84.6   2711 prof

ggplot(data=Prestige_Education_df, aes(Prestige_Education_df$education)) + 
  geom_histogram(aes(fill = ..count..), binwidth=0.25) +
  scale_fill_gradient("Count", low = "green", high = "brown") +
  labs(title = "Historgram - Years of Education") +
  labs(x = "Education") +
  labs(y = "Count")

Prestige_Income_df = arrange(Prestige, income)
head(Prestige_Income_df, 10)

   education income women prestige census type
1       9.46    611 96.53     25.9   6147 <NA>
2       9.62    918  7.00     14.8   5143 <NA>
3       8.60   1656 27.75     21.5   7182   bc
4       7.42   1890 72.24     23.2   8221   bc
5       9.93   2370  3.69     23.3   5145   bc
6      10.64   2448 91.76     42.3   4133   wc
7      10.05   2594 67.82     26.5   5137   wc
8       6.38   2847 90.67     28.2   8563   bc
9      11.04   2901 92.86     38.7   4171   wc
10      7.33   3000 69.31     20.8   6162   bc

tail(Prestige_Income_df, 10)

    education income women prestige census type
93      14.52  11377  1.03     73.1   2143 prof
94      13.11  12351 11.16     68.8   1113 prof
95      15.97  12480 19.59     84.6   2711 prof
96      12.27  14032  0.58     66.1   9111 prof
97      15.44  14163  2.69     78.1   2141 prof
98      15.94  14558  4.32     66.7   3115 prof
99      14.71  17498  6.91     68.4   3117 prof
100     15.77  19263  5.13     82.3   2343 prof
101     15.96  25308 10.56     87.2   3111 prof
102     12.26  25879  4.02     69.1   1130 prof

ggplot(data=Prestige_Income_df, aes(Prestige_Income_df$income)) + 
  geom_histogram(aes(fill = ..count..)) +
  scale_fill_gradient("Count", low = "blue", high = "purple") +
  labs(title = "Historgram - Income") +
  labs(x = "Income") +
  labs(y = "Count")

# Plot matrix of all variables.
plot(Prestige_4Column_df, pch=16, col="red", main="Scatterplot of Income, Education, Women and Prestige")

(3) Statistical Analysis:

In this section we will create a few multiple regression models and see which one provides a good relationship between predictor variables (education, prestige and women) and response variable (income).

(3.1) Create functions to calculate the slope and the intercept

getSlope <- function(m) {
  slp = round(m$coefficients[2], 4)
  return (slp)
}

getIntercept <- function(m) {
  int = round(m$coefficients[1], 4)
  return (int)
}

(3.2) Create Multiple Regression models

m1 = lm(Prestige$income ~ Prestige$education + Prestige$prestige + Prestige$women, data=Prestige)
summary(m1)


Call:
lm(formula = Prestige$income ~ Prestige$education + Prestige$prestige + 
    Prestige$women, data = Prestige)

Residuals:
    Min      1Q  Median      3Q     Max 
-7715.3  -929.7  -231.2   689.7 14391.8 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)        -253.850   1086.157  -0.234    0.816    
Prestige$education  177.199    187.632   0.944    0.347    
Prestige$prestige   141.435     29.910   4.729 7.58e-06 ***
Prestige$women      -50.896      8.556  -5.948 4.19e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2575 on 98 degrees of freedom
Multiple R-squared:  0.6432,    Adjusted R-squared:  0.6323 
F-statistic: 58.89 on 3 and 98 DF,  p-value: < 2.2e-16

\[ \hat{income} = -253.8497 + (177.199 * education) + (141.4354 * prestige) + (-50.8957 * women) \]

m2 = lm(Prestige$income ~ Prestige$education + I(Prestige$prestige^2) + Prestige$women, data=Prestige)
summary(m2)


Call:
lm(formula = Prestige$income ~ Prestige$education + I(Prestige$prestige^2) + 
    Prestige$women, data = Prestige)

Residuals:
    Min      1Q  Median      3Q     Max 
-8024.7  -792.2   -64.2   836.0 14130.6 

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)            3601.5240  1406.5994   2.560    0.012 *  
Prestige$education       57.1882   190.6623   0.300    0.765    
I(Prestige$prestige^2)    1.5999     0.3011   5.313 6.75e-07 ***
Prestige$women          -48.1680     8.4475  -5.702 1.25e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2514 on 98 degrees of freedom
Multiple R-squared:  0.6598,    Adjusted R-squared:  0.6494 
F-statistic: 63.35 on 3 and 98 DF,  p-value: < 2.2e-16

\[ \hat{income} = 3601.524 + (57.1882 * education) + (1.5999 * prestige^2) + (-48.168 * women) \]

(3.3) Residual Analysis

# Plot model residuals.
plot(m1, pch=16, which=1)

plot(m2, pch=16, which=1)

(3.4) Regression Statistics

Model	Linear Regression Equation	R-Square
Model-1	income = -253.8497 + (177.199 * education) + (141.4354 * prestige) + (-50.8957 * women)	0.6432
Model-2	income = 3601.524 + (57.1882 * education) + (1.5999 * prestige^2) + (-48.168 * women)	0.6598

(4) Summary:

We’ve seen a couple of multiple linear regression models applied to the Prestige dataset. We transformed the variables to see the effect in the models. We observe that with a quadratic term, the R-square value gets better and the residuals tend to get closer to zero (except a few outliers). Therefore, we can conclude that the response variable can be predicted better using a quadratic term for \(Prestige\).