Regression-2

Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?

(1) Dataset:

The dataset used in this excercise is called “Prestige” and it is included in \(car\) library. It has 102 rows and 6 columns. Each row is an observation that relates to an occupation. The columns in the dataset relate to predicators such as years of education, income, percentage of women in the occupation, prestige of the occupation, etc.

The description of the variables in the dataset are as follows:
  • education: The average number of years of education for occupational incumbents.
  • income: The average income of occupational incumbents, in dollars.
  • women: The percentage of women in the occupation.
  • prestige: The average prestige rating for the occupation.
  • census: The code of the occupation used in the survey.
  • type: Professional and managerial (prof), white collar(wc), blue collar(bc), or missing(NA)

  • For multiple regression, we’ll use more than one predictor. We will include \(women\), \(prestige\) along with \(education\) as our list of predictor variables and our response variable will be \(income\).

    For multiple regression, we would like to solve the following equation:

    \[ \hat{income} = B_{0} + (B_{1} * education) + (B_{2} * prestige) + (B_{3} * women) \] The model will estimate the value of the intercept \(B_{0}\) and each predictor’s slope \(B_{1}\) for education, \(B_{2}\) for prestige, \(B_{3}\) for women. We will consider a couple of models and see which one is a good fit.

    library(car)
    Warning: package 'car' was built under R version 3.3.2
    library(ggplot2)
    Warning: package 'ggplot2' was built under R version 3.3.2
    library(dplyr)
    summary(Prestige)
       education          income          women           prestige         census       type   
     Min.   : 6.380   Min.   :  611   Min.   : 0.000   Min.   :14.80   Min.   :1113   bc  :44  
     1st Qu.: 8.445   1st Qu.: 4106   1st Qu.: 3.592   1st Qu.:35.23   1st Qu.:3120   prof:31  
     Median :10.540   Median : 5930   Median :13.600   Median :43.60   Median :5135   wc  :23  
     Mean   :10.738   Mean   : 6798   Mean   :28.979   Mean   :46.83   Mean   :5402   NA's: 4  
     3rd Qu.:12.648   3rd Qu.: 8187   3rd Qu.:52.203   3rd Qu.:59.27   3rd Qu.:8312            
     Max.   :15.970   Max.   :25879   Max.   :97.510   Max.   :87.20   Max.   :9517            
    colnames(Prestige)
    [1] "education" "income"    "women"     "prestige"  "census"    "type"     
    ncol(Prestige)
    [1] 6
    nrow(Prestige)
    [1] 102
    head(Prestige, 20)
                           education income women prestige census type
    gov.administrators         13.11  12351 11.16     68.8   1113 prof
    general.managers           12.26  25879  4.02     69.1   1130 prof
    accountants                12.77   9271 15.70     63.4   1171 prof
    purchasing.officers        11.42   8865  9.11     56.8   1175 prof
    chemists                   14.62   8403 11.68     73.5   2111 prof
    physicists                 15.64  11030  5.13     77.6   2113 prof
    biologists                 15.09   8258 25.65     72.6   2133 prof
    architects                 15.44  14163  2.69     78.1   2141 prof
    civil.engineers            14.52  11377  1.03     73.1   2143 prof
    mining.engineers           14.64  11023  0.94     68.8   2153 prof
    surveyors                  12.39   5902  1.91     62.0   2161 prof
    draughtsmen                12.30   7059  7.83     60.0   2163 prof
    computer.programers        13.83   8425 15.33     53.8   2183 prof
    economists                 14.44   8049 57.31     62.2   2311 prof
    psychologists              14.36   7405 48.28     74.9   2315 prof
    social.workers             14.21   6336 54.77     55.1   2331 prof
    lawyers                    15.77  19263  5.13     82.3   2343 prof
    librarians                 14.15   6112 77.10     58.1   2351 prof
    vocational.counsellors     15.22   9593 34.89     58.3   2391 prof
    ministers                  14.50   4686  4.14     72.8   2511 prof
    Prestige_4Column_df = Prestige[,c(1:4)]
    summary(Prestige_4Column_df)
       education          income          women           prestige    
     Min.   : 6.380   Min.   :  611   Min.   : 0.000   Min.   :14.80  
     1st Qu.: 8.445   1st Qu.: 4106   1st Qu.: 3.592   1st Qu.:35.23  
     Median :10.540   Median : 5930   Median :13.600   Median :43.60  
     Mean   :10.738   Mean   : 6798   Mean   :28.979   Mean   :46.83  
     3rd Qu.:12.648   3rd Qu.: 8187   3rd Qu.:52.203   3rd Qu.:59.27  
     Max.   :15.970   Max.   :25879   Max.   :97.510   Max.   :87.20  

    (2) Data Visualization:

    Prestige_Education_df = arrange(Prestige, education)
    head(Prestige_Education_df, 10)
       education income women prestige census type
    1       6.38   2847 90.67     28.2   8563   bc
    2       6.60   5959  0.52     36.2   8782   bc
    3       6.67   4696  0.00     27.3   8715   bc
    4       6.69   4443 31.36     33.3   8267   bc
    5       6.74   3485 39.48     28.8   8278   bc
    6       6.84   3643  3.60     44.1   7112 <NA>
    7       6.92   5299  0.56     38.9   8781   bc
    8       7.11   3472 33.57     17.3   6191   bc
    9       7.33   3000 69.31     20.8   6162   bc
    10      7.42   1890 72.24     23.2   8221   bc
    tail(Prestige_Education_df, 10)
        education income women prestige census type
    93      15.08   8034 46.80     66.1   2733 prof
    94      15.09   8258 25.65     72.6   2133 prof
    95      15.21  10432 24.71     69.3   3151 prof
    96      15.22   9593 34.89     58.3   2391 prof
    97      15.44  14163  2.69     78.1   2141 prof
    98      15.64  11030  5.13     77.6   2113 prof
    99      15.77  19263  5.13     82.3   2343 prof
    100     15.94  14558  4.32     66.7   3115 prof
    101     15.96  25308 10.56     87.2   3111 prof
    102     15.97  12480 19.59     84.6   2711 prof
    ggplot(data=Prestige_Education_df, aes(Prestige_Education_df$education)) + 
      geom_histogram(aes(fill = ..count..), binwidth=0.25) +
      scale_fill_gradient("Count", low = "green", high = "brown") +
      labs(title = "Historgram - Years of Education") +
      labs(x = "Education") +
      labs(y = "Count")

    Prestige_Income_df = arrange(Prestige, income)
    head(Prestige_Income_df, 10)
       education income women prestige census type
    1       9.46    611 96.53     25.9   6147 <NA>
    2       9.62    918  7.00     14.8   5143 <NA>
    3       8.60   1656 27.75     21.5   7182   bc
    4       7.42   1890 72.24     23.2   8221   bc
    5       9.93   2370  3.69     23.3   5145   bc
    6      10.64   2448 91.76     42.3   4133   wc
    7      10.05   2594 67.82     26.5   5137   wc
    8       6.38   2847 90.67     28.2   8563   bc
    9      11.04   2901 92.86     38.7   4171   wc
    10      7.33   3000 69.31     20.8   6162   bc
    tail(Prestige_Income_df, 10)
        education income women prestige census type
    93      14.52  11377  1.03     73.1   2143 prof
    94      13.11  12351 11.16     68.8   1113 prof
    95      15.97  12480 19.59     84.6   2711 prof
    96      12.27  14032  0.58     66.1   9111 prof
    97      15.44  14163  2.69     78.1   2141 prof
    98      15.94  14558  4.32     66.7   3115 prof
    99      14.71  17498  6.91     68.4   3117 prof
    100     15.77  19263  5.13     82.3   2343 prof
    101     15.96  25308 10.56     87.2   3111 prof
    102     12.26  25879  4.02     69.1   1130 prof
    ggplot(data=Prestige_Income_df, aes(Prestige_Income_df$income)) + 
      geom_histogram(aes(fill = ..count..)) +
      scale_fill_gradient("Count", low = "blue", high = "purple") +
      labs(title = "Historgram - Income") +
      labs(x = "Income") +
      labs(y = "Count")

    # Plot matrix of all variables.
    plot(Prestige_4Column_df, pch=16, col="red", main="Scatterplot of Income, Education, Women and Prestige")

    (3) Statistical Analysis:

    In this section we will create a few multiple regression models and see which one provides a good relationship between predictor variables (education, prestige and women) and response variable (income).

    (3.1) Create functions to calculate the slope and the intercept
    getSlope <- function(m) {
      slp = round(m$coefficients[2], 4)
      return (slp)
    }
    
    getIntercept <- function(m) {
      int = round(m$coefficients[1], 4)
      return (int)
    }
    (3.2) Create Multiple Regression models
    m1 = lm(Prestige$income ~ Prestige$education + Prestige$prestige + Prestige$women, data=Prestige)
    summary(m1)
    
    Call:
    lm(formula = Prestige$income ~ Prestige$education + Prestige$prestige + 
        Prestige$women, data = Prestige)
    
    Residuals:
        Min      1Q  Median      3Q     Max 
    -7715.3  -929.7  -231.2   689.7 14391.8 
    
    Coefficients:
                       Estimate Std. Error t value Pr(>|t|)    
    (Intercept)        -253.850   1086.157  -0.234    0.816    
    Prestige$education  177.199    187.632   0.944    0.347    
    Prestige$prestige   141.435     29.910   4.729 7.58e-06 ***
    Prestige$women      -50.896      8.556  -5.948 4.19e-08 ***
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    
    Residual standard error: 2575 on 98 degrees of freedom
    Multiple R-squared:  0.6432,    Adjusted R-squared:  0.6323 
    F-statistic: 58.89 on 3 and 98 DF,  p-value: < 2.2e-16

    \[ \hat{income} = -253.8497 + (177.199 * education) + (141.4354 * prestige) + (-50.8957 * women) \]

    m2 = lm(Prestige$income ~ Prestige$education + I(Prestige$prestige^2) + Prestige$women, data=Prestige)
    summary(m2)
    
    Call:
    lm(formula = Prestige$income ~ Prestige$education + I(Prestige$prestige^2) + 
        Prestige$women, data = Prestige)
    
    Residuals:
        Min      1Q  Median      3Q     Max 
    -8024.7  -792.2   -64.2   836.0 14130.6 
    
    Coefficients:
                            Estimate Std. Error t value Pr(>|t|)    
    (Intercept)            3601.5240  1406.5994   2.560    0.012 *  
    Prestige$education       57.1882   190.6623   0.300    0.765    
    I(Prestige$prestige^2)    1.5999     0.3011   5.313 6.75e-07 ***
    Prestige$women          -48.1680     8.4475  -5.702 1.25e-07 ***
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    
    Residual standard error: 2514 on 98 degrees of freedom
    Multiple R-squared:  0.6598,    Adjusted R-squared:  0.6494 
    F-statistic: 63.35 on 3 and 98 DF,  p-value: < 2.2e-16

    \[ \hat{income} = 3601.524 + (57.1882 * education) + (1.5999 * prestige^2) + (-48.168 * women) \]

    (3.3) Residual Analysis
    # Plot model residuals.
    plot(m1, pch=16, which=1)

    plot(m2, pch=16, which=1)

    (3.4) Regression Statistics
    Model Linear Regression Equation R-Square
    Model-1 income = -253.8497 + (177.199 * education) + (141.4354 * prestige) + (-50.8957 * women) 0.6432
    Model-2 income = 3601.524 + (57.1882 * education) + (1.5999 * prestige^2) + (-48.168 * women) 0.6598


    (4) Summary:

    We’ve seen a couple of multiple linear regression models applied to the Prestige dataset. We transformed the variables to see the effect in the models. We observe that with a quadratic term, the R-square value gets better and the residuals tend to get closer to zero (except a few outliers). Therefore, we can conclude that the response variable can be predicted better using a quadratic term for \(Prestige\).