R Notebook

This is an R Markdown Notebook.

MIDTERM II

library(HistData)

## Warning: package 'HistData' was built under R version 3.3.3

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.3.3

library(readxl)

## Warning: package 'readxl' was built under R version 3.3.3

Importing the Data and assigning the column Names.

MODEL <- read_excel("C:/Users/annmo/Desktop/SPRING 2017/STAT 6509/MOAData.xlsx")

colnames(MODEL)=c("RACE", "YEAR", "METRO AREA", "HDI", "LIFEXPECT", "DEGREE", "ENROLL", "MIDEARN", "HEALTHINDEX", "EDUINDEX", "INCOMEINDEX")

print(MODEL)

## # A tibble: 91 x 11
##                RACE  YEAR     `METRO AREA`   HDI LIFEXPECT DEGREE ENROLL
##               <chr> <dbl>            <chr> <dbl>     <dbl>  <dbl>  <dbl>
##  1 AFRICAN AMERICAN  2010          Atlanta  4.46      75.9    8.7   78.8
##  2 AFRICAN AMERICAN  2010        Baltimore  4.16      73.4    8.4   74.9
##  3 AFRICAN AMERICAN  2010           Boston  5.09      79.9    8.4   83.4
##  4 AFRICAN AMERICAN  2010          Chicago  3.90      73.7    6.9   78.0
##  5 AFRICAN AMERICAN  2010 Dallas-Ft. Worth  4.40      75.1    6.4   77.4
##  6 AFRICAN AMERICAN  2010           Denver  4.23      76.4    6.9   77.5
##  7 AFRICAN AMERICAN  2010          Detroit  3.58      72.9    6.5   77.4
##  8 AFRICAN AMERICAN  2010          Houston  4.28      74.5    7.8   77.7
##  9 AFRICAN AMERICAN  2010      Los Angeles  4.59      74.6    8.1   78.6
## 10 AFRICAN AMERICAN  2010            Miami  3.93      77.3    5.9   77.8
## # ... with 81 more rows, and 4 more variables: MIDEARN <dbl>,
## #   HEALTHINDEX <dbl>, EDUINDEX <dbl>, INCOMEINDEX <dbl>

1(a) COMPARING THE HD INDEX FOR VAERIOUS RACES USING APPROPRIATE PLOT AND COMMENTING ON IT.

boxplot(MODEL$HDI~MODEL$RACE, 
        main="Box Plot of HD INDEX for Each Race/Ethnicity", 
        xlab="RACE", ylab="HD INDEX")

The Asian American population apperas to have the highest HD Index in the USA. It has the highest variance, with the second widest spread as seen in the boxplot. This is followed by the Caucasian American population, with a HD Index of 6.2. Though it has a few outliers, which may cause some skewness of the variance to be far larger. *Both the African American and the Latin American populations appear to have an almost same mean. However, the African American population has a larger variance in the comparison of the two.

## 1(b) DEVELOP AN ADDITIVE REGRESSION MODEL WITH TWO PREDICTORS (RACE AND INCOME), FOR PREDICTING THE HD INDEX. OBTAIN THE REGRESSION COEFFICIENTS AND INTERPRET THEM.

y_i = ??_0 + ??_1 x_i1 + ??_2 x_i2 + ??_3 x_i3 + ??_4 x_i4 + ??_5 x_i5 + ??_i y_i = HD Index
x_i1 = Income,Measured in Dollars
x_i2 = 1,if the subject is African American
= 0,if the subject is not African American
x_i3 = 1,if the subject is Asian American = 0,if the subject is not Asian American
x_i4 = 1,if the subject is Latin American = 0,if the subject is not Latin American
x_i5 = 1,if the subject is Caucasian American = 0,if the subject is not Caucasian American
??_i = Independent error terms that are normally distributed with mean = 0 and equal variance ??^2

MODEL.lm2 <- lm(HDI ~ INCOMEINDEX + RACE, data = MODEL)

summary(MODEL.lm2)

## 
## Call:
## lm(formula = HDI ~ INCOMEINDEX + RACE, data = MODEL)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.75691 -0.24459 -0.03259  0.21640  0.82990 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          1.8376     0.1888   9.733 1.59e-15 ***
## INCOMEINDEX          0.5531     0.0398  13.898  < 2e-16 ***
## RACEASIAN AMERICAN   2.1703     0.1238  17.528  < 2e-16 ***
## RACELATINO           0.9010     0.1200   7.507 5.19e-11 ***
## RACEWHITE            0.6499     0.1343   4.839 5.68e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3433 on 86 degrees of freedom
## Multiple R-squared:  0.9464, Adjusted R-squared:  0.9439 
## F-statistic: 379.4 on 4 and 86 DF,  p-value: < 2.2e-16

Since the indicator variable is not included in the outcome, one of the indictaor variable is dropped.

y_i = (1.205) + (0.0001)x_i1 + (2.254)x_i3 + (0.6802)x_i4 +(0.7049)x_i5 + ??_i

??_0 = 1.205 = Overall average intercept
??_1 = 0.0001 =Change in mean HDI (??_y) for each dollar increase in ???income x???_i1
??_3 = 2.254 = The coefficient that multiplies the Asian American indicator variable
??_4 = 0.6802 = The coefficient that multiplies the Latin American indicator variable ??_5 = 0.7049 = The coefficient that multiplies the Caucasian American indicator variable

Since the pvalue and the rsqrd for this model shows it containing two predictors (RACE & MIDEARNINGS), are more accurate in predicting the HD INDEX.We also take into account that the pvalues for the said predictors are significantly less than the 0.05, which is the default alpha value, proves that they should be included in the model.

1(c) RUN A FULL MODEL WITH ALL THE POSSIBLE INTERACTION TERMS (WITH JUST TWO PREDICTORS), PROVIDE THE REGRESSION COEFFs ALONG WITH ANY RELEVANT INTERPRETATIONS.

y_i = ??_0 + ??_1 x_i1 + ??_2 x_i2 + ??_3 x_i3 + ??_4 x_i4 + ??5 x(i5 )+ ??_12 x_i1 x_i2 + ??_13 x_i1 x_i3 + ??_14 x_i1 x_i4 + ??_15 x_i1 x_i5 + ??_i

y_i = HD Index
x_i1 = Income,Measured in Dollars
x_i2 = 1,if the subject is African American
= 0,if the subject is not African American
x_i3 = 1,if the subject is Asian American = 0,if the subject is not Asian American
x_i4 = 1,if the subject is Latin American = 0,if the subject is not Latin American x_i5 = 1,if the subject is Caucasian American = 0,if the subject is not Caucasian American
x_i1 x_i2 = Interaction term between x_i1 (income) and x_i2 (African American) x_i1 x_i3 = Interaction term between x_i1 (income) and x_i2 (Asian American) x_i1 x_i4 = Interaction term between x_i1 (income) and x_i2 (Latin American) x_i1 x_i5 = Interaction term between x_i1 (income) and x_i2 (Caucasian American) ??_i = Independent error terms that are normally distributed with mean = 0 and equal variance ??^2

MODEL.add <- lm(HDI ~ MIDEARN + factor(RACE), data = MODEL)

summary(MODEL.add)

## 
## Call:
## lm(formula = HDI ~ MIDEARN + factor(RACE), data = MODEL)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.73075 -0.23235  0.00728  0.19060  0.89931 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                1.205e+00  2.245e-01   5.370 6.61e-07 ***
## MIDEARN                    1.109e-04  7.721e-06  14.362  < 2e-16 ***
## factor(RACE)ASIAN AMERICAN 2.254e+00  1.174e-01  19.191  < 2e-16 ***
## factor(RACE)LATINO         6.802e-01  1.100e-01   6.187 2.01e-08 ***
## factor(RACE)WHITE          7.049e-01  1.280e-01   5.506 3.74e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3355 on 86 degrees of freedom
## Multiple R-squared:  0.9488, Adjusted R-squared:  0.9464 
## F-statistic: 398.2 on 4 and 86 DF,  p-value: < 2.2e-16

Since the pvalue is less than alpha, we conclude that its is significantly useful in predicting the HD INDEX.

1(d) CREATE A VISUAL DISPLAY FOR BOTH THE ADDITIVE AND THE INTERACTION MODELS IN PARTS (a) & (b) WITH THE DATA POINTS & ESTIMATED REGRESSION EQUATIONS.

We use the additive model.

plot(MODEL$HDI ~MODEL$MIDEARN, col = factor(MODEL$RACE), main = "REGRESSION OF MEDIAN EARNINGS ON HD INDEX BY RACE", xlab = "MEDIAN EARNINGS", ylab = "HD INDEX")

summary(model.interaction.lm)

mid_fit <- fitted.values(MODEL.lm2)

ggplot(data = MODEL, aes(x = MIDEARN, y = HDI, colour = factor(RACE))) + geom_point() + xlab("MIDEARN") + ylab("HDI") + geom_line(aes(y = mid_fit))

Estimated Regression Equations.

Estimated Regression Equations factor(RACE)AFRICAN AMERICAN: HDI = 1.381 + 0.0001046(MIDEARN) factor(RACE)ASIAN AMERICAN: HDI = 3.371 + 0.0001134(MIDEARN) factor(RACE)LATINO: HDI = -1.155 + 0.0002491(MIDEARN) factor(RACE)WHITE: HDI = 2.084 + 0.0001064(MIDEARN)

Interaction plot.

ggplot(data = MODEL, aes(x = MIDEARN, y = HDI, color = factor(RACE))) + stat_smooth(method = lm, fullrange = FALSE) + geom_point()

1(e) NOW INCLUDE LIFE EXPECTANCY AS A PREDICTOR AND RUN AN ADDITIVE MODEL. CHECK FOR THE MODEL ASSUMPTIONS USING RESIDULA PLOTS & FOR NORMALITY & EQUAL VARIANCES USING FORMAL TESTS. PROVIDE YOUR THOUGHTS ALONG WITH THE OUTPUTS.

y_i = ??_0 + ??_1 x_i1 + ??_2 x_i2 + ??_3 x_i3 + ??_4 x_i4 + ??_5 x_i5 + ??_6 x_i6 + ??_i

y_i = HD Index
x_i1 = Income,Measured in Dollars
x_i2 = Life Expectancy in Years x_i3 = 1,if the subject is African American
= 0,if the subject is not African American
x_i4 = 1,if the subject is Asian American = 0,if the subject is not Asian American
x_i5 = 1,if the subject is Latin American = 0,if the subject is not Latin American
x_i6 = 1,if the subject is Caucasian American = 0,if the subject is not Caucasian American
??_i = Independent error terms that are normally distributed with mean = 0 and equal variance ??^2

MODEL.lm3 <- lm(HDI ~ MIDEARN + LIFEXPECT + RACE, data = MODEL)

summary(MODEL.lm3)

## 
## Call:
## lm(formula = HDI ~ MIDEARN + LIFEXPECT + RACE, data = MODEL)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.28466 -0.09309 -0.01170  0.09119  0.40081 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -1.022e+01  6.101e-01 -16.757  < 2e-16 ***
## MIDEARN             9.440e-05  3.503e-06  26.952  < 2e-16 ***
## LIFEXPECT           1.575e-01  8.295e-03  18.982  < 2e-16 ***
## RACEASIAN AMERICAN  5.251e-01  1.047e-01   5.017 2.84e-06 ***
## RACELATINO         -6.691e-01  8.595e-02  -7.785 1.53e-11 ***
## RACEWHITE           2.203e-01  6.178e-02   3.565 0.000599 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1474 on 85 degrees of freedom
## Multiple R-squared:  0.9902, Adjusted R-squared:  0.9896 
## F-statistic:  1722 on 5 and 85 DF,  p-value: < 2.2e-16

We have lost some degrees of freedom due to the estimation, thus we’re using 85 dfs. The R-sqrd is promising since its at 99.02%, despite the pvalue being far much smaller than the alpha.

RESIDUALS vs FITTED PLOT.

plot(x = fitted(MODEL.lm3), y = residuals(MODEL.lm3), xlab = "FITTED VALUES", ylab = "RESIDUALS", panel.last = abline (h = 0, lty = 2))

From the graph, we see that the residuals randomly bounces about the zero line, which quite indicates some linear relationship. This show that the variance is zero in the error term.

PROBABILITY PLOT

MODEL.stdres = rstandard(MODEL.lm3)

MODEL.stdres = rstandard(MODEL.lm3)
qqnorm(MODEL.stdres, ylab="STANDARDIZED RESIDUALS", xlab="NORMAL SCORES",main="NORMALITY PROBABILITY PLOT")
qqline(MODEL.stdres)

The data seems to be lie around the line, thus showing linearity, hence considered normal/

shapiro.test(residuals(MODEL.lm3))

## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(MODEL.lm3)
## W = 0.98975, p-value = 0.7061

The Shapiro-Wilk test is run to confirm the rather normality considered above. With the pvalue (0.7061), being much higher than the alpha, the assumption that the errors are normally dstributed is taken into coonderation.

VARIANCE’S CONSTANT ERROR

library(lmtest)

## Warning: package 'lmtest' was built under R version 3.3.3

## Loading required package: zoo

## Warning: package 'zoo' was built under R version 3.3.3

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

bptest(MODEL.lm3)

## 
##  studentized Breusch-Pagan test
## 
## data:  MODEL.lm3
## BP = 8.3721, df = 5, p-value = 0.1369

The errors seem to have a constant error variance, as depicted by the high pvalue (0.1369) attained in the studentized Breusch-Pagan test.

1(f) CHECK FOR AN OUTLIER(S) PRESENT IN THE DATA. USE ANY APPROPRIATE PLOT TO IDENTIFY 2 POINTS WITH THE LARGEST INFLUENCE.

library(car)

## Warning: package 'car' was built under R version 3.3.3

influencePlot(MODEL.lm3, id.n = 2)

##       StudRes        Hat      CookD
## 27  2.2359394 0.08289576 0.07193068
## 54  2.9108802 0.05106498 0.06985332
## 66 -0.6703976 0.18092972 0.01665423
## 91 -1.4959682 0.18536664 0.08365339

In the plot above, we see that the most influencial plots are 27, 54 & 91, which ideally, are also the outliers. However, 27 & 91 appear to have the highest Cook’s (0.07193068 & 0.08365339 respectively) & Hat values. This therefore makes them the most influencial observations.

1(g) FINALLY USE THE AIC TO JUSTIFY THE CHOICE OF MODEL WITH OUR THREE PREDICTORS. PROVIDE ENTITLED RESULTS.

MODEL.lm3 <- lm(HDI ~ MIDEARN + LIFEXPECT + RACE, data = MODEL)

summary(MODEL.lm3)

## 
## Call:
## lm(formula = HDI ~ MIDEARN + LIFEXPECT + RACE, data = MODEL)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.28466 -0.09309 -0.01170  0.09119  0.40081 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -1.022e+01  6.101e-01 -16.757  < 2e-16 ***
## MIDEARN             9.440e-05  3.503e-06  26.952  < 2e-16 ***
## LIFEXPECT           1.575e-01  8.295e-03  18.982  < 2e-16 ***
## RACEASIAN AMERICAN  5.251e-01  1.047e-01   5.017 2.84e-06 ***
## RACELATINO         -6.691e-01  8.595e-02  -7.785 1.53e-11 ***
## RACEWHITE           2.203e-01  6.178e-02   3.565 0.000599 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1474 on 85 degrees of freedom
## Multiple R-squared:  0.9902, Adjusted R-squared:  0.9896 
## F-statistic:  1722 on 5 and 85 DF,  p-value: < 2.2e-16

The predictors’ pvalues (<2.2e-16) in the model, indicates its idealnes for predicting the HD INDEX.

fit0 = lm (HDI ~ 1, data = MODEL)

fit4 = lm(HDI ~ RACE*LIFEXPECT*MIDEARN, data = MODEL)

step(fit0, scope = list(lower = fit0, upper = fit4), direction = "forward")

## Start:  AIC=68.5
## HDI ~ 1
## 
##             Df Sum of Sq     RSS     AIC
## + RACE       3   156.071  32.897 -84.590
## + MIDEARN    1   126.661  62.307 -30.469
## + LIFEXPECT  1    82.313 106.656  18.446
## <none>                   188.968  68.495
## 
## Step:  AIC=-84.59
## HDI ~ RACE
## 
##             Df Sum of Sq    RSS     AIC
## + MIDEARN    1    23.217  9.680 -193.91
## + LIFEXPECT  1    15.260 17.637 -139.32
## <none>                   32.897  -84.59
## 
## Step:  AIC=-193.91
## HDI ~ RACE + MIDEARN
## 
##                Df Sum of Sq    RSS     AIC
## + LIFEXPECT     1    7.8322 1.8476 -342.62
## + RACE:MIDEARN  3    0.7382 8.9416 -195.13
## <none>                      9.6798 -193.91
## 
## Step:  AIC=-342.62
## HDI ~ RACE + MIDEARN + LIFEXPECT
## 
##                     Df Sum of Sq    RSS     AIC
## + RACE:LIFEXPECT     3  0.204668 1.6429 -347.31
## + LIFEXPECT:MIDEARN  1  0.048509 1.7991 -343.05
## <none>                           1.8476 -342.62
## + RACE:MIDEARN       3  0.112857 1.7348 -342.36
## 
## Step:  AIC=-347.31
## HDI ~ RACE + MIDEARN + LIFEXPECT + RACE:LIFEXPECT
## 
##                     Df Sum of Sq    RSS     AIC
## + RACE:MIDEARN       3  0.212313 1.4306 -353.90
## <none>                           1.6429 -347.31
## + LIFEXPECT:MIDEARN  1  0.000002 1.6429 -345.31
## 
## Step:  AIC=-353.9
## HDI ~ RACE + MIDEARN + LIFEXPECT + RACE:LIFEXPECT + RACE:MIDEARN
## 
##                     Df Sum of Sq    RSS     AIC
## <none>                           1.4306 -353.90
## + LIFEXPECT:MIDEARN  1 0.0010415 1.4296 -351.97

## 
## Call:
## lm(formula = HDI ~ RACE + MIDEARN + LIFEXPECT + RACE:LIFEXPECT + 
##     RACE:MIDEARN, data = MODEL)
## 
## Coefficients:
##                  (Intercept)            RACEASIAN AMERICAN  
##                   -9.572e+00                    -3.997e+00  
##                   RACELATINO                     RACEWHITE  
##                   -1.932e-01                    -4.038e+00  
##                      MIDEARN                     LIFEXPECT  
##                    1.009e-04                     1.464e-01  
## RACEASIAN AMERICAN:LIFEXPECT          RACELATINO:LIFEXPECT  
##                    5.588e-02                    -1.830e-02  
##          RACEWHITE:LIFEXPECT    RACEASIAN AMERICAN:MIDEARN  
##                    6.231e-02                    -7.787e-06  
##           RACELATINO:MIDEARN             RACEWHITE:MIDEARN  
##                    5.346e-05                    -1.894e-05