This is an R Markdown Notebook.
library(HistData)
## Warning: package 'HistData' was built under R version 3.3.3
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.3.3
library(readxl)
## Warning: package 'readxl' was built under R version 3.3.3
- Importing the Data and assigning the column Names.
MODEL <- read_excel("C:/Users/annmo/Desktop/SPRING 2017/STAT 6509/MOAData.xlsx")
colnames(MODEL)=c("RACE", "YEAR", "METRO AREA", "HDI", "LIFEXPECT", "DEGREE", "ENROLL", "MIDEARN", "HEALTHINDEX", "EDUINDEX", "INCOMEINDEX")
print(MODEL)
## # A tibble: 91 x 11
## RACE YEAR `METRO AREA` HDI LIFEXPECT DEGREE ENROLL
## <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 AFRICAN AMERICAN 2010 Atlanta 4.46 75.9 8.7 78.8
## 2 AFRICAN AMERICAN 2010 Baltimore 4.16 73.4 8.4 74.9
## 3 AFRICAN AMERICAN 2010 Boston 5.09 79.9 8.4 83.4
## 4 AFRICAN AMERICAN 2010 Chicago 3.90 73.7 6.9 78.0
## 5 AFRICAN AMERICAN 2010 Dallas-Ft. Worth 4.40 75.1 6.4 77.4
## 6 AFRICAN AMERICAN 2010 Denver 4.23 76.4 6.9 77.5
## 7 AFRICAN AMERICAN 2010 Detroit 3.58 72.9 6.5 77.4
## 8 AFRICAN AMERICAN 2010 Houston 4.28 74.5 7.8 77.7
## 9 AFRICAN AMERICAN 2010 Los Angeles 4.59 74.6 8.1 78.6
## 10 AFRICAN AMERICAN 2010 Miami 3.93 77.3 5.9 77.8
## # ... with 81 more rows, and 4 more variables: MIDEARN <dbl>,
## # HEALTHINDEX <dbl>, EDUINDEX <dbl>, INCOMEINDEX <dbl>
boxplot(MODEL$HDI~MODEL$RACE,
main="Box Plot of HD INDEX for Each Race/Ethnicity",
xlab="RACE", ylab="HD INDEX")
- The Asian American population apperas to have the highest HD Index in the USA. It has the highest variance, with the second widest spread as seen in the boxplot. This is followed by the Caucasian American population, with a HD Index of 6.2. Though it has a few outliers, which may cause some skewness of the variance to be far larger. *Both the African American and the Latin American populations appear to have an almost same mean. However, the African American population has a larger variance in the comparison of the two.
## 1(b) DEVELOP AN ADDITIVE REGRESSION MODEL WITH TWO PREDICTORS (RACE AND INCOME), FOR PREDICTING THE HD INDEX. OBTAIN THE REGRESSION COEFFICIENTS AND INTERPRET THEM.
y_i = ??_0 + ??_1 x_i1 + ??_2 x_i2 + ??_3 x_i3 + ??_4 x_i4 + ??_5 x_i5 + ??_i y_i = HD Index
x_i1 = Income,Measured in Dollars
x_i2 = 1,if the subject is African American
= 0,if the subject is not African American
x_i3 = 1,if the subject is Asian American = 0,if the subject is not Asian American
x_i4 = 1,if the subject is Latin American = 0,if the subject is not Latin American
x_i5 = 1,if the subject is Caucasian American = 0,if the subject is not Caucasian American
??_i = Independent error terms that are normally distributed with mean = 0 and equal variance ??^2
MODEL.lm2 <- lm(HDI ~ INCOMEINDEX + RACE, data = MODEL)
summary(MODEL.lm2)
##
## Call:
## lm(formula = HDI ~ INCOMEINDEX + RACE, data = MODEL)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.75691 -0.24459 -0.03259 0.21640 0.82990
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.8376 0.1888 9.733 1.59e-15 ***
## INCOMEINDEX 0.5531 0.0398 13.898 < 2e-16 ***
## RACEASIAN AMERICAN 2.1703 0.1238 17.528 < 2e-16 ***
## RACELATINO 0.9010 0.1200 7.507 5.19e-11 ***
## RACEWHITE 0.6499 0.1343 4.839 5.68e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3433 on 86 degrees of freedom
## Multiple R-squared: 0.9464, Adjusted R-squared: 0.9439
## F-statistic: 379.4 on 4 and 86 DF, p-value: < 2.2e-16
- Since the indicator variable is not included in the outcome, one of the indictaor variable is dropped.
y_i = (1.205) + (0.0001)x_i1 + (2.254)x_i3 + (0.6802)x_i4 +(0.7049)x_i5 + ??_i
??_0 = 1.205 = Overall average intercept
??_1 = 0.0001 =Change in mean HDI (??_y) for each dollar increase in ???income x???_i1
??_3 = 2.254 = The coefficient that multiplies the Asian American indicator variable
??_4 = 0.6802 = The coefficient that multiplies the Latin American indicator variable ??_5 = 0.7049 = The coefficient that multiplies the Caucasian American indicator variable
- Since the pvalue and the rsqrd for this model shows it containing two predictors (RACE & MIDEARNINGS), are more accurate in predicting the HD INDEX.We also take into account that the pvalues for the said predictors are significantly less than the 0.05, which is the default alpha value, proves that they should be included in the model.
y_i = ??_0 + ??_1 x_i1 + ??_2 x_i2 + ??_3 x_i3 + ??_4 x_i4 + ??5 x(i5 )+ ??_12 x_i1 x_i2 + ??_13 x_i1 x_i3 + ??_14 x_i1 x_i4 + ??_15 x_i1 x_i5 + ??_i
y_i = HD Index
x_i1 = Income,Measured in Dollars
x_i2 = 1,if the subject is African American
= 0,if the subject is not African American
x_i3 = 1,if the subject is Asian American = 0,if the subject is not Asian American
x_i4 = 1,if the subject is Latin American = 0,if the subject is not Latin American x_i5 = 1,if the subject is Caucasian American = 0,if the subject is not Caucasian American
x_i1 x_i2 = Interaction term between x_i1 (income) and x_i2 (African American) x_i1 x_i3 = Interaction term between x_i1 (income) and x_i2 (Asian American) x_i1 x_i4 = Interaction term between x_i1 (income) and x_i2 (Latin American) x_i1 x_i5 = Interaction term between x_i1 (income) and x_i2 (Caucasian American) ??_i = Independent error terms that are normally distributed with mean = 0 and equal variance ??^2
MODEL.add <- lm(HDI ~ MIDEARN + factor(RACE), data = MODEL)
summary(MODEL.add)
##
## Call:
## lm(formula = HDI ~ MIDEARN + factor(RACE), data = MODEL)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.73075 -0.23235 0.00728 0.19060 0.89931
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.205e+00 2.245e-01 5.370 6.61e-07 ***
## MIDEARN 1.109e-04 7.721e-06 14.362 < 2e-16 ***
## factor(RACE)ASIAN AMERICAN 2.254e+00 1.174e-01 19.191 < 2e-16 ***
## factor(RACE)LATINO 6.802e-01 1.100e-01 6.187 2.01e-08 ***
## factor(RACE)WHITE 7.049e-01 1.280e-01 5.506 3.74e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3355 on 86 degrees of freedom
## Multiple R-squared: 0.9488, Adjusted R-squared: 0.9464
## F-statistic: 398.2 on 4 and 86 DF, p-value: < 2.2e-16
- Since the pvalue is less than alpha, we conclude that its is significantly useful in predicting the HD INDEX.
- We use the additive model.
plot(MODEL$HDI ~MODEL$MIDEARN, col = factor(MODEL$RACE), main = "REGRESSION OF MEDIAN EARNINGS ON HD INDEX BY RACE", xlab = "MEDIAN EARNINGS", ylab = "HD INDEX")
summary(model.interaction.lm)
mid_fit <- fitted.values(MODEL.lm2)
ggplot(data = MODEL, aes(x = MIDEARN, y = HDI, colour = factor(RACE))) + geom_point() + xlab("MIDEARN") + ylab("HDI") + geom_line(aes(y = mid_fit))
- Estimated Regression Equations.
Estimated Regression Equations factor(RACE)AFRICAN AMERICAN: HDI = 1.381 + 0.0001046(MIDEARN) factor(RACE)ASIAN AMERICAN: HDI = 3.371 + 0.0001134(MIDEARN) factor(RACE)LATINO: HDI = -1.155 + 0.0002491(MIDEARN) factor(RACE)WHITE: HDI = 2.084 + 0.0001064(MIDEARN)
- Interaction plot.
ggplot(data = MODEL, aes(x = MIDEARN, y = HDI, color = factor(RACE))) + stat_smooth(method = lm, fullrange = FALSE) + geom_point()
y_i = ??_0 + ??_1 x_i1 + ??_2 x_i2 + ??_3 x_i3 + ??_4 x_i4 + ??_5 x_i5 + ??_6 x_i6 + ??_i
y_i = HD Index
x_i1 = Income,Measured in Dollars
x_i2 = Life Expectancy in Years x_i3 = 1,if the subject is African American
= 0,if the subject is not African American
x_i4 = 1,if the subject is Asian American = 0,if the subject is not Asian American
x_i5 = 1,if the subject is Latin American = 0,if the subject is not Latin American
x_i6 = 1,if the subject is Caucasian American = 0,if the subject is not Caucasian American
??_i = Independent error terms that are normally distributed with mean = 0 and equal variance ??^2
MODEL.lm3 <- lm(HDI ~ MIDEARN + LIFEXPECT + RACE, data = MODEL)
summary(MODEL.lm3)
##
## Call:
## lm(formula = HDI ~ MIDEARN + LIFEXPECT + RACE, data = MODEL)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.28466 -0.09309 -0.01170 0.09119 0.40081
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.022e+01 6.101e-01 -16.757 < 2e-16 ***
## MIDEARN 9.440e-05 3.503e-06 26.952 < 2e-16 ***
## LIFEXPECT 1.575e-01 8.295e-03 18.982 < 2e-16 ***
## RACEASIAN AMERICAN 5.251e-01 1.047e-01 5.017 2.84e-06 ***
## RACELATINO -6.691e-01 8.595e-02 -7.785 1.53e-11 ***
## RACEWHITE 2.203e-01 6.178e-02 3.565 0.000599 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1474 on 85 degrees of freedom
## Multiple R-squared: 0.9902, Adjusted R-squared: 0.9896
## F-statistic: 1722 on 5 and 85 DF, p-value: < 2.2e-16
- We have lost some degrees of freedom due to the estimation, thus we’re using 85 dfs. The R-sqrd is promising since its at 99.02%, despite the pvalue being far much smaller than the alpha.
RESIDUALS vs FITTED PLOT.
plot(x = fitted(MODEL.lm3), y = residuals(MODEL.lm3), xlab = "FITTED VALUES", ylab = "RESIDUALS", panel.last = abline (h = 0, lty = 2))
- From the graph, we see that the residuals randomly bounces about the zero line, which quite indicates some linear relationship. This show that the variance is zero in the error term.
MODEL.stdres = rstandard(MODEL.lm3)
MODEL.stdres = rstandard(MODEL.lm3)
qqnorm(MODEL.stdres, ylab="STANDARDIZED RESIDUALS", xlab="NORMAL SCORES",main="NORMALITY PROBABILITY PLOT")
qqline(MODEL.stdres)
- The data seems to be lie around the line, thus showing linearity, hence considered normal/
shapiro.test(residuals(MODEL.lm3))
##
## Shapiro-Wilk normality test
##
## data: residuals(MODEL.lm3)
## W = 0.98975, p-value = 0.7061
- The Shapiro-Wilk test is run to confirm the rather normality considered above. With the pvalue (0.7061), being much higher than the alpha, the assumption that the errors are normally dstributed is taken into coonderation.
library(lmtest)
## Warning: package 'lmtest' was built under R version 3.3.3
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 3.3.3
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
bptest(MODEL.lm3)
##
## studentized Breusch-Pagan test
##
## data: MODEL.lm3
## BP = 8.3721, df = 5, p-value = 0.1369
- The errors seem to have a constant error variance, as depicted by the high pvalue (0.1369) attained in the studentized Breusch-Pagan test.
library(car)
## Warning: package 'car' was built under R version 3.3.3
influencePlot(MODEL.lm3, id.n = 2)
## StudRes Hat CookD
## 27 2.2359394 0.08289576 0.07193068
## 54 2.9108802 0.05106498 0.06985332
## 66 -0.6703976 0.18092972 0.01665423
## 91 -1.4959682 0.18536664 0.08365339
- In the plot above, we see that the most influencial plots are 27, 54 & 91, which ideally, are also the outliers. However, 27 & 91 appear to have the highest Cook’s (0.07193068 & 0.08365339 respectively) & Hat values. This therefore makes them the most influencial observations.
MODEL.lm3 <- lm(HDI ~ MIDEARN + LIFEXPECT + RACE, data = MODEL)
summary(MODEL.lm3)
##
## Call:
## lm(formula = HDI ~ MIDEARN + LIFEXPECT + RACE, data = MODEL)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.28466 -0.09309 -0.01170 0.09119 0.40081
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.022e+01 6.101e-01 -16.757 < 2e-16 ***
## MIDEARN 9.440e-05 3.503e-06 26.952 < 2e-16 ***
## LIFEXPECT 1.575e-01 8.295e-03 18.982 < 2e-16 ***
## RACEASIAN AMERICAN 5.251e-01 1.047e-01 5.017 2.84e-06 ***
## RACELATINO -6.691e-01 8.595e-02 -7.785 1.53e-11 ***
## RACEWHITE 2.203e-01 6.178e-02 3.565 0.000599 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1474 on 85 degrees of freedom
## Multiple R-squared: 0.9902, Adjusted R-squared: 0.9896
## F-statistic: 1722 on 5 and 85 DF, p-value: < 2.2e-16
- The predictors’ pvalues (<2.2e-16) in the model, indicates its idealnes for predicting the HD INDEX.
fit0 = lm (HDI ~ 1, data = MODEL)
fit4 = lm(HDI ~ RACE*LIFEXPECT*MIDEARN, data = MODEL)
step(fit0, scope = list(lower = fit0, upper = fit4), direction = "forward")
## Start: AIC=68.5
## HDI ~ 1
##
## Df Sum of Sq RSS AIC
## + RACE 3 156.071 32.897 -84.590
## + MIDEARN 1 126.661 62.307 -30.469
## + LIFEXPECT 1 82.313 106.656 18.446
## <none> 188.968 68.495
##
## Step: AIC=-84.59
## HDI ~ RACE
##
## Df Sum of Sq RSS AIC
## + MIDEARN 1 23.217 9.680 -193.91
## + LIFEXPECT 1 15.260 17.637 -139.32
## <none> 32.897 -84.59
##
## Step: AIC=-193.91
## HDI ~ RACE + MIDEARN
##
## Df Sum of Sq RSS AIC
## + LIFEXPECT 1 7.8322 1.8476 -342.62
## + RACE:MIDEARN 3 0.7382 8.9416 -195.13
## <none> 9.6798 -193.91
##
## Step: AIC=-342.62
## HDI ~ RACE + MIDEARN + LIFEXPECT
##
## Df Sum of Sq RSS AIC
## + RACE:LIFEXPECT 3 0.204668 1.6429 -347.31
## + LIFEXPECT:MIDEARN 1 0.048509 1.7991 -343.05
## <none> 1.8476 -342.62
## + RACE:MIDEARN 3 0.112857 1.7348 -342.36
##
## Step: AIC=-347.31
## HDI ~ RACE + MIDEARN + LIFEXPECT + RACE:LIFEXPECT
##
## Df Sum of Sq RSS AIC
## + RACE:MIDEARN 3 0.212313 1.4306 -353.90
## <none> 1.6429 -347.31
## + LIFEXPECT:MIDEARN 1 0.000002 1.6429 -345.31
##
## Step: AIC=-353.9
## HDI ~ RACE + MIDEARN + LIFEXPECT + RACE:LIFEXPECT + RACE:MIDEARN
##
## Df Sum of Sq RSS AIC
## <none> 1.4306 -353.90
## + LIFEXPECT:MIDEARN 1 0.0010415 1.4296 -351.97
##
## Call:
## lm(formula = HDI ~ RACE + MIDEARN + LIFEXPECT + RACE:LIFEXPECT +
## RACE:MIDEARN, data = MODEL)
##
## Coefficients:
## (Intercept) RACEASIAN AMERICAN
## -9.572e+00 -3.997e+00
## RACELATINO RACEWHITE
## -1.932e-01 -4.038e+00
## MIDEARN LIFEXPECT
## 1.009e-04 1.464e-01
## RACEASIAN AMERICAN:LIFEXPECT RACELATINO:LIFEXPECT
## 5.588e-02 -1.830e-02
## RACEWHITE:LIFEXPECT RACEASIAN AMERICAN:MIDEARN
## 6.231e-02 -7.787e-06
## RACELATINO:MIDEARN RACEWHITE:MIDEARN
## 5.346e-05 -1.894e-05