1.Download the dataset from the source.
pima <- read.table ("https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data",header =F, sep =",")
colnames(pima) <- c("npreg", "glucose","bp", "triceps", "insulin", "bmi", "diabetes", "age", "class")
head(pima)
## npreg glucose bp triceps insulin bmi diabetes age class
## 1 6 148 72 35 0 33.6 0.627 50 1
## 2 1 85 66 29 0 26.6 0.351 31 0
## 3 8 183 64 0 0 23.3 0.672 32 1
## 4 1 89 66 23 94 28.1 0.167 21 0
## 5 0 137 40 35 168 43.1 2.288 33 1
## 6 5 116 74 0 0 25.6 0.201 30 0
head(pima)
## npreg glucose bp triceps insulin bmi diabetes age class
## 1 6 148 72 35 0 33.6 0.627 50 1
## 2 1 85 66 29 0 26.6 0.351 31 0
## 3 8 183 64 0 0 23.3 0.672 32 1
## 4 1 89 66 23 94 28.1 0.167 21 0
## 5 0 137 40 35 168 43.1 2.288 33 1
## 6 5 116 74 0 0 25.6 0.201 30 0
pairs(pima,gap=0.5, col = 'navyblue')
pimalm <- lm(class ~ npreg + glucose + bp + triceps + insulin + bmi + diabetes + age, data = pima)
summary(pimalm)
##
## Call:
## lm(formula = class ~ npreg + glucose + bp + triceps + insulin +
## bmi + diabetes + age, data = pima)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.01348 -0.29513 -0.09541 0.32112 1.24160
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.8538943 0.0854850 -9.989 < 2e-16 ***
## npreg 0.0205919 0.0051300 4.014 6.56e-05 ***
## glucose 0.0059203 0.0005151 11.493 < 2e-16 ***
## bp -0.0023319 0.0008116 -2.873 0.00418 **
## triceps 0.0001545 0.0011122 0.139 0.88954
## insulin -0.0001805 0.0001498 -1.205 0.22857
## bmi 0.0132440 0.0020878 6.344 3.85e-10 ***
## diabetes 0.1472374 0.0450539 3.268 0.00113 **
## age 0.0026214 0.0015486 1.693 0.09092 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4002 on 759 degrees of freedom
## Multiple R-squared: 0.3033, Adjusted R-squared: 0.2959
## F-statistic: 41.29 on 8 and 759 DF, p-value: < 2.2e-16
Remove variables (triceps) with large p value (p value > 0.005)
pimalm <- update(pimalm, .~. -triceps, data = pima)
summary(pimalm)
##
## Call:
## lm(formula = class ~ npreg + glucose + bp + insulin + bmi + diabetes +
## age, data = pima)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.01707 -0.29614 -0.09656 0.32073 1.24183
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.8537906 0.0854265 -9.994 < 2e-16 ***
## npreg 0.0205939 0.0051266 4.017 6.48e-05 ***
## glucose 0.0059092 0.0005086 11.619 < 2e-16 ***
## bp -0.0023152 0.0008022 -2.886 0.00401 **
## insulin -0.0001721 0.0001370 -1.257 0.20929
## bmi 0.0133382 0.0019733 6.759 2.76e-11 ***
## diabetes 0.1478835 0.0447843 3.302 0.00100 **
## age 0.0025991 0.0015393 1.688 0.09173 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4 on 760 degrees of freedom
## Multiple R-squared: 0.3032, Adjusted R-squared: 0.2968
## F-statistic: 47.25 on 7 and 760 DF, p-value: < 2.2e-16
Remove variables (insulin, age) with large p value (p value > 0.005) After the variables are dropped, the R-squared value remain about the same. This suggests the variables dropped do not have much effect on the model.
pimalm <- update(pimalm, .~. -insulin, data = pima)
pimalm <- update(pimalm, .~. -age, data = pima)
summary(pimalm)
##
## Call:
## lm(formula = class ~ npreg + glucose + bp + bmi + diabetes, data = pima)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.12594 -0.29476 -0.09844 0.31568 1.25963
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.7966369 0.0815821 -9.765 < 2e-16 ***
## npreg 0.0258008 0.0043741 5.898 5.51e-09 ***
## glucose 0.0058987 0.0004726 12.482 < 2e-16 ***
## bp -0.0020911 0.0007896 -2.648 0.00826 **
## bmi 0.0128248 0.0019613 6.539 1.14e-10 ***
## diabetes 0.1429909 0.0444358 3.218 0.00135 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4007 on 762 degrees of freedom
## Multiple R-squared: 0.2987, Adjusted R-squared: 0.2941
## F-statistic: 64.92 on 5 and 762 DF, p-value: < 2.2e-16
Residual analysis shows almost straight line with distribution around zero. Due to this pattern, this model is not as robust.
plot(fitted(pimalm),resid(pimalm))
qqnorm(resid(pimalm), col = "blue")
qqline(resid(pimalm), col = "red")
The second dataset with much simpler variables. Although intuitively the variables both effect the output, the amount of effect by each variable is interesting. This dataset was examined to have a better sense of how multivariate regression will perform.
library(DAAG)
## Warning: package 'DAAG' was built under R version 3.3.3
## Loading required package: lattice
data(allbacks)
plot(allbacks, gap= 0.5, col = 'red')
allbacks.lm <- lm(weight ~volume +area, data = allbacks)
summary(allbacks.lm)
##
## Call:
## lm(formula = weight ~ volume + area, data = allbacks)
##
## Residuals:
## Min 1Q Median 3Q Max
## -104.06 -30.02 -15.46 16.76 212.30
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 22.41342 58.40247 0.384 0.707858
## volume 0.70821 0.06107 11.597 7.07e-08 ***
## area 0.46843 0.10195 4.595 0.000616 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 77.66 on 12 degrees of freedom
## Multiple R-squared: 0.9285, Adjusted R-squared: 0.9166
## F-statistic: 77.89 on 2 and 12 DF, p-value: 1.339e-07
plot(fitted(allbacks.lm),resid(allbacks.lm))
qqnorm(resid(allbacks.lm), col = "blue")
qqline(resid(allbacks.lm), col = "red")