1.Download the dataset from the source.

pima <- read.table ("https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data",header =F, sep =",")
  1. Clean up the data 2.a Put in data columns
colnames(pima) <- c("npreg", "glucose","bp", "triceps", "insulin", "bmi", "diabetes", "age", "class")
head(pima)
##   npreg glucose bp triceps insulin  bmi diabetes age class
## 1     6     148 72      35       0 33.6    0.627  50     1
## 2     1      85 66      29       0 26.6    0.351  31     0
## 3     8     183 64       0       0 23.3    0.672  32     1
## 4     1      89 66      23      94 28.1    0.167  21     0
## 5     0     137 40      35     168 43.1    2.288  33     1
## 6     5     116 74       0       0 25.6    0.201  30     0
head(pima)
##   npreg glucose bp triceps insulin  bmi diabetes age class
## 1     6     148 72      35       0 33.6    0.627  50     1
## 2     1      85 66      29       0 26.6    0.351  31     0
## 3     8     183 64       0       0 23.3    0.672  32     1
## 4     1      89 66      23      94 28.1    0.167  21     0
## 5     0     137 40      35     168 43.1    2.288  33     1
## 6     5     116 74       0       0 25.6    0.201  30     0
pairs(pima,gap=0.5, col = 'navyblue')

pimalm <- lm(class ~ npreg + glucose + bp + triceps + insulin + bmi + diabetes + age, data = pima)

summary(pimalm)
## 
## Call:
## lm(formula = class ~ npreg + glucose + bp + triceps + insulin + 
##     bmi + diabetes + age, data = pima)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.01348 -0.29513 -0.09541  0.32112  1.24160 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.8538943  0.0854850  -9.989  < 2e-16 ***
## npreg        0.0205919  0.0051300   4.014 6.56e-05 ***
## glucose      0.0059203  0.0005151  11.493  < 2e-16 ***
## bp          -0.0023319  0.0008116  -2.873  0.00418 ** 
## triceps      0.0001545  0.0011122   0.139  0.88954    
## insulin     -0.0001805  0.0001498  -1.205  0.22857    
## bmi          0.0132440  0.0020878   6.344 3.85e-10 ***
## diabetes     0.1472374  0.0450539   3.268  0.00113 ** 
## age          0.0026214  0.0015486   1.693  0.09092 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4002 on 759 degrees of freedom
## Multiple R-squared:  0.3033, Adjusted R-squared:  0.2959 
## F-statistic: 41.29 on 8 and 759 DF,  p-value: < 2.2e-16

Remove variables (triceps) with large p value (p value > 0.005)

pimalm <- update(pimalm, .~. -triceps, data = pima)
summary(pimalm)
## 
## Call:
## lm(formula = class ~ npreg + glucose + bp + insulin + bmi + diabetes + 
##     age, data = pima)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.01707 -0.29614 -0.09656  0.32073  1.24183 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.8537906  0.0854265  -9.994  < 2e-16 ***
## npreg        0.0205939  0.0051266   4.017 6.48e-05 ***
## glucose      0.0059092  0.0005086  11.619  < 2e-16 ***
## bp          -0.0023152  0.0008022  -2.886  0.00401 ** 
## insulin     -0.0001721  0.0001370  -1.257  0.20929    
## bmi          0.0133382  0.0019733   6.759 2.76e-11 ***
## diabetes     0.1478835  0.0447843   3.302  0.00100 ** 
## age          0.0025991  0.0015393   1.688  0.09173 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4 on 760 degrees of freedom
## Multiple R-squared:  0.3032, Adjusted R-squared:  0.2968 
## F-statistic: 47.25 on 7 and 760 DF,  p-value: < 2.2e-16

Remove variables (insulin, age) with large p value (p value > 0.005) After the variables are dropped, the R-squared value remain about the same. This suggests the variables dropped do not have much effect on the model.

pimalm <- update(pimalm, .~. -insulin, data = pima)
pimalm <- update(pimalm, .~. -age, data = pima)
summary(pimalm)
## 
## Call:
## lm(formula = class ~ npreg + glucose + bp + bmi + diabetes, data = pima)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.12594 -0.29476 -0.09844  0.31568  1.25963 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.7966369  0.0815821  -9.765  < 2e-16 ***
## npreg        0.0258008  0.0043741   5.898 5.51e-09 ***
## glucose      0.0058987  0.0004726  12.482  < 2e-16 ***
## bp          -0.0020911  0.0007896  -2.648  0.00826 ** 
## bmi          0.0128248  0.0019613   6.539 1.14e-10 ***
## diabetes     0.1429909  0.0444358   3.218  0.00135 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4007 on 762 degrees of freedom
## Multiple R-squared:  0.2987, Adjusted R-squared:  0.2941 
## F-statistic: 64.92 on 5 and 762 DF,  p-value: < 2.2e-16

Residual analysis shows almost straight line with distribution around zero. Due to this pattern, this model is not as robust.

plot(fitted(pimalm),resid(pimalm))

qqnorm(resid(pimalm), col = "blue")
qqline(resid(pimalm), col = "red")

The second dataset with much simpler variables. Although intuitively the variables both effect the output, the amount of effect by each variable is interesting. This dataset was examined to have a better sense of how multivariate regression will perform.

library(DAAG)
## Warning: package 'DAAG' was built under R version 3.3.3
## Loading required package: lattice
data(allbacks)
plot(allbacks, gap= 0.5, col = 'red')

allbacks.lm <- lm(weight ~volume +area, data = allbacks)
summary(allbacks.lm)
## 
## Call:
## lm(formula = weight ~ volume + area, data = allbacks)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -104.06  -30.02  -15.46   16.76  212.30 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 22.41342   58.40247   0.384 0.707858    
## volume       0.70821    0.06107  11.597 7.07e-08 ***
## area         0.46843    0.10195   4.595 0.000616 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 77.66 on 12 degrees of freedom
## Multiple R-squared:  0.9285, Adjusted R-squared:  0.9166 
## F-statistic: 77.89 on 2 and 12 DF,  p-value: 1.339e-07
plot(fitted(allbacks.lm),resid(allbacks.lm))

qqnorm(resid(allbacks.lm), col = "blue")
qqline(resid(allbacks.lm), col = "red")