Contents


example 1

1. Boxplot

2. Correlation

3. Basic linear regression

4. ANOVA, pairwise comparison

5. Diagnostic Plots

6. VIF, Predictions, and intervals

\(\times\)

example 2

1. Stepwise Regression

\(\times\)

example 3

1. Boxplots

2. Binomial logistic regression

\(\times\)

example 4

1. Poisson Regression, Offsetting model

2. Chi-Square Signficance Test, Goodness of Fit

3. Dispersion parameter, Wald test

4. Count per area model

5. Wald test

\(\times\)

example 5

1. 10-fold and Leave one out cross validation

2. Mallow’s CP, AIC, BIC

3. Mallow’s CP model

4. Forward/Backward Stepwise

5. Ridge Regression

6. LASSO

7. Elastic Net

8. Predictions



example 1

1. Boxplot gives a visual display of categorical variables.

library(faraway)
library(ggplot2)

data(diabetes)
df=na.omit(diabetes[,c(5,8,10,11,12,17,18)])

ggplot(data=df, aes(x=as.factor(frame),y=ratio, fill=frame))+
  geom_boxplot(alpha=0.3)+
  xlab("frame")+
  ylab("ratio")+
  ggtitle("frame size vs diabetes")+ 
  scale_x_discrete(labels=c("0" = "Dead", "1" = "Alive"))+
  theme(axis.text.x = element_text(face="bold", color="cornflowerblue", size=14),
          axis.text.y = element_text(face="bold", color="royalblue4", 
          size=16, angle=25), legend.position = "none",
          axis.title=element_text(size=17,face="italic"),
          plot.title = element_text(size=20,face="bold.italic"))


2. Two methods are shown for displaying correlation between continuous variables.

library(GGally)
library(corrplot)

x <- cor(df[,c(1,2,3,4,6,7)])
corrplot.mixed(x)


ggpairs(df[,c(1,2,3,4,6,7)],ggplot2::aes(color="salmon"))


3. Basic Linear Regression with summary display.

library(broom)

model <- lm(ratio~.,data=df)
#broom summary with optional parameters
tidy(model,conf.int = TRUE, conf.level = 0.90)

#default summary with additional information
summary(model)

Call:
lm(formula = ratio ~ ., data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.5543 -1.1074 -0.1708  0.7815 13.9703 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)  
(Intercept)  2.1524812  2.2267713   0.967   0.3343  
age          0.0121734  0.0058294   2.088   0.0374 *
height      -0.0006488  0.0271067  -0.024   0.9809  
weight       0.0097284  0.0055837   1.742   0.0823 .
framemedium  0.4857166  0.2152938   2.256   0.0246 *
framelarge   0.3550940  0.2774862   1.280   0.2014  
waist        0.0822015  0.0336228   2.445   0.0150 *
hip         -0.0771822  0.0348566  -2.214   0.0274 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.628 on 375 degrees of freedom
Multiple R-squared:  0.1399,    Adjusted R-squared:  0.1238 
F-statistic: 8.713 on 7 and 375 DF,  p-value: 6.304e-10

4. ANOVA can be used to generate a partial F-score. If you want to compare differences between groups, use Tukey’s post-hoc test as a pairwise comparison. ANOVA can also be used to compare two models and determine if the predictors used in one model are significant and important to keep. Note the difference between using aov() and anova() functions.

anova.model <- aov(ratio~frame,data=df)
summary(anova.model)
             Df Sum Sq Mean Sq F value   Pr(>F)    
frame         2   76.5   38.27   13.47 2.22e-06 ***
Residuals   380 1079.2    2.84                     
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
TukeyHSD(anova.model)
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = ratio ~ frame, data = df)

$frame
                  diff        lwr       upr     p adj
medium-small 0.8329398  0.3409425 1.3249372 0.0002398
large-small  1.1882475  0.6288722 1.7476228 0.0000026
large-medium 0.3553077 -0.1382692 0.8488845 0.2088717
#Compare two models
model2 <- lm(ratio~age,data=df)
anova(model,model2)
Analysis of Variance Table

Model 1: ratio ~ age + height + weight + frame + waist + hip
Model 2: ratio ~ age
  Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
1    375  994.04                                  
2    381 1127.62 -6   -133.58 8.3989 1.465e-08 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

5. Diagnostic Plots

library(ggfortify)
library(car)
library(lindia)
#histogram of residuals
gg_reshist(model, bins=30)

#residuals vs fitted
autoplot(model, label.size=3, which=c(1,0))

#Q-Q plot
autoplot(model, label.size=3, which=c(2,0))

#Two models for Cooks distance
autoplot(model, label.size=3, which=c(4,0))


df$cook <- cooks.distance(model) #adds column of cooks distance values to original df
cook_val <- 4/nrow(df) #standard "bad" cook distance is 4/n

ggplot(df, aes(x=as.numeric(rownames(df)),y=cook))+ 
  geom_histogram(stat="identity", color="cornflowerblue",alpha=.3,size=2)+
  ylab("Cook's Distance")+
  theme(axis.text.x = element_text(face="bold", color="royalblue4", size=14),
          axis.text.y = element_text(face="bold", color="royalblue4", 
          size=16, angle=25), legend.position = "none",
          axis.title=element_text(size=17,face="italic"),
          plot.title = element_text(size=20,face="bold.italic"))+
  xlab("")
Ignoring unknown parameters: binwidth, bins, pad

outliers <- which(df$cook > cook_val)
cat("There are ",length(outliers)," outliers detected by Cook's Distance in the model.")
There are  12  outliers detected by Cook's Distance in the model.

6. Predictions and Intervals. The predict() function can be used to make predictions with the model as well as create both prediction and confidence intervals.

pred.data <- data.frame("age"=30,"frame"="medium","height"=66,"weight"=250,"waist"=45,"hip"=55)
prediction <- predict(model, pred.data,interval=c("prediction"))

cat("The model predicts that a woman with the given parameters will have an HDL/Cholesterol ratio of ", prediction[1],".",sep="","\n")
The model predicts that a woman with the given parameters will have an HDL/Cholesterol ratio of 4.846717.
#confidence interval at 90%
confint <- predict(model, pred.data,interval=c("confidence"), level =0.9)
cat("The 90% confidence interval for this model is (",confint[2],", ",confint[3],").",sep="" )
The 90% confidence interval for this model is (4.449433, 5.244001).

example 2

1. Using forward-backward stepwise regression for variable selection. The final selected variables from this example are Day of Week, Scheduled Departure, Destination Airport, and Airline.

df <- read.csv("flights.csv", header=TRUE, sep=",")
model <- lm(DEPARTURE_DELAY~DAY_OF_WEEK+DISTANCE+AIRLINE+DESTINATION_AIRPORT+SCHEDULED_ARRIVAL+SCHEDULED_DEPARTURE,data=df)

empty <- lm(DEPARTURE_DELAY~1,data=df)
mstep <- step(model,scope=list(lower=empty,upper=model),direction = "both")
Start:  AIC=45981.69
DEPARTURE_DELAY ~ DAY_OF_WEEK + DISTANCE + AIRLINE + DESTINATION_AIRPORT + 
    SCHEDULED_ARRIVAL + SCHEDULED_DEPARTURE


Step:  AIC=45981.69
DEPARTURE_DELAY ~ DAY_OF_WEEK + AIRLINE + DESTINATION_AIRPORT + 
    SCHEDULED_ARRIVAL + SCHEDULED_DEPARTURE

                       Df Sum of Sq     RSS   AIC
- SCHEDULED_ARRIVAL     1       525 5212868 45980
<none>                              5212342 45982
- DAY_OF_WEEK           1      3220 5215562 45984
- SCHEDULED_DEPARTURE   1      3434 5215776 45984
- DESTINATION_AIRPORT 155    364978 5577321 46138
- AIRLINE              10    383447 5595789 46450

Step:  AIC=45980.38
DEPARTURE_DELAY ~ DAY_OF_WEEK + AIRLINE + DESTINATION_AIRPORT + 
    SCHEDULED_DEPARTURE

                       Df Sum of Sq     RSS   AIC
<none>                              5212868 45980
+ SCHEDULED_ARRIVAL     1       525 5212342 45982
- DAY_OF_WEEK           1      3115 5215983 45982
- SCHEDULED_DEPARTURE   1      8547 5221415 45990
- DESTINATION_AIRPORT 155    364459 5577327 46136
- AIRLINE              10    383614 5596481 46449

example 3

1. More Boxplots!

df= read.csv("echo.csv", header=TRUE, sep=",")
library(ggplot2)

ggplot(data=df, aes(x=as.factor(still_alive),y=age, fill=still_alive))+
  geom_boxplot(alpha=0.3)+
  xlab("State of Existence")+
  ylab("Age when Heart Attack Occurred")+
  ggtitle("Age x Death")+ 
  scale_x_discrete(labels=c("0" = "Dead", "1" = "Alive"))+
  theme(axis.text.x = element_text(face="bold", color="cornflowerblue", size=14),
          axis.text.y = element_text(face="bold", color="royalblue4", 
          size=16, angle=25), legend.position = "none",
          axis.title=element_text(size=17,face="italic"),
          plot.title = element_text(size=20,face="bold.italic"))

ggplot(data=df, aes(x=as.factor(still_alive),y=fractional_shortening, fill=still_alive))+
  geom_boxplot(alpha=0.3)+
  xlab("Presence of Mortality")+
  ylab("Contracility around heart")+
  ggtitle("fractional shortening x Death")+ 
  scale_x_discrete(labels=c("0" = "Dead", "1" = "Alive"))+
  theme(axis.text.x = element_text(face="bold", color="cornflowerblue", size=14),
          axis.text.y = element_text(face="bold", color="royalblue4", 
          size=16, angle=25), legend.position = "none",
          axis.title=element_text(size=17,face="italic"),
          plot.title = element_text(size=20,face="bold.italic"))


2. Create a binomial general linear model with glm(). Use model summary to determine coefficients.

parameters for \(still\_alive\)
\(intercept\): \(8.54944\) *significant at p<.04
\(age\): \(-0.08994\) *significant at p<.05
\(frac\_short\): \(5.90295\) *significant at p<.05
\(lvdd\): \(-0.65886\) *significant at p<.05
m <- glm(still_alive~age+fractional_shortening+lvdd, family="binomial", data=df)
tidy(m,conf.int = TRUE, conf.level = 0.90)

example 4

1. Create a poisson regression model with the data. Create a second model that offsets the response variable with the log of area. Model parameters are the coefficients to the five prediction variables Area, Elevation, Nearest, Scruz, and Adjacent along with the intercept. Estimates for each of these parameters are given in he ‘Estimate’ column of the model summary.

Poisson regression uses a log transformation on the desired response, so the equation is \(Species = exp(3.155 - 0.0005799*Area + 0.003541*Elevation + 0.008826*Nearest - 0.005709*Scruz - 0.0006630*Adjacent)\)

For the coefficient for Elevation:

  • For every 1m increase in highest elevation, the log expected number of species (or the log rate of species per island) increases by 0.003541 when holding the other variables constant.
  • The ratio of expected number of species per island for a 1m increase in highest elevation is exp(0.003541)=1.0035, holding the other variables constant.
  • For every 1m increase in highest elevation, the expected number of species increases by a factor of exp(0.003541)=1.0035 (i.e. a 0.35% increase) when holding the other variables constant.

For the coefficient for Scruz:

  • For every 1km increase in distance to Santa Cruz, the log expected number of species (or the log rate of species per island) decreases by 0.005709 when holding the other variables constant.
  • The expected ratio of number of species per island for a 1km increase in distance to Santa Cruz is exp(-0.005709)=0.9943, holding the other variables constant.
  • For every 1km increase in distance to Santa Cruz, the expected number of species decreases by a factor of exp(-0.005709)=0.9943 (i.e. a 0.57% decrease) when holding the other variables constant.
require(faraway)
data(gala)
mydata <- gala
df <- mydata[-2]

model1 <- glm(Species ~ .,family=poisson, df)
model2 = glm(Species ~ Elevation+Nearest+Scruz+Adjacent+offset(log(Area)),
             data = mydata, family = poisson)

2. Chi square: To test if the overall model is significant, we use a chi-square test comparing the fitted model to the null model. The p-value is close to 0 indicating the model is significant overall.

1-pchisq((model1$null.dev-model1$deviance), (model1$df.null-model1$df.resid))
[1] 0

3. Goodness of Fit and Dispersion: We perform goodness of fit hypothesis tests using both deviance and Pearson residual, and calculate the dispersion parameter.

P-values from both goodness of fit tests are close to 0, suggesting our model is not a good fit. These findings do not contradict those from Question 2b because it is possible for a model to have predictive ability while still being a poor fit. Further, the dispersion parameter is much larger than 2, so there is overdispersion.

#With deviance residuals
1-pchisq(model1$deviance,model1$df.residual)
[1] 0
#with Pearson residuals
pResid <- resid(model1, type = "pearson")
1-pchisq(sum(pResid^2),model1$df.residual)
[1] 0
#dispersion paramter
model1$deviance/model1$df.res
[1] 29.86857

4. Let’s create a rate based model for the same dataset. Now the response will be density of species (number of species/km^2). So the exposure in this case will be Area. Call this model2.

The number of species is now scaled by the area of the island, so the equation is \(Species/Area = exp(2.155 - 0.002966*Elevation - 0.01674*Nearest - 0.001078*Scruz + 0.0001568*Adjacent)\)

For the coefficient of Adjacent: * For every 1km^2 increase in adjacent island area, the log expected number of species per square kilometer increases by 0.0001568 when holding the other variables constant. * The ratio of the expected number of species per square kilometer with a 1km^2 increase in adjacent island area to the expected number without the increase is exp(0.0001568)=1.000157, holding the other variables constant. * For every 1km^2 increase in area of the adjacent island, the estimated rate of species per square kilometer increases by a factor of exp(0.0001568)=1.000157 (i.e. a 0.0157% increase) when holding the other variables constant.

model2 = glm(Species ~ Elevation+Nearest+Scruz+Adjacent+offset(log(Area)),
             data = mydata, family = poisson)
summary(model2)

Call:
glm(formula = Species ~ Elevation + Nearest + Scruz + Adjacent + 
    offset(log(Area)), family = poisson, data = mydata)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-14.2091   -0.6762    3.4247    6.9750   14.4016  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)  2.155e+00  6.222e-02  34.642  < 2e-16 ***
Elevation   -2.966e-03  5.913e-05 -50.153  < 2e-16 ***
Nearest     -1.674e-02  1.553e-03 -10.780  < 2e-16 ***
Scruz       -1.078e-03  7.384e-04  -1.460    0.144    
Adjacent     1.568e-04  2.808e-05   5.582 2.37e-08 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 5766.9  on 29  degrees of freedom
Residual deviance: 1565.0  on 25  degrees of freedom
AIC: 1735.9

Number of Fisher Scoring iterations: 6

5. Use wald test to compare model2 with a model containing only Elevation and Scruz to assess if information about nearby islands is significant.

library(aod)
wald.test(b=coef(model2), Sigma=vcov(model2), Terms=c(3,5))
Wald test:
----------

Chi-squared test:
X2 = 188.4, df = 2, P(> X2) = 0.0

example 5

1. Create a general linear model and find 10-fold and leave one out cross validation scores.

require(faraway)
data(debt)
fullData=na.omit(debt)
set.seed(69)
testRows=sample(nrow(fullData),0.1*nrow(fullData))
testData=fullData[testRows, ]
trainData=fullData[-testRows, ]
library(boot)

mod1=lm(prodebt~.,data=trainData)
n=nrow(trainData)
cat("10-fold cross validation score: ",cv.glm(trainData,mod1,K=10)$delta[1],"\nLeave on out cross validation (LOOCV) score: ", cv.glm(trainData,mod1,K=n)$delta[1])
10-fold cross validation score:  NaN 
Leave on out cross validation (LOOCV) score:  NaN

2. Mallow’s CP, AIC, and BIC

library(CombMSC)
cat("Mallow's CP: ",Cp(mod1,S2=summary(mod1)$sigma^2),"\nAIC: ", AIC(mod1,k=2),"\nBIC: ",AIC(mod1,k=log(n)))
Mallow's CP:  13 
AIC:  571.4466 
BIC:  622.0304

3. Create a model using Mallow’s CP. The 1’s and 0’s correspond to whether that variable should be used or not. This tells us that variables 1, 5, 8, 9, 11, and 12 should be used to build the new regression model.

library(leaps)
out = leaps(trainData[,-13], trainData$prodebt, method = "Cp")
#cbind(as.matrix(out$which),out$Cp) #uncomment to display full table of Mallows values

best.model = which(out$Cp==min(out$Cp))
cbind(as.matrix(out$which),out$Cp)[best.model,]
       1        2        3        4        5        6        7        8        9        A        B        C          
1.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 1.000000 1.000000 0.000000 1.000000 1.000000 7.054524 
mod2=lm(formula = prodebt ~ incomegp + agegp + manage +
          ccarduse + cigbuy + xmasbuy + locintrn, data = trainData)

4. Forward and Backward stepwise regression

minmod=lm(prodebt~1,data=trainData)
mod3 <- step(minmod, scope = list(lower=minmod,upper=mod1), direction = "forward",trace=F)
backward.step <- step(mod1, scope = list(lower=minmod,upper=mod1), direction = "backward",trace=F)

summary(mod3);summary(backward.step)

Call:
lm(formula = prodebt ~ ccarduse + manage + agegp + locintrn + 
    xmasbuy + incomegp, data = trainData)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.00195 -0.41792  0.03008  0.39156  1.85551 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.99454    0.30192  13.230  < 2e-16 ***
ccarduse     0.19906    0.05449   3.653 0.000312 ***
manage      -0.12607    0.04726  -2.667 0.008113 ** 
agegp       -0.13038    0.04433  -2.941 0.003555 ** 
locintrn    -0.15995    0.04749  -3.368 0.000868 ***
xmasbuy      0.25726    0.12677   2.029 0.043411 *  
incomegp     0.06169    0.03256   1.895 0.059229 .  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.6684 on 267 degrees of freedom
Multiple R-squared:  0.1823,    Adjusted R-squared:  0.164 
F-statistic: 9.923 on 6 and 267 DF,  p-value: 6.902e-10


Call:
lm(formula = prodebt ~ incomegp + agegp + bsocacc + manage + 
    ccarduse + cigbuy + xmasbuy + locintrn, data = trainData)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.99009 -0.45940  0.02515  0.40054  1.85617 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  4.12022    0.31368  13.135  < 2e-16 ***
incomegp     0.05792    0.03289   1.761  0.07942 .  
agegp       -0.13582    0.04447  -3.055  0.00248 ** 
bsocacc     -0.12329    0.08589  -1.435  0.15233    
manage      -0.12558    0.04773  -2.631  0.00900 ** 
ccarduse     0.18638    0.05550   3.358  0.00090 ***
cigbuy      -0.14622    0.09279  -1.576  0.11625    
xmasbuy      0.27425    0.12663   2.166  0.03122 *  
locintrn    -0.15372    0.04746  -3.239  0.00135 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.6659 on 265 degrees of freedom
Multiple R-squared:  0.1945,    Adjusted R-squared:  0.1702 
F-statistic: 7.999 on 8 and 265 DF,  p-value: 1.171e-09

5. Ridge Regression. First, the optimal \(\lambda\) value must selected, which minimzes the generalized cross validation (GCV). This value is then used to find the new values for coefficients. Remember that Ridge regression does not perform variable selection, only shrinkage. This is because it uses the L2 norm for penalization, which does not measure sparsity.

#Scaling
library(MASS)
#lm.ridge automatically scales the predictors, but if you want to do it manually, use `scale()` function as shown below
y.scaled = scale(trainData$prodebt)
X.scaled = scale(as.matrix(trainData[,-13]))

#Ridge Regression
lambda = seq(100, 110, by=0.01)
mod4 = lm.ridge(prodebt~.,data=trainData, lambda = lambda)
optimal <- which(mod4$GCV == min(mod4$GCV))

cat("The GVC score is minimized when lambda is equal to ",names(optimal),", making this optimal lambda value.\n\n",sep="")
The GVC score is minimized when lambda is equal to 100.00, making this optimal lambda value.
cat("Ridge Regression Coefficients at this optimal Lambda Value:\n")
Ridge Regression Coefficients at this optimal Lambda Value:
round(coef(mod4)[which(mod4$GCV == min(mod4$GCV)),],4)
         incomegp    house children  singpar    agegp  bankacc  bsocacc   manage ccarduse   cigbuy  xmasbuy locintrn 
  3.8549   0.0519  -0.0614   0.0281   0.0290  -0.0788   0.0885  -0.0752  -0.0972   0.1313  -0.1178   0.1818  -0.1118 

6. Perform LASSO Regression.

library(glmnet)
#Like lm.ridge, this function automatically scales the predicotrs.

mod5.cv=cv.glmnet(as.matrix(trainData[,-13]),trainData$prodebt,alpha=1,nfolds=10)
mod5 = glmnet(as.matrix(trainData[,-13]),trainData$prodebt, alpha = 1, nlambda = 100)
optimal <-mod5.cv$lambda.min

cat("The optimal lambda value is ",optimal,".\n\n",sep="")
The optimal lambda value is 0.007221254.
cat("The variables selected by the LASSO model are shown in the table below:\n")
The variables selected by the LASSO model are shown in the table below:
coef(mod5,s=mod5.cv$lambda.min)
13 x 1 sparse Matrix of class "dgCMatrix"
                      1
(Intercept)  4.08479288
incomegp     0.05419295
house       -0.06823412
children     0.02079927
singpar      .         
agegp       -0.10246998
bankacc      0.08440017
bsocacc     -0.08711751
manage      -0.11882583
ccarduse     0.17326417
cigbuy      -0.14058391
xmasbuy      0.23555527
locintrn    -0.14950893
cat("\nWe can also plot the regression coefficient path to get an understanding of how values were selected for each coefficient:\n")

We can also plot the regression coefficient path to get an understanding of how values were selected for each coefficient:
plot(mod5,xvar="lambda",label=TRUE,lwd=2)
abline(v=log(mod5.cv$lambda.min),col='black',lty = 2,lwd=2)


7. Elastic Net Regression is performed using 100 values for \(\lambda\) and giving equal weight to both penalties. 10-fold cross validation is used to find the optimal \(\lambda\).

#Again, predictors are scaled automatically. 
mod6.cv=cv.glmnet(as.matrix(trainData[,-13]),trainData$prodebt,alpha=0.5,nfolds=10)
mod6 = glmnet(as.matrix(trainData[,-13]), trainData$prodebt, alpha = 0.5, nlambda = 100)
optimal <- mod6.cv$lambda.min
cat("The optimal lambda value is ",optimal,".\n\n",sep="")
The optimal lambda value is 0.02095361.
cat("The variables selected by the LASSO model are shown in the table below:\n")
The variables selected by the LASSO model are shown in the table below:
coef(mod6,s=mod6.cv$lambda.min)
13 x 1 sparse Matrix of class "dgCMatrix"
                      1
(Intercept)  4.04248407
incomegp     0.05326085
house       -0.06192981
children     0.01974651
singpar      .         
agegp       -0.09955276
bankacc      0.07582673
bsocacc     -0.08018302
manage      -0.11499109
ccarduse     0.16811282
cigbuy      -0.12969537
xmasbuy      0.22113558
locintrn    -0.14251235

8. Predict prodebt for each of the rows in the test data using the full model, lowest Mallow’s Cp model, and the models found using forward stepwise regression, ridge regression, lasso regression, and elastic net.

full=predict(mod1,testData)
mincp=predict(mod2,testData)
stepwise=predict(mod3,testData)
ridge=cbind(1,as.matrix(testData[,-13]))%*%coef(mod4)[which(mod4$GCV == min(mod4$GCV)),]
modlasso=lm(prodebt~.,data=trainData[,-4])
lasso=predict(modlasso,testData)
elastic=as.vector(predict(mod6,as.matrix(testData[,-13]),s=mod6.cv$lambda.min))

preds=data.frame(prodebt=testData$prodebt,full,mincp,stepwise,ridge,lasso,elastic)
preds

Predictions are compared below using mean squared prediction error. The lowest MSPE indicates the “best” model, but these are all so close that it would be difficult to make that determination with a single metric.

sapply(preds[,-1],function(x){mean((x-testData$prodebt)^2)})
     full     mincp  stepwise     ridge     lasso   elastic 
0.3017985 0.2967393 0.3019744 0.2937757 0.3017978 0.2951887 

Additional Notes

Poisson is mainly used to address count data across continuous events rather than a discrete events. For example - How many student pass a test is a discrete event. where each student have a probability of passing a test when they take it. How many awards does a student get in high school is much more continuous, or how many people are in a line in the airport.

---
title: "R Helpbook"
output: html_notebook
---

### _Contents_ ###

-----

##### _example 1_ #####
#### 1. Boxplot
#### 2. Correlation
#### 3. Basic linear regression
#### 4. ANOVA, pairwise comparison 
#### 5. Diagnostic Plots
#### 6. VIF, Predictions, and intervals

$\times$

##### _example 2_ #####
#### 1. Stepwise Regression

$\times$

##### _example 3_ #####
#### 1. Boxplots
#### 2. Binomial logistic regression

$\times$

##### _example 4_ #####
#### 1. Poisson Regression, Offsetting model
#### 2. Chi-Square Signficance Test, Goodness of Fit
#### 3. Dispersion parameter, Wald test
#### 4. Count per area model
#### 5. Wald test

$\times$

##### _example 5_ #####
#### 1. 10-fold and Leave one out cross validation
#### 2. Mallow's CP, AIC, BIC
#### 3. Mallow's CP model
#### 4. Forward/Backward Stepwise
#### 5. Ridge Regression
#### 6. LASSO
#### 7. Elastic Net
#### 8. Predictions
***
---

### _example 1_ 
#### 1. Boxplot gives a visual display of categorical variables. 
```{r}
library(faraway)
library(ggplot2)

data(diabetes)
df=na.omit(diabetes[,c(5,8,10,11,12,17,18)])

ggplot(data=df, aes(x=as.factor(frame),y=ratio, fill=frame))+
  geom_boxplot(alpha=0.3)+
  xlab("frame")+
  ylab("ratio")+
  ggtitle("frame size vs diabetes")+ 
  scale_x_discrete(labels=c("0" = "Dead", "1" = "Alive"))+
  theme(axis.text.x = element_text(face="bold", color="cornflowerblue", size=14),
          axis.text.y = element_text(face="bold", color="royalblue4", 
          size=16, angle=25), legend.position = "none",
          axis.title=element_text(size=17,face="italic"),
          plot.title = element_text(size=20,face="bold.italic"))
```
***

#### 2. Two methods are shown for displaying correlation between continuous variables. 
```{r}
library(GGally)
library(corrplot)

x <- cor(df[,c(1,2,3,4,6,7)])
corrplot.mixed(x)

ggpairs(df[,c(1,2,3,4,6,7)],ggplot2::aes(color="salmon"))

```
***

#### 3. Basic Linear Regression with summary display. 
```{r}
library(broom)

model <- lm(ratio~.,data=df)
#broom summary with optional parameters
tidy(model,conf.int = TRUE, conf.level = 0.90)

#default summary with additional information
summary(model)
```
***

#### 4. ANOVA can be used to generate a partial F-score. If you want to compare differences between groups, use Tukey's post-hoc test as a pairwise comparison. ANOVA can also be used to compare two models and determine if the predictors used in one model are significant and important to keep. Note the difference between using `aov()` and `anova()` functions. 
```{r}
anova.model <- aov(ratio~frame,data=df)
summary(anova.model)
TukeyHSD(anova.model)
#Compare two models
model2 <- lm(ratio~age,data=df)
anova(model,model2)
```
***

#### 5. Diagnostic Plots
```{r}
library(ggfortify)
library(car)
library(lindia)
#histogram of residuals
gg_reshist(model, bins=30)
#residuals vs fitted
autoplot(model, label.size=3, which=c(1,0))
#Q-Q plot
autoplot(model, label.size=3, which=c(2,0))
#Two models for Cooks distance
autoplot(model, label.size=3, which=c(4,0))

df$cook <- cooks.distance(model) #adds column of cooks distance values to original df
cook_val <- 4/nrow(df) #standard "bad" cook distance is 4/n

ggplot(df, aes(x=as.numeric(rownames(df)),y=cook))+ 
  geom_histogram(stat="identity", color="cornflowerblue",alpha=.3,size=2)+
  ylab("Cook's Distance")+
  theme(axis.text.x = element_text(face="bold", color="royalblue4", size=14),
          axis.text.y = element_text(face="bold", color="royalblue4", 
          size=16, angle=25), legend.position = "none",
          axis.title=element_text(size=17,face="italic"),
          plot.title = element_text(size=20,face="bold.italic"))+
  xlab("")

outliers <- which(df$cook > cook_val)
cat("There are ",length(outliers)," outliers detected by Cook's Distance in the model.")
```
***

#### 6. Predictions and Intervals. The `predict()` function can be used to make predictions with the model as well as create both prediction and confidence intervals. 

```{r}
pred.data <- data.frame("age"=30,"frame"="medium","height"=66,"weight"=250,"waist"=45,"hip"=55)
prediction <- predict(model, pred.data,interval=c("prediction"))

cat("The model predicts that a woman with the given parameters will have an HDL/Cholesterol ratio of ", prediction[1],".",sep="","\n")

#confidence interval at 90%
confint <- predict(model, pred.data,interval=c("confidence"), level =0.9)
cat("The 90% confidence interval for this model is (",confint[2],", ",confint[3],").",sep="" )

```
***
### _example 2_ ###
#### 1. Using forward-backward stepwise regression for variable selection. The final selected variables from this example are Day of Week, Scheduled Departure, Destination Airport, and Airline.
```{r}
df <- read.csv("flights.csv", header=TRUE, sep=",")
model <- lm(DEPARTURE_DELAY~DAY_OF_WEEK+DISTANCE+AIRLINE+DESTINATION_AIRPORT+SCHEDULED_ARRIVAL+SCHEDULED_DEPARTURE,data=df)

empty <- lm(DEPARTURE_DELAY~1,data=df)
mstep <- step(model,scope=list(lower=empty,upper=model),direction = "both")
```
***
### _example 3_ ###

#### 1. More Boxplots!
```{r}
df= read.csv("echo.csv", header=TRUE, sep=",")
library(ggplot2)

ggplot(data=df, aes(x=as.factor(still_alive),y=age, fill=still_alive))+
  geom_boxplot(alpha=0.3)+
  xlab("State of Existence")+
  ylab("Age when Heart Attack Occurred")+
  ggtitle("Age x Death")+ 
  scale_x_discrete(labels=c("0" = "Dead", "1" = "Alive"))+
  theme(axis.text.x = element_text(face="bold", color="cornflowerblue", size=14),
          axis.text.y = element_text(face="bold", color="royalblue4", 
          size=16, angle=25), legend.position = "none",
          axis.title=element_text(size=17,face="italic"),
          plot.title = element_text(size=20,face="bold.italic"))
ggplot(data=df, aes(x=as.factor(still_alive),y=fractional_shortening, fill=still_alive))+
  geom_boxplot(alpha=0.3)+
  xlab("Presence of Mortality")+
  ylab("Contracility around heart")+
  ggtitle("fractional shortening x Death")+ 
  scale_x_discrete(labels=c("0" = "Dead", "1" = "Alive"))+
  theme(axis.text.x = element_text(face="bold", color="cornflowerblue", size=14),
          axis.text.y = element_text(face="bold", color="royalblue4", 
          size=16, angle=25), legend.position = "none",
          axis.title=element_text(size=17,face="italic"),
          plot.title = element_text(size=20,face="bold.italic"))

```
***

#### 2. Create a binomial general linear model with `glm()`. Use model summary to determine coefficients. 

|parameters for $still\_alive$|
|---------------|
|$intercept$: $8.54944$ *significant at p<.04|
|$age$: $-0.08994$ *significant at p<.05|
$frac\_short$: $5.90295$ *significant at p<.05|
$lvdd$: $-0.65886$ *significant at p<.05|

```{r}
m <- glm(still_alive~age+fractional_shortening+lvdd, family="binomial", data=df)
tidy(m,conf.int = TRUE, conf.level = 0.90)
```

***
### _example 4_ ###

#### 1. Create a poisson regression model with the data. Create a second model that offsets the response variable with the log of area. Model parameters are the coefficients to the five prediction variables *Area*, *Elevation*, *Nearest*, *Scruz*, and *Adjacent* along with the intercept. Estimates for each of these parameters are given in he 'Estimate' column of the model summary.

Poisson regression uses a log transformation on the desired response, so the equation is $Species = exp(3.155 - 0.0005799*Area + 0.003541*Elevation + 0.008826*Nearest - 0.005709*Scruz - 0.0006630*Adjacent)$

For the coefficient for *Elevation*:

* For every 1m increase in highest elevation, the log expected number of species (or the log rate of species per island) increases by 0.003541 when holding the other variables constant. 
* The ratio of expected number of species per island for a 1m increase in highest elevation is exp(0.003541)=1.0035, holding the other variables constant.
* For every 1m increase in highest elevation, the expected number of species increases by a factor of exp(0.003541)=1.0035 (i.e. a 0.35% increase) when holding the other variables constant.

For the coefficient for *Scruz*:

* For every 1km increase in distance to Santa Cruz, the log expected number of species (or the log rate of species per island) decreases by 0.005709 when holding the other variables constant. 
* The expected ratio of number of species per island for a 1km increase in distance to Santa Cruz is exp(-0.005709)=0.9943, holding the other variables constant.
* For every 1km increase in distance to Santa Cruz, the expected number of species decreases by a factor of exp(-0.005709)=0.9943 (i.e. a 0.57% decrease) when holding the other variables constant.

```{r, message=F, warning=F}
require(faraway)
data(gala)
mydata <- gala
df <- mydata[-2]

model1 <- glm(Species ~ .,family=poisson, df)
model2 = glm(Species ~ Elevation+Nearest+Scruz+Adjacent+offset(log(Area)),
             data = mydata, family = poisson)

```
***

#### 2. Chi square: To test if the overall model is significant, we use a chi-square test comparing the fitted model to the null model. The p-value is close to 0 indicating the model is significant overall.
```{r}
1-pchisq((model1$null.dev-model1$deviance), (model1$df.null-model1$df.resid))
```
***

#### 3. Goodness of Fit and Dispersion: We perform goodness of fit hypothesis tests using both deviance and Pearson residual, and calculate the dispersion parameter. 
P-values from both goodness of fit tests are close to 0, suggesting our model is not a good fit. These findings do not contradict those from Question 2b because it is possible for a model to have predictive ability while still being a poor fit. Further, the dispersion parameter is much larger than 2, so there is overdispersion.
```{r, results='hold'}
#With deviance residuals
1-pchisq(model1$deviance,model1$df.residual)
#with Pearson residuals
pResid <- resid(model1, type = "pearson")
1-pchisq(sum(pResid^2),model1$df.residual)

#dispersion paramter
model1$deviance/model1$df.res
```

***

#### 4. Let's create a rate based model for the same dataset. Now the response will be density of species (number of species/km^2). So the exposure in this case will be Area. Call this **model2**.

The number of species is now scaled by the area of the island, so the equation is $Species/Area = exp(2.155 - 0.002966*Elevation - 0.01674*Nearest - 0.001078*Scruz + 0.0001568*Adjacent)$


For the coefficient of *Adjacent*:
* For every 1km^2 increase in adjacent island area, the log expected number of species per square kilometer increases by 0.0001568 when holding the other variables constant. 
* The ratio of the expected number of species per square kilometer with a 1km^2 increase in adjacent island area to the expected number without the increase is exp(0.0001568)=1.000157, holding the other variables constant.
* For every 1km^2 increase in area of the adjacent island, the estimated rate of species per square kilometer increases by a factor of exp(0.0001568)=1.000157 (i.e. a 0.0157% increase) when holding the other variables constant.
```{r}
model2 = glm(Species ~ Elevation+Nearest+Scruz+Adjacent+offset(log(Area)),
             data = mydata, family = poisson)
summary(model2)
```

***
#### 5. Use wald test to compare model2 with a model containing only *Elevation* and *Scruz* to assess if information about nearby islands is significant. 
```{r, message=F, warning=F}
library(aod)
wald.test(b=coef(model2), Sigma=vcov(model2), Terms=c(3,5))
```

### _example 5_ ###

#### 1. Create a general linear model and find 10-fold and leave one out cross validation scores. 

```{r, message=F, warning=F}
require(faraway)
data(debt)
fullData=na.omit(debt)
set.seed(69)
testRows=sample(nrow(fullData),0.1*nrow(fullData))
testData=fullData[testRows, ]
trainData=fullData[-testRows, ]
```

```{r}
library(boot)

mod1=lm(prodebt~.,data=trainData)
n=nrow(trainData)
cat("10-fold cross validation score: ",cv.glm(trainData,mod1,K=10)$delta[1],"\nLeave on out cross validation (LOOCV) score: ", cv.glm(trainData,mod1,K=n)$delta[1])
```
***
#### 2. Mallow's CP, AIC, and BIC ####
```{r}
library(CombMSC)
cat("Mallow's CP: ",Cp(mod1,S2=summary(mod1)$sigma^2),"\nAIC: ", AIC(mod1,k=2),"\nBIC: ",AIC(mod1,k=log(n)))
```
***

#### 3. Create a model using Mallow's CP. The 1's and 0's correspond to whether that variable should be used or not. This tells us that variables 1, 5, 8, 9, 11, and 12 should be used to build the new regression model. 

```{r, message=F, warning=F}
library(leaps)
out = leaps(trainData[,-13], trainData$prodebt, method = "Cp")
#cbind(as.matrix(out$which),out$Cp) #uncomment to display full table of Mallows values

best.model = which(out$Cp==min(out$Cp))
cbind(as.matrix(out$which),out$Cp)[best.model,]

mod2=lm(formula = prodebt ~ incomegp + agegp + manage +
          ccarduse + cigbuy + xmasbuy + locintrn, data = trainData)
```
***

#### 4. Forward and Backward stepwise regression

```{r}
minmod=lm(prodebt~1,data=trainData)
mod3 <- step(minmod, scope = list(lower=minmod,upper=mod1), direction = "forward",trace=F)
backward.step <- step(mod1, scope = list(lower=minmod,upper=mod1), direction = "backward",trace=F)

summary(mod3);summary(backward.step)
```
***

#### 5. Ridge Regression. First, the optimal $\lambda$ value must selected, which minimzes the generalized cross validation (GCV). This value is then used to find the new values for coefficients. Remember that Ridge regression does not perform variable selection, only shrinkage. This is because it uses the L2 norm for penalization, which does not measure sparsity.

```{r}
#Scaling
library(MASS)
#lm.ridge automatically scales the predictors, but if you want to do it manually, use `scale()` function as shown below
y.scaled = scale(trainData$prodebt)
X.scaled = scale(as.matrix(trainData[,-13]))

#Ridge Regression
lambda = seq(100, 110, by=0.01)
mod4 = lm.ridge(prodebt~.,data=trainData, lambda = lambda)
optimal <- which(mod4$GCV == min(mod4$GCV))

cat("The GVC score is minimized when lambda is equal to ",names(optimal),", making this optimal lambda value.\n\n",sep="")


cat("Ridge Regression Coefficients at this optimal Lambda Value:\n")
round(coef(mod4)[which(mod4$GCV == min(mod4$GCV)),],4)
```
***

#### 6. Perform LASSO Regression. 

```{r, message=F, warning=F}
library(glmnet)
#Like lm.ridge, this function automatically scales the predicotrs.

mod5.cv=cv.glmnet(as.matrix(trainData[,-13]),trainData$prodebt,alpha=1,nfolds=10)
mod5 = glmnet(as.matrix(trainData[,-13]),trainData$prodebt, alpha = 1, nlambda = 100)
optimal <-mod5.cv$lambda.min

cat("The optimal lambda value is ",optimal,".\n\n",sep="")

cat("The variables selected by the LASSO model are shown in the table below:\n")
coef(mod5,s=mod5.cv$lambda.min)

cat("\nWe can also plot the regression coefficient path to get an understanding of how values were selected for each coefficient:\n")
plot(mod5,xvar="lambda",label=TRUE,lwd=2)
abline(v=log(mod5.cv$lambda.min),col='black',lty = 2,lwd=2)

```
***

#### 7. Elastic Net Regression is performed using 100 values for $\lambda$ and giving equal weight to both penalties. 10-fold cross validation is used to find the optimal $\lambda$. 

```{r}
#Again, predictors are scaled automatically. 
mod6.cv=cv.glmnet(as.matrix(trainData[,-13]),trainData$prodebt,alpha=0.5,nfolds=10)
mod6 = glmnet(as.matrix(trainData[,-13]), trainData$prodebt, alpha = 0.5, nlambda = 100)
optimal <- mod6.cv$lambda.min
cat("The optimal lambda value is ",optimal,".\n\n",sep="")

cat("The variables selected by the LASSO model are shown in the table below:\n")
coef(mod6,s=mod6.cv$lambda.min)
```

#### 8. Predict *prodebt* for each of the rows in the test data using the full model, lowest Mallow's Cp model, and the models found using forward stepwise regression, ridge regression, lasso regression, and elastic net.

```{r}
full=predict(mod1,testData)
mincp=predict(mod2,testData)
stepwise=predict(mod3,testData)
ridge=cbind(1,as.matrix(testData[,-13]))%*%coef(mod4)[which(mod4$GCV == min(mod4$GCV)),]
modlasso=lm(prodebt~.,data=trainData[,-4])
lasso=predict(modlasso,testData)
elastic=as.vector(predict(mod6,as.matrix(testData[,-13]),s=mod6.cv$lambda.min))

preds=data.frame(prodebt=testData$prodebt,full,mincp,stepwise,ridge,lasso,elastic)
preds
```

Predictions are compared below using mean squared prediction error. The lowest MSPE indicates the "best" model, but these are all so close that it would be difficult to make that determination with a single metric. 
```{r}
sapply(preds[,-1],function(x){mean((x-testData$prodebt)^2)})
```


### Additional Notes ###
Poisson is mainly used to address count data across continuous events rather than a discrete events. For example - How many student pass a test is a discrete event.  where each student have a probability of passing a test when they take it. How many awards does a student get in high school is much more continuous, or how many people are in a line in the airport. 






