Linear and Logistic Regression Analysis

There are kinds of regression analysis, and we should be able to distinguished one from another, to select the appropriate one in application. Here is a brief introduction along with some examples.

Type of Regression

Linear Regression
Logistic Regression
Polynomial Regression
Stepwise Regression
Ridge Regression
Lasso Regression
ElasticNet Regression

Linear Regression

definition(from wikipedia)

In statistics, linear regression is an approach for modeling the relationship between a scalar dependent variable y and one or more explanatory variables (or independent variables) denoted X. THE MODEL IS LINEAR REFERS TO THE LINEARITY OF THE PARAMETERS.

simple linear regression

The case of one explanatory variable is called simple linear regression.

$Y = \beta_{0} + \beta_{1}X + \epsilon$

Method of least squares is applied to obtain the estimation of parameters:

$\hat{\beta_{1}} = \frac{S_{XY}}{S_{XX}}$ ; $\hat{\beta_{0}} = \bar{Y} - \hat{\beta_{1}}\bar{X}$

case study

Use the Boston dataset in library MASS, medv as response variable and lstat as preditor(exploanatory variable) using function lm.

library(MASS)
library(ggplot2)
attach(Boston)
names(Boston)

##  [1] "crim"    "zn"      "indus"   "chas"    "nox"     "rm"      "age"    
##  [8] "dis"     "rad"     "tax"     "ptratio" "black"   "lstat"   "medv"

#scatterplot of medv vs lstat, to have a quick look at the relationship
qplot(lstat, medv, data = Boston, geom = c("point", "smooth"), method = "lm")

From the scatterplot, we know that the response and predictor are negative associated and they seem to have proportional relationship from the plot, so later we may fit a simple linear regression model.

model <- lm(medv ~ lstat, data = Boston)
summary(model)

## 
## Call:
## lm(formula = medv ~ lstat, data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.168  -3.990  -1.318   2.034  24.500 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 34.55384    0.56263   61.41   <2e-16 ***
## lstat       -0.95005    0.03873  -24.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.216 on 504 degrees of freedom
## Multiple R-squared:  0.5441, Adjusted R-squared:  0.5432 
## F-statistic: 601.6 on 1 and 504 DF,  p-value: < 2.2e-16

lm function gives us a regression line $\hat{Y} = 34.5538 - 0.9501X$ with R-squared 0.5441, which means the model explains 54.41% of the variability of the response data around its mean. Corresponding p-value for the intercept and predictor(lstat) is extremely small, indicating the significance of these two parameters. We need to conduct lack of fit test to check whether the model is adequate:

model2 <- lm(medv ~ 0 + as.factor(lstat), data = Boston)
anova(model, model2)

## Analysis of Variance Table
## 
## Model 1: medv ~ lstat
## Model 2: medv ~ 0 + as.factor(lstat)
##   Res.Df   RSS  Df Sum of Sq      F Pr(>F)
## 1    504 19472                            
## 2     51  1751 453     17721 1.1394 0.2886

In lack of fit test, F-statistic = 1.1394 and p-value = 0.2886 > 0.05, we fail to reject the hypothesis that there is no lack of fit. Therefore we conclude that there is not enough evidence showing lack of fit in this model, the model would be adequate in this case.

#confidence interval of estimated parameters
confint(model)

##                 2.5 %     97.5 %
## (Intercept) 33.448457 35.6592247
## lstat       -1.026148 -0.8739505

#fitted value and confidence interval
predict(model, list(lstat=15), interval = "confidence")

##       fit      lwr      upr
## 1 20.3031 19.73159 20.87461

#predicted value and prediction interval
predict(model, list(lstat=15), interval = "prediction")

##       fit      lwr      upr
## 1 20.3031 8.077742 32.52846

Then we can sketch the residual plot as well as Q-Q plot to verify the reliability of the fitted model.

par(mfrow=c(2,2))
plot(model)

From the residual plot analysis, we may say it is not a very good model because we can see a convex curve, indicating a certain pattern/trend in residuals. Then we may carry out function boxcox to see what kind of transormation is suggested here. $\lambda$ =0 from the plot, showing that log-transformation could be applied here.

boxcox(model,lambda = seq(-2,2,0.1))

model2 <- lm(log(medv) ~ lstat, data = Boston)
summary(model2)

## 
## Call:
## lm(formula = log(medv) ~ lstat, data = Boston)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.94921 -0.14838 -0.02043  0.11441  0.90958 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.617572   0.021971  164.65   <2e-16 ***
## lstat       -0.046080   0.001513  -30.46   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2427 on 504 degrees of freedom
## Multiple R-squared:  0.6481, Adjusted R-squared:  0.6474 
## F-statistic: 928.1 on 1 and 504 DF,  p-value: < 2.2e-16

plot(model2)

lm function gives us a regression line $\hat{\log{Y}} = 3.617572 - 0.046080X$ with R-squared 0.6481, which means the model explains 64.81% of the variability of the response data around its mean. Corresponding p-value for the intercept and predictor(lstat) is extremely small, indicating the significance of these two parameters. Residual plots no longer show a obvious pattern, and normality is satisfied well. We can draw a conclusion that this can be a good simple linear model based on the analysis.

Logistic Regression

definition(from wikipedia)

In statistics, logistic regression, or logit regression is a regression model where the dependent variable (DV) is categorical. Logistic regression is linear regression on the logit transform of y, where y is the proportion (or probability) of success at each value of x.

When the response y is binary, and we try to fit a linear model, say $\pi = \beta_{0} +\beat_{1}X$, we will have some predicted probability greater than 1, or less than 0 at some extreme values of X, which does not make sense for the probability. Besides, constant variance assumption is also violated. Therefore, we have the following suggestion:

When the response variable is binary, or a binomial proportion, the expected response is more appropriately modeled by some curved relationship with the predictor variable. One such curved relationship is given by the logistic model.

$E(Y_{i}|X_{i}) = _{i} = $

There are several things that should be noted about this function. First of all it is bounded between zero and one. This will eliminate the possibility of getting nonsensical predictions of proportions or probabilities. Secondly, there is a linear model hidden in the function that can be revealed with a proper transformation(logit transformation: $\log{\frac{\pi}{1-\pi}} = \beta_{0} + \beta_{1}X_{i}$) of the response. Finally, the sign associated with the coefficient, $\beta_{1}$ indicates the direction of the curve. A positive value for $\beta_{1}$ indicates an increasing function while a negative value indicates a decreasing function.

case study

In the experiment, turtle eggs were collected. There were 5 different temperatures all between 27 and 30 degrees centigrade. There were 3 boxes of eggs for each temperature. Different boxes contain different numbers of eggs. When the turtles hatched, their sex was determined.

temperature <- c(27.2,27.7,28.3,28.4,29.9)
nmale <- c(2,17,26,19,27)
nfemale <- c(25,7,4,8,1)
nturtle <- nmale + nfemale
pmale <- nmale/nturtle
#scatterplot of temperature v.s. pmale
model1 <- lm(pmale ~ temperature)
plot(temperature,pmale)
abline(model1)

summary(model1)

## 
## Call:
## lm(formula = pmale ~ temperature)
## 
## Residuals:
##        1        2        3        4        5 
## -0.29528  0.20532  0.20325  0.01356 -0.12685 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  -6.9020     3.4737  -1.987    0.141
## temperature   0.2673     0.1227   2.179    0.117
## 
## Residual standard error: 0.2496 on 3 degrees of freedom
## Multiple R-squared:  0.6128, Adjusted R-squared:  0.4838 
## F-statistic: 4.748 on 1 and 3 DF,  p-value: 0.1175

fitted(model1, x=list(temperature))

##         1         2         3         4         5 
## 0.3693497 0.5030147 0.6634127 0.6901457 1.0911406

lm function gives us a regression line $\hat{\pi} = -6.9020 - 0.2673X$ with R-squared 0.6128, which means the model explains 61.28% of the variability of the response data around its mean. Corresponding p-value for the intercept and predictor(lstat) is 0.141 and 0.117, which are not significant at 95% confidence level. Moreover, we have a fitted value 1.0911406, which make no sense in probability, so we need to figure out another method to do the parameter estimation.

logitp <- log(pmale/(1-pmale)) #link = logit
model2 <- lm(logitp ~ temperature)
summary(model2)

## 
## Call:
## lm(formula = logitp ~ temperature)
## 
## Residuals:
##       1       2       3       4       5 
## -1.3837  1.1107  0.9930 -0.1976 -0.5224 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) -51.1122    16.9415  -3.017   0.0569 .
## temperature   1.8371     0.5983   3.070   0.0545 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.217 on 3 degrees of freedom
## Multiple R-squared:  0.7586, Adjusted R-squared:  0.6781 
## F-statistic: 9.428 on 1 and 3 DF,  p-value: 0.05454

exp(fitted(model2))/(1+exp(fitted(model2)))

##         1         2         3         4         5 
## 0.2419512 0.4443709 0.7065822 0.7431787 0.9785063

par(mfrow=c(2,2))
plot(model2)

par(mfrow=c(1,1))
x=seq(26,31,0.1)
y=exp(model2$coefficients[1]+model2$coefficients[2]*x)/(1+exp(model2$coefficients[1]+model2$coefficients[2]*x))
plot(x,y)

It seems better to perform regression after logit transformation, with higher R-squred 0.7586. And we no longer have probability greater than 1 or less than 0, which solves the boundary problem.

model3 <- glm(cbind(nfemale,nmale) ~ temperature, family = binomial(logit))
summary(model3)

## 
## Call:
## glm(formula = cbind(nfemale, nmale) ~ temperature, family = binomial(logit))
## 
## Deviance Residuals: 
##      1       2       3       4       5  
##  2.224  -2.248  -1.239   1.382   1.191  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  61.3183    12.0224   5.100 3.39e-07 ***
## temperature  -2.2110     0.4309  -5.132 2.87e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 64.429  on 4  degrees of freedom
## Residual deviance: 14.863  on 3  degrees of freedom
## AIC: 33.542
## 
## Number of Fisher Scoring iterations: 5

Reference: http://ww2.coastal.edu/kingw/statistics/R-tutorials/logistic.html

Linear and Logistic Regression Analysis

Simiao Ye

10/04/2015

Type of Regression

Linear Regression

simple linear regression

Logistic Regression