Logistic Regression

Load Packages and Dataset

Packages used to import and analyse the data include

broom - better view of the model outputs
wvplots - for gaincurveplot
dplyr - data manipulation
ggplot - plotting
tidyr - create tidy data (column is variable, row is observation, cell is single value)
mgcv - GAM model

A. Key points

Fitting the logistic model

should include family = binomial. glm(formula, data, family = binomial)
the family argument describes the error distribution of the model.
encode the results to be 0/1 or False/True

Predicting with a glm()

predict(model, newdata, type = “response” - to get the probabilities
by default: returns log-odds
note that squared R and RMSE are not good measures for logistic regression models
we use deviance (similar to variance) and pseudo R squared (analogous to R-squared) - compares deviance of a model to the null-deviance of the data
a good fit gives pseudo-R-squared near 1

Some coding into it

Fitting a model

summary of the dataset - sparrow

summary(sparrow)

##       status       age             total_length      wingspan    
##  Perished:36   Length:87          Min.   :153.0   Min.   :236.0  
##  Survived:51   Class :character   1st Qu.:158.0   1st Qu.:245.0  
##                Mode  :character   Median :160.0   Median :247.0  
##                                   Mean   :160.4   Mean   :247.5  
##                                   3rd Qu.:162.5   3rd Qu.:251.0  
##                                   Max.   :167.0   Max.   :256.0  
##      weight       beak_head        humerus           femur       
##  Min.   :23.2   Min.   :29.80   Min.   :0.6600   Min.   :0.6500  
##  1st Qu.:24.7   1st Qu.:31.40   1st Qu.:0.7250   1st Qu.:0.7000  
##  Median :25.8   Median :31.70   Median :0.7400   Median :0.7100  
##  Mean   :25.8   Mean   :31.64   Mean   :0.7353   Mean   :0.7134  
##  3rd Qu.:26.7   3rd Qu.:32.10   3rd Qu.:0.7500   3rd Qu.:0.7300  
##  Max.   :31.0   Max.   :33.00   Max.   :0.7800   Max.   :0.7600  
##     legbone          skull           sternum      
##  Min.   :1.010   Min.   :0.5600   Min.   :0.7700  
##  1st Qu.:1.110   1st Qu.:0.5900   1st Qu.:0.8300  
##  Median :1.130   Median :0.6000   Median :0.8500  
##  Mean   :1.131   Mean   :0.6032   Mean   :0.8511  
##  3rd Qu.:1.160   3rd Qu.:0.6100   3rd Qu.:0.8800  
##  Max.   :1.230   Max.   :0.6400   Max.   :0.9300

create a “survived” col - method used in base R (in dplyr we will use mutate to do the same)

sparrow$survived <- sparrow$status == "Survived"

create a formula with three independent variables (as the material decided)

fmla <- survived ~ total_length + weight + humerus

fit the logistic model - note here we do not put binomial in quotations

sparrow_model <- glm(fmla, data = sparrow, family = binomial)

call summary on the fitted model

summary(sparrow_model)

## 
## Call:
## glm(formula = fmla, family = binomial, data = sparrow)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1117  -0.6026   0.2871   0.6577   1.7082  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   46.8813    16.9631   2.764 0.005715 ** 
## total_length  -0.5435     0.1409  -3.858 0.000115 ***
## weight        -0.5689     0.2771  -2.053 0.040060 *  
## humerus       75.4610    19.1586   3.939 8.19e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 118.008  on 86  degrees of freedom
## Residual deviance:  75.094  on 83  degrees of freedom
## AIC: 83.094
## 
## Number of Fisher Scoring iterations: 5

save glance results to perf

(perf <- glance(sparrow_model))

## # A tibble: 1 × 8
##   null.deviance df.null logLik   AIC   BIC deviance df.residual  nobs
##           <dbl>   <int>  <dbl> <dbl> <dbl>    <dbl>       <int> <int>
## 1          118.      86  -37.5  83.1  93.0     75.1          83    87

calculate pseudo-r^2 (1 - deviance/null.deviance) - as the results show the fit is probably not that good (as we noticed it is not close to one and the model did not use all the variables)

(pseudoR2 <- 1 - perf$deviance/perf$null.deviance)

## [1] 0.3636526

Predicting

Predictions - note here we wrapped response in quotation marks.

sparrow$pred <- predict(sparrow_model, sparrow, type = "response")

Examine gain curve - here used the vtreat package

GainCurvePlot(frame, xvar, truthVar, title)

GainCurvePlot(sparrow, "pred", "survived", "sparrow survival model")

B. Key Points

Poisson and quasipoisson regression to predict counts

family = poisson (mean = var) or quasipoisson (mean very different from var)
both work better on larger datasets
if the counts to predict are much larger than zero, regular regression will be fine as well
outcome: integer or rates

some coding

Fit a model to predict bike rental counts

check the data

str(bikesJuly)

## 'data.frame':    744 obs. of  12 variables:
##  $ hr        : Factor w/ 24 levels "0","1","2","3",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ holiday   : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ workingday: logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ weathersit: chr  "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" ...
##  $ temp      : num  0.76 0.74 0.72 0.72 0.7 0.68 0.7 0.74 0.78 0.82 ...
##  $ atemp     : num  0.727 0.697 0.697 0.712 0.667 ...
##  $ hum       : num  0.66 0.7 0.74 0.84 0.79 0.79 0.79 0.7 0.62 0.56 ...
##  $ windspeed : num  0 0.1343 0.0896 0.1343 0.194 ...
##  $ cnt       : int  149 93 90 33 4 10 27 50 142 219 ...
##  $ instant   : int  13004 13005 13006 13007 13008 13009 13010 13011 13012 13013 ...
##  $ mnth      : int  7 7 7 7 7 7 7 7 7 7 ...
##  $ yr        : int  1 1 1 1 1 1 1 1 1 1 ...

outcome <- c("cnt")
vars <- c("hr", "holiday", "workingday", "weathersit", "temp", "atemp", "hum", "windspeed")

fit a model

(fmla <- paste(outcome, "~", paste(vars, collapse = " + ")))

## [1] "cnt ~ hr + holiday + workingday + weathersit + temp + atemp + hum + windspeed"

compare mean and var to determine if use poisson or quasipoisson

bikesJuly %>% 
  summarize(mean_bikes = mean(cnt), var_bikes = var(cnt))

##   mean_bikes var_bikes
## 1   273.6653  45863.84

fit the model

bike_model <- glm(fmla, data = bikesJuly, family = quasipoisson)

call glance from broom

(perf <- glance(bike_model))

## # A tibble: 1 × 8
##   null.deviance df.null logLik   AIC   BIC deviance df.residual  nobs
##           <dbl>   <int>  <dbl> <dbl> <dbl>    <dbl>       <int> <int>
## 1       133365.     743     NA    NA    NA   28775.         712   744

calculate pseudo-r^2

(pseudoR2 <- 1 - perf$deviance/perf$null.deviance)

## [1] 0.7842393

Predict bike rentals on new data

check on the new data

str(bikesAugust)

## 'data.frame':    744 obs. of  12 variables:
##  $ hr        : Factor w/ 24 levels "0","1","2","3",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ holiday   : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ workingday: logi  TRUE TRUE TRUE TRUE TRUE TRUE ...
##  $ weathersit: chr  "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" ...
##  $ temp      : num  0.68 0.66 0.64 0.64 0.64 0.64 0.64 0.64 0.66 0.68 ...
##  $ atemp     : num  0.636 0.606 0.576 0.576 0.591 ...
##  $ hum       : num  0.79 0.83 0.83 0.83 0.78 0.78 0.78 0.83 0.78 0.74 ...
##  $ windspeed : num  0.1642 0.0896 0.1045 0.1045 0.1343 ...
##  $ cnt       : int  47 33 13 7 4 49 185 487 681 350 ...
##  $ instant   : int  13748 13749 13750 13751 13752 13753 13754 13755 13756 13757 ...
##  $ mnth      : int  8 8 8 8 8 8 8 8 8 8 ...
##  $ yr        : int  1 1 1 1 1 1 1 1 1 1 ...

summary(bikesAugust)

##        hr       holiday        workingday       weathersit       
##  0      : 31   Mode :logical   Mode :logical   Length:744        
##  1      : 31   FALSE:744       FALSE:192       Class :character  
##  2      : 31                   TRUE :552       Mode  :character  
##  3      : 31                                                     
##  4      : 31                                                     
##  5      : 31                                                     
##  (Other):558                                                     
##       temp            atemp             hum           windspeed     
##  Min.   :0.5600   Min.   :0.2424   Min.   :0.2500   Min.   :0.0000  
##  1st Qu.:0.6400   1st Qu.:0.6061   1st Qu.:0.5200   1st Qu.:0.0896  
##  Median :0.7000   Median :0.6667   Median :0.6600   Median :0.1642  
##  Mean   :0.7118   Mean   :0.6475   Mean   :0.6486   Mean   :0.1551  
##  3rd Qu.:0.7600   3rd Qu.:0.6970   3rd Qu.:0.7800   3rd Qu.:0.2239  
##  Max.   :0.9000   Max.   :0.8485   Max.   :0.9400   Max.   :0.5224  
##                                                                     
##       cnt            instant           mnth         yr   
##  Min.   :  3.00   Min.   :13748   Min.   :8   Min.   :1  
##  1st Qu.: 86.75   1st Qu.:13934   1st Qu.:8   1st Qu.:1  
##  Median :266.00   Median :14120   Median :8   Median :1  
##  Mean   :288.31   Mean   :14120   Mean   :8   Mean   :1  
##  3rd Qu.:428.00   3rd Qu.:14305   3rd Qu.:8   3rd Qu.:1  
##  Max.   :941.00   Max.   :14491   Max.   :8   Max.   :1  
##

make predictions on the august data

bikesAugust$pred <- predict(bike_model, newdata = bikesAugust, type = "response")

calculate the residual and RMSE

bikesAugust %>% 
  mutate(residual = pred - cnt) %>% 
  summarize(rmse = sqrt(mean(residual ^ 2)))

##       rmse
## 1 112.5815

generate plot of predictions to actual counts

ggplot(bikesAugust, aes(pred, cnt)) +
  geom_point() +
  geom_abline(color = "darkblue")

Visualize the bike rental predictions

here we once again not using gather/spread but using pivot
convert instant to be in day units rather than hour
gather cnt and pred into a value column, filter for first two weeks of August (hence the filter formula)
plot value by instant

bikesAugust %>% 
  mutate(instant = (instant - min(instant))/24) %>% 
  pivot_longer(cols = c("cnt", "pred"), names_to = "valuetype", values_to = "value") %>% 
  filter(instant < 14) %>% 
  ggplot(aes(x = instant, y = value, color = valuetype, linetype = valuetype)) +
  geom_point() +
  geom_line() +
  scale_x_continuous("Day", breaks = 0:14, labels = 0:14) + 
  scale_color_brewer(palette = "Dark2") + 
  ggtitle("Predicted August bike rentals, Quasipoisson model")

C. Key Points - GAM

generalized additive models - automatically learn input variable transformations
gam(formula, family, data)
- family:
  - gaussian (default): regular regression
  - binomial: probabilities
  - poisson/quasipoisson: counts
- more likely to overfit so best to use on larger datasets
- use s(variable) to specify the variable should be non-linear continuous variable - input takes on more than about 10 unique values
using GAM to learn the transformations is useful when you don’t have the domain knowledge to tell you the correct transform (i.e. is it linear related, quadratic or cubic..)

Some coding practice

soybean datasets are loaded.

summary(soybean_train)

##       Plot     Variety   Year          Time           weight       
##  1988F6 : 10   F:161   1988:124   Min.   :14.00   Min.   : 0.0290  
##  1988F7 :  9   P:169   1989:102   1st Qu.:27.00   1st Qu.: 0.6663  
##  1988P1 :  9           1990:104   Median :42.00   Median : 3.5233  
##  1988P8 :  9                      Mean   :43.56   Mean   : 6.1645  
##  1988P2 :  9                      3rd Qu.:56.00   3rd Qu.:10.3808  
##  1988F3 :  8                      Max.   :84.00   Max.   :27.3700  
##  (Other):276

a simple plot

ggplot(soybean_train, aes(Time, weight)) +
  geom_point()

start using package mgcv

model avgerage leaf weight on a soybean plant as a function of time. as plotted there is probably not a linear relationship

create gam model - for comparison, also created a linear model

fmla.gam <- weight ~ s(Time)
model.gam <- gam(fmla.gam, family = gaussian, data = soybean_train)
model.lin <- lm(weight ~ Time, data = soybean_train)

call summary on both models and plot. as measured by R-squared, GAM appears to fit the data better than a linear model.

summary(model.lin)

## 
## Call:
## lm(formula = weight ~ Time, data = soybean_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3933 -1.7100 -0.3909  1.9056 11.4381 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -6.559283   0.358527  -18.30   <2e-16 ***
## Time         0.292094   0.007444   39.24   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.778 on 328 degrees of freedom
## Multiple R-squared:  0.8244, Adjusted R-squared:  0.8238 
## F-statistic:  1540 on 1 and 328 DF,  p-value: < 2.2e-16

summary(model.gam)

## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## weight ~ s(Time)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   6.1645     0.1143   53.93   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##           edf Ref.df     F p-value    
## s(Time) 8.495   8.93 338.2  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.902   Deviance explained = 90.4%
## GCV = 4.4395  Scale est. = 4.3117    n = 330

plot(model.gam)

Predict with soybean model on test data

get predictions from linear model

soybean_test$pred.lin <- predict(model.lin, newdata = soybean_test)

get predictions from GAM model. results are in matrix so use as.numeric to coerce it

soybean_test$pred.gam <- as.numeric(predict(model.gam, newdata = soybean_test))

pivot longer

soybean_long <- soybean_test %>% 
  pivot_longer(cols = c("pred.lin", "pred.gam"), names_to = "modeltype", values_to = "pred" )

calculate the rmse

soybean_long %>%
  mutate(residual = weight - pred) %>%     # residuals
  group_by(modeltype) %>%                  # group by modeltype
  summarize(rmse = sqrt(mean(residual ^ 2)))

## # A tibble: 2 × 2
##   modeltype  rmse
##   <chr>     <dbl>
## 1 pred.gam   2.29
## 2 pred.lin   3.19

compare the predictions against actual weights on the test data

soybean_long %>% 
  ggplot(aes(x = Time)) + #column for x axis
  geom_point(aes(y = weight, color = modeltype)) + #the y-column for the scatterpl;ot
  geom_point(aes(y = pred, color = modeltype)) + #the y-column for the point and line plot
  geom_line(aes(y = pred, color = modeltype, linetype = modeltype)) + #the y-column for point-and-line-plot
  scale_color_brewer(palette = "Dark2")

conclusion: from the GAM plot, we can see that GAM learns the non-linear growth function of the soybean plants, including the fact that weight is never negative. can see from the linear model the weight was negative in time < 20.