Load Packages and Dataset
Packages used to import and analyse the data include
broom - better view of the model outputs
wvplots - for gaincurveplot
dplyr - data manipulation
ggplot - plotting
tidyr - create tidy data (column is variable, row is observation, cell is single value)
mgcv - GAM model
A. Key points
Fitting the logistic model
should include family = binomial. glm(formula, data, family = binomial)
the family argument describes the error distribution of the model.
encode the results to be 0/1 or False/True
Predicting with a glm()
predict(model, newdata, type = “response” - to get the probabilities
by default: returns log-odds
note that squared R and RMSE are not good measures for logistic regression models
we use deviance (similar to variance) and pseudo R squared (analogous to R-squared) - compares deviance of a model to the null-deviance of the data
a good fit gives pseudo-R-squared near 1
Some coding into it
Fitting a model
summary of the dataset - sparrow
summary(sparrow)## status age total_length wingspan
## Perished:36 Length:87 Min. :153.0 Min. :236.0
## Survived:51 Class :character 1st Qu.:158.0 1st Qu.:245.0
## Mode :character Median :160.0 Median :247.0
## Mean :160.4 Mean :247.5
## 3rd Qu.:162.5 3rd Qu.:251.0
## Max. :167.0 Max. :256.0
## weight beak_head humerus femur
## Min. :23.2 Min. :29.80 Min. :0.6600 Min. :0.6500
## 1st Qu.:24.7 1st Qu.:31.40 1st Qu.:0.7250 1st Qu.:0.7000
## Median :25.8 Median :31.70 Median :0.7400 Median :0.7100
## Mean :25.8 Mean :31.64 Mean :0.7353 Mean :0.7134
## 3rd Qu.:26.7 3rd Qu.:32.10 3rd Qu.:0.7500 3rd Qu.:0.7300
## Max. :31.0 Max. :33.00 Max. :0.7800 Max. :0.7600
## legbone skull sternum
## Min. :1.010 Min. :0.5600 Min. :0.7700
## 1st Qu.:1.110 1st Qu.:0.5900 1st Qu.:0.8300
## Median :1.130 Median :0.6000 Median :0.8500
## Mean :1.131 Mean :0.6032 Mean :0.8511
## 3rd Qu.:1.160 3rd Qu.:0.6100 3rd Qu.:0.8800
## Max. :1.230 Max. :0.6400 Max. :0.9300
create a “survived” col - method used in base R (in dplyr we will use mutate to do the same)
sparrow$survived <- sparrow$status == "Survived"create a formula with three independent variables (as the material decided)
fmla <- survived ~ total_length + weight + humerusfit the logistic model - note here we do not put binomial in quotations
sparrow_model <- glm(fmla, data = sparrow, family = binomial)call summary on the fitted model
summary(sparrow_model)##
## Call:
## glm(formula = fmla, family = binomial, data = sparrow)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1117 -0.6026 0.2871 0.6577 1.7082
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 46.8813 16.9631 2.764 0.005715 **
## total_length -0.5435 0.1409 -3.858 0.000115 ***
## weight -0.5689 0.2771 -2.053 0.040060 *
## humerus 75.4610 19.1586 3.939 8.19e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 118.008 on 86 degrees of freedom
## Residual deviance: 75.094 on 83 degrees of freedom
## AIC: 83.094
##
## Number of Fisher Scoring iterations: 5
save glance results to perf
(perf <- glance(sparrow_model))## # A tibble: 1 × 8
## null.deviance df.null logLik AIC BIC deviance df.residual nobs
## <dbl> <int> <dbl> <dbl> <dbl> <dbl> <int> <int>
## 1 118. 86 -37.5 83.1 93.0 75.1 83 87
calculate pseudo-r^2 (1 - deviance/null.deviance) - as the results show the fit is probably not that good (as we noticed it is not close to one and the model did not use all the variables)
(pseudoR2 <- 1 - perf$deviance/perf$null.deviance)## [1] 0.3636526
Predicting
Predictions - note here we wrapped response in quotation marks.
sparrow$pred <- predict(sparrow_model, sparrow, type = "response")Examine gain curve - here used the vtreat package
GainCurvePlot(frame, xvar, truthVar, title)
GainCurvePlot(sparrow, "pred", "survived", "sparrow survival model")B. Key Points
Poisson and quasipoisson regression to predict counts
family = poisson (mean = var) or quasipoisson (mean very different from var)
both work better on larger datasets
if the counts to predict are much larger than zero, regular regression will be fine as well
outcome: integer or rates
some coding
Fit a model to predict bike rental counts
check the data
str(bikesJuly)## 'data.frame': 744 obs. of 12 variables:
## $ hr : Factor w/ 24 levels "0","1","2","3",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ holiday : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ workingday: logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ weathersit: chr "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" ...
## $ temp : num 0.76 0.74 0.72 0.72 0.7 0.68 0.7 0.74 0.78 0.82 ...
## $ atemp : num 0.727 0.697 0.697 0.712 0.667 ...
## $ hum : num 0.66 0.7 0.74 0.84 0.79 0.79 0.79 0.7 0.62 0.56 ...
## $ windspeed : num 0 0.1343 0.0896 0.1343 0.194 ...
## $ cnt : int 149 93 90 33 4 10 27 50 142 219 ...
## $ instant : int 13004 13005 13006 13007 13008 13009 13010 13011 13012 13013 ...
## $ mnth : int 7 7 7 7 7 7 7 7 7 7 ...
## $ yr : int 1 1 1 1 1 1 1 1 1 1 ...
outcome <- c("cnt")
vars <- c("hr", "holiday", "workingday", "weathersit", "temp", "atemp", "hum", "windspeed")fit a model
(fmla <- paste(outcome, "~", paste(vars, collapse = " + ")))## [1] "cnt ~ hr + holiday + workingday + weathersit + temp + atemp + hum + windspeed"
compare mean and var to determine if use poisson or quasipoisson
bikesJuly %>%
summarize(mean_bikes = mean(cnt), var_bikes = var(cnt))## mean_bikes var_bikes
## 1 273.6653 45863.84
fit the model
bike_model <- glm(fmla, data = bikesJuly, family = quasipoisson)call glance from broom
(perf <- glance(bike_model))## # A tibble: 1 × 8
## null.deviance df.null logLik AIC BIC deviance df.residual nobs
## <dbl> <int> <dbl> <dbl> <dbl> <dbl> <int> <int>
## 1 133365. 743 NA NA NA 28775. 712 744
calculate pseudo-r^2
(pseudoR2 <- 1 - perf$deviance/perf$null.deviance)## [1] 0.7842393
Predict bike rentals on new data
check on the new data
str(bikesAugust)## 'data.frame': 744 obs. of 12 variables:
## $ hr : Factor w/ 24 levels "0","1","2","3",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ holiday : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ workingday: logi TRUE TRUE TRUE TRUE TRUE TRUE ...
## $ weathersit: chr "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" ...
## $ temp : num 0.68 0.66 0.64 0.64 0.64 0.64 0.64 0.64 0.66 0.68 ...
## $ atemp : num 0.636 0.606 0.576 0.576 0.591 ...
## $ hum : num 0.79 0.83 0.83 0.83 0.78 0.78 0.78 0.83 0.78 0.74 ...
## $ windspeed : num 0.1642 0.0896 0.1045 0.1045 0.1343 ...
## $ cnt : int 47 33 13 7 4 49 185 487 681 350 ...
## $ instant : int 13748 13749 13750 13751 13752 13753 13754 13755 13756 13757 ...
## $ mnth : int 8 8 8 8 8 8 8 8 8 8 ...
## $ yr : int 1 1 1 1 1 1 1 1 1 1 ...
summary(bikesAugust)## hr holiday workingday weathersit
## 0 : 31 Mode :logical Mode :logical Length:744
## 1 : 31 FALSE:744 FALSE:192 Class :character
## 2 : 31 TRUE :552 Mode :character
## 3 : 31
## 4 : 31
## 5 : 31
## (Other):558
## temp atemp hum windspeed
## Min. :0.5600 Min. :0.2424 Min. :0.2500 Min. :0.0000
## 1st Qu.:0.6400 1st Qu.:0.6061 1st Qu.:0.5200 1st Qu.:0.0896
## Median :0.7000 Median :0.6667 Median :0.6600 Median :0.1642
## Mean :0.7118 Mean :0.6475 Mean :0.6486 Mean :0.1551
## 3rd Qu.:0.7600 3rd Qu.:0.6970 3rd Qu.:0.7800 3rd Qu.:0.2239
## Max. :0.9000 Max. :0.8485 Max. :0.9400 Max. :0.5224
##
## cnt instant mnth yr
## Min. : 3.00 Min. :13748 Min. :8 Min. :1
## 1st Qu.: 86.75 1st Qu.:13934 1st Qu.:8 1st Qu.:1
## Median :266.00 Median :14120 Median :8 Median :1
## Mean :288.31 Mean :14120 Mean :8 Mean :1
## 3rd Qu.:428.00 3rd Qu.:14305 3rd Qu.:8 3rd Qu.:1
## Max. :941.00 Max. :14491 Max. :8 Max. :1
##
make predictions on the august data
bikesAugust$pred <- predict(bike_model, newdata = bikesAugust, type = "response")calculate the residual and RMSE
bikesAugust %>%
mutate(residual = pred - cnt) %>%
summarize(rmse = sqrt(mean(residual ^ 2)))## rmse
## 1 112.5815
generate plot of predictions to actual counts
ggplot(bikesAugust, aes(pred, cnt)) +
geom_point() +
geom_abline(color = "darkblue")Visualize the bike rental predictions
here we once again not using gather/spread but using pivot
convert instant to be in day units rather than hour
gather cnt and pred into a value column, filter for first two weeks of August (hence the filter formula)
plot value by instant
bikesAugust %>%
mutate(instant = (instant - min(instant))/24) %>%
pivot_longer(cols = c("cnt", "pred"), names_to = "valuetype", values_to = "value") %>%
filter(instant < 14) %>%
ggplot(aes(x = instant, y = value, color = valuetype, linetype = valuetype)) +
geom_point() +
geom_line() +
scale_x_continuous("Day", breaks = 0:14, labels = 0:14) +
scale_color_brewer(palette = "Dark2") +
ggtitle("Predicted August bike rentals, Quasipoisson model")C. Key Points - GAM
generalized additive models - automatically learn input variable transformations
gam(formula, family, data)
family:
gaussian (default): regular regression
binomial: probabilities
poisson/quasipoisson: counts
more likely to overfit so best to use on larger datasets
use s(variable) to specify the variable should be non-linear continuous variable - input takes on more than about 10 unique values
using GAM to learn the transformations is useful when you don’t have the domain knowledge to tell you the correct transform (i.e. is it linear related, quadratic or cubic..)
Some coding practice
soybean datasets are loaded.
summary(soybean_train)## Plot Variety Year Time weight
## 1988F6 : 10 F:161 1988:124 Min. :14.00 Min. : 0.0290
## 1988F7 : 9 P:169 1989:102 1st Qu.:27.00 1st Qu.: 0.6663
## 1988P1 : 9 1990:104 Median :42.00 Median : 3.5233
## 1988P8 : 9 Mean :43.56 Mean : 6.1645
## 1988P2 : 9 3rd Qu.:56.00 3rd Qu.:10.3808
## 1988F3 : 8 Max. :84.00 Max. :27.3700
## (Other):276
a simple plot
ggplot(soybean_train, aes(Time, weight)) +
geom_point()start using package mgcv
- model avgerage leaf weight on a soybean plant as a function of time. as plotted there is probably not a linear relationship
create gam model - for comparison, also created a linear model
fmla.gam <- weight ~ s(Time)
model.gam <- gam(fmla.gam, family = gaussian, data = soybean_train)
model.lin <- lm(weight ~ Time, data = soybean_train)call summary on both models and plot. as measured by R-squared, GAM appears to fit the data better than a linear model.
summary(model.lin)##
## Call:
## lm(formula = weight ~ Time, data = soybean_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3933 -1.7100 -0.3909 1.9056 11.4381
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.559283 0.358527 -18.30 <2e-16 ***
## Time 0.292094 0.007444 39.24 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.778 on 328 degrees of freedom
## Multiple R-squared: 0.8244, Adjusted R-squared: 0.8238
## F-statistic: 1540 on 1 and 328 DF, p-value: < 2.2e-16
summary(model.gam)##
## Family: gaussian
## Link function: identity
##
## Formula:
## weight ~ s(Time)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.1645 0.1143 53.93 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(Time) 8.495 8.93 338.2 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.902 Deviance explained = 90.4%
## GCV = 4.4395 Scale est. = 4.3117 n = 330
plot(model.gam)Predict with soybean model on test data
- get predictions from linear model
soybean_test$pred.lin <- predict(model.lin, newdata = soybean_test)- get predictions from GAM model. results are in matrix so use as.numeric to coerce it
soybean_test$pred.gam <- as.numeric(predict(model.gam, newdata = soybean_test))- pivot longer
soybean_long <- soybean_test %>%
pivot_longer(cols = c("pred.lin", "pred.gam"), names_to = "modeltype", values_to = "pred" )- calculate the rmse
soybean_long %>%
mutate(residual = weight - pred) %>% # residuals
group_by(modeltype) %>% # group by modeltype
summarize(rmse = sqrt(mean(residual ^ 2))) ## # A tibble: 2 × 2
## modeltype rmse
## <chr> <dbl>
## 1 pred.gam 2.29
## 2 pred.lin 3.19
- compare the predictions against actual weights on the test data
soybean_long %>%
ggplot(aes(x = Time)) + #column for x axis
geom_point(aes(y = weight, color = modeltype)) + #the y-column for the scatterpl;ot
geom_point(aes(y = pred, color = modeltype)) + #the y-column for the point and line plot
geom_line(aes(y = pred, color = modeltype, linetype = modeltype)) + #the y-column for point-and-line-plot
scale_color_brewer(palette = "Dark2")conclusion: from the GAM plot, we can see that GAM learns the non-linear growth function of the soybean plants, including the fact that weight is never negative. can see from the linear model the weight was negative in time < 20.