class: center, middle, inverse, title-slide # MATH 4140/5140 ## Simple Logistic Regression ### Ziwei Ma ### UTC ### updated: 2022-04-27 --- ### Logistic regression example > **Election** Suppose that we are interested in the group membership of a hurricane, either tropical or non-tropical, based on the latitude of formation. ``` ## [1] TRUE ``` - Load data ```r library(readxl) hurricanes <- read_excel("hurricanes.xlsx") ``` --- ### Inspect the data - First, we inspect the structure of the data set. ```r str(hurricanes) ``` ``` ## tibble [337 x 12] (S3: tbl_df/tbl/data.frame) ## $ RowNames: chr [1:337] "1" "2" "3" "4" ... ## $ Number : num [1:337] 430 432 433 436 437 438 440 441 445 449 ... ## $ Name : chr [1:337] "NOTNAMED" "NOTNAMED" "NOTNAMED" "NOTNAMED" ... ## $ Year : num [1:337] 1944 1944 1944 1944 1944 ... ## $ Type : num [1:337] 1 0 0 0 0 1 0 1 0 0 ... ## $ FirstLat: num [1:337] 30.2 25.6 14.2 20.8 20 29.2 16.1 27.6 21.6 19 ... ## $ FirstLon: num [1:337] -76.1 -74.9 -65.2 -58 -84.2 -55.8 -80.8 -85.6 -95.2 -56.6 ... ## $ MaxLat : num [1:337] 32.1 31 16.6 26.3 20.6 38 21.9 27.6 28.6 24.9 ... ## $ MaxLon : num [1:337] -74.8 -78.1 -72.2 -72.3 -84.9 -53.2 -82.9 -85.6 -96.1 -79.6 ... ## $ LastLat : num [1:337] 35.1 32.6 20.6 42.1 19.1 50 28.4 31.7 29.5 28.9 ... ## $ LastLon : num [1:337] -69.2 -78.2 -88.5 -71.5 -93.9 -46.5 -82.1 -79.1 -96 -81.8 ... ## $ MaxInt : num [1:337] 80 80 105 120 70 85 105 100 120 120 ... ``` There are 337 observations and 12 variables in the data set. We are primarily interested in the variable **Type**, which is our response variable and the variable **FirstLat**, which corresponds to the latitude of formation, and thus is our predictor variable. --- - In order to get a sense for the data set we plot the number of hurricanes for each year as a bar plot using the ggplot2 package. ```r library(ggplot2) ggplot(hurricanes, aes(x = Year)) + geom_bar(stat = "count") ``` <img src="week-15-3-a_files/figure-html/unnamed-chunk-5-1.png" style="display: block; margin: auto;" /> --- - Types of hurricanes ```r ggplot(hurricanes, aes(x = Year, fill = factor(Type))) + geom_bar(stat = "count") + scale_fill_discrete(name = "Type of Hurricane", labels = c("tropical-only", "baroclinic influences", "baroclinic initiation")) ``` <img src="week-15-3-a_files/figure-html/unnamed-chunk-6-1.png" style="display: block; margin: auto;" /> --- For a numerical representation of the hurricane classes we summarize the data based on types. ```r table(hurricanes$Type) ``` ``` ## ## 0 1 3 ## 187 77 73 ``` In **class 0**, tropical hurricanes, there are 187 observations, in **class 1**, *baroclinic influences*, there are 77 observations and in **class 3**, *baroclinic initiation*, there are 73 observations. In logistic regression we deal with dichotomous data; thus, for simplicity we re-code the classes and assign the class 1 and class 2, both being influenced by the outer tropics, the label 1. ```r hurricanes$Type.new <- ifelse(test = hurricanes$Type==0, yes = 0, no = 1) ``` Now we have a binary response variable of two classes: Class 0 corresponds tropical hurricanes and class 1 to non-tropical hurricanes. ```r table(hurricanes$Type.new) ``` ``` ## ## 0 1 ## 187 150 ``` --- ### Geographical representation - We plot the data on an interactive map. ```r options(viewer=NULL) library(leaflet) m <- leaflet() m <- addTiles(m) m <- addProviderTiles(m, "Esri.OceanBasemap") cols <- c("red", "navy") m <- addCircleMarkers(m, lng = hurricanes$FirstLon, lat = hurricanes$FirstLat, radius = 2.5, color = cols[hurricanes$Type.new+1], popup = paste('Year:', as.character(hurricanes$Year))) m <- addLegend(m, "topright", colors = cols, labels = c('tropical', 'non-tropical'), title = "Type of Hurricane", opacity = 1) ``` --- ### Geographical representation
--- - **Fit Logistic regression model** We want to build a model that predicts the group membership of a hurricane, either tropical (0) or non-tropical (1), based on the latitude of formation of the hurricane. The response variable is the binary variable `Type.new` and the predictor variable is `FirstLat` ```r log.model <- glm(Type.new ~ FirstLat, data = hurricanes, family = 'binomial') # show the theoretical model library(equatiomatic) equatiomatic::extract_eq(log.model) ``` $$ \log\left[ \frac { P( \operatorname{Type.new} = \operatorname{1} ) }{ 1 - P( \operatorname{Type.new} = \operatorname{1} ) } \right] = \alpha + \beta_{1}(\operatorname{FirstLat}) $$ The fitted model is here ```r # display the actual coefficients equatiomatic::extract_eq(log.model, use_coefs = TRUE) ``` $$ \log\left[ \frac { \widehat{P( \operatorname{Type.new} = \operatorname{1} )} }{ 1 - \widehat{P( \operatorname{Type.new} = \operatorname{1} )} } \right] = -9.08 + 0.37(\operatorname{FirstLat}) $$ --- ### Interpret the model - The baseline odds ratio `$$e^{\alpha}=e^{-9.0826335}=1.1362198×10^{-4}$$` - This is a very low number! And it makes perfectly sense, since it is very unlikely to observe a non-tropical hurricane at the equator. - The odds ratio per unit increase of predictor `$$e^{\beta}_1=e^{0.3728295}=1.451837$$` Thus, for every degree increase in formation latitude the odds ratio increases on average by a constant factor of 1.4518368 (or 45%). --- ### Inference on the coefficients - we construct a 95% confidence interval for the estimated model coefficient. We exponentiate the result to return the odds-ratio, which is a quantity that is easier to interpret than the log-odds. - log odds ratio ```r confint.default(log.model)[2,] ``` ``` ## 2.5 % 97.5 % ## 0.2954770 0.4501821 ``` - odds ratio ```r exp(confint.default(log.model)[2,]) ``` ``` ## 2.5 % 97.5 % ## 1.343767 1.568598 ``` Thus, we can state that we are 95% confident that for every degree increase in formation latitude the observation of a non-tropical hurricane becomes on average more likely between 34 and 57%. --- ### Prediction - Now, we apply that model to make predictions about the chance of observing a non-tropical hurricane given the latitude of formation. In order to predict the probability that a hurricane picked at random, which forms at a given latitude is of type non-tropical we apply the `predict()` function. ```r predict(log.model, newdata = list(FirstLat = c(10, 23.5,30)), type = "response") ``` ``` ## 1 2 3 ## 0.004705352 0.420398055 0.891121908 ``` **NOTE** We should indicate `type="response"` in `predict()` to get the probability. Otherwise, we will get the following results. ```r predict(log.model, newdata = list(FirstLat = c(10, 23.5,30))) ``` ``` ## 1 2 3 ## -5.3543382 -0.3211396 2.1022524 ``` **Question** Can we use the later results to get the desired probabilities. --- ### Visualization - we create a plot of predictions across a range of latitudes to visualize our results . <img src="week-15-3-a_files/figure-html/unnamed-chunk-18-1.png" style="display: block; margin: auto;" /> The model predictions show that the probability of a non-tropical hurricane is less than 10% at latitudes south of 20∘N. However, by latitude 26∘N, the probability exceeds 50% and by latitude 33∘N the probability exceeds 90%.