Bike Data Analysis - Week 11 | Generalized Linear Models (Part 2)

Introduction:

Last week’s analysis covered Generalized Linear Models (GLM), Transformations, and Logistic Regression, however this data dive will mostly be the continuation of Generalized Linear Models (GLM). As discussed earlier, Generalized Linear Models (GLM) are a broad class of statistical models that extend linear regression to accommodate response variables that are not well modeled by a normal distribution.

As discussed in last week’s class, this data dive will follow these statistical topics:

Generalized Linear Models (GLMs): These are a class of models that extend linear regression to handle non-normally distributed response variables by using a link function.
Maximum Likelihood Estimation (MLE): A method used in GLMs to estimate the model parameters by maximizing the likelihood function, providing the most probable parameter values given the observed data.
When comparing GLMs, metrics like Akaike Information Criterion (AIC), Bayesian Information Criterion (BIC), analysis of variance (ANOVA), and deviance help assess model fit, complexity, and predictive performance, aiding in model selection.
Variable Expansion involves transforming predictors to fit the data distribution better, while Variable Selection aims to identify the most relevant predictors, enhancing model interpretability and predictive accuracy in GLMs.

As discussed in last week’s analysis, and a continuation in this part for the General Linear Models (GLM), the analysis journey continues with another deep dive into how it relates and informs our decisions, conclusions, and insights for the bike sale dataset.

Data Loading:

Loading and understanding the data structure is next, the dataset and and necessary libraries for the analysis are loaded below:

library(tidyverse)

## Warning: package 'ggplot2' was built under R version 4.3.3

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.0     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggthemes)
library(ggrepel)
library(dplyr)
library(broom)
library(lindia)
library(knitr)
library(car)

## Loading required package: carData
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some

bike_data <- read.csv('bike_data.csv')

Exploratory Data Analysis:

Exploratory data analysis will be used to analyze the dataset in summarizing the main characteristics, it can help to identify initial patterns, trends, relationships, and also anomalies in the data.

In this case, the summary() function will provide the summary statistics for the variables in the bike_data dataset, giving insights into the distribution, central tendency, and variability of the data.

The str() function on the other will display the structure of the bike_data dataset, including the data types of each variable, allowing for a deeper understanding of the dataset’s composition and structure.

summary(bike_data)

##        ID        Marital.Status        Gender             Income         
##  Min.   :11000   Length:1000        Length:1000        Length:1000       
##  1st Qu.:15291   Class :character   Class :character   Class :character  
##  Median :19744   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :19966                                                           
##  3rd Qu.:24471                                                           
##  Max.   :29447                                                           
##     Children      Education          Occupation         Home.Owner       
##  Min.   :0.000   Length:1000        Length:1000        Length:1000       
##  1st Qu.:0.000   Class :character   Class :character   Class :character  
##  Median :2.000   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :1.898                                                           
##  3rd Qu.:3.000                                                           
##  Max.   :5.000                                                           
##       Cars       Commute.Distance      Region               Age       
##  Min.   :0.000   Length:1000        Length:1000        Min.   :25.00  
##  1st Qu.:1.000   Class :character   Class :character   1st Qu.:35.00  
##  Median :1.000   Mode  :character   Mode  :character   Median :43.00  
##  Mean   :1.442                                         Mean   :44.16  
##  3rd Qu.:2.000                                         3rd Qu.:52.00  
##  Max.   :4.000                                         Max.   :89.00  
##  Age.Brackets       Purchased.Bike    
##  Length:1000        Length:1000       
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
##

str(bike_data)

## 'data.frame':    1000 obs. of  14 variables:
##  $ ID              : int  12496 24107 14177 24381 25597 13507 27974 19364 22155 19280 ...
##  $ Marital.Status  : chr  "Married" "Married" "Married" "Single" ...
##  $ Gender          : chr  "Female" "Male" "Male" "Male" ...
##  $ Income          : chr  "40,000" "30,000" "80,000" "70,000" ...
##  $ Children        : int  1 3 5 0 0 2 2 1 2 2 ...
##  $ Education       : chr  "Bachelors" "Partial College" "Partial College" "Bachelors" ...
##  $ Occupation      : chr  "Skilled Manual" "Clerical" "Professional" "Professional" ...
##  $ Home.Owner      : chr  "Yes" "Yes" "No" "Yes" ...
##  $ Cars            : int  0 1 2 1 0 0 4 0 2 1 ...
##  $ Commute.Distance: chr  "0-1 Miles" "0-1 Miles" "2-5 Miles" "5-10 Miles" ...
##  $ Region          : chr  "Europe" "Europe" "Europe" "Pacific" ...
##  $ Age             : int  42 43 60 41 36 50 33 43 58 40 ...
##  $ Age.Brackets    : chr  "Middle Age" "Middle Age" "Old" "Middle Age" ...
##  $ Purchased.Bike  : chr  "No" "No" "No" "Yes" ...

Building the Generalized Linear Model: Using the variables for our models created last week in our analysis:

# Convert 'Purchased.Bike' to a binary factor 
bike_data$Purchased.Bike <- factor(bike_data$Purchased.Bike, levels = c("No", "Yes"))

# Building the GLM model
model <- glm(Purchased.Bike ~ Age + Income + Cars, data = bike_data, family = binomial())

# Double Checking the structure of the Purchased.Bike 
str(bike_data$Purchased.Bike)

##  Factor w/ 2 levels "No","Yes": 1 1 1 2 2 1 2 2 1 2 ...

Output Interpretation:

The output from the model indicates that the Purchased.Bike variable in the dataset has been successfully converted into a factor with two levels: “No” (coded as 1) and “Yes” (coded as 2).

In R, since we are working with binary values, factor levels are internally coded starting with 1, not 0. So, the display Factor w/ 2 levels "No","Yes": 1 1 1 2 2 1 2 2 1 2 ... tells us that:

“No” corresponds to 1
“Yes” corresponds to 2

The model interpretes the first level of the factor (“No”) as 0 (failure, non-event) and the second level (“Yes”) as 1 (success, event) when estimating the model parameters.

After converting it to the binary factor to be used for a Generalized Linear Model, we can then proceed with diagnosing the model, checking the summary, and making necessary interpretations.

However, if we come across significant coefficients and reasonable diagnostics, like residuals plots, for example, then the model is working correctly.

Diagnosing the Model:

Diagnostic plots will be used here to check for issues like non-linearity, heteroscedasticity, or outliers.

par(mfrow=c(2,2))
plot(model)

Diagnosis Interpretation:

The diagnostic plots above presents several diagnostic plots for assessing the assumptions and fit of our Generalized Linear Model (GLM). Here are the insights that can be gathered from each plot:

Residuals vs Fitted: - This plot shows the residuals (i.e. the difference between observed and fitted values) against the fitted values from the model. - The residuals appear to be randomly scattered around zero, indicating that the assumption of constant variance and linearity is likely met. - However, there are a few potential outliers with large positive and negative residuals, which may warrant further investigation.

Normal Q-Q Plot: - This quantile-quantile plot compares the distribution of the residuals to a theoretical normal distribution. - It can be observed that the residuals deviate slightly from the straight line, suggesting a slight violation of the normality assumption. - For large sample sizes, moderate deviations from normality are often acceptable.

Scale-Location Plot: - This plot examines the assumption of constant variance by plotting the square root of the standardized residuals against the fitted values. - The points appear to be randomly scattered, indicating that the assumption of constant variance is likely met. - There are a few potential outliers that may be influencing the results.

Residuals vs Leverage: - The Residuals vs Leverage plot identifies influential observations by plotting the residuals against leverage values, in other words, a measure of how much an observation influences the model. - Most observations have low leverage values, but there are a few observations with higher leverage values (Cook’s distance > 0.5). - These high-leverage observations may be exerting undue influence on the model and should be investigated further.

Interpreting the Coefficients:

It is imperative to interpret the coefficients after seeing the model, the outcome and the diagnosis that was done earlier. The expected output of the summary(lm_model) will contain various statistical details and diagnostic information about the fitted linear regression model.

summary(model)

## 
## Call:
## glm(formula = Purchased.Bike ~ Age + Income + Cars, family = binomial(), 
##     data = bike_data)
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     0.64665    0.35152   1.840  0.06583 .  
## Age            -0.01510    0.00607  -2.487  0.01289 *  
## Income100,000   0.88441    0.48911   1.808  0.07057 .  
## Income110,000   1.61001    0.59081   2.725  0.00643 ** 
## Income120,000   1.85662    0.56912   3.262  0.00111 ** 
## Income130,000   1.42734    0.46873   3.045  0.00233 ** 
## Income150,000   2.87816    1.19991   2.399  0.01646 *  
## Income160,000  16.09254  502.40757   0.032  0.97445    
## Income170,000   0.54879    1.36231   0.403  0.68707    
## Income20,000    0.22284    0.34557   0.645  0.51902    
## Income30,000    0.12893    0.30483   0.423  0.67232    
## Income40,000    0.90126    0.29866   3.018  0.00255 ** 
## Income50,000    0.41541    0.41252   1.007  0.31393    
## Income60,000    0.61629    0.29515   2.088  0.03679 *  
## Income70,000    0.92849    0.31233   2.973  0.00295 ** 
## Income80,000    0.37293    0.33629   1.109  0.26746    
## Income90,000    1.50161    0.43337   3.465  0.00053 ***
## Cars           -0.49564    0.07122  -6.960 3.41e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1384.9  on 999  degrees of freedom
## Residual deviance: 1289.9  on 982  degrees of freedom
## AIC: 1325.9
## 
## Number of Fisher Scoring iterations: 13

The coefficients from the model represents the log-odds of the outcome occurring in the odds of purchasing a bike.

Now, analyzing the each coefficient below:

(Intercept): The estimate for the intercept (0.64665) represents the log-odds of purchasing a bike when all other variables are held at zero. The positive sign indicates higher log-odds of purchasing, but since this scenario might not be meaningful, for example; age cannot be zero obviously, the intercept is often not interpreted in isolation. Its p-value is greater than 0.05, so it’s not statistically significant at the 5% level.
Age: The coefficient for Age (-0.01510) suggests that for each additional year of age, the log-odds of purchasing a bike decreases. This is consistent with the negative coefficient. The p-value (0.01289) indicates that this is statistically significant at the 5% level.
Income: These coefficients compare each income level to a baseline. Each one informs us that the change in log-odds of purchasing a bike compared to this baseline. For example, having an Income of $100,000 (0.88441) increases the log-odds compared to the baseline, but the p-value (0.07057) indicates it’s not statistically significant at the 5% level. Other income levels, like $110,000 and $120,000, show significant positive increases in log-odds, with their respective p-values indicating statistical significance.
Cars: The coefficient for Cars (-0.49564) suggests that for each additional car owned, the log-odds of purchasing a bike decreases. Given the very small p-value (3.41e-12), this effect is highly statistically significant.

For more clarity and proper understanding, and especially in logistic regression, it is more intuitive to express these log-odds as odds ratios by exponentiating the coefficients:

exp_coef <- exp(coef(model))

This would give us the multiplicative change in odds for a one-unit change in the predictor.

For the p-values, asterisks indicate the level of significance, which helps in determining which variables are statistically significant predictors of the outcome. Here, variables with *** are highly significant, while those with ** or * are significant at a lesser threshold, and those with . are suggestive of a relationship but not significant at conventional levels.

Conclusion:

This model indicates that certain income levels significantly influence the likelihood of purchasing a bike, as well as the number of cars owned, which negatively correlates with bike purchasing.

Suggestions for Further Analysis:

Investigating the interaction between age and income could be insightful, as younger individuals with higher incomes may have different purchasing patterns.

Also, understanding the geographic and / or cultural context that may affect car ownership and bike purchasing could be valuable.

R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

Bike Data Analysis - Week 11 | Generalized Linear Models (Part 2)

Oluwatosin Agbaakin

2024-03-31

R Markdown