Introduction:
Last week’s analysis covered Generalized Linear Models (GLM), Transformations, and Logistic Regression, however this data dive will mostly be the continuation of Generalized Linear Models (GLM). As discussed earlier, Generalized Linear Models (GLM) are a broad class of statistical models that extend linear regression to accommodate response variables that are not well modeled by a normal distribution.
As discussed in last week’s class, this data dive will follow these statistical topics:
Generalized Linear Models (GLMs): These are a class of models that extend linear regression to handle non-normally distributed response variables by using a link function.
Maximum Likelihood Estimation (MLE): A method used in GLMs to estimate the model parameters by maximizing the likelihood function, providing the most probable parameter values given the observed data.
When comparing GLMs, metrics like Akaike Information Criterion (AIC), Bayesian Information Criterion (BIC), analysis of variance (ANOVA), and deviance help assess model fit, complexity, and predictive performance, aiding in model selection.
Variable Expansion involves transforming predictors to fit the data distribution better, while Variable Selection aims to identify the most relevant predictors, enhancing model interpretability and predictive accuracy in GLMs.
As discussed in last week’s analysis, and a continuation in this part for the General Linear Models (GLM), the analysis journey continues with another deep dive into how it relates and informs our decisions, conclusions, and insights for the bike sale dataset.
Data Loading:
Loading and understanding the data structure is next, the dataset and and necessary libraries for the analysis are loaded below:
library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.3.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.0 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
library(ggrepel)
library(dplyr)
library(broom)
library(lindia)
library(knitr)
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following object is masked from 'package:purrr':
##
## some
bike_data <- read.csv('bike_data.csv')
Exploratory Data Analysis:
Exploratory data analysis will be used to analyze the dataset in summarizing the main characteristics, it can help to identify initial patterns, trends, relationships, and also anomalies in the data.
In this case, the summary() function will provide the
summary statistics for the variables in the bike_data dataset, giving
insights into the distribution, central tendency, and variability of the
data.
The str() function on the other will display the
structure of the bike_data dataset, including the data types of each
variable, allowing for a deeper understanding of the dataset’s
composition and structure.
summary(bike_data)
## ID Marital.Status Gender Income
## Min. :11000 Length:1000 Length:1000 Length:1000
## 1st Qu.:15291 Class :character Class :character Class :character
## Median :19744 Mode :character Mode :character Mode :character
## Mean :19966
## 3rd Qu.:24471
## Max. :29447
## Children Education Occupation Home.Owner
## Min. :0.000 Length:1000 Length:1000 Length:1000
## 1st Qu.:0.000 Class :character Class :character Class :character
## Median :2.000 Mode :character Mode :character Mode :character
## Mean :1.898
## 3rd Qu.:3.000
## Max. :5.000
## Cars Commute.Distance Region Age
## Min. :0.000 Length:1000 Length:1000 Min. :25.00
## 1st Qu.:1.000 Class :character Class :character 1st Qu.:35.00
## Median :1.000 Mode :character Mode :character Median :43.00
## Mean :1.442 Mean :44.16
## 3rd Qu.:2.000 3rd Qu.:52.00
## Max. :4.000 Max. :89.00
## Age.Brackets Purchased.Bike
## Length:1000 Length:1000
## Class :character Class :character
## Mode :character Mode :character
##
##
##
str(bike_data)
## 'data.frame': 1000 obs. of 14 variables:
## $ ID : int 12496 24107 14177 24381 25597 13507 27974 19364 22155 19280 ...
## $ Marital.Status : chr "Married" "Married" "Married" "Single" ...
## $ Gender : chr "Female" "Male" "Male" "Male" ...
## $ Income : chr "40,000" "30,000" "80,000" "70,000" ...
## $ Children : int 1 3 5 0 0 2 2 1 2 2 ...
## $ Education : chr "Bachelors" "Partial College" "Partial College" "Bachelors" ...
## $ Occupation : chr "Skilled Manual" "Clerical" "Professional" "Professional" ...
## $ Home.Owner : chr "Yes" "Yes" "No" "Yes" ...
## $ Cars : int 0 1 2 1 0 0 4 0 2 1 ...
## $ Commute.Distance: chr "0-1 Miles" "0-1 Miles" "2-5 Miles" "5-10 Miles" ...
## $ Region : chr "Europe" "Europe" "Europe" "Pacific" ...
## $ Age : int 42 43 60 41 36 50 33 43 58 40 ...
## $ Age.Brackets : chr "Middle Age" "Middle Age" "Old" "Middle Age" ...
## $ Purchased.Bike : chr "No" "No" "No" "Yes" ...
Building the Generalized Linear Model: Using the variables for our models created last week in our analysis:
# Convert 'Purchased.Bike' to a binary factor
bike_data$Purchased.Bike <- factor(bike_data$Purchased.Bike, levels = c("No", "Yes"))
# Building the GLM model
model <- glm(Purchased.Bike ~ Age + Income + Cars, data = bike_data, family = binomial())
# Double Checking the structure of the Purchased.Bike
str(bike_data$Purchased.Bike)
## Factor w/ 2 levels "No","Yes": 1 1 1 2 2 1 2 2 1 2 ...
Output Interpretation:
The output from the model indicates that the
Purchased.Bike variable in the dataset has
been successfully converted into a factor with two levels: “No” (coded
as 1) and “Yes” (coded as 2).
In R, since we are working with binary values, factor levels are
internally coded starting with 1, not 0. So, the display
Factor w/ 2 levels "No","Yes": 1 1 1 2 2 1 2 2 1 2 ...
tells us that:
The model interpretes the first level of the factor (“No”) as 0 (failure, non-event) and the second level (“Yes”) as 1 (success, event) when estimating the model parameters.
After converting it to the binary factor to be used for a Generalized Linear Model, we can then proceed with diagnosing the model, checking the summary, and making necessary interpretations.
However, if we come across significant coefficients and reasonable diagnostics, like residuals plots, for example, then the model is working correctly.
Diagnosing the Model:
Diagnostic plots will be used here to check for issues like non-linearity, heteroscedasticity, or outliers.
par(mfrow=c(2,2))
plot(model)
Diagnosis Interpretation:
The diagnostic plots above presents several diagnostic plots for assessing the assumptions and fit of our Generalized Linear Model (GLM). Here are the insights that can be gathered from each plot:
Residuals vs Fitted: - This plot shows the residuals (i.e. the difference between observed and fitted values) against the fitted values from the model. - The residuals appear to be randomly scattered around zero, indicating that the assumption of constant variance and linearity is likely met. - However, there are a few potential outliers with large positive and negative residuals, which may warrant further investigation.
Normal Q-Q Plot: - This quantile-quantile plot compares the distribution of the residuals to a theoretical normal distribution. - It can be observed that the residuals deviate slightly from the straight line, suggesting a slight violation of the normality assumption. - For large sample sizes, moderate deviations from normality are often acceptable.
Scale-Location Plot: - This plot examines the assumption of constant variance by plotting the square root of the standardized residuals against the fitted values. - The points appear to be randomly scattered, indicating that the assumption of constant variance is likely met. - There are a few potential outliers that may be influencing the results.
Residuals vs Leverage: - The Residuals vs Leverage plot identifies influential observations by plotting the residuals against leverage values, in other words, a measure of how much an observation influences the model. - Most observations have low leverage values, but there are a few observations with higher leverage values (Cook’s distance > 0.5). - These high-leverage observations may be exerting undue influence on the model and should be investigated further.
Interpreting the Coefficients:
It is imperative to interpret the coefficients after seeing the
model, the outcome and the diagnosis that was done earlier. The expected
output of the summary(lm_model) will
contain various statistical details and diagnostic information about the
fitted linear regression model.
summary(model)
##
## Call:
## glm(formula = Purchased.Bike ~ Age + Income + Cars, family = binomial(),
## data = bike_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.64665 0.35152 1.840 0.06583 .
## Age -0.01510 0.00607 -2.487 0.01289 *
## Income100,000 0.88441 0.48911 1.808 0.07057 .
## Income110,000 1.61001 0.59081 2.725 0.00643 **
## Income120,000 1.85662 0.56912 3.262 0.00111 **
## Income130,000 1.42734 0.46873 3.045 0.00233 **
## Income150,000 2.87816 1.19991 2.399 0.01646 *
## Income160,000 16.09254 502.40757 0.032 0.97445
## Income170,000 0.54879 1.36231 0.403 0.68707
## Income20,000 0.22284 0.34557 0.645 0.51902
## Income30,000 0.12893 0.30483 0.423 0.67232
## Income40,000 0.90126 0.29866 3.018 0.00255 **
## Income50,000 0.41541 0.41252 1.007 0.31393
## Income60,000 0.61629 0.29515 2.088 0.03679 *
## Income70,000 0.92849 0.31233 2.973 0.00295 **
## Income80,000 0.37293 0.33629 1.109 0.26746
## Income90,000 1.50161 0.43337 3.465 0.00053 ***
## Cars -0.49564 0.07122 -6.960 3.41e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1384.9 on 999 degrees of freedom
## Residual deviance: 1289.9 on 982 degrees of freedom
## AIC: 1325.9
##
## Number of Fisher Scoring iterations: 13
The coefficients from the model represents the log-odds of the outcome occurring in the odds of purchasing a bike.
Now, analyzing the each coefficient below:
(Intercept): The estimate for the intercept (0.64665) represents the log-odds of purchasing a bike when all other variables are held at zero. The positive sign indicates higher log-odds of purchasing, but since this scenario might not be meaningful, for example; age cannot be zero obviously, the intercept is often not interpreted in isolation. Its p-value is greater than 0.05, so it’s not statistically significant at the 5% level.
Age: The coefficient for Age (-0.01510) suggests that for each additional year of age, the log-odds of purchasing a bike decreases. This is consistent with the negative coefficient. The p-value (0.01289) indicates that this is statistically significant at the 5% level.
Income: These coefficients compare each income
level to a baseline. Each one informs us that the change in log-odds of
purchasing a bike compared to this baseline. For example, having an
Income of $100,000 (0.88441) increases the log-odds
compared to the baseline, but the p-value (0.07057)
indicates it’s not statistically significant at the 5%
level. Other income levels, like $110,000 and
$120,000, show significant positive increases in log-odds,
with their respective p-values indicating statistical
significance.
Cars: The coefficient for Cars
(-0.49564) suggests that for each additional car owned, the
log-odds of purchasing a bike decreases. Given the very small p-value
(3.41e-12), this effect is highly statistically
significant.
For more clarity and proper understanding, and especially in logistic regression, it is more intuitive to express these log-odds as odds ratios by exponentiating the coefficients:
exp_coef <- exp(coef(model))
This would give us the multiplicative change in odds for a one-unit change in the predictor.
For the p-values, asterisks indicate the level of significance, which
helps in determining which variables are statistically significant
predictors of the outcome. Here, variables with *** are
highly significant, while those with ** or *
are significant at a lesser threshold, and those with . are
suggestive of a relationship but not significant at conventional
levels.
Conclusion:
This model indicates that certain income levels significantly influence the likelihood of purchasing a bike, as well as the number of cars owned, which negatively correlates with bike purchasing.
Suggestions for Further Analysis:
Investigating the interaction between age and income could be insightful, as younger individuals with higher incomes may have different purchasing patterns.
Also, understanding the geographic and / or cultural context that may affect car ownership and bike purchasing could be valuable.
This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.