This data dive is focused on Generalized Linear Models (GLM), Transformations, and Logistic Regression as discussed in this week’s class. Generalized Linear Models (GLM) are a broad class of statistical models that extend linear regression to accommodate response variables that are not well modeled by a normal distribution.
In a standard linear regression model, we tend to predict an outcome variable based on a linear combination of predictor variables, and we assume that the outcome variable is normally distributed around that linear prediction.
The purpose of this week’s data dive is to build a generalized linear model, or to make adjustments to variables represented in a model with the following tasks:
Select an interesting binary column of data, or one which can be reasonably converted into a binary variable in the Bike sales dataset.
Build a logistic regression model for this selected variable, using between 1-4 explanatory variables.
Now loading the Bike sales dataset, and the necessary libraries that are needed for analyzing the dataset and understanding its structure.
suppressPackageStartupMessages({
library(tidyverse, quietly = TRUE)
library(ggthemes)
library(ggrepel)
library(dplyr, warn.conflicts = FALSE)
library(readr)
library(broom)
library(lindia)
library(knitr)
library(car, warn.conflicts = FALSE)
})
bike_data <- read.csv('bike_data.csv')
str(bike_data)
## 'data.frame': 1000 obs. of 14 variables:
## $ ID : int 12496 24107 14177 24381 25597 13507 27974 19364 22155 19280 ...
## $ Marital.Status : chr "Married" "Married" "Married" "Single" ...
## $ Gender : chr "Female" "Male" "Male" "Male" ...
## $ Income : chr "40,000" "30,000" "80,000" "70,000" ...
## $ Children : int 1 3 5 0 0 2 2 1 2 2 ...
## $ Education : chr "Bachelors" "Partial College" "Partial College" "Bachelors" ...
## $ Occupation : chr "Skilled Manual" "Clerical" "Professional" "Professional" ...
## $ Home.Owner : chr "Yes" "Yes" "No" "Yes" ...
## $ Cars : int 0 1 2 1 0 0 4 0 2 1 ...
## $ Commute.Distance: chr "0-1 Miles" "0-1 Miles" "2-5 Miles" "5-10 Miles" ...
## $ Region : chr "Europe" "Europe" "Europe" "Pacific" ...
## $ Age : int 42 43 60 41 36 50 33 43 58 40 ...
## $ Age.Brackets : chr "Middle Age" "Middle Age" "Old" "Middle Age" ...
## $ Purchased.Bike : chr "No" "No" "No" "Yes" ...
One interesting binary column that I identified in this dataset is
the Purchased Bike, which indicates
whether a bike was purchased or not. I believe that this column is ideal
for a logistic regression model as it inherently represents a binary
outcome (Yes/No).
To use this Purchased Bike column for
the binary modeling and as the dependent variable in a logistic
regression model, the first step is for me to ensure that this column is
treated as a binary factor, i.e. converting the
Purchased Bike values into a factor with
two levels.
This is quite important for logistic regression, which will model the probability of one outcome, in this case it is “Yes” in relative to “No”.
As mentioned above, I will need to convert the
Purchased Bike column to a binary factor.
And since logistic regression requires numerical inputs, any categorical
variables used as predictors must be converted into appropriate formats,
such as dummy variables.
# Convert 'Purchased Bike' to a factor with levels 'Yes' and 'No'
bike_data$Purchased.Bike <- factor(bike_data$Purchased.Bike, levels = c("No", "Yes"))
The data has been prepared and converted into a factor above, so I
can continue to build the logistic regression model. For reference
purpose and as discussed in this week’s class, the
glm() function is used to fit generalized
linear models, including logistic regression. The
family = binomial() argument specifies
that I am basically modeling a binary outcome.
model <- glm(Purchased.Bike ~ Age + Income + Cars, data = bike_data, family = binomial())
What does the above variables mean? Well, let me explain the components and their functions:
After fitting the model, I can now use the
summary()function to get a detailed and
in-depth summary, including the coefficients for each predictor, which I
will also explain one after the other.
summary(model)
##
## Call:
## glm(formula = Purchased.Bike ~ Age + Income + Cars, family = binomial(),
## data = bike_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.64665 0.35152 1.840 0.06583 .
## Age -0.01510 0.00607 -2.487 0.01289 *
## Income100,000 0.88441 0.48911 1.808 0.07057 .
## Income110,000 1.61001 0.59081 2.725 0.00643 **
## Income120,000 1.85662 0.56912 3.262 0.00111 **
## Income130,000 1.42734 0.46873 3.045 0.00233 **
## Income150,000 2.87816 1.19991 2.399 0.01646 *
## Income160,000 16.09254 502.40757 0.032 0.97445
## Income170,000 0.54879 1.36231 0.403 0.68707
## Income20,000 0.22284 0.34557 0.645 0.51902
## Income30,000 0.12893 0.30483 0.423 0.67232
## Income40,000 0.90126 0.29866 3.018 0.00255 **
## Income50,000 0.41541 0.41252 1.007 0.31393
## Income60,000 0.61629 0.29515 2.088 0.03679 *
## Income70,000 0.92849 0.31233 2.973 0.00295 **
## Income80,000 0.37293 0.33629 1.109 0.26746
## Income90,000 1.50161 0.43337 3.465 0.00053 ***
## Cars -0.49564 0.07122 -6.960 3.41e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1384.9 on 999 degrees of freedom
## Residual deviance: 1289.9 on 982 degrees of freedom
## AIC: 1325.9
##
## Number of Fisher Scoring iterations: 13
The logistic regression model above provides a comprehensive insight into the factors influencing whether individuals decide to purchase a bike or not. The breakdown below lists out my interpretation of the analysis from the logistic regression model, which encompasses key findings, and conclusions.
Age: The coefficient for age is negative (-0.01510), indicating that as age increases, the likelihood of purchasing a bike decreases. This relationship is statistically significant (p = 0.01289), suggesting that younger individuals are more inclined to buy bikes.
Income: The coefficients for different income
levels vary significantly, indicating that income plays a crucial role
in the decision to purchase a bike. Higher income brackets, particularly
in households that earns $110,000,
$120,000,
$130,000, and
$150,000, show a significantly higher
likelihood of purchasing bikes. The highest coefficient is observed for
the income level of $160,000, although
this result is not statistically significant (p =
0.97445) due to a large standard error, likely indicating
insufficient data points at this income level or an outlier.
Cars: The coefficient for cars is negative (-0.49564) and highly significant (p < 0.001), I concur to this as I believe that individuals who own more cars are less likely to purchase a bike. This is intuitive as those with more cars may rely less on biking as a mode of transportation.
Demographic targeting: Younger individuals with
higher incomes, especially in specific brackets like
$110,000 to
$150,000 are more likely to purchase
bikes.
Income as a predictor: Income is a strong predictor of bike purchasing behavior, with certain brackets showing a much higher propensity to buy. This indicates a non-linear relationship between income and the likelihood of purchasing a bike, where certain income levels are particularly inclined towards buying bikes.
Alternative transportation: The significant negative relationship between the number of cars owned and the likelihood of purchasing a bike suggests a substitution effect, where bikes are less likely to be purchased by those with alternative modes of transportation available.
Here, I can use the confint() function
to calculate confidence intervals for the coefficients, which will give
a range of values within which the true coefficient value is likely to
lie.
confint(model, "Age", level = 0.95)
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## -0.027062287 -0.003242654
The above command calculates the 95% confidence interval for the Age
coefficient, providing insights into the certainty of my model’s
estimates. After running the
confint(model, "Age", level = 0.95), I get the following
regression model, which I will describe the interpretation as
follows:
This provides the range of values based on the model’s estimation and signifies that if I were to replicate this study multiple times, 95% of the confidence intervals calculated from those studies would contain the true coefficient.
Negative relationship: Both ends of the confidence interval are negative, reinforcing the conclusion that an increase in age is associated with a decrease in the likelihood of purchasing a bike. This is consistent with the coefficient estimate that I provided earlier.
Statistical significance: Because the confidence
interval does not include 0, this indicates that the
Age coefficient is statistically
significant at the 95% confidence level. If the interval had included 0,
it would suggest that we cannot be confident in the direction of the
relationship between age and the likelihood of purchasing a
bike.
As discussed about regression analysis, transformations are used to address issues with model assumptions that can include non-linearity, non-constant variance, non-normal distribution of errors, and to reduce the influence of outliers. To decide if a transformation is needed for an explanatory variable, the best bet is to inspect the scatter plots of the variable against the outcome.
I will use the explanatory variable,
Age to examine its relationship with the
binary outcome Purchased Bike. The plot
below includes two plots: a scatterplot with jitter to show the raw
relationship, and another scatterplot with a smooth line indicating the
fitted probabilities from a logistic regression model.
# Check for NA values
bike_data <- na.omit(bike_data[, c("Age", "Purchased.Bike")])
# Create a scatterplot with jitter to visualize relationships
ggplot(bike_data, aes(x = Age, y = as.numeric(Purchased.Bike) - 1)) +
#Converting factors "No" and "Yes" to 0 and 1
geom_jitter(alpha = 0.2) +
labs(y = "Probability of Purchasing Bike", x = "Age") +
ggtitle("Scatterplot of Age against Probability of Purchasing a Bike")
# Add a smooth line to see the estimated probabilities
ggplot(bike_data, aes(x = Age, y = as.numeric(Purchased.Bike) - 1)) +
geom_jitter(alpha = 0.2) +
geom_smooth(method = "glm", method.args = list(family = "binomial"), se = FALSE) +
labs(y = "Probability of Purchasing Bike", x = "Age") +
ggtitle("Scatterplot with Smooth Line of Age against Probability of Purchasing a Bike")
## `geom_smooth()` using formula = 'y ~ x'
Scatterplot without Smooth Line: From the first scatterplot without the line, we can see the following:
Data points are widely distributed across all ages, indicating that bike purchase decisions are being made by individuals across the age spectrum.
There is no clear pattern that suggests a strong relationship between age and the probability of purchasing a bike. The spread of data points is fairly consistent across different age groups.
Scatterplot with Smooth Line: The following are observed in the second scatterplot:
The smoothed line (representing the logistic regression model’s fit) is relatively flat and does not indicate a strong slope. However, there seems to be a very slight negative trend, suggesting that as age increases, the probability of purchasing a bike might decrease slightly.
The variance around the line does not appear to change dramatically with age, which suggests there might not be an issue with heteroscedasticity.
From the insights gathered and observations from both plots, I can conclude that a transformation of the Age variable may not be strictly necessary because of the following:
However, in a little digression or for what I personally call a flexible detour, if I were to refine the model further, then I might still consider transformations for the following reasons:
Slight Non-Linearity: If the motive is to capture the slight decrease in probability with age more accurately, a mild transformation such as a square root could be tested to see if it improves model fit.
Outliers or Subtle Trends: If there are outliers or if a trend becomes more apparent with a larger dataset or a different subset of the data, transformations might be reconsidered.
Model Performance: If the model’s predictive performance is not satisfactory, transformations may be part of a strategy to improve it.
In conclusion, based on the plots above, the data does not show strong evidence of non-linearity or heteroscedasticity, and therefore a transformation of the Age variable is not clearly indicated, in other words; not necessary.
This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.