Introduction:

This data dive is focused on Generalized Linear Models (GLM), Transformations, and Logistic Regression as discussed in this week’s class. Generalized Linear Models (GLM) are a broad class of statistical models that extend linear regression to accommodate response variables that are not well modeled by a normal distribution.

In a standard linear regression model, we tend to predict an outcome variable based on a linear combination of predictor variables, and we assume that the outcome variable is normally distributed around that linear prediction.

The purpose of this week’s data dive is to build a generalized linear model, or to make adjustments to variables represented in a model with the following tasks:

  1. Select an interesting binary column of data, or one which can be reasonably converted into a binary variable in the Bike sales dataset.

  2. Build a logistic regression model for this selected variable, using between 1-4 explanatory variables.

  1. Interprete the coefficients, and explain what they mean in my analysis.
  2. Use the Standard Error for at least one coefficient, build a C.I. for that coefficient, and translate its meaning.
  1. Consider a transformation for any explanatory variable, and illustrate why I need the transformation or why I do not need to; if applicable, and use a visualization for the illustration, for example, a scatter plot.

Data Loading:

Now loading the Bike sales dataset, and the necessary libraries that are needed for analyzing the dataset and understanding its structure.

suppressPackageStartupMessages({
  library(tidyverse, quietly = TRUE)
  library(ggthemes)
  library(ggrepel)
  library(dplyr, warn.conflicts = FALSE)
  library(readr)
  library(broom)
  library(lindia)
  library(knitr)
  library(car, warn.conflicts = FALSE)
  })
bike_data <- read.csv('bike_data.csv')
str(bike_data)
## 'data.frame':    1000 obs. of  14 variables:
##  $ ID              : int  12496 24107 14177 24381 25597 13507 27974 19364 22155 19280 ...
##  $ Marital.Status  : chr  "Married" "Married" "Married" "Single" ...
##  $ Gender          : chr  "Female" "Male" "Male" "Male" ...
##  $ Income          : chr  "40,000" "30,000" "80,000" "70,000" ...
##  $ Children        : int  1 3 5 0 0 2 2 1 2 2 ...
##  $ Education       : chr  "Bachelors" "Partial College" "Partial College" "Bachelors" ...
##  $ Occupation      : chr  "Skilled Manual" "Clerical" "Professional" "Professional" ...
##  $ Home.Owner      : chr  "Yes" "Yes" "No" "Yes" ...
##  $ Cars            : int  0 1 2 1 0 0 4 0 2 1 ...
##  $ Commute.Distance: chr  "0-1 Miles" "0-1 Miles" "2-5 Miles" "5-10 Miles" ...
##  $ Region          : chr  "Europe" "Europe" "Europe" "Pacific" ...
##  $ Age             : int  42 43 60 41 36 50 33 43 58 40 ...
##  $ Age.Brackets    : chr  "Middle Age" "Middle Age" "Old" "Middle Age" ...
##  $ Purchased.Bike  : chr  "No" "No" "No" "Yes" ...

Binary Column for Modeling:

One interesting binary column that I identified in this dataset is the Purchased Bike, which indicates whether a bike was purchased or not. I believe that this column is ideal for a logistic regression model as it inherently represents a binary outcome (Yes/No).

To use this Purchased Bike column for the binary modeling and as the dependent variable in a logistic regression model, the first step is for me to ensure that this column is treated as a binary factor, i.e. converting the Purchased Bike values into a factor with two levels.

This is quite important for logistic regression, which will model the probability of one outcome, in this case it is “Yes” in relative to “No”.

Preparing the Binary Column of Data:

As mentioned above, I will need to convert the Purchased Bike column to a binary factor. And since logistic regression requires numerical inputs, any categorical variables used as predictors must be converted into appropriate formats, such as dummy variables.

# Convert 'Purchased Bike' to a factor with levels 'Yes' and 'No'
bike_data$Purchased.Bike <- factor(bike_data$Purchased.Bike, levels = c("No", "Yes"))

Building the Logistic Regression Model:

The data has been prepared and converted into a factor above, so I can continue to build the logistic regression model. For reference purpose and as discussed in this week’s class, the glm() function is used to fit generalized linear models, including logistic regression. The family = binomial() argument specifies that I am basically modeling a binary outcome.

model <- glm(Purchased.Bike ~ Age + Income + Cars, data = bike_data, family = binomial())

What does the above variables mean? Well, let me explain the components and their functions:

Analysis and Interpretation:

After fitting the model, I can now use the summary()function to get a detailed and in-depth summary, including the coefficients for each predictor, which I will also explain one after the other.

summary(model)
## 
## Call:
## glm(formula = Purchased.Bike ~ Age + Income + Cars, family = binomial(), 
##     data = bike_data)
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     0.64665    0.35152   1.840  0.06583 .  
## Age            -0.01510    0.00607  -2.487  0.01289 *  
## Income100,000   0.88441    0.48911   1.808  0.07057 .  
## Income110,000   1.61001    0.59081   2.725  0.00643 ** 
## Income120,000   1.85662    0.56912   3.262  0.00111 ** 
## Income130,000   1.42734    0.46873   3.045  0.00233 ** 
## Income150,000   2.87816    1.19991   2.399  0.01646 *  
## Income160,000  16.09254  502.40757   0.032  0.97445    
## Income170,000   0.54879    1.36231   0.403  0.68707    
## Income20,000    0.22284    0.34557   0.645  0.51902    
## Income30,000    0.12893    0.30483   0.423  0.67232    
## Income40,000    0.90126    0.29866   3.018  0.00255 ** 
## Income50,000    0.41541    0.41252   1.007  0.31393    
## Income60,000    0.61629    0.29515   2.088  0.03679 *  
## Income70,000    0.92849    0.31233   2.973  0.00295 ** 
## Income80,000    0.37293    0.33629   1.109  0.26746    
## Income90,000    1.50161    0.43337   3.465  0.00053 ***
## Cars           -0.49564    0.07122  -6.960 3.41e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1384.9  on 999  degrees of freedom
## Residual deviance: 1289.9  on 982  degrees of freedom
## AIC: 1325.9
## 
## Number of Fisher Scoring iterations: 13

The logistic regression model above provides a comprehensive insight into the factors influencing whether individuals decide to purchase a bike or not. The breakdown below lists out my interpretation of the analysis from the logistic regression model, which encompasses key findings, and conclusions.

Insights:

  1. Age: The coefficient for age is negative (-0.01510), indicating that as age increases, the likelihood of purchasing a bike decreases. This relationship is statistically significant (p = 0.01289), suggesting that younger individuals are more inclined to buy bikes.

  2. Income: The coefficients for different income levels vary significantly, indicating that income plays a crucial role in the decision to purchase a bike. Higher income brackets, particularly in households that earns $110,000, $120,000, $130,000, and $150,000, show a significantly higher likelihood of purchasing bikes. The highest coefficient is observed for the income level of $160,000, although this result is not statistically significant (p = 0.97445) due to a large standard error, likely indicating insufficient data points at this income level or an outlier.

  3. Cars: The coefficient for cars is negative (-0.49564) and highly significant (p < 0.001), I concur to this as I believe that individuals who own more cars are less likely to purchase a bike. This is intuitive as those with more cars may rely less on biking as a mode of transportation.

Futher observations:

Confidence Interval for a Coefficient

Here, I can use the confint() function to calculate confidence intervals for the coefficients, which will give a range of values within which the true coefficient value is likely to lie.

confint(model, "Age", level = 0.95)
## Waiting for profiling to be done...
##        2.5 %       97.5 % 
## -0.027062287 -0.003242654

The above command calculates the 95% confidence interval for the Age coefficient, providing insights into the certainty of my model’s estimates. After running the confint(model, "Age", level = 0.95), I get the following regression model, which I will describe the interpretation as follows:

This provides the range of values based on the model’s estimation and signifies that if I were to replicate this study multiple times, 95% of the confidence intervals calculated from those studies would contain the true coefficient.

Interpretation:

  1. Negative relationship: Both ends of the confidence interval are negative, reinforcing the conclusion that an increase in age is associated with a decrease in the likelihood of purchasing a bike. This is consistent with the coefficient estimate that I provided earlier.

  2. Statistical significance: Because the confidence interval does not include 0, this indicates that the Age coefficient is statistically significant at the 95% confidence level. If the interval had included 0, it would suggest that we cannot be confident in the direction of the relationship between age and the likelihood of purchasing a bike.

Transformation of an Explanatory Variable:

As discussed about regression analysis, transformations are used to address issues with model assumptions that can include non-linearity, non-constant variance, non-normal distribution of errors, and to reduce the influence of outliers. To decide if a transformation is needed for an explanatory variable, the best bet is to inspect the scatter plots of the variable against the outcome.

I will use the explanatory variable, Age to examine its relationship with the binary outcome Purchased Bike. The plot below includes two plots: a scatterplot with jitter to show the raw relationship, and another scatterplot with a smooth line indicating the fitted probabilities from a logistic regression model.

# Check for NA values
bike_data <- na.omit(bike_data[, c("Age", "Purchased.Bike")])

# Create a scatterplot with jitter to visualize relationships
ggplot(bike_data, aes(x = Age, y = as.numeric(Purchased.Bike) - 1)) +  
#Converting factors "No" and "Yes" to 0 and 1
  geom_jitter(alpha = 0.2) +
  labs(y = "Probability of Purchasing Bike", x = "Age") +
  ggtitle("Scatterplot of Age against Probability of Purchasing a Bike")

# Add a smooth line to see the estimated probabilities
ggplot(bike_data, aes(x = Age, y = as.numeric(Purchased.Bike) - 1)) +
  geom_jitter(alpha = 0.2) +
  geom_smooth(method = "glm", method.args = list(family = "binomial"), se = FALSE) +
  labs(y = "Probability of Purchasing Bike", x = "Age") +
  ggtitle("Scatterplot with Smooth Line of Age against Probability of Purchasing a Bike")
## `geom_smooth()` using formula = 'y ~ x'

Inspecting the Insights and Deciding on Performing a Transformation:

Scatterplot without Smooth Line: From the first scatterplot without the line, we can see the following:

Scatterplot with Smooth Line: The following are observed in the second scatterplot:

Insights:

Transformation Decision:

From the insights gathered and observations from both plots, I can conclude that a transformation of the Age variable may not be strictly necessary because of the following:

However, in a little digression or for what I personally call a flexible detour, if I were to refine the model further, then I might still consider transformations for the following reasons:

Slight Non-Linearity: If the motive is to capture the slight decrease in probability with age more accurately, a mild transformation such as a square root could be tested to see if it improves model fit.

Outliers or Subtle Trends: If there are outliers or if a trend becomes more apparent with a larger dataset or a different subset of the data, transformations might be reconsidered.

Model Performance: If the model’s predictive performance is not satisfactory, transformations may be part of a strategy to improve it.

In conclusion, based on the plots above, the data does not show strong evidence of non-linearity or heteroscedasticity, and therefore a transformation of the Age variable is not clearly indicated, in other words; not necessary.

R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.