Introduction:
This week’s data dive is a continuation of the past week’s data dive on linear regression, however, this particular dive is primarily focused on critically interpreting and diagnosing regression models.
As discussed last week on the regression modeling, I will continue by presenting a comprehensive data analysis that is focused on understanding factors influencing bike purchases.
The data dive will explore the following and carry out some technical analysis based on the requirements of the data dive, some of the techniques are but not limited to:
Regression Characteristics.
Regression Diagnostics.
Simpson’s Paradox (demo).
I will also refer to the techniques that we used for the past week data dives as both are truly intertwined and would be helpful in the process of our analysis, the below are some of the mentioned techniques:
ANOVA.
F-test.
Bonferroni Correction.
Linear Regression Theory.
Data Loading:
As usual, I will load the dataset alongside the necessary libraries that will be needed for the analysis.
library(ggthemes)
library(ggrepel)
## Loading required package: ggplot2
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(broom)
library(lindia)
library(knitr)
library (car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
library(ggplot2)
library(GGally)
## Warning: package 'GGally' was built under R version 4.3.3
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
bike_data <- read.csv('bike_data.csv')
Preliminary Data Examination:
Using summary, string and head to pre-analyse the dataset.
summary(bike_data)
## ID Marital.Status Gender Income
## Min. :11000 Length:1000 Length:1000 Length:1000
## 1st Qu.:15291 Class :character Class :character Class :character
## Median :19744 Mode :character Mode :character Mode :character
## Mean :19966
## 3rd Qu.:24471
## Max. :29447
## Children Education Occupation Home.Owner
## Min. :0.000 Length:1000 Length:1000 Length:1000
## 1st Qu.:0.000 Class :character Class :character Class :character
## Median :2.000 Mode :character Mode :character Mode :character
## Mean :1.898
## 3rd Qu.:3.000
## Max. :5.000
## Cars Commute.Distance Region Age
## Min. :0.000 Length:1000 Length:1000 Min. :25.00
## 1st Qu.:1.000 Class :character Class :character 1st Qu.:35.00
## Median :1.000 Mode :character Mode :character Median :43.00
## Mean :1.442 Mean :44.16
## 3rd Qu.:2.000 3rd Qu.:52.00
## Max. :4.000 Max. :89.00
## Age.Brackets Purchased.Bike Bike.Sold
## Length:1000 Length:1000 Min. :0.000
## Class :character Class :character 1st Qu.:0.000
## Mode :character Mode :character Median :0.000
## Mean :0.481
## 3rd Qu.:1.000
## Max. :1.000
str(bike_data)
## 'data.frame': 1000 obs. of 15 variables:
## $ ID : int 12496 24107 14177 24381 25597 13507 27974 19364 22155 19280 ...
## $ Marital.Status : chr "Married" "Married" "Married" "Single" ...
## $ Gender : chr "Female" "Male" "Male" "Male" ...
## $ Income : chr "40,000" "30,000" "80,000" "70,000" ...
## $ Children : int 1 3 5 0 0 2 2 1 2 2 ...
## $ Education : chr "Bachelors" "Partial College" "Partial College" "Bachelors" ...
## $ Occupation : chr "Skilled Manual" "Clerical" "Professional" "Professional" ...
## $ Home.Owner : chr "Yes" "Yes" "No" "Yes" ...
## $ Cars : int 0 1 2 1 0 0 4 0 2 1 ...
## $ Commute.Distance: chr "0-1 Miles" "0-1 Miles" "2-5 Miles" "5-10 Miles" ...
## $ Region : chr "Europe" "Europe" "Europe" "Pacific" ...
## $ Age : int 42 43 60 41 36 50 33 43 58 40 ...
## $ Age.Brackets : chr "Middle Age" "Middle Age" "Old" "Middle Age" ...
## $ Purchased.Bike : chr "No" "No" "No" "Yes" ...
## $ Bike.Sold : int 0 0 0 1 1 0 1 1 0 1 ...
head(bike_data)
## ID Marital.Status Gender Income Children Education Occupation
## 1 12496 Married Female 40,000 1 Bachelors Skilled Manual
## 2 24107 Married Male 30,000 3 Partial College Clerical
## 3 14177 Married Male 80,000 5 Partial College Professional
## 4 24381 Single Male 70,000 0 Bachelors Professional
## 5 25597 Single Male 30,000 0 Bachelors Clerical
## 6 13507 Married Female 10,000 2 Partial College Manual
## Home.Owner Cars Commute.Distance Region Age Age.Brackets Purchased.Bike
## 1 Yes 0 0-1 Miles Europe 42 Middle Age No
## 2 Yes 1 0-1 Miles Europe 43 Middle Age No
## 3 No 2 2-5 Miles Europe 60 Old No
## 4 Yes 1 5-10 Miles Pacific 41 Middle Age Yes
## 5 No 0 0-1 Miles Europe 36 Middle Age Yes
## 6 Yes 0 1-2 Miles Europe 50 Middle Age No
## Bike.Sold
## 1 0
## 2 0
## 3 0
## 4 1
## 5 1
## 6 0
Initial Linear Regression Model:
Now, let us refer to the initial regression model that was generated last week, and I will also explore an interaction term for this particular task, however, before running the linear regression, I need to check the linear model variables for errors and make it clean, in case of any susceptible errors.
Last week, I ran into an error message “NA/NaN/Inf in ‘y’”, which
indicates that the dependent variable
Income in the linear model has missing
values (NA), not-a-number (NaN) values, or infinite (Inf) values. The
major culprit here is that the Income
variable is formatted as currency with commas, R will read it as a
character or factor type, not as numeric, which would prevent proper
numerical operations and cause issues when trying to fit a model.
Since R’s lm function cannot handle
these kinds of values when fitting a model, let me first run some codes
to remove these commas and convert the
Income column to numeric before I proceed
to enhance the regression model. Both tasks arr analysed below:
bike_data$Income <- as.numeric(gsub(",", "", bike_data$Income))
bike_data$Income <- ifelse(is.na(bike_data$Income) | is.nan(bike_data$Income) | is.infinite(bike_data$Income), NA, bike_data$Income)
bike_data_clean <- na.omit(bike_data[, c("Income", "Age")])
initial_model <- lm(Income ~ Age, data = bike_data_clean)
summary(initial_model)
##
## Call:
## lm(formula = Income ~ Age, data = bike_data_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -58380 -22621 -1402 16507 111855
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 35814.66 3890.80 9.205 < 2e-16 ***
## Age 465.22 85.32 5.452 6.27e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 30650 on 998 degrees of freedom
## Multiple R-squared: 0.02893, Adjusted R-squared: 0.02795
## F-statistic: 29.73 on 1 and 998 DF, p-value: 6.266e-08
The output of the lm function above provides a bunch of
information about the fitted linear regression model, which I will
explain some of the important output carefully below, and also the
insights we can gain from it:
The summary of residuals gives us a quick insight into the
distribution of errors between the actual and predicted values of
Income.
In this case:
The Min and
Max values suggest there are quite large
errors in the predictions for at least some points.
The median is closer to zero, which is good, but the range suggests that the model may not be predicting equally well across all ages.
This part includes the estimated coefficients for the intercept and
the slope (for Age), their standard
errors, t-values, and the p-values for the
t-tests on each coefficient:
The Intercept is estimated at
35,814.66 with a very low p-value, indicating
that the intercept is significantly different from zero.
The coefficient for Age is
465.22, suggesting that for each additional year of
age, the income increases by this amount, on average.
The p-value for Age is also very
low, indicating that age is a statistically significant predictor of
income.
The asterisks indicate the level of statistical significance for the coefficients. In this output, both the intercept and the age coefficient have three asterisks, which denote a significance level of less than 0.001.
This is the average amount that the response will deviate from the true regression line. It’s 30,650 which is quite large, considering the scale of the income.
These are related to the overall significance of the regression model. The F-statistic is relatively large, and the p-value is very small, which suggests that the model is statistically significant.
# Two more variables. Education and Gender
# Added a binary variable
enhanced_model <- lm(Income ~ Age + Education + Cars * Children, data = bike_data)
summary(enhanced_model)
##
## Call:
## lm(formula = Income ~ Age + Education + Cars * Children, data = bike_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -51787 -15776 -2741 13407 120508
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33231.61 3664.54 9.068 < 2e-16 ***
## Age -35.96 79.23 -0.454 0.65
## EducationGraduate Degree 13613.21 2340.74 5.816 8.14e-09 ***
## EducationHigh School -23202.28 2288.66 -10.138 < 2e-16 ***
## EducationPartial College -9188.06 2020.67 -4.547 6.11e-06 ***
## EducationPartial High School -39939.26 3157.96 -12.647 < 2e-16 ***
## Cars 19870.00 1135.79 17.494 < 2e-16 ***
## Children 5485.31 890.78 6.158 1.07e-09 ***
## Cars:Children -2203.27 415.50 -5.303 1.41e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 24000 on 991 degrees of freedom
## Multiple R-squared: 0.4088, Adjusted R-squared: 0.4041
## F-statistic: 85.67 on 8 and 991 DF, p-value: < 2.2e-16
vif(enhanced_model)
## there are higher-order terms (interactions) in this model
## consider setting type = 'predictor'; see ?vif
## GVIF Df GVIF^(1/(2*Df))
## Age 1.406497 1 1.185958
## Education 1.195727 4 1.022596
## Cars 2.832999 1 1.683151
## Children 3.650941 1 1.910744
## Cars:Children 5.720360 1 2.391727
The output of the above enhanced regression model and the Variance Inflation Factor (VIF) analysis provides a rich set of insights, which are broken down below:
Cars:Children interaction term, which has
a high VIF value, suggesting its variance is highly inflated due to
correlation with other predictors. High VIFs for
Cars and
Children also suggest multicollinearity,
potentially due to the interaction term.To create diagnostic plots for the enhanced model that we already created above, I will use the base plot function directly on the model object. This will generate four standard diagnostic plots that help evaluate the assumptions of linear regression: Residuals vs Fitted, Normal Q-Q, Scale-Location, and Residuals vs Leverage plots.
par(mfrow = c(1, 2))
plot(enhanced_model)
From the diagnostic plots that is generated above, here are the insights that we can get for each:
So, with the plots above, we can conclude that each plot suggests that while the model captures the general trend, there are anomalies — particularly outliers — that may be affecting the model’s performance. Further analysis should involve checking the influence of the outliers, considering data transformations, or robust regression techniques to mitigate the effect of these points.
We can analyse the outliers and leverage using the Cook’s distance, the Cook’s Distance plot is used to assess the influence of individual data points on the fitted regression coefficients. I will run the code below and interpret the plot afterwards:
cooks_dist <- cooks.distance(enhanced_model)
plot(cooks_dist, type="h", main="Cook's Distance")
With the plot that Cook’s Distance provided us with, here is a more comprehensive analysis in relation to the linear regression model:
Low Influence Overall: The Cook’s Distance values for most observations are quite low, generally well below 0.1. This indicates that removing any of these points would not significantly change the regression model’s coefficients. This suggests that the model is robust to the removal of most individual points.
Potential Outliers or High-Leverage Points: While most data points are not influential, there are a few spikes visible in the plot. These points may be outliers or high-leverage points that have more influence on the model fit than the typical data point. However, even for these points, the Cook’s Distance does not appear to exceed common thresholds (e.g., 0.5 or 1), indicating they may not be overly influential.
Model Assessment: This plot should be considered in conjunction with other diagnostic plots and metrics. The absence of highly influential points as indicated by Cook’s Distance is a good sign, but it doesn’t guarantee that the model is the best fit for the data.
Data Cleaning: Before finalizing the model, it could be prudent to examine the cases corresponding to the spikes more closely. Even though they are not highly influential according to Cook’s Distance, understanding why these points are different could offer insights into the data and the model’s suitability.
Overall Model Validity: When this plot is analyzed in the context of the full regression diagnostics, it can contribute to the overall confidence in the model. Given that there are no points with a high Cook’s Distance, the model parameters are likely not being unduly influenced by particular data points, which supports the model’s validity.
Related to Other Analysis: If the Residuals vs Fitted Values plot suggested non-linearity or heteroscedasticity, but Cook’s Distance indicates low influence of individual observations, it suggests that while individual data points aren’t disproportionately influential, the model form itself may need to be revisited (for example, by adding polynomial terms or considering a transformation of the response variable).
In summary, the Cook’s Distance plot supports the conclusion that the linear regression model is not being heavily influenced by a few data points. However, the validity and appropriateness of the model also depend on other assumptions of linear regression being met, such as linearity, independence, and homoscedasticity of residuals. If those other diagnostics indicated issues, addressing them would still be necessary to ensure the model’s overall appropriateness for the data.
It can be further explained based on all the analysis done above that:
Age is not a strong predictor of income when analyzed alone, necessitating additional variables for a more robust model.
The enhanced model indicated education level and car ownership are significant predictors of income, but the effect is complex due to interactions with the number of children.
While the enhanced model is statistically significant, issues of multicollinearity and the regression assumptions must be addressed for reliable predictions.
Recommendations include exploring additional variables, considering data segmentation, and performing transformation or robust regression methods to address outliers and non-constant variance.
As expected from this week’s data dive, this analysis underscores the importance of a meticulous approach to regression analysis, ensuring that the chosen model not only fits the data well but also adheres to the underlying assumptions of the regression technique used.
Further research could involve exploring alternative modeling techniques that can accommodate the dataset’s peculiarities, such as generalized linear models or non-parametric methods.
This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.