Bike Data Analysis - Week 9 | Regression Diagnostics

Introduction:

This week’s data dive is a continuation of the past week’s data dive on linear regression, however, this particular dive is primarily focused on critically interpreting and diagnosing regression models.

As discussed last week on the regression modeling, I will continue by presenting a comprehensive data analysis that is focused on understanding factors influencing bike purchases.

The data dive will explore the following and carry out some technical analysis based on the requirements of the data dive, some of the techniques are but not limited to:

Regression Characteristics.
Regression Diagnostics.
Simpson’s Paradox (demo).

I will also refer to the techniques that we used for the past week data dives as both are truly intertwined and would be helpful in the process of our analysis, the below are some of the mentioned techniques:

ANOVA.
F-test.
Bonferroni Correction.
Linear Regression Theory.

Data Loading:

As usual, I will load the dataset alongside the necessary libraries that will be needed for the analysis.

library(ggthemes)
library(ggrepel)

## Loading required package: ggplot2

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(broom)
library(lindia)
library(knitr)
library (car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

library(ggplot2)
library(GGally)

## Warning: package 'GGally' was built under R version 4.3.3

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

bike_data <- read.csv('bike_data.csv')

Preliminary Data Examination:

Using summary, string and head to pre-analyse the dataset.

summary(bike_data)

##        ID        Marital.Status        Gender             Income         
##  Min.   :11000   Length:1000        Length:1000        Length:1000       
##  1st Qu.:15291   Class :character   Class :character   Class :character  
##  Median :19744   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :19966                                                           
##  3rd Qu.:24471                                                           
##  Max.   :29447                                                           
##     Children      Education          Occupation         Home.Owner       
##  Min.   :0.000   Length:1000        Length:1000        Length:1000       
##  1st Qu.:0.000   Class :character   Class :character   Class :character  
##  Median :2.000   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :1.898                                                           
##  3rd Qu.:3.000                                                           
##  Max.   :5.000                                                           
##       Cars       Commute.Distance      Region               Age       
##  Min.   :0.000   Length:1000        Length:1000        Min.   :25.00  
##  1st Qu.:1.000   Class :character   Class :character   1st Qu.:35.00  
##  Median :1.000   Mode  :character   Mode  :character   Median :43.00  
##  Mean   :1.442                                         Mean   :44.16  
##  3rd Qu.:2.000                                         3rd Qu.:52.00  
##  Max.   :4.000                                         Max.   :89.00  
##  Age.Brackets       Purchased.Bike       Bike.Sold    
##  Length:1000        Length:1000        Min.   :0.000  
##  Class :character   Class :character   1st Qu.:0.000  
##  Mode  :character   Mode  :character   Median :0.000  
##                                        Mean   :0.481  
##                                        3rd Qu.:1.000  
##                                        Max.   :1.000

str(bike_data)

## 'data.frame':    1000 obs. of  15 variables:
##  $ ID              : int  12496 24107 14177 24381 25597 13507 27974 19364 22155 19280 ...
##  $ Marital.Status  : chr  "Married" "Married" "Married" "Single" ...
##  $ Gender          : chr  "Female" "Male" "Male" "Male" ...
##  $ Income          : chr  "40,000" "30,000" "80,000" "70,000" ...
##  $ Children        : int  1 3 5 0 0 2 2 1 2 2 ...
##  $ Education       : chr  "Bachelors" "Partial College" "Partial College" "Bachelors" ...
##  $ Occupation      : chr  "Skilled Manual" "Clerical" "Professional" "Professional" ...
##  $ Home.Owner      : chr  "Yes" "Yes" "No" "Yes" ...
##  $ Cars            : int  0 1 2 1 0 0 4 0 2 1 ...
##  $ Commute.Distance: chr  "0-1 Miles" "0-1 Miles" "2-5 Miles" "5-10 Miles" ...
##  $ Region          : chr  "Europe" "Europe" "Europe" "Pacific" ...
##  $ Age             : int  42 43 60 41 36 50 33 43 58 40 ...
##  $ Age.Brackets    : chr  "Middle Age" "Middle Age" "Old" "Middle Age" ...
##  $ Purchased.Bike  : chr  "No" "No" "No" "Yes" ...
##  $ Bike.Sold       : int  0 0 0 1 1 0 1 1 0 1 ...

head(bike_data)

##      ID Marital.Status Gender Income Children       Education     Occupation
## 1 12496        Married Female 40,000        1       Bachelors Skilled Manual
## 2 24107        Married   Male 30,000        3 Partial College       Clerical
## 3 14177        Married   Male 80,000        5 Partial College   Professional
## 4 24381         Single   Male 70,000        0       Bachelors   Professional
## 5 25597         Single   Male 30,000        0       Bachelors       Clerical
## 6 13507        Married Female 10,000        2 Partial College         Manual
##   Home.Owner Cars Commute.Distance  Region Age Age.Brackets Purchased.Bike
## 1        Yes    0        0-1 Miles  Europe  42   Middle Age             No
## 2        Yes    1        0-1 Miles  Europe  43   Middle Age             No
## 3         No    2        2-5 Miles  Europe  60          Old             No
## 4        Yes    1       5-10 Miles Pacific  41   Middle Age            Yes
## 5         No    0        0-1 Miles  Europe  36   Middle Age            Yes
## 6        Yes    0        1-2 Miles  Europe  50   Middle Age             No
##   Bike.Sold
## 1         0
## 2         0
## 3         0
## 4         1
## 5         1
## 6         0

Initial Linear Regression Model:

Now, let us refer to the initial regression model that was generated last week, and I will also explore an interaction term for this particular task, however, before running the linear regression, I need to check the linear model variables for errors and make it clean, in case of any susceptible errors.

Last week, I ran into an error message “NA/NaN/Inf in ‘y’”, which indicates that the dependent variable Income in the linear model has missing values (NA), not-a-number (NaN) values, or infinite (Inf) values. The major culprit here is that the Income variable is formatted as currency with commas, R will read it as a character or factor type, not as numeric, which would prevent proper numerical operations and cause issues when trying to fit a model.

Since R’s lm function cannot handle these kinds of values when fitting a model, let me first run some codes to remove these commas and convert the Income column to numeric before I proceed to enhance the regression model. Both tasks arr analysed below:

bike_data$Income <- as.numeric(gsub(",", "", bike_data$Income))

bike_data$Income <- ifelse(is.na(bike_data$Income) | is.nan(bike_data$Income) | is.infinite(bike_data$Income), NA, bike_data$Income)
bike_data_clean <- na.omit(bike_data[, c("Income", "Age")])

initial_model <- lm(Income ~ Age, data = bike_data_clean)
summary(initial_model)

## 
## Call:
## lm(formula = Income ~ Age, data = bike_data_clean)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -58380 -22621  -1402  16507 111855 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 35814.66    3890.80   9.205  < 2e-16 ***
## Age           465.22      85.32   5.452 6.27e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 30650 on 998 degrees of freedom
## Multiple R-squared:  0.02893,    Adjusted R-squared:  0.02795 
## F-statistic: 29.73 on 1 and 998 DF,  p-value: 6.266e-08

The output of the lm function above provides a bunch of information about the fitted linear regression model, which I will explain some of the important output carefully below, and also the insights we can gain from it:

Residuals:

The summary of residuals gives us a quick insight into the distribution of errors between the actual and predicted values of Income.

In this case:

The Min and Max values suggest there are quite large errors in the predictions for at least some points.
The median is closer to zero, which is good, but the range suggests that the model may not be predicting equally well across all ages.

Coefficients:

This part includes the estimated coefficients for the intercept and the slope (for Age), their standard errors, t-values, and the p-values for the t-tests on each coefficient:

The Intercept is estimated at 35,814.66 with a very low p-value, indicating that the intercept is significantly different from zero.
The coefficient for Age is 465.22, suggesting that for each additional year of age, the income increases by this amount, on average.
The p-value for Age is also very low, indicating that age is a statistically significant predictor of income.

Significance Codes:

The asterisks indicate the level of statistical significance for the coefficients. In this output, both the intercept and the age coefficient have three asterisks, which denote a significance level of less than 0.001.

Residual Standard Error:

This is the average amount that the response will deviate from the true regression line. It’s 30,650 which is quite large, considering the scale of the income.

F-statistic and p-value:

These are related to the overall significance of the regression model. The F-statistic is relatively large, and the p-value is very small, which suggests that the model is statistically significant.

Insights:

The relationship between age and income is statistically significant but weak. Age alone is not a strong predictor of income as indicated by the low R-squared value.
The model’s predictive power is limited, explaining less than 3% of the variance in income.
The large residuals indicate that the model’s predictions are often far from the actual values, suggesting a poor fit.

Further Considerations for the Linear Regression Model:

I might need to consider additional variables to explain income better. Socioeconomic factors, education, and employment type often have a significant impact on income.
Checking for non-linear relationships or interactions between variables might provide better insight.
Diagnostics plots should be examined to check for violations of regression assumptions.
It is also worth considering transformations of the variables or robust regression methods if the data has outliers or is not homoscedastic.

Enhanced Model:

# Two more variables. Education and Gender
# Added a binary variable
enhanced_model <- lm(Income ~ Age + Education + Cars * Children, data = bike_data)

summary(enhanced_model)

## 
## Call:
## lm(formula = Income ~ Age + Education + Cars * Children, data = bike_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -51787 -15776  -2741  13407 120508 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   33231.61    3664.54   9.068  < 2e-16 ***
## Age                             -35.96      79.23  -0.454     0.65    
## EducationGraduate Degree      13613.21    2340.74   5.816 8.14e-09 ***
## EducationHigh School         -23202.28    2288.66 -10.138  < 2e-16 ***
## EducationPartial College      -9188.06    2020.67  -4.547 6.11e-06 ***
## EducationPartial High School -39939.26    3157.96 -12.647  < 2e-16 ***
## Cars                          19870.00    1135.79  17.494  < 2e-16 ***
## Children                       5485.31     890.78   6.158 1.07e-09 ***
## Cars:Children                 -2203.27     415.50  -5.303 1.41e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 24000 on 991 degrees of freedom
## Multiple R-squared:  0.4088, Adjusted R-squared:  0.4041 
## F-statistic: 85.67 on 8 and 991 DF,  p-value: < 2.2e-16

vif(enhanced_model)

## there are higher-order terms (interactions) in this model
## consider setting type = 'predictor'; see ?vif

##                   GVIF Df GVIF^(1/(2*Df))
## Age           1.406497  1        1.185958
## Education     1.195727  4        1.022596
## Cars          2.832999  1        1.683151
## Children      3.650941  1        1.910744
## Cars:Children 5.720360  1        2.391727

The output of the above enhanced regression model and the Variance Inflation Factor (VIF) analysis provides a rich set of insights, which are broken down below:

Enhanced Model Output

Coefficients:

Intercept is significant, suggesting a baseline income when other predictors are zero.
Age: Surprisingly, age has a negative coefficient, but it’s not statistically significant (p-value: 0.65). This could imply that, within this model, age alone does not have a clear linear relationship with income.
Education: The coefficients for education levels are significant, indicating substantial differences in income across education levels. Graduates have a positive income adjustment, whereas High School, Partial College, and Partial High School negatively impact income relative to the baseline.
Cars: Owning cars is associated with higher income, significantly so.
Children: Having children is also positively associated with income.
Cars:Children: The interaction term is significant and negative, suggesting the income increase associated with having more cars diminishes with the number of children.

Model Fit:

Residual Standard Error: Reduced to 24,000 from 30,650 in the initial model, indicating improved accuracy in income prediction.
R-squared: 0.4088, meaning about 40.88% of the income variability is explained by the model, a substantial increase from the initial model.
F-statistic: Shows the model is statistically significant.

VIF Analysis:

Age, Education, Cars, Children, Cars:Children: The VIF scores indicate multicollinearity is present, especially with the Cars:Children interaction term, which has a high VIF value, suggesting its variance is highly inflated due to correlation with other predictors. High VIFs for Cars and Children also suggest multicollinearity, potentially due to the interaction term.

Insights and Considerations:

Education has a significant impact on income, which varies dramatically across different levels of educational attainment.
Cars and Children positively impact income independently, but their interaction shows the combined effect may not be as straightforward, possibly due to increased financial responsibilities or lifestyle changes associated with having more children.
The Age variable not being significant could indicate its effect is captured by other variables, or the relationship isn’t linear.
Multicollinearity, especially with the interaction term, suggests caution in interpreting the coefficients because they may not accurately reflect the isolated impact of each predictor on income. High VIF values can lead to unstable estimates of coefficients, which might affect the reliability of the model’s predictions.

Next Steps for Further Clarity:

Given the multicollinearity, especially due to the interaction term, I can consider simplifying the model by removing or adjusting the interaction term.
It might be helpful to use scatter plots to further explore the data to understand the relationships between variables.
Investigating other potential predictors or transformations of the current predictors to improve model fit and reduce multicollinearity.

Model Evaluation with Plots:

To create diagnostic plots for the enhanced model that we already created above, I will use the base plot function directly on the model object. This will generate four standard diagnostic plots that help evaluate the assumptions of linear regression: Residuals vs Fitted, Normal Q-Q, Scale-Location, and Residuals vs Leverage plots.

par(mfrow = c(1, 2))
plot(enhanced_model)

From the diagnostic plots that is generated above, here are the insights that we can get for each:

Residuals vs Fitted

The plot does not display a clear funnel shape, which is good for homoscedasticity, but the spread of residuals appears uneven across the range of fitted values. This could suggest potential non-linearity or outliers.
There’s no clear pattern, which suggests that the linear model might be appropriate for the data, but the outliers and the few points with large residuals need further investigation.

Q-Q Plot (Quantile-Quantile)

Most points follow the reference line but deviate sharply at the ends, especially on the upper end. This indicates heavy tails in the distribution of residuals — there might be outliers affecting the model.

Scale-Location (Spread vs Level)

The spread of residuals increases with the fitted values, suggesting potential heteroscedasticity (non-constant variance of errors). The model may benefit from transformations of the response variable or using models that account for heteroscedasticity.

Residuals vs Leverage

Most data points are clustered to the left, indicating low leverage. However, there are a few points with higher leverage to watch out for.
A couple of points stand out for having high Cook’s distance, indicating they are influential to the model fit and may disproportionately affect the regression coefficients. These should be examined for validity and potential exclusion.

So, with the plots above, we can conclude that each plot suggests that while the model captures the general trend, there are anomalies — particularly outliers — that may be affecting the model’s performance. Further analysis should involve checking the influence of the outliers, considering data transformations, or robust regression techniques to mitigate the effect of these points.

Analyzing the Outliers and Leverage:

We can analyse the outliers and leverage using the Cook’s distance, the Cook’s Distance plot is used to assess the influence of individual data points on the fitted regression coefficients. I will run the code below and interpret the plot afterwards:

cooks_dist <- cooks.distance(enhanced_model)
plot(cooks_dist, type="h", main="Cook's Distance")

With the plot that Cook’s Distance provided us with, here is a more comprehensive analysis in relation to the linear regression model:

Low Influence Overall: The Cook’s Distance values for most observations are quite low, generally well below 0.1. This indicates that removing any of these points would not significantly change the regression model’s coefficients. This suggests that the model is robust to the removal of most individual points.
Potential Outliers or High-Leverage Points: While most data points are not influential, there are a few spikes visible in the plot. These points may be outliers or high-leverage points that have more influence on the model fit than the typical data point. However, even for these points, the Cook’s Distance does not appear to exceed common thresholds (e.g., 0.5 or 1), indicating they may not be overly influential.
Model Assessment: This plot should be considered in conjunction with other diagnostic plots and metrics. The absence of highly influential points as indicated by Cook’s Distance is a good sign, but it doesn’t guarantee that the model is the best fit for the data.
Data Cleaning: Before finalizing the model, it could be prudent to examine the cases corresponding to the spikes more closely. Even though they are not highly influential according to Cook’s Distance, understanding why these points are different could offer insights into the data and the model’s suitability.
Overall Model Validity: When this plot is analyzed in the context of the full regression diagnostics, it can contribute to the overall confidence in the model. Given that there are no points with a high Cook’s Distance, the model parameters are likely not being unduly influenced by particular data points, which supports the model’s validity.
Related to Other Analysis: If the Residuals vs Fitted Values plot suggested non-linearity or heteroscedasticity, but Cook’s Distance indicates low influence of individual observations, it suggests that while individual data points aren’t disproportionately influential, the model form itself may need to be revisited (for example, by adding polynomial terms or considering a transformation of the response variable).

Conclusion and Recommendations:

In summary, the Cook’s Distance plot supports the conclusion that the linear regression model is not being heavily influenced by a few data points. However, the validity and appropriateness of the model also depend on other assumptions of linear regression being met, such as linearity, independence, and homoscedasticity of residuals. If those other diagnostics indicated issues, addressing them would still be necessary to ensure the model’s overall appropriateness for the data.

It can be further explained based on all the analysis done above that:

Age is not a strong predictor of income when analyzed alone, necessitating additional variables for a more robust model.
The enhanced model indicated education level and car ownership are significant predictors of income, but the effect is complex due to interactions with the number of children.
While the enhanced model is statistically significant, issues of multicollinearity and the regression assumptions must be addressed for reliable predictions.
Recommendations include exploring additional variables, considering data segmentation, and performing transformation or robust regression methods to address outliers and non-constant variance.

As expected from this week’s data dive, this analysis underscores the importance of a meticulous approach to regression analysis, ensuring that the chosen model not only fits the data well but also adheres to the underlying assumptions of the regression technique used.

Further research could involve exploring alternative modeling techniques that can accommodate the dataset’s peculiarities, such as generalized linear models or non-parametric methods.

R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

Bike Data Analysis - Week 9 | Regression Diagnostics

Oluwatosin Agbaakin

2024-03-08

Residuals:

Coefficients:

Significance Codes:

Residual Standard Error:

F-statistic and p-value:

Insights:

Further Considerations for the Linear Regression Model:

Enhanced Model:

Enhanced Model Output

Coefficients:

Model Fit:

VIF Analysis:

Insights and Considerations:

Next Steps for Further Clarity:

Model Evaluation with Plots:

Residuals vs Fitted

Q-Q Plot (Quantile-Quantile)

Scale-Location (Spread vs Level)

Residuals vs Leverage

Analyzing the Outliers and Leverage:

Conclusion and Recommendations:

R Markdown