Introduction:

This week’s data dive is a continuation of the past week’s data dive on linear regression, however, this particular dive is primarily focused on critically interpreting and diagnosing regression models.

As discussed last week on the regression modeling, I will continue by presenting a comprehensive data analysis that is focused on understanding factors influencing bike purchases.

The data dive will explore the following and carry out some technical analysis based on the requirements of the data dive, some of the techniques are but not limited to:

I will also refer to the techniques that we used for the past week data dives as both are truly intertwined and would be helpful in the process of our analysis, the below are some of the mentioned techniques:

Data Loading:

As usual, I will load the dataset alongside the necessary libraries that will be needed for the analysis.

library(ggthemes)
library(ggrepel)
## Loading required package: ggplot2
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(broom)
library(lindia)
library(knitr)
library (car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
library(ggplot2)
library(GGally)
## Warning: package 'GGally' was built under R version 4.3.3
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
bike_data <- read.csv('bike_data.csv')

Preliminary Data Examination:

Using summary, string and head to pre-analyse the dataset.

summary(bike_data)
##        ID        Marital.Status        Gender             Income         
##  Min.   :11000   Length:1000        Length:1000        Length:1000       
##  1st Qu.:15291   Class :character   Class :character   Class :character  
##  Median :19744   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :19966                                                           
##  3rd Qu.:24471                                                           
##  Max.   :29447                                                           
##     Children      Education          Occupation         Home.Owner       
##  Min.   :0.000   Length:1000        Length:1000        Length:1000       
##  1st Qu.:0.000   Class :character   Class :character   Class :character  
##  Median :2.000   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :1.898                                                           
##  3rd Qu.:3.000                                                           
##  Max.   :5.000                                                           
##       Cars       Commute.Distance      Region               Age       
##  Min.   :0.000   Length:1000        Length:1000        Min.   :25.00  
##  1st Qu.:1.000   Class :character   Class :character   1st Qu.:35.00  
##  Median :1.000   Mode  :character   Mode  :character   Median :43.00  
##  Mean   :1.442                                         Mean   :44.16  
##  3rd Qu.:2.000                                         3rd Qu.:52.00  
##  Max.   :4.000                                         Max.   :89.00  
##  Age.Brackets       Purchased.Bike       Bike.Sold    
##  Length:1000        Length:1000        Min.   :0.000  
##  Class :character   Class :character   1st Qu.:0.000  
##  Mode  :character   Mode  :character   Median :0.000  
##                                        Mean   :0.481  
##                                        3rd Qu.:1.000  
##                                        Max.   :1.000
str(bike_data)
## 'data.frame':    1000 obs. of  15 variables:
##  $ ID              : int  12496 24107 14177 24381 25597 13507 27974 19364 22155 19280 ...
##  $ Marital.Status  : chr  "Married" "Married" "Married" "Single" ...
##  $ Gender          : chr  "Female" "Male" "Male" "Male" ...
##  $ Income          : chr  "40,000" "30,000" "80,000" "70,000" ...
##  $ Children        : int  1 3 5 0 0 2 2 1 2 2 ...
##  $ Education       : chr  "Bachelors" "Partial College" "Partial College" "Bachelors" ...
##  $ Occupation      : chr  "Skilled Manual" "Clerical" "Professional" "Professional" ...
##  $ Home.Owner      : chr  "Yes" "Yes" "No" "Yes" ...
##  $ Cars            : int  0 1 2 1 0 0 4 0 2 1 ...
##  $ Commute.Distance: chr  "0-1 Miles" "0-1 Miles" "2-5 Miles" "5-10 Miles" ...
##  $ Region          : chr  "Europe" "Europe" "Europe" "Pacific" ...
##  $ Age             : int  42 43 60 41 36 50 33 43 58 40 ...
##  $ Age.Brackets    : chr  "Middle Age" "Middle Age" "Old" "Middle Age" ...
##  $ Purchased.Bike  : chr  "No" "No" "No" "Yes" ...
##  $ Bike.Sold       : int  0 0 0 1 1 0 1 1 0 1 ...
head(bike_data)
##      ID Marital.Status Gender Income Children       Education     Occupation
## 1 12496        Married Female 40,000        1       Bachelors Skilled Manual
## 2 24107        Married   Male 30,000        3 Partial College       Clerical
## 3 14177        Married   Male 80,000        5 Partial College   Professional
## 4 24381         Single   Male 70,000        0       Bachelors   Professional
## 5 25597         Single   Male 30,000        0       Bachelors       Clerical
## 6 13507        Married Female 10,000        2 Partial College         Manual
##   Home.Owner Cars Commute.Distance  Region Age Age.Brackets Purchased.Bike
## 1        Yes    0        0-1 Miles  Europe  42   Middle Age             No
## 2        Yes    1        0-1 Miles  Europe  43   Middle Age             No
## 3         No    2        2-5 Miles  Europe  60          Old             No
## 4        Yes    1       5-10 Miles Pacific  41   Middle Age            Yes
## 5         No    0        0-1 Miles  Europe  36   Middle Age            Yes
## 6        Yes    0        1-2 Miles  Europe  50   Middle Age             No
##   Bike.Sold
## 1         0
## 2         0
## 3         0
## 4         1
## 5         1
## 6         0

Initial Linear Regression Model:

Now, let us refer to the initial regression model that was generated last week, and I will also explore an interaction term for this particular task, however, before running the linear regression, I need to check the linear model variables for errors and make it clean, in case of any susceptible errors.

Last week, I ran into an error message “NA/NaN/Inf in ‘y’”, which indicates that the dependent variable Income in the linear model has missing values (NA), not-a-number (NaN) values, or infinite (Inf) values. The major culprit here is that the Income variable is formatted as currency with commas, R will read it as a character or factor type, not as numeric, which would prevent proper numerical operations and cause issues when trying to fit a model.

Since R’s lm function cannot handle these kinds of values when fitting a model, let me first run some codes to remove these commas and convert the Income column to numeric before I proceed to enhance the regression model. Both tasks arr analysed below:

bike_data$Income <- as.numeric(gsub(",", "", bike_data$Income))

bike_data$Income <- ifelse(is.na(bike_data$Income) | is.nan(bike_data$Income) | is.infinite(bike_data$Income), NA, bike_data$Income)
bike_data_clean <- na.omit(bike_data[, c("Income", "Age")])

initial_model <- lm(Income ~ Age, data = bike_data_clean)
summary(initial_model)
## 
## Call:
## lm(formula = Income ~ Age, data = bike_data_clean)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -58380 -22621  -1402  16507 111855 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 35814.66    3890.80   9.205  < 2e-16 ***
## Age           465.22      85.32   5.452 6.27e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 30650 on 998 degrees of freedom
## Multiple R-squared:  0.02893,    Adjusted R-squared:  0.02795 
## F-statistic: 29.73 on 1 and 998 DF,  p-value: 6.266e-08

The output of the lm function above provides a bunch of information about the fitted linear regression model, which I will explain some of the important output carefully below, and also the insights we can gain from it:

Residuals:

The summary of residuals gives us a quick insight into the distribution of errors between the actual and predicted values of Income.

In this case:

Coefficients:

This part includes the estimated coefficients for the intercept and the slope (for Age), their standard errors, t-values, and the p-values for the t-tests on each coefficient:

Significance Codes:

The asterisks indicate the level of statistical significance for the coefficients. In this output, both the intercept and the age coefficient have three asterisks, which denote a significance level of less than 0.001.

Residual Standard Error:

This is the average amount that the response will deviate from the true regression line. It’s 30,650 which is quite large, considering the scale of the income.

F-statistic and p-value:

These are related to the overall significance of the regression model. The F-statistic is relatively large, and the p-value is very small, which suggests that the model is statistically significant.

Insights:

Further Considerations for the Linear Regression Model:

Enhanced Model:

# Two more variables. Education and Gender
# Added a binary variable
enhanced_model <- lm(Income ~ Age + Education + Cars * Children, data = bike_data)

summary(enhanced_model)
## 
## Call:
## lm(formula = Income ~ Age + Education + Cars * Children, data = bike_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -51787 -15776  -2741  13407 120508 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   33231.61    3664.54   9.068  < 2e-16 ***
## Age                             -35.96      79.23  -0.454     0.65    
## EducationGraduate Degree      13613.21    2340.74   5.816 8.14e-09 ***
## EducationHigh School         -23202.28    2288.66 -10.138  < 2e-16 ***
## EducationPartial College      -9188.06    2020.67  -4.547 6.11e-06 ***
## EducationPartial High School -39939.26    3157.96 -12.647  < 2e-16 ***
## Cars                          19870.00    1135.79  17.494  < 2e-16 ***
## Children                       5485.31     890.78   6.158 1.07e-09 ***
## Cars:Children                 -2203.27     415.50  -5.303 1.41e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 24000 on 991 degrees of freedom
## Multiple R-squared:  0.4088, Adjusted R-squared:  0.4041 
## F-statistic: 85.67 on 8 and 991 DF,  p-value: < 2.2e-16
vif(enhanced_model)
## there are higher-order terms (interactions) in this model
## consider setting type = 'predictor'; see ?vif
##                   GVIF Df GVIF^(1/(2*Df))
## Age           1.406497  1        1.185958
## Education     1.195727  4        1.022596
## Cars          2.832999  1        1.683151
## Children      3.650941  1        1.910744
## Cars:Children 5.720360  1        2.391727

The output of the above enhanced regression model and the Variance Inflation Factor (VIF) analysis provides a rich set of insights, which are broken down below:

Enhanced Model Output

Coefficients:

Model Fit:

VIF Analysis:

Insights and Considerations:

Next Steps for Further Clarity:

Model Evaluation with Plots:

To create diagnostic plots for the enhanced model that we already created above, I will use the base plot function directly on the model object. This will generate four standard diagnostic plots that help evaluate the assumptions of linear regression: Residuals vs Fitted, Normal Q-Q, Scale-Location, and Residuals vs Leverage plots.

par(mfrow = c(1, 2))
plot(enhanced_model)

From the diagnostic plots that is generated above, here are the insights that we can get for each:

Residuals vs Fitted

Q-Q Plot (Quantile-Quantile)

Scale-Location (Spread vs Level)

Residuals vs Leverage

So, with the plots above, we can conclude that each plot suggests that while the model captures the general trend, there are anomalies — particularly outliers — that may be affecting the model’s performance. Further analysis should involve checking the influence of the outliers, considering data transformations, or robust regression techniques to mitigate the effect of these points.

Analyzing the Outliers and Leverage:

We can analyse the outliers and leverage using the Cook’s distance, the Cook’s Distance plot is used to assess the influence of individual data points on the fitted regression coefficients. I will run the code below and interpret the plot afterwards:

cooks_dist <- cooks.distance(enhanced_model)
plot(cooks_dist, type="h", main="Cook's Distance")

With the plot that Cook’s Distance provided us with, here is a more comprehensive analysis in relation to the linear regression model:

Conclusion and Recommendations:

In summary, the Cook’s Distance plot supports the conclusion that the linear regression model is not being heavily influenced by a few data points. However, the validity and appropriateness of the model also depend on other assumptions of linear regression being met, such as linearity, independence, and homoscedasticity of residuals. If those other diagnostics indicated issues, addressing them would still be necessary to ensure the model’s overall appropriateness for the data.

It can be further explained based on all the analysis done above that:

As expected from this week’s data dive, this analysis underscores the importance of a meticulous approach to regression analysis, ensuring that the chosen model not only fits the data well but also adheres to the underlying assumptions of the regression technique used.

Further research could involve exploring alternative modeling techniques that can accommodate the dataset’s peculiarities, such as generalized linear models or non-parametric methods.

R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.