Bike Data Analysis - Week 8 | Regression Modeling

Data Dive Introduction:

This task will guide me in getting experience running ANOVA tests and building regression models using the Bike sales dataset. Earlier in the week, regression model building was discussed alongside other topics such as:

ANOVA.
F-test.
Bonferroni Correction.
Linear Regression Theory.

For clarity purposes and as per the requirement of this task, all the steps taken in the analysis are clearly explained for this week’s data dive.

Loading the Data set and selecting a response variable:

The first thing to do in the analysis process will be to load all the necessary libraries, and read the Bike sales dataset, understanding the dataset structure, variables, and the potential relationships within the dataset. I used the glimpse function to display the columns vertically rather than horizontally,

It is often used for datasets with a large number of variables, however, to make it look more structured like a table and more appealing, I went ahead to still use the head() function, although with a condition of displaying only the first 20 rows of the data frame - I chose 20 so our output won’t look messy.

And lastly for this step, I inspected the dataset before I select a valuable continuous or ordered integer variable.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ lubridate 1.9.3     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.1
## ✔ readr     2.1.5     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggthemes)
library(ggrepel)
library(boot)
library(broom)
library(lindia)
library(knitr)


bike_data <- read.csv("bike_data.csv")

# Inspect the dataset to select a response variable
glimpse(bike_data)

## Rows: 1,000
## Columns: 15
## $ ID               <int> 12496, 24107, 14177, 24381, 25597, 13507, 27974, 1936…
## $ Marital.Status   <chr> "Married", "Married", "Married", "Single", "Single", …
## $ Gender           <chr> "Female", "Male", "Male", "Male", "Male", "Female", "…
## $ Income           <chr> "40,000", "30,000", "80,000", "70,000", "30,000", "10…
## $ Children         <int> 1, 3, 5, 0, 0, 2, 2, 1, 2, 2, 3, 0, 5, 2, 1, 2, 3, 1,…
## $ Education        <chr> "Bachelors", "Partial College", "Partial College", "B…
## $ Occupation       <chr> "Skilled Manual", "Clerical", "Professional", "Profes…
## $ Home.Owner       <chr> "Yes", "Yes", "No", "Yes", "No", "Yes", "Yes", "Yes",…
## $ Cars             <int> 0, 1, 2, 1, 0, 0, 4, 0, 2, 1, 2, 4, 0, 1, 1, 1, 2, 0,…
## $ Commute.Distance <chr> "0-1 Miles", "0-1 Miles", "2-5 Miles", "5-10 Miles", …
## $ Region           <chr> "Europe", "Europe", "Europe", "Pacific", "Europe", "E…
## $ Age              <int> 42, 43, 60, 41, 36, 50, 33, 43, 58, 40, 54, 36, 55, 3…
## $ Age.Brackets     <chr> "Middle Age", "Middle Age", "Old", "Middle Age", "Mid…
## $ Purchased.Bike   <chr> "No", "No", "No", "Yes", "Yes", "No", "Yes", "Yes", "…
## $ Bike.Sold        <int> 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0,…

# Set options to display 15 columns
options(width = 1000)

# Print the first 20 rows of the dataframe
print(head(bike_data, 20))

##       ID Marital.Status Gender  Income Children           Education     Occupation Home.Owner Cars Commute.Distance  Region Age Age.Brackets Purchased.Bike Bike.Sold
## 1  12496        Married Female  40,000        1           Bachelors Skilled Manual        Yes    0        0-1 Miles  Europe  42   Middle Age             No         0
## 2  24107        Married   Male  30,000        3     Partial College       Clerical        Yes    1        0-1 Miles  Europe  43   Middle Age             No         0
## 3  14177        Married   Male  80,000        5     Partial College   Professional         No    2        2-5 Miles  Europe  60          Old             No         0
## 4  24381         Single   Male  70,000        0           Bachelors   Professional        Yes    1       5-10 Miles Pacific  41   Middle Age            Yes         1
## 5  25597         Single   Male  30,000        0           Bachelors       Clerical         No    0        0-1 Miles  Europe  36   Middle Age            Yes         1
## 6  13507        Married Female  10,000        2     Partial College         Manual        Yes    0        1-2 Miles  Europe  50   Middle Age             No         0
## 7  27974         Single   Male 160,000        2         High School     Management        Yes    4        0-1 Miles Pacific  33   Middle Age            Yes         1
## 8  19364        Married   Male  40,000        1           Bachelors Skilled Manual        Yes    0        0-1 Miles  Europe  43   Middle Age            Yes         1
## 9  22155        Married   Male  20,000        2 Partial High School       Clerical        Yes    2       5-10 Miles Pacific  58          Old             No         0
## 10 19280        Married   Male 120,000        2     Partial College         Manual        Yes    1        0-1 Miles  Europe  40   Middle Age            Yes         1
## 11 22173        Married Female  30,000        3         High School Skilled Manual         No    2        1-2 Miles Pacific  54   Middle Age            Yes         1
## 12 12697         Single Female  90,000        0           Bachelors   Professional         No    4    Over 10 Miles Pacific  36   Middle Age             No         0
## 13 11434        Married   Male 170,000        5     Partial College   Professional        Yes    0        0-1 Miles  Europe  55          Old             No         0
## 14 25323        Married   Male  40,000        2     Partial College       Clerical        Yes    1        1-2 Miles  Europe  35   Middle Age            Yes         1
## 15 23542         Single   Male  60,000        1     Partial College Skilled Manual         No    1        0-1 Miles Pacific  45   Middle Age            Yes         1
## 16 20870         Single Female  10,000        2         High School         Manual        Yes    1        0-1 Miles  Europe  38   Middle Age            Yes         1
## 17 23316         Single   Male  30,000        3     Partial College       Clerical         No    2        1-2 Miles Pacific  59          Old            Yes         1
## 18 12610        Married Female  30,000        1           Bachelors       Clerical        Yes    0        0-1 Miles  Europe  47   Middle Age             No         0
## 19 27183         Single   Male  40,000        2     Partial College       Clerical        Yes    1        1-2 Miles  Europe  35   Middle Age            Yes         1
## 20 25940         Single   Male  20,000        2 Partial High School       Clerical        Yes    2       5-10 Miles Pacific  55          Old            Yes         1

Selecting and Analyzing the Response Variable:

# View the structure of the dataset to understand its variables
str(bike_data)

## 'data.frame':    1000 obs. of  15 variables:
##  $ ID              : int  12496 24107 14177 24381 25597 13507 27974 19364 22155 19280 ...
##  $ Marital.Status  : chr  "Married" "Married" "Married" "Single" ...
##  $ Gender          : chr  "Female" "Male" "Male" "Male" ...
##  $ Income          : chr  "40,000" "30,000" "80,000" "70,000" ...
##  $ Children        : int  1 3 5 0 0 2 2 1 2 2 ...
##  $ Education       : chr  "Bachelors" "Partial College" "Partial College" "Bachelors" ...
##  $ Occupation      : chr  "Skilled Manual" "Clerical" "Professional" "Professional" ...
##  $ Home.Owner      : chr  "Yes" "Yes" "No" "Yes" ...
##  $ Cars            : int  0 1 2 1 0 0 4 0 2 1 ...
##  $ Commute.Distance: chr  "0-1 Miles" "0-1 Miles" "2-5 Miles" "5-10 Miles" ...
##  $ Region          : chr  "Europe" "Europe" "Europe" "Pacific" ...
##  $ Age             : int  42 43 60 41 36 50 33 43 58 40 ...
##  $ Age.Brackets    : chr  "Middle Age" "Middle Age" "Old" "Middle Age" ...
##  $ Purchased.Bike  : chr  "No" "No" "No" "Yes" ...
##  $ Bike.Sold       : int  0 0 0 1 1 0 1 1 0 1 ...

# Summary statistics for income
summary(bike_data$income)

## Length  Class   Mode 
##      0   NULL   NULL

bike_data$Income <- gsub("[^0-9.]", "", bike_data$Income)
bike_data$Income <- as.numeric(bike_data$Income) # Convert to numeric
str(bike_data$Income)

##  num [1:1000] 40000 30000 80000 70000 30000 10000 160000 40000 20000 120000 ...

# View the summary of the Income variable
summary(bike_data$Income)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10000   30000   60000   56360   70000  170000

Summary Statistics for `Income`:

After converting the Income variable from character type to numeric format, the summary() function provides actual statistics for the selected variable:

The minimum income is $10,000.
The 1st quartile (25th percentile) is $40,000, indicating that 25% of the observations have an income of $40,000 or less.
The median income is $60,000, which means that half of the observations have an income less than or equal to $60,000.
The mean (average) income is approximately $57,000.
The 3rd quartile (75th percentile) is $70,000, indicating that 75% of the observations have an income of $70,000 or less.
The maximum income reported is $170,000.

Missing Summary for Other Variables:

While the Income variable’s summary statistics attempt has been done above, I really cannot come to a concrete conclusion or summary output for other variables that might also be valuable for the analysis. So, I attempted to juxtapose it with another variable below, using Education as an example here.

Exploring the Relationship Between `Income` and `Education:`

As mentioned above, I would be using Education to find out a relationship with our response variable Income, I want to get a sense of how Income varies across different Education levels. After completing this section, it should help mes in providing meaningful insight for our analysis in task 1.

bike_data$Education <- as.factor(bike_data$Education)

par(mar=c(8, 4, 4, 2) + 0.3) 
boxplot(Income ~ Education, data = bike_data,
        main="Income by Education Level",
        xlab="",  
        ylab="",  
        las=2,    
        cex.axis=0.8)  

mtext("Education Level", side=1, line=7)
mtext("Income", side=2, line=3)

Insights from the Visualization:

Income Distribution: Each education level has a different income distribution, as indicated by the range and median of each box-plot.
Median Income: It appears that individuals with a Graduate Degree tend to have a higher median income compared to other education levels.
Variability: The Graduate Degree category shows the widest interquartile range, suggesting more variability in income among individuals with this level of education.
Outliers: There are several outliers at higher income levels for Bachelors and Graduate Degree categories, indicating that a few individuals have incomes significantly higher than the general population within those categories.

Hypothesis Testing:

For hypothesis testing, let us test whether the mean income is significantly different across education levels. This will be done by using ANOVA below.

# ANOVA to test for differences in mean income across education levels
anova_result <- aov(Income ~ Education, data = bike_data)
summary(anova_result)

##              Df    Sum Sq   Mean Sq F value Pr(>F)    
## Education     4 8.193e+10 2.048e+10   23.07 <2e-16 ***
## Residuals   995 8.834e+11 8.879e+08                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Insights from Hypothesis Testing:

The output of this ANOVA test informs me that the mean income significantly differs across different education levels within the dataset. The result indicates a highly significant F-value (23.07) with a p-value less than 2e-16 (practically zero), marked with three asterisks. This suggests that there are indeed statistically significant differences in mean income across education levels in the bike sales dataset.

Validating Assumptions of ANOVA:

Next, we will check for the normality of residuals, which is an assumption of ANOVA.

# Check for normality of residuals
residuals <- anova_result$residuals
shapiro.test(residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  residuals
## W = 0.941, p-value < 2.2e-16

Validating the assumptions using the Shapiro-Wilk normality test performed on the residuals from an ANOVA analysis gave an output test statistic value (W) of 0.941, where the p-value is less than 2.2e-16, this indicates that the residuals do not follow a normal distribution. This suggests that the normality assumption required for ANOVA is violated.

In simple English, the result means that the incomes after accounting for education levels don’t spread out in the typical bell-shaped pattern that statistical tests like ANOVA assume. Since the data doesn’t spread out the way it’s expected to, the results of the ANOVA might not be reliable.

We can conclude that there is significant evidence that income variance is not consistent across all education levels.

Visual Inspection of ‘Income’:

Now, we will create two visualizations (histogram and a box-plot) to visually inspect the distribution of Income.

# Histogram and Boxplot for Income
hist(bike_data$Income, main="Histogram of Income", xlab="Income", breaks=30)

boxplot(bike_data$Income, main="Boxplot of Income", ylab="Income")

Visualization Insights:: The data suggests that within the sampled dataset, there’s a concentration of individuals with incomes on the lower end of the scale and fewer individuals with very high incomes.

The plotted histogram and box-plot provide a visual representation of the distribution of income in the bike sales dataset. The histogram showed that the distribution of income is right-skewed, meaning there are a higher number of individuals with lower income and fewer individuals with very high income.

The box-plot reinforces this by showing a median lower than the mean (suggested by the longer tail on the right), and several outliers above the upper quartile, which are individual incomes significantly higher than the rest. We can go ahead to derive the following conclusions:

The skewness might imply economic inequality within the dataset.
The presence of outliers suggests that there are individuals with incomes significantly higher than the typical range, which could represent a distinct segment.
Given the skewness, median would be a better measure of central tendency for this income data than mean.
Further investigation could explore factors contributing to high incomes and consider whether the data should be transformed for any parametric statistical tests.

Inspecting for Outliers and Skewness:

Finally, I will now check to identify outliers that might affect the analysis.

# Identifying outliers
boxplot.stats(bike_data$Income)$out

##  [1] 160000 170000 170000 150000 160000 150000 160000 150000 170000 150000

The output from the boxplot.stats function applied to the Income variable of the bike_data dataframe, shows several values labeled as outliers, although these outliers are all high income values, which may indicate the presence of individuals with significantly higher incomes than the majority of the dataset. This could influence the analysis, particularly if the data is assumed to be normally distributed, as outliers can affect the mean and variance.

Selecting a Categorical Explanatory Variable and Running ANOVA:

Given the context, let’s choose Occupation as the categorical explanatory variable, assuming it could influence Income.

# Using 'Occupation' as the categorical explanatory variable
anova_result <- aov(Income ~ Occupation, data = bike_data)
summary(anova_result)

##              Df    Sum Sq   Mean Sq F value Pr(>F)    
## Occupation    4 5.612e+11 1.403e+11   345.5 <2e-16 ***
## Residuals   995 4.041e+11 4.061e+08                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The ANOVA test results suggest that the variable Occupation has a statistically significant effect on Income. This means that the average income varies across different occupations within the dataset. The F-value of 345.5 is very high, which indicates that the observed differences in means are not likely to have occurred by chance. The p-value, which is practically zero (<2e-16), also reinforces this conclusion.

This could be critical information for stakeholders such as policymakers, it might provide them with informed decisions on educational guidance, labor market interventions, or socioeconomic studies.

Further analysis could involve conducting post hoc tests to identify which specific occupation categories differ from each other, exploring potential reasons behind these income differences, such as education requirements, industry demand, or geographic factors. Assessing the impact of occupation on income while controlling for other variables like education, age, or region to get a more nuanced understanding.

As mentioned in my earlier ANOVA test, it is important to note that while the ANOVA test shows differences in means, it does not provide information about the causal relationship, nor does it measure the size of the effect. Additional analysis would be required to draw conclusions about the direct impact of occupation on income.

Building a Linear Regression Model:

Given the variables available, Age seems like a continuous variable that could have a linear relationship with Income.

# Using 'Age' as a continuous variable that might influence income
lm_result <- lm(Income ~ Age, data = bike_data)
summary(lm_result)

## 
## Call:
## lm(formula = Income ~ Age, data = bike_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -58380 -22621  -1402  16507 111855 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 35814.66    3890.80   9.205  < 2e-16 ***
## Age           465.22      85.32   5.452 6.27e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 30650 on 998 degrees of freedom
## Multiple R-squared:  0.02893,    Adjusted R-squared:  0.02795 
## F-statistic: 29.73 on 1 and 998 DF,  p-value: 6.266e-08

Insights from Linear Regression Model:

This model explores how Age might affect Income. The summary provides the regression coefficients, indicating how income changes with age. The coefficient for Age reflects the average change in income for each additional year of age.

Furthermore, the key insights from the output are:

The coefficient for Age is positive (465.22), indicating that as age increases, income is predicted to increase as well.
The p-value for the Age coefficient is extremely small (6.27e-08), which means that the effect of age on income is statistically significant.
The model’s R-squared value is very low (about 2.89%), which suggests that Age alone does not explain much of the variation in Income.
The F-statistic is significant, which tells us that the model is better at predicting income than using the mean income as a prediction for all observations, but given the low R-squared, the model’s overall explanatory power is limited, well this could be directly related to the nature of our dataset.

So in conclusion for this linear regression model, it is safe to say that age has a statistically significant but relatively small impact on income. Other factors not included in this model likely also affect income, and they would need to be identified to improve the model’s explanatory power.

Conclusion and Recommendations

The ANOVA and regression analysis help to uncover relationships between income and factors like occupation and age. Insights from these analyses can guide targeted marketing strategies or product development efforts. For instance, if certain occupations or age groups are found to have higher income levels and are more likely to purchase bikes, marketing efforts could be tailored accordingly.

Recommendations

Based on the findings, recommendations could be made to bike sellers or manufacturers to target specific bike types towards market segments with higher incomes or to consider experience levels in their marketing strategies.

R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

Bike Data Analysis - Week 8 | Regression Modeling

Oluwatosin Agbaakin

2024-02-29

Data Dive Introduction:

Loading the Data set and selecting a response variable:

Summary Statistics for `Income`:

Missing Summary for Other Variables:

Exploring the Relationship Between `Income` and `Education:`

Hypothesis Testing:

Insights from Hypothesis Testing:

Validating Assumptions of ANOVA:

Visual Inspection of ‘Income’:

Inspecting for Outliers and Skewness:

Selecting a Categorical Explanatory Variable and Running ANOVA:

Building a Linear Regression Model:

Insights from Linear Regression Model:

Conclusion and Recommendations

Recommendations

R Markdown

Bike Data Analysis - Week 8 | Regression Modeling

Oluwatosin Agbaakin

2024-02-29

Data Dive Introduction:

Loading the Data set and selecting a response variable:

Summary Statistics for Income:

Missing Summary for Other Variables:

Exploring the Relationship Between Income and Education:

Hypothesis Testing:

Insights from Hypothesis Testing:

Validating Assumptions of ANOVA:

Visual Inspection of ‘Income’:

Inspecting for Outliers and Skewness:

Selecting a Categorical Explanatory Variable and Running ANOVA:

Building a Linear Regression Model:

Insights from Linear Regression Model:

Conclusion and Recommendations

Recommendations

R Markdown

Summary Statistics for `Income`:

Exploring the Relationship Between `Income` and `Education:`