This task will guide me in getting experience running ANOVA tests and building regression models using the Bike sales dataset. Earlier in the week, regression model building was discussed alongside other topics such as:
ANOVA.
F-test.
Bonferroni Correction.
Linear Regression Theory.
For clarity purposes and as per the requirement of this task, all the steps taken in the analysis are clearly explained for this week’s data dive.
The first thing to do in the analysis process will be to load all the
necessary libraries, and read the Bike sales dataset, understanding the
dataset structure, variables, and the potential relationships within the
dataset. I used the glimpse function to
display the columns vertically rather than horizontally,
It is often used for datasets with a large number of variables,
however, to make it look more structured like a table and more
appealing, I went ahead to still use the
head() function, although with a condition
of displaying only the first 20 rows of the data frame - I chose 20 so
our output won’t look messy.
And lastly for this step, I inspected the dataset before I select a valuable continuous or ordered integer variable.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ lubridate 1.9.3 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.1
## ✔ readr 2.1.5
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
library(ggrepel)
library(boot)
library(broom)
library(lindia)
library(knitr)
bike_data <- read.csv("bike_data.csv")
# Inspect the dataset to select a response variable
glimpse(bike_data)
## Rows: 1,000
## Columns: 15
## $ ID <int> 12496, 24107, 14177, 24381, 25597, 13507, 27974, 1936…
## $ Marital.Status <chr> "Married", "Married", "Married", "Single", "Single", …
## $ Gender <chr> "Female", "Male", "Male", "Male", "Male", "Female", "…
## $ Income <chr> "40,000", "30,000", "80,000", "70,000", "30,000", "10…
## $ Children <int> 1, 3, 5, 0, 0, 2, 2, 1, 2, 2, 3, 0, 5, 2, 1, 2, 3, 1,…
## $ Education <chr> "Bachelors", "Partial College", "Partial College", "B…
## $ Occupation <chr> "Skilled Manual", "Clerical", "Professional", "Profes…
## $ Home.Owner <chr> "Yes", "Yes", "No", "Yes", "No", "Yes", "Yes", "Yes",…
## $ Cars <int> 0, 1, 2, 1, 0, 0, 4, 0, 2, 1, 2, 4, 0, 1, 1, 1, 2, 0,…
## $ Commute.Distance <chr> "0-1 Miles", "0-1 Miles", "2-5 Miles", "5-10 Miles", …
## $ Region <chr> "Europe", "Europe", "Europe", "Pacific", "Europe", "E…
## $ Age <int> 42, 43, 60, 41, 36, 50, 33, 43, 58, 40, 54, 36, 55, 3…
## $ Age.Brackets <chr> "Middle Age", "Middle Age", "Old", "Middle Age", "Mid…
## $ Purchased.Bike <chr> "No", "No", "No", "Yes", "Yes", "No", "Yes", "Yes", "…
## $ Bike.Sold <int> 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0,…
# Set options to display 15 columns
options(width = 1000)
# Print the first 20 rows of the dataframe
print(head(bike_data, 20))
## ID Marital.Status Gender Income Children Education Occupation Home.Owner Cars Commute.Distance Region Age Age.Brackets Purchased.Bike Bike.Sold
## 1 12496 Married Female 40,000 1 Bachelors Skilled Manual Yes 0 0-1 Miles Europe 42 Middle Age No 0
## 2 24107 Married Male 30,000 3 Partial College Clerical Yes 1 0-1 Miles Europe 43 Middle Age No 0
## 3 14177 Married Male 80,000 5 Partial College Professional No 2 2-5 Miles Europe 60 Old No 0
## 4 24381 Single Male 70,000 0 Bachelors Professional Yes 1 5-10 Miles Pacific 41 Middle Age Yes 1
## 5 25597 Single Male 30,000 0 Bachelors Clerical No 0 0-1 Miles Europe 36 Middle Age Yes 1
## 6 13507 Married Female 10,000 2 Partial College Manual Yes 0 1-2 Miles Europe 50 Middle Age No 0
## 7 27974 Single Male 160,000 2 High School Management Yes 4 0-1 Miles Pacific 33 Middle Age Yes 1
## 8 19364 Married Male 40,000 1 Bachelors Skilled Manual Yes 0 0-1 Miles Europe 43 Middle Age Yes 1
## 9 22155 Married Male 20,000 2 Partial High School Clerical Yes 2 5-10 Miles Pacific 58 Old No 0
## 10 19280 Married Male 120,000 2 Partial College Manual Yes 1 0-1 Miles Europe 40 Middle Age Yes 1
## 11 22173 Married Female 30,000 3 High School Skilled Manual No 2 1-2 Miles Pacific 54 Middle Age Yes 1
## 12 12697 Single Female 90,000 0 Bachelors Professional No 4 Over 10 Miles Pacific 36 Middle Age No 0
## 13 11434 Married Male 170,000 5 Partial College Professional Yes 0 0-1 Miles Europe 55 Old No 0
## 14 25323 Married Male 40,000 2 Partial College Clerical Yes 1 1-2 Miles Europe 35 Middle Age Yes 1
## 15 23542 Single Male 60,000 1 Partial College Skilled Manual No 1 0-1 Miles Pacific 45 Middle Age Yes 1
## 16 20870 Single Female 10,000 2 High School Manual Yes 1 0-1 Miles Europe 38 Middle Age Yes 1
## 17 23316 Single Male 30,000 3 Partial College Clerical No 2 1-2 Miles Pacific 59 Old Yes 1
## 18 12610 Married Female 30,000 1 Bachelors Clerical Yes 0 0-1 Miles Europe 47 Middle Age No 0
## 19 27183 Single Male 40,000 2 Partial College Clerical Yes 1 1-2 Miles Europe 35 Middle Age Yes 1
## 20 25940 Single Male 20,000 2 Partial High School Clerical Yes 2 5-10 Miles Pacific 55 Old Yes 1
Selecting and Analyzing the Response Variable:
# View the structure of the dataset to understand its variables
str(bike_data)
## 'data.frame': 1000 obs. of 15 variables:
## $ ID : int 12496 24107 14177 24381 25597 13507 27974 19364 22155 19280 ...
## $ Marital.Status : chr "Married" "Married" "Married" "Single" ...
## $ Gender : chr "Female" "Male" "Male" "Male" ...
## $ Income : chr "40,000" "30,000" "80,000" "70,000" ...
## $ Children : int 1 3 5 0 0 2 2 1 2 2 ...
## $ Education : chr "Bachelors" "Partial College" "Partial College" "Bachelors" ...
## $ Occupation : chr "Skilled Manual" "Clerical" "Professional" "Professional" ...
## $ Home.Owner : chr "Yes" "Yes" "No" "Yes" ...
## $ Cars : int 0 1 2 1 0 0 4 0 2 1 ...
## $ Commute.Distance: chr "0-1 Miles" "0-1 Miles" "2-5 Miles" "5-10 Miles" ...
## $ Region : chr "Europe" "Europe" "Europe" "Pacific" ...
## $ Age : int 42 43 60 41 36 50 33 43 58 40 ...
## $ Age.Brackets : chr "Middle Age" "Middle Age" "Old" "Middle Age" ...
## $ Purchased.Bike : chr "No" "No" "No" "Yes" ...
## $ Bike.Sold : int 0 0 0 1 1 0 1 1 0 1 ...
# Summary statistics for income
summary(bike_data$income)
## Length Class Mode
## 0 NULL NULL
bike_data$Income <- gsub("[^0-9.]", "", bike_data$Income)
bike_data$Income <- as.numeric(bike_data$Income) # Convert to numeric
str(bike_data$Income)
## num [1:1000] 40000 30000 80000 70000 30000 10000 160000 40000 20000 120000 ...
# View the summary of the Income variable
summary(bike_data$Income)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10000 30000 60000 56360 70000 170000
Income:After converting the Income variable
from character type to numeric format, the
summary() function provides actual
statistics for the selected variable:
The minimum income is $10,000.
The 1st quartile (25th percentile) is $40,000, indicating that 25% of the observations have an income of $40,000 or less.
The median income is $60,000, which means that half of the observations have an income less than or equal to $60,000.
The mean (average) income is approximately $57,000.
The 3rd quartile (75th percentile) is $70,000, indicating that 75% of the observations have an income of $70,000 or less.
The maximum income reported is $170,000.
While the Income variable’s summary
statistics attempt has been done above, I really cannot come to a
concrete conclusion or summary output for other variables that might
also be valuable for the analysis. So, I attempted to juxtapose it with
another variable below, using Education as
an example here.
Income and
Education:As mentioned above, I would be using
Education to find out a relationship with
our response variable Income, I want to
get a sense of how Income varies across
different Education levels. After
completing this section, it should help mes in providing meaningful
insight for our analysis in task 1.
bike_data$Education <- as.factor(bike_data$Education)
par(mar=c(8, 4, 4, 2) + 0.3)
boxplot(Income ~ Education, data = bike_data,
main="Income by Education Level",
xlab="",
ylab="",
las=2,
cex.axis=0.8)
mtext("Education Level", side=1, line=7)
mtext("Income", side=2, line=3)
Insights from the Visualization:
Income Distribution: Each education level has a different income distribution, as indicated by the range and median of each box-plot.
Median Income: It appears that individuals with a Graduate Degree tend to have a higher median income compared to other education levels.
Variability: The Graduate Degree category shows the widest interquartile range, suggesting more variability in income among individuals with this level of education.
Outliers: There are several outliers at higher income levels for Bachelors and Graduate Degree categories, indicating that a few individuals have incomes significantly higher than the general population within those categories.
For hypothesis testing, let us test whether the mean income is significantly different across education levels. This will be done by using ANOVA below.
# ANOVA to test for differences in mean income across education levels
anova_result <- aov(Income ~ Education, data = bike_data)
summary(anova_result)
## Df Sum Sq Mean Sq F value Pr(>F)
## Education 4 8.193e+10 2.048e+10 23.07 <2e-16 ***
## Residuals 995 8.834e+11 8.879e+08
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The output of this ANOVA test informs me that the mean income significantly differs across different education levels within the dataset. The result indicates a highly significant F-value (23.07) with a p-value less than 2e-16 (practically zero), marked with three asterisks. This suggests that there are indeed statistically significant differences in mean income across education levels in the bike sales dataset.
Next, we will check for the normality of residuals, which is an assumption of ANOVA.
# Check for normality of residuals
residuals <- anova_result$residuals
shapiro.test(residuals)
##
## Shapiro-Wilk normality test
##
## data: residuals
## W = 0.941, p-value < 2.2e-16
Validating the assumptions using the Shapiro-Wilk normality test performed on the residuals from an ANOVA analysis gave an output test statistic value (W) of 0.941, where the p-value is less than 2.2e-16, this indicates that the residuals do not follow a normal distribution. This suggests that the normality assumption required for ANOVA is violated.
In simple English, the result means that the incomes after accounting for education levels don’t spread out in the typical bell-shaped pattern that statistical tests like ANOVA assume. Since the data doesn’t spread out the way it’s expected to, the results of the ANOVA might not be reliable.
We can conclude that there is significant evidence that income variance is not consistent across all education levels.
Now, we will create two visualizations (histogram and a box-plot) to
visually inspect the distribution of
Income.
# Histogram and Boxplot for Income
hist(bike_data$Income, main="Histogram of Income", xlab="Income", breaks=30)
boxplot(bike_data$Income, main="Boxplot of Income", ylab="Income")
Visualization Insights:: The data suggests that within the sampled dataset, there’s a concentration of individuals with incomes on the lower end of the scale and fewer individuals with very high incomes.
The plotted histogram and box-plot provide a visual representation of the distribution of income in the bike sales dataset. The histogram showed that the distribution of income is right-skewed, meaning there are a higher number of individuals with lower income and fewer individuals with very high income.
The box-plot reinforces this by showing a median lower than the mean (suggested by the longer tail on the right), and several outliers above the upper quartile, which are individual incomes significantly higher than the rest. We can go ahead to derive the following conclusions:
The skewness might imply economic inequality within the dataset.
The presence of outliers suggests that there are individuals with incomes significantly higher than the typical range, which could represent a distinct segment.
Given the skewness, median would be a better measure of central tendency for this income data than mean.
Further investigation could explore factors contributing to high incomes and consider whether the data should be transformed for any parametric statistical tests.
Finally, I will now check to identify outliers that might affect the analysis.
# Identifying outliers
boxplot.stats(bike_data$Income)$out
## [1] 160000 170000 170000 150000 160000 150000 160000 150000 170000 150000
The output from the boxplot.stats function applied to
the Income variable of the bike_data
dataframe, shows several values labeled as outliers, although these
outliers are all high income values, which may indicate the presence of
individuals with significantly higher incomes than the majority of the
dataset. This could influence the analysis, particularly if the data is
assumed to be normally distributed, as outliers can affect the mean and
variance.
Given the context, let’s choose
Occupation as the categorical explanatory
variable, assuming it could influence
Income.
# Using 'Occupation' as the categorical explanatory variable
anova_result <- aov(Income ~ Occupation, data = bike_data)
summary(anova_result)
## Df Sum Sq Mean Sq F value Pr(>F)
## Occupation 4 5.612e+11 1.403e+11 345.5 <2e-16 ***
## Residuals 995 4.041e+11 4.061e+08
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The ANOVA test results suggest that the variable
Occupation has a statistically significant
effect on Income. This means that the
average income varies across different occupations within the dataset.
The F-value of 345.5 is very high, which indicates that the observed
differences in means are not likely to have occurred by chance. The
p-value, which is practically zero (<2e-16), also reinforces this
conclusion.
This could be critical information for stakeholders such as policymakers, it might provide them with informed decisions on educational guidance, labor market interventions, or socioeconomic studies.
Further analysis could involve conducting post hoc tests to identify which specific occupation categories differ from each other, exploring potential reasons behind these income differences, such as education requirements, industry demand, or geographic factors. Assessing the impact of occupation on income while controlling for other variables like education, age, or region to get a more nuanced understanding.
As mentioned in my earlier ANOVA test, it is important to note that while the ANOVA test shows differences in means, it does not provide information about the causal relationship, nor does it measure the size of the effect. Additional analysis would be required to draw conclusions about the direct impact of occupation on income.
Given the variables available, Age
seems like a continuous variable that could have a linear relationship
with Income.
# Using 'Age' as a continuous variable that might influence income
lm_result <- lm(Income ~ Age, data = bike_data)
summary(lm_result)
##
## Call:
## lm(formula = Income ~ Age, data = bike_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -58380 -22621 -1402 16507 111855
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 35814.66 3890.80 9.205 < 2e-16 ***
## Age 465.22 85.32 5.452 6.27e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 30650 on 998 degrees of freedom
## Multiple R-squared: 0.02893, Adjusted R-squared: 0.02795
## F-statistic: 29.73 on 1 and 998 DF, p-value: 6.266e-08
This model explores how Age might
affect Income. The summary provides the
regression coefficients, indicating how income changes with age. The
coefficient for Age reflects the average
change in income for each additional year of age.
Furthermore, the key insights from the output are:
Age is positive
(465.22), indicating that as age increases, income is predicted to
increase as well.Age coefficient is
extremely small (6.27e-08), which means that the effect of age on income
is statistically significant.Age alone does not explain
much of the variation in Income.So in conclusion for this linear regression model, it is safe to say that age has a statistically significant but relatively small impact on income. Other factors not included in this model likely also affect income, and they would need to be identified to improve the model’s explanatory power.
The ANOVA and regression analysis help to uncover relationships between income and factors like occupation and age. Insights from these analyses can guide targeted marketing strategies or product development efforts. For instance, if certain occupations or age groups are found to have higher income levels and are more likely to purchase bikes, marketing efforts could be tailored accordingly.
Based on the findings, recommendations could be made to bike sellers or manufacturers to target specific bike types towards market segments with higher incomes or to consider experience levels in their marketing strategies.
This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.