This analysis is on the bike dataset, which focuses on exploring relationships between different variables through grouped data frames, visualization, and hypothesis testing.
I dived deep into individual rows of data, and investigated each row and group, looking into their probability, which can help in anomaly detection. I began by loading the necessary libraries and the dataset. Then, I inspected the dataset to fully understand its structure, variables, and initial statistics.
library(tidyverse)
bike_data <- read.csv("bike_data.csv")
head(bike_data)
## ID Marital.Status Gender Income Children Education Occupation
## 1 12496 Married Female $40,000 1 Bachelors Skilled Manual
## 2 24107 Married Male $30,000 3 Partial College Clerical
## 3 14177 Married Male $80,000 5 Partial College Professional
## 4 24381 Single Male $70,000 0 Bachelors Professional
## 5 25597 Single Male $30,000 0 Bachelors Clerical
## 6 13507 Married Female $10,000 2 Partial College Manual
## Home.Owner Cars Commute.Distance Region Age Age.Brackets Purchased.Bike
## 1 Yes 0 0-1 Miles Europe 42 Middle Age No
## 2 Yes 1 0-1 Miles Europe 43 Middle Age No
## 3 No 2 2-5 Miles Europe 60 Old No
## 4 Yes 1 5-10 Miles Pacific 41 Middle Age Yes
## 5 No 0 0-1 Miles Europe 36 Middle Age Yes
## 6 Yes 0 1-2 Miles Europe 50 Middle Age No
str(bike_data)
## 'data.frame': 1000 obs. of 14 variables:
## $ ID : int 12496 24107 14177 24381 25597 13507 27974 19364 22155 19280 ...
## $ Marital.Status : chr "Married" "Married" "Married" "Single" ...
## $ Gender : chr "Female" "Male" "Male" "Male" ...
## $ Income : chr "$40,000" "$30,000" "$80,000" "$70,000" ...
## $ Children : int 1 3 5 0 0 2 2 1 2 2 ...
## $ Education : chr "Bachelors" "Partial College" "Partial College" "Bachelors" ...
## $ Occupation : chr "Skilled Manual" "Clerical" "Professional" "Professional" ...
## $ Home.Owner : chr "Yes" "Yes" "No" "Yes" ...
## $ Cars : int 0 1 2 1 0 0 4 0 2 1 ...
## $ Commute.Distance: chr "0-1 Miles" "0-1 Miles" "2-5 Miles" "5-10 Miles" ...
## $ Region : chr "Europe" "Europe" "Europe" "Pacific" ...
## $ Age : int 42 43 60 41 36 50 33 43 58 40 ...
## $ Age.Brackets : chr "Middle Age" "Middle Age" "Old" "Middle Age" ...
## $ Purchased.Bike : chr "No" "No" "No" "Yes" ...
summary(bike_data)
## ID Marital.Status Gender Income
## Min. :11000 Length:1000 Length:1000 Length:1000
## 1st Qu.:15291 Class :character Class :character Class :character
## Median :19744 Mode :character Mode :character Mode :character
## Mean :19966
## 3rd Qu.:24471
## Max. :29447
## Children Education Occupation Home.Owner
## Min. :0.000 Length:1000 Length:1000 Length:1000
## 1st Qu.:0.000 Class :character Class :character Class :character
## Median :2.000 Mode :character Mode :character Mode :character
## Mean :1.898
## 3rd Qu.:3.000
## Max. :5.000
## Cars Commute.Distance Region Age
## Min. :0.000 Length:1000 Length:1000 Min. :25.00
## 1st Qu.:1.000 Class :character Class :character 1st Qu.:35.00
## Median :1.000 Mode :character Mode :character Median :43.00
## Mean :1.442 Mean :44.16
## 3rd Qu.:2.000 3rd Qu.:52.00
## Max. :4.000 Max. :89.00
## Age.Brackets Purchased.Bike
## Length:1000 Length:1000
## Class :character Class :character
## Mode :character Mode :character
##
##
##
Here, I grouped the data by Marital Status and summarized the Age variable. I aim to understand the age distribution within each marital status category.
group_by_marital <- bike_data %>%
group_by(Marital.Status) %>%
summarise(Average_Age = mean(Age, na.rm = TRUE))
ggplot(group_by_marital, aes(x = `Marital.Status`, y = Average_Age)) +
geom_bar(stat = "identity", fill = "steelblue") +
labs(title = "Average Age by Marital Status")
From the visualization above, married individuals constitute the larger group (53.8%), while single individuals form the smaller group (46.2%). The average age is slightly higher in the married group compared to the single group.
In this section, I analyzed the number of children based on different occupations of the household.
The largest group by occupation is the Professional with 27.6%, while the smallest is Manual with 11.9%.
The Management group has the highest average number of children, while Manual and Clerical occupations have lower averages.
I visualized the above figures and summary below:
group_by_occupation <- bike_data %>%
group_by(Occupation) %>%
summarise(Average_Children = mean(Children, na.rm = TRUE))
ggplot(group_by_occupation, aes(x = Occupation, y = Average_Children)) +
geom_bar(stat = "identity", fill = "tomato") +
labs(title = "Average Number of Children by Occupation") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
This analysis focuses on the number of cars owned in different regions:
North America is the largest region with 50.8%, while ‘Pacific’ has the smallest with 19.2%.
On average, people from the Pacific region own more cars compared to the other regions.
group_by_region <- bike_data %>%
group_by(Region) %>%
summarise(Average_Cars = mean(Cars, na.rm = TRUE))
ggplot(group_by_region, aes(x = Region, y = Average_Cars)) +
geom_bar(stat = "identity", fill = "brown") +
labs(title = "Average Number of Cars by Region")
I further explored the relationship between two categorical variables, ‘Marital Status’ and ‘Occupation’, to understand their interaction.
marital_occupation_tab <- table(bike_data$`Marital.Status`, bike_data$Occupation)
ggplot(as.data.frame(marital_occupation_tab), aes(Var1, Var2, fill = Freq)) +
geom_tile() +
labs(title = "Cross Tabulation of Marital Status and Occupation", x = "Marital Status", y = "Occupation")
The smallest probability groups identified (Single for Marital Status, Manual for Occupation, and Pacific for Region) could be subject to further investigation to understand why they are less represented.
For instance, the higher number of cars in the Pacific region could be linked to regional characteristics like urbanization, public transport availability, or income levels.
Additional analysis can be performed to see how these smallest groups relate to other variables in the dataset, like income or bike purchase decisions.
The above heatmap visualization shows the distribution of different occupation categories within marital status groups.
It can be clearly observed from the heatmap that the most common occupation for both married and single individuals appears to be Professional as indicated by the darker shade.
Clerical occupations are more common among single individuals compared to married individuals.
Manual and Management occupations have a moderate representation across both marital statuses.
Skilled Manual seems to have a slightly higher frequency among married individuals than single ones, as suggested by the color shading. In relation to the conclusion from the analyzed dataset above:
In relation to the conclusion from the analyzed dataset above:
The heatmap visually reinforces the quantitative findings from the data analysis. It shows which occupations are more or less common among different marital statuses.
It helps in identifying patterns or trends in occupation with respect to marital status, such as which occupations are predominantly chosen by married or single individuals.
The heatmap can support hypotheses regarding socio-economic behavior. For example, if Professional is the most common occupation among both married and single individuals, one might hypothesize that this occupation is a popular choice due to factors like job availability, job desirability, or income levels associated with Professional roles.