Introduction:

This analysis is on the bike dataset, which focuses on exploring relationships between different variables through grouped data frames, visualization, and hypothesis testing.

I dived deep into individual rows of data, and investigated each row and group, looking into their probability, which can help in anomaly detection. I began by loading the necessary libraries and the dataset. Then, I inspected the dataset to fully understand its structure, variables, and initial statistics.

Data Loading:

library(tidyverse)
bike_data <- read.csv("bike_data.csv")
head(bike_data)

##      ID Marital.Status Gender  Income Children       Education     Occupation
## 1 12496        Married Female $40,000        1       Bachelors Skilled Manual
## 2 24107        Married   Male $30,000        3 Partial College       Clerical
## 3 14177        Married   Male $80,000        5 Partial College   Professional
## 4 24381         Single   Male $70,000        0       Bachelors   Professional
## 5 25597         Single   Male $30,000        0       Bachelors       Clerical
## 6 13507        Married Female $10,000        2 Partial College         Manual
##   Home.Owner Cars Commute.Distance  Region Age Age.Brackets Purchased.Bike
## 1        Yes    0        0-1 Miles  Europe  42   Middle Age             No
## 2        Yes    1        0-1 Miles  Europe  43   Middle Age             No
## 3         No    2        2-5 Miles  Europe  60          Old             No
## 4        Yes    1       5-10 Miles Pacific  41   Middle Age            Yes
## 5         No    0        0-1 Miles  Europe  36   Middle Age            Yes
## 6        Yes    0        1-2 Miles  Europe  50   Middle Age             No

str(bike_data)

## 'data.frame':    1000 obs. of  14 variables:
##  $ ID              : int  12496 24107 14177 24381 25597 13507 27974 19364 22155 19280 ...
##  $ Marital.Status  : chr  "Married" "Married" "Married" "Single" ...
##  $ Gender          : chr  "Female" "Male" "Male" "Male" ...
##  $ Income          : chr  "$40,000" "$30,000" "$80,000" "$70,000" ...
##  $ Children        : int  1 3 5 0 0 2 2 1 2 2 ...
##  $ Education       : chr  "Bachelors" "Partial College" "Partial College" "Bachelors" ...
##  $ Occupation      : chr  "Skilled Manual" "Clerical" "Professional" "Professional" ...
##  $ Home.Owner      : chr  "Yes" "Yes" "No" "Yes" ...
##  $ Cars            : int  0 1 2 1 0 0 4 0 2 1 ...
##  $ Commute.Distance: chr  "0-1 Miles" "0-1 Miles" "2-5 Miles" "5-10 Miles" ...
##  $ Region          : chr  "Europe" "Europe" "Europe" "Pacific" ...
##  $ Age             : int  42 43 60 41 36 50 33 43 58 40 ...
##  $ Age.Brackets    : chr  "Middle Age" "Middle Age" "Old" "Middle Age" ...
##  $ Purchased.Bike  : chr  "No" "No" "No" "Yes" ...

summary(bike_data)

##        ID        Marital.Status        Gender             Income         
##  Min.   :11000   Length:1000        Length:1000        Length:1000       
##  1st Qu.:15291   Class :character   Class :character   Class :character  
##  Median :19744   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :19966                                                           
##  3rd Qu.:24471                                                           
##  Max.   :29447                                                           
##     Children      Education          Occupation         Home.Owner       
##  Min.   :0.000   Length:1000        Length:1000        Length:1000       
##  1st Qu.:0.000   Class :character   Class :character   Class :character  
##  Median :2.000   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :1.898                                                           
##  3rd Qu.:3.000                                                           
##  Max.   :5.000                                                           
##       Cars       Commute.Distance      Region               Age       
##  Min.   :0.000   Length:1000        Length:1000        Min.   :25.00  
##  1st Qu.:1.000   Class :character   Class :character   1st Qu.:35.00  
##  Median :1.000   Mode  :character   Mode  :character   Median :43.00  
##  Mean   :1.442                                         Mean   :44.16  
##  3rd Qu.:2.000                                         3rd Qu.:52.00  
##  Max.   :4.000                                         Max.   :89.00  
##  Age.Brackets       Purchased.Bike    
##  Length:1000        Length:1000       
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
##

Grouping and Summary:

Here, I grouped the data by Marital Status and summarized the Age variable. I aim to understand the age distribution within each marital status category.

group_by_marital <- bike_data %>%
  group_by(Marital.Status) %>%
  summarise(Average_Age = mean(Age, na.rm = TRUE))

ggplot(group_by_marital, aes(x = `Marital.Status`, y = Average_Age)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  labs(title = "Average Age by Marital Status")

From the visualization above, married individuals constitute the larger group (53.8%), while single individuals form the smaller group (46.2%). The average age is slightly higher in the married group compared to the single group.

Grouping by Occupation and the Number of children:

In this section, I analyzed the number of children based on different occupations of the household.

The largest group by occupation is the Professional with 27.6%, while the smallest is Manual with 11.9%.
The Management group has the highest average number of children, while Manual and Clerical occupations have lower averages.

I visualized the above figures and summary below:

group_by_occupation <- bike_data %>%
  group_by(Occupation) %>%
  summarise(Average_Children = mean(Children, na.rm = TRUE))

ggplot(group_by_occupation, aes(x = Occupation, y = Average_Children)) +
  geom_bar(stat = "identity", fill = "tomato") +
  labs(title = "Average Number of Children by Occupation") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Grouping by Regions and summarizing Cars:

This analysis focuses on the number of cars owned in different regions:

North America is the largest region with 50.8%, while ‘Pacific’ has the smallest with 19.2%.
On average, people from the Pacific region own more cars compared to the other regions.

group_by_region <- bike_data %>%
  group_by(Region) %>%
  summarise(Average_Cars = mean(Cars, na.rm = TRUE))

ggplot(group_by_region, aes(x = Region, y = Average_Cars)) +
  geom_bar(stat = "identity", fill = "brown") +
  labs(title = "Average Number of Cars by Region")

Cross-Tabulation of Categorical Variables:

I further explored the relationship between two categorical variables, ‘Marital Status’ and ‘Occupation’, to understand their interaction.

marital_occupation_tab <- table(bike_data$`Marital.Status`, bike_data$Occupation)

ggplot(as.data.frame(marital_occupation_tab), aes(Var1, Var2, fill = Freq)) +
  geom_tile() +
  labs(title = "Cross Tabulation of Marital Status and Occupation", x = "Marital Status", y = "Occupation")

Further Analysis:

The smallest probability groups identified (Single for Marital Status, Manual for Occupation, and Pacific for Region) could be subject to further investigation to understand why they are less represented.
For instance, the higher number of cars in the Pacific region could be linked to regional characteristics like urbanization, public transport availability, or income levels.
Additional analysis can be performed to see how these smallest groups relate to other variables in the dataset, like income or bike purchase decisions.

Conclusion:

The above heatmap visualization shows the distribution of different occupation categories within marital status groups.

It can be clearly observed from the heatmap that the most common occupation for both married and single individuals appears to be Professional as indicated by the darker shade.
Clerical occupations are more common among single individuals compared to married individuals.
Manual and Management occupations have a moderate representation across both marital statuses.
Skilled Manual seems to have a slightly higher frequency among married individuals than single ones, as suggested by the color shading. In relation to the conclusion from the analyzed dataset above:

In relation to the conclusion from the analyzed dataset above:

The heatmap visually reinforces the quantitative findings from the data analysis. It shows which occupations are more or less common among different marital statuses.
It helps in identifying patterns or trends in occupation with respect to marital status, such as which occupations are predominantly chosen by married or single individuals.
The heatmap can support hypotheses regarding socio-economic behavior. For example, if Professional is the most common occupation among both married and single individuals, one might hypothesize that this occupation is a popular choice due to factors like job availability, job desirability, or income levels associated with Professional roles.

Bike Data Analysis - Week 3 |Group By and Probabilities

Oluwatosin Agbaakin

2024-01-23