# Load tidyverse as a collection of data science packages (Practically not needed to import any other packages mostly after importing this package)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.6
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.1 ✔ tibble 3.3.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.2
## ✔ purrr 1.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Load dplyr for data manipulation
library(dplyr)
# Load ggplot2 for data visualisation
library(ggplot2)
# Load the dataset
bike_data <- read.csv("/Users/roshannaidu/Desktop/IU Sem 2/Stats 1/bike+sharing+dataset/hour.csv")
# View structure and data types of variables
str(bike_data)
## 'data.frame': 17379 obs. of 17 variables:
## $ instant : int 1 2 3 4 5 6 7 8 9 10 ...
## $ dteday : chr "2011-01-01" "2011-01-01" "2011-01-01" "2011-01-01" ...
## $ season : int 1 1 1 1 1 1 1 1 1 1 ...
## $ yr : int 0 0 0 0 0 0 0 0 0 0 ...
## $ mnth : int 1 1 1 1 1 1 1 1 1 1 ...
## $ hr : int 0 1 2 3 4 5 6 7 8 9 ...
## $ holiday : int 0 0 0 0 0 0 0 0 0 0 ...
## $ weekday : int 6 6 6 6 6 6 6 6 6 6 ...
## $ workingday: int 0 0 0 0 0 0 0 0 0 0 ...
## $ weathersit: int 1 1 1 1 1 2 1 1 1 1 ...
## $ temp : num 0.24 0.22 0.22 0.24 0.24 0.24 0.22 0.2 0.24 0.32 ...
## $ atemp : num 0.288 0.273 0.273 0.288 0.288 ...
## $ hum : num 0.81 0.8 0.8 0.75 0.75 0.75 0.8 0.86 0.75 0.76 ...
## $ windspeed : num 0 0 0 0 0 0.0896 0 0 0 0 ...
## $ casual : int 3 8 5 3 0 0 2 1 1 8 ...
## $ registered: int 13 32 27 10 1 1 0 2 7 6 ...
## $ cnt : int 16 40 32 13 1 1 2 3 8 14 ...
# View first few rows of the dataset
head(bike_data)
# View summary statistics for all variables
summary(bike_data)
## instant dteday season yr
## Min. : 1 Length:17379 Min. :1.000 Min. :0.0000
## 1st Qu.: 4346 Class :character 1st Qu.:2.000 1st Qu.:0.0000
## Median : 8690 Mode :character Median :3.000 Median :1.0000
## Mean : 8690 Mean :2.502 Mean :0.5026
## 3rd Qu.:13034 3rd Qu.:3.000 3rd Qu.:1.0000
## Max. :17379 Max. :4.000 Max. :1.0000
## mnth hr holiday weekday
## Min. : 1.000 Min. : 0.00 Min. :0.00000 Min. :0.000
## 1st Qu.: 4.000 1st Qu.: 6.00 1st Qu.:0.00000 1st Qu.:1.000
## Median : 7.000 Median :12.00 Median :0.00000 Median :3.000
## Mean : 6.538 Mean :11.55 Mean :0.02877 Mean :3.004
## 3rd Qu.:10.000 3rd Qu.:18.00 3rd Qu.:0.00000 3rd Qu.:5.000
## Max. :12.000 Max. :23.00 Max. :1.00000 Max. :6.000
## workingday weathersit temp atemp
## Min. :0.0000 Min. :1.000 Min. :0.020 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:1.000 1st Qu.:0.340 1st Qu.:0.3333
## Median :1.0000 Median :1.000 Median :0.500 Median :0.4848
## Mean :0.6827 Mean :1.425 Mean :0.497 Mean :0.4758
## 3rd Qu.:1.0000 3rd Qu.:2.000 3rd Qu.:0.660 3rd Qu.:0.6212
## Max. :1.0000 Max. :4.000 Max. :1.000 Max. :1.0000
## hum windspeed casual registered
## Min. :0.0000 Min. :0.0000 Min. : 0.00 Min. : 0.0
## 1st Qu.:0.4800 1st Qu.:0.1045 1st Qu.: 4.00 1st Qu.: 34.0
## Median :0.6300 Median :0.1940 Median : 17.00 Median :115.0
## Mean :0.6272 Mean :0.1901 Mean : 35.68 Mean :153.8
## 3rd Qu.:0.7800 3rd Qu.:0.2537 3rd Qu.: 48.00 3rd Qu.:220.0
## Max. :1.0000 Max. :0.8507 Max. :367.00 Max. :886.0
## cnt
## Min. : 1.0
## 1st Qu.: 40.0
## Median :142.0
## Mean :189.5
## 3rd Qu.:281.0
## Max. :977.0
# Check number of rows and columns
dim(bike_data)
## [1] 17379 17
# Display all variable names
names(bike_data)
## [1] "instant" "dteday" "season" "yr" "mnth"
## [6] "hr" "holiday" "weekday" "workingday" "weathersit"
## [11] "temp" "atemp" "hum" "windspeed" "casual"
## [16] "registered" "cnt"
# Check for missing values in each column
colSums(is.na(bike_data))
## instant dteday season yr mnth hr holiday
## 0 0 0 0 0 0 0
## weekday workingday weathersit temp atemp hum windspeed
## 0 0 0 0 0 0 0
## casual registered cnt
## 0 0 0
Upon analyzing the UCI Bike Sharing Dataset and its documentation, the following columns are identified as unclear without proper documentation:
temp: Represents normalized temperature values. Without documentation, it is unclear what the normalization range is (e.g., 0 to 1) and the original temperature scale (Celsius or Fahrenheit).
weathersit: Encoded as integers (1 to 4) corresponding to different weather situations. Without documentation, the specific weather conditions each integer represents are ambiguous.
atemp: Stands for “feels like” temperature, also normalized. It’s unclear how this differs from the temp column and what factors contribute to this perceived temperature.
temp and atemp: Normalizing temperature values facilitates easier integration with machine learning models by scaling features to a similar range. However, without knowing the normalization parameters, interpreting these values becomes challenging.
weathersit: Encoding categorical weather conditions as integers saves storage space and simplifies analysis. Nonetheless, without clear labels, the encoded integers lose their descriptive meaning.
Misinterpretation of Temperature Values: Assuming incorrect normalization ranges or temperature scales can lead to faulty analyses, such as underestimating the impact of temperature on bike rentals.
Ambiguity in Weather Conditions: Misunderstanding the weathersit encoding can result in incorrect associations between weather and rental patterns, skewing the analysis.
Confusion Between temp and atemp: Without distinguishing factors, it’s unclear how “feels like” temperature influences bike usage compared to actual temperature.
The detailed definitions of weather categories (1 to 4 in
weathersit) lack specific descriptions. For example,
category 4 may indicate severe conditions, but this is not explicitly
documented, making it challenging to interpret its impact
accurately.
The dataset does not explain how extreme weather events or
anomalies are handled in rental counts (cnt), potentially
affecting model robustness.
To explore the ambiguity surrounding the cnt column, especially its composition from casual and registered users, the following visualization examines the relationship between registered users and total rentals (cnt).
# Scatter plot of Registered Users vs Total Count
ggplot(bike_data, aes(x = registered, y = cnt)) +
geom_point(alpha = 0.6, color = 'steelblue') +
geom_smooth(method = 'lm', color = 'darkred', se = FALSE) +
labs(title = "Relationship Between Registered Users and Total Rentals",
x = "Number of Registered Users",
y = "Total Bike Rentals (cnt)") +
annotate("text", x = max(bike_data$registered)*0.6,
y = max(bike_data$cnt)*0.9,
label = "Unclear if 'cnt' includes only registered users or both \nregistered and casual users",
color = "darkred",
size = 4,
hjust = 0) +
theme_minimal(base_size = 10) +
theme(
plot.title = element_text(face = "bold", hjust = 0.5),
legend.position = "top"
)
## `geom_smooth()` using formula = 'y ~ x'
The scatter plot illustrates the relationship between the number of registered users and the cnt (total bike rentals). An evident linear trend suggests that as the number of registered users increases, the total rentals also rise. However, the annotation highlights the uncertainty regarding whether cnt exclusively counts registered users or aggregates both registered and casual users. This ambiguity can affect the interpretation of how different user types contribute to overall bike rentals.
# Create a new column to check if cnt equals casual + registered
bike_data <- bike_data %>%
mutate(cnt_difference = cnt - (casual + registered))
# Histogram of differences
ggplot(bike_data, aes(y = cnt_difference)) +
geom_boxplot(fill = "steelblue") +
geom_hline(yintercept = 0, color = "red", size = 1) +
labs(title = "Verification of Total Rental Count Consistency",
y = "cnt - (casual + registered)") +
theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
This visualization validates the documented definition of cnt as the total number of rentals equal to the sum of casual and registered users. The visualization shows that the difference between cnt and (casual + registered) is zero for all observations. It is confirming the internal consistency within the dataset. This strengthens confidence in the integrity of the total count variable and reduces the risk of modeling errors arising from undocumented aggregation methods.
This section evaluates whether key categorical variables contain: 1. Explicitly missing rows (NA values) 2. Implicitly missing combinations 3. Empty groups (categories defined but not present) We examine: - weathersit - season
# Explicitly missing values
sum(is.na(bike_data$weathersit))
## [1] 0
# Frequency table
table(bike_data$weathersit)
##
## 1 2 3 4
## 11413 4544 1419 3
# Unique categories present
unique(bike_data$weathersit)
## [1] 1 2 3 4
The weathersit variable contains no explicitly missing values. All documented categories (1-4) are present in the dataset, meaning there are no empty groups. This suggests complete weather classification coverage.
# Explicitly missing values
sum(is.na(bike_data$season))
## [1] 0
# Frequency table
table(bike_data$season)
##
## 1 2 3 4
## 4242 4409 4496 4232
# Group count check
bike_data %>%
group_by(season) %>%
summarise(observations = n())
The season variable also contains no explicitly missing rows. All four seasonal categories (1–4) appear in the dataset. There are no empty seasonal groups, indicating that rental data was collected across all seasons without structural gaps.
# Check implicit missing combinations: Season x Weathersit
season_weather_table <- table(bike_data$season, bike_data$weathersit)
season_weather_table
##
## 1 2 3 4
## 1 2665 1205 369 3
## 2 2859 1144 406 0
## 3 3280 947 269 0
## 4 2609 1248 375 0
# Check if any combinations are zero (missing)
any(season_weather_table == 0)
## [1] TRUE
If any(season_weather_table == 0) returns FALSE, then no implicit missing groups exist. Otherwise, explain where zeroes appear.
valuating categorical completeness is essential to ensure that predictive models are not biased due to structurally missing groups. The absence of explicit and implicit missing categories strengthens confidence in the dataset’s representativeness. However, if certain combinations were absent, this could limit the model’s ability to generalize to unseen scenarios.
To formally identify extreme rental counts, we use the 1.5 × IQR rule, a standard statistical method for detecting outliers.
# Compute quartiles
Q1 <- quantile(bike_data$cnt, 0.25)
Q3 <- quantile(bike_data$cnt, 0.75)
# Compute IQR
IQR_value <- Q3 - Q1
# Define bounds
lower_bound <- Q1 - 1.5 * IQR_value
upper_bound <- Q3 + 1.5 * IQR_value
lower_bound
## 25%
## -321.5
upper_bound
## 75%
## 642.5
# Identify outliers
outliers <- bike_data %>%
filter(cnt < lower_bound | cnt > upper_bound)
# Count number of outliers
nrow(outliers)
## [1] 505
ggplot(bike_data, aes(y = cnt)) +
geom_boxplot(fill = "steelblue") +
geom_hline(yintercept = lower_bound, color = "red", linetype = "dashed") +
geom_hline(yintercept = upper_bound, color = "red", linetype = "dashed") +
labs(title = "Outlier Detection for Total Rental Count (cnt)",
y = "Total Rentals (cnt)") +
theme_minimal()
Using the 1.5 × IQR rule, outliers were formally identified as rental counts falling outside the calculated lower and upper bounds. The analysis revealed that extreme rental counts exist beyond the upper threshold, indicating unusually high demand during certain hours or conditions.
These extreme values may correspond to peak commuting hours, special events, or unusually favorable weather. While they represent legitimate observations rather than data errors, they may disproportionately influence regression models and inflate variance estimates.
Therefore, careful consideration is required when modeling cnt. Robust regression methods or transformation techniques (e.g., log transformation) may help reduce the impact of extreme values while preserving meaningful variation.
Several risks emerge from the unclear aspects of the cnt column:
Modeling Inaccuracy: If cnt includes both casual and registered users without differentiation, models may misattribute the influence of user types on bike rentals, leading to biased predictions.
Misguided Business Decisions: Misunderstanding the composition of cnt can result in ineffective strategies, such as overemphasizing one user type over another based on incorrect assumptions.
Data Integrity Issues: Without clarity on how anomalies are handled, outliers may skew the data analysis, compromising the reliability of insights drawn from the dataset.
To mitigate these risks, the following actions are recommended:
Data Decomposition: Separate the cnt column into casual and registered components to analyze their individual contributions to total rentals.
Anomaly Analysis: Investigate instances of extreme cnt values to understand how anomalies are treated and whether they represent genuine spikes or data recording issues.
Seek Clarification: If documentation remains insufficient, contact the data provider or refer to supplementary resources to gain a clearer understanding of the cnt column’s composition.
Robust Modeling Techniques: Employ models that can handle potential data ambiguities, such as ensemble methods or models with built-in mechanisms for uncertainty estimation.
This data dive underscored the critical role of comprehensive documentation in data analysis and model building. By identifying unclear columns such as temp, weathersit, and atemp, and exploring the ambiguities surrounding the cnt column,I highlighted potential pitfalls in data interpretation. The visualization emphasized the uncertainty in cnt’s composition, while the discussion on risks and mitigation strategies provided actionable insights to enhance data reliability and model accuracy. Moving forward, ensuring clarity in data documentation will be paramount to deriving meaningful and accurate analytical outcomes.
Composition of cnt: Does cnt exclusively represent the sum of casual and registered users, or are there additional factors involved?
Handling of Anomalies: How are extreme values or anomalies in bike rentals addressed in the dataset? Are they excluded, adjusted, or treated as outliers?
Normalization Parameters: What specific normalization techniques were applied to temp and atemp, and what are their original scales?
Impact of External Factors: How do external factors like public events or city-wide initiatives influence bike rental patterns beyond what’s captured in the dataset?