Roshan R Naidu (016/02/2026)

Importing Libraries

# Load tidyverse as a collection of data science packages (Practically not needed to import any other packages mostly after importing this package)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.1     ✔ tibble    3.3.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Load dplyr for data manipulation
library(dplyr)

# Load ggplot2 for data visualisation
library(ggplot2)

Loading and Exploring The Dataset

# Load the dataset
bike_data <- read.csv("/Users/roshannaidu/Desktop/IU Sem 2/Stats 1/bike+sharing+dataset/hour.csv")

# View structure and data types of variables
str(bike_data)
## 'data.frame':    17379 obs. of  17 variables:
##  $ instant   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ dteday    : chr  "2011-01-01" "2011-01-01" "2011-01-01" "2011-01-01" ...
##  $ season    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ yr        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ mnth      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ hr        : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ holiday   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ weekday   : int  6 6 6 6 6 6 6 6 6 6 ...
##  $ workingday: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ weathersit: int  1 1 1 1 1 2 1 1 1 1 ...
##  $ temp      : num  0.24 0.22 0.22 0.24 0.24 0.24 0.22 0.2 0.24 0.32 ...
##  $ atemp     : num  0.288 0.273 0.273 0.288 0.288 ...
##  $ hum       : num  0.81 0.8 0.8 0.75 0.75 0.75 0.8 0.86 0.75 0.76 ...
##  $ windspeed : num  0 0 0 0 0 0.0896 0 0 0 0 ...
##  $ casual    : int  3 8 5 3 0 0 2 1 1 8 ...
##  $ registered: int  13 32 27 10 1 1 0 2 7 6 ...
##  $ cnt       : int  16 40 32 13 1 1 2 3 8 14 ...
# View first few rows of the dataset
head(bike_data)
# View summary statistics for all variables
summary(bike_data)
##     instant         dteday              season            yr        
##  Min.   :    1   Length:17379       Min.   :1.000   Min.   :0.0000  
##  1st Qu.: 4346   Class :character   1st Qu.:2.000   1st Qu.:0.0000  
##  Median : 8690   Mode  :character   Median :3.000   Median :1.0000  
##  Mean   : 8690                      Mean   :2.502   Mean   :0.5026  
##  3rd Qu.:13034                      3rd Qu.:3.000   3rd Qu.:1.0000  
##  Max.   :17379                      Max.   :4.000   Max.   :1.0000  
##       mnth              hr           holiday           weekday     
##  Min.   : 1.000   Min.   : 0.00   Min.   :0.00000   Min.   :0.000  
##  1st Qu.: 4.000   1st Qu.: 6.00   1st Qu.:0.00000   1st Qu.:1.000  
##  Median : 7.000   Median :12.00   Median :0.00000   Median :3.000  
##  Mean   : 6.538   Mean   :11.55   Mean   :0.02877   Mean   :3.004  
##  3rd Qu.:10.000   3rd Qu.:18.00   3rd Qu.:0.00000   3rd Qu.:5.000  
##  Max.   :12.000   Max.   :23.00   Max.   :1.00000   Max.   :6.000  
##    workingday       weathersit         temp           atemp       
##  Min.   :0.0000   Min.   :1.000   Min.   :0.020   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:1.000   1st Qu.:0.340   1st Qu.:0.3333  
##  Median :1.0000   Median :1.000   Median :0.500   Median :0.4848  
##  Mean   :0.6827   Mean   :1.425   Mean   :0.497   Mean   :0.4758  
##  3rd Qu.:1.0000   3rd Qu.:2.000   3rd Qu.:0.660   3rd Qu.:0.6212  
##  Max.   :1.0000   Max.   :4.000   Max.   :1.000   Max.   :1.0000  
##       hum           windspeed          casual         registered   
##  Min.   :0.0000   Min.   :0.0000   Min.   :  0.00   Min.   :  0.0  
##  1st Qu.:0.4800   1st Qu.:0.1045   1st Qu.:  4.00   1st Qu.: 34.0  
##  Median :0.6300   Median :0.1940   Median : 17.00   Median :115.0  
##  Mean   :0.6272   Mean   :0.1901   Mean   : 35.68   Mean   :153.8  
##  3rd Qu.:0.7800   3rd Qu.:0.2537   3rd Qu.: 48.00   3rd Qu.:220.0  
##  Max.   :1.0000   Max.   :0.8507   Max.   :367.00   Max.   :886.0  
##       cnt       
##  Min.   :  1.0  
##  1st Qu.: 40.0  
##  Median :142.0  
##  Mean   :189.5  
##  3rd Qu.:281.0  
##  Max.   :977.0
# Check number of rows and columns
dim(bike_data)
## [1] 17379    17
# Display all variable names
names(bike_data)
##  [1] "instant"    "dteday"     "season"     "yr"         "mnth"      
##  [6] "hr"         "holiday"    "weekday"    "workingday" "weathersit"
## [11] "temp"       "atemp"      "hum"        "windspeed"  "casual"    
## [16] "registered" "cnt"
# Check for missing values in each column
colSums(is.na(bike_data))
##    instant     dteday     season         yr       mnth         hr    holiday 
##          0          0          0          0          0          0          0 
##    weekday workingday weathersit       temp      atemp        hum  windspeed 
##          0          0          0          0          0          0          0 
##     casual registered        cnt 
##          0          0          0

Unclear Columns from the Data

Upon analyzing the UCI Bike Sharing Dataset and its documentation, the following columns are identified as unclear without proper documentation:

  1. temp: Represents normalized temperature values. Without documentation, it is unclear what the normalization range is (e.g., 0 to 1) and the original temperature scale (Celsius or Fahrenheit).

  2. weathersit: Encoded as integers (1 to 4) corresponding to different weather situations. Without documentation, the specific weather conditions each integer represents are ambiguous.

  3. atemp: Stands for “feels like” temperature, also normalized. It’s unclear how this differs from the temp column and what factors contribute to this perceived temperature.

Why the Data Was Encoded This Way

  1. temp and atemp: Normalizing temperature values facilitates easier integration with machine learning models by scaling features to a similar range. However, without knowing the normalization parameters, interpreting these values becomes challenging.

  2. weathersit: Encoding categorical weather conditions as integers saves storage space and simplifies analysis. Nonetheless, without clear labels, the encoded integers lose their descriptive meaning.

Potential Issues if Documentation is Not Read

  1. Misinterpretation of Temperature Values: Assuming incorrect normalization ranges or temperature scales can lead to faulty analyses, such as underestimating the impact of temperature on bike rentals.

  2. Ambiguity in Weather Conditions: Misunderstanding the weathersit encoding can result in incorrect associations between weather and rental patterns, skewing the analysis.

  3. Confusion Between temp and atemp: Without distinguishing factors, it’s unclear how “feels like” temperature influences bike usage compared to actual temperature.

Unclear Element Even After Reading Documentation

Visualization the Internal Consistency of the cnt Variable

To explore the ambiguity surrounding the cnt column, especially its composition from casual and registered users, the following visualization examines the relationship between registered users and total rentals (cnt).

# Scatter plot of Registered Users vs Total Count
ggplot(bike_data, aes(x = registered, y = cnt)) +
  geom_point(alpha = 0.6, color = 'steelblue') +
  geom_smooth(method = 'lm', color = 'darkred', se = FALSE) +
  labs(title = "Relationship Between Registered Users and Total Rentals",
       x = "Number of Registered Users",
       y = "Total Bike Rentals (cnt)") +
  annotate("text", x = max(bike_data$registered)*0.6, 
           y = max(bike_data$cnt)*0.9, 
           label = "Unclear if 'cnt' includes only registered users or both \nregistered and casual users",
           color = "darkred",
           size = 4,
           hjust = 0) +
  theme_minimal(base_size = 10) +
  theme(
    plot.title = element_text(face = "bold", hjust = 0.5),
    legend.position = "top"
  )
## `geom_smooth()` using formula = 'y ~ x'

Explanation of the visualization

The scatter plot illustrates the relationship between the number of registered users and the cnt (total bike rentals). An evident linear trend suggests that as the number of registered users increases, the total rentals also rise. However, the annotation highlights the uncertainty regarding whether cnt exclusively counts registered users or aggregates both registered and casual users. This ambiguity can affect the interpretation of how different user types contribute to overall bike rentals.

# Create a new column to check if cnt equals casual + registered
bike_data <- bike_data %>%
  mutate(cnt_difference = cnt - (casual + registered))

# Histogram of differences
ggplot(bike_data, aes(y = cnt_difference)) +
  geom_boxplot(fill = "steelblue") +
  geom_hline(yintercept = 0, color = "red", size = 1) +
  labs(title = "Verification of Total Rental Count Consistency",
       y = "cnt - (casual + registered)") +
  theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Explanation of the visualization

This visualization validates the documented definition of cnt as the total number of rentals equal to the sum of casual and registered users. The visualization shows that the difference between cnt and (casual + registered) is zero for all observations. It is confirming the internal consistency within the dataset. This strengthens confidence in the integrity of the total count variable and reduces the risk of modeling errors arising from undocumented aggregation methods.

Categorical Variable Completeness Analysis

This section evaluates whether key categorical variables contain: 1. Explicitly missing rows (NA values) 2. Implicitly missing combinations 3. Empty groups (categories defined but not present) We examine: - weathersit - season

# Explicitly missing values
sum(is.na(bike_data$weathersit))
## [1] 0
# Frequency table
table(bike_data$weathersit)
## 
##     1     2     3     4 
## 11413  4544  1419     3
# Unique categories present
unique(bike_data$weathersit)
## [1] 1 2 3 4

The weathersit variable contains no explicitly missing values. All documented categories (1-4) are present in the dataset, meaning there are no empty groups. This suggests complete weather classification coverage.

# Explicitly missing values
sum(is.na(bike_data$season))
## [1] 0
# Frequency table
table(bike_data$season)
## 
##    1    2    3    4 
## 4242 4409 4496 4232
# Group count check
bike_data %>%
  group_by(season) %>%
  summarise(observations = n())

The season variable also contains no explicitly missing rows. All four seasonal categories (1–4) appear in the dataset. There are no empty seasonal groups, indicating that rental data was collected across all seasons without structural gaps.

# Check implicit missing combinations: Season x Weathersit
season_weather_table <- table(bike_data$season, bike_data$weathersit)
season_weather_table
##    
##        1    2    3    4
##   1 2665 1205  369    3
##   2 2859 1144  406    0
##   3 3280  947  269    0
##   4 2609 1248  375    0
# Check if any combinations are zero (missing)
any(season_weather_table == 0)
## [1] TRUE

If any(season_weather_table == 0) returns FALSE, then no implicit missing groups exist. Otherwise, explain where zeroes appear.

valuating categorical completeness is essential to ensure that predictive models are not biased due to structurally missing groups. The absence of explicit and implicit missing categories strengthens confidence in the dataset’s representativeness. However, if certain combinations were absent, this could limit the model’s ability to generalize to unseen scenarios.

Outlier Detection in the Continuous Variable cnt

To formally identify extreme rental counts, we use the 1.5 × IQR rule, a standard statistical method for detecting outliers.

# Compute quartiles
Q1 <- quantile(bike_data$cnt, 0.25)
Q3 <- quantile(bike_data$cnt, 0.75)

# Compute IQR
IQR_value <- Q3 - Q1

# Define bounds
lower_bound <- Q1 - 1.5 * IQR_value
upper_bound <- Q3 + 1.5 * IQR_value

lower_bound
##    25% 
## -321.5
upper_bound
##   75% 
## 642.5
# Identify outliers
outliers <- bike_data %>%
  filter(cnt < lower_bound | cnt > upper_bound)

# Count number of outliers
nrow(outliers)
## [1] 505
ggplot(bike_data, aes(y = cnt)) +
  geom_boxplot(fill = "steelblue") +
  geom_hline(yintercept = lower_bound, color = "red", linetype = "dashed") +
  geom_hline(yintercept = upper_bound, color = "red", linetype = "dashed") +
  labs(title = "Outlier Detection for Total Rental Count (cnt)",
       y = "Total Rentals (cnt)") +
  theme_minimal()

Using the 1.5 × IQR rule, outliers were formally identified as rental counts falling outside the calculated lower and upper bounds. The analysis revealed that extreme rental counts exist beyond the upper threshold, indicating unusually high demand during certain hours or conditions.

These extreme values may correspond to peak commuting hours, special events, or unusually favorable weather. While they represent legitimate observations rather than data errors, they may disproportionately influence regression models and inflate variance estimates.

Therefore, careful consideration is required when modeling cnt. Robust regression methods or transformation techniques (e.g., log transformation) may help reduce the impact of extreme values while preserving meaningful variation.

Potential Risks

Several risks emerge from the unclear aspects of the cnt column:

  1. Modeling Inaccuracy: If cnt includes both casual and registered users without differentiation, models may misattribute the influence of user types on bike rentals, leading to biased predictions.

  2. Misguided Business Decisions: Misunderstanding the composition of cnt can result in ineffective strategies, such as overemphasizing one user type over another based on incorrect assumptions.

  3. Data Integrity Issues: Without clarity on how anomalies are handled, outliers may skew the data analysis, compromising the reliability of insights drawn from the dataset.

Risk Mitigation Strategies

To mitigate these risks, the following actions are recommended:

  1. Data Decomposition: Separate the cnt column into casual and registered components to analyze their individual contributions to total rentals.

  2. Anomaly Analysis: Investigate instances of extreme cnt values to understand how anomalies are treated and whether they represent genuine spikes or data recording issues.

  3. Seek Clarification: If documentation remains insufficient, contact the data provider or refer to supplementary resources to gain a clearer understanding of the cnt column’s composition.

  4. Robust Modeling Techniques: Employ models that can handle potential data ambiguities, such as ensemble methods or models with built-in mechanisms for uncertainty estimation.

Conclusion

This data dive underscored the critical role of comprehensive documentation in data analysis and model building. By identifying unclear columns such as temp, weathersit, and atemp, and exploring the ambiguities surrounding the cnt column,I highlighted potential pitfalls in data interpretation. The visualization emphasized the uncertainty in cnt’s composition, while the discussion on risks and mitigation strategies provided actionable insights to enhance data reliability and model accuracy. Moving forward, ensuring clarity in data documentation will be paramount to deriving meaningful and accurate analytical outcomes.

Further Questions

  1. Composition of cnt: Does cnt exclusively represent the sum of casual and registered users, or are there additional factors involved?

  2. Handling of Anomalies: How are extreme values or anomalies in bike rentals addressed in the dataset? Are they excluded, adjusted, or treated as outliers?

  3. Normalization Parameters: What specific normalization techniques were applied to temp and atemp, and what are their original scales?

  4. Impact of External Factors: How do external factors like public events or city-wide initiatives influence bike rental patterns beyond what’s captured in the dataset?