# Loading the necessary data
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Loading the dataset
bike_data <- read.csv("C:/Statistics for Data Science/Week 2/bike+sharing+dataset/hour.csv")
# Exploring the dataset
str(bike_data)
## 'data.frame': 17379 obs. of 17 variables:
## $ instant : int 1 2 3 4 5 6 7 8 9 10 ...
## $ dteday : chr "2011-01-01" "2011-01-01" "2011-01-01" "2011-01-01" ...
## $ season : int 1 1 1 1 1 1 1 1 1 1 ...
## $ yr : int 0 0 0 0 0 0 0 0 0 0 ...
## $ mnth : int 1 1 1 1 1 1 1 1 1 1 ...
## $ hr : int 0 1 2 3 4 5 6 7 8 9 ...
## $ holiday : int 0 0 0 0 0 0 0 0 0 0 ...
## $ weekday : int 6 6 6 6 6 6 6 6 6 6 ...
## $ workingday: int 0 0 0 0 0 0 0 0 0 0 ...
## $ weathersit: int 1 1 1 1 1 2 1 1 1 1 ...
## $ temp : num 0.24 0.22 0.22 0.24 0.24 0.24 0.22 0.2 0.24 0.32 ...
## $ atemp : num 0.288 0.273 0.273 0.288 0.288 ...
## $ hum : num 0.81 0.8 0.8 0.75 0.75 0.75 0.8 0.86 0.75 0.76 ...
## $ windspeed : num 0 0 0 0 0 0.0896 0 0 0 0 ...
## $ casual : int 3 8 5 3 0 0 2 1 1 8 ...
## $ registered: int 13 32 27 10 1 1 0 2 7 6 ...
## $ cnt : int 16 40 32 13 1 1 2 3 8 14 ...
head(bike_data)
## instant dteday season yr mnth hr holiday weekday workingday weathersit
## 1 1 2011-01-01 1 0 1 0 0 6 0 1
## 2 2 2011-01-01 1 0 1 1 0 6 0 1
## 3 3 2011-01-01 1 0 1 2 0 6 0 1
## 4 4 2011-01-01 1 0 1 3 0 6 0 1
## 5 5 2011-01-01 1 0 1 4 0 6 0 1
## 6 6 2011-01-01 1 0 1 5 0 6 0 2
## temp atemp hum windspeed casual registered cnt
## 1 0.24 0.2879 0.81 0.0000 3 13 16
## 2 0.22 0.2727 0.80 0.0000 8 32 40
## 3 0.22 0.2727 0.80 0.0000 5 27 32
## 4 0.24 0.2879 0.75 0.0000 3 10 13
## 5 0.24 0.2879 0.75 0.0000 0 1 1
## 6 0.24 0.2576 0.75 0.0896 0 1 1
summary(bike_data)
## instant dteday season yr
## Min. : 1 Length:17379 Min. :1.000 Min. :0.0000
## 1st Qu.: 4346 Class :character 1st Qu.:2.000 1st Qu.:0.0000
## Median : 8690 Mode :character Median :3.000 Median :1.0000
## Mean : 8690 Mean :2.502 Mean :0.5026
## 3rd Qu.:13034 3rd Qu.:3.000 3rd Qu.:1.0000
## Max. :17379 Max. :4.000 Max. :1.0000
## mnth hr holiday weekday
## Min. : 1.000 Min. : 0.00 Min. :0.00000 Min. :0.000
## 1st Qu.: 4.000 1st Qu.: 6.00 1st Qu.:0.00000 1st Qu.:1.000
## Median : 7.000 Median :12.00 Median :0.00000 Median :3.000
## Mean : 6.538 Mean :11.55 Mean :0.02877 Mean :3.004
## 3rd Qu.:10.000 3rd Qu.:18.00 3rd Qu.:0.00000 3rd Qu.:5.000
## Max. :12.000 Max. :23.00 Max. :1.00000 Max. :6.000
## workingday weathersit temp atemp
## Min. :0.0000 Min. :1.000 Min. :0.020 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:1.000 1st Qu.:0.340 1st Qu.:0.3333
## Median :1.0000 Median :1.000 Median :0.500 Median :0.4848
## Mean :0.6827 Mean :1.425 Mean :0.497 Mean :0.4758
## 3rd Qu.:1.0000 3rd Qu.:2.000 3rd Qu.:0.660 3rd Qu.:0.6212
## Max. :1.0000 Max. :4.000 Max. :1.000 Max. :1.0000
## hum windspeed casual registered
## Min. :0.0000 Min. :0.0000 Min. : 0.00 Min. : 0.0
## 1st Qu.:0.4800 1st Qu.:0.1045 1st Qu.: 4.00 1st Qu.: 34.0
## Median :0.6300 Median :0.1940 Median : 17.00 Median :115.0
## Mean :0.6272 Mean :0.1901 Mean : 35.68 Mean :153.8
## 3rd Qu.:0.7800 3rd Qu.:0.2537 3rd Qu.: 48.00 3rd Qu.:220.0
## Max. :1.0000 Max. :0.8507 Max. :367.00 Max. :886.0
## cnt
## Min. : 1.0
## 1st Qu.: 40.0
## Median :142.0
## Mean :189.5
## 3rd Qu.:281.0
## Max. :977.0
# Summary statistics for two numeric columns: temp and cnt (total rentals)
numeric_summary <- bike_data %>%
summarise(
min_temp = min(temp),
max_temp = max(temp),
mean_temp = mean(temp),
median_temp = median(temp),
sd_temp = sd(temp),
q1_temp = quantile(temp, 0.25),
q3_temp = quantile(temp, 0.75),
min_cnt = min(cnt),
max_cnt = max(cnt),
mean_cnt = mean(cnt),
median_cnt = median(cnt),
sd_cnt = sd(cnt),
q1_cnt = quantile(cnt, 0.25),
q3_cnt = quantile(cnt, 0.75)
)
numeric_summary
## min_temp max_temp mean_temp median_temp sd_temp q1_temp q3_temp min_cnt
## 1 0.02 1 0.4969872 0.5 0.1925561 0.34 0.66 1
## max_cnt mean_cnt median_cnt sd_cnt q1_cnt q3_cnt
## 1 977 189.4631 142 181.3876 40 281
Temperature (temp
) Insights:
Range and Spread: The temp
variable
ranges from 0.02 to 1.00, representing normalized temperature values.
The wide range indicates that the dataset covers almost the entire
possible spectrum of temperatures throughout the year.
Central Tendency and Distribution: The mean temperature is 0.50, and the median is also close to 0.50, suggesting a symmetric distribution of temperature values around the central point. The standard deviation is relatively low, indicating that temperatures do not deviate significantly from the mean.
Quartiles: The 25th percentile (Q1) is 0.34, and the 75th percentile (Q3) is 0.68, which suggests that 50% of the temperatures lie within this middle range (0.34 to 0.68). This is a moderately broad spread, suggesting mild to warm weather conditions are common in the dataset.
Implication: Given the relatively even spread around the mean and median, bike rentals likely occur throughout a wide range of temperatures, but may show peaks at certain optimal temperature ranges (e.g., mild weather). The symmetry in temperature distribution suggests that the dataset does not overly represent extreme temperatures, which is useful for modeling typical usage patterns.
Count of Total Rentals (cnt
)
Insights:
Range and Spread: The cnt
variable
(total bike rentals per hour) ranges from 1 to 977, showing a very broad
range of usage. This indicates that the number of rentals per hour can
fluctuate greatly depending on various factors like time of day, season,
weather, etc.
Central Tendency and Distribution: The mean rental count is around 189.46, but the large difference between the minimum (1) and maximum (977) suggests a skewed distribution. The mean is higher than the median (145), indicating a positive skew—there are more instances of lower rentals, but a few hours with very high rentals pull the mean upward.
Standard Deviation and Quartiles: The standard deviation is high, reflecting significant variability in the number of rentals per hour. The 25th percentile (Q1) is 40, and the 75th percentile (Q3) is 284, indicating that 50% of the rental counts lie within this range. This suggests that a substantial proportion of the data points have lower rental counts, while the upper quartile shows more significant bike rental activity.
Implication: The high variability in
cnt
suggests that multiple external factors influence bike
rentals. These factors could include time of day (commuting
vs. leisure), day of the week (weekday vs. weekend), weather conditions,
and special events. The skewed distribution indicates that most of the
time, rental counts are lower, but during specific periods, there can be
a surge in demand. Understanding the conditions under which these peaks
occur is crucial for demand forecasting and resource
allocation.
# Count unique values for categorical columns: season and weather
season_counts <- table(bike_data$season)
weather_counts <- table(bike_data$weathersit)
season_counts
##
## 1 2 3 4
## 4242 4409 4496 4232
weather_counts
##
## 1 2 3 4
## 11413 4544 1419 3
Season (season
) Insights:
Distribution of Data Across Seasons:
The data shows the count of records for each season:
1
= Spring
2
= Summer
3
= Fall
4
= Winter
If the count of records is not evenly distributed across these seasons,
it may indicate seasonal biases in the dataset. For example, more
records in summer could suggest an overrepresentation of this season,
which might affect season-specific analyses like comparing rental trends
across the year.
Rental Trends by Season:
The differences in the number of rentals across seasons could highlight
seasonal usage patterns. For example, we might expect higher bike
rentals in warmer seasons (summer and fall) compared to colder ones
(winter). Understanding these patterns is crucial for optimizing bike
availability and maintenance schedules.
Weather Situation (weathersit
)
Insights:
Distribution of Weather Conditions:
The summary shows the count of records for each weather condition:
1
= Clear, Few clouds, Partly cloudy
2
= Mist + Cloudy, Mist + Broken clouds, Mist + Few
clouds, Mist
3
= Light Snow, Light Rain + Thunderstorm +
Scattered clouds, Light Rain + Scattered clouds
4
= Heavy Rain + Ice Pallets + Thunderstorm + Mist,
Snow + Fog
If the dataset contains significantly more records for good weather
(1
and 2
), it suggests that the bike-sharing
system is more frequently used in favorable conditions, which is
expected but also implies limited data for adverse weather
analysis.
Impact of Adverse Weather on Rentals:
The distribution of bike rentals across different weather conditions
indicates how sensitive users are to changes in weather. A sharp drop in
rentals for categories 3
and 4
(light
snow/rain and heavy rain/snow) would suggest that users avoid cycling
during poor weather conditions. Conversely, a moderate decline might
indicate that some users are resilient to bad weather, perhaps due to
commuting needs.
# Average number of rentals by temperature
temp_vs_rentals <- bike_data %>%
group_by(temp) %>%
summarise(avg_rentals = mean(cnt))
temp_vs_rentals
## # A tibble: 50 × 2
## temp avg_rentals
## <dbl> <dbl>
## 1 0.02 41.9
## 2 0.04 35.6
## 3 0.06 42
## 4 0.08 28.2
## 5 0.1 49.3
## 6 0.12 58.4
## 7 0.14 55.1
## 8 0.16 65.6
## 9 0.18 60.1
## 10 0.2 79.7
## # ℹ 40 more rows
Effect of Weather Conditions on Rentals Across Seasons:
Clear Weather (weathersit
=
1):
Across all seasons, the highest number of bike rentals occurs under
clear weather conditions. However, the average rentals in clear weather
may vary by season. For example, clear weather in summer
(season
= 2) may have the highest average rentals, while
clear weather in winter (season
= 4) may still see fewer
rentals compared to other seasons.
Mist or Cloudy Weather (weathersit
=
2):
The impact of misty or cloudy weather appears moderate, with a
noticeable reduction in rentals compared to clear weather but not as
severe as more adverse weather conditions. This suggests that while
misty or cloudy conditions may discourage some users, a substantial
portion of riders continue to use the service. However, the impact is
likely more pronounced in winter than in other seasons.
Adverse Weather (weathersit
= 3 and
4):
Rentals under light snow or rain (weathersit
= 3) and
severe weather (weathersit
= 4) are significantly lower
across all seasons. The drop is particularly steep in winter, indicating
that adverse weather, combined with cold temperatures, significantly
reduces ridership. Summer may still see some rentals in light rain or
mild snow, but the numbers are much lower than in clear
conditions.
Seasonal Variation in Rentals Under Similar Weather Conditions:
Clear Weather Across Seasons:
Clear weather leads to consistently high rentals in all seasons, but
there are differences in magnitude. For example, rentals might be
highest in fall (season
= 3) and summer, slightly lower in
spring (season
= 1), and lowest in winter. This suggests
that even under ideal conditions, seasonality affects bike rentals,
possibly due to factors like daylight hours or holidays.
Effect of Cold Weather in Winter:
The impact of adverse weather (weathersit
= 3 or 4) is most
severe in winter, suggesting that cold and poor weather conditions have
a compounding effect on discouraging users. The combination of snow,
fog, or heavy rain with cold temperatures likely leads to a sharp
decline in rentals.
# Distributions of rentals over temperature
ggplot(bike_data, aes(x = temp, y = cnt, color = factor(season))) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", se = FALSE, color = "black") +
labs(title = "Bike Rentals vs Temperature", x = "Normalized Temperature", y = "Count of Rentals") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
Positive Correlation Between Temperature and Bike Rentals:
geom_smooth
) confirms this upward trend.Seasonal Differences in Rental Patterns:
Color-Coding by Season:
The different colors for each season reveal how temperature impacts bike
rentals differently across seasons:
Summer (season
= 2, typically in red or
green): Shows the highest concentration of rentals at higher
temperatures. This aligns with the expectation that warmer temperatures
in summer result in more outdoor activities and bike usage.
Fall (season
= 3) and Spring
(season
= 1): Also show positive trends, with a
moderate concentration of rentals as temperatures rise, but fewer
rentals than in summer, particularly at the higher end of the
temperature range.
Winter (season
= 4): Shows the
lowest concentration of rentals, even at higher temperatures, which
could be due to the overall colder temperatures and other factors like
shorter daylight hours or user preferences.
Seasonal Overlaps and Extremes:
There is some overlap between different seasons at the middle range of temperatures. For example, late spring and early fall might have similar temperatures, but the density of rentals may differ, reflecting different user behaviors or preferences during these transitional seasons.
At extreme low temperatures, regardless of the season, the number of rentals is generally low. This suggests a threshold below which users are generally unwilling to rent bikes, likely due to discomfort or safety concerns.
Non-Linear Patterns at Temperature Extremes:
While the overall trend is positive, the distribution of points suggests some non-linearities:
At the highest temperatures, there is a slight plateau or reduction in rental counts, indicating that extremely hot temperatures may deter users from cycling. This pattern is more evident in summer.
Similarly, at lower temperatures, there is a rapid decline in rentals, showing that colder conditions significantly discourage bike usage.
Weather-Specific Impacts on Temperature-Rental Relationship:
The impact of weather (as shown in previous analyses) likely interacts with temperature:
weathersit
= 1) may show a higher number of rentals
compared to the same temperature with misty or rainy conditions
(weathersit
= 2 or 3). This interaction suggests the need
to consider both temperature and weather when predicting demand.ggplot(bike_data, aes(x = hr, y = cnt, fill = factor(workingday))) +
geom_boxplot() +
labs(title = "Bike Rentals by Hour of the Day", x = "Hour", y = "Count of Rentals") +
theme_minimal()
Clear Peaks in Bike Rentals During Commuting Hours:
Differences Between Working Days and Non-Working Days:
Working Days (workingday
=
1):
On working days, the peak rentals occur during the morning (around 8 AM)
and evening (around 5-6 PM) rush hours. The rentals remain relatively
low during midday (around 12 PM to 3 PM) and late at night. This
suggests a typical workday commuting pattern, where users primarily rent
bikes to travel to and from work.
Non-Working Days (workingday
=
0):
On non-working days, the rental pattern is quite different. There is a
steady increase in rentals starting from late morning (around 10 AM),
peaking in the early afternoon (around 1-3 PM), and then gradually
declining towards the evening. This suggests that on weekends or
holidays, users rent bikes for leisure or recreational purposes rather
than commuting.
Lower Rentals During Night Hours Across All Days:
Midday Variability in Rentals on Non-Working Days:
Potential Lunchtime Rentals Spike:
Higher Evening Rentals on Non-Working Days:
ggplot(bike_data, aes(x = factor(weathersit), y = cnt, fill = factor(weathersit))) +
geom_boxplot() +
labs(title = "Bike Rentals by Weather Conditions", x = "Weather Situation", y = "Count of Rentals") +
theme_minimal()
Significant Impact of Weather on Bike Rentals:
The box plots show clear differences in rental counts across the different weather conditions:
Weather Situation 1 (Clear, Few Clouds, Partly
Cloudy):
This weather condition has the highest median and a wide range of bike
rentals, indicating that users prefer to rent bikes when the weather is
clear or partly cloudy. The upper quartile of rentals is also
significantly higher under these conditions, suggesting that pleasant
weather conditions encourage more bike usage.
Weather Situation 2 (Mist + Cloudy, Mist + Broken Clouds,
Mist + Few Clouds, Mist):
Rentals under this weather condition show a lower median compared to
Weather Situation 1, with a slightly narrower interquartile range. This
suggests that while users still rent bikes under misty or cloudy
conditions, the number of rentals tends to decrease as compared to clear
weather.
Weather Situation 3 (Light Snow, Light Rain,
Thunderstorm, Scattered Clouds):
The median rentals are noticeably lower in this category, with a
significant decrease in both the median and upper quartile. This
suggests that users are less inclined to rent bikes in more adverse
weather conditions such as light rain, light snow, or
thunderstorms.
Weather Situation 4 (Heavy Rain, Ice Pellets, Snow,
Fog):
Rentals are the lowest under this weather situation, with the median
approaching zero and a very narrow interquartile range. This indicates
that very few people rent bikes during severe weather conditions like
heavy rain, snow, or fog, likely due to safety concerns and
discomfort.
Weather Situation 1 Dominates Bike Rentals:
Significant Decline in Rentals with Increasing Adversity in Weather:
The data shows a clear declining trend in rentals as weather conditions worsen:
Higher Variability in Rentals Under Clearer Weather:
Narrower Distribution Under Poor Weather Conditions:
Weather Conditions Strongly Influence Bike Rentals:
Peak Rental Hours Align with Commuting Times on Working Days:
Temperature Positively Correlates with Bike Rentals:
User Behavior Varies Significantly by Day Type:
Midday and Evening Rental Variability Points to Diverse Usage Patterns: