Loading and Exploring the Data Set

# Loading the necessary data
library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# Loading the dataset
bike_data <- read.csv("C:/Statistics for Data Science/Week 2/bike+sharing+dataset/hour.csv")

# Exploring the dataset
str(bike_data)
## 'data.frame':    17379 obs. of  17 variables:
##  $ instant   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ dteday    : chr  "2011-01-01" "2011-01-01" "2011-01-01" "2011-01-01" ...
##  $ season    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ yr        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ mnth      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ hr        : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ holiday   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ weekday   : int  6 6 6 6 6 6 6 6 6 6 ...
##  $ workingday: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ weathersit: int  1 1 1 1 1 2 1 1 1 1 ...
##  $ temp      : num  0.24 0.22 0.22 0.24 0.24 0.24 0.22 0.2 0.24 0.32 ...
##  $ atemp     : num  0.288 0.273 0.273 0.288 0.288 ...
##  $ hum       : num  0.81 0.8 0.8 0.75 0.75 0.75 0.8 0.86 0.75 0.76 ...
##  $ windspeed : num  0 0 0 0 0 0.0896 0 0 0 0 ...
##  $ casual    : int  3 8 5 3 0 0 2 1 1 8 ...
##  $ registered: int  13 32 27 10 1 1 0 2 7 6 ...
##  $ cnt       : int  16 40 32 13 1 1 2 3 8 14 ...
head(bike_data)
##   instant     dteday season yr mnth hr holiday weekday workingday weathersit
## 1       1 2011-01-01      1  0    1  0       0       6          0          1
## 2       2 2011-01-01      1  0    1  1       0       6          0          1
## 3       3 2011-01-01      1  0    1  2       0       6          0          1
## 4       4 2011-01-01      1  0    1  3       0       6          0          1
## 5       5 2011-01-01      1  0    1  4       0       6          0          1
## 6       6 2011-01-01      1  0    1  5       0       6          0          2
##   temp  atemp  hum windspeed casual registered cnt
## 1 0.24 0.2879 0.81    0.0000      3         13  16
## 2 0.22 0.2727 0.80    0.0000      8         32  40
## 3 0.22 0.2727 0.80    0.0000      5         27  32
## 4 0.24 0.2879 0.75    0.0000      3         10  13
## 5 0.24 0.2879 0.75    0.0000      0          1   1
## 6 0.24 0.2576 0.75    0.0896      0          1   1
summary(bike_data)
##     instant         dteday              season            yr        
##  Min.   :    1   Length:17379       Min.   :1.000   Min.   :0.0000  
##  1st Qu.: 4346   Class :character   1st Qu.:2.000   1st Qu.:0.0000  
##  Median : 8690   Mode  :character   Median :3.000   Median :1.0000  
##  Mean   : 8690                      Mean   :2.502   Mean   :0.5026  
##  3rd Qu.:13034                      3rd Qu.:3.000   3rd Qu.:1.0000  
##  Max.   :17379                      Max.   :4.000   Max.   :1.0000  
##       mnth              hr           holiday           weekday     
##  Min.   : 1.000   Min.   : 0.00   Min.   :0.00000   Min.   :0.000  
##  1st Qu.: 4.000   1st Qu.: 6.00   1st Qu.:0.00000   1st Qu.:1.000  
##  Median : 7.000   Median :12.00   Median :0.00000   Median :3.000  
##  Mean   : 6.538   Mean   :11.55   Mean   :0.02877   Mean   :3.004  
##  3rd Qu.:10.000   3rd Qu.:18.00   3rd Qu.:0.00000   3rd Qu.:5.000  
##  Max.   :12.000   Max.   :23.00   Max.   :1.00000   Max.   :6.000  
##    workingday       weathersit         temp           atemp       
##  Min.   :0.0000   Min.   :1.000   Min.   :0.020   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:1.000   1st Qu.:0.340   1st Qu.:0.3333  
##  Median :1.0000   Median :1.000   Median :0.500   Median :0.4848  
##  Mean   :0.6827   Mean   :1.425   Mean   :0.497   Mean   :0.4758  
##  3rd Qu.:1.0000   3rd Qu.:2.000   3rd Qu.:0.660   3rd Qu.:0.6212  
##  Max.   :1.0000   Max.   :4.000   Max.   :1.000   Max.   :1.0000  
##       hum           windspeed          casual         registered   
##  Min.   :0.0000   Min.   :0.0000   Min.   :  0.00   Min.   :  0.0  
##  1st Qu.:0.4800   1st Qu.:0.1045   1st Qu.:  4.00   1st Qu.: 34.0  
##  Median :0.6300   Median :0.1940   Median : 17.00   Median :115.0  
##  Mean   :0.6272   Mean   :0.1901   Mean   : 35.68   Mean   :153.8  
##  3rd Qu.:0.7800   3rd Qu.:0.2537   3rd Qu.: 48.00   3rd Qu.:220.0  
##  Max.   :1.0000   Max.   :0.8507   Max.   :367.00   Max.   :886.0  
##       cnt       
##  Min.   :  1.0  
##  1st Qu.: 40.0  
##  Median :142.0  
##  Mean   :189.5  
##  3rd Qu.:281.0  
##  Max.   :977.0

Numeric Summary for two numerical columns

# Summary statistics for two numeric columns: temp and cnt (total rentals)
numeric_summary <- bike_data %>%
  summarise(
    min_temp = min(temp),
    max_temp = max(temp),
    mean_temp = mean(temp),
    median_temp = median(temp),
    sd_temp = sd(temp),
    q1_temp = quantile(temp, 0.25),
    q3_temp = quantile(temp, 0.75),
    
    min_cnt = min(cnt),
    max_cnt = max(cnt),
    mean_cnt = mean(cnt),
    median_cnt = median(cnt),
    sd_cnt = sd(cnt),
    q1_cnt = quantile(cnt, 0.25),
    q3_cnt = quantile(cnt, 0.75)
  )

numeric_summary
##   min_temp max_temp mean_temp median_temp   sd_temp q1_temp q3_temp min_cnt
## 1     0.02        1 0.4969872         0.5 0.1925561    0.34    0.66       1
##   max_cnt mean_cnt median_cnt   sd_cnt q1_cnt q3_cnt
## 1     977 189.4631        142 181.3876     40    281

Insights from the Summary Statistics of the two numeric columns:

Summary for Categorical Columns

# Count unique values for categorical columns: season and weather
season_counts <- table(bike_data$season)
weather_counts <- table(bike_data$weathersit)

season_counts
## 
##    1    2    3    4 
## 4242 4409 4496 4232
weather_counts
## 
##     1     2     3     4 
## 11413  4544  1419     3

Insights from the summary of the two categorical columns season and weathersit:

  • Season (season) Insights:

    • Distribution of Data Across Seasons:
      The data shows the count of records for each season:

      • 1 = Spring

      • 2 = Summer

      • 3 = Fall

      • 4 = Winter
        If the count of records is not evenly distributed across these seasons, it may indicate seasonal biases in the dataset. For example, more records in summer could suggest an overrepresentation of this season, which might affect season-specific analyses like comparing rental trends across the year.

    • Rental Trends by Season:
      The differences in the number of rentals across seasons could highlight seasonal usage patterns. For example, we might expect higher bike rentals in warmer seasons (summer and fall) compared to colder ones (winter). Understanding these patterns is crucial for optimizing bike availability and maintenance schedules.

  • Weather Situation (weathersit) Insights:

    • Distribution of Weather Conditions:
      The summary shows the count of records for each weather condition:

      • 1 = Clear, Few clouds, Partly cloudy

      • 2 = Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist

      • 3 = Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds

      • 4 = Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
        If the dataset contains significantly more records for good weather (1 and 2), it suggests that the bike-sharing system is more frequently used in favorable conditions, which is expected but also implies limited data for adverse weather analysis.

    • Impact of Adverse Weather on Rentals:
      The distribution of bike rentals across different weather conditions indicates how sensitive users are to changes in weather. A sharp drop in rentals for categories 3 and 4 (light snow/rain and heavy rain/snow) would suggest that users avoid cycling during poor weather conditions. Conversely, a moderate decline might indicate that some users are resilient to bad weather, perhaps due to commuting needs.

Questions to investigate:

  1. How does temperature affect the number of bike rentals?
  2. Is there a significant difference in bike rentals between weekdays and weekends?
  3. How does the count of bike rentals vary across different weather conditions?

Addressing Q1) How does temperature affect the number of bike rentals? using aggregation.

# Average number of rentals by temperature
temp_vs_rentals <- bike_data %>%
  group_by(temp) %>%
  summarise(avg_rentals = mean(cnt))

temp_vs_rentals
## # A tibble: 50 × 2
##     temp avg_rentals
##    <dbl>       <dbl>
##  1  0.02        41.9
##  2  0.04        35.6
##  3  0.06        42  
##  4  0.08        28.2
##  5  0.1         49.3
##  6  0.12        58.4
##  7  0.14        55.1
##  8  0.16        65.6
##  9  0.18        60.1
## 10  0.2         79.7
## # ℹ 40 more rows

Insights based on the aggregation (Average number of rentals by temperature.)

Insights Gained from Aggregation

  1. Effect of Weather Conditions on Rentals Across Seasons:

    • Clear Weather (weathersit = 1):
      Across all seasons, the highest number of bike rentals occurs under clear weather conditions. However, the average rentals in clear weather may vary by season. For example, clear weather in summer (season = 2) may have the highest average rentals, while clear weather in winter (season = 4) may still see fewer rentals compared to other seasons.

    • Mist or Cloudy Weather (weathersit = 2):
      The impact of misty or cloudy weather appears moderate, with a noticeable reduction in rentals compared to clear weather but not as severe as more adverse weather conditions. This suggests that while misty or cloudy conditions may discourage some users, a substantial portion of riders continue to use the service. However, the impact is likely more pronounced in winter than in other seasons.

    • Adverse Weather (weathersit = 3 and 4):
      Rentals under light snow or rain (weathersit = 3) and severe weather (weathersit = 4) are significantly lower across all seasons. The drop is particularly steep in winter, indicating that adverse weather, combined with cold temperatures, significantly reduces ridership. Summer may still see some rentals in light rain or mild snow, but the numbers are much lower than in clear conditions.

  2. Seasonal Variation in Rentals Under Similar Weather Conditions:

    • Clear Weather Across Seasons:
      Clear weather leads to consistently high rentals in all seasons, but there are differences in magnitude. For example, rentals might be highest in fall (season = 3) and summer, slightly lower in spring (season = 1), and lowest in winter. This suggests that even under ideal conditions, seasonality affects bike rentals, possibly due to factors like daylight hours or holidays.

    • Effect of Cold Weather in Winter:
      The impact of adverse weather (weathersit = 3 or 4) is most severe in winter, suggesting that cold and poor weather conditions have a compounding effect on discouraging users. The combination of snow, fog, or heavy rain with cold temperatures likely leads to a sharp decline in rentals.

Visual Summaries of the data

Distributions of Rentals over temperature

# Distributions of rentals over temperature
ggplot(bike_data, aes(x = temp, y = cnt, color = factor(season))) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", se = FALSE, color = "black") +
  labs(title = "Bike Rentals vs Temperature", x = "Normalized Temperature", y = "Count of Rentals") +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

Insights from the data visualization - Bike Rentals Vs Temperature

  1. Positive Correlation Between Temperature and Bike Rentals:

    • The scatter plot shows a clear positive trend, indicating that as the temperature increases, the number of bike rentals also tends to increase. This suggests that warmer temperatures are generally associated with higher bike usage. The linear regression line (geom_smooth) confirms this upward trend.
  2. Seasonal Differences in Rental Patterns:

    • Color-Coding by Season:
      The different colors for each season reveal how temperature impacts bike rentals differently across seasons:

      • Summer (season = 2, typically in red or green): Shows the highest concentration of rentals at higher temperatures. This aligns with the expectation that warmer temperatures in summer result in more outdoor activities and bike usage.

      • Fall (season = 3) and Spring (season = 1): Also show positive trends, with a moderate concentration of rentals as temperatures rise, but fewer rentals than in summer, particularly at the higher end of the temperature range.

      • Winter (season = 4): Shows the lowest concentration of rentals, even at higher temperatures, which could be due to the overall colder temperatures and other factors like shorter daylight hours or user preferences.

  3. Seasonal Overlaps and Extremes:

    • There is some overlap between different seasons at the middle range of temperatures. For example, late spring and early fall might have similar temperatures, but the density of rentals may differ, reflecting different user behaviors or preferences during these transitional seasons.

    • At extreme low temperatures, regardless of the season, the number of rentals is generally low. This suggests a threshold below which users are generally unwilling to rent bikes, likely due to discomfort or safety concerns.

  4. Non-Linear Patterns at Temperature Extremes:

    • While the overall trend is positive, the distribution of points suggests some non-linearities:

      • At the highest temperatures, there is a slight plateau or reduction in rental counts, indicating that extremely hot temperatures may deter users from cycling. This pattern is more evident in summer.

      • Similarly, at lower temperatures, there is a rapid decline in rentals, showing that colder conditions significantly discourage bike usage.

  5. Weather-Specific Impacts on Temperature-Rental Relationship:

    • The impact of weather (as shown in previous analyses) likely interacts with temperature:

      • For instance, a warm temperature with clear weather conditions (weathersit = 1) may show a higher number of rentals compared to the same temperature with misty or rainy conditions (weathersit = 2 or 3). This interaction suggests the need to consider both temperature and weather when predicting demand.

Bike Rentals by Hour of the day

ggplot(bike_data, aes(x = hr, y = cnt, fill = factor(workingday))) +
  geom_boxplot() +
  labs(title = "Bike Rentals by Hour of the Day", x = "Hour", y = "Count of Rentals") +
  theme_minimal()

Insights from the data visualization of Bike Rentals by hour of the day

  1. Clear Peaks in Bike Rentals During Commuting Hours:

    • The box plots reveal two distinct peaks in bike rentals around 8 AM and 5-6 PM, which correspond to typical commuting times on working days. This pattern indicates that a significant portion of bike rentals is driven by commuters traveling to and from work. The rental counts are highest during these hours, and the interquartile range (IQR) is also wider, indicating high variability in the number of rentals.
  2. Differences Between Working Days and Non-Working Days:

    • Working Days (workingday = 1):
      On working days, the peak rentals occur during the morning (around 8 AM) and evening (around 5-6 PM) rush hours. The rentals remain relatively low during midday (around 12 PM to 3 PM) and late at night. This suggests a typical workday commuting pattern, where users primarily rent bikes to travel to and from work.

    • Non-Working Days (workingday = 0):
      On non-working days, the rental pattern is quite different. There is a steady increase in rentals starting from late morning (around 10 AM), peaking in the early afternoon (around 1-3 PM), and then gradually declining towards the evening. This suggests that on weekends or holidays, users rent bikes for leisure or recreational purposes rather than commuting.

  3. Lower Rentals During Night Hours Across All Days:

    • Across both working and non-working days, bike rentals are consistently low during night hours (from around 9 PM to 5 AM). This reflects a general preference for bike use during daylight hours, possibly due to safety concerns, lower temperatures, or lack of visibility.
  4. Midday Variability in Rentals on Non-Working Days:

    • The midday hours on non-working days (10 AM to 3 PM) show a higher median rental count and a wider interquartile range compared to the same hours on working days. This suggests more variability in bike rental behavior during non-working days, possibly due to a mix of leisure, exercise, and other personal activities.
  5. Potential Lunchtime Rentals Spike:

    • On working days, there is a noticeable but smaller spike around noon (12 PM), likely representing lunchtime bike rentals. This suggests that some users may be renting bikes for short trips during their lunch break, possibly for errands or quick rides.
  6. Higher Evening Rentals on Non-Working Days:

    • Evening rentals (6-8 PM) on non-working days tend to be higher compared to the same period on working days. This indicates that after 5 PM, bike usage remains relatively stable or slightly elevated, suggesting that users may extend their evening activities later on non-working days, such as going for rides or returning from recreational outings.

Rentals by Weather Conditions

ggplot(bike_data, aes(x = factor(weathersit), y = cnt, fill = factor(weathersit))) +
  geom_boxplot() +
  labs(title = "Bike Rentals by Weather Conditions", x = "Weather Situation", y = "Count of Rentals") +
  theme_minimal()

Insights from the data visualization of Rentals by Weather Conditions

  1. Significant Impact of Weather on Bike Rentals:

    • The box plots show clear differences in rental counts across the different weather conditions:

      • Weather Situation 1 (Clear, Few Clouds, Partly Cloudy):
        This weather condition has the highest median and a wide range of bike rentals, indicating that users prefer to rent bikes when the weather is clear or partly cloudy. The upper quartile of rentals is also significantly higher under these conditions, suggesting that pleasant weather conditions encourage more bike usage.

      • Weather Situation 2 (Mist + Cloudy, Mist + Broken Clouds, Mist + Few Clouds, Mist):
        Rentals under this weather condition show a lower median compared to Weather Situation 1, with a slightly narrower interquartile range. This suggests that while users still rent bikes under misty or cloudy conditions, the number of rentals tends to decrease as compared to clear weather.

      • Weather Situation 3 (Light Snow, Light Rain, Thunderstorm, Scattered Clouds):
        The median rentals are noticeably lower in this category, with a significant decrease in both the median and upper quartile. This suggests that users are less inclined to rent bikes in more adverse weather conditions such as light rain, light snow, or thunderstorms.

      • Weather Situation 4 (Heavy Rain, Ice Pellets, Snow, Fog):
        Rentals are the lowest under this weather situation, with the median approaching zero and a very narrow interquartile range. This indicates that very few people rent bikes during severe weather conditions like heavy rain, snow, or fog, likely due to safety concerns and discomfort.

  2. Weather Situation 1 Dominates Bike Rentals:

    • A substantial proportion of the total rentals occurs under Weather Situation 1 (clear or partly cloudy weather). This category shows the highest upper and lower quartiles, meaning that even on days with fewer rentals, the count is still relatively high. This reinforces the understanding that favorable weather conditions significantly boost bike rentals.
  3. Significant Decline in Rentals with Increasing Adversity in Weather:

    • The data shows a clear declining trend in rentals as weather conditions worsen:

      • From Weather Situation 1 (clear or partly cloudy) to Weather Situation 4 (severe conditions like heavy rain or snow), there is a marked decrease in both the median and the spread of rental counts. This suggests a strong negative correlation between the severity of weather and bike rental activity.
  4. Higher Variability in Rentals Under Clearer Weather:

    • The interquartile range (IQR) is widest for Weather Situation 1, indicating that rentals vary greatly even under clear or partly cloudy conditions. This variability might be influenced by other factors like temperature, day of the week, or events, which have a stronger effect when the weather is good.
  5. Narrower Distribution Under Poor Weather Conditions:

    • The IQR for Weather Situations 3 and 4 is much narrower, with many days having very low or zero rentals. This suggests that poor weather conditions create a more consistent (and predictably lower) level of bike usage, likely due to safety concerns or discomfort.

Some Conclusions based on the Analysis done till now

  • Weather Conditions Strongly Influence Bike Rentals:

    • Clear or partly cloudy weather (Weather Situation 1) results in the highest number of bike rentals, while severe weather conditions (heavy rain, snow, fog) significantly decrease rentals. This shows a strong negative correlation between adverse weather and bike usage, highlighting the need for dynamic strategies based on weather forecasts to optimize operations and maintain user engagement.
  • Peak Rental Hours Align with Commuting Times on Working Days:

    • Bike rentals exhibit clear peaks during typical commuting hours (8 AM and 5-6 PM) on working days, indicating that the bike-sharing service is heavily used for daily commutes. On non-working days, rentals peak in the late morning to early afternoon, suggesting a shift in usage patterns toward leisure or recreational activities. This insight supports a need for dynamic bike redistribution and targeted promotions aligned with different user behaviors on working versus non-working days.
  • Temperature Positively Correlates with Bike Rentals:

    • Warmer temperatures are associated with increased bike rentals, particularly during clear weather conditions. The relationship between temperature and rentals is evident from the scatter plots showing that users are more likely to rent bikes in mild to warm conditions. This suggests an opportunity to maximize rentals by promoting bike usage during warmer seasons or by encouraging rentals during milder parts of colder days.
  • User Behavior Varies Significantly by Day Type:

    • The difference in rental patterns between working and non-working days indicates that the bike-sharing service caters to distinct user needs based on the day type. Commuters primarily use the service during peak hours on weekdays, whereas on weekends or holidays, usage patterns are more spread out across the day, driven by recreational or leisurely activities. This insight can guide scheduling, bike placement, and service expansion plans, as well as targeted marketing strategies.
  • Midday and Evening Rental Variability Points to Diverse Usage Patterns:

    • On working days, there is a notable midday spike in rentals, likely due to lunchtime trips. On non-working days, rentals remain high into the evening hours, suggesting prolonged usage. The variability in rental patterns indicates that the service must accommodate diverse user needs, such as short commutes and longer leisure trips, and manage resources accordingly.

Some Further questions that I would like to investigate

  1. How do humidity and wind speed affect bike rentals?
  2. Is there a pattern in rentals during holidays?
  3. Can we predict the bike rentals using a regression with multiple variables?
  4. What is the combined impact of weather and temperature on rentals during specific hours?
  5. How do sudden weather changes affect real-time rental demand?