Let’s examine the average Solar Radiation by month.
solar_month = SeoulBikeData |>
group_by(month) |>
summarise(mean(solar_radiation), .groups = "drop")
print(solar_month)
## # A tibble: 12 × 2
## month `mean(solar_radiation)`
## <dbl> <dbl>
## 1 1 0.227
## 2 2 0.482
## 3 3 0.600
## 4 4 0.713
## 5 5 0.754
## 6 6 0.837
## 7 7 0.754
## 8 8 0.695
## 9 9 0.654
## 10 10 0.542
## 11 11 0.369
## 12 12 0.204
The results make sense, the averages are higher in the summer than the winter due to the tilt of the Earth. However, these results don’t tell the full story. The length of daylight hours are longer in the summer, and during the night time there is no radiation. To address this, we must group by month and hour. We will use a heatmap to visualize this groupby summary.
#Group solar radiation by month and hour
solar_month_hour <- SeoulBikeData |>
group_by(month_name, hour) |>
summarise(mean_solar_radiation = mean(solar_radiation))
## `summarise()` has grouped output by 'month_name'. You can override using the
## `.groups` argument.
#Heatmap of mean solar radiation by hour and month
ggplot(solar_month_hour, aes(x = month_name, y = hour, fill = mean_solar_radiation)) +
geom_tile() +
scale_fill_gradient(low="black", high="orange")+
labs(title = "Heatmap of Solar Radiation",
x = "Month",
y = "Hour")
Solar radiation is both more intense and lasts longer throughout the day closer to June and trails off until December/January.
This information demonstrates the relationship well, but if we were to take a random date from the dataset, which would range of solar radiations would have the smallest probability of occuring?
#Histogram of daily solar radiation values
solar_date = SeoulBikeData |>
group_by(date) |>
summarise(solar_radiation_sum = sum(solar_radiation), .groups = "drop")
print(solar_date)
## # A tibble: 365 × 2
## date solar_radiation_sum
## <date> <dbl>
## 1 2017-12-01 5.97
## 2 2017-12-02 6.33
## 3 2017-12-03 3.01
## 4 2017-12-04 6.79
## 5 2017-12-05 0.86
## 6 2017-12-06 6.14
## 7 2017-12-07 5.85
## 8 2017-12-08 6.89
## 9 2017-12-09 6.32
## 10 2017-12-10 2.08
## # ℹ 355 more rows
ggplot(solar_date, aes(x = solar_radiation_sum)) +
geom_histogram(aes(y = (..count..)/sum(..count..) * 100), binwidth = 3, fill = "orange", color = "black") +
labs(title = "Percentage Histogram of Hourly Solar Radiation",
x = "Solar Radiation (mJ/m^2)",
y = "Percentage (%)")
Based on the histogram, a daily radiation over 30mJ/M^2 or under 3mJ/M^2 is very rare.
These low radiation and high radiation days like occur close to the yearly equinoxes and are based on unusual solar activity. I hypothesize that the amount of time between sunsets likely has a high R-Square value for daily solar radiation.
What is the effect of holidays on ridership? I hypothesize that holidays would see lower ridership and a different ridership distribution throughout the day. #First, let’s see what percent of the dataset is a holiday.
SeoulBikeData |>
group_by(holiday) |>
summarise(count = n()/nrow(SeoulBikeData), .groups = "drop")
## # A tibble: 2 × 2
## holiday count
## <chr> <dbl>
## 1 Holiday 0.0493
## 2 No Holiday 0.951
#Holidays make up about 5% of data in the data set.
SeoulBikeData |>
group_by(holiday) |>
summarise(mean(rented_bikes))
## # A tibble: 2 × 2
## holiday `mean(rented_bikes)`
## <chr> <dbl>
## 1 Holiday 500.
## 2 No Holiday 715.
The average ridership is roughly 30% lower on holidays.
Let’s examine temperature and its effect on ridership.
SeoulBikeData <- SeoulBikeData %>%
mutate(temp_category = case_when(
temp_c < 0 ~ "1. Freezing",
temp_c >= 0 & temp_c < 10 ~ "2. Cold",
temp_c >= 10 & temp_c < 20 ~ "3. Cool",
temp_c >= 20 & temp_c < 30 ~ "4. Warm",
temp_c >= 30 ~ "5. Hot"
))
#Let's see what the temperatures typically look like in Seoul.
SeoulBikeData |>
group_by(temp_category) |>
summarise(count=n(), .groups = "drop")
## # A tibble: 5 × 2
## temp_category count
## <chr> <int>
## 1 1. Freezing 1433
## 2 2. Cold 2142
## 3 3. Cool 2257
## 4 4. Warm 2404
## 5 5. Hot 524
If chosen randomly, the temperature is least likely to be above 30 degrees celsius. Let’s hold time of day constant and see how this effects ridership.
ggplot(SeoulBikeData, aes(x = hour, y = rented_bikes, color = temp_category)) +
stat_summary(fun = mean, geom = "line") +
labs(title = "Lineplot of average Bike Rentals per Hour by Temperature Category",
x = "Hour",
y = "Average Bikes Rented") +
theme_minimal()
This line plot shows clear differences in ridership based on temperature. Interestingly when it is “Hot,” the count of midday rides are on par with when the temperature is “Cool.” I hypothesize that solar radiation has a stronger effect on ridership when the temperature is Hot than when it is Warm between 9am and 5pm.
We will compare temperature with seasons in Seoul.
seasonal_temps <-SeoulBikeData |>
group_by(seasons, temp_category) |>
summarise(count=n(), .groups = "drop")
print(seasonal_temps)
## # A tibble: 15 × 3
## seasons temp_category count
## <chr> <chr> <int>
## 1 Autumn 1. Freezing 21
## 2 Autumn 2. Cold 654
## 3 Autumn 3. Cool 977
## 4 Autumn 4. Warm 530
## 5 Autumn 5. Hot 2
## 6 Spring 1. Freezing 22
## 7 Spring 2. Cold 719
## 8 Spring 3. Cool 1111
## 9 Spring 4. Warm 356
## 10 Summer 3. Cool 168
## 11 Summer 4. Warm 1518
## 12 Summer 5. Hot 522
## 13 Winter 1. Freezing 1390
## 14 Winter 2. Cold 769
## 15 Winter 3. Cool 1
ggplot(seasonal_temps, aes(x = seasons, y = temp_category, fill = count)) +
geom_tile() +
labs(title = "Heatmap of Temperature Categories by Season",
x = "Season",
y = "Temperature Category") +
scale_fill_gradient(low = "lightblue", high = "orange") +
theme_minimal()
To summarize this information: Spring is cool, Summer is warm, and Autumn is cool, Winter is cold. The missing pairs are caused by Seoul’s climate.