Group By and Probabilities

Groupby 1

Let’s examine the average Solar Radiation by month.

solar_month = SeoulBikeData |>
  group_by(month) |>
  summarise(mean(solar_radiation), .groups = "drop")
print(solar_month)

## # A tibble: 12 × 2
##    month `mean(solar_radiation)`
##    <dbl>                   <dbl>
##  1     1                   0.227
##  2     2                   0.482
##  3     3                   0.600
##  4     4                   0.713
##  5     5                   0.754
##  6     6                   0.837
##  7     7                   0.754
##  8     8                   0.695
##  9     9                   0.654
## 10    10                   0.542
## 11    11                   0.369
## 12    12                   0.204

The results make sense, the averages are higher in the summer than the winter due to the tilt of the Earth. However, these results don’t tell the full story. The length of daylight hours are longer in the summer, and during the night time there is no radiation. To address this, we must group by month and hour. We will use a heatmap to visualize this groupby summary.

#Group solar radiation by month and hour
solar_month_hour <- SeoulBikeData |>
  group_by(month_name, hour) |>
  summarise(mean_solar_radiation = mean(solar_radiation))

## `summarise()` has grouped output by 'month_name'. You can override using the
## `.groups` argument.

#Heatmap of mean solar radiation by hour and month
ggplot(solar_month_hour, aes(x = month_name, y = hour, fill = mean_solar_radiation)) + 
  geom_tile() +
  scale_fill_gradient(low="black", high="orange")+
   labs(title = "Heatmap of Solar Radiation",
       x = "Month",
       y = "Hour")

Solar radiation is both more intense and lasts longer throughout the day closer to June and trails off until December/January.

This information demonstrates the relationship well, but if we were to take a random date from the dataset, which would range of solar radiations would have the smallest probability of occuring?

#Histogram of daily solar radiation values
solar_date = SeoulBikeData |>
  group_by(date) |>
  summarise(solar_radiation_sum = sum(solar_radiation), .groups = "drop")
print(solar_date)

## # A tibble: 365 × 2
##    date       solar_radiation_sum
##    <date>                   <dbl>
##  1 2017-12-01                5.97
##  2 2017-12-02                6.33
##  3 2017-12-03                3.01
##  4 2017-12-04                6.79
##  5 2017-12-05                0.86
##  6 2017-12-06                6.14
##  7 2017-12-07                5.85
##  8 2017-12-08                6.89
##  9 2017-12-09                6.32
## 10 2017-12-10                2.08
## # ℹ 355 more rows

ggplot(solar_date, aes(x = solar_radiation_sum)) +
  geom_histogram(aes(y = (..count..)/sum(..count..) * 100), binwidth = 3, fill = "orange", color = "black") +
  labs(title = "Percentage Histogram of Hourly Solar Radiation",
       x = "Solar Radiation (mJ/m^2)",
       y = "Percentage (%)")

Based on the histogram, a daily radiation over 30mJ/M^2 or under 3mJ/M^2 is very rare.

These low radiation and high radiation days like occur close to the yearly equinoxes and are based on unusual solar activity. I hypothesize that the amount of time between sunsets likely has a high R-Square value for daily solar radiation.

Groupby 2

What is the effect of holidays on ridership? I hypothesize that holidays would see lower ridership and a different ridership distribution throughout the day. #First, let’s see what percent of the dataset is a holiday.

SeoulBikeData |>
  group_by(holiday) |>
  summarise(count = n()/nrow(SeoulBikeData), .groups = "drop")

## # A tibble: 2 × 2
##   holiday     count
##   <chr>       <dbl>
## 1 Holiday    0.0493
## 2 No Holiday 0.951

#Holidays make up about 5% of data in the data set. 

SeoulBikeData |>
  group_by(holiday) |>
  summarise(mean(rented_bikes))

## # A tibble: 2 × 2
##   holiday    `mean(rented_bikes)`
##   <chr>                     <dbl>
## 1 Holiday                    500.
## 2 No Holiday                 715.

The average ridership is roughly 30% lower on holidays.

Groupby 3

Let’s examine temperature and its effect on ridership.

SeoulBikeData <- SeoulBikeData %>%
  mutate(temp_category = case_when(
    temp_c < 0  ~ "1. Freezing",
    temp_c >= 0 & temp_c < 10  ~ "2. Cold",
    temp_c >= 10 & temp_c < 20 ~ "3. Cool",
    temp_c >= 20 & temp_c < 30 ~ "4. Warm",
    temp_c >= 30 ~ "5. Hot"
  ))

#Let's see what the temperatures typically look like in Seoul.
SeoulBikeData |>
  group_by(temp_category) |>
  summarise(count=n(), .groups = "drop")

## # A tibble: 5 × 2
##   temp_category count
##   <chr>         <int>
## 1 1. Freezing    1433
## 2 2. Cold        2142
## 3 3. Cool        2257
## 4 4. Warm        2404
## 5 5. Hot          524

If chosen randomly, the temperature is least likely to be above 30 degrees celsius. Let’s hold time of day constant and see how this effects ridership.

ggplot(SeoulBikeData, aes(x = hour, y = rented_bikes, color = temp_category)) +
  stat_summary(fun = mean, geom = "line") +
  labs(title = "Lineplot of average Bike Rentals per Hour by Temperature Category",
       x = "Hour", 
       y = "Average Bikes Rented") +
  theme_minimal()

This line plot shows clear differences in ridership based on temperature. Interestingly when it is “Hot,” the count of midday rides are on par with when the temperature is “Cool.” I hypothesize that solar radiation has a stronger effect on ridership when the temperature is Hot than when it is Warm between 9am and 5pm.

Pick two categorical variables, and build a data frame of all their combinations

We will compare temperature with seasons in Seoul.

seasonal_temps <-SeoulBikeData |>
  group_by(seasons, temp_category) |>
  summarise(count=n(), .groups = "drop")
print(seasonal_temps)

## # A tibble: 15 × 3
##    seasons temp_category count
##    <chr>   <chr>         <int>
##  1 Autumn  1. Freezing      21
##  2 Autumn  2. Cold         654
##  3 Autumn  3. Cool         977
##  4 Autumn  4. Warm         530
##  5 Autumn  5. Hot            2
##  6 Spring  1. Freezing      22
##  7 Spring  2. Cold         719
##  8 Spring  3. Cool        1111
##  9 Spring  4. Warm         356
## 10 Summer  3. Cool         168
## 11 Summer  4. Warm        1518
## 12 Summer  5. Hot          522
## 13 Winter  1. Freezing    1390
## 14 Winter  2. Cold         769
## 15 Winter  3. Cool           1

Autumn is the only season with all temperature groups
Spring experienced all but Hot
Summer did not experience cold or freezing
Winter did not experience Warm or Hot

ggplot(seasonal_temps, aes(x = seasons, y = temp_category, fill = count)) +
  geom_tile() +
  labs(title = "Heatmap of Temperature Categories by Season",
       x = "Season",
       y = "Temperature Category") +
  scale_fill_gradient(low = "lightblue", high = "orange") +
  theme_minimal()

To summarize this information: Spring is cool, Summer is warm, and Autumn is cool, Winter is cold. The missing pairs are caused by Seoul’s climate.

Data Dive 3: Seoul Bike Sharing (Attempt 2)

Jakob Morales

2025-02-10

Group By and Probabilities

Groupby 1

Groupby 2

Groupby 3

Pick two categorical variables, and build a data frame of all their combinations