Loading and Preparing the dataset

# Loading the dataset
bike_data <- read.csv("C:/Statistics for Data Science/Week 2/bike+sharing+dataset/hour.csv")
# Exploring the dataset 
head(bike_data)
##   instant     dteday season yr mnth hr holiday weekday workingday weathersit
## 1       1 2011-01-01      1  0    1  0       0       6          0          1
## 2       2 2011-01-01      1  0    1  1       0       6          0          1
## 3       3 2011-01-01      1  0    1  2       0       6          0          1
## 4       4 2011-01-01      1  0    1  3       0       6          0          1
## 5       5 2011-01-01      1  0    1  4       0       6          0          1
## 6       6 2011-01-01      1  0    1  5       0       6          0          2
##   temp  atemp  hum windspeed casual registered cnt
## 1 0.24 0.2879 0.81    0.0000      3         13  16
## 2 0.22 0.2727 0.80    0.0000      8         32  40
## 3 0.22 0.2727 0.80    0.0000      5         27  32
## 4 0.24 0.2879 0.75    0.0000      3         10  13
## 5 0.24 0.2879 0.75    0.0000      0          1   1
## 6 0.24 0.2576 0.75    0.0896      0          1   1
# Exploring the structure of the dataset and preparing the dataset
str(bike_data)
## 'data.frame':    17379 obs. of  17 variables:
##  $ instant   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ dteday    : chr  "2011-01-01" "2011-01-01" "2011-01-01" "2011-01-01" ...
##  $ season    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ yr        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ mnth      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ hr        : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ holiday   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ weekday   : int  6 6 6 6 6 6 6 6 6 6 ...
##  $ workingday: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ weathersit: int  1 1 1 1 1 2 1 1 1 1 ...
##  $ temp      : num  0.24 0.22 0.22 0.24 0.24 0.24 0.22 0.2 0.24 0.32 ...
##  $ atemp     : num  0.288 0.273 0.273 0.288 0.288 ...
##  $ hum       : num  0.81 0.8 0.8 0.75 0.75 0.75 0.8 0.86 0.75 0.76 ...
##  $ windspeed : num  0 0 0 0 0 0.0896 0 0 0 0 ...
##  $ casual    : int  3 8 5 3 0 0 2 1 1 8 ...
##  $ registered: int  13 32 27 10 1 1 0 2 7 6 ...
##  $ cnt       : int  16 40 32 13 1 1 2 3 8 14 ...
# Select relevant columns: categorical (season) and continuous (temp, hum, windspeed, cnt)
bike_data <- bike_data %>%
  select(season, temp, hum, windspeed, cnt)

Insights from exploring the data structure

The dataset contains both categorical (season) and continuous (temp, hum, windspeed, cnt) variables, which are useful for analysis.

Random Sampling of the data

# For reproducibility
set.seed(123)  # for reproducibility
n <- nrow(bike_data)

# Creating five random subsamples
subsample_1 <- bike_data[sample(1:n, size = 0.5 * n, replace = TRUE), ]
subsample_2 <- bike_data[sample(1:n, size = 0.5 * n, replace = TRUE), ]
subsample_3 <- bike_data[sample(1:n, size = 0.5 * n, replace = TRUE), ]
subsample_4 <- bike_data[sample(1:n, size = 0.5 * n, replace = TRUE), ]
subsample_5 <- bike_data[sample(1:n, size = 0.5 * n, replace = TRUE), ]

Scrutinizing the subsamples

# Function to group by 'season' and calculate means for each subsample
group_means <- function(df) {
  df %>%
    group_by(season) %>%
    summarise(mean_temp = mean(temp),
              mean_hum = mean(hum),
              mean_windspeed = mean(windspeed),
              mean_cnt = mean(cnt))
}

# Applying the function to all of the subsamples
mean_subsample_1 <- group_means(subsample_1)
mean_subsample_2 <- group_means(subsample_2)
mean_subsample_3 <- group_means(subsample_3)
mean_subsample_4 <- group_means(subsample_4)
mean_subsample_5 <- group_means(subsample_5)

# Displaying the results of applying the function on all of the subsamples
mean_subsample_1
## # A tibble: 4 × 5
##   season mean_temp mean_hum mean_windspeed mean_cnt
##    <int>     <dbl>    <dbl>          <dbl>    <dbl>
## 1      1     0.303    0.581          0.213     112.
## 2      2     0.547    0.626          0.202     210.
## 3      3     0.707    0.631          0.172     239.
## 4      4     0.426    0.664          0.172     203.
mean_subsample_2
## # A tibble: 4 × 5
##   season mean_temp mean_hum mean_windspeed mean_cnt
##    <int>     <dbl>    <dbl>          <dbl>    <dbl>
## 1      1     0.297    0.588          0.219     109.
## 2      2     0.540    0.627          0.203     213.
## 3      3     0.709    0.631          0.170     245.
## 4      4     0.423    0.665          0.169     195.
mean_subsample_3
## # A tibble: 4 × 5
##   season mean_temp mean_hum mean_windspeed mean_cnt
##    <int>     <dbl>    <dbl>          <dbl>    <dbl>
## 1      1     0.295    0.576          0.216     110.
## 2      2     0.550    0.617          0.202     209.
## 3      3     0.705    0.635          0.176     231.
## 4      4     0.423    0.663          0.168     200.
mean_subsample_4
## # A tibble: 4 × 5
##   season mean_temp mean_hum mean_windspeed mean_cnt
##    <int>     <dbl>    <dbl>          <dbl>    <dbl>
## 1      1     0.304    0.585          0.212     110.
## 2      2     0.544    0.630          0.202     210.
## 3      3     0.704    0.637          0.168     235.
## 4      4     0.425    0.666          0.172     205.
mean_subsample_5
## # A tibble: 4 × 5
##   season mean_temp mean_hum mean_windspeed mean_cnt
##    <int>     <dbl>    <dbl>          <dbl>    <dbl>
## 1      1     0.299    0.575          0.214     111.
## 2      2     0.543    0.623          0.207     203.
## 3      3     0.707    0.631          0.174     238.
## 4      4     0.422    0.664          0.172     200.

Insights

  1. Temperature and windspeed are consistent across the samples, with little fluctuation
  2. Humidity and bike rentals vary more between subsamples, highlighting potential anomalies.
  3. Humidity (hum) and bike rental counts (cnt) show some differences between subsamples.

Visualizing Subsamples using Boxplots

Boxplot for Temperature

# Combine subsamples into one dataframe for visualization
all_samples <- rbind(data.frame(subsample_1, sample = "Sample 1"),
                     data.frame(subsample_2, sample = "Sample 2"),
                     data.frame(subsample_3, sample = "Sample 3"),
                     data.frame(subsample_4, sample = "Sample 4"),
                     data.frame(subsample_5, sample = "Sample 5"))

# Plot temperature distribution
ggplot(all_samples, aes(x = sample, y = temp, fill = sample)) +
  geom_boxplot() +
  labs(title = "Temperature Distribution Across Subsamples",
       x = "Subsample",
       y = "Temperature") +
  theme_minimal()

Some Insights

  1. The median temperatures (represented by the horizontal lines inside the boxes) appear very similar across all 5 subsamples, clustering around 0.5. This suggests that the central tendencies of the temperature distributions are quite consistent.
  2. The interquartile ranges (boxes) are roughly similar in size for all subsamples, indicating comparable variability in the middle 50% of the data. However, Sample 1 seems to have a slightly larger spread than the others.
  3. Most boxes appear fairly symmetrical around the median, suggesting approximately normal distributions. Sample 1 shows some slight asymmetry with a longer lower whisker. 4)No outliers are visible in any of the subsamples, as there are no individual points plotted beyond the whiskers. 5)The overall range of temperatures (from bottom to top whisker) is consistent across samples, spanning from about 0 to 1 in all cases.
  4. The high degree of similarity in distribution characteristics across all subsamples suggests good consistency in the sampling process or stability in the measured phenomenon.
  5. While broadly similar, there are subtle differences between samples. For instance, Sample 1 appears to have a slightly wider spread and lower median compared to the others.
  6. There’s substantial overlap in the distributions, indicating that differences between subsamples are likely small and may not be statistically significant.
  7. The absence of outliers and the consistency across samples suggest good data quality with no obvious anomalies or measurement issues.
  8. The temperatures appear to be normalized or scaled, ranging from 0 to 1, which is useful for comparing relative differences but doesn’t provide information about absolute temperature values.

Boxplot for Humidity

# Plot humidity distribution
ggplot(all_samples, aes(x = sample, y = hum, fill = sample)) +
  geom_boxplot() +
  labs(title = "Humidity Distribution Across Subsamples",
       x = "Subsample",
       y = "Humidity") +
  theme_minimal()

Some Insights:

  1. The median humidity levels (horizontal lines in the boxes) are very similar across all 5 subsamples, hovering around 0.6-0.65. This indicates consistent central tendencies in the humidity distributions.
  2. The interquartile ranges (boxes) are quite similar in size for all subsamples, suggesting comparable variability in the middle 50% of the data across samples.
  3. Most boxes appear roughly symmetrical around the median, indicating approximately normal distributions. There’s a slight tendency for boxes to be a bit longer below the median, suggesting a mild negative skew. 4)Each subsample shows at least one lower outlier (dots below the whiskers), all close to 0. This indicates some rare instances of very low humidity in all samples.
  4. The overall range of humidity (excluding outliers) is consistent across samples, spanning from about 0.25 to 1.0 in most cases.
  5. There’s a high degree of similarity in distribution characteristics across all subsamples, suggesting good consistency in the sampling process or stability in the measured humidity levels.
  6. While broadly similar, there are subtle differences. For instance, Sample 4 appears to have a slightly higher median and shorter lower whisker compared to the others.
  7. There’s substantial overlap in the distributions, indicating that differences between subsamples are likely small and may not be statistically significant.
  8. The consistency across samples and the presence of similar outliers in each suggest good data quality and consistent measurement processes.
  9. The humidity values appear to be normalized or represent relative humidity, ranging from 0 to 1 (or 0% to 100%).
  10. The lower whiskers show some variation across samples, with Sample 4 having a noticeably higher lower bound than the others.
  11. All samples reach a maximum humidity close to 1 (100%), suggesting that conditions of very high humidity are observed in all subsamples.
  12. The presence of low-humidity outliers in all samples suggests this is a genuine phenomenon rather than a measurement error, possibly indicating occasional very dry conditions.

Boxplot for Windspeed

# Plot windspeed distribution
ggplot(all_samples, aes(x = sample, y = windspeed, fill = sample)) +
  geom_boxplot() +
  labs(title = "Windspeed Distribution Across Subsamples",
       x = "Subsample",
       y = "Windspeed") +
  theme_minimal()

Insights

  1. The median windspeeds (horizontal lines in the boxes) are similar across all 5 subsamples, generally falling between 0.15 and 0.25. Sample 3 appears to have a slightly higher median than the others.
  2. The interquartile ranges (boxes) are relatively consistent across samples, with Sample 3 and Sample 5 showing slightly larger spreads than the others.
  3. Most boxes appear roughly symmetrical around the median, suggesting approximately normal distributions for the core data.
  4. There are numerous outliers in all samples, particularly on the upper end of the distribution. These are represented by the dots above the upper whiskers, extending up to about 0.8-0.85.
  5. The overall range of windspeeds is consistent across samples, from near 0 to about 0.85, with the majority of data points falling below 0.4.
  6. While there are similarities across samples, there’s more variability here than in the previous plots for temperature and humidity.
  7. The distribution appears positively skewed in all samples, with a long tail of high windspeed outliers.
  8. There’s substantial overlap in the distributions, but also noticeable differences, particularly in the median and spread of Sample 3 compared to the others.
  9. The consistency in the pattern of outliers across samples suggests this is a genuine feature of the wind speed data rather than measurement errors.
  10. The values appear to be normalized or represent a specific scale, ranging from 0 to slightly above 0.8.
  11. All samples have lower whiskers extending close to 0, indicating periods of very low wind speeds are common across all subsamples.
  12. The numerous upper outliers suggest frequent occurrences of wind speeds significantly higher than the typical range, possibly representing gusts or storm events.
  13. Sample 3 stands out with a slightly higher median and larger spread, which might indicate it was taken during a period of generally higher wind activity.
  14. The outliers seem to cluster at certain levels, particularly visible in Samples 1 and 2, which could indicate specific weather patterns or measurement intervals.

These insights suggest that while there’s a consistent overall pattern in wind speed distributions, there’s more variability between samples compared to the temperature and humidity data. The prevalence of high outliers is a key feature, indicating that occasional high wind events are a significant characteristic of this dataset. The differences between samples, particularly Sample 3, might warrant further investigation into the conditions during different sampling periods.

##### Boxplot for Bike Rentals

# Plot bike rentals distribution
ggplot(all_samples, aes(x = sample, y = cnt, fill = sample)) +
  geom_boxplot() +
  labs(title = "Bike Rental Counts Across Subsamples",
       x = "Subsample",
       y = "Bike Rentals") +
  theme_minimal()

Some Insights
  1. The median bike rental counts (horizontal lines in the boxes) are fairly consistent across all 5 subsamples, generally falling between 150 and 200 rentals.
  2. The interquartile ranges (boxes) are relatively large and similar across samples, indicating considerable variability in daily rental counts.
  3. The boxes are not symmetrical around the median. They appear to be skewed upwards, with longer upper portions of the boxes, suggesting a positive skew in the distribution.
  4. There are numerous upper outliers in all samples, represented by dots above the upper whiskers. These extend up to about 1000 rentals per day.
  5. The overall range of rental counts is large and consistent across samples, from near 0 to about 1000 rentals per day.
  6. The overall pattern is remarkably consistent across all five subsamples, suggesting stable underlying factors influencing bike rentals.
  7. The distribution is positively skewed in all samples, with a long tail of high rental count outliers.
  8. All samples have lower whiskers extending close to 0, indicating days with very few or no bike rentals across all subsamples.
  9. The upper quartiles (top of the boxes) show some variation across samples, with Sample 1 having a slightly lower upper quartile than the others.
  10. The density of outliers is high, particularly in the range of 600-800 rentals, suggesting frequent occurrences of high-demand days.
  11. The outlier points appear in horizontal lines, indicating that rental counts are discrete values (whole numbers) rather than continuous.
  12. The consistency in pattern across subsamples might indicate that each sample represents a similar time period (e.g., a month), capturing similar seasonal patterns.
  13. The large number of upper outliers could suggest a bimodal distribution, with one mode represented by the box and another by the cluster of outliers.
  14. The maximum number of rentals (around 1000) might represent the operational capacity of the bike rental system.

These insights suggest a complex distribution of bike rental counts with high variability. The consistency across samples indicates stable underlying factors influencing rentals, but the wide range and numerous outliers point to significant day-to-day fluctuations. The positive skew and high outliers suggest frequent occurrences of high-demand days, which could be linked to factors like weather, events, or seasonal trends.

Monte Carlo Simulation

I am performing a Monte Carlo Simulation , randomly sampling 50% of the data 1,000 times and calculating the mean bike rental count (cnt) for each subsample.

# for reproducibility
set.seed(123)

# Perform Monte Carlo Simulation
mc_results <- replicate(1000, {
  sample_data <- bike_data[sample(1:n, size = 0.5 * n, replace = TRUE), ]
  mean(sample_data$cnt)
})

# Mean and standard deviation of the results
mc_mean <- mean(mc_results)
mc_sd <- sd(mc_results)

mc_mean
## [1] 189.4469
mc_sd
## [1] 1.959181
Insights

The average mean bike rental count from 1,000 samples is around 190, with a small standard deviation of 1.96, showing low variability across random subsamples.

Visualization of the Monte Carlo Simulation Results

# Plot the distribution of the mean bike rental counts from the Monte Carlo Simulation
mc_df <- data.frame(mean_cnt = mc_results)
ggplot(mc_df, aes(x = mean_cnt)) +
  geom_histogram(binwidth = 1, color = "black", fill = "blue") +
  geom_vline(aes(xintercept = mc_mean), color = "red", linetype = "dashed") +
  labs(title = "Distribution of Mean Bike Rental Counts from Monte Carlo Simulations",
       x = "Mean Bike Rentals",
       y = "Frequency") +
  theme_minimal()

Insights

The histogram shows a normal distribution centered around 190, confirming that the overall mean is stable, even with random subsampling.

Some Conclusions and Insights based on the Random Sampling and Monte Carlo Simulation

  1. Consistency Across Subsamples: Temperature and windspeed show stable trends across all subsamples, whereas humidity and bike rental counts show more variability.
  2. Monte Carlo Simulation Results: The Monte Carlo simulation confirms that the mean bike rental count is reliable, with only minor fluctuations across random subsamples.
  3. Anomalies: Individual subsamples can contain outliers or anomalies that don’t appear consistently in other samples.
  4. Future Implications: Relying on one subsample can lead to misleading conclusions, but Monte Carlo simulations provide confidence that the dataset’s overall trend is stable.

Some further questions that I would like to investigate

  1. Could additional variables like weather conditions further explain the variability in humidity and bike rentals?
  2. How would removing outliers or extreme anomalies impact the dataset?