Week 4 Sampling and Drawing Conclusions

Roshan R Naidu (09/02/2026)

Importing Libraries

# Load tidyverse as a collection of data science packages (Practically not needed to import any other packages mostly after importing this package)
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.1     ✔ tibble    3.3.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

# Load dplyr for data manipulation
library(dplyr)

# Load ggplot2 for data visualisation
library(ggplot2)

Loading and Exploring The Dataset

# Load the dataset
bike_data <- read.csv("/Users/roshannaidu/Desktop/IU Sem 2/Stats 1/bike+sharing+dataset/hour.csv")

# View structure and data types of variables
str(bike_data)

## 'data.frame':    17379 obs. of  17 variables:
##  $ instant   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ dteday    : chr  "2011-01-01" "2011-01-01" "2011-01-01" "2011-01-01" ...
##  $ season    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ yr        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ mnth      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ hr        : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ holiday   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ weekday   : int  6 6 6 6 6 6 6 6 6 6 ...
##  $ workingday: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ weathersit: int  1 1 1 1 1 2 1 1 1 1 ...
##  $ temp      : num  0.24 0.22 0.22 0.24 0.24 0.24 0.22 0.2 0.24 0.32 ...
##  $ atemp     : num  0.288 0.273 0.273 0.288 0.288 ...
##  $ hum       : num  0.81 0.8 0.8 0.75 0.75 0.75 0.8 0.86 0.75 0.76 ...
##  $ windspeed : num  0 0 0 0 0 0.0896 0 0 0 0 ...
##  $ casual    : int  3 8 5 3 0 0 2 1 1 8 ...
##  $ registered: int  13 32 27 10 1 1 0 2 7 6 ...
##  $ cnt       : int  16 40 32 13 1 1 2 3 8 14 ...

# View first few rows of the dataset
head(bike_data)

# View summary statistics for all variables
summary(bike_data)

##     instant         dteday              season            yr        
##  Min.   :    1   Length:17379       Min.   :1.000   Min.   :0.0000  
##  1st Qu.: 4346   Class :character   1st Qu.:2.000   1st Qu.:0.0000  
##  Median : 8690   Mode  :character   Median :3.000   Median :1.0000  
##  Mean   : 8690                      Mean   :2.502   Mean   :0.5026  
##  3rd Qu.:13034                      3rd Qu.:3.000   3rd Qu.:1.0000  
##  Max.   :17379                      Max.   :4.000   Max.   :1.0000  
##       mnth              hr           holiday           weekday     
##  Min.   : 1.000   Min.   : 0.00   Min.   :0.00000   Min.   :0.000  
##  1st Qu.: 4.000   1st Qu.: 6.00   1st Qu.:0.00000   1st Qu.:1.000  
##  Median : 7.000   Median :12.00   Median :0.00000   Median :3.000  
##  Mean   : 6.538   Mean   :11.55   Mean   :0.02877   Mean   :3.004  
##  3rd Qu.:10.000   3rd Qu.:18.00   3rd Qu.:0.00000   3rd Qu.:5.000  
##  Max.   :12.000   Max.   :23.00   Max.   :1.00000   Max.   :6.000  
##    workingday       weathersit         temp           atemp       
##  Min.   :0.0000   Min.   :1.000   Min.   :0.020   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:1.000   1st Qu.:0.340   1st Qu.:0.3333  
##  Median :1.0000   Median :1.000   Median :0.500   Median :0.4848  
##  Mean   :0.6827   Mean   :1.425   Mean   :0.497   Mean   :0.4758  
##  3rd Qu.:1.0000   3rd Qu.:2.000   3rd Qu.:0.660   3rd Qu.:0.6212  
##  Max.   :1.0000   Max.   :4.000   Max.   :1.000   Max.   :1.0000  
##       hum           windspeed          casual         registered   
##  Min.   :0.0000   Min.   :0.0000   Min.   :  0.00   Min.   :  0.0  
##  1st Qu.:0.4800   1st Qu.:0.1045   1st Qu.:  4.00   1st Qu.: 34.0  
##  Median :0.6300   Median :0.1940   Median : 17.00   Median :115.0  
##  Mean   :0.6272   Mean   :0.1901   Mean   : 35.68   Mean   :153.8  
##  3rd Qu.:0.7800   3rd Qu.:0.2537   3rd Qu.: 48.00   3rd Qu.:220.0  
##  Max.   :1.0000   Max.   :0.8507   Max.   :367.00   Max.   :886.0  
##       cnt       
##  Min.   :  1.0  
##  1st Qu.: 40.0  
##  Median :142.0  
##  Mean   :189.5  
##  3rd Qu.:281.0  
##  Max.   :977.0

# Check number of rows and columns
dim(bike_data)

## [1] 17379    17

# Display all variable names
names(bike_data)

##  [1] "instant"    "dteday"     "season"     "yr"         "mnth"      
##  [6] "hr"         "holiday"    "weekday"    "workingday" "weathersit"
## [11] "temp"       "atemp"      "hum"        "windspeed"  "casual"    
## [16] "registered" "cnt"

# Check for missing values in each column
colSums(is.na(bike_data))

##    instant     dteday     season         yr       mnth         hr    holiday 
##          0          0          0          0          0          0          0 
##    weekday workingday weathersit       temp      atemp        hum  windspeed 
##          0          0          0          0          0          0          0 
##     casual registered        cnt 
##          0          0          0

# Select relevant columns: categorical (season) and continuous (temp, hum, windspeed, cnt)
bike_data <- bike_data %>%
  select(season, temp, hum, windspeed, cnt)

Insights from exploring the data structure

The dataset contains both categorical (season) and continuous (temp, hum, windspeed, cnt) variables, which are useful for analysis.

Random Sampling of the data

# For reproducibility
set.seed(123)  # for reproducibility
n <- nrow(bike_data)

# Creating five random subsamples
subsample_1 <- bike_data[sample(1:n, size = 0.5 * n, replace = TRUE), ]
subsample_2 <- bike_data[sample(1:n, size = 0.5 * n, replace = TRUE), ]
subsample_3 <- bike_data[sample(1:n, size = 0.5 * n, replace = TRUE), ]
subsample_4 <- bike_data[sample(1:n, size = 0.5 * n, replace = TRUE), ]
subsample_5 <- bike_data[sample(1:n, size = 0.5 * n, replace = TRUE), ]

Scrutinizing the subsamples

# Function to group by 'season' and calculate means for each subsample
group_means <- function(df) {
  df %>%
    group_by(season) %>%
    summarise(mean_temp = mean(temp),
              mean_hum = mean(hum),
              mean_windspeed = mean(windspeed),
              mean_cnt = mean(cnt))
}

# Applying the function to all of the subsamples
mean_subsample_1 <- group_means(subsample_1)
mean_subsample_2 <- group_means(subsample_2)
mean_subsample_3 <- group_means(subsample_3)
mean_subsample_4 <- group_means(subsample_4)
mean_subsample_5 <- group_means(subsample_5)

# Displaying the results of applying the function on all of the subsamples
mean_subsample_1

mean_subsample_2

mean_subsample_3

mean_subsample_4

mean_subsample_5

Insights:

Temperature and windspeed are consistent across the samples, with little fluctuation
Humidity and bike rentals vary more between subsamples, highlighting potential anomalies.
Humidity (hum) and bike rental counts (cnt) show some differences between subsamples.

Visualizing Subsamples using Boxplots

Boxplot for Temperature

# Combine subsamples into one dataframe for visualization
all_samples <- rbind(data.frame(subsample_1, sample = "Sample 1"),
                     data.frame(subsample_2, sample = "Sample 2"),
                     data.frame(subsample_3, sample = "Sample 3"),
                     data.frame(subsample_4, sample = "Sample 4"),
                     data.frame(subsample_5, sample = "Sample 5"))

# Plot temperature distribution
ggplot(all_samples, aes(x = sample, y = temp, fill = sample)) +
  geom_boxplot() +
  labs(title = "Temperature Distribution Across Subsamples",
       x = "Subsample",
       y = "Temperature") +
  theme_minimal(base_size = 14) +
  theme(
    plot.title = element_text(face = "bold", hjust = 0.5),
    legend.position = "top"
  )

Insights:

The median temperatures (represented by the horizontal lines inside the boxes) appear very similar across all 5 subsamples, clustering around 0.5. This suggests that the central tendencies of the temperature distributions are quite consistent.
The interquartile ranges (boxes) are roughly similar in size for all subsamples, indicating comparable variability in the middle 50% of the data. However, Sample 1 seems to have a slightly larger spread than the others.
Most boxes appear fairly symmetrical around the median, suggesting approximately normal distributions. Sample 1 shows some slight asymmetry with a longer lower whisker. 4)No outliers are visible in any of the subsamples, as there are no individual points plotted beyond the whiskers. 5)The overall range of temperatures (from bottom to top whisker) is consistent across samples, spanning from about 0 to 1 in all cases.
The high degree of similarity in distribution characteristics across all subsamples suggests good consistency in the sampling process or stability in the measured phenomenon.
While broadly similar, there are subtle differences between samples. For instance, Sample 1 appears to have a slightly wider spread and lower median compared to the others.
There’s substantial overlap in the distributions, indicating that differences between subsamples are likely small and may not be statistically significant.
The absence of outliers and the consistency across samples suggest good data quality with no obvious anomalies or measurement issues.
The temperatures appear to be normalized or scaled, ranging from 0 to 1, which is useful for comparing relative differences but doesn’t provide information about absolute temperature values.

Boxplot for Humidity

# Plot humidity distribution
ggplot(all_samples, aes(x = sample, y = hum, fill = sample)) +
  geom_boxplot() +
  labs(title = "Humidity Distribution Across Subsamples",
       x = "Subsample",
       y = "Humidity") +
  theme_minimal(base_size = 14) +
  theme(
    plot.title = element_text(face = "bold", hjust = 0.5),
    legend.position = "top"
  )

Insights:

The median humidity levels (horizontal lines in the boxes) are very similar across all 5 subsamples, hovering around 0.6-0.65. This indicates consistent central tendencies in the humidity distributions.
The interquartile ranges (boxes) are quite similar in size for all subsamples, suggesting comparable variability in the middle 50% of the data across samples.
Most boxes appear roughly symmetrical around the median, indicating approximately normal distributions. There’s a slight tendency for boxes to be a bit longer below the median, suggesting a mild negative skew. 4)Each subsample shows at least one lower outlier (dots below the whiskers), all close to 0. This indicates some rare instances of very low humidity in all samples.
The overall range of humidity (excluding outliers) is consistent across samples, spanning from about 0.25 to 1.0 in most cases.
There’s a high degree of similarity in distribution characteristics across all subsamples, suggesting good consistency in the sampling process or stability in the measured humidity levels.
While broadly similar, there are subtle differences. For instance, Sample 4 appears to have a slightly higher median and shorter lower whisker compared to the others.
There’s substantial overlap in the distributions, indicating that differences between subsamples are likely small and may not be statistically significant.
The consistency across samples and the presence of similar outliers in each suggest good data quality and consistent measurement processes.
The humidity values appear to be normalized or represent relative humidity, ranging from 0 to 1 (or 0% to 100%).
The lower whiskers show some variation across samples, with Sample 4 having a noticeably higher lower bound than the others.
All samples reach a maximum humidity close to 1 (100%), suggesting that conditions of very high humidity are observed in all subsamples.
The presence of low-humidity outliers in all samples suggests this is a genuine phenomenon rather than a measurement error, possibly indicating occasional very dry conditions.

Boxplot for Windspeed

# Plot windspeed distribution
ggplot(all_samples, aes(x = sample, y = windspeed, fill = sample)) +
  geom_boxplot() +
  labs(title = "Windspeed Distribution Across Subsamples",
       x = "Subsample",
       y = "Windspeed") +
  theme_minimal(base_size = 14) +
  theme(
    plot.title = element_text(face = "bold", hjust = 0.5),
    legend.position = "top"
  )

Insights:

The median windspeeds (horizontal lines in the boxes) are similar across all 5 subsamples, generally falling between 0.15 and 0.25. Sample 3 appears to have a slightly higher median than the others.
The interquartile ranges (boxes) are relatively consistent across samples, with Sample 3 and Sample 5 showing slightly larger spreads than the others.
Most boxes appear roughly symmetrical around the median, suggesting approximately normal distributions for the core data.
There are numerous outliers in all samples, particularly on the upper end of the distribution. These are represented by the dots above the upper whiskers, extending up to about 0.8-0.85.
The overall range of windspeeds is consistent across samples, from near 0 to about 0.85, with the majority of data points falling below 0.4.
While there are similarities across samples, there’s more variability here than in the previous plots for temperature and humidity.
The distribution appears positively skewed in all samples, with a long tail of high windspeed outliers.
There’s substantial overlap in the distributions, but also noticeable differences, particularly in the median and spread of Sample 3 compared to the others.
The consistency in the pattern of outliers across samples suggests this is a genuine feature of the wind speed data rather than measurement errors.
The values appear to be normalized or represent a specific scale, ranging from 0 to slightly above 0.8.
All samples have lower whiskers extending close to 0, indicating periods of very low wind speeds are common across all subsamples.
The numerous upper outliers suggest frequent occurrences of wind speeds significantly higher than the typical range, possibly representing gusts or storm events.
Sample 3 stands out with a slightly higher median and larger spread, which might indicate it was taken during a period of generally higher wind activity.
The outliers seem to cluster at certain levels, particularly visible in Samples 1 and 2, which could indicate specific weather patterns or measurement intervals.

These insights suggest that while there’s a consistent overall pattern in wind speed distributions, there’s more variability between samples compared to the temperature and humidity data. The prevalence of high outliers is a key feature, indicating that occasional high wind events are a significant characteristic of this dataset. The differences between samples, particularly Sample 3, might warrant further investigation into the conditions during different sampling periods.

Boxplot for Bike Rentals

# Plot bike rentals distribution
ggplot(all_samples, aes(x = sample, y = cnt, fill = sample)) +
  geom_boxplot() +
  labs(title = "Bike Rental Counts Across Subsamples",
       x = "Subsample",
       y = "Bike Rentals") +
  theme_minimal(base_size = 14) +
  theme(
    plot.title = element_text(face = "bold", hjust = 0.5),
    legend.position = "top"
  )

Insights:

The median bike rental counts (horizontal lines in the boxes) are fairly consistent across all 5 subsamples, generally falling between 150 and 200 rentals.
The interquartile ranges (boxes) are relatively large and similar across samples, indicating considerable variability in daily rental counts.
The boxes are not symmetrical around the median. They appear to be skewed upwards, with longer upper portions of the boxes, suggesting a positive skew in the distribution.
There are numerous upper outliers in all samples, represented by dots above the upper whiskers. These extend up to about 1000 rentals per day.
The overall range of rental counts is large and consistent across samples, from near 0 to about 1000 rentals per day.
The overall pattern is remarkably consistent across all five subsamples, suggesting stable underlying factors influencing bike rentals.
The distribution is positively skewed in all samples, with a long tail of high rental count outliers.
All samples have lower whiskers extending close to 0, indicating days with very few or no bike rentals across all subsamples.
The upper quartiles (top of the boxes) show some variation across samples, with Sample 1 having a slightly lower upper quartile than the others.
The density of outliers is high, particularly in the range of 600-800 rentals, suggesting frequent occurrences of high-demand days.
The outlier points appear in horizontal lines, indicating that rental counts are discrete values (whole numbers) rather than continuous.
The consistency in pattern across subsamples might indicate that each sample represents a similar time period (e.g., a month), capturing similar seasonal patterns.
The large number of upper outliers could suggest a bimodal distribution, with one mode represented by the box and another by the cluster of outliers.
The maximum number of rentals (around 1000) might represent the operational capacity of the bike rental system.
These insights suggest a complex distribution of bike rental counts with high variability. The consistency across samples indicates stable underlying factors influencing rentals, but the wide range and numerous outliers point to significant day-to-day fluctuations. The positive skew and high outliers suggest frequent occurrences of high-demand days, which could be linked to factors like weather, events, or seasonal trends.

Monte Carlo Simulation

I am performing a Monte Carlo Simulation by randomly sampling 50% of the data 1,000 times and calculating the mean bike rental count (cnt) for each subsample.

# for reproducibility
set.seed(123)

# Perform Monte Carlo Simulation
mc_results <- replicate(1000, {
  sample_data <- bike_data[sample(1:n, size = 0.5 * n, replace = TRUE), ]
  mean(sample_data$cnt)
})

# Mean and standard deviation of the results
mc_mean <- mean(mc_results)
mc_sd <- sd(mc_results)

mc_mean

## [1] 189.4469

mc_sd

## [1] 1.959181

Insights:

The average mean bike rental count from 1,000 samples is around 190, with a small standard deviation of 1.96, showing low variability across random subsamples.

Visualization of the Monte Carlo Simulation Results

# Plot the distribution of the mean bike rental counts from the Monte Carlo Simulation
mc_df <- data.frame(mean_cnt = mc_results)
ggplot(mc_df, aes(x = mean_cnt)) +
  geom_histogram(binwidth = 1, color = "black", fill = "blue") +
  geom_vline(aes(xintercept = mc_mean), color = "red", linetype = "dashed") +
  labs(title = "Distribution of Mean Bike Rental Counts from Monte Carlo Simulations",
       x = "Mean Bike Rentals",
       y = "Frequency") +
  theme_minimal(base_size = 10) +
  theme(
    plot.title = element_text(face = "bold", hjust = 0.5),
    legend.position = "top"
  )

Insights:

The histogram shows a normal distribution centered around 190, confirming that the overall mean is stable, even with random subsampling.

Conclusions and Insights based on the Random Sampling and Monte Carlo Simulation:-

Consistency Across Subsamples: Temperature and windspeed show stable trends across all subsamples, whereas humidity and bike rental counts show more variability. Monte Carlo Simulation Results: The Monte Carlo simulation confirms that the mean bike rental count is reliable, with only minor fluctuations across random subsamples. Anomalies: Individual subsamples can contain outliers or anomalies that don’t appear consistently in other samples. Future Implications: Relying on one subsample can lead to misleading conclusions, but Monte Carlo simulations provide confidence that the dataset’s overall trend is stable.

Further questions that I would like to investigate:-

Could additional variables like weather conditions further explain the variability in humidity and bike rentals? How would removing outliers or extreme anomalies impact the dataset?