This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(purrr)
library(ggplot2)
bike <- read.csv('D:/FALL 2023/STATISTICS/datasets/bike.csv')
## Random sampling of bike dataset
```r
num <-sample(5:10,1)
columns <- c("Seasons", "Holiday", "Rented.Bike.Count", "Visibility", "Humidity", "Snowfall")
subsample_list <- list()
for (i in 1:num) {
# Determine sample size (approximately 50% of the data)
s_size <- round(0.5 * nrow(bike))
# Randomly select rows with replacement
s_index <- sample(1:nrow(bike), size = s_size, replace = TRUE)
# Create the subsample data frame
subsample <- bike[s_index, columns]
# Store the subsample in the list
subsample_list[[i]] <- subsample
}
View(subsample_list)
sapply(subsample_list, dim)
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 4380 4380 4380 4380 4380 4380
## [2,] 6 6 6 6 6 6
summary_table <- lapply(subsample_list, function(subsample){
summary_df <-summary(subsample)
knitr::kable(summary_df, caption = "summary statistics")
})
for (i in 1: num){
cat("### Subsample", i, "summary statisics \n")
print(summary_table[[i]])
}
## ### Subsample 1 summary statisics
##
##
## Table: summary statistics
##
## | | Seasons | Holiday |Rented.Bike.Count | Visibility | Humidity | Snowfall |
## |:--|:----------------|:----------------|:-----------------|:--------------|:-------------|:---------------|
## | |Length:4380 |Length:4380 |Min. : 0.0 |Min. : 33.0 |Min. : 0.00 |Min. :0.00000 |
## | |Class :character |Class :character |1st Qu.: 192.0 |1st Qu.: 951.5 |1st Qu.:42.00 |1st Qu.:0.00000 |
## | |Mode :character |Mode :character |Median : 490.5 |Median :1681.0 |Median :57.00 |Median :0.00000 |
## | |NA |NA |Mean : 704.9 |Mean :1435.9 |Mean :58.12 |Mean :0.07011 |
## | |NA |NA |3rd Qu.:1062.0 |3rd Qu.:1999.0 |3rd Qu.:74.00 |3rd Qu.:0.00000 |
## | |NA |NA |Max. :3556.0 |Max. :2000.0 |Max. :98.00 |Max. :8.80000 |
## ### Subsample 2 summary statisics
##
##
## Table: summary statistics
##
## | | Seasons | Holiday |Rented.Bike.Count | Visibility | Humidity | Snowfall |
## |:--|:----------------|:----------------|:-----------------|:------------|:-------------|:---------------|
## | |Length:4380 |Length:4380 |Min. : 0.0 |Min. : 27 |Min. : 0.00 |Min. :0.00000 |
## | |Class :character |Class :character |1st Qu.: 193.0 |1st Qu.: 946 |1st Qu.:42.00 |1st Qu.:0.00000 |
## | |Mode :character |Mode :character |Median : 479.5 |Median :1674 |Median :57.00 |Median :0.00000 |
## | |NA |NA |Mean : 704.7 |Mean :1435 |Mean :58.24 |Mean :0.07023 |
## | |NA |NA |3rd Qu.:1071.2 |3rd Qu.:2000 |3rd Qu.:74.00 |3rd Qu.:0.00000 |
## | |NA |NA |Max. :3298.0 |Max. :2000 |Max. :98.00 |Max. :8.80000 |
## ### Subsample 3 summary statisics
##
##
## Table: summary statistics
##
## | | Seasons | Holiday |Rented.Bike.Count | Visibility | Humidity | Snowfall |
## |:--|:----------------|:----------------|:-----------------|:------------|:-------------|:---------------|
## | |Length:4380 |Length:4380 |Min. : 0.0 |Min. : 27 |Min. : 0.00 |Min. :0.00000 |
## | |Class :character |Class :character |1st Qu.: 185.0 |1st Qu.: 895 |1st Qu.:43.00 |1st Qu.:0.00000 |
## | |Mode :character |Mode :character |Median : 479.5 |Median :1662 |Median :57.00 |Median :0.00000 |
## | |NA |NA |Mean : 692.7 |Mean :1420 |Mean :58.52 |Mean :0.07888 |
## | |NA |NA |3rd Qu.:1044.0 |3rd Qu.:2000 |3rd Qu.:75.00 |3rd Qu.:0.00000 |
## | |NA |NA |Max. :3556.0 |Max. :2000 |Max. :98.00 |Max. :8.80000 |
## ### Subsample 4 summary statisics
##
##
## Table: summary statistics
##
## | | Seasons | Holiday |Rented.Bike.Count | Visibility | Humidity | Snowfall |
## |:--|:----------------|:----------------|:-----------------|:------------|:-------------|:---------------|
## | |Length:4380 |Length:4380 |Min. : 0.0 |Min. : 53 |Min. : 0.00 |Min. :0.00000 |
## | |Class :character |Class :character |1st Qu.: 202.0 |1st Qu.: 955 |1st Qu.:42.00 |1st Qu.:0.00000 |
## | |Mode :character |Mode :character |Median : 518.0 |Median :1715 |Median :57.00 |Median :0.00000 |
## | |NA |NA |Mean : 720.8 |Mean :1447 |Mean :57.96 |Mean :0.07048 |
## | |NA |NA |3rd Qu.:1088.0 |3rd Qu.:2000 |3rd Qu.:74.00 |3rd Qu.:0.00000 |
## | |NA |NA |Max. :3418.0 |Max. :2000 |Max. :98.00 |Max. :6.00000 |
## ### Subsample 5 summary statisics
##
##
## Table: summary statistics
##
## | | Seasons | Holiday |Rented.Bike.Count | Visibility | Humidity | Snowfall |
## |:--|:----------------|:----------------|:-----------------|:--------------|:-------------|:---------------|
## | |Length:4380 |Length:4380 |Min. : 0.0 |Min. : 33.0 |Min. : 0.00 |Min. :0.00000 |
## | |Class :character |Class :character |1st Qu.: 193.0 |1st Qu.: 954.2 |1st Qu.:43.00 |1st Qu.:0.00000 |
## | |Mode :character |Mode :character |Median : 499.0 |Median :1700.0 |Median :57.00 |Median :0.00000 |
## | |NA |NA |Mean : 704.4 |Mean :1444.2 |Mean :58.23 |Mean :0.07781 |
## | |NA |NA |3rd Qu.:1058.0 |3rd Qu.:2000.0 |3rd Qu.:74.00 |3rd Qu.:0.00000 |
## | |NA |NA |Max. :3404.0 |Max. :2000.0 |Max. :98.00 |Max. :7.10000 |
## ### Subsample 6 summary statisics
##
##
## Table: summary statistics
##
## | | Seasons | Holiday |Rented.Bike.Count | Visibility | Humidity | Snowfall |
## |:--|:----------------|:----------------|:-----------------|:------------|:-------------|:---------------|
## | |Length:4380 |Length:4380 |Min. : 0.0 |Min. : 27 |Min. : 0.00 |Min. :0.00000 |
## | |Class :character |Class :character |1st Qu.: 189.8 |1st Qu.: 906 |1st Qu.:43.00 |1st Qu.:0.00000 |
## | |Mode :character |Mode :character |Median : 489.5 |Median :1686 |Median :58.00 |Median :0.00000 |
## | |NA |NA |Mean : 698.4 |Mean :1424 |Mean :58.69 |Mean :0.08171 |
## | |NA |NA |3rd Qu.:1059.0 |3rd Qu.:2000 |3rd Qu.:75.00 |3rd Qu.:0.00000 |
## | |NA |NA |Max. :3404.0 |Max. :2000 |Max. :98.00 |Max. :5.00000 |
# summary statistics for each subsamples
summary_stats <-lapply(subsample_list, summary)
# plot histogram
histograms <- lapply(subsample_list, function(subsample) {
ggplot(subsample, aes(x = Humidity)) +
geom_histogram(binwidth = 1, fill = 'blue', color = 'black') +
labs(title = "Histogram for humidity", x = 'Value', y = 'Frequency')
})
# display histograms and summary
for (i in 1:num) {
cat("Subsample", i, "summary statistics:\n")
print(summary_stats[[i]])
print(histograms[[i]])
}
## Subsample 1 summary statistics:
## Seasons Holiday Rented.Bike.Count Visibility
## Length:4380 Length:4380 Min. : 0.0 Min. : 33.0
## Class :character Class :character 1st Qu.: 192.0 1st Qu.: 951.5
## Mode :character Mode :character Median : 490.5 Median :1681.0
## Mean : 704.9 Mean :1435.9
## 3rd Qu.:1062.0 3rd Qu.:1999.0
## Max. :3556.0 Max. :2000.0
## Humidity Snowfall
## Min. : 0.00 Min. :0.00000
## 1st Qu.:42.00 1st Qu.:0.00000
## Median :57.00 Median :0.00000
## Mean :58.12 Mean :0.07011
## 3rd Qu.:74.00 3rd Qu.:0.00000
## Max. :98.00 Max. :8.80000
## Subsample 2 summary statistics:
## Seasons Holiday Rented.Bike.Count Visibility
## Length:4380 Length:4380 Min. : 0.0 Min. : 27
## Class :character Class :character 1st Qu.: 193.0 1st Qu.: 946
## Mode :character Mode :character Median : 479.5 Median :1674
## Mean : 704.7 Mean :1435
## 3rd Qu.:1071.2 3rd Qu.:2000
## Max. :3298.0 Max. :2000
## Humidity Snowfall
## Min. : 0.00 Min. :0.00000
## 1st Qu.:42.00 1st Qu.:0.00000
## Median :57.00 Median :0.00000
## Mean :58.24 Mean :0.07023
## 3rd Qu.:74.00 3rd Qu.:0.00000
## Max. :98.00 Max. :8.80000
## Subsample 3 summary statistics:
## Seasons Holiday Rented.Bike.Count Visibility
## Length:4380 Length:4380 Min. : 0.0 Min. : 27
## Class :character Class :character 1st Qu.: 185.0 1st Qu.: 895
## Mode :character Mode :character Median : 479.5 Median :1662
## Mean : 692.7 Mean :1420
## 3rd Qu.:1044.0 3rd Qu.:2000
## Max. :3556.0 Max. :2000
## Humidity Snowfall
## Min. : 0.00 Min. :0.00000
## 1st Qu.:43.00 1st Qu.:0.00000
## Median :57.00 Median :0.00000
## Mean :58.52 Mean :0.07888
## 3rd Qu.:75.00 3rd Qu.:0.00000
## Max. :98.00 Max. :8.80000
## Subsample 4 summary statistics:
## Seasons Holiday Rented.Bike.Count Visibility
## Length:4380 Length:4380 Min. : 0.0 Min. : 53
## Class :character Class :character 1st Qu.: 202.0 1st Qu.: 955
## Mode :character Mode :character Median : 518.0 Median :1715
## Mean : 720.8 Mean :1447
## 3rd Qu.:1088.0 3rd Qu.:2000
## Max. :3418.0 Max. :2000
## Humidity Snowfall
## Min. : 0.00 Min. :0.00000
## 1st Qu.:42.00 1st Qu.:0.00000
## Median :57.00 Median :0.00000
## Mean :57.96 Mean :0.07048
## 3rd Qu.:74.00 3rd Qu.:0.00000
## Max. :98.00 Max. :6.00000
## Subsample 5 summary statistics:
## Seasons Holiday Rented.Bike.Count Visibility
## Length:4380 Length:4380 Min. : 0.0 Min. : 33.0
## Class :character Class :character 1st Qu.: 193.0 1st Qu.: 954.2
## Mode :character Mode :character Median : 499.0 Median :1700.0
## Mean : 704.4 Mean :1444.2
## 3rd Qu.:1058.0 3rd Qu.:2000.0
## Max. :3404.0 Max. :2000.0
## Humidity Snowfall
## Min. : 0.00 Min. :0.00000
## 1st Qu.:43.00 1st Qu.:0.00000
## Median :57.00 Median :0.00000
## Mean :58.23 Mean :0.07781
## 3rd Qu.:74.00 3rd Qu.:0.00000
## Max. :98.00 Max. :7.10000
## Subsample 6 summary statistics:
## Seasons Holiday Rented.Bike.Count Visibility
## Length:4380 Length:4380 Min. : 0.0 Min. : 27
## Class :character Class :character 1st Qu.: 189.8 1st Qu.: 906
## Mode :character Mode :character Median : 489.5 Median :1686
## Mean : 698.4 Mean :1424
## 3rd Qu.:1059.0 3rd Qu.:2000
## Max. :3404.0 Max. :2000
## Humidity Snowfall
## Min. : 0.00 Min. :0.00000
## 1st Qu.:43.00 1st Qu.:0.00000
## Median :58.00 Median :0.00000
## Mean :58.69 Mean :0.08171
## 3rd Qu.:75.00 3rd Qu.:0.00000
## Max. :98.00 Max. :5.00000
means <- lapply(subsample_list, function(subsample) {
mean(subsample$Humidity)})
print(means)
## [[1]]
## [1] 58.11849
##
## [[2]]
## [1] 58.23653
##
## [[3]]
## [1] 58.51553
##
## [[4]]
## [1] 57.96096
##
## [[5]]
## [1] 58.23333
##
## [[6]]
## [1] 58.68721
sds <- lapply(subsample_list, function(subsample) {
sd(subsample$Visibility)})
print(sds)
## [[1]]
## [1] 607.5581
##
## [[2]]
## [1] 610.8125
##
## [[3]]
## [1] 614.9329
##
## [[4]]
## [1] 602.7616
##
## [[5]]
## [1] 602.0797
##
## [[6]]
## [1] 615.1279
for (i in 1:num) {
cat("Subsample", i, "Humidity", means[[i]],"\n")
cat("Subsample", i, "Visibility", sds[[i]],"\n")
}
## Subsample 1 Humidity 58.11849
## Subsample 1 Visibility 607.5581
## Subsample 2 Humidity 58.23653
## Subsample 2 Visibility 610.8125
## Subsample 3 Humidity 58.51553
## Subsample 3 Visibility 614.9329
## Subsample 4 Humidity 57.96096
## Subsample 4 Visibility 602.7616
## Subsample 5 Humidity 58.23333
## Subsample 5 Visibility 602.0797
## Subsample 6 Humidity 58.68721
## Subsample 6 Visibility 615.1279
ANALYSIS: 1. As we divided the dataset into subsamples and applied various aggregations, we can observe there is no significant variation between subsamples. 2. As there is no huge deviation, possibility of anamoly occurence is less. That is what we can observe from one of the column “Humidity”. 3. I can conclude there is prominent consistency among the columns across all the subsamples.