Dataset Description

The dataset was obtained form UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset).

# import data set
hour <- read.csv("~/Dropbox/Project/DataSets/Bike-Sharing-Dataset/hour.csv")

Number of columns:

ncol(hour)

## [1] 17

Number of rows:

nrow(hour)

## [1] 17379

The data came from a two-year historical log corresponding to years 2011 and 2012 from Capital Bikeshare system, Washington D.C., USA.

Bike sharing systems are new generation of traditional bike rentals where whole process from membership, rental and return back has become automatic. Through these systems, user is able to easily rent a bike from a particular position and return back at another position. Currently, there are about over 500 bike-sharing programs around the world which is composed of over 500 thousands bicycles. Today, there exists great interest in these systems due to their important role in traffic, environmental and health issues.

Apart from interesting real world applications of bike sharing systems, the characteristics of data being generated by these systems make them attractive for the research. Opposed to other transport services such as bus or subway, the duration of travel, departure and arrival position is explicitly recorded in these systems. This feature turns bike sharing system into a virtual sensor network that can be used for sensing mobility in the city. Hence, it is expected that most of important events in the city could be detected via monitoring these data.

Attribute Information:

instant: record index
dteday : date
season : season (1:springer, 2:summer, 3:fall, 4:winter)
yr : year (0: 2011, 1:2012)
mnth : month ( 1 to 12)
hr : hour (0 to 23)
holiday : weather day is holiday or not (extracted from [Web Link])
weekday : day of the week
workingday : if day is neither weekend nor holiday is 1, otherwise is 0.
weathersit :
- 1: Clear, Few clouds, Partly cloudy, Partly cloudy
- 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
- 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
- 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
temp : Normalized temperature in Celsius. The values are derived via (t-t_min)/(t_max-t_min), t_min=-8, t_max=+39 (only in hourly scale)
atemp: Normalized feeling temperature in Celsius. The values are derived via (t-t_min)/(t_max-t_min), t_min=-16, t_max=+50 (only in hourly scale)
hum: Normalized humidity. The values are divided to 100 (max)
windspeed: Normalized wind speed. The values are divided to 67 (max)
casual: count of casual users
registered: count of registered users
cnt: count of total rental bikes including both casual and registered

Questions

How do temperature values change over the seasons? What is mean, standard deviation and median of temperatures for each season?
For which weather condition the number of total bike rentals are the lowest/highest?
Is there a correlation between total number of rentals and season? What is the mean, median and standard deviation for total number of rentals (count) per season? Which season is the most popular for the bike rentals?
Is correlation between felt air temperature (atemp) and number of bike rentals significant? Is there a difference between the correlations for two years (2011 and 2012)?
Is weather condition correlated to number of bike rentals? What is minimum, maximum, mean, median, standard deviation and number of occurrences for each weather condition? How weather condition influences the distribution of bike rentals?
Is there a significant difference between total bike rentals on holidays and working days?

Data preparation

1. Recode season values from 1-4 to Spring-Winter.

### TASK 1

names(hour)

##  [1] "instant"    "dteday"     "season"     "yr"         "mnth"      
##  [6] "hr"         "holiday"    "weekday"    "workingday" "weathersit"
## [11] "temp"       "atemp"      "hum"        "windspeed"  "casual"    
## [16] "registered" "cnt"

# define recode function for recoding values:
recodev <- function(original.vector, 
                    old.values, 
                    new.values) {
  new.vector <- original.vector
  for (i in 1:length(old.values)) {
    change.log <- original.vector == old.values[i] & 
      is.na(original.vector) == F
    new.vector[change.log] <- new.values[i] 
    
  }
  return(new.vector)
}
# apply the functiontion for recoding season values
hour$season <- recodev(original.vector = hour$season,
           old.values = c(1:4),
           new.values = c("spring","summer","fall",
                          "winter"))

2. Rename columns “yr” and “mnth” on “year” and “month” and recode year values (0: 2011, 1:2012).

### TASK 1

# rename columns
names(hour)[4:5] <- c("year","month")
# recode year values
hour$year <- recodev(original.vector = hour$year,
           old.values = c(0,1),
           new.values = c(2011,2012))
# check column names
names(hour)

##  [1] "instant"    "dteday"     "season"     "year"       "month"     
##  [6] "hr"         "holiday"    "weekday"    "workingday" "weathersit"
## [11] "temp"       "atemp"      "hum"        "windspeed"  "casual"    
## [16] "registered" "cnt"

3. Rename “hum” on “humidity” and “cnt” on “count”.

### TASK 1

# rename columns
names(hour)[names(hour)=="hum"] <- "humidity"
names(hour)[names(hour)=="cnt"] <- "count"
names(hour)

##  [1] "instant"    "dteday"     "season"     "year"       "month"     
##  [6] "hr"         "holiday"    "weekday"    "workingday" "weathersit"
## [11] "temp"       "atemp"      "humidity"   "windspeed"  "casual"    
## [16] "registered" "count"

4. Denormalise “temp”" and “atemp” with the created function.

### TASKS 10, 1

# create a function for denormalisartion
tconvert <- function(min, max, vector){
  result <- vector * (max - min) + min
  return (result)
}

# apply the function and denormalise the temperature values
hour$temp <- tconvert(-8, 39, hour$temp)
hour$atemp <- tconvert(-16, 50, hour$atemp)

Analysis

1. How do temperature values change over the seasons? What is mean, standard deviation and median of temperatures for each season?

### TASKS 2, 9

# calculate mean, st.dev and median for each season
# by aggregation with dplyr library
library(dplyr)
hour.agg <- hour %>%
  group_by(season) %>%
  summarise(
    temp.min = min(temp),
    temp.max = max(temp),
    temp.med = median(temp),
    temp.stdev = sd(temp),
    temp.mean = mean(temp), 
    count = n())
hour.agg

## Source: local data frame [4 x 7]
## 
##   season temp.min temp.max temp.med temp.stdev temp.mean count
## 1   fall     9.86    39.00    24.90   4.413428 25.201277  4496
## 2 spring    -7.06    25.84     5.16   5.580120  6.059892  4242
## 3 summer    -0.48    36.18    18.32   6.543958 17.599170  4409
## 4 winter    -1.42    27.72    11.74   5.741867 11.887486  4232

### TASK 8

# create a boxplot for temperature by season
boxplot(temp ~ season,
        data = hour,
        xlab = "Season",
        ylab = "Temperature",
        main = "Temperature by Season",
        col = "skyblue")

# check seasons and respective months
# fall months
unique(hour$month[hour$season=="fall"])

## [1] 6 7 8 9

# winter months
unique(hour$month[hour$season=="winter"])

## [1]  9 10 11 12

# spring months
unique(hour$month[hour$season=="spring"])

## [1]  1  2  3 12

# summer months
unique(hour$month[hour$season=="summer"])

## [1] 3 4 5 6

As it can be seen from the analysis above, the lowest minimum temperature as well as the minimum mean temperature applies to spring (-7.06°C and 5.16°C respectively), maximum temperature as well as the maximum mean value belongs to fall (39.00°C and 24.90°C respectively). Boxplot clearly demonstrates that the lowest temperatures are typical for spring season and followed by winter regarding this parameter, while the highest temperatures belong to fall and followed by summer. Such untypical temperature values can be explained by months shift in the dataset.

2. For which weather condition the number of total bike rentals are the lowest/highest?

### TASK 8

# create a beanplot for number of bike rents per each weather condition
library("beanplot")
require("beanplot")
require("RColorBrewer")
bean.cols <- lapply(brewer.pal(6, "Set3"),
                    function(x){return(c(x, "black", "gray", "red"))})
beanplot(count ~ weathersit,
         data = hour,
         main = "Bike Rents by Weather Condition",
         xlab = "Weather Condition",
         ylab = "Number of rentals",
         col = bean.cols,
         lwd = 1,
         what = c (1,1,1,0),
         log = ""
         )

The beanplot demonstrates that the lowest number of rents is typical for the 4th weather type (rain, thunderstorm etc.) while the highest mean value of rentals have days with the 1st weather type (clear, partly cloudy etc.)

3. Is there a correlation between total number of rentals and season? What is the mean, median and standard deviation for total number of rentals (count) per season? Which season is the most popular for the bike rentals?

### TASK 11

# create a data frame
df <- data.frame(spring = rep(NA, 3),
                 winter = rep(NA, 3),
                 summer = rep(NA, 3),
                 fall = rep(NA, 3))
row.names(df) <- c("mean", "median", "sd")

# fill the data frame with corresponding mean, median and sd values
vec <- c ("mean","median","sd") 
for (n in vec){
  for (i in unique(hour$season)) {
    my.fun <- get(n)
    res <- my.fun(hour$count[hour$season == i])
    df[n,i] <- res
  }
}  
df

##          spring   winter   summer     fall
## mean   111.1146 198.8689 208.3441 236.0162
## median  76.0000 155.5000 165.0000 199.0000
## sd     119.2240 182.9680 188.3625 197.7116

From the numbers above we can see that the highest mean, median and standard deviation values of total bike rentals are typical for fall season (236.0162, 199 and 236.0162 respectively), while the lowest values has spring season (111.1146, 76 and 119.224 respectively).

# statistics (analysis of variance model)
summary(aov(count ~ season, data = hour))

##                Df    Sum Sq  Mean Sq F value Pr(>F)    
## season          3  37729358 12576453   409.2 <2e-16 ***
## Residuals   17375 534032233    30736                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Analysis of the variance model demonstrates that number of rents and season are significantly correlated (p-value < 2e-16).

# pairwise comparison of means for seasons
# in order to identify any difference between two means that is greater than the expected standard error
TukeyHSD(aov(count ~ season, data = hour))

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = count ~ season, data = hour)
## 
## $season
##                      diff        lwr          upr     p adj
## spring-fall   -124.901668 -134.54307 -115.2602613 0.0000000
## summer-fall    -27.672168  -37.21916  -18.1251741 0.0000000
## winter-fall    -37.147380  -46.79465  -27.5001142 0.0000000
## summer-spring   97.229500   87.54202  106.9169764 0.0000000
## winter-spring   87.754288   77.96798   97.5405970 0.0000000
## winter-summer   -9.475213  -19.16852    0.2180949 0.0581801

Pairwise means difference analysis reveals that the most significant difference in total number of bike rentals is for spring and fall seasons (-124.9), while the most insignificant means values difference is between winter and summer. This tells us that the the distribution of total bike rentals is quite similar for these two seasons, but differ significantly for spring and fall seasons.

### TASK 8

# create a boxplot for count~season in order to reveal the most popular season
# for bike rentals

boxplot(count ~ season,
        data = hour,
        xlab = "Season",
        ylab = "Count",
        main = "Count by Season",
        col = "yellow3")

The boxplots show that the most popular seasons for renting a bike is fall and summer while the most unpopular one is spring.

4. Is correlation between felt air temperature (atemp) and number of bike rentals significant? Is there a difference between the correlations for two years (2011 and 2012)?

### TASK 4

# correlation test for count~atemp
t1 <- cor.test(hour$atemp[hour$year == 2011],
               hour$count[hour$year == 2011])
t1

## 
##  Pearson's product-moment correlation
## 
## data:  hour$atemp[hour$year == 2011] and hour$count[hour$year == 2011]
## t = 46.4598, df = 8643, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4300004 0.4637388
## sample estimates:
##       cor 
## 0.4470285

t2 <- cor.test(hour$atemp[hour$year == 2012], 
               hour$count[hour$year == 2012])
t2

## 
##  Pearson's product-moment correlation
## 
## data:  hour$atemp[hour$year == 2012] and hour$count[hour$year == 2012]
## t = 40.3462, df = 8732, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3785679 0.4139248
## sample estimates:
##       cor 
## 0.3963933

# apa format
library("yarrr")
apa(t1)

## [1] "r = 0.45, t(8643) = 46.46, p < 0.01 (2-tailed)"

apa(t2)

## [1] "r = 0.4, t(8732) = 40.35, p < 0.01 (2-tailed)"

The correlation test demonstrates significant correlation between the felt air temperature and the number of bike rents for both years (p-value < 0.01 in both cases), although the correlation coefficients differ being higher for 2011 (0.45) than for 2011 (0.4).

### TASKS 5, 6

# plotting the results in a scatterplot with regression lines

# blank plot
plot(x = 1,
     xlab = "Temperature",
     ylab = "Number of Rents",
     xlim = c(-25,50),
     ylim = c(0,1000),
     main = "Temperature vs. Count")

# draw points for 2011 year
points(x = hour$atemp[hour$year == 2011],
       y = hour$count[hour$year == 2011],
       pch = 16,
       col = "red",
       cex = 0.5
       )
# draw points for 2012 year
points(x = hour$atemp[hour$year == 2012],
       y = hour$count[hour$year == 2012],
       pch = 16,
       col = "darkgreen",
       cex = 0.5
       )

# add regression lines for two ears
abline(lm(count~atemp, hour, subset = year == 2011),
       col = "darkgreen",
       lwd = 3)

abline(lm(count~atemp, hour, subset = year == 2012),
       col = "red",
       lwd = 3)

# add legend
legend("topleft",
       legend = c(2011, 2012),
       col = c("darkgreen","red"),
       pch = c(16, 16),
       bg = "white",
       cex = 1
)

The scatterplot with the regression lines for both years demonstrates once again the difference between the correlation for 2011 and 2012 years. The slope of the regression lines shows that the influence of the temperature for 2011 is more significant than for 2012.

5. Is weather condition correlated to number of bike rentals? What is minimum, maximum, mean, median, standard deviation and number of occurrences for each weather condition? How weather condition influences the distribution of bike rentals?

### TASK 5 

# summary on linear model fitting
summary(lm(count~weathersit, hour))

## 
## Call:
## lm(formula = count ~ weathersit, data = hour)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -205.65 -139.65  -45.65   89.35  790.76 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  247.054      3.328   74.24   <2e-16 ***
## weathersit   -40.407      2.130  -18.97   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 179.5 on 17377 degrees of freedom
## Multiple R-squared:  0.02029,    Adjusted R-squared:  0.02023 
## F-statistic: 359.8 on 1 and 17377 DF,  p-value: < 2.2e-16

summary(aov(count~weathersit, hour))

##                Df    Sum Sq  Mean Sq F value Pr(>F)    
## weathersit      1  11598301 11598301   359.8 <2e-16 ***
## Residuals   17377 560163290    32236                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Yes, there is a significant correlation between weather condition and number of bike rents (p-value < 2e-16).

### TASK 9

# calculate min, max, mean, st.dev and median for each season
# by aggregation with dplyr library

w.agg <- hour %>%
  group_by(weathersit) %>%
  summarise(
    temp.min = min(temp),
    temp.max = max(temp),
    temp.mean = mean(temp),
    temp.stdev = sd(temp),
    temp.med = median(temp), 
    count = n())
w.agg

## Source: local data frame [4 x 7]
## 
##   weathersit temp.min temp.max  temp.mean temp.stdev temp.med count
## 1          1    -7.06    39.00 16.0195409   9.436434    16.44 11413
## 2          2    -7.06    37.12 14.2989349   8.268867    13.62  4544
## 3          3    -4.24    35.24 13.4643270   7.543913    13.62  1419
## 4          4    -1.42     2.34  0.7733333   1.956766     1.40     3

The aggregation results reveals that the minimum temperatures for clear/cloudy and mist/cloudy weather conditions are the same (-7.06°C). The highest maximum temperature has the 1st weather condition - clear/partly cloudy (39°C). The lowest temperature for all estimated parameters is for heavy rain/thunderstorm weather condition and it’s the most rare weather condition in the whole dataset (only 4 records), while the most often one is the 1st one (11413).

### TASKS 7, 11 

# create histograms for each weather condition
# to explore distribution of the bike rentals by 
# weather condition

# create a vector for histograms titles
vec <- c("Clear Weather", "Cloudy Weather", "Rainy Weather", "Thunderstorm Weather")

# parameters for plots combining
par(mfrow = c(2, 2))

# create 4 histograms with a loop
for (i in c(1:4)){
  name.i <- vec[i]
  hist(hour$count[hour$weathersit == i],
     main = name.i,
     xlab = "Number of Rents",
     ylab = "Frequency",
     breaks = 10,
     col = "yellow3",
     border = "black")
  
# the line indicating median value
abline(v = median(hour$count[hour$weathersit == i]),
       col = "black", 
       lwd = 3, 
       lty = 2) 

# the line indicating mean value
abline(v = mean(hour$count[hour$weathersit == i]),
       col = "blue", 
       lwd = 3, 
       lty = 2) 
}

The histograms demonstrate that distribution of the bike rents for Clear and Cloudy weather is pretty similar (although the frequency is much higher in the first case), while differs significantly for Rainy weather and drastically for Thunderstorm weather, where, as it was already pointed out, the frequency being extremely low.

6. Is there a significant difference between total bike rentals on holidays and working days?

### TASK 3

t <- t.test(hour$count[hour$holiday == 0],
       hour$count[hour$holiday == 1])

# apa format
apa(t)

## [1] "mean difference = -33.56, t(539.61) = 4.69, p < 0.01 (2-tailed)"

# TASK 8

beanplot(count ~ holiday,
         data = hour,
         main = "Bike Rents by Type of a Day",
         xlab = "Type of Day",
         ylab = "Number of rents",
         col = bean.cols,
         lwd = 1,
         what = c(1,1,1,0),
         log = ""
         )

In accordance with the conducted t-test, there is a significant difference between total bike rentals on holidays and working days (p=value < 0.01). The beanplot shows that the maximum number of rents is significantly higher for working days than for holidays, while the frequency of moderate numbers of bike rentals (200-600 rentals) is higher for holidays.

Conclusion

Conducting a range of different statistical test and plotting the data with variety of plots on the dataset comprising two-year historical log on bike rentals in Washington D.C. allow to make the following conclusions:

The mean temperatures vary significantly over the seasons
Figures of total bike rents changes depending on weather condition and vary regarding their means. The most significant pairwise mean difference is typical for spring and fall seasons, while the most insignificant for winter and summer.
There is a strong correlation between felt air temperature and the total number of bike rentals, although it differs for two represented years.
Weather condition and total number of bike rentals also seemed to be significantly correlated. The two popular weather conditions for bike rentals are Clear and Cloudy weather.
There exist a significant correlation between number of total bike rentals and type of day.

Data analysis and visualization in R - Final Paper: Bike Sharing Dataset Analysis

Anna Martin

3 February 2016