Final Project - Data Set Analysis on Bike Rentals and Weather

1. Dataset Description

Importing the dataset “day”.

day <- read.csv("~/Documents/Studium 6. Semester/Datenanalyse mit R/Bike-Sharing-Dataset(1)/day.csv", stringsAsFactors=FALSE)

The dataset “Bike-Sharing-Dataset” was obtained by the UCI Machine Learning Repository. This is a collection of databeses, domain theories and data generators which are used by the machine learning community for empirical analyses. The archive was created in 1987 by David Aha and fellow graduate students at UC Irvine. Since then it has been widely used by student, educators and researchers. The current website was designed in 2007. The UCI Machine Learning Repository is based on donations of researchers, mostly outside of UCI. We found the dataset “Bike Sharing Dataset” under the index “regression” and chose the sub-dataset “day”.

This dataset contains the hourly and daily count of rental bikes between years 2011 and 2012 in Capital bikeshare system with the corresponding weather and seasonal information. Capital bikeshare has over 350 stations in Washington, D. C. , Arlington, Alexandria, VA und Montgomery County and MD. Bike sharing systems are a new way of traditional bike rentals. The wohle process from memberhsip to rental and retrun back has become automatic. The data was generated by 500 bike-sharing programs and was collected by the Laboratory of Artificial Intelligence and Decision Support (LIAAD), University of Porto. The Laboratory of Artificial Intelligence and Decision Support (LIAAD), University of Porto, aggregated the data on two hourly and daily basis and then extracted and added the corresponding weather and seasonal information that were extracted from http://www.freemeteo.com.

In this dataset there are originally 16 columns and 731 rows. In the course of the analysis we generated more columns.

nrow(day)

## [1] 731

ncol(day)

## [1] 16

dim(day)

## [1] 731  16

The dataset contains following columns in name. Below there is a short description:

names(day)

##  [1] "instant"    "dteday"     "season"     "yr"         "mnth"      
##  [6] "holiday"    "weekday"    "workingday" "weathersit" "temp"      
## [11] "atemp"      "hum"        "windspeed"  "casual"     "registered"
## [16] "cnt"

Original columns:

weathersit:
- 1: Clear, Few clouds, Partly cloudy,
- 2: Mist and Cloudy, Mist and Broken clouds, Mist and Few clouds, Mist
- 3: Light Snow, Light Rain and Thunderstorm and Scattered clouds, Light Rain an dScattered clouds
- 4: Heavy Rain and Ice Pallets and Thunderstorm and Mist, Snow and Fog
instant: record index
dteday: date
season: season (1:spring, 2:summer, 3:fall, 4:winter)
yr: year (0: 2011, 1:2012)
mnth: month ( 1 to 12)
holiday: weather day is holiday or not (extracted from http://dchr.dc.gov/page/holiday-schedule)
weekday: day of the week
workingday: if day is neither weekend nor holiday is 1, otherwise is 0.
temp: Normalized temperature in Celsius. The values are divided to 41 (max)
atemp: Normalized feeling temperature in Celsius. The values are divided to 50 (max)
hum: Normalized humidity. The values are divided to 100 (max)
windspeed: Normalized wind speed. The values are divided to 67 (max)
casual: count of casual users
registered: count of registered users
cnt: count of total rental bikes including both casual and registered

New columns:

month.names: names of the months
weather: converted weathersituation
- 1 = “nice”
- 2 = “cloudy”
- 3 = “wet”
- 4 = “lousy”
raw.temp: converted normalized temperature in Celsius
raw.windspeed: converted normalized windspeed
raw.humidity: converted normalized humidity
raw.atemp: converted normalized feeled temperature in Celsius
raw.mean.temp.atemp: mean between converted normalized temperature and feeled temperature (raw.atemp) in Celsius

2. Questions

How do the temperatures change across the seasons? What are the mean and median temperatures?
Is there a correlation between the temp/atemp/mean.temp.atemp and the total count of bike rentals?
Is there a difference between the real temperature and the feeled temperature? If there is a difference will it still be there across the different seasons?
Is temperature associated with bike rentals (registered vs. casual)?
Can the number of total bike rentals be predicted by holiday and weather?
What are the mean temperature, humidity, windspeed and total rentals per months?
What percentage of days are appropriate for biking concerning the weather with conditions
- 1. Temperature > 5°, weather situation 1-3, windspeed < 40 km/h and
- 1. Temperature > 10°, weather situation 1-2, windspeed < 20 km/h?

3. Analyses

1. How do the temperatures change across the seasons? What are the mean and median temperatures?

1:spring
2:summer
3:fall
4:winter

First we converted the temperature, because the data of temperature was divided to 41. Secondly we calculated the mean, the median and the standard deviation of all seasons.

# Converting the nomalized temperature:
day$raw.temp <- (day$temp*41)
head(day)

##   instant     dteday season yr mnth holiday weekday workingday weathersit
## 1       1 2011-01-01      1  0    1       0       6          0          2
## 2       2 2011-01-02      1  0    1       0       0          0          2
## 3       3 2011-01-03      1  0    1       0       1          1          1
## 4       4 2011-01-04      1  0    1       0       2          1          1
## 5       5 2011-01-05      1  0    1       0       3          1          1
## 6       6 2011-01-06      1  0    1       0       4          1          1
##       temp    atemp      hum windspeed casual registered  cnt  raw.temp
## 1 0.344167 0.363625 0.805833 0.1604460    331        654  985 14.110847
## 2 0.363478 0.353739 0.696087 0.2485390    131        670  801 14.902598
## 3 0.196364 0.189405 0.437273 0.2483090    120       1229 1349  8.050924
## 4 0.200000 0.212122 0.590435 0.1602960    108       1454 1562  8.200000
## 5 0.226957 0.229270 0.436957 0.1869000     82       1518 1600  9.305237
## 6 0.204348 0.233209 0.518261 0.0895652     88       1518 1606  8.378268

#2. Calculating Median, Mean and Standard deviation of spring.
spring <- subset(day, season == 1)$raw.temp
sp.mean <- mean(spring) 
sp.median <- median(spring) 
sp.sd <- sd(spring) 


#3. Calculating Median, Mean and Standard deviation of summer.
summer <- subset(day, season == 2)$raw.temp
su.mean <- mean(summer) 
su.median <- median(summer) 
su.sd <- sd(summer) 


#3. Calculating Median, Mean and Standard deviation of fall.
fall <- subset(day, season == 3)$raw.temp
fa.mean <- mean(fall)
fa.median <-median(fall)
fa.sd <- sd(fall)


#4. Calculating Median, Mean and Standard deviation of winter.
winter <- subset(day, season == 4)$raw.temp
wi.mean <- mean(winter)
wi.median <- median(winter)
wi.sd <- sd(winter)

Spring:
- The mean temperature of spring was 12.21.
- The median temperature of spring was 11.72.
- The standard deviation of the temperature in spring was 4.21.
Summer:
- The mean temperature of summer was 22.32.
- The median temperature of summer was 23.05.
- The standard deviation of the temperature in summer was 5.03.
Fall:
- The mean temperature of fall was 28.96.
- The median temperature of fall was 29.3.
- The standard deviation of the temperature in fall was 2.9.
Winter:
- The mean temperature of winter was 17.34.
- The median temperature of winter was 16.78.
- The standard deviation of the temperature in winter was 4.42.

Secondly we created a histogram displaying the temperatures of each season including lines for the mean and median temperatures.

#create histogram for the distribution of temperatures in spring.
hist(x = spring, 
     main = "Temperatures in Spring", 
     xlab = "Temperature in Celcius", 
     ylab = "Number of Days",
     xlim = c(0, 25),
     ylim = c(0, 45))

abline(v = sp.mean, lwd = 2, lty = 1, col = "red")  
text(x = 17, y = 35, 
     labels = paste("Mean = ", round(mean(spring),2), sep = ""), col="red" )

abline(v = sp.median, lwd = 2, lty = 3, col = "blue") 
text(x = 6, y = 35, 
     labels = paste("Median = ", round(median(spring),2), sep = ""), col="blue" )

#create a histogram for the distribution of temperatures in summer.
hist(x = summer, 
     main = "Temperatures in Summer", 
     xlab = "Temperature in Celcius", 
     ylab = "Number of Days", 
      xlim = c(0, 35), ylim = c(0, 40)
    )
abline(v = su.mean, lwd = 2, lty = 1, col = "red") 
text( x = 15, y = 40, 
     labels = paste("Mean = ", round(mean(summer),2), sep = ""),
col = "red")

abline(v = su.median, lwd = 2, lty = 3, col = "blue") 
text(x = 31, y = 40, 
     labels = paste("Median = ", round(median(summer),2), sep = ""), col = "blue" )

#create a histogram for the distribution of temperatures in fall. 
hist(x = fall, 
     main = "Temperatures in Fall", 
     xlab = "Temperature in Celcius", 
     ylab = "Number of Days", 
      xlim = c(15, 40), ylim = c(0, 70)
    )
abline(v = fa.mean, lwd = 2, lty = 1, col = "red") 
text(x = 24, y = 60, 
     labels = paste("Mean = ", round(mean(fall),3), sep = ""), col = "red" )

abline(v = fa.median, lwd = 2, lty = 3, col ="blue") 
text(x = 35, y = 60, 
     labels = paste("Median = ", round(median(fall),3), sep = ""), col ="blue" )

#create a histogram for the distribution of temperatures in winter. 
hist(x = winter, 
     main = "Temperatures in Winter", 
     xlab = "Temperature in Celcius", 
     ylab = "Number of Days", 
    xlim = c(0, 30), ylim = c(0, 40)
    )

abline(v = wi.mean, lwd = 2, lty = 1, col = "red")  
text(x = 23, y = 40, 
     labels = paste("Mean = ", round(mean(winter),2), sep = ""), col = "red" )

abline(v = wi.median, lwd = 2, lty = 3, col ="blue") 
text(x = 10, y = 40, 
     labels = paste("Median = ", round(median(winter),2), sep = ""), col ="blue" )

2. Is there a correlation between the temp/atemp/mean.temp.atemp and the total count of bike rentals?

First we checked the dataset of integers, NA or NULL and duplicates. Because the dataset was already recoded and correct we created a new coloumn. Afterwards we did a correlation test.

#Checking dataset 
#Tests if values in a vector are integers
is.integer(day)

## [1] FALSE

#Tests if values in a vector are NA or NULL
#is.na(day) we tested it but due to the huge output we deleted it. There was no "NA".
is.null(day)

## [1] FALSE

#Tests for duplicates
#There were no duplicates: duplicated(day)

Creating a new column.

#The Dataset is already recoded and correct.
#For this question we converted "atemp" because it was devided of 50.
day$raw.atemp <-(day$atemp * 50)

#Create a new column of the mean of raw.temp and raw.atemp.
day$raw.mean.temp.atemp <- (day$raw.temp + day$raw.atemp)/2
head(day)

##   instant     dteday season yr mnth holiday weekday workingday weathersit
## 1       1 2011-01-01      1  0    1       0       6          0          2
## 2       2 2011-01-02      1  0    1       0       0          0          2
## 3       3 2011-01-03      1  0    1       0       1          1          1
## 4       4 2011-01-04      1  0    1       0       2          1          1
## 5       5 2011-01-05      1  0    1       0       3          1          1
## 6       6 2011-01-06      1  0    1       0       4          1          1
##       temp    atemp      hum windspeed casual registered  cnt  raw.temp
## 1 0.344167 0.363625 0.805833 0.1604460    331        654  985 14.110847
## 2 0.363478 0.353739 0.696087 0.2485390    131        670  801 14.902598
## 3 0.196364 0.189405 0.437273 0.2483090    120       1229 1349  8.050924
## 4 0.200000 0.212122 0.590435 0.1602960    108       1454 1562  8.200000
## 5 0.226957 0.229270 0.436957 0.1869000     82       1518 1600  9.305237
## 6 0.204348 0.233209 0.518261 0.0895652     88       1518 1606  8.378268
##   raw.atemp raw.mean.temp.atemp
## 1  18.18125           16.146048
## 2  17.68695           16.294774
## 3   9.47025            8.760587
## 4  10.60610            9.403050
## 5  11.46350           10.384369
## 6  11.66045           10.019359

Correlation Tests.

#Correlation between raw.temp and the total count of bike rentals.

cor.temp <- cor.test(x = day$raw.temp,
y = day$cnt)

cor.temp

## 
##  Pearson's product-moment correlation
## 
## data:  day$raw.temp and day$cnt
## t = 21.7594, df = 729, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5814369 0.6695422
## sample estimates:
##      cor 
## 0.627494

Temperature <- day$raw.temp
Amount.Rentals <- day$cnt

The correlation was 0.63.

#Correlation between atemp and the total count of bike rentals.
cor.atemp <- cor.test(x = day$raw.atemp,
y = day$cnt)
cor.atemp

## 
##  Pearson's product-moment correlation
## 
## data:  day$raw.atemp and day$cnt
## t = 21.9648, df = 729, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5853376 0.6727918
## sample estimates:
##       cor 
## 0.6310657

Feeled.Temperature <- day$raw.atemp
Amount.Rentals <- day$cnt

The correlation was 0.63.

#Correlation between mean.temp.atemp and the total count of bike rentals.
day$raw.mean.temp.atemp <-(day$raw.temp + day$raw.atemp)/2
cor.mean.temp.atemp <- cor.test(x = day$raw.mean.temp.atemp,
y = day$cnt)
cor.mean.temp.atemp

## 
##  Pearson's product-moment correlation
## 
## data:  day$raw.mean.temp.atemp and day$cnt
## t = 21.9414, df = 729, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5848953 0.6724234
## sample estimates:
##       cor 
## 0.6306607

Feeled.Raw.Temperature <- day$raw.mean.temp.atemp
Amount.Rentals <- day$cnt

The correlation was 0.63.

Plotting the correlations

par(mfrow=c(2,2))

plot(x = Temperature, y = Amount.Rentals, main = "Correlation", col = "red")
abline(lm(Amount.Rentals ~ Temperature), col = "blue")
legend("topleft", legend = paste("cor = ", round(cor(Temperature, Amount.Rentals), 2), sep = ""),lty = 1, col = "blue")

plot(x = Feeled.Temperature, y = Amount.Rentals, main = "Correlation", col = "blue")
abline(lm(Amount.Rentals ~ Feeled.Temperature), col = "red")
legend("topleft", legend = paste("cor = ", round(cor(Feeled.Temperature, Amount.Rentals), 2), sep = ""),lty = 1, col = "red")

plot(x = Feeled.Raw.Temperature, y = Amount.Rentals, main = "Correlation", col = "green")
abline(lm(Amount.Rentals ~ Feeled.Raw.Temperature), col = "orange")
legend("topleft", legend = paste("cor = ", round(cor(Temperature, Amount.Rentals), 2), sep = ""),lty = 1, col = "orange")


plot(x = 1, y = 1, xlab = "Temperature", ylab = "Amount of rentals", xlim = c(0, 40), ylim = c(0, 10000), main = "Three correlations combined")

points(Feeled.Raw.Temperature, Amount.Rentals, pch = 8, col = "green")
points(Temperature, Amount.Rentals, pch = 8, col = "red")
points(Feeled.Temperature, Amount.Rentals, pch = 8, col = "blue")

par(mfrow=c(1,1))

3. Is there a difference between the real temperature and the feeled temperature? If there is a difference will it still be there across the different seasons?

A Two-sample-t-test for real temperature and feeled temperature.

test.result.1 <- t.test(x = day$raw.temp, y = day$raw.atemp, alternative = "two.sided")
test.result.1

## 
##  Welch Two Sample t-test
## 
## data:  day$raw.temp and day$raw.atemp
## t = -8.3151, df = 1450.245, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -4.210643 -2.603203
## sample estimates:
## mean of x mean of y 
##  20.31078  23.71770

hist(day$raw.temp, yaxt = "n", xaxt = "n", xlab = "",
ylab = "", main = "Two Sample t-test", xlim = c(5, 40), col = rgb(0, 0, 1, alpha = .1))
text(x = 13, y = 140, paste("Mean real Temp.\n", round(mean(day$raw.temp), 2), sep = ""), col = "blue")
abline(v = mean(day$raw.temp), lty = 1,
col = rgb(0, 0, 1, alpha = 1), lwd = 4)

par(new = T)
hist(day$raw.atemp, yaxt = "n", xaxt = "n", xlab = "",
ylab = "", main = "", xlim = c(5, 40), col = rgb(1, 0, 0, alpha = .1))

abline(v = mean(day$raw.atemp), lty = 1,
col = rgb(1, 0, 0, alpha = 1), lwd = 4)
text(x= 32, y = 131, paste("Mean feeled Temp.\n", round(mean(day$raw.atemp), 2), sep = ""),  col = "red")

mtext(text = "Alternative Hypothesis is confirmed true difference in means is not equal to 0", line = 0, side = 3)

The t-value was -8.32.
The mean of the x and y-variables are 20.31, 23.72.
The parameters are 1450.25.
The 95 percent confidence intervall is -4.21, -2.6.

Two Sample t-test across the seasons

# two-sample t-test for real temperature and feeled temperature in spring.

temp.spring <- subset(day, subset = season == "1")$raw.temp
atemp.spring <- subset(day, subset = season == "1")$raw.atemp
test.result.spring <- t.test(x = temp.spring, y = atemp.spring, alternative = "two.sided")
test.result.spring

## 
##  Welch Two Sample t-test
## 
## data:  temp.spring and atemp.spring
## t = -5.4597, df = 350.983, p-value = 9.038e-08
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -3.588345 -1.687750
## sample estimates:
## mean of x mean of y 
##  12.20765  14.84570

# two-sample t-test for real temperature and feeled temperature in summer.

temp.summer <- subset(day, subset = season == "2")$raw.temp
atemp.summer <- subset(day, subset = season == "2")$raw.atemp
test.result.summer <- t.test(x = temp.summer, y = atemp.summer, alternative = "two.sided")
test.result.summer

## 
##  Welch Two Sample t-test
## 
## data:  temp.summer and atemp.summer
## t = -6.7914, df = 364.147, p-value = 4.534e-11
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -4.764599 -2.624911
## sample estimates:
## mean of x mean of y 
##  22.32061  26.01537

# two-sample t-test for real temperature and feeled temperature in fall.

temp.fall <- subset(day, subset = season == "3")$raw.temp
atemp.fall <- subset(day, subset = season == "3")$raw.atemp
test.result.fall <- t.test(x = temp.fall, y = atemp.fall, alternative = "two.sided")
test.result.fall

## 
##  Welch Two Sample t-test
## 
## data:  temp.fall and atemp.fall
## t = -11.3657, df = 357.9, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -4.500023 -3.172453
## sample estimates:
## mean of x mean of y 
##  28.95868  32.79492

# two-sample t-test for real temperature and feeled temperature in winter.

temp.winter <- subset(day, subset = season == "4")$raw.temp
atemp.winter <- subset(day, subset = season == "4")$raw.atemp
test.result.winter <- t.test(x = temp.winter, y = atemp.winter, alternative = "two.sided")
test.result.winter

## 
##  Welch Two Sample t-test
## 
## data:  temp.winter and atemp.winter
## t = -7.0467, df = 351.902, p-value = 9.71e-12
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -4.397268 -2.478311
## sample estimates:
## mean of x mean of y 
##  17.33915  20.77694

Spring:
- The t-value was -5.46.
- The mean of the x and y-variables in spring are 12.21, 14.85.
- The parameters are 350.98.
- The 95 percent confidence intervall is -3.59, -1.69.
Summer:
- The t-value was -6.79.
- The mean of the x and y-variables in summer are 22.32, 26.02.
- The parameters are 364.15.
- The 95 percent confidence intervall is -4.76, -2.62.
Fall:
- The t-value was -11.37.
- The mean of the x and y-variables in fall are 28.96, 32.79.
- The parameters are 357.9.
- The 95 percent confidence intervall is -4.5, -3.17.
Winter:
- The t-value was -7.05.
- The mean of the x and y-variables in winter are 17.34, 20.78.
- The parameters are 351.9.
- The 95 percent confidence intervall is -4.4, -2.48.

4. Is temperature associated with bike rentals (registered vs. casual)?

# Plotting the association:
plot(x = 1, y = 1, xlab = "Temperature in Celcius", ylab = "Bike rentals", type = "n", main = "Association between temperature and bike rentals",
xlim = c(0, 40), ylim = c(0, 7000))


#Calculating min and max for the x-axis and y-axis:
min(day$raw.temp)

## [1] 2.424346

max(day$raw.temp)

## [1] 35.32835

min(day$casual)

## [1] 2

min(day$registered)

## [1] 20

max(day$casual)

## [1] 3410

max(day$registered)

## [1] 6946

#Adding points to the plot
day$raw.temp <- (day$temp*41)
points(day$raw.temp, day$casual, pch = 16, col = "red")
points(day$raw.temp, day$registered, pch = 16, col = "skyblue")

# Adding a legend to the plot
legend("topleft",legend = c("casual", "registered"), col = c("red","skyblue"), pch = c(16, 16), bg = "white")



# Calculating the correlation between raw.temp and registered users and between raw.temp and causal users
cor.reg <- cor.test(x = day$raw.temp, y = day$registered)
cor.reg

## 
##  Pearson's product-moment correlation
## 
## data:  day$raw.temp and day$registered
## t = 17.3233, df = 729, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4865508 0.5894440
## sample estimates:
##      cor 
## 0.540012

cor.cas <- cor.test(x = day$raw.temp,
y = day$casual)
cor.cas

## 
##  Pearson's product-moment correlation
## 
## data:  day$raw.temp and day$casual
## t = 17.4721, df = 729, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4900779 0.5924581
## sample estimates:
##       cor 
## 0.5432847

# Adding Correlation line and the correlation value to the plot
abline(lm(day$registered ~ day$raw.temp), lty = 6, col = "blue")

abline(lm(day$casual ~ day$raw.temp), lty = 6, col = "orange")

reg <- paste("cor = ", round(cor(day$registered, day$raw.temp), 2), sep = "")
cas <- paste("cor = ", round(cor(day$casual, day$raw.temp), 2), sep = "")

legend("left",legend = c(cas, reg) , col = c('orange', 'blue'),pch = c(16, 16), bg = "white")

# Calculating the max of casual and registered users:
max(day$casual)

## [1] 3410

max(day$registered)

## [1] 6946

5. Can the number of total bike rentals be predicted by holiday and weather?

Coding information:

holiday is coded as:

0 = no holiday

1 = holiday

Foor this question we converted the weathersituation:

1 = “nice”

2 = “cloudy”

3 = “wet”

4 = “lousy”

Converting weather with “merge”

lookup <- data.frame("numbers"=c("1","2","3","4"),
                    "weather"=c("nice","cloudy", "wet", "lousy")
                    )

day <- merge(x= day,
            y= lookup,
            by.x="weathersit",
            by.y="numbers",
            )

head(day)

##   weathersit instant     dteday season yr mnth holiday weekday workingday
## 1          1     151 2011-05-31      2  0    5       0       2          1
## 2          1      50 2011-02-19      1  0    2       0       6          0
## 3          1     157 2011-06-06      2  0    6       0       1          1
## 4          1     110 2011-04-20      2  0    4       0       3          1
## 5          1       4 2011-01-04      1  0    1       0       2          1
## 6          1     136 2011-05-16      2  0    5       0       1          1
##       temp    atemp      hum windspeed casual registered  cnt raw.temp
## 1 0.775000 0.725383 0.636667  0.111329    673       3309 3982 31.77500
## 2 0.399167 0.391404 0.187917  0.507463    532       1103 1635 16.36585
## 3 0.678333 0.621858 0.600000  0.121896    673       3875 4548 27.81165
## 4 0.595000 0.564392 0.614167  0.241925    613       3331 3944 24.39500
## 5 0.200000 0.212122 0.590435  0.160296    108       1454 1562  8.20000
## 6 0.577500 0.550512 0.787917  0.126871    773       3185 3958 23.67750
##   raw.atemp raw.mean.temp.atemp weather
## 1  36.26915            34.02208    nice
## 2  19.57020            17.96802    nice
## 3  31.09290            29.45228    nice
## 4  28.21960            26.30730    nice
## 5  10.60610             9.40305    nice
## 6  27.52560            25.60155    nice

Are weather and holiday good predictors?

#Using linear regression

total.rentals.lm <- lm(cnt ~ holiday + weather, data = day)
summary (total.rentals.lm)

## 
## Call:
## lm(formula = cnt ~ holiday + weather, data = day)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4475.9 -1250.2   -40.9  1398.8  4303.6 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   4058.4      117.2  34.619  < 2e-16 ***
## holiday       -929.5      406.8  -2.285   0.0226 *  
## weathernice    848.5      144.7   5.864 6.86e-09 ***
## weatherwet   -2255.2      417.4  -5.403 8.91e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1836 on 727 degrees of freedom
## Multiple R-squared:  0.1056, Adjusted R-squared:  0.1019 
## F-statistic: 28.61 on 3 and 727 DF,  p-value: < 2.2e-16

# default is cloudy weather -> weather "nice" and "wet" is compared to cloudy weather.

The coefficients (Intercept, holiday, weathernice, weatherwet) are 4058.44, -929.48, 848.46, -2255.16.
The df.residuals are 727.

Is there an effect of weather or not?

# Using a anova
anv.weather <- anova (total.rentals.lm)
anv.weather

## Analysis of Variance Table
## 
## Response: cnt
##            Df     Sum Sq   Mean Sq F value  Pr(>F)    
## holiday     1   12797494  12797494   3.797 0.05173 .  
## weather     2  276444009 138222004  41.010 < 2e-16 ***
## Residuals 727 2450293890   3370418                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

#The F-values (holiday, weather, residuals) are:
anv.weather$`F value`

## [1]  3.797005 41.010345        NA

#The p-values (holiday, weather, residuals) are:
anv.weather$`Pr(>F)`

## [1] 5.172871e-02 1.331783e-17           NA

Comparison of the three different types of weather

#Using Tukey-Test
weather.aov <- aov(cnt ~ weather, data = day)
summary(weather.aov)

##              Df    Sum Sq   Mean Sq F value Pr(>F)    
## weather       2 2.716e+08 135822286   40.07 <2e-16 ***
## Residuals   728 2.468e+09   3389960                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

TukeyHSD(weather.aov)

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = cnt ~ weather, data = day)
## 
## $weather
##                   diff        lwr       upr p adj
## nice-cloudy   840.9238   500.2174  1181.630 0e+00
## wet-cloudy  -2232.5766 -3215.4542 -1249.699 4e-07
## wet-nice    -3073.5005 -4038.2458 -2108.755 0e+00

6. What are the mean temperature, humidity, windspeed and total rentals per months?

# Months is coded as 1 to 12
# Converting month with "merge"

lookup.month<- data.frame("mnth" = c(1:12),
                          "mnth.name" = c("01Jan", "02Feb", "03March", "04April", "05May", "06June", "07July", "08Aug", "09Sept", "10Oct", "11Nov", "12Dec"), stringsAsFactors = FALSE)

day <- merge(x=day, y= lookup.month, by = 'mnth')


# Convert the nomalized windspeed and humidity
day$raw.windspeed <- (day$windspeed*67)
day$raw.hum <- (day$hum * 100)
head(day)

##   mnth weathersit instant     dteday season yr holiday weekday workingday
## 1    1          1      12 2011-01-12      1  0       0       3          1
## 2    1          1     394 2012-01-29      1  1       0       0          0
## 3    1          2       1 2011-01-01      1  0       0       6          0
## 4    1          2     392 2012-01-27      1  1       0       5          1
## 5    1          1       4 2011-01-04      1  0       0       2          1
## 6    1          1       9 2011-01-09      1  0       0       0          0
##       temp    atemp      hum windspeed casual registered  cnt  raw.temp
## 1 0.172727 0.160473 0.599545  0.304627     25       1137 1162  7.081807
## 2 0.282500 0.272721 0.311250  0.240050    558       2685 3243 11.582500
## 3 0.344167 0.363625 0.805833  0.160446    331        654  985 14.110847
## 4 0.425000 0.415383 0.741250  0.342667    269       3187 3456 17.425000
## 5 0.200000 0.212122 0.590435  0.160296    108       1454 1562  8.200000
## 6 0.138333 0.116175 0.434167  0.361950     54        768  822  5.671653
##   raw.atemp raw.mean.temp.atemp weather mnth.name raw.windspeed raw.hum
## 1   8.02365            7.552728    nice     01Jan      20.41001 59.9545
## 2  13.63605           12.609275    nice     01Jan      16.08335 31.1250
## 3  18.18125           16.146048  cloudy     01Jan      10.74988 80.5833
## 4  20.76915           19.097075  cloudy     01Jan      22.95869 74.1250
## 5  10.60610            9.403050    nice     01Jan      10.73983 59.0435
## 6   5.80875            5.740201    nice     01Jan      24.25065 43.4167

#descreptive statistics with the dplyr-functions:
require(dplyr)

## Loading required package: dplyr
## 
## Attaching package: 'dplyr'
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

month.agg <- day %>% group_by(mnth.name) %>% summarise(
    mean.temp = mean(raw.temp),
    mean.hum = mean(raw.hum),
    mean.windspeed = mean(raw.windspeed),
    mean.rentals = mean(cnt))

month.agg

## Source: local data frame [12 x 5]
## 
##    mnth.name mean.temp mean.hum mean.windspeed mean.rentals
## 1      01Jan  9.694201 58.58283       13.82229     2176.339
## 2      02Feb 12.268284 56.74647       14.45082     2655.298
## 3    03March 16.012089 58.84750       14.92086     3692.258
## 4    04April 19.269952 58.80631       15.71031     4484.900
## 5      05May 24.386735 68.89583       12.26026     5349.774
## 6     06June 28.047985 57.58055       12.42313     5772.367
## 7     07July 30.974287 59.78763       11.12594     5563.677
## 8      08Aug 29.051844 63.77301       11.58552     5664.419
## 9     09Sept 25.275884 71.47144       11.11832     5766.517
## 10     10Oct 19.885500 69.37609       11.73877     5199.226
## 11     11Nov 15.138010 62.48765       12.31470     4247.183
## 12     12Dec 13.285270 66.60405       11.83280     3403.806

Plotting the different means associated with month

par(mfrow=c(2,2))
barplot(height = month.agg$mean.rentals,
        names.arg = month.agg$mnth.name ,col = "red", main = "Mean rentals" )

barplot(height = month.agg$mean.windspeed,
        names.arg = month.agg$mnth.name,col = "blue", main = "Mean Windspeed (km/h)" )

barplot(height = month.agg$mean.hum,
        names.arg = month.agg$mnth.name,col = "green", main = "Mean Humidity" )


barplot(height = month.agg$mean.temp,
        names.arg = month.agg$mnth.name,col = "skyblue", main = "Mean Temperature" )

par(mfrow=c(1,1))

7. What percentage of days are appropriate for biking concerning the weather with conditions

1. Temperature > 5°, weather situation 1-3, windspeed < 40 km/h and
1. Temperature > 10°, weather situation 1-2, windspeed < 20 km/h?

#Calculating the maximum temperature:

max(day$raw.temp)

## [1] 35.32835

#Creating a custom function on criteria for fine weather for biking

biking.day <- function (temp.thresh, windspeed.thresh, weathersit.thresh)
{result <- with (day, raw.temp > temp.thresh & 
                   raw.windspeed < windspeed.thresh & 
                   weathersit < weathersit.thresh)
  
    return(result)} 

mean(biking.day(5, 40, 3))

## [1] 0.9658003

# with the A) weather conditions 97% of days were appropriate for biking

mean(biking.day(10, 20, 2))

## [1] 0.5348837

# with the B) weather conditions 53% of days were appropriate for biking

4. Conclusion

As we can see in the analysis of question no. 1, the temperature is relativly high across all seasons. The lowest mean temperature was in spring (M=12.21, SD=4.21) and the highest mean temperature was in fall (M=28,96, SD=2,9). In between there is summer with a mean temperature of 22.32 degree Celcius (SD=5.03) and winter with a mean temperature of 17.34 degree Celcius (SD=4.42).

Because of the mild temperatures in spring and winter and the warm weather in summer and fall, temperature should be highly correlated with the total amount of bike rentals. In this analysis we correlated the raw temperature (converted nomalized temperature), the converted feeled temperature (raw.atemp) and the mean of both. As we can see in the plots of question no. 2, all three kinds of temperatures are positive correlated with the total amount of bike rentals. We also see that the correlation value is not different across the three types of temperatures (cor=0.63). Anyway the analysis clearly shows, that there is a relationship between those two variables.

Now we know, that temperature is asscoiated with the total amount of bike rentals. That is the reason we tested if there is a difference between the real temperature and the feeled temperature. The analysis shows that there is a significant difference between those two variables (t = -8.3151, df = 1450.245). There is still a significant difference between the real temperature and the feeled temperature across the seasons: Spring (t = -5.46, df = 350.98), Summer ( t = -6.79, df = 364.15), Fall (t = -11.37, df = 357.9) and Winter (t = -7.05, df = 351.9). This could be a cue that temperature plays an important role in bike rentals.

Because of those results we looked up the association between the two groups (registered vs. casual) in dependence of temperature. As we can see in the plot, as soons as the temperature increases the amount of bike rentals increase as well in both users. It seems like, that the registered users rent even more bikes than the casual users. That’s why we calculated the correlation between the two users and the real temperature. It shows, that the correlation is even for both groups (cor = 0.54). The difference in presentation could be biased because of the unequal amount of registered (max = 6946) and casual (max = 3410) users. However the plot partly confirms the strong role of temperature.

Afterwards in question 5 we calculated a linear regression, because we thought that weather and holiday would be good predictors of bike rentals. It clearly shows, that holiday is a significant negativ predictor (estimate = -929.5). It also shows that weather nice compared to weather cloudy (default) is a significant positiv predictor (estimate = 848.5) and weather wet compared to weathercoludy is a significant negative predictor (estimate = -2255.2) of bike rentals. The lousy weather was not included because at this time there were no bike rentals.

Now we used an anova testing wether there are effects for weather and holiday or not. The anova showed a significant effect for weather (F-value = 41.010, Df = 2, p-value < 0.01) but a non-significant effect for holiday (F-value = 3,797, Df = 1, p-value = 0.05173). Afterwards we compared the three different weather types with the TukeyHSD. The means of bike rentals in dependence of the weather “nice-cloudy”, “wet-cloudy” and “wet-nice” differed significantly (each p-value < 0.01). This means that weather and temperature affect the amount of bike rentals.

In question no. 6 we plotted the mean humidity, the mean temperature, the mean windspeed an the mean total rentals per months. As we can see the total amount of bike rentals increases with the temperature per month. Whereas it seems that the rentals are independent of the windspeed and the humidity, because they are almost constant over the months. This also confirms on the one hand the high correlation between rentals and temperature and on the other hand that nice weather could be a good predictor.

In question 7 we created a custom function. As we can see, the custom function allows us to set own defined limits on the weathersituation to find out how many days were appropriate for biking under different conditions. Results show that very few days only were under 5 C°, with a windspeed of at least 40km/h and cloudy till wet weather. If we take a look at the distributions in the assignment before, it shows that Windspeed in general was not very high in Washington D.C. and the temperatures in mean did not fall below 10 C°. With stricter limits of min. 10 C°, windspeed of 20 km/h and exclusion of wet weather conditions, only about half of the days were suitalbe for a biking tour. This could be important for future research. With this function you could calculate the amount of appropriate days for biking in differen cities. With those results the bike rental systems could be built up in cities with many appropriate days and especially save costs.

As a conclusion we can say, that the amount of bike rentals depends mainly on the weather and on the real and feeled temperature. The analysis shows that there is a positive relationship between the amount of bike rentals and temperature. And as we can see in the plot of question no. 6 the mean amount of bike rentals increases an decreases with the temperature. So people mainly rent bikes on nice days and nice temperature. This could be important of planning new bike rental stations.

Final Project - Data Set Analysis on Bike Rentals and Weather

Sabrina Englert

Monday, 03. August 2015

1. Dataset Description

2. Questions

3. Analyses

4. Conclusion