Import the dataset “day”.
day <- read.csv("~/Documents/Psychologie/6. Semester/Data analysis and visualization (Philips)/Bike-Sharing-Dataset/day.csv", stringsAsFactors=FALSE)
We analyzed the “Bike Sharing Dataset” which was obtained by the UCI Machine Learning Repository. The UCI Machine Learning Repository is a collection of databases, domain theories and data generators which are used by the machine learning community for empirical analyses. The archive was created in 1987 by David Aha and his fellow graduate students at the UC Irvin. Since then, it has been widely used by students, educators and researchers. The current website was designed in 2007. The UCI Machine Learning Repository depends on donations of other researchers, mostly outside of UCI. The “Bike Sharing Dataset” consists of two complete sets of data of which we chose to analyze the dataset called “day”.
This dataset contains the hourly and daily counts of rental bikes in the Capital bikeshare system, over a 2-year-period (2011-2012). It also includes the corresponding weather and seasonal information. Capital bikeshare has over 350 stations in Washington, D. C. , Arlington, Alexandria, VA and Montgomery County, as well as in MD. Bike sharing systems are a new way of renting bikes in bigger cities. The entire process, starting from becoming a registered user to renting and returning bikes, has become completely automatic. The data was generated by 500 bike-sharing programs and was collected by the Laboratory of Artificial Intelligence and Decision Support (LIAAD), University of Porto .
This dataset originally consisted of 16 columns and 731 rows. In the course of the analysis we generated further columns.
nrow(day)
## [1] 731
ncol(day)
## [1] 16
dim(day)
## [1] 731 16
The dataset contains the following columns and they are briefly described below the output:
names(day)
## [1] "instant" "dteday" "season" "yr" "mnth"
## [6] "holiday" "weekday" "workingday" "weathersit" "temp"
## [11] "atemp" "hum" "windspeed" "casual" "registered"
## [16] "cnt"
Original columns:
instant: record index
dteday : date
season : season (1:spring, 2:summer, 3:fall, 4:winter)
yr : year (0: 2011, 1:2012)
mnth : month ( 1 to 12)
holiday : whether day is holiday or not (extracted from http://dchr.dc.gov/page/holiday- schedule)
weekday : day of the week
workingday : if the day is a regular day of work it is coded as 1, weekends or holidays are coded as 0.
temp: Normalized temperature in Celsius. The values are divided by 41 (max)
atemp: Normalized feeling temperature in Celsius. The values are divided by 50 (max)
hum: Normalized humidity. The values are divided by 100 (max)
windspeed: Normalized wind speed. The values are divided by 67 (max)
casual: count of casual users
registered: count of registered users
cnt: count of total rental bikes including both casual and registered users
New columns: As mentioned above, we have added a number of additional columns:
month.names: names of the months
raw.temp: converted normalized temperature in Celsius
raw.windspeed: converted normalized windspeed
raw.humidity: converted normalized humidity
raw.atemp: converted normalized feeling temperature in Celsius
raw.mean.temp.atemp: mean between converted normalized temperatures and feeling temperatures (raw.atemp) in Celsius
How do the temperatures change across the seasons? What are the mean and median temperatures?
Is there a correlation between the temp/atemp/mean.temp.atemp and the total count of bike rentals?
Do registered users rent more bikes than casual users depending on the weathersituations no. 1 to no. 3 ?
Can the number of total bike rentals be predicted by whether or not it is a holiday and the weather is good?
Is the tempereature associated with bike rentals (registered vs. casual)?
What is the mean temperature, humidity, windspeed and total count of rentals per month?
1. How do the temperatures change across the seasons? What are the mean and median temperatures?
1:spring
2:summer
3:fall
4:winter
First we recoded the temperatures, because the data of temperature was divided by 41. Secondly we calculated the mean, the median and the standard deviation of all seasons.
# Converting the nomalized temperature:
day$raw.temp <- (day$temp*41)
head(day)
## instant dteday season yr mnth holiday weekday workingday weathersit
## 1 1 2011-01-01 1 0 1 0 6 0 2
## 2 2 2011-01-02 1 0 1 0 0 0 2
## 3 3 2011-01-03 1 0 1 0 1 1 1
## 4 4 2011-01-04 1 0 1 0 2 1 1
## 5 5 2011-01-05 1 0 1 0 3 1 1
## 6 6 2011-01-06 1 0 1 0 4 1 1
## temp atemp hum windspeed casual registered cnt raw.temp
## 1 0.344167 0.363625 0.805833 0.1604460 331 654 985 14.110847
## 2 0.363478 0.353739 0.696087 0.2485390 131 670 801 14.902598
## 3 0.196364 0.189405 0.437273 0.2483090 120 1229 1349 8.050924
## 4 0.200000 0.212122 0.590435 0.1602960 108 1454 1562 8.200000
## 5 0.226957 0.229270 0.436957 0.1869000 82 1518 1600 9.305237
## 6 0.204348 0.233209 0.518261 0.0895652 88 1518 1606 8.378268
#1. Calculating Median, Mean and Standard deviation of spring.
spring <- subset(day, season == 1)$raw.temp
sp.mean <- mean(spring) #calculate the mean temperature of season 1 = spring
sp.median <- median(spring) #calculate the median temperature of season 1 = spring
sp.sd <- sd(spring) #calculate the standard deviation of temperatures in season 1 = spring
#2. Calculating Median, Mean and Standard deviation of summer.
summer <- subset(day, season == 2)$raw.temp
su.mean <- mean(summer) #calculate the mean temperature of season 2 = summer
su.median <- median(summer) #calculate the median temperature of season 2 = summer
su.sd <- sd(summer) #calculate the standard deviation of temperatures in season 2 = summer
#3. Calculating Median, Mean and Standard deviation of fall.
fall <- subset(day, season == 3)$raw.temp
fa.mean <- mean(fall)
fa.median <-median(fall)
fa.sd <- sd(fall)
#4. Calculating Median, Mean and Standard deviation of winter.
winter <- subset(day, season == 4)$raw.temp
wi.mean <- mean(winter)
wi.median <- median(winter)
wi.sd <- sd(winter)
Secondly we created a histogram displaying the temperatures of each season including lines for the mean and median temperatures.
#create a histogram for the distribution of temperatures in spring.
hist(x = spring,
main = "Temperatures in Spring",
xlab = "Temperature in Celcius",
ylab = "Number of Days",
xlim = c(0, 25),
ylim = c(0, 45))
abline(v = sp.mean, lwd = 2, lty = 1, col = "red") #include line for the mean
text(x = 17, y = 35,
labels = paste("Mean = ", round(mean(spring),2), sep = ""), col="red" )
abline(v = sp.median, lwd = 2, lty = 3, col = "blue") #include dotted line for the median
text(x = 6, y = 35,
labels = paste("Median = ", round(median(spring),2), sep = ""), col="blue" )
#create a histogram for the distribution of temperatures in summer.
hist(x = summer,
main = "Temperatures in Summer",
xlab = "Temperature in Celcius",
ylab = "Number of Days",
xlim = c(0, 35), ylim = c(0, 40)
)
abline(v = su.mean, lwd = 2, lty = 1, col = "red") #include line for the mean
text( x = 15, y = 40,
labels = paste("Mean = ", round(mean(summer),2), sep = ""),
col = "red")
abline(v = su.median, lwd = 2, lty = 3, col = "blue") #include dotted line for the median
text(x = 31, y = 40,
labels = paste("Median = ", round(median(summer),2), sep = ""), col = "blue" )
#create a histogram for the distribution of temperatures in fall.
hist(x = fall,
main = "Temperatures in Fall",
xlab = "Temperature in Celcius",
ylab = "Number of Days",
xlim = c(15, 40), ylim = c(0, 75)
)
abline(v = fa.mean, lwd = 2, lty = 1, col = "red") #include line for the mean
text(x = 24, y = 40,
labels = paste("Mean = ", round(mean(fall),3), sep = ""), col = "red" )
abline(v = fa.median, lwd = 2, lty = 3, col ="blue") #include dotted line for the median
text(x = 35, y = 40,
labels = paste("Median = ", round(median(fall),3), sep = ""), col ="blue" )
#create a histogram for the distribution of temperatures in winter.
hist(x = winter,
main = "Temperatures in Winter",
xlab = "Temperature in Celcius",
ylab = "Number of Days",
xlim = c(0, 30), ylim = c(0, 40)
)
abline(v = wi.mean, lwd = 2, lty = 1, col = "red") #include line for the mean
text(x = 23, y = 40,
labels = paste("Mean = ", round(mean(winter),2), sep = ""), col = "red" )
abline(v = wi.median, lwd = 2, lty = 3, col ="blue") #include dotted line for the median
text(x = 10, y = 40,
labels = paste("Median = ", round(median(winter),2), sep = ""), col ="blue" )
2. Is there a correlation between the temp/atemp/mean.temp.atemp and the total count of bike rentals?
First we checked the dataset of integers, NA or NULL and duplicates. Because the dataset was already recoded and correct we created a new coloumn. Afterwards we did a correlation test.
#Check dataset
#Tests if values in the vector are integers
is.integer(day)
## [1] FALSE
#Tests if the vector contains NA or NULL values
#is.na(day) we tested it but due to the huge output we deleted it. There was no "NA".
is.null(day)
## [1] FALSE
#Test for duplicates
#There were no duplicates: duplicated(day)
Creating a new column.
#The Dataset is already recoded and correct.
#For this question we converted "atemp" because it was devided by 50.
day$raw.atemp <-(day$atemp * 50)
#Create a new column of the means of raw.temp and raw.atemp.
day$raw.mean.temp.atemp <- (day$raw.temp + day$raw.atemp)/2
head(day)
## instant dteday season yr mnth holiday weekday workingday weathersit
## 1 1 2011-01-01 1 0 1 0 6 0 2
## 2 2 2011-01-02 1 0 1 0 0 0 2
## 3 3 2011-01-03 1 0 1 0 1 1 1
## 4 4 2011-01-04 1 0 1 0 2 1 1
## 5 5 2011-01-05 1 0 1 0 3 1 1
## 6 6 2011-01-06 1 0 1 0 4 1 1
## temp atemp hum windspeed casual registered cnt raw.temp
## 1 0.344167 0.363625 0.805833 0.1604460 331 654 985 14.110847
## 2 0.363478 0.353739 0.696087 0.2485390 131 670 801 14.902598
## 3 0.196364 0.189405 0.437273 0.2483090 120 1229 1349 8.050924
## 4 0.200000 0.212122 0.590435 0.1602960 108 1454 1562 8.200000
## 5 0.226957 0.229270 0.436957 0.1869000 82 1518 1600 9.305237
## 6 0.204348 0.233209 0.518261 0.0895652 88 1518 1606 8.378268
## raw.atemp raw.mean.temp.atemp
## 1 18.18125 16.146048
## 2 17.68695 16.294774
## 3 9.47025 8.760587
## 4 10.60610 9.403050
## 5 11.46350 10.384369
## 6 11.66045 10.019359
Correlation Tests.
#Correlation between raw.temp and the total count of bike rentals.
cor.temp <- cor.test(x = day$raw.temp,
y = day$cnt)
cor.temp
##
## Pearson's product-moment correlation
##
## data: day$raw.temp and day$cnt
## t = 21.7594, df = 729, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.5814369 0.6695422
## sample estimates:
## cor
## 0.627494
Temperature <- day$raw.temp
Amount.Rentals <- day$cnt
plot(x = Temperature, y = Amount.Rentals, main = "Correlation", col = "turquoise3")
abline(lm(Amount.Rentals ~ Temperature), col = "blue")
legend("topleft", legend = paste("cor = ", round(cor(Temperature, Amount.Rentals), 2), sep = ""),lty = 1, col = "blue")
The correlation was 0.63.
#Correlation between atemp and the total count of bike rentals.
cor.atemp <- cor.test(x = day$raw.atemp,
y = day$cnt)
cor.atemp
##
## Pearson's product-moment correlation
##
## data: day$raw.atemp and day$cnt
## t = 21.9648, df = 729, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.5853376 0.6727918
## sample estimates:
## cor
## 0.6310657
Feeled.Temperature <- day$raw.atemp
Amount.Rentals <- day$cnt
plot(x = Feeled.Temperature, y = Amount.Rentals, main = "Correlation", col = "paleturquoise3")
abline(lm(Amount.Rentals ~ Feeled.Temperature), col = "blue")
legend("topleft", legend = paste("cor = ", round(cor(Feeled.Temperature, Amount.Rentals), 2), sep = ""),lty = 1, col = "red")
The correlation was 0.63.
#Correlation between mean.temp.atemp and the total count of bike rentals.
day$raw.mean.temp.atemp <-(day$raw.temp + day$raw.atemp)/2
cor.mean.temp.atemp <- cor.test(x = day$raw.mean.temp.atemp,
y = day$cnt)
cor.mean.temp.atemp
##
## Pearson's product-moment correlation
##
## data: day$raw.mean.temp.atemp and day$cnt
## t = 21.9414, df = 729, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.5848953 0.6724234
## sample estimates:
## cor
## 0.6306607
#Plot
Feeled.Raw.Temperature <- day$raw.mean.temp.atemp
Amount.Rentals <- day$cnt
plot(x = Feeled.Raw.Temperature, y = Amount.Rentals, main = "Correlation", col = "paleturquoise4")
abline(lm(Amount.Rentals ~ Feeled.Raw.Temperature), col = "blue")
legend("topleft", legend = paste("cor = ", round(cor(Temperature, Amount.Rentals), 2), sep = ""),lty = 1, col = "orange")
The correlation was 0.63.
3. Do registered users rent more bikes than casual users depending on the weathersituations no. 1 to no. 3 ?
This question was tested by a two-sample t-test.
Weathersit is coded as:
1: Clear, Few clouds, Partly cloudy, Partly cloudy
2: Mist, Cloudy, Mist, Broken clouds, Mist, Few clouds, Mist
3: Light Snow, Light Rain, Thunderstorm, Scattered clouds, Light Rain Scattered clouds
4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
# two-sample-t-test for weathersit no. 1:
weathersit.1.reg <- subset(day, subset = weathersit == "1")$registered
weathersit.1.cas <- subset(day, subset = weathersit == "1")$casual
test.result.1 <- t.test(x = weathersit.1.reg, y = weathersit.1.cas, alternative = "two.sided")
test.result.1
##
## Welch Two Sample t-test
##
## data: weathersit.1.reg and weathersit.1.cas
## t = 37.638, df = 646.784, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 2794.886 3102.566
## sample estimates:
## mean of x mean of y
## 3912.7559 964.0302
# two-sample-t-test for weathersit no. 2:
weathersit.2.reg <- subset(day, subset = weathersit == "2")$registered
weathersit.2.cas <- subset(day, subset = weathersit == "2")$casual
test.result.2 <- t.test(x = weathersit.2.reg, y = weathersit.2.cas, alternative = "two.sided")
test.result.2
##
## Welch Two Sample t-test
##
## data: weathersit.2.reg and weathersit.2.cas
## t = 26.3186, df = 331.301, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 2462.253 2860.062
## sample estimates:
## mean of x mean of y
## 3348.5101 687.3522
# two-sample-t-test for weathersit no. 3:
weathersit.3.reg <- subset(day, subset = weathersit == "3")$registered
weathersit.3.cas <- subset(day, subset = weathersit == "3")$casual
test.result.3 <- t.test(x = weathersit.3.reg, y = weathersit.3.cas, alternative = "two.sided")
test.result.3
##
## Welch Two Sample t-test
##
## data: weathersit.3.reg and weathersit.3.cas
## t = 5.9687, df = 22.379, p-value = 4.89e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 935.1423 1929.5243
## sample estimates:
## mean of x mean of y
## 1617.8095 185.4762
4. Is the temperature associated with bike rentals (registered vs. casual)?
#Is the temperature associated with bike rentals (registered vs. casual)
plot(x = 1, y = 1, xlab = "Temperature in Celcius", ylab = "Bike rentals", type = "n", main = "Association between temperature and bike rentals",
xlim = c(0, 40), ylim = c(0, 7000))
#Calculating min and max for the x-axis and y-axis
min(day$raw.temp)
## [1] 2.424346
max(day$raw.temp)
## [1] 35.32835
min(day$casual)
## [1] 2
min(day$registered)
## [1] 20
max(day$casual)
## [1] 3410
max(day$registered)
## [1] 6946
#Adding points to the plot
day$raw.temp <- (day$temp*41)
points(day$raw.temp, day$casual, pch = 16, col = "red")
points(day$raw.temp, day$registered, pch = 16, col = "skyblue")
# Adding a legend to the plot
legend("topleft",legend = c("casual", "registered"), col = c('red', 'skyblue'),pch = c(16, 16), bg = "white")
5. Can the number of total bike rentals be predicted by holiday and weather?
Coding information:
holiday is coded as: 0 = no holiday and 1 = holiday
Foor this question we recoded the weathersituation:
1 = “nice”
2 = “cloudy”
3 = “wet”
4 = “lousy”
Recoding weather with “merge”
lookup <- data.frame("numbers"=c("1","2","3","4"),
"weather"=c("nice","cloudy", "wet", "lousy")
)
day <- merge(x= day,
y= lookup,
by.x="weathersit",
by.y="numbers",
)
head(day)
## weathersit instant dteday season yr mnth holiday weekday workingday
## 1 1 151 2011-05-31 2 0 5 0 2 1
## 2 1 50 2011-02-19 1 0 2 0 6 0
## 3 1 157 2011-06-06 2 0 6 0 1 1
## 4 1 110 2011-04-20 2 0 4 0 3 1
## 5 1 4 2011-01-04 1 0 1 0 2 1
## 6 1 136 2011-05-16 2 0 5 0 1 1
## temp atemp hum windspeed casual registered cnt raw.temp
## 1 0.775000 0.725383 0.636667 0.111329 673 3309 3982 31.77500
## 2 0.399167 0.391404 0.187917 0.507463 532 1103 1635 16.36585
## 3 0.678333 0.621858 0.600000 0.121896 673 3875 4548 27.81165
## 4 0.595000 0.564392 0.614167 0.241925 613 3331 3944 24.39500
## 5 0.200000 0.212122 0.590435 0.160296 108 1454 1562 8.20000
## 6 0.577500 0.550512 0.787917 0.126871 773 3185 3958 23.67750
## raw.atemp raw.mean.temp.atemp weather
## 1 36.26915 34.02208 nice
## 2 19.57020 17.96802 nice
## 3 31.09290 29.45228 nice
## 4 28.21960 26.30730 nice
## 5 10.60610 9.40305 nice
## 6 27.52560 25.60155 nice
Are weather and holiday good predictors?
#Using linear regression
total.rentals.lm <- lm(cnt ~ holiday + weather, data = day)
summary (total.rentals.lm)
##
## Call:
## lm(formula = cnt ~ holiday + weather, data = day)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4475.9 -1250.2 -40.9 1398.8 4303.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4058.4 117.2 34.619 < 2e-16 ***
## holiday -929.5 406.8 -2.285 0.0226 *
## weathernice 848.5 144.7 5.864 6.86e-09 ***
## weatherwet -2255.2 417.4 -5.403 8.91e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1836 on 727 degrees of freedom
## Multiple R-squared: 0.1056, Adjusted R-squared: 0.1019
## F-statistic: 28.61 on 3 and 727 DF, p-value: < 2.2e-16
# cloudy weather is the default -> "nice" and "wet" weather is compared to cloudy weather.
The coefficients (Intercept, holiday, weathernice, weatherwet) were 4058.44, -929.48, 848.46, -2255.16.
The df.residuals were 727.
Is there an effect of weather or not?
#using an ANOVA
anv.weather <- anova (total.rentals.lm)
anv.weather
## Analysis of Variance Table
##
## Response: cnt
## Df Sum Sq Mean Sq F value Pr(>F)
## holiday 1 12797494 12797494 3.797 0.05173 .
## weather 2 276444009 138222004 41.010 < 2e-16 ***
## Residuals 727 2450293890 3370418
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#The F-values (holiday, weather, residuals) are:
anv.weather$`F value`
## [1] 3.797005 41.010345 NA
#The p-values (holiday, weather, residuals) are:
anv.weather$`Pr(>F)`
## [1] 5.172871e-02 1.331783e-17 NA
Comparison of the three different types of weather
#Using the Tukey-Test
weather.aov <- aov(cnt ~ weather, data = day)
summary(weather.aov)
## Df Sum Sq Mean Sq F value Pr(>F)
## weather 2 2.716e+08 135822286 40.07 <2e-16 ***
## Residuals 728 2.468e+09 3389960
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TukeyHSD(weather.aov)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = cnt ~ weather, data = day)
##
## $weather
## diff lwr upr p adj
## nice-cloudy 840.9238 500.2174 1181.630 0e+00
## wet-cloudy -2232.5766 -3215.4542 -1249.699 4e-07
## wet-nice -3073.5005 -4038.2458 -2108.755 0e+00
6. What is the mean temperature, humidity, windspeed and total number of rentals per month?
# Months are coded as 1 to 12
# Recode months with the "merge" function
lookup.month <- data.frame("mnth" = c(1:12),
"mnth.name" = c("01Jan", "02Feb", "03March", "04April", "05May", "06June", "07July", "08Aug", "09Sept", "10Oct", "11Nov", "12Dec"), stringsAsFactors = FALSE)
day <- merge(x=day, y= lookup.month, by = 'mnth')
# Converte the nomalized windspeed and humidity
day$raw.windspeed <- (day$windspeed*67)
day$raw.hum <- (day$hum * 100)
head(day)
## mnth weathersit instant dteday season yr holiday weekday workingday
## 1 1 1 12 2011-01-12 1 0 0 3 1
## 2 1 1 394 2012-01-29 1 1 0 0 0
## 3 1 2 1 2011-01-01 1 0 0 6 0
## 4 1 2 392 2012-01-27 1 1 0 5 1
## 5 1 1 4 2011-01-04 1 0 0 2 1
## 6 1 1 9 2011-01-09 1 0 0 0 0
## temp atemp hum windspeed casual registered cnt raw.temp
## 1 0.172727 0.160473 0.599545 0.304627 25 1137 1162 7.081807
## 2 0.282500 0.272721 0.311250 0.240050 558 2685 3243 11.582500
## 3 0.344167 0.363625 0.805833 0.160446 331 654 985 14.110847
## 4 0.425000 0.415383 0.741250 0.342667 269 3187 3456 17.425000
## 5 0.200000 0.212122 0.590435 0.160296 108 1454 1562 8.200000
## 6 0.138333 0.116175 0.434167 0.361950 54 768 822 5.671653
## raw.atemp raw.mean.temp.atemp weather mnth.name raw.windspeed raw.hum
## 1 8.02365 7.552728 nice 01Jan 20.41001 59.9545
## 2 13.63605 12.609275 nice 01Jan 16.08335 31.1250
## 3 18.18125 16.146048 cloudy 01Jan 10.74988 80.5833
## 4 20.76915 19.097075 cloudy 01Jan 22.95869 74.1250
## 5 10.60610 9.403050 nice 01Jan 10.73983 59.0435
## 6 5.80875 5.740201 nice 01Jan 24.25065 43.4167
#descriptive statistics with the dplyr-functions:
require(dplyr)
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
month.agg <- day %>% group_by(mnth.name) %>% summarise(
mean.temp = mean(raw.temp),
mean.hum = mean(raw.hum),
mean.windspeed = mean(raw.windspeed),
mean.rentals = mean(cnt))
month.agg
## Source: local data frame [12 x 5]
##
## mnth.name mean.temp mean.hum mean.windspeed mean.rentals
## 1 01Jan 9.694201 58.58283 13.82229 2176.339
## 2 02Feb 12.268284 56.74647 14.45082 2655.298
## 3 03March 16.012089 58.84750 14.92086 3692.258
## 4 04April 19.269952 58.80631 15.71031 4484.900
## 5 05May 24.386735 68.89583 12.26026 5349.774
## 6 06June 28.047985 57.58055 12.42313 5772.367
## 7 07July 30.974287 59.78763 11.12594 5563.677
## 8 08Aug 29.051844 63.77301 11.58552 5664.419
## 9 09Sept 25.275884 71.47144 11.11832 5766.517
## 10 10Oct 19.885500 69.37609 11.73877 5199.226
## 11 11Nov 15.138010 62.48765 12.31470 4247.183
## 12 12Dec 13.285270 66.60405 11.83280 3403.806
Plotting the different means associated with month
par(mfrow=c(2,2))
barplot(height = month.agg$mean.rentals,
names.arg = month.agg$mnth.name ,col = "coral", main = "Mean rentals" )
barplot(height = month.agg$mean.windspeed,
names.arg = month.agg$mnth.name,col = "brown1", main = "Mean Windspeed (km/h)" )
barplot(height = month.agg$mean.hum,
names.arg = month.agg$mnth.name,col = "brown3", main = "Mean Humidity" )
barplot(height = month.agg$mean.temp,
names.arg = month.agg$mnth.name,col = "brown4", main = "Mean Temperature" )
par(mfrow=c(1,1))
7. What percentage of days are appropriate for biking concerning the weather with the following conditions
#Converting the normalized values
max(day$raw.temp)
## [1] 35.32835
#Creating a custom function based on the criteria for appropriate weather for biking
biking.day <- function (temp.thresh, windspeed.thresh, weathersit.thresh)
{result <- with (day, raw.temp > temp.thresh &
raw.windspeed < windspeed.thresh &
weathersit < weathersit.thresh)
return(result)}
mean(biking.day(5, 40, 3))
## [1] 0.9658003
# with A) weather conditions 97% of days were appropriate for biking
mean(biking.day(10, 20, 2))
## [1] 0.5348837
# with B) weather conditions 53% of days were appropriate for biking
As we can see in the analysis of question no. 1, the temperature is relatively high across all seasons. The lowest mean temperature was found in spring (M=12.21, SD=4.21) and the highest mean temperature was found in fall (M=28,96, SD=2,9). Summer had a mean temperature of 22.32 degrees Celcius (SD=5.03) and winter had a mean temperature of 17.34 degrees Celcius (SD=4.42).
Due to the mild temperatures in spring and winter and the warm weather in summer and fall, temperatures should be highly correlated with the total amount of bike rentals. In this analysis we correlated the raw temperatures (converted nomalized temperatures), the converted feeling temperatures (raw.atemp) and the mean of both. As we can see in the plots attached to question no. 2, all three kinds of temperatures are positively correlated with the total amount of bike rentals (cor=0.63). The results reveal, that there is a significant relationship between the variables.
Additionally, we wanted to know, if there is a difference between the two types of bike rental users (casual vs. registered). We assumed, that because registered useres have to pay a monthly fee, they will rent bikes more frequently. In question no. 3, we tested the difference between the two types of bike rental users in dependence on the weather situation. The two-sample T-test revealed a significant difference between registered and casual bike rental users for each of the weather situations . In weather situation 1 (t=37.638, df=646,784, p-value < 0.01), in weather situation 2 (t=26,3186, df=331.301, p-value < 0.01) and in weather situation 3 (t=5,9687, df=22,379, p-value < 0.01) the registered users have a higher mean than the casual users. These results confirm our hypothesis. Furthermore, the t-values and the df are decreasing from weather situation 1 to 3. This could also confirm the correlation between the temperature and the total amount of bike rentals. We should also note that in weather situation 4, neither casual nor registered users rented any bikes, which is why we did not include it into our analysis.
These findings lead us to further examine the association between the two groups (registered vs. casual) in dependence of the temperature. As we can see in the plot, as soon as the temperature increases the amount of bike rentals increase as well. Additionally, we can see that the registered users do rent more bikes than the casual users.
Consequently, we conducted a linear regression to answer question 5, because we thought that weather and holiday could be good predictors for the number of bike rentals. The results clearly reveal, that holiday is a significant negative predictor (estimate = -929.5). The results also show that weathernice compared to weathercloudy (default) is a significant positive predictor (estimate = 848.5) and weatherwet compared to weathercloudy is a significant negative predictor (estimate = -2255.2) for bike rentals. As in question no. 3 we left out the lousy weather condition because there were no bike rentals under this weather condition.
We then used an ANOVA to test whether there are significant effects for weather and holiday or not. The ANOVA revealed a significant effect for weather (F-value = 41.010, Df = 2, p-value < 0.01) but a non-significant effect for holiday (F-value = 3,797, Df = 1, p-value = 0.05173). Afterwards, we compared the three different weather types with the TukeyHSD. The means of bike rentals in dependence of the weather “nice-cloudy”, “wet-cloudy” and “wet-nice” differed significantly (each p-value < 0.01) from each another. This means that weather and temperature affect the amount of bike rentals.
In question no. 6 we plotted the mean humidity, the mean temperature, the mean windspeed and the mean count of total rentals per month. As we can see the total amount of bike rentals increases with increased temperatures. However, the total number of bike rentals seems to be unaffected by the windspeed and the humidity, because they don’t seem to have a lot of variance across the months. This can also be seen as evidence for the high correlation between rentals and temperatures. It also supports the fact that nice weather is a good predictor for bike rentals in general.
Lastly, we created a custom function based on the criteria for appropriate weather for biking. As we can see, we find fewer appropriate days for biking under the condition B, which corresponds to nice and mild weather.
To sum it up, we can say, that the total amount of bike rentals is dependent on the weather and also on the users’ status (registered vs. casual).