Importing the dataset “day”.
day <- read.csv("~/Documents/Studium 6. Semester/Datenanalyse mit R/Bike-Sharing-Dataset(1)/day.csv", stringsAsFactors=FALSE)
The dataset “Bike-Sharing-Dataset” was obtained by the UCI Machine Learning Repository. This is a collection of databeses, domain theories and data generators which are used by the machine learning community for empirical analyses. The archive was created in 1987 by David Aha and fellow graduate students at UC Irvine. Since then it has been widely used by student, educators and researchers. The current website was designed in 2007. The UCI Machine Learning Repository is based on donations of researchers, mostly outside of UCI. We found the dataset “Bike Sharing Dataset” under the index “regression” and chose the sub-dataset “day”.
This dataset contains the hourly and daily count of rental bikes between years 2011 and 2012 in Capital bikeshare system with the corresponding weather and seasonal information. Capital bikeshare has over 350 stations in Washington, D. C. , Arlington, Alexandria, VA und Montgomery County and MD. Bike sharing systems are a new way of traditional bike rentals. The wohle process from memberhsip to rental and retrun back has become automatic. The data was generated by 500 bike-sharing programs and was collected by the Laboratory of Artificial Intelligence and Decision Support (LIAAD), University of Porto. The Laboratory of Artificial Intelligence and Decision Support (LIAAD), University of Porto, aggregated the data on two hourly and daily basis and then extracted and added the corresponding weather and seasonal information that were extracted from http://www.freemeteo.com.
In this dataset there are originally 16 columns and 731 rows. In the course of the analysis we generated more columns.
nrow(day)
## [1] 731
ncol(day)
## [1] 16
dim(day)
## [1] 731 16
The dataset contains following columns in name. Below there is a short description:
names(day)
## [1] "instant" "dteday" "season" "yr" "mnth"
## [6] "holiday" "weekday" "workingday" "weathersit" "temp"
## [11] "atemp" "hum" "windspeed" "casual" "registered"
## [16] "cnt"
Original columns:
instant: record index
dteday: date
season: season (1:spring, 2:summer, 3:fall, 4:winter)
yr: year (0: 2011, 1:2012)
mnth: month ( 1 to 12)
holiday: weather day is holiday or not (extracted from http://dchr.dc.gov/page/holiday-schedule)
weekday: day of the week
workingday: if day is neither weekend nor holiday is 1, otherwise is 0.
temp: Normalized temperature in Celsius. The values are divided to 41 (max)
atemp: Normalized feeling temperature in Celsius. The values are divided to 50 (max)
hum: Normalized humidity. The values are divided to 100 (max)
windspeed: Normalized wind speed. The values are divided to 67 (max)
casual: count of casual users
registered: count of registered users
cnt: count of total rental bikes including both casual and registered
New columns:
month.names: names of the months
raw.temp: converted normalized temperature in Celsius
raw.windspeed: converted normalized windspeed
raw.humidity: converted normalized humidity
raw.atemp: converted normalized feeled temperature in Celsius
raw.mean.temp.atemp: mean between converted normalized temperature and feeled temperature (raw.atemp) in Celsius
How do the temperatures change across the seasons? What are the mean and median temperatures?
Is there a correlation between the temp/atemp/mean.temp.atemp and the total count of bike rentals?
Is there a difference between the real temperature and the feeled temperature? If there is a difference will it still be there across the different seasons?
Is temperature associated with bike rentals (registered vs. casual)?
Can the number of total bike rentals be predicted by holiday and weather?
What are the mean temperature, humidity, windspeed and total rentals per months?
1. How do the temperatures change across the seasons? What are the mean and median temperatures?
1:spring
2:summer
3:fall
4:winter
First we converted the temperature, because the data of temperature was divided to 41. Secondly we calculated the mean, the median and the standard deviation of all seasons.
# Converting the nomalized temperature:
day$raw.temp <- (day$temp*41)
head(day)
## instant dteday season yr mnth holiday weekday workingday weathersit
## 1 1 2011-01-01 1 0 1 0 6 0 2
## 2 2 2011-01-02 1 0 1 0 0 0 2
## 3 3 2011-01-03 1 0 1 0 1 1 1
## 4 4 2011-01-04 1 0 1 0 2 1 1
## 5 5 2011-01-05 1 0 1 0 3 1 1
## 6 6 2011-01-06 1 0 1 0 4 1 1
## temp atemp hum windspeed casual registered cnt raw.temp
## 1 0.344167 0.363625 0.805833 0.1604460 331 654 985 14.110847
## 2 0.363478 0.353739 0.696087 0.2485390 131 670 801 14.902598
## 3 0.196364 0.189405 0.437273 0.2483090 120 1229 1349 8.050924
## 4 0.200000 0.212122 0.590435 0.1602960 108 1454 1562 8.200000
## 5 0.226957 0.229270 0.436957 0.1869000 82 1518 1600 9.305237
## 6 0.204348 0.233209 0.518261 0.0895652 88 1518 1606 8.378268
#2. Calculating Median, Mean and Standard deviation of spring.
spring <- subset(day, season == 1)$raw.temp
sp.mean <- mean(spring)
sp.median <- median(spring)
sp.sd <- sd(spring)
#3. Calculating Median, Mean and Standard deviation of summer.
summer <- subset(day, season == 2)$raw.temp
su.mean <- mean(summer)
su.median <- median(summer)
su.sd <- sd(summer)
#3. Calculating Median, Mean and Standard deviation of fall.
fall <- subset(day, season == 3)$raw.temp
fa.mean <- mean(fall)
fa.median <-median(fall)
fa.sd <- sd(fall)
#4. Calculating Median, Mean and Standard deviation of winter.
winter <- subset(day, season == 4)$raw.temp
wi.mean <- mean(winter)
wi.median <- median(winter)
wi.sd <- sd(winter)
Secondly we created a histogram displaying the temperatures of each season including lines for the mean and median temperatures.
#create histogram for the distribution of temperatures in spring.
hist(x = spring,
main = "Temperatures in Spring",
xlab = "Temperature in Celcius",
ylab = "Number of Days",
xlim = c(0, 25),
ylim = c(0, 45))
abline(v = sp.mean, lwd = 2, lty = 1, col = "red")
text(x = 17, y = 35,
labels = paste("Mean = ", round(mean(spring),2), sep = ""), col="red" )
abline(v = sp.median, lwd = 2, lty = 3, col = "blue")
text(x = 6, y = 35,
labels = paste("Median = ", round(median(spring),2), sep = ""), col="blue" )
#create a histogram for the distribution of temperatures in summer.
hist(x = summer,
main = "Temperatures in Summer",
xlab = "Temperature in Celcius",
ylab = "Number of Days",
xlim = c(0, 35), ylim = c(0, 40)
)
abline(v = su.mean, lwd = 2, lty = 1, col = "red")
text( x = 15, y = 40,
labels = paste("Mean = ", round(mean(summer),2), sep = ""),
col = "red")
abline(v = su.median, lwd = 2, lty = 3, col = "blue")
text(x = 31, y = 40,
labels = paste("Median = ", round(median(summer),2), sep = ""), col = "blue" )
#create a histogram for the distribution of temperatures in fall.
hist(x = fall,
main = "Temperatures in Fall",
xlab = "Temperature in Celcius",
ylab = "Number of Days",
xlim = c(15, 40), ylim = c(0, 70)
)
abline(v = fa.mean, lwd = 2, lty = 1, col = "red")
text(x = 24, y = 60,
labels = paste("Mean = ", round(mean(fall),3), sep = ""), col = "red" )
abline(v = fa.median, lwd = 2, lty = 3, col ="blue")
text(x = 35, y = 60,
labels = paste("Median = ", round(median(fall),3), sep = ""), col ="blue" )
#create a histogram for the distribution of temperatures in winter.
hist(x = winter,
main = "Temperatures in Winter",
xlab = "Temperature in Celcius",
ylab = "Number of Days",
xlim = c(0, 30), ylim = c(0, 40)
)
abline(v = wi.mean, lwd = 2, lty = 1, col = "red")
text(x = 23, y = 40,
labels = paste("Mean = ", round(mean(winter),2), sep = ""), col = "red" )
abline(v = wi.median, lwd = 2, lty = 3, col ="blue")
text(x = 10, y = 40,
labels = paste("Median = ", round(median(winter),2), sep = ""), col ="blue" )
2. Is there a correlation between the temp/atemp/mean.temp.atemp and the total count of bike rentals?
First we checked the dataset of integers, NA or NULL and duplicates. Because the dataset was already recoded and correct we created a new coloumn. Afterwards we did a correlation test.
#Checking dataset
#Tests if values in a vector are integers
is.integer(day)
## [1] FALSE
#Tests if values in a vector are NA or NULL
#is.na(day) we tested it but due to the huge output we deleted it. There was no "NA".
is.null(day)
## [1] FALSE
#Tests for duplicates
#There were no duplicates: duplicated(day)
Creating a new column.
#The Dataset is already recoded and correct.
#For this question we converted "atemp" because it was devided of 50.
day$raw.atemp <-(day$atemp * 50)
#Create a new column of the mean of raw.temp and raw.atemp.
day$raw.mean.temp.atemp <- (day$raw.temp + day$raw.atemp)/2
head(day)
## instant dteday season yr mnth holiday weekday workingday weathersit
## 1 1 2011-01-01 1 0 1 0 6 0 2
## 2 2 2011-01-02 1 0 1 0 0 0 2
## 3 3 2011-01-03 1 0 1 0 1 1 1
## 4 4 2011-01-04 1 0 1 0 2 1 1
## 5 5 2011-01-05 1 0 1 0 3 1 1
## 6 6 2011-01-06 1 0 1 0 4 1 1
## temp atemp hum windspeed casual registered cnt raw.temp
## 1 0.344167 0.363625 0.805833 0.1604460 331 654 985 14.110847
## 2 0.363478 0.353739 0.696087 0.2485390 131 670 801 14.902598
## 3 0.196364 0.189405 0.437273 0.2483090 120 1229 1349 8.050924
## 4 0.200000 0.212122 0.590435 0.1602960 108 1454 1562 8.200000
## 5 0.226957 0.229270 0.436957 0.1869000 82 1518 1600 9.305237
## 6 0.204348 0.233209 0.518261 0.0895652 88 1518 1606 8.378268
## raw.atemp raw.mean.temp.atemp
## 1 18.18125 16.146048
## 2 17.68695 16.294774
## 3 9.47025 8.760587
## 4 10.60610 9.403050
## 5 11.46350 10.384369
## 6 11.66045 10.019359
Correlation Tests.
#Correlation between raw.temp and the total count of bike rentals.
cor.temp <- cor.test(x = day$raw.temp,
y = day$cnt)
cor.temp
##
## Pearson's product-moment correlation
##
## data: day$raw.temp and day$cnt
## t = 21.7594, df = 729, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.5814369 0.6695422
## sample estimates:
## cor
## 0.627494
Temperature <- day$raw.temp
Amount.Rentals <- day$cnt
The correlation was 0.63.
#Correlation between atemp and the total count of bike rentals.
cor.atemp <- cor.test(x = day$raw.atemp,
y = day$cnt)
cor.atemp
##
## Pearson's product-moment correlation
##
## data: day$raw.atemp and day$cnt
## t = 21.9648, df = 729, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.5853376 0.6727918
## sample estimates:
## cor
## 0.6310657
Feeled.Temperature <- day$raw.atemp
Amount.Rentals <- day$cnt
The correlation was 0.63.
#Correlation between mean.temp.atemp and the total count of bike rentals.
day$raw.mean.temp.atemp <-(day$raw.temp + day$raw.atemp)/2
cor.mean.temp.atemp <- cor.test(x = day$raw.mean.temp.atemp,
y = day$cnt)
cor.mean.temp.atemp
##
## Pearson's product-moment correlation
##
## data: day$raw.mean.temp.atemp and day$cnt
## t = 21.9414, df = 729, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.5848953 0.6724234
## sample estimates:
## cor
## 0.6306607
Feeled.Raw.Temperature <- day$raw.mean.temp.atemp
Amount.Rentals <- day$cnt
The correlation was 0.63.
Plotting the correlations
par(mfrow=c(2,2))
plot(x = Temperature, y = Amount.Rentals, main = "Correlation", col = "red")
abline(lm(Amount.Rentals ~ Temperature), col = "blue")
legend("topleft", legend = paste("cor = ", round(cor(Temperature, Amount.Rentals), 2), sep = ""),lty = 1, col = "blue")
plot(x = Feeled.Temperature, y = Amount.Rentals, main = "Correlation", col = "blue")
abline(lm(Amount.Rentals ~ Feeled.Temperature), col = "red")
legend("topleft", legend = paste("cor = ", round(cor(Feeled.Temperature, Amount.Rentals), 2), sep = ""),lty = 1, col = "red")
plot(x = Feeled.Raw.Temperature, y = Amount.Rentals, main = "Correlation", col = "green")
abline(lm(Amount.Rentals ~ Feeled.Raw.Temperature), col = "orange")
legend("topleft", legend = paste("cor = ", round(cor(Temperature, Amount.Rentals), 2), sep = ""),lty = 1, col = "orange")
plot(x = 1, y = 1, xlab = "Temperature", ylab = "Amount of rentals", xlim = c(0, 40), ylim = c(0, 10000), main = "Three correlations combined")
points(Feeled.Raw.Temperature, Amount.Rentals, pch = 8, col = "green")
points(Temperature, Amount.Rentals, pch = 8, col = "red")
points(Feeled.Temperature, Amount.Rentals, pch = 8, col = "blue")
par(mfrow=c(1,1))
3. Is there a difference between the real temperature and the feeled temperature? If there is a difference will it still be there across the different seasons?
A Two-sample-t-test for real temperature and feeled temperature.
test.result.1 <- t.test(x = day$raw.temp, y = day$raw.atemp, alternative = "two.sided")
test.result.1
##
## Welch Two Sample t-test
##
## data: day$raw.temp and day$raw.atemp
## t = -8.3151, df = 1450.245, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -4.210643 -2.603203
## sample estimates:
## mean of x mean of y
## 20.31078 23.71770
hist(day$raw.temp, yaxt = "n", xaxt = "n", xlab = "",
ylab = "", main = "Two Sample t-test", xlim = c(5, 40), col = rgb(0, 0, 1, alpha = .1))
text(x = 13, y = 140, paste("Mean real Temp.\n", round(mean(day$raw.temp), 2), sep = ""), col = "blue")
abline(v = mean(day$raw.temp), lty = 1,
col = rgb(0, 0, 1, alpha = 1), lwd = 4)
par(new = T)
hist(day$raw.atemp, yaxt = "n", xaxt = "n", xlab = "",
ylab = "", main = "", xlim = c(5, 40), col = rgb(1, 0, 0, alpha = .1))
abline(v = mean(day$raw.atemp), lty = 1,
col = rgb(1, 0, 0, alpha = 1), lwd = 4)
text(x= 32, y = 131, paste("Mean feeled Temp.\n", round(mean(day$raw.atemp), 2), sep = ""), col = "red")
mtext(text = "Alternative Hypothesis is confirmed true difference in means is not equal to 0", line = 0, side = 3)
Two Sample t-test across the seasons
# two-sample t-test for real temperature and feeled temperature in spring.
temp.spring <- subset(day, subset = season == "1")$raw.temp
atemp.spring <- subset(day, subset = season == "1")$raw.atemp
test.result.spring <- t.test(x = temp.spring, y = atemp.spring, alternative = "two.sided")
test.result.spring
##
## Welch Two Sample t-test
##
## data: temp.spring and atemp.spring
## t = -5.4597, df = 350.983, p-value = 9.038e-08
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -3.588345 -1.687750
## sample estimates:
## mean of x mean of y
## 12.20765 14.84570
# two-sample t-test for real temperature and feeled temperature in summer.
temp.summer <- subset(day, subset = season == "2")$raw.temp
atemp.summer <- subset(day, subset = season == "2")$raw.atemp
test.result.summer <- t.test(x = temp.summer, y = atemp.summer, alternative = "two.sided")
test.result.summer
##
## Welch Two Sample t-test
##
## data: temp.summer and atemp.summer
## t = -6.7914, df = 364.147, p-value = 4.534e-11
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -4.764599 -2.624911
## sample estimates:
## mean of x mean of y
## 22.32061 26.01537
# two-sample t-test for real temperature and feeled temperature in fall.
temp.fall <- subset(day, subset = season == "3")$raw.temp
atemp.fall <- subset(day, subset = season == "3")$raw.atemp
test.result.fall <- t.test(x = temp.fall, y = atemp.fall, alternative = "two.sided")
test.result.fall
##
## Welch Two Sample t-test
##
## data: temp.fall and atemp.fall
## t = -11.3657, df = 357.9, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -4.500023 -3.172453
## sample estimates:
## mean of x mean of y
## 28.95868 32.79492
# two-sample t-test for real temperature and feeled temperature in winter.
temp.winter <- subset(day, subset = season == "4")$raw.temp
atemp.winter <- subset(day, subset = season == "4")$raw.atemp
test.result.winter <- t.test(x = temp.winter, y = atemp.winter, alternative = "two.sided")
test.result.winter
##
## Welch Two Sample t-test
##
## data: temp.winter and atemp.winter
## t = -7.0467, df = 351.902, p-value = 9.71e-12
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -4.397268 -2.478311
## sample estimates:
## mean of x mean of y
## 17.33915 20.77694
4. Is temperature associated with bike rentals (registered vs. casual)?
# Plotting the association:
plot(x = 1, y = 1, xlab = "Temperature in Celcius", ylab = "Bike rentals", type = "n", main = "Association between temperature and bike rentals",
xlim = c(0, 40), ylim = c(0, 7000))
#Calculating min and max for the x-axis and y-axis:
min(day$raw.temp)
## [1] 2.424346
max(day$raw.temp)
## [1] 35.32835
min(day$casual)
## [1] 2
min(day$registered)
## [1] 20
max(day$casual)
## [1] 3410
max(day$registered)
## [1] 6946
#Adding points to the plot
day$raw.temp <- (day$temp*41)
points(day$raw.temp, day$casual, pch = 16, col = "red")
points(day$raw.temp, day$registered, pch = 16, col = "skyblue")
# Adding a legend to the plot
legend("topleft",legend = c("casual", "registered"), col = c("red","skyblue"), pch = c(16, 16), bg = "white")
# Calculating the correlation between raw.temp and registered users and between raw.temp and causal users
cor.reg <- cor.test(x = day$raw.temp, y = day$registered)
cor.reg
##
## Pearson's product-moment correlation
##
## data: day$raw.temp and day$registered
## t = 17.3233, df = 729, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4865508 0.5894440
## sample estimates:
## cor
## 0.540012
cor.cas <- cor.test(x = day$raw.temp,
y = day$casual)
cor.cas
##
## Pearson's product-moment correlation
##
## data: day$raw.temp and day$casual
## t = 17.4721, df = 729, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4900779 0.5924581
## sample estimates:
## cor
## 0.5432847
# Adding Correlation line and the correlation value to the plot
abline(lm(day$registered ~ day$raw.temp), lty = 6, col = "blue")
abline(lm(day$casual ~ day$raw.temp), lty = 6, col = "orange")
reg <- paste("cor = ", round(cor(day$registered, day$raw.temp), 2), sep = "")
cas <- paste("cor = ", round(cor(day$casual, day$raw.temp), 2), sep = "")
legend("left",legend = c(cas, reg) , col = c('orange', 'blue'),pch = c(16, 16), bg = "white")
# Calculating the max of casual and registered users:
max(day$casual)
## [1] 3410
max(day$registered)
## [1] 6946
5. Can the number of total bike rentals be predicted by holiday and weather?
Coding information:
holiday is coded as:
0 = no holiday
1 = holiday
Foor this question we converted the weathersituation:
1 = “nice”
2 = “cloudy”
3 = “wet”
4 = “lousy”
Converting weather with “merge”
lookup <- data.frame("numbers"=c("1","2","3","4"),
"weather"=c("nice","cloudy", "wet", "lousy")
)
day <- merge(x= day,
y= lookup,
by.x="weathersit",
by.y="numbers",
)
head(day)
## weathersit instant dteday season yr mnth holiday weekday workingday
## 1 1 151 2011-05-31 2 0 5 0 2 1
## 2 1 50 2011-02-19 1 0 2 0 6 0
## 3 1 157 2011-06-06 2 0 6 0 1 1
## 4 1 110 2011-04-20 2 0 4 0 3 1
## 5 1 4 2011-01-04 1 0 1 0 2 1
## 6 1 136 2011-05-16 2 0 5 0 1 1
## temp atemp hum windspeed casual registered cnt raw.temp
## 1 0.775000 0.725383 0.636667 0.111329 673 3309 3982 31.77500
## 2 0.399167 0.391404 0.187917 0.507463 532 1103 1635 16.36585
## 3 0.678333 0.621858 0.600000 0.121896 673 3875 4548 27.81165
## 4 0.595000 0.564392 0.614167 0.241925 613 3331 3944 24.39500
## 5 0.200000 0.212122 0.590435 0.160296 108 1454 1562 8.20000
## 6 0.577500 0.550512 0.787917 0.126871 773 3185 3958 23.67750
## raw.atemp raw.mean.temp.atemp weather
## 1 36.26915 34.02208 nice
## 2 19.57020 17.96802 nice
## 3 31.09290 29.45228 nice
## 4 28.21960 26.30730 nice
## 5 10.60610 9.40305 nice
## 6 27.52560 25.60155 nice
Are weather and holiday good predictors?
#Using linear regression
total.rentals.lm <- lm(cnt ~ holiday + weather, data = day)
summary (total.rentals.lm)
##
## Call:
## lm(formula = cnt ~ holiday + weather, data = day)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4475.9 -1250.2 -40.9 1398.8 4303.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4058.4 117.2 34.619 < 2e-16 ***
## holiday -929.5 406.8 -2.285 0.0226 *
## weathernice 848.5 144.7 5.864 6.86e-09 ***
## weatherwet -2255.2 417.4 -5.403 8.91e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1836 on 727 degrees of freedom
## Multiple R-squared: 0.1056, Adjusted R-squared: 0.1019
## F-statistic: 28.61 on 3 and 727 DF, p-value: < 2.2e-16
# default is cloudy weather -> weather "nice" and "wet" is compared to cloudy weather.
The coefficients (Intercept, holiday, weathernice, weatherwet) are 4058.44, -929.48, 848.46, -2255.16.
The df.residuals are 727.
Is there an effect of weather or not?
# Using a anova
anv.weather <- anova (total.rentals.lm)
anv.weather
## Analysis of Variance Table
##
## Response: cnt
## Df Sum Sq Mean Sq F value Pr(>F)
## holiday 1 12797494 12797494 3.797 0.05173 .
## weather 2 276444009 138222004 41.010 < 2e-16 ***
## Residuals 727 2450293890 3370418
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#The F-values (holiday, weather, residuals) are:
anv.weather$`F value`
## [1] 3.797005 41.010345 NA
#The p-values (holiday, weather, residuals) are:
anv.weather$`Pr(>F)`
## [1] 5.172871e-02 1.331783e-17 NA
Comparison of the three different types of weather
#Using Tukey-Test
weather.aov <- aov(cnt ~ weather, data = day)
summary(weather.aov)
## Df Sum Sq Mean Sq F value Pr(>F)
## weather 2 2.716e+08 135822286 40.07 <2e-16 ***
## Residuals 728 2.468e+09 3389960
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TukeyHSD(weather.aov)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = cnt ~ weather, data = day)
##
## $weather
## diff lwr upr p adj
## nice-cloudy 840.9238 500.2174 1181.630 0e+00
## wet-cloudy -2232.5766 -3215.4542 -1249.699 4e-07
## wet-nice -3073.5005 -4038.2458 -2108.755 0e+00
6. What are the mean temperature, humidity, windspeed and total rentals per months?
# Months is coded as 1 to 12
# Converting month with "merge"
lookup.month<- data.frame("mnth" = c(1:12),
"mnth.name" = c("01Jan", "02Feb", "03March", "04April", "05May", "06June", "07July", "08Aug", "09Sept", "10Oct", "11Nov", "12Dec"), stringsAsFactors = FALSE)
day <- merge(x=day, y= lookup.month, by = 'mnth')
# Convert the nomalized windspeed and humidity
day$raw.windspeed <- (day$windspeed*67)
day$raw.hum <- (day$hum * 100)
head(day)
## mnth weathersit instant dteday season yr holiday weekday workingday
## 1 1 1 12 2011-01-12 1 0 0 3 1
## 2 1 1 394 2012-01-29 1 1 0 0 0
## 3 1 2 1 2011-01-01 1 0 0 6 0
## 4 1 2 392 2012-01-27 1 1 0 5 1
## 5 1 1 4 2011-01-04 1 0 0 2 1
## 6 1 1 9 2011-01-09 1 0 0 0 0
## temp atemp hum windspeed casual registered cnt raw.temp
## 1 0.172727 0.160473 0.599545 0.304627 25 1137 1162 7.081807
## 2 0.282500 0.272721 0.311250 0.240050 558 2685 3243 11.582500
## 3 0.344167 0.363625 0.805833 0.160446 331 654 985 14.110847
## 4 0.425000 0.415383 0.741250 0.342667 269 3187 3456 17.425000
## 5 0.200000 0.212122 0.590435 0.160296 108 1454 1562 8.200000
## 6 0.138333 0.116175 0.434167 0.361950 54 768 822 5.671653
## raw.atemp raw.mean.temp.atemp weather mnth.name raw.windspeed raw.hum
## 1 8.02365 7.552728 nice 01Jan 20.41001 59.9545
## 2 13.63605 12.609275 nice 01Jan 16.08335 31.1250
## 3 18.18125 16.146048 cloudy 01Jan 10.74988 80.5833
## 4 20.76915 19.097075 cloudy 01Jan 22.95869 74.1250
## 5 10.60610 9.403050 nice 01Jan 10.73983 59.0435
## 6 5.80875 5.740201 nice 01Jan 24.25065 43.4167
#descreptive statistics with the dplyr-functions:
require(dplyr)
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
month.agg <- day %>% group_by(mnth.name) %>% summarise(
mean.temp = mean(raw.temp),
mean.hum = mean(raw.hum),
mean.windspeed = mean(raw.windspeed),
mean.rentals = mean(cnt))
month.agg
## Source: local data frame [12 x 5]
##
## mnth.name mean.temp mean.hum mean.windspeed mean.rentals
## 1 01Jan 9.694201 58.58283 13.82229 2176.339
## 2 02Feb 12.268284 56.74647 14.45082 2655.298
## 3 03March 16.012089 58.84750 14.92086 3692.258
## 4 04April 19.269952 58.80631 15.71031 4484.900
## 5 05May 24.386735 68.89583 12.26026 5349.774
## 6 06June 28.047985 57.58055 12.42313 5772.367
## 7 07July 30.974287 59.78763 11.12594 5563.677
## 8 08Aug 29.051844 63.77301 11.58552 5664.419
## 9 09Sept 25.275884 71.47144 11.11832 5766.517
## 10 10Oct 19.885500 69.37609 11.73877 5199.226
## 11 11Nov 15.138010 62.48765 12.31470 4247.183
## 12 12Dec 13.285270 66.60405 11.83280 3403.806
Plotting the different means associated with month
par(mfrow=c(2,2))
barplot(height = month.agg$mean.rentals,
names.arg = month.agg$mnth.name ,col = "red", main = "Mean rentals" )
barplot(height = month.agg$mean.windspeed,
names.arg = month.agg$mnth.name,col = "blue", main = "Mean Windspeed (km/h)" )
barplot(height = month.agg$mean.hum,
names.arg = month.agg$mnth.name,col = "green", main = "Mean Humidity" )
barplot(height = month.agg$mean.temp,
names.arg = month.agg$mnth.name,col = "skyblue", main = "Mean Temperature" )
par(mfrow=c(1,1))
7. What percentage of days are appropriate for biking concerning the weather with conditions
#Calculating the maximum temperature:
max(day$raw.temp)
## [1] 35.32835
#Creating a custom function on criteria for fine weather for biking
biking.day <- function (temp.thresh, windspeed.thresh, weathersit.thresh)
{result <- with (day, raw.temp > temp.thresh &
raw.windspeed < windspeed.thresh &
weathersit < weathersit.thresh)
return(result)}
mean(biking.day(5, 40, 3))
## [1] 0.9658003
# with the A) weather conditions 97% of days were appropriate for biking
mean(biking.day(10, 20, 2))
## [1] 0.5348837
# with the B) weather conditions 53% of days were appropriate for biking
As we can see in the analysis of question no. 1, the temperature is relativly high across all seasons. The lowest mean temperature was in spring (M=12.21, SD=4.21) and the highest mean temperature was in fall (M=28,96, SD=2,9). In between there is summer with a mean temperature of 22.32 degree Celcius (SD=5.03) and winter with a mean temperature of 17.34 degree Celcius (SD=4.42).
Because of the mild temperatures in spring and winter and the warm weather in summer and fall, temperature should be highly correlated with the total amount of bike rentals. In this analysis we correlated the raw temperature (converted nomalized temperature), the converted feeled temperature (raw.atemp) and the mean of both. As we can see in the plots of question no. 2, all three kinds of temperatures are positive correlated with the total amount of bike rentals. We also see that the correlation value is not different across the three types of temperatures (cor=0.63). Anyway the analysis clearly shows, that there is a relationship between those two variables.
Now we know, that temperature is asscoiated with the total amount of bike rentals. That is the reason we tested if there is a difference between the real temperature and the feeled temperature. The analysis shows that there is a significant difference between those two variables (t = -8.3151, df = 1450.245). There is still a significant difference between the real temperature and the feeled temperature across the seasons: Spring (t = -5.46, df = 350.98), Summer ( t = -6.79, df = 364.15), Fall (t = -11.37, df = 357.9) and Winter (t = -7.05, df = 351.9). This could be a cue that temperature plays an important role in bike rentals.
Because of those results we looked up the association between the two groups (registered vs. casual) in dependence of temperature. As we can see in the plot, as soons as the temperature increases the amount of bike rentals increase as well in both users. It seems like, that the registered users rent even more bikes than the casual users. That’s why we calculated the correlation between the two users and the real temperature. It shows, that the correlation is even for both groups (cor = 0.54). The difference in presentation could be biased because of the unequal amount of registered (max = 6946) and casual (max = 3410) users. However the plot partly confirms the strong role of temperature.
Afterwards in question 5 we calculated a linear regression, because we thought that weather and holiday would be good predictors of bike rentals. It clearly shows, that holiday is a significant negativ predictor (estimate = -929.5). It also shows that weather nice compared to weather cloudy (default) is a significant positiv predictor (estimate = 848.5) and weather wet compared to weathercoludy is a significant negative predictor (estimate = -2255.2) of bike rentals. The lousy weather was not included because at this time there were no bike rentals.
Now we used an anova testing wether there are effects for weather and holiday or not. The anova showed a significant effect for weather (F-value = 41.010, Df = 2, p-value < 0.01) but a non-significant effect for holiday (F-value = 3,797, Df = 1, p-value = 0.05173). Afterwards we compared the three different weather types with the TukeyHSD. The means of bike rentals in dependence of the weather “nice-cloudy”, “wet-cloudy” and “wet-nice” differed significantly (each p-value < 0.01). This means that weather and temperature affect the amount of bike rentals.
In question no. 6 we plotted the mean humidity, the mean temperature, the mean windspeed an the mean total rentals per months. As we can see the total amount of bike rentals increases with the temperature per month. Whereas it seems that the rentals are independent of the windspeed and the humidity, because they are almost constant over the months. This also confirms on the one hand the high correlation between rentals and temperature and on the other hand that nice weather could be a good predictor.
In question 7 we created a custom function. As we can see, the custom function allows us to set own defined limits on the weathersituation to find out how many days were appropriate for biking under different conditions. Results show that very few days only were under 5 C°, with a windspeed of at least 40km/h and cloudy till wet weather. If we take a look at the distributions in the assignment before, it shows that Windspeed in general was not very high in Washington D.C. and the temperatures in mean did not fall below 10 C°. With stricter limits of min. 10 C°, windspeed of 20 km/h and exclusion of wet weather conditions, only about half of the days were suitalbe for a biking tour. This could be important for future research. With this function you could calculate the amount of appropriate days for biking in differen cities. With those results the bike rental systems could be built up in cities with many appropriate days and especially save costs.
As a conclusion we can say, that the amount of bike rentals depends mainly on the weather and on the real and feeled temperature. The analysis shows that there is a positive relationship between the amount of bike rentals and temperature. And as we can see in the plot of question no. 6 the mean amount of bike rentals increases an decreases with the temperature. So people mainly rent bikes on nice days and nice temperature. This could be important of planning new bike rental stations.