Final Project - Data analysis and visualization in R

1. Dataset Description

Import the dataset “day”.

 day <- read.csv("~/Documents/Psychologie/6. Semester/Data analysis and visualization (Philips)/Bike-Sharing-Dataset/day.csv", stringsAsFactors=FALSE)

We analyzed the “Bike Sharing Dataset” which was obtained by the UCI Machine Learning Repository. The UCI Machine Learning Repository is a collection of databases, domain theories and data generators which are used by the machine learning community for empirical analyses. The archive was created in 1987 by David Aha and his fellow graduate students at the UC Irvin. Since then, it has been widely used by students, educators and researchers. The current website was designed in 2007. The UCI Machine Learning Repository depends on donations of other researchers, mostly outside of UCI. The “Bike Sharing Dataset” consists of two complete sets of data of which we chose to analyze the dataset called “day”.

This dataset contains the hourly and daily counts of rental bikes in the Capital bikeshare system, over a 2-year-period (2011-2012). It also includes the corresponding weather and seasonal information. Capital bikeshare has over 350 stations in Washington, D. C. , Arlington, Alexandria, VA and Montgomery County, as well as in MD. Bike sharing systems are a new way of renting bikes in bigger cities. The entire process, starting from becoming a registered user to renting and returning bikes, has become completely automatic. The data was generated by 500 bike-sharing programs and was collected by the Laboratory of Artificial Intelligence and Decision Support (LIAAD), University of Porto .

This dataset originally consisted of 16 columns and 731 rows. In the course of the analysis we generated further columns.

nrow(day)

## [1] 731

ncol(day)

## [1] 16

dim(day)

## [1] 731  16

The dataset contains the following columns and they are briefly described below the output:

names(day)

##  [1] "instant"    "dteday"     "season"     "yr"         "mnth"      
##  [6] "holiday"    "weekday"    "workingday" "weathersit" "temp"      
## [11] "atemp"      "hum"        "windspeed"  "casual"     "registered"
## [16] "cnt"

Original columns:

weathersit:
- 1: Clear, Few clouds, Partly cloudy, Partly cloudy
- 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
- 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
- 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
instant: record index
dteday : date
season : season (1:spring, 2:summer, 3:fall, 4:winter)
yr : year (0: 2011, 1:2012)
mnth : month ( 1 to 12)
holiday : whether day is holiday or not (extracted from http://dchr.dc.gov/page/holiday- schedule)
weekday : day of the week
workingday : if the day is a regular day of work it is coded as 1, weekends or holidays are coded as 0.
temp: Normalized temperature in Celsius. The values are divided by 41 (max)
atemp: Normalized feeling temperature in Celsius. The values are divided by 50 (max)
hum: Normalized humidity. The values are divided by 100 (max)
windspeed: Normalized wind speed. The values are divided by 67 (max)
casual: count of casual users
registered: count of registered users
cnt: count of total rental bikes including both casual and registered users

New columns: As mentioned above, we have added a number of additional columns:

month.names: names of the months
weather: converted weathersituation
- 1 = “nice”
- 2 = “cloudy”
- 3 = “wet”
- 4 = “lousy”
raw.temp: converted normalized temperature in Celsius
raw.windspeed: converted normalized windspeed
raw.humidity: converted normalized humidity
raw.atemp: converted normalized feeling temperature in Celsius
raw.mean.temp.atemp: mean between converted normalized temperatures and feeling temperatures (raw.atemp) in Celsius

2. Questions

How do the temperatures change across the seasons? What are the mean and median temperatures?
Is there a correlation between the temp/atemp/mean.temp.atemp and the total count of bike rentals?
Do registered users rent more bikes than casual users depending on the weathersituations no. 1 to no. 3 ?
Can the number of total bike rentals be predicted by whether or not it is a holiday and the weather is good?
Is the tempereature associated with bike rentals (registered vs. casual)?
What is the mean temperature, humidity, windspeed and total count of rentals per month?
What percentage of days are appropriate for biking concerning the weather under the following conditions
- 1. Temperature > 5°, weather situation 1-3, windspeed < 40 km/h and
- 1. Temperature > 10°, weather situation 1-2, windspeed < 20 km/h?

3. Analyses

1. How do the temperatures change across the seasons? What are the mean and median temperatures?

1:spring
2:summer
3:fall
4:winter

First we recoded the temperatures, because the data of temperature was divided by 41. Secondly we calculated the mean, the median and the standard deviation of all seasons.

# Converting the nomalized temperature:
day$raw.temp <- (day$temp*41)
head(day)

##   instant     dteday season yr mnth holiday weekday workingday weathersit
## 1       1 2011-01-01      1  0    1       0       6          0          2
## 2       2 2011-01-02      1  0    1       0       0          0          2
## 3       3 2011-01-03      1  0    1       0       1          1          1
## 4       4 2011-01-04      1  0    1       0       2          1          1
## 5       5 2011-01-05      1  0    1       0       3          1          1
## 6       6 2011-01-06      1  0    1       0       4          1          1
##       temp    atemp      hum windspeed casual registered  cnt  raw.temp
## 1 0.344167 0.363625 0.805833 0.1604460    331        654  985 14.110847
## 2 0.363478 0.353739 0.696087 0.2485390    131        670  801 14.902598
## 3 0.196364 0.189405 0.437273 0.2483090    120       1229 1349  8.050924
## 4 0.200000 0.212122 0.590435 0.1602960    108       1454 1562  8.200000
## 5 0.226957 0.229270 0.436957 0.1869000     82       1518 1600  9.305237
## 6 0.204348 0.233209 0.518261 0.0895652     88       1518 1606  8.378268

#1. Calculating Median, Mean and Standard deviation of spring.
spring <- subset(day, season == 1)$raw.temp
sp.mean <- mean(spring) #calculate the mean temperature of season 1 = spring
sp.median <- median(spring) #calculate the median temperature of season 1 = spring
sp.sd <- sd(spring) #calculate the standard deviation of temperatures in season 1 = spring


#2. Calculating Median, Mean and Standard deviation of summer.
summer <- subset(day, season == 2)$raw.temp
su.mean <- mean(summer) #calculate the mean temperature of season 2 = summer
su.median <- median(summer) #calculate the median temperature of season 2 = summer
su.sd <- sd(summer) #calculate the standard deviation of temperatures in season 2 = summer


#3. Calculating Median, Mean and Standard deviation of fall.
fall <- subset(day, season == 3)$raw.temp
fa.mean <- mean(fall)
fa.median <-median(fall)
fa.sd <- sd(fall)


#4. Calculating Median, Mean and Standard deviation of winter.
winter <- subset(day, season == 4)$raw.temp
wi.mean <- mean(winter)
wi.median <- median(winter)
wi.sd <- sd(winter)

Spring:
- The mean temperature of spring was 12.21.
- The median temperature of spring was 11.72.
- The standard deviation of the temperature in spring was 4.21.
Summer:
- The mean temperature of summer was 22.32.
- The median temperature of summer was 23.05.
- The standard deviation of the temperature in summer was 5.03.
Fall:
- The mean temperature of fall was 28.96.
- The median temperature of fall was 29.3.
- The standard deviation of the temperature in fall was 2.9.
Winter:
- The mean temperature of winter was 17.34.
- The median temperature of winter was 16.78.
- The standard deviation of the temperature in winter was 4.42.

Secondly we created a histogram displaying the temperatures of each season including lines for the mean and median temperatures.

#create a histogram for the distribution of temperatures in spring.
hist(x = spring, 
     main = "Temperatures in Spring", 
     xlab = "Temperature in Celcius", 
     ylab = "Number of Days",
     xlim = c(0, 25),
     ylim = c(0, 45))

abline(v = sp.mean, lwd = 2, lty = 1, col = "red") #include line for the mean 
text(x = 17, y = 35, 
     labels = paste("Mean = ", round(mean(spring),2), sep = ""), col="red" )

abline(v = sp.median, lwd = 2, lty = 3, col = "blue") #include dotted line for the median
text(x = 6, y = 35, 
     labels = paste("Median = ", round(median(spring),2), sep = ""), col="blue" )

#create a histogram for the distribution of temperatures in summer.
hist(x = summer, 
     main = "Temperatures in Summer", 
     xlab = "Temperature in Celcius", 
     ylab = "Number of Days", 
      xlim = c(0, 35), ylim = c(0, 40)
    )
abline(v = su.mean, lwd = 2, lty = 1, col = "red") #include line for the mean 
text( x = 15, y = 40, 
     labels = paste("Mean = ", round(mean(summer),2), sep = ""),
col = "red")

abline(v = su.median, lwd = 2, lty = 3, col = "blue") #include dotted line for the median
text(x = 31, y = 40, 
     labels = paste("Median = ", round(median(summer),2), sep = ""), col = "blue" )

#create a histogram for the distribution of temperatures in fall. 
hist(x = fall, 
     main = "Temperatures in Fall", 
     xlab = "Temperature in Celcius", 
     ylab = "Number of Days", 
      xlim = c(15, 40), ylim = c(0, 75)
    )
abline(v = fa.mean, lwd = 2, lty = 1, col = "red") #include line for the mean 
text(x = 24, y = 40, 
     labels = paste("Mean = ", round(mean(fall),3), sep = ""), col = "red" )

abline(v = fa.median, lwd = 2, lty = 3, col ="blue") #include dotted line for the median
text(x = 35, y = 40, 
     labels = paste("Median = ", round(median(fall),3), sep = ""), col ="blue" )

#create a histogram for the distribution of temperatures in winter. 
hist(x = winter, 
     main = "Temperatures in Winter", 
     xlab = "Temperature in Celcius", 
     ylab = "Number of Days", 
    xlim = c(0, 30), ylim = c(0, 40)
    )

abline(v = wi.mean, lwd = 2, lty = 1, col = "red") #include line for the mean 
text(x = 23, y = 40, 
     labels = paste("Mean = ", round(mean(winter),2), sep = ""), col = "red" )

abline(v = wi.median, lwd = 2, lty = 3, col ="blue") #include dotted line for the median
text(x = 10, y = 40, 
     labels = paste("Median = ", round(median(winter),2), sep = ""), col ="blue" )

2. Is there a correlation between the temp/atemp/mean.temp.atemp and the total count of bike rentals?

First we checked the dataset of integers, NA or NULL and duplicates. Because the dataset was already recoded and correct we created a new coloumn. Afterwards we did a correlation test.

#Check dataset 
#Tests if values in the vector are integers
is.integer(day)

## [1] FALSE

#Tests if the vector contains NA or NULL values
#is.na(day) we tested it but due to the huge output we deleted it. There was no "NA".
is.null(day)

## [1] FALSE

#Test for duplicates
#There were no duplicates: duplicated(day)

Creating a new column.

#The Dataset is already recoded and correct.
#For this question we converted "atemp" because it was devided by 50.
day$raw.atemp <-(day$atemp * 50)

#Create a new column of the means of raw.temp and raw.atemp.
day$raw.mean.temp.atemp <- (day$raw.temp + day$raw.atemp)/2
head(day)

##   instant     dteday season yr mnth holiday weekday workingday weathersit
## 1       1 2011-01-01      1  0    1       0       6          0          2
## 2       2 2011-01-02      1  0    1       0       0          0          2
## 3       3 2011-01-03      1  0    1       0       1          1          1
## 4       4 2011-01-04      1  0    1       0       2          1          1
## 5       5 2011-01-05      1  0    1       0       3          1          1
## 6       6 2011-01-06      1  0    1       0       4          1          1
##       temp    atemp      hum windspeed casual registered  cnt  raw.temp
## 1 0.344167 0.363625 0.805833 0.1604460    331        654  985 14.110847
## 2 0.363478 0.353739 0.696087 0.2485390    131        670  801 14.902598
## 3 0.196364 0.189405 0.437273 0.2483090    120       1229 1349  8.050924
## 4 0.200000 0.212122 0.590435 0.1602960    108       1454 1562  8.200000
## 5 0.226957 0.229270 0.436957 0.1869000     82       1518 1600  9.305237
## 6 0.204348 0.233209 0.518261 0.0895652     88       1518 1606  8.378268
##   raw.atemp raw.mean.temp.atemp
## 1  18.18125           16.146048
## 2  17.68695           16.294774
## 3   9.47025            8.760587
## 4  10.60610            9.403050
## 5  11.46350           10.384369
## 6  11.66045           10.019359

Correlation Tests.

#Correlation between raw.temp and the total count of bike rentals.

cor.temp <- cor.test(x = day$raw.temp,
y = day$cnt)

cor.temp

## 
##  Pearson's product-moment correlation
## 
## data:  day$raw.temp and day$cnt
## t = 21.7594, df = 729, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5814369 0.6695422
## sample estimates:
##      cor 
## 0.627494

Temperature <- day$raw.temp
Amount.Rentals <- day$cnt
plot(x = Temperature, y = Amount.Rentals, main = "Correlation", col = "turquoise3")
abline(lm(Amount.Rentals ~ Temperature), col = "blue")
legend("topleft", legend = paste("cor = ", round(cor(Temperature, Amount.Rentals), 2), sep = ""),lty = 1, col = "blue")

The correlation was 0.63.

#Correlation between atemp and the total count of bike rentals.
cor.atemp <- cor.test(x = day$raw.atemp,
y = day$cnt)
cor.atemp

## 
##  Pearson's product-moment correlation
## 
## data:  day$raw.atemp and day$cnt
## t = 21.9648, df = 729, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5853376 0.6727918
## sample estimates:
##       cor 
## 0.6310657

Feeled.Temperature <- day$raw.atemp
Amount.Rentals <- day$cnt
plot(x = Feeled.Temperature, y = Amount.Rentals, main = "Correlation", col = "paleturquoise3")
abline(lm(Amount.Rentals ~ Feeled.Temperature), col = "blue")
legend("topleft", legend = paste("cor = ", round(cor(Feeled.Temperature, Amount.Rentals), 2), sep = ""),lty = 1, col = "red")

The correlation was 0.63.

#Correlation between mean.temp.atemp and the total count of bike rentals.
day$raw.mean.temp.atemp <-(day$raw.temp + day$raw.atemp)/2
cor.mean.temp.atemp <- cor.test(x = day$raw.mean.temp.atemp,
y = day$cnt)
cor.mean.temp.atemp

## 
##  Pearson's product-moment correlation
## 
## data:  day$raw.mean.temp.atemp and day$cnt
## t = 21.9414, df = 729, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5848953 0.6724234
## sample estimates:
##       cor 
## 0.6306607

#Plot
Feeled.Raw.Temperature <- day$raw.mean.temp.atemp
Amount.Rentals <- day$cnt
plot(x = Feeled.Raw.Temperature, y = Amount.Rentals, main = "Correlation", col = "paleturquoise4")
abline(lm(Amount.Rentals ~ Feeled.Raw.Temperature), col = "blue")
legend("topleft", legend = paste("cor = ", round(cor(Temperature, Amount.Rentals), 2), sep = ""),lty = 1, col = "orange")

The correlation was 0.63.

3. Do registered users rent more bikes than casual users depending on the weathersituations no. 1 to no. 3 ?

This question was tested by a two-sample t-test.

Weathersit is coded as:

1: Clear, Few clouds, Partly cloudy, Partly cloudy

2: Mist, Cloudy, Mist, Broken clouds, Mist, Few clouds, Mist

3: Light Snow, Light Rain, Thunderstorm, Scattered clouds, Light Rain Scattered clouds

4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog

# two-sample-t-test for weathersit no. 1:
weathersit.1.reg <- subset(day, subset = weathersit == "1")$registered

weathersit.1.cas <- subset(day, subset = weathersit == "1")$casual

test.result.1 <- t.test(x = weathersit.1.reg, y = weathersit.1.cas, alternative = "two.sided")
test.result.1

## 
##  Welch Two Sample t-test
## 
## data:  weathersit.1.reg and weathersit.1.cas
## t = 37.638, df = 646.784, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  2794.886 3102.566
## sample estimates:
## mean of x mean of y 
## 3912.7559  964.0302

# two-sample-t-test for weathersit no. 2:
weathersit.2.reg <- subset(day, subset = weathersit == "2")$registered

weathersit.2.cas <- subset(day, subset = weathersit == "2")$casual

test.result.2 <- t.test(x = weathersit.2.reg, y = weathersit.2.cas, alternative = "two.sided")
test.result.2

## 
##  Welch Two Sample t-test
## 
## data:  weathersit.2.reg and weathersit.2.cas
## t = 26.3186, df = 331.301, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  2462.253 2860.062
## sample estimates:
## mean of x mean of y 
## 3348.5101  687.3522

# two-sample-t-test for weathersit no. 3:
weathersit.3.reg <- subset(day, subset = weathersit == "3")$registered

weathersit.3.cas <- subset(day, subset = weathersit == "3")$casual

test.result.3 <- t.test(x = weathersit.3.reg, y = weathersit.3.cas, alternative = "two.sided")
test.result.3

## 
##  Welch Two Sample t-test
## 
## data:  weathersit.3.reg and weathersit.3.cas
## t = 5.9687, df = 22.379, p-value = 4.89e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   935.1423 1929.5243
## sample estimates:
## mean of x mean of y 
## 1617.8095  185.4762

Weathersituation 1:
- The t-value of the difference between casual and registered renters in weathersituation 1 was 37.64.
- The mean of the x and y-variables in weathersituation 1 were 3912.76, 964.03.
Weathersituation 2:
- The t-value of the difference between casual and registered renters in weathersituation 2 was 26.32.
- The mean of the x and y-variables in weathersituation 2 were 3348.51, 687.35.
Weathersituation 3:
- The t-value of the difference between casual and registered renters in weathersituation 3 was 5.97.
- The mean of the x and y-variables in weathersituation 3 were 1617.81, 185.48.

4. Is the temperature associated with bike rentals (registered vs. casual)?

#Is the temperature associated with bike rentals (registered vs. casual)
plot(x = 1, y = 1, xlab = "Temperature in Celcius", ylab = "Bike rentals", type = "n", main = "Association between temperature and bike rentals",
xlim = c(0, 40), ylim = c(0, 7000))


#Calculating min and max for the x-axis and y-axis
min(day$raw.temp)

## [1] 2.424346

max(day$raw.temp)

## [1] 35.32835

min(day$casual)

## [1] 2

min(day$registered)

## [1] 20

max(day$casual)

## [1] 3410

max(day$registered)

## [1] 6946

#Adding points to the plot
day$raw.temp <- (day$temp*41)
points(day$raw.temp, day$casual, pch = 16, col = "red")
points(day$raw.temp, day$registered, pch = 16, col = "skyblue")

# Adding a legend to the plot
legend("topleft",legend = c("casual", "registered"), col = c('red', 'skyblue'),pch = c(16, 16), bg = "white")

5. Can the number of total bike rentals be predicted by holiday and weather?

Coding information:

holiday is coded as: 0 = no holiday and 1 = holiday

Foor this question we recoded the weathersituation:

1 = “nice”

2 = “cloudy”

3 = “wet”

4 = “lousy”

Recoding weather with “merge”

lookup <- data.frame("numbers"=c("1","2","3","4"),
                    "weather"=c("nice","cloudy", "wet", "lousy")
                    )

day <- merge(x= day,
            y= lookup,
            by.x="weathersit",
            by.y="numbers",
            )

head(day)

##   weathersit instant     dteday season yr mnth holiday weekday workingday
## 1          1     151 2011-05-31      2  0    5       0       2          1
## 2          1      50 2011-02-19      1  0    2       0       6          0
## 3          1     157 2011-06-06      2  0    6       0       1          1
## 4          1     110 2011-04-20      2  0    4       0       3          1
## 5          1       4 2011-01-04      1  0    1       0       2          1
## 6          1     136 2011-05-16      2  0    5       0       1          1
##       temp    atemp      hum windspeed casual registered  cnt raw.temp
## 1 0.775000 0.725383 0.636667  0.111329    673       3309 3982 31.77500
## 2 0.399167 0.391404 0.187917  0.507463    532       1103 1635 16.36585
## 3 0.678333 0.621858 0.600000  0.121896    673       3875 4548 27.81165
## 4 0.595000 0.564392 0.614167  0.241925    613       3331 3944 24.39500
## 5 0.200000 0.212122 0.590435  0.160296    108       1454 1562  8.20000
## 6 0.577500 0.550512 0.787917  0.126871    773       3185 3958 23.67750
##   raw.atemp raw.mean.temp.atemp weather
## 1  36.26915            34.02208    nice
## 2  19.57020            17.96802    nice
## 3  31.09290            29.45228    nice
## 4  28.21960            26.30730    nice
## 5  10.60610             9.40305    nice
## 6  27.52560            25.60155    nice

Are weather and holiday good predictors?

#Using linear regression

total.rentals.lm <- lm(cnt ~ holiday + weather, data = day)
summary (total.rentals.lm)

## 
## Call:
## lm(formula = cnt ~ holiday + weather, data = day)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4475.9 -1250.2   -40.9  1398.8  4303.6 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   4058.4      117.2  34.619  < 2e-16 ***
## holiday       -929.5      406.8  -2.285   0.0226 *  
## weathernice    848.5      144.7   5.864 6.86e-09 ***
## weatherwet   -2255.2      417.4  -5.403 8.91e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1836 on 727 degrees of freedom
## Multiple R-squared:  0.1056, Adjusted R-squared:  0.1019 
## F-statistic: 28.61 on 3 and 727 DF,  p-value: < 2.2e-16

# cloudy weather is the default -> "nice" and "wet" weather is compared to cloudy weather.

The coefficients (Intercept, holiday, weathernice, weatherwet) were 4058.44, -929.48, 848.46, -2255.16.
The df.residuals were 727.

Is there an effect of weather or not?

#using an ANOVA
anv.weather <- anova (total.rentals.lm)
anv.weather

## Analysis of Variance Table
## 
## Response: cnt
##            Df     Sum Sq   Mean Sq F value  Pr(>F)    
## holiday     1   12797494  12797494   3.797 0.05173 .  
## weather     2  276444009 138222004  41.010 < 2e-16 ***
## Residuals 727 2450293890   3370418                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

#The F-values (holiday, weather, residuals) are:
anv.weather$`F value`

## [1]  3.797005 41.010345        NA

#The p-values (holiday, weather, residuals) are:
anv.weather$`Pr(>F)`

## [1] 5.172871e-02 1.331783e-17           NA

Comparison of the three different types of weather

#Using the Tukey-Test
weather.aov <- aov(cnt ~ weather, data = day)
summary(weather.aov)

##              Df    Sum Sq   Mean Sq F value Pr(>F)    
## weather       2 2.716e+08 135822286   40.07 <2e-16 ***
## Residuals   728 2.468e+09   3389960                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

TukeyHSD(weather.aov)

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = cnt ~ weather, data = day)
## 
## $weather
##                   diff        lwr       upr p adj
## nice-cloudy   840.9238   500.2174  1181.630 0e+00
## wet-cloudy  -2232.5766 -3215.4542 -1249.699 4e-07
## wet-nice    -3073.5005 -4038.2458 -2108.755 0e+00

6. What is the mean temperature, humidity, windspeed and total number of rentals per month?

# Months are coded as 1 to 12
# Recode months with  the "merge" function

lookup.month <- data.frame("mnth" = c(1:12),
                          "mnth.name" = c("01Jan", "02Feb", "03March", "04April", "05May", "06June", "07July", "08Aug", "09Sept", "10Oct", "11Nov", "12Dec"), stringsAsFactors = FALSE)
 
day <- merge(x=day, y= lookup.month, by = 'mnth')


# Converte the nomalized windspeed and humidity
day$raw.windspeed <- (day$windspeed*67)
day$raw.hum <- (day$hum * 100)
head(day)

##   mnth weathersit instant     dteday season yr holiday weekday workingday
## 1    1          1      12 2011-01-12      1  0       0       3          1
## 2    1          1     394 2012-01-29      1  1       0       0          0
## 3    1          2       1 2011-01-01      1  0       0       6          0
## 4    1          2     392 2012-01-27      1  1       0       5          1
## 5    1          1       4 2011-01-04      1  0       0       2          1
## 6    1          1       9 2011-01-09      1  0       0       0          0
##       temp    atemp      hum windspeed casual registered  cnt  raw.temp
## 1 0.172727 0.160473 0.599545  0.304627     25       1137 1162  7.081807
## 2 0.282500 0.272721 0.311250  0.240050    558       2685 3243 11.582500
## 3 0.344167 0.363625 0.805833  0.160446    331        654  985 14.110847
## 4 0.425000 0.415383 0.741250  0.342667    269       3187 3456 17.425000
## 5 0.200000 0.212122 0.590435  0.160296    108       1454 1562  8.200000
## 6 0.138333 0.116175 0.434167  0.361950     54        768  822  5.671653
##   raw.atemp raw.mean.temp.atemp weather mnth.name raw.windspeed raw.hum
## 1   8.02365            7.552728    nice     01Jan      20.41001 59.9545
## 2  13.63605           12.609275    nice     01Jan      16.08335 31.1250
## 3  18.18125           16.146048  cloudy     01Jan      10.74988 80.5833
## 4  20.76915           19.097075  cloudy     01Jan      22.95869 74.1250
## 5  10.60610            9.403050    nice     01Jan      10.73983 59.0435
## 6   5.80875            5.740201    nice     01Jan      24.25065 43.4167

#descriptive statistics with the dplyr-functions:
require(dplyr)

## Loading required package: dplyr
## 
## Attaching package: 'dplyr'
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

month.agg <- day %>% group_by(mnth.name) %>% summarise(
    mean.temp = mean(raw.temp),
    mean.hum = mean(raw.hum),
    mean.windspeed = mean(raw.windspeed),
    mean.rentals = mean(cnt))

month.agg

## Source: local data frame [12 x 5]
## 
##    mnth.name mean.temp mean.hum mean.windspeed mean.rentals
## 1      01Jan  9.694201 58.58283       13.82229     2176.339
## 2      02Feb 12.268284 56.74647       14.45082     2655.298
## 3    03March 16.012089 58.84750       14.92086     3692.258
## 4    04April 19.269952 58.80631       15.71031     4484.900
## 5      05May 24.386735 68.89583       12.26026     5349.774
## 6     06June 28.047985 57.58055       12.42313     5772.367
## 7     07July 30.974287 59.78763       11.12594     5563.677
## 8      08Aug 29.051844 63.77301       11.58552     5664.419
## 9     09Sept 25.275884 71.47144       11.11832     5766.517
## 10     10Oct 19.885500 69.37609       11.73877     5199.226
## 11     11Nov 15.138010 62.48765       12.31470     4247.183
## 12     12Dec 13.285270 66.60405       11.83280     3403.806

Plotting the different means associated with month

par(mfrow=c(2,2))
barplot(height = month.agg$mean.rentals,
        names.arg = month.agg$mnth.name ,col = "coral", main = "Mean rentals" )

barplot(height = month.agg$mean.windspeed,
        names.arg = month.agg$mnth.name,col = "brown1", main = "Mean Windspeed (km/h)" )

barplot(height = month.agg$mean.hum,
        names.arg = month.agg$mnth.name,col = "brown3", main = "Mean Humidity" )


barplot(height = month.agg$mean.temp,
        names.arg = month.agg$mnth.name,col = "brown4", main = "Mean Temperature" )

par(mfrow=c(1,1))

7. What percentage of days are appropriate for biking concerning the weather with the following conditions

1. Temperature > 5°, weather situation 1-3, windspeed < 40 km/h and
1. Temperature > 10°, weather situation 1-2, windspeed < 20 km/h?

#Converting the normalized values

max(day$raw.temp)

## [1] 35.32835

#Creating a custom function based on the criteria for appropriate weather for biking

biking.day <- function (temp.thresh, windspeed.thresh, weathersit.thresh)
{result <- with (day, raw.temp > temp.thresh & 
                   raw.windspeed < windspeed.thresh & 
                   weathersit < weathersit.thresh)
  
    return(result)} 

mean(biking.day(5, 40, 3))

## [1] 0.9658003

# with A) weather conditions 97% of days were appropriate for biking

mean(biking.day(10, 20, 2))

## [1] 0.5348837

# with B) weather conditions 53% of days were appropriate for biking

4. Conclusion

As we can see in the analysis of question no. 1, the temperature is relatively high across all seasons. The lowest mean temperature was found in spring (M=12.21, SD=4.21) and the highest mean temperature was found in fall (M=28,96, SD=2,9). Summer had a mean temperature of 22.32 degrees Celcius (SD=5.03) and winter had a mean temperature of 17.34 degrees Celcius (SD=4.42).

Due to the mild temperatures in spring and winter and the warm weather in summer and fall, temperatures should be highly correlated with the total amount of bike rentals. In this analysis we correlated the raw temperatures (converted nomalized temperatures), the converted feeling temperatures (raw.atemp) and the mean of both. As we can see in the plots attached to question no. 2, all three kinds of temperatures are positively correlated with the total amount of bike rentals (cor=0.63). The results reveal, that there is a significant relationship between the variables.

Additionally, we wanted to know, if there is a difference between the two types of bike rental users (casual vs. registered). We assumed, that because registered useres have to pay a monthly fee, they will rent bikes more frequently. In question no. 3, we tested the difference between the two types of bike rental users in dependence on the weather situation. The two-sample T-test revealed a significant difference between registered and casual bike rental users for each of the weather situations . In weather situation 1 (t=37.638, df=646,784, p-value < 0.01), in weather situation 2 (t=26,3186, df=331.301, p-value < 0.01) and in weather situation 3 (t=5,9687, df=22,379, p-value < 0.01) the registered users have a higher mean than the casual users. These results confirm our hypothesis. Furthermore, the t-values and the df are decreasing from weather situation 1 to 3. This could also confirm the correlation between the temperature and the total amount of bike rentals. We should also note that in weather situation 4, neither casual nor registered users rented any bikes, which is why we did not include it into our analysis.

These findings lead us to further examine the association between the two groups (registered vs. casual) in dependence of the temperature. As we can see in the plot, as soon as the temperature increases the amount of bike rentals increase as well. Additionally, we can see that the registered users do rent more bikes than the casual users.

Consequently, we conducted a linear regression to answer question 5, because we thought that weather and holiday could be good predictors for the number of bike rentals. The results clearly reveal, that holiday is a significant negative predictor (estimate = -929.5). The results also show that weathernice compared to weathercloudy (default) is a significant positive predictor (estimate = 848.5) and weatherwet compared to weathercloudy is a significant negative predictor (estimate = -2255.2) for bike rentals. As in question no. 3 we left out the lousy weather condition because there were no bike rentals under this weather condition.

We then used an ANOVA to test whether there are significant effects for weather and holiday or not. The ANOVA revealed a significant effect for weather (F-value = 41.010, Df = 2, p-value < 0.01) but a non-significant effect for holiday (F-value = 3,797, Df = 1, p-value = 0.05173). Afterwards, we compared the three different weather types with the TukeyHSD. The means of bike rentals in dependence of the weather “nice-cloudy”, “wet-cloudy” and “wet-nice” differed significantly (each p-value < 0.01) from each another. This means that weather and temperature affect the amount of bike rentals.

In question no. 6 we plotted the mean humidity, the mean temperature, the mean windspeed and the mean count of total rentals per month. As we can see the total amount of bike rentals increases with increased temperatures. However, the total number of bike rentals seems to be unaffected by the windspeed and the humidity, because they don’t seem to have a lot of variance across the months. This can also be seen as evidence for the high correlation between rentals and temperatures. It also supports the fact that nice weather is a good predictor for bike rentals in general.

Lastly, we created a custom function based on the criteria for appropriate weather for biking. As we can see, we find fewer appropriate days for biking under the condition B, which corresponds to nice and mild weather.

To sum it up, we can say, that the total amount of bike rentals is dependent on the weather and also on the users’ status (registered vs. casual).

Final Project - Data analysis and visualization in R

Rebekka Herz

August 6, 2015

1. Dataset Description

2. Questions

3. Analyses

4. Conclusion