In this WPA, you will analyze data on bike sharing. These data come from the UCI Machine Learning database at http://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset
Here is the data description (taken directly from the original website
Bike sharing systems are new generation of traditional bike rentals where whole process from membership, rental and return back has become automatic. Through these systems, user is able to easily rent a bike from a particular position and return back at another position. Currently, there are about over 500 bike-sharing programs around the world which is composed of over 500 thousands bicycles. Today, there exists great interest in these systems due to their important role in traffic, environmental and health issues. Apart from interesting real world applications of bike sharing systems, the characteristics of data being generated by these systems make them attractive for the research. Opposed to other transport services such as bus or subway, the duration of travel, departure and arrival position is explicitly recorded in these systems. This feature turns bike sharing system into a virtual sensor network that can be used for sensing mobility in the city. Hence, it is expected that most of important events in the city could be detected via monitoring these data.
The data are located in a comma separated text file at http://nathanieldphillips.com/wp-content/uploads/2016/04/bikesharing.csv
Here is how the first few rows of the math data should look:
head(bike)
## instant dteday season yr mnth holiday weekday workingday weathersit
## 1 1 2011-01-01 1 0 1 0 6 0 2
## 2 2 2011-01-02 1 0 1 0 0 0 2
## 3 3 2011-01-03 1 0 1 0 1 1 1
## 4 4 2011-01-04 1 0 1 0 2 1 1
## 5 5 2011-01-05 1 0 1 0 3 1 1
## 6 6 2011-01-06 1 0 1 0 4 1 1
## temp atemp hum windspeed casual registered cnt
## 1 0.344167 0.363625 0.805833 0.1604460 331 654 985
## 2 0.363478 0.353739 0.696087 0.2485390 131 670 801
## 3 0.196364 0.189405 0.437273 0.2483090 120 1229 1349
## 4 0.200000 0.212122 0.590435 0.1602960 108 1454 1562
## 5 0.226957 0.229270 0.436957 0.1869000 82 1518 1600
## 6 0.204348 0.233209 0.518261 0.0895652 88 1518 1606
The data has 731 rows and 16 columns. Here are descriptions of the columns:
instant: record index
dteday : date
season : season (1:springer, 2:summer, 3:fall, 4:winter)
yr : year (0: 2011, 1:2012)
mnth : month ( 1 to 12)
hr : hour (0 to 23)
holiday : weather day is holiday or not (extracted from [Web Link])
weekday : day of the week
workingday : if day is neither weekend nor holiday is 1, otherwise is 0.
4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
temp : Normalized temperature in Celsius. The values are derived via (t-t_min)/(t_max-t_min), t_min=-8, t_max=+39 (only in hourly scale)
atemp: Normalized feeling temperature in Celsius. The values are derived via (t-t_min)/(t_max-t_min), t_min=-16, t_max=+50 (only in hourly scale)
hum: Normalized humidity. The values are divided to 100 (max)
windspeed: Normalized wind speed. The values are divided to 67 (max)
casual: count of casual users
registered: count of registered users
cnt: count of total rental bikes including both casual and registered
A. Open your WPA.RProject and open a new script. Save the script with the name WPA9.R.
B. Using read.table(), load the comma delimited text file containing the data into R and assign it to a new object called bike
D. Look at the first few rows of the dataframe with the head() function to make sure it was imported correctly.
E. Using the summary() function, look at summary statistics for each column in the dataframe. There should be 16 columns in the dataset. Make sure everything looks ok.
Your function should look like this:
my.sum <- function( ___) {
output <-____
return(___)
}
Test your function with the arguments a = 2 and b = 10
my.sum2()
## [1] 0
my.stat <- function(x) {
output <- ___
return(___)
}
Test your function on the temp column of the bike dataset.
my.stat(x = bike$temp)
## [1] 0.4953848
Your function should look like this:
my.stat <- function(x, what) {
if(what == ___) {
output <- ____
}
if(____) {
output <-____
}
if(____) {
__________
}
return(output)
}
Here is what my.stat() should return:
my.stat(x = bike$temp, what = "mean")
## [1] 0.4953848
my.stat(x = bike$temp, what = "median")
## [1] 0.4953848
my.stat(x = bike$temp, what = "sd")
## [1] 0.183051
z=x−mean(x)sd(x)
Here is how your function should look:
zscore <- function(x) {
output <- ______
return(___)
}
Here is my zscore() function in action on the numbers from 1 to 10:
zscore(x = 1:10)
## [1] -1.4863011 -1.1560120 -0.8257228 -0.4954337 -0.1651446 0.1651446
## [7] 0.4954337 0.8257228 1.1560120 1.4863011
Use your function to add a new column called casual.z to the bike dataset that is z-transformed version of the original casual data column. Show me a histogram of the new column as follows:
# Histogram of the original casual column
hist(bike$casual)
# Histogram of the new z-score column
hist(bike$casual.z)
Here is how your function could look
count.na <- function(x) {
output <- sum(is.na(__))
sentence <- paste(__, " NA values found", sep = "")
return(__)
}
Test your function on the following vector:
count.na(x = c(0, 4, 2, NA, 4, 3, NA))
## [1] "2 NA values found"
count.na(bike[,1])
## [1] "0 NA values found"
count.na(bike[,2])
## [1] "0 NA values found"
# ...
my.hist <- function(x) {
hist(__,
__ = __,
__ = __
)
}
For example, here’s what my personal version of my.hist() does:
my.hist(x = bike$temp)
my.hist(x = bike$casual,
xlim = c(0, 5000),
breaks = 10,
main = "Using my.hist() with additional arguments",
xlab = "Casual rentals",
ylab = "Frequency")
Write a function called bivariate() that takes two vectors x and y as arguments and then does two actions:
Here’s how your function should look:
bivariate <- function(__, __, ...) {
plot(__,
...)
cor.test(____)
}
Test your function on the temp and windspeed columns in the bike data. For example, here’s what happens when I run bivariate() on temp and windspeed:
bivariate(x = bike$temp,
y = bike$windspeed,
xlab = "Temperature",
ylab = "Windspeed",
main = "My bivariate() function"
)
##
## Pearson's product-moment correlation
##
## data: x and y
## t = -4.3187, df = 729, p-value = 1.787e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2278482 -0.0864203
## sample estimates:
## cor
## -0.1579441
Here’s how your function should look:
bivariate <- function(x, y, add.regression, ...) {
plot(__, ___, ...)
test <- cor.test(_____)
if(add.regression == _____) {
if(______ <= .05) {
abline(lm(y ~ x),
col = "red")
}
if(______) {
________
}
}
}
Test your function on two sets of variables: temp and atemp, then on weekday and windspeed:
bivariate(x = bike$temp,
y = bike$atemp,
add.regression = T,
xlab = "Temperature",
ylab = "Feeling Temperature")
bivariate(x = bike$weekday,
y = bike$windspeed,
add.regression = T,
xlab = "Temperature",
ylab = "Feeling Temperature")
apa.t <- function(x, y) {
test <- t.test(___, ___)
p.value <- test$___
if(_______) {
output <- paste("Significant at the .05 threshold!!! :) t(", _____, ") = ", _____, ", p = ", _____, sep = "")
}
if(______) {
output <- paste("Not Significant at the .05 threshold :( t(", _____, ") = ", _____, ", p = ", _____, sep = "")
}
return(output)
}
Test your function on the bike dataset by comparing the number of rentals in season 1 versus season 2
apa.t(x = bike$cnt[bike$yr == 0],
y = bike$cnt[bike$yr == 1])
## [1] "Significant at the .05 threshold!!! :) t(685.5) = -18.58, p = 0"
Test your function by comparing the windspeed between workingday values of 0 and 1. Try it once with a threshold of .05, and once with a threshold of 0.80:
apa.t(x = bike$windspeed[bike$workingday == 0],
y = bike$windspeed[bike$workingday == 1],
threshold = .05)
## [1] "Not Significant at the 0.05 threshold :( t(442.61) = 0.51, p = 0.61"
apa.t(x = bike$windspeed[bike$workingday == 0],
y = bike$windspeed[bike$workingday == 1],
threshold = .80)
## [1] "Significant at the 0.8 threshold!!! :) t(442.61) = 0.51, p = 0.61"
Here is how your function should look
plot.outliers <- function(x, y, ...) {
# Determine which values of x and y are outliers
x.out <- (x > ____) | x < (____)
y.out <- (y > ____) | y < (____)
# Determine which pairs have an outlier in either x or y
any.out <- x.out | y.out
# Plot the points without outliers
plot(x[any.out == ___],
y[any.out == ___],
...
)
# Add points for the data WITH outliers
points(___,
___,
col = "red")
}
Next, test your function by plotting the relationship between temp and cnt in the bike dataset
plot.outliers(x = bike$temp,
y = bike$cnt,
xlim = c(0, 1),
ylim = c(0, 10000),
xlab = "Temperature",
ylab = "Number of rentals",
main = "Outliers are in Red!")
Test your function on the temp and cnt data, and set the out.def argument value to 1 as follows:
plot.outliers(x = bike$temp,
y = bike$cnt,
out.def = 1,
xlim = c(0, 1),
ylim = c(0, 10000),
xlab = "Temperature",
ylab = "Number of rentals",
main = "Outliers are in Red!")