WPA #9: Custom Functions

Bike Sharing

In this WPA, you will analyze data on bike sharing. These data come from the UCI Machine Learning database at http://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset

Here is the data description (taken directly from the original website

Bike sharing systems are new generation of traditional bike rentals where whole process from membership, rental and return back has become automatic. Through these systems, user is able to easily rent a bike from a particular position and return back at another position. Currently, there are about over 500 bike-sharing programs around the world which is composed of over 500 thousands bicycles. Today, there exists great interest in these systems due to their important role in traffic, environmental and health issues. Apart from interesting real world applications of bike sharing systems, the characteristics of data being generated by these systems make them attractive for the research. Opposed to other transport services such as bus or subway, the duration of travel, departure and arrival position is explicitly recorded in these systems. This feature turns bike sharing system into a virtual sensor network that can be used for sensing mobility in the city. Hence, it is expected that most of important events in the city could be detected via monitoring these data.

The data are located in a comma separated text file at http://nathanieldphillips.com/wp-content/uploads/2016/04/bikesharing.csv

Here is how the first few rows of the math data should look:

head(bike)

##   instant     dteday season yr mnth holiday weekday workingday weathersit
## 1       1 2011-01-01      1  0    1       0       6          0          2
## 2       2 2011-01-02      1  0    1       0       0          0          2
## 3       3 2011-01-03      1  0    1       0       1          1          1
## 4       4 2011-01-04      1  0    1       0       2          1          1
## 5       5 2011-01-05      1  0    1       0       3          1          1
## 6       6 2011-01-06      1  0    1       0       4          1          1
##       temp    atemp      hum windspeed casual registered  cnt
## 1 0.344167 0.363625 0.805833 0.1604460    331        654  985
## 2 0.363478 0.353739 0.696087 0.2485390    131        670  801
## 3 0.196364 0.189405 0.437273 0.2483090    120       1229 1349
## 4 0.200000 0.212122 0.590435 0.1602960    108       1454 1562
## 5 0.226957 0.229270 0.436957 0.1869000     82       1518 1600
## 6 0.204348 0.233209 0.518261 0.0895652     88       1518 1606

Datafile description

The data has 731 rows and 16 columns. Here are descriptions of the columns:

instant: record index
dteday : date
season : season (1:springer, 2:summer, 3:fall, 4:winter)
yr : year (0: 2011, 1:2012)
mnth : month ( 1 to 12)
hr : hour (0 to 23)
holiday : weather day is holiday or not (extracted from [Web Link])
weekday : day of the week
workingday : if day is neither weekend nor holiday is 1, otherwise is 0.
weathersit :
1: Clear, Few clouds, Partly cloudy, Partly cloudy
2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
temp : Normalized temperature in Celsius. The values are derived via (t-t_min)/(t_max-t_min), t_min=-8, t_max=+39 (only in hourly scale)
atemp: Normalized feeling temperature in Celsius. The values are derived via (t-t_min)/(t_max-t_min), t_min=-16, t_max=+50 (only in hourly scale)
hum: Normalized humidity. The values are divided to 100 (max)
windspeed: Normalized wind speed. The values are divided to 67 (max)
casual: count of casual users
registered: count of registered users
cnt: count of total rental bikes including both casual and registered

Data loading and preparation

A. Open your WPA.RProject and open a new script. Save the script with the name WPA9.R.

B. Using read.table(), load the comma delimited text file containing the data into R and assign it to a new object called bike

Understand the data

D. Look at the first few rows of the dataframe with the head() function to make sure it was imported correctly.

E. Using the summary() function, look at summary statistics for each column in the dataframe. There should be 16 columns in the dataset. Make sure everything looks ok.

Writing functions

Write a function called my.sum() that takes two objects a and b as arguments. The function should return the sum of a and b

Your function should look like this:

my.sum <- function( ___) {
  
  output <-____
  
  return(___)
  
}

Test your function with the arguments a = 2 and b = 10

Now, create a new my.sum2() function that does the same thing as my.sum() but contains default values. Set the default value of a to 5, and the default value of b to -5. Test your function by evaluating it without specifying any arguments as follows:

my.sum2()

## [1] 0

Write a function called my.stat() that takes a vector of data as an argument x, and returns the mean of the data. Here is how your function should look

my.stat <- function(x) {
  
  output <- ___
  
  return(___)
  
}

Test your function on the temp column of the bike dataset.

my.stat(x = bike$temp)

## [1] 0.4953848

Now, add a second argument to your function called what. This argument should be a string in the set “mean”, “median”, “sd”. The function should now calculate whichever statistic is specified in the what string. Test your function on the temp data using all three values of what:

Your function should look like this:

my.stat <- function(x, what) {
  
  if(what == ___) {
    
    output <- ____
  
  }
  
    if(____) {
    
    output <-____
  
  }
  
    if(____) {
    
    __________
  
    }
  
  return(output)
}

Here is what my.stat() should return:

my.stat(x = bike$temp, what = "mean")

## [1] 0.4953848

my.stat(x = bike$temp, what = "median")

## [1] 0.4953848

my.stat(x = bike$temp, what = "sd")

## [1] 0.183051

Write a function called zscore() that takes a vector of data as an argument, and returns a z-transformed version of the vector. Recall that a z-transformation is given by the formula:

$z=\frac{x-mean(x)}{sd(x)}$

Here is how your function should look:

zscore <- function(x) {
  
  output <- ______
  
  return(___)
  
}

Here is my zscore() function in action on the numbers from 1 to 10:

zscore(x = 1:10)

##  [1] -1.4863011 -1.1560120 -0.8257228 -0.4954337 -0.1651446  0.1651446
##  [7]  0.4954337  0.8257228  1.1560120  1.4863011

Use your function to add a new column called casual.z to the bike dataset that is z-transformed version of the original casual data column. Show me a histogram of the new column as follows:

# Histogram of the original casual column
hist(bike$casual)

# Histogram of the new z-score column
hist(bike$casual.z)

Write a function called count.na() that takes a vector of data as an argument, counts the number of missing (NA) values in the vector, and then prints the number of NA values found in the sentence “X NA values found”.

Here is how your function could look

count.na <- function(x) {
  
  output <- sum(is.na(__))
  
  sentence <- paste(__, " NA values found", sep = "")
  
  return(__)
  
}

Test your function on the following vector:

count.na(x = c(0, 4, 2, NA, 4, 3, NA))

## [1] "2 NA values found"

Now, run your count.na() function on all columns in the bike dataset to see if there are any missing values in any of the columns. Hint: save yourself typing by indexing the dataframe by column number, not with the column names as follows:

count.na(bike[,1])

## [1] "0 NA values found"

count.na(bike[,2])

## [1] "0 NA values found"

# ...

Write a function called my.hist() that takes a vector of data as an argument, and returns a custom histogram with parameter values of your choosing. For example, you can change the colors of the bars with the col parameter, the color of the borders with the border parameter (etc.). Test your function on the temp column in the bike dataset.

my.hist <- function(x) {
  
  hist(__, 
       __ = __, 
       __ = __
       )
  
}

For example, here’s what my personal version of my.hist() does:

my.hist(x = bike$temp)

Now, update your my.hist() function with the … argument so that the user can pass additional arguments to hist(). Test your function on the casual column in the bike dataset and add a few additional arguments:

my.hist(x = bike$casual,
        xlim = c(0, 5000),
        breaks = 10,
        main = "Using my.hist() with additional arguments",
        xlab = "Casual rentals",
        ylab = "Frequency")

Checkpoint!!!!

Write a function called bivariate() that takes two vectors x and y as arguments and then does two actions:
- Print the result of a correlation test between x and y to the console
- Create a scatterplot of x and y.

Here’s how your function should look:

bivariate <- function(__, __, ...) {
  
  plot(__, 
         ...)
  
  cor.test(____)
  
}

Test your function on the temp and windspeed columns in the bike data. For example, here’s what happens when I run bivariate() on temp and windspeed:

bivariate(x = bike$temp, 
          y = bike$windspeed,
          xlab = "Temperature",
          ylab = "Windspeed",
          main = "My bivariate() function"
          )

## 
##  Pearson's product-moment correlation
## 
## data:  x and y
## t = -4.3187, df = 729, p-value = 1.787e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2278482 -0.0864203
## sample estimates:
##        cor 
## -0.1579441

Now, add a new argument to bivariate() called add.regression. This argument should be a logical value indicating whether or not to add a regression line to the plot. If the correlation test is significant (e.g.; p < .05), the regression line should be RED. If it is not significant, the line should be BLACK

Here’s how your function should look:

bivariate <- function(x, y, add.regression, ...) {
  
  plot(__, ___, ...)
  
  test <- cor.test(_____)
  
  if(add.regression == _____) {
  
  if(______ <= .05) {
    
    abline(lm(y ~ x), 
           col = "red")
    
  }
  
    if(______) {
    
   ________
    
    }
    
  }
  
}

Test your function on two sets of variables: temp and atemp, then on weekday and windspeed:

bivariate(x = bike$temp,
          y = bike$atemp,
          add.regression = T,
          xlab = "Temperature",
          ylab = "Feeling Temperature")

bivariate(x = bike$weekday,
          y = bike$windspeed,
          add.regression = T,
          xlab = "Temperature",
          ylab = "Feeling Temperature")

Write a function called apa.t() that takes two vectors x and y as arguments, conducts a two-sample t-test comparing x and y, and returns a sentence with an apa style conclusion. Your code should look like this:

apa.t <- function(x, y) {
  
  test <- t.test(___, ___)
 
  p.value <- test$___
  
  if(_______) {
    
    output <- paste("Significant at the .05 threshold!!! :)   t(", _____, ") = ", _____, ", p = ", _____, sep = "")
    
  }
  
    if(______) {
    
    output <- paste("Not Significant at the .05 threshold :(   t(", _____, ") = ", _____, ", p = ", _____, sep = "")
    
    }
  
  return(output)
}

Test your function on the bike dataset by comparing the number of rentals in season 1 versus season 2

apa.t(x = bike$cnt[bike$yr == 0],
      y = bike$cnt[bike$yr == 1])

## [1] "Significant at the .05 threshold!!! :)   t(685.5) = -18.58, p = 0"

Now add a new argument to apa.t() called threshold that allows the user to specify the p-value threshold for determining significance

Test your function by comparing the windspeed between workingday values of 0 and 1. Try it once with a threshold of .05, and once with a threshold of 0.80:

apa.t(x = bike$windspeed[bike$workingday == 0],
      y = bike$windspeed[bike$workingday == 1],
      threshold = .05)

## [1] "Not Significant at the 0.05 threshold :(   t(442.61) = 0.51, p = 0.61"

apa.t(x = bike$windspeed[bike$workingday == 0],
      y = bike$windspeed[bike$workingday == 1],
      threshold = .80)

## [1] "Significant at the 0.8 threshold!!! :)   t(442.61) = 0.51, p = 0.61"

Write a function called plot.outliers that takes two arguments: x and y (both vectors of data). The function should create a scatterplot of x and y. However, all outliers, defined as a datapoint less than or greater than 2 standard deviations from the mean on either x or y, should be highlighted in red.

Here is how your function should look

plot.outliers <- function(x, y, ...) {
  
  # Determine which values of x and y are outliers
  
  x.out <- (x > ____) | x < (____)
  y.out <- (y > ____) | y < (____)
  
  # Determine which pairs have an outlier in either x or y
  
  any.out <- x.out | y.out
  
  # Plot the points without outliers
  plot(x[any.out == ___],
       y[any.out == ___],
       ...
       )
  
  # Add points for the data WITH outliers
  
  points(___,
         ___,
         col = "red")
  
}

Next, test your function by plotting the relationship between temp and cnt in the bike dataset

plot.outliers(x = bike$temp, 
              y = bike$cnt,              
              xlim = c(0, 1),
              ylim = c(0, 10000),
              xlab = "Temperature",
              ylab = "Number of rentals",
              main = "Outliers are in Red!")

Now, make your plot.outliers() function more advanced by adding a new argument out.def which specifies how outliers are determined. Specifically, an outlier must be more than out.def standard deviations away from the mean to be called an outlier.

Test your function on the temp and cnt data, and set the out.def argument value to 1 as follows:

plot.outliers(x = bike$temp, 
              y = bike$cnt,
              out.def = 1,
              xlim = c(0, 1),
              ylim = c(0, 10000),
              xlab = "Temperature",
              ylab = "Number of rentals",
              main = "Outliers are in Red!")