This data spans the District of Columbia, Arlington County, Alexandria, Montgomery County and Fairfax County. The Capital Bikeshare system is owned by the participating jurisdictions and is operated by Motivate, a Brooklyn, NY-based company that operates several other bikesharing systems including Citibike in New York City, Hubway in Boston and Divvy Bikes in Chicago.
library(readr)
bikeshare <- read_csv("C:/Users/eseld/Desktop/NYU/Statistics/mydata/bikesharedailydata.csv")
## Parsed with column specification:
## cols(
## instant = col_integer(),
## dteday = col_character(),
## season = col_integer(),
## yr = col_integer(),
## mnth = col_integer(),
## holiday = col_integer(),
## weekday = col_integer(),
## workingday = col_integer(),
## weathersit = col_integer(),
## temp = col_double(),
## atemp = col_double(),
## hum = col_double(),
## windspeed = col_double(),
## casual = col_integer(),
## registered = col_integer(),
## cnt = col_integer()
## )
Preview the data
You can preview the data using the head function to show the first few observations.
head(bikeshare)
## # A tibble: 6 × 16
## instant dteday season yr mnth holiday weekday workingday weathersit
## <int> <chr> <int> <int> <int> <int> <int> <int> <int>
## 1 1 1/1/11 1 0 1 0 6 0 2
## 2 2 1/2/11 1 0 1 0 0 0 2
## 3 3 1/3/11 1 0 1 0 1 1 1
## 4 4 1/4/11 1 0 1 0 2 1 1
## 5 5 1/5/11 1 0 1 0 3 1 1
## 6 6 1/6/11 1 0 1 0 4 1 1
## # ... with 7 more variables: temp <dbl>, atemp <dbl>, hum <dbl>,
## # windspeed <dbl>, casual <int>, registered <int>, cnt <int>
Next, you can view the variables and types by using the str function.
str(bikeshare)
One of the first things you may notice is the data dimensions, the number of rows and columns. Specifically there are 731 rows (observations) and 16 columns (variables or attributes).
Rows are commonly referred to as observations or records and columns are described as attributes or variables.
However, the variable names listed at the first row of every column are not very descriptive.
bikeshare$season
## [1] 1 1 1 1 1 1 NA 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [24] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [47] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [70] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2
## [93] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [116] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [139] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [162] 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3
## [185] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [208] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [231] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [254] 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 4
## [277] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [300] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [323] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [346] 4 4 4 4 4 4 4 4 4 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [369] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [392] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [415] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [438] 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [461] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [484] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [507] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [530] 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [553] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [576] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [599] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [622] 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4
## [645] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [668] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [691] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [714] 4 4 4 4 4 4 4 1 1 1 1 1 1 1 1 1 1 1
What type of variable is it?
It is an integer. Youâll notice that in the column seasons the values are integers that range between 1 and 4.
What do the numbers represent?
If we really think about itâs unlikely that the numbers represent quantities. Instead, they probably represent the seasons of the year because we know there are four seasons. The numbers (1 through 4) are probably a code for the each of the four seasons of the year. Without additional information, such as a data dictionary or read me file, it would be impossible for the user of the data to know what the possible values of 1 through 4 correspond to in the categorical variable named season.
This leads us to the next step, reviewing the data dictionary along with the data set to better understand the meaning behind the values.
Review the data dictionary
A data dictionary defines the characteristics of each of the data attributes. If your data comes from a reputable source, odds are that it is accompanied with a data dictionary or metadata. To know which season is represented by each number in the variable season we can review the data dictionary.
| Field | Definition |
|---|---|
| instant | record index |
| dteday | date |
| season | season (1:spring, 2:summer, 3:fall, 4:winter) |
| yr | year (0: 2011, 1:2012) |
| mnth | month ( 1 to 12) |
| hr | hour (0 to 23) |
| holiday | weather day is holiday or not |
| weekday | day of the week |
| workingday | if day is neither weekend nor holiday is 1, otherwise is 0. |
| weathersit | 1, 2, 3, 4 |
| – 1 | Clear, Few clouds, Partly cloudy, Partly cloudy |
| – 2 | Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist |
| – 3 | Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds |
| – 4 | Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog |
| temp | Normalized temperature in Celsius. The values are divided to 41 (max) |
| atemp | Normalized feeling temperature in Celsius. The values are divided to 50 (max) |
| hum | Normalized humidity. The values are divided to 100 (max) |
| windspeed | Normalized wind speed. The values are divided to 67 (max) |
| casual | count of casual users |
| registered | count of registered users |
| cnt | count of total rental bikes including both casual and registered |
For example, season is a categorical variable defined by one of four values, each representing a season (1: spring, 2: summer, 3: fall, 4: winter).
Youâll notice that the variable year is coded with the value of 0 for 2011 and 1 for 2012, rather than actual year value of 2011 or 2012.
The variable weathersit is encoded with four possible values, 1 through 4. The values represent the daily weather situation as defined below.
It is essential undergo this process of understanding to help inform the formulate questions for exploration and further analysis. Visualizing data without understanding the meaning of the variables will make it difficult for you to interpret the result. By approaching a data visualization task informed about the data and its attributes you can better formulate questions for visual exploration. The next step is to prepare the data for analytical and visualization tasks.
At this point, you may want to rename the columns in your data set to make the data more usable when you begin the analysis. Renaming columns is a manual process that literally involves change the each column name. It is best practice to use lower case lettering and avoid spaces or hyphenation.
Preparing your data ##H. Exercise: Renaming columns
There are many ways to rename columns. Two approaches are presented below
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
bikeshare <- rename(bikeshare, humidity = hum)
names(bikeshare)
## [1] "instant" "dteday" "season" "yr" "mnth"
## [6] "holiday" "weekday" "workingday" "weathersit" "temp"
## [11] "atemp" "humidity" "windspeed" "casual" "registered"
## [16] "cnt"
# Rename column where names is "yr"
names(bikeshare)[names(bikeshare) == "yr"] <- "year"
names(bikeshare)
## [1] "instant" "dteday" "season" "year" "mnth"
## [6] "holiday" "weekday" "workingday" "weathersit" "temp"
## [11] "atemp" "humidity" "windspeed" "casual" "registered"
## [16] "cnt"
Even before you define the questions you seek to have answered from the data, it needs to be formatted appropriately. The rows should correspond to observations and the columns correspond the observed variables. This makes it easier to map the data to visual properties such as position, color, size, or shape. A preprocessing step is necessary to verify the dataset for correctness and consistency. Incomplete information has a high potential for incorrect results.
There are several ways you tackle working with data that are incomplete. Each has its pros and cons.
In these two cases it’s easy to replace the value with a pre-known value. We wouldn’t want to ignore the record because the values can be easily determined.
Updating the records
bikeshare$season[7]
## [1] NA
1->bikeshare$season[7]
bikeshare$season[7]
## [1] 1
bikeshare$mnth[10]
## [1] NA
1->bikeshare$mnth[10]
bikeshare$mnth[10]
## [1] 1
It is helpful to calculate some summary statistics about your data to learn more about the distribution, the median, minimum, maximum values, variance, standard deviation, number of observations and attributes.
summary(bikeshare)
## instant dteday season year
## Min. : 1.0 Length:731 Min. :1.000 Min. :0.0000
## 1st Qu.:183.5 Class :character 1st Qu.:2.000 1st Qu.:0.0000
## Median :366.0 Mode :character Median :3.000 Median :1.0000
## Mean :366.0 Mean :2.497 Mean :0.5007
## 3rd Qu.:548.5 3rd Qu.:3.000 3rd Qu.:1.0000
## Max. :731.0 Max. :4.000 Max. :1.0000
## mnth holiday weekday workingday
## Min. : 1.00 Min. :0.00000 Min. :0.000 Min. :0.000
## 1st Qu.: 4.00 1st Qu.:0.00000 1st Qu.:1.000 1st Qu.:0.000
## Median : 7.00 Median :0.00000 Median :3.000 Median :1.000
## Mean : 6.52 Mean :0.02873 Mean :2.997 Mean :0.684
## 3rd Qu.:10.00 3rd Qu.:0.00000 3rd Qu.:5.000 3rd Qu.:1.000
## Max. :12.00 Max. :1.00000 Max. :6.000 Max. :1.000
## weathersit temp atemp humidity
## Min. :1.000 Min. :0.05913 Min. :0.07907 Min. :0.0000
## 1st Qu.:1.000 1st Qu.:0.33708 1st Qu.:0.33784 1st Qu.:0.5200
## Median :1.000 Median :0.49833 Median :0.48673 Median :0.6267
## Mean :1.395 Mean :0.49538 Mean :0.47435 Mean :0.6279
## 3rd Qu.:2.000 3rd Qu.:0.65542 3rd Qu.:0.60860 3rd Qu.:0.7302
## Max. :3.000 Max. :0.86167 Max. :0.84090 Max. :0.9725
## windspeed casual registered cnt
## Min. :0.02239 Min. : 2.0 Min. : 20 Min. : 22
## 1st Qu.:0.13495 1st Qu.: 315.5 1st Qu.:2497 1st Qu.:3152
## Median :0.18097 Median : 713.0 Median :3662 Median :4548
## Mean :0.19049 Mean : 848.2 Mean :3656 Mean :4504
## 3rd Qu.:0.23321 3rd Qu.:1096.0 3rd Qu.:4776 3rd Qu.:5956
## Max. :0.50746 Max. :3410.0 Max. :6946 Max. :8714
The summary function shows the mean, median, minimum, and maximum values for each variable in the data set. This is particular useful for continuous variables such as temp, cnt, casual, and registered. For example, you can easily see the average number of customers (casual and registered) per day.
Explore the data visually. As a first step, consider scatterplots to show relationships between variables, histograms for frequencies, density plots to show distributions, and box plots to show the range of values.
Letâs say you wanted to see know the distribution of the ridership.
Kernal density plots are an effective way to view the distribution of a variable. Create the plot using plot(density(x)) where x is a numeric vector.
A density plot that shows the shape of the data for the number of riders per day.
density_riders = density(bikeshare$cnt)
plot(density_riders, main= "Number of riders per day",sub= round(mean(bikeshare$cnt), 2),"Mean =", frame=FALSE)
polygon(density_riders, col="gray", border="gray")
How would we interpret the density plot?
A histogram that shows the frequency of the weather situation by day.
hist(bikeshare$weathersit, col="gray",border="gray", xlab="Weather", main="Frequency of weather situations")
| Value | Meaning |
|---|---|
| 1 | Clear, Few clouds, Partly cloudy, Partly cloudy |
| 2 | Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist |
| 3 | Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds |
| 4 | Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog |
How would we interpret the histogram?
You can check to see if your histogram makes is clear by reviewing the sum of each value for weathersit.
table(bikeshare$weathersit)
##
## 1 2 3
## 463 247 21
To see relationships, scatter plots are useful. In this case, we are looking for positive or negative correlations.
A simple scatter plot that shows the relationship between the rentals and temperature
plot(bikeshare$cnt, bikeshare$atemp, main= "Relationship between bike rentals and average daily temperature", frame=FALSE, xlab="Number of rentals per day", ylab="Average daily temperature in degrees fahrenheit")
To aid in the interpretation, it is helpful to add a linear regression line if there is a linear relationship or a lowess line. A lowess line will more accurate fit the line to the data.
plot(bikeshare$cnt, bikeshare$atemp, main= "Relationship between bike rentals and average daily temperature", frame=FALSE, xlab="Number of rentals per day", ylab="Average daily temperature in degrees fahrenheit")
# Add fit lines
abline(lm(bikeshare$atemp~bikeshare$cnt), col="blue") # regression line (y~x)
lines(lowess(bikeshare$cnt, bikeshare$atemp), col="orange") # lowess line (x,y)
How would we interpret this scatter plot? Use this to inform the title of your plot.
Consider using color to group categorical data. In this example, we are grouping the points by season. We’re using the ggvis package.
#static chart
library(ggvis)
bikeshare %>%
ggvis(x=~cnt, y=~atemp) %>%
layer_points(fill = ~season) %>%
add_axis("x", title = "Number of rentals per day") %>%
add_axis("y", title = "Average daily temperature in degrees fahrenheit")
We can even look at the data by year.
#static chart
library(ggvis)
bikeshare %>%
ggvis(x=~cnt, y=~atemp) %>%
layer_points(fill = ~year) %>%
add_axis("x", title = "Number of rentals per day") %>%
add_axis("y", title = "Average daily temperature in degrees fahrenheit")