Updated on Wed May 16 15:34:25 2018.

DATA BASICS

This section will guide you in the process of decoding your data into information and ultimately intelligible insights. In doing so, we will explore the use of tidyverse and R base packages.


When working with a new data what initial questions do you have?


Consider the following questions to guide your understanding.


Once you have this basic understanding of your data you can dig deeper. Then you can use visualization techniques to explore your data and derive some basic understandings of the phenomena you are studying, such as the largest and smallest values for each variable. In addition, calculating summary statistics translate data into information by revealing the shape of the data, the mean, median, minimum value, maximum value, and variability all with simple visualizations.


For any data science project there are few simple steps to follow. Caption for the picture.


A. Exercise: Importing your data

Using the World internet usage data we will compare of read.csv to read_csv for importing data.


utils package using read.csv()

library(utils)
internet_utils <- read.csv("world_internet_usage.csv")
head(internet_utils)
##                country X2000 X2001 X2002 X2003 X2004 X2005 X2006 X2007
## 1                China  1.78  2.64  4.60  6.20  7.30  8.52 10.52 16.00
## 2               Mexico  5.08  7.04 11.90 12.90 14.10 17.21 19.52 20.81
## 3               Panama  6.55  7.27  8.52  9.99 11.14 11.48 17.35 22.29
## 4              Senegal  0.40  0.98  1.01  2.10  4.39  4.79  5.61  7.70
## 5            Singapore 36.00 41.67 47.00 53.84 62.00 61.00 59.00 69.90
## 6 United Arab Emirates 23.63 26.27 28.32 29.48 30.13 40.00 52.00 61.00
##   X2008 X2009 X2010 X2011 X2012
## 1 22.60 28.90 34.30 38.30 42.30
## 2 21.71 26.34 31.05 34.96 38.42
## 3 33.82 39.08 40.10 42.70 45.20
## 4 10.60 14.50 16.00 17.50 19.20
## 5 69.00 69.00 71.00 71.00 74.18
## 6 63.00 64.00 68.00 78.00 85.00

Use readr to import the data

library(readr)
internet_readr <- read_csv("world_internet_usage.csv")
## Parsed with column specification:
## cols(
##   country = col_character(),
##   `2000` = col_double(),
##   `2001` = col_double(),
##   `2002` = col_double(),
##   `2003` = col_double(),
##   `2004` = col_double(),
##   `2005` = col_double(),
##   `2006` = col_double(),
##   `2007` = col_double(),
##   `2008` = col_double(),
##   `2009` = col_double(),
##   `2010` = col_double(),
##   `2011` = col_double(),
##   `2012` = col_double()
## )
head(internet_readr)
## # A tibble: 6 x 14
##   country   `2000` `2001` `2002` `2003` `2004` `2005` `2006` `2007` `2008`
##   <chr>      <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1 China      1.78   2.64    4.60   6.20   7.30   8.52  10.5   16.0    22.6
## 2 Mexico     5.08   7.04   11.9   12.9   14.1   17.2   19.5   20.8    21.7
## 3 Panama     6.55   7.27    8.52   9.99  11.1   11.5   17.4   22.3    33.8
## 4 Senegal    0.400  0.980   1.01   2.10   4.39   4.79   5.61   7.70   10.6
## 5 Singapore 36.0   41.7    47.0   53.8   62.0   61.0   59.0   69.9    69.0
## 6 United A… 23.6   26.3    28.3   29.5   30.1   40.0   52.0   61.0    63.0
## # ... with 4 more variables: `2009` <dbl>, `2010` <dbl>, `2011` <dbl>,
## #   `2012` <dbl>

Select the second row, first column.

internet_readr[[2,1]]
## [1] "Mexico"
internet_utils [2,1] # double [[ ]] works too
## [1] Mexico
## 7 Levels: China Mexico Panama Senegal Singapore ... United States

Extract the variable “country”

internet_readr$country
## [1] "China"                "Mexico"               "Panama"              
## [4] "Senegal"              "Singapore"            "United Arab Emirates"
## [7] "United States"
internet_utils$country
## [1] China                Mexico               Panama              
## [4] Senegal              Singapore            United Arab Emirates
## [7] United States       
## 7 Levels: China Mexico Panama Senegal Singapore ... United States

An alternative using the infix function

#to use with infix function add a .
internet_readr %>% .$country 
## [1] "China"                "Mexico"               "Panama"              
## [4] "Senegal"              "Singapore"            "United Arab Emirates"
## [7] "United States"

B. Exercise: Tidy data - reshaping

Rename columns first to remove the X in front of each year.

names(internet_utils) <-c("country", "2000", "2001", "2002", "2003", "2004", "2005", "2006", "2007", "2008", "2009", "2010", "2011", "2012")
names(internet_utils)
##  [1] "country" "2000"    "2001"    "2002"    "2003"    "2004"    "2005"   
##  [8] "2006"    "2007"    "2008"    "2009"    "2010"    "2011"    "2012"

Reshape a data frame

library(reshape2)
internet_utils_reshaped <- melt(internet_utils,id.vars="country", variable.name="year", value.name="usage")
internet_utils_reshaped
##                 country year usage
## 1                 China 2000  1.78
## 2                Mexico 2000  5.08
## 3                Panama 2000  6.55
## 4               Senegal 2000  0.40
## 5             Singapore 2000 36.00
## 6  United Arab Emirates 2000 23.63
## 7         United States 2000 43.08
## 8                 China 2001  2.64
## 9                Mexico 2001  7.04
## 10               Panama 2001  7.27
## 11              Senegal 2001  0.98
## 12            Singapore 2001 41.67
## 13 United Arab Emirates 2001 26.27
## 14        United States 2001 49.08
## 15                China 2002  4.60
## 16               Mexico 2002 11.90
## 17               Panama 2002  8.52
## 18              Senegal 2002  1.01
## 19            Singapore 2002 47.00
## 20 United Arab Emirates 2002 28.32
## 21        United States 2002 58.79
## 22                China 2003  6.20
## 23               Mexico 2003 12.90
## 24               Panama 2003  9.99
## 25              Senegal 2003  2.10
## 26            Singapore 2003 53.84
## 27 United Arab Emirates 2003 29.48
## 28        United States 2003 61.70
## 29                China 2004  7.30
## 30               Mexico 2004 14.10
## 31               Panama 2004 11.14
## 32              Senegal 2004  4.39
## 33            Singapore 2004 62.00
## 34 United Arab Emirates 2004 30.13
## 35        United States 2004 64.76
## 36                China 2005  8.52
## 37               Mexico 2005 17.21
## 38               Panama 2005 11.48
## 39              Senegal 2005  4.79
## 40            Singapore 2005 61.00
## 41 United Arab Emirates 2005 40.00
## 42        United States 2005 67.97
## 43                China 2006 10.52
## 44               Mexico 2006 19.52
## 45               Panama 2006 17.35
## 46              Senegal 2006  5.61
## 47            Singapore 2006 59.00
## 48 United Arab Emirates 2006 52.00
## 49        United States 2006 68.93
## 50                China 2007 16.00
## 51               Mexico 2007 20.81
## 52               Panama 2007 22.29
## 53              Senegal 2007  7.70
## 54            Singapore 2007 69.90
## 55 United Arab Emirates 2007 61.00
## 56        United States 2007 75.00
## 57                China 2008 22.60
## 58               Mexico 2008 21.71
## 59               Panama 2008 33.82
## 60              Senegal 2008 10.60
## 61            Singapore 2008 69.00
## 62 United Arab Emirates 2008 63.00
## 63        United States 2008 74.00
## 64                China 2009 28.90
## 65               Mexico 2009 26.34
## 66               Panama 2009 39.08
## 67              Senegal 2009 14.50
## 68            Singapore 2009 69.00
## 69 United Arab Emirates 2009 64.00
## 70        United States 2009 71.00
## 71                China 2010 34.30
## 72               Mexico 2010 31.05
## 73               Panama 2010 40.10
## 74              Senegal 2010 16.00
## 75            Singapore 2010 71.00
## 76 United Arab Emirates 2010 68.00
## 77        United States 2010 74.00
## 78                China 2011 38.30
## 79               Mexico 2011 34.96
## 80               Panama 2011 42.70
## 81              Senegal 2011 17.50
## 82            Singapore 2011 71.00
## 83 United Arab Emirates 2011 78.00
## 84        United States 2011 77.86
## 85                China 2012 42.30
## 86               Mexico 2012 38.42
## 87               Panama 2012 45.20
## 88              Senegal 2012 19.20
## 89            Singapore 2012 74.18
## 90 United Arab Emirates 2012 85.00
## 91        United States 2012 81.03

Reshape a tibble

internet_readr_reshaped <- melt(internet_readr,id.vars="country", variable.name="year", value.name="usage")
internet_readr_reshaped
##                 country year usage
## 1                 China 2000  1.78
## 2                Mexico 2000  5.08
## 3                Panama 2000  6.55
## 4               Senegal 2000  0.40
## 5             Singapore 2000 36.00
## 6  United Arab Emirates 2000 23.63
## 7         United States 2000 43.08
## 8                 China 2001  2.64
## 9                Mexico 2001  7.04
## 10               Panama 2001  7.27
## 11              Senegal 2001  0.98
## 12            Singapore 2001 41.67
## 13 United Arab Emirates 2001 26.27
## 14        United States 2001 49.08
## 15                China 2002  4.60
## 16               Mexico 2002 11.90
## 17               Panama 2002  8.52
## 18              Senegal 2002  1.01
## 19            Singapore 2002 47.00
## 20 United Arab Emirates 2002 28.32
## 21        United States 2002 58.79
## 22                China 2003  6.20
## 23               Mexico 2003 12.90
## 24               Panama 2003  9.99
## 25              Senegal 2003  2.10
## 26            Singapore 2003 53.84
## 27 United Arab Emirates 2003 29.48
## 28        United States 2003 61.70
## 29                China 2004  7.30
## 30               Mexico 2004 14.10
## 31               Panama 2004 11.14
## 32              Senegal 2004  4.39
## 33            Singapore 2004 62.00
## 34 United Arab Emirates 2004 30.13
## 35        United States 2004 64.76
## 36                China 2005  8.52
## 37               Mexico 2005 17.21
## 38               Panama 2005 11.48
## 39              Senegal 2005  4.79
## 40            Singapore 2005 61.00
## 41 United Arab Emirates 2005 40.00
## 42        United States 2005 67.97
## 43                China 2006 10.52
## 44               Mexico 2006 19.52
## 45               Panama 2006 17.35
## 46              Senegal 2006  5.61
## 47            Singapore 2006 59.00
## 48 United Arab Emirates 2006 52.00
## 49        United States 2006 68.93
## 50                China 2007 16.00
## 51               Mexico 2007 20.81
## 52               Panama 2007 22.29
## 53              Senegal 2007  7.70
## 54            Singapore 2007 69.90
## 55 United Arab Emirates 2007 61.00
## 56        United States 2007 75.00
## 57                China 2008 22.60
## 58               Mexico 2008 21.71
## 59               Panama 2008 33.82
## 60              Senegal 2008 10.60
## 61            Singapore 2008 69.00
## 62 United Arab Emirates 2008 63.00
## 63        United States 2008 74.00
## 64                China 2009 28.90
## 65               Mexico 2009 26.34
## 66               Panama 2009 39.08
## 67              Senegal 2009 14.50
## 68            Singapore 2009 69.00
## 69 United Arab Emirates 2009 64.00
## 70        United States 2009 71.00
## 71                China 2010 34.30
## 72               Mexico 2010 31.05
## 73               Panama 2010 40.10
## 74              Senegal 2010 16.00
## 75            Singapore 2010 71.00
## 76 United Arab Emirates 2010 68.00
## 77        United States 2010 74.00
## 78                China 2011 38.30
## 79               Mexico 2011 34.96
## 80               Panama 2011 42.70
## 81              Senegal 2011 17.50
## 82            Singapore 2011 71.00
## 83 United Arab Emirates 2011 78.00
## 84        United States 2011 77.86
## 85                China 2012 42.30
## 86               Mexico 2012 38.42
## 87               Panama 2012 45.20
## 88              Senegal 2012 19.20
## 89            Singapore 2012 74.18
## 90 United Arab Emirates 2012 85.00
## 91        United States 2012 81.03
class(internet_readr_reshaped) # turns into a data.frame!
## [1] "data.frame"

Use the gather function to reshape

tidy_internet_readr <- 
internet_readr %>%
gather(`2000`,`2001`,`2002`,`2003`,`2004`,`2005`,`2006`,`2007`,`2008`,`2009`,`2010`,`2011`,`2012`, key="year", value="usage")

tidy_internet_readr
## # A tibble: 91 x 3
##    country              year   usage
##    <chr>                <chr>  <dbl>
##  1 China                2000   1.78 
##  2 Mexico               2000   5.08 
##  3 Panama               2000   6.55 
##  4 Senegal              2000   0.400
##  5 Singapore            2000  36.0  
##  6 United Arab Emirates 2000  23.6  
##  7 United States        2000  43.1  
##  8 China                2001   2.64 
##  9 Mexico               2001   7.04 
## 10 Panama               2001   7.27 
## # ... with 81 more rows

C. Exercise: Understand - Visualize

Create a few statistical visualizations to understand the makeup of your data.

Build a boxplot

boxplot(internet_readr$`2000`, main="Range of internet users in 2000", sub="Median of 6.55 users per 100 people", col="#999999", frame=FALSE, las=1)


Build multiple box plots

boxplot(internet_readr[,2:14], main="Range of internet users per 100 people", col="#999999", frame=FALSE, las=1)


Build a single histogram


hist(internet_readr$`2000`, main="Frequency of internet users in 2000 per 100 people", xlab="Year: 2000", col="#999999", border="#FFFFFF", label=TRUE, breaks=6, las=1
     )


Build a percentage (rather than count) histogram

library(lattice)
histogram(internet_readr$`2000`, main="Frequency of internet users in 2000 per 100 people", xlab="Year: 2000", col="#999999", border="#FFFFFF")

***

Build a histogram matrix

histogram(~ usage | year, data=tidy_internet_readr, layout=c(4,4), main="Histogram matrix: 2000-2012", col="#999999", border="#FFFFFF", xlab="Usage")


Re-arrange the years

h <-histogram(~tidy_internet_readr$usage|tidy_internet_readr$year,col="#999999",breaks=5,layout=c(3,5), xlab="Usage", main="Histogram matrix: 2000-2011")
update (h, index.cond=list(c(10:12, 7:9, 4:6, 1:3)))


Rearrange the years and show all years

tidy_internet_readr$year<-as.character(tidy_internet_readr$year)
h <-histogram(~tidy_internet_readr$usage|tidy_internet_readr$year,col="#999999", xlab="Usage", main="Histogram matrix: 2000-2012", breaks=5,layout=c(4,4))
update(h, index.cond=list(c(10:13, 6:9, 2:5, 1)))


Build column chart using ggplot

library(ggplot2)
g <- ggplot(tidy_internet_readr,aes(tidy_internet_readr$year, tidy_internet_readr$usage))
g + geom_col() + theme_few() + labs(title = "Internet Usage per 100 people", x = "Year",y ="Usage")


D. Exercise: Communicate

Create charts and reports.

Create a presentation ready line chart using ggplot and apply a ggtheme.

library(ggthemes)
library(ggplot2)
ggplot(tidy_internet_readr,aes(x=year,y=usage,colour=country,group=country)) + geom_line() + labs(title = "Internet Usage per 100 people", subtitle = "Since 2011, the UAE has surpassed Singapore and the US in internet users", caption = "Source: World Bank, 2013",x = "Year",y ="Usage") + theme_few()


Create a new markdown document and publish it with the graph you created above.


Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents.

See the sample markdown here: http://rpubs.com/sosulski/msbaR

For more details on using R Markdown see http://rmarkdown.rstudio.com.


APPLICATION: Capital Bikeshare

Understand your data

The type of data you have will dictate the types of questions you use to guide your analysis. To begin, import the bike sharing data from the Capital Bikeshare system.


E. Exercise: Import the bike sharing data

This data spans the District of Columbia, Arlington County, Alexandria, Montgomery County and Fairfax County. The Capital Bikeshare system is owned by the participating jurisdictions and is operated by Motivate, a Brooklyn, NY-based company that operates several other bikesharing systems including Citibike in New York City, Hubway in Boston and Divvy Bikes in Chicago.


Import the data

library(readr)
bikeshare <- read_csv("bikesharedailydata.csv")
## Parsed with column specification:
## cols(
##   instant = col_integer(),
##   dteday = col_character(),
##   season = col_integer(),
##   yr = col_integer(),
##   mnth = col_integer(),
##   holiday = col_integer(),
##   weekday = col_integer(),
##   workingday = col_integer(),
##   weathersit = col_integer(),
##   temp = col_double(),
##   atemp = col_double(),
##   hum = col_double(),
##   windspeed = col_double(),
##   casual = col_integer(),
##   registered = col_integer(),
##   cnt = col_integer()
## )

F. Exercise: Take a look at the data.

Preview the data

You can preview the data using the head function to show the first few observations.

head(bikeshare)
## # A tibble: 6 x 16
##   instant dteday season    yr  mnth holiday weekday workingday weathersit
##     <int> <chr>   <int> <int> <int>   <int>   <int>      <int>      <int>
## 1       1 1/1/11      1     0     1       0       6          0          2
## 2       2 1/2/11      1     0     1       0       0          0          2
## 3       3 1/3/11      1     0     1       0       1          1          1
## 4       4 1/4/11      1     0     1       0       2          1          1
## 5       5 1/5/11      1     0     1       0       3          1          1
## 6       6 1/6/11      1     0     1       0       4          1          1
## # ... with 7 more variables: temp <dbl>, atemp <dbl>, hum <dbl>,
## #   windspeed <dbl>, casual <int>, registered <int>, cnt <int>

Next, you can view the variables and types by using the str function.

str(bikeshare)

One of the first things you may notice is the data dimensions, the number of rows and columns. Specifically there are 731 rows (observations) and 16 columns (variables or attributes).

Rows are commonly referred to as observations or records and columns are described as attributes or variables.

However, the variable names listed at the first row of every column are not very descriptive.


G. Exercise: Understanding the variables

Take a look column named season. What is the meaning of season? What are the possible values for this variable?

bikeshare$season

What type of variable is it?

It is an integer. You’ll notice that in the column seasons the values are integers that range between 1 and 4.


What do the numbers represent?

If we really think about it’s unlikely that the numbers represent quantities. Instead, they probably represent the seasons of the year because we know there are four seasons. The numbers (1 through 4) are probably a code for the each of the four seasons of the year. Without additional information, such as a data dictionary or readme file, it would be impossible for the user of the data to know what the possible values of 1 through 4 correspond to in the categorical variable named season.

This leads us to the next step, reviewing the data dictionary along with the data set to better understand the meaning behind the values.


Review the data dictionary

A data dictionary defines the characteristics of each of the data attributes. If your data comes from a reputable source, odds are that it is accompanied with a data dictionary or metadata. To know which season is represented by each number in the variable season we can review the data dictionary.


Field Definition
instant record index
dteday date
season season (1:winter, 2:spring, 3:summer, 4:fall)
yr year (0: 2011, 1:2012)
mnth month ( 1 to 12)
hr hour (0 to 23)
holiday weather day is holiday or not
weekday day of the week
workingday if day is neither weekend nor holiday is 1, otherwise is 0.
weathersit 1, 2, 3, 4
– 1 Clear, Few clouds, Partly cloudy, Partly cloudy
– 2 Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
– 3 Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
– 4 Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
temp Normalized temperature in degrees F
atemp Normalized feeling temperature in degrees F
hum Normalized humidity.
windspeed Normalized wind speed
casual count of casual users
registered count of registered users
cnt count of total rental bikes including both casual and registered

For example, season is a categorical variable defined by one of four values, each representing a season (1: winter, 2: spring, 3: summer, 4: fall).


You’ll notice that the variable year is coded with the value of 0 for 2011 and 1 for 2012, rather than actual year value of 2011 or 2012.


The variable weathersit is encoded with four possible values, 1 through 4. The values represent the daily weather situation as defined below.

  1. Clear, Few clouds, Partly cloudy, Partly cloudy
  2. Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
  3. Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
  4. Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog

It is essential undergo this process of understanding to help inform the formulate questions for exploration and further analysis. Visualizing data without understanding the meaning of the variables will make it difficult for you to interpret the result. By approaching a data visualization task informed about the data and its attributes you can better formulate questions for visual exploration. The next step is to prepare the data for analytical and visualization tasks.


At this point, you may want to rename the columns in your data set to make the data more usable when you begin the analysis. Renaming columns is a manual process that literally involves change the each column name. It is best practice to use lower case lettering and avoid spaces or hyphenation.


Preparing your data

H. Exercise: Renaming columns

There are many ways to rename columns. Two approaches are presented below

Renaming columns with the rename function from the dplyr library.

library(dplyr)
bikeshare <- rename(bikeshare, humidity = hum)
names(bikeshare)
##  [1] "instant"    "dteday"     "season"     "yr"         "mnth"      
##  [6] "holiday"    "weekday"    "workingday" "weathersit" "temp"      
## [11] "atemp"      "humidity"   "windspeed"  "casual"     "registered"
## [16] "cnt"

Renaming columns with R base functions.

# Rename column where names is "yr"
names(bikeshare)[names(bikeshare) == "yr"] <- "year"
names(bikeshare)
##  [1] "instant"    "dteday"     "season"     "year"       "mnth"      
##  [6] "holiday"    "weekday"    "workingday" "weathersit" "temp"      
## [11] "atemp"      "humidity"   "windspeed"  "casual"     "registered"
## [16] "cnt"

I. Exercise: Dealing with missing values

Even before you define the questions you seek to have answered from the data, it needs to be formatted appropriately. The rows should correspond to observations and the columns correspond the observed variables. This makes it easier to map the data to visual properties such as position, color, size, or shape. A preprocessing step is necessary to verify the dataset for correctness and consistency. Incomplete information has a high potential for incorrect results.


Tactics

There are several ways you tackle working with data that are incomplete. Each has its pros and cons.

  1. Ignore any record with missing values
  2. Replace empty fields with a pre-defined value
  3. Replace empty fields with the most frequently appeared value
  4. Use the mean value
  5. Manual approach

Problem

  • Row 7, column 3: The season variable has no value
  • Row 10, column 5: The month has no value.

Solution

In these two cases it’s easy to replace the value with a pre-known value. We wouldn’t want to ignore the record because the values can be easily determined.


Update the values

bikeshare$season[7]
## [1] NA
1->bikeshare$season[7]
bikeshare$season[7]
## [1] 1
bikeshare$mnth[10]
## [1] NA
1->bikeshare$mnth[10]
bikeshare$mnth[10]
## [1] 1

J. Exercise: Understand - Calculate basic summary statistics

It is helpful to calculate some summary statistics about your data to learn more about the distribution, the median, minimum, maximum values, variance, standard deviation, number of observations and attributes.


summary(bikeshare)
##     instant         dteday              season           year       
##  Min.   :  1.0   Length:731         Min.   :1.000   Min.   :0.0000  
##  1st Qu.:183.5   Class :character   1st Qu.:2.000   1st Qu.:0.0000  
##  Median :366.0   Mode  :character   Median :3.000   Median :1.0000  
##  Mean   :366.0                      Mean   :2.497   Mean   :0.5007  
##  3rd Qu.:548.5                      3rd Qu.:3.000   3rd Qu.:1.0000  
##  Max.   :731.0                      Max.   :4.000   Max.   :1.0000  
##       mnth          holiday           weekday        workingday   
##  Min.   : 1.00   Min.   :0.00000   Min.   :0.000   Min.   :0.000  
##  1st Qu.: 4.00   1st Qu.:0.00000   1st Qu.:1.000   1st Qu.:0.000  
##  Median : 7.00   Median :0.00000   Median :3.000   Median :1.000  
##  Mean   : 6.52   Mean   :0.02873   Mean   :2.997   Mean   :0.684  
##  3rd Qu.:10.00   3rd Qu.:0.00000   3rd Qu.:5.000   3rd Qu.:1.000  
##  Max.   :12.00   Max.   :1.00000   Max.   :6.000   Max.   :1.000  
##    weathersit         temp             atemp            humidity     
##  Min.   :1.000   Min.   :0.05913   Min.   :0.07907   Min.   :0.0000  
##  1st Qu.:1.000   1st Qu.:0.33708   1st Qu.:0.33784   1st Qu.:0.5200  
##  Median :1.000   Median :0.49833   Median :0.48673   Median :0.6267  
##  Mean   :1.395   Mean   :0.49538   Mean   :0.47435   Mean   :0.6279  
##  3rd Qu.:2.000   3rd Qu.:0.65542   3rd Qu.:0.60860   3rd Qu.:0.7302  
##  Max.   :3.000   Max.   :0.86167   Max.   :0.84090   Max.   :0.9725  
##    windspeed           casual         registered        cnt      
##  Min.   :0.02239   Min.   :   2.0   Min.   :  20   Min.   :  22  
##  1st Qu.:0.13495   1st Qu.: 315.5   1st Qu.:2497   1st Qu.:3152  
##  Median :0.18097   Median : 713.0   Median :3662   Median :4548  
##  Mean   :0.19049   Mean   : 848.2   Mean   :3656   Mean   :4504  
##  3rd Qu.:0.23321   3rd Qu.:1096.0   3rd Qu.:4776   3rd Qu.:5956  
##  Max.   :0.50746   Max.   :3410.0   Max.   :6946   Max.   :8714

The summary function shows the mean, median, minimum, and maximum values for each variable in the data set. This is particular useful for continuous variables such as temp, cnt, casual, and registered. For example, you can easily see the average number of customers (casual and registered) per day.


K. Exercise: Understand - Visualize

Explore the data visually. As a first step, consider scatterplots to show relationships between variables, histograms for frequencies, density plots to show distributions, and box plots to show the range of values.

Kernal density plot

Let’s say you wanted to see know the distribution of the ridership.

Kernal density plots are an effective way to view the distribution of a variable. Create the plot using plot(density(x)) where x is a numeric vector.

Build a a density plot that shows the shape of the data for the number of riders per day.

density_riders = density(bikeshare$cnt)
plot(density_riders, main= "Daily ridership",sub= round(mean(bikeshare$cnt), 2),"Mean =", frame=FALSE, las=1)
polygon(density_riders, col="gray", border="gray")

How would we interpret the density plot?


Histogram

Build a histogram that shows the frequency of the weather situation by day.

hist(bikeshare$weathersit, col="gray",border="gray", las=1, xlab="Weather", main="Frequency of weather situations")

Value Meaning
1 Clear, Few clouds, Partly cloudy, Partly cloudy
2 Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
3 Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
4 Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog

How would we interpret the histogram?


You can check to see if your histogram makes is clear by reviewing the sum of each value for weathersit.

weathersituation <- table(bikeshare$weathersit)
barplot(weathersituation, las=1)

L. Exercise: Scatter plots

To see relationships, scatter plots are useful. In this case, we are looking for positive or negative correlations.


Scatter plot

Build a simple scatter plot that shows the relationship between the rentals and temperature

plot(bikeshare$cnt, bikeshare$atemp, main= "Relationship between bike rentals and average daily temperature", frame=FALSE, xlab="Number of rentals per day", col="#999999", cex=2, ylab="Average daily temperature in degrees fahrenheit", las=1)


Scatter plot with fit lines

To aid in the interpretation, it is helpful to add a linear regression line if there is a linear relationship or a lowess line. A lowess line will more accurate fit the line to the data.

plot(bikeshare$cnt, bikeshare$atemp, main= "Relationship between bike rentals and average daily temperature", frame=FALSE, xlab="Number of rentals per day", ylab="Average daily temperature in degrees fahrenheit", cex=2, las=1)

# Add fit lines
abline(lm(bikeshare$atemp~bikeshare$cnt), col="blue") # regression line (y~x) 
lines(lowess(bikeshare$cnt, bikeshare$atemp), col="orange") # lowess line (x,y)


How would we interpret this scatter plot? Use this to inform the title of your plot.


Scatter plot with grouped categorical data (season)

Consider using color to group categorical data. In this example, we are grouping the points by season. We’re using the ggvis package.

library(ggvis)
bikeshare %>% 
  ggvis(x=~cnt, y=~atemp) %>% 

layer_points(fill = ~season)   %>% 
  add_axis("x", title = "Number of rentals per day") %>%
  add_axis("y", title = "Average daily temperature in degrees fahrenheit")

Rework the ggvis chart to categorize season as discrete values

library(ggvis)
bikeshare %>% 
  ggvis(x=~cnt, y=~atemp) %>% 

layer_points(fill = ~as.factor(season))   %>% 
  add_axis("x", title = "Number of rentals per day") %>%
  add_axis("y", title = "Average daily temperature in degrees fahrenheit")

Scatter plot with grouped categorical data (year)

Let’s look at the data by year.

library(ggvis)
bikeshare %>% 
  ggvis(x=~cnt, y=~atemp) %>% 

layer_points(fill = ~year)   %>% 
  add_axis("x", title = "Number of rentals per day") %>%
  add_axis("y", title = "Average daily temperature in degrees fahrenheit")

Change year to a factor

bikeshare$year1<-ordered(factor(bikeshare$year, levels =c(0,1),
labels = c("2011", "2012")))

library(ggvis)
 bikeshare %>% 
  ggvis(x=~cnt, y=~atemp) %>% 

layer_points(fill = ~year1)   %>% 
  add_legend("fill", 
  title = "Year", 
  orient = "right")%>%
  add_axis("x", title = "Number of rentals per day") %>%
  add_axis("y", title = "Average daily temperature in degrees fahrenheit")%>%
 scale_ordinal("fill", range = c("#000000", "#4cbea3"))

M. Exercise: Interactive chart - Use ggvis to filter

Build on the example above and add a filter to hide and reveal different seasons.

#This code is set to eval=FALSE because it cannot be knit
library(ggvis)
bikeshare %>% 
  ggvis(x=~cnt, y=~atemp) %>% 
  filter(bikeshare$season %in% eval(input_checkboxgroup(choices=unique(bikeshare$season), 
    selected = "1")))%>% 
layer_points(fill = ~factor(season))   %>% 
  add_axis("x", title = "Number of rentals per day") %>%
  add_axis("y", title = "Average daily temperature in degrees fahrenheit") 
  %>%
  add_legend(title ="Seasons", size = 200)

Refined example

bikeshare$season<-ordered(factor(bikeshare$season, levels =c(1,2,3,4),labels = c("Winter", "Spring", "Summer", "Fall")))

library(ggvis)
bikeshare %>% 
  ggvis(x=~cnt, y=~atemp, fill = ~factor(season)) %>% 
  filter(bikeshare$season %in% eval(input_checkboxgroup(choices=unique(bikeshare$season), 
    selected = "Spring")))%>% 
layer_points()   %>% 
  add_legend("fill", 
  title = "Season", 
  orient = "left")%>%
  add_axis("x", title = "Number of rentals per day") %>%
  add_axis("y", title = "Average daily temperature in degrees fahrenheit") %>%
  scale_ordinal("fill", range = c("#000000", "#999999", "#CCCCCC", "#4cbea3"))
## Warning: Can't output dynamic/interactive ggvis plots in a knitr document.
## Generating a static (non-dynamic, non-interactive) version of the plot.

N. Communicate - Create an RMarkdown document

Complete on your own.

Create an RMarkdown document named Bike_Sharing.Rmd. Include the code and markup for exercises E-L.


ITERATION

This section will introduce control structures such as the while loop, for loop, if/else conditional statements, and functions.

O.Exercise: Iteration using the while loop

Create a while loop

x <- 10
while (x > 0) {
 print(x)
 x <- x - 1 
} 
## [1] 10
## [1] 9
## [1] 8
## [1] 7
## [1] 6
## [1] 5
## [1] 4
## [1] 3
## [1] 2
## [1] 1

Using a variable as a counter

counter <- 0
while (counter < 9) {
  print(counter)
  counter <- counter + 1 }
## [1] 0
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8

P. Exercise: Iteration using a for loop

Iterate through an array of numbers using a for loop

for (i in c(1,2,3,4)){
    print(i)
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4

Iterate through a column in the bikeshare data

for (i in bikeshare$atemp){
    print(i)
}
## [1] 0.363625
## [1] 0.353739
## [1] 0.189405
## [1] 0.212122
## [1] 0.22927
## [1] 0.233209
## [1] 0.208839
## [1] 0.162254
## [1] 0.116175
## [1] 0.150888
## [1] 0.191464
## [1] 0.160473
## [1] 0.150883
## [1] 0.188413
## [1] 0.248112
## [1] 0.234217
## [1] 0.176771
## [1] 0.232333
## [1] 0.298422
## [1] 0.25505
## [1] 0.157833
## [1] 0.0790696
## [1] 0.0988391
## [1] 0.11793
## [1] 0.234526
## [1] 0.2036
## [1] 0.2197
## [1] 0.223317
## [1] 0.212126
## [1] 0.250322
## [1] 0.18625
## [1] 0.23453
## [1] 0.254417
## [1] 0.177878
## [1] 0.228587
## [1] 0.243058
## [1] 0.291671
## [1] 0.303658
## [1] 0.198246
## [1] 0.144283
## [1] 0.149548
## [1] 0.213509
## [1] 0.232954
## [1] 0.324113
## [1] 0.39835
## [1] 0.254274
## [1] 0.3162
## [1] 0.428658
## [1] 0.511983
## [1] 0.391404
## [1] 0.27733
## [1] 0.284075
## [1] 0.186033
## [1] 0.245717
## [1] 0.289191
## [1] 0.350461
## [1] 0.282192
## [1] 0.351109
## [1] 0.400118
## [1] 0.263879
## [1] 0.320071
## [1] 0.200133
## [1] 0.255679
## [1] 0.378779
## [1] 0.366252
## [1] 0.238461
## [1] 0.3024
## [1] 0.286608
## [1] 0.385668
## [1] 0.305
## [1] 0.32575
## [1] 0.380091
## [1] 0.332
## [1] 0.318178
## [1] 0.36693
## [1] 0.410333
## [1] 0.527009
## [1] 0.466525
## [1] 0.32575
## [1] 0.409735
## [1] 0.440642
## [1] 0.337939
## [1] 0.270833
## [1] 0.256312
## [1] 0.257571
## [1] 0.250339
## [1] 0.257574
## [1] 0.292908
## [1] 0.29735
## [1] 0.257575
## [1] 0.283454
## [1] 0.315637
## [1] 0.378767
## [1] 0.542929
## [1] 0.39835
## [1] 0.387608
## [1] 0.433696
## [1] 0.324479
## [1] 0.341529
## [1] 0.426737
## [1] 0.565217
## [1] 0.493054
## [1] 0.417283
## [1] 0.462742
## [1] 0.441913
## [1] 0.425492
## [1] 0.445696
## [1] 0.503146
## [1] 0.489258
## [1] 0.564392
## [1] 0.453892
## [1] 0.321954
## [1] 0.450121
## [1] 0.551763
## [1] 0.5745
## [1] 0.594083
## [1] 0.575142
## [1] 0.578929
## [1] 0.497463
## [1] 0.464021
## [1] 0.448204
## [1] 0.532833
## [1] 0.582079
## [1] 0.40465
## [1] 0.441917
## [1] 0.474117
## [1] 0.512621
## [1] 0.518933
## [1] 0.525246
## [1] 0.522721
## [1] 0.5284
## [1] 0.523363
## [1] 0.4943
## [1] 0.500629
## [1] 0.536
## [1] 0.550512
## [1] 0.538529
## [1] 0.527158
## [1] 0.510742
## [1] 0.529042
## [1] 0.571975
## [1] 0.5745
## [1] 0.590296
## [1] 0.604813
## [1] 0.615542
## [1] 0.654688
## [1] 0.637008
## [1] 0.612379
## [1] 0.61555
## [1] 0.671092
## [1] 0.725383
## [1] 0.720967
## [1] 0.643942
## [1] 0.587133
## [1] 0.594696
## [1] 0.616804
## [1] 0.621858
## [1] 0.65595
## [1] 0.727279
## [1] 0.757579
## [1] 0.703292
## [1] 0.678038
## [1] 0.643325
## [1] 0.601654
## [1] 0.591546
## [1] 0.587754
## [1] 0.595346
## [1] 0.600383
## [1] 0.643954
## [1] 0.645846
## [1] 0.595346
## [1] 0.637646
## [1] 0.693829
## [1] 0.693833
## [1] 0.656583
## [1] 0.643313
## [1] 0.637629
## [1] 0.637004
## [1] 0.692558
## [1] 0.654688
## [1] 0.637008
## [1] 0.652162
## [1] 0.667308
## [1] 0.668575
## [1] 0.665417
## [1] 0.696338
## [1] 0.685633
## [1] 0.686871
## [1] 0.670483
## [1] 0.664158
## [1] 0.690025
## [1] 0.729804
## [1] 0.739275
## [1] 0.689404
## [1] 0.635104
## [1] 0.624371
## [1] 0.638263
## [1] 0.669833
## [1] 0.703925
## [1] 0.747479
## [1] 0.74685
## [1] 0.826371
## [1] 0.840896
## [1] 0.804287
## [1] 0.794829
## [1] 0.720958
## [1] 0.696979
## [1] 0.690667
## [1] 0.7399
## [1] 0.785967
## [1] 0.728537
## [1] 0.729796
## [1] 0.703292
## [1] 0.707071
## [1] 0.679937
## [1] 0.664788
## [1] 0.656567
## [1] 0.676154
## [1] 0.715292
## [1] 0.703283
## [1] 0.724121
## [1] 0.684983
## [1] 0.651521
## [1] 0.654042
## [1] 0.645858
## [1] 0.624388
## [1] 0.616167
## [1] 0.645837
## [1] 0.666671
## [1] 0.662258
## [1] 0.633221
## [1] 0.648996
## [1] 0.675525
## [1] 0.638254
## [1] 0.606067
## [1] 0.630692
## [1] 0.645854
## [1] 0.659733
## [1] 0.635556
## [1] 0.647959
## [1] 0.607958
## [1] 0.594704
## [1] 0.611121
## [1] 0.614921
## [1] 0.604808
## [1] 0.633213
## [1] 0.665429
## [1] 0.625646
## [1] 0.5152
## [1] 0.544229
## [1] 0.555361
## [1] 0.578946
## [1] 0.607962
## [1] 0.609229
## [1] 0.60213
## [1] 0.603554
## [1] 0.6269
## [1] 0.553671
## [1] 0.461475
## [1] 0.478512
## [1] 0.490537
## [1] 0.529675
## [1] 0.532217
## [1] 0.550533
## [1] 0.554963
## [1] 0.522125
## [1] 0.564412
## [1] 0.572637
## [1] 0.589042
## [1] 0.574525
## [1] 0.575158
## [1] 0.574512
## [1] 0.544829
## [1] 0.412863
## [1] 0.345317
## [1] 0.392046
## [1] 0.472858
## [1] 0.527138
## [1] 0.480425
## [1] 0.504404
## [1] 0.513242
## [1] 0.523983
## [1] 0.542925
## [1] 0.546096
## [1] 0.517717
## [1] 0.551804
## [1] 0.529675
## [1] 0.498725
## [1] 0.503154
## [1] 0.510725
## [1] 0.522721
## [1] 0.513848
## [1] 0.466525
## [1] 0.423596
## [1] 0.425492
## [1] 0.422333
## [1] 0.457067
## [1] 0.463375
## [1] 0.472846
## [1] 0.457046
## [1] 0.318812
## [1] 0.227913
## [1] 0.321329
## [1] 0.356063
## [1] 0.397088
## [1] 0.390133
## [1] 0.405921
## [1] 0.403392
## [1] 0.323854
## [1] 0.362358
## [1] 0.400871
## [1] 0.412246
## [1] 0.409079
## [1] 0.373721
## [1] 0.306817
## [1] 0.357942
## [1] 0.43055
## [1] 0.524612
## [1] 0.507579
## [1] 0.451988
## [1] 0.323221
## [1] 0.272721
## [1] 0.324483
## [1] 0.457058
## [1] 0.445062
## [1] 0.421696
## [1] 0.430537
## [1] 0.372471
## [1] 0.380671
## [1] 0.385087
## [1] 0.4558
## [1] 0.490122
## [1] 0.451375
## [1] 0.311221
## [1] 0.305554
## [1] 0.331433
## [1] 0.310604
## [1] 0.3491
## [1] 0.393925
## [1] 0.4564
## [1] 0.400246
## [1] 0.256938
## [1] 0.317542
## [1] 0.266412
## [1] 0.253154
## [1] 0.270196
## [1] 0.301138
## [1] 0.338362
## [1] 0.412237
## [1] 0.359825
## [1] 0.249371
## [1] 0.245579
## [1] 0.280933
## [1] 0.396454
## [1] 0.428017
## [1] 0.426121
## [1] 0.377513
## [1] 0.299242
## [1] 0.279961
## [1] 0.315535
## [1] 0.327633
## [1] 0.279974
## [1] 0.263892
## [1] 0.318812
## [1] 0.414121
## [1] 0.375621
## [1] 0.252304
## [1] 0.126275
## [1] 0.119337
## [1] 0.278412
## [1] 0.340267
## [1] 0.390779
## [1] 0.340258
## [1] 0.247479
## [1] 0.318826
## [1] 0.282821
## [1] 0.381938
## [1] 0.249362
## [1] 0.183087
## [1] 0.161625
## [1] 0.190663
## [1] 0.364278
## [1] 0.275254
## [1] 0.190038
## [1] 0.220958
## [1] 0.174875
## [1] 0.16225
## [1] 0.243058
## [1] 0.349108
## [1] 0.294821
## [1] 0.35605
## [1] 0.415383
## [1] 0.326379
## [1] 0.272721
## [1] 0.262625
## [1] 0.381317
## [1] 0.466538
## [1] 0.398971
## [1] 0.309346
## [1] 0.272725
## [1] 0.264521
## [1] 0.296426
## [1] 0.361104
## [1] 0.266421
## [1] 0.261988
## [1] 0.293558
## [1] 0.210867
## [1] 0.101658
## [1] 0.227913
## [1] 0.333946
## [1] 0.351629
## [1] 0.330162
## [1] 0.351629
## [1] 0.355425
## [1] 0.265788
## [1] 0.273391
## [1] 0.295113
## [1] 0.392667
## [1] 0.444446
## [1] 0.410971
## [1] 0.255675
## [1] 0.268308
## [1] 0.357954
## [1] 0.353525
## [1] 0.34847
## [1] 0.475371
## [1] 0.359842
## [1] 0.413492
## [1] 0.303021
## [1] 0.241171
## [1] 0.255042
## [1] 0.3851
## [1] 0.524604
## [1] 0.397083
## [1] 0.277767
## [1] 0.35967
## [1] 0.459592
## [1] 0.542929
## [1] 0.548617
## [1] 0.532825
## [1] 0.436229
## [1] 0.505046
## [1] 0.464
## [1] 0.532821
## [1] 0.538533
## [1] 0.513258
## [1] 0.531567
## [1] 0.570067
## [1] 0.486733
## [1] 0.437488
## [1] 0.43875
## [1] 0.315654
## [1] 0.47095
## [1] 0.482304
## [1] 0.375621
## [1] 0.421708
## [1] 0.417287
## [1] 0.427513
## [1] 0.461483
## [1] 0.53345
## [1] 0.431163
## [1] 0.390767
## [1] 0.426129
## [1] 0.492425
## [1] 0.476638
## [1] 0.436233
## [1] 0.337274
## [1] 0.387604
## [1] 0.431808
## [1] 0.487996
## [1] 0.573875
## [1] 0.614925
## [1] 0.598487
## [1] 0.457038
## [1] 0.493046
## [1] 0.515775
## [1] 0.542921
## [1] 0.389504
## [1] 0.301125
## [1] 0.405283
## [1] 0.470317
## [1] 0.483583
## [1] 0.452637
## [1] 0.377504
## [1] 0.450121
## [1] 0.457696
## [1] 0.577021
## [1] 0.537896
## [1] 0.537242
## [1] 0.590917
## [1] 0.584608
## [1] 0.546737
## [1] 0.527142
## [1] 0.557471
## [1] 0.553025
## [1] 0.491783
## [1] 0.520833
## [1] 0.544817
## [1] 0.585238
## [1] 0.5499
## [1] 0.576404
## [1] 0.595975
## [1] 0.572613
## [1] 0.551121
## [1] 0.566908
## [1] 0.583967
## [1] 0.565667
## [1] 0.580825
## [1] 0.584612
## [1] 0.6067
## [1] 0.627529
## [1] 0.642696
## [1] 0.641425
## [1] 0.6793
## [1] 0.672992
## [1] 0.611129
## [1] 0.631329
## [1] 0.607962
## [1] 0.566288
## [1] 0.575133
## [1] 0.578283
## [1] 0.525892
## [1] 0.542292
## [1] 0.569442
## [1] 0.597862
## [1] 0.648367
## [1] 0.663517
## [1] 0.659721
## [1] 0.597875
## [1] 0.611117
## [1] 0.624383
## [1] 0.599754
## [1] 0.594708
## [1] 0.571975
## [1] 0.544842
## [1] 0.654692
## [1] 0.720975
## [1] 0.752542
## [1] 0.724121
## [1] 0.652792
## [1] 0.674254
## [1] 0.654042
## [1] 0.594704
## [1] 0.640792
## [1] 0.675512
## [1] 0.786613
## [1] 0.687508
## [1] 0.750629
## [1] 0.702038
## [1] 0.70265
## [1] 0.732337
## [1] 0.761367
## [1] 0.752533
## [1] 0.804913
## [1] 0.790396
## [1] 0.654054
## [1] 0.664796
## [1] 0.650271
## [1] 0.654683
## [1] 0.667933
## [1] 0.666042
## [1] 0.705196
## [1] 0.724125
## [1] 0.755683
## [1] 0.745583
## [1] 0.714642
## [1] 0.613025
## [1] 0.549912
## [1] 0.623125
## [1] 0.690017
## [1] 0.70645
## [1] 0.654054
## [1] 0.739263
## [1] 0.734217
## [1] 0.697604
## [1] 0.667933
## [1] 0.684987
## [1] 0.662896
## [1] 0.667308
## [1] 0.707088
## [1] 0.722867
## [1] 0.751267
## [1] 0.731079
## [1] 0.710246
## [1] 0.697621
## [1] 0.707717
## [1] 0.699508
## [1] 0.667942
## [1] 0.638267
## [1] 0.644579
## [1] 0.662254
## [1] 0.676779
## [1] 0.654037
## [1] 0.654688
## [1] 0.2424
## [1] 0.618071
## [1] 0.603554
## [1] 0.595967
## [1] 0.601025
## [1] 0.621854
## [1] 0.637008
## [1] 0.6471
## [1] 0.618696
## [1] 0.595996
## [1] 0.654688
## [1] 0.66605
## [1] 0.635733
## [1] 0.652779
## [1] 0.6894
## [1] 0.702654
## [1] 0.649
## [1] 0.661629
## [1] 0.686888
## [1] 0.708983
## [1] 0.655329
## [1] 0.657204
## [1] 0.611121
## [1] 0.578925
## [1] 0.565654
## [1] 0.554292
## [1] 0.570075
## [1] 0.579558
## [1] 0.594083
## [1] 0.585867
## [1] 0.563125
## [1] 0.55305
## [1] 0.565067
## [1] 0.540404
## [1] 0.532192
## [1] 0.571971
## [1] 0.610488
## [1] 0.518933
## [1] 0.502513
## [1] 0.544179
## [1] 0.596613
## [1] 0.607975
## [1] 0.585863
## [1] 0.530296
## [1] 0.517663
## [1] 0.512
## [1] 0.542333
## [1] 0.599133
## [1] 0.607975
## [1] 0.580187
## [1] 0.538521
## [1] 0.419813
## [1] 0.387608
## [1] 0.438112
## [1] 0.503142
## [1] 0.431167
## [1] 0.433071
## [1] 0.391396
## [1] 0.508204
## [1] 0.53915
## [1] 0.460846
## [1] 0.450108
## [1] 0.512625
## [1] 0.537896
## [1] 0.472842
## [1] 0.456429
## [1] 0.482942
## [1] 0.530304
## [1] 0.558721
## [1] 0.529688
## [1] 0.52275
## [1] 0.515133
## [1] 0.467771
## [1] 0.4394
## [1] 0.309909
## [1] 0.3611
## [1] 0.369942
## [1] 0.356042
## [1] 0.323846
## [1] 0.329538
## [1] 0.308075
## [1] 0.281567
## [1] 0.274621
## [1] 0.341891
## [1] 0.355413
## [1] 0.393937
## [1] 0.421713
## [1] 0.475383
## [1] 0.323225
## [1] 0.281563
## [1] 0.324492
## [1] 0.347204
## [1] 0.326383
## [1] 0.337746
## [1] 0.375621
## [1] 0.380667
## [1] 0.364892
## [1] 0.350371
## [1] 0.378779
## [1] 0.248742
## [1] 0.257583
## [1] 0.339004
## [1] 0.281558
## [1] 0.289762
## [1] 0.298422
## [1] 0.323867
## [1] 0.316904
## [1] 0.359208
## [1] 0.455796
## [1] 0.469054
## [1] 0.428012
## [1] 0.258204
## [1] 0.321958
## [1] 0.389508
## [1] 0.390146
## [1] 0.435575
## [1] 0.338363
## [1] 0.297338
## [1] 0.294188
## [1] 0.294192
## [1] 0.338383
## [1] 0.369938
## [1] 0.4015
## [1] 0.409708
## [1] 0.342162
## [1] 0.335217
## [1] 0.301767
## [1] 0.236113
## [1] 0.259471
## [1] 0.2589
## [1] 0.294465
## [1] 0.220333
## [1] 0.226642
## [1] 0.255046
## [1] 0.2424
## [1] 0.2317
## [1] 0.223487

Next, us a for loop to round each number in bikshare$atemp

output <- vector("double", nrow(bikeshare)) #1.output
for (i in seq_along(bikeshare$atemp)) { #2. sequence
  output[[i]] <- round(bikeshare$atemp[[i]], 2) #3. body
}
output
##   [1] 0.36 0.35 0.19 0.21 0.23 0.23 0.21 0.16 0.12 0.15 0.19 0.16 0.15 0.19
##  [15] 0.25 0.23 0.18 0.23 0.30 0.26 0.16 0.08 0.10 0.12 0.23 0.20 0.22 0.22
##  [29] 0.21 0.25 0.19 0.23 0.25 0.18 0.23 0.24 0.29 0.30 0.20 0.14 0.15 0.21
##  [43] 0.23 0.32 0.40 0.25 0.32 0.43 0.51 0.39 0.28 0.28 0.19 0.25 0.29 0.35
##  [57] 0.28 0.35 0.40 0.26 0.32 0.20 0.26 0.38 0.37 0.24 0.30 0.29 0.39 0.30
##  [71] 0.33 0.38 0.33 0.32 0.37 0.41 0.53 0.47 0.33 0.41 0.44 0.34 0.27 0.26
##  [85] 0.26 0.25 0.26 0.29 0.30 0.26 0.28 0.32 0.38 0.54 0.40 0.39 0.43 0.32
##  [99] 0.34 0.43 0.57 0.49 0.42 0.46 0.44 0.43 0.45 0.50 0.49 0.56 0.45 0.32
## [113] 0.45 0.55 0.57 0.59 0.58 0.58 0.50 0.46 0.45 0.53 0.58 0.40 0.44 0.47
## [127] 0.51 0.52 0.53 0.52 0.53 0.52 0.49 0.50 0.54 0.55 0.54 0.53 0.51 0.53
## [141] 0.57 0.57 0.59 0.60 0.62 0.65 0.64 0.61 0.62 0.67 0.73 0.72 0.64 0.59
## [155] 0.59 0.62 0.62 0.66 0.73 0.76 0.70 0.68 0.64 0.60 0.59 0.59 0.60 0.60
## [169] 0.64 0.65 0.60 0.64 0.69 0.69 0.66 0.64 0.64 0.64 0.69 0.65 0.64 0.65
## [183] 0.67 0.67 0.67 0.70 0.69 0.69 0.67 0.66 0.69 0.73 0.74 0.69 0.64 0.62
## [197] 0.64 0.67 0.70 0.75 0.75 0.83 0.84 0.80 0.79 0.72 0.70 0.69 0.74 0.79
## [211] 0.73 0.73 0.70 0.71 0.68 0.66 0.66 0.68 0.72 0.70 0.72 0.68 0.65 0.65
## [225] 0.65 0.62 0.62 0.65 0.67 0.66 0.63 0.65 0.68 0.64 0.61 0.63 0.65 0.66
## [239] 0.64 0.65 0.61 0.59 0.61 0.61 0.60 0.63 0.67 0.63 0.52 0.54 0.56 0.58
## [253] 0.61 0.61 0.60 0.60 0.63 0.55 0.46 0.48 0.49 0.53 0.53 0.55 0.55 0.52
## [267] 0.56 0.57 0.59 0.57 0.58 0.57 0.54 0.41 0.35 0.39 0.47 0.53 0.48 0.50
## [281] 0.51 0.52 0.54 0.55 0.52 0.55 0.53 0.50 0.50 0.51 0.52 0.51 0.47 0.42
## [295] 0.43 0.42 0.46 0.46 0.47 0.46 0.32 0.23 0.32 0.36 0.40 0.39 0.41 0.40
## [309] 0.32 0.36 0.40 0.41 0.41 0.37 0.31 0.36 0.43 0.52 0.51 0.45 0.32 0.27
## [323] 0.32 0.46 0.45 0.42 0.43 0.37 0.38 0.39 0.46 0.49 0.45 0.31 0.31 0.33
## [337] 0.31 0.35 0.39 0.46 0.40 0.26 0.32 0.27 0.25 0.27 0.30 0.34 0.41 0.36
## [351] 0.25 0.25 0.28 0.40 0.43 0.43 0.38 0.30 0.28 0.32 0.33 0.28 0.26 0.32
## [365] 0.41 0.38 0.25 0.13 0.12 0.28 0.34 0.39 0.34 0.25 0.32 0.28 0.38 0.25
## [379] 0.18 0.16 0.19 0.36 0.28 0.19 0.22 0.17 0.16 0.24 0.35 0.29 0.36 0.42
## [393] 0.33 0.27 0.26 0.38 0.47 0.40 0.31 0.27 0.26 0.30 0.36 0.27 0.26 0.29
## [407] 0.21 0.10 0.23 0.33 0.35 0.33 0.35 0.36 0.27 0.27 0.30 0.39 0.44 0.41
## [421] 0.26 0.27 0.36 0.35 0.35 0.48 0.36 0.41 0.30 0.24 0.26 0.39 0.52 0.40
## [435] 0.28 0.36 0.46 0.54 0.55 0.53 0.44 0.51 0.46 0.53 0.54 0.51 0.53 0.57
## [449] 0.49 0.44 0.44 0.32 0.47 0.48 0.38 0.42 0.42 0.43 0.46 0.53 0.43 0.39
## [463] 0.43 0.49 0.48 0.44 0.34 0.39 0.43 0.49 0.57 0.61 0.60 0.46 0.49 0.52
## [477] 0.54 0.39 0.30 0.41 0.47 0.48 0.45 0.38 0.45 0.46 0.58 0.54 0.54 0.59
## [491] 0.58 0.55 0.53 0.56 0.55 0.49 0.52 0.54 0.59 0.55 0.58 0.60 0.57 0.55
## [505] 0.57 0.58 0.57 0.58 0.58 0.61 0.63 0.64 0.64 0.68 0.67 0.61 0.63 0.61
## [519] 0.57 0.58 0.58 0.53 0.54 0.57 0.60 0.65 0.66 0.66 0.60 0.61 0.62 0.60
## [533] 0.59 0.57 0.54 0.65 0.72 0.75 0.72 0.65 0.67 0.65 0.59 0.64 0.68 0.79
## [547] 0.69 0.75 0.70 0.70 0.73 0.76 0.75 0.80 0.79 0.65 0.66 0.65 0.65 0.67
## [561] 0.67 0.71 0.72 0.76 0.75 0.71 0.61 0.55 0.62 0.69 0.71 0.65 0.74 0.73
## [575] 0.70 0.67 0.68 0.66 0.67 0.71 0.72 0.75 0.73 0.71 0.70 0.71 0.70 0.67
## [589] 0.64 0.64 0.66 0.68 0.65 0.65 0.24 0.62 0.60 0.60 0.60 0.62 0.64 0.65
## [603] 0.62 0.60 0.65 0.67 0.64 0.65 0.69 0.70 0.65 0.66 0.69 0.71 0.66 0.66
## [617] 0.61 0.58 0.57 0.55 0.57 0.58 0.59 0.59 0.56 0.55 0.57 0.54 0.53 0.57
## [631] 0.61 0.52 0.50 0.54 0.60 0.61 0.59 0.53 0.52 0.51 0.54 0.60 0.61 0.58
## [645] 0.54 0.42 0.39 0.44 0.50 0.43 0.43 0.39 0.51 0.54 0.46 0.45 0.51 0.54
## [659] 0.47 0.46 0.48 0.53 0.56 0.53 0.52 0.52 0.47 0.44 0.31 0.36 0.37 0.36
## [673] 0.32 0.33 0.31 0.28 0.27 0.34 0.36 0.39 0.42 0.48 0.32 0.28 0.32 0.35
## [687] 0.33 0.34 0.38 0.38 0.36 0.35 0.38 0.25 0.26 0.34 0.28 0.29 0.30 0.32
## [701] 0.32 0.36 0.46 0.47 0.43 0.26 0.32 0.39 0.39 0.44 0.34 0.30 0.29 0.29
## [715] 0.34 0.37 0.40 0.41 0.34 0.34 0.30 0.24 0.26 0.26 0.29 0.22 0.23 0.26
## [729] 0.24 0.23 0.22

#simple way to round without a loop
#atemp_rounded<- round(bikeshare$atemp, 2)

Return to 4.2

How would you compute the individual measures of central tendancy and variability for the attitude data set?

You will need to think through this problem. First, you need a place to store the output, then you need to use the seq_along function to iterate through the dataset. Finally, you need to update the output.

Try it.

attitudestats <- vector("double", ncol (attitude)) #1 store output
for (i in seq_along(attitude)) { #2 sequence 
  attitudestats[[i]] <- median(attitude[[i]])
  }
attitudestats
## [1] 65.5 65.0 51.5 56.5 63.5 77.5 41.0

Refined example

attitudestats <- vector("double", ncol (attitude)) #1 store output
for (i in seq_along(attitude)) { #2 sequence 
  attitudestats[[i]] <- median(attitude[[i]])
  print (paste(i, colnames(attitude[i]),":", attitudestats[[i]]))
  }
## [1] "1 rating : 65.5"
## [1] "2 complaints : 65"
## [1] "3 privileges : 51.5"
## [1] "4 learning : 56.5"
## [1] "5 raises : 63.5"
## [1] "6 critical : 77.5"
## [1] "7 advance : 41"

Try to code the sample problem using a while loop for fun

i<- 1
attitudestats_while <-vector("double", ncol(attitude))

while (i <= ncol(attitude)){
  
  attitudestats_while[[i]] <- median(attitude[[i]])
  print(paste(i, colnames(attitude[i]), ":",median(attitude[[i]])))
  i <-i+1
}
## [1] "1 rating : 65.5"
## [1] "2 complaints : 65"
## [1] "3 privileges : 51.5"
## [1] "4 learning : 56.5"
## [1] "5 raises : 63.5"
## [1] "6 critical : 77.5"
## [1] "7 advance : 41"

CONDITIONALS

Q. Exercise - Conditionals

Let’s review of Boolean variables and logical operators

3 > 4
## [1] FALSE
c(1, 2, 3, 4, 5) > 4
## [1] FALSE FALSE FALSE FALSE  TRUE
c(1, 2, 3, 4, 6) == 3
## [1] FALSE FALSE  TRUE FALSE FALSE

Conditional statements using if/else logic

Build a program that checks to seet which prices are considered “cheap”

prices <- c(12.43, 9.99, 18.22, 7.25, 0.50)
numCheap <- 0
for (p in prices){
    if (p < 10){
        numCheap <- numCheap + 1
    }
}  
print(numCheap)
## [1] 3

Alternative approach

prices <- c(12.43, 9.99, 18.22, 7.25, 0.50, 11)
sum(prices < 10)
## [1] 3

FUNCTIONS

Some funcions are built in such as:

sqrt(25)
## [1] 5
mean(c(1,2,3,4,5))
## [1] 3
toupper("hello world")
## [1] "HELLO WORLD"

R. Exercise: Functions

Write your own function. Here’s a an example of the form… with one minor error…

f <- function(x) x + 2
#f("hello world") # causes an error because we need the parameter as a numeric.
f(3)
## [1] 5

Pass in mulitple arguments

addTogether <- function(x, y) x + y
addTogether(5, 10)
## [1] 15
addTogether(x = 5, y = 10) #alternative 
## [1] 15

Create multi-line functions


f <- function(x){
    y <- x^2 
  z <- y/2
  z
}
f(2)
## [1] 2

You try it: Write a function that averages two numbers

avg <- function(x,y){
    (x + y)/2
}
avg(1,2)
## [1] 1.5

Apply a function over a vector

f <- function(x) x^2
sapply(c(1,2,3,4,5),f)
## [1]  1  4  9 16 25

sapply(attitude,f)
##       rating complaints privileges learning raises critical advance
##  [1,]   1849       2601        900     1521   3721     8464    2025
##  [2,]   3969       4096       2601     2916   3969     5329    2209
##  [3,]   5041       4900       4624     4761   5776     7396    2304
##  [4,]   3721       3969       2025     2209   2916     7056    1225
##  [5,]   6561       6084       3136     4356   5041     6889    2209
##  [6,]   1849       3025       2401     1936   2916     2401    1156
##  [7,]   3364       4489       1764     3136   4356     4624    1225
##  [8,]   5041       5625       2500     3025   4900     4356    1681
##  [9,]   5184       6724       5184     4489   5041     6889     961
## [10,]   4489       3721       2025     2209   3844     6400    1681
## [11,]   4096       2809       2809     3364   3364     4489    1156
## [12,]   4489       3600       2209     1521   3481     5476    1681
## [13,]   4761       3844       3249     1764   3025     3969     625
## [14,]   4624       6889       6889     2025   3481     5929    1225
## [15,]   5929       5929       2916     5184   6241     5929    2116
## [16,]   6561       8100       2500     5184   3600     2916    1296
## [17,]   5476       7225       4096     4761   6241     6241    3969
## [18,]   4225       3600       4225     5625   3025     6400    3600
## [19,]   4225       4900       2116     3249   5625     7225    2116
## [20,]   2500       3364       4624     2916   4096     6084    2704
## [21,]   2500       1600       1089     1156   1849     4096    1089
## [22,]   4096       3721       2704     3844   4356     6400    1681
## [23,]   2809       4356       2704     2500   3969     6400    1369
## [24,]   1600       1369       1764     3364   2500     3249    2401
## [25,]   3969       2916       1764     2304   4356     5625    1089
## [26,]   4356       5929       4356     3969   7744     5776    5184
## [27,]   6084       5625       3364     5476   6400     6084    2401
## [28,]   2304       3249       1936     2025   2601     6889    1444
## [29,]   7225       7225       5041     5041   5929     5476    3025
## [30,]   6724       6724       1521     3481   4096     6084    1521

Try it using the bikeshare data.

f <- function(x) x^2
sapply(bikeshare$atemp,f)
##   [1] 0.132223141 0.125131280 0.035874254 0.044995743 0.052564733
##   [6] 0.054386438 0.043613728 0.026326361 0.013496631 0.022767189
##  [11] 0.036658463 0.025751584 0.022765680 0.035499459 0.061559565
##  [16] 0.054857603 0.031247986 0.053978623 0.089055690 0.065050502
##  [21] 0.024911256 0.006252002 0.009769168 0.013907485 0.055002445
##  [26] 0.041452960 0.048268090 0.049870482 0.044997440 0.062661104
##  [31] 0.034689062 0.055004321 0.064728010 0.031640583 0.052252017
##  [36] 0.059077191 0.085071972 0.092208181 0.039301477 0.020817584
##  [41] 0.022364604 0.045586093 0.054267566 0.105049237 0.158682722
##  [46] 0.064655267 0.099982440 0.183747681 0.262126592 0.153197091
##  [51] 0.076911929 0.080698606 0.034608277 0.060376844 0.083631434
##  [56] 0.122822913 0.079632325 0.123277530 0.160094414 0.069632127
##  [61] 0.102445445 0.040053218 0.065371751 0.143473531 0.134140528
##  [66] 0.056863649 0.091445760 0.082144146 0.148739806 0.093025000
##  [71] 0.106113062 0.144469168 0.110224000 0.101237240 0.134637625
##  [76] 0.168373171 0.277738486 0.217645576 0.106113062 0.167882770
##  [81] 0.194165372 0.114202768 0.073350514 0.065695841 0.066342820
##  [86] 0.062669615 0.066344365 0.085795096 0.088417022 0.066344881
##  [91] 0.080346170 0.099626716 0.143464440 0.294771899 0.158682722
##  [96] 0.150239962 0.188092220 0.105286621 0.116642058 0.182104467
## [101] 0.319470257 0.243102247 0.174125102 0.214130159 0.195287100
## [106] 0.181043442 0.198644924 0.253155897 0.239373391 0.318538330
## [111] 0.206017948 0.103654378 0.202608915 0.304442408 0.330050250
## [116] 0.352934611 0.330788320 0.335158787 0.247469436 0.215315488
## [121] 0.200886826 0.283911006 0.338815962 0.163741623 0.195290635
## [126] 0.224786930 0.262780290 0.269291458 0.275883361 0.273237244
## [131] 0.279206560 0.273908830 0.244332490 0.250629396 0.287296000
## [136] 0.303063462 0.290013484 0.277895557 0.260857391 0.279885438
## [141] 0.327155401 0.330050250 0.348449368 0.365798765 0.378891954
## [146] 0.428616377 0.405779192 0.375008040 0.378901803 0.450364472
## [151] 0.526180497 0.519793415 0.414661299 0.344725160 0.353663332
## [156] 0.380447174 0.386707372 0.430270403 0.528934744 0.573925941
## [161] 0.494619637 0.459735529 0.413867056 0.361987536 0.349926670
## [166] 0.345454765 0.354436860 0.360459747 0.414676754 0.417117056
## [171] 0.354436860 0.406592421 0.481398681 0.481404232 0.431101236
## [176] 0.413851616 0.406570742 0.405774096 0.479636583 0.428616377
## [181] 0.405779192 0.425315274 0.445299967 0.446992531 0.442779784
## [186] 0.484886610 0.470092611 0.471791771 0.449547453 0.441105849
## [191] 0.476134501 0.532613878 0.546527526 0.475277875 0.403357091
## [196] 0.389839146 0.407379657 0.448676248 0.495510406 0.558724855
## [201] 0.557784923 0.682889030 0.707106083 0.646877578 0.631753139
## [206] 0.519780438 0.485779726 0.477020905 0.547452010 0.617744125
## [211] 0.530766160 0.532602202 0.494619637 0.499949399 0.462314324
## [216] 0.441943085 0.431080225 0.457184232 0.511642645 0.494606978
## [221] 0.524351223 0.469201710 0.424479613 0.427770938 0.417132556
## [226] 0.389860375 0.379661772 0.417105431 0.444450222 0.438585659
## [231] 0.400968835 0.421195808 0.456334026 0.407368169 0.367317208
## [236] 0.397772399 0.417127389 0.435247631 0.403931429 0.419850866
## [241] 0.369612930 0.353672848 0.373468877 0.378127836 0.365792717
## [246] 0.400958703 0.442795754 0.391432917 0.265431040 0.296185204
## [251] 0.308425840 0.335178471 0.369617793 0.371159974 0.362560537
## [256] 0.364277431 0.393003610 0.306551576 0.212959176 0.228973734
## [261] 0.240626548 0.280555606 0.283254935 0.303086584 0.307983931
## [266] 0.272614516 0.318560906 0.327913134 0.346970478 0.330078976
## [271] 0.330806725 0.330064038 0.296838639 0.170455857 0.119243830
## [276] 0.153700066 0.223594688 0.277874471 0.230808181 0.254423395
## [281] 0.263417351 0.274558184 0.294767556 0.298220841 0.268030892
## [286] 0.304487654 0.280555606 0.248726626 0.253163948 0.260840026
## [291] 0.273237244 0.264039767 0.217645576 0.179433571 0.181043442
## [296] 0.178365163 0.208910242 0.214716391 0.223583340 0.208891046
## [301] 0.101641091 0.051944336 0.103252326 0.126780860 0.157678880
## [306] 0.152203758 0.164771858 0.162725106 0.104881413 0.131303320
## [311] 0.160697559 0.169946765 0.167345628 0.139667386 0.094136671
## [316] 0.128122475 0.185373303 0.275217751 0.257636441 0.204293152
## [321] 0.104471815 0.074376744 0.105289217 0.208902015 0.198080184
## [326] 0.177827516 0.185362108 0.138734646 0.144910410 0.148291998
## [331] 0.207753640 0.240219575 0.203739391 0.096858511 0.093363247
## [336] 0.109847833 0.096474845 0.121870810 0.155176906 0.208300960
## [341] 0.160196861 0.066017136 0.100832922 0.070975354 0.064086948
## [346] 0.073005878 0.090684095 0.114488843 0.169939344 0.129474031
## [351] 0.062185896 0.060309045 0.078923350 0.157175774 0.183198552
## [356] 0.181579107 0.142516065 0.089545775 0.078378162 0.099562336
## [361] 0.107343383 0.078385441 0.069638988 0.101641091 0.171496203
## [366] 0.141091136 0.063657308 0.015945376 0.014241320 0.077513242
## [371] 0.115781631 0.152708227 0.115775507 0.061245855 0.101650018
## [376] 0.079987718 0.145876636 0.062181407 0.033520850 0.026122641
## [381] 0.036352380 0.132698461 0.075764765 0.036114441 0.048822438
## [386] 0.030581266 0.026325063 0.059077191 0.121876396 0.086919422
## [391] 0.126771602 0.172543037 0.106523252 0.074376744 0.068971891
## [396] 0.145402654 0.217657705 0.159177859 0.095694948 0.074378926
## [401] 0.069971359 0.087868373 0.130396099 0.070980149 0.068637712
## [406] 0.086176299 0.044464892 0.010334349 0.051944336 0.111519931
## [411] 0.123642954 0.109006946 0.123642954 0.126326931 0.070643261
## [416] 0.074742639 0.087091683 0.154187373 0.197532247 0.168897163
## [421] 0.065369706 0.071989183 0.128131066 0.124979926 0.121431341
## [426] 0.225977588 0.129486265 0.170975634 0.091821726 0.058163451
## [431] 0.065046422 0.148302010 0.275209357 0.157674909 0.077154506
## [436] 0.129362509 0.211224806 0.294771899 0.300980613 0.283902481
## [441] 0.190295740 0.255071462 0.215296000 0.283898218 0.290017792
## [446] 0.263433775 0.282563475 0.324976384 0.236909013 0.191395750
## [451] 0.192501562 0.099637448 0.221793902 0.232617148 0.141091136
## [456] 0.177837637 0.174128440 0.182767365 0.212966559 0.284568902
## [461] 0.185901533 0.152698848 0.181585925 0.242482381 0.227183783
## [466] 0.190299230 0.113753751 0.150236861 0.186458149 0.238140096
## [471] 0.329332516 0.378132756 0.358186689 0.208883733 0.243094358
## [476] 0.266023851 0.294763212 0.151713366 0.090676266 0.164254310
## [481] 0.221198080 0.233852518 0.204880254 0.142509270 0.202608915
## [486] 0.209485628 0.332953234 0.289332107 0.288628967 0.349182901
## [491] 0.341766514 0.298921347 0.277878688 0.310773916 0.305836651
## [496] 0.241850519 0.271267014 0.296825563 0.342503517 0.302390010
## [501] 0.332241571 0.355186201 0.327885648 0.303734357 0.321384680
## [506] 0.341017457 0.319979155 0.337357681 0.341771191 0.368084890
## [511] 0.393792646 0.413058148 0.411426031 0.461448490 0.452918232
## [516] 0.373478655 0.398576306 0.369617793 0.320682099 0.330777968
## [521] 0.334411228 0.276562396 0.294080613 0.324264191 0.357438971
## [526] 0.420379767 0.440254809 0.435231798 0.357454516 0.373463988
## [531] 0.389854131 0.359704861 0.353677605 0.327155401 0.296852805
## [536] 0.428621615 0.519804951 0.566319462 0.524351223 0.426137395
## [541] 0.454618457 0.427770938 0.353672848 0.410614387 0.456316462
## [546] 0.618760012 0.472667250 0.563443896 0.492857353 0.493717022
## [551] 0.536317482 0.579679709 0.566305916 0.647884938 0.624725837
## [556] 0.427786635 0.441953722 0.422852373 0.428609830 0.446134492
## [561] 0.443611946 0.497301398 0.524357016 0.571056796 0.555894010
## [566] 0.510713188 0.375799651 0.302403208 0.388284766 0.476123460
## [571] 0.499071603 0.427786635 0.546509783 0.539074603 0.486651341
## [576] 0.446134492 0.469207190 0.439431107 0.445299967 0.499973440
## [581] 0.522536700 0.564402105 0.534476504 0.504449381 0.486675060
## [586] 0.500863352 0.489311442 0.446146515 0.407384763 0.415482087
## [591] 0.438580361 0.458029815 0.427764397 0.428616377 0.058757760
## [596] 0.382011761 0.364277431 0.355176665 0.361231051 0.386702397
## [601] 0.405779192 0.418738410 0.382784740 0.355211232 0.428616377
## [606] 0.443622603 0.404156447 0.426120423 0.475272360 0.493722644
## [611] 0.421201000 0.437752934 0.471815125 0.502656894 0.429456098
## [616] 0.431917098 0.373468877 0.335154156 0.319964448 0.307239621
## [621] 0.324985506 0.335887475 0.352934611 0.343240142 0.317109766
## [626] 0.305864303 0.319300714 0.292036483 0.283228325 0.327150825
## [631] 0.372695598 0.269291458 0.252519315 0.296130784 0.355947072
## [636] 0.369633601 0.343235455 0.281213848 0.267974982 0.262144000
## [641] 0.294125083 0.358960352 0.369633601 0.336616955 0.290004867
## [646] 0.176242955 0.150239962 0.191942125 0.253151872 0.185904982
## [651] 0.187550491 0.153190829 0.258271306 0.290682723 0.212379036
## [656] 0.202597212 0.262784391 0.289332107 0.223579557 0.208327432
## [661] 0.233232975 0.281222332 0.312169156 0.280569377 0.273267563
## [666] 0.265362008 0.218809708 0.193072360 0.096043588 0.130393210
## [671] 0.136857083 0.126765906 0.104876232 0.108595293 0.094910206
## [676] 0.079279975 0.075416694 0.116889456 0.126318401 0.155186360
## [681] 0.177841854 0.225988997 0.104474401 0.079277723 0.105295058
## [686] 0.120550618 0.106525863 0.114072361 0.141091136 0.144907365
## [691] 0.133146172 0.122759838 0.143473531 0.061872583 0.066349002
## [696] 0.114923712 0.079274907 0.083962017 0.089055690 0.104889834
## [701] 0.100428145 0.129030387 0.207749994 0.220011655 0.183194272
## [706] 0.066669306 0.103656954 0.151716482 0.152213901 0.189725581
## [711] 0.114489520 0.088409886 0.086546579 0.086548933 0.114503055
## [716] 0.136854124 0.161202250 0.167860645 0.117074834 0.112370437
## [721] 0.091063322 0.055749349 0.067325200 0.067029210 0.086709636
## [726] 0.048546631 0.051366596 0.065048462 0.058757760 0.053684890
## [731] 0.049946439

S. Exercise: Functions and for loops

Create a function that computes the basic summary statistics (max, min, median, and mean) for the attitude data set.

col_max <- function (df){
  output <- vector("double", ncol (df)) #1 store output
  for (i in seq_along(df)) { #2 sequence 
    output[[i]] <- max(df[[i]])
    print (paste(i, colnames(df[i]),":", output[[i]]))
  }
}

Let’s call the function col_max

col_max(attitude)
## [1] "1 rating : 85"
## [1] "2 complaints : 90"
## [1] "3 privileges : 83"
## [1] "4 learning : 75"
## [1] "5 raises : 88"
## [1] "6 critical : 92"
## [1] "7 advance : 72"

Let’s improve on this function. We’re going to write a function that takes a parameter of another function. This way we can pass in our data and function for the mean(), median(), etc.

col_summary <- function(df, fun){
  output <- vector("double", length(df)) #1 store output
  for (i in seq_along(df)) { #2 sequence 
    output[i] <- fun(df[[i]])
    print (paste(i, colnames(df[i]),":", output[[i]]))
    }
}

Now, let’s call the col_summary function

col_summary(attitude, mean)
## [1] "1 rating : 64.6333333333333"
## [1] "2 complaints : 66.6"
## [1] "3 privileges : 53.1333333333333"
## [1] "4 learning : 56.3666666666667"
## [1] "5 raises : 64.6333333333333"
## [1] "6 critical : 74.7666666666667"
## [1] "7 advance : 42.9333333333333"
col_summary(attitude[,2:4],median)
## [1] "1 complaints : 65"
## [1] "2 privileges : 51.5"
## [1] "3 learning : 56.5"
col_summary(attitude, max)
## [1] "1 rating : 85"
## [1] "2 complaints : 90"
## [1] "3 privileges : 83"
## [1] "4 learning : 75"
## [1] "5 raises : 88"
## [1] "6 critical : 92"
## [1] "7 advance : 72"
col_summary(attitude, min)
## [1] "1 rating : 40"
## [1] "2 complaints : 37"
## [1] "3 privileges : 30"
## [1] "4 learning : 34"
## [1] "5 raises : 43"
## [1] "6 critical : 49"
## [1] "7 advance : 25"

CODING IN OTHER LANGUAGES

T. Exercise - Python and Unix

Python

execute = True
if execute:
   print("Of course!")
   print("This will execute as well")
## Of course!
## This will execute as well

Unix

pwd
ls -l
## /Users/ksosulsk/Dropbox/_becomingvisual_manuscript_2017/becomingvisual_R
## total 188160
## drwxr-xr-x@ 11 ksosulsk  staff      374 Aug 29  2017 Bike-Sharing-Dataset
## -rw-r--r--@  1 ksosulsk  staff    27512 Aug 14  2017 Bike_Sharing_Carlos_Arias.Rmd
## -rw-r--r--@  1 ksosulsk  staff     1562 Aug 15  2017 ChartTypes.Rmd
## -rw-r--r--@  1 ksosulsk  staff    32774 May 16 14:27 R_In_Class_Session.Rmd
## -rw-r--r--@  1 ksosulsk  staff  5083706 May 16 08:11 R_In_Class_Session.html
## -rw-r--r--@  1 ksosulsk  staff    31116 May 15 23:05 R_In_Class_Session_STUDENTVERSION.Rmd
## drwxr-xr-x@  4 ksosulsk  staff      136 May 15 22:52 R_In_Class_Session_STUDENTVERSION_files
## drwxr-xr-x@  3 ksosulsk  staff      102 May 16 15:34 R_In_Class_Session_files
## -rw-r--r--@  1 ksosulsk  staff   193934 Mar 21  2015 Sidewalk_Cafes.csv
## -rw-r--r--@  1 ksosulsk  staff       90 Apr 13  2017 Untitled.Rnw
## drwxr-xr-x@  3 ksosulsk  staff      102 Aug 15  2017 _bookdown_files
## -rw-r--r--@  1 ksosulsk  staff    70025 May 16  2017 _main.Rmd
## -rw-r--r--@  1 ksosulsk  staff  5511379 May  7  2017 _main.html
## -rw-r--r--@  1 ksosulsk  staff     1494 Oct 18  2017 ansombes.Rmd
## -rw-r--r--@  1 ksosulsk  staff  1195959 Oct 17  2017 ansombes.html
## -rw-r--r--@  1 ksosulsk  staff    54426 May 17  2015 app1.tiff
## -rw-r--r--@  1 ksosulsk  staff    50926 Oct 17  2017 area01.png
## -rw-r--r--@  1 ksosulsk  staff    51827 Oct 17  2017 area02.png
## -rw-r--r--@  1 ksosulsk  staff    50707 Sep 14  2017 area03.png
## -rw-r--r--@  1 ksosulsk  staff    16210 Oct 17  2017 bar01.png
## -rw-r--r--@  1 ksosulsk  staff    31671 Oct 17  2017 bar02.png
## -rw-r--r--@  1 ksosulsk  staff      205 May 10  2017 becoming visual.Rproj
## -rw-r--r--@  1 ksosulsk  staff      205 May 16 09:17 becomingvisual_R.Rproj
## drwxr-xr-x@  3 ksosulsk  staff      102 Aug 15  2017 bikeshare-figure
## -rw-r--r--@  1 ksosulsk  staff   108521 Apr 13  2017 bikeshare-rpubs.html
## -rw-r--r--@  1 ksosulsk  staff      476 Apr 13  2017 bikeshare.Rhtml
## -rw-r--r--@  1 ksosulsk  staff      530 Apr 13  2017 bikeshare.Rpres
## -rw-r--r--@  1 ksosulsk  staff     2036 Apr 13  2017 bikeshare.html
## -rw-r--r--@  1 ksosulsk  staff      808 May 16  2017 bikeshare.md
## -rw-r--r--@  1 ksosulsk  staff    60431 Aug  9  2017 bikeshare_08_07_2017
## -rw-r--r--@  1 ksosulsk  staff    54384 May 16  2017 bikeshare_shinyhistogram.png
## drwxr-xr-x@  4 ksosulsk  staff      136 Aug 15  2017 bikeshareapp
## -rw-r--r--@  1 ksosulsk  staff    54609 May 10  2017 bikesharedailydata.csv
## -rw-r--r--@  1 ksosulsk  staff    19836 Oct 17  2017 boxplot01.png
## -rw-r--r--@  1 ksosulsk  staff    19988 Oct 17  2017 boxplot02.png
## -rw-r--r--@  1 ksosulsk  staff   177343 Sep 27  2017 casino.csv
## -rw-r--r--@  1 ksosulsk  staff  1106825 Dec  8 15:05 casino_games_sub.csv
## -rw-r--r--@  1 ksosulsk  staff   177343 Sep 27  2017 casino_new.csv
## -rw-r--r--@  1 ksosulsk  staff  1106675 Dec  8 14:51 casino_reshaped_2017.csv
## -rw-r--r--@  1 ksosulsk  staff     2057 Nov  1  2017 casinocasestudy.Rmd
## -rw-r--r--@  1 ksosulsk  staff  1963835 Nov  1  2017 casinocasestudy.html
## -rw-r--r--@  1 ksosulsk  staff     4633 Dec 10 13:08 casinoscript.Rmd
## -rw-r--r--@  1 ksosulsk  staff     1293 Jul 10  2017 chapter04_code.R
## -rw-r--r--@  1 ksosulsk  staff    11291 Sep 18  2017 cheatsheet_nicole_version03.Rmd
## -rw-r--r--@  1 ksosulsk  staff      461 Aug 16  2017 columbo.R
## -rw-r--r--@  1 ksosulsk  staff    19028 Oct 12  2015 crime.csv
## -rw-r--r--@  1 ksosulsk  staff    13515 Oct 12  2017 crime_edited.csv
## -rw-r--r--@  1 ksosulsk  staff     5022 Aug 29  2017 daniel_cheatsheet.Rmd
## -rw-r--r--@  1 ksosulsk  staff     6163 Aug 29  2017 daniel_cheatsheet_version02.Rmd
## -rw-r--r--@  1 ksosulsk  staff    71092 May 10  2017 datascienceslides.pptx
## -rw-r--r--@  1 ksosulsk  staff   200374 May 10  2017 datasciencesteps.png
## -rw-r--r--@  1 ksosulsk  staff     5608 May 16  2017 datavisinclasssession_2017.Rmd
## -rw-r--r--@  1 ksosulsk  staff    24527 Oct 17  2017 density01.png
## -rw-r--r--@  1 ksosulsk  staff    25360 Oct 17  2017 density02.png
## -rw-r--r--@  1 ksosulsk  staff    23758 Oct 17  2017 density03.png
## -rw-r--r--@  1 ksosulsk  staff    23997 Oct 17  2017 density04.png
## -rw-r--r--@  1 ksosulsk  staff    24044 Oct 17  2017 density05.png
## -rw-r--r--@  1 ksosulsk  staff    22066 Oct 17  2017 density06.png
## -rw-r--r--@  1 ksosulsk  staff    22164 Oct 17  2017 density07.png
## -rw-r--r--@  1 ksosulsk  staff    23387 Oct 17  2017 density08.png
## drwxr-xr-x@  3 ksosulsk  staff      102 Aug 15  2017 figure
## -rw-r--r--@  1 ksosulsk  staff     5712 Aug 16  2017 ggplot_primer.R
## -rw-r--r--@  1 ksosulsk  staff      791 May 16 11:33 ggplot_test.Rmd
## -rw-r--r--@  1 ksosulsk  staff   961340 May 16 11:22 ggplot_test.html
## -rwxrwxrwx@  1 ksosulsk  staff     5953 Oct 19  2015 ggplot_tutorial.R
## -rw-r--r--@  1 ksosulsk  staff    13534 Sep 14  2017 hist01.jpeg
## -rw-r--r--@  1 ksosulsk  staff    13534 Oct 17  2017 hist01.png
## -rw-r--r--@  1 ksosulsk  staff    13486 Oct 17  2017 hist02.png
## -rw-r--r--@  1 ksosulsk  staff    15952 Oct 17  2017 hist03.png
## -rw-r--r--@  1 ksosulsk  staff    15967 Oct 17  2017 hist04.png
## -rw-r--r--@  1 ksosulsk  staff    13678 Oct 17  2017 hist05.png
## -rw-r--r--@  1 ksosulsk  staff    13766 Oct 17  2017 hist06.png
## -rw-r--r--@  1 ksosulsk  staff    85318 May 16  2015 hist2.tiff
## drwxr-xr-x@  6 ksosulsk  staff      204 Aug 15  2017 histogram
## drwxr-xr-x@  3 ksosulsk  staff      102 May 14 15:47 inclass_presentation-figure
## -rw-r--r--@  1 ksosulsk  staff      541 May 14 15:47 inclass_presentation.Rpres
## -rw-r--r--@  1 ksosulsk  staff      830 May 14 15:47 inclass_presentation.md
## -rw-r--r--@  1 ksosulsk  staff    23039 Nov  1  2017 index.Rmd
## -rw-r--r--@  1 ksosulsk  staff  3109836 May  6  2017 index.html
## -rw-r--r--@  1 ksosulsk  staff  1849558 May  2  2017 index.nb.html
## -rw-r--r--@  1 ksosulsk  staff    24935 Oct 25  2017 lesson05_basic_charts_cheat_sheet.Rmd
## -rw-r--r--@  1 ksosulsk  staff  6258560 Sep 14  2017 lesson05_basic_charts_cheat_sheet.html
## -rw-r--r--@  1 ksosulsk  staff   218877 Sep 14  2017 lesson05_basic_charts_cheat_sheet.md
## drwxr-xr-x@  4 ksosulsk  staff      136 Sep 15  2017 lesson05_basic_charts_cheat_sheet_files
## -rw-r--r--@  1 ksosulsk  staff    17017 Dec 10 13:08 lesson05_solutions_and_demo.Rmd
## -rw-r--r--@  1 ksosulsk  staff   743152 Oct 18  2017 lesson05_solutions_and_demo.docx
## -rw-r--r--@  1 ksosulsk  staff  7395602 Nov 30 16:03 lesson05_solutions_and_demo.html
## -rw-r--r--@  1 ksosulsk  staff   264199 Nov 30 16:03 lesson05_solutions_and_demo.md
## drwxr-xr-x@  4 ksosulsk  staff      136 Feb 20 11:40 lesson05_solutions_and_demo_files
## -rw-r--r--@  1 ksosulsk  staff    68071 Oct 17  2017 line01.png
## -rw-r--r--@  1 ksosulsk  staff    85941 Oct 17  2017 line02.png
## -rw-r--r--@  1 ksosulsk  staff     4837 May 17  2017 markdown_report.Rmd
## -rw-r--r--@  1 ksosulsk  staff  1912940 May 17  2017 markdown_report.html
## -rw-r--r--@  1 ksosulsk  staff     4806 Jul 26  2017 markdown_slides.Rmd
## -rw-r--r--@  1 ksosulsk  staff  1611594 Jul 26  2017 markdown_slides.html
## -rw-r--r--@  1 ksosulsk  staff     1299 Oct 15  2017 multivariate_exercise.R
## -rw-r--r--@  1 ksosulsk  staff     2676 Jul  7  2017 myfile.csv
## -rw-r--r--@  1 ksosulsk  staff       38 Feb 21 10:38 myfirstRScriptToday.R
## -rw-r--r--@  1 ksosulsk  staff      639 Feb 20 09:40 myfirstnotebook.Rmd
## -rw-r--r--@  1 ksosulsk  staff   841650 Feb 20 09:40 myfirstnotebook.nb.html
## -rw-r--r--@  1 ksosulsk  staff     1853 Aug 14  2017 networkdiagram.R
## -rw-r--r--@  1 ksosulsk  staff     8305 Aug 24  2017 nicole_cheatsheet.Rmd
## -rw-r--r--@  1 ksosulsk  staff  7517784 Aug 24  2017 nicole_cheatsheet.html
## -rw-r--r--@  1 ksosulsk  staff    10906 Sep 12  2017 nicole_cheatsheet_02.Rmd
## -rw-r--r--@  1 ksosulsk  staff  7663834 Sep 12  2017 nicole_cheatsheet_02.html
## -rw-r--r--@  1 ksosulsk  staff  1647985 Aug  9  2017 nicole_week05_ck.html
## -rw-r--r--@  1 ksosulsk  staff     5996 Aug 14  2017 nicole_week05_ck.rmd
## -rwxrwxrwx@  1 ksosulsk  staff      308 Jul 30  2015 pakistan.childHIV.csv
## -rw-r--r--@  1 ksosulsk  staff   193443 May  7  2017 plot_id798936310.svg
## drwxr-xr-x@  3 ksosulsk  staff      102 Aug 15  2017 rsconnect
## -rw-r--r--@  1 ksosulsk  staff     2890 Oct 12  2017 sample.Rmd
## -rw-r--r--@  1 ksosulsk  staff      844 May 10  2017 sampleknit.Rmd
## -rw-r--r--@  1 ksosulsk  staff   792958 May 10  2017 sampleknit.html
## -rw-r--r--@  1 ksosulsk  staff    41842 Oct 17  2017 scatter01.png
## -rw-r--r--@  1 ksosulsk  staff    47486 Oct 17  2017 scatter02.png
## -rw-r--r--@  1 ksosulsk  staff    15573 Oct 17  2017 sosulski_visualization_02_cheat_sheet.Rmd
## -rw-r--r--@  1 ksosulsk  staff  5821303 Sep  7  2017 sosulski_visualization_02_cheat_sheet.html
## -rw-r--r--@  1 ksosulsk  staff   163844 Sep  7  2017 sosulski_visualization_02_cheat_sheet.md
## drwxr-xr-x@  3 ksosulsk  staff      102 Aug 30  2017 sosulski_visualization_02_cheat_sheet_files
## -rw-r--r--@  1 ksosulsk  staff     9020 Aug 30  2017 sosulski_visualization_cheat_sheet.Rmd
## -rw-r--r--@  1 ksosulsk  staff  5613184 Aug 30  2017 sosulski_visualization_cheat_sheet.html
## -rw-r--r--@  1 ksosulsk  staff   160975 Aug 30  2017 sosulski_visualization_cheat_sheet.md
## drwxr-xr-x@  4 ksosulsk  staff      136 Aug 29  2017 sosulski_visualization_cheat_sheet_files
## -rw-r--r--@  1 ksosulsk  staff    11187 Nov 30 15:29 sosulski_visualization_cheat_sheet_version02.Rmd
## -rw-r--r--@  1 ksosulsk  staff  5818929 Sep  5  2017 sosulski_visualization_cheat_sheet_version02.html
## -rw-r--r--@  1 ksosulsk  staff   163963 Sep  5  2017 sosulski_visualization_cheat_sheet_version02.md
## drwxr-xr-x@  3 ksosulsk  staff      102 Sep  5  2017 sosulski_visualization_cheat_sheet_version02_files
## -rwxrwxrwx@  1 ksosulsk  staff      794 Jul 30  2015 southasia.csv
## -rw-r--r--@  1 ksosulsk  staff     1782 Jan  8 23:02 sql_script.R
## -rw-r--r--@  1 ksosulsk  staff      447 Oct 18  2017 survey_skills.csv
## -rw-r--r--@  1 ksosulsk  staff      691 Apr 14  2017 test_templatermd.Rmd
## -rw-r--r--@  1 ksosulsk  staff      930 Apr 14  2017 test_templatermd.md
## -rw-r--r--@  1 ksosulsk  staff     1989 Apr 14  2017 vignette_rmdtemplate.Rmd
## -rw-r--r--@  1 ksosulsk  staff    42477 Apr 14  2017 vignette_rmdtemplate.html
## -rw-r--r--@  1 ksosulsk  staff     2330 Feb 25 13:32 visualizing _crime.Rmd
## -rw-r--r--@  1 ksosulsk  staff  1387286 Oct 15  2017 visualizing__crime.docx
## -rw-r--r--@  1 ksosulsk  staff  6267687 Nov  1  2017 visualizing__crime.html
## -rw-r--r--@  1 ksosulsk  staff    22333 Nov  1  2017 visualizing__crime.md
## drwxr-xr-x@  4 ksosulsk  staff      136 Oct 15  2017 visualizing__crime_files
## -rw-r--r--@  1 ksosulsk  staff     1357 Nov  1  2017 visualizing_ancombes.Rmd
## -rw-r--r--@  1 ksosulsk  staff  1028434 Nov  1  2017 visualizing_ancombes.html
## -rw-r--r--@  1 ksosulsk  staff      839 Nov  1  2017 visualizing_ancombes.md
## drwxr-xr-x@  3 ksosulsk  staff      102 Oct 18  2017 visualizing_ancombes_files
## -rw-r--r--@  1 ksosulsk  staff     3420 Oct 17  2017 visualizing_skills.Rmd
## drwxr-xr-x@  3 ksosulsk  staff      102 Oct 17  2017 visualizing_skills_files
## -rw-r--r--@  1 ksosulsk  staff    25886 May 14 21:21 week04_decode.Rmd
## -rw-r--r--@  1 ksosulsk  staff  7846577 Aug 22  2017 week04_decode.html
## -rw-r--r--@  1 ksosulsk  staff   226149 Aug 22  2017 week04_decode.md
## drwxr-xr-x@  3 ksosulsk  staff      102 Aug 17  2017 week04_decode_files
## -rw-r--r--@  1 ksosulsk  staff    13009 Aug 14  2017 week07_shiny.Rmd
## -rw-r--r--@  1 ksosulsk  staff      579 May  7  2017 world_internet_usage.csv
## -rw-r--r--@  1 ksosulsk  staff     4325 Aug  4  2013 worldexports_yearly.csv
## -rw-r--r--@  1 ksosulsk  staff    29263 May  6  2017 worldinternet.png

SQL

Review session 7 from R Fundamentals (Sosulski, 2018): SQL & R http://becomingvisual.com/rfundamentals/sql-r.html


U. Exercise: SQL & R

SELECT all applicable data

  1. The players on the San Antonio Spurs in 2014

  2. Top 5 blockers in 2010

  3. Top 10 combination power-forwards with the most defensive rebounds

  4. Top 20 Player-seasons in the NBA 50-40-90 Club (players who have hit over 50% for FG%, 40% for 3P%, 90% for FT%, 300 field goals, 55 3-pointers, and 125 free throws) ordered by their amount of points

  5. Top 10 oldest Milwaukee Bucks players with over 1000 points.

Exercise available at: http://becomingvisual.com/rfundamentals/sql-r.html#exercise-7.1


SHINY

Review session 8 from R Fundamentals (Sosulski, 2018): RShiny http://becomingvisual.com/rfundamentals/rshiny.html


V. Exercise: Build a RShiny APP

Modify your shiny app.

Use ggplot to create an interactive scatterplot of the same data

Exercise available at: http://becomingvisual.com/rfundamentals/rshiny.html#exercise-8.1 ***

W. Homework

Submit all the files for exercises U & V (with your name on it) to NYU Classes > Assignments > R in class session. Due date: Official end date of module 1.

X1. Optional Homework: Determine the average ridership

Write a script to determine the average ridership on weekends versus weekdays.

Let’s imagine it costs $10 per day to rent a bike on a weekday and $12 on a weekend. What is the annual weekday rental revenue in 2011 and 2012? What is the annual weekend revenue in 2011 and 2012?

Hint: Use a for loop and if/else logic.

X2. Optional Homework: Further analysis of Bikesharing data set

At this point in the process, you should have gained enough insight to frame a question to guide the rest of your analysis. Sometimes you don’t know what to ask of the data and other times the questions you have cannot be answered by the data that you have. In most visual analytical explorations there will be a back and forth between defining the questions and identifying the data sources that have contain the information you need to extract.

Often your question will fall into one of three categories: Past, present, or future.

Some questions that can guide an historical analysis of past events are:

  • Do weather conditions affect rental behaviors?
  • Does the precipitation, day of week, season, hour of the day, etc. affect rental behavior?
  • Which weather conditions affect behavior the most? Do they differ by season?

These questions serve a purpose of guiding reports, where the analyst is reporting on past events.

A question based on the present is:

How many bikes were rented in the past hour or today?

This type of question is reserved for producing a current state of an event.


Can we answer this question?

The data we are using cannot answer this question since it is historical data from 2011 and 2012.


A question about the future could be framed as the following:

Will bike rentals be higher in the summer rather than the winter due to weather?

Questions about the future using involve analysis that requires prediction or forecasting methods. The analyst in this case is trying to predict the future from past data.

Try to answer the following questions. Show your work as a data visualization.

  • Do weather conditions affect rental behaviors?
  • Does the precipitation, day of week, season, hour of the day, etc. affect rental behavior?
  • Which weather conditions affect behavior the most? Do they differ by season

X3. Optional Homework: Use temp and humidity to calculate the heat index for temperatures >=80

Use the data from: https://www.weather.gov/media/unr/heatindex.pdf

FURTHER STUDY

As a next step, I encourage you to select a data set from one of the resources provided below and explore it using the process we applied in class.

General Datasets

  1. UCI Machine Learning Repository: Consists of diverse field of datasets (360 datasets currently and still growing) for the purpose of performing analytics and machine learning algorithms. http://archive.ics.uci.edu/ml/

  2. Kaggle datasets: Perfect for exploring data through visualization. https://www.kaggle.com/datasets

  3. Amazon Public Dataset: These are large dataset which deals with dataset with memory in Gbs or Tbs. https://aws.amazon.com/public-datasets/

  4. Google Public Data: A set of dataset provided by Google, including Book corpus, US names, Genome dataset, BIgQuery dataset, and many more. https://cloud.google.com/public-datasets/

  5. Open Data by Socrata: Thousands of free dataset for exploration. https://opendata.socrata.com/

  6. Data.gov: A website dedicated to supply datasets of different domains, eg. Education, Nutrient, Sports. https://catalog.data.gov/dataset?res_format=CSV

  7. Datahub: Just as its tagline, “The easy way to get, share data”. https://datahub.io/dataset?tags=weather

  8. Harvard Dataverse: Find most of the datasets used for research purpose, and cited in different publications. https://dataverse.harvard.edu/

Challenges based dataset

  1. KDD Data Center: Have a problem coming up with a problem statement? No worries, KDD provides you with the dataset and problem statements through its challenges. http://www.kdd.org/kdd-cup

  2. CrowdAnalytics: More challenges to solve with dataset. https://www.crowdanalytix.com/community

  3. DataDriven: Problem for data scientist to solve. https://www.drivendata.org/competitions/

  4. Big Data Innovation Challenge: Tackle real problem with these analytics, and also win a challenge. https://bigdatainnovationchallenge.org/challenges/food-security-nutrition/

Census Dataset

  1. Open Census Data: Details of population in different cities of countries is just a click away with this open data. http://census.okfn.org/en/latest/

  2. Census.gov: Census data of United States. http://www.census.gov/data.html

Weather/Climate dataset

  1. Wunderground: Want to work with weather data? Use Wunderground’s API to get your own dataset. https://www.wunderground.com/weather/api/

  2. National Center for Environmental Information: Climate datasets available for analytics. https://www.ncdc.noaa.gov/cdo-web/datasets

News Dataset

  1. BBC Dataset: It consists of documents from the BBC news website corresponding to stories in five topical areas. http://mlg.ucd.ie/datasets/bbc.html

  2. The Guardian: A collection of news datasets from the guardian, which is updated regularly. https://www.theguardian.com/news/datablog/interactive/2013/jan/14/all-our-datasets-index

Food, and Nutrition Datasets

  1. United States Department of Agriculture: The dataset are provided by the Center of Nutritional Policy and Promotion giving details about food prices dataset, health eating index. https://www.cnpp.usda.gov/data

  2. Nutritional Science Blog: A blog listing some of dataset relating to the domain of nutrition. http://nutsci.org/open-nutrition-food-data/