Updated on Tue May 16 20:53:23 2017.
This section will guide you in the process of decoding your data into information and ultimately intelligible insights. In doing so, we will explore the use of tidyverse and R base packages.
When working with a new data what initial questions do you have?
Consider the following questions to guide your understanding.
Once you have this basic understanding of your data you can dig deeper. Then you can use visualization techniques to explore your data and derive some basic understandings of the phenomena you are studying, such as the largest and smallest values for each variable. In addition, calculating summary statistics translate data into information by revealing the shape of the data, the mean, median, minimum value, maximum value, and variability all with simple visualizations.
For any data science project there are few simple steps to follow.
Using the World internet usage data we will compare of read.csv to read_csv for importing data.
internet_utils <- read.csv("world_internet_usage.csv")
head(internet_utils)
## country X2000 X2001 X2002 X2003 X2004 X2005 X2006 X2007
## 1 China 1.78 2.64 4.60 6.20 7.30 8.52 10.52 16.00
## 2 Mexico 5.08 7.04 11.90 12.90 14.10 17.21 19.52 20.81
## 3 Panama 6.55 7.27 8.52 9.99 11.14 11.48 17.35 22.29
## 4 Senegal 0.40 0.98 1.01 2.10 4.39 4.79 5.61 7.70
## 5 Singapore 36.00 41.67 47.00 53.84 62.00 61.00 59.00 69.90
## 6 United Arab Emirates 23.63 26.27 28.32 29.48 30.13 40.00 52.00 61.00
## X2008 X2009 X2010 X2011 X2012
## 1 22.60 28.90 34.30 38.30 42.30
## 2 21.71 26.34 31.05 34.96 38.42
## 3 33.82 39.08 40.10 42.70 45.20
## 4 10.60 14.50 16.00 17.50 19.20
## 5 69.00 69.00 71.00 71.00 74.18
## 6 63.00 64.00 68.00 78.00 85.00
library(readr)
internet_readr <- read_csv("world_internet_usage.csv")
## Parsed with column specification:
## cols(
## country = col_character(),
## `2000` = col_double(),
## `2001` = col_double(),
## `2002` = col_double(),
## `2003` = col_double(),
## `2004` = col_double(),
## `2005` = col_double(),
## `2006` = col_double(),
## `2007` = col_double(),
## `2008` = col_double(),
## `2009` = col_double(),
## `2010` = col_double(),
## `2011` = col_double(),
## `2012` = col_double()
## )
head(internet_readr)
## # A tibble: 6 × 14
## country `2000` `2001` `2002` `2003` `2004` `2005` `2006`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 China 1.78 2.64 4.60 6.20 7.30 8.52 10.52
## 2 Mexico 5.08 7.04 11.90 12.90 14.10 17.21 19.52
## 3 Panama 6.55 7.27 8.52 9.99 11.14 11.48 17.35
## 4 Senegal 0.40 0.98 1.01 2.10 4.39 4.79 5.61
## 5 Singapore 36.00 41.67 47.00 53.84 62.00 61.00 59.00
## 6 United Arab Emirates 23.63 26.27 28.32 29.48 30.13 40.00 52.00
## # ... with 6 more variables: `2007` <dbl>, `2008` <dbl>, `2009` <dbl>,
## # `2010` <dbl>, `2011` <dbl>, `2012` <dbl>
#extract by position
internet_readr[[2,1]]
## [1] "Mexico"
internet_utils [2,1] # double [[ ]] works too
## [1] Mexico
## 7 Levels: China Mexico Panama Senegal Singapore ... United States
#extract by name
internet_readr$country
## [1] "China" "Mexico" "Panama"
## [4] "Senegal" "Singapore" "United Arab Emirates"
## [7] "United States"
internet_utils$country
## [1] China Mexico Panama
## [4] Senegal Singapore United Arab Emirates
## [7] United States
## 7 Levels: China Mexico Panama Senegal Singapore ... United States
#to use with infix function add a .
internet_readr %>% .$country
## [1] "China" "Mexico" "Panama"
## [4] "Senegal" "Singapore" "United Arab Emirates"
## [7] "United States"
You need to rename columns first to remove the X in front of each year.
names(internet_utils) <-c("country", "2000", "2001", "2002", "2003", "2004", "2005", "2006", "2007", "2008", "2009", "2010", "2011", "2012")
names(internet_utils)
## [1] "country" "2000" "2001" "2002" "2003" "2004" "2005"
## [8] "2006" "2007" "2008" "2009" "2010" "2011" "2012"
Reshape a data frame
library(reshape2)
internet_utils_reshaped <- melt(internet_utils,id.vars="country", variable.name="year", value.name="usage")
Reshape a tibble
internet_readr_reshaped <- melt(internet_readr,id.vars="country", variable.name="year", value.name="usage")
internet_readr_reshaped
## country year usage
## 1 China 2000 1.78
## 2 Mexico 2000 5.08
## 3 Panama 2000 6.55
## 4 Senegal 2000 0.40
## 5 Singapore 2000 36.00
## 6 United Arab Emirates 2000 23.63
## 7 United States 2000 43.08
## 8 China 2001 2.64
## 9 Mexico 2001 7.04
## 10 Panama 2001 7.27
## 11 Senegal 2001 0.98
## 12 Singapore 2001 41.67
## 13 United Arab Emirates 2001 26.27
## 14 United States 2001 49.08
## 15 China 2002 4.60
## 16 Mexico 2002 11.90
## 17 Panama 2002 8.52
## 18 Senegal 2002 1.01
## 19 Singapore 2002 47.00
## 20 United Arab Emirates 2002 28.32
## 21 United States 2002 58.79
## 22 China 2003 6.20
## 23 Mexico 2003 12.90
## 24 Panama 2003 9.99
## 25 Senegal 2003 2.10
## 26 Singapore 2003 53.84
## 27 United Arab Emirates 2003 29.48
## 28 United States 2003 61.70
## 29 China 2004 7.30
## 30 Mexico 2004 14.10
## 31 Panama 2004 11.14
## 32 Senegal 2004 4.39
## 33 Singapore 2004 62.00
## 34 United Arab Emirates 2004 30.13
## 35 United States 2004 64.76
## 36 China 2005 8.52
## 37 Mexico 2005 17.21
## 38 Panama 2005 11.48
## 39 Senegal 2005 4.79
## 40 Singapore 2005 61.00
## 41 United Arab Emirates 2005 40.00
## 42 United States 2005 67.97
## 43 China 2006 10.52
## 44 Mexico 2006 19.52
## 45 Panama 2006 17.35
## 46 Senegal 2006 5.61
## 47 Singapore 2006 59.00
## 48 United Arab Emirates 2006 52.00
## 49 United States 2006 68.93
## 50 China 2007 16.00
## 51 Mexico 2007 20.81
## 52 Panama 2007 22.29
## 53 Senegal 2007 7.70
## 54 Singapore 2007 69.90
## 55 United Arab Emirates 2007 61.00
## 56 United States 2007 75.00
## 57 China 2008 22.60
## 58 Mexico 2008 21.71
## 59 Panama 2008 33.82
## 60 Senegal 2008 10.60
## 61 Singapore 2008 69.00
## 62 United Arab Emirates 2008 63.00
## 63 United States 2008 74.00
## 64 China 2009 28.90
## 65 Mexico 2009 26.34
## 66 Panama 2009 39.08
## 67 Senegal 2009 14.50
## 68 Singapore 2009 69.00
## 69 United Arab Emirates 2009 64.00
## 70 United States 2009 71.00
## 71 China 2010 34.30
## 72 Mexico 2010 31.05
## 73 Panama 2010 40.10
## 74 Senegal 2010 16.00
## 75 Singapore 2010 71.00
## 76 United Arab Emirates 2010 68.00
## 77 United States 2010 74.00
## 78 China 2011 38.30
## 79 Mexico 2011 34.96
## 80 Panama 2011 42.70
## 81 Senegal 2011 17.50
## 82 Singapore 2011 71.00
## 83 United Arab Emirates 2011 78.00
## 84 United States 2011 77.86
## 85 China 2012 42.30
## 86 Mexico 2012 38.42
## 87 Panama 2012 45.20
## 88 Senegal 2012 19.20
## 89 Singapore 2012 74.18
## 90 United Arab Emirates 2012 85.00
## 91 United States 2012 81.03
class(internet_readr_reshaped) # turns into a data.frame!
## [1] "data.frame"
Use the gather function to reshape
tidy_internet_readr <-
internet_readr %>%
gather(`2000`,`2001`,`2002`,`2003`,`2004`,`2005`,`2006`,`2007`,`2008`,`2009`,`2010`,`2011`,`2012`, key="year", value="usage")
tidy_internet_readr
## # A tibble: 91 × 3
## country year usage
## <chr> <chr> <dbl>
## 1 China 2000 1.78
## 2 Mexico 2000 5.08
## 3 Panama 2000 6.55
## 4 Senegal 2000 0.40
## 5 Singapore 2000 36.00
## 6 United Arab Emirates 2000 23.63
## 7 United States 2000 43.08
## 8 China 2001 2.64
## 9 Mexico 2001 7.04
## 10 Panama 2001 7.27
## # ... with 81 more rows
Create a few statistical visualizations to understand the makeup of your data.
boxplot(internet_readr$`2000`, main="Range of internet users in 2000", sub="Median of 6.55 users per 100 people")
hist(internet_readr$`2000`, main="Frequency of internet users in 2000 per 100 people", xlab="2000")
library(lattice)
histogram(internet_readr$`2000`, main="Frequency of internet users in 2000 per 100 people", xlab="2000")
boxplot(internet_readr[,2:14], main="Range of internet users per 100 people")
plot(tidy_internet_readr$year, tidy_internet_readr$usage,main="Internet usage per 100 people",xlab="Year",ylab="Usage", type="p")
***
Create charts and reports.
library(ggthemes)
library(ggplot2)
#scatter plot
ggplot(tidy_internet_readr,aes(x=year,y=usage,colour=country,group=country)) + geom_line() + labs(title = "Internet Usage per 100 people", subtitle = "Since 2011, the UAE has surpassed Singapore and the US in internet users", caption = "Source: World Bank, 2013",x = "Year",y ="Usage") + theme_few()
Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents.
For more details on using R Markdown see http://rmarkdown.rstudio.com.
This section will introduce control structures such as the while loop, for loop, if/else conditional statements, and functions.
x <- 10
while (x > 0) {
print(x)
x <- x - 1
}
## [1] 10
## [1] 9
## [1] 8
## [1] 7
## [1] 6
## [1] 5
## [1] 4
## [1] 3
## [1] 2
## [1] 1
counter = 0
while (counter < 9) {
print(counter)
counter = counter + 1 }
## [1] 0
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
Iterate through an array of numbers
for (i in c(1,2,3,4)){
print(i)
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
output <- vector("double", ncol(bikeshare)) #1.output
for (i in seq_along(bikeshare$atemp)) { #2. sequence
output[[i]] <- round(bikeshare$atemp[[i]], 2) #3. body
}
output
## [1] 0.36 0.35 0.19 0.21 0.23 0.23 0.21 0.16 0.12 0.15 0.19 0.16 0.15 0.19
## [15] 0.25 0.23 0.18 0.23 0.30 0.26 0.16 0.08 0.10 0.12 0.23 0.20 0.22 0.22
## [29] 0.21 0.25 0.19 0.23 0.25 0.18 0.23 0.24 0.29 0.30 0.20 0.14 0.15 0.21
## [43] 0.23 0.32 0.40 0.25 0.32 0.43 0.51 0.39 0.28 0.28 0.19 0.25 0.29 0.35
## [57] 0.28 0.35 0.40 0.26 0.32 0.20 0.26 0.38 0.37 0.24 0.30 0.29 0.39 0.30
## [71] 0.33 0.38 0.33 0.32 0.37 0.41 0.53 0.47 0.33 0.41 0.44 0.34 0.27 0.26
## [85] 0.26 0.25 0.26 0.29 0.30 0.26 0.28 0.32 0.38 0.54 0.40 0.39 0.43 0.32
## [99] 0.34 0.43 0.57 0.49 0.42 0.46 0.44 0.43 0.45 0.50 0.49 0.56 0.45 0.32
## [113] 0.45 0.55 0.57 0.59 0.58 0.58 0.50 0.46 0.45 0.53 0.58 0.40 0.44 0.47
## [127] 0.51 0.52 0.53 0.52 0.53 0.52 0.49 0.50 0.54 0.55 0.54 0.53 0.51 0.53
## [141] 0.57 0.57 0.59 0.60 0.62 0.65 0.64 0.61 0.62 0.67 0.73 0.72 0.64 0.59
## [155] 0.59 0.62 0.62 0.66 0.73 0.76 0.70 0.68 0.64 0.60 0.59 0.59 0.60 0.60
## [169] 0.64 0.65 0.60 0.64 0.69 0.69 0.66 0.64 0.64 0.64 0.69 0.65 0.64 0.65
## [183] 0.67 0.67 0.67 0.70 0.69 0.69 0.67 0.66 0.69 0.73 0.74 0.69 0.64 0.62
## [197] 0.64 0.67 0.70 0.75 0.75 0.83 0.84 0.80 0.79 0.72 0.70 0.69 0.74 0.79
## [211] 0.73 0.73 0.70 0.71 0.68 0.66 0.66 0.68 0.72 0.70 0.72 0.68 0.65 0.65
## [225] 0.65 0.62 0.62 0.65 0.67 0.66 0.63 0.65 0.68 0.64 0.61 0.63 0.65 0.66
## [239] 0.64 0.65 0.61 0.59 0.61 0.61 0.60 0.63 0.67 0.63 0.52 0.54 0.56 0.58
## [253] 0.61 0.61 0.60 0.60 0.63 0.55 0.46 0.48 0.49 0.53 0.53 0.55 0.55 0.52
## [267] 0.56 0.57 0.59 0.57 0.58 0.57 0.54 0.41 0.35 0.39 0.47 0.53 0.48 0.50
## [281] 0.51 0.52 0.54 0.55 0.52 0.55 0.53 0.50 0.50 0.51 0.52 0.51 0.47 0.42
## [295] 0.43 0.42 0.46 0.46 0.47 0.46 0.32 0.23 0.32 0.36 0.40 0.39 0.41 0.40
## [309] 0.32 0.36 0.40 0.41 0.41 0.37 0.31 0.36 0.43 0.52 0.51 0.45 0.32 0.27
## [323] 0.32 0.46 0.45 0.42 0.43 0.37 0.38 0.39 0.46 0.49 0.45 0.31 0.31 0.33
## [337] 0.31 0.35 0.39 0.46 0.40 0.26 0.32 0.27 0.25 0.27 0.30 0.34 0.41 0.36
## [351] 0.25 0.25 0.28 0.40 0.43 0.43 0.38 0.30 0.28 0.32 0.33 0.28 0.26 0.32
## [365] 0.41 0.38 0.25 0.13 0.12 0.28 0.34 0.39 0.34 0.25 0.32 0.28 0.38 0.25
## [379] 0.18 0.16 0.19 0.36 0.28 0.19 0.22 0.17 0.16 0.24 0.35 0.29 0.36 0.42
## [393] 0.33 0.27 0.26 0.38 0.47 0.40 0.31 0.27 0.26 0.30 0.36 0.27 0.26 0.29
## [407] 0.21 0.10 0.23 0.33 0.35 0.33 0.35 0.36 0.27 0.27 0.30 0.39 0.44 0.41
## [421] 0.26 0.27 0.36 0.35 0.35 0.48 0.36 0.41 0.30 0.24 0.26 0.39 0.52 0.40
## [435] 0.28 0.36 0.46 0.54 0.55 0.53 0.44 0.51 0.46 0.53 0.54 0.51 0.53 0.57
## [449] 0.49 0.44 0.44 0.32 0.47 0.48 0.38 0.42 0.42 0.43 0.46 0.53 0.43 0.39
## [463] 0.43 0.49 0.48 0.44 0.34 0.39 0.43 0.49 0.57 0.61 0.60 0.46 0.49 0.52
## [477] 0.54 0.39 0.30 0.41 0.47 0.48 0.45 0.38 0.45 0.46 0.58 0.54 0.54 0.59
## [491] 0.58 0.55 0.53 0.56 0.55 0.49 0.52 0.54 0.59 0.55 0.58 0.60 0.57 0.55
## [505] 0.57 0.58 0.57 0.58 0.58 0.61 0.63 0.64 0.64 0.68 0.67 0.61 0.63 0.61
## [519] 0.57 0.58 0.58 0.53 0.54 0.57 0.60 0.65 0.66 0.66 0.60 0.61 0.62 0.60
## [533] 0.59 0.57 0.54 0.65 0.72 0.75 0.72 0.65 0.67 0.65 0.59 0.64 0.68 0.79
## [547] 0.69 0.75 0.70 0.70 0.73 0.76 0.75 0.80 0.79 0.65 0.66 0.65 0.65 0.67
## [561] 0.67 0.71 0.72 0.76 0.75 0.71 0.61 0.55 0.62 0.69 0.71 0.65 0.74 0.73
## [575] 0.70 0.67 0.68 0.66 0.67 0.71 0.72 0.75 0.73 0.71 0.70 0.71 0.70 0.67
## [589] 0.64 0.64 0.66 0.68 0.65 0.65 0.24 0.62 0.60 0.60 0.60 0.62 0.64 0.65
## [603] 0.62 0.60 0.65 0.67 0.64 0.65 0.69 0.70 0.65 0.66 0.69 0.71 0.66 0.66
## [617] 0.61 0.58 0.57 0.55 0.57 0.58 0.59 0.59 0.56 0.55 0.57 0.54 0.53 0.57
## [631] 0.61 0.52 0.50 0.54 0.60 0.61 0.59 0.53 0.52 0.51 0.54 0.60 0.61 0.58
## [645] 0.54 0.42 0.39 0.44 0.50 0.43 0.43 0.39 0.51 0.54 0.46 0.45 0.51 0.54
## [659] 0.47 0.46 0.48 0.53 0.56 0.53 0.52 0.52 0.47 0.44 0.31 0.36 0.37 0.36
## [673] 0.32 0.33 0.31 0.28 0.27 0.34 0.36 0.39 0.42 0.48 0.32 0.28 0.32 0.35
## [687] 0.33 0.34 0.38 0.38 0.36 0.35 0.38 0.25 0.26 0.34 0.28 0.29 0.30 0.32
## [701] 0.32 0.36 0.46 0.47 0.43 0.26 0.32 0.39 0.39 0.44 0.34 0.30 0.29 0.29
## [715] 0.34 0.37 0.40 0.41 0.34 0.34 0.30 0.24 0.26 0.26 0.29 0.22 0.23 0.26
## [729] 0.24 0.23 0.22
#simple way to round without a loop
#atemp_rounded<- round(bikeshare$atemp, 2)
3 > 4
## [1] FALSE
c(1, 2, 3, 4, 5) > 4
## [1] FALSE FALSE FALSE FALSE TRUE
c(1, 2, 3, 4, 6) == 3
## [1] FALSE FALSE TRUE FALSE FALSE
prices <- c(12.43, 9.99, 18.22, 7.25, 0.50)
numCheap <- 0
for (p in prices){
if (p < 10){
numCheap <- numCheap + 1
}
}
print(numCheap)
## [1] 3
prices <- c(12.43, 9.99, 18.22, 7.25, 0.50, 11)
sum(prices < 10)
## [1] 3
Write a script to determine the average ridership on weekends versus weekdays. Next, let’s imagine it costs $10 per day to rent a bike on a weekday and $12 on a weekend. What is the annual weekday rental revenue in 2011 and 2012? What is the annual weekend revenue in 2011 and 2012?
Hint: Use a for loop and if/else logic.
sqrt(25)
## [1] 5
mean(c(1,2,3,4,5))
## [1] 3
toupper("hello world")
## [1] "HELLO WORLD"
f <- function(x) x + 2
f(3)
## [1] 5
#f("hello world") # causes an error because we need the parameter as a numeric.
addTogether <- function(x, y) x + y
addTogether(5, 10)
## [1] 15
addTogether(x = 5, y = 10) #alternative
## [1] 15
f <- function(x){
y <- x^2
z <- y/2
z
}
f(2)
## [1] 2
avg <- function(x,y){
(x + y)/2
}
avg(1,2)
## [1] 1.5
f <- function(x) x^2
sapply(c(1,2,3,4,5),f)
## [1] 1 4 9 16 25
Use temp and humidty to calculate the heat index for temperatures >=80
Use the data from: https://www.weather.gov/media/unr/heatindex.pdf
The first thing you need to do is to make sure you have the Shiny package installed and enabled. Next, we will look at the basic operation of a Shiny app.
The basic structure of a Shiny app consists of a folder in the working directory of R, for example: app_1. That folder then contains two R script files, server.R and ui.R. Server.R contains the R commands that govern the server in performing calculations, analyzing data, and creating visualizations. ui.R contains the instructions for layout of the user interface and controls the interaction with the user. The app is then launched with the command runApp(“app name here”). When you run an app, you can no longer interact with the command line interface of R, as the runApp command is continuously running to be able to respond to commands from the user interface.
The server.R script contains the instructions that your computer needs to build your app.
Review: Shiny apps have a basic file structure:
There is a set of built in examples included with Shiny. Let’s look at the first one, a basic histogram. To run it, first make sure the Shiny package is installed and enabled, then run:
library(shiny)
runExample("01_hello")
This example bring up a sample histogram, and a slider to control the bin size. You can also see the code being used to generate the histogram and slider below the histogram.
If you look at the console window of R Studio, you will see a small STOP sign icon. If you click that button, it will stop the execution of the sever code, and allow you to interact with R Studio again. *** ###Other shiny examples
library(shiny)
#runExample("02_text")
#runExample("05_sliders")
#runExample("06_tabsets")
#runExample("07_widgets")
As previously stated, the code that runs a Shiny app resides in two files, server.R for the server commands, and ui.R for the User Interface. We’ll now explore the elements that are used for building the apps are deployed and interact with one another.
The user interface begins with the fluidPage function. This function creates a blank webpage which is automatically sized to the browser window. Next, panels are embedded in the webpage using the fluidPage function. It it common to use a title panel, and the sidebarLayout function. The sidebarLayout function requires a sidebarPanel and mainPanel. Each of the above functions take arguments, which can be in the form of non-interactive text, or much more advanced functions. Here is a minimal example, using only non-interactive text:
shinyUI(fluidPage(
titlePanel("Hello Shiny!"),
sidebarLayout(
sidebarPanel("Hello from sidebarPanel"),
mainPanel("Hello from mainPanel")
)
))
Create a directory called “app1” in your working directory and save the above code to ui.R
A minimal serve file consists of the shinyServer function, which serves to receive input from and deliver output to the User Interface. The code is shown below.
shinyServer(function(input,output) {
}
)
Copy and paste that code to a file called server.R in the app1 directory. You now have a Shiny app that will display static text in a title panel, a side panel, and a main panel. The server will also be listening for input from the UI. If you run app1, you should see the webpage shown below.
The code in the histogram example should make more sense now, and it is also a good example of some simple interaction between the UI and the server. Let’s review it. (All the comments have been removed for a more compact layout.)
shinyUI(fluidPage(
titlePanel("Bike sharing rental frequency"),
sidebarLayout(
sidebarPanel(
Everything above is the same as our simple “app1” example. However, the next lines in the code defines the slider widget.
sliderInput("bins",
"Number of bins:",
min = 1,
max = 50,
value = 30)
),
There are a few new things going on in the above code. First, the sliderInput function is a UI widget. There are a number available, predefined widgets which you can use to build your page. There is a specific tutorial on building widgets at R Studio, and a gallery of available widgets here.
The next thing to notice is in the first line of the code, the sliderInput function is making the value of the slider available to the server through the variable named “bins.” This is how the UI interacts with the server. The remaining lines of code are the values and text that are displayed on the slider.
mainPanel(
plotOutput("distPlot")
)
)
))
The next lines of code illustrate how output from the server is displayed in the UI. The mainPanel is displaying the plot, which is defined in the server code. The final lines of code in the UI file are quite straightforward, and are just closing braces for the various functions in the UI file.
Now let’s take another look at the server.R file.
shinyServer(function(input, output) {
output$distPlot <- renderPlot({
x <- (bikeshare$cnt)
bins <- seq(min(x), max(x), length.out = input$bins + 1)
The server file opens with the shinyServer function which is the basic function to set the server listening for input and output. The next line defines the output as the distPlot function, which is referenced in the UI file, in the mainPanel. distPlot is defined by the renderPlot function, and as noted in the comments(which were removed), the renderPlot function will cause the plot to be redrawn automatically when one of the inputs changes. These are the functions that add interactivity to your R code, and they are explain in much more detail here.
The next line contains our data, but the following line is a little more complicated. It defines the break points for the histogram bins, using a sequence from the min x to the max x, with the number of break points dictated by the input from the slider. Now the entire sequence should be clear. Moving the slider will cause the UI to update the bins variable to the server. The server will take that input, recalculate the bin size, update the plot, and return the plot to the UI for the display to be redrawn.
hist(x, breaks = bins, col = 'darkgray', border = 'white', main='Frequency of bike sharing rentals', ylab="Count", xlab='Rentals')
})
})
The remaining code is simply standard R for drawing a histogram, and the closing braces indicating the end of the various functions.
Once you have built your interactive display, there are several ways to share it. First, you can just continue to run it from the R console. There are several reasons why you might want to do this, including data privacy. Your data and visualization can be viewed by anyone with an R console, as long as you share your shiny app and the original data set.
A more complicated method of sharing would be to set up your own web server, and resources are available to help you do this. Github is also a popular choice for hosting. Finally, there is also shinyapps.io, which is hosted by R Studio, and has hooks for direct publishing of your apps from with the R Studio program.
At this point in the process, you should have gained enough insight to frame a question to guide the rest of your analysis. Sometimes you don’t know what to ask of the data and other times the questions you have cannot be answered by the data that you have. In most visual analytical explorations there will be a back and forth between defining the questions and identifying the data sources that have contain the information you need to extract. ***
Often your question will fall into one of three categories: Past, present, or future.
Some questions that can guide an historical analysis of past events are:
These questions serve a purpose of guiding reports, where the analyst is reporting on past events.
A question based on the present is:
How many bikes were rent in the past hour or today?
This type of question is reserved for producing a current state of an event.
Can we answer this question?
The data we are using cannot answer this question since it is historical data from 2011 and 2012.
A question about the future could be framed as the following:
Will bike rentals be higher in the summer rather than the winter due to weather?
Questions about the future using involve analysis that requires prediction or forecasting methods. The analyst in this case is trying to predict the future from past data.
To complete on your own. ###Try to answer the following questions. Show your work as a data visualization.
As a next step, I encourage you to select a data set from one of the resources provided below.
General Datasets
UCI Machine Learning Repository: Consists of diverse field of datasets (360 datasets currently and still growing) for the purpose of performing analytics and machine learning algorithms. http://archive.ics.uci.edu/ml/
Kaggle datasets: Perfect for exploring data through visualization. https://www.kaggle.com/datasets
Amazon Public Dataset: These are large dataset which deals with dataset with memory in Gbs or Tbs. https://aws.amazon.com/public-datasets/
Google Public Data: A set of dataset provided by Google, including Book corpus, US names, Genome dataset, BIgQuery dataset, and many more. https://cloud.google.com/public-datasets/
Open Data by Socrata: Thousands of free dataset for exploration. https://opendata.socrata.com/
Data.gov: A website dedicated to supply datasets of different domains, eg. Education, Nutrient, Sports. https://catalog.data.gov/dataset?res_format=CSV
Datahub: Just as its tagline, “The easy way to get, share data”. https://datahub.io/dataset?tags=weather
Harvard Dataverse: Find most of the datasets used for research purpose, and cited in different publications. https://dataverse.harvard.edu/
Challenges based dataset
KDD Data Center: Have a problem coming up with a problem statement? No worries, KDD provides you with the dataset and problem statements through its challenges. http://www.kdd.org/kdd-cup
CrowdAnalytics: More challenges to solve with dataset. https://www.crowdanalytix.com/community
DataDriven: Problem for data scientist to solve. https://www.drivendata.org/competitions/
Big Data Innovation Challenge: Tackle real problem with these analytics, and also win a challenge. https://bigdatainnovationchallenge.org/challenges/food-security-nutrition/
Census Dataset
Open Census Data: Details of population in different cities of countries is just a click away with this open data. http://census.okfn.org/en/latest/
Census.gov: Census data of United States. http://www.census.gov/data.html
Weather/Climate dataset
Wunderground: Want to work with weather data? Use Wunderground’s API to get your own dataset. https://www.wunderground.com/weather/api/
National Center for Environmental Information: Climate datasets available for analytics. https://www.ncdc.noaa.gov/cdo-web/datasets
News Dataset
BBC Dataset: It consists of documents from the BBC news website corresponding to stories in five topical areas. http://mlg.ucd.ie/datasets/bbc.html
The Guardian: A collection of news datasets from the guardian, which is updated regularly. https://www.theguardian.com/news/datablog/interactive/2013/jan/14/all-our-datasets-index
Food, and Nutrition Datasets
Nutritional Science Blog: A blog listing some of dataset relating to the domain of nutrition. http://nutsci.org/open-nutrition-food-data/
Complete items N, R, and V. Submit on NYU Classes > Assignments ***