Updated on Thu Aug 17 00:10:42 2017.
library(tidyverse)## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag(): dplyr, stats
library(ggplot2)
library(ggthemes)
library(ggvis)##
## Attaching package: 'ggvis'
## The following object is masked from 'package:ggplot2':
##
## resolution
library(reshape2)##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
library(knitr)
library(shiny)
library(scales)##
## Attaching package: 'scales'
## The following objects are masked from 'package:ggvis':
##
## fullseq, zero_range
## The following object is masked from 'package:purrr':
##
## discard
## The following object is masked from 'package:readr':
##
## col_factor
This section will guide you in the process of decoding your data into information and ultimately intelligible insights. In doing so, we will explore the use of tidyverse and R base packages.
When working with a new data what initial questions do you have?
Consider the following questions to guide your understanding.
Once you have this basic understanding of your data you can dig deeper. Then you can use visualization techniques to explore your data and derive some basic understandings of the phenomena you are studying, such as the largest and smallest values for each variable. In addition, calculating summary statistics translate data into information by revealing the shape of the data, the mean, median, minimum value, maximum value, and variability all with simple visualizations.
For any data science project there are few simple steps to follow.
Using the World internet usage data we will compare of read.csv to read_csv for importing data.
internet_utils <- read.csv("world_internet_usage.csv")
head(internet_utils)## country X2000 X2001 X2002 X2003 X2004 X2005 X2006 X2007
## 1 China 1.78 2.64 4.60 6.20 7.30 8.52 10.52 16.00
## 2 Mexico 5.08 7.04 11.90 12.90 14.10 17.21 19.52 20.81
## 3 Panama 6.55 7.27 8.52 9.99 11.14 11.48 17.35 22.29
## 4 Senegal 0.40 0.98 1.01 2.10 4.39 4.79 5.61 7.70
## 5 Singapore 36.00 41.67 47.00 53.84 62.00 61.00 59.00 69.90
## 6 United Arab Emirates 23.63 26.27 28.32 29.48 30.13 40.00 52.00 61.00
## X2008 X2009 X2010 X2011 X2012
## 1 22.60 28.90 34.30 38.30 42.30
## 2 21.71 26.34 31.05 34.96 38.42
## 3 33.82 39.08 40.10 42.70 45.20
## 4 10.60 14.50 16.00 17.50 19.20
## 5 69.00 69.00 71.00 71.00 74.18
## 6 63.00 64.00 68.00 78.00 85.00
library(readr)
internet_readr <- read_csv("world_internet_usage.csv")## Parsed with column specification:
## cols(
## country = col_character(),
## `2000` = col_double(),
## `2001` = col_double(),
## `2002` = col_double(),
## `2003` = col_double(),
## `2004` = col_double(),
## `2005` = col_double(),
## `2006` = col_double(),
## `2007` = col_double(),
## `2008` = col_double(),
## `2009` = col_double(),
## `2010` = col_double(),
## `2011` = col_double(),
## `2012` = col_double()
## )
head(internet_readr)## # A tibble: 6 x 14
## country `2000` `2001` `2002` `2003` `2004` `2005` `2006`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 China 1.78 2.64 4.60 6.20 7.30 8.52 10.52
## 2 Mexico 5.08 7.04 11.90 12.90 14.10 17.21 19.52
## 3 Panama 6.55 7.27 8.52 9.99 11.14 11.48 17.35
## 4 Senegal 0.40 0.98 1.01 2.10 4.39 4.79 5.61
## 5 Singapore 36.00 41.67 47.00 53.84 62.00 61.00 59.00
## 6 United Arab Emirates 23.63 26.27 28.32 29.48 30.13 40.00 52.00
## # ... with 6 more variables: `2007` <dbl>, `2008` <dbl>, `2009` <dbl>,
## # `2010` <dbl>, `2011` <dbl>, `2012` <dbl>
#extract by position
internet_readr[[2,1]]## [1] "Mexico"
internet_utils [2,1] # double [[ ]] works too## [1] Mexico
## 7 Levels: China Mexico Panama Senegal Singapore ... United States
#extract by name
internet_readr$country## [1] "China" "Mexico" "Panama"
## [4] "Senegal" "Singapore" "United Arab Emirates"
## [7] "United States"
internet_utils$country## [1] China Mexico Panama
## [4] Senegal Singapore United Arab Emirates
## [7] United States
## 7 Levels: China Mexico Panama Senegal Singapore ... United States
#to use with infix function add a .
internet_readr %>% .$country ## [1] "China" "Mexico" "Panama"
## [4] "Senegal" "Singapore" "United Arab Emirates"
## [7] "United States"
You need to rename columns first to remove the X in front of each year.
names(internet_utils) <-c("country", "2000", "2001", "2002", "2003", "2004", "2005", "2006", "2007", "2008", "2009", "2010", "2011", "2012")
names(internet_utils)## [1] "country" "2000" "2001" "2002" "2003" "2004" "2005"
## [8] "2006" "2007" "2008" "2009" "2010" "2011" "2012"
Reshape a data frame
library(reshape2)
internet_utils_reshaped <- melt(internet_utils,id.vars="country", variable.name="year", value.name="usage")Reshape a tibble
internet_readr_reshaped <- melt(internet_readr,id.vars="country", variable.name="year", value.name="usage")
internet_readr_reshaped## country year usage
## 1 China 2000 1.78
## 2 Mexico 2000 5.08
## 3 Panama 2000 6.55
## 4 Senegal 2000 0.40
## 5 Singapore 2000 36.00
## 6 United Arab Emirates 2000 23.63
## 7 United States 2000 43.08
## 8 China 2001 2.64
## 9 Mexico 2001 7.04
## 10 Panama 2001 7.27
## 11 Senegal 2001 0.98
## 12 Singapore 2001 41.67
## 13 United Arab Emirates 2001 26.27
## 14 United States 2001 49.08
## 15 China 2002 4.60
## 16 Mexico 2002 11.90
## 17 Panama 2002 8.52
## 18 Senegal 2002 1.01
## 19 Singapore 2002 47.00
## 20 United Arab Emirates 2002 28.32
## 21 United States 2002 58.79
## 22 China 2003 6.20
## 23 Mexico 2003 12.90
## 24 Panama 2003 9.99
## 25 Senegal 2003 2.10
## 26 Singapore 2003 53.84
## 27 United Arab Emirates 2003 29.48
## 28 United States 2003 61.70
## 29 China 2004 7.30
## 30 Mexico 2004 14.10
## 31 Panama 2004 11.14
## 32 Senegal 2004 4.39
## 33 Singapore 2004 62.00
## 34 United Arab Emirates 2004 30.13
## 35 United States 2004 64.76
## 36 China 2005 8.52
## 37 Mexico 2005 17.21
## 38 Panama 2005 11.48
## 39 Senegal 2005 4.79
## 40 Singapore 2005 61.00
## 41 United Arab Emirates 2005 40.00
## 42 United States 2005 67.97
## 43 China 2006 10.52
## 44 Mexico 2006 19.52
## 45 Panama 2006 17.35
## 46 Senegal 2006 5.61
## 47 Singapore 2006 59.00
## 48 United Arab Emirates 2006 52.00
## 49 United States 2006 68.93
## 50 China 2007 16.00
## 51 Mexico 2007 20.81
## 52 Panama 2007 22.29
## 53 Senegal 2007 7.70
## 54 Singapore 2007 69.90
## 55 United Arab Emirates 2007 61.00
## 56 United States 2007 75.00
## 57 China 2008 22.60
## 58 Mexico 2008 21.71
## 59 Panama 2008 33.82
## 60 Senegal 2008 10.60
## 61 Singapore 2008 69.00
## 62 United Arab Emirates 2008 63.00
## 63 United States 2008 74.00
## 64 China 2009 28.90
## 65 Mexico 2009 26.34
## 66 Panama 2009 39.08
## 67 Senegal 2009 14.50
## 68 Singapore 2009 69.00
## 69 United Arab Emirates 2009 64.00
## 70 United States 2009 71.00
## 71 China 2010 34.30
## 72 Mexico 2010 31.05
## 73 Panama 2010 40.10
## 74 Senegal 2010 16.00
## 75 Singapore 2010 71.00
## 76 United Arab Emirates 2010 68.00
## 77 United States 2010 74.00
## 78 China 2011 38.30
## 79 Mexico 2011 34.96
## 80 Panama 2011 42.70
## 81 Senegal 2011 17.50
## 82 Singapore 2011 71.00
## 83 United Arab Emirates 2011 78.00
## 84 United States 2011 77.86
## 85 China 2012 42.30
## 86 Mexico 2012 38.42
## 87 Panama 2012 45.20
## 88 Senegal 2012 19.20
## 89 Singapore 2012 74.18
## 90 United Arab Emirates 2012 85.00
## 91 United States 2012 81.03
class(internet_readr_reshaped) # turns into a data.frame!## [1] "data.frame"
Use the gather function to reshape
tidy_internet_readr <-
internet_readr %>%
gather(`2000`,`2001`,`2002`,`2003`,`2004`,`2005`,`2006`,`2007`,`2008`,`2009`,`2010`,`2011`,`2012`, key="year", value="usage")
tidy_internet_readr## # A tibble: 91 x 3
## country year usage
## <chr> <chr> <dbl>
## 1 China 2000 1.78
## 2 Mexico 2000 5.08
## 3 Panama 2000 6.55
## 4 Senegal 2000 0.40
## 5 Singapore 2000 36.00
## 6 United Arab Emirates 2000 23.63
## 7 United States 2000 43.08
## 8 China 2001 2.64
## 9 Mexico 2001 7.04
## 10 Panama 2001 7.27
## # ... with 81 more rows
Create a few statistical visualizations to understand the makeup of your data.
boxplot(internet_readr$`2000`, main="Range of internet users in 2000", sub="Median of 6.55 users per 100 people")boxplot(internet_readr$`2001`, main="Range of internet users in 2001", sub="Median of 7.21 users per 100 people")hist(internet_readr$`2000`, main="Frequency of internet users in 2000 per 100 people", xlab="2000")hist(internet_readr$`2001`, main="Frequency of internet users in 2001 per 100 people", xlab="2001")library(lattice)
histogram(internet_readr$`2000`, main="Frequency of internet users in 2000 per 100 people", xlab="2000")library(lattice)
histogram(internet_readr$`2000`, main="Frequency of internet users in 2001 per 100 people", xlab="2001") ***Histogram Matrix ##Version 1
histogram(~ usage | year, data=tidy_internet_readr, layout=c(4,4))h <-histogram(~tidy_internet_readr$usage|tidy_internet_readr$year,col=("lightgreen"),breaks=5,layout=c(3,5))
update (h, index.cond=list(c(10:12, 7:9, 4:6, 1:3)))#13, 10:12, 7:9, 4:6, 1:3tidy_internet_readr$year<-as.character(tidy_internet_readr$year)
h <-histogram(~tidy_internet_readr$usage|tidy_internet_readr$year,col=("lightgreen"),
xlab="Usage", breaks=5,layout=c(4,4), ylab ="Year")
update (h, index.cond=list(c(10:13, 6:9, 2:5, 1)))boxplot(internet_readr[,2:14], main="Range of internet users per 100 people")plot(tidy_internet_readr$year, tidy_internet_readr$usage,main="Internet usage per 100 people",xlab="Year",ylab="Usage", type="p") ***
Create charts and reports.
library(ggthemes)
library(ggplot2)
#line chart
ggplot(tidy_internet_readr,aes(x=year,y=usage,colour=country,group=country)) + geom_line() + labs(title = "Internet Usage per 100 people", subtitle = "Since 2011, the UAE has surpassed Singapore and the US in internet users", caption = "Source: World Bank, 2013",x = "Year",y ="Usage") + theme_excel()Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents.
For more details on using R Markdown see http://rmarkdown.rstudio.com.
#line chart
#reference: http://www.cookbook-r.com/Graphs/Legends_(ggplot2)/
lineplot <- ggplot(bikeshare,aes(x=dteday,y=cnt,color=factor(year),group=factor(year))) + geom_line() + labs(title = "Rentals by day", subtitle = "Insight here", caption = "Source: Capital Bikeshare",x = "day",y ="Rentals")
lineplot + scale_color_manual(values=c("#999999", "#56B4E9"),
name="Year",
breaks=c("0", "1"),
labels=c("2011", "2012"))areaplot <- ggplot(bikeshare,aes(x=dteday,y=cnt,color=factor(year),group=factor(year))) + geom_area() + labs(title = "Rentals by day", subtitle = "Insight here", caption = "Source: Capital Bikeshare",x = "day",y ="Rentals") + theme_fivethirtyeight()
areaplot + scale_color_manual(values=c("#999999", "#56B4E9"),
name="Year",
breaks=c("0", "1"),
labels=c("2011", "2012"))#source https://chrisalbon.com/r-stats/stacked-area-graph.html
####
#geom_area() +
# change the colors and reserve the legend order (to match to stacking order)
# scale_fill_brewer(palette="Blues", breaks=rev(levels(uspopage$AgeGroup)))histogram(~ cnt | as.factor(mnth), data=bikeshare_2012, layout=c(4,3))#levels
#mydata$v1 <- ordered(factor(mydata$v1,
#levels = c(1,2,3,4,5,6,7,8,9,10,11,12)),
#labels = c("red", "blue", "green"))
bikeshare_2012$mnth<-ordered(factor(bikeshare_2012$mnth, levels =c(1,2,3,4,5,6,7,8,9,10,11,12),
labels = c("Jan", "Feb", "March", "April", "May", "June", "July", "Aug.", "Sept", "Oct", "Nov.","Dec.")))
#reference: http://www.statmethods.net/RiA/lattice.pdf
require (lattice)
bikesharebymonth <-histogram(~bikeshare_2012$cnt|(bikeshare_2012$mnth),type=c("count"),col=("lightgreen"),strip =strip.custom(bg="lightgrey",
par.strip.text=list(col="black", cex=1, font=1)),main="The frequency of bicycle rentals by month in 2012\n",
xlab="Rentals", breaks=5,layout=c(4,3), ylab ="Month", sub=("\n Kristen Sosulski | Captial Bikeshare, 2012"))
update (bikesharebymonth, index.cond=list(c(9:12, 5:8, 1:4)))##basic bar
options(scipen=10000)
ggplot(bikeshare,aes(x=season,y=cnt)) + geom_bar(stat="identity") + labs(title = "Rentals by day", subtitle = "Insight here", caption = "Source: Capital Bikeshare",x = "Season",y ="Rentals") + scale_y_continuous(limits=c(0,1000000),oob = rescale_none) #stackeed bar
bar<- ggplot(bikeshare,aes(x=season,y=cnt,color=factor(year),group=factor(year))) + geom_bar(stat="identity") + labs(title = "Rentals by day", subtitle = "Insight here", caption = "Source: Capital Bikeshare",x = "Season",y ="Rentals") + scale_y_continuous(limits=c(0,1000000),oob = rescale_none)
bar + scale_color_manual(values=c("#999999", "#56B4E9"),
name="Year",
breaks=c("0", "1"),
labels=c("2011", "2012"))library(MASS)##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
parcoord(bikeshare[, c(16, 10,12,13)], col="#4cbea3", lty=7, var.label=TRUE, lwd = .4)#http://www.buildingwidgets.com/blog/2015/1/30/week-04-interactive-parallel-coordinates-1```library(devtools)
devtools::install_github("timelyportfolio/parcoords")## Skipping install of 'parcoords' from a github remote, the SHA1 (324d00b8) has not changed since last install.
## Use `force = TRUE` to force installation
library(parcoords)
parcoords(bikeshare)Brush On and Reorder
library(plotly)##
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
##
## select
## The following objects are masked from 'package:ggvis':
##
## add_data, hide_legend
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
p <- ggplot(data = bikeshare, aes(x = as.factor(weathersit), fill =season)) + geom_bar(position = "dodge")
ggplotly(p)## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
scatter
plot_ly(bikeshare, x = bikeshare$humidity, y = bikeshare$cnt,
text = paste("Weather situation: ", as.factor(bikeshare$weathersit)),
mode = "markers", color = as.factor(bikeshare$weathersit), xlab="humidity") # size = bikeshare$weathersit## No trace type specified:
## Based on info supplied, a 'scatter' trace seems appropriate.
## Read more about this trace type -> https://plot.ly/r/reference/#scatter
## Warning: 'scatter' objects don't have these attributes: 'xlab'
## Valid attributes include:
## 'type', 'visible', 'showlegend', 'legendgroup', 'opacity', 'name', 'uid', 'ids', 'customdata', 'hoverinfo', 'hoverlabel', 'stream', 'x', 'x0', 'dx', 'y', 'y0', 'dy', 'text', 'hovertext', 'mode', 'hoveron', 'line', 'connectgaps', 'cliponaxis', 'fill', 'fillcolor', 'marker', 'textposition', 'textfont', 'r', 't', 'error_y', 'error_x', 'xaxis', 'yaxis', 'xcalendar', 'ycalendar', 'idssrc', 'customdatasrc', 'hoverinfosrc', 'xsrc', 'ysrc', 'textsrc', 'hovertextsrc', 'textpositionsrc', 'rsrc', 'tsrc', 'key', 'set', 'frame', 'transforms', '_isNestedKey', '_isSimpleKey', '_isGraticule'
## Warning: 'scatter' objects don't have these attributes: 'xlab'
## Valid attributes include:
## 'type', 'visible', 'showlegend', 'legendgroup', 'opacity', 'name', 'uid', 'ids', 'customdata', 'hoverinfo', 'hoverlabel', 'stream', 'x', 'x0', 'dx', 'y', 'y0', 'dy', 'text', 'hovertext', 'mode', 'hoveron', 'line', 'connectgaps', 'cliponaxis', 'fill', 'fillcolor', 'marker', 'textposition', 'textfont', 'r', 't', 'error_y', 'error_x', 'xaxis', 'yaxis', 'xcalendar', 'ycalendar', 'idssrc', 'customdatasrc', 'hoverinfosrc', 'xsrc', 'ysrc', 'textsrc', 'hovertextsrc', 'textpositionsrc', 'rsrc', 'tsrc', 'key', 'set', 'frame', 'transforms', '_isNestedKey', '_isSimpleKey', '_isGraticule'
## Warning: 'scatter' objects don't have these attributes: 'xlab'
## Valid attributes include:
## 'type', 'visible', 'showlegend', 'legendgroup', 'opacity', 'name', 'uid', 'ids', 'customdata', 'hoverinfo', 'hoverlabel', 'stream', 'x', 'x0', 'dx', 'y', 'y0', 'dy', 'text', 'hovertext', 'mode', 'hoveron', 'line', 'connectgaps', 'cliponaxis', 'fill', 'fillcolor', 'marker', 'textposition', 'textfont', 'r', 't', 'error_y', 'error_x', 'xaxis', 'yaxis', 'xcalendar', 'ycalendar', 'idssrc', 'customdatasrc', 'hoverinfosrc', 'xsrc', 'ysrc', 'textsrc', 'hovertextsrc', 'textpositionsrc', 'rsrc', 'tsrc', 'key', 'set', 'frame', 'transforms', '_isNestedKey', '_isSimpleKey', '_isGraticule'
At this point in the process, you should have gained enough insight to frame a question to guide the rest of your analysis. Sometimes you don’t know what to ask of the data and other times the questions you have cannot be answered by the data that you have. In most visual analytical explorations there will be a back and forth between defining the questions and identifying the data sources that have contain the information you need to extract. ***
Often your question will fall into one of three categories: Past, present, or future.
Some questions that can guide an historical analysis of past events are:
These questions serve a purpose of guiding reports, where the analyst is reporting on past events.
A question based on the present is:
How many bikes were rented in the past hour or today?
This type of question is reserved for producing a current state of an event.
Can we answer this question?
The data we are using cannot answer this question since it is historical data from 2011 and 2012.
A question about the future could be framed as the following:
Will bike rentals be higher in the summer rather than the winter due to weather?
Questions about the future using involve analysis that requires prediction or forecasting methods. The analyst in this case is trying to predict the future from past data.
To complete on your own. ###Try to answer the following questions. Show your work as a data visualization.
**SOLUTION