8/20/2020
For any data science project there are few simple steps to follow.
world_internet_usage.csv
RMarkdown document.RMarkdown documentinternet.RmdPlanning your programs, presenting your code, and sharing your work.
Resource: https://bookdown.org/yihui/rmarkdown/
---
title: "Hello R Markdown"
author: "Kristen Sosulski"
date: "2020-08-17"
output: html_document
---
This is a paragraph in an R Markdown document.
Below is a sample code chunk:
#{r chunkname, echo=TRUE, eval=TRUE}
myformula <- (2+2)
YAML Ain't markup language
#title: "Hello R Markdown" #author: "Kristen Sosulski" #date: "2020-08-20" #output: html_document
| Option | Creates |
|---|---|
| html_document | html |
| pdf_document | pdf (requires Tex) |
| word_document | Microsoft Word (.docx) |
| github_document | Github compatible markdown |
| ioslides_presentation | ioslides HTML slides |
# Header 1## Header 2### Header 3** Bold **_Italics_| Option | Default |
|---|---|
| eval | TRUE |
| echo | TRUE |
| warning | TRUE |
| error | FALSE |
echo=TRUE
2+2
## [1] 4
echo=FALSE
## [1] 4
plot(attitude)
https://github.com/rstudio/cheatsheets/raw/master/rmarkdown-2.0.pdf
We have a couple different options.
read_csv() function from the readr libraryread.csv() function from the utils library.2002 to 2002 vs. X2002.Country Name to "Country Name" vs. Country.Nameread.csv()internet_baser <- read.csv("world_internet_usage.csv")
head(internet_baser, 5)
## country X2000 X2001 X2002 X2003 X2004 X2005 X2006 X2007 X2008 X2009 ## 1 China 1.78 2.64 4.60 6.20 7.30 8.52 10.52 16.00 22.60 28.90 ## 2 Mexico 5.08 7.04 11.90 12.90 14.10 17.21 19.52 20.81 21.71 26.34 ## 3 Panama 6.55 7.27 8.52 9.99 11.14 11.48 17.35 22.29 33.82 39.08 ## 4 Senegal 0.40 0.98 1.01 2.10 4.39 4.79 5.61 7.70 10.60 14.50 ## 5 Singapore 36.00 41.67 47.00 53.84 62.00 61.00 59.00 69.90 69.00 69.00 ## X2010 X2011 X2012 ## 1 34.30 38.30 42.30 ## 2 31.05 34.96 38.42 ## 3 40.10 42.70 45.20 ## 4 16.00 17.50 19.20 ## 5 71.00 71.00 74.18
read_csv()library(readr)
internet_readr <- read_csv("world_internet_usage.csv")
head(internet_readr, 5)
## # A tibble: 5 x 14 ## country `2000` `2001` `2002` `2003` `2004` `2005` `2006` `2007` `2008` ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 China 1.78 2.64 4.6 6.2 7.3 8.52 10.5 16 22.6 ## 2 Mexico 5.08 7.04 11.9 12.9 14.1 17.2 19.5 20.8 21.7 ## 3 Panama 6.55 7.27 8.52 9.99 11.1 11.5 17.4 22.3 33.8 ## 4 Senegal 0.4 0.98 1.01 2.1 4.39 4.79 5.61 7.7 10.6 ## 5 Singap… 36 41.7 47 53.8 62 61 59 69.9 69 ## # … with 4 more variables: `2009` <dbl>, `2010` <dbl>, `2011` <dbl>, ## # `2012` <dbl>
read_csv()As best practice, we're going to use read_csv() function from the readr library to import the world_internet_usage.csv data as tibble data frame.
class(internet_readr)
## [1] "tbl_df" "tbl" "data.frame"
Take a look at your data.
knitr::kable(head(internet_readr, 10))
| country | 2000 | 2001 | 2002 | 2003 | 2004 | 2005 | 2006 | 2007 | 2008 | 2009 | 2010 | 2011 | 2012 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| China | 1.78 | 2.64 | 4.60 | 6.20 | 7.30 | 8.52 | 10.52 | 16.00 | 22.60 | 28.90 | 34.30 | 38.30 | 42.30 |
| Mexico | 5.08 | 7.04 | 11.90 | 12.90 | 14.10 | 17.21 | 19.52 | 20.81 | 21.71 | 26.34 | 31.05 | 34.96 | 38.42 |
| Panama | 6.55 | 7.27 | 8.52 | 9.99 | 11.14 | 11.48 | 17.35 | 22.29 | 33.82 | 39.08 | 40.10 | 42.70 | 45.20 |
| Senegal | 0.40 | 0.98 | 1.01 | 2.10 | 4.39 | 4.79 | 5.61 | 7.70 | 10.60 | 14.50 | 16.00 | 17.50 | 19.20 |
| Singapore | 36.00 | 41.67 | 47.00 | 53.84 | 62.00 | 61.00 | 59.00 | 69.90 | 69.00 | 69.00 | 71.00 | 71.00 | 74.18 |
| United Arab Emirates | 23.63 | 26.27 | 28.32 | 29.48 | 30.13 | 40.00 | 52.00 | 61.00 | 63.00 | 64.00 | 68.00 | 78.00 | 85.00 |
| United States | 43.08 | 49.08 | 58.79 | 61.70 | 64.76 | 67.97 | 68.93 | 75.00 | 74.00 | 71.00 | 74.00 | 77.86 | 81.03 |
hist(internet_readr$`2012`)
hist(internet_readr$`2012`, breaks=8,
main="Internet usage for 2012",
col="magenta", xlab=" ", labels=TRUE)
hist(internet_readr$`2012`, breaks=4,
main="Internet usage for 2012",
col="magenta", xlab=" ", labels=TRUE)
par(mfrow=c(6,3)) hist(internet_readr$`2000`) hist(internet_readr$`2001`) hist(internet_readr$`2002`) hist(internet_readr$`2003`) hist(internet_readr$`2004`) hist(internet_readr$`2005`) hist(internet_readr$`2006`) hist(internet_readr$`2007`) hist(internet_readr$`2008`) hist(internet_readr$`2009`) hist(internet_readr$`2010`) hist(internet_readr$`2011`) hist(internet_readr$`2012`)
#Redo the histogram matrix #and add aesthetics
boxplot(internet_readr$`2012`,
main="Internet usage for 2012",
col="magenta",
xlab=paste("The median is:", median(internet_readr$`2012`)),
frame.plot=FALSE, horizontal=TRUE,
border="dark blue")
#Build box plots for 2000 -2012 #based on the example above using par(mfrow=c(INSERT NUMBER OF ROWS, NUMBER OF COLUMNS)) #Include aesthetics.
geom_col(), geom_bar() or geom_line().ggplot(internet_readr,aes(XVALUE, YVALUE)) + geom_col()
ggplot() function.ggplot(internet_readr,aes(X,Y)) + geom_col()
wide to long formats
## [1] "Wide format"
| country | 2000 | 2001 | 2002 | 2003 | 2004 | 2005 | 2006 | 2007 | 2008 | 2009 | 2010 | 2011 | 2012 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| China | 1.78 | 2.64 | 4.6 | 6.2 | 7.3 | 8.52 | 10.52 | 16.00 | 22.60 | 28.90 | 34.30 | 38.30 | 42.30 |
| Mexico | 5.08 | 7.04 | 11.9 | 12.9 | 14.1 | 17.21 | 19.52 | 20.81 | 21.71 | 26.34 | 31.05 | 34.96 | 38.42 |
## [1] "Long format"
| country | year | usage |
|---|---|---|
| China | 2000 | 1.78 |
| Mexico | 2000 | 5.08 |
gather() function to reshape.gather() function from the tidyr package to reshape a tibble from wide to long form.%>% that passes the left hand side of the operator to the first argument of the right hand side of the operator.tidy_internet_readr <-
internet_readr %>%
gather(`2000`:`2012`, key="year",
value="usage")
| country | year | usage |
|---|---|---|
| China | 2000 | 1.78 |
| Mexico | 2000 | 5.08 |
| Panama | 2000 | 6.55 |
| Senegal | 2000 | 0.40 |
| Singapore | 2000 | 36.00 |
write_csv(file, path)library(readr)
internet_readr <- read_csv("worldtidy.csv")
head(internet_readr, 5)
## # A tibble: 5 x 3 ## country year usage ## <chr> <int> <dbl> ## 1 China 2000 1.78 ## 2 Mexico 2000 5.08 ## 3 Panama 2000 6.55 ## 4 Senegal 2000 0.4 ## 5 Singapore 2000 36
ggplot2() packageggplot(data, aes(x,y,color, group)) + geom_line() + ... + ...
library(ggplot2)
#assignment ggplot call to a variable
line01 <- ggplot(tidy_internet_readr,
aes(x=year,y=usage,color=country,
group=country)) + geom_line()
line01
+labs(title="", subtitle="",x="", y="", caption="")