R Programming for Data: Meetup 2

8/20/2020

I. BEGINNING DATA PROJECTS

When working with a new data what initial questions do you have?

What does your data represent in the real world?
How is this real world phenomena characterized by the data that you have?
From what time period is the data?
What's the source?

Basic understanding

Once you have this basic understanding of your data you can dig deeper.

You can use visualization techniques to explore your data and derive some basic understandings of the phenomena you are studying, such as the largest and smallest values for each variable.

Calculating summary statistics can translate the data into information by revealing the shape of the data, the mean, median, minimum value, maximum value, and variability.

The process

For any data science project there are few simple steps to follow.

II.APPLICATION SAMPLE PROJECT. WORLD INTERNET USAGE

world_internet_usage.csv

1. Set up your workspace

Begin by creating a new folder that contains your data.
Then create a new project in RStudio.
Set your working directory to the folder you created above.
Create a new RMarkdown document.
Save it in your working directory.

2. Create a new `RMarkdown` document

Go to file > New file > RMarkdown
Save the file in you working directory
Name it internet.Rmd

Using R markdown to write your programs

Planning your programs, presenting your code, and sharing your work.

Create .Rmd
Write text
Embed code
Render output

Resource: https://bookdown.org/yihui/rmarkdown/

R Markdown Example

---
title: "Hello R Markdown"
author: "Kristen Sosulski"
date: "2020-08-17"
output: html_document
---

This is a paragraph in an R Markdown document.
Below is a sample code chunk:

#{r chunkname, echo=TRUE, eval=TRUE}

myformula <- (2+2)

R Markdown Components

metadata - YAML header
text - Text outside of code chunks
code - R code chunks

R Markdown YAML header

YAML Ain't markup language

#title: "Hello R Markdown"
#author: "Kristen Sosulski"
#date: "2020-08-20"
#output: html_document

R Markdown YAML common output options

Option	Creates
html_document	html
pdf_document	pdf (requires Tex)
word_document	Microsoft Word (.docx)
github_document	Github compatible markdown
ioslides_presentation	ioslides HTML slides

R Markdown common text options

# Header 1
## Header 2
### Header 3
** Bold **
_Italics_

R Markdown rchunks

Option	Default
eval	TRUE
echo	TRUE
warning	TRUE
error	FALSE

echo=TRUE and echo=FALSE

echo=TRUE

2+2

## [1] 4

echo=FALSE

## [1] 4

fig.height=3, fig.width=4

plot(attitude)

See R Markdown cheatsheet

https://github.com/rstudio/cheatsheets/raw/master/rmarkdown-2.0.pdf

3. Import your data

We have a couple different options.

read_csv() function from the readr library
read.csv() function from the utils library.

readr verses Base R

Typically faster (~10x)
Produce tibbles, they don’t convert character vectors to factors, use row names, or munge the column names.
Base R functions inherit some behavior from your OS and environment variables, so import code that works on your computer might not work on someone else’s.
Column names that are numbers convert from 2002 to 2002 vs. X2002.
Column names that contain spaces (bad practice) convert from Country Name to "Country Name" vs. Country.Name

`read.csv`()

internet_baser <- read.csv("world_internet_usage.csv")
head(internet_baser, 5)

##     country X2000 X2001 X2002 X2003 X2004 X2005 X2006 X2007 X2008 X2009
## 1     China  1.78  2.64  4.60  6.20  7.30  8.52 10.52 16.00 22.60 28.90
## 2    Mexico  5.08  7.04 11.90 12.90 14.10 17.21 19.52 20.81 21.71 26.34
## 3    Panama  6.55  7.27  8.52  9.99 11.14 11.48 17.35 22.29 33.82 39.08
## 4   Senegal  0.40  0.98  1.01  2.10  4.39  4.79  5.61  7.70 10.60 14.50
## 5 Singapore 36.00 41.67 47.00 53.84 62.00 61.00 59.00 69.90 69.00 69.00
##   X2010 X2011 X2012
## 1 34.30 38.30 42.30
## 2 31.05 34.96 38.42
## 3 40.10 42.70 45.20
## 4 16.00 17.50 19.20
## 5 71.00 71.00 74.18

`read_csv`()

library(readr)
internet_readr <- read_csv("world_internet_usage.csv")
head(internet_readr, 5)

## # A tibble: 5 x 14
##   country `2000` `2001` `2002` `2003` `2004` `2005` `2006` `2007` `2008`
##   <chr>    <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1 China     1.78   2.64   4.6    6.2    7.3    8.52  10.5    16     22.6
## 2 Mexico    5.08   7.04  11.9   12.9   14.1   17.2   19.5    20.8   21.7
## 3 Panama    6.55   7.27   8.52   9.99  11.1   11.5   17.4    22.3   33.8
## 4 Senegal   0.4    0.98   1.01   2.1    4.39   4.79   5.61    7.7   10.6
## 5 Singap…  36     41.7   47     53.8   62     61     59      69.9   69  
## # … with 4 more variables: `2009` <dbl>, `2010` <dbl>, `2011` <dbl>,
## #   `2012` <dbl>

use `read_csv()`

As best practice, we're going to use read_csv() function from the readr library to import the world_internet_usage.csv data as tibble data frame.

class(internet_readr)

## [1] "tbl_df"     "tbl"        "data.frame"

4. Prepare or tidy your data

Take a look at your data.

knitr::kable(head(internet_readr, 10))

country	2000	2001	2002	2003	2004	2005	2006	2007	2008	2009	2010	2011	2012
China	1.78	2.64	4.60	6.20	7.30	8.52	10.52	16.00	22.60	28.90	34.30	38.30	42.30
Mexico	5.08	7.04	11.90	12.90	14.10	17.21	19.52	20.81	21.71	26.34	31.05	34.96	38.42
Panama	6.55	7.27	8.52	9.99	11.14	11.48	17.35	22.29	33.82	39.08	40.10	42.70	45.20
Senegal	0.40	0.98	1.01	2.10	4.39	4.79	5.61	7.70	10.60	14.50	16.00	17.50	19.20
Singapore	36.00	41.67	47.00	53.84	62.00	61.00	59.00	69.90	69.00	69.00	71.00	71.00	74.18
United Arab Emirates	23.63	26.27	28.32	29.48	30.13	40.00	52.00	61.00	63.00	64.00	68.00	78.00	85.00
United States	43.08	49.08	58.79	61.70	64.76	67.97	68.93	75.00	74.00	71.00	74.00	77.86	81.03

5. Visualize your data

How could you visualize this to better understand it?

A histogram?

What function can we use?

Building a histogram

hist(internet_readr$`2012`)

Adding aesthetics

hist(internet_readr$`2012`, breaks=8, 
     main="Internet usage for 2012", 
     col="magenta", xlab=" ", labels=TRUE)

Maybe too may breaks…

hist(internet_readr$`2012`, breaks=4, 
     main="Internet usage for 2012", 
     col="magenta", xlab=" ", labels=TRUE)

A histogram for every year using par(mfrow=c(6,3))

par(mfrow=c(6,3))
hist(internet_readr$`2000`)
hist(internet_readr$`2001`)
hist(internet_readr$`2002`)
hist(internet_readr$`2003`)
hist(internet_readr$`2004`)
hist(internet_readr$`2005`)
hist(internet_readr$`2006`)
hist(internet_readr$`2007`)
hist(internet_readr$`2008`)
hist(internet_readr$`2009`)
hist(internet_readr$`2010`)
hist(internet_readr$`2011`)
hist(internet_readr$`2012`)

Histogram matrix

EXERCISE 01 - COMPLETE

..

#Redo the histogram matrix 
#and add aesthetics

Boxplots

boxplot(internet_readr$`2012`, 
        main="Internet usage for 2012", 
        col="magenta", 
        xlab=paste("The median is:", median(internet_readr$`2012`)),
        frame.plot=FALSE, horizontal=TRUE, 
        border="dark blue")

Multiple boxplots

EXERCISE 02 - COMPLETE

..

#Build box plots for 2000 -2012 
#based on the example above using par(mfrow=c(INSERT NUMBER OF ROWS, NUMBER OF COLUMNS)) 
#Include aesthetics.

How else might you want to visualize this data to better understand it?

How would you create a bar or line graph using ggplot?

Technically, we would use geom_col(), geom_bar() or geom_line().

However, what variable would you map for the x-axis and the y-axis?

For example…

ggplot(internet_readr,aes(XVALUE, YVALUE)) 
  + geom_col()

Let's think about how we would use the `ggplot()` function.

ggplot(internet_readr,aes(X,Y)) 
  + geom_col()

III. DATA TRANSFORMATION & ADVANCED VISUALIZATION

wide to long formats

We need to reshape our data from wide to long.

## [1] "Wide format"

country	2000	2001	2002	2003	2004	2005	2006	2007	2008	2009	2010	2011	2012
China	1.78	2.64	4.6	6.2	7.3	8.52	10.52	16.00	22.60	28.90	34.30	38.30	42.30
Mexico	5.08	7.04	11.9	12.9	14.1	17.21	19.52	20.81	21.71	26.34	31.05	34.96	38.42

## [1] "Long format"

country	year	usage
China	2000	1.78
Mexico	2000	5.08

Using the `gather()` function to reshape.

The gather() function from the tidyr package to reshape a tibble from wide to long form.
Note the use of the pipe %>% that passes the left hand side of the operator to the first argument of the right hand side of the operator.

tidy_internet_readr <- 
internet_readr %>%
gather(`2000`:`2012`, key="year", 
       value="usage")

View the data

country	year	usage
China	2000	1.78
Mexico	2000	5.08
Panama	2000	6.55
Senegal	2000	0.40
Singapore	2000	36.00

Option to write the data back to a new file.

Use write_csv(file, path)

You can reimport it

library(readr)
internet_readr <- read_csv("worldtidy.csv")
head(internet_readr, 5)

## # A tibble: 5 x 3
##   country    year usage
##   <chr>     <int> <dbl>
## 1 China      2000  1.78
## 2 Mexico     2000  5.08
## 3 Panama     2000  6.55
## 4 Senegal    2000  0.4 
## 5 Singapore  2000 36

4) Understand - Visualize

Let's create a time series line graph.
Use the ggplot2() package

ggplot(data, aes(x,y,color, group)) + geom_line() + ... + ...

Build the chart

library(ggplot2)
#assignment ggplot call to a variable
line01 <- ggplot(tidy_internet_readr,
        aes(x=year,y=usage,color=country,
            group=country)) + geom_line()

View the chart

line01

Adding asethetics: Labels

Labels: +labs(title="", subtitle="",x="", y="", caption="")