8/20/2020

I. BEGINNING DATA PROJECTS

When working with a new data what initial questions do you have?

  • What does your data represent in the real world?
  • How is this real world phenomena characterized by the data that you have?
  • From what time period is the data?
  • What's the source?

Basic understanding

  • Once you have this basic understanding of your data you can dig deeper.
  • You can use visualization techniques to explore your data and derive some basic understandings of the phenomena you are studying, such as the largest and smallest values for each variable.
  • Calculating summary statistics can translate the data into information by revealing the shape of the data, the mean, median, minimum value, maximum value, and variability.

The process

For any data science project there are few simple steps to follow.

II.APPLICATION SAMPLE PROJECT. WORLD INTERNET USAGE

world_internet_usage.csv

1. Set up your workspace

  • Begin by creating a new folder that contains your data.
  • Then create a new project in RStudio.
  • Set your working directory to the folder you created above.
  • Create a new RMarkdown document.
  • Save it in your working directory.

2. Create a new RMarkdown document

  • Go to file > New file > RMarkdown
  • Save the file in you working directory
  • Name it internet.Rmd

Using R markdown to write your programs

R Markdown Example

---
title: "Hello R Markdown"
author: "Kristen Sosulski"
date: "2020-08-17"
output: html_document
---

This is a paragraph in an R Markdown document.
Below is a sample code chunk:

#{r chunkname, echo=TRUE, eval=TRUE}

myformula <- (2+2)

R Markdown Components

  • metadata - YAML header
  • text - Text outside of code chunks
  • code - R code chunks

R Markdown YAML header

YAML Ain't markup language

#title: "Hello R Markdown"
#author: "Kristen Sosulski"
#date: "2020-08-20"
#output: html_document

R Markdown YAML common output options

Option Creates
html_document html
pdf_document pdf (requires Tex)
word_document Microsoft Word (.docx)
github_document Github compatible markdown
ioslides_presentation ioslides HTML slides

R Markdown common text options

  • # Header 1
  • ## Header 2
  • ### Header 3
  • ** Bold **
  • _Italics_

R Markdown rchunks

Option Default
eval TRUE
echo TRUE
warning TRUE
error FALSE

echo=TRUE and echo=FALSE

echo=TRUE

2+2
## [1] 4

echo=FALSE

## [1] 4

fig.height=3, fig.width=4

plot(attitude)

See R Markdown cheatsheet

3. Import your data

We have a couple different options.

  • read_csv() function from the readr library
  • read.csv() function from the utils library.

readr verses Base R

  • Typically faster (~10x)
  • Produce tibbles, they don’t convert character vectors to factors, use row names, or munge the column names.
  • Base R functions inherit some behavior from your OS and environment variables, so import code that works on your computer might not work on someone else’s.
  • Column names that are numbers convert from 2002 to 2002 vs. X2002.
  • Column names that contain spaces (bad practice) convert from Country Name to "Country Name" vs. Country.Name

read.csv()

internet_baser <- read.csv("world_internet_usage.csv")
head(internet_baser, 5)
##     country X2000 X2001 X2002 X2003 X2004 X2005 X2006 X2007 X2008 X2009
## 1     China  1.78  2.64  4.60  6.20  7.30  8.52 10.52 16.00 22.60 28.90
## 2    Mexico  5.08  7.04 11.90 12.90 14.10 17.21 19.52 20.81 21.71 26.34
## 3    Panama  6.55  7.27  8.52  9.99 11.14 11.48 17.35 22.29 33.82 39.08
## 4   Senegal  0.40  0.98  1.01  2.10  4.39  4.79  5.61  7.70 10.60 14.50
## 5 Singapore 36.00 41.67 47.00 53.84 62.00 61.00 59.00 69.90 69.00 69.00
##   X2010 X2011 X2012
## 1 34.30 38.30 42.30
## 2 31.05 34.96 38.42
## 3 40.10 42.70 45.20
## 4 16.00 17.50 19.20
## 5 71.00 71.00 74.18

read_csv()

library(readr)
internet_readr <- read_csv("world_internet_usage.csv")
head(internet_readr, 5)
## # A tibble: 5 x 14
##   country `2000` `2001` `2002` `2003` `2004` `2005` `2006` `2007` `2008`
##   <chr>    <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1 China     1.78   2.64   4.6    6.2    7.3    8.52  10.5    16     22.6
## 2 Mexico    5.08   7.04  11.9   12.9   14.1   17.2   19.5    20.8   21.7
## 3 Panama    6.55   7.27   8.52   9.99  11.1   11.5   17.4    22.3   33.8
## 4 Senegal   0.4    0.98   1.01   2.1    4.39   4.79   5.61    7.7   10.6
## 5 Singap…  36     41.7   47     53.8   62     61     59      69.9   69  
## # … with 4 more variables: `2009` <dbl>, `2010` <dbl>, `2011` <dbl>,
## #   `2012` <dbl>

use read_csv()

As best practice, we're going to use read_csv() function from the readr library to import the world_internet_usage.csv data as tibble data frame.

class(internet_readr)
## [1] "tbl_df"     "tbl"        "data.frame"

4. Prepare or tidy your data

Take a look at your data.

knitr::kable(head(internet_readr, 10))
country 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012
China 1.78 2.64 4.60 6.20 7.30 8.52 10.52 16.00 22.60 28.90 34.30 38.30 42.30
Mexico 5.08 7.04 11.90 12.90 14.10 17.21 19.52 20.81 21.71 26.34 31.05 34.96 38.42
Panama 6.55 7.27 8.52 9.99 11.14 11.48 17.35 22.29 33.82 39.08 40.10 42.70 45.20
Senegal 0.40 0.98 1.01 2.10 4.39 4.79 5.61 7.70 10.60 14.50 16.00 17.50 19.20
Singapore 36.00 41.67 47.00 53.84 62.00 61.00 59.00 69.90 69.00 69.00 71.00 71.00 74.18
United Arab Emirates 23.63 26.27 28.32 29.48 30.13 40.00 52.00 61.00 63.00 64.00 68.00 78.00 85.00
United States 43.08 49.08 58.79 61.70 64.76 67.97 68.93 75.00 74.00 71.00 74.00 77.86 81.03

5. Visualize your data

How could you visualize this to better understand it?

A histogram?

  • What function can we use?

Building a histogram

hist(internet_readr$`2012`)

Adding aesthetics

hist(internet_readr$`2012`, breaks=8, 
     main="Internet usage for 2012", 
     col="magenta", xlab=" ", labels=TRUE)

Maybe too may breaks…

hist(internet_readr$`2012`, breaks=4, 
     main="Internet usage for 2012", 
     col="magenta", xlab=" ", labels=TRUE)

A histogram for every year using par(mfrow=c(6,3))

par(mfrow=c(6,3))
hist(internet_readr$`2000`)
hist(internet_readr$`2001`)
hist(internet_readr$`2002`)
hist(internet_readr$`2003`)
hist(internet_readr$`2004`)
hist(internet_readr$`2005`)
hist(internet_readr$`2006`)
hist(internet_readr$`2007`)
hist(internet_readr$`2008`)
hist(internet_readr$`2009`)
hist(internet_readr$`2010`)
hist(internet_readr$`2011`)
hist(internet_readr$`2012`)

Histogram matrix

EXERCISE 01 - COMPLETE

..

#Redo the histogram matrix 
#and add aesthetics

Boxplots

boxplot(internet_readr$`2012`, 
        main="Internet usage for 2012", 
        col="magenta", 
        xlab=paste("The median is:", median(internet_readr$`2012`)),
        frame.plot=FALSE, horizontal=TRUE, 
        border="dark blue")

Multiple boxplots

EXERCISE 02 - COMPLETE

..

#Build box plots for 2000 -2012 
#based on the example above using par(mfrow=c(INSERT NUMBER OF ROWS, NUMBER OF COLUMNS)) 
#Include aesthetics.

How else might you want to visualize this data to better understand it?

How would you create a bar or line graph using ggplot?

  • Technically, we would use geom_col(), geom_bar() or geom_line().
  • However, what variable would you map for the x-axis and the y-axis?

For example…

ggplot(internet_readr,aes(XVALUE, YVALUE)) 
  + geom_col()

Let's think about how we would use the ggplot() function.

ggplot(internet_readr,aes(X,Y)) 
  + geom_col()

III. DATA TRANSFORMATION & ADVANCED VISUALIZATION

wide to long formats

We need to reshape our data from wide to long.

## [1] "Wide format"
country 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012
China 1.78 2.64 4.6 6.2 7.3 8.52 10.52 16.00 22.60 28.90 34.30 38.30 42.30
Mexico 5.08 7.04 11.9 12.9 14.1 17.21 19.52 20.81 21.71 26.34 31.05 34.96 38.42
## [1] "Long format"
country year usage
China 2000 1.78
Mexico 2000 5.08

Using the gather() function to reshape.

  • The gather() function from the tidyr package to reshape a tibble from wide to long form.
  • Note the use of the pipe %>% that passes the left hand side of the operator to the first argument of the right hand side of the operator.
tidy_internet_readr <- 
internet_readr %>%
gather(`2000`:`2012`, key="year", 
       value="usage")

View the data

country year usage
China 2000 1.78
Mexico 2000 5.08
Panama 2000 6.55
Senegal 2000 0.40
Singapore 2000 36.00

Option to write the data back to a new file.

  • Use write_csv(file, path)

You can reimport it

library(readr)
internet_readr <- read_csv("worldtidy.csv")
head(internet_readr, 5)
## # A tibble: 5 x 3
##   country    year usage
##   <chr>     <int> <dbl>
## 1 China      2000  1.78
## 2 Mexico     2000  5.08
## 3 Panama     2000  6.55
## 4 Senegal    2000  0.4 
## 5 Singapore  2000 36

4) Understand - Visualize

  • Let's create a time series line graph.
  • Use the ggplot2() package

ggplot(data, aes(x,y,color, group)) + geom_line() + ... + ...

Build the chart

library(ggplot2)
#assignment ggplot call to a variable
line01 <- ggplot(tidy_internet_readr,
        aes(x=year,y=usage,color=country,
            group=country)) + geom_line()

View the chart

line01

Adding asethetics: Labels

  • Labels: +labs(title="", subtitle="",x="", y="", caption="")