2020

#Foundations of Statistics Using R

New to programming?

Prior knowledge of programming

Learning programming

  • A new way of thinking, not just a new skill

  • A new language for speaking and reading (vectors, data frames, functions, objects, etc.)

  • A new syntax for writing c(), print(), cat(), sort(), require(), subset()

New to R

Prior knowledge of R

Prior knowledge of R

Breakout / Warm up: Talk with your partner and share the following:

  • Identify two major takeaways from the pre-module.
  • Describe what you do when things go wrong for you in R.

Outcomes

  • Review what you’ve learned in lessons 1 through 6.
  • Introduce you to techniques and approaches to programming in R.
  • Give you an opportunity to practice and solve problems.

#I. PRE MODULE REVIEW

Jeopardy!!!

#II. SETUP

##Setting up your R world

  • Download rmsba.zip file and unzip
  • Rename the folder rmsba that contains the files in the zip folder. Move it to a location you can find.
  • Create a new project in RStudio named rmsba
  • Set your working directory to the rmsba folder you created above.

#III. BEGINNING DATA PROJECTS

When working with a new data what initial questions do you have?

  • What does your data represent in the real world?
  • How is this real world phenomena characterized by the data that you have?
  • From what time period is the data?
  • What’s the source?

Basic understanding

  • Once you have this basic understanding of your data you can dig deeper.
  • You can use visualization techniques to explore your data and derive some basic understandings of the phenomena you are studying, such as the largest and smallest values for each variable.
  • Calculating summary statistics can translate the data into information by revealing the shape of the data, the mean, median, minimum value, maximum value, and variability.

The process

For any data science project there are few simple steps to follow.

IV. APPLICATION SAMPLE PROJECT. WORLD INTERNET USAGE

world_internet_usage.csv

1. Set up your workspace (generic steps, you already did this…)

  • Begin by creating a new folder that contains your data.
  • Then create a new project in RStudio.
  • Set your working directory to the folder you created above.
  • Create a new RMarkdown document.
  • Save it in your working directory.

2. Create a new RMarkdown document

  • Go to file > New file > RMarkdown
  • Save the file in you working directory
  • Name it internet.Rmd

##Using R markdown to write your programs

Planning your programs, presenting your code, and sharing your work.

  • Create .Rmd
  • Write text
  • Embed code
  • Render output

Resource: https://bookdown.org/yihui/rmarkdown/

##R Markdown Example

---
title: "Hello R Markdown"
author: "Kristen Sosulski"
date: "2020-08-17"
output: html_document
---

This is a paragraph in an R Markdown document.
Below is a sample code chunk:

#{r chunkname, echo=TRUE, eval=TRUE}

myformula <- (2+2)

##R Markdown Components

  • metadata - YAML header
  • text - Text outside of code chunks
  • code - R code chunks

##R Markdown YAML header

YAML Ain’t markup language

#title: "Hello R Markdown"
#author: "Kristen Sosulski"
#date: "2020-08-14"
#output: html_document

##R Markdown YAML common output options

Option Creates
html_document html
pdf_document pdf (requires Tex)
word_document Microsoft Word (.docx)
github_document Github compatible markdown
ioslides_presentation ioslides HTML slides

##R Markdown common text options

  • # Header 1
  • ## Header 2
  • ### Header 3
  • ** Bold **
  • _Italics_

##R Markdown rchunks

Option Default
eval TRUE
echo TRUE
warning TRUE
error FALSE

echo=TRUE and echo=FALSE

3. Import your data

We have a couple different options.

  • read_csv() function from the readr library
  • read.csv() function from the utils library.

##readr verses Base R

  • Typically faster (~10x)
  • Produce tibbles, they don’t convert character vectors to factors, use row names, or munge the column names.
  • Base R functions inherit some behavior from your OS and environment variables, so import code that works on your computer might not work on someone else’s.
  • Column names that are numbers convert from 2002 to 2002 vs. X2002.
  • Column names that contain spaces (bad practice) convert from Country Name to “Country Name” vs. Country.Name

##read.csv()

internet_baser <- read.csv("world_internet_usage.csv")
head(internet_baser, 5)
##     country X2000 X2001 X2002 X2003 X2004 X2005 X2006 X2007 X2008 X2009 X2010
## 1     China  1.78  2.64  4.60  6.20  7.30  8.52 10.52 16.00 22.60 28.90 34.30
## 2    Mexico  5.08  7.04 11.90 12.90 14.10 17.21 19.52 20.81 21.71 26.34 31.05
## 3    Panama  6.55  7.27  8.52  9.99 11.14 11.48 17.35 22.29 33.82 39.08 40.10
## 4   Senegal  0.40  0.98  1.01  2.10  4.39  4.79  5.61  7.70 10.60 14.50 16.00
## 5 Singapore 36.00 41.67 47.00 53.84 62.00 61.00 59.00 69.90 69.00 69.00 71.00
##   X2011 X2012
## 1 38.30 42.30
## 2 34.96 38.42
## 3 42.70 45.20
## 4 17.50 19.20
## 5 71.00 74.18

##read_csv()

library(readr)
internet_readr <- read_csv("world_internet_usage.csv")
head(internet_readr, 5)
## # A tibble: 5 x 14
##   country `2000` `2001` `2002` `2003` `2004` `2005` `2006` `2007` `2008` `2009`
##   <chr>    <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1 China     1.78   2.64   4.6    6.2    7.3    8.52  10.5    16     22.6   28.9
## 2 Mexico    5.08   7.04  11.9   12.9   14.1   17.2   19.5    20.8   21.7   26.3
## 3 Panama    6.55   7.27   8.52   9.99  11.1   11.5   17.4    22.3   33.8   39.1
## 4 Senegal   0.4    0.98   1.01   2.1    4.39   4.79   5.61    7.7   10.6   14.5
## 5 Singap~  36     41.7   47     53.8   62     61     59      69.9   69     69  
## # ... with 3 more variables: `2010` <dbl>, `2011` <dbl>, `2012` <dbl>

##use read_csv()

As best practice, we’re going to use read_csv() function from the readr library to import the world_internet_usage.csv data as tibble data frame.

class(internet_readr)
## [1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame"

4. Prepare or tidy your data

Take a look at your data.

knitr::kable(head(internet_readr, 10))
country 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012
China 1.78 2.64 4.60 6.20 7.30 8.52 10.52 16.00 22.60 28.90 34.30 38.30 42.30
Mexico 5.08 7.04 11.90 12.90 14.10 17.21 19.52 20.81 21.71 26.34 31.05 34.96 38.42
Panama 6.55 7.27 8.52 9.99 11.14 11.48 17.35 22.29 33.82 39.08 40.10 42.70 45.20
Senegal 0.40 0.98 1.01 2.10 4.39 4.79 5.61 7.70 10.60 14.50 16.00 17.50 19.20
Singapore 36.00 41.67 47.00 53.84 62.00 61.00 59.00 69.90 69.00 69.00 71.00 71.00 74.18
United Arab Emirates 23.63 26.27 28.32 29.48 30.13 40.00 52.00 61.00 63.00 64.00 68.00 78.00 85.00
United States 43.08 49.08 58.79 61.70 64.76 67.97 68.93 75.00 74.00 71.00 74.00 77.86 81.03

5. Visualize your data

##How could you visualize this to better understand it?

##A histogram?

  • What function can we use?

##Building a histogram

hist(internet_readr$`2012`)

##Adding aesthetics

hist(internet_readr$`2012`, breaks=8, 
     main="Internet usage for 2012", 
     col="magenta", xlab=" ", labels=TRUE)

##Maybe too may breaks…

hist(internet_readr$`2012`, breaks=4, 
     main="Internet usage for 2012", 
     col="magenta", xlab=" ", labels=TRUE)

##A histogram for every year using par(mfrow=c(6,3))

par(mfrow=c(6,3))
hist(internet_readr$`2000`)
hist(internet_readr$`2001`)
hist(internet_readr$`2002`)
hist(internet_readr$`2003`)
hist(internet_readr$`2004`)
hist(internet_readr$`2005`)
hist(internet_readr$`2006`)
hist(internet_readr$`2007`)
hist(internet_readr$`2008`)
hist(internet_readr$`2009`)
hist(internet_readr$`2010`)
hist(internet_readr$`2011`)
hist(internet_readr$`2012`)

##Histogram matrix

#EXERCISE 01 - COMPLETE

..

#Redo the histogram matrix 
#and add aesthetics

##Boxplots

boxplot(internet_readr$`2012`, 
        main="Internet usage for 2012", 
        col="magenta", 
        xlab=paste("The median is:", median(internet_readr$`2012`)),
        frame.plot=FALSE, horizontal=TRUE, 
        border="dark blue")

##Multiple boxplots

EXERCISE 02 - COMPLETE

..

#Build box plots for 2000 -2012 
#based on the example above using par(mfrow=c(INSERT NUMBER OF ROWS, NUMBER OF COLUMNS)) 
#Include aesthetics.

How else might you want to visualize this data to better understand it?

How would you create a bar or line graph using ggplot?

  • Technically, we would use geom_col(), geom_bar() or geom_line().
  • However, what variable would you map for the x-axis and the y-axis?

For example…

ggplot(internet_readr,aes(XVALUE, YVALUE)) 
  + geom_col()

##Let’s think about how we would use the ggplot() function.

ggplot(internet_readr,aes(X,Y)) 
  + geom_col()

V. DATA TRANSFORMATION & ADVANCED VISUALIZATION

wide to long formats

##We need to reshape our data from wide to long.

## [1] "Wide format"
country 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012
China 1.78 2.64 4.6 6.2 7.3 8.52 10.52 16.00 22.60 28.90 34.30 38.30 42.30
Mexico 5.08 7.04 11.9 12.9 14.1 17.21 19.52 20.81 21.71 26.34 31.05 34.96 38.42
## [1] "Long format"
country year usage
China 2000 1.78
Mexico 2000 5.08

Using the gather() function to reshape.

  • The gather() function from the tidyr package to reshape a tibble from wide to long form.
  • Note the use of the pipe %>% that passes the left hand side of the operator to the first argument of the right hand side of the operator.
tidy_internet_readr <- 
internet_readr %>%
gather(`2000`:`2012`, key="year", 
       value="usage")

View the data

country year usage
China 2000 1.78
Mexico 2000 5.08
Panama 2000 6.55
Senegal 2000 0.40
Singapore 2000 36.00

Option to write the data back to a new file.

  • Use write_csv(file, path)

You can reimport it

library(readr)
internet_readr <- read_csv("worldtidy.csv")
head(internet_readr, 5)
## # A tibble: 5 x 3
##   country    year usage
##   <chr>     <dbl> <dbl>
## 1 China      2000  1.78
## 2 Mexico     2000  5.08
## 3 Panama     2000  6.55
## 4 Senegal    2000  0.4 
## 5 Singapore  2000 36

##4) Understand - Visualize * Let’s create a time series line graph. * Use the ggplot2() package

ggplot(data, aes(x,y,color, group)) + geom_line() + ... + ...

Build the chart

library(ggplot2)
#assignment ggplot call to a variable
line01 <- ggplot(tidy_internet_readr,
        aes(x=year,y=usage,color=country,
            group=country)) + geom_line()

View the chart

line01

##Adding asethetics: Labels

  • Labels: +labs(title="", subtitle="",x="", y="", caption="")

Refine your line chart

library(ggthemes)
library(ggplot2)
line02<-ggplot(tidy_internet_readr,
               aes(x=year,y=usage,color=country,
                   group=country)) + geom_line() + 
  labs(title = "Internet Usage per 100 people", 
       subtitle = "Since 2011, 
       the UAE has surpassed Singapore and the US in internet users", 
       caption = "Source: World Bank (2013)",
       x = " ",y ="Usage")

Refined line chart

##Let’s create a bar chart with the same data

..

library(ggplot2)
bar01 <- ggplot(tidy_internet_readr,
        aes(tidy_internet_readr$year, tidy_internet_readr$usage))

bar01 <- bar01 + geom_col() + theme_few() +
  labs(title = "Internet Usage per 100 people", 
       x = "Year",y ="Usage", 
       caption="World Bank (2013)")

Let’s see it

bar01
## Warning: Use of `tidy_internet_readr$year` is discouraged. Use `year` instead.
## Warning: Use of `tidy_internet_readr$usage` is discouraged. Use `usage` instead.

#Refining asethetics: Specifying parameters

##Adding a fill / Color

  • fill="#4cbea3", color="#4cbea3"
bar01a <- bar01 + geom_col(fill="#4cbea3", color="#4cbea3") + theme_few() +
  labs(title = "Internet Usage per 100 people", 
       x = " ",y ="Usage", caption="World Bank (2013)")

##Add a fill

## Warning: Use of `tidy_internet_readr$year` is discouraged. Use `year` instead.
## Warning: Use of `tidy_internet_readr$usage` is discouraged. Use `usage` instead.
## Warning: Use of `tidy_internet_readr$year` is discouraged. Use `year` instead.
## Warning: Use of `tidy_internet_readr$usage` is discouraged. Use `usage` instead.

Changing the font

+ theme(text=element_text(family="Avenir"))

bar01b <- bar01 + geom_col(fill="#4cbea3", color="#4cbea3") + theme_few() +
  labs(title = "Internet Usage per 100 people", 
       x = " ",y ="Usage", caption="World Bank (2013)") + theme(text=element_text(family="Avenir"))

Changing the font

## Warning: Use of `tidy_internet_readr$year` is discouraged. Use `year` instead.
## Warning: Use of `tidy_internet_readr$usage` is discouraged. Use `usage` instead.
## Warning: Use of `tidy_internet_readr$year` is discouraged. Use `year` instead.
## Warning: Use of `tidy_internet_readr$usage` is discouraged. Use `usage` instead.
## Warning in grid.Call(C_stringMetric, as.graphicsAnnot(x$label)): font family not
## found in Windows font database

## Warning in grid.Call(C_stringMetric, as.graphicsAnnot(x$label)): font family not
## found in Windows font database

## Warning in grid.Call(C_stringMetric, as.graphicsAnnot(x$label)): font family not
## found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

Removing chart junk

_theme(panel.border = element_blank(), panel.grid.major = element_blank(),panel.grid.minor = element_blank(), axis.line = element_line(color = "gray"), axis.ticks.x=element_blank(), axis.ticks.y=element_blank())_

Removing chart junk

bar01c <- bar01 + geom_col(fill="#4cbea3", color="#4cbea3") + theme_few() +
  labs(title = "Internet Usage per 100 people", 
       x = " ",y ="Usage", 
       caption="World Bank (2013)") +
  theme(text=element_text(family="Avenir"), 
        panel.border = element_blank(), panel.grid.major =
          element_blank(),panel.grid.minor =
          element_blank(), 
        axis.line = element_line(color= "gray"),
        axis.ticks.x=element_blank(),
        axis.ticks.y=element_blank())

Removing chart junk

## Warning: Use of `tidy_internet_readr$year` is discouraged. Use `year` instead.
## Warning: Use of `tidy_internet_readr$usage` is discouraged. Use `usage` instead.
## Warning: Use of `tidy_internet_readr$year` is discouraged. Use `year` instead.
## Warning: Use of `tidy_internet_readr$usage` is discouraged. Use `usage` instead.
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

##Adding asethetics: Themes

  • ggthemes() package with additional themes
  • theme_classic(). White background no grid lines

Types of themes

t1 <- bar01 + theme_classic()
t2 <- bar01 + theme_bw()
t3 <- bar01 + theme_minimal()
t4 <- bar01 + theme_economist()
t5 <- bar01 + theme_fivethirtyeight()
t6 <- bar01 + theme_hc()

Themes: theme_classic()

## Warning: Use of `tidy_internet_readr$year` is discouraged. Use `year` instead.
## Warning: Use of `tidy_internet_readr$usage` is discouraged. Use `usage` instead.

Themes: theme_bw()

## Warning: Use of `tidy_internet_readr$year` is discouraged. Use `year` instead.
## Warning: Use of `tidy_internet_readr$usage` is discouraged. Use `usage` instead.

Themes: theme_minimal()

## Warning: Use of `tidy_internet_readr$year` is discouraged. Use `year` instead.
## Warning: Use of `tidy_internet_readr$usage` is discouraged. Use `usage` instead.

Themes: theme_economist()

## Warning: Use of `tidy_internet_readr$year` is discouraged. Use `year` instead.
## Warning: Use of `tidy_internet_readr$usage` is discouraged. Use `usage` instead.

Themes: theme_fivethirtyeight()

## Warning: Use of `tidy_internet_readr$year` is discouraged. Use `year` instead.
## Warning: Use of `tidy_internet_readr$usage` is discouraged. Use `usage` instead.

##Themes: theme_theme_hc()

## Warning: Use of `tidy_internet_readr$year` is discouraged. Use `year` instead.
## Warning: Use of `tidy_internet_readr$usage` is discouraged. Use `usage` instead.

Use cowplot::plotgrid() for layout

Reference: https://wilkelab.org/cowplot/articles/plot_grid.html

cowplot::plot_grid(t1, t2, t3,t4,t5,t6,labels=c("Classic","Black & White","Minimal","Economist","538", "High Charts"), label_size = 12, label_x = 0, label_y = 0, hjust = -0.5, vjust = -0.5, label_fontfamily="serif",
  label_fontface = "bold",
  label_colour = "#4cbea3") 

Use cowplot::plotgrid() for layout

## Warning: Use of `tidy_internet_readr$year` is discouraged. Use `year` instead.
## Warning: Use of `tidy_internet_readr$usage` is discouraged. Use `usage` instead.
## Warning: Use of `tidy_internet_readr$year` is discouraged. Use `year` instead.
## Warning: Use of `tidy_internet_readr$usage` is discouraged. Use `usage` instead.
## Warning: Use of `tidy_internet_readr$year` is discouraged. Use `year` instead.
## Warning: Use of `tidy_internet_readr$usage` is discouraged. Use `usage` instead.
## Warning: Use of `tidy_internet_readr$year` is discouraged. Use `year` instead.
## Warning: Use of `tidy_internet_readr$usage` is discouraged. Use `usage` instead.
## Warning: Use of `tidy_internet_readr$year` is discouraged. Use `year` instead.
## Warning: Use of `tidy_internet_readr$usage` is discouraged. Use `usage` instead.
## Warning: Use of `tidy_internet_readr$year` is discouraged. Use `year` instead.
## Warning: Use of `tidy_internet_readr$usage` is discouraged. Use `usage` instead.

##Saving your charts

  • ggsave("plot.png",width=5, height=5, units="in"")
  • Saves in your working directory
## Warning: Use of `tidy_internet_readr$year` is discouraged. Use `year` instead.
## Warning: Use of `tidy_internet_readr$usage` is discouraged. Use `usage` instead.

#VI. WRITING CONDITIONAL STATEMENTS

..

Early on, we learned about different types of comparison and logical operators. For example, comparison operators include <, <=,== and so on. These operators return a logical value, TRUE or FALSE.

Let’s do a quick review of comparison operators.

3 > 4
c(1, 2, 3, 4, 5) > 4
c(1, 2, 3, 4, 6) == 3

..

3 > 4
## [1] FALSE

..

c(1, 2, 3, 4, 5) > 4
## [1] FALSE FALSE FALSE FALSE  TRUE

..

c(1, 2, 3, 4, 6) == 3 
## [1] FALSE FALSE  TRUE FALSE FALSE

There are also a set of logical operators.

These include:

  • !
  • &
  • |

Let’s do a quick review of logical operators.

Example 1

We assign legal to the Boolean value TRUE and then use the logical not (!) operator to negate the variable legal which returns the value of FALSE.

legal<-TRUE

!legal
## [1] FALSE

Example 2

We assign customer to the Boolean value TRUE and over50 to the value FALSE. Then we use the logical and (&) operator to evaluate both variables. If both are TRUE, then result returned is TRUE. If one or none of the values are TRUE, then the result is FALSE.

customer <-TRUE
over50 <-FALSE

customer & over50
## [1] FALSE

##Example 3

Just in the previous example, we assign customer to the Boolean value TRUE and over50 to the value FALSE. Then we use the logical or (|) operator to evaluate both variables. If both are TRUE, then result returned is TRUE. If one is TRUE, then result returned is TRUE. If neither of the values are TRUE, then the result is FALSE.

customer <-TRUE
over50 <-FALSE
customer | over50
## [1] TRUE

##Conditional statements

  • A conditional statement evaluates a condition to see if it is TRUE. Conditional statements may make use of comparative and logical operators.

  • A type of conditional statement used to evaluate if a condition is TRUE is called an if statement.

  • Let’s look at an example of the structure of an if statement:

  • The if is followed by a set of parentheses and inside is the condition being evaluated for truth. If it is true, then we may specify an action such as update a variable. If the condition is FALSE no action occurs.

..

Example 1

Let’s construct a real example that evaluates the value assigned to a variable price. The condition is price < 10. This condition will evaluate to FALSE. Therefore nothing is printed.

price <- 15.99
if (price < 10) {
  print("This is excellent deal!")
}

##Example 2.

Let’s change Example 1. We’ll change the value of price to 9.99. In this case, This is an excellent deal will be printed to the screen. In other words, the code between the opening and closing curly brackets will run.

price <- 9.99

if (price < 10) {
  print("This is excellent deal!")
}
## [1] "This is excellent deal!"

##Conditional statements using if/else logic

You can add on an else onto an if statement. If the test condition is not met, that is, it evaluates to FALSE, the else code will run.

The basic structure is:

if (test_expression) {
  statement1
} else {
  statement2
}

##Example 3

Let’s build on Ex. 1 where price is set to 15.99 and add an else statement. You’ll notice that the print statement following the if is ignored and the print statement within the else clause has run.

price <- 15.99
if (price < 10) {
  print("This is excellent deal!")
} else {
    print("This product is too expensive")
  }
## [1] "This product is too expensive"

..

Take note of the curly brackets in this example. The placement is important. The if and else both have an opening and closing brace. The else statement must appear on the same line as the closing if curly bracket.

##Nested if...else if statements

  • The nested if...else if statement allows you execute a block of code among more than 2 alternatives.

  • The syntax of if...else if statement is:

Example 4

Let’s modify Example 3

##Try it.

Test your program out by changing the value of price to ensure it evaluates the condition as you have planned.

R ifelse() function

The ifelse() functions a shorthand function to the traditional if…else statement.

Syntax of ifelse() function

Here, test_expression must be a logical vector (or an object that can be coerced to logical). The return value is a vector with the same length as test_expression.

This returned vector has element from x if the corresponding value of test_expression is TRUE or from y if the corresponding value of test_expression is FALSE.

Example 5

In the example below, the ifelse() function evaluates the vector FALSE FALSE TRUE FALSE that resulted from the expression:a %% 2 == 0

a <- c(5,7,2,9)
ifelse(a %% 2 == 0,"even","odd")
## [1] "odd"  "odd"  "even" "odd"

Learn more at: https://www.datamentor.io/r-programming/ifelse-function/

Example 6

Let’s say you want to evaluate a vector for NA values and print output to the screen based on whether there is an NA value or not. We can use the ifelse() function along with the is.na() function.

a <- c(NA,7,2,9)
ifelse(is.na(a),"NA","Not NA")
## [1] "NA"     "Not NA" "Not NA" "Not NA"

VII. ITERATION

…

Creating loops

  • Loops are used in programming to repeat a specific block of code. There are two types of looping control structures in R, the while and for.
  • The while loop is used when you want to execute some code some number (possibly an unknown number) of times.

Syntax of while loop

while (test_expression)
{
  statement
}

Here, test_expression is evaluated and the body of the loop is entered if the result is TRUE.

Structure of a while loop

while loop

  • The statements inside the loop are executed and the flow returns to evaluate the test_expression again.

  • This is repeated each time until test_expression evaluates to FALSE, in which case, the loop exits.

Let’s look at a few examples:

Example 1

In the example below, how many times with the loop iterate?

x <- 10
while (x > 0) {
 print(x)
 x <- x - 1 
} 

Run the code to see it for yourself.

You’ll note that the loop will iterate ten times. First, when x is initialized to 10 and thereafter we decrease x by 1, until x = 0 where x > 0 finally evaluates to FALSE.

Result

x <- 10
while (x > 0) {
 print(x)
 x <- x - 1 
} 
## [1] 10
## [1] 9
## [1] 8
## [1] 7
## [1] 6
## [1] 5
## [1] 4
## [1] 3
## [1] 2
## [1] 1

Example 2

  • In this example, counter is initialized to zero. The test expression in the while statement evaluates to see counter is less than 9, prints the value of counter and then increments it by 1.

  • How many times will this loop iterate?

counter <- 0
while (counter < 9) {
  print(counter)
  counter = counter + 1
}
  • The loop will iterate 9 times, printing the sequence 0 through 8.

What if we change test condition to while(counter > 9)? How many times would the loop iterate?

  • This would create an infinite loop. That is, a loop that executes in theory forever (or until you power down your machine).

..

  • It is important to examine the relationship between the test expression and the control, which in this case in the counter.
  • In this example, counter is initialized to 10 and the test expression counter > 9 will always evaluate to TRUE since 10 is greater than 9. The counter = counter + 1, increments the control variable, rather then decrease it. counter = counter -1 would create a state where the test condition would evaluate to FALSE after a single iteration.

##Try it - Use the attitude data set

#pseudo code

while (there are still variables to evaluate)
{
  #take the mean
  #print the mean
  #move to the next ncol
} 

Solution

column <- ncol(attitude)
while (column > 0){
  print(paste(names(attitude[column]),
              ":",mean(attitude[,column])))
  column <- column - 1
}
## [1] "advance : 42.9333333333333"
## [1] "critical : 74.7666666666667"
## [1] "raises : 64.6333333333333"
## [1] "learning : 56.3666666666667"
## [1] "privileges : 53.1333333333333"
## [1] "complaints : 66.6"
## [1] "rating : 64.6333333333333"

The for loop

Another type of loop used in R is called a for loop. A for loop is used to iterate over a vector, such as a column in a DataFrame.

The syntax is as follows:

for (value in sequence)
{
statement
}

..

Here, sequence is a vector and value takes on each of one of the values in the sequence. During each iteration, statement is evaluated.

The for loop: Specifies the of iterations.

In this code , the iterator (in this case, i) takes on the values in the vector c(1,2,3,4) sequentially through each “loop” of the code that is between the brackets—in this case, print(i).

for (i in c(1,1,3,4)){
    print(i)
}
## [1] 1
## [1] 1
## [1] 3
## [1] 4

Loops and conditional statements using if/else logic

Let’s build a program that checks to see which prices are considered “cheap” (less then $10) based on the following vector.

prices <- c(12.43, 9.99, 18.22, 7.25, 0.50)

##You can approach it the following way:

First, create the price vector. Next, initialize a variable called numCheap. Third, create the for loop to traverse through the prices vector. Fourth, evaluate to see if p is less than 10 and if it is, then add 1 to numCheap. Five, when the loop exists, print out the number of inexpensive items.

prices <- c(12.43, 9.99, 18.22, 7.25, 0.50)
numCheap <- 0
for (p in prices){
    if (p < 10){
        numCheap <- numCheap + 1
    }
}  
print(numCheap)
## [1] 3

#VIII. APPLICATION CAPITAL BIKESHARE bikesharedailydata.csv

1. Setup your workspace

2. Import the data

  • This data spans the District of Columbia, Arlington County, Alexandria, Montgomery County and Fairfax County.
  • The Capital Bikeshare system is owned by the participating jurisdictions and is operated by Motivate, a Brooklyn, NY-based company that operates several other bikesharing systems including Citibike in New York City, Hubway in Boston and Divvy Bikes in Chicago.

…

library(readr)
bikeshare <- read_csv("bikesharedailydata.csv")

View the data.

## # A tibble: 6 x 16
##   instant dteday season    yr  mnth holiday weekday workingday weathersit  temp
##     <dbl> <chr>   <dbl> <dbl> <dbl>   <dbl>   <dbl>      <dbl>      <dbl> <dbl>
## 1       1 1/1/11      1     0     1       0       6          0          2 0.344
## 2       2 1/2/11      1     0     1       0       0          0          2 0.363
## 3       3 1/3/11      1     0     1       0       1          1          1 0.196
## 4       4 1/4/11      1     0     1       0       2          1          1 0.2  
## 5       5 1/5/11      1     0     1       0       3          1          1 0.227
## 6       6 1/6/11      1     0     1       0       4          1          1 0.204
## # ... with 6 more variables: atemp <dbl>, hum <dbl>, windspeed <dbl>,
## #   casual <dbl>, registered <dbl>, cnt <dbl>

Understand the type of data you are working with.

Observations

  • One of the first things you may notice is the data dimensions, the number of rows and columns. Specifically there are 731 rows (observations) and 16 columns (variables or attributes).
  • However, the variable names listed at the first row of every column are not very descriptive.

Determine what the variables mean in the real world.

  • Take a look column named season. What is the meaning of season?
  • What are the possible values for this variable?

bikeshare$season

bikeshare$season
##   [1]  1  1  1  1  1  1 NA  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##  [26]  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##  [51]  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##  [76]  1  1  1  1  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2
## [101]  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2
## [126]  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2
## [151]  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  3  3  3  3
## [176]  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3
## [201]  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3
## [226]  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3
## [251]  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  4  4  4  4  4  4  4  4  4  4
## [276]  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4
## [301]  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4
## [326]  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4
## [351]  4  4  4  4  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
## [376]  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
## [401]  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
## [426]  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  2  2  2  2  2
## [451]  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2
## [476]  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2
## [501]  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2
## [526]  2  2  2  2  2  2  2  2  2  2  2  2  3  3  3  3  3  3  3  3  3  3  3  3  3
## [551]  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3
## [576]  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3
## [601]  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3
## [626]  3  3  3  3  3  3  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4
## [651]  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4
## [676]  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4
## [701]  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  1  1  1  1  1
## [726]  1  1  1  1  1  1

##What type of variable is it?

  • It is an integer or data of type numeric.
  • You’ll notice that in the column seasons the values are integers that range between 1 and 4.

What do the numbers represent?

  • If we really think about it’s unlikely that the numbers represent quantities.
  • Instead, they probably represent the seasons of the year because we know there are four seasons.

Understanding the four seasons

  • The numbers (1 through 4) are probably a code for the each of the four seasons of the year.
  • Without additional information, such as a data dictionary or readme file, it would be impossible for the user of the data to know what the possible values of 1 through 4 correspond to in the categorical variable named season.

Review the data dictionary

This leads us to the next step, reviewing the data dictionary along with the data set to better understand the meaning behind the values.

The data dictionary

  • A data dictionary defines the characteristics of each of the data attributes.
  • If your data comes from a reputable source, odds are that it is accompanied with a data dictionary or metadata.
  • To know which season is represented by each number in the variable season we can review the data dictionary.

Reviewing the data dictionary

Field Definition
instant record index
dteday date
season season (1:winter, 2:spring, 3:summer, 4:fall)
yr year (0: 2011, 1:2012)
mnth month ( 1 to 12)
hr hour (0 to 23)
holiday weather day is holiday or not
weekday day of the week
workingday if day is neither weekend nor holiday is 1, otherwise is 0.
weathersit 1, 2, 3, 4
– 1 Clear, Few clouds, Partly cloudy, Partly cloudy
– 2 Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
– 3 Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
– 4 Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
temp Normalized temperature in degrees F
atemp Normalized feeling temperature in degrees F
hum Normalized humidity.
windspeed Normalized wind speed
casual count of casual users
registered count of registered users
cnt count of total rental bikes including both casual and registered

What did we learn?

  • season is a categorical variable defined by one of four values, each representing a season (1: winter, 2: spring, 3: summer, 4: fall).
  • year is coded with the value of 0 for 2011 and 1 for 2012, rather than actual year value of 2011 or 2012.

##3. Prepare or tidy your data

  • At this point, you may want to rename the columns in your data set to make the data more usable when you begin the analysis.
  • Renaming columns is a manual process that literally involves change the each column name.
  • It is best practice to use lower case lettering and avoid spaces and hyphenation.

Renaming columns

There two key ways to rename columns.

  • with rename() from dplyr
  • with names() from base

Way 1 - Renaming columns withrename() from the dplyr library.

library(dplyr)
bikeshare <- rename(bikeshare, humidity = hum, month=mnth)
names(bikeshare)
##  [1] "instant"    "dteday"     "season"     "yr"         "month"     
##  [6] "holiday"    "weekday"    "workingday" "weathersit" "temp"      
## [11] "atemp"      "humidity"   "windspeed"  "casual"     "registered"
## [16] "cnt"

Way 2 -Renaming columns with R base functions.

# Rename column where names is equal to "yr"
names(bikeshare)[names(bikeshare) == "yr"] <- "year"
names(bikeshare)
##  [1] "instant"    "dteday"     "season"     "year"       "month"     
##  [6] "holiday"    "weekday"    "workingday" "weathersit" "temp"      
## [11] "atemp"      "humidity"   "windspeed"  "casual"     "registered"
## [16] "cnt"

Identify missing values

  • We can use a function called is.na()counting the number NA values.
  • Let’s look for missing values in the seasons column.

…

sum(is.na(bikeshare$season)==TRUE)
## [1] 1

##Using iteration to identify missing values

  • In this case it seems necessary to “loop” through the entire data set and to identify which fields need closer inspection.
  • We can build a simple for loop to do this.

Let use the concept of iteration to traverse through the data set and print out those NA values.

…

counter <-0
for (i in bikeshare$season){
      counter <- counter +1
       if(is.na(i)==TRUE){
         print(paste("It's true. There's an NA value on row",counter))
         print(bikeshare[counter,])
    }
}
## [1] "It's true. There's an NA value on row 7"
## # A tibble: 1 x 16
##   instant dteday season  year month holiday weekday workingday weathersit  temp
##     <dbl> <chr>   <dbl> <dbl> <dbl>   <dbl>   <dbl>      <dbl>      <dbl> <dbl>
## 1       7 1/7/11     NA     0     1       0       5          1          2 0.197
## # ... with 6 more variables: atemp <dbl>, humidity <dbl>, windspeed <dbl>,
## #   casual <dbl>, registered <dbl>, cnt <dbl>

Dealing with missing values

There are several ways you tackle working with data that are incomplete. Each has its pros and cons.

  1. Ignore any record with missing values
  2. Replace empty fields with a pre-defined value
  3. Replace empty fields with the most frequently appeared value
  4. Use the mean value
  5. Manual approach

Solution

  • In this case it’s easy to replace the value with a pre-defined value.

  • We wouldn’t want to ignore the record because the values can be easily determined.

##Update the values

bikeshare$season[7]
## [1] NA
1->bikeshare$season[7]
bikeshare$season[7]
## [1] 1

##4. Understand and visualize

bikesall <- ggplot(bikeshare, aes(atemp,cnt)) + geom_point(color="#4cbea3")

##Let’s see it!

bikesall

##Refine it

bikes2011 <- ggplot(bikeshare[bikeshare$year < 1,],
                    aes(atemp,cnt)) +
  geom_point(color="#4cbea3") + 
  theme_few() + labs(title = "Rentals in 2011", 
                     x = "Average temp",y=" ")

##2011

bikes2011

##Plot 2012

bikes2012 <- ggplot(bikeshare[bikeshare$year > 0,], aes(atemp,cnt)) + geom_point(color="#4cbea3") + theme_few() + 
  labs(title = "Rentals in 2012", x = "Average temp",y=" ")

##2012

bikes2012

Now let’s arrange the charts side by side

We can do this by using the plot_grid function from the cowplot package

cowplot::plot_grid

-Pass in the two variables hist_age and hist_salary into the plot_grid function to see the graphs plotted side by side.

##plot_grid

cowplot::plot_grid(bikes2011, bikes2012, labels =c(" ", " "))

#IX. User Defined Functions

##User Defined Functions

  • Functions help us achieve our programmatic goals. They are the set of instructions that we can repeatedly call (or use) to manipulate data, objects, and states.

  • Some functions are built in such as:

##toupper

toupper("hello world")
## [1] "HELLO WORLD"

##mean

mean(c(1,2,3,4,5))
## [1] 3

##is.numeric

is.numeric(4)
## [1] TRUE

##is.na

is.na(NA)
## [1] TRUE

##sqrt

sqrt(25)
## [1] 5

Functions

Functions operate on some specified arguments. They generally return some value (a number, a string etc.). Such as passing the value of 25 to the square root function sqrt() and the return value is 5.

In addition to using pre-existing functions from R packages, we can write our own.

  • Functions are useful for executing repetitive commands.
  • Planning is key to writing effective functions.

..

This code shows the simple construction of a user designed function. Each function has two parts: the function definition and the function call.

Part 1 - The function definition

functionname <- function(x){
  return(print(paste("The value", x,   "is returned")))
}

Part 2 - The function call

functionname(34)
## [1] "The value 34 is returned"

Let’s apply this structure to a short problem.

Suppose we want to have a function to act on our data that adds 2 to every value. How would we design this function?

Function Pseudocode

  • We begin by drafting out the main components of a function.

  • First, let’s give our function a name. We’ll call it myfunction.

  • Next, we have to assign it to the a function declaration, namely, function()with an opening and closing curly bracket following the declaration. Then we add a variable as a parameter to a function declaration. In this case, we’ll call it myparameter.

myfunction <- function(myparameter){

}

Now let’s set up our function call.

myfunction(22)
  • If we try to run the function declaration it will work perfectly.
  • However, when we call it it will return the value of NULL.
  • This is because we have not specified in our function declaration what do to do with the parameter that is passed into the function.

Let’s go ahead and do that.

In this case, we are just adding 2 to my parameter.

myfunction <- function(myparameter) {
  myparameter + 2
}

Now, when we call myfunction and pass in a number, we see a value is returned which is myparameter + 2.

myfunction(22)
## [1] 24

What happens when we don’t pass in a numeric value?

myfunction("hello")

The error is: Error in myparameter + 2 : non-numeric argument to binary operator

Can you see why this error was given? You’ll notice that the string hello caused this error.

How can we plan for user error?

  • We can simply evaluate the input being passed into the function and see if it is of type numeric by using the test expression is.numeric(myparameter)==TRUE.
  • If the data is of type numeric, then can proceed and add 2 to myparameter.
  • If it is not, then we can provide a friendly message to the user.

..

myfunction <- function(myparameter) {
  if (is.numeric(myparameter)==TRUE)
  {
       myparameter + 2  
  }    else{

      print("Sorry. This function needs requires a value of type numeric.")
    }
}

..

Let’s see this in action by calling myfunction and passing in a string. We can see the friendly message as output to the user.

myfunction("hello")
## [1] "Sorry. This function needs requires a value of type numeric."

Let’s try calling myfunction again with a numeric value.

You can see the function works as it should when a numeric value is provided.

myfunction(3)
## [1] 5

Pass in multiple arguments

  • Let’s create a function that takes more than one argument.

  • We’ll call this function addTogether and include 2 parameters or arguments in the function declaration.

addTogether <- function(x, y) {
  if (is.numeric(x) & is.numeric(y)==TRUE)
  {
  x + y
  } else {
    print("Sorry, please enter two numbers")
  }
}

Now, let’s call the function, addTogether.

addTogether(5, 15)
## [1] 20

##Let’s call it, yet again, but this time passing in a number and as string.

addTogether(5, "d")
## [1] "Sorry, please enter two numbers"

Alternative function call, with literal specification

addTogether(x = 5, y = 10) 
## [1] 15

Try it. Write a function that averages two numbers

##CODE HERE

This can be easily achieved by modifying the addTogether function and changing the computation to (x + y)/2.

avg <- function(x,y){
    if (is.numeric(x) & is.numeric(y)==TRUE){
     (x + y)/2
    } else {
    print("Sorry, please enter two numbers")
  }
}
avg(1,"2")
## [1] "Sorry, please enter two numbers"

..

  • This approach works perfectly if you enter two numeric values.
  • It also works if one or both of the parameters are non-numeric.

Where it does not work is when only one parameter is provided. This throws the following error:

Error in avg(1) : argument "y" is missing, with no default

This error produced is less than interpret-able to a user. A big part of programming requires planning what to do when the user uses your program, function, application, etc. and enters input that causes your program to break.

By anticipating that the user may not enter in the correct number of parameters for your function can make your functions more usable.

..

addTogether <- function(x, y) {

    if ( (hasArg(x) == FALSE) &(hasArg(y)==FALSE))
    {
    print("You didn't enter any values. Please enter two numbers.")  
    }
    else if ( (hasArg(x) == FALSE) |(hasArg(y)==FALSE))
    {
    print("You only entered one value. Please enter two values.")
    } 
    else if (is.numeric(x) & is.numeric(y)==TRUE)
    {
    x + y
    } else {
      print("Sorry, please enter two numbers.")
    } 
  }

#Call the function
addTogether("3",3)

Let’s run it.

## [1] "Sorry, please enter two numbers."

4. Apply family of functions

What if we could apply a function to all the elements of the input? We can do this with a function called sapply().

sapply() comes from the apply family of functions.

The syntax is as follows: The function call is sapply with the arguments X, and FUN.

sapply(X, FUN)

Arguments:

  • X: A vector or an object. The input needs to be a list, vector, or data frame.
  • FUN: Function applied to each element of x

The output from sapply is a vector or matrix.

Let’s work through an example.

Example - sapply()

df1 is a data frame that we created. To find the max value, we can use `sapply() and pass in our data frame and then the max function.

df1 <-as.data.frame(c(1,2,3,4,5,6,7))

sapply(df1, max)
## c(1, 2, 3, 4, 5, 6, 7) 
##                      7

Let’s write a function that takes the mean.

takemean <-function(x){
  mean(x, na.rm=TRUE)
}

Next, let’s apply the takemean() function over a vector using sapply().

##     rating complaints privileges   learning     raises   critical    advance 
##   64.63333   66.60000   53.13333   56.36667   64.63333   74.76667   42.93333

We can check our work:

##      rating        complaints     privileges       learning         raises     
##  Min.   :40.00   Min.   :37.0   Min.   :30.00   Min.   :34.00   Min.   :43.00  
##  1st Qu.:58.75   1st Qu.:58.5   1st Qu.:45.00   1st Qu.:47.00   1st Qu.:58.25  
##  Median :65.50   Median :65.0   Median :51.50   Median :56.50   Median :63.50  
##  Mean   :64.63   Mean   :66.6   Mean   :53.13   Mean   :56.37   Mean   :64.63  
##  3rd Qu.:71.75   3rd Qu.:77.0   3rd Qu.:62.50   3rd Qu.:66.75   3rd Qu.:71.00  
##  Max.   :85.00   Max.   :90.0   Max.   :83.00   Max.   :75.00   Max.   :88.00  
##     critical        advance     
##  Min.   :49.00   Min.   :25.00  
##  1st Qu.:69.25   1st Qu.:35.00  
##  Median :77.50   Median :41.00  
##  Mean   :74.77   Mean   :42.93  
##  3rd Qu.:80.00   3rd Qu.:47.75  
##  Max.   :92.00   Max.   :72.00

#EXERCISE 03 - COMPLETE

..Try it using function on a column in the bikeshare data.

##CODE HERE

X. Working with APIs

Keys, endpoints, and methods (and in R, some extra packages)

Working with OpenWeatherMap API

Working with OpenWeatherMap API

  • Basic call:
  • get_current_weather(api_key, cityID = NA, city = "", country = "", coordinates = NA, zip_code = NA)

Working with OpenWeatherMap API

library (ROpenWeatherMap)
library(tidyverse)

api_key <- "ffb7b9808e07c9135bdcc7d1e867253d" #Kristen's key please get your own. =)

#API Call that reads in as a list
newyork=get_current_weather(api_key,city="New York")
class(newyork)
## [1] "list"

##.. Cast to data frame or tibble

newyork <-data.frame(newyork)
newyork$city <- "New York"
class(newyork)
## [1] "data.frame"

Reorder columns

newyork[c(length(newyork),1,2,3,4,5)]
##       city coord.lon coord.lat weather.id weather.main weather.description
## 1 New York    -74.01     40.71        800        Clear           clear sky
#Reorder again and save it back to newyork
newyork <- newyork[c(length(newyork),1,2,3,4,5)]

Let’s make another call to API with a different city

ct=get_current_weather(api_key,city="Cooperstown")
ct <- data.frame(ct)
ct$city <- "Cooperstown"

ct[c(length(ct),1,2,3,4,5)] #reorder
##          city coord.lon coord.lat weather.id weather.main weather.description
## 1 Cooperstown    -74.92      42.7        800        Clear           clear sky
ct <- ct[c(length(ct),1,2,3,4,5)]

Now, let’s combine the two dataframes.

cityweather <- rbind(newyork[,1:6], ct[,1:6])
#good resources on joins: https://stackoverflow.com/questions/1299871/how-to-join-merge-data-frames-inner-outer-left-right

On to visualizing our data…

Let’s plot our cities on a map and label it with the weather.

Basic map

library(maps)
map(database="county", region=c("New York"), col="#cccccc")
symbols(cityweather$coord.lon, cityweather$coord.lat, bg="#e2373f", fg="#ffffff", lwd=0.5, circles = rep(1,length(cityweather$coord.lon)), inches=0.1, add=TRUE)

Leaflet map

library(leaflet)
m <- leaflet()
m <- addTiles(m)
m <- addMarkers(m, lng=cityweather$coord.lon, lat=cityweather$coord.lat, popup=paste(cityweather$city, cityweather$weather.main))
m

#XI. RSHINY

install.packages("shiny")

##What is shiny?

  • Shiny is a web application framework for R.
  • Shiny makes it incredibly easy to build interactive web applications with R.
  • You can publish these apps to the web without detailed knowledge of the underlying web code such as HTML, JavaScript, or CSS.
  • Learn more at: https://cran.r-project.org/web/packages/shiny/shiny.pdf

##Video - Movie Explorer

Here’s a simple example of a shiny app. It’s a movie explorer app that allows you to select which variables you want to plot on the x and y axes. This allows you to interact with the data and explore it through a visualization interface.

##Video - Movie Explorer

https://shiny.rstudio.com/gallery/movie-explorer.html

##Video - Marathon Training

##Video - Intelligencia

Interactivity

There are many user interface features you can plug and play into your R code and turn it into an App.

ui Inputs

These interactive features are called UI inputs. For example there are:

Sliders, slider range, text input,

ui Inputs

Numeric input, radio buttons, select boxes,

ui Inputs

Date input, date range, file input,

ui Inputs

  • Action button, single check box, and a check box group.
  • All of these UI inputs allow users of your app to interact with your R program to render a plot, table, or update text.

Layouts

The user interface for shiny app be one of many layout options. These include panels that include UI inputs such as

wellPanel(dateInput("a", ""), submitButton()

Which renders a single element called a panel:

You can then organize panels and UI elements into a layout with a layout function such as:

  • fluidRow()
  • flowLayout()
  • slidebarLayout()
  • splitLayout(), or
  • verticalLayout().

..

server Outputs

The type of output a shiny app can display is to render a plot, render a Table, or show text to the screen via renderPrint.

In the case of the movie explorer, the output of the app was a plot.

Building a shiny app

To start building a shiny app, the first thing you need to do is install the Shiny package installed

install.packages("shiny")

##And then enable it by calling library(shiny)

library(shiny)

Now, let’s walk through an example.

  • Create a new shiny app. See the instructor demo.

Let’s explore what we see on the screen.

  • You’ll notice this shiny app has one of the UI elements we discussed, the slider.

  • This allows the user to specify the number bins to show in the histogram.

  • Here the server output is calls the RenderPlot function to show the histogram.

  • In this example, the code is also displayed.

Shiny app template (app.R)

Let’s look a the building blocks of a shiny app.

library (shiny)

#Defines the user interface through nested R functions
ui<-fluidPage()

#Specifies how to build and rebuild #R objects in the ui
server <- function(input,output){}

#Combines ui and server into an app call with runApp()
shinyApp(ui=ui, server=server)

Shiny divides the functions of its app into three distinct sections:

Section 1

  • Section 1 is the ui - nested R functions that assemble an HTML user interface for your app

  • fluidPage() contains the elements in the app.

  • The fluidPage contains both the input and output functions, for example, the titlePanel, sliderInput, and plotOutput.

  • You can change the title, what type of input you want (numericInput( ), selectInput( ), and dateInput( ) are popular as well), and the elements within the input (like the minimum, maximum and preset values).

Section 2

  • The second section is the server - a function with instructions on how to build and rebuild the R objects displayed in the UI
  • The function(input, output) {} actually builds the output.

Section 3

  • The third section shinyApp - combines ui and server into an app call with runApp().

Revist the example

  • Look for the three sections: ui, server, shinyApp.

Next steps..

Now that you are familiar with what a shiny apps and the shiny app components try building your own.

  • To build a Shiny app in R, it’s best to start with a template.
  • In RStudio, go to File -> New File -> Shiny Web App. Choose a single file web application.
  • app.R will be the file you modify. It is saved in a new directory
  • The directory name is the name of your app.
  • Then launch your app with runApp(path you to your directory)

runApp(nyuclasses)

App walkthrough (shiny and advanced data handling)

#EXERCISE 04 - COMPLETE

..

Try your best to achieve the following:

  • Revise app to provide a default view of most recent distributions by most recent assignment due date

  • Revise app to include a selector by one or more students

  • Revise app to include doughnut charts to show completion, late or uncompleted assessments by assignment

  • Revise app to include a student list

##Prototype specification

Review

Homework - Due 10/11 by 11:55pm ET

  • Submit 2 files. Your .Rmd file from today with exercises 1-3 completed and your app.R file (for exercise 4) via our NYU Classes site > Assignments > In Class Worksheet.
  • Rename your .Rmd file lastname_first_inclass.Rmd
  • Keep app.R named the same.

Extra Credit

THANK YOU