5/22/2019

Foundations of Statistics Using R

New to programming?

##          group value
## 1          New    20
## 2       Novice    20
## 3 Intermediate    11
## 4     Advanced     3

Prior knowledge of programming

Learning programming

  • A new way of thinking, not just a new skill

  • A new language for speaking and reading (vectors, data frames, functions, objects, etc.)

  • A new syntax for writing c(), print(), cat(), sort(), require(), subset()

New to R

Prior knowledge of R

Prior knowledge of R

Turn to the person next to you and share the following:

  • Identify two major takeaways from the pre-module.
  • Describe what you do when things go wrong for you in R.

Outcomes

  • Review what you’ve learned in lessons 1 through 6.
  • Introduce you to techniques and approaches to programming in R.
  • Give you an opportunity to practice and solve problems.

PRE MODULE REVIEW

SETUP

Setting up your R world

  • Download rmsba.zip file and unzip
  • Rename the folder rmsba that contains the files in the zip folder. Move it to a location you can find.
  • Create a new project in RStudio named rmsba
  • Set your working directory to the rmsba folder you created above.

PART I: BEGINNING DATA PROJECTS

When working with a new data what initial questions do you have?

  • What does your data represent in the real world?
  • How is this real world phenomena characterized by the data that you have?
  • From what time period is the data?
  • What's the source?

Basic understanding

  • Once you have this basic understanding of your data you can dig deeper.
  • You can use visualization techniques to explore your data and derive some basic understandings of the phenomena you are studying, such as the largest and smallest values for each variable.
  • Calculating summary statistics can translate the data into information by revealing the shape of the data, the mean, median, minimum value, maximum value, and variability.

The process

For any data science project there are few simple steps to follow.

EXAMPLE PROJECT. WORLD INTERNET USAGE

world_internet_usage.csv

1) Set up your workspace (generic steps)

  • Begin by creating a new folder that contains your data.
  • Then create a new project in RStudio.
  • Set your working directory to the folder you created above.
  • Create a new RMarkdown document.
  • Save it in your working directory.

Let's create a new RMarkdown document

  • Go to file > New file > RMarkdown
  • Save the file in you working directory
  • Name it internet.Rmd

RMARKDOWN

R MARKDOWN

R Markdown Example

---
title: "Hello R Markdown"
author: "Awesome Me"
date: "2018-02-14"
output: html_document
---

This is a paragraph in an R Markdown document.
Below is a sample code chunk:

#{r chunkname, echo=TRUE, eval=TRUE}

myformula <- (2+2)

R Markdown Components

  • metadata - YAML header
  • text - Text outside of code chunks
  • code - R code chunks

R Markdown YAML header

YAML Ain't markup language

#title: "Hello R Markdown"
#author: "Awesome Me"
#date: "2019-05-20"
#output: html_document

R Markdown YAML common output options

Option Creates
html_document html
pdf_document pdf (requires Tex)
word_document Microsoft Word (.docx)
github_document Github compatible markdown
ioslides_presentation ioslides HTML slides

R Markdown common text options

  • # Header 1
  • ## Header 2
  • ### Header 3
  • ** Bold **
  • _Italics_

R Markdown rchunks

Option Default
eval TRUE
echo TRUE
warning TRUE
error FALSE

echo=TRUE and echo=FALSE

echo=TRUE

2+2
## [1] 4

echo=FALSE

## [1] 4

fig.height=3, fig.width=4

plot(attitude)

See R Markdown cheatsheet

2) Import your data

We have a couple different options.

  • read_csv() function from the readr library
  • read.csv() function from the utils library.

readr verses Base R

  • Typically faster (~10x)
  • Produce tibbles, they don’t convert character vectors to factors, use row names, or munge the column names.
  • Base R functions inherit some behavior from your OS and environment variables, so import code that works on your computer might not work on someone else’s.
  • Column names that are numbers convert from 2002 to 2002 vs. X2002.
  • Column names that contain spaces (bad practice) convert from Country Name to "Country Name" vs. Country.Name

read.csv()

internet_baser <- read.csv("world_internet_usage.csv")
head(internet_baser, 5)
##     country X2000 X2001 X2002 X2003 X2004 X2005 X2006 X2007 X2008 X2009
## 1     China  1.78  2.64  4.60  6.20  7.30  8.52 10.52 16.00 22.60 28.90
## 2    Mexico  5.08  7.04 11.90 12.90 14.10 17.21 19.52 20.81 21.71 26.34
## 3    Panama  6.55  7.27  8.52  9.99 11.14 11.48 17.35 22.29 33.82 39.08
## 4   Senegal  0.40  0.98  1.01  2.10  4.39  4.79  5.61  7.70 10.60 14.50
## 5 Singapore 36.00 41.67 47.00 53.84 62.00 61.00 59.00 69.90 69.00 69.00
##   X2010 X2011 X2012
## 1 34.30 38.30 42.30
## 2 31.05 34.96 38.42
## 3 40.10 42.70 45.20
## 4 16.00 17.50 19.20
## 5 71.00 71.00 74.18

read_csv()

library(readr)
internet_readr <- read_csv("world_internet_usage.csv")
head(internet_readr, 5)
## # A tibble: 5 x 14
##   country `2000` `2001` `2002` `2003` `2004` `2005` `2006` `2007` `2008`
##   <chr>    <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1 China     1.78   2.64   4.6    6.2    7.3    8.52  10.5    16     22.6
## 2 Mexico    5.08   7.04  11.9   12.9   14.1   17.2   19.5    20.8   21.7
## 3 Panama    6.55   7.27   8.52   9.99  11.1   11.5   17.4    22.3   33.8
## 4 Senegal   0.4    0.98   1.01   2.1    4.39   4.79   5.61    7.7   10.6
## 5 Singap~  36     41.7   47     53.8   62     61     59      69.9   69  
## # ... with 4 more variables: `2009` <dbl>, `2010` <dbl>, `2011` <dbl>,
## #   `2012` <dbl>

use read_csv()

As best practice, we're going to use read_csv() function from the readr library to import the world_internet_usage.csv data as tibble data frame.

class(internet_readr)
## [1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame"

3) Prepare or tidy your data

Take a look at your data.

knitr::kable(head(internet_readr, 10))
country 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012
China 1.78 2.64 4.60 6.20 7.30 8.52 10.52 16.00 22.60 28.90 34.30 38.30 42.30
Mexico 5.08 7.04 11.90 12.90 14.10 17.21 19.52 20.81 21.71 26.34 31.05 34.96 38.42
Panama 6.55 7.27 8.52 9.99 11.14 11.48 17.35 22.29 33.82 39.08 40.10 42.70 45.20
Senegal 0.40 0.98 1.01 2.10 4.39 4.79 5.61 7.70 10.60 14.50 16.00 17.50 19.20
Singapore 36.00 41.67 47.00 53.84 62.00 61.00 59.00 69.90 69.00 69.00 71.00 71.00 74.18
United Arab Emirates 23.63 26.27 28.32 29.48 30.13 40.00 52.00 61.00 63.00 64.00 68.00 78.00 85.00
United States 43.08 49.08 58.79 61.70 64.76 67.97 68.93 75.00 74.00 71.00 74.00 77.86 81.03

How could you visualize this to better understand it?

A histogram?

  • What function can we use?

Building a histogram

hist(internet_readr$`2000`)

Adding aesthetics

hist(internet_readr$`2000`, breaks=8, 
     main="Internet usage for 2000", 
     col="magenta", xlab=" ", labels=TRUE)

A histogram for every year using par(mfrow=c(6,3))

par(mfrow=c(8,2))
hist(internet_readr$`2000`, col= "blue", labels=TRUE, breaks = 4)
hist(internet_readr$`2001`)
hist(internet_readr$`2002`)
hist(internet_readr$`2003`)
hist(internet_readr$`2004`)
hist(internet_readr$`2005`)
hist(internet_readr$`2006`)
hist(internet_readr$`2007`)
hist(internet_readr$`2008`)
hist(internet_readr$`2009`)
hist(internet_readr$`2010`)
hist(internet_readr$`2011`)
hist(internet_readr$`2012`)

Histogram matrix

Exercise 01

#Redo the histogram matrix 
#and add aesthetics

Boxplots

boxplot(internet_readr$`2000`, 
        main="Internet usage for 2000", 
        col="magenta", 
        xlab=median(internet_readr$`2000`),
        frame.plot=FALSE, horizontal=TRUE, 
        border="dark blue")

Multiple boxplots

Exercise 02

#Build box plots for 2000 -2012 
#based on the example above.  
#Include aesthetics.

How else might you want to visualize this data to better understand it?

How would you create a bar or line graph using ggplot?

  • Technically, we would use geom_col(), geom_bar() or geom_line().
  • However, what variable would you map for the x-axis and the y-axis?

For example…

ggplot(internet_readr,aes(XVALUE, YVALUE)) 
  + geom_col()

Let's think about how we would use the ggplot() function.

ggplot(internet_readr,aes(X,Y)) 
  + geom_col()

PART II: Data transformation and advanced visualization

wide to long formats

We need to reshape our data from wide to long.

## [1] "Wide format"
country 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012
China 1.78 2.64 4.6 6.2 7.3 8.52 10.52 16.00 22.60 28.90 34.30 38.30 42.30
Mexico 5.08 7.04 11.9 12.9 14.1 17.21 19.52 20.81 21.71 26.34 31.05 34.96 38.42
## [1] "Long format"
country year usage
China 2000 1.78
Mexico 2000 5.08

Using the gather() function to reshape.

  • The gather() function from the tidyr package to reshape a tibble from wide to long form.
  • Note the use of the pipe %>% that passes the left hand side of the operator to the first argument of the right hand side of the operator.
tidy_internet_readr <- 
internet_readr %>%
gather(`2000`:`2012`, key="year", 
       value="usage")

View the data

country year usage
China 2000 1.78
Mexico 2000 5.08
Panama 2000 6.55
Senegal 2000 0.40
Singapore 2000 36.00

4) Understand - Visualize

  • Let's create a time series line graph.
  • Use the ggplot2() package

ggplot(data, aes(x,y,color, group)) + geom_line() + ... + ...

Build the chart

library(ggplot2)
#assignment ggplot call to a variable
line01 <- ggplot(tidy_internet_readr,
        aes(x=year,y=usage,color=country,
            group=country)) + geom_line()

View the chart

line01

Adding asethetics: Labels

  • Labels: +labs(title="", subtitle="",x="", y="", caption="")

Refine your line chart

library(ggthemes)
library(ggplot2)
line02<-ggplot(tidy_internet_readr,
               aes(x=year,y=usage,color=country,
                   group=country)) + geom_line() + 
  labs(title = "Internet Usage per 100 people", 
       subtitle = "Since 2011, 
       the UAE has surpassed Singapore and the US in internet users", 
       caption = "Source: World Bank (2013)",
       x = " ",y ="Usage")

Refined line chart

Let's create a bar chart with the same data

..

library(ggplot2)
bar01 <- ggplot(tidy_internet_readr,
        aes(tidy_internet_readr$year, tidy_internet_readr$usage))

bar01 <- bar01 + geom_col() + theme_few() +
  labs(title = "Internet Usage per 100 people", 
       x = "Year",y ="Usage", 
       caption="World Bank (2013)")

Let's see it

bar01

Refining asethetics: Specifying parameters

Adding a fill / Color

  • fill="#4cbea3", color="#4cbea3"

Add a fill

Changing the font

*+ theme(text=element_text(family="Avenir"))

Changing the font

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x,
## x$y, : font family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

Removing chart junk

theme(panel.border = element_blank(), panel.grid.major = element_blank(),panel.grid.minor = element_blank(), axis.line = element_line(color = "gray"), axis.ticks.x=element_blank(), axis.ticks.y=element_blank())

Removing chart junk

bar01c <- bar01 + geom_col(fill="#4cbea3", color="#4cbea3") + theme_few() +
  labs(title = "Internet Usage per 100 people", 
       x = " ",y ="Usage", 
       caption="World Bank (2013)") +
  theme(text=element_text(family="Avenir"), 
        panel.border = element_blank(), panel.grid.major =
          element_blank(),panel.grid.minor =
          element_blank(), 
        axis.line = element_line(color= "gray"),
        axis.ticks.x=element_blank(),
        axis.ticks.y=element_blank())

Removing chart junk

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x,
## x$y, : font family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

Adding asethetics: Themes

  • ggthemes() package with additional themes
  • theme_classic(). White background no grid lines

Types of themes

t1 <- bar01 + theme_classic()
t2 <- bar01 + theme_bw()
t3 <- bar01 + theme_minimal()
t4 <- bar01 + theme_economist()
t5 <- bar01 + theme_fivethirtyeight()
t6 <- bar01 + theme_hc()

Themes: theme_classic()

Themes: theme_bw()

Themes: theme_minimal()

Themes: theme_economist()

Themes: theme_fivethirtyeight()

Themes: theme_theme_hc()

Saving your charts

  • ggsave("plot.png",width=5, height=5, units="in"")
  • Saves in your working directory

PROJECT. Capital Bikeshare

bikesharedailydata.csv

1. Setup your workspace

2. Import the data

  • This data spans the District of Columbia, Arlington County, Alexandria, Montgomery County and Fairfax County.
  • The Capital Bikeshare system is owned by the participating jurisdictions and is operated by Motivate, a Brooklyn, NY-based company that operates several other bikesharing systems including Citibike in New York City, Hubway in Boston and Divvy Bikes in Chicago.

library(readr)
bikeshare <- read_csv("bikesharedailydata.csv")

View the data.

## # A tibble: 6 x 16
##   instant dteday season    yr  mnth holiday weekday workingday weathersit
##     <dbl> <chr>   <dbl> <dbl> <dbl>   <dbl>   <dbl>      <dbl>      <dbl>
## 1       1 1/1/11      1     0     1       0       6          0          2
## 2       2 1/2/11      1     0     1       0       0          0          2
## 3       3 1/3/11      1     0     1       0       1          1          1
## 4       4 1/4/11      1     0     1       0       2          1          1
## 5       5 1/5/11      1     0     1       0       3          1          1
## 6       6 1/6/11      1     0     1       0       4          1          1
## # ... with 7 more variables: temp <dbl>, atemp <dbl>, hum <dbl>,
## #   windspeed <dbl>, casual <dbl>, registered <dbl>, cnt <dbl>

Understand the type of data you are working with.

Observations

  • One of the first things you may notice is the data dimensions, the number of rows and columns. Specifically there are 731 rows (observations) and 16 columns (variables or attributes).
  • However, the variable names listed at the first row of every column are not very descriptive.

Determine what the variables mean in the real world.

  • Take a look column named season. What is the meaning of season?
  • What are the possible values for this variable?

bikeshare$season

bikeshare$season
##   [1]  1  1  1  1  1  1 NA  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##  [24]  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##  [47]  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##  [70]  1  1  1  1  1  1  1  1  1  1  2  2  2  2  2  2  2  2  2  2  2  2  2
##  [93]  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2
## [116]  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2
## [139]  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2
## [162]  2  2  2  2  2  2  2  2  2  2  3  3  3  3  3  3  3  3  3  3  3  3  3
## [185]  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3
## [208]  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3
## [231]  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3
## [254]  3  3  3  3  3  3  3  3  3  3  3  3  4  4  4  4  4  4  4  4  4  4  4
## [277]  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4
## [300]  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4
## [323]  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4
## [346]  4  4  4  4  4  4  4  4  4  1  1  1  1  1  1  1  1  1  1  1  1  1  1
## [369]  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
## [392]  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
## [415]  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
## [438]  1  1  1  1  1  1  1  1  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2
## [461]  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2
## [484]  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2
## [507]  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2
## [530]  2  2  2  2  2  2  2  2  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3
## [553]  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3
## [576]  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3
## [599]  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3
## [622]  3  3  3  3  3  3  3  3  3  3  4  4  4  4  4  4  4  4  4  4  4  4  4
## [645]  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4
## [668]  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4
## [691]  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4
## [714]  4  4  4  4  4  4  4  1  1  1  1  1  1  1  1  1  1  1

What type of variable is it?

  • It is an integer or data of type numeric.
  • You’ll notice that in the column seasons the values are integers that range between 1 and 4.

What do the numbers represent?

  • If we really think about it’s unlikely that the numbers represent quantities.
  • Instead, they probably represent the seasons of the year because we know there are four seasons.

Understanding the four seasons

  • The numbers (1 through 4) are probably a code for the each of the four seasons of the year.
  • Without additional information, such as a data dictionary or readme file, it would be impossible for the user of the data to know what the possible values of 1 through 4 correspond to in the categorical variable named season.

Review the data dictionary

This leads us to the next step, reviewing the data dictionary along with the data set to better understand the meaning behind the values.

The data dictionary

  • A data dictionary defines the characteristics of each of the data attributes.
  • If your data comes from a reputable source, odds are that it is accompanied with a data dictionary or metadata.
  • To know which season is represented by each number in the variable season we can review the data dictionary.

Reviewing the data dictionary

Field Definition
instant record index
dteday date
season season (1:winter, 2:spring, 3:summer, 4:fall)
yr year (0: 2011, 1:2012)
mnth month ( 1 to 12)
hr hour (0 to 23)
holiday weather day is holiday or not
weekday day of the week
workingday if day is neither weekend nor holiday is 1, otherwise is 0.
weathersit 1, 2, 3, 4
– 1 Clear, Few clouds, Partly cloudy, Partly cloudy
– 2 Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
– 3 Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
– 4 Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
temp Normalized temperature in degrees F
atemp Normalized feeling temperature in degrees F
hum Normalized humidity.
windspeed Normalized wind speed
casual count of casual users
registered count of registered users
cnt count of total rental bikes including both casual and registered

What did we learn?

  • season is a categorical variable defined by one of four values, each representing a season (1: winter, 2: spring, 3: summer, 4: fall).
  • year is coded with the value of 0 for 2011 and 1 for 2012, rather than actual year value of 2011 or 2012.

3. Prepare or tidy your data

  • At this point, you may want to rename the columns in your data set to make the data more usable when you begin the analysis.
  • Renaming columns is a manual process that literally involves change the each column name.
  • It is best practice to use lower case lettering and avoid spaces and hyphenation.

Renaming columns

There two key ways to rename columns.

  • with rename() from dplyr
  • with names() from base

Way 1 - Renaming columns withrename() from the dplyr library.

library(dplyr)
bikeshare <- rename(bikeshare, humidity = hum, month=mnth)
names(bikeshare)
##  [1] "instant"    "dteday"     "season"     "yr"         "month"     
##  [6] "holiday"    "weekday"    "workingday" "weathersit" "temp"      
## [11] "atemp"      "humidity"   "windspeed"  "casual"     "registered"
## [16] "cnt"

Way 2 -Renaming columns with R base functions.

# Rename column where names is equal to "yr"
names(bikeshare)[names(bikeshare) == "yr"] <- "year"
names(bikeshare)
##  [1] "instant"    "dteday"     "season"     "year"       "month"     
##  [6] "holiday"    "weekday"    "workingday" "weathersit" "temp"      
## [11] "atemp"      "humidity"   "windspeed"  "casual"     "registered"
## [16] "cnt"

Identify missing values

  • We can use a function called is.na()counting the number NA values.
  • Let's look for missing values in the seasons column.

sum(is.na(bikeshare$season)==TRUE)
## [1] 1

Using iteration to identify missing values

  • In this case it seems necessary to "loop" through the entire data set and to identify which fields need closer inspection.
  • We can build a simple for loop to do this.

The for loop structure

Let's iterate through a vector of numbers using a for loop

myvector <- c(1,2,3,4)
for (i in myvector){
    print(paste ("Loop", i))
}
## [1] "Loop 1"
## [1] "Loop 2"
## [1] "Loop 3"
## [1] "Loop 4"

Let's refine the loop to include a counter and change the data a little.

..

myvector <- c(1,2,3,4,9,11)
mycounter <- 1
for (i in myvector){
    print(paste ("The value for loop number", mycounter,"is:", i))
    mycounter <- mycounter +1 # update 
}
## [1] "The value for loop number 1 is: 1"
## [1] "The value for loop number 2 is: 2"
## [1] "The value for loop number 3 is: 3"
## [1] "The value for loop number 4 is: 4"
## [1] "The value for loop number 5 is: 9"
## [1] "The value for loop number 6 is: 11"

Let use the concept of iteration to traverse through the data set and print out those NA values.

counter <-0
for (i in bikeshare$season){
      counter <- counter +1
       if(is.na(i)==TRUE){
         print(paste("It's true. 
                     There's an NA value 
                     on row",counter))
         print(bikeshare[counter,])
    }
}
## [1] "It's true. \n                     There's an NA value \n                     on row 7"
## # A tibble: 1 x 16
##   instant dteday season  year month holiday weekday workingday weathersit
##     <dbl> <chr>   <dbl> <dbl> <dbl>   <dbl>   <dbl>      <dbl>      <dbl>
## 1       7 1/7/11     NA     0     1       0       5          1          2
## # ... with 7 more variables: temp <dbl>, atemp <dbl>, humidity <dbl>,
## #   windspeed <dbl>, casual <dbl>, registered <dbl>, cnt <dbl>

Dealing with missing values

There are several ways you tackle working with data that are incomplete. Each has its pros and cons.

  1. Ignore any record with missing values
  2. Replace empty fields with a pre-defined value
  3. Replace empty fields with the most frequently appeared value
  4. Use the mean value
  5. Manual approach

Solution

  • In this case it's easy to replace the value with a pre-defined value.

  • We wouldn't want to ignore the record because the values can be easily determined.

Update the values

bikeshare$season[7]
## [1] NA
1->bikeshare$season[7]
bikeshare$season[7]
## [1] 1

4. Understand and visualize

bikesall <- ggplot(bikeshare, aes(atemp,cnt)) + geom_point(color="#4cbea3")

Let's see it!

bikesall

Refine it

bikes2011 <- ggplot(bikeshare[bikeshare$year < 1,],
                    aes(atemp,cnt)) +
  geom_point(color="#4cbea3") + 
  theme_few() + labs(title = "Rentals in 2011", 
                     x = "Average temp",y=" ")

2011

bikes2011

Plot 2012

bikes2012 <- ggplot(bikeshare[bikeshare$year > 0,], aes(atemp,cnt)) + geom_point(color="#4cbea3") + theme_few() + 
  labs(title = "Rentals in 2012", x = "Average temp",y=" ")

2012

bikes2012

Now let’s arrange the charts side by side

We can do this by using the plot_grid function from the cowplot package

cowplot::plot_grid

-Pass in the two variables hist_age and hist_salary into the plot_grid function to see the graphs plotted side by side.

plot_grid

cowplot::plot_grid(bikes2011, bikes2012, labels =c(" ", " "))

ITERATION

Loops

This section will introduce control structures for iteration know as loops. We will cover two types of loops:

  • while loop
  • for loop

Basic structure for a while loop.

#pseudo code

while (TRUE)
{
  ##do something...
  
} #exit the loop

Create a while loop

x <- 10
while (x > 0) {
 print(x)
 x <- x - 1 
} 
## [1] 10
## [1] 9
## [1] 8
## [1] 7
## [1] 6
## [1] 5
## [1] 4
## [1] 3
## [1] 2
## [1] 1

Create a while loop with a counter

counter <- 0
while (counter < 9) {
  print(counter)
  counter <- counter + 1 }
## [1] 0
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8

Exercise 03 - Use the attitude data set

#pseudo code

column <- ncol(attitude)

while (column > 0)
{
 print(paste(names(attitude[column]), ":", mean(attitude[,column]))) #take the mean
  column <- column -1
  #move to the next ncol
} 


mean <- NULL

for (i in names(attitude)){
  mean[i] <- mean(attitude[,i])
}

mean

Solution

#

Basic structure for a for loop.

#pseudo code

Iteration using a for loop

Iterate through an array of numbers using a for loop

for (i in c(1,2,3,4)){
    print(i)
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4

Exercise 04: Iterate through a column in the bikeshare data

#pseudo code

for (a in bikeshare$dteday){
  print (a)
}

Solution: Iterate through a column in the bikeshare data

#
for (i in names(bikeshare)){ 
  print }

CONDITIONALS

Let's review of Boolean variables and logical operators

3 > 4
## [1] FALSE
c(1, 2, 3, 4, 5) > 4
## [1] FALSE FALSE FALSE FALSE  TRUE
c(1, 2, 3, 4, 6) == 3
## [1] FALSE FALSE  TRUE FALSE FALSE

Loops and conditional statements using if/else logic

Build a program that checks to see which prices are considered "cheap".

prices <- c(12.43, 9.99, 18.22, 7.25, 0.50)
 
v <- c()
for (p in prices){
  v <- prices[p<10]
  }
    
v
## [1] 12.43  9.99 18.22  7.25  0.50
##CODE HERE

Solution

#

Alternative approach

#

INFIX OPERATORS

INFIX OPERATORS

  • Infix operators are very similar to functions.
  • You've been introduced to one infix operator, the %% remainder operator.
print(paste("The remainder from 5 / 3 is", 5%%3))
## [1] "The remainder from 5 / 3 is 2"

Table of infix operators

Operation Operator Example Input Example Output
Remainder operator %% 7%%2 1
Integer division %/% 10%/%3 3
Matrix multiplication %*% 3%*%6 [,1] [1,]18
Outer product %o% c(1:3)%o%c(0,1,2) 0,0,0,1,2,3,2,4,6
Matching operator %in% 1%in%1 TRUE

Integer division

We use the `%/% for integer division. Helpful for situtations where non-integer values are not possible. Examples include: products such as tickets, furniture, cars, etc.

print(paste("Using integer division 5 / 3 is", 5%/%3))
## [1] "Using integer division 5 / 3 is 1"

Matrix multiplication

Unlike the * operator, we use the %*% to multiply matrices. Using the operator, results in a matrix data structure.

mmatrix <- 2%*%5
mmatrix
##      [,1]
## [1,]   10
class(mmatrix)
## [1] "matrix"

Matrix Example

Matrix Example with code

##      [,1] [,2]
## [1,]    1    2
## [2,]    2    4
##      [,1] [,2]
## [1,]    0    0
## [2,]    2    3
##      [,1] [,2]
## [1,]    4    6
## [2,]    8   12

Outer product

The outer product %o% is a handy function for linear algebra and linear programming.

c(1:3)%o%c(0,1,2)
##      [,1] [,2] [,3]
## [1,]    0    1    2
## [2,]    0    2    4
## [3,]    0    3    6

Outer product example

Outer product - example continued.

vector01 <-c(1:3)
vector02 <-c(0,1,2)
opvector <- vector01%o%vector02
opvector
##      [,1] [,2] [,3]
## [1,]    0    1    2
## [2,]    0    2    4
## [3,]    0    3    6

Matching %in% operator

  • Useful for identifying if some set of values are identical.
  • %in% tells you which items from the left hand side are also in the right hand side.

Using the %in operator

Let's look at an example.

Example

v1 <- 3
v2 <- 101
t <- c(1,2,3,4,5,6,7,8)

Next, we use the %in% operator

v1 %in% t
## [1] TRUE

You try it with v2

Is is the value of v2 present in the vector t

Solution

v2 %in% t
## [1] FALSE

Example

 c(1:5) %in% c(3:8)
## [1] FALSE FALSE  TRUE  TRUE  TRUE

Example

We can use the %in% to determine which values in one vector are identical to those in another vector.

Let's return to the example, the Country Name variable in the gdp data. ##Read in the data

library(readr)
gdp <- read_csv("gdp.csv")

View the data

Country Name Country Code 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017
Aruba ABW NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 1330167598 1320670391 1379888268 1531843575 1665363128 1722798883 1873452514 1920262570 1941094972 2021301676 2228279330 2.331006e+09 2.421475e+09 2.623726e+09 2.791961e+09 2.498933e+09 2.467704e+09 2.584464e+09 NA NA NA NA NA NA
Afghanistan AFG 537777811 548888896 546666678 751111191 800000044 1006666638 1399999967 1673333418 1373333367 1408888922 1748886596 1831108971 1595555476 1733333264 2155555498 2366666616 2555555567 2953333418 3300000109 3697940410 3641723322 3478787909 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 2461665938 4128820723 4583644246 5285465686 6.275074e+09 7.057598e+09 9.843842e+09 1.019053e+10 1.248694e+10 1.593680e+10 1.793024e+10 2.053654e+10 2.026425e+10 2.061610e+10 1.921556e+10 1.946902e+10 2.081530e+10
Angola AGO NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 5930503401 5550483036 5550483036 5784341596 6131475065 7553560459 7072063345 8083872012 8769250550 10201099040 11228764963 10603784541 8307810974 5768720422 4438321017 5538749260 7526446606 7648377413 6506229607 6152922943 9129594819 8936063723 12497347956 14188949398 19640853734 2.823371e+10 4.178948e+10 6.044892e+10 8.417804e+10 7.549239e+10 8.252614e+10 1.041158e+11 1.139232e+11 1.249125e+11 1.267302e+11 1.026212e+11 9.533720e+10 1.242094e+11
Albania ALB NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 1924242453 1965384586 2173750013 2156624900 2126000000 2335124988 2101624963 1139166646 709452584 1228071038 1985673798 2424499009 3314898292 2359903108 2707123772 3414760915 3632043908 4060758804 4435078648 5746945913 7314865176 8.158549e+09 8.992642e+09 1.070101e+10 1.288135e+10 1.204421e+10 1.192695e+10 1.289087e+10 1.231978e+10 1.277628e+10 1.322824e+10 1.138693e+10 1.188368e+10 1.303935e+10
Andorra AND NA NA NA NA NA NA NA NA NA NA 78619206 89409820 113408232 150820103 186558696 220127246 227281025 254020153 308008898 411578334 446416106 388958731 375895956 327861833 330070689 346737965 482000594 611316399 721425939 795449332 1029048482 1106928583 1210013652 1007025755 1017549124 1178738991 1223945357 1180597273 1211932398 1239876305 1434429703 1496912752 1733116883 2398645598 2935659300 3.255789e+09 3.543257e+09 4.016972e+09 4.007353e+09 3.660531e+09 3.355695e+09 3.442063e+09 3.164615e+09 3.281585e+09 3.350736e+09 2.811489e+09 2.877312e+09 3.012914e+09
Arab World ARB NA NA NA NA NA NA NA NA 25752663334 28425351599 31375728863 36415569616 43302571643 55001266851 105113069536 116300804387 144801082500 167256241961 183498400604 248568798879 338072174739 348484272974 324227785106 303867911387 307844905036 303799011536 288939171303 312584335589 307407305095 322224795592 446738041822 439642267667 471016834842 476365284407 487375131447 523357362511 578012134733 612896365522 590692883162 643147698720 734768117003 723282816386 729051715399 823110541435 963862340514 1.184662e+12 1.404114e+12 1.637573e+12 2.078116e+12 1.795820e+12 2.109551e+12 2.501305e+12 2.786139e+12 2.866038e+12 2.906918e+12 2.554480e+12 2.500164e+12 2.591047e+12
United Arab Emirates ARE NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 14720672507 19213022691 24871775165 23775831783 31225463218 43598748449 49333424135 46622718605 42803323345 41807954236 40603650232 33943612095 36384908744 36275674203 41464995914 50701443748 51552165622 54239171888 55625170253 59305093980 65743666576 73571233996 78839008445 75674336283 84445473111 104337372362 103311640572 109816201498 124346358067 147824370320 1.806170e+11 2.221165e+11 2.579161e+11 3.154746e+11 2.535474e+11 2.897873e+11 3.506660e+11 3.745906e+11 3.901076e+11 4.031371e+11 3.581351e+11 3.570451e+11 3.825751e+11
Argentina ARG NA NA 24450604878 18272123664 25605249382 28344705967 28630474728 24256667553 26436857248 31256284544 31584210366 33293199095 34733000536 52544000117 72436777342 52438647922 51169499891 56781000101 58082870156 69252328953 76961923742 78676842366 84307486837 103979106778 79092001998 88416668900 110934442763 111106191358 126206817196 76636898036 141352368715 189719984268 228788617202 236741715015 257440000000 258031750000 272149750000 292859000000 298948250000 283523000000 284203750000 268696750000 97724004252 127586973492 164657930453 1.987371e+11 2.325573e+11 2.875305e+11 3.615580e+11 3.329765e+11 4.236274e+11 5.301633e+11 5.459824e+11 5.520251e+11 5.263197e+11 5.947493e+11 5.548609e+11 6.375904e+11
Armenia ARM NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 2256838858 2068526522 1272577456 1201313201 1315158670 1468317350 1596968913 1639492424 1893726437 1845482181 1911563665 2118467913 2376335048 2807061009 3576615240 4.900470e+09 6.384452e+09 9.206302e+09 1.166204e+10 8.647937e+09 9.260285e+09 1.014211e+10 1.061932e+10 1.112147e+10 1.160951e+10 1.055334e+10 1.054614e+10 1.153659e+10
American Samoa ASM NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 514000000 527000000 512000000 5.030000e+08 4.960000e+08 5.200000e+08 5.630000e+08 6.780000e+08 5.760000e+08 5.740000e+08 6.440000e+08 6.410000e+08 6.430000e+08 6.590000e+08 6.580000e+08 NA
##  [1] "Country Name" "Country Code" "1960"         "1961"        
##  [5] "1962"         "1963"         "1964"         "1965"        
##  [9] "1966"         "1967"         "1968"         "1969"        
## [13] "1970"         "1971"         "1972"         "1973"        
## [17] "1974"         "1975"         "1976"         "1977"        
## [21] "1978"         "1979"         "1980"         "1981"        
## [25] "1982"         "1983"         "1984"         "1985"        
## [29] "1986"         "1987"         "1988"         "1989"        
## [33] "1990"         "1991"         "1992"         "1993"        
## [37] "1994"         "1995"         "1996"         "1997"        
## [41] "1998"         "1999"         "2000"         "2001"        
## [45] "2002"         "2003"         "2004"         "2005"        
## [49] "2006"         "2007"         "2008"         "2009"        
## [53] "2010"         "2011"         "2012"         "2013"        
## [57] "2014"         "2015"         "2016"         "2017"

Example - continued 1 of 2

Luckily, your professor has done the work of identifying which of these country codes represent aggregate values.

You've been provided with a aggregateclcodes.csv

agg <- read_csv("aggregatecodes.csv")
summary(agg)
##  aggregate_code    
##  Length:46         
##  Class :character  
##  Mode  :character

View the data

aggregate_code
ARB
CEB
CSS
EAP
EAR
EAS
ECA
ECS
EMU
EUU
FCS
HIC
HPC
IBD

Example - Continued 2 of 2

Let's use the %in% infix function.

gdpagg<-gdp[!gdp$`Country Code` %in% 
              agg$aggregate_code, ,drop= FALSE]
knitr::kable(head(gdpagg,10))
Country Name Country Code 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017
Aruba ABW NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 1330167598 1320670391 1379888268 1531843575 1665363128 1722798883 1873452514 1920262570 1941094972 2021301676 2228279330 2331005587 2421474860 2623726257 2791960894 2498932961 2467703911 2584463687 NA NA NA NA NA NA
Afghanistan AFG 537777811 548888896 546666678 751111191 800000044 1006666638 1399999967 1673333418 1373333367 1408888922 1748886596 1831108971 1595555476 1733333264 2155555498 2366666616 2555555567 2953333418 3300000109 3697940410 3641723322 3478787909 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 2461665938 4128820723 4583644246 5285465686 6275073572 7057598407 9843842455 10190529882 12486943506 15936800636 17930239400 20536542737 20264253974 20616104299 19215562179 19469022208 20815300220
Angola AGO NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 5930503401 5550483036 5550483036 5784341596 6131475065 7553560459 7072063345 8083872012 8769250550 10201099040 11228764963 10603784541 8307810974 5768720422 4438321017 5538749260 7526446606 7648377413 6506229607 6152922943 9129594819 8936063723 12497347956 14188949398 19640853734 28233712738 41789479932 60448924662 84178035579 75492385928 82526143645 104115807986 113923162050 124912503781 126730196125 102621215573 95337203468 124209385825
Albania ALB NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 1924242453 1965384586 2173750013 2156624900 2126000000 2335124988 2101624963 1139166646 709452584 1228071038 1985673798 2424499009 3314898292 2359903108 2707123772 3414760915 3632043908 4060758804 4435078648 5746945913 7314865176 8158548717 8992642349 10701011897 12881352688 12044212904 11926953259 12890867539 12319784787 12776277515 13228244357 11386931490 11883682171 13039352744
Andorra AND NA NA NA NA NA NA NA NA NA NA 78619206 89409820 113408232 150820103 186558696 220127246 227281025 254020153 308008898 411578334 446416106 388958731 375895956 327861833 330070689 346737965 482000594 611316399 721425939 795449332 1029048482 1106928583 1210013652 1007025755 1017549124 1178738991 1223945357 1180597273 1211932398 1239876305 1434429703 1496912752 1733116883 2398645598 2935659300 3255789081 3543256806 4016972351 4007353157 3660530703 3355695364 3442062830 3164615187 3281585236 3350736367 2811489409 2877311947 3012914131
United Arab Emirates ARE NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 14720672507 19213022691 24871775165 23775831783 31225463218 43598748449 49333424135 46622718605 42803323345 41807954236 40603650232 33943612095 36384908744 36275674203 41464995914 50701443748 51552165622 54239171888 55625170253 59305093980 65743666576 73571233996 78839008445 75674336283 84445473111 104337372362 103311640572 109816201498 124346358067 147824370320 180617018380 222116541865 257916133424 315474615739 253547358747 289787338325 350666031314 374590605854 390107556161 403137100068 358135057862 357045064670 382575085092
Argentina ARG NA NA 24450604878 18272123664 25605249382 28344705967 28630474728 24256667553 26436857248 31256284544 31584210366 33293199095 34733000536 52544000117 72436777342 52438647922 51169499891 56781000101 58082870156 69252328953 76961923742 78676842366 84307486837 103979106778 79092001998 88416668900 110934442763 111106191358 126206817196 76636898036 141352368715 189719984268 228788617202 236741715015 257440000000 258031750000 272149750000 292859000000 298948250000 283523000000 284203750000 268696750000 97724004252 127586973492 164657930453 198737095012 232557260817 287530508431 361558037110 332976484578 423627422092 530163281575 545982375701 552025140252 526319673732 594749285413 554860945014 637590419269
Armenia ARM NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 2256838858 2068526522 1272577456 1201313201 1315158670 1468317350 1596968913 1639492424 1893726437 1845482181 1911563665 2118467913 2376335048 2807061009 3576615240 4900469950 6384451606 9206301700 11662040714 8647936748 9260284938 10142111335 10619320049 11121465767 11609512940 10553337673 10546135160 11536590636
American Samoa ASM NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 514000000 527000000 512000000 503000000 496000000 520000000 563000000 678000000 576000000 574000000 644000000 641000000 643000000 659000000 658000000 NA
Antigua and Barbuda ATG NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 77496741 87879333 109079963 131431037 147841741 164369296 182144111 208372852 240923926 290440148 337174852 398637741 438794778 459469074 481706333 499281148 535172778 589429593 577280741 633730630 680617111 727860593 766198926 830158778 800740259 814615333 855643111 919577148 1022191296 1157005444 1311401333 1368431037 1224253000 1152469074 1142042926 1211411704 1192925407 1280133333 1364863037 1460144704 1532397556

FUNCTIONS

FUNCTIONS

Some functions are built in such as:

toupper("hello world")
## [1] "HELLO WORLD"
mean(c(1,2,3,4,5))
## [1] 3
is.numeric(4)
## [1] TRUE
is.na(NA)
## [1] TRUE
sqrt(25)
## [1] 5

User Defined Functions

  • Functions are useful for executing repetitive commands.
  • Planning is key to writing effective functions.

Example

functionname <- function(x){
  return(print(paste("The value", x,   "is returned")))
}
functionname(34)
## [1] "The value 34 is returned"

Writing our own functions

  • We write our own functions for repetitive tasks or for algorithms.
  • For example, suppose we want to have a function to act on our data that adds 2 to every value.

Function Pseudocode

myfunction <- function(x){ 
  x = x + 2
   print(x)
  #add two to x
}

#apply the function

myfunction(5)

#x+2 is returned from the function

Exercise 06 - Write a function to act on our data that adds 2 to every value.

##CODE HERE

How can we plan for user error?

f <- function(x) {
  if (is.numeric(x)==FALSE)
  {
    print("Sorry. This function needs requires a value of type numeric.")
  }    else{
        x + 2
    }
}
f(37)
## [1] 39
f("Hi class of 2020!")
## [1] "Sorry. This function needs requires a value of type numeric."

Pass in multiple arguments

addTogether <- function(x, y) {
  if (is.numeric(x) & is.numeric(y)==TRUE)
  {
  x + y
  } else {
    print("Sorry, please enter two numbers")
  }
}

#Call the function
addTogether(5, 10)
## [1] 15

Alternative function call, with literal specificiation

addTogether(x = 5, y = 10) 
## [1] 15
#Passing in non-numeric data
addTogether(x=4, y="Hey what's up?")
## [1] "Sorry, please enter two numbers"

Exercise 07: Write a function that averages two numbers

##CODE HERE
f <- function(x, y) {
  if (is.numeric(x) & is.numeric(y)==TRUE)
  {
  mean(c(x,y))
  } else {
    print("Sorry, please enter two numbers")
  }
}

f(2,3)
## [1] 2.5

Solution

#
#
#
#

Apply a function to all elements of input

What if we could apply a function to all the elements of the input?

  • Input is the List, vector or data frame
  • Output is a vector (or matrix)

sapply() from the apply family of functions

sapply(X, FUN)

Arguments:

  • X: A vector or an object
  • FUN: Function applied to each element of x

Example - sapply()

df1 <-as.data.frame(c(1,2,3,4,5,6,7))

sapply(df1, max)
## c(1, 2, 3, 4, 5, 6, 7) 
##                      7

Alternative to using as.data.frame

library(tibble)
df1 <-as_tibble(c(1,2,3,4,5,6,7))
## Warning: Calling `as_tibble()` on a vector is discouraged, because the behavior is likely to change in the future. Use `tibble::enframe(name = NULL)` instead.
## This warning is displayed once per session.
sapply(df1, max)
## value 
##     7

What if you didn't convert to a data frame or tibble?

df3 <-c(1,2,3,4,5,6,7)
sapply(df3, max)
## [1] 1 2 3 4 5 6 7

A function that takes the mean

avg <-function(x){
  mean(x, na.rm=TRUE)
}

Apply this a function over a vector using sapply()

f <- function(x) x^2
sapply(c(1,2,3,4,5),f)
## [1]  1  4  9 16 25

sapply(attitude,f)
##       rating complaints privileges learning raises critical advance
##  [1,]   1849       2601        900     1521   3721     8464    2025
##  [2,]   3969       4096       2601     2916   3969     5329    2209
##  [3,]   5041       4900       4624     4761   5776     7396    2304
##  [4,]   3721       3969       2025     2209   2916     7056    1225
##  [5,]   6561       6084       3136     4356   5041     6889    2209
##  [6,]   1849       3025       2401     1936   2916     2401    1156
##  [7,]   3364       4489       1764     3136   4356     4624    1225
##  [8,]   5041       5625       2500     3025   4900     4356    1681
##  [9,]   5184       6724       5184     4489   5041     6889     961
## [10,]   4489       3721       2025     2209   3844     6400    1681
## [11,]   4096       2809       2809     3364   3364     4489    1156
## [12,]   4489       3600       2209     1521   3481     5476    1681
## [13,]   4761       3844       3249     1764   3025     3969     625
## [14,]   4624       6889       6889     2025   3481     5929    1225
## [15,]   5929       5929       2916     5184   6241     5929    2116
## [16,]   6561       8100       2500     5184   3600     2916    1296
## [17,]   5476       7225       4096     4761   6241     6241    3969
## [18,]   4225       3600       4225     5625   3025     6400    3600
## [19,]   4225       4900       2116     3249   5625     7225    2116
## [20,]   2500       3364       4624     2916   4096     6084    2704
## [21,]   2500       1600       1089     1156   1849     4096    1089
## [22,]   4096       3721       2704     3844   4356     6400    1681
## [23,]   2809       4356       2704     2500   3969     6400    1369
## [24,]   1600       1369       1764     3364   2500     3249    2401
## [25,]   3969       2916       1764     2304   4356     5625    1089
## [26,]   4356       5929       4356     3969   7744     5776    5184
## [27,]   6084       5625       3364     5476   6400     6084    2401
## [28,]   2304       3249       1936     2025   2601     6889    1444
## [29,]   7225       7225       5041     5041   5929     5476    3025
## [30,]   6724       6724       1521     3481   4096     6084    1521

Exercise 08:Try it using function on a column in the bikeshare data.

##CODE HERE

Solution

#
#

PROJECT: GDP Analysis

Steps

  • Import data set
  • Prepare - remove values that are not countries.
  • Understand - summary stats & visualize
  • Communciate

Understand

Use our avg() function

Let's take the average of select column.

gdpagg_avg<-sapply(gdpagg[3:50],avg)

sapply()

gdpagg_avg <-as.data.frame(gdpagg_avg)
kable(head(gdpagg_avg, 10))
gdpagg_avg
1960 11472836298
1961 11934440519
1962 12805007499
1963 13783853462
1964 15099033075
1965 15198438002
1966 16343121604
1967 16987893024
1968 18073927393
1969 19902605245

Calculate summary statistics

summaryfun <- function(x){
    xmean <- mean(x, na.rm=TRUE)
    xmedian <- median(x, na.rm=TRUE)
    print(paste("The mean is", prettyNum(xmean, big.mark=",", scientific=FALSE)))
   print(paste("The median is",prettyNum(xmedian, big.mark=",", scientific=FALSE)))
  }

summaryfun()

  • After we write the function, we can see that it is available for us to use at any point (just like a variable) in our work space).
  • X is the data and FUN is the function. In our case, the data is gdpagg$2017 and the function is `summaryfun

..

summaryfun(gdpagg$`2017`)
## [1] "The mean is 421,352,224,909"
## [1] "The median is 35,052,862,071"

RSHINY

install.packages("shiny")

What is shiny?

Video - Movie Explorer

Video - Marathon Training

Video - Intelligencia

Building a shiny app

  • The components of a shiny app To build a Shiny app in R, start with a template.
  • In RStudio, go to File -> New File -> Shiny Web App. Choose a single file web application.
  • app.R will be the file you modify. It is saved in a new directory
  • Directory name is your app name

Shiny app template (app.R)

library (shiny)

#Defines the user interface through nested R functions
ui<-fluidPage()

#Specifies how to build and rebuild #R objects in the ui
server <- function(input,output){}

#Combines ui and server into an app call with runApp()
shinyApp(ui=ui, server=server)

ui Inputs

ui Inputs

ui Inputs

ui Inputs

server Outputs

Example prototypes - shiny apps

..

runExample("02_text")
runExample("03_reactivity")
runExample("04_mpg")
runExample("05_sliders")
runExample("06_tabsets")
runExample("07_widgets")
runExample("08_html")
runExample("09_upload")
runExample("11_timer")

runApp(nyuclasses)

App walkthrough

Exercise 09

  • Revise app to provide a default view of most recent distributions by most recent assignment due date

  • Revise app to include a selector by one or more students

  • Revise app to include doughnut charts to show completion, late or incompleted assessments by assignment

  • Revise app to include a student list

Exercise 09 - Prototype specification

Review

Homework

  • Submit 2 files from today
  • .Rmd and app.R via NYU Classes > Assignments > In Class Worksheet.
  • Name your files lastname_first_inclass.Rmd & lastname_first_app.R

THANK YOU