Foundations of Statistics Using R

5/22/2019

Foundations of Statistics Using R

New to programming?

##          group value
## 1          New    20
## 2       Novice    20
## 3 Intermediate    11
## 4     Advanced     3

Prior knowledge of programming

Learning programming

A new way of thinking, not just a new skill
A new language for speaking and reading (vectors, data frames, functions, objects, etc.)
A new syntax for writing c(), print(), cat(), sort(), require(), subset()

New to R

Prior knowledge of R

Turn to the person next to you and share the following:

Outcomes

Review what you’ve learned in lessons 1 through 6.

Introduce you to techniques and approaches to programming in R.

Give you an opportunity to practice and solve problems.

PRE MODULE REVIEW

SETUP

Setting up your R world

Download rmsba.zip file and unzip
Rename the folder rmsba that contains the files in the zip folder. Move it to a location you can find.
Create a new project in RStudio named rmsba
Set your working directory to the rmsba folder you created above.

PART I: BEGINNING DATA PROJECTS

When working with a new data what initial questions do you have?

What does your data represent in the real world?
How is this real world phenomena characterized by the data that you have?
From what time period is the data?
What's the source?

Basic understanding

Once you have this basic understanding of your data you can dig deeper.

You can use visualization techniques to explore your data and derive some basic understandings of the phenomena you are studying, such as the largest and smallest values for each variable.

Calculating summary statistics can translate the data into information by revealing the shape of the data, the mean, median, minimum value, maximum value, and variability.

The process

For any data science project there are few simple steps to follow.

EXAMPLE PROJECT. WORLD INTERNET USAGE

world_internet_usage.csv

1) Set up your workspace (generic steps)

Begin by creating a new folder that contains your data.
Then create a new project in RStudio.
Set your working directory to the folder you created above.
Create a new RMarkdown document.
Save it in your working directory.

Let's create a new `RMarkdown` document

Go to file > New file > RMarkdown
Save the file in you working directory
Name it internet.Rmd

RMARKDOWN

R MARKDOWN

Planning your programs, presenting your code, and sharing your work.

Create .Rmd
Write text
Embed code
Render output

Resource: https://bookdown.org/yihui/rmarkdown/

R Markdown Example

---
title: "Hello R Markdown"
author: "Awesome Me"
date: "2018-02-14"
output: html_document
---

This is a paragraph in an R Markdown document.
Below is a sample code chunk:

#{r chunkname, echo=TRUE, eval=TRUE}

myformula <- (2+2)

R Markdown Components

metadata - YAML header
text - Text outside of code chunks
code - R code chunks

R Markdown YAML header

YAML Ain't markup language

#title: "Hello R Markdown"
#author: "Awesome Me"
#date: "2019-05-20"
#output: html_document

R Markdown YAML common output options

Option	Creates
html_document	html
pdf_document	pdf (requires Tex)
word_document	Microsoft Word (.docx)
github_document	Github compatible markdown
ioslides_presentation	ioslides HTML slides

R Markdown common text options

# Header 1
## Header 2
### Header 3
** Bold **
_Italics_

R Markdown rchunks

Option	Default
eval	TRUE
echo	TRUE
warning	TRUE
error	FALSE

echo=TRUE and echo=FALSE

echo=TRUE

2+2

## [1] 4

echo=FALSE

## [1] 4

fig.height=3, fig.width=4

plot(attitude)

See R Markdown cheatsheet

https://github.com/rstudio/cheatsheets/raw/master/rmarkdown-2.0.pdf

2) Import your data

We have a couple different options.

read_csv() function from the readr library
read.csv() function from the utils library.

readr verses Base R

Typically faster (~10x)
Produce tibbles, they don’t convert character vectors to factors, use row names, or munge the column names.
Base R functions inherit some behavior from your OS and environment variables, so import code that works on your computer might not work on someone else’s.
Column names that are numbers convert from 2002 to 2002 vs. X2002.
Column names that contain spaces (bad practice) convert from Country Name to "Country Name" vs. Country.Name

`read.csv`()

internet_baser <- read.csv("world_internet_usage.csv")
head(internet_baser, 5)

##     country X2000 X2001 X2002 X2003 X2004 X2005 X2006 X2007 X2008 X2009
## 1     China  1.78  2.64  4.60  6.20  7.30  8.52 10.52 16.00 22.60 28.90
## 2    Mexico  5.08  7.04 11.90 12.90 14.10 17.21 19.52 20.81 21.71 26.34
## 3    Panama  6.55  7.27  8.52  9.99 11.14 11.48 17.35 22.29 33.82 39.08
## 4   Senegal  0.40  0.98  1.01  2.10  4.39  4.79  5.61  7.70 10.60 14.50
## 5 Singapore 36.00 41.67 47.00 53.84 62.00 61.00 59.00 69.90 69.00 69.00
##   X2010 X2011 X2012
## 1 34.30 38.30 42.30
## 2 31.05 34.96 38.42
## 3 40.10 42.70 45.20
## 4 16.00 17.50 19.20
## 5 71.00 71.00 74.18

`read_csv`()

library(readr)
internet_readr <- read_csv("world_internet_usage.csv")
head(internet_readr, 5)

## # A tibble: 5 x 14
##   country `2000` `2001` `2002` `2003` `2004` `2005` `2006` `2007` `2008`
##   <chr>    <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1 China     1.78   2.64   4.6    6.2    7.3    8.52  10.5    16     22.6
## 2 Mexico    5.08   7.04  11.9   12.9   14.1   17.2   19.5    20.8   21.7
## 3 Panama    6.55   7.27   8.52   9.99  11.1   11.5   17.4    22.3   33.8
## 4 Senegal   0.4    0.98   1.01   2.1    4.39   4.79   5.61    7.7   10.6
## 5 Singap~  36     41.7   47     53.8   62     61     59      69.9   69  
## # ... with 4 more variables: `2009` <dbl>, `2010` <dbl>, `2011` <dbl>,
## #   `2012` <dbl>

use `read_csv()`

As best practice, we're going to use read_csv() function from the readr library to import the world_internet_usage.csv data as tibble data frame.

class(internet_readr)

## [1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame"

3) Prepare or tidy your data

Take a look at your data.

knitr::kable(head(internet_readr, 10))

country	2000	2001	2002	2003	2004	2005	2006	2007	2008	2009	2010	2011	2012
China	1.78	2.64	4.60	6.20	7.30	8.52	10.52	16.00	22.60	28.90	34.30	38.30	42.30
Mexico	5.08	7.04	11.90	12.90	14.10	17.21	19.52	20.81	21.71	26.34	31.05	34.96	38.42
Panama	6.55	7.27	8.52	9.99	11.14	11.48	17.35	22.29	33.82	39.08	40.10	42.70	45.20
Senegal	0.40	0.98	1.01	2.10	4.39	4.79	5.61	7.70	10.60	14.50	16.00	17.50	19.20
Singapore	36.00	41.67	47.00	53.84	62.00	61.00	59.00	69.90	69.00	69.00	71.00	71.00	74.18
United Arab Emirates	23.63	26.27	28.32	29.48	30.13	40.00	52.00	61.00	63.00	64.00	68.00	78.00	85.00
United States	43.08	49.08	58.79	61.70	64.76	67.97	68.93	75.00	74.00	71.00	74.00	77.86	81.03

How could you visualize this to better understand it?

A histogram?

What function can we use?

Building a histogram

hist(internet_readr$`2000`)

Adding aesthetics

hist(internet_readr$`2000`, breaks=8, 
     main="Internet usage for 2000", 
     col="magenta", xlab=" ", labels=TRUE)

A histogram for every year using par(mfrow=c(6,3))

par(mfrow=c(8,2))
hist(internet_readr$`2000`, col= "blue", labels=TRUE, breaks = 4)
hist(internet_readr$`2001`)
hist(internet_readr$`2002`)
hist(internet_readr$`2003`)
hist(internet_readr$`2004`)
hist(internet_readr$`2005`)
hist(internet_readr$`2006`)
hist(internet_readr$`2007`)
hist(internet_readr$`2008`)
hist(internet_readr$`2009`)
hist(internet_readr$`2010`)
hist(internet_readr$`2011`)
hist(internet_readr$`2012`)

Histogram matrix

Exercise 01

#Redo the histogram matrix 
#and add aesthetics

Boxplots

boxplot(internet_readr$`2000`, 
        main="Internet usage for 2000", 
        col="magenta", 
        xlab=median(internet_readr$`2000`),
        frame.plot=FALSE, horizontal=TRUE, 
        border="dark blue")

Multiple boxplots

Exercise 02

#Build box plots for 2000 -2012 
#based on the example above.  
#Include aesthetics.

How else might you want to visualize this data to better understand it?

How would you create a bar or line graph using ggplot?

Technically, we would use geom_col(), geom_bar() or geom_line().

However, what variable would you map for the x-axis and the y-axis?

For example…

ggplot(internet_readr,aes(XVALUE, YVALUE)) 
  + geom_col()

Let's think about how we would use the `ggplot()` function.

ggplot(internet_readr,aes(X,Y)) 
  + geom_col()

PART II: Data transformation and advanced visualization

wide to long formats

We need to reshape our data from wide to long.

## [1] "Wide format"

country	2000	2001	2002	2003	2004	2005	2006	2007	2008	2009	2010	2011	2012
China	1.78	2.64	4.6	6.2	7.3	8.52	10.52	16.00	22.60	28.90	34.30	38.30	42.30
Mexico	5.08	7.04	11.9	12.9	14.1	17.21	19.52	20.81	21.71	26.34	31.05	34.96	38.42

## [1] "Long format"

country	year	usage
China	2000	1.78
Mexico	2000	5.08

Using the `gather()` function to reshape.

The gather() function from the tidyr package to reshape a tibble from wide to long form.
Note the use of the pipe %>% that passes the left hand side of the operator to the first argument of the right hand side of the operator.

tidy_internet_readr <- 
internet_readr %>%
gather(`2000`:`2012`, key="year", 
       value="usage")

View the data

country	year	usage
China	2000	1.78
Mexico	2000	5.08
Panama	2000	6.55
Senegal	2000	0.40
Singapore	2000	36.00

4) Understand - Visualize

Let's create a time series line graph.
Use the ggplot2() package

ggplot(data, aes(x,y,color, group)) + geom_line() + ... + ...

Build the chart

library(ggplot2)
#assignment ggplot call to a variable
line01 <- ggplot(tidy_internet_readr,
        aes(x=year,y=usage,color=country,
            group=country)) + geom_line()

View the chart

line01

Adding asethetics: Labels

Labels: +labs(title="", subtitle="",x="", y="", caption="")

Refine your line chart

library(ggthemes)
library(ggplot2)
line02<-ggplot(tidy_internet_readr,
               aes(x=year,y=usage,color=country,
                   group=country)) + geom_line() + 
  labs(title = "Internet Usage per 100 people", 
       subtitle = "Since 2011, 
       the UAE has surpassed Singapore and the US in internet users", 
       caption = "Source: World Bank (2013)",
       x = " ",y ="Usage")

Refined line chart

Let's create a bar chart with the same data

..

library(ggplot2)
bar01 <- ggplot(tidy_internet_readr,
        aes(tidy_internet_readr$year, tidy_internet_readr$usage))

bar01 <- bar01 + geom_col() + theme_few() +
  labs(title = "Internet Usage per 100 people", 
       x = "Year",y ="Usage", 
       caption="World Bank (2013)")

Let's see it

bar01

Refining asethetics: Specifying parameters

Adding a fill / Color

fill="#4cbea3", color="#4cbea3"

Add a fill

Changing the font

*+ theme(text=element_text(family="Avenir"))

Changing the font

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x,
## x$y, : font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

Removing chart junk

theme(panel.border = element_blank(), panel.grid.major = element_blank(),panel.grid.minor = element_blank(), axis.line = element_line(color = "gray"), axis.ticks.x=element_blank(), axis.ticks.y=element_blank())

Removing chart junk

bar01c <- bar01 + geom_col(fill="#4cbea3", color="#4cbea3") + theme_few() +
  labs(title = "Internet Usage per 100 people", 
       x = " ",y ="Usage", 
       caption="World Bank (2013)") +
  theme(text=element_text(family="Avenir"), 
        panel.border = element_blank(), panel.grid.major =
          element_blank(),panel.grid.minor =
          element_blank(), 
        axis.line = element_line(color= "gray"),
        axis.ticks.x=element_blank(),
        axis.ticks.y=element_blank())

Removing chart junk

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x,
## x$y, : font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

Adding asethetics: Themes

ggthemes() package with additional themes
theme_classic(). White background no grid lines

Types of themes

t1 <- bar01 + theme_classic()
t2 <- bar01 + theme_bw()
t3 <- bar01 + theme_minimal()
t4 <- bar01 + theme_economist()
t5 <- bar01 + theme_fivethirtyeight()
t6 <- bar01 + theme_hc()

Themes: `theme_classic()`

Themes: `theme_bw()`

Themes: `theme_minimal()`

Themes: `theme_economist()`

Themes: `theme_fivethirtyeight()`

Themes: `theme_theme_hc()`

Saving your charts

ggsave("plot.png",width=5, height=5, units="in"")
Saves in your working directory

PROJECT. Capital Bikeshare

bikesharedailydata.csv

1. Setup your workspace

2. Import the data

This data spans the District of Columbia, Arlington County, Alexandria, Montgomery County and Fairfax County.
The Capital Bikeshare system is owned by the participating jurisdictions and is operated by Motivate, a Brooklyn, NY-based company that operates several other bikesharing systems including Citibike in New York City, Hubway in Boston and Divvy Bikes in Chicago.

…

library(readr)
bikeshare <- read_csv("bikesharedailydata.csv")

View the data.

## # A tibble: 6 x 16
##   instant dteday season    yr  mnth holiday weekday workingday weathersit
##     <dbl> <chr>   <dbl> <dbl> <dbl>   <dbl>   <dbl>      <dbl>      <dbl>
## 1       1 1/1/11      1     0     1       0       6          0          2
## 2       2 1/2/11      1     0     1       0       0          0          2
## 3       3 1/3/11      1     0     1       0       1          1          1
## 4       4 1/4/11      1     0     1       0       2          1          1
## 5       5 1/5/11      1     0     1       0       3          1          1
## 6       6 1/6/11      1     0     1       0       4          1          1
## # ... with 7 more variables: temp <dbl>, atemp <dbl>, hum <dbl>,
## #   windspeed <dbl>, casual <dbl>, registered <dbl>, cnt <dbl>

Understand the type of data you are working with.

Observations

One of the first things you may notice is the data dimensions, the number of rows and columns. Specifically there are 731 rows (observations) and 16 columns (variables or attributes).

However, the variable names listed at the first row of every column are not very descriptive.

Determine what the variables mean in the real world.

Take a look column named season. What is the meaning of season?

What are the possible values for this variable?

bikeshare$season

bikeshare$season

##   [1]  1  1  1  1  1  1 NA  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##  [24]  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##  [47]  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##  [70]  1  1  1  1  1  1  1  1  1  1  2  2  2  2  2  2  2  2  2  2  2  2  2
##  [93]  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2
## [116]  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2
## [139]  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2
## [162]  2  2  2  2  2  2  2  2  2  2  3  3  3  3  3  3  3  3  3  3  3  3  3
## [185]  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3
## [208]  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3
## [231]  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3
## [254]  3  3  3  3  3  3  3  3  3  3  3  3  4  4  4  4  4  4  4  4  4  4  4
## [277]  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4
## [300]  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4
## [323]  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4
## [346]  4  4  4  4  4  4  4  4  4  1  1  1  1  1  1  1  1  1  1  1  1  1  1
## [369]  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
## [392]  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
## [415]  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
## [438]  1  1  1  1  1  1  1  1  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2
## [461]  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2
## [484]  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2
## [507]  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2
## [530]  2  2  2  2  2  2  2  2  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3
## [553]  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3
## [576]  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3
## [599]  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3
## [622]  3  3  3  3  3  3  3  3  3  3  4  4  4  4  4  4  4  4  4  4  4  4  4
## [645]  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4
## [668]  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4
## [691]  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4
## [714]  4  4  4  4  4  4  4  1  1  1  1  1  1  1  1  1  1  1

What type of variable is it?

It is an integer or data of type numeric.

You’ll notice that in the column seasons the values are integers that range between 1 and 4.

What do the numbers represent?

If we really think about it’s unlikely that the numbers represent quantities.

Instead, they probably represent the seasons of the year because we know there are four seasons.

Understanding the four seasons

The numbers (1 through 4) are probably a code for the each of the four seasons of the year.
Without additional information, such as a data dictionary or readme file, it would be impossible for the user of the data to know what the possible values of 1 through 4 correspond to in the categorical variable named season.

Review the data dictionary

This leads us to the next step, reviewing the data dictionary along with the data set to better understand the meaning behind the values.

The data dictionary

A data dictionary defines the characteristics of each of the data attributes.
If your data comes from a reputable source, odds are that it is accompanied with a data dictionary or metadata.
To know which season is represented by each number in the variable season we can review the data dictionary.

Reviewing the data dictionary

Field	Definition
instant	record index
dteday	date
season	season (1:winter, 2:spring, 3:summer, 4:fall)
yr	year (0: 2011, 1:2012)
mnth	month ( 1 to 12)
hr	hour (0 to 23)
holiday	weather day is holiday or not
weekday	day of the week
workingday	if day is neither weekend nor holiday is 1, otherwise is 0.
weathersit	1, 2, 3, 4
– 1	Clear, Few clouds, Partly cloudy, Partly cloudy
– 2	Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
– 3	Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
– 4	Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
temp	Normalized temperature in degrees F
atemp	Normalized feeling temperature in degrees F
hum	Normalized humidity.
windspeed	Normalized wind speed
casual	count of casual users
registered	count of registered users
cnt	count of total rental bikes including both casual and registered

What did we learn?

season is a categorical variable defined by one of four values, each representing a season (1: winter, 2: spring, 3: summer, 4: fall).

year is coded with the value of 0 for 2011 and 1 for 2012, rather than actual year value of 2011 or 2012.

3. Prepare or tidy your data

At this point, you may want to rename the columns in your data set to make the data more usable when you begin the analysis.
Renaming columns is a manual process that literally involves change the each column name.
It is best practice to use lower case lettering and avoid spaces and hyphenation.

Renaming columns

There two key ways to rename columns.

with rename() from dplyr
with names() from base

Way 1 - Renaming columns with`rename()` from the `dplyr` library.

library(dplyr)
bikeshare <- rename(bikeshare, humidity = hum, month=mnth)
names(bikeshare)

##  [1] "instant"    "dteday"     "season"     "yr"         "month"     
##  [6] "holiday"    "weekday"    "workingday" "weathersit" "temp"      
## [11] "atemp"      "humidity"   "windspeed"  "casual"     "registered"
## [16] "cnt"

Way 2 -Renaming columns with R base functions.

# Rename column where names is equal to "yr"
names(bikeshare)[names(bikeshare) == "yr"] <- "year"
names(bikeshare)

##  [1] "instant"    "dteday"     "season"     "year"       "month"     
##  [6] "holiday"    "weekday"    "workingday" "weathersit" "temp"      
## [11] "atemp"      "humidity"   "windspeed"  "casual"     "registered"
## [16] "cnt"

Identify missing values

We can use a function called is.na()counting the number NA values.
Let's look for missing values in the seasons column.

…

sum(is.na(bikeshare$season)==TRUE)

## [1] 1

Using iteration to identify missing values

In this case it seems necessary to "loop" through the entire data set and to identify which fields need closer inspection.
We can build a simple for loop to do this.

The `for` loop structure

Let's iterate through a vector of numbers using a for loop

…

myvector <- c(1,2,3,4)
for (i in myvector){
    print(paste ("Loop", i))
}

## [1] "Loop 1"
## [1] "Loop 2"
## [1] "Loop 3"
## [1] "Loop 4"

Let's refine the loop to include a counter and change the data a little.

..

myvector <- c(1,2,3,4,9,11)
mycounter <- 1
for (i in myvector){
    print(paste ("The value for loop number", mycounter,"is:", i))
    mycounter <- mycounter +1 # update 
}

## [1] "The value for loop number 1 is: 1"
## [1] "The value for loop number 2 is: 2"
## [1] "The value for loop number 3 is: 3"
## [1] "The value for loop number 4 is: 4"
## [1] "The value for loop number 5 is: 9"
## [1] "The value for loop number 6 is: 11"

Let use the concept of iteration to traverse through the data set and print out those NA values.

…

counter <-0
for (i in bikeshare$season){
      counter <- counter +1
       if(is.na(i)==TRUE){
         print(paste("It's true. 
                     There's an NA value 
                     on row",counter))
         print(bikeshare[counter,])
    }
}

## [1] "It's true. \n                     There's an NA value \n                     on row 7"
## # A tibble: 1 x 16
##   instant dteday season  year month holiday weekday workingday weathersit
##     <dbl> <chr>   <dbl> <dbl> <dbl>   <dbl>   <dbl>      <dbl>      <dbl>
## 1       7 1/7/11     NA     0     1       0       5          1          2
## # ... with 7 more variables: temp <dbl>, atemp <dbl>, humidity <dbl>,
## #   windspeed <dbl>, casual <dbl>, registered <dbl>, cnt <dbl>

Dealing with missing values

There are several ways you tackle working with data that are incomplete. Each has its pros and cons.

Ignore any record with missing values
Replace empty fields with a pre-defined value
Replace empty fields with the most frequently appeared value
Use the mean value
Manual approach

Solution

In this case it's easy to replace the value with a pre-defined value.
We wouldn't want to ignore the record because the values can be easily determined.

Update the values

bikeshare$season[7]

## [1] NA

1->bikeshare$season[7]
bikeshare$season[7]

## [1] 1

4. Understand and visualize

bikesall <- ggplot(bikeshare, aes(atemp,cnt)) + geom_point(color="#4cbea3")

Let's see it!

bikesall

Refine it

bikes2011 <- ggplot(bikeshare[bikeshare$year < 1,],
                    aes(atemp,cnt)) +
  geom_point(color="#4cbea3") + 
  theme_few() + labs(title = "Rentals in 2011", 
                     x = "Average temp",y=" ")

2011

bikes2011

Plot 2012

bikes2012 <- ggplot(bikeshare[bikeshare$year > 0,], aes(atemp,cnt)) + geom_point(color="#4cbea3") + theme_few() + 
  labs(title = "Rentals in 2012", x = "Average temp",y=" ")

2012

bikes2012

Now let’s arrange the charts side by side

We can do this by using the plot_grid function from the cowplot package

cowplot::plot_grid

-Pass in the two variables hist_age and hist_salary into the plot_grid function to see the graphs plotted side by side.

plot_grid

cowplot::plot_grid(bikes2011, bikes2012, labels =c(" ", " "))

ITERATION

Loops

This section will introduce control structures for iteration know as loops. We will cover two types of loops:

while loop
for loop

Basic structure for a `while` loop.

#pseudo code

while (TRUE)
{
  ##do something...
  
} #exit the loop

Create a `while loop`

x <- 10
while (x > 0) {
 print(x)
 x <- x - 1 
}

## [1] 10
## [1] 9
## [1] 8
## [1] 7
## [1] 6
## [1] 5
## [1] 4
## [1] 3
## [1] 2
## [1] 1

Create a `while` loop with a counter

counter <- 0
while (counter < 9) {
  print(counter)
  counter <- counter + 1 }

## [1] 0
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8

Exercise 03 - Use the attitude data set

#pseudo code

column <- ncol(attitude)

while (column > 0)
{
 print(paste(names(attitude[column]), ":", mean(attitude[,column]))) #take the mean
  column <- column -1
  #move to the next ncol
} 


mean <- NULL

for (i in names(attitude)){
  mean[i] <- mean(attitude[,i])
}

mean

Solution

Basic structure for a `for` loop.

#pseudo code

Iteration using a `for` loop

Iterate through an array of numbers using a for loop

for (i in c(1,2,3,4)){
    print(i)
}

## [1] 1
## [1] 2
## [1] 3
## [1] 4

Exercise 04: Iterate through a column in the bikeshare data

#pseudo code

for (a in bikeshare$dteday){
  print (a)
}

Solution: Iterate through a column in the bikeshare data

#
for (i in names(bikeshare)){ 
  print }

CONDITIONALS

Let's review of Boolean variables and logical operators

3 > 4

## [1] FALSE

c(1, 2, 3, 4, 5) > 4

## [1] FALSE FALSE FALSE FALSE  TRUE

c(1, 2, 3, 4, 6) == 3

## [1] FALSE FALSE  TRUE FALSE FALSE

Loops and conditional statements using if/else logic

Build a program that checks to see which prices are considered "cheap".

prices <- c(12.43, 9.99, 18.22, 7.25, 0.50)
 
v <- c()
for (p in prices){
  v <- prices[p<10]
  }
    
v

## [1] 12.43  9.99 18.22  7.25  0.50

##CODE HERE

Solution

Alternative approach

INFIX OPERATORS

Infix operators are very similar to functions.
You've been introduced to one infix operator, the %% remainder operator.

print(paste("The remainder from 5 / 3 is", 5%%3))

## [1] "The remainder from 5 / 3 is 2"

Table of infix operators

Operation	Operator	Example Input	Example Output
Remainder operator	`%%`	`7%%2`	`1`
Integer division	`%/%`	`10%/%3`	3
Matrix multiplication	`%*%`	`3%*%6`	`[,1] [1,]18`
Outer product	`%o%`	`c(1:3)%o%c(0,1,2)`	`0,0,0,1,2,3,2,4,6`
Matching operator	`%in%`	`1%in%1`	TRUE

Integer division

We use the `%/% for integer division. Helpful for situtations where non-integer values are not possible. Examples include: products such as tickets, furniture, cars, etc.

print(paste("Using integer division 5 / 3 is", 5%/%3))

## [1] "Using integer division 5 / 3 is 1"

Matrix multiplication

Unlike the * operator, we use the %*% to multiply matrices. Using the operator, results in a matrix data structure.

mmatrix <- 2%*%5
mmatrix

##      [,1]
## [1,]   10

class(mmatrix)

## [1] "matrix"

Matrix Example

Matrix Example with code

##      [,1] [,2]
## [1,]    1    2
## [2,]    2    4

##      [,1] [,2]
## [1,]    0    0
## [2,]    2    3

##      [,1] [,2]
## [1,]    4    6
## [2,]    8   12

Outer product

The outer product %o% is a handy function for linear algebra and linear programming.

c(1:3)%o%c(0,1,2)

##      [,1] [,2] [,3]
## [1,]    0    1    2
## [2,]    0    2    4
## [3,]    0    3    6

Outer product example

Outer product - example continued.

vector01 <-c(1:3)
vector02 <-c(0,1,2)
opvector <- vector01%o%vector02
opvector

##      [,1] [,2] [,3]
## [1,]    0    1    2
## [2,]    0    2    4
## [3,]    0    3    6

Matching `%in%` operator

Useful for identifying if some set of values are identical.
%in% tells you which items from the left hand side are also in the right hand side.

Using the `%in` operator

Let's look at an example.

Example

v1 <- 3
v2 <- 101
t <- c(1,2,3,4,5,6,7,8)

Next, we use the %in% operator

v1 %in% t

## [1] TRUE

You try it with v2

Is is the value of `v2` present in the vector `t`

Solution

v2 %in% t

## [1] FALSE

Example

 c(1:5) %in% c(3:8)

## [1] FALSE FALSE  TRUE  TRUE  TRUE

Example

We can use the %in% to determine which values in one vector are identical to those in another vector.

Let's return to the example, the Country Name variable in the gdp data. ##Read in the data

library(readr)
gdp <- read_csv("gdp.csv")

View the data

Country Name	Country Code	1960	1961	1962	1963	1964	1965	1966	1967	1968	1969	1970	1971	1972	1973	1974	1975	1976	1977	1978	1979	1980	1981	1982	1983	1984	1985	1986	1987	1988	1989	1990	1991	1992	1993	1994	1995	1996	1997	1998	1999	2000	2001	2002	2003	2004	2005	2006	2007	2008	2009	2010	2011	2012	2013	2014	2015	2016	2017
Aruba	ABW	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	1330167598	1320670391	1379888268	1531843575	1665363128	1722798883	1873452514	1920262570	1941094972	2021301676	2228279330	2.331006e+09	2.421475e+09	2.623726e+09	2.791961e+09	2.498933e+09	2.467704e+09	2.584464e+09	NA	NA	NA	NA	NA	NA
Afghanistan	AFG	537777811	548888896	546666678	751111191	800000044	1006666638	1399999967	1673333418	1373333367	1408888922	1748886596	1831108971	1595555476	1733333264	2155555498	2366666616	2555555567	2953333418	3300000109	3697940410	3641723322	3478787909	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	2461665938	4128820723	4583644246	5285465686	6.275074e+09	7.057598e+09	9.843842e+09	1.019053e+10	1.248694e+10	1.593680e+10	1.793024e+10	2.053654e+10	2.026425e+10	2.061610e+10	1.921556e+10	1.946902e+10	2.081530e+10
Angola	AGO	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	5930503401	5550483036	5550483036	5784341596	6131475065	7553560459	7072063345	8083872012	8769250550	10201099040	11228764963	10603784541	8307810974	5768720422	4438321017	5538749260	7526446606	7648377413	6506229607	6152922943	9129594819	8936063723	12497347956	14188949398	19640853734	2.823371e+10	4.178948e+10	6.044892e+10	8.417804e+10	7.549239e+10	8.252614e+10	1.041158e+11	1.139232e+11	1.249125e+11	1.267302e+11	1.026212e+11	9.533720e+10	1.242094e+11
Albania	ALB	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	1924242453	1965384586	2173750013	2156624900	2126000000	2335124988	2101624963	1139166646	709452584	1228071038	1985673798	2424499009	3314898292	2359903108	2707123772	3414760915	3632043908	4060758804	4435078648	5746945913	7314865176	8.158549e+09	8.992642e+09	1.070101e+10	1.288135e+10	1.204421e+10	1.192695e+10	1.289087e+10	1.231978e+10	1.277628e+10	1.322824e+10	1.138693e+10	1.188368e+10	1.303935e+10
Andorra	AND	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	78619206	89409820	113408232	150820103	186558696	220127246	227281025	254020153	308008898	411578334	446416106	388958731	375895956	327861833	330070689	346737965	482000594	611316399	721425939	795449332	1029048482	1106928583	1210013652	1007025755	1017549124	1178738991	1223945357	1180597273	1211932398	1239876305	1434429703	1496912752	1733116883	2398645598	2935659300	3.255789e+09	3.543257e+09	4.016972e+09	4.007353e+09	3.660531e+09	3.355695e+09	3.442063e+09	3.164615e+09	3.281585e+09	3.350736e+09	2.811489e+09	2.877312e+09	3.012914e+09
Arab World	ARB	NA	NA	NA	NA	NA	NA	NA	NA	25752663334	28425351599	31375728863	36415569616	43302571643	55001266851	105113069536	116300804387	144801082500	167256241961	183498400604	248568798879	338072174739	348484272974	324227785106	303867911387	307844905036	303799011536	288939171303	312584335589	307407305095	322224795592	446738041822	439642267667	471016834842	476365284407	487375131447	523357362511	578012134733	612896365522	590692883162	643147698720	734768117003	723282816386	729051715399	823110541435	963862340514	1.184662e+12	1.404114e+12	1.637573e+12	2.078116e+12	1.795820e+12	2.109551e+12	2.501305e+12	2.786139e+12	2.866038e+12	2.906918e+12	2.554480e+12	2.500164e+12	2.591047e+12
United Arab Emirates	ARE	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	14720672507	19213022691	24871775165	23775831783	31225463218	43598748449	49333424135	46622718605	42803323345	41807954236	40603650232	33943612095	36384908744	36275674203	41464995914	50701443748	51552165622	54239171888	55625170253	59305093980	65743666576	73571233996	78839008445	75674336283	84445473111	104337372362	103311640572	109816201498	124346358067	147824370320	1.806170e+11	2.221165e+11	2.579161e+11	3.154746e+11	2.535474e+11	2.897873e+11	3.506660e+11	3.745906e+11	3.901076e+11	4.031371e+11	3.581351e+11	3.570451e+11	3.825751e+11
Argentina	ARG	NA	NA	24450604878	18272123664	25605249382	28344705967	28630474728	24256667553	26436857248	31256284544	31584210366	33293199095	34733000536	52544000117	72436777342	52438647922	51169499891	56781000101	58082870156	69252328953	76961923742	78676842366	84307486837	103979106778	79092001998	88416668900	110934442763	111106191358	126206817196	76636898036	141352368715	189719984268	228788617202	236741715015	257440000000	258031750000	272149750000	292859000000	298948250000	283523000000	284203750000	268696750000	97724004252	127586973492	164657930453	1.987371e+11	2.325573e+11	2.875305e+11	3.615580e+11	3.329765e+11	4.236274e+11	5.301633e+11	5.459824e+11	5.520251e+11	5.263197e+11	5.947493e+11	5.548609e+11	6.375904e+11
Armenia	ARM	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	2256838858	2068526522	1272577456	1201313201	1315158670	1468317350	1596968913	1639492424	1893726437	1845482181	1911563665	2118467913	2376335048	2807061009	3576615240	4.900470e+09	6.384452e+09	9.206302e+09	1.166204e+10	8.647937e+09	9.260285e+09	1.014211e+10	1.061932e+10	1.112147e+10	1.160951e+10	1.055334e+10	1.054614e+10	1.153659e+10
American Samoa	ASM	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	514000000	527000000	512000000	5.030000e+08	4.960000e+08	5.200000e+08	5.630000e+08	6.780000e+08	5.760000e+08	5.740000e+08	6.440000e+08	6.410000e+08	6.430000e+08	6.590000e+08	6.580000e+08	NA

##  [1] "Country Name" "Country Code" "1960"         "1961"        
##  [5] "1962"         "1963"         "1964"         "1965"        
##  [9] "1966"         "1967"         "1968"         "1969"        
## [13] "1970"         "1971"         "1972"         "1973"        
## [17] "1974"         "1975"         "1976"         "1977"        
## [21] "1978"         "1979"         "1980"         "1981"        
## [25] "1982"         "1983"         "1984"         "1985"        
## [29] "1986"         "1987"         "1988"         "1989"        
## [33] "1990"         "1991"         "1992"         "1993"        
## [37] "1994"         "1995"         "1996"         "1997"        
## [41] "1998"         "1999"         "2000"         "2001"        
## [45] "2002"         "2003"         "2004"         "2005"        
## [49] "2006"         "2007"         "2008"         "2009"        
## [53] "2010"         "2011"         "2012"         "2013"        
## [57] "2014"         "2015"         "2016"         "2017"

Example - continued 1 of 2

Luckily, your professor has done the work of identifying which of these country codes represent aggregate values.

You've been provided with a aggregateclcodes.csv

agg <- read_csv("aggregatecodes.csv")
summary(agg)

##  aggregate_code    
##  Length:46         
##  Class :character  
##  Mode  :character

View the data

aggregate_code
ARB
CEB
CSS
EAP
EAR
EAS
ECA
ECS
EMU
EUU
FCS
HIC
HPC
IBD

Example - Continued 2 of 2

Let's use the %in% infix function.

gdpagg<-gdp[!gdp$`Country Code` %in% 
              agg$aggregate_code, ,drop= FALSE]
knitr::kable(head(gdpagg,10))

Country Name	Country Code	1960	1961	1962	1963	1964	1965	1966	1967	1968	1969	1970	1971	1972	1973	1974	1975	1976	1977	1978	1979	1980	1981	1982	1983	1984	1985	1986	1987	1988	1989	1990	1991	1992	1993	1994	1995	1996	1997	1998	1999	2000	2001	2002	2003	2004	2005	2006	2007	2008	2009	2010	2011	2012	2013	2014	2015	2016	2017
Aruba	ABW	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	1330167598	1320670391	1379888268	1531843575	1665363128	1722798883	1873452514	1920262570	1941094972	2021301676	2228279330	2331005587	2421474860	2623726257	2791960894	2498932961	2467703911	2584463687	NA	NA	NA	NA	NA	NA
Afghanistan	AFG	537777811	548888896	546666678	751111191	800000044	1006666638	1399999967	1673333418	1373333367	1408888922	1748886596	1831108971	1595555476	1733333264	2155555498	2366666616	2555555567	2953333418	3300000109	3697940410	3641723322	3478787909	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	2461665938	4128820723	4583644246	5285465686	6275073572	7057598407	9843842455	10190529882	12486943506	15936800636	17930239400	20536542737	20264253974	20616104299	19215562179	19469022208	20815300220
Angola	AGO	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	5930503401	5550483036	5550483036	5784341596	6131475065	7553560459	7072063345	8083872012	8769250550	10201099040	11228764963	10603784541	8307810974	5768720422	4438321017	5538749260	7526446606	7648377413	6506229607	6152922943	9129594819	8936063723	12497347956	14188949398	19640853734	28233712738	41789479932	60448924662	84178035579	75492385928	82526143645	104115807986	113923162050	124912503781	126730196125	102621215573	95337203468	124209385825
Albania	ALB	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	1924242453	1965384586	2173750013	2156624900	2126000000	2335124988	2101624963	1139166646	709452584	1228071038	1985673798	2424499009	3314898292	2359903108	2707123772	3414760915	3632043908	4060758804	4435078648	5746945913	7314865176	8158548717	8992642349	10701011897	12881352688	12044212904	11926953259	12890867539	12319784787	12776277515	13228244357	11386931490	11883682171	13039352744
Andorra	AND	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	78619206	89409820	113408232	150820103	186558696	220127246	227281025	254020153	308008898	411578334	446416106	388958731	375895956	327861833	330070689	346737965	482000594	611316399	721425939	795449332	1029048482	1106928583	1210013652	1007025755	1017549124	1178738991	1223945357	1180597273	1211932398	1239876305	1434429703	1496912752	1733116883	2398645598	2935659300	3255789081	3543256806	4016972351	4007353157	3660530703	3355695364	3442062830	3164615187	3281585236	3350736367	2811489409	2877311947	3012914131
United Arab Emirates	ARE	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	14720672507	19213022691	24871775165	23775831783	31225463218	43598748449	49333424135	46622718605	42803323345	41807954236	40603650232	33943612095	36384908744	36275674203	41464995914	50701443748	51552165622	54239171888	55625170253	59305093980	65743666576	73571233996	78839008445	75674336283	84445473111	104337372362	103311640572	109816201498	124346358067	147824370320	180617018380	222116541865	257916133424	315474615739	253547358747	289787338325	350666031314	374590605854	390107556161	403137100068	358135057862	357045064670	382575085092
Argentina	ARG	NA	NA	24450604878	18272123664	25605249382	28344705967	28630474728	24256667553	26436857248	31256284544	31584210366	33293199095	34733000536	52544000117	72436777342	52438647922	51169499891	56781000101	58082870156	69252328953	76961923742	78676842366	84307486837	103979106778	79092001998	88416668900	110934442763	111106191358	126206817196	76636898036	141352368715	189719984268	228788617202	236741715015	257440000000	258031750000	272149750000	292859000000	298948250000	283523000000	284203750000	268696750000	97724004252	127586973492	164657930453	198737095012	232557260817	287530508431	361558037110	332976484578	423627422092	530163281575	545982375701	552025140252	526319673732	594749285413	554860945014	637590419269
Armenia	ARM	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	2256838858	2068526522	1272577456	1201313201	1315158670	1468317350	1596968913	1639492424	1893726437	1845482181	1911563665	2118467913	2376335048	2807061009	3576615240	4900469950	6384451606	9206301700	11662040714	8647936748	9260284938	10142111335	10619320049	11121465767	11609512940	10553337673	10546135160	11536590636
American Samoa	ASM	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	514000000	527000000	512000000	503000000	496000000	520000000	563000000	678000000	576000000	574000000	644000000	641000000	643000000	659000000	658000000	NA
Antigua and Barbuda	ATG	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	77496741	87879333	109079963	131431037	147841741	164369296	182144111	208372852	240923926	290440148	337174852	398637741	438794778	459469074	481706333	499281148	535172778	589429593	577280741	633730630	680617111	727860593	766198926	830158778	800740259	814615333	855643111	919577148	1022191296	1157005444	1311401333	1368431037	1224253000	1152469074	1142042926	1211411704	1192925407	1280133333	1364863037	1460144704	1532397556

FUNCTIONS

Some functions are built in such as:

toupper("hello world")

## [1] "HELLO WORLD"

mean(c(1,2,3,4,5))

## [1] 3

is.numeric(4)

## [1] TRUE

is.na(NA)

## [1] TRUE

sqrt(25)

## [1] 5

User Defined Functions

Functions are useful for executing repetitive commands.
Planning is key to writing effective functions.

Example

functionname <- function(x){
  return(print(paste("The value", x,   "is returned")))
}

functionname(34)

## [1] "The value 34 is returned"

Writing our own functions

We write our own functions for repetitive tasks or for algorithms.
For example, suppose we want to have a function to act on our data that adds 2 to every value.

Function Pseudocode

myfunction <- function(x){ 
  x = x + 2
   print(x)
  #add two to x
}

#apply the function

myfunction(5)

#x+2 is returned from the function

Exercise 06 - Write a function to act on our data that adds 2 to every value.

##CODE HERE

How can we plan for user error?

…

f <- function(x) {
  if (is.numeric(x)==FALSE)
  {
    print("Sorry. This function needs requires a value of type numeric.")
  }    else{
        x + 2
    }
}
f(37)

## [1] 39

f("Hi class of 2020!")

## [1] "Sorry. This function needs requires a value of type numeric."

Pass in multiple arguments

addTogether <- function(x, y) {
  if (is.numeric(x) & is.numeric(y)==TRUE)
  {
  x + y
  } else {
    print("Sorry, please enter two numbers")
  }
}

#Call the function
addTogether(5, 10)

## [1] 15

Alternative function call, with literal specificiation

addTogether(x = 5, y = 10)

## [1] 15

#Passing in non-numeric data
addTogether(x=4, y="Hey what's up?")

## [1] "Sorry, please enter two numbers"

Exercise 07: Write a function that averages two numbers

##CODE HERE
f <- function(x, y) {
  if (is.numeric(x) & is.numeric(y)==TRUE)
  {
  mean(c(x,y))
  } else {
    print("Sorry, please enter two numbers")
  }
}

f(2,3)

## [1] 2.5

Solution

#
#
#
#

Apply a function to all elements of input

What if we could apply a function to all the elements of the input?

Input is the List, vector or data frame
Output is a vector (or matrix)

`sapply()` from the apply family of functions

sapply(X, FUN)

Arguments:

X: A vector or an object
FUN: Function applied to each element of x

Example - `sapply()`

df1 <-as.data.frame(c(1,2,3,4,5,6,7))

sapply(df1, max)

## c(1, 2, 3, 4, 5, 6, 7) 
##                      7

Alternative to using `as.data.frame`

library(tibble)
df1 <-as_tibble(c(1,2,3,4,5,6,7))

## Warning: Calling `as_tibble()` on a vector is discouraged, because the behavior is likely to change in the future. Use `tibble::enframe(name = NULL)` instead.
## This warning is displayed once per session.

sapply(df1, max)

## value 
##     7

What if you didn't convert to a data frame or tibble?

df3 <-c(1,2,3,4,5,6,7)
sapply(df3, max)

## [1] 1 2 3 4 5 6 7

A function that takes the mean

avg <-function(x){
  mean(x, na.rm=TRUE)
}

Apply this a function over a vector using `sapply()`

f <- function(x) x^2
sapply(c(1,2,3,4,5),f)
## [1]  1  4  9 16 25

sapply(attitude,f)
##       rating complaints privileges learning raises critical advance
##  [1,]   1849       2601        900     1521   3721     8464    2025
##  [2,]   3969       4096       2601     2916   3969     5329    2209
##  [3,]   5041       4900       4624     4761   5776     7396    2304
##  [4,]   3721       3969       2025     2209   2916     7056    1225
##  [5,]   6561       6084       3136     4356   5041     6889    2209
##  [6,]   1849       3025       2401     1936   2916     2401    1156
##  [7,]   3364       4489       1764     3136   4356     4624    1225
##  [8,]   5041       5625       2500     3025   4900     4356    1681
##  [9,]   5184       6724       5184     4489   5041     6889     961
## [10,]   4489       3721       2025     2209   3844     6400    1681
## [11,]   4096       2809       2809     3364   3364     4489    1156
## [12,]   4489       3600       2209     1521   3481     5476    1681
## [13,]   4761       3844       3249     1764   3025     3969     625
## [14,]   4624       6889       6889     2025   3481     5929    1225
## [15,]   5929       5929       2916     5184   6241     5929    2116
## [16,]   6561       8100       2500     5184   3600     2916    1296
## [17,]   5476       7225       4096     4761   6241     6241    3969
## [18,]   4225       3600       4225     5625   3025     6400    3600
## [19,]   4225       4900       2116     3249   5625     7225    2116
## [20,]   2500       3364       4624     2916   4096     6084    2704
## [21,]   2500       1600       1089     1156   1849     4096    1089
## [22,]   4096       3721       2704     3844   4356     6400    1681
## [23,]   2809       4356       2704     2500   3969     6400    1369
## [24,]   1600       1369       1764     3364   2500     3249    2401
## [25,]   3969       2916       1764     2304   4356     5625    1089
## [26,]   4356       5929       4356     3969   7744     5776    5184
## [27,]   6084       5625       3364     5476   6400     6084    2401
## [28,]   2304       3249       1936     2025   2601     6889    1444
## [29,]   7225       7225       5041     5041   5929     5476    3025
## [30,]   6724       6724       1521     3481   4096     6084    1521

Exercise 08:Try it using function on a column in the bikeshare data.

##CODE HERE

Solution

#
#

PROJECT: GDP Analysis

Steps

Import data set
Prepare - remove values that are not countries.
Understand - summary stats & visualize
Communciate

Understand

Use our `avg()` function

Let's take the average of select column.

gdpagg_avg<-sapply(gdpagg[3:50],avg)

`sapply()`

gdpagg_avg <-as.data.frame(gdpagg_avg)
kable(head(gdpagg_avg, 10))

	gdpagg_avg
1960	11472836298
1961	11934440519
1962	12805007499
1963	13783853462
1964	15099033075
1965	15198438002
1966	16343121604
1967	16987893024
1968	18073927393
1969	19902605245

Calculate summary statistics

summaryfun <- function(x){
    xmean <- mean(x, na.rm=TRUE)
    xmedian <- median(x, na.rm=TRUE)
    print(paste("The mean is", prettyNum(xmean, big.mark=",", scientific=FALSE)))
   print(paste("The median is",prettyNum(xmedian, big.mark=",", scientific=FALSE)))
  }

`summaryfun()`

After we write the function, we can see that it is available for us to use at any point (just like a variable) in our work space).
X is the data and FUN is the function. In our case, the data is gdpagg$2017 and the function is `summaryfun

..

summaryfun(gdpagg$`2017`)

## [1] "The mean is 421,352,224,909"
## [1] "The median is 35,052,862,071"

RSHINY

install.packages("shiny")

What is shiny?

It’s a web application framework for R
Makes it incredibly easy to build interactive web applications with R.
Learn more at: https://cran.r-project.org/web/packages/shiny/shiny.pdf

Video - Movie Explorer

Video - Marathon Training

Video - Intelligencia

Building a shiny app

The components of a shiny app To build a Shiny app in R, start with a template.
In RStudio, go to File -> New File -> Shiny Web App. Choose a single file web application.
app.R will be the file you modify. It is saved in a new directory
Directory name is your app name

Shiny app template (app.R)

library (shiny)

#Defines the user interface through nested R functions
ui<-fluidPage()

#Specifies how to build and rebuild #R objects in the ui
server <- function(input,output){}

#Combines ui and server into an app call with runApp()
shinyApp(ui=ui, server=server)

ui Inputs

ui Inputs

ui Inputs

ui Inputs

server Outputs

Example prototypes - shiny apps

..

runExample("02_text")
runExample("03_reactivity")
runExample("04_mpg")
runExample("05_sliders")
runExample("06_tabsets")
runExample("07_widgets")
runExample("08_html")
runExample("09_upload")
runExample("11_timer")

runApp(nyuclasses)

App walkthrough

Exercise 09

Revise app to provide a default view of most recent distributions by most recent assignment due date
Revise app to include a selector by one or more students
Revise app to include doughnut charts to show completion, late or incompleted assessments by assignment
Revise app to include a student list

Exercise 09 - Prototype specification

Review

Review session 8 from R Fundamentals (Sosulski, 2019): RShiny http://becomingvisual.com/rfundamentals/rshiny.html

Homework

Submit 2 files from today
.Rmd and app.R via NYU Classes > Assignments > In Class Worksheet.
Name your files lastname_first_inclass.Rmd & lastname_first_app.R