2020
#Foundations of Statistics Using R
A new way of thinking, not just a new skill
A new language for speaking and reading (vectors, data frames, functions, objects, etc.)
A new syntax for writing c(), print(), cat(), sort(), require(), subset()
#I. PRE MODULE REVIEW
Jeopardy!!!
#II. SETUP
##Setting up your R world
#III. BEGINNING DATA PROJECTS
For any data science project there are few simple steps to follow.
world_internet_usage.csv
RMarkdown document.RMarkdown documentinternet.Rmd##Using R markdown to write your programs
Planning your programs, presenting your code, and sharing your work.
Resource: https://bookdown.org/yihui/rmarkdown/
##R Markdown Example
---
title: "Hello R Markdown"
author: "Kristen Sosulski"
date: "2020-08-17"
output: html_document
---
This is a paragraph in an R Markdown document.
Below is a sample code chunk:
#{r chunkname, echo=TRUE, eval=TRUE}
myformula <- (2+2)
##R Markdown Components
##R Markdown YAML header
YAML Ain’t markup language
#title: "Hello R Markdown" #author: "Kristen Sosulski" #date: "2020-08-14" #output: html_document
##R Markdown YAML common output options
| Option | Creates |
|---|---|
| html_document | html |
| pdf_document | pdf (requires Tex) |
| word_document | Microsoft Word (.docx) |
| github_document | Github compatible markdown |
| ioslides_presentation | ioslides HTML slides |
##R Markdown common text options
# Header 1## Header 2### Header 3** Bold **_Italics_##R Markdown rchunks
| Option | Default |
|---|---|
| eval | TRUE |
| echo | TRUE |
| warning | TRUE |
| error | FALSE |
echo=TRUE
2+2
## [1] 4
echo=FALSE
## [1] 4
##fig.height=3, fig.width=4
plot(attitude)
##See R Markdown cheatsheet
https://github.com/rstudio/cheatsheets/raw/master/rmarkdown-2.0.pdf
We have a couple different options.
read_csv() function from the readr libraryread.csv() function from the utils library.##readr verses Base R
2002 to 2002 vs. X2002.Country Name to “Country Name” vs. Country.Name##read.csv()
internet_baser <- read.csv("world_internet_usage.csv")
head(internet_baser, 5)
## country X2000 X2001 X2002 X2003 X2004 X2005 X2006 X2007 X2008 X2009 X2010 ## 1 China 1.78 2.64 4.60 6.20 7.30 8.52 10.52 16.00 22.60 28.90 34.30 ## 2 Mexico 5.08 7.04 11.90 12.90 14.10 17.21 19.52 20.81 21.71 26.34 31.05 ## 3 Panama 6.55 7.27 8.52 9.99 11.14 11.48 17.35 22.29 33.82 39.08 40.10 ## 4 Senegal 0.40 0.98 1.01 2.10 4.39 4.79 5.61 7.70 10.60 14.50 16.00 ## 5 Singapore 36.00 41.67 47.00 53.84 62.00 61.00 59.00 69.90 69.00 69.00 71.00 ## X2011 X2012 ## 1 38.30 42.30 ## 2 34.96 38.42 ## 3 42.70 45.20 ## 4 17.50 19.20 ## 5 71.00 74.18
##read_csv()
library(readr)
internet_readr <- read_csv("world_internet_usage.csv")
head(internet_readr, 5)
## # A tibble: 5 x 14 ## country `2000` `2001` `2002` `2003` `2004` `2005` `2006` `2007` `2008` `2009` ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 China 1.78 2.64 4.6 6.2 7.3 8.52 10.5 16 22.6 28.9 ## 2 Mexico 5.08 7.04 11.9 12.9 14.1 17.2 19.5 20.8 21.7 26.3 ## 3 Panama 6.55 7.27 8.52 9.99 11.1 11.5 17.4 22.3 33.8 39.1 ## 4 Senegal 0.4 0.98 1.01 2.1 4.39 4.79 5.61 7.7 10.6 14.5 ## 5 Singap~ 36 41.7 47 53.8 62 61 59 69.9 69 69 ## # ... with 3 more variables: `2010` <dbl>, `2011` <dbl>, `2012` <dbl>
##use read_csv()
As best practice, we’re going to use read_csv() function from the readr library to import the world_internet_usage.csv data as tibble data frame.
class(internet_readr)
## [1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
Take a look at your data.
knitr::kable(head(internet_readr, 10))
| country | 2000 | 2001 | 2002 | 2003 | 2004 | 2005 | 2006 | 2007 | 2008 | 2009 | 2010 | 2011 | 2012 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| China | 1.78 | 2.64 | 4.60 | 6.20 | 7.30 | 8.52 | 10.52 | 16.00 | 22.60 | 28.90 | 34.30 | 38.30 | 42.30 |
| Mexico | 5.08 | 7.04 | 11.90 | 12.90 | 14.10 | 17.21 | 19.52 | 20.81 | 21.71 | 26.34 | 31.05 | 34.96 | 38.42 |
| Panama | 6.55 | 7.27 | 8.52 | 9.99 | 11.14 | 11.48 | 17.35 | 22.29 | 33.82 | 39.08 | 40.10 | 42.70 | 45.20 |
| Senegal | 0.40 | 0.98 | 1.01 | 2.10 | 4.39 | 4.79 | 5.61 | 7.70 | 10.60 | 14.50 | 16.00 | 17.50 | 19.20 |
| Singapore | 36.00 | 41.67 | 47.00 | 53.84 | 62.00 | 61.00 | 59.00 | 69.90 | 69.00 | 69.00 | 71.00 | 71.00 | 74.18 |
| United Arab Emirates | 23.63 | 26.27 | 28.32 | 29.48 | 30.13 | 40.00 | 52.00 | 61.00 | 63.00 | 64.00 | 68.00 | 78.00 | 85.00 |
| United States | 43.08 | 49.08 | 58.79 | 61.70 | 64.76 | 67.97 | 68.93 | 75.00 | 74.00 | 71.00 | 74.00 | 77.86 | 81.03 |
##How could you visualize this to better understand it?
##A histogram?
##Building a histogram
hist(internet_readr$`2012`)
##Adding aesthetics
hist(internet_readr$`2012`, breaks=8,
main="Internet usage for 2012",
col="magenta", xlab=" ", labels=TRUE)
##Maybe too may breaks…
hist(internet_readr$`2012`, breaks=4,
main="Internet usage for 2012",
col="magenta", xlab=" ", labels=TRUE)
##A histogram for every year using par(mfrow=c(6,3))
par(mfrow=c(6,3)) hist(internet_readr$`2000`) hist(internet_readr$`2001`) hist(internet_readr$`2002`) hist(internet_readr$`2003`) hist(internet_readr$`2004`) hist(internet_readr$`2005`) hist(internet_readr$`2006`) hist(internet_readr$`2007`) hist(internet_readr$`2008`) hist(internet_readr$`2009`) hist(internet_readr$`2010`) hist(internet_readr$`2011`) hist(internet_readr$`2012`)
##Histogram matrix
#EXERCISE 01 - COMPLETE
#Redo the histogram matrix #and add aesthetics
##Boxplots
boxplot(internet_readr$`2012`,
main="Internet usage for 2012",
col="magenta",
xlab=paste("The median is:", median(internet_readr$`2012`)),
frame.plot=FALSE, horizontal=TRUE,
border="dark blue")
##Multiple boxplots
#Build box plots for 2000 -2012 #based on the example above using par(mfrow=c(INSERT NUMBER OF ROWS, NUMBER OF COLUMNS)) #Include aesthetics.
geom_col(), geom_bar() or geom_line().ggplot(internet_readr,aes(XVALUE, YVALUE)) + geom_col()
##Let’s think about how we would use the ggplot() function.
ggplot(internet_readr,aes(X,Y)) + geom_col()
wide to long formats
##We need to reshape our data from wide to long.
## [1] "Wide format"
| country | 2000 | 2001 | 2002 | 2003 | 2004 | 2005 | 2006 | 2007 | 2008 | 2009 | 2010 | 2011 | 2012 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| China | 1.78 | 2.64 | 4.6 | 6.2 | 7.3 | 8.52 | 10.52 | 16.00 | 22.60 | 28.90 | 34.30 | 38.30 | 42.30 |
| Mexico | 5.08 | 7.04 | 11.9 | 12.9 | 14.1 | 17.21 | 19.52 | 20.81 | 21.71 | 26.34 | 31.05 | 34.96 | 38.42 |
## [1] "Long format"
| country | year | usage |
|---|---|---|
| China | 2000 | 1.78 |
| Mexico | 2000 | 5.08 |
gather() function to reshape.gather() function from the tidyr package to reshape a tibble from wide to long form.%>% that passes the left hand side of the operator to the first argument of the right hand side of the operator.tidy_internet_readr <-
internet_readr %>%
gather(`2000`:`2012`, key="year",
value="usage")
| country | year | usage |
|---|---|---|
| China | 2000 | 1.78 |
| Mexico | 2000 | 5.08 |
| Panama | 2000 | 6.55 |
| Senegal | 2000 | 0.40 |
| Singapore | 2000 | 36.00 |
write_csv(file, path)library(readr)
internet_readr <- read_csv("worldtidy.csv")
head(internet_readr, 5)
## # A tibble: 5 x 3 ## country year usage ## <chr> <dbl> <dbl> ## 1 China 2000 1.78 ## 2 Mexico 2000 5.08 ## 3 Panama 2000 6.55 ## 4 Senegal 2000 0.4 ## 5 Singapore 2000 36
##4) Understand - Visualize * Let’s create a time series line graph. * Use the ggplot2() package
ggplot(data, aes(x,y,color, group)) + geom_line() + ... + ...
library(ggplot2)
#assignment ggplot call to a variable
line01 <- ggplot(tidy_internet_readr,
aes(x=year,y=usage,color=country,
group=country)) + geom_line()
line01
##Adding asethetics: Labels
+labs(title="", subtitle="",x="", y="", caption="")library(ggthemes)
library(ggplot2)
line02<-ggplot(tidy_internet_readr,
aes(x=year,y=usage,color=country,
group=country)) + geom_line() +
labs(title = "Internet Usage per 100 people",
subtitle = "Since 2011,
the UAE has surpassed Singapore and the US in internet users",
caption = "Source: World Bank (2013)",
x = " ",y ="Usage")
##Let’s create a bar chart with the same data
library(ggplot2)
bar01 <- ggplot(tidy_internet_readr,
aes(tidy_internet_readr$year, tidy_internet_readr$usage))
bar01 <- bar01 + geom_col() + theme_few() +
labs(title = "Internet Usage per 100 people",
x = "Year",y ="Usage",
caption="World Bank (2013)")
bar01
## Warning: Use of `tidy_internet_readr$year` is discouraged. Use `year` instead.
## Warning: Use of `tidy_internet_readr$usage` is discouraged. Use `usage` instead.
#Refining asethetics: Specifying parameters
##Adding a fill / Color
fill="#4cbea3", color="#4cbea3"bar01a <- bar01 + geom_col(fill="#4cbea3", color="#4cbea3") + theme_few() +
labs(title = "Internet Usage per 100 people",
x = " ",y ="Usage", caption="World Bank (2013)")
##Add a fill
## Warning: Use of `tidy_internet_readr$year` is discouraged. Use `year` instead.
## Warning: Use of `tidy_internet_readr$usage` is discouraged. Use `usage` instead.
## Warning: Use of `tidy_internet_readr$year` is discouraged. Use `year` instead.
## Warning: Use of `tidy_internet_readr$usage` is discouraged. Use `usage` instead.
+ theme(text=element_text(family="Avenir"))
bar01b <- bar01 + geom_col(fill="#4cbea3", color="#4cbea3") + theme_few() +
labs(title = "Internet Usage per 100 people",
x = " ",y ="Usage", caption="World Bank (2013)") + theme(text=element_text(family="Avenir"))
## Warning: Use of `tidy_internet_readr$year` is discouraged. Use `year` instead.
## Warning: Use of `tidy_internet_readr$usage` is discouraged. Use `usage` instead.
## Warning: Use of `tidy_internet_readr$year` is discouraged. Use `year` instead.
## Warning: Use of `tidy_internet_readr$usage` is discouraged. Use `usage` instead.
## Warning in grid.Call(C_stringMetric, as.graphicsAnnot(x$label)): font family not ## found in Windows font database ## Warning in grid.Call(C_stringMetric, as.graphicsAnnot(x$label)): font family not ## found in Windows font database ## Warning in grid.Call(C_stringMetric, as.graphicsAnnot(x$label)): font family not ## found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font ## family not found in Windows font database ## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font ## family not found in Windows font database ## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font ## family not found in Windows font database ## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font ## family not found in Windows font database ## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font ## family not found in Windows font database ## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font ## family not found in Windows font database ## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font ## family not found in Windows font database ## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font ## family not found in Windows font database ## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font ## family not found in Windows font database ## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font ## family not found in Windows font database ## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font ## family not found in Windows font database ## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font ## family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, : ## font family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font ## family not found in Windows font database ## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font ## family not found in Windows font database
_theme(panel.border = element_blank(), panel.grid.major = element_blank(),panel.grid.minor = element_blank(), axis.line = element_line(color = "gray"), axis.ticks.x=element_blank(), axis.ticks.y=element_blank())_
bar01c <- bar01 + geom_col(fill="#4cbea3", color="#4cbea3") + theme_few() +
labs(title = "Internet Usage per 100 people",
x = " ",y ="Usage",
caption="World Bank (2013)") +
theme(text=element_text(family="Avenir"),
panel.border = element_blank(), panel.grid.major =
element_blank(),panel.grid.minor =
element_blank(),
axis.line = element_line(color= "gray"),
axis.ticks.x=element_blank(),
axis.ticks.y=element_blank())
## Warning: Use of `tidy_internet_readr$year` is discouraged. Use `year` instead.
## Warning: Use of `tidy_internet_readr$usage` is discouraged. Use `usage` instead.
## Warning: Use of `tidy_internet_readr$year` is discouraged. Use `year` instead.
## Warning: Use of `tidy_internet_readr$usage` is discouraged. Use `usage` instead.
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font ## family not found in Windows font database ## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font ## family not found in Windows font database ## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font ## family not found in Windows font database ## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font ## family not found in Windows font database ## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font ## family not found in Windows font database ## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font ## family not found in Windows font database ## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font ## family not found in Windows font database ## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font ## family not found in Windows font database ## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font ## family not found in Windows font database ## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font ## family not found in Windows font database ## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font ## family not found in Windows font database ## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font ## family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, : ## font family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font ## family not found in Windows font database ## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font ## family not found in Windows font database
##Adding asethetics: Themes
ggthemes() package with additional themestheme_classic(). White background no grid linest1 <- bar01 + theme_classic() t2 <- bar01 + theme_bw() t3 <- bar01 + theme_minimal() t4 <- bar01 + theme_economist() t5 <- bar01 + theme_fivethirtyeight() t6 <- bar01 + theme_hc()
theme_classic()## Warning: Use of `tidy_internet_readr$year` is discouraged. Use `year` instead.
## Warning: Use of `tidy_internet_readr$usage` is discouraged. Use `usage` instead.
theme_bw()## Warning: Use of `tidy_internet_readr$year` is discouraged. Use `year` instead.
## Warning: Use of `tidy_internet_readr$usage` is discouraged. Use `usage` instead.
theme_minimal()## Warning: Use of `tidy_internet_readr$year` is discouraged. Use `year` instead.
## Warning: Use of `tidy_internet_readr$usage` is discouraged. Use `usage` instead.
theme_economist()## Warning: Use of `tidy_internet_readr$year` is discouraged. Use `year` instead.
## Warning: Use of `tidy_internet_readr$usage` is discouraged. Use `usage` instead.
theme_fivethirtyeight()## Warning: Use of `tidy_internet_readr$year` is discouraged. Use `year` instead.
## Warning: Use of `tidy_internet_readr$usage` is discouraged. Use `usage` instead.
##Themes: theme_theme_hc()
## Warning: Use of `tidy_internet_readr$year` is discouraged. Use `year` instead.
## Warning: Use of `tidy_internet_readr$usage` is discouraged. Use `usage` instead.
cowplot::plotgrid() for layoutReference: https://wilkelab.org/cowplot/articles/plot_grid.html
cowplot::plot_grid(t1, t2, t3,t4,t5,t6,labels=c("Classic","Black & White","Minimal","Economist","538", "High Charts"), label_size = 12, label_x = 0, label_y = 0, hjust = -0.5, vjust = -0.5, label_fontfamily="serif",
label_fontface = "bold",
label_colour = "#4cbea3")
cowplot::plotgrid() for layout## Warning: Use of `tidy_internet_readr$year` is discouraged. Use `year` instead.
## Warning: Use of `tidy_internet_readr$usage` is discouraged. Use `usage` instead.
## Warning: Use of `tidy_internet_readr$year` is discouraged. Use `year` instead.
## Warning: Use of `tidy_internet_readr$usage` is discouraged. Use `usage` instead.
## Warning: Use of `tidy_internet_readr$year` is discouraged. Use `year` instead.
## Warning: Use of `tidy_internet_readr$usage` is discouraged. Use `usage` instead.
## Warning: Use of `tidy_internet_readr$year` is discouraged. Use `year` instead.
## Warning: Use of `tidy_internet_readr$usage` is discouraged. Use `usage` instead.
## Warning: Use of `tidy_internet_readr$year` is discouraged. Use `year` instead.
## Warning: Use of `tidy_internet_readr$usage` is discouraged. Use `usage` instead.
## Warning: Use of `tidy_internet_readr$year` is discouraged. Use `year` instead.
## Warning: Use of `tidy_internet_readr$usage` is discouraged. Use `usage` instead.
##Saving your charts
ggsave("plot.png",width=5, height=5, units="in"")## Warning: Use of `tidy_internet_readr$year` is discouraged. Use `year` instead.
## Warning: Use of `tidy_internet_readr$usage` is discouraged. Use `usage` instead.
#VI. WRITING CONDITIONAL STATEMENTS
Early on, we learned about different types of comparison and logical operators. For example, comparison operators include <, <=,== and so on. These operators return a logical value, TRUE or FALSE.
3 > 4 c(1, 2, 3, 4, 5) > 4 c(1, 2, 3, 4, 6) == 3
3 > 4
## [1] FALSE
c(1, 2, 3, 4, 5) > 4
## [1] FALSE FALSE FALSE FALSE TRUE
c(1, 2, 3, 4, 6) == 3
## [1] FALSE FALSE TRUE FALSE FALSE
These include:
!&|Example 1
We assign legal to the Boolean value TRUE and then use the logical not (!) operator to negate the variable legal which returns the value of FALSE.
legal<-TRUE !legal
## [1] FALSE
We assign customer to the Boolean value TRUE and over50 to the value FALSE. Then we use the logical and (&) operator to evaluate both variables. If both are TRUE, then result returned is TRUE. If one or none of the values are TRUE, then the result is FALSE.
customer <-TRUE over50 <-FALSE customer & over50
## [1] FALSE
##Example 3
Just in the previous example, we assign customer to the Boolean value TRUE and over50 to the value FALSE. Then we use the logical or (|) operator to evaluate both variables. If both are TRUE, then result returned is TRUE. If one is TRUE, then result returned is TRUE. If neither of the values are TRUE, then the result is FALSE.
customer <-TRUE over50 <-FALSE customer | over50
## [1] TRUE
##Conditional statements
A conditional statement evaluates a condition to see if it is TRUE. Conditional statements may make use of comparative and logical operators.
A type of conditional statement used to evaluate if a condition is TRUE is called an if statement.
Let’s look at an example of the structure of an if statement:
The if is followed by a set of parentheses and inside is the condition being evaluated for truth. If it is true, then we may specify an action such as update a variable. If the condition is FALSE no action occurs.
Let’s construct a real example that evaluates the value assigned to a variable price. The condition is price < 10. This condition will evaluate to FALSE. Therefore nothing is printed.
price <- 15.99
if (price < 10) {
print("This is excellent deal!")
}
##Example 2.
Let’s change Example 1. We’ll change the value of price to 9.99. In this case, This is an excellent deal will be printed to the screen. In other words, the code between the opening and closing curly brackets will run.
price <- 9.99
if (price < 10) {
print("This is excellent deal!")
}
## [1] "This is excellent deal!"
##Conditional statements using if/else logic
You can add on an else onto an if statement. If the test condition is not met, that is, it evaluates to FALSE, the else code will run.
The basic structure is:
if (test_expression) {
statement1
} else {
statement2
}
##Example 3
Let’s build on Ex. 1 where price is set to 15.99 and add an else statement. You’ll notice that the print statement following the if is ignored and the print statement within the else clause has run.
price <- 15.99
if (price < 10) {
print("This is excellent deal!")
} else {
print("This product is too expensive")
}
## [1] "This product is too expensive"
Take note of the curly brackets in this example. The placement is important. The if and else both have an opening and closing brace. The else statement must appear on the same line as the closing if curly bracket.
##Nested if...else if statements
The nested if...else if statement allows you execute a block of code among more than 2 alternatives.
The syntax of if...else if statement is:
Let’s modify Example 3
##Try it.
Test your program out by changing the value of price to ensure it evaluates the condition as you have planned.
The ifelse() functions a shorthand function to the traditional if…else statement.
Syntax of ifelse() function
Here, test_expression must be a logical vector (or an object that can be coerced to logical). The return value is a vector with the same length as test_expression.
This returned vector has element from x if the corresponding value of test_expression is TRUE or from y if the corresponding value of test_expression is FALSE.
In the example below, the ifelse() function evaluates the vector FALSE FALSE TRUE FALSE that resulted from the expression:a %% 2 == 0
a <- c(5,7,2,9) ifelse(a %% 2 == 0,"even","odd")
## [1] "odd" "odd" "even" "odd"
Learn more at: https://www.datamentor.io/r-programming/ifelse-function/
Let’s say you want to evaluate a vector for NA values and print output to the screen based on whether there is an NA value or not. We can use the ifelse() function along with the is.na() function.
a <- c(NA,7,2,9) ifelse(is.na(a),"NA","Not NA")
## [1] "NA" "Not NA" "Not NA" "Not NA"
…
while and for.while loop is used when you want to execute some code some number (possibly an unknown number) of times.while (test_expression)
{
statement
}
Here, test_expression is evaluated and the body of the loop is entered if the result is TRUE.
while loopThe statements inside the loop are executed and the flow returns to evaluate the test_expression again.
This is repeated each time until test_expression evaluates to FALSE, in which case, the loop exits.
In the example below, how many times with the loop iterate?
x <- 10
while (x > 0) {
print(x)
x <- x - 1
}
You’ll note that the loop will iterate ten times. First, when x is initialized to 10 and thereafter we decrease x by 1, until x = 0 where x > 0 finally evaluates to FALSE.
x <- 10
while (x > 0) {
print(x)
x <- x - 1
}
## [1] 10 ## [1] 9 ## [1] 8 ## [1] 7 ## [1] 6 ## [1] 5 ## [1] 4 ## [1] 3 ## [1] 2 ## [1] 1
In this example, counter is initialized to zero. The test expression in the while statement evaluates to see counter is less than 9, prints the value of counter and then increments it by 1.
How many times will this loop iterate?
counter <- 0
while (counter < 9) {
print(counter)
counter = counter + 1
}
while(counter > 9)? How many times would the loop iterate?counter > 9 will always evaluate to TRUE since 10 is greater than 9. The counter = counter + 1, increments the control variable, rather then decrease it. counter = counter -1 would create a state where the test condition would evaluate to FALSE after a single iteration.##Try it - Use the attitude data set
#pseudo code
while (there are still variables to evaluate)
{
#take the mean
#print the mean
#move to the next ncol
}
column <- ncol(attitude)
while (column > 0){
print(paste(names(attitude[column]),
":",mean(attitude[,column])))
column <- column - 1
}
## [1] "advance : 42.9333333333333" ## [1] "critical : 74.7666666666667" ## [1] "raises : 64.6333333333333" ## [1] "learning : 56.3666666666667" ## [1] "privileges : 53.1333333333333" ## [1] "complaints : 66.6" ## [1] "rating : 64.6333333333333"
for loopAnother type of loop used in R is called a for loop. A for loop is used to iterate over a vector, such as a column in a DataFrame.
The syntax is as follows:
for (value in sequence)
{
statement
}
Here, sequence is a vector and value takes on each of one of the values in the sequence. During each iteration, statement is evaluated.
for loop: Specifies the of iterations.In this code , the iterator (in this case, i) takes on the values in the vector c(1,2,3,4) sequentially through each “loop” of the code that is between the brackets—in this case, print(i).
for (i in c(1,1,3,4)){
print(i)
}
## [1] 1 ## [1] 1 ## [1] 3 ## [1] 4
Let’s build a program that checks to see which prices are considered “cheap” (less then $10) based on the following vector.
prices <- c(12.43, 9.99, 18.22, 7.25, 0.50)
##You can approach it the following way:
First, create the price vector. Next, initialize a variable called numCheap. Third, create the for loop to traverse through the prices vector. Fourth, evaluate to see if p is less than 10 and if it is, then add 1 to numCheap. Five, when the loop exists, print out the number of inexpensive items.
prices <- c(12.43, 9.99, 18.22, 7.25, 0.50)
numCheap <- 0
for (p in prices){
if (p < 10){
numCheap <- numCheap + 1
}
}
print(numCheap)
## [1] 3
#VIII. APPLICATION CAPITAL BIKESHARE bikesharedailydata.csv
library(readr)
bikeshare <- read_csv("bikesharedailydata.csv")
## # A tibble: 6 x 16 ## instant dteday season yr mnth holiday weekday workingday weathersit temp ## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 1 1/1/11 1 0 1 0 6 0 2 0.344 ## 2 2 1/2/11 1 0 1 0 0 0 2 0.363 ## 3 3 1/3/11 1 0 1 0 1 1 1 0.196 ## 4 4 1/4/11 1 0 1 0 2 1 1 0.2 ## 5 5 1/5/11 1 0 1 0 3 1 1 0.227 ## 6 6 1/6/11 1 0 1 0 4 1 1 0.204 ## # ... with 6 more variables: atemp <dbl>, hum <dbl>, windspeed <dbl>, ## # casual <dbl>, registered <dbl>, cnt <dbl>
731 rows (observations) and 16 columns (variables or attributes).bikeshare$season
## [1] 1 1 1 1 1 1 NA 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ## [26] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ## [51] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ## [76] 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 ## [101] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 ## [126] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 ## [151] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 ## [176] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 ## [201] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 ## [226] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 ## [251] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 ## [276] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 ## [301] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 ## [326] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 ## [351] 4 4 4 4 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ## [376] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ## [401] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ## [426] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 ## [451] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 ## [476] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 ## [501] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 ## [526] 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 ## [551] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 ## [576] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 ## [601] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 ## [626] 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 ## [651] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 ## [676] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 ## [701] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 1 1 1 1 1 ## [726] 1 1 1 1 1 1
##What type of variable is it?
1 and 4.season.This leads us to the next step, reviewing the data dictionary along with the data set to better understand the meaning behind the values.
season is represented by each number in the variable season we can review the data dictionary.| Field | Definition |
|---|---|
| instant | record index |
| dteday | date |
| season | season (1:winter, 2:spring, 3:summer, 4:fall) |
| yr | year (0: 2011, 1:2012) |
| mnth | month ( 1 to 12) |
| hr | hour (0 to 23) |
| holiday | weather day is holiday or not |
| weekday | day of the week |
| workingday | if day is neither weekend nor holiday is 1, otherwise is 0. |
| weathersit | 1, 2, 3, 4 |
| – 1 | Clear, Few clouds, Partly cloudy, Partly cloudy |
| – 2 | Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist |
| – 3 | Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds |
| – 4 | Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog |
| temp | Normalized temperature in degrees F |
| atemp | Normalized feeling temperature in degrees F |
| hum | Normalized humidity. |
| windspeed | Normalized wind speed |
| casual | count of casual users |
| registered | count of registered users |
| cnt | count of total rental bikes including both casual and registered |
season is a categorical variable defined by one of four values, each representing a season (1: winter, 2: spring, 3: summer, 4: fall).year is coded with the value of 0 for 2011 and 1 for 2012, rather than actual year value of 2011 or 2012.##3. Prepare or tidy your data
There two key ways to rename columns.
rename() from dplyrnames() from baserename() from the dplyr library.library(dplyr) bikeshare <- rename(bikeshare, humidity = hum, month=mnth) names(bikeshare)
## [1] "instant" "dteday" "season" "yr" "month" ## [6] "holiday" "weekday" "workingday" "weathersit" "temp" ## [11] "atemp" "humidity" "windspeed" "casual" "registered" ## [16] "cnt"
# Rename column where names is equal to "yr" names(bikeshare)[names(bikeshare) == "yr"] <- "year" names(bikeshare)
## [1] "instant" "dteday" "season" "year" "month" ## [6] "holiday" "weekday" "workingday" "weathersit" "temp" ## [11] "atemp" "humidity" "windspeed" "casual" "registered" ## [16] "cnt"
is.na()counting the number NA values.seasons column.sum(is.na(bikeshare$season)==TRUE)
## [1] 1
##Using iteration to identify missing values
for loop to do this.counter <-0
for (i in bikeshare$season){
counter <- counter +1
if(is.na(i)==TRUE){
print(paste("It's true. There's an NA value on row",counter))
print(bikeshare[counter,])
}
}
## [1] "It's true. There's an NA value on row 7" ## # A tibble: 1 x 16 ## instant dteday season year month holiday weekday workingday weathersit temp ## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 7 1/7/11 NA 0 1 0 5 1 2 0.197 ## # ... with 6 more variables: atemp <dbl>, humidity <dbl>, windspeed <dbl>, ## # casual <dbl>, registered <dbl>, cnt <dbl>
There are several ways you tackle working with data that are incomplete. Each has its pros and cons.
In this case it’s easy to replace the value with a pre-defined value.
We wouldn’t want to ignore the record because the values can be easily determined.
##Update the values
bikeshare$season[7]
## [1] NA
1->bikeshare$season[7] bikeshare$season[7]
## [1] 1
##4. Understand and visualize
bikesall <- ggplot(bikeshare, aes(atemp,cnt)) + geom_point(color="#4cbea3")
##Let’s see it!
bikesall
##Refine it
bikes2011 <- ggplot(bikeshare[bikeshare$year < 1,],
aes(atemp,cnt)) +
geom_point(color="#4cbea3") +
theme_few() + labs(title = "Rentals in 2011",
x = "Average temp",y=" ")
##2011
bikes2011
##Plot 2012
bikes2012 <- ggplot(bikeshare[bikeshare$year > 0,], aes(atemp,cnt)) + geom_point(color="#4cbea3") + theme_few() + labs(title = "Rentals in 2012", x = "Average temp",y=" ")
##2012
bikes2012
We can do this by using the plot_grid function from the cowplot package
cowplot::plot_grid
-Pass in the two variables hist_age and hist_salary into the plot_grid function to see the graphs plotted side by side.
##plot_grid
cowplot::plot_grid(bikes2011, bikes2012, labels =c(" ", " "))
#IX. User Defined Functions
##User Defined Functions
Functions help us achieve our programmatic goals. They are the set of instructions that we can repeatedly call (or use) to manipulate data, objects, and states.
Some functions are built in such as:
##toupper
toupper("hello world")
## [1] "HELLO WORLD"
##mean
mean(c(1,2,3,4,5))
## [1] 3
##is.numeric
is.numeric(4)
## [1] TRUE
##is.na
is.na(NA)
## [1] TRUE
##sqrt
sqrt(25)
## [1] 5
Functions operate on some specified arguments. They generally return some value (a number, a string etc.). Such as passing the value of 25 to the square root function sqrt() and the return value is 5.
This code shows the simple construction of a user designed function. Each function has two parts: the function definition and the function call.
functionname <- function(x){
return(print(paste("The value", x, "is returned")))
}
functionname(34)
## [1] "The value 34 is returned"
Suppose we want to have a function to act on our data that adds 2 to every value. How would we design this function?
We begin by drafting out the main components of a function.
First, let’s give our function a name. We’ll call it myfunction.
Next, we have to assign it to the a function declaration, namely, function()with an opening and closing curly bracket following the declaration. Then we add a variable as a parameter to a function declaration. In this case, we’ll call it myparameter.
myfunction <- function(myparameter){
}
myfunction(22)
In this case, we are just adding 2 to my parameter.
myfunction <- function(myparameter) {
myparameter + 2
}
Now, when we call myfunction and pass in a number, we see a value is returned which is myparameter + 2.
myfunction(22)
## [1] 24
myfunction("hello")
The error is: Error in myparameter + 2 : non-numeric argument to binary operator
Can you see why this error was given? You’ll notice that the string hello caused this error.
is.numeric(myparameter)==TRUE.myparameter.myfunction <- function(myparameter) {
if (is.numeric(myparameter)==TRUE)
{
myparameter + 2
} else{
print("Sorry. This function needs requires a value of type numeric.")
}
}
Let’s see this in action by calling myfunction and passing in a string. We can see the friendly message as output to the user.
myfunction("hello")
## [1] "Sorry. This function needs requires a value of type numeric."
myfunction again with a numeric value.You can see the function works as it should when a numeric value is provided.
myfunction(3)
## [1] 5
Let’s create a function that takes more than one argument.
We’ll call this function addTogether and include 2 parameters or arguments in the function declaration.
addTogether <- function(x, y) {
if (is.numeric(x) & is.numeric(y)==TRUE)
{
x + y
} else {
print("Sorry, please enter two numbers")
}
}
addTogether.addTogether(5, 15)
## [1] 20
##Let’s call it, yet again, but this time passing in a number and as string.
addTogether(5, "d")
## [1] "Sorry, please enter two numbers"
addTogether(x = 5, y = 10)
## [1] 15
##CODE HERE
avg <- function(x,y){
if (is.numeric(x) & is.numeric(y)==TRUE){
(x + y)/2
} else {
print("Sorry, please enter two numbers")
}
}
avg(1,"2")
## [1] "Sorry, please enter two numbers"
Error in avg(1) : argument "y" is missing, with no default
This error produced is less than interpret-able to a user. A big part of programming requires planning what to do when the user uses your program, function, application, etc. and enters input that causes your program to break.
By anticipating that the user may not enter in the correct number of parameters for your function can make your functions more usable.
addTogether <- function(x, y) {
if ( (hasArg(x) == FALSE) &(hasArg(y)==FALSE))
{
print("You didn't enter any values. Please enter two numbers.")
}
else if ( (hasArg(x) == FALSE) |(hasArg(y)==FALSE))
{
print("You only entered one value. Please enter two values.")
}
else if (is.numeric(x) & is.numeric(y)==TRUE)
{
x + y
} else {
print("Sorry, please enter two numbers.")
}
}
#Call the function
addTogether("3",3)
## [1] "Sorry, please enter two numbers."
What if we could apply a function to all the elements of the input? We can do this with a function called sapply().
sapply() comes from the apply family of functions.
The syntax is as follows: The function call is sapply with the arguments X, and FUN.
sapply(X, FUN)
Arguments:
The output from sapply is a vector or matrix.
Example - sapply()
df1 is a data frame that we created. To find the max value, we can use `sapply() and pass in our data frame and then the max function.
df1 <-as.data.frame(c(1,2,3,4,5,6,7)) sapply(df1, max)
## c(1, 2, 3, 4, 5, 6, 7) ## 7
takemean <-function(x){
mean(x, na.rm=TRUE)
}
takemean() function over a vector using sapply().## rating complaints privileges learning raises critical advance ## 64.63333 66.60000 53.13333 56.36667 64.63333 74.76667 42.93333
## rating complaints privileges learning raises ## Min. :40.00 Min. :37.0 Min. :30.00 Min. :34.00 Min. :43.00 ## 1st Qu.:58.75 1st Qu.:58.5 1st Qu.:45.00 1st Qu.:47.00 1st Qu.:58.25 ## Median :65.50 Median :65.0 Median :51.50 Median :56.50 Median :63.50 ## Mean :64.63 Mean :66.6 Mean :53.13 Mean :56.37 Mean :64.63 ## 3rd Qu.:71.75 3rd Qu.:77.0 3rd Qu.:62.50 3rd Qu.:66.75 3rd Qu.:71.00 ## Max. :85.00 Max. :90.0 Max. :83.00 Max. :75.00 Max. :88.00 ## critical advance ## Min. :49.00 Min. :25.00 ## 1st Qu.:69.25 1st Qu.:35.00 ## Median :77.50 Median :41.00 ## Mean :74.77 Mean :42.93 ## 3rd Qu.:80.00 3rd Qu.:47.75 ## Max. :92.00 Max. :72.00
#EXERCISE 03 - COMPLETE
##CODE HERE
Keys, endpoints, and methods (and in R, some extra packages)
get_current_weather(api_key, cityID = NA, city = "", country = "", coordinates = NA, zip_code = NA)library (ROpenWeatherMap)library (ROpenWeatherMap) library(tidyverse) api_key <- "ffb7b9808e07c9135bdcc7d1e867253d" #Kristen's key please get your own. =) #API Call that reads in as a list newyork=get_current_weather(api_key,city="New York") class(newyork)
## [1] "list"
##.. Cast to data frame or tibble
newyork <-data.frame(newyork) newyork$city <- "New York" class(newyork)
## [1] "data.frame"
newyork[c(length(newyork),1,2,3,4,5)]
## city coord.lon coord.lat weather.id weather.main weather.description ## 1 New York -74.01 40.71 800 Clear clear sky
#Reorder again and save it back to newyork newyork <- newyork[c(length(newyork),1,2,3,4,5)]
ct=get_current_weather(api_key,city="Cooperstown") ct <- data.frame(ct) ct$city <- "Cooperstown" ct[c(length(ct),1,2,3,4,5)] #reorder
## city coord.lon coord.lat weather.id weather.main weather.description ## 1 Cooperstown -74.92 42.7 800 Clear clear sky
ct <- ct[c(length(ct),1,2,3,4,5)]
cityweather <- rbind(newyork[,1:6], ct[,1:6]) #good resources on joins: https://stackoverflow.com/questions/1299871/how-to-join-merge-data-frames-inner-outer-left-right
Let’s plot our cities on a map and label it with the weather.
library(maps)
map(database="county", region=c("New York"), col="#cccccc")
symbols(cityweather$coord.lon, cityweather$coord.lat, bg="#e2373f", fg="#ffffff", lwd=0.5, circles = rep(1,length(cityweather$coord.lon)), inches=0.1, add=TRUE)
library(leaflet) m <- leaflet() m <- addTiles(m) m <- addMarkers(m, lng=cityweather$coord.lon, lat=cityweather$coord.lat, popup=paste(cityweather$city, cityweather$weather.main)) m
#XI. RSHINY
install.packages("shiny")
##What is shiny?
##Video - Movie Explorer
Here’s a simple example of a shiny app. It’s a movie explorer app that allows you to select which variables you want to plot on the x and y axes. This allows you to interact with the data and explore it through a visualization interface.
##Video - Movie Explorer
https://shiny.rstudio.com/gallery/movie-explorer.html
##Video - Marathon Training
##Video - Intelligencia
There are many user interface features you can plug and play into your R code and turn it into an App.
These interactive features are called UI inputs. For example there are:
Sliders, slider range, text input,
Numeric input, radio buttons, select boxes,
Date input, date range, file input,
The user interface for shiny app be one of many layout options. These include panels that include UI inputs such as
wellPanel(dateInput("a", ""), submitButton()
Which renders a single element called a panel:
The type of output a shiny app can display is to render a plot, render a Table, or show text to the screen via renderPrint.
In the case of the movie explorer, the output of the app was a plot.
To start building a shiny app, the first thing you need to do is install the Shiny package installed
install.packages("shiny")
##And then enable it by calling library(shiny)
library(shiny)
You’ll notice this shiny app has one of the UI elements we discussed, the slider.
This allows the user to specify the number bins to show in the histogram.
Here the server output is calls the RenderPlot function to show the histogram.
In this example, the code is also displayed.
Let’s look a the building blocks of a shiny app.
library (shiny)
#Defines the user interface through nested R functions
ui<-fluidPage()
#Specifies how to build and rebuild #R objects in the ui
server <- function(input,output){}
#Combines ui and server into an app call with runApp()
shinyApp(ui=ui, server=server)
Section 1 is the ui - nested R functions that assemble an HTML user interface for your app
fluidPage() contains the elements in the app.
The fluidPage contains both the input and output functions, for example, the titlePanel, sliderInput, and plotOutput.
You can change the title, what type of input you want (numericInput( ), selectInput( ), and dateInput( ) are popular as well), and the elements within the input (like the minimum, maximum and preset values).
function(input, output) {} actually builds the output.Now that you are familiar with what a shiny apps and the shiny app components try building your own.
app.R will be the file you modify. It is saved in a new directoryrunApp(path you to your directory)#EXERCISE 04 - COMPLETE
Try your best to achieve the following:
Revise app to provide a default view of most recent distributions by most recent assignment due date
Revise app to include a selector by one or more students
Revise app to include doughnut charts to show completion, late or uncompleted assessments by assignment
Revise app to include a student list
##Prototype specification
Review lesson 8 from R Fundamentals (Sosulski, 2020): RShiny http://becomingvisual.com/rfundamentals/rshiny.html
Must complete both of the following. No partial credit.
Complete Assignment 7 http://becomingvisual.com/rfundamentals/conditionals-controls-functions.html#assignment-7
Complete Assignment 8 http://becomingvisual.com/rfundamentals/interactive-applications-using-rshiny.html#assignment-8
Submit both to NYU Classes > R Extra Credit