Intro to R & Simple Analytics

The topics are to be covered:

The R language, RStudio, Math Operations, Probability, Linear Algebra (lines/spaces)
Manipulating Data Structures, Building Functions, and Using Probability
Summarizing Data and Building Tables
Building Graphs, Titanic Task
Project Brainstorming

Introduction to Data Science, the R language, RStudio, and Basic Math Operations

Data Science is a huge and lucritave field, and so there are many ways to learn it and many levels of complexity you can understand. Given that the world is very data dominated, understanding how to read, manipulate, and gain insight from data can help you both in your career as well as your everyday life. Furthermore, given the amount of access we can get to data, drawing insight about a topic, field, or situation is much easier than before. The R language is a tool that allows you to manipulate and understand data intuitively and efficiently, and also explain many new mathematical concepts that you can visualize. RStudio is an IDE (integrated development environment) that allows you to organize and immediately analyze the impact of your code and model building.

We generally work on 3 problems in data science:

Regression: We get a set of inputs (X) and we get a set of values (Y) and our job is to use the given information to model a general relationship between X and Y so that for future X’s, we can predict the associated Y.
Classification: This is the same as regression except that the values(Y) are replaced with a category. In other words given inputs X we are trying to build a function that generally maps X to the correct category.
Clustering: Here we are given inputs X with no labels, and we want to discover a structure or pattern for these inputs.

R

R is a statistical language that is excellent for experimenting and testing different ideas. It’s advantages include that it is a functional programming language (therefore it automatically handles memory and controlling data types), it is built to efficiently handle vectors (arrays), lists, and matrices, and can also flexibly build models. It is a functional programming language, which means all objects (models, numbers, vectors, and so on) are defined the same way and used the same way. R Studio is an IDE (integrated development environment) for R, and R Markdown allows for an easy way to create reports which integrate code and the results of code to display work. Furthermore, we can also run normal (non-data related) scripts here too so it can be used for normal functions as well. One of the challenges in starting out with R is that it is very particular about syntax and punctuation - and this can be frustrating for some people early on. If you get more repetition it is a very quick learning curve for using the language.

This section gives a light introduction to R, and in particular will cover some basic usage of RStudio, how an R Markdown file works, and overview the R programming language. In the end, there are 5 relatively simple questions that allow for some use of all the features covered so far.

You must download R from here (https://cran.r-project.org/mirrors.html) and Rstudio from here (https://www.rstudio.com/products/rstudio/download/).

RStudio

RStudio is an IDE for R, and has many features which allow it to be effective for experimenting. The four major components are the 4 sections of the IDE, and will be referred to here as Help/Plots, Environment, Scripts, and Console.

Help/Plots

The help/plots section appears in the lower right corner of the RStudio window, and is used to display images/plots, browse files, and provide a help page for functions/packages. Whenever any image (plot/graph) is generated in an R script, the plot will be shows in this help/plots tab. Similarly, whenever the “help()” command is called on a function (i.e help(plot)) the output will be shown in this tab as well. Run the below for an example.

help(plot)

## starting httpd help server ... done

Environment

The environment tab (top right) keeps track of all variables which are currently stored in R memory in the given working instance. This includes all data structures and variables. It appears in the top right of the RStudio window. History shows a history of commands as well. To clear the space in environment or reset the working environment, use the command as follows.

rm(list=ls())

Now, to see a variable “a” with value 3 be added to the environment, run the following and see the update in the environment section:

a <- 3

Scripts Section

The scripts section (top left) is used to write R scripts, build R markdown files, and even build web applications using “Shiny”. There are many other types of R based files that can be created too. R scripts are scripts that run and return R code. This section is also used to build R Markdown documents, and details about R Markdown Documents will be discussed in the RMarkdown section. The scripts section is located in the top left part of RStudio.

Console

The console (bottom left) is an interactive place to run R commands. When an R Script is run, it is typically passed through the console. In the console, you can even enter math commands such as “2+3” and get an immediate result. The console is located in the bottom left part of the screen.

The console also typically has a terminal window, which can take linux commands to navigate through the directory of your computer. This will not be used extensively in these notebooks, but is still useful to know about.

RMarkdown

R Markdown is a file type that allows a user to quickly build a static webpage, and integrate analysis along with actual code and output. All of the notebooks in this set are written with R Markdown. In R Markdown it is easy to both display the code used as well as the results, and also add in text to analyze to results on the same page.

Code Blocks

To create a code block, use the symbols " ``` " followed by an opening curly parenthesis “{”, the language being used (in this case “R”), and then the name of the code block. After the code segment is finished, the block is closed with another " ``` " symbol set. For instance, to build code written in the R language with the name “firstCodeBlock” in an R Markdown document, the sequence of commands is:

```{r firstCodeBlock}

print(“my first code block”)

```

This will look as follows in the actual html/pdf file:

print("my first code block")

## [1] "my first code block"

To run a codeblock from inside the RMarkdown file, click the small “play” icon in the top right corner.

Similarly,to include a plot,the following code is used in an R code block:

plot(pressure)

Out of the many additional arguments that can be added to a code block, one of them is the “echo” flag. If “echo” is set to true, then the code block will appear in the RMarkdown Document, otherwise if it is false the block will not appear. The default value is true, so the first two blocks of code appeared along with their output, but in the next one resetting the “echo” flag using the line “{r pressure2, echo=FALSE}” will no longer show the code and just plot the graph:

Building Documents

The “Knit” command is located in the command bar, and can generate an html file with the RMarkdown content and results. This can then be saved as a PDF or attached to a webpage as a static page.

Basic R Commands

This section covers assignments and printing, math operations, and functions. First, though, it is important to emphasize that comments in R are done with the “#” symbol. This way, if an R script reads a “#” symbol the remainder of the line will be a comment, unless the “#” symbol is between quotation marks. In the code below, “#” are used to indicate a new idea within the code.

Assignment Operator and Printing

R has 2 methods for assigning a value to a variable:

“<-”: a <- 3
“=”: a = 3

Both provide essentially the same functionality, but it is recommended to use “<-”. The assignment operator is also used for storing a model or function or data frame or vector, and so it is very important.

To print a value, either you can use the “print()” function, or you can just place the value on a line on it’s own. For instance, to print the variable a defined above, one can use:

a
print(a)

Note that print(“a”) will print the letter “a”. Also, assigning a new value to an existing name will overwrite that name.

Below are some examples in code:

a <- 3
b <- 5
a

## [1] 3

## [1] 5

a <- 2
b <- 6
print(a)

## [1] 2

print(b)

## [1] 6

Math Operations

R can handle all of the standard math operations such as addition, subtraction, division, multiplication, exponents, modulo (+, -, /, *, ^, %%, ) and all standard comparisons such as equal, greater than, less than, and, or, not, exclusive or (==, >, <, &, |, !, xor). It can even handle “set” commands such as intersect, union, setdiff, setequal (using the names given), but these will be explored more in future notebooks.

Below with code and comments there is a short illustration of how to run math commands. It’s also easy to store results from operations into variables.

# math Operations
a <- 2
b <- 4
2 + 3 # add

## [1] 5

3 - 1 # subtract

## [1] 2

a * 2 # multiply

## [1] 4

b / 2 # divide

## [1] 2

b %% a # modulo (remainder)

## [1] 0

a ^ b # exponents

## [1] 16

a ** b # also exponents

## [1] 16

# storing results
print("storing results")

## [1] "storing results"

c <- a ** b
c

## [1] 16

d <- b %% a
d

## [1] 0

e <- c * d
a <- e

# comparisons
a < b

## [1] TRUE

a != b

## [1] TRUE

e != a

## [1] FALSE

QUESTION: What value will “a” hold?

Functions

R has both many inbuilt functions, and an easy way for a user to develop their own functions. All functions are abstract, meaning you do not need to specify a return type and input type unlike many other languages. Functions are saved the same way a variable is saved, and therefore can also be called the way a variable is called. Simply inputting a functions name in R gives you an idea of what the function does and how.

For a function named “hello” which takes in a string “input” and returns the string “hello input”, the syntax is:

hello <- function(input) {
  paste("hello", input)
}

Even though the function entered needed a string, writing string explicitly was not required. This is both an advantage and disadvantage; while it is easier to write functions and make them general purpose, sometimes they will behave unexpectadly without warning the user!

Note, “paste” is an existing function in R which takes a list of inputs and returns them concatenated separated by spaces. To get an idea of how the backend of “paste” looks, type the function name (paste).

paste

## function (..., sep = " ", collapse = NULL) 
## .Internal(paste(list(...), sep, collapse))
## <bytecode: 0x0000000012e66388>
## <environment: namespace:base>

To call a function, use the function name and any inputs in parenthesis. Note R can also support default arguments, so if the above function is rewritten as follows, this function can be called with or without an argument. Usually it is good to use default arguments to help the user and also immediately tell if something went wrong.

hello <- function(input="nobody") {
  print(paste("hello", input))
}
# default
hello()

## [1] "hello nobody"

# new
hello("harry")

## [1] "hello harry"

hello("potter")

## [1] "hello potter"

If Statements

We can use traditional if statements in R as well. If statements give a method of running a section of code if a condition is satisfied. They follow the syntax below:

if (a == 0) {
  print("hello")
} else if (b == 3) {
  print("helllloooo")
} else {
  print("goodbye")
}

## [1] "hello"

They can be added to a function as well with the following syntax:

myIfFn <- function(a,b) {
  if (a == 0) {
    print("hello")
  } else if (b == 3) {
    print("helllloooo")
  } else {
    print("goodbye")
  }
}
myIfFn(0,1)

## [1] "hello"

myIfFn(3,3)

## [1] "helllloooo"

myIfFn(5,5)

## [1] "goodbye"

Help Function

The most important function for practising, and least important for production is “help()”. You can use the help function along with any function name to see how a function works. For instance, to learn more about paste:

help(paste)

QUESTION: Use the help command on paste to change the function above so it prints the inputted words without any spaces between them (i.e an input of “harry” would result in the output “helloharry” instead of “hello harry”).

Another method to get help is to add a question mark before the command being searched. For example:

?paste

Practical (Questions)

Create a code block labelled q0 and inside it create a variable “a” which holds the value 3.

a <- 3

Create a code block labelled q1 and inside it create a variable “b” which holds the value 2. For all future questions you must create a code block with the question number and answer it with code inside the block.
Build a function takes in 2 numbers and returns their sum
Build a function that takes 2 strings and returns their concatenation without spaces. Check the “paste” function documentation for hints. For example:

input: “a”, “b” returns “ab”
input: “a”, “b” returns “a b”

Use the seq function (search for it using help) to generate a sequence of 100 numbers separated by 0.5.
Write a function which takes in two numbers a and b and returns the larger of the sum or multiplication of those numbers.

Day 2. Manipulating Data Structures in R and Statistical Features

Today we will look at basic data structures in R and some simple statistical features!

1. Basic Data Structures

In R, the basic data structures are vectors, lists, matrices, and data frames. Typically, data structures are organized by whether they are homogeneous or heterogeneous (all elements have the same type versus all elements have different types) and what their dimension is (1d, 2d, nd). To classify the four types: + Vectors and matrices are homogenous while lists and dataframes are heterogenous. + Vectors and Lists are 1d while Matrices and Data Frames are 2d + There is an array structure which is nd but is not important for now.

Vectors

Vectors are one of the more unique and powerful data types in R. Vectors are 1 dimensional and homogeneous (all their elements are of the same type). These resemble a vector in linear algebra. R is often used in statistics because it has the ability to do operations on vectors quickly. This also allows much of the code to avoid loops and repetitive functions, as subsetting/aggregating and applying operations to all elements of a vector can be done with more efficient commands.

A slightly important subtlety to note is that R does not have “0 dimension” variables the way most languages do. That means the above definition (“a <- 3”) is actually a vector of length 1 and not an object that only holds one value. Vectors are initialized with the “c()” function and use comma’s to separate places, so a length one vector is defined with “a <- c(3)”, and a multi-length vector with “b <- c(1,2,3)”.

a <- c(3)
aPrime <- 3
a == aPrime

## [1] TRUE

b <- c(1,2,3)
a

## [1] 3

## [1] 1 2 3

Note that { a <- c(1, “2”, 3) } will not work while { a <- c(“1”,“2”,“3”) } will work.

Vectors can also be nested, so c(1,c(2,c(3,4))) is equivalent to c(1,2,3,4).

c(1,2,3,4) == c(1,c(2,c(3,4)))

## [1] TRUE TRUE TRUE TRUE

Subsetting Items

To choose items from a vector, the square bracket “[” can be used. Vectors start counting at index 1 (contrary to many features in CS), and so choosing the second item from vector “b” above will be done with b[2]. If an index is chosen that does not exist, R will return “NA”.

# returns 2nd value of b
b[2]

## [1] 2

# a has only one value so returns NA
a[2]

## [1] NA

Vectors can also subset with conditions, i.e selecting all vectors with value greater than 2 in b above:

b[b>2]

## [1] 3

Note we can also do math operations on vectors, and this can lead to interesting results when the vectors do not have the same length!

c <- c(3,4,5)
a + c

## [1] 6 7 8

a + b

## [1] 4 5 6

a * b

## [1] 3 6 9

c / b

## [1] 3.000000 2.000000 1.666667

Lists

Lists are similar to vectors except their arguments can be of mixed types (including other lists). Lists are built using the “list()” command, and different arguments are different inputs. For example the following is a list composed: 1) a vector of 3 numbers 2) one character 3) one character 4) a 2 character vector 5) a nested list of 3 elements, each of which is a number

x <- list(1:3, "a","b", c("c", "d"), list(1,2,3))
x

## [[1]]
## [1] 1 2 3
## 
## [[2]]
## [1] "a"
## 
## [[3]]
## [1] "b"
## 
## [[4]]
## [1] "c" "d"
## 
## [[5]]
## [[5]][[1]]
## [1] 1
## 
## [[5]][[2]]
## [1] 2
## 
## [[5]][[3]]
## [1] 3

To select specific elements from the larger list, a double square bracket “[[]]” is used. For example, each element of the above list can be stored into a separate vector and then a new list can be created by combining those vectors:

a <- x[[1]]
b <- x[[2]]
c <- x[[3]]
d <- x[[4]]
e <- x[[5]]
a

## [1] 1 2 3

## [1] "a"

## [1] "b"

## [1] "c" "d"

## [[1]]
## [1] 1
## 
## [[2]]
## [1] 2
## 
## [[3]]
## [1] 3

y <- list(a,b,c,d,e)

Matrices

Matrices are built using the “matrix()” function. These resemble the matrices from linear algebra. They only store one type, and typically are used in applications to help with calculations.

myMatrix <- matrix(1:15, nrow=5, ncol=3)
myMatrix

##      [,1] [,2] [,3]
## [1,]    1    6   11
## [2,]    2    7   12
## [3,]    3    8   13
## [4,]    4    9   14
## [5,]    5   10   15

To choose an element of the matrix, use a comma in the square bracket selectors (representing row and column) i.e to choose the 2nd row and 3rd column: matrix[2,3]. To isolate just the 2nd row or 3rd column, use: matrix[2,] or matrix[,3].

Below is an example on the “myMatrix” object buit above.

myMatrix[2,3]

## [1] 12

myMatrix[2,]

## [1]  2  7 12

myMatrix[,3]

## [1] 11 12 13 14 15

Data Frames

Data Frames are very commonly used to store data in R, and help make analysis much easier. These are essentially tables. They can be created with the “data.frame()” command, combined with “cbind” and “rbind”, and then specific columns can be selected by name with the “$” symbol, and rows with the row indices in square brackets (“[1:3,]” or “[c(1,3,5,7),]”).

# build frame
df <- data.frame(x=1:4, y=c("a","b","c","d"), stringsAsFactors = FALSE)

# add column
col <- c(4,6,8,10)
df <- cbind(df, col)

# add row
row <- c("a", "b", "c")
rbind(df, row)

##   x y col
## 1 1 a   4
## 2 2 b   6
## 3 3 c   8
## 4 4 d  10
## 5 a b   c

# name columns
colnames(df) <- c("col1", "col2", "col3")

# return first column
df$col1

## [1] 1 2 3 4

# return second row
df[2,]

##   col1 col2 col3
## 2    2    b    6

# return second through fourth row
df[2:4,]

##   col1 col2 col3
## 2    2    b    6
## 3    3    c    8
## 4    4    d   10

2. Built in Features

R has many built in features. We will show some examples concerning random variables, summaries, and plotting. These will be developed in more depth in future notebooks, but the basics of random variables will be shown here.

Plotting

Plotting two variables can be done with the “plot” function. For example, in the built cars package, these are commands for plotting speed and stopping distance of cars as follows:

plot(cars$speed, cars$dist, main="SpeedvsDist", xlab = "speed", ylab = "dist")

Note that the main argument specifies the main title of the graph, and xlab and ylab specify the x axis and y axis labels.

Random Variables

R can generate random variables using rnorm, qnorm, pnorm, or dnorm. Typically generating random variables on a normal distribution with mean and standard deviation is done with rnorm. The following generates 50 random variables with mean 0 and standard deviation 1 and stores the results in the variable “rda”. It is a good idea to look up qnorm, pnorm and dnorm just for future reference.

set.seed(100)
rda <- rnorm(50, 0, 1)
rda

##  [1] -0.50219235  0.13153117 -0.07891709  0.88678481  0.11697127
##  [6]  0.31863009 -0.58179068  0.71453271 -0.82525943 -0.35986213
## [11]  0.08988614  0.09627446 -0.20163395  0.73984050  0.12337950
## [16] -0.02931671 -0.38885425  0.51085626 -0.91381419  2.31029682
## [21] -0.43808998  0.76406062  0.26196129  0.77340460 -0.81437912
## [26] -0.43845057 -0.72022155  0.23094453 -1.15772946  0.24707599
## [31] -0.09111356  1.75737562 -0.13792961 -0.11119350 -0.69001432
## [36] -0.22179423  0.18290768  0.41732329  1.06540233  0.97020202
## [41] -0.10162924  1.40320349 -1.77677563  0.62286739 -0.52228335
## [46]  1.32223096 -0.36344033  1.31906574  0.04377907 -1.87865588

Note, for testing purposes R also is able to fix the random sequence using the “set.seed()” command. This means that the numbers will be generated randomly, but in a repeatable way depending on the seed. This will still simulate randomness but fix the results to be the same. Often, in a situation involving randomness it is good to set a seed so that results can be tested in a reproducible way.

Loading Libraries

Another very important part of R is installing and loading packages. This allows a user to quickly run many standard functions that do not come pre-built in R.

To install a package, the command is “install.packages()” with package name entered as a string (i.e “packagename”). For multiple packages, the arguments are separated by commas. For instance, to load the packages “dplyr” which is excellent for filtering data, and “ggplot2” which is excellent for building advanced plots, use the command “install.packages(”dplyr“,”ggplot2“)”. Note, once a package is installed the contents stay once the R session closes, so it does not need to be reinstalled again.

To load the functions, the command is “library()”. This has to be called every time R is re-opened. This runs on one package at a time. Note that the package name gets entered directly and not as a string. To load the two above use the following commands:

#install.packages("dplyr", "ggplot2")
library("dplyr")

## Warning: package 'dplyr' was built under R version 3.6.1

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library("ggplot2")

## Warning: package 'ggplot2' was built under R version 3.6.1

Good style when building a notebook involves commenting out the install line (similar to above) so the user is aware which packages to download. It is then good to put all the library commands in a block right away so the notebook runs smoothly.

The install line will break a notebook in the knitting step if it is not commented out. When installing a package it is best to do it from the console.

Applying Functions on Vectors and Lists in R

This is not a “dplyr” feature, but it is an important idea for working with vectors and lists. In R, it is usually best to avoid loops on vectors and instead use a function known as “apply” which applies the same function to all of the elements in a vector. This is so that the computer can take advantage of parallelization in the future, and process the code much faster.

To effectively use apply, it is good to write a function you want that can be applied to any individual element in a vector. Then, simply use lapply to apply that individual element function to every element and store the result in a list. The unlist command can convert a list back into a vector. An example of this is below, the first function will add1 to every element in the numeric vector and the second function will add the letter “a” to the start of each string.

add1 <- function(x) {
  return (x+1)
}

adda <- function(x) {
  return (paste("a", x, sep=""))
}

unlist(lapply(c(1,2,3,4,5), add1))

## [1] 2 3 4 5 6

unlist(lapply(c("a","ab", "abc", "abcd"), adda))

## [1] "aa"    "aab"   "aabc"  "aabcd"

There are other methods of apply too which work on different data structures; run help(lapply) to learn more!

The extra topics in these notebooks called “functional programming” are also great tools to program smarter in R, and will be especially useful for assignments.

3. More Basic R Questions

Here are some more questions to work on the basics from above. Please make a code block labelled like the previous day when answering each question. Part of the first question has been done for you.

Generate 1000 numbers with a mean 0 and stdev 1 using the function rnorm, and find how many of them are greater than 0.2 using subsets. Store these as a vector named “six”

generated <- rnorm(1000, mean=0, sd=1)

Generate 100 numbers with mean 0 and stdev 0.5 using “rnorm”. Multiply this vector by first the vector “six” and then the vector “generated” above. Which one of these gives a warning?
Use the lapply function to build a vector with the following rules: for each number in “six” if the value is greater than 0 the new vector should hold the value 1 in that spot, and if it is less than or equal to 0 it should hold the value -1.
Install the “housingData” package (install.packages("housingData")) and store the “fipsCounty” data (housingData::fipsCounty) into a data frame. Which state in fipsCounty appears the most times? (hint: use the functions sort and table to find this, and also don’t forget to run library(housingData) to use the info in that package).
Load the housing data from the housingData package into a dataframe. Which state has the most houses sold, and which state has the highest average individual difference between list and sold price? (Note the which.max function can help you to solve this)
Plot a graph of list and sold price of all the houses. Use the plot function for this. Then plot a graph of the list price and selling price of all houses with a sold price greater than 2500. What do you notice about the two plots?

Day 3: Building Summaries and Tables

How you would summarise or display data in general?

Now we begin exploring methods for visualizing and analyzing data in R using tables and graphs. The library (package) most often used for this is called tidyverse, which is the standard for doing simple visualizations (and some complicated ones) in R. When there is not much data or the audience only is looking for overviews and summary information, graphs and tables are usually enough to get a point across. This lesson will use the nycflights dataset, which contains information about flights. We will then do a task using the Titanic datset which contains passengers and whether or not they survived the shipwreck.

The different steps in this day are: - Reading and Cleaning Data - Summary Functions - Problems

Important Packages and Link to Reference Book

The two important packages inside tidyverse (which is a set of packages compiled into one) are “dplyr” and “ggplot2”.

#install.packages("tidyverse", "nycflights13")
#install.packages("dplyr", "ggplot2", "magrittr", "nycflights13")

library("tidyverse")

## Warning: package 'tidyverse' was built under R version 3.6.1

## -- Attaching packages ----------------------------------- tidyverse 1.2.1 --

## v tibble  2.1.3     v purrr   0.3.2
## v tidyr   0.8.3     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0

## Warning: package 'readr' was built under R version 3.6.1

## -- Conflicts -------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library("nycflights13")

The tidyverse book reference (filtering plotting and more): http://r4ds.had.co.nz/introduction.html

Filter, Arrange, Select, Mutate, Summarise, and Group By

We will first load some data to transform using the above functions. The dataset we will use for this lesson is “nycflights13”, and the first step to analyzing any dataset is to see what columns and datatypes exist. We use the “str()” function for this. We will also use the “nrow()” function to see how many rows exist in the dataset.

str(flights)

## Classes 'tbl_df', 'tbl' and 'data.frame':    336776 obs. of  19 variables:
##  $ year          : int  2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
##  $ month         : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ day           : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ dep_time      : int  517 533 542 544 554 554 555 557 557 558 ...
##  $ sched_dep_time: int  515 529 540 545 600 558 600 600 600 600 ...
##  $ dep_delay     : num  2 4 2 -1 -6 -4 -5 -3 -3 -2 ...
##  $ arr_time      : int  830 850 923 1004 812 740 913 709 838 753 ...
##  $ sched_arr_time: int  819 830 850 1022 837 728 854 723 846 745 ...
##  $ arr_delay     : num  11 20 33 -18 -25 12 19 -14 -8 8 ...
##  $ carrier       : chr  "UA" "UA" "AA" "B6" ...
##  $ flight        : int  1545 1714 1141 725 461 1696 507 5708 79 301 ...
##  $ tailnum       : chr  "N14228" "N24211" "N619AA" "N804JB" ...
##  $ origin        : chr  "EWR" "LGA" "JFK" "JFK" ...
##  $ dest          : chr  "IAH" "IAH" "MIA" "BQN" ...
##  $ air_time      : num  227 227 160 183 116 150 158 53 140 138 ...
##  $ distance      : num  1400 1416 1089 1576 762 ...
##  $ hour          : num  5 5 5 5 6 5 6 6 6 6 ...
##  $ minute        : num  15 29 40 45 0 58 0 0 0 0 ...
##  $ time_hour     : POSIXct, format: "2013-01-01 05:00:00" "2013-01-01 05:00:00" ...

nrow(flights)

## [1] 336776

The str output shows what type of data belongs to each column as well as some examples from each column, and nrow gives a count of rows. We will now begin to look for specific information.

Filter

Sometimes it’s more useful to take a summary of very specific instances in the data, for example, a summary of the Age of all survivors where they are female or a summary of all variables for passengers who are 18. The first step to find these summaries is to filter the data and pull only the necessary rows.

Such filters are really effectively done using the “dplyr” library inside the tidyverse package. Below are some examples of filters. The full filtered data is saved in a variable and only the first few rows are shown as a display using the below code/the “head()” function. We start by looking for all flights that occured on January 1st in any year, as well as December 25th in any year.

jan1 <- filter(flights, month==1, day==1)
head(jan1)

## # A tibble: 6 x 19
##    year month   day dep_time sched_dep_time dep_delay arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>
## 1  2013     1     1      517            515         2      830
## 2  2013     1     1      533            529         4      850
## 3  2013     1     1      542            540         2      923
## 4  2013     1     1      544            545        -1     1004
## 5  2013     1     1      554            600        -6      812
## 6  2013     1     1      554            558        -4      740
## # ... with 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## #   time_hour <dttm>

head(dec25 <- filter(flights, month == 12, day == 25))

## # A tibble: 6 x 19
##    year month   day dep_time sched_dep_time dep_delay arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>
## 1  2013    12    25      456            500        -4      649
## 2  2013    12    25      524            515         9      805
## 3  2013    12    25      542            540         2      832
## 4  2013    12    25      546            550        -4     1022
## 5  2013    12    25      556            600        -4      730
## 6  2013    12    25      557            600        -3      743
## # ... with 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## #   time_hour <dttm>

We can also use logical arguments such

nov_dec <- filter(flights, month %in% c(11, 12))
nov_dec2 <- filter(flights, month == 11 | month == 12)
identical(nov_dec, nov_dec2)

## [1] TRUE

gl <- filter(flights, !(arr_delay > 120 | dep_delay > 120))
gl2 <- filter(flights, arr_delay <= 120, dep_delay <= 120)
identical(gl, gl2)

## [1] TRUE

Filters are quite powerful tools to get initial ideas specific parts of the data. They are also great for figuring out how much data is missing, and whether or not there are any anomalies.

Arrange

Arrange sorts rows. Here are some examples (just showing the first few rows) for arranging flights by year, and then arrranging flights both from increasing and decreasing delay.

sortFlights <- arrange(flights, year, month, day)
sortDelayUp <- arrange(flights, dep_delay)
sortDelayDown <- arrange(flights, desc(dep_delay))

Select

Select is especially useful when there are many columns or when we want a subset of data with specific conditions or transformations done to the columns. We can also use this to re-order in a dataset.

head(select(flights, year, month, day))

## # A tibble: 6 x 3
##    year month   day
##   <int> <int> <int>
## 1  2013     1     1
## 2  2013     1     1
## 3  2013     1     1
## 4  2013     1     1
## 5  2013     1     1
## 6  2013     1     1

head(select(flights, dep_time, sched_dep_time, everything()))

## # A tibble: 6 x 19
##   dep_time sched_dep_time  year month   day dep_delay arr_time
##      <int>          <int> <int> <int> <int>     <dbl>    <int>
## 1      517            515  2013     1     1         2      830
## 2      533            529  2013     1     1         4      850
## 3      542            540  2013     1     1         2      923
## 4      544            545  2013     1     1        -1     1004
## 5      554            600  2013     1     1        -6      812
## 6      554            558  2013     1     1        -4      740
## # ... with 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## #   time_hour <dttm>

Mutate

Mutate is especially useful when trying to create a new field out of the existing fields (field in this case is a column). This will add a new column to the end of the dataset. Here are a few examples (they seem useful but are used more to illustrate the concept).

flights_sml <- select(flights, 
  year:day, 
  ends_with("delay"), 
  distance, 
  air_time
)
mutate(flights_sml,
  gain = dep_delay - arr_delay,
  speed = distance / air_time * 60
) %>% head()

## # A tibble: 6 x 9
##    year month   day dep_delay arr_delay distance air_time  gain speed
##   <int> <int> <int>     <dbl>     <dbl>    <dbl>    <dbl> <dbl> <dbl>
## 1  2013     1     1         2        11     1400      227    -9  370.
## 2  2013     1     1         4        20     1416      227   -16  374.
## 3  2013     1     1         2        33     1089      160   -31  408.
## 4  2013     1     1        -1       -18     1576      183    17  517.
## 5  2013     1     1        -6       -25      762      116    19  394.
## 6  2013     1     1        -4        12      719      150   -16  288.

transmute(flights,
  gain = dep_delay - arr_delay,
  hours = air_time / 60,
  gain_per_hour = gain / hours
) %>% head()

## # A tibble: 6 x 3
##    gain hours gain_per_hour
##   <dbl> <dbl>         <dbl>
## 1    -9  3.78         -2.38
## 2   -16  3.78         -4.23
## 3   -31  2.67        -11.6 
## 4    17  3.05          5.57
## 5    19  1.93          9.83
## 6   -16  2.5          -6.4

Summarise and group_by

Finally, a summarise function in dplyr can give specific summary stats, and running a built in “summary” function on a filtered dplyr table also gives good results. Here is an example:

# dplyr summarise
summarise(flights, delay = mean(dep_delay, na.rm = TRUE)) %>% head()

## # A tibble: 1 x 1
##   delay
##   <dbl>
## 1  12.6

It is also possible to use group_by() to get summaries about specific subgroups.

Simple Problems

First we will do some problems using the above functions on another dataset built into R.

data("msleep")

Q1: Find the number of carnivores belonging to order Rodentia or Primates.

Q2: Find avg difference between brainwt and bodywt for all domesticated animals

Q3: Find the total number of “genus” categories

Q4: Find the animal with the maximum and minimum bodywt, and maximum and minimum brainwt.

Q5: Find the number of rows containing at least 1 “NA” value

Reading and Cleaning Data

Data can come from many sources - a csv works for smaller sized data while maybe an avro file (serialized) with a schema can work for larger amounts of data. R has many functions for reading external data and writing data from R to other softwares. Usually it is best to look them up for specific cases, but in many cases read.csv() and write.csv() are the more important ones. Use the help command to learn more about them.

Another very common problem with data is that it does not always come in a form that is easy to analyze. That is why it is important to reshuffle and filter and clean the data to meet relevant requirements. R is not the best language to actually clean data with - especially compared to languages that allow a user to easily open and modify file contents (C++, Python etc) are good. Also, there are specific programs for specific types of files (i.e Excel is good for xlsx and csv or Tableau is good for twb).

There is no real scientific method for cleaning data or trying to decode it so it can be read into R smoothly, which unfortunately can make that task very frustrating. However, starting with a clean dataset and seeing some results will hopefully be motivating enough to emphasize spending the time in the future to clean data! (Most in class small assignments will be done with clean data, but independent or class projects etc are likely to have some mess).

In this notebook the example used is Titanic data, which gives information about different passengers on the Titanic ship and whether or not the passengers survived. Below is an example of using the read csv function to read the titanic data into a dataframe.

The data can be downloaded here: https://www.kaggle.com/c/titanic/data

Be sure to change “filepath” to match the path of the titanic data file on the computer this code is being run on.

# Change filepath to match where you downloaded
filePathTrain <- "E:/Datasets/trainTitanic.csv"
filePathTest <- "E:/Datasets/test.csv"
# Save into dataframes
dfTrain <- as.data.frame(read.csv(filePathTrain, header=TRUE))
#dfTest <- as.data.frame(read.csv(filePathTest, header=TRUE))

Examples of table processing with Titanic Data

# summarize fare
summary(dfTrain$Fare)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    7.91   14.45   32.20   31.00  512.33

# summarize fare of survivors
dfTrain %>% filter(Survived==1) %>% select(Fare) %>% summary()

##       Fare       
##  Min.   :  0.00  
##  1st Qu.: 12.47  
##  Median : 26.00  
##  Mean   : 48.40  
##  3rd Qu.: 57.00  
##  Max.   :512.33

# filter all fields for only females that have not survived
females <- filter(dfTrain, Sex == "female")

# filter all fields for only females that have survived
survivingFemales <- filter(dfTrain, Survived == 1, Sex == "female")

# filter all passengers who were 18 years old
age18 <- filter(dfTrain, Age >= 18)

head(females)

##   PassengerId Survived Pclass
## 1           2        1      1
## 2           3        1      3
## 3           4        1      1
## 4           9        1      3
## 5          10        1      2
## 6          11        1      3
##                                                  Name    Sex Age SibSp
## 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female  38     1
## 2                              Heikkinen, Miss. Laina female  26     0
## 3        Futrelle, Mrs. Jacques Heath (Lily May Peel) female  35     1
## 4   Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female  27     0
## 5                 Nasser, Mrs. Nicholas (Adele Achem) female  14     1
## 6                     Sandstrom, Miss. Marguerite Rut female   4     1
##   Parch           Ticket    Fare Cabin Embarked
## 1     0         PC 17599 71.2833   C85        C
## 2     0 STON/O2. 3101282  7.9250              S
## 3     0           113803 53.1000  C123        S
## 4     2           347742 11.1333              S
## 5     0           237736 30.0708              C
## 6     1          PP 9549 16.7000    G6        S

head(survivingFemales)

##   PassengerId Survived Pclass
## 1           2        1      1
## 2           3        1      3
## 3           4        1      1
## 4           9        1      3
## 5          10        1      2
## 6          11        1      3
##                                                  Name    Sex Age SibSp
## 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female  38     1
## 2                              Heikkinen, Miss. Laina female  26     0
## 3        Futrelle, Mrs. Jacques Heath (Lily May Peel) female  35     1
## 4   Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female  27     0
## 5                 Nasser, Mrs. Nicholas (Adele Achem) female  14     1
## 6                     Sandstrom, Miss. Marguerite Rut female   4     1
##   Parch           Ticket    Fare Cabin Embarked
## 1     0         PC 17599 71.2833   C85        C
## 2     0 STON/O2. 3101282  7.9250              S
## 3     0           113803 53.1000  C123        S
## 4     2           347742 11.1333              S
## 5     0           237736 30.0708              C
## 6     1          PP 9549 16.7000    G6        S

head(age18)

##   PassengerId Survived Pclass
## 1           1        0      3
## 2           2        1      1
## 3           3        1      3
## 4           4        1      1
## 5           5        0      3
## 6           7        0      1
##                                                  Name    Sex Age SibSp
## 1                             Braund, Mr. Owen Harris   male  22     1
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female  38     1
## 3                              Heikkinen, Miss. Laina female  26     0
## 4        Futrelle, Mrs. Jacques Heath (Lily May Peel) female  35     1
## 5                            Allen, Mr. William Henry   male  35     0
## 6                             McCarthy, Mr. Timothy J   male  54     0
##   Parch           Ticket    Fare Cabin Embarked
## 1     0        A/5 21171  7.2500              S
## 2     0         PC 17599 71.2833   C85        C
## 3     0 STON/O2. 3101282  7.9250              S
## 4     0           113803 53.1000  C123        S
## 5     0           373450  8.0500              S
## 6     0            17463 51.8625   E46        S

summarise(survivingFemales)

## data frame with 0 columns and 1 row

# Arrange
sortAge <- arrange(dfTrain, Age)
sortId <- arrange(dfTrain, desc(PassengerId))

head(sortAge)

##   PassengerId Survived Pclass                            Name    Sex  Age
## 1         804        1      3 Thomas, Master. Assad Alexander   male 0.42
## 2         756        1      2       Hamalainen, Master. Viljo   male 0.67
## 3         470        1      3   Baclini, Miss. Helene Barbara female 0.75
## 4         645        1      3          Baclini, Miss. Eugenie female 0.75
## 5          79        1      2   Caldwell, Master. Alden Gates   male 0.83
## 6         832        1      2 Richards, Master. George Sibley   male 0.83
##   SibSp Parch Ticket    Fare Cabin Embarked
## 1     0     1   2625  8.5167              C
## 2     1     1 250649 14.5000              S
## 3     2     1   2666 19.2583              C
## 4     2     1   2666 19.2583              C
## 5     0     2 248738 29.0000              S
## 6     1     1  29106 18.7500              S

head(sortId)

##   PassengerId Survived Pclass                                     Name
## 1         891        0      3                      Dooley, Mr. Patrick
## 2         890        1      1                    Behr, Mr. Karl Howell
## 3         889        0      3 Johnston, Miss. Catherine Helen "Carrie"
## 4         888        1      1             Graham, Miss. Margaret Edith
## 5         887        0      2                    Montvila, Rev. Juozas
## 6         886        0      3     Rice, Mrs. William (Margaret Norton)
##      Sex Age SibSp Parch     Ticket   Fare Cabin Embarked
## 1   male  32     0     0     370376  7.750              Q
## 2   male  26     0     0     111369 30.000  C148        C
## 3 female  NA     1     2 W./C. 6607 23.450              S
## 4 female  19     0     0     112053 30.000   B42        S
## 5   male  27     0     0     211536 13.000              S
## 6 female  39     0     5     382652 29.125              Q

# Select columns
selectUsefulCols <- select(dfTrain, Age, Fare, Pclass, Survived, everything())
head(selectUsefulCols)

##   Age    Fare Pclass Survived PassengerId
## 1  22  7.2500      3        0           1
## 2  38 71.2833      1        1           2
## 3  26  7.9250      3        1           3
## 4  35 53.1000      1        1           4
## 5  35  8.0500      3        0           5
## 6  NA  8.4583      3        0           6
##                                                  Name    Sex SibSp Parch
## 1                             Braund, Mr. Owen Harris   male     1     0
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female     1     0
## 3                              Heikkinen, Miss. Laina female     0     0
## 4        Futrelle, Mrs. Jacques Heath (Lily May Peel) female     1     0
## 5                            Allen, Mr. William Henry   male     0     0
## 6                                    Moran, Mr. James   male     0     0
##             Ticket Cabin Embarked
## 1        A/5 21171              S
## 2         PC 17599   C85        C
## 3 STON/O2. 3101282              S
## 4           113803  C123        S
## 5           373450              S
## 6           330877              Q

# Mutate
funnyNewFields <- mutate(dfTrain, lettersInName = length(Name), farePerYear = Fare/Age)
head(funnyNewFields)

##   PassengerId Survived Pclass
## 1           1        0      3
## 2           2        1      1
## 3           3        1      3
## 4           4        1      1
## 5           5        0      3
## 6           6        0      3
##                                                  Name    Sex Age SibSp
## 1                             Braund, Mr. Owen Harris   male  22     1
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female  38     1
## 3                              Heikkinen, Miss. Laina female  26     0
## 4        Futrelle, Mrs. Jacques Heath (Lily May Peel) female  35     1
## 5                            Allen, Mr. William Henry   male  35     0
## 6                                    Moran, Mr. James   male  NA     0
##   Parch           Ticket    Fare Cabin Embarked lettersInName farePerYear
## 1     0        A/5 21171  7.2500              S           891   0.3295455
## 2     0         PC 17599 71.2833   C85        C           891   1.8758763
## 3     0 STON/O2. 3101282  7.9250              S           891   0.3048077
## 4     0           113803 53.1000  C123        S           891   1.5171429
## 5     0           373450  8.0500              S           891   0.2300000
## 6     0           330877  8.4583              Q           891          NA

selectAndMutate <- select(mutate(dfTrain, ageBuckets = as.numeric(Age>50 | is.na(Age)), fareBuckets = ifelse(Fare < 20, 2, ifelse(Fare < 40, 1, 0))), Name, Age, Fare, Pclass, ageBuckets, fareBuckets, Survived)

# Summarise
summarise(females, count = n(), age = mean(Age, na.rm=TRUE), fare = mean(Fare,na.rm=TRUE)) %>% head()

##   count      age     fare
## 1   314 27.91571 44.47982

# base-summary
summary(females)

##   PassengerId       Survived         Pclass     
##  Min.   :  2.0   Min.   :0.000   Min.   :1.000  
##  1st Qu.:231.8   1st Qu.:0.000   1st Qu.:1.000  
##  Median :414.5   Median :1.000   Median :2.000  
##  Mean   :431.0   Mean   :0.742   Mean   :2.159  
##  3rd Qu.:641.2   3rd Qu.:1.000   3rd Qu.:3.000  
##  Max.   :889.0   Max.   :1.000   Max.   :3.000  
##                                                 
##                                              Name         Sex     
##  Abbott, Mrs. Stanton (Rosa Hunt)              :  1   female:314  
##  Abelson, Mrs. Samuel (Hannah Wizosky)         :  1   male  :  0  
##  Ahlin, Mrs. Johan (Johanna Persdotter Larsson):  1               
##  Aks, Mrs. Sam (Leah Rosen)                    :  1               
##  Allen, Miss. Elisabeth Walton                 :  1               
##  Allison, Miss. Helen Loraine                  :  1               
##  (Other)                                       :308               
##       Age            SibSp            Parch            Ticket   
##  Min.   : 0.75   Min.   :0.0000   Min.   :0.0000   347082 :  5  
##  1st Qu.:18.00   1st Qu.:0.0000   1st Qu.:0.0000   2666   :  4  
##  Median :27.00   Median :0.0000   Median :0.0000   110152 :  3  
##  Mean   :27.92   Mean   :0.6943   Mean   :0.6497   113781 :  3  
##  3rd Qu.:37.00   3rd Qu.:1.0000   3rd Qu.:1.0000   13502  :  3  
##  Max.   :63.00   Max.   :8.0000   Max.   :6.0000   24160  :  3  
##  NA's   :53                                        (Other):293  
##       Fare            Cabin     Embarked
##  Min.   :  6.75          :217    :  2   
##  1st Qu.: 12.07   G6     :  4   C: 73   
##  Median : 23.00   E101   :  3   Q: 36   
##  Mean   : 44.48   F33    :  3   S:203   
##  3rd Qu.: 55.00   B18    :  2           
##  Max.   :512.33   B28    :  2           
##                   (Other): 83

summary(survivingFemales)

##   PassengerId       Survived     Pclass     
##  Min.   :  2.0   Min.   :1   Min.   :1.000  
##  1st Qu.:238.0   1st Qu.:1   1st Qu.:1.000  
##  Median :400.0   Median :1   Median :2.000  
##  Mean   :429.7   Mean   :1   Mean   :1.918  
##  3rd Qu.:636.0   3rd Qu.:1   3rd Qu.:3.000  
##  Max.   :888.0   Max.   :1   Max.   :3.000  
##                                             
##                                               Name         Sex     
##  Abbott, Mrs. Stanton (Rosa Hunt)               :  1   female:233  
##  Abelson, Mrs. Samuel (Hannah Wizosky)          :  1   male  :  0  
##  Aks, Mrs. Sam (Leah Rosen)                     :  1               
##  Allen, Miss. Elisabeth Walton                  :  1               
##  Andersen-Jensen, Miss. Carla Christine Nielsine:  1               
##  Andersson, Miss. Erna Alexandra                :  1               
##  (Other)                                        :227               
##       Age            SibSp           Parch            Ticket   
##  Min.   : 0.75   Min.   :0.000   Min.   :0.000   2666    :  4  
##  1st Qu.:19.00   1st Qu.:0.000   1st Qu.:0.000   110152  :  3  
##  Median :28.00   Median :0.000   Median :0.000   13502   :  3  
##  Mean   :28.85   Mean   :0.515   Mean   :0.515   24160   :  3  
##  3rd Qu.:38.00   3rd Qu.:1.000   3rd Qu.:1.000   PC 17757:  3  
##  Max.   :63.00   Max.   :4.000   Max.   :5.000   110413  :  2  
##  NA's   :36                                      (Other) :215  
##       Fare             Cabin     Embarked
##  Min.   :  7.225          :142    :  2   
##  1st Qu.: 13.000   E101   :  3   C: 64   
##  Median : 26.000   F33    :  3   Q: 27   
##  Mean   : 51.939   B18    :  2   S:140   
##  3rd Qu.: 76.292   B28    :  2           
##  Max.   :512.329   B35    :  2           
##                    (Other): 79

# Using pipe commands to build interesting summaries
selectAndMutate %>% group_by(fareBuckets) %>% summarise(count=n(), ages=mean(Age,na.rm=TRUE))

## # A tibble: 3 x 3
##   fareBuckets count  ages
##         <dbl> <int> <dbl>
## 1           0   176  33.9
## 2           1   200  29.8
## 3           2   515  28.0

selectAndMutate %>% group_by(Pclass) %>% summarise(count=n(), ages=mean(Fare,na.rm=TRUE))

## # A tibble: 3 x 3
##   Pclass count  ages
##    <int> <int> <dbl>
## 1      1   216  84.2
## 2      2   184  20.7
## 3      3   491  13.7

Task with Titanic Data

Using the training set, try to come up with an idea of which types of passengers survived based on the features given. To do this, use various summarizing functions to get an idea of how variables are distributed for passengers of different types.

This task is open ended - but in the end you will use whatever indicators you have built with the training set and try to classify people in the testing set. This means in the end you must write a function that takes in certain variables and outputs the result (1 if survived, 0 otherwise).

Show all of your work! That includes what summary functions and filters you used and how you analyzed them to reach your results, as well as the final function and some sample inputs+outputs.

#this is a very simplistic example, more to use as a template and complicate yourself

isLetterM <- function(x) {
  if ("m" %in% unlist(strsplit(tolower(x), ""))) {
    return(1)
  } else {
    return(0)
  }
}

letterM <- unlist(lapply(dfTrain$Name, isLetterM))

# feature letterM is built

survivalTest <- function(Age, Name, Fare) {
  if (Age > 20) {
    if (isLetterM(Name)) {
      if (Fare > 15) {
        return(1)
      } else {
        return (0)
      }
    } else if (Fare < 8) {
      return (1)
    } else {
      return(0)
    }
  } else if (Age < 20) {
    return (0)
  }
}

Datasets references

https://www.kaggle.com/c/titanic/data

Day 4: Plotting

One very common way to display data to a user is using a graph or plot. Graphs and plots are good because they allow someone to compare things or view a trend very quickly. They can be bad because it is very easy to mislead someone using a plot.

We use a library called ggplot to do plotting.

library("dplyr")
library("ggplot2")

The general form of a plot in ggplot is as follows:

ggplot(data=<DATA>)+<GEOM_FUNCTION>(mapping=aes(<MAPPING>))+<ADDITIONAL SPECIFICATIONS>

There are many types of plots all of which can be useful in different situations. The ones we will cover today are:

Scatter
Line
Histogram
Bar Plot
Boxplot
Density Plot

The library we enforce using for plotting is called “ggplot2”. However, we will also show some plotting functions built into R’s core. The ggplot package is the industry standard. We will demonstrate these plotting styles with datasets existing in R.

Scatter Plot

A scatter plot can either compare two variables or just show the trend in one variable. Also, with ggplot we can add additional dimensions such as size and shape and transparency. This is definitely the most commonly used plot for data applications. We have various dimensions we can add including: x, y, colour, shape, size, alpha.

data("mtcars")

# Straight up plot
ggplot(mtcars, aes(x=wt, y=mpg)) + geom_point()

# Change the point size, and shape
ggplot(mtcars, aes(x=wt, y=mpg)) +
  geom_point(size=2, shape=23)

# Change the point size
ggplot(mtcars, aes(x=wt, y=mpg)) + 
  geom_point(aes(size=qsec))

# Add text
ggplot(mtcars, aes(x=wt, y=mpg)) +
  geom_point() + 
  geom_text(label=rownames(mtcars))

# Add the regression line
ggplot(mtcars, aes(x=wt, y=mpg)) + 
  geom_point()+
  geom_smooth(method=lm)

# Remove the confidence interval
ggplot(mtcars, aes(x=wt, y=mpg)) + 
  geom_point()+
  geom_smooth(method=lm, se=FALSE)

# Loess method
ggplot(mtcars, aes(x=wt, y=mpg)) + 
  geom_point()+
  geom_smooth()

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

# Add marginal rugs
ggplot(mtcars, aes(x=wt, y=mpg)) +
  geom_point() + geom_rug()

# Change colors
ggplot(mtcars, aes(x=wt, y=mpg, color=cyl)) +
  geom_point() + geom_rug()

# Add marginal rugs using faithful data
ggplot(faithful, aes(x=eruptions, y=waiting)) +
  geom_point() + geom_rug()

# Scatter plot with the 2d density estimation
sp <- ggplot(faithful, aes(x=eruptions, y=waiting)) +
  geom_point()
sp + geom_density_2d()

# Gradient color
sp + stat_density_2d(aes(fill = ..level..), geom="polygon")

# Change the gradient color
sp + stat_density_2d(aes(fill = ..level..), geom="polygon")+
  scale_fill_gradient(low="blue", high="red")

# One ellipse arround all points
ggplot(faithful, aes(waiting, eruptions))+
  geom_point()+
  stat_ellipse()

# Ellipse by groups
p <- ggplot(faithful, aes(waiting, eruptions, color = eruptions > 3))+
  geom_point()
p + stat_ellipse()

# Change the type of ellipses: possible values are "t", "norm", "euclid"
p + stat_ellipse(type = "norm")

You can find many more examples here: http://www.sthda.com/english/wiki/ggplot2-scatter-plots-quick-start-guide-r-software-and-data-visualization

Line Plot

Line plots are similar to scatter plots except that they are connected (usually by the order that the points appear in the x axis). These are often used when trying to show a relationship or multiple relationships directly. We can also use an argument “linetype” or “group” to add multiple lines. You can also add points and lines in the same graph, each with their own mapping.

df <- data.frame(dose=c("D0.5", "D1", "D2"),
                len=c(4.2, 10, 29.5))
head(df)

##   dose  len
## 1 D0.5  4.2
## 2   D1 10.0
## 3   D2 29.5

# Basic line plot with points
ggplot(data=df, aes(x=dose, y=len, group=1)) +
  geom_line()+
  geom_point()

# Change the line type
ggplot(data=df, aes(x=dose, y=len, group=1)) +
  geom_line(linetype = "dashed")+
  geom_point()

# Change the color
ggplot(data=df, aes(x=dose, y=len, group=1)) +
  geom_line(color="red")+
  geom_point()

library(grid)
# Add an arrow
ggplot(data=df, aes(x=dose, y=len, group=1)) +
  geom_line(arrow = arrow())+
  geom_point()

# Add a closed arrow to the end of the line
myarrow=arrow(angle = 15, ends = "both", type = "closed")
ggplot(data=df, aes(x=dose, y=len, group=1)) +
  geom_line(arrow=myarrow)+
  geom_point()

# Grid steps
ggplot(data=df, aes(x=dose, y=len, group=1)) +
  geom_step()+
  geom_point()

# straight steps
ggplot(data=df, aes(x=dose, y=len, group=1)) +
  geom_path()+
  geom_point()

# multiple groups
df2 <- data.frame(supp=rep(c("VC", "OJ"), each=3),
                dose=rep(c("D0.5", "D1", "D2"),2),
                len=c(6.8, 15, 33, 4.2, 10, 29.5))
head(df2)

##   supp dose  len
## 1   VC D0.5  6.8
## 2   VC   D1 15.0
## 3   VC   D2 33.0
## 4   OJ D0.5  4.2
## 5   OJ   D1 10.0
## 6   OJ   D2 29.5

# Line plot with multiple groups
ggplot(data=df2, aes(x=dose, y=len, group=supp)) +
  geom_line()+
  geom_point()

# Change line types
ggplot(data=df2, aes(x=dose, y=len, group=supp)) +
  geom_line(linetype="dashed", color="blue", size=1.2)+
  geom_point(color="red", size=3)

# Change line types by groups (supp)
ggplot(df2, aes(x=dose, y=len, group=supp)) +
  geom_line(aes(linetype=supp))+
  geom_point()

# Change line types and point shapes
ggplot(df2, aes(x=dose, y=len, group=supp)) +
  geom_line(aes(linetype=supp))+
  geom_point(aes(shape=supp))

# Set line types manually
ggplot(df2, aes(x=dose, y=len, group=supp)) +
  geom_line(aes(linetype=supp))+
  geom_point()+
  scale_linetype_manual(values=c("twodash", "dotted"))

# Colour by group
p<-ggplot(df2, aes(x=dose, y=len, group=supp)) +
  geom_line(aes(color=supp))+
  geom_point(aes(color=supp))
p

# Use custom color palettes
p+scale_color_manual(values=c("#999999", "#E69F00", "#56B4E9"))

# Use brewer color palettes
p+scale_color_brewer(palette="Dark2")

# Use grey scale
p + scale_color_grey() + theme_classic()

#3d line graph
data("economics")
# Change line size
ggplot(data=economics, aes(x=date, y=pop, size=unemploy/pop))+
  geom_line()

For more ideas look here: http://www.sthda.com/english/wiki/ggplot2-line-plot-quick-start-guide-r-software-and-data-visualization

Histogram

Histograms show a distribution of one variable typically by counts or densities. These are extremely important when explaining trends in a dataset.

set.seed(1234)
df <- data.frame(
  sex=factor(rep(c("F", "M"), each=200)),
  weight=round(c(rnorm(200, mean=55, sd=5), rnorm(200, mean=65, sd=5)))
  )
head(df)

##   sex weight
## 1   F     49
## 2   F     56
## 3   F     60
## 4   F     43
## 5   F     57
## 6   F     58

# Basic histogram
ggplot(df, aes(x=weight)) + geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Change the width of bins
ggplot(df, aes(x=weight)) + 
  geom_histogram(binwidth=1)

# Change colors
p<-ggplot(df, aes(x=weight)) + 
  geom_histogram(color="black", fill="white")
p

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Add mean line
p+ geom_vline(aes(xintercept=mean(weight)),
            color="blue", linetype="dashed", size=1)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Histogram with density plot
ggplot(df, aes(x=weight)) + 
 geom_histogram(aes(y=..density..), colour="black", fill="white")+
 geom_density(alpha=.2, fill="#FF6666")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Change histogram plot line colors by groups
ggplot(df, aes(x=weight, color=sex)) +
  geom_histogram(fill="white")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Overlaid histograms
ggplot(df, aes(x=weight, color=sex)) +
  geom_histogram(fill="white", alpha=0.5, position="identity")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

mu <- data.frame(sex=c("M", "F"), grp.mean=c(mean(filter(df, sex=="M")$weight), mean(filter(df, sex=="F")$weight)))
head(mu)

##   sex grp.mean
## 1   M    65.36
## 2   F    54.70

# Interleaved histograms
ggplot(df, aes(x=weight, color=sex)) +
  geom_histogram(fill="white", position="dodge")+
  theme(legend.position="top")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Add mean lines
p<-ggplot(df, aes(x=weight, color=sex)) +
  geom_histogram(fill="white", position="dodge")+
  geom_vline(data=mu, aes(xintercept=grp.mean, color=sex),
             linetype="dashed")+
  theme(legend.position="top")
p

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Use custom color palettes
p+scale_color_manual(values=c("#999999", "#E69F00", "#56B4E9"))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Use brewer color palettes
p+scale_color_brewer(palette="Dark2")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Use grey scale
p + scale_color_grey() + theme_classic() +
  theme(legend.position="top")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Use semi-transparent fill
p<-ggplot(df, aes(x=weight, fill=sex, color=sex)) +
  geom_histogram(position="identity", alpha=0.5)
p

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Multi-panel
p<-ggplot(df, aes(x=weight))+
  geom_histogram(color="black", fill="white")+
  facet_grid(sex ~ .)
p

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Add mean lines
p+geom_vline(data=mu, aes(xintercept=grp.mean, color="red"),
             linetype="dashed")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Basic histogram
ggplot(df, aes(x=weight, fill=sex)) +
  geom_histogram(fill="white", color="black")+
  geom_vline(aes(xintercept=mean(weight)), color="blue",
             linetype="dashed")+
  labs(title="Weight histogram plot",x="Weight(kg)", y = "Count")+
  theme_classic()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Change line colors by groups
ggplot(df, aes(x=weight, color=sex, fill=sex)) +
  geom_histogram(position="identity", alpha=0.5)+
  geom_vline(data=mu, aes(xintercept=grp.mean, color=sex),
             linetype="dashed")+
  scale_color_manual(values=c("#999999", "#E69F00", "#56B4E9"))+
  scale_fill_manual(values=c("#999999", "#E69F00", "#56B4E9"))+
  labs(title="Weight histogram plot",x="Weight(kg)", y = "Count")+
  theme_classic()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Line Colours
p<-ggplot(df, aes(x=weight, color=sex)) +
  geom_histogram(fill="white", position="dodge")+
  geom_vline(data=mu, aes(xintercept=grp.mean, color=sex),
             linetype="dashed")
# Continuous colors
p + scale_color_brewer(palette="Paired") + 
  theme_classic()+theme(legend.position="top")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Discrete colors
p + scale_color_brewer(palette="Dark2") +
  theme_minimal()+theme_classic()+theme(legend.position="top")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Gradient colors
p + scale_color_brewer(palette="Accent") + 
  theme_minimal()+theme(legend.position="top")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Bar Plot

Bar plots are similar to histograms except that you can control the Y and X axis a little more and it does not necessarily give a probability distribution but instead fixed counts.

df <- data.frame(dose=c("D0.5", "D1", "D2"),
                len=c(4.2, 10, 29.5))
head(df)

##   dose  len
## 1 D0.5  4.2
## 2   D1 10.0
## 3   D2 29.5

# Basic barplot
p<-ggplot(data=df, aes(x=dose, y=len)) +
  geom_bar(stat="identity")
p

# Horizontal bar plot
p + coord_flip()

# Change the width of bars
ggplot(data=df, aes(x=dose, y=len)) +
  geom_bar(stat="identity", width=0.5)

# Change colors
ggplot(data=df, aes(x=dose, y=len)) +
  geom_bar(stat="identity", color="blue", fill="white")

# Minimal theme + blue fill color
p<-ggplot(data=df, aes(x=dose, y=len)) +
  geom_bar(stat="identity", fill="steelblue")+
  theme_minimal()
p

# Choose X axis categories
p + scale_x_discrete(limits=c("D0.5", "D2"))

## Warning: Removed 1 rows containing missing values (position_stack).

# Outside bars
ggplot(data=df, aes(x=dose, y=len)) +
  geom_bar(stat="identity", fill="steelblue")+
  geom_text(aes(label=len), vjust=-0.3, size=3.5)+
  theme_minimal()

# Inside bars
ggplot(data=df, aes(x=dose, y=len)) +
  geom_bar(stat="identity", fill="steelblue")+
  geom_text(aes(label=len), vjust=1.6, color="white", size=3.5)+
  theme_minimal()

head(mtcars)

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

# Don't map a variable to y
ggplot(mtcars, aes(x=factor(cyl)))+
  geom_bar(stat="count", width=0.7, fill="steelblue")+
  theme_minimal()

# Change barplot line colors by groups
p<-ggplot(df, aes(x=dose, y=len, color=dose)) +
  geom_bar(stat="identity", fill="white")
p

# Use custom color palettes
p+scale_color_manual(values=c("#999999", "#E69F00", "#56B4E9"))

# Use brewer color palettes
p+scale_color_brewer(palette="Dark2")

# Use grey scale
p + scale_color_grey() + theme_classic()

# Change barplot fill colors by groups
p<-ggplot(df, aes(x=dose, y=len, fill=dose)) +
  geom_bar(stat="identity")+theme_minimal()
p

# Use custom color palettes
p+scale_fill_manual(values=c("#999999", "#E69F00", "#56B4E9"))

# use brewer color palettes
p+scale_fill_brewer(palette="Dark2")

# Use grey scale
p + scale_fill_grey()

ggplot(df, aes(x=dose, y=len, fill=dose))+
geom_bar(stat="identity", color="black")+
scale_fill_manual(values=c("#999999", "#E69F00", "#56B4E9"))+
  theme_minimal()

# Change bar fill colors to blues
p <- p+scale_fill_brewer(palette="Blues")
p + theme(legend.position="top")

p + theme(legend.position="bottom")

# Remove legend
p + theme(legend.position="none")

# Controlling legends
p + scale_x_discrete(limits=c("D2", "D0.5", "D1"))

# Multiple Groups
df2 <- data.frame(supp=rep(c("VC", "OJ"), each=3),
                dose=rep(c("D0.5", "D1", "D2"),2),
                len=c(6.8, 15, 33, 4.2, 10, 29.5))
head(df2)

##   supp dose  len
## 1   VC D0.5  6.8
## 2   VC   D1 15.0
## 3   VC   D2 33.0
## 4   OJ D0.5  4.2
## 5   OJ   D1 10.0
## 6   OJ   D2 29.5

# Stacked barplot with multiple groups
ggplot(data=df2, aes(x=dose, y=len, fill=supp)) +
  geom_bar(stat="identity")

# Use position=position_dodge()
ggplot(data=df2, aes(x=dose, y=len, fill=supp)) +
geom_bar(stat="identity", position=position_dodge())

# Change the colors manually
p <- ggplot(data=df2, aes(x=dose, y=len, fill=supp)) +
geom_bar(stat="identity", color="black", position=position_dodge())+
  theme_minimal()
# Use custom colors
p + scale_fill_manual(values=c('#999999','#E69F00'))

# Use brewer color palettes
p + scale_fill_brewer(palette="Blues")

# Add labels inside the plot
ggplot(data=df2, aes(x=dose, y=len, fill=supp)) +
  geom_bar(stat="identity", position=position_dodge())+
  geom_text(aes(label=len), vjust=1.6, color="white",
            position = position_dodge(0.9), size=3.5)+
  scale_fill_brewer(palette="Paired")+
  theme_minimal()

# Sort by dose and supp
df_sorted <- arrange(df2, dose, supp) 
head(df_sorted)

##   supp dose  len
## 1   OJ D0.5  4.2
## 2   VC D0.5  6.8
## 3   OJ   D1 10.0
## 4   VC   D1 15.0
## 5   OJ   D2 29.5
## 6   VC   D2 33.0

# Continuous X axis
# Create some data
df2 <- data.frame(supp=rep(c("VC", "OJ"), each=3),
                dose=rep(c("0.5", "1", "2"),2),
                len=c(6.8, 15, 33, 4.2, 10, 29.5))
head(df2)

##   supp dose  len
## 1   VC  0.5  6.8
## 2   VC    1 15.0
## 3   VC    2 33.0
## 4   OJ  0.5  4.2
## 5   OJ    1 10.0
## 6   OJ    2 29.5

# x axis treated as continuous variable
df2$dose <- as.numeric(as.vector(df2$dose))
ggplot(data=df2, aes(x=dose, y=len, fill=supp)) +
  geom_bar(stat="identity", position=position_dodge())+
  scale_fill_brewer(palette="Paired")+
  theme_minimal()

# Axis treated as discrete variable
df2$dose<-as.factor(df2$dose)
ggplot(data=df2, aes(x=dose, y=len, fill=supp)) +
  geom_bar(stat="identity", position=position_dodge())+
  scale_fill_brewer(palette="Paired")+
  theme_minimal()

Density Plot

Density plots give you a smooth probability distribution for a dataset which may or may not be smooth. This is used a lot in probability applications.

set.seed(1234)
df <- data.frame(
  sex=factor(rep(c("F", "M"), each=200)),
  weight=round(c(rnorm(200, mean=55, sd=5),
                 rnorm(200, mean=65, sd=5)))
  )
head(df)

##   sex weight
## 1   F     49
## 2   F     56
## 3   F     60
## 4   F     43
## 5   F     57
## 6   F     58

# Basic density
p <- ggplot(df, aes(x=weight)) + 
  geom_density()
p

# Add mean line
p+ geom_vline(aes(xintercept=mean(weight)),
            color="blue", linetype="dashed", size=1)

http://www.sthda.com/english/wiki/ggplot2-density-plot-quick-start-guide-r-software-and-data-visualization

Box Plot

Box plots are a way to summarise an individual variable based on our basic summaries (1st quartile, median, 3rd quartile, outliers).

data("ToothGrowth")
# Convert the variable dose from a numeric to a factor variable
ToothGrowth$dose <- as.factor(ToothGrowth$dose)
head(ToothGrowth)

##    len supp dose
## 1  4.2   VC  0.5
## 2 11.5   VC  0.5
## 3  7.3   VC  0.5
## 4  5.8   VC  0.5
## 5  6.4   VC  0.5
## 6 10.0   VC  0.5

library(ggplot2)
# Basic box plot
p <- ggplot(ToothGrowth, aes(x=dose, y=len)) + 
  geom_boxplot()
p

# Rotate the box plot
p + coord_flip()

# Notched box plot
ggplot(ToothGrowth, aes(x=dose, y=len)) + 
  geom_boxplot(notch=TRUE)

# Change outlier, color, shape and size
ggplot(ToothGrowth, aes(x=dose, y=len)) + 
  geom_boxplot(outlier.colour="red", outlier.shape=8,
                outlier.size=4)

http://www.sthda.com/english/wiki/ggplot2-box-plot-quick-start-guide-r-software-and-data-visualization

Interesting Additional Plots

nz <- map_data("nz")

ggplot(nz, aes(long, lat, group = group)) +
  geom_polygon(fill = "white", colour = "black")

ggplot(nz, aes(long, lat, group = group)) +
  geom_polygon(fill = "white", colour = "black") +
  coord_quickmap()

bar <- ggplot(data = diamonds) + 
  geom_bar(
    mapping = aes(x = cut, fill = cut), 
    show.legend = FALSE,
    width = 1
  ) + 
  theme(aspect.ratio = 1) +
  labs(x = NULL, y = NULL)

bar + coord_flip()

bar + coord_polar()

Excellent example of visualization in general

https://www.gapminder.org/

Project Brainstorming

https://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones https://archive.ics.uci.edu/ml/datasets/Abalone https://archive.ics.uci.edu/ml/datasets/Adult https://archive.ics.uci.edu/ml/datasets/Artificial+Characters https://archive.ics.uci.edu/ml/datasets/Bach+Choral+Harmony https://docs.aws.amazon.com/sagemaker/latest/dg/ex1-model-deployment.html https://docs.aws.amazon.com/sagemaker/latest/dg/algos.html https://aws.amazon.com/blogs/aws/amazon-sagemaker-ground-truth-build-highly-accurate-datasets-and-reduce-labeling-costs-by-up-to-70/

https://www.dataquest.io/blog/topics/data-science-projects/ https://fivethirtyeight.com/ https://github.com/fivethirtyeight/data https://data.fivethirtyeight.com/ https://github.com/BuzzFeedNews/everything https://www.propublica.org/datastore/apis https://www.propublica.org/datastore/datasets https://opendata.socrata.com/ https://en.wikipedia.org/wiki/Wikipedia:Database_download https://www.quandl.com/search

https://www.meteoblue.com/en/weather/archive/export/india_el-salvador_3585481 https://data.gov.in/ http://www.surveyofindia.gov.in/details/view/7 http://mospi.nic.in/data https://www.india.gov.in/ https://open.canada.ca/en/open-data

Introduction to R & Analytics

Satyajeet Singh