The topics are to be covered:
Data Science is a huge and lucritave field, and so there are many ways to learn it and many levels of complexity you can understand. Given that the world is very data dominated, understanding how to read, manipulate, and gain insight from data can help you both in your career as well as your everyday life. Furthermore, given the amount of access we can get to data, drawing insight about a topic, field, or situation is much easier than before. The R language is a tool that allows you to manipulate and understand data intuitively and efficiently, and also explain many new mathematical concepts that you can visualize. RStudio is an IDE (integrated development environment) that allows you to organize and immediately analyze the impact of your code and model building.
We generally work on 3 problems in data science:
R is a statistical language that is excellent for experimenting and testing different ideas. It’s advantages include that it is a functional programming language (therefore it automatically handles memory and controlling data types), it is built to efficiently handle vectors (arrays), lists, and matrices, and can also flexibly build models. It is a functional programming language, which means all objects (models, numbers, vectors, and so on) are defined the same way and used the same way. R Studio is an IDE (integrated development environment) for R, and R Markdown allows for an easy way to create reports which integrate code and the results of code to display work. Furthermore, we can also run normal (non-data related) scripts here too so it can be used for normal functions as well. One of the challenges in starting out with R is that it is very particular about syntax and punctuation - and this can be frustrating for some people early on. If you get more repetition it is a very quick learning curve for using the language.
This section gives a light introduction to R, and in particular will cover some basic usage of RStudio, how an R Markdown file works, and overview the R programming language. In the end, there are 5 relatively simple questions that allow for some use of all the features covered so far.
You must download R from here (https://cran.r-project.org/mirrors.html) and Rstudio from here (https://www.rstudio.com/products/rstudio/download/).
RStudio is an IDE for R, and has many features which allow it to be effective for experimenting. The four major components are the 4 sections of the IDE, and will be referred to here as Help/Plots, Environment, Scripts, and Console.
The help/plots section appears in the lower right corner of the RStudio window, and is used to display images/plots, browse files, and provide a help page for functions/packages. Whenever any image (plot/graph) is generated in an R script, the plot will be shows in this help/plots tab. Similarly, whenever the “help()” command is called on a function (i.e help(plot)) the output will be shown in this tab as well. Run the below for an example.
help(plot)
## starting httpd help server ... done
The environment tab (top right) keeps track of all variables which are currently stored in R memory in the given working instance. This includes all data structures and variables. It appears in the top right of the RStudio window. History shows a history of commands as well. To clear the space in environment or reset the working environment, use the command as follows.
rm(list=ls())
Now, to see a variable “a” with value 3 be added to the environment, run the following and see the update in the environment section:
a <- 3
The scripts section (top left) is used to write R scripts, build R markdown files, and even build web applications using “Shiny”. There are many other types of R based files that can be created too. R scripts are scripts that run and return R code. This section is also used to build R Markdown documents, and details about R Markdown Documents will be discussed in the RMarkdown section. The scripts section is located in the top left part of RStudio.
The console (bottom left) is an interactive place to run R commands. When an R Script is run, it is typically passed through the console. In the console, you can even enter math commands such as “2+3” and get an immediate result. The console is located in the bottom left part of the screen.
The console also typically has a terminal window, which can take linux commands to navigate through the directory of your computer. This will not be used extensively in these notebooks, but is still useful to know about.
R Markdown is a file type that allows a user to quickly build a static webpage, and integrate analysis along with actual code and output. All of the notebooks in this set are written with R Markdown. In R Markdown it is easy to both display the code used as well as the results, and also add in text to analyze to results on the same page.
To create a code block, use the symbols " ``` " followed by an opening curly parenthesis “{”, the language being used (in this case “R”), and then the name of the code block. After the code segment is finished, the block is closed with another " ``` " symbol set. For instance, to build code written in the R language with the name “firstCodeBlock” in an R Markdown document, the sequence of commands is:
```{r firstCodeBlock}
print(“my first code block”)
```
This will look as follows in the actual html/pdf file:
print("my first code block")
## [1] "my first code block"
To run a codeblock from inside the RMarkdown file, click the small “play” icon in the top right corner.
Similarly,to include a plot,the following code is used in an R code block:
plot(pressure)
Out of the many additional arguments that can be added to a code block, one of them is the “echo” flag. If “echo” is set to true, then the code block will appear in the RMarkdown Document, otherwise if it is false the block will not appear. The default value is true, so the first two blocks of code appeared along with their output, but in the next one resetting the “echo” flag using the line “{r pressure2, echo=FALSE}” will no longer show the code and just plot the graph:
The “Knit” command is located in the command bar, and can generate an html file with the RMarkdown content and results. This can then be saved as a PDF or attached to a webpage as a static page.
This section covers assignments and printing, math operations, and functions. First, though, it is important to emphasize that comments in R are done with the “#” symbol. This way, if an R script reads a “#” symbol the remainder of the line will be a comment, unless the “#” symbol is between quotation marks. In the code below, “#” are used to indicate a new idea within the code.
R has 2 methods for assigning a value to a variable:
Both provide essentially the same functionality, but it is recommended to use “<-”. The assignment operator is also used for storing a model or function or data frame or vector, and so it is very important.
To print a value, either you can use the “print()” function, or you can just place the value on a line on it’s own. For instance, to print the variable a defined above, one can use:
Note that print(“a”) will print the letter “a”. Also, assigning a new value to an existing name will overwrite that name.
Below are some examples in code:
a <- 3
b <- 5
a
## [1] 3
b
## [1] 5
a <- 2
b <- 6
print(a)
## [1] 2
print(b)
## [1] 6
R can handle all of the standard math operations such as addition, subtraction, division, multiplication, exponents, modulo (+, -, /, *, ^, %%, ) and all standard comparisons such as equal, greater than, less than, and, or, not, exclusive or (==, >, <, &, |, !, xor). It can even handle “set” commands such as intersect, union, setdiff, setequal (using the names given), but these will be explored more in future notebooks.
Below with code and comments there is a short illustration of how to run math commands. It’s also easy to store results from operations into variables.
# math Operations
a <- 2
b <- 4
2 + 3 # add
## [1] 5
3 - 1 # subtract
## [1] 2
a * 2 # multiply
## [1] 4
b / 2 # divide
## [1] 2
b %% a # modulo (remainder)
## [1] 0
a ^ b # exponents
## [1] 16
a ** b # also exponents
## [1] 16
# storing results
print("storing results")
## [1] "storing results"
c <- a ** b
c
## [1] 16
d <- b %% a
d
## [1] 0
e <- c * d
a <- e
# comparisons
a < b
## [1] TRUE
a != b
## [1] TRUE
e != a
## [1] FALSE
QUESTION: What value will “a” hold?
R has both many inbuilt functions, and an easy way for a user to develop their own functions. All functions are abstract, meaning you do not need to specify a return type and input type unlike many other languages. Functions are saved the same way a variable is saved, and therefore can also be called the way a variable is called. Simply inputting a functions name in R gives you an idea of what the function does and how.
For a function named “hello” which takes in a string “input” and returns the string “hello input”, the syntax is:
hello <- function(input) {
paste("hello", input)
}
Even though the function entered needed a string, writing string explicitly was not required. This is both an advantage and disadvantage; while it is easier to write functions and make them general purpose, sometimes they will behave unexpectadly without warning the user!
Note, “paste” is an existing function in R which takes a list of inputs and returns them concatenated separated by spaces. To get an idea of how the backend of “paste” looks, type the function name (paste).
paste
## function (..., sep = " ", collapse = NULL)
## .Internal(paste(list(...), sep, collapse))
## <bytecode: 0x0000000012e66388>
## <environment: namespace:base>
To call a function, use the function name and any inputs in parenthesis. Note R can also support default arguments, so if the above function is rewritten as follows, this function can be called with or without an argument. Usually it is good to use default arguments to help the user and also immediately tell if something went wrong.
hello <- function(input="nobody") {
print(paste("hello", input))
}
# default
hello()
## [1] "hello nobody"
# new
hello("harry")
## [1] "hello harry"
hello("potter")
## [1] "hello potter"
We can use traditional if statements in R as well. If statements give a method of running a section of code if a condition is satisfied. They follow the syntax below:
if (a == 0) {
print("hello")
} else if (b == 3) {
print("helllloooo")
} else {
print("goodbye")
}
## [1] "hello"
They can be added to a function as well with the following syntax:
myIfFn <- function(a,b) {
if (a == 0) {
print("hello")
} else if (b == 3) {
print("helllloooo")
} else {
print("goodbye")
}
}
myIfFn(0,1)
## [1] "hello"
myIfFn(3,3)
## [1] "helllloooo"
myIfFn(5,5)
## [1] "goodbye"
The most important function for practising, and least important for production is “help()”. You can use the help function along with any function name to see how a function works. For instance, to learn more about paste:
help(paste)
QUESTION: Use the help command on paste to change the function above so it prints the inputted words without any spaces between them (i.e an input of “harry” would result in the output “helloharry” instead of “hello harry”).
Another method to get help is to add a question mark before the command being searched. For example:
?paste
a <- 3
Create a code block labelled q1 and inside it create a variable “b” which holds the value 2. For all future questions you must create a code block with the question number and answer it with code inside the block.
Build a function takes in 2 numbers and returns their sum
Build a function that takes 2 strings and returns their concatenation without spaces. Check the “paste” function documentation for hints. For example:
Use the seq function (search for it using help) to generate a sequence of 100 numbers separated by 0.5.
Write a function which takes in two numbers a and b and returns the larger of the sum or multiplication of those numbers.
Today we will look at basic data structures in R and some simple statistical features!
In R, the basic data structures are vectors, lists, matrices, and data frames. Typically, data structures are organized by whether they are homogeneous or heterogeneous (all elements have the same type versus all elements have different types) and what their dimension is (1d, 2d, nd). To classify the four types: + Vectors and matrices are homogenous while lists and dataframes are heterogenous. + Vectors and Lists are 1d while Matrices and Data Frames are 2d + There is an array structure which is nd but is not important for now.
Vectors are one of the more unique and powerful data types in R. Vectors are 1 dimensional and homogeneous (all their elements are of the same type). These resemble a vector in linear algebra. R is often used in statistics because it has the ability to do operations on vectors quickly. This also allows much of the code to avoid loops and repetitive functions, as subsetting/aggregating and applying operations to all elements of a vector can be done with more efficient commands.
A slightly important subtlety to note is that R does not have “0 dimension” variables the way most languages do. That means the above definition (“a <- 3”) is actually a vector of length 1 and not an object that only holds one value. Vectors are initialized with the “c()” function and use comma’s to separate places, so a length one vector is defined with “a <- c(3)”, and a multi-length vector with “b <- c(1,2,3)”.
a <- c(3)
aPrime <- 3
a == aPrime
## [1] TRUE
b <- c(1,2,3)
a
## [1] 3
b
## [1] 1 2 3
Note that { a <- c(1, “2”, 3) } will not work while { a <- c(“1”,“2”,“3”) } will work.
Vectors can also be nested, so c(1,c(2,c(3,4))) is equivalent to c(1,2,3,4).
c(1,2,3,4) == c(1,c(2,c(3,4)))
## [1] TRUE TRUE TRUE TRUE
To choose items from a vector, the square bracket “[” can be used. Vectors start counting at index 1 (contrary to many features in CS), and so choosing the second item from vector “b” above will be done with b[2]. If an index is chosen that does not exist, R will return “NA”.
# returns 2nd value of b
b[2]
## [1] 2
# a has only one value so returns NA
a[2]
## [1] NA
Vectors can also subset with conditions, i.e selecting all vectors with value greater than 2 in b above:
b[b>2]
## [1] 3
Note we can also do math operations on vectors, and this can lead to interesting results when the vectors do not have the same length!
c <- c(3,4,5)
a + c
## [1] 6 7 8
a + b
## [1] 4 5 6
a * b
## [1] 3 6 9
c / b
## [1] 3.000000 2.000000 1.666667
Lists are similar to vectors except their arguments can be of mixed types (including other lists). Lists are built using the “list()” command, and different arguments are different inputs. For example the following is a list composed: 1) a vector of 3 numbers 2) one character 3) one character 4) a 2 character vector 5) a nested list of 3 elements, each of which is a number
x <- list(1:3, "a","b", c("c", "d"), list(1,2,3))
x
## [[1]]
## [1] 1 2 3
##
## [[2]]
## [1] "a"
##
## [[3]]
## [1] "b"
##
## [[4]]
## [1] "c" "d"
##
## [[5]]
## [[5]][[1]]
## [1] 1
##
## [[5]][[2]]
## [1] 2
##
## [[5]][[3]]
## [1] 3
To select specific elements from the larger list, a double square bracket “[[]]” is used. For example, each element of the above list can be stored into a separate vector and then a new list can be created by combining those vectors:
a <- x[[1]]
b <- x[[2]]
c <- x[[3]]
d <- x[[4]]
e <- x[[5]]
a
## [1] 1 2 3
b
## [1] "a"
c
## [1] "b"
d
## [1] "c" "d"
e
## [[1]]
## [1] 1
##
## [[2]]
## [1] 2
##
## [[3]]
## [1] 3
y <- list(a,b,c,d,e)
Matrices are built using the “matrix()” function. These resemble the matrices from linear algebra. They only store one type, and typically are used in applications to help with calculations.
myMatrix <- matrix(1:15, nrow=5, ncol=3)
myMatrix
## [,1] [,2] [,3]
## [1,] 1 6 11
## [2,] 2 7 12
## [3,] 3 8 13
## [4,] 4 9 14
## [5,] 5 10 15
To choose an element of the matrix, use a comma in the square bracket selectors (representing row and column) i.e to choose the 2nd row and 3rd column: matrix[2,3]. To isolate just the 2nd row or 3rd column, use: matrix[2,] or matrix[,3].
Below is an example on the “myMatrix” object buit above.
myMatrix[2,3]
## [1] 12
myMatrix[2,]
## [1] 2 7 12
myMatrix[,3]
## [1] 11 12 13 14 15
Data Frames are very commonly used to store data in R, and help make analysis much easier. These are essentially tables. They can be created with the “data.frame()” command, combined with “cbind” and “rbind”, and then specific columns can be selected by name with the “$” symbol, and rows with the row indices in square brackets (“[1:3,]” or “[c(1,3,5,7),]”).
# build frame
df <- data.frame(x=1:4, y=c("a","b","c","d"), stringsAsFactors = FALSE)
# add column
col <- c(4,6,8,10)
df <- cbind(df, col)
# add row
row <- c("a", "b", "c")
rbind(df, row)
## x y col
## 1 1 a 4
## 2 2 b 6
## 3 3 c 8
## 4 4 d 10
## 5 a b c
# name columns
colnames(df) <- c("col1", "col2", "col3")
# return first column
df$col1
## [1] 1 2 3 4
# return second row
df[2,]
## col1 col2 col3
## 2 2 b 6
# return second through fourth row
df[2:4,]
## col1 col2 col3
## 2 2 b 6
## 3 3 c 8
## 4 4 d 10
R has many built in features. We will show some examples concerning random variables, summaries, and plotting. These will be developed in more depth in future notebooks, but the basics of random variables will be shown here.
Plotting two variables can be done with the “plot” function. For example, in the built cars package, these are commands for plotting speed and stopping distance of cars as follows:
plot(cars$speed, cars$dist, main="SpeedvsDist", xlab = "speed", ylab = "dist")
Note that the main argument specifies the main title of the graph, and xlab and ylab specify the x axis and y axis labels.
R can generate random variables using rnorm, qnorm, pnorm, or dnorm. Typically generating random variables on a normal distribution with mean and standard deviation is done with rnorm. The following generates 50 random variables with mean 0 and standard deviation 1 and stores the results in the variable “rda”. It is a good idea to look up qnorm, pnorm and dnorm just for future reference.
set.seed(100)
rda <- rnorm(50, 0, 1)
rda
## [1] -0.50219235 0.13153117 -0.07891709 0.88678481 0.11697127
## [6] 0.31863009 -0.58179068 0.71453271 -0.82525943 -0.35986213
## [11] 0.08988614 0.09627446 -0.20163395 0.73984050 0.12337950
## [16] -0.02931671 -0.38885425 0.51085626 -0.91381419 2.31029682
## [21] -0.43808998 0.76406062 0.26196129 0.77340460 -0.81437912
## [26] -0.43845057 -0.72022155 0.23094453 -1.15772946 0.24707599
## [31] -0.09111356 1.75737562 -0.13792961 -0.11119350 -0.69001432
## [36] -0.22179423 0.18290768 0.41732329 1.06540233 0.97020202
## [41] -0.10162924 1.40320349 -1.77677563 0.62286739 -0.52228335
## [46] 1.32223096 -0.36344033 1.31906574 0.04377907 -1.87865588
Note, for testing purposes R also is able to fix the random sequence using the “set.seed()” command. This means that the numbers will be generated randomly, but in a repeatable way depending on the seed. This will still simulate randomness but fix the results to be the same. Often, in a situation involving randomness it is good to set a seed so that results can be tested in a reproducible way.
Another very important part of R is installing and loading packages. This allows a user to quickly run many standard functions that do not come pre-built in R.
To install a package, the command is “install.packages()” with package name entered as a string (i.e “packagename”). For multiple packages, the arguments are separated by commas. For instance, to load the packages “dplyr” which is excellent for filtering data, and “ggplot2” which is excellent for building advanced plots, use the command “install.packages(”dplyr“,”ggplot2“)”. Note, once a package is installed the contents stay once the R session closes, so it does not need to be reinstalled again.
To load the functions, the command is “library()”. This has to be called every time R is re-opened. This runs on one package at a time. Note that the package name gets entered directly and not as a string. To load the two above use the following commands:
#install.packages("dplyr", "ggplot2")
library("dplyr")
## Warning: package 'dplyr' was built under R version 3.6.1
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library("ggplot2")
## Warning: package 'ggplot2' was built under R version 3.6.1
Good style when building a notebook involves commenting out the install line (similar to above) so the user is aware which packages to download. It is then good to put all the library commands in a block right away so the notebook runs smoothly.
The install line will break a notebook in the knitting step if it is not commented out. When installing a package it is best to do it from the console.
This is not a “dplyr” feature, but it is an important idea for working with vectors and lists. In R, it is usually best to avoid loops on vectors and instead use a function known as “apply” which applies the same function to all of the elements in a vector. This is so that the computer can take advantage of parallelization in the future, and process the code much faster.
To effectively use apply, it is good to write a function you want that can be applied to any individual element in a vector. Then, simply use lapply to apply that individual element function to every element and store the result in a list. The unlist command can convert a list back into a vector. An example of this is below, the first function will add1 to every element in the numeric vector and the second function will add the letter “a” to the start of each string.
add1 <- function(x) {
return (x+1)
}
adda <- function(x) {
return (paste("a", x, sep=""))
}
unlist(lapply(c(1,2,3,4,5), add1))
## [1] 2 3 4 5 6
unlist(lapply(c("a","ab", "abc", "abcd"), adda))
## [1] "aa" "aab" "aabc" "aabcd"
There are other methods of apply too which work on different data structures; run help(lapply) to learn more!
The extra topics in these notebooks called “functional programming” are also great tools to program smarter in R, and will be especially useful for assignments.
Here are some more questions to work on the basics from above. Please make a code block labelled like the previous day when answering each question. Part of the first question has been done for you.
rnorm, and find how many of them are greater than 0.2 using subsets. Store these as a vector named “six”generated <- rnorm(1000, mean=0, sd=1)
Generate 100 numbers with mean 0 and stdev 0.5 using “rnorm”. Multiply this vector by first the vector “six” and then the vector “generated” above. Which one of these gives a warning?
Use the lapply function to build a vector with the following rules: for each number in “six” if the value is greater than 0 the new vector should hold the value 1 in that spot, and if it is less than or equal to 0 it should hold the value -1.
Install the “housingData” package (install.packages("housingData")) and store the “fipsCounty” data (housingData::fipsCounty) into a data frame. Which state in fipsCounty appears the most times? (hint: use the functions sort and table to find this, and also don’t forget to run library(housingData) to use the info in that package).
Load the housing data from the housingData package into a dataframe. Which state has the most houses sold, and which state has the highest average individual difference between list and sold price? (Note the which.max function can help you to solve this)
Plot a graph of list and sold price of all the houses. Use the plot function for this. Then plot a graph of the list price and selling price of all houses with a sold price greater than 2500. What do you notice about the two plots?
How you would summarise or display data in general?
Now we begin exploring methods for visualizing and analyzing data in R using tables and graphs. The library (package) most often used for this is called tidyverse, which is the standard for doing simple visualizations (and some complicated ones) in R. When there is not much data or the audience only is looking for overviews and summary information, graphs and tables are usually enough to get a point across. This lesson will use the nycflights dataset, which contains information about flights. We will then do a task using the Titanic datset which contains passengers and whether or not they survived the shipwreck.
The different steps in this day are: - Reading and Cleaning Data - Summary Functions - Problems
The two important packages inside tidyverse (which is a set of packages compiled into one) are “dplyr” and “ggplot2”.
#install.packages("tidyverse", "nycflights13")
#install.packages("dplyr", "ggplot2", "magrittr", "nycflights13")
library("tidyverse")
## Warning: package 'tidyverse' was built under R version 3.6.1
## -- Attaching packages ----------------------------------- tidyverse 1.2.1 --
## v tibble 2.1.3 v purrr 0.3.2
## v tidyr 0.8.3 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.4.0
## Warning: package 'readr' was built under R version 3.6.1
## -- Conflicts -------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library("nycflights13")
The tidyverse book reference (filtering plotting and more): http://r4ds.had.co.nz/introduction.html
We will first load some data to transform using the above functions. The dataset we will use for this lesson is “nycflights13”, and the first step to analyzing any dataset is to see what columns and datatypes exist. We use the “str()” function for this. We will also use the “nrow()” function to see how many rows exist in the dataset.
str(flights)
## Classes 'tbl_df', 'tbl' and 'data.frame': 336776 obs. of 19 variables:
## $ year : int 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
## $ month : int 1 1 1 1 1 1 1 1 1 1 ...
## $ day : int 1 1 1 1 1 1 1 1 1 1 ...
## $ dep_time : int 517 533 542 544 554 554 555 557 557 558 ...
## $ sched_dep_time: int 515 529 540 545 600 558 600 600 600 600 ...
## $ dep_delay : num 2 4 2 -1 -6 -4 -5 -3 -3 -2 ...
## $ arr_time : int 830 850 923 1004 812 740 913 709 838 753 ...
## $ sched_arr_time: int 819 830 850 1022 837 728 854 723 846 745 ...
## $ arr_delay : num 11 20 33 -18 -25 12 19 -14 -8 8 ...
## $ carrier : chr "UA" "UA" "AA" "B6" ...
## $ flight : int 1545 1714 1141 725 461 1696 507 5708 79 301 ...
## $ tailnum : chr "N14228" "N24211" "N619AA" "N804JB" ...
## $ origin : chr "EWR" "LGA" "JFK" "JFK" ...
## $ dest : chr "IAH" "IAH" "MIA" "BQN" ...
## $ air_time : num 227 227 160 183 116 150 158 53 140 138 ...
## $ distance : num 1400 1416 1089 1576 762 ...
## $ hour : num 5 5 5 5 6 5 6 6 6 6 ...
## $ minute : num 15 29 40 45 0 58 0 0 0 0 ...
## $ time_hour : POSIXct, format: "2013-01-01 05:00:00" "2013-01-01 05:00:00" ...
nrow(flights)
## [1] 336776
The str output shows what type of data belongs to each column as well as some examples from each column, and nrow gives a count of rows. We will now begin to look for specific information.
Sometimes it’s more useful to take a summary of very specific instances in the data, for example, a summary of the Age of all survivors where they are female or a summary of all variables for passengers who are 18. The first step to find these summaries is to filter the data and pull only the necessary rows.
Such filters are really effectively done using the “dplyr” library inside the tidyverse package. Below are some examples of filters. The full filtered data is saved in a variable and only the first few rows are shown as a display using the below code/the “head()” function. We start by looking for all flights that occured on January 1st in any year, as well as December 25th in any year.
jan1 <- filter(flights, month==1, day==1)
head(jan1)
## # A tibble: 6 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2 830
## 2 2013 1 1 533 529 4 850
## 3 2013 1 1 542 540 2 923
## 4 2013 1 1 544 545 -1 1004
## 5 2013 1 1 554 600 -6 812
## 6 2013 1 1 554 558 -4 740
## # ... with 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## # time_hour <dttm>
head(dec25 <- filter(flights, month == 12, day == 25))
## # A tibble: 6 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 12 25 456 500 -4 649
## 2 2013 12 25 524 515 9 805
## 3 2013 12 25 542 540 2 832
## 4 2013 12 25 546 550 -4 1022
## 5 2013 12 25 556 600 -4 730
## 6 2013 12 25 557 600 -3 743
## # ... with 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## # time_hour <dttm>
We can also use logical arguments such
nov_dec <- filter(flights, month %in% c(11, 12))
nov_dec2 <- filter(flights, month == 11 | month == 12)
identical(nov_dec, nov_dec2)
## [1] TRUE
gl <- filter(flights, !(arr_delay > 120 | dep_delay > 120))
gl2 <- filter(flights, arr_delay <= 120, dep_delay <= 120)
identical(gl, gl2)
## [1] TRUE
Filters are quite powerful tools to get initial ideas specific parts of the data. They are also great for figuring out how much data is missing, and whether or not there are any anomalies.
Arrange sorts rows. Here are some examples (just showing the first few rows) for arranging flights by year, and then arrranging flights both from increasing and decreasing delay.
sortFlights <- arrange(flights, year, month, day)
sortDelayUp <- arrange(flights, dep_delay)
sortDelayDown <- arrange(flights, desc(dep_delay))
Select is especially useful when there are many columns or when we want a subset of data with specific conditions or transformations done to the columns. We can also use this to re-order in a dataset.
head(select(flights, year, month, day))
## # A tibble: 6 x 3
## year month day
## <int> <int> <int>
## 1 2013 1 1
## 2 2013 1 1
## 3 2013 1 1
## 4 2013 1 1
## 5 2013 1 1
## 6 2013 1 1
head(select(flights, dep_time, sched_dep_time, everything()))
## # A tibble: 6 x 19
## dep_time sched_dep_time year month day dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 517 515 2013 1 1 2 830
## 2 533 529 2013 1 1 4 850
## 3 542 540 2013 1 1 2 923
## 4 544 545 2013 1 1 -1 1004
## 5 554 600 2013 1 1 -6 812
## 6 554 558 2013 1 1 -4 740
## # ... with 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## # time_hour <dttm>
Mutate is especially useful when trying to create a new field out of the existing fields (field in this case is a column). This will add a new column to the end of the dataset. Here are a few examples (they seem useful but are used more to illustrate the concept).
flights_sml <- select(flights,
year:day,
ends_with("delay"),
distance,
air_time
)
mutate(flights_sml,
gain = dep_delay - arr_delay,
speed = distance / air_time * 60
) %>% head()
## # A tibble: 6 x 9
## year month day dep_delay arr_delay distance air_time gain speed
## <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2013 1 1 2 11 1400 227 -9 370.
## 2 2013 1 1 4 20 1416 227 -16 374.
## 3 2013 1 1 2 33 1089 160 -31 408.
## 4 2013 1 1 -1 -18 1576 183 17 517.
## 5 2013 1 1 -6 -25 762 116 19 394.
## 6 2013 1 1 -4 12 719 150 -16 288.
transmute(flights,
gain = dep_delay - arr_delay,
hours = air_time / 60,
gain_per_hour = gain / hours
) %>% head()
## # A tibble: 6 x 3
## gain hours gain_per_hour
## <dbl> <dbl> <dbl>
## 1 -9 3.78 -2.38
## 2 -16 3.78 -4.23
## 3 -31 2.67 -11.6
## 4 17 3.05 5.57
## 5 19 1.93 9.83
## 6 -16 2.5 -6.4
Finally, a summarise function in dplyr can give specific summary stats, and running a built in “summary” function on a filtered dplyr table also gives good results. Here is an example:
# dplyr summarise
summarise(flights, delay = mean(dep_delay, na.rm = TRUE)) %>% head()
## # A tibble: 1 x 1
## delay
## <dbl>
## 1 12.6
It is also possible to use group_by() to get summaries about specific subgroups.
First we will do some problems using the above functions on another dataset built into R.
data("msleep")
Q1: Find the number of carnivores belonging to order Rodentia or Primates.
Q2: Find avg difference between brainwt and bodywt for all domesticated animals
Q3: Find the total number of “genus” categories
Q4: Find the animal with the maximum and minimum bodywt, and maximum and minimum brainwt.
Q5: Find the number of rows containing at least 1 “NA” value
Data can come from many sources - a csv works for smaller sized data while maybe an avro file (serialized) with a schema can work for larger amounts of data. R has many functions for reading external data and writing data from R to other softwares. Usually it is best to look them up for specific cases, but in many cases read.csv() and write.csv() are the more important ones. Use the help command to learn more about them.
Another very common problem with data is that it does not always come in a form that is easy to analyze. That is why it is important to reshuffle and filter and clean the data to meet relevant requirements. R is not the best language to actually clean data with - especially compared to languages that allow a user to easily open and modify file contents (C++, Python etc) are good. Also, there are specific programs for specific types of files (i.e Excel is good for xlsx and csv or Tableau is good for twb).
There is no real scientific method for cleaning data or trying to decode it so it can be read into R smoothly, which unfortunately can make that task very frustrating. However, starting with a clean dataset and seeing some results will hopefully be motivating enough to emphasize spending the time in the future to clean data! (Most in class small assignments will be done with clean data, but independent or class projects etc are likely to have some mess).
In this notebook the example used is Titanic data, which gives information about different passengers on the Titanic ship and whether or not the passengers survived. Below is an example of using the read csv function to read the titanic data into a dataframe.
The data can be downloaded here: https://www.kaggle.com/c/titanic/data
Be sure to change “filepath” to match the path of the titanic data file on the computer this code is being run on.
# Change filepath to match where you downloaded
filePathTrain <- "E:/Datasets/trainTitanic.csv"
filePathTest <- "E:/Datasets/test.csv"
# Save into dataframes
dfTrain <- as.data.frame(read.csv(filePathTrain, header=TRUE))
#dfTest <- as.data.frame(read.csv(filePathTest, header=TRUE))
# summarize fare
summary(dfTrain$Fare)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 7.91 14.45 32.20 31.00 512.33
# summarize fare of survivors
dfTrain %>% filter(Survived==1) %>% select(Fare) %>% summary()
## Fare
## Min. : 0.00
## 1st Qu.: 12.47
## Median : 26.00
## Mean : 48.40
## 3rd Qu.: 57.00
## Max. :512.33
# filter all fields for only females that have not survived
females <- filter(dfTrain, Sex == "female")
# filter all fields for only females that have survived
survivingFemales <- filter(dfTrain, Survived == 1, Sex == "female")
# filter all passengers who were 18 years old
age18 <- filter(dfTrain, Age >= 18)
head(females)
## PassengerId Survived Pclass
## 1 2 1 1
## 2 3 1 3
## 3 4 1 1
## 4 9 1 3
## 5 10 1 2
## 6 11 1 3
## Name Sex Age SibSp
## 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1
## 2 Heikkinen, Miss. Laina female 26 0
## 3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1
## 4 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27 0
## 5 Nasser, Mrs. Nicholas (Adele Achem) female 14 1
## 6 Sandstrom, Miss. Marguerite Rut female 4 1
## Parch Ticket Fare Cabin Embarked
## 1 0 PC 17599 71.2833 C85 C
## 2 0 STON/O2. 3101282 7.9250 S
## 3 0 113803 53.1000 C123 S
## 4 2 347742 11.1333 S
## 5 0 237736 30.0708 C
## 6 1 PP 9549 16.7000 G6 S
head(survivingFemales)
## PassengerId Survived Pclass
## 1 2 1 1
## 2 3 1 3
## 3 4 1 1
## 4 9 1 3
## 5 10 1 2
## 6 11 1 3
## Name Sex Age SibSp
## 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1
## 2 Heikkinen, Miss. Laina female 26 0
## 3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1
## 4 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27 0
## 5 Nasser, Mrs. Nicholas (Adele Achem) female 14 1
## 6 Sandstrom, Miss. Marguerite Rut female 4 1
## Parch Ticket Fare Cabin Embarked
## 1 0 PC 17599 71.2833 C85 C
## 2 0 STON/O2. 3101282 7.9250 S
## 3 0 113803 53.1000 C123 S
## 4 2 347742 11.1333 S
## 5 0 237736 30.0708 C
## 6 1 PP 9549 16.7000 G6 S
head(age18)
## PassengerId Survived Pclass
## 1 1 0 3
## 2 2 1 1
## 3 3 1 3
## 4 4 1 1
## 5 5 0 3
## 6 7 0 1
## Name Sex Age SibSp
## 1 Braund, Mr. Owen Harris male 22 1
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1
## 3 Heikkinen, Miss. Laina female 26 0
## 4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1
## 5 Allen, Mr. William Henry male 35 0
## 6 McCarthy, Mr. Timothy J male 54 0
## Parch Ticket Fare Cabin Embarked
## 1 0 A/5 21171 7.2500 S
## 2 0 PC 17599 71.2833 C85 C
## 3 0 STON/O2. 3101282 7.9250 S
## 4 0 113803 53.1000 C123 S
## 5 0 373450 8.0500 S
## 6 0 17463 51.8625 E46 S
summarise(survivingFemales)
## data frame with 0 columns and 1 row
# Arrange
sortAge <- arrange(dfTrain, Age)
sortId <- arrange(dfTrain, desc(PassengerId))
head(sortAge)
## PassengerId Survived Pclass Name Sex Age
## 1 804 1 3 Thomas, Master. Assad Alexander male 0.42
## 2 756 1 2 Hamalainen, Master. Viljo male 0.67
## 3 470 1 3 Baclini, Miss. Helene Barbara female 0.75
## 4 645 1 3 Baclini, Miss. Eugenie female 0.75
## 5 79 1 2 Caldwell, Master. Alden Gates male 0.83
## 6 832 1 2 Richards, Master. George Sibley male 0.83
## SibSp Parch Ticket Fare Cabin Embarked
## 1 0 1 2625 8.5167 C
## 2 1 1 250649 14.5000 S
## 3 2 1 2666 19.2583 C
## 4 2 1 2666 19.2583 C
## 5 0 2 248738 29.0000 S
## 6 1 1 29106 18.7500 S
head(sortId)
## PassengerId Survived Pclass Name
## 1 891 0 3 Dooley, Mr. Patrick
## 2 890 1 1 Behr, Mr. Karl Howell
## 3 889 0 3 Johnston, Miss. Catherine Helen "Carrie"
## 4 888 1 1 Graham, Miss. Margaret Edith
## 5 887 0 2 Montvila, Rev. Juozas
## 6 886 0 3 Rice, Mrs. William (Margaret Norton)
## Sex Age SibSp Parch Ticket Fare Cabin Embarked
## 1 male 32 0 0 370376 7.750 Q
## 2 male 26 0 0 111369 30.000 C148 C
## 3 female NA 1 2 W./C. 6607 23.450 S
## 4 female 19 0 0 112053 30.000 B42 S
## 5 male 27 0 0 211536 13.000 S
## 6 female 39 0 5 382652 29.125 Q
# Select columns
selectUsefulCols <- select(dfTrain, Age, Fare, Pclass, Survived, everything())
head(selectUsefulCols)
## Age Fare Pclass Survived PassengerId
## 1 22 7.2500 3 0 1
## 2 38 71.2833 1 1 2
## 3 26 7.9250 3 1 3
## 4 35 53.1000 1 1 4
## 5 35 8.0500 3 0 5
## 6 NA 8.4583 3 0 6
## Name Sex SibSp Parch
## 1 Braund, Mr. Owen Harris male 1 0
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 1 0
## 3 Heikkinen, Miss. Laina female 0 0
## 4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 1 0
## 5 Allen, Mr. William Henry male 0 0
## 6 Moran, Mr. James male 0 0
## Ticket Cabin Embarked
## 1 A/5 21171 S
## 2 PC 17599 C85 C
## 3 STON/O2. 3101282 S
## 4 113803 C123 S
## 5 373450 S
## 6 330877 Q
# Mutate
funnyNewFields <- mutate(dfTrain, lettersInName = length(Name), farePerYear = Fare/Age)
head(funnyNewFields)
## PassengerId Survived Pclass
## 1 1 0 3
## 2 2 1 1
## 3 3 1 3
## 4 4 1 1
## 5 5 0 3
## 6 6 0 3
## Name Sex Age SibSp
## 1 Braund, Mr. Owen Harris male 22 1
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1
## 3 Heikkinen, Miss. Laina female 26 0
## 4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1
## 5 Allen, Mr. William Henry male 35 0
## 6 Moran, Mr. James male NA 0
## Parch Ticket Fare Cabin Embarked lettersInName farePerYear
## 1 0 A/5 21171 7.2500 S 891 0.3295455
## 2 0 PC 17599 71.2833 C85 C 891 1.8758763
## 3 0 STON/O2. 3101282 7.9250 S 891 0.3048077
## 4 0 113803 53.1000 C123 S 891 1.5171429
## 5 0 373450 8.0500 S 891 0.2300000
## 6 0 330877 8.4583 Q 891 NA
selectAndMutate <- select(mutate(dfTrain, ageBuckets = as.numeric(Age>50 | is.na(Age)), fareBuckets = ifelse(Fare < 20, 2, ifelse(Fare < 40, 1, 0))), Name, Age, Fare, Pclass, ageBuckets, fareBuckets, Survived)
# Summarise
summarise(females, count = n(), age = mean(Age, na.rm=TRUE), fare = mean(Fare,na.rm=TRUE)) %>% head()
## count age fare
## 1 314 27.91571 44.47982
# base-summary
summary(females)
## PassengerId Survived Pclass
## Min. : 2.0 Min. :0.000 Min. :1.000
## 1st Qu.:231.8 1st Qu.:0.000 1st Qu.:1.000
## Median :414.5 Median :1.000 Median :2.000
## Mean :431.0 Mean :0.742 Mean :2.159
## 3rd Qu.:641.2 3rd Qu.:1.000 3rd Qu.:3.000
## Max. :889.0 Max. :1.000 Max. :3.000
##
## Name Sex
## Abbott, Mrs. Stanton (Rosa Hunt) : 1 female:314
## Abelson, Mrs. Samuel (Hannah Wizosky) : 1 male : 0
## Ahlin, Mrs. Johan (Johanna Persdotter Larsson): 1
## Aks, Mrs. Sam (Leah Rosen) : 1
## Allen, Miss. Elisabeth Walton : 1
## Allison, Miss. Helen Loraine : 1
## (Other) :308
## Age SibSp Parch Ticket
## Min. : 0.75 Min. :0.0000 Min. :0.0000 347082 : 5
## 1st Qu.:18.00 1st Qu.:0.0000 1st Qu.:0.0000 2666 : 4
## Median :27.00 Median :0.0000 Median :0.0000 110152 : 3
## Mean :27.92 Mean :0.6943 Mean :0.6497 113781 : 3
## 3rd Qu.:37.00 3rd Qu.:1.0000 3rd Qu.:1.0000 13502 : 3
## Max. :63.00 Max. :8.0000 Max. :6.0000 24160 : 3
## NA's :53 (Other):293
## Fare Cabin Embarked
## Min. : 6.75 :217 : 2
## 1st Qu.: 12.07 G6 : 4 C: 73
## Median : 23.00 E101 : 3 Q: 36
## Mean : 44.48 F33 : 3 S:203
## 3rd Qu.: 55.00 B18 : 2
## Max. :512.33 B28 : 2
## (Other): 83
summary(survivingFemales)
## PassengerId Survived Pclass
## Min. : 2.0 Min. :1 Min. :1.000
## 1st Qu.:238.0 1st Qu.:1 1st Qu.:1.000
## Median :400.0 Median :1 Median :2.000
## Mean :429.7 Mean :1 Mean :1.918
## 3rd Qu.:636.0 3rd Qu.:1 3rd Qu.:3.000
## Max. :888.0 Max. :1 Max. :3.000
##
## Name Sex
## Abbott, Mrs. Stanton (Rosa Hunt) : 1 female:233
## Abelson, Mrs. Samuel (Hannah Wizosky) : 1 male : 0
## Aks, Mrs. Sam (Leah Rosen) : 1
## Allen, Miss. Elisabeth Walton : 1
## Andersen-Jensen, Miss. Carla Christine Nielsine: 1
## Andersson, Miss. Erna Alexandra : 1
## (Other) :227
## Age SibSp Parch Ticket
## Min. : 0.75 Min. :0.000 Min. :0.000 2666 : 4
## 1st Qu.:19.00 1st Qu.:0.000 1st Qu.:0.000 110152 : 3
## Median :28.00 Median :0.000 Median :0.000 13502 : 3
## Mean :28.85 Mean :0.515 Mean :0.515 24160 : 3
## 3rd Qu.:38.00 3rd Qu.:1.000 3rd Qu.:1.000 PC 17757: 3
## Max. :63.00 Max. :4.000 Max. :5.000 110413 : 2
## NA's :36 (Other) :215
## Fare Cabin Embarked
## Min. : 7.225 :142 : 2
## 1st Qu.: 13.000 E101 : 3 C: 64
## Median : 26.000 F33 : 3 Q: 27
## Mean : 51.939 B18 : 2 S:140
## 3rd Qu.: 76.292 B28 : 2
## Max. :512.329 B35 : 2
## (Other): 79
# Using pipe commands to build interesting summaries
selectAndMutate %>% group_by(fareBuckets) %>% summarise(count=n(), ages=mean(Age,na.rm=TRUE))
## # A tibble: 3 x 3
## fareBuckets count ages
## <dbl> <int> <dbl>
## 1 0 176 33.9
## 2 1 200 29.8
## 3 2 515 28.0
selectAndMutate %>% group_by(Pclass) %>% summarise(count=n(), ages=mean(Fare,na.rm=TRUE))
## # A tibble: 3 x 3
## Pclass count ages
## <int> <int> <dbl>
## 1 1 216 84.2
## 2 2 184 20.7
## 3 3 491 13.7
Using the training set, try to come up with an idea of which types of passengers survived based on the features given. To do this, use various summarizing functions to get an idea of how variables are distributed for passengers of different types.
This task is open ended - but in the end you will use whatever indicators you have built with the training set and try to classify people in the testing set. This means in the end you must write a function that takes in certain variables and outputs the result (1 if survived, 0 otherwise).
Show all of your work! That includes what summary functions and filters you used and how you analyzed them to reach your results, as well as the final function and some sample inputs+outputs.
#this is a very simplistic example, more to use as a template and complicate yourself
isLetterM <- function(x) {
if ("m" %in% unlist(strsplit(tolower(x), ""))) {
return(1)
} else {
return(0)
}
}
letterM <- unlist(lapply(dfTrain$Name, isLetterM))
# feature letterM is built
survivalTest <- function(Age, Name, Fare) {
if (Age > 20) {
if (isLetterM(Name)) {
if (Fare > 15) {
return(1)
} else {
return (0)
}
} else if (Fare < 8) {
return (1)
} else {
return(0)
}
} else if (Age < 20) {
return (0)
}
}
One very common way to display data to a user is using a graph or plot. Graphs and plots are good because they allow someone to compare things or view a trend very quickly. They can be bad because it is very easy to mislead someone using a plot.
We use a library called ggplot to do plotting.
library("dplyr")
library("ggplot2")
The general form of a plot in ggplot is as follows:
ggplot(data=<DATA>)+<GEOM_FUNCTION>(mapping=aes(<MAPPING>))+<ADDITIONAL SPECIFICATIONS>
There are many types of plots all of which can be useful in different situations. The ones we will cover today are:
The library we enforce using for plotting is called “ggplot2”. However, we will also show some plotting functions built into R’s core. The ggplot package is the industry standard. We will demonstrate these plotting styles with datasets existing in R.
A scatter plot can either compare two variables or just show the trend in one variable. Also, with ggplot we can add additional dimensions such as size and shape and transparency. This is definitely the most commonly used plot for data applications. We have various dimensions we can add including: x, y, colour, shape, size, alpha.
data("mtcars")
# Straight up plot
ggplot(mtcars, aes(x=wt, y=mpg)) + geom_point()
# Change the point size, and shape
ggplot(mtcars, aes(x=wt, y=mpg)) +
geom_point(size=2, shape=23)
# Change the point size
ggplot(mtcars, aes(x=wt, y=mpg)) +
geom_point(aes(size=qsec))
# Add text
ggplot(mtcars, aes(x=wt, y=mpg)) +
geom_point() +
geom_text(label=rownames(mtcars))
# Add the regression line
ggplot(mtcars, aes(x=wt, y=mpg)) +
geom_point()+
geom_smooth(method=lm)
# Remove the confidence interval
ggplot(mtcars, aes(x=wt, y=mpg)) +
geom_point()+
geom_smooth(method=lm, se=FALSE)
# Loess method
ggplot(mtcars, aes(x=wt, y=mpg)) +
geom_point()+
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
# Add marginal rugs
ggplot(mtcars, aes(x=wt, y=mpg)) +
geom_point() + geom_rug()
# Change colors
ggplot(mtcars, aes(x=wt, y=mpg, color=cyl)) +
geom_point() + geom_rug()
# Add marginal rugs using faithful data
ggplot(faithful, aes(x=eruptions, y=waiting)) +
geom_point() + geom_rug()
# Scatter plot with the 2d density estimation
sp <- ggplot(faithful, aes(x=eruptions, y=waiting)) +
geom_point()
sp + geom_density_2d()
# Gradient color
sp + stat_density_2d(aes(fill = ..level..), geom="polygon")
# Change the gradient color
sp + stat_density_2d(aes(fill = ..level..), geom="polygon")+
scale_fill_gradient(low="blue", high="red")
# One ellipse arround all points
ggplot(faithful, aes(waiting, eruptions))+
geom_point()+
stat_ellipse()
# Ellipse by groups
p <- ggplot(faithful, aes(waiting, eruptions, color = eruptions > 3))+
geom_point()
p + stat_ellipse()
# Change the type of ellipses: possible values are "t", "norm", "euclid"
p + stat_ellipse(type = "norm")
You can find many more examples here: http://www.sthda.com/english/wiki/ggplot2-scatter-plots-quick-start-guide-r-software-and-data-visualization
Line plots are similar to scatter plots except that they are connected (usually by the order that the points appear in the x axis). These are often used when trying to show a relationship or multiple relationships directly. We can also use an argument “linetype” or “group” to add multiple lines. You can also add points and lines in the same graph, each with their own mapping.
df <- data.frame(dose=c("D0.5", "D1", "D2"),
len=c(4.2, 10, 29.5))
head(df)
## dose len
## 1 D0.5 4.2
## 2 D1 10.0
## 3 D2 29.5
# Basic line plot with points
ggplot(data=df, aes(x=dose, y=len, group=1)) +
geom_line()+
geom_point()
# Change the line type
ggplot(data=df, aes(x=dose, y=len, group=1)) +
geom_line(linetype = "dashed")+
geom_point()
# Change the color
ggplot(data=df, aes(x=dose, y=len, group=1)) +
geom_line(color="red")+
geom_point()
library(grid)
# Add an arrow
ggplot(data=df, aes(x=dose, y=len, group=1)) +
geom_line(arrow = arrow())+
geom_point()
# Add a closed arrow to the end of the line
myarrow=arrow(angle = 15, ends = "both", type = "closed")
ggplot(data=df, aes(x=dose, y=len, group=1)) +
geom_line(arrow=myarrow)+
geom_point()
# Grid steps
ggplot(data=df, aes(x=dose, y=len, group=1)) +
geom_step()+
geom_point()
# straight steps
ggplot(data=df, aes(x=dose, y=len, group=1)) +
geom_path()+
geom_point()
# multiple groups
df2 <- data.frame(supp=rep(c("VC", "OJ"), each=3),
dose=rep(c("D0.5", "D1", "D2"),2),
len=c(6.8, 15, 33, 4.2, 10, 29.5))
head(df2)
## supp dose len
## 1 VC D0.5 6.8
## 2 VC D1 15.0
## 3 VC D2 33.0
## 4 OJ D0.5 4.2
## 5 OJ D1 10.0
## 6 OJ D2 29.5
# Line plot with multiple groups
ggplot(data=df2, aes(x=dose, y=len, group=supp)) +
geom_line()+
geom_point()
# Change line types
ggplot(data=df2, aes(x=dose, y=len, group=supp)) +
geom_line(linetype="dashed", color="blue", size=1.2)+
geom_point(color="red", size=3)
# Change line types by groups (supp)
ggplot(df2, aes(x=dose, y=len, group=supp)) +
geom_line(aes(linetype=supp))+
geom_point()
# Change line types and point shapes
ggplot(df2, aes(x=dose, y=len, group=supp)) +
geom_line(aes(linetype=supp))+
geom_point(aes(shape=supp))
# Set line types manually
ggplot(df2, aes(x=dose, y=len, group=supp)) +
geom_line(aes(linetype=supp))+
geom_point()+
scale_linetype_manual(values=c("twodash", "dotted"))
# Colour by group
p<-ggplot(df2, aes(x=dose, y=len, group=supp)) +
geom_line(aes(color=supp))+
geom_point(aes(color=supp))
p
# Use custom color palettes
p+scale_color_manual(values=c("#999999", "#E69F00", "#56B4E9"))
# Use brewer color palettes
p+scale_color_brewer(palette="Dark2")
# Use grey scale
p + scale_color_grey() + theme_classic()
#3d line graph
data("economics")
# Change line size
ggplot(data=economics, aes(x=date, y=pop, size=unemploy/pop))+
geom_line()
For more ideas look here: http://www.sthda.com/english/wiki/ggplot2-line-plot-quick-start-guide-r-software-and-data-visualization
Histograms show a distribution of one variable typically by counts or densities. These are extremely important when explaining trends in a dataset.
set.seed(1234)
df <- data.frame(
sex=factor(rep(c("F", "M"), each=200)),
weight=round(c(rnorm(200, mean=55, sd=5), rnorm(200, mean=65, sd=5)))
)
head(df)
## sex weight
## 1 F 49
## 2 F 56
## 3 F 60
## 4 F 43
## 5 F 57
## 6 F 58
# Basic histogram
ggplot(df, aes(x=weight)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# Change the width of bins
ggplot(df, aes(x=weight)) +
geom_histogram(binwidth=1)
# Change colors
p<-ggplot(df, aes(x=weight)) +
geom_histogram(color="black", fill="white")
p
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# Add mean line
p+ geom_vline(aes(xintercept=mean(weight)),
color="blue", linetype="dashed", size=1)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# Histogram with density plot
ggplot(df, aes(x=weight)) +
geom_histogram(aes(y=..density..), colour="black", fill="white")+
geom_density(alpha=.2, fill="#FF6666")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# Change histogram plot line colors by groups
ggplot(df, aes(x=weight, color=sex)) +
geom_histogram(fill="white")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# Overlaid histograms
ggplot(df, aes(x=weight, color=sex)) +
geom_histogram(fill="white", alpha=0.5, position="identity")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
mu <- data.frame(sex=c("M", "F"), grp.mean=c(mean(filter(df, sex=="M")$weight), mean(filter(df, sex=="F")$weight)))
head(mu)
## sex grp.mean
## 1 M 65.36
## 2 F 54.70
# Interleaved histograms
ggplot(df, aes(x=weight, color=sex)) +
geom_histogram(fill="white", position="dodge")+
theme(legend.position="top")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# Add mean lines
p<-ggplot(df, aes(x=weight, color=sex)) +
geom_histogram(fill="white", position="dodge")+
geom_vline(data=mu, aes(xintercept=grp.mean, color=sex),
linetype="dashed")+
theme(legend.position="top")
p
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# Use custom color palettes
p+scale_color_manual(values=c("#999999", "#E69F00", "#56B4E9"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# Use brewer color palettes
p+scale_color_brewer(palette="Dark2")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# Use grey scale
p + scale_color_grey() + theme_classic() +
theme(legend.position="top")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# Use semi-transparent fill
p<-ggplot(df, aes(x=weight, fill=sex, color=sex)) +
geom_histogram(position="identity", alpha=0.5)
p
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# Multi-panel
p<-ggplot(df, aes(x=weight))+
geom_histogram(color="black", fill="white")+
facet_grid(sex ~ .)
p
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# Add mean lines
p+geom_vline(data=mu, aes(xintercept=grp.mean, color="red"),
linetype="dashed")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# Basic histogram
ggplot(df, aes(x=weight, fill=sex)) +
geom_histogram(fill="white", color="black")+
geom_vline(aes(xintercept=mean(weight)), color="blue",
linetype="dashed")+
labs(title="Weight histogram plot",x="Weight(kg)", y = "Count")+
theme_classic()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# Change line colors by groups
ggplot(df, aes(x=weight, color=sex, fill=sex)) +
geom_histogram(position="identity", alpha=0.5)+
geom_vline(data=mu, aes(xintercept=grp.mean, color=sex),
linetype="dashed")+
scale_color_manual(values=c("#999999", "#E69F00", "#56B4E9"))+
scale_fill_manual(values=c("#999999", "#E69F00", "#56B4E9"))+
labs(title="Weight histogram plot",x="Weight(kg)", y = "Count")+
theme_classic()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# Line Colours
p<-ggplot(df, aes(x=weight, color=sex)) +
geom_histogram(fill="white", position="dodge")+
geom_vline(data=mu, aes(xintercept=grp.mean, color=sex),
linetype="dashed")
# Continuous colors
p + scale_color_brewer(palette="Paired") +
theme_classic()+theme(legend.position="top")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# Discrete colors
p + scale_color_brewer(palette="Dark2") +
theme_minimal()+theme_classic()+theme(legend.position="top")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# Gradient colors
p + scale_color_brewer(palette="Accent") +
theme_minimal()+theme(legend.position="top")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Bar plots are similar to histograms except that you can control the Y and X axis a little more and it does not necessarily give a probability distribution but instead fixed counts.
df <- data.frame(dose=c("D0.5", "D1", "D2"),
len=c(4.2, 10, 29.5))
head(df)
## dose len
## 1 D0.5 4.2
## 2 D1 10.0
## 3 D2 29.5
# Basic barplot
p<-ggplot(data=df, aes(x=dose, y=len)) +
geom_bar(stat="identity")
p
# Horizontal bar plot
p + coord_flip()
# Change the width of bars
ggplot(data=df, aes(x=dose, y=len)) +
geom_bar(stat="identity", width=0.5)
# Change colors
ggplot(data=df, aes(x=dose, y=len)) +
geom_bar(stat="identity", color="blue", fill="white")
# Minimal theme + blue fill color
p<-ggplot(data=df, aes(x=dose, y=len)) +
geom_bar(stat="identity", fill="steelblue")+
theme_minimal()
p
# Choose X axis categories
p + scale_x_discrete(limits=c("D0.5", "D2"))
## Warning: Removed 1 rows containing missing values (position_stack).
# Outside bars
ggplot(data=df, aes(x=dose, y=len)) +
geom_bar(stat="identity", fill="steelblue")+
geom_text(aes(label=len), vjust=-0.3, size=3.5)+
theme_minimal()
# Inside bars
ggplot(data=df, aes(x=dose, y=len)) +
geom_bar(stat="identity", fill="steelblue")+
geom_text(aes(label=len), vjust=1.6, color="white", size=3.5)+
theme_minimal()
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
# Don't map a variable to y
ggplot(mtcars, aes(x=factor(cyl)))+
geom_bar(stat="count", width=0.7, fill="steelblue")+
theme_minimal()
# Change barplot line colors by groups
p<-ggplot(df, aes(x=dose, y=len, color=dose)) +
geom_bar(stat="identity", fill="white")
p
# Use custom color palettes
p+scale_color_manual(values=c("#999999", "#E69F00", "#56B4E9"))
# Use brewer color palettes
p+scale_color_brewer(palette="Dark2")
# Use grey scale
p + scale_color_grey() + theme_classic()
# Change barplot fill colors by groups
p<-ggplot(df, aes(x=dose, y=len, fill=dose)) +
geom_bar(stat="identity")+theme_minimal()
p
# Use custom color palettes
p+scale_fill_manual(values=c("#999999", "#E69F00", "#56B4E9"))
# use brewer color palettes
p+scale_fill_brewer(palette="Dark2")
# Use grey scale
p + scale_fill_grey()
ggplot(df, aes(x=dose, y=len, fill=dose))+
geom_bar(stat="identity", color="black")+
scale_fill_manual(values=c("#999999", "#E69F00", "#56B4E9"))+
theme_minimal()
# Change bar fill colors to blues
p <- p+scale_fill_brewer(palette="Blues")
p + theme(legend.position="top")
p + theme(legend.position="bottom")
# Remove legend
p + theme(legend.position="none")
# Controlling legends
p + scale_x_discrete(limits=c("D2", "D0.5", "D1"))
# Multiple Groups
df2 <- data.frame(supp=rep(c("VC", "OJ"), each=3),
dose=rep(c("D0.5", "D1", "D2"),2),
len=c(6.8, 15, 33, 4.2, 10, 29.5))
head(df2)
## supp dose len
## 1 VC D0.5 6.8
## 2 VC D1 15.0
## 3 VC D2 33.0
## 4 OJ D0.5 4.2
## 5 OJ D1 10.0
## 6 OJ D2 29.5
# Stacked barplot with multiple groups
ggplot(data=df2, aes(x=dose, y=len, fill=supp)) +
geom_bar(stat="identity")
# Use position=position_dodge()
ggplot(data=df2, aes(x=dose, y=len, fill=supp)) +
geom_bar(stat="identity", position=position_dodge())
# Change the colors manually
p <- ggplot(data=df2, aes(x=dose, y=len, fill=supp)) +
geom_bar(stat="identity", color="black", position=position_dodge())+
theme_minimal()
# Use custom colors
p + scale_fill_manual(values=c('#999999','#E69F00'))
# Use brewer color palettes
p + scale_fill_brewer(palette="Blues")
# Add labels inside the plot
ggplot(data=df2, aes(x=dose, y=len, fill=supp)) +
geom_bar(stat="identity", position=position_dodge())+
geom_text(aes(label=len), vjust=1.6, color="white",
position = position_dodge(0.9), size=3.5)+
scale_fill_brewer(palette="Paired")+
theme_minimal()
# Sort by dose and supp
df_sorted <- arrange(df2, dose, supp)
head(df_sorted)
## supp dose len
## 1 OJ D0.5 4.2
## 2 VC D0.5 6.8
## 3 OJ D1 10.0
## 4 VC D1 15.0
## 5 OJ D2 29.5
## 6 VC D2 33.0
# Continuous X axis
# Create some data
df2 <- data.frame(supp=rep(c("VC", "OJ"), each=3),
dose=rep(c("0.5", "1", "2"),2),
len=c(6.8, 15, 33, 4.2, 10, 29.5))
head(df2)
## supp dose len
## 1 VC 0.5 6.8
## 2 VC 1 15.0
## 3 VC 2 33.0
## 4 OJ 0.5 4.2
## 5 OJ 1 10.0
## 6 OJ 2 29.5
# x axis treated as continuous variable
df2$dose <- as.numeric(as.vector(df2$dose))
ggplot(data=df2, aes(x=dose, y=len, fill=supp)) +
geom_bar(stat="identity", position=position_dodge())+
scale_fill_brewer(palette="Paired")+
theme_minimal()
# Axis treated as discrete variable
df2$dose<-as.factor(df2$dose)
ggplot(data=df2, aes(x=dose, y=len, fill=supp)) +
geom_bar(stat="identity", position=position_dodge())+
scale_fill_brewer(palette="Paired")+
theme_minimal()
Density plots give you a smooth probability distribution for a dataset which may or may not be smooth. This is used a lot in probability applications.
set.seed(1234)
df <- data.frame(
sex=factor(rep(c("F", "M"), each=200)),
weight=round(c(rnorm(200, mean=55, sd=5),
rnorm(200, mean=65, sd=5)))
)
head(df)
## sex weight
## 1 F 49
## 2 F 56
## 3 F 60
## 4 F 43
## 5 F 57
## 6 F 58
# Basic density
p <- ggplot(df, aes(x=weight)) +
geom_density()
p
# Add mean line
p+ geom_vline(aes(xintercept=mean(weight)),
color="blue", linetype="dashed", size=1)
Box plots are a way to summarise an individual variable based on our basic summaries (1st quartile, median, 3rd quartile, outliers).
data("ToothGrowth")
# Convert the variable dose from a numeric to a factor variable
ToothGrowth$dose <- as.factor(ToothGrowth$dose)
head(ToothGrowth)
## len supp dose
## 1 4.2 VC 0.5
## 2 11.5 VC 0.5
## 3 7.3 VC 0.5
## 4 5.8 VC 0.5
## 5 6.4 VC 0.5
## 6 10.0 VC 0.5
library(ggplot2)
# Basic box plot
p <- ggplot(ToothGrowth, aes(x=dose, y=len)) +
geom_boxplot()
p
# Rotate the box plot
p + coord_flip()
# Notched box plot
ggplot(ToothGrowth, aes(x=dose, y=len)) +
geom_boxplot(notch=TRUE)
# Change outlier, color, shape and size
ggplot(ToothGrowth, aes(x=dose, y=len)) +
geom_boxplot(outlier.colour="red", outlier.shape=8,
outlier.size=4)
nz <- map_data("nz")
ggplot(nz, aes(long, lat, group = group)) +
geom_polygon(fill = "white", colour = "black")
ggplot(nz, aes(long, lat, group = group)) +
geom_polygon(fill = "white", colour = "black") +
coord_quickmap()
bar <- ggplot(data = diamonds) +
geom_bar(
mapping = aes(x = cut, fill = cut),
show.legend = FALSE,
width = 1
) +
theme(aspect.ratio = 1) +
labs(x = NULL, y = NULL)
bar + coord_flip()
bar + coord_polar()
https://www.kaggle.com/datasets https://www.kaggle.com/pranavbadami/nj-transit-amtrak-nec-performance https://www.kaggle.com/mohansacharya/graduate-admissions https://www.kaggle.com/pavansanagapati/urban-sound-classification https://www.kaggle.com/xvivancos/barcelona-data-sets https://www.kaggle.com/ronitf/heart-disease-uci https://www.kaggle.com/karangadiya/fifa19 https://www.kaggle.com/vjchoudhary7/customer-segmentation-tutorial-in-python https://www.kaggle.com/borismarjanovic/price-volume-data-for-all-us-stocks-etfs
https://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones https://archive.ics.uci.edu/ml/datasets/Abalone https://archive.ics.uci.edu/ml/datasets/Adult https://archive.ics.uci.edu/ml/datasets/Artificial+Characters https://archive.ics.uci.edu/ml/datasets/Bach+Choral+Harmony https://docs.aws.amazon.com/sagemaker/latest/dg/ex1-model-deployment.html https://docs.aws.amazon.com/sagemaker/latest/dg/algos.html https://aws.amazon.com/blogs/aws/amazon-sagemaker-ground-truth-build-highly-accurate-datasets-and-reduce-labeling-costs-by-up-to-70/
https://www.dataquest.io/blog/topics/data-science-projects/ https://fivethirtyeight.com/ https://github.com/fivethirtyeight/data https://data.fivethirtyeight.com/ https://github.com/BuzzFeedNews/everything https://www.propublica.org/datastore/apis https://www.propublica.org/datastore/datasets https://opendata.socrata.com/ https://en.wikipedia.org/wiki/Wikipedia:Database_download https://www.quandl.com/search
https://www.meteoblue.com/en/weather/archive/export/india_el-salvador_3585481 https://data.gov.in/ http://www.surveyofindia.gov.in/details/view/7 http://mospi.nic.in/data https://www.india.gov.in/ https://open.canada.ca/en/open-data