<!DOCTYPE html>
<!DOCTYPE html>
<!DOCTYPE html>
<!DOCTYPE html>
<!DOCTYPE html>
<!DOCTYPE html>
<!DOCTYPE html>
Please complete the following questions and submit the finished Rmd and HTML file onto Canvas. In order for this file to knit, it needs a file called “lab.css” to be in the same folder as the .Rmd file.
Don’t forget to change name field on line 3 to your first and last name.
The most commonly used functions for wrangling data nowadays are in the
dplyr package. The dplyr package is part of a
suite of packages known as the tidyverse. Tidyverse also
automatically loads other useful packages like ggplot2,
which is the package for visualization in R. You
can simplify things by just installing the tidyverse by calling
install.packages("tidyverse") then loading the tidyverse
using the library command. All of the canonical statistical
tests we will use for the hypothesis testing portion of the lab are in
base R. The first part of the following code chunk checks
to see if a package is installed, and if it is not installed, it
installs it. The second part runs the function then loads the library.
iini <-function(x){
# stands for install if not installed
if (!x %in% rownames(installed.packages())) install.packages(x)
}
iini("tidyverse")
library(tidyverse)
library(dragracer)
library(nycflights13)
The instructions may be a little lengthy, but it is beneficial to work through the R markdown file (.Rmd). This will allow you to tinker with the code as you’re going through.
R is an object oriented programming language. This means that when you use R, your computer essentially becomes a filing system for things that you create. Like any good filing system, things have addresses or call numbers that you can quickly use to retrieve things. However, the bookshelf used to store the objects you create is limited in size by the amount of memory installed in your computer. You’re unlikely to bump into the memory limits of your computer in this class, but it’s important to realize that R objects actually occupy space in your computer.
The code a <- 10 creates an object called a
that contains the number 10. The assignment operator
<- (you can also use = which is frowned
upon in R for reasons that are beyond the scope of this lab) is used to
assign the number 10 to the object a. You will
use the assignment operator a lot! It is used to assign an
object to a name, thus storing it in memory (the filing system) with the
name as an address. The name a is the address that will be
used to retrieve the object in the future. The name is completely
arbitrary, I could have called the object bob or
zx10sf. You can think of the object’s name as an address,
typing the name into the R console retrieves the object stored at that
address and displays in the console. As a general rule, it’s helpful to
give the objects you create meaningful names. Names cannot begin in
numbers, names are case sensitive (A and a
would be different addresses), and names shouldn’t be the same as the
names of functions in R (like mean or sum).
a <- 10
a
## [1] 10
In the code above a stores a number, thus it is an object
of the type numeric. We can see the type of any R object by
typing class(objectName) into the console:
class(a)
## [1] "numeric"
# Overwrite a and store a character text string
a <- "bacon"
class(a)
## [1] "character"
a
## [1] "bacon"
Objects of any type can be combined into a vector using the combine
function, c(). Vectors are useful for storing information
that logically belong together. Vectors can be combined into longer
vectors:
statesAdamHasLived <- c("MN", "MS", "AZ", "CA", "CO", "VA", "UT", "WI") # A vector with six elements
length(statesAdamHasLived) # Returns the length of the vector
## [1] 8
statesAdamWillVisitThisSemester <- c("AZ", "CO")
# Combining vectors
statesAdamKnows <- c(statesAdamWillVisitThisSemester, statesAdamHasLived)
statesAdamKnows
## [1] "AZ" "CO" "MN" "MS" "AZ" "CA" "CO" "VA" "UT" "WI"
length(statesAdamKnows)
## [1] 10
Notice how CO is in the vector twice because R just appended the two vectors. If we had combined the two vectors with statesAdamHasLived first and statesAdamWillVisitThisSemester second, how would it look?
Indexing is a critical aspects of vectors. It’s possible to pull individual elements of a vector or a range of elements from a vector.
statesAdamKnows[1] # Access the first element of the vector (index position 1)
## [1] "AZ"
statesAdamKnows[3] # Access index position 3
## [1] "MN"
statesAdamKnows[2:4] # The colon denotes the range of integers from 2 to 4
## [1] "CO" "MN" "MS"
# A vector of numbers to reorder the vector or access the elements in a specific index order
statesAdamKnows[c(4,3,1,2)]
## [1] "MS" "MN" "AZ" "CO"
# Permanently reorder the vector and keep only those elements - because you are now re-assigning statesAdamKnows with the =
statesAdamKnows <- statesAdamKnows[c(4,3,1,2)]
statesAdamKnows
## [1] "MS" "MN" "AZ" "CO"
# Alphabetize the vector with sort
alpha.statesAdamKnows <- sort(statesAdamKnows)
alpha.statesAdamKnows
## [1] "AZ" "CO" "MN" "MS"
Elements of a vector can be named, this facilitates accessing the vector using an indexing scheme that may be more meaningful than numbers.
# The schools Carson Farmer went to as a kid - note these are from Carson Farmer who taught the course in 2018 because I wanted to keep "Smiles & Chuckles"
schools <- c(kindergarten="Smiles & Chuckles", elementary="Deep Cove", middle="Bay Side", high="Stelly's")
schools
## kindergarten elementary middle high
## "Smiles & Chuckles" "Deep Cove" "Bay Side" "Stelly's"
# index by the name of each item in the vector
schools["high"]
## high
## "Stelly's"
schools[c("high", "middle")]
## high middle
## "Stelly's" "Bay Side"
In R, tabular data is stored in a ‘data.frame’ object. Data.frames are
like a vector with dimensions \(n×p\).
This is useful because both rows and columns can be used as named (or
numbered) indices. Thus it is possible to ask for small sections of the
table, like rows 1 to 5 and columns 2 to 4. However, because data frames
have rows and columns, our addressing scheme needs to include
row addresses and column addresses. Note the use of the plural,
addresses. When working with tables, one often wants to retrieve
multiple rows and columns, and this is done using vectors (like we
played with in the previous section) created explicitly using
c() or implicitly using other types of operators/functions.
| Entity | Name | Type | Weight | Health | Age Group |
|---|---|---|---|---|---|
| 1 | Fluffy | Cat | 27lbs | Obese | Adult |
| 2 | Felix | Cat | 12lbs | Healthy | Adult |
| 3 | Patches | Cat | 6lbs |
NA
|
Kitten |
| n | … | … | … | … | … |
In CSV format, the above tabular data would look something like the
following, perhaps stored in a file called cats.csv:
"Entity","Name","Type","Weight","Health","Age Group"
1,"Fluffy","Cat",27,"Obese","Adult"
2,"Felix","Cat",12,"Healthy","Adult"
3,"Patches","Cat",6,,"Kitten"
...
R is distributed with a bunch of example data sets. To see a vector of
these type data() in the console. We’ll load a data set
from the package dragracer, a database of all of the
important information from RuPaul’s Drag Race, for an explanation of the
data type ?dragracer::rpdr_contestants into the console.
data("rpdr_contestants")
# To see the size of the table use dim()
dim(rpdr_contestants) #1000 rows and 5 columns
## [1] 184 5
# To see the names of the columns
names(rpdr_contestants)
## [1] "season" "contestant" "age" "dob" "hometown"
# Show the first 10 rows of the quakes data.frame
head(rpdr_contestants, n=10)
## # A tibble: 10 × 5
## season contestant age dob hometown
## <chr> <chr> <dbl> <date> <chr>
## 1 S01 BeBe Zahara Benet 28 1981-03-20 Minneapolis, Minnesota
## 2 S01 Nina Flowers 34 1974-02-22 Bayamón, Puerto Rico
## 3 S01 Rebecca Glasscock 26 1983-05-25 Fort Lauderdale, Florida
## 4 S01 Shannel 26 1979-07-03 Las Vegas, Nevada
## 5 S01 Ongina 26 1982-01-06 Los Angeles, California
## 6 S01 Jade 32 1984-11-18 Chicago, Illinois
## 7 S01 Akashia 32 1985-02-19 Cleveland, Ohio
## 8 S01 Tammie Brown 36 1980-09-15 Los Angeles, California
## 9 S01 Victoria (Porkchop) Parker 39 1969-01-16 Raleigh, North Carolina
## 10 S02 Tyra Sanchez 21 1988-04-22 Orlando, Florida
Indexing of ‘data.frames’ is similar to the indexing of vectors, except
that one has to provide row and column addresses. Data.frames are
indices using the convention dfname[row vector, column
vector].
# Get the contents of the first row and the first column
rpdr_contestants[1, 1]
# Get the first 10 rows?
rpdr_contestants[1:10] # NOTE THE ERROR! I have not provided valid column addresses.
# This works.
rpdr_contestants[1:10,] #The blank after the comma indicates all columns.
# Get the latitude and longitude of the first 10 quakes
rpdr_contestants[1:10, c("season", "contestant")]
# Note 1:10 is a vector and c("lat", "long") is a vector
# This is equivalent to the previous line
rpdr_contestants[1:10, 1:2]
‘data.frames’ can also be indexed using column names with the syntax
df\(colname&amp;amp;amp;amp;lt;/code&amp;amp;amp;amp;gt;.&amp;amp;amp;amp;lt;/p&amp;amp;amp;amp;gt;
&amp;amp;amp;amp;lt;pre
class=&amp;amp;amp;amp;quot;r&amp;amp;amp;amp;quot;&amp;amp;amp;amp;gt;&amp;amp;amp;amp;lt;code&amp;amp;amp;amp;gt;rpdr_contestants\)contestant
# note that this gives you ALL the data, which might be too much!
rpdr_contestants\(contestant[1:10] # Returns the first 10 entries in the mag column&amp;amp;amp;lt;/p&amp;amp;amp;gt; &amp;amp;amp;lt;div id=&amp;amp;amp;quot;the-line-above-is-equivalent-to&amp;amp;amp;quot; class=&amp;amp;amp;quot;section level1&amp;amp;amp;quot;&amp;amp;amp;gt; &amp;amp;amp;lt;h1&amp;amp;amp;gt;The line above is equivalent to&amp;amp;amp;lt;/h1&amp;amp;amp;gt; rpdr_contestants[1:10, &amp;amp;amp;amp;quot;contestant&amp;amp;amp;amp;quot;]&amp;amp;amp;lt;/code&amp;amp;amp;gt; &amp;amp;amp;lt;/pre&amp;amp;amp;gt; &amp;amp;amp;lt;/div&amp;amp;amp;gt; &amp;amp;amp;lt;/div&amp;amp;amp;gt; &amp;amp;amp;lt;div id=&amp;amp;amp;quot;logical-expressions-in-r&amp;amp;amp;quot; class=&amp;amp;amp;quot;section level2&amp;amp;amp;quot;&amp;amp;amp;gt; &amp;amp;amp;lt;h2&amp;amp;amp;gt; Logical Expressions in R &amp;amp;amp;lt;/h2&amp;amp;amp;gt; &amp;amp;amp;lt;p&amp;amp;amp;gt; Logical expressions are expressions that evaluate to &amp;amp;amp;lt;code&amp;amp;amp;gt;TRUE&amp;amp;amp;lt;/code&amp;amp;amp;gt; or &amp;amp;amp;lt;code&amp;amp;amp;gt;FALSE&amp;amp;amp;lt;/code&amp;amp;amp;gt;. For example, “is this bigger than that” or “are these things equal”. Logical expressions are created with logical operators, R uses the following logical operators: &amp;amp;amp;lt;/p&amp;amp;amp;gt; &amp;amp;amp;lt;ul&amp;amp;amp;gt; &amp;amp;amp;lt;li&amp;amp;amp;gt; &amp;amp;amp;lt;code&amp;amp;amp;gt;&amp;amp;amp;amp;lt;&amp;amp;amp;lt;/code&amp;amp;amp;gt; Less than &amp;amp;amp;lt;/li&amp;amp;amp;gt; &amp;amp;amp;lt;li&amp;amp;amp;gt; &amp;amp;amp;lt;code&amp;amp;amp;gt;&amp;amp;amp;amp;gt;&amp;amp;amp;lt;/code&amp;amp;amp;gt; Greater than &amp;amp;amp;lt;/li&amp;amp;amp;gt; &amp;amp;amp;lt;li&amp;amp;amp;gt; &amp;amp;amp;lt;code&amp;amp;amp;gt;==&amp;amp;amp;lt;/code&amp;amp;amp;gt; Equal to (Note that this is different from ‘=’ which is the assignment operator) &amp;amp;amp;lt;/li&amp;amp;amp;gt; &amp;amp;amp;lt;li&amp;amp;amp;gt; &amp;amp;amp;lt;code&amp;amp;amp;gt;&amp;amp;amp;amp;gt;=&amp;amp;amp;lt;/code&amp;amp;amp;gt; Greater than or equal to &amp;amp;amp;lt;/li&amp;amp;amp;gt; &amp;amp;amp;lt;li&amp;amp;amp;gt; &amp;amp;amp;lt;code&amp;amp;amp;gt;&amp;amp;amp;amp;lt;=&amp;amp;amp;lt;/code&amp;amp;amp;gt; Less than or equal to &amp;amp;amp;lt;/li&amp;amp;amp;gt; &amp;amp;amp;lt;li&amp;amp;amp;gt; &amp;amp;amp;lt;code&amp;amp;amp;gt;&amp;amp;amp;amp;amp;&amp;amp;amp;amp;amp;&amp;amp;amp;lt;/code&amp;amp;amp;gt; And &amp;amp;amp;lt;/li&amp;amp;amp;gt; &amp;amp;amp;lt;li&amp;amp;amp;gt; &amp;amp;amp;lt;code&amp;amp;amp;gt;||&amp;amp;amp;lt;/code&amp;amp;amp;gt; Or &amp;amp;amp;lt;/li&amp;amp;amp;gt; &amp;amp;amp;lt;li&amp;amp;amp;gt; &amp;amp;amp;lt;code&amp;amp;amp;gt;%in%&amp;amp;amp;lt;/code&amp;amp;amp;gt; Matching operator &amp;amp;amp;lt;/li&amp;amp;amp;gt; &amp;amp;amp;lt;/ul&amp;amp;amp;gt; &amp;amp;amp;lt;pre class=&amp;amp;amp;quot;r&amp;amp;amp;quot;&amp;amp;amp;gt;&amp;amp;amp;lt;code&amp;amp;amp;gt;# Examples of logical expressions 10 == 5&amp;amp;amp;lt;/code&amp;amp;amp;gt;&amp;amp;amp;lt;/pre&amp;amp;amp;gt; &amp;amp;amp;lt;pre&amp;amp;amp;gt;&amp;amp;amp;lt;code&amp;amp;amp;gt;## [1] FALSE&amp;amp;amp;lt;/code&amp;amp;amp;gt;&amp;amp;amp;lt;/pre&amp;amp;amp;gt; &amp;amp;amp;lt;pre class=&amp;amp;amp;quot;r&amp;amp;amp;quot;&amp;amp;amp;gt;&amp;amp;amp;lt;code&amp;amp;amp;gt;10 &amp;amp;amp;amp;gt; 5&amp;amp;amp;lt;/code&amp;amp;amp;gt;&amp;amp;amp;lt;/pre&amp;amp;amp;gt; &amp;amp;amp;lt;pre&amp;amp;amp;gt;&amp;amp;amp;lt;code&amp;amp;amp;gt;## [1] TRUE&amp;amp;amp;lt;/code&amp;amp;amp;gt;&amp;amp;amp;lt;/pre&amp;amp;amp;gt; &amp;amp;amp;lt;pre class=&amp;amp;amp;quot;r&amp;amp;amp;quot;&amp;amp;amp;gt;&amp;amp;amp;lt;code&amp;amp;amp;gt;a = 10 # Create an object that stores the number 10 a == 10&amp;amp;amp;lt;/code&amp;amp;amp;gt;&amp;amp;amp;lt;/pre&amp;amp;amp;gt; &amp;amp;amp;lt;pre&amp;amp;amp;gt;&amp;amp;amp;lt;code&amp;amp;amp;gt;## [1] TRUE&amp;amp;amp;lt;/code&amp;amp;amp;gt;&amp;amp;amp;lt;/pre&amp;amp;amp;gt; &amp;amp;amp;lt;pre class=&amp;amp;amp;quot;r&amp;amp;amp;quot;&amp;amp;amp;gt;&amp;amp;amp;lt;code&amp;amp;amp;gt;a &amp;amp;amp;amp;gt; 10&amp;amp;amp;lt;/code&amp;amp;amp;gt;&amp;amp;amp;lt;/pre&amp;amp;amp;gt; &amp;amp;amp;lt;pre&amp;amp;amp;gt;&amp;amp;amp;lt;code&amp;amp;amp;gt;## [1] FALSE&amp;amp;amp;lt;/code&amp;amp;amp;gt;&amp;amp;amp;lt;/pre&amp;amp;amp;gt; &amp;amp;amp;lt;pre class=&amp;amp;amp;quot;r&amp;amp;amp;quot;&amp;amp;amp;gt;&amp;amp;amp;lt;code&amp;amp;amp;gt;a &amp;amp;amp;amp;gt;= a + 1&amp;amp;amp;lt;/code&amp;amp;amp;gt;&amp;amp;amp;lt;/pre&amp;amp;amp;gt; &amp;amp;amp;lt;pre&amp;amp;amp;gt;&amp;amp;amp;lt;code&amp;amp;amp;gt;## [1] FALSE&amp;amp;amp;lt;/code&amp;amp;amp;gt;&amp;amp;amp;lt;/pre&amp;amp;amp;gt; &amp;amp;amp;lt;pre class=&amp;amp;amp;quot;r&amp;amp;amp;quot;&amp;amp;amp;gt;&amp;amp;amp;lt;code&amp;amp;amp;gt;&amp;amp;amp;amp;quot;CO&amp;amp;amp;amp;quot; %in% statesAdamKnows&amp;amp;amp;lt;/code&amp;amp;amp;gt;&amp;amp;amp;lt;/pre&amp;amp;amp;gt; &amp;amp;amp;lt;pre&amp;amp;amp;gt;&amp;amp;amp;lt;code&amp;amp;amp;gt;## [1] TRUE&amp;amp;amp;lt;/code&amp;amp;amp;gt;&amp;amp;amp;lt;/pre&amp;amp;amp;gt; &amp;amp;amp;lt;pre class=&amp;amp;amp;quot;r&amp;amp;amp;quot;&amp;amp;amp;gt;&amp;amp;amp;lt;code&amp;amp;amp;gt;c(&amp;amp;amp;amp;quot;MT&amp;amp;amp;amp;quot;, &amp;amp;amp;amp;quot;CA&amp;amp;amp;amp;quot;) %in% statesAdamKnows&amp;amp;amp;lt;/code&amp;amp;amp;gt;&amp;amp;amp;lt;/pre&amp;amp;amp;gt; &amp;amp;amp;lt;pre&amp;amp;amp;gt;&amp;amp;amp;lt;code&amp;amp;amp;gt;## [1] FALSE FALSE&amp;amp;amp;lt;/code&amp;amp;amp;gt;&amp;amp;amp;lt;/pre&amp;amp;amp;gt; &amp;amp;amp;lt;pre class=&amp;amp;amp;quot;r&amp;amp;amp;quot;&amp;amp;amp;gt;&amp;amp;amp;lt;code&amp;amp;amp;gt;# Which contestants won which episodes? # use head to display the first 10 answers head(rpdr_contep\)outcome == "WIN", n = 10)
## [1] FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Tables can also be indexed using logical expressions. This is really
useful!! R will return all rows for which the logical expression
evaluates to true. Logical expressions are typically used as row index
selectors. But now we’re going to start using the dplyr
functions side by side with base R square bracketing.
data("rpdr_contep")
# Select the contestants that won each episode
rpdr_contep[rpdr_contep$outcome == "WIN",]
## # A tibble: 1,060 × 11
## season rank missc contestant episode outcome eliminated participant
## <chr> <dbl> <chr> <chr> <dbl> <chr> <chr> <chr>
## 1 S01 2 1 Nina Flowers 1 WIN 0 1
## 2 S01 5 0 Ongina 2 WIN 0 1
## 3 <NA> NA <NA> <NA> NA <NA> <NA> <NA>
## 4 S01 1 0 BeBe Zahara Benet 3 WIN 0 1
## 5 <NA> NA <NA> <NA> NA <NA> <NA> <NA>
## 6 <NA> NA <NA> <NA> NA <NA> <NA> <NA>
## 7 S01 5 0 Ongina 4 WIN 0 1
## 8 <NA> NA <NA> <NA> NA <NA> <NA> <NA>
## 9 <NA> NA <NA> <NA> NA <NA> <NA> <NA>
## 10 <NA> NA <NA> <NA> NA <NA> <NA> <NA>
## # ℹ 1,050 more rows
## # ℹ 3 more variables: minichalw <chr>, finale <dbl>, penultimate <dbl>
# Subsets of tables can be saved as new objects.
# Something's strage with the NA's, so we wrapped the whole thing in na.omit()
winners <- na.omit(rpdr_contep[rpdr_contep$outcome == "WIN", ])
# now, the filter function in dplyr
winners <- filter(rpdr_contep, outcome == "WIN")
winners <- rpdr_contep %>%
filter(outcome == "WIN")
Now we’ll use the quakes dataset. It is fairly easy to create new variables in a data frame. All one has to do is name the column and assign values to it:
# Create a column called "big"" stores the value "YES" for all quakes over magnitude 5.7
data(quakes)
quakes[quakes$mag > 5.7, "big"] <- "YES"
This syntax is a little bit complicated. First, we are selecting all rows of the table that contain a quake with a magnitude greater than 5.7. Next, we are “creating” a column called “big” simply by entering “big” in the column address. Next we’re assigning the word “YES” to the selected row column combination. The net result of this is that we’ve made a new column but most of it is empty, because very few quakes are greater than 5.7.
# now to look at some rows of data to see if the "big" column was created correctly
quakes[10:20,] # Notice the <NA>? These are missing data.
## lat long depth mag stations big
## 10 -17.47 179.59 622 4.3 19 <NA>
## 11 -21.44 180.69 583 4.4 13 <NA>
## 12 -12.26 167.00 249 4.6 16 <NA>
## 13 -18.54 182.11 554 4.4 19 <NA>
## 14 -21.00 181.66 600 4.4 10 <NA>
## 15 -20.70 169.92 139 6.1 94 YES
## 16 -15.94 184.95 306 4.3 11 <NA>
## 17 -13.64 165.96 50 6.0 83 YES
## 18 -17.83 181.50 590 4.5 21 <NA>
## 19 -23.50 179.78 570 4.4 13 <NA>
## 20 -22.63 180.31 598 4.4 18 <NA>
We could fix the missing data simply by assigning the value “NO” to all quakes that are smaller than 5.7.
# Fix missing data
quakes[quakes$mag < 5.7, "big"] <- "NO"
We could also have used a function called ifelse to create
the column big. Try to figure it out? You can get help on the
ifelse function by typing ?ifelse into the
console. You can get help for other functions that way as long as they
are already loaded into R, either in the base package or any packages
you had loaded (we will get to that later).
R is terrible for ‘looking’ at data. It’s really hard and not very
useful to look at tabular representations of your data in R. Instead of
looking at the raw data, I use the summary function. It
displays the max, min, mean, 25th, 75th, and 50th percentile of numeric
variables. You can actually use the summary function on
almost any object in R, and it will give you some kind of useful output
relevant to the object at hand…
summary(quakes)
## lat long depth mag
## Min. :-38.59 Min. :165.7 Min. : 40.0 Min. :4.00
## 1st Qu.:-23.47 1st Qu.:179.6 1st Qu.: 99.0 1st Qu.:4.30
## Median :-20.30 Median :181.4 Median :247.0 Median :4.60
## Mean :-20.64 Mean :179.5 Mean :311.4 Mean :4.62
## 3rd Qu.:-17.64 3rd Qu.:183.2 3rd Qu.:543.0 3rd Qu.:4.90
## Max. :-10.72 Max. :188.1 Max. :680.0 Max. :6.40
## stations big
## Min. : 10.00 Length:1000
## 1st Qu.: 18.00 Class :character
## Median : 27.00 Mode :character
## Mean : 33.42
## 3rd Qu.: 42.00
## Max. :132.00
Notice that the summary of the “big” column is not
especially useful. It tells us that the column stores characters.
Actually, the column stores a variable that has two non-numeric levels.
Many variables that you’ll work with, even if they appear as numbers,
aren’t really numeric variables. Consider a variable called “habitat
type” where the value “1” represents “wetlands” and the value “2”
represents “forest”. R would see these as numeric variables even though
the numbers are simply codes.
Non-numeric variables are called factors in R. Factors are variables
that contain any number of distinct levels. R treats factors differently
from numeric variables. Consider the vector of states Adam know from
earlier. All states in the US have numeric IDs called Federal
Information Processing Standard (FIPS) codes. We could re-create the
vector statesAdamKnows using FIPS codes for the state.
statesAdamKnows = c(44, 25, 6, 22)
The problem is that adding Massachusetts (code 25) and Louisiana (code
22) together is nonsense. It is certainly not true that adding
Massachusetts and Louisiana together equals Tennessee, even though
25+22=47. However, R has no problem doing just that.
statesAdamKnows[2] + statesAdamKnows[4]
## [1] 47
# Notice that because statesAdamKnows is a vector not a data frame, there is only one indexed value
We need to tell R that the vector statesAdamKnows is a
factor, not a numeric variable. We can do this by telling R to store the
vector as a factor.
statesAdamKnows = as.factor(statesAdamKnows)
After converting the vector to a factor, R behaves more reasonably. Trying to add MA and LA yields an error “‘+’ not meaningful for factors.” Converting the “big” column of the quake table to a factor, however, yields a more reasonable and useful summary.
quakes$big = as.factor(quakes$big)
summary(quakes)
## lat long depth mag
## Min. :-38.59 Min. :165.7 Min. : 40.0 Min. :4.00
## 1st Qu.:-23.47 1st Qu.:179.6 1st Qu.: 99.0 1st Qu.:4.30
## Median :-20.30 Median :181.4 Median :247.0 Median :4.60
## Mean :-20.64 Mean :179.5 Mean :311.4 Mean :4.62
## 3rd Qu.:-17.64 3rd Qu.:183.2 3rd Qu.:543.0 3rd Qu.:4.90
## Max. :-10.72 Max. :188.1 Max. :680.0 Max. :6.40
## stations big
## Min. : 10.00 NO :985
## 1st Qu.: 18.00 YES : 7
## Median : 27.00 NA's: 8
## Mean : 33.42
## 3rd Qu.: 42.00
## Max. :132.00
We have a problem. Why are there 8 NA’s in the big column? Try using
is.na(quakes\(big)&amp;amp;amp;amp;lt;/code&amp;amp;amp;amp;gt;
to
select rows from the quakes table. The
goal is to produce output like what is shown
below.&amp;amp;amp;amp;lt;/p&amp;amp;amp;amp;gt;
&amp;amp;amp;amp;lt;p&amp;amp;amp;amp;gt;Which rows have
&amp;amp;amp;amp;lt;code&amp;amp;amp;amp;gt;NAs&amp;amp;amp;amp;lt;/code&amp;amp;amp;amp;gt;
for
&amp;amp;amp;amp;lt;code&amp;amp;amp;amp;gt;quakes\)big?
The ones that are exactly equal to 5.7.
Fix the NA’s so that summary(quakes) looks like this:
## lat long depth mag
## Min. :-38.59 Min. :165.7 Min. : 40.0 Min. :4.00
## 1st Qu.:-23.47 1st Qu.:179.6 1st Qu.: 99.0 1st Qu.:4.30
## Median :-20.30 Median :181.4 Median :247.0 Median :4.60
## Mean :-20.64 Mean :179.5 Mean :311.4 Mean :4.62
## 3rd Qu.:-17.64 3rd Qu.:183.2 3rd Qu.:543.0 3rd Qu.:4.90
## Max. :-10.72 Max. :188.1 Max. :680.0 Max. :6.40
## stations big
## Min. : 10.00 NO :985
## 1st Qu.: 18.00 YES: 15
## Median : 27.00
## Mean : 33.42
## 3rd Qu.: 42.00
## Max. :132.00
In a data.frame, factors often represent distinct types or categories.
Sometimes we will want more detailed summaries than the generic
summary function provides. We might want to summarize the
big quakes and small quakes seperately. We can do this by using the
aggregate function. The ~ is used a lot in R. It stands for
“described by” (or “is a function of”…). In this case, we’re describing
the magnitude by the big variable we created using logical expressions.
To each aggregate of the data (level of big), we are
applying the FUN (which stands for function) summary.
## quakes$big quakes$mag.Min. quakes$mag.1st Qu. quakes$mag.Median
## 1 NO 4.000000 4.300000 4.600000
## 2 YES 5.700000 5.700000 5.700000
## quakes$mag.Mean quakes$mag.3rd Qu. quakes$mag.Max.
## 1 4.601523 4.800000 5.600000
## 2 5.860000 6.000000 6.400000
We can even do basic group by in plots:
boxplot(mag ~ big, data=quakes)
Most of the above was the “old” way to work with data sets in R. The new way uses a set of R packages that are grouped together under the name the “tidyverse”. Increasingly, everyone is working with data this way instead of the old way – in fact, many of your classmates are likely better at ‘wrangling’ data with tools in the tidyverse than the way that I just showed above. Here is a quick intro to these tools. Check out this webpage for more information on all of the packages and tools included in the tidyverse: https://www.tidyverse.org/
Let’s look at the data about flights. What do each of these commands provide you information on?
flights
## # A tibble: 336,776 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # ℹ 336,766 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
?flights
Notice that the flights data is in a format called a “tibble”. Tibbles are data frames, but they are slightly different to help do some things that can be done in the tidyverse. You can transform data.frames to tibbles and back again with these commands.
#convert the quakes data frame from before into a tibble
quakes <- as_tibble(quakes)
str(quakes) #this gives you information about the structure of the dataset as well as of each column within the data frame
## tibble [1,000 × 6] (S3: tbl_df/tbl/data.frame)
## $ lat : num [1:1000] -20.4 -20.6 -26 -18 -20.4 ...
## $ long : num [1:1000] 182 181 184 182 182 ...
## $ depth : int [1:1000] 562 650 42 626 649 195 82 194 211 622 ...
## $ mag : num [1:1000] 4.8 4.2 5.4 4.1 4 4 4.8 4.4 4.7 4.3 ...
## $ stations: int [1:1000] 41 15 43 19 11 12 43 15 35 19 ...
## $ big : Factor w/ 2 levels "NO","YES": 1 1 1 1 1 1 1 1 1 1 ...
glimpse(quakes) # this is similar to the str command, but also lets you look at the first several rows of each column
## Rows: 1,000
## Columns: 6
## $ lat <dbl> -20.42, -20.62, -26.00, -17.97, -20.42, -19.68, -11.70, -28.1…
## $ long <dbl> 181.62, 181.03, 184.10, 181.66, 181.96, 184.31, 166.10, 181.9…
## $ depth <int> 562, 650, 42, 626, 649, 195, 82, 194, 211, 622, 583, 249, 554…
## $ mag <dbl> 4.8, 4.2, 5.4, 4.1, 4.0, 4.0, 4.8, 4.4, 4.7, 4.3, 4.4, 4.6, 4…
## $ stations <int> 41, 15, 43, 19, 11, 12, 43, 15, 35, 19, 13, 16, 19, 10, 94, 1…
## $ big <fct> NO, NO, NO, NO, NO, NO, NO, NO, NO, NO, NO, NO, NO, NO, YES, …
# and convert back from tibbles to data.frames
quakes <- as.data.frame(quakes)
str(quakes)
## 'data.frame': 1000 obs. of 6 variables:
## $ lat : num -20.4 -20.6 -26 -18 -20.4 ...
## $ long : num 182 181 184 182 182 ...
## $ depth : int 562 650 42 626 649 195 82 194 211 622 ...
## $ mag : num 4.8 4.2 5.4 4.1 4 4 4.8 4.4 4.7 4.3 ...
## $ stations: int 41 15 43 19 11 12 43 15 35 19 ...
## $ big : Factor w/ 2 levels "NO","YES": 1 1 1 1 1 1 1 1 1 1 ...
Do you see how they have different structures? If not, talk to another student or the TA to get help (that is true for each step in this lab.)
The package dplyr (already loaded when you loaded the tidyverse) is a package that can do a lot of data manipulation for you.
Filtering the data allows you to select observations based on their values. The first argument is the name of the data frame. The second and subsequent arguments are the expressions that filter the data frame. For example, we can select all flights on December 31 with:
filter(flights, month==12, day==31)
## # A tibble: 776 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 12 31 13 2359 14 439 437
## 2 2013 12 31 18 2359 19 449 444
## 3 2013 12 31 26 2245 101 129 2353
## 4 2013 12 31 459 500 -1 655 651
## 5 2013 12 31 514 515 -1 814 812
## 6 2013 12 31 549 551 -2 925 900
## 7 2013 12 31 550 600 -10 725 745
## 8 2013 12 31 552 600 -8 811 826
## 9 2013 12 31 553 600 -7 741 754
## 10 2013 12 31 554 550 4 1024 1027
## # ℹ 766 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
#if you want to save the manipulation, assign it to an object
NYE_flights <- filter(flights, month==12, day==31)
Whereas filter() selects rows by their values, select() selects columns. Say you only need some of the variables in your data frame, you can select just those columns and save them as a new object and then you only have to analyze a smaller data frame. This can be particularly useful when you have a large dataset and only need some of the variables. Make sure to save as an object with a different name than the original unless you do want to write over it in your current version of R. Otherwise, if you need anything else from the original dataset, you will have to load it in again.
flights2 <- select(flights, carrier, air_time, distance)
Mutate creates new variables.
mutate(flights,
speed = distance / air_time * 60
)
## # A tibble: 336,776 × 20
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # ℹ 336,766 more rows
## # ℹ 12 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>, speed <dbl>
The summarize function (often referred to as “summarise”, but both spellings work) is similar to aggregate that we learned about above. Here is an example:
summarize(flights, ave_dep = mean(dep_time, na.rm = TRUE))
## # A tibble: 1 × 1
## ave_dep
## <dbl>
## 1 1349.
Note that this created just one value - the average departure time for all flights in 2013 from a NYC airport. This probably isn’t want we wanted. What if we wanted to compare the average departure delay times by airport or by airline. We could combine two steps into one function using the pipe (%>%) and the group_by() function.
flights %>%
group_by(origin) %>%
summarize(mean_delay = mean(dep_delay))
## # A tibble: 3 × 2
## origin mean_delay
## <chr> <dbl>
## 1 EWR NA
## 2 JFK NA
## 3 LGA NA
hmmm…why didn’t that work? Because there are many NA values for dep_delay. We can fix that by using the term “na.rm = TRUE” which mean to remove (rm) the NA values before taking the mean. This is quite common for many functions in R.
flights %>%
group_by(origin) %>%
summarize(mean_delay = mean(dep_delay, na.rm=TRUE))
## # A tibble: 3 × 2
## origin mean_delay
## <chr> <dbl>
## 1 EWR 15.1
## 2 JFK 12.1
## 3 LGA 10.3
You can also add multiple new variables with summarize. A key new
function here is n(), which simply counts rows within the
grouping.
carrier_data <- flights %>%
group_by(carrier) %>%
summarize(mean_delay = mean(dep_delay, na.rm=TRUE),
count = n(),
max_delay = max(dep_delay, na.rm=TRUE),
mean_distance = mean(distance, na.rm=TRUE))
carrier_data
## # A tibble: 16 × 5
## carrier mean_delay count max_delay mean_distance
## <chr> <dbl> <int> <dbl> <dbl>
## 1 9E 16.7 18460 747 530.
## 2 AA 8.59 32729 1014 1340.
## 3 AS 5.80 714 225 2402
## 4 B6 13.0 54635 502 1069.
## 5 DL 9.26 48110 960 1237.
## 6 EV 20.0 54173 548 563.
## 7 F9 20.2 685 853 1620
## 8 FL 18.7 3260 602 665.
## 9 HA 4.90 342 1301 4983
## 10 MQ 10.6 26397 1137 570.
## 11 OO 12.6 32 154 501.
## 12 UA 12.1 58665 483 1529.
## 13 US 3.78 20536 500 553.
## 14 VX 12.9 5162 653 2499.
## 15 WN 17.7 12275 471 996.
## 16 YV 19.0 601 387 375.
Sometimes you might want to convert a column into a vector. For this, we
use the pull() function. lets say we hypothesize that a
plane that arrives on time on new years day has good luck, and will then
arrive on time the rest of the year.
lucky_planes <- flights %>%
filter(month == 1 & day == 1 & arr_delay < 1) %>%
pull(tailnum) %>%
unique()
So, we filtered flights on January first with an arrival delay of less than one minute. Then, we pulled the tailnum column, and then called the unique command to get rid of repeats. Now, lets find out if those planes were lucky
flights %>%
filter(tailnum %in% lucky_planes) %>%
pull(arr_delay) %>%
mean(na.rm=T) %>%
paste("lucky planes:", .) %>%
print()
## [1] "lucky planes: 6.92185807693939"
flights %>%
filter(!tailnum %in% lucky_planes) %>%
pull(arr_delay) %>%
mean(na.rm=T) %>%
paste("unlucky planes:", .) %>%
print()
## [1] "unlucky planes: 6.89061869341345"
Doesn’t look like a big difference.
Merging data can be done with old functions such as merge(), but can also be done with various functions in the tidyverse. You might need to merge different datasets to gather by a common identifying variable for your final project in this course or for a future research project or job project.
There are a bunch of datasets that we can merge to our flight data. Airlines, airports, planes, and weather can all be joined to flights. First look at all of the data:
data(airlines)
data(airports)
data(planes)
data(weather)
#use the head() function to see if there are any variables that you can merge on
We have to have a variable in common on which to merge datasets together. You can see that the airlines dataset could merge to the flights dataset by the variable “carrier”.
There are various kinds of merges, called joins in the tidyverse, that can be done. A left_join merges all observations from the second dataset (the one on the right) to observations in the first dataset (the one on the left) by the variable(s) that you state they have in common and keeps all observations in the left (first) dataset. This can create missing values if there is an observation in the first dataset without a match in the second data set, and it will drop all values from the second data set that don’t have a match in the first dataset. A right_join is the opposite, all observations are merged based on the merging variable(s), but all observation in the second dataset are kept (with an NA value if no match) and those in the first are dropped if they don’t match. An inner_join only keeps observations that match in both datasets and an outer_join keeps all observations regardless of match (no match will create NA values). When an observation in one matches more than one observation in the other data set, it will be merged to all that it matches. For more information, look here: https://r4ds.had.co.nz/relational-data.html#nycflights13-relational
Here are a couple of examples:
flights3 <- flights %>%
left_join(airlines, by="carrier")
dim(flights)
## [1] 336776 19
dim(airlines)
## [1] 16 2
dim(flights3)
## [1] 336776 20
#do you see how the left_join operates?
#when the key is different for the two datasets, you state the key from the left = the key from the right
flights4 <- flights %>%
inner_join(airports, by=c("dest" = "faa"))
dim(flights)
## [1] 336776 19
dim(airports)
## [1] 1458 8
dim(flights4)
## [1] 329174 26
Whenever I merge data, I do many checks to make sure the merge worked correctly. You don’t want to have a situation in which you do analysis on incorrectly merged data.
You saw above that when a value is missing, R creates a value of NA. The function is.na() tells us whether a value is missing or not. We can use this to also remove missing values by using !is.na within a filter statement.
summary(flights) #this shows us which variables have NA values. Which columns have NA values?
## year month day dep_time sched_dep_time
## Min. :2013 Min. : 1.000 Min. : 1.00 Min. : 1 Min. : 106
## 1st Qu.:2013 1st Qu.: 4.000 1st Qu.: 8.00 1st Qu.: 907 1st Qu.: 906
## Median :2013 Median : 7.000 Median :16.00 Median :1401 Median :1359
## Mean :2013 Mean : 6.549 Mean :15.71 Mean :1349 Mean :1344
## 3rd Qu.:2013 3rd Qu.:10.000 3rd Qu.:23.00 3rd Qu.:1744 3rd Qu.:1729
## Max. :2013 Max. :12.000 Max. :31.00 Max. :2400 Max. :2359
## NA's :8255
## dep_delay arr_time sched_arr_time arr_delay
## Min. : -43.00 Min. : 1 Min. : 1 Min. : -86.000
## 1st Qu.: -5.00 1st Qu.:1104 1st Qu.:1124 1st Qu.: -17.000
## Median : -2.00 Median :1535 Median :1556 Median : -5.000
## Mean : 12.64 Mean :1502 Mean :1536 Mean : 6.895
## 3rd Qu.: 11.00 3rd Qu.:1940 3rd Qu.:1945 3rd Qu.: 14.000
## Max. :1301.00 Max. :2400 Max. :2359 Max. :1272.000
## NA's :8255 NA's :8713 NA's :9430
## carrier flight tailnum origin
## Length:336776 Min. : 1 Length:336776 Length:336776
## Class :character 1st Qu.: 553 Class :character Class :character
## Mode :character Median :1496 Mode :character Mode :character
## Mean :1972
## 3rd Qu.:3465
## Max. :8500
##
## dest air_time distance hour
## Length:336776 Min. : 20.0 Min. : 17 Min. : 1.00
## Class :character 1st Qu.: 82.0 1st Qu.: 502 1st Qu.: 9.00
## Mode :character Median :129.0 Median : 872 Median :13.00
## Mean :150.7 Mean :1040 Mean :13.18
## 3rd Qu.:192.0 3rd Qu.:1389 3rd Qu.:17.00
## Max. :695.0 Max. :4983 Max. :23.00
## NA's :9430
## minute time_hour
## Min. : 0.00 Min. :2013-01-01 05:00:00.00
## 1st Qu.: 8.00 1st Qu.:2013-04-04 13:00:00.00
## Median :29.00 Median :2013-07-03 10:00:00.00
## Mean :26.23 Mean :2013-07-03 05:22:54.64
## 3rd Qu.:44.00 3rd Qu.:2013-10-01 07:00:00.00
## Max. :59.00 Max. :2013-12-31 23:00:00.00
##
#we want to remove all flights that were never in the air
not_canceled <- flights %>%
filter(!is.na(air_time))
#what if we wanted the information on cancelled flights
canceled <- flights %>%
filter(is.na(air_time))
#which airlines have the most canceled flights?
canceled %>%
group_by(carrier) %>%
summarize(n_canceled = n()) %>%
ungroup() # good habit to always ungroup
## # A tibble: 15 × 2
## carrier n_canceled
## <chr> <int>
## 1 9E 1166
## 2 AA 782
## 3 AS 5
## 4 B6 586
## 5 DL 452
## 6 EV 3065
## 7 F9 4
## 8 FL 85
## 9 MQ 1360
## 10 OO 3
## 11 UA 883
## 12 US 705
## 13 VX 46
## 14 WN 231
## 15 YV 57
Data manipulation is a large part of any statistical analysis. In this exercise we have just scratched the surface. In the first lab, we’ll do some very complex data manipulation. It’s really easy to find help in online forums (fora) on data manipulation problems in R. However, it’s important to use the right vocabulary in your google searches: “data frame”, “factor”, and “vector” are all important key words to find relevant help. Words like “table” are not useful. It’s important to know that because R is open source, anyone can contribute. This sometimes leads to messy things. For example, “tables” are a special data structure in R that is different from data.frames!
For the following questions, please complete the codes to produce the required answers.
Q1: Create a numeric vector with 10 random numbers, find the median value, and plot the histogram (see lecture slides) (10 pts)
num <- c(1,4,5,6,7,3,4,10,11,34)
median(num)
hist(num,
col = 'red'
)
### The median is 5.5.
Q2b: Load the flights data into R, and find the summary information about departure delays (5 pts).
flights
summary(flights$dep_delay, na.omit = TRUE)
Q2b: Explain what all the summary measures mean for the
dep_time column (5 pts).
All of the summary measures in the dep_time column are telling us what time the plane was scheduled to depart at.
Q3: Find all flights that flew to Houston (10 pts).
print(flights$dest)
hou_flights <- flights %>%
filter(dest == 'HOU')
nrow(hou_flights)
### There were 2,115 flights that flew to Houston.
Q4: Find all flights that had an arrival delay of more than two hours (10 pts).
late_flights <- flights %>%
filter(arr_delay > 120)
nrow(late_flights)
### There were 10,034 flights that were delayed more than 2 hours.
Q5: Find all flights that departed in summer (July, August, and September). (10 pts)
summer_flights <- flights %>%
filter(month >= 7 & month <=9)
nrow(summer_flights)
### There were 86,326 flights that departed in the summer.
Q6: Find all flights that arrived more than two hours late, but didn’t leave late. (10 pts)
# complete your answer in the following code chunk
flights
del_flights <- flights %>%
filter(arr_delay > 120 & dep_delay <= 0)
nrow(del_flights)
### There were 29 flights that arrived more than 2 hours late but did not leave late
Q7: Which plane (tailnum) has the worst on-time record?. There are a few different ways to define the on-time record, (most flights that are late, average arrival delay, etc.). Do whatever makes sense to you. (10 pts)
flights
worst_plane <- flights %>%
group_by(tailnum)%>%
summarize(avg_delay = mean(arr_delay, na.rm = TRUE)) %>%
na.omit() %>%
filter(avg_delay == max(avg_delay))
print(worst_plane)
### The plane with the worst on-time record was plane N844MH with an average delay of 320 minutes.
Q8: What time of day should you fly if you want to
avoid delays as much as possible?. HINT: the column hour is
the scheduled departure hour. (10 pts)
flights
best_time <- flights %>%
group_by(hour)%>%
summarize(avg_delay = mean(dep_delay, na.rm = TRUE)) %>%
na.omit() %>%
filter(avg_delay == min(avg_delay))
print(best_time)
## You should plan to leave around 5am to avoid delays as much as possible
Answer: You should plan to leave around 5am to avoid delays as much as possible
Q9: How many flights were flown by planes that flew over 100 flights? You might need to do this in two steps, where first you create a vector of planes with more than 100 flights, then use that to filter out the flights. (20 pts)
planes_100 <- flights %>%
group_by(tailnum) %>%
summarize(num_flights = n()) %>%
filter(num_flights > 100)
planes_100 <- planes_100$tailnum
print(planes_100)
flight_count <- flights %>%
filter(tailnum %in% planes_100)
nrow(flight_count)
# There are 229,202 planes that have flown over 100 hours
This lab is a modified version of many previous labs created by many people. Thanks to Carson Farmer, Seth Spielman, Colleen Reid and Adam Mahood.