<!DOCTYPE html>

lab_1_scherer

<!DOCTYPE html>

lab_1_data_wrangling

<!DOCTYPE html>

lab_1_data_wrangling

<!DOCTYPE html>

lab_1_data_wrangling

<!DOCTYPE html>

lab_1_data_wrangling

<!DOCTYPE html>

lab_1_data_wrangling

<!DOCTYPE html>

Geog 4023/5023 Lab 1: Data Wrangling with R

Please complete the following questions and submit the finished Rmd and HTML file onto Canvas. In order for this file to knit, it needs a file called “lab.css” to be in the same folder as the .Rmd file.

Don’t forget to change name field on line 3 to your first and last name.

Objectives

Learn the basics of how to wrangle data in R
Learn the basics of visualization and mapping in R

Introduction

Setup

The most commonly used functions for wrangling data nowadays are in the dplyr package. The dplyr package is part of a suite of packages known as the tidyverse. Tidyverse also automatically loads other useful packages like ggplot2, which is the package for visualization in R. You can simplify things by just installing the tidyverse by calling install.packages("tidyverse") then loading the tidyverse using the library command. All of the canonical statistical tests we will use for the hypothesis testing portion of the lab are in base R. The first part of the following code chunk checks to see if a package is installed, and if it is not installed, it installs it. The second part runs the function then loads the library.

iini <-function(x){
  # stands for install if not installed
  if (!x %in% rownames(installed.packages())) install.packages(x)
}
iini("tidyverse")
library(tidyverse)
library(dragracer)
library(nycflights13)

Part 1: R Basics

The instructions may be a little lengthy, but it is beneficial to work through the R markdown file (.Rmd). This will allow you to tinker with the code as you’re going through.

Objects in R

R is an object oriented programming language. This means that when you use R, your computer essentially becomes a filing system for things that you create. Like any good filing system, things have addresses or call numbers that you can quickly use to retrieve things. However, the bookshelf used to store the objects you create is limited in size by the amount of memory installed in your computer. You’re unlikely to bump into the memory limits of your computer in this class, but it’s important to realize that R objects actually occupy space in your computer.

The code a <- 10 creates an object called a that contains the number 10. The assignment operator <- (you can also use = which is frowned upon in R for reasons that are beyond the scope of this lab) is used to assign the number 10 to the object a. You will use the assignment operator a lot! It is used to assign an object to a name, thus storing it in memory (the filing system) with the name as an address. The name a is the address that will be used to retrieve the object in the future. The name is completely arbitrary, I could have called the object bob or zx10sf. You can think of the object’s name as an address, typing the name into the R console retrieves the object stored at that address and displays in the console. As a general rule, it’s helpful to give the objects you create meaningful names. Names cannot begin in numbers, names are case sensitive (A and a would be different addresses), and names shouldn’t be the same as the names of functions in R (like mean or sum).

a <- 10
a

## [1] 10

In the code above a stores a number, thus it is an object of the type numeric. We can see the type of any R object by typing class(objectName) into the console:

class(a)

## [1] "numeric"

# Overwrite a and store a character text string
a <- "bacon"
class(a)

## [1] "character"

## [1] "bacon"

Objects of any type can be combined into a vector using the combine function, c(). Vectors are useful for storing information that logically belong together. Vectors can be combined into longer vectors:

statesAdamHasLived <- c("MN", "MS", "AZ", "CA", "CO", "VA", "UT", "WI")  # A vector with six elements
length(statesAdamHasLived)  # Returns the length of the vector

## [1] 8

statesAdamWillVisitThisSemester <- c("AZ", "CO")

# Combining vectors
statesAdamKnows <- c(statesAdamWillVisitThisSemester, statesAdamHasLived)
statesAdamKnows

##  [1] "AZ" "CO" "MN" "MS" "AZ" "CA" "CO" "VA" "UT" "WI"

length(statesAdamKnows)

## [1] 10

Notice how CO is in the vector twice because R just appended the two vectors. If we had combined the two vectors with statesAdamHasLived first and statesAdamWillVisitThisSemester second, how would it look?

Indexing is a critical aspects of vectors. It’s possible to pull individual elements of a vector or a range of elements from a vector.

statesAdamKnows[1] # Access the first element of the vector (index position 1)

## [1] "AZ"

statesAdamKnows[3] # Access index position 3

## [1] "MN"

statesAdamKnows[2:4] # The colon denotes the range of integers from 2 to 4

## [1] "CO" "MN" "MS"

# A vector of numbers to reorder the vector or access the elements in a specific index order
statesAdamKnows[c(4,3,1,2)]

## [1] "MS" "MN" "AZ" "CO"

# Permanently reorder the vector and keep only those elements - because you are now re-assigning statesAdamKnows with the = 
statesAdamKnows <- statesAdamKnows[c(4,3,1,2)]
statesAdamKnows

## [1] "MS" "MN" "AZ" "CO"

# Alphabetize the vector with sort
alpha.statesAdamKnows <- sort(statesAdamKnows)
alpha.statesAdamKnows

## [1] "AZ" "CO" "MN" "MS"

Elements of a vector can be named, this facilitates accessing the vector using an indexing scheme that may be more meaningful than numbers.

# The schools Carson Farmer went to as a kid - note these are from Carson Farmer who taught the course in 2018 because I wanted to keep "Smiles & Chuckles" 
schools <- c(kindergarten="Smiles & Chuckles", elementary="Deep Cove", middle="Bay Side", high="Stelly's")
schools

##        kindergarten          elementary              middle                high 
## "Smiles & Chuckles"         "Deep Cove"          "Bay Side"          "Stelly's"

# index by the name of each item in the vector
schools["high"]

##       high 
## "Stelly's"

schools[c("high", "middle")]

##       high     middle 
## "Stelly's" "Bay Side"

Tabular Data in R

In R, tabular data is stored in a ‘data.frame’ object. Data.frames are like a vector with dimensions \(n×p\). This is useful because both rows and columns can be used as named (or numbered) indices. Thus it is possible to ask for small sections of the table, like rows 1 to 5 and columns 2 to 4. However, because data frames have rows and columns, our addressing scheme needs to include row addresses and column addresses. Note the use of the plural, addresses. When working with tables, one often wants to retrieve multiple rows and columns, and this is done using vectors (like we played with in the previous section) created explicitly using c() or implicitly using other types of operators/functions.

An Example of Tabular Data

Entity	Name	Type	Weight	Health	Age Group
1	Fluffy	Cat	27lbs	Obese	Adult
2	Felix	Cat	12lbs	Healthy	Adult
3	Patches	Cat	6lbs	`NA`	Kitten
n	…	…	…	…	…

In CSV format, the above tabular data would look something like the following, perhaps stored in a file called cats.csv:

"Entity","Name","Type","Weight","Health","Age Group"
1,"Fluffy","Cat",27,"Obese","Adult"
2,"Felix","Cat",12,"Healthy","Adult"
3,"Patches","Cat",6,,"Kitten"
...

R is distributed with a bunch of example data sets. To see a vector of these type data() in the console. We’ll load a data set from the package dragracer, a database of all of the important information from RuPaul’s Drag Race, for an explanation of the data type ?dragracer::rpdr_contestants into the console.

data("rpdr_contestants")

# To see the size of the table use dim()
dim(rpdr_contestants) #1000 rows and 5 columns

## [1] 184   5

# To see the names of the columns
names(rpdr_contestants)

## [1] "season"     "contestant" "age"        "dob"        "hometown"

# Show the first 10 rows of the quakes data.frame
head(rpdr_contestants, n=10)

## # A tibble: 10 × 5
##    season contestant                   age dob        hometown                
##    <chr>  <chr>                      <dbl> <date>     <chr>                   
##  1 S01    BeBe Zahara Benet             28 1981-03-20 Minneapolis, Minnesota  
##  2 S01    Nina Flowers                  34 1974-02-22 Bayamón, Puerto Rico    
##  3 S01    Rebecca Glasscock             26 1983-05-25 Fort Lauderdale, Florida
##  4 S01    Shannel                       26 1979-07-03 Las Vegas, Nevada       
##  5 S01    Ongina                        26 1982-01-06 Los Angeles, California 
##  6 S01    Jade                          32 1984-11-18 Chicago, Illinois       
##  7 S01    Akashia                       32 1985-02-19 Cleveland, Ohio         
##  8 S01    Tammie Brown                  36 1980-09-15 Los Angeles, California 
##  9 S01    Victoria (Porkchop) Parker    39 1969-01-16 Raleigh, North Carolina 
## 10 S02    Tyra Sanchez                  21 1988-04-22 Orlando, Florida

Indexing of ‘data.frames’ is similar to the indexing of vectors, except that one has to provide row and column addresses. Data.frames are indices using the convention dfname[row vector, column vector].

# Get the contents of the first row and the first column
rpdr_contestants[1, 1]

# Get the first 10 rows?
rpdr_contestants[1:10] # NOTE THE ERROR!  I have not provided valid column addresses.

# This works.  
rpdr_contestants[1:10,] #The blank after the comma indicates all columns.

# Get the latitude and longitude of the first 10 quakes
rpdr_contestants[1:10, c("season", "contestant")]  

# Note 1:10 is a vector and c("lat", "long") is a vector

# This is equivalent to the previous line
rpdr_contestants[1:10, 1:2]

‘data.frames’ can also be indexed using column names with the syntax df\(colname&amp;amp;amp;amp;lt;/code&amp;amp;amp;amp;gt;.&amp;amp;amp;amp;lt;/p&amp;amp;amp;amp;gt; &amp;amp;amp;amp;lt;pre class=&amp;amp;amp;amp;quot;r&amp;amp;amp;amp;quot;&amp;amp;amp;amp;gt;&amp;amp;amp;amp;lt;code&amp;amp;amp;amp;gt;rpdr_contestants\)contestant # note that this gives you ALL the data, which might be too much!

rpdr_contestants\(contestant[1:10] # Returns the first 10 entries in the mag column&amp;amp;amp;lt;/p&amp;amp;amp;gt; &amp;amp;amp;lt;div id=&amp;amp;amp;quot;the-line-above-is-equivalent-to&amp;amp;amp;quot; class=&amp;amp;amp;quot;section level1&amp;amp;amp;quot;&amp;amp;amp;gt; &amp;amp;amp;lt;h1&amp;amp;amp;gt;The line above is equivalent to&amp;amp;amp;lt;/h1&amp;amp;amp;gt; rpdr_contestants[1:10, &amp;amp;amp;amp;quot;contestant&amp;amp;amp;amp;quot;]&amp;amp;amp;lt;/code&amp;amp;amp;gt; &amp;amp;amp;lt;/pre&amp;amp;amp;gt; &amp;amp;amp;lt;/div&amp;amp;amp;gt; &amp;amp;amp;lt;/div&amp;amp;amp;gt; &amp;amp;amp;lt;div id=&amp;amp;amp;quot;logical-expressions-in-r&amp;amp;amp;quot; class=&amp;amp;amp;quot;section level2&amp;amp;amp;quot;&amp;amp;amp;gt; &amp;amp;amp;lt;h2&amp;amp;amp;gt; Logical Expressions in R &amp;amp;amp;lt;/h2&amp;amp;amp;gt; &amp;amp;amp;lt;p&amp;amp;amp;gt; Logical expressions are expressions that evaluate to &amp;amp;amp;lt;code&amp;amp;amp;gt;TRUE&amp;amp;amp;lt;/code&amp;amp;amp;gt; or &amp;amp;amp;lt;code&amp;amp;amp;gt;FALSE&amp;amp;amp;lt;/code&amp;amp;amp;gt;. For example, “is this bigger than that” or “are these things equal”. Logical expressions are created with logical operators, R uses the following logical operators: &amp;amp;amp;lt;/p&amp;amp;amp;gt; &amp;amp;amp;lt;ul&amp;amp;amp;gt; &amp;amp;amp;lt;li&amp;amp;amp;gt; &amp;amp;amp;lt;code&amp;amp;amp;gt;&amp;amp;amp;amp;lt;&amp;amp;amp;lt;/code&amp;amp;amp;gt; Less than &amp;amp;amp;lt;/li&amp;amp;amp;gt; &amp;amp;amp;lt;li&amp;amp;amp;gt; &amp;amp;amp;lt;code&amp;amp;amp;gt;&amp;amp;amp;amp;gt;&amp;amp;amp;lt;/code&amp;amp;amp;gt; Greater than &amp;amp;amp;lt;/li&amp;amp;amp;gt; &amp;amp;amp;lt;li&amp;amp;amp;gt; &amp;amp;amp;lt;code&amp;amp;amp;gt;==&amp;amp;amp;lt;/code&amp;amp;amp;gt; Equal to (Note that this is different from ‘=’ which is the assignment operator) &amp;amp;amp;lt;/li&amp;amp;amp;gt; &amp;amp;amp;lt;li&amp;amp;amp;gt; &amp;amp;amp;lt;code&amp;amp;amp;gt;&amp;amp;amp;amp;gt;=&amp;amp;amp;lt;/code&amp;amp;amp;gt; Greater than or equal to &amp;amp;amp;lt;/li&amp;amp;amp;gt; &amp;amp;amp;lt;li&amp;amp;amp;gt; &amp;amp;amp;lt;code&amp;amp;amp;gt;&amp;amp;amp;amp;lt;=&amp;amp;amp;lt;/code&amp;amp;amp;gt; Less than or equal to &amp;amp;amp;lt;/li&amp;amp;amp;gt; &amp;amp;amp;lt;li&amp;amp;amp;gt; &amp;amp;amp;lt;code&amp;amp;amp;gt;&amp;amp;amp;amp;amp;&amp;amp;amp;amp;amp;&amp;amp;amp;lt;/code&amp;amp;amp;gt; And &amp;amp;amp;lt;/li&amp;amp;amp;gt; &amp;amp;amp;lt;li&amp;amp;amp;gt; &amp;amp;amp;lt;code&amp;amp;amp;gt;||&amp;amp;amp;lt;/code&amp;amp;amp;gt; Or &amp;amp;amp;lt;/li&amp;amp;amp;gt; &amp;amp;amp;lt;li&amp;amp;amp;gt; &amp;amp;amp;lt;code&amp;amp;amp;gt;%in%&amp;amp;amp;lt;/code&amp;amp;amp;gt; Matching operator &amp;amp;amp;lt;/li&amp;amp;amp;gt; &amp;amp;amp;lt;/ul&amp;amp;amp;gt; &amp;amp;amp;lt;pre class=&amp;amp;amp;quot;r&amp;amp;amp;quot;&amp;amp;amp;gt;&amp;amp;amp;lt;code&amp;amp;amp;gt;# Examples of logical expressions 10 == 5&amp;amp;amp;lt;/code&amp;amp;amp;gt;&amp;amp;amp;lt;/pre&amp;amp;amp;gt; &amp;amp;amp;lt;pre&amp;amp;amp;gt;&amp;amp;amp;lt;code&amp;amp;amp;gt;## [1] FALSE&amp;amp;amp;lt;/code&amp;amp;amp;gt;&amp;amp;amp;lt;/pre&amp;amp;amp;gt; &amp;amp;amp;lt;pre class=&amp;amp;amp;quot;r&amp;amp;amp;quot;&amp;amp;amp;gt;&amp;amp;amp;lt;code&amp;amp;amp;gt;10 &amp;amp;amp;amp;gt; 5&amp;amp;amp;lt;/code&amp;amp;amp;gt;&amp;amp;amp;lt;/pre&amp;amp;amp;gt; &amp;amp;amp;lt;pre&amp;amp;amp;gt;&amp;amp;amp;lt;code&amp;amp;amp;gt;## [1] TRUE&amp;amp;amp;lt;/code&amp;amp;amp;gt;&amp;amp;amp;lt;/pre&amp;amp;amp;gt; &amp;amp;amp;lt;pre class=&amp;amp;amp;quot;r&amp;amp;amp;quot;&amp;amp;amp;gt;&amp;amp;amp;lt;code&amp;amp;amp;gt;a = 10 # Create an object that stores the number 10 a == 10&amp;amp;amp;lt;/code&amp;amp;amp;gt;&amp;amp;amp;lt;/pre&amp;amp;amp;gt; &amp;amp;amp;lt;pre&amp;amp;amp;gt;&amp;amp;amp;lt;code&amp;amp;amp;gt;## [1] TRUE&amp;amp;amp;lt;/code&amp;amp;amp;gt;&amp;amp;amp;lt;/pre&amp;amp;amp;gt; &amp;amp;amp;lt;pre class=&amp;amp;amp;quot;r&amp;amp;amp;quot;&amp;amp;amp;gt;&amp;amp;amp;lt;code&amp;amp;amp;gt;a &amp;amp;amp;amp;gt; 10&amp;amp;amp;lt;/code&amp;amp;amp;gt;&amp;amp;amp;lt;/pre&amp;amp;amp;gt; &amp;amp;amp;lt;pre&amp;amp;amp;gt;&amp;amp;amp;lt;code&amp;amp;amp;gt;## [1] FALSE&amp;amp;amp;lt;/code&amp;amp;amp;gt;&amp;amp;amp;lt;/pre&amp;amp;amp;gt; &amp;amp;amp;lt;pre class=&amp;amp;amp;quot;r&amp;amp;amp;quot;&amp;amp;amp;gt;&amp;amp;amp;lt;code&amp;amp;amp;gt;a &amp;amp;amp;amp;gt;= a + 1&amp;amp;amp;lt;/code&amp;amp;amp;gt;&amp;amp;amp;lt;/pre&amp;amp;amp;gt; &amp;amp;amp;lt;pre&amp;amp;amp;gt;&amp;amp;amp;lt;code&amp;amp;amp;gt;## [1] FALSE&amp;amp;amp;lt;/code&amp;amp;amp;gt;&amp;amp;amp;lt;/pre&amp;amp;amp;gt; &amp;amp;amp;lt;pre class=&amp;amp;amp;quot;r&amp;amp;amp;quot;&amp;amp;amp;gt;&amp;amp;amp;lt;code&amp;amp;amp;gt;&amp;amp;amp;amp;quot;CO&amp;amp;amp;amp;quot; %in% statesAdamKnows&amp;amp;amp;lt;/code&amp;amp;amp;gt;&amp;amp;amp;lt;/pre&amp;amp;amp;gt; &amp;amp;amp;lt;pre&amp;amp;amp;gt;&amp;amp;amp;lt;code&amp;amp;amp;gt;## [1] TRUE&amp;amp;amp;lt;/code&amp;amp;amp;gt;&amp;amp;amp;lt;/pre&amp;amp;amp;gt; &amp;amp;amp;lt;pre class=&amp;amp;amp;quot;r&amp;amp;amp;quot;&amp;amp;amp;gt;&amp;amp;amp;lt;code&amp;amp;amp;gt;c(&amp;amp;amp;amp;quot;MT&amp;amp;amp;amp;quot;, &amp;amp;amp;amp;quot;CA&amp;amp;amp;amp;quot;) %in% statesAdamKnows&amp;amp;amp;lt;/code&amp;amp;amp;gt;&amp;amp;amp;lt;/pre&amp;amp;amp;gt; &amp;amp;amp;lt;pre&amp;amp;amp;gt;&amp;amp;amp;lt;code&amp;amp;amp;gt;## [1] FALSE FALSE&amp;amp;amp;lt;/code&amp;amp;amp;gt;&amp;amp;amp;lt;/pre&amp;amp;amp;gt; &amp;amp;amp;lt;pre class=&amp;amp;amp;quot;r&amp;amp;amp;quot;&amp;amp;amp;gt;&amp;amp;amp;lt;code&amp;amp;amp;gt;# Which contestants won which episodes? # use head to display the first 10 answers head(rpdr_contep\)outcome == "WIN", n = 10)

##  [1] FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

Tables can also be indexed using logical expressions. This is really useful!! R will return all rows for which the logical expression evaluates to true. Logical expressions are typically used as row index selectors. But now we’re going to start using the dplyr functions side by side with base R square bracketing.

data("rpdr_contep")

# Select the contestants that won each episode
rpdr_contep[rpdr_contep$outcome == "WIN",]

## # A tibble: 1,060 × 11
##    season  rank missc contestant        episode outcome eliminated participant
##    <chr>  <dbl> <chr> <chr>               <dbl> <chr>   <chr>      <chr>      
##  1 S01        2 1     Nina Flowers            1 WIN     0          1          
##  2 S01        5 0     Ongina                  2 WIN     0          1          
##  3 <NA>      NA <NA>  <NA>                   NA <NA>    <NA>       <NA>       
##  4 S01        1 0     BeBe Zahara Benet       3 WIN     0          1          
##  5 <NA>      NA <NA>  <NA>                   NA <NA>    <NA>       <NA>       
##  6 <NA>      NA <NA>  <NA>                   NA <NA>    <NA>       <NA>       
##  7 S01        5 0     Ongina                  4 WIN     0          1          
##  8 <NA>      NA <NA>  <NA>                   NA <NA>    <NA>       <NA>       
##  9 <NA>      NA <NA>  <NA>                   NA <NA>    <NA>       <NA>       
## 10 <NA>      NA <NA>  <NA>                   NA <NA>    <NA>       <NA>       
## # ℹ 1,050 more rows
## # ℹ 3 more variables: minichalw <chr>, finale <dbl>, penultimate <dbl>

# Subsets of tables can be saved as new objects. 
# Something's strage with the NA's, so we wrapped the whole thing in na.omit()
winners <- na.omit(rpdr_contep[rpdr_contep$outcome == "WIN", ])

# now, the filter function in dplyr
winners <- filter(rpdr_contep, outcome == "WIN")

winners <- rpdr_contep %>%
  filter(outcome == "WIN")

Creating new variables

Now we’ll use the quakes dataset. It is fairly easy to create new variables in a data frame. All one has to do is name the column and assign values to it:

# Create a column called "big"" stores the value "YES" for all quakes over magnitude 5.7
data(quakes)
quakes[quakes$mag > 5.7, "big"] <- "YES"

This syntax is a little bit complicated. First, we are selecting all rows of the table that contain a quake with a magnitude greater than 5.7. Next, we are “creating” a column called “big” simply by entering “big” in the column address. Next we’re assigning the word “YES” to the selected row column combination. The net result of this is that we’ve made a new column but most of it is empty, because very few quakes are greater than 5.7.

# now to look at some rows of data to see if the "big" column was created correctly
quakes[10:20,] # Notice the <NA>?  These are missing data.

##       lat   long depth mag stations  big
## 10 -17.47 179.59   622 4.3       19 <NA>
## 11 -21.44 180.69   583 4.4       13 <NA>
## 12 -12.26 167.00   249 4.6       16 <NA>
## 13 -18.54 182.11   554 4.4       19 <NA>
## 14 -21.00 181.66   600 4.4       10 <NA>
## 15 -20.70 169.92   139 6.1       94  YES
## 16 -15.94 184.95   306 4.3       11 <NA>
## 17 -13.64 165.96    50 6.0       83  YES
## 18 -17.83 181.50   590 4.5       21 <NA>
## 19 -23.50 179.78   570 4.4       13 <NA>
## 20 -22.63 180.31   598 4.4       18 <NA>

We could fix the missing data simply by assigning the value “NO” to all quakes that are smaller than 5.7.

# Fix missing data
quakes[quakes$mag < 5.7, "big"] <- "NO"

We could also have used a function called ifelse to create the column big. Try to figure it out? You can get help on the ifelse function by typing ?ifelse into the console. You can get help for other functions that way as long as they are already loaded into R, either in the base package or any packages you had loaded (we will get to that later).

Summarizing objects

R is terrible for ‘looking’ at data. It’s really hard and not very useful to look at tabular representations of your data in R. Instead of looking at the raw data, I use the summary function. It displays the max, min, mean, 25th, 75th, and 50th percentile of numeric variables. You can actually use the summary function on almost any object in R, and it will give you some kind of useful output relevant to the object at hand…

summary(quakes)

##       lat              long           depth            mag      
##  Min.   :-38.59   Min.   :165.7   Min.   : 40.0   Min.   :4.00  
##  1st Qu.:-23.47   1st Qu.:179.6   1st Qu.: 99.0   1st Qu.:4.30  
##  Median :-20.30   Median :181.4   Median :247.0   Median :4.60  
##  Mean   :-20.64   Mean   :179.5   Mean   :311.4   Mean   :4.62  
##  3rd Qu.:-17.64   3rd Qu.:183.2   3rd Qu.:543.0   3rd Qu.:4.90  
##  Max.   :-10.72   Max.   :188.1   Max.   :680.0   Max.   :6.40  
##     stations          big           
##  Min.   : 10.00   Length:1000       
##  1st Qu.: 18.00   Class :character  
##  Median : 27.00   Mode  :character  
##  Mean   : 33.42                     
##  3rd Qu.: 42.00                     
##  Max.   :132.00

Factors: Non-numeric Variables

Notice that the summary of the “big” column is not especially useful. It tells us that the column stores characters. Actually, the column stores a variable that has two non-numeric levels. Many variables that you’ll work with, even if they appear as numbers, aren’t really numeric variables. Consider a variable called “habitat type” where the value “1” represents “wetlands” and the value “2” represents “forest”. R would see these as numeric variables even though the numbers are simply codes.

Non-numeric variables are called factors in R. Factors are variables that contain any number of distinct levels. R treats factors differently from numeric variables. Consider the vector of states Adam know from earlier. All states in the US have numeric IDs called Federal Information Processing Standard (FIPS) codes. We could re-create the vector statesAdamKnows using FIPS codes for the state.

statesAdamKnows = c(44, 25, 6, 22)

The problem is that adding Massachusetts (code 25) and Louisiana (code 22) together is nonsense. It is certainly not true that adding Massachusetts and Louisiana together equals Tennessee, even though 25+22=47. However, R has no problem doing just that.

statesAdamKnows[2] + statesAdamKnows[4]

## [1] 47

# Notice that because statesAdamKnows is a vector not a data frame, there is only one indexed value

We need to tell R that the vector statesAdamKnows is a factor, not a numeric variable. We can do this by telling R to store the vector as a factor.

statesAdamKnows = as.factor(statesAdamKnows)

After converting the vector to a factor, R behaves more reasonably. Trying to add MA and LA yields an error “‘+’ not meaningful for factors.” Converting the “big” column of the quake table to a factor, however, yields a more reasonable and useful summary.

quakes$big = as.factor(quakes$big)
summary(quakes)

##       lat              long           depth            mag      
##  Min.   :-38.59   Min.   :165.7   Min.   : 40.0   Min.   :4.00  
##  1st Qu.:-23.47   1st Qu.:179.6   1st Qu.: 99.0   1st Qu.:4.30  
##  Median :-20.30   Median :181.4   Median :247.0   Median :4.60  
##  Mean   :-20.64   Mean   :179.5   Mean   :311.4   Mean   :4.62  
##  3rd Qu.:-17.64   3rd Qu.:183.2   3rd Qu.:543.0   3rd Qu.:4.90  
##  Max.   :-10.72   Max.   :188.1   Max.   :680.0   Max.   :6.40  
##     stations        big     
##  Min.   : 10.00   NO  :985  
##  1st Qu.: 18.00   YES :  7  
##  Median : 27.00   NA's:  8  
##  Mean   : 33.42             
##  3rd Qu.: 42.00             
##  Max.   :132.00

We have a problem. Why are there 8 NA’s in the big column? Try using is.na(quakes\(big)&amp;amp;amp;amp;lt;/code&amp;amp;amp;amp;gt; to select rows from the quakes table. The goal is to produce output like what is shown below.&amp;amp;amp;amp;lt;/p&amp;amp;amp;amp;gt; &amp;amp;amp;amp;lt;p&amp;amp;amp;amp;gt;Which rows have &amp;amp;amp;amp;lt;code&amp;amp;amp;amp;gt;NAs&amp;amp;amp;amp;lt;/code&amp;amp;amp;amp;gt; for &amp;amp;amp;amp;lt;code&amp;amp;amp;amp;gt;quakes\)big? The ones that are exactly equal to 5.7.

Fix the NA’s so that summary(quakes) looks like this:

##       lat              long           depth            mag      
##  Min.   :-38.59   Min.   :165.7   Min.   : 40.0   Min.   :4.00  
##  1st Qu.:-23.47   1st Qu.:179.6   1st Qu.: 99.0   1st Qu.:4.30  
##  Median :-20.30   Median :181.4   Median :247.0   Median :4.60  
##  Mean   :-20.64   Mean   :179.5   Mean   :311.4   Mean   :4.62  
##  3rd Qu.:-17.64   3rd Qu.:183.2   3rd Qu.:543.0   3rd Qu.:4.90  
##  Max.   :-10.72   Max.   :188.1   Max.   :680.0   Max.   :6.40  
##     stations       big     
##  Min.   : 10.00   NO :985  
##  1st Qu.: 18.00   YES: 15  
##  Median : 27.00            
##  Mean   : 33.42            
##  3rd Qu.: 42.00            
##  Max.   :132.00

By group summaries

In a data.frame, factors often represent distinct types or categories. Sometimes we will want more detailed summaries than the generic summary function provides. We might want to summarize the big quakes and small quakes seperately. We can do this by using the aggregate function. The ~ is used a lot in R. It stands for “described by” (or “is a function of”…). In this case, we’re describing the magnitude by the big variable we created using logical expressions. To each aggregate of the data (level of big), we are applying the FUN (which stands for function) summary.

##   quakes$big quakes$mag.Min. quakes$mag.1st Qu. quakes$mag.Median
## 1         NO        4.000000           4.300000          4.600000
## 2        YES        5.700000           5.700000          5.700000
##   quakes$mag.Mean quakes$mag.3rd Qu. quakes$mag.Max.
## 1        4.601523           4.800000        5.600000
## 2        5.860000           6.000000        6.400000

We can even do basic group by in plots:

boxplot(mag ~ big, data=quakes)

Part 2: Data Wrangling with the “tidyverse”

Most of the above was the “old” way to work with data sets in R. The new way uses a set of R packages that are grouped together under the name the “tidyverse”. Increasingly, everyone is working with data this way instead of the old way – in fact, many of your classmates are likely better at ‘wrangling’ data with tools in the tidyverse than the way that I just showed above. Here is a quick intro to these tools. Check out this webpage for more information on all of the packages and tools included in the tidyverse: https://www.tidyverse.org/

Let’s look at the data about flights. What do each of these commands provide you information on?

flights

## # A tibble: 336,776 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # ℹ 336,766 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

?flights

Notice that the flights data is in a format called a “tibble”. Tibbles are data frames, but they are slightly different to help do some things that can be done in the tidyverse. You can transform data.frames to tibbles and back again with these commands.

#convert the quakes data frame from before into a tibble
quakes <- as_tibble(quakes)

str(quakes) #this gives you information about the structure of the dataset as well as of each column within the data frame

## tibble [1,000 × 6] (S3: tbl_df/tbl/data.frame)
##  $ lat     : num [1:1000] -20.4 -20.6 -26 -18 -20.4 ...
##  $ long    : num [1:1000] 182 181 184 182 182 ...
##  $ depth   : int [1:1000] 562 650 42 626 649 195 82 194 211 622 ...
##  $ mag     : num [1:1000] 4.8 4.2 5.4 4.1 4 4 4.8 4.4 4.7 4.3 ...
##  $ stations: int [1:1000] 41 15 43 19 11 12 43 15 35 19 ...
##  $ big     : Factor w/ 2 levels "NO","YES": 1 1 1 1 1 1 1 1 1 1 ...

glimpse(quakes) # this is similar to the str command, but also lets you look at the first several rows of each column

## Rows: 1,000
## Columns: 6
## $ lat      <dbl> -20.42, -20.62, -26.00, -17.97, -20.42, -19.68, -11.70, -28.1…
## $ long     <dbl> 181.62, 181.03, 184.10, 181.66, 181.96, 184.31, 166.10, 181.9…
## $ depth    <int> 562, 650, 42, 626, 649, 195, 82, 194, 211, 622, 583, 249, 554…
## $ mag      <dbl> 4.8, 4.2, 5.4, 4.1, 4.0, 4.0, 4.8, 4.4, 4.7, 4.3, 4.4, 4.6, 4…
## $ stations <int> 41, 15, 43, 19, 11, 12, 43, 15, 35, 19, 13, 16, 19, 10, 94, 1…
## $ big      <fct> NO, NO, NO, NO, NO, NO, NO, NO, NO, NO, NO, NO, NO, NO, YES, …

# and convert back from tibbles to data.frames
quakes <- as.data.frame(quakes)

str(quakes)

## 'data.frame':    1000 obs. of  6 variables:
##  $ lat     : num  -20.4 -20.6 -26 -18 -20.4 ...
##  $ long    : num  182 181 184 182 182 ...
##  $ depth   : int  562 650 42 626 649 195 82 194 211 622 ...
##  $ mag     : num  4.8 4.2 5.4 4.1 4 4 4.8 4.4 4.7 4.3 ...
##  $ stations: int  41 15 43 19 11 12 43 15 35 19 ...
##  $ big     : Factor w/ 2 levels "NO","YES": 1 1 1 1 1 1 1 1 1 1 ...

Do you see how they have different structures? If not, talk to another student or the TA to get help (that is true for each step in this lab.)

Data manipulation with dplyr (a package within the tidyverse)

The package dplyr (already loaded when you loaded the tidyverse) is a package that can do a lot of data manipulation for you.

Filter

Filtering the data allows you to select observations based on their values. The first argument is the name of the data frame. The second and subsequent arguments are the expressions that filter the data frame. For example, we can select all flights on December 31 with:

filter(flights, month==12, day==31)

## # A tibble: 776 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013    12    31       13           2359        14      439            437
##  2  2013    12    31       18           2359        19      449            444
##  3  2013    12    31       26           2245       101      129           2353
##  4  2013    12    31      459            500        -1      655            651
##  5  2013    12    31      514            515        -1      814            812
##  6  2013    12    31      549            551        -2      925            900
##  7  2013    12    31      550            600       -10      725            745
##  8  2013    12    31      552            600        -8      811            826
##  9  2013    12    31      553            600        -7      741            754
## 10  2013    12    31      554            550         4     1024           1027
## # ℹ 766 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

#if you want to save the manipulation, assign it to an object
NYE_flights <- filter(flights, month==12, day==31)

Select

Whereas filter() selects rows by their values, select() selects columns. Say you only need some of the variables in your data frame, you can select just those columns and save them as a new object and then you only have to analyze a smaller data frame. This can be particularly useful when you have a large dataset and only need some of the variables. Make sure to save as an object with a different name than the original unless you do want to write over it in your current version of R. Otherwise, if you need anything else from the original dataset, you will have to load it in again.

flights2 <- select(flights, carrier, air_time, distance)

Mutate

Mutate creates new variables.

mutate(flights,
  speed = distance / air_time * 60
)

## # A tibble: 336,776 × 20
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # ℹ 336,766 more rows
## # ℹ 12 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>, speed <dbl>

Summarize

The summarize function (often referred to as “summarise”, but both spellings work) is similar to aggregate that we learned about above. Here is an example:

summarize(flights, ave_dep = mean(dep_time, na.rm = TRUE))

## # A tibble: 1 × 1
##   ave_dep
##     <dbl>
## 1   1349.

Note that this created just one value - the average departure time for all flights in 2013 from a NYC airport. This probably isn’t want we wanted. What if we wanted to compare the average departure delay times by airport or by airline. We could combine two steps into one function using the pipe (%>%) and the group_by() function.

Combining Steps with the Pipe

flights %>%
    group_by(origin) %>%
    summarize(mean_delay = mean(dep_delay))

## # A tibble: 3 × 2
##   origin mean_delay
##   <chr>       <dbl>
## 1 EWR            NA
## 2 JFK            NA
## 3 LGA            NA

hmmm…why didn’t that work? Because there are many NA values for dep_delay. We can fix that by using the term “na.rm = TRUE” which mean to remove (rm) the NA values before taking the mean. This is quite common for many functions in R.

flights %>%
    group_by(origin) %>%
    summarize(mean_delay = mean(dep_delay, na.rm=TRUE))

## # A tibble: 3 × 2
##   origin mean_delay
##   <chr>       <dbl>
## 1 EWR          15.1
## 2 JFK          12.1
## 3 LGA          10.3

You can also add multiple new variables with summarize. A key new function here is n(), which simply counts rows within the grouping.

carrier_data <- flights %>%
    group_by(carrier) %>%
    summarize(mean_delay = mean(dep_delay, na.rm=TRUE), 
              count = n(), 
              max_delay = max(dep_delay, na.rm=TRUE), 
              mean_distance = mean(distance, na.rm=TRUE))
carrier_data

## # A tibble: 16 × 5
##    carrier mean_delay count max_delay mean_distance
##    <chr>        <dbl> <int>     <dbl>         <dbl>
##  1 9E           16.7  18460       747          530.
##  2 AA            8.59 32729      1014         1340.
##  3 AS            5.80   714       225         2402 
##  4 B6           13.0  54635       502         1069.
##  5 DL            9.26 48110       960         1237.
##  6 EV           20.0  54173       548          563.
##  7 F9           20.2    685       853         1620 
##  8 FL           18.7   3260       602          665.
##  9 HA            4.90   342      1301         4983 
## 10 MQ           10.6  26397      1137          570.
## 11 OO           12.6     32       154          501.
## 12 UA           12.1  58665       483         1529.
## 13 US            3.78 20536       500          553.
## 14 VX           12.9   5162       653         2499.
## 15 WN           17.7  12275       471          996.
## 16 YV           19.0    601       387          375.

Pulling

Sometimes you might want to convert a column into a vector. For this, we use the pull() function. lets say we hypothesize that a plane that arrives on time on new years day has good luck, and will then arrive on time the rest of the year.

lucky_planes <- flights %>%
  filter(month == 1 & day == 1 & arr_delay < 1) %>%
  pull(tailnum) %>%
  unique()

So, we filtered flights on January first with an arrival delay of less than one minute. Then, we pulled the tailnum column, and then called the unique command to get rid of repeats. Now, lets find out if those planes were lucky

flights %>%
  filter(tailnum %in% lucky_planes) %>%
  pull(arr_delay) %>%
  mean(na.rm=T) %>%
 paste("lucky planes:", .) %>%
 print()

## [1] "lucky planes: 6.92185807693939"

flights %>%
  filter(!tailnum %in% lucky_planes) %>%
  pull(arr_delay) %>%
  mean(na.rm=T) %>%
  paste("unlucky planes:", .) %>%
  print()

## [1] "unlucky planes: 6.89061869341345"

Doesn’t look like a big difference.

Merging

Merging data can be done with old functions such as merge(), but can also be done with various functions in the tidyverse. You might need to merge different datasets to gather by a common identifying variable for your final project in this course or for a future research project or job project.

There are a bunch of datasets that we can merge to our flight data. Airlines, airports, planes, and weather can all be joined to flights. First look at all of the data:

data(airlines)
data(airports)
data(planes)
data(weather)

#use the head() function to see if there are any variables that you can merge on

We have to have a variable in common on which to merge datasets together. You can see that the airlines dataset could merge to the flights dataset by the variable “carrier”.

There are various kinds of merges, called joins in the tidyverse, that can be done. A left_join merges all observations from the second dataset (the one on the right) to observations in the first dataset (the one on the left) by the variable(s) that you state they have in common and keeps all observations in the left (first) dataset. This can create missing values if there is an observation in the first dataset without a match in the second data set, and it will drop all values from the second data set that don’t have a match in the first dataset. A right_join is the opposite, all observations are merged based on the merging variable(s), but all observation in the second dataset are kept (with an NA value if no match) and those in the first are dropped if they don’t match. An inner_join only keeps observations that match in both datasets and an outer_join keeps all observations regardless of match (no match will create NA values). When an observation in one matches more than one observation in the other data set, it will be merged to all that it matches. For more information, look here: https://r4ds.had.co.nz/relational-data.html#nycflights13-relational

Here are a couple of examples:

flights3 <- flights %>%
  left_join(airlines, by="carrier")
dim(flights)

## [1] 336776     19

dim(airlines)

## [1] 16  2

dim(flights3)

## [1] 336776     20

#do you see how the left_join operates?

#when the key is different for the two datasets, you state the key from the left = the key from the right
flights4 <- flights %>% 
  inner_join(airports, by=c("dest" = "faa"))
dim(flights)

## [1] 336776     19

dim(airports)

## [1] 1458    8

dim(flights4)

## [1] 329174     26

Whenever I merge data, I do many checks to make sure the merge worked correctly. You don’t want to have a situation in which you do analysis on incorrectly merged data.

More on missing data

You saw above that when a value is missing, R creates a value of NA. The function is.na() tells us whether a value is missing or not. We can use this to also remove missing values by using !is.na within a filter statement.

summary(flights) #this shows us which variables have NA values. Which columns have NA values?

##       year          month             day           dep_time    sched_dep_time
##  Min.   :2013   Min.   : 1.000   Min.   : 1.00   Min.   :   1   Min.   : 106  
##  1st Qu.:2013   1st Qu.: 4.000   1st Qu.: 8.00   1st Qu.: 907   1st Qu.: 906  
##  Median :2013   Median : 7.000   Median :16.00   Median :1401   Median :1359  
##  Mean   :2013   Mean   : 6.549   Mean   :15.71   Mean   :1349   Mean   :1344  
##  3rd Qu.:2013   3rd Qu.:10.000   3rd Qu.:23.00   3rd Qu.:1744   3rd Qu.:1729  
##  Max.   :2013   Max.   :12.000   Max.   :31.00   Max.   :2400   Max.   :2359  
##                                                  NA's   :8255                 
##    dep_delay          arr_time    sched_arr_time   arr_delay       
##  Min.   : -43.00   Min.   :   1   Min.   :   1   Min.   : -86.000  
##  1st Qu.:  -5.00   1st Qu.:1104   1st Qu.:1124   1st Qu.: -17.000  
##  Median :  -2.00   Median :1535   Median :1556   Median :  -5.000  
##  Mean   :  12.64   Mean   :1502   Mean   :1536   Mean   :   6.895  
##  3rd Qu.:  11.00   3rd Qu.:1940   3rd Qu.:1945   3rd Qu.:  14.000  
##  Max.   :1301.00   Max.   :2400   Max.   :2359   Max.   :1272.000  
##  NA's   :8255      NA's   :8713                  NA's   :9430      
##    carrier              flight       tailnum             origin         
##  Length:336776      Min.   :   1   Length:336776      Length:336776     
##  Class :character   1st Qu.: 553   Class :character   Class :character  
##  Mode  :character   Median :1496   Mode  :character   Mode  :character  
##                     Mean   :1972                                        
##                     3rd Qu.:3465                                        
##                     Max.   :8500                                        
##                                                                         
##      dest              air_time        distance         hour      
##  Length:336776      Min.   : 20.0   Min.   :  17   Min.   : 1.00  
##  Class :character   1st Qu.: 82.0   1st Qu.: 502   1st Qu.: 9.00  
##  Mode  :character   Median :129.0   Median : 872   Median :13.00  
##                     Mean   :150.7   Mean   :1040   Mean   :13.18  
##                     3rd Qu.:192.0   3rd Qu.:1389   3rd Qu.:17.00  
##                     Max.   :695.0   Max.   :4983   Max.   :23.00  
##                     NA's   :9430                                  
##      minute        time_hour                     
##  Min.   : 0.00   Min.   :2013-01-01 05:00:00.00  
##  1st Qu.: 8.00   1st Qu.:2013-04-04 13:00:00.00  
##  Median :29.00   Median :2013-07-03 10:00:00.00  
##  Mean   :26.23   Mean   :2013-07-03 05:22:54.64  
##  3rd Qu.:44.00   3rd Qu.:2013-10-01 07:00:00.00  
##  Max.   :59.00   Max.   :2013-12-31 23:00:00.00  
##

#we want to remove all flights that were never in the air
not_canceled <- flights %>%
            filter(!is.na(air_time))

#what if we wanted the information on cancelled flights
canceled <- flights %>%
          filter(is.na(air_time))

#which airlines have the most canceled flights?
canceled %>%
  group_by(carrier) %>%
  summarize(n_canceled = n()) %>%
  ungroup() # good habit to always ungroup

## # A tibble: 15 × 2
##    carrier n_canceled
##    <chr>        <int>
##  1 9E            1166
##  2 AA             782
##  3 AS               5
##  4 B6             586
##  5 DL             452
##  6 EV            3065
##  7 F9               4
##  8 FL              85
##  9 MQ            1360
## 10 OO               3
## 11 UA             883
## 12 US             705
## 13 VX              46
## 14 WN             231
## 15 YV              57

Conclusion

Data manipulation is a large part of any statistical analysis. In this exercise we have just scratched the surface. In the first lab, we’ll do some very complex data manipulation. It’s really easy to find help in online forums (fora) on data manipulation problems in R. However, it’s important to use the right vocabulary in your google searches: “data frame”, “factor”, and “vector” are all important key words to find relevant help. Words like “table” are not useful. It’s important to know that because R is open source, anyone can contribute. This sometimes leads to messy things. For example, “tables” are a special data structure in R that is different from data.frames!

Assignment (100 pts)

For the following questions, please complete the codes to produce the required answers.

Q1: Create a numeric vector with 10 random numbers, find the median value, and plot the histogram (see lecture slides) (10 pts)

num <- c(1,4,5,6,7,3,4,10,11,34)
median(num)
hist(num,
col = 'red'
)

### The median is 5.5.

Q2b: Load the flights data into R, and find the summary information about departure delays (5 pts).

flights
summary(flights$dep_delay, na.omit = TRUE)

Q2b: Explain what all the summary measures mean for the dep_time column (5 pts).

All of the summary measures in the dep_time column are telling us what time the plane was scheduled to depart at.

Q3: Find all flights that flew to Houston (10 pts).

print(flights$dest)

hou_flights <- flights %>%
  filter(dest == 'HOU')
nrow(hou_flights)

### There were 2,115 flights that flew to Houston.

Q4: Find all flights that had an arrival delay of more than two hours (10 pts).

late_flights <- flights %>% 
  filter(arr_delay > 120)
nrow(late_flights)

### There were 10,034 flights that were delayed more than 2 hours.

Q5: Find all flights that departed in summer (July, August, and September). (10 pts)

summer_flights <- flights %>%
  filter(month >= 7 & month <=9)
nrow(summer_flights)

### There were 86,326 flights that departed in the summer.

Q6: Find all flights that arrived more than two hours late, but didn’t leave late. (10 pts)

# complete your answer in the following code chunk
flights
del_flights <- flights %>%
  filter(arr_delay > 120 & dep_delay <= 0)
nrow(del_flights)
  
### There were 29 flights that arrived more than 2 hours late but did not leave late

Q7: Which plane (tailnum) has the worst on-time record?. There are a few different ways to define the on-time record, (most flights that are late, average arrival delay, etc.). Do whatever makes sense to you. (10 pts)

flights
worst_plane <- flights %>%
  group_by(tailnum)%>%
  summarize(avg_delay = mean(arr_delay, na.rm = TRUE)) %>% 
  na.omit() %>%
  filter(avg_delay == max(avg_delay))
print(worst_plane)
 ### The plane with the worst on-time record was plane N844MH with an average delay of 320 minutes.

Q8: What time of day should you fly if you want to avoid delays as much as possible?. HINT: the column hour is the scheduled departure hour. (10 pts)

flights
best_time <- flights %>%
  group_by(hour)%>%
  summarize(avg_delay = mean(dep_delay, na.rm = TRUE)) %>% 
  na.omit() %>%
  filter(avg_delay == min(avg_delay))
print(best_time)

## You should plan to leave around 5am to avoid delays as much as possible

Answer: You should plan to leave around 5am to avoid delays as much as possible

Q9: How many flights were flown by planes that flew over 100 flights? You might need to do this in two steps, where first you create a vector of planes with more than 100 flights, then use that to filter out the flights. (20 pts)

planes_100 <- flights %>%
  group_by(tailnum) %>%
  summarize(num_flights = n()) %>%
  filter(num_flights > 100) 
planes_100 <- planes_100$tailnum
  print(planes_100)
  
flight_count <- flights %>%
  filter(tailnum %in% planes_100)
nrow(flight_count)

# There are 229,202 planes that have flown over 100 hours

Acknowledgements

This lab is a modified version of many previous labs created by many people. Thanks to Carson Farmer, Seth Spielman, Colleen Reid and Adam Mahood.

Geog 4023/5023 Lab 1: Data Wrangling with R

FirstName LastName

01/27/2025

Objectives

Introduction

Setup

Part 1: R Basics

Objects in R

Tabular Data in R

An Example of Tabular Data

Creating new variables

Summarizing objects

Factors: Non-numeric Variables

By group summaries

Part 2: Data Wrangling with the “tidyverse”

Data manipulation with dplyr (a package within the tidyverse)

Filter

Select

Mutate

Summarize

Combining Steps with the Pipe

Pulling

Merging

More on missing data

Conclusion

Assignment (100 pts)

Acknowledgements