R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

This exercise is under construction. Please report any errors at https://forms.gle/2W4tffs4YJA1jeBv9

Goal: Experience the Fraud Analytics Process Model

Background: The Fraud Analytics Process Model: 1. ID Problem 2. ID Data Sources 3. Select Data 4. Clean Data 5. Transform Data 6. Analyze Data 7. Interpret Model 8. Refine Model

The first ten questions are for the individual assignment and the remaining questions are for the team assignment. Specifically: For the individual assignment submissions: Delete the everything below Q10. Then, knit and submit your html and Rmd files. For the team assignment submissions: Pick any individual team member’s answers for the first ten questions. Then complete the remaining questions to submit for the team assignment.

  1. Delete the everything below Q10 and submit for individual assignment.
  2. Pick any individual team member’s answers for the first ten questions and then knit the remaining questions to submit for the team assignment.

Individual assignment: 25 total points Team assignment: 115 total points

[1 point] Q1.

Start by entering your name and today’s date in Lines 3 and 4, respectively. Then, run the chunk of code below by clicking on the green arrow (that points to the right) on the top right of the chunk. Tip: This line of code (knitr…) is generated by RStudio. It allows you to format (knit) your output so you can share in a readable format. We will use an html format for output as specified in Line 5. You will be instructed to “knit” after completing the assignment without any errors. Tip: I numbered code chunks corresponding to their numbers. Chunk 1 specified the knitting parameters.

1. ID Problem:

Flight cancellations and delays due to weather are not the airlines’ responsibility. Our goal is to detect and report any delays that are not due to weather for further investigation.

2. ID Data Sources

We will analyze data in nycflights13 because it contains information about all flights originating from NYC airports. Please take a moment to skim https://cran.r-project.org/web/packages/nycflights13/nycflights13.pdf with a focus on finding information about cancellations/delays and weather.

[2 points] Q2.

Install and load packages tidyverse and nycflights13. Tip: tidyverse package gives you a tidy way of manipulating data. Tip: install.packages(“packageName”) command installs package packageName. library(packageName) will load package packageName. Tip: You only need to install packages once and you are free to load them afterwards. In fact, you will need to remove the install command to knit and submit your solutions.

# install.packages("tidyverse")

# Load tidyverse package
library(tidyverse)
## Warning: 程辑包'tidyverse'是用R版本4.3.2 来建造的
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# install.packages("nycflights13")

# Load nycflights13 package
library(nycflights13)
## Warning: 程辑包'nycflights13'是用R版本4.3.2 来建造的

[1 point] Q3.

Confirm that the package you loaded in the previous chunk by using the search() function. There is nothing enclosed in parentheses because search doesn’t need any information (parameters).

search()
##  [1] ".GlobalEnv"           "package:nycflights13" "package:lubridate"   
##  [4] "package:forcats"      "package:stringr"      "package:dplyr"       
##  [7] "package:purrr"        "package:readr"        "package:tidyr"       
## [10] "package:tibble"       "package:ggplot2"      "package:tidyverse"   
## [13] "package:stats"        "package:graphics"     "package:grDevices"   
## [16] "package:utils"        "package:datasets"     "package:methods"     
## [19] "Autoloads"            "package:base"

[1 point] Q4.

Now, take a peek inside nycflights13 by using the command ls(“package:nycflights13”) Tip: ls(“package:packageName”) gives the names of the objects in packageName.

library(ggplot2)

ls("package:nycflights13")
## [1] "airlines" "airports" "flights"  "planes"   "weather"

[1 point] Q5.

Now, see what is in your environment by using the command ls() without anything inside the parentheses. Tip: ls() provides information about data sets and functions defined by the user. This will give character(0) if there is nothing in the environment. It is a good idea to start every R script with a clean environment so previous data doesn’t corrupt your script. Tip: You may clean your environment at any time by using the broom in the environment tab (likely near top right of RStudio) or using the command rm(list = ls()).

ls()
## character(0)

[2 points] Q6.

Find the tibbles that contain information about flight delays and weather using the information from nycflights13 documentation, and assign them to variables names of your choosing. Tip: Tibbles are a more robust and refined version of data frames. (Google tibble if you want to learn more about tibbles but it is not necessary at this time.) Tip: I name variables with an abbreviation of the data type followed by a description of the variable. For example, I would use tFlights for the tibble containing flights data so I know that it is a tibble by just reading the variable name. I don’t mind long variable names because I copy-paste them. (You may learn more about naming conventions at https://en.wikipedia.org/wiki/Naming_convention_(programming).

# Assign flights tibble to tFlights
tFlights <- flights

# Assign weather tibble to tWeather
tWeather <- weather

[1 point] Q7.

Now, use ls() to see how your environment have changed. Tip: Also, checkout the new variables in the environment tab.

ls()
## [1] "tFlights" "tWeather"

[2 points] Q8.

Use head function to print and check out the first 20 lines of the tibbles containing the flights and weather data. Tip: Use https://cran.r-project.org/web/packages/nycflights13/nycflights13.pdf to understand the details of each data set. Tip: Always check what you read into your program. Tip: Get accustomed vertical and horizontal scrolling. Tip: Get accustomed to clicking on different tabs when they are printed by one code chunk.

head(tFlights, 20)
## # A tibble: 20 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## 11  2013     1     1      558            600        -2      849            851
## 12  2013     1     1      558            600        -2      853            856
## 13  2013     1     1      558            600        -2      924            917
## 14  2013     1     1      558            600        -2      923            937
## 15  2013     1     1      559            600        -1      941            910
## 16  2013     1     1      559            559         0      702            706
## 17  2013     1     1      559            600        -1      854            902
## 18  2013     1     1      600            600         0      851            858
## 19  2013     1     1      600            600         0      837            825
## 20  2013     1     1      601            600         1      844            850
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>
head(tWeather, 20)
## # A tibble: 20 × 15
##    origin  year month   day  hour  temp  dewp humid wind_dir wind_speed
##    <chr>  <int> <int> <int> <int> <dbl> <dbl> <dbl>    <dbl>      <dbl>
##  1 EWR     2013     1     1     1  39.0  26.1  59.4      270      10.4 
##  2 EWR     2013     1     1     2  39.0  27.0  61.6      250       8.06
##  3 EWR     2013     1     1     3  39.0  28.0  64.4      240      11.5 
##  4 EWR     2013     1     1     4  39.9  28.0  62.2      250      12.7 
##  5 EWR     2013     1     1     5  39.0  28.0  64.4      260      12.7 
##  6 EWR     2013     1     1     6  37.9  28.0  67.2      240      11.5 
##  7 EWR     2013     1     1     7  39.0  28.0  64.4      240      15.0 
##  8 EWR     2013     1     1     8  39.9  28.0  62.2      250      10.4 
##  9 EWR     2013     1     1     9  39.9  28.0  62.2      260      15.0 
## 10 EWR     2013     1     1    10  41    28.0  59.6      260      13.8 
## 11 EWR     2013     1     1    11  41    27.0  57.1      260      15.0 
## 12 EWR     2013     1     1    13  39.2  28.4  69.7      330      16.1 
## 13 EWR     2013     1     1    14  39.0  24.1  54.7      280      13.8 
## 14 EWR     2013     1     1    15  37.9  24.1  57.0      290       9.21
## 15 EWR     2013     1     1    16  37.0  19.9  49.6      300      13.8 
## 16 EWR     2013     1     1    17  36.0  19.0  49.8      330      11.5 
## 17 EWR     2013     1     1    18  34.0  15.1  45.4      310      12.7 
## 18 EWR     2013     1     1    19  33.1  12.9  42.8      320      10.4 
## 19 EWR     2013     1     1    20  32    15.1  49.2      310      15.0 
## 20 EWR     2013     1     1    21  30.0  12.9  48.5      320      18.4 
## # ℹ 5 more variables: wind_gust <dbl>, precip <dbl>, pressure <dbl>,
## #   visib <dbl>, time_hour <dttm>

[4 points] Q9.

Now, print the structure and summary statistics of the tibbles containing the flights and weather data. Tip: Get accustomed to working with datasets that can’t fit on a spreadsheet (and your head).

# Print the structure of the flights tibble
str(tFlights)
## tibble [336,776 × 19] (S3: tbl_df/tbl/data.frame)
##  $ year          : int [1:336776] 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
##  $ month         : int [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
##  $ day           : int [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
##  $ dep_time      : int [1:336776] 517 533 542 544 554 554 555 557 557 558 ...
##  $ sched_dep_time: int [1:336776] 515 529 540 545 600 558 600 600 600 600 ...
##  $ dep_delay     : num [1:336776] 2 4 2 -1 -6 -4 -5 -3 -3 -2 ...
##  $ arr_time      : int [1:336776] 830 850 923 1004 812 740 913 709 838 753 ...
##  $ sched_arr_time: int [1:336776] 819 830 850 1022 837 728 854 723 846 745 ...
##  $ arr_delay     : num [1:336776] 11 20 33 -18 -25 12 19 -14 -8 8 ...
##  $ carrier       : chr [1:336776] "UA" "UA" "AA" "B6" ...
##  $ flight        : int [1:336776] 1545 1714 1141 725 461 1696 507 5708 79 301 ...
##  $ tailnum       : chr [1:336776] "N14228" "N24211" "N619AA" "N804JB" ...
##  $ origin        : chr [1:336776] "EWR" "LGA" "JFK" "JFK" ...
##  $ dest          : chr [1:336776] "IAH" "IAH" "MIA" "BQN" ...
##  $ air_time      : num [1:336776] 227 227 160 183 116 150 158 53 140 138 ...
##  $ distance      : num [1:336776] 1400 1416 1089 1576 762 ...
##  $ hour          : num [1:336776] 5 5 5 5 6 5 6 6 6 6 ...
##  $ minute        : num [1:336776] 15 29 40 45 0 58 0 0 0 0 ...
##  $ time_hour     : POSIXct[1:336776], format: "2013-01-01 05:00:00" "2013-01-01 05:00:00" ...
# Print summary statistics of the flights tibble
summary(tFlights)
##       year          month             day           dep_time    sched_dep_time
##  Min.   :2013   Min.   : 1.000   Min.   : 1.00   Min.   :   1   Min.   : 106  
##  1st Qu.:2013   1st Qu.: 4.000   1st Qu.: 8.00   1st Qu.: 907   1st Qu.: 906  
##  Median :2013   Median : 7.000   Median :16.00   Median :1401   Median :1359  
##  Mean   :2013   Mean   : 6.549   Mean   :15.71   Mean   :1349   Mean   :1344  
##  3rd Qu.:2013   3rd Qu.:10.000   3rd Qu.:23.00   3rd Qu.:1744   3rd Qu.:1729  
##  Max.   :2013   Max.   :12.000   Max.   :31.00   Max.   :2400   Max.   :2359  
##                                                  NA's   :8255                 
##    dep_delay          arr_time    sched_arr_time   arr_delay       
##  Min.   : -43.00   Min.   :   1   Min.   :   1   Min.   : -86.000  
##  1st Qu.:  -5.00   1st Qu.:1104   1st Qu.:1124   1st Qu.: -17.000  
##  Median :  -2.00   Median :1535   Median :1556   Median :  -5.000  
##  Mean   :  12.64   Mean   :1502   Mean   :1536   Mean   :   6.895  
##  3rd Qu.:  11.00   3rd Qu.:1940   3rd Qu.:1945   3rd Qu.:  14.000  
##  Max.   :1301.00   Max.   :2400   Max.   :2359   Max.   :1272.000  
##  NA's   :8255      NA's   :8713                  NA's   :9430      
##    carrier              flight       tailnum             origin         
##  Length:336776      Min.   :   1   Length:336776      Length:336776     
##  Class :character   1st Qu.: 553   Class :character   Class :character  
##  Mode  :character   Median :1496   Mode  :character   Mode  :character  
##                     Mean   :1972                                        
##                     3rd Qu.:3465                                        
##                     Max.   :8500                                        
##                                                                         
##      dest              air_time        distance         hour      
##  Length:336776      Min.   : 20.0   Min.   :  17   Min.   : 1.00  
##  Class :character   1st Qu.: 82.0   1st Qu.: 502   1st Qu.: 9.00  
##  Mode  :character   Median :129.0   Median : 872   Median :13.00  
##                     Mean   :150.7   Mean   :1040   Mean   :13.18  
##                     3rd Qu.:192.0   3rd Qu.:1389   3rd Qu.:17.00  
##                     Max.   :695.0   Max.   :4983   Max.   :23.00  
##                     NA's   :9430                                  
##      minute        time_hour                     
##  Min.   : 0.00   Min.   :2013-01-01 05:00:00.00  
##  1st Qu.: 8.00   1st Qu.:2013-04-04 13:00:00.00  
##  Median :29.00   Median :2013-07-03 10:00:00.00  
##  Mean   :26.23   Mean   :2013-07-03 05:22:54.64  
##  3rd Qu.:44.00   3rd Qu.:2013-10-01 07:00:00.00  
##  Max.   :59.00   Max.   :2013-12-31 23:00:00.00  
## 
# Print the structure of the weather tibble
str(tWeather)
## tibble [26,115 × 15] (S3: tbl_df/tbl/data.frame)
##  $ origin    : chr [1:26115] "EWR" "EWR" "EWR" "EWR" ...
##  $ year      : int [1:26115] 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
##  $ month     : int [1:26115] 1 1 1 1 1 1 1 1 1 1 ...
##  $ day       : int [1:26115] 1 1 1 1 1 1 1 1 1 1 ...
##  $ hour      : int [1:26115] 1 2 3 4 5 6 7 8 9 10 ...
##  $ temp      : num [1:26115] 39 39 39 39.9 39 ...
##  $ dewp      : num [1:26115] 26.1 27 28 28 28 ...
##  $ humid     : num [1:26115] 59.4 61.6 64.4 62.2 64.4 ...
##  $ wind_dir  : num [1:26115] 270 250 240 250 260 240 240 250 260 260 ...
##  $ wind_speed: num [1:26115] 10.36 8.06 11.51 12.66 12.66 ...
##  $ wind_gust : num [1:26115] NA NA NA NA NA NA NA NA NA NA ...
##  $ precip    : num [1:26115] 0 0 0 0 0 0 0 0 0 0 ...
##  $ pressure  : num [1:26115] 1012 1012 1012 1012 1012 ...
##  $ visib     : num [1:26115] 10 10 10 10 10 10 10 10 10 10 ...
##  $ time_hour : POSIXct[1:26115], format: "2013-01-01 01:00:00" "2013-01-01 02:00:00" ...
# Print summary statistics of the weather tibble
summary(tWeather)
##     origin               year          month             day       
##  Length:26115       Min.   :2013   Min.   : 1.000   Min.   : 1.00  
##  Class :character   1st Qu.:2013   1st Qu.: 4.000   1st Qu.: 8.00  
##  Mode  :character   Median :2013   Median : 7.000   Median :16.00  
##                     Mean   :2013   Mean   : 6.504   Mean   :15.68  
##                     3rd Qu.:2013   3rd Qu.: 9.000   3rd Qu.:23.00  
##                     Max.   :2013   Max.   :12.000   Max.   :31.00  
##                                                                    
##       hour            temp             dewp           humid       
##  Min.   : 0.00   Min.   : 10.94   Min.   :-9.94   Min.   : 12.74  
##  1st Qu.: 6.00   1st Qu.: 39.92   1st Qu.:26.06   1st Qu.: 47.05  
##  Median :11.00   Median : 55.40   Median :42.08   Median : 61.79  
##  Mean   :11.49   Mean   : 55.26   Mean   :41.44   Mean   : 62.53  
##  3rd Qu.:17.00   3rd Qu.: 69.98   3rd Qu.:57.92   3rd Qu.: 78.79  
##  Max.   :23.00   Max.   :100.04   Max.   :78.08   Max.   :100.00  
##                  NA's   :1        NA's   :1       NA's   :1       
##     wind_dir       wind_speed         wind_gust         precip        
##  Min.   :  0.0   Min.   :   0.000   Min.   :16.11   Min.   :0.000000  
##  1st Qu.:120.0   1st Qu.:   6.905   1st Qu.:20.71   1st Qu.:0.000000  
##  Median :220.0   Median :  10.357   Median :24.17   Median :0.000000  
##  Mean   :199.8   Mean   :  10.518   Mean   :25.49   Mean   :0.004469  
##  3rd Qu.:290.0   3rd Qu.:  13.809   3rd Qu.:28.77   3rd Qu.:0.000000  
##  Max.   :360.0   Max.   :1048.361   Max.   :66.75   Max.   :1.210000  
##  NA's   :460     NA's   :4          NA's   :20778                     
##     pressure          visib          time_hour                    
##  Min.   : 983.8   Min.   : 0.000   Min.   :2013-01-01 01:00:00.0  
##  1st Qu.:1012.9   1st Qu.:10.000   1st Qu.:2013-04-01 21:30:00.0  
##  Median :1017.6   Median :10.000   Median :2013-07-01 14:00:00.0  
##  Mean   :1017.9   Mean   : 9.255   Mean   :2013-07-01 18:26:37.7  
##  3rd Qu.:1023.0   3rd Qu.:10.000   3rd Qu.:2013-09-30 13:00:00.0  
##  Max.   :1042.1   Max.   :10.000   Max.   :2013-12-30 18:00:00.0  
##  NA's   :2729

3. Select Data

[10 points] Q10.

Using the information above, identify columns of each tibble that you will need for your analysis. Tip: Copy-paste from https://cran.r-project.org/web/packages/nycflights13/nycflights13.pdf and then delete what you don’t need after completing your analysis. This will serve as your data dictionary. Tip: You may update this list as you progress through the rest of the assignment. Don’t worry about getting it perfect right now.

### This section doesn't require code. Just list all the columns of each dataset you'll need for your analysis.### 
#tFlights Tibble:
#year: The year of the flight.
#month: The month of the flight.
#day: The day of the flight.
#dep_time: The actual departure time (in local time).
#dep_delay: The delay in departure time (in minutes).
#arr_time: The actual arrival time (in local time).
#arr_delay: The delay in arrival time (in minutes).
#carrier: The airline carrier code.
#flight: The flight number.
#origin: The airport code of the origin airport.
#dest: The airport code of the destination airport.
#air_time: The duration of the flight (in minutes).
#distance: The distance of the flight (in miles).
#hour: The hour of the day (in local time) when the flight departs.
#minute: The minute of the hour (in local time) when the flight departs.
#time_hour: The date and time of the flight, including the year, month, day, hour, and minute.
#tWeather Tibble:
#year: The year of the weather observation.
#month: The month of the weather observation.
#day: The day of the weather observation.
#hour: The hour of the day of the weather observation.
#temp: The temperature (in degrees Fahrenheit).
#dewp: The dew point (in degrees Fahrenheit).
#humid: The relative humidity (in percentage).
#wind_dir: The wind direction (in degrees).
#wind_speed: The wind speed (in miles per hour).
#wind_gust: The wind gust speed (in miles per hour).
#precip: The precipitation (in inches).
#pressure: The atmospheric pressure (in inches of mercury).
#visib: The visibility (in miles).
#time_hour: The date and time of the weather observation, including the year, month, day, and hour.

For the individual assignment submissions: Delete the everything below this line. Then, knit and submit your html and Rmd files. For the team assignment submissions: Pick any individual team member’s answers for the first ten questions. Then complete the remaining questions to submit for the team assignment.