This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
This exercise is under construction. Please report any errors at https://forms.gle/2W4tffs4YJA1jeBv9
Goal: Experience the Fraud Analytics Process Model
Background: The Fraud Analytics Process Model: 1. ID Problem 2. ID Data Sources 3. Select Data 4. Clean Data 5. Transform Data 6. Analyze Data 7. Interpret Model 8. Refine Model
The first ten questions are for the individual assignment and the remaining questions are for the team assignment. Specifically: For the individual assignment submissions: Delete the everything below Q10. Then, knit and submit your html and Rmd files. For the team assignment submissions: Pick any individual team member’s answers for the first ten questions. Then complete the remaining questions to submit for the team assignment.
Individual assignment: 25 total points Team assignment: 115 total points
Start by entering your name and today’s date in Lines 3 and 4, respectively. Then, run the chunk of code below by clicking on the green arrow (that points to the right) on the top right of the chunk. Tip: This line of code (knitr…) is generated by RStudio. It allows you to format (knit) your output so you can share in a readable format. We will use an html format for output as specified in Line 5. You will be instructed to “knit” after completing the assignment without any errors. Tip: I numbered code chunks corresponding to their numbers. Chunk 1 specified the knitting parameters.
Flight cancellations and delays due to weather are not the airlines’ responsibility. Our goal is to detect and report any delays that are not due to weather for further investigation.
We will analyze data in nycflights13 because it contains information about all flights originating from NYC airports. Please take a moment to skim https://cran.r-project.org/web/packages/nycflights13/nycflights13.pdf with a focus on finding information about cancellations/delays and weather.
Install and load packages tidyverse and nycflights13. Tip: tidyverse package gives you a tidy way of manipulating data. Tip: install.packages(“packageName”) command installs package packageName. library(packageName) will load package packageName. Tip: You only need to install packages once and you are free to load them afterwards. In fact, you will need to remove the install command to knit and submit your solutions.
# install.packages("tidyverse")
# Load tidyverse package
library(tidyverse)
## Warning: 程辑包'tidyverse'是用R版本4.3.2 来建造的
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# install.packages("nycflights13")
# Load nycflights13 package
library(nycflights13)
## Warning: 程辑包'nycflights13'是用R版本4.3.2 来建造的
Confirm that the package you loaded in the previous chunk by using the search() function. There is nothing enclosed in parentheses because search doesn’t need any information (parameters).
search()
## [1] ".GlobalEnv" "package:nycflights13" "package:lubridate"
## [4] "package:forcats" "package:stringr" "package:dplyr"
## [7] "package:purrr" "package:readr" "package:tidyr"
## [10] "package:tibble" "package:ggplot2" "package:tidyverse"
## [13] "package:stats" "package:graphics" "package:grDevices"
## [16] "package:utils" "package:datasets" "package:methods"
## [19] "Autoloads" "package:base"
Now, take a peek inside nycflights13 by using the command ls(“package:nycflights13”) Tip: ls(“package:packageName”) gives the names of the objects in packageName.
library(ggplot2)
ls("package:nycflights13")
## [1] "airlines" "airports" "flights" "planes" "weather"
Now, see what is in your environment by using the command ls() without anything inside the parentheses. Tip: ls() provides information about data sets and functions defined by the user. This will give character(0) if there is nothing in the environment. It is a good idea to start every R script with a clean environment so previous data doesn’t corrupt your script. Tip: You may clean your environment at any time by using the broom in the environment tab (likely near top right of RStudio) or using the command rm(list = ls()).
ls()
## character(0)
Find the tibbles that contain information about flight delays and weather using the information from nycflights13 documentation, and assign them to variables names of your choosing. Tip: Tibbles are a more robust and refined version of data frames. (Google tibble if you want to learn more about tibbles but it is not necessary at this time.) Tip: I name variables with an abbreviation of the data type followed by a description of the variable. For example, I would use tFlights for the tibble containing flights data so I know that it is a tibble by just reading the variable name. I don’t mind long variable names because I copy-paste them. (You may learn more about naming conventions at https://en.wikipedia.org/wiki/Naming_convention_(programming).
# Assign flights tibble to tFlights
tFlights <- flights
# Assign weather tibble to tWeather
tWeather <- weather
Now, use ls() to see how your environment have changed. Tip: Also, checkout the new variables in the environment tab.
ls()
## [1] "tFlights" "tWeather"
Use head function to print and check out the first 20 lines of the tibbles containing the flights and weather data. Tip: Use https://cran.r-project.org/web/packages/nycflights13/nycflights13.pdf to understand the details of each data set. Tip: Always check what you read into your program. Tip: Get accustomed vertical and horizontal scrolling. Tip: Get accustomed to clicking on different tabs when they are printed by one code chunk.
head(tFlights, 20)
## # A tibble: 20 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## 11 2013 1 1 558 600 -2 849 851
## 12 2013 1 1 558 600 -2 853 856
## 13 2013 1 1 558 600 -2 924 917
## 14 2013 1 1 558 600 -2 923 937
## 15 2013 1 1 559 600 -1 941 910
## 16 2013 1 1 559 559 0 702 706
## 17 2013 1 1 559 600 -1 854 902
## 18 2013 1 1 600 600 0 851 858
## 19 2013 1 1 600 600 0 837 825
## 20 2013 1 1 601 600 1 844 850
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
head(tWeather, 20)
## # A tibble: 20 × 15
## origin year month day hour temp dewp humid wind_dir wind_speed
## <chr> <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 EWR 2013 1 1 1 39.0 26.1 59.4 270 10.4
## 2 EWR 2013 1 1 2 39.0 27.0 61.6 250 8.06
## 3 EWR 2013 1 1 3 39.0 28.0 64.4 240 11.5
## 4 EWR 2013 1 1 4 39.9 28.0 62.2 250 12.7
## 5 EWR 2013 1 1 5 39.0 28.0 64.4 260 12.7
## 6 EWR 2013 1 1 6 37.9 28.0 67.2 240 11.5
## 7 EWR 2013 1 1 7 39.0 28.0 64.4 240 15.0
## 8 EWR 2013 1 1 8 39.9 28.0 62.2 250 10.4
## 9 EWR 2013 1 1 9 39.9 28.0 62.2 260 15.0
## 10 EWR 2013 1 1 10 41 28.0 59.6 260 13.8
## 11 EWR 2013 1 1 11 41 27.0 57.1 260 15.0
## 12 EWR 2013 1 1 13 39.2 28.4 69.7 330 16.1
## 13 EWR 2013 1 1 14 39.0 24.1 54.7 280 13.8
## 14 EWR 2013 1 1 15 37.9 24.1 57.0 290 9.21
## 15 EWR 2013 1 1 16 37.0 19.9 49.6 300 13.8
## 16 EWR 2013 1 1 17 36.0 19.0 49.8 330 11.5
## 17 EWR 2013 1 1 18 34.0 15.1 45.4 310 12.7
## 18 EWR 2013 1 1 19 33.1 12.9 42.8 320 10.4
## 19 EWR 2013 1 1 20 32 15.1 49.2 310 15.0
## 20 EWR 2013 1 1 21 30.0 12.9 48.5 320 18.4
## # ℹ 5 more variables: wind_gust <dbl>, precip <dbl>, pressure <dbl>,
## # visib <dbl>, time_hour <dttm>
Now, print the structure and summary statistics of the tibbles containing the flights and weather data. Tip: Get accustomed to working with datasets that can’t fit on a spreadsheet (and your head).
# Print the structure of the flights tibble
str(tFlights)
## tibble [336,776 × 19] (S3: tbl_df/tbl/data.frame)
## $ year : int [1:336776] 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
## $ month : int [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
## $ day : int [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
## $ dep_time : int [1:336776] 517 533 542 544 554 554 555 557 557 558 ...
## $ sched_dep_time: int [1:336776] 515 529 540 545 600 558 600 600 600 600 ...
## $ dep_delay : num [1:336776] 2 4 2 -1 -6 -4 -5 -3 -3 -2 ...
## $ arr_time : int [1:336776] 830 850 923 1004 812 740 913 709 838 753 ...
## $ sched_arr_time: int [1:336776] 819 830 850 1022 837 728 854 723 846 745 ...
## $ arr_delay : num [1:336776] 11 20 33 -18 -25 12 19 -14 -8 8 ...
## $ carrier : chr [1:336776] "UA" "UA" "AA" "B6" ...
## $ flight : int [1:336776] 1545 1714 1141 725 461 1696 507 5708 79 301 ...
## $ tailnum : chr [1:336776] "N14228" "N24211" "N619AA" "N804JB" ...
## $ origin : chr [1:336776] "EWR" "LGA" "JFK" "JFK" ...
## $ dest : chr [1:336776] "IAH" "IAH" "MIA" "BQN" ...
## $ air_time : num [1:336776] 227 227 160 183 116 150 158 53 140 138 ...
## $ distance : num [1:336776] 1400 1416 1089 1576 762 ...
## $ hour : num [1:336776] 5 5 5 5 6 5 6 6 6 6 ...
## $ minute : num [1:336776] 15 29 40 45 0 58 0 0 0 0 ...
## $ time_hour : POSIXct[1:336776], format: "2013-01-01 05:00:00" "2013-01-01 05:00:00" ...
# Print summary statistics of the flights tibble
summary(tFlights)
## year month day dep_time sched_dep_time
## Min. :2013 Min. : 1.000 Min. : 1.00 Min. : 1 Min. : 106
## 1st Qu.:2013 1st Qu.: 4.000 1st Qu.: 8.00 1st Qu.: 907 1st Qu.: 906
## Median :2013 Median : 7.000 Median :16.00 Median :1401 Median :1359
## Mean :2013 Mean : 6.549 Mean :15.71 Mean :1349 Mean :1344
## 3rd Qu.:2013 3rd Qu.:10.000 3rd Qu.:23.00 3rd Qu.:1744 3rd Qu.:1729
## Max. :2013 Max. :12.000 Max. :31.00 Max. :2400 Max. :2359
## NA's :8255
## dep_delay arr_time sched_arr_time arr_delay
## Min. : -43.00 Min. : 1 Min. : 1 Min. : -86.000
## 1st Qu.: -5.00 1st Qu.:1104 1st Qu.:1124 1st Qu.: -17.000
## Median : -2.00 Median :1535 Median :1556 Median : -5.000
## Mean : 12.64 Mean :1502 Mean :1536 Mean : 6.895
## 3rd Qu.: 11.00 3rd Qu.:1940 3rd Qu.:1945 3rd Qu.: 14.000
## Max. :1301.00 Max. :2400 Max. :2359 Max. :1272.000
## NA's :8255 NA's :8713 NA's :9430
## carrier flight tailnum origin
## Length:336776 Min. : 1 Length:336776 Length:336776
## Class :character 1st Qu.: 553 Class :character Class :character
## Mode :character Median :1496 Mode :character Mode :character
## Mean :1972
## 3rd Qu.:3465
## Max. :8500
##
## dest air_time distance hour
## Length:336776 Min. : 20.0 Min. : 17 Min. : 1.00
## Class :character 1st Qu.: 82.0 1st Qu.: 502 1st Qu.: 9.00
## Mode :character Median :129.0 Median : 872 Median :13.00
## Mean :150.7 Mean :1040 Mean :13.18
## 3rd Qu.:192.0 3rd Qu.:1389 3rd Qu.:17.00
## Max. :695.0 Max. :4983 Max. :23.00
## NA's :9430
## minute time_hour
## Min. : 0.00 Min. :2013-01-01 05:00:00.00
## 1st Qu.: 8.00 1st Qu.:2013-04-04 13:00:00.00
## Median :29.00 Median :2013-07-03 10:00:00.00
## Mean :26.23 Mean :2013-07-03 05:22:54.64
## 3rd Qu.:44.00 3rd Qu.:2013-10-01 07:00:00.00
## Max. :59.00 Max. :2013-12-31 23:00:00.00
##
# Print the structure of the weather tibble
str(tWeather)
## tibble [26,115 × 15] (S3: tbl_df/tbl/data.frame)
## $ origin : chr [1:26115] "EWR" "EWR" "EWR" "EWR" ...
## $ year : int [1:26115] 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
## $ month : int [1:26115] 1 1 1 1 1 1 1 1 1 1 ...
## $ day : int [1:26115] 1 1 1 1 1 1 1 1 1 1 ...
## $ hour : int [1:26115] 1 2 3 4 5 6 7 8 9 10 ...
## $ temp : num [1:26115] 39 39 39 39.9 39 ...
## $ dewp : num [1:26115] 26.1 27 28 28 28 ...
## $ humid : num [1:26115] 59.4 61.6 64.4 62.2 64.4 ...
## $ wind_dir : num [1:26115] 270 250 240 250 260 240 240 250 260 260 ...
## $ wind_speed: num [1:26115] 10.36 8.06 11.51 12.66 12.66 ...
## $ wind_gust : num [1:26115] NA NA NA NA NA NA NA NA NA NA ...
## $ precip : num [1:26115] 0 0 0 0 0 0 0 0 0 0 ...
## $ pressure : num [1:26115] 1012 1012 1012 1012 1012 ...
## $ visib : num [1:26115] 10 10 10 10 10 10 10 10 10 10 ...
## $ time_hour : POSIXct[1:26115], format: "2013-01-01 01:00:00" "2013-01-01 02:00:00" ...
# Print summary statistics of the weather tibble
summary(tWeather)
## origin year month day
## Length:26115 Min. :2013 Min. : 1.000 Min. : 1.00
## Class :character 1st Qu.:2013 1st Qu.: 4.000 1st Qu.: 8.00
## Mode :character Median :2013 Median : 7.000 Median :16.00
## Mean :2013 Mean : 6.504 Mean :15.68
## 3rd Qu.:2013 3rd Qu.: 9.000 3rd Qu.:23.00
## Max. :2013 Max. :12.000 Max. :31.00
##
## hour temp dewp humid
## Min. : 0.00 Min. : 10.94 Min. :-9.94 Min. : 12.74
## 1st Qu.: 6.00 1st Qu.: 39.92 1st Qu.:26.06 1st Qu.: 47.05
## Median :11.00 Median : 55.40 Median :42.08 Median : 61.79
## Mean :11.49 Mean : 55.26 Mean :41.44 Mean : 62.53
## 3rd Qu.:17.00 3rd Qu.: 69.98 3rd Qu.:57.92 3rd Qu.: 78.79
## Max. :23.00 Max. :100.04 Max. :78.08 Max. :100.00
## NA's :1 NA's :1 NA's :1
## wind_dir wind_speed wind_gust precip
## Min. : 0.0 Min. : 0.000 Min. :16.11 Min. :0.000000
## 1st Qu.:120.0 1st Qu.: 6.905 1st Qu.:20.71 1st Qu.:0.000000
## Median :220.0 Median : 10.357 Median :24.17 Median :0.000000
## Mean :199.8 Mean : 10.518 Mean :25.49 Mean :0.004469
## 3rd Qu.:290.0 3rd Qu.: 13.809 3rd Qu.:28.77 3rd Qu.:0.000000
## Max. :360.0 Max. :1048.361 Max. :66.75 Max. :1.210000
## NA's :460 NA's :4 NA's :20778
## pressure visib time_hour
## Min. : 983.8 Min. : 0.000 Min. :2013-01-01 01:00:00.0
## 1st Qu.:1012.9 1st Qu.:10.000 1st Qu.:2013-04-01 21:30:00.0
## Median :1017.6 Median :10.000 Median :2013-07-01 14:00:00.0
## Mean :1017.9 Mean : 9.255 Mean :2013-07-01 18:26:37.7
## 3rd Qu.:1023.0 3rd Qu.:10.000 3rd Qu.:2013-09-30 13:00:00.0
## Max. :1042.1 Max. :10.000 Max. :2013-12-30 18:00:00.0
## NA's :2729
Using the information above, identify columns of each tibble that you will need for your analysis. Tip: Copy-paste from https://cran.r-project.org/web/packages/nycflights13/nycflights13.pdf and then delete what you don’t need after completing your analysis. This will serve as your data dictionary. Tip: You may update this list as you progress through the rest of the assignment. Don’t worry about getting it perfect right now.
### This section doesn't require code. Just list all the columns of each dataset you'll need for your analysis.###
#tFlights Tibble:
#year, month, day
#Date of departure.
#dep_time, arr_time
#Actual departure and arrival times (format HHMM or HMM), local tz.
#sched_dep_time, sched_arr_time
#Scheduled departure and arrival times (format HHMM or HMM), local tz.
#dep_delay, arr_delay
#Departure and arrival delays, in minutes. Negative times represent early departures/arrivals.
#carrier
#Two letter carrier abbreviation. See airlines to get name.
#flight
#Flight number.
#tailnum
#Plane tail number. See planes for additional metadata.
#origin, dest
#Origin and destination. See airports for additional metadata.
#air_time
#Amount of time spent in the air, in minutes.
#distance
#Distance between airports, in miles.
#hour, minute
#Time of scheduled departure broken into hour and minutes.
#time_hour
#Scheduled date and hour of the flight as a POSIXct date. Along with origin, can be used to join flights data to weather data.
#tWeather Tibble:
#origin
#Weather station. Named origin to facilitate merging with flights data.
#year, month, day, hour
#Time of recording.
#temp, dewp
#Temperature and dewpoint in F.
#humid
#Relative humidity.
#wind_dir, wind_speed, wind_gust
#Wind direction (in degrees), speed and gust speed (in mph).
#precip
#Precipitation, in inches.
#pressure
#Sea level pressure in millibars.
#visib
#Visibility in miles.
#time_hour
#Date and hour of the recording as a POSIXct date.
For the individual assignment submissions: Delete the everything below this line. Then, knit and submit your html and Rmd files. For the team assignment submissions: Pick any individual team member’s answers for the first ten questions. Then complete the remaining questions to submit for the team assignment.