Load the nycflights13 dataset in R using the following code: library(nycflights13)
library(nycflights13)
## Warning: package 'nycflights13' was built under R version 4.2.2
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.2.2
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.7 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ readr 2.1.2 ✔ forcats 0.5.2
## Warning: package 'ggplot2' was built under R version 4.2.2
## Warning: package 'tidyr' was built under R version 4.2.2
## Warning: package 'readr' was built under R version 4.2.1
## Warning: package 'purrr' was built under R version 4.2.1
## Warning: package 'dplyr' was built under R version 4.2.1
## Warning: package 'stringr' was built under R version 4.2.1
## Warning: package 'forcats' was built under R version 4.2.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
The nycflights13 dataset contains multiple data frames. One of them is called flights, another is called airports, and another is called planes. To make it easier to refer to these data frames without needing to type the name of the dataset (i.e., nycflights13), type the following code in R: flights <- nycflights13::flights planes <- nycflights13::planes airports <- nycflights13::airports
flights <- nycflights13::flights
planes <- nycflights13::planes
airports <- nycflights13::airports
str(planes)
## tibble [3,322 × 9] (S3: tbl_df/tbl/data.frame)
## $ tailnum : chr [1:3322] "N10156" "N102UW" "N103US" "N104UW" ...
## $ year : int [1:3322] 2004 1998 1999 1999 2002 1999 1999 1999 1999 1999 ...
## $ type : chr [1:3322] "Fixed wing multi engine" "Fixed wing multi engine" "Fixed wing multi engine" "Fixed wing multi engine" ...
## $ manufacturer: chr [1:3322] "EMBRAER" "AIRBUS INDUSTRIE" "AIRBUS INDUSTRIE" "AIRBUS INDUSTRIE" ...
## $ model : chr [1:3322] "EMB-145XR" "A320-214" "A320-214" "A320-214" ...
## $ engines : int [1:3322] 2 2 2 2 2 2 2 2 2 2 ...
## $ seats : int [1:3322] 55 182 182 182 55 182 182 182 182 182 ...
## $ speed : int [1:3322] NA NA NA NA NA NA NA NA NA NA ...
## $ engine : chr [1:3322] "Turbo-fan" "Turbo-fan" "Turbo-fan" "Turbo-fan" ...
str(flights)
## tibble [336,776 × 19] (S3: tbl_df/tbl/data.frame)
## $ year : int [1:336776] 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
## $ month : int [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
## $ day : int [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
## $ dep_time : int [1:336776] 517 533 542 544 554 554 555 557 557 558 ...
## $ sched_dep_time: int [1:336776] 515 529 540 545 600 558 600 600 600 600 ...
## $ dep_delay : num [1:336776] 2 4 2 -1 -6 -4 -5 -3 -3 -2 ...
## $ arr_time : int [1:336776] 830 850 923 1004 812 740 913 709 838 753 ...
## $ sched_arr_time: int [1:336776] 819 830 850 1022 837 728 854 723 846 745 ...
## $ arr_delay : num [1:336776] 11 20 33 -18 -25 12 19 -14 -8 8 ...
## $ carrier : chr [1:336776] "UA" "UA" "AA" "B6" ...
## $ flight : int [1:336776] 1545 1714 1141 725 461 1696 507 5708 79 301 ...
## $ tailnum : chr [1:336776] "N14228" "N24211" "N619AA" "N804JB" ...
## $ origin : chr [1:336776] "EWR" "LGA" "JFK" "JFK" ...
## $ dest : chr [1:336776] "IAH" "IAH" "MIA" "BQN" ...
## $ air_time : num [1:336776] 227 227 160 183 116 150 158 53 140 138 ...
## $ distance : num [1:336776] 1400 1416 1089 1576 762 ...
## $ hour : num [1:336776] 5 5 5 5 6 5 6 6 6 6 ...
## $ minute : num [1:336776] 15 29 40 45 0 58 0 0 0 0 ...
## $ time_hour : POSIXct[1:336776], format: "2013-01-01 05:00:00" "2013-01-01 05:00:00" ...
Using a filter join, determine the number of records in the flights dataset that have tail numbers matching records in the planes dataset. (Hint: In both datasets, the column containing tail numbers is called tailnum. Also, you can count the number of rows in a data frame using the function nrow.)
filter <- flights %>% semi_join(planes, by = "tailnum")
nrow(filter)
## [1] 284170
Using a filter join, determine the number of records in the flights dataset with tail numbers that do not match records in the planes dataset.
no_match <- flights %>% anti_join(planes, by = "tailnum")
nrow(no_match)
## [1] 52606
Using a filter join, filter the airports dataset for those airports that do not have matching destination values in the flights dataset. (Hint: The destination is listed in the flights dataset in the column called dest. The names of airports are listed in the dest column in the same format that they appear in the faa column in the flights dataset. For filter joins, you can join datasets with different column names in the same manner that we did for mutating joins.)
destination_not_matching <- airports %>% anti_join(flights, by = c("faa" = "dest"))
nrow(destination_not_matching)
## [1] 1357
Load the data_Windmill csv file into R. This datafile can be found in the “Week 7” folder of our Google Drive. a) Perform regression on this dataset using Velocity as the covariate (i.e., the independent variable), and use Output as the target variable.
library(readr)
data_Windmill <- read_csv("C:/Users/Lynx/Documents/MSDA/621/data_Windmill.csv")
## Rows: 25 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (3): Observation_Number, Velocity, Output
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
B1 <- data_Windmill$Output
B2 <- data_Windmill$Velocity
model <- lm(B1 ~ B2, data = data_Windmill)
summary(model)
##
## Call:
## lm(formula = B1 ~ B2, data = data_Windmill)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.59869 -0.14099 0.06059 0.17262 0.32184
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.13088 0.12599 1.039 0.31
## B2 0.24115 0.01905 12.659 7.55e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2361 on 23 degrees of freedom
## Multiple R-squared: 0.8745, Adjusted R-squared: 0.869
## F-statistic: 160.3 on 1 and 23 DF, p-value: 7.546e-12
plot(model)
plot(model$fitted.values, model$residuals, pch = 20, col = "blue")
abline(h = 0)
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
boxcox(model)
model2 = lm(I(sqrt(B1)) ~ B2, data = data_Windmill)
boxcox(model2)
Load the data_Windmill2 and data_Windmill3 csv files into R. These datafiles can also be found in the “Week 7” folder of our Google Drive.
data_Windmill2 <- read_csv("C:/Users/Lynx/Documents/MSDA/621/data_Windmill2.csv")
## Rows: 25 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (2): Velocity, Output
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
data_Windmill3 <- read_csv("C:/Users/Lynx/Documents/MSDA/621/data_Windmill3.csv")
## Rows: 11 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (2): Velocity, Output
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
intersect(data_Windmill2, data_Windmill3)
## # A tibble: 10 × 2
## Velocity Output
## <dbl> <dbl>
## 1 9.7 2.39
## 2 9.55 2.29
## 3 3.05 0.558
## 4 8.15 2.17
## 5 6.2 1.87
## 6 2.9 0.653
## 7 6.35 1.93
## 8 4.6 1.56
## 9 5.8 1.74
## 10 7.4 2.09
setdiff(data_Windmill2, data_Windmill3)
## # A tibble: 15 × 2
## Velocity Output
## <dbl> <dbl>
## 1 5 1.58
## 2 6 1.82
## 3 3.4 1.06
## 4 2.7 0.5
## 5 10 2.24
## 6 3.6 1.14
## 7 7.85 2.18
## 8 8.8 2.11
## 9 7 1.8
## 10 5.45 1.50
## 11 9.1 2.30
## 12 10.2 2.31
## 13 4.1 1.19
## 14 3.95 1.14
## 15 2.45 0.123
union(data_Windmill2, data_Windmill3)
## # A tibble: 26 × 2
## Velocity Output
## <dbl> <dbl>
## 1 5 1.58
## 2 6 1.82
## 3 3.4 1.06
## 4 2.7 0.5
## 5 10 2.24
## 6 9.7 2.39
## 7 9.55 2.29
## 8 3.05 0.558
## 9 8.15 2.17
## 10 6.2 1.87
## # … with 16 more rows