2

Create References for NYC Flights Data Sets

flights <- nycflights13::flights 
planes <- nycflights13::planes 
airports <- nycflights13::airports

3

Number of records in the flights dataset that have tail numbers matching records in the planes data set

Updated_Records <- flights %>%
  inner_join(planes, by = "tailnum")

Count_records <- nrow(Updated_Records)
Count_records
## [1] 284170

There are 284,170 records that have matching tail numbers.

4

Number of records that do not have matching tail records.

Updated_Anti <- flights %>%
  anti_join(planes, by = "tailnum")

Count_record <- nrow(Updated_Anti)
Count_record
## [1] 52606

There are 52,606 records that do not have matching tail numbers

5

filtered <- airports %>% 
  anti_join(flights, by = c("faa" = "dest") )

head(filtered)
## # A tibble: 6 × 8
##   faa   name                             lat   lon   alt    tz dst   tzone      
##   <chr> <chr>                          <dbl> <dbl> <dbl> <dbl> <chr> <chr>      
## 1 04G   Lansdowne Airport               41.1 -80.6  1044    -5 A     America/Ne…
## 2 06A   Moton Field Municipal Airport   32.5 -85.7   264    -6 A     America/Ch…
## 3 06C   Schaumburg Regional             42.0 -88.1   801    -6 A     America/Ch…
## 4 06N   Randall Airport                 41.4 -74.4   523    -5 A     America/Ne…
## 5 09J   Jekyll Island Airport           31.1 -81.4    11    -5 A     America/Ne…
## 6 0A9   Elizabethton Municipal Airport  36.4 -82.2  1593    -5 A     America/Ne…

6 Windmill Data Set

6a.

Perform a regression using velocity as the covariate and the output as the target variable.

wind_lm <- lm(Output ~. , data = Windmill)
summary(wind_lm)
## 
## Call:
## lm(formula = Output ~ ., data = Windmill)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.59869 -0.14099  0.06059  0.17262  0.32184 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.13088    0.12599   1.039     0.31    
## Velocity     0.24115    0.01905  12.659 7.55e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2361 on 23 degrees of freedom
## Multiple R-squared:  0.8745, Adjusted R-squared:  0.869 
## F-statistic: 160.3 on 1 and 23 DF,  p-value: 7.546e-12

The following regression would be indicative of a strong linear relationship between the output and the velocity. The Multiple R-squared is very high, and 87.5% of the variance within the model is explained. Meaning that predicted values are relatively close and do not vary all that greatly from one another. Additionally, the estimated value of slope holds a significant p-value. On the contrary the intercept does not, but this can be relatively misleading because it is unlikely that the independent variable would be 0, considering how far away the actual values are within the dataset. Yet, residual analysis still needs to be performed to ensure that this is an adequate model.

6b.

Create a scatter plot comparing the residuals to the predictions made by the regression model.

plot(wind_lm, which = 1) ## easy way 

pred <- fitted(wind_lm)
resid<- resid(wind_lm)
df_red <- data.frame(pred, resid)

ggplot(df_red , aes(x = pred, y = resid)) + 
  geom_point() + 
  labs(title = "Residuals vs. Fitted Values", x = "predicted values", y = "Residual Values")+
  geom_smooth(method = "loess" , se = FALSE, color = "red") + 
  geom_hline(yintercept = 0, linetype = "dotted", color = "black") ## little bit more 
## `geom_smooth()` using formula = 'y ~ x'

The residuals would indicate the data needs to be transformed as there is a clear curve presented.

6c.

Determine the best value for 𝜆 to use in a Box-Cox transformation. Then create a new regression model where Output is transformed using this ideal value for 𝜆

trans <- boxcox(Windmill$Output ~ Windmill$Velocity)

lambda <- trans$x[which.max(trans$y)]
lambda
## [1] 2
new_model <- lm(((Output^lambda-1)/lambda) ~ Velocity, data = Windmill)
summary(new_model)
## 
## Call:
## lm(formula = ((Output^lambda - 1)/lambda) ~ Velocity, data = Windmill)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.37420 -0.15514  0.02976  0.15397  0.28536 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.17926    0.10620  -11.11 1.02e-10 ***
## Velocity     0.35533    0.01606   22.13  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.199 on 23 degrees of freedom
## Multiple R-squared:  0.9551, Adjusted R-squared:  0.9532 
## F-statistic: 489.7 on 1 and 23 DF,  p-value: < 2.2e-16

The transformation had a significant impact on the performance of the model. There is an addtional amount of variance that is explained, and bot the intercept and slope have a significant p-value.

7

7a.

Using a set operation, determine the number of rows that are found in both datasets.

nrow(union_all(Windmill2, Windmill3))
## [1] 36
nrow(intersect(Windmill2, Windmill3))
## [1] 10

I was a little confused if you wanted literally all of them included, but I can explain them.

  • Union_all function shows all of the rows and includes the duplicate values. - 36 rows

  • I did not include union because I believe that is the correct answer to the last question.

  • Intersect shows just the rows values that are common within both. - 10 rows

7b.

Find all rows in Windmill2 that arent in Windmill 3

nrow(setdiff(Windmill2, Windmill3))
## [1] 15

There are 15 rows that are in Windmill2 that aren’t in Windmill3

7c.

Using a set operation, determine the number of unique rows that are contained in the datasets.

union(Windmill2, Windmill3)
## # A tibble: 26 × 2
##    Velocity Output
##       <dbl>  <dbl>
##  1     5     1.58 
##  2     6     1.82 
##  3     3.4   1.06 
##  4     2.7   0.5  
##  5    10     2.24 
##  6     9.7   2.39 
##  7     9.55  2.29 
##  8     3.05  0.558
##  9     8.15  2.17 
## 10     6.2   1.87 
## # … with 16 more rows

There are 26 rows that are unique. The union function removes duplicate values, meaning that entry presented here are only seen once in both data sets.