2
Create References for NYC Flights Data Sets
3
Number of records in the flights dataset that have tail numbers matching records in the planes data set
Updated_Records <- flights %>%
inner_join(planes, by = "tailnum")
Count_records <- nrow(Updated_Records)
Count_records## [1] 284170
There are 284,170 records that have matching tail numbers.
4
Number of records that do not have matching tail records.
Updated_Anti <- flights %>%
anti_join(planes, by = "tailnum")
Count_record <- nrow(Updated_Anti)
Count_record## [1] 52606
There are 52,606 records that do not have matching tail numbers
5
## # A tibble: 6 × 8
## faa name lat lon alt tz dst tzone
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 04G Lansdowne Airport 41.1 -80.6 1044 -5 A America/Ne…
## 2 06A Moton Field Municipal Airport 32.5 -85.7 264 -6 A America/Ch…
## 3 06C Schaumburg Regional 42.0 -88.1 801 -6 A America/Ch…
## 4 06N Randall Airport 41.4 -74.4 523 -5 A America/Ne…
## 5 09J Jekyll Island Airport 31.1 -81.4 11 -5 A America/Ne…
## 6 0A9 Elizabethton Municipal Airport 36.4 -82.2 1593 -5 A America/Ne…
6 Windmill Data Set
6a.
Perform a regression using velocity as the covariate and the output as the target variable.
##
## Call:
## lm(formula = Output ~ ., data = Windmill)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.59869 -0.14099 0.06059 0.17262 0.32184
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.13088 0.12599 1.039 0.31
## Velocity 0.24115 0.01905 12.659 7.55e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2361 on 23 degrees of freedom
## Multiple R-squared: 0.8745, Adjusted R-squared: 0.869
## F-statistic: 160.3 on 1 and 23 DF, p-value: 7.546e-12
The following regression would be indicative of a strong linear relationship between the output and the velocity. The Multiple R-squared is very high, and 87.5% of the variance within the model is explained. Meaning that predicted values are relatively close and do not vary all that greatly from one another. Additionally, the estimated value of slope holds a significant p-value. On the contrary the intercept does not, but this can be relatively misleading because it is unlikely that the independent variable would be 0, considering how far away the actual values are within the dataset. Yet, residual analysis still needs to be performed to ensure that this is an adequate model.
6b.
Create a scatter plot comparing the residuals to the predictions made by the regression model.
pred <- fitted(wind_lm)
resid<- resid(wind_lm)
df_red <- data.frame(pred, resid)
ggplot(df_red , aes(x = pred, y = resid)) +
geom_point() +
labs(title = "Residuals vs. Fitted Values", x = "predicted values", y = "Residual Values")+
geom_smooth(method = "loess" , se = FALSE, color = "red") +
geom_hline(yintercept = 0, linetype = "dotted", color = "black") ## little bit more ## `geom_smooth()` using formula = 'y ~ x'
The residuals would indicate the data needs to be transformed as there
is a clear curve presented.
6c.
Determine the best value for 𝜆 to use in a Box-Cox transformation. Then create a new regression model where Output is transformed using this ideal value for 𝜆
## [1] 2
##
## Call:
## lm(formula = ((Output^lambda - 1)/lambda) ~ Velocity, data = Windmill)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.37420 -0.15514 0.02976 0.15397 0.28536
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.17926 0.10620 -11.11 1.02e-10 ***
## Velocity 0.35533 0.01606 22.13 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.199 on 23 degrees of freedom
## Multiple R-squared: 0.9551, Adjusted R-squared: 0.9532
## F-statistic: 489.7 on 1 and 23 DF, p-value: < 2.2e-16
The transformation had a significant impact on the performance of the model. There is an addtional amount of variance that is explained, and bot the intercept and slope have a significant p-value.
7
7a.
Using a set operation, determine the number of rows that are found in both datasets.
## [1] 36
## [1] 10
I was a little confused if you wanted literally all of them included, but I can explain them.
Union_all function shows all of the rows and includes the duplicate values. - 36 rows
I did not include union because I believe that is the correct answer to the last question.
Intersect shows just the rows values that are common within both. - 10 rows
7b.
Find all rows in Windmill2 that arent in Windmill 3
## [1] 15
There are 15 rows that are in Windmill2 that aren’t in Windmill3
7c.
Using a set operation, determine the number of unique rows that are contained in the datasets.
## # A tibble: 26 × 2
## Velocity Output
## <dbl> <dbl>
## 1 5 1.58
## 2 6 1.82
## 3 3.4 1.06
## 4 2.7 0.5
## 5 10 2.24
## 6 9.7 2.39
## 7 9.55 2.29
## 8 3.05 0.558
## 9 8.15 2.17
## 10 6.2 1.87
## # … with 16 more rows
There are 26 rows that are unique. The union function removes duplicate values, meaning that entry presented here are only seen once in both data sets.