#These first chunks are the steps I took to clean and mutate the data set. They are not exactly the same as the final product, and are simply there to demonstrate what the varying lines of code within the final product are for. I broke the coding into many chunks when working on the project for ease of use, and believe that providing these chunks separately may help others understand the code better.
#load data set
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.4.4 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.0
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(nycflights13)data(flights)
#rename variables, remove NAs, filter by month, and arrange data. I know you told us not to use na.omit(), and I understand why, but this data set was so large and needed to be reduced in size anyway to complete the assignment, so I figured that it wouldn’t matter if I used it. It only removed 2775 rows, many of which where flights that never left (i.e. were cancelled), so that data was useless anyway. In this particular case, I decided that it was ok to use na.omit(), as it is very quick and effective and didn’t prevent us from getting accurate and reliable results.
scatterplot <- all_in_one_go |>group_by(origin) |>ggplot() +xlim(0,800) +ylim(0,800) +geom_count(aes(x = sched_flight_time, y = flight_time, col = origin), alpha =0.5) +geom_smooth(aes(x = sched_flight_time, y = flight_time)) +scale_color_manual(values =c("pink", "purple", "orange")) +theme_bw() +theme(aspect.ratio =1) +labs(x ="Scheduled Flight Time\n(in minutes)",y ="Actual Flight Time\n(in minutes)",title ="A Scatterplot with the Relationship between\nScheduled and Actual Flight Times\nOriginating in Three Different Airports\nin January of 2013",caption ="FAA Aircraft Registry")scatterplot +geom_abline(intercept =0, slope =1, col ="red", lty =2, lwd =1, alpha =1)
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
mean(all_in_one_go$flight_time_diff)
[1] -3.855519
summary(all_in_one_go$flight_time_diff)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-69.000 -14.000 -5.000 -3.856 5.000 129.000
sum(all_in_one_go$flight_time_diff <0)/26398
[1] 0.6260702
I created a scatterplot to show the correlation between scheduled flight times and actual flight times for flights originating in Newark, LaGuardia, and JFK in January of 2013. “Scheduled flight time” is defined as the time from when a plane is scheduled to leave the terminal to when it’s scheduled to arrive at the next terminal. “Flight time” is how long it actually took for the plane to get from one terminal to the next. As you can see on the graph, flights don’t go that far below the dashed line (y=x), which represents flights that took the same amount of time as they were scheduled to. Flights go farther above the line than below it, despite a larger amount of flights being below the line. This data set is skewed right (skewed towards flights taking longer than scheduled). This can be confirmed by looking at the summary statistics provided under the graph, which show that the mode for flight time difference (flight time - scheduled flight time) is less than the median which is less than the mean, which is a pattern that occurs in skewed data sets. Despite the data set being skewed towards flights taking longer than scheduled, the mean, median, and mode all show that flights being shorter than scheduled. This can probably be explained by the fact that it is difficult for a flight to be quicker than scheduled due to tight scheduling, while it is much easier for a flight to be delayed and take much longer than scheduled.
P.s. I was unable to render the document into rpubs with the Mode() function, but the mode for flight time difference was -8 minutes with a frequency of 799.