Initial data load

data(nycflights)
names(nycflights)
##  [1] "year"      "month"     "day"       "dep_time"  "dep_delay" "arr_time" 
##  [7] "arr_delay" "carrier"   "tailnum"   "flight"    "origin"    "dest"     
## [13] "air_time"  "distance"  "hour"      "minute"
glimpse(nycflights)
## Rows: 32,735
## Columns: 16
## $ year      <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 201...
## $ month     <int> 6, 5, 12, 5, 7, 1, 12, 8, 9, 4, 6, 11, 4, 3, 10, 1, 2, 8,...
## $ day       <int> 30, 7, 8, 14, 21, 1, 9, 13, 26, 30, 17, 22, 26, 25, 21, 2...
## $ dep_time  <int> 940, 1657, 859, 1841, 1102, 1817, 1259, 1920, 725, 1323, ...
## $ dep_delay <dbl> 15, -3, -1, -4, -3, -3, 14, 85, -10, 62, 5, 5, -2, 115, -...
## $ arr_time  <int> 1216, 2104, 1238, 2122, 1230, 2008, 1617, 2032, 1027, 154...
## $ arr_delay <dbl> -4, 10, 11, -34, -8, 3, 22, 71, -8, 60, -4, -2, 22, 91, -...
## $ carrier   <chr> "VX", "DL", "DL", "DL", "9E", "AA", "WN", "B6", "AA", "EV...
## $ tailnum   <chr> "N626VA", "N3760C", "N712TW", "N914DL", "N823AY", "N3AXAA...
## $ flight    <int> 407, 329, 422, 2391, 3652, 353, 1428, 1407, 2279, 4162, 2...
## $ origin    <chr> "JFK", "JFK", "JFK", "JFK", "LGA", "LGA", "EWR", "JFK", "...
## $ dest      <chr> "LAX", "SJU", "LAX", "TPA", "ORF", "ORD", "HOU", "IAD", "...
## $ air_time  <dbl> 313, 216, 376, 135, 50, 138, 240, 48, 148, 110, 50, 161, ...
## $ distance  <dbl> 2475, 1598, 2475, 1005, 296, 733, 1411, 228, 1096, 820, 2...
## $ hour      <dbl> 9, 16, 8, 18, 11, 18, 12, 19, 7, 13, 9, 13, 8, 20, 12, 20...
## $ minute    <dbl> 40, 57, 59, 41, 2, 17, 59, 20, 25, 23, 40, 20, 9, 54, 17,...

Plot Histogram

Analysis

Let’s start by examing the distribution of departure delays of all flights with a histogram.

Histogram 1

ggplot(data = nycflights, aes(x = dep_delay)) +
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Histogram 2

ggplot(data = nycflights, aes(x = dep_delay)) +
  geom_histogram(binwidth = 15)

Histogram 2

ggplot(data = nycflights, aes(x = dep_delay)) +
  geom_histogram(binwidth = 150)

Exercise 1

Look carefully at these three histograms. How do they compare? Are features revealed in one that are obscured in another?

We notice that the smaller the binwidth is, the finer the detail is. The second histogram has the smallest binwidth, so it displays the data in finer detail. The third histogram has the largest binwidth and clumps much of the data together, hiding lots of detail. The first histogram has a binwidth in between the other two and displays detail correspondingly. Though the second histogram shows the most detail, the binwidth of the first one seems to be just right and is more pleasing to the eye to visualize the data.

arbuthnot$girls
##  [1] 4683 4457 4102 4590 4839 4820 4928 4605 4457 4952 4784 5332 5200 4910 4617
## [16] 3997 3919 3395 3536 3181 2746 2722 2840 2908 2959 3179 3349 3382 3289 3013
## [31] 2781 3247 4107 4803 4881 5681 4858 4319 5322 5560 5829 5719 6061 6120 5822
## [46] 5738 5717 5847 6203 6033 6041 6299 6533 6744 7158 7127 7246 7119 7214 7101
## [61] 7167 7302 7392 7316 7483 6647 6713 7229 7767 7626 7452 7061 7514 7656 7683
## [76] 5738 7779 7417 7687 7623 7380 7288

Exercise 2

Create a new data frame that includes flights headed to SFO in February, and save this data frame as sfo_feb_flights. How many flights meet these criteria?

sfo_feb_flights <- nycflights %>%
  filter(dest == "SFO", month == 2)

Flights meet these criteria

nrow(sfo_feb_flights)
## [1] 68

Exercise 3

Describe the distribution of the arrival delays of these flights using a histogram and appropriate summary statistics. Hint: The summary statistics you use should depend on the shape of the distribution.

ggplot(data = sfo_feb_flights, aes(x = dep_delay)) +
  geom_histogram(binwidth = 15)


The histogram is right-skewed, so the standard deviation would not accurately represent the distribution of the data.

sfo_feb_flights %>%
  group_by(origin) %>%
  summarise(median_dd = median(dep_delay), 
            iqr_dd = IQR(dep_delay), 
            n_flights = n())
## # A tibble: 2 x 4
##   origin median_dd iqr_dd n_flights
## * <chr>      <dbl>  <dbl>     <int>
## 1 EWR          0.5   5.75         8
## 2 JFK         -2.5  15.2         60

Exercise 4

Calculate the median and interquartile range for arr_delays of flights in in the sfo_feb_flights data frame, grouped by carrier. Which carrier has the most variable arrival delays?

sfo_feb_flights %>%group_by(carrier) %>%
  summarise(median = median(arr_delay), 
            iqr = IQR(arr_delay), 
            n_flights = n()) %>%
  arrange(desc(iqr))
## # A tibble: 5 x 4
##   carrier median   iqr n_flights
##   <chr>    <dbl> <dbl>     <int>
## 1 DL       -15    22          19
## 2 UA       -10    22          21
## 3 VX       -22.5  21.2        12
## 4 AA         5    17.5        10
## 5 B6       -10.5  12.2         6

Carrier that has the most variable arrival delay

sfo_feb_flights %>%group_by(carrier) %>%
  summarise(iqr = IQR(arr_delay)) %>%
  arrange(desc(iqr))
## # A tibble: 5 x 2
##   carrier   iqr
##   <chr>   <dbl>
## 1 DL       22  
## 2 UA       22  
## 3 VX       21.2
## 4 AA       17.5
## 5 B6       12.2

Departure delays by month

nycflights %>%
  group_by(month) %>%
  summarise(mean_dd = mean(dep_delay)) %>%
  arrange(desc(mean_dd))
## # A tibble: 12 x 2
##    month mean_dd
##    <int>   <dbl>
##  1     7   20.8 
##  2     6   20.4 
##  3    12   17.4 
##  4     4   14.6 
##  5     3   13.5 
##  6     5   13.3 
##  7     8   12.6 
##  8     2   10.7 
##  9     1   10.2 
## 10     9    6.87
## 11    11    6.10
## 12    10    5.88

Exercise 5

Suppose you really dislike departure delays and you want to schedule your travel in a month that minimizes your potential departure delay leaving NYC. One option is to choose the month with the lowest mean departure delay. Another option is to choose the month with the lowest median departure delay. What are the pros and cons of these two choices?

The mean is the more reliable measure for deciding which month(s) to avoid flying if you really dislike delayed flights since it gives us the true average. That is even the pros of choosing the mean and the cons is just the measure is sensitive to extreme delays. The cons of choosing the median is that the measure doesn’t arise the true delay.

nycflights %>%
  group_by(month) %>%
  summarise(median = median(dep_delay)) %>%
  arrange(desc(median))
## # A tibble: 12 x 2
##    month median
##    <int>  <dbl>
##  1    12      1
##  2     6      0
##  3     7      0
##  4     3     -1
##  5     5     -1
##  6     8     -1
##  7     1     -2
##  8     2     -2
##  9     4     -2
## 10    11     -2
## 11     9     -3
## 12    10     -3

On time departure rate for NYC airports

nycflights <- nycflights %>%
  mutate(dep_type = ifelse(dep_delay < 5, "on time", "delayed"))

nycflights %>%
  group_by(origin) %>%
  summarise(ot_dep_rate = sum(dep_type == "on time") / n()) %>%
  arrange(desc(ot_dep_rate))
## # A tibble: 3 x 2
##   origin ot_dep_rate
##   <chr>        <dbl>
## 1 LGA          0.728
## 2 JFK          0.694
## 3 EWR          0.637

Exercise 6

If you were selecting an airport simply based on on time departure percentage, which NYC airport would you choose to fly out of?

ggplot(data = nycflights, aes(x = origin, fill = dep_type)) +
  geom_bar()

LGA has best time departure percentage of (72.8%). Thus correct choice should be LDA airport

Exercise 7

Mutate the data frame so that it includes a new variable that contains the average speed, avg_speed traveled by the plane for each flight (in mph). Hint: Average speed can be calculated as distance divided by number of hours of travel, and note that air_time is given in minutes.

nycflights <- nycflights %>%
  mutate(avg_speed = distance / (air_time / 60))
glimpse(nycflights)
## Rows: 32,735
## Columns: 18
## $ year      <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 201...
## $ month     <int> 6, 5, 12, 5, 7, 1, 12, 8, 9, 4, 6, 11, 4, 3, 10, 1, 2, 8,...
## $ day       <int> 30, 7, 8, 14, 21, 1, 9, 13, 26, 30, 17, 22, 26, 25, 21, 2...
## $ dep_time  <int> 940, 1657, 859, 1841, 1102, 1817, 1259, 1920, 725, 1323, ...
## $ dep_delay <dbl> 15, -3, -1, -4, -3, -3, 14, 85, -10, 62, 5, 5, -2, 115, -...
## $ arr_time  <int> 1216, 2104, 1238, 2122, 1230, 2008, 1617, 2032, 1027, 154...
## $ arr_delay <dbl> -4, 10, 11, -34, -8, 3, 22, 71, -8, 60, -4, -2, 22, 91, -...
## $ carrier   <chr> "VX", "DL", "DL", "DL", "9E", "AA", "WN", "B6", "AA", "EV...
## $ tailnum   <chr> "N626VA", "N3760C", "N712TW", "N914DL", "N823AY", "N3AXAA...
## $ flight    <int> 407, 329, 422, 2391, 3652, 353, 1428, 1407, 2279, 4162, 2...
## $ origin    <chr> "JFK", "JFK", "JFK", "JFK", "LGA", "LGA", "EWR", "JFK", "...
## $ dest      <chr> "LAX", "SJU", "LAX", "TPA", "ORF", "ORD", "HOU", "IAD", "...
## $ air_time  <dbl> 313, 216, 376, 135, 50, 138, 240, 48, 148, 110, 50, 161, ...
## $ distance  <dbl> 2475, 1598, 2475, 1005, 296, 733, 1411, 228, 1096, 820, 2...
## $ hour      <dbl> 9, 16, 8, 18, 11, 18, 12, 19, 7, 13, 9, 13, 8, 20, 12, 20...
## $ minute    <dbl> 40, 57, 59, 41, 2, 17, 59, 20, 25, 23, 40, 20, 9, 54, 17,...
## $ dep_type  <chr> "delayed", "on time", "on time", "on time", "on time", "o...
## $ avg_speed <dbl> 474.4409, 443.8889, 394.9468, 446.6667, 355.2000, 318.695...

Exercise 8

Make a scatterplot of avg_speed vs. distance. Describe the relationship between average speed and distance. Hint: Use geom_point().

ggplot(data = nycflights, aes(x = distance, y = avg_speed)) + geom_point()

Above graph shows - as distance increases, the average speed increases as well. So average speed and distance maintain a positive relationship.

Exercise 9

Replicate the following plot. Hint: The data frame plotted only contains flights from American Airlines, Delta Airlines, and United Airlines, and the points are colored by carrier. Once you replicate the plot, determine (roughly) what the cutoff point is for departure delays where you can still expect to get to your destination on time.

nycflights_3carriers <- nycflights %>%
  filter(carrier == "AA" | carrier == "DL" | carrier == "UA")
ggplot(data = nycflights_3carriers, aes(x = dep_delay, y = arr_delay, color= carrier)) + geom_point()

---
title: "Lab 2 : Introduction to data"
author: "Ramnivas Singh"
date: "`r Sys.Date()`"
output: openintro::lab_report
---

```{r global_options, include=FALSE}
knitr::opts_chunk$set(eval = TRUE, results = TRUE, fig.show = "show", message = TRUE, warning = FALSE)
library(tidyverse)
library(openintro)
```
\

# Initial data load
```{r load-data}
data(nycflights)
```

```{r names}
names(nycflights)
```
```{r}
glimpse(nycflights)
```

# Plot Histogram

# Analysis

Let's start by examing the distribution of departure delays of all flights with a 
histogram.

## Histogram 1
```{r hist-dep-delay}
ggplot(data = nycflights, aes(x = dep_delay)) +
  geom_histogram()
```

## Histogram 2
```{r hist-dep-delay-bins15}
ggplot(data = nycflights, aes(x = dep_delay)) +
  geom_histogram(binwidth = 15)
```

## Histogram 2
```{r hist-dep-delay-bins150}
ggplot(data = nycflights, aes(x = dep_delay)) +
  geom_histogram(binwidth = 150)
```

## Exercise 1
Look carefully at these three histograms. How do they compare? Are features
revealed in one that are obscured in another?

We notice that the smaller the binwidth is, the finer the detail is. The second histogram has the smallest binwidth, so it displays the data in finer detail. The third histogram has the largest binwidth and clumps much of the data together, hiding lots of detail. The first histogram has a binwidth in between the other two and displays detail correspondingly. Though the second histogram shows the most detail, the binwidth of the first one seems to be just right and is more pleasing to the eye to visualize the data.

```{r view-girls-counts}
arbuthnot$girls
```


## Exercise 2
Create a new data frame that includes flights headed to SFO in February, and save this data frame as sfo_feb_flights. How many flights meet these criteria?
```{r sfo_feb_flights}
sfo_feb_flights <- nycflights %>%
  filter(dest == "SFO", month == 2)
```
**Flights meet these criteria**
```{r}
nrow(sfo_feb_flights)
```

## Exercise 3
Describe the distribution of the arrival delays of these flights using a histogram and appropriate summary statistics. Hint: The summary statistics you use should depend on the shape of the distribution.\

```{r plot-sfo_feb_flights}
ggplot(data = sfo_feb_flights, aes(x = dep_delay)) +
  geom_histogram(binwidth = 15)

```
\
The histogram is right-skewed, so the standard deviation would not accurately represent the distribution of the data.
```{r}
sfo_feb_flights %>%
  group_by(origin) %>%
  summarise(median_dd = median(dep_delay), 
            iqr_dd = IQR(dep_delay), 
            n_flights = n())
```
## Exercise 4
Calculate the median and interquartile range for arr_delays of flights in in the sfo_feb_flights data frame, grouped by carrier. Which carrier has the most variable arrival delays?
```{r}
sfo_feb_flights %>%group_by(carrier) %>%
  summarise(median = median(arr_delay), 
            iqr = IQR(arr_delay), 
            n_flights = n()) %>%
  arrange(desc(iqr))
```
Carrier that has the most variable arrival delay
```{r}
sfo_feb_flights %>%group_by(carrier) %>%
  summarise(iqr = IQR(arr_delay)) %>%
  arrange(desc(iqr))

```
Departure delays by month

```{r}
nycflights %>%
  group_by(month) %>%
  summarise(mean_dd = mean(dep_delay)) %>%
  arrange(desc(mean_dd))

```

## Exercise 5
Suppose you really dislike departure delays and you want to schedule your travel in a month that minimizes your potential departure delay leaving NYC. One option is to choose the month with the lowest mean departure delay. Another option is to choose the month with the lowest median departure delay. What are the pros and cons of these two choices?

The mean is the more reliable measure for deciding which month(s) to avoid flying if you really dislike delayed flights since it gives us the true average. That is even the pros of choosing the mean and the cons is just the measure is sensitive to extreme delays. The cons of choosing the median is that the measure doesn’t arise the true delay.
```{r}
nycflights %>%
  group_by(month) %>%
  summarise(median = median(dep_delay)) %>%
  arrange(desc(median))

```
On time departure rate for NYC airports
```{r}
nycflights <- nycflights %>%
  mutate(dep_type = ifelse(dep_delay < 5, "on time", "delayed"))

nycflights %>%
  group_by(origin) %>%
  summarise(ot_dep_rate = sum(dep_type == "on time") / n()) %>%
  arrange(desc(ot_dep_rate))
```
## Exercise 6
If you were selecting an airport simply based on on time departure percentage, which NYC airport would you choose to fly out of?
```{r}
ggplot(data = nycflights, aes(x = origin, fill = dep_type)) +
  geom_bar()
```
LGA has best time departure percentage of (72.8%). Thus correct choice should be LDA airport\

## Exercise 7
Mutate the data frame so that it includes a new variable that contains the average speed, avg_speed traveled by the plane for each flight (in mph). Hint: Average speed can be calculated as distance divided by number of hours of travel, and note that air_time is given in minutes.

```{r}
nycflights <- nycflights %>%
  mutate(avg_speed = distance / (air_time / 60))
glimpse(nycflights)

```

## Exercise 8
Make a scatterplot of avg_speed vs. distance. Describe the relationship between average speed and distance. Hint: Use geom_point().

```{r}
ggplot(data = nycflights, aes(x = distance, y = avg_speed)) + geom_point()

```
Above graph shows - as distance increases, the average speed increases as well. So average speed and distance maintain a positive relationship.

## Exercise 9

Replicate the following plot. Hint: The data frame plotted only contains flights from American Airlines, Delta Airlines, and United Airlines, and the points are colored by carrier. Once you replicate the plot, determine (roughly) what the cutoff point is for departure delays where you can still expect to get to your destination on time.
```{r}
nycflights_3carriers <- nycflights %>%
  filter(carrier == "AA" | carrier == "DL" | carrier == "UA")
ggplot(data = nycflights_3carriers, aes(x = dep_delay, y = arr_delay, color= carrier)) + geom_point()
```

