library(tidyverse)
library(openintro)
data(nycflights)
names(nycflights)
##  [1] "year"      "month"     "day"       "dep_time"  "dep_delay" "arr_time" 
##  [7] "arr_delay" "carrier"   "tailnum"   "flight"    "origin"    "dest"     
## [13] "air_time"  "distance"  "hour"      "minute"
glimpse(nycflights)
## Rows: 32,735
## Columns: 16
## $ year      <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, ~
## $ month     <int> 6, 5, 12, 5, 7, 1, 12, 8, 9, 4, 6, 11, 4, 3, 10, 1, 2, 8, 10~
## $ day       <int> 30, 7, 8, 14, 21, 1, 9, 13, 26, 30, 17, 22, 26, 25, 21, 23, ~
## $ dep_time  <int> 940, 1657, 859, 1841, 1102, 1817, 1259, 1920, 725, 1323, 940~
## $ dep_delay <dbl> 15, -3, -1, -4, -3, -3, 14, 85, -10, 62, 5, 5, -2, 115, -4, ~
## $ arr_time  <int> 1216, 2104, 1238, 2122, 1230, 2008, 1617, 2032, 1027, 1549, ~
## $ arr_delay <dbl> -4, 10, 11, -34, -8, 3, 22, 71, -8, 60, -4, -2, 22, 91, -6, ~
## $ carrier   <chr> "VX", "DL", "DL", "DL", "9E", "AA", "WN", "B6", "AA", "EV", ~
## $ tailnum   <chr> "N626VA", "N3760C", "N712TW", "N914DL", "N823AY", "N3AXAA", ~
## $ flight    <int> 407, 329, 422, 2391, 3652, 353, 1428, 1407, 2279, 4162, 20, ~
## $ origin    <chr> "JFK", "JFK", "JFK", "JFK", "LGA", "LGA", "EWR", "JFK", "LGA~
## $ dest      <chr> "LAX", "SJU", "LAX", "TPA", "ORF", "ORD", "HOU", "IAD", "MIA~
## $ air_time  <dbl> 313, 216, 376, 135, 50, 138, 240, 48, 148, 110, 50, 161, 87,~
## $ distance  <dbl> 2475, 1598, 2475, 1005, 296, 733, 1411, 228, 1096, 820, 264,~
## $ hour      <dbl> 9, 16, 8, 18, 11, 18, 12, 19, 7, 13, 9, 13, 8, 20, 12, 20, 6~
## $ minute    <dbl> 40, 57, 59, 41, 2, 17, 59, 20, 25, 23, 40, 20, 9, 54, 17, 24~
### First histogram
ggplot(data = nycflights, aes(x = dep_delay)) +
  geom_histogram()

### Second histogram
ggplot(data = nycflights, aes(x = dep_delay)) +
  geom_histogram(binwidth = 15)

### Third histogram
ggplot(data = nycflights, aes(x = dep_delay)) +
  geom_histogram(binwidth = 150)

Exercise 1

Look carefully at these three histograms. How do they compare? Are features revealed in one that are obscured in another?

#The first histogram has a bin width of 30 and a value a bit smaller than 30000. 
#The second histogram has a bin width of 15 and a value a bit bigger 20000. 
#Finally, the third histogram has a bin width of 150 and a value a bit bigger 
#than 30000.We can see that the second histogram has the most narrow bin width,meaning 
#it has the most approximate/accurate depiction of delayed flights. 
#The third histogram has the biggest bin width but has the least approximate/accurate 
#depiction of flight delays. This is because each bin shows a very wide range 
#of flight delays. We can thus conclude that the second histogram has less obscurity 
#in flight delay information than the third one. However,the first histogram is 
#in the middle of the second and third histograms in terms of accuracy, having more 
#obscurity than the second but more obscurity than the third histogram.

Exercise 2

Create a new data frame that includes flights headed to SFO in February, and save this data frame as sfo_feb_flights. How many flights meet these criteria?

sfo_feb_flights <- nycflights %>%
  filter(dest == "SFO", month == 2)

count(sfo_feb_flights)
## # A tibble: 1 x 1
##       n
##   <int>
## 1    68

Exercise 3

Describe the distribution of the arrival delays of these flights using a histogram and appropriate summary statistics. Hint: The summary statistics you use should depend on the shape of the distribution.

# Creating a histogram of sfo_feb_flights with the bin width of 50
ggplot(data = sfo_feb_flights, aes(x = arr_delay)) +
  geom_histogram(binwidth = 15)

# Summarizing the results, finding the mean,median,sd,variance, and iqr of sfo_feb_flights
sfo_feb_flights %>%
  summarise( 
            min = min(arr_delay),
            max = max(arr_delay),        
            median = median(arr_delay),
            IQR = IQR(arr_delay)
            )
## # A tibble: 1 x 4
##     min   max median   IQR
##   <dbl> <dbl>  <dbl> <dbl>
## 1   -66   196    -11  23.2

From the summary above, we can see that most flights arrived early. We can also observe this from the graph because there are bars skewed right (100,200) away from the mean.

Exercise 4

Calculate the median and interquartile range for arr_delays of flights in the sfo_feb_flights data frame, grouped by carrier. Which carrier has the most variable arrival delays?

sfo_feb_flights %>%
  group_by(carrier) %>%
  summarize(variance_arrival_delay = median(var(arr_delay))) %>%
  arrange(desc(variance_arrival_delay))
## # A tibble: 5 x 2
##   carrier variance_arrival_delay
##   <chr>                    <dbl>
## 1 UA                       2335.
## 2 VX                       1669.
## 3 AA                        868.
## 4 DL                        485.
## 5 B6                        121.

Upon calculating the median and interquartile range for sfo_feb_flights, I saw that United Airlines has the highest variance of arrival times. This in turn means that the United Airlines carrier has the most variable arrival delays.

Exercise 5

Suppose you really dislike departure delays and you want to schedule your travel in a month that minimizes your potential departure delay leaving NYC. One option is to choose the month with the lowest mean departure delay. Another option is to choose the month with the lowest median departure delay. What are the pros and cons of these two choices?

#The pros of choosing the month with the lowest mean of departure delays would be
#that the flights delays do not have to be put in any specific order. The next 
#pro is that the mean value is pretty descriptive. The cons are that the mean 
#can be skewed because of an outlier number and requires you to take into account 
#all the flight departure delays.The pros of choosing the lowest median departure delay 
#is it is not affected by an outlier like the lowest mean departure is. 
#The con of finding the lowest median departure delay is the flight 
#departure delays need to be ordered from least to greatest.

Exercise 6

If you were selecting an airport simply based on on time departure percentage, which NYC airport would you choose to fly out of?

nycflights %>% 
  group_by(origin) %>%
  summarize(average_departure = mean(dep_delay)) %>%
  arrange(average_departure)
## # A tibble: 3 x 2
##   origin average_departure
##   <chr>              <dbl>
## 1 LGA                 10.1
## 2 JFK                 12.3
## 3 EWR                 15.3

I would choose to fly out of the Laguardia airport because it has the lowest time departure percentage.

Exercise 7

Mutate the data frame so that it includes a new variable that contains the average speed, avg_speed traveled by the plane for each flight (in mph). Hint: Average speed can be calculated as distance divided by number of hours of travel, and note that air_time is given in minutes.

nycflights <- nycflights %>%
  mutate(avg_speed = distance / air_time * 60)

Exercise 8

Make a scatter plot of avg_speed vs. distance. Describe the relationship between average speed and distance. Hint: Use geom_point()

nycflights %>% ggplot() +
  geom_point(aes(x = avg_speed, y = distance, color = carrier))

Looking at the scatter plot, as the distance increases the average speed increases. However, at the distance of about 2,500, the average speed remains constant.

Exercise 9

Replicate the following plot. Hint: The data frame plotted only contains flights from American Airlines, Delta Airlines, and United Airlines, and the points are colored by carrier. Once you replicate the plot, determine (roughly) what the cutoff point is for departure delays where you can still expect to get to your destination on time.

c_delays <- nycflights %>%
  filter(carrier == 'AA' | carrier == 'DL' | carrier == 'UA')
ggplot(c_delays, aes(dep_delay, arr_delay, color = carrier)) + geom_point()

Looking at the replicated scatter plot, it looks like the cutoff is about half an hour.After that point the arrival and departure delays go up.

---
title: "Lab 2: Intro to Data"
author: "Vladimir Nimchenko"
date: "`r Sys.Date()`"
output: openintro::lab_report
---

```{r load-packages, message=FALSE}
library(tidyverse)
library(openintro)
```

```{r the-data, message=FALSE}
data(nycflights)
names(nycflights)
glimpse(nycflights)
```
```{r Analysis, message=FALSE}
### First histogram
ggplot(data = nycflights, aes(x = dep_delay)) +
  geom_histogram()
### Second histogram
ggplot(data = nycflights, aes(x = dep_delay)) +
  geom_histogram(binwidth = 15)
### Third histogram
ggplot(data = nycflights, aes(x = dep_delay)) +
  geom_histogram(binwidth = 150)

```




### Exercise 1
Look carefully at these three histograms. How do they compare? Are features revealed in one that are obscured in another?

```{r histogram-comparison}
#The first histogram has a bin width of 30 and a value a bit smaller than 30000. 
#The second histogram has a bin width of 15 and a value a bit bigger 20000. 
#Finally, the third histogram has a bin width of 150 and a value a bit bigger 
#than 30000.We can see that the second histogram has the most narrow bin width,meaning 
#it has the most approximate/accurate depiction of delayed flights. 
#The third histogram has the biggest bin width but has the least approximate/accurate 
#depiction of flight delays. This is because each bin shows a very wide range 
#of flight delays. We can thus conclude that the second histogram has less obscurity 
#in flight delay information than the third one. However,the first histogram is 
#in the middle of the second and third histograms in terms of accuracy, having more 
#obscurity than the second but more obscurity than the third histogram.
```


### Exercise 2
Create a new data frame that includes flights headed to SFO in February, and save this data frame as sfo_feb_flights. How many flights meet these criteria?

```{r create-dataframe}
sfo_feb_flights <- nycflights %>%
  filter(dest == "SFO", month == 2)

count(sfo_feb_flights)
```


### Exercise 3
Describe the distribution of the arrival delays of these flights using a histogram and appropriate summary statistics. Hint: The summary statistics you use should depend on the shape of the distribution.

```{r summary-statistics}
# Creating a histogram of sfo_feb_flights with the bin width of 50
ggplot(data = sfo_feb_flights, aes(x = arr_delay)) +
  geom_histogram(binwidth = 15)
# Summarizing the results, finding the mean,median,sd,variance, and iqr of sfo_feb_flights
sfo_feb_flights %>%
  summarise( 
            min = min(arr_delay),
            max = max(arr_delay),        
            median = median(arr_delay),
            IQR = IQR(arr_delay)
            )
```
From the summary above, we can see that most flights arrived early. We can also observe this from the graph because there are bars skewed right (100,200) away from the mean.

### Exercise 4
Calculate the median and interquartile range for arr_delays of flights in the sfo_feb_flights data frame, grouped by carrier. Which carrier has the most variable arrival delays?

```{r summary- median and interquartile range}
sfo_feb_flights %>%
  group_by(carrier) %>%
  summarize(variance_arrival_delay = median(var(arr_delay))) %>%
  arrange(desc(variance_arrival_delay))
```
Upon calculating the median and interquartile range for sfo_feb_flights, I saw that United Airlines has the highest variance of arrival times. This in turn means that the United Airlines carrier has the most variable arrival delays.

### Exercise 5
Suppose you really dislike departure delays and you want to schedule your travel in a month that minimizes your potential departure delay leaving NYC. One option is to choose the month with the lowest mean departure delay. Another option is to choose the month with the lowest median departure delay. What are the pros and cons of these two choices?

```{r lowest departure delay}
#The pros of choosing the month with the lowest mean of departure delays would be
#that the flights delays do not have to be put in any specific order. The next 
#pro is that the mean value is pretty descriptive. The cons are that the mean 
#can be skewed because of an outlier number and requires you to take into account 
#all the flight departure delays.The pros of choosing the lowest median departure delay 
#is it is not affected by an outlier like the lowest mean departure is. 
#The con of finding the lowest median departure delay is the flight 
#departure delays need to be ordered from least to greatest.
```

### Exercise 6
If you were selecting an airport simply based on on time departure percentage, which NYC airport would you choose to fly out of?
```{r NYC airport choice}
nycflights %>% 
  group_by(origin) %>%
  summarize(average_departure = mean(dep_delay)) %>%
  arrange(average_departure)
```

I would choose to fly out of the Laguardia airport because it has the lowest time departure percentage.

### Exercise 7
Mutate the data frame so that it includes a new variable that contains the average speed, avg_speed traveled by the plane for each flight (in mph). Hint: Average speed can be calculated as distance divided by number of hours of travel, and note that air_time is given in minutes.
```{r Mutate Data Frame}
nycflights <- nycflights %>%
  mutate(avg_speed = distance / air_time * 60)
```

### Exercise 8
Make a scatter plot of avg_speed vs. distance. Describe the relationship between average speed and distance. Hint: Use geom_point()

```{r Scatter plot}
nycflights %>% ggplot() +
  geom_point(aes(x = avg_speed, y = distance, color = carrier))
```

Looking at the scatter plot, as the distance increases the average speed increases. However, at the distance of about 2,500, the average speed remains constant.

### Exercise 9
Replicate the following plot. Hint: The data frame plotted only contains flights from American Airlines, Delta Airlines, and United Airlines, and the points are colored by carrier. Once you replicate the plot, determine (roughly) what the cutoff point is for departure delays where you can still expect to get to your destination on time.

```{r Scatter plot replication}
c_delays <- nycflights %>%
  filter(carrier == 'AA' | carrier == 'DL' | carrier == 'UA')
ggplot(c_delays, aes(dep_delay, arr_delay, color = carrier)) + geom_point()
```

Looking at the replicated scatter plot, it looks like the cutoff is about half an hour.After that point the arrival and departure delays go up.