Dataset Exploration
- Description of columns
Question 1
Question 2 (number of flights and average arrival delay)
Question 3

Dataset Exploration

The diamonds dataset contains measurements on 10 different variables (like price, color, clarity, etc.) for 53,940 different diamonds.

cat('number of rows:',nrow(diamonds), #53,940 
   '\nnumber of columns:',ncol(diamonds)) #10

## number of rows: 53940 
## number of columns: 10

glimpse(flights) #shows rows, columns, and snummary of all columns

## Rows: 336,776
## Columns: 19
## $ year           <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
## $ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ dep_time       <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, …
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, …
## $ dep_delay      <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1…
## $ arr_time       <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849,…
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851,…
## $ arr_delay      <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -1…
## $ carrier        <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "…
## $ flight         <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 4…
## $ tailnum        <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N394…
## $ origin         <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA",…
## $ dest           <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD",…
## $ air_time       <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 1…
## $ distance       <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, …
## $ hour           <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6…
## $ minute         <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0…
## $ time_hour      <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 0…

summary(flights) #column names

##       year          month             day           dep_time    sched_dep_time
##  Min.   :2013   Min.   : 1.000   Min.   : 1.00   Min.   :   1   Min.   : 106  
##  1st Qu.:2013   1st Qu.: 4.000   1st Qu.: 8.00   1st Qu.: 907   1st Qu.: 906  
##  Median :2013   Median : 7.000   Median :16.00   Median :1401   Median :1359  
##  Mean   :2013   Mean   : 6.549   Mean   :15.71   Mean   :1349   Mean   :1344  
##  3rd Qu.:2013   3rd Qu.:10.000   3rd Qu.:23.00   3rd Qu.:1744   3rd Qu.:1729  
##  Max.   :2013   Max.   :12.000   Max.   :31.00   Max.   :2400   Max.   :2359  
##                                                  NA's   :8255                 
##    dep_delay          arr_time    sched_arr_time   arr_delay       
##  Min.   : -43.00   Min.   :   1   Min.   :   1   Min.   : -86.000  
##  1st Qu.:  -5.00   1st Qu.:1104   1st Qu.:1124   1st Qu.: -17.000  
##  Median :  -2.00   Median :1535   Median :1556   Median :  -5.000  
##  Mean   :  12.64   Mean   :1502   Mean   :1536   Mean   :   6.895  
##  3rd Qu.:  11.00   3rd Qu.:1940   3rd Qu.:1945   3rd Qu.:  14.000  
##  Max.   :1301.00   Max.   :2400   Max.   :2359   Max.   :1272.000  
##  NA's   :8255      NA's   :8713                  NA's   :9430      
##    carrier              flight       tailnum             origin         
##  Length:336776      Min.   :   1   Length:336776      Length:336776     
##  Class :character   1st Qu.: 553   Class :character   Class :character  
##  Mode  :character   Median :1496   Mode  :character   Mode  :character  
##                     Mean   :1972                                        
##                     3rd Qu.:3465                                        
##                     Max.   :8500                                        
##                                                                         
##      dest              air_time        distance         hour      
##  Length:336776      Min.   : 20.0   Min.   :  17   Min.   : 1.00  
##  Class :character   1st Qu.: 82.0   1st Qu.: 502   1st Qu.: 9.00  
##  Mode  :character   Median :129.0   Median : 872   Median :13.00  
##                     Mean   :150.7   Mean   :1040   Mean   :13.18  
##                     3rd Qu.:192.0   3rd Qu.:1389   3rd Qu.:17.00  
##                     Max.   :695.0   Max.   :4983   Max.   :23.00  
##                     NA's   :9430                                  
##      minute        time_hour                     
##  Min.   : 0.00   Min.   :2013-01-01 05:00:00.00  
##  1st Qu.: 8.00   1st Qu.:2013-04-04 13:00:00.00  
##  Median :29.00   Median :2013-07-03 10:00:00.00  
##  Mean   :26.23   Mean   :2013-07-03 05:22:54.64  
##  3rd Qu.:44.00   3rd Qu.:2013-10-01 07:00:00.00  
##  Max.   :59.00   Max.   :2013-12-31 23:00:00.00  
##

Description of columns

Description of this dataset on RPubs

Column Name	Variable Type	Variable Description
carat	numerical continuous	diamond carat (aka the unit of measurment which indicates the diamond’s weight)
cut	categorical ordinal	The diamonds quality in terms of its proportions, symmetry, and polish, which determine how well it interacts with light to produce brilliance and sparkle
color	categorical ordinal	A diamonds hue, from colorless to yellow, gray, brown and nearly every shade of the rainbow
clarity	categorical ordinal	A diamond’s internal characteristics, or inclusions, and surface imperfections, or blemishes, under 10-power magnification
depth	numerical continuous	vertical measurement from the top table to the pointed culet at the bottom. shown as a percentage
table	numerical continuous	The diamond’s flat, top facet which is clearly visible when viewed from above. The ideal table percentage varies by shape
price	numerical discrete	price of diamond
x	numerical continuous	diamond length
y	numerical continuous	diamond width
z	numerical continuous	diamond depth

Question 1

Above, we analyzed the effect of carat, cut, and color on a diamond’s price. But what about the 4th c, clarity?

If you need more information about a diamonds clarity, color, or cut you can consult this link from the American gem society.

Part A (diamond clarity and price)

How does clarity matter to a diamond’s price?

Answer

So figure out how diamond clarity (categorical ordinal variable) relates to diamond price (numerical continuous variable) I looked at the frequency distribution of different diamond clarity types through a bar plot. The bar plot showed that diamonds with a clarity scale of 6, 5, 4, and 3 have more occurrences in the frequency distribution. This means that I should look at those clarity types when doing any sort of analysis between clarity and price, since those results will be more accurate and I can stay with greater confidence that any result derived from these values will be accurate. As compared to Diamonds with other clarity types, such as 8, 2, 1, and 0, they have really small count in the dataset which means that I should look at any insights, derived from these diamond types, with caution.

After I analysed the frequency distribution of the diamond clarity types I decided to compare the diamond clarity and price through a box plot. For all box plots, the mean was greater than the median which means that all of the distributions were right skewed. As a result, I needed to look at the median since it was more reliable measure of central tendency for the distribution.

When looking at the median prices, I noticed that for Diamond cut clarity scales of 6, 5, 4, and 3, the price for the Diamonds with a lower grade are higher. The median is higher as compared to the Diamonds with a higher grade, such as 3, 4. I also noticed the same pattern emerge with the Diamonds with clarity grades of 0, 1, and 2.

In conclusion, I would say that Diamonds that have a lower grade clarity tend to have a higher price as compared to Diamonds with a higher grade clarity. To examine why diamonds with lower clarity grades have a higher price we would need to look at whether other C’s (cut, color, and carat) play a role in bringing up the price of diamonds with a low clarity grade.

Here are the steps I used to answer this question:

Step 1: Exploratory Data Analysis (frequency distribution for a categorical variable with bar charts)

Created a bar chart of the clarity variable to examine the frequency distribution of diamonds across clarity types. A frequency distribution helps assess how balanced or uneven the sample sizes are among categories, which is important because unequal group sizes can affect the reliability of comparisons in later steps (e.g., when relating clarity to price). .

It is important to note that frequency distributions are used to summarize categorical variables. Frequency distribution allows us to see the mode, the category that occurs the most, which can be analogous to central tendency in numerical variables. Although it is important to note that the comparison is not entirely similar since central tendencies contain the mean, median, and mode. Frequency distributions don’t invalidate our analysis, they inform us about how much confidence you should have in those averages and comparisons.

Bar charts are not the only type of graphs which can show frequency distributions. I have included a table below which indicates the different graphs that you can use to illustrate frequency distributions.

Graph Name	Frequency Distribution Type	Purpose
Bar Chart	frequency distribution of a single variable	To assess how balanced or uneven the sample sizes are among categories. Unequal group sizes can affect the reliability of comparisons to different variables
Bubble Chart	joint frequency distribution	To assess how balanced or uneven the sample sizes are among two categories/variables which will inform us whether the averages we are calculating within those groups are accurate or not.
Heat map	joint frequency distribution	To assess how balanced or uneven the sample sizes are among two categories/variables which will inform us whether the averages we are calculating within those groups are accurate or not.

# STEP 1: creating the barchart to count 
#         number of clarity types in dataset
# 8 (I1) = Lowest clarity grade
# 0 (IF) = Highest clarity grade

ggplot(data=diamonds) + 
  geom_bar(aes(x=clarity)) +
  scale_y_continuous(breaks=c(0,2500,5000,7500,10000,12500)) + 
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5)) +
  labs(y = "Number of Diamonds",
       x = 'Diamond Clarity Scale', 
       title = "Number of Diamond Clarity Types",
       caption = "Dataset: Diamonds Dataset | Data Source: ggplot2 package")

Getting the exact counts through a table:

diamonds %>%
  group_by(clarity) %>% 
  count()

## # A tibble: 8 × 2
## # Groups:   clarity [8]
##   clarity     n
##   <ord>   <int>
## 1 I1        741
## 2 SI2      9194
## 3 SI1     13065
## 4 VS2     12258
## 5 VS1      8171
## 6 VVS2     5066
## 7 VVS1     3655
## 8 IF       1790

Step 2: Exploratory Data Analysis (distribution for a numerical variable with box plots)

Based on the bar plot results from step 1 the frequency distribution of diamond clarity types shows that the diamonds data set has a low sample size of diamonds with clarity type “8/I1”, “2/VVS2”, “1/VVS1”, and “0/IF”. These results tell us that when interpreting the box plot prices by clarity, the results of “8/I1”, “2/VVS2”, “1/VVS1”, and “0/IF” should be treated with caution because of their relatively low sample sizes. Note: Keep in mind that 8 is the lowest clarity grade while 0 is the highest

The box plot highlights the median and distribution shape (via quartiles, spread, and skewness), while also showing potential outliers that distort the mean”. I box plot has:

Minimum:
Lower Quartile Q1:
Median (Quartile Q2): The middle value in the dataset
Upper Quartile Q3:
Maximum (Quartile Q4):
Whiskers: Left whisker represent quartile 1 and the whisker on the right represent quartile 4
Interquartile range (IQR): The spread of the middle 50% of a data set
Outliers: Points beyond whiskers
Mean (optional, sometimes plotted as a dot): The arithmetic average.

The table below highlights the skew of the distribution based on the mean and median in the box plot. The skew helps you decide which measure of central tendency to trust (median vs. mean).

Skew	Median and Mean positions	Purpose
Right Skew	Mean > Median	the mean is inflated relative to most of the data so the median is more reliable for central tendency. If you use the mean, it might suggest the average is more expensive/higher than the actual data points (when outliers are removed)
Left Skew	Mean < Median	the mean is deflated relative to most of the data so the median is more reliable for central tendency. If you use the mean, it might suggest the average a cheaper/lower than most of the data (when outliers are removed).
Middle Skew/ Symmetrical Distribution	Mean = Median = Mode	Mean is reliable for central tendency. The data is predictable and well-behaved

ggplot(diamonds, mapping=aes(x=clarity,y=price)) + 
  geom_boxplot() + 
  stat_summary(fun = mean, geom = "point", shape = 18, size = 3, color = "red") +
  coord_flip() +
  scale_y_continuous(breaks=c(0,2500,5000,10000,15000,20000)) +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5))  +
  labs(y = "Price",
       x = 'Diamond Clarity Scale', 
       title = "Price of Diamond Clarity Types",
       caption = "Dataset: Diamonds Dataset | Data Source: ggplot2 package")

Part B (diamond clarity relation to carat, cut and color)

How is clarity related to the other Cs?

Answer

I used a heat map to graph the relationship with diamond clarity (categorical ordinal variable) and diamond color and cut (both categorical ordinal variables). I used a box plot to graph the relationship between diamond clarity and carat (numerical continuous variable).

I noticed that with both heat maps the diamond clarity grades of 6,5,4,3 had high counts (I got an idea of these counts by looking at the columns of each clarity type and seeing which columns have the most color shades).

In terms of color, I noticed that the data set has high counts of colors D,E,F,G,H. Diamond cuts of ideal and premium had high counts in the data set. When examining the relationship between these variables in a scatter plot I would need to make sure that my analysis looks at the under represented groups with caution.

All of the box plots of the diamond clarity and carat/weight graph had a right skew since the mean was greater than the median. It also means that the median is a more reliable measure for central tendency. When looking at the graph I have noticed that the diamonds with a low clarity grade have a higher weight.

The observations from the box plot lead me to a working hypothesis that, despite having lower grades in cut, clarity, and color, the weight of the diamond can keep its price high.

To answer this question I followed these steps:

Step 3: Exploratory Data Analysis (joint frequency distributions for a categorical ordinal variables with heat maps)

#heat map
clarity_color_counts <- diamonds %>%
  group_by(clarity, color) %>% 
  count()

ggplot(clarity_color_counts, aes(x=clarity, y=color, fill=n)) +
  geom_tile(color = "white") +
  scale_fill_gradient(low="white", high="blue") +
  labs(fill = "Count", 
       y = "Diamond Color",
       x = 'clarity grade', 
       title = "Heat map to display the joint frequency distributions of clarity and color",
       caption = "Dataset: Diamonds Dataset | Data Source: ggplot2 package")

#heat map
clarity_cut_counts <- diamonds %>%
  group_by(clarity, cut) %>% 
  count()

ggplot(clarity_cut_counts, aes(x=clarity, y=cut, fill=n)) +
  geom_tile(color = "white") +
  scale_fill_gradient(low="white", high="blue") +
  labs(fill = "Count", 
       y = "Cut Grade",
       x = 'clarity grade', 
       title = "Heat map to display the joint frequency distributions of clarity and cut",
       caption = "Dataset: Diamonds Dataset | Data Source: ggplot2 package")

Step 4: Exploratory Data Analysis (distribution for a numerical variable with box plots)

ggplot(diamonds, mapping=aes(x=clarity,y=carat)) + 
  geom_boxplot() + 
  stat_summary(fun = mean, geom = "point", shape = 18, size = 3, color = "red") +
  coord_flip() + 
  scale_x_discrete(labels = c(
    "I1" = 8, "SI2"= 6 , "SI1"= 5 , "VS2"= 4,
    "VS1"= 3, "VVS2"= 2, "VVS1"= 1, "IF" = 0 )) +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5))  +
  labs(y = "Carat/Weight",
       x = 'Diamond Clarity Scale', 
       title = "Diamond carat for each clarity grade/group",
       caption = "Dataset: Diamonds Dataset | Data Source: ggplot2 package")

Part C (conclusion)

What would you conclude based on your analysis?

Answer

Based on my analysis with the scatter plots, pairing Diamond carat with Price and then creating separate scatter plots which group diamond Color, Cut, Clarity as groups, I have observed that diamonds with high-grade features usually have less weight.

For example, I noticed that the Diamond Color D has less weight, the Diamond with a Cut of Ideal also had less weight, and the Diamond with the Clarity of IF also has less weight. All of these features are high grade features.

Diamonds with less weight have the same price as diamonds with more weight. But the diamonds with more weight don’t have high-grade features.

These observations from the scatter plot are strengthening my inclination towards the working hypothesis that I was forming in part B. Diamonds with high weight have a high price because of their weight, not because of their features. And then diamonds with less weight have a high price because of their high-grade features.

Step 5: Exploratory Data Analysis (scatter plot comparing two numerical variables, grouped by categorical variable)

# STEP 4: creating a scatterplot of carat-price and adding a 
#         color aethetic that represents color and cut

#  carat-price scatterplot with a color aesthetic 
ggplot(data = diamonds) + 
  geom_point(aes(x = carat, y = price, color=color), alpha=0.5) + 
  ylab("Price") + 
  xlab("Diamond Weight (carat)") + 
  scale_y_continuous(breaks=c(0,5000,10000,15000,20000)) + 
  ggtitle("Price of Diamond by their weight, grouped by color") +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_color_discrete(name="Diamond Color")

#  carat-price scatterplot with a cut aesthetic 
ggplot(data = diamonds) + 
  geom_point(aes(x = carat, y = price, color=cut), alpha=0.5) + 
  ylab("Price") + 
  xlab("Diamond Weight (carat)") + 
  scale_y_continuous(breaks=c(0,5000,10000,15000,20000)) + 
  ggtitle("Price of Diamond by their weight, grouped by cut") +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_color_discrete(name="Diamond Cut")

#  carat-price scatterplot with a cut aesthetic 
ggplot(data = diamonds) + 
  geom_point(aes(x = carat, y = price, color=clarity), alpha=0.5) + 
  ylab("Price") + 
  xlab("Diamond Weight (carat)") + 
  scale_y_continuous(breaks=c(0,5000,10000,15000,20000)) + 
  ggtitle("Price of Diamond by their weight, grouped by clarity") +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_color_discrete(name="Diamond Clarity")

Use the carat and price scatter plot and add a color aesthetic which represents the various clarity types in our dataset.
Compared all the carat and price scatter plots, each having a different aesthetic of color, cut, and clarity
Created a geom_count graph that compares clarity with color and cut

Question 2 (number of flights and average arrival delay)

Calculate the number of flights for each destination in the NYC flights data. Plot that against the average arrival delay for each destination. What, if anything, would you conclude? Is this surprising? Why or why not?

Answer

This question highlights how more data points reduce the variability of the mean (standard error). This makes measures of central tendency like the mean more stable and representative of the underlying trend.

Based on my output, I came to the conclusion that as the number of flights increase, for each destination, the average arrival delay decreases and slowley moves towards zero. I reached this conslusion by looking at how the geom_smooth line gradually goes down. I also saw that there were many high average arival delays when the number of flights was close to zero. As the number of flights reached 2500 and onwards, the high average of arrival delays went down from 30 to around 10.

# Question 2

# STEP 1: creating a object that has non-cancelled flights
not_cancelled <- flights %>% 
  filter(!is.na(dep_delay), !is.na(arr_delay))

# STEP 2: scatterplot of flight # for each destination against
#         ave. arrival delay for each destination

not_cancelled %>% 
  group_by(dest) %>% 
  summarize(., count =n() ,
            avg_delay1 = mean(arr_delay)) %>%
  ggplot(aes(x = count, y = avg_delay1)) +   
  geom_point() + 
  geom_smooth() +
  ylab("Average Arrival Delay") + 
  xlab("Number of Flights for each Destination") + 
  scale_y_continuous(breaks=c(-20,-10,0,10,20,30,40,50)) +
  scale_x_continuous(breaks=c(0,2500,5000,7500,10000,12500,15000,17000))+
  ggtitle("Destination arrival delays and flight numbers") +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5))

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Question 3

This question will ask you to solve some problems with the nycflights dataset.

Part A (airplane average speed and distance travelled)

Plot average speed against distance. What do you conclude?

Answer

My output shows that as a airplanes average speed increases the airplane gets to travel a greater distance. There are more outliers such as the airplanes which have a speed of 10-12. These planes do not travel a big distance. In fact, the planes travel a short distance.

A possible explanation for these ourliers is that airplanes with a high average speed probably did not travel a great distance because maybe the planes engine was too large or small? There could be a possibility that there were some airplanes that had to re-route or take a long way to their destination (this could potentially explain why they are some outliers in the upper part of the graph. Weather could also be another factor that prevented planes, with high average speeds, to not go far.

# Question 3A

# scatterplot of average airplane speed and distance airplane travelled 
not_cancelled %>% 
  mutate(avg_speed = distance/air_time) %>%
  ggplot(aes(x = avg_speed, y = distance)) +   
    geom_point() +   
  ylab("Distance Airplane Travelled") + 
  xlab("Airplane Average Speed") + 
  scale_y_continuous(breaks=c(0,1000,2000,3000,4000,5000)) +
  scale_x_continuous(breaks=c(0,2,4,6,8,10,12))+
  ggtitle("Distance Travelled by Average Airplane Speed") +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5))

Part B (on time flight proportions)

Create a new variable that records whether or not a flight arrived on time.

Hint: define a flight to be “on time” if it gets to the destination on time or earlier than expected, regardless of any departure delays. Then, determine the on-time arrival percentage based on whether the flight departed on time/early or not (i.e., whether there was a departure delay). What percent of flights that were “delayed” departing arrive “on time”?

Answer

27.7 percent of flights that were delayed for departure arrived on time.

I categorized the arrival flights that were on time from a 1 to 0 scale (this was done so the proportion of early/on-time arrival flights can be calculated. I also did the same for flights that had a departure delay.

Since the question was asking for flights that were delayed and still arrived on time I had to made sure I was getting the proportion of arrived flights that were also delayed. This is why I added the filter funtion (to only have the data of delayed flights).

After that, I used the summarize function to calculate the proportion/percantage of flights that had a depature delay and still arrived early/on-time.

# Question 3B

# calculation to get the proportion on arrived flights that are on time 
 # (even though the flight was delayed)
not_cancelled %>% 
  mutate(on_time = ifelse((arr_delay <= 0),1,0),
         delay_flights = ifelse((dep_delay > 0),1,0)) %>%
  filter(delay_flights == 1) %>%
  summarize(., on_time_proportion = 100*mean(on_time))

## # A tibble: 1 × 1
##   on_time_proportion
##                <dbl>
## 1               27.7

Part C (departure delay and on time flights proportions)

Building on the question above, determine (roughly) what the cutoff point is for departure delays where you can still expect to get to your destination on time.

Answer

The rough cutoff point for departure delays is 7.47. This is a rough estimate that if you departure delay is 7.47 minutes long then there is still a change that your flight would arrive on time/early. I reached this conclusion from the code I wrote below.

I kept the code similar to the one in part B. The detail that I did change was that I filtered delayed flights and flights that arrived early/on-time. I did this because the question was asking for a rough estimate of a cutoff point for departure delays that could gauruntee a early/on-time arrival. After I did this, I got the mean/average of the departure delays that arrived early/on-time. Calculating this would give me a good estimate of the average departure delay cutoff that would still highly gauruntee a on-time/early arrival.

#calculation to get the estimated delay time of when a flights arrival 
 # would be on time

not_cancelled %>% 
  mutate(on_time = ifelse((arr_delay <= 0),1,0),
         delay_flights = ifelse((dep_delay > 0),1,0)) %>%
  filter(delay_flights == 1, on_time == 1) %>%
  summarize(.,mean_of_dep_delay = mean(dep_delay))

## # A tibble: 1 × 1
##   mean_of_dep_delay
##               <dbl>
## 1              7.47

SARFRAZ_HW3

Hussain Sarfraz

2025-09-24

Dataset Exploration

Description of columns

Question 1

Part A (diamond clarity and price)

Step 1: Exploratory Data Analysis (frequency distribution for a categorical variable with bar charts)

Step 2: Exploratory Data Analysis (distribution for a numerical variable with box plots)

Part B (diamond clarity relation to carat, cut and color)

Step 3: Exploratory Data Analysis (joint frequency distributions for a categorical ordinal variables with heat maps)

Step 4: Exploratory Data Analysis (distribution for a numerical variable with box plots)

Part C (conclusion)

Step 5: Exploratory Data Analysis (scatter plot comparing two numerical variables, grouped by categorical variable)

Question 2 (number of flights and average arrival delay)

Question 3

Part A (airplane average speed and distance travelled)

Part B (on time flight proportions)

Part C (departure delay and on time flights proportions)