For this project, we focus on the following datasets: airlines, airports, flights, planes, and weather.We are going to invstigate these datasets to answeer the following questions:
Construct a barplot displaying number of flights per month.
## # A tibble: 16 x 2
## carrier name
## <chr> <chr>
## 1 9E Endeavor Air Inc.
## 2 AA American Airlines Inc.
## 3 AS Alaska Airlines Inc.
## 4 B6 JetBlue Airways
## 5 DL Delta Air Lines Inc.
## 6 EV ExpressJet Airlines Inc.
## 7 F9 Frontier Airlines Inc.
## 8 FL AirTran Airways Corporation
## 9 HA Hawaiian Airlines Inc.
## 10 MQ Envoy Air
## 11 OO SkyWest Airlines Inc.
## 12 UA United Air Lines Inc.
## 13 US US Airways Inc.
## 14 VX Virgin America
## 15 WN Southwest Airlines Co.
## 16 YV Mesa Airlines Inc.
## # A tibble: 1,458 x 8
## faa name lat lon alt tz dst tzone
## <chr> <chr> <dbl> <dbl> <int> <dbl> <chr> <chr>
## 1 04G Lansdowne Airport 41.1 -80.6 1044 -5 A America/New_~
## 2 06A Moton Field Municip~ 32.5 -85.7 264 -6 A America/Chic~
## 3 06C Schaumburg Regional 42.0 -88.1 801 -6 A America/Chic~
## 4 06N Randall Airport 41.4 -74.4 523 -5 A America/New_~
## 5 09J Jekyll Island Airpo~ 31.1 -81.4 11 -5 A America/New_~
## 6 0A9 Elizabethton Munici~ 36.4 -82.2 1593 -5 A America/New_~
## 7 0G6 Williams County Air~ 41.5 -84.5 730 -5 A America/New_~
## 8 0G7 Finger Lakes Region~ 42.9 -76.8 492 -5 A America/New_~
## 9 0P2 Shoestring Aviation~ 39.8 -76.6 1000 -5 U America/New_~
## 10 0S9 Jefferson County In~ 48.1 -123. 108 -8 A America/Los_~
## # ... with 1,448 more rows
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2 830
## 2 2013 1 1 533 529 4 850
## 3 2013 1 1 542 540 2 923
## 4 2013 1 1 544 545 -1 1004
## 5 2013 1 1 554 600 -6 812
## 6 2013 1 1 554 558 -4 740
## 7 2013 1 1 555 600 -5 913
## 8 2013 1 1 557 600 -3 709
## 9 2013 1 1 557 600 -3 838
## 10 2013 1 1 558 600 -2 753
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
## # A tibble: 3,322 x 9
## tailnum year type manufacturer model engines seats speed engine
## <chr> <int> <chr> <chr> <chr> <int> <int> <int> <chr>
## 1 N10156 2004 Fixed wi~ EMBRAER EMB-1~ 2 55 NA Turbo~
## 2 N102UW 1998 Fixed wi~ AIRBUS INDUS~ A320-~ 2 182 NA Turbo~
## 3 N103US 1999 Fixed wi~ AIRBUS INDUS~ A320-~ 2 182 NA Turbo~
## 4 N104UW 1999 Fixed wi~ AIRBUS INDUS~ A320-~ 2 182 NA Turbo~
## 5 N10575 2002 Fixed wi~ EMBRAER EMB-1~ 2 55 NA Turbo~
## 6 N105UW 1999 Fixed wi~ AIRBUS INDUS~ A320-~ 2 182 NA Turbo~
## 7 N107US 1999 Fixed wi~ AIRBUS INDUS~ A320-~ 2 182 NA Turbo~
## 8 N108UW 1999 Fixed wi~ AIRBUS INDUS~ A320-~ 2 182 NA Turbo~
## 9 N109UW 1999 Fixed wi~ AIRBUS INDUS~ A320-~ 2 182 NA Turbo~
## 10 N110UW 1999 Fixed wi~ AIRBUS INDUS~ A320-~ 2 182 NA Turbo~
## # ... with 3,312 more rows
## # A tibble: 26,115 x 15
## origin year month day hour temp dewp humid wind_dir wind_speed
## <chr> <dbl> <dbl> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 EWR 2013 1 1 1 39.0 26.1 59.4 270 10.4
## 2 EWR 2013 1 1 2 39.0 27.0 61.6 250 8.06
## 3 EWR 2013 1 1 3 39.0 28.0 64.4 240 11.5
## 4 EWR 2013 1 1 4 39.9 28.0 62.2 250 12.7
## 5 EWR 2013 1 1 5 39.0 28.0 64.4 260 12.7
## 6 EWR 2013 1 1 6 37.9 28.0 67.2 240 11.5
## 7 EWR 2013 1 1 7 39.0 28.0 64.4 240 15.0
## 8 EWR 2013 1 1 8 39.9 28.0 62.2 250 10.4
## 9 EWR 2013 1 1 9 39.9 28.0 62.2 260 15.0
## 10 EWR 2013 1 1 10 41 28.0 59.6 260 13.8
## # ... with 26,105 more rows, and 5 more variables: wind_gust <dbl>,
## # precip <dbl>, pressure <dbl>, visib <dbl>, time_hour <dttm>
Remember to read about the datasets.
Construct a barplot displaying number of flights per month.
# Here, we convert these variables to the categorical ones.
flights$month <- as.factor(flights$month)
flights$origin <- as.factor(flights$origin)
flights$dest <- as.factor(flights$dest)
planes$manufacturer <-as.factor(planes$manufacturer)
# Vector Name of months
v_months <- c("1" = "Jan", "2" = "Feb","3" = "Mar",
"4" = "Apr", "5" = "May", "6" = "Jun",
"7" = "Jul", "8" = "Aug","9" = "sep",
"10" = "Oct","11" = "Nov", "12" = "Dec")
Find the number of flights per month.
flights%>%
group_by(month)%>% na.omit()%>% # aggregation by months
summarise(Flights =n())%>%
pander(big.mark = '.', justify = c('left', 'right'))
| month | Flights |
|---|---|
| 1 | 26.398 |
| 2 | 23.611 |
| 3 | 27.902 |
| 4 | 27.564 |
| 5 | 28.128 |
| 6 | 27.075 |
| 7 | 28.293 |
| 8 | 28.756 |
| 9 | 27.010 |
| 10 | 28.618 |
| 11 | 26.971 |
| 12 | 27.020 |
Now we can construct a barplot displaying number of flights per month.
nfly_per_month <- flights%>%
group_by(year, month)%>%
ggplot(aes(x = month))+
geom_bar() +
labs(x = "Months",
y = "Number of flights",
title = "Number of flights per month")+
scale_x_discrete(labels = v_months)
nfly_per_month
# ggplotly(nfly_per_month) # to makee it interactive.
Now, in the barplot (showing number of flights per month), make a separate bar for each origin.
origin_flights <- flights%>%
ggplot(aes( x = month))+
geom_bar(aes(fill = origin), position = "dodge") +
labs(x = "Months",
y = "Number of flights",
title = "flights for each origin per month") +
scale_x_discrete(labels = v_months)
# ggplotly(origin_flights) # make it interactive
origin_flights
List of the number of flights for each origin per month
months_list <- flights%>%
group_by(month, origin)%>%
summarise(n_of_flights = n())
months_list%>%
pander(big.mark = '.', justify = c('left','center', 'right'))
| month | origin | n_of_flights |
|---|---|---|
| 1 | EWR | 9.893 |
| 1 | JFK | 9.161 |
| 1 | LGA | 7.950 |
| 2 | EWR | 9.107 |
| 2 | JFK | 8.421 |
| 2 | LGA | 7.423 |
| 3 | EWR | 10.420 |
| 3 | JFK | 9.697 |
| 3 | LGA | 8.717 |
| 4 | EWR | 10.531 |
| 4 | JFK | 9.218 |
| 4 | LGA | 8.581 |
| 5 | EWR | 10.592 |
| 5 | JFK | 9.397 |
| 5 | LGA | 8.807 |
| 6 | EWR | 10.175 |
| 6 | JFK | 9.472 |
| 6 | LGA | 8.596 |
| 7 | EWR | 10.475 |
| 7 | JFK | 10.023 |
| 7 | LGA | 8.927 |
| 8 | EWR | 10.359 |
| 8 | JFK | 9.983 |
| 8 | LGA | 8.985 |
| 9 | EWR | 9.550 |
| 9 | JFK | 8.908 |
| 9 | LGA | 9.116 |
| 10 | EWR | 10.104 |
| 10 | JFK | 9.143 |
| 10 | LGA | 9.642 |
| 11 | EWR | 9.707 |
| 11 | JFK | 8.710 |
| 11 | LGA | 8.851 |
| 12 | EWR | 9.922 |
| 12 | JFK | 9.146 |
| 12 | LGA | 9.067 |
kable(months_list, "html") %>%
kable_styling("striped","hover") %>%
scroll_box(width = "500px", height = "600px")
| month | origin | n_of_flights |
|---|---|---|
| 1 | EWR | 9893 |
| 1 | JFK | 9161 |
| 1 | LGA | 7950 |
| 2 | EWR | 9107 |
| 2 | JFK | 8421 |
| 2 | LGA | 7423 |
| 3 | EWR | 10420 |
| 3 | JFK | 9697 |
| 3 | LGA | 8717 |
| 4 | EWR | 10531 |
| 4 | JFK | 9218 |
| 4 | LGA | 8581 |
| 5 | EWR | 10592 |
| 5 | JFK | 9397 |
| 5 | LGA | 8807 |
| 6 | EWR | 10175 |
| 6 | JFK | 9472 |
| 6 | LGA | 8596 |
| 7 | EWR | 10475 |
| 7 | JFK | 10023 |
| 7 | LGA | 8927 |
| 8 | EWR | 10359 |
| 8 | JFK | 9983 |
| 8 | LGA | 8985 |
| 9 | EWR | 9550 |
| 9 | JFK | 8908 |
| 9 | LGA | 9116 |
| 10 | EWR | 10104 |
| 10 | JFK | 9143 |
| 10 | LGA | 9642 |
| 11 | EWR | 9707 |
| 11 | JFK | 8710 |
| 11 | LGA | 8851 |
| 12 | EWR | 9922 |
| 12 | JFK | 9146 |
| 12 | LGA | 9067 |
What are the top-10 destinations and how many flights were made to these?
# The top-10 destinations and number of flights to these.
top10 <-flights%>%na.omit()%>%
group_by(dest)%>%
summarise(n_flight= n())%>%
arrange(desc(n_flight))%>%
select( dest, n_flight)%>%
head(10)
# List of top-10 destinations and number of flights to these.
top10%>%
pander(big.mark = '.', decimal.mark = ',', justify = c('left', 'right'))
| dest | n_flight |
|---|---|
| ATL | 16.837 |
| ORD | 16.566 |
| LAX | 16.026 |
| BOS | 15.022 |
| MCO | 13.967 |
| CLT | 13.674 |
| SFO | 13.173 |
| FLL | 11.897 |
| MIA | 11.593 |
| DCA | 9.111 |
For these 10 destinations, make a barplot illustrating the number of flights from origin to top-10 destination.
nfly_to_top10 <-top10%>%
ggplot(aes(dest, n_flight))+
geom_col() +
labs(x = "Destinations",
y = "Number of flights",
title = "flights to top-10 destinations")
# ggplotly(nfly_to_top10) # Make it interactive
nfly_to_top10
Now, order the bars (destinations) according to the total number of flights to these destinations.
nfly_to_top10 <- top10 %>%
ggplot( aes(x= reorder(dest,-n_flight),n_flight))+
geom_bar(stat = "identity")+
labs(x = "Top-10 destinations",
y = "Number of flights",
title = "Number of flights")
nfly_to_top10
# ggplotly(nfly_to_top10) # To make it interactive
Is the mean air time from each of the three origins different? Further, are these differences statistically significant?
flight_air_time <- flights %>%
filter(!is.na(air_time)) %>%
select(air_time, origin)
flight_air_time
## # A tibble: 327,346 x 2
## air_time origin
## <dbl> <fct>
## 1 227 EWR
## 2 227 LGA
## 3 160 JFK
## 4 183 JFK
## 5 116 LGA
## 6 150 EWR
## 7 158 EWR
## 8 53 LGA
## 9 140 JFK
## 10 138 LGA
## # ... with 327,336 more rows
# perform anova test
plot(air_time ~ origin, data = flight_air_time)
flight_air_time %>%
aov(air_time ~ origin, data = .) %>%
summary()
## Df Sum Sq Mean Sq F value Pr(>F)
## origin 2 1.935e+08 96741314 11817 <2e-16 ***
## Residuals 327343 2.680e+09 8186
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Visualize the mean air time for all origins to see the differences between them.
m_air_time_month <- flights%>%
group_by(origin, month, day)%>%
summarise(mean_air = mean(air_time, na.rm = TRUE))%>%
arrange(desc(mean_air))
m_air_plot <- m_air_time_month%>%
ggplot(aes(origin, mean_air))+ theme_bw() +
geom_boxplot(alpha = 0.4)+
labs(x = "Origin",
y = "Mean of air times",
title = "Origin mean time") +
# added after submission
# geom_boxplot(alpha=0.4) +
stat_summary(fun.y=mean, geom="point", shape=4,
size=2, color="red", fill="red") +
theme(legend.position="none")
m_air_plot
# ggplotly(m_air_plot) # make it interactive
How many weather observations are there for each origin?
# Compute the number of observations for each origin.
weather%>%
group_by(origin)%>% na.omit()%>%
summarise(n_observ = n())%>% # Number of observations
arrange(desc(n_observ))%>%
pander(big.mark = ',', justify = c('left', 'right'))
| origin | n_observ |
|---|---|
| LGA | 1,892 |
| EWR | 1,689 |
| JFK | 1,399 |
Convert temperature to degrees Celsius. This will be used in the reminder of this miniproject. (We do this for both temp and dewp using mutate_at)
# Convert degree fahrenheit to degree celcius.
fahrenheit_to_celcius <- function(df)
{
t_celcius <- (df - 32)* 5/9
return(t_celcius)
}
# Convert temp and dewp to degree selcius.
weather_dc <- mutate_at(weather, vars(temp,dewp),funs(fahrenheit_to_celcius))
head(weather_dc)
## # A tibble: 6 x 15
## origin year month day hour temp dewp humid wind_dir wind_speed
## <chr> <dbl> <dbl> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 EWR 2013 1 1 1 3.9 -3.3 59.4 270 10.4
## 2 EWR 2013 1 1 2 3.9 -2.80 61.6 250 8.06
## 3 EWR 2013 1 1 3 3.9 -2.2 64.4 240 11.5
## 4 EWR 2013 1 1 4 4.4 -2.2 62.2 250 12.7
## 5 EWR 2013 1 1 5 3.9 -2.2 64.4 260 12.7
## 6 EWR 2013 1 1 6 3.30 -2.2 67.2 240 11.5
## # ... with 5 more variables: wind_gust <dbl>, precip <dbl>,
## # pressure <dbl>, visib <dbl>, time_hour <dttm>
# convert variables origin and month to fator.
weather_dc$origin <- as.factor(weather_dc$origin)
weather_dc$month <- as.factor(weather_dc$month)
Construct a graph displaying the temperature at JFK.
jfk_temp <- weather_dc%>%
group_by(year, month, day)%>%
filter(origin =="JFK")%>%
ggplot(aes(month, temp))+
geom_boxplot() +
labs(x = "Months",
y = "Temperature in degree C",
title = "Graph temperature at JFK") +
scale_x_discrete(labels = v_months)+
# added after submission
# geom_boxplot(alpha=0.4) +
stat_summary(fun.y=mean, geom="point", shape=4,
size=2, color="red", fill="red") +
theme(legend.position="none")
# end added
jfk_temp
# ggplotly(jfk_temp)
Add a red line showing the mean temperature for each day.
l_mean_temp <- weather_dc%>% # with temp in degree celcius
mutate(mydate = make_date(year, month, day))%>%
group_by(year,month,day, mydate)%>%
# filter(origin =="JFK")%>%
summarise(m_temp = mean(temp, na.rm = TRUE))%>%
ggplot(aes(mydate, m_temp))+
geom_point()+
geom_smooth(se = FALSE, col = "red", method = 'loess') +
labs(x = "days",
y = "Temperature",
title = "Mean temperature for each day")
l_mean_temp
# ggplotly(l_mean_temp)
# Solution after submission
mean_temp_orign <- weather_dc%>%
group_by(year,month, day)%>%
summarise(mean_temp = mean(temp, na.rm = TRUE))%>%
ggplot(aes(day,mean_temp))+
geom_point()+
#geom_hline()+
geom_smooth(se = FALSE, col = "red", method = 'loess') +
# added after submission
labs(x = "Origin airport ",
y = "Mean temperature",
title = "Mean temperature ")
mean_temp_orign
# ggplotly(mean_temp_orign)
Now, visualize the daily mean temperature for each origin airport.
mean_temp_orign <- weather_dc%>%
group_by(origin,year,month, day)%>%
summarise(mean_temp = mean(temp, na.rm = TRUE))%>%
ggplot(aes(origin,mean_temp))+
geom_boxplot(aes(fill = origin))+
# added after submission
# geom_boxplot(alpha=0.4) +
stat_summary(fun.y=mean, geom="point", shape=4, size=2, color="red", fill="red") +
# theme(legend.position="none") +
# end added
#geom_point()+
labs(x = "Origins ",
y = "Mean temperature",
title = "Mean temperature ")
mean_temp_orign
# ggplotly(mean_temp_orign)
Investigate if arrival delay is associated with the flight distance (and also departure delay).
In order to investigate if there is any statistical dependence or association between some variables, one can calculate the correlation between these variables. To do so, we will first investigate if these variables are numeric and if there is no missing values.
select_cols <- flights%>%
select(arr_delay, dep_delay, distance) # Select relevant columns from flights
#Take complete cases.
flights_complet_cases <- select_cols[complete.cases(select_cols),]
Calculate the correlation between variables
# Correlation test between different variables or columns.
cor_test1 <- cor(flights_complet_cases)
cor_test_tb <- as.data.frame(cor_test1)
cor_test_tb
## arr_delay dep_delay distance
## arr_delay 1.00000000 0.9148028 -0.06186776
## dep_delay 0.91480276 1.0000000 -0.02168090
## distance -0.06186776 -0.0216809 1.00000000
Format the correlation data
kable(cor_test_tb, "html") %>% # format correlation table.
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "bordered", "responsive"))
| arr_delay | dep_delay | distance | |
|---|---|---|---|
| arr_delay | 1.0000000 | 0.9148028 | -0.0618678 |
| dep_delay | 0.9148028 | 1.0000000 | -0.0216809 |
| distance | -0.0618678 | -0.0216809 | 1.0000000 |
We can also plot the correlation data to visualize the relation between variables.
corrplot(cor_test1, method="color", type = "upper",
tl.srt = 45,addCoef.col = "black")
As we can see from the different graphs and correlation parameters , there is a strong correlation between arrival delay and departure delay. In the contrast, there is a negative correlation between the distance and arrival delay. In the next step, I will build a linear regression model using these three variables.
Linear model arrival delay and departure delay
# Model arrival delay and departure delay
del_model1 <- lm(arr_delay ~ dep_delay, data = flights_complet_cases)
coef(del_model1)
## (Intercept) dep_delay
## -5.899493 1.019093
summary(del_model1)
##
## Call:
## lm(formula = arr_delay ~ dep_delay, data = flights_complet_cases)
##
## Residuals:
## Min 1Q Median 3Q Max
## -107.587 -11.005 -1.883 8.938 201.938
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.8994935 0.0330195 -178.7 <2e-16 ***
## dep_delay 1.0190929 0.0007864 1295.8 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.03 on 327344 degrees of freedom
## Multiple R-squared: 0.8369, Adjusted R-squared: 0.8369
## F-statistic: 1.679e+06 on 1 and 327344 DF, p-value: < 2.2e-16
from this result, one can conclude that departure delay is an important predictor of arrival delay. The changes of departure delay can explain the changes in arrival delay.
pred_delay <- flights_complet_cases%>%
add_predictions(del_model1)%>%
add_residuals(del_model1)
head(pred_delay, 10)
## # A tibble: 10 x 5
## arr_delay dep_delay distance pred resid
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 11 2 1400 -3.86 14.9
## 2 20 4 1416 -1.82 21.8
## 3 33 2 1089 -3.86 36.9
## 4 -18 -1 1576 -6.92 -11.1
## 5 -25 -6 762 -12.0 -13.0
## 6 12 -4 719 -9.98 22.0
## 7 19 -5 1065 -11.0 30.0
## 8 -14 -3 229 -8.96 -5.04
## 9 -8 -3 944 -8.96 0.957
## 10 8 -2 733 -7.94 15.9
# After submission of report
pred_plot <- pred_delay%>%
ggplot(aes(dep_delay, pred))+
geom_line()+
geom_point(aes(dep_delay, arr_delay),
col = "red")
pred_plot
# Graph arr_delay vs dep_dealy using intercept and slope from the model.
flights_complet_cases%>%
ggplot(aes(arr_delay, dep_delay))+
geom_point()+
geom_abline (intercept = -5.89, slope = 1, col= "red") +
labs(x = "arrivel delay ",
y = "departure delay",
title = "arrival vs departure delay ")
Model arrival delay, departure delay and distance
# Model arrival delay, departure delay and distance
del_model2 <- lm(arr_delay ~ dep_delay+ distance, data = flights_complet_cases)
coef(del_model2)
## (Intercept) dep_delay distance
## -3.212779441 1.018077208 -0.002550586
summary(del_model2)
##
## Call:
## lm(formula = arr_delay ~ dep_delay + distance, data = flights_complet_cases)
##
## Residuals:
## Min 1Q Median 3Q Max
## -107.944 -11.069 -2.019 8.728 205.562
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.213e+00 5.560e-02 -57.78 <2e-16 ***
## dep_delay 1.018e+00 7.823e-04 1301.32 <2e-16 ***
## distance -2.551e-03 4.259e-05 -59.88 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.93 on 327343 degrees of freedom
## Multiple R-squared: 0.8386, Adjusted R-squared: 0.8386
## F-statistic: 8.506e+05 on 2 and 327343 DF, p-value: < 2.2e-16
From this result, one can conclude that the changes of arrival delay can not be explained by the distance.
Investigate if departure delay is associated with weather conditions at the origin airport. This includes descriptives (mean departure delay), plotting, regression modelling, considering missing values etc.
# We selct the relevant columns from both flights and weather datasets
# then join the two subsets.
flight_delay <- flights%>%
select(year:day, origin, hour, arr_delay, dep_delay, tailnum)
#flight_delay
Join subset flights and planes datasets
fly_w_cd <- inner_join(flight_delay, weather_dc,
by = c("origin" = "origin",
"year" = "year",
"month" = "month",
"day" = "day",
"hour" = "hour"))
#c("year", "day","month", "hour", "origin")
# Here I consider only the completecases from dataset.
fly_weather_cd <- fly_w_cd[complete.cases(fly_w_cd),]
head(fly_weather_cd)
## # A tibble: 6 x 18
## year month day origin hour arr_delay dep_delay tailnum temp dewp
## <dbl> <fct> <int> <fct> <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 2013 1 1 LGA 5 20 4 N24211 4.4 -3.90
## 2 2013 1 1 LGA 6 -25 -6 N668DN 4.4 -3.90
## 3 2013 1 1 LGA 6 -14 -3 N829AS 4.4 -3.90
## 4 2013 1 1 LGA 6 8 -2 N3ALAA 4.4 -3.90
## 5 2013 1 1 LGA 6 31 -1 N3DUAA 4.4 -3.90
## 6 2013 1 1 LGA 6 -7 0 N595JB 4.4 -3.90
## # ... with 8 more variables: humid <dbl>, wind_dir <dbl>,
## # wind_speed <dbl>, wind_gust <dbl>, precip <dbl>, pressure <dbl>,
## # visib <dbl>, time_hour <dttm>
Now we can compute the correlation between numeric relevant variables.
num_cols <- sapply(fly_weather_cd, is.numeric)
fly_weather_cd <- fly_weather_cd[, num_cols] # numeric columns
delay_weather_cd <- fly_weather_cd%>%
select(dep_delay, humid,precip, visib,
wind_dir, wind_speed, temp, dewp)
cor_delay_wc <- cor(delay_weather_cd)
cor_delay_wc
## dep_delay humid precip visib wind_dir
## dep_delay 1.00000000 0.122166383 0.048397908 -0.09745755 -0.06842713
## humid 0.12216638 1.000000000 0.262944487 -0.45001107 -0.41160597
## precip 0.04839791 0.262944487 1.000000000 -0.47729425 -0.15145866
## visib -0.09745755 -0.450011070 -0.477294245 1.00000000 0.20250473
## wind_dir -0.06842713 -0.411605970 -0.151458658 0.20250473 1.00000000
## wind_speed 0.02100191 -0.009770943 0.003165484 -0.06396121 0.11603287
## temp 0.06124366 0.077493589 -0.016366496 0.04275419 -0.21351987
## dewp 0.10067971 0.484751473 0.073763317 -0.11416287 -0.34985317
## wind_speed temp dewp
## dep_delay 0.021001914 0.06124366 0.10067971
## humid -0.009770943 0.07749359 0.48475147
## precip 0.003165484 -0.01636650 0.07376332
## visib -0.063961212 0.04275419 -0.11416287
## wind_dir 0.116032873 -0.21351987 -0.34985317
## wind_speed 1.000000000 -0.31494220 -0.28646035
## temp -0.314942198 1.00000000 0.90528731
## dewp -0.286460353 0.90528731 1.00000000
We can Plot the correlation data to see the relations between variables
corrplot(cor_delay_wc, method="color", type = "upper",
tl.srt = 45,addCoef.col = "black")
#corrplot(cor_delay_wc, method = 'number')
According to correlation plots, there is no strong correlation between departure delay and any of the weather conditions variables and there is a tight competition between positive and negative correlations. To get more information we will visualize the relationship between delay and some of the variable such as temp and `precip.
fly_w_cd%>%
group_by(precip)%>%
summarise (dp_delay = mean(dep_delay, na.rm = TRUE))%>%
ggplot(aes(precip, dp_delay))+
geom_line()+
geom_point()
As we can see from the graph, the precip doesn’t provide more information explanation about departure delay and the correlation between these two variables is weak.
fly_w_cd%>%
group_by(temp)%>%na.omit()%>%
summarise (dp_delay = mean(dep_delay, na.rm = TRUE))%>%
ggplot(aes(temp, dp_delay))+
# geom_line(col = "red")+
geom_point()+
geom_smooth( se = FALSE, method = 'loess')
Surprisingly, can that the departure delay increase with temperature but this variation is not significant to provide more explanation about departure delay changes. In the next step I will build the linear regression model.
Modelling weather conditions delay model
set.seed(101)
#delay_weather_cd%>%
wc_delay_sample <- sample.split(delay_weather_cd$dep_delay,
SplitRatio = 0.7)
# train set
train <- subset(delay_weather_cd, wc_delay_sample == TRUE)
head(train)
## # A tibble: 6 x 8
## dep_delay humid precip visib wind_dir wind_speed temp dewp
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 4 54.8 0 10 250 15.0 4.4 -3.90
## 2 -1 54.8 0 10 260 16.1 4.4 -3.90
## 3 -3 54.8 0 10 260 16.1 4.4 -3.90
## 4 13 54.8 0 10 260 16.1 4.4 -3.90
## 5 -4 54.8 0 10 260 16.1 4.4 -3.90
## 6 -6 54.8 0 10 260 16.1 4.4 -3.90
#Test set
test <- subset(delay_weather_cd, wc_delay_sample == FALSE)
head(test)
## # A tibble: 6 x 8
## dep_delay humid precip visib wind_dir wind_speed temp dewp
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 -6 54.8 0 10 260 16.1 4.4 -3.90
## 2 -3 54.8 0 10 260 16.1 4.4 -3.90
## 3 -2 54.8 0 10 260 16.1 4.4 -3.90
## 4 0 54.8 0 10 260 16.1 4.4 -3.90
## 5 0 54.8 0 10 260 16.1 4.4 -3.90
## 6 -8 54.8 0 10 260 16.1 4.4 -3.90
Build linear model
model <- lm(dep_delay~ ., data = train)
summary(model)
##
## Call:
## lm(formula = dep_delay ~ ., data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -54.93 -16.97 -11.83 -0.15 754.66
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.352674 6.327225 -0.846 0.397571
## humid 0.427979 0.062583 6.839 8.09e-12 ***
## precip 3.028329 14.635779 0.207 0.836079
## visib -1.605281 0.211482 -7.591 3.24e-14 ***
## wind_dir -0.006602 0.002347 -2.813 0.004914 **
## wind_speed 0.287092 0.036878 7.785 7.11e-15 ***
## temp 0.882015 0.182450 4.834 1.34e-06 ***
## dewp -0.717097 0.200783 -3.572 0.000355 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 38.25 on 50911 degrees of freedom
## Multiple R-squared: 0.02183, Adjusted R-squared: 0.0217
## F-statistic: 162.3 on 7 and 50911 DF, p-value: < 2.2e-16
# Residuals
residus <- residuals(model)
residus <- as.data.frame(residus)
#head(residus)
residus%>%
ggplot(aes(residus)) + geom_histogram(fill='blue',alpha=0.5, bins = 20)
# After submission of report
delay_weather_cd%>%
ggplot(aes(dep_delay)) + geom_histogram(fill='blue',alpha=0.5, bins = 20)
Plotting model
Prediction computation
delay_predict <- predict(model, test)
predict_res <- cbind(delay_predict, test$dep_delay)
colnames(predict_res) <- c("predict", "actual")
predict_res<- as.data.frame(predict_res)
head(predict_res)
## predict actual
## 1 11.63846 -6
## 2 11.63846 -3
## 3 11.63846 -2
## 4 11.63846 0
## 5 11.63846 0
## 6 11.63846 -8
head(delay_predict)
## 1 2 3 4 5 6
## 11.63846 11.63846 11.63846 11.63846 11.63846 11.63846
Is the age of the plane associated to delay?
Here, we start by computing the age of each and store it in new column names age.
# This function compute the age the plane as integer.
cur_year <- as.integer(format(Sys.Date(), "%Y"))
year_diff <- function(x_year, y_year = cur_year)
{
x_year <- as.integer(x_year)
return(y_year - x_year)
}
pl_ages <- planes%>%
mutate(age = year_diff(year))%>%
select(tailnum, age, seats)
pl_ages <- pl_ages[complete.cases(pl_ages),]
flight_delay <- flights%>%
select(year:day, origin, hour, arr_delay, dep_delay, tailnum)
# join planes and flights datasets
join_fly_planes <- flight_delay%>%
inner_join(pl_ages, by = "tailnum")
#join_fly_planes
fly_planes <-join_fly_planes%>%
select(age,dep_delay,arr_delay)
fly_planes <-fly_planes[complete.cases(fly_planes),]
fly_planes
## # A tibble: 273,853 x 3
## age dep_delay arr_delay
## <int> <dbl> <dbl>
## 1 19 2 11
## 2 20 4 20
## 3 28 2 33
## 4 6 -1 -18
## 5 27 -6 -25
## 6 6 -4 12
## 7 18 -5 19
## 8 20 -3 -14
## 9 14 -3 -8
## 10 7 -2 -2
## # ... with 273,843 more rows
Now I will perform some plotting to visualize the graphic (age vs delay)
age_delay <- fly_planes%>%
group_by(age)%>%
filter(!is.na(dep_delay)) %>%
summarise(delay = mean(dep_delay)) %>%
ggplot(aes(x = age, y = delay)) +
# geom_point() +
geom_line() +
labs(x = "Age of planes",
y = "Mean of departure delay",
title = "Graph dep_delay vs age")
age_delay
#ggplotly(age_delay)
try correlation between age of planes and arrival delay.
cor_delay_age <-cor(fly_planes)
cor_delay_age
## age dep_delay arr_delay
## age 1.00000000 -0.01513102 -0.01767153
## dep_delay -0.01513102 1.00000000 0.91674159
## arr_delay -0.01767153 0.91674159 1.00000000
According to the correlation table, There is a weak negative correlation between age and both arrival delay and departure delay. Meaning, the age of the planes can not explained the changes of delay.
Format the correlation data
| age | dep_delay | arr_delay | |
|---|---|---|---|
| age | 1.0000000 | -0.0151310 | -0.0176715 |
| dep_delay | -0.0151310 | 1.0000000 | 0.9167416 |
| arr_delay | -0.0176715 | 0.9167416 | 1.0000000 |
It seems like the plane manufacturer could use a cleaning. After that, how many manufacturers have more than 200 planes? And how many flights are each manufacturer with more than 200 planes responsible for?
# Find list of manufacturers that have more than 200 planes.
manufact_200_planes <- planes%>%
filter(!is.na(manufacturer)) %>%
group_by( manufacturer)%>%
summarise(n_planes =n())%>%
arrange(desc(n_planes))%>%
filter(n_planes > 200)
#print list
manufact_200_planes%>%
pander(big.mark = ',', justify = c('left', 'right'))
| manufacturer | n_planes |
|---|---|
| BOEING | 1,630 |
| AIRBUS INDUSTRIE | 400 |
| BOMBARDIER INC | 368 |
| AIRBUS | 336 |
| EMBRAER | 299 |
# Combine the subset manufact_200_planes with flights dataset to find number of flights.
num_planes200 <- planes%>%
inner_join(flights,"tailnum")%>%
filter(manufacturer %in%
manufact_200_planes$manufacturer)%>%
group_by( manufacturer)%>%
summarise(n_flight =n())%>%
arrange(desc(n_flight))
# print list
num_planes200%>%
pander(big.mark = ',', justify = c('left', 'right'))
| manufacturer | n_flight |
|---|---|
| BOEING | 82,912 |
| EMBRAER | 66,068 |
| AIRBUS | 47,302 |
| AIRBUS INDUSTRIE | 40,891 |
| BOMBARDIER INC | 28,272 |
# format number of flights
kable(num_planes200, "html") %>% # format correlation table.
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive", "bordered"))
| manufacturer | n_flight |
|---|---|
| BOEING | 82912 |
| EMBRAER | 66068 |
| AIRBUS | 47302 |
| AIRBUS INDUSTRIE | 40891 |
| BOMBARDIER INC | 28272 |
It turns out that Airbus has several main models, e.g. several A320 (A320-211, A320-212 etc.) and so on. Create a frequency table of the main models for Airbus and how many planes there are in each.
# Find number of model for "AIRBUS"
airbus_models <- planes%>%
group_by(model)%>%
filter(manufacturer =="AIRBUS")%>%
summarise(n_of_planes = n())%>%
arrange(desc(n_of_planes))
# Split column model to find main models
#airbus_models%>%
split_m <- strsplit( airbus_models$model, "-")
main_mdl <- vector("character", length = length(split_m))
for (m in 1: length(split_m)) {
main_mdl[m] <- split_m[[m]][1]
}
# List of main models
m_models <- airbus_models%>%
mutate(main_model = main_mdl)
m_models %>%
pander(big.mark = ',', justify = c('left','center', 'right'))
| model | n_of_planes | main_model |
|---|---|---|
| A320-232 | 129 | A320 |
| A320-214 | 61 | A320 |
| A321-231 | 49 | A321 |
| A319-114 | 31 | A319 |
| A321-211 | 18 | A321 |
| A319-112 | 14 | A319 |
| A330-243 | 14 | A330 |
| A319-131 | 5 | A319 |
| A319-132 | 5 | A319 |
| A319-115 | 3 | A319 |
| A320-211 | 2 | A320 |
| A320-212 | 2 | A320 |
| A330-223 | 2 | A330 |
| A330-323 | 1 | A330 |
# format list
kable(m_models, "html") %>% # format correlation table.
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive", "bordered"))
| model | n_of_planes | main_model |
|---|---|---|
| A320-232 | 129 | A320 |
| A320-214 | 61 | A320 |
| A321-231 | 49 | A321 |
| A319-114 | 31 | A319 |
| A321-211 | 18 | A321 |
| A319-112 | 14 | A319 |
| A330-243 | 14 | A330 |
| A319-131 | 5 | A319 |
| A319-132 | 5 | A319 |
| A319-115 | 3 | A319 |
| A320-211 | 2 | A320 |
| A320-212 | 2 | A320 |
| A330-223 | 2 | A330 |
| A330-323 | 1 | A330 |
# A Frequency table of main models
m_models%>%
group_by(main_model)%>%
summarise( num_model = n(), num_of_planes = sum(n_of_planes))%>%
pander(big.mark = ',', justify = c('left','center', 'right'))
| main_model | num_model | num_of_planes |
|---|---|---|
| A319 | 5 | 58 |
| A320 | 4 | 194 |
| A321 | 2 | 67 |
| A330 | 3 | 17 |
Are larger planes (measured by number of seats) more or less delayed than smaller planes? To answer this question one can visualize the plot (seats vs arr_delay)
delay_seats <- join_fly_planes%>%
group_by(seats)%>%
summarise(mean_delay = mean(arr_delay, na.rm = TRUE))%>%
arrange(desc(seats))
delay_seats%>%
pander(big.mark = ',', justify = c('left', 'right'))
| seats | mean_delay |
|---|---|
| 450 | 120 |
| 400 | 14.33 |
| 379 | 3.29 |
| 377 | -6.635 |
| 330 | 5.281 |
| 300 | 24.2 |
| 292 | 6.216 |
| 290 | 7.736 |
| 275 | 11.95 |
| 269 | -4.75 |
| 260 | -4.5 |
| 255 | -1.106 |
| 222 | -8.819 |
| 200 | 7.599 |
| 199 | 2.683 |
| 191 | 0.8034 |
| 189 | -1.337 |
| 182 | 3.222 |
| 179 | 4.298 |
| 178 | 2.286 |
| 172 | -0.8459 |
| 149 | 3.388 |
| 147 | 33.47 |
| 145 | 0.3996 |
| 142 | 5.905 |
| 140 | 9.651 |
| 139 | 2.341 |
| 128 | -3.053 |
| 102 | 3.5 |
| 100 | 17.79 |
| 95 | 8.039 |
| 80 | 8.247 |
| 55 | 15.81 |
| 22 | 7.173 |
| 20 | 8.056 |
| 16 | -4.295 |
| 14 | -1.346 |
| 12 | -10.25 |
| 11 | 22.43 |
| 10 | 5.667 |
| 9 | 4.045 |
| 8 | 7.822 |
| 7 | -3.034 |
| 6 | 10.67 |
| 5 | 7.137 |
| 4 | 6.365 |
| 2 | 5.497 |
kable(delay_seats, "html") %>%
kable_styling() %>%
scroll_box(width = "500px", height = "600px")
| seats | mean_delay |
|---|---|
| 450 | 120.0000000 |
| 400 | 14.3333333 |
| 379 | 3.2904795 |
| 377 | -6.6354839 |
| 330 | 5.2810127 |
| 300 | 24.2000000 |
| 292 | 6.2159091 |
| 290 | 7.7356322 |
| 275 | 11.9455645 |
| 269 | -4.7500000 |
| 260 | -4.5000000 |
| 255 | -1.1056853 |
| 222 | -8.8192771 |
| 200 | 7.5991997 |
| 199 | 2.6828016 |
| 191 | 0.8033501 |
| 189 | -1.3370618 |
| 182 | 3.2215930 |
| 179 | 4.2978482 |
| 178 | 2.2861217 |
| 172 | -0.8459043 |
| 149 | 3.3884528 |
| 147 | 33.4731183 |
| 145 | 0.3995519 |
| 142 | 5.9049074 |
| 140 | 9.6505463 |
| 139 | 2.3406593 |
| 128 | -3.0526316 |
| 102 | 3.5000000 |
| 100 | 17.7885978 |
| 95 | 8.0389666 |
| 80 | 8.2470027 |
| 55 | 15.8059566 |
| 22 | 7.1729958 |
| 20 | 8.0563663 |
| 16 | -4.2950820 |
| 14 | -1.3461538 |
| 12 | -10.2500000 |
| 11 | 22.4347826 |
| 10 | 5.6666667 |
| 9 | 4.0454545 |
| 8 | 7.8216561 |
| 7 | -3.0338983 |
| 6 | 10.6735395 |
| 5 | 7.1370717 |
| 4 | 6.3654917 |
| 2 | 5.4972527 |
According to this graph, there is no sign that the larger planes are more delayed than the smaller ones.
One can also try to compute the correlation between age, arr_delay and dep_delay.
| seats | arr_delay | dep_delay | |
|---|---|---|---|
| seats | 1.0000000 | -0.0726318 | -0.0541665 |
| arr_delay | -0.0726318 | 1.0000000 | 0.9167416 |
| dep_delay | -0.0541665 | 0.9167416 | 1.0000000 |
There is a weak negative correlation between a number of seats and delay meaning that the number of seats (smaller/ larger) can not explain the arrival delay.
##
## Call:
## lm(formula = seat_delay$arr_delay ~ seat_delay$seats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -91.03 -24.21 -11.82 7.29 1275.86
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.333210 0.186109 71.64 <2e-16 ***
## seat_delay$seats -0.045604 0.001197 -38.11 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 44.81 on 273851 degrees of freedom
## Multiple R-squared: 0.005275, Adjusted R-squared: 0.005272
## F-statistic: 1452 on 1 and 273851 DF, p-value: < 2.2e-16
## (Intercept) seat_delay$seats
## 13.33320972 -0.04560441
On a map (map_data("usa")), plot the airports that has flights to them.
usa <- map_data("usa")
#airports$faa <- as.factor(airports$faa)
fli_dest <- flights%>%
group_by(dest)%>%
summarise(n = n())%>%
inner_join(airports, by =c(dest = "faa"))
## Warning: Column `dest`/`faa` joining factor and character vector, coercing
## into character vector
fli_dest
## # A tibble: 101 x 9
## dest n name lat lon alt tz dst tzone
## <chr> <int> <chr> <dbl> <dbl> <int> <dbl> <chr> <chr>
## 1 ABQ 254 Albuquerque Inte~ 35.0 -107. 5355 -7 A America/D~
## 2 ACK 265 Nantucket Mem 41.3 -70.1 48 -5 A America/N~
## 3 ALB 439 Albany Intl 42.7 -73.8 285 -5 A America/N~
## 4 ANC 8 Ted Stevens Anch~ 61.2 -150. 152 -9 A America/A~
## 5 ATL 17215 Hartsfield Jacks~ 33.6 -84.4 1026 -5 A America/N~
## 6 AUS 2439 Austin Bergstrom~ 30.2 -97.7 542 -6 A America/C~
## 7 AVL 275 Asheville Region~ 35.4 -82.5 2165 -5 A America/N~
## 8 BDL 443 Bradley Intl 41.9 -72.7 173 -5 A America/N~
## 9 BGR 375 Bangor Intl 44.8 -68.8 192 -5 A America/N~
## 10 BHM 297 Birmingham Intl 33.6 -86.8 644 -6 A America/C~
## # ... with 91 more rows
Make a similar plot, but now points must have size relative to the number of flights each airport is destination for.
map_data <- ggplot() +
geom_polygon(data = usa, aes(long, lat, group = group), fill = "grey") +
geom_point(data =fli_dest, aes(x = lon, y = lat, size = n , color = n)) +
borders("state")+
scale_color_viridis()+
labs(x = "longitude",
y = "latitude",
title = "map flights for destinations")+
coord_quickmap()
map_data
#ggplotly(map_data)
Do a principal component analysis of the weather at JFK using the following columns: temp, dewp, humid, wind_dir, wind_speed, precip, visib (only on complete.cases()). How many principal components should be used to capture the variability in the weather data?
weather_dc <- mutate_at(weather, vars(temp, dewp),funs(fahrenheit_to_celcius))
weather_jfk_df <- weather_dc%>%
group_by(origin)%>%
filter(origin == "JFK") %>%
select(origin,temp, dewp, humid, wind_dir, wind_speed, precip, visib)
head(weather_jfk_df)
## # A tibble: 6 x 8
## # Groups: origin [1]
## origin temp dewp humid wind_dir wind_speed precip visib
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 JFK 3.9 -3.3 59.4 260 12.7 0 10
## 2 JFK 3.9 -3.3 59.4 270 11.5 0 10
## 3 JFK 4.4 -2.80 59.5 260 15.0 0 10
## 4 JFK 4.4 -2.2 62.2 250 17.3 0 10
## 5 JFK 3.9 -2.80 61.6 260 15.0 0 10
## 6 JFK 3.30 -2.80 64.3 260 13.8 0 10
# Here we consider the completecases
weather_jfk<- weather_jfk_df[complete.cases(weather_jfk_df),]
weather_jfk <- weather_jfk[, sapply(weather_jfk, is.numeric)]
summary and correlation plot
Principal component analysis PCA
## # A tibble: 6 x 7
## temp dewp humid wind_dir wind_speed precip visib
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 3.9 -3.3 59.4 260 12.7 0 10
## 2 3.9 -3.3 59.4 270 11.5 0 10
## 3 4.4 -2.80 59.5 260 15.0 0 10
## 4 4.4 -2.2 62.2 250 17.3 0 10
## 5 3.9 -2.80 61.6 260 15.0 0 10
## 6 3.30 -2.80 64.3 260 13.8 0 10
## temp dewp humid wind_dir
## Min. :-11.10 Min. :-23.300 Min. : 15.21 Min. : 0.0
## 1st Qu.: 4.40 1st Qu.: -2.800 1st Qu.: 49.19 1st Qu.:140.0
## Median : 12.20 Median : 6.100 Median : 65.90 Median :220.0
## Mean : 12.45 Mean : 5.462 Mean : 65.26 Mean :204.2
## 3rd Qu.: 20.60 3rd Qu.: 14.400 3rd Qu.: 82.81 3rd Qu.:290.0
## Max. : 36.70 Max. : 25.600 Max. :100.00 Max. :360.0
## wind_speed precip visib
## Min. : 0.000 Min. :0.000000 Min. : 0.000
## 1st Qu.: 6.905 1st Qu.:0.000000 1st Qu.:10.000
## Median :11.508 Median :0.000000 Median :10.000
## Mean :11.503 Mean :0.003948 Mean : 9.175
## 3rd Qu.:14.960 3rd Qu.:0.000000 3rd Qu.:10.000
## Max. :42.579 Max. :0.660000 Max. :10.000
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.6100 1.2085 1.0722 0.8732 0.78761 0.64222 0.05141
## Proportion of Variance 0.3703 0.2086 0.1642 0.1089 0.08862 0.05892 0.00038
## Cumulative Proportion 0.3703 0.5789 0.7431 0.8521 0.94070 0.99962 1.00000
## Standard deviations (1, .., p=7):
## [1] 1.60996245 1.20847381 1.07220661 0.87324870 0.78761090 0.64222082
## [7] 0.05141167
##
## Rotation (n x k) = (7 x 7):
## PC1 PC2 PC3 PC4 PC5
## temp 0.4258354 -0.45733088 0.39167074 -0.10637799 -0.061768190
## dewp 0.5565280 -0.26675384 0.27395954 0.05549235 0.004384504
## humid 0.4795867 0.28491715 -0.08595086 0.33572394 0.111121220
## wind_dir -0.3322328 -0.07787344 0.50057333 0.41389629 0.679300151
## wind_speed -0.2772188 0.17817362 0.64084659 0.06113809 -0.630844077
## precip 0.1471361 0.50302298 0.31168170 -0.71717007 0.336338691
## visib -0.2646875 -0.58963620 -0.07582321 -0.42838399 0.106211496
## PC6 PC7
## temp 0.29644440 0.5942186847
## dewp -0.08850770 -0.7301595113
## humid -0.66588955 0.3361009569
## wind_dir -0.01498101 0.0003908938
## wind_speed -0.28109860 -0.0010084427
## precip -0.02647824 -0.0039567227
## visib -0.61720034 0.0281169743
Accordingly, to the different plots and summary table, one can see that more than 94% of weather variability can be captured by 5 PCs. So we choose the 5 first components as principal ones.