Here I am installing the necessary packages and loading the required libraries.
# Load standard libraries
library(tidyverse)
library(tidyverse)
library(nycflights13)
library(ggplot2)
library(dplyr)
library(plyr)
Flight delays are often linked to weather conditions. How does weather impact flights from NYC? Utilizing both the flights and weather datasets from the nycflights13 package to explore this question.
dat_original <- as.data.frame(nycflights13::flights, row.names = NULL) #storing nycflights data into a data frame
#Getting rid of incomplete cases
dat_original <- drop_na(dat_original)
flght <- dat_original
wthr <- as.data.frame(nycflights13::weather, row.names = NULL) #storing nycflights weather data into a data frame
dat <- merge(flght, wthr)
#For arrival and departure (Consider only positive delays)
dat$arr_delay <- ifelse(dat$arr_delay >= 0, dat$arr_delay, 0)
dat$dep_delay <- ifelse(dat$dep_delay >= 0, dat$dep_delay, 0)
#Adding a column for total delay
dat$total_delay <- dat$arr_delay + dat$dep_delay
#selecting only relevant columns
datRelevant <- dat[,c(20:29)]
g <- ggplot(datRelevant, aes(y = humid, x = total_delay, title = "Total Delay v/s Relative Humidity"))
g + geom_smooth() + ylab("Relative Humidity") +
xlab("Total Delay (mins)")
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## Warning: Removed 17 rows containing non-finite values (stat_smooth).
g <- ggplot(datRelevant, aes(y = temp, x = total_delay,
title = "Total Delay v/s Temperature"))
g + geom_smooth() + ylab("Temperature [F]") +
xlab("Total Delay (mins)")
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## Warning: Removed 17 rows containing non-finite values (stat_smooth).
This shows that the total_delay increases with humidity.
The temperature plot also shows that with a higher temperature (i.e in summer) there is a higher delay. The smoother tail falls down cause of extreme outliers which are days that saw extreme delay due to maybe reasons outside the purview of the data set variables.
In this problem we will use the state dataset, available as part of the R statistical computing platforms. This data is related to the 50 states of the United States of America.
#Loading the state data
data(state)
#storing the state data into a data frame
dataState <- as.data.frame(state.x77)
#Converting the row names into the first colummn
#dataState <- rownames_to_column(dataState, var = "State")
cat("\n\nStructure of the state data:\n\n")
##
##
## Structure of the state data:
str(dataState)
## 'data.frame': 50 obs. of 8 variables:
## $ Population: num 3615 365 2212 2110 21198 ...
## $ Income : num 3624 6315 4530 3378 5114 ...
## $ Illiteracy: num 2.1 1.5 1.8 1.9 1.1 0.7 1.1 0.9 1.3 2 ...
## $ Life Exp : num 69 69.3 70.5 70.7 71.7 ...
## $ Murder : num 15.1 11.3 7.8 10.1 10.3 6.8 3.1 6.2 10.7 13.9 ...
## $ HS Grad : num 41.3 66.7 58.1 39.9 62.6 63.9 56 54.6 52.6 40.6 ...
## $ Frost : num 20 152 15 65 20 166 139 103 11 60 ...
## $ Area : num 50708 566432 113417 51945 156361 ...
State is a collection of data sets related to the 50 states of the United States of America. For this question, we use the state.x77 data set. Each of the columns in the test represent:
Area: land area in square miles
Murder rate and other characteristics of the state, for example population, illiteracy rate, and more. We begin by examining the bivariate relationships present in the data.model <- lm(Murder ~., data=dataState)
options(scipen=999)
summary(model)
##
## Call:
## lm(formula = Murder ~ ., data = dataState)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4452 -1.1016 -0.0598 1.1758 3.2355
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 122.180392646 17.886225407 6.831 0.0000000254 ***
## Population 0.000188036 0.000064737 2.905 0.00584 **
## Income -0.000159207 0.000572530 -0.278 0.78232
## Illiteracy 1.373109504 0.832202602 1.650 0.10641
## `Life Exp` -1.654869830 0.256211567 -6.459 0.0000000868 ***
## `HS Grad` 0.032338308 0.057252663 0.565 0.57519
## Frost -0.012884070 0.007392415 -1.743 0.08867 .
## Area 0.000005967 0.000003801 1.570 0.12391
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.746 on 42 degrees of freedom
## Multiple R-squared: 0.8083, Adjusted R-squared: 0.7763
## F-statistic: 25.29 on 7 and 42 DF, p-value: 0.0000000000003872
As is evident from the above analysis, the R square value is 0.81. This suggests that the above model explains about 81% of the total variance in the murder rate.
A quick glance at the coefficients from this model suggests us that the variables Population and Life Expectancy also affect the murder rates based on the p-value which is significantly less than 0.05.
Now to ascertain the maximum variance in murder rate as explained by multiple variables, we use the step functionor a stepwise regression, where the choice of predictor is carried out automatically by comparing certain criterion.
fit.best <- step(lm( Murder~.,data = dataState))
## Start: AIC=63.01
## Murder ~ Population + Income + Illiteracy + `Life Exp` + `HS Grad` +
## Frost + Area
##
## Df Sum of Sq RSS AIC
## - Income 1 0.236 128.27 61.105
## - `HS Grad` 1 0.973 129.01 61.392
## <none> 128.03 63.013
## - Area 1 7.514 135.55 63.865
## - Illiteracy 1 8.299 136.33 64.154
## - Frost 1 9.260 137.29 64.505
## - Population 1 25.719 153.75 70.166
## - `Life Exp` 1 127.175 255.21 95.503
##
## Step: AIC=61.11
## Murder ~ Population + Illiteracy + `Life Exp` + `HS Grad` + Frost +
## Area
##
## Df Sum of Sq RSS AIC
## - `HS Grad` 1 0.763 129.03 59.402
## <none> 128.27 61.105
## - Area 1 7.310 135.58 61.877
## - Illiteracy 1 8.715 136.98 62.392
## - Frost 1 9.345 137.61 62.621
## - Population 1 27.142 155.41 68.702
## - `Life Exp` 1 127.500 255.77 93.613
##
## Step: AIC=59.4
## Murder ~ Population + Illiteracy + `Life Exp` + Frost + Area
##
## Df Sum of Sq RSS AIC
## <none> 129.03 59.402
## - Illiteracy 1 8.723 137.75 60.672
## - Frost 1 11.030 140.06 61.503
## - Area 1 15.937 144.97 63.225
## - Population 1 26.415 155.45 66.714
## - `Life Exp` 1 140.391 269.42 94.213
summary(fit.best)
##
## Call:
## lm(formula = Murder ~ Population + Illiteracy + `Life Exp` +
## Frost + Area, data = dataState)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2976 -1.0711 -0.1123 1.1092 3.4671
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 120.164031804 17.181610452 6.994 0.0000000117 ***
## Population 0.000177981 0.000059303 3.001 0.00442 **
## Illiteracy 1.172980493 0.680121662 1.725 0.09161 .
## `Life Exp` -1.607836823 0.232377225 -6.919 0.0000000150 ***
## Frost -0.013730312 0.007079737 -1.939 0.05888 .
## Area 0.000006804 0.000002919 2.331 0.02439 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.712 on 44 degrees of freedom
## Multiple R-squared: 0.8068, Adjusted R-squared: 0.7848
## F-statistic: 36.74 on 5 and 44 DF, p-value: 0.00000000000001221
lm() function in R.Also, I have disregarded the Life Expectancy as it is intuitive that life expectancy will decrease with an increase in the murder rate. In the following model, I have taken population as my predictor variable.
model2 <- lm(Murder ~ Illiteracy, dataState)
summary(model2)
##
## Call:
## lm(formula = Murder ~ Illiteracy, data = dataState)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.5315 -2.0602 -0.2503 1.6916 6.9745
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.3968 0.8184 2.928 0.0052 **
## Illiteracy 4.2575 0.6217 6.848 0.0000000126 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.653 on 48 degrees of freedom
## Multiple R-squared: 0.4942, Adjusted R-squared: 0.4836
## F-statistic: 46.89 on 1 and 48 DF, p-value: 0.00000001258
state dataset.Does the model with Illiteracy as the predictor variable a good model to determine the murder rate in a state?
To determine the answer to this question, I have plotted the residual plot.
par(mfrow=c(2,2))
plot(model2)
The scatterplot below shows the relationship between per capita income (in thousands of dollars) and percent of population with a bachelor’s degree in 3,143 counties in the US in 2010.
Per Capita Income and Education
Explanatory Variable: Percent of people with Bachelor’s Degree
Response Variable: Per Capita Income
The per capita income (in thousands of dollars) and percent of population with a bachelor’s degree are positively correlated.
A very high percenatge of people with a bachelor’s degree within a county will result in an increase in the per capita income in that county.
There are significant cases wherein, counties having bachelor’s degree holders between 10% and 40% will have less per capita income.
We cannot conclude that having a bachelor’s degree increases one’s income as there are significant cases wherein, counties having bachelor’s degree holders between 10% and 40% will have less per capita income.
There are probably other factors that are to be considered when predicting the per capita income of the US counties.