Setup:

Here I am installing the necessary packages and loading the required libraries.

# Load standard libraries
library(tidyverse)
library(tidyverse)
library(nycflights13)
library(ggplot2)
library(dplyr)
library(plyr)

Problem 1: Flight Delays

Flight delays are often linked to weather conditions. How does weather impact flights from NYC? Utilizing both the flights and weather datasets from the nycflights13 package to explore this question.

dat_original <- as.data.frame(nycflights13::flights, row.names = NULL) #storing nycflights data into a data frame

#Getting rid of incomplete cases
dat_original <- drop_na(dat_original)

flght <- dat_original

wthr <- as.data.frame(nycflights13::weather, row.names = NULL) #storing nycflights weather data into a data frame

dat <- merge(flght, wthr)


#For arrival and departure (Consider only positive delays)
dat$arr_delay <- ifelse(dat$arr_delay >= 0, dat$arr_delay, 0)
dat$dep_delay <- ifelse(dat$dep_delay >= 0, dat$dep_delay, 0)

#Adding a column for total delay
dat$total_delay <- dat$arr_delay + dat$dep_delay

#selecting only relevant columns
datRelevant <- dat[,c(20:29)]

g <- ggplot(datRelevant, aes(y = humid, x = total_delay, title = "Total Delay v/s Relative Humidity"))
g + geom_smooth() + ylab("Relative Humidity") + 
  xlab("Total Delay (mins)")
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## Warning: Removed 17 rows containing non-finite values (stat_smooth).

g <- ggplot(datRelevant, aes(y = temp, x = total_delay, 
                          title = "Total Delay v/s Temperature"))
g + geom_smooth() + ylab("Temperature [F]") + 
  xlab("Total Delay (mins)")
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## Warning: Removed 17 rows containing non-finite values (stat_smooth).

Problem 2: 50 States in the USA

In this problem we will use the state dataset, available as part of the R statistical computing platforms. This data is related to the 50 states of the United States of America.

(a) Describing the data and each variable it contains. Tidying the data, preparing it for a data analysis.
#Loading the state data
data(state)

#storing the state data into a data frame
dataState <- as.data.frame(state.x77)

#Converting the row names into the first colummn
#dataState <- rownames_to_column(dataState, var = "State")

cat("\n\nStructure of the state data:\n\n")
## 
## 
## Structure of the state data:
str(dataState)
## 'data.frame':    50 obs. of  8 variables:
##  $ Population: num  3615 365 2212 2110 21198 ...
##  $ Income    : num  3624 6315 4530 3378 5114 ...
##  $ Illiteracy: num  2.1 1.5 1.8 1.9 1.1 0.7 1.1 0.9 1.3 2 ...
##  $ Life Exp  : num  69 69.3 70.5 70.7 71.7 ...
##  $ Murder    : num  15.1 11.3 7.8 10.1 10.3 6.8 3.1 6.2 10.7 13.9 ...
##  $ HS Grad   : num  41.3 66.7 58.1 39.9 62.6 63.9 56 54.6 52.6 40.6 ...
##  $ Frost     : num  20 152 15 65 20 166 139 103 11 60 ...
##  $ Area      : num  50708 566432 113417 51945 156361 ...
  • State is a collection of data sets related to the 50 states of the United States of America. For this question, we use the state.x77 data set. Each of the columns in the test represent:

  • Population: population estimate as of July 1, 1975
  • Income: per capita income (1974)
  • Illiteracy: illiteracy (1970, percent of population)
  • Life Exp: life expectancy in years (1969–71)
  • Murder: murder and non-negligent manslaughter rate per 100,000 population (1976)
  • HS Grad: percent high-school graduates (1970)
  • Frost: mean number of days with minimum temperature below freezing (1931–1960) in capital or large city
  • Area: land area in square miles

(b) Suppose we want to explore the relationship between a state’s Murder rate and other characteristics of the state, for example population, illiteracy rate, and more. We begin by examining the bivariate relationships present in the data.
model <- lm(Murder ~., data=dataState)
options(scipen=999)
summary(model)
## 
## Call:
## lm(formula = Murder ~ ., data = dataState)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4452 -1.1016 -0.0598  1.1758  3.2355 
## 
## Coefficients:
##                  Estimate    Std. Error t value     Pr(>|t|)    
## (Intercept) 122.180392646  17.886225407   6.831 0.0000000254 ***
## Population    0.000188036   0.000064737   2.905      0.00584 ** 
## Income       -0.000159207   0.000572530  -0.278      0.78232    
## Illiteracy    1.373109504   0.832202602   1.650      0.10641    
## `Life Exp`   -1.654869830   0.256211567  -6.459 0.0000000868 ***
## `HS Grad`     0.032338308   0.057252663   0.565      0.57519    
## Frost        -0.012884070   0.007392415  -1.743      0.08867 .  
## Area          0.000005967   0.000003801   1.570      0.12391    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.746 on 42 degrees of freedom
## Multiple R-squared:  0.8083, Adjusted R-squared:  0.7763 
## F-statistic: 25.29 on 7 and 42 DF,  p-value: 0.0000000000003872
  • As is evident from the above analysis, the R square value is 0.81. This suggests that the above model explains about 81% of the total variance in the murder rate.

  • A quick glance at the coefficients from this model suggests us that the variables Population and Life Expectancy also affect the murder rates based on the p-value which is significantly less than 0.05.

  • Now to ascertain the maximum variance in murder rate as explained by multiple variables, we use the step functionor a stepwise regression, where the choice of predictor is carried out automatically by comparing certain criterion.

fit.best <- step(lm( Murder~.,data = dataState))
## Start:  AIC=63.01
## Murder ~ Population + Income + Illiteracy + `Life Exp` + `HS Grad` + 
##     Frost + Area
## 
##              Df Sum of Sq    RSS    AIC
## - Income      1     0.236 128.27 61.105
## - `HS Grad`   1     0.973 129.01 61.392
## <none>                    128.03 63.013
## - Area        1     7.514 135.55 63.865
## - Illiteracy  1     8.299 136.33 64.154
## - Frost       1     9.260 137.29 64.505
## - Population  1    25.719 153.75 70.166
## - `Life Exp`  1   127.175 255.21 95.503
## 
## Step:  AIC=61.11
## Murder ~ Population + Illiteracy + `Life Exp` + `HS Grad` + Frost + 
##     Area
## 
##              Df Sum of Sq    RSS    AIC
## - `HS Grad`   1     0.763 129.03 59.402
## <none>                    128.27 61.105
## - Area        1     7.310 135.58 61.877
## - Illiteracy  1     8.715 136.98 62.392
## - Frost       1     9.345 137.61 62.621
## - Population  1    27.142 155.41 68.702
## - `Life Exp`  1   127.500 255.77 93.613
## 
## Step:  AIC=59.4
## Murder ~ Population + Illiteracy + `Life Exp` + Frost + Area
## 
##              Df Sum of Sq    RSS    AIC
## <none>                    129.03 59.402
## - Illiteracy  1     8.723 137.75 60.672
## - Frost       1    11.030 140.06 61.503
## - Area        1    15.937 144.97 63.225
## - Population  1    26.415 155.45 66.714
## - `Life Exp`  1   140.391 269.42 94.213
summary(fit.best)
## 
## Call:
## lm(formula = Murder ~ Population + Illiteracy + `Life Exp` + 
##     Frost + Area, data = dataState)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2976 -1.0711 -0.1123  1.1092  3.4671 
## 
## Coefficients:
##                  Estimate    Std. Error t value     Pr(>|t|)    
## (Intercept) 120.164031804  17.181610452   6.994 0.0000000117 ***
## Population    0.000177981   0.000059303   3.001      0.00442 ** 
## Illiteracy    1.172980493   0.680121662   1.725      0.09161 .  
## `Life Exp`   -1.607836823   0.232377225  -6.919 0.0000000150 ***
## Frost        -0.013730312   0.007079737  -1.939      0.05888 .  
## Area          0.000006804   0.000002919   2.331      0.02439 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.712 on 44 degrees of freedom
## Multiple R-squared:  0.8068, Adjusted R-squared:  0.7848 
## F-statistic: 36.74 on 5 and 44 DF,  p-value: 0.00000000000001221
  • As per the above calculations, we can see that in the last step, a combination of just 5 variables: Population, Illiteracy, Life Expectancy, Frost and Area explains about 81% (R-squared value) of the variance in murder rate.
(c) Choosing one variable and fitting a simple linear regression model, \(Y = \beta_1X + \beta_0\), using the lm() function in R.

Also, I have disregarded the Life Expectancy as it is intuitive that life expectancy will decrease with an increase in the murder rate. In the following model, I have taken population as my predictor variable.

model2 <- lm(Murder ~ Illiteracy, dataState)
summary(model2)
## 
## Call:
## lm(formula = Murder ~ Illiteracy, data = dataState)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.5315 -2.0602 -0.2503  1.6916  6.9745 
## 
## Coefficients:
##             Estimate Std. Error t value     Pr(>|t|)    
## (Intercept)   2.3968     0.8184   2.928       0.0052 ** 
## Illiteracy    4.2575     0.6217   6.848 0.0000000126 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.653 on 48 degrees of freedom
## Multiple R-squared:  0.4942, Adjusted R-squared:  0.4836 
## F-statistic: 46.89 on 1 and 48 DF,  p-value: 0.00000001258
  • From this model it is clear that Illiteracy alone accounts for about 49% ofthe variance in the murder rate. However, based on its p-value which is less than 0.05, it is highly likely that this observation is by chance and there is only 10% chance of the murder rate depending upon the illiteracy levels in a state.
(d) Developing a new research question of your own that you can address using the state dataset.
  • Does the model with Illiteracy as the predictor variable a good model to determine the murder rate in a state?

  • To determine the answer to this question, I have plotted the residual plot.

par(mfrow=c(2,2))
plot(model2)

  • As per the above plot, Residual plots seems to be randomly scattered, and some transformation may be needed for linearity. This random pattern indicates that a linear model provides a decent fit to the data.

Problem 3: Income and Education

The scatterplot below shows the relationship between per capita income (in thousands of dollars) and percent of population with a bachelor’s degree in 3,143 counties in the US in 2010.

Per Capita Income and Education

Per Capita Income and Education

(a) Explanatory and response variables?
  • Explanatory Variable: Percent of people with Bachelor’s Degree

  • Response Variable: Per Capita Income

(b) Describing the relationship between the two variables.
  • The per capita income (in thousands of dollars) and percent of population with a bachelor’s degree are positively correlated.

  • A very high percenatge of people with a bachelor’s degree within a county will result in an increase in the per capita income in that county.

  • There are significant cases wherein, counties having bachelor’s degree holders between 10% and 40% will have less per capita income.

(c) Can we conclude that having a bachelor’s degree increases one’s income? Why or why not?
  • We cannot conclude that having a bachelor’s degree increases one’s income as there are significant cases wherein, counties having bachelor’s degree holders between 10% and 40% will have less per capita income.

  • There are probably other factors that are to be considered when predicting the per capita income of the US counties.