Data Analysis

Setup:

Here I am installing the necessary packages and loading the required libraries.

# Load standard libraries
library(tidyverse)
library(tidyverse)
library(nycflights13)
library(ggplot2)
library(dplyr)
library(plyr)

Problem 1: Flight Delays

Flight delays are often linked to weather conditions. How does weather impact flights from NYC? Utilizing both the flights and weather datasets from the nycflights13 package to explore this question.

dat_original <- as.data.frame(nycflights13::flights, row.names = NULL) #storing nycflights data into a data frame

#Getting rid of incomplete cases
dat_original <- drop_na(dat_original)

flght <- dat_original

wthr <- as.data.frame(nycflights13::weather, row.names = NULL) #storing nycflights weather data into a data frame

dat <- merge(flght, wthr)


#For arrival and departure (Consider only positive delays)
dat$arr_delay <- ifelse(dat$arr_delay >= 0, dat$arr_delay, 0)
dat$dep_delay <- ifelse(dat$dep_delay >= 0, dat$dep_delay, 0)

#Adding a column for total delay
dat$total_delay <- dat$arr_delay + dat$dep_delay

#selecting only relevant columns
datRelevant <- dat[,c(20:29)]

g <- ggplot(datRelevant, aes(y = humid, x = total_delay, title = "Total Delay v/s Relative Humidity"))
g + geom_smooth() + ylab("Relative Humidity") + 
  xlab("Total Delay (mins)")

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

## Warning: Removed 17 rows containing non-finite values (stat_smooth).

g <- ggplot(datRelevant, aes(y = temp, x = total_delay, 
                          title = "Total Delay v/s Temperature"))
g + geom_smooth() + ylab("Temperature [F]") + 
  xlab("Total Delay (mins)")

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

## Warning: Removed 17 rows containing non-finite values (stat_smooth).

This shows that the total_delay increases with humidity.
The temperature plot also shows that with a higher temperature (i.e in summer) there is a higher delay. The smoother tail falls down cause of extreme outliers which are days that saw extreme delay due to maybe reasons outside the purview of the data set variables.

Problem 2: 50 States in the USA

In this problem we will use the state dataset, available as part of the R statistical computing platforms. This data is related to the 50 states of the United States of America.

(a) Describing the data and each variable it contains. Tidying the data, preparing it for a data analysis.

#Loading the state data
data(state)

#storing the state data into a data frame
dataState <- as.data.frame(state.x77)

#Converting the row names into the first colummn
#dataState <- rownames_to_column(dataState, var = "State")

cat("\n\nStructure of the state data:\n\n")

## 
## 
## Structure of the state data:

str(dataState)

## 'data.frame':    50 obs. of  8 variables:
##  $ Population: num  3615 365 2212 2110 21198 ...
##  $ Income    : num  3624 6315 4530 3378 5114 ...
##  $ Illiteracy: num  2.1 1.5 1.8 1.9 1.1 0.7 1.1 0.9 1.3 2 ...
##  $ Life Exp  : num  69 69.3 70.5 70.7 71.7 ...
##  $ Murder    : num  15.1 11.3 7.8 10.1 10.3 6.8 3.1 6.2 10.7 13.9 ...
##  $ HS Grad   : num  41.3 66.7 58.1 39.9 62.6 63.9 56 54.6 52.6 40.6 ...
##  $ Frost     : num  20 152 15 65 20 166 139 103 11 60 ...
##  $ Area      : num  50708 566432 113417 51945 156361 ...

State is a collection of data sets related to the 50 states of the United States of America. For this question, we use the state.x77 data set. Each of the columns in the test represent:
Population: population estimate as of July 1, 1975
Income: per capita income (1974)
Illiteracy: illiteracy (1970, percent of population)
Life Exp: life expectancy in years (1969–71)
Murder: murder and non-negligent manslaughter rate per 100,000 population (1976)
HS Grad: percent high-school graduates (1970)
Frost: mean number of days with minimum temperature below freezing (1931–1960) in capital or large city
Area: land area in square miles

(b) Suppose we want to explore the relationship between a state’s `Murder` rate and other characteristics of the state, for example population, illiteracy rate, and more. We begin by examining the bivariate relationships present in the data.

model <- lm(Murder ~., data=dataState)
options(scipen=999)
summary(model)

## 
## Call:
## lm(formula = Murder ~ ., data = dataState)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4452 -1.1016 -0.0598  1.1758  3.2355 
## 
## Coefficients:
##                  Estimate    Std. Error t value     Pr(>|t|)    
## (Intercept) 122.180392646  17.886225407   6.831 0.0000000254 ***
## Population    0.000188036   0.000064737   2.905      0.00584 ** 
## Income       -0.000159207   0.000572530  -0.278      0.78232    
## Illiteracy    1.373109504   0.832202602   1.650      0.10641    
## `Life Exp`   -1.654869830   0.256211567  -6.459 0.0000000868 ***
## `HS Grad`     0.032338308   0.057252663   0.565      0.57519    
## Frost        -0.012884070   0.007392415  -1.743      0.08867 .  
## Area          0.000005967   0.000003801   1.570      0.12391    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.746 on 42 degrees of freedom
## Multiple R-squared:  0.8083, Adjusted R-squared:  0.7763 
## F-statistic: 25.29 on 7 and 42 DF,  p-value: 0.0000000000003872

As is evident from the above analysis, the R square value is 0.81. This suggests that the above model explains about 81% of the total variance in the murder rate.
A quick glance at the coefficients from this model suggests us that the variables Population and Life Expectancy also affect the murder rates based on the p-value which is significantly less than 0.05.
Now to ascertain the maximum variance in murder rate as explained by multiple variables, we use the step functionor a stepwise regression, where the choice of predictor is carried out automatically by comparing certain criterion.

fit.best <- step(lm( Murder~.,data = dataState))

## Start:  AIC=63.01
## Murder ~ Population + Income + Illiteracy + `Life Exp` + `HS Grad` + 
##     Frost + Area
## 
##              Df Sum of Sq    RSS    AIC
## - Income      1     0.236 128.27 61.105
## - `HS Grad`   1     0.973 129.01 61.392
## <none>                    128.03 63.013
## - Area        1     7.514 135.55 63.865
## - Illiteracy  1     8.299 136.33 64.154
## - Frost       1     9.260 137.29 64.505
## - Population  1    25.719 153.75 70.166
## - `Life Exp`  1   127.175 255.21 95.503
## 
## Step:  AIC=61.11
## Murder ~ Population + Illiteracy + `Life Exp` + `HS Grad` + Frost + 
##     Area
## 
##              Df Sum of Sq    RSS    AIC
## - `HS Grad`   1     0.763 129.03 59.402
## <none>                    128.27 61.105
## - Area        1     7.310 135.58 61.877
## - Illiteracy  1     8.715 136.98 62.392
## - Frost       1     9.345 137.61 62.621
## - Population  1    27.142 155.41 68.702
## - `Life Exp`  1   127.500 255.77 93.613
## 
## Step:  AIC=59.4
## Murder ~ Population + Illiteracy + `Life Exp` + Frost + Area
## 
##              Df Sum of Sq    RSS    AIC
## <none>                    129.03 59.402
## - Illiteracy  1     8.723 137.75 60.672
## - Frost       1    11.030 140.06 61.503
## - Area        1    15.937 144.97 63.225
## - Population  1    26.415 155.45 66.714
## - `Life Exp`  1   140.391 269.42 94.213

summary(fit.best)

## 
## Call:
## lm(formula = Murder ~ Population + Illiteracy + `Life Exp` + 
##     Frost + Area, data = dataState)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2976 -1.0711 -0.1123  1.1092  3.4671 
## 
## Coefficients:
##                  Estimate    Std. Error t value     Pr(>|t|)    
## (Intercept) 120.164031804  17.181610452   6.994 0.0000000117 ***
## Population    0.000177981   0.000059303   3.001      0.00442 ** 
## Illiteracy    1.172980493   0.680121662   1.725      0.09161 .  
## `Life Exp`   -1.607836823   0.232377225  -6.919 0.0000000150 ***
## Frost        -0.013730312   0.007079737  -1.939      0.05888 .  
## Area          0.000006804   0.000002919   2.331      0.02439 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.712 on 44 degrees of freedom
## Multiple R-squared:  0.8068, Adjusted R-squared:  0.7848 
## F-statistic: 36.74 on 5 and 44 DF,  p-value: 0.00000000000001221

As per the above calculations, we can see that in the last step, a combination of just 5 variables: Population, Illiteracy, Life Expectancy, Frost and Area explains about 81% (R-squared value) of the variance in murder rate.

(c) Choosing one variable and fitting a simple linear regression model, \(Y = \beta_1X + \beta_0\), using the `lm()` function in R.

Also, I have disregarded the Life Expectancy as it is intuitive that life expectancy will decrease with an increase in the murder rate. In the following model, I have taken population as my predictor variable.

model2 <- lm(Murder ~ Illiteracy, dataState)
summary(model2)

## 
## Call:
## lm(formula = Murder ~ Illiteracy, data = dataState)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.5315 -2.0602 -0.2503  1.6916  6.9745 
## 
## Coefficients:
##             Estimate Std. Error t value     Pr(>|t|)    
## (Intercept)   2.3968     0.8184   2.928       0.0052 ** 
## Illiteracy    4.2575     0.6217   6.848 0.0000000126 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.653 on 48 degrees of freedom
## Multiple R-squared:  0.4942, Adjusted R-squared:  0.4836 
## F-statistic: 46.89 on 1 and 48 DF,  p-value: 0.00000001258

From this model it is clear that Illiteracy alone accounts for about 49% ofthe variance in the murder rate. However, based on its p-value which is less than 0.05, it is highly likely that this observation is by chance and there is only 10% chance of the murder rate depending upon the illiteracy levels in a state.

(d) Developing a new research question of your own that you can address using the `state` dataset.

Does the model with Illiteracy as the predictor variable a good model to determine the murder rate in a state?
To determine the answer to this question, I have plotted the residual plot.

par(mfrow=c(2,2))
plot(model2)

As per the above plot, Residual plots seems to be randomly scattered, and some transformation may be needed for linearity. This random pattern indicates that a linear model provides a decent fit to the data.

Problem 3: Income and Education

The scatterplot below shows the relationship between per capita income (in thousands of dollars) and percent of population with a bachelor’s degree in 3,143 counties in the US in 2010.

Per Capita Income and Education

(a) Explanatory and response variables?

Explanatory Variable: Percent of people with Bachelor’s Degree
Response Variable: Per Capita Income

(b) Describing the relationship between the two variables.

The per capita income (in thousands of dollars) and percent of population with a bachelor’s degree are positively correlated.
A very high percenatge of people with a bachelor’s degree within a county will result in an increase in the per capita income in that county.
There are significant cases wherein, counties having bachelor’s degree holders between 10% and 40% will have less per capita income.

(c) Can we conclude that having a bachelor’s degree increases one’s income? Why or why not?

We cannot conclude that having a bachelor’s degree increases one’s income as there are significant cases wherein, counties having bachelor’s degree holders between 10% and 40% will have less per capita income.
There are probably other factors that are to be considered when predicting the per capita income of the US counties.

Data Analysis

Akshay Khanna

October 23, 2018

Setup:

Problem 1: Flight Delays

Problem 2: 50 States in the USA

(a) Describing the data and each variable it contains. Tidying the data, preparing it for a data analysis.

(b) Suppose we want to explore the relationship between a state’s `Murder` rate and other characteristics of the state, for example population, illiteracy rate, and more. We begin by examining the bivariate relationships present in the data.

(c) Choosing one variable and fitting a simple linear regression model, \(Y = \beta_1X + \beta_0\), using the `lm()` function in R.

(d) Developing a new research question of your own that you can address using the `state` dataset.

Problem 3: Income and Education

(a) Explanatory and response variables?

(b) Describing the relationship between the two variables.

(c) Can we conclude that having a bachelor’s degree increases one’s income? Why or why not?

Data Analysis

Akshay Khanna

October 23, 2018

Setup:

Problem 1: Flight Delays

Problem 2: 50 States in the USA

(a) Describing the data and each variable it contains. Tidying the data, preparing it for a data analysis.

(b) Suppose we want to explore the relationship between a state’s Murder rate and other characteristics of the state, for example population, illiteracy rate, and more. We begin by examining the bivariate relationships present in the data.

(c) Choosing one variable and fitting a simple linear regression model, \(Y = \beta_1X + \beta_0\), using the lm() function in R.

(d) Developing a new research question of your own that you can address using the state dataset.

Problem 3: Income and Education

(a) Explanatory and response variables?

(b) Describing the relationship between the two variables.

(c) Can we conclude that having a bachelor’s degree increases one’s income? Why or why not?

(b) Suppose we want to explore the relationship between a state’s `Murder` rate and other characteristics of the state, for example population, illiteracy rate, and more. We begin by examining the bivariate relationships present in the data.

(c) Choosing one variable and fitting a simple linear regression model, \(Y = \beta_1X + \beta_0\), using the `lm()` function in R.

(d) Developing a new research question of your own that you can address using the `state` dataset.