Lesson 13 | 12 June 2020

Some Categories of Data Science

1.) Data Cleaning

2.) Linear Regressions (Single-Variate Modeling)

3.) Data Visualization (e.g. Spatial Mapping, Graphing, and Plotting)

1.) Data Cleaning

Let’s recreate the clean_flight_data.R script. What will be steps in order to do this?

  1. Read the data.
  2. Remove unnecessary columns.
  3. Recode column values such as all binary variables to 0’s and 1’s and center continuous variables around their respective means. Why center variables (especially when conducting multiple regression)? In regression, it is often recommended to center the variables so that the predictors have mean 0. This makes it so the intercept term is interpreted as the expected value of Yi when the predictor values are set to their means. Or in other words, when you center X so that a value (which is the mean of X) within the dataset becomes 0, the intercept becomes the mean of Y at the value(s) you centered on.
source("clean_flight_data.R")
## 
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
## 
##     date
data_all <- read_flight_data("complete_flight_data-Winter2020.csv")

2.) Linear Regressions (Single-Variate Modeling)

What are linear regressions?

“Linear regression is the geocentric model of applied statistics. By “linear regression,” we will mean a family of simple statistical models that attempt to learn about the mean and variance of some measurement, using an additive combination of other measurements. Like the geocentric model, linear regression can usefully describe a very large variety of natural phenomena. Like the geocentric model, linear regression is a descriptive model that corresponds to many different process models. If we read its structure too literally, we’re likely to make mistakes. But used wisely, these little linear models continue to be useful." - Richard McElreath, Statistical Rethinking, Chapter 4

Why do we use linear regressions?

Because these models have predictive and inferential power. They can be used to test the association of an outcome with a predictor, to quantify the degree of association, or to estimate the mean value of the outcome for given values of the predictors.

What do linear regressions look like?

For single-variate modeling (as opposed to multivariate or mixed-effect modeling), linear regressions are just like y = mx + b, where y is the response or dependent variable and x is the explanatory, independent, or predictive variable (yes there are many names for the same thing). m is the slope of the regression line which is estimated by \(\beta\), which is the effect or coefficient of your explanatory variable, in this case x. Finally, b is the intercept which is estimated by \(\alpha\), which equals y when x=0 (assuming you didn’t center the data). In statistics, these are the equations:

\(y_i \sim Normal(\mu_i, \sigma)\)

This probalistic equation is known as the ‘likelihood’. A special notation is employed to indicate that y is normally distributed with these parameters. The parameters of the normal distribution are the mean \(\mu\) and the standard deviation \(\sigma\) (or the variance \(\sigma^2\)). Normal distributions are symmetric and unimodal bell-shaped curves, which you can plot as such:

x <- seq(-4, 4, length=100)
hx <- dnorm(x) # Density, distribution function for normal distribution.
plot(x,hx)

\(\mu_i = \alpha + \beta_1 x_i\)

But when conducting a linear regression, \(\mu\) is no longer a parameter to be estimated. Instead, it is constructed form other parameters \(\alpha\), \(\beta\), and the predictor variable, x. The line is not a stochastic relationship - there is no \(\sim\) in it because \(\mu\) is deterministic not probabilistic.

Finally, \(x_i, y_i\) and \(\mu_i\) all refer to some row i in the data.

What do single-variate linear regressions look like in R?

In R, for simple linear regressions you can use lm( ) or glm( ). In short the lm( ) function is used because it is simple and used quite often but the same results are found using glm( ) with the Gaussian family and identity link. You’ll get the same answer, but the technical difference is glm uses likelihood (if you want AIC values) whereas lm uses least squares.

We will be just using lm() today and interpreting the resulting p-value, where a significant p-value is one that is less then 0.05, which rejects the null hypothesis (the null being that the explanatory variable is a poor predictor or the response variable).

# Coding difference:

# lm(response_var ~ explanatory_var, data=data)
# glm(response_var ~ explanatory_var, data=data, family=(binomial, gaussian, Gamma, poisson))

How can linear regressions be applied to experimental data?

First, you’ll need to differentiate between independent variables and covariates:

“Similar to an independent variable, a covariate is complementary to the dependent, or response, variable. A variable is a covariate if it is related to the dependent variable. According to this definition, any variable that is measurable and considered to have a statistical relationship with the dependent variable would qualify as a potential covariate. A covariate is thus a possible predictive or explanatory variable of the dependent variable. This may be the reason that in regression analyses, independent variables (i.e., the regressors) are sometimes called covariates. Used in this context, covariates are of primary interest. In most other circumstances, however, covariates are of no primary interest compared with the independent variables. They arise because the experimental or observational units are heterogeneous.” - Encyclopedia of Research Design

Now we can jump in and run some regressions!

Let’s say we want to predict how certain variables affected whether a bug flew or not.

Experimental Set-Up Covariates

Biological Covariates

Morphology Covariates

Experimental Set-Up Covariates:

There are three: 1. ) chamber, 2. ) test_date, 3. ) test_time.

####### No effect of chamber
summary(glm(flew_b~chamber, data=data_all, family=binomial))$coefficients
##                  Estimate Std. Error       z value  Pr(>|z|)
## (Intercept) -4.182588e-15  0.2626129 -1.592682e-14 1.0000000
## chamberA-2   2.113091e-01  0.3495523  6.045135e-01 0.5455023
## chamberA-3   4.489502e-01  0.3515330  1.277121e+00 0.2015595
## chamberA-4   1.262937e-01  0.3242614  3.894812e-01 0.6969202
## chamberB-1   5.007753e-01  0.3863529  1.296160e+00 0.1949202
## chamberB-2   4.737844e-01  0.3437099  1.378443e+00 0.1680666
## chamberB-3  -1.000835e-01  0.3450954 -2.900168e-01 0.7718034
## chamberB-4   1.923719e-01  0.3525174  5.457089e-01 0.5852661
####### Effect of test date (but no effect of test date when you split between T1 and T2)
summary(glm(flew_b~days_from_start_c, data=data_all, family=binomial))$coefficients
##                      Estimate Std. Error   z value    Pr(>|z|)
## (Intercept)        0.22565827 0.08181221  2.758247 0.005811230
## days_from_start_c -0.03429809 0.01181618 -2.902637 0.003700351
####### No effect of test time
summary(glm(flew_b~min_from_IncStart_c, data=data_all, family=binomial))$coefficients
##                         Estimate   Std. Error   z value    Pr(>|z|)
## (Intercept)         0.2224922320 0.0812272689 2.7391323 0.006160159
## min_from_IncStart_c 0.0002672204 0.0005770423 0.4630863 0.643302527

Biological Covariates:

There are three: 1. ) mass, 2. ) number of eggs laid, 3. ) whether eggs were laid or not.

####### (Strong) Effect of mass
summary(glm(flew_b~mass_c, data=data_all, family=binomial))$coefficients
##               Estimate Std. Error   z value     Pr(>|z|)
## (Intercept)   0.207567 0.08755761  2.370633 1.775765e-02
## mass_c      -33.709714 4.24949041 -7.932649 2.145209e-15
####### Effect of number of eggs laid
summary(glm(flew_b~total_eggs, data=data_all, family=binomial))$coefficients
##                Estimate  Std. Error    z value     Pr(>|z|)
## (Intercept)  0.05800016 0.242715366  0.2389637 8.111337e-01
## total_eggs  -0.01174514 0.002733288 -4.2970756 1.730662e-05
####### Effect of whether eggs were laid or not
summary(glm(flew_b~eggs_b, data=data_all, family=binomial))$coefficients
##               Estimate Std. Error   z value     Pr(>|z|)
## (Intercept)  0.7154767 0.09801114  7.299953 2.878690e-13
## eggs_b      -2.3081075 0.24443287 -9.442705 3.632775e-21

Morphology Covariates:

There are five: 1. ) beak length, 2. ) thorax length, 3. ) body size, 4. ) wing length, and 5. ) wing morph.

####### Effect of beak length
summary(glm(flew_b~beak_c, data=data_all, family=binomial))$coefficients
##               Estimate Std. Error   z value     Pr(>|z|)
## (Intercept)  0.2301153 0.08299943  2.772492 5.562883e-03
## beak_c      -0.4081347 0.08355515 -4.884614 1.036317e-06
####### Effect of thorax length
summary(glm(flew_b~thorax_c, data=data_all, family=binomial))$coefficients
##              Estimate Std. Error   z value     Pr(>|z|)
## (Intercept)  0.231718 0.08263267  2.804194 5.044251e-03
## thorax_c    -1.192777 0.27662201 -4.311938 1.618296e-05
####### Effect of body length
summary(glm(flew_b~body_c, data=data_all, family=binomial))$coefficients
##               Estimate Std. Error   z value    Pr(>|z|)
## (Intercept)  0.2289800 0.08191673  2.795278 0.005185512
## body_c      -0.2311964 0.07777222 -2.972738 0.002951563
####### No effect of wing length
summary(glm(flew_b~wing_c, data=data_all, family=binomial))$coefficients
##               Estimate Std. Error   z value    Pr(>|z|)
## (Intercept)  0.2268357 0.08144024  2.785302 0.005347782
## wing_c      -0.1433543 0.09977016 -1.436845 0.150762019
####### No effect of wing morph (check how annotated the wing morph) - don't include it
summary(glm(flew_b~w_morph_c, data=data_all, family=binomial)) # but close p val = 0.0512
## 
## Call:
## glm(formula = flew_b ~ w_morph_c, family = binomial, data = data_all)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.302  -1.302   1.058   1.058   2.246  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -1.0754     0.4108  -2.618 0.008843 ** 
## w_morph_c     1.3637     0.4121   3.309 0.000936 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 843.64  on 613  degrees of freedom
## Residual deviance: 823.69  on 612  degrees of freedom
## AIC: 827.69
## 
## Number of Fisher Scoring iterations: 4

3.) Data Visualizations (Plotting lms and glms)

m1 <- glm(flew_b~beak_c, data=data_all, family=binomial)
summary(m1)
## 
## Call:
## glm(formula = flew_b ~ beak_c, family = binomial, data = data_all)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5815  -1.2435   0.9173   1.0198   1.6964  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.23012    0.08300   2.772  0.00556 ** 
## beak_c      -0.40813    0.08356  -4.885 1.04e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 842.02  on 612  degrees of freedom
## Residual deviance: 816.99  on 611  degrees of freedom
##   (1 observation deleted due to missingness)
## AIC: 820.99
## 
## Number of Fisher Scoring iterations: 4
plot(data_all$beak_c, data_all$flew_b, xlab="Beak Length (mm)", ylab="Yes-no flew")
abline(m1, col="blue") # abline only works for single-variate models

Figure 1. As beak length increases, the bug is less likely to fly. But is this due to beak length? Females have longer beaks than males…so it the real cause because of general differences in sex?

m2 <- lm(mass ~ sym_dist, data=data_all)
summary(m2)
## 
## Call:
## lm(formula = mass ~ sym_dist, data = data_all)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.031013 -0.016330 -0.008832  0.010396  0.124783 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.0550152  0.0012169  45.211  < 2e-16 ***
## sym_dist    -0.0030184  0.0009394  -3.213  0.00138 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.02433 on 609 degrees of freedom
##   (3 observations deleted due to missingness)
## Multiple R-squared:  0.01667,    Adjusted R-squared:  0.01506 
## F-statistic: 10.32 on 1 and 609 DF,  p-value: 0.001382
plot(data_all$sym_dist, data_all$mass, xlab="Distance from Sympatric Zone", ylab="Mass (g)")
abline(m2, col="blue")

Figure 2. As a bug grows up farther away from the sympatric zone, its mass decreases.