Midterm

1) Sampling your system (10 points)

Each of you has a study system your work in and a question of interest.

Tumor evolution and heterogeneity creates drug-resistance, understanding the diversity that a tumor undergoes has great clinical significance. High-throughput sequencing technology provides data on transcript abundance and diversity. This technology allows us to interrogate the transcriptomic effects of chemotherapeutic drug treatments on breast cancer cell (BCC) lines. Statistical analysis of these transcriptomic changes is a powerful tool for discovering drug-resistant clones, a major obstacle in personalized cancer therapy.

Random sampling from breast cancer tissue throughout the global population would be ideal sampling method, however 2/3rds of all in vitro breast cancer research uses three BCC lines, with the the oldest line (MCF-7) dating back to 1970. MCF-7 retains many features of the tumor that it originated from, making it useful in cancer research. The validity of statistical relationships are enhanced by using technical replicates, which means the MCF-7 line is divided and allowed to grow into numerous samples.

This type of experiment is prone to errors relating to improper experimental design. Asynchrounously running the experiments, use of different reagents, inconsistent cell culture techniques can create batch and counfounding effects on the results of the experiment. Therefore, the experiment needs to be designed with consistent treatment in order to protect the validitiy of the statistical analysis.

Further, the differences in transcriptome abundance and diversity across samples is confounded by the the genomic instability of cancer. This begs the question, is the differential expression due to treatment or genomic instability? These need to be minimized in order to provide credibility to the statistical analysis of the chemotherapeutic treatments.

The negative binomial distribution is suited for this data because it is count data with a variance that shifts away from the mean with greater gene expression. The variance behaves this way because experiments typically have 3-5 samples with a few highly expressed genes and many, lowly expressed genes. The negative binomial distribution is suited for this data, as it accomodates the variance shifting away from the mean with higher levels of gene expression. Theoretically, a poisson distribution would work with a larger sample size, however, this is financially burdensome.

Data Reshaping and Visuzliation Johns Hopkins has been maintaining one of the best Covid-19 timseries data sets out there. The data on the US can be found here with information about what is in the data at https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data

2a) Access (5 points) Download and read in the data. Can you do this without downloading, but read directly from the archive (+1).

#read raw github from internet

Covid_JHU <- read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_US.csv")

#look at the file
head(Covid_JHU)

## # A tibble: 6 x 331
##      UID iso2  iso3  code3  FIPS Admin2 Province_State
##    <dbl> <chr> <chr> <dbl> <dbl> <chr>  <chr>         
## 1 8.40e7 US    USA     840  1001 Autau~ Alabama       
## 2 8.40e7 US    USA     840  1003 Baldw~ Alabama       
## 3 8.40e7 US    USA     840  1005 Barbo~ Alabama       
## 4 8.40e7 US    USA     840  1007 Bibb   Alabama       
## 5 8.40e7 US    USA     840  1009 Blount Alabama       
## 6 8.40e7 US    USA     840  1011 Bullo~ Alabama       
## # ... with 324 more variables: Country_Region <chr>,
## #   Lat <dbl>, Long_ <dbl>, Combined_Key <chr>,
## #   `1/22/20` <dbl>, `1/23/20` <dbl>, `1/24/20` <dbl>,
## #   `1/25/20` <dbl>, `1/26/20` <dbl>, `1/27/20` <dbl>,
## #   `1/28/20` <dbl>, `1/29/20` <dbl>, `1/30/20` <dbl>,
## #   `1/31/20` <dbl>, `2/1/20` <dbl>, `2/2/20` <dbl>,
## #   `2/3/20` <dbl>, `2/4/20` <dbl>, `2/5/20` <dbl>,
## #   `2/6/20` <dbl>, `2/7/20` <dbl>, `2/8/20` <dbl>,
## #   `2/9/20` <dbl>, `2/10/20` <dbl>, `2/11/20` <dbl>,
## #   `2/12/20` <dbl>, `2/13/20` <dbl>, `2/14/20` <dbl>,
## #   `2/15/20` <dbl>, `2/16/20` <dbl>, `2/17/20` <dbl>,
## #   `2/18/20` <dbl>, `2/19/20` <dbl>, `2/20/20` <dbl>,
## #   `2/21/20` <dbl>, `2/22/20` <dbl>, `2/23/20` <dbl>,
## #   `2/24/20` <dbl>, `2/25/20` <dbl>, `2/26/20` <dbl>,
## #   `2/27/20` <dbl>, `2/28/20` <dbl>, `2/29/20` <dbl>,
## #   `3/1/20` <dbl>, `3/2/20` <dbl>, `3/3/20` <dbl>,
## #   `3/4/20` <dbl>, `3/5/20` <dbl>, `3/6/20` <dbl>,
## #   `3/7/20` <dbl>, `3/8/20` <dbl>, `3/9/20` <dbl>,
## #   `3/10/20` <dbl>, `3/11/20` <dbl>, `3/12/20` <dbl>,
## #   `3/13/20` <dbl>, `3/14/20` <dbl>, `3/15/20` <dbl>,
## #   `3/16/20` <dbl>, `3/17/20` <dbl>, `3/18/20` <dbl>,
## #   `3/19/20` <dbl>, `3/20/20` <dbl>, `3/21/20` <dbl>,
## #   `3/22/20` <dbl>, `3/23/20` <dbl>, `3/24/20` <dbl>,
## #   `3/25/20` <dbl>, `3/26/20` <dbl>, `3/27/20` <dbl>,
## #   `3/28/20` <dbl>, `3/29/20` <dbl>, `3/30/20` <dbl>,
## #   `3/31/20` <dbl>, `4/1/20` <dbl>, `4/2/20` <dbl>,
## #   `4/3/20` <dbl>, `4/4/20` <dbl>, `4/5/20` <dbl>,
## #   `4/6/20` <dbl>, `4/7/20` <dbl>, `4/8/20` <dbl>,
## #   `4/9/20` <dbl>, `4/10/20` <dbl>, `4/11/20` <dbl>,
## #   `4/12/20` <dbl>, `4/13/20` <dbl>, `4/14/20` <dbl>,
## #   `4/15/20` <dbl>, `4/16/20` <dbl>, `4/17/20` <dbl>,
## #   `4/18/20` <dbl>, `4/19/20` <dbl>, `4/20/20` <dbl>,
## #   `4/21/20` <dbl>, `4/22/20` <dbl>, `4/23/20` <dbl>,
## #   `4/24/20` <dbl>, `4/25/20` <dbl>, `4/26/20` <dbl>, ...

2b) It’s big and wide! (10 Points) The data is, well, huge. It’s also wide, with dates as columns. Write a function that, given a state, will output a time series (long data where every row is a day) of cummulative cases in that state as well as new daily cases.

#write a function/figure out way to update by day

state_function <- function(x) {

#select states and dates
clean_covid <- Covid_JHU %>%
  select(state = 7, 12:305)



#insert pivot long here

#pivot long
Covid_long <- pivot_longer(clean_covid,
                           cols = !state,
                           names_to = "Date",
                           values_to = "Cases")

#lubridates

Covid_dates <- Covid_long %>% 
  mutate(Date = mdy(Date))

#group by date, state and summarize cases

tidying_up <- Covid_dates %>%
      
      group_by(Date, state) %>%

      summarise(Cases = sum(Cases))

#filter by function input
tidy_covid <- tidying_up %>%
  rowwise() %>%
  filter(state == x) 

return(tidy_covid)


}

Mass_data <- state_function("Massachusetts")

reactable(Mass_data)

2c) Let’s get visual! (10 Points) Great! Make a compelling plot of the timeseries for Massachusetts! Points for style, class, ease of understanding major trends, etc. Note, 10/10 only for the most killer figures. Don’t phone it in! Also, note what the data from JHU is. Do you want the cummulatives, or daily, or what?

p <- Mass_data %>%
  ggplot( aes(x=Date, y=Cases)) +
    geom_area(fill="#69b3a2", alpha=0.5) +
    geom_line(color="#69b3a2") +
    ylab("Covid 19 Infections)") +
    theme_dark()

# Turn it interactive with ggplotly
p <- ggplotly(p)
p

2d) At our fingertips (10 Points) Cool. Now, write a function that will take what you did above, and create a plot for any state - so, I enter Alaska and I get the plot for Alaska! +2 if it can do daily or cumulative cases - or cases per 100,000 if you did that above. +3 EC if you highlight points of interest - but dynamically using the data. Note, you might need to do some funky stuff to make things fit well in the plot for this one. Or, meh.

#modularize state data and plot
state_plot <- function(x){
  
  #state function
  data <- state_function(x)
  
#create a ggplot object with state data 
  p <- data %>%
  ggplot( aes(x=Date, y=Cases)) +
    geom_area(fill="#69b3a2", alpha=0.5) +
    geom_line(color="#69b3a2") +
    ylab("Covid 19 Infections)") +
    theme_dark()

# Turn it interactive with ggplotly
out <- ggplotly(p)

  
  return(out)
}

state_plot("Alabama")

Let’s get philosophical. (10 points) We have discussed multiple inferential frameworks this semester. Frequentist NHST, Likelihood and model comparison, Baysian probabilistic thinking, Assessment of Predictive Ability (which spans frameworks!), and more. We’ve talked about Popper and Lakatos. Put these pieces of the puzzle together and look deep within yourself.

What do you feel is the inferential framework that you adopt as a scientist? Why? Include in your answer why you prefer the inferential tools (e.g. confidence intervals, test statistics, out-of-sample prediction, posterior probabilities, etc.) of your chosen worldview and why you do not like the ones of the other one. This includes defining just what those different tools mean, as well as relating them to the things you study. extra credit for citing and discussing outside sources - one point per source/point

I tend to reach conclusions by observing patterns, therefore I am inclined towards Bayesian probabilistic thinking and likelihood; p(H|D). For example, I’m often aware of and influenced by prior knowledge which is analguous to a conditional probability. Although, I am often misled by conditional probabilities if the math is not explicitly shown. For example, I don’t perform the mental math necessary to determine a posterior probability of having a disease given a positive test result. The statistics required to formulate this posterior can be easily misinterpreted if taken in piecemeal. A low false-positive rate and a relatively high test result specificity give the impression that a P(have the disease|positive test result) is highly likely, however, the sample size of the false positive rate needs to be taken into account. After factoring in sample size, the chance of not having the disease can overwhelm the posterior and contrast my initial appraisal.

As an undergraduate scientist, Frequentist Null Hypthoesis Significance Testing has been the most common inferential framework used within my studies. For instance, hypotheses in Genetics (BIOL254) are tested by calculating probabilities for allele inheritance by applying expected and observed frequencies to the chi-square distribution using the appropriate degrees of freedom. The probability value generated from the chi-square distribution is then analyzed against a threshold value for significance. P-values below the threshold are considered statisically significant and fails to accept falsification. These tools are useful for linear relationships with normally distributed error generating processes, however, the more complex models involving non-normal error distributions require a different set of tools.

Although, I am more experienced with NHST, I am growing a prefernce towards Bayesian probabilistic thinking. NHST is too binary in its conclusions, as an association is considered either false or true depending upon the alpha (signigicance level). The NHST confidence interval only tells us if a re-constructed interval will produce the same or similar parameter. A Bayesian confidence interval tells us the probability that the true paramter is contained within a given interval.

Model comparison and Assessment of predictive ability are powerful tools in assessing how models fit the data. Out of sample prediction, k-folds and leave one out cross validation show the strength of model fits and give insight into which interactions fit the data, as well as over-fit the data.</span?

Bayes Theorem (10 points) I’ve referenced the following figure a few times. I’d like you to demonstrate your understanding of Bayes Theorem by hand (e.g. calculate it out and show your work - you can do this all in R, I’m not a monster) showing what is the probability of the sun exploding is given that the device said yes. Assume that your prior probability that the sun explodes is p(Sun Explodes) = 0.0001 (I’ll leave it to you to get p(Sun Doesn’t Explode). The rest of the information you need - and some you don’t - is in the cartoon - p(Yes | Explodes), p(Yes | Doesn’t Explode), p(No | Explodes), p(No | Doesn’t Explode).

#Calculate the Posterior

#Calculate the likelihood ####
#P(Yes|Explodes)
Likelihood = (35/36)*.0001

#Prior ~ p(Explodes) ####
Prior = .0001

#calculate the denominator - law of total probabilities for sun exploding ####
# (yes|explodes)+(no|explodes)

Marginal_Likelihood = (35/36*.0001)+(1/36)*(1-.0001)


############### CALCULATE THE POSTERIOR
#P(Explodes|Yes)= P(Yes|Explodes)p(Explodes)/p(Explodes) 

Posterior = (Likelihood*Prior)/Marginal_Likelihood

The probability that the sun exploded given that the machine said yes is 3.488140310^{-7}.

4a Extra Credit (10 Points) Why is this a bad parody of frequentist statistics?

The stick-figure scientist did not consider the conditional probability of the sun exploding. The p-value of 1/36 is specific to false-positive rate, which is independent of the probablity of the sun exploding. Therefore, this is a bad parody because the null hypothesis that the sun did not explode is rejected using the independent probability of obtaining two sixes. While it is unlikely that the machine lied, it is far more unlikely that the sun went nova.

Quailing at the Prospect of Linear Models I’d like us to walk through the three different ‘engines’ that we have learned about to fit linear models. To motivate this, we’ll look at Burness et al.’s 2012 study "Post-hatch heat warms adult beaks: irreversible physiological plasticity in Japanese quail http://rspb.royalsocietypublishing.org/content/280/1767/20131436.short the data for which they have made available at Data Dryad at http://datadryad.org/resource/doi:10.5061/dryad.gs661. We’ll be looking at the morphology data.

#load qual morphology data
quail_data <- read.csv(na.omit("data/Morphology data.csv"))

#structure of data frame
str(quail_data)

## 'data.frame':    880 obs. of  10 variables:
##  $ Bird..               : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Sex                  : chr  "" "m" "f" "m" ...
##  $ Age..days.           : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ Exp..Temp...degree.C.: int  15 15 15 15 15 15 15 15 15 15 ...
##  $ Mass..g.             : num  16.1 19.2 17.5 14.4 17.4 ...
##  $ Tarsus..mm.          : num  19.4 20.4 19 20.1 21.8 ...
##  $ Culmen..mm.          : num  7.64 7.49 7.31 7.34 8.24 6.82 7.84 7.39 7.4 7.81 ...
##  $ Depth..mm.           : num  4.23 4.46 3.92 3.85 4.42 3.65 3.94 3.72 4.5 4.09 ...
##  $ Width..mm.           : num  4.49 4.44 4.01 4.22 4.56 3.73 4.6 4.66 3.83 3.89 ...
##  $ NOTES                : chr  "" "" "" "" ...

skimr::skim(quail_data)

Data summary
Name	quail_data
Number of rows	880
Number of columns	10
_______________________
Column type frequency:
character	2
numeric	8
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
Sex	0	1	0	1	88	3	0
NOTES	0	1	0	35	872	8	0

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Bird..	0	1.00	20.50	11.55	1.00	10.75	20.50	30.25	40.00	▇▇▇▇▇
Age..days.	0	1.00	37.00	31.15	5.00	15.00	26.00	49.00	123.00	▇▃▁▁▁
Exp..Temp…degree.C.	0	1.00	22.50	7.50	15.00	15.00	22.50	30.00	30.00	▇▁▁▁▇
Mass..g.	76	0.91	138.75	93.09	12.51	50.07	127.58	219.59	375.00	▇▅▅▃▁
Tarsus..mm.	112	0.87	31.73	6.74	16.19	26.19	33.81	37.17	49.42	▃▆▇▇▁
Culmen..mm.	114	0.87	11.73	2.80	4.23	9.31	11.98	14.03	18.14	▁▆▆▇▂
Depth..mm.	112	0.87	5.59	0.80	3.30	5.06	5.60	6.11	9.52	▂▇▆▁▁
Width..mm.	112	0.87	4.82	0.53	3.29	4.45	4.84	5.17	6.92	▁▆▇▂▁

5a) Three fits (10 points) To begin with, I’d like you to fit the relationship that describes how Tarsus (leg) length predicts upper beak (Culmen) length. Fit this relationship using least squares, likelihood, and Bayesian techniques. For each fit, demonstrate that the necessary assumptions have been met. Note, functions used to fit with likelihood and Bayes may or may not behave well when fed NAs. So look out for those errors.

#generate a plot of the quail data
quail_plot <- ggplot(data = quail_data,
       mapping = aes(x = Tarsus..mm., y = Culmen..mm.)) +
  geom_point(alpha = 0.5)+
  theme_classic()

#plot with line
quail_plot + 
  stat_smooth(method=lm, formula=y~x)

The variance of residuals appear to violate homoscedasticity, evident by the increase in variation as the mean increases. This indicates a glm may be better suited, as it reweights the larger variances.

#fit the relationship of tarsus length predicting upper beak using least squares

LSquail <- lm(Culmen..mm. ~ Tarsus..mm., data = quail_data)

#simulate the data
LSquail_sims <- simulate(LSquail, nsim = 20) %>%
  pivot_longer(
    cols = everything(),
    names_to = "sim",
    values_to = "Culmen..mm."
  )
#plot distribution of simulated data over observed data
ggplot() +
  geom_density(data = LSquail_sims,
               mapping = aes(x = Culmen..mm., group = sim), 
               size = 0.2)  +
  geom_density(data = quail_data,
               mapping = aes(x = Culmen..mm.),
               size = 2, color = "blue")

The simulated data does not recapitulate our observed values.

#check the relationship of residual vs. fitted values
plot(LSquail, which =1)

The residuals vs. fitted plot appears to be clustered in two regions, however, this may be due to sample size as residuals have a cloud formation. This may be suggestive of non-linearity and the need to restructure the model.

#check normality of residuals
residuals(LSquail) %>% hist()

Residuals are normally distributed.

plot(LSquail, which = 2)

This qqplot has suspect linearity.

Likelihood Regression

#Plot GLM for quail data
ggplot(quail_data, 
       mapping = aes(x = Tarsus..mm., y = Culmen..mm.)) +
  geom_point() +
  stat_smooth(method = "glm", method.args = list(family = gaussian(link="identity")))

#fit a Likelihood model using the iteratively reweighted least squares algorithm in GLM
quail_mle <- glm(Culmen..mm. ~ Tarsus..mm.,
                 #identity means 1:1 translation between linear predictors and shape of curve
                family = gaussian(link = "identity"),
                data = quail_data)

#extract predicted and residuals for plot
fitted <- predict(quail_mle)
res <- residuals(quail_mle)

qplot(fitted, res)

qqnorm(res)

The linearity of the GLM qqplot is better behaved than the LM plot

hist(res)

Residuals have a normal distribution.

Bayes Regression

options(mc.cores = parallel::detectCores())
set.seed(600)
color_scheme_set("orange")
quail_brm <- brm(Culmen..mm. ~ Tarsus..mm.,
                  data = quail_data,
                  family = gaussian(link = "identity"))

plot(quail_brm)

The parameters are behaved and the chains show convergence

#does our data match the chains for y-distributions?
color_scheme_set("green")
pp_check(quail_brm, "dens_overlay")+
  theme_light()

The data contains trends that are not well-reflected in the predicted values.

#fitted vs. residual, do we meet linearity assumption?
quail_fit <- fitted(quail_brm) %>% as.data.frame()
quail_res <- residuals(quail_brm) %>% as.data.frame()

plot(quail_fit$Estimate, quail_res$Estimate)

The residuals vs fitted appears to cluster at two points, suggesting that this model violates the assumption of linearity.

qqnorm(quail_res$Estimate)
qqline(quail_res$Estimate)

5b) Three interpretations (10 points) OK, now that we have fits, take a look! Do the coefficients and their associated measures of error in their estimation match? How would we interpret the results from these different analyses differently? Or would we? Note, confint works on lm objects as well.

#LM coef ####

LSquail <- lm(Culmen..mm. ~ Tarsus..mm., data = quail_data)

coef(LSquail)

## (Intercept) Tarsus..mm. 
## -0.09870712  0.37292677

summary(LSquail)

## 
## Call:
## lm(formula = Culmen..mm. ~ Tarsus..mm., data = quail_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.4081 -0.7029 -0.0328  0.7263  3.5970 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.098707   0.215450  -0.458    0.647    
## Tarsus..mm.  0.372927   0.006646  56.116   <2e-16 ***
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.238 on 764 degrees of freedom
##   (114 observations deleted due to missingness)
## Multiple R-squared:  0.8048, Adjusted R-squared:  0.8045 
## F-statistic:  3149 on 1 and 764 DF,  p-value: < 2.2e-16

#Least squares confidence intervals
confint(LSquail)

##                  2.5 %    97.5 %
## (Intercept) -0.5216505 0.3242363
## Tarsus..mm.  0.3598809 0.3859727

#Likelihood ~ GLM ####
coef(quail_mle)

## (Intercept) Tarsus..mm. 
## -0.09870712  0.37292677

confint(quail_mle)

##                  2.5 %    97.5 %
## (Intercept) -0.5209805 0.3235663
## Tarsus..mm.  0.3599015 0.3859520

#Bayes Regression ####
fixef(quail_brm)

##               Estimate   Est.Error       Q2.5     Q97.5
## Intercept   -0.1006079 0.218081111 -0.5226992 0.3349920
## Tarsus..mm.  0.3730012 0.006711794  0.3595569 0.3860978

The coefficients of the dependent variable (Tarsus) and intercept for the three fits are approximately the same. We can interpret these results as mean Culmen size increasing by ~0.37mm for each 1mm increase in Tarsus size. The arbitrary Tarsus size of 0mm is associated with a approximately ~-.10 mean Culmen size.

The least squares and likelihood models have confidence intervals, which

5c) Everyday I’m Profilin’ (10 points) For your likelihood fit, are your profiles well behaved? For just the slope, use grid sampling to create a profile. You’ll need to write functions for this, sampling the whole grid of slope and intercept, and then take out the relevant slices as we have done before. Use the results from the fit above to provide the reasonable bounds of what you should be profiling over (3SE should do). Is it well behaved? Plot the profile and give the 80% and 95% CI (remember how we use the chisq here!). Verify your results with profileModel.

clean_quails <- na.omit(quail_data)
############likelihood with grid sampling #################################
norm_likelihood <- function(obs, B0, B1, sd){
  
  #data generating process (uses crossing variable names in y-hat equation)
  est <- clean_quails$Tarsus..mm.  * B1 + B0
  
  #log likelihood function
  sum(dnorm(obs, mean = est, sd = sd, log = TRUE))
  }


#grid sample for likelihood 
quail_dist <- crossing(Slope = seq(0.33, 0.4, by = 0.01), 
                      Int =seq(-0.09, 0.10, by = 0.01),
                      stdd = seq(1.1,1.4, by = 0.01)) %>%
  rowwise() %>%
  mutate(log_lik = norm_likelihood(obs = clean_quails$Culmen..mm., B0 = Int, B1 = Slope, sd = stdd)) %>% 
  ungroup()

#extract the MLE
quail_MLE <- quail_dist %>%
  filter(log_lik == max(log_lik))

#profile for the mean
quail_mean_profile <- quail_dist %>%
  #group by Int
  group_by(Int) %>%
  #filter out mle
  filter(log_lik == max(log_lik)) %>%
  ungroup()


#Check to see if we have a profile
qplot(Int, log_lik, data=quail_mean_profile, geom = "line")

The profile of the parameter is well behaved.

#obtain ninety-five CI
ninetyfive_CI <- quail_mean_profile %>%
  filter(log_lik > max(log_lik) - qchisq(0.95, df = 1)/2) %>%
    filter(row_number()==1 | row_number()==n())

ninetyfive_CI

## # A tibble: 2 x 4
##   Slope   Int  stdd log_lik
##   <dbl> <dbl> <dbl>   <dbl>
## 1  0.37 -0.09  1.24  -1252.
## 2  0.37  0.08  1.24  -1252.

#Obtain 80% CI
eighty_CI <- quail_mean_profile %>%
  filter(log_lik > max(log_lik) - qchisq(0.80, df = 1)/2) %>%
    filter(row_number()==1 | row_number()==n())

eighty_CI

## # A tibble: 2 x 4
##   Slope   Int  stdd log_lik
##   <dbl> <dbl> <dbl>   <dbl>
## 1  0.37 -0.06  1.24  -1251.
## 2  0.37  0.05  1.24  -1251.

#verify 95% CI results
prof <- profileModel(quail_mle,
                     objective = "ordinaryDeviance",
                     quantile = qchisq(0.95,1))

## Preliminary iteration .. Done
## 
## Profiling for parameter (Intercept) ... Done
## Profiling for parameter Tarsus..mm. ... Done

plot(prof)

#verify 80% CI results
prof80 <- profileModel(quail_mle,
                     objective = "ordinaryDeviance",
                     quantile = qchisq(0.80,1))

## Preliminary iteration .. Done
## 
## Profiling for parameter (Intercept) ... Done
## Profiling for parameter Tarsus..mm. ... Done

plot(prof80)

The profile model confidence intervals are within range of the MLE grid sampling function.

5d) The Power of the Prior (10 points) This data set is pretty big. After excluding NAs in the variables we’re interested in, it’s over 766 lines of data! Now, a lot of data can overwhelm a strong prior. But only to a point. Show first that there is enough data here that a prior for the slope with an estimate of 0.7 and a sd of 0.01 is overwhelmed and produces similar results to the default prior. How different are the results from the original?

#construct a BRM with adjusted prior slope and sd
adj_brm_prior <- brm(Culmen..mm. ~ Tarsus..mm.,
                         data = quail_data,
                         family=gaussian(),
                   prior = (prior(normal(0.7, 0.01), class = b)))
adj_brm_prior

##  Family: gaussian 
##   Links: mu = identity; sigma = identity 
## Formula: Culmen..mm. ~ Tarsus..mm. 
##    Data: quail_data (Number of observations: 766) 
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup samples = 4000
## 
## Population-Level Effects: 
##             Estimate Est.Error l-95% CI u-95% CI Rhat
## Intercept      -4.24      0.27    -4.78    -3.71 1.00
## Tarsus..mm.     0.50      0.01     0.49     0.52 1.00
##             Bulk_ESS Tail_ESS
## Intercept       2312     2750
## Tarsus..mm.     2362     2737
## 
## Family Specific Parameters: 
##       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
## sigma     1.52      0.05     1.43     1.63 1.00     2273
##       Tail_ESS
## sigma     2403
## 
## Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).

#Prepare coefficients and credible intervals from the BRMs for graphing

#adjusted prior coefficient and credible interval
adjPriorCoef <- fixef(adj_brm_prior)
adjCoef_wide <- spread_draws(adj_brm_prior,
                             b_Intercept,
                             b_Tarsus..mm.)

#default prior coefficient and credible interval
quail_coefs <- fixef(quail_brm)
quail_coefs_wide <- spread_draws(quail_brm,
                                b_Intercept,
                                b_Tarsus..mm.)

#tarsus v. culmen plot
quail_plot +  
#credible interval of default prior
  geom_abline(data = quail_coefs_wide,
              aes(slope = b_Tarsus..mm.,
                  intercept = b_Intercept),
              color = "blue",
              alpha = 0.06) +
  #median fit
  geom_abline(slope = quail_coefs[2,1],
              intercept = quail_coefs[1,1],
              color = "red",
              size = 1)+
# credible interval of adjusted prior   
  geom_abline(data = adjCoef_wide,
              aes(slope = b_Tarsus..mm.,
                  intercept = b_Intercept),
              color = "orange",
              alpha = 0.08) +
  #median fit of adjusted prior model
  geom_abline(slope = adjPriorCoef[2,1],
              intercept = adjPriorCoef[1,1],
              color = "green",
              size = 1)

Second, randomly sample 10, 100, 300, and 500 data points. At which level is our prior overwhelmed (e.g., the prior slope becomes highly unlikely)? Communicate that visually in the best way you feel gets the point across, and explain your reasoning.

#draw random samples from quail data


Functionally <- function(x) {

samples <- quail_data[sample(nrow(quail_data), x), ] %>%
  na.omit()

#input into BRMS

Brm_object <- brm(Culmen..mm. ~ Tarsus..mm.,
                  data = samples,
                  family = gaussian(link = "identity"),
                  chains = 2,
                  iter = 500)

#get_prior

out <- plot(Brm_object)


return(Brm_object)

}

tensamps <- Functionally(10)

hundredsamps <- Functionally(100)

threehunsamps <- Functionally(300)

fivehunsamps <- Functionally(500)

#ten sample coefs and draws
tensampCoef <- fixef(tensamps)
tensampcoefs_wide <- spread_draws(tensamps,
                                b_Intercept,
                                b_Tarsus..mm.)


#100 sample coefs and draws
hundredsampCoef <- fixef(hundredsamps)
hundredcoefs_wide <- spread_draws(hundredsamps,
                                b_Intercept,
                                b_Tarsus..mm.)

#500 sample coefs and draws
fivesampCoef <- fixef(fivehunsamps)
fivesampcoefs_wide <- spread_draws(fivehunsamps,
                                b_Intercept,
                                b_Tarsus..mm.)

###Plot the sample draws at each level to see how well they fit the data
quail_plot +  
#simulated draws from ten samps
  geom_abline(data = tensampcoefs_wide,
              aes(slope = b_Tarsus..mm.,
                  intercept = b_Intercept),
              color = "orange",
              alpha = 0.26) +
  #median fit of ten samples
  geom_abline(slope = tensampCoef[2,1],
              intercept = tensampCoef[1,1],
              color = "yellow",
              size = 1)+
  #simulated draws of 100 samples
  geom_abline(data = hundredcoefs_wide,
              aes(slope = b_Tarsus..mm.,
                  intercept = b_Intercept),
              color = "purple",
              alpha = 0.16) +
  
  #median fit of 100 samples
  geom_abline(slope = hundredsampCoef[2,1],
              intercept = hundredsampCoef[1,1],
              color = "white",
              size = 1)+
  
  
# credible interval of 500 samps  
  geom_abline(data = fivesampcoefs_wide,
              aes(slope = b_Tarsus..mm.,
                  intercept = b_Intercept),
              color = "blue",
              alpha = 0.08) +
  #median fit of 500 samps
  geom_abline(slope = fivesampCoef[2,1],
              intercept = fivesampCoef[1,1],
              color = "green",
              size = 1)

The weakly informed prior has low certainty of estimating parameters when the sample size is ten. The certainty in parameter estimation increases as the prior becomes updated with new information and around 300 samples our estimate interval begins to resemble the theta of the entire data set. Judging by the confidence bands, any model made with 100 samples should generalize to new data fairly well.

+4 for a function that means you don’t have to copy and paste the model over and over. + 4 more if you use map() in combination with a tibble to make this as code-efficient as possible. This will also make visualization easier.

Cross-Validation and Priors (15 points) There is some interesting curvature in the culmen-tarsus relationship. Is the relationship really linear? Squared? Cubic? Exponential? Use one of the cross-validation techniques we explored to show which model is more predictive. Justify your choice of technique. Do you get a clear answer? What does it say?

The culmen-tarsus relationship is cubic

Brm_object1 <- brm(Culmen..mm. ~ poly(Tarsus..mm.,3),
                  data = na.omit(quail_data),
                  family = gaussian(link = "identity"))

color_scheme_set("green")
pp_check(Brm_object1, "dens_overlay")+
  theme_light()

This is a much better fit than the linear model from before.

#widely applicable information criterion for culmen~tarsus brm
quail_aic <- waic(quail_brm)
quail_aic

## 
## Computed from 4000 by 766 log-likelihood matrix
## 
##           Estimate   SE
## elpd_waic  -1252.8 21.9
## p_waic         3.1  0.3
## waic        2505.5 43.7

#widely applicable information criterion for cubic-polynomial culmen~tarsus brm
cubic <- waic(Brm_object1)
cubic

## 
## Computed from 4000 by 766 log-likelihood matrix
## 
##           Estimate   SE
## elpd_waic  -1240.6 22.0
## p_waic         4.5  0.4
## waic        2481.2 44.0

#compare both models
compare_mod <- compare_ic(quail_aic, cubic)
compare_mod

##                            WAIC    SE
## quail_brm               2505.53 43.73
## Brm_object1             2481.15 44.03
## quail_brm - Brm_object1   24.38 10.53

####Enter {AICcmodavg}! #####

LSquail <- lm(Culmen..mm. ~ Tarsus..mm., data = quail_data)

LScubicQuail <- lm(Culmen..mm. ~ poly(Tarsus..mm.,3), data = na.omit(quail_data))

aictab(list(LSquail, LScubicQuail),
       c("QuailLM", "CubicQuailLM"))

## 
## Model selection based on AICc:
## 
##              K    AICc Delta_AICc AICcWt Cum.Wt       LL
## CubicQuailLM 5 2481.70       0.00      1      1 -1235.81
## QuailLM      3 2505.39      23.69      0      1 -1249.68

Technically the cubic fit gives a better score, however, it isn’t substantial enough to determine which model has better predictive abilities. The best evidence in support of cubic is how well it fits the data on the graph.

Midterm

Bob Bartolini

11/2/2020

1) Sampling your system (10 points)

Each of you has a study system your work in and a question of interest.