To begin,

Filter data set before all analysis, i.e. this filtered dataset enters the analysis with filter parameters set to user prefs.

Acronyms:

SR: supporting reads, i.e. reads with (A/G switch)
DP/depth: total reads mapped to location
Edit Level SR/DP, estimated proportion of edited transcripts


SR and DP Filtration

## [1] "Starting number of data points: 197439"
##   DP_level SR_level SR_top editLevel_mean editLevel_sd   AH   NJ    P   NP
## 1       80        5    500          0.216        0.206 1257 1433 1422 1268
##   Total_survive AH_NJ_rat P_NP_rat AHP AHNP NJP NJNP
## 1          2690     0.877    1.121 608  649 814  619

Exploratory Plots


Standardize all variables (Ind. and dep.), i.e. Z-scores, subtract mean from each value and then divide these new values by the standard deviation of the var.

Super long tail but only a few observations make up the tail

After removing outliers

Still has long tail but ~5 sd, started 20 sd

SR minimum filter: 5 Depth: 80 SR top filter: 500

Filtered for at least 2

    #file.tmp$Location = as.integer(ffile.tmp$Location)
    
    file.tmp %>%
      filter(N_Fish >= N_fish_filt) -> tmp
      

    tmp[,c("ChromLoc", "Pop", "Env", "Family", "SupReads", "DP")] -> tmp
    
    #tmp$SupReads = scale(tmp$SupReads, center = TRUE, scale = TRUE) 
    #tmp$DP = scale(tmp$DP, center = T, scale = T)
    
    tmp$Family = as.factor(tmp$Family)
    
    #m1 <- glmmTMB(SupReads ~ Pop*Env + Family + offset(log(DP)), data = tmp, family = nbinom1())
    
    m2 <- glm.nb(SupReads ~ Pop*Env + offset(log(DP)), data = tmp)
    
    rootogram(m2)

    summary(m2)
## 
## Call:
## glm.nb(formula = SupReads ~ Pop * Env + offset(log(DP)), data = tmp, 
##     init.theta = 1.475156545, link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6138  -1.0864  -0.5205   0.3927   2.3361  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.57282    0.03470 -45.322   <2e-16 ***
## PopNJ        0.04627    0.05004   0.925    0.355    
## EnvP         0.04362    0.05000   0.872    0.383    
## PopNJ:EnvP  -0.03621    0.06908  -0.524    0.600    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(1.4752) family taken to be 1)
## 
##     Null deviance: 2640.4  on 2443  degrees of freedom
## Residual deviance: 2638.9  on 2440  degrees of freedom
## AIC: 20676
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  1.4752 
##           Std. Err.:  0.0410 
## 
##  2 x log-likelihood:  -20666.0090
#    tmp[order(tmp$ChromLoc), ] -> tmp
    
   

popNj_pval = 0
EnvP_pval = 0
popNj_EnvP_pval = 0

The red curved line is the theoretical Poisson fit. “Hanging” from each point on the red line is a bar, the height of which represents the difference between expected and observed counts. A bar hanging below 0 indicates underfitting. A bar hanging above 0 indicates overfitting. The counts have been transformed with a square root transformation to prevent smaller counts from getting obscured and overwhelmed by larger counts. We see a great deal of underfitting for counts 2 and higher and massive overfitting for the 1 count.

\(AIC = −2log(L)+2K\)

interpretation of a model with offset in Poisson regression is the change in the rate of the outcome per the offset variable for changes in the independent variables. Use incident rate ratios ….IRR.