Filter data set before all analysis, i.e. this filtered dataset enters the analysis with filter parameters set to user prefs.
Acronyms:
SR: supporting reads, i.e. reads with (A/G switch)
DP/depth: total reads mapped to location
Edit Level SR/DP, estimated proportion of edited transcripts
## [1] "Starting number of data points: 197439"
## DP_level SR_level SR_top editLevel_mean editLevel_sd AH NJ P NP
## 1 80 5 500 0.216 0.206 1257 1433 1422 1268
## Total_survive AH_NJ_rat P_NP_rat AHP AHNP NJP NJNP
## 1 2690 0.877 1.121 608 649 814 619
Standardize all variables (Ind. and dep.), i.e. Z-scores, subtract mean from each value and then divide these new values by the standard deviation of the var.
Super long tail but only a few observations make up the tail
After removing outliers
Still has long tail but ~5 sd, started 20 sd
SR minimum filter: 5 Depth: 80 SR top filter: 500
Filtered for at least 2
#file.tmp$Location = as.integer(ffile.tmp$Location)
file.tmp %>%
filter(N_Fish >= N_fish_filt) -> tmp
tmp[,c("ChromLoc", "Pop", "Env", "Family", "SupReads", "DP")] -> tmp
#tmp$SupReads = scale(tmp$SupReads, center = TRUE, scale = TRUE)
#tmp$DP = scale(tmp$DP, center = T, scale = T)
tmp$Family = as.factor(tmp$Family)
#m1 <- glmmTMB(SupReads ~ Pop*Env + Family + offset(log(DP)), data = tmp, family = nbinom1())
m2 <- glm.nb(SupReads ~ Pop*Env + offset(log(DP)), data = tmp)
rootogram(m2)
summary(m2)
##
## Call:
## glm.nb(formula = SupReads ~ Pop * Env + offset(log(DP)), data = tmp,
## init.theta = 1.475156545, link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6138 -1.0864 -0.5205 0.3927 2.3361
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.57282 0.03470 -45.322 <2e-16 ***
## PopNJ 0.04627 0.05004 0.925 0.355
## EnvP 0.04362 0.05000 0.872 0.383
## PopNJ:EnvP -0.03621 0.06908 -0.524 0.600
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(1.4752) family taken to be 1)
##
## Null deviance: 2640.4 on 2443 degrees of freedom
## Residual deviance: 2638.9 on 2440 degrees of freedom
## AIC: 20676
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 1.4752
## Std. Err.: 0.0410
##
## 2 x log-likelihood: -20666.0090
# tmp[order(tmp$ChromLoc), ] -> tmp
popNj_pval = 0
EnvP_pval = 0
popNj_EnvP_pval = 0
The red curved line is the theoretical Poisson fit. “Hanging” from each point on the red line is a bar, the height of which represents the difference between expected and observed counts. A bar hanging below 0 indicates underfitting. A bar hanging above 0 indicates overfitting. The counts have been transformed with a square root transformation to prevent smaller counts from getting obscured and overwhelmed by larger counts. We see a great deal of underfitting for counts 2 and higher and massive overfitting for the 1 count.
\(AIC = −2log(L)+2K\)
interpretation of a model with offset in Poisson regression is the change in the rate of the outcome per the offset variable for changes in the independent variables. Use incident rate ratios ….IRR.