Question: Create a table1 and use GGally::ggpairs to explore your data.
| conirostris (N=185) |
difficilis (N=102) |
fortis (N=510) |
fuliginosa (N=318) |
magnirostris (N=90) |
scandens (N=229) |
|
|---|---|---|---|---|---|---|
| Beak height (mm) | ||||||
| Mean (SD) | 14.9 (1.63) | 8.17 (0.494) | 12.3 (1.29) | 8.00 (0.508) | 19.9 (1.83) | 9.56 (0.584) |
| Median [Min, Max] | 15.0 [10.5, 18.7] | 8.10 [7.10, 9.70] | 12.0 [8.60, 16.6] | 8.00 [6.70, 10.2] | 20.0 [11.5, 23.2] | 9.60 [8.20, 11.9] |
| Wing (length) (mm) | ||||||
| Mean (SD) | 77.3 (2.94) | 64.7 (3.35) | 69.9 (3.45) | 61.9 (2.39) | 81.4 (3.92) | 70.6 (2.20) |
| Median [Min, Max] | 77.0 [68.0, 84.0] | 64.0 [58.0, 71.0] | 70.0 [60.0, 80.0] | 62.0 [55.0, 70.0] | 81.0 [70.0, 89.0] | 71.0 [65.0, 76.0] |
| Upper beak (mm) | ||||||
| Mean (SD) | 21.0 (1.24) | 13.8 (0.720) | 16.5 (1.21) | 12.5 (0.649) | 22.8 (1.64) | 19.6 (1.09) |
| Median [Min, Max] | 21.0 [18.1, 24.1] | 13.8 [12.3, 15.5] | 16.3 [13.4, 20.2] | 12.5 [11.0, 15.0] | 22.9 [15.0, 25.9] | 19.6 [17.1, 22.5] |
| Sex | ||||||
| F | 59 (31.9%) | 29 (28.4%) | 155 (30.4%) | 128 (40.3%) | 29 (32.2%) | 72 (31.4%) |
| M | 126 (68.1%) | 73 (71.6%) | 355 (69.6%) | 190 (59.7%) | 61 (67.8%) | 157 (68.6%) |
function (data, mapping = NULL, columns = 1:ncol(data), title = NULL,
upper = list(continuous = "cor", combo = "box_no_facet",
discrete = "count", na = "na"), lower = list(continuous = "points",
combo = "facethist", discrete = "facetbar", na = "na"),
diag = list(continuous = "densityDiag", discrete = "barDiag",
na = "naDiag"), params = NULL, ..., xlab = NULL, ylab = NULL,
axisLabels = c("show", "internal", "none"), columnLabels = colnames(data[columns]),
labeller = "label_value", switch = NULL, showStrips = NULL,
legend = NULL, cardinality_threshold = 15, progress = NULL,
proportions = NULL, legends = stop("deprecated"))
{
warn_deprecated(!missing(legends), "legends")
warn_if_args_exist(list(...))
stop_if_params_exist(params)
isSharedData <- inherits(data, "SharedData")
data_ <- fix_data(data)
data <- fix_data_slim(data_, isSharedData)
if (!missing(mapping) & !is.list(mapping) & missing(columns)) {
columns <- mapping
mapping <- NULL
}
stop_if_bad_mapping(mapping)
columns <- fix_column_values(data, columns, columnLabels,
"columns", "columnLabels")
stop_if_high_cardinality(data, columns, cardinality_threshold)
upper <- check_and_set_ggpairs_defaults("upper", upper, continuous = "cor",
combo = "box_no_facet", discrete = "count", na = "na")
lower <- check_and_set_ggpairs_defaults("lower", lower, continuous = "points",
combo = "facethist", discrete = "facetbar", na = "na")
diag <- check_and_set_ggpairs_defaults("diag", diag, continuous = "densityDiag",
discrete = "barDiag", na = "naDiag", isDiag = TRUE)
axisLabels <- fix_axis_label_choice(axisLabels, c("show",
"internal", "none"))
proportions <- ggmatrix_proportions(proportions, data, columns)
dataTypes <- plot_types(data, columns, columns, allowDiag = TRUE)
if (identical(axisLabels, "internal")) {
dataTypes$plotType[dataTypes$posX == dataTypes$posY] <- "label"
}
ggpairsPlots <- lapply(seq_len(nrow(dataTypes)), function(i) {
plotType <- dataTypes[i, "plotType"]
posX <- dataTypes[i, "posX"]
posY <- dataTypes[i, "posY"]
xColName <- dataTypes[i, "xVar"]
yColName <- dataTypes[i, "yVar"]
if (posX > posY) {
types <- upper
}
else if (posX < posY) {
types <- lower
}
else {
types <- diag
}
sectionAes <- add_and_overwrite_aes(add_and_overwrite_aes(aes_(x = as.name(xColName),
y = as.name(yColName)), mapping), types$mapping)
args <- list(types = types, sectionAes = sectionAes)
if (plotType == "label") {
args$label <- columnLabels[posX]
}
plot_fn <- ggmatrix_plot_list(plotType)
p <- do.call(plot_fn, args)
return(p)
})
plotMatrix <- ggmatrix(plots = ggpairsPlots, byrow = TRUE,
nrow = length(columns), ncol = length(columns), xAxisLabels = (if (axisLabels ==
"internal")
NULL
else columnLabels), yAxisLabels = (if (axisLabels ==
"internal")
NULL
else columnLabels), labeller = labeller, switch = switch,
showStrips = showStrips, showXAxisPlotLabels = identical(axisLabels,
"show"), showYAxisPlotLabels = identical(axisLabels,
"show"), title = title, xlab = xlab, ylab = ylab,
data = data_, gg = NULL, progress = progress, legend = legend,
xProportions = proportions, yProportions = proportions)
plotMatrix
}
<bytecode: 0x559e7ba3aed0>
<environment: namespace:GGally>
Question: Try to replicate as many graphs as you can from those shown above, and try to explain what could be the limitations of a fixed effects models we created?
##
## Call:
## lm(formula = beakh ~ wingl, data = finches)
##
## Coefficients:
## (Intercept) wingl
## -20.3974 0.4576
##
## Call:
## lm(formula = beakh ~ wingl, data = finches)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.9257 -1.0104 0.0236 1.0507 6.1625
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -20.397361 0.503009 -40.55 <2e-16 ***
## wingl 0.457641 0.007204 63.53 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.734 on 1432 degrees of freedom
## Multiple R-squared: 0.7381, Adjusted R-squared: 0.7379
## F-statistic: 4035 on 1 and 1432 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = beakh ~ wingl + wingl2, data = finches)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.6554 -1.0030 -0.1342 0.9923 6.6615
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.4954288 4.0697832 9.705 <2e-16 ***
## wingl -1.2530210 0.1156634 -10.833 <2e-16 ***
## wingl2 0.0121132 0.0008176 14.815 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.615 on 1431 degrees of freedom
## Multiple R-squared: 0.7729, Adjusted R-squared: 0.7726
## F-statistic: 2435 on 2 and 1431 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = beakh ~ wingl + wingl2 + species, data = finches)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.5830 -0.4802 -0.0155 0.4421 3.1512
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 32.9526799 3.1630443 10.418 < 2e-16 ***
## wingl -0.7882287 0.0892474 -8.832 < 2e-16 ***
## wingl2 0.0071660 0.0006295 11.384 < 2e-16 ***
## speciesdifficilis -3.8668081 0.1429794 -27.045 < 2e-16 ***
## speciesfortis -0.6566894 0.0963176 -6.818 1.36e-11 ***
## speciesfuliginosa -3.6559863 0.1425631 -25.645 < 2e-16 ***
## speciesmagnirostris 3.4833861 0.1258761 27.673 < 2e-16 ***
## speciesscandens -3.4835950 0.1039663 -33.507 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.866 on 1426 degrees of freedom
## Multiple R-squared: 0.935, Adjusted R-squared: 0.9346
## F-statistic: 2928 on 7 and 1426 DF, p-value: < 2.2e-16
Answer: Since the independence assumption is violated, the data will explain variance that is not of interest. This is because birds of the same species are more similar than birds of different species. Also, we see heteroscedasticity in the graph, which is a violation of the linear regression assumption. The solution might be to create a mixed model with fixed effects for species. This will essentially add 5 dummy variables to our regression model which reflect the different intercepts for the different species. This will allow the model to explain the between-species variance for each species seperately. Our model will thus be a more accurate representation of realtiy.
Question: Try to replicate as many figures above as you can. What do you notice about the intercepts? What do you notice about the slopes? How does this compare to the previous models we fit?
Answer: So we see that the intercept vary across
species but the slopes are the same for all species. Compared to the
previous model this model explains the variation that we are interested
in: the relation between wing length and beakheight. It models the
variation between species separately so we can distinguish between them.
The previous model does not do this.
Question: How might we make the intercept more meaningful if we were to refit this model? (Hint: consider centering your predictor variable!)
## $species
## (Intercept)
## conirostris 1.3417493
## difficilis -2.5605229
## fortis 0.4037595
## fuliginosa -2.0911466
## magnirostris 5.3831675
## scandens -2.4770068
##
## with conditional variances for "species"
## [1] 1.3417493 -2.5605229 0.4037595 -2.0911466 5.3831675 -2.4770068
## intercept species
## conirostris 1.3417493 conirostris
## difficilis -2.5605229 difficilis
## fortis 0.4037595 fortis
## fuliginosa -2.0911466 fuliginosa
## magnirostris 5.3831675 magnirostris
## scandens -2.4770068 scandens
## [1] 0
## (Intercept) wingl
## -3.8188397 0.2247786
## [1] -3.81884
## [1] 0.2247786
## numeric(0)
## $species
## (Intercept) wingl
## conirostris -2.477090 0.2247786
## difficilis -6.379363 0.2247786
## fortis -3.415080 0.2247786
## fuliginosa -5.909986 0.2247786
## magnirostris 1.564328 0.2247786
## scandens -6.295846 0.2247786
##
## attr(,"class")
## [1] "coef.mer"
## Linear mixed model fit by REML ['lmerMod']
## Formula: beakh ~ wingl + (1 | species)
## Data: finches
##
## REML criterion at convergence: 3830.8
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -7.1590 -0.5683 -0.0215 0.5570 3.4892
##
## Random effects:
## Groups Name Variance Std.Dev.
## species (Intercept) 9.6096 3.0999
## Residual 0.8176 0.9042
## Number of obs: 1434, groups: species, 6
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) -3.818840 1.384784 -2.758
## wingl 0.224779 0.007912 28.409
##
## Correlation of Fixed Effects:
## (Intr)
## wingl -0.405
## lmer(formula = beakh ~ wingl + (1 | species), data = finches)
## coef.est coef.se
## (Intercept) -3.82 1.38
## wingl 0.22 0.01
##
## Error terms:
## Groups Name Std.Dev.
## species (Intercept) 3.10
## Residual 0.90
## ---
## number of obs: 1434, groups: species, 6
## AIC = 3838.8, DIC = 3819.5
## deviance = 3825.1
## [1] 0.2
Answer: In general our model predicts that beak height increases as wing length increases across all species. This is actually not the case; our mean is zero. This is because the Bj has a normal distribution where N(0, s^2).
We expected the mean intercept to be the same as the intercept in the fixed effect model since the mean intercept reflects the average intercept across species.
The negative correlation between the intercept and the slope tells us that within a species the predictor and outcome variable are related to each other. Thus their variation is actually different. This we don’t want in our model because we want to observe the difference in beak height independent of species.
To make the intercept more meaningful we could center the predictor. The variable wing length will then have a mean of 0. The intercept can then be predicted as the expected value when all predictors are at their mean. Additionally, we could also make the slopes random in order to account for the correlation between the slopes and intercepts to make further partition the variance within species.