Introduction

The first part of the assignment is performed to statistically analyze sector based greenhouse gas emissions data. The purpose of the first part of the assignment is to statistically analyze the data with a mixed-effects model while focusing on the fixed effects of different sectors. The second part of the assignment is performed to statistically analyze an experiment based on the fish species guppy, by the latin term of the Poecilia reticulata. The purpose of the second part of the assignment is to statistically analyze the data of the experiment with a generalized linear mixed-effects model.

An online version of the course curriculum “Modern Statistics with R” is provided by the university and is authored by Måns Thulin (2021), which is the literature that is used as a foundation to perform statistical tests by using the mixed-effects model. The default help function in R and online error searches are also used while completing assignment II.

The first research question for part I is on how greenhouse gas emissions differ between different sectors. The second research question is on whether the 17α-ethinyl estradiol (EE2) treatment affects guppy fish anxiety and if there are gender differences based on the test data.

\(Statistical\) \(hypothesis\) \(for\) \(part\) \(1:\)

\(Hypothesis\) \(1.\)

\(H_0:\) \(The\) \(fixed\) \(effects\) \(on\) \(greenhouse\) \(gas\) \(emissions\) \(differ\) \(between\) \(the\) \(sectors.\)

\(H_A:\) \(The\) \(fixed\) \(effects\) \(on\) \(greenhouse\) \(gas\) \(emissions\) \(do\) \(not\) \(differ\) \(between\) \(the\) \(sectors.\)

Part 1 Methods

The gathering of the data set for part I is conducted by the Swedish emissions data base and is organized by the Swedish government agency Statistics Sweden. In part 1, five variables are included in the data set. The first variable measures carbon dioxide emissions in kilotonnes. The second variable measures the population by thousands of inhabitants. The third and fourth variables are categorical factors, that and subdivide each observation on emission data by municipality and regional county. The fifth variable categorizes the data set by 7 sectors, such as industry, agriculture and heating.

When choosing and constructing a statistical model that is suitable for the hypothesis of part 1, it is important to first assess the data. The only variable with unique numeric observations is greenhouse gas (GG2017) and is part of the basis of why this variable is chosen as the dependent variable. The categorical variables have more than two factor categories and therefore it is not possible to perform a logistic regression, which further indicates why GG2017 is the only variable that can be used as the dependent variable. As the data lacks a variable that measures time, a different approach is needed as the mixed-effects model is not be applied on panel data. Therefore, at least one random variable that does not measure time is required instead. As the counties (county) and municipalities (municipality) have an absence of other distinguishable characteristics, these two variables are useful as random factors. First, the data is subdivided into counties and secondly each county is subdivided into municipalities with mutual exclusivity to other counties. It is essentially a double subdivision that randomizes data on greenhouse gases. The variable measuring sectors has different constant categories and is therefore used for fixed effects in the mixed-effects model. To specifically clarify the mixed-effects model, the variables measuring county and municipality are identified as random factors and the variable measuring sectors is identified for fixed effects. As the data on greenhouse gases varies a lot with population size of each municipality, the variable measuring population (pop2017) is used to control for this factor to further isolate the differences and similarities in greenhouse gas emissions between the sectors. The chosen statistical model below has an unorthodox structure as the isolating variable of population is not a separate factor, which would be typical for an OLS approach. The vertical conditional lines that are used in the model specification below, as in “POP2017|MUNIC” and “POP2017|CNTY”, indicate that the randomization is conditional on “county” and “municipality”.

\(Chosen\) \(statistical\) \(model:\) \(Linear\) \(Mixed-effects\) \(Modelling\)

\(Model\) \(1\) \(for\) \(part\) \(1:\)

\(GG2017 = δ_0+δ_1SECTOR+δ_2POP2017|MUNIC+δ_3POP2017|CNTY\)

As a first step of developing the model, every categorical variable is transformed into factor classes and the continuous variables are transformed into numeric variables. Then the coefficients of the effect of sector on the dependent variables is performed by municipality and regional county respectively. The plot indicates a highly correlated negative relationship between intercept and sector by municipality, as well as a highly correlated negative relationship between intercept and sector by regional county. This absence of statistical bias is preferable to show before the regression results, even though it can partially be considered to be a plot for diagnostics. In the appendix it is also shown that the correlation between the constant and the sector based coefficients of the random variables are approximately perfectly negative.

Part 1 Results

Anova test for random constants. Table 1.
	npar	AIC	BIC	logLik	deviance	Chisq	Df	Pr(>Chisq)
LM	9	13701	13739	-6841	13683	NA	NA	NA
MEREG	14	13701	13761	-6836	13673	10	5	0.075

In the table above, an anova test is performed between the mixed effects regression and an OLS regression with the same variables. If p < x/2 = 0.05 the random effects of the model are appropriate when the null hypothesis of similarity between an OLS regression and mixed-effects regression is rejected. In this case, x/2 = 0.075/2 = 0.0375 < 0.05 = α, which indicates that random effects impact the regression and that random intercepts are appropriate to use.

Results. Regression table 2.
Dependent variable: Greenhouse gas emissions (kt CO2)
	ME (1)
Constant	25287.844*
	(12605.109)
SectorHeating	-24461.035
	(17414.305)
SectorIndustry	55810.003**
	(17414.305)
SectorMachinery	-15165.620
	(17414.305)
SectorProduct use	-22952.597
	(17414.305)
SectorTransport	29912.767+
	(17414.305)
SectorWaste	-24262.897
	(17414.305)
SD (Intercept)	1433.245
	58085.004
SD (pop2017)	2305.373
	25.034
Cor (Intercept~pop2017: municipality)	-1.000
Cor (Intercept~pop2017: county)	-1.000
SD (Observations)	106640.401
Num.Obs.	525
RMSE	104743.09
Marginal R-squared	0.048
Conditional R-squared	0.366
+ p < 0.1, * p < 0.05, p < 0.01, * p < 0.001

After the randomization is performed to better isolate the effect of sectors on greenhouse gas emissions, the significance levels vary greatly. It must also be clarified that a plus-sign indicates 10% significance and one, two and three stars indicate significance at 5%, 1% and 0.1% respectively. The low significance levels can mainly be explained by high standard deviations, that in turn indicate a large level of variance in greenhouse gas emissions between municipalities and counties. In the appendix, the bootstrapped standard errors are shown and after having manually calculated the t-values when the standard errors are generated within each sector, they do not change sufficiently for having to modify the results above. The standard errors shown above are not bootstrapped and are measured on the entirety of the sector variables, but are as mentioned approximately correct and suitable for the analysis of the results.

The marginal coefficient of determination is 0.048 and the conditional coefficient of determination is 0.366, where the former figure only includes fixed effects and the latter figure includes both fixed effects and the condition of the random effects. This indicates that the explanatory power is highly driven by only sector variables, “Industry” and “Transport” that are significant at 10%. As the standard errors of the sector variables are uniform, this means that the slope coefficients of the estimates vary greatly and that only two sector variables hold sufficient significance to mainly explain the variation in greenhouse gas emissions.

Part 1 Diagnostics

In the diagnostic plots above, the residual variance leads to heteroskedastic bias, mainly for the sector variables of “Agriculture” and “Transport”, when the remainder of the sector variables indicate a roughly homoskedastic spread of residuals. “Industry” strongly deviates with a highly positive fit, which in turn is a strong contributor to the high significance level for “Industry” in the regression table above.

An important mutation is performed before plotting the variables above where every residual is divided by its standard deviation within each sector, to instead create a box plot showing the spread in standard deviations. This is necessary as the greenhouse emission levels differ substantially between the sectors and the spread of the residuals needs to be made comparable. A few extreme outliers beyond 4 standard deviations in absolute numbers are ignored. As the random effects treat the sectors separately and the spread of residuals for each sector does not differ sufficiently, it can be stated that most sectors, except for “Transport” and “Agriculture” have concentrations of residuals around the median. Though, “Industry” is exceptional as the residuals are very highly centered near the median and can once again partially explain its specific contribution to the regression results.

The observed values and the fitted values are also mutated and divided by standard deviation to make the sectors visually comparable. Nearly every fitted line above indicates a less positive slope than the observations at the angle between beyond the centered cluster of each sector. This can reduce significance levels when smaller slope coefficients are calculated.

As stated in the assignment, the QQ-plots above mainly display S-formed structures. These are additional indications that outliers create biased mixed-effects regressions. Though, the recurrent sector is “Industry”, that does not have an S-shaped QQ-plot as the distribution only deviates on the upper end indicates a better fit for this sector variable. The division of sectors in the randomization reduces the number of observations for each fixed effects variable, which increases risks associated with outliers. A similar assignment on a larger country with more observations in a larger data set could potentially solve this issue and create QQ-plots that are less S-formed in shape.

Post-hoc tests

PWSECTOR	t.value	p.value
Agriculture - Heating	1.4047	0.799263
Agriculture - Industry	-3.2048	0.024254
Agriculture - Machinery	0.8709	0.976674
Agriculture - Product use	1.3180	0.843360
Agriculture - Transport	-1.7177	0.604502
Agriculture - Waste	1.3933	0.805365
Heating - Industry	-4.6095	0.000107
Heating - Machinery	-0.5338	0.998335
Heating - Product use	-0.0866	1.000000
Heating - Transport	-3.1224	0.031250
Heating - Waste	-0.0114	1.000000
Industry - Machinery	4.0757	0.001061
Industry - Product use	4.5229	0.000158
Industry - Transport	1.4871	0.752493
Industry - Waste	4.5981	0.000113
Machinery - Product use	0.4472	0.999392
Machinery - Transport	-2.5886	0.131925
Machinery - Waste	0.5224	0.998526
Product use - Transport	-3.0357	0.040431
Product use - Waste	0.0752	1.000000
Transport - Waste	3.1110	0.032342

The Tukey test above shows that only 7 of the differences between the estimates are significantly different at 5%. Combined with the low significance levels in the sector variables in the results, these are further indications that most estimates are equally irrelevant in explaining the variation in greenhouse gas emissions. Also, the estimate for “Industry” is significantly different from every other estimate except for “Waste” which further indicates that the estimate of “Industry” is exceptional.

Part 1 Discussion

The correlation between the constant and the sector based coefficients of the random variables are approximately perfectly negative. In simple terms, this can be formulated as the larger the constant, the smaller the slope with an absence of any relevant deviation from the relationship. The standard deviations of the sector variables are high and indicate a low fit for the model. These two different aspects are made clear by the large difference between the marginal coefficient of determination of 0.048 and the conditional coefficient of determination of 0.366, where the former figure only includes fixed effects and the latter figure includes both fixed effects and the condition of the random effects. Therefore, when the model is based on the conditionality of random effects, the nearly perfect negative correlation contributes to the explanatory power of the model. The following analysis is mainly based on comparing the sector variables “Industry” and partially also “Transport” that both are significant at 10% and the remainder of the variables that lack statistical insignificance. Without a mixed effects model, the explanatory power of every sector variable would nearly be lost. The variation of greenhouse gas emissions due to transport and industry are seemingly high enough for these to matter, when the remainder of the sector variables lack such quantitative structures. As previously mentioned in the diagnostics, a larger data set is needed for future research as the mixed-effects regression suffers from bias due to being performed on a number of observations too few in number. Even though the hypothesis in the introduction might not be required, the null hypothesis is rejected as the effect on greenhouse gas emissions differs between the sectors.

Part 2 Methods

The gathering of the data for part II is conducted by Thomas Kristensen, Erik Baatrup and Mark Bayley for a 2005 scientific article called .17α-Ethinylestradiol Reduces the Competitive Reproductive Fitness of the Male Guppy (Poecilia reticulata)“. In part 2, four variables are included in the data set. The first variable”Family” consists of categories of siblings born to the same parents. The second variable “Gender” measures gender, as categorical factors of either male or female. The third variable “Treatment” measures treatment of the synthetic hormone 17α-ethinyl estradiol (EE2) with two categories of either 20 nanograms EE2 in 10ppm acetone or 0ng EE2 in 10 ppm acetone, which is to the sample prior to the female fishes giving birth. The fourth variable “Transitions” measures the number times an individual fish has transitioned between the lower and upper compartments of an aquarium.

When choosing and constructing a statistical model that is suitable for the hypothesis of part 2, it is important to first assess the data. As stated in the assignment, the response variable in part 2 should be “Transitions”. Though, based on the structure it is important to establish which family of statistical distribution that is suitable for the regression. For the reader it needs to be clarified that a family of statistical distribution should not be confused with the variable “Family” that measures groups of siblings. As the gamma family requires that the regression is defined with non-zero divisions, this family of statistical distribution is not suitable as the dependent variable “Transitions” includes observations where individual fishes have not transitioned between the compartments of an aquarium. The binomial family is not suitable either as it requires a binary dependent variable, where the variable “Transition” is numeric, which rules out the possibility of a logistic generalized linear mixed-effects model. Given the regression model, the Gaussian family of statistical distribution is statistically implementable but would make the regression equivalent to a non-generalized mixed effects model and is therefore not appropriate for part 2 of the assignment. Partially by having ruled out every other family of statistical error distribution, the Poisson family is the only viable option where the estimation is performed on the probability of transitions between the upper and lower part of an aquarium. After having chosen the probability distribution based on the dependent variable, it needs to be established whether the three remaining variables are suitable for either fixed effects or random effects. As the variable “Gender” consists of two constant categories it is suitable for fixed effects and according to the instructions of assignment, its potential effect on transitions needs to be directly tested and should not be used for randomization. The variable “Treatment” is the most important of the independent variables and has two categories where the control group received 0 nanograms of EE2 and 10ppm of acetone and the treatment group received 20 nanograms of EE2 and 10 ppm of acetone. Though, the equivalent treatment of acetone must be assumed to have an equal or no effect on anxiety and in turn, the probability of fish transitioning between the lower and upper end of an aquarium. As the potential effect of the variable “Treatment” needs to be tested on the dependent variable, it needs to used for fixed effects in the generalized linear mixed-effects model. As the model violates the Gauss-Markov assumption of independent observations, the variable “Family” needs to be used as a factor categorization and randomized within the generalized linear mixed model, where the correlations within families are used to benefit the rather than create bias in the regression. For further clarification, the Poisson family is chosen for error distribution.

\(Chosen\) \(statistical\) \(model:\) \(Linear\) \(Mixed-effects\) \(Modelling\)

\(Model\) \(1\) \(for\) \(part\) \(2:\)

\(TRANSITIONS = δ_0+δ_1TREATMENT+δ_2GENDER + δ_3(1|FAMILY)\)

Part 2 Results

To test the correlation between the intercepts and slopes by “Family” is not possible as the variables used for fixed effects are binary and do not create any usable variation with more than two more categories. As the standard deviation of the correlation between the intercepts and coefficients is zero and an error message is present, uncorrelated intercepts and slopes are assumed out of necessity.

Anova test for random constants. Table 3.
	npar	AIC	BIC	logLik	deviance	Chisq	Df	Pr(>Chisq)
MEGLMM	4	853	865	-422	845	NA	NA	NA
LM2	21	817	879	-387	775	70.2	17	0

In the table above, an anova test is performed between the generalized mixed effects model regression and an OLS regression with the same variables. As in the equivalent diagnostics in part 1, if p < x/2 = 0.05 the random effects of the model are appropriate when the null hypothesis of similarity between an OLS regression and mixed-effects regression is rejected. In this case, x/2 = (1.97/2)e-08 = 9.85e-09 < 0.001, which indicates that random effects impact the regression and that random intercepts are appropriate to use at below 0.1% significance.

Results. Regression table 4.
Dependent variable: Transitions
	ME GLMM (1)
Constant	1.448***
	(0.150)
Treatment.L	-0.211
	(0.212)
Gender.L	-0.005
	(0.062)
SD (Intercept)	0.620
SD (Observations)	1.000
Num.Obs.	145
RMSE	3.51
Marginal R-squared	0.038
Conditional R-squared	0.681
+ p < 0.1, * p < 0.05, p < 0.01, * p < 0.001

Both fixed factor estimates are clearly statistically insignificant with p-values of 0.32 and 0.94 for the estimates of “Treatment” and “Gender” respectively. Treatment has a negative statistically insignificant effect on anxiety, as quantified by transitions, all else equal. This directly corresponds to the marginal coefficient of determination of 0.038 which indicates that the variation in transitions is not explained solely by the fixed factor variables to a degree of any relevance. Though, as there is a large difference between the marginal- and the conditional coefficient of determination, the randomization through “Family” must hold relevance.

The effect plots above further visualize the negative coefficients of the binary estimates of treatment and gender, where the b-value of treatment clearly is the largest by absolute value. # Part 2 Diagnostics

As the independent variables used in the fixed effects are binary and the individual fishes received treatment within each family, in the first plot above the fitted values by each family have no residual spread on the x-axis and as stated in the assignment, a normal distribution of residuals can not be assumed. As the second plot above does not take the division of families into account, it shows a large but relatively homogeneous spread of residuals. As based on the two plots above it is difficult to make any statement on whether there is heteroskedastic or homoskedastic variance in the residuals of the regression.

Certain families have larger or smaller spreads of residuals, but there is no mentioned idiosyncratic difference between the families, and thus the plot above holds little to no relevance. A different visual approach is necessary.

The QQ-plot above visualizes that issues with heteroskedasticity is present within the randomization of most families with a sufficient number of observations.

Part 2 Discussion

In absence of randomization with a marginal coefficient of determination of 0.038, the fixed effects variables of treatment and gender alone indicate a very low degree of statistical relevance in explaining the variation in transitions. The randomization by family of fishes create a condition where these two fixed effects variables are given indirect explanatory power. Neither can the randomization through factor categorization alone generate such a high conditional coefficient of determination. This strongly indicates that the effect of treatment and possibly also gender on transitions are conditional on the dependency by family of fishes. The relevance of treatment and gender would therefore be lost without using the generalized linear mixed-effects model. The lack of statistical significance in table 4 is misleading from this perspective as the constant alone can not explain the variation in transitions under the condition of random effects. Though, gender has a very high p-value of 0.937 and treatment has a lower p-value of 0.320. Under the Poisson error distribution this indicates that a treatment of 20 nanograms EE2 under the condition by randomization reduces anxiety as quantified by transitions, but gender does not. Though, most of the diagnostics are relatively inconclusive and the conditional QQ-plot indicates bias in most of the family categories. As in part 1, further research could benefit from a larger data set that could reduce statistical bias generated by possible outliers.

References

Måns Thulin (2021). Modern Statistics with R. From wrangling and exploring data to inference and predictive modelling. http://www.modernstatisticswithr.com/

Appendix

Appendix Part 1

# Extraction of coefficients by municipality.

GG <- greenhousegases

GG$county <- as.factor(GG$county)
GG$sector <- as.factor(GG$sector)
GG$municipality <- as.factor(GG$municipality)
GG$GG2017 <- as.numeric(GG$GG2017)
GG$pop2017 <- as.numeric(GG$pop2017)

GG %>% 
  split(.$municipality) %>% 
  map(~ lm(GG2017 ~ sector, data = .)) %>% 
  map(coef) -> coefficients1

# Data frame:
coefficients1 <- data.frame(matrix(unlist(coefficients1),
                                  nrow = length(coefficients1),
                                  byrow = TRUE),
                           row.names = names(coefficients1))
names(coefficients1) <- c("Intercept", "sector")

# Extraction of coefficients by county.

GG %>% 
  split(.$county) %>% 
  map(~ lm(GG2017 ~ sector, data = .)) %>% 
  map(coef) -> coefficients2

# Data frame:
coefficients2 <- data.frame(matrix(unlist(coefficients2),
                                   nrow = length(coefficients2),
                                   byrow = TRUE),
                            row.names = names(coefficients2))
names(coefficients2) <- c("Intercept", "sector")

# Plot the coefficients by municipality:
ggplot(coefficients1, aes(Intercept, sector,
                         colour = row.names(coefficients1))) +
  geom_point(size = 3) +
  geom_smooth(method = "lm", colour = "black", se = FALSE) +
  ggtitle("Figure 1") +
  labs(caption = "Intercept vs sector by municipality") +
  theme(plot.title = element_text(size = 20, hjust = 0.5),
        plot.caption=element_text(size = 14, hjust = 0.5))

# Plot the coefficients by county:
ggplot(coefficients2, aes(Intercept, sector,
                          colour = row.names(coefficients2))) +
  geom_point(size = 3) +
  geom_smooth(method = "lm", colour = "black", se = FALSE) +
  ggtitle("Figure 2") +
  labs(caption = "Intercept vs sector by county") +
  theme(plot.title = element_text(size = 20, hjust = 0.5),
        plot.caption=element_text(size = 14, hjust = 0.5))

# Renaming "sector" to "Sector" with a capital letter.

GG2 <- GG
GG2$Sector <- GG2$sector
GG2<-
  GG2 %>%
  select(-2)

# Saving the regression by specifying statistical model and 

RSQRREG <- lmer(GG2017 ~ Sector + (1 + pop2017|county) + 
               (1 + pop2017|municipality), data = GG2)
RSQR <- r.squaredGLMM(RSQRREG)

part1reg <- list()
part1reg[['ME (1)']] <- lmer(GG2017 ~ Sector + (1 + pop2017|county) + 
                                (1 + pop2017|municipality), data = GG2)
varnames <- c('(Intercept)' = 'Constant')
rsqr <- tribble(~term,          ~"ME (1)",
                'Marginal R-squared',    r.squaredGLMM(RSQRREG)[1],
                'Conditional R-squared',    r.squaredGLMM(RSQRREG)[2])

# Running regression by summary

tablepart1 <- modelsummary(part1reg, output = 'kableExtra',
                           coef_rename = varnames, stars = TRUE, 
                           gof_omit = 'IC|Log|Adj', add_rows = rsqr)

# Correlated random intercept and slope.
MEREG <- lmer(GG2017 ~ sector + (1 + pop2017|county) + 
                      (1 + pop2017|municipality), data = GG)

LM <- lm(GG2017 ~ sector + pop2017, data = GG)
anova1 <- anova(MEREG, LM)  

anova1 %>% 
  kable("html", digits = 3) %>% 
  kable_styling(full_width = F) %>%
  add_header_above(c("Anova test for random constants. Table 1."  = 8, " " = 1), 
                   font_size = 22, align = "l")

# Test the correlation:

cor.test(coefficients1$Intercept, coefficients1$sector)[4]
cor.test(coefficients2$Intercept, coefficients2$sector)[4]

tablepart1 %>%
  add_header_above(c("Dependent variable: Greenhouse gas emissions (kt CO2)"  = 1, " " = 1), font_size = 14, align = "l") %>%
  add_header_above(c("Results. Regression table 2." = 1, " " = 1), font_size = 20, align = "l")

# Fitted values vs residuals.

GG <-
  GG %>%
  mutate(RESID = resid(MEREG)) %>%
  mutate(FITTED = fitted(MEREG))

SECTORRESID <- 
  GG %>%
  select(2,4,6,7)

ggplot(SECTORRESID, aes(FITTED, RESID)) +
  geom_point(aes(color = sector)) +
  facet_wrap(~ sector, nrow = 2) +
  coord_cartesian(xlim = c(-25000, 135000),
                  ylim = c(-98300,187006)) +
  geom_smooth(method = "lm", se = FALSE, color = "black", size = 0.3) +
  xlab("Fitted values") + 
  ylab("Residuals") +
  ggtitle("Figure 3") +
  labs(caption = "Diagnostics plot for spread of residual variance") +
  theme(plot.title = element_text(size = 20, hjust = 0.5),
        plot.caption=element_text(size = 14, hjust = 0.5))

# Box plot of residuals.

SECTORRESID <-
  SECTORRESID %>%
  group_by(sector) %>%
  mutate(STDRESID = RESID/sd(RESID))

ggplot(SECTORRESID, aes(sector, STDRESID, color = sector)) +
  geom_boxplot() +
  coord_flip() +
  ylim(c(-4,4)) +
  xlab("Residuals in number of standard deviations") + 
  ylab("Sector") +
  ggtitle("Figure 4") +
  labs(caption = "Diagnostics box plot for spread of residual standard deviations") +
  theme(plot.title = element_text(size = 20, hjust = 0.5),
        plot.caption=element_text(size = 14, hjust = 0.5))

# Observed values vs fitted values.

SECTORRESID <-
  SECTORRESID %>%
  group_by(sector) %>%
  mutate(STDFITTED = FITTED/sd(FITTED)) %>%
  mutate(STDGG2017 = GG2017/sd(GG2017))

ggplot(SECTORRESID, aes(STDFITTED, STDGG2017, color = sector)) +
  geom_point() +
  facet_wrap(~ sector, nrow = 2) +
  geom_smooth(method='lm') +
  xlim(c(-1,4)) +
  xlab("Fitted values div. by sector stdev.") + 
  ylab("Observed values div. by sector stdev.") +
  ggtitle("Figure 5") +
  labs(caption = "Observed values vs fitted values") +
  theme(plot.title = element_text(size = 20, hjust = 0.5),
        plot.caption=element_text(size = 14, hjust = 0.5))

## Q-Q plot of residuals.

SECTORRESID %>%
  ggplot(aes(sample = STDRESID, color = sector)) +
  facet_wrap(~ sector, nrow = 2) +
  geom_qq() +
  geom_qq_line() +
  ylim(c(-4,4)) +
  xlab("Theoretical quantiles") +
  ylab("Sample quantiles") +
  ggtitle("Figure 6") +
  labs(caption = "QQ-plots") +
  theme(plot.title = element_text(size = 20, hjust = 0.5),
        plot.caption=element_text(size = 14, hjust = 0.5))

# Post-hoc tests

tukey <- emmeans(MEREG, list(pairwise ~ sector), adjust = "tukey")
tukeypair <- tukey[2]
tukeydataframe <- as.data.frame(tukeypair)
TDF <-
  tukeydataframe %>%
  select(1,5,6)
TDF$PWSECTOR <- TDF$pairwise.differences.of.sector.1
TDF$t.value <- TDF$pairwise.differences.of.sector.t.ratio
TDF$p.value <- TDF$pairwise.differences.of.sector.p.value
TDF <-
  TDF %>%
  select(4,5,6)
TDF <- as_tibble(TDF)
flextable(TDF)

# Test the correlations between intercept and randomized variables:

cor.test(coefficients1$Intercept, coefficients1$sector)[4]

## $estimate
##    cor 
## -0.999

cor.test(coefficients2$Intercept, coefficients2$sector)[4]

## $estimate
## cor 
##  -1

# Bootstrapped standard deviations to show that they do not change sufficiently
# enough to create changes in significance cut-off percentage levels.

# summary(bootstrapped)

btstrtable <- matrix(c(25288,13855,-24461,17803,55810,18229,-15166,17686,-22953,17758,29913,18056,-24263,18148), ncol=2, byrow=TRUE)
colnames(btstrtable) <- c('original ','bootSE ')
rownames(btstrtable) <- c('(Intercept)','sectorHeating','sectorIndustry','sectorMachinery','sectorProduct use','sectorTransport','sectorWaste')
btstrtable <- as.table(btstrtable)
btstrtable

##                   original  bootSE 
## (Intercept)           25288   13855
## sectorHeating        -24461   17803
## sectorIndustry        55810   18229
## sectorMachinery      -15166   17686
## sectorProduct use    -22953   17758
## sectorTransport       29913   18056
## sectorWaste          -24263   18148

Appendix Part 2

GP2 <- GP
GP2$Family <- as.factor(GP2$Family)
GP2$Gender <- as.factor(GP2$Gender)
GP2$Treatment <- as.factor(GP2$Treatment)

GP$Family <- as.factor(GP$Family)
GP$Gender <- ordered(GP$Gender, levels = c("F", "M"))
GP$Treatment <- ordered(GP$Treatment, levels = c("0ng", "20ng"))

LM2 <- lm(Transitions ~ Treatment + Gender + Family, data = GP2)
MEGLMM <- glmer(Transitions ~ Treatment + Gender + (1|Family), family = poisson, data = GP2)
anova1 <- anova(MEGLMM, LM2)  

anova1 %>% 
  kable("html", digits = 3) %>% 
  kable_styling(full_width = F) %>%
  add_header_above(c("Anova test for random constants. Table 3."  = 8, " " = 1), 
                   font_size = 22, align = "l")

GP$Family <- as.factor(GP$Family)
GP$Gender <- ordered(GP$Gender, levels = c("F", "M"))
GP$Treatment <- ordered(GP$Treatment, levels = c("0ng", "20ng"))

MGLMSQRREG <- glmer(Transitions ~ Treatment + Gender + (1|Family), family = poisson, data = GP)
MLGLMRSQR <- r.squaredGLMM(MGLMSQRREG)

part2reg <- list()
part2reg[['ME GLMM (1)']] <- glmer(Transitions ~ Treatment + Gender + (1|Family), 
                                   family = poisson, data = GP)
varnames <- c('(Intercept)' = 'Constant')
rsqr2 <- tribble(~term,          ~"ME (1)",
                'Marginal R-squared',    MLGLMRSQR[1],
                'Conditional R-squared',    MLGLMRSQR[4])
tablepart2 <- modelsummary(part2reg, output = 'kableExtra',
                           coef_rename = varnames, stars = TRUE, 
                           gof_omit = 'IC|Log|Adj', add_rows = rsqr2)

tablepart2 %>%
  add_header_above(c("Dependent variable: Transitions"  = 1, " " = 1), font_size = 14, align = "l") %>%
  add_header_above(c("Results. Regression table 4." = 1, " " = 1), font_size = 20, align = "l")

# Effects plots

plot(allEffects(MEGLMM)[1], sub = "Treatment effect plot", main = "Figure 7")

plot(allEffects(MEGLMM)[2], sub = "Gender effect plot", main = "Figure 8")

# Fitted values vs residuals by family.

MEGLMM <- glmer(Transitions ~ Treatment + Gender + (1|Family), family = poisson, data = GP)
 
GP <-
  GP %>%
  mutate(RESID = resid(MEGLMM)) %>%
  mutate(FITTED = fitted(MEGLMM))

ggplot(GP, aes(FITTED, RESID)) +
  geom_point(aes(color = Family)) +
  facet_wrap(~ Family) +
  xlab("Fitted values") + 
  ylab("Residuals") +
  ggtitle("Figure 9") +
  labs(caption = "Diagnostics plot for spread of residual variance") +
  theme(plot.title = element_text(size=16, hjust = 0.5)) + 
  geom_smooth(method = "lm", se = FALSE, color = "black", size = 0.3)

# Fitted values vs residuals in general.

plot(MEGLMM, type = c("p", "smooth"), sub = "Residuals vs Fitted values",
     main = "Figure 10")

# Diagnostics box plot for spread of residuals

ggplot(GP, aes(Family, RESID, color = Family)) +
  geom_boxplot() +
  coord_flip() +
  ylim(c(-4,4)) +
  xlab("Residuals") + 
  ylab("Sector") +
  ggtitle("Figure 11") +
  labs(caption = "Diagnostics box plot for spread of residuals") +
  theme(plot.title = element_text(size = 20, hjust = 0.5),
        plot.caption=element_text(size = 14, hjust = 0.5))

# Observed vs fitted values

ggplot(GP, aes(FITTED, Transitions, color = Family)) +
  geom_point() +
  facet_wrap(~ Family, nrow = 3) +
  geom_smooth(method='lm') +
  xlab("Fitted values") + 
  ylab("Observed values") +
  ggtitle("Figure 12") +
  labs(caption = "Observed values vs fitted values") +
  theme(plot.title = element_text(size = 20, hjust = 0.5),
        plot.caption=element_text(size = 14, hjust = 0.5))

## QQ-plot of residuals:

GP %>%
  ggplot(aes(sample = RESID, color = Family)) +
  facet_wrap(~ Family, nrow = 3) +
  geom_qq() +
  geom_qq_line() +
  xlab("Theoretical quantiles") +
  ylab("Sample quantiles") +
  ggtitle("Figure 13") +
  labs(caption = "QQ-plots") +
  theme(plot.title = element_text(size = 20, hjust = 0.5),
        plot.caption=element_text(size = 14, hjust = 0.5))

Assignment 2 Part 3 Mixed-effects Models

1041MA: Statistical analyses and visualization in R: II

Olle Scherling

2022-04-11