This is a simple code I wrote for the SIB students to demonstrate some of the things we have learned during the classes in multiple regression analysis and with inclusion of categorical variables.
With the code comes a small dataset that I downloaded from the Our World in Data site - it is some data related with the policy response of different countries to the Pandemic.
In this dataset the categorical variables are better because they have no missing observations.
We try to investigate with a simple multiple regression model whether the stringency_index and facial_coverings can explain any of the variation in the total_cases_per_million across the different 172 countries sampled in this dataset.
We then want to run the diagnostics for our model.
Multicollinearity can be hard to investigate with this data. That is where we could need to construct the chisquare contingency table. You can also do it yourself in Excel following this video: https://www.youtube.com/watch?v=hpiI_HZfmIY.
However, as you can see below it is quite easily done in R Studio with a single line of code.
If you want to run the R code yourself, you need to place it on your desktop. Alternatively you can open the data in Excel and work with it from there. Any statistics programme should be able to reproduce these calculations.
load('~/Desktop/covid_april_21.rda')
summary(covid_april_21)
## iso_code date continent location
## Length:172 Length:172 Length:172 Length:172
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## total_cases_per_million total_tests_per_thousand people_vaccinated_per_hundred
## Min. : 3.26 Min. : 4.587 Min. : 0.000
## 1st Qu.: 996.35 1st Qu.: 36.569 1st Qu.: 0.055
## Median : 7924.71 Median : 167.367 Median : 0.180
## Mean : 16408.65 Mean : 356.983 Mean : 1.127
## 3rd Qu.: 27942.57 3rd Qu.: 457.563 3rd Qu.: 0.385
## Max. :106762.44 Max. :2661.225 Max. :16.560
## NA's :83 NA's :149
## stringency_index facial_coverings
## Min. : 6.48 Min. :0.00
## 1st Qu.:45.37 1st Qu.:3.00
## Median :60.19 Median :3.00
## Mean :57.05 Mean :2.93
## 3rd Qu.:71.30 3rd Qu.:4.00
## Max. :90.74 Max. :4.00
##
You can look up all the variables on Our World in Data. In this data there are some basic country descriptors and then the dependent variable (total number of cases observed) and the two policy variables - where the facial coverings is a multinomial variable taking the values from 0 to 4 (0: no policy, 1: recommended, 2: required for some public spaces, 3: required in all public spaces, 4: always outside home) and the stringency index takes values between 0 and 100.
I ran a multiple regression (OLS) using the total_cases_per_million as dependent variable (y) and the stringency_index and facial_coverings as explanatory variables:
fit1 <- lm(total_cases_per_million ~ stringency_index + facial_coverings, data=covid_april_21)
summary(fit1)
##
## Call:
## lm(formula = total_cases_per_million ~ stringency_index + facial_coverings,
## data = covid_april_21)
##
## Residuals:
## Min 1Q Median 3Q Max
## -31974 -12528 -4403 7562 92364
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5202.34 5422.87 -0.959 0.339
## stringency_index 448.40 77.43 5.791 3.34e-08 ***
## facial_coverings -1355.26 1630.62 -0.831 0.407
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18140 on 169 degrees of freedom
## Multiple R-squared: 0.1724, Adjusted R-squared: 0.1626
## F-statistic: 17.6 on 2 and 169 DF, p-value: 1.143e-07
This is a good example of a multiple regression model where we can demonstrate correlation but not causation. There could be many reasons why we find there to be a positive correlation between the stringency_index and the number of cases observed in the population (controlling for different population sizes since total cases are measured in per million inhabitants). Try to think of some…
Facial coverings do not seem to have any impact on the number of cases.
I ran another model where I also controlled for the continental belonging of each country:
fit2 <- lm(total_cases_per_million ~ continent + stringency_index + facial_coverings, data=covid_april_21)
summary(fit2)
##
## Call:
## lm(formula = total_cases_per_million ~ continent + stringency_index +
## facial_coverings, data = covid_april_21)
##
## Residuals:
## Min 1Q Median 3Q Max
## -28685 -8298 -2252 4063 71716
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6543.06 4509.47 -1.451 0.14870
## continentAsia 6820.85 3101.62 2.199 0.02927 *
## continentEurope 30741.15 3269.46 9.403 < 2e-16 ***
## continentNorth America 10497.42 4060.40 2.585 0.01060 *
## continentOceania -2072.07 6315.90 -0.328 0.74328
## continentSouth America 12678.17 4836.27 2.621 0.00958 **
## stringency_index 217.01 67.93 3.195 0.00168 **
## facial_coverings -201.79 1319.95 -0.153 0.87869
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14470 on 164 degrees of freedom
## Multiple R-squared: 0.489, Adjusted R-squared: 0.4672
## F-statistic: 22.42 on 7 and 164 DF, p-value: < 2.2e-16
This model is better and explains nearly 50% of the variation in the data.
Let us next check the residual diagnostics of the model - to see if it lives up to the normality assumptions for the residuals as we have just discussed in class:
library(lmtest)
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
plot(fit1, 1)
plot(fit2, 2)
We are also interested in investigating the assumption about independence between the explanatory variables. But here it is difficult because many of the explanatory variables are categorical. For some we could use a simple t-test (for a dichotomous categorical variable) or the Pearson correlation coefficient (numerical variables only) - for other - especially for example correlation between continent and facial covering it will not work.
For the correlation between continent and the stringency index we cannot use a simple t-test, because there are more than two groups - in this situation a one-way Anova is a good choice:
res.aov <- aov(stringency_index ~ continent, data = covid_april_21)
# Summary of the analysis
summary(res.aov)
## Df Sum Sq Mean Sq F value Pr(>F)
## continent 5 10805 2161 6.926 6.71e-06 ***
## Residuals 166 51793 312
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The test shows that there are differences across continents for the stringency index.
For the correlation between continent and facial covering it is better to use the chisquare test, because both are categorical and multinomial - this is the type of situation where we could need the test:
chisq.test(covid_april_21$continent, covid_april_21$facial_coverings)
## Warning in chisq.test(covid_april_21$continent,
## covid_april_21$facial_coverings): Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: covid_april_21$continent and covid_april_21$facial_coverings
## X-squared = 26.74, df = 20, p-value = 0.1427
The test suggests that there is no correlation for these two categorical variables. It could for example suggest that facial covering is the most commonly applied policy across continents and/or that variation in this policy is not based on continental belonging. It seems reasonable as it is a costless and easy to adopt policy as opposed to many of the other stringency measures.
But also notice that we are told the chisquare test may be incorrect or does not meet assumptions. What is the problem here and how does it relate to the count of country cases in the contingency table??
Finally, we can also look at the two-way contingency table to check the tendencies here. This contingency table shows the percentage count - that is the observed percentages - what would the theoretical or expected percentages be in this case?
prop.table(table(covid_april_21$continent, covid_april_21$facial_coverings))*100
##
## 0 1 2 3 4
## Africa 0.0000000 2.3255814 4.0697674 16.8604651 5.8139535
## Asia 0.0000000 2.3255814 2.9069767 12.2093023 8.1395349
## Europe 0.5813953 0.5813953 5.8139535 9.8837209 7.5581395
## North America 0.5813953 0.0000000 1.1627907 6.3953488 2.3255814
## Oceania 0.5813953 0.0000000 1.7441860 0.5813953 0.5813953
## South America 0.0000000 0.5813953 1.1627907 2.9069767 2.3255814