Workshop resources

Variables and analysis methods

Identify your variables and statistical tests.

Figure 1: Types of predictors and Statistical tests

Figure 1: Types of predictors and Statistical tests

Keep in mind the workflow:

Figure 2: Workflow

Analysis of continuous data: Correlation vs Regression

Definition

Figure 3: Examples of Correlation coefficients

Figure 3: Examples of Correlation coefficients

Example case I: Chocolate consumption and Nobel laureates

Table 1: Chocolate consumption and Nobel laureates by Country
Country Chocolate Nobel GDP
Australia 4.8 4.84 52518
Austria 8.5 25.14 55098
Belgium 6.4 8.70 51968
Brazil 1.2 0.05 14836
Canada 3.9 5.95 48073
China 0.7 0.04 17312
Croatia 6.6 4.80 28504
Denmark 6.9 17.38 60399
Finland 7.4 9.02 51090
France 3.4 5.98 46227
Germany 11.1 11.18 53694
Greece 2.5 1.79 28464
Hungary 2.9 12.38 33084
Ireland 8.8 22.90 93612
Italy 3.1 2.19 41840
Japan 1.8 1.89 42197
Netherlands 4.5 11.71 59229
Norway 9.2 24.28 63198
Poland 2.2 1.31 34265
Portugal 3.6 1.94 34496
Romania 3.6 1.02 31946
Spain 4.5 0.43 38335
Switzerland 10.3 31.60 71352
UK 8.1 19.43 44916
USA 5.3 11.72 63544



We have a sample data (n=25) with chocolate consumption(kg per year per capita), number of nobel laureats for every 10 million populations, and GDP per capita (PPP) indexed by country. We are interested in investigating the relationship between chocolate consumption and nobel prize awards: \(Nobel = \alpha + \beta*Chocolate + \epsilon\).





Check the distribution of our variables (check the Nearly Normal Condition for each variable).


    Shapiro-Wilk normality test

data:  choco$sqrNobel
W = 0.96027, p-value = 0.4198

    Shapiro-Wilk normality test

data:  choco$Chocolate
W = 0.95867, p-value = 0.3886

Ways to get correlation coeffient:

> cor(choco$Chocolate, choco$Nobel) 
[1] 0.8082649
> cor.test(choco$Chocolate, choco$Nobel, method="pearson")

    Pearson's product-moment correlation

data:  choco$Chocolate and choco$Nobel
t = 6.5832, df = 23, p-value = 1.022e-06
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.6069881 0.9120985
sample estimates:
      cor 
0.8082649 
> cor(choco$Chocolate, choco$sqrNobel) 
[1] 0.8303721
> cor.test(choco$Chocolate, choco$sqrNobel, method="pearson")

    Pearson's product-moment correlation

data:  choco$Chocolate and choco$sqrNobel
t = 7.1469, df = 23, p-value = 2.805e-07
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.6477822 0.9227451
sample estimates:
      cor 
0.8303721 


Add this correlation coefficient on to the graph (your r and p values):



Let’s move on to the regression analysis.
LINEAR REGRESSION ASSUMPTIONS(LINE)
1.Linearity. The points should be able to be described in a linear relationship.
2.Independence. Each observation must be independent of all other observations (e.g., each experimental or sampling unit is independent of each other experimental sampling unit).
3.Normality. The error terms (residuals) come from normally distributed populations. Check for this on the residuals using box plots, q-q plots, histograms, shapiro-wilks, etc.
4.Equal Variance (homoscedasticity). The error terms (redisuals) must have have equal variances. Check for this on the residuals using bptest.


Now we test each assumption one by one for our model: \(\sqrt{Nobel}\) = \(\alpha\) + \(\beta\)Chocolate + \(\epsilon\)

Always specify your model.

> c <- lm(sqrNobel ~ Chocolate, data = choco)


Also, we always visualize. So why not visualize our residuals.

> res <- resid(c)
> plot(fitted(c), res)
> abline(0,0)

> # Assumption2: Independence 
> # check whether the mean of residuals is zero and check autocorrelation of residuals (the residuals are independent)
> mean(c$residuals) 
[1] -1.221462e-17
> cor.test(choco$Chocolate, c$residuals)

    Pearson's product-moment correlation

data:  choco$Chocolate and c$residuals
t = -7.9609e-16, df = 23, p-value = 1
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.3951309  0.3951309
sample estimates:
          cor 
-1.659968e-16 
> #Ho: There is no correlation 
> #Ha: There is correlation
> acf(c$residuals) # when you use time-series data

> # Assumption3: Normality 
> # check the distribution of residuals 
> qq <- rstandard(c)
> qqnorm(qq, ylab="Standardized Residuals", 
+        xlab="Theoretical Quantiles: lm(sqrNobel ~ Chocolate)", 
+        main="Normal Q-Q") 
> qqline(qq)

> #Assumption4: Homoscedasticity
> #Ho: The residuals are homoscedastic (i.e. evenly discributed)
> #Ha: The residuals are heteroscedastic (i.e. not evenly spread)
> bptest(c)

    studentized Breusch-Pagan test

data:  c
BP = 1.127, df = 1, p-value = 0.2884
> # As the p-value is large, we cannot reject Ho. The residuals have equal variance 

Add regression line on to the graph



Dependent variable:
sqrNobel
Chocolate 0.449***
(0.063)
Constant 0.310
(0.376)
Observations 25
R2 0.690
Adjusted R2 0.676
Residual Std. Error 0.899 (df = 23)
F Statistic 51.078*** (df = 1; 23)
Note: p<0.1; p<0.05; p<0.01



Now, let’s look at the other variable - GDP.







OLS regression: Nobel ~ GDP
Dependent variable:
sqrNobel
(1) (2)
GDPperCapita 0.0001***
(0.00001)
lnGDP 3.018***
(0.475)
Constant -0.734 -29.532***
(0.552) (5.076)
Observations 25 25
R2 0.652 0.637
Adjusted R2 0.637 0.621
Residual Std. Error (df = 23) 0.952 0.973
F Statistic (df = 1; 23) 43.141*** 40.294***
Note: p<0.1; p<0.05; p<0.01



Multivariate regression model: \(\sqrt{Nobel}\) = \(\beta_0\) + \(\beta_1\)Chocolate + \(\beta_2\)lnGDP + \(\epsilon\). Can we use this model?

          Chocolate sqrNobel lnGDP
Chocolate      1.00     0.83  0.72
sqrNobel       0.83     1.00  0.80
lnGDP          0.72     0.80  1.00
             Chocolate     sqrNobel        lnGDP
Chocolate 0.000000e+00 2.805328e-07 5.437817e-05
sqrNobel  2.805328e-07 0.000000e+00 1.775004e-06
lnGDP     5.437817e-05 1.775004e-06 0.000000e+00

> #c_g <- lm(Nobel ~ Chocolate + GDPperCapita, data = choco)
> c_lng <- lm(sqrNobel ~ Chocolate + lnGDP, data = choco)
> 
> stargazer(c_lng, header=FALSE, type="html",
+           title = 'OLS regression: sqrNobel ~ Chocolate + lnGDP',
+           align = TRUE,
+           column.sep.width = "1pt",
+           no.space = TRUE, font.size = "small") 
OLS regression: sqrNobel ~ Chocolate + lnGDP
Dependent variable:
sqrNobel
Chocolate 0.287***
(0.079)
lnGDP 1.576***
(0.551)
Constant -15.653**
(5.587)
Observations 25
R2 0.774
Adjusted R2 0.753
Residual Std. Error 0.785 (df = 22)
F Statistic 37.619*** (df = 2; 22)
Note: p<0.1; p<0.05; p<0.01


Example case II: \(CO_2\) concentration in the atmosphere and temperature increase

\[y = \alpha + \beta x + \epsilon\]



> cor(co2_temp$co2conc, co2_temp$temp) 
[1] 0.9591698
> cor.test(co2_temp$co2conc, co2_temp$temp, method="pearson")

    Pearson's product-moment correlation

data:  co2_temp$co2conc and co2_temp$temp
t = 26.487, df = 61, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.9331719 0.9751837
sample estimates:
      cor 
0.9591698 



[1] -4.5188e-18


    Durbin-Watson test

data:  r
DW = 1.6824, p-value = 0.07969
alternative hypothesis: true autocorrelation is greater than 0

    Pearson's product-moment correlation

data:  co2_temp$co2conc and r$residuals
t = 1.9513e-15, df = 61, p-value = 1
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.247765  0.247765
sample estimates:
         cor 
2.498414e-16 
> r <- lm(temp ~ co2conc, data = co2_temp)
> stargazer(r, type="html")
Dependent variable:
temp
co2conc 0.011***
(0.0004)
Constant -3.401***
(0.142)
Observations 63
R2 0.920
Adjusted R2 0.919
Residual Std. Error 0.093 (df = 61)
F Statistic 701.564*** (df = 1; 61)
Note: p<0.1; p<0.05; p<0.01



How can we interprete \(R^2\)?



Writing tips for your Research Methods and Results sections

Sample equations

\[\begin{equation} ln(\text{Carbon Intensity}_\text{i})_\text{t+1} = \beta_{0} + \beta_{1}\text{Carbon Management Quality}_\text{i,t} + \gamma_{1} I_{i,t} + \gamma_{2} S_{i,t} + \epsilon_{i,t} \end{equation}\]

It is designed to predict cross-sectional and intertemporal variations in emissions intensity within each corporation. I am interested in whether the explanatory variables of interest associate negatively with the following year’s emissions per unit of turnover.

\[\begin{equation} \tag{1} ln(GHG_{i}/Rev_{i})_\text{t+1} = \beta_{0} + \beta_{1}\text{Total score}_\text{i,t} + \gamma_{1} I_{i,t} + \gamma_{2} S_{i,t} + \epsilon_{i,t} \end{equation}\]

\[\begin{equation}\tag{2} ln(GHG_{i}/Rev_{i})_\text{t+1} = \beta_{0} + \beta_{1}\text{G score}_\text{i,t} + \beta_{2}\text{S score}_\text{i,t} + \beta_{3}\text{R score}_\text{i,t} + \beta_{4}\text{MT score}_\text{i,t} + \gamma_{1} I_{i,t} + \gamma_{2} S_{i,t} + \epsilon_{i,t} \end{equation}\]

Sample graphs

> vs_f <-read.csv("chart1.csv")
> ggplot(data = vs_f,
+        aes(x = g, y = percent, fill = Industry_s)) + 
+   geom_bar(stat = "identity", width = 0.3) +
+   geom_col(width = 0.5, colour = "gray75") +
+   scale_fill_manual(name = "Industry",
+                     values = c( "white","grey100","grey96","grey93", 
+                                 "grey90","grey87","gray84","tomato4", 
+                                 "saddlebrown", "tan4", 
+                                 "darkslategray", "gray20", "gray12"), 
+                     labels = gsub("[0-9]+","",vs_f$Industry_s)) +
+   labs(title = "(N = 470, \nEmissions Total = 2.8 Gt CO2e)", 
+        x = " ", y = " ") + 
+   theme(plot.title = element_text(size=9), 
+         legend.title = element_text(size = 7), legend.text = element_text(size = 6)) +
+   scale_y_continuous(breaks=seq(0,1, 0.2))
Sample compositions by Industry

Sample compositions by Industry

> score = read.csv("Sample_v4.csv")
> vs1 = score %>% data.frame() %>% 
+   select(c(Organization, lnCI, Industry, Year))
> 
> ggplot(vs1, aes(x = reorder(Industry, lnCI, FUN = median, .desc = TRUE), y = lnCI, fill = Industry)) +
+   geom_boxplot() +
+   scale_fill_manual(values = c("white", "lightpink4", "lightpink4", "white", 
+                                "lightpink4", "white", "white", "lightpink4", 
+                                "lightpink4","lightpink4","white","white","lightpink4"),guide = "none") + 
+   labs(x = " ", y = "Carbon intensity(tCO2e/mn$)") +
+   theme(axis.text.x = element_text(angle = 70, hjust = 0.5, vjust = 1)) 
Emissions intensity by Industry

Emissions intensity by Industry




Reference

Luo, L., & Tang, Q. (2014). Does voluntary carbon disclosure reflect underlying carbon performance? Journal of Contemporary Accounting & Economics, 10(3), 191–205. https://doi.org/10.1016/j.jcae.2014.08.003


Maurage, P., Heeren, A., & Pesenti, M. (2013). Does chocolate consumption really boost nobel award chances? The peril of over-interpreting correlations in health studies. The Journal of Nutrition, 143(6), 931–933. https://doi.org/10.3945/jn.113.174813


Stock J, Watson M. Introduction to Econometrics (3rd edition). Addison Wesley Longman; 2011.