Identify your variables and statistical tests.
Figure 1: Types of predictors and Statistical tests
Keep in mind the workflow:
Figure 2: Workflow
Definition
Figure 3: Examples of Correlation coefficients
| Country | Chocolate | Nobel | GDP |
|---|---|---|---|
| Australia | 4.8 | 4.84 | 52518 |
| Austria | 8.5 | 25.14 | 55098 |
| Belgium | 6.4 | 8.70 | 51968 |
| Brazil | 1.2 | 0.05 | 14836 |
| Canada | 3.9 | 5.95 | 48073 |
| China | 0.7 | 0.04 | 17312 |
| Croatia | 6.6 | 4.80 | 28504 |
| Denmark | 6.9 | 17.38 | 60399 |
| Finland | 7.4 | 9.02 | 51090 |
| France | 3.4 | 5.98 | 46227 |
| Germany | 11.1 | 11.18 | 53694 |
| Greece | 2.5 | 1.79 | 28464 |
| Hungary | 2.9 | 12.38 | 33084 |
| Ireland | 8.8 | 22.90 | 93612 |
| Italy | 3.1 | 2.19 | 41840 |
| Japan | 1.8 | 1.89 | 42197 |
| Netherlands | 4.5 | 11.71 | 59229 |
| Norway | 9.2 | 24.28 | 63198 |
| Poland | 2.2 | 1.31 | 34265 |
| Portugal | 3.6 | 1.94 | 34496 |
| Romania | 3.6 | 1.02 | 31946 |
| Spain | 4.5 | 0.43 | 38335 |
| Switzerland | 10.3 | 31.60 | 71352 |
| UK | 8.1 | 19.43 | 44916 |
| USA | 5.3 | 11.72 | 63544 |
We have a sample data (n=25) with chocolate consumption(kg per year per capita), number of nobel laureats for every 10 million populations, and GDP per capita (PPP) indexed by country. We are interested in investigating the relationship between chocolate consumption and nobel prize awards: \(Nobel = \alpha + \beta*Chocolate + \epsilon\).
Check the distribution of our variables (check the Nearly Normal Condition for each variable).
Shapiro-Wilk normality test
data: choco$sqrNobel
W = 0.96027, p-value = 0.4198
Shapiro-Wilk normality test
data: choco$Chocolate
W = 0.95867, p-value = 0.3886
Ways to get correlation coeffient:
> cor(choco$Chocolate, choco$Nobel)
[1] 0.8082649
> cor.test(choco$Chocolate, choco$Nobel, method="pearson")
Pearson's product-moment correlation
data: choco$Chocolate and choco$Nobel
t = 6.5832, df = 23, p-value = 1.022e-06
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.6069881 0.9120985
sample estimates:
cor
0.8082649
> cor(choco$Chocolate, choco$sqrNobel)
[1] 0.8303721
> cor.test(choco$Chocolate, choco$sqrNobel, method="pearson")
Pearson's product-moment correlation
data: choco$Chocolate and choco$sqrNobel
t = 7.1469, df = 23, p-value = 2.805e-07
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.6477822 0.9227451
sample estimates:
cor
0.8303721
Add this correlation coefficient on to the graph (your r and p values):
Let’s move on to the regression analysis.
LINEAR REGRESSION ASSUMPTIONS(LINE)
1.Linearity. The points should be able to be described in a linear relationship.
2.Independence. Each observation must be independent of all other observations (e.g., each experimental or sampling unit is independent of each other experimental sampling unit).
3.Normality. The error terms (residuals) come from normally distributed populations. Check for this on the residuals using box plots, q-q plots, histograms, shapiro-wilks, etc.
4.Equal Variance (homoscedasticity). The error terms (redisuals) must have have equal variances. Check for this on the residuals using bptest.
Now we test each assumption one by one for our model: \(\sqrt{Nobel}\) = \(\alpha\) + \(\beta\)Chocolate + \(\epsilon\)
Always specify your model.
> c <- lm(sqrNobel ~ Chocolate, data = choco)
Also, we always visualize. So why not visualize our residuals.
> res <- resid(c)
> plot(fitted(c), res)
> abline(0,0)
> # Assumption2: Independence
> # check whether the mean of residuals is zero and check autocorrelation of residuals (the residuals are independent)
> mean(c$residuals)
[1] -1.221462e-17
> cor.test(choco$Chocolate, c$residuals)
Pearson's product-moment correlation
data: choco$Chocolate and c$residuals
t = -7.9609e-16, df = 23, p-value = 1
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.3951309 0.3951309
sample estimates:
cor
-1.659968e-16
> #Ho: There is no correlation
> #Ha: There is correlation
> acf(c$residuals) # when you use time-series data
> # Assumption3: Normality
> # check the distribution of residuals
> qq <- rstandard(c)
> qqnorm(qq, ylab="Standardized Residuals",
+ xlab="Theoretical Quantiles: lm(sqrNobel ~ Chocolate)",
+ main="Normal Q-Q")
> qqline(qq)
> #Assumption4: Homoscedasticity
> #Ho: The residuals are homoscedastic (i.e. evenly discributed)
> #Ha: The residuals are heteroscedastic (i.e. not evenly spread)
> bptest(c)
studentized Breusch-Pagan test
data: c
BP = 1.127, df = 1, p-value = 0.2884
> # As the p-value is large, we cannot reject Ho. The residuals have equal variance
Add regression line on to the graph
| Dependent variable: | |
| sqrNobel | |
| Chocolate | 0.449*** |
| (0.063) | |
| Constant | 0.310 |
| (0.376) | |
| Observations | 25 |
| R2 | 0.690 |
| Adjusted R2 | 0.676 |
| Residual Std. Error | 0.899 (df = 23) |
| F Statistic | 51.078*** (df = 1; 23) |
| Note: | p<0.1; p<0.05; p<0.01 |
Now, let’s look at the other variable - GDP.
| Dependent variable: | ||
| sqrNobel | ||
| (1) | (2) | |
| GDPperCapita | 0.0001*** | |
| (0.00001) | ||
| lnGDP | 3.018*** | |
| (0.475) | ||
| Constant | -0.734 | -29.532*** |
| (0.552) | (5.076) | |
| Observations | 25 | 25 |
| R2 | 0.652 | 0.637 |
| Adjusted R2 | 0.637 | 0.621 |
| Residual Std. Error (df = 23) | 0.952 | 0.973 |
| F Statistic (df = 1; 23) | 43.141*** | 40.294*** |
| Note: | p<0.1; p<0.05; p<0.01 | |
Multivariate regression model: \(\sqrt{Nobel}\) = \(\beta_0\) + \(\beta_1\)Chocolate + \(\beta_2\)lnGDP + \(\epsilon\). Can we use this model?
Chocolate sqrNobel lnGDP
Chocolate 1.00 0.83 0.72
sqrNobel 0.83 1.00 0.80
lnGDP 0.72 0.80 1.00
Chocolate sqrNobel lnGDP
Chocolate 0.000000e+00 2.805328e-07 5.437817e-05
sqrNobel 2.805328e-07 0.000000e+00 1.775004e-06
lnGDP 5.437817e-05 1.775004e-06 0.000000e+00
> #c_g <- lm(Nobel ~ Chocolate + GDPperCapita, data = choco)
> c_lng <- lm(sqrNobel ~ Chocolate + lnGDP, data = choco)
>
> stargazer(c_lng, header=FALSE, type="html",
+ title = 'OLS regression: sqrNobel ~ Chocolate + lnGDP',
+ align = TRUE,
+ column.sep.width = "1pt",
+ no.space = TRUE, font.size = "small")
| Dependent variable: | |
| sqrNobel | |
| Chocolate | 0.287*** |
| (0.079) | |
| lnGDP | 1.576*** |
| (0.551) | |
| Constant | -15.653** |
| (5.587) | |
| Observations | 25 |
| R2 | 0.774 |
| Adjusted R2 | 0.753 |
| Residual Std. Error | 0.785 (df = 22) |
| F Statistic | 37.619*** (df = 2; 22) |
| Note: | p<0.1; p<0.05; p<0.01 |
\[y = \alpha + \beta x + \epsilon\]
> cor(co2_temp$co2conc, co2_temp$temp)
[1] 0.9591698
> cor.test(co2_temp$co2conc, co2_temp$temp, method="pearson")
Pearson's product-moment correlation
data: co2_temp$co2conc and co2_temp$temp
t = 26.487, df = 61, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.9331719 0.9751837
sample estimates:
cor
0.9591698
[1] -4.5188e-18
Durbin-Watson test
data: r
DW = 1.6824, p-value = 0.07969
alternative hypothesis: true autocorrelation is greater than 0
Pearson's product-moment correlation
data: co2_temp$co2conc and r$residuals
t = 1.9513e-15, df = 61, p-value = 1
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.247765 0.247765
sample estimates:
cor
2.498414e-16
> r <- lm(temp ~ co2conc, data = co2_temp)
> stargazer(r, type="html")
| Dependent variable: | |
| temp | |
| co2conc | 0.011*** |
| (0.0004) | |
| Constant | -3.401*** |
| (0.142) | |
| Observations | 63 |
| R2 | 0.920 |
| Adjusted R2 | 0.919 |
| Residual Std. Error | 0.093 (df = 61) |
| F Statistic | 701.564*** (df = 1; 61) |
| Note: | p<0.1; p<0.05; p<0.01 |
How can we interprete \(R^2\)?
The R-squared value, denoted by \(R^2\), is the square of the correlation (In this case, our \(\rho\) value (or r value) was 0.96. Therefore, \(R^2\) = \(0.96^2\) = 0.92). It measures the proportion of variation in the dependent variable that can be attributed to the independent variable. In other words, it is a measure of the explained variance or a goodness-of-fit measure for linear regression models. The \(R^2\) is always between 0 and 1 inclusive. So can we say that our model is near perfect as we find strong goodness-of-fit (\(R^2\) = 0.92)?
Deterministic relationship vs Statistical relationship
\[\begin{equation} ln(\text{Carbon Intensity}_\text{i})_\text{t+1} = \beta_{0} + \beta_{1}\text{Carbon Management Quality}_\text{i,t} + \gamma_{1} I_{i,t} + \gamma_{2} S_{i,t} + \epsilon_{i,t} \end{equation}\]
It is designed to predict cross-sectional and intertemporal variations in emissions intensity within each corporation. I am interested in whether the explanatory variables of interest associate negatively with the following year’s emissions per unit of turnover.
\[\begin{equation} \tag{1} ln(GHG_{i}/Rev_{i})_\text{t+1} = \beta_{0} + \beta_{1}\text{Total score}_\text{i,t} + \gamma_{1} I_{i,t} + \gamma_{2} S_{i,t} + \epsilon_{i,t} \end{equation}\]
\[\begin{equation}\tag{2} ln(GHG_{i}/Rev_{i})_\text{t+1} = \beta_{0} + \beta_{1}\text{G score}_\text{i,t} + \beta_{2}\text{S score}_\text{i,t} + \beta_{3}\text{R score}_\text{i,t} + \beta_{4}\text{MT score}_\text{i,t} + \gamma_{1} I_{i,t} + \gamma_{2} S_{i,t} + \epsilon_{i,t} \end{equation}\]
> vs_f <-read.csv("chart1.csv")
> ggplot(data = vs_f,
+ aes(x = g, y = percent, fill = Industry_s)) +
+ geom_bar(stat = "identity", width = 0.3) +
+ geom_col(width = 0.5, colour = "gray75") +
+ scale_fill_manual(name = "Industry",
+ values = c( "white","grey100","grey96","grey93",
+ "grey90","grey87","gray84","tomato4",
+ "saddlebrown", "tan4",
+ "darkslategray", "gray20", "gray12"),
+ labels = gsub("[0-9]+","",vs_f$Industry_s)) +
+ labs(title = "(N = 470, \nEmissions Total = 2.8 Gt CO2e)",
+ x = " ", y = " ") +
+ theme(plot.title = element_text(size=9),
+ legend.title = element_text(size = 7), legend.text = element_text(size = 6)) +
+ scale_y_continuous(breaks=seq(0,1, 0.2))
Sample compositions by Industry
> score = read.csv("Sample_v4.csv")
> vs1 = score %>% data.frame() %>%
+ select(c(Organization, lnCI, Industry, Year))
>
> ggplot(vs1, aes(x = reorder(Industry, lnCI, FUN = median, .desc = TRUE), y = lnCI, fill = Industry)) +
+ geom_boxplot() +
+ scale_fill_manual(values = c("white", "lightpink4", "lightpink4", "white",
+ "lightpink4", "white", "white", "lightpink4",
+ "lightpink4","lightpink4","white","white","lightpink4"),guide = "none") +
+ labs(x = " ", y = "Carbon intensity(tCO2e/mn$)") +
+ theme(axis.text.x = element_text(angle = 70, hjust = 0.5, vjust = 1))
Emissions intensity by Industry
Luo, L., & Tang, Q. (2014). Does voluntary carbon disclosure reflect underlying carbon performance? Journal of Contemporary Accounting & Economics, 10(3), 191–205. https://doi.org/10.1016/j.jcae.2014.08.003
Maurage, P., Heeren, A., & Pesenti, M. (2013). Does chocolate consumption really boost nobel award chances? The peril of over-interpreting correlations in health studies. The Journal of Nutrition, 143(6), 931–933. https://doi.org/10.3945/jn.113.174813
Stock J, Watson M. Introduction to Econometrics (3rd edition). Addison Wesley Longman; 2011.