ESPM175B Week7

Workshop resources
Variables and analysis methods
Analysis of continuous data: Correlation vs Regression
Example case I: Chocolate consumption and Nobel laureates
Example case II: \(CO_2\) concentration in the atmosphere and temperature increase
Writing tips for your Research Methods and Results sections
- Sample equations
- Sample graphs
Reference

Workshop resources

Class google drive folder for Correlation & Regression workshop

Variables and analysis methods

Identify your variables and statistical tests.

Figure 1: Types of predictors and Statistical tests

Keep in mind the workflow:

Figure 2: Workflow

Analysis of continuous data: Correlation vs Regression

Definition

Correlation evaluates how two variables covary, and describes the strength and direction of that relationship. The correlation coefficient, a single number that measures the linear relationship between two continuous variables, can range from -1 to +1. The correlation coefficient is symmetric: corr(X,Y) = corr(Y,X).

Figure 3: Examples of Correlation coefficients

Linear regression determines a linear model for the effect of one variable on another, and how well that model describes that data. Linear regression attempts to model the relationship between two variables (an independent or explanatory variable and a dependent or outcome variable) by fitting a linear equation to observed data. Linear equations are defined as \(y = \alpha + \beta x + \epsilon\) for bivariate regression and \(y = \beta_0 + \beta_1x_1 + \beta_2x_2 + \beta_3x_3 + ... + \beta_ix_i + \epsilon\) for multivariate regression. In a bivariate regression, also known as simple linear regression (contains only one x variable), the regression coefficient is the slope(\(\beta\)) of the line. And the error term \(\epsilon\) is the difference between the predicted value \((\hat{y})\) and the observed value \((y)\).

Example case I: Chocolate consumption and Nobel laureates

Does Chocolate Consumption Boost Nobel Award Chances? The peril of over-interpreting correlations

Table 1: Chocolate consumption and Nobel laureates by Country
Country	Chocolate	Nobel	GDP
Australia	4.8	4.84	52518
Austria	8.5	25.14	55098
Belgium	6.4	8.70	51968
Brazil	1.2	0.05	14836
Canada	3.9	5.95	48073
China	0.7	0.04	17312
Croatia	6.6	4.80	28504
Denmark	6.9	17.38	60399
Finland	7.4	9.02	51090
France	3.4	5.98	46227
Germany	11.1	11.18	53694
Greece	2.5	1.79	28464
Hungary	2.9	12.38	33084
Ireland	8.8	22.90	93612
Italy	3.1	2.19	41840
Japan	1.8	1.89	42197
Netherlands	4.5	11.71	59229
Norway	9.2	24.28	63198
Poland	2.2	1.31	34265
Portugal	3.6	1.94	34496
Romania	3.6	1.02	31946
Spain	4.5	0.43	38335
Switzerland	10.3	31.60	71352
UK	8.1	19.43	44916
USA	5.3	11.72	63544

We have a sample data (n=25) with chocolate consumption(kg per year per capita), number of nobel laureats for every 10 million populations, and GDP per capita (PPP) indexed by country. We are interested in investigating the relationship between chocolate consumption and nobel prize awards: \(Nobel = \alpha + \beta*Chocolate + \epsilon\).

Check the distribution of our variables (check the Nearly Normal Condition for each variable).


    Shapiro-Wilk normality test

data:  choco$sqrNobel
W = 0.96027, p-value = 0.4198


    Shapiro-Wilk normality test

data:  choco$Chocolate
W = 0.95867, p-value = 0.3886

Ways to get correlation coeffient:

> cor(choco$Chocolate, choco$Nobel)

[1] 0.8082649

> cor.test(choco$Chocolate, choco$Nobel, method="pearson")


    Pearson's product-moment correlation

data:  choco$Chocolate and choco$Nobel
t = 6.5832, df = 23, p-value = 1.022e-06
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.6069881 0.9120985
sample estimates:
      cor 
0.8082649

> cor(choco$Chocolate, choco$sqrNobel)

[1] 0.8303721

> cor.test(choco$Chocolate, choco$sqrNobel, method="pearson")


    Pearson's product-moment correlation

data:  choco$Chocolate and choco$sqrNobel
t = 7.1469, df = 23, p-value = 2.805e-07
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.6477822 0.9227451
sample estimates:
      cor 
0.8303721

Add this correlation coefficient on to the graph (your r and p values):

Let’s move on to the regression analysis.
LINEAR REGRESSION ASSUMPTIONS(LINE)
1.Linearity. The points should be able to be described in a linear relationship.
2.Independence. Each observation must be independent of all other observations (e.g., each experimental or sampling unit is independent of each other experimental sampling unit).
3.Normality. The error terms (residuals) come from normally distributed populations. Check for this on the residuals using box plots, q-q plots, histograms, shapiro-wilks, etc.
4.Equal Variance (homoscedasticity). The error terms (redisuals) must have have equal variances. Check for this on the residuals using bptest.

Now we test each assumption one by one for our model: \(\sqrt{Nobel}\) = \(\alpha\) + \(\beta\)Chocolate + \(\epsilon\)

Always specify your model.

> c <- lm(sqrNobel ~ Chocolate, data = choco)

Also, we always visualize. So why not visualize our residuals.

> res <- resid(c)
> plot(fitted(c), res)
> abline(0,0)

> # Assumption2: Independence 
> # check whether the mean of residuals is zero and check autocorrelation of residuals (the residuals are independent)
> mean(c$residuals)

[1] -1.221462e-17

> cor.test(choco$Chocolate, c$residuals)


    Pearson's product-moment correlation

data:  choco$Chocolate and c$residuals
t = -7.9609e-16, df = 23, p-value = 1
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.3951309  0.3951309
sample estimates:
          cor 
-1.659968e-16

> #Ho: There is no correlation 
> #Ha: There is correlation
> acf(c$residuals) # when you use time-series data

> # Assumption3: Normality 
> # check the distribution of residuals 
> qq <- rstandard(c)
> qqnorm(qq, ylab="Standardized Residuals", 
+        xlab="Theoretical Quantiles: lm(sqrNobel ~ Chocolate)", 
+        main="Normal Q-Q") 
> qqline(qq)

> #Assumption4: Homoscedasticity
> #Ho: The residuals are homoscedastic (i.e. evenly discributed)
> #Ha: The residuals are heteroscedastic (i.e. not evenly spread)
> bptest(c)


    studentized Breusch-Pagan test

data:  c
BP = 1.127, df = 1, p-value = 0.2884

> # As the p-value is large, we cannot reject Ho. The residuals have equal variance

Add regression line on to the graph


	Dependent variable:

	sqrNobel

Chocolate	0.449^***
	(0.063)

Constant	0.310
	(0.376)


Observations	25
R²	0.690
Adjusted R²	0.676
Residual Std. Error	0.899 (df = 23)
F Statistic	51.078^*** (df = 1; 23)

Note:	p<0.1; p<0.05; p<0.01

Now, let’s look at the other variable - GDP.

**OLS regression: Nobel ~ GDP**

	Dependent variable:

	sqrNobel
	(1)	(2)

GDPperCapita	0.0001^***
	(0.00001)

lnGDP		3.018^***
		(0.475)

Constant	-0.734	-29.532^***
	(0.552)	(5.076)


Observations	25	25
R²	0.652	0.637
Adjusted R²	0.637	0.621
Residual Std. Error (df = 23)	0.952	0.973
F Statistic (df = 1; 23)	43.141^***	40.294^***

Note:	p<0.1; p<0.05; p<0.01

Multivariate regression model: \(\sqrt{Nobel}\) = \(\beta_0\) + \(\beta_1\)Chocolate + \(\beta_2\)lnGDP + \(\epsilon\). Can we use this model?

          Chocolate sqrNobel lnGDP
Chocolate      1.00     0.83  0.72
sqrNobel       0.83     1.00  0.80
lnGDP          0.72     0.80  1.00

             Chocolate     sqrNobel        lnGDP
Chocolate 0.000000e+00 2.805328e-07 5.437817e-05
sqrNobel  2.805328e-07 0.000000e+00 1.775004e-06
lnGDP     5.437817e-05 1.775004e-06 0.000000e+00

> #c_g <- lm(Nobel ~ Chocolate + GDPperCapita, data = choco)
> c_lng <- lm(sqrNobel ~ Chocolate + lnGDP, data = choco)
> 
> stargazer(c_lng, header=FALSE, type="html",
+           title = 'OLS regression: sqrNobel ~ Chocolate + lnGDP',
+           align = TRUE,
+           column.sep.width = "1pt",
+           no.space = TRUE, font.size = "small")

**OLS regression: sqrNobel ~ Chocolate + lnGDP**

	Dependent variable:

	sqrNobel

Chocolate	0.287^***
	(0.079)
lnGDP	1.576^***
	(0.551)
Constant	-15.653^**
	(5.587)

Observations	25
R²	0.774
Adjusted R²	0.753
Residual Std. Error	0.785 (df = 22)
F Statistic	37.619^*** (df = 2; 22)

Note:	p<0.1; p<0.05; p<0.01

Example case II: \(CO_2\) concentration in the atmosphere and temperature increase

understand your data, visualize and summarize it, check assumptions
perform correlation
perform regression (check assumptions, how to interpret \(R^2\))

\[y = \alpha + \beta x + \epsilon\]

> cor(co2_temp$co2conc, co2_temp$temp)

[1] 0.9591698

> cor.test(co2_temp$co2conc, co2_temp$temp, method="pearson")


    Pearson's product-moment correlation

data:  co2_temp$co2conc and co2_temp$temp
t = 26.487, df = 61, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.9331719 0.9751837
sample estimates:
      cor 
0.9591698

[1] -4.5188e-18


    Durbin-Watson test

data:  r
DW = 1.6824, p-value = 0.07969
alternative hypothesis: true autocorrelation is greater than 0


    Pearson's product-moment correlation

data:  co2_temp$co2conc and r$residuals
t = 1.9513e-15, df = 61, p-value = 1
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.247765  0.247765
sample estimates:
         cor 
2.498414e-16

> r <- lm(temp ~ co2conc, data = co2_temp)
> stargazer(r, type="html")


	Dependent variable:

	temp

co2conc	0.011^***
	(0.0004)

Constant	-3.401^***
	(0.142)


Observations	63
R²	0.920
Adjusted R²	0.919
Residual Std. Error	0.093 (df = 61)
F Statistic	701.564^*** (df = 1; 61)

Note:	p<0.1; p<0.05; p<0.01

How can we interprete \(R^2\)?

The R-squared value, denoted by \(R^2\), is the square of the correlation (In this case, our \(\rho\) value (or r value) was 0.96. Therefore, \(R^2\) = \(0.96^2\) = 0.92). It measures the proportion of variation in the dependent variable that can be attributed to the independent variable. In other words, it is a measure of the explained variance or a goodness-of-fit measure for linear regression models. The \(R^2\) is always between 0 and 1 inclusive. So can we say that our model is near perfect as we find strong goodness-of-fit (\(R^2\) = 0.92)?
Deterministic relationship vs Statistical relationship

Writing tips for your Research Methods and Results sections

Clearly define your model and variables
- Descriptive table: list of variables
- Write your regression equations and explain them
Understand your data and check assumptions
- Descriptive statistics
- Key visuals if needed
Results
- Correlation table
- Regression results table

Sample equations

\[\begin{equation} ln(\text{Carbon Intensity}_\text{i})_\text{t+1} = \beta_{0} + \beta_{1}\text{Carbon Management Quality}_\text{i,t} + \gamma_{1} I_{i,t} + \gamma_{2} S_{i,t} + \epsilon_{i,t} \end{equation}\]

It is designed to predict cross-sectional and intertemporal variations in emissions intensity within each corporation. I am interested in whether the explanatory variables of interest associate negatively with the following year’s emissions per unit of turnover.

\[\begin{equation} \tag{1} ln(GHG_{i}/Rev_{i})_\text{t+1} = \beta_{0} + \beta_{1}\text{Total score}_\text{i,t} + \gamma_{1} I_{i,t} + \gamma_{2} S_{i,t} + \epsilon_{i,t} \end{equation}\]

\[\begin{equation}\tag{2} ln(GHG_{i}/Rev_{i})_\text{t+1} = \beta_{0} + \beta_{1}\text{G score}_\text{i,t} + \beta_{2}\text{S score}_\text{i,t} + \beta_{3}\text{R score}_\text{i,t} + \beta_{4}\text{MT score}_\text{i,t} + \gamma_{1} I_{i,t} + \gamma_{2} S_{i,t} + \epsilon_{i,t} \end{equation}\]

Sample graphs

> vs_f <-read.csv("chart1.csv")
> ggplot(data = vs_f,
+        aes(x = g, y = percent, fill = Industry_s)) + 
+   geom_bar(stat = "identity", width = 0.3) +
+   geom_col(width = 0.5, colour = "gray75") +
+   scale_fill_manual(name = "Industry",
+                     values = c( "white","grey100","grey96","grey93", 
+                                 "grey90","grey87","gray84","tomato4", 
+                                 "saddlebrown", "tan4", 
+                                 "darkslategray", "gray20", "gray12"), 
+                     labels = gsub("[0-9]+","",vs_f$Industry_s)) +
+   labs(title = "(N = 470, \nEmissions Total = 2.8 Gt CO2e)", 
+        x = " ", y = " ") + 
+   theme(plot.title = element_text(size=9), 
+         legend.title = element_text(size = 7), legend.text = element_text(size = 6)) +
+   scale_y_continuous(breaks=seq(0,1, 0.2))

Sample compositions by Industry

> score = read.csv("Sample_v4.csv")
> vs1 = score %>% data.frame() %>% 
+   select(c(Organization, lnCI, Industry, Year))
> 
> ggplot(vs1, aes(x = reorder(Industry, lnCI, FUN = median, .desc = TRUE), y = lnCI, fill = Industry)) +
+   geom_boxplot() +
+   scale_fill_manual(values = c("white", "lightpink4", "lightpink4", "white", 
+                                "lightpink4", "white", "white", "lightpink4", 
+                                "lightpink4","lightpink4","white","white","lightpink4"),guide = "none") + 
+   labs(x = " ", y = "Carbon intensity(tCO2e/mn$)") +
+   theme(axis.text.x = element_text(angle = 70, hjust = 0.5, vjust = 1))

Emissions intensity by Industry

Reference

Luo, L., & Tang, Q. (2014). Does voluntary carbon disclosure reflect underlying carbon performance? Journal of Contemporary Accounting & Economics, 10(3), 191–205. https://doi.org/10.1016/j.jcae.2014.08.003

Maurage, P., Heeren, A., & Pesenti, M. (2013). Does chocolate consumption really boost nobel award chances? The peril of over-interpreting correlations in health studies. The Journal of Nutrition, 143(6), 931–933. https://doi.org/10.3945/jn.113.174813

Stock J, Watson M. Introduction to Econometrics (3rd edition). Addison Wesley Longman; 2011.