Linear regression:

We are estimating a model based on fitting a line that minimizes the squared errors!
This means that the linear regression model finds the line that minimizes the (squared) distance between it and the observations
It looks something like this:

\[ y_i = \alpha+x_1\beta_i + x_2\beta_i + x_3\beta_i + \varepsilon_i \]

Interpretation:

The intercept (also known as constant), gives us the predicted value for the dependent variable when all the independent variables equal zero.
The \(\beta\)’s tell us the marginal effect/slope associated with each variable (what would one-unit change in each independent variable mean in terms of change in values of the dependent variable?)
Inference regarding significance is based on the \(t\) value (greater than 2 in absolute value?), on the \(p\)-value (smaller than 0.05?) etc.
The \(p\)-value is the risk we are willing to make in rejecting the null-hypothesis by mistake
Therefore, smaller \(p\)-values indicate higher confidence that we are making the correct decision in rejecting the null.

\(R^2\) or the coefficient of determination

One way to evaluate model fit is to think about how close the line falls relative to the data
In other words, how small the residual (errors) are
To measure model fit, we sum the squared errors \(SS_{res}=\sum e^2_i\)
And compare this sum to the variation in the dependent variable \(SS_{tot}=\sum (y_i-\bar{y})^2\)
Like this: \(R^2=1-\frac{SS_{res}}{SS_{tot}}\)
Higher values (up to 1) indicate that the model is explaining more of the variation in the dependent variable
But \(R^2\) are at least weakly increasing with more predictors/variables

Prepare our data

library("ggplot2")
library("tidyverse")
library("gtsummary")

load("C:/Users/vxs15qru/OneDrive - University of East Anglia/PPLX7012A 21-22/data/leaders.Rdata")

Testing differences between two groups

So far, we’ve looked at regression w/ 2 continuous variables
Sometimes, our independent variable can be nominal
For example: gender differences; differences between Home v. EU /RoW students; religious vs. non-religious; conservative vs. liberals etc.

The t-test

When testing for difference between groups, we use a t-test
This gives us a \(t-value\), and a corresponding \(p-value\) (larger t-values correspond to smaller p values)
The null hypothesis is that there are no differences between groups
if the \(p-value\) is “very small”, we reject this null

Hypothesis

There is a difference in the tenure of male and female leaders
Independent variable: gender
Dependent variable: tenure in months

The data

str(leaders)

## 'data.frame':    497 obs. of  6 variables:
##  $ country: Factor w/ 12 levels "AUS","AUT","BEL",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ name   : Factor w/ 473 levels "(death)","(party merger)",..: 13 110 316 242 457 196 331 96 180 211 ...
##  $ party  : Factor w/ 82 levels "","Ã–VP","ALP",..: 15 15 15 15 15 15 15 15 36 36 ...
##  $ gender : Factor w/ 2 levels "Female","Male": 2 2 1 2 2 2 2 2 2 2 ...
##  $ age    : num  60 49 49 47 36 47 62 39 46 64 ...
##  $ length : num  20 114 170 78 50 25 25 NA 156 56 ...

Let’s check

leaders%>%
  group_by(gender)%>%
  summarise(mean_tenure=mean(length, na.rm=T))

## # A tibble: 2 x 2
##   gender mean_tenure
##   <fct>        <dbl>
## 1 Female        44.1
## 2 Male          68.4

Let’s plot

ggplot(leaders,aes(y=length, x=gender))+ 
  geom_boxplot()+
  scale_x_discrete("Sex of Leader") +
  scale_y_continuous(breaks =seq(0,560,40),"Time in month of party leader tenure") +
  stat_summary(fun=mean, geom="point", shape=20, color="red",
               fill="red")+
  theme_minimal()

Let’s test

len_gen_t<- t.test(length ~ gender,data = leaders, var.equal=T)
len_gen_t

## 
##  Two Sample t-test
## 
## data:  length by gender
## t = -2.3897, df = 428, p-value = 0.0173
## alternative hypothesis: true difference in means between group Female and group Male is not equal to 0
## 95 percent confidence interval:
##  -44.271149  -4.311499
## sample estimates:
## mean in group Female   mean in group Male 
##             44.11765             68.40897

Which looks a lot like…

len_gen_lm<- lm(length ~ gender,data = leaders)
summary(len_gen_lm)

## 
## Call:
## lm(formula = length ~ gender, data = leaders)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -67.41 -43.41 -20.41  16.81 473.59 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   44.118      9.543   4.623 5.02e-06 ***
## genderMale    24.291     10.165   2.390   0.0173 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 68.15 on 428 degrees of freedom
##   (67 observations deleted due to missingness)
## Multiple R-squared:  0.01317,    Adjusted R-squared:  0.01086 
## F-statistic: 5.711 on 1 and 428 DF,  p-value: 0.0173

Nominal variables

\(R\) lets us run models with nominal variables
But we need to make sure they are factor variables, for example:

leaders$gender<- as.factor(leaders$gender)
# and we can check
class(leaders$gender)

## [1] "factor"

So, now we have two categories in our variable, and the reference category is female.

Interpretation

The coefficient attached to this variable gives you the predicted value of the dependent variable for observations with this value
The reference category is subsumed within the intercept

Let’s examine this again, shall we…

len_gen_lm<- lm(length ~ gender,data = leaders)
summary(len_gen_lm)

## 
## Call:
## lm(formula = length ~ gender, data = leaders)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -67.41 -43.41 -20.41  16.81 473.59 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   44.118      9.543   4.623 5.02e-06 ***
## genderMale    24.291     10.165   2.390   0.0173 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 68.15 on 428 degrees of freedom
##   (67 observations deleted due to missingness)
## Multiple R-squared:  0.01317,    Adjusted R-squared:  0.01086 
## F-statistic: 5.711 on 1 and 428 DF,  p-value: 0.0173

More nominal variables

As some of you have probably noted, sometimes we have nominal variables with more than two values
Examples: country, party
Here, we pick a reference category
All other values become dummy variables

A basic model, not very interesting

len_cou_lm<- lm(length ~ country,data = leaders)
summary(len_cou_lm)

## 
## Call:
## lm(formula = length ~ country, data = leaders)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -81.24 -40.22 -20.06  19.87 487.03 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  57.1951    10.6826   5.354 1.42e-07 ***
## countryAUT    8.3674    16.1348   0.519   0.6043    
## countryBEL   -1.3156    13.0571  -0.101   0.9198    
## countryCAN   20.1382    17.5804   1.145   0.2527    
## countryDEU   -2.2259    13.6418  -0.163   0.8705    
## countryESP   23.9185    14.8477   1.611   0.1080    
## countryGBR   16.6310    17.8198   0.933   0.3512    
## countryHUN  -18.2951    24.1247  -0.758   0.4487    
## countryISR   15.1049    18.6563   0.810   0.4186    
## countryITA    0.6549    18.6563   0.035   0.9720    
## countryNOR   26.0402    14.3478   1.815   0.0703 .  
## countryROU   11.2166    19.7318   0.568   0.5700    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 68.4 on 418 degrees of freedom
##   (67 observations deleted due to missingness)
## Multiple R-squared:  0.02917,    Adjusted R-squared:  0.003618 
## F-statistic: 1.142 on 11 and 418 DF,  p-value: 0.3269

We can choose the reference category

leaders$country<- as.factor(leaders$country)#making sure
leaders$country<- relevel(leaders$country, "GBR")

And estimate our model

len_cou_lm<- lm(length ~ country,data = leaders)
summary(len_cou_lm)

## 
## Call:
## lm(formula = length ~ country, data = leaders)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -81.24 -40.22 -20.06  19.87 487.03 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   73.826     14.263   5.176 3.53e-07 ***
## countryAUS   -16.631     17.820  -0.933    0.351    
## countryAUT    -8.264     18.699  -0.442    0.659    
## countryBEL   -17.947     16.118  -1.113    0.266    
## countryCAN     3.507     19.959   0.176    0.861    
## countryDEU   -18.857     16.595  -1.136    0.256    
## countryESP     7.288     17.600   0.414    0.679    
## countryHUN   -34.926     25.910  -1.348    0.178    
## countryISR    -1.526     20.913  -0.073    0.942    
## countryITA   -15.976     20.913  -0.764    0.445    
## countryNOR     9.409     17.180   0.548    0.584    
## countryROU    -5.414     21.878  -0.247    0.805    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 68.4 on 418 degrees of freedom
##   (67 observations deleted due to missingness)
## Multiple R-squared:  0.02917,    Adjusted R-squared:  0.003618 
## F-statistic: 1.142 on 11 and 418 DF,  p-value: 0.3269

Important for making a nuanced argument

Sometimes we want to make conditional hypotheses
Examples:

“The more x, the more y, but only under condition z”

“The positive effect of x on y will be greater the higher z”

“the effect of economic development on levels of democracy is positive, but only when the country is not a producer of oil”

The effect of one independent variable on the dependent variable depends on the level of another independent variable
Make sure you understand this

Let’s set this up

load("C:/Users/vxs15qru/OneDrive - University of East Anglia/PPLX7012A 19-20/labs/data/Fish_correlation.RData")

Let’s set this up (2)

class(dat$OPEC)# What class do we need it to be?

## [1] "integer"

##Now generate the factor variable
dat<-dat%>%
  mutate(OPEC=case_when(OPEC==1 ~ "OPEC member",
                               OPEC==0 ~ "Not an OPEC member"))

Let’s plot the relationship

ggplot(dat, aes(x=GDP90LGN, y=POLITY, color=OPEC))+  
  geom_point()+  
  scale_x_continuous("Polity scores")+  
  scale_y_continuous("GDP, 1990s Dollars")+
  ggtitle("Democracy Level and GDP",
          subtitle = "Color indicate OPEC membership")+
  theme_bw()

Let’s add trend lines

dat%>%
ggplot(aes(x=GDP90LGN, y=POLITY, color=OPEC))+  
  geom_point()+  
  geom_smooth(method="lm", se=FALSE)+
  scale_x_continuous("Polity scores")+  
  scale_y_continuous("GDP, 1990s Dollars")+
  ggtitle("Democracy Level and GDP",
          subtitle = "Color indicate OPEC membership")+
  theme_bw()

These lines capture the relationship between GDP and democracy, conditional on OPEC membership

But… how do I specify an interaction/conditional relationship in a regression model?

Simple, include the terms that interact and their multiplication

Let’s estimate and interpret the model

polity_interact<-summary(lm(POLITY~GDP90LGN+OPEC+GDP90LGN*OPEC, data=dat))
polity_interact

## 
## Call:
## lm(formula = POLITY ~ GDP90LGN + OPEC + GDP90LGN * OPEC, data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.9633  -3.4897   0.5645   3.8950  14.0723 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               -15.649      3.358  -4.661 1.16e-05 ***
## GDP90LGN                    5.687      1.027   5.537 3.37e-07 ***
## OPECOPEC member            15.853     10.985   1.443    0.153    
## GDP90LGN:OPECOPEC member   -7.431      3.108  -2.391    0.019 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.556 on 85 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.3876, Adjusted R-squared:  0.366 
## F-statistic: 17.94 on 3 and 85 DF,  p-value: 4.168e-09

Interpretation of interaction modelS:

When none of the terms equals zero we use all terms to make our prediction, i.e

\(y = a + \beta_1\times x_1 + \beta_2\times x_2 + \beta_3\times x_1 \times x_2\),

or, in our case:

POLITY = a + \(\beta_1\times\) GDP +\(\beta_2\times\) OPEC+ \(\beta_3\times\) GDP\(\times\) OPEC

But when the country isn’t a member (OPEC\(=0\)) …

\(y = a + \beta_1\times x_1\)

or,

POLITY = a + \(\beta_1\times\) GDP

In practice

Let’s use the model estimates and make a ‘prediction’ about a specific country:

polity_interact<-summary(lm(POLITY~GDP90LGN+OPEC+GDP90LGN*OPEC, data=dat))
tbl_regression(polity_interact, intercept = T)

Characteristic	Beta	95% CI¹	p-value
(Intercept)	-16	-22, -9.0	<0.001
GDP90LGN	5.7	3.6, 7.7	<0.001
OPEC
Not an OPEC member	—	—
OPEC member	16	-6.0, 38	0.2
GDP90LGN * OPEC
GDP90LGN * OPEC member	-7.4	-14, -1.3	0.019
¹ CI = Confidence Interval

Predictions

Remember a country called Nambia?¹ Suppose Nambia is an OPEC member, with a GDP value of 4

\(\hat{y}=-16 + 5.7 \times 4 + 16 \times 1 + (-7.4 \times 4 \times 1)\)

\(\hat{y}=-6.8\)
Now, let’s Suppose the same Nambia is not an OPEC member:

\(\hat{y}=-16 + 5.7\times 4\)

Now, \(\hat{y}=6.8\)
AKA, the “Oil Curse”

More Regression

What did we talk about last week?

Linear regression:

Interpretation:

\(R^2\) or the coefficient of determination

Introducing t-tests

Prepare our data

Testing differences between two groups

The t-test

Hypothesis

The data

Let’s check

Let’s plot

Let’s plot

Let’s test

Which looks a lot like…

Dummy variables

Nominal variables

Interpretation

Let’s examine this again, shall we…

More nominal variables

A basic model, not very interesting

We can choose the reference category

And estimate our model

Interactions

Important for making a nuanced argument

Let’s set this up

Let’s set this up (2)

Let’s plot the relationship

Produces

Let’s add trend lines

But… how do I specify an interaction/conditional relationship in a regression model?

Let’s estimate and interpret the model

Interpretation of interaction modelS:

In practice

Predictions