More Regression

Dr Eitan Tzelgov

What did we talk about last week?

Linear regression:

\[ y_i = \alpha+x_1\beta_i + x_2\beta_i + x_3\beta_i + \varepsilon_i \]

Interpretation:

\(R^2\) or the coefficient of determination

Introducing t-tests

Prepare our data

library("ggplot2")
library("tidyverse")
library("gtsummary")

load("C:/Users/vxs15qru/OneDrive - University of East Anglia/PPLX7012A 21-22/data/leaders.Rdata")

Testing differences between two groups

The t-test

Hypothesis

The data

str(leaders)
## 'data.frame':    497 obs. of  6 variables:
##  $ country: Factor w/ 12 levels "AUS","AUT","BEL",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ name   : Factor w/ 473 levels "(death)","(party merger)",..: 13 110 316 242 457 196 331 96 180 211 ...
##  $ party  : Factor w/ 82 levels "","ÖVP","ALP",..: 15 15 15 15 15 15 15 15 36 36 ...
##  $ gender : Factor w/ 2 levels "Female","Male": 2 2 1 2 2 2 2 2 2 2 ...
##  $ age    : num  60 49 49 47 36 47 62 39 46 64 ...
##  $ length : num  20 114 170 78 50 25 25 NA 156 56 ...

Let’s check

leaders%>%
  group_by(gender)%>%
  summarise(mean_tenure=mean(length, na.rm=T))
## # A tibble: 2 x 2
##   gender mean_tenure
##   <fct>        <dbl>
## 1 Female        44.1
## 2 Male          68.4

Let’s plot

ggplot(leaders,aes(y=length, x=gender))+ 
  geom_boxplot()+
  scale_x_discrete("Sex of Leader") +
  scale_y_continuous(breaks =seq(0,560,40),"Time in month of party leader tenure") +
  stat_summary(fun=mean, geom="point", shape=20, color="red",
               fill="red")+
  theme_minimal()

Let’s plot

Let’s test

len_gen_t<- t.test(length ~ gender,data = leaders, var.equal=T)
len_gen_t
## 
##  Two Sample t-test
## 
## data:  length by gender
## t = -2.3897, df = 428, p-value = 0.0173
## alternative hypothesis: true difference in means between group Female and group Male is not equal to 0
## 95 percent confidence interval:
##  -44.271149  -4.311499
## sample estimates:
## mean in group Female   mean in group Male 
##             44.11765             68.40897

Which looks a lot like…

len_gen_lm<- lm(length ~ gender,data = leaders)
summary(len_gen_lm)
## 
## Call:
## lm(formula = length ~ gender, data = leaders)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -67.41 -43.41 -20.41  16.81 473.59 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   44.118      9.543   4.623 5.02e-06 ***
## genderMale    24.291     10.165   2.390   0.0173 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 68.15 on 428 degrees of freedom
##   (67 observations deleted due to missingness)
## Multiple R-squared:  0.01317,    Adjusted R-squared:  0.01086 
## F-statistic: 5.711 on 1 and 428 DF,  p-value: 0.0173

Dummy variables

Nominal variables

leaders$gender<- as.factor(leaders$gender)
# and we can check
class(leaders$gender)
## [1] "factor"

Interpretation

Let’s examine this again, shall we…

len_gen_lm<- lm(length ~ gender,data = leaders)
summary(len_gen_lm)
## 
## Call:
## lm(formula = length ~ gender, data = leaders)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -67.41 -43.41 -20.41  16.81 473.59 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   44.118      9.543   4.623 5.02e-06 ***
## genderMale    24.291     10.165   2.390   0.0173 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 68.15 on 428 degrees of freedom
##   (67 observations deleted due to missingness)
## Multiple R-squared:  0.01317,    Adjusted R-squared:  0.01086 
## F-statistic: 5.711 on 1 and 428 DF,  p-value: 0.0173

More nominal variables

A basic model, not very interesting

len_cou_lm<- lm(length ~ country,data = leaders)
summary(len_cou_lm)
## 
## Call:
## lm(formula = length ~ country, data = leaders)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -81.24 -40.22 -20.06  19.87 487.03 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  57.1951    10.6826   5.354 1.42e-07 ***
## countryAUT    8.3674    16.1348   0.519   0.6043    
## countryBEL   -1.3156    13.0571  -0.101   0.9198    
## countryCAN   20.1382    17.5804   1.145   0.2527    
## countryDEU   -2.2259    13.6418  -0.163   0.8705    
## countryESP   23.9185    14.8477   1.611   0.1080    
## countryGBR   16.6310    17.8198   0.933   0.3512    
## countryHUN  -18.2951    24.1247  -0.758   0.4487    
## countryISR   15.1049    18.6563   0.810   0.4186    
## countryITA    0.6549    18.6563   0.035   0.9720    
## countryNOR   26.0402    14.3478   1.815   0.0703 .  
## countryROU   11.2166    19.7318   0.568   0.5700    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 68.4 on 418 degrees of freedom
##   (67 observations deleted due to missingness)
## Multiple R-squared:  0.02917,    Adjusted R-squared:  0.003618 
## F-statistic: 1.142 on 11 and 418 DF,  p-value: 0.3269

We can choose the reference category

leaders$country<- as.factor(leaders$country)#making sure
leaders$country<- relevel(leaders$country, "GBR")

And estimate our model

len_cou_lm<- lm(length ~ country,data = leaders)
summary(len_cou_lm)
## 
## Call:
## lm(formula = length ~ country, data = leaders)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -81.24 -40.22 -20.06  19.87 487.03 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   73.826     14.263   5.176 3.53e-07 ***
## countryAUS   -16.631     17.820  -0.933    0.351    
## countryAUT    -8.264     18.699  -0.442    0.659    
## countryBEL   -17.947     16.118  -1.113    0.266    
## countryCAN     3.507     19.959   0.176    0.861    
## countryDEU   -18.857     16.595  -1.136    0.256    
## countryESP     7.288     17.600   0.414    0.679    
## countryHUN   -34.926     25.910  -1.348    0.178    
## countryISR    -1.526     20.913  -0.073    0.942    
## countryITA   -15.976     20.913  -0.764    0.445    
## countryNOR     9.409     17.180   0.548    0.584    
## countryROU    -5.414     21.878  -0.247    0.805    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 68.4 on 418 degrees of freedom
##   (67 observations deleted due to missingness)
## Multiple R-squared:  0.02917,    Adjusted R-squared:  0.003618 
## F-statistic: 1.142 on 11 and 418 DF,  p-value: 0.3269

Interactions

Important for making a nuanced argument

“The more x, the more y, but only under condition z”

“The positive effect of x on y will be greater the higher z”

“the effect of economic development on levels of democracy is positive, but only when the country is not a producer of oil”

Let’s set this up

load("C:/Users/vxs15qru/OneDrive - University of East Anglia/PPLX7012A 19-20/labs/data/Fish_correlation.RData")

Let’s set this up (2)

class(dat$OPEC)# What class do we need it to be?
## [1] "integer"
##Now generate the factor variable
dat<-dat%>%
  mutate(OPEC=case_when(OPEC==1 ~ "OPEC member",
                               OPEC==0 ~ "Not an OPEC member"))

Let’s plot the relationship

ggplot(dat, aes(x=GDP90LGN, y=POLITY, color=OPEC))+  
  geom_point()+  
  scale_x_continuous("Polity scores")+  
  scale_y_continuous("GDP, 1990s Dollars")+
  ggtitle("Democracy Level and GDP",
          subtitle = "Color indicate OPEC membership")+
  theme_bw()

Produces

Let’s add trend lines

dat%>%
ggplot(aes(x=GDP90LGN, y=POLITY, color=OPEC))+  
  geom_point()+  
  geom_smooth(method="lm", se=FALSE)+
  scale_x_continuous("Polity scores")+  
  scale_y_continuous("GDP, 1990s Dollars")+
  ggtitle("Democracy Level and GDP",
          subtitle = "Color indicate OPEC membership")+
  theme_bw()

But… how do I specify an interaction/conditional relationship in a regression model?

Let’s estimate and interpret the model

polity_interact<-summary(lm(POLITY~GDP90LGN+OPEC+GDP90LGN*OPEC, data=dat))
polity_interact
## 
## Call:
## lm(formula = POLITY ~ GDP90LGN + OPEC + GDP90LGN * OPEC, data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.9633  -3.4897   0.5645   3.8950  14.0723 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               -15.649      3.358  -4.661 1.16e-05 ***
## GDP90LGN                    5.687      1.027   5.537 3.37e-07 ***
## OPECOPEC member            15.853     10.985   1.443    0.153    
## GDP90LGN:OPECOPEC member   -7.431      3.108  -2.391    0.019 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.556 on 85 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.3876, Adjusted R-squared:  0.366 
## F-statistic: 17.94 on 3 and 85 DF,  p-value: 4.168e-09

Interpretation of interaction modelS:

\(y = a + \beta_1\times x_1 + \beta_2\times x_2 + \beta_3\times x_1 \times x_2\),

or, in our case:

POLITY = a + \(\beta_1\times\) GDP +\(\beta_2\times\) OPEC+ \(\beta_3\times\) GDP\(\times\) OPEC

\(y = a + \beta_1\times x_1\)

or,

POLITY = a + \(\beta_1\times\) GDP

In practice

Let’s use the model estimates and make a ‘prediction’ about a specific country:

polity_interact<-summary(lm(POLITY~GDP90LGN+OPEC+GDP90LGN*OPEC, data=dat))
tbl_regression(polity_interact, intercept = T)
Characteristic Beta 95% CI1 p-value
(Intercept) -16 -22, -9.0 <0.001
GDP90LGN 5.7 3.6, 7.7 <0.001
OPEC
    Not an OPEC member
    OPEC member 16 -6.0, 38 0.2
GDP90LGN * OPEC
    GDP90LGN * OPEC member -7.4 -14, -1.3 0.019
1 CI = Confidence Interval

Predictions

\(\hat{y}=-16 + 5.7 \times 4 + 16 \times 1 + (-7.4 \times 4 \times 1)\)

\(\hat{y}=-16 + 5.7\times 4\)


  1. if not, google Trump↩︎