Introduction

sub heading

An illustration of R Markdown….

What’s the link between foreigners and crime?

Illustration

To illustrate your story you can include images (even of thugs who attack foreigners)

You can also embedd tweets. To illustrate that, let’s hear some remarks from Nigel Farage (The patron saint of the Farage Garage):

Farage Garage

OK some more serious work now

Now let’s load some data. To do that you can include chunks of are code like this:

ff=read.csv("https://www.dropbox.com/s/g1w75gkw7g91zef/foreigners.csv?dl=1")

This loads the local authority dataset we have seen before. Note that you can include inline are code as well. For instance: the dataset has 348 observations and contains 5 variables.

As before we can summarise the data:

summary(ff)

##        X             crimes11          b_migr11          pop11        
##  Min.   :  1.00   Min.   :   1134   Min.   : 2.241   Min.   :   2203  
##  1st Qu.: 87.75   1st Qu.: 107618   1st Qu.: 4.899   1st Qu.:  94263  
##  Median :174.50   Median : 160556   Median : 7.603   Median : 125746  
##  Mean   :174.50   Mean   : 236988   Mean   :11.226   Mean   : 161434  
##  3rd Qu.:261.25   3rd Qu.: 309377   3rd Qu.:12.382   3rd Qu.: 200247  
##  Max.   :348.00   Max.   :1714295   Max.   :55.161   Max.   :1072372  
##                   NA's   :24        NA's   :9        NA's   :9        
##      area          
##  Length:348        
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
##

Note, you might want to see the output of a command in your final document, but you might not want to see the command. Just do it like this:

##        X             crimes11          b_migr11          pop11        
##  Min.   :  1.00   Min.   :   1134   Min.   : 2.241   Min.   :   2203  
##  1st Qu.: 87.75   1st Qu.: 107618   1st Qu.: 4.899   1st Qu.:  94263  
##  Median :174.50   Median : 160556   Median : 7.603   Median : 125746  
##  Mean   :174.50   Mean   : 236988   Mean   :11.226   Mean   : 161434  
##  3rd Qu.:261.25   3rd Qu.: 309377   3rd Qu.:12.382   3rd Qu.: 200247  
##  Max.   :348.00   Max.   :1714295   Max.   :55.161   Max.   :1072372  
##                   NA's   :24        NA's   :9        NA's   :9        
##      area          
##  Length:348        
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
##

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

## Warning: Removed 24 rows containing missing values (geom_point).

Let’s get rid of outliers…

#ff= filter(ff,crimesPc<15)

ff= ff %>% filter(crimesPc<15)

…and do some other stuff to make it look nicer..

ggplot(ff, aes(y=crimesPc,x=b_migr11)) + geom_point()   +   # Make sure + is on this line so R understands the command is not finished
                   xlab("Share of foreign born in %")   +  
                   ylab("Number of Crimes per capita")  + 
                   geom_smooth(method = "lm", se = FALSE) +
                   theme_minimal()

## `geom_smooth()` using formula 'y ~ x'

Running regressions:

We said that putting in a trend line in a scatter plot is a way of estimating an econometric model that describes the relationship between the dependent (or outcome) variable on the Y axis and an explanatory variable on the x axis.
If you want a computer to do this for you (rather take out a ruler and a pen) you need a precise algorithm.
The most commonly used algorithm for that is called Ordinary Least Squares estimator (OLS).

r1=lm(crimesPc~b_migr11,ff)  
# Note we can assign the output of a regression to a variable...

summary(r1)

## 
## Call:
## lm(formula = crimesPc ~ b_migr11, data = ff)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.13314 -0.33959 -0.06763  0.22302  2.92572 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.091273   0.045146   24.17  < 2e-16 ***
## b_migr11    0.025164   0.002922    8.61 3.33e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5482 on 321 degrees of freedom
## Multiple R-squared:  0.1876, Adjusted R-squared:  0.1851 
## F-statistic: 74.14 on 1 and 321 DF,  p-value: 3.325e-16

# ...and then refer back to this variable in subsequent commands.
# You can check how the output of the lm command looks like by clicking on r1 in the 
# variable browser.

Note this identifies a linear model of the form \[ Y_i=\beta_0 + \beta_1 \times X_i +\epsilon_i\] where \(Y_i\) is Crimes per capita and \(X_i\) is the share of foreigners in %. This also shows you how you can integrate formulas in a markdown document and how easily format text to make it italic. More on this under this link.
Now this of course is the true model. What we get out of the OLS procedure is an estimate of the above model; i.e. \[ Y_i=\hat{\beta}_0 + \hat{\beta}_1 \times X_i +\hat{\epsilon}_i\] which given the above R results becomes

\[ Y_i=1.091 + 0.025 \times X_i +\hat{\epsilon}_i\]

where we round the results up to 3 digit precision. Check out the code for this in the underlying Rmd file as it also shows how you can refer back to a previously caclulated result within an inline formula; i.e. rather than manually typing the value for \(\beta_0\) and \(\beta_1\) I let R work it out from the lm command.

Implication of OLS

An important implication of the OLS estimator is that it works out the \(\beta\) parameters in a way that sets correlation between \(\hat{\epsilon}\) and \(X\) equal to 0. To check that this is the case we extract the \(\epsilon\) from the r1 variable and add it to our ff dataframe.

ff['residuals']=r1$residuals  
#Note this is an alternative way of assigning a new variable to a dataframe.
#Alternatively use:
ff=ff %>% mutate(residualsx1=r1$residuals)

Note that the correlation is indeed 0:

  ff %>% select(residuals,b_migr11, pop11)  %>% cor(use="complete.obs")

##              residuals     b_migr11       pop11
## residuals 1.000000e+00 3.310480e-17 0.003356407
## b_migr11  3.310480e-17 1.000000e+00 0.333636733
## pop11     3.356407e-03 3.336367e-01 1.000000000

  10^17

## [1] 1e+17

Which we can also see in a scatterplot:

Merging data

A big part of applied data analysis involves combining different bit of data. For that you need merge or join commands. Here is an example where we add more area level variables to our foreigners dataset:

library(dplyr)
ff_raw=read.csv("https://www.dropbox.com/s/g1w75gkw7g91zef/foreigners.csv?dl=1")  




ff_more=read.csv("https://www.dropbox.com/s/gwq2wmmxr8s3v7t/foreigners_more.csv?dl=1")  

names(ff_more)

##  [1] "X"         "urate2010" "urate2012" "urate2004" "urate2011" "area"     
##  [7] "meanage"   "medianage" "region"    "pct_leave" "mus_sh"    "citshare" 
## [13] "urbshare"

inner=ff_raw %>% inner_join(ff_more,by="area")
inner=ff_raw %>% full_join(ff_more,by="area")

inner=ff %>% inner_join(ff_more,by="area")
full=ff %>% full_join(ff_more,by="area")

Dataframe ff has 323 observations. Dataframe ff_more has 348. The inner join has 323, the full join 348.

Doing stuff by region and how to use functions

One of the variables we merged earlier includes a region variable. The following command counts the number of observations for every region.

inner %>% group_by(region) %>% summarise(n())

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 10 x 2
##    region                   `n()`
##    <chr>                    <int>
##  1 East                        43
##  2 East Midlands               40
##  3 London                      32
##  4 North East                  10
##  5 North West                  39
##  6 South East                  64
##  7 South West                  32
##  8 Wales                       22
##  9 West Midlands               27
## 10 Yorkshire and The Humber    14

Define the function:

plotter=function(r){
  
  ffx=inner %>% filter(region==r)
  
  plot=ggplot(ffx, aes(x = b_migr11, y = crimesPc)) +
    geom_smooth(method = "lm", se = FALSE) +
    geom_point()+theme_minimal()+
    xlab("% foreign born")+ylab("Crimes per capita")+
    ggtitle(r)
  
  
  reg=(lm(crimesPc~b_migr11,ffx))
  return(list(plot,reg))
}

Call the function:

p=plotter("London") 
p[[1]];summary(p[[2]])

## 
## Call:
## lm(formula = crimesPc ~ b_migr11, data = ffx)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.00220 -0.35611 -0.06671  0.18426  2.64380 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  0.62044    0.39045   1.589  0.12253   
## b_migr11     0.03282    0.01025   3.202  0.00322 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6392 on 30 degrees of freedom
## Multiple R-squared:  0.2547, Adjusted R-squared:  0.2298 
## F-statistic: 10.25 on 1 and 30 DF,  p-value: 0.003225

p=plotter("East")
p[[1]];summary(p[[2]])

## 
## Call:
## lm(formula = crimesPc ~ b_migr11, data = ffx)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.8849 -0.3370 -0.2046  0.1440  2.4343 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.80217    0.19602   4.092 0.000195 ***
## b_migr11     0.04982    0.01579   3.156 0.002997 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6717 on 41 degrees of freedom
## Multiple R-squared:  0.1954, Adjusted R-squared:  0.1758 
## F-statistic: 9.959 on 1 and 41 DF,  p-value: 0.002997

p=plotter("South West")
p[[1]];summary(p[[2]])

## 
## Call:
## lm(formula = crimesPc ~ b_migr11, data = ffx)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.99207 -0.34065 -0.06424  0.28628  1.47611 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  0.48027    0.23850   2.014  0.05308 . 
## b_migr11     0.09821    0.03040   3.231  0.00299 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5407 on 30 degrees of freedom
## Multiple R-squared:  0.2581, Adjusted R-squared:  0.2334 
## F-statistic: 10.44 on 1 and 30 DF,  p-value: 0.002991

Or maybe doing it with a loop?

regions=inner$region %>% unique()



for(rrr in regions){

  p=plotter(rrr)
  print(p[[1]])
  print(summary(p[[2]]))

}

## 
## Call:
## lm(formula = crimesPc ~ b_migr11, data = ffx)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.95549 -0.28291  0.01332  0.33373  0.82333 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  1.04807    0.31537   3.323   0.0105 *
## b_migr11     0.07821    0.05092   1.536   0.1631  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5278 on 8 degrees of freedom
## Multiple R-squared:  0.2278, Adjusted R-squared:  0.1312 
## F-statistic: 2.359 on 1 and 8 DF,  p-value: 0.1631

## 
## Call:
## lm(formula = crimesPc ~ b_migr11, data = ffx)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.96912 -0.14052 -0.01849  0.21011  0.95806 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.86423    0.11455   7.544 5.40e-09 ***
## b_migr11     0.07270    0.01416   5.134 9.31e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3849 on 37 degrees of freedom
## Multiple R-squared:  0.416,  Adjusted R-squared:  0.4003 
## F-statistic: 26.36 on 1 and 37 DF,  p-value: 9.308e-06

## 
## Call:
## lm(formula = crimesPc ~ b_migr11, data = ffx)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.65522 -0.19623 -0.08563  0.09058  0.71980 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.53085    0.26058   5.875 7.54e-05 ***
## b_migr11     0.01987    0.03042   0.653    0.526    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3813 on 12 degrees of freedom
## Multiple R-squared:  0.03433,    Adjusted R-squared:  -0.04614 
## F-statistic: 0.4266 on 1 and 12 DF,  p-value: 0.526

## 
## Call:
## lm(formula = crimesPc ~ b_migr11, data = ffx)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.53908 -0.26282 -0.07753  0.21221  0.85058 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.88010    0.09965   8.832 9.57e-11 ***
## b_migr11     0.05044    0.01023   4.931 1.65e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3752 on 38 degrees of freedom
## Multiple R-squared:  0.3902, Adjusted R-squared:  0.3741 
## F-statistic: 24.31 on 1 and 38 DF,  p-value: 1.649e-05

## 
## Call:
## lm(formula = crimesPc ~ b_migr11, data = ffx)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.16156 -0.39571 -0.13070  0.00326  2.72541 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.35201    0.28400   4.761 6.93e-05 ***
## b_migr11     0.01332    0.02961   0.450    0.657    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7865 on 25 degrees of freedom
## Multiple R-squared:  0.008033,   Adjusted R-squared:  -0.03165 
## F-statistic: 0.2025 on 1 and 25 DF,  p-value: 0.6566

## 
## Call:
## lm(formula = crimesPc ~ b_migr11, data = ffx)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.99207 -0.34065 -0.06424  0.28628  1.47611 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  0.48027    0.23850   2.014  0.05308 . 
## b_migr11     0.09821    0.03040   3.231  0.00299 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5407 on 30 degrees of freedom
## Multiple R-squared:  0.2581, Adjusted R-squared:  0.2334 
## F-statistic: 10.44 on 1 and 30 DF,  p-value: 0.002991

## 
## Call:
## lm(formula = crimesPc ~ b_migr11, data = ffx)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.8849 -0.3370 -0.2046  0.1440  2.4343 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.80217    0.19602   4.092 0.000195 ***
## b_migr11     0.04982    0.01579   3.156 0.002997 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6717 on 41 degrees of freedom
## Multiple R-squared:  0.1954, Adjusted R-squared:  0.1758 
## F-statistic: 9.959 on 1 and 41 DF,  p-value: 0.002997

## 
## Call:
## lm(formula = crimesPc ~ b_migr11, data = ffx)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.59422 -0.27411 -0.08116  0.17012  1.70115 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.759918   0.122436   6.207 4.96e-08 ***
## b_migr11    0.044783   0.009276   4.828 9.40e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.431 on 62 degrees of freedom
## Multiple R-squared:  0.2732, Adjusted R-squared:  0.2615 
## F-statistic: 23.31 on 1 and 62 DF,  p-value: 9.397e-06

## 
## Call:
## lm(formula = crimesPc ~ b_migr11, data = ffx)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.00220 -0.35611 -0.06671  0.18426  2.64380 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  0.62044    0.39045   1.589  0.12253   
## b_migr11     0.03282    0.01025   3.202  0.00322 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6392 on 30 degrees of freedom
## Multiple R-squared:  0.2547, Adjusted R-squared:  0.2298 
## F-statistic: 10.25 on 1 and 30 DF,  p-value: 0.003225

## 
## Call:
## lm(formula = crimesPc ~ b_migr11, data = ffx)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.54065 -0.21328  0.00295  0.16357  0.56448 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.83100    0.14682    5.66 1.54e-05 ***
## b_migr11     0.08626    0.02704    3.19   0.0046 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3088 on 20 degrees of freedom
## Multiple R-squared:  0.3372, Adjusted R-squared:  0.3041 
## F-statistic: 10.18 on 1 and 20 DF,  p-value: 0.0046

```