An illustration of R Markdown….
What’s the link between foreigners and crime?
To illustrate your story you can include images (even of thugs who attack foreigners)
You can also embedd tweets. To illustrate that, let’s hear some remarks from Nigel Farage (The patron saint of the Farage Garage):
Now let’s load some data. To do that you can include chunks of are code like this:
ff=read.csv("https://www.dropbox.com/s/g1w75gkw7g91zef/foreigners.csv?dl=1")
This loads the local authority dataset we have seen before. Note that you can include inline are code as well. For instance: the dataset has 348 observations and contains 5 variables.
As before we can summarise the data:
summary(ff)
## X crimes11 b_migr11 pop11
## Min. : 1.00 Min. : 1134 Min. : 2.241 Min. : 2203
## 1st Qu.: 87.75 1st Qu.: 107618 1st Qu.: 4.899 1st Qu.: 94263
## Median :174.50 Median : 160556 Median : 7.603 Median : 125746
## Mean :174.50 Mean : 236988 Mean :11.226 Mean : 161434
## 3rd Qu.:261.25 3rd Qu.: 309377 3rd Qu.:12.382 3rd Qu.: 200247
## Max. :348.00 Max. :1714295 Max. :55.161 Max. :1072372
## NA's :24 NA's :9 NA's :9
## area
## Length:348
## Class :character
## Mode :character
##
##
##
##
Note, you might want to see the output of a command in your final document, but you might not want to see the command. Just do it like this:
## X crimes11 b_migr11 pop11
## Min. : 1.00 Min. : 1134 Min. : 2.241 Min. : 2203
## 1st Qu.: 87.75 1st Qu.: 107618 1st Qu.: 4.899 1st Qu.: 94263
## Median :174.50 Median : 160556 Median : 7.603 Median : 125746
## Mean :174.50 Mean : 236988 Mean :11.226 Mean : 161434
## 3rd Qu.:261.25 3rd Qu.: 309377 3rd Qu.:12.382 3rd Qu.: 200247
## Max. :348.00 Max. :1714295 Max. :55.161 Max. :1072372
## NA's :24 NA's :9 NA's :9
## area
## Length:348
## Class :character
## Mode :character
##
##
##
##
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## Warning: Removed 24 rows containing missing values (geom_point).
Let’s get rid of outliers…
#ff= filter(ff,crimesPc<15)
ff= ff %>% filter(crimesPc<15)
…and do some other stuff to make it look nicer..
ggplot(ff, aes(y=crimesPc,x=b_migr11)) + geom_point() + # Make sure + is on this line so R understands the command is not finished
xlab("Share of foreign born in %") +
ylab("Number of Crimes per capita") +
geom_smooth(method = "lm", se = FALSE) +
theme_minimal()
## `geom_smooth()` using formula 'y ~ x'
We said that putting in a trend line in a scatter plot is a way of estimating an econometric model that describes the relationship between the dependent (or outcome) variable on the Y axis and an explanatory variable on the x axis.
If you want a computer to do this for you (rather take out a ruler and a pen) you need a precise algorithm.
The most commonly used algorithm for that is called Ordinary Least Squares estimator (OLS).
r1=lm(crimesPc~b_migr11,ff)
# Note we can assign the output of a regression to a variable...
summary(r1)
##
## Call:
## lm(formula = crimesPc ~ b_migr11, data = ff)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.13314 -0.33959 -0.06763 0.22302 2.92572
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.091273 0.045146 24.17 < 2e-16 ***
## b_migr11 0.025164 0.002922 8.61 3.33e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5482 on 321 degrees of freedom
## Multiple R-squared: 0.1876, Adjusted R-squared: 0.1851
## F-statistic: 74.14 on 1 and 321 DF, p-value: 3.325e-16
# ...and then refer back to this variable in subsequent commands.
# You can check how the output of the lm command looks like by clicking on r1 in the
# variable browser.
Note this identifies a linear model of the form \[ Y_i=\beta_0 + \beta_1 \times X_i +\epsilon_i\] where \(Y_i\) is Crimes per capita and \(X_i\) is the share of foreigners in %. This also shows you how you can integrate formulas in a markdown document and how easily format text to make it italic. More on this under this link.
Now this of course is the true model. What we get out of the OLS procedure is an estimate of the above model; i.e. \[ Y_i=\hat{\beta}_0 + \hat{\beta}_1 \times X_i +\hat{\epsilon}_i\] which given the above R results becomes
\[ Y_i=1.091 + 0.025 \times X_i +\hat{\epsilon}_i\]
where we round the results up to 3 digit precision. Check out the code for this in the underlying Rmd file as it also shows how you can refer back to a previously caclulated result within an inline formula; i.e. rather than manually typing the value for \(\beta_0\) and \(\beta_1\) I let R work it out from the lm command.
An important implication of the OLS estimator is that it works out the \(\beta\) parameters in a way that sets correlation between \(\hat{\epsilon}\) and \(X\) equal to 0. To check that this is the case we extract the \(\epsilon\) from the r1 variable and add it to our ff dataframe.
ff['residuals']=r1$residuals
#Note this is an alternative way of assigning a new variable to a dataframe.
#Alternatively use:
ff=ff %>% mutate(residualsx1=r1$residuals)
Note that the correlation is indeed 0:
ff %>% select(residuals,b_migr11, pop11) %>% cor(use="complete.obs")
## residuals b_migr11 pop11
## residuals 1.000000e+00 3.310480e-17 0.003356407
## b_migr11 3.310480e-17 1.000000e+00 0.333636733
## pop11 3.356407e-03 3.336367e-01 1.000000000
10^17
## [1] 1e+17
Which we can also see in a scatterplot:
A big part of applied data analysis involves combining different bit of data. For that you need merge or join commands. Here is an example where we add more area level variables to our foreigners dataset:
library(dplyr)
ff_raw=read.csv("https://www.dropbox.com/s/g1w75gkw7g91zef/foreigners.csv?dl=1")
ff_more=read.csv("https://www.dropbox.com/s/gwq2wmmxr8s3v7t/foreigners_more.csv?dl=1")
names(ff_more)
## [1] "X" "urate2010" "urate2012" "urate2004" "urate2011" "area"
## [7] "meanage" "medianage" "region" "pct_leave" "mus_sh" "citshare"
## [13] "urbshare"
inner=ff_raw %>% inner_join(ff_more,by="area")
inner=ff_raw %>% full_join(ff_more,by="area")
inner=ff %>% inner_join(ff_more,by="area")
full=ff %>% full_join(ff_more,by="area")
Dataframe ff has 323 observations. Dataframe ff_more has 348. The inner join has 323, the full join 348.
One of the variables we merged earlier includes a region variable. The following command counts the number of observations for every region.
inner %>% group_by(region) %>% summarise(n())
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 10 x 2
## region `n()`
## <chr> <int>
## 1 East 43
## 2 East Midlands 40
## 3 London 32
## 4 North East 10
## 5 North West 39
## 6 South East 64
## 7 South West 32
## 8 Wales 22
## 9 West Midlands 27
## 10 Yorkshire and The Humber 14
plotter=function(r){
ffx=inner %>% filter(region==r)
plot=ggplot(ffx, aes(x = b_migr11, y = crimesPc)) +
geom_smooth(method = "lm", se = FALSE) +
geom_point()+theme_minimal()+
xlab("% foreign born")+ylab("Crimes per capita")+
ggtitle(r)
reg=(lm(crimesPc~b_migr11,ffx))
return(list(plot,reg))
}
p=plotter("London")
p[[1]];summary(p[[2]])
##
## Call:
## lm(formula = crimesPc ~ b_migr11, data = ffx)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.00220 -0.35611 -0.06671 0.18426 2.64380
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.62044 0.39045 1.589 0.12253
## b_migr11 0.03282 0.01025 3.202 0.00322 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6392 on 30 degrees of freedom
## Multiple R-squared: 0.2547, Adjusted R-squared: 0.2298
## F-statistic: 10.25 on 1 and 30 DF, p-value: 0.003225
p=plotter("East")
p[[1]];summary(p[[2]])
##
## Call:
## lm(formula = crimesPc ~ b_migr11, data = ffx)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.8849 -0.3370 -0.2046 0.1440 2.4343
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.80217 0.19602 4.092 0.000195 ***
## b_migr11 0.04982 0.01579 3.156 0.002997 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6717 on 41 degrees of freedom
## Multiple R-squared: 0.1954, Adjusted R-squared: 0.1758
## F-statistic: 9.959 on 1 and 41 DF, p-value: 0.002997
p=plotter("South West")
p[[1]];summary(p[[2]])
##
## Call:
## lm(formula = crimesPc ~ b_migr11, data = ffx)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.99207 -0.34065 -0.06424 0.28628 1.47611
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.48027 0.23850 2.014 0.05308 .
## b_migr11 0.09821 0.03040 3.231 0.00299 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5407 on 30 degrees of freedom
## Multiple R-squared: 0.2581, Adjusted R-squared: 0.2334
## F-statistic: 10.44 on 1 and 30 DF, p-value: 0.002991
regions=inner$region %>% unique()
for(rrr in regions){
p=plotter(rrr)
print(p[[1]])
print(summary(p[[2]]))
}
##
## Call:
## lm(formula = crimesPc ~ b_migr11, data = ffx)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.95549 -0.28291 0.01332 0.33373 0.82333
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.04807 0.31537 3.323 0.0105 *
## b_migr11 0.07821 0.05092 1.536 0.1631
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5278 on 8 degrees of freedom
## Multiple R-squared: 0.2278, Adjusted R-squared: 0.1312
## F-statistic: 2.359 on 1 and 8 DF, p-value: 0.1631
##
## Call:
## lm(formula = crimesPc ~ b_migr11, data = ffx)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.96912 -0.14052 -0.01849 0.21011 0.95806
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.86423 0.11455 7.544 5.40e-09 ***
## b_migr11 0.07270 0.01416 5.134 9.31e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3849 on 37 degrees of freedom
## Multiple R-squared: 0.416, Adjusted R-squared: 0.4003
## F-statistic: 26.36 on 1 and 37 DF, p-value: 9.308e-06
##
## Call:
## lm(formula = crimesPc ~ b_migr11, data = ffx)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.65522 -0.19623 -0.08563 0.09058 0.71980
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.53085 0.26058 5.875 7.54e-05 ***
## b_migr11 0.01987 0.03042 0.653 0.526
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3813 on 12 degrees of freedom
## Multiple R-squared: 0.03433, Adjusted R-squared: -0.04614
## F-statistic: 0.4266 on 1 and 12 DF, p-value: 0.526
##
## Call:
## lm(formula = crimesPc ~ b_migr11, data = ffx)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.53908 -0.26282 -0.07753 0.21221 0.85058
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.88010 0.09965 8.832 9.57e-11 ***
## b_migr11 0.05044 0.01023 4.931 1.65e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3752 on 38 degrees of freedom
## Multiple R-squared: 0.3902, Adjusted R-squared: 0.3741
## F-statistic: 24.31 on 1 and 38 DF, p-value: 1.649e-05
##
## Call:
## lm(formula = crimesPc ~ b_migr11, data = ffx)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.16156 -0.39571 -0.13070 0.00326 2.72541
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.35201 0.28400 4.761 6.93e-05 ***
## b_migr11 0.01332 0.02961 0.450 0.657
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7865 on 25 degrees of freedom
## Multiple R-squared: 0.008033, Adjusted R-squared: -0.03165
## F-statistic: 0.2025 on 1 and 25 DF, p-value: 0.6566
##
## Call:
## lm(formula = crimesPc ~ b_migr11, data = ffx)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.99207 -0.34065 -0.06424 0.28628 1.47611
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.48027 0.23850 2.014 0.05308 .
## b_migr11 0.09821 0.03040 3.231 0.00299 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5407 on 30 degrees of freedom
## Multiple R-squared: 0.2581, Adjusted R-squared: 0.2334
## F-statistic: 10.44 on 1 and 30 DF, p-value: 0.002991
##
## Call:
## lm(formula = crimesPc ~ b_migr11, data = ffx)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.8849 -0.3370 -0.2046 0.1440 2.4343
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.80217 0.19602 4.092 0.000195 ***
## b_migr11 0.04982 0.01579 3.156 0.002997 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6717 on 41 degrees of freedom
## Multiple R-squared: 0.1954, Adjusted R-squared: 0.1758
## F-statistic: 9.959 on 1 and 41 DF, p-value: 0.002997
##
## Call:
## lm(formula = crimesPc ~ b_migr11, data = ffx)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.59422 -0.27411 -0.08116 0.17012 1.70115
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.759918 0.122436 6.207 4.96e-08 ***
## b_migr11 0.044783 0.009276 4.828 9.40e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.431 on 62 degrees of freedom
## Multiple R-squared: 0.2732, Adjusted R-squared: 0.2615
## F-statistic: 23.31 on 1 and 62 DF, p-value: 9.397e-06
##
## Call:
## lm(formula = crimesPc ~ b_migr11, data = ffx)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.00220 -0.35611 -0.06671 0.18426 2.64380
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.62044 0.39045 1.589 0.12253
## b_migr11 0.03282 0.01025 3.202 0.00322 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6392 on 30 degrees of freedom
## Multiple R-squared: 0.2547, Adjusted R-squared: 0.2298
## F-statistic: 10.25 on 1 and 30 DF, p-value: 0.003225
##
## Call:
## lm(formula = crimesPc ~ b_migr11, data = ffx)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.54065 -0.21328 0.00295 0.16357 0.56448
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.83100 0.14682 5.66 1.54e-05 ***
## b_migr11 0.08626 0.02704 3.19 0.0046 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3088 on 20 degrees of freedom
## Multiple R-squared: 0.3372, Adjusted R-squared: 0.3041
## F-statistic: 10.18 on 1 and 20 DF, p-value: 0.0046
```