A Brief Introduction to Data Science in R

Introduction
To Begin
Setting up R
Data Import and Tidying
Data Exploration and Modelling
Data Visualization and Communication
Handling geographical data
Conclusion

Introduction

The purpose of this tutorial is to provide a very brief introduction to Data Science using the open source software, R with R Studio. The data scientist Hadley Wickham conceptualizes doing data science as a process of importing, tidying, transforming, visualising, modelling and communicating data. This is not necessarily a linear sequence but may require some ‘going around in circles’ (see below) as we go through processes of data wrangling and exploration to best draw meaning and then to communicate that knowledge from data.

Please note: this intention of this tutorial is not to teach you R; there is no expectation that you will need to learn nor able to follow all the snippets of code that are given, which you simply can cut and paste into The Console of R Studio to run them. Instead, the purpose is more experiential - to get a flavour for some of the processes of data science listed above.

Processes of Data Science (source: https://r4ds.had.co.nz/)

To Begin

The data we shall use will be an example of what is known as volunteered data - information provided voluntarily be individuals (it’s not entirely voluntary in this case given you have to do the practical!).

Please go to the website http://map-me.org/sites/political and follow the instructions there.

When using the spray can on that site, make sure that the spray can always is turned off when panning around, zooming in and out or going on to the next question

When you have finished, please wait a while for others around you to finish, and until you are instructed to proceed.

Setting up R

When instructed - and not before, please - open RStudio on your PC. Using the drop-down menus at the top of the screen, check your working directory is set to somewhere in your own file-space (use Session -> Set Working Directory -> Choose Directory to do this).

getwd()

We also need to check whether some packages are available for R and, if not, install them,

install.packages("tidyverse")
ip <- installed.packages()[,1]
if(!length(grep("plotly", ip))) install.packages("plotly")
if(!length(grep("ggplot2", ip))) install.packages("ggplot2")
if(!length(grep("ggmap", ip))) install.packages("ggmap")

Data Import and Tidying

To look at the answers to the survey questions contained in the data you have been sent, we first need to bring the data into R:

library(tidyverse)
mydata <- read_csv("https://www.dropbox.com/s/mbge69m5x6twtmq/answers.csv?dl=1")
# you can ignore the warnings about any parsing failures

The data set I am looking at here will have different answers from what you have so the output that follows will not be the same. However, it will be of a similar structure, which can be observed by looking at it,

mydata

## # A tibble: 210 x 6
##    id_dem_answer id_website id_dem_question id_person dem_answer     X6   
##            <int>      <int>           <int>     <int> <chr>          <chr>
##  1         54163        553           12920     40269 Oxford         <NA> 
##  2         54164        553           12921     40269 Labour         <NA> 
##  3         54165        553           12922     40269 Not Given      <NA> 
##  4         54166        553           12923     40269 Chuka Umunna   <NA> 
##  5         54167        553           12924     40269 Against leavi~ <NA> 
##  6         54173        553           12920     40274 yorkshire      <NA> 
##  7         54174        553           12921     40274 Other          <NA> 
##  8         54175        553           12922     40274 green          <NA> 
##  9         54176        553           12923     40274 None of the a~ <NA> 
## 10         54177        553           12924     40274 Against leavi~ <NA> 
## # ... with 200 more rows

This data needs some tidying-up, to keep only the columns needed for the analysis, to remove any incomplete data rows and to remove some duplicated data:

mydata <- mydata %>%
  dplyr::select(id_dem_question, id_person, dem_answer) %>%
  na.omit() %>%
  arrange(id_person, id_dem_question)
keep <- mydata %>%
  group_by(id_dem_question) %>%
  summarise(n=n()) %>%
  arrange(desc(n)) %>%
  slice(1:5) %>%
  dplyr::select(id_dem_question)
mydata <- mydata %>%
  filter(id_dem_question %in% unlist(keep)) %>%
  unique()
mydata

## # A tibble: 180 x 3
##    id_dem_question id_person dem_answer       
##              <int>     <int> <chr>            
##  1           12925     40299 halifax          
##  2           12926     40299 Other            
##  3           12927     40299 green            
##  4           12928     40299 None of the above
##  5           12929     40299 Against leaving  
##  6           12925     40300 Washington       
##  7           12926     40300 Lib Dems         
##  8           12927     40300 Not Given        
##  9           12928     40300 Jeremy Corbyn    
## 10           12929     40300 Against leaving  
## # ... with 170 more rows

To check the number of questions asked and the number of people who took the survey we can use

mydata %>%
  group_by(id_dem_question) %>%
  summarise(n=n())

## # A tibble: 5 x 2
##   id_dem_question     n
##             <int> <int>
## 1           12925    36
## 2           12926    36
## 3           12927    36
## 4           12928    36
## 5           12929    36

From this I can see that there were 5 questions and that 36 people answered each of them.

How many people have answered the survey in the data you have been given?

Currently the data are in what is known as long format (the answers given by each person are stacked downwards in the data so the same person appears in multiple rows of the data). It will be helpful to change it to wide format (one row per person) and to give the questions a more intuitive numbering:

mydata <- mydata %>%
  spread(id_dem_question, dem_answer)
k <- ncol(mydata)
names(mydata)[2:k] <- paste0("Q", 1:(k-1))
mydata

## # A tibble: 36 x 6
##    id_person Q1             Q2         Q3        Q4            Q5         
##        <int> <chr>          <chr>      <chr>     <chr>         <chr>      
##  1     40299 halifax        Other      green     None of the ~ Against le~
##  2     40300 Washington     Lib Dems   Not Given Jeremy Corbyn Against le~
##  3     40301 Wales          Labour     Not Given Jeremy Corbyn Against le~
##  4     40302 Brighton       Other      Green pa~ None of the ~ Against le~
##  5     40303 Weston-super-~ Not Given  Not Given Not Given     For leaving
##  6     40304 China          Labour     Not Given None of the ~ Against le~
##  7     40306 Brighton       Not Given  Not Given Not Given     Against le~
##  8     40308 New York City  Not Given  Not Given Not Given     Against le~
##  9     40309 Macclesfield   Not Given  Not Given Not Given     Against le~
## 10     40310 Berkshire      Conservat~ Not Given Theresa May   Against le~
## # ... with 26 more rows

Can you set how the format of the data has changed?

Data Exploration and Modelling

Having tidied the data, it is easy to look specifically at the answers to the second question (about voting intentions):

mydata %>%
  dplyr::select(Q2) %>%
  group_by(Q2) %>%
  summarise(n = n()) %>%
  arrange(desc(n))

## # A tibble: 6 x 2
##   Q2                   n
##   <chr>            <int>
## 1 Labour              15
## 2 Conservative         7
## 3 Not Given            6
## 4 Other                4
## 5 Lib Dems             3
## 6 I would not vote     1

And, in regard to who would make the best PM (the 4th question)…

mydata %>%
  dplyr::select(Q4) %>%
  group_by(Q4) %>%
  summarise(n = n()) %>%
  arrange(desc(n))

## # A tibble: 6 x 2
##   Q4                    n
##   <chr>             <int>
## 1 None of the above    15
## 2 Jeremy Corbyn         8
## 3 Not Given             7
## 4 Chuka Umunna          2
## 5 Theresa May           2
## 6 Tony Blair            2

However, rather than repeating code (which, in programming terms, is wasteful and prone to introducing error), it is better to write a more generic function that can run with the option to change question. For example,

tally <- function(mydata, question) {
  mydata %>%
    dplyr::select(question) %>%
    group_by_at(1) %>%
    summarise(n = n()) %>%
    arrange(desc(n))
}

Now the tallies for the fourth question can be obtained simply by typing

tally(mydata, "Q4")

## # A tibble: 6 x 2
##   Q4                    n
##   <chr>             <int>
## 1 None of the above    15
## 2 Jeremy Corbyn         8
## 3 Not Given             7
## 4 Chuka Umunna          2
## 5 Theresa May           2
## 6 Tony Blair            2

Use this function to find out from your data which is the most popular political party, who is the most popular potential PM, and how many are in favour of staying in and leaving the EU (question 5).

We might reasonably expect voting intention and preferred future PM to be related. A very simple model exploring the relationship between these two variables is a cross-tabulation:

with(mydata, table(Q2, Q4))

##                   Q4
## Q2                 Chuka Umunna Jeremy Corbyn None of the above Not Given
##   Conservative                0             0                 5         0
##   I would not vote            0             0                 0         0
##   Labour                      1             7                 6         1
##   Lib Dems                    0             1                 1         0
##   Not Given                   0             0                 0         6
##   Other                       1             0                 3         0
##                   Q4
## Q2                 Theresa May Tony Blair
##   Conservative               2          0
##   I would not vote           0          1
##   Labour                     0          0
##   Lib Dems                   0          1
##   Not Given                  0          0
##   Other                      0          0

In your data, who is the most popular candidate for PM amongst Conservative voters, and who is amongst Labour voters?

Data Visualization and Communication

Graphical communication is one of the most important components of both understanding and explaining data. The communication can be to you - to aid understanding of the data - or to others, to help explain your results.

There are various ways to produce graphics in R, one of which is to use the \(ggplot2\) library. This is not always the easiest but it is built on interesting foundations, the idea that there is a grammar of graphics, providing a coherent system for describing and building graphs. At its simplest, one or more variables in the data set are mapped (in a non-geographical sense) to aesthetics such as colour, size or symbol type. For example, to produce a bar plot of the voting intentions:

ggplot(data = mydata) +
  geom_bar(mapping = aes(Q2)) +
  xlab("political party") +
  ylab("count") +
  theme_minimal()

Or, with a bit of data wrangling, and potentially better

mydata <- transform(mydata,
       Q2 = ordered(Q2, levels = names(sort(-table(Q2)))))
g <- ggplot(data = mydata) +
  geom_bar(mapping = aes(Q2)) +
  xlab("political party") +
  ylab("count") +
  theme_minimal()
plotly::ggplotly(g)

… which is also interactive - move your mouse across the chart and see what happens.

Handling geographical data

The next stage is to have a look at the perceptions of the political geography pooled across the group. The information is contained in the file \(blobs.csv\).

blobs <- read_csv("https://www.dropbox.com/s/v0uudld1ipganvi/blobs.csv?dl=1")
blobs

## # A tibble: 69,553 x 10
##    id_blob id_person id_website id_question spray_number   lat    lng
##      <int>     <int>      <int>       <int>        <int> <dbl>  <dbl>
##  1  1.90e7     40286        553        8511            1  50.8 -0.105
##  2  1.90e7     40286        553        8511            1  50.8 -0.134
##  3  1.90e7     40286        553        8511            1  50.8 -0.170
##  4  1.90e7     40286        553        8511            1  50.8 -0.170
##  5  1.90e7     40286        553        8511            1  50.8 -0.151
##  6  1.90e7     40286        553        8511            1  50.8 -0.162
##  7  1.90e7     40286        553        8511            1  50.8 -0.145
##  8  1.90e7     40286        553        8511            1  50.8 -0.156
##  9  1.90e7     40286        553        8511            1  50.8 -0.103
## 10  1.90e7     40286        553        8511            1  50.8 -0.167
## # ... with 69,543 more rows, and 3 more variables: time_blob <dbl>,
## #   zoom <int>, X10 <chr>

For these data it is better to keep them in long format though we will still find it easier to change the question names to something more intuitive, at the same time removing any blobs that are outside of the UK.

old_ids <- unique(blobs$id_question)
for(i in 1: length(old_ids)) {
  blobs <- blobs %>%
    mutate(id_question = ifelse(id_question == old_ids[i],
                                paste0("Q",i), id_question))
}
blobs <- blobs %>%
  filter(lng > -7.6, lng < 1.7) %>%
  filter(lat > 49.9, lat < 58.7)
blobs

## # A tibble: 65,709 x 10
##    id_blob id_person id_website id_question spray_number   lat    lng
##      <int>     <int>      <int> <chr>              <int> <dbl>  <dbl>
##  1  1.90e7     40286        553 Q1                     1  50.8 -0.105
##  2  1.90e7     40286        553 Q1                     1  50.8 -0.134
##  3  1.90e7     40286        553 Q1                     1  50.8 -0.170
##  4  1.90e7     40286        553 Q1                     1  50.8 -0.170
##  5  1.90e7     40286        553 Q1                     1  50.8 -0.151
##  6  1.90e7     40286        553 Q1                     1  50.8 -0.162
##  7  1.90e7     40286        553 Q1                     1  50.8 -0.145
##  8  1.90e7     40286        553 Q1                     1  50.8 -0.156
##  9  1.90e7     40286        553 Q1                     1  50.8 -0.103
## 10  1.90e7     40286        553 Q1                     1  50.8 -0.167
## # ... with 65,699 more rows, and 3 more variables: time_blob <dbl>,
## #   zoom <int>, X10 <chr>

It is now relatively straightforward to map where people thought Conservative voters are most likely to live (question 2)

library(ggmap)
qmplot(lng, lat, data = filter(blobs, id_question == "Q2"), maptype = "toner-lite", geom = "density2d", color = I("red"))

Writing it as a function …

mapit <- function(blobs, question) {
  qmplot(lng, lat, data = filter(blobs, id_question == question),
         maptype = "toner-lite", geom = "density2d", color = I("red"))
  }

… then to see where people thought UKIP voters live (Q5) we can use,

mapit(blobs, "Q5")

Use your data to explore where people thing Conservative (Q2), Labour (Q3), Lib Dem (Q4) and UKIP voters (Q5) live.
How do these perceptions compare with the outcome of the 2017 General Election results, which can be viewed here?

Conclusion

R is a powerful language for data science because it is free and flexible, and the sharing of code is encouraged (this document was written in R, using R Markdown). Much of what I have written here is drawn from here and here. As with the learning of any language, it takes a while to get used to but once you have done so, it opens the door to doing the sorts of research and analysis that just aren’t possible using standard packages like Excel. Because we live in an increasingly data-based world, making effective use of those data to answer questions of relevant to the social sciences, sciences and humanities is an on-going and important challenge. Here at the University of Bristol, both Bristol Q-Step and the Jean Golding Institute for data science are helping to equip students and researchers to respond to that challenge.