The purpose of this tutorial is to provide a very brief introduction to Data Science using the open source software, R with R Studio. The data scientist Hadley Wickham conceptualizes doing data science as a process of importing, tidying, transforming, visualising, modelling and communicating data. This is not necessarily a linear sequence but may require some ‘going around in circles’ (see below) as we go through processes of data wrangling and exploration to best draw meaning and then to communicate that knowledge from data.
Please note: this intention of this tutorial is not to teach you R; there is no expectation that you will need to learn nor able to follow all the snippets of code that are given, which you simply can cut and paste into The Console of R Studio to run them. Instead, the purpose is more experiential - to get a flavour for some of the processes of data science listed above.
Processes of Data Science (source: https://r4ds.had.co.nz/)
The data we shall use will be an example of what is known as volunteered data - information provided voluntarily be individuals (it’s not entirely voluntary in this case given you have to do the practical!).
Please go to the website http://map-me.org/sites/political and follow the instructions there.
When using the spray can on that site, make sure that the spray can always is turned off when panning around, zooming in and out or going on to the next question
When you have finished, please wait a while for others around you to finish, and until you are instructed to proceed.
When instructed - and not before, please - open RStudio on your PC. Using the drop-down menus at the top of the screen, check your working directory is set to somewhere in your own file-space (use Session -> Set Working Directory -> Choose Directory to do this).
getwd()
We also need to check whether some packages are available for R and, if not, install them,
install.packages("tidyverse")
ip <- installed.packages()[,1]
if(!length(grep("plotly", ip))) install.packages("plotly")
if(!length(grep("ggplot2", ip))) install.packages("ggplot2")
if(!length(grep("ggmap", ip))) install.packages("ggmap")
To look at the answers to the survey questions contained in the data you have been sent, we first need to bring the data into R:
library(tidyverse)
mydata <- read_csv("https://www.dropbox.com/s/mbge69m5x6twtmq/answers.csv?dl=1")
# you can ignore the warnings about any parsing failures
The data set I am looking at here will have different answers from what you have so the output that follows will not be the same. However, it will be of a similar structure, which can be observed by looking at it,
mydata
## # A tibble: 210 x 6
## id_dem_answer id_website id_dem_question id_person dem_answer X6
## <int> <int> <int> <int> <chr> <chr>
## 1 54163 553 12920 40269 Oxford <NA>
## 2 54164 553 12921 40269 Labour <NA>
## 3 54165 553 12922 40269 Not Given <NA>
## 4 54166 553 12923 40269 Chuka Umunna <NA>
## 5 54167 553 12924 40269 Against leavi~ <NA>
## 6 54173 553 12920 40274 yorkshire <NA>
## 7 54174 553 12921 40274 Other <NA>
## 8 54175 553 12922 40274 green <NA>
## 9 54176 553 12923 40274 None of the a~ <NA>
## 10 54177 553 12924 40274 Against leavi~ <NA>
## # ... with 200 more rows
This data needs some tidying-up, to keep only the columns needed for the analysis, to remove any incomplete data rows and to remove some duplicated data:
mydata <- mydata %>%
dplyr::select(id_dem_question, id_person, dem_answer) %>%
na.omit() %>%
arrange(id_person, id_dem_question)
keep <- mydata %>%
group_by(id_dem_question) %>%
summarise(n=n()) %>%
arrange(desc(n)) %>%
slice(1:5) %>%
dplyr::select(id_dem_question)
mydata <- mydata %>%
filter(id_dem_question %in% unlist(keep)) %>%
unique()
mydata
## # A tibble: 180 x 3
## id_dem_question id_person dem_answer
## <int> <int> <chr>
## 1 12925 40299 halifax
## 2 12926 40299 Other
## 3 12927 40299 green
## 4 12928 40299 None of the above
## 5 12929 40299 Against leaving
## 6 12925 40300 Washington
## 7 12926 40300 Lib Dems
## 8 12927 40300 Not Given
## 9 12928 40300 Jeremy Corbyn
## 10 12929 40300 Against leaving
## # ... with 170 more rows
To check the number of questions asked and the number of people who took the survey we can use
mydata %>%
group_by(id_dem_question) %>%
summarise(n=n())
## # A tibble: 5 x 2
## id_dem_question n
## <int> <int>
## 1 12925 36
## 2 12926 36
## 3 12927 36
## 4 12928 36
## 5 12929 36
From this I can see that there were 5 questions and that 36 people answered each of them.
Currently the data are in what is known as long format (the answers given by each person are stacked downwards in the data so the same person appears in multiple rows of the data). It will be helpful to change it to wide format (one row per person) and to give the questions a more intuitive numbering:
mydata <- mydata %>%
spread(id_dem_question, dem_answer)
k <- ncol(mydata)
names(mydata)[2:k] <- paste0("Q", 1:(k-1))
mydata
## # A tibble: 36 x 6
## id_person Q1 Q2 Q3 Q4 Q5
## <int> <chr> <chr> <chr> <chr> <chr>
## 1 40299 halifax Other green None of the ~ Against le~
## 2 40300 Washington Lib Dems Not Given Jeremy Corbyn Against le~
## 3 40301 Wales Labour Not Given Jeremy Corbyn Against le~
## 4 40302 Brighton Other Green pa~ None of the ~ Against le~
## 5 40303 Weston-super-~ Not Given Not Given Not Given For leaving
## 6 40304 China Labour Not Given None of the ~ Against le~
## 7 40306 Brighton Not Given Not Given Not Given Against le~
## 8 40308 New York City Not Given Not Given Not Given Against le~
## 9 40309 Macclesfield Not Given Not Given Not Given Against le~
## 10 40310 Berkshire Conservat~ Not Given Theresa May Against le~
## # ... with 26 more rows
Having tidied the data, it is easy to look specifically at the answers to the second question (about voting intentions):
mydata %>%
dplyr::select(Q2) %>%
group_by(Q2) %>%
summarise(n = n()) %>%
arrange(desc(n))
## # A tibble: 6 x 2
## Q2 n
## <chr> <int>
## 1 Labour 15
## 2 Conservative 7
## 3 Not Given 6
## 4 Other 4
## 5 Lib Dems 3
## 6 I would not vote 1
And, in regard to who would make the best PM (the 4th question)…
mydata %>%
dplyr::select(Q4) %>%
group_by(Q4) %>%
summarise(n = n()) %>%
arrange(desc(n))
## # A tibble: 6 x 2
## Q4 n
## <chr> <int>
## 1 None of the above 15
## 2 Jeremy Corbyn 8
## 3 Not Given 7
## 4 Chuka Umunna 2
## 5 Theresa May 2
## 6 Tony Blair 2
However, rather than repeating code (which, in programming terms, is wasteful and prone to introducing error), it is better to write a more generic function that can run with the option to change question. For example,
tally <- function(mydata, question) {
mydata %>%
dplyr::select(question) %>%
group_by_at(1) %>%
summarise(n = n()) %>%
arrange(desc(n))
}
Now the tallies for the fourth question can be obtained simply by typing
tally(mydata, "Q4")
## # A tibble: 6 x 2
## Q4 n
## <chr> <int>
## 1 None of the above 15
## 2 Jeremy Corbyn 8
## 3 Not Given 7
## 4 Chuka Umunna 2
## 5 Theresa May 2
## 6 Tony Blair 2
We might reasonably expect voting intention and preferred future PM to be related. A very simple model exploring the relationship between these two variables is a cross-tabulation:
with(mydata, table(Q2, Q4))
## Q4
## Q2 Chuka Umunna Jeremy Corbyn None of the above Not Given
## Conservative 0 0 5 0
## I would not vote 0 0 0 0
## Labour 1 7 6 1
## Lib Dems 0 1 1 0
## Not Given 0 0 0 6
## Other 1 0 3 0
## Q4
## Q2 Theresa May Tony Blair
## Conservative 2 0
## I would not vote 0 1
## Labour 0 0
## Lib Dems 0 1
## Not Given 0 0
## Other 0 0
Graphical communication is one of the most important components of both understanding and explaining data. The communication can be to you - to aid understanding of the data - or to others, to help explain your results.
There are various ways to produce graphics in R, one of which is to use the \(ggplot2\) library. This is not always the easiest but it is built on interesting foundations, the idea that there is a grammar of graphics, providing a coherent system for describing and building graphs. At its simplest, one or more variables in the data set are mapped (in a non-geographical sense) to aesthetics such as colour, size or symbol type. For example, to produce a bar plot of the voting intentions:
ggplot(data = mydata) +
geom_bar(mapping = aes(Q2)) +
xlab("political party") +
ylab("count") +
theme_minimal()
Or, with a bit of data wrangling, and potentially better
mydata <- transform(mydata,
Q2 = ordered(Q2, levels = names(sort(-table(Q2)))))
g <- ggplot(data = mydata) +
geom_bar(mapping = aes(Q2)) +
xlab("political party") +
ylab("count") +
theme_minimal()
plotly::ggplotly(g)
… which is also interactive - move your mouse across the chart and see what happens.
The next stage is to have a look at the perceptions of the political geography pooled across the group. The information is contained in the file \(blobs.csv\).
blobs <- read_csv("https://www.dropbox.com/s/v0uudld1ipganvi/blobs.csv?dl=1")
blobs
## # A tibble: 69,553 x 10
## id_blob id_person id_website id_question spray_number lat lng
## <int> <int> <int> <int> <int> <dbl> <dbl>
## 1 1.90e7 40286 553 8511 1 50.8 -0.105
## 2 1.90e7 40286 553 8511 1 50.8 -0.134
## 3 1.90e7 40286 553 8511 1 50.8 -0.170
## 4 1.90e7 40286 553 8511 1 50.8 -0.170
## 5 1.90e7 40286 553 8511 1 50.8 -0.151
## 6 1.90e7 40286 553 8511 1 50.8 -0.162
## 7 1.90e7 40286 553 8511 1 50.8 -0.145
## 8 1.90e7 40286 553 8511 1 50.8 -0.156
## 9 1.90e7 40286 553 8511 1 50.8 -0.103
## 10 1.90e7 40286 553 8511 1 50.8 -0.167
## # ... with 69,543 more rows, and 3 more variables: time_blob <dbl>,
## # zoom <int>, X10 <chr>
For these data it is better to keep them in long format though we will still find it easier to change the question names to something more intuitive, at the same time removing any blobs that are outside of the UK.
old_ids <- unique(blobs$id_question)
for(i in 1: length(old_ids)) {
blobs <- blobs %>%
mutate(id_question = ifelse(id_question == old_ids[i],
paste0("Q",i), id_question))
}
blobs <- blobs %>%
filter(lng > -7.6, lng < 1.7) %>%
filter(lat > 49.9, lat < 58.7)
blobs
## # A tibble: 65,709 x 10
## id_blob id_person id_website id_question spray_number lat lng
## <int> <int> <int> <chr> <int> <dbl> <dbl>
## 1 1.90e7 40286 553 Q1 1 50.8 -0.105
## 2 1.90e7 40286 553 Q1 1 50.8 -0.134
## 3 1.90e7 40286 553 Q1 1 50.8 -0.170
## 4 1.90e7 40286 553 Q1 1 50.8 -0.170
## 5 1.90e7 40286 553 Q1 1 50.8 -0.151
## 6 1.90e7 40286 553 Q1 1 50.8 -0.162
## 7 1.90e7 40286 553 Q1 1 50.8 -0.145
## 8 1.90e7 40286 553 Q1 1 50.8 -0.156
## 9 1.90e7 40286 553 Q1 1 50.8 -0.103
## 10 1.90e7 40286 553 Q1 1 50.8 -0.167
## # ... with 65,699 more rows, and 3 more variables: time_blob <dbl>,
## # zoom <int>, X10 <chr>
It is now relatively straightforward to map where people thought Conservative voters are most likely to live (question 2)
library(ggmap)
qmplot(lng, lat, data = filter(blobs, id_question == "Q2"), maptype = "toner-lite", geom = "density2d", color = I("red"))
Writing it as a function …
mapit <- function(blobs, question) {
qmplot(lng, lat, data = filter(blobs, id_question == question),
maptype = "toner-lite", geom = "density2d", color = I("red"))
}
… then to see where people thought UKIP voters live (Q5) we can use,
mapit(blobs, "Q5")
R is a powerful language for data science because it is free and flexible, and the sharing of code is encouraged (this document was written in R, using R Markdown). Much of what I have written here is drawn from here and here. As with the learning of any language, it takes a while to get used to but once you have done so, it opens the door to doing the sorts of research and analysis that just aren’t possible using standard packages like Excel. Because we live in an increasingly data-based world, making effective use of those data to answer questions of relevant to the social sciences, sciences and humanities is an on-going and important challenge. Here at the University of Bristol, both Bristol Q-Step and the Jean Golding Institute for data science are helping to equip students and researchers to respond to that challenge.