This is the third and final part of the Project 2.
The goal of this exercise is to transform the data on the locality profiles (census areas) of Edinburgh and answer the following questions about the life in the city:
The original dataset is an XLS file with the results of various surveys conducted in Edinburgh, available for download from this page.
For the sake of simplicity, I have saved it into CSV format and uploaded to GitHub.
# Load the required packages
library(readr)
library(knitr)
library(tidyr)
library(dplyr)
library(plotly)
library(DT)
raw = read_csv2("https://raw.githubusercontent.com/datafeelings/data607/master/Project_2/localityprofilesbasedatacsv.csv")
The raw data looks pretty complicated with almost 1000 rows and 110 variables:
dim(raw)
## [1] 999 110
However, removing empty columns we somewhat reduce its size
# Remove last empty columns and the first column with numbers
raw = select(raw, -starts_with("X")) %>% select(-no)
Before transforming the data, we need to subset the information required to answer the questions. We filter the data using the keywords from the research questions.
students = raw %>% filter(grepl("Student",Indicator,ignore.case=T))
health = raw %>% filter(grepl("health",Indicator, ignore.case=T))
lazy = raw %>% filter(grepl("exercise",Indicator, ignore.case=T))
criminal = raw %>% filter(grepl("crime",Indicator, ignore.case = T))
Now we have more manageable data for each of the questions, an example for the question about students:
kable(students[,1:10])
| Indicator | Source | Date | Scotland | Edinburgh | Edinburgh% | Edinburghindex | High | Low | Almond |
|---|---|---|---|---|---|---|---|---|---|
| Full-time student - employed | Census | Mar-11 | NA | 20.255 | 5,5% | NA | 10,4% | 2,8% | 507 |
| Student | Census | Mar-11 | NA | 38.223 | 10,3% | NA | 30,8% | 4,0% | 754 |
| Male: Full-time student - employed | Census | Mar-11 | NA | 9.280 | 5,1% | NA | 9,3% | 2,6% | 222 |
| Male: Student | Census | Mar-11 | NA | 19.149 | 10,5% | NA | 29,8% | 4,1% | 383 |
| Female: Full-time student - employed | Census | Mar-11 | NA | 10.975 | 5,8% | NA | 12,0% | 3,1% | 285 |
| Female: Student | Census | Mar-11 | NA | 19.074 | 10,2% | NA | 31,8% | 3,6% | 371 |
All all subsets have the same column structure, we can construct a function performing the required transformation.
transformer = function(x, filterValue) {
# Conveniently, all columns with percentage of population have "%" in the name
x = x %>% select(Indicator, contains("%"))
x = x %>% gather("Locality", "Share", 2:ncol(x)) %>%
# Remove punctuation and digits from locality
mutate(Locality = gsub("[[:punct:]]", "", Locality)) %>%
mutate(Locality = gsub("[[:digit:]]", "", Locality)) %>%
# Convert the share field to a numeric value
mutate(Share = gsub("%", "", Share)) %>%
mutate(Share = gsub(",", ".", Share)) %>%
mutate(Share = as.numeric(Share)) %>%
unique() # Remove duplicate records
# Filter the required Indicator value to the required string
x = filter(x, Indicator == filterValue)
}
After inspecting each of the subsets, we have identified the precise strings to filter the values by. Now we can apply the transformer function to each of the subsets to get a tidy dataset.
students1 = transformer(students, "Student")
health1 = transformer(health, "Very good health")
lazy1 = transformer(lazy, "No exercise taken")
criminal1 = transformer(criminal, "Crime deprived 15%")
Now the dataset for students looks tidy:
datatable(students1)
Now after the data has been prepared, it is very easy to answer the questions, as the proportions per locality have already been calculated in the data. Thus we only need to filter the rows.
students_ans = filter(students1, Share == max(Share))
kable(students_ans)
| Indicator | Locality | Share |
|---|---|---|
| Student | Southside Newington | 30.8 |
The proportion of stundents in the population is very different between the localities: from 4 to 31 per cent.
health_ans = filter(health1, Share == max(Share))
kable(health_ans)
| Indicator | Locality | Share |
|---|---|---|
| Very good health | Meadows Morningside | 66.3 |
summary(health1$Share)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 50.60 54.32 57.85 57.67 60.48 66.30
We can see that compared to the share of students, the variation in the share of healthy people is not big.
criminal_ans = filter(criminal1, Share == max(Share))
kable(criminal_ans)
| Indicator | Locality | Share |
|---|---|---|
| Crime deprived 15% | Leith | 43.5 |
summary(criminal1$Share)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 6.175 12.300 15.780 21.780 43.500
Again we see a strong variation between the localities in terms of the crime deprivation variable (it is part of the The Scottish Index of Multiple Deprivation)
df = bind_rows(students1,health1,lazy1, criminal1)
plot_ly(data = arrange(df,Share), y = Share, type = "box", group = Indicator) %>% layout (title = "Variation in the share of indicator per locality")
The Edinburgh dataset provides a very multi-faceted view on the population in the city.
We have taken just a glimpse at some of its aspects and have found out that the proportion of students and crime per locality varies strongly, and the share of very healthy and very lazy (not exercising) people is much less variable.