1. Introduction

This is the third and final part of the Project 2.

The goal of this exercise is to transform the data on the locality profiles (census areas) of Edinburgh and answer the following questions about the life in the city:

  1. Which area has the highest proportion of students?
  2. Which area has the highest proportion of healthy people?
  3. Which area is the laziest (share of people not exercising at all)?
  4. Which area is the most criminal one?

2. Read the data

The original dataset is an XLS file with the results of various surveys conducted in Edinburgh, available for download from this page.

For the sake of simplicity, I have saved it into CSV format and uploaded to GitHub.

# Load the required packages
library(readr)
library(knitr)
library(tidyr)
library(dplyr)
library(plotly)
library(DT)
raw = read_csv2("https://raw.githubusercontent.com/datafeelings/data607/master/Project_2/localityprofilesbasedatacsv.csv")

The raw data looks pretty complicated with almost 1000 rows and 110 variables:

dim(raw)
## [1] 999 110

However, removing empty columns we somewhat reduce its size

# Remove last empty columns and the first column with numbers
raw = select(raw, -starts_with("X")) %>% select(-no)

Before transforming the data, we need to subset the information required to answer the questions. We filter the data using the keywords from the research questions.

students = raw %>% filter(grepl("Student",Indicator,ignore.case=T))
health = raw %>% filter(grepl("health",Indicator, ignore.case=T))
lazy = raw %>% filter(grepl("exercise",Indicator, ignore.case=T))
criminal = raw %>% filter(grepl("crime",Indicator, ignore.case = T))

Now we have more manageable data for each of the questions, an example for the question about students:

kable(students[,1:10])
Indicator Source Date Scotland Edinburgh Edinburgh% Edinburghindex High Low Almond
Full-time student - employed Census Mar-11 NA 20.255 5,5% NA 10,4% 2,8% 507
Student Census Mar-11 NA 38.223 10,3% NA 30,8% 4,0% 754
Male: Full-time student - employed Census Mar-11 NA 9.280 5,1% NA 9,3% 2,6% 222
Male: Student Census Mar-11 NA 19.149 10,5% NA 29,8% 4,1% 383
Female: Full-time student - employed Census Mar-11 NA 10.975 5,8% NA 12,0% 3,1% 285
Female: Student Census Mar-11 NA 19.074 10,2% NA 31,8% 3,6% 371

3. Transform the data into tidy format

All all subsets have the same column structure, we can construct a function performing the required transformation.

transformer = function(x, filterValue) {
  
  # Conveniently, all columns with percentage of population have "%" in the name
  x = x %>% select(Indicator, contains("%")) 
  x = x %>% gather("Locality", "Share", 2:ncol(x)) %>% 
    # Remove punctuation and digits from locality
    mutate(Locality = gsub("[[:punct:]]", "", Locality)) %>% 
    mutate(Locality = gsub("[[:digit:]]", "", Locality)) %>% 
    # Convert the share field to a numeric value
    mutate(Share = gsub("%", "", Share)) %>% 
    mutate(Share = gsub(",", ".", Share)) %>% 
    mutate(Share = as.numeric(Share)) %>% 
    unique() # Remove duplicate records

  # Filter the required Indicator value to the required string
  x = filter(x, Indicator == filterValue)
  
}

After inspecting each of the subsets, we have identified the precise strings to filter the values by. Now we can apply the transformer function to each of the subsets to get a tidy dataset.

students1 = transformer(students, "Student")
health1 = transformer(health, "Very good health")
lazy1 = transformer(lazy, "No exercise taken")
criminal1 = transformer(criminal, "Crime deprived 15%")

Now the dataset for students looks tidy:

datatable(students1)

4. Calculate summary statistics

Now after the data has been prepared, it is very easy to answer the questions, as the proportions per locality have already been calculated in the data. Thus we only need to filter the rows.

4.1. Which area has the highest proportion of students?

students_ans = filter(students1, Share == max(Share))
kable(students_ans)
Indicator Locality Share
Student Southside Newington 30.8

The proportion of stundents in the population is very different between the localities: from 4 to 31 per cent.

4.2. Which area has the highest proportion of very healthy people?

health_ans = filter(health1, Share == max(Share))
kable(health_ans)
Indicator Locality Share
Very good health Meadows Morningside 66.3
summary(health1$Share)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   50.60   54.32   57.85   57.67   60.48   66.30

We can see that compared to the share of students, the variation in the share of healthy people is not big.

4.3. Which area is the laziest (share of people not exercising at all)?

lazy_ans = filter(lazy1, Share == max(Share))
kable(lazy_ans)
Indicator Locality Share
No exercise taken Liberton Gilmerton 46
summary(lazy1$Share)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.41   39.72   41.05   39.31   42.65   46.00

Here, the variation is not big either (except for the outlier minimum value).

4.4. Which area is the most criminal one?

criminal_ans = filter(criminal1, Share == max(Share))
kable(criminal_ans)
Indicator Locality Share
Crime deprived 15% Leith 43.5
summary(criminal1$Share)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   6.175  12.300  15.780  21.780  43.500

Again we see a strong variation between the localities in terms of the crime deprivation variable (it is part of the The Scottish Index of Multiple Deprivation)

5. Conclusion

df = bind_rows(students1,health1,lazy1, criminal1)

plot_ly(data = arrange(df,Share), y = Share, type = "box", group = Indicator) %>% layout (title = "Variation in the share of indicator per locality")

The Edinburgh dataset provides a very multi-faceted view on the population in the city.
We have taken just a glimpse at some of its aspects and have found out that the proportion of students and crime per locality varies strongly, and the share of very healthy and very lazy (not exercising) people is much less variable.