Introduction

Hi! In this kernel we are going to analize some important topics about Barcelona: population, culture, transport, environment, etc. As a citizen and lover of this city, I would like to introduce it to the rest of Kaggle users. We are going to use data from the Open Data BCN platform:

Open Data BCN, a project that was born in 2010, implementing the portal in 2011, has evolved and is now part of the Barcelona Ciutat Digital strategy, fostering a pluralistic digital economy and developing a new model of urban innovation based on the transformation and digital innovation of the public sector and the implication among companies, administrations, the academic world, organizations, communities and people, with a clear public and citizen leadership.

This portal contains more than 400 data sets classified in 5 main topics: Administration, Urban environment, Population, Territory, Economy and Business which are also divided in sub topics. You can consult the data sets catalogue here. You can visit the second part (Discovering Barcelona Part II) here. Let’s begin!

library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2     v purrr   0.3.4
## v tibble  3.1.0     v dplyr   1.0.5
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.5.0
## Warning: package 'tibble' was built under R version 4.0.4
## Warning: package 'tidyr' was built under R version 4.0.4
## Warning: package 'dplyr' was built under R version 4.0.4
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(skimr)

setwd('C:/Users/DellPC/Desktop/Corner/R_source_code/Barcelona')

Loading Data

#Load libraries 

library(tidyverse)
library(knitr)
library(scales)
## 
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
## 
##     discard
## The following object is masked from 'package:readr':
## 
##     col_factor
library(wordcloud)
## Loading required package: RColorBrewer
library(visNetwork)
library(knitr)

# Read the data 

population <- read.csv('population.csv', 
                       sep =',', fileEncoding = 'UTF-8')
immigrant_emigrants_by_sex <- read.csv('immigrants_emigrants_by_sex.csv',
                                      sep =',' , fileEncoding = 'UTF-8')

immigrants_emigrants_by_age <- read.csv('immigrants_emigrants_by_age.csv', 
                                       sep = ',',fileEncoding = 'UTF-8')
immigrants_emigrants_by_destination <- read.csv('immigrants_emigrants_by_destination.csv', 
                                              sep =',', fileEncoding = 'UTF-8')
immigrants_emigrants_by_destination2 <- read.csv('immigrants_emigrants_by_destination2.csv',
                                               sep = ',', fileEncoding = 'UTF-8')
immigrants_by_nationality <- read.csv('immigrants_by_nationality.csv',
                                               sep = ',', fileEncoding = 'UTF-8')
births <- read.csv('births.csv',
                   sep = ',', fileEncoding = 'UTF-8')

deaths <- read.csv('deaths.csv', 
                  sep = ',', fileEncoding = 'UTF-8')

unemployment <- read.csv('unemployment.csv',
                         sep = ',', fileEncoding = 'UTF-8')
baby_names <- read.csv('most_frequent_baby_names.csv',
                       sep = ',', fileEncoding = 'UTF-8')
names <- read.csv('most_frequent_names.csv',
                  sep = ',', fileEncoding = 'UTF-8')
life_expectancy <- read.csv('life_expectancy.csv',
                            sep = ',', fileEncoding = 'UTF-8')

colnames(life_expectancy)[2:6] <- c("2006-2010", "2007-2011", "2008-2012", "2009-2013", "2010-2014")

Let’s get an idea of what we’re working with. We have many datasets so we are not going to inspect everyone.

Show data

Population

population %>% head(10)
##    Year District.Code District.Name Neighborhood.Code
## 1  2017             1  Ciutat Vella                 1
## 2  2017             1  Ciutat Vella                 2
## 3  2017             1  Ciutat Vella                 3
## 4  2017             1  Ciutat Vella                 4
## 5  2017             2      Eixample                 5
## 6  2017             2      Eixample                 6
## 7  2017             2      Eixample                 7
## 8  2017             2      Eixample                 8
## 9  2017             2      Eixample                 9
## 10 2017             2      Eixample                10
##                        Neighborhood.Name Gender Age Number
## 1                               el Raval   Male 0-4    224
## 2                         el Barri Gòtic   Male 0-4     50
## 3                         la Barceloneta   Male 0-4     43
## 4  Sant Pere, Santa Caterina i la Ribera   Male 0-4     95
## 5                          el Fort Pienc   Male 0-4    124
## 6                     la Sagrada Família   Male 0-4    191
## 7                 la Dreta de l'Eixample   Male 0-4    176
## 8        l'Antiga Esquerra de l'Eixample   Male 0-4    158
## 9         la Nova Esquerra de l'Eixample   Male 0-4    220
## 10                           Sant Antoni   Male 0-4    143
str(population)
## 'data.frame':    70080 obs. of  8 variables:
##  $ Year             : int  2017 2017 2017 2017 2017 2017 2017 2017 2017 2017 ...
##  $ District.Code    : int  1 1 1 1 2 2 2 2 2 2 ...
##  $ District.Name    : chr  "Ciutat Vella" "Ciutat Vella" "Ciutat Vella" "Ciutat Vella" ...
##  $ Neighborhood.Code: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Neighborhood.Name: chr  "el Raval" "el Barri Gòtic" "la Barceloneta" "Sant Pere, Santa Caterina i la Ribera" ...
##  $ Gender           : chr  "Male" "Male" "Male" "Male" ...
##  $ Age              : chr  "0-4" "0-4" "0-4" "0-4" ...
##  $ Number           : int  224 50 43 95 124 191 176 158 220 143 ...
skim(population)
Data summary
Name population
Number of rows 70080
Number of columns 8
_______________________
Column type frequency:
character 4
numeric 4
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
District.Name 0 1 6 19 0 10 0
Neighborhood.Name 0 1 5 44 0 73 0
Gender 0 1 4 6 0 2 0
Age 0 1 3 5 0 20 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Year 0 1 2015.00 1.41 2013 2014 2015 2016 2017 ▇▇▇▇▇
District.Code 0 1 6.25 2.79 1 4 7 8 10 ▃▃▃▇▆
Neighborhood.Code 0 1 37.00 21.07 1 19 37 55 73 ▇▇▇▇▇
Number 0 1 114.84 96.54 0 43 92 162 777 ▇▂▁▁▁

immigrants_emigrants_by_sex

immigrant_emigrants_by_sex %>% head(10)
##    Year District.Code District.Name Neighborhood.Code
## 1  2017             1  Ciutat Vella                 1
## 2  2017             1  Ciutat Vella                 2
## 3  2017             1  Ciutat Vella                 3
## 4  2017             1  Ciutat Vella                 4
## 5  2017             2      Eixample                 5
## 6  2017             2      Eixample                 6
## 7  2017             2      Eixample                 7
## 8  2017             2      Eixample                 8
## 9  2017             2      Eixample                 9
## 10 2017             2      Eixample                10
##                        Neighborhood.Name Gender Immigrants Emigrants
## 1                               el Raval   Male       3063      1195
## 2                         el Barri Gòtic   Male       1499       427
## 3                         la Barceloneta   Male        910       361
## 4  Sant Pere, Santa Caterina i la Ribera   Male       1438       502
## 5                          el Fort Pienc   Male       1082       495
## 6                     la Sagrada Família   Male       1693       796
## 7                 la Dreta de l'Eixample   Male       1543       666
## 8        l'Antiga Esquerra de l'Eixample   Male       1693       702
## 9         la Nova Esquerra de l'Eixample   Male       1927       925
## 10                           Sant Antoni   Male       1253       632
str(immigrant_emigrants_by_sex)
## 'data.frame':    730 obs. of  8 variables:
##  $ Year             : int  2017 2017 2017 2017 2017 2017 2017 2017 2017 2017 ...
##  $ District.Code    : int  1 1 1 1 2 2 2 2 2 2 ...
##  $ District.Name    : chr  "Ciutat Vella" "Ciutat Vella" "Ciutat Vella" "Ciutat Vella" ...
##  $ Neighborhood.Code: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Neighborhood.Name: chr  "el Raval" "el Barri Gòtic" "la Barceloneta" "Sant Pere, Santa Caterina i la Ribera" ...
##  $ Gender           : chr  "Male" "Male" "Male" "Male" ...
##  $ Immigrants       : int  3063 1499 910 1438 1082 1693 1543 1693 1927 1253 ...
##  $ Emigrants        : int  1195 427 361 502 495 796 666 702 925 632 ...
skim(immigrant_emigrants_by_sex)
Data summary
Name immigrant_emigrants_by_se…
Number of rows 730
Number of columns 8
_______________________
Column type frequency:
character 3
numeric 5
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
District.Name 0 1 6 19 0 10 0
Neighborhood.Name 0 1 5 44 0 73 0
Gender 0 1 4 6 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Year 0 1 2015.00 1.42 2013 2014.00 2015.0 2016.00 2017 ▇▇▇▇▇
District.Code 0 1 6.25 2.79 1 4.00 7.0 8.00 10 ▃▃▃▇▆
Neighborhood.Code 0 1 37.00 21.09 1 19.00 37.0 55.00 73 ▇▇▇▇▇
Immigrants 0 1 588.79 484.94 5 251.25 474.5 784.25 3219 ▇▃▁▁▁
Emigrants 0 1 362.18 255.19 3 178.00 329.0 498.75 1541 ▇▆▂▁▁

immigrants_emigrants_by_age

immigrants_emigrants_by_age %>% head(10)
##    Year District.Code District.Name Neighborhood.Code
## 1  2017             1  Ciutat Vella                 1
## 2  2017             1  Ciutat Vella                 2
## 3  2017             1  Ciutat Vella                 3
## 4  2017             1  Ciutat Vella                 4
## 5  2017             2      Eixample                 5
## 6  2017             2      Eixample                 6
## 7  2017             2      Eixample                 7
## 8  2017             2      Eixample                 8
## 9  2017             2      Eixample                 9
## 10 2017             2      Eixample                10
##                        Neighborhood.Name Age Immigrants Emigrants
## 1                               el Raval 0-4        154       108
## 2                         el Barri Gòtic 0-4         58        33
## 3                         la Barceloneta 0-4         38        37
## 4  Sant Pere, Santa Caterina i la Ribera 0-4         56        55
## 5                          el Fort Pienc 0-4         79        60
## 6                     la Sagrada Família 0-4        111        95
## 7                 la Dreta de l'Eixample 0-4        121        78
## 8        l'Antiga Esquerra de l'Eixample 0-4         97        63
## 9         la Nova Esquerra de l'Eixample 0-4        123       130
## 10                           Sant Antoni 0-4         57        80
str(immigrants_emigrants_by_age)
## 'data.frame':    4662 obs. of  8 variables:
##  $ Year             : int  2017 2017 2017 2017 2017 2017 2017 2017 2017 2017 ...
##  $ District.Code    : int  1 1 1 1 2 2 2 2 2 2 ...
##  $ District.Name    : chr  "Ciutat Vella" "Ciutat Vella" "Ciutat Vella" "Ciutat Vella" ...
##  $ Neighborhood.Code: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Neighborhood.Name: chr  "el Raval" "el Barri Gòtic" "la Barceloneta" "Sant Pere, Santa Caterina i la Ribera" ...
##  $ Age              : chr  "0-4" "0-4" "0-4" "0-4" ...
##  $ Immigrants       : int  154 58 38 56 79 111 121 97 123 57 ...
##  $ Emigrants        : int  108 33 37 55 60 95 78 63 130 80 ...
str(immigrants_emigrants_by_age)
## 'data.frame':    4662 obs. of  8 variables:
##  $ Year             : int  2017 2017 2017 2017 2017 2017 2017 2017 2017 2017 ...
##  $ District.Code    : int  1 1 1 1 2 2 2 2 2 2 ...
##  $ District.Name    : chr  "Ciutat Vella" "Ciutat Vella" "Ciutat Vella" "Ciutat Vella" ...
##  $ Neighborhood.Code: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Neighborhood.Name: chr  "el Raval" "el Barri Gòtic" "la Barceloneta" "Sant Pere, Santa Caterina i la Ribera" ...
##  $ Age              : chr  "0-4" "0-4" "0-4" "0-4" ...
##  $ Immigrants       : int  154 58 38 56 79 111 121 97 123 57 ...
##  $ Emigrants        : int  108 33 37 55 60 95 78 63 130 80 ...
skim(immigrants_emigrants_by_age)
Data summary
Name immigrants_emigrants_by_a…
Number of rows 4662
Number of columns 8
_______________________
Column type frequency:
character 3
numeric 5
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
District.Name 0 1 6 19 0 11 0
Neighborhood.Name 0 1 5 44 0 74 0
Age 0 1 3 5 0 21 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Year 0 1 2016.00 0.82 2015 2015 2016.0 2017 2017 ▇▁▇▁▇
District.Code 0 1 7.50 11.06 1 4 7.0 8 99 ▇▁▁▁▁
Neighborhood.Code 0 1 37.84 22.12 1 19 37.5 56 99 ▇▇▇▆▁
Immigrants 0 1 58.23 108.97 0 5 21.0 64 1393 ▇▁▁▁▁
Emigrants 0 1 33.68 45.93 0 6 18.0 42 451 ▇▁▁▁▁

immigrants_emigrants_by_destination

immigrants_emigrants_by_destination %>% head(10)
##                from               to weight
## 1            Abroad        Barcelona  53085
## 2         Andalusia        Barcelona   2416
## 3            Aragon        Barcelona    811
## 4          Asturias        Barcelona    323
## 5  Balearic Islands        Barcelona   1634
## 6         Barcelona           Abroad   5485
## 7         Barcelona        Andalusia   1596
## 8         Barcelona           Aragon    745
## 9         Barcelona         Asturias    204
## 10        Barcelona Balearic Islands   1428
str(immigrants_emigrants_by_destination)
## 'data.frame':    40 obs. of  3 variables:
##  $ from  : chr  "Abroad" "Andalusia" "Aragon" "Asturias" ...
##  $ to    : chr  "Barcelona" "Barcelona" "Barcelona" "Barcelona" ...
##  $ weight: int  53085 2416 811 323 1634 5485 1596 745 204 1428 ...
skim(immigrants_emigrants_by_destination)
Data summary
Name immigrants_emigrants_by_d…
Number of rows 40
Number of columns 3
_______________________
Column type frequency:
character 2
numeric 1
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
from 0 1 5 19 0 21 0
to 0 1 5 19 0 21 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
weight 0 1 3749.32 10603.95 12 203.25 613.5 1605.5 53085 ▇▁▁▁▁

immigrants_emigrants_by_nationality

immigrants_by_nationality %>% head(10)
##    Year District.Code District.Name Neighborhood.Code
## 1  2017             1  Ciutat Vella                 1
## 2  2017             1  Ciutat Vella                 2
## 3  2017             1  Ciutat Vella                 3
## 4  2017             1  Ciutat Vella                 4
## 5  2017             2      Eixample                 5
## 6  2017             2      Eixample                 6
## 7  2017             2      Eixample                 7
## 8  2017             2      Eixample                 8
## 9  2017             2      Eixample                 9
## 10 2017             2      Eixample                10
##                        Neighborhood.Name Nationality Number
## 1                               el Raval       Spain   1109
## 2                         el Barri Gòtic       Spain    482
## 3                         la Barceloneta       Spain    414
## 4  Sant Pere, Santa Caterina i la Ribera       Spain    537
## 5                          el Fort Pienc       Spain    663
## 6                     la Sagrada Família       Spain   1181
## 7                 la Dreta de l'Eixample       Spain   1063
## 8        l'Antiga Esquerra de l'Eixample       Spain   1177
## 9         la Nova Esquerra de l'Eixample       Spain   1593
## 10                           Sant Antoni       Spain    883
str(immigrants_by_nationality)
## 'data.frame':    35224 obs. of  7 variables:
##  $ Year             : int  2017 2017 2017 2017 2017 2017 2017 2017 2017 2017 ...
##  $ District.Code    : int  1 1 1 1 2 2 2 2 2 2 ...
##  $ District.Name    : chr  "Ciutat Vella" "Ciutat Vella" "Ciutat Vella" "Ciutat Vella" ...
##  $ Neighborhood.Code: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Neighborhood.Name: chr  "el Raval" "el Barri Gòtic" "la Barceloneta" "Sant Pere, Santa Caterina i la Ribera" ...
##  $ Nationality      : chr  "Spain" "Spain" "Spain" "Spain" ...
##  $ Number           : int  1109 482 414 537 663 1181 1063 1177 1593 883 ...
skim(immigrants_by_nationality)
Data summary
Name immigrants_by_nationality
Number of rows 35224
Number of columns 7
_______________________
Column type frequency:
character 3
numeric 4
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
District.Name 0 1 6 19 0 11 0
Neighborhood.Name 0 1 5 44 0 74 0
Nationality 0 1 4 24 0 177 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Year 0 1 2016.01 0.81 2015 2015 2016.0 2017 2017 ▇▁▇▁▇
District.Code 0 1 7.50 11.06 1 4 7.0 8 99 ▇▁▁▁▁
Neighborhood.Code 0 1 37.84 22.12 1 19 37.5 56 99 ▇▇▇▆▁
Number 0 1 7.71 50.42 0 0 0.0 2 1603 ▇▁▁▁▁

births

births %>% head(10)
##    Year District.Code District.Name Neighborhood.Code
## 1  2017             1  Ciutat Vella                 1
## 2  2017             1  Ciutat Vella                 2
## 3  2017             1  Ciutat Vella                 3
## 4  2017             1  Ciutat Vella                 4
## 5  2017             2      Eixample                 5
## 6  2017             2      Eixample                 6
## 7  2017             2      Eixample                 7
## 8  2017             2      Eixample                 8
## 9  2017             2      Eixample                 9
## 10 2017             2      Eixample                10
##                        Neighborhood.Name Gender Number
## 1                               el Raval   Boys    283
## 2                         el Barri Gòtic   Boys     56
## 3                         la Barceloneta   Boys     51
## 4  Sant Pere, Santa Caterina i la Ribera   Boys     90
## 5                          el Fort Pienc   Boys    117
## 6                     la Sagrada Família   Boys    207
## 7                 la Dreta de l'Eixample   Boys    185
## 8        l'Antiga Esquerra de l'Eixample   Boys    176
## 9         la Nova Esquerra de l'Eixample   Boys    218
## 10                           Sant Antoni   Boys    172
str(births)
## 'data.frame':    734 obs. of  7 variables:
##  $ Year             : int  2017 2017 2017 2017 2017 2017 2017 2017 2017 2017 ...
##  $ District.Code    : int  1 1 1 1 2 2 2 2 2 2 ...
##  $ District.Name    : chr  "Ciutat Vella" "Ciutat Vella" "Ciutat Vella" "Ciutat Vella" ...
##  $ Neighborhood.Code: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Neighborhood.Name: chr  "el Raval" "el Barri Gòtic" "la Barceloneta" "Sant Pere, Santa Caterina i la Ribera" ...
##  $ Gender           : chr  "Boys" "Boys" "Boys" "Boys" ...
##  $ Number           : int  283 56 51 90 117 207 185 176 218 172 ...
skim(births)
Data summary
Name births
Number of rows 734
Number of columns 7
_______________________
Column type frequency:
character 3
numeric 4
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
District.Name 0 1 6 19 0 11 0
Neighborhood.Name 0 1 5 44 0 75 0
Gender 0 1 4 5 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Year 0 1 2015.00 1.42 2013 2014.00 2015.0 2016.00 2017 ▇▇▇▇▇
District.Code 0 1 6.75 7.38 1 4.00 7.0 8.00 99 ▇▁▁▁▁
Neighborhood.Code 0 1 37.34 21.52 1 19.00 37.0 55.75 99 ▇▇▇▆▁
Number 0 1 91.75 61.93 0 43.25 84.5 129.25 283 ▇▇▃▂▁

deaths

deaths %>% head(10)
##    Year District.Code District.Name Neighborhood.Code
## 1  2017             1  Ciutat Vella                 1
## 2  2017             1  Ciutat Vella                 2
## 3  2017             1  Ciutat Vella                 3
## 4  2017             1  Ciutat Vella                 4
## 5  2017             2      Eixample                 5
## 6  2017             2      Eixample                 6
## 7  2017             2      Eixample                 7
## 8  2017             2      Eixample                 8
## 9  2017             2      Eixample                 9
## 10 2017             2      Eixample                10
##                        Neighborhood.Name Age Number
## 1                               el Raval 0-4      1
## 2                         el Barri Gòtic 0-4      1
## 3                         la Barceloneta 0-4      0
## 4  Sant Pere, Santa Caterina i la Ribera 0-4      0
## 5                          el Fort Pienc 0-4      0
## 6                     la Sagrada Família 0-4      1
## 7                 la Dreta de l'Eixample 0-4      0
## 8        l'Antiga Esquerra de l'Eixample 0-4      0
## 9         la Nova Esquerra de l'Eixample 0-4      0
## 10                           Sant Antoni 0-4      0
str(deaths)
## 'data.frame':    4599 obs. of  7 variables:
##  $ Year             : int  2017 2017 2017 2017 2017 2017 2017 2017 2017 2017 ...
##  $ District.Code    : int  1 1 1 1 2 2 2 2 2 2 ...
##  $ District.Name    : chr  "Ciutat Vella" "Ciutat Vella" "Ciutat Vella" "Ciutat Vella" ...
##  $ Neighborhood.Code: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Neighborhood.Name: chr  "el Raval" "el Barri Gòtic" "la Barceloneta" "Sant Pere, Santa Caterina i la Ribera" ...
##  $ Age              : chr  "0-4" "0-4" "0-4" "0-4" ...
##  $ Number           : int  1 1 0 0 0 1 0 0 0 0 ...
skim(deaths)
Data summary
Name deaths
Number of rows 4599
Number of columns 7
_______________________
Column type frequency:
character 3
numeric 4
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
District.Name 0 1 6 19 0 10 0
Neighborhood.Name 0 1 5 44 0 73 0
Age 0 1 3 5 0 21 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Year 0 1 2016.00 0.82 2015 2015 2016 2017 2017 ▇▁▇▁▇
District.Code 0 1 6.25 2.79 1 4 7 8 10 ▃▃▃▇▆
Neighborhood.Code 0 1 37.00 21.07 1 19 37 55 73 ▇▇▇▇▇
Number 0 1 10.05 18.56 0 0 2 11 139 ▇▁▁▁▁

unemployment

unemployment %>% head(10)
##    Year   Month District.Code District.Name Neighborhood.Code
## 1  2017 January             1  Ciutat Vella                 1
## 2  2017 January             1  Ciutat Vella                 2
## 3  2017 January             1  Ciutat Vella                 3
## 4  2017 January             1  Ciutat Vella                 4
## 5  2017 January             2      Eixample                 5
## 6  2017 January             2      Eixample                 6
## 7  2017 January             2      Eixample                 7
## 8  2017 January             2      Eixample                 8
## 9  2017 January             2      Eixample                 9
## 10 2017 January             2      Eixample                10
##                        Neighborhood.Name Gender     Demand_occupation Number
## 1                               el Raval   Male Registered unemployed   2107
## 2                         el Barri Gòtic   Male Registered unemployed    538
## 3                         la Barceloneta   Male Registered unemployed    537
## 4  Sant Pere, Santa Caterina i la Ribera   Male Registered unemployed    741
## 5                          el Fort Pienc   Male Registered unemployed    630
## 6                     la Sagrada Família   Male Registered unemployed   1032
## 7                 la Dreta de l'Eixample   Male Registered unemployed    655
## 8        l'Antiga Esquerra de l'Eixample   Male Registered unemployed    751
## 9         la Nova Esquerra de l'Eixample   Male Registered unemployed   1128
## 10                           Sant Antoni   Male Registered unemployed    847
str(unemployment)
## 'data.frame':    14208 obs. of  9 variables:
##  $ Year             : int  2017 2017 2017 2017 2017 2017 2017 2017 2017 2017 ...
##  $ Month            : chr  "January" "January" "January" "January" ...
##  $ District.Code    : int  1 1 1 1 2 2 2 2 2 2 ...
##  $ District.Name    : chr  "Ciutat Vella" "Ciutat Vella" "Ciutat Vella" "Ciutat Vella" ...
##  $ Neighborhood.Code: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Neighborhood.Name: chr  "el Raval" "el Barri Gòtic" "la Barceloneta" "Sant Pere, Santa Caterina i la Ribera" ...
##  $ Gender           : chr  "Male" "Male" "Male" "Male" ...
##  $ Demand_occupation: chr  "Registered unemployed" "Registered unemployed" "Registered unemployed" "Registered unemployed" ...
##  $ Number           : int  2107 538 537 741 630 1032 655 751 1128 847 ...
skim(unemployment)
Data summary
Name unemployment
Number of rows 14208
Number of columns 9
_______________________
Column type frequency:
character 5
numeric 4
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Month 0 1 3 9 0 12 0
District.Name 0 1 6 19 0 11 0
Neighborhood.Name 0 1 5 44 0 74 0
Gender 0 1 4 6 0 2 0
Demand_occupation 0 1 19 21 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Year 0 1 2015.38 1.32 2013 2014.75 2015.5 2016.25 2017 ▃▃▇▇▇
District.Code 0 1 7.50 11.06 1 4.00 7.0 8.00 99 ▇▁▁▁▁
Neighborhood.Code 0 1 37.84 22.12 1 19.00 37.5 56.00 99 ▇▇▇▆▁
Number 0 1 446.70 425.95 0 121.00 296.0 685.00 3094 ▇▂▁▁▁

baby_names

baby_names %>% head(10)
##    Order    Name Gender Year Frequency
## 1      1   LAURA Female 1996       237
## 2      2   MARIA Female 1996       219
## 3      3   MARTA Female 1996       206
## 4      4  ANDREA Female 1996       199
## 5      5    ANNA Female 1996       195
## 6      6   PAULA Female 1996       186
## 7      7    ALBA Female 1996       142
## 8      8    SARA Female 1996       125
## 9      9    LAIA Female 1996       123
## 10    10 CLAUDIA Female 1996       122
str(baby_names)
## 'data.frame':    1050 obs. of  5 variables:
##  $ Order    : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Name     : chr  "LAURA" "MARIA" "MARTA" "ANDREA" ...
##  $ Gender   : chr  "Female" "Female" "Female" "Female" ...
##  $ Year     : int  1996 1996 1996 1996 1996 1996 1996 1996 1996 1996 ...
##  $ Frequency: int  237 219 206 199 195 186 142 125 123 122 ...
skim(baby_names)
Data summary
Name baby_names
Number of rows 1050
Number of columns 5
_______________________
Column type frequency:
character 2
numeric 3
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Name 0 1 3 9 0 97 0
Gender 0 1 4 6 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Order 0 1 13.00 7.21 1 7 13 19 25 ▇▇▇▇▇
Year 0 1 2006.00 6.06 1996 2001 2006 2011 2016 ▇▆▆▆▆
Frequency 0 1 101.66 44.34 44 69 89 122 348 ▇▃▁▁▁

names

names %>% head(10)
##    Order       Name Gender      Decade Frequency
## 1      1      MARIA Female Before 1930      2085
## 2      2     CARMEN Female Before 1930      1463
## 3      3     JOSEFA Female Before 1930      1032
## 4      4 MONTSERRAT Female Before 1930       879
## 5      5       ROSA Female Before 1930       762
## 6      6     TERESA Female Before 1930       734
## 7      7      PILAR Female Before 1930       694
## 8      8    DOLORES Female Before 1930       679
## 9      9   MERCEDES Female Before 1930       603
## 10    10  FRANCISCA Female Before 1930       596
str(names)
## 'data.frame':    1100 obs. of  5 variables:
##  $ Order    : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Name     : chr  "MARIA" "CARMEN" "JOSEFA" "MONTSERRAT" ...
##  $ Gender   : chr  "Female" "Female" "Female" "Female" ...
##  $ Decade   : chr  "Before 1930" "Before 1930" "Before 1930" "Before 1930" ...
##  $ Frequency: int  2085 1463 1032 879 762 734 694 679 603 596 ...
skim(names)
Data summary
Name names
Number of rows 1100
Number of columns 5
_______________________
Column type frequency:
character 3
numeric 2
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Name 0 1 3 16 0 276 0
Gender 0 1 4 6 0 2 0
Decade 0 1 4 11 0 11 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Order 0 1 25.50 14.44 1 13 25.5 38 50 ▇▇▇▇▇
Frequency 0 1 1263.57 1989.23 39 379 621.5 1145 19173 ▇▁▁▁▁

life_expectancy

life_expectancy %>% head(10)
##                             Neighborhood 2006-2010 2007-2011 2008-2012
## 1                               el Raval      87.5      84.9      84.7
## 2                         el Barri Gòtic      88.0      84.3      84.4
## 3                         la Barceloneta      88.2      85.3      84.4
## 4  Sant Pere, Santa Caterina i la Ribera      88.9      85.9      86.1
## 5                          el Fort Pienc      89.7      87.1      87.2
## 6                     la Sagrada Família      88.9      86.6      86.8
## 7                 la Dreta de l'Eixample      87.0      85.0      85.1
## 8        l'Antiga Esquerra de l'Eixample      88.9      86.4      86.3
## 9         la Nova Esquerra de l'Eixample      90.0      87.4      87.5
## 10                           Sant Antoni      89.0      86.3      86.5
##    2009-2013 2010-2014 Gender
## 1       84.9      85.3 Female
## 2       87.5      84.4 Female
## 3       86.3      84.7 Female
## 4       85.5      85.3 Female
## 5       87.7      86.7 Female
## 6       85.1      87.2 Female
## 7       85.7      85.1 Female
## 8       85.8      86.9 Female
## 9       87.0      88.7 Female
## 10      86.0      87.6 Female
str(life_expectancy)
## 'data.frame':    146 obs. of  7 variables:
##  $ Neighborhood: chr  "el Raval" "el Barri Gòtic" "la Barceloneta" "Sant Pere, Santa Caterina i la Ribera" ...
##  $ 2006-2010   : num  87.5 88 88.2 88.9 89.7 88.9 87 88.9 90 89 ...
##  $ 2007-2011   : num  84.9 84.3 85.3 85.9 87.1 86.6 85 86.4 87.4 86.3 ...
##  $ 2008-2012   : num  84.7 84.4 84.4 86.1 87.2 86.8 85.1 86.3 87.5 86.5 ...
##  $ 2009-2013   : num  84.9 87.5 86.3 85.5 87.7 85.1 85.7 85.8 87 86 ...
##  $ 2010-2014   : num  85.3 84.4 84.7 85.3 86.7 87.2 85.1 86.9 88.7 87.6 ...
##  $ Gender      : chr  "Female" "Female" "Female" "Female" ...
skim(life_expectancy)
Data summary
Name life_expectancy
Number of rows 146
Number of columns 7
_______________________
Column type frequency:
character 2
numeric 5
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Neighborhood 0 1 5 44 0 73 0
Gender 0 1 4 6 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
2006-2010 8 0.95 84.93 3.87 74.2 81.70 83.9 88.50 90.9 ▁▂▇▂▇
2007-2011 6 0.96 82.47 3.96 70.7 79.50 81.7 86.12 88.2 ▁▁▇▂▇
2008-2012 8 0.95 82.78 3.83 69.5 79.90 82.0 86.30 88.2 ▁▁▆▃▇
2009-2013 6 0.96 83.10 3.85 69.0 80.10 82.7 86.53 88.5 ▁▁▆▃▇
2010-2014 4 0.97 83.45 3.91 71.3 80.53 83.4 86.97 89.8 ▁▂▇▃▇

Data Analysis

Population

Let’s begin with the population by year

# Population by year

population %>% 
               group_by(Year, Gender) %>%
               summarise(count = sum(Number)) %>%
               mutate(percent = paste0(round((count/sum(count))*100, 2), "%")) %>%
               ggplot(aes(x = Year, y = count)) +
               geom_bar(stat = 'identity', aes(fill = Gender)) +
               geom_text(aes(label = percent, group = Gender), 
                         position = position_stack(vjust = 0.5), color = 'white') +
               scale_y_continuous(labels = comma) +
               labs(x = 'Year', y = 'Population', title ='Population by year (2013-2017)') +
               theme_bw()
## `summarise()` has grouped output by 'Year'. You can override using the `.groups` argument.

The city of Barcelona’s population remains relatively stable as it’s already extremely densely populated. The female population is slightly higher than the male. The population per year is very similar, so in the following visualizations we will only take in account the last year (2017). We can continue with the population by age.

#Ordered levels

population$Age <- ordered(population$Age, levels=c("0-4", "5-9", "10-14", "15-19",
                                                   "20-24", "25-29", "30-34", "35-39",
                                                   "40-44", "45-49", "50-54", "55-59",
                                                   "60-64", "65-69", "70-74", "75-79",
                                                   "80-84", "85-89", "90-94", ">=95"))

# Population by age (2017)

population2017 <- population %>%
               filter(Year == '2017')

ggplot(data = population2017, aes(x = Age, fill = Gender)) +
               geom_bar(data = filter(population2017, Gender == 'Female'), aes( y =Number), stat = 'identity') +
               geom_bar(data = filter(population2017, Gender == 'Male'), aes(y = Number*(-1)), stat = 'identity') +
               scale_y_continuous(breaks = seq(-100000, 100000, 20000),
                                  labels = comma(abs(seq(-100000, 100000, 20000)))) +
               labs(x = 'Age', y = 'Population', title = 'Population by age (2017)') + 
               coord_flip() +
               theme_bw()

The population distribution is centered around 25-44 years. It’s interesting to observe how the male population decreases considerably from the 40-44 years old range, while in the female gender the decrease is less pronounced . It seems clear that men live less years in Barcelona!

# Population by district (2017)

population %>% filter(Year == "2017") %>%
               group_by(District.Name, Gender) %>%
               summarise(count = sum(Number)) %>% 
               mutate(percent = paste0(round((count/sum(count))*100, 2), "%")) %>%
               ggplot(aes(x = reorder(District.Name, count), y = count)) +
               geom_bar(stat = 'identity', aes(fill = Gender)) +
               geom_text(aes(label = percent, group = Gender), position = position_stack(vjust = 0.5)) +
               scale_y_continuous(breaks = seq(0, 300000, 100000), labels = comma) + 
               labs(x = 'District name', y = 'Population', title = 'Population by district (2017)') +
               theme_bw() +
               coord_flip()
## `summarise()` has grouped output by 'District.Name'. You can override using the `.groups` argument.

Example is the most populated district of Barcelona, followed by Sant Martí and Sants-Montjuïc. Conversely, Les Corts is the least populated district. We can see how the female population is higher than the male population in all neighborhoods, except in Ciutat Vella. We can go into more detail and analyze the population by neighborhoods. There are more than 70 neighborhoods, so we’re just going to visualize the most populated.

# Population by neighbordhood - Top 15 neighbordhoods (2017)
population %>%
  filter(Year=="2017") %>%
  filter(Neighborhood.Name %in% c("la Nova Esquerra de l'Eixample", "Sant Andreu",
                                  "la Sagrada Família", "la Vila de Gràcia",
                                  "Sant Gervasi - Galvany", "el Raval", "les Corts",
                                  "la Dreta de l'Eixample", "l'Antiga Esquerra de l'Eixample",
                                  "Sants", "el Poble Sec", "Sant Antoni", 
                                  "el Camp de l'Arpa del Clot", "el Guinardó", 
                                  "el Camp d'en Grassot i Gràcia Nova")) %>%
  group_by(Neighborhood.Name, Gender) %>%
  summarise(count=sum(Number)) %>%
  mutate(percent=paste0(round((count/sum(count))*100, 2), "%")) %>%
  ggplot(aes(x=reorder(Neighborhood.Name, count), y=count)) +
  geom_bar(stat="identity", aes(fill=Gender)) +
  geom_text(aes(label=percent, group=Gender), position=position_stack(vjust=0.5)) +
  scale_y_continuous(breaks=seq(0, 60000, 20000), labels=comma) +
  labs(x="Neighbordhood name", y="Population", title="Population by neighbordhood (2017)") +
  theme_bw() + 
  coord_flip()
## `summarise()` has grouped output by 'Neighborhood.Name'. You can override using the `.groups` argument.

In the Top 15 neighborhoods the female population is higher than the male population, except in el Raval.

Immigration and emigration

In this section we are going to show some visualizations to understand the complex phenomenon of immigration and emigration in the city of Barcelona. Let’s begin with the analysis by year.

immigrant_emigrants_by_sex <- gather(immigrant_emigrants_by_sex, 
                                     `Immigrants/Emigrants`, value, Immigrants:Emigrants, 
                                     na.rm = TRUE)

immigrant_emigrants_by_sex %>% 
               group_by(Gender, `Immigrants/Emigrants`, Year) %>% 
               summarise( count = sum(value)) %>% 
               ggplot(aes(x = `Immigrants/Emigrants`, y = count, fill = Gender)) + 
               geom_bar(stat = 'identity', position = 'stack') + 
               facet_grid(~ Year) +
               theme_bw() + 
               labs(y = 'Population', title = 'Immigration and emigration by year (2013-2017)') +
               scale_y_continuous(labels = comma) + 
               theme(axis.text.x = element_text(angle = 45, hjust = 1), 
                     axis.title.x = element_blank())
## `summarise()` has grouped output by 'Gender', 'Immigrants/Emigrants'. You can override using the `.groups` argument.

Whereas emigration has remained more or less constant over recent years, immigration has increased considerably since 2013. Let’s study the immigration and emigration distribution

immigrants_emigrants_by_age <- gather(immigrants_emigrants_by_age, `Immigrants/Emigrants`, 
                                      value, Immigrants:Emigrants, na.rm=TRUE)

# Ordered levels
immigrants_emigrants_by_age$Age <- ordered(immigrants_emigrants_by_age$Age, 
                                           levels=c("0-4", "5-9", "10-14", "15-19",
                                                    "20-24", "25-29", "30-34", "35-39",
                                                    "40-44", "45-49", "50-54", "55-59",
                                                    "60-64", "65-69", "70-74", "75-79",
                                                    "80-84", "85-89", "90-94", "95-99",
                                                    ">=100"))

# Immigration and emigration by age (2017)
immigrants_emigrants_by_age_2017 <- immigrants_emigrants_by_age %>%
  filter(Year=="2017")

ggplot(data=immigrants_emigrants_by_age_2017, aes(x=Age, fill=`Immigrants/Emigrants`)) +
  geom_bar(data=filter(immigrants_emigrants_by_age_2017, `Immigrants/Emigrants`=="Immigrants"), 
           aes(y=value), stat="identity") + 
  geom_bar(data=filter(immigrants_emigrants_by_age_2017, `Immigrants/Emigrants`=="Emigrants"), 
           aes(y=value*(-1)), stat="identity") +
  scale_y_continuous(breaks=seq(-30000, 30000, 5000), labels=comma(abs(seq(-30000, 30000, 5000)))) + 
  labs(x="Age", y="Population", title="Immigration and emigration by age (2017)") +
  coord_flip() +
  theme_bw() +
  guides(fill=guide_legend(title=NULL)) 

As we can see, there were more than 20.000 immigrants between 20-24 years old in 2017. With respect to emigration, the distribution is centered around 25-39 years old.

In the next visualization we are going to analyze the emigration by place of destination (2017) using a Sankey diagram. A Sankey Diagram displays flows and their quantities, in proportion to one another using the width of arrows or lines to show their magnitudes. Let’s see!

library(networkD3)
## Warning: package 'networkD3' was built under R version 4.0.4
# Nodes 

nodes <- as.data.frame(unique(immigrants_emigrants_by_destination$from))
nodes$id <- 1:nrow(nodes)

nodes <- nodes[, c(2, 1)]
names(nodes) <- c('id', 'label')

# Emigrants 

emigrants_by_destination <- immigrants_emigrants_by_destination %>% 
               filter(from == "Barcelona")

# Edges 

edges <- emigrants_by_destination %>% 
               left_join(nodes, by = c("from" = "label")) %>%
               select(-from) %>%
               dplyr::rename(from = id)

edges <- emigrants_by_destination %>% 
               left_join(nodes, by = c("from" = "label")) %>%
               select(-from) %>%
               dplyr::rename(from = id) 

edges <- edges %>% left_join(nodes, by = c('to' = 'label')) %>%
               select(-to) %>%
               dplyr::rename(to = id)

nodes_d3 <- mutate(nodes, id = id - 1)
edges_d3 <- mutate(edges, from = from - 1, to = to -1)

# sankeyNetwork - Emigrants destination 

sankeyNetwork(Links = edges_d3, Nodes = nodes_d3, Source= 'from', 
              Target = 'to', NodeID = 'label', Value = 'weight', fontSize = 16, unit = 'Letter(s)')

Firstly it’s important to emphasize that the data set used for this visualization doesn’t specify destinations outside of Spain (category “Abroad”). Some comments about the graph:

As we can see, vast majority of inhabitants migrate to other areas of Catalonia (due to the high rental prices in Barcelona, perhaps?).

A high percentage of inhabitants emigrate outside of Spain (looking for better job opportunities, maybe?).

Within Spain, the autonomous communities with more prominence are: Community of Madrid, Andalusia, Balearic Islands and Valencian Community.

We can further detail the visualization and also study the emigration by place of destination and by neighbourhoods of the city of Barcelona (2017).

# Nodes
nodes <- as.data.frame(unique(immigrants_emigrants_by_destination2$from))
nodes$id <- 1:nrow(nodes)
nodes <- nodes[, c(2,1)]
names(nodes) <- c("id", "label")

# Emigrants 
emigrants_by_destination2 <- immigrants_emigrants_by_destination2 %>%
  filter(from %in% c("Ciutat Vella", "Eixample", "Gràcia", "Horta-Guinardó", 
                     "Les Corts", "Nou Barris", "Sant Andreu", "Sant Martí",
                     "Sants-Montjuïc", "Sarrià-Sant Gervasi"))

# Edges
edges <- emigrants_by_destination2 %>% 
  left_join(nodes, by=c("from"="label")) %>%
  select(-from) %>%
  dplyr::rename(from=id)
  
edges <- edges %>% 
  left_join(nodes, by=c("to"="label")) %>%
  select(-to) %>%
  dplyr::rename(to=id)

nodes_d3 <- mutate(nodes, id=id-1)
edges_d3 <- mutate(edges, from=from-1, to=to-1)

# sankeyNetwork - Emigrants destination

sankeyNetwork(Links=edges_d3, Nodes=nodes_d3, Source="from", Target="to", 
              NodeID="label", Value="weight", fontSize=16, unit="Letter(s)")