Introduction

We are interested in the geographic breakdown of “elite” sports clubs in France (along with Corsica, excluding French overseas territories).
Only team sports are considered here (football, handball, rugby, basketball, volleyball, ice hockey).
By “elite” club, I mean clubs that are evolving in first or second division of each relevant sport (even if some sports don’t really have a second division) ; also, “elite” does not necessarly mean professional, as some of these clubs are still amateurs (especially true for women championships).

Preparation

Libraries

library(tidyverse)
library(ggmap)
library(sf)
library(tmap)
library(formattable)

Data

I created a quick csv file, listing the cities where an “elite” club is present for different sports.

# Import data
clubs <- read_csv("clubs_pro_france.csv")

# quick glance at the data
head(sample_n(tbl = clubs, size = nrow(clubs)))
## # A tibble: 6 x 4
##   Ville               Sport    Division Categorie
##   <chr>               <chr>       <int> <chr>    
## 1 Strasbourg          Hockey          2 H        
## 2 Toulouse            Handball        1 H        
## 3 Reims               Basket          1 H        
## 4 Toulon              Handball        1 F        
## 5 Villard de Lans     Hockey          2 H        
## 6 Chalon en Champagne Hockey          2 H

We have 4 variables:

  • Ville: french city where the “elite” club is located
  • Sport: sport (Football, Handball, Basketball, Rugby, Volleyball, Ice Hockey)
  • Division: league (1st division or 2nd division)
  • Categorie: gender (men (H) or women (F))

Transform Data for Mapping

We will need to transform the previous data, in order to be able to map them correctly.

‘Clubs’ data transformations

Geocoding of the cities from clubs dataframe.
We are adding ‘France’ after the cities, to be sure we are geocoding the correct city (with the exception of ‘Monaco’, which is not a French city, but has teams in some championships).

# find longitude and latitude for all cities listed in 'clubs'
clubs_coords <- sapply(X = clubs[, "Ville"], 
                      FUN = function(x) geocode(location = paste(x, "France"),
                                                source = "dsk", messaging = FALSE))

# find longitude and latitude for "Monaco"
monaco_coords <- geocode(location = "Monaco, Monaco", source = "dsk", messaging = FALSE)

Add coordinates to the dataframe.

# add coordinates to the dataframe
clubs_ok <- clubs %>% 
  mutate(long = clubs_coords[[1]], lat = clubs_coords[[2]]) %>% 
  # specific case for "Monaco
  mutate(long = if_else(Ville == "Monaco", monaco_coords[1] %>% pull(), long),
         lat  = if_else(Ville == "Monaco", monaco_coords[2] %>% pull(), lat))

In order to map the data, we have to transform it into a sf object.

# transform into 'sf' object, using a CRS equal to 4326
clubs_sf <- st_as_sf(x = clubs_ok, 
                     coords = c("long","lat"), crs = 4326)

head(clubs_sf)  # geometry column has been created
## Simple feature collection with 6 features and 4 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: -1.55336 ymin: 43.29695 xmax: 7.41903 ymax: 47.46667
## epsg (SRID):    4326
## proj4string:    +proj=longlat +datum=WGS84 +no_defs
## # A tibble: 6 x 5
##   Ville     Sport    Division Categorie            geometry
##   <chr>     <chr>       <int> <chr>             <POINT [°]>
## 1 Marseille Football        1 H          (5.38107 43.29695)
## 2 Toulouse  Football        1 H          (1.44367 43.60426)
## 3 Nantes    Football        1 H         (-1.55336 47.21725)
## 4 Monaco    Football        1 H          (7.41903 43.73141)
## 5 Angers    Football        1 H            (-0.55 47.46667)
## 6 Nimes     Football        1 H             (4.35 43.83333)
# quick plot of the new 'sf' object
plot(clubs_sf["Sport"])

# same plot using ggplot
ggplot(clubs_sf) +
  geom_sf()

‘France’ data

In order to visualize the data from clubs into a France map, we have to create this 2nd map.
These data were extracted from the GADM database.

# import background map
fr_map <- readRDS("gadm36_FRA_1_sf.rds")

head(fr_map)
## Simple feature collection with 6 features and 10 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -5.143751 ymin: 41.33375 xmax: 9.560416 ymax: 50.16764
## epsg (SRID):    4326
## proj4string:    +proj=longlat +datum=WGS84 +no_defs
##    GID_0 NAME_0   GID_1                  NAME_1 VARNAME_1 NL_NAME_1 TYPE_1
## 1    FRA France FRA.1_1    Auvergne-Rhône-Alpes      <NA>      <NA> Région
## 6    FRA France FRA.2_1 Bourgogne-Franche-Comté      <NA>      <NA> Région
## 7    FRA France FRA.3_1                Bretagne      <NA>      <NA> Région
## 8    FRA France FRA.4_1     Centre-Val de Loire      <NA>      <NA> Région
## 9    FRA France FRA.5_1                   Corse   Corsica      <NA> Région
## 10   FRA France FRA.6_1               Grand Est      <NA>      <NA> Région
##    ENGTYPE_1 CC_1 HASC_1                       geometry
## 1     Region <NA>  FR.AR MULTIPOLYGON (((2.22057 44....
## 6     Region <NA>  FR.BF MULTIPOLYGON (((5.894627 46...
## 7     Region <NA>  FR.BT MULTIPOLYGON (((-2.83514 47...
## 8     Region <NA>  FR.CN MULTIPOLYGON (((1.12372 46....
## 9     Region <NA>  FR.CE MULTIPOLYGON (((9.25764 41....
## 10    Region <NA>  FR.AO MULTIPOLYGON (((4.299289 47...
# plot the map
tm_shape(fr_map) +
  tm_polygons(col = "tan") +
  tm_layout(bg.color = "skyblue",
            main.title = "Metropolitan France (including Corsica)",
            main.title.size = 1.2,
            main.title.position = "center",
            fontfamily = "Bookman"
            )

Mapping

Before plotting the map, we will transform the CRS into an unprojected CRS for both data/map.

# actual CRS: => projected CRS
st_crs(clubs_sf)
## Coordinate Reference System:
##   EPSG: 4326 
##   proj4string: "+proj=longlat +datum=WGS84 +no_defs"
st_crs(fr_map)
## Coordinate Reference System:
##   EPSG: 4326 
##   proj4string: "+proj=longlat +datum=WGS84 +no_defs"
# transform CRS to unprojected CRS ; use CRS code = 2154 (projected coordinate system for "France" : https://epsg.io/?q=France%20kind%3APROJCRS)
clubs_sf_proj <- st_transform(x = clubs_sf, crs = 2154)
fr_map_proj <- st_transform(x = fr_map, crs = 2154)

We can now plot our data.
For example, plot all cities where there is (at least) one “elite” club.

map1 <- tm_shape(fr_map_proj) +
  tm_borders() +
  tm_shape(clubs_sf_proj) +
  tm_dots(col = "Sport", size = 0.3, jitter = 0.1) +
  tm_layout(main.title = "Which sport in which city?",
            main.title.position = "center",
            main.title.size = 1.2,
            fontfamily = "Bookman"
            )

map1

This is not really readable, we can plot the number of clubs per city.

# using number of clubs per city to size the dots
clubs_size <- clubs_sf_proj %>% 
  count(Ville)

map2 <- tm_shape(fr_map_proj) +
  tm_borders() +
  tm_shape(clubs_size) +
  tm_dots(col = "n", size = "n", scale = 2, alpha = 0.7, style = "cat", 
          title = "",
          palette = tmaptools::get_brewer_pal("Dark2", n = 8, plot = FALSE), 
          legend.size.show = FALSE
          ) +
  tm_layout(main.title = 'Number of "elite" clubs per city',
            main.title.position = "center",
            main.title.size = 1.2,
            fontfamily = "Bookman"
            )

map2

# arranging both maps side by side
tmap_arrange(map1, map2, nrow = 1)

Plot “elite” cities by gender.

map3 <- tm_shape(fr_map_proj) +
  tm_borders() +
  tm_shape(clubs_sf_proj) +
  tm_dots(col = "Categorie", size = 0.2, jitter = 0.1, palette = c("hotpink1", "dodgerblue1")) +
  tm_facets(by = "Categorie", as.layers = TRUE) +
  tm_layout(legend.show = FALSE,
            panel.labels = c("Women", "Men")
            )

map3

Transform Data for Analysis

Join both spatial data, to compute statistics by region.

# join both data/map
all_data <- st_intersection(clubs_sf_proj, fr_map_proj)

head(all_data)
## Simple feature collection with 6 features and 14 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: 706704.2 ymin: 6455601 xmax: 913392.9 ymax: 6569740
## epsg (SRID):    2154
## proj4string:    +proj=lcc +lat_1=49 +lat_2=44 +lat_0=46.5 +lon_0=3 +x_0=700000 +y_0=6600000 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs
## # A tibble: 6 x 15
##   Ville Sport Division Categorie GID_0 NAME_0 GID_1 NAME_1 VARNAME_1
##   <chr> <chr>    <int> <chr>     <chr> <chr>  <chr> <chr>  <chr>    
## 1 Sain… Foot…        1 H         FRA   France FRA.… Auver… <NA>     
## 2 Lyon  Foot…        1 H         FRA   France FRA.… Auver… <NA>     
## 3 Gren… Foot…        2 H         FRA   France FRA.… Auver… <NA>     
## 4 Cler… Foot…        2 H         FRA   France FRA.… Auver… <NA>     
## 5 Bour… Bask…        1 H         FRA   France FRA.… Auver… <NA>     
## 6 Vill… Bask…        1 H         FRA   France FRA.… Auver… <NA>     
## # ... with 6 more variables: NL_NAME_1 <chr>, TYPE_1 <chr>,
## #   ENGTYPE_1 <chr>, CC_1 <chr>, HASC_1 <chr>, geometry <POINT [m]>

The data from France map has now been added to data from clubs.

# Remove geometry and compute how many clubs per region
clubs_region <- all_data %>% 
  # Remove geometry from dataframe
  st_set_geometry(NULL) %>% 
  # create factor in order to use 'complete'
  mutate(NAME_1 = as.factor(NAME_1),
         Categorie = as.factor(Categorie)) %>% 
  count(NAME_1, Categorie) %>%
  # add and keep value '0' where we have no count
  complete(NAME_1, Categorie, fill = list(n = 0)) %>% 
  # add total counts per region, for display in future plot
  add_count(NAME_1, wt = n) %>% 
  rename(Total = nn)

clubs_region
## # A tibble: 26 x 4
##    NAME_1                  Categorie     n Total
##    <fct>                   <fct>     <dbl> <dbl>
##  1 Auvergne-Rhône-Alpes    F             4    33
##  2 Auvergne-Rhône-Alpes    H            29    33
##  3 Bourgogne-Franche-Comté F             4    11
##  4 Bourgogne-Franche-Comté H             7    11
##  5 Bretagne                F             4    14
##  6 Bretagne                H            10    14
##  7 Centre-Val de Loire     F             3    12
##  8 Centre-Val de Loire     H             9    12
##  9 Corse                   F             0     3
## 10 Corse                   H             3     3
## # ... with 16 more rows

How many “elite” clubs per region?

ggplot(clubs_region, aes(x = reorder(NAME_1, n), y = n, fill = Categorie)) +
  geom_col(width = 0.7, position = "dodge") +
  scale_y_continuous(breaks = seq(0, 35, 5)) +
  # change legend title and texts
  scale_fill_manual(labels = c("Women", "Men"), values = c("hotpink1", "dodgerblue1")) +
  coord_flip(expand = 0, ylim = c(-3, 30)) +
  # add total counts per region
  geom_text(aes(label = paste0("(", Total, ")")), 
            y = -1.6,
            family = "Bookman",
            fontface = "italic",
            col = "sienna4",
            alpha = 0.8) +
  labs(title = 'More "elite" clubs in Southern France',
       x = "", y = 'Number of "elite" clubs',
       fill = "Gender") +
  theme(plot.title = element_text(family = "Bookman", face = "bold.italic", size = 18, 
                                  hjust = 0.9, vjust = 5), plot.margin = margin(10, 2, 5, 0, "mm"),
        axis.ticks.y = element_blank(),
        axis.text.y = element_text(family = "Bookman", face = "italic", size = 11, colour = "sienna4"),
        axis.text.x = element_text(family = "Bookman", face = "italic", size = 8, colour = "sienna4"),
        axis.title.x = element_text(family = "Bookman", face = "italic", size = 9, colour = "sienna4"),
        panel.background = element_blank(),
        panel.grid.major.x = element_line(linetype = "dashed", colour = "grey90", size = 0.5),
        legend.title = element_text(family = "Bookman", face = "italic", colour = "sienna4"),
        legend.text = element_text(family = "Bookman", face = "italic", colour = "sienna4")
        )

Ratio between number of “elite” clubs and total population per region.
(Population data is from INSEE web site)

# Import data for population per region
fr_pop <- readxl::read_xls("estim-pop-nreg-sexe-aq-1975-2018.xls", sheet = "2018", range = "A5:V18")

head(fr_pop)
## # A tibble: 6 x 22
##   X__1  `0 à 4 ans` `5 à 9 ans` `10 à 14 ans` `15 à 19 ans` `20 à 24 ans`
##   <chr>       <dbl>       <dbl>         <dbl>         <dbl>         <dbl>
## 1 Auve…      460097      506417        505548        497434        457304
## 2 Bour…      143511      164296        170767        168188        141824
## 3 Bret…      170095      200975        209705        209957        174775
## 4 Cent…      137424      158048        161842        156152        127016
## 5 Corse       15501       18379         17550         17735         15893
## 6 Gran…      293508      328163        331942        335903        313173
## # ... with 16 more variables: `25 à 29 ans` <dbl>, `30 à 34 ans` <dbl>,
## #   `35 à 39 ans` <dbl>, `40 à 44 ans` <dbl>, `45 à 49 ans` <dbl>, `50 à
## #   54 ans` <dbl>, `55 à 59 ans` <dbl>, `60 à 64 ans` <dbl>, `65 à 69
## #   ans` <dbl>, `70 à 74 ans` <dbl>, `75 à 79 ans` <dbl>, `80 à 84
## #   ans` <dbl>, `85 à 89 ans` <dbl>, `90 à 94 ans` <dbl>, `95 ans et
## #   plus` <dbl>, Total <dbl>
# only keep variables of interest
fr_pop_ok <- fr_pop %>% 
  select(X__1, Total)

# change the values of region names, in order to match those from 'all_data'
fr_pop_ok[, 1] <- all_data %>% st_set_geometry(NULL) %>% distinct(NAME_1)

fr_pop_ok
## # A tibble: 13 x 2
##    X__1                          Total
##    <chr>                         <dbl>
##  1 Auvergne-Rhône-Alpes        8037059
##  2 Bourgogne-Franche-Comté     2813289
##  3 Bretagne                    3336643
##  4 Centre-Val de Loire         2582522
##  5 Corse                        337796
##  6 Grand Est                   5548090
##  7 Hauts-de-France             6023336
##  8 Île-de-France              12246234
##  9 Normandie                   3342467
## 10 Nouvelle-Aquitaine          5994336
## 11 Occitanie                   5903190
## 12 Pays de la Loire            3787411
## 13 Provence-Alpes-Côte d'Azur  5065723

Join the data.

clubs_pop <- clubs_region %>% 
  count(NAME_1, wt = n) %>% 
  left_join(fr_pop_ok, by = c("NAME_1" = "X__1")) %>% 
  # compute number of clubs per 100000 inhbitants
  mutate(per100000 = round(nn / Total * 100000, 2)) %>% 
  # change the format of 'Population' for nice display in the table
  mutate(Total = format(Total, big.mark = " ", trim = TRUE)) %>% 
  arrange(desc(per100000)) %>% 
  # Rename for table display
  select(Region = NAME_1, Population = Total, `# of clubs` = nn, `Per 100 000 capita` = per100000)

clubs_pop
## # A tibble: 13 x 4
##    Region                     Population `# of clubs` `Per 100 000 capita`
##    <chr>                      <chr>             <dbl>                <dbl>
##  1 Corse                      337 796               3                 0.89
##  2 Centre-Val de Loire        2 582 522            12                 0.46
##  3 Occitanie                  5 903 190            27                 0.46
##  4 Provence-Alpes-Côte d'Azur 5 065 723            22                 0.43
##  5 Bretagne                   3 336 643            14                 0.42
##  6 Auvergne-Rhône-Alpes       8 037 059            33                 0.41
##  7 Grand Est                  5 548 090            22                 0.4 
##  8 Pays de la Loire           3 787 411            15                 0.4 
##  9 Bourgogne-Franche-Comté    2 813 289            11                 0.39
## 10 Nouvelle-Aquitaine         5 994 336            21                 0.35
## 11 Normandie                  3 342 467            11                 0.33
## 12 Hauts-de-France            6 023 336            18                 0.3 
## 13 Île-de-France              12 246 234           26                 0.21

Pretty table (using formattable package).

format_table(x = clubs_pop,
             format = "html",
             align = c("l", "c", "c", "l"), 
             formatters = list(
               `Per 100 000 capita` = color_bar("pink")
             )) %>% 
  kableExtra::kable_styling(bootstrap_options = "striped", full_width = F, position = "left") %>% 
  kableExtra::column_spec(column = 1, border_right = TRUE) %>% 
  kableExtra::column_spec(column = 4, italic = TRUE)
Region Population # of clubs Per 100 000 capita
Corse 337 796 3 0.89
Centre-Val de Loire 2 582 522 12 0.46
Occitanie 5 903 190 27 0.46
Provence-Alpes-Côte d’Azur 5 065 723 22 0.43
Bretagne 3 336 643 14 0.42
Auvergne-Rhône-Alpes 8 037 059 33 0.41
Grand Est 5 548 090 22 0.40
Pays de la Loire 3 787 411 15 0.40
Bourgogne-Franche-Comté 2 813 289 11 0.39
Nouvelle-Aquitaine 5 994 336 21 0.35
Normandie 3 342 467 11 0.33
Hauts-de-France 6 023 336 18 0.30
Île-de-France 12 246 234 26 0.21