Municipalities Data Analysis

Motivation

This Analysis investigate the data of municipalities in Switzerland. The data is taken from the Swiss Federal Statistical Office. In the course Data Analysis(Visualization) with R, I analysed the requirements which is given in the task.

Part 1 Data Exploration

Libraries

library(tidyverse)
library(readxl)
library(ggplot2)
library(dplyr)
library(reshape2)

1 Read data

gemeindedaten <- read_csv("gemeindedaten.csv")

2 Coding of the data

All columns are correctly coded as numeric or character except for the column “polit_pda”, which is as character coded. This column is coded now as numeric.

3 Data cleaning for polit_pda

The character “*” in the column “polit_pda” is replaced by NA.

gemeindedaten$polit_pda <- gsub("\\*", NA, gemeindedaten$polit_pda)
gemeindedaten$polit_pda <- as.numeric(gemeindedaten$polit_pda)

4 Number of manicipalities in Switzerland in 2020

number_gmd <- n_distinct(gemeindedaten$gmdename)

The Number of municipalities in Switzerland in 2020 is 2202.

median_bev_total <- median(gemeindedaten$bev_total)

The median of the population in the municipalities in Switzerland in 2020 is 1536.

max_gmd <- gemeindedaten %>% filter(bev_total == max(bev_total)) %>% select(gmdename)
min_gmd <- gemeindedaten %>% filter(bev_total == min(bev_total)) %>% select(gmdename)

Maximal populated municipality in Switzerland in 2020 is Zürich and the minimal populated municipality in Switzerland in 2020 is Corippo.

Part 2 Grafik Data Exploration

5 In which canton are the most municipalities located,and in which canton are the least?

kantone_gmd <- gemeindedaten %>%
  group_by(kantone) %>%
  summarise(num_man = n_distinct(gmdename)) %>%
  arrange(desc(num_man))

# Max
max_gmd <- kantone_gmd %>% 
  filter(num_man == max(num_man))

# Min
min_gmd <- kantone_gmd %>% 
  filter(num_man == min(num_man))

ggplot(kantone_gmd, aes(x = reorder(kantone, num_man), y = num_man, fill = num_man)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  scale_fill_gradient(low = "blue", high = "red") +  
  labs(title = "Number of municipalities per canton",
       x = "Canton",
       y = "Number of municipalities") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5))

The most municipalities are located in the canton BE and the least municipalities are located in the canton BS, GL.

6 Number of population in municipalities grouped by language region

The most populated municipalities are grouped by language region and can be seen in the following plot.

most_pop_gem_language <- gemeindedaten %>%
  group_by(sprachregionen) %>%
  slice_max(order_by = bev_total, n = 1) %>%
  select(sprachregionen, gmdename, bev_total)

ggplot(most_pop_gem_language, aes(x = reorder(gmdename, bev_total), y = bev_total, fill = sprachregionen)) +
  geom_bar(stat = "identity") +
  labs(title = "Most populated municipalities grouped by language region",
       x = "Language region",
       y = "Number of Population") +
  scale_y_continuous(labels = scales::comma) +
  theme_minimal()

7a Development of the population in the municipalities from 2010 to 2018 grouped by language region

Since the column “bev_1018” is only give the perceptual change of the population from 2010 to 2018, the population between 2010 and 2018 must be calculated. The development of the population in the municipalities from 2010 to 2018 is grouped by language region and can be seen in the following plot as perceptual change.

gemeindedaten <- gemeindedaten %>%
  mutate(bev_1018_total = bev_total * (1 - bev_1018 / 100))

pop_old_lang <- gemeindedaten %>%
  group_by(sprachregionen) %>%
  summarise(pop_old = sum(bev_1018_total), .groups = "drop")

pop_new_lang <- gemeindedaten %>%
  group_by(sprachregionen) %>%
  summarise(pop_new = sum(bev_total), .groups = "drop")

pop_percent_change <- pop_old_lang %>%
  left_join(pop_new_lang, by = "sprachregionen") %>%
  mutate(percent_change = ((pop_new - pop_old) / pop_old) * 100)

ggplot(pop_percent_change, aes(x = reorder(sprachregionen, percent_change), y = percent_change, fill = sprachregionen)) +
  geom_bar(stat = "identity") +
  labs(title = "Development of the population in the municipalities from 2010 to 2018 grouped by language region",
       x = "Language region",
       y = "Change of population in [%]") +
    theme_minimal()

7b The same analysis as 7a but in addition of stadt_land

The following plot shows the development of the population in the municipalities from 2010 to 2018, grouped by language region and settlement structure (stadt_land). In the French speaking region, the population is increasing primarily in the agglomerations, while in the German and Italian speaking regions, the population is growing primarily in rural areas.

pop_old_stadt_land <- gemeindedaten %>%
  group_by(sprachregionen, stadt_land) %>%
  summarise(pop_old = sum(bev_1018_total), .groups = "drop") 

pop_new_stadt_land <- gemeindedaten %>%
  group_by(sprachregionen, stadt_land) %>%
  summarise(pop_new = sum(bev_total), .groups = "drop")

# left join 
pop_percent_change_stadt_land <- pop_old_stadt_land %>%
  left_join(pop_new_stadt_land, by = c("sprachregionen", "stadt_land")) %>%
  mutate(percent_change = ((pop_new - pop_old) / pop_old) * 100)

ggplot(pop_percent_change_stadt_land, aes(x = reorder(sprachregionen, percent_change), y = percent_change, fill = stadt_land)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(
      title = "Percentage population development by language region & settlement structure",
    x = "Language region",
    y = "Change of population in (%)"
  ) +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5))

8 Identify correlation between bev_dichte, bev_ausl, alter_0_19, alter_20_64, alter_65, bevbew_geburt, sozsich_sh

nan_count <- gemeindedaten %>%
  select(bev_dichte, bev_ausl, alter_0_19, alter_20_64, "alter_65+", bevbew_geburt, sozsich_sh) %>%
  summarise_all(~sum(is.na(.)))

The number of missing values in the columns “bev_dichte”, “bev_ausl”, “alter_0_19”, “alter_20_64”, “alter_65”and “bevbew_geburt” is 0 except in the column “sozsich_sh” where replace the Nan with the median.

gemeindedaten$sozsich_sh <- ifelse(is.na(gemeindedaten$sozsich_sh),
                                   mean(gemeindedaten$sozsich_sh, na.rm = TRUE), 
                                   gemeindedaten$sozsich_sh)

correlation <- gemeindedaten %>%
  select(bev_dichte, bev_ausl, alter_0_19, alter_20_64, "alter_65+", bevbew_geburt, sozsich_sh) %>%
  cor()

corr_long <- melt(correlation)

ggplot(corr_long, aes(Var1, Var2, fill = value)) +
  geom_tile() +
  scale_fill_gradient2(low = "blue", high = "red", mid = "white", midpoint = 0, name = "Correlations") +
  theme_minimal() +
  labs(title = "Correlation matrix", x = "", y = "") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        axis.text.y = element_text(angle = 45, hjust = 1)) +
  coord_fixed()

There are several interesting correlations in the data. For example, the correlation between bev_ausl and bev_dichte is moderately strong at 0.5097, which indicates that a higher population density tends to be associated with a higher proportion of foreigners. There are also negative correlations, such as between age_65+ (age group 65+) and age_0_19 (age group 0-19) at -0.6805, which indicates that fewer young people tend to live in areas with more older people.

8b Choose one of the correlations and visualize it separately

ggplot(gemeindedaten, aes(x = bev_dichte, y = bev_ausl)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "Correlation between population density and proportion of foreigners",
       x = "Population density",
       y = "Proportion of foreigners") +
  theme_minimal()

As in the correlation matrix, the scatter plot shows a positive correlation between population density and the proportion of foreigners. The linear regression line indicates that as population density increases, the proportion of foreigners also tends to increase (slope = 0.5097).

9 Visualise a contingency table with the variables stadt_land and language regions

The following contingency table shows the number of municipalities in each combination of settlement structure (stadt_land) and language regions.

contingency_table <- gemeindedaten %>%
  count(stadt_land, sprachregionen) %>%
  spread(key = sprachregionen, value = n, fill = 0)

contingency_table_long <- contingency_table %>%
  pivot_longer(cols = deutsch:raetoromanisch, names_to = "sprachregionen", values_to = "count")

In the plot, we can observe that most municipalities in the German and Italian speaking languages are located in the agglomerations.

ggplot(contingency_table_long, aes(x = reorder(sprachregionen , count), y = count, fill = stadt_land)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Contingency table with the variables stadt_land and language regions",
       x = "Settlement structure",
       y = "Number of municipalities") +
  theme_minimal()

10 Politics in the language regions

lang_party <- gemeindedaten %>%
  group_by(sprachregionen) %>%
  summarise(across(starts_with("polit_"), ~ mean(., na.rm = TRUE)))


lang_party_long <- lang_party %>%
    pivot_longer(
    cols = starts_with("polit_"),
    names_to = "party",
    values_to = "percentage"
  )

lang_party_long <- lang_party_long %>% drop_na()
ggplot(lang_party_long, aes(x = reorder(sprachregionen, percentage), y = percentage, fill = party)) +
  geom_bar(stat = "identity", position = "dodge") +
  theme_minimal() +
  labs(
    title = "Shares of political parties by language regions",
    x = "Language region",
    y = "Percent (%)",
    fill = "Parties"
  ) +
  theme(axis.text.x = element_text(angle = 360, hjust = 1))

The strongest party in the French and German speaking region is SVP, in the Italian speaking region FDP and in the Rhaeto-Romance speaking region the CVP.