Analysis of Chinese Dialect Distribution

Done by Calvin C

- Jul 2021

For this R analysis, we will be looking at the distribution of Singapore’s chinese dialect and see whether has there been any changes in its make up between different age groups.

Setup

library(tidyverse)
library(ggthemes)
library(dplyr)
theme_set(theme_light())

Reading the data

dialect <- read.csv("eca_dialect.csv")
head(dialect)

Checking out structure and summary of variables

str(dialect)

## 'data.frame':    540 obs. of  4 variables:
##  $ Dialect: chr  "Hokkien" "Hokkien" "Hokkien" "Hokkien" ...
##  $ Gender : chr  "Total" "Total" "Total" "Total" ...
##  $ Age    : chr  "   0 - 4" "   5 - 9" " 10 - 14" " 15 - 19" ...
##  $ Freq   : int  47883 50005 55344 70365 77662 77273 79021 82387 90491 90897 ...

summary(dialect)

##    Dialect             Gender              Age                 Freq       
##  Length:540         Length:540         Length:540         Min.   :    42  
##  Class :character   Class :character   Class :character   1st Qu.:   945  
##  Mode  :character   Mode  :character   Mode  :character   Median :  4868  
##                                                           Mean   : 10741  
##                                                           3rd Qu.: 13868  
##                                                           Max.   :101400

Data Preparation

As we are not looking at any gender specifically for our analysis, we will only use the total category under gender. There are currently 15 age categories and we will be shrinking these categories into 4 age groups.

dialect.1 <- dialect%>%
  filter(Gender== "Total")%>%
  mutate(age_group = (if_else(
    Age == "   0 - 4"|Age == "   5 - 9"|
      Age ==" 10 - 14"|Age ==" 15 - 19", " Age: 0 - 19", if_else(
        Age ==" 20 - 24"|Age ==" 25 - 29"|
          Age ==" 30 - 34"|Age ==" 35 - 39", "Age: 20 - 39",if_else(
          Age ==" 40 - 44"|Age ==" 45 - 49"|
          Age ==" 50 - 54"|Age ==" 55 - 59", "Age: 40 - 59", if_else(
          Age ==" 60 - 64"|
            Age ==" 65 - 69"|Age ==" 70 - 74", "Age: 60 - 74",NULL
          ))))))

head(dialect.1)

Create a sub-dataset with Age group, dialect and its frequency

dialect.2<-aggregate(Freq~Dialect + age_group,dialect.1,sum)
head(dialect.2)

Assign percentage of each dialect under each age group

dialect.3<-group_by(dialect.2,age_group)%>%
  mutate(pct = round(Freq*100/sum(Freq)))
head(dialect.3)

Visualisation - comparision of dialect population across different age group

p.1a <- ggplot(dialect.3, aes(x=reorder(Dialect, pct), y = pct, fill = Dialect))
p.1a + geom_col(position = "dodge2")+
  facet_grid(~age_group)+
  labs(x = NULL, 
       y = "Percent %", 
       fill = "Dialect", 
       title = "Distribution of dialect group against age group")+
  geom_text(aes(label = pct), size = 3, position = position_stack(vjust = 0.5))+
  guides(fill= "none")+
  coord_flip()

From the plot, we can see that the distribution of dialects across the different age group is largely consistent.

The top three dialect groups, Hokkien, Teochew and Cantonese remains the majority across all age group and takes up at least 70% of the population in each age group. However, their proportion has been decreasing as the age group gets younger, from 78% at the highest age group, to only 71% at the lowest age group.

There is also a rise of dialect group categorize under, Other Chinese, that has been experiencing increasing proportion as the age group gets younger. This group makes up only 1% of the oldest age group but makes up 12% at the youngest age group and is only 1 percentage point lower than the 3rd biggest dialect group, Cantonese, at 13%.

From the above observations, we can conclude that the country’s diversity of dialect groups is increasing and there might be a possibility that a new dialect group might emerge to be included in the top 9 dialects of the country in the years to come.