Done by Calvin C
- Jul 2021
For this R analysis, we will be looking at the distribution of Singapore’s chinese dialect and see whether has there been any changes in its make up between different age groups.
library(tidyverse)
library(ggthemes)
library(dplyr)
theme_set(theme_light())
Reading the data
dialect <- read.csv("eca_dialect.csv")
head(dialect)
Checking out structure and summary of variables
str(dialect)
## 'data.frame': 540 obs. of 4 variables:
## $ Dialect: chr "Hokkien" "Hokkien" "Hokkien" "Hokkien" ...
## $ Gender : chr "Total" "Total" "Total" "Total" ...
## $ Age : chr " 0 - 4" " 5 - 9" " 10 - 14" " 15 - 19" ...
## $ Freq : int 47883 50005 55344 70365 77662 77273 79021 82387 90491 90897 ...
summary(dialect)
## Dialect Gender Age Freq
## Length:540 Length:540 Length:540 Min. : 42
## Class :character Class :character Class :character 1st Qu.: 945
## Mode :character Mode :character Mode :character Median : 4868
## Mean : 10741
## 3rd Qu.: 13868
## Max. :101400
As we are not looking at any gender specifically for our analysis, we will only use the total category under gender. There are currently 15 age categories and we will be shrinking these categories into 4 age groups.
dialect.1 <- dialect%>%
filter(Gender== "Total")%>%
mutate(age_group = (if_else(
Age == " 0 - 4"|Age == " 5 - 9"|
Age ==" 10 - 14"|Age ==" 15 - 19", " Age: 0 - 19", if_else(
Age ==" 20 - 24"|Age ==" 25 - 29"|
Age ==" 30 - 34"|Age ==" 35 - 39", "Age: 20 - 39",if_else(
Age ==" 40 - 44"|Age ==" 45 - 49"|
Age ==" 50 - 54"|Age ==" 55 - 59", "Age: 40 - 59", if_else(
Age ==" 60 - 64"|
Age ==" 65 - 69"|Age ==" 70 - 74", "Age: 60 - 74",NULL
))))))
head(dialect.1)
Create a sub-dataset with Age group, dialect and its frequency
dialect.2<-aggregate(Freq~Dialect + age_group,dialect.1,sum)
head(dialect.2)
Assign percentage of each dialect under each age group
dialect.3<-group_by(dialect.2,age_group)%>%
mutate(pct = round(Freq*100/sum(Freq)))
head(dialect.3)
p.1a <- ggplot(dialect.3, aes(x=reorder(Dialect, pct), y = pct, fill = Dialect))
p.1a + geom_col(position = "dodge2")+
facet_grid(~age_group)+
labs(x = NULL,
y = "Percent %",
fill = "Dialect",
title = "Distribution of dialect group against age group")+
geom_text(aes(label = pct), size = 3, position = position_stack(vjust = 0.5))+
guides(fill= "none")+
coord_flip()
From the plot, we can see that the distribution of dialects across the different age group is largely consistent.
The top three dialect groups, Hokkien, Teochew and Cantonese remains the majority across all age group and takes up at least 70% of the population in each age group. However, their proportion has been decreasing as the age group gets younger, from 78% at the highest age group, to only 71% at the lowest age group.
There is also a rise of dialect group categorize under, Other Chinese, that has been experiencing increasing proportion as the age group gets younger. This group makes up only 1% of the oldest age group but makes up 12% at the youngest age group and is only 1 percentage point lower than the 3rd biggest dialect group, Cantonese, at 13%.
From the above observations, we can conclude that the country’s diversity of dialect groups is increasing and there might be a possibility that a new dialect group might emerge to be included in the top 9 dialects of the country in the years to come.