Understanding the Median Income of Canadians

Canadian Income

In the face of an ever-evolving economic landscape, gaining insights into the financial of individuals and households is paramount. This project delve into the median income of Canadians across the true north and aims to shed light on the economic realities faced by the average Canadian. This project leverages Cancensus API in exploration to uncover patterns, disparities, and trends in income distribution. With a specific focus on the top 10 cities in Canada. Through this analysis, we endeavor we find that higher median income correlates with larger population centers and outliers in remote and center of Canada.

Data Retrieval

Given how large Canada is, I want to see how the average income of Canadians are in different regions. I first want to comparing the top 10 cities in Canada and how they fare in the average income. I am using Cancensus, taking data from the 2021 census from Canada and isolating the information that will provide me with the vectors for Median income and th the top 10 cities. The vectors will allow me to pull the information I need from the data set.

data_set = "CA21" # retrieving the 2021 census dataset 
label_ = "CD" # For cities 

# Pulling vectors that belong to income
income_vectors <- list_census_vectors(data_set) %>% 
  filter(type=="Total",grepl("income",label)) %>% 
  pull("vector")

# Finding the top 10 cities by population 
regions <- list_census_regions(data_set) %>% filter(level %in% "CSD") %>% top_n(10,pop) %>% as_census_region_list

# list of vector data for income 
list_income <- list_census_vectors(data_set) %>%
  filter(type=="Total",grepl("income",label)) %>% 
  select(label, vector)%>% rename(variable=vector)

Looking at the list_income, I will extract the data from the top ten cities from the following list.Next, I pull the data from the vectors that I extracted from the Cancensus API. Then, I selected the top cities, selected and rearrange the pulled data to merge the variable name.

data<- get_census(dataset = data_set,
                  level = "Regions",
                  vectors = c(income_vectors),
                  regions = regions, # only the top 10 cities 
                  geo_format = NA,
                  labels = 'short')

Data processing

Next I am processing the data dropping the geometry and selecting the region names and list of vectors from the income list. I then melt the table into Region name so then can plot it for visualization.

# Arranging City names
Cities <- data %>% arrange(desc(Population)) %>% distinct(`Region Name`) %>% pull("Region Name")

# Selecting data for regions, income and reshaping
plot_data <- data %>% 
  select(c(`Region Name`, income_vectors)) %>%
  melt(id="Region Name") %>%
  mutate(`Region Name` = factor(`Region Name`, levels = Cities, 
                                ordered=TRUE)) 

plot_data <- merge(plot_data, list_income, by = "variable")

Visualization of income

First I only wanted to analyze the median income and just filter for only variables that have Median in it.

The statistics looks a bit similar - it could be because are looking at the big cities in Canada. Looking at the biggest cities in Canada, Toronto and Vancouver, we see that while Toronto’s median house hold income is at $84,000 similar to Vancouver’s $82,000. While Ottawa has a significantly higher median house hold income at $102,000. This is still looking at the cities, therefore I will map the “Median after-tax income in 2020 among recipients” across Canada to see if there is a difference.

Wanting to see if there is a difference if there is an average or median when plot, I filter it for average income and did not find much differences.

Visualization across Canada

I wanted to see how different variables fare across Canada so I created a function that can pull data from the 2021 census to a choropleth map depicting the variability of the variable on the side bar. The vectors can be found using the list_income dataframe.

From this we find that the higher median income is obtain higher up north and not in one of the big cities. However, it is hard to tell which city is which. I next use leaflet, I should be able to find specifically which cities have a higher median income.

## Reading regions list from local cache.

## Reading vectors data from local cache.

## Reading geo data from local cache.

The interactive map shows the where higher incomes are situated and when hovering over different areas it names the region with the median income.

Next I want to automated a function where we can enter in the variable of the dataset and it will produce the map of couple vectors that I find might be useful to compare.

## Reading regions list from local cache.

## Reading vectors data from local cache.

## Reading geo data from local cache.

Statistical testing

From hearsay I usually hear that you will get a higher income living in a city that is more densely populated than out in more rural areas. To test this I am taking the Median total income of one-person households in 2020 across canada and compare it to population of each region to find if I can have a better determine if it is true or not.

# Data Retrieval 
data_i <- get_census(dataset = data_set,
                  level = "Regions",
                  vectors = c("v_CA21_909"),
                  regions = CD,
                  geo_format = NA,
                  labels = 'short')

## Reading vectors data from local cache.

# Pulling name of vector 
x <- list_income %>% filter(variable == 'v_CA21_909') %>% pull(label)

# Processing data 
plot_data <- data_i %>% 
  select(c(`Region Name`, v_CA21_909, Population)) %>% # selecting for region name, variable, and population
  rename(x = v_CA21_909 ) # renaming the variable to the the correct name 

# visualizing data 
ggplot(plot_data, aes(x = x,
                y = Population))+
  geom_point()+ # scatter plot
  theme_minimal()+ # or theme_classic()
  scale_x_continuous(labels=currency_format_short)+
  xlab(x)+
  ylab("Population") +
  #geom_abline(intercept = 0, slope = 1)+ # just a straight line with a slope of one 
  geom_smooth(method = lm)  # regression line add se = FALSE to remove CI.

## `geom_smooth()` using formula = 'y ~ x'

Visualization of the data we find it clustering, this means the data is not distributed and is skewed. I can further look at how skewed the data is and then transform the data so when we analyze it can be better analyzed.

## visualizing the data 
plot_data <- plot_data %>% mutate(logp = log(Population), logx = log(x))
ggplot(plot_data, aes(x = logx,
                y = logp))+
  geom_point()+ # scatter plot
  theme_minimal()+ # or theme_classic()
  #scale_x_continuous(labels=currency_format_short)+
  labs(y = "Log Population",
      x = paste0("log ", x),
      title = "log population compared to log median income")+
  geom_smooth(method = lm)  # regression line add se = FALSE to remove CI.

## `geom_smooth()` using formula = 'y ~ x'

# correlation coefficient 
cor(plot_data$logp, plot_data$logx)

## [1] 0.3178135

# coefficient of determination
cor(plot_data$logp, plot_data$logx)^2

## [1] 0.1010054

Using the Pearson’s product-moment correlation Null hypothesis (H0) : There is a no correlation between Median total income of one-person households in 2020 and Population of a city. Alternative Hypothesis (Ha): There is a correlation between Median total income of one-person households in 2020 ($) and Population of a city.

# Pearson’s product–moment correlation
cor.test(plot_data$logp, plot_data$logx)

## 
##  Pearson's product-moment correlation
## 
## data:  plot_data$logp and plot_data$logx
## t = 5.7179, df = 291, p-value = 2.672e-08
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2109066 0.4172074
## sample estimates:
##       cor 
## 0.3178135

Given that p-value is than alpha of < 0.5, thus we reject the null hypothesis and accept alternative hypothesis that there is a correlation between median total income of one-person household and population of a city. Looking at the correlation coefficient we find that r = 0.3178135 meaning that it is linear medium positive correlation.