A Statistical Analysis Project - Immigration Status in Canada

Project Description

How have the immigration conditions in Canada been changed currently and historically? A lot of news say that Canada has been quite friendly toward immigrants and people from other countries. However, this statement has been stated so widely without enough quantifiable evidence. A more data driven approach is to look at data addressing the unemployment rates in relation to the immgration statuses.

Some immigration related open data from Canadian government addressed the question above. This data has been published very recently in July this year. The data shows the labour force survey estimates (LFS) with immigrant statuses, age groups in various regions such as Vancouver, Montreal, and Toronto from 2006 until now.

Questions

Given a large numbers of variables that I could potentially analysize and display, some of the questions I could potentially draw from this data are:

What is the unemployment rates of people who born in Canada versus people immgrated to Canada? Knowing the employment rate of people who are Canadian will help compare whether there is a bias toward Canadian versus immigrants.
How are the unemployment rate different among immigrants who have stayed in Canada less than 5 years, between 5 to 10 years, and more than 10 years?
How are the unemployment rates different among provinces in Canada? These popular provinces that I choose to analyze are Quebec, Vancouver British Columbia, Alberta, Nova Scotia, Manitoba.
How do the unemployment rates change per from the year 2006 to 2017?

The significance of the questions

Addressing these questions can be beneficial for sociologists who want to know whether there is discriminations against immigrants compared to people born in Canada. This is also helpful for people who want to compare the unemployment rates between different areas in Canada.

On a personal level, I have been wondering whether I should work in Canada or not. If I choose to work here, I hadn’t been informed about what is the probability or the chances that I could get a job in Canada, depending on the fact that I will be considered an immigrant. Taking a statitics class, as well as learning R have equipped me good toolsets to analyze data and potentially answer these questions without depending on news.

Note about the interactive plot

The plot embedded in this project is strongly influenced by the Minard’s graphic of the Napoleon March in 1812. The graph was first popularized by Edward Tufle, in which he strongly advocates the ideas of multi-dimensional graphics. This concept illustrates that information graphics can display many types of data at once.

In my interactive plot, I will use heavily ggplot2 to display a multi-dimensional, or layered graphic. This one plot can answer four of the questions I have addressed above.

Note about the data analysis process

All of my statistical analysis process in R can be found on the top right of the dashboard. I have found this dashboard can be easier for readers to navigate.
The summary table is the tidy one after I have processed and transformed from the large dataset from Canadian government website.

Interactive Multi-layered Time-Series Plot

Summary Table

Year	STATUS	provinces	Value
2006	Born in Canada	Alberta	3.000000
2006	Born in Canada	Manitoba	3.690000
2006	Born in Canada	Montreal, Quebec	6.460000
2006	Born in Canada	Nova Scotia	7.365000
2006	Born in Canada	Toronto, Ontario	5.015000
2006	Born in Canada	Vancouver, British Columbia	3.555000
2006	Immigrants, landed 5 or less years earlier	Alberta	6.947368
2006	Immigrants, landed 5 or less years earlier	Manitoba	7.605000
2006	Immigrants, landed 5 or less years earlier	Montreal, Quebec	19.710000
2006	Immigrants, landed 5 or less years earlier	Nova Scotia	16.750000
2006	Immigrants, landed 5 or less years earlier	Toronto, Ontario	11.765000
2006	Immigrants, landed 5 or less years earlier	Vancouver, British Columbia	8.925000
2006	Immigrants, landed more than 10 years earlier	Alberta	2.675000
2006	Immigrants, landed more than 10 years earlier	Manitoba	2.920000
2006	Immigrants, landed more than 10 years earlier	Montreal, Quebec	10.045000
2006	Immigrants, landed more than 10 years earlier	Nova Scotia	5.163636
2006	Immigrants, landed more than 10 years earlier	Toronto, Ontario	5.400000
2006	Immigrants, landed more than 10 years earlier	Vancouver, British Columbia	3.855000
2006	Immigrants, landed more than 5 to 10 years earlier	Alberta	4.929412
2006	Immigrants, landed more than 5 to 10 years earlier	Manitoba	5.872727

Discussion and Conclusion

From the summary table and the plot that I have employed in the previous two tabs, some of the important points that can be drawn are:

For all the time series plots, the unemployment rate of immigrants landed less than 5 years is highest, while the lowest unemployment rate addresses people who born in Canada. At the same time, immigrants who lived in Canada for 10 years and longer have significantly lower employment rate than those who immgrated less than 5 years. It can be drawn that for immigrants, the more immigrants live in Canada, the less likely that they will be unemployed. Likewise, people who born in Canada tend to be suffered the least from unemployment, while immigrants who recently moved to Canada tend to have the highest unemployment rate. It should be noted that that statement does not address a causal relationship, but rather a predictive statement.
Some provinces, such as Toronto and Montreal, Quebec, have significantly higher unemployment rate of immigrants who have lived in Canada 5 years or less. In Montreal, Quebec, the unemployment rate was 20 points LFS on averages. Then from the year 2015, the unemployment rate decreases rapidly. However, this dropout rate still has a strong discrepency compared to people who born in Canada. From the comparisions of the unemployment rate in relation to immigration statuses, we can say that the unemployment rate vary between areas - Alberta seems to have the least discrepancy between different immigration statuses, while Montreal, Quebec, Toronto and Nova Scotia show the highest discrepancies. The fact that Toronto is among the highest discrepancies areas is quite interesting to me personally, because I have long heard that Toronto is very multi-cultural.
For all provinces, the unemployment rates of all immigration statuses become significantly higher after the economic downturn in 2008. This is predictable since the economic collapse affected almost all countries, and Canada is not an exception. One noticable shift is Nova Scotia, where the unemployment rate was fluctuated the strongest after the economic downturn, and in the year 2015 as well. For Manitoba, interestingly, was affected the least among areas.

Some concluding thoughts on the statistical analysis process

The tidying data process took a longest time during the whole analysis process (60%). This includes time to shape the data structures, transforming data so that only neccessary information be included. The user interface design (dashboard) took (10%). The remaining 30% was spent on doing statistical analysis. The reason why tidying data takes so long because it involves questions like ‘how to transform the data correctly so that I can easily draw the graphic?’ or ‘what information should I omit (without affecting the important results) to make the statistical analysis process easier?’. Addressing these questions helps make the statistical analysis much more easier and less flawed.
Layered graphics addressed by Edward Tulfe can successfully display a lot of variables at once. For example, in one plot, I have shown 4 different variables (the unemployment rate, year, immigration status, and provinces) using colors and divisions. The total design process of this time-series plot has helped me understand the importance of the A.B.C principle (always be charting). Because this graph clearly shows its advantages against the summary table.

---
title: "A Statistical Analysis Project - Immigration Status in Canada"
author: "Tam Nguyen"
output: 
  flexdashboard::flex_dashboard:
    orientation: rows
    social: menu
    source_code: embed
    storyboard: true
    theme: readable
    highlight: pygments
---

```{r setup, include=FALSE}

knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
library(shiny)
canada <- read_csv("data/02820101-eng.csv")
library(lubridate)
library(stringr)

immigrationClean <- 
  canada %>%
  select(-(Vector:Coordinate),
         -`Geographical classification`) 

provinces <- c("Alberta",
               "Vancouver, British Columbia",
               "Toronto, Ontario",
               "Manitoba",
               "Montreal, Quebec",
               "Nova Scotia")

status <- c("Born in Canada",
            "Immigrants, landed 5 or less years earlier",
            "Immigrants, landed more than 10 years earlier",
            "Immigrants, landed more than 5 to 10 years earlier")

employ <- c("Unemployment rate")

# process data into a new dataset:
province <- immigrationClean %>% 
  filter(GEO %in% provinces,
         CHARACTERISTICS %in% employ,
         STATUS %in% status) %>% 
  mutate(Date = ymd(paste0(Ref_Date, "/01"))) %>%
  select(Date,
         provinces = GEO,
         Value,
         AGEGROUP,
         CHARACTERISTICS,
         STATUS)

provinceArea <- province %>% 
  mutate(Year = year(Date)) %>% 
  group_by(Year, STATUS, provinces) %>% 
  filter(!is.na(Value)) %>% 
  summarise_each(funs(mean), Value)
  

immigrationClean <- 
  canada %>%
  select(-(Vector:Coordinate),
         -`Geographical classification`) 

provinces <- c("Alberta",
               "Vancouver, British Columbia",
               "Toronto, Ontario",
               "Manitoba",
               "Montreal, Quebec",
               "Nova Scotia")

status <- c("Born in Canada",
            "Immigrants, landed 5 or less years earlier",
            "Immigrants, landed more than 10 years earlier",
            "Immigrants, landed more than 5 to 10 years earlier")

employ <- c("Unemployment rate")

# process data into a new dataset:
province <- immigrationClean %>% 
  filter(GEO %in% provinces,
         CHARACTERISTICS %in% employ,
         STATUS %in% status) %>% 
  mutate(Date = ymd(paste0(Ref_Date, "/01"))) %>%
  select(Date,
         provinces = GEO,
         Value,
         AGEGROUP,
         CHARACTERISTICS,
         STATUS) 

```

### **Project Description** 

  How have the immigration conditions in Canada been changed currently and historically? A lot of news say that Canada has been quite friendly toward immigrants and people from other countries. However, this statement has been stated so widely without enough quantifiable evidence. A more data driven approach is to look at data addressing the unemployment rates in relation to the immgration statuses. 

  Some immigration related [open data](http://open.canada.ca/data/en/dataset/1c7c3efd-c990-4b96-8ecd-9e5eaee6b0bb) from Canadian government addressed the question above. This data has been published very recently in July this year. The data shows the labour force survey estimates (LFS) with immigrant statuses, age groups in various regions such as Vancouver, Montreal, and Toronto from 2006 until now. 

**Questions**  
  
Given a large numbers of variables that I could potentially analysize and display, some of the questions I could potentially draw from this data are:

* What is the unemployment rates of people who born in Canada versus people immgrated to Canada? Knowing the employment rate of people who are Canadian will help compare whether there is a bias toward Canadian versus immigrants.

* How are the unemployment rate different among immigrants who have stayed in Canada less than 5 years, between 5 to 10 years, and more than 10 years?

* How are the unemployment rates different among provinces in Canada? These popular provinces that I choose to analyze are Quebec, Vancouver British Columbia, Alberta, Nova Scotia, Manitoba.

* How do the unemployment rates change per from the year 2006 to 2017?

**The significance of the questions**

Addressing these questions can be beneficial for sociologists who want to know whether there is discriminations against immigrants compared to people born in Canada. This is also helpful for people who want to compare the unemployment rates between different areas in Canada.

On a personal level, I have been wondering whether I should work in Canada or not. If I choose to work here, I hadn't been informed about what is the probability or the chances that I could get a job in Canada, depending on the fact that I will be considered an immigrant. Taking a statitics class, as well as learning R have equipped me good toolsets to analyze data and potentially answer these questions without depending on news. 

**Note about the interactive plot**

  The plot embedded in this project is strongly influenced by the Minard's graphic of the Napoleon March in 1812. The graph was first popularized by Edward Tufle, in which he strongly advocates the ideas of **multi-dimensional graphics**. This concept illustrates that information graphics can display many types of data at once. 
  
  In my interactive plot, I will use heavily ggplot2 to display a multi-dimensional, or layered graphic. This one plot can answer four of the questions I have addressed above.

**Note about the data analysis process**

* All of my statistical analysis process in R can be found on the top right of the dashboard. I have found this dashboard can be easier for readers to navigate.
* The summary table is the tidy one after I have processed and transformed from the large dataset from Canadian government website. 

### **Interactive Multi-layered Time-Series Plot** {data-width=650}

```{r echo=FALSE}
library(plotly)
p <- provinceArea %>% 
  ggplot(aes(x = Year,
             y = Value,
             colour = STATUS)) +
  geom_line() +
  scale_y_continuous(breaks=seq(2006, 2017, 1)) +
  labs(y = "Umemployment Rate") +
  facet_wrap(~provinces) +
  theme_minimal() +
  theme(legend.position = "none") 
   
ggplotly(p)
```

### **Summary Table** {data-width=350}

```{r echo=FALSE}
knitr::kable(provinceArea[1:20, ])
```

### **Discussion and Conclusion**

From the summary table and the plot that I have employed in the previous two tabs, some of the important points that can be drawn are:

* For all the time series plots, the unemployment rate of immigrants landed less than 5 years is highest, while the lowest unemployment rate addresses people who born in Canada. At the same time, immigrants who lived in Canada for 10 years and longer have significantly lower employment rate than those who immgrated less than 5 years. It can be drawn that for immigrants, **the more immigrants live in Canada, the less likely that they will be unemployed**. Likewise, people who born in Canada tend to be suffered the least from unemployment, while immigrants who recently moved to Canada tend to have the highest unemployment rate.  It should be noted that that statement does not address a causal relationship, but rather a predictive statement. 

* Some provinces, such as Toronto and Montreal, Quebec, have significantly higher unemployment rate of immigrants who have lived in Canada 5 years or less. In Montreal, Quebec, the unemployment rate was 20 points LFS on averages. Then from the year 2015, the unemployment rate decreases rapidly. However, this dropout rate still has a strong discrepency compared to people who born in Canada. From the comparisions of the unemployment rate in relation to immigration statuses, we can say that **the unemployment rate vary between areas** - Alberta seems to have the least discrepancy between different immigration statuses, while Montreal, Quebec, Toronto and Nova Scotia show the highest discrepancies. The fact that Toronto is among the highest discrepancies areas is quite interesting to me personally, because I have long heard that Toronto is very multi-cultural.

* For all provinces, **the unemployment rates of all immigration statuses become significantly higher after the economic downturn in 2008**. This is predictable since the economic collapse affected almost all countries, and Canada is not an exception. One noticable shift is Nova Scotia, where the unemployment rate was fluctuated the strongest after the economic downturn, and in the year 2015 as well. For Manitoba, interestingly, was affected the least among areas. 

**Some concluding thoughts on the statistical analysis process**

* The tidying data process took a longest time during the whole analysis process (60%). This includes time to shape the data structures, transforming data so that only neccessary information be included. The user interface design (dashboard) took (10%). The remaining 30% was spent on doing statistical analysis. The reason why tidying data takes so long because it involves questions like 'how to transform the data correctly so that I can easily draw the graphic?' or 'what information should I omit (without affecting the important results) to make the statistical analysis process easier?'. Addressing these questions helps make the statistical analysis much more easier and less flawed. 

* Layered graphics addressed by Edward Tulfe can successfully display a lot of variables at once. For example, in one plot, I have shown 4 different variables (the unemployment rate, year, immigration status, and provinces) using colors and divisions. The total design process of this time-series plot has helped me understand the importance of the A.B.C principle (always be charting). Because this graph clearly shows its advantages against the summary table.