Group Members: Nithin Reddy Padicherla, Chegu Hitesh Sai Sushanth, Tirumala Naga sai Gottumukkala, Sai Pavani Gutha, Susenthar Raj Jegadeesh Chandra Bose

1. List of libraries used in this project

The list of libraries that we utilized to make tables, format material, and style graphs for our exploratory analysis is provided below.

# List of all the libraries used in this project
library(stargazer)
library(dplyr)
library(gmodels)
library(epiDisplay)
library(ggplot2)
library(kableExtra)
library(xtable)
library(janitor)
library(naniar)
library(summarytools)
library(vcd)
library(prettydoc)

2. Introduction

2.1 Data set overview

The Canadian Internet Use Survey (CIUS), which provides statistics on the adoption, use, and location of internet access for people older than 15 living in Canada’s ten provinces, was used as the dataset for this analysis. The data set has 23 dimensions, which include information on variables like province, area, age, gender, education levels levels, internet use, and their accessibility.

The aforementioned data can be examined using various exploratory and predictive analysis techniques, and the findings can be applied to conduct evidence-based policy-making, resource management, and development planning in all of the provinces, as well as provide internationally comparable statistics on the use and access trends of the internet in Canada.

Some of the benefits include

Guide government efforts to provide households with more reliable and affordable high-speed Internet.
Develop policies to protect individuals from online privacy and security risks.
Identify barriers that prevent people from accessing the Internet and making the most of new and emerging internet technology.
Contribute to international initiatives, such as the United Nations Sustainable Development Goals and the OECD Going Digital Project, to help track and compare Canada’s digital development.

Additional information about the CIUS data set can be found here Link

2.2 Data Research

According to our preliminary analysis, the implementation of public programs depends on evidence-based policy formulation. The Internet usage data in this case contains a number of characteristics that can be used to interpret and evaluate different aspects of how the general public uses the internet.

Analyzing where the survey respondents live like region and province can reveal information the usage patterns of the individuals based on the location.
Analyzing the physiological variables like age and gender can help us understand the influencing factors on the personality of the respondents which can eventually combined with Internet usage patterns in different regions to gain useful insights.
Also, analyzing the educational and employement status can reveal information on the specific purposes of the use of internet like educational research and workplace usages.
Finally analyzing on how many years did the respondents use internet and where have they used them from, like work,from home,from school,from public library or from a friends place can reveal information on the pattern of usage at all these different places, and their purposes. Combining them with regional, educational and physiological variables that are available above can provide detailed and specific insights into the usage dynamics of the respondents.

2.3 Analysis approach overview

The dataset Canadian Internet Use Survey (CIUS) used as part of this analysis is a categorical dataset which contains mostly nominal and ordinal variables. So as part of our analysis, we would like to use contingency tables, relative frequency tables,bar charts, mosaic plots, density plots etc. for exploratory analysis. Along with that we also are considering logistic and chi-square tests for predictive analysis to understand the relation between the variables and draw effective conclusions and provide recommendations.

3. Pre-processing the data

Before conducting the analysis, this dataset required pre-processing, so we used a few pre-processing techniques to enhance the dataset and improve our results efficiency.

3.1 Renaming columns

The variable names in the dataset were unclear and were encoded in accordance with the needs of the survey to be compatible with their processing systems. It is not practical to interpret the data using the encoded variable headers. Therefore, we changed the header names to a more comprehensible and meaningful format.

# Renames all the columns specified below
locationofUse <- read.csv("~/University of Windsor/locationofUse.csv")
locationofUse <-
  locationofUse %>% rename(
    "Customer ID" = "PUMFID",
    "Province" = "PROVINCE",
    "Region" = "REGION",
    "Community" = "G_URBRUR",
    "Age" = "GCAGEGR6",
    "Gender" = "CSEX",
    "Education" = "G_CEDUC",
    "Student_Status" = "G_CSTUD",
    "Employment" = "G_CLFSST",
    "Houshold_Type" = "GFAMTYPE",
    "House_Size" = "G_HHSIZE",
    "Household_Education" = "G_HEDUC",
    "Student_Household" = "G_HSTUD",
    "Internet_User" = "EV_Q01",
    "Internet_Usage_Years" = "EV_Q02",
    "Internet_Usage_Home" = "LU_Q01",
    "Internet_Usage_Work" = "LU_Q02",
    "Internet_Usage_School" = "LU_G03",
    "Internet_Usage_Library" = "LU_Q04",
    "Internet_Usage_Others" = "LU_Q05",
    "Internet_Usage_Relatives" = "LU_Q06A",
    "Internet_Usage_Neighbours" = "LU_Q06B",
    "Internet_Others" = "LU_G06",
  )

3.2 Check for missing values

We checked if the dataset already has any missing values that might hinder our analysis and we found no missing values.

# check if there are any missing values
sapply(locationofUse, function(x) sum(is.na(x)))

3.3 Reassigning data levels

This dataset has 23 different dimensions and each of these variables have different levels. So, for better interpretation and analysis we have reassigned few levels in the dataset. Starting with 2[NO] as 0, 6,7,8,9 as NA for the columns 16 to 23 and would interpret all of the values as other category. The already existing 1[YES] is interpreted as 1 with no change.

# This reassigns values 6,7,8,9 to 'NA' and 2 to '0' for the columns in the dataset.
Recode_columns <- function(startcol, endCol) {
  for (i in startcol:endCol) {
    locationofUse[, i] <<-
      ifelse(locationofUse[, i] == 2, 0, locationofUse[, i])
    locationofUse[, i] <<-
      ifelse(locationofUse[, i] == 6, NA, locationofUse[, i])
    locationofUse[, i] <<-
      ifelse(locationofUse[, i] == 7, NA, locationofUse[, i])
    locationofUse[, i] <<-
      ifelse(locationofUse[, i] == 8, NA, locationofUse[, i])
    locationofUse[, i] <<-
      ifelse(locationofUse[, i] == 9, NA, locationofUse[, i])
  }
}
# Function call - This calls the function 'Recode_columns' and parses startcol and endcol values. 
Recode_columns(16, 23)

3.4 Changing the datatype

After processing, R interpreted the data in this dataset as numeric datatype for different variables. This can cause an issue while working with categorical variables because numeric variables are sometimes interpreted as continuous in nature but the categorical once here are discrete which can cause logic issues while executing the code. So, we will be using as.character, as.factor build in functions to change the numerical data type to a character or a factor when appropriately needed.

#Change the datatype to character or factor for the mentioned columns 
locationofUse <- locationofUse %>% mutate_at(c('column name(s)'), as.character)
#or
locationofUse <- locationofUse %>% mutate_at(c('column name(s)'), as.factor)

4. Exploratory Analysis

As part of exploratory analysis we wanted to understand all the individual dimensions and their underlying patterns and develop an effective analysis to get maximum insights from the available data.

4.1 Analysing Provincial information with a Frequency table

This frequency table’s purpose is to show how frequently a particular province was chosen by clients. we can accomplish this by counting each province in the table. Based on this we would like to understand which provinces was selected the most. This data can also used to understand the dynamics of the provinces like the total observations, least and highest repeated provinces in the dataset, and each province’s respondent count contribution to the dataset.

# Bind the frequency, cumulative and relative frequency of the provinces
cbind(
  Frequency = table(locationofUse$Province),
  Cummulative_Frequency = cumsum(table(locationofUse$Province)),
  Relative_Frequency = prop.table(table(locationofUse$Province))
  ) %>%
  kable(caption = " Table:1 A Frequency Table on Provinces") %>%
  kable_classic(font_size = "13", full_width = F)

Table:1 A Frequency Table on Provinces
	Frequency	Cummulative_Frequency	Relative_Frequency
10	882	882	0.0380533
11	592	1474	0.0255415
12	1240	2714	0.0534990
13	1084	3798	0.0467685
24	4437	8235	0.1914315
35	6518	14753	0.2812149
46	2023	16776	0.0872810
47	1627	18403	0.0701959
48	2242	20645	0.0967297
59	2533	23178	0.1092847

4.1.1 Findings

We found that Ontario [35] was the most selected province by the respondents in the survey with an occurrence of 6518 times and it had a relative frequency of 0.28.
We found that Prince Edward Island [11] was the least selected province by the respondents in the survey with the lowest occurrence of 592 and with a relative frequency of 0.02.
We also found that Ontario[11] is followed by Quebec [24], British Colombia [59], and Alberta [48] with occurrences of 4437,2533,2242 and relative frequencies 0.19,0.10,0.09 respectively.

4.1.2 Explanation

One of the reasons for Ontario [35] being the highest occurrence may be possibly due to the volume of respondents responding the survey might have highly been from the province and it also may have to do with the population of the region [highest populated province in Canada]. This reasoning might also be valid for Prince Edward Island [11] being selected the least number of times and so on for the other provinces.

4.2 Analysing the distribution of the age of respondents

The objective of using the density plot is to help us understand the distribution of the age of the respondents of this survey, which helps us in providing the probability density function of the age of the survey respondents. This can further be combined and analysed on how different age sections in the data set are in relation with the region/province and gender.

# Filled Density Plot
dplot_variable <- density(locationofUse$Age)
plot(dplot_variable,xlab=" Fig:1 Age Ranges of Respondents", main="Age Distribution of Respondents ")
polygon(dplot_variable, col="#fb8072", border="black")

4.2.1 Findings

We found that respondents above 65 or older [6] are the once with the highest amount of density around 0.49, which shows that people in this age range are the once that mostly responded to this survey and their data is the largest part of this survey.
We found that respondents from the 16 to 24 [1] are the once with least density in the dataset of around 0.20, which shows that the people in this age range are the once that least responded to this survey and contribute the least amount of responses.
Finally, We also found that respondents of age 45 to 54 [4] and 55 to 64 [5] are around the same density of 0.40 in this survey, which shows that their responses are almost the same in number.

4.2.2 Explanation

One of the reasons for a higher density of respondents above the age of 45 may be due to the fact that the average age of respondents who take internet surveys is around 53.51 years according to (Price, 2012). So, there is higher probability that respondents who mostly take surveys can be in a higher number when their age is more than 50 years.

4.3 Analysing Education levels based on province using Pivot table

The objective of using the below pivot table is to summarize and organize education levels based on the provinces. This would help us understand how education levels are distributed among different provinces. Based on the findings we can have an understanding on what level of education do respondents hold in different provinces which can provide supporting and additional information on which province’s respondents has highest and lowest education levels and what are their percentages.

#Create a pivot table with province and education as variables
locationofUse %>%
  tabyl(Province, Education) %>%
  adorn_totals(c("row", "col")) %>%
  adorn_percentages("row") %>%
  adorn_pct_formatting() %>%
  adorn_ns() %>%
  adorn_title("combined") %>%
  kable(caption = "Table:2 A Pivot Table on Provinces and Education") %>%
  kable_classic(font_size = "13")

Table:2 A Pivot Table on Provinces and Education
Province/Education	1	2	3	Total
10	43.3% (382)	44.2% (390)	12.5% (110)	100.0% (882)
11	38.3% (227)	45.8% (271)	15.9% (94)	100.0% (592)
12	39.2% (486)	42.9% (532)	17.9% (222)	100.0% (1240)
13	41.9% (454)	42.6% (462)	15.5% (168)	100.0% (1084)
24	39.4% (1748)	42.8% (1900)	17.8% (789)	100.0% (4437)
35	38.2% (2489)	40.8% (2657)	21.0% (1372)	100.0% (6518)
46	43.4% (878)	39.2% (794)	17.4% (351)	100.0% (2023)
47	43.0% (699)	39.5% (642)	17.6% (286)	100.0% (1627)
48	38.9% (872)	43.2% (969)	17.9% (401)	100.0% (2242)
59	33.4% (847)	44.8% (1136)	21.7% (550)	100.0% (2533)
Total	39.2% (9082)	42.1% (9753)	18.7% (4343)	100.0% (23178)

4.3.1 Findings

We found that in all the 10 provinces 39.2% [9082] respondents have high school level or less education [1] in which British Colombia [59] has the least number of respondents 33.4% [847] that have level [1] education and also Manitoba [46] has the highest count of respondents 43.4% [699] that have level [1] education.
We found that in all the 10 provinces 42.1% [9753] respondents have College or some post-secondary level education [2] in which Prince Edward Island [11] has the highest number of respondents 45.8% [271] that have level [2] education and also Manitoba [46] has the lowest count of respondents 39.2% [699] that have level [1] education.
Finally, We found that in all the 10 provinces 18.7% [4343] respondents have University Certificate or degree [3] in which Newfoundland and Labrador [10] has the least number of respondents 12.5% [110] that have level [3] education and also British Colombia [59] has the highest count of respondents 21.7% [550] that have level [3] education.

4.3.2 Explanation

The reason for almost half of the respondents in all the 10 provinces have college or some post-secondary level education is mainly because education system in Canada mandates students to stay in school till the age of 16 and in some provinces like Nova Scotia, Manitoba, New Brunswick till the age of 18(Nair, 2022). So, that may be the reason why significant portion have a college degree.

4.4 Analysing internet users across the regions

Our objective is to understand which region has the highest number of users who have ever used the Internet (E-mail or World Wide Web) from home, work, school, or any other location for personal non-business use. Based on this we can identify in which region has the most Internet users concentrated in.

#Create a subset for the columns
Internet_Userset <- locationofUse[c(3, 14)]
# Change the datatype of the variables for processing
Internet_Userset$Internet_User <-
  as.character(Internet_Userset$Internet_User)
Internet_Userset$Region <-
  as.character(Internet_Userset$Region)
# Create a plot with Internet users and region variables
Internet_Userset %>%
  filter(Internet_User == "1") %>% # filter on values
  ggplot(aes(Region, ..count..)) + geom_bar(aes(fill = Internet_User),
                                            position = "dodge2" ,
                                            show.legend = FALSE, colour="Black") + ggtitle("Fig:2 Internet Users Across the Regions") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold", size = 13)) + labs(x =
                                                                                   "Region", y = "Count") +
  scale_x_discrete(
    labels = c(
      "1" = "Atlantic Regions",
      "2" = "Quebec",
      "3" = "Ontario",
      "4" = "Manitoba/Saskatchewan",
      "5" = "Alberta",
      "6" = "British Columbia"
    )
  ) +
  geom_bar(fill = "00BFC4")

4.4.1 Findings

We found that Ontario[3] followed by Quebec[2] regions has the highest count of users who have used internet [e-mail or world wide web] from home, work, school or other locations for personal non-business users and their count is above 5000 and above 3000 respectively.
We also found that Atlantic and Manitoba/Saskatchewan[4] regions had similar user counts of around 2700 users using internet for personal non-business purposes.
Finally we also found that Alberta[5] region had the lowest count of internet users of around 1800 who used internet for personal non-business purposes.

4.4.2 Explanation

One of the reasons for Ontario [1] and Quebec[2] regions have the highest occurrence may be possibly due to the volume of respondents responding the survey might have highly been from the province and it also may have to do with the population of the provinces [1st and 2nd most highly populated provinces in Canada].

4.5 Analysing internet usage at home by province

Our objective is to understand based on the survey, if the respondents have used internet for personal non-business related use from their home. These results can help us understand if respondents are using internet at home for recreational/personal use. Based on this we can identify which province has the most number of internet users who prefer to use internet from home for personal use. This can further be combined with other variables to understand the rise of home internet usage in recent years.

#Create a subset for the columns
Internet_Province <- locationofUse[c(2, 16)]
# Change the datatype of the variables for processing
Internet_Province$Internet_Usage_Home <-
  as.character(Internet_Province$Internet_Usage_Home)
Internet_Province$Province <-
  as.character(Internet_Province$Province)
# Create a plot with Internet usage at home and province variables
Internet_Province %>%
  filter(Internet_Usage_Home != "NA" & Internet_Usage_Home != "0") %>% # filter on non-missing values
  ggplot(aes(Province, ..count..)) +geom_bar(aes(fill = Internet_Usage_Home),
                                            position = "dodge2" ,
                                            show.legend = FALSE, colour="Black") + ggtitle(" Fig:3 Internet Usage At Home by Province") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold", size = 13)) + labs(x =
                                                                                   "Province", y = "Internet Users") +
     scale_x_discrete(
    labels = c(
      "10" = "NL",
      "11" = "PE",
      "12" = "NS",
      "13" = "NB",
      "24" = "QC",
      "35" = "ON",
      "46" = "MB",
      "47" = "SK",
      "48" = "AB",
      "59" = "BC"
    )
  ) +
  geom_bar(fill = "00BFC4")

4.5.1 Findings

We found that Ontario [ON] followed by Quebec [QC] provinces have the most number of Internet users for personal non-business use from home which are close to 5000 and 3000 users respectively.
we found that British Columbia [BC] was the next province that had a user count of around 2000 who used internet at home.
we found that Newfoundland [NL], Prince Edward Islands [PE], Nova Scotia [NS], and New Brunswick [NB] where the only provinces that had a user count of below 1000 users using internet usage at home for personal non-business purposes.

4.5.2 Explanation

Similar reason like population density might apply to this finding as well for Ontario and Quebec having the most number of Internet users for personal use. Along with that as regions tend to advance and modernize so does there communications means. This can include social media, shopping, online entertainment, information seeking etc. which can eventually mean more of internet usage for personal non-business uses in the regions.

4.6 Analysing number of years of internet use across provinces

Our objective is to understand if respondent has used internet, how many years have they used them and in which province. Based on this we can identify how many users (respondents) belong to which section of the usage years like if its less than 1 year, or 1 to 2, or 2 to 5 years, or greater than 5 years. This can further analysed based on usage patterns like [Home, work,school] etc. for further analysis.

#Create a subset for the columns
Internet_yearsset <- locationofUse[c(2, 15)]
# Change the datatype of the variables for processing
Internet_yearsset$Province <-
  as.character(Internet_yearsset$Province)
Internet_yearsset$Internet_Usage_Years <-
  as.character(Internet_yearsset$Internet_Usage_Years)
# Change the values 6,7,8 in this subset to NA
Internet_yearsset[Internet_yearsset == "6" |
                    Internet_yearsset == "7" | Internet_yearsset == "8"] <- NA
# Create a plot with internet usage years and provinces
Internet_yearsset %>%
  filter(!is.na(Internet_Usage_Years)) %>% # filter values
  ggplot(aes(Province, ..count..)) + geom_bar(aes(fill = Internet_Usage_Years), position = "stack", colour="Black") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold", size = 13)) + labs(title = " Fig:4 Internet Usage Years Across the Provinces", x = "Provinces", y =
         "Count") +
  scale_x_discrete(
    labels = c(
      "10" = "Newfoundland and Labrador",
      "11" = "Prince Edward Island",
      "12" = "Nova Scotia",
      "13" = "New Brunswick ",
      "24" = "Quebec",
      "35" = "Ontario",
      "46" = "Manitoba",
      "47" = "Saskatchewan",
      "48" = "Alberta",
      "59" = "British Columbia"
    )
  ) +
  scale_fill_discrete(name = "Internet Usage Years", labels = c("<1", "1-2", "2-5", ">5")) +
  coord_flip()

4.6.1 Findings

We found that a significant portion of the users in all the 10 provinces have been using internet for more than 5 years and in that respondents of Ontario province has the largest user set with a count of 4000 respondents using Internet for greater than five years.
We found that very less respondents have been using internet for less than a year in all the provinces.
We also found that Ontario and Quebec has the most number of users who have been using internet for a minimum of 2 years and more.
Finally, we found that Prince Edward Islands has the lowest user count <1000 who have been using internet greater than 5 years.

4.6.2 Explanation

Since Canada is already a developed country it is likely that all the province might have access to technology from a long time. This is evident with the results where in most of the provinces have longest internet usage users. That being said, Prince Edward Islands respondents being the least users who have used internet greater than 5 years might be because of the population settlement speed and the density which is lower in the province.

4.7 Analysing the internet usage from work place

Our objective is to find how many respondents who are employed where using internet for personal non-business use from work place. This will help us identify if users are using internet for personal uses at work which can help us further interpret reasons and usage patterns.

#Create a subset for the columns
Internet_Workset <- locationofUse[c(9,17)]
# Change the datatype of the variables for processing
Internet_Workset$Internet_Usage_Work <-
  as.character(Internet_Workset$Internet_Usage_Work)
Internet_Workset$Employment <-
  as.character(Internet_Workset$Employment)
#Create a plot with internet usage frequency who are employed
Internet_Workset %>%
  filter(!is.na(Internet_Usage_Work) & Employment == "1") %>% # filter on non-missing values
  ggplot(aes(Internet_Usage_Work,
             ..count..)) + geom_bar(aes(fill = Employment), position = "dodge2", show.legend = FALSE, colour="Black")+ theme(plot.title = element_text(hjust = 0.5, face = "bold", size = 13)) + labs(title=" Fig:5 Internet Usage At Work",
                                                                                       x="Internet Usage at Work", y= "Employee count") + 
  scale_x_discrete(labels=c("0" = "No", "1" = "Yes"))

4.7.1 Findings

We found that the respondents who are employed and who where using the internet for personal use from workplace are more than 6000 and who are not using are around 5500 for all provinces.

4.7.2 Explanation

The results show that a large portion of people who are employed use internet at work for personal use. This might be due to not having stringent internet usage policies at workplaces, might also be due to people wanting to finish personal tasks during office hours while work loads are low, and finally might also be to kill boredom during office hours.

4.8 Analysing internet usage amoung different age groups

As part of this analysis our objective is to understand how different age groups of respondents use internet. This will help us understand which age group has been using the internet or world wide web services more than the other. This can further be analysed on how individual age category uses internet at home, work, school, and other places.

#Create a subset for the columns
Internet_ageset <- locationofUse[c(5, 14)]
# Change the datatype of the variables for processing
Internet_ageset$Internet_User <-
  as.character(Internet_ageset$Internet_User)
Internet_ageset$Age <-
  as.character(Internet_ageset$Age)
#Create a plot with internet user and age
Internet_ageset %>%
  filter(!is.na(Internet_User) &
           Internet_User == "1") %>% # filter values
  ggplot(aes(Age, ..count..)) + geom_bar(aes(fill = Internet_User), show.legend = FALSE, colour="Black") + theme(plot.title = element_text(hjust = 0.5, face = "bold", size = 13)) +
  labs(title = " Fig:6 Internet Users Among Different Age Groups", x = "Age Groups", y =
         "Count") +
  scale_x_discrete(labels = c(
    "1" = "16-24",
    "2" = "25-34",
    "3" = "35-44",
    "4" = "45-54",
    "5" = "55-64",
    "6" = ">65"
  ))

4.8.1 Findings

We found that the age group 45-54 was the group that used the internet the most with a count of almost 4000 users.
We also found that the age group greater than 65 was the group that used the internet the least with a count of around 2000 users.
Finally, we also found that the age groups 16-24 and greater than 65 were using internet the same with a count of around 2000 users.

4.8.2 Explanation

The respondents in the age group 45-54 where the once that used the internet the most mainly because there are the working class and they where the generation that started the internet revolution so it is likely that the usage pattern of internet grew along side the generation.

4.9 Analysing education levels based on gender using contingency table

Our objective is to analyse the frequency distribution of combination of educations and gender variables. This will help us understand how different genders get educated in different levels.

# Create a contingency table
CrossTable(locationofUse$Gender, locationofUse$Education)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  23178 
## 
##  
##                      | locationofUse$Education 
## locationofUse$Gender |         1 |         2 |         3 | Row Total | 
## ---------------------|-----------|-----------|-----------|-----------|
##                    1 |      4012 |      4357 |      1992 |     10361 | 
##                      |     0.563 |     0.002 |     1.319 |           | 
##                      |     0.387 |     0.421 |     0.192 |     0.447 | 
##                      |     0.442 |     0.447 |     0.459 |           | 
##                      |     0.173 |     0.188 |     0.086 |           | 
## ---------------------|-----------|-----------|-----------|-----------|
##                    2 |      5070 |      5396 |      2351 |     12817 | 
##                      |     0.455 |     0.001 |     1.066 |           | 
##                      |     0.396 |     0.421 |     0.183 |     0.553 | 
##                      |     0.558 |     0.553 |     0.541 |           | 
##                      |     0.219 |     0.233 |     0.101 |           | 
## ---------------------|-----------|-----------|-----------|-----------|
##         Column Total |      9082 |      9753 |      4343 |     23178 | 
##                      |     0.392 |     0.421 |     0.187 |           | 
## ---------------------|-----------|-----------|-----------|-----------|
## 
##

4.9.1 Findings

From the above contingency table we found that males [1] with high school or less education [1] in all the provinces are 4012 and females [2] with same level of education are 5070 totaling to around 9082 males and females with education levels high school or less.
We found that males with college or some post secondary level education [2] in all provinces are 4357 and females [2] with same level of education are 5396 totaling to 9753 males and females with post secondary level education.
Finally, we found that males with university degree or certificate [3] in all provinces are 1993 and females [2] with same level of education are 2351 totaling to 4343 males and females with university level education.
The lowest count was males with university level education and highest was females with secondary level education.

4.9.2 Explanation

Educational indicators show that females [2] tend to get more education than men in all levels of education which is clearly evident from the findings (Zechuan Deng, 2021). Some studies show that men stop continuing education for family, financial, and other personal reasons which might be the case here and research also shows that women tend to choose and hang on to education even when the stream is harder to pass. while men tend to drop out and look for other means of making a living (Guo, 2016).

4.10 Analysing gender on age categories using mosaic plot

We want to visually analyse the proportions of different age categories and their respective gender type to understand the ratios of respondents(gender types) in different age category.

counts_subset <- table(locationofUse$Age, locationofUse$Gender)
#create mosaic plot on age vs gender
mosaicplot(counts_subset, xlab='Age', ylab='Gender',
           main='Fig:7 Age vs Gender', col='#00CCCC', border = "black")

4.10.1 Findings

We found that the respondents both male [1] and female [2] are in equal proportions in the age category of 16-24 [1] and are relatively the least number of respondents in all the age groups in the data set[ based on visual interpretation].
We found that the female[2] respondents are slightly large in proportion in the age category of 25-34 [2] than males[ based on visual interpretation].
We found similar pattern of almost equal proportion of male [1] and female [2] like in age category [1] with respondents in the category 45-54 [4][ based on visual interpretation].
We also found a slightly higher proportions of female [2] respondents similar to age category 25-34 [2] with respondents in age category ’55-65 [5]`[ based on visual interpretation].
Finally, we also found that age category 65 and older [6] also has a higher count of females [2] than males [1] and this category respondents are the highest number of respondents in all the age groups in the data set [ based on visual interpretation].

4.10.2 Explanation

The findings show that there is almost equal distribution gender in the survey with just ages 25-34 [2] and 65 and older [6] slightly having more female respondents but the age category of people who responded to the survey are higher in the 65 and older [6] mainly because that generation tends to show interest in answering surveys,provide feedback, and they tend to signup for such actively more often than younger generations.

4.11 Analysing internet usage from school status based on gender

We want to analyse on how many respondents have used internet in the school for personal non-business uses. This will help us understand the how each gender is using the internet facilities at school for any personal purposes. This can be further analysed and segregated into regions and province and also with additional information analysis can be done on internet usage patterns and types in school.

#Create a subset for the columns
Internet_schoolset <- locationofUse[c(6,18)]
# Change the datatype of the variables for processing
Internet_schoolset$Gender <-
  as.character(Internet_schoolset$Gender)
Internet_schoolset$Internet_Usage_School <-
  as.character(Internet_schoolset$Internet_Usage_School)
#Create a plot with Internet usage, gender
Internet_schoolset %>%
  filter(!is.na(Internet_Usage_School)) %>% # filter values
  ggplot(aes(Gender, ..count..)) + geom_bar(aes(fill = Internet_Usage_School), colour="Black") + theme(plot.title = element_text(hjust = 0.5, face = "bold", size = 13)) +
  labs(title = " Fig:8 Internet Usage at School for Personal Use", x = "Gender", y =
         "Count") +
  scale_x_discrete(labels = c(
    "1" = "Male",
    "2" = "Female")) +
 scale_fill_discrete(name = "Internet Usage School", labels = c("No", "Yes"))

4.11.1 Findings

We found that overall the number of people [both males and females] using internet for personal non-business purposes in school are very low.
We found that male students who use internet for personal purposes are around 500 in count and female students are around 700 in count.
Finally, We also found that there are more number of people who have not use internet for personal purposes at school that then once that use. There are around 7500 male students who have not used school internet for personal purposes and there are around 8500 female students who have not used internet for any personal non-business purposes at school.

4.11.2 Explanation

The number of people both male and females who use internet at school for personal purposes are very low because the schools might have a stricter policies for internet access or simply the students might not have the need to use the internet at school because they might have access to technology at home. Also the students are likely disciplined enough to use school resources for the purposes they are intended.

4.12 Analysing internet usage from public library

We want to analyse the internet usage pattern of respondents at library for personal non-business usage by different age groups. This will help us understand how different age groups have used the library for internet.

#Create a subset for the columns
Internet_libraryset <- locationofUse[c(5, 19)]
# Change the datatype of the variables for processing
Internet_libraryset$Age <-
  as.character(Internet_libraryset$Age)
Internet_libraryset$Internet_Usage_Library <-
  as.character(Internet_libraryset$Internet_Usage_Library)
#Create a plot with internet usage library, and age group
Internet_libraryset %>%
  filter(!is.na(Internet_Usage_Library)) %>% # filter values
  ggplot(aes(Age, ..count..)) + geom_bar(aes(fill = Internet_Usage_Library), colour =
                                           "Black") + theme(plot.title = element_text(hjust = 0.5, face = "bold", size = 13)) +
  labs(title = " Fig:9 Internet Usage at Library", x = "Age Groups", y =
         "Count") +
  scale_x_discrete(labels = c(
    "1" = "16-24",
    "2" = "25-34",
    "3" = "35-44",
    "4" = "45-54",
    "5" = "55-64",
    "6" = ">65"
  )) +
  scale_fill_discrete(name = "Internet Used at Library", labels = c("No", "Yes"))

4.12.1 Findings

We found that in the age range of 16-24 the respondents have used the internet for personal non-business uses in library the most in all the age ranges.
We found that the age range older than 65 have used the internet the least for personal purposes.
We found that the respondents in the age range of 45-54 have not used the library for personal purposes, which is the highest in all the age categories.
we found that the age range 25-34 and 55-64 also have one of the highest number of respondents that have not used the internet for any personal purposes and their count is relatively similar in number.
Finally, we found that almost all the age ranges of the respondents are almost normally distributed with respect to internet usage at library.

4.12.2 Explanation

The findings show that the age range 16-24 tend to use the library’s internet for personal and non business purposes mainly because people in that age range tend to attend schools, colleges, and universities and they may tend to use the library for internet more often than others. The age group 45-54 utilizes library internet for personal purposes less frequently than other age groups, likely because they may not approach a library for internet access in the first place since they may not be students or because they may use the internet at home or at work.

4.13 Analysing internet usage from friends or neighbours home

We want to analyse the internet usage pattern of the respondents specifically understanding how different age category of the respondents have accessed internet for a friends’ or Neighbors’ home. This is done by analyzing data to conclude how many respondents have or haven’t used the internet and segregate them based on their age category.

#Create a subset for the columns
Internet_friendsset <- locationofUse[c(5, 22)]
# Change the datatype of the variables for processing
Internet_friendsset$Age <-
  as.character(Internet_friendsset$Age)
Internet_friendsset$Internet_Usage_Neighbours <-
  as.character(Internet_friendsset$Internet_Usage_Neighbours)
#Create a plot with internet usage Friends or Neighbor's home, and age group
Internet_friendsset %>%
  filter(!is.na(Internet_Usage_Neighbours)) %>% # filter values
  ggplot(aes(Age, ..count..)) + geom_bar(aes(fill = Internet_Usage_Neighbours), colour =
                                           "Black") + theme(plot.title = element_text(hjust = 0.5, face = "bold", size = 13)) +
  labs(title = " Fig:10 Internet Usage at Friends or Neighbor's home", x = "Age Groups", y =
         "Count") +
  scale_x_discrete(labels = c(
    "1" = "16-24",
    "2" = "25-34",
    "3" = "35-44",
    "4" = "45-54",
    "5" = "55-64",
    "6" = ">65"
  )) +
  scale_fill_discrete(name = "Usage at Friends'/ Neighbor's home", labels = c("No", "Yes"))

4.13.1 Findings

We found that the respondents in the age of 16-24 are the highest once to use internet at a friends’ or neighbors’ home with a count of 750 respondents.
We found that the respondents in the age of 25-34 are in equal number in using and not using internet from a friends’ or neighbors’ home with a count of 750 respectively.
We found that respondents in the age of 35-44 are showing a decline in the usage from a friends’ or neighbors’ home with Yes being around 450 and No around 750.
We also found that consecutive respondents of ages [45-54 and 55-64] are also showing a downward trend in using the internet from a friends’ or neighbors’ place.
Finally, we found that respondents of age 65 and above are the once that used internet the least from a friends’ or neighbors’ place. Less than 100 people have said yes and more that 250 people have said no.

4.13.2 Explanation

The results show that respondents aged 16-24 are the once that use internet at friends or a neighbors place. This might be due to the fact that the younger generation tend to work and relax in groups and do group studies, play online games with friends, watch movies together at friends place. So, all these accounts to internet usage at friends or neighbors place. Also, older people tend to stay isolated and alone so it is evident that the use the least internet from a friends or neighbors place.

4.14 Analysing number of people in a household across all provinces

We want to analyse the number of people [1,2,3, or more than 4] in a household based on each province. The respondents have provided information on how many people live with them as part of their household which can be used to analyse the above question. This can further help us to understand the household dynamics of each province in future.

#Create a subset for the columns
Householdsset <- locationofUse[c(2, 10)]
# Change the datatype of the variables for processing
Householdsset$Province <-
  as.character(Householdsset$Province)
Householdsset$Houshold_Type <-
  as.character(Householdsset$Houshold_Type)
# Create a plot with no.of people in household and provinces
Householdsset %>%
  filter(!is.na(Houshold_Type)) %>% # filter values
  ggplot(aes(Province, ..count..)) + geom_bar(aes(fill = Houshold_Type), position = "stack", colour="Black") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold", size = 13)) + labs(title = " Fig:11 No.of People in Household Across the Provinces", x = "Provinces", y =
         "Count") +
  scale_x_discrete(
    labels = c(
      "10" = "Newfoundland and Labrador",
      "11" = "Prince Edward Island",
      "12" = "Nova Scotia",
      "13" = "New Brunswick ",
      "24" = "Quebec",
      "35" = "Ontario",
      "46" = "Manitoba",
      "47" = "Saskatchewan",
      "48" = "Alberta",
      "59" = "British Columbia"
    )
  ) +
  scale_fill_discrete(name = "No.of People in Household", labels = c("1 Persons", "2 Persons", "3 Persons", "4 or more persons")) +
  coord_flip()

4.14.1 Findings

We found that Ontario[35] had the largest respondent data and 1 or 2 or 3 people living in a household was the highest in the region with more than 6000 plus respondents responding that no more than 3 people lived in their household, which contributes to around 98% of the responses.
we found that Quebec[24] also had similar ratios as of Ontario where major chuck of their respondents responded that no more than 3 people lived in their household.
we also found that British Columbia[59] and Alberta[48] were the next once in order that had the maximum number of respondents stating no more than 3 people in the household and the count was around 3000.
We also found that Prince Edward Island[11] was the only province that had the lowest total respondent count around 700 and they had a very minimal portion of respondents who have 4 or more people in their household.
Finally, all the provinces had a very little portion of respondents who responded stating that they have 4 or more people in their household.

4.14.2 Explanation

The provinces Ontario[35] and Quebec[24]’s significant number of respondents had 3 or less people in the homes. This can be because urbanization has increased the complexity of living with a combined family due to work, financial, and personal reasons. So, the families might not be willing to say together or even have large families. As the results show that all the provinces have a very less portion of people living together [more than 4] the findings are inline with the results.

5. References

Price, A. C. (2012, April 16). The AAVSO 2011 Demographic and Background Survey. arXiv.org. https://arxiv.org/abs/1204.3582

Nair, M. (2022, July 18). Understanding The Canadian Education System. University of the People. https://www.uopeople.edu/blog/understanding-the-canadian-education-system/

Gender-related differences in desired level of educational attainment among students in Canada. (2021, September 22). https://www150.statcan.gc.ca/n1/pub/36-28-0001/2021009/article/00004-eng.htm

Guo, J. (2016, January 28). The serious reason boys do worse than girls. Washington Post. https://www.washingtonpost.com/news/wonk/wp/2016/01/28/the-serious-reason-boys-do-worse-than-girls/

Group-4 Project - Canadian Internet Usage Survey - Location Data

Group-4

10-11-2022