1 Introduction

The internet, which has become a significant part of our lives, has typically made things convenient and easier for us. Since the internet is used frequently, this analysis explores how much usage varies depending on the different types of users who fall into distinct groups. The location specific dataset from the Canadian Internet Use Survey, offers a comprehensive breakdown of each key variable that is crucial for analyzing internet usage around the world. It includes the major elements such as Region, provinces, age, employment and respondent details and other specific features. This in-depth research analysis is accomplished using both exploratory and predictive technique that helps interpret the data visually from various viewpoints and predict favorable recommendations out of it.

2 Initial Analysis

The Dataset gives values for each variable for the determined quantities based on a set of dimensions, and when their characteristics are taken into account for analysis, they show a notable variance. The analysis started by putting forward potential queries as part of our basic investigation in order to better understand the dataset. An important component of the interpretation included examining factors like the province with the highest internet usage, the employment status of Canadian household members, etc. The distribution of users, internet usage frequency, educational background, and the main variables that influence and exploit the students on using the resource excessively are also projected in order to gain meaningful insights.

Dimension Check: To understand the dynamics of the dataset, it is divided into descriptive chunks. Recognizing the dataset’s components and attributes aids in conceptualizing how the data relates to one another. As part of data quality measures, accuracy, completeness, consistency, timeliness, validity, and Integrity need to be examined. Following the dimensions, the dataset’s summary is noted, displaying the R-structured tabular form of representation of the number of records and their associated components.

Inference: According to the statistical information pulled in relation to the data summary, the important constituents, such as the min, median, mode, max, first quadrant, and third quadrant, are covered in detail. As an outcome, the built-in function is used to generate the model fitting functions of the result summary. Additionally, there are 23178 rows and 23 columns in the data set. The total count allows us to figure out if the dataset contains any null values.

3 Data Pre-processing

The acquired raw data is converted into real information that removes the possibility of periodic noise and data corruption. In order to follow a regular pattern, the data that is inconsistent, contains errors, and frequently is incomplete, is cleaned and processed. This approach makes it simpler to visualize the data and as a result, transforms it into a structured manner.

Below are some of the approaches that we’ll be following:

3.1 Data structuring

In the code snippet below, str() has been used to ensure that the data set has a well-defined structure, appropriate data types are assigned to the variables, etc. which will make it easy to store and access data.

3.2 Removing Null Values

Missing data can yield skewed estimates, which can impair a study’s statistical significance and lead to inaccurate conclusions. In such scenarios, data imputation or many other methodologies can be used to handle null values in a dataset. The location dataset used made it simpler by the dataset’s absence of null values, which minimized complexity and saved time and code usage.

3.3 Datatype Modification

Despite being categorical by default, it is clear from the dataset’s variety of variables that they are represented as continuous variables (in numerical form). Hence, the continuous variables need to be handled to make them categorical in R. so that the groupings can be statistically analyzed for deeper insights. In general, establishing levels and factors can make it easier to explore the data.

3.4 Value Modification

Despite the dataset’s extensive response collection, some of the responses appear to be very sparsely populated because they run the risk of becoming an outlier when the data is visualized. Such variables would be handled by consolidating their multiple columnar values into a single value.

Below are the libraries that we’ll be using for our analyses.

#Libraries Initialization
library(ggplot2)
library(ggplot2)
library(dplyr)
library(DT)
library(magrittr)
library(knitr)
library(tidyverse)
library(epiDisplay)
library(ggpubr)
library(ggmosaic)
library(vcd)
library(grid)
library(ggpubr)
theme_set(theme_pubr())

Firstly, Let’s have a look at the summary of our data using in R to better understand the dynamics of the data set at hand.

Post that, we’ll rename the columns of the Datset for better readability.

head(data)

##   PUMFID PROVINCE REGION G_URBRUR GCAGEGR6 CSEX G_CEDUC G_CSTUD G_CLFSST
## 1      1       35      3        5        3    2       3       2        1
## 2      2       46      4        5        1    2       1       1        2
## 3      3       10      1        5        2    1       2       2        1
## 4      4       35      3        4        5    2       2       2        3
## 5      5       13      1        4        3    1       2       2        1
## 6      6       46      4        4        2    2       1       2        3
##   GFAMTYPE G_HHSIZE G_HEDUC G_HSTUD EV_Q01 EV_Q02 LU_Q01 LU_Q02 LU_G03 LU_Q04
## 1        3        1       3       2      1      2      1      2      2      2
## 2        2        3       2       1      1      4      1      1      1      1
## 3        2        2       3       2      1      4      1      1      1      1
## 4        3        1       2       2      1      3      6      6      6      6
## 5        2        3       2       2      1      2      1      2      2      2
## 6        1        3       1       2      1      4      6      6      6      6
##   LU_Q05 LU_Q06A LU_Q06B LU_G06
## 1      2       6       6      6
## 2      1       2       1      2
## 3      1       2       2      1
## 4      6       6       6      6
## 5      2       6       6      6
## 6      6       6       6      6

#Renaming columns for better readability and understanding
colnames(data) <- c("ID", 
                    "PROVINCE", 
                    "REGION", 
                    "URBAN_RURAL", 
                    "AGE_GROUP", 
                    "SEX", 
                    "ED_LEVEL", 
                    "STUDENT_STATUS", 
                    "EMPLOYMENT_STATUS", 
                    "HOUSE_TYPE", 
                    "HOUSE_SIZE", 
                    "HIGH_EDU", 
                    "STUD_IN_HOUSE", 
                    "INT_USAGE", 
                    "INT_YEARS", 
                    "LU_FROM_HOME", 
                    "LU_FROM_WORK", 
                    "LU_FROM_SCHOOL", 
                    "LU_FROM_LIBRARY", 
                    "LU_OTHERS", 
                    "LU_OTHERS_RELATIVE", 
                    "LU_OTHERS_FRIEND", 
                    "LU_OTHERS_MISC")

In the provided Dataset, we are dealing with Categorical Data encoded in the form of numbers. There are no continuous variables in the provided Dataset. However, since the values are in numbers, R might consider this to be continuous data and assign numeric data types to the variables. We’ll fix this issue in the later part of the below code block by validating the structure of our Dataset and setting the columns as factors.

Now, In order to cleanse the data we plan to modify the values for certain columns by clubbing certain column values under one value i.e. for instance in the LU_FROM_SCHOOL column, the number 6 will be assigned to No Response which will include everything such as Valid Skips, Refusal & Not Stated instead of all these sub classifications having separate numerical values.

However, the values 7 = Don’t Know is not disturbed in the process and remains as it is.

We intend to do this step as we feel that these separate values will not be able to provide significance to our analysis in anyway.

Below code snippet performs the action of data cleansing/data pre-processing:

#Data Pre-processing 

#Modifying values from the data set
#SEX:
#Assigning 0 = Female 1 = Male
data["SEX"][data["SEX"]==2] <-0

#STUDENT_STATUS:
#Assigning 0 = No 1 = Yes
data["STUDENT_STATUS"][data["STUDENT_STATUS"]== 2] <- 0

#STUD_IN_HOUSE:
#Assigning 0 = No 1 = Yes
data["STUD_IN_HOUSE"][data["STUD_IN_HOUSE"]== 2] <- 0

#INT_USAGE:
#Assigning 0 = No 1 = Yes
data["INT_USAGE"][data["INT_USAGE"]== 2] <- 0

#INT_YEARS:
# 6 = No Response (Valid Skips & Refusals) 7 = Don't Know
data["INT_YEARS"][data["INT_YEARS"] == 8] <- 6

#LU_FROM_WORK:
# 6 = No Response (Valid Skips & Not Stated) 7 = Don't Know
data["LU_FROM_WORK"][data["LU_FROM_WORK"] == 9] <- 6

#LU_FROM_SCHOOL:
# 6 = No Response (Valid Skips, Refusal & Not Stated) 7 = Don't Know
data["LU_FROM_SCHOOL"][data["LU_FROM_SCHOOL"] > 7] <- 6

#LU_FROM_LIBRARY:
# 6 = No Response (Valid Skips & Not Stated) 7 = Don't Know
data["LU_FROM_LIBRARY"][data["LU_FROM_LIBRARY"] == 9] <- 6

#LU_OTHERS:
# 6 = No Response (Valid Skips & Not Stated) 7 = Don't Know
data["LU_OTHERS"][data["LU_OTHERS"] == 9] <- 6

#LU_OTHERS_RELATIVE:
# 6 = No Response (Valid Skips, Refusal & Not Stated) 7 = Don't Know
data["LU_OTHERS_RELATIVE"][data["LU_OTHERS_RELATIVE"] > 7] <- 6

#LU_OTHERS_FRIEND:
# 6 = No Response (Valid Skips, Refusal & Not Stated) 7 = Don't Know
data["LU_OTHERS_FRIEND"][data["LU_OTHERS_FRIEND"] >7] <- 6

#LU_OTHERS_MISC:
# 6 = No Response (Valid Skips & Not Stated)
data["LU_OTHERS_MISC"][data["LU_OTHERS_MISC"] == 9] <- 6

str(data) #Checking the structure of our data set

#Handling categorical variables: Setting factors & levels to categories
data$ID <- factor(data$ID)
data$PROVINCE <- factor(data$PROVINCE, levels=c("10","11","12","13","24","35","46","47","48","59"))
data$REGION <- factor(data$REGION, levels = c("1","2","3","4","5","6"))
data$URBAN_RURAL <- factor(data$URBAN_RURAL,levels = c("1","2","3","4","5","6"))
data$AGE_GROUP <- factor(data$AGE_GROUP,levels = c("1","2","3","4","5","6"))
data$SEX <- factor(data$SEX,levels = c("0","1"))
data$ED_LEVEL <- factor(data$ED_LEVEL,levels = c("1","2","3"))
data$STUDENT_STATUS <- factor(data$STUDENT_STATUS,levels = c("0","1"))
data$EMPLOYMENT_STATUS <- factor(data$EMPLOYMENT_STATUS,levels = c("1","2","3"))
data$HOUSE_TYPE <- factor(data$HOUSE_TYPE,levels = c("1","2","3","4"))
data$HOUSE_SIZE <-factor(data$HOUSE_SIZE,levels=c("1","2","3","4"))
data$HIGH_EDU <- factor(data$HIGH_EDU,levels = c("1","2","3"))
data$STUD_IN_HOUSE <- factor(data$STUD_IN_HOUSE,levels = c("0","1"))
data$INT_USAGE <- factor(data$INT_USAGE,levels = c("0","1"))
data$INT_YEARS <- factor(data$INT_YEARS,levels =c("1","2","3","4","6","7"))
data$LU_FROM_HOME<- factor(data$LU_FROM_HOME,levels = c("1","2","6","7"))
data$LU_FROM_WORK <- factor(data$LU_FROM_WORK ,levels = c("1","2","6","7"))
data$LU_FROM_SCHOOL <- factor(data$LU_FROM_SCHOOL,levels = c("1","2","6","7"))
data$LU_FROM_LIBRARY <- factor(data$LU_FROM_LIBRARY ,levels = c("1","2","6","7"))
data$LU_OTHERS <- factor(data$LU_OTHERS ,levels = c("1","2","6","7"))
data$LU_OTHERS_RELATIVE <- factor(data$LU_OTHERS_RELATIVE ,levels = c("1","2","6","7"))
data$LU_OTHERS_FRIEND <- factor(data$LU_OTHERS_FRIEND ,levels = c("1","2","6","7"))
data$LU_OTHERS_MISC <- factor(data$LU_OTHERS_MISC ,levels = c("1","2","6"))

Now, we have completed modifying certain values in various columns and assigning factors to variables to make them categorical.

Let’s have a look at the summary and dimensions of our Dataset.

summary(data)

##        ID           PROVINCE    REGION   URBAN_RURAL AGE_GROUP SEX      
##  1      :    1   35     :6518   1:3798   1:  975     1:1981    0:12817  
##  2      :    1   24     :4437   2:4437   2: 1019     2:3269    1:10361  
##  3      :    1   59     :2533   3:6518   3:  817     3:3749             
##  4      :    1   48     :2242   4:3650   4:12143     4:4555             
##  5      :    1   46     :2023   5:2242   5: 7632     5:4419             
##  6      :    1   47     :1627   6:2533   6:  592     6:5205             
##  (Other):23172   (Other):3798                                           
##  ED_LEVEL STUDENT_STATUS EMPLOYMENT_STATUS HOUSE_TYPE HOUSE_SIZE HIGH_EDU 
##  1:9082   0:21464        1:13559           1: 5639    1:6304     1: 5995  
##  2:9753   1: 1714        2: 1045           2:10395    2:8831     2:11104  
##  3:4343                  3: 8574           3: 6304    3:3474     3: 6079  
##                                            4:  840    4:4569              
##                                                                           
##                                                                           
##                                                                           
##  STUD_IN_HOUSE INT_USAGE INT_YEARS LU_FROM_HOME LU_FROM_WORK LU_FROM_SCHOOL
##  0:19089       0: 5599   1:  607   1:16046      1: 6740      1: 2364       
##  1: 4089       1:17579   2:  891   2:  950      2:10253      2:14601       
##                          3: 2586   6: 6181      6: 6183      6: 6211       
##                          4:13461   7:    1      7:    2      7:    2       
##                          6: 5600                                           
##                          7:   33                                           
##                                                                            
##  LU_FROM_LIBRARY LU_OTHERS LU_OTHERS_RELATIVE LU_OTHERS_FRIEND LU_OTHERS_MISC
##  1: 2118         1: 5453   1: 1974            1: 2355          1: 2641       
##  2:14869         2:11531   2: 3462            2: 3081          2: 2795       
##  6: 6185         6: 6186   6:17738            6:17738          6:17742       
##  7:    6         7:    8   7:    4            7:    4                        
##                                                                              
##                                                                              
##

dim(data)

## [1] 23178    23

The Summary Function provides us a clear idea of the frequency/occurences of different values in our Dataset as we are working with categorical data.

With the Dim Function we identified there are 23,178 Rows and 23 Columns in our data set and no NA values are present in the provided data set.

4 Outliers

While going through the summary of the Dataset, we realized that every variable in the dataset is catergorical and in general the concept of outliers are applicable to continuous variables. However, categorical variables with marginally lower frequency of occurence can also be treated as outliers.

In this Dataset we solved this problem in the above code snippet, where we combined certain values (Valid Skips, Refusals, etc.) of specific columns that felt like outliers to one broad category (No Response). However, the don’t know column was not hampered as it may be useful for further analysis.

5 Exploratory Analyses

Since, we have completed our data pre-processing by checking for missing values, modifying columns that are not required for our analysis and combining certain column values under on subsection, Now we are ready to move ahead with our Data Visualization stage.

In the following portion, we will be identifying the trends and weakness of the obtained survey (dataset) using exploratory data analysis. This technique involves examining data sets to highlight their key features, frequently utilizing statistical graphics and other data visual analytics. At the end of the analysis, we will be able to identify the key elements that reveal on the major challenges produced by the data available and will be in a position to provide forward-thinking suggestions and troubleshooting solutions. Some of the possible troubleshoots that could be a part of the analysis is followed.

Locating the dataset’s weakest point and gaining a grasp of the possible threat that the linked constituents might experience as a result of the deviation.
To develop an action plan that could improve the existing situation by understanding how individuals in different locations perceive the internet. This is done by observing the pattern of response.
To gather information and feedback about new products and services, optimizations to marketing strategies, upcoming technological developments, and improvements to current features in order to better understand the target market and reach the right people at the right location with the least amount of effort.

5.1 Pie Chart: Male & Female

By segmenting the circular statistical visual into sectors or portions, pie charts are the best tools for highlighting the nature of the data and exhibiting numerical issues. As a part of the analysis, the chart depicts the proportion of male and female survey participants.

Let’s have a look at a pie chart to better understand the data we are working with. We’ll start by looking at what percent of males and females took part in the survey.

#Visualizations
#1. Pie Chart: Male & Female respondents
count <- table(data$SEX)
pct <- round(count/sum(count)*100)
lbls <- c("Female","Male")
lbls <- paste(lbls, pct) # add percents to labels
lbls <- paste(lbls,"%",sep="") # ad % to labels
pie(count, labels = lbls,col=rainbow(length(lbls)), main="Percentage of Males & Females", )

Findings:

According to the chart, the female community has participated in the survey at a percentage of 55, while the male population has covered the remaining 45% of the total.
Percentages are added to the labels to improve readability.

Inference:

As a result, the male population is identified in the pie chart as red, and the female population is recognized as blue which covers the major portion, making it easy to visibly distinguish between the two groups.

5.2 Horizontal Bar Plot: Distribution of Participants based on Province

The primary goal of the below horizontal bar chart visualization is to analyze the distribution of Internet users with respect to each province in percentage. In the below coding, in order to compute, we have divided the province counts by their respective sums and multiplied by 100 to get the percentage value. R uses the function barplot( ) to create bar charts. R can draw both vertical and horizontal bars in the bar chart. In the below visualization, we have used a horizontal bar plot. We have then added the percentage to labels and have moved the margins and adjusted the titles to get a perfect horizontal visualization.

#2. Horizontal Bar Plot: Distribution of Participants based on Province
province_counts <- table(data$PROVINCE)
namesarg <- c("NL","PE","NS","NB","QC","ON","MB", "SK","AB", "BC")

pct <- round(province_counts/sum(province_counts)*100)
namesarg <- paste(namesarg, pct) # add percents to labels
namesarg <- paste(namesarg,"%",sep="") # ad % to labels
par(mar=c(5.1, 13 ,4.1 ,2.1)) # Moving margins
province_plot <- barplot(province_counts,names.arg=namesarg,col=rainbow(length(province_counts)), 
                         las=1, horiz = T,xlim = c(0,7000), xlab="",ylab = "",
                         main="Distribution of Participants based on Province")
title(xlab="Number of Participants",ylab="Province",line=0, cex.lab=1.2) # Adjusting the Title

Findings:

Based on the aforementioned visualization, it is clear that:

Ontario ranks first, followed by Quebec, in terms of the percentage of people who have used the internet (either for e-mail or the web) for personal, non-business purposes from home, work, or other locations with a percentage of 28% and 19%, respectively.
It can also be noted that Internet usage for personal, and non-business purposes is almost the same in Manitoba and Alberta, with 9% and 10%, respectively.
With barely 3% of the population using the internet, it is evident that Prince Edward Island has the lowest percentage.

Inference:

Most of the people who use the Internet for personal purposes are found to be in Ontario and Quebec, which may be related to their dense populations. On the other hand, regions like Prince Edward Island, Newfoundland, and Labrador are found to be the least internet users. Additionally, communication tools tend to grow and modernize along with locations. This could involve online shopping, information searching, social networking, entertainment, etc., which could eventually lead to a rise in regional internet usage for private, non-commercial purposes.

5.3 Bar Plot: Which Age group uses the internet most & least?

The below bar plot visualization differentiates different age categories of people who have utilized the most and least of the Internet. A bar chart basically represents data in rectangular bars with the length of the bar proportional to the value of the variable.

#3.Bar plot: Which Age group uses the internet most & least?
age <- table(data[data$INT_USAGE==1, "AGE_GROUP"])
age_group <- c("16-24","25-34","35-44","45-54","55-64",">64")

age_plot <- barplot(age,names.arg=age_group, las=1,col=rainbow(length(age)),
                    main="Distribution of Usage by Age Group",
                    xlab = "Age Group",
                    ylab="Number of Participants")

text(age_plot,age,labels = age, pos = 1)

Findings:

It is clear that the age group 45-54 is the group that uses most of the Internet with a count of 3821 users followed by the age range 35-44 with a count of 3500 Internet users.
We have discovered that the age range 35–44 has the second-highest number of Internet users, with 3500, followed by the 45–54 age group, which has 3821 users. Finally, it is evident that the number of Internet users in the 16 to 24 and over 64 age categories has the same number of Internet users having the count of 1952.

Inference: The respondents in the age group 45-54 are the ones who are using most of the Internet and it is noted that they are in the working category and they are the generation who have started using the Internet and they are likely to use more Internet than the other age categories.

5.4 Barplot: Which gender makes the most out of the internet?

The visualization’s objective is to examine how respondents’ Internet usage is distributed according to their sex. In the coding shown below, we have utilized gender as a common denominator to assess the distribution of each group using a bar plot. We can determine which gender category has utilized the Internet the most from the bar plot. The below barplot will help us answer the question of Which gender makes the most out of the internet?

#4. Barplot: Which gender makes the most out of the internet?
gender_count <- table(data[data$INT_USAGE==1,"SEX"])
gender <- c("Female", "Male")

gender_plot <- barplot(gender_count,names.arg=gender, las=1,col=rainbow(length(gender_count)),
                       main="Distribution of Usage by Gender", ylim = c(0,10500),
                       xlab = "Gender",
                       ylab="Number of Participants")
text(gender_plot,gender_count,labels = gender_count, pos = 3)

Findings: With 9629 female Internet users and 7950 male Internet users, it is clear that women have used the internet more than men.

Inference: Internet usage is higher in the female group than in the male category. The fact that women make up a significantly larger proportion of internet users than men could be the factor. In a bar chart, each bar can also be given a different color. To distinguish between two gender categories in the graphic above, we utilized the colors red and blue. The comparisons between different data categories are displayed using bars in a bar chart. Comparisons between the male and female genders are displayed in our visualization.

5.5 Pie chart: Distribution of students who use the internet based on gender

The below pie chart visualization depicts the distribution of students using the Internet with respect to their gender. We have used pie chart visualization since it displays relative proportions of multiple classes of data.

#5. Pie chart: Portfolio of students who use the internet based on gender
Student_gender_count <- table(data[data$STUDENT_STATUS==1 & data$INT_USAGE==1,"SEX"])

Student_gender_pct <- round(Student_gender_count/sum(Student_gender_count)*100)
Student_gender_lbls <- c("Female","Male")
Student_gender_lbls <- paste(Student_gender_lbls, Student_gender_pct) # add percents to labels
Student_gender_lbls <- paste(Student_gender_lbls,"%",sep="") # ad % to labels
pie(Student_gender_count, labels = Student_gender_lbls,
    col=c("yellow", "black"), main="Distribution of Students Using the Internet Based on Gender")

Findings: It is noted that about 60% of female students utilize the Internet, with 40% of male students making up the remaining user base.

Inference: The findings indicate that a significant share of female students utilize the Internet. This might be a result of the fact that female students have a higher population density than male students.

5.6 Grouped Bar Plot: Gender and Age classified by Internet Usage

Our goal in conducting this analysis is to examine the respondents who utilize the internet by age and gender. By doing so, we will be better able to determine which age group has been utilizing the internet or other web services the most. We will also be able to identify the gender that has utilized the Internet the most. In order to display each data category in a frequency distribution, we used a grouped bar chart for this visualization. We can represent relative percentages or numbers of various categories using a group bar plot, and it also aids in visually presenting a summary of our sizable dataset.

#6. Grouped bar plot for: Gender and age classified by internet usage

gender_age <- data[data$INT_USAGE==1,]

gender_age <- within(gender_age, {
  gender_age.cat <- NA
  gender_age.cat[AGE_GROUP=="1"] <- "16-24"
  gender_age.cat[AGE_GROUP=="2"] <- "25-34"
  gender_age.cat[AGE_GROUP=="3"] <- "35-44"
  gender_age.cat[AGE_GROUP=="4"] <- "45-54"
  gender_age.cat[AGE_GROUP=="5"] <- "55-64"
  gender_age.cat[AGE_GROUP=="6"] <- ">64"
})
 
ggplot(gender_age,                                      # Grouped bar plot using ggplot2
       aes(x = gender_age.cat,
           fill= SEX)) +
  scale_fill_discrete(labels = c("Male", "Female"))+
  labs(x = "Age Group", y = "Number of Participants",title = "Gender and age classified by internet usage", )+
  theme_dark()+ theme(axis.text = element_text(face = "bold",size = 10, angle = 20, hjust = 0.75))+
  geom_bar(position = "dodge")

Findings:

We discovered that women utilize the Internet more than men do, regardless of age group.
The above visualization makes it evident that people aged 45 to 54 are the group who use the internet the most.
We also discovered that people over 64 and those between the ages of 16 and 24 utilize the least amount of the Internet, with male users accounting for 1000 and female users for 800 respectively.

Inference: It can be seen from the above visualization that, regardless of age group, women use the Internet more frequently than males. Additionally, it is shown that Internet usage is the same for people aged 16 to 24 and older than 64, with 1750 female users and around 1500 male users. On the other hand, it can be seen that the age groups of 25–34 and 55–64, with a count of 1450 and 1750 respectively, have the same Internet usage for both males and females.

5.7 Stacked Bar Plot: Education Level in Each Province

Our fundamental goal for performing the below analysis is to compare the education level of all the provinces and provide insights. The reason behind choosing a stacked bar chart is that it makes the comparison of the data points even easier, and it makes the interpretation effective.

#7.Stacked Bar Plot: Education Level in Each Province

#Setting province Codes
x <- within(data, {
  province.cat <- NA
  province.cat[PROVINCE==10] <- "NL"
  province.cat[PROVINCE==11] <- "PE"
  province.cat[PROVINCE==12] <- "NS"
  province.cat[PROVINCE==13] <- "NB"
  province.cat[PROVINCE==24] <- "QC"
  province.cat[PROVINCE==35] <- "ON"
  province.cat[PROVINCE==46] <- "MB"
  province.cat[PROVINCE==47] <- "SK"
  province.cat[PROVINCE==48] <- "AB"
  province.cat[PROVINCE==59] <- "BC"
})

ggplot(x, aes(fill=ED_LEVEL,y=ED_LEVEL,x=province.cat)) +
  scale_fill_discrete(labels = c("High school or less", "College or some post-secondary",
                                 "University certificate or degree"), name = "Education Level")+
  labs(x = "Province", y = " Education Level",title = "Education Level in each Province")+
  theme(axis.text = element_text(face = "bold",size = 10, angle = 20, hjust = 0.75))+
  geom_bar(position="stack", stat="identity")

Findings:

We discovered that, compared to all other categories, Ontario has the highest literacy rate at the college or post-secondary level, followed by Quebec, which has a level of literacy that is about three times higher than Nova Scotia.
Alberta and British Columbia had the next-largest proportion of respondents with the greatest literacy rate, according to our analysis.
Prince Edward Island, on the other side, is the only province with the lowest level of education across all categories.
It may also be observed that the literacy rates for all categories in New Brunswick and Nova Scotia are pretty similar.
Finally, just a relatively small percentage of respondents from each province had completed high school.

Inference: The stacked bar chart above makes it evident that in all provinces, a higher proportion of respondents have completed post-secondary education than the other categories, while the proportion of respondents with university degrees is moderate

5.8 Pie Chart: Distribution of Internet Users and Non Internet Users

Let’s generate a pie chart to show the Distribution of Internet Users and Non Internet Users

#8. Distribution of internet users and non users in the survey?
int_usage_count <- table(data["INT_USAGE"])

int_usage_pct <- round(int_usage_count/sum(int_usage_count)*100)
int_usage_lbls <- c("No","Yes")
int_usage_lbls <- paste(int_usage_lbls, int_usage_pct) # add percents to labels
int_usage_lbls <- paste(int_usage_lbls,"%",sep="") # ad % to labels
pie(int_usage_count, labels = int_usage_lbls,
    col=c("Orange", "light blue"), main="Distribution of Internet Users and Non Internet Users")

Findings:

The graphic shows that approximately 76% of people use the internet, compared to a total of only 24% who don’t.
To make the labels easier to read, percentages have been added.

Inference: As a result, the pie chart shows the percentage of internet usage as blue, which makes it simple to see how the two groups differ from one another, and the percentage of non-internet usage as orange.

5.9 Bar Plot: Distribution of Users Based on Years of Usage

The bar plot below illustrates the distribution of users based on years of usage.

#9. Distribution of Users Based on Years of Usage

years_usage <- data["INT_YEARS"]

years_usage <- within(years_usage, {
  years.cat <- NA
  years.cat[INT_YEARS==1] <- "< 1 Year"
  years.cat[INT_YEARS==2] <- "1-2 Years"
  years.cat[INT_YEARS==3] <- "2-5 Years"
  years.cat[INT_YEARS==4] <- "> 5 Years"
  years.cat[INT_YEARS==6] <- "Skipped"
  years.cat[INT_YEARS==7] <- "Don't Know"
})

tab1(years_usage$years.cat, sort.group = "decreasing", cum.percent = TRUE, main = "Distribution of Users Based on Years of Usage", xlab = "Years of Usage")

## years_usage$years.cat : 
##            Frequency Percent Cum. percent
## > 5 Years      13461    58.1         58.1
## Skipped         5600    24.2         82.2
## 2-5 Years       2586    11.2         93.4
## 1-2 Years        891     3.8         97.2
## < 1 Year         607     2.6         99.9
## Don't Know        33     0.1        100.0
##   Total        23178   100.0        100.0

Findings:

The frequency level has a range of 0 to 14,000 with the x axis’s equivalent values of more than 5 years, skipped, 2 to 5, 1to 2, less than a year and don’t know.
People who are unaware of the usage preferred selecting the “Don’t know” option when filling out surveys.
A total of 25,86 respondents reported using the internet for between two and five years, while 13,461 participants claim to have used it for more than five years.
Other categories, such as skipped, account for a sizable amount of the response but are worthless for further evaluation.

Inference: When other categories are included, the analysis drastically deviates since the respondent either didn’t know their own practice or didn’t want to report it. However, a large portion of the respondents belonged to the group that has been a notable lead in internet usage for years.

5.10 Mosaic Plot: Relationship Between Family Size and Internet Usage

The association between internet usage and occupant count is graphically evaluated in the mosaic plot below. The key justification for selecting a mosaic plot is that it enables us to identify the relationship between the two variables. Another key benefit is that it summarizes the data and makes it possible to find correlations between multiple factors. The blue tiles in the mosaic plot represent substantial positive residuals with a frequency more than predicted, whereas the red tiles represent significant negative residuals with a frequency less than expected.

#10.Mosaic Plot: Relationship between family size and internet usage
mosaic <- data[,c("INT_USAGE","HOUSE_SIZE","STUDENT_STATUS")]
mosaic <- within(mosaic, {
  int_usage.cat <- NA
  int_usage.cat[INT_USAGE=="0"] <- "No"
  int_usage.cat[INT_USAGE=="1"] <- "Yes"
})

mosaic <- within(mosaic, {
  house_size.cat <- NA
  house_size.cat[HOUSE_SIZE=="1"] <- "1 Person"
  house_size.cat[HOUSE_SIZE=="2"] <- "2 People"
  house_size.cat[HOUSE_SIZE=="3"] <- "3 People"
  house_size.cat[HOUSE_SIZE=="4"] <- ">= 4 People"
})

mosaic <- within(mosaic, {
  student_status.cat <- NA
  student_status.cat[STUDENT_STATUS=="0"] <- "Not Student"
  student_status.cat[STUDENT_STATUS=="1"] <- "Student"
})

mosaic_count <- table(mosaic[,c("int_usage.cat","house_size.cat","student_status.cat")]) 
mosaic(~int_usage.cat + house_size.cat,data = mosaic_count, shade = T,legend = TRUE, 
       labeling = labeling_border(abbreviate_labs = c(3, 1, 10)),
       main = "Relationship Between Internet Usage 
       and Number of People in House")

Findings:

It is evident that internet usage is significantly higher than all other categories in households with 2 people.
It has also been discovered that internet consumption is quite similar whether there are 1 or more than 4 people living in the household.
On the other hand, compared to other household sizes, a household of three people uses the internet far less frequently.

Inference: Since the proportions of the two variables (Internet usage and the number of persons living in the home) differ, it can be observed that they are reliant on one another. Additionally, it appears that there is a direct correlation between internet usage and the number of individuals living in the family. However, we need to perform an in-depth analysis to strongly conclude.

5.11 Three Variable Mosaic Plot: Understanding the Relationship Between Student Status,Family Size and Internet Usage

We can better comprehend the connection between family size and internet usage by looking at the status of the students in the mosaic plot below. Since the below 3 variables are categorical in nature, we have used a mosaic plot.

#11.Mosaic Plot: Understanding the relationship between family size, student status and internet usage

mosaic(~int_usage.cat + house_size.cat + student_status.cat,  
       data = mosaic_count,shade = TRUE,
       labeling = labeling_border(rot_labels = c(0, 45, 90, 90), abbreviate_labs = c(10, 1, 5)),
       legend = TRUE, 
       main = "Family Size and Internet Usage")

Findings:

The mosaic plot shown above reveals that households with three people who don’t use the internet are not students.
Additionally, it is determined that a family with more than four non-internet users are not students.
In addition, it has been observed that households with three internet-connected residents are seen as students.
On the other hand, when there are two people living in the same household using the internet, that household is also considered to not be a student, which might also apply to employees who work from home.

Inference: Overall, the mosaic plot shown above is used to determine whether the three variables are associated with one another.

5.12 Contingency Bar Plot: Relationship between Employment Status, Sex and Internet Usage

In this contingency bar plot, the categorical variables are added together side by side, creating a two-way table. Counting the total number of observations for each combination of levels in the categorical variable is another valuable application of this data.

#12.Contingency Bar Plot: Relationship between Employment Status, Sex and Internet Usage
contingency_plot <- data[,c("INT_USAGE","EMPLOYMENT_STATUS", "SEX")]
contingency_plot <- within(contingency_plot, {
  emp_status.cat <- NA
  emp_status.cat[EMPLOYMENT_STATUS=="1"] <- "Employed"
  emp_status.cat[EMPLOYMENT_STATUS=="2"] <- "Unemployed"
  emp_status.cat[EMPLOYMENT_STATUS=="3"] <- "Unable to Work"
})

contingency_plot <- within(contingency_plot, {
  int_usage.cat <- NA
  int_usage.cat[INT_USAGE=="0"] <- "Never Used Internet"
  int_usage.cat[INT_USAGE=="1"] <- "Used Internet"
})

contingency_plot <- within(contingency_plot, {
  sex.cat <- NA
  sex.cat[SEX=="0"] <- "Female"
  sex.cat[SEX=="1"] <- "Male"
})

ggplot(contingency_plot, aes(x = emp_status.cat))+
  labs(x = "Employment Status", y = " No.of Participant",title = "Usage of Internet based on Gender")+
      geom_bar(aes(fill = sex.cat),color = "white", position = position_dodge(0.9))+ 
  labs(fill="Gender of Respondent")+
    facet_wrap(~contingency_plot$int_usage.cat,as.table = TRUE)+  
  theme(axis.text = element_text(face = "bold",size = 10, angle = 20, hjust = 0.75))+
      fill_palette("jco")

Findings:

The distinct data taken into account aids in understanding how the community that has never utilized the internet compares to the community that has. Based on their employment status that is, whether they are working, unemployed, or unable to work are verified.
To identify the metric changes between the discrete values, it is necessary to analyze the variations in the bars.
Additionally, access is made to the groups with the highest and lowest values.
For instance, about 6000 members of the working-class female group have a connection to the internet, compared to slightly fewer men.
On the other hand, a sizable portion of the unemployed population has the highest proportion of never using the internet. Both varies that the unemployed either use the internet infrequently or never at all.

Inference: This suggests that working professionals use the internet more frequently than the unemployed members of the community. If a new strategic plan is to be implemented, the focus should be limited to the categories stated above.

5.13 Multi Panel Balloon Plot: Are employed people more likely using the internet?

Multi-panel typically, balloon plots are intended to replace heat maps. When it comes to many groups of observations, where the size of the dots indicates the magnitude of the related component, it is said that the balloon plots act like magic. It is also used to show the contingency table created by the combination of two category variables.

#13. Multi Panel Balloon Plot: Are employed people more likely using the internet?
emp_count <- as.data.frame(table(contingency_plot))
ggballoonplot(emp_count, x = "emp_status.cat", y = "int_usage.cat", size = "Freq",
#ggballoonplot(emp_count, x = "EMPLOYMENT_STATUS", y = "AGE_GROUP", size = "Freq",
              fill = "Freq", facet.by = "sex.cat",
              ggtheme = theme_bw()) +
  scale_fill_viridis_c(option = "C")

Findings:

The distribution of internet usage among male and female users is presented in comparison. Size and color are used to identify the frequencies. The largest balloon represents the numerical value 6000 with the color yellow, while the smallest balloon depicts 0 with the color purple.
When comparing the two categories, it is clear that women use the internet more than men do, as indicated by the female category’s representation in yellow, which is followed by the male category’s representation in orange, which indicates that the male count is lower.
It should be observed that the employment status is employed in both instances. The least of all the categories, the unemployed use the internet the least—nearly 0 percent—of the time.
Both the unemployed male and female populations share this situation.
The community’s “unable to work” section has a balanced usage of men and women.
The frequency ranges close to 4000. The frequency of internet usage among the employed community is 2000, which is very little and therefore insignificant.

Inference: Based on the scale and color of the graph, it is simpler to forecast the results and analyze them. The understanding has become quite simple and obvious because to the factor of frequency. As a consequence of the preceding experiment, we have determined that regardless of employment position, internet usage is essentially identical to that of women. Additionally, both genders follow a nearly identical trend as those who do not utilize the internet.

5.14 Where do People mostly Use Internet From?

The location of the internet is crucial to its main area of distribution. The participants and their primary internet usage locations are precisely visualized using the horizontal bar graph.

#14. Where do People mostly Use Internet From?

location_data <- data[,c("PROVINCE",  
                         "LU_FROM_HOME", 
                         "LU_FROM_WORK",
                         "LU_FROM_SCHOOL",
                         "LU_FROM_LIBRARY", 
                         "LU_OTHERS_RELATIVE",
                         "LU_OTHERS_FRIEND", 
                         "LU_OTHERS_MISC")]

location_data <- within(location_data, {
  place <- NA
  place[LU_FROM_HOME=="1"] <- "Home"
  place[LU_FROM_WORK=="1"] <- "Work"
  place[LU_FROM_SCHOOL=="1"] <- "School"
  place[LU_FROM_LIBRARY=="1"] <- "Library"
  place[LU_OTHERS_RELATIVE=="1"] <- "Relative's"
  place[LU_OTHERS_FRIEND=="1"] <- "Friend's"
  place[LU_OTHERS_MISC=="1"] <- "Misc"
})

location_data <- within(location_data, {
  province.cat <- NA
  province.cat[PROVINCE==10] <- "NL"
  province.cat[PROVINCE==11] <- "PE"
  province.cat[PROVINCE==12] <- "NS"
  province.cat[PROVINCE==13] <- "NB"
  province.cat[PROVINCE==24] <- "QC"
  province.cat[PROVINCE==35] <- "ON"
  province.cat[PROVINCE==46] <- "MB"
  province.cat[PROVINCE==47] <- "SK"
  province.cat[PROVINCE==48] <- "AB"
  province.cat[PROVINCE==59] <- "BC"
})

location_data <- na.omit(location_data)

loc <- as.data.frame(table(location_data[,c("place", "province.cat")]))

tab1(location_data$place, sort.group = "decreasing", cum.percent = TRUE, main = "Location of Internet Use", xlab = "Number of Participants")

## location_data$place : 
##            Frequency Percent Cum. percent
## Home            6417    37.8         37.8
## Work            3267    19.2         57.0
## Misc            2641    15.5         72.6
## Friend's        1792    10.5         83.1
## Relative's      1003     5.9         89.0
## Library          954     5.6         94.6
## School           912     5.4        100.0
##   Total        16986   100.0        100.0

Findings:

From the graph, it can be seen that 92% of the 7000 persons in the population use the internet at home.
It is startling to see that internet usage at the office is practically half as total usage at home.
The percentage of internet usage by other groups, such as friends, family, and libraries, is equally distributed (close to 1000).

Inference: When compared to usage from other venues, a substantial fraction of both communities uses their internet at home. This makes sense because people use technologies more frequently the more time, they have available to them. A 13% usage share of the internet by libraries and schools indicates how intensively the educational system depends on advanced technologies.

5.15 Balloon Plot: Distribution of Internet Usage Based on Province & Location

The balloon plot, as has already been noted, is useful for determining how category variables relate to one another. The graph below has numerous dimensions that, in contrast to the previous one, make it easier to determine how locations and provinces are related to one another.

#15. Balloon Plot: Distribution of Internet Usage Based on Province & Location
ggballoonplot(loc, x = "place", y = "province.cat", size = "Freq",
              fill = "Freq", 
              ggtheme = theme_bw()) +
  scale_fill_viridis_c(option = "C")

Findings:

Finding the distribution of internet usage based on provinces and locales is the goal of this exploratory investigation utilizing a balloon chart. We are able to make observations promptly since frequency plays a key part in conveying the changes in terms of size and color.
The color spectrum goes from yellow, which represents the greatest frequency level, through red, which denotes the mid-frequency level, and ultimately, purple, which denotes the lowest frequency level of them all.
Initially, the figures show that home internet use has dominated all internet usage in Ontario. It is observed that, on average, residents of various provinces use the internet frequently at home.
From the same location as Ontario, Quebec takes the lead in internet usage. Of all the locations, Newfoundland and Labrador has the lowest records for internet usage.
Among the several locations taken into account, British Columbia and Alberta both use the internet to a similar extent.
The least amount of internet use is observed in the library in Newfoundland and Labrador and New Brunswick combined.

Inference: Being one of Canada’s most populous provinces, Ontario has considerably accounted for the majority of internet usage worldwide for a variety of reasons. Quebec exhibits the similar pattern, with only minor variations, showing that internet usage is extremely common in the province. From this, it can be inferred that the distribution is fairly even across Canada’s developed provinces while other emerging provinces are still falling behind.

5.16 Balloon Plot: Who are Accessing the Internet More?

The balloon plot is drawn to show usage of internet by both students and non-students. We chose to examine its dynamics in depth since the type of students and their employment status is so important to how the internet is used today.

#16. Balloon Plot: Who are Accessing the Internet More?

int_access <- data[,c("INT_USAGE","EMPLOYMENT_STATUS", "STUDENT_STATUS")]

int_access <- within(int_access, {
  emp_status.cat <- NA
  emp_status.cat[EMPLOYMENT_STATUS=="1"] <- "Employed"
  emp_status.cat[EMPLOYMENT_STATUS=="2"] <- "Unemployed"
  emp_status.cat[EMPLOYMENT_STATUS=="3"] <- "Unable to Work"
})

int_access <- within(int_access, {
  int_usage.cat <- NA
  int_usage.cat[INT_USAGE=="0"] <- "Never Used Internet"
  int_usage.cat[INT_USAGE=="1"] <- "Used Internet"
})

int_access <- within(int_access, {
  std_status.cat <- NA
  std_status.cat[STUDENT_STATUS=="0"] <- "Not a Student"
  std_status.cat[STUDENT_STATUS=="1"] <- "Student"
})

int_access_count <- as.data.frame(table(int_access[,c("emp_status.cat", "std_status.cat", "int_usage.cat")]))

ggballoonplot(int_access_count, x = "emp_status.cat", y = "int_usage.cat", size = "Freq",
              fill = "Freq", facet.by = "std_status.cat",
              ggtheme = theme_bw()) +
  scale_fill_viridis_c(option = "C")

Findings:

The graphic denotes the main frequency aspects of the variables. The frequency of variance is more easily represented visually by size and color. Yellow denotes the frequency with the maximum intensity, orange the frequency with a medium level, and purple the frequency with the fewest users. - Here, it is clear that the majority of people who are employed but not students use the internet (represented by a yellow dot with a rate of 9000), whereas people who have never used the internet and people who are students, regardless of their employment status, have only very limited access to the internet.
Students who have an employment status of “unable to work” have taken up a sizable portion of both the used and unused internet categories.
Finally, interestingly, compared to those who are not students, students who use the internet have utilized it far less frequently than those who do not, irrespective of their work situation.

Inference: Internet usage among employees far outpaces that of students who are actually enrolled in classes. Despite the significant differences, the variables showed several characteristics that could highlight a range of population characteristics. Additionally, the unemployed population has very limited access to the internet, despite of their position as students.

6 Next steps

From the above visualizations constructed during the exploratory data analysis, we were able to observe the dynamics of the dataset, categorize their values, and identify relationships in our dataset. Furthermore, we were also able to locate the outliers in the dataset, organize it, and comprehend the dynamics of each variable.

From all these visualizations and variables, we will be able to perform predictive analysis making use of the dataset to forecast future trends and occurrences. As part of the next steps in our analysis, we will determine if a rise in family size will result in a decline in Internet usage. We’ll also examine whether the number of inhabitants in a household has an impact on Internet usage.

Given that the variables are categorical in nature, predictive methods such as logistic regression will assist us to attain the aforementioned goals. For problems involving prediction and classification, logistic regression is frequently utilized. Additionally, it is simpler to implement, interpret, and train on, and it is quite effective.

7 Takeaways

From our detailed Exploratory Analysis, we were able to identify the dynamics of various variables in our Dataset and how they should be handled.
We were able to identify that the Dataset provided is completely categorical in nature and We were able to clean the data accordingly in this phase.
The Mosaic & Balloon Plots in our Analysis have helped us identify association between certain variables that can be used for our next phase of Analysis i.e, the Predictive Phase.
With the knowledge gained from our analysis certain thoughts & questions came up-to our minds:

Will an increase in family size cause a decrease in internet use by single member of the family?

Are people who live independently more likely to use the internet over people living with families?

These questions might help us with predictions in our next phase of analysis.

8 References

Christophe Dervieux, Emily Riederer, Y. X. (2022). R Markdown Cookbook.
from https://bookdown.org/yihui/rmarkdown-cookbook/

Wisconsin School of Business (2021). Analytics Using R.
from https://pubs.wsb.wisc.edu/academics/analytics-using-r-2019/index.html

R Studio (n.d.). Markdown Basics.
from https://rmarkdown.rstudio.com/lesson-8.html

Joyce Robbins, Z. B. (n.d.). 15 Chart: Mosaic.
from https://edav.info/

Seaborn (n.d.). Visualizing Categorical Data.
from https://seaborn.pydata.org/tutorial/categorical.html

Group 5 - Section 2: Location of Internet Usage (Exploratory Analysis)

By Raghuraman Ganapathi Raman, Sramana Ghosh, Thanusha Michelle George & Aashika Prem Anand

November 10, 2022

1 Introduction

2 Initial Analysis

3 Data Pre-processing

3.1 Data structuring

3.2 Removing Null Values

3.3 Datatype Modification

3.4 Value Modification

4 Outliers

5 Exploratory Analyses

5.1 Pie Chart: Male & Female

5.2 Horizontal Bar Plot: Distribution of Participants based on Province

5.3 Bar Plot: Which Age group uses the internet most & least?

5.4 Barplot: Which gender makes the most out of the internet?

5.5 Pie chart: Distribution of students who use the internet based on gender

5.6 Grouped Bar Plot: Gender and Age classified by Internet Usage

5.7 Stacked Bar Plot: Education Level in Each Province

5.8 Pie Chart: Distribution of Internet Users and Non Internet Users

5.9 Bar Plot: Distribution of Users Based on Years of Usage

5.10 Mosaic Plot: Relationship Between Family Size and Internet Usage

5.11 Three Variable Mosaic Plot: Understanding the Relationship Between Student Status,Family Size and Internet Usage

5.12 Contingency Bar Plot: Relationship between Employment Status, Sex and Internet Usage

5.13 Multi Panel Balloon Plot: Are employed people more likely using the internet?

5.14 Where do People mostly Use Internet From?

5.15 Balloon Plot: Distribution of Internet Usage Based on Province & Location

5.16 Balloon Plot: Who are Accessing the Internet More?

6 Next steps

7 Takeaways

8 References