The internet, which has become a significant part of our lives, has typically made things convenient and easier for us. Since the internet is used frequently, this analysis explores how much usage varies depending on the different types of users who fall into distinct groups. The location specific dataset from the Canadian Internet Use Survey, offers a comprehensive breakdown of each key variable that is crucial for analyzing internet usage around the world. It includes the major elements such as Region, provinces, age, employment and respondent details and other specific features. This in-depth research analysis is accomplished using both exploratory and predictive technique that helps interpret the data visually from various viewpoints and predict favorable recommendations out of it.
The Dataset gives values for each variable for the determined quantities based on a set of dimensions, and when their characteristics are taken into account for analysis, they show a notable variance. The analysis started by putting forward potential queries as part of our basic investigation in order to better understand the dataset. An important component of the interpretation included examining factors like the province with the highest internet usage, the employment status of Canadian household members, etc. The distribution of users, internet usage frequency, educational background, and the main variables that influence and exploit the students on using the resource excessively are also projected in order to gain meaningful insights.
Dimension Check: To understand the dynamics of the dataset, it is divided into descriptive chunks. Recognizing the dataset’s components and attributes aids in conceptualizing how the data relates to one another. As part of data quality measures, accuracy, completeness, consistency, timeliness, validity, and Integrity need to be examined. Following the dimensions, the dataset’s summary is noted, displaying the R-structured tabular form of representation of the number of records and their associated components.
Inference: According to the statistical information pulled in relation to the data summary, the important constituents, such as the min, median, mode, max, first quadrant, and third quadrant, are covered in detail. As an outcome, the built-in function is used to generate the model fitting functions of the result summary. Additionally, there are 23178 rows and 23 columns in the data set. The total count allows us to figure out if the dataset contains any null values.
The acquired raw data is converted into real information that removes the possibility of periodic noise and data corruption. In order to follow a regular pattern, the data that is inconsistent, contains errors, and frequently is incomplete, is cleaned and processed. This approach makes it simpler to visualize the data and as a result, transforms it into a structured manner.
Below are some of the approaches that we’ll be following:
In the code snippet below, str() has been used to ensure
that the data set has a well-defined structure, appropriate data types
are assigned to the variables, etc. which will make it easy to store and
access data.
Missing data can yield skewed estimates, which can impair a study’s statistical significance and lead to inaccurate conclusions. In such scenarios, data imputation or many other methodologies can be used to handle null values in a dataset. The location dataset used made it simpler by the dataset’s absence of null values, which minimized complexity and saved time and code usage.
Despite being categorical by default, it is clear from the dataset’s variety of variables that they are represented as continuous variables (in numerical form). Hence, the continuous variables need to be handled to make them categorical in R. so that the groupings can be statistically analyzed for deeper insights. In general, establishing levels and factors can make it easier to explore the data.
Despite the dataset’s extensive response collection, some of the responses appear to be very sparsely populated because they run the risk of becoming an outlier when the data is visualized. Such variables would be handled by consolidating their multiple columnar values into a single value.
Below are the libraries that we’ll be using for our analyses.
#Libraries Initialization
library(ggplot2)
library(ggplot2)
library(dplyr)
library(DT)
library(magrittr)
library(knitr)
library(tidyverse)
library(epiDisplay)
library(ggpubr)
library(ggmosaic)
library(vcd)
library(grid)
library(ggpubr)
theme_set(theme_pubr())
Firstly, Let’s have a look at the summary of our data using in R to better understand the dynamics of the data set at hand.
Post that, we’ll rename the columns of the Datset for better readability.
head(data)
## PUMFID PROVINCE REGION G_URBRUR GCAGEGR6 CSEX G_CEDUC G_CSTUD G_CLFSST
## 1 1 35 3 5 3 2 3 2 1
## 2 2 46 4 5 1 2 1 1 2
## 3 3 10 1 5 2 1 2 2 1
## 4 4 35 3 4 5 2 2 2 3
## 5 5 13 1 4 3 1 2 2 1
## 6 6 46 4 4 2 2 1 2 3
## GFAMTYPE G_HHSIZE G_HEDUC G_HSTUD EV_Q01 EV_Q02 LU_Q01 LU_Q02 LU_G03 LU_Q04
## 1 3 1 3 2 1 2 1 2 2 2
## 2 2 3 2 1 1 4 1 1 1 1
## 3 2 2 3 2 1 4 1 1 1 1
## 4 3 1 2 2 1 3 6 6 6 6
## 5 2 3 2 2 1 2 1 2 2 2
## 6 1 3 1 2 1 4 6 6 6 6
## LU_Q05 LU_Q06A LU_Q06B LU_G06
## 1 2 6 6 6
## 2 1 2 1 2
## 3 1 2 2 1
## 4 6 6 6 6
## 5 2 6 6 6
## 6 6 6 6 6
#Renaming columns for better readability and understanding
colnames(data) <- c("ID",
"PROVINCE",
"REGION",
"URBAN_RURAL",
"AGE_GROUP",
"SEX",
"ED_LEVEL",
"STUDENT_STATUS",
"EMPLOYMENT_STATUS",
"HOUSE_TYPE",
"HOUSE_SIZE",
"HIGH_EDU",
"STUD_IN_HOUSE",
"INT_USAGE",
"INT_YEARS",
"LU_FROM_HOME",
"LU_FROM_WORK",
"LU_FROM_SCHOOL",
"LU_FROM_LIBRARY",
"LU_OTHERS",
"LU_OTHERS_RELATIVE",
"LU_OTHERS_FRIEND",
"LU_OTHERS_MISC")
In the provided Dataset, we are dealing with Categorical Data encoded in the form of numbers. There are no continuous variables in the provided Dataset. However, since the values are in numbers, R might consider this to be continuous data and assign numeric data types to the variables. We’ll fix this issue in the later part of the below code block by validating the structure of our Dataset and setting the columns as factors.
Now, In order to cleanse the data we plan to modify the values for
certain columns by clubbing certain column values under one value
i.e. for instance in the LU_FROM_SCHOOL column, the number
6 will be assigned to No Response which will include
everything such as Valid Skips, Refusal & Not
Stated instead of all these sub classifications having separate
numerical values.
However, the values 7 = Don’t Know is not disturbed in the process and remains as it is.
We intend to do this step as we feel that these separate values will not be able to provide significance to our analysis in anyway.
Below code snippet performs the action of data cleansing/data pre-processing:
#Data Pre-processing
#Modifying values from the data set
#SEX:
#Assigning 0 = Female 1 = Male
data["SEX"][data["SEX"]==2] <-0
#STUDENT_STATUS:
#Assigning 0 = No 1 = Yes
data["STUDENT_STATUS"][data["STUDENT_STATUS"]== 2] <- 0
#STUD_IN_HOUSE:
#Assigning 0 = No 1 = Yes
data["STUD_IN_HOUSE"][data["STUD_IN_HOUSE"]== 2] <- 0
#INT_USAGE:
#Assigning 0 = No 1 = Yes
data["INT_USAGE"][data["INT_USAGE"]== 2] <- 0
#INT_YEARS:
# 6 = No Response (Valid Skips & Refusals) 7 = Don't Know
data["INT_YEARS"][data["INT_YEARS"] == 8] <- 6
#LU_FROM_WORK:
# 6 = No Response (Valid Skips & Not Stated) 7 = Don't Know
data["LU_FROM_WORK"][data["LU_FROM_WORK"] == 9] <- 6
#LU_FROM_SCHOOL:
# 6 = No Response (Valid Skips, Refusal & Not Stated) 7 = Don't Know
data["LU_FROM_SCHOOL"][data["LU_FROM_SCHOOL"] > 7] <- 6
#LU_FROM_LIBRARY:
# 6 = No Response (Valid Skips & Not Stated) 7 = Don't Know
data["LU_FROM_LIBRARY"][data["LU_FROM_LIBRARY"] == 9] <- 6
#LU_OTHERS:
# 6 = No Response (Valid Skips & Not Stated) 7 = Don't Know
data["LU_OTHERS"][data["LU_OTHERS"] == 9] <- 6
#LU_OTHERS_RELATIVE:
# 6 = No Response (Valid Skips, Refusal & Not Stated) 7 = Don't Know
data["LU_OTHERS_RELATIVE"][data["LU_OTHERS_RELATIVE"] > 7] <- 6
#LU_OTHERS_FRIEND:
# 6 = No Response (Valid Skips, Refusal & Not Stated) 7 = Don't Know
data["LU_OTHERS_FRIEND"][data["LU_OTHERS_FRIEND"] >7] <- 6
#LU_OTHERS_MISC:
# 6 = No Response (Valid Skips & Not Stated)
data["LU_OTHERS_MISC"][data["LU_OTHERS_MISC"] == 9] <- 6
str(data) #Checking the structure of our data set
#Handling categorical variables: Setting factors & levels to categories
data$ID <- factor(data$ID)
data$PROVINCE <- factor(data$PROVINCE, levels=c("10","11","12","13","24","35","46","47","48","59"))
data$REGION <- factor(data$REGION, levels = c("1","2","3","4","5","6"))
data$URBAN_RURAL <- factor(data$URBAN_RURAL,levels = c("1","2","3","4","5","6"))
data$AGE_GROUP <- factor(data$AGE_GROUP,levels = c("1","2","3","4","5","6"))
data$SEX <- factor(data$SEX,levels = c("0","1"))
data$ED_LEVEL <- factor(data$ED_LEVEL,levels = c("1","2","3"))
data$STUDENT_STATUS <- factor(data$STUDENT_STATUS,levels = c("0","1"))
data$EMPLOYMENT_STATUS <- factor(data$EMPLOYMENT_STATUS,levels = c("1","2","3"))
data$HOUSE_TYPE <- factor(data$HOUSE_TYPE,levels = c("1","2","3","4"))
data$HOUSE_SIZE <-factor(data$HOUSE_SIZE,levels=c("1","2","3","4"))
data$HIGH_EDU <- factor(data$HIGH_EDU,levels = c("1","2","3"))
data$STUD_IN_HOUSE <- factor(data$STUD_IN_HOUSE,levels = c("0","1"))
data$INT_USAGE <- factor(data$INT_USAGE,levels = c("0","1"))
data$INT_YEARS <- factor(data$INT_YEARS,levels =c("1","2","3","4","6","7"))
data$LU_FROM_HOME<- factor(data$LU_FROM_HOME,levels = c("1","2","6","7"))
data$LU_FROM_WORK <- factor(data$LU_FROM_WORK ,levels = c("1","2","6","7"))
data$LU_FROM_SCHOOL <- factor(data$LU_FROM_SCHOOL,levels = c("1","2","6","7"))
data$LU_FROM_LIBRARY <- factor(data$LU_FROM_LIBRARY ,levels = c("1","2","6","7"))
data$LU_OTHERS <- factor(data$LU_OTHERS ,levels = c("1","2","6","7"))
data$LU_OTHERS_RELATIVE <- factor(data$LU_OTHERS_RELATIVE ,levels = c("1","2","6","7"))
data$LU_OTHERS_FRIEND <- factor(data$LU_OTHERS_FRIEND ,levels = c("1","2","6","7"))
data$LU_OTHERS_MISC <- factor(data$LU_OTHERS_MISC ,levels = c("1","2","6"))
Now, we have completed modifying certain values in various columns and assigning factors to variables to make them categorical.
Let’s have a look at the summary and dimensions of our Dataset.
summary(data)
## ID PROVINCE REGION URBAN_RURAL AGE_GROUP SEX
## 1 : 1 35 :6518 1:3798 1: 975 1:1981 0:12817
## 2 : 1 24 :4437 2:4437 2: 1019 2:3269 1:10361
## 3 : 1 59 :2533 3:6518 3: 817 3:3749
## 4 : 1 48 :2242 4:3650 4:12143 4:4555
## 5 : 1 46 :2023 5:2242 5: 7632 5:4419
## 6 : 1 47 :1627 6:2533 6: 592 6:5205
## (Other):23172 (Other):3798
## ED_LEVEL STUDENT_STATUS EMPLOYMENT_STATUS HOUSE_TYPE HOUSE_SIZE HIGH_EDU
## 1:9082 0:21464 1:13559 1: 5639 1:6304 1: 5995
## 2:9753 1: 1714 2: 1045 2:10395 2:8831 2:11104
## 3:4343 3: 8574 3: 6304 3:3474 3: 6079
## 4: 840 4:4569
##
##
##
## STUD_IN_HOUSE INT_USAGE INT_YEARS LU_FROM_HOME LU_FROM_WORK LU_FROM_SCHOOL
## 0:19089 0: 5599 1: 607 1:16046 1: 6740 1: 2364
## 1: 4089 1:17579 2: 891 2: 950 2:10253 2:14601
## 3: 2586 6: 6181 6: 6183 6: 6211
## 4:13461 7: 1 7: 2 7: 2
## 6: 5600
## 7: 33
##
## LU_FROM_LIBRARY LU_OTHERS LU_OTHERS_RELATIVE LU_OTHERS_FRIEND LU_OTHERS_MISC
## 1: 2118 1: 5453 1: 1974 1: 2355 1: 2641
## 2:14869 2:11531 2: 3462 2: 3081 2: 2795
## 6: 6185 6: 6186 6:17738 6:17738 6:17742
## 7: 6 7: 8 7: 4 7: 4
##
##
##
dim(data)
## [1] 23178 23
The Summary Function provides us a clear idea of the
frequency/occurences of different values in our Dataset as we are
working with categorical data.
With the Dim Function we identified there are
23,178 Rows and 23 Columns in our data set and
no NA values are present in the provided data set.
While going through the summary of the Dataset, we realized that every variable in the dataset is catergorical and in general the concept of outliers are applicable to continuous variables. However, categorical variables with marginally lower frequency of occurence can also be treated as outliers.
In this Dataset we solved this problem in the above code snippet, where we combined certain values (Valid Skips, Refusals, etc.) of specific columns that felt like outliers to one broad category (No Response). However, the don’t know column was not hampered as it may be useful for further analysis.
Since, we have completed our data pre-processing by checking for missing values, modifying columns that are not required for our analysis and combining certain column values under on subsection, Now we are ready to move ahead with our Data Visualization stage.
In the following portion, we will be identifying the trends and weakness of the obtained survey (dataset) using exploratory data analysis. This technique involves examining data sets to highlight their key features, frequently utilizing statistical graphics and other data visual analytics. At the end of the analysis, we will be able to identify the key elements that reveal on the major challenges produced by the data available and will be in a position to provide forward-thinking suggestions and troubleshooting solutions. Some of the possible troubleshoots that could be a part of the analysis is followed.
By segmenting the circular statistical visual into sectors or portions, pie charts are the best tools for highlighting the nature of the data and exhibiting numerical issues. As a part of the analysis, the chart depicts the proportion of male and female survey participants.
Let’s have a look at a pie chart to better understand the data we are working with. We’ll start by looking at what percent of males and females took part in the survey.
#Visualizations
#1. Pie Chart: Male & Female respondents
count <- table(data$SEX)
pct <- round(count/sum(count)*100)
lbls <- c("Female","Male")
lbls <- paste(lbls, pct) # add percents to labels
lbls <- paste(lbls,"%",sep="") # ad % to labels
pie(count, labels = lbls,col=rainbow(length(lbls)), main="Percentage of Males & Females", )
Findings:
Inference:
As a result, the male population is identified in the pie chart as red, and the female population is recognized as blue which covers the major portion, making it easy to visibly distinguish between the two groups.
The primary goal of the below horizontal bar chart visualization is to analyze the distribution of Internet users with respect to each province in percentage. In the below coding, in order to compute, we have divided the province counts by their respective sums and multiplied by 100 to get the percentage value. R uses the function barplot( ) to create bar charts. R can draw both vertical and horizontal bars in the bar chart. In the below visualization, we have used a horizontal bar plot. We have then added the percentage to labels and have moved the margins and adjusted the titles to get a perfect horizontal visualization.
#2. Horizontal Bar Plot: Distribution of Participants based on Province
province_counts <- table(data$PROVINCE)
namesarg <- c("NL","PE","NS","NB","QC","ON","MB", "SK","AB", "BC")
pct <- round(province_counts/sum(province_counts)*100)
namesarg <- paste(namesarg, pct) # add percents to labels
namesarg <- paste(namesarg,"%",sep="") # ad % to labels
par(mar=c(5.1, 13 ,4.1 ,2.1)) # Moving margins
province_plot <- barplot(province_counts,names.arg=namesarg,col=rainbow(length(province_counts)),
las=1, horiz = T,xlim = c(0,7000), xlab="",ylab = "",
main="Distribution of Participants based on Province")
title(xlab="Number of Participants",ylab="Province",line=0, cex.lab=1.2) # Adjusting the Title
Findings:
Based on the aforementioned visualization, it is clear that:
Inference:
Most of the people who use the Internet for personal purposes are found to be in Ontario and Quebec, which may be related to their dense populations. On the other hand, regions like Prince Edward Island, Newfoundland, and Labrador are found to be the least internet users. Additionally, communication tools tend to grow and modernize along with locations. This could involve online shopping, information searching, social networking, entertainment, etc., which could eventually lead to a rise in regional internet usage for private, non-commercial purposes.
The below bar plot visualization differentiates different age categories of people who have utilized the most and least of the Internet. A bar chart basically represents data in rectangular bars with the length of the bar proportional to the value of the variable.
#3.Bar plot: Which Age group uses the internet most & least?
age <- table(data[data$INT_USAGE==1, "AGE_GROUP"])
age_group <- c("16-24","25-34","35-44","45-54","55-64",">64")
age_plot <- barplot(age,names.arg=age_group, las=1,col=rainbow(length(age)),
main="Distribution of Usage by Age Group",
xlab = "Age Group",
ylab="Number of Participants")
text(age_plot,age,labels = age, pos = 1)
Findings:
It is clear that the age group 45-54 is the group that uses most of the Internet with a count of 3821 users followed by the age range 35-44 with a count of 3500 Internet users.
We have discovered that the age range 35–44 has the second-highest number of Internet users, with 3500, followed by the 45–54 age group, which has 3821 users. Finally, it is evident that the number of Internet users in the 16 to 24 and over 64 age categories has the same number of Internet users having the count of 1952.
Inference: The respondents in the age group 45-54 are the ones who are using most of the Internet and it is noted that they are in the working category and they are the generation who have started using the Internet and they are likely to use more Internet than the other age categories.
The visualization’s objective is to examine how respondents’ Internet usage is distributed according to their sex. In the coding shown below, we have utilized gender as a common denominator to assess the distribution of each group using a bar plot. We can determine which gender category has utilized the Internet the most from the bar plot. The below barplot will help us answer the question of Which gender makes the most out of the internet?
#4. Barplot: Which gender makes the most out of the internet?
gender_count <- table(data[data$INT_USAGE==1,"SEX"])
gender <- c("Female", "Male")
gender_plot <- barplot(gender_count,names.arg=gender, las=1,col=rainbow(length(gender_count)),
main="Distribution of Usage by Gender", ylim = c(0,10500),
xlab = "Gender",
ylab="Number of Participants")
text(gender_plot,gender_count,labels = gender_count, pos = 3)
Findings: With 9629 female Internet users and 7950 male Internet users, it is clear that women have used the internet more than men.
Inference: Internet usage is higher in the female group than in the male category. The fact that women make up a significantly larger proportion of internet users than men could be the factor. In a bar chart, each bar can also be given a different color. To distinguish between two gender categories in the graphic above, we utilized the colors red and blue. The comparisons between different data categories are displayed using bars in a bar chart. Comparisons between the male and female genders are displayed in our visualization.
The below pie chart visualization depicts the distribution of students using the Internet with respect to their gender. We have used pie chart visualization since it displays relative proportions of multiple classes of data.
#5. Pie chart: Portfolio of students who use the internet based on gender
Student_gender_count <- table(data[data$STUDENT_STATUS==1 & data$INT_USAGE==1,"SEX"])
Student_gender_pct <- round(Student_gender_count/sum(Student_gender_count)*100)
Student_gender_lbls <- c("Female","Male")
Student_gender_lbls <- paste(Student_gender_lbls, Student_gender_pct) # add percents to labels
Student_gender_lbls <- paste(Student_gender_lbls,"%",sep="") # ad % to labels
pie(Student_gender_count, labels = Student_gender_lbls,
col=c("yellow", "black"), main="Distribution of Students Using the Internet Based on Gender")
Findings: It is noted that about 60% of female students utilize the Internet, with 40% of male students making up the remaining user base.
Inference: The findings indicate that a significant share of female students utilize the Internet. This might be a result of the fact that female students have a higher population density than male students.
Our goal in conducting this analysis is to examine the respondents who utilize the internet by age and gender. By doing so, we will be better able to determine which age group has been utilizing the internet or other web services the most. We will also be able to identify the gender that has utilized the Internet the most. In order to display each data category in a frequency distribution, we used a grouped bar chart for this visualization. We can represent relative percentages or numbers of various categories using a group bar plot, and it also aids in visually presenting a summary of our sizable dataset.
#6. Grouped bar plot for: Gender and age classified by internet usage
gender_age <- data[data$INT_USAGE==1,]
gender_age <- within(gender_age, {
gender_age.cat <- NA
gender_age.cat[AGE_GROUP=="1"] <- "16-24"
gender_age.cat[AGE_GROUP=="2"] <- "25-34"
gender_age.cat[AGE_GROUP=="3"] <- "35-44"
gender_age.cat[AGE_GROUP=="4"] <- "45-54"
gender_age.cat[AGE_GROUP=="5"] <- "55-64"
gender_age.cat[AGE_GROUP=="6"] <- ">64"
})
ggplot(gender_age, # Grouped bar plot using ggplot2
aes(x = gender_age.cat,
fill= SEX)) +
scale_fill_discrete(labels = c("Male", "Female"))+
labs(x = "Age Group", y = "Number of Participants",title = "Gender and age classified by internet usage", )+
theme_dark()+ theme(axis.text = element_text(face = "bold",size = 10, angle = 20, hjust = 0.75))+
geom_bar(position = "dodge")
Findings:
Inference: It can be seen from the above visualization that, regardless of age group, women use the Internet more frequently than males. Additionally, it is shown that Internet usage is the same for people aged 16 to 24 and older than 64, with 1750 female users and around 1500 male users. On the other hand, it can be seen that the age groups of 25–34 and 55–64, with a count of 1450 and 1750 respectively, have the same Internet usage for both males and females.
Our fundamental goal for performing the below analysis is to compare the education level of all the provinces and provide insights. The reason behind choosing a stacked bar chart is that it makes the comparison of the data points even easier, and it makes the interpretation effective.
#7.Stacked Bar Plot: Education Level in Each Province
#Setting province Codes
x <- within(data, {
province.cat <- NA
province.cat[PROVINCE==10] <- "NL"
province.cat[PROVINCE==11] <- "PE"
province.cat[PROVINCE==12] <- "NS"
province.cat[PROVINCE==13] <- "NB"
province.cat[PROVINCE==24] <- "QC"
province.cat[PROVINCE==35] <- "ON"
province.cat[PROVINCE==46] <- "MB"
province.cat[PROVINCE==47] <- "SK"
province.cat[PROVINCE==48] <- "AB"
province.cat[PROVINCE==59] <- "BC"
})
ggplot(x, aes(fill=ED_LEVEL,y=ED_LEVEL,x=province.cat)) +
scale_fill_discrete(labels = c("High school or less", "College or some post-secondary",
"University certificate or degree"), name = "Education Level")+
labs(x = "Province", y = " Education Level",title = "Education Level in each Province")+
theme(axis.text = element_text(face = "bold",size = 10, angle = 20, hjust = 0.75))+
geom_bar(position="stack", stat="identity")
Findings:
Inference: The stacked bar chart above makes it evident that in all provinces, a higher proportion of respondents have completed post-secondary education than the other categories, while the proportion of respondents with university degrees is moderate
Let’s generate a pie chart to show the Distribution of Internet Users and Non Internet Users
#8. Distribution of internet users and non users in the survey?
int_usage_count <- table(data["INT_USAGE"])
int_usage_pct <- round(int_usage_count/sum(int_usage_count)*100)
int_usage_lbls <- c("No","Yes")
int_usage_lbls <- paste(int_usage_lbls, int_usage_pct) # add percents to labels
int_usage_lbls <- paste(int_usage_lbls,"%",sep="") # ad % to labels
pie(int_usage_count, labels = int_usage_lbls,
col=c("Orange", "light blue"), main="Distribution of Internet Users and Non Internet Users")
Findings:
Inference: As a result, the pie chart shows the percentage of internet usage as blue, which makes it simple to see how the two groups differ from one another, and the percentage of non-internet usage as orange.
The bar plot below illustrates the distribution of users based on years of usage.
#9. Distribution of Users Based on Years of Usage
years_usage <- data["INT_YEARS"]
years_usage <- within(years_usage, {
years.cat <- NA
years.cat[INT_YEARS==1] <- "< 1 Year"
years.cat[INT_YEARS==2] <- "1-2 Years"
years.cat[INT_YEARS==3] <- "2-5 Years"
years.cat[INT_YEARS==4] <- "> 5 Years"
years.cat[INT_YEARS==6] <- "Skipped"
years.cat[INT_YEARS==7] <- "Don't Know"
})
tab1(years_usage$years.cat, sort.group = "decreasing", cum.percent = TRUE, main = "Distribution of Users Based on Years of Usage", xlab = "Years of Usage")
## years_usage$years.cat :
## Frequency Percent Cum. percent
## > 5 Years 13461 58.1 58.1
## Skipped 5600 24.2 82.2
## 2-5 Years 2586 11.2 93.4
## 1-2 Years 891 3.8 97.2
## < 1 Year 607 2.6 99.9
## Don't Know 33 0.1 100.0
## Total 23178 100.0 100.0
Findings:
Inference: When other categories are included, the analysis drastically deviates since the respondent either didn’t know their own practice or didn’t want to report it. However, a large portion of the respondents belonged to the group that has been a notable lead in internet usage for years.
The association between internet usage and occupant count is graphically evaluated in the mosaic plot below. The key justification for selecting a mosaic plot is that it enables us to identify the relationship between the two variables. Another key benefit is that it summarizes the data and makes it possible to find correlations between multiple factors. The blue tiles in the mosaic plot represent substantial positive residuals with a frequency more than predicted, whereas the red tiles represent significant negative residuals with a frequency less than expected.
#10.Mosaic Plot: Relationship between family size and internet usage
mosaic <- data[,c("INT_USAGE","HOUSE_SIZE","STUDENT_STATUS")]
mosaic <- within(mosaic, {
int_usage.cat <- NA
int_usage.cat[INT_USAGE=="0"] <- "No"
int_usage.cat[INT_USAGE=="1"] <- "Yes"
})
mosaic <- within(mosaic, {
house_size.cat <- NA
house_size.cat[HOUSE_SIZE=="1"] <- "1 Person"
house_size.cat[HOUSE_SIZE=="2"] <- "2 People"
house_size.cat[HOUSE_SIZE=="3"] <- "3 People"
house_size.cat[HOUSE_SIZE=="4"] <- ">= 4 People"
})
mosaic <- within(mosaic, {
student_status.cat <- NA
student_status.cat[STUDENT_STATUS=="0"] <- "Not Student"
student_status.cat[STUDENT_STATUS=="1"] <- "Student"
})
mosaic_count <- table(mosaic[,c("int_usage.cat","house_size.cat","student_status.cat")])
mosaic(~int_usage.cat + house_size.cat,data = mosaic_count, shade = T,legend = TRUE,
labeling = labeling_border(abbreviate_labs = c(3, 1, 10)),
main = "Relationship Between Internet Usage
and Number of People in House")
Findings:
Inference: Since the proportions of the two variables (Internet usage and the number of persons living in the home) differ, it can be observed that they are reliant on one another. Additionally, it appears that there is a direct correlation between internet usage and the number of individuals living in the family. However, we need to perform an in-depth analysis to strongly conclude.
We can better comprehend the connection between family size and internet usage by looking at the status of the students in the mosaic plot below. Since the below 3 variables are categorical in nature, we have used a mosaic plot.
#11.Mosaic Plot: Understanding the relationship between family size, student status and internet usage
mosaic(~int_usage.cat + house_size.cat + student_status.cat,
data = mosaic_count,shade = TRUE,
labeling = labeling_border(rot_labels = c(0, 45, 90, 90), abbreviate_labs = c(10, 1, 5)),
legend = TRUE,
main = "Family Size and Internet Usage")
Findings:
Inference: Overall, the mosaic plot shown above is used to determine whether the three variables are associated with one another.
In this contingency bar plot, the categorical variables are added together side by side, creating a two-way table. Counting the total number of observations for each combination of levels in the categorical variable is another valuable application of this data.
#12.Contingency Bar Plot: Relationship between Employment Status, Sex and Internet Usage
contingency_plot <- data[,c("INT_USAGE","EMPLOYMENT_STATUS", "SEX")]
contingency_plot <- within(contingency_plot, {
emp_status.cat <- NA
emp_status.cat[EMPLOYMENT_STATUS=="1"] <- "Employed"
emp_status.cat[EMPLOYMENT_STATUS=="2"] <- "Unemployed"
emp_status.cat[EMPLOYMENT_STATUS=="3"] <- "Unable to Work"
})
contingency_plot <- within(contingency_plot, {
int_usage.cat <- NA
int_usage.cat[INT_USAGE=="0"] <- "Never Used Internet"
int_usage.cat[INT_USAGE=="1"] <- "Used Internet"
})
contingency_plot <- within(contingency_plot, {
sex.cat <- NA
sex.cat[SEX=="0"] <- "Female"
sex.cat[SEX=="1"] <- "Male"
})
ggplot(contingency_plot, aes(x = emp_status.cat))+
labs(x = "Employment Status", y = " No.of Participant",title = "Usage of Internet based on Gender")+
geom_bar(aes(fill = sex.cat),color = "white", position = position_dodge(0.9))+
labs(fill="Gender of Respondent")+
facet_wrap(~contingency_plot$int_usage.cat,as.table = TRUE)+
theme(axis.text = element_text(face = "bold",size = 10, angle = 20, hjust = 0.75))+
fill_palette("jco")
Findings:
Inference: This suggests that working professionals use the internet more frequently than the unemployed members of the community. If a new strategic plan is to be implemented, the focus should be limited to the categories stated above.
Multi-panel typically, balloon plots are intended to replace heat maps. When it comes to many groups of observations, where the size of the dots indicates the magnitude of the related component, it is said that the balloon plots act like magic. It is also used to show the contingency table created by the combination of two category variables.
#13. Multi Panel Balloon Plot: Are employed people more likely using the internet?
emp_count <- as.data.frame(table(contingency_plot))
ggballoonplot(emp_count, x = "emp_status.cat", y = "int_usage.cat", size = "Freq",
#ggballoonplot(emp_count, x = "EMPLOYMENT_STATUS", y = "AGE_GROUP", size = "Freq",
fill = "Freq", facet.by = "sex.cat",
ggtheme = theme_bw()) +
scale_fill_viridis_c(option = "C")
Findings:
Inference: Based on the scale and color of the graph, it is simpler to forecast the results and analyze them. The understanding has become quite simple and obvious because to the factor of frequency. As a consequence of the preceding experiment, we have determined that regardless of employment position, internet usage is essentially identical to that of women. Additionally, both genders follow a nearly identical trend as those who do not utilize the internet.
The location of the internet is crucial to its main area of distribution. The participants and their primary internet usage locations are precisely visualized using the horizontal bar graph.
#14. Where do People mostly Use Internet From?
location_data <- data[,c("PROVINCE",
"LU_FROM_HOME",
"LU_FROM_WORK",
"LU_FROM_SCHOOL",
"LU_FROM_LIBRARY",
"LU_OTHERS_RELATIVE",
"LU_OTHERS_FRIEND",
"LU_OTHERS_MISC")]
location_data <- within(location_data, {
place <- NA
place[LU_FROM_HOME=="1"] <- "Home"
place[LU_FROM_WORK=="1"] <- "Work"
place[LU_FROM_SCHOOL=="1"] <- "School"
place[LU_FROM_LIBRARY=="1"] <- "Library"
place[LU_OTHERS_RELATIVE=="1"] <- "Relative's"
place[LU_OTHERS_FRIEND=="1"] <- "Friend's"
place[LU_OTHERS_MISC=="1"] <- "Misc"
})
location_data <- within(location_data, {
province.cat <- NA
province.cat[PROVINCE==10] <- "NL"
province.cat[PROVINCE==11] <- "PE"
province.cat[PROVINCE==12] <- "NS"
province.cat[PROVINCE==13] <- "NB"
province.cat[PROVINCE==24] <- "QC"
province.cat[PROVINCE==35] <- "ON"
province.cat[PROVINCE==46] <- "MB"
province.cat[PROVINCE==47] <- "SK"
province.cat[PROVINCE==48] <- "AB"
province.cat[PROVINCE==59] <- "BC"
})
location_data <- na.omit(location_data)
loc <- as.data.frame(table(location_data[,c("place", "province.cat")]))
tab1(location_data$place, sort.group = "decreasing", cum.percent = TRUE, main = "Location of Internet Use", xlab = "Number of Participants")
## location_data$place :
## Frequency Percent Cum. percent
## Home 6417 37.8 37.8
## Work 3267 19.2 57.0
## Misc 2641 15.5 72.6
## Friend's 1792 10.5 83.1
## Relative's 1003 5.9 89.0
## Library 954 5.6 94.6
## School 912 5.4 100.0
## Total 16986 100.0 100.0
Findings:
Inference: When compared to usage from other venues, a substantial fraction of both communities uses their internet at home. This makes sense because people use technologies more frequently the more time, they have available to them. A 13% usage share of the internet by libraries and schools indicates how intensively the educational system depends on advanced technologies.
The balloon plot, as has already been noted, is useful for determining how category variables relate to one another. The graph below has numerous dimensions that, in contrast to the previous one, make it easier to determine how locations and provinces are related to one another.
#15. Balloon Plot: Distribution of Internet Usage Based on Province & Location
ggballoonplot(loc, x = "place", y = "province.cat", size = "Freq",
fill = "Freq",
ggtheme = theme_bw()) +
scale_fill_viridis_c(option = "C")
Findings:
Inference: Being one of Canada’s most populous provinces, Ontario has considerably accounted for the majority of internet usage worldwide for a variety of reasons. Quebec exhibits the similar pattern, with only minor variations, showing that internet usage is extremely common in the province. From this, it can be inferred that the distribution is fairly even across Canada’s developed provinces while other emerging provinces are still falling behind.
The balloon plot is drawn to show usage of internet by both students and non-students. We chose to examine its dynamics in depth since the type of students and their employment status is so important to how the internet is used today.
#16. Balloon Plot: Who are Accessing the Internet More?
int_access <- data[,c("INT_USAGE","EMPLOYMENT_STATUS", "STUDENT_STATUS")]
int_access <- within(int_access, {
emp_status.cat <- NA
emp_status.cat[EMPLOYMENT_STATUS=="1"] <- "Employed"
emp_status.cat[EMPLOYMENT_STATUS=="2"] <- "Unemployed"
emp_status.cat[EMPLOYMENT_STATUS=="3"] <- "Unable to Work"
})
int_access <- within(int_access, {
int_usage.cat <- NA
int_usage.cat[INT_USAGE=="0"] <- "Never Used Internet"
int_usage.cat[INT_USAGE=="1"] <- "Used Internet"
})
int_access <- within(int_access, {
std_status.cat <- NA
std_status.cat[STUDENT_STATUS=="0"] <- "Not a Student"
std_status.cat[STUDENT_STATUS=="1"] <- "Student"
})
int_access_count <- as.data.frame(table(int_access[,c("emp_status.cat", "std_status.cat", "int_usage.cat")]))
ggballoonplot(int_access_count, x = "emp_status.cat", y = "int_usage.cat", size = "Freq",
fill = "Freq", facet.by = "std_status.cat",
ggtheme = theme_bw()) +
scale_fill_viridis_c(option = "C")
Findings:
Inference: Internet usage among employees far outpaces that of students who are actually enrolled in classes. Despite the significant differences, the variables showed several characteristics that could highlight a range of population characteristics. Additionally, the unemployed population has very limited access to the internet, despite of their position as students.
From the above visualizations constructed during the exploratory data analysis, we were able to observe the dynamics of the dataset, categorize their values, and identify relationships in our dataset. Furthermore, we were also able to locate the outliers in the dataset, organize it, and comprehend the dynamics of each variable.
From all these visualizations and variables, we will be able to perform predictive analysis making use of the dataset to forecast future trends and occurrences. As part of the next steps in our analysis, we will determine if a rise in family size will result in a decline in Internet usage. We’ll also examine whether the number of inhabitants in a household has an impact on Internet usage.
Given that the variables are categorical in nature, predictive methods such as logistic regression will assist us to attain the aforementioned goals. For problems involving prediction and classification, logistic regression is frequently utilized. Additionally, it is simpler to implement, interpret, and train on, and it is quite effective.
Will an increase in family size cause a decrease in internet use by single member of the family?
Are people who live independently more likely to use the internet over people living with families?
These questions might help us with predictions in our next phase of analysis.
Christophe Dervieux, Emily Riederer, Y. X. (2022). R Markdown
Cookbook.
from https://bookdown.org/yihui/rmarkdown-cookbook/
Wisconsin School of Business (2021). Analytics Using R.
from https://pubs.wsb.wisc.edu/academics/analytics-using-r-2019/index.html
R Studio (n.d.). Markdown Basics.
from https://rmarkdown.rstudio.com/lesson-8.html
Joyce Robbins, Z. B. (n.d.). 15 Chart: Mosaic.
from https://edav.info/
Seaborn (n.d.). Visualizing Categorical Data.
from https://seaborn.pydata.org/tutorial/categorical.html