1 Introduction

The internet, which has become a significant part of our lives, has typically made things convenient and easier for us. Since the internet is used frequently, this analysis explores how much usage varies depending on the different types of users who fall into distinct groups. The location specific dataset from the Canadian Internet Use Survey, offers a comprehensive breakdown of each key variable that is crucial for analyzing internet usage around the world. It includes the major elements such as Region, provinces, age, employment and respondent details and other specific features. This in-depth research analysis is accomplished using both exploratory and predictive technique that helps interpret the data visually from various viewpoints and predict favorable recommendations out of it.

2 Initial Analysis

The Dataset gives values for each variable for the determined quantities based on a set of dimensions, and when their characteristics are taken into account for analysis, they show a notable variance. The analysis started by putting forward potential queries as part of our basic investigation in order to better understand the dataset. An important component of the interpretation included examining factors like the province with the highest internet usage, the employment status of Canadian household members, etc. The distribution of users, internet usage frequency, educational background, and the main variables that influence and exploit the students on using the resource excessively are also projected in order to gain meaningful insights.

Dimension Check: To understand the dynamics of the dataset, it is divided into descriptive chunks. Recognizing the dataset’s components and attributes aids in conceptualizing how the data relates to one another. As part of data quality measures, accuracy, completeness, consistency, timeliness, validity, and Integrity need to be examined. Following the dimensions, the dataset’s summary is noted, displaying the R-structured tabular form of representation of the number of records and their associated components.

Inference: According to the statistical information pulled in relation to the data summary, the important constituents, such as the min, median, mode, max, first quadrant, and third quadrant, are covered in detail. As an outcome, the built-in function is used to generate the model fitting functions of the result summary. Additionally, there are 23178 rows and 23 columns in the data set. The total count allows us to figure out if the dataset contains any null values.

3 Data Pre-processing

The acquired raw data is converted into real information that removes the possibility of periodic noise and data corruption. In order to follow a regular pattern, the data that is inconsistent, contains errors, and frequently is incomplete, is cleaned and processed. This approach makes it simpler to visualize the data and as a result, transforms it into a structured manner.

Below are some of the approaches that we’ll be following:

3.1 Data structuring

In the code snippet below, str() has been used to ensure that the data set has a well-defined structure, appropriate data types are assigned to the variables, etc. which will make it easy to store and access data.

3.2 Removing Null Values

Missing data can yield skewed estimates, which can impair a study’s statistical significance and lead to inaccurate conclusions. In such scenarios, data imputation or many other methodologies can be used to handle null values in a dataset. The location dataset used made it simpler by the dataset’s absence of null values, which minimized complexity and saved time and code usage.

3.3 Datatype Modification

Despite being categorical by default, it is clear from the dataset’s variety of variables that they are represented as continuous variables (in numerical form). Hence, the continuous variables need to be handled to make them categorical in R. so that the groupings can be statistically analyzed for deeper insights. In general, establishing levels and factors can make it easier to explore the data.

3.4 Value Modification

Despite the dataset’s extensive response collection, some of the responses appear to be very sparsely populated because they run the risk of becoming an outlier when the data is visualized. Such variables would be handled by consolidating their multiple columnar values into a single value.

Below are the libraries that we’ll be using for our analyses.

#Libraries Initialization
library(ggplot2)
library(ggplot2)
library(dplyr)
library(DT)
library(magrittr)
library(knitr)
library(tidyverse)
library(epiDisplay)
library(ggpubr)
library(ggmosaic)
library(vcd)
library(grid)
library(ggpubr)
theme_set(theme_pubr())
library(car)
library(nnet)

Firstly, Let’s have a look at the summary of our data using in R to better understand the dynamics of the data set at hand.

Post that, we’ll rename the columns of the Datset for better readability.

head(data)

##   PUMFID PROVINCE REGION G_URBRUR GCAGEGR6 CSEX G_CEDUC G_CSTUD G_CLFSST
## 1      1       35      3        5        3    2       3       2        1
## 2      2       46      4        5        1    2       1       1        2
## 3      3       10      1        5        2    1       2       2        1
## 4      4       35      3        4        5    2       2       2        3
## 5      5       13      1        4        3    1       2       2        1
## 6      6       46      4        4        2    2       1       2        3
##   GFAMTYPE G_HHSIZE G_HEDUC G_HSTUD EV_Q01 EV_Q02 LU_Q01 LU_Q02 LU_G03 LU_Q04
## 1        3        1       3       2      1      2      1      2      2      2
## 2        2        3       2       1      1      4      1      1      1      1
## 3        2        2       3       2      1      4      1      1      1      1
## 4        3        1       2       2      1      3      6      6      6      6
## 5        2        3       2       2      1      2      1      2      2      2
## 6        1        3       1       2      1      4      6      6      6      6
##   LU_Q05 LU_Q06A LU_Q06B LU_G06
## 1      2       6       6      6
## 2      1       2       1      2
## 3      1       2       2      1
## 4      6       6       6      6
## 5      2       6       6      6
## 6      6       6       6      6

#Renaming columns for better readability and understanding
colnames(data) <- c("ID", 
                    "PROVINCE", 
                    "REGION", 
                    "URBAN_RURAL", 
                    "AGE_GROUP", 
                    "SEX", 
                    "ED_LEVEL", 
                    "STUDENT_STATUS", 
                    "EMPLOYMENT_STATUS", 
                    "HOUSE_TYPE", 
                    "HOUSE_SIZE", 
                    "HIGH_EDU", 
                    "STUD_IN_HOUSE", 
                    "INT_USAGE", 
                    "INT_YEARS", 
                    "LU_FROM_HOME", 
                    "LU_FROM_WORK", 
                    "LU_FROM_SCHOOL", 
                    "LU_FROM_LIBRARY", 
                    "LU_OTHERS", 
                    "LU_OTHERS_RELATIVE", 
                    "LU_OTHERS_FRIEND", 
                    "LU_OTHERS_MISC")

In the provided Dataset, we are dealing with Categorical Data encoded in the form of numbers. There are no continuous variables in the provided Dataset. However, since the values are in numbers, R might consider this to be continuous data and assign numeric data types to the variables. We’ll fix this issue in the later part of the below code block by validating the structure of our Dataset and setting the columns as factors.

Now, In order to cleanse the data we plan to modify the values for certain columns by clubbing certain column values under one value i.e. for instance in the LU_FROM_SCHOOL column, the number 6 will be assigned to No Response which will include everything such as Valid Skips, Refusal & Not Stated instead of all these sub classifications having separate numerical values.

However, the values 7 = Don’t Know is not disturbed in the process and remains as it is.

We intend to do this step as we feel that these separate values will not be able to provide significance to our analysis in anyway.

Below code snippet performs the action of data cleansing/data pre-processing:

#Data Pre-processing 

#Modifying values from the data set
#SEX:
#Assigning 0 = Female 1 = Male
data["SEX"][data["SEX"]==2] <-0

#STUDENT_STATUS:
#Assigning 0 = No 1 = Yes
data["STUDENT_STATUS"][data["STUDENT_STATUS"]== 2] <- 0

#STUD_IN_HOUSE:
#Assigning 0 = No 1 = Yes
data["STUD_IN_HOUSE"][data["STUD_IN_HOUSE"]== 2] <- 0

#INT_USAGE:
#Assigning 0 = No 1 = Yes
data["INT_USAGE"][data["INT_USAGE"]== 2] <- 0

#INT_YEARS:
# 6 = No Response (Valid Skips & Refusals) 7 = Don't Know
data["INT_YEARS"][data["INT_YEARS"] == 8] <- 6

#LU_FROM_WORK:
# 6 = No Response (Valid Skips & Not Stated) 7 = Don't Know
data["LU_FROM_WORK"][data["LU_FROM_WORK"] == 9] <- 6

#LU_FROM_SCHOOL:
# 6 = No Response (Valid Skips, Refusal & Not Stated) 7 = Don't Know
data["LU_FROM_SCHOOL"][data["LU_FROM_SCHOOL"] > 7] <- 6

#LU_FROM_LIBRARY:
# 6 = No Response (Valid Skips & Not Stated) 7 = Don't Know
data["LU_FROM_LIBRARY"][data["LU_FROM_LIBRARY"] == 9] <- 6

#LU_OTHERS:
# 6 = No Response (Valid Skips & Not Stated) 7 = Don't Know
data["LU_OTHERS"][data["LU_OTHERS"] == 9] <- 6

#LU_OTHERS_RELATIVE:
# 6 = No Response (Valid Skips, Refusal & Not Stated) 7 = Don't Know
data["LU_OTHERS_RELATIVE"][data["LU_OTHERS_RELATIVE"] > 7] <- 6

#LU_OTHERS_FRIEND:
# 6 = No Response (Valid Skips, Refusal & Not Stated) 7 = Don't Know
data["LU_OTHERS_FRIEND"][data["LU_OTHERS_FRIEND"] >7] <- 6

#LU_OTHERS_MISC:
# 6 = No Response (Valid Skips & Not Stated)
data["LU_OTHERS_MISC"][data["LU_OTHERS_MISC"] == 9] <- 6

str(data) #Checking the structure of our data set

#Handling categorical variables: Setting factors & levels to categories
data$ID <- factor(data$ID)
data$PROVINCE <- factor(data$PROVINCE, levels=c("10","11","12","13","24","35","46","47","48","59"))
data$REGION <- factor(data$REGION, levels = c("1","2","3","4","5","6"))
data$URBAN_RURAL <- factor(data$URBAN_RURAL,levels = c("1","2","3","4","5","6"))
data$AGE_GROUP <- factor(data$AGE_GROUP,levels = c("1","2","3","4","5","6"))
data$SEX <- factor(data$SEX,levels = c("0","1"))
data$ED_LEVEL <- factor(data$ED_LEVEL,levels = c("1","2","3"))
data$STUDENT_STATUS <- factor(data$STUDENT_STATUS,levels = c("0","1"))
data$EMPLOYMENT_STATUS <- factor(data$EMPLOYMENT_STATUS,levels = c("1","2","3"))
data$HOUSE_TYPE <- factor(data$HOUSE_TYPE,levels = c("1","2","3","4"))
data$HOUSE_SIZE <-factor(data$HOUSE_SIZE,levels=c("1","2","3","4"))
data$HIGH_EDU <- factor(data$HIGH_EDU,levels = c("1","2","3"))
data$STUD_IN_HOUSE <- factor(data$STUD_IN_HOUSE,levels = c("0","1"))
data$INT_USAGE <- factor(data$INT_USAGE,levels = c("0","1"))
data$INT_YEARS <- factor(data$INT_YEARS,levels =c("1","2","3","4","6","7"))
data$LU_FROM_HOME<- factor(data$LU_FROM_HOME,levels = c("1","2","6","7"))
data$LU_FROM_WORK <- factor(data$LU_FROM_WORK ,levels = c("1","2","6","7"))
data$LU_FROM_SCHOOL <- factor(data$LU_FROM_SCHOOL,levels = c("1","2","6","7"))
data$LU_FROM_LIBRARY <- factor(data$LU_FROM_LIBRARY ,levels = c("1","2","6","7"))
data$LU_OTHERS <- factor(data$LU_OTHERS ,levels = c("1","2","6","7"))
data$LU_OTHERS_RELATIVE <- factor(data$LU_OTHERS_RELATIVE ,levels = c("1","2","6","7"))
data$LU_OTHERS_FRIEND <- factor(data$LU_OTHERS_FRIEND ,levels = c("1","2","6","7"))
data$LU_OTHERS_MISC <- factor(data$LU_OTHERS_MISC ,levels = c("1","2","6"))

Now, we have completed modifying certain values in various columns and assigning factors to variables to make them categorical.

Let’s have a look at the summary and dimensions of our Dataset.

summary(data)

##        ID           PROVINCE    REGION   URBAN_RURAL AGE_GROUP SEX      
##  1      :    1   35     :6518   1:3798   1:  975     1:1981    0:12817  
##  2      :    1   24     :4437   2:4437   2: 1019     2:3269    1:10361  
##  3      :    1   59     :2533   3:6518   3:  817     3:3749             
##  4      :    1   48     :2242   4:3650   4:12143     4:4555             
##  5      :    1   46     :2023   5:2242   5: 7632     5:4419             
##  6      :    1   47     :1627   6:2533   6:  592     6:5205             
##  (Other):23172   (Other):3798                                           
##  ED_LEVEL STUDENT_STATUS EMPLOYMENT_STATUS HOUSE_TYPE HOUSE_SIZE HIGH_EDU 
##  1:9082   0:21464        1:13559           1: 5639    1:6304     1: 5995  
##  2:9753   1: 1714        2: 1045           2:10395    2:8831     2:11104  
##  3:4343                  3: 8574           3: 6304    3:3474     3: 6079  
##                                            4:  840    4:4569              
##                                                                           
##                                                                           
##                                                                           
##  STUD_IN_HOUSE INT_USAGE INT_YEARS LU_FROM_HOME LU_FROM_WORK LU_FROM_SCHOOL
##  0:19089       0: 5599   1:  607   1:16046      1: 6740      1: 2364       
##  1: 4089       1:17579   2:  891   2:  950      2:10253      2:14601       
##                          3: 2586   6: 6181      6: 6183      6: 6211       
##                          4:13461   7:    1      7:    2      7:    2       
##                          6: 5600                                           
##                          7:   33                                           
##                                                                            
##  LU_FROM_LIBRARY LU_OTHERS LU_OTHERS_RELATIVE LU_OTHERS_FRIEND LU_OTHERS_MISC
##  1: 2118         1: 5453   1: 1974            1: 2355          1: 2641       
##  2:14869         2:11531   2: 3462            2: 3081          2: 2795       
##  6: 6185         6: 6186   6:17738            6:17738          6:17742       
##  7:    6         7:    8   7:    4            7:    4                        
##                                                                              
##                                                                              
##

dim(data)

## [1] 23178    23

The Summary Function provides us a clear idea of the frequency/occurences of different values in our Dataset as we are working with categorical data.

With the Dim Function we identified there are 23,178 Rows and 23 Columns in our data set and no NA values are present in the provided data set.

4 Outliers

While going through the summary of the Dataset, we realized that every variable in the dataset is catergorical and in general the concept of outliers are applicable to continuous variables. However, categorical variables with marginally lower frequency of occurence can also be treated as outliers.

In this Dataset we solved this problem in the above code snippet, where we combined certain values (Valid Skips, Refusals, etc.) of specific columns that felt like outliers to one broad category (No Response). However, the don’t know column was not hampered as it may be useful for further analysis.

5 Exploratory Analyses

Since, we have completed our data pre-processing by checking for missing values, modifying columns that are not required for our analysis and combining certain column values under on subsection, Now we are ready to move ahead with our Data Visualization stage.

In the following portion, we will be identifying the trends and weakness of the obtained survey (dataset) using exploratory data analysis. This technique involves examining data sets to highlight their key features, frequently utilizing statistical graphics and other data visual analytics. At the end of the analysis, we will be able to identify the key elements that reveal on the major challenges produced by the data available and will be in a position to provide forward-thinking suggestions and troubleshooting solutions. Some of the possible troubleshoots that could be a part of the analysis is followed.

Locating the dataset’s weakest point and gaining a grasp of the possible threat that the linked constituents might experience as a result of the deviation.
To develop an action plan that could improve the existing situation by understanding how individuals in different locations perceive the internet. This is done by observing the pattern of response.
To gather information and feedback about new products and services, optimizations to marketing strategies, upcoming technological developments, and improvements to current features in order to better understand the target market and reach the right people at the right location with the least amount of effort.

5.1 Pie Chart: Male & Female

By segmenting the circular statistical visual into sectors or portions, pie charts are the best tools for highlighting the nature of the data and exhibiting numerical issues. As a part of the analysis, the chart depicts the proportion of male and female survey participants.

Let’s have a look at a pie chart to better understand the data we are working with. We’ll start by looking at what percent of males and females took part in the survey.

#Visualizations
#1. Pie Chart: Male & Female respondents
count <- table(data$SEX)
pct <- round(count/sum(count)*100)
lbls <- c("Female","Male")
lbls <- paste(lbls, pct) # add percents to labels
lbls <- paste(lbls,"%",sep="") # ad % to labels
pie(count, labels = lbls,col=rainbow(length(lbls)), main="Percentage of Males & Females", )

Findings:

According to the chart, the female community has participated in the survey at a percentage of 55, while the male population has covered the remaining 45% of the total.
Percentages are added to the labels to improve readability.

Inference:

As a result, the male population is identified in the pie chart as red, and the female population is recognized as blue which covers the major portion, making it easy to visibly distinguish between the two groups.

5.2 Horizontal Bar Plot: Distribution of Participants based on Province

The primary goal of the below horizontal bar chart visualization is to analyze the distribution of Internet users with respect to each province in percentage. In the below coding, in order to compute, we have divided the province counts by their respective sums and multiplied by 100 to get the percentage value. R uses the function barplot( ) to create bar charts. R can draw both vertical and horizontal bars in the bar chart. In the below visualization, we have used a horizontal bar plot. We have then added the percentage to labels and have moved the margins and adjusted the titles to get a perfect horizontal visualization.

#2. Horizontal Bar Plot: Distribution of Participants based on Province
province_counts <- table(data$PROVINCE)
namesarg <- c("NL","PE","NS","NB","QC","ON","MB", "SK","AB", "BC")

pct <- round(province_counts/sum(province_counts)*100)
namesarg <- paste(namesarg, pct) # add percents to labels
namesarg <- paste(namesarg,"%",sep="") # ad % to labels
par(mar=c(5.1, 13 ,4.1 ,2.1)) # Moving margins
province_plot <- barplot(province_counts,names.arg=namesarg,col=rainbow(length(province_counts)), 
                         las=1, horiz = T,xlim = c(0,7000), xlab="",ylab = "",
                         main="Distribution of Participants based on Province")
title(xlab="Number of Participants",ylab="Province",line=0, cex.lab=1.2) # Adjusting the Title

Findings:

Based on the aforementioned visualization, it is clear that:

Ontario ranks first, followed by Quebec, in terms of the percentage of people who have used the internet (either for e-mail or the web) for personal, non-business purposes from home, work, or other locations with a percentage of 28% and 19%, respectively.
It can also be noted that Internet usage for personal, and non-business purposes is almost the same in Manitoba and Alberta, with 9% and 10%, respectively.
With barely 3% of the population using the internet, it is evident that Prince Edward Island has the lowest percentage.

Inference:

Most of the people who use the Internet for personal purposes are found to be in Ontario and Quebec, which may be related to their dense populations. On the other hand, regions like Prince Edward Island, Newfoundland, and Labrador are found to be the least internet users. Additionally, communication tools tend to grow and modernize along with locations. This could involve online shopping, information searching, social networking, entertainment, etc., which could eventually lead to a rise in regional internet usage for private, non-commercial purposes.

5.3 Bar Plot: Which Age group uses the internet most & least?

The below bar plot visualization differentiates different age categories of people who have utilized the most and least of the Internet. A bar chart basically represents data in rectangular bars with the length of the bar proportional to the value of the variable.

#3.Bar plot: Which Age group uses the internet most & least?
age <- table(data[data$INT_USAGE==1, "AGE_GROUP"])
age_group <- c("16-24","25-34","35-44","45-54","55-64",">64")

age_plot <- barplot(age,names.arg=age_group, las=1,col=rainbow(length(age)),
                    main="Distribution of Usage by Age Group",
                    xlab = "Age Group",
                    ylab="Number of Participants")

text(age_plot,age,labels = age, pos = 1)

Findings:

It is clear that the age group 45-54 is the group that uses most of the Internet with a count of 3821 users followed by the age range 35-44 with a count of 3500 Internet users.
We have discovered that the age range 35–44 has the second-highest number of Internet users, with 3500, followed by the 45–54 age group, which has 3821 users. Finally, it is evident that the number of Internet users in the 16 to 24 and over 64 age categories has the same number of Internet users having the count of 1952.

Inference: The respondents in the age group 45-54 are the ones who are using most of the Internet and it is noted that they are in the working category and they are the generation who have started using the Internet and they are likely to use more Internet than the other age categories.

5.4 Barplot: Which gender makes the most out of the internet?

The visualization’s objective is to examine how respondents’ Internet usage is distributed according to their sex. In the coding shown below, we have utilized gender as a common denominator to assess the distribution of each group using a bar plot. We can determine which gender category has utilized the Internet the most from the bar plot. The below barplot will help us answer the question of Which gender makes the most out of the internet?

#4. Barplot: Which gender makes the most out of the internet?
gender_count <- table(data[data$INT_USAGE==1,"SEX"])
gender <- c("Female", "Male")

gender_plot <- barplot(gender_count,names.arg=gender, las=1,col=rainbow(length(gender_count)),
                       main="Distribution of Usage by Gender", ylim = c(0,10500),
                       xlab = "Gender",
                       ylab="Number of Participants")
text(gender_plot,gender_count,labels = gender_count, pos = 3)

Findings: With 9629 female Internet users and 7950 male Internet users, it is clear that women have used the internet more than men.

Inference: Internet usage is higher in the female group than in the male category. The fact that women make up a significantly larger proportion of internet users than men could be the factor. In a bar chart, each bar can also be given a different color. To distinguish between two gender categories in the graphic above, we utilized the colors red and blue. The comparisons between different data categories are displayed using bars in a bar chart. Comparisons between the male and female genders are displayed in our visualization.

5.5 Pie chart: Distribution of students who use the internet based on gender

The below pie chart visualization depicts the distribution of students using the Internet with respect to their gender. We have used pie chart visualization since it displays relative proportions of multiple classes of data.

#5. Pie chart: Portfolio of students who use the internet based on gender
Student_gender_count <- table(data[data$STUDENT_STATUS==1 & data$INT_USAGE==1,"SEX"])

Student_gender_pct <- round(Student_gender_count/sum(Student_gender_count)*100)
Student_gender_lbls <- c("Female","Male")
Student_gender_lbls <- paste(Student_gender_lbls, Student_gender_pct) # add percents to labels
Student_gender_lbls <- paste(Student_gender_lbls,"%",sep="") # ad % to labels
pie(Student_gender_count, labels = Student_gender_lbls,
    col=c("yellow", "black"), main="Distribution of Students Using the Internet Based on Gender")

Findings: It is noted that about 60% of female students utilize the Internet, with 40% of male students making up the remaining user base.

Inference: The findings indicate that a significant share of female students utilize the Internet. This might be a result of the fact that female students have a higher population density than male students.

5.6 Grouped Bar Plot: Gender and Age classified by Internet Usage

Our goal in conducting this analysis is to examine the respondents who utilize the internet by age and gender. By doing so, we will be better able to determine which age group has been utilizing the internet or other web services the most. We will also be able to identify the gender that has utilized the Internet the most. In order to display each data category in a frequency distribution, we used a grouped bar chart for this visualization. We can represent relative percentages or numbers of various categories using a group bar plot, and it also aids in visually presenting a summary of our sizable dataset.

#6. Grouped bar plot for: Gender and age classified by internet usage

gender_age <- data[data$INT_USAGE==1,]

gender_age <- within(gender_age, {
  gender_age.cat <- NA
  gender_age.cat[AGE_GROUP=="1"] <- "16-24"
  gender_age.cat[AGE_GROUP=="2"] <- "25-34"
  gender_age.cat[AGE_GROUP=="3"] <- "35-44"
  gender_age.cat[AGE_GROUP=="4"] <- "45-54"
  gender_age.cat[AGE_GROUP=="5"] <- "55-64"
  gender_age.cat[AGE_GROUP=="6"] <- ">64"
})
 
ggplot(gender_age,                                      # Grouped bar plot using ggplot2
       aes(x = gender_age.cat,
           fill= SEX)) +
  scale_fill_discrete(labels = c("Male", "Female"))+
  labs(x = "Age Group", y = "Number of Participants",title = "Gender and age classified by internet usage", )+
  theme_dark()+ theme(axis.text = element_text(face = "bold",size = 10, angle = 20, hjust = 0.75))+
  geom_bar(position = "dodge")

Findings:

We discovered that women utilize the Internet more than men do, regardless of age group.
The above visualization makes it evident that people aged 45 to 54 are the group who use the internet the most.
We also discovered that people over 64 and those between the ages of 16 and 24 utilize the least amount of the Internet, with male users accounting for 1000 and female users for 800 respectively.

Inference: It can be seen from the above visualization that, regardless of age group, women use the Internet more frequently than males. Additionally, it is shown that Internet usage is the same for people aged 16 to 24 and older than 64, with 1750 female users and around 1500 male users. On the other hand, it can be seen that the age groups of 25–34 and 55–64, with a count of 1450 and 1750 respectively, have the same Internet usage for both males and females.

5.7 Stacked Bar Plot: Education Level in Each Province

Our fundamental goal for performing the below analysis is to compare the education level of all the provinces and provide insights. The reason behind choosing a stacked bar chart is that it makes the comparison of the data points even easier, and it makes the interpretation effective.

#7.Stacked Bar Plot: Education Level in Each Province

#Setting province Codes
x <- within(data, {
  province.cat <- NA
  province.cat[PROVINCE==10] <- "NL"
  province.cat[PROVINCE==11] <- "PE"
  province.cat[PROVINCE==12] <- "NS"
  province.cat[PROVINCE==13] <- "NB"
  province.cat[PROVINCE==24] <- "QC"
  province.cat[PROVINCE==35] <- "ON"
  province.cat[PROVINCE==46] <- "MB"
  province.cat[PROVINCE==47] <- "SK"
  province.cat[PROVINCE==48] <- "AB"
  province.cat[PROVINCE==59] <- "BC"
})

ggplot(x, aes(fill=ED_LEVEL,y=ED_LEVEL,x=province.cat)) +
  scale_fill_discrete(labels = c("High school or less", "College or some post-secondary",
                                 "University certificate or degree"), name = "Education Level")+
  labs(x = "Province", y = " Education Level",title = "Education Level in each Province")+
  theme(axis.text = element_text(face = "bold",size = 10, angle = 20, hjust = 0.75))+
  geom_bar(position="stack", stat="identity")

Findings:

We discovered that, compared to all other categories, Ontario has the highest literacy rate at the college or post-secondary level, followed by Quebec, which has a level of literacy that is about three times higher than Nova Scotia.
Alberta and British Columbia had the next-largest proportion of respondents with the greatest literacy rate, according to our analysis.
Prince Edward Island, on the other side, is the only province with the lowest level of education across all categories.
It may also be observed that the literacy rates for all categories in New Brunswick and Nova Scotia are pretty similar.
Finally, just a relatively small percentage of respondents from each province had completed high school.

Inference: The stacked bar chart above makes it evident that in all provinces, a higher proportion of respondents have completed post-secondary education than the other categories, while the proportion of respondents with university degrees is moderate

5.8 Pie Chart: Distribution of Internet Users and Non Internet Users

Let’s generate a pie chart to show the Distribution of Internet Users and Non Internet Users

#8. Distribution of internet users and non users in the survey?
int_usage_count <- table(data["INT_USAGE"])

int_usage_pct <- round(int_usage_count/sum(int_usage_count)*100)
int_usage_lbls <- c("No","Yes")
int_usage_lbls <- paste(int_usage_lbls, int_usage_pct) # add percents to labels
int_usage_lbls <- paste(int_usage_lbls,"%",sep="") # ad % to labels
pie(int_usage_count, labels = int_usage_lbls,
    col=c("Orange", "light blue"), main="Distribution of Internet Users and Non Internet Users")

Findings:

The graphic shows that approximately 76% of people use the internet, compared to a total of only 24% who don’t.
To make the labels easier to read, percentages have been added.

Inference: As a result, the pie chart shows the percentage of internet usage as blue, which makes it simple to see how the two groups differ from one another, and the percentage of non-internet usage as orange.

5.9 Bar Plot: Distribution of Users Based on Years of Usage

The bar plot below illustrates the distribution of users based on years of usage.

#9. Distribution of Users Based on Years of Usage

years_usage <- data["INT_YEARS"]

years_usage <- within(years_usage, {
  years.cat <- NA
  years.cat[INT_YEARS==1] <- "< 1 Year"
  years.cat[INT_YEARS==2] <- "1-2 Years"
  years.cat[INT_YEARS==3] <- "2-5 Years"
  years.cat[INT_YEARS==4] <- "> 5 Years"
  years.cat[INT_YEARS==6] <- "Skipped"
  years.cat[INT_YEARS==7] <- "Don't Know"
})

tab1(years_usage$years.cat, sort.group = "decreasing", cum.percent = TRUE, main = "Distribution of Users Based on Years of Usage", xlab = "Years of Usage")

## years_usage$years.cat : 
##            Frequency Percent Cum. percent
## > 5 Years      13461    58.1         58.1
## Skipped         5600    24.2         82.2
## 2-5 Years       2586    11.2         93.4
## 1-2 Years        891     3.8         97.2
## < 1 Year         607     2.6         99.9
## Don't Know        33     0.1        100.0
##   Total        23178   100.0        100.0

Findings:

The frequency level has a range of 0 to 14,000 with the x axis’s equivalent values of more than 5 years, skipped, 2 to 5, 1to 2, less than a year and don’t know.
People who are unaware of the usage preferred selecting the “Don’t know” option when filling out surveys.
A total of 25,86 respondents reported using the internet for between two and five years, while 13,461 participants claim to have used it for more than five years.
Other categories, such as skipped, account for a sizable amount of the response but are worthless for further evaluation.

Inference: When other categories are included, the analysis drastically deviates since the respondent either didn’t know their own practice or didn’t want to report it. However, a large portion of the respondents belonged to the group that has been a notable lead in internet usage for years.

5.10 Mosaic Plot: Relationship Between Family Size and Internet Usage

The association between internet usage and occupant count is graphically evaluated in the mosaic plot below. The key justification for selecting a mosaic plot is that it enables us to identify the relationship between the two variables. Another key benefit is that it summarizes the data and makes it possible to find correlations between multiple factors. The blue tiles in the mosaic plot represent substantial positive residuals with a frequency more than predicted, whereas the red tiles represent significant negative residuals with a frequency less than expected.

#10.Mosaic Plot: Relationship between family size and internet usage
mosaic <- data[,c("INT_USAGE","HOUSE_SIZE","STUDENT_STATUS")]
mosaic <- within(mosaic, {
  int_usage.cat <- NA
  int_usage.cat[INT_USAGE=="0"] <- "No"
  int_usage.cat[INT_USAGE=="1"] <- "Yes"
})

mosaic <- within(mosaic, {
  house_size.cat <- NA
  house_size.cat[HOUSE_SIZE=="1"] <- "1 Person"
  house_size.cat[HOUSE_SIZE=="2"] <- "2 People"
  house_size.cat[HOUSE_SIZE=="3"] <- "3 People"
  house_size.cat[HOUSE_SIZE=="4"] <- ">= 4 People"
})

mosaic <- within(mosaic, {
  student_status.cat <- NA
  student_status.cat[STUDENT_STATUS=="0"] <- "Not Student"
  student_status.cat[STUDENT_STATUS=="1"] <- "Student"
})

mosaic_count <- table(mosaic[,c("int_usage.cat","house_size.cat","student_status.cat")]) 
mosaic(~int_usage.cat + house_size.cat,data = mosaic_count, shade = T,legend = TRUE, 
       labeling = labeling_border(abbreviate_labs = c(3, 1, 10)),
       main = "Relationship Between Internet Usage 
       and Number of People in House")

Findings:

It is evident that internet usage is significantly higher than all other categories in households with 2 people.
It has also been discovered that internet consumption is quite similar whether there are 1 or more than 4 people living in the household.
On the other hand, compared to other household sizes, a household of three people uses the internet far less frequently.

Inference: Since the proportions of the two variables (Internet usage and the number of persons living in the home) differ, it can be observed that they are reliant on one another. Additionally, it appears that there is a direct correlation between internet usage and the number of individuals living in the family. However, we need to perform an in-depth analysis to strongly conclude.

5.11 Three Variable Mosaic Plot: Understanding the Relationship Between Student Status,Family Size and Internet Usage

We can better comprehend the connection between family size and internet usage by looking at the status of the students in the mosaic plot below. Since the below 3 variables are categorical in nature, we have used a mosaic plot.

#11.Mosaic Plot: Understanding the relationship between family size, student status and internet usage

mosaic(~int_usage.cat + house_size.cat + student_status.cat,  
       data = mosaic_count,shade = TRUE,
       labeling = labeling_border(rot_labels = c(0, 45, 90, 90), abbreviate_labs = c(10, 1, 5)),
       legend = TRUE, 
       main = "Family Size and Internet Usage")

Findings:

The mosaic plot shown above reveals that households with three people who don’t use the internet are not students.
Additionally, it is determined that a family with more than four non-internet users are not students.
In addition, it has been observed that households with three internet-connected residents are seen as students.
On the other hand, when there are two people living in the same household using the internet, that household is also considered to not be a student, which might also apply to employees who work from home.

Inference: Overall, the mosaic plot shown above is used to determine whether the three variables are associated with one another.

5.12 Contingency Bar Plot: Relationship between Employment Status, Sex and Internet Usage

In this contingency bar plot, the categorical variables are added together side by side, creating a two-way table. Counting the total number of observations for each combination of levels in the categorical variable is another valuable application of this data.

#12.Contingency Bar Plot: Relationship between Employment Status, Sex and Internet Usage
contingency_plot <- data[,c("INT_USAGE","EMPLOYMENT_STATUS", "SEX")]
contingency_plot <- within(contingency_plot, {
  emp_status.cat <- NA
  emp_status.cat[EMPLOYMENT_STATUS=="1"] <- "Employed"
  emp_status.cat[EMPLOYMENT_STATUS=="2"] <- "Unemployed"
  emp_status.cat[EMPLOYMENT_STATUS=="3"] <- "Unable to Work"
})

contingency_plot <- within(contingency_plot, {
  int_usage.cat <- NA
  int_usage.cat[INT_USAGE=="0"] <- "Never Used Internet"
  int_usage.cat[INT_USAGE=="1"] <- "Used Internet"
})

contingency_plot <- within(contingency_plot, {
  sex.cat <- NA
  sex.cat[SEX=="0"] <- "Female"
  sex.cat[SEX=="1"] <- "Male"
})

ggplot(contingency_plot, aes(x = emp_status.cat))+
  labs(x = "Employment Status", y = " No.of Participant",title = "Usage of Internet based on Gender")+
      geom_bar(aes(fill = sex.cat),color = "white", position = position_dodge(0.9))+ 
  labs(fill="Gender of Respondent")+
    facet_wrap(~contingency_plot$int_usage.cat,as.table = TRUE)+  
  theme(axis.text = element_text(face = "bold",size = 10, angle = 20, hjust = 0.75))+
      fill_palette("jco")

Findings:

The distinct data taken into account aids in understanding how the community that has never utilized the internet compares to the community that has. Based on their employment status that is, whether they are working, unemployed, or unable to work are verified.
To identify the metric changes between the discrete values, it is necessary to analyze the variations in the bars.
Additionally, access is made to the groups with the highest and lowest values.
For instance, about 6000 members of the working-class female group have a connection to the internet, compared to slightly fewer men.
On the other hand, a sizable portion of the unemployed population has the highest proportion of never using the internet. Both varies that the unemployed either use the internet infrequently or never at all.

Inference: This suggests that working professionals use the internet more frequently than the unemployed members of the community. If a new strategic plan is to be implemented, the focus should be limited to the categories stated above.

5.13 Multi Panel Balloon Plot: Are employed people more likely using the internet?

Multi-panel typically, balloon plots are intended to replace heat maps. When it comes to many groups of observations, where the size of the dots indicates the magnitude of the related component, it is said that the balloon plots act like magic. It is also used to show the contingency table created by the combination of two category variables.

#13. Multi Panel Balloon Plot: Are employed people more likely using the internet?
emp_count <- as.data.frame(table(contingency_plot))
ggballoonplot(emp_count, x = "emp_status.cat", y = "int_usage.cat", size = "Freq",
#ggballoonplot(emp_count, x = "EMPLOYMENT_STATUS", y = "AGE_GROUP", size = "Freq",
              fill = "Freq", facet.by = "sex.cat",
              ggtheme = theme_bw()) +
  scale_fill_viridis_c(option = "C")

Findings:

The distribution of internet usage among male and female users is presented in comparison. Size and color are used to identify the frequencies. The largest balloon represents the numerical value 6000 with the color yellow, while the smallest balloon depicts 0 with the color purple.
When comparing the two categories, it is clear that women use the internet more than men do, as indicated by the female category’s representation in yellow, which is followed by the male category’s representation in orange, which indicates that the male count is lower.
It should be observed that the employment status is employed in both instances. The least of all the categories, the unemployed use the internet the least—nearly 0 percent—of the time.
Both the unemployed male and female populations share this situation.
The community’s “unable to work” section has a balanced usage of men and women.
The frequency ranges close to 4000. The frequency of internet usage among the employed community is 2000, which is very little and therefore insignificant.

Inference: Based on the scale and color of the graph, it is simpler to forecast the results and analyze them. The understanding has become quite simple and obvious because to the factor of frequency. As a consequence of the preceding experiment, we have determined that regardless of employment position, internet usage is essentially identical to that of women. Additionally, both genders follow a nearly identical trend as those who do not utilize the internet.

5.14 Where do People mostly Use Internet From?

The location of the internet is crucial to its main area of distribution. The participants and their primary internet usage locations are precisely visualized using the horizontal bar graph.

#14. Where do People mostly Use Internet From?

location_data <- data[,c("PROVINCE",  
                         "LU_FROM_HOME", 
                         "LU_FROM_WORK",
                         "LU_FROM_SCHOOL",
                         "LU_FROM_LIBRARY", 
                         "LU_OTHERS_RELATIVE",
                         "LU_OTHERS_FRIEND", 
                         "LU_OTHERS_MISC")]

location_data <- within(location_data, {
  place <- NA
  place[LU_FROM_HOME=="1"] <- "Home"
  place[LU_FROM_WORK=="1"] <- "Work"
  place[LU_FROM_SCHOOL=="1"] <- "School"
  place[LU_FROM_LIBRARY=="1"] <- "Library"
  place[LU_OTHERS_RELATIVE=="1"] <- "Relative's"
  place[LU_OTHERS_FRIEND=="1"] <- "Friend's"
  place[LU_OTHERS_MISC=="1"] <- "Misc"
})

location_data <- within(location_data, {
  province.cat <- NA
  province.cat[PROVINCE==10] <- "NL"
  province.cat[PROVINCE==11] <- "PE"
  province.cat[PROVINCE==12] <- "NS"
  province.cat[PROVINCE==13] <- "NB"
  province.cat[PROVINCE==24] <- "QC"
  province.cat[PROVINCE==35] <- "ON"
  province.cat[PROVINCE==46] <- "MB"
  province.cat[PROVINCE==47] <- "SK"
  province.cat[PROVINCE==48] <- "AB"
  province.cat[PROVINCE==59] <- "BC"
})

location_data <- na.omit(location_data)

loc <- as.data.frame(table(location_data[,c("place", "province.cat")]))

tab1(location_data$place, sort.group = "decreasing", cum.percent = TRUE, main = "Location of Internet Use", xlab = "Number of Participants")

## location_data$place : 
##            Frequency Percent Cum. percent
## Home            6417    37.8         37.8
## Work            3267    19.2         57.0
## Misc            2641    15.5         72.6
## Friend's        1792    10.5         83.1
## Relative's      1003     5.9         89.0
## Library          954     5.6         94.6
## School           912     5.4        100.0
##   Total        16986   100.0        100.0

Findings:

From the graph, it can be seen that 92% of the 7000 persons in the population use the internet at home.
It is startling to see that internet usage at the office is practically half as total usage at home.
The percentage of internet usage by other groups, such as friends, family, and libraries, is equally distributed (close to 1000).

Inference: When compared to usage from other venues, a substantial fraction of both communities uses their internet at home. This makes sense because people use technologies more frequently the more time, they have available to them. A 13% usage share of the internet by libraries and schools indicates how intensively the educational system depends on advanced technologies.

5.15 Balloon Plot: Distribution of Internet Usage Based on Province & Location

The balloon plot, as has already been noted, is useful for determining how category variables relate to one another. The graph below has numerous dimensions that, in contrast to the previous one, make it easier to determine how locations and provinces are related to one another.

#15. Balloon Plot: Distribution of Internet Usage Based on Province & Location
ggballoonplot(loc, x = "place", y = "province.cat", size = "Freq",
              fill = "Freq", 
              ggtheme = theme_bw()) +
  scale_fill_viridis_c(option = "C")

Findings:

Finding the distribution of internet usage based on provinces and locales is the goal of this exploratory investigation utilizing a balloon chart. We are able to make observations promptly since frequency plays a key part in conveying the changes in terms of size and color.
The color spectrum goes from yellow, which represents the greatest frequency level, through red, which denotes the mid-frequency level, and ultimately, purple, which denotes the lowest frequency level of them all.
Initially, the figures show that home internet use has dominated all internet usage in Ontario. It is observed that, on average, residents of various provinces use the internet frequently at home.
From the same location as Ontario, Quebec takes the lead in internet usage. Of all the locations, Newfoundland and Labrador has the lowest records for internet usage.
Among the several locations taken into account, British Columbia and Alberta both use the internet to a similar extent.
The least amount of internet use is observed in the library in Newfoundland and Labrador and New Brunswick combined.

Inference: Being one of Canada’s most populous provinces, Ontario has considerably accounted for the majority of internet usage worldwide for a variety of reasons. Quebec exhibits the similar pattern, with only minor variations, showing that internet usage is extremely common in the province. From this, it can be inferred that the distribution is fairly even across Canada’s developed provinces while other emerging provinces are still falling behind.

5.16 Balloon Plot: Who are Accessing the Internet More?

The balloon plot is drawn to show usage of internet by both students and non-students. We chose to examine its dynamics in depth since the type of students and their employment status is so important to how the internet is used today.

#16. Balloon Plot: Who are Accessing the Internet More?

int_access <- data[,c("INT_USAGE","EMPLOYMENT_STATUS", "STUDENT_STATUS")]

int_access <- within(int_access, {
  emp_status.cat <- NA
  emp_status.cat[EMPLOYMENT_STATUS=="1"] <- "Employed"
  emp_status.cat[EMPLOYMENT_STATUS=="2"] <- "Unemployed"
  emp_status.cat[EMPLOYMENT_STATUS=="3"] <- "Unable to Work"
})

int_access <- within(int_access, {
  int_usage.cat <- NA
  int_usage.cat[INT_USAGE=="0"] <- "Never Used Internet"
  int_usage.cat[INT_USAGE=="1"] <- "Used Internet"
})

int_access <- within(int_access, {
  std_status.cat <- NA
  std_status.cat[STUDENT_STATUS=="0"] <- "Not a Student"
  std_status.cat[STUDENT_STATUS=="1"] <- "Student"
})

int_access_count <- as.data.frame(table(int_access[,c("emp_status.cat", "std_status.cat", "int_usage.cat")]))

ggballoonplot(int_access_count, x = "emp_status.cat", y = "int_usage.cat", size = "Freq",
              fill = "Freq", facet.by = "std_status.cat",
              ggtheme = theme_bw()) +
  scale_fill_viridis_c(option = "C")

Findings:

The graphic denotes the main frequency aspects of the variables. The frequency of variance is more easily represented visually by size and color. Yellow denotes the frequency with the maximum intensity, orange the frequency with a medium level, and purple the frequency with the fewest users. - Here, it is clear that the majority of people who are employed but not students use the internet (represented by a yellow dot with a rate of 9000), whereas people who have never used the internet and people who are students, regardless of their employment status, have only very limited access to the internet.
Students who have an employment status of “unable to work” have taken up a sizable portion of both the used and unused internet categories.
Finally, interestingly, compared to those who are not students, students who use the internet have utilized it far less frequently than those who do not, irrespective of their work situation.

Inference: Internet usage among employees far outpaces that of students who are actually enrolled in classes. Despite the significant differences, the variables showed several characteristics that could highlight a range of population characteristics. Additionally, the unemployed population has very limited access to the internet, despite of their position as students.

6 Observations: Exploratory Analysis

From the above visualizations constructed during the exploratory data analysis, we were able to observe the dynamics of the dataset, categorize their values, and identify relationships in our dataset. Furthermore, we were also able to locate the outliers in the dataset, organize it, and comprehend the dynamics of each variable.

From all these visualizations and variables, we will be able to perform predictive analysis making use of the dataset to forecast future trends and occurrences. As part of the next steps in our analysis, we will determine if a rise in family size will result in a decline in Internet usage. We’ll also examine whether the number of inhabitants in a household has an impact on Internet usage.

Given that the variables are categorical in nature, predictive methods such as logistic regression will assist us to attain the aforementioned goals. For problems involving prediction and classification, logistic regression is frequently utilized. Additionally, it is simpler to implement, interpret, and train on, and it is quite effective.

From our detailed Exploratory Analysis, we were able to identify the dynamics of various variables in our Dataset and how they should be handled.
We were able to identify that the Dataset provided is completely categorical in nature and We were able to clean the data accordingly in this phase.
The Mosaic & Balloon Plots in our Analysis have helped us identify association between certain variables that can be used for our next phase of Analysis i.e, the Predictive Phase.
With the knowledge gained from our analysis certain thoughts & questions came up-to our minds:

Will a respondent more likely use the internet from Home/Work/Both?

What is the likeliness of a respondent using the internet from a public library?

These questions might help us with predictions in our next phase of analysis.

7 Predictive Analyses

7.1 Reduction of Dimensions

From our exploratory analysis we certainly gained some useful insights. The first thing that we would like to modify in this dataset is to eliminate certain variables that are redundant.

REGION: The Region column doesn’t really convey valuable insights. Instead, the Province column conveys more precise information

G_URBRUR: The community where the respondent lives is not very specific and only has 3 major cities. Hence, this can be eliminated as it won’t be useful for further analysis

GFAMTYPE: The family type variable is redundant as the G_HHSIZE column also conveys the same information in a more specific manner

G_HEDUC: The highest level of education in a particular household will not be used for futher analysis. Hence, we’ll eliminate this particular variable

G_HSTUD: The student status column states more useful data. Knowing if there is another student in the household doesn’t seem to be very important for our analysis.

#Predictive Analyses
#Reduction of Dimensions
reduced_data <- data[,-c(3,4,10,12,13)] #Eliminating columns that are not required for analysis

7.2 Approach to Building Our Predictive Models

As a common rule of thumb to build our predictive models we would be following a step-by-step process:

Firstly, we’ll find the association between the predictors and dependent variable to build an effective model
Second, we would build our model on using the variables that have a good association with the variable that we are trying to predict
Post that, we’ll split our data set into 2 part: Training Data and Testing Data. Our model will be trained with the Training Data and Later our model will predict values for the Testing Data
With the predictions we will build a confusion matrix which can help us understand the accuracy of our model and how its performing.
Once all these steps are done, we’ll have a good model that can predict values for future data.

To understand the association between variables we’ll be using 2 tests:

Pearson's Chi-squared test of Independence: Pearson’s Chi-squared test of Independence will help us understand if any 2 variables are dependent. However, One limitation of the chi-squared test is that it does not provide us the strength of the association between 2 variables.

Cramer's V Test: The Cramer’s V Test takes a step beyond the conventional chi-squared test and helps us understand the strength of the association of 2 variables. In other words, the Cramer’s V test gives us more specific details regarding the association of 2 variables.

We’ll be using both of these tests to understand the dynamics of the variables of the predictive models that we construct

7.3 Logistic Regression: Student Prediction Using Repondent’s Age Group, Employment Status & Internet Usage

First, Let’s understand the dynamics of our data using a contingency table

#1. Predicting if a respondent is a student or not using age, employment status, internet usage
stud_pred <- reduced_data[,c("STUDENT_STATUS","AGE_GROUP","INT_USAGE", "EMPLOYMENT_STATUS", "SEX")]
cont_stud_pred <- table(stud_pred$STUDENT_STATUS, stud_pred$AGE_GROUP)
cont_stud_pred

##    
##        1    2    3    4    5    6
##   0  920 2939 3570 4449 4381 5205
##   1 1061  330  179  106   38    0

Now let’s have a look at the association of these variables with the help of the Pearson’s Chi-squared test & the Cramer’s V Test

#Performing Chi-Squared & Cramer's V Test to understand the relationship between variables
chisq.test(stud_pred$STUDENT_STATUS, stud_pred$AGE_GROUP)

## 
##  Pearson's Chi-squared test
## 
## data:  stud_pred$STUDENT_STATUS and stud_pred$AGE_GROUP
## X-squared = 7099.2, df = 5, p-value < 2.2e-16

assocstats(xtabs(~stud_pred$STUDENT_STATUS + stud_pred$AGE_GROUP))#Strong Association

##                     X^2 df P(> X^2)
## Likelihood Ratio 4468.4  5        0
## Pearson          7099.2  5        0
## 
## Phi-Coefficient   : NA 
## Contingency Coeff.: 0.484 
## Cramer's V        : 0.553

The P-value from our chi-squared test was < 2.2e-16 which is lesser than 0.05. Hence, we can reject the null hypothesis which says that the 2 variables are independent and conclude that they are dependent and related.

Our Cramer’s V test shows a value of 0.553. Any value of effective size between 0.2 and 0.6 is considered to convey moderate association. Hence, we can conclude that Student Status and Age Group are moderately associated with each other.

Now, Let’s Continue the same process for other related variables

chisq.test(stud_pred$STUDENT_STATUS, stud_pred$INT_USAGE)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  stud_pred$STUDENT_STATUS and stud_pred$INT_USAGE
## X-squared = 521.81, df = 1, p-value < 2.2e-16

assocstats(xtabs(~stud_pred$STUDENT_STATUS + stud_pred$INT_USAGE)) #Decent Association

##                     X^2 df P(> X^2)
## Likelihood Ratio 788.26  1        0
## Pearson          523.15  1        0
## 
## Phi-Coefficient   : 0.15 
## Contingency Coeff.: 0.149 
## Cramer's V        : 0.15

chisq.test(stud_pred$STUDENT_STATUS, stud_pred$EMPLOYMENT_STATUS)

## 
##  Pearson's Chi-squared test
## 
## data:  stud_pred$STUDENT_STATUS and stud_pred$EMPLOYMENT_STATUS
## X-squared = 39.171, df = 2, p-value = 3.12e-09

assocstats(xtabs(~stud_pred$STUDENT_STATUS + stud_pred$EMPLOYMENT_STATUS)) #Weak Association

##                     X^2 df   P(> X^2)
## Likelihood Ratio 37.536  2 7.0675e-09
## Pearson          39.171  2 3.1196e-09
## 
## Phi-Coefficient   : NA 
## Contingency Coeff.: 0.041 
## Cramer's V        : 0.041

chisq.test(stud_pred$STUDENT_STATUS, stud_pred$SEX)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  stud_pred$STUDENT_STATUS and stud_pred$SEX
## X-squared = 13.84, df = 1, p-value = 0.0001991

assocstats(xtabs(~stud_pred$STUDENT_STATUS + stud_pred$SEX)) #Weak Association

##                     X^2 df   P(> X^2)
## Likelihood Ratio 14.126  1 0.00017097
## Pearson          14.028  1 0.00018006
## 
## Phi-Coefficient   : 0.025 
## Contingency Coeff.: 0.025 
## Cramer's V        : 0.025

After analysing the results we can see that variables Age Group and Internet Usage are significantly associated with Student Status. Hence, we can use these 2 variables to build our predictive model

Now, let’s build our model and predict values for our test data:

#Splitting our data into training data and testing data
set.seed(1)
trainingrows <- sample(nrow(stud_pred), nrow(stud_pred) * 0.8)
stud_pred_training <- stud_pred[trainingrows, ]
stud_pred_testing <- stud_pred[-trainingrows, ]

#Constructing a logistic regression model with stud_pred_training
stud_pred_model <- glm(STUDENT_STATUS ~ AGE_GROUP + INT_USAGE, data = stud_pred_training,
                     family = binomial(link = "logit"))
summary(stud_pred_model)

## 
## Call:
## glm(formula = STUDENT_STATUS ~ AGE_GROUP + INT_USAGE, family = binomial(link = "logit"), 
##     data = stud_pred_training)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.2555  -0.3145  -0.1530  -0.0001   3.3570  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -1.00760    0.24192  -4.165 3.11e-05 ***
## AGE_GROUP2   -2.36480    0.08313 -28.448  < 2e-16 ***
## AGE_GROUP3   -3.16326    0.10056 -31.457  < 2e-16 ***
## AGE_GROUP4   -3.73828    0.11818 -31.632  < 2e-16 ***
## AGE_GROUP5   -4.62342    0.18365 -25.175  < 2e-16 ***
## AGE_GROUP6  -19.13797  163.88734  -0.117    0.907    
## INT_USAGE1    1.18927    0.23910   4.974 6.56e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 9809.9  on 18541  degrees of freedom
## Residual deviance: 6171.2  on 18535  degrees of freedom
## AIC: 6185.2
## 
## Number of Fisher Scoring iterations: 18

pred_stud <- predict(stud_pred_model, stud_pred_testing, type = "response")

We’ll now build a confusion matrix and assess the performance of our model with our predicted values

#Generating a confusion matrix to test the accuracy of our model using training vs testing data
logitpredictions = rep(0,length(pred_stud))
logitpredictions[pred_stud > 0.5] <- 1
cm_stud <- table(logitpredictions, stud_pred_testing$STUDENT_STATUS)
cm_stud

##                 
## logitpredictions    0    1
##                0 4112  137
##                1  187  200

#Assessing the performance of the model we constructed
tp = length(which((logitpredictions == 1) & (stud_pred_testing$STUDENT_STATUS == 1)))
tn = length(which((logitpredictions == 0) & (stud_pred_testing$STUDENT_STATUS == 0)))
fp = length(which((logitpredictions == 1) & (stud_pred_testing$STUDENT_STATUS == 0)))
fn = length(which((logitpredictions == 0) & (stud_pred_testing$STUDENT_STATUS == 1)))

logitaccuracy <- (tp+tn)/(tp+tn+fp+fn)
logitsensitivity <- tp/(tp+fn)
logitspecificity <- tn/(tn+fp)
logitprecision <- tp/(tp+fp)

logitaccuracy     #Calculating the accuracy

## [1] 0.9301122

logitsensitivity  #Calculating the sensitivity

## [1] 0.5934718

logitspecificity  #Calculating the specificity

## [1] 0.9565015

logitprecision    #Calculating the precision

## [1] 0.5167959

We see a Accuracy value of 0.9301122 which is 93.01% accurate and a specificity value of 0.9565015 which 95%. Hence, our model is good to predict future test data.

7.4 Logistic Regression: Internet Usage Prediction - Predicting if A Respondent Has Used the Internet Before

Let’s follow the same protocol that we used for previous prediction and start by finding association between variables.

#2. Predicting if a respondent has used internet before using age, student status, sex
int_pred <- reduced_data[,c("INT_USAGE","AGE_GROUP", "EMPLOYMENT_STATUS", "SEX")]

#Performing Chi-Squared Test to better understand if there is a relationship between the variables
chisq.test(int_pred$INT_USAGE, int_pred$AGE_GROUP)

## 
##  Pearson's Chi-squared test
## 
## data:  int_pred$INT_USAGE and int_pred$AGE_GROUP
## X-squared = 6361, df = 5, p-value < 2.2e-16

assocstats(xtabs(~int_pred$INT_USAGE + int_pred$AGE_GROUP)) #Strong Association

##                     X^2 df P(> X^2)
## Likelihood Ratio 6477.2  5        0
## Pearson          6361.0  5        0
## 
## Phi-Coefficient   : NA 
## Contingency Coeff.: 0.464 
## Cramer's V        : 0.524

chisq.test(int_pred$INT_USAGE, int_pred$EMPLOYMENT_STATUS)

## 
##  Pearson's Chi-squared test
## 
## data:  int_pred$INT_USAGE and int_pred$EMPLOYMENT_STATUS
## X-squared = 3682.2, df = 2, p-value < 2.2e-16

assocstats(xtabs(~int_pred$INT_USAGE + int_pred$EMPLOYMENT_STATUS)) #Strong Association

##                     X^2 df P(> X^2)
## Likelihood Ratio 3618.1  2        0
## Pearson          3682.2  2        0
## 
## Phi-Coefficient   : NA 
## Contingency Coeff.: 0.37 
## Cramer's V        : 0.399

chisq.test(int_pred$INT_USAGE, int_pred$SEX)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  int_pred$INT_USAGE and int_pred$SEX
## X-squared = 7.9511, df = 1, p-value = 0.004806

assocstats(xtabs(~int_pred$INT_USAGE + int_pred$SEX)) #Weak Association

##                     X^2 df  P(> X^2)
## Likelihood Ratio 8.0524  1 0.0045443
## Pearson          8.0384  1 0.0045797
## 
## Phi-Coefficient   : 0.019 
## Contingency Coeff.: 0.019 
## Cramer's V        : 0.019

We can see that AGE_GROUP & EMPLOYMENT_STATUS have a good association with Internet Usage and SEX has a weak association.

Hence, we’ll be using AGE_GROUP & EMPLOYMENT_STATUS to build our prediction model

#Splitting our data into training data and testing data
set.seed(1)
trainingrows <- sample(nrow(int_pred), nrow(int_pred) * 0.8)
int_pred_training <- int_pred[trainingrows, ]
int_pred_testing <- int_pred[-trainingrows, ]

#Constructing a logistic regression model with stud_pred_training
int_pred_model <- glm(INT_USAGE ~ AGE_GROUP + EMPLOYMENT_STATUS, data = int_pred_training,
                       family = binomial(link = "logit"))
summary(stud_pred_model)

## 
## Call:
## glm(formula = STUDENT_STATUS ~ AGE_GROUP + INT_USAGE, family = binomial(link = "logit"), 
##     data = stud_pred_training)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.2555  -0.3145  -0.1530  -0.0001   3.3570  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -1.00760    0.24192  -4.165 3.11e-05 ***
## AGE_GROUP2   -2.36480    0.08313 -28.448  < 2e-16 ***
## AGE_GROUP3   -3.16326    0.10056 -31.457  < 2e-16 ***
## AGE_GROUP4   -3.73828    0.11818 -31.632  < 2e-16 ***
## AGE_GROUP5   -4.62342    0.18365 -25.175  < 2e-16 ***
## AGE_GROUP6  -19.13797  163.88734  -0.117    0.907    
## INT_USAGE1    1.18927    0.23910   4.974 6.56e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 9809.9  on 18541  degrees of freedom
## Residual deviance: 6171.2  on 18535  degrees of freedom
## AIC: 6185.2
## 
## Number of Fisher Scoring iterations: 18

pred_int <- predict(int_pred_model, int_pred_testing, type = "response")

Let’s construct A confusion matrix and assess the performance of our model with our predicted values

#Generating a confusion matrix to test the accuracy of our model using training vs testing data
logitpredictions = rep(0,length(pred_int))
logitpredictions[pred_int > 0.5] <- 1
cm_stud <- table(logitpredictions, int_pred_testing$INT_USAGE)
cm_stud

##                 
## logitpredictions    0    1
##                0  606  342
##                1  497 3191

#Assessing the performance of the model we constructed
tp = length(which((logitpredictions == 1) & (int_pred_testing$INT_USAGE == 1)))
tn = length(which((logitpredictions == 0) & (int_pred_testing$INT_USAGE == 0)))
fp = length(which((logitpredictions == 1) & (int_pred_testing$INT_USAGE == 0)))
fn = length(which((logitpredictions == 0) & (int_pred_testing$INT_USAGE == 1)))

logitaccuracy <- (tp+tn)/(tp+tn+fp+fn) #Calculating the accuracy
logitsensitivity <- tp/(tp+fn) #Calculating the sensitivity  
logitspecificity <- tn/(tn+fp) #Calculating the specificity
logitprecision <- tp/(tp+fp) #Calculating the precision

#Displaying values
logitaccuracy

## [1] 0.819025

logitsensitivity

## [1] 0.9031984

logitspecificity

## [1] 0.5494107

logitprecision

## [1] 0.8652386

7.5 Multinomial Logistic Regression: Employment Status prediction - Predicting employment status of respondents using age, student status and sex

Let’s find the variables that have a good association with Employment Status

#Moving required columns to the new data frame
emp_pred <- reduced_data[,c("EMPLOYMENT_STATUS", "AGE_GROUP", "SEX", "STUDENT_STATUS", 
                            "PROVINCE", "ED_LEVEL", "HOUSE_SIZE", "INT_YEARS")]

emp_pred$EMPLOYMENT_STATUS <- relevel(emp_pred$EMPLOYMENT_STATUS, ref = 1) #Re-leveling for multi-nominal logistic regression

#Performing Chi-Squared Test to better understand if there is a relationship between the variables
chisq.test(emp_pred$EMPLOYMENT_STATUS, emp_pred$AGE_GROUP)

## 
##  Pearson's Chi-squared test
## 
## data:  emp_pred$EMPLOYMENT_STATUS and emp_pred$AGE_GROUP
## X-squared = 9358.6, df = 10, p-value < 2.2e-16

assocstats(xtabs(~emp_pred$EMPLOYMENT_STATUS + emp_pred$AGE_GROUP)) #Good Association

##                     X^2 df P(> X^2)
## Likelihood Ratio 9976.8 10        0
## Pearson          9358.6 10        0
## 
## Phi-Coefficient   : NA 
## Contingency Coeff.: 0.536 
## Cramer's V        : 0.449

chisq.test(emp_pred$EMPLOYMENT_STATUS, emp_pred$SEX)

## 
##  Pearson's Chi-squared test
## 
## data:  emp_pred$EMPLOYMENT_STATUS and emp_pred$SEX
## X-squared = 333.2, df = 2, p-value < 2.2e-16

assocstats(xtabs(~emp_pred$EMPLOYMENT_STATUS + emp_pred$INT_YEARS)) #Positive Association

##                     X^2 df P(> X^2)
## Likelihood Ratio 4038.1 10        0
## Pearson          4057.1 10        0
## 
## Phi-Coefficient   : NA 
## Contingency Coeff.: 0.386 
## Cramer's V        : 0.296

chisq.test(emp_pred$EMPLOYMENT_STATUS, emp_pred$SEX)

## 
##  Pearson's Chi-squared test
## 
## data:  emp_pred$EMPLOYMENT_STATUS and emp_pred$SEX
## X-squared = 333.2, df = 2, p-value < 2.2e-16

assocstats(xtabs(~emp_pred$EMPLOYMENT_STATUS + emp_pred$SEX)) #Weak Association

##                     X^2 df P(> X^2)
## Likelihood Ratio 335.49  2        0
## Pearson          333.20  2        0
## 
## Phi-Coefficient   : NA 
## Contingency Coeff.: 0.119 
## Cramer's V        : 0.12

chisq.test(emp_pred$EMPLOYMENT_STATUS, emp_pred$STUDENT_STATUS)

## 
##  Pearson's Chi-squared test
## 
## data:  emp_pred$EMPLOYMENT_STATUS and emp_pred$STUDENT_STATUS
## X-squared = 39.171, df = 2, p-value = 3.12e-09

assocstats(xtabs(~emp_pred$EMPLOYMENT_STATUS + emp_pred$STUDENT_STATUS)) #Very Weak Association

##                     X^2 df   P(> X^2)
## Likelihood Ratio 37.536  2 7.0675e-09
## Pearson          39.171  2 3.1196e-09
## 
## Phi-Coefficient   : NA 
## Contingency Coeff.: 0.041 
## Cramer's V        : 0.041

We can see that AGE_GROUP and INT_YEARS have a good association with EMPLOYMENT_STATUS. Hence, we’ll use these variables to build our model.

#Splitting our data into training data and testing data
set.seed(1)
trainingrows <- sample(nrow(emp_pred), nrow(emp_pred) * 0.8)
emp_pred_training <- emp_pred[trainingrows, ]
emp_pred_testing <- emp_pred[-trainingrows, ]

# Constructing a Multi-nominal logistic regression with associated variables - Internet Years and age group
library(nnet)

emp_pred_model <- multinom(EMPLOYMENT_STATUS ~ AGE_GROUP + INT_YEARS, data = emp_pred_training)

## # weights:  36 (22 variable)
## initial  value 20370.469056 
## iter  10 value 11453.324906
## iter  20 value 11091.575756
## iter  30 value 11022.219696
## final  value 11022.210082 
## converged

summary(emp_pred_model)

## Call:
## multinom(formula = EMPLOYMENT_STATUS ~ AGE_GROUP + INT_YEARS, 
##     data = emp_pred_training)
## 
## Coefficients:
##   (Intercept) AGE_GROUP2 AGE_GROUP3 AGE_GROUP4  AGE_GROUP5 AGE_GROUP6
## 2  -1.2067066 -0.9507013  -1.045068 -0.9640328 -1.07655374  -1.372630
## 3   0.3388662 -1.3029953  -1.442353 -1.2343652  0.03571077   2.416025
##    INT_YEARS2 INT_YEARS3 INT_YEARS4  INT_YEARS6 INT_YEARS7
## 2 -0.05408934 -0.3338697 -0.6348954  0.02102686  0.7178079
## 3 -0.39562697 -0.6516500 -1.1133409 -0.11955670  0.4294735
## 
## Std. Errors:
##   (Intercept) AGE_GROUP2 AGE_GROUP3 AGE_GROUP4 AGE_GROUP5 AGE_GROUP6 INT_YEARS2
## 2   0.2305719 0.12220943 0.11918070 0.11398033 0.12821403 0.23954224  0.2666660
## 3   0.1258848 0.08346045 0.08135671 0.07488551 0.06879825 0.08190181  0.1454427
##   INT_YEARS3 INT_YEARS4 INT_YEARS6 INT_YEARS7
## 2  0.2338009  0.2168789  0.2325354  0.8214163
## 3  0.1238099  0.1143149  0.1194972  0.5167312
## 
## Residual Deviance: 22044.42 
## AIC: 22088.42

pred_emp <- predict(emp_pred_model, emp_pred_testing)

Now let’s build a confusion matrix and calculate the misclassification % of our model

#Generating a confusion matrix to test the accuracy of our model with regards to original data
cm_emp <- table(pred_emp, emp_pred_testing$EMPLOYMENT_STATUS)
cm_emp

##         
## pred_emp    1    2    3
##        1 2487  187  616
##        2    0    0    0
##        3  230   18 1098

#Let's calculate the misclassification % of our model
1-sum(diag(cm_emp))/sum(cm_emp)

## [1] 0.2267041

As we can see the missclassfication percentage of our model is 0.2267041 which is 22.6%.

This means that there is 22.6% chance of our model being wrong, which is a decent score.

7.6 Multinomial Logistic Regression: Finding if a person will use the internet from Home/Work/Both with other Dependent variables such as Age, Employment status and Sex

Let’s move required columns to a new data frame for our analysis

#Moving required columns to the new data frame
loc_pred <- reduced_data[,c("AGE_GROUP","EMPLOYMENT_STATUS", "SEX", "LU_FROM_HOME", 
                            "LU_FROM_WORK", "STUDENT_STATUS", "PROVINCE", "ED_LEVEL", 
                            "HOUSE_SIZE", "INT_YEARS")]

Now we will classify LU_FROM_HOME and LU_FROM_WORK into a new column Place.

The column PLACE will take values:

1 - When the Respondent has accessed the internet from home

2 - When the Respondent has accessed the internet from Work

3 - When the Respondent has accessed internet from both Home and Work

#Creating new variable place which combines data from both columns Home and Work
loc_pred <- within(loc_pred, {
  place <- NA
  place[LU_FROM_HOME=="1" & LU_FROM_WORK == "2"] <- "1" #Only Home
  place[LU_FROM_HOME=="2" & LU_FROM_WORK == "1"] <- "2" #Only Work
  place[LU_FROM_HOME=="1" & LU_FROM_WORK == "1"] <- "3" #Both Home & Work
})

Let’s have a look at the variables categorical association

#Converting Place variable to factor
loc_pred$place <- as.factor(loc_pred$place)
loc_pred$place <- relevel(loc_pred$place, ref = 1) #Re-levelling for multi-nominal logistic regression

#Removing NA values for creating a training data set
loc_pred <- na.omit(loc_pred)

#Performing Chi-Squared Test to better understand if there is a relationship between the variables
chisq.test(loc_pred$place, loc_pred$AGE_GROUP)

## 
##  Pearson's Chi-squared test
## 
## data:  loc_pred$place and loc_pred$AGE_GROUP
## X-squared = 1400.3, df = 10, p-value < 2.2e-16

assocstats(xtabs(~loc_pred$place + loc_pred$AGE_GROUP))

##                     X^2 df P(> X^2)
## Likelihood Ratio 1561.2 10        0
## Pearson          1400.3 10        0
## 
## Phi-Coefficient   : NA 
## Contingency Coeff.: 0.28 
## Cramer's V        : 0.206

chisq.test(loc_pred$place, loc_pred$EMPLOYMENT_STATUS)

## 
##  Pearson's Chi-squared test
## 
## data:  loc_pred$place and loc_pred$EMPLOYMENT_STATUS
## X-squared = 2661.3, df = 4, p-value < 2.2e-16

assocstats(xtabs(~loc_pred$place + loc_pred$EMPLOYMENT_STATUS))

##                     X^2 df P(> X^2)
## Likelihood Ratio 3087.3  4        0
## Pearson          2661.3  4        0
## 
## Phi-Coefficient   : NA 
## Contingency Coeff.: 0.373 
## Cramer's V        : 0.284

#Continuing with just Cramer V test
assocstats(xtabs(~loc_pred$place + loc_pred$SEX))

##                     X^2 df   P(> X^2)
## Likelihood Ratio 41.707  2 8.7802e-10
## Pearson          41.751  2 8.5885e-10
## 
## Phi-Coefficient   : NA 
## Contingency Coeff.: 0.05 
## Cramer's V        : 0.05

assocstats(xtabs(~loc_pred$place + loc_pred$STUDENT_STATUS))

##                     X^2 df   P(> X^2)
## Likelihood Ratio 40.780  2 1.3957e-09
## Pearson          33.089  2 6.5300e-08
## 
## Phi-Coefficient   : NA 
## Contingency Coeff.: 0.045 
## Cramer's V        : 0.045

assocstats(xtabs(~loc_pred$place + loc_pred$PROVINCE))

##                     X^2 df   P(> X^2)
## Likelihood Ratio 43.259 18 0.00073538
## Pearson          43.658 18 0.00064553
## 
## Phi-Coefficient   : NA 
## Contingency Coeff.: 0.051 
## Cramer's V        : 0.036

assocstats(xtabs(~loc_pred$place + loc_pred$ED_LEVEL))

##                     X^2 df P(> X^2)
## Likelihood Ratio 1102.4  4        0
## Pearson          1089.0  4        0
## 
## Phi-Coefficient   : NA 
## Contingency Coeff.: 0.249 
## Cramer's V        : 0.182

assocstats(xtabs(~loc_pred$place + loc_pred$HOUSE_SIZE))

##                     X^2 df P(> X^2)
## Likelihood Ratio 202.51  6        0
## Pearson          218.75  6        0
## 
## Phi-Coefficient   : NA 
## Contingency Coeff.: 0.114 
## Cramer's V        : 0.081

We can see that AGE_GROUP & EMPLOYMENT_STATUS are significantly associated with the the variable Place.

Hence, we will eliminate other variables and proceed with AGE_GROUP & EMPLOYMENT_STATUS to build our model to predict Place.

#Splitting our data into training data and testing data
set.seed(100)
trainingrows <- sample(nrow(loc_pred), nrow(loc_pred) * 0.6)
loc_pred_training <- loc_pred[trainingrows, ]
loc_pred_testing <- loc_pred[-trainingrows, ]

# Constructing a Multi-nomial logistic regression with associated variables - Employment status and age group
loc_pred_model <- multinom(place ~ AGE_GROUP + EMPLOYMENT_STATUS, data = loc_pred_training)

## # weights:  27 (16 variable)
## initial  value 10870.768596 
## iter  10 value 7227.535714
## iter  20 value 6801.402861
## iter  30 value 6657.372997
## final  value 6656.943400 
## converged

summary(loc_pred_model)

## Call:
## multinom(formula = place ~ AGE_GROUP + EMPLOYMENT_STATUS, data = loc_pred_training)
## 
## Coefficients:
##   (Intercept) AGE_GROUP2  AGE_GROUP3 AGE_GROUP4 AGE_GROUP5 AGE_GROUP6
## 2 -2.83139167  0.4736733 -0.06660337  0.3168686  0.5869540 -0.2461317
## 3 -0.02730885  0.5693568  0.22389316 -0.1184339 -0.1800612 -0.6969082
##   EMPLOYMENT_STATUS2 EMPLOYMENT_STATUS3
## 2          -1.238178          -2.504550
## 3          -1.119933          -2.321146
## 
## Std. Errors:
##   (Intercept) AGE_GROUP2 AGE_GROUP3 AGE_GROUP4 AGE_GROUP5 AGE_GROUP6
## 2  0.22046391 0.26009264 0.26823570 0.24890830 0.25370298  0.4318078
## 3  0.07129431 0.08622257 0.08447988 0.08440498 0.09044237  0.1408269
##   EMPLOYMENT_STATUS2 EMPLOYMENT_STATUS3
## 2          0.3644211          0.3008890
## 3          0.1068090          0.0869557
## 
## Residual Deviance: 13313.89 
## AIC: 13345.89

pred_loc <- predict(loc_pred_model,loc_pred_testing)

Let’s go ahead and build a confusion matrix and calculate the misclassification % of our model

#Generating a confusion matrix to test the accuracy of our model with regards to original data
cm_loc <- table(pred_loc, loc_pred_testing$place)
cm_loc

##         
## pred_loc    1    2    3
##        1 3062  122 1304
##        2    0    0    0
##        3  859   57 1194

#Let's calculate the misclassification % of our model
1-sum(diag(cm_loc))/sum(cm_loc)

## [1] 0.354956

7.7 5. Logistic Regression: Public Library Prediction - How likely is a respondent going to access the internet from a public library

Let’s make certain adjustments to our Data frame to better analyze our data

#5. Logistic Regression: Public Library Prediction - How likely is a respondent going to access the internet from a public library

#Moving required columns to the new data frame
library_pred <- reduced_data[reduced_data$LU_FROM_LIBRARY==1|reduced_data$LU_FROM_LIBRARY == 2,c("LU_FROM_LIBRARY","AGE_GROUP","EMPLOYMENT_STATUS", 
                                "STUDENT_STATUS", "PROVINCE", "ED_LEVEL", "INT_YEARS")]

library_pred$LU_FROM_LIBRARY <- as.integer(library_pred$LU_FROM_LIBRARY)
library_pred["LU_FROM_LIBRARY"][library_pred["LU_FROM_LIBRARY"] == 2] <- 0

library_pred$LU_FROM_LIBRARY <- factor(library_pred$LU_FROM_LIBRARY, levels=c("0","1"))

Now, let’s understand the association between our variables in the Data set

#Performing Chi-Squared Test to better understand if there is a relationship between the variables
chisq.test(library_pred$LU_FROM_LIBRARY, library_pred$INT_YEARS)

## Warning in chisq.test(library_pred$LU_FROM_LIBRARY, library_pred$INT_YEARS):
## Chi-squared approximation may be incorrect

## 
##  Pearson's Chi-squared test
## 
## data:  library_pred$LU_FROM_LIBRARY and library_pred$INT_YEARS
## X-squared = 83.184, df = 4, p-value < 2.2e-16

assocstats(xtabs(~library_pred$LU_FROM_LIBRARY + library_pred$INT_YEARS)) #High Association

##                     X^2 df P(> X^2)
## Likelihood Ratio 92.159  5        0
## Pearson             NaN  5      NaN
## 
## Phi-Coefficient   : NA 
## Contingency Coeff.: NaN 
## Cramer's V        : NaN

chisq.test(library_pred$LU_FROM_LIBRARY, library_pred$AGE_GROUP)

## 
##  Pearson's Chi-squared test
## 
## data:  library_pred$LU_FROM_LIBRARY and library_pred$AGE_GROUP
## X-squared = 528.21, df = 5, p-value < 2.2e-16

assocstats(xtabs(~library_pred$LU_FROM_LIBRARY + library_pred$AGE_GROUP)) # good association

##                     X^2 df P(> X^2)
## Likelihood Ratio 451.04  5        0
## Pearson          528.21  5        0
## 
## Phi-Coefficient   : NA 
## Contingency Coeff.: 0.174 
## Cramer's V        : 0.176

chisq.test(library_pred$LU_FROM_LIBRARY, library_pred$EMPLOYMENT_STATUS)

## 
##  Pearson's Chi-squared test
## 
## data:  library_pred$LU_FROM_LIBRARY and library_pred$EMPLOYMENT_STATUS
## X-squared = 86.007, df = 2, p-value < 2.2e-16

assocstats(xtabs(~library_pred$LU_FROM_LIBRARY + library_pred$EMPLOYMENT_STATUS)) #good association

##                     X^2 df P(> X^2)
## Likelihood Ratio 78.266  2        0
## Pearson          86.007  2        0
## 
## Phi-Coefficient   : NA 
## Contingency Coeff.: 0.071 
## Cramer's V        : 0.071

#Continuing with just Cramer's V test
assocstats(xtabs(~library_pred$LU_FROM_LIBRARY + library_pred$STUDENT_STATUS)) #Decent

##                     X^2 df P(> X^2)
## Likelihood Ratio 468.31  1        0
## Pearson          601.33  1        0
## 
## Phi-Coefficient   : 0.188 
## Contingency Coeff.: 0.185 
## Cramer's V        : 0.188

assocstats(xtabs(~library_pred$LU_FROM_LIBRARY + library_pred$ED_LEVEL))       #Decent

##                     X^2 df P(> X^2)
## Likelihood Ratio 90.160  2        0
## Pearson          95.565  2        0
## 
## Phi-Coefficient   : NA 
## Contingency Coeff.: 0.075 
## Cramer's V        : 0.075

assocstats(xtabs(~library_pred$LU_FROM_LIBRARY + library_pred$PROVINCE))       #Weakness

##                      X^2 df P(> X^2)
## Likelihood Ratio 102.560  9        0
## Pearson           98.018  9        0
## 
## Phi-Coefficient   : NA 
## Contingency Coeff.: 0.076 
## Cramer's V        : 0.076

As we can see here, INT_YEARS, AGE_GROUP & EMPLOYMENT_STATUS are associated with LU_FROM_LIBRARY

Hence, we’ll be using these variables to build our model

#Splitting our data into training data and testing data
set.seed(1)
trainingrows <- sample(nrow(library_pred), nrow(library_pred) * 0.8)
library_pred_training <- library_pred[trainingrows, ]
library_pred_testing <- library_pred[-trainingrows, ]

#Constructing a logistic regression model with stud_pred_training
library_pred_model <- glm(LU_FROM_LIBRARY ~ INT_YEARS + AGE_GROUP + EMPLOYMENT_STATUS, 
                          data = library_pred_training, family = binomial(link = "logit"))
summary(library_pred_model)

## 
## Call:
## glm(formula = LU_FROM_LIBRARY ~ INT_YEARS + AGE_GROUP + EMPLOYMENT_STATUS, 
##     family = binomial(link = "logit"), data = library_pred_training)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.9411  -0.5547  -0.4429  -0.3999   2.5938  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         -1.38928    0.18847  -7.372 1.69e-13 ***
## INT_YEARS2          -0.25350    0.23201  -1.093    0.275    
## INT_YEARS3          -0.12307    0.19205  -0.641    0.522    
## INT_YEARS4           0.27846    0.17766   1.567    0.117    
## INT_YEARS7         -11.50572  129.55275  -0.089    0.929    
## AGE_GROUP2          -0.67907    0.08130  -8.353  < 2e-16 ***
## AGE_GROUP3          -0.92679    0.08430 -10.994  < 2e-16 ***
## AGE_GROUP4          -1.16174    0.08656 -13.422  < 2e-16 ***
## AGE_GROUP5          -1.37533    0.09365 -14.686  < 2e-16 ***
## AGE_GROUP6          -1.68588    0.12172 -13.851  < 2e-16 ***
## EMPLOYMENT_STATUS2   0.52577    0.10271   5.119 3.07e-07 ***
## EMPLOYMENT_STATUS3   0.47855    0.06795   7.043 1.88e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 10291.8  on 13588  degrees of freedom
## Residual deviance:  9824.9  on 13577  degrees of freedom
## AIC: 9848.9
## 
## Number of Fisher Scoring iterations: 12

we can see that the variable INT_YEARS is not a significant predictor.

So, lets reduce the dimension INT_YEARS from our model and build a new reduced model

#Reducing logistic regression model and eliminating INT_YEARS
library_pred_model_reduced <- glm(LU_FROM_LIBRARY ~ AGE_GROUP + EMPLOYMENT_STATUS, 
                          data = library_pred_training, family = binomial(link = "logit"))
summary(library_pred_model_reduced)

## 
## Call:
## glm(formula = LU_FROM_LIBRARY ~ AGE_GROUP + EMPLOYMENT_STATUS, 
##     family = binomial(link = "logit"), data = library_pred_training)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.9189  -0.5447  -0.4732  -0.4062   2.4260  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -1.14914    0.06283 -18.290  < 2e-16 ***
## AGE_GROUP2         -0.68379    0.08111  -8.431  < 2e-16 ***
## AGE_GROUP3         -0.95804    0.08399 -11.407  < 2e-16 ***
## AGE_GROUP4         -1.20918    0.08601 -14.059  < 2e-16 ***
## AGE_GROUP5         -1.41914    0.09315 -15.234  < 2e-16 ***
## AGE_GROUP6         -1.73948    0.12117 -14.356  < 2e-16 ***
## EMPLOYMENT_STATUS2  0.50524    0.10245   4.932 8.16e-07 ***
## EMPLOYMENT_STATUS3  0.43522    0.06736   6.461 1.04e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 10291.8  on 13588  degrees of freedom
## Residual deviance:  9864.6  on 13581  degrees of freedom
## AIC: 9880.6
## 
## Number of Fisher Scoring iterations: 5

pred_library <- predict(library_pred_model_reduced, library_pred_testing, type = "response")

Now let’s build a confusion matrix and assess the performance of our newly built model.

#Generating a confusion matrix to test the accuracy of our model using training vs testing data
logitpredictions = rep(0,length(pred_library))
logitpredictions[pred_library > 0.5] <- 1

cm_library <- table(logitpredictions, library_pred_testing$LU_FROM_LIBRARY)
cm_library

##                 
## logitpredictions    0    1
##                0 2992  406

#Assessing the performance of the model we constructed
tp = length(which((logitpredictions == 1) & (library_pred_testing$LU_FROM_LIBRARY == 1)))
tn = length(which((logitpredictions == 0) & (library_pred_testing$LU_FROM_LIBRARY == 0)))
fp = length(which((logitpredictions == 1) & (library_pred_testing$LU_FROM_LIBRARY == 0)))
fn = length(which((logitpredictions == 0) & (library_pred_testing$LU_FROM_LIBRARY == 1)))

logitaccuracy <- (tp+tn)/(tp+tn+fp+fn) #Calculating the accuracy

#Displaying values
logitaccuracy

## [1] 0.880518

The model seems to be 88% accurate and is good for predicting future test data

8 References

Christophe Dervieux, Emily Riederer, Y. X. (2022). R Markdown Cookbook.
from https://bookdown.org/yihui/rmarkdown-cookbook/

Wisconsin School of Business (2021). Analytics Using R.
from https://pubs.wsb.wisc.edu/academics/analytics-using-r-2019/index.html

R Studio (n.d.). Markdown Basics.
from https://rmarkdown.rstudio.com/lesson-8.html

Joyce Robbins, Z. B. (n.d.). 15 Chart: Mosaic.
from https://edav.info/

Seaborn (n.d.). Visualizing Categorical Data.
from https://seaborn.pydata.org/tutorial/categorical.html

CIUS - Location of Internet Usage (Exploratory & Predictive Analyses)

By Raghuraman Ganapathi Raman

1 Introduction

2 Initial Analysis

3 Data Pre-processing

3.1 Data structuring

3.2 Removing Null Values

3.3 Datatype Modification

3.4 Value Modification

4 Outliers

5 Exploratory Analyses

5.1 Pie Chart: Male & Female

5.2 Horizontal Bar Plot: Distribution of Participants based on Province

5.3 Bar Plot: Which Age group uses the internet most & least?

5.4 Barplot: Which gender makes the most out of the internet?

5.5 Pie chart: Distribution of students who use the internet based on gender

5.6 Grouped Bar Plot: Gender and Age classified by Internet Usage

5.7 Stacked Bar Plot: Education Level in Each Province

5.8 Pie Chart: Distribution of Internet Users and Non Internet Users

5.9 Bar Plot: Distribution of Users Based on Years of Usage

5.10 Mosaic Plot: Relationship Between Family Size and Internet Usage

5.11 Three Variable Mosaic Plot: Understanding the Relationship Between Student Status,Family Size and Internet Usage

5.12 Contingency Bar Plot: Relationship between Employment Status, Sex and Internet Usage

5.13 Multi Panel Balloon Plot: Are employed people more likely using the internet?

5.14 Where do People mostly Use Internet From?

5.15 Balloon Plot: Distribution of Internet Usage Based on Province & Location

5.16 Balloon Plot: Who are Accessing the Internet More?

6 Observations: Exploratory Analysis

7 Predictive Analyses

7.1 Reduction of Dimensions

7.2 Approach to Building Our Predictive Models

7.3 Logistic Regression: Student Prediction Using Repondent’s Age Group, Employment Status & Internet Usage

7.4 Logistic Regression: Internet Usage Prediction - Predicting if A Respondent Has Used the Internet Before

7.5 Multinomial Logistic Regression: Employment Status prediction - Predicting employment status of respondents using age, student status and sex

7.6 Multinomial Logistic Regression: Finding if a person will use the internet from Home/Work/Both with other Dependent variables such as Age, Employment status and Sex

7.7 5. Logistic Regression: Public Library Prediction - How likely is a respondent going to access the internet from a public library

8 References