The internet, which has become a significant part of our lives, has typically made things convenient and easier for us. Since the internet is used frequently, this analysis explores how much usage varies depending on the different types of users who fall into distinct groups. The location specific dataset from the Canadian Internet Use Survey, offers a comprehensive breakdown of each key variable that is crucial for analyzing internet usage around the world. It includes the major elements such as Region, provinces, age, employment and respondent details and other specific features. This in-depth research analysis is accomplished using both exploratory and predictive technique that helps interpret the data visually from various viewpoints and predict favorable recommendations out of it.
The Dataset gives values for each variable for the determined quantities based on a set of dimensions, and when their characteristics are taken into account for analysis, they show a notable variance. The analysis started by putting forward potential queries as part of our basic investigation in order to better understand the dataset. An important component of the interpretation included examining factors like the province with the highest internet usage, the employment status of Canadian household members, etc. The distribution of users, internet usage frequency, educational background, and the main variables that influence and exploit the students on using the resource excessively are also projected in order to gain meaningful insights.
Dimension Check: To understand the dynamics of the dataset, it is divided into descriptive chunks. Recognizing the dataset’s components and attributes aids in conceptualizing how the data relates to one another. As part of data quality measures, accuracy, completeness, consistency, timeliness, validity, and Integrity need to be examined. Following the dimensions, the dataset’s summary is noted, displaying the R-structured tabular form of representation of the number of records and their associated components.
Inference: According to the statistical information pulled in relation to the data summary, the important constituents, such as the min, median, mode, max, first quadrant, and third quadrant, are covered in detail. As an outcome, the built-in function is used to generate the model fitting functions of the result summary. Additionally, there are 23178 rows and 23 columns in the data set. The total count allows us to figure out if the dataset contains any null values.
The acquired raw data is converted into real information that removes the possibility of periodic noise and data corruption. In order to follow a regular pattern, the data that is inconsistent, contains errors, and frequently is incomplete, is cleaned and processed. This approach makes it simpler to visualize the data and as a result, transforms it into a structured manner.
Below are some of the approaches that we’ll be following:
In the code snippet below, str() has been used to ensure
that the data set has a well-defined structure, appropriate data types
are assigned to the variables, etc. which will make it easy to store and
access data.
Missing data can yield skewed estimates, which can impair a study’s statistical significance and lead to inaccurate conclusions. In such scenarios, data imputation or many other methodologies can be used to handle null values in a dataset. The location dataset used made it simpler by the dataset’s absence of null values, which minimized complexity and saved time and code usage.
Despite being categorical by default, it is clear from the dataset’s variety of variables that they are represented as continuous variables (in numerical form). Hence, the continuous variables need to be handled to make them categorical in R. so that the groupings can be statistically analyzed for deeper insights. In general, establishing levels and factors can make it easier to explore the data.
Despite the dataset’s extensive response collection, some of the responses appear to be very sparsely populated because they run the risk of becoming an outlier when the data is visualized. Such variables would be handled by consolidating their multiple columnar values into a single value.
Below are the libraries that we’ll be using for our analyses.
#Libraries Initialization
library(ggplot2)
library(ggplot2)
library(dplyr)
library(DT)
library(magrittr)
library(knitr)
library(tidyverse)
library(epiDisplay)
library(ggpubr)
library(ggmosaic)
library(vcd)
library(grid)
library(ggpubr)
theme_set(theme_pubr())
library(car)
library(nnet)
Firstly, Let’s have a look at the summary of our data using in R to better understand the dynamics of the data set at hand.
Post that, we’ll rename the columns of the Datset for better readability.
head(data)
## PUMFID PROVINCE REGION G_URBRUR GCAGEGR6 CSEX G_CEDUC G_CSTUD G_CLFSST
## 1 1 35 3 5 3 2 3 2 1
## 2 2 46 4 5 1 2 1 1 2
## 3 3 10 1 5 2 1 2 2 1
## 4 4 35 3 4 5 2 2 2 3
## 5 5 13 1 4 3 1 2 2 1
## 6 6 46 4 4 2 2 1 2 3
## GFAMTYPE G_HHSIZE G_HEDUC G_HSTUD EV_Q01 EV_Q02 LU_Q01 LU_Q02 LU_G03 LU_Q04
## 1 3 1 3 2 1 2 1 2 2 2
## 2 2 3 2 1 1 4 1 1 1 1
## 3 2 2 3 2 1 4 1 1 1 1
## 4 3 1 2 2 1 3 6 6 6 6
## 5 2 3 2 2 1 2 1 2 2 2
## 6 1 3 1 2 1 4 6 6 6 6
## LU_Q05 LU_Q06A LU_Q06B LU_G06
## 1 2 6 6 6
## 2 1 2 1 2
## 3 1 2 2 1
## 4 6 6 6 6
## 5 2 6 6 6
## 6 6 6 6 6
#Renaming columns for better readability and understanding
colnames(data) <- c("ID",
"PROVINCE",
"REGION",
"URBAN_RURAL",
"AGE_GROUP",
"SEX",
"ED_LEVEL",
"STUDENT_STATUS",
"EMPLOYMENT_STATUS",
"HOUSE_TYPE",
"HOUSE_SIZE",
"HIGH_EDU",
"STUD_IN_HOUSE",
"INT_USAGE",
"INT_YEARS",
"LU_FROM_HOME",
"LU_FROM_WORK",
"LU_FROM_SCHOOL",
"LU_FROM_LIBRARY",
"LU_OTHERS",
"LU_OTHERS_RELATIVE",
"LU_OTHERS_FRIEND",
"LU_OTHERS_MISC")
In the provided Dataset, we are dealing with Categorical Data encoded in the form of numbers. There are no continuous variables in the provided Dataset. However, since the values are in numbers, R might consider this to be continuous data and assign numeric data types to the variables. We’ll fix this issue in the later part of the below code block by validating the structure of our Dataset and setting the columns as factors.
Now, In order to cleanse the data we plan to modify the values for
certain columns by clubbing certain column values under one value
i.e. for instance in the LU_FROM_SCHOOL column, the number
6 will be assigned to No Response which will include
everything such as Valid Skips, Refusal & Not
Stated instead of all these sub classifications having separate
numerical values.
However, the values 7 = Don’t Know is not disturbed in the process and remains as it is.
We intend to do this step as we feel that these separate values will not be able to provide significance to our analysis in anyway.
Below code snippet performs the action of data cleansing/data pre-processing:
#Data Pre-processing
#Modifying values from the data set
#SEX:
#Assigning 0 = Female 1 = Male
data["SEX"][data["SEX"]==2] <-0
#STUDENT_STATUS:
#Assigning 0 = No 1 = Yes
data["STUDENT_STATUS"][data["STUDENT_STATUS"]== 2] <- 0
#STUD_IN_HOUSE:
#Assigning 0 = No 1 = Yes
data["STUD_IN_HOUSE"][data["STUD_IN_HOUSE"]== 2] <- 0
#INT_USAGE:
#Assigning 0 = No 1 = Yes
data["INT_USAGE"][data["INT_USAGE"]== 2] <- 0
#INT_YEARS:
# 6 = No Response (Valid Skips & Refusals) 7 = Don't Know
data["INT_YEARS"][data["INT_YEARS"] == 8] <- 6
#LU_FROM_WORK:
# 6 = No Response (Valid Skips & Not Stated) 7 = Don't Know
data["LU_FROM_WORK"][data["LU_FROM_WORK"] == 9] <- 6
#LU_FROM_SCHOOL:
# 6 = No Response (Valid Skips, Refusal & Not Stated) 7 = Don't Know
data["LU_FROM_SCHOOL"][data["LU_FROM_SCHOOL"] > 7] <- 6
#LU_FROM_LIBRARY:
# 6 = No Response (Valid Skips & Not Stated) 7 = Don't Know
data["LU_FROM_LIBRARY"][data["LU_FROM_LIBRARY"] == 9] <- 6
#LU_OTHERS:
# 6 = No Response (Valid Skips & Not Stated) 7 = Don't Know
data["LU_OTHERS"][data["LU_OTHERS"] == 9] <- 6
#LU_OTHERS_RELATIVE:
# 6 = No Response (Valid Skips, Refusal & Not Stated) 7 = Don't Know
data["LU_OTHERS_RELATIVE"][data["LU_OTHERS_RELATIVE"] > 7] <- 6
#LU_OTHERS_FRIEND:
# 6 = No Response (Valid Skips, Refusal & Not Stated) 7 = Don't Know
data["LU_OTHERS_FRIEND"][data["LU_OTHERS_FRIEND"] >7] <- 6
#LU_OTHERS_MISC:
# 6 = No Response (Valid Skips & Not Stated)
data["LU_OTHERS_MISC"][data["LU_OTHERS_MISC"] == 9] <- 6
str(data) #Checking the structure of our data set
#Handling categorical variables: Setting factors & levels to categories
data$ID <- factor(data$ID)
data$PROVINCE <- factor(data$PROVINCE, levels=c("10","11","12","13","24","35","46","47","48","59"))
data$REGION <- factor(data$REGION, levels = c("1","2","3","4","5","6"))
data$URBAN_RURAL <- factor(data$URBAN_RURAL,levels = c("1","2","3","4","5","6"))
data$AGE_GROUP <- factor(data$AGE_GROUP,levels = c("1","2","3","4","5","6"))
data$SEX <- factor(data$SEX,levels = c("0","1"))
data$ED_LEVEL <- factor(data$ED_LEVEL,levels = c("1","2","3"))
data$STUDENT_STATUS <- factor(data$STUDENT_STATUS,levels = c("0","1"))
data$EMPLOYMENT_STATUS <- factor(data$EMPLOYMENT_STATUS,levels = c("1","2","3"))
data$HOUSE_TYPE <- factor(data$HOUSE_TYPE,levels = c("1","2","3","4"))
data$HOUSE_SIZE <-factor(data$HOUSE_SIZE,levels=c("1","2","3","4"))
data$HIGH_EDU <- factor(data$HIGH_EDU,levels = c("1","2","3"))
data$STUD_IN_HOUSE <- factor(data$STUD_IN_HOUSE,levels = c("0","1"))
data$INT_USAGE <- factor(data$INT_USAGE,levels = c("0","1"))
data$INT_YEARS <- factor(data$INT_YEARS,levels =c("1","2","3","4","6","7"))
data$LU_FROM_HOME<- factor(data$LU_FROM_HOME,levels = c("1","2","6","7"))
data$LU_FROM_WORK <- factor(data$LU_FROM_WORK ,levels = c("1","2","6","7"))
data$LU_FROM_SCHOOL <- factor(data$LU_FROM_SCHOOL,levels = c("1","2","6","7"))
data$LU_FROM_LIBRARY <- factor(data$LU_FROM_LIBRARY ,levels = c("1","2","6","7"))
data$LU_OTHERS <- factor(data$LU_OTHERS ,levels = c("1","2","6","7"))
data$LU_OTHERS_RELATIVE <- factor(data$LU_OTHERS_RELATIVE ,levels = c("1","2","6","7"))
data$LU_OTHERS_FRIEND <- factor(data$LU_OTHERS_FRIEND ,levels = c("1","2","6","7"))
data$LU_OTHERS_MISC <- factor(data$LU_OTHERS_MISC ,levels = c("1","2","6"))
Now, we have completed modifying certain values in various columns and assigning factors to variables to make them categorical.
Let’s have a look at the summary and dimensions of our Dataset.
summary(data)
## ID PROVINCE REGION URBAN_RURAL AGE_GROUP SEX
## 1 : 1 35 :6518 1:3798 1: 975 1:1981 0:12817
## 2 : 1 24 :4437 2:4437 2: 1019 2:3269 1:10361
## 3 : 1 59 :2533 3:6518 3: 817 3:3749
## 4 : 1 48 :2242 4:3650 4:12143 4:4555
## 5 : 1 46 :2023 5:2242 5: 7632 5:4419
## 6 : 1 47 :1627 6:2533 6: 592 6:5205
## (Other):23172 (Other):3798
## ED_LEVEL STUDENT_STATUS EMPLOYMENT_STATUS HOUSE_TYPE HOUSE_SIZE HIGH_EDU
## 1:9082 0:21464 1:13559 1: 5639 1:6304 1: 5995
## 2:9753 1: 1714 2: 1045 2:10395 2:8831 2:11104
## 3:4343 3: 8574 3: 6304 3:3474 3: 6079
## 4: 840 4:4569
##
##
##
## STUD_IN_HOUSE INT_USAGE INT_YEARS LU_FROM_HOME LU_FROM_WORK LU_FROM_SCHOOL
## 0:19089 0: 5599 1: 607 1:16046 1: 6740 1: 2364
## 1: 4089 1:17579 2: 891 2: 950 2:10253 2:14601
## 3: 2586 6: 6181 6: 6183 6: 6211
## 4:13461 7: 1 7: 2 7: 2
## 6: 5600
## 7: 33
##
## LU_FROM_LIBRARY LU_OTHERS LU_OTHERS_RELATIVE LU_OTHERS_FRIEND LU_OTHERS_MISC
## 1: 2118 1: 5453 1: 1974 1: 2355 1: 2641
## 2:14869 2:11531 2: 3462 2: 3081 2: 2795
## 6: 6185 6: 6186 6:17738 6:17738 6:17742
## 7: 6 7: 8 7: 4 7: 4
##
##
##
dim(data)
## [1] 23178 23
The Summary Function provides us a clear idea of the
frequency/occurences of different values in our Dataset as we are
working with categorical data.
With the Dim Function we identified there are
23,178 Rows and 23 Columns in our data set and
no NA values are present in the provided data set.
While going through the summary of the Dataset, we realized that every variable in the dataset is catergorical and in general the concept of outliers are applicable to continuous variables. However, categorical variables with marginally lower frequency of occurence can also be treated as outliers.
In this Dataset we solved this problem in the above code snippet, where we combined certain values (Valid Skips, Refusals, etc.) of specific columns that felt like outliers to one broad category (No Response). However, the don’t know column was not hampered as it may be useful for further analysis.
Since, we have completed our data pre-processing by checking for missing values, modifying columns that are not required for our analysis and combining certain column values under on subsection, Now we are ready to move ahead with our Data Visualization stage.
In the following portion, we will be identifying the trends and weakness of the obtained survey (dataset) using exploratory data analysis. This technique involves examining data sets to highlight their key features, frequently utilizing statistical graphics and other data visual analytics. At the end of the analysis, we will be able to identify the key elements that reveal on the major challenges produced by the data available and will be in a position to provide forward-thinking suggestions and troubleshooting solutions. Some of the possible troubleshoots that could be a part of the analysis is followed.
By segmenting the circular statistical visual into sectors or portions, pie charts are the best tools for highlighting the nature of the data and exhibiting numerical issues. As a part of the analysis, the chart depicts the proportion of male and female survey participants.
Let’s have a look at a pie chart to better understand the data we are working with. We’ll start by looking at what percent of males and females took part in the survey.
#Visualizations
#1. Pie Chart: Male & Female respondents
count <- table(data$SEX)
pct <- round(count/sum(count)*100)
lbls <- c("Female","Male")
lbls <- paste(lbls, pct) # add percents to labels
lbls <- paste(lbls,"%",sep="") # ad % to labels
pie(count, labels = lbls,col=rainbow(length(lbls)), main="Percentage of Males & Females", )
Findings:
Inference:
As a result, the male population is identified in the pie chart as red, and the female population is recognized as blue which covers the major portion, making it easy to visibly distinguish between the two groups.
The primary goal of the below horizontal bar chart visualization is to analyze the distribution of Internet users with respect to each province in percentage. In the below coding, in order to compute, we have divided the province counts by their respective sums and multiplied by 100 to get the percentage value. R uses the function barplot( ) to create bar charts. R can draw both vertical and horizontal bars in the bar chart. In the below visualization, we have used a horizontal bar plot. We have then added the percentage to labels and have moved the margins and adjusted the titles to get a perfect horizontal visualization.
#2. Horizontal Bar Plot: Distribution of Participants based on Province
province_counts <- table(data$PROVINCE)
namesarg <- c("NL","PE","NS","NB","QC","ON","MB", "SK","AB", "BC")
pct <- round(province_counts/sum(province_counts)*100)
namesarg <- paste(namesarg, pct) # add percents to labels
namesarg <- paste(namesarg,"%",sep="") # ad % to labels
par(mar=c(5.1, 13 ,4.1 ,2.1)) # Moving margins
province_plot <- barplot(province_counts,names.arg=namesarg,col=rainbow(length(province_counts)),
las=1, horiz = T,xlim = c(0,7000), xlab="",ylab = "",
main="Distribution of Participants based on Province")
title(xlab="Number of Participants",ylab="Province",line=0, cex.lab=1.2) # Adjusting the Title
Findings:
Based on the aforementioned visualization, it is clear that:
Inference:
Most of the people who use the Internet for personal purposes are found to be in Ontario and Quebec, which may be related to their dense populations. On the other hand, regions like Prince Edward Island, Newfoundland, and Labrador are found to be the least internet users. Additionally, communication tools tend to grow and modernize along with locations. This could involve online shopping, information searching, social networking, entertainment, etc., which could eventually lead to a rise in regional internet usage for private, non-commercial purposes.
The below bar plot visualization differentiates different age categories of people who have utilized the most and least of the Internet. A bar chart basically represents data in rectangular bars with the length of the bar proportional to the value of the variable.
#3.Bar plot: Which Age group uses the internet most & least?
age <- table(data[data$INT_USAGE==1, "AGE_GROUP"])
age_group <- c("16-24","25-34","35-44","45-54","55-64",">64")
age_plot <- barplot(age,names.arg=age_group, las=1,col=rainbow(length(age)),
main="Distribution of Usage by Age Group",
xlab = "Age Group",
ylab="Number of Participants")
text(age_plot,age,labels = age, pos = 1)
Findings:
It is clear that the age group 45-54 is the group that uses most of the Internet with a count of 3821 users followed by the age range 35-44 with a count of 3500 Internet users.
We have discovered that the age range 35–44 has the second-highest number of Internet users, with 3500, followed by the 45–54 age group, which has 3821 users. Finally, it is evident that the number of Internet users in the 16 to 24 and over 64 age categories has the same number of Internet users having the count of 1952.
Inference: The respondents in the age group 45-54 are the ones who are using most of the Internet and it is noted that they are in the working category and they are the generation who have started using the Internet and they are likely to use more Internet than the other age categories.
The visualization’s objective is to examine how respondents’ Internet usage is distributed according to their sex. In the coding shown below, we have utilized gender as a common denominator to assess the distribution of each group using a bar plot. We can determine which gender category has utilized the Internet the most from the bar plot. The below barplot will help us answer the question of Which gender makes the most out of the internet?
#4. Barplot: Which gender makes the most out of the internet?
gender_count <- table(data[data$INT_USAGE==1,"SEX"])
gender <- c("Female", "Male")
gender_plot <- barplot(gender_count,names.arg=gender, las=1,col=rainbow(length(gender_count)),
main="Distribution of Usage by Gender", ylim = c(0,10500),
xlab = "Gender",
ylab="Number of Participants")
text(gender_plot,gender_count,labels = gender_count, pos = 3)
Findings: With 9629 female Internet users and 7950 male Internet users, it is clear that women have used the internet more than men.
Inference: Internet usage is higher in the female group than in the male category. The fact that women make up a significantly larger proportion of internet users than men could be the factor. In a bar chart, each bar can also be given a different color. To distinguish between two gender categories in the graphic above, we utilized the colors red and blue. The comparisons between different data categories are displayed using bars in a bar chart. Comparisons between the male and female genders are displayed in our visualization.
The below pie chart visualization depicts the distribution of students using the Internet with respect to their gender. We have used pie chart visualization since it displays relative proportions of multiple classes of data.
#5. Pie chart: Portfolio of students who use the internet based on gender
Student_gender_count <- table(data[data$STUDENT_STATUS==1 & data$INT_USAGE==1,"SEX"])
Student_gender_pct <- round(Student_gender_count/sum(Student_gender_count)*100)
Student_gender_lbls <- c("Female","Male")
Student_gender_lbls <- paste(Student_gender_lbls, Student_gender_pct) # add percents to labels
Student_gender_lbls <- paste(Student_gender_lbls,"%",sep="") # ad % to labels
pie(Student_gender_count, labels = Student_gender_lbls,
col=c("yellow", "black"), main="Distribution of Students Using the Internet Based on Gender")
Findings: It is noted that about 60% of female students utilize the Internet, with 40% of male students making up the remaining user base.
Inference: The findings indicate that a significant share of female students utilize the Internet. This might be a result of the fact that female students have a higher population density than male students.
Our goal in conducting this analysis is to examine the respondents who utilize the internet by age and gender. By doing so, we will be better able to determine which age group has been utilizing the internet or other web services the most. We will also be able to identify the gender that has utilized the Internet the most. In order to display each data category in a frequency distribution, we used a grouped bar chart for this visualization. We can represent relative percentages or numbers of various categories using a group bar plot, and it also aids in visually presenting a summary of our sizable dataset.
#6. Grouped bar plot for: Gender and age classified by internet usage
gender_age <- data[data$INT_USAGE==1,]
gender_age <- within(gender_age, {
gender_age.cat <- NA
gender_age.cat[AGE_GROUP=="1"] <- "16-24"
gender_age.cat[AGE_GROUP=="2"] <- "25-34"
gender_age.cat[AGE_GROUP=="3"] <- "35-44"
gender_age.cat[AGE_GROUP=="4"] <- "45-54"
gender_age.cat[AGE_GROUP=="5"] <- "55-64"
gender_age.cat[AGE_GROUP=="6"] <- ">64"
})
ggplot(gender_age, # Grouped bar plot using ggplot2
aes(x = gender_age.cat,
fill= SEX)) +
scale_fill_discrete(labels = c("Male", "Female"))+
labs(x = "Age Group", y = "Number of Participants",title = "Gender and age classified by internet usage", )+
theme_dark()+ theme(axis.text = element_text(face = "bold",size = 10, angle = 20, hjust = 0.75))+
geom_bar(position = "dodge")
Findings:
Inference: It can be seen from the above visualization that, regardless of age group, women use the Internet more frequently than males. Additionally, it is shown that Internet usage is the same for people aged 16 to 24 and older than 64, with 1750 female users and around 1500 male users. On the other hand, it can be seen that the age groups of 25–34 and 55–64, with a count of 1450 and 1750 respectively, have the same Internet usage for both males and females.
Our fundamental goal for performing the below analysis is to compare the education level of all the provinces and provide insights. The reason behind choosing a stacked bar chart is that it makes the comparison of the data points even easier, and it makes the interpretation effective.
#7.Stacked Bar Plot: Education Level in Each Province
#Setting province Codes
x <- within(data, {
province.cat <- NA
province.cat[PROVINCE==10] <- "NL"
province.cat[PROVINCE==11] <- "PE"
province.cat[PROVINCE==12] <- "NS"
province.cat[PROVINCE==13] <- "NB"
province.cat[PROVINCE==24] <- "QC"
province.cat[PROVINCE==35] <- "ON"
province.cat[PROVINCE==46] <- "MB"
province.cat[PROVINCE==47] <- "SK"
province.cat[PROVINCE==48] <- "AB"
province.cat[PROVINCE==59] <- "BC"
})
ggplot(x, aes(fill=ED_LEVEL,y=ED_LEVEL,x=province.cat)) +
scale_fill_discrete(labels = c("High school or less", "College or some post-secondary",
"University certificate or degree"), name = "Education Level")+
labs(x = "Province", y = " Education Level",title = "Education Level in each Province")+
theme(axis.text = element_text(face = "bold",size = 10, angle = 20, hjust = 0.75))+
geom_bar(position="stack", stat="identity")
Findings:
Inference: The stacked bar chart above makes it evident that in all provinces, a higher proportion of respondents have completed post-secondary education than the other categories, while the proportion of respondents with university degrees is moderate
Let’s generate a pie chart to show the Distribution of Internet Users and Non Internet Users
#8. Distribution of internet users and non users in the survey?
int_usage_count <- table(data["INT_USAGE"])
int_usage_pct <- round(int_usage_count/sum(int_usage_count)*100)
int_usage_lbls <- c("No","Yes")
int_usage_lbls <- paste(int_usage_lbls, int_usage_pct) # add percents to labels
int_usage_lbls <- paste(int_usage_lbls,"%",sep="") # ad % to labels
pie(int_usage_count, labels = int_usage_lbls,
col=c("Orange", "light blue"), main="Distribution of Internet Users and Non Internet Users")
Findings:
Inference: As a result, the pie chart shows the percentage of internet usage as blue, which makes it simple to see how the two groups differ from one another, and the percentage of non-internet usage as orange.
The bar plot below illustrates the distribution of users based on years of usage.
#9. Distribution of Users Based on Years of Usage
years_usage <- data["INT_YEARS"]
years_usage <- within(years_usage, {
years.cat <- NA
years.cat[INT_YEARS==1] <- "< 1 Year"
years.cat[INT_YEARS==2] <- "1-2 Years"
years.cat[INT_YEARS==3] <- "2-5 Years"
years.cat[INT_YEARS==4] <- "> 5 Years"
years.cat[INT_YEARS==6] <- "Skipped"
years.cat[INT_YEARS==7] <- "Don't Know"
})
tab1(years_usage$years.cat, sort.group = "decreasing", cum.percent = TRUE, main = "Distribution of Users Based on Years of Usage", xlab = "Years of Usage")
## years_usage$years.cat :
## Frequency Percent Cum. percent
## > 5 Years 13461 58.1 58.1
## Skipped 5600 24.2 82.2
## 2-5 Years 2586 11.2 93.4
## 1-2 Years 891 3.8 97.2
## < 1 Year 607 2.6 99.9
## Don't Know 33 0.1 100.0
## Total 23178 100.0 100.0
Findings:
Inference: When other categories are included, the analysis drastically deviates since the respondent either didn’t know their own practice or didn’t want to report it. However, a large portion of the respondents belonged to the group that has been a notable lead in internet usage for years.
The association between internet usage and occupant count is graphically evaluated in the mosaic plot below. The key justification for selecting a mosaic plot is that it enables us to identify the relationship between the two variables. Another key benefit is that it summarizes the data and makes it possible to find correlations between multiple factors. The blue tiles in the mosaic plot represent substantial positive residuals with a frequency more than predicted, whereas the red tiles represent significant negative residuals with a frequency less than expected.
#10.Mosaic Plot: Relationship between family size and internet usage
mosaic <- data[,c("INT_USAGE","HOUSE_SIZE","STUDENT_STATUS")]
mosaic <- within(mosaic, {
int_usage.cat <- NA
int_usage.cat[INT_USAGE=="0"] <- "No"
int_usage.cat[INT_USAGE=="1"] <- "Yes"
})
mosaic <- within(mosaic, {
house_size.cat <- NA
house_size.cat[HOUSE_SIZE=="1"] <- "1 Person"
house_size.cat[HOUSE_SIZE=="2"] <- "2 People"
house_size.cat[HOUSE_SIZE=="3"] <- "3 People"
house_size.cat[HOUSE_SIZE=="4"] <- ">= 4 People"
})
mosaic <- within(mosaic, {
student_status.cat <- NA
student_status.cat[STUDENT_STATUS=="0"] <- "Not Student"
student_status.cat[STUDENT_STATUS=="1"] <- "Student"
})
mosaic_count <- table(mosaic[,c("int_usage.cat","house_size.cat","student_status.cat")])
mosaic(~int_usage.cat + house_size.cat,data = mosaic_count, shade = T,legend = TRUE,
labeling = labeling_border(abbreviate_labs = c(3, 1, 10)),
main = "Relationship Between Internet Usage
and Number of People in House")
Findings:
Inference: Since the proportions of the two variables (Internet usage and the number of persons living in the home) differ, it can be observed that they are reliant on one another. Additionally, it appears that there is a direct correlation between internet usage and the number of individuals living in the family. However, we need to perform an in-depth analysis to strongly conclude.
We can better comprehend the connection between family size and internet usage by looking at the status of the students in the mosaic plot below. Since the below 3 variables are categorical in nature, we have used a mosaic plot.
#11.Mosaic Plot: Understanding the relationship between family size, student status and internet usage
mosaic(~int_usage.cat + house_size.cat + student_status.cat,
data = mosaic_count,shade = TRUE,
labeling = labeling_border(rot_labels = c(0, 45, 90, 90), abbreviate_labs = c(10, 1, 5)),
legend = TRUE,
main = "Family Size and Internet Usage")
Findings:
Inference: Overall, the mosaic plot shown above is used to determine whether the three variables are associated with one another.
In this contingency bar plot, the categorical variables are added together side by side, creating a two-way table. Counting the total number of observations for each combination of levels in the categorical variable is another valuable application of this data.
#12.Contingency Bar Plot: Relationship between Employment Status, Sex and Internet Usage
contingency_plot <- data[,c("INT_USAGE","EMPLOYMENT_STATUS", "SEX")]
contingency_plot <- within(contingency_plot, {
emp_status.cat <- NA
emp_status.cat[EMPLOYMENT_STATUS=="1"] <- "Employed"
emp_status.cat[EMPLOYMENT_STATUS=="2"] <- "Unemployed"
emp_status.cat[EMPLOYMENT_STATUS=="3"] <- "Unable to Work"
})
contingency_plot <- within(contingency_plot, {
int_usage.cat <- NA
int_usage.cat[INT_USAGE=="0"] <- "Never Used Internet"
int_usage.cat[INT_USAGE=="1"] <- "Used Internet"
})
contingency_plot <- within(contingency_plot, {
sex.cat <- NA
sex.cat[SEX=="0"] <- "Female"
sex.cat[SEX=="1"] <- "Male"
})
ggplot(contingency_plot, aes(x = emp_status.cat))+
labs(x = "Employment Status", y = " No.of Participant",title = "Usage of Internet based on Gender")+
geom_bar(aes(fill = sex.cat),color = "white", position = position_dodge(0.9))+
labs(fill="Gender of Respondent")+
facet_wrap(~contingency_plot$int_usage.cat,as.table = TRUE)+
theme(axis.text = element_text(face = "bold",size = 10, angle = 20, hjust = 0.75))+
fill_palette("jco")
Findings:
Inference: This suggests that working professionals use the internet more frequently than the unemployed members of the community. If a new strategic plan is to be implemented, the focus should be limited to the categories stated above.
Multi-panel typically, balloon plots are intended to replace heat maps. When it comes to many groups of observations, where the size of the dots indicates the magnitude of the related component, it is said that the balloon plots act like magic. It is also used to show the contingency table created by the combination of two category variables.
#13. Multi Panel Balloon Plot: Are employed people more likely using the internet?
emp_count <- as.data.frame(table(contingency_plot))
ggballoonplot(emp_count, x = "emp_status.cat", y = "int_usage.cat", size = "Freq",
#ggballoonplot(emp_count, x = "EMPLOYMENT_STATUS", y = "AGE_GROUP", size = "Freq",
fill = "Freq", facet.by = "sex.cat",
ggtheme = theme_bw()) +
scale_fill_viridis_c(option = "C")
Findings:
Inference: Based on the scale and color of the graph, it is simpler to forecast the results and analyze them. The understanding has become quite simple and obvious because to the factor of frequency. As a consequence of the preceding experiment, we have determined that regardless of employment position, internet usage is essentially identical to that of women. Additionally, both genders follow a nearly identical trend as those who do not utilize the internet.
The location of the internet is crucial to its main area of distribution. The participants and their primary internet usage locations are precisely visualized using the horizontal bar graph.
#14. Where do People mostly Use Internet From?
location_data <- data[,c("PROVINCE",
"LU_FROM_HOME",
"LU_FROM_WORK",
"LU_FROM_SCHOOL",
"LU_FROM_LIBRARY",
"LU_OTHERS_RELATIVE",
"LU_OTHERS_FRIEND",
"LU_OTHERS_MISC")]
location_data <- within(location_data, {
place <- NA
place[LU_FROM_HOME=="1"] <- "Home"
place[LU_FROM_WORK=="1"] <- "Work"
place[LU_FROM_SCHOOL=="1"] <- "School"
place[LU_FROM_LIBRARY=="1"] <- "Library"
place[LU_OTHERS_RELATIVE=="1"] <- "Relative's"
place[LU_OTHERS_FRIEND=="1"] <- "Friend's"
place[LU_OTHERS_MISC=="1"] <- "Misc"
})
location_data <- within(location_data, {
province.cat <- NA
province.cat[PROVINCE==10] <- "NL"
province.cat[PROVINCE==11] <- "PE"
province.cat[PROVINCE==12] <- "NS"
province.cat[PROVINCE==13] <- "NB"
province.cat[PROVINCE==24] <- "QC"
province.cat[PROVINCE==35] <- "ON"
province.cat[PROVINCE==46] <- "MB"
province.cat[PROVINCE==47] <- "SK"
province.cat[PROVINCE==48] <- "AB"
province.cat[PROVINCE==59] <- "BC"
})
location_data <- na.omit(location_data)
loc <- as.data.frame(table(location_data[,c("place", "province.cat")]))
tab1(location_data$place, sort.group = "decreasing", cum.percent = TRUE, main = "Location of Internet Use", xlab = "Number of Participants")
## location_data$place :
## Frequency Percent Cum. percent
## Home 6417 37.8 37.8
## Work 3267 19.2 57.0
## Misc 2641 15.5 72.6
## Friend's 1792 10.5 83.1
## Relative's 1003 5.9 89.0
## Library 954 5.6 94.6
## School 912 5.4 100.0
## Total 16986 100.0 100.0
Findings:
Inference: When compared to usage from other venues, a substantial fraction of both communities uses their internet at home. This makes sense because people use technologies more frequently the more time, they have available to them. A 13% usage share of the internet by libraries and schools indicates how intensively the educational system depends on advanced technologies.
The balloon plot, as has already been noted, is useful for determining how category variables relate to one another. The graph below has numerous dimensions that, in contrast to the previous one, make it easier to determine how locations and provinces are related to one another.
#15. Balloon Plot: Distribution of Internet Usage Based on Province & Location
ggballoonplot(loc, x = "place", y = "province.cat", size = "Freq",
fill = "Freq",
ggtheme = theme_bw()) +
scale_fill_viridis_c(option = "C")
Findings:
Inference: Being one of Canada’s most populous provinces, Ontario has considerably accounted for the majority of internet usage worldwide for a variety of reasons. Quebec exhibits the similar pattern, with only minor variations, showing that internet usage is extremely common in the province. From this, it can be inferred that the distribution is fairly even across Canada’s developed provinces while other emerging provinces are still falling behind.
The balloon plot is drawn to show usage of internet by both students and non-students. We chose to examine its dynamics in depth since the type of students and their employment status is so important to how the internet is used today.
#16. Balloon Plot: Who are Accessing the Internet More?
int_access <- data[,c("INT_USAGE","EMPLOYMENT_STATUS", "STUDENT_STATUS")]
int_access <- within(int_access, {
emp_status.cat <- NA
emp_status.cat[EMPLOYMENT_STATUS=="1"] <- "Employed"
emp_status.cat[EMPLOYMENT_STATUS=="2"] <- "Unemployed"
emp_status.cat[EMPLOYMENT_STATUS=="3"] <- "Unable to Work"
})
int_access <- within(int_access, {
int_usage.cat <- NA
int_usage.cat[INT_USAGE=="0"] <- "Never Used Internet"
int_usage.cat[INT_USAGE=="1"] <- "Used Internet"
})
int_access <- within(int_access, {
std_status.cat <- NA
std_status.cat[STUDENT_STATUS=="0"] <- "Not a Student"
std_status.cat[STUDENT_STATUS=="1"] <- "Student"
})
int_access_count <- as.data.frame(table(int_access[,c("emp_status.cat", "std_status.cat", "int_usage.cat")]))
ggballoonplot(int_access_count, x = "emp_status.cat", y = "int_usage.cat", size = "Freq",
fill = "Freq", facet.by = "std_status.cat",
ggtheme = theme_bw()) +
scale_fill_viridis_c(option = "C")
Findings:
Inference: Internet usage among employees far outpaces that of students who are actually enrolled in classes. Despite the significant differences, the variables showed several characteristics that could highlight a range of population characteristics. Additionally, the unemployed population has very limited access to the internet, despite of their position as students.
From the above visualizations constructed during the exploratory data analysis, we were able to observe the dynamics of the dataset, categorize their values, and identify relationships in our dataset. Furthermore, we were also able to locate the outliers in the dataset, organize it, and comprehend the dynamics of each variable.
From all these visualizations and variables, we will be able to perform predictive analysis making use of the dataset to forecast future trends and occurrences. As part of the next steps in our analysis, we will determine if a rise in family size will result in a decline in Internet usage. We’ll also examine whether the number of inhabitants in a household has an impact on Internet usage.
Given that the variables are categorical in nature, predictive methods such as logistic regression will assist us to attain the aforementioned goals. For problems involving prediction and classification, logistic regression is frequently utilized. Additionally, it is simpler to implement, interpret, and train on, and it is quite effective.
Will a respondent more likely use the internet from Home/Work/Both?
What is the likeliness of a respondent using the internet from a public library?
These questions might help us with predictions in our next phase of analysis.
From our exploratory analysis we certainly gained some useful insights. The first thing that we would like to modify in this dataset is to eliminate certain variables that are redundant.
REGION: The Region column doesn’t really convey valuable
insights. Instead, the Province column conveys more precise
information
G_URBRUR: The community where the respondent lives is
not very specific and only has 3 major cities. Hence, this can be
eliminated as it won’t be useful for further analysis
GFAMTYPE: The family type variable is redundant as the
G_HHSIZE column also conveys the same information in a more specific
manner
G_HEDUC: The highest level of education in a particular
household will not be used for futher analysis. Hence, we’ll eliminate
this particular variable
G_HSTUD: The student status column states more useful
data. Knowing if there is another student in the household doesn’t seem
to be very important for our analysis.
#Predictive Analyses
#Reduction of Dimensions
reduced_data <- data[,-c(3,4,10,12,13)] #Eliminating columns that are not required for analysis
As a common rule of thumb to build our predictive models we would be following a step-by-step process:
To understand the association between variables we’ll be using 2 tests:
Pearson's Chi-squared test of Independence:
Pearson’s Chi-squared test of Independence will help us
understand if any 2 variables are dependent. However, One limitation of
the chi-squared test is that it does not provide us the strength of the
association between 2 variables.
Cramer's V Test: The Cramer’s V Test takes a
step beyond the conventional chi-squared test and helps us understand
the strength of the association of 2 variables. In other words, the
Cramer’s V test gives us more specific details regarding the association
of 2 variables.
We’ll be using both of these tests to understand the dynamics of the variables of the predictive models that we construct
First, Let’s understand the dynamics of our data using a contingency table
#1. Predicting if a respondent is a student or not using age, employment status, internet usage
stud_pred <- reduced_data[,c("STUDENT_STATUS","AGE_GROUP","INT_USAGE", "EMPLOYMENT_STATUS", "SEX")]
cont_stud_pred <- table(stud_pred$STUDENT_STATUS, stud_pred$AGE_GROUP)
cont_stud_pred
##
## 1 2 3 4 5 6
## 0 920 2939 3570 4449 4381 5205
## 1 1061 330 179 106 38 0
Now let’s have a look at the association of these variables with the help of the Pearson’s Chi-squared test & the Cramer’s V Test
#Performing Chi-Squared & Cramer's V Test to understand the relationship between variables
chisq.test(stud_pred$STUDENT_STATUS, stud_pred$AGE_GROUP)
##
## Pearson's Chi-squared test
##
## data: stud_pred$STUDENT_STATUS and stud_pred$AGE_GROUP
## X-squared = 7099.2, df = 5, p-value < 2.2e-16
assocstats(xtabs(~stud_pred$STUDENT_STATUS + stud_pred$AGE_GROUP))#Strong Association
## X^2 df P(> X^2)
## Likelihood Ratio 4468.4 5 0
## Pearson 7099.2 5 0
##
## Phi-Coefficient : NA
## Contingency Coeff.: 0.484
## Cramer's V : 0.553
The P-value from our chi-squared test was < 2.2e-16 which is lesser than 0.05. Hence, we can reject the null hypothesis which says that the 2 variables are independent and conclude that they are dependent and related.
Our Cramer’s V test shows a value of 0.553. Any value of effective
size between 0.2 and 0.6 is considered to convey moderate association.
Hence, we can conclude that Student Status and
Age Group are moderately associated with each other.
Now, Let’s Continue the same process for other related variables
chisq.test(stud_pred$STUDENT_STATUS, stud_pred$INT_USAGE)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: stud_pred$STUDENT_STATUS and stud_pred$INT_USAGE
## X-squared = 521.81, df = 1, p-value < 2.2e-16
assocstats(xtabs(~stud_pred$STUDENT_STATUS + stud_pred$INT_USAGE)) #Decent Association
## X^2 df P(> X^2)
## Likelihood Ratio 788.26 1 0
## Pearson 523.15 1 0
##
## Phi-Coefficient : 0.15
## Contingency Coeff.: 0.149
## Cramer's V : 0.15
chisq.test(stud_pred$STUDENT_STATUS, stud_pred$EMPLOYMENT_STATUS)
##
## Pearson's Chi-squared test
##
## data: stud_pred$STUDENT_STATUS and stud_pred$EMPLOYMENT_STATUS
## X-squared = 39.171, df = 2, p-value = 3.12e-09
assocstats(xtabs(~stud_pred$STUDENT_STATUS + stud_pred$EMPLOYMENT_STATUS)) #Weak Association
## X^2 df P(> X^2)
## Likelihood Ratio 37.536 2 7.0675e-09
## Pearson 39.171 2 3.1196e-09
##
## Phi-Coefficient : NA
## Contingency Coeff.: 0.041
## Cramer's V : 0.041
chisq.test(stud_pred$STUDENT_STATUS, stud_pred$SEX)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: stud_pred$STUDENT_STATUS and stud_pred$SEX
## X-squared = 13.84, df = 1, p-value = 0.0001991
assocstats(xtabs(~stud_pred$STUDENT_STATUS + stud_pred$SEX)) #Weak Association
## X^2 df P(> X^2)
## Likelihood Ratio 14.126 1 0.00017097
## Pearson 14.028 1 0.00018006
##
## Phi-Coefficient : 0.025
## Contingency Coeff.: 0.025
## Cramer's V : 0.025
After analysing the results we can see that variables
Age Group and Internet Usage are significantly
associated with Student Status. Hence, we can use these 2
variables to build our predictive model
Now, let’s build our model and predict values for our test data:
#Splitting our data into training data and testing data
set.seed(1)
trainingrows <- sample(nrow(stud_pred), nrow(stud_pred) * 0.8)
stud_pred_training <- stud_pred[trainingrows, ]
stud_pred_testing <- stud_pred[-trainingrows, ]
#Constructing a logistic regression model with stud_pred_training
stud_pred_model <- glm(STUDENT_STATUS ~ AGE_GROUP + INT_USAGE, data = stud_pred_training,
family = binomial(link = "logit"))
summary(stud_pred_model)
##
## Call:
## glm(formula = STUDENT_STATUS ~ AGE_GROUP + INT_USAGE, family = binomial(link = "logit"),
## data = stud_pred_training)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.2555 -0.3145 -0.1530 -0.0001 3.3570
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.00760 0.24192 -4.165 3.11e-05 ***
## AGE_GROUP2 -2.36480 0.08313 -28.448 < 2e-16 ***
## AGE_GROUP3 -3.16326 0.10056 -31.457 < 2e-16 ***
## AGE_GROUP4 -3.73828 0.11818 -31.632 < 2e-16 ***
## AGE_GROUP5 -4.62342 0.18365 -25.175 < 2e-16 ***
## AGE_GROUP6 -19.13797 163.88734 -0.117 0.907
## INT_USAGE1 1.18927 0.23910 4.974 6.56e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 9809.9 on 18541 degrees of freedom
## Residual deviance: 6171.2 on 18535 degrees of freedom
## AIC: 6185.2
##
## Number of Fisher Scoring iterations: 18
pred_stud <- predict(stud_pred_model, stud_pred_testing, type = "response")
We’ll now build a confusion matrix and assess the performance of our model with our predicted values
#Generating a confusion matrix to test the accuracy of our model using training vs testing data
logitpredictions = rep(0,length(pred_stud))
logitpredictions[pred_stud > 0.5] <- 1
cm_stud <- table(logitpredictions, stud_pred_testing$STUDENT_STATUS)
cm_stud
##
## logitpredictions 0 1
## 0 4112 137
## 1 187 200
#Assessing the performance of the model we constructed
tp = length(which((logitpredictions == 1) & (stud_pred_testing$STUDENT_STATUS == 1)))
tn = length(which((logitpredictions == 0) & (stud_pred_testing$STUDENT_STATUS == 0)))
fp = length(which((logitpredictions == 1) & (stud_pred_testing$STUDENT_STATUS == 0)))
fn = length(which((logitpredictions == 0) & (stud_pred_testing$STUDENT_STATUS == 1)))
logitaccuracy <- (tp+tn)/(tp+tn+fp+fn)
logitsensitivity <- tp/(tp+fn)
logitspecificity <- tn/(tn+fp)
logitprecision <- tp/(tp+fp)
logitaccuracy #Calculating the accuracy
## [1] 0.9301122
logitsensitivity #Calculating the sensitivity
## [1] 0.5934718
logitspecificity #Calculating the specificity
## [1] 0.9565015
logitprecision #Calculating the precision
## [1] 0.5167959
We see a Accuracy value of 0.9301122
which is 93.01% accurate and a
specificity value of 0.9565015 which
95%. Hence, our model is good to predict future test
data.
Let’s follow the same protocol that we used for previous prediction and start by finding association between variables.
#2. Predicting if a respondent has used internet before using age, student status, sex
int_pred <- reduced_data[,c("INT_USAGE","AGE_GROUP", "EMPLOYMENT_STATUS", "SEX")]
#Performing Chi-Squared Test to better understand if there is a relationship between the variables
chisq.test(int_pred$INT_USAGE, int_pred$AGE_GROUP)
##
## Pearson's Chi-squared test
##
## data: int_pred$INT_USAGE and int_pred$AGE_GROUP
## X-squared = 6361, df = 5, p-value < 2.2e-16
assocstats(xtabs(~int_pred$INT_USAGE + int_pred$AGE_GROUP)) #Strong Association
## X^2 df P(> X^2)
## Likelihood Ratio 6477.2 5 0
## Pearson 6361.0 5 0
##
## Phi-Coefficient : NA
## Contingency Coeff.: 0.464
## Cramer's V : 0.524
chisq.test(int_pred$INT_USAGE, int_pred$EMPLOYMENT_STATUS)
##
## Pearson's Chi-squared test
##
## data: int_pred$INT_USAGE and int_pred$EMPLOYMENT_STATUS
## X-squared = 3682.2, df = 2, p-value < 2.2e-16
assocstats(xtabs(~int_pred$INT_USAGE + int_pred$EMPLOYMENT_STATUS)) #Strong Association
## X^2 df P(> X^2)
## Likelihood Ratio 3618.1 2 0
## Pearson 3682.2 2 0
##
## Phi-Coefficient : NA
## Contingency Coeff.: 0.37
## Cramer's V : 0.399
chisq.test(int_pred$INT_USAGE, int_pred$SEX)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: int_pred$INT_USAGE and int_pred$SEX
## X-squared = 7.9511, df = 1, p-value = 0.004806
assocstats(xtabs(~int_pred$INT_USAGE + int_pred$SEX)) #Weak Association
## X^2 df P(> X^2)
## Likelihood Ratio 8.0524 1 0.0045443
## Pearson 8.0384 1 0.0045797
##
## Phi-Coefficient : 0.019
## Contingency Coeff.: 0.019
## Cramer's V : 0.019
We can see that AGE_GROUP &
EMPLOYMENT_STATUS have a good association with Internet
Usage and SEX has a weak association.
Hence, we’ll be using AGE_GROUP & EMPLOYMENT_STATUS to build our prediction model
#Splitting our data into training data and testing data
set.seed(1)
trainingrows <- sample(nrow(int_pred), nrow(int_pred) * 0.8)
int_pred_training <- int_pred[trainingrows, ]
int_pred_testing <- int_pred[-trainingrows, ]
#Constructing a logistic regression model with stud_pred_training
int_pred_model <- glm(INT_USAGE ~ AGE_GROUP + EMPLOYMENT_STATUS, data = int_pred_training,
family = binomial(link = "logit"))
summary(stud_pred_model)
##
## Call:
## glm(formula = STUDENT_STATUS ~ AGE_GROUP + INT_USAGE, family = binomial(link = "logit"),
## data = stud_pred_training)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.2555 -0.3145 -0.1530 -0.0001 3.3570
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.00760 0.24192 -4.165 3.11e-05 ***
## AGE_GROUP2 -2.36480 0.08313 -28.448 < 2e-16 ***
## AGE_GROUP3 -3.16326 0.10056 -31.457 < 2e-16 ***
## AGE_GROUP4 -3.73828 0.11818 -31.632 < 2e-16 ***
## AGE_GROUP5 -4.62342 0.18365 -25.175 < 2e-16 ***
## AGE_GROUP6 -19.13797 163.88734 -0.117 0.907
## INT_USAGE1 1.18927 0.23910 4.974 6.56e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 9809.9 on 18541 degrees of freedom
## Residual deviance: 6171.2 on 18535 degrees of freedom
## AIC: 6185.2
##
## Number of Fisher Scoring iterations: 18
pred_int <- predict(int_pred_model, int_pred_testing, type = "response")
Let’s construct A confusion matrix and assess the performance of our model with our predicted values
#Generating a confusion matrix to test the accuracy of our model using training vs testing data
logitpredictions = rep(0,length(pred_int))
logitpredictions[pred_int > 0.5] <- 1
cm_stud <- table(logitpredictions, int_pred_testing$INT_USAGE)
cm_stud
##
## logitpredictions 0 1
## 0 606 342
## 1 497 3191
#Assessing the performance of the model we constructed
tp = length(which((logitpredictions == 1) & (int_pred_testing$INT_USAGE == 1)))
tn = length(which((logitpredictions == 0) & (int_pred_testing$INT_USAGE == 0)))
fp = length(which((logitpredictions == 1) & (int_pred_testing$INT_USAGE == 0)))
fn = length(which((logitpredictions == 0) & (int_pred_testing$INT_USAGE == 1)))
logitaccuracy <- (tp+tn)/(tp+tn+fp+fn) #Calculating the accuracy
logitsensitivity <- tp/(tp+fn) #Calculating the sensitivity
logitspecificity <- tn/(tn+fp) #Calculating the specificity
logitprecision <- tp/(tp+fp) #Calculating the precision
#Displaying values
logitaccuracy
## [1] 0.819025
logitsensitivity
## [1] 0.9031984
logitspecificity
## [1] 0.5494107
logitprecision
## [1] 0.8652386
Let’s find the variables that have a good association with Employment Status
#Moving required columns to the new data frame
emp_pred <- reduced_data[,c("EMPLOYMENT_STATUS", "AGE_GROUP", "SEX", "STUDENT_STATUS",
"PROVINCE", "ED_LEVEL", "HOUSE_SIZE", "INT_YEARS")]
emp_pred$EMPLOYMENT_STATUS <- relevel(emp_pred$EMPLOYMENT_STATUS, ref = 1) #Re-leveling for multi-nominal logistic regression
#Performing Chi-Squared Test to better understand if there is a relationship between the variables
chisq.test(emp_pred$EMPLOYMENT_STATUS, emp_pred$AGE_GROUP)
##
## Pearson's Chi-squared test
##
## data: emp_pred$EMPLOYMENT_STATUS and emp_pred$AGE_GROUP
## X-squared = 9358.6, df = 10, p-value < 2.2e-16
assocstats(xtabs(~emp_pred$EMPLOYMENT_STATUS + emp_pred$AGE_GROUP)) #Good Association
## X^2 df P(> X^2)
## Likelihood Ratio 9976.8 10 0
## Pearson 9358.6 10 0
##
## Phi-Coefficient : NA
## Contingency Coeff.: 0.536
## Cramer's V : 0.449
chisq.test(emp_pred$EMPLOYMENT_STATUS, emp_pred$SEX)
##
## Pearson's Chi-squared test
##
## data: emp_pred$EMPLOYMENT_STATUS and emp_pred$SEX
## X-squared = 333.2, df = 2, p-value < 2.2e-16
assocstats(xtabs(~emp_pred$EMPLOYMENT_STATUS + emp_pred$INT_YEARS)) #Positive Association
## X^2 df P(> X^2)
## Likelihood Ratio 4038.1 10 0
## Pearson 4057.1 10 0
##
## Phi-Coefficient : NA
## Contingency Coeff.: 0.386
## Cramer's V : 0.296
chisq.test(emp_pred$EMPLOYMENT_STATUS, emp_pred$SEX)
##
## Pearson's Chi-squared test
##
## data: emp_pred$EMPLOYMENT_STATUS and emp_pred$SEX
## X-squared = 333.2, df = 2, p-value < 2.2e-16
assocstats(xtabs(~emp_pred$EMPLOYMENT_STATUS + emp_pred$SEX)) #Weak Association
## X^2 df P(> X^2)
## Likelihood Ratio 335.49 2 0
## Pearson 333.20 2 0
##
## Phi-Coefficient : NA
## Contingency Coeff.: 0.119
## Cramer's V : 0.12
chisq.test(emp_pred$EMPLOYMENT_STATUS, emp_pred$STUDENT_STATUS)
##
## Pearson's Chi-squared test
##
## data: emp_pred$EMPLOYMENT_STATUS and emp_pred$STUDENT_STATUS
## X-squared = 39.171, df = 2, p-value = 3.12e-09
assocstats(xtabs(~emp_pred$EMPLOYMENT_STATUS + emp_pred$STUDENT_STATUS)) #Very Weak Association
## X^2 df P(> X^2)
## Likelihood Ratio 37.536 2 7.0675e-09
## Pearson 39.171 2 3.1196e-09
##
## Phi-Coefficient : NA
## Contingency Coeff.: 0.041
## Cramer's V : 0.041
We can see that AGE_GROUP and INT_YEARS
have a good association with EMPLOYMENT_STATUS. Hence,
we’ll use these variables to build our model.
#Splitting our data into training data and testing data
set.seed(1)
trainingrows <- sample(nrow(emp_pred), nrow(emp_pred) * 0.8)
emp_pred_training <- emp_pred[trainingrows, ]
emp_pred_testing <- emp_pred[-trainingrows, ]
# Constructing a Multi-nominal logistic regression with associated variables - Internet Years and age group
library(nnet)
emp_pred_model <- multinom(EMPLOYMENT_STATUS ~ AGE_GROUP + INT_YEARS, data = emp_pred_training)
## # weights: 36 (22 variable)
## initial value 20370.469056
## iter 10 value 11453.324906
## iter 20 value 11091.575756
## iter 30 value 11022.219696
## final value 11022.210082
## converged
summary(emp_pred_model)
## Call:
## multinom(formula = EMPLOYMENT_STATUS ~ AGE_GROUP + INT_YEARS,
## data = emp_pred_training)
##
## Coefficients:
## (Intercept) AGE_GROUP2 AGE_GROUP3 AGE_GROUP4 AGE_GROUP5 AGE_GROUP6
## 2 -1.2067066 -0.9507013 -1.045068 -0.9640328 -1.07655374 -1.372630
## 3 0.3388662 -1.3029953 -1.442353 -1.2343652 0.03571077 2.416025
## INT_YEARS2 INT_YEARS3 INT_YEARS4 INT_YEARS6 INT_YEARS7
## 2 -0.05408934 -0.3338697 -0.6348954 0.02102686 0.7178079
## 3 -0.39562697 -0.6516500 -1.1133409 -0.11955670 0.4294735
##
## Std. Errors:
## (Intercept) AGE_GROUP2 AGE_GROUP3 AGE_GROUP4 AGE_GROUP5 AGE_GROUP6 INT_YEARS2
## 2 0.2305719 0.12220943 0.11918070 0.11398033 0.12821403 0.23954224 0.2666660
## 3 0.1258848 0.08346045 0.08135671 0.07488551 0.06879825 0.08190181 0.1454427
## INT_YEARS3 INT_YEARS4 INT_YEARS6 INT_YEARS7
## 2 0.2338009 0.2168789 0.2325354 0.8214163
## 3 0.1238099 0.1143149 0.1194972 0.5167312
##
## Residual Deviance: 22044.42
## AIC: 22088.42
pred_emp <- predict(emp_pred_model, emp_pred_testing)
Now let’s build a confusion matrix and calculate the misclassification % of our model
#Generating a confusion matrix to test the accuracy of our model with regards to original data
cm_emp <- table(pred_emp, emp_pred_testing$EMPLOYMENT_STATUS)
cm_emp
##
## pred_emp 1 2 3
## 1 2487 187 616
## 2 0 0 0
## 3 230 18 1098
#Let's calculate the misclassification % of our model
1-sum(diag(cm_emp))/sum(cm_emp)
## [1] 0.2267041
As we can see the missclassfication percentage of our model is 0.2267041 which is 22.6%.
This means that there is 22.6% chance of our model being wrong, which is a decent score.
Let’s move required columns to a new data frame for our analysis
#Moving required columns to the new data frame
loc_pred <- reduced_data[,c("AGE_GROUP","EMPLOYMENT_STATUS", "SEX", "LU_FROM_HOME",
"LU_FROM_WORK", "STUDENT_STATUS", "PROVINCE", "ED_LEVEL",
"HOUSE_SIZE", "INT_YEARS")]
Now we will classify LU_FROM_HOME and
LU_FROM_WORK into a new column Place.
The column PLACE will take values:
1 - When the Respondent has accessed the internet from home
2 - When the Respondent has accessed the internet from Work
3 - When the Respondent has accessed internet from both Home and Work
#Creating new variable place which combines data from both columns Home and Work
loc_pred <- within(loc_pred, {
place <- NA
place[LU_FROM_HOME=="1" & LU_FROM_WORK == "2"] <- "1" #Only Home
place[LU_FROM_HOME=="2" & LU_FROM_WORK == "1"] <- "2" #Only Work
place[LU_FROM_HOME=="1" & LU_FROM_WORK == "1"] <- "3" #Both Home & Work
})
Let’s have a look at the variables categorical association
#Converting Place variable to factor
loc_pred$place <- as.factor(loc_pred$place)
loc_pred$place <- relevel(loc_pred$place, ref = 1) #Re-levelling for multi-nominal logistic regression
#Removing NA values for creating a training data set
loc_pred <- na.omit(loc_pred)
#Performing Chi-Squared Test to better understand if there is a relationship between the variables
chisq.test(loc_pred$place, loc_pred$AGE_GROUP)
##
## Pearson's Chi-squared test
##
## data: loc_pred$place and loc_pred$AGE_GROUP
## X-squared = 1400.3, df = 10, p-value < 2.2e-16
assocstats(xtabs(~loc_pred$place + loc_pred$AGE_GROUP))
## X^2 df P(> X^2)
## Likelihood Ratio 1561.2 10 0
## Pearson 1400.3 10 0
##
## Phi-Coefficient : NA
## Contingency Coeff.: 0.28
## Cramer's V : 0.206
chisq.test(loc_pred$place, loc_pred$EMPLOYMENT_STATUS)
##
## Pearson's Chi-squared test
##
## data: loc_pred$place and loc_pred$EMPLOYMENT_STATUS
## X-squared = 2661.3, df = 4, p-value < 2.2e-16
assocstats(xtabs(~loc_pred$place + loc_pred$EMPLOYMENT_STATUS))
## X^2 df P(> X^2)
## Likelihood Ratio 3087.3 4 0
## Pearson 2661.3 4 0
##
## Phi-Coefficient : NA
## Contingency Coeff.: 0.373
## Cramer's V : 0.284
#Continuing with just Cramer V test
assocstats(xtabs(~loc_pred$place + loc_pred$SEX))
## X^2 df P(> X^2)
## Likelihood Ratio 41.707 2 8.7802e-10
## Pearson 41.751 2 8.5885e-10
##
## Phi-Coefficient : NA
## Contingency Coeff.: 0.05
## Cramer's V : 0.05
assocstats(xtabs(~loc_pred$place + loc_pred$STUDENT_STATUS))
## X^2 df P(> X^2)
## Likelihood Ratio 40.780 2 1.3957e-09
## Pearson 33.089 2 6.5300e-08
##
## Phi-Coefficient : NA
## Contingency Coeff.: 0.045
## Cramer's V : 0.045
assocstats(xtabs(~loc_pred$place + loc_pred$PROVINCE))
## X^2 df P(> X^2)
## Likelihood Ratio 43.259 18 0.00073538
## Pearson 43.658 18 0.00064553
##
## Phi-Coefficient : NA
## Contingency Coeff.: 0.051
## Cramer's V : 0.036
assocstats(xtabs(~loc_pred$place + loc_pred$ED_LEVEL))
## X^2 df P(> X^2)
## Likelihood Ratio 1102.4 4 0
## Pearson 1089.0 4 0
##
## Phi-Coefficient : NA
## Contingency Coeff.: 0.249
## Cramer's V : 0.182
assocstats(xtabs(~loc_pred$place + loc_pred$HOUSE_SIZE))
## X^2 df P(> X^2)
## Likelihood Ratio 202.51 6 0
## Pearson 218.75 6 0
##
## Phi-Coefficient : NA
## Contingency Coeff.: 0.114
## Cramer's V : 0.081
We can see that AGE_GROUP &
EMPLOYMENT_STATUS are significantly associated with the the
variable Place.
Hence, we will eliminate other variables and proceed with AGE_GROUP & EMPLOYMENT_STATUS to build our model to predict Place.
#Splitting our data into training data and testing data
set.seed(100)
trainingrows <- sample(nrow(loc_pred), nrow(loc_pred) * 0.6)
loc_pred_training <- loc_pred[trainingrows, ]
loc_pred_testing <- loc_pred[-trainingrows, ]
# Constructing a Multi-nomial logistic regression with associated variables - Employment status and age group
loc_pred_model <- multinom(place ~ AGE_GROUP + EMPLOYMENT_STATUS, data = loc_pred_training)
## # weights: 27 (16 variable)
## initial value 10870.768596
## iter 10 value 7227.535714
## iter 20 value 6801.402861
## iter 30 value 6657.372997
## final value 6656.943400
## converged
summary(loc_pred_model)
## Call:
## multinom(formula = place ~ AGE_GROUP + EMPLOYMENT_STATUS, data = loc_pred_training)
##
## Coefficients:
## (Intercept) AGE_GROUP2 AGE_GROUP3 AGE_GROUP4 AGE_GROUP5 AGE_GROUP6
## 2 -2.83139167 0.4736733 -0.06660337 0.3168686 0.5869540 -0.2461317
## 3 -0.02730885 0.5693568 0.22389316 -0.1184339 -0.1800612 -0.6969082
## EMPLOYMENT_STATUS2 EMPLOYMENT_STATUS3
## 2 -1.238178 -2.504550
## 3 -1.119933 -2.321146
##
## Std. Errors:
## (Intercept) AGE_GROUP2 AGE_GROUP3 AGE_GROUP4 AGE_GROUP5 AGE_GROUP6
## 2 0.22046391 0.26009264 0.26823570 0.24890830 0.25370298 0.4318078
## 3 0.07129431 0.08622257 0.08447988 0.08440498 0.09044237 0.1408269
## EMPLOYMENT_STATUS2 EMPLOYMENT_STATUS3
## 2 0.3644211 0.3008890
## 3 0.1068090 0.0869557
##
## Residual Deviance: 13313.89
## AIC: 13345.89
pred_loc <- predict(loc_pred_model,loc_pred_testing)
Let’s go ahead and build a confusion matrix and calculate the misclassification % of our model
#Generating a confusion matrix to test the accuracy of our model with regards to original data
cm_loc <- table(pred_loc, loc_pred_testing$place)
cm_loc
##
## pred_loc 1 2 3
## 1 3062 122 1304
## 2 0 0 0
## 3 859 57 1194
#Let's calculate the misclassification % of our model
1-sum(diag(cm_loc))/sum(cm_loc)
## [1] 0.354956
Let’s make certain adjustments to our Data frame to better analyze our data
#5. Logistic Regression: Public Library Prediction - How likely is a respondent going to access the internet from a public library
#Moving required columns to the new data frame
library_pred <- reduced_data[reduced_data$LU_FROM_LIBRARY==1|reduced_data$LU_FROM_LIBRARY == 2,c("LU_FROM_LIBRARY","AGE_GROUP","EMPLOYMENT_STATUS",
"STUDENT_STATUS", "PROVINCE", "ED_LEVEL", "INT_YEARS")]
library_pred$LU_FROM_LIBRARY <- as.integer(library_pred$LU_FROM_LIBRARY)
library_pred["LU_FROM_LIBRARY"][library_pred["LU_FROM_LIBRARY"] == 2] <- 0
library_pred$LU_FROM_LIBRARY <- factor(library_pred$LU_FROM_LIBRARY, levels=c("0","1"))
Now, let’s understand the association between our variables in the Data set
#Performing Chi-Squared Test to better understand if there is a relationship between the variables
chisq.test(library_pred$LU_FROM_LIBRARY, library_pred$INT_YEARS)
## Warning in chisq.test(library_pred$LU_FROM_LIBRARY, library_pred$INT_YEARS):
## Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: library_pred$LU_FROM_LIBRARY and library_pred$INT_YEARS
## X-squared = 83.184, df = 4, p-value < 2.2e-16
assocstats(xtabs(~library_pred$LU_FROM_LIBRARY + library_pred$INT_YEARS)) #High Association
## X^2 df P(> X^2)
## Likelihood Ratio 92.159 5 0
## Pearson NaN 5 NaN
##
## Phi-Coefficient : NA
## Contingency Coeff.: NaN
## Cramer's V : NaN
chisq.test(library_pred$LU_FROM_LIBRARY, library_pred$AGE_GROUP)
##
## Pearson's Chi-squared test
##
## data: library_pred$LU_FROM_LIBRARY and library_pred$AGE_GROUP
## X-squared = 528.21, df = 5, p-value < 2.2e-16
assocstats(xtabs(~library_pred$LU_FROM_LIBRARY + library_pred$AGE_GROUP)) # good association
## X^2 df P(> X^2)
## Likelihood Ratio 451.04 5 0
## Pearson 528.21 5 0
##
## Phi-Coefficient : NA
## Contingency Coeff.: 0.174
## Cramer's V : 0.176
chisq.test(library_pred$LU_FROM_LIBRARY, library_pred$EMPLOYMENT_STATUS)
##
## Pearson's Chi-squared test
##
## data: library_pred$LU_FROM_LIBRARY and library_pred$EMPLOYMENT_STATUS
## X-squared = 86.007, df = 2, p-value < 2.2e-16
assocstats(xtabs(~library_pred$LU_FROM_LIBRARY + library_pred$EMPLOYMENT_STATUS)) #good association
## X^2 df P(> X^2)
## Likelihood Ratio 78.266 2 0
## Pearson 86.007 2 0
##
## Phi-Coefficient : NA
## Contingency Coeff.: 0.071
## Cramer's V : 0.071
#Continuing with just Cramer's V test
assocstats(xtabs(~library_pred$LU_FROM_LIBRARY + library_pred$STUDENT_STATUS)) #Decent
## X^2 df P(> X^2)
## Likelihood Ratio 468.31 1 0
## Pearson 601.33 1 0
##
## Phi-Coefficient : 0.188
## Contingency Coeff.: 0.185
## Cramer's V : 0.188
assocstats(xtabs(~library_pred$LU_FROM_LIBRARY + library_pred$ED_LEVEL)) #Decent
## X^2 df P(> X^2)
## Likelihood Ratio 90.160 2 0
## Pearson 95.565 2 0
##
## Phi-Coefficient : NA
## Contingency Coeff.: 0.075
## Cramer's V : 0.075
assocstats(xtabs(~library_pred$LU_FROM_LIBRARY + library_pred$PROVINCE)) #Weakness
## X^2 df P(> X^2)
## Likelihood Ratio 102.560 9 0
## Pearson 98.018 9 0
##
## Phi-Coefficient : NA
## Contingency Coeff.: 0.076
## Cramer's V : 0.076
As we can see here, INT_YEARS, AGE_GROUP
& EMPLOYMENT_STATUS are associated with
LU_FROM_LIBRARY
Hence, we’ll be using these variables to build our model
#Splitting our data into training data and testing data
set.seed(1)
trainingrows <- sample(nrow(library_pred), nrow(library_pred) * 0.8)
library_pred_training <- library_pred[trainingrows, ]
library_pred_testing <- library_pred[-trainingrows, ]
#Constructing a logistic regression model with stud_pred_training
library_pred_model <- glm(LU_FROM_LIBRARY ~ INT_YEARS + AGE_GROUP + EMPLOYMENT_STATUS,
data = library_pred_training, family = binomial(link = "logit"))
summary(library_pred_model)
##
## Call:
## glm(formula = LU_FROM_LIBRARY ~ INT_YEARS + AGE_GROUP + EMPLOYMENT_STATUS,
## family = binomial(link = "logit"), data = library_pred_training)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.9411 -0.5547 -0.4429 -0.3999 2.5938
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.38928 0.18847 -7.372 1.69e-13 ***
## INT_YEARS2 -0.25350 0.23201 -1.093 0.275
## INT_YEARS3 -0.12307 0.19205 -0.641 0.522
## INT_YEARS4 0.27846 0.17766 1.567 0.117
## INT_YEARS7 -11.50572 129.55275 -0.089 0.929
## AGE_GROUP2 -0.67907 0.08130 -8.353 < 2e-16 ***
## AGE_GROUP3 -0.92679 0.08430 -10.994 < 2e-16 ***
## AGE_GROUP4 -1.16174 0.08656 -13.422 < 2e-16 ***
## AGE_GROUP5 -1.37533 0.09365 -14.686 < 2e-16 ***
## AGE_GROUP6 -1.68588 0.12172 -13.851 < 2e-16 ***
## EMPLOYMENT_STATUS2 0.52577 0.10271 5.119 3.07e-07 ***
## EMPLOYMENT_STATUS3 0.47855 0.06795 7.043 1.88e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 10291.8 on 13588 degrees of freedom
## Residual deviance: 9824.9 on 13577 degrees of freedom
## AIC: 9848.9
##
## Number of Fisher Scoring iterations: 12
we can see that the variable INT_YEARS is not a
significant predictor.
So, lets reduce the dimension INT_YEARS from our model
and build a new reduced model
#Reducing logistic regression model and eliminating INT_YEARS
library_pred_model_reduced <- glm(LU_FROM_LIBRARY ~ AGE_GROUP + EMPLOYMENT_STATUS,
data = library_pred_training, family = binomial(link = "logit"))
summary(library_pred_model_reduced)
##
## Call:
## glm(formula = LU_FROM_LIBRARY ~ AGE_GROUP + EMPLOYMENT_STATUS,
## family = binomial(link = "logit"), data = library_pred_training)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.9189 -0.5447 -0.4732 -0.4062 2.4260
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.14914 0.06283 -18.290 < 2e-16 ***
## AGE_GROUP2 -0.68379 0.08111 -8.431 < 2e-16 ***
## AGE_GROUP3 -0.95804 0.08399 -11.407 < 2e-16 ***
## AGE_GROUP4 -1.20918 0.08601 -14.059 < 2e-16 ***
## AGE_GROUP5 -1.41914 0.09315 -15.234 < 2e-16 ***
## AGE_GROUP6 -1.73948 0.12117 -14.356 < 2e-16 ***
## EMPLOYMENT_STATUS2 0.50524 0.10245 4.932 8.16e-07 ***
## EMPLOYMENT_STATUS3 0.43522 0.06736 6.461 1.04e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 10291.8 on 13588 degrees of freedom
## Residual deviance: 9864.6 on 13581 degrees of freedom
## AIC: 9880.6
##
## Number of Fisher Scoring iterations: 5
pred_library <- predict(library_pred_model_reduced, library_pred_testing, type = "response")
Now let’s build a confusion matrix and assess the performance of our newly built model.
#Generating a confusion matrix to test the accuracy of our model using training vs testing data
logitpredictions = rep(0,length(pred_library))
logitpredictions[pred_library > 0.5] <- 1
cm_library <- table(logitpredictions, library_pred_testing$LU_FROM_LIBRARY)
cm_library
##
## logitpredictions 0 1
## 0 2992 406
#Assessing the performance of the model we constructed
tp = length(which((logitpredictions == 1) & (library_pred_testing$LU_FROM_LIBRARY == 1)))
tn = length(which((logitpredictions == 0) & (library_pred_testing$LU_FROM_LIBRARY == 0)))
fp = length(which((logitpredictions == 1) & (library_pred_testing$LU_FROM_LIBRARY == 0)))
fn = length(which((logitpredictions == 0) & (library_pred_testing$LU_FROM_LIBRARY == 1)))
logitaccuracy <- (tp+tn)/(tp+tn+fp+fn) #Calculating the accuracy
#Displaying values
logitaccuracy
## [1] 0.880518
The model seems to be 88% accurate and is good for predicting future test data
Christophe Dervieux, Emily Riederer, Y. X. (2022). R Markdown
Cookbook.
from https://bookdown.org/yihui/rmarkdown-cookbook/
Wisconsin School of Business (2021). Analytics Using R.
from https://pubs.wsb.wisc.edu/academics/analytics-using-r-2019/index.html
R Studio (n.d.). Markdown Basics.
from https://rmarkdown.rstudio.com/lesson-8.html
Joyce Robbins, Z. B. (n.d.). 15 Chart: Mosaic.
from https://edav.info/
Seaborn (n.d.). Visualizing Categorical Data.
from https://seaborn.pydata.org/tutorial/categorical.html