Internet Past Use Analysis

UNIVERSITY OF WINDSOR

Data Analytic Method and Algorithm (BSMM8740)

GROUP 11:

Viet Hang Le
Simran Chauhan
Shannon Jeet Singh
Garima Suhag

knitr::opts_chunk$set(echo = TRUE)

getwd()

## [1] "/cloud/project"

setwd("/cloud/project")
PU_data <- read.csv("pastUse.csv")
PU_data <- PU_data[,-1]

List of libraries used in this project

library(dplyr)
library(tidyr)
library(ggplot2)
library(knitr)
library(viridis)
library(hrbrthemes)

Structure of the report

DiagrammeR::grViz("digraph {
  graph [layout = dot, rankdir = TB]
  
  node [shape = rectangle]        
  rec1 [label = '1. Introduction']
  rec2 [label = '2. Data Preprocessing']
  rec3 [label = '3. Exploratory Analysis']
  rec4 [label = '4. Predictive Analysis']
  rec5 [label = '5. Recommendations/Conclusions']
  
  # edge definitions with the node IDs
  rec1 -> rec2 -> rec3 -> rec4 -> rec5
  }",
  height = 500)

1. Introduction

Purpose of the survey and description of the data set. The EDA here offers details on how people 15 years of age and over who reside in Canada’s 10 provinces adopt and use internet, including how frequently they use them and other survey questions.

We will be dealing with past use variables, such as frequency of usage, length of time since last use, and reasons for not using internet services anymore.

The data set consists of 23,178 records across 21 variables as listed below:

PROVINCE: Province of respondent
REGION: Regions of Canada
G_URBRUR: Characteristic of community where the respondent lives
GCAGEGR6: Age of respondent
CSEX: Sex of respondent
G_CEDUC: Respondent’s highest education level
G_CSTUD: Whether the respondent is student or not
G_CLFSST: Type of household
G_HHSIZE: Number of persons in household
G_HEDUC: Highest level of education ever completed in the household
G_HSTUD : Whether there is student in household
EV_Q01: Have you ever used the Internet (E-mail or World Wide Web) from home, work, school, or any other location for personal non-business use?
EV_Q02: How many years have you used the Internet?
PU_Q01: Did you use the Internet during the past 12 months for personal non-business use
PU_Q02: When was the last time you used the Internet?
PU_Q03: During the period that you last used the Internet, how often did you use it?
PU_Q06A: Was high cost of service/equipment the reasons you no longer use the Internet from any location?
PU_Q06E: Was unnecessariness the reasons you no longer use the Internet from any location?
PU_Q06J: Was being busy the reasons you no longer use the Internet from any location?
PU_Q06K: Was lack of computer access the reasons you no longer use the Internet from any location?

The first 5 rows of the dataset:

head(PU_data, 5)

##   PROVINCE REGION G_URBRUR GCAGEGR6 CSEX G_CEDUC G_CSTUD G_CLFSST GFAMTYPE
## 1       35      3        5        3    2       3       2        1        3
## 2       46      4        5        1    2       1       1        2        2
## 3       10      1        5        2    1       2       2        1        2
## 4       35      3        4        5    2       2       2        3        3
## 5       13      1        4        3    1       2       2        1        2
##   G_HHSIZE G_HEDUC G_HSTUD EV_Q01 EV_Q02 PU_Q01 PU_Q02 PU_Q03 PU_Q06A PU_Q06E
## 1        1       3       2      1      2      1      6      6       6       6
## 2        3       2       1      1      4      1      6      6       6       6
## 3        2       3       2      1      4      1      6      6       6       6
## 4        1       2       2      1      3      2      2      1       1       2
## 5        3       2       2      1      2      1      6      6       6       6
##   PU_Q06J PU_Q06K PU_G06
## 1       6       6      6
## 2       6       6      6
## 3       6       6      6
## 4       2       2      2
## 5       6       6      6

2. Data Preprocessing

Rename variables

Create function to change variables’ values

The first 5 rows of renamed dataset:

subdata <- head(PU_data, 5)
kable(subdata)

PROVINCE	REGION	isRural	AGE	GENDER	Edu_Level	isStudent	Labour_Status	Household_Type	Household_Size	Household_Edu_Level	Student_in_Household	used_internet	internet_usage_duration	PU_Q01	PU_Q02	PU_Q03	PU_Q06A	PU_Q06E	PU_Q06J	PU_Q06K	PU_G06
Ontario	Ontario	Rural excluding Prince Edward Island	35 to 44	Female	University certificate or degree	No	Employed	One Person	1	University certificate or degree	No	Yes	1 to 2 years	Yes	Valid skip	Valid skip	Valid Skip	Valid Skip	Valid Skip	Valid Skip	Valid Skip
Manitoba	Manitoba/Saskatchewan	Rural excluding Prince Edward Island	16 to 24	Female	High school or less	Yes	Unemployed	Single Family without unmarried children under 16	3	College or some post-secondary	Yes	Yes	5 or more years	Yes	Valid skip	Valid skip	Valid Skip	Valid Skip	Valid Skip	Valid Skip	Valid Skip
Newfoundland and Labrador	Atlantic Region	Rural excluding Prince Edward Island	25 to 34	Male	College or some post-secondary	No	Employed	Single Family without unmarried children under 16	2	University certificate or degree	No	Yes	5 or more years	Yes	Valid skip	Valid skip	Valid Skip	Valid Skip	Valid Skip	Valid Skip	Valid Skip
Ontario	Ontario	Other Urban excluding Prince Edward Island	55 to 64	Female	College or some post-secondary	No	Not in the labour force	One Person	1	College or some post-secondary	No	Yes	2 to 5 years	No	2 to 5 years	Once a day	Yes	No	No	No	No
New Brunswick	Atlantic Region	Other Urban excluding Prince Edward Island	35 to 44	Male	College or some post-secondary	No	Employed	Single Family without unmarried children under 16	3	College or some post-secondary	No	Yes	1 to 2 years	Yes	Valid skip	Valid skip	Valid Skip	Valid Skip	Valid Skip	Valid Skip	Valid Skip

3. Exploratory Analysis

3.1. Distribution of province of respondents

proCount <- table(PU_data$PROVINCE)
proCount <- proCount[order(proCount, decreasing = TRUE)]
proPerc <- proCount/sum(proCount)*100
proPerc2 <- paste0(round(proPerc, 2), "%")
names(proPerc2) <- names(proPerc)

proBar <- barplot(proPerc, main = "Which province do past users live?", ylab = "Province", xlab = "Proportion of respondents", xlim = c(0,100), col = 'lightsteelblue', border = NA, horiz = TRUE, las = 1)

Observations:

Majority of the respondents are from Ontario and Quebec and least from Prince Edward Island

3.2. Distribution of region of respondents

regionCount <- table(PU_data$REGION)
regionCount <- regionCount[order(regionCount, decreasing = FALSE)]
regionPerc <- regionCount/sum(regionCount)*100
regionPerc2 <- paste0(round(regionPerc, 2), "%")
names(regionPerc2) <- names(regionPerc)
regionBar <- barplot(regionPerc, main = "Which region do past users live?", xlab = "Proportion of respondents", xlim = c(0,100), col = 'lightsteelblue', border = NA, horiz = TRUE, las = 1)

Observations:

Distribution across region shows majority of respondants from Ontario and least from Alberta

3.3. Distribution of respondents’ age group

ageCount <- table(PU_data$AGE)
agePerc <- ageCount/sum(ageCount)*100
agePerc2 <- paste0(round(agePerc, 1), "%")
names(agePerc2) <- names(agePerc)
cols_age = c('peachpuff', 'mistyrose', 'lightpink', 'lavender', 'lightsteelblue', 'lightblue')
pie(ageCount, labels = agePerc2, main = "Age of Respondents", col = cols_age, clockwise = TRUE)       
legend("topright", c("16-24", "25-34", "35-44", "45-54", "55-64", "65 and older"), cex = 0.8, fill = cols_age)

Observations:

Respondents’ age groups are approximately evenly distributed, except for age group 16-24 and 65 and above where the proportion is 8.5% and 22.5% respectively

3.4. Distribution of respondents’ gender

genCount <- table(PU_data$GENDER)
genPerc <- genCount/sum(genCount)*100
genPerc2 <- paste0(round(genPerc, 1), "%")
names(genPerc2) <- names(genPerc)
cols_gen = c('mistyrose', 'lightblue')
pie(genCount, labels = genPerc2, main = "Gender of Respondents", col = cols_gen, clockwise = TRUE)        
legend("right", c("Female", "Male"), cex = 0.8, fill = cols_gen)

Observations:

Data proportion across gender is 44.7% for male and 55.3% for female

3.5. Distribution of respondents’ highest education level

eduCount <- table(PU_data$Edu_Level)
eduPerc <- eduCount/sum(genCount)*100
eduPerc2 <- paste0(round(eduPerc, 1), "%")
names(eduPerc2) <- names(eduPerc)
cols_edu = c('lightpink', 'lavender', 'lightsteelblue')
pie(eduCount, labels = eduPerc2, main = "Respondent's Highest Education Level", col = cols_edu, clockwise = TRUE)                     
legend("topright", c("College or some post-secondary", "High school or less", "Univeristy certificate of degree"), cex = 0.5, fill = cols_edu)

Observations:

The highest education level of most respondents is either college/post-secondary or high school. Only 18.7% of them earned university certificate or degree.

3.6. Is respondent student?

stuCount <- table(PU_data$isStudent)
stuPerc <- stuCount/sum(stuCount)*100
stuPerc2 <- paste0(round(stuPerc, 1), "%")
names(stuPerc2) <- names(stuPerc)
cols_stu = c('mistyrose', 'lightblue')
pie(stuCount, labels = stuPerc2, main = "Respondent is a Student?", col = cols_stu, clockwise = TRUE)   
legend("right", c("No","Yes"), cex = 0.8, fill = cols_stu)

Observations:

More than 90% of respondents are not student, which makes sense considering that the survey was mainly responded by those in senior age groups as shown in the age plot above. Therefore, it is understandable that most of them are no longer students.

3.7. Distribution of respondents’ labour status

laborCount <- table(PU_data$Labour_Status)
laborPerc <- laborCount/sum(laborCount)*100
laborPerc2 <- paste0(round(laborPerc, 1), "%")
names(laborPerc2) <- names(laborPerc)
cols_labor = c('peachpuff', 'lightpink', 'lightblue')
pie(laborCount, labels = laborPerc2, main = "Respondent's Labour Force Status", col = cols_labor, clockwise = TRUE)
legend("topright", c("Employed","Unemployed", "Not in the labour force"), cex = 0.8, fill = cols_labor)

Observations:

More than half of the respondents are employed (58.5%). The unemployed also make up a considerable portion with 35%.

3.8. What is the proportion of people who use internet across provience, age groups, gender, education level and labour status?

3.8.1. Proportion of respondents who use internet across province

a1 <- PU_data %>% group_by(PROVINCE) %>% summarise(tot = n())
a <- PU_data %>% group_by(PROVINCE, used_internet) %>% 
  summarise(total = n())

## `summarise()` has grouped output by 'PROVINCE'. You can override using the
## `.groups` argument.

a <- merge(a, a1 , by = 'PROVINCE', all.x=TRUE)
a$percentage <- round((a$total/a$tot)*100,2)
ggplot(a, aes(fill= used_internet, y= percentage, x= PROVINCE )) + 
  geom_bar(position='stack', stat='identity') +
  geom_text(aes( label = percentage), size = 3) +
  coord_flip()+
  ggtitle('Proportion of people who ever used internet across province')

Observations:

British Columbia, Alberta and Ontario have highest internet users i.e. 83.1%, 81.13% and 78.58% respectively
Newfoundland and New Brunswick have the least number of internet users i.e. 64.17% and 68.73% respectively

3.8.2. Proportion of respodents who ever used internet across age groups

b1 <- PU_data %>% group_by(AGE) %>% summarise(tot = n())
b <- PU_data %>% group_by(AGE, used_internet) %>%  
    summarise(total = n())

## `summarise()` has grouped output by 'AGE'. You can override using the `.groups`
## argument.

b <- merge(b, b1 , by = 'AGE', all.x=TRUE)
b$percentage <- round((b$total/b$tot)*100,2)
ggplot(b, aes(fill= used_internet, y= percentage, x= AGE )) + 
  geom_bar(position='stack', stat='identity') +
  geom_text(aes( label = percentage), size = 3) +
  coord_flip() +
  ggtitle('Proportion of people who ever used internet across age groups')

Observations:

Younger population uses internet the most
As the population gets older their internet usage is decreasing
98.54% of respodents who belong to age group 16-24 have used internet
Only 37.5% of respondents from age group 65 and above have used internet

3.8.3. Proportion of respodents who ever used internet across gender

d1 <- PU_data %>% group_by(GENDER) %>% summarise(tot = n())
d <- PU_data %>% group_by(GENDER, used_internet) %>%  
    summarise(total = n())

## `summarise()` has grouped output by 'GENDER'. You can override using the
## `.groups` argument.

d <- merge(d, d1 , by = 'GENDER', all.x=TRUE)
d$percentage <- round((d$total/d$tot)*100,2)
ggplot(d, aes(fill= used_internet, y= percentage, x= GENDER )) + 
  geom_bar(position='stack', stat='identity') +
  geom_text(aes( label = percentage), size = 3) +
  coord_flip() +
  ggtitle('Proportion of people who used internet across gender')

Observations:

Proportion of people using internet arcoss gender is same, just male proportion being just 1 percent more

3.8.4. Proportion of respodents who ever used internet by Education Level

d1 <- PU_data %>% group_by(Edu_Level) %>% summarise(tot = n())
d <- PU_data %>% group_by(Edu_Level, used_internet) %>%  
    summarise(total = n())

## `summarise()` has grouped output by 'Edu_Level'. You can override using the
## `.groups` argument.

d <- merge(d, d1 , by = 'Edu_Level', all.x=TRUE)
d$percentage <- round((d$total/d$tot)*100,2)
ggplot(d, aes(fill= used_internet, y= percentage, x= Edu_Level )) + 
  geom_bar(position='stack', stat='identity') +
  geom_text(aes( label = percentage), size = 3) +
  coord_flip() +
  ggtitle('Proportion of people who used internet by education level')

Observations:

People who are more educated have more proportion of respondents that have ever used internet
94.77% of respondents that have a university certificate of degree have used internet, however only 58.59% in the case of High school or less level of education
Respondents with College or some post-secondary, 83.48% have used internet

3.8.5. Proportion of respodents who ever used internet by Employment status

d1 <- PU_data %>% group_by(Labour_Status) %>% summarise(tot = n())
d <- PU_data %>% group_by(Labour_Status, used_internet) %>%  
    summarise(total = n())

## `summarise()` has grouped output by 'Labour_Status'. You can override using the
## `.groups` argument.

d <- merge(d, d1 , by = 'Labour_Status', all.x=TRUE)
d$percentage <- round((d$total/d$tot)*100,2)
ggplot(d, aes(fill= used_internet, y= percentage, x= Labour_Status )) + 
  geom_bar(position='stack', stat='identity') +
  geom_text(aes( label = percentage), size = 3) +
  coord_flip() +
  ggtitle('Proportion of people who used internet by education level')

Observations:

People who are employed use internet the most, i.e. 89%
Also, people who are unemployed, 86.12% have used internet
Proportion of people who have or have not used internet is almost same

3.8.6. Analysing less internet usage in Not in labour force wrt to age groups

d1 <- PU_data %>% group_by(AGE) %>% summarise(tot = n())
d <- PU_data %>% group_by(AGE, Labour_Status) %>%  
    summarise(total = n())

## `summarise()` has grouped output by 'AGE'. You can override using the `.groups`
## argument.

d <- merge(d, d1 , by = 'AGE', all.x=TRUE)
d$percentage <- round((d$total/d$tot)*100,2)
ggplot(d, aes(fill= Labour_Status, y= percentage, x= AGE )) + 
  geom_bar(position='stack', stat='identity') +
  geom_text(aes( label = percentage), size = 3) +
  coord_flip() +
  ggtitle('Proportion of people who used internet by education level')

Observations:

From the plot we can observe that majority of Not in labor respondents belong to age group 55 or more, and this explains the reason why this group has more proportion of people that have never used interner as, older people have more proportion of people that have never used internet

EDA of people who have used internet

3.9. How many years have respondents used the Internet?

durationCount <- table(PU_data$internet_usage_duration)
durationCount <- durationCount[order(durationCount, decreasing = FALSE)]
durationPerc <- durationCount/sum(durationCount)*100
durationPerc2 <- paste0(round(durationPerc, 2), "%")
names(durationPerc2) <- names(durationPerc)
durationBar <- barplot(durationPerc, main = "How many years have you used the Internet?", xlab = "Proportion of respondents", xlim = c(0,100), col = 'lightsteelblue', border = NA, horiz = TRUE, las = 1)

Observations:

Out of people who have used internet, 60% of the respondents have been using it for more than 5 years.

EDA of people who no longer or dont use internet

3.10. What are the reasons users no longer use the Internet from any location?

reasons$Answer <- factor(reasons$Answer, c("Yes", "No", "Don't Know"))
ggplot(reasons, aes(fill=Answer, y=Frequency, x=Answer)) + 
  geom_bar(position="dodge", stat="identity") +
  scale_fill_viridis(discrete = T, option = "E") +
  ggtitle("What are the reasons users no longer use the Internet from any location?") +
  facet_wrap(~Reason) +
  theme_ipsum() +
  theme(legend.position="none") +
  xlab("")

Observations:

The graph shows that among those who didn’t use the Internet during the past 12 months, no need to use appeared to be the most popular reason. However, as most of them answered ‘No’ to 4 reasons being surveyed, there seems to be other factors that make them stopped using the Internet.

Final Observation and Conclution for Predictive Analysis

As the population grows older they tend to stop using internet or not use it at all
People with high level of education use more internet
Employed and Unemployed group has more percentage of people who use internet compared to Not in Labour force group as this group has people th belong to age group 55 and higher and they don’t use internet

4. Predictive Analysis

4.1 Chi square to understand internet usage behavior across variables

As our project deals with past Internet users, we use chi-square test for independence to determine the association between different demographic factors and the answer of whether or not respondents used the Internet during the past 12 months. Specifically, we will consider 6 variables AGE, GENDER, Edu_Level, Labour_Status, Household_Size and PU_Q01 for our model, where the first 5 variables are predictors and the 6th variable is the outcome. The goal is to test the hypothesis whether each of the first 5 variables is independent of the fixed 6th variable at 0.05 level of significance. In other words, we aim to find out whether age, gender, education level, labour status or household size of respondents is indenpendent of their past usage. There will be 5 tests to be done corresponding to 5 combinations with the null hypothesis and alternate hypothesis of each test as follows:

H0: Both variables are independent

H1: Both variables are not independent

df2 <- PU_data[c(4:6,8,10,15)]
df2 <- subset(df2, PU_Q01!="Valid Skip")

x <- 1:5
outcome <- vector('list', length(x))
for (i in x) {
  test <- chisq.test(df2[,i], df2[,6])
  outcome[[i]] <- data.frame("X" = colnames(df2[i]), "Y" = colnames(df2[6]),
                             "chi.square" = round(test$statistic, 2),
                             "df" = test$parameter,
                             "p.value" = test$p.value)
}

outcome

## [[1]]
##             X      Y chi.square df      p.value
## X-squared AGE PU_Q01     205.37  5 2.012823e-42
## 
## [[2]]
##                X      Y chi.square df   p.value
## X-squared GENDER PU_Q01       1.27  1 0.2601836
## 
## [[3]]
##                   X      Y chi.square df      p.value
## X-squared Edu_Level PU_Q01     127.14  2 2.467053e-28
## 
## [[4]]
##                       X      Y chi.square df      p.value
## X-squared Labour_Status PU_Q01      168.2  2 2.991793e-37
## 
## [[5]]
##                        X      Y chi.square df      p.value
## X-squared Household_Size PU_Q01     150.34  3 2.228604e-32

Results:

The first pair to be tested is AGE and PU_Q01. As the p-value 2.012823e-42 is smaller than 0.05, we reject the null hypothesis and conclude that age is not independent of past usage. In other words, age affects whether or not respondents used the Internet in the past 12 months.
The second pair to be tested is GENDER and PU_Q01. As the p-value 0.2601836 is greater than 0.05, we fail to reject the null hypothesis and conclude that gender is independent of past usage. In other words, respondents’ gender does not affect whether or not they used the Internet in the past 12 months.
The third pair to be tested is Edu_Level and PU_Q01. As the p-value 2.467053e-28 is smaller than 0.05, we reject the null hypothesis and conclude that education level is not independent of past usage. In other words, respondents’ highest level of education affects whether or not they used the Internet in the past 12 months.
The fourth pair to be tested is Labour_Status and PU_Q01. As the p-value 2.991793e-37 is smaller than 0.05, we reject the null hypothesis and conclude that labour status is not independent of past usage. In other words, respondents’ employment status affects whether or not they used the Internet in the past 12 months.
The last pair to be tested is Household_Size and PU_Q01. As the p-value 2.228604e-32 is smaller than 0.05, we reject the null hypothesis and conclude that household size is not independent of past usage. In other words, the size of respondents’ family affects whether or not they used the Internet in the past 12 months.

5. Recommendation/Conclusion

To conclude, even the test findings indicate that age, education, and employment status are major factors in internet usage. People that are employed use the internet the most, and internet usage increases with education level.
Additionally, the age group of 55 and older include those who do not use the internet, which makes perfect sense given that this group belongs to the category of not in labour force and uses the internet the least.