UNIVERSITY OF WINDSOR

Data Analytic Method and Algorithm (BSMM8740)

GROUP 11:

  • Viet Hang Le
  • Simran Chauhan
  • Shannon Jeet Singh
  • Garima Suhag
knitr::opts_chunk$set(echo = TRUE)

getwd()
## [1] "/cloud/project"
setwd("/cloud/project")
PU_data <- read.csv("pastUse.csv")
PU_data <- PU_data[,-1]

List of libraries used in this project

library(dplyr)
library(tidyr)
library(ggplot2)
library(knitr)
library(viridis)
library(hrbrthemes)

Structure of the report

DiagrammeR::grViz("digraph {
  graph [layout = dot, rankdir = TB]
  
  node [shape = rectangle]        
  rec1 [label = '1. Introduction']
  rec2 [label = '2. Data Preprocessing']
  rec3 [label = '3. Exploratory Analysis']
  rec4 [label = '4. Predictive Analysis']
  rec5 [label = '5. Recommendations/Conclusions']
  
  # edge definitions with the node IDs
  rec1 -> rec2 -> rec3 -> rec4 -> rec5
  }",
  height = 500)

1. Introduction

Purpose of the survey and description of the data set. The EDA here offers details on how people 15 years of age and over who reside in Canada’s 10 provinces adopt and use internet, including how frequently they use them and other survey questions.

We will be dealing with past use variables, such as frequency of usage, length of time since last use, and reasons for not using internet services anymore.

The data set consists of 23,178 records across 21 variables as listed below:

The first 5 rows of the dataset:

head(PU_data, 5)
##   PROVINCE REGION G_URBRUR GCAGEGR6 CSEX G_CEDUC G_CSTUD G_CLFSST GFAMTYPE
## 1       35      3        5        3    2       3       2        1        3
## 2       46      4        5        1    2       1       1        2        2
## 3       10      1        5        2    1       2       2        1        2
## 4       35      3        4        5    2       2       2        3        3
## 5       13      1        4        3    1       2       2        1        2
##   G_HHSIZE G_HEDUC G_HSTUD EV_Q01 EV_Q02 PU_Q01 PU_Q02 PU_Q03 PU_Q06A PU_Q06E
## 1        1       3       2      1      2      1      6      6       6       6
## 2        3       2       1      1      4      1      6      6       6       6
## 3        2       3       2      1      4      1      6      6       6       6
## 4        1       2       2      1      3      2      2      1       1       2
## 5        3       2       2      1      2      1      6      6       6       6
##   PU_Q06J PU_Q06K PU_G06
## 1       6       6      6
## 2       6       6      6
## 3       6       6      6
## 4       2       2      2
## 5       6       6      6

2. Data Preprocessing

Rename variables

Create function to change variables’ values

The first 5 rows of renamed dataset:

subdata <- head(PU_data, 5)
kable(subdata)
PROVINCE REGION isRural AGE GENDER Edu_Level isStudent Labour_Status Household_Type Household_Size Household_Edu_Level Student_in_Household used_internet internet_usage_duration PU_Q01 PU_Q02 PU_Q03 PU_Q06A PU_Q06E PU_Q06J PU_Q06K PU_G06
Ontario Ontario Rural excluding Prince Edward Island 35 to 44 Female University certificate or degree No Employed One Person 1 University certificate or degree No Yes 1 to 2 years Yes Valid skip Valid skip Valid Skip Valid Skip Valid Skip Valid Skip Valid Skip
Manitoba Manitoba/Saskatchewan Rural excluding Prince Edward Island 16 to 24 Female High school or less Yes Unemployed Single Family without unmarried children under 16 3 College or some post-secondary Yes Yes 5 or more years Yes Valid skip Valid skip Valid Skip Valid Skip Valid Skip Valid Skip Valid Skip
Newfoundland and Labrador Atlantic Region Rural excluding Prince Edward Island 25 to 34 Male College or some post-secondary No Employed Single Family without unmarried children under 16 2 University certificate or degree No Yes 5 or more years Yes Valid skip Valid skip Valid Skip Valid Skip Valid Skip Valid Skip Valid Skip
Ontario Ontario Other Urban excluding Prince Edward Island 55 to 64 Female College or some post-secondary No Not in the labour force One Person 1 College or some post-secondary No Yes 2 to 5 years No 2 to 5 years Once a day Yes No No No No
New Brunswick Atlantic Region Other Urban excluding Prince Edward Island 35 to 44 Male College or some post-secondary No Employed Single Family without unmarried children under 16 3 College or some post-secondary No Yes 1 to 2 years Yes Valid skip Valid skip Valid Skip Valid Skip Valid Skip Valid Skip Valid Skip

3. Exploratory Analysis

3.1. Distribution of province of respondents

proCount <- table(PU_data$PROVINCE)
proCount <- proCount[order(proCount, decreasing = TRUE)]
proPerc <- proCount/sum(proCount)*100
proPerc2 <- paste0(round(proPerc, 2), "%")
names(proPerc2) <- names(proPerc)

proBar <- barplot(proPerc, main = "Which province do past users live?", ylab = "Province", xlab = "Proportion of respondents", xlim = c(0,100), col = 'lightsteelblue', border = NA, horiz = TRUE, las = 1)

Observations:

Majority of the respondents are from Ontario and Quebec and least from Prince Edward Island

3.2. Distribution of region of respondents

regionCount <- table(PU_data$REGION)
regionCount <- regionCount[order(regionCount, decreasing = FALSE)]
regionPerc <- regionCount/sum(regionCount)*100
regionPerc2 <- paste0(round(regionPerc, 2), "%")
names(regionPerc2) <- names(regionPerc)
regionBar <- barplot(regionPerc, main = "Which region do past users live?", xlab = "Proportion of respondents", xlim = c(0,100), col = 'lightsteelblue', border = NA, horiz = TRUE, las = 1)

Observations:

Distribution across region shows majority of respondants from Ontario and least from Alberta

3.3. Distribution of respondents’ age group

ageCount <- table(PU_data$AGE)
agePerc <- ageCount/sum(ageCount)*100
agePerc2 <- paste0(round(agePerc, 1), "%")
names(agePerc2) <- names(agePerc)
cols_age = c('peachpuff', 'mistyrose', 'lightpink', 'lavender', 'lightsteelblue', 'lightblue')
pie(ageCount, labels = agePerc2, main = "Age of Respondents", col = cols_age, clockwise = TRUE)       
legend("topright", c("16-24", "25-34", "35-44", "45-54", "55-64", "65 and older"), cex = 0.8, fill = cols_age)

Observations:

Respondents’ age groups are approximately evenly distributed, except for age group 16-24 and 65 and above where the proportion is 8.5% and 22.5% respectively

3.4. Distribution of respondents’ gender

genCount <- table(PU_data$GENDER)
genPerc <- genCount/sum(genCount)*100
genPerc2 <- paste0(round(genPerc, 1), "%")
names(genPerc2) <- names(genPerc)
cols_gen = c('mistyrose', 'lightblue')
pie(genCount, labels = genPerc2, main = "Gender of Respondents", col = cols_gen, clockwise = TRUE)        
legend("right", c("Female", "Male"), cex = 0.8, fill = cols_gen)

Observations:

Data proportion across gender is 44.7% for male and 55.3% for female

3.5. Distribution of respondents’ highest education level

eduCount <- table(PU_data$Edu_Level)
eduPerc <- eduCount/sum(genCount)*100
eduPerc2 <- paste0(round(eduPerc, 1), "%")
names(eduPerc2) <- names(eduPerc)
cols_edu = c('lightpink', 'lavender', 'lightsteelblue')
pie(eduCount, labels = eduPerc2, main = "Respondent's Highest Education Level", col = cols_edu, clockwise = TRUE)                     
legend("topright", c("College or some post-secondary", "High school or less", "Univeristy certificate of degree"), cex = 0.5, fill = cols_edu)

Observations:

The highest education level of most respondents is either college/post-secondary or high school. Only 18.7% of them earned university certificate or degree.

3.6. Is respondent student?

stuCount <- table(PU_data$isStudent)
stuPerc <- stuCount/sum(stuCount)*100
stuPerc2 <- paste0(round(stuPerc, 1), "%")
names(stuPerc2) <- names(stuPerc)
cols_stu = c('mistyrose', 'lightblue')
pie(stuCount, labels = stuPerc2, main = "Respondent is a Student?", col = cols_stu, clockwise = TRUE)   
legend("right", c("No","Yes"), cex = 0.8, fill = cols_stu)

Observations:

More than 90% of respondents are not student, which makes sense considering that the survey was mainly responded by those in senior age groups as shown in the age plot above. Therefore, it is understandable that most of them are no longer students.

3.7. Distribution of respondents’ labour status

laborCount <- table(PU_data$Labour_Status)
laborPerc <- laborCount/sum(laborCount)*100
laborPerc2 <- paste0(round(laborPerc, 1), "%")
names(laborPerc2) <- names(laborPerc)
cols_labor = c('peachpuff', 'lightpink', 'lightblue')
pie(laborCount, labels = laborPerc2, main = "Respondent's Labour Force Status", col = cols_labor, clockwise = TRUE)
legend("topright", c("Employed","Unemployed", "Not in the labour force"), cex = 0.8, fill = cols_labor)

Observations:

More than half of the respondents are employed (58.5%). The unemployed also make up a considerable portion with 35%.

3.8. What is the proportion of people who use internet across provience, age groups, gender, education level and labour status?

3.8.1. Proportion of respondents who use internet across province

a1 <- PU_data %>% group_by(PROVINCE) %>% summarise(tot = n())
a <- PU_data %>% group_by(PROVINCE, used_internet) %>% 
  summarise(total = n())
## `summarise()` has grouped output by 'PROVINCE'. You can override using the
## `.groups` argument.
a <- merge(a, a1 , by = 'PROVINCE', all.x=TRUE)
a$percentage <- round((a$total/a$tot)*100,2)
ggplot(a, aes(fill= used_internet, y= percentage, x= PROVINCE )) + 
  geom_bar(position='stack', stat='identity') +
  geom_text(aes( label = percentage), size = 3) +
  coord_flip()+
  ggtitle('Proportion of people who ever used internet across province')

Observations:

  • British Columbia, Alberta and Ontario have highest internet users i.e. 83.1%, 81.13% and 78.58% respectively
  • Newfoundland and New Brunswick have the least number of internet users i.e. 64.17% and 68.73% respectively

3.8.2. Proportion of respodents who ever used internet across age groups

b1 <- PU_data %>% group_by(AGE) %>% summarise(tot = n())
b <- PU_data %>% group_by(AGE, used_internet) %>%  
    summarise(total = n())
## `summarise()` has grouped output by 'AGE'. You can override using the `.groups`
## argument.
b <- merge(b, b1 , by = 'AGE', all.x=TRUE)
b$percentage <- round((b$total/b$tot)*100,2)
ggplot(b, aes(fill= used_internet, y= percentage, x= AGE )) + 
  geom_bar(position='stack', stat='identity') +
  geom_text(aes( label = percentage), size = 3) +
  coord_flip() +
  ggtitle('Proportion of people who ever used internet across age groups')

Observations:

  • Younger population uses internet the most
  • As the population gets older their internet usage is decreasing
  • 98.54% of respodents who belong to age group 16-24 have used internet
  • Only 37.5% of respondents from age group 65 and above have used internet

3.8.3. Proportion of respodents who ever used internet across gender

d1 <- PU_data %>% group_by(GENDER) %>% summarise(tot = n())
d <- PU_data %>% group_by(GENDER, used_internet) %>%  
    summarise(total = n())
## `summarise()` has grouped output by 'GENDER'. You can override using the
## `.groups` argument.
d <- merge(d, d1 , by = 'GENDER', all.x=TRUE)
d$percentage <- round((d$total/d$tot)*100,2)
ggplot(d, aes(fill= used_internet, y= percentage, x= GENDER )) + 
  geom_bar(position='stack', stat='identity') +
  geom_text(aes( label = percentage), size = 3) +
  coord_flip() +
  ggtitle('Proportion of people who used internet across gender')

Observations:

Proportion of people using internet arcoss gender is same, just male proportion being just 1 percent more

3.8.4. Proportion of respodents who ever used internet by Education Level

d1 <- PU_data %>% group_by(Edu_Level) %>% summarise(tot = n())
d <- PU_data %>% group_by(Edu_Level, used_internet) %>%  
    summarise(total = n())
## `summarise()` has grouped output by 'Edu_Level'. You can override using the
## `.groups` argument.
d <- merge(d, d1 , by = 'Edu_Level', all.x=TRUE)
d$percentage <- round((d$total/d$tot)*100,2)
ggplot(d, aes(fill= used_internet, y= percentage, x= Edu_Level )) + 
  geom_bar(position='stack', stat='identity') +
  geom_text(aes( label = percentage), size = 3) +
  coord_flip() +
  ggtitle('Proportion of people who used internet by education level')

Observations:

  • People who are more educated have more proportion of respondents that have ever used internet
  • 94.77% of respondents that have a university certificate of degree have used internet, however only 58.59% in the case of High school or less level of education
  • Respondents with College or some post-secondary, 83.48% have used internet

3.8.5. Proportion of respodents who ever used internet by Employment status

d1 <- PU_data %>% group_by(Labour_Status) %>% summarise(tot = n())
d <- PU_data %>% group_by(Labour_Status, used_internet) %>%  
    summarise(total = n())
## `summarise()` has grouped output by 'Labour_Status'. You can override using the
## `.groups` argument.
d <- merge(d, d1 , by = 'Labour_Status', all.x=TRUE)
d$percentage <- round((d$total/d$tot)*100,2)
ggplot(d, aes(fill= used_internet, y= percentage, x= Labour_Status )) + 
  geom_bar(position='stack', stat='identity') +
  geom_text(aes( label = percentage), size = 3) +
  coord_flip() +
  ggtitle('Proportion of people who used internet by education level')

Observations:

  • People who are employed use internet the most, i.e. 89%
  • Also, people who are unemployed, 86.12% have used internet
  • Proportion of people who have or have not used internet is almost same

3.8.6. Analysing less internet usage in Not in labour force wrt to age groups

d1 <- PU_data %>% group_by(AGE) %>% summarise(tot = n())
d <- PU_data %>% group_by(AGE, Labour_Status) %>%  
    summarise(total = n())
## `summarise()` has grouped output by 'AGE'. You can override using the `.groups`
## argument.
d <- merge(d, d1 , by = 'AGE', all.x=TRUE)
d$percentage <- round((d$total/d$tot)*100,2)
ggplot(d, aes(fill= Labour_Status, y= percentage, x= AGE )) + 
  geom_bar(position='stack', stat='identity') +
  geom_text(aes( label = percentage), size = 3) +
  coord_flip() +
  ggtitle('Proportion of people who used internet by education level')

Observations:

From the plot we can observe that majority of Not in labor respondents belong to age group 55 or more, and this explains the reason why this group has more proportion of people that have never used interner as, older people have more proportion of people that have never used internet

EDA of people who have used internet

3.9. How many years have respondents used the Internet?

durationCount <- table(PU_data$internet_usage_duration)
durationCount <- durationCount[order(durationCount, decreasing = FALSE)]
durationPerc <- durationCount/sum(durationCount)*100
durationPerc2 <- paste0(round(durationPerc, 2), "%")
names(durationPerc2) <- names(durationPerc)
durationBar <- barplot(durationPerc, main = "How many years have you used the Internet?", xlab = "Proportion of respondents", xlim = c(0,100), col = 'lightsteelblue', border = NA, horiz = TRUE, las = 1)

Observations:

Out of people who have used internet, 60% of the respondents have been using it for more than 5 years.

EDA of people who no longer or dont use internet

3.10. What are the reasons users no longer use the Internet from any location?

reasons$Answer <- factor(reasons$Answer, c("Yes", "No", "Don't Know"))
ggplot(reasons, aes(fill=Answer, y=Frequency, x=Answer)) + 
  geom_bar(position="dodge", stat="identity") +
  scale_fill_viridis(discrete = T, option = "E") +
  ggtitle("What are the reasons users no longer use the Internet from any location?") +
  facet_wrap(~Reason) +
  theme_ipsum() +
  theme(legend.position="none") +
  xlab("")

Observations:

The graph shows that among those who didn’t use the Internet during the past 12 months, no need to use appeared to be the most popular reason. However, as most of them answered ‘No’ to 4 reasons being surveyed, there seems to be other factors that make them stopped using the Internet.

Final Observation and Conclution for Predictive Analysis

  • As the population grows older they tend to stop using internet or not use it at all
  • People with high level of education use more internet
  • Employed and Unemployed group has more percentage of people who use internet compared to Not in Labour force group as this group has people th belong to age group 55 and higher and they don’t use internet

4. Predictive Analysis

4.1 Chi square to understand internet usage behavior across variables

As our project deals with past Internet users, we use chi-square test for independence to determine the association between different demographic factors and the answer of whether or not respondents used the Internet during the past 12 months. Specifically, we will consider 6 variables AGE, GENDER, Edu_Level, Labour_Status, Household_Size and PU_Q01 for our model, where the first 5 variables are predictors and the 6th variable is the outcome. The goal is to test the hypothesis whether each of the first 5 variables is independent of the fixed 6th variable at 0.05 level of significance. In other words, we aim to find out whether age, gender, education level, labour status or household size of respondents is indenpendent of their past usage. There will be 5 tests to be done corresponding to 5 combinations with the null hypothesis and alternate hypothesis of each test as follows:

H0: Both variables are independent

H1: Both variables are not independent

df2 <- PU_data[c(4:6,8,10,15)]
df2 <- subset(df2, PU_Q01!="Valid Skip")

x <- 1:5
outcome <- vector('list', length(x))
for (i in x) {
  test <- chisq.test(df2[,i], df2[,6])
  outcome[[i]] <- data.frame("X" = colnames(df2[i]), "Y" = colnames(df2[6]),
                             "chi.square" = round(test$statistic, 2),
                             "df" = test$parameter,
                             "p.value" = test$p.value)
}

outcome
## [[1]]
##             X      Y chi.square df      p.value
## X-squared AGE PU_Q01     205.37  5 2.012823e-42
## 
## [[2]]
##                X      Y chi.square df   p.value
## X-squared GENDER PU_Q01       1.27  1 0.2601836
## 
## [[3]]
##                   X      Y chi.square df      p.value
## X-squared Edu_Level PU_Q01     127.14  2 2.467053e-28
## 
## [[4]]
##                       X      Y chi.square df      p.value
## X-squared Labour_Status PU_Q01      168.2  2 2.991793e-37
## 
## [[5]]
##                        X      Y chi.square df      p.value
## X-squared Household_Size PU_Q01     150.34  3 2.228604e-32

Results:

  • The first pair to be tested is AGE and PU_Q01. As the p-value 2.012823e-42 is smaller than 0.05, we reject the null hypothesis and conclude that age is not independent of past usage. In other words, age affects whether or not respondents used the Internet in the past 12 months.

  • The second pair to be tested is GENDER and PU_Q01. As the p-value 0.2601836 is greater than 0.05, we fail to reject the null hypothesis and conclude that gender is independent of past usage. In other words, respondents’ gender does not affect whether or not they used the Internet in the past 12 months.

  • The third pair to be tested is Edu_Level and PU_Q01. As the p-value 2.467053e-28 is smaller than 0.05, we reject the null hypothesis and conclude that education level is not independent of past usage. In other words, respondents’ highest level of education affects whether or not they used the Internet in the past 12 months.

  • The fourth pair to be tested is Labour_Status and PU_Q01. As the p-value 2.991793e-37 is smaller than 0.05, we reject the null hypothesis and conclude that labour status is not independent of past usage. In other words, respondents’ employment status affects whether or not they used the Internet in the past 12 months.

  • The last pair to be tested is Household_Size and PU_Q01. As the p-value 2.228604e-32 is smaller than 0.05, we reject the null hypothesis and conclude that household size is not independent of past usage. In other words, the size of respondents’ family affects whether or not they used the Internet in the past 12 months.

5. Recommendation/Conclusion