knitr::opts_chunk$set(echo = TRUE)
getwd()
## [1] "/cloud/project"
setwd("/cloud/project")
PU_data <- read.csv("pastUse.csv")
PU_data <- PU_data[,-1]
library(dplyr)
library(tidyr)
library(ggplot2)
library(knitr)
library(viridis)
library(hrbrthemes)
DiagrammeR::grViz("digraph {
graph [layout = dot, rankdir = TB]
node [shape = rectangle]
rec1 [label = '1. Introduction']
rec2 [label = '2. Data Preprocessing']
rec3 [label = '3. Exploratory Analysis']
rec4 [label = '4. Predictive Analysis']
rec5 [label = '5. Recommendations/Conclusions']
# edge definitions with the node IDs
rec1 -> rec2 -> rec3 -> rec4 -> rec5
}",
height = 500)
Purpose of the survey and description of the data set. The EDA here offers details on how people 15 years of age and over who reside in Canada’s 10 provinces adopt and use internet, including how frequently they use them and other survey questions.
We will be dealing with past use variables, such as frequency of usage, length of time since last use, and reasons for not using internet services anymore.
The data set consists of 23,178 records across 21 variables as listed below:
The first 5 rows of the dataset:
head(PU_data, 5)
## PROVINCE REGION G_URBRUR GCAGEGR6 CSEX G_CEDUC G_CSTUD G_CLFSST GFAMTYPE
## 1 35 3 5 3 2 3 2 1 3
## 2 46 4 5 1 2 1 1 2 2
## 3 10 1 5 2 1 2 2 1 2
## 4 35 3 4 5 2 2 2 3 3
## 5 13 1 4 3 1 2 2 1 2
## G_HHSIZE G_HEDUC G_HSTUD EV_Q01 EV_Q02 PU_Q01 PU_Q02 PU_Q03 PU_Q06A PU_Q06E
## 1 1 3 2 1 2 1 6 6 6 6
## 2 3 2 1 1 4 1 6 6 6 6
## 3 2 3 2 1 4 1 6 6 6 6
## 4 1 2 2 1 3 2 2 1 1 2
## 5 3 2 2 1 2 1 6 6 6 6
## PU_Q06J PU_Q06K PU_G06
## 1 6 6 6
## 2 6 6 6
## 3 6 6 6
## 4 2 2 2
## 5 6 6 6
The first 5 rows of renamed dataset:
subdata <- head(PU_data, 5)
kable(subdata)
| PROVINCE | REGION | isRural | AGE | GENDER | Edu_Level | isStudent | Labour_Status | Household_Type | Household_Size | Household_Edu_Level | Student_in_Household | used_internet | internet_usage_duration | PU_Q01 | PU_Q02 | PU_Q03 | PU_Q06A | PU_Q06E | PU_Q06J | PU_Q06K | PU_G06 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Ontario | Ontario | Rural excluding Prince Edward Island | 35 to 44 | Female | University certificate or degree | No | Employed | One Person | 1 | University certificate or degree | No | Yes | 1 to 2 years | Yes | Valid skip | Valid skip | Valid Skip | Valid Skip | Valid Skip | Valid Skip | Valid Skip |
| Manitoba | Manitoba/Saskatchewan | Rural excluding Prince Edward Island | 16 to 24 | Female | High school or less | Yes | Unemployed | Single Family without unmarried children under 16 | 3 | College or some post-secondary | Yes | Yes | 5 or more years | Yes | Valid skip | Valid skip | Valid Skip | Valid Skip | Valid Skip | Valid Skip | Valid Skip |
| Newfoundland and Labrador | Atlantic Region | Rural excluding Prince Edward Island | 25 to 34 | Male | College or some post-secondary | No | Employed | Single Family without unmarried children under 16 | 2 | University certificate or degree | No | Yes | 5 or more years | Yes | Valid skip | Valid skip | Valid Skip | Valid Skip | Valid Skip | Valid Skip | Valid Skip |
| Ontario | Ontario | Other Urban excluding Prince Edward Island | 55 to 64 | Female | College or some post-secondary | No | Not in the labour force | One Person | 1 | College or some post-secondary | No | Yes | 2 to 5 years | No | 2 to 5 years | Once a day | Yes | No | No | No | No |
| New Brunswick | Atlantic Region | Other Urban excluding Prince Edward Island | 35 to 44 | Male | College or some post-secondary | No | Employed | Single Family without unmarried children under 16 | 3 | College or some post-secondary | No | Yes | 1 to 2 years | Yes | Valid skip | Valid skip | Valid Skip | Valid Skip | Valid Skip | Valid Skip | Valid Skip |
proCount <- table(PU_data$PROVINCE)
proCount <- proCount[order(proCount, decreasing = TRUE)]
proPerc <- proCount/sum(proCount)*100
proPerc2 <- paste0(round(proPerc, 2), "%")
names(proPerc2) <- names(proPerc)
proBar <- barplot(proPerc, main = "Which province do past users live?", ylab = "Province", xlab = "Proportion of respondents", xlim = c(0,100), col = 'lightsteelblue', border = NA, horiz = TRUE, las = 1)
Observations:
Majority of the respondents are from Ontario and Quebec and least from Prince Edward Island
regionCount <- table(PU_data$REGION)
regionCount <- regionCount[order(regionCount, decreasing = FALSE)]
regionPerc <- regionCount/sum(regionCount)*100
regionPerc2 <- paste0(round(regionPerc, 2), "%")
names(regionPerc2) <- names(regionPerc)
regionBar <- barplot(regionPerc, main = "Which region do past users live?", xlab = "Proportion of respondents", xlim = c(0,100), col = 'lightsteelblue', border = NA, horiz = TRUE, las = 1)
Observations:
Distribution across region shows majority of respondants from Ontario and least from Alberta
ageCount <- table(PU_data$AGE)
agePerc <- ageCount/sum(ageCount)*100
agePerc2 <- paste0(round(agePerc, 1), "%")
names(agePerc2) <- names(agePerc)
cols_age = c('peachpuff', 'mistyrose', 'lightpink', 'lavender', 'lightsteelblue', 'lightblue')
pie(ageCount, labels = agePerc2, main = "Age of Respondents", col = cols_age, clockwise = TRUE)
legend("topright", c("16-24", "25-34", "35-44", "45-54", "55-64", "65 and older"), cex = 0.8, fill = cols_age)
Observations:
Respondents’ age groups are approximately evenly distributed, except for age group 16-24 and 65 and above where the proportion is 8.5% and 22.5% respectively
genCount <- table(PU_data$GENDER)
genPerc <- genCount/sum(genCount)*100
genPerc2 <- paste0(round(genPerc, 1), "%")
names(genPerc2) <- names(genPerc)
cols_gen = c('mistyrose', 'lightblue')
pie(genCount, labels = genPerc2, main = "Gender of Respondents", col = cols_gen, clockwise = TRUE)
legend("right", c("Female", "Male"), cex = 0.8, fill = cols_gen)
Observations:
Data proportion across gender is 44.7% for male and 55.3% for female
eduCount <- table(PU_data$Edu_Level)
eduPerc <- eduCount/sum(genCount)*100
eduPerc2 <- paste0(round(eduPerc, 1), "%")
names(eduPerc2) <- names(eduPerc)
cols_edu = c('lightpink', 'lavender', 'lightsteelblue')
pie(eduCount, labels = eduPerc2, main = "Respondent's Highest Education Level", col = cols_edu, clockwise = TRUE)
legend("topright", c("College or some post-secondary", "High school or less", "Univeristy certificate of degree"), cex = 0.5, fill = cols_edu)
Observations:
The highest education level of most respondents is either college/post-secondary or high school. Only 18.7% of them earned university certificate or degree.
stuCount <- table(PU_data$isStudent)
stuPerc <- stuCount/sum(stuCount)*100
stuPerc2 <- paste0(round(stuPerc, 1), "%")
names(stuPerc2) <- names(stuPerc)
cols_stu = c('mistyrose', 'lightblue')
pie(stuCount, labels = stuPerc2, main = "Respondent is a Student?", col = cols_stu, clockwise = TRUE)
legend("right", c("No","Yes"), cex = 0.8, fill = cols_stu)
Observations:
More than 90% of respondents are not student, which makes sense considering that the survey was mainly responded by those in senior age groups as shown in the age plot above. Therefore, it is understandable that most of them are no longer students.
laborCount <- table(PU_data$Labour_Status)
laborPerc <- laborCount/sum(laborCount)*100
laborPerc2 <- paste0(round(laborPerc, 1), "%")
names(laborPerc2) <- names(laborPerc)
cols_labor = c('peachpuff', 'lightpink', 'lightblue')
pie(laborCount, labels = laborPerc2, main = "Respondent's Labour Force Status", col = cols_labor, clockwise = TRUE)
legend("topright", c("Employed","Unemployed", "Not in the labour force"), cex = 0.8, fill = cols_labor)
Observations:
More than half of the respondents are employed (58.5%). The unemployed also make up a considerable portion with 35%.
a1 <- PU_data %>% group_by(PROVINCE) %>% summarise(tot = n())
a <- PU_data %>% group_by(PROVINCE, used_internet) %>%
summarise(total = n())
## `summarise()` has grouped output by 'PROVINCE'. You can override using the
## `.groups` argument.
a <- merge(a, a1 , by = 'PROVINCE', all.x=TRUE)
a$percentage <- round((a$total/a$tot)*100,2)
ggplot(a, aes(fill= used_internet, y= percentage, x= PROVINCE )) +
geom_bar(position='stack', stat='identity') +
geom_text(aes( label = percentage), size = 3) +
coord_flip()+
ggtitle('Proportion of people who ever used internet across province')
Observations:
b1 <- PU_data %>% group_by(AGE) %>% summarise(tot = n())
b <- PU_data %>% group_by(AGE, used_internet) %>%
summarise(total = n())
## `summarise()` has grouped output by 'AGE'. You can override using the `.groups`
## argument.
b <- merge(b, b1 , by = 'AGE', all.x=TRUE)
b$percentage <- round((b$total/b$tot)*100,2)
ggplot(b, aes(fill= used_internet, y= percentage, x= AGE )) +
geom_bar(position='stack', stat='identity') +
geom_text(aes( label = percentage), size = 3) +
coord_flip() +
ggtitle('Proportion of people who ever used internet across age groups')
Observations:
d1 <- PU_data %>% group_by(GENDER) %>% summarise(tot = n())
d <- PU_data %>% group_by(GENDER, used_internet) %>%
summarise(total = n())
## `summarise()` has grouped output by 'GENDER'. You can override using the
## `.groups` argument.
d <- merge(d, d1 , by = 'GENDER', all.x=TRUE)
d$percentage <- round((d$total/d$tot)*100,2)
ggplot(d, aes(fill= used_internet, y= percentage, x= GENDER )) +
geom_bar(position='stack', stat='identity') +
geom_text(aes( label = percentage), size = 3) +
coord_flip() +
ggtitle('Proportion of people who used internet across gender')
Observations:
Proportion of people using internet arcoss gender is same, just male proportion being just 1 percent more
d1 <- PU_data %>% group_by(Edu_Level) %>% summarise(tot = n())
d <- PU_data %>% group_by(Edu_Level, used_internet) %>%
summarise(total = n())
## `summarise()` has grouped output by 'Edu_Level'. You can override using the
## `.groups` argument.
d <- merge(d, d1 , by = 'Edu_Level', all.x=TRUE)
d$percentage <- round((d$total/d$tot)*100,2)
ggplot(d, aes(fill= used_internet, y= percentage, x= Edu_Level )) +
geom_bar(position='stack', stat='identity') +
geom_text(aes( label = percentage), size = 3) +
coord_flip() +
ggtitle('Proportion of people who used internet by education level')
Observations:
d1 <- PU_data %>% group_by(Labour_Status) %>% summarise(tot = n())
d <- PU_data %>% group_by(Labour_Status, used_internet) %>%
summarise(total = n())
## `summarise()` has grouped output by 'Labour_Status'. You can override using the
## `.groups` argument.
d <- merge(d, d1 , by = 'Labour_Status', all.x=TRUE)
d$percentage <- round((d$total/d$tot)*100,2)
ggplot(d, aes(fill= used_internet, y= percentage, x= Labour_Status )) +
geom_bar(position='stack', stat='identity') +
geom_text(aes( label = percentage), size = 3) +
coord_flip() +
ggtitle('Proportion of people who used internet by education level')
Observations:
d1 <- PU_data %>% group_by(AGE) %>% summarise(tot = n())
d <- PU_data %>% group_by(AGE, Labour_Status) %>%
summarise(total = n())
## `summarise()` has grouped output by 'AGE'. You can override using the `.groups`
## argument.
d <- merge(d, d1 , by = 'AGE', all.x=TRUE)
d$percentage <- round((d$total/d$tot)*100,2)
ggplot(d, aes(fill= Labour_Status, y= percentage, x= AGE )) +
geom_bar(position='stack', stat='identity') +
geom_text(aes( label = percentage), size = 3) +
coord_flip() +
ggtitle('Proportion of people who used internet by education level')
Observations:
From the plot we can observe that majority of Not in labor respondents belong to age group 55 or more, and this explains the reason why this group has more proportion of people that have never used interner as, older people have more proportion of people that have never used internet
durationCount <- table(PU_data$internet_usage_duration)
durationCount <- durationCount[order(durationCount, decreasing = FALSE)]
durationPerc <- durationCount/sum(durationCount)*100
durationPerc2 <- paste0(round(durationPerc, 2), "%")
names(durationPerc2) <- names(durationPerc)
durationBar <- barplot(durationPerc, main = "How many years have you used the Internet?", xlab = "Proportion of respondents", xlim = c(0,100), col = 'lightsteelblue', border = NA, horiz = TRUE, las = 1)
Observations:
Out of people who have used internet, 60% of the respondents have been using it for more than 5 years.
reasons$Answer <- factor(reasons$Answer, c("Yes", "No", "Don't Know"))
ggplot(reasons, aes(fill=Answer, y=Frequency, x=Answer)) +
geom_bar(position="dodge", stat="identity") +
scale_fill_viridis(discrete = T, option = "E") +
ggtitle("What are the reasons users no longer use the Internet from any location?") +
facet_wrap(~Reason) +
theme_ipsum() +
theme(legend.position="none") +
xlab("")
Observations:
The graph shows that among those who didn’t use the Internet during the past 12 months, no need to use appeared to be the most popular reason. However, as most of them answered ‘No’ to 4 reasons being surveyed, there seems to be other factors that make them stopped using the Internet.
As our project deals with past Internet users, we use chi-square test for independence to determine the association between different demographic factors and the answer of whether or not respondents used the Internet during the past 12 months. Specifically, we will consider 6 variables AGE, GENDER, Edu_Level, Labour_Status, Household_Size and PU_Q01 for our model, where the first 5 variables are predictors and the 6th variable is the outcome. The goal is to test the hypothesis whether each of the first 5 variables is independent of the fixed 6th variable at 0.05 level of significance. In other words, we aim to find out whether age, gender, education level, labour status or household size of respondents is indenpendent of their past usage. There will be 5 tests to be done corresponding to 5 combinations with the null hypothesis and alternate hypothesis of each test as follows:
H0: Both variables are independent
H1: Both variables are not independent
df2 <- PU_data[c(4:6,8,10,15)]
df2 <- subset(df2, PU_Q01!="Valid Skip")
x <- 1:5
outcome <- vector('list', length(x))
for (i in x) {
test <- chisq.test(df2[,i], df2[,6])
outcome[[i]] <- data.frame("X" = colnames(df2[i]), "Y" = colnames(df2[6]),
"chi.square" = round(test$statistic, 2),
"df" = test$parameter,
"p.value" = test$p.value)
}
outcome
## [[1]]
## X Y chi.square df p.value
## X-squared AGE PU_Q01 205.37 5 2.012823e-42
##
## [[2]]
## X Y chi.square df p.value
## X-squared GENDER PU_Q01 1.27 1 0.2601836
##
## [[3]]
## X Y chi.square df p.value
## X-squared Edu_Level PU_Q01 127.14 2 2.467053e-28
##
## [[4]]
## X Y chi.square df p.value
## X-squared Labour_Status PU_Q01 168.2 2 2.991793e-37
##
## [[5]]
## X Y chi.square df p.value
## X-squared Household_Size PU_Q01 150.34 3 2.228604e-32
Results:
The first pair to be tested is AGE and PU_Q01. As the p-value 2.012823e-42 is smaller than 0.05, we reject the null hypothesis and conclude that age is not independent of past usage. In other words, age affects whether or not respondents used the Internet in the past 12 months.
The second pair to be tested is GENDER and PU_Q01. As the p-value 0.2601836 is greater than 0.05, we fail to reject the null hypothesis and conclude that gender is independent of past usage. In other words, respondents’ gender does not affect whether or not they used the Internet in the past 12 months.
The third pair to be tested is Edu_Level and PU_Q01. As the p-value 2.467053e-28 is smaller than 0.05, we reject the null hypothesis and conclude that education level is not independent of past usage. In other words, respondents’ highest level of education affects whether or not they used the Internet in the past 12 months.
The fourth pair to be tested is Labour_Status and PU_Q01. As the p-value 2.991793e-37 is smaller than 0.05, we reject the null hypothesis and conclude that labour status is not independent of past usage. In other words, respondents’ employment status affects whether or not they used the Internet in the past 12 months.
The last pair to be tested is Household_Size and PU_Q01. As the p-value 2.228604e-32 is smaller than 0.05, we reject the null hypothesis and conclude that household size is not independent of past usage. In other words, the size of respondents’ family affects whether or not they used the Internet in the past 12 months.
To conclude, even the test findings indicate that age, education, and employment status are major factors in internet usage. People that are employed use the internet the most, and internet usage increases with education level.
Additionally, the age group of 55 and older include those who do not use the internet, which makes perfect sense given that this group belongs to the category of not in labour force and uses the internet the least.