Internet access is the ability of Individual and organization to connect to internet for their day to day activities. In the modern day world emails, online meetings has become a part and parcel of our lives. Therefore, internet service providers are trying to deliver connectivity to a wide range of customers through networking technologies.
With the vast advancement and accessibility of Internet, information is just a click away. Both male and female from every age group relies upon Internet for every aspect of their lives. Analyzing the data based on gender and age group will help us get insights as to which category of gender relies more on Internet.
The dataset has been collected over a span of 12 months in the year 2009 and depicts the adoption and use of digital technologies and the online behaviors of individual aged 15 years or older in the 10 provinces of Canada.
In this project, we are going to analyze the Personal Internet Usage of individuals belonging to various age groups hailing from the different provinces of Canada. Moreover we are concentrating on the regions where the frequency of internet usage is lowest and highest inorder to implement various strategies to enhance the utility of Internet in those provinces.
In 2009 80% of Canadians aged above 16 years have been using Internet for their Individual Use.
OVERVIEW OF DATASET
##
## Please cite as:
## Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
## R package version 5.2.3. https://CRAN.R-project.org/package=stargazer
##
## Descriptive statistics
## =============================================
## Statistic N Mean St. Dev. Min Max
## ---------------------------------------------
## PUMFID 23,178 11,589.5 6,691.1 1 23,178
## PROVINCE 23,178 34.8 14.6 10 59
## REGION 23,178 3.2 1.5 1 6
## G_URBRUR 23,178 4.1 1.0 1 6
## GCAGEGR6 23,178 3.9 1.6 1 6
## CSEX 23,178 1.6 0.5 1 2
## G_CEDUC 23,178 1.8 0.7 1 3
## G_CSTUD 23,178 1.9 0.3 1 2
## G_CLFSST 23,178 1.8 1.0 1 3
## GFAMTYPE 23,178 2.1 0.8 1 4
## G_HHSIZE 23,178 2.3 1.1 1 4
## G_HEDUC 23,178 2.0 0.7 1 3
## G_HSTUD 23,178 1.8 0.4 1 2
## EV_Q01 23,178 1.2 0.4 1 2
## EV_Q02 23,178 4.2 1.2 1 8
## IU_Q01A 23,178 3.0 2.1 1 9
## IU_Q01B 23,178 2.9 2.1 1 9
## IU_Q01E 23,178 3.1 2.0 1 9
## IU_G01 23,178 3.1 2.1 1 9
## IU_Q01G 23,178 3.3 1.9 1 9
## IU_Q02A 23,178 2.7 2.3 1 9
## IU_Q02B 23,178 2.9 2.1 1 9
## IU_Q02E 23,178 3.1 2.0 1 9
## IU_G02 23,178 3.2 2.0 1 9
## IU_Q03 23,178 2.8 2.2 1 9
## IU_Q04 23,178 32.0 43.8 1 99
## IU_Q05 23,178 4.5 2.3 1 9
## IU_Q06 23,178 5.8 1.1 1 9
## ---------------------------------------------
head(df,10)
## PUMFID PROVINCE REGION G_URBRUR GCAGEGR6 CSEX G_CEDUC G_CSTUD G_CLFSST
## 1 1 35 3 5 3 2 3 2 1
## 2 2 46 4 5 1 2 1 1 2
## 3 3 10 1 5 2 1 2 2 1
## 4 4 35 3 4 5 2 2 2 3
## 5 5 13 1 4 3 1 2 2 1
## 6 6 46 4 4 2 2 1 2 3
## 7 7 35 3 2 3 2 3 2 2
## 8 8 59 6 5 1 1 2 1 1
## 9 9 24 2 1 6 2 2 2 3
## 10 10 59 6 4 6 1 1 2 3
## GFAMTYPE G_HHSIZE G_HEDUC G_HSTUD EV_Q01 EV_Q02 IU_Q01A IU_Q01B IU_Q01E
## 1 3 1 3 2 1 2 1 2 2
## 2 2 3 2 1 1 4 2 1 2
## 3 2 2 3 2 1 4 2 1 2
## 4 3 1 2 2 1 3 6 6 6
## 5 2 3 2 2 1 2 2 1 2
## 6 1 3 1 2 1 4 6 6 6
## 7 1 4 3 2 1 4 2 1 2
## 8 2 3 3 1 1 4 2 1 2
## 9 3 1 2 2 2 6 6 6 6
## 10 3 1 1 2 2 6 6 6 6
## IU_G01 IU_Q01G IU_Q02A IU_Q02B IU_Q02E IU_G02 IU_Q03 IU_Q04 IU_Q05 IU_Q06
## 1 2 2 2 1 2 2 2 1 1 6
## 2 2 2 1 2 2 2 1 1 6 6
## 3 2 2 1 2 2 1 1 2 6 6
## 4 6 6 6 6 6 6 6 96 6 6
## 5 2 2 1 1 2 2 3 1 6 6
## 6 6 6 6 6 6 6 6 96 6 6
## 7 2 2 1 2 2 2 1 5 6 6
## 8 2 2 1 1 2 2 1 3 6 6
## 9 6 6 6 6 6 6 6 96 6 6
## 10 6 6 6 6 6 6 6 96 6 6
PREPROCESSED DATA
# Setting the new preprocessed dataset
myDataPath=("/cloud/project/IndividualUse1.csv")
df1 <- read.csv("IndividualUse1.csv", header = T)
head(df1,10)
## X PUMFID PROVINCE REGION isUrban
## 1 1 1 Ontario Ontario 5
## 2 2 2 Manitoba Manitoba/Saskatchewan 5
## 3 3 3 Newfoundland and Labrador Atlantic Region 5
## 4 4 4 Ontario Ontario 4
## 5 5 5 New Brunswick Atlantic Region 4
## 6 6 6 Manitoba Manitoba/Saskatchewan 4
## 7 7 7 Ontario Ontario 2
## 8 8 8 British Columbia British Columbia 5
## 9 9 9 Quebec Quebec 1
## 10 10 10 British Columbia British Columbia 4
## ageGroup gender customerHighestEdu isStudent labourStat householdType
## 1 35 to 44 years Female 3 2 1 3
## 2 16 to 24 years Female 1 1 2 2
## 3 25 to 34 years Male 2 2 1 2
## 4 55 to 64 years Female 2 2 3 3
## 5 35 to 44 years Male 2 2 1 2
## 6 25 to 34 years Female 1 2 3 1
## 7 35 to 44 years Female 3 2 2 1
## 8 16 to 24 years Male 2 1 1 2
## 9 65 and older Female 2 2 3 3
## 10 65 and older Male 1 2 3 3
## householPersonCount householHighEdu householdStu usedInternet
## 1 1 3 2 1
## 2 3 2 1 1
## 3 2 3 2 1
## 4 1 2 2 1
## 5 3 2 2 1
## 6 3 1 2 1
## 7 4 3 2 1
## 8 3 3 1 1
## 9 1 2 2 2
## 10 1 1 2 2
## usedInternetYears iahTelephone iahCable iahWireless iahOtherConn iahNone
## 1 1- 2 years 1 2 2 2 2
## 2 >= 5 years 2 1 2 2 2
## 3 >= 5 years 2 1 2 2 2
## 4 2- 5 years 6 6 6 6 6
## 5 1- 2 years 2 1 2 2 2
## 6 >= 5 years 6 6 6 6 6
## 7 >= 5 years 2 1 2 2 2
## 8 >= 5 years 2 1 2 2 2
## 9 Valid Skip 6 6 6 6 6
## 10 Valid Skip 6 6 6 6 6
## iahPersonal iahPortable iahPda iahOtherDevice useFrequency hoursSpent
## 1 2 1 2 2 2 < 5 hours
## 2 1 2 2 2 1 < 5 hours
## 3 1 2 2 1 1 5 - 9 hours
## 4 6 6 6 6 6 > 40 hours
## 5 1 1 2 2 3 < 5 hours
## 6 6 6 6 6 6 > 40 hours
## 7 1 2 2 2 1 30 - 39 hours
## 8 1 1 2 2 1 10 - 19 hours
## 9 6 6 6 6 6 > 40 hours
## 10 6 6 6 6 6 > 40 hours
## highSpeedConnection highSpeedIntService
## 1 1 6
## 2 6 6
## 3 6 6
## 4 6 6
## 5 6 6
## 6 6 6
## 7 6 6
## 8 6 6
## 9 6 6
## 10 6 6
Frequency Table for Distribution of Hours Spent
# Frequency Table for Distribution of Hours Spent
library(epiDisplay)
## Loading required package: foreign
## Loading required package: survival
## Loading required package: MASS
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
## Loading required package: nnet
##
## Attaching package: 'epiDisplay'
## The following object is masked from 'package:ggplot2':
##
## alpha
tab1(df1$hoursSpent, sort.group = "decreasing", main = "Distribution of Hours Spent",cex.main=1 ,cex = 1.3,cex.names = 1.3 ,cex.lab=1.3,
ylab = "Frequency",cum.percent = TRUE, horiz = TRUE )
## df1$hoursSpent :
## Frequency Percent Cum. percent
## > 40 hours 7386 31.9 31.9
## < 5 hours 7256 31.3 63.2
## 5 - 9 hours 4203 18.1 81.3
## 10 - 19 hours 2540 11.0 92.3
## 20 - 29 hours 944 4.1 96.3
## > 40 hours 448 1.9 98.3
## 30 - 39 hours 401 1.7 100.0
## Total 23178 100.0 100.0
CrossTable for Higher Education and Years of Internet
Used
#CrossTable for Higher Education and Years of Internet Used
library(knitr)
library(gmodels)
##
## Attaching package: 'gmodels'
## The following object is masked from 'package:epiDisplay':
##
## ci
CrossTable(highEdu$highEdu, df1$usedInternetYears, prop.t=TRUE, prop.r=TRUE, prop.c=TRUE, horiz = TRUE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 23178
##
##
## | df1$usedInternetYears
## highEdu$highEdu | < 1 year | >= 5 years | 1- 2 years | 2- 5 years | Don't know | Refusal | Valid Skip | Row Total |
## ----------------------------------|------------|------------|------------|------------|------------|------------|------------|------------|
## Either College or University | 338 | 7086 | 501 | 1518 | 21 | 0 | 2259 | 11723 |
## | 3.128 | 11.325 | 5.625 | 33.733 | 1.113 | 0.506 | 115.888 | |
## | 0.029 | 0.604 | 0.043 | 0.129 | 0.002 | 0.000 | 0.193 | 0.506 |
## | 0.557 | 0.526 | 0.562 | 0.587 | 0.636 | 0.000 | 0.403 | |
## | 0.015 | 0.306 | 0.022 | 0.065 | 0.001 | 0.000 | 0.097 | |
## ----------------------------------|------------|------------|------------|------------|------------|------------|------------|------------|
## Post-secondary or less | 212 | 1809 | 296 | 640 | 8 | 0 | 3030 | 5995 |
## | 19.267 | 803.604 | 18.640 | 1.246 | 0.034 | 0.259 | 1727.780 | |
## | 0.035 | 0.302 | 0.049 | 0.107 | 0.001 | 0.000 | 0.505 | 0.259 |
## | 0.349 | 0.134 | 0.332 | 0.247 | 0.242 | 0.000 | 0.541 | |
## | 0.009 | 0.078 | 0.013 | 0.028 | 0.000 | 0.000 | 0.131 | |
## ----------------------------------|------------|------------|------------|------------|------------|------------|------------|------------|
## University certificate or degree | 57 | 4566 | 94 | 428 | 4 | 1 | 310 | 5460 |
## | 51.712 | 613.712 | 63.989 | 53.886 | 1.832 | 2.481 | 771.808 | |
## | 0.010 | 0.836 | 0.017 | 0.078 | 0.001 | 0.000 | 0.057 | 0.236 |
## | 0.094 | 0.339 | 0.105 | 0.166 | 0.121 | 1.000 | 0.055 | |
## | 0.002 | 0.197 | 0.004 | 0.018 | 0.000 | 0.000 | 0.013 | |
## ----------------------------------|------------|------------|------------|------------|------------|------------|------------|------------|
## Column Total | 607 | 13461 | 891 | 2586 | 33 | 1 | 5599 | 23178 |
## | 0.026 | 0.581 | 0.038 | 0.112 | 0.001 | 0.000 | 0.242 | |
## ----------------------------------|------------|------------|------------|------------|------------|------------|------------|------------|
##
##
CrossTable for Hours Spent on Internet and Gender
#CrossTable for Hours Spent on Internet and Gender
CrossTable(df1$gender, df1$hoursSpent, prop.t=TRUE, prop.r=TRUE, prop.c=TRUE,horiz = TRUE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 23178
##
##
## | df1$hoursSpent
## df1$gender | > 40 hours | > 40 hours | 10 - 19 hours | 20 - 29 hours | 30 - 39 hours | 5 - 9 hours | < 5 hours | Row Total |
## -------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|
## Female | 4168 | 207 | 1319 | 457 | 199 | 2298 | 4169 | 12817 |
## | 1.714 | 6.698 | 5.213 | 8.097 | 2.333 | 0.295 | 6.109 | |
## | 0.325 | 0.016 | 0.103 | 0.036 | 0.016 | 0.179 | 0.325 | 0.553 |
## | 0.564 | 0.462 | 0.519 | 0.484 | 0.496 | 0.547 | 0.575 | |
## | 0.180 | 0.009 | 0.057 | 0.020 | 0.009 | 0.099 | 0.180 | |
## -------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|
## Male | 3218 | 241 | 1221 | 487 | 202 | 1905 | 3087 | 10361 |
## | 2.121 | 8.286 | 6.449 | 10.017 | 2.886 | 0.365 | 7.558 | |
## | 0.311 | 0.023 | 0.118 | 0.047 | 0.019 | 0.184 | 0.298 | 0.447 |
## | 0.436 | 0.538 | 0.481 | 0.516 | 0.504 | 0.453 | 0.425 | |
## | 0.139 | 0.010 | 0.053 | 0.021 | 0.009 | 0.082 | 0.133 | |
## -------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|
## Column Total | 7386 | 448 | 2540 | 944 | 401 | 4203 | 7256 | 23178 |
## | 0.319 | 0.019 | 0.110 | 0.041 | 0.017 | 0.181 | 0.313 | |
## -------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|
##
##
Various Age Group Distribution and its Frequency
The below pie chart shows the various age groups considered in this
survey. As per the donut chart, it can be analysed that the frequency of
respondents from the age group 65 and above is higher compared to the
others.
#Various Age Group Distribution and its Frequency
library(plotrix)
library(ggpubr)
df1 %>%
pull(ageGroup) %>%
fct_count() %>%
rename(ageGroup = f, count = n) %>%
ggdonutchart("count", label = "ageGroup", fill = "ageGroup", color = "white",lab.font = c(4,"bold","black"),lab.pos = "in")
Province Wise Information Collected
The below donut chart shows the the numbers of respondents in the
various Provinces considered in this survey. We can see that, the
highest frequency of respondents are from the Ontario region followed by
Quebec, British Columbia and Alberta
#Province Wise Information Collected
library(plotrix)
library(ggpubr)
df1 %>%
pull(PROVINCE) %>%
fct_count() %>%
rename(PROVINCE = f, count = n) %>%
ggdonutchart("count", label = "PROVINCE", fill = "PROVINCE", color = "white",lab.font = c(3,"bold","black"),lab.pos = "in")
Gender wise hours spent by Internet users
The bar graph below shows the hours spent on Internet by the users
categorized on the basis of their gender. The frequency of Internet
usage is more or less similar in both the gender. There is no
significant difference in the hours spent on Internet by both Male and
Female. This plot depicts that gender does not have any significant
effect on Internet usage.
#Gender wise hours spent by Internet users
library(knitr)
library(ggplot2)
library(dplyr)
data <- df1 %>%
mutate(gender = factor(gender, labels = c("Male", "Female")),
hoursSpent = factor(hoursSpent))
ggplot(data, aes(x = hoursSpent, fill = gender))+
labs(x="Hours Spent",y="Frequency")+
theme(plot.title=element_text(size=10))+
geom_bar()
Province wise hours spent by Internet users
The bar graph gives information about the share of internet usage
in the different provinces of Canada which is further sorted by the
total amount of hours spent respectively. The statistics clearly
illustrates that Ontario is the highest internet using populance in the
whole of Canada with an internet usage standing at more than 40 hours
per week. Whereas, the internet usage population further declines to the
lowest in Prince Edward Island and the internet usage dropped to less
than 5 hours for most of its population.
#Province wise hours spent by Internet users
ggplot(df1,
aes(x = df1$PROVINCE ,fill = hoursSpent)) +
labs(x="Province",y="Total number of Hours Spent")+
theme(plot.title=element_text(size=10))+
geom_bar(position = "dodge") + coord_flip()
Province wise number of respondents
The bar plot below depicts the province wise number of respondents
in various Provinces of Canada. It can be clearly seen that Prince
Edward Island and Newfoundland and Labrador are the two provinces with
the lowest number of respondents, whereas Quebec and Ontario have the
highest number of respondents to the survey conducted. This further
points towards the fact that the internet usage in Ontario is maximum
while in Prince Edward Island internet usage is the least.
#Province wise number of respondents
library(RColorBrewer)
library(ggplot2)
ggplot(df1,
aes(x=reorder(PROVINCE, -table(PROVINCE)[PROVINCE]))) +
labs(x="Province",y="Total number of Respondent") + geom_bar(fill="#381339") + coord_flip()
Province wise Internet usage over the years
The below bar graph is representing the frequency of the years of
Internet used in various Provinces. This illustration gives us the
information about the top 5 provinces that has been using Internet for 5
years and above are Ontario,Quebec, British Columbia,Alberta and
Manitoba. Most users in the reqions of Saskatchewan, Nova Scotia , New
Brunswick and Newfoundland & Labrador has been using Internet for 2
to 5 years. Therefore these areas should be taken into account on how to
improve the use of Internet in those provinces.
#Province wise Internet usage over the years
ggplot(df1, aes(x = PROVINCE , fill = usedInternetYears)) +
labs(x="Province",y="Internet used over the Years")+
theme(plot.title=element_text(size=10))+
geom_bar(position = "dodge") + coord_flip()
Total number of responses based on age group
We have
analysed the total number of respondents through this bar graph, which
shows Women within the age group of 65 and above have responded to the
survey more than men of that age group. Moreover in the age group of 45
to 54 years both the men and women responded equally.
#Bargraph - respondents age
ggplot(df1,
aes(x= ageGroup,
fill = gender) )+
labs(x="Age Group",y="Total number of Respondent" ,
title = "Age group wise respondents",fill = "Gender")+
geom_bar(position = "dodge")
Internet Used on the basis of Education Level
The
mosaic plot depicts Internet usage over the years by the students
associated with different education levels. The graph represents the
relationship between the two categories ( Internet Used over the Years
and the level of Education) and illustrates the proportion under each
higher education levels.
#Internet Used on the basis of Education Level
library(ggmosaic)
ggplot(data = highEdu) +
geom_mosaic(aes(x = product(highEdu, usedInternet), fill = usedInternet)) +
xlab("Higher Education") + ylab("Internet Used in Years")
## Warning: `unite_()` was deprecated in tidyr 1.2.0.
## ℹ Please use `unite()` instead.
## ℹ The deprecated feature was likely used in the ggmosaic package.
## Please report the issue at <]8;;https://github.com/haleyjeppson/ggmosaichttps://github.com/haleyjeppson/ggmosaic]8;;>.
Chi-square test examines whether rows and columns of a
contingency table are statistically significantly associated.
head(highEdu)
## PUMFID PROVINCE REGION isUrban
## 1 1 Ontario Ontario Rural
## 2 2 Manitoba Manitoba/Saskatchewan Rural
## 3 3 Newfoundland and Labrador Atlantic Region Rural
## 4 4 Ontario Ontario Urban
## 5 5 New Brunswick Atlantic Region Urban
## 6 6 Manitoba Manitoba/Saskatchewan Urban
## ageGroup gender customerHighestEdu isStudent labourStat householdType
## 1 35 to 44 years Female 3 2 1 3
## 2 16 to 24 years Female 1 1 2 2
## 3 25 to 34 years Male 2 2 1 2
## 4 55 to 64 years Female 2 2 3 3
## 5 35 to 44 years Male 2 2 1 2
## 6 25 to 34 years Female 1 2 3 1
## householPersonCount householHighEdu householdStu usedInternet
## 1 1 3 2 Yes
## 2 3 2 1 Yes
## 3 2 3 2 Yes
## 4 1 2 2 Yes
## 5 3 2 2 Yes
## 6 3 1 2 Yes
## usedInternetYears iahTelephone iahCable iahWireless iahOtherConn iahNone
## 1 1- 2 years 1 2 2 2 2
## 2 >= 5 years 2 1 2 2 2
## 3 >= 5 years 2 1 2 2 2
## 4 2- 5 years 6 6 6 6 6
## 5 1- 2 years 2 1 2 2 2
## 6 >= 5 years 6 6 6 6 6
## iahPersonal iahPortable iahPda iahOtherDevice
## 1 2 1 2 2
## 2 1 2 2 2
## 3 1 2 2 1
## 4 6 6 6 6
## 5 1 1 2 2
## 6 6 6 6 6
## useFrequency hoursSpent highSpeedConnection
## 1 At least once a week (but not every day) < 5 hours 1
## 2 At least once a day < 5 hours 6
## 3 At least once a day 5 - 9 hours 6
## 4 Valid skip > 40 hours 6
## 5 At least once a month (but not every week) < 5 hours 6
## 6 Valid skip > 40 hours 6
## highSpeedIntService highEdu
## 1 6 University certificate or degree
## 2 6 Either College or University
## 3 6 University certificate or degree
## 4 6 Either College or University
## 5 6 Either College or University
## 6 6 Post-secondary or less
#Creating Subset to remove valid Skip and Not stated responses
D2 = subset(highEdu, useFrequency %in% c("At least once a day","At least once a week (but not every day)","At least once a month (but not every week)","Less than once a month"))
D2.Gender_isInternetUsed <- table(D2$useFrequency,D2$gender)
D2.Gender_isInternetUsed
##
## Female Male
## At least once a day 6370 5477
## At least once a month (but not every week) 312 228
## At least once a week (but not every day) 1854 1362
## Less than once a month 135 95
round(prop.table(D2.Gender_isInternetUsed,1),2)
##
## Female Male
## At least once a day 0.54 0.46
## At least once a month (but not every week) 0.58 0.42
## At least once a week (but not every day) 0.58 0.42
## Less than once a month 0.59 0.41
chisq <- chisq.test(D2.Gender_isInternetUsed)
round(chisq$residuals, 3)
##
## Female Male
## At least once a day -1.466 1.613
## At least once a month (but not every week) 0.946 -1.041
## At least once a week (but not every day) 2.210 -2.432
## Less than once a month 0.805 -0.886
Creating Contingency tables for Gender Vs Internet Used Frequency
The p value less than significance level 0.05, so we can fail to reject the null hypothesis saying there is enough evidence to suggest association between gender and the internet usage
library(corrplot)
## corrplot 0.92 loaded
corrplot(chisq$residuals, is.cor = FALSE)
#Creating Contingency tables for Area Vs Internet Used Frequency
D2.Area_InternetUsed <- table(D2$useFrequency,D2$isUrban)
D2.Area_InternetUsed
##
## Rural Urban
## At least once a day 3292 8555
## At least once a month (but not every week) 190 350
## At least once a week (but not every day) 1084 2132
## Less than once a month 73 157
round(prop.table(D2.Area_InternetUsed,1),2)
##
## Rural Urban
## At least once a day 0.28 0.72
## At least once a month (but not every week) 0.35 0.65
## At least once a week (but not every day) 0.34 0.66
## Less than once a month 0.32 0.68
chisq1 <- chisq.test(D2.Area_InternetUsed)
round(chisq1$residuals, 3)
##
## Rural Urban
## At least once a day -3.040 1.957
## At least once a month (but not every week) 2.527 -1.627
## At least once a week (but not every day) 4.617 -2.972
## Less than once a month 0.684 -0.440
Creating Contingency tables for Area Vs Internet Used Frequency
The p value less than significance level 0.05, so we can fail to reject the null hypothesis saying there is enough evidence to suggest association between Area and the internet usage
library(corrplot)
corrplot(chisq1$residuals, is.cor = FALSE)
Correlation Matrix
install.packages("corrplot")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
library("corrplot")
library("ggcorrplot")
library("GGally")
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
##
## Attaching package: 'GGally'
## The following object is masked from 'package:ggmosaic':
##
## happy
ggcorr(df1,method = c("everything","pearson"),
label = TRUE,label_alpha = TRUE,label_size = 3)
## Warning in ggcorr(df1, method = c("everything", "pearson"), label = TRUE, : data
## in column(s) 'PROVINCE', 'REGION', 'ageGroup', 'gender', 'usedInternetYears',
## 'hoursSpent' are not numeric and were ignored
#### Decision Tree Model
The Decision Tree (DT) method,
which is used in our study, does not require a pre-defined underlying
relationship. In addition, the method allows a great many explanatory
variables to be processed and the most important variables are easy to
identify. Obtained results can serve as to web developers and designers,
since by indicating the differences between male and female internet
users in terms of their behaviour on the internet it can help in
deciding when, where and how to address and appeal to which section of
the user base. It is especially important to know their online
preferences in order to enable the adequate and targeted placement of
information, actions or products and services for the intended target
groups.
D2_df <- highEdu[c("gender","labourStat","highEdu","usedInternet","hoursSpent")]
colnames(highEdu)
## [1] "PUMFID" "PROVINCE" "REGION"
## [4] "isUrban" "ageGroup" "gender"
## [7] "customerHighestEdu" "isStudent" "labourStat"
## [10] "householdType" "householPersonCount" "householHighEdu"
## [13] "householdStu" "usedInternet" "usedInternetYears"
## [16] "iahTelephone" "iahCable" "iahWireless"
## [19] "iahOtherConn" "iahNone" "iahPersonal"
## [22] "iahPortable" "iahPda" "iahOtherDevice"
## [25] "useFrequency" "hoursSpent" "highSpeedConnection"
## [28] "highSpeedIntService" "highEdu"
colnames(D2_df)
## [1] "gender" "labourStat" "highEdu" "usedInternet" "hoursSpent"
#compute Shannon entropy
entropy <- function(target) {
freq <- table(target)/length(target)
#Vectorize
vec <- as.data.frame(freq)[,2]
#drop 0 to avoid NaN resulting from log2
vec<-vec[vec>0]
#Compute entropy
-sum(vec * log2(vec))
}
for(k in names(D2_df)){
print(k)
print(entropy(D2_df[[k]]))
}
## [1] "gender"
## [1] 0.9918854
## [1] "labourStat"
## [1] 1.184813
## [1] "highEdu"
## [1] 1.493344
## [1] "usedInternet"
## [1] 0.7976336
## [1] "hoursSpent"
## [1] 2.245911
#Converting everything into factors
D2_df[] <- lapply(D2_df, factor)
#creating train and test dataset
create_train_test <- function(data, size = 0.8, train = TRUE) {
n_row = nrow(data)
total_row = size * n_row
train_sample <- 1: total_row
if (train == TRUE) {
return (data[train_sample, ])
} else {
return (data[-train_sample, ])
}
}
data_train <- create_train_test(D2_df, 0.8, train = TRUE)
data_test <- create_train_test(D2_df, 0.8, train = FALSE)
dim(data_train)
## [1] 18542 5
#Plotting decision tree with determining the output as "sex" for the train dataset
install.packages("rpart")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
library(rpart)
library(rpart.plot)
fit <- rpart(data_test$usedInternet~., data = data_test, method = 'class')
rpart.plot(fit,extra = 108,box.palette="GnRd",fallen.leaves = FALSE,cex = 0.5)
Predicting the output whether gender and their employment
status has prominant effect on internet usage or not
predict_unseen1 <-predict(fit, data_test, type = 'class')
table_mat1 <- table(data_test$usedInternet, predict_unseen1)
table_mat1
## predict_unseen1
## No Yes
## No 1121 19
## Yes 329 3167
accuracy_Test1 <- sum(diag(table_mat1)) / sum(table_mat1)
print(paste('Accuracy for test', accuracy_Test1))
## [1] "Accuracy for test 0.924935289042278"
#PCA- in order to determine which variables are most important for explaining the variance in our data set:
myDataPath=("/cloud/project/IndividualUse_Raw.csv")
df <- read.csv("IndividualUse_Raw.csv", header = T)
head(df)
## PUMFID PROVINCE REGION G_URBRUR GCAGEGR6 CSEX G_CEDUC G_CSTUD G_CLFSST
## 1 1 35 3 5 3 2 3 2 1
## 2 2 46 4 5 1 2 1 1 2
## 3 3 10 1 5 2 1 2 2 1
## 4 4 35 3 4 5 2 2 2 3
## 5 5 13 1 4 3 1 2 2 1
## 6 6 46 4 4 2 2 1 2 3
## GFAMTYPE G_HHSIZE G_HEDUC G_HSTUD EV_Q01 EV_Q02 IU_Q01A IU_Q01B IU_Q01E
## 1 3 1 3 2 1 2 1 2 2
## 2 2 3 2 1 1 4 2 1 2
## 3 2 2 3 2 1 4 2 1 2
## 4 3 1 2 2 1 3 6 6 6
## 5 2 3 2 2 1 2 2 1 2
## 6 1 3 1 2 1 4 6 6 6
## IU_G01 IU_Q01G IU_Q02A IU_Q02B IU_Q02E IU_G02 IU_Q03 IU_Q04 IU_Q05 IU_Q06
## 1 2 2 2 1 2 2 2 1 1 6
## 2 2 2 1 2 2 2 1 1 6 6
## 3 2 2 1 2 2 1 1 2 6 6
## 4 6 6 6 6 6 6 6 96 6 6
## 5 2 2 1 1 2 2 3 1 6 6
## 6 6 6 6 6 6 6 6 96 6 6
df1 <- (df[, 2:28])
#Renaming Columns
df1 <- rename(df1, isUrban = G_URBRUR) #change column name G_URBRUR to isUrban
df1 <- rename(df1, gender = CSEX)
df1 <- rename(df1, iahTelephone = IU_Q01A)
df1 <- rename(df1, iahCable = IU_Q01B)
df1 <- rename(df1, iahWireless = IU_Q01E)
df1 <- rename(df1, iahOtherConn = IU_G01)
df1 <- rename(df1, iahNone = IU_Q01G)
df1 <- rename(df1, iahPersonal = IU_Q02A)
df1 <- rename(df1, iahPortable = IU_Q02B)
df1 <- rename(df1, iahPda = IU_Q02E)
df1 <- rename(df1, iahOtherDevice = IU_G02)
df1 <- rename(df1, ageGroup = GCAGEGR6)
df1 <- rename(df1, customerHighestEdu = G_CEDUC)
df1 <- rename(df1,isStudent= G_CSTUD)
df1 <- rename(df1,labourStat= G_CLFSST)
df1 <- rename(df1,householdType= GFAMTYPE)
df1 <- rename(df1, householPersonCount= G_HHSIZE)
df1 <- rename(df1, householHighEdu = G_HEDUC)
df1 <- rename(df1,householdStu= G_HSTUD)
df1 <- rename(df1,usedInternet= EV_Q01)
df1 <- rename(df1, usedInternetYears= EV_Q02)
df1 <- rename(df1,useFrequency= IU_Q03)
df1 <- rename(df1, hoursSpent = IU_Q04)
df1 <- rename(df1, highSpeedConnection = IU_Q05)
df1 <- rename(df1,highSpeedIntService= IU_Q06)
Eigen Values
The eigenvalues measure the amount of variation retained by each principal component. Eigenvalues are large for the first PCs and small for the subsequent PCs. That is, the first PCs corresponds to the directions with the maximum amount of variation in the data set. How many principal components should be retained? An eigenvalue > 1 indicates that PCs account for more variance than accounted by one of the original variables in standardized data. Hence, the components with eigenvalue > 1 are retained. Another option is to use the scree plot. From the plot below, we might want to stop at the fifth principal component. 89% of the information (variances) contained in the data are retained by the first five principal components.
install.packages(c("FactoMineR", "factoextra"))
## Installing packages into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
library("FactoMineR")
library("factoextra")
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
res.pca <- PCA(df1, graph = FALSE)
print(res.pca)
## **Results for the Principal Component Analysis (PCA)**
## The analysis was performed on 23178 individuals, described by 27 variables
## *The results are available in the following objects:
##
## name description
## 1 "$eig" "eigenvalues"
## 2 "$var" "results for the variables"
## 3 "$var$coord" "coord. for the variables"
## 4 "$var$cor" "correlations variables - dimensions"
## 5 "$var$cos2" "cos2 for the variables"
## 6 "$var$contrib" "contributions of the variables"
## 7 "$ind" "results for the individuals"
## 8 "$ind$coord" "coord. for the individuals"
## 9 "$ind$cos2" "cos2 for the individuals"
## 10 "$ind$contrib" "contributions of the individuals"
## 11 "$call" "summary statistics"
## 12 "$call$centre" "mean of the variables"
## 13 "$call$ecart.type" "standard error of the variables"
## 14 "$call$row.w" "weights for the individuals"
## 15 "$call$col.w" "weights for the variables"
eig.val <- get_eigenvalue(res.pca)
eig.val
## eigenvalue variance.percent cumulative.variance.percent
## Dim.1 12.818420098 47.47562999 47.47563
## Dim.2 2.125285088 7.87142625 55.34706
## Dim.3 1.971610123 7.30225971 62.64932
## Dim.4 1.601495484 5.93146476 68.58078
## Dim.5 1.345875243 4.98472312 73.56550
## Dim.6 1.115252828 4.13056603 77.69607
## Dim.7 1.058241537 3.91941310 81.61548
## Dim.8 0.927411468 3.43485729 85.05034
## Dim.9 0.915740709 3.39163226 88.44197
## Dim.10 0.731856851 2.71058093 91.15255
## Dim.11 0.667649463 2.47277579 93.62533
## Dim.12 0.479415477 1.77561288 95.40094
## Dim.13 0.323802209 1.19926744 96.60021
## Dim.14 0.228077393 0.84473108 97.44494
## Dim.15 0.216769497 0.80284999 98.24779
## Dim.16 0.165711197 0.61374517 98.86154
## Dim.17 0.094391466 0.34959802 99.21113
## Dim.18 0.052264967 0.19357395 99.40471
## Dim.19 0.045960284 0.17022328 99.57493
## Dim.20 0.032242618 0.11941711 99.69435
## Dim.21 0.020728821 0.07677341 99.77112
## Dim.22 0.017349166 0.06425617 99.83538
## Dim.23 0.015895029 0.05887048 99.89425
## Dim.24 0.013043007 0.04830743 99.94256
## Dim.25 0.008482208 0.03141559 99.97397
## Dim.26 0.004318778 0.01599547 99.98997
## Dim.27 0.002708990 0.01003330 100.00000
fviz_eig(res.pca, addlabels = TRUE, ylim = c(0, 50))
Principal Component Analysis Results for variables
PCA reduces the dimensionality of a multivariate data to two or three
principal components, which could be visualized graphically, with
minimal loss of information.
var <- get_pca_var(res.pca)
var
## Principal Component Analysis Results for variables
## ===================================================
## Name Description
## 1 "$coord" "Coordinates for the variables"
## 2 "$cor" "Correlations between variables and dimensions"
## 3 "$cos2" "Cos2 for the variables"
## 4 "$contrib" "contributions of the variables"
head(var$cos2, 4)
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
## PROVINCE 0.008735762 1.079010e-01 0.85610110 0.010483532 0.002409806
## REGION 0.009573626 1.088714e-01 0.85312187 0.010851716 0.002861060
## isUrban 0.008852487 4.586957e-06 0.01269342 0.062933100 0.148343961
## ageGroup 0.245979794 2.371816e-01 0.04003498 0.001495516 0.001452835
Correlation Matrix The correlation matrix can be
reordered according to the correlation matrix coefficients. This is
important to identify the hidden structure and pattern in the
matrix.
install.packages("corrplot")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
library("corrplot")
corrplot(var$cos2, is.corr=FALSE)
Quality Of Representation-Cos2
An accurate depiction of the variable on the principal component is
indicated by a high cos2. In this instance, the variable is situated
rather close to the correlation circle’s edge.
A low cos2 means that the PCs did not accurately represent the variable. In this instance, the variable is very near the circle’s centre.
The sum of the cos2 on all the principal components for a particular variable is equal to one.
The sum of the cos2 on the two main components (PCs) of a variable that can be perfectly described by just two (Dim.1 & Dim.2) is one. The variables will be placed on the circle of correlations in this scenario. In order to accurately represent the data, some of the variables may require more than 2 components. In this instance, the variables are situated inside the correlation circle.
In conclusion, the cos2 values are used to estimate the accuracy of the representation. A variable’s factor map representation improves when a variable gets nearer to the circle of correlations. Close to the plot’s centre variables are less significant for the initial components.
install.packages("ggrepel")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
library(ggrepel)
fviz_cos2(res.pca, choice = "var", axes = 1:2)
library(ggrepel)
fviz_pca_var(res.pca, col.var = "cos2",
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE
)
## Warning: ggrepel: 14 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
fviz_contrib(res.pca, choice = "var", axes = 1, top = 15)
fviz_contrib(res.pca, choice = "var", axes = 2, top = 15)
fviz_pca_var(res.pca, col.var = "contrib",
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07")
)
K-means Clustering
#Cluster analysis:
install.packages("ClusterR")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("cluster")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
library(ClusterR)
## Loading required package: gtools
library(cluster)
colnames(df1)
## [1] "PROVINCE" "REGION" "isUrban"
## [4] "ageGroup" "gender" "customerHighestEdu"
## [7] "isStudent" "labourStat" "householdType"
## [10] "householPersonCount" "householHighEdu" "householdStu"
## [13] "usedInternet" "usedInternetYears" "iahTelephone"
## [16] "iahCable" "iahWireless" "iahOtherConn"
## [19] "iahNone" "iahPersonal" "iahPortable"
## [22] "iahPda" "iahOtherDevice" "useFrequency"
## [25] "hoursSpent" "highSpeedConnection" "highSpeedIntService"
#Filtering the two columns used for kmeans clustering
kmdata <- df1
kmdata <- kmdata[,c("PROVINCE","hoursSpent")]
#As kmeans required numerical data, used org_df for numerical data
kmdata[, c('PROVINCE', 'usedInternet')] <-
sapply(df1[, c('PROVINCE', 'hoursSpent')], unclass)
head(kmdata)
## PROVINCE hoursSpent usedInternet
## 1 35 1 1
## 2 46 1 1
## 3 10 2 2
## 4 35 96 96
## 5 13 1 1
## 6 46 96 96
set.seed(240)
###find wss and find no of cluster
#Calculate and plot WSS for a series of k values
wss <- numeric(20)
for (k in 1:20) wss[k] <- sum(kmeans(kmdata, centers = k, nstart = 25)$withinss)
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
plot(1:20, wss, type = "b", xlab = "Number of Clusters",
ylab = "Within Sum of Squares (WSS)"
,engine = "interactive")
## Warning in plot.window(...): "engine" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "engine" is not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "engine" is not a
## graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "engine" is not a
## graphical parameter
## Warning in box(...): "engine" is not a graphical parameter
## Warning in title(...): "engine" is not a graphical parameter
kmeans.re <- kmeans(kmdata, centers = 3, nstart = 1)
#kmeans.re
#kmeans.re$cluster
#creating clusters based on the province
confMatrix <- table(kmdata$PROVINCE, kmeans.re$cluster)
confMatrix
##
## 1 2 3
## 10 0 364 518
## 11 0 201 391
## 12 0 420 820
## 13 0 413 671
## 24 0 1590 2847
## 35 0 1898 4620
## 46 1264 759 0
## 47 1079 548 0
## 48 1656 586 0
## 59 1926 607 0
# unique(kmdata)
K-means Clustering with 3 clusters
The unsupervised method of determining the cluster of the various group of internet users has been conducted using the K-means clustering. Based on the hours spent by the internet users, and the Provinces they belong to, we have identified 3 prominent clusters as analyzed from the elbow method.
plot(kmdata[c("PROVINCE", "hoursSpent")],
col = kmeans.re$cluster,
xlab = 'PROVINCE',
ylab = 'Hours Spent on Internet')
kmeans.re$centers
## PROVINCE hoursSpent usedInternet
## 1 50.96692 2.056203 2.056203
## 2 33.23328 96.019767 96.019767
## 3 26.15506 1.983176 1.983176
kmeans.re$centers[, c("PROVINCE", "hoursSpent")]
## PROVINCE hoursSpent
## 1 50.96692 2.056203
## 2 33.23328 96.019767
## 3 26.15506 1.983176
points(kmeans.re$centers[, c("PROVINCE", "hoursSpent")],
col = 1:3, pch = 8, cex = 3)
Based on our analysis, we performed the unsupervised learning
techniques such as decision tree model and k-means clustering method. On
the basis of our decision tree model, we have found that the internet
usage is higher amongst male who are fit to work as well as those who
are unemployed. On contrary, lower tendency in the usage of internet has
been detected among those users attending college or university or lower
education level. Finally the accuracy for this particular decision tree
model is 92.5%. Hence we relied on these factors for our prediction
model.