INTRODUCTION

Internet access is the ability of Individual and organization to connect to internet for their day to day activities. In the modern day world emails, online meetings has become a part and parcel of our lives. Therefore, internet service providers are trying to deliver connectivity to a wide range of customers through networking technologies.

With the vast advancement and accessibility of Internet, information is just a click away. Both male and female from every age group relies upon Internet for every aspect of their lives. Analyzing the data based on gender and age group will help us get insights as to which category of gender relies more on Internet.

The dataset has been collected over a span of 12 months in the year 2009 and depicts the adoption and use of digital technologies and the online behaviors of individual aged 15 years or older in the 10 provinces of Canada.

OBJECTIVE

In this project, we are going to analyze the Personal Internet Usage of individuals belonging to various age groups hailing from the different provinces of Canada. Moreover we are concentrating on the regions where the frequency of internet usage is lowest and highest inorder to implement various strategies to enhance the utility of Internet in those provinces.

In 2009 80% of Canadians aged above 16 years have been using Internet for their Individual Use.


OVERVIEW OF DATASET

## 
## Please cite as:
##  Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
##  R package version 5.2.3. https://CRAN.R-project.org/package=stargazer
## 
## Descriptive statistics
## =============================================
## Statistic   N      Mean   St. Dev. Min  Max  
## ---------------------------------------------
## PUMFID    23,178 11,589.5 6,691.1   1  23,178
## PROVINCE  23,178   34.8     14.6   10    59  
## REGION    23,178   3.2      1.5     1    6   
## G_URBRUR  23,178   4.1      1.0     1    6   
## GCAGEGR6  23,178   3.9      1.6     1    6   
## CSEX      23,178   1.6      0.5     1    2   
## G_CEDUC   23,178   1.8      0.7     1    3   
## G_CSTUD   23,178   1.9      0.3     1    2   
## G_CLFSST  23,178   1.8      1.0     1    3   
## GFAMTYPE  23,178   2.1      0.8     1    4   
## G_HHSIZE  23,178   2.3      1.1     1    4   
## G_HEDUC   23,178   2.0      0.7     1    3   
## G_HSTUD   23,178   1.8      0.4     1    2   
## EV_Q01    23,178   1.2      0.4     1    2   
## EV_Q02    23,178   4.2      1.2     1    8   
## IU_Q01A   23,178   3.0      2.1     1    9   
## IU_Q01B   23,178   2.9      2.1     1    9   
## IU_Q01E   23,178   3.1      2.0     1    9   
## IU_G01    23,178   3.1      2.1     1    9   
## IU_Q01G   23,178   3.3      1.9     1    9   
## IU_Q02A   23,178   2.7      2.3     1    9   
## IU_Q02B   23,178   2.9      2.1     1    9   
## IU_Q02E   23,178   3.1      2.0     1    9   
## IU_G02    23,178   3.2      2.0     1    9   
## IU_Q03    23,178   2.8      2.2     1    9   
## IU_Q04    23,178   32.0     43.8    1    99  
## IU_Q05    23,178   4.5      2.3     1    9   
## IU_Q06    23,178   5.8      1.1     1    9   
## ---------------------------------------------
head(df,10)
##    PUMFID PROVINCE REGION G_URBRUR GCAGEGR6 CSEX G_CEDUC G_CSTUD G_CLFSST
## 1       1       35      3        5        3    2       3       2        1
## 2       2       46      4        5        1    2       1       1        2
## 3       3       10      1        5        2    1       2       2        1
## 4       4       35      3        4        5    2       2       2        3
## 5       5       13      1        4        3    1       2       2        1
## 6       6       46      4        4        2    2       1       2        3
## 7       7       35      3        2        3    2       3       2        2
## 8       8       59      6        5        1    1       2       1        1
## 9       9       24      2        1        6    2       2       2        3
## 10     10       59      6        4        6    1       1       2        3
##    GFAMTYPE G_HHSIZE G_HEDUC G_HSTUD EV_Q01 EV_Q02 IU_Q01A IU_Q01B IU_Q01E
## 1         3        1       3       2      1      2       1       2       2
## 2         2        3       2       1      1      4       2       1       2
## 3         2        2       3       2      1      4       2       1       2
## 4         3        1       2       2      1      3       6       6       6
## 5         2        3       2       2      1      2       2       1       2
## 6         1        3       1       2      1      4       6       6       6
## 7         1        4       3       2      1      4       2       1       2
## 8         2        3       3       1      1      4       2       1       2
## 9         3        1       2       2      2      6       6       6       6
## 10        3        1       1       2      2      6       6       6       6
##    IU_G01 IU_Q01G IU_Q02A IU_Q02B IU_Q02E IU_G02 IU_Q03 IU_Q04 IU_Q05 IU_Q06
## 1       2       2       2       1       2      2      2      1      1      6
## 2       2       2       1       2       2      2      1      1      6      6
## 3       2       2       1       2       2      1      1      2      6      6
## 4       6       6       6       6       6      6      6     96      6      6
## 5       2       2       1       1       2      2      3      1      6      6
## 6       6       6       6       6       6      6      6     96      6      6
## 7       2       2       1       2       2      2      1      5      6      6
## 8       2       2       1       1       2      2      1      3      6      6
## 9       6       6       6       6       6      6      6     96      6      6
## 10      6       6       6       6       6      6      6     96      6      6


PREPROCESSED DATA

# Setting the new preprocessed dataset
myDataPath=("/cloud/project/IndividualUse1.csv")
df1 <- read.csv("IndividualUse1.csv", header = T) 
head(df1,10)
##     X PUMFID                  PROVINCE                REGION isUrban
## 1   1      1                   Ontario               Ontario       5
## 2   2      2                  Manitoba Manitoba/Saskatchewan       5
## 3   3      3 Newfoundland and Labrador       Atlantic Region       5
## 4   4      4                   Ontario               Ontario       4
## 5   5      5             New Brunswick       Atlantic Region       4
## 6   6      6                  Manitoba Manitoba/Saskatchewan       4
## 7   7      7                   Ontario               Ontario       2
## 8   8      8          British Columbia      British Columbia       5
## 9   9      9                    Quebec                Quebec       1
## 10 10     10          British Columbia      British Columbia       4
##           ageGroup gender customerHighestEdu isStudent labourStat householdType
## 1  35 to 44 years  Female                  3         2          1             3
## 2  16 to 24 years  Female                  1         1          2             2
## 3  25 to 34 years    Male                  2         2          1             2
## 4  55 to 64 years  Female                  2         2          3             3
## 5  35 to 44 years    Male                  2         2          1             2
## 6  25 to 34 years  Female                  1         2          3             1
## 7  35 to 44 years  Female                  3         2          2             1
## 8  16 to 24 years    Male                  2         1          1             2
## 9     65 and older Female                  2         2          3             3
## 10    65 and older   Male                  1         2          3             3
##    householPersonCount householHighEdu householdStu usedInternet
## 1                    1               3            2            1
## 2                    3               2            1            1
## 3                    2               3            2            1
## 4                    1               2            2            1
## 5                    3               2            2            1
## 6                    3               1            2            1
## 7                    4               3            2            1
## 8                    3               3            1            1
## 9                    1               2            2            2
## 10                   1               1            2            2
##    usedInternetYears iahTelephone iahCable iahWireless iahOtherConn iahNone
## 1         1- 2 years            1        2           2            2       2
## 2         >= 5 years            2        1           2            2       2
## 3         >= 5 years            2        1           2            2       2
## 4         2- 5 years            6        6           6            6       6
## 5         1- 2 years            2        1           2            2       2
## 6         >= 5 years            6        6           6            6       6
## 7         >= 5 years            2        1           2            2       2
## 8         >= 5 years            2        1           2            2       2
## 9         Valid Skip            6        6           6            6       6
## 10        Valid Skip            6        6           6            6       6
##    iahPersonal iahPortable iahPda iahOtherDevice useFrequency     hoursSpent
## 1            2           1      2              2            2      < 5 hours
## 2            1           2      2              2            1      < 5 hours
## 3            1           2      2              1            1    5 - 9 hours
## 4            6           6      6              6            6     > 40 hours
## 5            1           1      2              2            3      < 5 hours
## 6            6           6      6              6            6     > 40 hours
## 7            1           2      2              2            1  30 - 39 hours
## 8            1           1      2              2            1  10 - 19 hours
## 9            6           6      6              6            6     > 40 hours
## 10           6           6      6              6            6     > 40 hours
##    highSpeedConnection highSpeedIntService
## 1                    1                   6
## 2                    6                   6
## 3                    6                   6
## 4                    6                   6
## 5                    6                   6
## 6                    6                   6
## 7                    6                   6
## 8                    6                   6
## 9                    6                   6
## 10                   6                   6

Exploratory Data Analysis


Frequency Table for Distribution of Hours Spent

# Frequency Table for Distribution of Hours Spent
library(epiDisplay)
## Loading required package: foreign
## Loading required package: survival
## Loading required package: MASS
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
## Loading required package: nnet
## 
## Attaching package: 'epiDisplay'
## The following object is masked from 'package:ggplot2':
## 
##     alpha
tab1(df1$hoursSpent, sort.group = "decreasing", main = "Distribution of Hours Spent",cex.main=1 ,cex = 1.3,cex.names = 1.3 ,cex.lab=1.3,
     ylab = "Frequency",cum.percent = TRUE, horiz = TRUE )

## df1$hoursSpent : 
##                Frequency Percent Cum. percent
##  > 40 hours         7386    31.9         31.9
## < 5 hours           7256    31.3         63.2
##  5 - 9 hours        4203    18.1         81.3
##  10 - 19 hours      2540    11.0         92.3
##  20 - 29 hours       944     4.1         96.3
##  > 40 hours          448     1.9         98.3
##  30 - 39 hours       401     1.7        100.0
##   Total            23178   100.0        100.0


CrossTable for Higher Education and Years of Internet Used

#CrossTable for Higher Education and Years of Internet Used
library(knitr)
library(gmodels)
## 
## Attaching package: 'gmodels'
## The following object is masked from 'package:epiDisplay':
## 
##     ci
CrossTable(highEdu$highEdu, df1$usedInternetYears, prop.t=TRUE, prop.r=TRUE, prop.c=TRUE, horiz = TRUE)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  23178 
## 
##  
##                                   | df1$usedInternetYears 
##                   highEdu$highEdu |   < 1 year | >= 5 years | 1- 2 years | 2- 5 years | Don't know |    Refusal | Valid Skip |  Row Total | 
## ----------------------------------|------------|------------|------------|------------|------------|------------|------------|------------|
##    Either College or University   |        338 |       7086 |        501 |       1518 |         21 |          0 |       2259 |      11723 | 
##                                   |      3.128 |     11.325 |      5.625 |     33.733 |      1.113 |      0.506 |    115.888 |            | 
##                                   |      0.029 |      0.604 |      0.043 |      0.129 |      0.002 |      0.000 |      0.193 |      0.506 | 
##                                   |      0.557 |      0.526 |      0.562 |      0.587 |      0.636 |      0.000 |      0.403 |            | 
##                                   |      0.015 |      0.306 |      0.022 |      0.065 |      0.001 |      0.000 |      0.097 |            | 
## ----------------------------------|------------|------------|------------|------------|------------|------------|------------|------------|
##           Post-secondary or less  |        212 |       1809 |        296 |        640 |          8 |          0 |       3030 |       5995 | 
##                                   |     19.267 |    803.604 |     18.640 |      1.246 |      0.034 |      0.259 |   1727.780 |            | 
##                                   |      0.035 |      0.302 |      0.049 |      0.107 |      0.001 |      0.000 |      0.505 |      0.259 | 
##                                   |      0.349 |      0.134 |      0.332 |      0.247 |      0.242 |      0.000 |      0.541 |            | 
##                                   |      0.009 |      0.078 |      0.013 |      0.028 |      0.000 |      0.000 |      0.131 |            | 
## ----------------------------------|------------|------------|------------|------------|------------|------------|------------|------------|
## University certificate or degree  |         57 |       4566 |         94 |        428 |          4 |          1 |        310 |       5460 | 
##                                   |     51.712 |    613.712 |     63.989 |     53.886 |      1.832 |      2.481 |    771.808 |            | 
##                                   |      0.010 |      0.836 |      0.017 |      0.078 |      0.001 |      0.000 |      0.057 |      0.236 | 
##                                   |      0.094 |      0.339 |      0.105 |      0.166 |      0.121 |      1.000 |      0.055 |            | 
##                                   |      0.002 |      0.197 |      0.004 |      0.018 |      0.000 |      0.000 |      0.013 |            | 
## ----------------------------------|------------|------------|------------|------------|------------|------------|------------|------------|
##                      Column Total |        607 |      13461 |        891 |       2586 |         33 |          1 |       5599 |      23178 | 
##                                   |      0.026 |      0.581 |      0.038 |      0.112 |      0.001 |      0.000 |      0.242 |            | 
## ----------------------------------|------------|------------|------------|------------|------------|------------|------------|------------|
## 
## 


CrossTable for Hours Spent on Internet and Gender

#CrossTable for Hours Spent on Internet and Gender
CrossTable(df1$gender, df1$hoursSpent, prop.t=TRUE, prop.r=TRUE, prop.c=TRUE,horiz = TRUE)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  23178 
## 
##  
##              | df1$hoursSpent 
##   df1$gender |     > 40 hours |    > 40 hours  |  10 - 19 hours |  20 - 29 hours |  30 - 39 hours |    5 - 9 hours |      < 5 hours |      Row Total | 
## -------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|
##       Female |           4168 |            207 |           1319 |            457 |            199 |           2298 |           4169 |          12817 | 
##              |          1.714 |          6.698 |          5.213 |          8.097 |          2.333 |          0.295 |          6.109 |                | 
##              |          0.325 |          0.016 |          0.103 |          0.036 |          0.016 |          0.179 |          0.325 |          0.553 | 
##              |          0.564 |          0.462 |          0.519 |          0.484 |          0.496 |          0.547 |          0.575 |                | 
##              |          0.180 |          0.009 |          0.057 |          0.020 |          0.009 |          0.099 |          0.180 |                | 
## -------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|
##         Male |           3218 |            241 |           1221 |            487 |            202 |           1905 |           3087 |          10361 | 
##              |          2.121 |          8.286 |          6.449 |         10.017 |          2.886 |          0.365 |          7.558 |                | 
##              |          0.311 |          0.023 |          0.118 |          0.047 |          0.019 |          0.184 |          0.298 |          0.447 | 
##              |          0.436 |          0.538 |          0.481 |          0.516 |          0.504 |          0.453 |          0.425 |                | 
##              |          0.139 |          0.010 |          0.053 |          0.021 |          0.009 |          0.082 |          0.133 |                | 
## -------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|
## Column Total |           7386 |            448 |           2540 |            944 |            401 |           4203 |           7256 |          23178 | 
##              |          0.319 |          0.019 |          0.110 |          0.041 |          0.017 |          0.181 |          0.313 |                | 
## -------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|
## 
## 


Various Age Group Distribution and its Frequency

The below pie chart shows the various age groups considered in this survey. As per the donut chart, it can be analysed that the frequency of respondents from the age group 65 and above is higher compared to the others.

#Various Age Group Distribution and its Frequency
library(plotrix)
library(ggpubr)
df1 %>% 
  pull(ageGroup) %>% 
  fct_count() %>% 
  rename(ageGroup = f, count = n) %>% 
  ggdonutchart("count", label = "ageGroup", fill = "ageGroup", color = "white",lab.font = c(4,"bold","black"),lab.pos = "in")


Province Wise Information Collected

The below donut chart shows the the numbers of respondents in the various Provinces considered in this survey. We can see that, the highest frequency of respondents are from the Ontario region followed by Quebec, British Columbia and Alberta

#Province Wise Information Collected
library(plotrix)
library(ggpubr)
df1 %>% 
  pull(PROVINCE) %>% 
  fct_count() %>% 
  rename(PROVINCE = f, count = n) %>% 
  ggdonutchart("count", label = "PROVINCE", fill = "PROVINCE", color = "white",lab.font = c(3,"bold","black"),lab.pos = "in")


Gender wise hours spent by Internet users

The bar graph below shows the hours spent on Internet by the users categorized on the basis of their gender. The frequency of Internet usage is more or less similar in both the gender. There is no significant difference in the hours spent on Internet by both Male and Female. This plot depicts that gender does not have any significant effect on Internet usage.

#Gender wise hours spent by Internet users
library(knitr)
library(ggplot2)
library(dplyr)
data <- df1 %>% 
  mutate(gender = factor(gender, labels = c("Male", "Female")),
         hoursSpent = factor(hoursSpent))

ggplot(data, aes(x = hoursSpent, fill = gender))+
  labs(x="Hours Spent",y="Frequency")+
  theme(plot.title=element_text(size=10))+
  geom_bar() 



Province wise hours spent by Internet users

The bar graph gives information about the share of internet usage in the different provinces of Canada which is further sorted by the total amount of hours spent respectively. The statistics clearly illustrates that Ontario is the highest internet using populance in the whole of Canada with an internet usage standing at more than 40 hours per week. Whereas, the internet usage population further declines to the lowest in Prince Edward Island and the internet usage dropped to less than 5 hours for most of its population.

#Province wise hours spent by Internet users
ggplot(df1, 
       aes(x = df1$PROVINCE ,fill = hoursSpent)) +
  labs(x="Province",y="Total number of Hours Spent")+
  theme(plot.title=element_text(size=10))+
  geom_bar(position = "dodge") + coord_flip()



Province wise number of respondents

The bar plot below depicts the province wise number of respondents in various Provinces of Canada. It can be clearly seen that Prince Edward Island and Newfoundland and Labrador are the two provinces with the lowest number of respondents, whereas Quebec and Ontario have the highest number of respondents to the survey conducted. This further points towards the fact that the internet usage in Ontario is maximum while in Prince Edward Island internet usage is the least.

#Province wise number of respondents
library(RColorBrewer)
library(ggplot2)
ggplot(df1, 
       aes(x=reorder(PROVINCE, -table(PROVINCE)[PROVINCE]))) +
  labs(x="Province",y="Total number of Respondent") + geom_bar(fill="#381339") + coord_flip()



Province wise Internet usage over the years

The below bar graph is representing the frequency of the years of Internet used in various Provinces. This illustration gives us the information about the top 5 provinces that has been using Internet for 5 years and above are Ontario,Quebec, British Columbia,Alberta and Manitoba. Most users in the reqions of Saskatchewan, Nova Scotia , New Brunswick and Newfoundland & Labrador has been using Internet for 2 to 5 years. Therefore these areas should be taken into account on how to improve the use of Internet in those provinces.

#Province wise Internet usage over the years
ggplot(df1, aes(x = PROVINCE , fill = usedInternetYears)) +
  labs(x="Province",y="Internet used over the Years")+ 
  theme(plot.title=element_text(size=10))+
  geom_bar(position = "dodge") + coord_flip()


Total number of responses based on age group
We have analysed the total number of respondents through this bar graph, which shows Women within the age group of 65 and above have responded to the survey more than men of that age group. Moreover in the age group of 45 to 54 years both the men and women responded equally.

#Bargraph - respondents age
ggplot(df1, 
       aes(x= ageGroup,
           fill = gender) )+
  labs(x="Age Group",y="Total number of Respondent" , 
       title = "Age group wise respondents",fill = "Gender")+
  geom_bar(position = "dodge")


Internet Used on the basis of Education Level
The mosaic plot depicts Internet usage over the years by the students associated with different education levels. The graph represents the relationship between the two categories ( Internet Used over the Years and the level of Education) and illustrates the proportion under each higher education levels.

#Internet Used on the basis of Education Level
library(ggmosaic)
ggplot(data = highEdu) +
  geom_mosaic(aes(x = product(highEdu, usedInternet), fill = usedInternet)) +
  xlab("Higher Education") + ylab("Internet Used in Years")
## Warning: `unite_()` was deprecated in tidyr 1.2.0.
## ℹ Please use `unite()` instead.
## ℹ The deprecated feature was likely used in the ggmosaic package.
##   Please report the issue at <]8;;https://github.com/haleyjeppson/ggmosaichttps://github.com/haleyjeppson/ggmosaic]8;;>.


PREDICTIVE ANALYSIS


Chi-square test examines whether rows and columns of a contingency table are statistically significantly associated.

head(highEdu)
##   PUMFID                  PROVINCE                REGION isUrban
## 1      1                   Ontario               Ontario   Rural
## 2      2                  Manitoba Manitoba/Saskatchewan   Rural
## 3      3 Newfoundland and Labrador       Atlantic Region   Rural
## 4      4                   Ontario               Ontario   Urban
## 5      5             New Brunswick       Atlantic Region   Urban
## 6      6                  Manitoba Manitoba/Saskatchewan   Urban
##          ageGroup gender customerHighestEdu isStudent labourStat householdType
## 1 35 to 44 years  Female                  3         2          1             3
## 2 16 to 24 years  Female                  1         1          2             2
## 3 25 to 34 years    Male                  2         2          1             2
## 4 55 to 64 years  Female                  2         2          3             3
## 5 35 to 44 years    Male                  2         2          1             2
## 6 25 to 34 years  Female                  1         2          3             1
##   householPersonCount householHighEdu householdStu usedInternet
## 1                   1               3            2          Yes
## 2                   3               2            1          Yes
## 3                   2               3            2          Yes
## 4                   1               2            2          Yes
## 5                   3               2            2          Yes
## 6                   3               1            2          Yes
##   usedInternetYears iahTelephone iahCable iahWireless iahOtherConn iahNone
## 1        1- 2 years            1        2           2            2       2
## 2        >= 5 years            2        1           2            2       2
## 3        >= 5 years            2        1           2            2       2
## 4        2- 5 years            6        6           6            6       6
## 5        1- 2 years            2        1           2            2       2
## 6        >= 5 years            6        6           6            6       6
##   iahPersonal iahPortable iahPda iahOtherDevice
## 1           2           1      2              2
## 2           1           2      2              2
## 3           1           2      2              1
## 4           6           6      6              6
## 5           1           1      2              2
## 6           6           6      6              6
##                                 useFrequency   hoursSpent highSpeedConnection
## 1   At least once a week (but not every day)    < 5 hours                   1
## 2                        At least once a day    < 5 hours                   6
## 3                        At least once a day  5 - 9 hours                   6
## 4                                 Valid skip   > 40 hours                   6
## 5 At least once a month (but not every week)    < 5 hours                   6
## 6                                 Valid skip   > 40 hours                   6
##   highSpeedIntService                           highEdu
## 1                   6 University certificate or degree 
## 2                   6    Either College or University  
## 3                   6 University certificate or degree 
## 4                   6    Either College or University  
## 5                   6    Either College or University  
## 6                   6           Post-secondary or less
#Creating Subset to remove valid Skip and Not stated responses
D2 = subset(highEdu, useFrequency %in% c("At least once a day","At least once a week (but not every day)","At least once a month (but not every week)","Less than once a month"))
D2.Gender_isInternetUsed <- table(D2$useFrequency,D2$gender)
D2.Gender_isInternetUsed
##                                             
##                                              Female Male
##   At least once a day                          6370 5477
##   At least once a month (but not every week)    312  228
##   At least once a week (but not every day)     1854 1362
##   Less than once a month                        135   95
round(prop.table(D2.Gender_isInternetUsed,1),2)
##                                             
##                                              Female Male
##   At least once a day                          0.54 0.46
##   At least once a month (but not every week)   0.58 0.42
##   At least once a week (but not every day)     0.58 0.42
##   Less than once a month                       0.59 0.41
chisq <- chisq.test(D2.Gender_isInternetUsed)
round(chisq$residuals, 3)
##                                             
##                                              Female   Male
##   At least once a day                        -1.466  1.613
##   At least once a month (but not every week)  0.946 -1.041
##   At least once a week (but not every day)    2.210 -2.432
##   Less than once a month                      0.805 -0.886

Creating Contingency tables for Gender Vs Internet Used Frequency

The p value less than significance level 0.05, so we can fail to reject the null hypothesis saying there is enough evidence to suggest association between gender and the internet usage

library(corrplot)
## corrplot 0.92 loaded
corrplot(chisq$residuals, is.cor = FALSE)

#Creating Contingency tables for Area Vs Internet Used Frequency
D2.Area_InternetUsed <- table(D2$useFrequency,D2$isUrban)
D2.Area_InternetUsed
##                                             
##                                              Rural Urban
##   At least once a day                         3292  8555
##   At least once a month (but not every week)   190   350
##   At least once a week (but not every day)    1084  2132
##   Less than once a month                        73   157
round(prop.table(D2.Area_InternetUsed,1),2)
##                                             
##                                              Rural Urban
##   At least once a day                         0.28  0.72
##   At least once a month (but not every week)  0.35  0.65
##   At least once a week (but not every day)    0.34  0.66
##   Less than once a month                      0.32  0.68
chisq1 <- chisq.test(D2.Area_InternetUsed)
round(chisq1$residuals, 3)
##                                             
##                                               Rural  Urban
##   At least once a day                        -3.040  1.957
##   At least once a month (but not every week)  2.527 -1.627
##   At least once a week (but not every day)    4.617 -2.972
##   Less than once a month                      0.684 -0.440

Creating Contingency tables for Area Vs Internet Used Frequency

The p value less than significance level 0.05, so we can fail to reject the null hypothesis saying there is enough evidence to suggest association between Area and the internet usage

library(corrplot)
corrplot(chisq1$residuals, is.cor = FALSE)


Correlation Matrix

install.packages("corrplot")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
library("corrplot")
library("ggcorrplot")
library("GGally")
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
## 
## Attaching package: 'GGally'
## The following object is masked from 'package:ggmosaic':
## 
##     happy
ggcorr(df1,method = c("everything","pearson"),
       label = TRUE,label_alpha = TRUE,label_size = 3)
## Warning in ggcorr(df1, method = c("everything", "pearson"), label = TRUE, : data
## in column(s) 'PROVINCE', 'REGION', 'ageGroup', 'gender', 'usedInternetYears',
## 'hoursSpent' are not numeric and were ignored


#### Decision Tree Model
The Decision Tree (DT) method, which is used in our study, does not require a pre-defined underlying relationship. In addition, the method allows a great many explanatory variables to be processed and the most important variables are easy to identify. Obtained results can serve as to web developers and designers, since by indicating the differences between male and female internet users in terms of their behaviour on the internet it can help in deciding when, where and how to address and appeal to which section of the user base. It is especially important to know their online preferences in order to enable the adequate and targeted placement of information, actions or products and services for the intended target groups.

D2_df <- highEdu[c("gender","labourStat","highEdu","usedInternet","hoursSpent")]
 colnames(highEdu)
##  [1] "PUMFID"              "PROVINCE"            "REGION"             
##  [4] "isUrban"             "ageGroup"            "gender"             
##  [7] "customerHighestEdu"  "isStudent"           "labourStat"         
## [10] "householdType"       "householPersonCount" "householHighEdu"    
## [13] "householdStu"        "usedInternet"        "usedInternetYears"  
## [16] "iahTelephone"        "iahCable"            "iahWireless"        
## [19] "iahOtherConn"        "iahNone"             "iahPersonal"        
## [22] "iahPortable"         "iahPda"              "iahOtherDevice"     
## [25] "useFrequency"        "hoursSpent"          "highSpeedConnection"
## [28] "highSpeedIntService" "highEdu"
 colnames(D2_df)
## [1] "gender"       "labourStat"   "highEdu"      "usedInternet" "hoursSpent"
#compute Shannon entropy
entropy <- function(target) {
  freq <- table(target)/length(target)
#Vectorize
vec <- as.data.frame(freq)[,2]
#drop 0 to avoid NaN resulting from log2
vec<-vec[vec>0]
#Compute entropy
-sum(vec * log2(vec))
}
for(k in names(D2_df)){
   print(k)
   print(entropy(D2_df[[k]]))
 }
## [1] "gender"
## [1] 0.9918854
## [1] "labourStat"
## [1] 1.184813
## [1] "highEdu"
## [1] 1.493344
## [1] "usedInternet"
## [1] 0.7976336
## [1] "hoursSpent"
## [1] 2.245911
#Converting everything into factors
D2_df[] <- lapply(D2_df, factor)
#creating  train and test dataset
create_train_test <- function(data, size = 0.8, train = TRUE) {
   n_row = nrow(data)
   total_row = size * n_row
   train_sample <- 1: total_row
   if (train == TRUE) {
     return (data[train_sample, ])
   } else {
     return (data[-train_sample, ])
   }
 }
 data_train <- create_train_test(D2_df, 0.8, train = TRUE)
 data_test <- create_train_test(D2_df, 0.8, train = FALSE)
 dim(data_train)
## [1] 18542     5
#Plotting decision tree with determining the output as "sex" for the train dataset
install.packages("rpart")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
 library(rpart)
  library(rpart.plot)
 fit <- rpart(data_test$usedInternet~., data = data_test, method = 'class')
 rpart.plot(fit,extra = 108,box.palette="GnRd",fallen.leaves = FALSE,cex = 0.5)


Predicting the output whether gender and their employment status has prominant effect on internet usage or not

predict_unseen1 <-predict(fit, data_test, type = 'class')
table_mat1 <- table(data_test$usedInternet, predict_unseen1)
table_mat1
##      predict_unseen1
##         No  Yes
##   No  1121   19
##   Yes  329 3167
accuracy_Test1 <- sum(diag(table_mat1)) / sum(table_mat1)
print(paste('Accuracy for test', accuracy_Test1))
## [1] "Accuracy for test 0.924935289042278"
#PCA- in order to determine which variables are most important for explaining the variance in our data set:
myDataPath=("/cloud/project/IndividualUse_Raw.csv")
df <- read.csv("IndividualUse_Raw.csv", header = T) 
head(df)
##   PUMFID PROVINCE REGION G_URBRUR GCAGEGR6 CSEX G_CEDUC G_CSTUD G_CLFSST
## 1      1       35      3        5        3    2       3       2        1
## 2      2       46      4        5        1    2       1       1        2
## 3      3       10      1        5        2    1       2       2        1
## 4      4       35      3        4        5    2       2       2        3
## 5      5       13      1        4        3    1       2       2        1
## 6      6       46      4        4        2    2       1       2        3
##   GFAMTYPE G_HHSIZE G_HEDUC G_HSTUD EV_Q01 EV_Q02 IU_Q01A IU_Q01B IU_Q01E
## 1        3        1       3       2      1      2       1       2       2
## 2        2        3       2       1      1      4       2       1       2
## 3        2        2       3       2      1      4       2       1       2
## 4        3        1       2       2      1      3       6       6       6
## 5        2        3       2       2      1      2       2       1       2
## 6        1        3       1       2      1      4       6       6       6
##   IU_G01 IU_Q01G IU_Q02A IU_Q02B IU_Q02E IU_G02 IU_Q03 IU_Q04 IU_Q05 IU_Q06
## 1      2       2       2       1       2      2      2      1      1      6
## 2      2       2       1       2       2      2      1      1      6      6
## 3      2       2       1       2       2      1      1      2      6      6
## 4      6       6       6       6       6      6      6     96      6      6
## 5      2       2       1       1       2      2      3      1      6      6
## 6      6       6       6       6       6      6      6     96      6      6
df1 <- (df[, 2:28])

#Renaming Columns
df1 <- rename(df1, isUrban = G_URBRUR) #change column name G_URBRUR to isUrban 
df1 <- rename(df1, gender = CSEX)
df1 <- rename(df1, iahTelephone = IU_Q01A)
df1 <- rename(df1, iahCable = IU_Q01B)
df1 <- rename(df1, iahWireless = IU_Q01E)
df1 <- rename(df1, iahOtherConn = IU_G01)
df1 <- rename(df1, iahNone = IU_Q01G)
df1 <- rename(df1, iahPersonal = IU_Q02A)
df1 <- rename(df1, iahPortable = IU_Q02B)
df1 <- rename(df1, iahPda = IU_Q02E)
df1 <- rename(df1, iahOtherDevice = IU_G02)

df1 <- rename(df1, ageGroup = GCAGEGR6)
df1 <- rename(df1, customerHighestEdu = G_CEDUC)
df1 <- rename(df1,isStudent= G_CSTUD)
df1 <- rename(df1,labourStat= G_CLFSST)
df1 <- rename(df1,householdType= GFAMTYPE)
df1 <- rename(df1, householPersonCount= G_HHSIZE)
df1 <- rename(df1, householHighEdu = G_HEDUC)
df1 <- rename(df1,householdStu= G_HSTUD)
df1 <- rename(df1,usedInternet= EV_Q01)
df1 <- rename(df1, usedInternetYears= EV_Q02)
df1 <- rename(df1,useFrequency= IU_Q03)
df1 <- rename(df1, hoursSpent = IU_Q04)
df1 <- rename(df1, highSpeedConnection = IU_Q05)
df1 <- rename(df1,highSpeedIntService= IU_Q06)


Eigen Values

The eigenvalues measure the amount of variation retained by each principal component. Eigenvalues are large for the first PCs and small for the subsequent PCs. That is, the first PCs corresponds to the directions with the maximum amount of variation in the data set. How many principal components should be retained? An eigenvalue > 1 indicates that PCs account for more variance than accounted by one of the original variables in standardized data. Hence, the components with eigenvalue > 1 are retained. Another option is to use the scree plot. From the plot below, we might want to stop at the fifth principal component. 89% of the information (variances) contained in the data are retained by the first five principal components.

install.packages(c("FactoMineR", "factoextra"))
## Installing packages into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
library("FactoMineR")
library("factoextra")
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
res.pca <- PCA(df1, graph = FALSE)
print(res.pca)
## **Results for the Principal Component Analysis (PCA)**
## The analysis was performed on 23178 individuals, described by 27 variables
## *The results are available in the following objects:
## 
##    name               description                          
## 1  "$eig"             "eigenvalues"                        
## 2  "$var"             "results for the variables"          
## 3  "$var$coord"       "coord. for the variables"           
## 4  "$var$cor"         "correlations variables - dimensions"
## 5  "$var$cos2"        "cos2 for the variables"             
## 6  "$var$contrib"     "contributions of the variables"     
## 7  "$ind"             "results for the individuals"        
## 8  "$ind$coord"       "coord. for the individuals"         
## 9  "$ind$cos2"        "cos2 for the individuals"           
## 10 "$ind$contrib"     "contributions of the individuals"   
## 11 "$call"            "summary statistics"                 
## 12 "$call$centre"     "mean of the variables"              
## 13 "$call$ecart.type" "standard error of the variables"    
## 14 "$call$row.w"      "weights for the individuals"        
## 15 "$call$col.w"      "weights for the variables"
eig.val <- get_eigenvalue(res.pca)
eig.val
##          eigenvalue variance.percent cumulative.variance.percent
## Dim.1  12.818420098      47.47562999                    47.47563
## Dim.2   2.125285088       7.87142625                    55.34706
## Dim.3   1.971610123       7.30225971                    62.64932
## Dim.4   1.601495484       5.93146476                    68.58078
## Dim.5   1.345875243       4.98472312                    73.56550
## Dim.6   1.115252828       4.13056603                    77.69607
## Dim.7   1.058241537       3.91941310                    81.61548
## Dim.8   0.927411468       3.43485729                    85.05034
## Dim.9   0.915740709       3.39163226                    88.44197
## Dim.10  0.731856851       2.71058093                    91.15255
## Dim.11  0.667649463       2.47277579                    93.62533
## Dim.12  0.479415477       1.77561288                    95.40094
## Dim.13  0.323802209       1.19926744                    96.60021
## Dim.14  0.228077393       0.84473108                    97.44494
## Dim.15  0.216769497       0.80284999                    98.24779
## Dim.16  0.165711197       0.61374517                    98.86154
## Dim.17  0.094391466       0.34959802                    99.21113
## Dim.18  0.052264967       0.19357395                    99.40471
## Dim.19  0.045960284       0.17022328                    99.57493
## Dim.20  0.032242618       0.11941711                    99.69435
## Dim.21  0.020728821       0.07677341                    99.77112
## Dim.22  0.017349166       0.06425617                    99.83538
## Dim.23  0.015895029       0.05887048                    99.89425
## Dim.24  0.013043007       0.04830743                    99.94256
## Dim.25  0.008482208       0.03141559                    99.97397
## Dim.26  0.004318778       0.01599547                    99.98997
## Dim.27  0.002708990       0.01003330                   100.00000
fviz_eig(res.pca, addlabels = TRUE, ylim = c(0, 50))


Principal Component Analysis Results for variables
PCA reduces the dimensionality of a multivariate data to two or three principal components, which could be visualized graphically, with minimal loss of information.

var <- get_pca_var(res.pca)
var
## Principal Component Analysis Results for variables
##  ===================================================
##   Name       Description                                    
## 1 "$coord"   "Coordinates for the variables"                
## 2 "$cor"     "Correlations between variables and dimensions"
## 3 "$cos2"    "Cos2 for the variables"                       
## 4 "$contrib" "contributions of the variables"
head(var$cos2, 4)
##                Dim.1        Dim.2      Dim.3       Dim.4       Dim.5
## PROVINCE 0.008735762 1.079010e-01 0.85610110 0.010483532 0.002409806
## REGION   0.009573626 1.088714e-01 0.85312187 0.010851716 0.002861060
## isUrban  0.008852487 4.586957e-06 0.01269342 0.062933100 0.148343961
## ageGroup 0.245979794 2.371816e-01 0.04003498 0.001495516 0.001452835


Correlation Matrix
The correlation matrix can be reordered according to the correlation matrix coefficients. This is important to identify the hidden structure and pattern in the matrix.

install.packages("corrplot")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
library("corrplot")
corrplot(var$cos2, is.corr=FALSE)


Quality Of Representation-Cos2
An accurate depiction of the variable on the principal component is indicated by a high cos2. In this instance, the variable is situated rather close to the correlation circle’s edge.

A low cos2 means that the PCs did not accurately represent the variable. In this instance, the variable is very near the circle’s centre.

The sum of the cos2 on all the principal components for a particular variable is equal to one.

The sum of the cos2 on the two main components (PCs) of a variable that can be perfectly described by just two (Dim.1 & Dim.2) is one. The variables will be placed on the circle of correlations in this scenario. In order to accurately represent the data, some of the variables may require more than 2 components. In this instance, the variables are situated inside the correlation circle.

In conclusion, the cos2 values are used to estimate the accuracy of the representation. A variable’s factor map representation improves when a variable gets nearer to the circle of correlations. Close to the plot’s centre variables are less significant for the initial components.

install.packages("ggrepel")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
library(ggrepel)
fviz_cos2(res.pca, choice = "var", axes = 1:2)

library(ggrepel)
fviz_pca_var(res.pca, col.var = "cos2",
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"), 
             repel = TRUE
)
## Warning: ggrepel: 14 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

fviz_contrib(res.pca, choice = "var", axes = 1, top = 15)

fviz_contrib(res.pca, choice = "var", axes = 2, top = 15)

fviz_pca_var(res.pca, col.var = "contrib",
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07")
)


K-means Clustering

#Cluster analysis:
install.packages("ClusterR")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("cluster")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
library(ClusterR)
## Loading required package: gtools
library(cluster)
colnames(df1)
##  [1] "PROVINCE"            "REGION"              "isUrban"            
##  [4] "ageGroup"            "gender"              "customerHighestEdu" 
##  [7] "isStudent"           "labourStat"          "householdType"      
## [10] "householPersonCount" "householHighEdu"     "householdStu"       
## [13] "usedInternet"        "usedInternetYears"   "iahTelephone"       
## [16] "iahCable"            "iahWireless"         "iahOtherConn"       
## [19] "iahNone"             "iahPersonal"         "iahPortable"        
## [22] "iahPda"              "iahOtherDevice"      "useFrequency"       
## [25] "hoursSpent"          "highSpeedConnection" "highSpeedIntService"
#Filtering the two columns used for kmeans clustering
kmdata <- df1
kmdata <- kmdata[,c("PROVINCE","hoursSpent")]
#As kmeans required numerical data, used org_df for numerical data
kmdata[, c('PROVINCE', 'usedInternet')] <-
  sapply(df1[, c('PROVINCE', 'hoursSpent')], unclass)
head(kmdata)
##   PROVINCE hoursSpent usedInternet
## 1       35          1            1
## 2       46          1            1
## 3       10          2            2
## 4       35         96           96
## 5       13          1            1
## 6       46         96           96
set.seed(240)
###find wss and find no of cluster
#Calculate and plot WSS for a series of k values
wss <- numeric(20)
for (k in 1:20) wss[k] <- sum(kmeans(kmdata, centers = k, nstart = 25)$withinss)
## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations
plot(1:20, wss, type = "b", xlab = "Number of Clusters",
     ylab = "Within Sum of Squares (WSS)"
     ,engine = "interactive")
## Warning in plot.window(...): "engine" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "engine" is not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "engine" is not a
## graphical parameter

## Warning in axis(side = side, at = at, labels = labels, ...): "engine" is not a
## graphical parameter
## Warning in box(...): "engine" is not a graphical parameter
## Warning in title(...): "engine" is not a graphical parameter

kmeans.re <- kmeans(kmdata, centers = 3, nstart = 1)
#kmeans.re
#kmeans.re$cluster
#creating clusters based on the province
confMatrix <- table(kmdata$PROVINCE, kmeans.re$cluster)
confMatrix
##     
##         1    2    3
##   10    0  364  518
##   11    0  201  391
##   12    0  420  820
##   13    0  413  671
##   24    0 1590 2847
##   35    0 1898 4620
##   46 1264  759    0
##   47 1079  548    0
##   48 1656  586    0
##   59 1926  607    0
# unique(kmdata)


K-means Clustering with 3 clusters

The unsupervised method of determining the cluster of the various group of internet users has been conducted using the K-means clustering. Based on the hours spent by the internet users, and the Provinces they belong to, we have identified 3 prominent clusters as analyzed from the elbow method.

plot(kmdata[c("PROVINCE", "hoursSpent")],
     col = kmeans.re$cluster,
     xlab = 'PROVINCE',
     ylab = 'Hours Spent on Internet')
kmeans.re$centers
##   PROVINCE hoursSpent usedInternet
## 1 50.96692   2.056203     2.056203
## 2 33.23328  96.019767    96.019767
## 3 26.15506   1.983176     1.983176
kmeans.re$centers[, c("PROVINCE", "hoursSpent")]
##   PROVINCE hoursSpent
## 1 50.96692   2.056203
## 2 33.23328  96.019767
## 3 26.15506   1.983176
points(kmeans.re$centers[, c("PROVINCE", "hoursSpent")],
       col = 1:3, pch = 8, cex = 3)


CONCLUSION


Based on our analysis, we performed the unsupervised learning techniques such as decision tree model and k-means clustering method. On the basis of our decision tree model, we have found that the internet usage is higher amongst male who are fit to work as well as those who are unemployed. On contrary, lower tendency in the usage of internet has been detected among those users attending college or university or lower education level. Finally the accuracy for this particular decision tree model is 92.5%. Hence we relied on these factors for our prediction model.