Project: GVU’s WWW User Survey Data

For the past seven years, the Graphics, Visualization and, Users Department of the Georgia Institute of Technology in Atlanta has conducted an international survey of World Wide Web usage as a public service in order to provide information concerning the demographics and trends of Internet access.

Goal:

The goal of this assignment is to obtain a profile of the “typical” Internet user by applying clustering techniques.

*** Benefits:***

Such data mining task can be advantageous to e-commerce marketers so that they may tailor their advertisements to a particular people-set.

These tasks may also assist to software engineers and system designers would be interested in understanding why a particular subset of the population is still uncomfortable using computers.

About Data set:

Data set for this exam is the General Demographics dataset from the GVU WWW User Survey*.

Students should evaluate the GVU’s Tenth WWW User Survey (October 1998).

This data is divided in several sections, including demographics, privacy and security, web usage, and others. Each data section includes a unique user identifier.

Students should join at least demographics and web usage data in preparation for the clustering task.

Exam Problems: 1. What are the typical groups of web users? Explain differences and similarities among groups.

Suggest methods of better targeting the most important customers.

[1] Getting the Data:

setwd("F:\\DATA SCIENCE\\PENN STATE\\Course work\\SWENG 545 - Data Mining\\FINAL PROJECT")

# read the csv files:
demo = read.delim("demo.txt", stringsAsFactors = TRUE)
usage = read.delim("usage.txt", stringsAsFactors = TRUE)
dim(demo)

## [1] 5022  106

dim(usage)

## [1] 3291  126

[2] Pre-processign of the data:

Pre-processing 1: Removing NAs

# removing last column added by read.delim()
demo = demo[,-c(106)]
usage = usage[,-c(126)]
dim(demo)

## [1] 5022  105

dim(usage)

## [1] 3291  125

# removing NAs:
demo = na.omit(demo)
usage = na.omit(usage)
dim(demo)

## [1] 4767  105

dim(usage)

## [1] 3291  125

Pre-processing 2: Merging two file in one

#finding common column names in both data frames:
intersect(names(demo),names(usage))

## [1] "survey" "who"

# checking if all values in the variable are unique or are there any duplicates:
length(demo$who)

## [1] 4767

length(unique(demo$who))

## [1] 4767

length(usage$who)

## [1] 3291

length(unique(usage$who))

## [1] 3291

# new data frame by getting all common users in both files:
suppressMessages(library(data.table))

# converting data frame into data table so data.table package functions can be used to merge both files:
demo2 = data.table(demo)
class(demo)

## [1] "data.frame"

class(demo2)

## [1] "data.table" "data.frame"

usage2 = data.table(usage)
class(usage)

## [1] "data.frame"

class(usage2)

## [1] "data.table" "data.frame"

# setting up the key for he merge: "who" is common column in both which has users who took part in this survey.
setkey(demo2, who)
setkey(usage2, who)

# merging both files in "data"
data = merge(demo2,usage2)

dim(demo2)

## [1] 4767  105

dim(usage2)

## [1] 3291  125

dim(data)

## [1] 3084  229

# converting data.table back to data frame.
data = as.data.frame(data)

Pre-processing 3: User.id as row names

unique user ids do not help in clustering so we should avoid them from clustering technique. But at the same time we need them for future indentity of each observation points.

so we will remove this variable from data set but add back it as row names of each row which is information from that user.

rownames(data) = data[,1] # giving rownames

data = data[,-1]

Pre-processing 4: More Cleaning

Now we need to find which variables in the data set have only 1 level. these variables do not help in creating clusters as there are no groups or clusters within the variable, so we should avoid them in pre-processing step.

n = ncol(data)
same = numeric()

for(i in 1:n){
        if (length(levels(data[,i])) == 1){
                same = c(same,i)
        }
}

str(data[,same]) # should be all columns with 1 level

## 'data.frame':    3084 obs. of  3 variables:
##  $ survey.x : Factor w/ 1 level "general": 1 1 1 1 1 1 1 1 1 1 ...
##  $ da_source: Factor w/ 1 level "Go!": 1 1 1 1 1 1 1 1 1 1 ...
##  $ survey.y : Factor w/ 1 level "use": 1 1 1 1 1 1 1 1 1 1 ...

data = data[,-same]

# new dataset excluding predictors of 1 level

Pre-processing 5: Dealing with Binary variables

there are multiple variable in the data set which have binary levels of “0” and “1”. for the understanding them easily we will convert these levels into “NO” and “YES” respectively.

n = ncol(data)

for(i in 1:n){
        
        if (class(data[,i]) == "integer"){
                data[,i] = ifelse(data[,i]==0,"NO","YES")
                data[,i] = as.factor(data[,i])
        }
}

Pre-processing 6: Dealing with Excel’s problem

during exporting the data frame into csv file, i learned that excel program is converting values in some variables from let’s say 7-10 into July-10.

to avoid this problem i have come up with a solution to add a letter “r” at the end of these values who are in the form of 7-10 meaning digits-digits.

n = ncol(data)
m = nrow(data)

for(i in 1:n){
        for(j in 1:m){
                if (grepl("^[0-9]+[-]", data[,i][j]) == TRUE){
                        data[,i] = as.character(data[,i])
                        data[,i][j] = paste(data[,i][j], " r")
                }
                data[,i] = as.factor(data[,i])
        }
}

#write.csv(data, "data.csv")

[3] Clustering :

Now that we have created a tidy data set from raw data files, its time to work on actual clustering analysis.

first step is, to decide which clustering technique we should be using here.

as all variables in the data set are categorical type, we can use partition-based method such as k-medoids method [example: PAM “Partitioning Around Medoids”] Partitioning (clustering) of the data into k clusters “around medoids”, a more robust version of K-means.

OR we can use Hierarchical Clustering method,a Hierarchical cluster analysis on a set of dissimilarities and methods for analyzing it.

As we are not sure how many clusters to start with as that is actually the objective no. 1 of this analysis and because all variables are of categorical type, here we will use Hierarchical Clustering method. later i will explain another reason why we haven’t used partition-based PAM here.

first, in order to use R function to perform clusterization we need to calculate dissimilarity matrix of the data set.

because our variables are categorical/ordinal, “Gower’s dissimilarity distance” will be a good choice.

“Gower’s distance” :

“Gower’s distance” is chosen by metric “gower” or automatically if some columns of x are not numeric. Also known as Gower’s coefficient (1971), expressed as a dissimilarity, this implies that a particular standardization will be applied to each variable, and the “distance” between two units is the sum of all the variable-specific distances.

#data5 file is with hclust clusters.
data5 = data

library(cluster)
# calculating gower dissimilarity distance for categorical variables:
d = daisy(data5, metric = c("gower"))

distance = dist(d)

now to roughly see dividing data set in how many clusters will give us optimal clustering, we can use kmeans method and see the chart of cluster’s within-sum-of-squares and between-sum-of-squares to decide the no. of clusters k.

k-means is not an optimal and robust method for this type of data set due to categorical variables but we are using to gauge no. of clusters only.

kmeans : within cluster_sum of squares

set.seed(1)

# lower the WSS better. it tells data points within a cluster are more similar.

wss = rep(0,11)

for (i in 1:11){
        kmeans.model2 = kmeans(d, i)
        wss[i] = kmeans.model2$tot.withinss
}

plot(1:10, wss[1:10], type="b",
     xlab = "Number of Clusters",
     ylab = "within cluster sum of squares")

wss.drop=rep(0,10)

for(i in 1:10){
        wss.drop[i] = ((wss[i]-wss[i+1])/wss[i])*100
}

plot(2:11, wss.drop, type="b",
     xlab = "Number of Clusters",
     ylab = "wss drop in percentage")

at K=3 looks like we can get best clustering with minimum wss. after that the advantage is not that big (drop in wss is 8% or less at k=4 or more).

kmeans : between cluster_sum of squares

set.seed(1)

# higher the better. tells the dissimilarity is high between clusters.

bss = rep(0,10)

for (i in 1:10){
        kmeans.model2 = kmeans(d, i)
        bss[i] = kmeans.model2$betweenss
}

plot(1:10, bss, type="b",
     xlab = "Number of Clusters",
     ylab = "Between cluster sum of squares")

at K=3 we get best clustering with maximum bss. after that the advantage is not that big.

#install.packages("FactoMineR")
library(FactoMineR)
library(ggplot2)

hclust(): h-clustering technique for categorical variable dataset:

data5 = data

set.seed(1)

#H-clustering:
hc = hclust(distance, method = "complete")

plot(hc)

clusters = cutree(hc,k=3)

data5$cluster = clusters

data5$cluster = as.factor(data5$cluster)
levels(data5$cluster) = c("A","B","C")
table(data5$cluster)

## 
##    A    B    C 
## 2119  869   96

prop.table(table(data5$cluster))

## 
##         A         B         C 
## 0.6870947 0.2817769 0.0311284

#write.csv(data5, "data5.csv")

plot(silhouette(clusters, distance))

data3 = data5

hclust() function has given us 3 clusters in our data set. cluster-A, B, C.

cluster-A with 68.7% of the observations from the data set. cluster-B with 28.2% of the observations from the data set. cluster-C with 3.1% of the observations from the data set.

now to confirm our 3 clusters in the data set, i will use MCA method (Multiple Correspondence Analysis) to calculate eigen values so we can plot the dataset in two-dimensional space using first two components to see clusters.

In R, there are several functions from different packages that allow us to apply Multiple Correspondence Analysis.

My preferred function to do multiple correspondence analysis is the MCA() function that comes in the fabulous package “FactoMineR” by Francois Husson, Julie Josse, Sebastien Le, and Jeremy Mazet.

No matter what function you decide to use for MCA, the typical results should consist of a set of eigenvalues, a table with the row coordinates, and a table with the column coordinates.

Compared to the eigenvalues obtained from a PCA or a CA, the eigenvalues in a MCA can be much more smaller. This is important to know because if you just consider the eigenvalues, you might be tempted to conclude that MCA sucks. Which is absolutely false.

Personally, I think that the real meat and potatoes of MCA relies in its dimension reduction properties that let us visualize our data, among other things. Besides the eigenvalues, the row coordinates provide information about the structure of the rows in the analyzed table. In turn, the column coordinates provide information about the structure of the analyzed variables and their corresponding categories.

plotting observations using MCA():

newdata = read.csv("data.csv")

# number of categories per variable
cats = apply(newdata, 2, function(x) nlevels(as.factor(x)))

library(FactoMineR)
set.seed(1234)

# apply MCA
mca1 = MCA(newdata, graph = FALSE)

We can use the package “ggplot2()” to get a nice plot:

# data frame with variable coordinates
mca1_vars_df = data.frame(mca1$var$coord, Variable = rep(names(cats), cats))

# data frame with observation coordinates
mca1_obs_df = data.frame(mca1$ind$coord)
mca1_obs_df$cluster = data5$cluster

# plot of variable categories
ggplot(data=mca1_vars_df, 
       aes(x = Dim.1, y = Dim.2, label = rownames(mca1_vars_df))) +
 geom_hline(yintercept = 0, colour = "gray70") +
 geom_vline(xintercept = 0, colour = "gray70") +
 geom_text(aes(colour=Variable)) +
 ggtitle("MCA plot of variables")+ theme(legend.position="none")

# MCA plot of observations with clusters:
ggplot(data = mca1_obs_df, aes(x = Dim.1, y = Dim.2)) +
  geom_hline(yintercept = 0, colour = "gray70") +
  geom_vline(xintercept = 0, colour = "gray70") +
  geom_point(size=2,aes(color=mca1_obs_df$cluster,alpha = 0.7,shape=factor(cluster))) +
        ggtitle("MCA plot of observations with Clusters") +
        scale_color_discrete(name = "Clusters", labels = c("A","B","C"))

# MCA plot of observations with clusters:
ggplot(data = mca1_obs_df, aes(x = Dim.1, y = Dim.2)) +
  geom_hline(yintercept = 0, colour = "gray70") +
  geom_vline(xintercept = 0, colour = "gray70") +
  geom_point(size=2,aes(color=mca1_obs_df$cluster,alpha = 0.7,shape=factor(cluster))) +
  geom_density2d(colour = "black") +
  ggtitle("MCA plot of observations with Clusters") +
        scale_color_discrete(name = "Clusters", labels = c("A","B","C"))

The above plot is a good visualization of all our data sets along with 3 clusters A,B and C.

this 2-dimensional view is based on only first two eigen vectors.

head(mca1$eig, 2)

##       eigenvalue percentage of variance cumulative percentage of variance
## dim 1 0.07207710              0.4365968                         0.4365968
## dim 2 0.04458851              0.2700885                         0.7066853

as seen in above table, first two eigen components account for almost 62% variance of the entire data set. so this view of our observations is a good way to see clusters.

Clustering observations:

cluster-A with 68.7% of the observations from the data set. cluster-B with 28.2% of the observations from the data set. cluster-C with 3.1% of the observations from the data set.
high density observations very close to each other are Cluster-A.
all scattered observations around cluster-A are mostly part of Cluster-B.
cluster-C observations are very similar to cluster-B but more of an extreme values, more scattered as seen towards north side of the entire space (high values of Dim 2).
cluster-A & B looks like more of typical cluster or groups.
But cluster-C is more of an outlier type of cluster and it will be interesting to see how it is different from cluster-B as looks like it is more closer in characteristics to cluster-B than cluster-A.

[4] Profiling the Clusters:

Now we need to know these 3 cluster’s individual behaviours and characteristics in terms of their similarities and differences.

we will do this by analyzing all variables against clusters by plotting three cluster’s distribution for each class of the variable/variables. and for some by plotting the variable’s class’s distribution across clusters as well.

mostly we will analyze characteristics of cluster-A & B. for cluster-C will will discuss how it is different than cluster-B or even cluster-A if we find anything to correlate.

our data set has total 3084 observations and 225 predictors (exluding 1 column of cluster assigment).

after careful analysis and observation of these predictors i have come up with total 7 Profile Groups which we can study and analyze to indentify charateristics of three clusters.

Below are these 7 Profile Groups:

1. Web Usage (Amount/Frequency) Profile

2. Demographics Profile

3. Experience/Issues/Opinion on Web Profile

4. Education & Work Profile

5. Web shopping Profile

6. Type of Web Usage Profile

7. Other useful Info Profile

each Group has sub-groups within them that we will actually analyze, i have called them Elements.

each Element is made up of either one, two or multiple predictors depending on the domain the predictors are representing.

I have used total of 120 predictors to create 7 Groups and 46 Elements within these 7 groups.

total of 105 predictors were not used in analyzing characteristics of clusters due to multiple reasons as below.

Generalization Bias due to major portion of dataset from one class label (3 predictors).
Could not make sense out of this variable for the use in this analysis (2 predictors).
time and space constraint (100 predictors).

[1] Web Usage Amount/Frequency Profile:

Frequency of use:

Characteristics of Clusters:

Cluster-A: High usage
Cluster-B: Low usage
Cluster-C: Low usage (more similar to cluster-B)

Hours Used:

Characteristics of Clusters:

Cluster-A: High usage
Cluster-B: Low usage
Cluster-C: High usage (more similar to cluster-A)

Years on Internet:

Characteristics of Clusters:

Cluster-A: High usage
Cluster-B: Low usage
Cluster-C: No clear trend

Access frequency on Internet:

Characteristics of Clusters:

Cluster-A: High usage
Cluster-B: Low usage
Cluster-C: Low usage (more similar to cluster-B)

[1] Web Usage (Amount/Frequency) Profile: SUMMARY

So as observed in “WEB USAGE” section of the analysis, there is a clear distinction between cluster-A & B. cluster-A is represented by high internet usage customers and cluster-B is represented by low internet usage customers.

Cluster-C looks more similar to cluster-B when it comes to “frequency of internet use”" but it is more similar to cluster-A (high usage) when it comes to “no. of Hours used”. this tell me that these users don’t show up on internet that often in a week but in less sitting they still used more hours of internet like cluster-A users.

[3] Demographics Profile:

Age:

Characteristics of Clusters:

Cluster-A: Age group 21-85
Cluster-B: Age group 11-20 and over-85
Cluster-C: Age group 11-20 and over-85 (more similar to cluster-B)

Gender:

##    
##     Female Male
##   A   63.8 71.0
##   B   33.1 25.9
##   C    3.1  3.1

Characteristics of Clusters:

Cluster-A: No clear trend
Cluster-B: No clear trend
Cluster-C: No clear trend

Household-Income:

Characteristics of Clusters:

Cluster-A: High Income
Cluster-B: Mid-Low Income
Cluster-C: Mid-Low Income (more similar to cluster-B)

AreaType (Rural-Suburban-Urban):

Characteristics of Clusters:

Cluster-A: No clear trend
Cluster-B: No clear trend
Cluster-C: Urban-Suburban

Marital status:

Characteristics of Clusters:

Cluster-A: Married
Cluster-B: Not-Married (Single, separated, widowed, divorced, etc.)
Cluster-C: Not-Married (with more weightage on single)

number of children in household:

Characteristics of Clusters:

Cluster-A: No clear trend
Cluster-B: No clear trend
Cluster-C: No clear trend

[2] Demographics Profile: SUMMARY

So as observed in “DEMOGRAPHIC” section of the analysis, there is a clear distinction between cluster A & B. cluster-A is represented by users in age group 21-85/High Income/More users married. cluster-B is represented by users in age group 11-20 and over-85/mid-low income/more weightage on not-marriers users.

cluster-C is lot similar to cluster-B but one distinction. users belong to more from urban/suburban areas than rural.

[3] Experience/Issues/Opinion on Web Profile:

comfort with Computers_internet:

Characteristics of Clusters:

Cluster-A: More Comfortable
Cluster-B: Less Comfortable
Cluster-C: Less Comfortable (more weightage towards very uncomfortable)

internet advertising:

Characteristics of Clusters:

Cluster-A: No clear trend
Cluster-B: No clear trend
Cluster-C: No clear trend

internet advertising:

Characteristics of Clusters:

Cluster-A: Speed/GovernmentRegulation/Commercialization/IntellectualProperty
Cluster-B: ContentAccuracy/Paying/ICrime/Pornography
Cluster-C: IntellectualProperty/Spam/Others

opinion on censorship:

Characteristics of Clusters:

Cluster-A: Disagree
Cluster-B: Agree
Cluster-C: Disagree (more similar to cluster-A)

Problems Using the Web:

Characteristics of Clusters:

Cluster-A: Graphics/Speed/Slow.Ads/Broken.Links
Cluster-B: Cost/Revisit/Vizualization/Lost
Cluster-C: Cost/Revisit/Vizualization/Lost

[3] Experience/Issues/Opinion on Web Profile: SUMMARY

** cluster-A users are more comfortable using web/internet and their issues on using internet are more technical issues.**

** cluster-B & C users are less comfortable using web/internet and their issues are more non-techical.**

[4] Education & Work Profile:

Education:

Characteristics of Clusters:

Cluster-A: Higher education (Doctoral/Masters/College)
Cluster-B: Lower education (Some college/high school)
Cluster-C: Lower education (Some college/high school)

No. of work hours:

Characteristics of Clusters:

Cluster-A: 20-40+ hours
Cluster-B: 20 or less hours
Cluster-C: 20 or less hours

primary industry:

Characteristics of Clusters:

Cluster-A: Agriculture/Software/Publishing/Mining/Legal
Cluster-B: Broadcasting/Homemaker/Religious/Hotel and Food/Unemployed
Cluster-C: Broadcasting/Homemaker/Religious/Hotel and Food/Unemployed

occupation:

Characteristics of Clusters:

Cluster-A: UpperMgmt/Consultant/Self-employed/Middle Mgmt/Trained Prof.
Cluster-B: Temporary/Researcher/Skilled/Labor/Student
Cluster-C: Temporary/Researcher/Skilled/Labor/Student

professional correspondence:

Characteristics of Clusters:

Cluster-A: Private/Public sector
Cluster-B: Not-for-profit/Others
Cluster-C: Not-for-profit/Others

[4] Education & Work Profile: SUMMARY

** cluster-A is represented by users with higher education (Doctoral/Masters/College), who works 20 or more hours, works in the field of Agriculture/Software/Publishing/Mining/Legal at UpperMgmt/Consultant/Self-employed/Middle Mgmt/Trained Prof. positions, mostly in Private and Public sectors.**

** cluster-B & C is represented by users with comparatively Lower education (Some college/high school), who works 20 or less hours in the filed of Broadcasting/Homemaker/Religious/Hotel and Food/Unemployed with positions as Temporary/Researcher/Skilled/Labor/Student and mostly in Not-for-profit/Others sectors.**

** cluster-C has more weightage on students and less weightage on trained professionals and self employed.**

[5] Web shopping Profile:

Online Purchases Frequency:

Characteristics of Clusters:

Cluster-A: frequent online purchaser
Cluster-B: less-frequent online purchaser
Cluster-C: less-frequent online purchaser

Reasons for not purchasing:

Characteristics of Clusters:

Cluster-A: Not.applicable/Not.option/Judge.quality/Easier.Locally/Company.Policy
Cluster-B: No.credit/Too.complicated/Receipt/Uncomfortable/Never.tried/Bad.Press
Cluster-C: No.credit/Too.complicated/Receipt/Uncomfortable/Never.tried/Bad.Press

[5] Web shopping Profile: SUMMARY

Cluster-A users are frequent online purchasers. Cluster-B & C users are less frequent online purchasers. when we combine it with reasons for not purchasing online, it looks like No credit, feeling its too complicated, or their inexperience buying things online are main reasons.

[6] Type of Web Usage Profile:

type of primary use of the web:

Characteristics of Clusters:

Cluster-A: Work/Shopping/Personal.Info/Entertainment
Cluster-B: Education/Time.Wasting/Communication/Other
Cluster-C: Education/Time.Wasting/Communication/Other

Chat groups:

Characteristics of Clusters:

Cluster-A: less-frequent user
Cluster-B: frequent user
Cluster-C: frequent user (more extreme users)

use of web for electronics news:

Characteristics of Clusters:

Cluster-A: frequent user
Cluster-B: less-frequent user
Cluster-C: frequent user

use of web for job listings:

Characteristics of Clusters:

Cluster-A: less-frequent user
Cluster-B: frequent user
Cluster-C: frequent user

use of web for maps:

Characteristics of Clusters:

Cluster-A: less-frequent user
Cluster-B: No clear trend
Cluster-C: frequent user

use of web for household work:

Characteristics of Clusters:

Cluster-A: less-frequent user
Cluster-B: frequent user
Cluster-C: frequent user

use of web for medical information:

Characteristics of Clusters:

Cluster-A: less-frequent user
Cluster-B: frequent user
Cluster-C: frequent user

use of web for movies:

Characteristics of Clusters:

Cluster-A: less-frequent user
Cluster-B: frequent user
Cluster-C: frequent user

use of web for news groups:

Characteristics of Clusters:

Cluster-A: No clear trend
Cluster-B: No clear trend
Cluster-C: No clear trend

use of web for product information:

Characteristics of Clusters:

Cluster-A: frequent user
Cluster-B: less-frequent user
Cluster-C: frequent user

use of web for reading:

Characteristics of Clusters:

Cluster-A: less-frequent user
Cluster-B: frequent user
Cluster-C: frequent user

use of web for real estate:

Characteristics of Clusters:

Cluster-A: less-frequent user
Cluster-B: frequent user
Cluster-C: frequent user

use of web for reference material:

Characteristics of Clusters:

Cluster-A: frequent user
Cluster-B: less-frequent user
Cluster-C: frequent user

use of web for socializing:

Characteristics of Clusters:

Cluster-A: less-frequent user
Cluster-B: frequent user
Cluster-C: frequent user

use of web for financial material:

Characteristics of Clusters:

Cluster-A: frequent user
Cluster-B: less-frequent user
Cluster-C: less-frequent user

[6] Type of Web Usage Profile: SUMMARY

cluster-A users uses web mostly for Work/Shopping/Personal.Info/Entertainment with being frequent users for e-news, product information, reference material, financial information etc.

cluster-B & C users uses web mostly for Education/Time.Wasting/Communication/Other with frequent use for Chat groups, job listings, household work, medical info., movies, reading, real estate, socializing.

[7] Other useful Info Profile:

using Phone:

Characteristics of Clusters:

Cluster-A: No clear trend
Cluster-B: No clear trend
Cluster-C: No clear trend

watching TV:

Characteristics of Clusters:

Cluster-A: No clear trend
Cluster-B: No clear trend
Cluster-C: No clear trend

Disability:

Characteristics of Clusters:

Cluster-A: No clear trend
Cluster-B: No clear trend
Cluster-C: No clear trend

computing platform:

Characteristics of Clusters:

Cluster-A: VT100/Mac/Sys 8/NT/PC Unix/Win95
Cluster-B: nix/DOS/OS2/Windows/WebTV
Cluster-C: PC UNIX/WIN 98/Other

who pays for access:

Characteristics of Clusters:

Cluster-A: Work/Self
Cluster-B: School/Other/Parents/dont know
Cluster-C: School/Other/Parents/dont know (with major portion by school/parents)

community building:

Characteristics of Clusters:

Cluster-A: No Clear trend
Cluster-B: No Clear trend
Cluster-C: No Clear trend

community membership:

Characteristics of Clusters:

Cluster-A: Professional/Family/None/Hobbies
Cluster-B: Religious/Support/Other
Cluster-C: Religious/Support/Other

how you heard about survey:

Characteristics of Clusters:

Cluster-A: Usenet.News/Mailing.List/Others/WWW.Page/Friend
Cluster-B: Search.Engine/Banner/Printed.Media
Cluster-C: Search.Engine/Banner/Printed.Media

exercising:

Characteristics of Clusters:

Cluster-A: less-frequent
Cluster-B: frequent
Cluster-C: frequent

[7] Other useful Info Profile: SUMMARY

most related distinction was that cluster-A has users who has work/self pays for web and cluster-B/C has School/Other/Parents/dont know pays for the web. with cluster-C having more weightage on school/parents.

Generalization Biases in the data set:

country and major_geographical_locations:

## Source: local data frame [12 x 2]
## 
##    Major.Geographical.Location Percentage
##                         (fctr)      (dbl)
## 1                          USA      82.72
## 2                       Europe       8.37
## 3                       Canada       4.31
## 4                      Oceania       2.33
## 5                         Asia       1.13
## 6                       Africa       0.32
## 7                  Middle East       0.29
## 8                South America       0.19
## 9                       Mexico       0.13
## 10                  Antarctica       0.06
## 11             Central America       0.06
## 12                 West Indies       0.06

As almost 83% of the observations in our data set are from USA, we won’t be able make any generalization statements to any other countries or part of the world.

any statement or recommendations in this review are for USA only.

Primary Language:

## Source: local data frame [17 x 2]
## 
##    Primary.Language Percentage
##              (fctr)      (dbl)
## 1           English      91.67
## 2            German       1.52
## 3             Other       1.52
## 4            French       1.04
## 5             Dutch       0.94
## 6           Swedish       0.62
## 7           Spanish       0.52
## 8           Chinese       0.32
## 9         Norwegian       0.32
## 10          Italian       0.29
## 11           Danish       0.26
## 12          Not Say       0.23
## 13       Portuguese       0.23
## 14           Hebrew       0.19
## 15          Russian       0.19
## 16            Greek       0.06
## 17           Korean       0.06

As almost 92% of the observations in our data set are from English speaking users, we won’t be able make any generalization statements to people or users speaking any other languages.

any statement or recommendations in this review are for “English speaking population in USA”" only.

Race:

## Source: local data frame [10 x 2]
## 
##              Race Percentage
##            (fctr)      (dbl)
## 1           White      88.52
## 2           Asian       2.82
## 3         Not Say       2.33
## 4     Multiracial       1.78
## 5       Afr. Amer       1.36
## 6           Other       1.23
## 7        Hispanic       1.13
## 8          Latino       0.52
## 9      Indigenous       0.26
## 10 Latino\\0Other       0.03

As almost 89% of the observations in our data set are from White race users, we won’t be able make any generalization statements to people or users from any other races.

any statement or recommendations in this review are for “English speaking White population in USA”" only.

Cluster Profiling Summary:

[1] Web Usage (Amount/Frequency) Profile:

There is a clear distinction between cluster-A & B.
cluster-A is represented by users with high internet usage.
cluster-B is represented by users with low internet usage.
cluster-C looks more similar to cluster-B when it comes to “frequency of internet use”" but it is more similar to cluster-A (high usage) when it comes to “no. of Hours used”. this tell me that these users don’t show up on internet that often in a week but in less sitting they still use more hours of internet like cluster-A users.

[2] Demographics Profile:

There is a clear distinction between cluster A & B.
cluster-A is represented by users in age group 21-85/High Income/More users married.
cluster-B is represented by users in age group 11-20 and over-85/mid-low income/more weightage on not-married users either single or were married once but not now.
cluster-C is lot similar to cluster-B but with one distinction. users belong to more from urban/suburban areas than rural.

[3] Experience/Issues/Opinion on Web Profile:

cluster-A users are more comfortable using web/internet and their issues on using internet are more technical issues.
cluster-B & C users are less comfortable using web/internet and their issues are more non-techical.

[4] Education & Work Profile:

cluster-A is represented by users with higher education (Doctoral/Masters/College), who works 20 or more hours, works in the field of Agriculture/Software/Publishing/Mining/Legal at UpperMgmt/Consultant/Self-employed/Middle Mgmt/Trained Prof. positions, mostly in Private and Public sectors.
cluster-B & C is represented by users with comparatively Lower education (Some college/high school), who works 20 or less hours in the filed of Broadcasting/Homemaker/Religious/Hotel and Food/Unemployed with positions as Temporary/Researcher/Skilled/Labor/Student and mostly in Not-for-profit/Others sectors.
cluster-C has more weightage on students and less weightage on trained professionals/self employed.

[5] Web shopping Profile:

cluster-A users are frequent online purchasers.
cluster-B & C users are less frequent online purchasers. when we combine it with reasons for not purchasing online, it looks like No credit, feeling its too complicated, or their inexperience buying things online are main reasons.

[6] Type of Web Usage Profile:

cluster-A users uses web mostly for Work/Shopping/Personal.Info/Entertainment with being frequent users for e-news, product information, reference material, financial information etc.
cluster-B & C users uses web mostly for Education/Time.Wasting/Communication/Other with frequent use for Chat groups, job listings, household work, medical info., movies, reading, real estate, socializing.

[7] Other useful Info Profile:

most related distinction was that cluster-A has users who has work/self pays for web.
but cluster-B/C has School/Other/Parents/dont know pays for the web.
cluster-C having more weightage on school/parents.

Question-1:

with this Profiling summary now I will try to answer our 1st research question: What are the typical groups of web users? Explain differences and similarities among groups.

Cluster-A:

high internet usage users.
age group 21-85.
High Income.
More users married.
more comfortable using web/internet.
issues using internet are more technical issues.
users with higher education (Doctoral/Masters/College).
works 20 or more hours.
works in the field of Agriculture/Software/Publishing/Mining/Legal at UpperMgmt/Consultant/Self-employed/Middle Mgmt/Trained Prof. positions.
mostly in Private and Public sectors.
frequent online purchasers.
uses web mostly for Work/Shopping/Personal.Info/Entertainment
frequent users for e-news, product information, reference material, financial information etc.
work/self pay for the web.

Cluster-B:

low internet usage users.
age group 11-20 and over-85.
mid-low income.
more weightage on not-married users either single or were married once but not now.
less comfortable using web/internet.
issues facing internet are more non-techical.
comparatively Lower education (Some college/high school).
works 20 or less hours
works in the field of Broadcasting/Homemaker/Religious/Hotel and Food/Unemployed with positions as Temporary/Researcher/Skilled/Labor/Student.
mostly in Not-for-profit/Others sectors.
less frequent online purchasers. when we combine it with reasons for not purchasing online, it looks like No credit, feeling its too complicated, or their inexperience buying things online are main reasons.
uses web mostly for Education/Time.Wasting/Communication/Other.
frequent users for Chat groups, job listings, household work, medical info., movies, reading, real estate, socializing.
School/Other/Parents/dont know pay for the web.

Cluster-C:

these users are more similar to cluster-B users than cluster-A users.
yet there are few characteristics that matches between cluster-C & A users. - High internet usage users. more similar to cluster-B when it comes to “frequency of internet use”" but it is more similar to cluster-A (high usage) when it comes to “no. of Hours used”. this tell me that these users don’t show up on internet that often in a week but in less sitting they still use more hours of internet like cluster-A users.
age group 11-20 and over-85.
mid-low income. - for marital status, more weightage on single. - Demographically, lot similar to cluster-B but with one distinction. users belong to more from urban/suburban areas than rural.
less comfortable using web/internet.
issues facing internet are more non-techical.
comparatively Lower education (Some college/high school).
works 20 or less hours
works in the field of Broadcasting/Homemaker/Religious/Hotel and Food/Unemployed with positions as Temporary/Researcher/Skilled/Labor/Student.
mostly in Not-for-profit/Others sectors.
more weightage on students and less weightage on trained professionals/self employed.
less frequent online purchasers. when we combine it with reasons for not purchasing online, it looks like No credit, feeling its too complicated, or their inexperience buying things online are main reasons.
uses web mostly for Education/Time.Wasting/Communication/Other.
frequent users for Chat groups, job listings, household work, medical info., movies, reading, real estate, socializing.for some categories, these users have more weightage on frequent use.
School/Other/Parents/dont know pay for the web.more weightage on school/parents.
based on all points in Bold which are this group’s distinct characteristics, looks like these users might be students. all points and their behaviour match with being student.

Question-2:

with this Profiling summary now I will try to answer our 2nd research question:Suggest methods of better targeting the most important customers.

Targeting customers in cluster-A:

As these users are already high web usage users, we can use internet to reach out to them using onine marketing and advertising.
these customers are highly educated, busy, working class people. so work related products/marketing strategies will also be a good idea.
more customers in this group are with family/married. so family/kids based products and marketing should also be considered to attract them.

Targeting customers in cluster-B & C:

customers in this group do not use web that often so printed-media or tv/radio marketing will be a good start to attract them to use web more for let’s say purchasing products.
most customers fall in single or once married/not now categories. marketing strategies and product should consider this point.
this group also have majority customers with mid to low income. so discounts and coupons can be a good way to attract them for online shopping.
most portion of this group has either very young or elderly crowd. related products should be considered in marketing.
as these customers are less comfortable using web for shopping and have mostly non-technical issues using web, educational and training related to web shopping should be considered in marketing.
looks like students are one of the major part of this group (Cluster-C). education and student related products/marketing should be targeted. example: books, video games, computers/tablets/accessories, stationary, etc.

SWENG545_FINAL_PROJECT

Maulik Patel

November 25, 2016

Project: GVU’s WWW User Survey Data

***[1] Getting the Data:***

***[2] Pre-processign of the data:***

Pre-processing 1: Removing NAs

Pre-processing 2: Merging two file in one

Pre-processing 3: User.id as row names

Pre-processing 4: More Cleaning

Pre-processing 5: Dealing with Binary variables

Pre-processing 6: Dealing with Excel’s problem

[3] Clustering :

“Gower’s distance” :

** kmeans : within cluster_sum of squares**

kmeans : between cluster_sum of squares

at K=3 we get best clustering with maximum bss. after that the advantage is not that big.

hclust(): h-clustering technique for categorical variable dataset:

** plotting observations using MCA():**

** Clustering observations:**

[4] Profiling the Clusters:

[1] Web Usage Amount/Frequency Profile:

Frequency of use:

Characteristics of Clusters:

Hours Used:

Characteristics of Clusters:

Years on Internet:

Characteristics of Clusters:

Access frequency on Internet:

Characteristics of Clusters:

[1] Web Usage (Amount/Frequency) Profile: SUMMARY

[3] Demographics Profile:

Age:

Characteristics of Clusters:

Gender:

Characteristics of Clusters:

Household-Income:

Characteristics of Clusters:

AreaType (Rural-Suburban-Urban):

Characteristics of Clusters:

Marital status:

Characteristics of Clusters:

number of children in household:

Characteristics of Clusters:

[2] Demographics Profile: SUMMARY

[3] Experience/Issues/Opinion on Web Profile:

comfort with Computers_internet:

Characteristics of Clusters:

internet advertising:

Characteristics of Clusters:

internet advertising:

Characteristics of Clusters:

opinion on censorship:

Characteristics of Clusters:

Problems Using the Web:

Characteristics of Clusters:

[3] Experience/Issues/Opinion on Web Profile: SUMMARY

[4] Education & Work Profile:

Education:

Characteristics of Clusters:

No. of work hours:

Characteristics of Clusters:

primary industry:

Characteristics of Clusters:

occupation:

Characteristics of Clusters:

professional correspondence:

Characteristics of Clusters:

[4] Education & Work Profile: SUMMARY

[5] Web shopping Profile:

Online Purchases Frequency:

Characteristics of Clusters:

Reasons for not purchasing:

Characteristics of Clusters:

[5] Web shopping Profile: SUMMARY

[6] Type of Web Usage Profile:

type of primary use of the web:

Characteristics of Clusters:

Chat groups:

Characteristics of Clusters:

[1] Getting the Data:

[2] Pre-processign of the data:

kmeans : within cluster_sum of squares

plotting observations using MCA():

Clustering observations:

use of web for electronics news:

use of web for maps:

watching TV:

computing platform: