For the past seven years, the Graphics, Visualization and, Users Department of the Georgia Institute of Technology in Atlanta has conducted an international survey of World Wide Web usage as a public service in order to provide information concerning the demographics and trends of Internet access.
Goal:
The goal of this assignment is to obtain a profile of the “typical” Internet user by applying clustering techniques.
*** Benefits:***
Such data mining task can be advantageous to e-commerce marketers so that they may tailor their advertisements to a particular people-set.
These tasks may also assist to software engineers and system designers would be interested in understanding why a particular subset of the population is still uncomfortable using computers.
About Data set:
Data set for this exam is the General Demographics dataset from the GVU WWW User Survey*.
Students should evaluate the GVU’s Tenth WWW User Survey (October 1998).
This data is divided in several sections, including demographics, privacy and security, web usage, and others. Each data section includes a unique user identifier.
Students should join at least demographics and web usage data in preparation for the clustering task.
Exam Problems: 1. What are the typical groups of web users? Explain differences and similarities among groups.
setwd("F:\\DATA SCIENCE\\PENN STATE\\Course work\\SWENG 545 - Data Mining\\FINAL PROJECT")
# read the csv files:
demo = read.delim("demo.txt", stringsAsFactors = TRUE)
usage = read.delim("usage.txt", stringsAsFactors = TRUE)
dim(demo)
## [1] 5022 106
dim(usage)
## [1] 3291 126
# removing last column added by read.delim()
demo = demo[,-c(106)]
usage = usage[,-c(126)]
dim(demo)
## [1] 5022 105
dim(usage)
## [1] 3291 125
# removing NAs:
demo = na.omit(demo)
usage = na.omit(usage)
dim(demo)
## [1] 4767 105
dim(usage)
## [1] 3291 125
#finding common column names in both data frames:
intersect(names(demo),names(usage))
## [1] "survey" "who"
# checking if all values in the variable are unique or are there any duplicates:
length(demo$who)
## [1] 4767
length(unique(demo$who))
## [1] 4767
length(usage$who)
## [1] 3291
length(unique(usage$who))
## [1] 3291
# new data frame by getting all common users in both files:
suppressMessages(library(data.table))
# converting data frame into data table so data.table package functions can be used to merge both files:
demo2 = data.table(demo)
class(demo)
## [1] "data.frame"
class(demo2)
## [1] "data.table" "data.frame"
usage2 = data.table(usage)
class(usage)
## [1] "data.frame"
class(usage2)
## [1] "data.table" "data.frame"
# setting up the key for he merge: "who" is common column in both which has users who took part in this survey.
setkey(demo2, who)
setkey(usage2, who)
# merging both files in "data"
data = merge(demo2,usage2)
dim(demo2)
## [1] 4767 105
dim(usage2)
## [1] 3291 125
dim(data)
## [1] 3084 229
# converting data.table back to data frame.
data = as.data.frame(data)
unique user ids do not help in clustering so we should avoid them from clustering technique. But at the same time we need them for future indentity of each observation points.
so we will remove this variable from data set but add back it as row names of each row which is information from that user.
rownames(data) = data[,1] # giving rownames
data = data[,-1]
Now we need to find which variables in the data set have only 1 level. these variables do not help in creating clusters as there are no groups or clusters within the variable, so we should avoid them in pre-processing step.
n = ncol(data)
same = numeric()
for(i in 1:n){
if (length(levels(data[,i])) == 1){
same = c(same,i)
}
}
str(data[,same]) # should be all columns with 1 level
## 'data.frame': 3084 obs. of 3 variables:
## $ survey.x : Factor w/ 1 level "general": 1 1 1 1 1 1 1 1 1 1 ...
## $ da_source: Factor w/ 1 level "Go!": 1 1 1 1 1 1 1 1 1 1 ...
## $ survey.y : Factor w/ 1 level "use": 1 1 1 1 1 1 1 1 1 1 ...
data = data[,-same]
# new dataset excluding predictors of 1 level
there are multiple variable in the data set which have binary levels of “0” and “1”. for the understanding them easily we will convert these levels into “NO” and “YES” respectively.
n = ncol(data)
for(i in 1:n){
if (class(data[,i]) == "integer"){
data[,i] = ifelse(data[,i]==0,"NO","YES")
data[,i] = as.factor(data[,i])
}
}
during exporting the data frame into csv file, i learned that excel program is converting values in some variables from let’s say 7-10 into July-10.
to avoid this problem i have come up with a solution to add a letter “r” at the end of these values who are in the form of 7-10 meaning digits-digits.
n = ncol(data)
m = nrow(data)
for(i in 1:n){
for(j in 1:m){
if (grepl("^[0-9]+[-]", data[,i][j]) == TRUE){
data[,i] = as.character(data[,i])
data[,i][j] = paste(data[,i][j], " r")
}
data[,i] = as.factor(data[,i])
}
}
#write.csv(data, "data.csv")
Now that we have created a tidy data set from raw data files, its time to work on actual clustering analysis.
first step is, to decide which clustering technique we should be using here.
as all variables in the data set are categorical type, we can use partition-based method such as k-medoids method [example: PAM “Partitioning Around Medoids”] Partitioning (clustering) of the data into k clusters “around medoids”, a more robust version of K-means.
OR we can use Hierarchical Clustering method,a Hierarchical cluster analysis on a set of dissimilarities and methods for analyzing it.
As we are not sure how many clusters to start with as that is actually the objective no. 1 of this analysis and because all variables are of categorical type, here we will use Hierarchical Clustering method. later i will explain another reason why we haven’t used partition-based PAM here.
first, in order to use R function to perform clusterization we need to calculate dissimilarity matrix of the data set.
because our variables are categorical/ordinal, “Gower’s dissimilarity distance” will be a good choice.
“Gower’s distance” is chosen by metric “gower” or automatically if some columns of x are not numeric. Also known as Gower’s coefficient (1971), expressed as a dissimilarity, this implies that a particular standardization will be applied to each variable, and the “distance” between two units is the sum of all the variable-specific distances.
#data5 file is with hclust clusters.
data5 = data
library(cluster)
# calculating gower dissimilarity distance for categorical variables:
d = daisy(data5, metric = c("gower"))
distance = dist(d)
now to roughly see dividing data set in how many clusters will give us optimal clustering, we can use kmeans method and see the chart of cluster’s within-sum-of-squares and between-sum-of-squares to decide the no. of clusters k.
k-means is not an optimal and robust method for this type of data set due to categorical variables but we are using to gauge no. of clusters only.
set.seed(1)
# lower the WSS better. it tells data points within a cluster are more similar.
wss = rep(0,11)
for (i in 1:11){
kmeans.model2 = kmeans(d, i)
wss[i] = kmeans.model2$tot.withinss
}
plot(1:10, wss[1:10], type="b",
xlab = "Number of Clusters",
ylab = "within cluster sum of squares")
wss.drop=rep(0,10)
for(i in 1:10){
wss.drop[i] = ((wss[i]-wss[i+1])/wss[i])*100
}
plot(2:11, wss.drop, type="b",
xlab = "Number of Clusters",
ylab = "wss drop in percentage")
at K=3 looks like we can get best clustering with minimum wss. after that the advantage is not that big (drop in wss is 8% or less at k=4 or more).
set.seed(1)
# higher the better. tells the dissimilarity is high between clusters.
bss = rep(0,10)
for (i in 1:10){
kmeans.model2 = kmeans(d, i)
bss[i] = kmeans.model2$betweenss
}
plot(1:10, bss, type="b",
xlab = "Number of Clusters",
ylab = "Between cluster sum of squares")
#install.packages("FactoMineR")
library(FactoMineR)
library(ggplot2)
data5 = data
set.seed(1)
#H-clustering:
hc = hclust(distance, method = "complete")
plot(hc)
clusters = cutree(hc,k=3)
data5$cluster = clusters
data5$cluster = as.factor(data5$cluster)
levels(data5$cluster) = c("A","B","C")
table(data5$cluster)
##
## A B C
## 2119 869 96
prop.table(table(data5$cluster))
##
## A B C
## 0.6870947 0.2817769 0.0311284
#write.csv(data5, "data5.csv")
plot(silhouette(clusters, distance))
data3 = data5
hclust() function has given us 3 clusters in our data set. cluster-A, B, C.
cluster-A with 68.7% of the observations from the data set. cluster-B with 28.2% of the observations from the data set. cluster-C with 3.1% of the observations from the data set.
now to confirm our 3 clusters in the data set, i will use MCA method (Multiple Correspondence Analysis) to calculate eigen values so we can plot the dataset in two-dimensional space using first two components to see clusters.
In R, there are several functions from different packages that allow us to apply Multiple Correspondence Analysis.
My preferred function to do multiple correspondence analysis is the MCA() function that comes in the fabulous package “FactoMineR” by Francois Husson, Julie Josse, Sebastien Le, and Jeremy Mazet.
No matter what function you decide to use for MCA, the typical results should consist of a set of eigenvalues, a table with the row coordinates, and a table with the column coordinates.
Compared to the eigenvalues obtained from a PCA or a CA, the eigenvalues in a MCA can be much more smaller. This is important to know because if you just consider the eigenvalues, you might be tempted to conclude that MCA sucks. Which is absolutely false.
Personally, I think that the real meat and potatoes of MCA relies in its dimension reduction properties that let us visualize our data, among other things. Besides the eigenvalues, the row coordinates provide information about the structure of the rows in the analyzed table. In turn, the column coordinates provide information about the structure of the analyzed variables and their corresponding categories.
newdata = read.csv("data.csv")
# number of categories per variable
cats = apply(newdata, 2, function(x) nlevels(as.factor(x)))
library(FactoMineR)
set.seed(1234)
# apply MCA
mca1 = MCA(newdata, graph = FALSE)
We can use the package “ggplot2()” to get a nice plot:
# data frame with variable coordinates
mca1_vars_df = data.frame(mca1$var$coord, Variable = rep(names(cats), cats))
# data frame with observation coordinates
mca1_obs_df = data.frame(mca1$ind$coord)
mca1_obs_df$cluster = data5$cluster
# plot of variable categories
ggplot(data=mca1_vars_df,
aes(x = Dim.1, y = Dim.2, label = rownames(mca1_vars_df))) +
geom_hline(yintercept = 0, colour = "gray70") +
geom_vline(xintercept = 0, colour = "gray70") +
geom_text(aes(colour=Variable)) +
ggtitle("MCA plot of variables")+ theme(legend.position="none")
# MCA plot of observations with clusters:
ggplot(data = mca1_obs_df, aes(x = Dim.1, y = Dim.2)) +
geom_hline(yintercept = 0, colour = "gray70") +
geom_vline(xintercept = 0, colour = "gray70") +
geom_point(size=2,aes(color=mca1_obs_df$cluster,alpha = 0.7,shape=factor(cluster))) +
ggtitle("MCA plot of observations with Clusters") +
scale_color_discrete(name = "Clusters", labels = c("A","B","C"))
# MCA plot of observations with clusters:
ggplot(data = mca1_obs_df, aes(x = Dim.1, y = Dim.2)) +
geom_hline(yintercept = 0, colour = "gray70") +
geom_vline(xintercept = 0, colour = "gray70") +
geom_point(size=2,aes(color=mca1_obs_df$cluster,alpha = 0.7,shape=factor(cluster))) +
geom_density2d(colour = "black") +
ggtitle("MCA plot of observations with Clusters") +
scale_color_discrete(name = "Clusters", labels = c("A","B","C"))
The above plot is a good visualization of all our data sets along with 3 clusters A,B and C.
this 2-dimensional view is based on only first two eigen vectors.
head(mca1$eig, 2)
## eigenvalue percentage of variance cumulative percentage of variance
## dim 1 0.07207710 0.4365968 0.4365968
## dim 2 0.04458851 0.2700885 0.7066853
as seen in above table, first two eigen components account for almost 62% variance of the entire data set. so this view of our observations is a good way to see clusters.
Now we need to know these 3 cluster’s individual behaviours and characteristics in terms of their similarities and differences.
we will do this by analyzing all variables against clusters by plotting three cluster’s distribution for each class of the variable/variables. and for some by plotting the variable’s class’s distribution across clusters as well.
mostly we will analyze characteristics of cluster-A & B. for cluster-C will will discuss how it is different than cluster-B or even cluster-A if we find anything to correlate.
our data set has total 3084 observations and 225 predictors (exluding 1 column of cluster assigment).
after careful analysis and observation of these predictors i have come up with total 7 Profile Groups which we can study and analyze to indentify charateristics of three clusters.
Below are these 7 Profile Groups:
1. Web Usage (Amount/Frequency) Profile
2. Demographics Profile
3. Experience/Issues/Opinion on Web Profile
4. Education & Work Profile
5. Web shopping Profile
6. Type of Web Usage Profile
7. Other useful Info Profile
each Group has sub-groups within them that we will actually analyze, i have called them Elements.
each Element is made up of either one, two or multiple predictors depending on the domain the predictors are representing.
I have used total of 120 predictors to create 7 Groups and 46 Elements within these 7 groups.
total of 105 predictors were not used in analyzing characteristics of clusters due to multiple reasons as below.
So as observed in “WEB USAGE” section of the analysis, there is a clear distinction between cluster-A & B. cluster-A is represented by high internet usage customers and cluster-B is represented by low internet usage customers.
Cluster-C looks more similar to cluster-B when it comes to “frequency of internet use”" but it is more similar to cluster-A (high usage) when it comes to “no. of Hours used”. this tell me that these users don’t show up on internet that often in a week but in less sitting they still used more hours of internet like cluster-A users.
##
## Female Male
## A 63.8 71.0
## B 33.1 25.9
## C 3.1 3.1
So as observed in “DEMOGRAPHIC” section of the analysis, there is a clear distinction between cluster A & B. cluster-A is represented by users in age group 21-85/High Income/More users married. cluster-B is represented by users in age group 11-20 and over-85/mid-low income/more weightage on not-marriers users.
cluster-C is lot similar to cluster-B but one distinction. users belong to more from urban/suburban areas than rural.
** cluster-A users are more comfortable using web/internet and their issues on using internet are more technical issues.**
** cluster-B & C users are less comfortable using web/internet and their issues are more non-techical.**
** cluster-A is represented by users with higher education (Doctoral/Masters/College), who works 20 or more hours, works in the field of Agriculture/Software/Publishing/Mining/Legal at UpperMgmt/Consultant/Self-employed/Middle Mgmt/Trained Prof. positions, mostly in Private and Public sectors.**
** cluster-B & C is represented by users with comparatively Lower education (Some college/high school), who works 20 or less hours in the filed of Broadcasting/Homemaker/Religious/Hotel and Food/Unemployed with positions as Temporary/Researcher/Skilled/Labor/Student and mostly in Not-for-profit/Others sectors.**
** cluster-C has more weightage on students and less weightage on trained professionals and self employed.**
Cluster-A users are frequent online purchasers. Cluster-B & C users are less frequent online purchasers. when we combine it with reasons for not purchasing online, it looks like No credit, feeling its too complicated, or their inexperience buying things online are main reasons.
cluster-A users uses web mostly for Work/Shopping/Personal.Info/Entertainment with being frequent users for e-news, product information, reference material, financial information etc.
cluster-B & C users uses web mostly for Education/Time.Wasting/Communication/Other with frequent use for Chat groups, job listings, household work, medical info., movies, reading, real estate, socializing.
most related distinction was that cluster-A has users who has work/self pays for web and cluster-B/C has School/Other/Parents/dont know pays for the web. with cluster-C having more weightage on school/parents.
## Source: local data frame [12 x 2]
##
## Major.Geographical.Location Percentage
## (fctr) (dbl)
## 1 USA 82.72
## 2 Europe 8.37
## 3 Canada 4.31
## 4 Oceania 2.33
## 5 Asia 1.13
## 6 Africa 0.32
## 7 Middle East 0.29
## 8 South America 0.19
## 9 Mexico 0.13
## 10 Antarctica 0.06
## 11 Central America 0.06
## 12 West Indies 0.06
As almost 83% of the observations in our data set are from USA, we won’t be able make any generalization statements to any other countries or part of the world.
any statement or recommendations in this review are for USA only.
## Source: local data frame [17 x 2]
##
## Primary.Language Percentage
## (fctr) (dbl)
## 1 English 91.67
## 2 German 1.52
## 3 Other 1.52
## 4 French 1.04
## 5 Dutch 0.94
## 6 Swedish 0.62
## 7 Spanish 0.52
## 8 Chinese 0.32
## 9 Norwegian 0.32
## 10 Italian 0.29
## 11 Danish 0.26
## 12 Not Say 0.23
## 13 Portuguese 0.23
## 14 Hebrew 0.19
## 15 Russian 0.19
## 16 Greek 0.06
## 17 Korean 0.06
As almost 92% of the observations in our data set are from English speaking users, we won’t be able make any generalization statements to people or users speaking any other languages.
any statement or recommendations in this review are for “English speaking population in USA”" only.
## Source: local data frame [10 x 2]
##
## Race Percentage
## (fctr) (dbl)
## 1 White 88.52
## 2 Asian 2.82
## 3 Not Say 2.33
## 4 Multiracial 1.78
## 5 Afr. Amer 1.36
## 6 Other 1.23
## 7 Hispanic 1.13
## 8 Latino 0.52
## 9 Indigenous 0.26
## 10 Latino\\0Other 0.03
As almost 89% of the observations in our data set are from White race users, we won’t be able make any generalization statements to people or users from any other races.
any statement or recommendations in this review are for “English speaking White population in USA”" only.