Using a high-dimensional dataset of your choice, perform a factor analysis and clustering and interpret the results. You may use, for instance, the datasets inside the psych package, such as bfi (25 personality items thought to boil down to a few core personality types) or iqitems (14 scores that are thought to boil down to a few core mental skills), or anything else you can find. You can load the data using, for instance, data(bfi) after loading the psych package; you may need to clean it a bit first with na.omit() to remove the observations with na items, or else impute those missing items. It might also help to use scale() on your dataset before analysis. scale() takes all your variables (columns) and rescales them to have a mean of 0 and a sd of 1, so that you can more easily compare all your factors or clusters to see which are larger or smaller. For the factor analysis, you may use any of the methods covered in the lesson – they should all produce similar results, though princomp and prcomp might be simplest. You don’t have to interpret everything, say, fa() outputs, which is a lot of stuff – easier to use str() to examine the output of your function and find the quantities you want. After running your factor analysis or PCA, be sure to discuss and interpret your output:
Source: https://openpsychometrics.org/_rawdata/ This data was collected (c. 2012) through on interactive online personality test. Participants were informed that their responses would be recorded and used for research at the begining of the test and asked to confirm their consent at the end of the test.
The following items were rated on a five point scale where 1=Disagree, 3=Neutral, 5=Agree (0=missed). All were presented on one page in the order E1, N2, A1, C1, O1, E2……E1 | I am the life of the party |
E2 | I don’t talk a lot |
E3 | I feel comfortable around people |
E4 | I keep in the background |
E5 | I start conversations |
E6 | I have little to say |
E7 | I talk to a lot of different people at parties |
E8 | I don’t like to draw attention to myself |
E9 | I don’t mind being the center of attention |
E10 | I am quiet around strangers |
N1 | I get stressed out easily |
N2 | I am relaxed most of the time |
N3 | I worry about things |
N4 | I seldom feel blue |
N5 | I am easily disturbed |
N6 | I get upset easily |
N7 | I change my mood a lot |
N8 | I have frequent mood swings |
N9 | I get irritated easily |
N10 | I often feel blue |
A1 | I feel little concern for others |
A2 | I am interested in people |
A3 | I insult people |
A4 | I sympathize with others’ feelings |
A5 | I am not interested in other people’s problems |
A6 | I have a soft heart |
A7 | I am not really interested in others |
A8 | I take time out for others |
A9 | I feel others’ emotions |
A10 | I make people feel at ease |
C1 | I am always prepared |
C2 | I leave my belongings around |
C3 | I pay attention to details |
C4 | I make a mess of things |
C5 | I get chores done right away |
C6 | I often forget to put things back in their proper place |
C7 | I like order |
C8 | I shirk my duties |
C9 | I follow a schedule |
C10 | I am exacting in my work |
O1 | I have a rich vocabulary |
O2 | I have difficulty understanding abstract ideas |
O3 | I have a vivid imagination |
O4 | I am not interested in abstract ideas |
O5 | I have excellent ideas |
O6 | I do not have a good imagination |
O7 | I am quick to understand things |
O8 | I use difficult words |
O9 | I spend time reflecting on things |
O10 | I am full of ideas |
code <- tibble::tribble(~Q, ~Text, "E1", "I am the life of the party", "E2", "I don't talk a lot",
"E3", "I feel comfortable around people", "E4", "I keep in the background", "E5",
"I start conversations", "E6", "I have little to say", "E7", "I talk to a lot of different people at parties",
"E8", "I don't like to draw attention to myself", "E9", "I don't mind being the center of attention",
"E10", "I am quiet around strangers", "N1", "I get stressed out easily", "N2",
"I am relaxed most of the time", "N3", "I worry about things", "N4", "I seldom feel blue",
"N5", "I am easily disturbed", "N6", "I get upset easily", "N7", "I change my mood a lot",
"N8", "I have frequent mood swings", "N9", "I get irritated easily", "N10", "I often feel blue",
"A1", "I feel little concern for others", "A2", "I am interested in people",
"A3", "I insult people", "A4", "I sympathize with others' feelings", "A5", "I am not interested in other people's problems",
"A6", "I have a soft heart", "A7", "I am not really interested in others", "A8",
"I take time out for others", "A9", "I feel others' emotions", "A10", "I make people feel at ease",
"C1", "I am always prepared", "C2", "I leave my belongings around", "C3", "I pay attention to details",
"C4", "I make a mess of things", "C5", "I get chores done right away", "C6",
"I often forget to put things back in their proper place", "C7", "I like order",
"C8", "I shirk my duties", "C9", "I follow a schedule", "C10", "I am exacting in my work",
"O1", "I have a rich vocabulary", "O2", "I have difficulty understanding abstract ideas",
"O3", "I have a vivid imagination", "O4", "I am not interested in abstract ideas",
"O5", "I have excellent ideas", "O6", "I do not have a good imagination", "O7",
"I am quick to understand things", "O8", "I use difficult words", "O9", "I spend time reflecting on things",
"O10", "I am full of ideas")
On the next page the following values were collected.
race Chosen from a drop down menu. 1=Mixed Race, 2=Arctic (Siberian, Eskimo), 3=Caucasian (European), 4=Caucasian (Indian), 5=Caucasian (Middle East), 6=Caucasian (North African, Other), 7=Indigenous Australian, 8=Native American, 9=North East Asian (Mongol, Tibetan, Korean Japanese, etc), 10=Pacific (Polynesian, Micronesian, etc), 11=South East Asian (Chinese, Thai, Malay, Filipino, etc), 12=West African, Bushmen, Ethiopian, 13=Other (0=missed) age entered as text (individuals reporting age < 13 were not recorded) engnat Response to “is English your native language?”. 1=yes, 2=no (0=missed) gender Chosen from a drop down menu. 1=Male, 2=Female, 3=Other (0=missed) hand “What hand do you use to write with?”. 1=Right, 2=Left, 3=Both (0=missed)
On this page users were also asked to confirm that their answers were accurate and could be used for research. Participants who did not were not recorded).
Some values were calculated from technical information.
country The participant’s technical location. ISO country code. source How the participant came to the test. Based on HTTP Referer. 1=from another page on the test website, 2=from google, 3=from facebook, 4=from any url with “.edu” in its domain name (e.g. xxx.edu, xxx.edu.au), 6=other source, or HTTP Referer not provided.
d <- read_tsv("~/Northeastern/Git/ppua5301/Homework 11/BIG5/data.csv", col_names = T)
demo <- d[, c(1:7)]
d <- d[-19065, ] #Remove a row with missing values
ds <- d[, -c(1:7)] #remove demographic data to make the data manageable
ds <- scale(ds, center = rep(3, length(ds)), scale = rep(1, length(ds)))
ds5 <- psych::fa(ds, nfactors = 5)
dspar <- psych::fa.parallel(ds, quant = 0.95)
## Parallel analysis suggests that the number of factors = 10 and the number of components = 7
Examine the factor eigenvalues or variances (or the sdev or standard deviations as reported by prcomp or princomp, which you then need to square to get the variances). Plot these in a scree plot and use the “elbow” test to guess how many factors one should retain. What proportion of the total variance does your subset of variables explain?
plot(ds5$e.values)
dspar
## Call: psych::fa.parallel(x = ds, quant = 0.95)
## Parallel analysis suggests that the number of factors = 10 and the number of components = 7
##
## Eigen Values of
## Original factors Simulated data Original components simulated data
## 1 7.28 0.10 8.05 1.10
## 2 3.79 0.09 4.62 1.09
## 3 2.87 0.08 3.75 1.08
## 4 2.63 0.08 3.55 1.08
## 5 1.89 0.07 2.76 1.07
## 6 0.71 0.07 1.57 1.07
## 7 0.39 0.06 1.33 1.06
## 8 0.19 0.06 1.05 1.06
## 9 0.10 0.06 0.97 1.05
## 10 0.06 0.05 0.93 1.05
The data is from the IPIP Big Five personality model, so it comes as no surprise that a visual inspection of the scree plot suggests that 5 factors explain the majority of the data. The second graph uses the Psych package parallel analysis function that interestingly enough suggests 10 factors or 7 components explain the data when the quantile is set at .95.
ds5v <- cumsum(ds5$e.values)/sum(ds5$e.values)
ds5v[5]
## [1] 0.4545916
plot(ds5v, ylim = c(0, 1))
The 5 factors derived from this analysis only explain ~45.46% of the variance in the data set, this is suprisingly low considering this is a test of the Big Five factor model!
Examine the loadings of the factors on the variables (sometimes called the “rotation” in the function output) – ie, the projection of the factors on the variables – focusing on just the first one or two factors. Sort the variables by their loadings, and try to interpret what the first one or two factors “mean.” This may require looking more carefully into the dataset to understand exactly what each of the variables were measuring. You can find more about the data in the psych package using ?psych or visiting http://personality-project.org/ . Next perform a cluster analysis of the same data.
ds5r <- ds5$loadings[, 1]
ds5i <- as.data.frame(cbind(Lvl = c(rep("H", 5), rep("T", 5)), Q = c(names(head(ds5r[order(ds5r,
decreasing = T)], 5)), names(tail(ds5r[order(ds5r, decreasing = T)], 5))))) %>%
inner_join(code, by = "Q")
ds5i
## Lvl Q Text
## 1 H E7 I talk to a lot of different people at parties
## 2 H E5 I start conversations
## 3 H E1 I am the life of the party
## 4 H E9 I don't mind being the center of attention
## 5 H E3 I feel comfortable around people
## 6 T E6 I have little to say
## 7 T E8 I don't like to draw attention to myself
## 8 T E10 I am quiet around strangers
## 9 T E2 I don't talk a lot
## 10 T E4 I keep in the background
This factor is obviously extraversion ranging from extraverted traits on the high end and introverted on the low end.
ds5r2 <- ds5$loadings[, 2]
ds5r2i <- as.data.frame(cbind(Lvl = c(rep("H", 5), rep("T", 5)), Q = c(names(head(ds5r2[order(ds5r2,
decreasing = T)], 5)), names(tail(ds5r2[order(ds5r2, decreasing = T)], 5))))) %>%
inner_join(code, by = "Q")
ds5r2i
## Lvl Q Text
## 1 H N6 I get upset easily
## 2 H N8 I have frequent mood swings
## 3 H N9 I get irritated easily
## 4 H N7 I change my mood a lot
## 5 H N1 I get stressed out easily
## 6 T E2 I don't talk a lot
## 7 T O7 I am quick to understand things
## 8 T E3 I feel comfortable around people
## 9 T N4 I seldom feel blue
## 10 T N2 I am relaxed most of the time
It looks like the second factor spans from neurotic traits on the high end, to a mixture of two defining neurotic traits at the lowest points, in addition to extraversion (introversion in this case) and openness traits on the lower end.
First use k-means and examine the centers of the first two or three clusters. How are they similar to and different from the factor loadings of the first couple factors?
kds <- kmeans(ds, centers = 5, nstart = 30)
Of note, when using kmeans with this data set, the function outputs the following warning: “did not converge in 10 iterations” three consecutive times when nstart is set to 30. In thinking that perhaps the number of clusters was inaccurate, I searched for iterative methods of selecting the appropriate numbers of clusters and found the following SO post: Cluster analysis in R: determine the optimal number of clusters. The first method from the post using the within group sum of squares and scree plot is demonstrated below.
wss <- (nrow(ds) - 1) * sum(apply(ds, 2, var))
for (i in 2:15) wss[i] <- sum(kmeans(ds, centers = i)$withinss)
plot(1:15, wss, type = "b", xlab = "Number of Clusters", ylab = "Within groups sum of squares")
As is apparent from the graph, there isn’t a particularly well defined elbow to use in selecting the appropriate number of clusters. Given this is the case, I’ve opted to go with 5 categories (to match the 5 factors) and explain the results, though as this graph suggests, the data is well-distributed and difficult to cluster.
catI <- function(kds) {
cats <- vector("list", length(kds$size))
for (i in 1:length(kds$size)) {
cats[[i]]$h <- head(kds$centers[i, order(kds$centers[i, ], decreasing = T)])
cats[[i]]$t <- tail(kds$centers[i, order(kds$centers[i, ], decreasing = T)])
}
for (i in seq_along(kds$size)) {
cats[[i]]$i <- merge(cbind(Lvl = rep(paste("H", i, sep = ""), 6), code %>%
filter(Q %in% names(cats[[i]]$h))), cbind(Lvl = rep(paste("T", i, sep = ""),
6), code %>% filter(Q %in% names(cats[[i]]$t))), all = T)
}
return(cats)
}
catsI <- catI(kds)
listofdf <- vector("list", 5)
for (i in 1:5) {
listofdf[[i]] <- catsI[[i]]$i
}
CatDesc <- do.call(rbind, listofdf)
CatDesc
## Lvl Q Text
## 1 H1 A2 I am interested in people
## 2 H1 A4 I sympathize with others' feelings
## 3 H1 A9 I feel others' emotions
## 4 H1 E5 I start conversations
## 5 H1 N3 I worry about things
## 6 H1 O3 I have a vivid imagination
## 7 T1 A5 I am not interested in other people's problems
## 8 T1 A7 I am not really interested in others
## 9 T1 E2 I don't talk a lot
## 10 T1 E6 I have little to say
## 11 T1 O4 I am not interested in abstract ideas
## 12 T1 O6 I do not have a good imagination
## 13 H2 A4 I sympathize with others' feelings
## 14 H2 A6 I have a soft heart
## 15 H2 A9 I feel others' emotions
## 16 H2 C3 I pay attention to details
## 17 H2 E10 I am quiet around strangers
## 18 H2 O9 I spend time reflecting on things
## 19 T2 A1 I feel little concern for others
## 20 T2 A3 I insult people
## 21 T2 A5 I am not interested in other people's problems
## 22 T2 A7 I am not really interested in others
## 23 T2 C4 I make a mess of things
## 24 T2 O6 I do not have a good imagination
## 25 H3 E10 I am quiet around strangers
## 26 H3 O10 I am full of ideas
## 27 H3 O3 I have a vivid imagination
## 28 H3 O5 I have excellent ideas
## 29 H3 O7 I am quick to understand things
## 30 H3 O9 I spend time reflecting on things
## 31 T3 E1 I am the life of the party
## 32 T3 E7 I talk to a lot of different people at parties
## 33 T3 N6 I get upset easily
## 34 T3 O2 I have difficulty understanding abstract ideas
## 35 T3 O4 I am not interested in abstract ideas
## 36 T3 O6 I do not have a good imagination
## 37 H4 E10 I am quiet around strangers
## 38 H4 N1 I get stressed out easily
## 39 H4 N3 I worry about things
## 40 H4 N9 I get irritated easily
## 41 H4 O3 I have a vivid imagination
## 42 H4 O9 I spend time reflecting on things
## 43 T4 C5 I get chores done right away
## 44 T4 E1 I am the life of the party
## 45 T4 E7 I talk to a lot of different people at parties
## 46 T4 E9 I don't mind being the center of attention
## 47 T4 N4 I seldom feel blue
## 48 T4 O6 I do not have a good imagination
## 49 H5 A2 I am interested in people
## 50 H5 A4 I sympathize with others' feelings
## 51 H5 E3 I feel comfortable around people
## 52 H5 E5 I start conversations
## 53 H5 O10 I am full of ideas
## 54 H5 O7 I am quick to understand things
## 55 T5 A3 I insult people
## 56 T5 A5 I am not interested in other people's problems
## 57 T5 A7 I am not really interested in others
## 58 T5 E6 I have little to say
## 59 T5 N8 I have frequent mood swings
## 60 T5 O6 I do not have a good imagination
The function above takes a kmeans output compiles the head (h) and tail (t) of each category, and provides a data frame (i) using the question key corresponding to the head and tail of each category. The dataframes are combined into CatDesc for ease of reference.
Category 1 is best explained by the data above. In summary, it is defined by individuals who are conversation starters, comfortable and polite with others, interested in others, quick learners with a lot to say. We might call this category the “socialites.” This category contrasts with the first factor in that it is not a clearly defined vector between two poles of a personality dimension. This is logical because the test is designed around five factors, rather than categories.
Category 2 is people who are very high on openness, imaginative, quick to learn things, and come up with ideas easily. They are fond of abstract ideas, they also tend towards introversion and avoid being the center of attention. We might call this category the “innovators.”
Category 3 is comprised of individuals whom have some neurotic traits and tend towards agitation or worry, have vivid imaginations and enjoy reflecting. They tend to procrastinate and are sensitive to moods and might be the wallflowers at a party. We might call this category the “dreamers.”
It is evident that each category is a cluster of selected traits, rather than a vectored dimension. Each category is comprised of individuals who answered questions that imply the presence of traits that are often found associated within personality “types” so to speak.
Next use hierarchical clustering. Print the dendrogram, and use that to guide your choice of the number of clusters. Use cutree to generate a list of which clusters each observation belongs to. Aggregate the data by cluster and then examine those centers (the aggregate means) as you did in (3). Can you interpret all of them meaningfully using the methods from (3) to look at the centers?
The data is far too large to perform dist without crashing R. To remedy this the original data has been filtered by respondents age 32 in the US.
d4 <- d %>% filter(age == 32 & country == "US")
d4 <- d4[, -c(1:7)]
d4 <- scale(d4, center = rep(3, length(d4)), scale = rep(1, length(d4)))
d4s <- dist(d4)
d4ha <- hclust(d4s, method = "complete")
plot(d4ha)
abline(a = 17, b = 0, col = "red")
abline(a = 15, b = 0, col = "blue")
The plot above shows distinctive drops at the red line which encompasses the 5 categories from the kmeans model and the blue line that indicates a cross-section of a possible 8 categories. For consistency and ease of interpretation, we will use 5 categories for further analysis. For a simplified visualization of these categories, see the dendrograph below.
plot(d4ha)
rect.hclust(d4ha, 5)
ctd4 <- cutree(d4ha, 5)
hcats <- cbind(Cat = ctd4, d4) #Add the category to the original matrix
dfhc <- as.data.frame(hcats) #make it a df
hcmeans <- dfhc %>% group_by(Cat) %>% summarize_all("mean") #compute the means for each column grouped by category
HCatLi <- vector("list", 5) #instantiate a list for the loop
# loop through and find the head and tail of each category, label as to the
# category, add the corresponding questions and store the df in a list.
for (i in 1:5) {
h <- head(hcmeans[i, ] %>% gather() %>% filter(key != "Cat") %>% arrange(desc(value)) %>%
select(key), 5) %>% inner_join(code, by = c(key = "Q"))
t <- tail(hcmeans[i, ] %>% gather() %>% filter(key != "Cat") %>% arrange(desc(value)) %>%
select(key), 5) %>% inner_join(code, by = c(key = "Q"))
HCatLi[[i]] <- merge(cbind(Lvl = rep(paste("H", i, sep = ""), 5), h), cbind(Lvl = rep(paste("T",
i, sep = ""), 5), t), all = T)
}
HCatDesc <- do.call(rbind, HCatLi)
HCatDesc
## Lvl key Text
## 1 H1 C10 I am exacting in my work
## 2 H1 C7 I like order
## 3 H1 C9 I follow a schedule
## 4 H1 N3 I worry about things
## 5 H1 O9 I spend time reflecting on things
## 6 T1 A5 I am not interested in other people's problems
## 7 T1 C8 I shirk my duties
## 8 T1 E1 I am the life of the party
## 9 T1 E7 I talk to a lot of different people at parties
## 10 T1 E9 I don't mind being the center of attention
## 11 H2 A4 I sympathize with others' feelings
## 12 H2 A6 I have a soft heart
## 13 H2 N3 I worry about things
## 14 H2 O3 I have a vivid imagination
## 15 H2 O9 I spend time reflecting on things
## 16 T2 A1 I feel little concern for others
## 17 T2 A7 I am not really interested in others
## 18 T2 C5 I get chores done right away
## 19 T2 N4 I seldom feel blue
## 20 T2 O6 I do not have a good imagination
## 21 H3 E8 I don't like to draw attention to myself
## 22 H3 O1 I have a rich vocabulary
## 23 H3 O10 I am full of ideas
## 24 H3 O3 I have a vivid imagination
## 25 H3 O7 I am quick to understand things
## 26 T3 E1 I am the life of the party
## 27 T3 E7 I talk to a lot of different people at parties
## 28 T3 O2 I have difficulty understanding abstract ideas
## 29 T3 O4 I am not interested in abstract ideas
## 30 T3 O6 I do not have a good imagination
## 31 H4 A2 I am interested in people
## 32 H4 A4 I sympathize with others' feelings
## 33 H4 A9 I feel others' emotions
## 34 H4 C3 I pay attention to details
## 35 H4 C7 I like order
## 36 T4 A1 I feel little concern for others
## 37 T4 A3 I insult people
## 38 T4 A7 I am not really interested in others
## 39 T4 C8 I shirk my duties
## 40 T4 O2 I have difficulty understanding abstract ideas
## 41 H5 A2 I am interested in people
## 42 H5 E3 I feel comfortable around people
## 43 H5 E5 I start conversations
## 44 H5 E7 I talk to a lot of different people at parties
## 45 H5 O2 I have difficulty understanding abstract ideas
## 46 T5 A7 I am not really interested in others
## 47 T5 A9 I feel others' emotions
## 48 T5 C1 I am always prepared
## 49 T5 C6 I often forget to put things back in their proper place
## 50 T5 O8 I use difficult words
Replicating and modifying the method used in 3 gives the data frame above with descriptions of each of the categories.
Category 1 consists of individuals high on conscientiousness traits like affinity for order, precision, routine and self reflection, with a tendency to worry. These individuals empathize with others and are dutiful, as well as introverted. The individuals in this category might be known as “achievers.”
Category 2 consists of individuals with traits similar to the “dreamers” from the cluster analysis, though these individuals might be more concerned with the moods of and interactions with others. The individuals in this category might be known as “caregivers.”
Category 3 is comprised of individuals similar to the “innovators” from the cluster analysis. Quick learners and introverts who are interested in imagining, understanding and communicating ideas. The individuals in this category might be known as “philosophers.”
Category 4 is comprised of individuals who rank high on positive traits associated with agreeableness and conscientiousness, and low on the negative traits associated with these categories. These are individuals who are likely concerned with relationship dynamics, and finding order within relationships; these traits are combined with a sense of duty and a penchant for understanding abstract ideas. The individuals in this category might be known as “organizers.”
Category 5 is comprised of individuals who also rank high on positive traits of extraversion similar to , and are concerned with others, but have difficulty with abstract ideas. These individuals are self-contained, but not always prepared and may be somewhat forgetful. They also use language that is accessible to a broad audience. The individuals in this category might be known as “public figures.”
Generalizing a methodology for future data explorations of this nature will be helpful for me to understand and crystallize the information covered in this homework. The methodology is below, with a general explanation for this specific dataset following.
Two preliminary pieces of data would be useful to guide a factor or category analysis:
One is the meaning of each of the variables which will likely hint at potentially useful factors or groupings within the data.
The second is if it is possible to graph each variable in a sphere where each cartesian plane is anchored with origin at the center and y axis as a vertical bisector, each x-axis forming a horizontal radius with each of the variables evenly distributed by radians, forming cross sections of the sphere. If this multidimensional sphere graph can be viewed from any perspective, then it may be possible to glean whether distinctive ovals or clusters can be seen in the data, indicating whether factor or cluster/hierarchical analyses respectively will be more descriptive of the data. Further, if clusters are apparent, it may be the case that sub clusters are easily discernible within the larger clusters, indicating that a hierarchical cluster analysis will be useful. If this graph was not possible, as in unsupervised learning, it would be best to do an iterative approach using the various analyses, using a measure of variance explained by the model, to discern which model best describes the data.
Once the best suited analysis is decided upon, scree plots and dendrographs will be useful in discerning the number of factors or categories to use. It will be important to note during this phase that hierarchical analysis should be approached with caution as it is highly resource intensive due to the distance computation and will be unsuitable for large datasets unless the hardware is available to parallelize the computation and complete it in a reasonable amount of time.
With the appropriate number of factors or categories, the data can then be parsed into the respective factors or categories using the appropriate functions and variables.
Finally, a method for linking the resulting eigenvectors or centers and their explanatory variables with the original variable measures (as in the question dataframe in 3 & 4 above) will be useful in attempting to explain the resulting factors or categories.
With regard to the Big Five data in this particular dataset, it was definitely apparent that the factor analysis was the most relevant analysis, as the questions are specifically tuned to ellicit where an individual is along each of the factors in consideration, namely extraversion, agreeableness, conscientiousness, neuroticism, and openness.
As expected, clear clusterings of data for cluster type analyses were not particularly apparent with this dataset as they might be with other datasets because of it’s factorable origins. However, the cluster analyses (both types) yielded interesting groupings of traits that appeared to fit generalized personality types (though I might have just read into these.) It is clear to see how widely applicable these analyses will be to all types of data from almost every major discipline.