The OkCupid Profiles dataset contains 60552 observations and 29 columns,which has included typical user demographics information, lifestyle variables, and text responses to 10 essay questions. The original source is ``OkCupid Data for Introductory Statistics and Data Science Courses’’ by Albert Kim and Adriana Escobedo-Land. Data science is gaining more prominence in academia and industry. By analysing the data we hope to give an answer or a suggestion to people’s normal life questions.
In this project, the first topic we are curious to analyze is the relationship between body type, age and the single status. Furthermore, by grouping, clustering, and refining the dataset, we want to see if there is a relationship between heights, income and the single status for the second topic. And the third topic we are looking at is text analysis on essay question 2. We hope to provide users some helpful labels which are the pop up words when they are being asked what are you good at. Such text analysis can show information about how people usually sell themselves to potential lovers.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.2.1 ✔ purrr 0.3.3
## ✔ tibble 2.1.3 ✔ dplyr 0.8.3
## ✔ tidyr 1.0.0 ✔ stringr 1.4.0
## ✔ readr 1.3.1 ✔ forcats 0.4.0
## ── Conflicts ────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(ggplot2)
library(sqldf)
## Loading required package: gsubfn
## Loading required package: proto
## Loading required package: RSQLite
okc <- read_csv("https://uofi.box.com/shared/static/oy32nc373w4jqz3kummksnw6wvhfrl7a.csv",
col_types = cols(last_online = col_datetime(format = "%Y-%m-%d-%H-%M"))) #takes about 20 seconds on laptop
colnames(okc) <- tolower(colnames(okc))
original_copy = okc
okc = sqldf("select body_type, age, height, income, status from okc")
okc = sqldf("select * from okc where status = 'available' or status = 'married' or status = 'seeing someone'")
copy <- mutate(okc, age_group =
ifelse(okc$age %in% 18:23, "18-23",
ifelse(okc$age %in% 24:29, "24-29",
ifelse(okc$age %in% 30:35, "30-35",
ifelse(okc$age %in% 36:41, "36-41",
ifelse(okc$age %in% 42:47, "42-47",
ifelse(okc$age %in% 48:53, "48-53",
ifelse(okc$age %in% 54:59, "54-59",
ifelse(okc$age %in% 60:65, "60-65",
"above 65")))))))))
bar <- ggplot(data = okc) +
geom_bar(
mapping = aes(x = body_type, fill = status), position = "fill",
width = 1
) +
theme(aspect.ratio = 1) +
labs(x = NULL, y = NULL)
bar + coord_flip() + ggtitle("Percentage Distribution of Different Body_type and Single Status")
bar <- ggplot(data = okc) +
geom_bar(
mapping = aes(x = body_type, fill = status),
width = 1
) +
theme(aspect.ratio = 1) +
labs(x = NULL, y = NULL)
bar + coord_flip() + ggtitle("Amount of Different Body Shapes and Single Status")
By first graph, we can see that most people filled out average, fit, and athlete. Although it showed the amount of single status in each body type criteria, it is still hard to provide a conclusion. So let’s look at the second graph, jacked body type seems to be the most amount of married.
library('RColorBrewer')
available <- okc[which(okc$status=='available'),]
ggplot(data = available, col= heat.colors(5)) +
geom_bar(mapping = aes(x = status, fill = body_type ))
This graph takes out the available stats, we look at the percentage of available people alone, it was surprised to see, fit, and average, and athletes are more likely to being single, whereas jacked are much less likely to being single.
bar <- ggplot(data = copy) +
geom_bar(
mapping = aes(x = age_group, fill = status), position = "fill",
width = 1
) +
theme(aspect.ratio = 1) +
labs(x = NULL, y = NULL)
bar + coord_polar() + ggtitle("Distribution of Different Age_Group and Single Status")
From the fourth graph, we can see the trends from age 18-35, people getting married, but by the age around 36, there seems to be more pressure in family, or work place, people are seeking to find new relationships.
There are two continuous variables in the dataset which are income and height. We have investigated two different groups previously, which are body shape and age groups. For this part, we renamed the values in the status column, we combine married and seeing someone as not single and set available as single. We have investigated in two group methods previously, which is body shapes and age groups. We want to further study how will income and height be clustered into these two groups.
copy_2 = sqldf("select * from copy where income <> -1")
temp = copy_2$status
temp = replace(temp, temp=="available", "single")
temp = replace(temp, temp=="seeing someone", "not single")
temp = replace(temp, temp=="married", "not single")
copy_2$status = temp
copy_2 = sqldf("select * from copy_2 where income < 250000")
ggplot(data = copy_2, mapping = aes(x = status, y = height)) +
geom_boxplot() + ggtitle("Boxplot Height vs Single Status")
ggplot(data = copy_2, mapping = aes(x = status, y = income)) +
geom_boxplot() + ggtitle("Boxplot Income vs Single Status")
First we plotted the box plot of income against single status and also the height against single status. We saw that single people have slightly higher average height and higher income.
dataset_1 = sqldf("select body_type, avg(height) as AVG_Height, avg(income) as AVG_Income from copy_2 group by body_type")
dataset_2 = sqldf("select age_group, avg(height) as AVG_Height, avg(income) as AVG_Income from copy_2 group by age_group")
ggplot(data = dataset_1, mapping = aes(x = AVG_Height, y = AVG_Income)) +
geom_point(mapping = aes(color = body_type)) +
geom_smooth()
ggplot(data = dataset_2, mapping = aes(x = AVG_Height, y = AVG_Income)) +
geom_point(mapping = aes(color = age_group)) +
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
We plotted the the average income vs average height when group by body type and by age group. We saw some interesting facts here that if you are a tall person, you have a high chance to being either jacked or overweight. As well as that, higher income group seems to spend more time on fitness, where average, fit, and athletic are prevalent body type . Although the fit curves for income vs height have different shapes when group by body type or age group. However, both group methods show a portion of the curve with a positive correlation between height and income.
plot_hclost <- function(copy_2){
small_table = copy_2
z = small_table[,-c(1,1)]
m = apply(z,2,mean)
s = apply(z,2,sd)
z <- scale(z,m,s)
distance = dist(z)
hc_trail = hclust(distance)
plot(hc_trail, labels = small_table$body_type, hang = -1)
title("Body Type ")
}
plot_hclost(dataset_1)
plot_hclost <- function(copy_2){
small_table = copy_2
z = small_table[,-c(1,1)]
m = apply(z,2,mean)
s = apply(z,2,sd)
z <- scale(z,m,s)
distance = dist(z)
hc_trail = hclust(distance)
plot(hc_trail, labels = small_table$age_group, hang = -1)
title("Age Group ")
}
plot_hclost(dataset_2)
Now,we have the hierarchical cluster results by body type and by age group. We can see similar body types share similar income and height status. For example, skinny and thin, athletic and fit, and a little extra and average. An interesting fact here is that people who choose rather not answer have similar body type to full figured. Also people who leave their answers blank are most likely skinny or thin. On the right graph, the results are mostly expected that adjacent age groups share similar income and height status. One useful information you can get here is that most people seem to retire around age 65 so that the age group over 65 shared some similar height and income status with the younger generation.
First, I took a look at some answers of question 2, which means the original data.
library(tidyverse)
okc <- read_csv("https://uofi.box.com/shared/static/oy32nc373w4jqz3kummksnw6wvhfrl7a.csv",
col_types = cols(last_online = col_datetime(format = "%Y-%m-%d-%H-%M")))
head(okc)
## # A tibble: 6 x 31
## age body_type diet drinks drugs education essay0 essay1 essay2 essay3
## <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 22 a little… stri… socia… never working … "abou… "curr… "maki… "the …
## 2 35 average most… often some… working … "i am… dedic… "bein… <NA>
## 3 38 thin anyt… socia… <NA> graduate… "i'm … "i ma… "impr… "my l…
## 4 23 thin vege… socia… <NA> working … i wor… readi… "play… socia…
## 5 29 athletic <NA> socia… never graduate… "hey … work … "crea… i smi…
## 6 29 average most… socia… <NA> graduate… "i'm … "buil… "imag… "i ha…
## # … with 21 more variables: essay4 <chr>, essay5 <chr>, essay6 <chr>,
## # essay7 <chr>, essay8 <chr>, essay9 <chr>, ethnicity <chr>, height <dbl>,
## # income <dbl>, job <chr>, last_online <dttm>, location <chr>,
## # offspring <chr>, orientation <chr>, pets <chr>, religion <chr>, sex <chr>,
## # sign <chr>, smokes <chr>, speaks <chr>, status <chr>
colnames(okc) <- tolower(colnames(okc))
set.seed(448)
si <- sample(1:nrow(okc),10000)
okc$essay2[si][1:15]
## [1] "listening, cooking, and locking myself out of my\napartment/jeep.<br />\n<br />\nalso i can solve a rubix cube :o)"
## [2] "drawing.<br />\nwriting.<br />\nbullshit persuasion. (never for selfish needs, more for\nenabling)<br />\nrandom acts of kindness.<br />\nadventures!<br />\ncatching people when they fall.<br />\ncoming up with impractical goals (i.e. learn to ride a unicycle)"
## [3] NA
## [4] "taking care of people :)!"
## [5] "the cell phone and the email, cat herding and hand holding,\nenvisioning and inspiring."
## [6] "getting things done... cooking... laughing... sleeping... not\nsleeping... procrastinating... meeting great friends.. showing up a\nlittle late but being the last one to leave... making other folks\ncomfortable..."
## [7] "wordplay. puns that are good maybe 30% of the time and the rest are\nso bad that i usually only make myself and a few close friends\nlaugh.<br />\nmaking up songs on the spot.<br />\nseeing things from other people's point of view."
## [8] "hmm?"
## [9] "painting. observational comedy. finding cool shit at goodwill.\nclimbing stairs. whipping up a mean salad. walkie talkie\nimpressions and making kids laugh."
## [10] "making the best of what life brings to you"
## [11] "travelling and learning languages. i have visited many many\ncountries, speak 5.5 languages fluently, and i can pack backpacks,\ncross borders, and chat random people up in public places like i\nwas born to do it :) and i am a singer.<br />\nah, and cooking."
## [12] "being a very good friend. laughing. having compassion.\nunderstanding how people \"work\" and being a great therapist. i take\ngood care of the planet. i am a great kisser. i can hold a plank\nfor over 3 minutes. if we get together you will find that i am\nreally good at making you smile and wanting to see me again."
## [13] "neutralizing awkward situations<br />\ngiving massages (touch)<br />\nfalling on my ass on my bike<br />\ngoing to foreign places<br />\nsinging loudly in public<br />\ndancing...in public and private<br />\ndoing accents...<br />\nwalking fast.<br />\ntaking pictures.<br />\nteaching (i would hope so).<br />\ndiscovering new music<br />\ndiscovering old music<br />\ncooking<br />\nbeing positive about life<br />\nstarting dance parties<br />\nword play<br />\nmaking people laugh"
## [14] "having random deep conversations, telling funny stories, getting\nmyself into and out of interesting situations"
## [15] "laughing... i like to talk to people and be social."
Then I cleaned the data by setting all words to lowercase, removing stopwords, extremely rare words, punctuation, unnecessary whitespace and numbers. Afterwards, I made a corpus of the words in question 2.
library(NLP)
##
## Attaching package: 'NLP'
## The following object is masked from 'package:ggplot2':
##
## annotate
library(tm)
e2 <- data.frame(doc_id=si,text=okc$essay2[si],stringsAsFactors = FALSE)
corpus <- VCorpus(DataframeSource(e2))
tryTolower <- function(x){
y = NA
try_error = tryCatch(tolower(x), error = function(e) e)
if (!inherits(try_error, 'error'))
y = tolower(x)
return(y)
}
clean.corpus<-function(corpus){
corpus <- tm_map(corpus, content_transformer(tryTolower))
corpus <- tm_map(corpus, removeWords, stopwords('english'))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeNumbers)
return(corpus)
}
newcorpus <- clean.corpus(corpus)
tdm<-TermDocumentMatrix(newcorpus, control=list(weighting=weightTf))
tdm.essay2 <- as.matrix(tdm)
rownames(tdm.essay2)[21:120]
## [1] "absentmindedness" "absolute" "absolutely"
## [4] "absorbed" "absorbing" "abstract"
## [7] "abstracting" "abstraction" "abstractions"
## [10] "abstrusely" "absurd" "absurdity"
## [13] "absurdly" "absurdnessbr" "abundance"
## [16] "abuse" "abused" "abusive"
## [19] "abysmal" "academia" "academiabr"
## [22] "academic" "academically" "academics"
## [25] "accent" "accentbr" "accents"
## [28] "accentsbr" "accentsdialects" "accep"
## [31] "acceptabilitybr" "acceptable" "acceptance"
## [34] "accepted" "accepting" "access"
## [37] "accessible" "accessing" "accessorizing"
## [40] "accident" "accidental" "accidentally"
## [43] "accidentbr" "accidents" "acclimating"
## [46] "accommodated" "accommodating" "accommodatingbr"
## [49] "accommodations" "accompaniment" "accompany"
## [52] "accomplice" "accomplish" "accomplished"
## [55] "accomplishing" "accomplishment" "accordian"
## [58] "according" "accordingly" "accordion"
## [61] "accountability" "accountable" "accountant"
## [64] "accountbr" "accounting" "accountsbr"
## [67] "accumulate" "accumulating" "accurate"
## [70] "accurately" "accused" "ace"
## [73] "aced" "acerbic" "aces"
## [76] "acesbr" "ache" "acheive"
## [79] "achieve" "achieved" "achievement"
## [82] "achieving" "acknowledge" "acknowledging"
## [85] "acoustic" "acoustics" "acquired"
## [88] "acquiring" "acrobatic" "acrobatics"
## [91] "acrobats" "acronyms" "across"
## [94] "acroyoga" "acrylic" "acrylicbr"
## [97] "act" "acting" "actingbr"
## [100] "actinglooking"
sfq <- data.frame(words=names(sort(rowSums(tdm.essay2),decreasing = TRUE)), freqs=sort(rowSums(tdm.essay2),decreasing = TRUE), row.names = NULL)
sfq[1:100,]
## words freqs
## 1 good 3267
## 2 making 2510
## 3 people 2381
## 4 things 1812
## 5 really 1255
## 6 cooking 989
## 7 listening 982
## 8 like 954
## 9 can 914
## 10 also 894
## 11 classilink 772
## 12 love 719
## 13 laugh 715
## 14 friends 602
## 15 pretty 602
## 16 time 598
## 17 getting 590
## 18 new 584
## 19 finding 539
## 20 just 534
## 21 playing 484
## 22 laughing 471
## 23 know 460
## 24 writing 460
## 25 one 434
## 26 life 433
## 27 dancing 431
## 28 think 427
## 29 fun 414
## 30 make 411
## 31 anything 408
## 32 get 399
## 33 taking 395
## 34 great 394
## 35 others 383
## 36 href 371
## 37 feel 365
## 38 stuff 356
## 39 giving 349
## 40 talking 338
## 41 reading 335
## 42 well 327
## 43 friend 315
## 44 music 314
## 45 say 306
## 46 around 303
## 47 much 301
## 48 learning 291
## 49 way 286
## 50 cook 285
## 51 eating 283
## 52 working 281
## 53 lot 271
## 54 work 267
## 55 always 265
## 56 games 257
## 57 keeping 256
## 58 singing 255
## 59 everything 254
## 60 something 252
## 61 trying 239
## 62 will 239
## 63 fixing 238
## 64 seeing 233
## 65 find 225
## 66 enjoy 224
## 67 random 221
## 68 best 220
## 69 even 217
## 70 food 217
## 71 going 216
## 72 amp 210
## 73 want 207
## 74 mind 206
## 75 many 205
## 76 thinking 205
## 77 art 204
## 78 sports 204
## 79 remembering 203
## 80 bad 202
## 81 told 202
## 82 try 201
## 83 baking 195
## 84 play 194
## 85 person 193
## 86 little 188
## 87 sometimes 185
## 88 solving 182
## 89 better 180
## 90 enjoying 180
## 91 back 177
## 92 day 177
## 93 conversation 176
## 94 helping 175
## 95 care 173
## 96 telling 172
## 97 jokes 171
## 98 put 170
## 99 staying 170
## 100 never 168
Next, I made a bar plot and a polar plot of the most frequent words of question 2, which is “what are you good at?”
ggplot(sfq[1:18,], mapping = aes(x = reorder(words, freqs), y = freqs, fill = reorder(words, freqs))) +
geom_bar(stat= "identity") +
coord_flip() +
scale_colour_hue() +
labs(x= "Words", title = "18 Most Frequent Words",y= "Frequence") +
theme(panel.background = element_blank(), axis.ticks.x = element_blank(),axis.ticks.y = element_blank())
ggplot(sfq[1:18,], mapping = aes(x = reorder(words, freqs), y = freqs, fill = reorder(words, freqs))) +
geom_bar(stat= "identity") +
coord_polar() +
scale_colour_hue() +
labs(x= "Words", title = "18 Most Frequent Words",y= "Frequence") +
theme(panel.background = element_blank(), axis.ticks.x = element_blank(),axis.ticks.y = element_blank())
Finally, I made a wordcloud of those words and as we can see from the plot, bigger words mean higher frequency. The word “laughing” probably means people try to show that they have a sense of humor. The words “making” , “cooking” and “good” probably show that people want to show their abilities to others.
library(wordcloud2)
wordcloud2(sfq,size=0.4,shape='pentagon',color="random-dark",backgroundColor="pink")
For the first relationship, we found that people with body type “Athlete”, “Average” and “fit” are more likely to be available. The age group analysis shows younger group are more likely to be in a relationship. What’s more, around 35, a high possibility of divorce may occur because there is an increasing jump in the single percentage.
For hierarchical cluster analysis, similar body shapes show similar height and income status. People who choose not to answer their body type are most likely full-figured. On the other hand, age group clusters show us most people retire around 65.
For text analysis, we found that people usually try to sell themselves to their potential lovers with these words and a sense of humor, also they could use some of these words as their labels when answering “what are you good at?”