In this project I am going to analyze the difference in speech by gender based on the data about book reviews written by both males and females. Step by step, I will come up with different measures and comment on the result`s obtained just after each metric is calculated. In the end, I will briefly summarize my findings.
text <- read_delim("~/review.csv", ";", escape_double = FALSE, trim_ws = TRUE)
First, some preparation needs to be done: let`s eliminate all the punctuation and replace upper case to lower case. After that we do lemmatization, delete russian stopwords and the most frequemtly used words as well, as they don’t communicate us anything meaningful.
text$Review = str_to_lower(text$Review)
text$Review = gsub("[[:punct:]]", "",text$Review)
text$lemma <-system2("mystem", c("-c","-l","-d"), input = text$Review,stdout = T)
text$lemma <- str_replace_all(text$lemma, "\\{([^}]+?)([?]+)?\\}", "\\1")
rustopwords <- data.frame(words= c(stopwords("ru"),"книга","это","который","весь","свой","мочь","читать","герой","автор","история","просто","роман","очень","самый"), stringsAsFactors=FALSE)
text.tok = text %>% unnest_tokens(words, lemma) %>% na.omit() %>% anti_join(rustopwords, by = "words")
text.tok$Sex = as.character(text.tok$Sex)
text.tok$Sex = ifelse(str_detect(text.tok$Sex, "N/A"), NA, text.tok$Sex)
text.tok = text.tok %>% na.omit()
text.tok$Sex = as.factor(text.tok$Sex)
To start with, let`s look at the simple frequency lists by gender:
freq_f = text.tok %>% filter(Sex == "Ж") %>% group_by(words) %>% summarise(count = n()) %>% arrange(-count) %>% top_n(15, count)
ggplot(freq_f, aes(x = reorder(words,count), y = count))+
geom_bar(stat = "identity", alpha = 0.7, color = "black", fill = "lightblue")+
coord_flip()+
theme_minimal()+
labs(title = "Top 15 words used by Females
", x = "Word frequency", y = "Word")+
theme(text = element_text(size = 15), plot.title = element_text(hjust = 0.5), title = element_text(size = 15))
freq_m = text.tok %>% filter(Sex == "М") %>% group_by(words) %>% summarise(count = n()) %>% arrange(-count) %>% top_n(15, count)
ggplot(freq_m, aes(x = reorder(words,count), y = count))+
geom_bar(stat = "identity", alpha = 0.7, color = "black", fill = "lightblue")+
coord_flip()+
theme_minimal()+
labs(title = "Top 15 words used by Males
", x = "Word frequency", y = "Word")+
theme(text = element_text(size = 15), plot.title = element_text(hjust = 0.5), title = element_text(size = 15))
From the bar charts we can track the most popular words used by both men and women. As it can be seen, “время” consistently occupies top 1 positions in both cases.
However, it can be noticed that “любовь” is in top 10 most frequently used words in females`s subgroup, whereas there is no such a word even in top 15 (actually 16) in the men’s subgroup!
Women are also prone to use such words as “друг”, “ребенок”(especially) which probably tells us that females are more likely to concentrate their attention on relationships, family and motherhood, whereas males operate with much more vague words.
My personal guess would be that females are usually focused on the book`s content (i.g. plot) while writing a review, whereas males are more likely to rate the style of writing, professionalism of the author, etc.
For Male:
wordcloud_m = text.tok %>% filter(Sex == "М") %>%
count(words, sort=TRUE) %>%
top_n(80, n)
wordcloud2(data = wordcloud_m, rotateRatio = 0)
For Female:
wordcloud_f = text.tok %>% filter(Sex == "Ж") %>%
count(words, sort=TRUE) %>%
top_n(80, n)
wordcloud2(data = wordcloud_f, rotateRatio = 0)
words_said = text.tok %>% group_by(Sex) %>% summarize(words_said = n()) %>% mutate(pecentage = round((words_said/sum(words_said)*100),1))
mean_words = text.tok %>% group_by(Nickname, Sex) %>% summarise(count = n()) %>% group_by(Sex) %>% summarise(mean = round(mean(count),2), median = median(count), sd = round(sd(count),2), min = min(count), max = max(count)) %>% select(-Sex)
stats = cbind(words_said, mean_words)
formattable(stats,
align =c("c","c","c"),
list(`Indicator Name` = formatter(
"span", style = ~ style(color = "grey",font.weight = "bold"))))
Sex | words_said | pecentage | mean | median | sd | min | max |
---|---|---|---|---|---|---|---|
Ж | 217554 | 84.5 | 212.25 | 134 | 246.39 | 3 | 2446 |
М | 39790 | 15.5 | 219.83 | 154 | 278.45 | 2 | 2423 |
We can see that the overwhelming majority of the words (~ 85%) are said by women in our sample data. However, it is interesting to notice that on average men write 7 words longer reviews (mean ~ 222 words per review) then women (mean ~ 215 words per review).
Nevertheless, standard deviation for the number of words per review is higher for males then for females, which tells us about a higher diversity of review length written by males.
In simple words, in our sample males on average write a little more then females but the length of a randomely chosen review is a little bit less predictable then for females.
males <- text.tok %>% filter(Sex == "М")
females <- text.tok %>% filter(Sex == "Ж")
different_table <- bind_rows(males, females) %>%
dplyr::count(words, Sex) %>%
spread(Sex, n, fill = 0) %>%
filter(М>5 | Ж>5)
g2 = function(a, b) {
c = sum(a)
d = sum(b)
E1 = c * ((a + b) / (c + d))
E2 = d * ((a + b) / (c + d))
return(2*((a*log(a/E1+1e-7)) + (b*log(b/E2+1e-7))))
}
logratio <- function(a, b) {
return(log2((a/sum(a)/(b/sum(b)))))
}
g2_table <- different_table %>%
mutate(g2=g2(Ж,М)) %>%
arrange(desc(g2)) %>%
mutate(g2 = round(g2, 2)) %>%
group_by(words) %>% mutate(male_ipm = (М * 10e+6)/nrow(males), female_ipm = (Ж * 10e+6)/nrow(females))
top_g2 = g2_table %>% head(10)
formattable(top_g2,
align =c("l","c","c","c","c","c"),
list(`Indicator Name` = formatter(
"span", style = ~ style(color = "grey",font.weight = "bold"))))
words | Ж | М | g2 | male_ipm | female_ipm |
---|---|---|---|---|---|
пинчон | 0 | 18 | 69.26 | 4523.750 | 0.0000 |
аверченко | 0 | 14 | 53.87 | 3518.472 | 0.0000 |
шалалалулааа | 0 | 13 | 50.02 | 3267.153 | 0.0000 |
ходасевич | 0 | 12 | 46.17 | 3015.833 | 0.0000 |
любовь | 730 | 54 | 45.15 | 13571.249 | 33554.8875 |
лу | 4 | 14 | 36.06 | 3518.472 | 183.8624 |
бизнес | 7 | 16 | 35.50 | 4021.111 | 321.7592 |
футбол | 2 | 12 | 35.32 | 3015.833 | 91.9312 |
крусанов | 0 | 9 | 34.63 | 2261.875 | 0.0000 |
моралите | 0 | 9 | 34.63 | 2261.875 | 0.0000 |
From the table we can see that in our sample only men write (and therefore read?) Thomas Pinchon, Arkadiy Averchenko and Vladislav Hodasevich. It is also can be noticed that men are more likely to read Franz Kafka and Isaac Asimov, mentioning them more frequently then women.
Furthermore, as it might be intuitively clear, men are much more likely to talk about football and business. It is astonishing how straightforward the results are! Honestly speaking, I would not expect to find such a vivid examples of gender differentiation.
Still, the most outstanding gender based difference can be seen in the “любовь” word usage. Women are extremely outperform men in elaborating on topics concerning love and, probably, romantic relationships. Well, it has just confirmed one of the most popular stereotype about women.
g2_table <- different_table %>%
mutate(logratio = logratio(Ж,М)) %>%
mutate(logratio = round(logratio,2)) %>%
mutate(g2=g2(Ж,М)) %>%
arrange(desc(g2)) %>%
mutate(g2 = round(g2, 2)) %>%
group_by(words) %>% mutate(male_ipm = (М * 10e+6)/nrow(males), female_ipm = (Ж * 10e+6)/nrow(females))
g2_table_1 = g2_table %>%
filter(М > 0 & Ж > 0) %>%
group_by(logratio < 0) %>%
top_n(10, abs(logratio)) %>%
ungroup()
male_words = g2_table_1 %>% filter(`logratio < 0` == TRUE) %>% select(-`logratio < 0`) %>% arrange(logratio)
female_words = g2_table_1 %>% filter(`logratio < 0` == FALSE) %>% select(-`logratio < 0`) %>% arrange(-logratio)
formattable(male_words,
align =c("l","c","c","c","c","c","c"),
list(`Indicator Name` = formatter(
"span", style = ~ style(color = "grey",font.weight = "bold"))))
words | Ж | М | logratio | g2 | male_ipm | female_ipm |
---|---|---|---|---|---|---|
ницше | 1 | 7 | -5.35 | 21.22 | 1759.236 | 45.9656 |
футбол | 2 | 12 | -5.13 | 35.32 | 3015.833 | 91.9312 |
коэльо | 1 | 6 | -5.13 | 17.66 | 1507.917 | 45.9656 |
формально | 1 | 6 | -5.13 | 17.66 | 1507.917 | 45.9656 |
методика | 2 | 11 | -5.01 | 31.79 | 2764.514 | 91.9312 |
биографический | 2 | 10 | -4.87 | 28.29 | 2513.194 | 91.9312 |
сорокин | 2 | 10 | -4.87 | 28.29 | 2513.194 | 91.9312 |
v | 2 | 9 | -4.72 | 24.83 | 2261.875 | 91.9312 |
владимир | 2 | 9 | -4.72 | 24.83 | 2261.875 | 91.9312 |
спецслужба | 2 | 9 | -4.72 | 24.83 | 2261.875 | 91.9312 |
субъект | 2 | 9 | -4.72 | 24.83 | 2261.875 | 91.9312 |
Now we have a little bit different results concerning gender based word usage taking into consideration logratio. I have alredy talked about football, business, Kafka (“Кафка”, “Тошнота”) and Azimov.
Still, it is interesting to note overwhelming “методика” word usage in male subgroup. This somewhat supports my previously stated htpothesis about men being more pragmatically-tuned and talking about the author`s style of writing, etc.
Finally, Lev Tolstoy is a little bit more popular among males then females.
formattable(female_words,
align =c("l","c","c","c","c","c","c"),
list(`Indicator Name` = formatter(
"span", style = ~ style(color = "grey",font.weight = "bold"))))
words | Ж | М | logratio | g2 | male_ipm | female_ipm |
---|---|---|---|---|---|---|
кот | 58 | 1 | 3.31 | 12.02 | 251.3194 | 2666.005 |
зацеплять | 53 | 1 | 3.18 | 10.62 | 251.3194 | 2436.177 |
захотеться | 51 | 1 | 3.12 | 10.07 | 251.3194 | 2344.246 |
горе | 48 | 1 | 3.04 | 9.24 | 251.3194 | 2206.349 |
сквозь | 48 | 1 | 3.04 | 9.24 | 251.3194 | 2206.349 |
жалко | 46 | 1 | 2.98 | 8.69 | 251.3194 | 2114.418 |
умение | 45 | 1 | 2.94 | 8.42 | 251.3194 | 2068.452 |
идеальный | 87 | 2 | 2.90 | 16.03 | 502.6389 | 3999.007 |
смотря | 42 | 1 | 2.84 | 7.61 | 251.3194 | 1930.555 |
поворот | 82 | 2 | 2.81 | 14.69 | 502.6389 | 3769.179 |
задумка | 41 | 1 | 2.81 | 7.34 | 251.3194 | 1884.590 |
уезжать | 41 | 1 | 2.81 | 7.34 | 251.3194 | 1884.590 |
As for females, it is so funny to find out that women are more likely to enjoy writing about cats (stories which mentioned cats). By intuitive analysis, the sentiment for these femalse words are a little bit more negative then for the same male`s list.
Women tend to write about “горе”, “жалко”, “уезжать”. It can be infered that women are on average more likely to read more emotionally intense books, and, therefore, leave more emotionally intense reviews. Or women are just more emotional in general but identifying this is beyond the scope of text analysis’ capacities :)
g2_table_1 %>%
filter(М > 0 & Ж > 0) %>%
group_by(logratio < 0) %>%
top_n(10, abs(logratio)) %>%
ungroup() %>%
mutate(words = reorder(words, logratio)) %>%
ggplot(aes(words, logratio, fill = logratio > 0)) +
geom_col(show.legend = F, alpha = 0.5, color = "black") +
coord_flip() +
theme_minimal()+
scale_fill_discrete(name = "", labels = c("Males", "Females"))+
labs(title = "Distribution of words by gender
", x = "Words", y = "log odds ratio (Males/Females)")+
theme(text = element_text(size = 13), plot.title = element_text(hjust = 0.5), title = element_text(size = 13))
Visually, it an be noticed that male`s words have a greater values of logration, which means that men are more likely to use words that no woman does. In simple words, men use more unique words or mention more unique topics/authors in thier reviews based on our sample.
To sum up, I have uncovered some gender-based differences in the speech of males and females based on their book reviews.
First, males seem to be more methodologically oriented, whereas women are obserbed in book`s content completely. So to say, men are more likely to write about the author’s style, writing rechniques and so on, while women are generally concentrated on the plot of the book.
Second, female users are more emotionally engaged ones. In their reviews they often elaborate on love, romantic and family/friends relationships and parenthood. As for specific males` topics, they are much more likely to write about football and business.
Third, counterintuitively, women on average tend to write shorter reviews then males.
Fourth, males mention more unique words and authors then females. Some authors are read exclusively by males in our sample, however there is no author read exlusively by females.
Fifth, females are drama queens and like writing about cats! :)