I have found the data for this project at https://www.kaggle.com/bahramjannesarr/goodreads-book-datasets-10m I then downloaded the dataset and saved it as a .CSV file and uploaded it into my Github repository (Applied-Statistics). With this data I hope to show in this portion of the project any relationship between the frequency of author birthplace and the goodness of fit. I also plan to test for independnce between author gender and the genre that they typically write for.
Here is a small portion of the data and the summary statistics
head(data)
## author_average_rating author_gender author_genres
## 1 4.01 female historical-fiction,
## 2 4.15 male literature-fiction,mystery-thrillers,
## 3 4.00 female romance,
## 4 3.88 male fiction,memoir,
## 5 4.10 female young-adult,fantasy,
## 6 3.77 male horror,
## author_id author_name author_page_url
## 1 74489 Victoria Thompson\n /author/show/74489.Victoria_Thompson
## 2 706255 Stieg Larsson\n /author/show/706255.Stieg_Larsson
## 3 5618190 Mimi Jean Pamfiloff\n /author/show/5618190.Mimi_Jean_Pamfiloff
## 4 37871 José Donoso\n /author/show/37871.Jos_Donoso
## 5 36122 Patricia C. Wrede\n /author/show/36122.Patricia_C_Wrede
## 6 58947 Steve Niles\n /author/show/58947.Steve_Niles
## author_rating_count author_review_count birthplace
## 1 74399 6268 United States\n
## 2 3726435 142704 Sweden\n
## 3 76496 7975 United States\n
## 4 5522 489 Chile\n
## 5 291013 13453 United States\n
## 6 47938 3240 United States\n
## book_average_rating
## 1 4.02
## 2 4.13
## 3 3.99
## 4 4.14
## 5 4.01
## 6 3.80
## book_fullurl
## 1 https://www.goodreads.com/book/show/686717.Murder_on_St_Mark_s_Place
## 2 https://www.goodreads.com/book/show/2429135.The_Girl_with_the_Dragon_Tattoo
## 3 https://www.goodreads.com/book/show/27833684-tailored-for-trouble
## 4 https://www.goodreads.com/book/show/382975.The_Obscene_Bird_of_Night
## 5 https://www.goodreads.com/book/show/64207.Sorcery_Cecelia
## 6 https://www.goodreads.com/book/show/831829.30_Days_of_Night_Vol_1
## book_id book_title
## 1 686717 \n Murder on St. Mark's Place\n
## 2 2429135 \n The Girl with the Dragon Tattoo\n
## 3 27833684 \n Tailored for Trouble\n
## 4 382975 \n The Obscene Bird of Night\n
## 5 64207 \n Sorcery & Cecelia: or The Enchanted Chocolate Pot\n
## 6 831829 \n 30 Days of Night, Vol. 1\n
## genre_1 genre_2 num_ratings num_reviews pages
## 1 Mystery Historical 5260 375 277
## 2 Fiction Mystery 2229163 65227 465
## 3 Romance Contemporary 2151 391 354
## 4 Fiction Magical Realism 1844 173 438
## 5 Fantasy Young Adult 17051 1890 326
## 6 Sequential Art Sequential Art 17122 561 104
## publish_date score
## 1 2000 3230
## 2 Aug-05 3062
## 3 2016 4585
## 4 1970 1533
## 5 April 15th 1988 2105
## 6 January 10th 2004 4372
summary(data)
## author_average_rating author_gender author_genres author_id
## Min. :1.82 Length:22891 Length:22891 Min. : 4
## 1st Qu.:3.81 Class :character Class :character 1st Qu.: 40836
## Median :3.97 Mode :character Mode :character Median : 1415543
## Mean :3.96 Mean : 3233957
## 3rd Qu.:4.12 3rd Qu.: 5775601
## Max. :5.00 Max. :18770448
## author_name author_page_url author_rating_count author_review_count
## Length:22891 Length:22891 Min. : 6 Min. : 0
## Class :character Class :character 1st Qu.: 4324 1st Qu.: 545
## Mode :character Mode :character Median : 24635 Median : 2273
## Mean : 172032 Mean : 9370
## 3rd Qu.: 111337 3rd Qu.: 8262
## Max. :21117318 Max. :516745
## birthplace book_average_rating book_fullurl book_id
## Length:22891 Min. :0.000 Length:22891 Length:22891
## Class :character 1st Qu.:3.770 Class :character Class :character
## Mode :character Median :3.960 Mode :character Mode :character
## Mean :3.951
## 3rd Qu.:4.140
## Max. :5.000
## book_title genre_1 genre_2 num_ratings
## Length:22891 Length:22891 Length:22891 Min. : 0
## Class :character Class :character Class :character 1st Qu.: 820
## Mode :character Mode :character Mode :character Median : 4403
## Mean : 46683
## 3rd Qu.: 20143
## Max. :3820921
## num_reviews pages publish_date score
## Min. : 0 Length:22891 Length:22891 Min. : 55
## 1st Qu.: 106 Class :character Class :character 1st Qu.: 832
## Median : 384 Mode :character Mode :character Median : 1727
## Mean : 2325 Mean : 3893
## 3rd Qu.: 1504 3rd Qu.: 3598
## Max. :147696 Max. :598270
This is a continuation of my project from previous parts, I would include a link to access those projects but I am currently unable to open them because I have hit my hour limit in Rstudio cloud. It’s fine we can work with it.
I am liking making silly little puns to go along with each portion of the project.
\[ H_0: \text{The number of authors from each birthplace is equal.}\\ H_A: \text{The number of authors from each birthplace is not equal}. \]
data$birthplace[is.na(data$birthplace)]<-0
test=chisq.test(table(data$birthplace))
test
##
## Chi-squared test for given probabilities
##
## data: table(data$birthplace)
## X-squared = 1073563, df = 434, p-value < 2.2e-16
Based on this data, we can reject the null hypothesis that there are the same number of authors from each country in each category. below is code that would tell us to expect about 53 authors (52.62299) born in each location. I am commenting this out because its a lot.
#test$expected
Below is a Bar chart with some taller peaks at the places where the most authors are from.
barplot(table(data$birthplace))
\[ H_O: \text{The author gender is idependent of genre}\\ H_A: \text{Author gender is not independent of genre} \] Ok so here we go! lets see if author gender is independent of the genre they write for.
testagain = chisq.test(table(data$author_gender,data$genre_1))
## Warning in chisq.test(table(data$author_gender, data$genre_1)): Chi-squared
## approximation may be incorrect
testagain$expected
##
## Adult Adult Fiction Adventure Amish Animals Anthologies
## female 0.9339915 52.77052 14.00987 6.070945 19.61382 10.27391
## male 1.0660085 60.22948 15.99013 6.929055 22.38618 11.72609
##
## Apocalyptic Art Asian Literature Autobiography Biblical Fiction
## female 2.801975 5.603949 25.21777 45.76558 0.9339915
## male 3.198025 6.396051 28.78223 52.23442 1.0660085
##
## Biography Business Category Romance Childrens Christian
## female 78.92228 43.43061 2.801975 201.7422 31.75571
## male 90.07772 49.56939 3.198025 230.2578 36.24429
##
## Christian Fiction Christianity Classics Comics Contemporary
## female 65.37941 1.400987 305.8822 9.339915 51.83653
## male 74.62059 1.599013 349.1178 10.660085 59.16347
##
## Couture Crime Cultural Dark Dark Fantasy Dc Comics Death
## female 0.4669958 14.47687 30.35472 71.45035 0.4669958 0.4669958 0.4669958
## male 0.5330042 16.52313 34.64528 81.54965 0.5330042 0.5330042 0.5330042
##
## Design Did Not Finish Drama Dungeons and Dragons Eastern Africa
## female 0.4669958 0.4669958 1.867983 0.4669958 0.9339915
## male 0.5330042 0.5330042 2.132017 0.5330042 1.0660085
##
## Economics Education Epic Erotica Esoterica European Literature
## female 4.202962 3.26897 0.4669958 90.59718 0.4669958 1.400987
## male 4.797038 3.73103 0.5330042 103.40282 0.5330042 1.599013
##
## Fairy Tales Family Fan Fiction Fantasy Feminism Fiction
## female 0.4669958 0.9339915 0.4669958 1586.385 21.01481 1014.315
## male 0.5330042 1.0660085 0.5330042 1810.615 23.98519 1157.685
##
## Food and Drink Football Gardening Glbt Health Historical
## female 24.28378 0.4669958 0.4669958 21.01481 6.537941 653.3271
## male 27.71622 0.5330042 0.5330042 23.98519 7.462059 745.6729
##
## History Holiday Horror How To Humor Inspirational Language
## female 196.1382 38.76065 247.9747 0.4669958 51.36953 1.867983 7.004936
## male 223.8618 44.23935 283.0253 0.5330042 58.63047 2.132017 7.995064
##
## Lds Leadership Lgbt Literary Fiction Literature Love
## female 0.4669958 2.334979 47.63357 1.400987 5.136953 0.4669958
## male 0.5330042 2.665021 54.36643 1.599013 5.863047 0.5330042
##
## Magical Realism Manga Marriage Media Tie In Medical
## female 3.26897 1.400987 0.9339915 2.801975 0.4669958
## male 3.73103 1.599013 1.0660085 3.198025 0.5330042
##
## Mental Health Military History Music Mystery Mythology New Adult
## female 2.801975 3.735966 10.27391 382.9365 19.61382 152.2406
## male 3.198025 4.264034 11.72609 437.0635 22.38618 173.7594
##
## Nonfiction Northern Africa Novella Novels Occult Own
## female 460.9248 1.400987 1.867983 86.86121 0.9339915 0.4669958
## male 526.0752 1.599013 2.132017 99.13879 1.0660085 0.5330042
##
## Paranormal Parenting Philosophy Plays Poetry Politics Polyamorous
## female 156.4436 1.400987 55.1055 22.4158 135.4288 28.01975 5.603949
## male 178.5564 1.599013 62.8945 25.5842 154.5712 31.98025 6.396051
##
## Psychology Pulp Realistic Fiction Reference Relationships
## female 31.75571 0.4669958 14.47687 0.4669958 0.9339915
## male 36.24429 0.5330042 16.52313 0.5330042 1.0660085
##
## Religion Retellings Romance Science Science Fiction
## female 57.44048 1.400987 1586.385 34.55769 453.4529
## male 65.55952 1.599013 1810.615 39.44231 517.5471
##
## Science Fiction Fantasy Self Help Sequential Art Sexuality
## female 0.9339915 38.29365 298.8773 0.9339915
## male 1.0660085 43.70635 341.1227 1.0660085
##
## Shapeshifters Short Stories Social Science Sociology
## female 8.872919 77.5213 0.4669958 2.334979
## male 10.127081 88.4787 0.5330042 2.665021
##
## Speculative Fiction Spirituality Sports Sports and Games
## female 0.4669958 22.4158 4.669958 27.08575
## male 0.5330042 25.5842 5.330042 30.91425
##
## Spy Thriller Suspense Teaching Thriller Travel Unfinished
## female 1.400987 7.938928 0.4669958 75.65331 34.09069 0.9339915
## male 1.599013 9.061072 0.5330042 86.34669 38.90931 1.0660085
##
## United States War Warfare Westerns Womens Fiction World War II
## female 0.4669958 9.339915 2.801975 7.004936 44.83159 5.136953
## male 0.5330042 10.660085 3.198025 7.995064 51.16841 5.863047
##
## Writing Young Adult
## female 2.334979 1166.088
## male 2.665021 1330.912
mosaicplot(table(data$genre_1,data$author_gender))
That mosaic plot is HIDEOUS, moving on. The Chi squared wouldn’t run above and print to the screen but when examining the results in the global environment I can see that I have a low p- value. This leads me to reject the null hypothesis, that author gender is independent of the genre they write for, in favor of the alternative. This makes sense! In previous parts of the project (I would include but cannot because RStudio has locked me out of my projects) I had found some things that gave me the idea that author gender is not independent of the genre. Lets take Women’s Fiction for example, it is safe to assume that more women would write for this genre and this test for independence proves that! So exciting!