I have found the data for this project at https://www.kaggle.com/bahramjannesarr/goodreads-book-datasets-10m I then downloaded the dataset and saved it as a .CSV file and uploaded it into my Github repository (Applied-Statistics). With this data I hope to show in this portion of the project any relationship between the frequency of author birthplace and the goodness of fit. I also plan to test for independnce between author gender and the genre that they typically write for.

Here is a small portion of the data and the summary statistics

head(data)
##   author_average_rating author_gender                         author_genres
## 1                  4.01        female                   historical-fiction,
## 2                  4.15          male literature-fiction,mystery-thrillers,
## 3                  4.00        female                              romance,
## 4                  3.88          male                       fiction,memoir,
## 5                  4.10        female                  young-adult,fantasy,
## 6                  3.77          male                               horror,
##   author_id           author_name                          author_page_url
## 1     74489   Victoria Thompson\n     /author/show/74489.Victoria_Thompson
## 2    706255       Stieg Larsson\n        /author/show/706255.Stieg_Larsson
## 3   5618190 Mimi Jean Pamfiloff\n /author/show/5618190.Mimi_Jean_Pamfiloff
## 4     37871         José Donoso\n            /author/show/37871.Jos_Donoso
## 5     36122   Patricia C. Wrede\n      /author/show/36122.Patricia_C_Wrede
## 6     58947         Steve Niles\n           /author/show/58947.Steve_Niles
##   author_rating_count author_review_count           birthplace
## 1               74399                6268  United States\n    
## 2             3726435              142704         Sweden\n    
## 3               76496                7975    United States\n  
## 4                5522                 489          Chile\n    
## 5              291013               13453  United States\n    
## 6               47938                3240  United States\n    
##   book_average_rating
## 1                4.02
## 2                4.13
## 3                3.99
## 4                4.14
## 5                4.01
## 6                3.80
##                                                                  book_fullurl
## 1        https://www.goodreads.com/book/show/686717.Murder_on_St_Mark_s_Place
## 2 https://www.goodreads.com/book/show/2429135.The_Girl_with_the_Dragon_Tattoo
## 3           https://www.goodreads.com/book/show/27833684-tailored-for-trouble
## 4        https://www.goodreads.com/book/show/382975.The_Obscene_Bird_of_Night
## 5                   https://www.goodreads.com/book/show/64207.Sorcery_Cecelia
## 6           https://www.goodreads.com/book/show/831829.30_Days_of_Night_Vol_1
##    book_id                                                  book_title
## 1   686717                        \n      Murder on St. Mark's Place\n
## 2  2429135                   \n      The Girl with the Dragon Tattoo\n
## 3 27833684                              \n      Tailored for Trouble\n
## 4   382975                         \n      The Obscene Bird of Night\n
## 5    64207 \n      Sorcery & Cecelia: or The Enchanted Chocolate Pot\n
## 6   831829                          \n      30 Days of Night, Vol. 1\n
##          genre_1         genre_2 num_ratings num_reviews pages
## 1        Mystery      Historical        5260         375   277
## 2        Fiction         Mystery     2229163       65227   465
## 3        Romance    Contemporary        2151         391   354
## 4        Fiction Magical Realism        1844         173   438
## 5        Fantasy     Young Adult       17051        1890   326
## 6 Sequential Art  Sequential Art       17122         561   104
##        publish_date score
## 1              2000  3230
## 2            Aug-05  3062
## 3              2016  4585
## 4              1970  1533
## 5   April 15th 1988  2105
## 6 January 10th 2004  4372
summary(data)
##  author_average_rating author_gender      author_genres        author_id       
##  Min.   :1.82          Length:22891       Length:22891       Min.   :       4  
##  1st Qu.:3.81          Class :character   Class :character   1st Qu.:   40836  
##  Median :3.97          Mode  :character   Mode  :character   Median : 1415543  
##  Mean   :3.96                                                Mean   : 3233957  
##  3rd Qu.:4.12                                                3rd Qu.: 5775601  
##  Max.   :5.00                                                Max.   :18770448  
##  author_name        author_page_url    author_rating_count author_review_count
##  Length:22891       Length:22891       Min.   :       6    Min.   :     0     
##  Class :character   Class :character   1st Qu.:    4324    1st Qu.:   545     
##  Mode  :character   Mode  :character   Median :   24635    Median :  2273     
##                                        Mean   :  172032    Mean   :  9370     
##                                        3rd Qu.:  111337    3rd Qu.:  8262     
##                                        Max.   :21117318    Max.   :516745     
##   birthplace        book_average_rating book_fullurl         book_id         
##  Length:22891       Min.   :0.000       Length:22891       Length:22891      
##  Class :character   1st Qu.:3.770       Class :character   Class :character  
##  Mode  :character   Median :3.960       Mode  :character   Mode  :character  
##                     Mean   :3.951                                            
##                     3rd Qu.:4.140                                            
##                     Max.   :5.000                                            
##   book_title          genre_1            genre_2           num_ratings     
##  Length:22891       Length:22891       Length:22891       Min.   :      0  
##  Class :character   Class :character   Class :character   1st Qu.:    820  
##  Mode  :character   Mode  :character   Mode  :character   Median :   4403  
##                                                           Mean   :  46683  
##                                                           3rd Qu.:  20143  
##                                                           Max.   :3820921  
##   num_reviews        pages           publish_date           score       
##  Min.   :     0   Length:22891       Length:22891       Min.   :    55  
##  1st Qu.:   106   Class :character   Class :character   1st Qu.:   832  
##  Median :   384   Mode  :character   Mode  :character   Median :  1727  
##  Mean   :  2325                                         Mean   :  3893  
##  3rd Qu.:  1504                                         3rd Qu.:  3598  
##  Max.   :147696                                         Max.   :598270

This is a continuation of my project from previous parts, I would include a link to access those projects but I am currently unable to open them because I have hit my hour limit in Rstudio cloud. It’s fine we can work with it.

“Goodness” Gracious

I am liking making silly little puns to go along with each portion of the project.

\[ H_0: \text{The number of authors from each birthplace is equal.}\\ H_A: \text{The number of authors from each birthplace is not equal}. \]

data$birthplace[is.na(data$birthplace)]<-0
test=chisq.test(table(data$birthplace))
test
## 
##  Chi-squared test for given probabilities
## 
## data:  table(data$birthplace)
## X-squared = 1073563, df = 434, p-value < 2.2e-16

Based on this data, we can reject the null hypothesis that there are the same number of authors from each country in each category. below is code that would tell us to expect about 53 authors (52.62299) born in each location. I am commenting this out because its a lot.

#test$expected

Below is a Bar chart with some taller peaks at the places where the most authors are from.

barplot(table(data$birthplace))

Miss Independence

\[ H_O: \text{The author gender is idependent of genre}\\ H_A: \text{Author gender is not independent of genre} \] Ok so here we go! lets see if author gender is independent of the genre they write for.

testagain = chisq.test(table(data$author_gender,data$genre_1))
## Warning in chisq.test(table(data$author_gender, data$genre_1)): Chi-squared
## approximation may be incorrect
testagain$expected
##         
##              Adult Adult Fiction Adventure    Amish  Animals Anthologies
##   female 0.9339915      52.77052  14.00987 6.070945 19.61382    10.27391
##   male   1.0660085      60.22948  15.99013 6.929055 22.38618    11.72609
##         
##          Apocalyptic      Art Asian Literature Autobiography Biblical Fiction
##   female    2.801975 5.603949         25.21777      45.76558        0.9339915
##   male      3.198025 6.396051         28.78223      52.23442        1.0660085
##         
##          Biography Business Category Romance Childrens Christian
##   female  78.92228 43.43061         2.801975  201.7422  31.75571
##   male    90.07772 49.56939         3.198025  230.2578  36.24429
##         
##          Christian Fiction Christianity Classics    Comics Contemporary
##   female          65.37941     1.400987 305.8822  9.339915     51.83653
##   male            74.62059     1.599013 349.1178 10.660085     59.16347
##         
##            Couture    Crime Cultural     Dark Dark Fantasy Dc Comics     Death
##   female 0.4669958 14.47687 30.35472 71.45035    0.4669958 0.4669958 0.4669958
##   male   0.5330042 16.52313 34.64528 81.54965    0.5330042 0.5330042 0.5330042
##         
##             Design Did Not Finish    Drama Dungeons and Dragons Eastern Africa
##   female 0.4669958      0.4669958 1.867983            0.4669958      0.9339915
##   male   0.5330042      0.5330042 2.132017            0.5330042      1.0660085
##         
##          Economics Education      Epic   Erotica Esoterica European Literature
##   female  4.202962   3.26897 0.4669958  90.59718 0.4669958            1.400987
##   male    4.797038   3.73103 0.5330042 103.40282 0.5330042            1.599013
##         
##          Fairy Tales    Family Fan Fiction  Fantasy Feminism  Fiction
##   female   0.4669958 0.9339915   0.4669958 1586.385 21.01481 1014.315
##   male     0.5330042 1.0660085   0.5330042 1810.615 23.98519 1157.685
##         
##          Food and Drink  Football Gardening     Glbt   Health Historical
##   female       24.28378 0.4669958 0.4669958 21.01481 6.537941   653.3271
##   male         27.71622 0.5330042 0.5330042 23.98519 7.462059   745.6729
##         
##           History  Holiday   Horror    How To    Humor Inspirational Language
##   female 196.1382 38.76065 247.9747 0.4669958 51.36953      1.867983 7.004936
##   male   223.8618 44.23935 283.0253 0.5330042 58.63047      2.132017 7.995064
##         
##                Lds Leadership     Lgbt Literary Fiction Literature      Love
##   female 0.4669958   2.334979 47.63357         1.400987   5.136953 0.4669958
##   male   0.5330042   2.665021 54.36643         1.599013   5.863047 0.5330042
##         
##          Magical Realism    Manga  Marriage Media Tie In   Medical
##   female         3.26897 1.400987 0.9339915     2.801975 0.4669958
##   male           3.73103 1.599013 1.0660085     3.198025 0.5330042
##         
##          Mental Health Military History    Music  Mystery Mythology New Adult
##   female      2.801975         3.735966 10.27391 382.9365  19.61382  152.2406
##   male        3.198025         4.264034 11.72609 437.0635  22.38618  173.7594
##         
##          Nonfiction Northern Africa  Novella   Novels    Occult       Own
##   female   460.9248        1.400987 1.867983 86.86121 0.9339915 0.4669958
##   male     526.0752        1.599013 2.132017 99.13879 1.0660085 0.5330042
##         
##          Paranormal Parenting Philosophy   Plays   Poetry Politics Polyamorous
##   female   156.4436  1.400987    55.1055 22.4158 135.4288 28.01975    5.603949
##   male     178.5564  1.599013    62.8945 25.5842 154.5712 31.98025    6.396051
##         
##          Psychology      Pulp Realistic Fiction Reference Relationships
##   female   31.75571 0.4669958          14.47687 0.4669958     0.9339915
##   male     36.24429 0.5330042          16.52313 0.5330042     1.0660085
##         
##          Religion Retellings  Romance  Science Science Fiction
##   female 57.44048   1.400987 1586.385 34.55769        453.4529
##   male   65.55952   1.599013 1810.615 39.44231        517.5471
##         
##          Science Fiction Fantasy Self Help Sequential Art Sexuality
##   female               0.9339915  38.29365       298.8773 0.9339915
##   male                 1.0660085  43.70635       341.1227 1.0660085
##         
##          Shapeshifters Short Stories Social Science Sociology
##   female      8.872919       77.5213      0.4669958  2.334979
##   male       10.127081       88.4787      0.5330042  2.665021
##         
##          Speculative Fiction Spirituality   Sports Sports and Games
##   female           0.4669958      22.4158 4.669958         27.08575
##   male             0.5330042      25.5842 5.330042         30.91425
##         
##          Spy Thriller Suspense  Teaching Thriller   Travel Unfinished
##   female     1.400987 7.938928 0.4669958 75.65331 34.09069  0.9339915
##   male       1.599013 9.061072 0.5330042 86.34669 38.90931  1.0660085
##         
##          United States       War  Warfare Westerns Womens Fiction World War II
##   female     0.4669958  9.339915 2.801975 7.004936       44.83159     5.136953
##   male       0.5330042 10.660085 3.198025 7.995064       51.16841     5.863047
##         
##           Writing Young Adult
##   female 2.334979    1166.088
##   male   2.665021    1330.912
mosaicplot(table(data$genre_1,data$author_gender))

That mosaic plot is HIDEOUS, moving on. The Chi squared wouldn’t run above and print to the screen but when examining the results in the global environment I can see that I have a low p- value. This leads me to reject the null hypothesis, that author gender is independent of the genre they write for, in favor of the alternative. This makes sense! In previous parts of the project (I would include but cannot because RStudio has locked me out of my projects) I had found some things that gave me the idea that author gender is not independent of the genre. Lets take Women’s Fiction for example, it is safe to assume that more women would write for this genre and this test for independence proves that! So exciting!