Goodreads Data Analysis

I have found the data for this project at https://www.kaggle.com/bahramjannesarr/goodreads-book-datasets-10m I then downloaded the dataset and saved it as a .CSV file and uploaded it into my Github repository (Applied-Statistics). With this data I hope to analyze any relationship between the variables.

Here is a small portion of the data and the summary statistics

head(data)

##   author_average_rating author_gender                         author_genres
## 1                  4.01        female                   historical-fiction,
## 2                  4.15          male literature-fiction,mystery-thrillers,
## 3                  4.00        female                              romance,
## 4                  3.88          male                       fiction,memoir,
## 5                  4.10        female                  young-adult,fantasy,
## 6                  3.77          male                               horror,
##   author_id           author_name                          author_page_url
## 1     74489   Victoria Thompson\n     /author/show/74489.Victoria_Thompson
## 2    706255       Stieg Larsson\n        /author/show/706255.Stieg_Larsson
## 3   5618190 Mimi Jean Pamfiloff\n /author/show/5618190.Mimi_Jean_Pamfiloff
## 4     37871         José Donoso\n            /author/show/37871.Jos_Donoso
## 5     36122   Patricia C. Wrede\n      /author/show/36122.Patricia_C_Wrede
## 6     58947         Steve Niles\n           /author/show/58947.Steve_Niles
##   author_rating_count author_review_count           birthplace
## 1               74399                6268  United States\n    
## 2             3726435              142704         Sweden\n    
## 3               76496                7975    United States\n  
## 4                5522                 489          Chile\n    
## 5              291013               13453  United States\n    
## 6               47938                3240  United States\n    
##   book_average_rating
## 1                4.02
## 2                4.13
## 3                3.99
## 4                4.14
## 5                4.01
## 6                3.80
##                                                                  book_fullurl
## 1        https://www.goodreads.com/book/show/686717.Murder_on_St_Mark_s_Place
## 2 https://www.goodreads.com/book/show/2429135.The_Girl_with_the_Dragon_Tattoo
## 3           https://www.goodreads.com/book/show/27833684-tailored-for-trouble
## 4        https://www.goodreads.com/book/show/382975.The_Obscene_Bird_of_Night
## 5                   https://www.goodreads.com/book/show/64207.Sorcery_Cecelia
## 6           https://www.goodreads.com/book/show/831829.30_Days_of_Night_Vol_1
##    book_id                                                  book_title
## 1   686717                        \n      Murder on St. Mark's Place\n
## 2  2429135                   \n      The Girl with the Dragon Tattoo\n
## 3 27833684                              \n      Tailored for Trouble\n
## 4   382975                         \n      The Obscene Bird of Night\n
## 5    64207 \n      Sorcery & Cecelia: or The Enchanted Chocolate Pot\n
## 6   831829                          \n      30 Days of Night, Vol. 1\n
##          genre_1         genre_2 num_ratings num_reviews pages
## 1        Mystery      Historical        5260         375   277
## 2        Fiction         Mystery     2229163       65227   465
## 3        Romance    Contemporary        2151         391   354
## 4        Fiction Magical Realism        1844         173   438
## 5        Fantasy     Young Adult       17051        1890   326
## 6 Sequential Art  Sequential Art       17122         561   104
##        publish_date score
## 1              2000  3230
## 2            Aug-05  3062
## 3              2016  4585
## 4              1970  1533
## 5   April 15th 1988  2105
## 6 January 10th 2004  4372

summary(data)

##  author_average_rating author_gender      author_genres        author_id       
##  Min.   :1.82          Length:22891       Length:22891       Min.   :       4  
##  1st Qu.:3.81          Class :character   Class :character   1st Qu.:   40836  
##  Median :3.97          Mode  :character   Mode  :character   Median : 1415543  
##  Mean   :3.96                                                Mean   : 3233957  
##  3rd Qu.:4.12                                                3rd Qu.: 5775601  
##  Max.   :5.00                                                Max.   :18770448  
##  author_name        author_page_url    author_rating_count author_review_count
##  Length:22891       Length:22891       Min.   :       6    Min.   :     0     
##  Class :character   Class :character   1st Qu.:    4324    1st Qu.:   545     
##  Mode  :character   Mode  :character   Median :   24635    Median :  2273     
##                                        Mean   :  172032    Mean   :  9370     
##                                        3rd Qu.:  111337    3rd Qu.:  8262     
##                                        Max.   :21117318    Max.   :516745     
##   birthplace        book_average_rating book_fullurl         book_id         
##  Length:22891       Min.   :0.000       Length:22891       Length:22891      
##  Class :character   1st Qu.:3.770       Class :character   Class :character  
##  Mode  :character   Median :3.960       Mode  :character   Mode  :character  
##                     Mean   :3.951                                            
##                     3rd Qu.:4.140                                            
##                     Max.   :5.000                                            
##   book_title          genre_1            genre_2           num_ratings     
##  Length:22891       Length:22891       Length:22891       Min.   :      0  
##  Class :character   Class :character   Class :character   1st Qu.:    820  
##  Mode  :character   Mode  :character   Mode  :character   Median :   4403  
##                                                           Mean   :  46683  
##                                                           3rd Qu.:  20143  
##                                                           Max.   :3820921  
##   num_reviews        pages           publish_date           score       
##  Min.   :     0   Length:22891       Length:22891       Min.   :    55  
##  1st Qu.:   106   Class :character   Class :character   1st Qu.:   832  
##  Median :   384   Mode  :character   Mode  :character   Median :  1727  
##  Mean   :  2325                                         Mean   :  3893  
##  3rd Qu.:  1504                                         3rd Qu.:  3598  
##  Max.   :147696                                         Max.   :598270

project part 0

str(data)

## 'data.frame':    22891 obs. of  20 variables:
##  $ author_average_rating: num  4.01 4.15 4 3.88 4.1 3.77 4.16 3.94 3.78 4.08 ...
##  $ author_gender        : chr  "female" "male" "female" "male" ...
##  $ author_genres        : chr  "historical-fiction," "literature-fiction,mystery-thrillers," "romance," "fiction,memoir," ...
##  $ author_id            : int  74489 706255 5618190 37871 36122 58947 4833990 7956 155651 274533 ...
##  $ author_name          : chr  "Victoria Thompson\n" "Stieg Larsson\n" "Mimi Jean Pamfiloff\n" "José Donoso\n" ...
##  $ author_page_url      : chr  "/author/show/74489.Victoria_Thompson" "/author/show/706255.Stieg_Larsson" "/author/show/5618190.Mimi_Jean_Pamfiloff" "/author/show/37871.Jos_Donoso" ...
##  $ author_rating_count  : int  74399 3726435 76496 5522 291013 47938 110522 321197 1019 481114 ...
##  $ author_review_count  : int  6268 142704 7975 489 13453 3240 9451 29747 104 25166 ...
##  $ birthplace           : chr  " United States\n    " "Sweden\n    " " United States\n  " "Chile\n    " ...
##  $ book_average_rating  : num  4.02 4.13 3.99 4.14 4.01 3.8 3.95 3.84 3.77 4.01 ...
##  $ book_fullurl         : chr  "https://www.goodreads.com/book/show/686717.Murder_on_St_Mark_s_Place" "https://www.goodreads.com/book/show/2429135.The_Girl_with_the_Dragon_Tattoo" "https://www.goodreads.com/book/show/27833684-tailored-for-trouble" "https://www.goodreads.com/book/show/382975.The_Obscene_Bird_of_Night" ...
##  $ book_id              : chr  "686717" "2429135" "27833684" "382975" ...
##  $ book_title           : chr  "\n      Murder on St. Mark's Place\n" "\n      The Girl with the Dragon Tattoo\n" "\n      Tailored for Trouble\n" "\n      The Obscene Bird of Night\n" ...
##  $ genre_1              : chr  "Mystery" "Fiction" "Romance" "Fiction" ...
##  $ genre_2              : chr  "Historical" "Mystery" "Contemporary" "Magical Realism" ...
##  $ num_ratings          : int  5260 2229163 2151 1844 17051 17122 11684 45963 594 40093 ...
##  $ num_reviews          : int  375 65227 391 173 1890 561 1107 4268 42 2375 ...
##  $ pages                : chr  "277" "465" "354" "438" ...
##  $ publish_date         : chr  "2000" "Aug-05" "2016" "1970" ...
##  $ score                : int  3230 3062 4585 1533 2105 4372 2396 2054 1311 1994 ...

median(data$author_average_rating)

## [1] 3.97

mean(data$author_average_rating)

## [1] 3.960368

min(data$author_average_rating)

## [1] 1.82

max(data$author_average_rating)

## [1] 5

quantile(data$author_average_rating)

##   0%  25%  50%  75% 100% 
## 1.82 3.81 3.97 4.12 5.00

sd(data$author_average_rating)

## [1] 0.2404211

hist(data$author_average_rating)

## It appears that the data collected for the average rating of an author has a bell shape and is approximately normally distributed with a center near 4. Lets visualize the data another way

boxplot(data$author_average_rating)

## The same variable distribution is shown in the form of a box-plot. this graph shows us that there are potential outliers past the 1st and 3rd quantiles.

qqnorm(data$author_average_rating)
qqline(data$author_average_rating)

(## I would say that the variable author average rating is approximately normally distributed as the data falls near the y=x line in the Q-Q Plot.

gender_table <- table(data$author_gender)

gender_table

## 
## female   male 
##  10690  12201

gender_table/ sum(gender_table)

## 
##    female      male 
## 0.4669958 0.5330042

Above is the frequency and relative frequency of author gender from the sample of Authors in this datatset. Below we take a look at the authors gender compared with the genre of book that they write.

genre_gender <- table(data$author_gender, data$genre_1)

rownames(genre_gender)<- c("Female","Male")
genre_gender

##         
##          Adult Adult Fiction Adventure Amish Animals Anthologies Apocalyptic
##   Female     0            58         7    12      14           7           0
##   Male       2            55        23     1      28          15           6
##         
##           Art Asian Literature Autobiography Biblical Fiction Biography
##   Female    4                8            49                2        33
##   Male      8               46            49                0       136
##         
##          Business Category Romance Childrens Christian Christian Fiction
##   Female        3                3       150        14               108
##   Male         90                3       282        54                32
##         
##          Christianity Classics Comics Contemporary Couture Crime Cultural Dark
##   Female            0      161      2           80       1     6       15   93
##   Male              3      494     18           31       0    25       50   60
##         
##          Dark Fantasy Dc Comics Death Design Did Not Finish Drama
##   Female            1         0     1      0              0     1
##   Male              0         1     0      1              1     3
##         
##          Dungeons and Dragons Eastern Africa Economics Education Epic Erotica
##   Female                    0              0         1         0    0     137
##   Male                      1              2         8         7    1      57
##         
##          Esoterica European Literature Fairy Tales Family Fan Fiction Fantasy
##   Female         0                   0           0      0           0    1754
##   Male           1                   3           1      2           1    1643
##         
##          Feminism Fiction Food and Drink Football Gardening Glbt Health
##   Female       36     717             26        0         1   34      6
##   Male          9    1455             26        1         0   11      8
##         
##          Historical History Holiday Horror How To Humor Inspirational Language
##   Female        815      45      42    110      0    33             1        6
##   Male          584     375      41    421      1    77             3        9
##         
##           Lds Leadership Lgbt Literary Fiction Literature Love Magical Realism
##   Female    0          0   56                1          2    0               6
##   Male      1          5   46                2          9    1               1
##         
##          Manga Marriage Media Tie In Medical Mental Health Military History
##   Female     2        0            1       0             6                0
##   Male       1        2            5       1             0                8
##         
##          Music Mystery Mythology New Adult Nonfiction Northern Africa Novella
##   Female    13     425        21       247        347               0       3
##   Male       9     395        21        79        640               3       1
##         
##          Novels Occult  Own Paranormal Parenting Philosophy Plays Poetry
##   Female     15      0    0        225         2          3     0     70
##   Male      171      2    1        110         1        115    48    220
##         
##          Politics Polyamorous Psychology Pulp Realistic Fiction Reference
##   Female        6           7          7    0                11         0
##   Male         54           5         61    1                20         1
##         
##          Relationships Religion Retellings Romance Science Science Fiction
##   Female             0        7          3    2302       0             300
##   Male               2      116          0    1095      74             671
##         
##          Science Fiction Fantasy Self Help Sequential Art Sexuality
##   Female                       1        22            186         1
##   Male                         1        60            454         1
##         
##          Shapeshifters Short Stories Social Science Sociology
##   Female            15            55              1         4
##   Male               4           111              0         1
##         
##          Speculative Fiction Spirituality Sports Sports and Games Spy Thriller
##   Female                   0           13      1               25            0
##   Male                     1           35      9               33            3
##         
##          Suspense Teaching Thriller Travel Unfinished United States  War
##   Female       12        0       51     18          2             0    9
##   Male          5        1      111     55          0             1   11
##         
##          Warfare Westerns Womens Fiction World War II Writing Young Adult
##   Female       4        3             60            5       3        1514
##   Male         2       12             36            6       2         983

The data is a little messy, with items that are very similar, columns such as “Spy Thriller” and Thriller, etc. we will try and clean this up for next time but for now, this isnt too terrible.

Lets do some hypothesis testing. I think that the average rating for an author will be around 2.5.

\[ H_0:\ \mu_F = \ 2.5\\\\\ H_A:\ \mu_F\neq \ 2.5 \]

t.test(data$author_average_rating, mu= 2.5)

## 
##  One Sample t-test
## 
## data:  data$author_average_rating
## t = 919.01, df = 22890, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 2.5
## 95 percent confidence interval:
##  3.957254 3.963483
## sample estimates:
## mean of x 
##  3.960368

Hot dang! I was wrong, the observed mean of authors average ratings is more near 4 than 2.5. 2.5 also lies outside of the 95% confidence interval. Therefore we reject the null hypothesis in favor of the alternative.

Do more reviews = higher average book ratings? Lets check it out! R= rating r= reviews \[ H_0:\ \mu_r\ = \ \mu_R\\\ H_A:\ \mu_F\neq\mu_R \]

lessrev= data[which(data$num_reviews> data$num_ratings
),]

morev = data[which(data$num_reviews<=data$num_ratings
),]

t.test(lessrev$book_average_rating , morev$book_average_rating)

## 
##  Welch Two Sample t-test
## 
## data:  lessrev$book_average_rating and morev$book_average_rating
## t = 6.4069, df = 11.012, p-value = 5.008e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.3504600 0.7171876
## sample estimates:
## mean of x mean of y 
##  4.485000  3.951176

This T test gives evidence to reject the null hypothesis in favor of the alternative.

lets take a look at the relationship between number of ratings and number of reviews. Number of ratings has been renamed RAT and number of reviews has been renamed REV

REV =c(data$num_reviews)
RAT =c(data$num_ratings)
cor(REV,RAT)

## [1] 0.8289449

model <-lm(REV~RAT)
plot(model)

ylist = c(1200,1400)
point <- data.frame(RAT = ylist)
predict(model,point, interval = 'predict')

##        fit       lwr      upr
## 1 893.1073 -6603.357 8389.572
## 2 899.4026 -6597.062 8395.867

Above is a 95% confidence interval to predict the number of reviews based on the number of ratings.

##residuals(model)
hist(residuals(model))

I have chosen to not include the printed portion of all the residuals in this report because it would wipe out the remainder of the amazon rainforest if printed. The code to print the residuals has been commented out above. Instead I have displayed a histogram to help determine if the residuals follow a normal distribution. It is difficult to tell if this histogram is normally distributed or not. I’m going to say that it is because I believe there are some values to the left of the tallest bar but they are too short to be seen. the equation of the regression line is \[y = 0.03148x+855.3590\] the slope here represents that for every one review the number of ratings goes up approximately .03148.

cor.test(REV,RAT)

## 
##  Pearson's product-moment correlation
## 
## data:  REV and RAT
## t = 224.22, df = 22889, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8248481 0.8329547
## sample estimates:
##       cor 
## 0.8289449

the correlation coefficient, r, is equal to .829 this gives reason to conclude that there is a moderately strong positive correlation between the number of ratings and the number of reviews.

gender<- data$author_gender
fit <- lm(REV~RAT+gender,data=data)
print(fit)

## 
## Call:
## lm(formula = REV ~ RAT + gender, data = data)
## 
## Coefficients:
## (Intercept)          RAT   gendermale  
##    936.5015       0.0315    -153.9378

above I have taken the linear model and fitted it to a categorical variable. here I’ll explore ratings and reviews alongside the author’s gender.

summary(fit)

## 
## Call:
## lm(formula = REV ~ RAT + gender, data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -50326   -802   -673   -185  65447 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  9.365e+02  3.737e+01  25.060   <2e-16 ***
## RAT          3.150e-02  1.405e-04 224.172   <2e-16 ***
## gendermale  -1.539e+02  5.071e+01  -3.036   0.0024 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3824 on 22888 degrees of freedom
## Multiple R-squared:  0.6873, Adjusted R-squared:  0.6872 
## F-statistic: 2.515e+04 on 2 and 22888 DF,  p-value: < 2.2e-16

y <- data.frame(RAT=100, gender = "female")
predict(fit,y)

##       1 
## 939.651

Bootstrapping

library(boot)

set.seed(16)
samp_mean <- function(x,i){
  mean(x[i])
}

results <- boot(data$author_average_rating, samp_mean,1000)
plot(results)

boot.ci(results, type="perc")

## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = results, type = "perc")
## 
## Intervals : 
## Level     Percentile     
## 95%   ( 3.957,  3.963 )  
## Calculations and Intervals on Original Scale

mean(data$author_average_rating)

## [1] 3.960368

t.test(data$author_average_rating, mu =4)

## 
##  One Sample t-test
## 
## data:  data$author_average_rating
## t = -24.94, df = 22890, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 4
## 95 percent confidence interval:
##  3.957254 3.963483
## sample estimates:
## mean of x 
##  3.960368

We can see from the results above that the confidence intervals are very ver similar. The confidence interval for the bootstrap mean has a little less accuracy than the interval for the actual data, but they are almost the same. We can also notice that the center of the confidence interval for the bootstrap is the actual mean of 3.96.

Putting on our non-parametric boots

Now lets do a little bootstrapping on a non-parametric statistic. Lets look at the median of the number of ratings for each book in the data set.

#```{r bootfullMedian} set.seed(16) samp_median <- function(x,i){ median(x[i])}

paraboots <- boot(data$author_average_rating, samp_median, R=1000) plot(paraboots)

Things are not looking normal... This most likely has something to do with these being the average rating for each author and not the true rating that each user would give the author. On goodreads users are only able to rate authors with the whole value, so a 3 star rating exists but a 3.5 doesn't. Instead of continuing with this because it seems a little futile I'm going to switch it up and look at the number of ratings for each author. I'll go through the parametric and also the non-parametric below. 

## Number of ratings Bootstrap
#Here is bootstrapping for the mean. 
#```{r bootFullMean}
#set.seed(16)
#samp_mean <- function(x,i){
#  mean(x[i])
#}

result2 <- boot(data$num_ratings, samp_mean,1000)
plot(results)

Now a couple confidence intervals.

#boot.ci(result2, type="perc")
mean(data$num_ratings)

## [1] 46683.49

Here is the confidence interval for number of ratings from the original dataset.

t.test(data$num_ratings, mu =46683.49)

## 
##  One Sample t-test
## 
## data:  data$num_ratings
## t = -1.8569e-06, df = 22890, p-value = 1
## alternative hypothesis: true mean is not equal to 46683.49
## 95 percent confidence interval:
##  44350.68 49016.30
## sample estimates:
## mean of x 
##  46683.49

The confidence intervals for the bootstrap and for the original data are different, close but different. The true mean falls at about the middle of both of these intervals.

Now lets examine the median for number of ratings. I want to hypothesize that the median number of ratings is equal to the mean. \[ H_0:\ \ M = \ 46683.49\\\\\ H_A:\ \ M\neq \ 46683.49 \]

set.seed(16)
samp_median <- function(x,i){
  median(x[i])}

np <- boot(data$num_ratings, samp_median, R=1000)
plot(np)

This looks much better than it did earlier. Things are approximately normally distributed, lets find a $p$ value. \[ t=\frac{\mu-\overline x}{SE} \]

xbar = np$t0

##Cross-Validation

I will be very candid in saying that cross validation is something that I don’t understand fully. I will try my best to illustrate what is necessary below but I am very uncomfortable with this portion of the project. #```{r} #library(caret) #SamplesID <- createDataPartition(data$author_id, p =.66, list=FALSE) #SamplesID

##{r} #traindata <- data[SamplesID,] #testdata <- data[-SamplesID,] #testdata ```

#{r} #model <- lm(data$num_reviews ~ data$num_ratings, data = testdata) #summary(model) #

#{r} #data[SamplesID,'Test_Train'] <- "Train" #data[-SamplesID,'Test_Train'] <- "Test" #data$Test_Train #

#{r} #library(ggplot2) #ggplot(data=data, mapping = aes(data$num_ratings,data$num_reviews,col=data$Test_Train,))+ # geom_jitter()+ # geom_smooth(method=lm) #

#{r} #pre <- predict(model, data[-SamplesID,]) #pre # #{r} #R2(pre,data[-SamplesID,"num_reviews"],"everything") # I’m not entirely certain why I am not getting a value when I run the line of code above. I’m assuming that the correlation coefficient would be somewhere close to the one from the predicted model which was around .6. This value leads to the conclusion that the correlation is a weak positive correlation, the value is not very close to one, therefore not very strong. If had gotten a value I am assuming that it would be close to the .6 value I had gotten before. # 10- Fold #More cross validation. More frustration. Things are going well!

Goodreads Data Analysis

Nichole A. Beyer

3/14/2021

project part 0

Bootstrapping

Putting on our non-parametric boots