I have found the data for this project at https://www.kaggle.com/bahramjannesarr/goodreads-book-datasets-10m I then downloaded the dataset and saved it as a .CSV file and uploaded it into my Github repository (Applied-Statistics). With this data I hope to analyze any relationship between the variables.
Here is a small portion of the data and the summary statistics
head(data)
## author_average_rating author_gender author_genres
## 1 4.01 female historical-fiction,
## 2 4.15 male literature-fiction,mystery-thrillers,
## 3 4.00 female romance,
## 4 3.88 male fiction,memoir,
## 5 4.10 female young-adult,fantasy,
## 6 3.77 male horror,
## author_id author_name author_page_url
## 1 74489 Victoria Thompson\n /author/show/74489.Victoria_Thompson
## 2 706255 Stieg Larsson\n /author/show/706255.Stieg_Larsson
## 3 5618190 Mimi Jean Pamfiloff\n /author/show/5618190.Mimi_Jean_Pamfiloff
## 4 37871 José Donoso\n /author/show/37871.Jos_Donoso
## 5 36122 Patricia C. Wrede\n /author/show/36122.Patricia_C_Wrede
## 6 58947 Steve Niles\n /author/show/58947.Steve_Niles
## author_rating_count author_review_count birthplace
## 1 74399 6268 United States\n
## 2 3726435 142704 Sweden\n
## 3 76496 7975 United States\n
## 4 5522 489 Chile\n
## 5 291013 13453 United States\n
## 6 47938 3240 United States\n
## book_average_rating
## 1 4.02
## 2 4.13
## 3 3.99
## 4 4.14
## 5 4.01
## 6 3.80
## book_fullurl
## 1 https://www.goodreads.com/book/show/686717.Murder_on_St_Mark_s_Place
## 2 https://www.goodreads.com/book/show/2429135.The_Girl_with_the_Dragon_Tattoo
## 3 https://www.goodreads.com/book/show/27833684-tailored-for-trouble
## 4 https://www.goodreads.com/book/show/382975.The_Obscene_Bird_of_Night
## 5 https://www.goodreads.com/book/show/64207.Sorcery_Cecelia
## 6 https://www.goodreads.com/book/show/831829.30_Days_of_Night_Vol_1
## book_id book_title
## 1 686717 \n Murder on St. Mark's Place\n
## 2 2429135 \n The Girl with the Dragon Tattoo\n
## 3 27833684 \n Tailored for Trouble\n
## 4 382975 \n The Obscene Bird of Night\n
## 5 64207 \n Sorcery & Cecelia: or The Enchanted Chocolate Pot\n
## 6 831829 \n 30 Days of Night, Vol. 1\n
## genre_1 genre_2 num_ratings num_reviews pages
## 1 Mystery Historical 5260 375 277
## 2 Fiction Mystery 2229163 65227 465
## 3 Romance Contemporary 2151 391 354
## 4 Fiction Magical Realism 1844 173 438
## 5 Fantasy Young Adult 17051 1890 326
## 6 Sequential Art Sequential Art 17122 561 104
## publish_date score
## 1 2000 3230
## 2 Aug-05 3062
## 3 2016 4585
## 4 1970 1533
## 5 April 15th 1988 2105
## 6 January 10th 2004 4372
summary(data)
## author_average_rating author_gender author_genres author_id
## Min. :1.82 Length:22891 Length:22891 Min. : 4
## 1st Qu.:3.81 Class :character Class :character 1st Qu.: 40836
## Median :3.97 Mode :character Mode :character Median : 1415543
## Mean :3.96 Mean : 3233957
## 3rd Qu.:4.12 3rd Qu.: 5775601
## Max. :5.00 Max. :18770448
## author_name author_page_url author_rating_count author_review_count
## Length:22891 Length:22891 Min. : 6 Min. : 0
## Class :character Class :character 1st Qu.: 4324 1st Qu.: 545
## Mode :character Mode :character Median : 24635 Median : 2273
## Mean : 172032 Mean : 9370
## 3rd Qu.: 111337 3rd Qu.: 8262
## Max. :21117318 Max. :516745
## birthplace book_average_rating book_fullurl book_id
## Length:22891 Min. :0.000 Length:22891 Length:22891
## Class :character 1st Qu.:3.770 Class :character Class :character
## Mode :character Median :3.960 Mode :character Mode :character
## Mean :3.951
## 3rd Qu.:4.140
## Max. :5.000
## book_title genre_1 genre_2 num_ratings
## Length:22891 Length:22891 Length:22891 Min. : 0
## Class :character Class :character Class :character 1st Qu.: 820
## Mode :character Mode :character Mode :character Median : 4403
## Mean : 46683
## 3rd Qu.: 20143
## Max. :3820921
## num_reviews pages publish_date score
## Min. : 0 Length:22891 Length:22891 Min. : 55
## 1st Qu.: 106 Class :character Class :character 1st Qu.: 832
## Median : 384 Mode :character Mode :character Median : 1727
## Mean : 2325 Mean : 3893
## 3rd Qu.: 1504 3rd Qu.: 3598
## Max. :147696 Max. :598270
str(data)
## 'data.frame': 22891 obs. of 20 variables:
## $ author_average_rating: num 4.01 4.15 4 3.88 4.1 3.77 4.16 3.94 3.78 4.08 ...
## $ author_gender : chr "female" "male" "female" "male" ...
## $ author_genres : chr "historical-fiction," "literature-fiction,mystery-thrillers," "romance," "fiction,memoir," ...
## $ author_id : int 74489 706255 5618190 37871 36122 58947 4833990 7956 155651 274533 ...
## $ author_name : chr "Victoria Thompson\n" "Stieg Larsson\n" "Mimi Jean Pamfiloff\n" "José Donoso\n" ...
## $ author_page_url : chr "/author/show/74489.Victoria_Thompson" "/author/show/706255.Stieg_Larsson" "/author/show/5618190.Mimi_Jean_Pamfiloff" "/author/show/37871.Jos_Donoso" ...
## $ author_rating_count : int 74399 3726435 76496 5522 291013 47938 110522 321197 1019 481114 ...
## $ author_review_count : int 6268 142704 7975 489 13453 3240 9451 29747 104 25166 ...
## $ birthplace : chr " United States\n " "Sweden\n " " United States\n " "Chile\n " ...
## $ book_average_rating : num 4.02 4.13 3.99 4.14 4.01 3.8 3.95 3.84 3.77 4.01 ...
## $ book_fullurl : chr "https://www.goodreads.com/book/show/686717.Murder_on_St_Mark_s_Place" "https://www.goodreads.com/book/show/2429135.The_Girl_with_the_Dragon_Tattoo" "https://www.goodreads.com/book/show/27833684-tailored-for-trouble" "https://www.goodreads.com/book/show/382975.The_Obscene_Bird_of_Night" ...
## $ book_id : chr "686717" "2429135" "27833684" "382975" ...
## $ book_title : chr "\n Murder on St. Mark's Place\n" "\n The Girl with the Dragon Tattoo\n" "\n Tailored for Trouble\n" "\n The Obscene Bird of Night\n" ...
## $ genre_1 : chr "Mystery" "Fiction" "Romance" "Fiction" ...
## $ genre_2 : chr "Historical" "Mystery" "Contemporary" "Magical Realism" ...
## $ num_ratings : int 5260 2229163 2151 1844 17051 17122 11684 45963 594 40093 ...
## $ num_reviews : int 375 65227 391 173 1890 561 1107 4268 42 2375 ...
## $ pages : chr "277" "465" "354" "438" ...
## $ publish_date : chr "2000" "Aug-05" "2016" "1970" ...
## $ score : int 3230 3062 4585 1533 2105 4372 2396 2054 1311 1994 ...
median(data$author_average_rating)
## [1] 3.97
mean(data$author_average_rating)
## [1] 3.960368
min(data$author_average_rating)
## [1] 1.82
max(data$author_average_rating)
## [1] 5
quantile(data$author_average_rating)
## 0% 25% 50% 75% 100%
## 1.82 3.81 3.97 4.12 5.00
sd(data$author_average_rating)
## [1] 0.2404211
hist(data$author_average_rating)
## It appears that the data collected for the average rating of an author has a bell shape and is approximately normally distributed with a center near 4. Lets visualize the data another way
boxplot(data$author_average_rating)
## The same variable distribution is shown in the form of a box-plot. this graph shows us that there are potential outliers past the 1st and 3rd quantiles.
qqnorm(data$author_average_rating)
qqline(data$author_average_rating)
(## I would say that the variable author average rating is approximately normally distributed as the data falls near the y=x line in the Q-Q Plot.
gender_table <- table(data$author_gender)
gender_table
##
## female male
## 10690 12201
gender_table/ sum(gender_table)
##
## female male
## 0.4669958 0.5330042
Above is the frequency and relative frequency of author gender from the sample of Authors in this datatset. Below we take a look at the authors gender compared with the genre of book that they write.
genre_gender <- table(data$author_gender, data$genre_1)
rownames(genre_gender)<- c("Female","Male")
genre_gender
##
## Adult Adult Fiction Adventure Amish Animals Anthologies Apocalyptic
## Female 0 58 7 12 14 7 0
## Male 2 55 23 1 28 15 6
##
## Art Asian Literature Autobiography Biblical Fiction Biography
## Female 4 8 49 2 33
## Male 8 46 49 0 136
##
## Business Category Romance Childrens Christian Christian Fiction
## Female 3 3 150 14 108
## Male 90 3 282 54 32
##
## Christianity Classics Comics Contemporary Couture Crime Cultural Dark
## Female 0 161 2 80 1 6 15 93
## Male 3 494 18 31 0 25 50 60
##
## Dark Fantasy Dc Comics Death Design Did Not Finish Drama
## Female 1 0 1 0 0 1
## Male 0 1 0 1 1 3
##
## Dungeons and Dragons Eastern Africa Economics Education Epic Erotica
## Female 0 0 1 0 0 137
## Male 1 2 8 7 1 57
##
## Esoterica European Literature Fairy Tales Family Fan Fiction Fantasy
## Female 0 0 0 0 0 1754
## Male 1 3 1 2 1 1643
##
## Feminism Fiction Food and Drink Football Gardening Glbt Health
## Female 36 717 26 0 1 34 6
## Male 9 1455 26 1 0 11 8
##
## Historical History Holiday Horror How To Humor Inspirational Language
## Female 815 45 42 110 0 33 1 6
## Male 584 375 41 421 1 77 3 9
##
## Lds Leadership Lgbt Literary Fiction Literature Love Magical Realism
## Female 0 0 56 1 2 0 6
## Male 1 5 46 2 9 1 1
##
## Manga Marriage Media Tie In Medical Mental Health Military History
## Female 2 0 1 0 6 0
## Male 1 2 5 1 0 8
##
## Music Mystery Mythology New Adult Nonfiction Northern Africa Novella
## Female 13 425 21 247 347 0 3
## Male 9 395 21 79 640 3 1
##
## Novels Occult Own Paranormal Parenting Philosophy Plays Poetry
## Female 15 0 0 225 2 3 0 70
## Male 171 2 1 110 1 115 48 220
##
## Politics Polyamorous Psychology Pulp Realistic Fiction Reference
## Female 6 7 7 0 11 0
## Male 54 5 61 1 20 1
##
## Relationships Religion Retellings Romance Science Science Fiction
## Female 0 7 3 2302 0 300
## Male 2 116 0 1095 74 671
##
## Science Fiction Fantasy Self Help Sequential Art Sexuality
## Female 1 22 186 1
## Male 1 60 454 1
##
## Shapeshifters Short Stories Social Science Sociology
## Female 15 55 1 4
## Male 4 111 0 1
##
## Speculative Fiction Spirituality Sports Sports and Games Spy Thriller
## Female 0 13 1 25 0
## Male 1 35 9 33 3
##
## Suspense Teaching Thriller Travel Unfinished United States War
## Female 12 0 51 18 2 0 9
## Male 5 1 111 55 0 1 11
##
## Warfare Westerns Womens Fiction World War II Writing Young Adult
## Female 4 3 60 5 3 1514
## Male 2 12 36 6 2 983
The data is a little messy, with items that are very similar, columns such as “Spy Thriller” and Thriller, etc. we will try and clean this up for next time but for now, this isnt too terrible.
Lets do some hypothesis testing. I think that the average rating for an author will be around 2.5.
\[ H_0:\ \mu_F = \ 2.5\\\\\ H_A:\ \mu_F\neq \ 2.5 \]
t.test(data$author_average_rating, mu= 2.5)
##
## One Sample t-test
##
## data: data$author_average_rating
## t = 919.01, df = 22890, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 2.5
## 95 percent confidence interval:
## 3.957254 3.963483
## sample estimates:
## mean of x
## 3.960368
Hot dang! I was wrong, the observed mean of authors average ratings is more near 4 than 2.5. 2.5 also lies outside of the 95% confidence interval. Therefore we reject the null hypothesis in favor of the alternative.
Do more reviews = higher average book ratings? Lets check it out! R= rating r= reviews \[ H_0:\ \mu_r\ = \ \mu_R\\\ H_A:\ \mu_F\neq\mu_R \]
lessrev= data[which(data$num_reviews> data$num_ratings
),]
morev = data[which(data$num_reviews<=data$num_ratings
),]
t.test(lessrev$book_average_rating , morev$book_average_rating)
##
## Welch Two Sample t-test
##
## data: lessrev$book_average_rating and morev$book_average_rating
## t = 6.4069, df = 11.012, p-value = 5.008e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.3504600 0.7171876
## sample estimates:
## mean of x mean of y
## 4.485000 3.951176
This T test gives evidence to reject the null hypothesis in favor of the alternative.
lets take a look at the relationship between number of ratings and number of reviews. Number of ratings has been renamed RAT and number of reviews has been renamed REV
REV =c(data$num_reviews)
RAT =c(data$num_ratings)
cor(REV,RAT)
## [1] 0.8289449
model <-lm(REV~RAT)
plot(model)
ylist = c(1200,1400)
point <- data.frame(RAT = ylist)
predict(model,point, interval = 'predict')
## fit lwr upr
## 1 893.1073 -6603.357 8389.572
## 2 899.4026 -6597.062 8395.867
Above is a 95% confidence interval to predict the number of reviews based on the number of ratings.
##residuals(model)
hist(residuals(model))
I have chosen to not include the printed portion of all the residuals in this report because it would wipe out the remainder of the amazon rainforest if printed. The code to print the residuals has been commented out above. Instead I have displayed a histogram to help determine if the residuals follow a normal distribution. It is difficult to tell if this histogram is normally distributed or not. I’m going to say that it is because I believe there are some values to the left of the tallest bar but they are too short to be seen. the equation of the regression line is \[y = 0.03148x+855.3590\] the slope here represents that for every one review the number of ratings goes up approximately .03148.
cor.test(REV,RAT)
##
## Pearson's product-moment correlation
##
## data: REV and RAT
## t = 224.22, df = 22889, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.8248481 0.8329547
## sample estimates:
## cor
## 0.8289449
the correlation coefficient, r, is equal to .829 this gives reason to conclude that there is a moderately strong positive correlation between the number of ratings and the number of reviews.
gender<- data$author_gender
fit <- lm(REV~RAT+gender,data=data)
print(fit)
##
## Call:
## lm(formula = REV ~ RAT + gender, data = data)
##
## Coefficients:
## (Intercept) RAT gendermale
## 936.5015 0.0315 -153.9378
above I have taken the linear model and fitted it to a categorical variable. here I’ll explore ratings and reviews alongside the author’s gender.
summary(fit)
##
## Call:
## lm(formula = REV ~ RAT + gender, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -50326 -802 -673 -185 65447
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.365e+02 3.737e+01 25.060 <2e-16 ***
## RAT 3.150e-02 1.405e-04 224.172 <2e-16 ***
## gendermale -1.539e+02 5.071e+01 -3.036 0.0024 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3824 on 22888 degrees of freedom
## Multiple R-squared: 0.6873, Adjusted R-squared: 0.6872
## F-statistic: 2.515e+04 on 2 and 22888 DF, p-value: < 2.2e-16
y <- data.frame(RAT=100, gender = "female")
predict(fit,y)
## 1
## 939.651
library(boot)
set.seed(16)
samp_mean <- function(x,i){
mean(x[i])
}
results <- boot(data$author_average_rating, samp_mean,1000)
plot(results)
boot.ci(results, type="perc")
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = results, type = "perc")
##
## Intervals :
## Level Percentile
## 95% ( 3.957, 3.963 )
## Calculations and Intervals on Original Scale
mean(data$author_average_rating)
## [1] 3.960368
t.test(data$author_average_rating, mu =4)
##
## One Sample t-test
##
## data: data$author_average_rating
## t = -24.94, df = 22890, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 4
## 95 percent confidence interval:
## 3.957254 3.963483
## sample estimates:
## mean of x
## 3.960368
We can see from the results above that the confidence intervals are very ver similar. The confidence interval for the bootstrap mean has a little less accuracy than the interval for the actual data, but they are almost the same. We can also notice that the center of the confidence interval for the bootstrap is the actual mean of 3.96.
Now lets do a little bootstrapping on a non-parametric statistic. Lets look at the median of the number of ratings for each book in the data set.
#```{r bootfullMedian} set.seed(16) samp_median <- function(x,i){ median(x[i])}
paraboots <- boot(data$author_average_rating, samp_median, R=1000) plot(paraboots)
Things are not looking normal... This most likely has something to do with these being the average rating for each author and not the true rating that each user would give the author. On goodreads users are only able to rate authors with the whole value, so a 3 star rating exists but a 3.5 doesn't. Instead of continuing with this because it seems a little futile I'm going to switch it up and look at the number of ratings for each author. I'll go through the parametric and also the non-parametric below.
## Number of ratings Bootstrap
#Here is bootstrapping for the mean.
#```{r bootFullMean}
#set.seed(16)
#samp_mean <- function(x,i){
# mean(x[i])
#}
result2 <- boot(data$num_ratings, samp_mean,1000)
plot(results)
Now a couple confidence intervals.
#boot.ci(result2, type="perc")
mean(data$num_ratings)
## [1] 46683.49
Here is the confidence interval for number of ratings from the original dataset.
t.test(data$num_ratings, mu =46683.49)
##
## One Sample t-test
##
## data: data$num_ratings
## t = -1.8569e-06, df = 22890, p-value = 1
## alternative hypothesis: true mean is not equal to 46683.49
## 95 percent confidence interval:
## 44350.68 49016.30
## sample estimates:
## mean of x
## 46683.49
The confidence intervals for the bootstrap and for the original data are different, close but different. The true mean falls at about the middle of both of these intervals.
Now lets examine the median for number of ratings. I want to hypothesize that the median number of ratings is equal to the mean. \[ H_0:\ \ M = \ 46683.49\\\\\ H_A:\ \ M\neq \ 46683.49 \]
set.seed(16)
samp_median <- function(x,i){
median(x[i])}
np <- boot(data$num_ratings, samp_median, R=1000)
plot(np)
This looks much better than it did earlier. Things are approximately normally distributed, lets find a \(p\) value. \[
t=\frac{\mu-\overline x}{SE}
\]
xbar = np$t0
##Cross-Validation
I will be very candid in saying that cross validation is something that I don’t understand fully. I will try my best to illustrate what is necessary below but I am very uncomfortable with this portion of the project. #```{r} #library(caret) #SamplesID <- createDataPartition(data$author_id, p =.66, list=FALSE) #SamplesID
##{r} #traindata <- data[SamplesID,] #testdata <- data[-SamplesID,] #testdata ```
#{r} #model <- lm(data$num_reviews ~ data$num_ratings, data = testdata) #summary(model) #
#{r} #data[SamplesID,'Test_Train'] <- "Train" #data[-SamplesID,'Test_Train'] <- "Test" #data$Test_Train #
#{r} #library(ggplot2) #ggplot(data=data, mapping = aes(data$num_ratings,data$num_reviews,col=data$Test_Train,))+ # geom_jitter()+ # geom_smooth(method=lm) #
#{r} #pre <- predict(model, data[-SamplesID,]) #pre # #{r} #R2(pre,data[-SamplesID,"num_reviews"],"everything") # I’m not entirely certain why I am not getting a value when I run the line of code above. I’m assuming that the correlation coefficient would be somewhere close to the one from the predicted model which was around .6. This value leads to the conclusion that the correlation is a weak positive correlation, the value is not very close to one, therefore not very strong. If had gotten a value I am assuming that it would be close to the .6 value I had gotten before. # 10- Fold #More cross validation. More frustration. Things are going well!