hmdataOrig <- read.csv("~/IMB/R/HW1/HW1 dataset.csv")
set.seed(1)
hmdata <- sample_n(hmdataOrig, 5000, replace = FALSE)
head(hmdata)
## bookID title
## 1 475 Collapse: How Societies Choose to Fail or Succeed
## 2 24271 Skellig (Skellig #1)
## 3 30982 Scales of the Serpent (Diablo: The Sin War #2)
## 4 32184 Thanksgiving on Thursday (Magic Tree House #27)
## 5 10943 Absolutely Mahvelous
## 6 30985 El Escorpion: La Marca del Diablo: El Escorpion: The Mark of the Devil
## authors average_rating isbn isbn13
## 1 Jared Diamond 3.93 143036556 9.78e+12
## 2 David Almond 3.80 440229081 9.78e+12
## 3 Richard A. Knaak 3.86 743471237 9.78e+12
## 4 Mary Pope Osborne/Salvatore Murdocca 3.85 375806156 9.78e+12
## 5 Billy Crystal/Dick Schaap 3.79 399512462 9.78e+12
## 6 Stephen Desberg/Enrico Marini 3.84 1594970009 9.78e+12
## language_code num_pages ratings_count text_reviews_count publication_date
## 1 eng 608 52522 2780 12/27/2005
## 2 eng 208 17160 1782 9/11/2001
## 3 eng 327 866 31 3/27/2007
## 4 eng 73 6361 209 9/24/2002
## 5 eng 128 24 2 7/18/1986
## 6 spa 56 20 2 10/1/2004
## publisher
## 1 Penguin Books Ltd. (London)
## 2 Laurel Leaf
## 3 Pocket Star
## 4 Random House Books for Young Readers
## 5 Perigee Trade
## 6 Public Square Books
The unit of observation in this data set are published books, while the sample size is 5000 number of books. The numeric (ratio) variables are measured based on the reviews and ratings left by the readers, number of pages and the language a book was published in.
Description of variables:
This dataset was obtained from Goodreads API (https://www.goodreads.com/api), which was last updated and available for the public in December 2020.
The aim of this regression analysis is to find out if there is a connection between the average rating and 3 independent variables. This will show how can a different language, number of pages and ratings count relate to average rating of a book.
Research question: What is the relationship between average rating (dependent) and language code, number of pages, and ratings count (explanatory variables)?
Removing the irrelevant variables.
hmdata <- subset(hmdata, select = -c(isbn, isbn13, publisher, publication_date, text_reviews_count))
Checking for missing values
colSums(is.na(hmdata))
## bookID title authors average_rating language_code
## 0 0 0 0 0
## num_pages ratings_count
## 0 0
Eliminating books with 0 pages.
hmdata <- subset(hmdata,num_pages > 0 )
Adding factor variable.
hmdata$language_code <- as.factor(ifelse(hmdata$language_code == "eng" | hmdata$language_code == "en-US" | hmdata$language_code == "en-CA" | hmdata$language_code == "en-GB", "eng", "non eng"))
scatterplotMatrix(hmdata[c(-1,-2,-3)],
smooth = FALSE)
There is a problem with distribution of the dependent variable, which could be caused by the number of reviews bellow 5 being unreliable and will be removed. The books that have 0 ratings (ratings_count), have the average rating is 0 as well. These outliers will be removed as it does not show the actual ratings.
hmdata <- hmdata[!(hmdata[,7] <= 5),]
scatterplotMatrix(hmdata[c(-1,-2,-3)],
smooth = FALSE)
The state improved and will be re-estimated again after removing the outliers and units with high impact. There is a noticeable positive relationship between average rating and number of pages as well as average rating and ratings count. There seems to be a negligible relationship between the dependent variable and the categorical one - language code.
fit1 <- lm(average_rating~num_pages + ratings_count + language_code,
data = hmdata)
summary(fit1)
##
## Call:
## lm(formula = average_rating ~ num_pages + ratings_count + language_code,
## data = hmdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.52942 -0.16806 0.01316 0.18634 1.02667
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.861e+00 7.321e-03 527.374 < 2e-16 ***
## num_pages 2.221e-04 1.799e-05 12.348 < 2e-16 ***
## ratings_count 1.049e-07 3.773e-08 2.781 0.00544 **
## language_codenon eng 7.686e-02 1.910e-02 4.023 5.82e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2759 on 4775 degrees of freedom
## Multiple R-squared: 0.03642, Adjusted R-squared: 0.03582
## F-statistic: 60.16 on 3 and 4775 DF, p-value: < 2.2e-16
From the table it can be seen that all variables are significant and will be explained.
vif(fit1)
## num_pages ratings_count language_code
## 1.001830 1.002477 1.001431
mean(vif(fit1))
## [1] 1.001913
All the values are between 1 and 5 so we can continue using all of the variables at once. The average is close to 1, meaning that there is no indication of problems with multicollinearity.
hmdata$StdResid <- round(rstandard(fit1),3)
hmdata$CooksD <- round(cooks.distance(fit1), 3)
#Test for normality
shapiro.test(hmdata$StdResid)
##
## Shapiro-Wilk normality test
##
## data: hmdata$StdResid
## W = 0.98794, p-value < 2.2e-16
hist(hmdata$StdResid,
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals",
col = "darkseagreen")
hist(hmdata$CooksD,
xlab = "Cooks distance",
ylab = "Frequency",
main = "Histogram of Cooks distances",
col = "darkseagreen",
breaks = seq(0,0.4,0.01))
Since the sample is very large the violation of normality is not as important, even though it could cause the p-values to be off.
head(hmdata[order(hmdata$StdResid), c("bookID", "ratings_count", "StdResid")], 40)
## bookID ratings_count StdResid
## 1436 33993 5415 -5.543
## 1314 16021 33 -4.971
## 1387 18829 1449 -4.166
## 1866 24929 13152 -4.052
## 250 16717 3854 -3.992
## 1483 6613 23409 -3.988
## 3841 20367 10 -3.962
## 2850 42808 23 -3.892
## 1283 39441 44 -3.890
## 3121 42831 6 -3.879
## 2770 14917 114 -3.859
## 4446 13149 4320 -3.849
## 267 33032 541 -3.816
## 3747 6644 3602 -3.815
## 2822 32181 65 -3.686
## 4727 15610 9 -3.544
## 1593 19608 164 -3.498
## 3650 159 411 -3.447
## 2188 10713 106 -3.422
## 3541 4221 34 -3.341
## 2047 40597 8 -3.237
## 1960 36461 6 -3.229
## 803 32189 6 -3.205
## 1475 41261 181 -3.156
## 1694 42410 29 -3.154
## 4550 15379 402 -3.138
## 2922 41773 9 -3.118
## 2363 29320 13 -3.097
## 4275 33075 767 -3.091
## 775 35057 8 -3.072
## 2843 18412 1698 -3.048
## 3964 12466 2307 -2.998
## 865 38570 148 -2.983
## 2330 6976 68363 -2.972
## 2312 32941 199 -2.967
## 2476 27054 8 -2.944
## 1949 2126 31 -2.913
## 1834 44439 49 -2.886
## 2198 33574 2157 -2.883
## 1047 23201 1411 -2.882
head(hmdata[order(-hmdata$StdResid), c("bookID", "ratings_count", "StdResid")], 10)
## bookID ratings_count StdResid
## 1841 15705 11 3.721
## 2424 37424 8 3.409
## 4420 16898 8 2.977
## 1717 35561 29 2.900
## 4603 24818 20308 2.855
## 2320 39201 9 2.832
## 4443 34545 102 2.675
## 2379 39661 2406 2.674
## 2611 35406 27 2.648
## 1530 17279 22 2.633
Any standardised residual that is more than 3 in absolute value will be removed.
hmdata <- subset(hmdata, StdResid < 3 & StdResid >-3 )
head(hmdata[order(-hmdata$CooksD),c("bookID", "CooksD")], 10)
## bookID CooksD
## 652 5107 0.067
## 1097 960 0.067
## 4427 18135 0.040
## 4637 15881 0.023
## 3868 2 0.015
## 565 2701 0.014
## 1222 41856 0.009
## 2304 18403 0.008
## 2795 865 0.008
## 3168 25516 0.008
Additionally, Cooks distance showed that there are 4 units with high impact, which will be removed since they are too high compared to the rest.
hmdata <- hmdata[!(hmdata[,9]>0.015),]
fit1 <- lm(average_rating~num_pages + ratings_count + language_code,
data = hmdata)
hmdata$Std_fitted_values <- scale(fit1$fitted.values)
scatterplot(y = hmdata$StdResid, x = hmdata$Std_fitted_values,
ylab = "Standardized residuals",
xlab = "Standardized fitted values",
boxplots = FALSE,
regLine = TRUE,
smooth = FALSE,
col = "darkgreen")
From the scatter plot above, it can be seen that not all regression assumptions are met:
Linearity: in the graph showing the residuals vs fitted scatter plot, we can see that linearity is not violated, therefore, it doesn’t violate the assumption
Homoscedasticity: heteroskedasticity can be noticed, since the variance of residuals is unequal over a range of the values in the sample. To support this Breusch Pagan Test will be conducted bellow.
ols_test_breusch_pagan(fit1)
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## ------------------------------------------
## Response : average_rating
## Variables: fitted values of average_rating
##
## Test Summary
## ------------------------------
## DF = 1
## Chi2 = 24.06133
## Prob > Chi2 = 9.33155e-07
The null hypothesis can be rejected at p>0.001, proving that the variance of the sample data is not constant.
Since we rejected the hypothesis that the variance is constant, we used lm_robust
fit2 <- lm_robust(average_rating ~ num_pages + ratings_count + language_code,
se_type = "HC1",
data = hmdata)
summary(fit2)
##
## Call:
## lm_robust(formula = average_rating ~ num_pages + ratings_count +
## language_code, data = hmdata, se_type = "HC1")
##
## Standard error type: HC1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper
## (Intercept) 3.869e+00 7.311e-03 529.198 0.000e+00 3.855e+00 3.883e+00
## num_pages 2.157e-04 1.711e-05 12.606 7.311e-36 1.822e-04 2.493e-04
## ratings_count 1.551e-07 3.302e-08 4.697 2.713e-06 9.038e-08 2.199e-07
## language_codenon eng 7.516e-02 1.859e-02 4.044 5.340e-05 3.873e-02 1.116e-01
## DF
## (Intercept) 4738
## num_pages 4738
## ratings_count 4738
## language_codenon eng 4738
##
## Multiple R-squared: 0.03857 , Adjusted R-squared: 0.03796
## F-statistic: 67.15 on 3 and 4738 DF, p-value: < 2.2e-16
scatterplotMatrix(hmdata[c(-1,-2,-3, -8, -9, -10)],
smooth = FALSE)
round(stat.desc(hmdata[c(-1,-2,-3, -5, -8, -9, -10)]), 1)
## average_rating num_pages ratings_count
## nbr.val 4742.0 4742.0 4742.0
## nbr.null 0.0 0.0 0.0
## nbr.na 0.0 0.0 0.0
## min 3.1 1.0 6.0
## max 4.8 2690.0 2153167.0
## range 1.8 2689.0 2153161.0
## sum 18720.7 1603801.0 77911377.0
## median 4.0 303.0 861.0
## mean 3.9 338.2 16430.1
## SE.mean 0.0 3.2 1215.0
## CI.mean.0.95 0.0 6.3 2382.0
## var 0.1 49486.9 7000368784.2
## std.dev 0.3 222.5 83668.2
## coef.var 0.1 0.7 5.1
The minimum of the average rating is 3.1, while the maximum is 4.8. The mean is 3.9 and median is 4, while the coefficient of variance is low. The number of pages has a range of 2689 (large difference between min and max), while the dispersion around the mean is higher than the average rating, but significantly smaller than ratings count. The ratings count has a very high dispersion around the mean compared the the other two independent variables, which can also be seen from the minimum and maximum.
If the number of pages of a book increases by 1 page then the average rating of the book increases on average by approximately 0.02%, ceteris paribus (p < 0.001).
If the count of book’s ratings increases by 1 rating then the average rating of the book increases on average by approximately 1.551e-5%, ceteris paribus (p < 0.001).
Given the values of all other variables, books written in non-English language have on average 7.5% higher average rating than books written in English language (p<0.001).
Multiple coefficient of determination: Adjusted R-squared has the value of 0.037, indicating 4% of variability of average rating is explained by the variability of 3 explanatory variables in the model.
F-statistic provides the information whether the regression model provides a better fit than a model that contains no independent variables:
H0: ρ2 = 0
H1: ρ2 > 0
We reject the H0 at p-value at p < 0.001. We found that at least some of the variability is explained by the above variables.
sqrt(summary(fit2)$r.squared)
## [1] 0.1963865
The linear relationship between average rating and 3 explanatory variables is weak.