Data description and data manipulations

hmdataOrig <- read.csv("~/IMB/R/HW1/HW1 dataset.csv")
set.seed(1)
hmdata <- sample_n(hmdataOrig, 5000, replace = FALSE)

head(hmdata)
##   bookID                                                                  title
## 1    475                      Collapse: How Societies Choose to Fail or Succeed
## 2  24271                                                  Skellig (Skellig  #1)
## 3  30982                        Scales of the Serpent (Diablo: The Sin War  #2)
## 4  32184                       Thanksgiving on Thursday (Magic Tree House  #27)
## 5  10943                                                   Absolutely Mahvelous
## 6  30985 El Escorpion: La Marca del Diablo: El Escorpion: The Mark of the Devil
##                                authors average_rating       isbn   isbn13
## 1                        Jared Diamond           3.93  143036556 9.78e+12
## 2                         David Almond           3.80  440229081 9.78e+12
## 3                     Richard A. Knaak           3.86  743471237 9.78e+12
## 4 Mary Pope Osborne/Salvatore Murdocca           3.85  375806156 9.78e+12
## 5            Billy Crystal/Dick Schaap           3.79  399512462 9.78e+12
## 6        Stephen Desberg/Enrico Marini           3.84 1594970009 9.78e+12
##   language_code num_pages ratings_count text_reviews_count publication_date
## 1           eng       608         52522               2780       12/27/2005
## 2           eng       208         17160               1782        9/11/2001
## 3           eng       327           866                 31        3/27/2007
## 4           eng        73          6361                209        9/24/2002
## 5           eng       128            24                  2        7/18/1986
## 6           spa        56            20                  2        10/1/2004
##                              publisher
## 1          Penguin Books Ltd. (London)
## 2                          Laurel Leaf
## 3                          Pocket Star
## 4 Random House Books for Young Readers
## 5                        Perigee Trade
## 6                  Public Square Books

The unit of observation in this data set are published books, while the sample size is 5000 number of books. The numeric (ratio) variables are measured based on the reviews and ratings left by the readers, number of pages and the language a book was published in.

Description of variables:

This dataset was obtained from Goodreads API (https://www.goodreads.com/api), which was last updated and available for the public in December 2020.

The aim of this regression analysis is to find out if there is a connection between the average rating and 3 independent variables. This will show how can a different language, number of pages and ratings count relate to average rating of a book.

Research question: What is the relationship between average rating (dependent) and language code, number of pages, and ratings count (explanatory variables)?

Removing the irrelevant variables.

hmdata <- subset(hmdata, select = -c(isbn, isbn13, publisher, publication_date, text_reviews_count))

Checking for missing values

colSums(is.na(hmdata))
##         bookID          title        authors average_rating  language_code 
##              0              0              0              0              0 
##      num_pages  ratings_count 
##              0              0

Eliminating books with 0 pages.

hmdata <- subset(hmdata,num_pages > 0 )

Adding factor variable.

hmdata$language_code <- as.factor(ifelse(hmdata$language_code == "eng" | hmdata$language_code == "en-US" | hmdata$language_code == "en-CA" | hmdata$language_code == "en-GB", "eng", "non eng"))

Graphical representation of the relationships

scatterplotMatrix(hmdata[c(-1,-2,-3)],
            smooth = FALSE)

There is a problem with distribution of the dependent variable, which could be caused by the number of reviews bellow 5 being unreliable and will be removed. The books that have 0 ratings (ratings_count), have the average rating is 0 as well. These outliers will be removed as it does not show the actual ratings.

hmdata <- hmdata[!(hmdata[,7] <= 5),]
scatterplotMatrix(hmdata[c(-1,-2,-3)],
            smooth = FALSE)

The state improved and will be re-estimated again after removing the outliers and units with high impact. There is a noticeable positive relationship between average rating and number of pages as well as average rating and ratings count. There seems to be a negligible relationship between the dependent variable and the categorical one - language code.

Estimating the linear regression model and running diagnostics

fit1 <- lm(average_rating~num_pages + ratings_count + language_code,
           data = hmdata)
summary(fit1)
## 
## Call:
## lm(formula = average_rating ~ num_pages + ratings_count + language_code, 
##     data = hmdata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.52942 -0.16806  0.01316  0.18634  1.02667 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          3.861e+00  7.321e-03 527.374  < 2e-16 ***
## num_pages            2.221e-04  1.799e-05  12.348  < 2e-16 ***
## ratings_count        1.049e-07  3.773e-08   2.781  0.00544 ** 
## language_codenon eng 7.686e-02  1.910e-02   4.023 5.82e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2759 on 4775 degrees of freedom
## Multiple R-squared:  0.03642,    Adjusted R-squared:  0.03582 
## F-statistic: 60.16 on 3 and 4775 DF,  p-value: < 2.2e-16

From the table it can be seen that all variables are significant and will be explained.

Multicollinearity

vif(fit1)
##     num_pages ratings_count language_code 
##      1.001830      1.002477      1.001431
mean(vif(fit1))
## [1] 1.001913

All the values are between 1 and 5 so we can continue using all of the variables at once. The average is close to 1, meaning that there is no indication of problems with multicollinearity.

Normality, Standardised residuals and Cooks distance

hmdata$StdResid <- round(rstandard(fit1),3)
hmdata$CooksD <- round(cooks.distance(fit1), 3)

#Test for normality 
shapiro.test(hmdata$StdResid)
## 
##  Shapiro-Wilk normality test
## 
## data:  hmdata$StdResid
## W = 0.98794, p-value < 2.2e-16
hist(hmdata$StdResid, 
     xlab = "Standardized residuals", 
     ylab = "Frequency", 
     main = "Histogram of standardized residuals",
     col = "darkseagreen")

hist(hmdata$CooksD, 
     xlab = "Cooks distance", 
     ylab = "Frequency", 
     main = "Histogram of Cooks distances",
     col = "darkseagreen",
     breaks = seq(0,0.4,0.01))

Since the sample is very large the violation of normality is not as important, even though it could cause the p-values to be off.

head(hmdata[order(hmdata$StdResid), c("bookID", "ratings_count", "StdResid")], 40) 
##      bookID ratings_count StdResid
## 1436  33993          5415   -5.543
## 1314  16021            33   -4.971
## 1387  18829          1449   -4.166
## 1866  24929         13152   -4.052
## 250   16717          3854   -3.992
## 1483   6613         23409   -3.988
## 3841  20367            10   -3.962
## 2850  42808            23   -3.892
## 1283  39441            44   -3.890
## 3121  42831             6   -3.879
## 2770  14917           114   -3.859
## 4446  13149          4320   -3.849
## 267   33032           541   -3.816
## 3747   6644          3602   -3.815
## 2822  32181            65   -3.686
## 4727  15610             9   -3.544
## 1593  19608           164   -3.498
## 3650    159           411   -3.447
## 2188  10713           106   -3.422
## 3541   4221            34   -3.341
## 2047  40597             8   -3.237
## 1960  36461             6   -3.229
## 803   32189             6   -3.205
## 1475  41261           181   -3.156
## 1694  42410            29   -3.154
## 4550  15379           402   -3.138
## 2922  41773             9   -3.118
## 2363  29320            13   -3.097
## 4275  33075           767   -3.091
## 775   35057             8   -3.072
## 2843  18412          1698   -3.048
## 3964  12466          2307   -2.998
## 865   38570           148   -2.983
## 2330   6976         68363   -2.972
## 2312  32941           199   -2.967
## 2476  27054             8   -2.944
## 1949   2126            31   -2.913
## 1834  44439            49   -2.886
## 2198  33574          2157   -2.883
## 1047  23201          1411   -2.882
head(hmdata[order(-hmdata$StdResid), c("bookID", "ratings_count", "StdResid")], 10)
##      bookID ratings_count StdResid
## 1841  15705            11    3.721
## 2424  37424             8    3.409
## 4420  16898             8    2.977
## 1717  35561            29    2.900
## 4603  24818         20308    2.855
## 2320  39201             9    2.832
## 4443  34545           102    2.675
## 2379  39661          2406    2.674
## 2611  35406            27    2.648
## 1530  17279            22    2.633

Any standardised residual that is more than 3 in absolute value will be removed.

hmdata <- subset(hmdata, StdResid < 3 & StdResid >-3 )
head(hmdata[order(-hmdata$CooksD),c("bookID", "CooksD")], 10) 
##      bookID CooksD
## 652    5107  0.067
## 1097    960  0.067
## 4427  18135  0.040
## 4637  15881  0.023
## 3868      2  0.015
## 565    2701  0.014
## 1222  41856  0.009
## 2304  18403  0.008
## 2795    865  0.008
## 3168  25516  0.008

Additionally, Cooks distance showed that there are 4 units with high impact, which will be removed since they are too high compared to the rest.

hmdata <- hmdata[!(hmdata[,9]>0.015),]
fit1 <- lm(average_rating~num_pages + ratings_count + language_code,
           data = hmdata)

Linearity and Homoscedasticity

hmdata$Std_fitted_values <- scale(fit1$fitted.values)
scatterplot(y = hmdata$StdResid, x = hmdata$Std_fitted_values,
            ylab = "Standardized residuals",
            xlab = "Standardized fitted values",
            boxplots = FALSE,
            regLine = TRUE,
            smooth = FALSE,
            col = "darkgreen")

From the scatter plot above, it can be seen that not all regression assumptions are met:

  • Linearity: in the graph showing the residuals vs fitted scatter plot, we can see that linearity is not violated, therefore, it doesn’t violate the assumption

  • Homoscedasticity: heteroskedasticity can be noticed, since the variance of residuals is unequal over a range of the values in the sample. To support this Breusch Pagan Test will be conducted bellow.

ols_test_breusch_pagan(fit1)
## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##                    Data                    
##  ------------------------------------------
##  Response : average_rating 
##  Variables: fitted values of average_rating 
## 
##          Test Summary          
##  ------------------------------
##  DF            =    1 
##  Chi2          =    24.06133 
##  Prob > Chi2   =    9.33155e-07

The null hypothesis can be rejected at p>0.001, proving that the variance of the sample data is not constant.

Re-estimating the model with correction

Since we rejected the hypothesis that the variance is constant, we used lm_robust

fit2 <- lm_robust(average_rating ~ num_pages + ratings_count + language_code,
                  se_type = "HC1",
           data = hmdata)
summary(fit2)
## 
## Call:
## lm_robust(formula = average_rating ~ num_pages + ratings_count + 
##     language_code, data = hmdata, se_type = "HC1")
## 
## Standard error type:  HC1 
## 
## Coefficients:
##                       Estimate Std. Error t value  Pr(>|t|)  CI Lower  CI Upper
## (Intercept)          3.869e+00  7.311e-03 529.198 0.000e+00 3.855e+00 3.883e+00
## num_pages            2.157e-04  1.711e-05  12.606 7.311e-36 1.822e-04 2.493e-04
## ratings_count        1.551e-07  3.302e-08   4.697 2.713e-06 9.038e-08 2.199e-07
## language_codenon eng 7.516e-02  1.859e-02   4.044 5.340e-05 3.873e-02 1.116e-01
##                        DF
## (Intercept)          4738
## num_pages            4738
## ratings_count        4738
## language_codenon eng 4738
## 
## Multiple R-squared:  0.03857 ,   Adjusted R-squared:  0.03796 
## F-statistic: 67.15 on 3 and 4738 DF,  p-value: < 2.2e-16
scatterplotMatrix(hmdata[c(-1,-2,-3, -8, -9, -10)],
            smooth = FALSE)

round(stat.desc(hmdata[c(-1,-2,-3, -5, -8, -9, -10)]), 1)
##              average_rating num_pages ratings_count
## nbr.val              4742.0    4742.0        4742.0
## nbr.null                0.0       0.0           0.0
## nbr.na                  0.0       0.0           0.0
## min                     3.1       1.0           6.0
## max                     4.8    2690.0     2153167.0
## range                   1.8    2689.0     2153161.0
## sum                 18720.7 1603801.0    77911377.0
## median                  4.0     303.0         861.0
## mean                    3.9     338.2       16430.1
## SE.mean                 0.0       3.2        1215.0
## CI.mean.0.95            0.0       6.3        2382.0
## var                     0.1   49486.9  7000368784.2
## std.dev                 0.3     222.5       83668.2
## coef.var                0.1       0.7           5.1

The minimum of the average rating is 3.1, while the maximum is 4.8. The mean is 3.9 and median is 4, while the coefficient of variance is low. The number of pages has a range of 2689 (large difference between min and max), while the dispersion around the mean is higher than the average rating, but significantly smaller than ratings count. The ratings count has a very high dispersion around the mean compared the the other two independent variables, which can also be seen from the minimum and maximum.

Interpretation of the estimated regression coefficients

  • If the number of pages of a book increases by 1 page then the average rating of the book increases on average by approximately 0.02%, ceteris paribus (p < 0.001).

  • If the count of book’s ratings increases by 1 rating then the average rating of the book increases on average by approximately 1.551e-5%, ceteris paribus (p < 0.001).

  • Given the values of all other variables, books written in non-English language have on average 7.5% higher average rating than books written in English language (p<0.001).

  • Multiple coefficient of determination: Adjusted R-squared has the value of 0.037, indicating 4% of variability of average rating is explained by the variability of 3 explanatory variables in the model.

  • F-statistic provides the information whether the regression model provides a better fit than a model that contains no independent variables:

H0: ρ2 = 0

H1: ρ2 > 0

We reject the H0 at p-value at p < 0.001. We found that at least some of the variability is explained by the above variables.

  • Multiple correlation coefficient: a square root of R
sqrt(summary(fit2)$r.squared) 
## [1] 0.1963865

The linear relationship between average rating and 3 explanatory variables is weak.