I have found the data for this project at https://www.kaggle.com/bahramjannesarr/goodreads-book-datasets-10m I then downloaded the dataset and saved it as a .CSV file and uploaded it into my Github repository (Applied-Statistics). With this data I hope to analyze any relationship between the variables.

Cross-Validation

I will be very candid in saying that cross validation is something that I don’t understand fully. I will try my best to illustrate what is necessary below but I am very uncomfortable with this portion of the project.

DISCLAIMER: much of this code has been commented out because I did not have the RAM capacity in Rstudio to knit the cross validation into a document without doing so.

library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
SamplesID <- createDataPartition(data$author_id, p =.66, list=FALSE)
#SamplesID
traindata <- data[SamplesID,]
testdata <- data[-SamplesID,]
#testdata
model <- lm(data$num_reviews ~ data$num_ratings, data = testdata)
summary(model)
## 
## Call:
## lm(formula = data$num_reviews ~ data$num_ratings, data = testdata)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -50200   -817   -680   -190  65564 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      8.553e+02  2.611e+01   32.75   <2e-16 ***
## data$num_ratings 3.148e-02  1.404e-04  224.22   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3825 on 22889 degrees of freedom
## Multiple R-squared:  0.6871, Adjusted R-squared:  0.6871 
## F-statistic: 5.027e+04 on 1 and 22889 DF,  p-value: < 2.2e-16
data[SamplesID,'Test_Train'] <- "Train" 
data[-SamplesID,'Test_Train'] <- "Test"
#data$Test_Train
library(ggplot2)
ggplot(data=data, mapping = aes(num_ratings,num_reviews,col=Test_Train,))+
  geom_jitter()+
  geom_smooth(method=lm)
## `geom_smooth()` using formula 'y ~ x'

pre <- predict(model, data[-SamplesID,])
## Warning: 'newdata' had 7780 rows but variables found have 22891 rows
#pre
R2(pre,data[-SamplesID,"num_reviews"],"everything")

I’m not entirely certain why I am not getting a value when I run the line of code above. I’m assuming that the correlation coefficient would be somewhere close to the one from the predicted model which was around .6. This value leads to the conclusion that the correlation is a weak positive correlation, the value is not very close to one, therefore not very strong. If had gotten a value I am assuming that it would be close to the .6 value I had gotten before.

10- Fold

More cross validation. More frustration. Things are going well!

rating=data$num_ratings
review=data$num_reviews
ar = data$author_average_rating

train.control <- trainControl(method="cv",number= 10)
model <- train(num_reviews ~ num_ratings, data = data,
             method = "lm",
             trControl=train.control)
print(model)
## Linear Regression 
## 
## 22891 samples
##     1 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 20602, 20602, 20601, 20602, 20602, 20601, ... 
## Resampling results:
## 
##   RMSE    Rsquared   MAE     
##   3821.2  0.6904552  1448.771
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
summary(model)
## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -50200   -817   -680   -190  65564 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 8.553e+02  2.611e+01   32.75   <2e-16 ***
## num_ratings 3.148e-02  1.404e-04  224.22   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3825 on 22889 degrees of freedom
## Multiple R-squared:  0.6871, Adjusted R-squared:  0.6871 
## F-statistic: 5.027e+04 on 1 and 22889 DF,  p-value: < 2.2e-16

Alright! I was able to get the 10 fold to work properly. Looking at this I am unable to compare it to the cross validation because I was unable to get the previous cross validation code to run. Because of this I will say a few things about the 10 fold compared with the linear model. observing both of these I am able to see that the \(r^2\) values are very similar. The value of \(r^2\) in the linear model is .688 and the value in the 10- fold cross validation is .6908.