Author: Amnah Mahmood

Class: Data 101

Semester: Spring 2021

Date: 05/13/2021

Introduction

Abstract

The dataset I chose for my final project is called the Computer Science Books dataset. For the purpose of this project I was interested in finding out if price is linearly related to Rating, Reviews, Number of pages or Type of book. To answer this question I decided to construct a statistical model that predicts book price based on a combination of some of these other parameters. After performing some exploratory data analysis I created a multiple linear regression model using backward elimination. The most relevant feature of my work was the discovery that based on the data provided a linear regression may not be a suitable model to predict price of book unless further transformations are applied to either the response variable or the predictor variables or both.

Topic Statement:

This dataset is available on kaggle (https://www.kaggle.com/thomaskonstantin/top-270-rated-computer-science-programing-books) and contains a list of 270 books in the field of computer science and programming related topics. As someone who is planning to get my master’s I became interested in this dataset to find out what are some of the most popular books in Computer Science in terms of their rating, the number of reviews, or how affordable they are. The list of books was constructed using different websites and the 270 most popular books were selected. The dataset is relevant to not only those involved in the field of Computer Science but others as well who may want to learn more about this subject. It may also be useful to publishers and those in the printing business who may be interested in calculating profit margins. Even though there are plenty of articles and other data available, it appears that limited prior work has been done to construct and analyze tidy datasets similar to this one. So, the purpose of this work is to serve as a guide to those wanting to visually compare popular books based on different factors as well as introduce the process of creating a parsimonious linear model for this dataset.

The dataset provides information such as the number of pages in the book, the type of binding, a brief description, and the price of the book and contains 4 quantitative and 3 categorical variables which are as follows:

Variables

Rating: (num) The user rating for the book. Ranges between 0 and 5 Reviews: (num) The number of reviews found on this book Book_title: (chr) The name of the book Description: (chr) A short description of the book Number_Of_Pages: (num) Number of pages in the book Type: (chr) The type of binding of the book meaning is it hardcover, an ebook or a kindle book etc. Price: (num) The average price of the book in USD where the average is calculated according to the 5 web sources

Analysis

Loading necessary libraries

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.3     ✓ purrr   0.3.4
## ✓ tibble  3.0.6     ✓ dplyr   1.0.3
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(ggthemes)
library(forcats)

Setting current working directory

setwd("~/Desktop/DATA110")

Reading in the dataset

df <- read.csv("computer_science_books.csv")

Printing summary statistics

summary(df)
##      Rating        Reviews           Book_title        Description       
##  Min.   :3.000   Length:271         Length:271         Length:271        
##  1st Qu.:3.915   Class :character   Class :character   Class :character  
##  Median :4.100   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :4.067                                                           
##  3rd Qu.:4.250                                                           
##  Max.   :5.000                                                           
##  Number_Of_Pages      Type               Price        
##  Min.   :  50.0   Length:271         Min.   :  9.324  
##  1st Qu.: 289.0   Class :character   1st Qu.: 30.751  
##  Median : 384.0   Mode  :character   Median : 46.318  
##  Mean   : 475.1                      Mean   : 54.542  
##  3rd Qu.: 572.5                      3rd Qu.: 67.854  
##  Max.   :3168.0                      Max.   :235.650

Printing structure of the dataset

str(df)
## 'data.frame':    271 obs. of  7 variables:
##  $ Rating         : num  4.17 4.01 3.33 3.97 4.06 3.84 4.09 4.15 3.87 4.62 ...
##  $ Reviews        : chr  "3,829" "1,406" "0" "1,658" ...
##  $ Book_title     : chr  "The Elements of Style" "The Information: A History, a Theory, a Flood" "Responsive Web Design Overview For Beginners" "Ghost in the Wires: My Adventures as the World's Most Wanted Hacker" ...
##  $ Description    : chr  "This style manual offers practical advice on improving writing skills. Throughout, the emphasis is on promoting"| __truncated__ "James Gleick, the author of the best sellers Chaos and Genius, now brings us a work just as astonishing and mas"| __truncated__ "In Responsive Web Design Overview For Beginners, you'll get an overview of what to expect when building a respo"| __truncated__ "If they were a hall of fame or shame for computer hackers, a Kevin Mitnick plaque would be mounted the near the"| __truncated__ ...
##  $ Number_Of_Pages: int  105 527 50 393 305 288 256 368 259 128 ...
##  $ Type           : chr  "Hardcover" "Hardcover" "Kindle Edition" "Hardcover" ...
##  $ Price          : num  9.32 11 11.27 12.87 13.16 ...

Formatting dataset

Values in the Reviews column contain a comma “,” so R treats it as a character variable. Let’s get rid of the comma and change it to a numeric variable. Let’s also create new variable names that are more readable.

df$Reviews <- as.numeric(gsub(",", "", df$Reviews))
#names(df) <- c("Rating", "Reviews", "Book Title", "Description", "Number of Pages", "Type", "Price")
head(df)
##   Rating Reviews
## 1   4.17    3829
## 2   4.01    1406
## 3   3.33       0
## 4   3.97    1658
## 5   4.06    1325
## 6   3.84     117
##                                                            Book_title
## 1                                               The Elements of Style
## 2                       The Information: A History, a Theory, a Flood
## 3                        Responsive Web Design Overview For Beginners
## 4 Ghost in the Wires: My Adventures as the World's Most Wanted Hacker
## 5                                                    How Google Works
## 6                                                    The Meme Machine
##                                                                                                                                                                                                                                                                                                                                                                                                                             Description
## 1                                                                                                                                                                                       This style manual offers practical advice on improving writing skills. Throughout, the emphasis is on promoting a plain English style. This little book can help you communicate more effectively by showing you how to enliven your sentences.
## 2                             James Gleick, the author of the best sellers Chaos and Genius, now brings us a work just as astonishing and masterly: a revelatory chronicle and meditation that shows how information has become the modern era’s defining quality—the blood, the fuel, the vital principle of our world.\n \nThe story of information begins in a time profoundly unlike our own, when every thought and\n\n\n\n...more
## 3 In Responsive Web Design Overview For Beginners, you'll get an overview of what to expect when building a responsive website.\n\nYou'll learn about all of the following:\nResponsive Web Design Overview\nUsability of Smaller Screens\nWhy Plugins Aren't the Solution\nWhy a Responsive Web Design Theme May Not Be Best for Your Existing Website\nRisks Involved with Responsive Web Design F\n\n\n\n\n\n\n\n\n\n\n\n\n\n...more
## 4                                      If they were a hall of fame or shame for computer hackers, a Kevin Mitnick plaque would be mounted the near the entrance. While other nerds were fumbling with password possibilities, this adept break-artist was penetrating the digital secrets of Sun Microsystems, Digital Equipment Corporation, Nokia, Motorola, Pacific Bell, and other mammoth enterprises. His Ghost in the W\n...more
## 5                                Both Eric Schmidt and Jonathan Rosenberg came to Google as seasoned Silicon Valley business executives, but over the course of a decade they came to see the wisdom in Coach John Wooden's observation that 'it's what you learn after you know it all that counts'. As they helped grow Google from a young start-up to a global icon, they relearned everything they knew about manag\n\n\n\n...more
## 6                 What is a meme? First coined by Richard Dawkins in 'The Selfish Gene', a meme is any idea, behavior, or skill that can be transferred from one person to another by imitation: stories, fashions, inventions, recipes, songs, ways of plowing a field or throwing a baseball or making a sculpture.\n\nThe meme is also one of the most important--and controversial--concepts to emerge s\n\n\n\n\n\n\n\n\n\n...more
##   Number_Of_Pages           Type     Price
## 1             105      Hardcover  9.323529
## 2             527      Hardcover 11.000000
## 3              50 Kindle Edition 11.267647
## 4             393      Hardcover 12.873529
## 5             305 Kindle Edition 13.164706
## 6             288      Paperback 14.188235

Let’s start with some visualizations to develop an understanding of different variables.

How many of each type of books are in the dataset?

book_count <- df %>%
  group_by(Type) %>%
  summarize(count = n()) %>%
  ggplot(mapping = aes(x = reorder(Type, desc(count)), y = count, fill = Type)) +
  geom_col(width = 0.5)+
  ggtitle('Book Type by Count') +
  labs(title = "Type of Binding by Book Count", subtitle = "How many of each type of books are in the dataset?", caption = "https://www.kaggle.com/thomaskonstantin/top-270-rated-computer-science-programing-books") +
  ylab("Book Count") +
  theme_fivethirtyeight() +
  theme(axis.text.x = element_text(angle = 25, vjust = .7, hjust = 0.5), axis.title.x  = element_blank()) +
  scale_fill_manual(name = 'Book Binding', 
                    labels = c("Boxed Set", "eBook", "Hardcover", "Kindle Edition", "Paperback", "Unknown Binding"),
                    values = c("#84a59d", "#cbdfbd" , "#f19c79", "#a44a3f" ,  "#f28482" , "#f6bd60"))
                   # values = c("#84a59d", "#cb997e" , "#f4acb7" , "#ffc9b9", "#f28482" , "#f6bd60"))
book_count

The plot above shows that the two most common types of books are Paperback and Hardcover which appear to make up majority of the books in the dataset. Let’s look at exactly what that percentage is in our next visualization.

Loading necessary libraries

library(ggrepel)
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library(IRdisplay)

Plotting the percentage of each type of book in the dataset

book_percent <- df %>%
  group_by(Type) %>%
  summarize(count = n(), percentage = n()/nrow(df)) %>%
 plot_ly(labels = ~factor(Type), 
             values = ~percentage, 
             marker = list(colors = c("#84a59d", "#f7ede2", "#f5cac3" , "#cb997e" , "#f28482" , "#f6bd60")),
             hole = 0.4,
             type = 'pie', 
             textposition = 'outside',
             textinfo = 'label+percent') %>%
  layout(title = 'Percent breakdown by Book Binding',
         xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = TRUE),
         yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = TRUE))
book_percent

Once again we see that Paperback and Hardcover are the most common types of books making up over 90% of the total number of books.

Next, let’s look at the most expensive books in the dataset.

What are the 10 Most Expensive Books?

df$Book_title[df$Book_title == "3D Game Engine Architecture: Engineering Real-Time Applications with Wild Magic (The Morgan Kaufmann Series in Interactive 3d Technology)"] = "3D Game Engine Architecture: Engineering Real-Time Applications with Wild Magic"
price <- df %>%
  arrange(desc(Price)) %>%
  slice(1:10) %>%
  ggplot(mapping = aes(x = reorder(Book_title, Price) , y =  Price, fill = Book_title)) +
  geom_col(width=.5) +
  ggtitle('What are the 10 Most Expensive Books?') +
  labs(caption = "https://www.kaggle.com/thomaskonstantin/top-270-rated-computer-science-programing-books") +
  theme(text=element_text(size=10,  family='Times New Roman')) +
  theme(legend.title = element_blank(), legend.position = "none", axis.title.y = element_blank()) +
  scale_fill_manual(values = c("#84a59d", "#cbdfbd" , "#f19c79", "#a44a3f" ,  "#f28482" , "#f6bd60",
                               "#84a59d", "#cbdfbd" , "#f19c79", "#a44a3f")) +
  coord_flip() 
  
price

The plot above shows that the book called “A Discipline for Software Engineering” is the most expensive book in the dataset costing over $250 followed closely by "The Art of Computer Programming, Volumes 1-4a Boxed Set. Let’s print all the data pertaining to this book.

Printing Most Expensive Book

df %>%
  filter(Book_title == "A Discipline for Software Engineering")
##   Rating Reviews                            Book_title
## 1   3.84       5 A Discipline for Software Engineering
##                                                                                                                                                                                                                                                                                                                                                                                       Description
## 1 Designed to help individual programmers develop software more effectively and successfully, this book presents a scaled-down version of Humphrey's popular methods for managing the software process. It: presents concepts and methods for a disciplined software engineering process for individual programmers, following the Capability Maturity Model (CMM); scales down industria ...more
##   Number_Of_Pages      Type  Price
## 1             789 Hardcover 235.65

Next, I am interested to see if the most expensive books are also rated the highest.

Let’s look at the highest rated books in the dataset.

What are the 10 Highest Rated Books?

rating <- df %>%
  arrange(desc(Rating)) %>%
  slice(1:10) %>%
  ggplot(mapping = aes(x = reorder(Book_title, Rating) , y =  Rating, fill = Book_title)) +
  geom_col(width=.5) +
  ggtitle('What are the 10 Highest \nRated Books?') +
  labs(caption = "https://www.kaggle.com/thomaskonstantin/top-270-rated-computer-science-programing-books") +
  theme(text=element_text(size=10,  family='Times New Roman')) +
  theme(legend.title = element_blank(), legend.position = "none", axis.title.y = element_blank()) +
  scale_fill_manual(values = c("#84a59d", "#cbdfbd" , "#f19c79", "#a44a3f" ,  "#f28482" , "#f6bd60",
                               "#84a59d", "#cbdfbd" , "#f19c79", "#a44a3f")) +
  coord_flip() 
  
rating

The plot above shows that the book called “Your First App: Node.js” which is an ebook is rated the highest and the only book with a perfect rating of 5. We can also see that it is not the case that the most expensive books are also rated the highest.

Let’s print all the data pertaining to this book.

Printing Highest Rated Book

df %>%
  filter(Book_title == "Your First App: Node.js")
##   Rating Reviews              Book_title
## 1      5       0 Your First App: Node.js
##                                                                                                                                                                                                                                                                                                                                                                                                      Description
## 1 A tutorial for real-world application development using Node.js, AngularJS, Express and MongoDB\n\nPublished via Leanpub, and can be found here: https://leanpub.com/yfa-nodejs\n\nIn this book, you will learn the basics of designing and developing a node.js application. Unlike other books, focus here is on learning multiple technologies at once and defining processes for success. T\n\n\n\n...more
##   Number_Of_Pages  Type    Price
## 1             317 ebook 25.85588

Let’s look at the price for each of these books.

What is the Price of the 10 Highest Rated Books?

rating_price <- df %>%
  arrange(desc(Rating)) %>%
  slice(1:10) %>%
  ggplot(mapping = aes(x = reorder(Book_title, Price), y = Price, fill = Book_title)) +
  geom_col(width = 0.5)+
  ggtitle("Highest Rated Books") +
  labs(subtitle = "by Price", caption = "https://www.kaggle.com/thomaskonstantin/top-270-rated-computer-science-programing-books") +
  xlab('Book Title') +
  ylab('Price') +
  theme(text=element_text(size=10,  family="Times New Roman")) +
  theme(legend.title = element_blank(), legend.position = "none", axis.title.y = element_blank()) +
   scale_fill_manual(values = c("#84a59d", "#cbdfbd" , "#f19c79", "#a44a3f" ,  "#f28482" , "#f6bd60",
                               "#84a59d", "#cbdfbd" , "#f19c79", "#a44a3f")) +
  coord_flip() 

rating_price

One of the top ten Highest Rated Books that costs the most is called “The Art of Computer Programming, Volumes 1-4a Boxed Set” with a price tag of over $200. This is not as surprising because the set consists of 4 books whereas the other books in the dataset are listed individually.

Let’s look at the number of pages for each of these books.

What are the Number of Pages in the 10 Highest Rated Books?

rating_pages <- df %>%
  arrange(desc(Rating)) %>%
  slice(1:10) %>%
  ggplot(mapping = aes(x = reorder(Book_title, Number_Of_Pages), y = Number_Of_Pages, fill = Book_title)) +
  geom_col(width = 0.5)+
  ggtitle('Highest Rated Books') +
  labs(subtitle = "by Number of Pages", caption = "https://www.kaggle.com/thomaskonstantin/top-270-rated-computer-science-programing-books") +
  xlab('Book Title') +
  ylab('Number of Pages') +
  theme(text=element_text(size=10,  family="Times New Roman")) +
  theme(legend.title = element_blank(), legend.position = "none", axis.title.y = element_blank()) +
   scale_fill_manual(values = c("#84a59d", "#cbdfbd" , "#f19c79", "#a44a3f" ,  "#f28482" , "#f6bd60",
                               "#84a59d", "#cbdfbd" , "#f19c79", "#a44a3f")) +
  coord_flip() 

rating_pages

One of the top ten Highest Rated Books with the most number of pages is “The Art of Computer Programming, Volumes 1-4a Boxed Set” with over 3000 pages. Once again, this is not as surprising because the set consists of 4 books and combines the number of pages in those books whereas the other books in the dataset are listed individually.

Next, I am interested to see if the highest rated books also have the most number of reviews.

What are the top 10 Books with the most number of Reviews?

reviews <- df %>%
  arrange(desc(Reviews)) %>%
  slice(1:10) %>%
  ggplot(mapping = aes(x = reorder(Book_title, Reviews), y =  Reviews, fill = Book_title)) +
  geom_col(width = 0.5) +
  ggtitle('What are the top 10 Books with the \nMost Number of Reviews?') +
  labs(caption = "https://www.kaggle.com/thomaskonstantin/top-270-rated-computer-science-programing-books") +
  ylab('Number of Reviews') +
  theme(text=element_text(size=10,  family="Times New Roman")) +
  theme(legend.title = element_blank(), legend.position = "none", axis.title.y = element_blank()) +
  scale_fill_manual(values = c("#84a59d", "#cbdfbd" , "#f19c79", "#a44a3f" ,  "#f28482" , "#f6bd60",
                               "#84a59d", "#cbdfbd" , "#f19c79", "#a44a3f")) +
                      #c("#f6f4d2", "#cbdfbd" , "#f19c79", "#a44a3f" , "#660708" , 
                               #"#f6f4d2", "#cbdfbd" , "#f19c79", "#a44a3f" , "#660708")) +
  coord_flip() 
reviews

The plot above shows that the book called “Start with Why: How Great Leaders Inspire Everyone to Take Action” which is a hardcover is the most highly reviewed with close to 6000 reviews. We can also see that it is not the case that the highest rated books also have the most number of reviews.

Let’s print all the data pertaining to this book.

Printing Most Reviewed Book

df %>%
  filter(Book_title == "Start with Why: How Great Leaders Inspire Everyone to Take Action")
##   Rating Reviews
## 1   4.09    5938
##                                                          Book_title
## 1 Start with Why: How Great Leaders Inspire Everyone to Take Action
##                                                                                                                                                                                                                                                                                                                                                                                                                      Description
## 1 Why do you do what you do?\n\nWhy are some people and organizations more innovative, more influential, and more profitable than others? Why do some command greater loyalty from customers and employees alike? Even among the successful, why are so few able to repeat their success over and over?\n\nPeople like Martin Luther King Jr., Steve Jobs, and the Wright Brothers might have lit\n\n\n\n\n\n\n\n\n\n\n\n...more
##   Number_Of_Pages      Type    Price
## 1             256 Hardcover 14.23235

Let’s look at the price for each of these books.

What is the Price of the 10 Most Reviwed Books?

reviews_price <- df %>%
  arrange(desc(Reviews)) %>%
  slice(1:10) %>%
  ggplot(mapping = aes(x = reorder(Book_title, Price), y = Price, fill = Book_title)) +
  geom_col(width = 0.5)+
  ggtitle('Most Reviewed Books') +
  labs(subtitle = "by Price", caption = "https://www.kaggle.com/thomaskonstantin/top-270-rated-computer-science-programing-books") +
  xlab('Book Title') +
  ylab('Price') +
  theme(text=element_text(size=10,  family="Times New Roman")) +
  theme(legend.title = element_blank(), legend.position = "none", axis.title.y = element_blank()) +
   scale_fill_manual(values = c("#84a59d", "#cbdfbd" , "#f19c79", "#a44a3f" ,  "#f28482" , "#f6bd60",
                               "#84a59d", "#cbdfbd" , "#f19c79", "#a44a3f")) +
  coord_flip() 

reviews_price

The most expensive of the most reviewed books is called “The Goal: A Process of Ongoing Improvement” costing over $35.

Let’s look at the number of pages for each of these books.

What is the Number of Pages in the 10 Most Reviewed Books?

reviews_pages <- df %>%
  arrange(desc(Reviews)) %>%
  slice(1:10) %>%
  ggplot(mapping = aes(x = reorder(Book_title, Number_Of_Pages), y = Number_Of_Pages, fill = Book_title)) +
  geom_col(width = 0.5)+
  ggtitle('Most Reviewed Books') +
  labs(subtitle = "by Number of Pages", caption = "https://www.kaggle.com/thomaskonstantin/top-270-rated-computer-science-programing-books") +
  xlab('Book Title') +
  ylab('Number of Pages') +
  theme(text=element_text(size=10,  family="Times New Roman")) +
  theme(legend.title = element_blank(), legend.position = "none", axis.title.y = element_blank()) +
   scale_fill_manual(values = c("#84a59d", "#cbdfbd" , "#f19c79", "#a44a3f" ,  "#f28482" , "#f6bd60",
                               "#84a59d", "#cbdfbd" , "#f19c79", "#a44a3f")) +
  coord_flip() 

reviews_pages

One of the top ten Reviewed Books with the most number of pages is called The Innovators followed closely by The Information with both consisting of over 500 pages.

Next thing I would like to look at is the average rating, average number of reviews and the average price grouped by the tyep of binding to see if we can gather any new information.

What is the Average (Rating, Number of Reviews and Price) Grouped by the Type of Binding?

library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
avg_rating <- df %>%
  group_by(Type) %>%
  summarize(avgRating = mean(Rating)) %>%
  ggplot(mapping = aes(x = Type, y =  avgRating, fill = Type)) +
  geom_col(width = 0.5) +
  labs(subtitle = "Average Rating \nby Book Binding") +
  ylab('Mean of Ratings') +
  theme(text=element_text(size=10,  family="Times New Roman")) +
  theme(legend.title = element_blank(), legend.position = "none", axis.title.y = element_blank()) +
  scale_fill_manual(values = c("#84a59d", "#cbdfbd" , "#f19c79", "#a44a3f" ,  "#f28482" , "#f6bd60")) +
  coord_flip() 


avg_reviews <- df %>%
  group_by(Type) %>%
  summarize(avgReviews = mean(Reviews)) %>%
  ggplot(mapping = aes(x = Type, y =  avgReviews, fill = Type)) +
  geom_col(width = 0.5) +
  labs(subtitle = "Average Number of \nReviews by \nBook Binding") +
  ylab('Mean of Reviews') +
  theme(text=element_text(size=10,  family="Times New Roman")) +
  theme(legend.title = element_blank(), legend.position = "none", axis.title.y = element_blank()) +
  scale_fill_manual(values = c("#84a59d", "#cbdfbd" , "#f19c79", "#a44a3f" ,  "#f28482" , "#f6bd60")) +
  coord_flip() 

avg_price <- df %>%
  group_by(Type) %>%
  summarize(avgPrice = mean(Price)) %>%
  ggplot(mapping = aes(x = Type, y =  avgPrice, fill = Type)) +
  geom_col(width = 0.5) +
  labs(subtitle = "Average Price \nby Book Binding") +
  ylab('Mean of Price') +
  theme(text=element_text(size=10,  family="Times New Roman")) +
  theme(legend.title = element_blank(), legend.position = "none", axis.title.y = element_blank()) +
  scale_fill_manual(values = c("#84a59d", "#cbdfbd" , "#f19c79", "#a44a3f" ,  "#f28482" , "#f6bd60")) + 
  coord_flip() 

grid.arrange(avg_rating, avg_reviews, avg_price, ncol = 3)

From this plot we can see that the Boxed Set - Hardcover is the Category with the highest rating, lowest number of reviews and the highest average price while hardcover on the other side has slightly lower rating but the highest number of reviews as well as a lower price, so it might be worth thinking if the boxed set is worth it or is it better to buy each book in the set individually if price is an issue.

library(highcharter)
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

What is the Book Rating by Price when grouped by the Type of Binding?

highcharter1 <- highchart() %>%
   hc_add_series(data = df,
                 type = "scatter",
                   hcaes(x = Price,
                   y = Rating, 
                   group = Type)) %>%
  hc_chart(style = list(fontFamily = "Times New Roman", fontWeight = "bold")) %>%
  hc_xAxis(title = list(text="Price")) %>%
  hc_yAxis(title = list(text="Rating")) %>%
  hc_colors(c("#84a59d", "#cbdfbd" , "#f6bd60", "#a44a3f" ,  "#f28482","#f19c79")) %>%
  hc_title(
    text = "What is the Book Rating by Price when grouped by Type of Binding?",
    margin = 30,
    align = "center",
    style = list(color = "black", useHTML = TRUE)) %>%
  hc_tooltip(shared = TRUE,
             borderColor = "black",
             pointFormat = "Price: {point.Price} <br> Rating: {point.Rating}")
highcharter1

Based on the plot we notice a trend that Paperback, Hardcover and Kindle Edition are all approximately rated the same, but Kindle editions are usually the least expensive followed by paperback and then hardcover. We also observe a paperback which costs as much as some of the most expensive hardcover books in the dataset. Next, we will find out which paperback is this and print out the title of the book.

paperback <- df %>%
  filter(Type == "Paperback" & Price > 200)

paperback
##   Rating Reviews                          Book_title
## 1   3.94      22 An Introduction to Database Systems
##                                                                                                                                                                                                                                                                                                                                                                                       Description
## 1 Continuing in the eighth edition, An Introduction to Database Systems provides a comprehensive introduction to the now very large field of database systems by providing a solid grounding in the foundations of database technology while shedding some light on how the field is likely to develop in the future. This new edition has been rewritten and expanded to stay current wi ...more
##   Number_Of_Pages      Type    Price
## 1            1040 Paperback 212.0971

The title of the book is An Introduction to Database Systems and priced at over $200 making it the most expensive paperback in the dataset.

What is the Maximum Price and Average Number of Pages by the Type of Book Binding?

df2 <- df %>%
  select(Type, Price, Number_Of_Pages) %>%
  group_by(Type) %>%
  summarize(maxPrice = max(Price), avgPages = mean(Number_Of_Pages))

highcharter2 <- highchart() %>%
  hc_yAxis_multiples(
    list(title = list(text = "Maximum Price of Book")),
    list(title = list(text = "Average Number of Pages"), opposite = TRUE)
    ) %>%
  hc_xAxis(title = list(text="Type of Binding")) %>%
  hc_add_series(data = df2$maxPrice,
                name = "Maximum Price",
                type = "column",
                yAxis = 0) %>%
  hc_add_series(data = df2$avgPages,
                name = "Average Number of Pages",
                type = "line",
                yAxis = 1) %>%
  hc_xAxis(categories = df2$Type,
           tickInterval = 1) %>%
  hc_title(
    text = "What is the Maximum Price and Average Number of Pages by the Type of Book Binding?",
    margin = 30,
    align = "center",
    style = list(color = "black", useHTML = TRUE)
    ) %>% 
    hc_colors(c("#84a59d","#f28482")) %>%
    hc_chart(style = list(fontFamily = "Times New Roman",
                        fontWeight = "bold"))

highcharter2

We can see that Hardcovers are among the most expensive books as well as contain the highest average number of pages. This is followed closely by the Boxed Set and Paperbacks while ebooks and Kindle editions are among some of the least expensive books with the least average number of pages.

Multiple Linear Regression

In this section, we will construct a multiple linear regression model to predict cost of a book. There are four assumptions associated with a linear regression model which can be tested using diagnostic plots. The assumptions are as follows:

Linearity: The relationship between X and the mean of Y is linear. Homoscedasticity: The variance of residual is the same for any value of X. Independence: Observations are independent of each other. Normality: For any fixed value of X, Y is normally distributed.

Which variables are best correlated according to the Heapmap?

library(corrplot)
## corrplot 0.84 loaded
df_Numeric_Variable <- select_if(df, is.numeric)
colnames(df_Numeric_Variable)
## [1] "Rating"          "Reviews"         "Number_Of_Pages" "Price"
corr <- cor(df_Numeric_Variable)
# using lower because the matrix is reflective along the principal diagonal
corrplot(corr,method = "color",type = "lower", tl.cex=0.9, cl.cex = 0.6, tl.col="black")

The correlation coefficient is a value between -1 and 1, inclusive and tells how strong or weak the correlation is. Values closer to +/- 1 represent a strong correlation where the sign is determined by the linear slope, values close to +/- 0.5 are weak correlation, and values close to zero have no correlation.

We can observe the correlation between different elements from the above heat map and it appears that Price and Number of Pages are most highly correlated compared to other variables in the data.

I will create a model to predict the price of a book based on multiple predictors and through an iterative process try to find the model that is parsimonious. With multiple regression there are several strategies for comparing variable inputs into a model. We will use backward elimination and start with all possible predictors (Rating, Reviews, Number of Pages and Type) with our response variable being Price. We are now ready to create our model.

First Model

fit1 <- lm(Price ~ Rating+Reviews+Number_Of_Pages+Type, data = df)
summary(fit1)
## 
## Call:
## lm(formula = Price ~ Rating + Reviews + Number_Of_Pages + Type, 
##     data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -80.018 -14.040  -1.195   9.994 149.551 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          1.859e+02  3.462e+01   5.368 1.75e-07 ***
## Rating              -4.471e+00  5.380e+00  -0.831    0.407    
## Reviews             -1.371e-02  2.875e-03  -4.767 3.10e-06 ***
## Number_Of_Pages      6.148e-02  5.517e-03  11.143  < 2e-16 ***
## Typeebook           -1.384e+02  2.693e+01  -5.138 5.42e-07 ***
## TypeHardcover       -1.310e+02  2.532e+01  -5.175 4.54e-07 ***
## TypeKindle Edition  -1.508e+02  2.656e+01  -5.678 3.61e-08 ***
## TypePaperback       -1.453e+02  2.534e+01  -5.736 2.67e-08 ***
## TypeUnknown Binding -1.444e+02  3.095e+01  -4.666 4.90e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25.05 on 262 degrees of freedom
## Multiple R-squared:  0.5225, Adjusted R-squared:  0.5079 
## F-statistic: 35.84 on 8 and 262 DF,  p-value: < 2.2e-16
par(mfrow = c(2,2))
plot(fit1)
## Warning: not plotting observations with leverage one:
##   269

cor(df$Rating, df$Price)
## [1] 0.04669738
cor(df$Reviews, df$Price)
## [1] -0.2548988
cor(df$Number_Of_Pages, df$Price)
## [1] 0.6369506

The p-value on the right of Reviews, Number of Pages, and Type each have 3 asterisks which suggest they are meaningful variables to explain the linear change in the price of a book but we also need to look at the adjusted R-squared value. It states that about 51% of the variation in the observations may be explained by the model. In other words, 50% of the variation in the data is likely not explained by this model.

We usually pay great attention to regression results, such as p-values, R-squared or adjusted R-squared that tell us how well a model represents given data but that’s not the whole picture. We should also look at diagnostic plots to not only check if the linear regression assumptions are met but to improve our model in an exploratory way.

In this case, in residuals vs fitted plot we don’t find equally spread residuals without distinct patterns which is a good indication that a linear model may not be the best fit for the data. Normal Q-Q plot shows if residuals are normally distributed and they do seem to follow a straight line. Let’s look at the next plot which is the Scale-Location plot which checks the assumption of equal variance (homoscedasticity). It’s a good sign if the red line is horizontal with the points spread about randomly which is definitely not the case here. Finally, in the last plot we watch out for outlying values at the upper right corner or at the lower right corner as those spots are the places where cases can be influential against a regression line. The plot identified the influential observation as #70. Let’s see how it changes our model if I exclude this case from the analysis.

In other cases where linear model appears to be a good fit I would have removed the outlier and reran the model but it is evident that a linear regression may not be a good fit for the data so removing an outlier won’t improve the model too much. Lastly, looking at the correlation coefficients we see that Rating and Reviews are only weakly correlated with Price and the strongest correlation is between Price and Number of Pages which corroborates the information we got from the Heapmap above. At this point, we return to our summary and try removing the least meaningful variable which is Rating to see if that improves the model at all.

Based on summary statistics, we eliminate Reviews and rerun the model.

Second Model

fit1 <- lm(Price ~ Reviews+Number_Of_Pages+Type, data = df)
summary(fit1)
## 
## Call:
## lm(formula = Price ~ Reviews + Number_Of_Pages + Type, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -81.957 -14.691  -1.637   9.692 150.593 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          1.664e+02  2.551e+01   6.524 3.50e-10 ***
## Reviews             -1.391e-02  2.863e-03  -4.858 2.04e-06 ***
## Number_Of_Pages      6.078e-02  5.449e-03  11.154  < 2e-16 ***
## Typeebook           -1.378e+02  2.691e+01  -5.122 5.85e-07 ***
## TypeHardcover       -1.293e+02  2.522e+01  -5.126 5.75e-07 ***
## TypeKindle Edition  -1.491e+02  2.646e+01  -5.634 4.52e-08 ***
## TypePaperback       -1.438e+02  2.525e+01  -5.693 3.33e-08 ***
## TypeUnknown Binding -1.426e+02  3.086e+01  -4.622 5.96e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25.04 on 263 degrees of freedom
## Multiple R-squared:  0.5213, Adjusted R-squared:  0.5085 
## F-statistic: 40.91 on 7 and 263 DF,  p-value: < 2.2e-16
par(mfrow = c(2,2))
plot(fit1)
## Warning: not plotting observations with leverage one:
##   269

cor(df$Reviews, df$Price)
## [1] -0.2548988
cor(df$Number_Of_Pages, df$Price)
## [1] 0.6369506

Once again, we don’t notice a significant improvement in diagnostic plots and they all look relatively the same. Our adjusted R-squared remained the same to 51% but the p-values on the right of Reviews, Number of Pages, and Type have 3 asterisks which suggest that they are all meaningful variables. So, we select the simplest (parsimonious) model with Reviews, Number of Pages, and Type on Price while keeping in mind that we may want to explore a better model for our data. We found earlier that Price is most strongly correlated with Number of Pages out of all other variables in the data so it makes sense that this variable fits in the model.

Conclusion

In the analysis above, we observe that our summary statistics remained more or less the same but judging from the diagnostic plots we come to the conclusion that a linear regression may not be a suitable model for this dataset. As the name suggests, linear regression will only be able to fit linear relationship between dependent, and independent variables. If we observe that the linear regression assumptions are not met we can then apply transformation to either the response variable, the predictor variables or both. For example, the first assumption associated with a linear regression is, Linearity: The relationship between X and the mean of Y is linear. Linear regression works well when your inputs or independent variables are not correlated but when they are correlated and this assumption for linearity is violated, linear regression may not give the optimal solution and this “multicollinearity” might give some very un-intuitive results. Both Lasso and Ridge transformations work by putting a cost on this kind of complexity, to maintain simplicity. As there are a ton of different transformations to choose from, in the words of professor Rachel Saidi, a lot of this process is subjective and there is really an “art and science” to knowing which transformation to apply. It does however demand an in-depth knowledge of your data and knowing exactly what is it that you are trying to accomplish otherwise this process will become a lot more cumbersome than it needs to be.

Through my background research I came across two articles in particular that mention some of the more sought after programming books for beginners. The first is written by Hadley Wickham, Chief Scientist at RStudio and creator of many packages for the R programming language, where he lists the best books to help aspiring data scientists build solid computer science fundamentals (Mathieu, 2019). Another article published on medium.com titled Get Started With A Collection Of 247 Free Computer Science Books lists down 247 free computer science ebooks covering a reasonable amount of programming topics. Through my research I also discovered that there is plenty of data on computer science publications that is readily available however tidy datasets that contain an entire collection of books for data science purposes are few and far in between.

Sources

“Get Started With A Collection Of 247 Free Computer Science Books.” Medium, Medium, 18 Aug. 2019,

Mathieu, Edouard. “Computer Science for Data Scientists: Hadley Wickham on Five Books.” Five Books, 22 Feb. 2019, fivebooks.com/best-books/computer-science-data-science-hadley-wickham/