The dataset I chose for my final project is called the Computer Science Books dataset. For the purpose of this project I was interested in finding out if price is linearly related to Rating, Reviews, Number of pages or Type of book. To answer this question I decided to construct a statistical model that predicts book price based on a combination of some of these other parameters. After performing some exploratory data analysis I created a multiple linear regression model using backward elimination. The most relevant feature of my work was the discovery that based on the data provided a linear regression may not be a suitable model to predict price of book unless further transformations are applied to either the response variable or the predictor variables or both.
This dataset is available on kaggle (https://www.kaggle.com/thomaskonstantin/top-270-rated-computer-science-programing-books) and contains a list of 270 books in the field of computer science and programming related topics. As someone who is planning to get my master’s I became interested in this dataset to find out what are some of the most popular books in Computer Science in terms of their rating, the number of reviews, or how affordable they are. The list of books was constructed using different websites and the 270 most popular books were selected. The dataset is relevant to not only those involved in the field of Computer Science but others as well who may want to learn more about this subject. It may also be useful to publishers and those in the printing business who may be interested in calculating profit margins. Even though there are plenty of articles and other data available, it appears that limited prior work has been done to construct and analyze tidy datasets similar to this one. So, the purpose of this work is to serve as a guide to those wanting to visually compare popular books based on different factors as well as introduce the process of creating a parsimonious linear model for this dataset.
The dataset provides information such as the number of pages in the book, the type of binding, a brief description, and the price of the book and contains 4 quantitative and 3 categorical variables which are as follows:
Rating: (num) The user rating for the book. Ranges between 0 and 5 Reviews: (num) The number of reviews found on this book Book_title: (chr) The name of the book Description: (chr) A short description of the book Number_Of_Pages: (num) Number of pages in the book Type: (chr) The type of binding of the book meaning is it hardcover, an ebook or a kindle book etc. Price: (num) The average price of the book in USD where the average is calculated according to the 5 web sources
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.3 ✓ purrr 0.3.4
## ✓ tibble 3.0.6 ✓ dplyr 1.0.3
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(ggthemes)
library(forcats)
setwd("~/Desktop/DATA110")
df <- read.csv("computer_science_books.csv")
summary(df)
## Rating Reviews Book_title Description
## Min. :3.000 Length:271 Length:271 Length:271
## 1st Qu.:3.915 Class :character Class :character Class :character
## Median :4.100 Mode :character Mode :character Mode :character
## Mean :4.067
## 3rd Qu.:4.250
## Max. :5.000
## Number_Of_Pages Type Price
## Min. : 50.0 Length:271 Min. : 9.324
## 1st Qu.: 289.0 Class :character 1st Qu.: 30.751
## Median : 384.0 Mode :character Median : 46.318
## Mean : 475.1 Mean : 54.542
## 3rd Qu.: 572.5 3rd Qu.: 67.854
## Max. :3168.0 Max. :235.650
str(df)
## 'data.frame': 271 obs. of 7 variables:
## $ Rating : num 4.17 4.01 3.33 3.97 4.06 3.84 4.09 4.15 3.87 4.62 ...
## $ Reviews : chr "3,829" "1,406" "0" "1,658" ...
## $ Book_title : chr "The Elements of Style" "The Information: A History, a Theory, a Flood" "Responsive Web Design Overview For Beginners" "Ghost in the Wires: My Adventures as the World's Most Wanted Hacker" ...
## $ Description : chr "This style manual offers practical advice on improving writing skills. Throughout, the emphasis is on promoting"| __truncated__ "James Gleick, the author of the best sellers Chaos and Genius, now brings us a work just as astonishing and mas"| __truncated__ "In Responsive Web Design Overview For Beginners, you'll get an overview of what to expect when building a respo"| __truncated__ "If they were a hall of fame or shame for computer hackers, a Kevin Mitnick plaque would be mounted the near the"| __truncated__ ...
## $ Number_Of_Pages: int 105 527 50 393 305 288 256 368 259 128 ...
## $ Type : chr "Hardcover" "Hardcover" "Kindle Edition" "Hardcover" ...
## $ Price : num 9.32 11 11.27 12.87 13.16 ...
Values in the Reviews column contain a comma “,” so R treats it as a character variable. Let’s get rid of the comma and change it to a numeric variable. Let’s also create new variable names that are more readable.
df$Reviews <- as.numeric(gsub(",", "", df$Reviews))
#names(df) <- c("Rating", "Reviews", "Book Title", "Description", "Number of Pages", "Type", "Price")
head(df)
## Rating Reviews
## 1 4.17 3829
## 2 4.01 1406
## 3 3.33 0
## 4 3.97 1658
## 5 4.06 1325
## 6 3.84 117
## Book_title
## 1 The Elements of Style
## 2 The Information: A History, a Theory, a Flood
## 3 Responsive Web Design Overview For Beginners
## 4 Ghost in the Wires: My Adventures as the World's Most Wanted Hacker
## 5 How Google Works
## 6 The Meme Machine
## Description
## 1 This style manual offers practical advice on improving writing skills. Throughout, the emphasis is on promoting a plain English style. This little book can help you communicate more effectively by showing you how to enliven your sentences.
## 2 James Gleick, the author of the best sellers Chaos and Genius, now brings us a work just as astonishing and masterly: a revelatory chronicle and meditation that shows how information has become the modern era’s defining quality—the blood, the fuel, the vital principle of our world.\n \nThe story of information begins in a time profoundly unlike our own, when every thought and\n\n\n\n...more
## 3 In Responsive Web Design Overview For Beginners, you'll get an overview of what to expect when building a responsive website.\n\nYou'll learn about all of the following:\nResponsive Web Design Overview\nUsability of Smaller Screens\nWhy Plugins Aren't the Solution\nWhy a Responsive Web Design Theme May Not Be Best for Your Existing Website\nRisks Involved with Responsive Web Design F\n\n\n\n\n\n\n\n\n\n\n\n\n\n...more
## 4 If they were a hall of fame or shame for computer hackers, a Kevin Mitnick plaque would be mounted the near the entrance. While other nerds were fumbling with password possibilities, this adept break-artist was penetrating the digital secrets of Sun Microsystems, Digital Equipment Corporation, Nokia, Motorola, Pacific Bell, and other mammoth enterprises. His Ghost in the W\n...more
## 5 Both Eric Schmidt and Jonathan Rosenberg came to Google as seasoned Silicon Valley business executives, but over the course of a decade they came to see the wisdom in Coach John Wooden's observation that 'it's what you learn after you know it all that counts'. As they helped grow Google from a young start-up to a global icon, they relearned everything they knew about manag\n\n\n\n...more
## 6 What is a meme? First coined by Richard Dawkins in 'The Selfish Gene', a meme is any idea, behavior, or skill that can be transferred from one person to another by imitation: stories, fashions, inventions, recipes, songs, ways of plowing a field or throwing a baseball or making a sculpture.\n\nThe meme is also one of the most important--and controversial--concepts to emerge s\n\n\n\n\n\n\n\n\n\n...more
## Number_Of_Pages Type Price
## 1 105 Hardcover 9.323529
## 2 527 Hardcover 11.000000
## 3 50 Kindle Edition 11.267647
## 4 393 Hardcover 12.873529
## 5 305 Kindle Edition 13.164706
## 6 288 Paperback 14.188235
Let’s start with some visualizations to develop an understanding of different variables.
book_count <- df %>%
group_by(Type) %>%
summarize(count = n()) %>%
ggplot(mapping = aes(x = reorder(Type, desc(count)), y = count, fill = Type)) +
geom_col(width = 0.5)+
ggtitle('Book Type by Count') +
labs(title = "Type of Binding by Book Count", subtitle = "How many of each type of books are in the dataset?", caption = "https://www.kaggle.com/thomaskonstantin/top-270-rated-computer-science-programing-books") +
ylab("Book Count") +
theme_fivethirtyeight() +
theme(axis.text.x = element_text(angle = 25, vjust = .7, hjust = 0.5), axis.title.x = element_blank()) +
scale_fill_manual(name = 'Book Binding',
labels = c("Boxed Set", "eBook", "Hardcover", "Kindle Edition", "Paperback", "Unknown Binding"),
values = c("#84a59d", "#cbdfbd" , "#f19c79", "#a44a3f" , "#f28482" , "#f6bd60"))
# values = c("#84a59d", "#cb997e" , "#f4acb7" , "#ffc9b9", "#f28482" , "#f6bd60"))
book_count
The plot above shows that the two most common types of books are Paperback and Hardcover which appear to make up majority of the books in the dataset. Let’s look at exactly what that percentage is in our next visualization.
library(ggrepel)
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(IRdisplay)
book_percent <- df %>%
group_by(Type) %>%
summarize(count = n(), percentage = n()/nrow(df)) %>%
plot_ly(labels = ~factor(Type),
values = ~percentage,
marker = list(colors = c("#84a59d", "#f7ede2", "#f5cac3" , "#cb997e" , "#f28482" , "#f6bd60")),
hole = 0.4,
type = 'pie',
textposition = 'outside',
textinfo = 'label+percent') %>%
layout(title = 'Percent breakdown by Book Binding',
xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = TRUE),
yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = TRUE))
book_percent
Once again we see that Paperback and Hardcover are the most common types of books making up over 90% of the total number of books.
Next, let’s look at the most expensive books in the dataset.
df$Book_title[df$Book_title == "3D Game Engine Architecture: Engineering Real-Time Applications with Wild Magic (The Morgan Kaufmann Series in Interactive 3d Technology)"] = "3D Game Engine Architecture: Engineering Real-Time Applications with Wild Magic"
price <- df %>%
arrange(desc(Price)) %>%
slice(1:10) %>%
ggplot(mapping = aes(x = reorder(Book_title, Price) , y = Price, fill = Book_title)) +
geom_col(width=.5) +
ggtitle('What are the 10 Most Expensive Books?') +
labs(caption = "https://www.kaggle.com/thomaskonstantin/top-270-rated-computer-science-programing-books") +
theme(text=element_text(size=10, family='Times New Roman')) +
theme(legend.title = element_blank(), legend.position = "none", axis.title.y = element_blank()) +
scale_fill_manual(values = c("#84a59d", "#cbdfbd" , "#f19c79", "#a44a3f" , "#f28482" , "#f6bd60",
"#84a59d", "#cbdfbd" , "#f19c79", "#a44a3f")) +
coord_flip()
price
The plot above shows that the book called “A Discipline for Software Engineering” is the most expensive book in the dataset costing over $250 followed closely by "The Art of Computer Programming, Volumes 1-4a Boxed Set. Let’s print all the data pertaining to this book.
df %>%
filter(Book_title == "A Discipline for Software Engineering")
## Rating Reviews Book_title
## 1 3.84 5 A Discipline for Software Engineering
## Description
## 1 Designed to help individual programmers develop software more effectively and successfully, this book presents a scaled-down version of Humphrey's popular methods for managing the software process. It: presents concepts and methods for a disciplined software engineering process for individual programmers, following the Capability Maturity Model (CMM); scales down industria ...more
## Number_Of_Pages Type Price
## 1 789 Hardcover 235.65
Next, I am interested to see if the most expensive books are also rated the highest.
Let’s look at the highest rated books in the dataset.
rating <- df %>%
arrange(desc(Rating)) %>%
slice(1:10) %>%
ggplot(mapping = aes(x = reorder(Book_title, Rating) , y = Rating, fill = Book_title)) +
geom_col(width=.5) +
ggtitle('What are the 10 Highest \nRated Books?') +
labs(caption = "https://www.kaggle.com/thomaskonstantin/top-270-rated-computer-science-programing-books") +
theme(text=element_text(size=10, family='Times New Roman')) +
theme(legend.title = element_blank(), legend.position = "none", axis.title.y = element_blank()) +
scale_fill_manual(values = c("#84a59d", "#cbdfbd" , "#f19c79", "#a44a3f" , "#f28482" , "#f6bd60",
"#84a59d", "#cbdfbd" , "#f19c79", "#a44a3f")) +
coord_flip()
rating
The plot above shows that the book called “Your First App: Node.js” which is an ebook is rated the highest and the only book with a perfect rating of 5. We can also see that it is not the case that the most expensive books are also rated the highest.
Let’s print all the data pertaining to this book.
df %>%
filter(Book_title == "Your First App: Node.js")
## Rating Reviews Book_title
## 1 5 0 Your First App: Node.js
## Description
## 1 A tutorial for real-world application development using Node.js, AngularJS, Express and MongoDB\n\nPublished via Leanpub, and can be found here: https://leanpub.com/yfa-nodejs\n\nIn this book, you will learn the basics of designing and developing a node.js application. Unlike other books, focus here is on learning multiple technologies at once and defining processes for success. T\n\n\n\n...more
## Number_Of_Pages Type Price
## 1 317 ebook 25.85588
Let’s look at the price for each of these books.
rating_price <- df %>%
arrange(desc(Rating)) %>%
slice(1:10) %>%
ggplot(mapping = aes(x = reorder(Book_title, Price), y = Price, fill = Book_title)) +
geom_col(width = 0.5)+
ggtitle("Highest Rated Books") +
labs(subtitle = "by Price", caption = "https://www.kaggle.com/thomaskonstantin/top-270-rated-computer-science-programing-books") +
xlab('Book Title') +
ylab('Price') +
theme(text=element_text(size=10, family="Times New Roman")) +
theme(legend.title = element_blank(), legend.position = "none", axis.title.y = element_blank()) +
scale_fill_manual(values = c("#84a59d", "#cbdfbd" , "#f19c79", "#a44a3f" , "#f28482" , "#f6bd60",
"#84a59d", "#cbdfbd" , "#f19c79", "#a44a3f")) +
coord_flip()
rating_price
One of the top ten Highest Rated Books that costs the most is called “The Art of Computer Programming, Volumes 1-4a Boxed Set” with a price tag of over $200. This is not as surprising because the set consists of 4 books whereas the other books in the dataset are listed individually.
Let’s look at the number of pages for each of these books.
rating_pages <- df %>%
arrange(desc(Rating)) %>%
slice(1:10) %>%
ggplot(mapping = aes(x = reorder(Book_title, Number_Of_Pages), y = Number_Of_Pages, fill = Book_title)) +
geom_col(width = 0.5)+
ggtitle('Highest Rated Books') +
labs(subtitle = "by Number of Pages", caption = "https://www.kaggle.com/thomaskonstantin/top-270-rated-computer-science-programing-books") +
xlab('Book Title') +
ylab('Number of Pages') +
theme(text=element_text(size=10, family="Times New Roman")) +
theme(legend.title = element_blank(), legend.position = "none", axis.title.y = element_blank()) +
scale_fill_manual(values = c("#84a59d", "#cbdfbd" , "#f19c79", "#a44a3f" , "#f28482" , "#f6bd60",
"#84a59d", "#cbdfbd" , "#f19c79", "#a44a3f")) +
coord_flip()
rating_pages
One of the top ten Highest Rated Books with the most number of pages is “The Art of Computer Programming, Volumes 1-4a Boxed Set” with over 3000 pages. Once again, this is not as surprising because the set consists of 4 books and combines the number of pages in those books whereas the other books in the dataset are listed individually.
Next, I am interested to see if the highest rated books also have the most number of reviews.
reviews <- df %>%
arrange(desc(Reviews)) %>%
slice(1:10) %>%
ggplot(mapping = aes(x = reorder(Book_title, Reviews), y = Reviews, fill = Book_title)) +
geom_col(width = 0.5) +
ggtitle('What are the top 10 Books with the \nMost Number of Reviews?') +
labs(caption = "https://www.kaggle.com/thomaskonstantin/top-270-rated-computer-science-programing-books") +
ylab('Number of Reviews') +
theme(text=element_text(size=10, family="Times New Roman")) +
theme(legend.title = element_blank(), legend.position = "none", axis.title.y = element_blank()) +
scale_fill_manual(values = c("#84a59d", "#cbdfbd" , "#f19c79", "#a44a3f" , "#f28482" , "#f6bd60",
"#84a59d", "#cbdfbd" , "#f19c79", "#a44a3f")) +
#c("#f6f4d2", "#cbdfbd" , "#f19c79", "#a44a3f" , "#660708" ,
#"#f6f4d2", "#cbdfbd" , "#f19c79", "#a44a3f" , "#660708")) +
coord_flip()
reviews
The plot above shows that the book called “Start with Why: How Great Leaders Inspire Everyone to Take Action” which is a hardcover is the most highly reviewed with close to 6000 reviews. We can also see that it is not the case that the highest rated books also have the most number of reviews.
Let’s print all the data pertaining to this book.
df %>%
filter(Book_title == "Start with Why: How Great Leaders Inspire Everyone to Take Action")
## Rating Reviews
## 1 4.09 5938
## Book_title
## 1 Start with Why: How Great Leaders Inspire Everyone to Take Action
## Description
## 1 Why do you do what you do?\n\nWhy are some people and organizations more innovative, more influential, and more profitable than others? Why do some command greater loyalty from customers and employees alike? Even among the successful, why are so few able to repeat their success over and over?\n\nPeople like Martin Luther King Jr., Steve Jobs, and the Wright Brothers might have lit\n\n\n\n\n\n\n\n\n\n\n\n...more
## Number_Of_Pages Type Price
## 1 256 Hardcover 14.23235
Let’s look at the price for each of these books.
reviews_price <- df %>%
arrange(desc(Reviews)) %>%
slice(1:10) %>%
ggplot(mapping = aes(x = reorder(Book_title, Price), y = Price, fill = Book_title)) +
geom_col(width = 0.5)+
ggtitle('Most Reviewed Books') +
labs(subtitle = "by Price", caption = "https://www.kaggle.com/thomaskonstantin/top-270-rated-computer-science-programing-books") +
xlab('Book Title') +
ylab('Price') +
theme(text=element_text(size=10, family="Times New Roman")) +
theme(legend.title = element_blank(), legend.position = "none", axis.title.y = element_blank()) +
scale_fill_manual(values = c("#84a59d", "#cbdfbd" , "#f19c79", "#a44a3f" , "#f28482" , "#f6bd60",
"#84a59d", "#cbdfbd" , "#f19c79", "#a44a3f")) +
coord_flip()
reviews_price
The most expensive of the most reviewed books is called “The Goal: A Process of Ongoing Improvement” costing over $35.
Let’s look at the number of pages for each of these books.
reviews_pages <- df %>%
arrange(desc(Reviews)) %>%
slice(1:10) %>%
ggplot(mapping = aes(x = reorder(Book_title, Number_Of_Pages), y = Number_Of_Pages, fill = Book_title)) +
geom_col(width = 0.5)+
ggtitle('Most Reviewed Books') +
labs(subtitle = "by Number of Pages", caption = "https://www.kaggle.com/thomaskonstantin/top-270-rated-computer-science-programing-books") +
xlab('Book Title') +
ylab('Number of Pages') +
theme(text=element_text(size=10, family="Times New Roman")) +
theme(legend.title = element_blank(), legend.position = "none", axis.title.y = element_blank()) +
scale_fill_manual(values = c("#84a59d", "#cbdfbd" , "#f19c79", "#a44a3f" , "#f28482" , "#f6bd60",
"#84a59d", "#cbdfbd" , "#f19c79", "#a44a3f")) +
coord_flip()
reviews_pages
One of the top ten Reviewed Books with the most number of pages is called The Innovators followed closely by The Information with both consisting of over 500 pages.
Next thing I would like to look at is the average rating, average number of reviews and the average price grouped by the tyep of binding to see if we can gather any new information.
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
avg_rating <- df %>%
group_by(Type) %>%
summarize(avgRating = mean(Rating)) %>%
ggplot(mapping = aes(x = Type, y = avgRating, fill = Type)) +
geom_col(width = 0.5) +
labs(subtitle = "Average Rating \nby Book Binding") +
ylab('Mean of Ratings') +
theme(text=element_text(size=10, family="Times New Roman")) +
theme(legend.title = element_blank(), legend.position = "none", axis.title.y = element_blank()) +
scale_fill_manual(values = c("#84a59d", "#cbdfbd" , "#f19c79", "#a44a3f" , "#f28482" , "#f6bd60")) +
coord_flip()
avg_reviews <- df %>%
group_by(Type) %>%
summarize(avgReviews = mean(Reviews)) %>%
ggplot(mapping = aes(x = Type, y = avgReviews, fill = Type)) +
geom_col(width = 0.5) +
labs(subtitle = "Average Number of \nReviews by \nBook Binding") +
ylab('Mean of Reviews') +
theme(text=element_text(size=10, family="Times New Roman")) +
theme(legend.title = element_blank(), legend.position = "none", axis.title.y = element_blank()) +
scale_fill_manual(values = c("#84a59d", "#cbdfbd" , "#f19c79", "#a44a3f" , "#f28482" , "#f6bd60")) +
coord_flip()
avg_price <- df %>%
group_by(Type) %>%
summarize(avgPrice = mean(Price)) %>%
ggplot(mapping = aes(x = Type, y = avgPrice, fill = Type)) +
geom_col(width = 0.5) +
labs(subtitle = "Average Price \nby Book Binding") +
ylab('Mean of Price') +
theme(text=element_text(size=10, family="Times New Roman")) +
theme(legend.title = element_blank(), legend.position = "none", axis.title.y = element_blank()) +
scale_fill_manual(values = c("#84a59d", "#cbdfbd" , "#f19c79", "#a44a3f" , "#f28482" , "#f6bd60")) +
coord_flip()
grid.arrange(avg_rating, avg_reviews, avg_price, ncol = 3)
From this plot we can see that the Boxed Set - Hardcover is the Category with the highest rating, lowest number of reviews and the highest average price while hardcover on the other side has slightly lower rating but the highest number of reviews as well as a lower price, so it might be worth thinking if the boxed set is worth it or is it better to buy each book in the set individually if price is an issue.
library(highcharter)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
highcharter1 <- highchart() %>%
hc_add_series(data = df,
type = "scatter",
hcaes(x = Price,
y = Rating,
group = Type)) %>%
hc_chart(style = list(fontFamily = "Times New Roman", fontWeight = "bold")) %>%
hc_xAxis(title = list(text="Price")) %>%
hc_yAxis(title = list(text="Rating")) %>%
hc_colors(c("#84a59d", "#cbdfbd" , "#f6bd60", "#a44a3f" , "#f28482","#f19c79")) %>%
hc_title(
text = "What is the Book Rating by Price when grouped by Type of Binding?",
margin = 30,
align = "center",
style = list(color = "black", useHTML = TRUE)) %>%
hc_tooltip(shared = TRUE,
borderColor = "black",
pointFormat = "Price: {point.Price} <br> Rating: {point.Rating}")
highcharter1
Based on the plot we notice a trend that Paperback, Hardcover and Kindle Edition are all approximately rated the same, but Kindle editions are usually the least expensive followed by paperback and then hardcover. We also observe a paperback which costs as much as some of the most expensive hardcover books in the dataset. Next, we will find out which paperback is this and print out the title of the book.
paperback <- df %>%
filter(Type == "Paperback" & Price > 200)
paperback
## Rating Reviews Book_title
## 1 3.94 22 An Introduction to Database Systems
## Description
## 1 Continuing in the eighth edition, An Introduction to Database Systems provides a comprehensive introduction to the now very large field of database systems by providing a solid grounding in the foundations of database technology while shedding some light on how the field is likely to develop in the future. This new edition has been rewritten and expanded to stay current wi ...more
## Number_Of_Pages Type Price
## 1 1040 Paperback 212.0971
The title of the book is An Introduction to Database Systems and priced at over $200 making it the most expensive paperback in the dataset.
df2 <- df %>%
select(Type, Price, Number_Of_Pages) %>%
group_by(Type) %>%
summarize(maxPrice = max(Price), avgPages = mean(Number_Of_Pages))
highcharter2 <- highchart() %>%
hc_yAxis_multiples(
list(title = list(text = "Maximum Price of Book")),
list(title = list(text = "Average Number of Pages"), opposite = TRUE)
) %>%
hc_xAxis(title = list(text="Type of Binding")) %>%
hc_add_series(data = df2$maxPrice,
name = "Maximum Price",
type = "column",
yAxis = 0) %>%
hc_add_series(data = df2$avgPages,
name = "Average Number of Pages",
type = "line",
yAxis = 1) %>%
hc_xAxis(categories = df2$Type,
tickInterval = 1) %>%
hc_title(
text = "What is the Maximum Price and Average Number of Pages by the Type of Book Binding?",
margin = 30,
align = "center",
style = list(color = "black", useHTML = TRUE)
) %>%
hc_colors(c("#84a59d","#f28482")) %>%
hc_chart(style = list(fontFamily = "Times New Roman",
fontWeight = "bold"))
highcharter2
We can see that Hardcovers are among the most expensive books as well as contain the highest average number of pages. This is followed closely by the Boxed Set and Paperbacks while ebooks and Kindle editions are among some of the least expensive books with the least average number of pages.
In this section, we will construct a multiple linear regression model to predict cost of a book. There are four assumptions associated with a linear regression model which can be tested using diagnostic plots. The assumptions are as follows:
Linearity: The relationship between X and the mean of Y is linear. Homoscedasticity: The variance of residual is the same for any value of X. Independence: Observations are independent of each other. Normality: For any fixed value of X, Y is normally distributed.
fit1 <- lm(Price ~ Rating+Reviews+Number_Of_Pages+Type, data = df)
summary(fit1)
##
## Call:
## lm(formula = Price ~ Rating + Reviews + Number_Of_Pages + Type,
## data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -80.018 -14.040 -1.195 9.994 149.551
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.859e+02 3.462e+01 5.368 1.75e-07 ***
## Rating -4.471e+00 5.380e+00 -0.831 0.407
## Reviews -1.371e-02 2.875e-03 -4.767 3.10e-06 ***
## Number_Of_Pages 6.148e-02 5.517e-03 11.143 < 2e-16 ***
## Typeebook -1.384e+02 2.693e+01 -5.138 5.42e-07 ***
## TypeHardcover -1.310e+02 2.532e+01 -5.175 4.54e-07 ***
## TypeKindle Edition -1.508e+02 2.656e+01 -5.678 3.61e-08 ***
## TypePaperback -1.453e+02 2.534e+01 -5.736 2.67e-08 ***
## TypeUnknown Binding -1.444e+02 3.095e+01 -4.666 4.90e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 25.05 on 262 degrees of freedom
## Multiple R-squared: 0.5225, Adjusted R-squared: 0.5079
## F-statistic: 35.84 on 8 and 262 DF, p-value: < 2.2e-16
par(mfrow = c(2,2))
plot(fit1)
## Warning: not plotting observations with leverage one:
## 269
cor(df$Rating, df$Price)
## [1] 0.04669738
cor(df$Reviews, df$Price)
## [1] -0.2548988
cor(df$Number_Of_Pages, df$Price)
## [1] 0.6369506
The p-value on the right of Reviews, Number of Pages, and Type each have 3 asterisks which suggest they are meaningful variables to explain the linear change in the price of a book but we also need to look at the adjusted R-squared value. It states that about 51% of the variation in the observations may be explained by the model. In other words, 50% of the variation in the data is likely not explained by this model.
We usually pay great attention to regression results, such as p-values, R-squared or adjusted R-squared that tell us how well a model represents given data but that’s not the whole picture. We should also look at diagnostic plots to not only check if the linear regression assumptions are met but to improve our model in an exploratory way.
In this case, in residuals vs fitted plot we don’t find equally spread residuals without distinct patterns which is a good indication that a linear model may not be the best fit for the data. Normal Q-Q plot shows if residuals are normally distributed and they do seem to follow a straight line. Let’s look at the next plot which is the Scale-Location plot which checks the assumption of equal variance (homoscedasticity). It’s a good sign if the red line is horizontal with the points spread about randomly which is definitely not the case here. Finally, in the last plot we watch out for outlying values at the upper right corner or at the lower right corner as those spots are the places where cases can be influential against a regression line. The plot identified the influential observation as #70. Let’s see how it changes our model if I exclude this case from the analysis.
In other cases where linear model appears to be a good fit I would have removed the outlier and reran the model but it is evident that a linear regression may not be a good fit for the data so removing an outlier won’t improve the model too much. Lastly, looking at the correlation coefficients we see that Rating and Reviews are only weakly correlated with Price and the strongest correlation is between Price and Number of Pages which corroborates the information we got from the Heapmap above. At this point, we return to our summary and try removing the least meaningful variable which is Rating to see if that improves the model at all.
Based on summary statistics, we eliminate Reviews and rerun the model.
fit1 <- lm(Price ~ Reviews+Number_Of_Pages+Type, data = df)
summary(fit1)
##
## Call:
## lm(formula = Price ~ Reviews + Number_Of_Pages + Type, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -81.957 -14.691 -1.637 9.692 150.593
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.664e+02 2.551e+01 6.524 3.50e-10 ***
## Reviews -1.391e-02 2.863e-03 -4.858 2.04e-06 ***
## Number_Of_Pages 6.078e-02 5.449e-03 11.154 < 2e-16 ***
## Typeebook -1.378e+02 2.691e+01 -5.122 5.85e-07 ***
## TypeHardcover -1.293e+02 2.522e+01 -5.126 5.75e-07 ***
## TypeKindle Edition -1.491e+02 2.646e+01 -5.634 4.52e-08 ***
## TypePaperback -1.438e+02 2.525e+01 -5.693 3.33e-08 ***
## TypeUnknown Binding -1.426e+02 3.086e+01 -4.622 5.96e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 25.04 on 263 degrees of freedom
## Multiple R-squared: 0.5213, Adjusted R-squared: 0.5085
## F-statistic: 40.91 on 7 and 263 DF, p-value: < 2.2e-16
par(mfrow = c(2,2))
plot(fit1)
## Warning: not plotting observations with leverage one:
## 269
cor(df$Reviews, df$Price)
## [1] -0.2548988
cor(df$Number_Of_Pages, df$Price)
## [1] 0.6369506
Once again, we don’t notice a significant improvement in diagnostic plots and they all look relatively the same. Our adjusted R-squared remained the same to 51% but the p-values on the right of Reviews, Number of Pages, and Type have 3 asterisks which suggest that they are all meaningful variables. So, we select the simplest (parsimonious) model with Reviews, Number of Pages, and Type on Price while keeping in mind that we may want to explore a better model for our data. We found earlier that Price is most strongly correlated with Number of Pages out of all other variables in the data so it makes sense that this variable fits in the model.
In the analysis above, we observe that our summary statistics remained more or less the same but judging from the diagnostic plots we come to the conclusion that a linear regression may not be a suitable model for this dataset. As the name suggests, linear regression will only be able to fit linear relationship between dependent, and independent variables. If we observe that the linear regression assumptions are not met we can then apply transformation to either the response variable, the predictor variables or both. For example, the first assumption associated with a linear regression is, Linearity: The relationship between X and the mean of Y is linear. Linear regression works well when your inputs or independent variables are not correlated but when they are correlated and this assumption for linearity is violated, linear regression may not give the optimal solution and this “multicollinearity” might give some very un-intuitive results. Both Lasso and Ridge transformations work by putting a cost on this kind of complexity, to maintain simplicity. As there are a ton of different transformations to choose from, in the words of professor Rachel Saidi, a lot of this process is subjective and there is really an “art and science” to knowing which transformation to apply. It does however demand an in-depth knowledge of your data and knowing exactly what is it that you are trying to accomplish otherwise this process will become a lot more cumbersome than it needs to be.
Through my background research I came across two articles in particular that mention some of the more sought after programming books for beginners. The first is written by Hadley Wickham, Chief Scientist at RStudio and creator of many packages for the R programming language, where he lists the best books to help aspiring data scientists build solid computer science fundamentals (Mathieu, 2019). Another article published on medium.com titled Get Started With A Collection Of 247 Free Computer Science Books lists down 247 free computer science ebooks covering a reasonable amount of programming topics. Through my research I also discovered that there is plenty of data on computer science publications that is readily available however tidy datasets that contain an entire collection of books for data science purposes are few and far in between.
“Get Started With A Collection Of 247 Free Computer Science Books.” Medium, Medium, 18 Aug. 2019, medium.com/@getfreebooks/get-started-with-a-collection-of-247-free-computer-science-books-c8fa7d27b8db
Mathieu, Edouard. “Computer Science for Data Scientists: Hadley Wickham on Five Books.” Five Books, 22 Feb. 2019, fivebooks.com/best-books/computer-science-data-science-hadley-wickham/