library(s20x)
bookcost.df = read.table("bookcost.txt", header = TRUE)
bookcost.df$Format = factor(bookcost.df$Format)
plot(Cost ~ Pages, data = bookcost.df, pch = substr(Format, 1, 1))
Exploratory Analysis: Potentially a positive relationship between Cost
and Pages but hard to say. Hard covers might have a higher cost than
paper covers, but again hard to say.
bookcost.lm = lm(Cost ~ Pages * Format, data = bookcost.df)
modcheck(bookcost.lm)
Exploratory Analysis:Residuals look fine, might fan out a little but not really. Majority of the points lay close to the line but upper ends of the Q-Q plot have slight deviations. Cooks distance is fine.
summary(bookcost.lm)
##
## Call:
## lm(formula = Cost ~ Pages * Format, data = bookcost.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.4766 -2.2143 -0.8453 1.0037 19.4456
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 19.500428 0.913658 21.343 < 2e-16 ***
## Pages 0.016468 0.002734 6.023 7.93e-09 ***
## FormatPaper -7.921170 1.386921 -5.711 3.95e-08 ***
## Pages:FormatPaper -0.009543 0.004072 -2.344 0.0201 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.549 on 203 degrees of freedom
## Multiple R-squared: 0.6109, Adjusted R-squared: 0.6051
## F-statistic: 106.2 on 3 and 203 DF, p-value: < 2.2e-16
All values statistically significant so everthing stays.
confint(bookcost.lm)
## 2.5 % 97.5 %
## (Intercept) 17.69895189 21.301904283
## Pages 0.01107683 0.021859512
## FormatPaper -10.65578735 -5.186551842
## Pages:FormatPaper -0.01757139 -0.001514958
Can’t tell if there is a difference between books with paper format and books with hard format, so we need to rotate the model.
bookcost.df$Format = factor(bookcost.df$Format, levels = c("Paper", "Hard"))
bookcost.lm2 = lm(Cost ~ Pages * Format, data = bookcost.df)
summary(bookcost.lm2)
##
## Call:
## lm(formula = Cost ~ Pages * Format, data = bookcost.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.4766 -2.2143 -0.8453 1.0037 19.4456
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.579258 1.043446 11.097 < 2e-16 ***
## Pages 0.006925 0.003017 2.295 0.0227 *
## FormatHard 7.921170 1.386921 5.711 3.95e-08 ***
## Pages:FormatHard 0.009543 0.004072 2.344 0.0201 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.549 on 203 degrees of freedom
## Multiple R-squared: 0.6109, Adjusted R-squared: 0.6051
## F-statistic: 106.2 on 3 and 203 DF, p-value: < 2.2e-16
confint(bookcost.lm2)
## 2.5 % 97.5 %
## (Intercept) 9.5218772013 13.63663978
## Pages 0.0009764142 0.01287359
## FormatHard 5.1865518420 10.65578735
## Pages:FormatHard 0.0015149581 0.01757139
Methods and Assumptions
We start by fitting a linear model with cost as the response and page number and format as exploratory variables. Interaction was found between them so the terms were kept.
We have assumed that data is independent of one another. The equality of variance and normality assumptions appear to be satisfied, with no unwarranted influential observations.
The model fitted is \({\tt bookcost}_i=\beta_0 + \beta_1\times {\tt Pages}_i + \beta_2\times {formatHard}_i +\beta_3 {Pages} \times {formatHard}_i + \varepsilon_i\) Where \({formatHard}_i\) = 1 if the book used a hard format, 0 if the book used paper format and \(\varepsilon_i \sim iid~ N(0,\sigma^2)\)
Executive Summary
Data was collected to determine the relationship between the number of pages in a book and the cost of producing the book. It was also of interest to determine whether this depended on the books format.
We estimate on average that when books are used with a hard format the cost increases by between 5.19 and 10.66 dollars compared to books that use a paper format.
A 10 page increase to paper format books on average increases the price by between 0.11 and 0.22 dollars.
A 10 page increase to hard format books on average increases the price by between 0.1 and 0.13 dollars.
Our model explained 61.1% of the variability in the data.