# Clear the environment and the console
rm(list = ls()); cat("\f")
#----------------------------#
# Install required libraries #
#----------------------------#
packages <- c("ggplot2", "ggpubr", "readxl", "readr", "dplyr", "tidyr", "psych", "stringr",
"lubridate", "knitr", "outliers", "MVN", "TSA", "tseries", "lmtest", "FSAdata",
"forecast", "matrixcalc", "car", "corpcor", "scales", "QuantPsyc",
"urca", "rugarch", "fGarch", "tswge",
"imputeTS", "shiny", "coda", "rjags", "runjags", "ks", "epiDisplay", "fastDummies")
# Install any packages not already installed
installed_packages <- packages %in% rownames(installed.packages())
if (any(installed_packages == FALSE)) { install.packages(packages[!installed_packages]) }
#-------------------------#
# Load libraries #
#-------------------------#
invisible(lapply(packages, library, character.only = TRUE))
#-------------------------#
# Clear the environment and the console
rm(list = ls()); cat("\f")
library(readxl)
library(tidyr)
library(dplyr)
library(lubridate)
library(outliers)
library(stringr)
library(MVN)
library(forecast)
library(matrixcalc)
library(car)
library(corpcor)
library(ggplot2)
library(scales)
library(QuantPsyc)
library(TSA)
An experiment was conducted to investigate the amount of drug retained in the liver of a rat. Nineteen rats were chosen for the experiment and each rat was administered a dose of the drug. After a fixed length of time, the rat was sacrificed, the liver weighed and the percentage of the dose in the liver determined. The questions can be attempted using whatever is most convenient, a computer package or hand calculations. When computer software is used you need to include the output and code you used (or details on the steps you followed) to generate the output.
The variables are: * X1 = body weight of rat * X2 = weight of liver * X3 = dose given * Y = percentage of dose in liver.
# Question 1
liver <- read_csv("liver.csv")
head(liver)
x <- data.frame(rep(1, 19), liver$BdyWt, liver$LvrWt, liver$Dose,
fix.empty.names = FALSE)
x <- as.matrix(x)
xc <- as.matrix(t(x))
is.matrix(x)
is.matrix(xc)
xc %*% x
C <- xc %*% x
C
is.matrix(C)
matrix.power(C, -1)# requires library(matrixcalc)
fit_liver <- lm(Y ~ BdyWt + LvrWt + Dose, data = liver)
summary(fit_liver)
Equation of best fit: Y = 0.2659 – 0.0212BdyWt + 0.0143LvrWt + 4.1781*Dose The fitted equation implies that a 1 unit increase in the body weight of a rat is associated with a decrease of 0.021 in the percentage of drug dose found in the rat’s liver; a 1 unit increase in the weight of the rat’s liver is associated with an increase of 0.014 in the percentage of drug dose found in the rat’s liver; and a 1 unit increase in the drug dose given is associated with an increase of 4.178 in the percentage of drug dose found in a rat’s liver. The p-value of 0.0720 indicates that the model is not significant at the α=0.05 significance level. Similarly, based on the output for the fitted equation, only the ‘BdyWt’ and ‘Dose’ features are significant in predicting the percentage of drug dose found in a rat’s liver at the α=0.05 significance level. These findings make sense given that the adjusted R-squared for the fitted equation is 0.2367, indicating that only 23.7% of the variation in the amount of drug dose found in a rat’s liver is explained by the model. That is, there is a large amount of variation in drug dose found in a rat’s liver that is not explained by the model
summary(fit_liver)
anova(fit_liver)
𝐹0 = 2.8605 As F0 < Fcrit (2.86 < 5.41), the model is not significant at the α=0.05 significance level. The adjusted R-squared value of 0.24 is also very low, indicating that only 24% of the variation in the amount of drug retained in the liver of a rat is explained by the relationship between amount of drug retained and body weight, liver weight and drug dose.
According to the ANOVA table, the p-values of ‘BdyWt’ and ‘LvrWt’ are both greater than the significance level of α=0.05, indicating that they are not significant predictors of the percentage of drug dose found in the liver of a rat. The p-value for ‘Dose’ indicates it is a significant predictor of the percentage of drug dose found in the liver of a rat.
T-tests for coefficients We reject the null hypothesis (𝐻0) if absolute value of t is greater the critical value.
According to the t-tests, ‘LvrWt’ can be removed from the model as it is not a significant predictor of the percentage of drug dose found in the liver of a rat at the α=0.05 significance level. There is also insufficient evidence to conclude that the intercept Is significantly different from zero at the 95% confidence level
reduced_liver <- lm(Y ~ BdyWt + Dose, data = liver)
summary(reduced_liver)
anova(reduced_liver, fit_liver)
After removing the ‘LvrWt’ variable, the adjusted R-squared of the model has improved slightly to 0.25. The p-value of the model with ‘LvrWt’ removed is 0.038, indicating that the model is now significant at the α=0.05 significance level. The p-value in the ANOVA test indicates that we have insufficient evidence to reject the null hypothesis that the addition of ‘LvrWt’ is significant. This supports the conclusions from the t-tests, that ‘LvrWt’ can be left out of the model.
to come up with a smaller parsimonious mo del. (Use Fstay = 4).
full <- lm(Y ~., data = liver)
drop1(full, test="F") # Backward approach - Manual F-test-based backward selection
‘LvrWt’ is the least significant regressor by partial F-test (p=0.42), and its F-value (0.69) is less than the 𝐹𝑠𝑡𝑎𝑦 of 4. Because ‘LvrWt’ is the least significant regressor, it will be dropped as the first step in the backward elimination.
#LvrWt is the least significant by partial F test
drop1(update(full, ~ . -LvrWt), test = "F")
The remaining regressors (‘BdyWt’ and ‘Dose’) are significant at the α=0.05 significance level and have an F-value greater than the 𝐹𝑠𝑡𝑎𝑦 of 4. Therefore, no other regressors will be removed from the model.
fit_liver_drop1 <- update(full, ~ . -LvrWt)
fit_liver_drop1
summary(fit_liver_drop1)
The smaller parsimonious model achieved via backward elimination is: Y = 0.286 – 0.020(BdyWt) + 4.125(Dose) The model achieved via backward elimination is significant at the α=0.05 significance level and has an adjusted R-squared of 0.2515, indicating that 25% of the variance in drug dose found in a rat’s liver is explained by the model. Although the model is significant, a higher adjusted R-squared would be preferable. The model implies that; a 1 unit increase in the body weight of a rat is associated with a decrease of 0.020 in the percentage of drug dose found in a rat’s liver; while a 1 unit increase in the drug dose given to the rat is associated with an increase of 4.125 in the percentage of drug dose found in a rat’s liver.
This question uses the Crest Toothpaste Sales Data file. The questions can be attempted using whatever is most convenient, a computer package or hand calculations. When computer software is used you need to include the output and code you used (or details on the steps you followed) to generate the output. Write a comprehensive report (including the outputs and discussion) based on your findings concerning the following items: * Is the model using all the predictors significant? * From the output, which variables are important? * Are the assumptions governing the residuals satisfied? Which of these assumptions have been violated? * Highlight and check possible problems (such as multicollinearity, …). * Model building comparing forward, backward and stepwise regression procedures.
The purpose of this report is to explore the Crest Toothpaste Sales data, identify the significant features, and develop a regression model based on the dataset that could be used to predict toothpaste sales.
The dataset was sourced from the journal “Business Economics”, in an article written by C. L Allmon titled “Advertising and Sales Relationships for Toothpaste: Another Look”. The dataset contains 14 observations of 4 variables, as well as an identifier for year, with each observation representing the aggregated information of the 4 variables for a given year.
crest <- data.frame(
sales = c(105000, 105000, 121600, 113750, 113750, 128925, 142500, 126000, 162000, 191625, 189000,
210000, 224250, 245000),
budget = c(16300, 15800, 16000, 14200, 15000, 14000, 15400, 18250, 17300, 23000, 19300, 23056,
26000, 28000),
ratio = c(1.25, 1.34, 1.22, 1.00, 1.15, 1.13, 1.05, 1.27, 1.07, 1.17, 1.07, 1.54, 1.59, 1.56),
USpdi = c(547.9, 593.4, 638.9, 695.3, 751.8, 810.3, 914.5, 998.3, 1096.1, 1194.4, 1311.5, 1462.9, 1641.7,
1821.7))
head(crest)
p1 <- ggplot(data = crest, mapping = aes(x = budget, y = sales))
p1 + geom_point() + scale_x_continuous(labels = dollar_format()) +
scale_y_continuous(labels = dollar_format()) +
labs(title = "Budget vs. Yearly Sales for Crest Toothpaste, 1967-1980",
x = "Budget ($USD)",
y = "Sales ($USD)")
#
p2 <- ggplot(data = crest, mapping = aes(x = ratio, y = sales))
p2 + geom_point() + scale_y_continuous(labels = dollar_format()) +
labs(title = "Ratio vs. Yearly Sales for Crest Toothpaste, 1967-1980",
x = "Ratio",
y = "Sales ($USD)")
#
p3 <- ggplot(data = crest, mapping = aes(x = USpdi, y = sales))
p3 + geom_point() + scale_x_continuous(labels = dollar_format()) +
scale_y_continuous(labels = dollar_format()) +
labs(title = "US personal Disposable Income vs. Yearly Sales for Crest Toothpaste, 1967-1980",
x = "US personal Disposable Income ($USD)",
y = "Sales ($USD)")
#library(psych)
pairs.panels(crest, scale=FALSE)
The relationship between sales and budget is positive, strong, and linear, with increasing budgeting generally corresponding to higher values for sales. The relationship between sales and ratio is positive, moderate, and linear, with higher ratios generally corresponding to higher values for sales. The relationship between sales and ‘USpdi’ is positive, strong, and linear, with higher US personal disposable incomes generally corresponding to higher values for sales.
fit_crest <- lm(sales ~ ., data = crest)
summary(fit_crest)
The fitted model which includes all predictors has a p-value that is smaller than the significance level (α=0.05), indicating that the model using all predictors is significant at the α=0.05 significance level. The adjusted R-squared of this model (0.9598) indicates that 96% of the variation in sales is explained by the linear relationship between sales and budget, ratio, and US personal disposable income. However, at the α=0.05 significance level there is only one feature that is significant, US personal disposable income. All other features, and the intercept, have p-values greater than 0.05, indicating that they are not significant at the α=0.05 significance level.
Based on the output of the model summary in methodology part 3, ‘USpdi’ is the only significant variable at the α=0.05 significance level, as it is the only variable with a t-statistic greater than the critical t value (2.2281). The t-statistics of all other variables are smaller than the critical t value and are therefore not significant at the α=0.05 significance level.
par(mfrow=c(2,2))
plot(fit_crest)
plot(fit_crest, which = 1:4)
mean(fit_crest$residuals)
#library(car)
ncvTest(fit_crest) # test for normality
shapiro.test(fit_crest$residuals) # test for normality
#library(TSA)
par(mfrow=c(1,1))
acf(fit_crest$residuals) #test for auto-correlated errors
durbinWatsonTest(fit_crest) #test for auto-correlated errors
The points in the plot of Residuals vs. Fitted shows a relatively straight line, with points well scattered and no clear pattern, supporting the assumption of linearity. The average value of the residuals is small enough to support the assumption that the expected value of the error term is zero. The non-constant variance test (NCV test) results in a p-value of 0.7853, which is larger than the critical value of 0.05 meaning we have insufficient evidence to reject the null hypothesis and can conclude that the errors do have constant variance. This is supported by the straight line in the Scale-Location plot. The Normal Q-Q plot shows one point deviating from the diagonal, suggesting a possible violation of the normality assumption. The Shapiro-Wilk test produced a p-value of 0.015 indicating that, at the 95% confidence level, the assumption of normality of errors is violated. It appears that the 8th observation may be the reason the assumption is violated, as it is the only point that deviates from the diagonal on the Normal Q-Q plot. The Durbin-Watson test for autocorrelation resulted in a p-value of 0.92 indicating that, at the 95% significance level, there is insufficient evidence to reject the null hypothesis that the errors are uncorrelated. Therefore, we can conclude that the error terms are uncorrelated at the 95% confidence level. The ACF plot shows all lines are within the confidence bands, supporting the conclusion drawn from the Durbin-Watson test.
fit_sqrt_crest <- lm(sqrt(sales) ~ ., data = crest)
shapiro.test(fit_sqrt_crest$residuals) # test for normality
ncvTest(fit_sqrt_crest) # test for non-constant variance
durbinWatsonTest(fit_sqrt_crest) #test for auto-correlated errors
The assumption of normality of errors is now satisfied, and the assumptions of non-constant variance and auto-correlation are still satisfied as their p-values are still greater than 0.05
summary(fit_sqrt_crest)
The fitted model with all predictors and the transformed dependent variable has a p-value less than the significance level (α=0.05), indicating that the model using all predictors is significant. The adjusted R-squared of this model (0.9542) informs us that 95.4% of the variation in sales is explained by the linear relationship between the square-root of sales and budget, ratio, and US personal disposable income. Compared to the original model (which violated the normality assumption), the intercept is now significant, as is the US personal disposable income feature. However, the budget and ratio features are still not significant at the α=0.05 significance level.
vif(fit_sqrt_crest)
‘budget’ and ‘USpdi’ have VIF values greater than 5, which is considered significant, and indicates that multicollinearity is present. These results also indicate that the associated regression coefficients may be poorly estimated due to the presence of multicollinearity. The ‘budget’ and ‘USpdi’ regressors appear to be the cause of the multicollinearity, as the VIF for ‘ratio’ is less than 5. A partial correlation coefficient matrix will be used to assess multicollinearity between different pairs of predictors.
#library(corpcor) #partial correlation coefficient matrix
crest_preds <- crest[,2:4]
cor2pcor(cov(crest_preds))
Based on the partial correlation coefficient matrix, ‘budget’ is highly correlated with ‘USpdi’, as they have partial correlation coefficients with each other that are close to 1. ‘ratio’ does not appear to be strongly correlated with the other features.
To address the multicollinearity, MeanCentering will be applied to the ‘budget’ and ‘USPdi’ features.
sqrt_sales <- sqrt(crest$sales)
budget_mc <- meanCenter(crest$budget)
ratio <- crest$ratio
USpdi_mc <- meanCenter(crest$USpdi)
fit3_crest <- lm(sqrt_sales ~ budget_mc + ratio + USpdi_mc)
summary(fit3_crest)
vif(fit3_crest)
Unfortunately, MeanCentering the ‘budget’ and ‘USpdi’ features did not remove the multicollinearity from the model. In this case, multicollinearity is addressed by collecting more data, removing one of the correlated variables, or through the application of ridge regression. As MeanCentering failed to address the multicollinearity in the model, and we cannot obtain more data, removing one of the correlated variables will be used to address multicollinearity, by removing the ‘budget’ feature as it has the highest VIF and was not significant in the original model.
fit2_sqrt_crest <- lm(sqrt(sales) ~ ratio + USpdi, data = crest)
vif(fit2_sqrt_crest)
The retained variables, ‘ratio’ and ‘USpdi’, have small VIF values, indicating that multicollinearity is not present in the model now that the ‘budget’ feature has been removed.
# stepwise regression
# Full model contains all variables
full <- lm(sqrt(sales)~., data=crest)
# null model contains no variables
null <- lm(sqrt(sales)~1, data=crest)
# Manual F-test-based backward elimination
drop1(full,test="F")
Based on the output from the F-test, ‘ratio’ is the least significant feature by partial F-test and is not significant at the α=0.05 significance level, so it will be dropped in the first step of the backwards elimination.
drop1(update(full, ~ . -ratio), test = "F")
With ‘ratio’ removed, ‘budget’ is now the least significant feature, according to the partial F-test. As it is not significant at the α=0.05 significance level, ‘budget’ will be dropped in the second step of the backwards elimination.
drop1(update(full, ~ . -ratio-budget), test = "F")
All remaining variables in the model are significant.
fit_bkwd <- lm(sqrt(sales) ~ USpdi, data = crest)
summary(fit_bkwd)
The model achieved using backwards elimination is significant (p < 0.05) and has an adjusted R-squared of 0.9496, indicating that 95.0% of the variation in the square root of sales is explained by the relationship between the square root of sales and US personal disposable income. The model achieved using backward elimination via partial F-test is: sqrt(sales) = 0.143(USpdi) + 242.800 For comparative purposes, we will also produce a model using backward elimination via AIC values.
step(full, data=crest, direction="backward") #Backward elimination using AIC values
Performing backward elimination using AIC values informs us that the “full” model has an AIC value of 74.35 and that the AIC of the model cannot be reduced by removing any of the variables.
fit_bwd_aic <- lm(sqrt(sales) ~ budget + ratio + USpdi, data = crest)
summary(fit_bwd_aic)
The model achieved using backward elimination and AIC values is significant, with an adjusted R-squared of 0.9542. In model evaluation, variables that are collinear can produce large p-values. However, this does not mean that the variables are not informative. Because of this, and the fact that the AIC backward elimination model resulted in a higher adjusted R-squared than the F-test backward elimination model, the preferred backward elimination model is the one achieved via AIC values. sqrt(sales) = 0.004(budget) -48.589(ratio) + 0.113(USpdi) + 253.626
# Manual F-test-based forward selection
add1(null, scope =full, test = "F")
The most significant variable not currently in the model is ‘USpdi’, so this will be added to the model in the first step.
add1(update(null, ~ . +USpdi), scope = full, test = "F")
With ‘USpdi’ added to the model, none of the remaining variables are significant at the α=0.05 level, so there are no more forward selection steps to perform.
fit_fwd <- lm(sqrt(sales) ~ USpdi, data = crest)
summary(fit_fwd)
The model achieved using forward selection via partial F-test is significant (p < 0.05) and has an adjusted R-squared of 0.9496, indicating that 95.0% of the variation in the square root of sales is explained by the relationship between the square root of sales and US personal disposable income. The model achieved using forward selection via partial F-test is: sqrt(sales) = 0.143(USpdi) + 242.800 For comparative purposes, we will also produce a model via forward selection using AIC values.
#forward selection using AIC values
step(null, scope=list(lower=null, upper=full), direction="forward")
Performing forward selection using AIC values informs us that the “null” model has an AIC value of 115.18. By adding ‘USpdi’ to the model, the AIC is reduced to 74.24. The model’s AIC cannot be further reduced by adding any other variables to the model. The resulting model for forward selection using AIC values is: sqrt(sales) = 0.143(USpdi) + 242.778 this is the same as the model produced when performing forward selection via F-tests and will therefore have the same p-value and adjusted R-squared.
step(null, scope = list(upper=full), data=crest, direction="both")
The model achieved using stepwise regressions and AIC values is: sqrt(sales) = 0.143(USpdi) + 242.778 This is the same model achieved using forward selection (partial F-test and AIC values). As per the summary of the fitted model from forward selection, the model is significant, with an adjusted R-squared of 0.9496. Given multicollinearity is present in the model when all variables are included, the model produced via forward selection is the best model.
The forward selection model is sqrt(sales) = 0.143(USpdi) + 242.800
fit_fwd <- lm(sqrt(sales) ~ USpdi, data = crest)
par(mfrow=c(2,2))
plot(fit_fwd)
mean(fit_fwd$residuals)
ncvTest(fit_fwd) # test for normality
shapiro.test(fit_fwd$residuals) # test for normality
par(mfrow=c(1,1))
acf(fit_fwd$residuals) #test for auto-correlated errors
durbinWatsonTest(fit_fwd)
The points in the plot of Residuals vs. Fitted shows a relatively straight line, with points well scattered and no clear pattern, supporting the assumption of linearity. The average value of the residuals is small enough to support the assumption that the expected value of the error term is zero. The non-constant variance test (NCV test) results in a p-value of 0.9815, which is larger than the critical value of 0.05 meaning we have insufficient evidence to reject the null hypothesis and can conclude that the errors do have constant variance. This is supported by the straight line in the Scale-Location plot. The Normal Q-Q plot does show one point deviating from the diagonal, suggesting a possible violation of the normality assumption, however, the Shapiro-Wilk test produced a p-value of 0.497 indicating that, at the 95% significance level, the assumption of normality of errors is violated. The Durbin-Watson test for autocorrelation resulted in a p-value of 0.78 indicating that, at the 95% significance level, there is insufficient evidence to reject the null hypothesis that the errors are uncorrelated. Therefore, we can conclude that the error terms are uncorrelated at the 95% confidence level. The ACF plot shows all lines are within the confidence bands, supporting the conclusion drawn from the Durbin-Watson test.
pred_sales <- predict(fit_fwd, data.frame(USpdi=1800), interval="prediction", level = 0.95)
predict(fit_fwd, data.frame(USpdi=1800), interval="prediction", level = 0.95)
(pred_sales[1])^2
(pred_sales[2])^2
(pred_sales[3])^2
In a year where US personal disposable income was 1,800, the model predicts that Crest toothpaste sales would be $249,589.2. This prediction is considered reasonable, given $245,000 worth of toothpaste sales were observed in 1980, when US personal disposable income was 1,821.7. The 95% confidence interval for this predicted value of toothpaste sales is [217,185.0, 284,245.7]
The objective of this report was to explore the Crest Toothpaste sales dataset, identify the important features, and produce a model that can predict toothpaste sales. Exploration of the Crest Toothpaste sales data revealed that a model using all available features violated the assumption that the distribution of the residuals was approximately normal. To address the violated assumption, the sales feature was transformed by applying a square root transformation. The model using this transformed feature and all available features was shown to satisfy all linear regression assumptions. Variance Inflation Factors (VIF) were used to test for signs of multicollinearity within the model. Multicollinearity was identified, with the ‘budget’ and ‘US personal disposable income’ features exhibiting significant VIF values, and a partial correlation coefficient matrix identified that ‘budget’ was highly correlated with ‘USpdi’. It was recommended that one of the two features be dropped from the model to address the multicollinearity within the model if all variables were to be used. Stepwise regression methods were also used to observe which explanatory variables would produce the best model according to backward elimination, forward selection, and stepwise regression. Models were produced using F-statistics, and AIC values. The model produced using forward selection (either F-statistics or AIC values) was selected as the preferred model based on a high adjusted R-squared value and the fact that the forward selection model did not include both the ‘budget’ and ‘USpdi’ features, the features that lead to multicollinearity in the model containing all features. The model produced using forward selection was sqrt(sales) = 0.143(USpdi) + 242.778. The model was significant at the 95% confidence level, and had an adjusted R-squared of 0.9496, indicating that 95.0% of the variation in the square root of sales was explained by the relationship between the square root of sales and US personal disposable income. This value of adjusted R-squared was deemed acceptable, given how much of the variation in toothpaste sales was explained by the model. Finally, the forward selection model was used to predict toothpaste sales. Based on a US personal disposable income of 1,800, the model predicted toothpaste sales of $249,589. The lower bound of the 95% confidence interval for this prediction was $217,185 and the upper bound was $284,245. This model could be used to predict toothpaste sales for Crest, provided that the value of US Personal disposable income was not greater than 1,821.7, and not less than 547.9. Any predictions made using values outside of these limits could be inaccurate, as they would be beyond the scope of the regression model.