Regression analysis is a way of predicting an outcome variable from one predictor variable (simple regression) or several predictor variables (multiple regression).
outcome = (model) + error
In regression, the model we fit is linear, which means that we summarize a data set with a straight line. As such, the word ‘model’ in the equation above simply gets replaced by ‘things’ that define the line that we fit to the data.
A straight line can be defined by two things:
* the slope (or gradient) of the line (usually denoted by bi); and
* the point at which the line crosses the vertical axis of the graph (known as the intercept of the line, b0).
The general model becomes an equation below in which Y is the outcome that we want to predict and X is the ith participant’s score on the predictor variable. Here bi is the gradient of the straight line fitted to the data and b0 is the intercept of that line. These parameters bi and b0 are known as the regression coefficients.
There is a residual term, ‘E’, which represents the difference between the score predicted by the line for participant i and the score that participant i actually obtained.
Yi = (b0 + bixi) + E
The gradient of the line tells us something about the nature of the relationship being described - positive or negative.
Least squares method uses a straight line in order to fit through the given points which is known as the method of linear or ordinary least squares. This line is termed as the line of best fit from which the sum of squares of the distances from the points are minimized.
The method of least squares actually defines the solution for the minimization of sum of squares of deviations or the errors in the result of each equation.
The least squares method is mostly applied in data fitting. The best fit result is supposed to minimize the sum of squared errors or residuals which are said to be the differences between observed or experimental value and corresponding fitted value given in the model. The are two basic categories of the least squares problems - ordinary or linear least squares and nonlinear least squares. These depend upon linearity or nonlinearity of the residuals.
In a linear regression, the line of best fit is a straight line
The given data points are to be minimized by the method of minimizing residuals or offsets of each point from the line. The vertical offsets are used generally in surface, polynomial and hyperplane problems.
While, perpendicular offsets are utilized in common practice.
Load the file
library(readxl)
## Warning: package 'readxl' was built under R version 3.4.3
album2<- read_excel("D:/Analytics/BACP-Dec2017/07_Regression_Analysis/Album Sales 2.xlsx")
Data summary
summary(album2)
## adverts sales airplay attract
## Min. : 9.104 Min. : 10.0 Min. : 0.00 Min. : 1.00
## 1st Qu.: 215.918 1st Qu.:137.5 1st Qu.:19.75 1st Qu.: 6.00
## Median : 531.916 Median :200.0 Median :28.00 Median : 7.00
## Mean : 614.412 Mean :193.2 Mean :27.50 Mean : 6.77
## 3rd Qu.: 911.226 3rd Qu.:250.0 3rd Qu.:36.00 3rd Qu.: 8.00
## Max. :2271.860 Max. :360.0 Max. :63.00 Max. :10.00
From the Album Sales 2 data set there are 4 vectors in the data frame: adverts, sales, airplay and attract.
The goal is to create a model that takes into account these predictors on the sales as the outcome variable.
The first thing we can do is load the car library to use the handy scatterplot() in that library.
#install.packages("car",repos = "http://cran.us.r-project.org")
library(car)
## Warning: package 'car' was built under R version 3.4.3
scatterplot(album2$sales, album2$adverts)
The data show a pretty wide variance for these variables but we see that there is a line that may indicate a linear relationship exists.
Out of curiousity, let’s see what the correlation is between these two vectors.
cor.test(album2$sales, album2$adverts)
##
## Pearson's product-moment correlation
##
## data: album2$sales and album2$adverts
## t = 9.9793, df = 198, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4781207 0.6639409
## sample estimates:
## cor
## 0.5784877
The correlation t-test shows that we can reject the null hypothesis that there is no correlation between adverts and album sales and support the hypothesis that there is a correlation with p = 2.2e-16 (highly significant). The correlation is 0.5785 with a 95% confidence interval of 0.4781 - 0.66394.
Performing a linear regression and obtaining the first model
albumSales.2 <- lm(sales ~ adverts, data = album2)
summary(albumSales.2)
##
## Call:
## lm(formula = sales ~ adverts, data = album2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -152.949 -43.796 -0.393 37.040 211.866
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.341e+02 7.537e+00 17.799 <2e-16 ***
## adverts 9.612e-02 9.632e-03 9.979 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 65.99 on 198 degrees of freedom
## Multiple R-squared: 0.3346, Adjusted R-squared: 0.3313
## F-statistic: 99.59 on 1 and 198 DF, p-value: < 2.2e-16
The model shows an increase of 1 unit of adverts results in 0.09612 in album sales, highly significant, with a multiple R^2 = 0.3346, and an adjusted R^2 of 0.3313
Let’s take a look at the plot output
par(mfrow=(c(2,2)))
plot(albumSales.2)
dev.off()
## null device
## 1
There appears to be a bit of heteroscadicity in the Residuals vs. Fitted graph; not a huge amount, but it’s there.
The QQ Plot actually looks somewhat reasonable.
Scale-Location plot shows a similar indication of heteroscadicity
Don’t see anything with respect to leverage that is of concern
Now, we think that airplay is also a significant predictor of album sales, so let’s create another model that now inludes airplay
albumSales.3 <- lm(sales ~ adverts + airplay, data = album2)
summary(albumSales.3)
##
## Call:
## lm(formula = sales ~ adverts + airplay, data = album2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -112.121 -30.027 3.952 32.072 155.498
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 41.123811 9.330952 4.407 1.72e-05 ***
## adverts 0.086887 0.007246 11.991 < 2e-16 ***
## airplay 3.588789 0.286807 12.513 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 49.38 on 197 degrees of freedom
## Multiple R-squared: 0.6293, Adjusted R-squared: 0.6255
## F-statistic: 167.2 on 2 and 197 DF, p-value: < 2.2e-16
As we expected, airplay has a highly significant non-zero coefficient. For each unit increase in airplay, album sales increase 3.59 units The multiple R^2 has increased to 0.6293, so our model accounts for almost 63% of the variation in album sales.
par(mfrow=c(2,2))
plot(albumSales.3)
dev.off()
## null device
## 1
Little bit better on the homoscedacity than the first model.
QQ Plot still looks reasonable.
Other diagnostics look good
Now let’s add in the final variable, attractiveness of the band
albumSales.4 <- lm(sales ~ adverts + airplay + attract, data = album2)
summary(albumSales.4)
##
## Call:
## lm(formula = sales ~ adverts + airplay + attract, data = album2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -121.324 -28.336 -0.451 28.967 144.132
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -26.612958 17.350001 -1.534 0.127
## adverts 0.084885 0.006923 12.261 < 2e-16 ***
## airplay 3.367425 0.277771 12.123 < 2e-16 ***
## attract 11.086335 2.437849 4.548 9.49e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 47.09 on 196 degrees of freedom
## Multiple R-squared: 0.6647, Adjusted R-squared: 0.6595
## F-statistic: 129.5 on 3 and 196 DF, p-value: < 2.2e-16
Now we have all 3 beta coefficients and all are statistically significantly different from zero.
The R^2 has increased to 0.665
par(mfrow=c(2,2))
plot(albumSales.4)
dev.off()
## null device
## 1
Diagnostics from the plot() all look OK.
Interpreting the regression output
The value of R2, is a measure of how much of the variability in the outcome is accounted for by the predictors.
For the first model its value is 0.335, which means that advertising budget accounts for 33.5% of the variation in album sales. However, when the other two predictors are included as well, this value increases to 0.665, or 66.5% of the variance in album sales. Therefore, if advertising accounts for 33.5%, we can tell that attractiveness and radio play account for an additional 33.0%. So the inclusion of the two new predictors has explained quite a large amount of the variation in album sales.
The adjusted R2 gives us some idea of how well our model generalizes, and ideally we would like its value to be the same, or very close to, the value of R2. In this example the difference for the final model is small (in fact the difference between the values is 0.665 - 0.660 = .005 (about 0.5%)). This shrinkage means that if the model were derived from the population rather than a sample it would account for approximately 0.5% less variance in the outcome.
Each of these beta values has an associated standard error indicating to what extent these values would vary across different samples, and these standard errors are used to determine whether or not the b-value differs significantly from zero.
if the t-test associated with a bvalue is significant (if the value in the column labelled Pr(>|t|) is less than .05) then the predictor is making a significant contribution to the model. The smaller the value of Pr(>|t|) (and the larger the value of t), the greater the contribution of that predictor.
We can examine the standardized beta estimates by loading the QuantPsych library and using the lm.beta().
#install.packages("QuantPsyc",repos = "http://cran.us.r-project.org")
#install.packages("boot",repos = "http://cran.us.r-project.org")
library(QuantPsyc)
## Warning: package 'QuantPsyc' was built under R version 3.4.3
## Loading required package: boot
## Warning: package 'boot' was built under R version 3.4.3
##
## Attaching package: 'boot'
## The following object is masked from 'package:car':
##
## logit
## Loading required package: MASS
## Warning: package 'MASS' was built under R version 3.4.3
##
## Attaching package: 'QuantPsyc'
## The following object is masked from 'package:base':
##
## norm
library(boot)
lm.beta(albumSales.4)
## adverts airplay attract
## 0.5108462 0.5119881 0.1916834
The betas are given in terms of standard deviations. So now for adverts with a standardized beta of 0.511 we can say that as advert budget increases by 1 SD then album sales increase by 0.511 SD.
Also, every 1 SD increase in airplay results in 0.512 increase in album sales.
Finally, every 1 SD increase in band attractiveness results in a 0.192 increase in album sales.
Compute the confidence intervas for the model
confint(albumSales.4)
## 2.5 % 97.5 %
## (Intercept) -60.82960967 7.60369295
## adverts 0.07123166 0.09853799
## airplay 2.81962186 3.91522848
## attract 6.27855218 15.89411823
We see that our coefficients fall in range of each of these confidence intervals and that the confidence intervals don’t cross 0.
Compare models using anova
anova(albumSales.2, albumSales.3)
## Analysis of Variance Table
##
## Model 1: sales ~ adverts
## Model 2: sales ~ adverts + airplay
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 198 862264
## 2 197 480428 1 381836 156.57 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Compared to a model based on the mean, the first model improved with F(2, 197) = 156.57, p < .001
anova(albumSales.3, albumSales.4)
## Analysis of Variance Table
##
## Model 1: sales ~ adverts + airplay
## Model 2: sales ~ adverts + airplay + attract
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 197 480428
## 2 196 434575 1 45853 20.681 9.492e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Compared to the previous model, the final model is an improvement with F(2, 196) = 20.681, p < .001
Calculate Akaike Information Criterion
The Akaike’s information criterion - AIC measures of the goodness of fit of an estimated statistical model and can also be used for model selection. For model comparison, the model with the lowest AIC score is preferred.
AIC(albumSales.2)
## [1] 2247.375
AIC(albumSales.3)
## [1] 2132.398
AIC(albumSales.4)
## [1] 2114.337
Diagnostic tests
Compute some useful diagnostic statistics and place them in the dataframe
* Outliers: Residuals can be obtained with the resid() function, standardized residuals with the rstandard() function and studentized residuals with the rstudent() function. * Influential cases: Cook’s distances can be obtained with the cooks.distance() function, DFBeta with the dfbeta() function, DFFit with the dffits() function, hat values (leverage) with the hatvalues() function, and the covariance ratio with the covratio() function.
album2$standardized.residuals <- rstandard(albumSales.4)
album2$studentized.residuals <- rstudent(albumSales.4)
album2$cooks.distance <- cooks.distance(albumSales.4)
album2$dfbeta <- dfbeta(albumSales.4)
album2$dffit <- dffits(albumSales.4)
album2$leverage <- hatvalues(albumSales.4)
album2$covariance.ratios <- covratio(albumSales.4)
#
album2$large.residual <- (album2$standardized.residuals > 2) | (album2$standardized.residuals < -2) # how many residuals do we have outside 2 sd?
#
sum(album2$large.residual) # There are 12 data points with large residuals outside the 96.5% range.
## [1] 12
#
# The next command will display those rows and the specified columns
album2[album2$large.residual, c("sales","airplay","attract","adverts","standardized.residuals")]
## # A tibble: 12 x 5
## sales airplay attract adverts standardized.residuals
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 330 43.0 10.0 10.3 2.18
## 2 120 28.0 7.00 986 -2.32
## 3 300 40.0 7.00 174 2.13
## 4 40.0 25.0 8.00 103 -2.46
## 5 190 12.0 4.00 406 2.10
## 6 190 33.0 8.00 1542 -2.46
## 7 300 30.0 7.00 579 2.10
## 8 70.0 37.0 7.00 56.9 -2.36
## 9 250 5.00 7.00 1000 2.10
## 10 120 53.0 8.00 9.10 -2.63
## 11 360 42.0 8.00 146 3.09
## 12 110 20.0 9.00 786 -2.09
# There's nothing too huge except for data point 11, which is ~ 3.1
Let’s take a look at some of the other diagnostic variables for these 12 data points.
album2[album2$large.residual, c("cooks.distance", "leverage", "covariance.ratios")]
## # A tibble: 12 x 3
## cooks.distance leverage covariance.ratios
## <dbl> <dbl> <dbl>
## 1 0.0587 0.0472 0.971
## 2 0.0109 0.00801 0.920
## 3 0.0178 0.0154 0.944
## 4 0.0241 0.0157 0.915
## 5 0.0332 0.0292 0.960
## 6 0.0404 0.0261 0.925
## 7 0.00595 0.00535 0.937
## 8 0.0223 0.0157 0.924
## 9 0.0314 0.0278 0.959
## 10 0.0708 0.0393 0.920
## 11 0.0509 0.0208 0.853
## 12 0.0251 0.0225 0.954
However, the Cooks distance value for the point 11 is of little concern, so this point is not excercising undue influence
Durbin-Watson test of independence
dwt(albumSales.4)
## lag Autocorrelation D-W Statistic p-value
## 1 0.0026951 1.949819 0.694
## Alternative hypothesis: rho != 0
As a conservative rule its suggested that values less than 1 or greater than 3 should definitely raise alarm bells. The closer to 2 that the value is, the better, and for these data the value is 1.950, which is so close to 2 that the assumption has almost certainly been met.This shows that we can’t reject the null hypothesis of no autocorrelation.
Multicollinearity
vif(albumSales.4)
## adverts airplay attract
## 1.014593 1.042504 1.038455
1/vif(albumSales.4) # Tolerance is 1/VIF
## adverts airplay attract
## 0.9856172 0.9592287 0.9629695
mean(vif(albumSales.4))
## [1] 1.03185
If the largest VIF > 10 then there is cause for concern
If the average VIF is substantially > 1 then the regression may be biased
Tolerance < 0.1 indicates a serious issue
Tolerance < 0.2 indicates a potential issue
For our current model the VIF values are all well below 10 and the tolerance statistics all well above 0.2. Also, the average VIF is very close to 1. Based on these measures we can safely conclude that there is no collinearity within our data.
Dev and test sample Generating n random numbers b/w 0 and 1
Add these randomly generated numbers to the data as a new column
Splitting the data into dev and testing sample based on the random number
install.packages("caTools",repos = "http://cran.us.r-project.org")
## Installing package into 'C:/Users/Ranvir Kumar/Documents/R/win-library/3.4'
## (as 'lib' is unspecified)
## package 'caTools' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\Ranvir Kumar\AppData\Local\Temp\RtmpOyLQeS\downloaded_packages
library(caTools)
## Warning: package 'caTools' was built under R version 3.4.3
set.seed(123) # set seed to ensure you always have same random numbers generated
Splits the data in the ratio mentioned in SplitRatio.
After splitting marks these rows as logical TRUE and the the remaining are marked as logical FALSE
sample = sample.split(album2,SplitRatio = 0.75)
## Warning: Length of logical index must be 1 or 200, not 800
Creates a training dataset named train1 with rows which are marked as TRUE
train1 =subset(album2,sample ==TRUE)
## Warning: Length of logical index must be 1 or 200, not 12
test1=subset(album2, sample==FALSE)
## Warning: Length of logical index must be 1 or 200, not 12
Develop a model using train data
albumSales.5 <- lm(sales ~ adverts + airplay + attract, data = train1)
summary(albumSales.5)
##
## Call:
## lm(formula = sales ~ adverts + airplay + attract, data = train1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -112.570 -29.150 0.711 25.490 130.978
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -23.461730 19.064693 -1.231 0.22
## adverts 0.075678 0.007592 9.968 < 2e-16 ***
## airplay 3.572758 0.309224 11.554 < 2e-16 ***
## attract 11.426238 2.713136 4.211 4.42e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 44.88 on 146 degrees of freedom
## Multiple R-squared: 0.674, Adjusted R-squared: 0.6673
## F-statistic: 100.6 on 3 and 146 DF, p-value: < 2.2e-16
Predict values in validation sample using model
prediction<-predict(albumSales.5,newdata=test1)
head(prediction)
## 1 2 3 4 5 6
## 234.3490 198.6409 301.0778 258.8600 290.3373 321.9954
head(test1$sales)
## [1] 220 210 290 250 230 320
To find confidence interval: It signifies the range in which the true population lies at a 95% level of confidence test1.
predict(albumSales.5, newdata = test1, interval = 'confidence')
## fit lwr upr
## 1 234.34896 217.65705 251.04086
## 2 198.64095 183.38148 213.90041
## 3 301.07778 283.01436 319.14121
## 4 258.86004 241.68757 276.03251
## 5 290.33732 275.60443 305.07021
## 6 321.99542 300.12663 343.86420
## 7 198.21302 184.85464 211.57141
## 8 164.43128 151.88747 176.97509
## 9 261.76240 248.79931 274.72549
## 10 247.77250 238.61245 256.93254
## 11 225.60781 215.53107 235.68455
## 12 165.02929 152.01632 178.04226
## 13 88.69897 67.82077 109.57717
## 14 240.74895 230.35334 251.14455
## 15 108.05441 91.11180 124.99701
## 16 187.62197 171.97744 203.26649
## 17 193.01970 180.48236 205.55704
## 18 134.32334 120.54059 148.10609
## 19 232.41814 222.55597 242.28031
## 20 113.42900 98.92640 127.93160
## 21 50.36849 30.99923 69.73775
## 22 145.41380 134.41883 156.40876
## 23 234.34130 218.56256 250.12004
## 24 271.89068 260.56066 283.22071
## 25 172.31315 159.86715 184.75915
## 26 154.96624 145.75899 164.17349
## 27 102.54291 84.73699 120.34884
## 28 81.39910 59.34666 103.45154
## 29 105.00597 81.97469 128.03724
## 30 317.78789 297.07604 338.49974
## 31 64.75398 47.10398 82.40398
## 32 257.14678 242.93398 271.35958
## 33 186.55049 173.93066 199.17033
## 34 116.26710 102.39855 130.13564
## 35 217.09412 209.27289 224.91535
## 36 196.56004 185.44181 207.67828
## 37 236.29023 225.74215 246.83831
## 38 173.81315 156.62415 191.00216
## 39 115.94210 98.98097 132.90323
## 40 177.83049 169.49047 186.17052
## 41 257.99333 238.35796 277.62870
## 42 138.33670 127.97141 148.70198
## 43 207.52765 196.33769 218.71761
## 44 171.36029 162.51673 180.20386
## 45 277.81903 261.40427 294.23378
## 46 293.14839 280.04263 306.25415
## 47 189.02885 180.07779 197.97991
## 48 155.94590 145.31898 166.57283
## 49 162.35966 150.19664 174.52267
## 50 210.28952 194.41518 226.16386
Calculate prediction accuracy
A simple correlation between the actuals and predicted values can be used as a form of accuracy measure. A higher correlation accuracy implies that the actuals and predicted values have similar directional movement, i.e. when the actuals values increase the predicteds also increase and viceversa.
actuals_pred<-data.frame(cbind(actuals=test1$sales, predicteds=prediction))
correlation_accuracy<-cor(actuals_pred)
correlation_accuracy
## actuals predicteds
## actuals 1.0000000 0.8030145
## predicteds 0.8030145 1.0000000
Alternate way
cor.test(test1$sales,prediction)
##
## Pearson's product-moment correlation
##
## data: test1$sales and prediction
## t = 9.3353, df = 48, p-value = 2.291e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6756966 0.8838144
## sample estimates:
## cor
## 0.8030145
Robust Regression
albumSales.6<-rlm(sales ~ adverts + airplay + attract, data = train1)
aov(albumSales.6)
## Call:
## aov(formula = albumSales.6)
##
## Terms:
## adverts airplay attract Residuals
## Sum of Squares 240593.7 331871.1 35729.0 294110.2
## Deg. of Freedom 1 1 1 146
##
## Residual standard error: 44.88266
## Estimated effects may be unbalanced
summary(albumSales.6)
##
## Call: rlm(formula = sales ~ adverts + airplay + attract, data = train1)
## Residuals:
## Min 1Q Median 3Q Max
## -115.3782 -28.6665 0.7027 27.9151 133.8203
##
## Coefficients:
## Value Std. Error t value
## (Intercept) -25.2624 19.8827 -1.2706
## adverts 0.0796 0.0079 10.0540
## airplay 3.5541 0.3225 11.0207
## attract 11.3226 2.8296 4.0015
##
## Residual standard error: 42.27 on 146 degrees of freedom
Parsimony - Forward Selection
install.packages("leaps",repos = "http://cran.us.r-project.org")
## Installing package into 'C:/Users/Ranvir Kumar/Documents/R/win-library/3.4'
## (as 'lib' is unspecified)
## package 'leaps' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\Ranvir Kumar\AppData\Local\Temp\RtmpOyLQeS\downloaded_packages
library(leaps)
## Warning: package 'leaps' was built under R version 3.4.3
Null<-lm(sales ~ 1, data=album2)
Full<-lm(sales ~ adverts + airplay + attract, data = album2)
step(Null, scope=list(lower=Null,upper=Full),direction = "forward")
## Start: AIC=1757.29
## sales ~ 1
##
## Df Sum of Sq RSS AIC
## + airplay 1 464863 831089 1670.4
## + adverts 1 433688 862264 1677.8
## + attract 1 137822 1158130 1736.8
## <none> 1295952 1757.3
##
## Step: AIC=1670.44
## sales ~ airplay
##
## Df Sum of Sq RSS AIC
## + adverts 1 350661 480428 1562.8
## + attract 1 63182 767907 1656.6
## <none> 831089 1670.4
##
## Step: AIC=1562.82
## sales ~ airplay + adverts
##
## Df Sum of Sq RSS AIC
## + attract 1 45853 434575 1544.8
## <none> 480428 1562.8
##
## Step: AIC=1544.76
## sales ~ airplay + adverts + attract
##
## Call:
## lm(formula = sales ~ airplay + adverts + attract, data = album2)
##
## Coefficients:
## (Intercept) airplay adverts attract
## -26.61296 3.36743 0.08488 11.08634
Parsimony - Backward Selection
step(Full,data=album2,direction = "backward")
## Start: AIC=1544.76
## sales ~ adverts + airplay + attract
##
## Df Sum of Sq RSS AIC
## <none> 434575 1544.8
## - attract 1 45853 480428 1562.8
## - airplay 1 325860 760434 1654.7
## - adverts 1 333332 767907 1656.6
##
## Call:
## lm(formula = sales ~ adverts + airplay + attract, data = album2)
##
## Coefficients:
## (Intercept) adverts airplay attract
## -26.61296 0.08488 3.36743 11.08634
Parsimony - Stepwise: Both Forward and Backward Selection
step(Null, scope=list(upper=Full),data=album2, direction = "both")
## Start: AIC=1757.29
## sales ~ 1
##
## Df Sum of Sq RSS AIC
## + airplay 1 464863 831089 1670.4
## + adverts 1 433688 862264 1677.8
## + attract 1 137822 1158130 1736.8
## <none> 1295952 1757.3
##
## Step: AIC=1670.44
## sales ~ airplay
##
## Df Sum of Sq RSS AIC
## + adverts 1 350661 480428 1562.8
## + attract 1 63182 767907 1656.6
## <none> 831089 1670.4
## - airplay 1 464863 1295952 1757.3
##
## Step: AIC=1562.82
## sales ~ airplay + adverts
##
## Df Sum of Sq RSS AIC
## + attract 1 45853 434575 1544.8
## <none> 480428 1562.8
## - adverts 1 350661 831089 1670.4
## - airplay 1 381836 862264 1677.8
##
## Step: AIC=1544.76
## sales ~ airplay + adverts + attract
##
## Df Sum of Sq RSS AIC
## <none> 434575 1544.8
## - attract 1 45853 480428 1562.8
## - airplay 1 325860 760434 1654.7
## - adverts 1 333332 767907 1656.6
##
## Call:
## lm(formula = sales ~ airplay + adverts + attract, data = album2)
##
## Coefficients:
## (Intercept) airplay adverts attract
## -26.61296 3.36743 0.08488 11.08634