Regression Analysis

An introduction to regression

Regression analysis is a way of predicting an outcome variable from one predictor variable (simple regression) or several predictor variables (multiple regression).
outcome = (model) + error

In regression, the model we fit is linear, which means that we summarize a data set with a straight line. As such, the word ‘model’ in the equation above simply gets replaced by ‘things’ that define the line that we fit to the data.

A straight line can be defined by two things:
* the slope (or gradient) of the line (usually denoted by bi); and
* the point at which the line crosses the vertical axis of the graph (known as the intercept of the line, b0).

The general model becomes an equation below in which Y is the outcome that we want to predict and X is the ith participant’s score on the predictor variable. Here bi is the gradient of the straight line fitted to the data and b0 is the intercept of that line. These parameters bi and b0 are known as the regression coefficients.

There is a residual term, ‘E’, which represents the difference between the score predicted by the line for participant i and the score that participant i actually obtained.

Yi = (b0 + bixi) + E

The gradient of the line tells us something about the nature of the relationship being described - positive or negative.

Least Square Method

Least squares method uses a straight line in order to fit through the given points which is known as the method of linear or ordinary least squares. This line is termed as the line of best fit from which the sum of squares of the distances from the points are minimized.

The method of least squares actually defines the solution for the minimization of sum of squares of deviations or the errors in the result of each equation.

The least squares method is mostly applied in data fitting. The best fit result is supposed to minimize the sum of squared errors or residuals which are said to be the differences between observed or experimental value and corresponding fitted value given in the model. The are two basic categories of the least squares problems - ordinary or linear least squares and nonlinear least squares. These depend upon linearity or nonlinearity of the residuals.

In a linear regression, the line of best fit is a straight line

The given data points are to be minimized by the method of minimizing residuals or offsets of each point from the line. The vertical offsets are used generally in surface, polynomial and hyperplane problems.
While, perpendicular offsets are utilized in common practice.

Regression analysis using R

Load the file

library(readxl)

## Warning: package 'readxl' was built under R version 3.4.3

album2<- read_excel("D:/Analytics/BACP-Dec2017/07_Regression_Analysis/Album Sales 2.xlsx")

Data summary

summary(album2)

##     adverts             sales          airplay         attract     
##  Min.   :   9.104   Min.   : 10.0   Min.   : 0.00   Min.   : 1.00  
##  1st Qu.: 215.918   1st Qu.:137.5   1st Qu.:19.75   1st Qu.: 6.00  
##  Median : 531.916   Median :200.0   Median :28.00   Median : 7.00  
##  Mean   : 614.412   Mean   :193.2   Mean   :27.50   Mean   : 6.77  
##  3rd Qu.: 911.226   3rd Qu.:250.0   3rd Qu.:36.00   3rd Qu.: 8.00  
##  Max.   :2271.860   Max.   :360.0   Max.   :63.00   Max.   :10.00

From the Album Sales 2 data set there are 4 vectors in the data frame: adverts, sales, airplay and attract.
The goal is to create a model that takes into account these predictors on the sales as the outcome variable.
The first thing we can do is load the car library to use the handy scatterplot() in that library.

#install.packages("car",repos = "http://cran.us.r-project.org")
library(car)

## Warning: package 'car' was built under R version 3.4.3

scatterplot(album2$sales, album2$adverts)

The data show a pretty wide variance for these variables but we see that there is a line that may indicate a linear relationship exists.
Out of curiousity, let’s see what the correlation is between these two vectors.

cor.test(album2$sales, album2$adverts)

## 
##  Pearson's product-moment correlation
## 
## data:  album2$sales and album2$adverts
## t = 9.9793, df = 198, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4781207 0.6639409
## sample estimates:
##       cor 
## 0.5784877

The correlation t-test shows that we can reject the null hypothesis that there is no correlation between adverts and album sales and support the hypothesis that there is a correlation with p = 2.2e-16 (highly significant). The correlation is 0.5785 with a 95% confidence interval of 0.4781 - 0.66394.

Performing a linear regression and obtaining the first model

albumSales.2 <- lm(sales ~ adverts, data = album2)
summary(albumSales.2)

## 
## Call:
## lm(formula = sales ~ adverts, data = album2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -152.949  -43.796   -0.393   37.040  211.866 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.341e+02  7.537e+00  17.799   <2e-16 ***
## adverts     9.612e-02  9.632e-03   9.979   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 65.99 on 198 degrees of freedom
## Multiple R-squared:  0.3346, Adjusted R-squared:  0.3313 
## F-statistic: 99.59 on 1 and 198 DF,  p-value: < 2.2e-16

The model shows an increase of 1 unit of adverts results in 0.09612 in album sales, highly significant, with a multiple R^2 = 0.3346, and an adjusted R^2 of 0.3313

Let’s take a look at the plot output

par(mfrow=(c(2,2)))
plot(albumSales.2)

dev.off()

## null device 
##           1

There appears to be a bit of heteroscadicity in the Residuals vs. Fitted graph; not a huge amount, but it’s there.
The QQ Plot actually looks somewhat reasonable.
Scale-Location plot shows a similar indication of heteroscadicity
Don’t see anything with respect to leverage that is of concern

Now, we think that airplay is also a significant predictor of album sales, so let’s create another model that now inludes airplay

albumSales.3 <- lm(sales ~ adverts + airplay, data = album2)
summary(albumSales.3)

## 
## Call:
## lm(formula = sales ~ adverts + airplay, data = album2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -112.121  -30.027    3.952   32.072  155.498 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 41.123811   9.330952   4.407 1.72e-05 ***
## adverts      0.086887   0.007246  11.991  < 2e-16 ***
## airplay      3.588789   0.286807  12.513  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 49.38 on 197 degrees of freedom
## Multiple R-squared:  0.6293, Adjusted R-squared:  0.6255 
## F-statistic: 167.2 on 2 and 197 DF,  p-value: < 2.2e-16

As we expected, airplay has a highly significant non-zero coefficient. For each unit increase in airplay, album sales increase 3.59 units The multiple R^2 has increased to 0.6293, so our model accounts for almost 63% of the variation in album sales.

par(mfrow=c(2,2))
plot(albumSales.3)

dev.off()

## null device 
##           1

Little bit better on the homoscedacity than the first model.
QQ Plot still looks reasonable.
Other diagnostics look good

Now let’s add in the final variable, attractiveness of the band

albumSales.4 <- lm(sales ~ adverts + airplay + attract, data = album2)
summary(albumSales.4)

## 
## Call:
## lm(formula = sales ~ adverts + airplay + attract, data = album2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -121.324  -28.336   -0.451   28.967  144.132 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -26.612958  17.350001  -1.534    0.127    
## adverts       0.084885   0.006923  12.261  < 2e-16 ***
## airplay       3.367425   0.277771  12.123  < 2e-16 ***
## attract      11.086335   2.437849   4.548 9.49e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 47.09 on 196 degrees of freedom
## Multiple R-squared:  0.6647, Adjusted R-squared:  0.6595 
## F-statistic: 129.5 on 3 and 196 DF,  p-value: < 2.2e-16

Now we have all 3 beta coefficients and all are statistically significantly different from zero.
The R^2 has increased to 0.665

par(mfrow=c(2,2))
plot(albumSales.4)

dev.off()

## null device 
##           1

Diagnostics from the plot() all look OK.

Interpreting the regression output
The value of R2, is a measure of how much of the variability in the outcome is accounted for by the predictors.
For the first model its value is 0.335, which means that advertising budget accounts for 33.5% of the variation in album sales. However, when the other two predictors are included as well, this value increases to 0.665, or 66.5% of the variance in album sales. Therefore, if advertising accounts for 33.5%, we can tell that attractiveness and radio play account for an additional 33.0%. So the inclusion of the two new predictors has explained quite a large amount of the variation in album sales.

The adjusted R2 gives us some idea of how well our model generalizes, and ideally we would like its value to be the same, or very close to, the value of R2. In this example the difference for the final model is small (in fact the difference between the values is 0.665 - 0.660 = .005 (about 0.5%)). This shrinkage means that if the model were derived from the population rather than a sample it would account for approximately 0.5% less variance in the outcome.

Each of these beta values has an associated standard error indicating to what extent these values would vary across different samples, and these standard errors are used to determine whether or not the b-value differs significantly from zero.
if the t-test associated with a bvalue is significant (if the value in the column labelled Pr(>|t|) is less than .05) then the predictor is making a significant contribution to the model. The smaller the value of Pr(>|t|) (and the larger the value of t), the greater the contribution of that predictor.

We can examine the standardized beta estimates by loading the QuantPsych library and using the lm.beta().

#install.packages("QuantPsyc",repos = "http://cran.us.r-project.org")
#install.packages("boot",repos = "http://cran.us.r-project.org")
library(QuantPsyc)

## Warning: package 'QuantPsyc' was built under R version 3.4.3

## Loading required package: boot

## Warning: package 'boot' was built under R version 3.4.3

## 
## Attaching package: 'boot'

## The following object is masked from 'package:car':
## 
##     logit

## Loading required package: MASS

## Warning: package 'MASS' was built under R version 3.4.3

## 
## Attaching package: 'QuantPsyc'

## The following object is masked from 'package:base':
## 
##     norm

library(boot)

lm.beta(albumSales.4)

##   adverts   airplay   attract 
## 0.5108462 0.5119881 0.1916834

The betas are given in terms of standard deviations. So now for adverts with a standardized beta of 0.511 we can say that as advert budget increases by 1 SD then album sales increase by 0.511 SD.
Also, every 1 SD increase in airplay results in 0.512 increase in album sales.
Finally, every 1 SD increase in band attractiveness results in a 0.192 increase in album sales.

Compute the confidence intervas for the model

confint(albumSales.4)

##                    2.5 %      97.5 %
## (Intercept) -60.82960967  7.60369295
## adverts       0.07123166  0.09853799
## airplay       2.81962186  3.91522848
## attract       6.27855218 15.89411823

We see that our coefficients fall in range of each of these confidence intervals and that the confidence intervals don’t cross 0.

Compare models using anova

anova(albumSales.2, albumSales.3)

## Analysis of Variance Table
## 
## Model 1: sales ~ adverts
## Model 2: sales ~ adverts + airplay
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1    198 862264                                  
## 2    197 480428  1    381836 156.57 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Compared to a model based on the mean, the first model improved with F(2, 197) = 156.57, p < .001

anova(albumSales.3, albumSales.4)

## Analysis of Variance Table
## 
## Model 1: sales ~ adverts + airplay
## Model 2: sales ~ adverts + airplay + attract
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1    197 480428                                  
## 2    196 434575  1     45853 20.681 9.492e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Compared to the previous model, the final model is an improvement with F(2, 196) = 20.681, p < .001

Calculate Akaike Information Criterion
The Akaike’s information criterion - AIC measures of the goodness of fit of an estimated statistical model and can also be used for model selection. For model comparison, the model with the lowest AIC score is preferred.

AIC(albumSales.2)

## [1] 2247.375

AIC(albumSales.3)

## [1] 2132.398

AIC(albumSales.4)

## [1] 2114.337

Diagnostic tests
Compute some useful diagnostic statistics and place them in the dataframe
* Outliers: Residuals can be obtained with the resid() function, standardized residuals with the rstandard() function and studentized residuals with the rstudent() function. * Influential cases: Cook’s distances can be obtained with the cooks.distance() function, DFBeta with the dfbeta() function, DFFit with the dffits() function, hat values (leverage) with the hatvalues() function, and the covariance ratio with the covratio() function.

album2$standardized.residuals <- rstandard(albumSales.4)
album2$studentized.residuals <- rstudent(albumSales.4)
album2$cooks.distance <- cooks.distance(albumSales.4)
album2$dfbeta <- dfbeta(albumSales.4)
album2$dffit <- dffits(albumSales.4)
album2$leverage <- hatvalues(albumSales.4)
album2$covariance.ratios <- covratio(albumSales.4)
#
album2$large.residual <- (album2$standardized.residuals > 2) | (album2$standardized.residuals < -2) # how many residuals do we have outside 2 sd?
#
sum(album2$large.residual) # There are 12 data points with large residuals outside the 96.5% range.

## [1] 12

#
# The next command will display those rows and the specified columns
album2[album2$large.residual, c("sales","airplay","attract","adverts","standardized.residuals")]

## # A tibble: 12 x 5
##    sales airplay attract adverts standardized.residuals
##    <dbl>   <dbl>   <dbl>   <dbl>                  <dbl>
##  1 330     43.0    10.0    10.3                    2.18
##  2 120     28.0     7.00  986                     -2.32
##  3 300     40.0     7.00  174                      2.13
##  4  40.0   25.0     8.00  103                     -2.46
##  5 190     12.0     4.00  406                      2.10
##  6 190     33.0     8.00 1542                     -2.46
##  7 300     30.0     7.00  579                      2.10
##  8  70.0   37.0     7.00   56.9                   -2.36
##  9 250      5.00    7.00 1000                      2.10
## 10 120     53.0     8.00    9.10                  -2.63
## 11 360     42.0     8.00  146                      3.09
## 12 110     20.0     9.00  786                     -2.09

# There's nothing too huge except for data point 11, which is ~ 3.1

Let’s take a look at some of the other diagnostic variables for these 12 data points.

album2[album2$large.residual, c("cooks.distance", "leverage", "covariance.ratios")]

## # A tibble: 12 x 3
##    cooks.distance leverage covariance.ratios
##             <dbl>    <dbl>             <dbl>
##  1        0.0587   0.0472              0.971
##  2        0.0109   0.00801             0.920
##  3        0.0178   0.0154              0.944
##  4        0.0241   0.0157              0.915
##  5        0.0332   0.0292              0.960
##  6        0.0404   0.0261              0.925
##  7        0.00595  0.00535             0.937
##  8        0.0223   0.0157              0.924
##  9        0.0314   0.0278              0.959
## 10        0.0708   0.0393              0.920
## 11        0.0509   0.0208              0.853
## 12        0.0251   0.0225              0.954

However, the Cooks distance value for the point 11 is of little concern, so this point is not excercising undue influence

Durbin-Watson test of independence

dwt(albumSales.4)

##  lag Autocorrelation D-W Statistic p-value
##    1       0.0026951      1.949819   0.694
##  Alternative hypothesis: rho != 0

As a conservative rule its suggested that values less than 1 or greater than 3 should definitely raise alarm bells. The closer to 2 that the value is, the better, and for these data the value is 1.950, which is so close to 2 that the assumption has almost certainly been met.This shows that we can’t reject the null hypothesis of no autocorrelation.

Multicollinearity

vif(albumSales.4)

##  adverts  airplay  attract 
## 1.014593 1.042504 1.038455

1/vif(albumSales.4)     # Tolerance is 1/VIF

##   adverts   airplay   attract 
## 0.9856172 0.9592287 0.9629695

mean(vif(albumSales.4))

## [1] 1.03185

If the largest VIF > 10 then there is cause for concern
If the average VIF is substantially > 1 then the regression may be biased
Tolerance < 0.1 indicates a serious issue
Tolerance < 0.2 indicates a potential issue

For our current model the VIF values are all well below 10 and the tolerance statistics all well above 0.2. Also, the average VIF is very close to 1. Based on these measures we can safely conclude that there is no collinearity within our data.

Dev and test sample Generating n random numbers b/w 0 and 1
Add these randomly generated numbers to the data as a new column
Splitting the data into dev and testing sample based on the random number

install.packages("caTools",repos = "http://cran.us.r-project.org")

## Installing package into 'C:/Users/Ranvir Kumar/Documents/R/win-library/3.4'
## (as 'lib' is unspecified)

## package 'caTools' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\Ranvir Kumar\AppData\Local\Temp\RtmpOyLQeS\downloaded_packages

library(caTools)

## Warning: package 'caTools' was built under R version 3.4.3

set.seed(123)  #  set seed to ensure you always have same random numbers generated

Splits the data in the ratio mentioned in SplitRatio.
After splitting marks these rows as logical TRUE and the the remaining are marked as logical FALSE

sample = sample.split(album2,SplitRatio = 0.75)

## Warning: Length of logical index must be 1 or 200, not 800

Creates a training dataset named train1 with rows which are marked as TRUE

train1 =subset(album2,sample ==TRUE)

## Warning: Length of logical index must be 1 or 200, not 12

test1=subset(album2, sample==FALSE)

## Warning: Length of logical index must be 1 or 200, not 12

Develop a model using train data

albumSales.5 <- lm(sales ~ adverts + airplay + attract, data = train1)
summary(albumSales.5)

## 
## Call:
## lm(formula = sales ~ adverts + airplay + attract, data = train1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -112.570  -29.150    0.711   25.490  130.978 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -23.461730  19.064693  -1.231     0.22    
## adverts       0.075678   0.007592   9.968  < 2e-16 ***
## airplay       3.572758   0.309224  11.554  < 2e-16 ***
## attract      11.426238   2.713136   4.211 4.42e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 44.88 on 146 degrees of freedom
## Multiple R-squared:  0.674,  Adjusted R-squared:  0.6673 
## F-statistic: 100.6 on 3 and 146 DF,  p-value: < 2.2e-16

Predict values in validation sample using model

prediction<-predict(albumSales.5,newdata=test1)
head(prediction)

##        1        2        3        4        5        6 
## 234.3490 198.6409 301.0778 258.8600 290.3373 321.9954

head(test1$sales)

## [1] 220 210 290 250 230 320

To find confidence interval: It signifies the range in which the true population lies at a 95% level of confidence test1.

predict(albumSales.5, newdata = test1, interval = 'confidence')

##          fit       lwr       upr
## 1  234.34896 217.65705 251.04086
## 2  198.64095 183.38148 213.90041
## 3  301.07778 283.01436 319.14121
## 4  258.86004 241.68757 276.03251
## 5  290.33732 275.60443 305.07021
## 6  321.99542 300.12663 343.86420
## 7  198.21302 184.85464 211.57141
## 8  164.43128 151.88747 176.97509
## 9  261.76240 248.79931 274.72549
## 10 247.77250 238.61245 256.93254
## 11 225.60781 215.53107 235.68455
## 12 165.02929 152.01632 178.04226
## 13  88.69897  67.82077 109.57717
## 14 240.74895 230.35334 251.14455
## 15 108.05441  91.11180 124.99701
## 16 187.62197 171.97744 203.26649
## 17 193.01970 180.48236 205.55704
## 18 134.32334 120.54059 148.10609
## 19 232.41814 222.55597 242.28031
## 20 113.42900  98.92640 127.93160
## 21  50.36849  30.99923  69.73775
## 22 145.41380 134.41883 156.40876
## 23 234.34130 218.56256 250.12004
## 24 271.89068 260.56066 283.22071
## 25 172.31315 159.86715 184.75915
## 26 154.96624 145.75899 164.17349
## 27 102.54291  84.73699 120.34884
## 28  81.39910  59.34666 103.45154
## 29 105.00597  81.97469 128.03724
## 30 317.78789 297.07604 338.49974
## 31  64.75398  47.10398  82.40398
## 32 257.14678 242.93398 271.35958
## 33 186.55049 173.93066 199.17033
## 34 116.26710 102.39855 130.13564
## 35 217.09412 209.27289 224.91535
## 36 196.56004 185.44181 207.67828
## 37 236.29023 225.74215 246.83831
## 38 173.81315 156.62415 191.00216
## 39 115.94210  98.98097 132.90323
## 40 177.83049 169.49047 186.17052
## 41 257.99333 238.35796 277.62870
## 42 138.33670 127.97141 148.70198
## 43 207.52765 196.33769 218.71761
## 44 171.36029 162.51673 180.20386
## 45 277.81903 261.40427 294.23378
## 46 293.14839 280.04263 306.25415
## 47 189.02885 180.07779 197.97991
## 48 155.94590 145.31898 166.57283
## 49 162.35966 150.19664 174.52267
## 50 210.28952 194.41518 226.16386

Calculate prediction accuracy
A simple correlation between the actuals and predicted values can be used as a form of accuracy measure. A higher correlation accuracy implies that the actuals and predicted values have similar directional movement, i.e. when the actuals values increase the predicteds also increase and viceversa.

actuals_pred<-data.frame(cbind(actuals=test1$sales, predicteds=prediction))
correlation_accuracy<-cor(actuals_pred)
correlation_accuracy

##              actuals predicteds
## actuals    1.0000000  0.8030145
## predicteds 0.8030145  1.0000000

Alternate way

cor.test(test1$sales,prediction)

## 
##  Pearson's product-moment correlation
## 
## data:  test1$sales and prediction
## t = 9.3353, df = 48, p-value = 2.291e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6756966 0.8838144
## sample estimates:
##       cor 
## 0.8030145

Robust Regression

albumSales.6<-rlm(sales ~ adverts + airplay + attract, data = train1)
aov(albumSales.6)

## Call:
##    aov(formula = albumSales.6)
## 
## Terms:
##                  adverts  airplay  attract Residuals
## Sum of Squares  240593.7 331871.1  35729.0  294110.2
## Deg. of Freedom        1        1        1       146
## 
## Residual standard error: 44.88266
## Estimated effects may be unbalanced

summary(albumSales.6)

## 
## Call: rlm(formula = sales ~ adverts + airplay + attract, data = train1)
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -115.3782  -28.6665    0.7027   27.9151  133.8203 
## 
## Coefficients:
##             Value    Std. Error t value 
## (Intercept) -25.2624  19.8827    -1.2706
## adverts       0.0796   0.0079    10.0540
## airplay       3.5541   0.3225    11.0207
## attract      11.3226   2.8296     4.0015
## 
## Residual standard error: 42.27 on 146 degrees of freedom

Parsimony - Forward Selection

install.packages("leaps",repos = "http://cran.us.r-project.org")

## Installing package into 'C:/Users/Ranvir Kumar/Documents/R/win-library/3.4'
## (as 'lib' is unspecified)

## package 'leaps' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\Ranvir Kumar\AppData\Local\Temp\RtmpOyLQeS\downloaded_packages

library(leaps)

## Warning: package 'leaps' was built under R version 3.4.3

Null<-lm(sales ~ 1, data=album2)
Full<-lm(sales ~ adverts + airplay + attract, data = album2)
step(Null, scope=list(lower=Null,upper=Full),direction = "forward")

## Start:  AIC=1757.29
## sales ~ 1
## 
##           Df Sum of Sq     RSS    AIC
## + airplay  1    464863  831089 1670.4
## + adverts  1    433688  862264 1677.8
## + attract  1    137822 1158130 1736.8
## <none>                 1295952 1757.3
## 
## Step:  AIC=1670.44
## sales ~ airplay
## 
##           Df Sum of Sq    RSS    AIC
## + adverts  1    350661 480428 1562.8
## + attract  1     63182 767907 1656.6
## <none>                 831089 1670.4
## 
## Step:  AIC=1562.82
## sales ~ airplay + adverts
## 
##           Df Sum of Sq    RSS    AIC
## + attract  1     45853 434575 1544.8
## <none>                 480428 1562.8
## 
## Step:  AIC=1544.76
## sales ~ airplay + adverts + attract

## 
## Call:
## lm(formula = sales ~ airplay + adverts + attract, data = album2)
## 
## Coefficients:
## (Intercept)      airplay      adverts      attract  
##   -26.61296      3.36743      0.08488     11.08634

Parsimony - Backward Selection

step(Full,data=album2,direction = "backward")

## Start:  AIC=1544.76
## sales ~ adverts + airplay + attract
## 
##           Df Sum of Sq    RSS    AIC
## <none>                 434575 1544.8
## - attract  1     45853 480428 1562.8
## - airplay  1    325860 760434 1654.7
## - adverts  1    333332 767907 1656.6

## 
## Call:
## lm(formula = sales ~ adverts + airplay + attract, data = album2)
## 
## Coefficients:
## (Intercept)      adverts      airplay      attract  
##   -26.61296      0.08488      3.36743     11.08634

Parsimony - Stepwise: Both Forward and Backward Selection

step(Null, scope=list(upper=Full),data=album2, direction = "both")

## Start:  AIC=1757.29
## sales ~ 1
## 
##           Df Sum of Sq     RSS    AIC
## + airplay  1    464863  831089 1670.4
## + adverts  1    433688  862264 1677.8
## + attract  1    137822 1158130 1736.8
## <none>                 1295952 1757.3
## 
## Step:  AIC=1670.44
## sales ~ airplay
## 
##           Df Sum of Sq     RSS    AIC
## + adverts  1    350661  480428 1562.8
## + attract  1     63182  767907 1656.6
## <none>                  831089 1670.4
## - airplay  1    464863 1295952 1757.3
## 
## Step:  AIC=1562.82
## sales ~ airplay + adverts
## 
##           Df Sum of Sq    RSS    AIC
## + attract  1     45853 434575 1544.8
## <none>                 480428 1562.8
## - adverts  1    350661 831089 1670.4
## - airplay  1    381836 862264 1677.8
## 
## Step:  AIC=1544.76
## sales ~ airplay + adverts + attract
## 
##           Df Sum of Sq    RSS    AIC
## <none>                 434575 1544.8
## - attract  1     45853 480428 1562.8
## - airplay  1    325860 760434 1654.7
## - adverts  1    333332 767907 1656.6

## 
## Call:
## lm(formula = sales ~ airplay + adverts + attract, data = album2)
## 
## Coefficients:
## (Intercept)      airplay      adverts      attract  
##   -26.61296      3.36743      0.08488     11.08634

Regression Analysis

Ranvir

14 February 2018

An introduction to regression

Least Square Method

Regression analysis using R