1. Check Data

##         MarketID       MarketSize       LocationID       AgeOfStore 
##                0                0                0                0 
##        Promotion             Week SalesInThousands 
##                0                0                0

Fortunately, we don’t have any miss value or NA.

1. Data Summary

## Classes 'tbl_df', 'tbl' and 'data.frame':    548 obs. of  7 variables:
##  $ MarketID        : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ MarketSize      : chr  "Medium" "Medium" "Medium" "Medium" ...
##  $ LocationID      : num  1 1 1 1 2 2 2 2 3 3 ...
##  $ AgeOfStore      : num  4 4 4 4 5 5 5 5 12 12 ...
##  $ Promotion       : num  3 3 3 3 2 2 2 2 1 1 ...
##  $ Week            : num  1 2 3 4 1 2 3 4 1 2 ...
##  $ SalesInThousands: num  33.7 35.7 29 39.2 27.8 ...

2. Exploring Data

Best sales’s location should within 40

## Warning: attributes are not identical across measure variables;
## they will be dropped

3. Split Data

##   MarketSize    LocationID      AgeOfStore       Promotion    
##  Large :125   Min.   :  1.0   Min.   : 1.000   Min.   :1.000  
##  Medium:212   1st Qu.:216.0   1st Qu.: 3.000   1st Qu.:1.000  
##  Small : 43   Median :502.0   Median : 7.000   Median :2.000  
##               Mean   :481.9   Mean   : 8.234   Mean   :1.982  
##               3rd Qu.:709.2   3rd Qu.:12.000   3rd Qu.:3.000  
##               Max.   :920.0   Max.   :28.000   Max.   :3.000  
##       Week       SalesInThousands
##  Min.   :1.000   Min.   :19.26   
##  1st Qu.:2.000   1st Qu.:42.90   
##  Median :2.000   Median :51.16   
##  Mean   :2.518   Mean   :54.26   
##  3rd Qu.:4.000   3rd Qu.:61.43   
##  Max.   :4.000   Max.   :99.65
##   MarketSize    LocationID      AgeOfStore       Promotion    
##  Large : 43   Min.   :  1.0   Min.   : 1.000   Min.   :1.000  
##  Medium:108   1st Qu.:213.8   1st Qu.: 4.000   1st Qu.:2.000  
##  Small : 17   Median :507.0   Median : 8.000   Median :2.000  
##               Mean   :474.5   Mean   : 9.113   Mean   :2.137  
##               3rd Qu.:705.0   3rd Qu.:12.000   3rd Qu.:3.000  
##               Max.   :920.0   Max.   :28.000   Max.   :3.000  
##       Week       SalesInThousands
##  Min.   :1.000   Min.   :17.34   
##  1st Qu.:1.000   1st Qu.:42.05   
##  Median :3.000   Median :48.41   
##  Mean   :2.458   Mean   :51.67   
##  3rd Qu.:3.000   3rd Qu.:56.80   
##  Max.   :4.000   Max.   :94.89

4.Creating the linear model

fit1 <- lm(SalesInThousands~ . , data = trainglm)
summary(fit1)
## 
## Call:
## lm(formula = SalesInThousands ~ ., data = trainglm)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.603  -8.328   1.447   8.178  24.468 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       80.116707   2.595067  30.873  < 2e-16 ***
## MarketSizeMedium -26.743810   1.329168 -20.121  < 2e-16 ***
## MarketSizeSmall  -17.960401   2.165744  -8.293 2.03e-15 ***
## LocationID        -0.015995   0.002163  -7.395 9.44e-13 ***
## AgeOfStore         0.138087   0.092235   1.497    0.135    
## Promotion         -1.135161   0.727534  -1.560    0.120    
## Week              -0.033217   0.534863  -0.062    0.951    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.59 on 373 degrees of freedom
## Multiple R-squared:  0.5456, Adjusted R-squared:  0.5383 
## F-statistic: 74.65 on 6 and 373 DF,  p-value: < 2.2e-16
#The significant indepentent variables are large Marketsize, LocationID.

plot(fit1, which=c(1,1))

A perfect fitted model would have its red line horizontal around zero - meaning that the residuals are randomly distributed over the fitted values and therefore our model would cover the characteristics of the data.so let get the model between diferent variablesSo let’s include the interaction effects in a new model:

5. Model Accuracy

fit2 <- lm(SalesInThousands~ (MarketSize + LocationID)^2
           , data = trainglm)
summary(fit2)
## 
## Call:
## lm(formula = SalesInThousands ~ (MarketSize + LocationID)^2, 
##     data = trainglm)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -25.3933  -4.7856   0.1191   5.3898  19.8020 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  95.354879   1.242925  76.718   <2e-16 ***
## MarketSizeMedium            -60.527153   1.678064 -36.070   <2e-16 ***
## MarketSizeSmall             -31.407677   3.246429  -9.675   <2e-16 ***
## LocationID                   -0.045810   0.001900 -24.105   <2e-16 ***
## MarketSizeMedium:LocationID   0.065229   0.002796  23.330   <2e-16 ***
## MarketSizeSmall:LocationID    0.017411   0.011982   1.453    0.147    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.426 on 374 degrees of freedom
## Multiple R-squared:  0.8129, Adjusted R-squared:  0.8104 
## F-statistic:   325 on 5 and 374 DF,  p-value: < 2.2e-16
plot(fit2, which = c(1,1))

In the case, we went from 53.1% variance explained by fit1, to 81.1%% variance explained witht he model fit4.

6. Conclusion:

Our model fit4 a very good regression fit, The model fit4 explains 81.1% of the variance given by the data. Higher sales are mainly affected by the market size and store location.Large market size and averge location is probably 40