R Markdown

Part A

df <- read.csv("~/Desktop/SalesPerformance(7).csv")
## Estalbish a linear regression involving outlets and commission as independent variables. 
reg <- lm(Profit ~ Outlets + Commis,data=df)
summary(reg)
## 
## Call:
## lm(formula = Profit ~ Outlets + Commis, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -815.56 -169.86   13.72  154.83  622.32 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -298.425    267.366  -1.116 0.269907    
## Outlets        6.962      1.392   5.001 8.04e-06 ***
## Commis       329.694     87.125   3.784 0.000428 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286.1 on 48 degrees of freedom
## Multiple R-squared:  0.3889, Adjusted R-squared:  0.3634 
## F-statistic: 15.27 on 2 and 48 DF,  p-value: 7.359e-06
## Establish a regression model involving interaction term. 
reg1 <- lm(Profit ~ Outlets * Commis, data=df)
summary(reg1)
## 
## Call:
## lm(formula = Profit ~ Outlets * Commis, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -843.43 -158.91  -16.01  152.77  665.69 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)  
## (Intercept)      40.218    490.446   0.082   0.9350  
## Outlets           5.143      2.611   1.970   0.0547 .
## Commis         -130.686    564.977  -0.231   0.8181  
## Outlets:Commis    2.549      3.090   0.825   0.4137  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 287 on 47 degrees of freedom
## Multiple R-squared:  0.3976, Adjusted R-squared:  0.3592 
## F-statistic: 10.34 on 3 and 47 DF,  p-value: 2.424e-05
## Through comparing p values of each independent variable and r squares, we found out managers are not right, and combination term has little effect.

## Establish a regression model involivng all variables as indepednent variables, and profit is our dependent variable.
reg3 <- lm(Profit ~ Area+Popn+ Outlets+ Commis ,data = df) 
summary(reg3)
## 
## Call:
## lm(formula = Profit ~ Area + Popn + Outlets + Commis, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -773.91  -93.91   25.22  108.02  505.06 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 694.6890   367.1055   1.892  0.06475 .  
## Area        -22.9564     8.2917  -2.769  0.00809 ** 
## Popn        101.7687    61.4965   1.655  0.10476    
## Outlets       0.8301     1.5524   0.535  0.59540    
## Commis      316.7449    68.4808   4.625 3.05e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 220.6 on 46 degrees of freedom
## Multiple R-squared:  0.6518, Adjusted R-squared:  0.6215 
## F-statistic: 21.52 on 4 and 46 DF,  p-value: 4.648e-10
## Establish a regression model involivng all variables as indepednent variables, and profit divided by outlets is our dependent variable.
reg4 <- lm(Profit/Outlets ~ Area+Popn+ Outlets+ Commis ,data = df)
summary(reg4)
## 
## Call:
## lm(formula = Profit/Outlets ~ Area + Popn + Outlets + Commis, 
##     data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.0776 -0.5590  0.1337  0.6044  2.2461 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.710263   1.866715   5.737 7.15e-07 ***
## Area        -0.166435   0.042163  -3.947 0.000269 ***
## Popn         0.571581   0.312707   1.828 0.074062 .  
## Outlets     -0.030872   0.007894  -3.911 0.000301 ***
## Commis       1.796929   0.348221   5.160 5.12e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.122 on 46 degrees of freedom
## Multiple R-squared:  0.6643, Adjusted R-squared:  0.6351 
## F-statistic: 22.76 on 4 and 46 DF,  p-value: 2.03e-10
## Through comparing p values of each independent variable and r squares, Profit/Outlets is the better depedent variable, because it has larger r square and less p-value. 

Part B

library(car)
## Warning: package 'car' was built under R version 3.5.2
## Loading required package: carData
library(MASS)
library(GGally)
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 3.5.2
fstep <- lm(Profit~Area+Popn+ Commis+ Outlets, data = df)

X <- subset(df,select=-c(Profit))   # Take a look at relationships among all independent variables
ggpairs(X)

##We found out there is the multicolinearity issue. Population and area, outlets and population have strong relaitonship.Therefore, we need to use an approach to select a best model.
step <- stepAIC(fstep,direction="both")  #Using stepwise approach to select the best model and remove specific variables.
## Start:  AIC=555.17
## Profit ~ Area + Popn + Commis + Outlets
## 
##           Df Sum of Sq     RSS    AIC
## - Outlets  1     13917 2252655 553.49
## <none>                 2238739 555.17
## - Popn     1    133283 2372022 556.12
## - Area     1    373045 2611784 561.03
## - Commis   1   1041184 3279923 572.65
## 
## Step:  AIC=553.49
## Profit ~ Area + Popn + Commis
## 
##           Df Sum of Sq     RSS    AIC
## <none>                 2252655 553.49
## + Outlets  1     13917 2238739 555.17
## - Popn     1    217890 2470545 556.19
## - Area     1    386658 2639314 559.56
## - Commis   1   1031720 3284375 570.72
step$anova # display results
## Stepwise Model Path 
## Analysis of Deviance Table
## 
## Initial Model:
## Profit ~ Area + Popn + Commis + Outlets
## 
## Final Model:
## Profit ~ Area + Popn + Commis
## 
## 
##        Step Df Deviance Resid. Df Resid. Dev      AIC
## 1                              46    2238739 555.1695
## 2 - Outlets  1 13916.59        47    2252655 553.4855
## estimate what stepwise suggested
stepModel <- lm(Profit ~ Area+Popn+ Commis,data=df)
summary(stepModel)
## 
## Call:
## lm(formula = Profit ~ Area + Popn + Commis, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -751.69  -90.93   21.43  101.61  499.67 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  792.701    315.650   2.511  0.01552 *  
## Area         -23.301      8.204  -2.840  0.00664 ** 
## Popn         116.444     54.613   2.132  0.03825 *  
## Commis       310.156     66.849   4.640 2.81e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 218.9 on 47 degrees of freedom
## Multiple R-squared:  0.6496, Adjusted R-squared:  0.6272 
## F-statistic: 29.04 on 3 and 47 DF,  p-value: 8.984e-11
##Through using stepwise approach, we omit the variable outlets from initial model, so our final model only comprises area, population and commis. 

Part C

plot(stepModel) #Evaluate the regression diagnostics

## We need to look at the residual plots in order to check the assumptions of regression model, such as residuals are normally distributed, and variance of errors is constant. Through observing these plots, we know that we have outliers, 19, 32,47, which would have a lot of influence on the model, and also violate our assumptions. Therefore, we have to remove theose outliers. 
new_model<-df[-c(19,32,47),] #remove outliers
newdf<- lm(data = new_model, Profit~ Area+Popn+Commis) 
summary(newdf)
## 
## Call:
## lm(formula = Profit ~ Area + Popn + Commis, data = new_model)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -361.88 -103.93   -5.84   95.46  309.22 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1193.532    238.001   5.015 9.18e-06 ***
## Area         -33.482      6.139  -5.454 2.13e-06 ***
## Popn          33.337     42.263   0.789    0.434    
## Commis       367.940     49.135   7.488 2.21e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 158.3 on 44 degrees of freedom
## Multiple R-squared:  0.7961, Adjusted R-squared:  0.7822 
## F-statistic: 57.26 on 3 and 44 DF,  p-value: 3.102e-15
plot(newdf)

Part D After my analysis, the final model includes area, population, commission. Area has negative relsitonship with profit, and populatoin and commision have positive relationship with profit. The managers are not right about combination effect of commission and outlets, becasue it has little effct through comparing p value and r square. Profit per outlet is the better dependent variable than profit, because it has larger r square and smaller p values for each coefficient of independent variables. Through observing correlation, I found out that there is multicolinearity issue, including population and area, population and outlets, which have relationship between these independent variables. Through stepwise regression approach, we select three important variables, and remove outlets. After we have the final model, we check the assumption by observing residual plots, and then found out it violated assumption. Therefore, we remove the outliers. After removing it, we have the final dataset and plugging it into the final model.