Project Title: “Analysis of sales of carseats”
NAME: “Akshay Kumar Jha
EMAIL: “ajakshayjha@gmail.com”"
COLLEGE : DMS,IIT Delhi
This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
We will be trying to predict the sales of carseats. In this data set, a single observation represents a location where carseats are sold.
The data set looks like the following:
Sales - Unit sales (in thousands) at each location
CompPrice - Price charged by competitor at each location
Income - Community income level (in thousands of dollars)
Advertising - Local advertising budget for company at each location (in thousands of dollars)
Population - Population size in region (in thousands)
Price - Price company charges for car seats at each site
ShelveLoc - A factor with levels Bad, Good and Medium indicating the quality of the shelving location for the car seats at each site
Age - Average age of the local population
Education - Education level at each location
Urban - A factor with levels No and Yes to indicate whether the store is in an urban or rural location
US - A factor with levels No and Yes to indicate whether the store is in the US or not
car.df <- read.csv(paste("Carseats.csv", sep=""))
attach(car.df)
dim(car.df)
## [1] 400 12
summary(car.df)
## Sno Sales CompPrice Income
## Min. : 1.0 Min. : 0.000 Min. : 77 Min. : 21.00
## 1st Qu.:100.8 1st Qu.: 5.390 1st Qu.:115 1st Qu.: 42.75
## Median :200.5 Median : 7.490 Median :125 Median : 69.00
## Mean :200.5 Mean : 7.496 Mean :125 Mean : 68.66
## 3rd Qu.:300.2 3rd Qu.: 9.320 3rd Qu.:135 3rd Qu.: 91.00
## Max. :400.0 Max. :16.270 Max. :175 Max. :120.00
## Advertising Population Price ShelveLoc
## Min. : 0.000 Min. : 10.0 Min. : 24.0 Bad : 96
## 1st Qu.: 0.000 1st Qu.:139.0 1st Qu.:100.0 Good : 85
## Median : 5.000 Median :272.0 Median :117.0 Medium:219
## Mean : 6.635 Mean :264.8 Mean :115.8
## 3rd Qu.:12.000 3rd Qu.:398.5 3rd Qu.:131.0
## Max. :29.000 Max. :509.0 Max. :191.0
## Age Education Urban US
## Min. :25.00 Min. :10.0 No :118 No :142
## 1st Qu.:39.75 1st Qu.:12.0 Yes:282 Yes:258
## Median :54.50 Median :14.0
## Mean :53.32 Mean :13.9
## 3rd Qu.:66.00 3rd Qu.:16.0
## Max. :80.00 Max. :18.0
library(psych)
describe(car.df)
## vars n mean sd median trimmed mad min max range
## Sno 1 400 200.50 115.61 200.50 200.50 148.26 1 400.00 399.00
## Sales 2 400 7.50 2.82 7.49 7.43 2.87 0 16.27 16.27
## CompPrice 3 400 124.97 15.33 125.00 125.04 14.83 77 175.00 98.00
## Income 4 400 68.66 27.99 69.00 68.26 35.58 21 120.00 99.00
## Advertising 5 400 6.63 6.65 5.00 5.89 7.41 0 29.00 29.00
## Population 6 400 264.84 147.38 272.00 265.56 191.26 10 509.00 499.00
## Price 7 400 115.80 23.68 117.00 115.92 22.24 24 191.00 167.00
## ShelveLoc* 8 400 2.31 0.83 3.00 2.38 0.00 1 3.00 2.00
## Age 9 400 53.32 16.20 54.50 53.48 20.02 25 80.00 55.00
## Education 10 400 13.90 2.62 14.00 13.88 2.97 10 18.00 8.00
## Urban* 11 400 1.70 0.46 2.00 1.76 0.00 1 2.00 1.00
## US* 12 400 1.64 0.48 2.00 1.68 0.00 1 2.00 1.00
## skew kurtosis se
## Sno 0.00 -1.21 5.78
## Sales 0.18 -0.11 0.14
## CompPrice -0.04 0.01 0.77
## Income 0.05 -1.10 1.40
## Advertising 0.63 -0.57 0.33
## Population -0.05 -1.21 7.37
## Price -0.12 0.41 1.18
## ShelveLoc* -0.62 -1.28 0.04
## Age -0.08 -1.14 0.81
## Education 0.04 -1.31 0.13
## Urban* -0.90 -1.20 0.02
## US* -0.60 -1.64 0.02
aggregate(car.df$Sales, list(ShelfLocation = car.df$ShelveLoc), mean)
## ShelfLocation x
## 1 Bad 5.522917
## 2 Good 10.214000
## 3 Medium 7.306575
aggregate(car.df$Sales, list(US = car.df$US), mean)
## US x
## 1 No 6.823028
## 2 Yes 7.866899
aggregate(car.df$Sales, list(Urban = car.df$Urban), mean)
## Urban x
## 1 No 7.563559
## 2 Yes 7.468191
aggregate(car.df$Sales, list(ShelfLocation = car.df$ShelveLoc), sd)
## ShelfLocation x
## 1 Bad 2.356349
## 2 Good 2.501243
## 3 Medium 2.266373
aggregate(car.df$Sales, list(US = car.df$US), sd)
## US x
## 1 No 2.602585
## 2 Yes 2.877131
aggregate(car.df$Sales, list(Urban = car.df$Urban), sd)
## Urban x
## 1 No 2.805846
## 2 Yes 2.836219
boxplot(Sales, main="Number of Car Seats Sold ('000)")
hist(Sales , main="Histogram of Number of CarSeats Sold", xlab ="Sales ('000)",col="Green Yellow")
boxplot(Advertising, main="Advertising Budget ('000 USD)")
hist(Advertising , main="Histogram of Advertising", xlab ="Advertising Budget ('000 USD)",col="Green Yellow")
boxplot(Sales~ShelveLoc, data=car.df, main="Sales broken down by ShelfLoc",
xlab="Shelf Location", ylab="Sales ('000 units sold)")
boxplot(Sales~US, data=car.df, main="Sales broken down by Store in US",
xlab="US", ylab="Sales ('000 units sold)")
boxplot(Sales~Urban, data=car.df, main="Sales broken down by Store in UrbanArea",
xlab="Urban Location", ylab="Sales ('000 units sold)")
library(car)
##
## Attaching package: 'car'
## The following object is masked from 'package:psych':
##
## logit
scatterplot(Sales ~ Advertising, data=car.df,
spread=FALSE, smoother.args=list(lty=2), pch=19,
main="Scatterplot of Sales of Car Seats vs.Advertising ",
xlab="Advertising",
ylab="Sales")
scatterplot(Sales ~ Price, data=car.df,
spread=FALSE, smoother.args=list(lty=2), pch=19,
main="Scatterplot of Sales of Car Seats vs.Price ",
xlab="Price",
ylab="Sales")
x <- car.df[,c("Sales","Advertising", "Age", "Income", "CompPrice")]
library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
##
## Attaching package: 'Hmisc'
## The following object is masked from 'package:psych':
##
## describe
## The following objects are masked from 'package:base':
##
## format.pval, round.POSIXt, trunc.POSIXt, units
rcorr(as.matrix(x), type="pearson")
## Sales Advertising Age Income CompPrice
## Sales 1.00 0.27 -0.23 0.15 0.06
## Advertising 0.27 1.00 0.00 0.06 -0.02
## Age -0.23 0.00 1.00 0.00 -0.10
## Income 0.15 0.06 0.00 1.00 -0.08
## CompPrice 0.06 -0.02 -0.10 -0.08 1.00
##
## n= 400
##
##
## P
## Sales Advertising Age Income CompPrice
## Sales 0.0000 0.0000 0.0023 0.2009
## Advertising 0.0000 0.9276 0.2391 0.6294
## Age 0.0000 0.9276 0.9258 0.0451
## Income 0.0023 0.2391 0.9258 0.1073
## CompPrice 0.2009 0.6294 0.0451 0.1073
library(corrgram)
corrgram(car.df, order=FALSE, lower.panel=panel.shade,
upper.panel=panel.pie, text.panel=panel.txt,
main="Corrgram of variables ")
corrgram(car.df[,c("Sales","Income","Advertising")], order=FALSE,
lower.panel=panel.shade,
upper.panel=panel.pie,
diag.panel=panel.minmax,
text.panel=panel.txt,
main="Corrgram")
# scatter plot matrix for the following variables" {"Sales","Income","Advertising","ShelveLoc"}
library(car)
scatterplotMatrix(car.df[,c("Sales","Income","Advertising","ShelveLoc")],
spread=FALSE, smoother.args=list(lty=2),
main="Scatter Plot Matrix")
t.test(car.df$Sales,car.df$Advertising)
##
## Welch Two Sample t-test
##
## data: car.df$Sales and car.df$Advertising
## t = 2.3842, df = 538.37, p-value = 0.01746
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.1516768 1.5709732
## sample estimates:
## mean of x mean of y
## 7.496325 6.635000
#Null hypothesis:There is no relation between Sales and Advertising
#Result:Null hypothesis rejected
t.test(car.df$Sales,car.df$Income)
##
## Welch Two Sample t-test
##
## data: car.df$Sales and car.df$Income
## t = -43.487, df = 407.13, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -63.92590 -58.39645
## sample estimates:
## mean of x mean of y
## 7.496325 68.657500
#Null hypothesis:There is no relation between Sales and Income
#Result:Null hypothesis rejected
t.test(car.df$Sales,car.df$CompPrice)
##
## Welch Two Sample t-test
##
## data: car.df$Sales and car.df$CompPrice
## t = -150.69, df = 426.04, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -119.0111 -115.9463
## sample estimates:
## mean of x mean of y
## 7.496325 124.975000
#Null hypothesis:There is no relation between Sales and Competitor Price.
#Result:Null hypothesis rejected
m1 <- lm(Sales ~ Advertising, data = car.df)
summary(m1)
##
## Call:
## lm(formula = Sales ~ Advertising, data = car.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.3770 -1.9634 -0.1037 1.7222 8.3208
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.7370 0.1925 35.007 < 2e-16 ***
## Advertising 0.1144 0.0205 5.583 4.38e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.723 on 398 degrees of freedom
## Multiple R-squared: 0.07263, Adjusted R-squared: 0.0703
## F-statistic: 31.17 on 1 and 398 DF, p-value: 4.378e-08
m2 <- lm( Sales ~ ShelveLoc, data = car.df)
summary(m2)
##
## Call:
## lm(formula = Sales ~ ShelveLoc, data = car.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.3066 -1.6282 -0.0416 1.5666 6.1471
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.5229 0.2388 23.131 < 2e-16 ***
## ShelveLocGood 4.6911 0.3484 13.464 < 2e-16 ***
## ShelveLocMedium 1.7837 0.2864 6.229 1.2e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.339 on 397 degrees of freedom
## Multiple R-squared: 0.3172, Adjusted R-squared: 0.3138
## F-statistic: 92.23 on 2 and 397 DF, p-value: < 2.2e-16
m3 <- lm( Sales ~ Advertising + ShelveLoc, data = car.df)
summary(m3)
##
## Call:
## lm(formula = Sales ~ Advertising + ShelveLoc, data = car.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.6480 -1.6198 -0.0476 1.5308 6.4098
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.89662 0.25207 19.426 < 2e-16 ***
## Advertising 0.10071 0.01692 5.951 5.88e-09 ***
## ShelveLocGood 4.57686 0.33479 13.671 < 2e-16 ***
## ShelveLocMedium 1.75142 0.27475 6.375 5.11e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.244 on 396 degrees of freedom
## Multiple R-squared: 0.3733, Adjusted R-squared: 0.3685
## F-statistic: 78.62 on 3 and 396 DF, p-value: < 2.2e-16
model4<-Sales ~ Advertising + ShelveLoc + CompPrice + Population + Income + Age + Education + Urban + US
m4 <- lm( model4, data = car.df)
summary(m4)
##
## Call:
## lm(formula = model4, data = car.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.1992 -1.4647 -0.0918 1.4021 5.8304
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.3551482 1.2473921 4.293 2.23e-05 ***
## Advertising 0.1119963 0.0229872 4.872 1.61e-06 ***
## ShelveLocGood 4.6660979 0.3163467 14.750 < 2e-16 ***
## ShelveLocMedium 1.8981893 0.2606779 7.282 1.84e-12 ***
## CompPrice 0.0072577 0.0069983 1.037 0.300
## Population -0.0002846 0.0007653 -0.372 0.710
## Income 0.0168430 0.0038140 4.416 1.30e-05 ***
## Age -0.0400548 0.0065684 -6.098 2.59e-09 ***
## Education -0.0246887 0.0407679 -0.606 0.545
## UrbanYes 0.0676860 0.2335351 0.290 0.772
## USYes -0.2937822 0.3097058 -0.949 0.343
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.106 on 389 degrees of freedom
## Multiple R-squared: 0.4576, Adjusted R-squared: 0.4437
## F-statistic: 32.82 on 10 and 389 DF, p-value: < 2.2e-16
confint(m4)
## 2.5 % 97.5 %
## (Intercept) 2.902674333 7.807622131
## Advertising 0.066801671 0.157190914
## ShelveLocGood 4.044134630 5.288061256
## ShelveLocMedium 1.385675485 2.410703058
## CompPrice -0.006501431 0.021016825
## Population -0.001789312 0.001220039
## Income 0.009344400 0.024341512
## Age -0.052968890 -0.027140692
## Education -0.104841742 0.055464288
## UrbanYes -0.391462992 0.526834908
## USYes -0.902688988 0.315124605
par(mfrow = c(2, 2))
plot(m4)
library(leaps)
leap<-regsubsets(model4,data=car.df,nbest=1)
summary(leap)
## Subset selection object
## Call: regsubsets.formula(model4, data = car.df, nbest = 1)
## 10 Variables (and intercept)
## Forced in Forced out
## Advertising FALSE FALSE
## ShelveLocGood FALSE FALSE
## ShelveLocMedium FALSE FALSE
## CompPrice FALSE FALSE
## Population FALSE FALSE
## Income FALSE FALSE
## Age FALSE FALSE
## Education FALSE FALSE
## UrbanYes FALSE FALSE
## USYes FALSE FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: exhaustive
## Advertising ShelveLocGood ShelveLocMedium CompPrice Population
## 1 ( 1 ) " " "*" " " " " " "
## 2 ( 1 ) " " "*" "*" " " " "
## 3 ( 1 ) "*" "*" "*" " " " "
## 4 ( 1 ) "*" "*" "*" " " " "
## 5 ( 1 ) "*" "*" "*" " " " "
## 6 ( 1 ) "*" "*" "*" "*" " "
## 7 ( 1 ) "*" "*" "*" "*" " "
## 8 ( 1 ) "*" "*" "*" "*" " "
## Income Age Education UrbanYes USYes
## 1 ( 1 ) " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " "
## 3 ( 1 ) " " " " " " " " " "
## 4 ( 1 ) " " "*" " " " " " "
## 5 ( 1 ) "*" "*" " " " " " "
## 6 ( 1 ) "*" "*" " " " " " "
## 7 ( 1 ) "*" "*" " " " " "*"
## 8 ( 1 ) "*" "*" "*" " " "*"
plot(leap,scale="adjr2")
Model 1:As F-statistic p-value is less than 0.05, the overall model is significant. The model is able to explain 7.03% of the variance in the data.
The linear model is as follows:
Sales = B0 + B1 * Advertising
Model 2:As F-statistic p-value is less than 0.05, the overall model is significant. The model is able to explain 31.3% of the variance in the data.
The linear model is as follows:
Sales = B0 + B1 * ShelfLocation
Model 3:As F-statistic p-value is less than 0.05, the overall model is significant. The model is able to explain 36.8% of the variance in the data.
The linear model is as follows:
Sales = B0 + B1 * Ad+ B2 * ShelfLoc
Model 4:As F-statistic p-value is less than 0.05, the overall model is significant. The model is able to explain 44.37% of the variance in the data.
The linear model is as follows:
Sales = B0 + B1Ad + B2ShelveLoc + B3CompPrice + B4Population + B5Income + B6Age + B7Education + B8Urban + B9*US
Both models are overall statistically significant. But Model 4 is able to explain 44.37% of the variance in the data compared to only 36.8% explained by Model 3.
Hence, Model 4 is a better fit for the given data.Statistically significant variables are:Advertising ,ShelveLocation,Income & Age.