Multiple Variables Regression
Ice Cream VS. INFLUENCING FACTORS
Echo Wei
MOnday, March 19, 2015
In this project, multiple linear regression model will be developed to test the variable influecing the consumption of ice cream.
1. Database data select:
selecting a dataset from “100+ Interesting Data Sets for Statistics”. http://lib.stat.cmu.edu/DASL/Datafiles/IceCream.html
Variable
independent variable: there are three variables.
Price: price of ice cream per pint in dollars;
Income: weekly family income in dollars;
Temp: mean temperature in degress F.
dependent variable: IC: ice cream consumption in pints per capita.
All these data is gathered form 30 four-week periods, from 03/18/51 to 07/11/53.
Objective:
Finding the influencing factors on consumption of ice cream.
Null hypothesis:
The consumption of ice cream cannot be explained by the three variables, price, income and temperature.
2. Read the dataset
data
data.table<-read.csv("C:\\Users\\Echo\\Desktop\\2015\\Regression\\Project 2\\ice cream.csv", header=T)
head(data.table)
## IC price income temp
## 1 0.386 0.270 78 41
## 2 0.374 0.282 79 56
## 3 0.393 0.277 81 63
## 4 0.425 0.280 80 68
## 5 0.406 0.272 76 69
## 6 0.344 0.262 78 65
summary(data.table)
## IC price income temp
## Min. :0.2560 Min. :0.2600 Min. :76.00 Min. :24.00
## 1st Qu.:0.3113 1st Qu.:0.2685 1st Qu.:79.25 1st Qu.:32.25
## Median :0.3515 Median :0.2770 Median :83.50 Median :49.50
## Mean :0.3594 Mean :0.2753 Mean :84.60 Mean :49.10
## 3rd Qu.:0.3912 3rd Qu.:0.2815 3rd Qu.:89.25 3rd Qu.:63.75
## Max. :0.5480 Max. :0.2920 Max. :96.00 Max. :72.00
3. Plot
sample plot
plot(data.table)
attach(data.table)
scatter plot
plot(price,IC,main="IC vs price",xlab="price(per pint in dollars)",ylab="IC(pints per capita)")
plot(income,IC,pch=21,main="IC vs income",xlab="income(dollars)",ylab="IC(pints per capita)")
plot(temp,IC,pch=21,main="IC vs temperature",xlab="temperatur(degress F)",ylab="IC(pints per capita)")
box plot
boxplot(price,main = "price",xlab="price(per pint in dollars)",ylab="IC(pints per capita)")
boxplot(income,main = "income",xlab="income(dollars)",ylab="IC(pints per capita)")
boxplot(temp,main = "temperature",xlab="temperatur(degress F)",ylab="IC(pints per capita)")
As shown in the graph, no outliers exist. All data can be used.
3. variable entry techniques
(1) Entry-wise
All variables entered simultaneously.
ice.lme<-lm(IC~price+income+temp)
summary(ice.lme)
##
## Call:
## lm(formula = IC ~ price + income + temp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.065302 -0.011873 0.002737 0.015953 0.078986
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.1973151 0.2702162 0.730 0.47179
## price -1.0444140 0.8343573 -1.252 0.22180
## income 0.0033078 0.0011714 2.824 0.00899 **
## temp 0.0034584 0.0004455 7.762 3.1e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.03683 on 26 degrees of freedom
## Multiple R-squared: 0.719, Adjusted R-squared: 0.6866
## F-statistic: 22.17 on 3 and 26 DF, p-value: 2.451e-07
The value of R-squared is 0.719, illustrating that the three variables explain 71.9% of the change of comsunption of ice cream. The p value of income and temp is signuficant, then we can reject the null hypothesis. However, the p value of price is 0.22180, which is larger than 0.5. Then we cannot reject the null hypotheis for indenpent variable price.
(2) Hierarchical
According to common sense, the higher the tmperature is, people are more willing to buying ice cream. For example, more ice cream sold in summer than in winter. Also, according to the principle of consumption, the demand curve moves right, leading to the intersection with higher consumption. Thus people with higher income are more capable of consumption because of higher consumption constraint. Same principle allpied to the price of ice cream. Demand increases because of lowering price.
Therefore, we will include the variable temp first. Then adding variable income. Last, adding variable price.
adding independent variable temp
ice.lm2<-lm(IC~temp)
summary(ice.lm2)
##
## Call:
## lm(formula = IC ~ temp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.069411 -0.024478 -0.007371 0.029126 0.120516
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.2068621 0.0247002 8.375 4.13e-09 ***
## temp 0.0031074 0.0004779 6.502 4.79e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.04226 on 28 degrees of freedom
## Multiple R-squared: 0.6016, Adjusted R-squared: 0.5874
## F-statistic: 42.28 on 1 and 28 DF, p-value: 4.789e-07
The p value if significant and the R-squared is 0.6016. Comparing to 71.9% of the contribution of three variables, temperature explains a lot of dependent variable, whcih plays the most important role in the independent variable.
adding independent variable temp+income
ice.lm3<-lm(IC~temp+income)
summary(ice.lm3)
##
## Call:
## lm(formula = IC ~ temp + income)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.065420 -0.022458 0.004026 0.015987 0.091905
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.113195 0.108280 -1.045 0.30511
## temp 0.003543 0.000445 7.963 1.47e-08 ***
## income 0.003530 0.001170 3.017 0.00551 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.03722 on 27 degrees of freedom
## Multiple R-squared: 0.7021, Adjusted R-squared: 0.68
## F-statistic: 31.81 on 2 and 27 DF, p-value: 7.957e-08
The R-squared increases to 0.7021, which is close to the 0.719 in the entry wise method. Then we can suppose the variable price does not have much influence on the change of consumption.
adding independent variable temp,income and price
ice.lm4<-lm(IC~temp+income+price)
summary(ice.lm4)
##
## Call:
## lm(formula = IC ~ temp + income + price)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.065302 -0.011873 0.002737 0.015953 0.078986
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.1973151 0.2702162 0.730 0.47179
## temp 0.0034584 0.0004455 7.762 3.1e-08 ***
## income 0.0033078 0.0011714 2.824 0.00899 **
## price -1.0444140 0.8343573 -1.252 0.22180
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.03683 on 26 degrees of freedom
## Multiple R-squared: 0.719, Adjusted R-squared: 0.6866
## F-statistic: 22.17 on 3 and 26 DF, p-value: 2.451e-07
Therefore, we can conclude that price does not have much influence on the consumption of ice cream. Just keep the variable temperature and income in the final model.
anova(ice.lm2,ice.lm3,ice.lm4)
## Analysis of Variance Table
##
## Model 1: IC ~ temp
## Model 2: IC ~ temp + income
## Model 3: IC ~ temp + income + price
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 28 0.050009
## 2 27 0.037399 1 0.0126108 9.2955 0.005225 **
## 3 26 0.035273 1 0.0021257 1.5669 0.221803
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(3) Stepwise
Step method is applied in R language to establish the most fitted model, which is based on the value of AIC. By picking the smallest AIC, dropping useless variables and adding useful ones.
in R language, the structure of step is as below:
step(object, scope,scale=0,direction=c(“both”,“backward”,“forward”,trace=1,keep=NULL,steps=1000,k=2,…)
object is the model we get by inputting all independent variables. Scope determined the space for stepwise search. Direction Determins the searching direction. The defaulted value is both.
lm.step<-step(ice.lme)
## Start: AIC=-194.38
## IC ~ price + income + temp
##
## Df Sum of Sq RSS AIC
## - price 1 0.002126 0.037399 -194.62
## <none> 0.035273 -194.38
## - income 1 0.010817 0.046090 -188.35
## - temp 1 0.081741 0.117013 -160.40
##
## Step: AIC=-194.62
## IC ~ income + temp
##
## Df Sum of Sq RSS AIC
## <none> 0.037399 -194.62
## - income 1 0.012611 0.050009 -187.90
## - temp 1 0.087836 0.125235 -160.36
When all independent variables are included, the AIC value is -194.38. Dropping variable price, the AIC value is -194.62. Dropping cariable income, AIC equals to -188.35. Dropping variable temp, AIC is -160.40. Choosing the one with smallest AIC, we decide to drop variable price. The final model only include variable temp and income.
conclusion
Based on analysis of three methods, the variables temp and income are included as independent varible, whic have significant influence on consumption of ince cream.
Final Model Summary
model.lm<-lm(IC~income+temp)
summary(model.lm)
##
## Call:
## lm(formula = IC ~ income + temp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.065420 -0.022458 0.004026 0.015987 0.091905
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.113195 0.108280 -1.045 0.30511
## income 0.003530 0.001170 3.017 0.00551 **
## temp 0.003543 0.000445 7.963 1.47e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.03722 on 27 degrees of freedom
## Multiple R-squared: 0.7021, Adjusted R-squared: 0.68
## F-statistic: 31.81 on 2 and 27 DF, p-value: 7.957e-08
Both independent variable are significant. We can reject the null hypethesis and the model exist. R-squared is 0.7021. 70% of the dependent variable is explained.
4. Regression
Individual variable
As the final model only includes variable income and temperature, we only plot 95% confidential interval of this two variables.
plot(price,IC,main="IC vs price",xlab="price(per pint in dollars)",ylab="IC(pints per capita)")
ip.lm<-lm(IC~price)
abline(ip.lm$coef,lwd=2)
summary(ip.lm)
##
## Call:
## lm(formula = IC ~ price)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.11152 -0.03707 -0.00991 0.03666 0.15724
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.9230 0.3964 2.329 0.0273 *
## price -2.0472 1.4393 -1.422 0.1660
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06466 on 28 degrees of freedom
## Multiple R-squared: 0.06739, Adjusted R-squared: 0.03408
## F-statistic: 2.023 on 1 and 28 DF, p-value: 0.166
plot(income,IC,pch=21,main="IC vs income",xlab="income(dollars)",ylab="IC(pints per capita)")
ii.lm<-lm(IC~income)
abline(ii.lm$coef,lwd=2)
model_conf <- predict(ii.lm, interval="confidence")
lines(income, model_conf[,2], lty=2, lwd=2, col="blue")
lines(income, model_conf[,3], lty=2, lwd=2, col="blue")
summary(ii.lm)
##
## Call:
## lm(formula = IC ~ income)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.100606 -0.050393 -0.009145 0.034013 0.185840
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.316715 0.168665 1.878 0.0709 .
## income 0.000505 0.001988 0.254 0.8014
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06688 on 28 degrees of freedom
## Multiple R-squared: 0.002298, Adjusted R-squared: -0.03333
## F-statistic: 0.06449 on 1 and 28 DF, p-value: 0.8014
plot(temp,IC,pch=21,main="IC vs temperature",xlab="temperatur(degress F)",ylab="IC(pints per capita)")
it.lm<-lm(IC~temp)
abline(it.lm$coef,lwd=2)
model_conf <- predict(it.lm, interval="confidence")
lines(temp, model_conf[,2], lty=2, lwd=2, col="blue")
lines(temp, model_conf[,3], lty=2, lwd=2, col="blue")
summary(it.lm)
##
## Call:
## lm(formula = IC ~ temp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.069411 -0.024478 -0.007371 0.029126 0.120516
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.2068621 0.0247002 8.375 4.13e-09 ***
## temp 0.0031074 0.0004779 6.502 4.79e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.04226 on 28 degrees of freedom
## Multiple R-squared: 0.6016, Adjusted R-squared: 0.5874
## F-statistic: 42.28 on 1 and 28 DF, p-value: 4.789e-07
We can observe that although price is not included in the model, its R-square value is larger than that of income. However, income is included in final model. Also, when testing individually, the p value of income is not significant.
Let us take a look at the correlation between the correlation between the three independent variable to find the reason behind the phenomenon.
correlation
cor(data.table)
## IC price income temp
## IC 1.00000000 -0.2595940 0.04793538 0.7756246
## price -0.25959400 1.0000000 -0.10747896 -0.1082061
## income 0.04793538 -0.1074790 1.00000000 -0.3247093
## temp 0.77562456 -0.1082061 -0.32470927 1.0000000
It seems that price has a close ralationship to both income and temp with correlation close to -1. But the realtionship of temp and income is abour -0.3, which is rather small. Then adding the variable price will lead to variable autocorrelation.
Final Model 3D Plot
library(scatterplot3d)
mdl6 <- scatterplot3d(temp,income,IC,pch=21,main="IC vs. temp and income", bg = 'blue', xlab="temperatur(degress F)", ylab="income(dollars)", zlab="IC(pints per capita)", axis=TRUE)
mdl6$plane3d(model.lm, col = 'dodgerblue4')
5.Residuals Check
Normality Check
qqnorm(residuals(model.lm))
qqline(residuals(model.lm))
The plot fluctates around the line, except the upper and lower ends.
y.yes<-residuals(model.lm)
print(y.yes)
## 1 2 3 4 5
## 0.0785665517 0.0098867900 -0.0029766887 0.0148369471 0.0064143099
## 6 7 8 9 10
## -0.0484727994 -0.0654202433 -0.0442234534 0.0005166429 0.0052725915
## 11 12 13 14 15
## 0.0105088644 0.0190049752 0.0252149702 -0.0035409784 0.0027792599
## 16 17 18 19 20
## -0.0185068559 0.0456637219 0.0257240565 -0.0347404157 -0.0609976062
## 21 22 23 24 25
## -0.0237745381 -0.0286616475 -0.0480263680 0.0187504980 -0.0123833101
## 26 27 28 29 30
## 0.0163699911 0.0120838753 0.0060471715 0.0021783323 0.0919053555
y.rst<-rstandard(model.lm)
print(y.rst)
## 1 2 3 4 5 6
## 2.22373452 0.27424476 -0.08256212 0.41655346 0.18385391 -1.36299387
## 7 8 9 10 11 12
## -1.80602401 -1.23071685 0.01536842 0.15728342 0.30049672 0.54052983
## 13 14 15 16 17 18
## 0.70300297 -0.09767930 0.07612916 -0.51256422 1.29583062 0.73449148
## 19 20 21 22 23 24
## -0.97153183 -1.68684785 -0.65086100 -0.78826950 -1.37812191 0.53625498
## 25 26 27 28 29 30
## -0.36095839 0.47821297 0.34444405 0.17845876 0.06270818 2.69366619
y.fit<-predict(model.lm)
op<-par(mfrow=c(1,2))
plot(y.yes~y.fit)
plot(y.rst~y.fit)
par(op)
The residuals fluctated around value of 0. The model is fitted.
Histogram
hist(y.yes, main="Model Residual Histogram",xlab = "Fitted value of model")
More variations in the range of 0 to 0.025. The distribution of residuals tend to be left skew. The mean value of residual is around 0.
boxplot
boxplot(y.yes)
Two outliers exist. The mean of residual is about 0.