Multiple Variables Regression

Ice Cream VS. INFLUENCING FACTORS

Echo Wei

MOnday, March 19, 2015

In this project, multiple linear regression model will be developed to test the variable influecing the consumption of ice cream.

1. Database data select:

selecting a dataset from “100+ Interesting Data Sets for Statistics”. http://lib.stat.cmu.edu/DASL/Datafiles/IceCream.html

Variable

independent variable: there are three variables.

Price: price of ice cream per pint in dollars;

Income: weekly family income in dollars;

Temp: mean temperature in degress F.

dependent variable: IC: ice cream consumption in pints per capita.

All these data is gathered form 30 four-week periods, from 03/18/51 to 07/11/53.

Objective:

Finding the influencing factors on consumption of ice cream.

Null hypothesis:

The consumption of ice cream cannot be explained by the three variables, price, income and temperature.

2. Read the dataset

data

data.table<-read.csv("C:\\Users\\Echo\\Desktop\\2015\\Regression\\Project 2\\ice cream.csv", header=T)
head(data.table)

##      IC price income temp
## 1 0.386 0.270     78   41
## 2 0.374 0.282     79   56
## 3 0.393 0.277     81   63
## 4 0.425 0.280     80   68
## 5 0.406 0.272     76   69
## 6 0.344 0.262     78   65

summary(data.table)

##        IC             price            income           temp      
##  Min.   :0.2560   Min.   :0.2600   Min.   :76.00   Min.   :24.00  
##  1st Qu.:0.3113   1st Qu.:0.2685   1st Qu.:79.25   1st Qu.:32.25  
##  Median :0.3515   Median :0.2770   Median :83.50   Median :49.50  
##  Mean   :0.3594   Mean   :0.2753   Mean   :84.60   Mean   :49.10  
##  3rd Qu.:0.3912   3rd Qu.:0.2815   3rd Qu.:89.25   3rd Qu.:63.75  
##  Max.   :0.5480   Max.   :0.2920   Max.   :96.00   Max.   :72.00

3. Plot

sample plot

plot(data.table)

attach(data.table)

scatter plot

plot(price,IC,main="IC vs price",xlab="price(per pint in dollars)",ylab="IC(pints per capita)")

plot(income,IC,pch=21,main="IC vs income",xlab="income(dollars)",ylab="IC(pints per capita)")

plot(temp,IC,pch=21,main="IC vs temperature",xlab="temperatur(degress F)",ylab="IC(pints per capita)")

box plot

boxplot(price,main = "price",xlab="price(per pint in dollars)",ylab="IC(pints per capita)")

boxplot(income,main = "income",xlab="income(dollars)",ylab="IC(pints per capita)")

boxplot(temp,main = "temperature",xlab="temperatur(degress F)",ylab="IC(pints per capita)")

As shown in the graph, no outliers exist. All data can be used.

3. variable entry techniques

(1) Entry-wise

All variables entered simultaneously.

ice.lme<-lm(IC~price+income+temp)
summary(ice.lme)

## 
## Call:
## lm(formula = IC ~ price + income + temp)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.065302 -0.011873  0.002737  0.015953  0.078986 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.1973151  0.2702162   0.730  0.47179    
## price       -1.0444140  0.8343573  -1.252  0.22180    
## income       0.0033078  0.0011714   2.824  0.00899 ** 
## temp         0.0034584  0.0004455   7.762  3.1e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.03683 on 26 degrees of freedom
## Multiple R-squared:  0.719,  Adjusted R-squared:  0.6866 
## F-statistic: 22.17 on 3 and 26 DF,  p-value: 2.451e-07

The value of R-squared is 0.719, illustrating that the three variables explain 71.9% of the change of comsunption of ice cream. The p value of income and temp is signuficant, then we can reject the null hypothesis. However, the p value of price is 0.22180, which is larger than 0.5. Then we cannot reject the null hypotheis for indenpent variable price.

(2) Hierarchical

According to common sense, the higher the tmperature is, people are more willing to buying ice cream. For example, more ice cream sold in summer than in winter. Also, according to the principle of consumption, the demand curve moves right, leading to the intersection with higher consumption. Thus people with higher income are more capable of consumption because of higher consumption constraint. Same principle allpied to the price of ice cream. Demand increases because of lowering price.

Therefore, we will include the variable temp first. Then adding variable income. Last, adding variable price.

adding independent variable temp

ice.lm2<-lm(IC~temp)
summary(ice.lm2)

## 
## Call:
## lm(formula = IC ~ temp)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.069411 -0.024478 -0.007371  0.029126  0.120516 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.2068621  0.0247002   8.375 4.13e-09 ***
## temp        0.0031074  0.0004779   6.502 4.79e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.04226 on 28 degrees of freedom
## Multiple R-squared:  0.6016, Adjusted R-squared:  0.5874 
## F-statistic: 42.28 on 1 and 28 DF,  p-value: 4.789e-07

The p value if significant and the R-squared is 0.6016. Comparing to 71.9% of the contribution of three variables, temperature explains a lot of dependent variable, whcih plays the most important role in the independent variable.

adding independent variable temp+income

ice.lm3<-lm(IC~temp+income)
summary(ice.lm3)

## 
## Call:
## lm(formula = IC ~ temp + income)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.065420 -0.022458  0.004026  0.015987  0.091905 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.113195   0.108280  -1.045  0.30511    
## temp         0.003543   0.000445   7.963 1.47e-08 ***
## income       0.003530   0.001170   3.017  0.00551 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.03722 on 27 degrees of freedom
## Multiple R-squared:  0.7021, Adjusted R-squared:   0.68 
## F-statistic: 31.81 on 2 and 27 DF,  p-value: 7.957e-08

The R-squared increases to 0.7021, which is close to the 0.719 in the entry wise method. Then we can suppose the variable price does not have much influence on the change of consumption.

adding independent variable temp,income and price

ice.lm4<-lm(IC~temp+income+price)
summary(ice.lm4)

## 
## Call:
## lm(formula = IC ~ temp + income + price)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.065302 -0.011873  0.002737  0.015953  0.078986 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.1973151  0.2702162   0.730  0.47179    
## temp         0.0034584  0.0004455   7.762  3.1e-08 ***
## income       0.0033078  0.0011714   2.824  0.00899 ** 
## price       -1.0444140  0.8343573  -1.252  0.22180    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.03683 on 26 degrees of freedom
## Multiple R-squared:  0.719,  Adjusted R-squared:  0.6866 
## F-statistic: 22.17 on 3 and 26 DF,  p-value: 2.451e-07

Therefore, we can conclude that price does not have much influence on the consumption of ice cream. Just keep the variable temperature and income in the final model.

anova(ice.lm2,ice.lm3,ice.lm4)

## Analysis of Variance Table
## 
## Model 1: IC ~ temp
## Model 2: IC ~ temp + income
## Model 3: IC ~ temp + income + price
##   Res.Df      RSS Df Sum of Sq      F   Pr(>F)   
## 1     28 0.050009                                
## 2     27 0.037399  1 0.0126108 9.2955 0.005225 **
## 3     26 0.035273  1 0.0021257 1.5669 0.221803   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(3) Stepwise

Step method is applied in R language to establish the most fitted model, which is based on the value of AIC. By picking the smallest AIC, dropping useless variables and adding useful ones.

in R language, the structure of step is as below:

step(object, scope,scale=0,direction=c(“both”,“backward”,“forward”,trace=1,keep=NULL,steps=1000,k=2,…)

object is the model we get by inputting all independent variables. Scope determined the space for stepwise search. Direction Determins the searching direction. The defaulted value is both.

lm.step<-step(ice.lme)

## Start:  AIC=-194.38
## IC ~ price + income + temp
## 
##          Df Sum of Sq      RSS     AIC
## - price   1  0.002126 0.037399 -194.62
## <none>                0.035273 -194.38
## - income  1  0.010817 0.046090 -188.35
## - temp    1  0.081741 0.117013 -160.40
## 
## Step:  AIC=-194.62
## IC ~ income + temp
## 
##          Df Sum of Sq      RSS     AIC
## <none>                0.037399 -194.62
## - income  1  0.012611 0.050009 -187.90
## - temp    1  0.087836 0.125235 -160.36

When all independent variables are included, the AIC value is -194.38. Dropping variable price, the AIC value is -194.62. Dropping cariable income, AIC equals to -188.35. Dropping variable temp, AIC is -160.40. Choosing the one with smallest AIC, we decide to drop variable price. The final model only include variable temp and income.

conclusion

Based on analysis of three methods, the variables temp and income are included as independent varible, whic have significant influence on consumption of ince cream.

Final Model Summary

model.lm<-lm(IC~income+temp)
summary(model.lm)

## 
## Call:
## lm(formula = IC ~ income + temp)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.065420 -0.022458  0.004026  0.015987  0.091905 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.113195   0.108280  -1.045  0.30511    
## income       0.003530   0.001170   3.017  0.00551 ** 
## temp         0.003543   0.000445   7.963 1.47e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.03722 on 27 degrees of freedom
## Multiple R-squared:  0.7021, Adjusted R-squared:   0.68 
## F-statistic: 31.81 on 2 and 27 DF,  p-value: 7.957e-08

Both independent variable are significant. We can reject the null hypethesis and the model exist. R-squared is 0.7021. 70% of the dependent variable is explained.

4. Regression

Individual variable

As the final model only includes variable income and temperature, we only plot 95% confidential interval of this two variables.

plot(price,IC,main="IC vs price",xlab="price(per pint in dollars)",ylab="IC(pints per capita)")
ip.lm<-lm(IC~price)
abline(ip.lm$coef,lwd=2)

summary(ip.lm)

## 
## Call:
## lm(formula = IC ~ price)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.11152 -0.03707 -0.00991  0.03666  0.15724 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   0.9230     0.3964   2.329   0.0273 *
## price        -2.0472     1.4393  -1.422   0.1660  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06466 on 28 degrees of freedom
## Multiple R-squared:  0.06739,    Adjusted R-squared:  0.03408 
## F-statistic: 2.023 on 1 and 28 DF,  p-value: 0.166

plot(income,IC,pch=21,main="IC vs income",xlab="income(dollars)",ylab="IC(pints per capita)")
ii.lm<-lm(IC~income)
abline(ii.lm$coef,lwd=2)
model_conf <- predict(ii.lm, interval="confidence")
lines(income, model_conf[,2], lty=2, lwd=2, col="blue")
lines(income, model_conf[,3], lty=2, lwd=2, col="blue")

summary(ii.lm)

## 
## Call:
## lm(formula = IC ~ income)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.100606 -0.050393 -0.009145  0.034013  0.185840 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 0.316715   0.168665   1.878   0.0709 .
## income      0.000505   0.001988   0.254   0.8014  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06688 on 28 degrees of freedom
## Multiple R-squared:  0.002298,   Adjusted R-squared:  -0.03333 
## F-statistic: 0.06449 on 1 and 28 DF,  p-value: 0.8014

plot(temp,IC,pch=21,main="IC vs temperature",xlab="temperatur(degress F)",ylab="IC(pints per capita)")
it.lm<-lm(IC~temp)
abline(it.lm$coef,lwd=2)
model_conf <- predict(it.lm, interval="confidence")
lines(temp, model_conf[,2], lty=2, lwd=2, col="blue")
lines(temp, model_conf[,3], lty=2, lwd=2, col="blue")

summary(it.lm)

## 
## Call:
## lm(formula = IC ~ temp)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.069411 -0.024478 -0.007371  0.029126  0.120516 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.2068621  0.0247002   8.375 4.13e-09 ***
## temp        0.0031074  0.0004779   6.502 4.79e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.04226 on 28 degrees of freedom
## Multiple R-squared:  0.6016, Adjusted R-squared:  0.5874 
## F-statistic: 42.28 on 1 and 28 DF,  p-value: 4.789e-07

We can observe that although price is not included in the model, its R-square value is larger than that of income. However, income is included in final model. Also, when testing individually, the p value of income is not significant.

Let us take a look at the correlation between the correlation between the three independent variable to find the reason behind the phenomenon.

correlation

cor(data.table)

##                 IC      price      income       temp
## IC      1.00000000 -0.2595940  0.04793538  0.7756246
## price  -0.25959400  1.0000000 -0.10747896 -0.1082061
## income  0.04793538 -0.1074790  1.00000000 -0.3247093
## temp    0.77562456 -0.1082061 -0.32470927  1.0000000

It seems that price has a close ralationship to both income and temp with correlation close to -1. But the realtionship of temp and income is abour -0.3, which is rather small. Then adding the variable price will lead to variable autocorrelation.

Final Model 3D Plot

library(scatterplot3d)
mdl6 <- scatterplot3d(temp,income,IC,pch=21,main="IC vs. temp and income", bg = 'blue', xlab="temperatur(degress F)", ylab="income(dollars)", zlab="IC(pints per capita)", axis=TRUE)
mdl6$plane3d(model.lm, col = 'dodgerblue4')

5.Residuals Check

Normality Check

qqnorm(residuals(model.lm))
qqline(residuals(model.lm))

The plot fluctates around the line, except the upper and lower ends.

y.yes<-residuals(model.lm)
print(y.yes)

##             1             2             3             4             5 
##  0.0785665517  0.0098867900 -0.0029766887  0.0148369471  0.0064143099 
##             6             7             8             9            10 
## -0.0484727994 -0.0654202433 -0.0442234534  0.0005166429  0.0052725915 
##            11            12            13            14            15 
##  0.0105088644  0.0190049752  0.0252149702 -0.0035409784  0.0027792599 
##            16            17            18            19            20 
## -0.0185068559  0.0456637219  0.0257240565 -0.0347404157 -0.0609976062 
##            21            22            23            24            25 
## -0.0237745381 -0.0286616475 -0.0480263680  0.0187504980 -0.0123833101 
##            26            27            28            29            30 
##  0.0163699911  0.0120838753  0.0060471715  0.0021783323  0.0919053555

y.rst<-rstandard(model.lm)
print(y.rst)

##           1           2           3           4           5           6 
##  2.22373452  0.27424476 -0.08256212  0.41655346  0.18385391 -1.36299387 
##           7           8           9          10          11          12 
## -1.80602401 -1.23071685  0.01536842  0.15728342  0.30049672  0.54052983 
##          13          14          15          16          17          18 
##  0.70300297 -0.09767930  0.07612916 -0.51256422  1.29583062  0.73449148 
##          19          20          21          22          23          24 
## -0.97153183 -1.68684785 -0.65086100 -0.78826950 -1.37812191  0.53625498 
##          25          26          27          28          29          30 
## -0.36095839  0.47821297  0.34444405  0.17845876  0.06270818  2.69366619

y.fit<-predict(model.lm)
op<-par(mfrow=c(1,2))
plot(y.yes~y.fit)
plot(y.rst~y.fit)

par(op)

The residuals fluctated around value of 0. The model is fitted.

Histogram

hist(y.yes, main="Model Residual Histogram",xlab = "Fitted value of model")

More variations in the range of 0 to 0.025. The distribution of residuals tend to be left skew. The mean value of residual is around 0.

boxplot

boxplot(y.yes)

Two outliers exist. The mean of residual is about 0.