Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term,bone dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?
rm(list=ls())
library(ggplot2)
clothing.dt<-read.csv('https://raw.githubusercontent.com/VioletaStoyanova/Data605/master/Clothing.csv', header=TRUE)
head(clothing.dt)
## tsale sales margin nown nfull npart naux hoursw hourspw
## 1 750000 4411.765 41 1 1.0000 1.0000 1.5357 76 16.75596
## 2 1926395 4280.878 39 2 2.0000 3.0000 1.5357 192 22.49376
## 3 1250000 4166.667 40 1 2.0000 2.2222 1.4091 114 17.19120
## 4 694227 2670.104 40 1 1.0000 1.2833 1.3673 100 21.50260
## 5 750000 15000.000 44 2 1.9556 1.2833 1.3673 104 15.74279
## 6 400000 4444.444 41 2 1.9556 1.2833 1.3673 72 10.89885
## inv1 inv2 ssize start
## 1 17166.67 27177.04 170 1984
## 2 17166.67 27177.04 450 1972
## 3 292857.20 71570.55 300 1952
## 4 22207.04 15000.00 260 1966
## 5 22207.04 10000.00 50 1996
## 6 22207.04 22859.85 90 1947
summary(clothing.dt)
## tsale sales margin nown
## Min. : 50000 Min. : 300 Min. :16.00 Min. : 1.000
## 1st Qu.: 495340 1st Qu.: 3904 1st Qu.:37.00 1st Qu.: 1.000
## Median : 694227 Median : 5279 Median :39.00 Median : 1.000
## Mean : 833584 Mean : 6335 Mean :38.77 Mean : 1.284
## 3rd Qu.: 976817 3rd Qu.: 7740 3rd Qu.:41.00 3rd Qu.: 1.295
## Max. :5000000 Max. :27000 Max. :66.00 Max. :10.000
## nfull npart naux hoursw
## Min. :1.000 Min. :1.000 Min. :1.000 Min. : 32.0
## 1st Qu.:1.923 1st Qu.:1.283 1st Qu.:1.333 1st Qu.: 80.0
## Median :1.956 Median :1.283 Median :1.367 Median :104.0
## Mean :2.069 Mean :1.566 Mean :1.390 Mean :121.1
## 3rd Qu.:2.066 3rd Qu.:2.000 3rd Qu.:1.367 3rd Qu.:145.2
## Max. :8.000 Max. :9.000 Max. :4.000 Max. :582.0
## hourspw inv1 inv2 ssize
## Min. : 5.708 Min. : 1000 Min. : 350 Min. : 16.0
## 1st Qu.:13.541 1st Qu.: 20000 1st Qu.: 10000 1st Qu.: 80.0
## Median :17.745 Median : 22207 Median : 22860 Median : 120.0
## Mean :18.955 Mean : 58257 Mean : 27829 Mean : 151.1
## 3rd Qu.:24.303 3rd Qu.: 62269 3rd Qu.: 22860 3rd Qu.: 190.0
## Max. :43.326 Max. :1500000 Max. :400000 Max. :1214.0
## start
## Min. :1945
## 1st Qu.:1959
## Median :1978
## Mean :1978
## 3rd Qu.:1996
## Max. :2015
pairs(clothing.dt,gap=0.5)
clothing.lm <- lm(sales ~ tsale+ margin+ nown+ nfull + npart+ naux+ hoursw + hourspw +inv1 +inv2+ssize+start, data =clothing.dt)
summary(clothing.lm)
##
## Call:
## lm(formula = sales ~ tsale + margin + nown + nfull + npart +
## naux + hoursw + hourspw + inv1 + inv2 + ssize + start, data = clothing.dt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7684.6 -1149.0 -560.0 571.4 14962.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.395e+04 1.076e+04 -1.297 0.1956
## tsale 5.272e-03 3.212e-04 16.412 <2e-16 ***
## margin 5.118e+01 2.294e+01 2.231 0.0263 *
## nown 1.872e+02 2.882e+02 0.650 0.5164
## nfull -2.431e+02 2.451e+02 -0.992 0.3219
## npart -2.540e+02 2.185e+02 -1.162 0.2459
## naux -2.249e+02 3.522e+02 -0.639 0.5234
## hoursw 1.587e+01 8.266e+00 1.919 0.0557 .
## hourspw -6.933e+01 5.661e+01 -1.225 0.2215
## inv1 9.753e-04 1.194e-03 0.817 0.4145
## inv2 -2.890e-03 3.059e-03 -0.945 0.3454
## ssize -2.673e+01 1.312e+00 -20.378 <2e-16 ***
## start 9.270e+00 5.390e+00 1.720 0.0863 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2207 on 387 degrees of freedom
## Multiple R-squared: 0.6622, Adjusted R-squared: 0.6517
## F-statistic: 63.22 on 12 and 387 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
hist(clothing.lm$residuals, main = "Histogram of Residuals", xlab= "")
plot(clothing.lm$residuals, fitted(clothing.lm))
qqnorm(clothing.lm$residuals)
qqline(clothing.lm$residuals)
The equation for this model includes the following predictors sales^=-1.395e+04+ 5.272e-03???tsale +5.118e+01*margin+ -2.673e+01???ssize
R-squared/Adjusted R^2: values of 0.6622 and 0.6517 respectively, which means that about 65% of the data fall into the regression line. F-statistic: value of 63.22 with a small p-value < 2.2e-16
The residuals slightly follow the indicated line but we cannot conclude that they are normally distributed.
I don’t think that the Multiple Regression Model is appropriate in this case