I just created my own cars data and tried to fit the multiple linear regression model. The data has following variables: Mileage, Type, Cylinder, Liter, Doors, Leather and proce. Usinf all the other variables we are going to build the model to predict the price of the car.
#cars data
cars <- read.csv('https://raw.githubusercontent.com/Riteshlohiya/Data605_Discussion12/master/cars.csv')
summary(cars)
## Price Mileage Type Cylinder
## Min. :22245 Min. : 583 Convertible:50 Min. :4.000
## 1st Qu.:29338 1st Qu.:14050 Sedan :76 1st Qu.:4.000
## Median :33370 Median :21237 Median :4.000
## Mean :35667 Mean :20257 Mean :5.619
## 3rd Qu.:38275 3rd Qu.:25776 3rd Qu.:8.000
## Max. :70755 Max. :50387 Max. :8.000
## Liter Doors Leather
## Min. :2.000 Min. :2.000 Min. :0.0000
## 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:1.0000
## Median :2.300 Median :4.000 Median :1.0000
## Mean :3.211 Mean :3.206 Mean :0.7698
## 3rd Qu.:4.600 3rd Qu.:4.000 3rd Qu.:1.0000
## Max. :6.000 Max. :4.000 Max. :1.0000
# Encoding the categorical variable
cars$Type <- ifelse(cars$Type == 'Sedan', 0, 1) # Sedan = 0 and Convertible = 1
cars
## Price Mileage Type Cylinder Liter Doors Leather
## 1 37510.25 21593 0 8 4.6 4 1
## 2 37215.17 22211 0 8 4.6 4 1
## 3 36332.89 25153 0 8 4.6 4 1
## 4 36245.16 26250 0 8 4.6 4 1
## 5 32954.14 36074 0 8 4.6 4 1
## 6 32537.19 41829 0 8 4.6 4 1
## 7 35715.77 6447 0 8 4.6 4 1
## 8 35651.68 10555 0 8 4.6 4 1
## 9 35129.34 11975 0 8 4.6 4 1
## 10 35165.76 13449 0 8 4.6 4 1
## 11 32501.25 17508 0 8 4.6 4 1
## 12 33220.03 18661 0 8 4.6 4 1
## 13 32509.48 20910 0 8 4.6 4 1
## 14 31132.21 23124 0 8 4.6 4 1
## 15 31181.72 26222 0 8 4.6 4 1
## 16 31059.18 27544 0 8 4.6 4 1
## 17 42741.52 2846 0 6 3.6 4 1
## 18 40966.61 7476 0 6 3.6 4 1
## 19 38795.38 13973 0 6 3.6 4 1
## 20 38297.46 16754 0 6 3.6 4 1
## 21 37192.90 19100 0 6 3.6 4 1
## 22 36210.12 21778 0 6 3.6 4 1
## 23 36633.63 22042 0 6 3.6 4 1
## 24 35895.50 23056 0 6 3.6 4 1
## 25 34974.38 25796 0 6 3.6 4 1
## 26 32038.34 35326 0 6 3.6 4 1
## 27 48310.33 788 0 8 4.6 4 1
## 28 48365.98 2616 0 8 4.6 4 1
## 29 45061.95 13829 0 8 4.6 4 1
## 30 44205.88 15104 0 8 4.6 4 1
## 31 42377.96 18581 0 8 4.6 4 1
## 32 41671.58 20575 0 8 4.6 4 1
## 33 41516.43 23861 0 8 4.6 4 1
## 34 41053.48 25717 0 8 4.6 4 1
## 35 38208.50 31303 0 8 4.6 4 1
## 36 39072.39 31587 0 8 4.6 4 1
## 37 70755.47 583 1 8 4.6 2 1
## 38 68566.19 6420 1 8 4.6 2 1
## 39 69133.73 7892 1 8 4.6 2 1
## 40 66374.31 12021 1 8 4.6 2 1
## 41 65281.48 15600 1 8 4.6 2 1
## 42 63913.12 18200 1 8 4.6 2 1
## 43 60567.55 23193 1 8 4.6 2 1
## 44 57154.44 29260 1 8 4.6 2 1
## 45 55639.09 31805 1 8 4.6 2 1
## 46 52001.99 42691 1 8 4.6 2 1
## 47 46732.61 3625 1 8 6.0 2 1
## 48 47065.21 5239 1 8 6.0 2 1
## 49 44749.69 12115 1 8 6.0 2 1
## 50 42773.03 14546 1 8 6.0 2 1
## 51 41371.38 20000 1 8 6.0 2 1
## 52 39547.59 23826 1 8 6.0 2 1
## 53 39691.73 25169 1 8 6.0 2 1
## 54 38824.87 25960 1 8 6.0 2 1
## 55 36970.90 30502 1 8 6.0 2 1
## 56 37288.94 32039 1 8 6.0 2 1
## 57 35622.14 10340 1 4 2.0 2 0
## 58 34819.30 12251 1 4 2.0 2 0
## 59 34355.00 17711 1 4 2.0 2 0
## 60 32737.08 19112 1 4 2.0 2 1
## 61 33540.54 20925 1 4 2.0 2 1
## 62 31970.54 21208 1 4 2.0 2 0
## 63 33287.41 21661 1 4 2.0 2 1
## 64 32075.98 23553 1 4 2.0 2 0
## 65 31969.07 24559 1 4 2.0 2 0
## 66 27666.23 35157 1 4 2.0 2 0
## 67 29246.24 3907 0 4 2.0 4 1
## 68 26337.83 16068 0 4 2.0 4 0
## 69 26775.03 16688 0 4 2.0 4 0
## 70 25299.97 19569 0 4 2.0 4 0
## 71 24896.60 21266 0 4 2.0 4 0
## 72 25996.81 21433 0 4 2.0 4 1
## 73 24801.62 26345 0 4 2.0 4 1
## 74 24063.01 27674 0 4 2.0 4 1
## 75 23249.84 27686 0 4 2.0 4 0
## 76 22244.88 50387 0 4 2.0 4 1
## 77 37088.56 3828 1 4 2.0 2 1
## 78 33381.82 17381 1 4 2.0 2 1
## 79 33358.77 17590 1 4 2.0 2 1
## 80 33586.91 18930 1 4 2.0 2 1
## 81 30731.94 22479 1 4 2.0 2 0
## 82 30315.17 23635 1 4 2.0 2 0
## 83 30166.85 25049 1 4 2.0 2 0
## 84 30251.02 27558 1 4 2.0 2 1
## 85 29142.71 31655 1 4 2.0 2 1
## 86 29612.15 32477 1 4 2.0 2 1
## 87 26841.08 10003 0 4 2.0 4 0
## 88 27825.95 10014 0 4 2.0 4 1
## 89 27284.75 14281 0 4 2.0 4 1
## 90 27060.14 17319 0 4 2.0 4 1
## 91 25618.28 20208 0 4 2.0 4 1
## 92 25790.51 21160 0 4 2.0 4 1
## 93 25148.38 22272 0 4 2.0 4 1
## 94 24852.50 22814 0 4 2.0 4 1
## 95 24173.53 27015 0 4 2.0 4 0
## 96 23733.40 27600 0 4 2.0 4 0
## 97 38324.81 12090 1 4 2.0 2 1
## 98 38167.17 13162 1 4 2.0 2 1
## 99 37383.50 16088 1 4 2.0 2 1
## 100 36338.75 18195 1 4 2.0 2 0
## 101 35580.33 21167 1 4 2.0 2 0
## 102 35304.49 21293 1 4 2.0 2 1
## 103 34393.00 24031 1 4 2.0 2 1
## 104 33984.43 25420 1 4 2.0 2 0
## 105 33248.34 27051 1 4 2.0 2 1
## 106 28777.96 48991 1 4 2.0 2 1
## 107 32197.34 3867 0 4 2.0 4 0
## 108 32053.10 5144 0 4 2.0 4 0
## 109 30274.71 10800 0 4 2.0 4 0
## 110 30353.59 11273 0 4 2.0 4 0
## 111 30122.43 14568 0 4 2.0 4 1
## 112 26789.83 22189 0 4 2.0 4 0
## 113 28291.76 22328 0 4 2.0 4 0
## 114 27109.41 22598 0 4 2.0 4 1
## 115 27256.49 26400 0 4 2.0 4 0
## 116 25267.37 34175 0 4 2.0 4 0
## 117 35033.22 1676 0 4 2.3 4 1
## 118 32746.13 7924 0 4 2.3 4 1
## 119 33183.33 9795 0 4 2.3 4 1
## 120 31002.73 15087 0 4 2.3 4 1
## 121 30075.99 22052 0 4 2.3 4 1
## 122 29844.20 23143 0 4 2.3 4 1
## 123 28432.82 25247 0 4 2.3 4 1
## 124 28054.98 26276 0 4 2.3 4 1
## 125 28502.96 28598 0 4 2.3 4 1
## 126 24912.08 38717 0 4 2.3 4 1
#Distribution
hist(cars$Price, main = "Histogram of price of the cars")
hist(cars$Mileage, main = "Histogram of Mileage of the cars")
#Correlation matrix
pairs(cars)
# Quadratic variable
cars$q <- cars$Cylinder^2
# Dichotomous vs. quantative interaction
cars$dq <- cars$Type * cars$Cylinder
#Fitting the multiple regression model
ml = lm(Price ~ Mileage + Type + Cylinder + Liter + Doors + Leather + q + dq, data = cars)
summary(ml)
##
## Call:
## lm(formula = Price ~ Mileage + Type + Cylinder + Liter + Doors +
## Leather + q + dq, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6042.6 -2068.0 -60.9 1875.2 5785.8
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.387e+04 8.883e+03 -4.939 2.62e-06 ***
## Mileage -3.002e-01 2.870e-02 -10.460 < 2e-16 ***
## Type -1.326e+04 1.919e+03 -6.908 2.66e-10 ***
## Cylinder 3.373e+04 3.418e+03 9.869 < 2e-16 ***
## Liter -1.372e+04 9.491e+02 -14.453 < 2e-16 ***
## Doors NA NA NA NA
## Leather 2.122e+03 7.464e+02 2.843 0.00526 **
## q -1.896e+03 2.697e+02 -7.027 1.46e-10 ***
## dq 4.638e+03 3.467e+02 13.377 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3059 on 118 degrees of freedom
## Multiple R-squared: 0.9118, Adjusted R-squared: 0.9065
## F-statistic: 174.2 on 7 and 118 DF, p-value: < 2.2e-16
After seeing summary i think Doors are not significant contributor, so removing the variable from the model.
#Refitting the model
ml2 = lm(Price ~ Mileage + Type + Cylinder + Liter + Leather + q + dq, data = cars)
summary(ml2)
##
## Call:
## lm(formula = Price ~ Mileage + Type + Cylinder + Liter + Leather +
## q + dq, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6042.6 -2068.0 -60.9 1875.2 5785.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.387e+04 8.883e+03 -4.939 2.62e-06 ***
## Mileage -3.002e-01 2.870e-02 -10.460 < 2e-16 ***
## Type -1.326e+04 1.919e+03 -6.908 2.66e-10 ***
## Cylinder 3.373e+04 3.418e+03 9.869 < 2e-16 ***
## Liter -1.372e+04 9.491e+02 -14.453 < 2e-16 ***
## Leather 2.122e+03 7.464e+02 2.843 0.00526 **
## q -1.896e+03 2.697e+02 -7.027 1.46e-10 ***
## dq 4.638e+03 3.467e+02 13.377 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3059 on 118 degrees of freedom
## Multiple R-squared: 0.9118, Adjusted R-squared: 0.9065
## F-statistic: 174.2 on 7 and 118 DF, p-value: < 2.2e-16
#Histogram
hist(ml2$residuals, main = "Regression Residuals")
# Residuals
plot(ml2$residuals, ylab='Residuals')
abline(a=0, b=0)
# Q-Q plot
qqnorm(ml2$residuals)
qqline(ml2$residuals)
The R-squared value is 91.18% which is good. That means that the explained variability is 91.18% between independent and dependent variables. Seeing the residual plot, we can see mostly there is constant variability and no pattern. Q-Q plot also looks good with some outliers at the tails. I think this multiple linear model(ml2) is appropriate.