We will use Linear Regression model with cellphone dataset. We want to know the relationship between variables on smartphone prices. We will also predict the price of smartphones based on historical data. Some of the libraries we need include:
We use head to see the top data
Data structure check
#> Rows: 161
#> Columns: 14
#> $ Product_id <int> 203, 880, 40, 99, 880, 947, 774, 947, 99, 1103, 289, 605,…
#> $ Price <int> 2357, 1749, 1916, 1315, 1749, 2137, 1238, 2137, 1315, 258…
#> $ Sale <int> 10, 10, 10, 11, 11, 12, 13, 13, 14, 15, 16, 16, 16, 16, 1…
#> $ weight <dbl> 135.0, 125.0, 110.0, 118.5, 125.0, 150.0, 134.1, 150.0, 1…
#> $ resoloution <dbl> 5.2, 4.0, 4.7, 4.0, 4.0, 5.5, 4.0, 5.5, 4.0, 5.1, 5.3, 5.…
#> $ ppi <int> 424, 233, 312, 233, 233, 401, 233, 401, 233, 432, 277, 20…
#> $ cpu.core <int> 8, 2, 4, 2, 2, 4, 2, 4, 2, 4, 8, 8, 4, 4, 4, 4, 4, 4, 4, …
#> $ cpu.freq <dbl> 1.350, 1.300, 1.200, 1.300, 1.300, 2.300, 1.200, 2.300, 1…
#> $ internal.mem <dbl> 16, 4, 8, 4, 4, 16, 8, 16, 4, 16, 32, 4, 16, 32, 16, 8, 1…
#> $ ram <dbl> 3.000, 1.000, 1.500, 0.512, 1.000, 2.000, 1.000, 2.000, 0…
#> $ RearCam <dbl> 13.00, 3.15, 13.00, 3.15, 3.15, 16.00, 2.00, 16.00, 3.15,…
#> $ Front_Cam <dbl> 8.0, 0.0, 5.0, 0.0, 0.0, 8.0, 0.0, 8.0, 0.0, 2.0, 8.0, 0.…
#> $ battery <int> 2610, 1700, 2000, 1400, 1700, 2500, 1560, 2500, 1400, 280…
#> $ thickness <dbl> 7.4, 9.9, 7.6, 11.0, 9.9, 9.5, 11.7, 9.5, 11.0, 8.1, 7.7,…
From the glimps function above, we can see that the data has 161 rows and 14 columns. Here is the explanation of the variables:
Check Missing values
#> Product_id Price Sale weight resoloution ppi
#> 0 0 0 0 0 0
#> cpu.core cpu.freq internal.mem ram RearCam Front_Cam
#> 0 0 0 0 0 0
#> battery thickness
#> 0 0
Delete columns that are not needed
#> Price Sale weight resoloution
#> Min. : 614 Min. : 10.0 Min. : 66.0 Min. : 1.40
#> 1st Qu.:1734 1st Qu.: 37.0 1st Qu.:134.1 1st Qu.: 4.80
#> Median :2258 Median : 106.0 Median :153.0 Median : 5.15
#> Mean :2216 Mean : 621.5 Mean :170.4 Mean : 5.21
#> 3rd Qu.:2744 3rd Qu.: 382.0 3rd Qu.:170.0 3rd Qu.: 5.50
#> Max. :4361 Max. :9807.0 Max. :753.0 Max. :12.20
#> ppi cpu.core cpu.freq internal.mem
#> Min. :121.0 Min. :0.000 Min. :0.000 Min. : 0.0
#> 1st Qu.:233.0 1st Qu.:4.000 1st Qu.:1.200 1st Qu.: 8.0
#> Median :294.0 Median :4.000 Median :1.400 Median : 16.0
#> Mean :335.1 Mean :4.857 Mean :1.503 Mean : 24.5
#> 3rd Qu.:428.0 3rd Qu.:8.000 3rd Qu.:1.875 3rd Qu.: 32.0
#> Max. :806.0 Max. :8.000 Max. :2.700 Max. :128.0
#> ram RearCam Front_Cam battery
#> Min. :0.000 Min. : 0.00 Min. : 0.000 Min. : 800
#> 1st Qu.:1.000 1st Qu.: 5.00 1st Qu.: 0.000 1st Qu.:2040
#> Median :2.000 Median :12.00 Median : 5.000 Median :2800
#> Mean :2.205 Mean :10.38 Mean : 4.503 Mean :2842
#> 3rd Qu.:3.000 3rd Qu.:16.00 3rd Qu.: 8.000 3rd Qu.:3240
#> Max. :6.000 Max. :23.00 Max. :20.000 Max. :9500
#> thickness
#> Min. : 5.100
#> 1st Qu.: 7.600
#> Median : 8.400
#> Mean : 8.922
#> 3rd Qu.: 9.800
#> Max. :18.500
We will find the relationship between variables
From the correlation above, the ones that affect the price that are positively significant are ppi, cpu.core, spu.freq, internal.mem, ram, RearCam, Front_Cam and battery. while negatively significant is thickness
We will divide the data into 2: train data and test data.
RNGkind(sample.kind = "Rounding")
set.seed(123)
# index sampling
index <- sample(nrow(smartphone_clean), nrow(smartphone_clean)*0.75)
# splitting
smartphone_train <- smartphone_clean[index,]
smartphone_test <- smartphone_clean[-index,]now we will predict the price of smartphones
#>
#> Call:
#> lm(formula = Price ~ ., data = smartphone_train)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -299.34 -123.24 -13.95 139.56 434.82
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 1852.94159 250.36511 7.401 0.0000000000323 ***
#> Sale -0.02134 0.01378 -1.548 0.124457
#> weight -0.46665 0.87036 -0.536 0.592960
#> resoloution -90.50404 51.27843 -1.765 0.080425 .
#> ppi 0.80534 0.25342 3.178 0.001940 **
#> cpu.core 52.56972 11.62342 4.523 0.0000158523443 ***
#> cpu.freq 123.59895 51.98748 2.377 0.019207 *
#> internal.mem 5.13515 1.48949 3.448 0.000810 ***
#> ram 122.64318 30.84089 3.977 0.000127 ***
#> RearCam 6.08353 4.92314 1.236 0.219276
#> Front_Cam 6.00783 6.05598 0.992 0.323412
#> battery 0.14903 0.03795 3.927 0.000153 ***
#> thickness -78.27573 14.22768 -5.502 0.0000002590968 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 175.7 on 107 degrees of freedom
#> Multiple R-squared: 0.9558, Adjusted R-squared: 0.9509
#> F-statistic: 193 on 12 and 107 DF, p-value: < 0.00000000000000022
From the above models that have significance are ppi, cpu.core, cpu.freq, internal.mem, ram, battery and thickness with adjusted R-squared : 0.9533
From the model results above we will create a new model
smartphone_pccirbt <- lm(Price ~ ppi + cpu.core + cpu.freq + internal.mem + ram + battery + thickness,
data = smartphone_train)
summary(smartphone_pccirbt)#>
#> Call:
#> lm(formula = Price ~ ppi + cpu.core + cpu.freq + internal.mem +
#> ram + battery + thickness, data = smartphone_train)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -425.76 -105.84 -21.88 106.10 491.45
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 1453.42200 161.57353 8.995 0.00000000000000684 ***
#> ppi 1.09150 0.23540 4.637 0.00000963830989740 ***
#> cpu.core 62.51550 11.26353 5.550 0.00000019365792438 ***
#> cpu.freq 70.95099 48.21823 1.471 0.1440
#> internal.mem 5.47281 1.35644 4.035 0.0001 ***
#> ram 151.47198 30.60958 4.949 0.00000265714548821 ***
#> battery 0.03346 0.01896 1.765 0.0803 .
#> thickness -64.91401 11.68938 -5.553 0.00000019107861978 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 189.3 on 112 degrees of freedom
#> Multiple R-squared: 0.9463, Adjusted R-squared: 0.943
#> F-statistic: 282.1 on 7 and 112 DF, p-value: < 0.00000000000000022
The model above has adjusted R-squared : 0.946
We try to use step-weise regresion
smartphone_backward <- step(object = smartphone_all,
direction = 'backward',
trace = FALSE)
summary(smartphone_backward)#>
#> Call:
#> lm(formula = Price ~ Sale + resoloution + ppi + cpu.core + cpu.freq +
#> internal.mem + ram + RearCam + battery + thickness, data = smartphone_train)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -290.63 -131.39 -9.12 143.28 408.44
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 1948.79852 197.89971 9.847 < 0.0000000000000002 ***
#> Sale -0.01994 0.01314 -1.517 0.132101
#> resoloution -111.49918 28.71221 -3.883 0.000177 ***
#> ppi 0.75264 0.24917 3.021 0.003143 **
#> cpu.core 57.14005 10.96984 5.209 0.000000905472 ***
#> cpu.freq 126.37021 48.08851 2.628 0.009831 **
#> internal.mem 5.37305 1.45025 3.705 0.000334 ***
#> ram 130.79616 29.20828 4.478 0.000018631544 ***
#> RearCam 7.68338 4.63579 1.657 0.100313
#> battery 0.13233 0.03100 4.269 0.000042123244 ***
#> thickness -82.91474 12.12664 -6.837 0.000000000487 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 175.3 on 109 degrees of freedom
#> Multiple R-squared: 0.9552, Adjusted R-squared: 0.9511
#> F-statistic: 232.5 on 10 and 109 DF, p-value: < 0.00000000000000022
Of the three models above, the one with the highest Adjusted R-squared is smartphone_backward with the value of: 0.9511
The model we use is the one with the highest Adjusted R-squared value, namely smartphone-backward.
smartphone_backward_pred <- predict(smartphone_backward, newdata = smartphone_test %>% select(-Price))
# RMSE of train dataset
RMSE(smartphone_backward$fitted.values, smartphone_train$Price)#> [1] 167.0512
#> [1] 170.5656
#> [1] 614 4361
The RMSE value of the smartphone_backward model is good enough when compared to the actual target value of the variable.
Linearity means that the target variable and its predictor have a linear relationship or the relationship is straight line.
smartphone_backward model still meets linearity
It is expected that the error generated by the model spreads randomly or with constant variation. If visualized then the error is not patterned
# scatter plot
plot(x = smartphone_backward$fitted.values, y = smartphone_backward$residuals)
abline(h = 0, col = "red")Breusch-Pagan hypothesis test:
#>
#> studentized Breusch-Pagan test
#>
#> data: smartphone_backward
#> BP = 12.441, df = 10, p-value = 0.2566
p-value > 0.05 thus failing to reject H0, the model has errors that spread constantly/ homoscedasticity
Linear regression models are expected to produce normally distributed errors. That way, more errors gather around the number zero
Shapiro-Wilk hypothesis test:
#>
#> Shapiro-Wilk normality test
#>
#> data: smartphone_backward$residuals
#> W = 0.97348, p-value = 0.01789
p-value > 0.5 so the error is not normally distributed
Multicollinearity is a condition of strong correlation between predictors * nilai VIF > 10: multicollinearity occurs in the model * nilai VIF < 10: there is no multicollinearity in the model
#> Sale resoloution ppi cpu.core cpu.freq internal.mem
#> 1.691452 6.427604 4.555447 2.901669 3.648987 7.423650
#> ram RearCam battery thickness
#> 8.873503 3.329914 6.281444 3.223363
From the above results, there is no multicollinearity.
The variables we use for the smartphone_backward model with significance are resoloution, ppi, cpu.core, cpu.freq, internal.mem, ram, battery and thickness. The highest R-Squared is 95.11% with the help of step-wise regreation feature. Accuray our model using RMSE, with smartphone_training data we get an RMSE value of 167.0512 and with smartphone_test data we get a value of 170.5656 with a difference of only 2%. Our model assumptions fulfill linearity, homoscedasticity and no multicollinearity.