Introduction

We will use Linear Regression model with cellphone dataset. We want to know the relationship between variables on smartphone prices. We will also predict the price of smartphones based on historical data. Some of the libraries we need include:
library(tidyverse)
library(caret)
library(plotly)
library(data.table)
library(GGally)
library(dplyr)
library(car)
library(scales)
library(lmtest)
library(MLmetrics)

Data Preprocesing

Import Data

We import the data we have prepared

smartphone <- read.csv("Cellphone.csv")

Data Inspection

We use head to see the top data

head(smartphone)

Data structure check

glimpse(smartphone)
#> Rows: 161
#> Columns: 14
#> $ Product_id   <int> 203, 880, 40, 99, 880, 947, 774, 947, 99, 1103, 289, 605,…
#> $ Price        <int> 2357, 1749, 1916, 1315, 1749, 2137, 1238, 2137, 1315, 258…
#> $ Sale         <int> 10, 10, 10, 11, 11, 12, 13, 13, 14, 15, 16, 16, 16, 16, 1…
#> $ weight       <dbl> 135.0, 125.0, 110.0, 118.5, 125.0, 150.0, 134.1, 150.0, 1…
#> $ resoloution  <dbl> 5.2, 4.0, 4.7, 4.0, 4.0, 5.5, 4.0, 5.5, 4.0, 5.1, 5.3, 5.…
#> $ ppi          <int> 424, 233, 312, 233, 233, 401, 233, 401, 233, 432, 277, 20…
#> $ cpu.core     <int> 8, 2, 4, 2, 2, 4, 2, 4, 2, 4, 8, 8, 4, 4, 4, 4, 4, 4, 4, …
#> $ cpu.freq     <dbl> 1.350, 1.300, 1.200, 1.300, 1.300, 2.300, 1.200, 2.300, 1…
#> $ internal.mem <dbl> 16, 4, 8, 4, 4, 16, 8, 16, 4, 16, 32, 4, 16, 32, 16, 8, 1…
#> $ ram          <dbl> 3.000, 1.000, 1.500, 0.512, 1.000, 2.000, 1.000, 2.000, 0…
#> $ RearCam      <dbl> 13.00, 3.15, 13.00, 3.15, 3.15, 16.00, 2.00, 16.00, 3.15,…
#> $ Front_Cam    <dbl> 8.0, 0.0, 5.0, 0.0, 0.0, 8.0, 0.0, 8.0, 0.0, 2.0, 8.0, 0.…
#> $ battery      <int> 2610, 1700, 2000, 1400, 1700, 2500, 1560, 2500, 1400, 280…
#> $ thickness    <dbl> 7.4, 9.9, 7.6, 11.0, 9.9, 9.5, 11.7, 9.5, 11.0, 8.1, 7.7,…

From the glimps function above, we can see that the data has 161 rows and 14 columns. Here is the explanation of the variables:

  • Product_id : ID of each cellphone
  • Price : Price of each cellphone
  • Sale : Sales number
  • weight : Weight of each cellphone
  • resoloution : Resoloution of each cellphone
  • ppi : Phone Pixel Density
  • cpu.core : type of CPU core in each cellphone
  • cpu.freq : CPU Frequency in each cellphone
  • internal.mem : Internal memory of each cellphone
  • ram : RAM of each cellphone
  • RearCam : Capacity rear cam
  • Front_Cam : Capacity front cam
  • battery : Capacity of battery
  • thickness : Thikcnes of smartphone

Check Missing values

colSums(is.na(smartphone))
#>   Product_id        Price         Sale       weight  resoloution          ppi 
#>            0            0            0            0            0            0 
#>     cpu.core     cpu.freq internal.mem          ram      RearCam    Front_Cam 
#>            0            0            0            0            0            0 
#>      battery    thickness 
#>            0            0

Delete columns that are not needed

smartphone_clean <- smartphone %>% select(-Product_id)

Exploratory Data Analysis

summary(smartphone_clean)
#>      Price           Sale            weight       resoloution   
#>  Min.   : 614   Min.   :  10.0   Min.   : 66.0   Min.   : 1.40  
#>  1st Qu.:1734   1st Qu.:  37.0   1st Qu.:134.1   1st Qu.: 4.80  
#>  Median :2258   Median : 106.0   Median :153.0   Median : 5.15  
#>  Mean   :2216   Mean   : 621.5   Mean   :170.4   Mean   : 5.21  
#>  3rd Qu.:2744   3rd Qu.: 382.0   3rd Qu.:170.0   3rd Qu.: 5.50  
#>  Max.   :4361   Max.   :9807.0   Max.   :753.0   Max.   :12.20  
#>       ppi           cpu.core        cpu.freq      internal.mem  
#>  Min.   :121.0   Min.   :0.000   Min.   :0.000   Min.   :  0.0  
#>  1st Qu.:233.0   1st Qu.:4.000   1st Qu.:1.200   1st Qu.:  8.0  
#>  Median :294.0   Median :4.000   Median :1.400   Median : 16.0  
#>  Mean   :335.1   Mean   :4.857   Mean   :1.503   Mean   : 24.5  
#>  3rd Qu.:428.0   3rd Qu.:8.000   3rd Qu.:1.875   3rd Qu.: 32.0  
#>  Max.   :806.0   Max.   :8.000   Max.   :2.700   Max.   :128.0  
#>       ram           RearCam        Front_Cam         battery    
#>  Min.   :0.000   Min.   : 0.00   Min.   : 0.000   Min.   : 800  
#>  1st Qu.:1.000   1st Qu.: 5.00   1st Qu.: 0.000   1st Qu.:2040  
#>  Median :2.000   Median :12.00   Median : 5.000   Median :2800  
#>  Mean   :2.205   Mean   :10.38   Mean   : 4.503   Mean   :2842  
#>  3rd Qu.:3.000   3rd Qu.:16.00   3rd Qu.: 8.000   3rd Qu.:3240  
#>  Max.   :6.000   Max.   :23.00   Max.   :20.000   Max.   :9500  
#>    thickness     
#>  Min.   : 5.100  
#>  1st Qu.: 7.600  
#>  Median : 8.400  
#>  Mean   : 8.922  
#>  3rd Qu.: 9.800  
#>  Max.   :18.500

We will find the relationship between variables

ggcorr(smartphone_clean, label = TRUE, label_size = 2.9, hjust = 1, layout.exp = 2)

From the correlation above, the ones that affect the price that are positively significant are ppi, cpu.core, spu.freq, internal.mem, ram, RearCam, Front_Cam and battery. while negatively significant is thickness

Modeling

We will divide the data into 2: train data and test data.

RNGkind(sample.kind = "Rounding")
set.seed(123)

# index sampling
index <- sample(nrow(smartphone_clean), nrow(smartphone_clean)*0.75)
# splitting
smartphone_train <- smartphone_clean[index,]
smartphone_test <- smartphone_clean[-index,]

now we will predict the price of smartphones

smartphone_all <- lm(Price ~ ., data = smartphone_train)
summary(smartphone_all)
#> 
#> Call:
#> lm(formula = Price ~ ., data = smartphone_train)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -299.34 -123.24  -13.95  139.56  434.82 
#> 
#> Coefficients:
#>                Estimate Std. Error t value        Pr(>|t|)    
#> (Intercept)  1852.94159  250.36511   7.401 0.0000000000323 ***
#> Sale           -0.02134    0.01378  -1.548        0.124457    
#> weight         -0.46665    0.87036  -0.536        0.592960    
#> resoloution   -90.50404   51.27843  -1.765        0.080425 .  
#> ppi             0.80534    0.25342   3.178        0.001940 ** 
#> cpu.core       52.56972   11.62342   4.523 0.0000158523443 ***
#> cpu.freq      123.59895   51.98748   2.377        0.019207 *  
#> internal.mem    5.13515    1.48949   3.448        0.000810 ***
#> ram           122.64318   30.84089   3.977        0.000127 ***
#> RearCam         6.08353    4.92314   1.236        0.219276    
#> Front_Cam       6.00783    6.05598   0.992        0.323412    
#> battery         0.14903    0.03795   3.927        0.000153 ***
#> thickness     -78.27573   14.22768  -5.502 0.0000002590968 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 175.7 on 107 degrees of freedom
#> Multiple R-squared:  0.9558, Adjusted R-squared:  0.9509 
#> F-statistic:   193 on 12 and 107 DF,  p-value: < 0.00000000000000022

From the above models that have significance are ppi, cpu.core, cpu.freq, internal.mem, ram, battery and thickness with adjusted R-squared : 0.9533

From the model results above we will create a new model

smartphone_pccirbt <- lm(Price ~ ppi + cpu.core + cpu.freq + internal.mem + ram + battery + thickness, 
                         data = smartphone_train)
summary(smartphone_pccirbt)
#> 
#> Call:
#> lm(formula = Price ~ ppi + cpu.core + cpu.freq + internal.mem + 
#>     ram + battery + thickness, data = smartphone_train)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -425.76 -105.84  -21.88  106.10  491.45 
#> 
#> Coefficients:
#>                Estimate Std. Error t value            Pr(>|t|)    
#> (Intercept)  1453.42200  161.57353   8.995 0.00000000000000684 ***
#> ppi             1.09150    0.23540   4.637 0.00000963830989740 ***
#> cpu.core       62.51550   11.26353   5.550 0.00000019365792438 ***
#> cpu.freq       70.95099   48.21823   1.471              0.1440    
#> internal.mem    5.47281    1.35644   4.035              0.0001 ***
#> ram           151.47198   30.60958   4.949 0.00000265714548821 ***
#> battery         0.03346    0.01896   1.765              0.0803 .  
#> thickness     -64.91401   11.68938  -5.553 0.00000019107861978 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 189.3 on 112 degrees of freedom
#> Multiple R-squared:  0.9463, Adjusted R-squared:  0.943 
#> F-statistic: 282.1 on 7 and 112 DF,  p-value: < 0.00000000000000022

The model above has adjusted R-squared : 0.946

We try to use step-weise regresion

smartphone_backward <- step(object = smartphone_all, 
                       direction = 'backward',
                       trace = FALSE)
summary(smartphone_backward)
#> 
#> Call:
#> lm(formula = Price ~ Sale + resoloution + ppi + cpu.core + cpu.freq + 
#>     internal.mem + ram + RearCam + battery + thickness, data = smartphone_train)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -290.63 -131.39   -9.12  143.28  408.44 
#> 
#> Coefficients:
#>                Estimate Std. Error t value             Pr(>|t|)    
#> (Intercept)  1948.79852  197.89971   9.847 < 0.0000000000000002 ***
#> Sale           -0.01994    0.01314  -1.517             0.132101    
#> resoloution  -111.49918   28.71221  -3.883             0.000177 ***
#> ppi             0.75264    0.24917   3.021             0.003143 ** 
#> cpu.core       57.14005   10.96984   5.209       0.000000905472 ***
#> cpu.freq      126.37021   48.08851   2.628             0.009831 ** 
#> internal.mem    5.37305    1.45025   3.705             0.000334 ***
#> ram           130.79616   29.20828   4.478       0.000018631544 ***
#> RearCam         7.68338    4.63579   1.657             0.100313    
#> battery         0.13233    0.03100   4.269       0.000042123244 ***
#> thickness     -82.91474   12.12664  -6.837       0.000000000487 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 175.3 on 109 degrees of freedom
#> Multiple R-squared:  0.9552, Adjusted R-squared:  0.9511 
#> F-statistic: 232.5 on 10 and 109 DF,  p-value: < 0.00000000000000022

Of the three models above, the one with the highest Adjusted R-squared is smartphone_backward with the value of: 0.9511

Evaluation

The model we use is the one with the highest Adjusted R-squared value, namely smartphone-backward.

smartphone_backward_pred <- predict(smartphone_backward, newdata = smartphone_test %>% select(-Price))

# RMSE of train dataset
RMSE(smartphone_backward$fitted.values, smartphone_train$Price)
#> [1] 167.0512
# RMSE Of test dataset
RMSE(smartphone_backward_pred, smartphone_test$Price)
#> [1] 170.5656
range(smartphone_clean$Price)
#> [1]  614 4361

The RMSE value of the smartphone_backward model is good enough when compared to the actual target value of the variable.

Assumption

Linierity

Linearity means that the target variable and its predictor have a linear relationship or the relationship is straight line.

plot(smartphone_backward, which = 1)

smartphone_backward model still meets linearity

Homoscedasticity of Residuals

It is expected that the error generated by the model spreads randomly or with constant variation. If visualized then the error is not patterned

# scatter plot
plot(x = smartphone_backward$fitted.values, y = smartphone_backward$residuals)
abline(h = 0, col = "red")

Breusch-Pagan hypothesis test:

  • H0: error spread is constant or homoscedasticity
  • H1: error spread is NOT constant or heteroscedasticity
# bptest dari model
bptest(smartphone_backward)
#> 
#>  studentized Breusch-Pagan test
#> 
#> data:  smartphone_backward
#> BP = 12.441, df = 10, p-value = 0.2566

p-value > 0.05 thus failing to reject H0, the model has errors that spread constantly/ homoscedasticity

Normality of Residuals

Linear regression models are expected to produce normally distributed errors. That way, more errors gather around the number zero

# histogram residual
hist(smartphone_backward$residuals)

Shapiro-Wilk hypothesis test:

  • H0: error is normally distributed
  • H1: error is NOT normally distributed
# shapiro test dari residual
shapiro.test(smartphone_backward$residuals)
#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  smartphone_backward$residuals
#> W = 0.97348, p-value = 0.01789

p-value > 0.5 so the error is not normally distributed

No Multicollinearity

Multicollinearity is a condition of strong correlation between predictors * nilai VIF > 10: multicollinearity occurs in the model * nilai VIF < 10: there is no multicollinearity in the model

vif(smartphone_backward)
#>         Sale  resoloution          ppi     cpu.core     cpu.freq internal.mem 
#>     1.691452     6.427604     4.555447     2.901669     3.648987     7.423650 
#>          ram      RearCam      battery    thickness 
#>     8.873503     3.329914     6.281444     3.223363

From the above results, there is no multicollinearity.

Conclusion

The variables we use for the smartphone_backward model with significance are resoloution, ppi, cpu.core, cpu.freq, internal.mem, ram, battery and thickness. The highest R-Squared is 95.11% with the help of step-wise regreation feature. Accuray our model using RMSE, with smartphone_training data we get an RMSE value of 167.0512 and with smartphone_test data we get a value of 170.5656 with a difference of only 2%. Our model assumptions fulfill linearity, homoscedasticity and no multicollinearity.