Description

One of the companies wants to started his own mobile company. they wants to give tough fight to big companies like Apple, Samsung, etc. They does not know how to estimate price of mobiles his company creates. In this competitive mobile phone market you cannot simply assume things. To solve this problem they collects sales data of mobile phones of various companies.

This report describe moblie price classification using Mchine Learning Algorithm. We investigated 4 Algorithm : Logistic Regression, Decision Tree, Random Forest, and Support Vector Machine (SVM).
The dataset used in this report is Mobile Price Classification hosted in kaggle.

The dataset can be downloaded

{here} https://www.kaggle.com/iabhishekofficial/mobile-price-classification

Report Outline
1. Data Extraction
2. Exploratory Data Analysis
3. Modelling
4. Evaluation
5. Recommendation

1. Data Extraction

The Dataset is downloaded from kaggle and saved in the data folder. We use read.csv function to read the dataset and put in mobile_train df for data train and mibile_test_df for data test.

mobile_test_df <- read.csv("Data/test.csv")
mobile_train_df <- read.csv("Data/train.csv")

To see the number of rows and column names and types, we used dim function. The dataset has 2000 rows and 21 columns

dim(mobile_train_df)
## [1] 2000   21

2. Exploratory Data Analysis

To find out the column names and types, we used **str() function

str(mobile_train_df)
## 'data.frame':    2000 obs. of  21 variables:
##  $ battery_power: int  842 1021 563 615 1821 1859 1821 1954 1445 509 ...
##  $ blue         : int  0 1 1 1 1 0 0 0 1 1 ...
##  $ clock_speed  : num  2.2 0.5 0.5 2.5 1.2 0.5 1.7 0.5 0.5 0.6 ...
##  $ dual_sim     : int  0 1 1 0 0 1 0 1 0 1 ...
##  $ fc           : int  1 0 2 0 13 3 4 0 0 2 ...
##  $ four_g       : int  0 1 1 0 1 0 1 0 0 1 ...
##  $ int_memory   : int  7 53 41 10 44 22 10 24 53 9 ...
##  $ m_dep        : num  0.6 0.7 0.9 0.8 0.6 0.7 0.8 0.8 0.7 0.1 ...
##  $ mobile_wt    : int  188 136 145 131 141 164 139 187 174 93 ...
##  $ n_cores      : int  2 3 5 6 2 1 8 4 7 5 ...
##  $ pc           : int  2 6 6 9 14 7 10 0 14 15 ...
##  $ px_height    : int  20 905 1263 1216 1208 1004 381 512 386 1137 ...
##  $ px_width     : int  756 1988 1716 1786 1212 1654 1018 1149 836 1224 ...
##  $ ram          : int  2549 2631 2603 2769 1411 1067 3220 700 1099 513 ...
##  $ sc_h         : int  9 17 11 16 8 17 13 16 17 19 ...
##  $ sc_w         : int  7 3 2 8 2 1 8 3 1 10 ...
##  $ talk_time    : int  19 7 9 11 15 10 18 5 20 12 ...
##  $ three_g      : int  0 1 1 1 1 1 1 1 1 1 ...
##  $ touch_screen : int  0 1 1 0 1 0 0 1 0 0 ...
##  $ wifi         : int  1 0 0 0 0 0 1 1 0 0 ...
##  $ price_range  : int  1 2 2 2 1 1 3 0 0 0 ...
str(mobile_test_df)
## 'data.frame':    1000 obs. of  21 variables:
##  $ id           : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ battery_power: int  1043 841 1807 1546 1434 1464 1718 833 1111 1520 ...
##  $ blue         : int  1 1 1 0 0 1 0 0 1 0 ...
##  $ clock_speed  : num  1.8 0.5 2.8 0.5 1.4 2.9 2.4 2.4 2.9 0.5 ...
##  $ dual_sim     : int  1 1 0 1 0 1 0 1 1 0 ...
##  $ fc           : int  14 4 1 18 11 5 1 0 9 1 ...
##  $ four_g       : int  0 1 0 1 1 1 0 0 1 0 ...
##  $ int_memory   : int  5 61 27 25 49 50 47 62 25 25 ...
##  $ m_dep        : num  0.1 0.8 0.9 0.5 0.5 0.8 1 0.8 0.6 0.5 ...
##  $ mobile_wt    : int  193 191 186 96 108 198 156 111 101 171 ...
##  $ n_cores      : int  3 5 3 8 6 8 2 1 5 3 ...
##  $ pc           : int  16 12 4 20 18 9 3 2 19 20 ...
##  $ px_height    : int  226 746 1270 295 749 569 1283 1312 556 52 ...
##  $ px_width     : int  1412 857 1366 1752 810 939 1374 1880 876 1009 ...
##  $ ram          : int  3476 3895 2396 3893 1773 3506 3873 1495 3485 651 ...
##  $ sc_h         : int  12 6 17 10 15 10 14 7 11 6 ...
##  $ sc_w         : int  7 0 10 0 8 7 2 2 9 0 ...
##  $ talk_time    : int  2 7 10 7 7 3 10 18 10 5 ...
##  $ three_g      : int  0 1 0 1 1 1 0 0 1 1 ...
##  $ touch_screen : int  1 0 1 1 0 1 0 1 1 0 ...
##  $ wifi         : int  0 0 1 0 1 1 0 1 0 1 ...

From the result above, we know the following, The first column in data test is id. It is unique and unnecassary for prediction. so, it should be removed.

To find out head of data and correlation in variables, we used head() and corrgram() function

head(mobile_train_df)
##   battery_power blue clock_speed dual_sim fc four_g int_memory m_dep mobile_wt
## 1           842    0         2.2        0  1      0          7   0.6       188
## 2          1021    1         0.5        1  0      1         53   0.7       136
## 3           563    1         0.5        1  2      1         41   0.9       145
## 4           615    1         2.5        0  0      0         10   0.8       131
## 5          1821    1         1.2        0 13      1         44   0.6       141
## 6          1859    0         0.5        1  3      0         22   0.7       164
##   n_cores pc px_height px_width  ram sc_h sc_w talk_time three_g touch_screen
## 1       2  2        20      756 2549    9    7        19       0            0
## 2       3  6       905     1988 2631   17    3         7       1            1
## 3       5  6      1263     1716 2603   11    2         9       1            1
## 4       6  9      1216     1786 2769   16    8        11       1            0
## 5       2 14      1208     1212 1411    8    2        15       1            1
## 6       1  7      1004     1654 1067   17    1        10       1            0
##   wifi price_range
## 1    1           1
## 2    0           2
## 3    0           2
## 4    0           2
## 5    0           1
## 6    0           1
head(mobile_test_df)
##   id battery_power blue clock_speed dual_sim fc four_g int_memory m_dep
## 1  1          1043    1         1.8        1 14      0          5   0.1
## 2  2           841    1         0.5        1  4      1         61   0.8
## 3  3          1807    1         2.8        0  1      0         27   0.9
## 4  4          1546    0         0.5        1 18      1         25   0.5
## 5  5          1434    0         1.4        0 11      1         49   0.5
## 6  6          1464    1         2.9        1  5      1         50   0.8
##   mobile_wt n_cores pc px_height px_width  ram sc_h sc_w talk_time three_g
## 1       193       3 16       226     1412 3476   12    7         2       0
## 2       191       5 12       746      857 3895    6    0         7       1
## 3       186       3  4      1270     1366 2396   17   10        10       0
## 4        96       8 20       295     1752 3893   10    0         7       1
## 5       108       6 18       749      810 1773   15    8         7       1
## 6       198       8  9       569      939 3506   10    7         3       1
##   touch_screen wifi
## 1            1    0
## 2            0    0
## 3            1    1
## 4            1    0
## 5            0    1
## 6            1    1
library(corrgram)
## Warning: package 'corrgram' was built under R version 4.0.4
corrgram(mobile_train_df[6:16], order = TRUE,
         upper.panel = panel.pie)

We need to change data type from numeric to factor

mobile_train_df$price_range <- factor(mobile_train_df$price_range, levels = c(0,1,2,3),
                       labels = c("low cost", "medium cost", "high cost", "very high cost"))

mobile_train_df$blue <- factor(mobile_train_df$blue, levels = c(0,1),
                                  labels = c("not", "yes"))
mobile_train_df$dual_sim <- factor(mobile_train_df$dual_sim, levels = c(0,1),
                               labels = c("not", "yes"))
mobile_train_df$four_g <- factor(mobile_train_df$four_g, levels = c(0,1),
                               labels = c("not", "yes"))
mobile_train_df$three_g <- factor(mobile_train_df$three_g, levels = c(0,1),
                               labels = c("not", "yes"))
mobile_train_df$touch_screen <- factor(mobile_train_df$touch_screen, levels = c(0,1),
                               labels = c("not", "yes"))
mobile_train_df$wifi <- factor(mobile_train_df$wifi, levels = c(0,1),
                               labels = c("not", "yes"))

And also we need to change data type from integer to numeric so that we can made bivariate data analysis.

# change to numeric 
mobile_train_df$fc <- as.numeric(mobile_train_df$fc)
mobile_train_df$int_memory <- as.numeric(mobile_train_df$int_memory)
mobile_train_df$mobile_wt <- as.numeric(mobile_train_df$mobile_wt)
mobile_train_df$n_cores <- as.numeric(mobile_train_df$n_cores)
mobile_train_df$px_height <- as.numeric(mobile_train_df$px_height)
mobile_train_df$px_width <- as.numeric(mobile_train_df$px_width)
mobile_train_df$ram <- as.numeric(mobile_train_df$ram)
mobile_train_df$sc_h <- as.numeric(mobile_train_df$sc_h)
mobile_train_df$sc_w <- as.numeric(mobile_train_df$sc_w)
mobile_train_df$talk_time <- as.numeric(mobile_train_df$talk_time)

2.1 Univariate Data Analysis

Analysis one variable

library(ggplot2)
ggplot(data=mobile_train_df, aes(x = price_range)) +
  geom_bar()

ggplot(data = mobile_train_df, aes(y=price_range)) +
  geom_boxplot() +
  labs(title = "Mobile Price Classification", y="Price Range")

From the result above, we know the amount of data in dataset has the same amount in each class that consist of low cost, medium cost, high cost, and very high cost.

2.2 Bivariate Data Analysis

Analysis of two variables, We can find out some relation between features of mobile phone and its selling price.

ggplot(mobile_train_df, aes(x=price_range, fill = blue)) +
  geom_bar(position = "dodge")

ggplot(mobile_train_df, aes(x=price_range, fill = dual_sim)) +
  geom_bar(position = "dodge")

ggplot(mobile_train_df, aes(x=price_range, fill = four_g)) +
  geom_bar(position = "dodge")

ggplot(mobile_train_df, aes(x=price_range, fill = three_g)) +
  geom_bar(position = "dodge")

ggplot(mobile_train_df, aes(x=price_range, fill = touch_screen)) +
  geom_bar(position = "dodge")

ggplot(mobile_train_df, aes(x=price_range, fill = wifi)) +
  geom_bar(position = "dodge")

Observations based on ram and cores variables. We need to know whether the number of ram and cores can affect the price.
The color and shape of the observations are based on selling price (low cost, medium cost, high cost, and very high cost).

ggplot(data=mobile_train_df, aes(x=ram, y=n_cores,
                        shape=price_range, color=price_range)) +
  geom_point() +
  labs(title = "Mobile Price Classification", x="Ram", y="Cores")

In general, the number of ram and cores has a big effect on the price range. Smartphone that have a big ram are classified as having a higher price than a small ram. However, these two variables are not enough to separate the classes.

4. Modelling

We use 4 Machine Learning Algorithms.

4.1 Logistic Regression

fit.logit <- glm(price_range~. ,
                 data = mobile_train_df,
                 family = binomial)
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(fit.logit)
## 
## Call:
## glm(formula = price_range ~ ., family = binomial, data = mobile_train_df)
## 
## Deviance Residuals: 
##        Min          1Q      Median          3Q         Max  
## -2.427e-03   1.000e-08   2.000e-08   2.000e-08   2.075e-03  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)
## (Intercept)     -5771.7296 53409.9887  -0.108    0.914
## battery_power       1.4927    13.5365   0.110    0.912
## blueyes            -7.2891   694.7855  -0.010    0.992
## clock_speed       -17.1823  1013.5374  -0.017    0.986
## dual_simyes        -6.3351   647.0005  -0.010    0.992
## fc                 -6.2349   223.8397  -0.028    0.978
## four_gyes           5.0844   908.2902   0.006    0.996
## int_memory          0.4382    23.6780   0.019    0.985
## m_dep             -29.8635  1595.0836  -0.019    0.985
## mobile_wt          -3.1548    29.6382  -0.106    0.915
## n_cores            17.5208   297.3277   0.059    0.953
## pc                  3.1345   102.6764   0.031    0.976
## px_height           0.8551     8.0433   0.106    0.915
## px_width            0.8479     8.3537   0.102    0.919
## ram                 2.4153    22.0169   0.110    0.913
## sc_h               -4.3319   153.8299  -0.028    0.978
## sc_w               -9.5058   151.5778  -0.063    0.950
## talk_time          -2.2555    66.7331  -0.034    0.973
## three_gyes         15.9741   686.9693   0.023    0.981
## touch_screenyes    -3.7032   579.0539  -0.006    0.995
## wifiyes           -28.6611  1090.9916  -0.026    0.979
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2.2493e+03  on 1999  degrees of freedom
## Residual deviance: 5.2049e-05  on 1979  degrees of freedom
## AIC: 42
## 
## Number of Fisher Scoring iterations: 25

4.2 Decision Tree

library(party)
## Loading required package: grid
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## Loading required package: sandwich
fit.ctree <- ctree(price_range~. , data = mobile_train_df)
plot(fit.ctree, main = "Conditional Inference Tree")

4.3 Random Forest

library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
set.seed(2021)
fit.forest <- randomForest(formula = price_range ~ ., data = mobile_train_df,
                           na.action = na.roughfix,
                           importance = TRUE)
fit.forest
## 
## Call:
##  randomForest(formula = price_range ~ ., data = mobile_train_df,      importance = TRUE, na.action = na.roughfix) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 4
## 
##         OOB estimate of  error rate: 11.45%
## Confusion matrix:
##                low cost medium cost high cost very high cost class.error
## low cost            472          28         0              0       0.056
## medium cost          37         425        38              0       0.150
## high cost             0          55       413             32       0.174
## very high cost        0           0        39            461       0.078

4.4 Support Vector Machine (SVM)

library(e1071)
set.seed(2021)
fit.svm <- svm(price_range~., data=mobile_train_df)
fit.svm
## 
## Call:
## svm(formula = price_range ~ ., data = mobile_train_df)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
## 
## Number of Support Vectors:  1415