One of the companies wants to started his own mobile company. they wants to give tough fight to big companies like Apple, Samsung, etc. They does not know how to estimate price of mobiles his company creates. In this competitive mobile phone market you cannot simply assume things. To solve this problem they collects sales data of mobile phones of various companies.
This report describe moblie price classification using Mchine Learning Algorithm. We investigated 4 Algorithm : Logistic Regression, Decision Tree, Random Forest, and Support Vector Machine (SVM).
The dataset used in this report is Mobile Price Classification hosted in kaggle.
The dataset can be downloaded
{here} https://www.kaggle.com/iabhishekofficial/mobile-price-classification
Report Outline
1. Data Extraction
2. Exploratory Data Analysis
3. Modelling
4. Evaluation
5. Recommendation
The Dataset is downloaded from kaggle and saved in the data folder. We use read.csv function to read the dataset and put in mobile_train df for data train and mibile_test_df for data test.
mobile_test_df <- read.csv("Data/test.csv")
mobile_train_df <- read.csv("Data/train.csv")
To see the number of rows and column names and types, we used dim function. The dataset has 2000 rows and 21 columns
dim(mobile_train_df)
## [1] 2000 21
To find out the column names and types, we used **str() function
str(mobile_train_df)
## 'data.frame': 2000 obs. of 21 variables:
## $ battery_power: int 842 1021 563 615 1821 1859 1821 1954 1445 509 ...
## $ blue : int 0 1 1 1 1 0 0 0 1 1 ...
## $ clock_speed : num 2.2 0.5 0.5 2.5 1.2 0.5 1.7 0.5 0.5 0.6 ...
## $ dual_sim : int 0 1 1 0 0 1 0 1 0 1 ...
## $ fc : int 1 0 2 0 13 3 4 0 0 2 ...
## $ four_g : int 0 1 1 0 1 0 1 0 0 1 ...
## $ int_memory : int 7 53 41 10 44 22 10 24 53 9 ...
## $ m_dep : num 0.6 0.7 0.9 0.8 0.6 0.7 0.8 0.8 0.7 0.1 ...
## $ mobile_wt : int 188 136 145 131 141 164 139 187 174 93 ...
## $ n_cores : int 2 3 5 6 2 1 8 4 7 5 ...
## $ pc : int 2 6 6 9 14 7 10 0 14 15 ...
## $ px_height : int 20 905 1263 1216 1208 1004 381 512 386 1137 ...
## $ px_width : int 756 1988 1716 1786 1212 1654 1018 1149 836 1224 ...
## $ ram : int 2549 2631 2603 2769 1411 1067 3220 700 1099 513 ...
## $ sc_h : int 9 17 11 16 8 17 13 16 17 19 ...
## $ sc_w : int 7 3 2 8 2 1 8 3 1 10 ...
## $ talk_time : int 19 7 9 11 15 10 18 5 20 12 ...
## $ three_g : int 0 1 1 1 1 1 1 1 1 1 ...
## $ touch_screen : int 0 1 1 0 1 0 0 1 0 0 ...
## $ wifi : int 1 0 0 0 0 0 1 1 0 0 ...
## $ price_range : int 1 2 2 2 1 1 3 0 0 0 ...
str(mobile_test_df)
## 'data.frame': 1000 obs. of 21 variables:
## $ id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ battery_power: int 1043 841 1807 1546 1434 1464 1718 833 1111 1520 ...
## $ blue : int 1 1 1 0 0 1 0 0 1 0 ...
## $ clock_speed : num 1.8 0.5 2.8 0.5 1.4 2.9 2.4 2.4 2.9 0.5 ...
## $ dual_sim : int 1 1 0 1 0 1 0 1 1 0 ...
## $ fc : int 14 4 1 18 11 5 1 0 9 1 ...
## $ four_g : int 0 1 0 1 1 1 0 0 1 0 ...
## $ int_memory : int 5 61 27 25 49 50 47 62 25 25 ...
## $ m_dep : num 0.1 0.8 0.9 0.5 0.5 0.8 1 0.8 0.6 0.5 ...
## $ mobile_wt : int 193 191 186 96 108 198 156 111 101 171 ...
## $ n_cores : int 3 5 3 8 6 8 2 1 5 3 ...
## $ pc : int 16 12 4 20 18 9 3 2 19 20 ...
## $ px_height : int 226 746 1270 295 749 569 1283 1312 556 52 ...
## $ px_width : int 1412 857 1366 1752 810 939 1374 1880 876 1009 ...
## $ ram : int 3476 3895 2396 3893 1773 3506 3873 1495 3485 651 ...
## $ sc_h : int 12 6 17 10 15 10 14 7 11 6 ...
## $ sc_w : int 7 0 10 0 8 7 2 2 9 0 ...
## $ talk_time : int 2 7 10 7 7 3 10 18 10 5 ...
## $ three_g : int 0 1 0 1 1 1 0 0 1 1 ...
## $ touch_screen : int 1 0 1 1 0 1 0 1 1 0 ...
## $ wifi : int 0 0 1 0 1 1 0 1 0 1 ...
From the result above, we know the following, The first column in data test is id. It is unique and unnecassary for prediction. so, it should be removed.
To find out head of data and correlation in variables, we used head() and corrgram() function
head(mobile_train_df)
## battery_power blue clock_speed dual_sim fc four_g int_memory m_dep mobile_wt
## 1 842 0 2.2 0 1 0 7 0.6 188
## 2 1021 1 0.5 1 0 1 53 0.7 136
## 3 563 1 0.5 1 2 1 41 0.9 145
## 4 615 1 2.5 0 0 0 10 0.8 131
## 5 1821 1 1.2 0 13 1 44 0.6 141
## 6 1859 0 0.5 1 3 0 22 0.7 164
## n_cores pc px_height px_width ram sc_h sc_w talk_time three_g touch_screen
## 1 2 2 20 756 2549 9 7 19 0 0
## 2 3 6 905 1988 2631 17 3 7 1 1
## 3 5 6 1263 1716 2603 11 2 9 1 1
## 4 6 9 1216 1786 2769 16 8 11 1 0
## 5 2 14 1208 1212 1411 8 2 15 1 1
## 6 1 7 1004 1654 1067 17 1 10 1 0
## wifi price_range
## 1 1 1
## 2 0 2
## 3 0 2
## 4 0 2
## 5 0 1
## 6 0 1
head(mobile_test_df)
## id battery_power blue clock_speed dual_sim fc four_g int_memory m_dep
## 1 1 1043 1 1.8 1 14 0 5 0.1
## 2 2 841 1 0.5 1 4 1 61 0.8
## 3 3 1807 1 2.8 0 1 0 27 0.9
## 4 4 1546 0 0.5 1 18 1 25 0.5
## 5 5 1434 0 1.4 0 11 1 49 0.5
## 6 6 1464 1 2.9 1 5 1 50 0.8
## mobile_wt n_cores pc px_height px_width ram sc_h sc_w talk_time three_g
## 1 193 3 16 226 1412 3476 12 7 2 0
## 2 191 5 12 746 857 3895 6 0 7 1
## 3 186 3 4 1270 1366 2396 17 10 10 0
## 4 96 8 20 295 1752 3893 10 0 7 1
## 5 108 6 18 749 810 1773 15 8 7 1
## 6 198 8 9 569 939 3506 10 7 3 1
## touch_screen wifi
## 1 1 0
## 2 0 0
## 3 1 1
## 4 1 0
## 5 0 1
## 6 1 1
library(corrgram)
## Warning: package 'corrgram' was built under R version 4.0.4
corrgram(mobile_train_df[6:16], order = TRUE,
upper.panel = panel.pie)
We need to change data type from numeric to factor
mobile_train_df$price_range <- factor(mobile_train_df$price_range, levels = c(0,1,2,3),
labels = c("low cost", "medium cost", "high cost", "very high cost"))
mobile_train_df$blue <- factor(mobile_train_df$blue, levels = c(0,1),
labels = c("not", "yes"))
mobile_train_df$dual_sim <- factor(mobile_train_df$dual_sim, levels = c(0,1),
labels = c("not", "yes"))
mobile_train_df$four_g <- factor(mobile_train_df$four_g, levels = c(0,1),
labels = c("not", "yes"))
mobile_train_df$three_g <- factor(mobile_train_df$three_g, levels = c(0,1),
labels = c("not", "yes"))
mobile_train_df$touch_screen <- factor(mobile_train_df$touch_screen, levels = c(0,1),
labels = c("not", "yes"))
mobile_train_df$wifi <- factor(mobile_train_df$wifi, levels = c(0,1),
labels = c("not", "yes"))
And also we need to change data type from integer to numeric so that we can made bivariate data analysis.
# change to numeric
mobile_train_df$fc <- as.numeric(mobile_train_df$fc)
mobile_train_df$int_memory <- as.numeric(mobile_train_df$int_memory)
mobile_train_df$mobile_wt <- as.numeric(mobile_train_df$mobile_wt)
mobile_train_df$n_cores <- as.numeric(mobile_train_df$n_cores)
mobile_train_df$px_height <- as.numeric(mobile_train_df$px_height)
mobile_train_df$px_width <- as.numeric(mobile_train_df$px_width)
mobile_train_df$ram <- as.numeric(mobile_train_df$ram)
mobile_train_df$sc_h <- as.numeric(mobile_train_df$sc_h)
mobile_train_df$sc_w <- as.numeric(mobile_train_df$sc_w)
mobile_train_df$talk_time <- as.numeric(mobile_train_df$talk_time)
Analysis one variable
library(ggplot2)
ggplot(data=mobile_train_df, aes(x = price_range)) +
geom_bar()
ggplot(data = mobile_train_df, aes(y=price_range)) +
geom_boxplot() +
labs(title = "Mobile Price Classification", y="Price Range")
From the result above, we know the amount of data in dataset has the same amount in each class that consist of low cost, medium cost, high cost, and very high cost.
Analysis of two variables, We can find out some relation between features of mobile phone and its selling price.
ggplot(mobile_train_df, aes(x=price_range, fill = blue)) +
geom_bar(position = "dodge")
ggplot(mobile_train_df, aes(x=price_range, fill = dual_sim)) +
geom_bar(position = "dodge")
ggplot(mobile_train_df, aes(x=price_range, fill = four_g)) +
geom_bar(position = "dodge")
ggplot(mobile_train_df, aes(x=price_range, fill = three_g)) +
geom_bar(position = "dodge")
ggplot(mobile_train_df, aes(x=price_range, fill = touch_screen)) +
geom_bar(position = "dodge")
ggplot(mobile_train_df, aes(x=price_range, fill = wifi)) +
geom_bar(position = "dodge")
Observations based on ram and cores variables. We need to know whether the number of ram and cores can affect the price.
The color and shape of the observations are based on selling price (low cost, medium cost, high cost, and very high cost).
ggplot(data=mobile_train_df, aes(x=ram, y=n_cores,
shape=price_range, color=price_range)) +
geom_point() +
labs(title = "Mobile Price Classification", x="Ram", y="Cores")
In general, the number of ram and cores has a big effect on the price range. Smartphone that have a big ram are classified as having a higher price than a small ram. However, these two variables are not enough to separate the classes.
We use 4 Machine Learning Algorithms.
fit.logit <- glm(price_range~. ,
data = mobile_train_df,
family = binomial)
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(fit.logit)
##
## Call:
## glm(formula = price_range ~ ., family = binomial, data = mobile_train_df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.427e-03 1.000e-08 2.000e-08 2.000e-08 2.075e-03
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5771.7296 53409.9887 -0.108 0.914
## battery_power 1.4927 13.5365 0.110 0.912
## blueyes -7.2891 694.7855 -0.010 0.992
## clock_speed -17.1823 1013.5374 -0.017 0.986
## dual_simyes -6.3351 647.0005 -0.010 0.992
## fc -6.2349 223.8397 -0.028 0.978
## four_gyes 5.0844 908.2902 0.006 0.996
## int_memory 0.4382 23.6780 0.019 0.985
## m_dep -29.8635 1595.0836 -0.019 0.985
## mobile_wt -3.1548 29.6382 -0.106 0.915
## n_cores 17.5208 297.3277 0.059 0.953
## pc 3.1345 102.6764 0.031 0.976
## px_height 0.8551 8.0433 0.106 0.915
## px_width 0.8479 8.3537 0.102 0.919
## ram 2.4153 22.0169 0.110 0.913
## sc_h -4.3319 153.8299 -0.028 0.978
## sc_w -9.5058 151.5778 -0.063 0.950
## talk_time -2.2555 66.7331 -0.034 0.973
## three_gyes 15.9741 686.9693 0.023 0.981
## touch_screenyes -3.7032 579.0539 -0.006 0.995
## wifiyes -28.6611 1090.9916 -0.026 0.979
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2.2493e+03 on 1999 degrees of freedom
## Residual deviance: 5.2049e-05 on 1979 degrees of freedom
## AIC: 42
##
## Number of Fisher Scoring iterations: 25
library(party)
## Loading required package: grid
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: sandwich
fit.ctree <- ctree(price_range~. , data = mobile_train_df)
plot(fit.ctree, main = "Conditional Inference Tree")
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
set.seed(2021)
fit.forest <- randomForest(formula = price_range ~ ., data = mobile_train_df,
na.action = na.roughfix,
importance = TRUE)
fit.forest
##
## Call:
## randomForest(formula = price_range ~ ., data = mobile_train_df, importance = TRUE, na.action = na.roughfix)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 4
##
## OOB estimate of error rate: 11.45%
## Confusion matrix:
## low cost medium cost high cost very high cost class.error
## low cost 472 28 0 0 0.056
## medium cost 37 425 38 0 0.150
## high cost 0 55 413 32 0.174
## very high cost 0 0 39 461 0.078
library(e1071)
set.seed(2021)
fit.svm <- svm(price_range~., data=mobile_train_df)
fit.svm
##
## Call:
## svm(formula = price_range ~ ., data = mobile_train_df)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1
##
## Number of Support Vectors: 1415