Abstract:

This project investigates mobile phone pricing with an emphasis on the wide range of variables impacting consumer decision. I will be analyzing how phone features such as screen size, camera properties and so on affect the price of a phone, I did an exploratory data analysis on the features to explore the relationship between the explanatory variables and the response variable, I also used this to identify possible multi-collinearity and identify potential outliers in the dataset. I explored the mathematical relationship between the explanatory variables and response variable, this was achieve by fitting a multiple linear regression on the dataset. I fitted the best model on the dataset using the best subset strategy based on the least AIc,least BIC and least cp mellow criteria, I investigated the statistical significant of the variables and check for the least square assumptions, then the response variables was transformed to meet the least square assumptions.The project reveals that variables that could be useful in predicting mobile phone price are Storage, RAM, Screen size, camera_1,camera_2,camera_3,camera_4,Battery capacity and Camera count .

Introduction:

The use of mobile phones has increased rapidly over the years. Since their inception, mobile phones have seen an exponential rise in usage globally. Initially, mobile phones were primarily used for voice calls and text messaging but with the advent of smartphones and their multifunctional capabilities, including internet access, apps, cameras, and more, the adoption and reliance on mobile phones have skyrocketed.When buying a phone, there are lots of features to look out for and there is no basic way to determine a mobile phone price according to its characteristics.The modern mobile phone market is a complex ecosystem where pricing strategies are influenced by a multitude of factors.Predictive models such as regression are essential for predicting the prices of mobile phones which will aid manufacturers and customers in informed decision-making.

This project seeks to explore the intricate correlations between fundamental phone features such as storage capacity, RAM, battery health, Screen size and camera specifications and their impact on mobile phone prices. Employing rigorous data analysis techniques encompassing regression methodologies, this study aims to uncover the crucial determinants shaping pricing strategies within the mobile phone industry.

The first section of the project talks about the abstract, introduction and data characteristics. The second section of this project is on explorative data analysis where where I explored some relationship, trends, pattern and correlation in the variables . The third section of the project is the the modeling phase, where I will build a linear regression model using all the explanatory variables and then use the best subset regression to get a simpler and a better model, I will further investigate if regression assumptions are satisfied. If all the regression assumptions are not satisfied i will transform some of the variables again and check for regression assumptions again and if it is satisfied, i will check for the significance of the estimates.In the final section of this project will be the conclusions about my findings and discuss further work on how to improve the model to be more robust and more accurate(recommendations).

Data Characteristics:

The study examines a data taken from Kaggle (https://www.kaggle.com/datasets/rkiattisak/mobile-phone-price ). The initial data is made up of 9 columns and 408 rows.The dataset was large language models generated and not collected from actual data sources. The dependent is the Price which is the retail price of the phone and its in US dollars and this type of variable is numeric.In addition to the dependent variable, we have the following independent variables :Brand which shows the name of the company that manufactures the phone and this variable is categorical which includes Apple, Samsung, OnePlus, Xiaomi, Google, Vivo and so on, Model represent the name of the phone model and this variable is also categorical, Storage (GB) shows the amount of storage space (in gigabytes) available on the phone which is numeric variable, RAM (GB) stands for the amount of RAM (in gigabytes) available on the phone and it is also numeric, Screen Size (inches) is the size of the phone’s display screen in inches and this variable is numeric, Camera (MP) is also the megapixel count of the phone’s rear camera(s) and this variable is numeric, The numeric variable Battery Capacity (mAh) stands for the capacity of the phone’s battery which is measured in milliampere hours . The cleaned data was the one used for my analysis which has 10 columns and 408 rows because the Camera(MP) column in the intial data was split into four different columns and all the NA’s in those columns were changed to zero so that camera_1 represent the megapixels of the first camera , Camera_2 will represent the second in that order.I then created a new column called the camera_count which indicates the number of cameras that a particular phone model has.Dropped some variables such the Model and Brand which were the categorical variables in the dataset because I wanted my model to be able to predict new phone brands and model that will be manufactured in future.

data <- read.csv(file = file.choose())

Exploratory Data Analysis

Summary Statistics of data
##     Storage         RAM          Screen_Size       Camera_1     
##  Min.   : 32   Min.   : 2.000   Min.   :4.500   Min.   :  8.00  
##  1st Qu.: 64   1st Qu.: 4.000   1st Qu.:6.440   1st Qu.: 13.00  
##  Median :128   Median : 6.000   Median :6.500   Median : 48.00  
##  Mean   :123   Mean   : 5.838   Mean   :6.468   Mean   : 43.32  
##  3rd Qu.:128   3rd Qu.: 8.000   3rd Qu.:6.580   3rd Qu.: 64.00  
##  Max.   :512   Max.   :16.000   Max.   :6.900   Max.   :108.00  
##                                 NA's   :2                       
##     Camera_2         Camera_3         Camera_4       Battery_Capacity
##  Min.   : 0.000   Min.   : 0.000   Min.   : 0.0000   Min.   :1821    
##  1st Qu.: 2.000   1st Qu.: 2.000   1st Qu.: 0.0000   1st Qu.:4300    
##  Median : 8.000   Median : 2.000   Median : 0.0000   Median :5000    
##  Mean   : 9.126   Mean   : 3.641   Mean   : 0.9835   Mean   :4676    
##  3rd Qu.:12.000   3rd Qu.: 5.000   3rd Qu.: 2.0000   3rd Qu.:5000    
##  Max.   :64.000   Max.   :48.000   Max.   :12.0000   Max.   :7000    
##                                                                      
##      Price         Camera_Count  
##  Min.   :  99.0   Min.   :1.000  
##  1st Qu.: 199.0   1st Qu.:3.000  
##  Median : 299.0   Median :3.000  
##  Mean   : 402.2   Mean   :3.145  
##  3rd Qu.: 476.5   3rd Qu.:4.000  
##  Max.   :1999.0   Max.   :4.000  
##  NA's   :3

From the summary, the minimum price of a mobile phone is 99 dollars whiles the maximum is 1999 dollars. 25% of the mobile phones has a price below 199 dollars. 50% has a price below 299 dollars and 75% has a price below 476.5 dollars. The mean Price is is 402.2 dollars.Also from the summary, the smallest phone storage a phone can have is 32 gigabytes whiles the highest storage capacity is 512 gigabytes. We can also see that the the megapixel count of the phone’s rear first camera has a maximum of 108 and the minimum is just 8.

Histograms of Some of the variables

Based on the histogram , I can see that the Price is skewed to the right and most mobile phones has a price that ranges from zero dollars and 500 dollars and very few has a price that is above 1500 dollars

We have most of the phone storage clustered between 0 gigabytes and 150 gigabytes, Very few phones have storage capacity between 250 gigabytes and 300 gigabytes . We observed an increasing trend from 0 gigabytes to 150 gigabytes and no phone have storage capacity between 150 gigabytes and 250 gigabytes, similarly, we have no phone with storage capacity between 300 gigabytes and 500 gigabytes

The histogram for the screen size shows that,it is skewed to the left and most phone’s screen sizes is 6.5 inches. Phones screen’s inches are clustered around 6 inches and 7 inches

The histogram for the camera_2 shows that, most of the phone’s second camera has a megapixel between zero and twenty and very few above twenty

Correlation of variables
##                      Storage         RAM Screen_Size  Camera_1   Camera_2
## Storage           1.00000000  0.67970979   0.2345064 0.3622186  0.5229435
## RAM               0.67970979  1.00000000   0.2720626 0.5232928  0.5288045
## Screen_Size       0.23450637  0.27206261   1.0000000 0.3932998  0.1470192
## Camera_1          0.36221863  0.52329279   0.3932998 1.0000000  0.1534035
## Camera_2          0.52294353  0.52880450   0.1470192 0.1534035  1.0000000
## Camera_3          0.56494885  0.47437477   0.2283105 0.1984541  0.4209567
## Camera_4          0.19795091  0.26938665   0.2053093 0.3927535  0.1902992
## Battery_Capacity -0.07527421 -0.02976405   0.6254124 0.2917892 -0.1213288
## Price             0.71060054  0.61823403  -0.0112647 0.1347174  0.6510858
## Camera_Count      0.31200508  0.34447663   0.4603438 0.4789499  0.2277448
##                     Camera_3  Camera_4 Battery_Capacity       Price
## Storage           0.56494885 0.1979509      -0.07527421  0.71060054
## RAM               0.47437477 0.2693866      -0.02976405  0.61823403
## Screen_Size       0.22831052 0.2053093       0.62541242 -0.01126470
## Camera_1          0.19845408 0.3927535       0.29178924  0.13471743
## Camera_2          0.42095670 0.1902992      -0.12132883  0.65108583
## Camera_3          1.00000000 0.2136455      -0.05783133  0.61980845
## Camera_4          0.21364547 1.0000000       0.18304955  0.13412436
## Battery_Capacity -0.05783133 0.1830495       1.00000000 -0.42370871
## Price             0.61980845 0.1341244      -0.42370871  1.00000000
## Camera_Count      0.39811496 0.6330156       0.38832928  0.05900447
##                  Camera_Count
## Storage            0.31200508
## RAM                0.34447663
## Screen_Size        0.46034379
## Camera_1           0.47894995
## Camera_2           0.22774481
## Camera_3           0.39811496
## Camera_4           0.63301559
## Battery_Capacity   0.38832928
## Price              0.05900447
## Camera_Count       1.00000000

The correlation table above shows that, there are some form of correlation among some of the variables. There is a strong correlation between the phone’s price and the storage, strong correlation between the RAM and the price.Strong correlation between the Price and the phone’s second camera.I can also see that some of the independent variables are correlated and this might cause for multicollinearity in my model.For instance, Screen size and battery capacity have a strong correlation.

Scatter Plots
## `geom_smooth()` using formula = 'y ~ x'

From the above plot, we see that as the battery capacity increases the price decreases and we can see a weak negative correlation between battery capacity and the Price of the mobile phone.

## `geom_smooth()` using formula = 'y ~ x'

There is clearly no relationship between the screen size and price of the phone, however we have majority of the mobile phones have screen size between 6.0 and 7.0 inches, also most of the prices ranges 125dollars to 1000dollars.

## `geom_smooth()` using formula = 'y ~ x'

From the plot, we see that as the RAM of the phone increases the price of the phone increases, hence this suggest that RAM has a strong positive correlation with the price of the phone. RAM will be a good predictor variable for the price of a phone.

## `geom_smooth()` using formula = 'y ~ x'

From the plot, we see that as the the megapixel of camera_2 increases the price of the phone increases , hence this suggest that Camera_2 has a strong positive correlation with the price of the phone.

## `geom_smooth()` using formula = 'y ~ x'

From the plot, we see that as the the megapixel of camera_3 increases the price of the phone increases , hence this suggest that Camera_3 has a strong positive correlation with the price of the phone.

## `geom_smooth()` using formula = 'y ~ x'

From the plot, we see that as the the storage increases the price of the phone also increases , hence this suggest that storage has a strong positive correlation with the price of the phone.

## `geom_smooth()` using formula = 'y ~ x'

This plot shows that Screen size and Pice are positively associated

Model Selection

Preliminary Model

With the preliminary model I decided to regress all the explanatory variables on Price to see which variables are significant when all predictors are present in the model

## 
## Call:
## lm(formula = Price ~ ., data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -414.16  -66.04  -22.32   41.46  627.34 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      480.65362  148.47062   3.237  0.00131 ** 
## Storage            1.41675    0.14596   9.707  < 2e-16 ***
## RAM               19.25352    4.14951   4.640 4.76e-06 ***
## Screen_Size       37.99460   28.20032   1.347  0.17866    
## Camera_1          -0.34728    0.34210  -1.015  0.31067    
## Camera_2           7.40609    0.77748   9.526  < 2e-16 ***
## Camera_3          19.42520    1.88196  10.322  < 2e-16 ***
## Camera_4          21.25665    4.91691   4.323 1.95e-05 ***
## Battery_Capacity  -0.10083    0.01108  -9.102  < 2e-16 ***
## Camera_Count     -89.39370   12.31615  -7.258 2.12e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 123.6 on 392 degrees of freedom
##   (5 observations deleted due to missingness)
## Multiple R-squared:  0.8098, Adjusted R-squared:  0.8055 
## F-statistic: 185.5 on 9 and 392 DF,  p-value: < 2.2e-16

From the model I have after regressing all the independent variables on the dependent variable, I had the screen size and the camera one not being good predictors for the price at a significance level of 0.05

Model improvement.

We will implement the best subset and select the model with the least AIC and least C(p) and and highest adjusted R-squared value so that I can have a simple but yet better model for predicting the price of a mobile phone

## 
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
## 
##     rivers
##                                         Best Subsets Regression                                         
## --------------------------------------------------------------------------------------------------------
## Model Index    Predictors
## --------------------------------------------------------------------------------------------------------
##      1         Storage                                                                                   
##      2         Storage Battery_Capacity                                                                  
##      3         Storage Camera_2 Battery_Capacity                                                         
##      4         Storage Camera_2 Camera_3 Battery_Capacity                                                
##      5         Storage Screen_Size Camera_2 Camera_3 Battery_Capacity                                    
##      6         Storage Screen_Size Camera_2 Camera_3 Battery_Capacity Camera_Count                       
##      7         Storage RAM Screen_Size Camera_2 Camera_3 Battery_Capacity Camera_Count                   
##      8         Storage RAM Screen_Size Camera_2 Camera_3 Camera_4 Battery_Capacity Camera_Count          
##      9         Storage RAM Screen_Size Camera_1 Camera_2 Camera_3 Camera_4 Battery_Capacity Camera_Count 
## --------------------------------------------------------------------------------------------------------
## 
##                                                            Subsets Regression Summary                                                            
## -------------------------------------------------------------------------------------------------------------------------------------------------
##                        Adj.        Pred                                                                                                           
## Model    R-Square    R-Square    R-Square      C(p)         AIC         SBIC          SBC           MSEP            FPE          HSP        APC  
## -------------------------------------------------------------------------------------------------------------------------------------------------
##   1        0.5059      0.5047      0.4983    712.8906    5454.0031    4304.1413    5466.0073    17084067.1161    42496.6353    105.4546    0.4990 
##   2        0.6330      0.6312      0.6239    428.5104    5335.8117    4185.7195    5351.8173    12719543.0883    31717.6106     78.7095    0.3724 
##   3        0.7060      0.7038      0.6949    266.2052    5248.2703    4098.3446    5268.2774    10216517.6848    25538.4704     63.3787    0.2999 
##   4        0.7513      0.7488      0.7374    166.2068    5182.6935    4033.2473    5206.7020     8664599.8371    21712.0544     53.8860    0.2550 
##   5        0.7728      0.7700      0.7584     78.2728    5089.0305    3946.4656    5117.0057     7261465.8803    18332.4849     45.7288    0.2341 
##   6        0.7884      0.7851      0.7705     48.2536    5062.5580    3920.6198    5094.5296     6782117.2046    17164.1653     42.8183    0.2191 
##   7        0.8007      0.7971      0.7835     24.8534    5040.4430    3899.2707    5076.4111     6403490.1212    16245.4555     40.5305    0.2074 
##   8        0.8093      0.8054      0.7906      9.0305    5024.6146    3884.1978    5064.5792     6141332.7695    15618.2702     38.9701    0.1994 
##   9        0.8098      0.8055      0.7899     10.0000    5025.5592    3885.2416    5069.5202     6140896.3763    15655.0564     39.0667    0.1999 
## -------------------------------------------------------------------------------------------------------------------------------------------------
## AIC: Akaike Information Criteria 
##  SBIC: Sawa's Bayesian Information Criteria 
##  SBC: Schwarz Bayesian Criteria 
##  MSEP: Estimated error of prediction, assuming multivariate normality 
##  FPE: Final Prediction Error 
##  HSP: Hocking's Sp 
##  APC: Amemiya Prediction Criteria
Best Subset Model
## 
## Call:
## lm(formula = Price ~ Storage + RAM + Screen_Size + Camera_2 + 
##     Camera_3 + Camera_4 + Battery_Capacity + Camera_Count, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -417.70  -68.50  -20.36   41.98  624.63 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      505.98115  146.36489   3.457 0.000606 ***
## Storage            1.40404    0.14542   9.655  < 2e-16 ***
## RAM               17.53114    3.78691   4.629 4.99e-06 ***
## Screen_Size       35.24478   28.07101   1.256 0.210022    
## Camera_2           7.54527    0.76533   9.859  < 2e-16 ***
## Camera_3          19.69658    1.86295  10.573  < 2e-16 ***
## Camera_4          20.54678    4.86712   4.222 3.02e-05 ***
## Battery_Capacity  -0.10211    0.01101  -9.279  < 2e-16 ***
## Camera_Count     -91.47662   12.14450  -7.532 3.46e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 123.6 on 393 degrees of freedom
##   (5 observations deleted due to missingness)
## Multiple R-squared:  0.8093, Adjusted R-squared:  0.8054 
## F-statistic: 208.5 on 8 and 393 DF,  p-value: < 2.2e-16

From the best subset using the smallest AIc and smallest C(p) and the largest adjusted R sqaured,these variables will be good for predicting the price: Storage, RAM, Screen_Size, Camera_2, Camera_3, Camera_4, Battery_Capacity and Camera_Count

Checking Assumptions

The above two plots indicates that the normality assumptions fails

This plot also shows that linearity and constant variance assumptions also fails for the model fitted using the best subset.

Model improvement by transformation.

Since the least square assumptions were violated when I fitted the model that was recommended by the best subset based on the smallest AIc and the largest adjusted R squared, We will implement a logarithm transformation on the response variable to see if the new model will fix the least squared assumption that failed in the previous model.

## 
## Call:
## lm(formula = log(Price) ~ ., data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.8868 -0.1795 -0.0204  0.1345  1.2465 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       6.839e+00  3.465e-01  19.737  < 2e-16 ***
## Storage           2.582e-03  3.407e-04   7.581 2.51e-13 ***
## RAM               6.600e-02  9.685e-03   6.815 3.57e-11 ***
## Screen_Size      -6.054e-02  6.582e-02  -0.920 0.358257    
## Camera_1          2.453e-03  7.985e-04   3.073 0.002269 ** 
## Camera_2          1.313e-02  1.815e-03   7.234 2.49e-12 ***
## Camera_3          2.982e-02  4.392e-03   6.790 4.17e-11 ***
## Camera_4          3.628e-02  1.148e-02   3.162 0.001691 ** 
## Battery_Capacity -2.943e-04  2.585e-05 -11.384  < 2e-16 ***
## Camera_Count     -1.128e-01  2.875e-02  -3.923 0.000103 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2885 on 392 degrees of freedom
##   (5 observations deleted due to missingness)
## Multiple R-squared:  0.794,  Adjusted R-squared:  0.7893 
## F-statistic: 167.9 on 9 and 392 DF,  p-value: < 2.2e-16
Checking Assumptions

After the logarithm transformation of the dependent variable and fitting a new model,the histogram of the residuals and the qq-plot shows that residuals are normally distributed and also the model is very close to satisfying linearity and constant variance assumptions as compared to the first model that was fitted from the best subset so I can use this model for predicting the price of a mobile phone.

Multicollinearity of the Model
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:purrr':
## 
##     some
## The following object is masked from 'package:dplyr':
## 
##     recode
vif(model_1)
##          Storage              RAM      Screen_Size         Camera_1 
##         2.289926         2.566394         2.080677         1.842324 
##         Camera_2         Camera_3         Camera_4 Battery_Capacity 
##         1.601669         1.745404         1.764957         2.066832 
##     Camera_Count 
##         2.478943

This shows that the independent variables in my model are not involved in severe multicollinearity because all the Variance inflation factors are less than ten

Checking for outliers
##   5  52  58 103 109 113 124 209 335 339 344 363 378 405 
##   5  52  58 101 106 110 121 205 331 335 340 359 373 400

These points are the outliers in my model

Checking for Leverages
##   2  10  23  33  63  84  91 101 290 298 310 313 334 352 362 364 368 370 377 380 
##   2  10  23  33  63  84  89  99 286 294 306 309 330 348 358 360 364 366 372 375 
## 393 403 
## 388 398

There are the point leverages in my model

Concluson :

The model_1 is then usefull for predicting the price of phone since it satisfies the least square assumptions and has a small residual standard error compared to the models fitted earlier .With model_1 and at 0.05 significance level, the individual variables that are significant for predicting Price of a mobile phone are Storage, RAM, Screen size, camera_1,camera_2,camera_3,camera_4,Battery capacity,Camera count but just screen_size is not significant.

Interpretation of some regression coefficients

1.On an average, the price of a phone is expected to be 6.839e+0 when all the independents variables are zero

2.On an average ,an increase in one gigabytes of a phone storage is associated with 2.582e-03 increase in the price of a mobile phone when all variables are held constant.

3..On an average ,an increase in one gigabytes of a phone RAM is associated with 6.600e-02 increase in the price of a mobile phone when all variables are held constant

4..On an average ,an increase in one inche of a phone screen size is associated with 6.054e-0 decrease in the price of a mobile phone when all variables are held constant

Recommendation

The dataset used was large language models generated and not collected from actual data sources so there can be inherent limitations and considerations regarding its quality and reliability. The presence of missing values (NA’s) posed challenges in data preprocessing and the existence of some outliers and leverages will have some potential impacting the robustness of our inferential statements.

To enhance the reliability of our inferential statements and further strengthen the predictive capacity of future models, acquiring more diverse, real-world data from reputable sources would be instrumental. Authentic data directly obtained from manufacturers, retailers, or authoritative mobile phone databases would mitigate potential biases and uncertainties inherent in machine-generated datasets.

Exploring additional features that deeply influence mobile phone prices could significantly enhance the model’s accuracy. Factors such as customer reviews, market trends, and regional variations in pricing could be integrated into the dataset for a more comprehensive analysis.