To view all code for this project, click here

1. Problem Statement and Background

The aim of the following analysis is to examine the relationships between various aspects of a laptop and its price, hoping to identify key factors to hopefully predict a laptops price and whether a new listing for a laptop is overpriced. Diving deep into this topic will hopefully provide useful insights for individuals who are seeking a new laptop and determining if the asking price is one worth paying.

This dataset comes from Kaggle, a free web-based platform that data scientists and statisticians use to share both ideas and datasets. The link to the Laptop Prices dataset is below:

Laptop Prices

This dataset was pulled from another dataset with fewer variables. The link to the Original Dataset is below:

Original Dataset

I am using the Tuesday, September 10, 2024 update of the Laptop Prices dataset. Link to this version of the dataset is below:

9/10/24 Version

R/R Markdown were used in this project as it is free and open-source, allowing users to customize their experience with various libraries and features that other coding software such as SAS do not have. R provides an easy-to-use and comprehensive toolset of statistical analyses and tests. While these advanced analyses will not be used in this project, further work can be done to provide additional insights to the major factors in laptops’ price.

This analysis seeks to answer the questions:

  • What aspects of a laptop contribute to its asking price?
  • Can we accurately predict a Laptops price based on its aspects?
  • Can we determine if a laptop is overpriced?

2. Descrption of the Data

Here is what the author on Kaggle had to say about the Laptop Prices dataset: “The original dataset was pretty compact with a lot of details in each column. The columns mostly consisted of long strings of data, which was pretty human-readable and concise but for Machine Learning algorithms to work more efficiently it’s better to separate the different details into their own columns. After doing so, 28 duplicate rows were exposed and removed with this dataset being the final result.”

A detailed description of the variables within the dataset is given below:

Company: Laptop Manufacturer [categorical]

Product: Brand and Model [categorical]

TypeName: Laptop Type (Notebook, Ultrabook, Gaming, …etc) [categorical]

Inches: Screen Size [numerical]

Ram: Total amount of RAM in laptop (GBs) [numerical]

OS: Operating System installed [categorical]

Weight: Laptop Weight in kilograms [numerical]

Price_euros: Price of Laptop in Euros (Target) [numerical]

Screen: screen definition (Standard, Full HD, 4K Ultra HD, Quad HD+) [categorical]

ScreenW: screen width (pixels) [numerical]

ScreenH: screen height (pixels) [numerical]

Touchscreen: whether or not the laptop has a touchscreen [categorical]

IPSpanel: whether or not the laptop has an IPSpanel [categorical]

RetinaDisplay: whether or not the laptop has retina display [categorical]

CPU_company [categorical]

CPU_freq: frequency of laptop CPU (Hz) [numerical]

CPU_model [categorical]

PrimaryStorage: primary storage space (GB) [numerical]

PrimaryStorageType: primary storage type (HDD, SSD, Flash Storage, Hybrid) [categorical]

SecondaryStorage: secondary storage space if any (GB) [numerical]

SecondaryStorageType: secondary storage type (HSS, SSD, Hybrid, None) [categorical]

GPU_company [categorical]

GPU_model [categorical]

laptop.prices <- read.csv("https://raw.githubusercontent.com/EPKeep32/STA551/refs/heads/main/laptop_prices.csv")

summary(laptop.prices)
   Company            Product            TypeName             Inches     
 Length:1275        Length:1275        Length:1275        Min.   :10.10  
 Class :character   Class :character   Class :character   1st Qu.:14.00  
 Mode  :character   Mode  :character   Mode  :character   Median :15.60  
                                                          Mean   :15.02  
                                                          3rd Qu.:15.60  
                                                          Max.   :18.40  
      Ram              OS                Weight       Price_euros  
 Min.   : 2.000   Length:1275        Min.   :0.690   Min.   : 174  
 1st Qu.: 4.000   Class :character   1st Qu.:1.500   1st Qu.: 609  
 Median : 8.000   Mode  :character   Median :2.040   Median : 989  
 Mean   : 8.441                      Mean   :2.041   Mean   :1135  
 3rd Qu.: 8.000                      3rd Qu.:2.310   3rd Qu.:1496  
 Max.   :64.000                      Max.   :4.700   Max.   :6099  
    Screen             ScreenW        ScreenH     Touchscreen       
 Length:1275        Min.   :1366   Min.   : 768   Length:1275       
 Class :character   1st Qu.:1920   1st Qu.:1080   Class :character  
 Mode  :character   Median :1920   Median :1080   Mode  :character  
                    Mean   :1900   Mean   :1074                     
                    3rd Qu.:1920   3rd Qu.:1080                     
                    Max.   :3840   Max.   :2160                     
   IPSpanel         RetinaDisplay      CPU_company           CPU_freq    
 Length:1275        Length:1275        Length:1275        Min.   :0.900  
 Class :character   Class :character   Class :character   1st Qu.:2.000  
 Mode  :character   Mode  :character   Mode  :character   Median :2.500  
                                                          Mean   :2.303  
                                                          3rd Qu.:2.700  
                                                          Max.   :3.600  
  CPU_model         PrimaryStorage   SecondaryStorage PrimaryStorageType
 Length:1275        Min.   :   8.0   Min.   :   0.0   Length:1275       
 Class :character   1st Qu.: 256.0   1st Qu.:   0.0   Class :character  
 Mode  :character   Median : 256.0   Median :   0.0   Mode  :character  
                    Mean   : 444.5   Mean   : 176.1                     
                    3rd Qu.: 512.0   3rd Qu.:   0.0                     
                    Max.   :2048.0   Max.   :2048.0                     
 SecondaryStorageType GPU_company         GPU_model        
 Length:1275          Length:1275        Length:1275       
 Class :character     Class :character   Class :character  
 Mode  :character     Mode  :character   Mode  :character  
                                                           
                                                           
                                                           

2.1 Handling Missing Values

missing_values <- is.na(laptop.prices)
summary(missing_values)
  Company         Product         TypeName         Inches       
 Mode :logical   Mode :logical   Mode :logical   Mode :logical  
 FALSE:1275      FALSE:1275      FALSE:1275      FALSE:1275     
    Ram              OS            Weight        Price_euros    
 Mode :logical   Mode :logical   Mode :logical   Mode :logical  
 FALSE:1275      FALSE:1275      FALSE:1275      FALSE:1275     
   Screen         ScreenW         ScreenH        Touchscreen    
 Mode :logical   Mode :logical   Mode :logical   Mode :logical  
 FALSE:1275      FALSE:1275      FALSE:1275      FALSE:1275     
  IPSpanel       RetinaDisplay   CPU_company      CPU_freq      
 Mode :logical   Mode :logical   Mode :logical   Mode :logical  
 FALSE:1275      FALSE:1275      FALSE:1275      FALSE:1275     
 CPU_model       PrimaryStorage  SecondaryStorage PrimaryStorageType
 Mode :logical   Mode :logical   Mode :logical    Mode :logical     
 FALSE:1275      FALSE:1275      FALSE:1275       FALSE:1275        
 SecondaryStorageType GPU_company     GPU_model      
 Mode :logical        Mode :logical   Mode :logical  
 FALSE:1275           FALSE:1275      FALSE:1275     
missing_rows <- laptop.prices[!complete.cases(laptop.prices), ]
print(missing_rows)
 [1] Company              Product              TypeName            
 [4] Inches               Ram                  OS                  
 [7] Weight               Price_euros          Screen              
[10] ScreenW              ScreenH              Touchscreen         
[13] IPSpanel             RetinaDisplay        CPU_company         
[16] CPU_freq             CPU_model            PrimaryStorage      
[19] SecondaryStorage     PrimaryStorageType   SecondaryStorageType
[22] GPU_company          GPU_model           
<0 rows> (or 0-length row.names)

The above summary table shows that 0 of the variables have missing values. If any of the variables had missing values, there would be an NA: n row beneath it, with n representing the number of missing values.

Since the dataset was cleaned before hand and used for testing Machine Learning Algorithms, it is of no surprise that none of the variables have any missing values.

2.2 Data Manipulation

laptop.prices$Price_USD = round(laptop.prices$Price_euros * 1.09, 2)

Price_USD was added by converting the Price_euros to USD as of the October 9th, 9:29 PM exchange rate of 1 Euro to $1.09 USD. This was done to allow for easier comparisons for those in the United States market.

This finalizes the dataset with 1275 observations and 24 variables. All 24 variables may not be used in the analysis, but will still be held within the dataset for any further research.

3 Exploratory Data Analysis

Next I will perform some Exploratpory Data Analysis, or EDA. EDA is an important step in the overarching data analysis process, as it is important to familiarize yourself with the dataset.

3.1 Distribution of Price_USD

Below is a histogram of Price_USD

hist(laptop.prices$Price_USD,
     breaks = seq(0, 8000, 500),
     main = "Distribution of USD Price",
     xlim = c(0, 8000),
     xlab = "Laptop Price (USD)",
     ylim = c(0, 450),
     ylab = "Count of Laptops",
     col = "lightgreen",
     labels = TRUE)
abline(h = seq(50, 400, 50), col = "gray", lty = "dotted")

The histogram above shows that Price_USD has a very right-skewed distribution. Over 75% of the laptops in the dataset are below $1500 USD, and nearly 100% of the laptops are below $4000. A few outliers may be present, including one over $6000.

3.2 Distribution of Inches

Below is a histogram of Inches

hist(laptop.prices$Inches,
     breaks = seq(10, 20, 1),
     main = "Distribution of Inches",
     xlim = c(10, 20), 
     xlab = "Screen Size (in)",
     ylim = c(0, 800),
     ylab = "Count of Laptops",
     col = "coral",
     labels = TRUE)
abline(h = seq(100, 800, 100), col = "gray", lty = "dotted")

The histogram above shows that Inches has a very unique distribution. Most laptops have a screen size of 13.9”, 15.9”, 17.9” etc.

Implications for Feature Engineering:

To alleviate the unique distribution, please consider rounding up all Inches to the nearest round number. I changed the bucket size on the below histogram to reflect what this change would look like:

hist(laptop.prices$Inches,
     breaks = seq(10, 20, 2),
     main = "Distribution of Inches",
     xlim = c(10, 20), 
     xlab = "Screen Size (in)",
     ylim = c(0, 800),
     ylab = "Count of Laptops",
     col = "coral",
     labels = TRUE)
abline(h = seq(100, 800, 100), col = "gray", lty = "dotted")

As shown above, this histogram appears far more Normal.

3.3 Distribution of Ram

Below is a histogram of Ram

hist(laptop.prices$Ram,
     breaks = seq(0, 70, 5),
     main = "Distribution of Ram",
     xlim = c(0, 70),
     xlab = "Ram (GB)",
     ylim = c(0, 800),
     ylab = "Count of Laptops",
     col = "yellow",
     labels = TRUE)
abline(h = seq(100, 800, 100), col = "gray", lty = "dotted")

The histogram above shows that Ram has a very right-skewed distribution. Almost 100% of the laptops in the dataset are have below 20 GB of Ram. Laptops above 25 GB of Ram may be considered outliers.

3.4 Distribution of Company

Below is a histogram of Company

ggplot(laptop.prices, aes(y = reorder(factor(Company), Company, function(x) + length(x)))) + 
  geom_bar(fill = "darkblue") + 
  labs(title = "Company Distribution", x = "Company", y = "Count")

As shown in the plot above, the top 5 most common laptop companies in the dataset are:

  1. Dell
  2. Lenovo
  3. HP
  4. Asus
  5. Acer

Implications for Feature Engineering:

The imbalance of laptop companies may cause potential bias when creating a predictive model. To alleviate the potential bias, please consider combining all companies outside the 5 listed above into one “Other” category.

3.5 Distribution of TypeName

Below is a histogram of TypeName

ggplot(laptop.prices, aes(y = reorder(factor(TypeName), TypeName, function(x) + length(x)))) + 
  geom_bar(fill = "darkred") + 
  labs(title = "Laptop Type Distribution", x = "Laptop Type", y = "Count")

As shown above, there are 6 different types of laptops within the dataset, with “Notebook” being the most common

3.6 Distribution of Primary Storage Type

Below is a pie chart of PrimaryStorageType

pst <- laptop.prices %>%
  group_by(PrimaryStorageType) %>%
  summarise(count = n()) %>%
  ungroup() %>%
  mutate(perc = count / sum(count)) %>%
  arrange(perc) %>%
  mutate(labels = scales::percent(perc))

ggplot(pst, aes(x = "", y = perc, fill = PrimaryStorageType)) + 
  geom_col() + 
  geom_text(aes(label = paste(count, "\n(", labels, ")", sep = "")), 
            position = position_stack(vjust = (0.5))) + 
  coord_polar(theta = "y") +
  labs(title = "Pie Chart of Primary Storage Type", y = "", x = "")

As shown in the plot above, over 65% of laptops in the dataset have an SSD primary storage type.

3.7 Distribution of Touchscreen

Below is a pie chart of Touchscreen

touchscreen <- laptop.prices %>%
  group_by(Touchscreen) %>%
  summarise(count = n()) %>%
  ungroup() %>%
  mutate(perc = count / sum(count)) %>%
  arrange(perc) %>%
  mutate(labels = scales::percent(perc))

ggplot(touchscreen, aes(x = "", y = perc, fill = Touchscreen)) +
  geom_col() +
  geom_text(aes(label = paste(count, "\n(", labels, ")", sep = "")),
            position = position_stack(vjust = 0.5)) +
  coord_polar(theta = "y") +
  labs(title = "Pie Chart of Touchscreen", y = "", x = "")

As shown in the plot above, a only 188 laptops (15%) in the dataset have a touchscreen. Since not many laptops have a touchscreen, I predict that a laptop with a touchscreen will be more expensive than a laptop without a touchscreen, assuming all other aspects are similar.

3.8 Distribution between Touchscreen and Price (USD)

To glance at this claim, below is a plot of Touchscreen and Price_usd

touchscreen.price <- laptop.prices %>%
  group_by(Touchscreen) %>%
  summarise(count = n(), avg = mean(Price_USD))

ggplot(touchscreen.price, aes(x = Touchscreen, y = avg, fill = Touchscreen)) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = paste("$", round(avg, 2))), position = position_stack(vjust = 0.5)) +
  labs(title = "Average Price (USD) by Touchscreen Status",
       x = "Touchscreen Status",
       y = "Average price (USD)") +
  theme_minimal()

As shown above, the average price for a laptop without a touchscreen is $1177.14 USD, and the average price for a laptop with a touchscreen is $1583.90.

4 Regression Analysis

Next, I will create some models that will hopefully allow for further understanding of the data. To accomplish this, I will first split the data into 2 separate datasets: Training Data and Testing Data. Doing so will allow for the models to be tested on real data without the need for additional research or sampling. roughly 75% of the overall dataset will be in the training dataset, and the remaining ~25% will be in the testing dataset.

4.1 Multiple Linear Regression

Multiple linear regression is a statistical technique that uses multiple variables to predict the outcome of another variable. This technique allows statisticians to build models and equations that can be used across multiple datasets.

The subsections below outline my process through developing a multiple linear regression model.

4.1.2 Question Statement

One question that this research seeks to answer is What aspects of a laptop contribute to its asking price? To address this question, there are numerous variables in the dataset that may attribute to a laptops asking price, making the dataset fit to answer the above question.

4.1.3 Model #1:

The first model I built includes the following variables:

  • ScreenW
  • ScreenH
  • Weight
  • Company

These variables were chosen because these aspects do not have to do with the physical hardware within the laptop, and rather the additional features that are of more practical use.

I created dummy variables for Company. Following the Feature Engineering suggestion from above, if the Comapny is not within the top 5 most common laptop companies within the dataset, its company will be unofficially categorized as “other”

laptop.prices$company.dell <- ifelse(laptop.prices$Company == "Dell", 1, 0)
laptop.prices$company.lenovo <- ifelse(laptop.prices$Company == "Lenovo", 1, 0)
laptop.prices$company.hp <- ifelse(laptop.prices$Company == "HP", 1, 0)
laptop.prices$company.asus <- ifelse(laptop.prices$Company == "Asus", 1, 0)
laptop.prices$company.acer <- ifelse(laptop.prices$Company == "Acer", 1, 0)
laptop.prices$IPSpanel = ifelse(laptop.prices$IPSpanel == "Yes", 1, 0)
set.seed(323)
index <- sample.split(Y = laptop.prices$Price_USD, SplitRatio = 0.75)
train.data <- laptop.prices[index, ]
test.data <- laptop.prices[!index, ]
mlr.model.1 = lm(formula = Price_USD ~ ScreenW + ScreenH + Weight + company.dell + company.lenovo + company.hp + company.asus + company.acer, data = train.data)

4.1.3.a Condition Checks

Below shows some plots to identify some potential violations of conditions for Model 1:

par(mfrow = c(2,2), mar = c(2, 3, 2, 2))
plot(mlr.model.1)

Next, I will carry the Box-Cox transformation to identify a potential power transformation of the response variable Price_USD:

boxcox(Price_USD ~ ScreenW + ScreenH + Weight + company.dell + company.lenovo + company.hp + company.asus + company.acer,
       data = train.data,
       lambda = seq(-1, 1.5, length = 10),
       xlab = expression(paste(lambda)))

title(main = "Box-Cox Transformation: 95% CI of lambda",)

Since no common lambda’s fall within the 95% confidence interval for lambda, no transformations need to be done.

4.1.3.b Model Summary

Below is the summary of model 1:

kable(summary(mlr.model.1)$coef)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -679.7977419 110.8882441 -6.1304762 0.0000000
ScreenW 1.0065739 0.3792064 2.6544225 0.0080777
ScreenH -0.3023726 0.6583448 -0.4592921 0.6461298
Weight 309.4865131 29.0238959 10.6631623 0.0000000
company.dell -333.8503796 68.5100735 -4.8730116 0.0000013
company.lenovo -349.9186142 67.4439680 -5.1882863 0.0000003
company.hp -251.5993848 68.9874948 -3.6470289 0.0002798
company.asus -327.3180393 77.4511225 -4.2261239 0.0000261
company.acer -672.9197009 87.3185693 -7.7064902 0.0000000

We can see that the only insignificant variable is the ScreenH, meaning that the height of the screen is not significant in attributing to a laptop’s price. Interestingly, the company of the laptop is significant for all top 5 companies compared to laptops with an “other” category.

4.1.4 Model #2:

The second model I built includes the following variables:

  • ScreenW
  • ScreenH
  • Weight
  • Company
  • Ram
  • CPU_freq
  • PrimaryStorage

This model is similar to Model 1, but this time includes some hardware-type variables to see if the hardware can better help predict a laptops price.

mlr.model.2 = lm(formula = Price_USD ~ ScreenW + ScreenH + Weight + company.dell + company.lenovo + company.hp + company.asus + company.acer + Ram + CPU_freq + PrimaryStorage, data = train.data)

4.1.4.a Condition Checks

Below shows some plots to identify some potential violations of conditions for Model 2:

par(mfrow = c(2,2), mar = c(2, 3, 2, 2))
plot(mlr.model.2)

Next, I will carry the Box-Cox transformation to identify a potential power transformation of the response variable Price_USD:

boxcox(Price_USD ~ ScreenW + ScreenH + Weight + company.dell + company.lenovo + company.hp + company.asus + company.acer + Ram + CPU_freq + PrimaryStorage,
       data = train.data,
       lambda = seq(-1, 1.5, length = 10),
       xlab = expression(paste(lambda)))

title(main = "Box-Cox Transformation: 95% CI of lambda",)

Since no common lambda’s fall within the 95% confidence interval for lambda, no transformations need to be done.

4.1.4.b Model Summary

Below is the summary of model 2:

kable(summary(mlr.model.2)$coef)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -665.3982409 99.3548496 -6.6971894 0.0000000
ScreenW 0.1454878 0.2827000 0.5146367 0.6069274
ScreenH 0.4710990 0.4892147 0.9629699 0.3358092
Weight -9.0226220 25.1283214 -0.3590619 0.7196291
company.dell -110.0830558 51.8291801 -2.1239590 0.0339330
company.lenovo -132.2551571 50.8444761 -2.6011706 0.0094360
company.hp 15.8917352 52.0622965 0.3052446 0.7602472
company.asus -183.8699352 57.9179995 -3.1746596 0.0015487
company.acer -268.8680156 66.3082423 -4.0548204 0.0000543
Ram 82.2163154 3.5052008 23.4555223 0.0000000
CPU_freq 286.4280110 31.4014139 9.1215004 0.0000000
PrimaryStorage -0.2677410 0.0408542 -6.5535687 0.0000000

We can see that after adding the “hardware” variables, many of the variables from Model 1 turned insignificant, suggesting that the hardware variables are more significant in determining a laptops price.

4.1.5 Model Comparison

mlr.model.1.r = summary(mlr.model.1)$r.squared
mlr.model.2.r = summary(mlr.model.2)$r.squared

R.Square = cbind(Model.1.r = mlr.model.1.r, Model.2.r = mlr.model.2.r)

kable(R.Square)
Model.1.r Model.2.r
0.4089328 0.6789631

As shown above, Model 1’s r-squared value is .4089, and Model 2s r-squared value is 0.679. This means that 40.89% of the variability in laptop prices can be explained by Model 1, and 67.9% of the variability in laptop prices can be explained by Model 2.

Since the response variable for both models are at the same scale, I will use MSE in cross-validation to compare the two models.

mlr.model.1.mse = mean(mlr.model.1$residuals^2)
mlr.model.2.mse = mean(mlr.model.2$residuals^2)

MSE = cbind(Model.1.MSE = mlr.model.1.mse, Model.2.MSE = mlr.model.2.mse)

kable(MSE)
Model.1.MSE Model.2.MSE
339476 184385.7

As shown above, Model 2 has a lower MSE than Model 1, meaning Model 2 is more accurate than Model 1

Since Model 2 has a higher R-Squared term and a lower MSE than Model 1, Model 2 is the better model between the two and will be used as the final model.

4.1.6 Final Model

The Final Model is as reported below:

\(Price_USD = -665.40 + 0.15(ScreenW) + 0.47(ScreenH) - 9.02(Weight) - 110.08(Company.Dell) - 132.26(Company.Lenovo) - 15.89(Company.HP) - 183.87(Company.ASUS) - 268.87(Company.Acer) + 82.22(Ram) + 286.43(CPU_freq) - 0.27(PrimaryStorage)\)

4.1.6.a Performance Testing

Now that the final model is identified, it is time to test this model on the test dataset to determine its accuracy. Below is Model 2’s RMSE and r-squared values when tested on the test dataset:

test.data$predicted.PriceUSD <- predict(mlr.model.2, newdata = test.data)

mlr.rmse <- sqrt(mean((test.data$Price_USD - test.data$predicted.PriceUSD)^2))

mlr.rsquared <- summary(mlr.model.2)$r.squared

cat("Root Mean Squared Error (RMSE): ", mlr.rmse, "\n")
Root Mean Squared Error (RMSE):  459.8317 
cat("R-squared for the model (training data): ", mlr.rsquared, "\n")
R-squared for the model (training data):  0.6789631 

4.1.7 Discussion

As shown above, a laptops screen width, screen height, weight, company, ram, CPU frequency, and primary storage are all significant in predicting a laptops’ asking price. It is important to note that the regression coefficients for each of the company variables are in comparison to an “other” laptop. For example, the -110.08 represents that on average, a Dell laptop costs $110.08 less than a laptop not made by Dell, Lenovo, HP, ASUS, nor Acer. When tested on the test dataset, Model 2 had an r-squared value of 0.6790, very similar to the r-squared value on the training dataset.

4.2 Logistic Regression

Logistic regression is a statistical technique that uses multiple variables to predict the outcome of a binary variable. This technique allows statisticians to build models and equations that can be used across multiple datasets.

The subsections below outline my process through developing a multiple linear regression model.

4.2.1 Question Statement

One question that this section of the research seeks to answer is What aspects of a laptop require it to have an IPS panel? An IPS Panel a panel that is used for liquid crystal display technology to enhance color accuracy. To address this question, there are numerous hardware-type variables in the dataset that may attribute to a laptop having an IPS panel, making the dataset fit to answer the above question.

4.2.2 Model #1:

The first model I built includes the following variables:

  • Ram
  • Inches
  • CPU_freq
  • ScreenW
  • ScreenH
  • PrimaryStorage
  • SecondaryStorage

These variables were chosen because these aspects are ones that mainly effect the display of the laptop, and may attribute to the presence of an IPS panel.

Below shows a summary of Model 1:

log.model.1 <- glm(IPSpanel ~ Ram + Inches + CPU_freq + ScreenW + ScreenH + PrimaryStorage + SecondaryStorage, data = train.data, family = binomial)

summary(log.model.1)

Call:
glm(formula = IPSpanel ~ Ram + Inches + CPU_freq + ScreenW + 
    ScreenH + PrimaryStorage + SecondaryStorage, family = binomial, 
    data = train.data)

Coefficients:
                   Estimate Std. Error z value Pr(>|z|)    
(Intercept)      -1.818e-01  1.021e+00  -0.178  0.85868    
Ram               8.002e-02  1.862e-02   4.298 1.72e-05 ***
Inches           -2.279e-01  6.823e-02  -3.339  0.00084 ***
CPU_freq          2.681e-01  1.774e-01   1.511  0.13068    
ScreenW          -4.212e-04  1.266e-03  -0.333  0.73945    
ScreenH           2.165e-03  2.181e-03   0.993  0.32084    
PrimaryStorage   -5.498e-04  2.790e-04  -1.971  0.04877 *  
SecondaryStorage  8.777e-05  2.154e-04   0.407  0.68365    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1147.3  on 955  degrees of freedom
Residual deviance: 1032.8  on 948  degrees of freedom
AIC: 1048.8

Number of Fisher Scoring iterations: 4

As shown above, many p-values are bigger than 0.05, meaning some insignificant predictor variables should be dropped from the model.

4.2.3 Model #2:

Next, I begin with Model 1 and slowly remove insignificant variables until all remaining variables are significant.

log.model.2 = step(log.model.1, direction = "backward", trace = 0)

summary(log.model.2)

Call:
glm(formula = IPSpanel ~ Ram + Inches + CPU_freq + ScreenH + 
    PrimaryStorage, family = binomial, data = train.data)

Coefficients:
                 Estimate Std. Error z value Pr(>|z|)    
(Intercept)    -0.3227982  0.9310651  -0.347 0.728818    
Ram             0.0822647  0.0178504   4.609 4.05e-06 ***
Inches         -0.2184003  0.0607044  -3.598 0.000321 ***
CPU_freq        0.2618654  0.1764667   1.484 0.137826    
ScreenH         0.0014496  0.0003024   4.793 1.64e-06 ***
PrimaryStorage -0.0005968  0.0002572  -2.320 0.020333 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1147.3  on 955  degrees of freedom
Residual deviance: 1033.1  on 950  degrees of freedom
AIC: 1045.1

Number of Fisher Scoring iterations: 4

As shown above, most remaining variables are significant.

4.2.4 Model Comparison

To compare the two models, I will be using ROC curves and AUC values.

4.2.4.a ROC Curves

An ROC curve is a graph of the models performance by plotting the True Positive Rate against the False Positive Rate in a representation of model accuracy when the model attempts to classify responses.

Below are the ROC Curves for both models:

train.data$predicted.prob.1 <- predict(log.model.1, type = "response")
train.data$predicted.prob.2 <- predict(log.model.2, type = "response")

log.model.1.roc <- roc(train.data$IPSpanel, train.data$predicted.prob.1)
log.model.2.roc <- roc(train.data$IPSpanel, train.data$predicted.prob.2)

plot(log.model.1.roc, 
     col = "blue", 
     lwd = 2, 
     main = "Overlayed ROC Curves", 
     xlim = c(0,1), 
     ylim = c(0,1))

lines(log.model.2.roc, col = "red", lwd = 2)

legend("bottomright", 
       legend = c("Model 1", "Model 2"), 
       col = c("blue", "red"),
       lwd = 2)

As shown above, the ROC curves are very similar and hug each other the whole time. This makes it difficult to distinguish which curve and thus which model is better. Instead, I will use AUC to compare the two models.

4.2.4.b AUC Values

AUC values represent the area under an ROC curve. The interpretation of AUC values are as follows:

  • AUC = 0.5: Indicates no discriminative power, similar to random guessing
  • AUC < 0.5: Indicates the model performs worse than random guessing
  • AUC > 0.5: Indicates the model has some predictive power. The closer the AUC is to 1, the better the model’s performance

Below are each model’s AUC value:

log.model.1.auc = auc(log.model.1.roc)
log.model.2.auc = auc(log.model.2.roc)

AUC = cbind(Model.1.AUC = log.model.1.auc, Model.2.AUC = log.model.2.auc)

kable(AUC)
Model.1.AUC Model.2.AUC
0.7311494 0.7319637

As shown above, Model 2 has a slightly higher AUC value. Since Model 2 has a higher AUC value and has all of its predictor variables as statistically significant, Model 2 is the final model.

4.2.5 Final Model

The Final Model is as reported below:

\(IPSpanel = -0.3228 + 0.0823(Ram) - 0.2184(Inches) + + 0.2619(CPUfreq) + 0.0014(ScreenH) - 0.0006(PrimaryStorage)\)

4.2.5.a Performance Testing

Now that the final model is identified, it is time to test this model on the test dataset to determine its accuracy. Below is the confusion matrix for Model 2:

test.data$predicted.prob <- predict(log.model.2, newdata = test.data, type = "response")

threshold <- 0.5
test.data$predicted.IPS <- ifelse(test.data$predicted.prob > threshold, 1, 0)

log.confusion.matrix <- table(Actual = test.data$IPSpanel, Predicted = test.data$predicted.IPS)
print(log.confusion.matrix)
      Predicted
Actual   0   1
     0 215  22
     1  69  13

Below are some additional metrics for Model 2:

log.accuracy <- sum(diag(log.confusion.matrix)) / sum(log.confusion.matrix)
log.precision <- log.confusion.matrix[2,2] / sum(log.confusion.matrix[, 2])

cat("Accuracy: ", log.accuracy, "\n")
Accuracy:  0.7147335 
cat("Precision: ", log.precision, "\n")
Precision:  0.3714286 

Accuracy is the proportion of correctly predicted outcomes, both true positives and true negatives. Precision is the proportion of how many of the predicted positives are actually positive.

As shown above, Model 2 has a high accuracy rating but a low precision rating. This means that the model is performing well overall in terms of the total number of correct predictions, but struggles with the reliability of its positive predictions

4.2.6 Discussion

A laptop’s ram, diagonal screen dimension, screen height, and primary storage are all significant in predicting whether or not a laptop has an IPS panel. It is important to note that the coefficients for each predictor variable represent the increase/decrease in the odds that a laptop has an IPS panel. Model 2 is pretty accurate but not precise, meaning that some of its positive predictions may not be correct.