UNIVERSITY OF CINCINNATI
LINDNER COLLEGE OF BUSINESS
TEAM: Eli Bales, Vivian Comer, Andrew McCurrach, Devin Walker, Kazuhide Watanabe
LAPTOP PRICE PREDICTION: IDENTIFYING THE KEY DRIVERS OF MARKET VALUE
A statistical modeling analysis evaluating how technical specifications and brand factors influence laptop pricing.
1. Introduction
For this case study, our group created multiple linear regression models to understand what were the driving factors in laptop price across multiple brands. Our dataset, originating from Kaggle, contains information on 1273 different laptops. Information includes the laptop’s brand, such as ASUS, Apple, or Xiaomi, and technical specifications such as RAM, SSD, CPU Type, GPU, etc. These are all predictor variables that we want to use to accurately predict our output variable, price.
Using both forward and backward stepwise regression, we developed and compared many models, and ultimately narrowed our model selection down to three models to analyze their performance against one another. Each of these models were quite similar, and what we found was that driving predictors of laptop price were many of the technical specifications. This includes CPU brand, total RAM, SSD size, and screen resolution, was well as operating system (OS), and laptop “type” (Notebook, Gaming, 2 in 1, etc.).
In this case study, we will walk through our initial EDA, how we developed our models, and which model produces the lowest error when predicting price against test data.
2. Data Description
2.1 Data Source and Structure
We first load the necessary librarys to do our analysis in R, and read in our data from the csv.
library(tidyverse)
library(dplyr)
library(ggplot2)
library(kableExtra)
library(knitr)
library(patchwork)
library(GGally)
library(broom)
df <- read_csv('/Users/kazuhidewatanabe/Desktop/BANA_7052/laptop_data_cleaned.csv')
2.2 Data Cleaning
Many of our predictors need to be converted to factors for linear regression to adequately utilize their predictive power. We do so below, converting laptop type, company name, CPU brand, touch screen availability, GPU brand, and OS all to factors.
df$TypeName <- as.factor(df$TypeName)
df$Company <- as.factor(df$Company)
df$Cpu_brand <- as.factor(df$Cpu_brand)
df$TouchScreen <- as.factor(df$TouchScreen)
df$Gpu_brand <- as.factor(df$Gpu_brand)
df$Os <- as.factor(df$Os)
2.3 Data Visualization
For EDA, we hope to see whether we can initially spot any possible correlations between predictors and price, and use both boxplots and scatter plots to visualize our data below.
2.4 Preliminary Statistical Insights
From many of these plots we can easily see a correlation between the predictor variable and the price of the laptop. For example, the first boxplot plots the brand of the laptop against the price. From this boxplot we can pull two pieces of information. One, that clearly a laptops brand influences its price. And two, that some brands have a wide range of prices, such as Lenovo, Dell, HP, and ASUS, while some have a very small range of prices, like Vero, Xiaomi, and Samsung. Other predictors, such as GPU brand, operating system, total RAM, and total storage size also seem to have an affect on the price.
From this visual analysis, it is clear that there are multiple predictors with possible influence over the laptop price, being a clear example of a good dataset for a linear regression case study.
3. Method
3.1 Statistical Models and Methods Chosen
Based on the exploratory analysis and the preliminary regression diagnostics, multiple linear regression was selected as the primary modeling technique to quantify the relationship between laptop characteristics and pricing. Categorical variables were coded as factors, and most EDA suggested roughly linear relationships for the main numeric predictors. The dataset was randomly split into a training set (75%) and test set (25%), with all model selection and fitting done on the training data and predictive performance evaluated on the test data.
# Split the data into testing/training samples.
set.seed(2)
training.samples <- sample.int(nrow(df), 0.75*nrow(df))
train <- df[training.samples,]
test <- df[-training.samples,]
n <- nrow(train)
3.2 Model Selection and Identification of the “Best” Model
Because the dataset has many possible predictor variables and combinations, we used stepwise selection to narrow it down to a smaller model that still makes sense statistically. Two reference models were specified:
Price ~ 1)Price ~ .)# Base & Full
base <- lm(Price ~ 1, data=train)
full <- lm(Price ~ ., data=train)
# Forward Selection
fwdBIC <- step(base, scope = formula(full), direction="forward", k=log(n))
# Backwards Selection
bwdBIC <- step(full, direction="backward", k=log(n))
| term | estimate | std.error | statistic | p.value | Significance |
|---|---|---|---|---|---|
| (Intercept) | 9.9265 | 0.1209 | 82.1317 | 0.00e+00 | *** |
| Cpu_brandIntel Core i3 | 0.1703 | 0.0499 | 3.4143 | 6.67e-04 | *** |
| Cpu_brandIntel Core i5 | 0.4885 | 0.0446 | 10.9594 | 2.19e-26 | *** |
| Cpu_brandIntel Core i7 | 0.5643 | 0.0465 | 12.1340 | 1.40e-31 | *** |
| Cpu_brandOther Intel Processor | -0.1538 | 0.0509 | -3.0229 | 2.57e-03 | ** |
| Ram | 0.0292 | 0.0026 | 11.0307 | 1.09e-26 | *** |
| TypeNameNotebook | -0.0211 | 0.0713 | -0.2964 | 7.67e-01 | |
| TypeName2 in 1 Convertible | 0.1226 | 0.0764 | 1.6059 | 1.09e-01 | |
| TypeNameGaming | 0.2743 | 0.0772 | 3.5527 | 4.00e-04 | *** |
| TypeNameUltrabook | 0.2163 | 0.0762 | 2.8389 | 4.62e-03 | ** |
| TypeNameWorkstation | 0.7031 | 0.0932 | 7.5412 | 1.10e-13 | *** |
| SSD | 0.0006 | 0.0001 | 8.8807 | 3.32e-18 | *** |
| OsOthers | -0.4538 | 0.0810 | -5.6038 | 2.76e-08 | *** |
| OsWindows | -0.2196 | 0.0767 | -2.8630 | 4.29e-03 | ** |
| Ppi | 0.0020 | 0.0003 | 7.6650 | 4.46e-14 | *** |
| adj.r.squared | sigma | statistic | p.value |
|---|---|---|---|
| 0.8111 | 0.2732 | 293.1938 | 0 |
| term | estimate | std.error | statistic | p.value | Significance |
|---|---|---|---|---|---|
| (Intercept) | 9.9265 | 0.1209 | 82.1317 | 0.00e+00 | *** |
| TypeNameNotebook | -0.0211 | 0.0713 | -0.2964 | 7.67e-01 | |
| TypeName2 in 1 Convertible | 0.1226 | 0.0764 | 1.6059 | 1.09e-01 | |
| TypeNameGaming | 0.2743 | 0.0772 | 3.5527 | 4.00e-04 | *** |
| TypeNameUltrabook | 0.2163 | 0.0762 | 2.8389 | 4.62e-03 | ** |
| TypeNameWorkstation | 0.7031 | 0.0932 | 7.5412 | 1.10e-13 | *** |
| Ram | 0.0292 | 0.0026 | 11.0307 | 1.09e-26 | *** |
| Ppi | 0.0020 | 0.0003 | 7.6650 | 4.46e-14 | *** |
| Cpu_brandIntel Core i3 | 0.1703 | 0.0499 | 3.4143 | 6.67e-04 | *** |
| Cpu_brandIntel Core i5 | 0.4885 | 0.0446 | 10.9594 | 2.19e-26 | *** |
| Cpu_brandIntel Core i7 | 0.5643 | 0.0465 | 12.1340 | 1.40e-31 | *** |
| Cpu_brandOther Intel Processor | -0.1538 | 0.0509 | -3.0229 | 2.57e-03 | ** |
| SSD | 0.0006 | 0.0001 | 8.8807 | 3.32e-18 | *** |
| OsOthers | -0.4538 | 0.0810 | -5.6038 | 2.76e-08 | *** |
| OsWindows | -0.2196 | 0.0767 | -2.8630 | 4.29e-03 | ** |
| adj.r.squared | sigma | statistic | p.value |
|---|---|---|---|
| 0.8111 | 0.2732 | 293.1938 | 0 |
On the training data, we defined a base model with only an intercept (Price ~ 1) and a full model including all predictors (Price ~ .). We then applied stepwise selection using BIC, running both forward selection (from the base model) and backward elimination (from the full model). For the original dataset, both directions converged to the same model:
\[ \text{Price ~ } \text{Cpu_brand} + \text{Ram} + \text{TypeName} + \text{SSD} + \text{Os} + \text{Ppi} \]
with an adjusted \(R^2\) of 0.8111.
3.2.2 Model Selection 2: Converting Numeric Columns to Factors
We observed that several numeric predictors (e.g., RAM and SSD capacity) did not behave as continuous linear variables. For example, RAM takes on distinct upgrade levels (4GB, 8GB, 16GB, 32GB, etc.) rather than varying smoothly across the number line. Because price increases are likely associated with these upgrade tiers rather than a unit-by-unit linear change, treating RAM as a numeric variable would impose an assumption of linearity that may not reflect market behavior.
To allow the regression model to estimate separate effects for each tier, RAM was converted from numeric to a factor variable. This enables the model to learn level-specific price differences (e.g., the price jump from 8GB → 16GB is not assumed to be the same as 16GB → 32GB).
df2 <- read_csv('/Users/kazuhidewatanabe/Desktop/BANA_7052/laptop_data_cleaned.csv')
df2$TypeName <- as.factor(df$TypeName)
df2$Company <- as.factor(df$Company)
df2$Cpu_brand <- as.factor(df$Cpu_brand)
df2$Ram <- as.factor(df$Ram)
df2$TouchScreen <- as.factor(df$TouchScreen)
df2$HDD <- as.factor(df$HDD)
df2$SSD <- as.factor(df$SSD)
df2$Gpu_brand <- as.factor(df$Gpu_brand)
df2$Os <- as.factor(df$Os)
df2$Ips <- as.factor(df$Ips)
df2$Ppi <- round(df$Ppi, 0)
df2$Ppi <- as.factor(df$Ppi)
# Split the data into testing/training samples.
training.samples2 <- sample.int(nrow(df2), 0.75*nrow(df2))
train2 <- df2[training.samples,]
test2 <- df2[-training.samples,]
n2<-nrow(train2)
test2$SSD <- factor(test2$SSD, levels = levels(train2$SSD))
base2 <- lm(Price ~ 1, data=train2)
full2 <- lm(Price ~ ., data=train2)
fwdBIC2 <- step(base2, scope = formula(full2), direction="forward", k = log(n2))
bwdBIC2 <- step(full2, direction="backward", k = log(n2))
| term | estimate | std.error | statistic | p.value | Significance |
|---|---|---|---|---|---|
| (Intercept) | 9.8769 | 0.1360 | 72.6497 | 0.00e+00 | *** |
| Ram4 | 0.4178 | 0.0810 | 5.1608 | 3.01e-07 | *** |
| Ram6 | 0.4054 | 0.0960 | 4.2244 | 2.63e-05 | *** |
| Ram8 | 0.6531 | 0.0848 | 7.7039 | 3.40e-14 | *** |
| Ram12 | 0.7073 | 0.1026 | 6.8917 | 1.02e-11 | *** |
| Ram16 | 0.9474 | 0.0904 | 10.4778 | 2.39e-24 | *** |
| Ram24 | 1.2121 | 0.2080 | 5.8273 | 7.78e-09 | *** |
| Ram32 | 1.3252 | 0.1245 | 10.6457 | 4.85e-25 | *** |
| Ram64 | 1.2716 | 0.2912 | 4.3664 | 1.41e-05 | *** |
| Cpu_brandIntel Core i3 | 0.1506 | 0.0491 | 3.0666 | 2.23e-03 | ** |
| Cpu_brandIntel Core i5 | 0.4179 | 0.0447 | 9.3488 | 6.51e-20 | *** |
| Cpu_brandIntel Core i7 | 0.4869 | 0.0467 | 10.4150 | 4.32e-24 | *** |
| Cpu_brandOther Intel Processor | -0.0967 | 0.0513 | -1.8844 | 5.98e-02 | . |
| TypeNameNotebook | -0.0538 | 0.0691 | -0.7791 | 4.36e-01 | |
| TypeName2 in 1 Convertible | 0.1292 | 0.0743 | 1.7384 | 8.25e-02 | . |
| TypeNameGaming | 0.1657 | 0.0750 | 2.2087 | 2.74e-02 |
|
| TypeNameUltrabook | 0.1949 | 0.0739 | 2.6358 | 8.54e-03 | ** |
| TypeNameWorkstation | 0.6428 | 0.0906 | 7.0954 | 2.57e-12 | *** |
| SSD16 | -0.1735 | 0.1568 | -1.1065 | 2.69e-01 | |
| SSD32 | -0.2661 | 0.1353 | -1.9661 | 4.96e-02 |
|
| SSD64 | -0.3980 | 0.2660 | -1.4959 | 1.35e-01 | |
| SSD128 | 0.1724 | 0.0301 | 5.7252 | 1.40e-08 | *** |
| SSD180 | 0.4057 | 0.1329 | 3.0524 | 2.34e-03 | ** |
| SSD240 | 1.6263 | 0.2668 | 6.0950 | 1.61e-09 | *** |
| SSD256 | 0.2407 | 0.0252 | 9.5702 | 9.45e-21 | *** |
| SSD512 | 0.3448 | 0.0413 | 8.3566 | 2.36e-16 | *** |
| SSD768 | 0.0919 | 0.2667 | 0.3447 | 7.30e-01 | |
| SSD1000 | 0.6438 | 0.0861 | 7.4743 | 1.80e-13 | *** |
| SSD1024 | 0.3818 | 0.2657 | 1.4369 | 1.51e-01 | |
| OsOthers | -0.4687 | 0.0788 | -5.9497 | 3.81e-09 | *** |
| OsWindows | -0.2398 | 0.0747 | -3.2114 | 1.37e-03 | ** |
| Ips1 | 0.0581 | 0.0206 | 2.8170 | 4.95e-03 | ** |
| adj.r.squared | sigma | statistic | p.value |
|---|---|---|---|
| 0.8255 | 0.2625 | 146.4525 | 0 |
| term | estimate | std.error | statistic | p.value | Significance |
|---|---|---|---|---|---|
| (Intercept) | 9.8769 | 0.1360 | 72.6497 | 0.00e+00 | *** |
| TypeNameNotebook | -0.0538 | 0.0691 | -0.7791 | 4.36e-01 | |
| TypeName2 in 1 Convertible | 0.1292 | 0.0743 | 1.7384 | 8.25e-02 | . |
| TypeNameGaming | 0.1657 | 0.0750 | 2.2087 | 2.74e-02 |
|
| TypeNameUltrabook | 0.1949 | 0.0739 | 2.6358 | 8.54e-03 | ** |
| TypeNameWorkstation | 0.6428 | 0.0906 | 7.0954 | 2.57e-12 | *** |
| Ram4 | 0.4178 | 0.0810 | 5.1608 | 3.01e-07 | *** |
| Ram6 | 0.4054 | 0.0960 | 4.2244 | 2.63e-05 | *** |
| Ram8 | 0.6531 | 0.0848 | 7.7039 | 3.40e-14 | *** |
| Ram12 | 0.7073 | 0.1026 | 6.8917 | 1.02e-11 | *** |
| Ram16 | 0.9474 | 0.0904 | 10.4778 | 2.39e-24 | *** |
| Ram24 | 1.2121 | 0.2080 | 5.8273 | 7.78e-09 | *** |
| Ram32 | 1.3252 | 0.1245 | 10.6457 | 4.85e-25 | *** |
| Ram64 | 1.2716 | 0.2912 | 4.3664 | 1.41e-05 | *** |
| Ips1 | 0.0581 | 0.0206 | 2.8170 | 4.95e-03 | ** |
| Cpu_brandIntel Core i3 | 0.1506 | 0.0491 | 3.0666 | 2.23e-03 | ** |
| Cpu_brandIntel Core i5 | 0.4179 | 0.0447 | 9.3488 | 6.51e-20 | *** |
| Cpu_brandIntel Core i7 | 0.4869 | 0.0467 | 10.4150 | 4.32e-24 | *** |
| Cpu_brandOther Intel Processor | -0.0967 | 0.0513 | -1.8844 | 5.98e-02 | . |
| SSD16 | -0.1735 | 0.1568 | -1.1065 | 2.69e-01 | |
| SSD32 | -0.2661 | 0.1353 | -1.9661 | 4.96e-02 |
|
| SSD64 | -0.3980 | 0.2660 | -1.4959 | 1.35e-01 | |
| SSD128 | 0.1724 | 0.0301 | 5.7252 | 1.40e-08 | *** |
| SSD180 | 0.4057 | 0.1329 | 3.0524 | 2.34e-03 | ** |
| SSD240 | 1.6263 | 0.2668 | 6.0950 | 1.61e-09 | *** |
| SSD256 | 0.2407 | 0.0252 | 9.5702 | 9.45e-21 | *** |
| SSD512 | 0.3448 | 0.0413 | 8.3566 | 2.36e-16 | *** |
| SSD768 | 0.0919 | 0.2667 | 0.3447 | 7.30e-01 | |
| SSD1000 | 0.6438 | 0.0861 | 7.4743 | 1.80e-13 | *** |
| SSD1024 | 0.3818 | 0.2657 | 1.4369 | 1.51e-01 | |
| OsOthers | -0.4687 | 0.0788 | -5.9497 | 3.81e-09 | *** |
| OsWindows | -0.2398 | 0.0747 | -3.2114 | 1.37e-03 | ** |
| adj.r.squared | sigma | statistic | p.value |
|---|---|---|---|
| 0.8255 | 0.2625 | 146.4525 | 0 |
On the training data, we defined a base model with only an intercept (Price ~ 1) and a full model including all predictors (Price ~ .). We then applied stepwise selection using BIC, running both forward selection (from the base model) and backward elimination (from the full model). For the original dataset, both directions converged to the same model:
\[ \text{Price ~ } \text{Cpu_brand} + \text{Ram} + \text{TypeName} + \text{SSD} + \text{Os} + \text{Ips} \]
with an adjusted \(R^2\) of 0.8255.
3.2.3 Model Selection 3: Log Transformation
During diagnostic checks of the initial linear regression model, the QQ-plot and residual distribution indicated noticeable right-skewness, suggesting a violation of the normality assumption. Because linear regression assumes normally distributed errors, we applied a natural log transformation to the response variable (Price).
base.log <- lm(log(Price) ~ 1, data = train2)
full.log <- lm(log(Price) ~ ., data = train2)
fwdBIC.log <- step(base.log, scope = formula(full.log), direction="forward", k = log(n))
bwdBIC.log <- step(full.log, direction = "backward", k = log(n))
| term | estimate | std.error | statistic | p.value | Significance |
|---|---|---|---|---|---|
| (Intercept) | 2.2878 | 0.0125 | 182.5461 | 0.00e+00 | *** |
| Ram4 | 0.0426 | 0.0075 | 5.7147 | 1.48e-08 | *** |
| Ram6 | 0.0422 | 0.0088 | 4.7741 | 2.10e-06 | *** |
| Ram8 | 0.0651 | 0.0078 | 8.3334 | 2.83e-16 | *** |
| Ram12 | 0.0704 | 0.0095 | 7.4377 | 2.34e-13 | *** |
| Ram16 | 0.0912 | 0.0083 | 10.9426 | 2.75e-26 | *** |
| Ram24 | 0.1136 | 0.0192 | 5.9247 | 4.42e-09 | *** |
| Ram32 | 0.1233 | 0.0115 | 10.7481 | 1.81e-25 | *** |
| Ram64 | 0.1172 | 0.0268 | 4.3656 | 1.41e-05 | *** |
| Cpu_brandIntel Core i3 | 0.0148 | 0.0045 | 3.2697 | 1.12e-03 | ** |
| Cpu_brandIntel Core i5 | 0.0404 | 0.0041 | 9.7932 | 1.30e-21 | *** |
| Cpu_brandIntel Core i7 | 0.0464 | 0.0043 | 10.7736 | 1.42e-25 | *** |
| Cpu_brandOther Intel Processor | -0.0105 | 0.0047 | -2.2159 | 2.69e-02 |
|
| TypeNameNotebook | -0.0049 | 0.0064 | -0.7719 | 4.40e-01 | |
| TypeName2 in 1 Convertible | 0.0121 | 0.0069 | 1.7668 | 7.76e-02 | . |
| TypeNameGaming | 0.0152 | 0.0069 | 2.1938 | 2.85e-02 |
|
| TypeNameUltrabook | 0.0179 | 0.0068 | 2.6295 | 8.69e-03 | ** |
| TypeNameWorkstation | 0.0575 | 0.0084 | 6.8834 | 1.08e-11 | *** |
| SSD16 | -0.0196 | 0.0145 | -1.3537 | 1.76e-01 | |
| SSD32 | -0.0259 | 0.0125 | -2.0764 | 3.81e-02 |
|
| SSD64 | -0.0407 | 0.0245 | -1.6614 | 9.70e-02 | . |
| SSD128 | 0.0164 | 0.0028 | 5.8944 | 5.27e-09 | *** |
| SSD180 | 0.0382 | 0.0123 | 3.1194 | 1.87e-03 | ** |
| SSD240 | 0.1485 | 0.0246 | 6.0357 | 2.29e-09 | *** |
| SSD256 | 0.0223 | 0.0023 | 9.6358 | 5.30e-21 | *** |
| SSD512 | 0.0315 | 0.0038 | 8.2935 | 3.87e-16 | *** |
| SSD768 | 0.0093 | 0.0246 | 0.3773 | 7.06e-01 | |
| SSD1000 | 0.0573 | 0.0079 | 7.2180 | 1.10e-12 | *** |
| SSD1024 | 0.0350 | 0.0245 | 1.4283 | 1.54e-01 | |
| OsOthers | -0.0444 | 0.0073 | -6.1086 | 1.48e-09 | *** |
| OsWindows | -0.0227 | 0.0069 | -3.2999 | 1.00e-03 | ** |
| Ips1 | 0.0054 | 0.0019 | 2.8308 | 4.74e-03 | ** |
| adj.r.squared | sigma | statistic | p.value |
|---|---|---|---|
| 0.8296 | 0.0242 | 150.6673 | 0 |
| term | estimate | std.error | statistic | p.value | Significance |
|---|---|---|---|---|---|
| (Intercept) | 2.2898 | 0.0126 | 182.3136 | 0.00e+00 | *** |
| TypeNameNotebook | -0.0051 | 0.0064 | -0.8038 | 4.22e-01 | |
| TypeName2 in 1 Convertible | 0.0136 | 0.0069 | 1.9877 | 4.71e-02 |
|
| TypeNameGaming | 0.0160 | 0.0069 | 2.3046 | 2.14e-02 |
|
| TypeNameUltrabook | 0.0182 | 0.0068 | 2.6583 | 7.99e-03 | ** |
| TypeNameWorkstation | 0.0582 | 0.0084 | 6.9507 | 6.87e-12 | *** |
| Ram4 | 0.0434 | 0.0075 | 5.7915 | 9.56e-09 | *** |
| Ram6 | 0.0433 | 0.0089 | 4.8820 | 1.24e-06 | *** |
| Ram8 | 0.0664 | 0.0078 | 8.4760 | 9.14e-17 | *** |
| Ram12 | 0.0718 | 0.0095 | 7.5717 | 8.93e-14 | *** |
| Ram16 | 0.0919 | 0.0084 | 10.9883 | 1.75e-26 | *** |
| Ram24 | 0.1121 | 0.0192 | 5.8246 | 7.90e-09 | *** |
| Ram32 | 0.1246 | 0.0115 | 10.8240 | 8.69e-26 | *** |
| Ram64 | 0.1206 | 0.0269 | 4.4802 | 8.39e-06 | *** |
| Cpu_brandIntel Core i3 | 0.0152 | 0.0045 | 3.3452 | 8.55e-04 | *** |
| Cpu_brandIntel Core i5 | 0.0407 | 0.0041 | 9.8475 | 7.99e-22 | *** |
| Cpu_brandIntel Core i7 | 0.0469 | 0.0043 | 10.8471 | 6.94e-26 | *** |
| Cpu_brandOther Intel Processor | -0.0098 | 0.0047 | -2.0622 | 3.95e-02 |
|
| SSD16 | -0.0198 | 0.0145 | -1.3642 | 1.73e-01 | |
| SSD32 | -0.0230 | 0.0125 | -1.8392 | 6.62e-02 | . |
| SSD64 | -0.0407 | 0.0246 | -1.6530 | 9.87e-02 | . |
| SSD128 | 0.0170 | 0.0028 | 6.1322 | 1.28e-09 | *** |
| SSD180 | 0.0389 | 0.0123 | 3.1600 | 1.63e-03 | ** |
| SSD240 | 0.1468 | 0.0247 | 5.9465 | 3.88e-09 | *** |
| SSD256 | 0.0229 | 0.0023 | 9.8710 | 6.47e-22 | *** |
| SSD512 | 0.0326 | 0.0038 | 8.5903 | 3.66e-17 | *** |
| SSD768 | 0.0084 | 0.0247 | 0.3406 | 7.33e-01 | |
| SSD1000 | 0.0586 | 0.0080 | 7.3580 | 4.12e-13 | *** |
| SSD1024 | 0.0323 | 0.0246 | 1.3141 | 1.89e-01 | |
| OsOthers | -0.0472 | 0.0072 | -6.5352 | 1.05e-10 | *** |
| OsWindows | -0.0253 | 0.0068 | -3.6908 | 2.37e-04 | *** |
| adj.r.squared | sigma | statistic | p.value |
|---|---|---|---|
| 0.8283 | 0.0243 | 154.2504 | 0 |
After transformation, the residual distribution became slightly more symmetric and the residual standard error decreasing (≈ 0.27 → 0.024), indicating improved model fit. The QQ plot shows a very slight improvement, but hardly noticable.
3.3 Remedies & Adjustments
During the preliminary modeling phase, diagnostic plots for the initial model revealed that the residuals exhibited right-skewness, with heavier upper tails in the QQ-plot and increasing spread at higher fitted values. These patterns suggested a violation of the normality and constant variance assumptions.
To address this, the dependent variable Price was transformed using the natural logarithm.
After transformation:
Some numeric predictors, mainly RAM and SSD storage size, do not increase continuously in realistic increments; rather, they exist in specific hardware tiers (e.g., 4GB → 8GB → 16GB → 32GB, and 128GB → 256GB → 512GB, etc.). Because the effect of price is likely nonlinear across these jumps, a strictly numeric treatment would incorrectly force the model to learn a linear pricing relationship.
Therefore, RAM and SSD were reclassified as categorical variables (factors) to allow the model to learn distinct price jumps between hardware levels rather than assume a single linear slope.
4. Results
We now have three different models we can (somewhat) compare. We have (1) the base model, where we did not convert numeric variables such as SSD and RAM to factors. We also have (2), the model where we do convert numeric columns to factors. And we have (3), the model where we took the log-transform to try to solve the slightly right skewed data we saw appearing in our graphs. We want to look at each models’ out-of-sample performance by computing the MSE for each model.
| Statistic | Value |
|---|---|
| Model 1 | 0.0737980 |
| Model 2 | 0.0792047 |
| Model 3 | 0.0809789 |
| Statistic | Value |
|---|---|
| t-value | -3.2519 |
| df | 317 |
| p-value | 1.27e-03 |
| CI Lower | -0.002847498 |
| CI Upper | -0.00070074 |
| Mean difference | -0.001774119 |
| Alternative hypothesis | two.sided |
We cannot directly compare model (1) to (2) or (3) because it is using a different train and test set due to technically having a different set of predictors (since we “factorized” the numeric columns for 2 and 3). But, we should still look at the MSE’s of each model, as we have done in Table 13 above. Model (1) shows the lowest MSE, with model (2) being in the middle, and Model (3) having the worst MSE. But, we should run a t-test between the errors of model (2) and model (3) to ensure their difference is statistically significant. Unable to use an F-test due to the transformation, we do a t-test on the errors, and as shown in Table 14, get a p-value lower than 0.05, allowing us to reject the null hypothesis and determine the model (2) produces less error than model (3). Model (1) having a lower MSE than model (2) could be attributed to numerous factors, such as having a different training set and a different test set. It could also be that converting the numeric columns to factors worked in the opposite direction of what we assumed, which was there was a more complex relationship between RAM and SSD levels than just a linear slope. Regardless, it seems the original model, model (1) produces the lowest MSE.
5. Discussion
We originally believed that the laptop’s brand would be a significant factor in the laptop’s price, due to the boxplot comparing the brands showing a clear distinction between them. But both the forward and backward stepwise functions found the same model, which did not include that brand. Considering this, we can see that often the laptop brand is also correlated with other predictors, like OS, CPU Brand, and RAM, predictors that were included. For example, Razer laptops had the highest average price. Razer laptops are also gaming laptops, and therefore often have very high RAM and expensive CPUs. Thus, a model that already incorporates those predictors would not find the laptop brand that helpful. Additionally, some brands have an incredibly large price range, therefore making difficult to use brand as predictor.
Circling back to our conversation about converting numeric columns to factors, it does seem like doing so allowed for a more complex relationship with those predictors. The coefficient for RAM for model (1) was 0.0292. As we did not “factorize” RAM in this model, this means that increasing RAM is a linearly increases the price. Having 16 RAM would have twice the effect as having 8 RAM. But if we look at model (2), we can see that the coefficients for the different RAM levels are not linear. The predictor “8 RAM” had a coefficient of 0.6531, while “16 RAM” and “32 RAM” had coefficients of 0.9474 and 1.272 respectively. This rate of increase is not linear, suggesting there is a more complex relationship to numeric columns like RAM and SSD, even though model (2) produced a higher MSE than model (1).
5.1 Limitations and Future
Before we used this dataset on laptop prices, we were previously using a dataset with HR information related to employee burnout, or “attrition”. With predictors like employee age, department, years under current boss, vacation time, etc., we felt it would be a very interesting case study to see if we could create a good model to determine when if an employee would be burnt out. What we did not consider soon enough was that our dependent variable, attrition, was binary. The employee was either burnt out or not. Using linear regression to predict a binary out is not the correct path, as binary data violates many of the key assumptions needed for linear regression to work, such as constant error variance, or normality of errors. We saw this reflected in our analysis as no matter what transformations or tools we attempted to use, our model did incredibly poorly at predicting burnout and produced residuals vs fitted graphs with clear issues. It was then we decided to reevaluate our data, and realized we chose the wrong dataset for a linear regression case study.
6. Reference
Data for this case study was pulled from Kaggle. It can be found here.