UNIVERSITY OF CINCINNATI
LINDNER COLLEGE OF BUSINESS

TEAM: Eli Bales, Vivian Comer, Andrew McCurrach, Devin Walker, Kazuhide Watanabe

LAPTOP PRICE PREDICTION: IDENTIFYING THE KEY DRIVERS OF MARKET VALUE


A statistical modeling analysis evaluating how technical specifications and brand factors influence laptop pricing.




1. Introduction


For this case study, our group created multiple linear regression models to understand what were the driving factors in laptop price across multiple brands. Our dataset, originating from Kaggle, contains information on 1273 different laptops. Information includes the laptop’s brand, such as ASUS, Apple, or Xiaomi, and technical specifications such as RAM, SSD, CPU Type, GPU, etc. These are all predictor variables that we want to use to accurately predict our output variable, price.

Using both forward and backward stepwise regression, we developed and compared many models, and ultimately narrowed our model selection down to three models to analyze their performance against one another. Each of these models were quite similar, and what we found was that driving predictors of laptop price were many of the technical specifications. This includes CPU brand, total RAM, SSD size, and screen resolution, was well as operating system (OS), and laptop “type” (Notebook, Gaming, 2 in 1, etc.).

In this case study, we will walk through our initial EDA, how we developed our models, and which model produces the lowest error when predicting price against test data.

2. Data Description


2.1 Data Source and Structure

We first load the necessary librarys to do our analysis in R, and read in our data from the csv.

library(tidyverse)
library(dplyr)
library(ggplot2)
library(kableExtra)
library(knitr)
library(patchwork)
library(GGally)
library(broom)
df <- read_csv('/Users/kazuhidewatanabe/Desktop/BANA_7052/laptop_data_cleaned.csv')


2.2 Data Cleaning

Many of our predictors need to be converted to factors for linear regression to adequately utilize their predictive power. We do so below, converting laptop type, company name, CPU brand, touch screen availability, GPU brand, and OS all to factors.

df$TypeName <- as.factor(df$TypeName)
df$Company <- as.factor(df$Company)
df$Cpu_brand <- as.factor(df$Cpu_brand)
df$TouchScreen <- as.factor(df$TouchScreen)
df$Gpu_brand <- as.factor(df$Gpu_brand)
df$Os <- as.factor(df$Os)


2.3 Data Visualization

For EDA, we hope to see whether we can initially spot any possible correlations between predictors and price, and use both boxplots and scatter plots to visualize our data below.


2.4 Preliminary Statistical Insights

From many of these plots we can easily see a correlation between the predictor variable and the price of the laptop. For example, the first boxplot plots the brand of the laptop against the price. From this boxplot we can pull two pieces of information. One, that clearly a laptops brand influences its price. And two, that some brands have a wide range of prices, such as Lenovo, Dell, HP, and ASUS, while some have a very small range of prices, like Vero, Xiaomi, and Samsung. Other predictors, such as GPU brand, operating system, total RAM, and total storage size also seem to have an affect on the price.

From this visual analysis, it is clear that there are multiple predictors with possible influence over the laptop price, being a clear example of a good dataset for a linear regression case study.


3. Method


3.1 Statistical Models and Methods Chosen

Based on the exploratory analysis and the preliminary regression diagnostics, multiple linear regression was selected as the primary modeling technique to quantify the relationship between laptop characteristics and pricing. Categorical variables were coded as factors, and most EDA suggested roughly linear relationships for the main numeric predictors. The dataset was randomly split into a training set (75%) and test set (25%), with all model selection and fitting done on the training data and predictive performance evaluated on the test data.

# Split the data into testing/training samples.
set.seed(2)

training.samples <- sample.int(nrow(df), 0.75*nrow(df)) 
train <- df[training.samples,]
test <- df[-training.samples,]
n <- nrow(train)


3.2 Model Selection and Identification of the “Best” Model

Because the dataset has many possible predictor variables and combinations, we used stepwise selection to narrow it down to a smaller model that still makes sense statistically. Two reference models were specified:

# Base & Full 
base <- lm(Price ~ 1, data=train)
full <- lm(Price ~ ., data=train)

# Forward Selection
fwdBIC <- step(base, scope = formula(full), direction="forward", k=log(n))

# Backwards Selection
bwdBIC <- step(full, direction="backward", k=log(n))
Table 1: Coefficient Estimates for Final Forward BIC Model
term estimate std.error statistic p.value Significance
(Intercept) 9.9265 0.1209 82.1317 0.00e+00 ***
Cpu_brandIntel Core i3 0.1703 0.0499 3.4143 6.67e-04 ***
Cpu_brandIntel Core i5 0.4885 0.0446 10.9594 2.19e-26 ***
Cpu_brandIntel Core i7 0.5643 0.0465 12.1340 1.40e-31 ***
Cpu_brandOther Intel Processor -0.1538 0.0509 -3.0229 2.57e-03 **
Ram 0.0292 0.0026 11.0307 1.09e-26 ***
TypeNameNotebook -0.0211 0.0713 -0.2964 7.67e-01
TypeName2 in 1 Convertible 0.1226 0.0764 1.6059 1.09e-01
TypeNameGaming 0.2743 0.0772 3.5527 4.00e-04 ***
TypeNameUltrabook 0.2163 0.0762 2.8389 4.62e-03 **
TypeNameWorkstation 0.7031 0.0932 7.5412 1.10e-13 ***
SSD 0.0006 0.0001 8.8807 3.32e-18 ***
OsOthers -0.4538 0.0810 -5.6038 2.76e-08 ***
OsWindows -0.2196 0.0767 -2.8630 4.29e-03 **
Ppi 0.0020 0.0003 7.6650 4.46e-14 ***
Table 2: Model Fit Statistics
adj.r.squared sigma statistic p.value
0.8111 0.2732 293.1938 0
Table 3: Coefficient Estimates for Backward BIC Model
term estimate std.error statistic p.value Significance
(Intercept) 9.9265 0.1209 82.1317 0.00e+00 ***
TypeNameNotebook -0.0211 0.0713 -0.2964 7.67e-01
TypeName2 in 1 Convertible 0.1226 0.0764 1.6059 1.09e-01
TypeNameGaming 0.2743 0.0772 3.5527 4.00e-04 ***
TypeNameUltrabook 0.2163 0.0762 2.8389 4.62e-03 **
TypeNameWorkstation 0.7031 0.0932 7.5412 1.10e-13 ***
Ram 0.0292 0.0026 11.0307 1.09e-26 ***
Ppi 0.0020 0.0003 7.6650 4.46e-14 ***
Cpu_brandIntel Core i3 0.1703 0.0499 3.4143 6.67e-04 ***
Cpu_brandIntel Core i5 0.4885 0.0446 10.9594 2.19e-26 ***
Cpu_brandIntel Core i7 0.5643 0.0465 12.1340 1.40e-31 ***
Cpu_brandOther Intel Processor -0.1538 0.0509 -3.0229 2.57e-03 **
SSD 0.0006 0.0001 8.8807 3.32e-18 ***
OsOthers -0.4538 0.0810 -5.6038 2.76e-08 ***
OsWindows -0.2196 0.0767 -2.8630 4.29e-03 **
Table 4: Model Fit Statistics for Backward BIC Model
adj.r.squared sigma statistic p.value
0.8111 0.2732 293.1938 0

On the training data, we defined a base model with only an intercept (Price ~ 1) and a full model including all predictors (Price ~ .). We then applied stepwise selection using BIC, running both forward selection (from the base model) and backward elimination (from the full model). For the original dataset, both directions converged to the same model:

\[ \text{Price ~ } \text{Cpu_brand} + \text{Ram} + \text{TypeName} + \text{SSD} + \text{Os} + \text{Ppi} \]

with an adjusted \(R^2\) of 0.8111.


3.2.2 Model Selection 2: Converting Numeric Columns to Factors

We observed that several numeric predictors (e.g., RAM and SSD capacity) did not behave as continuous linear variables. For example, RAM takes on distinct upgrade levels (4GB, 8GB, 16GB, 32GB, etc.) rather than varying smoothly across the number line. Because price increases are likely associated with these upgrade tiers rather than a unit-by-unit linear change, treating RAM as a numeric variable would impose an assumption of linearity that may not reflect market behavior.

To allow the regression model to estimate separate effects for each tier, RAM was converted from numeric to a factor variable. This enables the model to learn level-specific price differences (e.g., the price jump from 8GB → 16GB is not assumed to be the same as 16GB → 32GB).

df2 <- read_csv('/Users/kazuhidewatanabe/Desktop/BANA_7052/laptop_data_cleaned.csv')

df2$TypeName <- as.factor(df$TypeName)
df2$Company <- as.factor(df$Company)
df2$Cpu_brand <- as.factor(df$Cpu_brand)
df2$Ram <- as.factor(df$Ram)
df2$TouchScreen <- as.factor(df$TouchScreen)
df2$HDD <- as.factor(df$HDD)
df2$SSD <- as.factor(df$SSD)
df2$Gpu_brand <- as.factor(df$Gpu_brand)
df2$Os <- as.factor(df$Os)
df2$Ips <- as.factor(df$Ips)
df2$Ppi <- round(df$Ppi, 0)
df2$Ppi <- as.factor(df$Ppi)
# Split the data into testing/training samples.
training.samples2 <- sample.int(nrow(df2), 0.75*nrow(df2)) 
train2 <- df2[training.samples,]
test2 <- df2[-training.samples,]
n2<-nrow(train2)

test2$SSD <- factor(test2$SSD, levels = levels(train2$SSD))

base2 <- lm(Price ~ 1, data=train2)
full2 <- lm(Price ~ ., data=train2)

fwdBIC2 <- step(base2, scope = formula(full2), direction="forward", k = log(n2))
bwdBIC2 <- step(full2, direction="backward", k = log(n2))
Table 5: Coefficient Estimates for Final Forward BIC Model
term estimate std.error statistic p.value Significance
(Intercept) 9.8769 0.1360 72.6497 0.00e+00 ***
Ram4 0.4178 0.0810 5.1608 3.01e-07 ***
Ram6 0.4054 0.0960 4.2244 2.63e-05 ***
Ram8 0.6531 0.0848 7.7039 3.40e-14 ***
Ram12 0.7073 0.1026 6.8917 1.02e-11 ***
Ram16 0.9474 0.0904 10.4778 2.39e-24 ***
Ram24 1.2121 0.2080 5.8273 7.78e-09 ***
Ram32 1.3252 0.1245 10.6457 4.85e-25 ***
Ram64 1.2716 0.2912 4.3664 1.41e-05 ***
Cpu_brandIntel Core i3 0.1506 0.0491 3.0666 2.23e-03 **
Cpu_brandIntel Core i5 0.4179 0.0447 9.3488 6.51e-20 ***
Cpu_brandIntel Core i7 0.4869 0.0467 10.4150 4.32e-24 ***
Cpu_brandOther Intel Processor -0.0967 0.0513 -1.8844 5.98e-02 .
TypeNameNotebook -0.0538 0.0691 -0.7791 4.36e-01
TypeName2 in 1 Convertible 0.1292 0.0743 1.7384 8.25e-02 .
TypeNameGaming 0.1657 0.0750 2.2087 2.74e-02
TypeNameUltrabook 0.1949 0.0739 2.6358 8.54e-03 **
TypeNameWorkstation 0.6428 0.0906 7.0954 2.57e-12 ***
SSD16 -0.1735 0.1568 -1.1065 2.69e-01
SSD32 -0.2661 0.1353 -1.9661 4.96e-02
SSD64 -0.3980 0.2660 -1.4959 1.35e-01
SSD128 0.1724 0.0301 5.7252 1.40e-08 ***
SSD180 0.4057 0.1329 3.0524 2.34e-03 **
SSD240 1.6263 0.2668 6.0950 1.61e-09 ***
SSD256 0.2407 0.0252 9.5702 9.45e-21 ***
SSD512 0.3448 0.0413 8.3566 2.36e-16 ***
SSD768 0.0919 0.2667 0.3447 7.30e-01
SSD1000 0.6438 0.0861 7.4743 1.80e-13 ***
SSD1024 0.3818 0.2657 1.4369 1.51e-01
OsOthers -0.4687 0.0788 -5.9497 3.81e-09 ***
OsWindows -0.2398 0.0747 -3.2114 1.37e-03 **
Ips1 0.0581 0.0206 2.8170 4.95e-03 **
Table 6: Model Fit Statistics
adj.r.squared sigma statistic p.value
0.8255 0.2625 146.4525 0
Table 7: Coefficient Estimates for Backward BIC Model
term estimate std.error statistic p.value Significance
(Intercept) 9.8769 0.1360 72.6497 0.00e+00 ***
TypeNameNotebook -0.0538 0.0691 -0.7791 4.36e-01
TypeName2 in 1 Convertible 0.1292 0.0743 1.7384 8.25e-02 .
TypeNameGaming 0.1657 0.0750 2.2087 2.74e-02
TypeNameUltrabook 0.1949 0.0739 2.6358 8.54e-03 **
TypeNameWorkstation 0.6428 0.0906 7.0954 2.57e-12 ***
Ram4 0.4178 0.0810 5.1608 3.01e-07 ***
Ram6 0.4054 0.0960 4.2244 2.63e-05 ***
Ram8 0.6531 0.0848 7.7039 3.40e-14 ***
Ram12 0.7073 0.1026 6.8917 1.02e-11 ***
Ram16 0.9474 0.0904 10.4778 2.39e-24 ***
Ram24 1.2121 0.2080 5.8273 7.78e-09 ***
Ram32 1.3252 0.1245 10.6457 4.85e-25 ***
Ram64 1.2716 0.2912 4.3664 1.41e-05 ***
Ips1 0.0581 0.0206 2.8170 4.95e-03 **
Cpu_brandIntel Core i3 0.1506 0.0491 3.0666 2.23e-03 **
Cpu_brandIntel Core i5 0.4179 0.0447 9.3488 6.51e-20 ***
Cpu_brandIntel Core i7 0.4869 0.0467 10.4150 4.32e-24 ***
Cpu_brandOther Intel Processor -0.0967 0.0513 -1.8844 5.98e-02 .
SSD16 -0.1735 0.1568 -1.1065 2.69e-01
SSD32 -0.2661 0.1353 -1.9661 4.96e-02
SSD64 -0.3980 0.2660 -1.4959 1.35e-01
SSD128 0.1724 0.0301 5.7252 1.40e-08 ***
SSD180 0.4057 0.1329 3.0524 2.34e-03 **
SSD240 1.6263 0.2668 6.0950 1.61e-09 ***
SSD256 0.2407 0.0252 9.5702 9.45e-21 ***
SSD512 0.3448 0.0413 8.3566 2.36e-16 ***
SSD768 0.0919 0.2667 0.3447 7.30e-01
SSD1000 0.6438 0.0861 7.4743 1.80e-13 ***
SSD1024 0.3818 0.2657 1.4369 1.51e-01
OsOthers -0.4687 0.0788 -5.9497 3.81e-09 ***
OsWindows -0.2398 0.0747 -3.2114 1.37e-03 **
Table 8: Model Fit Statistics for Backward BIC Model
adj.r.squared sigma statistic p.value
0.8255 0.2625 146.4525 0

On the training data, we defined a base model with only an intercept (Price ~ 1) and a full model including all predictors (Price ~ .). We then applied stepwise selection using BIC, running both forward selection (from the base model) and backward elimination (from the full model). For the original dataset, both directions converged to the same model:

\[ \text{Price ~ } \text{Cpu_brand} + \text{Ram} + \text{TypeName} + \text{SSD} + \text{Os} + \text{Ips} \]

with an adjusted \(R^2\) of 0.8255.



3.2.3 Model Selection 3: Log Transformation

During diagnostic checks of the initial linear regression model, the QQ-plot and residual distribution indicated noticeable right-skewness, suggesting a violation of the normality assumption. Because linear regression assumes normally distributed errors, we applied a natural log transformation to the response variable (Price).

base.log <- lm(log(Price) ~ 1, data = train2)
full.log <- lm(log(Price) ~ ., data = train2)

fwdBIC.log <- step(base.log, scope = formula(full.log), direction="forward", k = log(n))
bwdBIC.log <- step(full.log, direction = "backward", k = log(n))
Table 9: Coefficient Estimates for Log Forward BIC Model
term estimate std.error statistic p.value Significance
(Intercept) 2.2878 0.0125 182.5461 0.00e+00 ***
Ram4 0.0426 0.0075 5.7147 1.48e-08 ***
Ram6 0.0422 0.0088 4.7741 2.10e-06 ***
Ram8 0.0651 0.0078 8.3334 2.83e-16 ***
Ram12 0.0704 0.0095 7.4377 2.34e-13 ***
Ram16 0.0912 0.0083 10.9426 2.75e-26 ***
Ram24 0.1136 0.0192 5.9247 4.42e-09 ***
Ram32 0.1233 0.0115 10.7481 1.81e-25 ***
Ram64 0.1172 0.0268 4.3656 1.41e-05 ***
Cpu_brandIntel Core i3 0.0148 0.0045 3.2697 1.12e-03 **
Cpu_brandIntel Core i5 0.0404 0.0041 9.7932 1.30e-21 ***
Cpu_brandIntel Core i7 0.0464 0.0043 10.7736 1.42e-25 ***
Cpu_brandOther Intel Processor -0.0105 0.0047 -2.2159 2.69e-02
TypeNameNotebook -0.0049 0.0064 -0.7719 4.40e-01
TypeName2 in 1 Convertible 0.0121 0.0069 1.7668 7.76e-02 .
TypeNameGaming 0.0152 0.0069 2.1938 2.85e-02
TypeNameUltrabook 0.0179 0.0068 2.6295 8.69e-03 **
TypeNameWorkstation 0.0575 0.0084 6.8834 1.08e-11 ***
SSD16 -0.0196 0.0145 -1.3537 1.76e-01
SSD32 -0.0259 0.0125 -2.0764 3.81e-02
SSD64 -0.0407 0.0245 -1.6614 9.70e-02 .
SSD128 0.0164 0.0028 5.8944 5.27e-09 ***
SSD180 0.0382 0.0123 3.1194 1.87e-03 **
SSD240 0.1485 0.0246 6.0357 2.29e-09 ***
SSD256 0.0223 0.0023 9.6358 5.30e-21 ***
SSD512 0.0315 0.0038 8.2935 3.87e-16 ***
SSD768 0.0093 0.0246 0.3773 7.06e-01
SSD1000 0.0573 0.0079 7.2180 1.10e-12 ***
SSD1024 0.0350 0.0245 1.4283 1.54e-01
OsOthers -0.0444 0.0073 -6.1086 1.48e-09 ***
OsWindows -0.0227 0.0069 -3.2999 1.00e-03 **
Ips1 0.0054 0.0019 2.8308 4.74e-03 **
Table 10: Model Fit Statistics
adj.r.squared sigma statistic p.value
0.8296 0.0242 150.6673 0
Table 11: Coefficient Estimates for Log Backward BIC Model
term estimate std.error statistic p.value Significance
(Intercept) 2.2898 0.0126 182.3136 0.00e+00 ***
TypeNameNotebook -0.0051 0.0064 -0.8038 4.22e-01
TypeName2 in 1 Convertible 0.0136 0.0069 1.9877 4.71e-02
TypeNameGaming 0.0160 0.0069 2.3046 2.14e-02
TypeNameUltrabook 0.0182 0.0068 2.6583 7.99e-03 **
TypeNameWorkstation 0.0582 0.0084 6.9507 6.87e-12 ***
Ram4 0.0434 0.0075 5.7915 9.56e-09 ***
Ram6 0.0433 0.0089 4.8820 1.24e-06 ***
Ram8 0.0664 0.0078 8.4760 9.14e-17 ***
Ram12 0.0718 0.0095 7.5717 8.93e-14 ***
Ram16 0.0919 0.0084 10.9883 1.75e-26 ***
Ram24 0.1121 0.0192 5.8246 7.90e-09 ***
Ram32 0.1246 0.0115 10.8240 8.69e-26 ***
Ram64 0.1206 0.0269 4.4802 8.39e-06 ***
Cpu_brandIntel Core i3 0.0152 0.0045 3.3452 8.55e-04 ***
Cpu_brandIntel Core i5 0.0407 0.0041 9.8475 7.99e-22 ***
Cpu_brandIntel Core i7 0.0469 0.0043 10.8471 6.94e-26 ***
Cpu_brandOther Intel Processor -0.0098 0.0047 -2.0622 3.95e-02
SSD16 -0.0198 0.0145 -1.3642 1.73e-01
SSD32 -0.0230 0.0125 -1.8392 6.62e-02 .
SSD64 -0.0407 0.0246 -1.6530 9.87e-02 .
SSD128 0.0170 0.0028 6.1322 1.28e-09 ***
SSD180 0.0389 0.0123 3.1600 1.63e-03 **
SSD240 0.1468 0.0247 5.9465 3.88e-09 ***
SSD256 0.0229 0.0023 9.8710 6.47e-22 ***
SSD512 0.0326 0.0038 8.5903 3.66e-17 ***
SSD768 0.0084 0.0247 0.3406 7.33e-01
SSD1000 0.0586 0.0080 7.3580 4.12e-13 ***
SSD1024 0.0323 0.0246 1.3141 1.89e-01
OsOthers -0.0472 0.0072 -6.5352 1.05e-10 ***
OsWindows -0.0253 0.0068 -3.6908 2.37e-04 ***
Table 12: Model Fit Statistics for Log Backward BIC Model
adj.r.squared sigma statistic p.value
0.8283 0.0243 154.2504 0



After transformation, the residual distribution became slightly more symmetric and the residual standard error decreasing (≈ 0.27 → 0.024), indicating improved model fit. The QQ plot shows a very slight improvement, but hardly noticable.


3.3 Remedies & Adjustments

  1. Log-transformation of the dependent variable

During the preliminary modeling phase, diagnostic plots for the initial model revealed that the residuals exhibited right-skewness, with heavier upper tails in the QQ-plot and increasing spread at higher fitted values. These patterns suggested a violation of the normality and constant variance assumptions.

To address this, the dependent variable Price was transformed using the natural logarithm.

After transformation:

  1. Converting selected numeric predictors into categorical factors

Some numeric predictors, mainly RAM and SSD storage size, do not increase continuously in realistic increments; rather, they exist in specific hardware tiers (e.g., 4GB → 8GB → 16GB → 32GB, and 128GB → 256GB → 512GB, etc.). Because the effect of price is likely nonlinear across these jumps, a strictly numeric treatment would incorrectly force the model to learn a linear pricing relationship.

Therefore, RAM and SSD were reclassified as categorical variables (factors) to allow the model to learn distinct price jumps between hardware levels rather than assume a single linear slope.


4. Results


We now have three different models we can (somewhat) compare. We have (1) the base model, where we did not convert numeric variables such as SSD and RAM to factors. We also have (2), the model where we do convert numeric columns to factors. And we have (3), the model where we took the log-transform to try to solve the slightly right skewed data we saw appearing in our graphs. We want to look at each models’ out-of-sample performance by computing the MSE for each model.

Table 13: Model Error Comparisons
Statistic Value
Model 1 0.0737980
Model 2 0.0792047
Model 3 0.0809789
Table 14: Paired t-test Results
Statistic Value
t-value -3.2519
df 317
p-value 1.27e-03
CI Lower -0.002847498
CI Upper -0.00070074
Mean difference -0.001774119
Alternative hypothesis two.sided


We cannot directly compare model (1) to (2) or (3) because it is using a different train and test set due to technically having a different set of predictors (since we “factorized” the numeric columns for 2 and 3). But, we should still look at the MSE’s of each model, as we have done in Table 13 above. Model (1) shows the lowest MSE, with model (2) being in the middle, and Model (3) having the worst MSE. But, we should run a t-test between the errors of model (2) and model (3) to ensure their difference is statistically significant. Unable to use an F-test due to the transformation, we do a t-test on the errors, and as shown in Table 14, get a p-value lower than 0.05, allowing us to reject the null hypothesis and determine the model (2) produces less error than model (3). Model (1) having a lower MSE than model (2) could be attributed to numerous factors, such as having a different training set and a different test set. It could also be that converting the numeric columns to factors worked in the opposite direction of what we assumed, which was there was a more complex relationship between RAM and SSD levels than just a linear slope. Regardless, it seems the original model, model (1) produces the lowest MSE.


5. Discussion


We originally believed that the laptop’s brand would be a significant factor in the laptop’s price, due to the boxplot comparing the brands showing a clear distinction between them. But both the forward and backward stepwise functions found the same model, which did not include that brand. Considering this, we can see that often the laptop brand is also correlated with other predictors, like OS, CPU Brand, and RAM, predictors that were included. For example, Razer laptops had the highest average price. Razer laptops are also gaming laptops, and therefore often have very high RAM and expensive CPUs. Thus, a model that already incorporates those predictors would not find the laptop brand that helpful. Additionally, some brands have an incredibly large price range, therefore making difficult to use brand as predictor.

Circling back to our conversation about converting numeric columns to factors, it does seem like doing so allowed for a more complex relationship with those predictors. The coefficient for RAM for model (1) was 0.0292. As we did not “factorize” RAM in this model, this means that increasing RAM is a linearly increases the price. Having 16 RAM would have twice the effect as having 8 RAM. But if we look at model (2), we can see that the coefficients for the different RAM levels are not linear. The predictor “8 RAM” had a coefficient of 0.6531, while “16 RAM” and “32 RAM” had coefficients of 0.9474 and 1.272 respectively. This rate of increase is not linear, suggesting there is a more complex relationship to numeric columns like RAM and SSD, even though model (2) produced a higher MSE than model (1).


5.1 Limitations and Future

Before we used this dataset on laptop prices, we were previously using a dataset with HR information related to employee burnout, or “attrition”. With predictors like employee age, department, years under current boss, vacation time, etc., we felt it would be a very interesting case study to see if we could create a good model to determine when if an employee would be burnt out. What we did not consider soon enough was that our dependent variable, attrition, was binary. The employee was either burnt out or not. Using linear regression to predict a binary out is not the correct path, as binary data violates many of the key assumptions needed for linear regression to work, such as constant error variance, or normality of errors. We saw this reflected in our analysis as no matter what transformations or tools we attempted to use, our model did incredibly poorly at predicting burnout and produced residuals vs fitted graphs with clear issues. It was then we decided to reevaluate our data, and realized we chose the wrong dataset for a linear regression case study.


6. Reference


Data for this case study was pulled from Kaggle. It can be found here.