To view all code for this project, click here
The aim of the following analysis is to examine the relationships between various aspects of a laptop and its price, hoping to identify key factors to hopefully predict a laptops price and whether a new listing for a laptop is overpriced. Diving deep into this topic will hopefully provide useful insights for individuals who are seeking a new laptop and determining if the asking price is one worth paying.
This dataset comes from Kaggle, a free web-based platform that data scientists and statisticians use to share both ideas and datasets. The link to the Laptop Prices dataset is below:
This dataset was pulled from another dataset with fewer variables. The link to the Original Dataset is below:
I am using the Tuesday, September 10, 2024 update of the Laptop Prices dataset. Link to this version of the dataset is below:
R/R Markdown were used in this project as it is free and open-source, allowing users to customize their experience with various libraries and features that other coding software such as SAS do not have. R provides an easy-to-use and comprehensive toolset of statistical analyses and tests. While these advanced analyses will not be used in this project, further work can be done to provide additional insights to the major factors in laptops’ price.
This analysis seeks to answer the questions:
Here is what the author on Kaggle had to say about the Laptop Prices dataset: “The original dataset was pretty compact with a lot of details in each column. The columns mostly consisted of long strings of data, which was pretty human-readable and concise but for Machine Learning algorithms to work more efficiently it’s better to separate the different details into their own columns. After doing so, 28 duplicate rows were exposed and removed with this dataset being the final result.”
A detailed description of the variables within the dataset is given below:
Company: Laptop Manufacturer [categorical]
Product: Brand and Model [categorical]
TypeName: Laptop Type (Notebook, Ultrabook, Gaming,
…etc) [categorical]
Inches: Screen Size [numerical]
Ram: Total amount of RAM in laptop (GBs)
[numerical]
OS: Operating System installed
[categorical]
Weight: Laptop Weight in kilograms
[numerical]
Price_euros: Price of Laptop in Euros (Target)
[numerical]
Screen: screen definition (Standard, Full HD, 4K Ultra
HD, Quad HD+) [categorical]
ScreenW: screen width (pixels) [numerical]
ScreenH: screen height (pixels) [numerical]
Touchscreen: whether or not the laptop has a touchscreen
[categorical]
IPSpanel: whether or not the laptop has an IPSpanel
[categorical]
RetinaDisplay: whether or not the laptop has retina
display [categorical]
CPU_company [categorical]
CPU_freq: frequency of laptop CPU (Hz)
[numerical]
CPU_model [categorical]
PrimaryStorage: primary storage space (GB)
[numerical]
PrimaryStorageType: primary storage type (HDD, SSD,
Flash Storage, Hybrid) [categorical]
SecondaryStorage: secondary storage space if any (GB)
[numerical]
SecondaryStorageType: secondary storage type (HSS, SSD,
Hybrid, None) [categorical]
GPU_company [categorical]
GPU_model [categorical]
laptop.prices <- read.csv("https://raw.githubusercontent.com/EPKeep32/STA551/refs/heads/main/laptop_prices.csv")
summary(laptop.prices)
Company Product TypeName Inches
Length:1275 Length:1275 Length:1275 Min. :10.10
Class :character Class :character Class :character 1st Qu.:14.00
Mode :character Mode :character Mode :character Median :15.60
Mean :15.02
3rd Qu.:15.60
Max. :18.40
Ram OS Weight Price_euros
Min. : 2.000 Length:1275 Min. :0.690 Min. : 174
1st Qu.: 4.000 Class :character 1st Qu.:1.500 1st Qu.: 609
Median : 8.000 Mode :character Median :2.040 Median : 989
Mean : 8.441 Mean :2.041 Mean :1135
3rd Qu.: 8.000 3rd Qu.:2.310 3rd Qu.:1496
Max. :64.000 Max. :4.700 Max. :6099
Screen ScreenW ScreenH Touchscreen
Length:1275 Min. :1366 Min. : 768 Length:1275
Class :character 1st Qu.:1920 1st Qu.:1080 Class :character
Mode :character Median :1920 Median :1080 Mode :character
Mean :1900 Mean :1074
3rd Qu.:1920 3rd Qu.:1080
Max. :3840 Max. :2160
IPSpanel RetinaDisplay CPU_company CPU_freq
Length:1275 Length:1275 Length:1275 Min. :0.900
Class :character Class :character Class :character 1st Qu.:2.000
Mode :character Mode :character Mode :character Median :2.500
Mean :2.303
3rd Qu.:2.700
Max. :3.600
CPU_model PrimaryStorage SecondaryStorage PrimaryStorageType
Length:1275 Min. : 8.0 Min. : 0.0 Length:1275
Class :character 1st Qu.: 256.0 1st Qu.: 0.0 Class :character
Mode :character Median : 256.0 Median : 0.0 Mode :character
Mean : 444.5 Mean : 176.1
3rd Qu.: 512.0 3rd Qu.: 0.0
Max. :2048.0 Max. :2048.0
SecondaryStorageType GPU_company GPU_model
Length:1275 Length:1275 Length:1275
Class :character Class :character Class :character
Mode :character Mode :character Mode :character
missing_values <- is.na(laptop.prices)
summary(missing_values)
Company Product TypeName Inches
Mode :logical Mode :logical Mode :logical Mode :logical
FALSE:1275 FALSE:1275 FALSE:1275 FALSE:1275
Ram OS Weight Price_euros
Mode :logical Mode :logical Mode :logical Mode :logical
FALSE:1275 FALSE:1275 FALSE:1275 FALSE:1275
Screen ScreenW ScreenH Touchscreen
Mode :logical Mode :logical Mode :logical Mode :logical
FALSE:1275 FALSE:1275 FALSE:1275 FALSE:1275
IPSpanel RetinaDisplay CPU_company CPU_freq
Mode :logical Mode :logical Mode :logical Mode :logical
FALSE:1275 FALSE:1275 FALSE:1275 FALSE:1275
CPU_model PrimaryStorage SecondaryStorage PrimaryStorageType
Mode :logical Mode :logical Mode :logical Mode :logical
FALSE:1275 FALSE:1275 FALSE:1275 FALSE:1275
SecondaryStorageType GPU_company GPU_model
Mode :logical Mode :logical Mode :logical
FALSE:1275 FALSE:1275 FALSE:1275
missing_rows <- laptop.prices[!complete.cases(laptop.prices), ]
print(missing_rows)
[1] Company Product TypeName
[4] Inches Ram OS
[7] Weight Price_euros Screen
[10] ScreenW ScreenH Touchscreen
[13] IPSpanel RetinaDisplay CPU_company
[16] CPU_freq CPU_model PrimaryStorage
[19] SecondaryStorage PrimaryStorageType SecondaryStorageType
[22] GPU_company GPU_model
<0 rows> (or 0-length row.names)
The above summary table shows that 0 of the
variables have missing values. If any of the variables had missing
values, there would be an NA: n row beneath it, with
n representing the number of missing values.
Since the dataset was cleaned before hand and used for testing Machine Learning Algorithms, it is of no surprise that none of the variables have any missing values.
laptop.prices$Price_USD = round(laptop.prices$Price_euros * 1.09, 2)
Price_USD was added by converting the
Price_euros to USD as of the October 9th, 9:29 PM exchange
rate of 1 Euro to $1.09 USD. This was done to allow for easier
comparisons for those in the United States market.
This finalizes the dataset with 1275 observations and 24 variables. All 24 variables may not be used in the analysis, but will still be held within the dataset for any further research.
Next I will perform some Exploratpory Data Analysis, or EDA. EDA is an important step in the overarching data analysis process, as it is important to familiarize yourself with the dataset.
Below is a histogram of Price_USD
hist(laptop.prices$Price_USD,
breaks = seq(0, 8000, 500),
main = "Distribution of USD Price",
xlim = c(0, 8000),
xlab = "Laptop Price (USD)",
ylim = c(0, 450),
ylab = "Count of Laptops",
col = "lightgreen",
labels = TRUE)
abline(h = seq(50, 400, 50), col = "gray", lty = "dotted")
The histogram above shows that Price_USD has a very
right-skewed distribution. Over 75% of the laptops in the dataset are
below $1500 USD, and nearly 100% of the laptops are below $4000. A few
outliers may be present, including one over $6000.
Below is a histogram of Inches
hist(laptop.prices$Inches,
breaks = seq(10, 20, 1),
main = "Distribution of Inches",
xlim = c(10, 20),
xlab = "Screen Size (in)",
ylim = c(0, 800),
ylab = "Count of Laptops",
col = "coral",
labels = TRUE)
abline(h = seq(100, 800, 100), col = "gray", lty = "dotted")
The histogram above shows that Inches has a very unique
distribution. Most laptops have a screen size of 13.9”, 15.9”, 17.9”
etc.
Implications for Feature Engineering:
To alleviate the unique distribution, please consider rounding up all
Inches to the nearest round number. I changed the bucket
size on the below histogram to reflect what this change would look
like:
hist(laptop.prices$Inches,
breaks = seq(10, 20, 2),
main = "Distribution of Inches",
xlim = c(10, 20),
xlab = "Screen Size (in)",
ylim = c(0, 800),
ylab = "Count of Laptops",
col = "coral",
labels = TRUE)
abline(h = seq(100, 800, 100), col = "gray", lty = "dotted")
As shown above, this histogram appears far more Normal.
Below is a histogram of Ram
hist(laptop.prices$Ram,
breaks = seq(0, 70, 5),
main = "Distribution of Ram",
xlim = c(0, 70),
xlab = "Ram (GB)",
ylim = c(0, 800),
ylab = "Count of Laptops",
col = "yellow",
labels = TRUE)
abline(h = seq(100, 800, 100), col = "gray", lty = "dotted")
The histogram above shows that Ram has a very
right-skewed distribution. Almost 100% of the laptops in the dataset are
have below 20 GB of Ram. Laptops above 25 GB of Ram may be considered
outliers.
Below is a histogram of Company
ggplot(laptop.prices, aes(y = reorder(factor(Company), Company, function(x) + length(x)))) +
geom_bar(fill = "darkblue") +
labs(title = "Company Distribution", x = "Company", y = "Count")
As shown in the plot above, the top 5 most common laptop companies in the dataset are:
Implications for Feature Engineering:
The imbalance of laptop companies may cause potential bias when creating a predictive model. To alleviate the potential bias, please consider combining all companies outside the 5 listed above into one “Other” category.
Below is a histogram of TypeName
ggplot(laptop.prices, aes(y = reorder(factor(TypeName), TypeName, function(x) + length(x)))) +
geom_bar(fill = "darkred") +
labs(title = "Laptop Type Distribution", x = "Laptop Type", y = "Count")
As shown above, there are 6 different types of laptops within the dataset, with “Notebook” being the most common
Below is a pie chart of PrimaryStorageType
pst <- laptop.prices %>%
group_by(PrimaryStorageType) %>%
summarise(count = n()) %>%
ungroup() %>%
mutate(perc = count / sum(count)) %>%
arrange(perc) %>%
mutate(labels = scales::percent(perc))
ggplot(pst, aes(x = "", y = perc, fill = PrimaryStorageType)) +
geom_col() +
geom_text(aes(label = paste(count, "\n(", labels, ")", sep = "")),
position = position_stack(vjust = (0.5))) +
coord_polar(theta = "y") +
labs(title = "Pie Chart of Primary Storage Type", y = "", x = "")
As shown in the plot above, over 65% of laptops in the dataset have an SSD primary storage type.
Below is a pie chart of Touchscreen
touchscreen <- laptop.prices %>%
group_by(Touchscreen) %>%
summarise(count = n()) %>%
ungroup() %>%
mutate(perc = count / sum(count)) %>%
arrange(perc) %>%
mutate(labels = scales::percent(perc))
ggplot(touchscreen, aes(x = "", y = perc, fill = Touchscreen)) +
geom_col() +
geom_text(aes(label = paste(count, "\n(", labels, ")", sep = "")),
position = position_stack(vjust = 0.5)) +
coord_polar(theta = "y") +
labs(title = "Pie Chart of Touchscreen", y = "", x = "")
As shown in the plot above, a only 188 laptops (15%) in the dataset have a touchscreen. Since not many laptops have a touchscreen, I predict that a laptop with a touchscreen will be more expensive than a laptop without a touchscreen, assuming all other aspects are similar.
To glance at this claim, below is a plot of Touchscreen
and Price_usd
touchscreen.price <- laptop.prices %>%
group_by(Touchscreen) %>%
summarise(count = n(), avg = mean(Price_USD))
ggplot(touchscreen.price, aes(x = Touchscreen, y = avg, fill = Touchscreen)) +
geom_bar(stat = "identity") +
geom_text(aes(label = paste("$", round(avg, 2))), position = position_stack(vjust = 0.5)) +
labs(title = "Average Price (USD) by Touchscreen Status",
x = "Touchscreen Status",
y = "Average price (USD)") +
theme_minimal()
As shown above, the average price for a laptop without a touchscreen is $1177.14 USD, and the average price for a laptop with a touchscreen is $1583.90.
Next, I will create some models that will hopefully allow for further understanding of the data. To accomplish this, I will first split the data into 2 separate datasets: Training Data and Testing Data. Doing so will allow for the models to be tested on real data without the need for additional research or sampling. roughly 75% of the overall dataset will be in the training dataset, and the remaining ~25% will be in the testing dataset.
Multiple linear regression is a statistical technique that uses multiple variables to predict the outcome of another variable. This technique allows statisticians to build models and equations that can be used across multiple datasets.
The subsections below outline my process through developing a multiple linear regression model.
One question that this research seeks to answer is What aspects of a laptop contribute to its asking price? To address this question, there are numerous variables in the dataset that may attribute to a laptops asking price, making the dataset fit to answer the above question.
The first model I built includes the following variables:
ScreenWScreenHWeightCompanyThese variables were chosen because these aspects do not have to do with the physical hardware within the laptop, and rather the additional features that are of more practical use.
I created dummy variables for Company. Following the
Feature Engineering suggestion from above, if the
Comapny is not within the top 5 most common laptop
companies within the dataset, its company will be unofficially
categorized as “other”
laptop.prices$company.dell <- ifelse(laptop.prices$Company == "Dell", 1, 0)
laptop.prices$company.lenovo <- ifelse(laptop.prices$Company == "Lenovo", 1, 0)
laptop.prices$company.hp <- ifelse(laptop.prices$Company == "HP", 1, 0)
laptop.prices$company.asus <- ifelse(laptop.prices$Company == "Asus", 1, 0)
laptop.prices$company.acer <- ifelse(laptop.prices$Company == "Acer", 1, 0)
laptop.prices$IPSpanel = ifelse(laptop.prices$IPSpanel == "Yes", 1, 0)
set.seed(323)
index <- sample.split(Y = laptop.prices$Price_USD, SplitRatio = 0.75)
train.data <- laptop.prices[index, ]
test.data <- laptop.prices[!index, ]
mlr.model.1 = lm(formula = Price_USD ~ ScreenW + ScreenH + Weight + company.dell + company.lenovo + company.hp + company.asus + company.acer, data = train.data)
Below shows some plots to identify some potential violations of
conditions for Model 1:
par(mfrow = c(2,2), mar = c(2, 3, 2, 2))
plot(mlr.model.1)
Next, I will carry the Box-Cox transformation to identify a potential
power transformation of the response variable
Price_USD:
boxcox(Price_USD ~ ScreenW + ScreenH + Weight + company.dell + company.lenovo + company.hp + company.asus + company.acer,
data = train.data,
lambda = seq(-1, 1.5, length = 10),
xlab = expression(paste(lambda)))
title(main = "Box-Cox Transformation: 95% CI of lambda",)
Since no common lambda’s fall within the 95% confidence interval for lambda, no transformations need to be done.
Below is the summary of model 1:
kable(summary(mlr.model.1)$coef)
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | -679.7977419 | 110.8882441 | -6.1304762 | 0.0000000 |
| ScreenW | 1.0065739 | 0.3792064 | 2.6544225 | 0.0080777 |
| ScreenH | -0.3023726 | 0.6583448 | -0.4592921 | 0.6461298 |
| Weight | 309.4865131 | 29.0238959 | 10.6631623 | 0.0000000 |
| company.dell | -333.8503796 | 68.5100735 | -4.8730116 | 0.0000013 |
| company.lenovo | -349.9186142 | 67.4439680 | -5.1882863 | 0.0000003 |
| company.hp | -251.5993848 | 68.9874948 | -3.6470289 | 0.0002798 |
| company.asus | -327.3180393 | 77.4511225 | -4.2261239 | 0.0000261 |
| company.acer | -672.9197009 | 87.3185693 | -7.7064902 | 0.0000000 |
We can see that the only insignificant variable is the
ScreenH, meaning that the height of the screen is not
significant in attributing to a laptop’s price. Interestingly, the
company of the laptop is significant for all top 5 companies compared to
laptops with an “other” category.
The second model I built includes the following variables:
ScreenWScreenHWeightCompanyRamCPU_freqPrimaryStorageThis model is similar to Model 1, but this time includes some hardware-type variables to see if the hardware can better help predict a laptops price.
mlr.model.2 = lm(formula = Price_USD ~ ScreenW + ScreenH + Weight + company.dell + company.lenovo + company.hp + company.asus + company.acer + Ram + CPU_freq + PrimaryStorage, data = train.data)
Below shows some plots to identify some potential violations of
conditions for Model 2:
par(mfrow = c(2,2), mar = c(2, 3, 2, 2))
plot(mlr.model.2)
Next, I will carry the Box-Cox transformation to identify a potential
power transformation of the response variable
Price_USD:
boxcox(Price_USD ~ ScreenW + ScreenH + Weight + company.dell + company.lenovo + company.hp + company.asus + company.acer + Ram + CPU_freq + PrimaryStorage,
data = train.data,
lambda = seq(-1, 1.5, length = 10),
xlab = expression(paste(lambda)))
title(main = "Box-Cox Transformation: 95% CI of lambda",)
Since no common lambda’s fall within the 95% confidence interval for lambda, no transformations need to be done.
Below is the summary of model 2:
kable(summary(mlr.model.2)$coef)
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | -665.3982409 | 99.3548496 | -6.6971894 | 0.0000000 |
| ScreenW | 0.1454878 | 0.2827000 | 0.5146367 | 0.6069274 |
| ScreenH | 0.4710990 | 0.4892147 | 0.9629699 | 0.3358092 |
| Weight | -9.0226220 | 25.1283214 | -0.3590619 | 0.7196291 |
| company.dell | -110.0830558 | 51.8291801 | -2.1239590 | 0.0339330 |
| company.lenovo | -132.2551571 | 50.8444761 | -2.6011706 | 0.0094360 |
| company.hp | 15.8917352 | 52.0622965 | 0.3052446 | 0.7602472 |
| company.asus | -183.8699352 | 57.9179995 | -3.1746596 | 0.0015487 |
| company.acer | -268.8680156 | 66.3082423 | -4.0548204 | 0.0000543 |
| Ram | 82.2163154 | 3.5052008 | 23.4555223 | 0.0000000 |
| CPU_freq | 286.4280110 | 31.4014139 | 9.1215004 | 0.0000000 |
| PrimaryStorage | -0.2677410 | 0.0408542 | -6.5535687 | 0.0000000 |
We can see that after adding the “hardware” variables, many of the variables from Model 1 turned insignificant, suggesting that the hardware variables are more significant in determining a laptops price.
mlr.model.1.r = summary(mlr.model.1)$r.squared
mlr.model.2.r = summary(mlr.model.2)$r.squared
R.Square = cbind(Model.1.r = mlr.model.1.r, Model.2.r = mlr.model.2.r)
kable(R.Square)
| Model.1.r | Model.2.r |
|---|---|
| 0.4089328 | 0.6789631 |
As shown above, Model 1’s r-squared value is .4089, and Model 2s r-squared value is 0.679. This means that 40.89% of the variability in laptop prices can be explained by Model 1, and 67.9% of the variability in laptop prices can be explained by Model 2.
Since the response variable for both models are at the same scale, I will use MSE in cross-validation to compare the two models.
mlr.model.1.mse = mean(mlr.model.1$residuals^2)
mlr.model.2.mse = mean(mlr.model.2$residuals^2)
MSE = cbind(Model.1.MSE = mlr.model.1.mse, Model.2.MSE = mlr.model.2.mse)
kable(MSE)
| Model.1.MSE | Model.2.MSE |
|---|---|
| 339476 | 184385.7 |
As shown above, Model 2 has a lower MSE than Model 1, meaning Model 2 is more accurate than Model 1
Since Model 2 has a higher R-Squared term and a lower MSE than Model 1, Model 2 is the better model between the two and will be used as the final model.
The Final Model is as reported below:
\(Price_USD = -665.40 + 0.15(ScreenW) + 0.47(ScreenH) - 9.02(Weight) - 110.08(Company.Dell) - 132.26(Company.Lenovo) - 15.89(Company.HP) - 183.87(Company.ASUS) - 268.87(Company.Acer) + 82.22(Ram) + 286.43(CPU_freq) - 0.27(PrimaryStorage)\)
Now that the final model is identified, it is time to test this model on the test dataset to determine its accuracy. Below is Model 2’s RMSE and r-squared values when tested on the test dataset:
test.data$predicted.PriceUSD <- predict(mlr.model.2, newdata = test.data)
mlr.rmse <- sqrt(mean((test.data$Price_USD - test.data$predicted.PriceUSD)^2))
mlr.rsquared <- summary(mlr.model.2)$r.squared
cat("Root Mean Squared Error (RMSE): ", mlr.rmse, "\n")
Root Mean Squared Error (RMSE): 459.8317
cat("R-squared for the model (training data): ", mlr.rsquared, "\n")
R-squared for the model (training data): 0.6789631
As shown above, a laptops screen width, screen height, weight, company, ram, CPU frequency, and primary storage are all significant in predicting a laptops’ asking price. It is important to note that the regression coefficients for each of the company variables are in comparison to an “other” laptop. For example, the -110.08 represents that on average, a Dell laptop costs $110.08 less than a laptop not made by Dell, Lenovo, HP, ASUS, nor Acer. When tested on the test dataset, Model 2 had an r-squared value of 0.6790, very similar to the r-squared value on the training dataset.
Logistic regression is a statistical technique that uses multiple variables to predict the outcome of a binary variable. This technique allows statisticians to build models and equations that can be used across multiple datasets.
The subsections below outline my process through developing a multiple linear regression model.
One question that this section of the research seeks to answer is What aspects of a laptop require it to have an IPS panel? An IPS Panel a panel that is used for liquid crystal display technology to enhance color accuracy. To address this question, there are numerous hardware-type variables in the dataset that may attribute to a laptop having an IPS panel, making the dataset fit to answer the above question.
The first model I built includes the following variables:
RamInchesCPU_freqScreenWScreenHPrimaryStorageSecondaryStorageThese variables were chosen because these aspects are ones that mainly effect the display of the laptop, and may attribute to the presence of an IPS panel.
Below shows a summary of Model 1:
log.model.1 <- glm(IPSpanel ~ Ram + Inches + CPU_freq + ScreenW + ScreenH + PrimaryStorage + SecondaryStorage, data = train.data, family = binomial)
summary(log.model.1)
Call:
glm(formula = IPSpanel ~ Ram + Inches + CPU_freq + ScreenW +
ScreenH + PrimaryStorage + SecondaryStorage, family = binomial,
data = train.data)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.818e-01 1.021e+00 -0.178 0.85868
Ram 8.002e-02 1.862e-02 4.298 1.72e-05 ***
Inches -2.279e-01 6.823e-02 -3.339 0.00084 ***
CPU_freq 2.681e-01 1.774e-01 1.511 0.13068
ScreenW -4.212e-04 1.266e-03 -0.333 0.73945
ScreenH 2.165e-03 2.181e-03 0.993 0.32084
PrimaryStorage -5.498e-04 2.790e-04 -1.971 0.04877 *
SecondaryStorage 8.777e-05 2.154e-04 0.407 0.68365
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1147.3 on 955 degrees of freedom
Residual deviance: 1032.8 on 948 degrees of freedom
AIC: 1048.8
Number of Fisher Scoring iterations: 4
As shown above, many p-values are bigger than 0.05, meaning some insignificant predictor variables should be dropped from the model.
Next, I begin with Model 1 and slowly remove insignificant variables until all remaining variables are significant.
log.model.2 = step(log.model.1, direction = "backward", trace = 0)
summary(log.model.2)
Call:
glm(formula = IPSpanel ~ Ram + Inches + CPU_freq + ScreenH +
PrimaryStorage, family = binomial, data = train.data)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.3227982 0.9310651 -0.347 0.728818
Ram 0.0822647 0.0178504 4.609 4.05e-06 ***
Inches -0.2184003 0.0607044 -3.598 0.000321 ***
CPU_freq 0.2618654 0.1764667 1.484 0.137826
ScreenH 0.0014496 0.0003024 4.793 1.64e-06 ***
PrimaryStorage -0.0005968 0.0002572 -2.320 0.020333 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1147.3 on 955 degrees of freedom
Residual deviance: 1033.1 on 950 degrees of freedom
AIC: 1045.1
Number of Fisher Scoring iterations: 4
As shown above, most remaining variables are significant.
To compare the two models, I will be using ROC curves and AUC values.
An ROC curve is a graph of the models performance by plotting the True Positive Rate against the False Positive Rate in a representation of model accuracy when the model attempts to classify responses.
Below are the ROC Curves for both models:
train.data$predicted.prob.1 <- predict(log.model.1, type = "response")
train.data$predicted.prob.2 <- predict(log.model.2, type = "response")
log.model.1.roc <- roc(train.data$IPSpanel, train.data$predicted.prob.1)
log.model.2.roc <- roc(train.data$IPSpanel, train.data$predicted.prob.2)
plot(log.model.1.roc,
col = "blue",
lwd = 2,
main = "Overlayed ROC Curves",
xlim = c(0,1),
ylim = c(0,1))
lines(log.model.2.roc, col = "red", lwd = 2)
legend("bottomright",
legend = c("Model 1", "Model 2"),
col = c("blue", "red"),
lwd = 2)
As shown above, the ROC curves are very similar and hug each other the whole time. This makes it difficult to distinguish which curve and thus which model is better. Instead, I will use AUC to compare the two models.
AUC values represent the area under an ROC curve. The interpretation of AUC values are as follows:
Below are each model’s AUC value:
log.model.1.auc = auc(log.model.1.roc)
log.model.2.auc = auc(log.model.2.roc)
AUC = cbind(Model.1.AUC = log.model.1.auc, Model.2.AUC = log.model.2.auc)
kable(AUC)
| Model.1.AUC | Model.2.AUC |
|---|---|
| 0.7311494 | 0.7319637 |
As shown above, Model 2 has a slightly higher AUC value. Since Model 2 has a higher AUC value and has all of its predictor variables as statistically significant, Model 2 is the final model.
The Final Model is as reported below:
\(IPSpanel = -0.3228 + 0.0823(Ram) - 0.2184(Inches) + + 0.2619(CPUfreq) + 0.0014(ScreenH) - 0.0006(PrimaryStorage)\)
Now that the final model is identified, it is time to test this model on the test dataset to determine its accuracy. Below is the confusion matrix for Model 2:
test.data$predicted.prob <- predict(log.model.2, newdata = test.data, type = "response")
threshold <- 0.5
test.data$predicted.IPS <- ifelse(test.data$predicted.prob > threshold, 1, 0)
log.confusion.matrix <- table(Actual = test.data$IPSpanel, Predicted = test.data$predicted.IPS)
print(log.confusion.matrix)
Predicted
Actual 0 1
0 215 22
1 69 13
Below are some additional metrics for Model 2:
log.accuracy <- sum(diag(log.confusion.matrix)) / sum(log.confusion.matrix)
log.precision <- log.confusion.matrix[2,2] / sum(log.confusion.matrix[, 2])
cat("Accuracy: ", log.accuracy, "\n")
Accuracy: 0.7147335
cat("Precision: ", log.precision, "\n")
Precision: 0.3714286
Accuracy is the proportion of correctly predicted outcomes, both true positives and true negatives. Precision is the proportion of how many of the predicted positives are actually positive.
As shown above, Model 2 has a high accuracy rating but a low precision rating. This means that the model is performing well overall in terms of the total number of correct predictions, but struggles with the reliability of its positive predictions
A laptop’s ram, diagonal screen dimension, screen height, and primary storage are all significant in predicting whether or not a laptop has an IPS panel. It is important to note that the coefficients for each predictor variable represent the increase/decrease in the odds that a laptop has an IPS panel. Model 2 is pretty accurate but not precise, meaning that some of its positive predictions may not be correct.