title: “project final” output: pdf_document: latex_engine: xelatex html_document: default date: “2024-12-04”

Main Objective-

-The primary objective of this analysis is to understand the key factors that influence laptop prices.

-Develop a predictive model that can estimate the price of a laptop based on its specifications and features

#Benefit-

-This analysis helps manufacturers, retailers to understand how the features of laptop influencing the prices. Setting competitive price for laptops according to the specifications. Marketing teams to odentigy key selling points.

library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)
library(corrplot)
## corrplot 0.95 loaded
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
laptop_data <- read.csv("~/Documents/statistics(1)/annotated-laptop_prices_reverted.csv")

Initial EDA

print("Summary of the dataset:")
## [1] "Summary of the dataset:"
summary(laptop_data)
##    Company            Product            TypeName             Inches     
##  Length:1275        Length:1275        Length:1275        Min.   :10.10  
##  Class :character   Class :character   Class :character   1st Qu.:14.00  
##  Mode  :character   Mode  :character   Mode  :character   Median :15.60  
##                                                           Mean   :15.02  
##                                                           3rd Qu.:15.60  
##                                                           Max.   :18.40  
##       Ram              OS                Weight       Price_euros  
##  Min.   : 2.000   Length:1275        Min.   :0.690   Min.   : 174  
##  1st Qu.: 4.000   Class :character   1st Qu.:1.500   1st Qu.: 609  
##  Median : 8.000   Mode  :character   Median :2.040   Median : 989  
##  Mean   : 8.441                      Mean   :2.041   Mean   :1135  
##  3rd Qu.: 8.000                      3rd Qu.:2.310   3rd Qu.:1496  
##  Max.   :64.000                      Max.   :4.700   Max.   :6099  
##     Screen             ScreenW        ScreenH     TouchscreenIPSpanel
##  Length:1275        Min.   :1366   Min.   : 768   Length:1275        
##  Class :character   1st Qu.:1920   1st Qu.:1080   Class :character   
##  Mode  :character   Median :1920   Median :1080   Mode  :character   
##                     Mean   :1900   Mean   :1074                      
##                     3rd Qu.:1920   3rd Qu.:1080                      
##                     Max.   :3840   Max.   :2160                      
##  RetinaDisplay      CPU_company           CPU_freq      CPU_model        
##  Length:1275        Length:1275        Min.   :0.900   Length:1275       
##  Class :character   Class :character   1st Qu.:2.000   Class :character  
##  Mode  :character   Mode  :character   Median :2.500   Mode  :character  
##                                        Mean   :2.303                     
##                                        3rd Qu.:2.700                     
##                                        Max.   :3.600                     
##  PrimaryStorage   SecondaryStorage PrimaryStorageType SecondaryStorageType
##  Min.   :   8.0   Min.   :   0.0   Length:1275        Length:1275         
##  1st Qu.: 256.0   1st Qu.:   0.0   Class :character   Class :character    
##  Median : 256.0   Median :   0.0   Mode  :character   Mode  :character    
##  Mean   : 444.5   Mean   : 176.1                                          
##  3rd Qu.: 512.0   3rd Qu.:   0.0                                          
##  Max.   :2048.0   Max.   :2048.0                                          
##  GPU_company         GPU_model         Touchscreen       
##  Length:1275        Length:1275        Length:1275       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
## 
# Check for missing values
print("Missing values per column:")
## [1] "Missing values per column:"
colSums(is.na(laptop_data))
##              Company              Product             TypeName 
##                    0                    0                    0 
##               Inches                  Ram                   OS 
##                    0                    0                    0 
##               Weight          Price_euros               Screen 
##                    0                    0                    0 
##              ScreenW              ScreenH  TouchscreenIPSpanel 
##                    0                    0                    0 
##        RetinaDisplay          CPU_company             CPU_freq 
##                    0                    0                    0 
##            CPU_model       PrimaryStorage     SecondaryStorage 
##                    0                    0                    0 
##   PrimaryStorageType SecondaryStorageType          GPU_company 
##                    0                    0                    0 
##            GPU_model          Touchscreen 
##                    0                    0
#  Data structure
print("Structure of the dataset:")
## [1] "Structure of the dataset:"
str(laptop_data)
## 'data.frame':    1275 obs. of  23 variables:
##  $ Company             : chr  "Apple" "Apple" "HP" "Apple" ...
##  $ Product             : chr  "MacBook Pro" "Macbook Air" "250 G6" "MacBook Pro" ...
##  $ TypeName            : chr  "Ultrabook" "Ultrabook" "Notebook" "Ultrabook" ...
##  $ Inches              : num  13.3 13.3 15.6 15.4 13.3 15.6 15.4 13.3 14 14 ...
##  $ Ram                 : int  8 8 8 16 8 4 16 8 16 8 ...
##  $ OS                  : chr  "macOS" "macOS" "No OS" "macOS" ...
##  $ Weight              : num  1.37 1.34 1.86 1.83 1.37 2.1 2.04 1.34 1.3 1.6 ...
##  $ Price_euros         : num  1340 899 575 2537 1804 ...
##  $ Screen              : chr  "Standard" "Standard" "Full HD" "Standard" ...
##  $ ScreenW             : int  2560 1440 1920 2880 2560 1366 2880 1440 1920 1920 ...
##  $ ScreenH             : int  1600 900 1080 1800 1600 768 1800 900 1080 1080 ...
##  $ TouchscreenIPSpanel : chr  "Yes" "No" "No" "Yes" ...
##  $ RetinaDisplay       : chr  "Yes" "No" "No" "Yes" ...
##  $ CPU_company         : chr  "Intel" "Intel" "Intel" "Intel" ...
##  $ CPU_freq            : num  2.3 1.8 2.5 2.7 3.1 3 2.2 1.8 1.8 1.6 ...
##  $ CPU_model           : chr  "Core i5" "Core i5" "Core i5 7200U" "Core i7" ...
##  $ PrimaryStorage      : int  128 128 256 512 256 500 256 256 512 256 ...
##  $ SecondaryStorage    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PrimaryStorageType  : chr  "SSD" "Flash Storage" "SSD" "SSD" ...
##  $ SecondaryStorageType: chr  "No" "No" "No" "No" ...
##  $ GPU_company         : chr  "Intel" "Intel" "Intel" "AMD" ...
##  $ GPU_model           : chr  "Iris Plus Graphics 640" "HD Graphics 6000" "HD Graphics 620" "Radeon Pro 455" ...
##  $ Touchscreen         : chr  "Yes" "No" "No" "Yes" ...

#Univariate analysis

ggplot(laptop_data, aes(x = Price_euros)) +
  geom_histogram(bins = 30, fill = "blue", alpha = 0.7) +
  labs(title = "Distribution of Laptop Prices", x = "Price (Euros)", y = "Frequency")

The distribution is right-skewed, most laptops are priced in the lower range, while fewer laptops are in the higher price brackets

ggplot(laptop_data, aes(x = TypeName)) +
  geom_bar(fill = "orange", alpha = 0.7) +
  labs(title = "Frequency of Laptop Types", x = "Laptop Type", y = "Count") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Notebooks are the most common type of laptop, followed by Ultrabooks and Gaming laptops.

Categories like Netbook, Workstation, and 2-in-1 Convertible have fewer laptops

#bivariate analysis

Price vs Screen Size(Inches)

ggplot(laptop_data, aes(x = Inches, y = Price_euros)) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "lm", color = "red", se = FALSE) +
  labs(title = "Screen Size vs Price", x = "Screen Size (Inches)", y = "Price (Euros)")
## `geom_smooth()` using formula = 'y ~ x'

# Correlation coefficient
cor(laptop_data$Inches, laptop_data$Price_euros, use = "complete.obs")
## [1] 0.06660794

There is a positive relationship between screen size and price, though the trend is not very strong. Laptops with larger screen sizes generally have higher prices

Price Vs Laptop type

ggplot(laptop_data, aes(x = TypeName, y = Price_euros)) +
  geom_boxplot(fill = "orange", alpha = 0.7) +
  labs(title = "Price by Laptop Type", x = "Laptop Type", y = "Price (Euros)") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Grouped summary statistics
laptop_data %>%
  group_by(TypeName) %>%
  summarise(
    Mean_Price = mean(Price_euros, na.rm = TRUE),
    Median_Price = median(Price_euros, na.rm = TRUE),
    Count = n()
  )
## # A tibble: 6 × 4
##   TypeName           Mean_Price Median_Price Count
##   <chr>                   <dbl>        <dbl> <int>
## 1 2 in 1 Convertible      1290.        1199    117
## 2 Gaming                  1731.        1493.   205
## 3 Netbook                  673.         355     23
## 4 Notebook                 789.         695    707
## 5 Ultrabook               1557.        1499    194
## 6 Workstation             2280.        2065.    29

Gaming and Workstation laptops have the highest median prices, with significant variability

Netbooks and Notebooks have the lowest prices, indicating these are more budget-friendly options.

TouchScreen Vs Laptop Type

table(laptop_data$Touchscreen, laptop_data$TypeName)
##      
##       2 in 1 Convertible Gaming Netbook Notebook Ultrabook Workstation
##   No                  56    117      19      594       114          18
##   Yes                 61     88       4      113        80          11
# Stacked bar chart
ggplot(laptop_data, aes(x = TypeName, fill = Touchscreen)) +
  geom_bar(position = "fill") +
  labs(title = "Proportion of Touchscreens by Laptop Type", x = "Laptop Type", y = "Proportion") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

2-in-1 Convertible laptops are predominantly touchscreen, as expected.

Other categories, such as Gaming and Ultrabooks, have very few touchscreen models.

Workstation and Netbooks and Notebooksshow a mix of touchscreen and non-touchscreen models.

Hypothesis 1

Laptops with Touchscreens are priced more than Non Touchscreen

t_test_touchscreen <- t.test(Price_euros ~ Touchscreen, data = laptop_data)
print(t_test_touchscreen)
## 
##  Welch Two Sample t-test
## 
## data:  Price_euros by Touchscreen
## t = -8.8671, df = 598.24, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group No and group Yes is not equal to 0
## 95 percent confidence interval:
##  -477.8107 -304.5334
## sample estimates:
##  mean in group No mean in group Yes 
##          1025.441          1416.613
#box plot
ggplot(laptop_data, aes(x = Touchscreen, y = Price_euros, fill = Touchscreen)) +
  geom_boxplot(alpha = 0.7) +
  labs(title = "Price Distribution: Touchscreen vs Non-Touchscreen", 
       x = "Touchscreen", y = "Price (Euros)") +
  theme_minimal()

The difference in mean prices lies between -477.81 and -304.53 euros, with 95% confidence

p value: p<2.2e-16, much smaller than α=0.05α=0.05, indicating a significant difference in mean prices.

This plot visually confirms that touchscreen laptops are positioned in a higher price range, with higher median and variability.

The presence of outliers in both groups suggests premium models or specific configurations driving up prices.

Hyppothesis 2

Certian laptop brands are priced higher than the others

# One-way ANOVA for brand effect on price
anova_company <- aov(Price_euros ~ Company, data = laptop_data)
summary(anova_company)
##               Df    Sum Sq Mean Sq F value Pr(>F)    
## Company       23 102648747 4462989   10.68 <2e-16 ***
## Residuals   1251 522954141  418029                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#Bar plot
avg_prices <- laptop_data %>%
  group_by(Company) %>%
  summarise(Average_Price = mean(Price_euros, na.rm = TRUE)) %>%
  arrange(desc(Average_Price))

ggplot(avg_prices, aes(x = reorder(Company, -Average_Price), y = Average_Price, fill = Company)) +
  geom_bar(stat = "identity", alpha = 0.7) +
  labs(title = "Average Price by Laptop Brand", 
       x = "Brand", y = "Average Price (Euros)") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  theme_minimal()

The very small p-value indicates that brand has a statistically significant effect on laptop prices.

Brands like Razer and Mediacom stand out as having higher variability in pricing, with Mediacom often appearing in non-significant comparisons.

Regression model

Screen Size is influencing the laptop prices

model_screen <- lm(Price_euros ~ Inches, data = laptop_data)
summary(model_screen)
## 
## Call:
## lm(formula = Price_euros ~ Inches, data = laptop_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -954.8 -540.3 -146.8  375.8 4889.7 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   644.43     206.88   3.115  0.00188 **
## Inches         32.65      13.71   2.382  0.01737 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 699.5 on 1273 degrees of freedom
## Multiple R-squared:  0.004437,   Adjusted R-squared:  0.003655 
## F-statistic: 5.673 on 1 and 1273 DF,  p-value: 0.01737
laptop_data <- laptop_data %>%
  mutate(Screen_Size_Group = cut(Inches, 
                                 breaks = c(0, 13, 15, 17, Inf), 
                                 labels = c("<13\"", "13-15\"", "15-17\"", ">17\"")))

# Boxplot for screen size groups
ggplot(laptop_data, aes(x = Screen_Size_Group, y = Price_euros, fill = Screen_Size_Group)) +
  geom_boxplot(alpha = 0.7) +
  labs(title = "Price Distribution by Screen Size Group", 
       x = "Screen Size Group", y = "Price (Euros)") +
  theme_minimal()

The median price increases with screen size. Laptops in the >17" group have the highest median prices.

P value Indicates that screen size has a statistically significant impact on price at the 5% significance level.

R-squared value is 0.44% of the variability in price is explained by screen size alone. This suggests other factors (e.g., brand, features) play a larger role.

The positive and significant coefficient confirms that larger screen sizes generally lead to higher prices.

The low R-squared indicates screen size alone is not sufficient to explain price variation, highlighting the need to include other predictors.

Linear Regression for RAM vs Price

ram_model <- lm(Price_euros ~ Ram, data = laptop_data)
summary(ram_model)
## 
## Call:
## lm(formula = Price_euros ~ Ram, data = laptop_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2813.72  -297.59   -94.07   244.39  2859.29 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   276.03      25.54   10.81   <2e-16 ***
## Ram           101.76       2.59   39.29   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 471.3 on 1273 degrees of freedom
## Multiple R-squared:  0.548,  Adjusted R-squared:  0.5477 
## F-statistic:  1544 on 1 and 1273 DF,  p-value: < 2.2e-16
# RAM vs Price
ggplot(laptop_data, aes(x = Ram, y = Price_euros)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", color = "blue") +
  labs(title = "Effect of RAM on Price", x = "RAM (GB)", y = "Price (€)")
## `geom_smooth()` using formula = 'y ~ x'

t-value (39.29) and p-value (< 2e-16): Strong evidence that RAM is a statistically significant predictor of price. R-squared (0.548): RAM explains approximately 54.8% of the variability in laptop prices. F-statistic (1544) with p-value (< 2.2e-16): Indicates that the overall model is highly significant. RAM is a very strong and statistically significant predictor of laptop prices. Each additional GB of RAM increases the laptop price by approximately €101.76. RAM is a strong predictor of laptop price, with higher RAM configurations commanding a noticeable price premium

Linear Regression for CPU vs Price

# ANOVA for CPU Company
cpu_model <- lm(Price_euros ~ factor(CPU_company), data = laptop_data)
summary(cpu_model)
## 
## Call:
## lm(formula = Price_euros ~ factor(CPU_company), data = laptop_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -989.7 -499.7 -134.7  335.3 4935.3 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  560.99      89.01   6.303 4.02e-10 ***
## factor(CPU_company)Intel     602.74      91.18   6.610 5.62e-11 ***
## factor(CPU_company)Samsung    98.01     695.16   0.141    0.888    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 689.4 on 1272 degrees of freedom
## Multiple R-squared:  0.03356,    Adjusted R-squared:  0.03204 
## F-statistic: 22.09 on 2 and 1272 DF,  p-value: 3.717e-10
Anova(cpu_model)  # Type II ANOVA for categorical variable
## Anova Table (Type II tests)
## 
## Response: Price_euros
##                        Sum Sq   Df F value    Pr(>F)    
## factor(CPU_company)  20997536    2  22.088 3.717e-10 ***
## Residuals           604605352 1272                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# CPU vs Price
ggplot(laptop_data, aes(x = factor(CPU_company), y = Price_euros)) +
  geom_boxplot() +
  labs(title = "Effect of CPU Company on Price", x = "CPU Company", y = "Price (€)")

Laptops with Intel CPUs are, on average, €602.74 more expensive than those from the baseline CPU company. Samsung Coefficient (98.01): Laptops with Samsung CPUs are, on average, €98.01 more expensive, but this is not statistically significant (p = 0.888). Although the effect is statistically significant, it explains a small portion of the total variability, as seen in the residual sum of squares. The CPU company is a weak predictor of laptop price overall, as it explains only 3.4% of the variability.

#Linear Regression for GPU vs Price

# ANOVA for GPU Company
gpu_model <- lm(Price_euros ~ factor(GPU_company), data = laptop_data)
summary(gpu_model)
## 
## Call:
## lm(formula = Price_euros ~ factor(GPU_company), data = laptop_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1037.7  -483.1  -140.4   367.1  4602.3 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 778.03      49.51  15.714  < 2e-16 ***
## factor(GPU_company)ARM     -119.03     654.97  -0.182    0.856    
## factor(GPU_company)Intel    242.34      55.29   4.383 1.27e-05 ***
## factor(GPU_company)Nvidia   718.72      59.40  12.099  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 653.1 on 1271 degrees of freedom
## Multiple R-squared:  0.1334, Adjusted R-squared:  0.1314 
## F-statistic: 65.23 on 3 and 1271 DF,  p-value: < 2.2e-16
Anova(gpu_model)  # Type II ANOVA for categorical variable
## Anova Table (Type II tests)
## 
## Response: Price_euros
##                        Sum Sq   Df F value    Pr(>F)    
## factor(GPU_company)  83470726    3  65.231 < 2.2e-16 ***
## Residuals           542132161 1271                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# GPU vs Price
ggplot(laptop_data, aes(x = factor(GPU_company), y = Price_euros)) +
  geom_boxplot() +
  labs(title = "Effect of GPU Company on Price", x = "GPU Company", y = "Price (€)")

Nvidia GPUs command the largest price premium (€718.72), followed by Intel GPUs (€242.34). These effects are highly significant. ARM GPUs are associated with a lower price compared to the baseline, but the effect is not significant. Overall Impact: GPU company explains 13.3% of the price variability, making it a moderately strong predictor.

#Conclusion: Touchscreen: While significant, it has a moderate effect size (€391.17) but lacks a direct measure of variability explained. Brand: Statistically significant with a substantial effect but weaker in explanatory power compared to RAM. RAM: Explains the most variability (54.8%) in laptop prices and has a strong linear effect (€101.76 per GB).

RAM is the strongest factor influencing laptop prices, as it explains the most variability in price compared to touchscreen and brand. However: Brand also has a notable influence, especially when considering differences across premium brands. Touchscreen has a smaller but statistically significant impact, adding a premium for the feature.