The dataset titled “Uncleaned Laptop Price Dataset,” which I referenced in Assignment #5, is accessible on Kaggle. Utilizing this dataset, I aim to investigate the research question: “How do RAM, CPU, screen size (inches), and weight influence the price of laptops, and are there significant differences across brands?” This analysis will contribute to a deeper understanding of the factors affecting laptop pricing in the current market.

The first chunk involves importing the CSV file and subsequently removing the first column, labeled ‘unnamed:0,’ which serves as a filler and does not provide any meaningful data. Additionally, rows containing missing values will be eliminated, along with any duplicate entries, to ensure the integrity and quality of the dataset.

knitr::opts_chunk$set(echo = TRUE)
options(repos = c(CRAN = "https://cloud.r-project.org/"))  # Set CRAN mirror
install.packages(c("readr", "dplyr", "tidyr")) 
## Installing packages into 'C:/Users/tiffh/AppData/Local/R/win-library/4.4'
## (as 'lib' is unspecified)
## package 'readr' successfully unpacked and MD5 sums checked
## Warning: cannot remove prior installation of package 'readr'
## Warning in file.copy(savedcopy, lib, recursive = TRUE): problem copying
## C:\Users\tiffh\AppData\Local\R\win-library\4.4\00LOCK\readr\libs\x64\readr.dll
## to C:\Users\tiffh\AppData\Local\R\win-library\4.4\readr\libs\x64\readr.dll:
## Permission denied
## Warning: restored 'readr'
## package 'dplyr' successfully unpacked and MD5 sums checked
## Warning: cannot remove prior installation of package 'dplyr'
## Warning in file.copy(savedcopy, lib, recursive = TRUE): problem copying
## C:\Users\tiffh\AppData\Local\R\win-library\4.4\00LOCK\dplyr\libs\x64\dplyr.dll
## to C:\Users\tiffh\AppData\Local\R\win-library\4.4\dplyr\libs\x64\dplyr.dll:
## Permission denied
## Warning: restored 'dplyr'
## package 'tidyr' successfully unpacked and MD5 sums checked
## Warning: cannot remove prior installation of package 'tidyr'
## Warning in file.copy(savedcopy, lib, recursive = TRUE): problem copying
## C:\Users\tiffh\AppData\Local\R\win-library\4.4\00LOCK\tidyr\libs\x64\tidyr.dll
## to C:\Users\tiffh\AppData\Local\R\win-library\4.4\tidyr\libs\x64\tidyr.dll:
## Permission denied
## Warning: restored 'tidyr'
## 
## The downloaded binary packages are in
##  C:\Users\tiffh\AppData\Local\Temp\RtmpqSxULB\downloaded_packages
library(readr); library(dplyr);library(tidyr) 
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# import csv 
laptopData <- read_csv("laptopData.csv")
## Rows: 1303 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): Company, TypeName, Inches, ScreenResolution, Cpu, Ram, Memory, Gpu...
## dbl  (2): Unnamed: 0, Price
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
View(laptopData)

# drop first column, n/a rows, and duplicates 
laptop_data <- laptopData %>%
select(-`Unnamed: 0`) %>%       
drop_na() %>%                   
distinct()                      

View(laptop_data)

The data needs to be prepared for analysis by converting certain columns to numeric format. Initially, the code removes rows in the Inches column containing a question mark (“?”) and converts the remaining values to numeric, replacing any NA entries with 0. The Ram column is directly processed using parse_number to extract numeric values. For the Weight column, it is first converted to character format, then parsed to numeric, filtering out any rows with NA. The code also extracts CPU speed from the Cpu column using a regular expression and creates a new MemorySize column by extracting numeric values from the Memory column when they indicate gigabytes (“GB”). During this process, I considered converting the data to a long format using long_pivot for a potential linear regression analysis; however, I determined that keeping the data in wide format would be more suitable for the intended analysis. Finally, the original Cpu and Memory columns are removed.

knitr::opts_chunk$set(echo = TRUE)

# coverting column observation to numeric seperately, tried to do it together but delete all obervation because some coulmns had to remove non-numeric and parse before converting to numeric...
library(dplyr)
library(readr)

# inches 
laptop_info = laptop_data %>%
  filter(Inches != "?") %>%  # remove "?"
  mutate(
    Inches = as.numeric(as.character(Inches)) 
  ) %>%
  mutate(
    Inches = ifelse(is.na(Inches), 0, Inches)  # replace NA with 0
  )

# ram 
laptop_info = laptop_info %>%
  mutate(
    Ram = parse_number(Ram)  
  )

# weight - giving issues with parsing make sure to define it was a chara turning to numeric 
laptop_info = laptop_info %>%
  mutate(
    Weight = as.character(Weight),  
    Weight = parse_number(Weight)    
  ) %>%
  filter(!is.na(Weight))  # keep valid numeric Weight
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Weight = parse_number(Weight)`.
## Caused by warning:
## ! 1 parsing failure.
## row col expected actual
## 202  -- a number      ?
# remove NA in inches or weight
laptop_info = laptop_info %>%
  filter(!is.na(Inches) & !is.na(Weight))

# cpu
laptop_info <- laptop_info %>%
  mutate(
    CPUSpeed = parse_number(gsub(".*? (\\d+\\.?\\d*)GHz.*", "\\1", Cpu))  # Extracting CPU speed in GHz
  ) 

#memory - giving issues with mutate
unique_memory_values <- unique(laptop_info$Memory)
print(unique_memory_values)
##  [1] "128GB SSD"                     "128GB Flash Storage"          
##  [3] "256GB SSD"                     "512GB SSD"                    
##  [5] "500GB HDD"                     "256GB Flash Storage"          
##  [7] "1TB HDD"                       "128GB SSD +  1TB HDD"         
##  [9] "256GB SSD +  256GB SSD"        "64GB Flash Storage"           
## [11] "32GB Flash Storage"            "256GB SSD +  1TB HDD"         
## [13] "256GB SSD +  2TB HDD"          "32GB SSD"                     
## [15] "2TB HDD"                       "64GB SSD"                     
## [17] "1.0TB Hybrid"                  "512GB SSD +  1TB HDD"         
## [19] "1TB SSD"                       "256GB SSD +  500GB HDD"       
## [21] "128GB SSD +  2TB HDD"          "512GB SSD +  512GB SSD"       
## [23] "16GB SSD"                      "16GB Flash Storage"           
## [25] "512GB SSD +  256GB SSD"        "512GB SSD +  2TB HDD"         
## [27] "64GB Flash Storage +  1TB HDD" "180GB SSD"                    
## [29] "1TB HDD +  1TB HDD"            "32GB HDD"                     
## [31] "1TB SSD +  1TB HDD"            "?"                            
## [33] "512GB Flash Storage"           "128GB HDD"                    
## [35] "240GB SSD"                     "8GB SSD"                      
## [37] "508GB Hybrid"                  "1.0TB HDD"                    
## [39] "512GB SSD +  1.0TB Hybrid"     "256GB SSD +  1.0TB Hybrid"
laptop_info <- laptop_info %>%
  mutate(
    # Extract Memory size with various formats
    MemorySize = case_when(
      grepl("GB", Memory) ~ parse_number(gsub("GB.*", "", Memory)),  # Extract size if it contains "GB"
      TRUE ~ NA_real_  # NA for unexpected formats
    )
  ) 
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `MemorySize = case_when(...)`.
## Caused by warning:
## ! 1 parsing failure.
## row col expected actual
## 748  -- a number      ?
# remove the original columns 
laptop_info <- laptop_info %>%
  select(-Cpu, -Memory)

str(laptop_info)
## tibble [1,242 × 11] (S3: tbl_df/tbl/data.frame)
##  $ Company         : chr [1:1242] "Apple" "Apple" "HP" "Apple" ...
##  $ TypeName        : chr [1:1242] "Ultrabook" "Ultrabook" "Notebook" "Ultrabook" ...
##  $ Inches          : num [1:1242] 13.3 13.3 15.6 15.4 13.3 15.6 15.4 13.3 14 14 ...
##  $ ScreenResolution: chr [1:1242] "IPS Panel Retina Display 2560x1600" "1440x900" "Full HD 1920x1080" "IPS Panel Retina Display 2880x1800" ...
##  $ Ram             : num [1:1242] 8 8 8 16 8 4 16 8 16 8 ...
##  $ Gpu             : chr [1:1242] "Intel Iris Plus Graphics 640" "Intel HD Graphics 6000" "Intel HD Graphics 620" "AMD Radeon Pro 455" ...
##  $ OpSys           : chr [1:1242] "macOS" "macOS" "No OS" "macOS" ...
##  $ Weight          : num [1:1242] 1.37 1.34 1.86 1.83 1.37 2.1 2.04 1.34 1.3 1.6 ...
##   ..- attr(*, "problems")= tibble [1 × 4] (S3: tbl_df/tbl/data.frame)
##   .. ..$ row     : int 202
##   .. ..$ col     : int NA
##   .. ..$ expected: chr "a number"
##   .. ..$ actual  : chr "?"
##  $ Price           : num [1:1242] 71379 47896 30636 135195 96096 ...
##  $ CPUSpeed        : num [1:1242] 2.3 1.8 2.5 2.7 3.1 3 2.2 1.8 1.8 1.6 ...
##  $ MemorySize      : num [1:1242] 128 128 256 512 256 500 256 256 512 256 ...
knitr::opts_chunk$set(echo = TRUE)
# computer brand percentage
company_count <- laptop_info %>%
  group_by(Company) %>%
  summarize(Count = n()) %>%
  mutate(Percentage = (Count / sum(Count)) * 100) %>%
  arrange(desc(Count))
print(company_count)
## # A tibble: 19 × 3
##    Company   Count Percentage
##    <chr>     <int>      <dbl>
##  1 Lenovo      282     22.7  
##  2 Dell        279     22.5  
##  3 HP          260     20.9  
##  4 Asus        149     12.0  
##  5 Acer        101      8.13 
##  6 MSI          53      4.27 
##  7 Toshiba      47      3.78 
##  8 Apple        21      1.69 
##  9 Samsung       9      0.725
## 10 Mediacom      7      0.564
## 11 Razer         7      0.564
## 12 Microsoft     6      0.483
## 13 Vero          4      0.322
## 14 Xiaomi        4      0.322
## 15 Chuwi         3      0.242
## 16 Google        3      0.242
## 17 LG            3      0.242
## 18 Fujitsu       2      0.161
## 19 Huawei        2      0.161
#cpu percentage 
cpu_count <- laptop_info %>%
  group_by(CPUSpeed) %>%
  summarize(Count = n()) %>%
  mutate(Percentage = (Count / sum(Count)) * 100) %>%
  arrange(desc(Count))
print(cpu_count)
## # A tibble: 25 × 3
##    CPUSpeed Count Percentage
##       <dbl> <int>      <dbl>
##  1      2.5   278      22.4 
##  2      2.8   160      12.9 
##  3      2.7   158      12.7 
##  4      1.6   118       9.50
##  5      2      84       6.76
##  6      2.3    84       6.76
##  7      1.8    76       6.12
##  8      2.6    73       5.88
##  9      1.1    53       4.27
## 10      2.4    50       4.03
## # ℹ 15 more rows
# ram percentage 
ram_count <- laptop_info %>%
  group_by(Ram) %>%
  summarize(Count = n()) %>%
  mutate(Percentage = (Count / sum(Count)) * 100) %>%
  arrange(desc(Count))
print(ram_count)
## # A tibble: 10 × 3
##      Ram Count Percentage
##    <dbl> <int>      <dbl>
##  1     8   593    47.7   
##  2     4   358    28.8   
##  3    16   192    15.5   
##  4     6    34     2.74  
##  5    12    25     2.01  
##  6    32    17     1.37  
##  7     2    16     1.29  
##  8    24     3     0.242 
##  9    64     3     0.242 
## 10     1     1     0.0805
#visuals 
install.packages("ggplot2")
## Installing package into 'C:/Users/tiffh/AppData/Local/R/win-library/4.4'
## (as 'lib' is unspecified)
## package 'ggplot2' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\tiffh\AppData\Local\Temp\RtmpqSxULB\downloaded_packages
library(ggplot2)
library(dplyr)

# Top 5 companies
top_companies <- company_count %>%
  slice_head(n = 5)
ggplot(top_companies, aes(x = reorder(Company, -Percentage), y = Percentage)) +
  geom_bar(stat = "identity", fill = "purple") +
  labs(title = "Top 5 Laptop Companies by Percentage", x = "Company", y = "Percentage (%)") +
  theme_minimal()

# Top 5 CPUs
top_cpus <- cpu_count %>%
  slice_head(n = 5)
ggplot(top_cpus, aes(x = reorder(CPUSpeed, -Percentage), y = Percentage)) +
  geom_bar(stat = "identity", fill = "plum") +
  labs(title = "Top 5 CPUs by Percentage", x = "CPU Speed", y = "Percentage (%)") +
  theme_minimal()

# Top 5 RAM sizes
top_rams <- ram_count %>%
  slice_head(n = 5)
ggplot(top_rams, aes(x = reorder(Ram, -Percentage), y = Percentage)) +
  geom_bar(stat = "identity", fill = "maroon") +
  labs(title = "Top 5 RAM Sizes by Percentage", x = "RAM (GB)", y = "Percentage (%)") +
  theme_minimal()

When examining laptop purchase preferences, the data reveals significant trends in brand popularity, CPU speeds, and RAM configurations based on percentage distribution. Lenovo leads the market with 22.71%, closely followed by Dell at 22.46% and HP at 20.93%. Asus and Acer account for 12.00% and 8.13%, respectively. In terms of CPU speeds, the most favored option is 2.50 GHz, representing 22.38% of the dataset, while 2.80 GHz and 2.70 GHz come next, capturing 12.88% and 12.72%. Regarding RAM sizes, 8 GB is the most common choice, comprising a substantial 47.75% of laptops, with 4 GB trailing at 28.82%. Meanwhile, 16 GB accounts for 15.46%, indicating a clear preference for higher RAM among consumers.

Having gained insights into the popular laptop brands, CPU speeds, and RAM configurations, it is now essential to examine the correlations between these variables for a more comprehensive understanding.

knitr::opts_chunk$set(echo = TRUE)
# Correlation matrix for numeric variables
cor_matrix <- cor(laptop_info %>% select(Price, Ram, CPUSpeed, Inches, Weight), use = "complete.obs")

#install.packages("ggcorrplot")
# Visualize the correlation matrix
library(ggcorrplot)
ggcorrplot(cor_matrix, method = "circle", type = "lower", lab = TRUE)

Since I am interested in exploring how laptop characteristics such as RAM, CPU speed, screen size (inches), and weight affect price, I will delve deeper into the correlation of price in relation to these attributes. The most notable finding is the strong correlation between price and RAM, with a coefficient of 0.683. This indicates that as RAM increases, the price of laptops tends to rise significantly, suggesting that consumers are willing to pay a premium for laptops with higher RAM.

The correlation between price and CPU speed is also positive, at 0.427, indicating a moderate relationship. This finding suggests that laptops equipped with faster CPUs generally command higher prices, underscoring the importance of processing power in consumer purchasing decisions.

In contrast, the correlations between price and the other two variables—screen size (inches) and weight—are relatively weak. The minimal correlation with inches (0.043) suggests that screen size does not significantly influence laptop prices in this dataset. Similarly, the correlation with weight (0.177) is low, indicating that variations in weight have little impact on pricing. Overall, these insights highlight the strong influence of RAM and CPU speed on laptop prices while suggesting that screen size and weight are less critical factors in consumer decision-making.

After conducting a correlation analysis, we can proceed with running a linear regression.

knitr::opts_chunk$set(echo = TRUE)
# fit a linear regression model
linear_model <- lm(Price ~ Ram + CPUSpeed + Inches + Weight, data = laptop_info)
summary(linear_model)
## 
## Call:
## lm(formula = Price ~ Ram + CPUSpeed + Inches + Weight, data = laptop_info)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -234895  -14885   -4671   11845  161184 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   9317.5     6346.3   1.468 0.142311    
## Ram           4157.9      143.1  29.062  < 2e-16 ***
## CPUSpeed     18075.8     1585.2  11.403  < 2e-16 ***
## Inches       -1531.5      431.2  -3.552 0.000397 ***
## Weight       -1296.5     1083.5  -1.197 0.231717    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25950 on 1237 degrees of freedom
## Multiple R-squared:  0.5201, Adjusted R-squared:  0.5186 
## F-statistic: 335.2 on 4 and 1237 DF,  p-value: < 2.2e-16

The model explains approximately 52% of the variance in laptop prices. The linear regression analysis reveals that RAM and CPU speed are significant positive predictors of laptop price, with coefficients of 4,157.9 and 18,075.8, respectively, both having p-values less than 2e-16, indicating strong statistical significance. This suggests that consumers are willing to pay more for laptops with larger RAM and faster CPUs, as these features are often sought after for better performance. Conversely, screen size (inches) has a significant negative coefficient of -1,531.5, indicating that larger screens are associated with lower prices. Weight, however, is not a statistically significant predictor (p = 0.2317), suggesting it does not meaningfully impact laptop pricing. Overall, these insights highlight the importance of RAM and CPU speed in consumer purchasing decisions. Despite this I am unsure if the data is actually meaningful since intercept is 9317.5, meaning that when ram,cpu, inches, and weight is zero a laptop would cost $9317.5 which is not logical. In addition, when one unit of ram increase the price increases by $ 9317.5, one cpu unit increase the cost goes up by $ 18075.8 and so forth.

Reference Akinwande, A. 2020. ggcorrplot: Visualization of a Correlation Matrix. CRAN. https://cran.r-project.org/web/packages/ggcorrplot/readme/README.html.

Chat gpt prompts: Received warning:! 1 parsingfailure.row col expected actual202 – a number ? What causes parsing failure and some possible solutions?