1 columns that are unclear

CPU_model

#Unclear elements: The CPU_model column doesn’t specigy whether “CPU_model” refers to just the processor family (e.g., Intel Core i5) or a specific model (e.g., Intel Core i5-7200U). #Reason for encoding: This might be to differentiate between processor versions within the same model line. #Consequences: We could fail to distinguish between different generations or performance characteristics.

GPU_model

#Unclear elements: The column GPU_model doesn’t specify clearly. ex: Iris Pro Graphics , Radeon 530 - Iris Pro Graphics manufacturer is Intel and Radeon 530 manufacturer is AMD. It didn’t specify any the manufacturer, generations,which type of graphics whether it is integrated or discrete. There is inconsistent naming and details. #Reason for encoding: Different models within a GPU family can have drastically different performance characteristics. Knowing the exact GPU model is essential for evaluating a laptop’s graphical capabilities. #Consequences: Users could mistakenly assume that all entries labeled “HD Graphics” have similar performance, leading to faulty conclusions, especially when comparing laptop prices or performance benchmarks. Without knowing whether a GPU is integrated or discrete, users might incorrectly compare laptops based on graphics power.

PrimaryStorageType, SecondaryStorageType

#Unclear elements: These columns contain storage types like “SSD” and “Flash Storage,” but what does “No” mean in the context of secondary storage? Is it indicating the absence of storage, or something else? #Reason for encoding: The dataset might simplify storage details by including the absence of secondary storage as “No,” but this could cause confusion. #Consequences: Users might think “No” means different things—such as no storage or a specific type of storage, which could result in incorrect conclusions about laptop storage configurations.

Touchscreen, IPSpanel, RetinaDisplay

#Unclear Elements: These columns indicate whether a laptop has a “Touchscreen,” “IPSpanel,” or “RetinaDisplay” using Yes/No values. #Touchscreen: The documentation does not clarify if “Yes” means a fully functional multi-touch screen or just basic touch capability. #IPSpanel: It’s unclear if this column indicates the full IPS panel quality spectrum (e.g., color accuracy, viewing angles) or just the fact that it’s an IPS technology. #RetinaDisplay: The documentation doesn’t specify what qualifies as a “Retina Display” (e.g., a certain pixel density threshold or resolution), which is important as different brands may use this term differently. #Reason for encoding: It’s encoded in way to indicate features are present or not, which simplifies storing and querying the data. If I hadn’t checked the documentation, I might misinterpret the “Yes/No” values or assume a binary encoding #consequences: Without clarity, any analysis involving screen features might yield misleading results.

2

#Even after reading documentation GPU_model column is unclear. It didn’t specify the generations, integrated or discreted. It would be better if GPU_model column is splitted into GPU_gen, whether integrated or discreted, manufacturer if possible for better understanding.

3 Visualization

# Load necessary libraries
library(ggplot2)

# Load the dataset
data <- read.csv("~/Documents/statistics(1)/laptop_prices.csv")

# Classify GPU as Integrated or Discrete
data$GPU_type <- ifelse(grepl("HD Graphics|Iris", data$GPU_model), "Integrated", "Discrete")

# Create the bar chart
ggplot(data, aes(x = GPU_model, y = Price_euros, fill = GPU_type)) +
  geom_bar(stat = "summary", fun = "mean", position = position_dodge(width = 0.9)) +
  labs(title = "Average Laptop Price by GPU Model",
       x = "GPU Model",
       y = "Average Price (Euros)") +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 60, hjust = 1, size = 5),  # Adjust angle and size
    plot.title = element_text(hjust = 0.5, size = 14, vjust = 2),  # Center and adjust title
    plot.margin = margin(t = 20, r = 10, b = 10, l = 10)  # Increase top margin for title
  ) + 
  scale_fill_manual(values = c("Integrated" = "lightblue", "Discrete" = "lightgreen")) +
  annotate("text", x = 5, y = max(data$Price_euros, na.rm = TRUE) * 0.9, 
           label = "Lack of detail affects comparison", color = "red", size = 3, vjust = -0.5) +
  annotate("text", x = 8, y = max(data$Price_euros, na.rm = TRUE) * 0.8, 
           label = "Is 'HD Graphics' too generic?", color = "red", size = 3, vjust = -0.5) +
  coord_cartesian(clip = "off")  

#There is little bit confusion for the models of GPU while aligning. They are overlapped.

#The models are FirePro, For GeForce, Graphic, HD Graphics, Iris Graphics, Quadro, Radeon, UHD Graphics. #FirePro - FirePro W4190M, FirePro W6150M, FirePro W5130M #For GeForce - GeForce 150MX, GeForce 920, GeForce 920M, GeForce 930M, GeForce 930MX, GeForce 940M, GeForce 960M, GeForce GT 940MX, GeForce GTX 1050, GeForce GTX 1050 Ti, GeForce GTX 1050M, GeForce GTX 1060, GeForce GTX 1070, GeForce GTX 1070M, GeForce GTX 1080, GeForce GTX 930MX, GeForce GTX 940M, GeForce GTX 940MX, GeForce GTX 950M, GeForce GTX 960M, GeForce GTX 965M, GeForce GTX 970M, GeForce GTX 980, GeForce GTX 980M, GeForce GTX1050 Ti, GeForce GTX1060, GeForce GTX1080, GeForce MX130. #Graphic 620, GTX 980 SLI #For HD Graphics - HD Graphics, HD Graphics 400, HD Graphics 405, HD Graphics 500, HD Graphics 505, HD Graphics 510, HD Graphics 515, HD Graphics 520, HD Graphics 530, HD Graphics 5300, HD Graphics 540, HD Graphics 6000, HD Graphics 615, HD Graphics 620, HD Graphics 630. #For Iris Graphics - Iris Graphics 540, Iris Graphics 550, Iris Plus Graphics 640, Iris Graphics 650, Iris Pro Graphics. #Mali T860 MP4 #For Quadro - Quadro 3000M, Quadro M1000M, Quadro M1200, Quadro M2000M, Quadro M2200, Quadro M2200M, Quadro M3000M, Quadro M500M, Quadro M520M, Quadro M620, Quadro M620M. #For Radeon and its models

#Here light blue is for integrated GPUs and light green for discrete GPUs. #Integrated and discrete GPUs are not explicitly labeled in original data #Annotations point out where lack of specificity in the GPU model(HD graphics) that lead to confusion.

#Here we highlight the inconsistency in the GPU_model column. Main issue is lack of differentiation between integrated and discrete GPUs and the inconsistent detail in GPU model names.

#Risks - Misleading price comparisons like without knowing whether a GPU is integrated or discrete, users could misinterpret the data, assuming that two laptops with “HD Graphics” are comparable in performance, which is not always true.

#Faulty Performance Assumptions - ntegrated GPUs like “HD Graphics” can vary in performance across generations, leading to misleading conclusions about laptop capabilities.

#Solution - To reduce these risks, the dataset could be improved by categorizing GPUs more explicitly (e.g., separating integrated vs. discrete and specifying GPU generation).

#Documentation: Clearer definitions of GPU models and their performance implications should be provided in the dataset documentation.