Introduction

What factors are strongest predictors of RAM module prices and how to they contribute to pricing in the consumer memory market? This research project will look into different components of RAM module pricing and how they are affected by different variables. The data set used in this study was taken from GitHub (https://github.com/docyx/pc-part-dataset/blob/main/data/csv/memory.csv) and provides detailed specifications about pricing information of RAM products that are actively available in the retail market. The data set holds about one thousand to two thousand RAM modules, with 13553 total observations. The 8 primary variables that I will be looking at throughout this project include: price, total capacity of gigabytes, speed in MHz, module count, capacity per module, CAS latency, DDR type (whether or not it is a DDR4 or DDR5), and brand name. A RAM module is the circuit board within computer, serving as the computers short-term and high speed memory chip. Within the past year, the prices of RAM modules have experienced a rapid upward trend globally, affecting both the consumer and producer market prices and production and consumption factors. Understanding relationships between RAM specifications and pricing becomes important for consumers to optimize purchasing decisions and for manufacturers to be able to understand pricing patterns and keep a competitive marketing position. Using a multiple linear regression, this study aims to quantify the individual and combined effects of capacity, speed, and other features on retail pricing, providing insight on the consumer RAM market structure.

Data Analysis

To prepare the pricing dataset for a multiple linear regression analysis, the data has to be cleaned first using dplyr and tidyverse functions. To clean the data, I used the filter function to remove missing observation and non-positive prices ensuring that only valid prices remained in the dataset. Next, I used the mutate operation to create new variables that are necessary for analysis (speed_mhz, ddr_type, brand, module_count, total_capacity_gb, Capacity_per_module, and is_ddr5). There was one unnecessary outlier that needed to be removed using the filter function as it was unrealistic and irrelevant, making the model inaccurate and hard to understand. Lastly, the select function was used to retain only the variables necessary for this model.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.1     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggplot2)
library(corrplot)

## corrplot 0.95 loaded

library(dplyr)
library(pROC)

## Warning: package 'pROC' was built under R version 4.5.2

## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## 
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

library(car)

## Warning: package 'car' was built under R version 4.5.2

## Loading required package: carData

## Warning: package 'carData' was built under R version 4.5.2

## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some

RAM_Pricing <- read.csv("C:/Users/nika/Downloads/R/csv/Final_Project.csv")

RAM_clean <- RAM_Pricing |>
  filter(!is.na(price), price > 0) |>
  
  mutate(
    speed_mhz = str_extract(speed, "\\d{4,5}"),
    ddr_type = case_when(
      speed_mhz >= 4800 ~ "DDR5",
      speed_mhz < 4800 ~ "DDR4",
      TRUE ~ NA_character_),
    is_ddr5 = ifelse(ddr_type == "DDR5", 1, 0), # 1 is yes, 0 is no
    brand = word(name, 1)) |>
    
  mutate(
    module_count = str_extract(modules, "^\\d+"),
    total_capacity_gb = as.numeric(str_extract(name, "\\d+(?=\\s*GB)")),
    capacity_per_module = as.numeric(total_capacity_gb) / as.numeric(module_count)) |>
  
  filter(total_capacity_gb < 1000) |>
  
  mutate(
    price = as.numeric(price),
    total_capacity_gb = as.numeric(total_capacity_gb),
    speed_mhz = as.numeric(speed_mhz),
    module_count = as.numeric(module_count),
    capacity_per_module = as.numeric(capacity_per_module),
    cas_latency = as.numeric(cas_latency),
    is_ddr5 = as.numeric(is_ddr5)) |>
  
  select(price, total_capacity_gb, speed_mhz, module_count, capacity_per_module, cas_latency, ddr_type, is_ddr5, brand)

summary(RAM_clean |> select(price, total_capacity_gb, speed_mhz, cas_latency, module_count))

##      price        total_capacity_gb   speed_mhz     cas_latency   
##  Min.   :   5.0   Min.   :  1.00    Min.   :1066   Min.   : 3.00  
##  1st Qu.:  50.8   1st Qu.: 16.00    1st Qu.:2400   1st Qu.:16.00  
##  Median : 100.3   Median : 32.00    Median :3600   Median :19.00  
##  Mean   : 166.9   Mean   : 35.21    Mean   :4145   Mean   :23.73  
##  3rd Qu.: 180.0   3rd Qu.: 32.00    3rd Qu.:6000   3rd Qu.:34.00  
##  Max.   :3737.5   Max.   :512.00    Max.   :8400   Max.   :52.00  
##                                     NA's   :59                    
##   module_count  
##  Min.   :1.000  
##  1st Qu.:1.000  
##  Median :2.000  
##  Mean   :2.012  
##  3rd Qu.:2.000  
##  Max.   :8.000  
##

top_brands <- RAM_clean |>
  count(brand, sort = TRUE) |>
  head(10)

print(top_brands)

##        brand   n
## 1   Kingston 648
## 2    G.Skill 606
## 3    Corsair 536
## 4    Crucial 237
## 5  TEAMGROUP 235
## 6    Patriot 208
## 7    Mushkin 113
## 8    V-Color  55
## 9    Silicon  54
## 10     ADATA  47

RAM_clean |>
  ggplot(aes(x = price)) +
  geom_histogram(bins = 30, fill = "steelblue", color = "white", alpha = 0.8) +
  labs(title = "Distribution of RAM Prices", x = "Price ($)", y = "Frequency") +
  theme_minimal()

RAM_clean |>
  ggplot(aes(x = total_capacity_gb, y = price)) +
  geom_point(alpha = 0.5, color = "steelblue", size = 2) +
  labs(
    title = "RAM Price vs Total Capacity",
    x = "Total Capacity (GB)",
    y = "Price ($)"
  ) +
  theme_minimal()

RAM_clean |>
  ggplot(aes(x = speed_mhz, y = price, color = ddr_type)) +
  geom_point(alpha = 0.6, size = 2) +
  labs(title = "RAM Price vs Speed by DDR Generation", x = "Speed (MHz)", y = "Price ($)", color = "DDR Type") +
  theme_minimal()

## Warning: Removed 59 rows containing missing values or values outside the scale range
## (`geom_point()`).

RAM_clean |>
  ggplot(aes(x = ddr_type, y = price, fill = ddr_type)) +
  geom_boxplot(alpha = 0.7) +
  labs(title = "RAM Price by DDR Generation", x = "DDR Type", y = "Price ($)") +
  theme_minimal()

Statistical Analysis

multiple_model <- lm(price ~ total_capacity_gb + speed_mhz + ddr_type + cas_latency + module_count + brand, data = RAM_clean)
summary(multiple_model)

## 
## Call:
## lm(formula = price ~ total_capacity_gb + speed_mhz + ddr_type + 
##     cas_latency + module_count + brand, data = RAM_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -670.34  -37.74   -4.82   29.21 1134.67 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -66.379568 108.914100  -0.609   0.5423    
## total_capacity_gb   4.020694   0.082166  48.934  < 2e-16 ***
## speed_mhz           0.024682   0.002997   8.236 2.69e-16 ***
## ddr_typeDDR5       31.908026  12.632764   2.526   0.0116 *  
## cas_latency        -5.802780   0.556108 -10.435  < 2e-16 ***
## module_count       37.800566   2.499292  15.125  < 2e-16 ***
## brandADATA         53.057541 109.700756   0.484   0.6287    
## brandAMD           62.807435 153.557938   0.409   0.6826    
## brandAntec         40.379624 153.507170   0.263   0.7925    
## brandCorsair       13.640428 108.646614   0.126   0.9001    
## brandCrucial       45.817066 108.868597   0.421   0.6739    
## brandG.Skill       -1.127648 108.623368  -0.010   0.9917    
## brandGeIL          15.163062 125.303300   0.121   0.9037    
## brandGOODRAM       47.987142 133.009504   0.361   0.7183    
## brandHP            37.504678 132.996409   0.282   0.7780    
## brandIBM          135.196550 132.984635   1.017   0.3094    
## brandKingston      70.558300 108.637452   0.649   0.5161    
## brandKlevv        -22.604175 113.351669  -0.199   0.8420    
## brandLenovo        53.688685 153.567240   0.350   0.7267    
## brandLexar         34.992190 112.983943   0.310   0.7568    
## brandMicron        50.335802 121.418799   0.415   0.6785    
## brandMushkin       37.071856 109.109569   0.340   0.7341    
## brandOLOy         -15.236919 116.107151  -0.131   0.8956    
## brandPatriot       23.066170 108.823216   0.212   0.8322    
## brandPNY           26.143159 110.394745   0.237   0.8128    
## brandSamsung       38.145156 110.567604   0.345   0.7301    
## brandSilicon      -11.980887 109.544412  -0.109   0.9129    
## brandSupermicro    24.317582 153.583499   0.158   0.8742    
## brandTEAMGROUP      6.382563 108.776113   0.059   0.9532    
## brandThermaltake  112.986472 110.631197   1.021   0.3072    
## brandTimetec       -7.432078 110.952435  -0.067   0.9466    
## brandTranscend     68.191584 132.984511   0.513   0.6081    
## brandV-Color      567.673052 109.786397   5.171 2.50e-07 ***
## brandVisionTek    143.627386 153.525791   0.936   0.3496    
## brandWintec        41.853237 132.966529   0.315   0.7530    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 108.5 on 2812 degrees of freedom
##   (59 observations deleted due to missingness)
## Multiple R-squared:  0.8319, Adjusted R-squared:  0.8299 
## F-statistic: 409.4 on 34 and 2812 DF,  p-value: < 2.2e-16

plot(resid(multiple_model), type="b", main="Residuals vs Order", ylab="Residuals")
abline(h=0, lty=2)

par(mfrow=c(2,2)); plot(multiple_model); par(mfrow=c(1,1))

## Warning: not plotting observations with leverage one:
##   270, 1893, 2132, 2165, 2207, 2450

crPlots(multiple_model)

cor(RAM_clean[, c("price", "total_capacity_gb", "speed_mhz", "module_count", "capacity_per_module", "cas_latency", "is_ddr5")], use = "complete.obs")

##                         price total_capacity_gb speed_mhz module_count
## price               1.0000000         0.8469138 0.3207434   0.71212103
## total_capacity_gb   0.8469138         1.0000000 0.3990663   0.69823391
## speed_mhz           0.3207434         0.3990663 1.0000000   0.18162966
## module_count        0.7121210         0.6982339 0.1816297   1.00000000
## capacity_per_module 0.4029028         0.6690725 0.5038485   0.10670732
## cas_latency         0.2380348         0.3622101 0.9009369   0.08886038
## is_ddr5             0.2783389         0.3679513 0.9105697   0.13001421
##                     capacity_per_module cas_latency   is_ddr5
## price                         0.4029028  0.23803485 0.2783389
## total_capacity_gb             0.6690725  0.36221014 0.3679513
## speed_mhz                     0.5038485  0.90093695 0.9105697
## module_count                  0.1067073  0.08886038 0.1300142
## capacity_per_module           1.0000000  0.52801726 0.4826639
## cas_latency                   0.5280173  1.00000000 0.9257800
## is_ddr5                       0.4826639  0.92578001 1.0000000

residuals_multiple <- resid(multiple_model)

rmse_multiple <- sqrt(mean(residuals_multiple^2))
rmse_multiple

## [1] 107.8066

For this research paper, I selected a multiple linear regression to answer my research question. My research question asks for an analysis of how multiple independent variables (various factors) collectively influence a single dependent variable (RAM module prices). The multiple linear regression is ideal for this analysis because it allows be to examine how several independent variables collectively influence a single dependent variable. The final model is specified as: lm(price ~ total_capacity_gb + speed_mhz + ddr_type + cas_latency + module_count + brand, data = RAM_clean)

The intercept of the Predicted price of RAM modules when all predictors are 0 is -66.379. This is just a mathematical calculation for the intercept and is not realistic as the price at which a RAM module is sold can not be a negative value. Interpreting the coefficients in context: each additional gigabyte of total capacity increases pricing by $4.020694, each additional MHz of speed increases price by $0.024682, DDR5 modules cost $ 31.908026 more than DDR4 modules, each unit increase in CAS latency decreases price by $5.802780, and each additional module increases price by $37.800566. The effect of brand names varried significantly with V-Color holding a premium of $567.673052 increase in price, while brands such as Klevv (-$22.604175) and OLOy (-$15.236919) causing a decrease in prices. The model achieved statistical significance with the key predictors (total_capacity_gb, speed_mhz, ddr_typeDDR5, cas_latency, module_count, and brand name V-Color) all showing p-values < 0.05. The adjusted R-squared value is 0.8299. This means about 82.99% of the variance in RAM prices is explained by the 8 predictors.

To validate the model, I had to assess the five key assumptions for a valid inference: linearity, independence, homoscedasticity, normality of residuals, and multicollinearity. linearity was evaluated through the Residuals vs Fitted graph showing us a mostly flat cloud. Thus, linearity looks good. The spread of residuals starts fans out and then narrows, or makes a “cone” shape, this suggests unequal variance — lower prices might have smaller prediction errors than larger ones. The residuals are fairly evenly scattered around zero, indicating that the homoscedasticity assumption is approximately met. There are slight patterns, suggesting mild heteroscedasticity, but not severe enough to invalidate the model confirming that the linear relationships between the predictors and outcome are appropriate. To check the independence of the observation, the Residual vs Order graph was analyzed. It showed us a completely random graph meaning that independence is likely not violated and the multiple regression model is valid. Homoscedasticity was assessed through the Residuals vs Fitted and the Scale-Location plots. While residuals were generally evenly dispersed, the Scale-Location graph showed that the spread decreases with fitted values meaning that there is some heteroscedasticity (variance grows for low prices). To check the normality of residuals, the Normal Q-Q plot was examined, showing evidence that tails deviate (right tail high) meaning that residuals not perfectly normal. This deviation is not disastrous and the assumption is reasonably met. Lastly, Multicollinearity was assessed using a correlation matrix which revealed no concerning correlations. Additionally the Residuals vs Leverage plot identified a few observations with higher leverage values, but none appeared to be highly influential outliers based on Cook’s distance, as no points fell beyond typical thresholds. Finally, the multiple model RMSE = 10.8066 meaning predictions miss by about $10.8066 on average. Overall, while this model shows mild violations of homoscedasticity and normality, they are not sever enough to compromise the validity of the results.

Conclusion

The multiple linear regression model was successful at identifying and quantifying key predictors of RAM module pricing in the consumer memory market. The model gave us an R squared of 0.8299 meaning that the variables chosen explained 82.99% of the variance in RAM prices showing strong predictive power. The analysis revealed that the strongest predictors were total capacity, DDR generation, and brand name. More specifically, each additional gigabyte of capacity increased the price by about $4.02, while DDR5 modules lead to an increase in pricing over DDR4 modules (adding $31.91 to the cost). Brand effects also proved highly variable and significant with brands such as V-Color($567.67), VisionTek ($143.63), and IBM ($135.20), holding substantial price increases while brands such as Klevv ($22.60) and OLOy ($15.24) holding lower price levels. This suggests that brand reputation, warranty support, and perceived quality play crucial roles in consumer willingness to pay beyond technical specifications alone. Furthermore, the model diagnostic confirmed reasonable adherence to multiple regression assumption with little heteroscedasticity detected, errors varying across price ranges, showing approximate normality with slight deviation i the tails, and no sever multicollinearity or influential outlines.

These findings provide actionable insights for both consumers and manufacturers. Consumers seeking value should prioritize capacity and DDR generation while recognizing that brand premiums may not always reflect proportional performance gains. Manufacturers can use this data to better understand competitive positioning and pricing strategies based on their product specifications and brand equity. The significant DDR5 premium suggests that early adopters continue to pay for cutting-edge technology, while the DDR4 market may offer better value for budget-conscious consumers.

This study focused on retail pricing at a single point in time and does not capture temporal dynamics such as the recent upward pricing trends mentioned in the introduction. Future research could incorporate time-series analysis to examine how RAM prices respond to market forces like supply chain disruptions, cryptocurrency mining demand, or semiconductor shortages. Additionally, investigating consumer reviews and performance benchmarks could reveal whether price premiums translate to proportional performance improvements or customer satisfaction. A deeper analysis of specific brand positioning strategies, regional pricing variations, and the impact of emerging technologies (such as DDR6 development) would further enrich understanding of the RAM market structure. Finally, exploring interaction effects between variables—such as whether capacity has different pricing impacts for DDR4 versus DDR5 modules—could reveal more nuanced pricing patterns that inform both purchasing decisions and market strategies.

References

Data set: https://github.com/docyx/pc-part-dataset/blob/main/data/csv/memory.csv

Final Project

Veronica Chunikhin

12-6-25

Introduction

Data Analysis

Statistical Analysis

Conclusion

References