HW3 - Analyzing Factors Influencing CPU Performance

2026-03-06

Dataset breakdown

Below are all the different feature contain in side cpus dataset

## 'data.frame':    209 obs. of  9 variables:
##  $ name   : Factor w/ 209 levels "ADVISOR 32/60",..: 1 3 2 4 5 6 8 9 10 7 ...
##  $ syct   : int  125 29 29 29 29 26 23 23 23 23 ...
##  $ mmin   : int  256 8000 8000 8000 8000 8000 16000 16000 16000 32000 ...
##  $ mmax   : int  6000 32000 32000 32000 16000 32000 32000 32000 64000 64000 ...
##  $ cach   : int  256 32 32 32 32 64 64 64 64 128 ...
##  $ chmin  : int  16 8 8 8 8 8 16 16 16 32 ...
##  $ chmax  : int  128 32 32 32 16 32 32 32 32 64 ...
##  $ perf   : int  198 269 220 172 132 318 367 489 636 1144 ...
##  $ estperf: int  199 253 253 253 132 290 381 381 749 1238 ...

Here are the heads of each variable in the dataset

##             name syct mmin  mmax cach chmin chmax perf estperf
## 1  ADVISOR 32/60  125  256  6000  256    16   128  198     199
## 2  AMDAHL 470V/7   29 8000 32000   32     8    32  269     253
## 3  AMDAHL 470/7A   29 8000 32000   32     8    32  220     253
## 4 AMDAHL 470V/7B   29 8000 32000   32     8    32  172     253
## 5 AMDAHL 470V/7C   29 8000 16000   32     8    16  132     132
## 6  AMDAHL 470V/8   26 8000 32000   64     8    32  318     290

Theoredical Model

We hypothesize that CPU performance (\(Y\)) is linearly related to cycle time and cache size.

Let the Linear estimation of a CPU performance be defined as:

\[ Y_{perf} = \beta_0 + \beta_1 X_{cycle} + \beta_2 X_{cache} \]

Where:

- \(\beta_0\) is the intercept.
- \(\beta_1\) and \(\beta_2\) are coefficients for cycle time and cache size.

This equation allows us to quantify how much cache improves performance while controlling for cycle time.

Distribution of Performance

The relationship between performance and cycletime

## `geom_smooth()` using formula = 'y ~ x'

The relationship between performance and cache size

df = cpus %>%
  select(perf, cach)
ggplot(df, mapping = aes(x = cach, y = perf)) + geom_point( color = 'green')+
  stat_smooth(method = 'lm', se = F, color = 'blue')

## `geom_smooth()` using formula = 'y ~ x'

Analysis and Conclusion

From the 2 previous plot we see that cycle time have an inverse relationship with performance(\(Y\)) while cache size is directly related to the performance of a CPU. Therefore, the general linear estimation equation for a CPU’s performance is: \[ Y_{perf} = \beta_1 N_{cycle} + \frac{\beta_2}{T_{cache}} \]

Where:

- \(\beta_0\) is the intercept.
- \(\beta_1\) and \(\beta_2\) are coefficients for cycle time and cache size.

Conclusion

From this we see that to design the best performing CPU we need to balance the numbers of cycle with the cache time.

Top 5 highest performing CPUs

top5 = cpus %>% 
  arrange(desc(perf)) %>%
  slice(1:5)
top5

##              name syct  mmin  mmax cach chmin chmax perf estperf
## 1  SPERRY 1100/94   30  8000 64000  128    12   176 1150     978
## 2 AMDAHL 580 5880   23 32000 64000  128    32    64 1144    1238
## 3  SPERRY 1100/93   30  8000 64000   96    12   176  915     919
## 4 AMDAHL 580-5860   23 16000 64000   64    16    32  636     749
## 5 NAS AS/9000 DPC   38 16000 32000  128    16    32  510     426

Comment:

We see that all of them have low cycle time and high cache size

How true is the published performance of CPU?

## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

Number of over/under perform cpus

## [1] " Numbers of CPUs that over performed"

## [1] 102

## [1] " Numbers of CPUs that under performed"

## [1] 100

Final Analysis

From the plot on the previous slides, we see that the majority of CPUs in the dataset have a performance discrepancy when comparing the number from the manufacturer to the estimated. Luckily, most CPUs’ discrepancies are relatively small(falling under 50%). Unfortunately, the number of CPUs that overperform is nearly even to the number that underperform. In conclusion, take the numbers from the manufacturer with a grain of salt and double-check it yourself to know the real performance.