Consider the cpus data from R package MASS. We will use linear regression to investigate the relationship between variables in this data set and estimated performance (variable estperf). Do not use published performance as a predictor of performance in this problem.
Investigate the relationship between variables in the
cpus dataset, both numerically and visually. Comment on the
relationships you observe.
#import the necessary packages
library(MASS)
library(ggplot2)
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.4.2
## corrplot 0.95 loaded
library(car)
## Loading required package: carData
#load the dataset
data(cpus)
#check the dataset documentation
?cpus
## starting httpd help server ... done
#inspecting the dataset
head(cpus)
#getting a summary
summary(cpus)
## name syct mmin mmax
## ADVISOR 32/60 : 1 Min. : 17.0 Min. : 64 Min. : 64
## AMDAHL 470/7A : 1 1st Qu.: 50.0 1st Qu.: 768 1st Qu.: 4000
## AMDAHL 470V/7 : 1 Median : 110.0 Median : 2000 Median : 8000
## AMDAHL 470V/7B: 1 Mean : 203.8 Mean : 2868 Mean :11796
## AMDAHL 470V/7C: 1 3rd Qu.: 225.0 3rd Qu.: 4000 3rd Qu.:16000
## AMDAHL 470V/8 : 1 Max. :1500.0 Max. :32000 Max. :64000
## (Other) :203
## cach chmin chmax perf
## Min. : 0.00 Min. : 0.000 Min. : 0.00 Min. : 6.0
## 1st Qu.: 0.00 1st Qu.: 1.000 1st Qu.: 5.00 1st Qu.: 27.0
## Median : 8.00 Median : 2.000 Median : 8.00 Median : 50.0
## Mean : 25.21 Mean : 4.699 Mean : 18.27 Mean : 105.6
## 3rd Qu.: 32.00 3rd Qu.: 6.000 3rd Qu.: 24.00 3rd Qu.: 113.0
## Max. :256.00 Max. :52.000 Max. :176.00 Max. :1150.0
##
## estperf
## Min. : 15.00
## 1st Qu.: 28.00
## Median : 45.00
## Mean : 99.33
## 3rd Qu.: 101.00
## Max. :1238.00
##
From the summary() we observe the following:
Right-Skewed Distributions: Most variables have a median much lower than the mean, indicating a few high-value CPUs skew the distribution.
Potential Outliers: syct,
mmax, perf, and estperf have very
high max values compared to their median/mean, suggesting outliers
exist.
#get the correlation matrix
corr_matrix <- cor(cpus[, sapply(cpus, is.numeric)])
corr_matrix
## syct mmin mmax cach chmin chmax
## syct 1.0000000 -0.3356422 -0.3785606 -0.3209998 -0.3010897 -0.2505023
## mmin -0.3356422 1.0000000 0.7581573 0.5347291 0.5171892 0.2669074
## mmax -0.3785606 0.7581573 1.0000000 0.5379898 0.5605134 0.5272462
## cach -0.3209998 0.5347291 0.5379898 1.0000000 0.5822455 0.4878458
## chmin -0.3010897 0.5171892 0.5605134 0.5822455 1.0000000 0.5482812
## chmax -0.2505023 0.2669074 0.5272462 0.4878458 0.5482812 1.0000000
## perf -0.3070821 0.7949233 0.8629942 0.6626135 0.6089025 0.6052193
## estperf -0.2883956 0.8192915 0.9012024 0.6486203 0.6105802 0.5921556
## perf estperf
## syct -0.3070821 -0.2883956
## mmin 0.7949233 0.8192915
## mmax 0.8629942 0.9012024
## cach 0.6626135 0.6486203
## chmin 0.6089025 0.6105802
## chmax 0.6052193 0.5921556
## perf 1.0000000 0.9664687
## estperf 0.9664687 1.0000000
corrplot(corr_matrix, method = 'color', tl.cex = 0.8)
We see from the above plot that cycle time in nanoseconds
(syct) is negatively correlated with the other variables
e.g perf, although not strongly. This indicates that higher
cycle times in computers is probably associated with low
performance.
We also see some positive correlation between performance
(perf) and estimated relative performance
(estperf). We see also a strong correlation between memory
(whether mmin or mmax) with both performance
and estimated performance. This is quite expected in a real-world
scenario since memory affects the performance of a computer in a great
way.
#we can make a few scatter plots to help us explore the data further
plot(
cpus$mmax,
cpus$esterperf,
main="Estimated Performance vs. Main Memory",
xlab = "Maximum Main Memory in Kb",
ylab="Estimated Performance",
col="blue",
pch = 16
)
# we make another plot for maximum number of channels versus performance
plot(
cpus$chmax,
cpus$estperf,
main="Estimated Performance vs. Number of Channels",
xlab = "Maximum Number of Channels",
ylab="Estimated Performance",
col="blue",
pch = 16
)
# finally we get to look at one more scatter plot for our data exploration
plot(
cpus$cach,
cpus$estperf,
main ='Estimated Performance Vs. Cache Size',
xlab = "Cache Size in KB",
ylab="Estimated Performance",
col="blue",
pch = 16
)
Use either methods commonly used in the book/lecture notes to build a linear regression model predicting estimated performance from predictors in the cpus dataset. Do not consider name in this modeling approach. Explain the process used to arrive at your final model.
#fitting an intial full model (excluding `name`)
full_model <- lm(estperf ~ syct + mmin + mmax + cach + chmin + chmax + perf, data = cpus)
#getting the summary of the model
summary(full_model)
##
## Call:
## lm(formula = estperf ~ syct + mmin + mmax + cach + chmin + chmax +
## perf, data = cpus)
##
## Residuals:
## Min 1Q Median 3Q Max
## -117.481 -9.549 2.864 15.258 182.259
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.423e+01 4.732e+00 -7.233 9.71e-12 ***
## syct 3.777e-02 9.434e-03 4.003 8.79e-05 ***
## mmin 5.483e-03 1.120e-03 4.894 2.02e-06 ***
## mmax 3.376e-03 3.974e-04 8.495 4.41e-15 ***
## cach 1.245e-01 7.751e-02 1.606 0.10977
## chmin -1.651e-02 4.523e-01 -0.037 0.97091
## chmax 3.457e-01 1.287e-01 2.686 0.00783 **
## perf 5.770e-01 3.718e-02 15.519 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 31.7 on 201 degrees of freedom
## Multiple R-squared: 0.9595, Adjusted R-squared: 0.958
## F-statistic: 679.5 on 7 and 201 DF, p-value: < 2.2e-16
From the model performance, we have an \(R^2\) of 0.9595 meaning that the model explains 95.95% of the variance in estimated performance.
We have an F-statistic of 679.5 which is relatively large, and a corresponding very small p-value of \(2.2 \times 10^-16\) indicating that at least one predictor is significantly contributing to the model.
Looking at the predictors and their associate p-values, we see that most of the predictors are statistically significant in predicting the estimated performance response variable, except the cache size (p-value = 0.10977) and minimum number of channels (p-value = 0.97091). These two do not seem to be statistically significant based on their associated p-values. However, based on the p-values, the maximum main memory and performance are the strongest predictor variables as they have the least p-values.
Based on this we try to build a new model without cache size and maximum number of channels and investigate the model performance
#model without cache size and minimum number of channels predictors
lm2 <- lm(estperf ~ syct + mmin + mmax + chmax + perf, data = cpus)
#get the summary
summary(lm2)
##
## Call:
## lm(formula = estperf ~ syct + mmin + mmax + chmax + perf, data = cpus)
##
## Residuals:
## Min 1Q Median 3Q Max
## -110.777 -10.446 2.617 14.216 173.179
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.254e+01 4.620e+00 -7.043 2.86e-11 ***
## syct 3.521e-02 9.313e-03 3.780 0.000206 ***
## mmin 5.648e-03 1.099e-03 5.138 6.50e-07 ***
## mmax 3.275e-03 3.929e-04 8.335 1.16e-14 ***
## chmax 3.772e-01 1.212e-01 3.112 0.002127 **
## perf 5.962e-01 3.537e-02 16.855 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 31.76 on 203 degrees of freedom
## Multiple R-squared: 0.9589, Adjusted R-squared: 0.9579
## F-statistic: 947.3 on 5 and 203 DF, p-value: < 2.2e-16Without the two predictors we have a model that has an \(R^2 = 0.9589\) meaning that, even without
the two predictors our model explains 95.89 of variability
in the estimated performance. This means that it was a right choice to
remove cache size and minimum number of channels variable since our
model did not lose its explanatory power. We also notice that we realize
a higher F-stastic value from 679.5 to 947.3
confirming that the new model is a better fit than the previous
one.
Create a residual plot using this model and comment on it’s features. Do any of the assumptions of linear regression seem to be violated? What might be done to adjust our model? Adjust the model if necessary by considering various residual plots, updating the model, and assessing residual plots using the updated model.
# Residuals plots
par(mfrow = c(2, 2))
plot(lm2)
For the Residual vs. Fitted plot, we see that the the red LOESS line, we see that the curve is not flat, but rather shows a systematic trend especially for small fitted values. This implies that the relationship between the predictor values and response variable might not be purely linear. Therefore, the assumption of linearity between predictor variables and response variables might be violated. We also see that the variance across the residuals is not constant (as per one of the assumptions of linear regression) i.e. there is heteroscedasticity. Lastly, from this plot, we see some outliers for example observation 32, 10, and 157. We are later going to find out whether they are also high leverage points.
From the Q-Q plot, we can assess whether the residuals of our linear regression follow a normal distribution. We see that at the middle section, most of the points follow the 45-degrees dashed line. However, at both tails (left and right), residuals deviate significantly, indicating non-normality in extreme values. And just like the previous plot, we see observations 32,10, and 157 being outliers.
Lastly, we look at the Residuals vs. Leverage plot, where we now confirm that observation points 32, 10, and 157 are not only outliers, but they happen to be leverage points too since they have relatively higher cook’s distance compared to the other observations in the data set. We see that observation that observation 10 really stands out having both a high residual and higher cook’s distance - the point exceeds the Cook’s distance boundary. This implies that the point has a higher influence on the regression model.
We consider removing the observation 10 and re-fitting the model again
cpus_filtered <- cpus[-10, ] # Exclude observation 10
lm_new <- lm(estperf ~ syct + mmin + mmax + chmax + perf, data=cpus_filtered)
summary(lm_new)
##
## Call:
## lm(formula = estperf ~ syct + mmin + mmax + chmax + perf, data = cpus_filtered)
##
## Residuals:
## Min 1Q Median 3Q Max
## -95.956 -10.592 1.092 12.568 165.711
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.813e+01 4.127e+00 -6.816 1.05e-10 ***
## syct 2.840e-02 8.285e-03 3.428 0.000736 ***
## mmin 3.181e-03 1.025e-03 3.102 0.002196 **
## mmax 3.909e-03 3.574e-04 10.936 < 2e-16 ***
## chmax 4.078e-01 1.073e-01 3.802 0.000190 ***
## perf 5.467e-01 3.196e-02 17.107 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 28.09 on 202 degrees of freedom
## Multiple R-squared: 0.9567, Adjusted R-squared: 0.9556
## F-statistic: 892.3 on 5 and 202 DF, p-value: < 2.2e-16We see a slightly lower F-statistic value from 947.3 to 892.3, and a drop in \(R^2\) from 0.9579 to 0.9567, which is quite expected because we removed a highly influential point from our dataset.
par(mfrow = c(2, 2))
plot(lm_new)
How well does the final model fit the data? Comment on some model fit criteria from the model built in c)
Interpret all variables in your final model using complete sentences, making sure to account for the fact that this may be a multivariable model. Give interpretations in terms of as meaningful of units as possible (it may not be possible to use seconds for cycle time- the answer is too large, but you may use MB instead of kB, for instance). Adjust interpretations as needed, both for units, and the fact that our outcome has been log transformed (how do we get to the raw data values from a log transformation? Start by thinking: what is the inverse of the log function???)
Intercept: When all the predictor variables are zero, the estimated performance is -28.13. However, since a CPU cannot have all parameters at zero, this intercept is not practically meaningful.
Cycle Time (syct): Holding all
other variables constant, a 1-unit increase in cycle time is associated
with a 2.84% increase in estimated performance. Since cycle time is
being measure in nanoseconds here, a small increase in cycle time leads
to slightly better estimated performance.
Minimum Main Memory (mmin): A 1kB
increase in minimum main memory is associated with a 0.318% increase in
estimated performance.
Maximum Main Memory(mmax): A 1kB
increase in maximum main memory is associated with a 0.40% increase in
estimated performance.
Maximum Cache Memory: A 1kB increase in cache size is associated with a 40.78% increase in estimated performance. This makes logical sense since cache memory is crucial for CPU speed because frequently accessed data is stored closed to the processor, reducing the need for slower RAM access.
Published Performance: A 1 unit increase in
performance is associated with 54.67% increase in estimated performance.
Since, perf represents the manufacturer-reported benchmark,
this suggests that official performance scores have strong correlation
with estimated real-world performance.
Note: We did not log-transform our response variable, hence our interpretation remains as above.
Calculate indices that help assess multicollinearity between predictors in your final model. Is there evidence of multicollinearity? What does this imply, and should you take action? Take action if appropriate.
To assess multicollinearity in the final model, we will calculate the Variance Inflation Factor (VIFs) for the predictors.
# Compute Variance Inflation Factor (VIF)
vif_values <- vif(lm_new) # lm_new is your final regression model
print(vif_values)
## syct mmin mmax chmax perf
## 1.223190 3.033525 4.188266 2.020315 5.562914From the results above, the model does not exhibit strong multicollinearity since all the VIF values are approximately 5 and less. This is generally considered acceptable, hence we take no action.
Are there any outliers or influential observations in this data? Calculate relevant indices or provide visualizations to justify your answer. Make sure to use rules of thumb discussed in class if necessary for interpretations.
Consider the birthwt data from R package
MASS. We will investigate the relationship between low
birthweight and the predictors in the birthwt data using logistic
regression and discriminant analysis.
Investigate the relationship between variables in the
birthwt dataset. Do you see anything surprising? Use both
numeric and visual summaries. Create and comment on visualizations
specifically between the outcome variable and predictor/independent
variables. Also, notice that qualitative/categorical variables should be
visualized in an alternative manner, not just scatterplots/correlations
as in the case of quantitative variables.
# Load required packages
library(MASS)
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
##
## recode
## The following object is masked from 'package:MASS':
##
## select
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(corrplot)
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
# Load the birthwt dataset
data("birthwt")
#see the dataset documentatiom
?birthwt
# Convert categorical variables to factors
birthwt <- birthwt %>%
mutate(low = factor(low, labels = c("Normal Weight", "Low Weight")),
race = factor(race, labels = c("White", "Black", "Other")),
smoke = factor(smoke, labels = c("Non-Smoker", "Smoker")),
ht = factor(ht, labels = c("No Hypertension", "Hypertension")),
ui = factor(ui, labels = c("No Uterine Irritability", "Uterine Irritability")))
# View the first few rows
head(birthwt)
#getting the numerical summary
summary(birthwt)
## low age lwt race
## Normal Weight:130 Min. :14.00 Min. : 80.0 White:96
## Low Weight : 59 1st Qu.:19.00 1st Qu.:110.0 Black:26
## Median :23.00 Median :121.0 Other:67
## Mean :23.24 Mean :129.8
## 3rd Qu.:26.00 3rd Qu.:140.0
## Max. :45.00 Max. :250.0
## smoke ptl ht
## Non-Smoker:115 Min. :0.0000 No Hypertension:177
## Smoker : 74 1st Qu.:0.0000 Hypertension : 12
## Median :0.0000
## Mean :0.1958
## 3rd Qu.:0.0000
## Max. :3.0000
## ui ftv bwt
## No Uterine Irritability:161 Min. :0.0000 Min. : 709
## Uterine Irritability : 28 1st Qu.:0.0000 1st Qu.:2414
## Median :0.0000 Median :2977
## Mean :0.7937 Mean :2945
## 3rd Qu.:1.0000 3rd Qu.:3487
## Max. :6.0000 Max. :4990
From the summary above we see the following:
Low Birth Weight: Out of 189 observations, 59 (31.2%) are classified as low birth weight, while 130 (68.8%) are normal.
Maternal Age (age): The ages range from 14 to 45 years, with a mean of 23.24 years, suggesting that some young mothers are included in the dataset.
Maternal Weight (lwt): This ranges from 80 to 250 pounds, with a mean of 129.8, indicating a reasonable variation in maternal weight.
Race: The dataset is imbalanced, with 96 White, 26 Black, and 67 Other mothers.
Smoking Status (smoke): A substantial number of mothers (74 out of 189, ~39%) reported smoking, which may be an important risk factor for low birth weight.
Hypertension (ht) and Uterine Irritability (ui): Both conditions are relatively uncommon, with 12 cases of hypertension and 28 cases of uterine irritability, but they may still have an impact on birth weight.
Birth Weight (bwt): The birth weights range from 709g to 4990g, with a median of 2977g, suggesting that most newborns are within a healthy weight range, but there are some extremely low weights.
# next we check the correlation amongst the variables for the numeric variables.
numeric_vars <- birthwt %>% select_if(is.numeric)
# Correlation matrix visualization
corrplot(cor(numeric_vars), method = "color", addCoef.col = "black")
From the above correlation plot, we see that there is weak
correlations between birth weight (bwt)and the predictor
variables. The highest positive correlation is between birth
weight and mother’s weight (0.19), suggesting that
maternal weight has small positive relationship with birth weight.
Overall, there are no strong correlations, implying that multiple
factors may collectively influence birth weight rather than a single
dominant predictor.
Next we make a few visualization plots between the predictor variables and the birth weight response variable.
# Boxplots to compare birth weight with predictors
ggplot(birthwt, aes(x = low, y = bwt, fill = low)) +
geom_boxplot() +
theme_minimal() +
ggtitle("Birth Weight Distribution by Low Birth Weight Category") +
labs(x="Birth Weight Category", y="Birth Weight (grams)")
From the box-plot above, we can make the following observations:
Babies classified as Low Weight have significantly lower birth weights, with the median around 2000-2200 grams.
Normal Weight babies have a higher median birth weight, around 3000-3500 grams, with some outliers above 5000 grams.
The spread (interquartile range) for Normal Weight babies is larger, indicating greater variation in birth weight compared to Low Weight babies.
Outliers are observed in the Low Weight category, indicating some extremely low birth weights.
# Bar plot for smoking status
ggplot(birthwt, aes(x = smoke, fill = low)) +
geom_bar(position = "fill") +
theme_minimal() +
ggtitle("Proportion of Low Birth Weight by Smoking Status") +
labs(x="Smoking status", y="Proportion")
From the above plot we can make the following observations:
A higher proportion of babies born to smokers are classified as low birth weight compared to non-smokers.
Non-smokers have a lower proportion of low birth weight babies, suggesting that smoking during pregnancy is associated with an increased risk of low birth weight.
This visualization supports the hypothesis that smoking could negatively impact fetal development, increasing the likelihood of a baby being born underweight.
# Bar plot for race
ggplot(birthwt, aes(x = race, fill = low)) +
geom_bar(position = "fill") +
theme_minimal() +
ggtitle("Proportion of Low Birth Weight by Race")
From this bar graph, we can make the following observations:
The Black category has the highest proportion of low birth weight babies, suggesting that this group may be at a greater risk of delivering underweight infants.
The White category has the lowest proportion of low birth weight babies.
The Other category also has a relatively high proportion of low birth weight babies, but slightly lower than the Black category.
This suggests that race might be a factor influencing birth weight, possibly due to socioeconomic, genetic, or healthcare-related differences.
# Scatter plot of gestational age (lwt) vs birth weight (bwt)
ggplot(birthwt, aes(x = lwt, y = bwt, color = low)) +
geom_point() +
theme_minimal() +
ggtitle("Birth Weight vs Mother's Weight") +
labs(x="Mother's Weight (pounds)", y="birth weight (grams)")
From the scatter plot above, we see the folllowing:
There appears to be a positive correlation between mother’s weight and birth weight, meaning that mothers with higher weights tend to have babies with higher birth weights.
Babies classified as low birth weight (blue dots) are more concentrated at the lower end of the birth weight scale, regardless of the mother’s weight.
However, even among heavier mothers, some babies still have low birth weight, suggesting that factors other than maternal weight may contribute to birth weight.
birthwt to avoid including variables that are
not logically acceptable for inclusion in the model.logit_model <- glm(low ~ lwt + age + race + smoke + ht + ui + ftv,
data = birthwt, family = binomial)
summary(logit_model)
##
## Call:
## glm(formula = low ~ lwt + age + race + smoke + ht + ui + ftv,
## family = binomial, data = birthwt)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.45481 1.18541 0.384 0.70122
## lwt -0.01652 0.00686 -2.409 0.01600 *
## age -0.02051 0.03595 -0.570 0.56844
## raceBlack 1.28976 0.52761 2.445 0.01451 *
## raceOther 0.91907 0.43630 2.106 0.03516 *
## smokeSmoker 1.04159 0.39548 2.634 0.00844 **
## htHypertension 1.88506 0.69481 2.713 0.00667 **
## uiUterine Irritability 0.90415 0.44860 2.015 0.04385 *
## ftv 0.05912 0.17200 0.344 0.73105
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 234.67 on 188 degrees of freedom
## Residual deviance: 203.83 on 180 degrees of freedom
## AIC: 221.83
##
## Number of Fisher Scoring iterations: 4
The coefficients (Estimate)
represent the log-odds of low birth weight for each predictor.
The p-values (Pr(>|z|)) indicate
whether each variable significantly affects birth weight.
If a predictor has p < 0.05, it significantly impacts birth weight.
What do you notice regarding the variables ptl and ftv. What is your logistic regression model in b)(perhaps before performing variable selection) implicitly assuming regarding these variables’ effects on the log odds of giving birth to a low weight baby? Are these assumptions realistic?
Create a new variable for ptl named ptl2 which is more useful for analysis. Keep in mind that with very small sample sizes, it may be worthwhile to collapse multiple categories.
Create a new variable for ftv named ftv2 which is more useful for analysis. Keep in mind that with very small sample sizes, it may be worthwhile to collapse multiple categories. Also, it may be helpful to form tables which summarize low birthweight probabilities by levels of the variable in order to better understand the relationship between probability of low birthweight and the newly created variable.
Using the newly created variables in d) and e), reassess the logistic regression model arrived at in b) (use ftv2 and ptl2 in the modeling). Comment on what you find- are the new versions of these variables important in predicting low birthweight??
In a manner similar to the approach used in the book, split the birthwt data into a training and test set, where the test set is about 20% the size of the entire dataset. Then, using variables that are justifiable for inclusion in discriminant analysis, fit LDA and QDA models to the training set and form confusion matrices, calculate the sensitivity, specificity, and the accuracy of each method using the test set, and do the same for the logistic regression models built in f) and b). Which model performs the best? Remember you MUST set the seed using the TeachingDemos package in a manner similar to as done in the notes (but don’t use my name to set the seed!)
Using your final model from f), interpret the estimates for all covariates