library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.3.2

Please hand this result in on Canvas no later than 11:59pm on Wednesday, February 28! Do not work in groups!

  1. Consider the data from R package . We will use linear regression to investigate the relationship between variables in this data set and estimated performance (variable ). Do not use published performance as a predictor of performance in this problem.
library(MASS)
## Warning: package 'MASS' was built under R version 4.3.2
library(car)
## Loading required package: carData
head(cpus)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
## 
##     recode
## The following object is masked from 'package:MASS':
## 
##     select
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
summary(cpus)
##              name          syct             mmin            mmax      
##  ADVISOR 32/60 :  1   Min.   :  17.0   Min.   :   64   Min.   :   64  
##  AMDAHL 470/7A :  1   1st Qu.:  50.0   1st Qu.:  768   1st Qu.: 4000  
##  AMDAHL 470V/7 :  1   Median : 110.0   Median : 2000   Median : 8000  
##  AMDAHL 470V/7B:  1   Mean   : 203.8   Mean   : 2868   Mean   :11796  
##  AMDAHL 470V/7C:  1   3rd Qu.: 225.0   3rd Qu.: 4000   3rd Qu.:16000  
##  AMDAHL 470V/8 :  1   Max.   :1500.0   Max.   :32000   Max.   :64000  
##  (Other)       :203                                                   
##       cach            chmin            chmax             perf       
##  Min.   :  0.00   Min.   : 0.000   Min.   :  0.00   Min.   :   6.0  
##  1st Qu.:  0.00   1st Qu.: 1.000   1st Qu.:  5.00   1st Qu.:  27.0  
##  Median :  8.00   Median : 2.000   Median :  8.00   Median :  50.0  
##  Mean   : 25.21   Mean   : 4.699   Mean   : 18.27   Mean   : 105.6  
##  3rd Qu.: 32.00   3rd Qu.: 6.000   3rd Qu.: 24.00   3rd Qu.: 113.0  
##  Max.   :256.00   Max.   :52.000   Max.   :176.00   Max.   :1150.0  
##                                                                     
##     estperf       
##  Min.   :  15.00  
##  1st Qu.:  28.00  
##  Median :  45.00  
##  Mean   :  99.33  
##  3rd Qu.: 101.00  
##  Max.   :1238.00  
## 
  1. Investigate the relationship between variables in the dataset, both numerically and visually. Comment on the relationships you observe.
##Numerical relation
correlation_matrix <- cor(cpus[, -1])  # Excluding the 'name' as it is non numerical column
print(correlation_matrix)
##               syct       mmin       mmax       cach      chmin      chmax
## syct     1.0000000 -0.3356422 -0.3785606 -0.3209998 -0.3010897 -0.2505023
## mmin    -0.3356422  1.0000000  0.7581573  0.5347291  0.5171892  0.2669074
## mmax    -0.3785606  0.7581573  1.0000000  0.5379898  0.5605134  0.5272462
## cach    -0.3209998  0.5347291  0.5379898  1.0000000  0.5822455  0.4878458
## chmin   -0.3010897  0.5171892  0.5605134  0.5822455  1.0000000  0.5482812
## chmax   -0.2505023  0.2669074  0.5272462  0.4878458  0.5482812  1.0000000
## perf    -0.3070821  0.7949233  0.8629942  0.6626135  0.6089025  0.6052193
## estperf -0.2883956  0.8192915  0.9012024  0.6486203  0.6105802  0.5921556
##               perf    estperf
## syct    -0.3070821 -0.2883956
## mmin     0.7949233  0.8192915
## mmax     0.8629942  0.9012024
## cach     0.6626135  0.6486203
## chmin    0.6089025  0.6105802
## chmax    0.6052193  0.5921556
## perf     1.0000000  0.9664687
## estperf  0.9664687  1.0000000
##Visual relation
pairs(cpus[, -1], main = "Scatterplot Matrix")

a) The correlation matrix provided reveals associations among the variables:

  1. Mmin, mmax, cach, chmin, and chmax exhibit moderate to strong positive correlations with both perf and estperf. This implies that as these variables increase, there is a tendency for both published performance (perf) and estimated performance (estperf) to rise.
  2. Syct (system clock time) demonstrates weak negative correlations with perf and estperf. This suggests that with an increase in system clock time, performance may slightly decrease, although the correlation is not very robust.
  3. The most substantial correlation coefficients are observed between mmin, mmax, and estperf, indicating a robust positive relationship. This underscores the influential role of minimum and maximum main memory (in kilobytes) in estimating performance.
  4. Cach (cache memory in kilobytes) also exhibits a strong positive correlation with perf and estperf, indicating that larger cache sizes are linked with higher performance.
  5. A highly robust positive correlation is evident between perf and estperf, signifying a high level of consistency between estimated performance and published performance.
  1. Use either methods commonly used in the book/lecture notes to build a linear regression model predicting estimated performance from predictors in the dataset. Do not consider in this modeling approach. Explain the process used to arrive at your final model.
#building initial model
initial_model <- lm(estperf ~ syct+ mmin + mmax + cach + chmin + chmax - perf, data = cpus)
summary(initial_model)
## 
## Call:
## lm(formula = estperf ~ syct + mmin + mmax + cach + chmin + chmax - 
##     perf, data = cpus)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -138.198  -24.064    1.429   23.136  288.718 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -6.648e+01  6.288e+00 -10.574  < 2e-16 ***
## syct         6.596e-02  1.369e-02   4.818 2.85e-06 ***
## mmin         1.431e-02  1.428e-03  10.021  < 2e-16 ***
## mmax         6.590e-03  5.016e-04  13.138  < 2e-16 ***
## cach         4.945e-01  1.091e-01   4.533 9.94e-06 ***
## chmin       -1.723e-01  6.687e-01  -0.258    0.797    
## chmax        1.201e+00  1.720e-01   6.985 4.04e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 46.88 on 202 degrees of freedom
## Multiple R-squared:  0.9109, Adjusted R-squared:  0.9082 
## F-statistic: 344.1 on 6 and 202 DF,  p-value: < 2.2e-16
cor_with_estperf <- correlation_matrix["estperf", ]
print(cor_with_estperf)
##       syct       mmin       mmax       cach      chmin      chmax       perf 
## -0.2883956  0.8192915  0.9012024  0.6486203  0.6105802  0.5921556  0.9664687 
##    estperf 
##  1.0000000

Interpretation: We shall do the backward elimination and removing the least significant predictor (highest p-value or least contribution to model performance). hence, i am not considering chmin, as it has highest p value and not considering syct as it is having negative correlation with est_perf.

second_model <- lm(estperf~ mmin + mmax - perf + chmax + cach , data= cpus)
summary(second_model)
## 
## Call:
## lm(formula = estperf ~ mmin + mmax - perf + chmax + cach, data = cpus)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -127.666  -29.613    1.242   27.726  311.090 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -4.627e+01  4.893e+00  -9.458  < 2e-16 ***
## mmin         1.384e-02  1.470e-03   9.416  < 2e-16 ***
## mmax         6.272e-03  5.223e-04  12.008  < 2e-16 ***
## chmax        1.155e+00  1.708e-01   6.759 1.43e-10 ***
## cach         4.298e-01  1.099e-01   3.910 0.000126 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 49.29 on 204 degrees of freedom
## Multiple R-squared:  0.9005, Adjusted R-squared:  0.8986 
## F-statistic: 461.7 on 4 and 204 DF,  p-value: < 2.2e-16

Conclusion: As,The Multiple Rsquared value and adjusted R squared value of initial model is better than the second model. we shall consider initial model as the better one.

  1. Create a residual plot using this model and comment on it’s features. Do any of the assumptions of linear regression seem to be violated? What might be done to adjust our model? Adjust the model if necessary by considering various residual plots, updating the model, and assessing residual plots using the updated model.
# Create a residual plot for the initial linear regression model
par(mfrow=c(2,2))
plot(initial_model)

interpretation: As we can see, there is a curvature or a pattern around the fitted line, which suggests that it is following Non-Linearity. to enhance the model, we can apply certain transformation on target. Type of transformation depends on the value derived from Power Transformation.

pT <- powerTransform(initial_model, family = "bcPower")
pT$lambda
##        Y1 
## 0.4314326

we know that, When \(\lambda\approx 0\), it can be shown that this function approaches \(\log(y)\), so the this becomes the best transformation

When \(\lambda \approx 1/2\), then the \(\sqrt{y}\) is usually a sufficient transformation

If \(\lambda \approx 1\), no transformation is needed, and if \(\lambda \approx -1\), you can use \(\frac{1}{y}\).

so, we shall go forward and apply log transformation

cpus <- cpus |>
  mutate(log_estperf = log(estperf)) 

model_final <- lm(log_estperf ~ syct + mmin + mmax + cach + chmin + chmax - name - perf  , data = cpus)

rsquared <- summary(initial_model)$r.squared
print(rsquared)
## [1] 0.9108694
summary(model_final)
## 
## Call:
## lm(formula = log_estperf ~ syct + mmin + mmax + cach + chmin + 
##     chmax - name - perf, data = cpus)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.96828 -0.12609  0.02829  0.15471  0.46886 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.293e+00  2.934e-02 112.254  < 2e-16 ***
## syct        -4.510e-04  6.389e-05  -7.059 2.64e-11 ***
## mmin         1.278e-05  6.662e-06   1.919   0.0565 .  
## mmax         5.386e-05  2.341e-06  23.011  < 2e-16 ***
## cach         7.024e-03  5.090e-04  13.801  < 2e-16 ***
## chmin        4.808e-03  3.120e-03   1.541   0.1249    
## chmax       -1.598e-03  8.024e-04  -1.992   0.0478 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2188 on 202 degrees of freedom
## Multiple R-squared:  0.947,  Adjusted R-squared:  0.9455 
## F-statistic: 602.1 on 6 and 202 DF,  p-value: < 2.2e-16

plotting the residual plot for transformed model

par(mfrow=c(2,2))
plot(model_final)

now, in the residual plot, the non linearity is reduced,as there is no systematic pattern in the residuals and spread of residuals is consistent across all levels of the fitted values,which could indicate the linearity.

  1. How well does the final model fit the data? Comment on some model fit criteria from the model built in c)
  1. R-squared (R²): A higher R-squared indicates a better fit, as it may increase even with the addition of less meaningful predictors. In the model_final, i got the R-squared value as 0.9472 which is high enough indicating better fit

Adjusted R-squared: Adjusted R-squared: accounts for the number of predictors in the model and is often considered a more reliable measure, especially when comparing models with different numbers of predictors. It penalizes the inclusion of less useful predictors that do not contribute much to the model. the adjusted R squared value is 0.9455 which suggets a better fit

Residual Standard Error (RSE): RSE is an estimate of the standard deviation of the residuals. It provides an indication of how well the model predicts the response variable. Smaller RSE values suggest a better fit. the value i got it is 0.2188

  1. Interpret all variables in your final model using complete sentences, making sure to account for the fact that this may be a multivariable model. Give interpretations in terms of as meaningful of units as possible (it may not be possible to use seconds for cycle time - the answer is too large, but you may use MB instead of kB, for instance). Adjust interpretations as needed, both for units, and the fact that our outcome has been log transformed (how do we get to the raw data values from a log transformation? Start by thinking: what is the inverse of the log function???)
model_final
## 
## Call:
## lm(formula = log_estperf ~ syct + mmin + mmax + cach + chmin + 
##     chmax - name - perf, data = cpus)
## 
## Coefficients:
## (Intercept)         syct         mmin         mmax         cach        chmin  
##   3.293e+00   -4.510e-04    1.278e-05    5.386e-05    7.024e-03    4.808e-03  
##       chmax  
##  -1.598e-03

#converting to meaningful outputs

# Coefficients from the model
coefficients <- c(3.293e+00, -4.510e-04, 1.278e-05, 5.33e-05, 7.022e-03, 4.8080e-03, -1.598e-03)

# Variables corresponding to the coefficients
variables <- c("Intercept", "syct", "mmin", "mmax", "cach", "chmin", "chmax")

# Units corresponding to the variables (replace these with actual units)
units <- c("Unknown", "nanoSeconds", "Kilobytes", "Kilobytes", "Kilobytes", "Units", "Units")

# Convert coefficients to original scale using exp()
converted_values <- exp(coefficients)

# Display the results with units
interpretation <- data.frame(variables, converted_values, units)
print(interpretation)
##   variables converted_values       units
## 1 Intercept       26.9235132     Unknown
## 2      syct        0.9995491 nanoSeconds
## 3      mmin        1.0000128   Kilobytes
## 4      mmax        1.0000533   Kilobytes
## 5      cach        1.0070467   Kilobytes
## 6     chmin        1.0048196       Units
## 7     chmax        0.9984033       Units

interpreting the coefficients:

Intercept (26.9235132): The estimated performance when all other predictors are zero is approximately 26.92. This value is in the original data units, and it represents the baseline performance.

syct (0.9995491): For each additional unit increase in syct (system cycle time) measured in nanoSeconds, the estimated performance decreases to approximately 99.95% of the original performance.

mmin (1.0000128): For each additional unit increase in mmin (minimum main memory) measured in Kilobytes, the estimated performance increases to approximately 100.00% + 0.0128%, or very close to the original performance.

mmax (1.0000533): For each additional unit increase in mmax (maximum main memory) measured in Kilobytes, the estimated performance increases to approximately 100.00% + 0.0533%, or very close to the original performance.

cach (1.0070467): For each additional unit increase in cach (cache memory) measured in Kilobytes, the estimated performance increases to approximately 100.70% of the original performance.

chmin (1.0048196): For each additional unit increase in chmin (minimum channels) measured in Units, the estimated performance increases to approximately 100.48% of the original performance.

chmax (0.9984033): For each additional unit increase in chmax (maximum channels) measured in Units, the estimated performance decreases to approximately 99.84% of the original performance.

f)Calculate indices that help assess multicollinearity between predictors in your final model. Is there evidence of multicollinearity? What does this imply, and should you take action? Take action if appropriate.

#calculating variance inflation factor to determine multicollinearity
vif_values <- vif(model_final)
print(vif_values)
##     syct     mmin     mmax     cach    chmin    chmax 
## 1.201644 2.902002 3.274002 1.858349 1.966234 1.891392

As the variance inflation factors determine the occurence of multicollinearity. i.e. if the VIF value is greater than 10, there is a strong indication of multicollinearity and if it exceeds 5 or 10, it creates a problematic amount of collinearity. and if its less than 5, small amount of multicollinearity persists, which is not a concern. As all the features have less than 5 VIF, there is no need to take an action.

  1. Are there any outliers or influential observations in this data? Calculate relevant indices or provide visualizations to justify your answer. Make sure to use rules of thumb discussed in class if necessary for interpretations.
par(mfrow = c(2, 2))

# Plotting residual vs. leverage
plot(model_final, which = 5, main = "Residual vs. Leverage", pch = 16)

# Reset par to default settings
par(mfrow = c(1, 1))

Observations:

Observation with Index 10: High leverage value suggests that this observation has an extreme value for one or more predictor variables, making it influential in determining the coefficients of the regression model. Low standardized residual indicates that this observation fits well with the overall trend of the model.

Observations with Indices 157 and 200: These observations have higher standardized residuals compared to observation 10. Higher standardized residuals suggest that these points deviate more from the predicted values. However, they have lower leverage values, indicating that they might not be as influential in determining the model coefficients compared to observation 10.

Based on this information, we can make the following interpretations: Observation 10: This observation is both influential (high leverage) and well-fitted (low standardized residual). It has a strong impact on the model due to its extreme values in predictor variables.

Observations 157 and 200: These observations have higher residuals, indicating that they deviate more from the model predictions. However, their leverage values are lower compared to observation 10, suggesting that they might not have as much influence on the model.

  1. Consider the data from R package . We will investigate the relationship between low birthweight and the predictors in the data using logistic regression and discriminant analysis.
head(birthwt)
  1. Investigate the relationship between variables in the dataset. Do you see anything surprising? Use both numeric and visual summaries. Create and comment on visualizations specifically between the outcome variable and predictor/independent variables. Also, notice that qualitative/categorical variables should be visualized in an alternative manner, not just scatterplots/correlations as in the case of quantitative variables.
summary(birthwt)
##       low              age             lwt             race      
##  Min.   :0.0000   Min.   :14.00   Min.   : 80.0   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:19.00   1st Qu.:110.0   1st Qu.:1.000  
##  Median :0.0000   Median :23.00   Median :121.0   Median :1.000  
##  Mean   :0.3122   Mean   :23.24   Mean   :129.8   Mean   :1.847  
##  3rd Qu.:1.0000   3rd Qu.:26.00   3rd Qu.:140.0   3rd Qu.:3.000  
##  Max.   :1.0000   Max.   :45.00   Max.   :250.0   Max.   :3.000  
##      smoke             ptl               ht                ui        
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.00000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.0000   Median :0.00000   Median :0.0000  
##  Mean   :0.3915   Mean   :0.1958   Mean   :0.06349   Mean   :0.1481  
##  3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.:0.00000   3rd Qu.:0.0000  
##  Max.   :1.0000   Max.   :3.0000   Max.   :1.00000   Max.   :1.0000  
##       ftv              bwt      
##  Min.   :0.0000   Min.   : 709  
##  1st Qu.:0.0000   1st Qu.:2414  
##  Median :0.0000   Median :2977  
##  Mean   :0.7937   Mean   :2945  
##  3rd Qu.:1.0000   3rd Qu.:3487  
##  Max.   :6.0000   Max.   :4990

Now, let’s create visualizations between the outcome variable (low - indicating low birthweight) and predictor variables, considering both quantitative and categorical variables:

##numerical variables

# Pair plot for numeric variables
pairs(birthwt[, c("age", "lwt", "ftv")])

Using boxplots to compare the distribution of numeric variables across different levels of categorical variables.

#Categorical variable
# Boxplot for age by low birthweight
boxplot(age ~ low, data = birthwt, main = "Boxplot of Age by Low Birthweight", xlab = "Low Birthweight", ylab = "Age")

lets consider taking relation between categorical variables

#categorical 
# Bar plot for race by low birthweight
table_race_low <- table(birthwt$race, birthwt$low)
barplot(table_race_low, beside = TRUE, legend= TRUE,names.arg = c("YES", "NO") ,col = c("red", "green","blue"), main = "Bar Plot of Race by Low Birthweight", xlab = "Race", ylab = "Count")

# Create a grouped bar plot for smoke and low birthweight
ggplot(birthwt, aes(x = factor(smoke), fill = factor(low))) +
  geom_bar(position = "dodge", stat = "count") +
  labs(title = "Grouped Bar Plot of Smoke and Low Birthweight",
       x = "Smoke (1 = Yes, 0 = No)", y = "Count",
       fill = "Low Birthweight") +
  theme_minimal()

Other barplots for other categories

# Barplot of low birth weight by smoking status with colors
colors <- c("blue","green")
barplot(table(birthwt$smoke, birthwt$bwt == "Low"), 
        beside = TRUE, legend.text = c("No", "Yes"), 
        names.arg = c("Non-Smoker", "Smoker"), 
        xlab = "Smoking Status", ylab = "Frequency",
        main = "Low Birth Weight by Smoking Status",
        col = colors)

# Barplot of low birth weight by uterine irritability with colors
barplot(table(birthwt$ui, birthwt$bwt == "Low"), 
        beside = TRUE, legend.text = c("No", "Yes"), 
        names.arg = c("No", "Yes"), 
        xlab = "Uterine Irritability", ylab = "Frequency",
        main = "Low Birth Weight by Uterine Irritability",
     col=colors)

interpretation : 1. Low Birthweight Rate: The rate of low birth weight infants in the dataset is approximately 31.22%. This rate is higher than the general population rate in many countries, which is typically around 5-8%. The dataset may be biased towards high-risk pregnancies or populations with higher rates of low birthweight.

2.Smoking: About 39.15% of mothers in the dataset are smokers. Smoking during pregnancy is a known risk factor for low birthweight which is visible in grouped bar plot. The high rate of smoking in this dataset might indicate a population with higher overall health risks.

3.Age: The mean age of mothers in the dataset is 23.24 years, ranging from 14 to 45 years. The presence of teenage mothers (age 14-19) and older mothers (age 40-45) in the dataset could indicate potential high-risk pregnancies associated with low birth weight.

4.Number of Physician Visits: The mean number of physician visits (ftv variable) is 0.7937, ranging from 0 to 6 visits. The low mean number of visits may indicate limited access to healthcare or potential barriers to prenatal care.

5.Race Distribution: The race variable is categorical with values ranging from 1 to 3, indicating different race categories. The distribution of race categories in the dataset may be of interest for further analysis, especially considering potential disparities in healthcare access and outcomes among different racial groups.

6.uterine irritabilty: The higher frequency for “No” indicates that, among the observations with no uterine irritability, there is a larger proportion of cases with normal birth weight. The lower frequency for “Yes” suggests that, among the observations with uterine irritability, there is a smaller proportion of cases with normal birth weight, and a relatively higher proportion of cases with low birth weight.

  1. Fit a logistic regression model using methods discussed in class/the book, similar to as in problem 1). Be careful to understand each variable in to avoid including variables that are not logically acceptable for inclusion in the model.

let us create a initial model to determine the significancy if variables

initial_model1 <- glm(low ~ smoke + ht + race + lwt + ftv + age + ptl + ui, 
                      data = birthwt, 
                      family = binomial)
summary(initial_model1)
## 
## Call:
## glm(formula = low ~ smoke + ht + race + lwt + ftv + age + ptl + 
##     ui, family = binomial, data = birthwt)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)   
## (Intercept) -0.078975   1.276254  -0.062  0.95066   
## smoke        0.937275   0.398458   2.352  0.01866 * 
## ht           1.830720   0.694135   2.637  0.00835 **
## race         0.453424   0.215294   2.106  0.03520 * 
## lwt         -0.012387   0.006614  -1.873  0.06111 . 
## ftv          0.063461   0.169765   0.374  0.70854   
## age         -0.035845   0.036472  -0.983  0.32569   
## ptl          0.542087   0.346168   1.566  0.11736   
## ui           0.721965   0.463174   1.559  0.11906   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 234.67  on 188  degrees of freedom
## Residual deviance: 204.19  on 180  degrees of freedom
## AIC: 222.19
## 
## Number of Fisher Scoring iterations: 4

With respect to significance legend:

Highly significant (p-value < 0.001) Significant (p-value < 0.01) Marginally significant (p-value < 0.05) Not significant (p-value ≥ 0.05)

Logically Acceptable Variables:

Race (race): Coefficient Estimate: 0.4534 p-value: 0.03520 (*) Interpretation: The variable race is statistically significant at a 0.05 significance level. The positive coefficient suggests an increase in the log-odds of low birth weight for different race categories.

Smoking Status (smoke): Coefficient Estimate: 0.9373 p-value: 0.01866 (*) Interpretation: The variable smoke is statistically significant at a 0.05 significance level. The positive coefficient indicates an increase in the log-odds of low birth weight for mothers who smoke during pregnancy.

Hypertension (ht): Coefficient Estimate: 1.8307 p-value: 0.00835 (**) Interpretation: The variable ht is statistically significant at a 0.01 significance level. The positive coefficient suggests an increase in the log-odds of low birth weight for mothers with hypertension.

Variables with Marginal Significance:

Last Weight (lwt): Coefficient Estimate: -0.0124 p-value: 0.06111 (.) Interpretation: The variable lwt has a p-value of 0.06111, which is marginally above the 0.05 significance level. It might be considered for further investigation or model refinement.

Number of Physician Visits (ftv): Coefficient Estimate: 0.0635 p-value: 0.70854 (not significant) Interpretation: The variable ftv has a high p-value (0.70854), indicating that it is not statistically significant. It may be less important in predicting low birth weight. Variables that are Not Statistically Significant:

Mother’s Age (age): Coefficient Estimate: -0.0358 p-value: 0.32569 (not significant) Interpretation: The variable age is not statistically significant at the 0.05 significance level.

Prior Preterm Births (ptl): Coefficient Estimate: 0.5421 p-value: 0.11736 (not significant) Interpretation: The variable ptl is not statistically significant at the 0.05 significance level.

Uterine Irritability (ui): Coefficient Estimate: 0.7220 p-value: 0.11906 (not significant) Interpretation: The variable ui is not statistically significant at the 0.05 significance level.

Overall conlusion: we shall only consider race, smoke and ht as significant variables and build an other model.

logistic_model2 <- glm(low ~ smoke + ht + race , 
                      data = birthwt, 
                      family = binomial)

# Display the summary of the model
summary(logistic_model2)
## 
## Call:
## glm(formula = low ~ smoke + ht + race, family = binomial, data = birthwt)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -2.4308     0.5232  -4.646 3.38e-06 ***
## smoke         1.1289     0.3723   3.032  0.00243 ** 
## ht            1.2253     0.6218   1.971  0.04877 *  
## race          0.5636     0.2004   2.812  0.00492 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 234.67  on 188  degrees of freedom
## Residual deviance: 217.40  on 185  degrees of freedom
## AIC: 225.4
## 
## Number of Fisher Scoring iterations: 4

Observation: We can observe that AIC value of updated model(225) is better than initial one(222), suggesting that the model is a better fit

  1. What do you notice regarding the variables and . What is your logistic regression model in b) (perhaps before performing variable selection) implicitly assuming regarding these variables’ effects on the log odds of giving birth to a low weight baby? Are these assumptions realistic?
  1. ptl (Prior Preterm Births): Model Coefficient: 0.5421 p-value: 0.11736 (not statistically significant at 0.05 level)

Assumption: The model assumes that an increase in the number of prior preterm births (ptl) is associated with an increase in the log-odds of giving birth to a low-weight baby.

Realism Consideration: The assumption may or may not be realistic. The lack of statistical significance suggests that, based on the data, there is not enough evidence to conclude that ptl has a significant effect on the log-odds of low birth weight. It might be worth reevaluating the variable’s inclusion in the model or exploring potential interactions.

  1. ftv (Number of Physician Visits): Model Coefficient: 0.0635 p-value: 0.70854 (not statistically significant at 0.05 level)

Assumption: The model assumes that an increase in the number of physician visits (ftv) is associated with an increase in the log-odds of giving birth to a low-weight baby.

Realism Consideration: The assumption may not be realistic based on the data. The lack of statistical significance suggests that the variable ftv does not significantly contribute to explaining the variation in the log-odds of low birth weight. It might be worth reconsidering the inclusion of this variable in the model or exploring alternative measures of healthcare access.

Overall Considerations: The lack of statistical significance for these variables implies that, based on the data, there isn’t strong evidence to support their inclusion in the model.

  1. Create a new variable for named which is more useful for analysis. Keep in mind that with very small sample sizes, it may be worthwhile to collapse multiple categories.
# Displaying unique values in ptl for reference
unique(birthwt$ptl)
## [1] 0 1 2 3
# Creating a new variable ptl2 by collapsing categories in ptl
birthwt$ptl2 <- ifelse(birthwt$ptl > 0, "Prior Preterm Birth", "No Prior Preterm Birth")

# Displaying unique values in ptl2 for verification
unique(birthwt$ptl2)
## [1] "No Prior Preterm Birth" "Prior Preterm Birth"
# Displaying counts of individual values in the variable ptl2
counts <- table(birthwt$ptl2)
print(counts)
## 
## No Prior Preterm Birth    Prior Preterm Birth 
##                    159                     30

Explanation:

i am creating a new variable ptl2 based on the values in ptl. If ptl has a value greater than 0 (indicating prior preterm births), ptl2 is assigned the label “Prior Preterm Birth”. Otherwise, it is labeled “No Prior Preterm Birth”. This transformation simplifies the variable into a binary indicator, making it more straightforward for analysis.

  1. Create a new variable for named which is more useful for analysis. Keep in mind that with very small sample sizes, it may be worthwhile to collapse multiple categories. Also, it may be helpful to form tables which summarize low birthweight probabilities by levels of the variable in order to better understand the relationship between probability of low birthweight and the newly created variable.
  1. Creating a new variable for ftvl named ftv2 that is more useful for analysis could involve collapsing categories or transforming the variable in a meaningful way. In the case of small sample sizes, collapsing categories may help improve the stability of the analysis
# Displaying unique values in ftv for reference
unique(birthwt$ftv)
## [1] 0 3 1 2 4 6
# Creating a new variable ftv2 by collapsing categories in ftv
birthwt$ftv2 <- ifelse(birthwt$ftv > 0, "Visits > 0", "No Visits")

# Displaying unique values in ftv2 for verification
unique(birthwt$ftv2)
## [1] "No Visits"  "Visits > 0"
# Form tables to summarize low birthweight probabilities by levels of ftv2
table_summary <- table(birthwt$ftv2, birthwt$low)
prop_table <- prop.table(table_summary, margin = 1)

# Display tables
print(table_summary)
##             
##               0  1
##   No Visits  64 36
##   Visits > 0 66 23
print(prop_table)
##             
##                     0        1
##   No Visits  0.640000 0.360000
##   Visits > 0 0.741573 0.258427

Explanation: i am creating a new variable ftv2 based on the values in ftv. If ftv has a value greater than 0, indicating visits greater than 0, ftv2 is assigned the label “Visits > 0”. Otherwise, it is labeled “No Visits”. i then formed tables to summarize the low birthweight probabilities by levels of the new variable ftv2

  1. Using the newly created variables in d) and e), reassess the logistic regression model arrived at in b) (use and in the modeling). Comment on what you find - are the new versions of these variables important in predicting low birthweight??
# Fitting a logistic regression model with the new variables ftv2 and ptl2
updated_logistic_model <- glm(low ~ race + smoke + ptl2 + ht + ftv2, 
                               family = binomial, 
                               data = birthwt)

# Displaying the summary of the updated model
summary(updated_logistic_model)
## 
## Call:
## glm(formula = low ~ race + smoke + ptl2 + ht + ftv2, family = binomial, 
##     data = birthwt)
## 
## Coefficients:
##                         Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              -2.3164     0.6003  -3.859 0.000114 ***
## race                      0.4935     0.2114   2.335 0.019551 *  
## smoke                     0.8798     0.3943   2.231 0.025677 *  
## ptl2Prior Preterm Birth   1.3296     0.4392   3.027 0.002467 ** 
## ht                        1.2528     0.6402   1.957 0.050374 .  
## ftv2Visits > 0           -0.2681     0.3537  -0.758 0.448438    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 234.67  on 188  degrees of freedom
## Residual deviance: 207.76  on 183  degrees of freedom
## AIC: 219.76
## 
## Number of Fisher Scoring iterations: 4

Interpretation:

Significant Variables: 1.ptl2 (Prior Preterm Birth) is significant (p-value = 0.002467). ht (Hypertension) is significant (p-value = 0.050374). 2.smoke (Smoking) has a p-value of 0.02567, which is significant. 3.race has a pvalue of 0.019551, which is significant.

Variables Not Significant: 1.ftv2 are not statistically significant (p-values > 0.05). Model Fit:

The residual deviance is 207.76 on 180 degrees of freedom.

Interpretation and Comments: Importance of New Variables: The new variable ptl2 (Prior Preterm Birth) is significant and has a positive coefficient, indicating an increase in the log-odds of low birthweight for mothers with a history of preterm births. ftv2 (Number of Physician Visits) is not significant (p-value = 0.44).

Overall Model Fit: The model has a lower residual deviance compared to the initial model, suggesting an improved fit.

  1. In a manner similar to the approach used in the book, split the data into a training and test set, where the test set is about 20% the size of the entire dataset. Then, using variables that are justifiable for inclusion in discriminant analysis, fit LDA and QDA models to the training set and form confusion matrices, calculate the sensitivity, specificity, and the accuracy of each method using the test set, and do the same for the logistic regression models built in f) and b). Which model performs the best? Remember you MUST set the seed using the package in a manner similar to as done in the notes (but don’t use my name to set the seed!)
# Set the seed for reproducibility
library(TeachingDemos)
## Warning: package 'TeachingDemos' was built under R version 4.3.2
char2seed("neuralink")

# Load necessary library
library(MASS)

# Split the data into training (80%) and test (20%) sets
train_indices <- sample(1:nrow(birthwt), 0.8 * nrow(birthwt))
train_data <- birthwt[train_indices, ]
test_data <- birthwt[-train_indices, ]

# Fit LDA and QDA models to the training set
lda_model <- lda(low ~ race + smoke + ptl2 + ht , data = train_data)
qda_model <- qda(low ~ race + smoke + ptl2 + ht , data = train_data)

# Fit logistic regression models to the training set
logistic_model_f <- glm(low ~ race + smoke + ptl2 + ht + ftv2, 
                        family = binomial, data = train_data)
logistic_model_b <- glm(low ~ smoke + ht + race, 
                        family = binomial, data = train_data)

# Predictions on the test set
lda_pred <- predict(lda_model, test_data)$class
qda_pred <- predict(qda_model, test_data)$class
logistic_pred_f <- predict(logistic_model_f, test_data, type = "response")
logistic_pred_b <- predict(logistic_model_b, test_data, type = "response")

# Convert logistic regression probabilities to binary predictions
logistic_pred_f_binary <- ifelse(logistic_pred_f > 0.5, 1, 0)
logistic_pred_b_binary <- ifelse(logistic_pred_b > 0.5, 1, 0)

# Confusion matrices
confusion_lda <- table(lda_pred, test_data$low)
confusion_qda <- table(qda_pred, test_data$low)
confusion_logistic_f <- table(logistic_pred_f_binary, test_data$low)
confusion_logistic_b <- table(logistic_pred_b_binary, test_data$low)

# Calculate sensitivity, specificity, and accuracy
sensitivity <- function(confusion_matrix) {
  confusion_matrix[2, 2] / sum(confusion_matrix[2, ])
}

specificity <- function(confusion_matrix) {
  confusion_matrix[1, 1] / sum(confusion_matrix[1, ])
}

accuracy <- function(confusion_matrix) {
  sum(diag(confusion_matrix)) / sum(confusion_matrix)
}

# Calculate metrics for each model
metrics_lda <- c(Sensitivity = sensitivity(confusion_lda),
                 Specificity = specificity(confusion_lda),
                 Accuracy = accuracy(confusion_lda))

metrics_qda <- c(Sensitivity = sensitivity(confusion_qda),
                 Specificity = specificity(confusion_qda),
                 Accuracy = accuracy(confusion_qda))

metrics_logistic_f <- c(Sensitivity = sensitivity(confusion_logistic_f),
                        Specificity = specificity(confusion_logistic_f),
                        Accuracy = accuracy(confusion_logistic_f))

metrics_logistic_b <- c(Sensitivity = sensitivity(confusion_logistic_b),
                        Specificity = specificity(confusion_logistic_b),
                        Accuracy = accuracy(confusion_logistic_b))

# Display the results
print("Linear Discriminant Analysis (LDA):")
## [1] "Linear Discriminant Analysis (LDA):"
print(metrics_lda)
## Sensitivity Specificity    Accuracy 
##   0.1666667   0.7812500   0.6842105
print("\nQuadratic Discriminant Analysis (QDA):")
## [1] "\nQuadratic Discriminant Analysis (QDA):"
print(metrics_qda)
## Sensitivity Specificity    Accuracy 
##   0.5714286   0.8709677   0.8157895
print("\nLogistic Regression (Built in f):")
## [1] "\nLogistic Regression (Built in f):"
print(metrics_logistic_f)
## Sensitivity Specificity    Accuracy 
##   0.3750000   0.8333333   0.7368421
print("\nLogistic Regression (Built in b):")
## [1] "\nLogistic Regression (Built in b):"
print(metrics_logistic_b)
## Sensitivity Specificity    Accuracy 
##   0.2000000   0.7878788   0.7105263

Interpretation: With the Set.seed(“Neuralink”) Interpretation:

1)QDA appears to have the highest accuracy (81.58%) among the methods considered. 2)LDA has the lowest sensitivity and accuracy among the models. 3)Logistic Regression (Built in f) performs reasonably well with 73.68% accuracy. 4)Logistic Regression (Built in b) has a similar accuracy to LDA but slightly better sensitivity.

Conclusion: Based on the accuracy metric, QDA seems to be the most effective model among the methods considered for this particular dataset. However, it’s crucial to consider the specific goals of your analysis and other factors such as computational efficiency and interpretability when choosing a model.

  1. Using your final model from f), interpret the estimates for all covariates.
#summary of the updated logistic regression model
summary(updated_logistic_model)
## 
## Call:
## glm(formula = low ~ race + smoke + ptl2 + ht + ftv2, family = binomial, 
##     data = birthwt)
## 
## Coefficients:
##                         Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              -2.3164     0.6003  -3.859 0.000114 ***
## race                      0.4935     0.2114   2.335 0.019551 *  
## smoke                     0.8798     0.3943   2.231 0.025677 *  
## ptl2Prior Preterm Birth   1.3296     0.4392   3.027 0.002467 ** 
## ht                        1.2528     0.6402   1.957 0.050374 .  
## ftv2Visits > 0           -0.2681     0.3537  -0.758 0.448438    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 234.67  on 188  degrees of freedom
## Residual deviance: 207.76  on 183  degrees of freedom
## AIC: 219.76
## 
## Number of Fisher Scoring iterations: 4

interpretation the estimatation:

Interpretation of Coefficients:

1)Intercept: The estimated intercept is -2.3164. This is the log-odds of the response variable being “low” when all other predictor variables are zero. 2)race: The coefficient for “race” is 0.4935. Holding other variables constant, for each one-unit increase in the “race” variable, the log-odds of the response being “low” increase by 0.4935. The p-value is 0.019551, suggesting that “race” is statistically significant. 3)smoke: The coefficient for “smoke” is 0.8798. Holding other variables constant, for each one-unit increase in the “smoke” variable, the log-odds of the response being “low” increase by 0.8798. The p-value is 0.025677, indicating that “smoke” is statistically significant. 4)ptl2 (Prior Preterm Birth): The coefficient for “ptl2Prior Preterm Birth” is 1.3296. Holding other variables constant, individuals with a prior preterm birth have log-odds 1.3296 higher than those without a prior preterm birth. The p-value is 0.002467, suggesting that the effect is statistically significant. 5)ht (hypertension): The coefficient for “ht” is 1.2528. Holding other variables constant, individuals with hypertension have log-odds 1.2528 higher than those without hypertension. The p-value is 0.050374, which is marginally significant. 6)ftv2 (Visits > 0): The coefficient for “ftv2Visits > 0” is -0.2681. Holding other variables constant, individuals with more than zero visits have log-odds 0.2681 lower than those with zero visits. However, the p-value is 0.448438, indicating that this variable is not statistically significant.

Model Fit:

Deviance: The null deviance is 234.67, and the residual deviance is 207.76. The difference between them is significant, suggesting that the model provides a better fit