NikhilBharadwaj_DA

1) Consider the cpus data from R package MASS. We will use linear regression to investigate the relationship between variables in this data set and estimated performance (variable estperf). Do not use published performance as a predictor of performance in this problem.

library(MASS)

# Load the dataset
data(cpus)

# Explore the structure of the dataset
str(cpus)

## 'data.frame':    209 obs. of  9 variables:
##  $ name   : Factor w/ 209 levels "ADVISOR 32/60",..: 1 3 2 4 5 6 8 9 10 7 ...
##  $ syct   : int  125 29 29 29 29 26 23 23 23 23 ...
##  $ mmin   : int  256 8000 8000 8000 8000 8000 16000 16000 16000 32000 ...
##  $ mmax   : int  6000 32000 32000 32000 16000 32000 32000 32000 64000 64000 ...
##  $ cach   : int  256 32 32 32 32 64 64 64 64 128 ...
##  $ chmin  : int  16 8 8 8 8 8 16 16 16 32 ...
##  $ chmax  : int  128 32 32 32 16 32 32 32 32 64 ...
##  $ perf   : int  198 269 220 172 132 318 367 489 636 1144 ...
##  $ estperf: int  199 253 253 253 132 290 381 381 749 1238 ...

# Perform linear regression
model <- lm(estperf ~ ., data = cpus)

# Summarize the regression results
summary(model)

## 
## Call:
## lm(formula = estperf ~ ., data = cpus)
## 
## Residuals:
## ALL 209 residuals are 0: no residual degrees of freedom!
## 
## Coefficients: (7 not defined because of singularities)
##                            Estimate Std. Error t value Pr(>|t|)
## (Intercept)                     199        NaN     NaN      NaN
## nameAMDAHL 470/7A                54        NaN     NaN      NaN
## nameAMDAHL 470V/7                54        NaN     NaN      NaN
## nameAMDAHL 470V/7B               54        NaN     NaN      NaN
## nameAMDAHL 470V/7C              -67        NaN     NaN      NaN
## nameAMDAHL 470V/8                91        NaN     NaN      NaN
## nameAMDAHL 580 5880            1039        NaN     NaN      NaN
## nameAMDAHL 580-5840             182        NaN     NaN      NaN
## nameAMDAHL 580-5850             182        NaN     NaN      NaN
## nameAMDAHL 580-5860             550        NaN     NaN      NaN
## nameAPOLLO DN320               -176        NaN     NaN      NaN
## nameAPOLLO DN420               -175        NaN     NaN      NaN
## nameBASF 7/65                  -129        NaN     NaN      NaN
## nameBASF 7/68                   -82        NaN     NaN      NaN
## nameBTI 5000                   -184        NaN     NaN      NaN
## nameBTI 8000                   -135        NaN     NaN      NaN
## nameBURROUGHS B1955            -176        NaN     NaN      NaN
## nameBURROUGHS B2900            -170        NaN     NaN      NaN
## nameBURROUGHS B2925            -177        NaN     NaN      NaN
## nameBURROUGHS B4955             -75        NaN     NaN      NaN
## nameBURROUGHS B5900            -164        NaN     NaN      NaN
## nameBURROUGHS B5920            -160        NaN     NaN      NaN
## nameBURROUGHS B6900            -159        NaN     NaN      NaN
## nameBURROUGHS B6925            -154        NaN     NaN      NaN
## nameC.R.D. 68/10-80            -171        NaN     NaN      NaN
## nameC.R.D. UNIVERSE 2203T      -178        NaN     NaN      NaN
## nameC.R.D. UNIVERSE 68         -171        NaN     NaN      NaN
## nameC.R.D. UNIVERSE 68/05      -177        NaN     NaN      NaN
## nameC.R.D. UNIVERSE 68/137     -171        NaN     NaN      NaN
## nameC.R.D. UNIVERSE 68/37      -172        NaN     NaN      NaN
## nameCAMBEX 1636-1              -169        NaN     NaN      NaN
## nameCAMBEX 1636-10             -158        NaN     NaN      NaN
## nameCAMBEX 1641-1              -125        NaN     NaN      NaN
## nameCAMBEX 1641-11             -125        NaN     NaN      NaN
## nameCAMBEX 1651-1              -125        NaN     NaN      NaN
## nameCDC CYBER 170/750           -97        NaN     NaN      NaN
## nameCDC CYBER 170/760           -97        NaN     NaN      NaN
## nameCDC CYBER 170/815          -125        NaN     NaN      NaN
## nameCDC CYBER 170/825          -125        NaN     NaN      NaN
## nameCDC CYBER 170/835           -61        NaN     NaN      NaN
## nameCDC CYBER 170/845           -63        NaN     NaN      NaN
## nameCDC OMEGA 480-I            -176        NaN     NaN      NaN
## nameCDC OMEGA 480-II           -170        NaN     NaN      NaN
## nameCDC OMEGA 480-III          -155        NaN     NaN      NaN
## nameDEC DECSYS 10 1091         -145        NaN     NaN      NaN
## nameDEC DECSYS 20 2060         -158        NaN     NaN      NaN
## nameDEC MICROVAX-1             -181        NaN     NaN      NaN
## nameDEC VAX 11/730             -171        NaN     NaN      NaN
## nameDEC VAX 11/750             -163        NaN     NaN      NaN
## nameDEC VAX 11/780             -161        NaN     NaN      NaN
## nameDG ECLIPSE C/350           -165        NaN     NaN      NaN
## nameDG ECLIPSE M/600           -180        NaN     NaN      NaN
## nameDG ECLIPSE MV/1000         -127        NaN     NaN      NaN
## nameDG ECLIPSE MV/4000         -163        NaN     NaN      NaN
## nameDG ECLIPSE MV/6000         -169        NaN     NaN      NaN
## nameDG ECLIPSE MV/8000         -143        NaN     NaN      NaN
## nameDG ECLIPSE MV/8000 II      -157        NaN     NaN      NaN
## nameFORMATION F4000/100        -165        NaN     NaN      NaN
## nameFORMATION F4000/200        -165        NaN     NaN      NaN
## nameFORMATION F4000/200AP      -165        NaN     NaN      NaN
## nameFORMATION F4000/300        -165        NaN     NaN      NaN
## nameFORMATION F4000/300AP      -165        NaN     NaN      NaN
## nameFOUR PHASE 2000/260        -180        NaN     NaN      NaN
## nameGOULD CONCEPT 32/8705      -124        NaN     NaN      NaN
## nameGOULD CONCEPT 32/8750       -86        NaN     NaN      NaN
## nameGOULD CONCEPT 32/8780       -42        NaN     NaN      NaN
## nameHARRIS 100                 -176        NaN     NaN      NaN
## nameHARRIS 300                 -174        NaN     NaN      NaN
## nameHARRIS 500                 -147        NaN     NaN      NaN
## nameHARRIS 600                 -172        NaN     NaN      NaN
## nameHARRIS 700                 -149        NaN     NaN      NaN
## nameHARRIS 80                  -181        NaN     NaN      NaN
## nameHARRIS 800                 -146        NaN     NaN      NaN
## nameHONEYWELL DPS 6/35         -176        NaN     NaN      NaN
## nameHONEYWELL DPS 6/92         -169        NaN     NaN      NaN
## nameHONEYWELL DPS 6/96         -126        NaN     NaN      NaN
## nameHONEYWELL DPS 7/35         -179        NaN     NaN      NaN
## nameHONEYWELL DPS 7/45         -174        NaN     NaN      NaN
## nameHONEYWELL DPS 7/55         -171        NaN     NaN      NaN
## nameHONEYWELL DPS 7/65         -170        NaN     NaN      NaN
## nameHONEYWELL DPS 8/20         -167        NaN     NaN      NaN
## nameHONEYWELL DPS 8/44         -167        NaN     NaN      NaN
## nameHONEYWELL DPS 8/49          -24        NaN     NaN      NaN
## nameHONEYWELL DPS 8/50         -142        NaN     NaN      NaN
## nameHONEYWELL DPS 8/52          -18        NaN     NaN      NaN
## nameHONEYWELL DPS 8/62          -18        NaN     NaN      NaN
## nameHP 3000/30                 -181        NaN     NaN      NaN
## nameHP 3000/40                 -179        NaN     NaN      NaN
## nameHP 3000/44                 -171        NaN     NaN      NaN
## nameHP 3000/48                 -166        NaN     NaN      NaN
## nameHP 3000/64                 -152        NaN     NaN      NaN
## nameHP 3000/88                 -145        NaN     NaN      NaN
## nameHP 3000/III                -179        NaN     NaN      NaN
## nameIBM 3033 S                 -117        NaN     NaN      NaN
## nameIBM 3033 U                  -28        NaN     NaN      NaN
## nameIBM 3081                    162        NaN     NaN      NaN
## nameIBM 3081 D                  151        NaN     NaN      NaN
## nameIBM 3083 B                   21        NaN     NaN      NaN
## nameIBM 3083 E                  -86        NaN     NaN      NaN
## nameIBM 370/125-2              -184        NaN     NaN      NaN
## nameIBM 370/148                -178        NaN     NaN      NaN
## nameIBM 370/158-3              -164        NaN     NaN      NaN
## nameIBM 38/3                   -181        NaN     NaN      NaN
## nameIBM 38/4                   -179        NaN     NaN      NaN
## nameIBM 38/5                   -179        NaN     NaN      NaN
## nameIBM 38/7                   -171        NaN     NaN      NaN
## nameIBM 38/8                   -154        NaN     NaN      NaN
## nameIBM 4321                   -181        NaN     NaN      NaN
## nameIBM 4331-1                 -182        NaN     NaN      NaN
## nameIBM 4331-11                -173        NaN     NaN      NaN
## nameIBM 4331-2                 -171        NaN     NaN      NaN
## nameIBM 4341                   -171        NaN     NaN      NaN
## nameIBM 4341-1                 -168        NaN     NaN      NaN
## nameIBM 4341-10                -168        NaN     NaN      NaN
## nameIBM 4341-11                -157        NaN     NaN      NaN
## nameIBM 4341-12                -123        NaN     NaN      NaN
## nameIBM 4341-2                 -123        NaN     NaN      NaN
## nameIBM 4341-9                 -173        NaN     NaN      NaN
## nameIBM 4361-4                 -140        NaN     NaN      NaN
## nameIBM 4361-5                 -134        NaN     NaN      NaN
## nameIBM 4381-1                  -98        NaN     NaN      NaN
## nameIBM 4381-2                  -83        NaN     NaN      NaN
## nameIBM 8130 A                 -181        NaN     NaN      NaN
## nameIBM 8130 B                 -179        NaN     NaN      NaN
## nameIBM 8140                   -179        NaN     NaN      NaN
## nameIPL 4436                   -169        NaN     NaN      NaN
## nameIPL 4443                   -155        NaN     NaN      NaN
## nameIPL 4445                   -155        NaN     NaN      NaN
## nameIPL 4446                   -117        NaN     NaN      NaN
## nameIPL 4460                   -117        NaN     NaN      NaN
## nameIPL 4480                    -71        NaN     NaN      NaN
## nameMAGNUSON M80/30            -162        NaN     NaN      NaN
## nameMAGNUSON M80/31            -153        NaN     NaN      NaN
## nameMAGNUSON M80/32            -153        NaN     NaN      NaN
## nameMAGNUSON M80/42            -119        NaN     NaN      NaN
## nameMAGNUSON M80/43            -111        NaN     NaN      NaN
## nameMAGNUSON M80/44            -111        NaN     NaN      NaN
## nameMICRODATA SEQ.MS/3200      -166        NaN     NaN      NaN
## nameNAS AS/3000                -153        NaN     NaN      NaN
## nameNAS AS/3000 N              -170        NaN     NaN      NaN
## nameNAS AS/5000                -146        NaN     NaN      NaN
## nameNAS AS/5000 E              -146        NaN     NaN      NaN
## nameNAS AS/5000 N              -158        NaN     NaN      NaN
## nameNAS AS/6130                -113        NaN     NaN      NaN
## nameNAS AS/6150                -104        NaN     NaN      NaN
## nameNAS AS/6620                 -92        NaN     NaN      NaN
## nameNAS AS/6630                 -82        NaN     NaN      NaN
## nameNAS AS/6650                 -80        NaN     NaN      NaN
## nameNAS AS/7000                 -79        NaN     NaN      NaN
## nameNAS AS/7000 N              -151        NaN     NaN      NaN
## nameNAS AS/8040                 -73        NaN     NaN      NaN
## nameNAS AS/8050                  67        NaN     NaN      NaN
## nameNAS AS/8060                  71        NaN     NaN      NaN
## nameNAS AS/9000 DPC             227        NaN     NaN      NaN
## nameNAS AS/9000 N               -48        NaN     NaN      NaN
## nameNAS AS/9040                  68        NaN     NaN      NaN
## nameNAS AS/9060                 404        NaN     NaN      NaN
## nameNCR V8535 II               -180        NaN     NaN      NaN
## nameNCR V8545 II               -178        NaN     NaN      NaN
## nameNCR V8555 II               -173        NaN     NaN      NaN
## nameNCR V8565 II               -164        NaN     NaN      NaN
## nameNCR V8565 II E             -158        NaN     NaN      NaN
## nameNCR V8575 II               -152        NaN     NaN      NaN
## nameNCR V8585 II               -137        NaN     NaN      NaN
## nameNCR V8595 II               -121        NaN     NaN      NaN
## nameNCR V8635                   -57        NaN     NaN      NaN
## nameNCR V8635                  -119        NaN     NaN      NaN
## nameNCR V8650                  -119        NaN     NaN      NaN
## nameNCR V8665                    82        NaN     NaN      NaN
## nameNCR V8670                    -9        NaN     NaN      NaN
## nameNIXDORF 8890/30            -178        NaN     NaN      NaN
## nameNIXDORF 8890/50            -174        NaN     NaN      NaN
## nameNIXDORF 8890/70            -132        NaN     NaN      NaN
## namePERKIN-ELMER 3205          -175        NaN     NaN      NaN
## namePERKIN-ELMER 3210          -175        NaN     NaN      NaN
## namePERKIN-ELMER 3230          -135        NaN     NaN      NaN
## namePRIME 50-2250              -174        NaN     NaN      NaN
## namePRIME 50-250 II            -179        NaN     NaN      NaN
## namePRIME 50-550 II            -170        NaN     NaN      NaN
## namePRIME 50-750 II            -156        NaN     NaN      NaN
## namePRIME 50-850 II            -146        NaN     NaN      NaN
## nameSIEMENS 7.521              -180        NaN     NaN      NaN
## nameSIEMENS 7.531              -177        NaN     NaN      NaN
## nameSIEMENS 7.536              -168        NaN     NaN      NaN
## nameSIEMENS 7.541              -158        NaN     NaN      NaN
## nameSIEMENS 7.551              -152        NaN     NaN      NaN
## nameSIEMENS 7.561              -100        NaN     NaN      NaN
## nameSIEMENS 7.865-2            -132        NaN     NaN      NaN
## nameSIEMENS 7.870-2            -118        NaN     NaN      NaN
## nameSIEMENS 7.872-2             -50        NaN     NaN      NaN
## nameSIEMENS 7.875-2             -16        NaN     NaN      NaN
## nameSIEMENS 7.880-2              76        NaN     NaN      NaN
## nameSIEMENS 7.881-2             183        NaN     NaN      NaN
## nameSPERRY 1100/61 H1          -143        NaN     NaN      NaN
## nameSPERRY 1100/81              -17        NaN     NaN      NaN
## nameSPERRY 1100/82               28        NaN     NaN      NaN
## nameSPERRY 1100/83              142        NaN     NaN      NaN
## nameSPERRY 1100/84              161        NaN     NaN      NaN
## nameSPERRY 1100/93              720        NaN     NaN      NaN
## nameSPERRY 1100/94              779        NaN     NaN      NaN
## nameSPERRY 80/3                -175        NaN     NaN      NaN
## nameSPERRY 80/4                -175        NaN     NaN      NaN
## nameSPERRY 80/5                -175        NaN     NaN      NaN
## nameSPERRY 80/6                -175        NaN     NaN      NaN
## nameSPERRY 80/8                -162        NaN     NaN      NaN
## nameSPERRY 90/80 MODEL 3       -149        NaN     NaN      NaN
## nameSTRATUS 32                 -158        NaN     NaN      NaN
## nameWANG VS 90                 -174        NaN     NaN      NaN
## nameWANG VS10                  -152        NaN     NaN      NaN
## syct                             NA         NA      NA       NA
## mmin                             NA         NA      NA       NA
## mmax                             NA         NA      NA       NA
## cach                             NA         NA      NA       NA
## chmin                            NA         NA      NA       NA
## chmax                            NA         NA      NA       NA
## perf                             NA         NA      NA       NA
## 
## Residual standard error: NaN on 0 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:    NaN 
## F-statistic:   NaN on 208 and 0 DF,  p-value: NA

a) Investigate the relationship between variables in the cpus dataset, both numerically and visually. Comment on the relationships you observe.

summary(cpus)

##              name          syct             mmin            mmax      
##  ADVISOR 32/60 :  1   Min.   :  17.0   Min.   :   64   Min.   :   64  
##  AMDAHL 470/7A :  1   1st Qu.:  50.0   1st Qu.:  768   1st Qu.: 4000  
##  AMDAHL 470V/7 :  1   Median : 110.0   Median : 2000   Median : 8000  
##  AMDAHL 470V/7B:  1   Mean   : 203.8   Mean   : 2868   Mean   :11796  
##  AMDAHL 470V/7C:  1   3rd Qu.: 225.0   3rd Qu.: 4000   3rd Qu.:16000  
##  AMDAHL 470V/8 :  1   Max.   :1500.0   Max.   :32000   Max.   :64000  
##  (Other)       :203                                                   
##       cach            chmin            chmax             perf       
##  Min.   :  0.00   Min.   : 0.000   Min.   :  0.00   Min.   :   6.0  
##  1st Qu.:  0.00   1st Qu.: 1.000   1st Qu.:  5.00   1st Qu.:  27.0  
##  Median :  8.00   Median : 2.000   Median :  8.00   Median :  50.0  
##  Mean   : 25.21   Mean   : 4.699   Mean   : 18.27   Mean   : 105.6  
##  3rd Qu.: 32.00   3rd Qu.: 6.000   3rd Qu.: 24.00   3rd Qu.: 113.0  
##  Max.   :256.00   Max.   :52.000   Max.   :176.00   Max.   :1150.0  
##                                                                     
##     estperf       
##  Min.   :  15.00  
##  1st Qu.:  28.00  
##  Median :  45.00  
##  Mean   :  99.33  
##  3rd Qu.: 101.00  
##  Max.   :1238.00  
##

library(corrplot)

## corrplot 0.92 loaded

# Select numeric variables from the dataset
numeric_cps <- cpus[sapply(cpus, is.numeric)]

# Compute correlation matrix
cor_matrix <- cor(numeric_cps)

# Create a correlation heatmap
corrplot(cor_matrix, method = "color", type = "upper", order = "hclust")

b) Use either methods commonly used in the book/lecture notes to build a linear regression model predicting estimated performance from predictors in the cpus dataset. Do not consider name in this modeling approach. Explain the process used to arrive at your final model.

# Load necessary library
library(MASS)

# Load the dataset
data(cpus)

# Fit linear regression model with stepwise variable selection
model_stepwise <- lm(estperf ~ . - name, data = cpus)

# Perform stepwise regression with AIC as the criterion
final_model <- step(model_stepwise, direction = "both", trace = 0)

# Summarize the final model
summary(final_model)

## 
## Call:
## lm(formula = estperf ~ syct + mmin + mmax + cach + chmax + perf, 
##     data = cpus)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -117.541   -9.553    2.867   15.213  182.172 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.424e+01  4.713e+00  -7.264 8.01e-12 ***
## syct         3.778e-02  9.403e-03   4.018 8.27e-05 ***
## mmin         5.476e-03  1.100e-03   4.980 1.36e-06 ***
## mmax         3.375e-03  3.959e-04   8.526 3.55e-15 ***
## cach         1.238e-01  7.473e-02   1.656  0.09920 .  
## chmax        3.443e-01  1.223e-01   2.814  0.00537 ** 
## perf         5.770e-01  3.708e-02  15.563  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 31.62 on 202 degrees of freedom
## Multiple R-squared:  0.9595, Adjusted R-squared:  0.9582 
## F-statistic: 796.7 on 6 and 202 DF,  p-value: < 2.2e-16

Loading Data: We load the cpus dataset from the MASS package.

Initial Model: We create an initial linear regression model (model_stepwise) including all predictors (.) except the name variable.

Stepwise Regression: We use the step() function to perform stepwise regression with the AIC criterion (direction = “both” indicates that both forward and backward steps are allowed). The trace = 0 argument suppresses the output during the stepwise selection process.

Final Model: The resulting final_model is the linear regression model with the selected predictors based on the stepwise regression process.

Summarize Model: We use summary() to display details of the final model, including coefficients, standard errors, t-values, and p-values.

c) Create a residual plot using this model and comment on it’s features. Do any of the assumptions of linear regression seem to be violated? What might be done to adjust our model? Adjust the model if necessary by considering various residual plots, updating the model, and assessing residual plots using the updated model.

# Obtain residuals from the final model
residuals <- residuals(final_model)

# Create a residual plot
plot(x = fitted(final_model), y = residuals, 
     xlab = "Fitted Values", ylab = "Residuals",
     main = "Residual Plot")

# Add a horizontal line at y = 0 for reference
abline(h = 0, col = "red", lty = 2)

“The residual plot shows a generally random scatter of residuals around the zero line, indicating that the linear regression model adequately captures the underlying relationship between the predictors and the estimated performance. The spread of residuals are not relatively constant across different levels of fitted values, suggesting non-homoscedasticity. However, there are a few points with larger residuals that may warrant further investigation as potential outliers. Overall, the model seems to be a reasonable fit, but caution should be exercised in interpreting the results, especially in the presence of potential outliers.”

# Log-transform the response variable
cpus$log_estperf <- log(cpus$estperf)

# Fit a new linear regression model with the transformed response variable
model_transformed <- lm(log_estperf ~ . - name, data = cpus)

# Check the residual plot for heteroscedasticity
residuals_transformed <- residuals(model_transformed)

plot(x = fitted(model_transformed), y = residuals_transformed, 
     xlab = "Fitted Values", ylab = "Residuals",
     main = "Residual Plot (Transformed Model)")

abline(h = 0, col = "red", lty = 2)

summary(model_transformed)

## 
## Call:
## lm(formula = log_estperf ~ . - name, data = cpus)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.34248 -0.07284  0.01788  0.06913  0.34862 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.030e+00  1.961e-02 154.526  < 2e-16 ***
## syct        -1.834e-04  3.618e-05  -5.068 9.13e-07 ***
## mmin         6.660e-05  4.374e-06  15.225  < 2e-16 ***
## mmax         7.997e-05  1.710e-06  46.774  < 2e-16 ***
## cach         8.782e-03  2.879e-04  30.507  < 2e-16 ***
## chmin        4.237e-03  1.669e-03   2.538   0.0119 *  
## chmax        2.741e-03  4.834e-04   5.671 4.92e-08 ***
## perf         9.006e-04  2.034e-04   4.427 1.57e-05 ***
## estperf     -4.724e-03  2.603e-04 -18.148  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.117 on 200 degrees of freedom
## Multiple R-squared:  0.985,  Adjusted R-squared:  0.9844 
## F-statistic:  1642 on 8 and 200 DF,  p-value: < 2.2e-16

The low RSE (0.117) suggests that, on average, the model’s predictions deviate from the actual values by a small amount. The high R-squared (0.985) and adjusted R-squared (0.9844) values indicate that the model explains a large proportion of the variability in the estimated performance. The significant F-statistic suggests that the overall model is statistically significant, providing evidence that the predictors jointly have an impact on the response variable.

###Interpret all variables in your final model using complete sentences, making sure to account for the fact that this may be a multivariable model. Give interpretations in terms of as meaningful of units as possible (it may not be possible to use seconds for cycle time - the answer is too large, but you may use MB instead of kB, for instance). Adjust interpretations as needed, both for units, and the fact that our outcome has been log transformed (how do we get to the raw data values from a log transformation? Start by thinking: what is the inverse of the log function???)

Intercept:

The intercept is approximately 3.030. In the context of the log transformation, this represents the estimated log-transformed performance when all other predictor variables are zero. To interpret this in the original units, apply the inverse log transformation: exp(3.030).

syct (System Cycle Time): The coefficient for syct is approximately -1.834e-04. For each one-unit increase in syct, the estimated log-transformed performance decreases by -1.834e-04 units, holding all other variables constant.

mmin (Minimum Main Memory): The coefficient for mmin is approximately 6.660e-05. For each one-unit increase in mmin, the estimated log-transformed performance increases by 6.660e-05 units, holding all other variables constant.

mmax (Maximum Main Memory): The coefficient for mmax is approximately 7.997e-05. For each one-unit increase in mmax, the estimated log-transformed performance increases by 7.997e-05 units, holding all other variables constant.

cach (Cache Memory): The coefficient for cach is approximately 8.782e-03. For each one-unit increase in cach, the estimated log-transformed performance increases by 8.782e-03 units, holding all other variables constant.

chmin (Minimum Channels): The coefficient for chmin is approximately 4.237e-03. For each one-unit increase in chmin, the estimated log-transformed performance increases by 4.237e-03 units, holding all other variables constant.

chmax (Maximum Channels): The coefficient for chmax is approximately 2.741e-03. For each one-unit increase in chmax, the estimated log-transformed performance increases by 2.741e-03 units, holding all other variables constant.

perf (Published Performance): The coefficient for perf is approximately 9.006e-04. For each one-unit increase in perf, the estimated log-transformed performance increases by 9.006e-04 units, holding all other variables constant.

estperf (Estimated Performance): The coefficient for estperf is approximately -4.724e-03. For each one-unit increase in estperf, the estimated log-transformed performance decreases by -4.724e-03 units, holding all other variables constant.

These interpretations are based on the log-transformed scale. If you want to express these effects in the original units, you can apply the exponential function (inverse of the log transformation) to the coefficients. For example, exp(3.030) would give you the estimated performance when all other predictors are zero. Adjustments to the interpretation may be needed based on the specific units of your variables.

# Load necessary library
library(car)

## Loading required package: carData

# Assuming 'final_model' is the name of your model
vif_values <- vif(model_transformed)

# Print VIF values
print(vif_values)

##      syct      mmin      mmax      cach     chmin     chmax      perf   estperf 
##  1.347415  4.374518  6.108787  2.078872  1.967216  2.399655 16.268406 24.663399

Based on these VIF values: syct, cach, chmin, and chmax have low to moderate VIF values, suggesting relatively low multicollinearity. mmin and mmax have moderate VIF values, indicating a moderate level of multicollinearity. perf and estperf have high VIF values, indicating potential high multicollinearity.

# Assuming 'final_model' is the name of your model
residuals <- residuals(model_transformed)
fitted_values <- fitted(model_transformed)
cooksd <- cooks.distance(model_transformed)

# Plot standardized residuals vs. fitted values
plot(x = fitted_values, y = rstandard(model_transformed),
     xlab = "Fitted Values", ylab = "Standardized Residuals",
     main = "Standardized Residuals vs. Fitted Values")

# Add a horizontal line at y = 0 for reference
abline(h = 0, col = "red", lty = 2)

# Identify influential observations using Cook's distance
influential_obs <- which(cooksd > 4 / length(fitted_values))
influential_obs

##   1   9  10  31  32  83  91  92  96  97 138 153 154 157 166 167 198 199 200 
##   1   9  10  31  32  83  91  92  96  97 138 153 154 157 166 167 198 199 200

Yes there are outliers/influential values in this dataset.

2) Consider the birthwt data from R package MASS. We will investigate the relationship between low birthweight and the predictors in the birthwt data using logistic regression and discriminant analysis.

# Load necessary library
library(MASS)

# Load the dataset
data(birthwt)

# View the structure of the dataset
str(birthwt)

## 'data.frame':    189 obs. of  10 variables:
##  $ low  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ age  : int  19 33 20 21 18 21 22 17 29 26 ...
##  $ lwt  : int  182 155 105 108 107 124 118 103 123 113 ...
##  $ race : int  2 3 1 1 1 3 1 3 1 1 ...
##  $ smoke: int  0 0 1 1 1 0 0 0 1 1 ...
##  $ ptl  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ ht   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ ui   : int  1 0 0 1 1 0 0 0 0 0 ...
##  $ ftv  : int  0 3 1 2 0 0 1 1 1 0 ...
##  $ bwt  : int  2523 2551 2557 2594 2600 2622 2637 2637 2663 2665 ...

# Fit logistic regression model
logistic_model <- glm(low ~ ., data = birthwt, family = binomial)

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

# Summarize the model
summary(logistic_model)

## 
## Call:
## glm(formula = low ~ ., family = binomial, data = birthwt)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)
## (Intercept)  1.161e+03  2.074e+05   0.006    0.996
## age          3.223e-01  1.787e+03   0.000    1.000
## lwt         -1.733e-01  3.202e+02  -0.001    1.000
## race         6.494e-01  3.165e+04   0.000    1.000
## smoke       -1.746e+01  7.668e+04   0.000    1.000
## ptl          1.267e+02  3.406e+05   0.000    1.000
## ht           3.636e+01  1.237e+05   0.000    1.000
## ui          -6.183e+01  7.547e+04  -0.001    0.999
## ftv         -8.925e+00  1.624e+04  -0.001    1.000
## bwt         -4.466e-01  6.468e+01  -0.007    0.994
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2.3467e+02  on 188  degrees of freedom
## Residual deviance: 1.0537e-07  on 179  degrees of freedom
## AIC: 20
## 
## Number of Fisher Scoring iterations: 25

# Fit discriminant analysis model
discriminant_model <- lda(low ~ ., data = birthwt)

# Summarize the model
summary(discriminant_model)

##         Length Class  Mode     
## prior    2     -none- numeric  
## counts   2     -none- numeric  
## means   18     -none- numeric  
## scaling  9     -none- numeric  
## lev      2     -none- character
## svd      1     -none- numeric  
## N        1     -none- numeric  
## call     3     -none- call     
## terms    3     terms  call     
## xlevels  0     -none- list

Investigate the relationship between variables in the birthwt dataset. Do you see anything surprising? Use both numeric and visual summaries. Create and comment on visualizations specifically between the outcome variable and predictor/independent variables. Also, notice that qualitative/categorical variables should be visualized in an alternative manner, not just scatterplots/correlations as in the case of quantitative variables.

# Load necessary libraries
library(MASS)

# Load the dataset
data(birthwt)

summary(birthwt)

##       low              age             lwt             race      
##  Min.   :0.0000   Min.   :14.00   Min.   : 80.0   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:19.00   1st Qu.:110.0   1st Qu.:1.000  
##  Median :0.0000   Median :23.00   Median :121.0   Median :1.000  
##  Mean   :0.3122   Mean   :23.24   Mean   :129.8   Mean   :1.847  
##  3rd Qu.:1.0000   3rd Qu.:26.00   3rd Qu.:140.0   3rd Qu.:3.000  
##  Max.   :1.0000   Max.   :45.00   Max.   :250.0   Max.   :3.000  
##      smoke             ptl               ht                ui        
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.00000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.0000   Median :0.00000   Median :0.0000  
##  Mean   :0.3915   Mean   :0.1958   Mean   :0.06349   Mean   :0.1481  
##  3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.:0.00000   3rd Qu.:0.0000  
##  Max.   :1.0000   Max.   :3.0000   Max.   :1.00000   Max.   :1.0000  
##       ftv              bwt      
##  Min.   :0.0000   Min.   : 709  
##  1st Qu.:0.0000   1st Qu.:2414  
##  Median :0.0000   Median :2977  
##  Mean   :0.7937   Mean   :2945  
##  3rd Qu.:1.0000   3rd Qu.:3487  
##  Max.   :6.0000   Max.   :4990

# Visualizations for categorical variables
categorical_vars <- sapply(birthwt, is.factor)
categorical_data <- birthwt[, categorical_vars]

# Manually select pairs of categorical variables for mosaic plots
pairs_for_mosaic <- list(c("race", "smoke"), c("race", "ht"), c("race", "ui"), c("smoke", "ht"), c("smoke", "ui"), c("ht", "ui"))

# Mosaic plots for selected pairs of categorical variables
mosaic_plots <- lapply(pairs_for_mosaic, function(pair) {
  mosaicplot(table(birthwt[, c(pair, "low")]), 
             main = paste("Low Birthweight vs", pair[1], "and", pair[2]))
})

# Arrange and print mosaic plots
par(mfrow = c(2, 3))  # Adjust the layout if needed
invisible(lapply(mosaic_plots, print))

## NULL
## NULL
## NULL
## NULL
## NULL
## NULL

# Load necessary libraries
library(MASS)
library(ggplot2)

# Load the dataset
data(birthwt)

# Visualizations for categorical variables
par(mfrow = c(1, 2))

# Mosaic plot for race vs low
mosaicplot(table(birthwt$race, birthwt$low), main = "Low Birthweight vs Race")

# Mosaic plot for smoke vs low
mosaicplot(table(birthwt$smoke, birthwt$low), main = "Low Birthweight vs Smoke")

# Visualizations for quantitative variables
par(mfrow = c(2, 2))

# Box plot for age vs low
boxplot(age ~ low, data = birthwt, main = "Low Birthweight vs Age")

# Box plot for lwt vs low
boxplot(lwt ~ low, data = birthwt, main = "Low Birthweight vs LWT")

# Box plot for ptl vs low
boxplot(ptl ~ low, data = birthwt, main = "Low Birthweight vs PTL")

# Box plot for bwt vs low
boxplot(bwt ~ low, data = birthwt, main = "Low Birthweight vs BWT")

Race vs. Low Birthweight: “The mosaic plot for race vs. low birthweight shows that race category 1 (presumably indicating a specific race) has a higher proportion of low birthweights compared to other race categories.”

Smoke vs. Low Birthweight: “The mosaic plot for smoke vs. low birthweight indicates that smoking (smoke = 1) is associated with a higher proportion of low birthweights compared to non-smokers.”

Box Plots: Age vs. Low Birthweight: “In the box plot for age vs. low birthweight, we observe a slight tendency for younger mothers to have a higher likelihood of low birthweight compared to older mothers.”

LWT (LbWeight) vs. Low Birthweight: “The box plot for LWT (LbWeight) vs. low birthweight suggests that lower maternal weight is associated with a higher likelihood of low birthweight.”

PTL (Prior preterm labor) vs. Low Birthweight: “The box plot for PTL (prior preterm labor) vs. low birthweight shows that a history of prior preterm labor may be associated with an increased likelihood of low birthweight.”

BWT (Birthweight) vs. Low Birthweight: “The box plot for birthweight vs. low birthweight illustrates that infants with low birthweights have, on average, lower birthweights compared to those with normal birthweights.”

b) Fit a logistic regression model using methods discussed in class/the book, similar to as in problem 1). Be careful to understand each variable in birthwt to avoid including variables that are not logically acceptable for inclusion in the model.

# Load necessary libraries
library(MASS)

# Load the dataset
data(birthwt)

# View the structure of the dataset
str(birthwt)

## 'data.frame':    189 obs. of  10 variables:
##  $ low  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ age  : int  19 33 20 21 18 21 22 17 29 26 ...
##  $ lwt  : int  182 155 105 108 107 124 118 103 123 113 ...
##  $ race : int  2 3 1 1 1 3 1 3 1 1 ...
##  $ smoke: int  0 0 1 1 1 0 0 0 1 1 ...
##  $ ptl  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ ht   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ ui   : int  1 0 0 1 1 0 0 0 0 0 ...
##  $ ftv  : int  0 3 1 2 0 0 1 1 1 0 ...
##  $ bwt  : int  2523 2551 2557 2594 2600 2622 2637 2637 2663 2665 ...

# Fit logistic regression model
logistic_model <- glm(low ~ age + lwt + race + smoke + ptl + ht + ui + ftv, 
                      data = birthwt, 
                      family = binomial)

# Display the summary of the model
summary(logistic_model)

## 
## Call:
## glm(formula = low ~ age + lwt + race + smoke + ptl + ht + ui + 
##     ftv, family = binomial, data = birthwt)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)   
## (Intercept) -0.078975   1.276254  -0.062  0.95066   
## age         -0.035845   0.036472  -0.983  0.32569   
## lwt         -0.012387   0.006614  -1.873  0.06111 . 
## race         0.453424   0.215294   2.106  0.03520 * 
## smoke        0.937275   0.398458   2.352  0.01866 * 
## ptl          0.542087   0.346168   1.566  0.11736   
## ht           1.830720   0.694135   2.637  0.00835 **
## ui           0.721965   0.463174   1.559  0.11906   
## ftv          0.063461   0.169765   0.374  0.70854   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 234.67  on 188  degrees of freedom
## Residual deviance: 204.19  on 180  degrees of freedom
## AIC: 222.19
## 
## Number of Fisher Scoring iterations: 4

c) What do you notice regarding the variables ptl and ftv. What is your logistic regression model in b) (perhaps before performing variable selection) implicitly assuming regarding these variables’ effects on the log odds of giving birth to a low weight baby? Are these assumptions realistic?

Coefficients for ptl and ftv: ptl (prior preterm labor): Estimate = 0.542087, Std. Error = 0.346168, z value = 1.566, p-value = 0.11736 ftv (number of physician visits): Estimate = 0.063461, Std. Error = 0.169765, z value = 0.374, p-value = 0.70854

Interpretation: ptl (prior preterm labor): Estimate: 0.542087

Interpretation: Holding other variables constant, the estimated increase in the log odds of having a low birthweight baby for individuals with a history of prior preterm labor (ptl) compared to those without is 0.542087. p-value: 0.11736 (not statistically significant at conventional significance levels) ftv (number of physician visits): Estimate: 0.063461

Interpretation: Holding other variables constant, the estimated increase in the log odds of having a low birthweight baby for each additional physician visit (ftv) is 0.063461. p-value: 0.70854 (not statistically significant at conventional significance levels)

Assumptions and Considerations:

ptl (prior preterm labor): The positive coefficient (0.542087) suggests an increase in the log odds of low birthweight for individuals with a history of prior preterm labor.

The p-value is 0.11736, indicating that the effect of ptl is not statistically significant at a conventional significance level of 0.05.

ftv (number of physician visits): The positive coefficient (0.063461) suggests a small increase in the log odds of low birthweight for each additional physician visit.

The p-value is 0.70854, indicating that the effect of ftv is not statistically significant at a conventional significance level of 0.05.

Considerations for Model Interpretation: While ptl has a positive coefficient, it’s important to note that the associated p-value suggests that the effect is not statistically significant. Interpretation should be cautious, and the variable’s impact might not be practically significant.

The non-significant p-value for ftv suggests that the number of physician visits does not have a statistically significant impact on the log odds of having a low birthweight baby.

d) Create a new variable for ptl named ptl2 which is more useful for analysis. Keep in mind that with very small sample sizes, it may be worthwhile to collapse multiple categories.

# Create new variable ptl2
birthwt$ptl2 <- ifelse(birthwt$ptl > 0, 1, 0)

# View the updated dataset
head(birthwt)

##    low age lwt race smoke ptl ht ui ftv  bwt ptl2
## 85   0  19 182    2     0   0  0  1   0 2523    0
## 86   0  33 155    3     0   0  0  0   3 2551    0
## 87   0  20 105    1     1   0  0  0   1 2557    0
## 88   0  21 108    1     1   0  0  1   2 2594    0
## 89   0  18 107    1     1   0  0  1   0 2600    0
## 91   0  21 124    3     0   0  0  0   0 2622    0

e) Create a new variable for ftv named ftv2 which is more useful for analysis. Keep in mind that with very small sample sizes, it may be worthwhile to collapse multiple categories. Also, it may be helpful to form tables which summarize low birthweight probabilities by levels of the variable in order to better understand the relationship between probability of low birthweight and the newly created variable.

# Create new variable ftv2
birthwt$ftv2 <- ifelse(birthwt$ftv > 0, 1, 0)

# View the updated dataset
head(birthwt)

##    low age lwt race smoke ptl ht ui ftv  bwt ptl2 ftv2
## 85   0  19 182    2     0   0  0  1   0 2523    0    0
## 86   0  33 155    3     0   0  0  0   3 2551    0    1
## 87   0  20 105    1     1   0  0  0   1 2557    0    1
## 88   0  21 108    1     1   0  0  1   2 2594    0    1
## 89   0  18 107    1     1   0  0  1   0 2600    0    0
## 91   0  21 124    3     0   0  0  0   0 2622    0    0

# Create a table summarizing low birthweight probabilities by levels of ftv2
table_summary <- table(birthwt$ftv2, birthwt$low) 

# Add row and column names for clarity
rownames(table_summary) <- c("No Physician Visit", "At Least One Physician Visit")
colnames(table_summary) <- c("Normal Birthweight", "Low Birthweight")

# Display the table summary
table_summary

##                               
##                                Normal Birthweight Low Birthweight
##   No Physician Visit                           64              36
##   At Least One Physician Visit                 66              23

f) Using the newly created variables in d) and e), reassess the logistic regression model arrived at in b) (use ftv2 and ptl2 in the modeling). Comment on what you find - are the new versions of these variables important in predicting low birthweight??

# Fit logistic regression model with ptl2 and ftv2
updated_logistic_model <- glm(low ~ age + lwt + race + smoke + ptl2 + ht + ui + ftv2,
                               data = birthwt,
                               family = binomial)

# Display the summary of the updated model
summary(updated_logistic_model)

## 
## Call:
## glm(formula = low ~ age + lwt + race + smoke + ptl2 + ht + ui + 
##     ftv2, family = binomial, data = birthwt)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.132302   1.325225   0.100  0.92048   
## age         -0.041760   0.037844  -1.103  0.26981   
## lwt         -0.011848   0.006755  -1.754  0.07943 . 
## race         0.404931   0.224863   1.801  0.07174 . 
## smoke        0.816451   0.416669   1.959  0.05006 . 
## ptl2         1.249407   0.465305   2.685  0.00725 **
## ht           1.795749   0.702141   2.558  0.01054 * 
## ui           0.657830   0.468951   1.403  0.16069   
## ftv2        -0.103561   0.373413  -0.277  0.78152   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 234.67  on 188  degrees of freedom
## Residual deviance: 199.45  on 180  degrees of freedom
## AIC: 217.45
## 
## Number of Fisher Scoring iterations: 4

Interpretation: ptl2 (Prior Preterm Labor):

Estimate: 1.249407 Std. Error: 0.465305 z value: 2.685 p-value: 0.00725 (statistically significant at a conventional significance level of 0.05) Interpretation: Holding other variables constant, the log odds of having a low birthweight baby for individuals with a history of prior preterm labor (ptl2) is significantly higher than for those without prior preterm labor. ftv2 (Physician Visits):

Estimate: -0.103561 Std. Error: 0.373413 z value: -0.277 p-value: 0.78152 (not statistically significant at a conventional significance level of 0.05) Interpretation: Holding other variables constant, the log odds of having a low birthweight baby for individuals with at least one physician visit (ftv2) is not significantly different from those without physician visits. Comments: ptl2 (Prior Preterm Labor):

The positive coefficient for ptl2 suggests an increase in the log odds of low birthweight for individuals with a history of prior preterm labor. The statistically significant p-value (0.00725) indicates that the effect of ptl2 is significant in predicting low birthweight. ftv2 (Physician Visits):

The negative coefficient for ftv2 suggests a decrease in the log odds of low birthweight for individuals with at least one physician visit, although it is not statistically significant (p-value = 0.78152). Conclusion: The variables ptl2 and ftv2 have different impacts on the log odds of low birthweight. ptl2 is statistically significant and positively associated with an increased likelihood of low birthweight. ftv2 is not statistically significant, and its impact on the log odds of low birthweight is not supported by the data. These findings provide insights into the potential importance of prior preterm labor (ptl2) in predicting low birthweight, while the impact of physician visits (ftv2) is not statistically supported in this model.

In a manner similar to the approach used in the book, split the birthwt data into a training and test set, where the test set is about 20% the size of the entire dataset. Then, using variables that are justifiable for inclusion in discriminant analysis, fit LDA and QDA models to the training set and form confusion matrices, calculate the sensitivity, specificity, and the accuracy of each method using the test set, and do the same for the logistic regression models built in f) and b). Which model performs the best? Remember you MUST set the seed using the TeachingDemos package in a manner similar to as done in the notes (but don’t use my name to set the seed!)

# Load necessary libraries
library(MASS)
library(caret)

## Loading required package: lattice

# Set seed
set.seed(123)  

# Step 1: Split data into training and test sets
splitIndex <- createDataPartition(birthwt$low, p = 0.8, list = FALSE)
train_data <- birthwt[splitIndex, ]
test_data <- birthwt[-splitIndex, ]

# Step 2: Fit Logistic Regression model using the training set
logistic_model_train <- glm(low ~ age + lwt + race + smoke + ptl2 + ht + ui + ftv2, 
                             family = binomial, data = train_data)

# Step 3: Predictions for logistic regression on the test set
predicted_probs <- predict(logistic_model_train, newdata = test_data, type = "response")

# Convert predicted probabilities to class labels
predicted_labels <- ifelse(predicted_probs > 0.5, 1, 0)

# Confusion matrix for logistic regression
conf_matrix_logistic <- table(predicted_labels, test_data$low)

# Sensitivity, specificity, and accuracy for logistic regression
sensitivity_logistic <- conf_matrix_logistic[2, 2] / sum(test_data$low == 1)
specificity_logistic <- conf_matrix_logistic[1, 1] / sum(test_data$low == 0)
accuracy_logistic <- sum(diag(conf_matrix_logistic)) / sum(conf_matrix_logistic)

# Display results
performance <- data.frame(
  Model = "Logistic Regression",
  Sensitivity = sensitivity_logistic,
  Specificity = specificity_logistic,
  Accuracy = accuracy_logistic
)

performance

##                 Model Sensitivity Specificity  Accuracy
## 1 Logistic Regression   0.1818182   0.8461538 0.6486486

model f

# Step 1: Split data into training and test sets
splitIndex <- createDataPartition(birthwt$low, p = 0.8, list = FALSE)
train_data <- birthwt[splitIndex, ]
test_data <- birthwt[-splitIndex, ]

# Step 2: Predictions for updated logistic regression on the test set
predicted_probs_updated <- predict(updated_logistic_model, newdata = test_data, type = "response")

# Convert predicted probabilities to class labels
predicted_labels_updated <- ifelse(predicted_probs_updated > 0.5, 1, 0)

# Confusion matrix for updated logistic regression
conf_matrix_updated <- table(predicted_labels_updated, test_data$low)

# Sensitivity, specificity, and accuracy for updated logistic regression
sensitivity_updated <- conf_matrix_updated[2, 2] / sum(test_data$low == 1)
specificity_updated <- conf_matrix_updated[1, 1] / sum(test_data$low == 0)
accuracy_updated <- sum(diag(conf_matrix_updated)) / sum(conf_matrix_updated)

# Display results for updated logistic regression
performance_updated <- data.frame(
  Model = "Updated Logistic Regression",
  Sensitivity = sensitivity_updated,
  Specificity = specificity_updated,
  Accuracy = accuracy_updated
)

performance_updated

##                         Model Sensitivity Specificity  Accuracy
## 1 Updated Logistic Regression   0.3846154   0.8333333 0.6756757

model b

# Step 1: Split data into training and test sets
splitIndex <- createDataPartition(birthwt$low, p = 0.8, list = FALSE)
train_data <- birthwt[splitIndex, ]
test_data <- birthwt[-splitIndex, ]

# Step 2: Predictions for updated logistic regression on the test set
predicted_probs_updated <- predict(logistic_model, newdata = test_data, type = "response")

# Convert predicted probabilities to class labels
predicted_labels_updated <- ifelse(predicted_probs_updated > 0.5, 1, 0)

# Confusion matrix for updated logistic regression
conf_matrix_updated <- table(predicted_labels_updated, test_data$low)

# Sensitivity, specificity, and accuracy for updated logistic regression
sensitivity_updated <- conf_matrix_updated[2, 2] / sum(test_data$low == 1)
specificity_updated <- conf_matrix_updated[1, 1] / sum(test_data$low == 0)
accuracy_updated <- sum(diag(conf_matrix_updated)) / sum(conf_matrix_updated)

# Display results for updated logistic regression
performance_updated <- data.frame(
  Model = "Updated Logistic Regression",
  Sensitivity = sensitivity_updated,
  Specificity = specificity_updated,
  Accuracy = accuracy_updated
)

performance_updated

##                         Model Sensitivity Specificity  Accuracy
## 1 Updated Logistic Regression   0.2222222   0.8928571 0.7297297

# Display the summary of the logistic regression model
summary(logistic_model_train)

## 
## Call:
## glm(formula = low ~ age + lwt + race + smoke + ptl2 + ht + ui + 
##     ftv2, family = binomial, data = train_data)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)  
## (Intercept) -0.852906   1.604201  -0.532   0.5950  
## age         -0.010328   0.041957  -0.246   0.8056  
## lwt         -0.011783   0.008076  -1.459   0.1445  
## race         0.564538   0.267157   2.113   0.0346 *
## smoke        0.965524   0.496599   1.944   0.0519 .
## ptl2         1.168134   0.504784   2.314   0.0207 *
## ht           1.822660   0.852966   2.137   0.0326 *
## ui           0.794484   0.508886   1.561   0.1185  
## ftv2        -0.446293   0.427504  -1.044   0.2965  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 189.59  on 151  degrees of freedom
## Residual deviance: 156.77  on 143  degrees of freedom
## AIC: 174.77
## 
## Number of Fisher Scoring iterations: 4

The logistic regression model for predicting low birthweight (variable ‘low’) based on the covariates age, lwt, race, smoke, ptl2, ht, ui, and ftv2 has been summarized as follows:

Model Coefficients:

Intercept: -0.853 (p-value = 0.5950) The estimated log-odds of the response variable being 1 when all predictors are zero is -0.853. However, the p-value suggests that this intercept is not statistically significant.

Age: -0.0103 (p-value = 0.8056) For a one-unit increase in age, the log-odds of the response variable being 1 decrease by 0.0103. The p-value indicates that age is not statistically significant.

Lwt: -0.0118 (p-value = 0.1445) For a one-unit increase in lwt, the log-odds of the response variable being 1 decrease by 0.0118. The p-value suggests that lwt is not statistically significant.

Race: 0.5645 (p-value = 0.0346) The coefficient for race represents the log-odds change in response for a one-unit change in the race variable. The positive value suggests an increase in the log-odds for a higher race value. The p-value indicates that race is statistically significant.

Smoke: 0.9655 (p-value = 0.0519) For a smoker compared to a non-smoker, the log-odds of the response variable being 1 increase by 0.9655. The p-value is close to the significance threshold (0.05).

Ptl2: 1.1681 (p-value = 0.0207) For a one-unit increase in ptl2, the log-odds of the response variable being 1 increase by 1.1681. The p-value suggests that ptl2 is statistically significant.

Ht: 1.8227 (p-value = 0.0326) For a one-unit increase in ht, the log-odds of the response variable being 1 increase by 1.8227. The p-value indicates that ht is statistically significant.

Ui: 0.7945 (p-value = 0.1185) For a one-unit increase in ui, the log-odds of the response variable being 1 increase by 0.7945. The p-value is greater than the significance threshold. Ftv2: -0.4463 (p-value = 0.2965)

For a one-unit increase in ftv2, the log-odds of the response variable being 1 decrease by 0.4463. The p-value suggests that ftv2 is not statistically significant.

Model Fit: Null Deviance: 189.59 (on 151 degrees of freedom) Residual Deviance: 156.77 (on 143 degrees of freedom) AIC: 174.77

Comments: The overall model fit is assessed using the Null Deviance and Residual Deviance, and AIC. Lower AIC values indicate better model fit. Some predictors, such as race, ptl2, and ht, appear to be statistically significant based on their p-values.

NikhilBharadwaj_DA_Mid

2024-02-27