Take Home Exam

Question 1

Consider the cpus data from R package MASS. We will use linear regression to investigate the relationship between variables in this data set and estimated performance (variable estperf). Do not use published performance as a predictor of performance in this problem.

Part a)

Investigate the relationship between variables in the cpus dataset, both numerically and visually. Comment on the relationships you observe.

#import the necessary packages
library(MASS)
library(ggplot2)
library(corrplot)

## Warning: package 'corrplot' was built under R version 4.4.2

## corrplot 0.95 loaded

library(car)

## Loading required package: carData

#load the dataset
data(cpus)
#check the dataset documentation
?cpus

## starting httpd help server ... done

#inspecting the dataset
head(cpus)

#getting a summary 
summary(cpus)

##              name          syct             mmin            mmax      
##  ADVISOR 32/60 :  1   Min.   :  17.0   Min.   :   64   Min.   :   64  
##  AMDAHL 470/7A :  1   1st Qu.:  50.0   1st Qu.:  768   1st Qu.: 4000  
##  AMDAHL 470V/7 :  1   Median : 110.0   Median : 2000   Median : 8000  
##  AMDAHL 470V/7B:  1   Mean   : 203.8   Mean   : 2868   Mean   :11796  
##  AMDAHL 470V/7C:  1   3rd Qu.: 225.0   3rd Qu.: 4000   3rd Qu.:16000  
##  AMDAHL 470V/8 :  1   Max.   :1500.0   Max.   :32000   Max.   :64000  
##  (Other)       :203                                                   
##       cach            chmin            chmax             perf       
##  Min.   :  0.00   Min.   : 0.000   Min.   :  0.00   Min.   :   6.0  
##  1st Qu.:  0.00   1st Qu.: 1.000   1st Qu.:  5.00   1st Qu.:  27.0  
##  Median :  8.00   Median : 2.000   Median :  8.00   Median :  50.0  
##  Mean   : 25.21   Mean   : 4.699   Mean   : 18.27   Mean   : 105.6  
##  3rd Qu.: 32.00   3rd Qu.: 6.000   3rd Qu.: 24.00   3rd Qu.: 113.0  
##  Max.   :256.00   Max.   :52.000   Max.   :176.00   Max.   :1150.0  
##                                                                     
##     estperf       
##  Min.   :  15.00  
##  1st Qu.:  28.00  
##  Median :  45.00  
##  Mean   :  99.33  
##  3rd Qu.: 101.00  
##  Max.   :1238.00  
##

From the summary() we observe the following:
- Right-Skewed Distributions: Most variables have a median much lower than the mean, indicating a few high-value CPUs skew the distribution.
- Potential Outliers: syct, mmax, perf, and estperf have very high max values compared to their median/mean, suggesting outliers exist.

#get the correlation matrix
corr_matrix <- cor(cpus[, sapply(cpus, is.numeric)])
corr_matrix

##               syct       mmin       mmax       cach      chmin      chmax
## syct     1.0000000 -0.3356422 -0.3785606 -0.3209998 -0.3010897 -0.2505023
## mmin    -0.3356422  1.0000000  0.7581573  0.5347291  0.5171892  0.2669074
## mmax    -0.3785606  0.7581573  1.0000000  0.5379898  0.5605134  0.5272462
## cach    -0.3209998  0.5347291  0.5379898  1.0000000  0.5822455  0.4878458
## chmin   -0.3010897  0.5171892  0.5605134  0.5822455  1.0000000  0.5482812
## chmax   -0.2505023  0.2669074  0.5272462  0.4878458  0.5482812  1.0000000
## perf    -0.3070821  0.7949233  0.8629942  0.6626135  0.6089025  0.6052193
## estperf -0.2883956  0.8192915  0.9012024  0.6486203  0.6105802  0.5921556
##               perf    estperf
## syct    -0.3070821 -0.2883956
## mmin     0.7949233  0.8192915
## mmax     0.8629942  0.9012024
## cach     0.6626135  0.6486203
## chmin    0.6089025  0.6105802
## chmax    0.6052193  0.5921556
## perf     1.0000000  0.9664687
## estperf  0.9664687  1.0000000

We can visualize the correlation matrix as shown below in this plot:

corrplot(corr_matrix, method = 'color', tl.cex = 0.8)

We see from the above plot that cycle time in nanoseconds (syct) is negatively correlated with the other variables e.g perf, although not strongly. This indicates that higher cycle times in computers is probably associated with low performance.
We also see some positive correlation between performance (perf) and estimated relative performance (estperf). We see also a strong correlation between memory (whether mmin or mmax) with both performance and estimated performance. This is quite expected in a real-world scenario since memory affects the performance of a computer in a great way.

#we can make a few scatter plots to help us explore the data further

plot(
  cpus$mmax, 
  cpus$esterperf, 
  main="Estimated Performance vs. Main Memory",
  xlab = "Maximum Main Memory in Kb", 
  ylab="Estimated Performance",
  col="blue",
  pch = 16
  )

From the above plot, we see a positive relationship between Maximum Main memory and the estimated computer performance, just as we had pointed out earlier. Again, as indicated earlier, we see a possibility of outliers in our dataset.

# we make another plot for maximum number of channels versus performance
plot(
  cpus$chmax, 
  cpus$estperf, 
  main="Estimated Performance vs. Number of Channels",
  xlab = "Maximum Number of Channels", 
  ylab="Estimated Performance",
  col="blue",
  pch = 16
  )

From this plot we see that, while there exists a positive relationship between the number of channels and the estimated performance, it is not as strong. We can spot a few outliers as well.

# finally we get to look at one more scatter plot for our data exploration
plot(
  cpus$cach, 
  cpus$estperf, 
  main ='Estimated Performance Vs. Cache Size',
  xlab = "Cache Size in KB", 
  ylab="Estimated Performance",
  col="blue",
  pch = 16
  )

Just like the previous plot, here we notice a positive relationship between cache size and peformance. The cache size is not continuous, but not discrete. This makes sense since we can’t have random cache sizes - they are usually by design. We notice the same thing for memory size and number of channels.

Part b)

Use either methods commonly used in the book/lecture notes to build a linear regression model predicting estimated performance from predictors in the cpus dataset. Do not consider name in this modeling approach. Explain the process used to arrive at your final model.

#fitting an intial full model (excluding `name`)

full_model <- lm(estperf ~ syct + mmin + mmax + cach + chmin + chmax + perf,  data = cpus)
#getting the summary of the model
summary(full_model)

## 
## Call:
## lm(formula = estperf ~ syct + mmin + mmax + cach + chmin + chmax + 
##     perf, data = cpus)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -117.481   -9.549    2.864   15.258  182.259 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.423e+01  4.732e+00  -7.233 9.71e-12 ***
## syct         3.777e-02  9.434e-03   4.003 8.79e-05 ***
## mmin         5.483e-03  1.120e-03   4.894 2.02e-06 ***
## mmax         3.376e-03  3.974e-04   8.495 4.41e-15 ***
## cach         1.245e-01  7.751e-02   1.606  0.10977    
## chmin       -1.651e-02  4.523e-01  -0.037  0.97091    
## chmax        3.457e-01  1.287e-01   2.686  0.00783 ** 
## perf         5.770e-01  3.718e-02  15.519  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 31.7 on 201 degrees of freedom
## Multiple R-squared:  0.9595, Adjusted R-squared:  0.958 
## F-statistic: 679.5 on 7 and 201 DF,  p-value: < 2.2e-16

From the model performance, we have an \(R^2\) of 0.9595 meaning that the model explains 95.95% of the variance in estimated performance.
We have an F-statistic of 679.5 which is relatively large, and a corresponding very small p-value of \(2.2 \times 10^-16\) indicating that at least one predictor is significantly contributing to the model.
Looking at the predictors and their associate p-values, we see that most of the predictors are statistically significant in predicting the estimated performance response variable, except the cache size (p-value = 0.10977) and minimum number of channels (p-value = 0.97091). These two do not seem to be statistically significant based on their associated p-values. However, based on the p-values, the maximum main memory and performance are the strongest predictor variables as they have the least p-values.

Based on this we try to build a new model without cache size and maximum number of channels and investigate the model performance

#model without cache size and minimum number of channels predictors
lm2 <- lm(estperf ~ syct + mmin + mmax + chmax  + perf,  data = cpus)

#get the summary
summary(lm2)

## 
## Call:
## lm(formula = estperf ~ syct + mmin + mmax + chmax + perf, data = cpus)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -110.777  -10.446    2.617   14.216  173.179 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.254e+01  4.620e+00  -7.043 2.86e-11 ***
## syct         3.521e-02  9.313e-03   3.780 0.000206 ***
## mmin         5.648e-03  1.099e-03   5.138 6.50e-07 ***
## mmax         3.275e-03  3.929e-04   8.335 1.16e-14 ***
## chmax        3.772e-01  1.212e-01   3.112 0.002127 ** 
## perf         5.962e-01  3.537e-02  16.855  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 31.76 on 203 degrees of freedom
## Multiple R-squared:  0.9589, Adjusted R-squared:  0.9579 
## F-statistic: 947.3 on 5 and 203 DF,  p-value: < 2.2e-16

Without the two predictors we have a model that has an \(R^2 = 0.9589\) meaning that, even without the two predictors our model explains 95.89 of variability in the estimated performance. This means that it was a right choice to remove cache size and minimum number of channels variable since our model did not lose its explanatory power. We also notice that we realize a higher F-stastic value from 679.5 to 947.3 confirming that the new model is a better fit than the previous one.

Part c)

Create a residual plot using this model and comment on it’s features. Do any of the assumptions of linear regression seem to be violated? What might be done to adjust our model? Adjust the model if necessary by considering various residual plots, updating the model, and assessing residual plots using the updated model.

# Residuals plots
par(mfrow = c(2, 2))
plot(lm2)

For the Residual vs. Fitted plot, we see that the the red LOESS line, we see that the curve is not flat, but rather shows a systematic trend especially for small fitted values. This implies that the relationship between the predictor values and response variable might not be purely linear. Therefore, the assumption of linearity between predictor variables and response variables might be violated. We also see that the variance across the residuals is not constant (as per one of the assumptions of linear regression) i.e. there is heteroscedasticity. Lastly, from this plot, we see some outliers for example observation 32, 10, and 157. We are later going to find out whether they are also high leverage points.
From the Q-Q plot, we can assess whether the residuals of our linear regression follow a normal distribution. We see that at the middle section, most of the points follow the 45-degrees dashed line. However, at both tails (left and right), residuals deviate significantly, indicating non-normality in extreme values. And just like the previous plot, we see observations 32,10, and 157 being outliers.
Lastly, we look at the Residuals vs. Leverage plot, where we now confirm that observation points 32, 10, and 157 are not only outliers, but they happen to be leverage points too since they have relatively higher cook’s distance compared to the other observations in the data set. We see that observation that observation 10 really stands out having both a high residual and higher cook’s distance - the point exceeds the Cook’s distance boundary. This implies that the point has a higher influence on the regression model.

We consider removing the observation 10 and re-fitting the model again

cpus_filtered <- cpus[-10, ]  # Exclude observation 10
lm_new <- lm(estperf ~ syct + mmin + mmax + chmax + perf, data=cpus_filtered)
summary(lm_new)

## 
## Call:
## lm(formula = estperf ~ syct + mmin + mmax + chmax + perf, data = cpus_filtered)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -95.956 -10.592   1.092  12.568 165.711 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.813e+01  4.127e+00  -6.816 1.05e-10 ***
## syct         2.840e-02  8.285e-03   3.428 0.000736 ***
## mmin         3.181e-03  1.025e-03   3.102 0.002196 ** 
## mmax         3.909e-03  3.574e-04  10.936  < 2e-16 ***
## chmax        4.078e-01  1.073e-01   3.802 0.000190 ***
## perf         5.467e-01  3.196e-02  17.107  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 28.09 on 202 degrees of freedom
## Multiple R-squared:  0.9567, Adjusted R-squared:  0.9556 
## F-statistic: 892.3 on 5 and 202 DF,  p-value: < 2.2e-16

We see a slightly lower F-statistic value from 947.3 to 892.3, and a drop in \(R^2\) from 0.9579 to 0.9567, which is quite expected because we removed a highly influential point from our dataset.

par(mfrow = c(2, 2))
plot(lm_new)

Looking now at the Residual vs. Fitted plot, we have less curvature in the red line suggesting that the extreme outlier that was distorting the model has now been successfully removed, and it’s distortion effect eliminated. The residuals are slightly structured, but they are closed to zero which is a good sign.

Part (d)

How well does the final model fit the data? Comment on some model fit criteria from the model built in c)

The model does a decent job in fitting the data. We can tell this by observing the \(R^2\) statistic which shows that the model explains 95.67% variability in estimated performance of cpus. The model also has a large F-statistic which suggests that at least one of the predictor variables significantly explains the variability in the response variabe. This would imply a strong evidence against the Null Hypothesis that states that none of the predictor variables contribute to explaining the response variable.

Part e)

Interpret all variables in your final model using complete sentences, making sure to account for the fact that this may be a multivariable model. Give interpretations in terms of as meaningful of units as possible (it may not be possible to use seconds for cycle time- the answer is too large, but you may use MB instead of kB, for instance). Adjust interpretations as needed, both for units, and the fact that our outcome has been log transformed (how do we get to the raw data values from a log transformation? Start by thinking: what is the inverse of the log function???)

Intercept: When all the predictor variables are zero, the estimated performance is -28.13. However, since a CPU cannot have all parameters at zero, this intercept is not practically meaningful.
Cycle Time (syct): Holding all other variables constant, a 1-unit increase in cycle time is associated with a 2.84% increase in estimated performance. Since cycle time is being measure in nanoseconds here, a small increase in cycle time leads to slightly better estimated performance.
Minimum Main Memory (mmin): A 1kB increase in minimum main memory is associated with a 0.318% increase in estimated performance.
Maximum Main Memory(mmax): A 1kB increase in maximum main memory is associated with a 0.40% increase in estimated performance.
Maximum Cache Memory: A 1kB increase in cache size is associated with a 40.78% increase in estimated performance. This makes logical sense since cache memory is crucial for CPU speed because frequently accessed data is stored closed to the processor, reducing the need for slower RAM access.
Published Performance: A 1 unit increase in performance is associated with 54.67% increase in estimated performance. Since, perf represents the manufacturer-reported benchmark, this suggests that official performance scores have strong correlation with estimated real-world performance.

Note: We did not log-transform our response variable, hence our interpretation remains as above.

Part f)

Calculate indices that help assess multicollinearity between predictors in your final model. Is there evidence of multicollinearity? What does this imply, and should you take action? Take action if appropriate.

To assess multicollinearity in the final model, we will calculate the Variance Inflation Factor (VIFs) for the predictors.

# Compute Variance Inflation Factor (VIF)
vif_values <- vif(lm_new)  # lm_new is your final regression model
print(vif_values)

##     syct     mmin     mmax    chmax     perf 
## 1.223190 3.033525 4.188266 2.020315 5.562914

From the results above, the model does not exhibit strong multicollinearity since all the VIF values are approximately 5 and less. This is generally considered acceptable, hence we take no action.

Part g)

Are there any outliers or influential observations in this data? Calculate relevant indices or provide visualizations to justify your answer. Make sure to use rules of thumb discussed in class if necessary for interpretations.

For this question, we discussed the presence of outliers and influential observations points in previous parts of this questions. We considered removing some leverage points in a bid to improve the curvature of the LOESS line in the Residual vs. Fitted plot.

Question 2

Consider the birthwt data from R package MASS. We will investigate the relationship between low birthweight and the predictors in the birthwt data using logistic regression and discriminant analysis.

Part a)

Investigate the relationship between variables in the birthwt dataset. Do you see anything surprising? Use both numeric and visual summaries. Create and comment on visualizations specifically between the outcome variable and predictor/independent variables. Also, notice that qualitative/categorical variables should be visualized in an alternative manner, not just scatterplots/correlations as in the case of quantitative variables.

# Load required packages
library(MASS)
library(ggplot2)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:car':
## 
##     recode

## The following object is masked from 'package:MASS':
## 
##     select

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(corrplot)
library(gridExtra)

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

# Load the birthwt dataset
data("birthwt")

#see the dataset documentatiom
?birthwt

# Convert categorical variables to factors
birthwt <- birthwt %>%
  mutate(low = factor(low, labels = c("Normal Weight", "Low Weight")),
         race = factor(race, labels = c("White", "Black", "Other")),
         smoke = factor(smoke, labels = c("Non-Smoker", "Smoker")),
         ht = factor(ht, labels = c("No Hypertension", "Hypertension")),
         ui = factor(ui, labels = c("No Uterine Irritability", "Uterine Irritability")))

# View the first few rows
head(birthwt)

#getting the numerical summary
summary(birthwt)

##             low           age             lwt           race   
##  Normal Weight:130   Min.   :14.00   Min.   : 80.0   White:96  
##  Low Weight   : 59   1st Qu.:19.00   1st Qu.:110.0   Black:26  
##                      Median :23.00   Median :121.0   Other:67  
##                      Mean   :23.24   Mean   :129.8             
##                      3rd Qu.:26.00   3rd Qu.:140.0             
##                      Max.   :45.00   Max.   :250.0             
##         smoke          ptl                       ht     
##  Non-Smoker:115   Min.   :0.0000   No Hypertension:177  
##  Smoker    : 74   1st Qu.:0.0000   Hypertension   : 12  
##                   Median :0.0000                        
##                   Mean   :0.1958                        
##                   3rd Qu.:0.0000                        
##                   Max.   :3.0000                        
##                        ui           ftv              bwt      
##  No Uterine Irritability:161   Min.   :0.0000   Min.   : 709  
##  Uterine Irritability   : 28   1st Qu.:0.0000   1st Qu.:2414  
##                                Median :0.0000   Median :2977  
##                                Mean   :0.7937   Mean   :2945  
##                                3rd Qu.:1.0000   3rd Qu.:3487  
##                                Max.   :6.0000   Max.   :4990

From the summary above we see the following:
- Low Birth Weight: Out of 189 observations, 59 (31.2%) are classified as low birth weight, while 130 (68.8%) are normal.
- Maternal Age (age): The ages range from 14 to 45 years, with a mean of 23.24 years, suggesting that some young mothers are included in the dataset.
- Maternal Weight (lwt): This ranges from 80 to 250 pounds, with a mean of 129.8, indicating a reasonable variation in maternal weight.
- Race: The dataset is imbalanced, with 96 White, 26 Black, and 67 Other mothers.
- Smoking Status (smoke): A substantial number of mothers (74 out of 189, ~39%) reported smoking, which may be an important risk factor for low birth weight.
- Hypertension (ht) and Uterine Irritability (ui): Both conditions are relatively uncommon, with 12 cases of hypertension and 28 cases of uterine irritability, but they may still have an impact on birth weight.
- Birth Weight (bwt): The birth weights range from 709g to 4990g, with a median of 2977g, suggesting that most newborns are within a healthy weight range, but there are some extremely low weights.

# next we check the correlation amongst the variables for the numeric variables. 
numeric_vars <- birthwt %>% select_if(is.numeric)

# Correlation matrix visualization
corrplot(cor(numeric_vars), method = "color", addCoef.col = "black")

From the above correlation plot, we see that there is weak correlations between birth weight (bwt)and the predictor variables. The highest positive correlation is between birth weight and mother’s weight (0.19), suggesting that maternal weight has small positive relationship with birth weight. Overall, there are no strong correlations, implying that multiple factors may collectively influence birth weight rather than a single dominant predictor.
Next we make a few visualization plots between the predictor variables and the birth weight response variable.

# Boxplots to compare birth weight with predictors
ggplot(birthwt, aes(x = low, y = bwt, fill = low)) +
  geom_boxplot() +
  theme_minimal() +
  ggtitle("Birth Weight Distribution by Low Birth Weight Category") + 
  labs(x="Birth Weight Category", y="Birth Weight (grams)")

From the box-plot above, we can make the following observations:
- Babies classified as Low Weight have significantly lower birth weights, with the median around 2000-2200 grams.
- Normal Weight babies have a higher median birth weight, around 3000-3500 grams, with some outliers above 5000 grams.
- The spread (interquartile range) for Normal Weight babies is larger, indicating greater variation in birth weight compared to Low Weight babies.
- Outliers are observed in the Low Weight category, indicating some extremely low birth weights.

# Bar plot for smoking status
ggplot(birthwt, aes(x = smoke, fill = low)) +
  geom_bar(position = "fill") +
  theme_minimal() +
  ggtitle("Proportion of Low Birth Weight by Smoking Status") +
  labs(x="Smoking status", y="Proportion")

From the above plot we can make the following observations:
- A higher proportion of babies born to smokers are classified as low birth weight compared to non-smokers.
- Non-smokers have a lower proportion of low birth weight babies, suggesting that smoking during pregnancy is associated with an increased risk of low birth weight.
- This visualization supports the hypothesis that smoking could negatively impact fetal development, increasing the likelihood of a baby being born underweight.

# Bar plot for race
ggplot(birthwt, aes(x = race, fill = low)) +
  geom_bar(position = "fill") +
  theme_minimal() +
  ggtitle("Proportion of Low Birth Weight by Race")

From this bar graph, we can make the following observations:
- The Black category has the highest proportion of low birth weight babies, suggesting that this group may be at a greater risk of delivering underweight infants.
- The White category has the lowest proportion of low birth weight babies.
- The Other category also has a relatively high proportion of low birth weight babies, but slightly lower than the Black category.
- This suggests that race might be a factor influencing birth weight, possibly due to socioeconomic, genetic, or healthcare-related differences.

# Scatter plot of gestational age (lwt) vs birth weight (bwt)
ggplot(birthwt, aes(x = lwt, y = bwt, color = low)) +
  geom_point() +
  theme_minimal() +
  ggtitle("Birth Weight vs Mother's Weight") +
  labs(x="Mother's Weight (pounds)", y="birth weight (grams)")

From the scatter plot above, we see the folllowing:
- There appears to be a positive correlation between mother’s weight and birth weight, meaning that mothers with higher weights tend to have babies with higher birth weights.
- Babies classified as low birth weight (blue dots) are more concentrated at the lower end of the birth weight scale, regardless of the mother’s weight.
- However, even among heavier mothers, some babies still have low birth weight, suggesting that factors other than maternal weight may contribute to birth weight.

Fit a logistic regression model using methods discussed in class/the book, similar to as in problem 1). Be careful to understand each variable in birthwt to avoid including variables that are not logically acceptable for inclusion in the model.

logit_model <- glm(low ~ lwt + age + race + smoke + ht + ui + ftv, 
                   data = birthwt, family = binomial)
summary(logit_model)

## 
## Call:
## glm(formula = low ~ lwt + age + race + smoke + ht + ui + ftv, 
##     family = binomial, data = birthwt)
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)   
## (Intercept)             0.45481    1.18541   0.384  0.70122   
## lwt                    -0.01652    0.00686  -2.409  0.01600 * 
## age                    -0.02051    0.03595  -0.570  0.56844   
## raceBlack               1.28976    0.52761   2.445  0.01451 * 
## raceOther               0.91907    0.43630   2.106  0.03516 * 
## smokeSmoker             1.04159    0.39548   2.634  0.00844 **
## htHypertension          1.88506    0.69481   2.713  0.00667 **
## uiUterine Irritability  0.90415    0.44860   2.015  0.04385 * 
## ftv                     0.05912    0.17200   0.344  0.73105   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 234.67  on 188  degrees of freedom
## Residual deviance: 203.83  on 180  degrees of freedom
## AIC: 221.83
## 
## Number of Fisher Scoring iterations: 4

The coefficients (Estimate) represent the log-odds of low birth weight for each predictor.
The p-values (Pr(>|z|)) indicate whether each variable significantly affects birth weight.
If a predictor has p < 0.05, it significantly impacts birth weight.

What do you notice regarding the variables ptl and ftv. What is your logistic regression model in b)(perhaps before performing variable selection) implicitly assuming regarding these variables’ effects on the log odds of giving birth to a low weight baby? Are these assumptions realistic?
Create a new variable for ptl named ptl2 which is more useful for analysis. Keep in mind that with very small sample sizes, it may be worthwhile to collapse multiple categories.
Create a new variable for ftv named ftv2 which is more useful for analysis. Keep in mind that with very small sample sizes, it may be worthwhile to collapse multiple categories. Also, it may be helpful to form tables which summarize low birthweight probabilities by levels of the variable in order to better understand the relationship between probability of low birthweight and the newly created variable.
Using the newly created variables in d) and e), reassess the logistic regression model arrived at in b) (use ftv2 and ptl2 in the modeling). Comment on what you find- are the new versions of these variables important in predicting low birthweight??
In a manner similar to the approach used in the book, split the birthwt data into a training and test set, where the test set is about 20% the size of the entire dataset. Then, using variables that are justifiable for inclusion in discriminant analysis, fit LDA and QDA models to the training set and form confusion matrices, calculate the sensitivity, specificity, and the accuracy of each method using the test set, and do the same for the logistic regression models built in f) and b). Which model performs the best? Remember you MUST set the seed using the TeachingDemos package in a manner similar to as done in the notes (but don’t use my name to set the seed!)
Using your final model from f), interpret the estimates for all covariates

R Notebook

Take Home Exam

Question 1

Part a)

Part b)

Part c)

Part (d)

Part e)

Part f)

Part g)

Question 2

Part a)