library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.3.2
library(MASS)
## Warning: package 'MASS' was built under R version 4.3.2
library(car)
## Loading required package: carData
head(cpus)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
##
## recode
## The following object is masked from 'package:MASS':
##
## select
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
summary(cpus)
## name syct mmin mmax
## ADVISOR 32/60 : 1 Min. : 17.0 Min. : 64 Min. : 64
## AMDAHL 470/7A : 1 1st Qu.: 50.0 1st Qu.: 768 1st Qu.: 4000
## AMDAHL 470V/7 : 1 Median : 110.0 Median : 2000 Median : 8000
## AMDAHL 470V/7B: 1 Mean : 203.8 Mean : 2868 Mean :11796
## AMDAHL 470V/7C: 1 3rd Qu.: 225.0 3rd Qu.: 4000 3rd Qu.:16000
## AMDAHL 470V/8 : 1 Max. :1500.0 Max. :32000 Max. :64000
## (Other) :203
## cach chmin chmax perf
## Min. : 0.00 Min. : 0.000 Min. : 0.00 Min. : 6.0
## 1st Qu.: 0.00 1st Qu.: 1.000 1st Qu.: 5.00 1st Qu.: 27.0
## Median : 8.00 Median : 2.000 Median : 8.00 Median : 50.0
## Mean : 25.21 Mean : 4.699 Mean : 18.27 Mean : 105.6
## 3rd Qu.: 32.00 3rd Qu.: 6.000 3rd Qu.: 24.00 3rd Qu.: 113.0
## Max. :256.00 Max. :52.000 Max. :176.00 Max. :1150.0
##
## estperf
## Min. : 15.00
## 1st Qu.: 28.00
## Median : 45.00
## Mean : 99.33
## 3rd Qu.: 101.00
## Max. :1238.00
##
##Numerical relation
correlation_matrix <- cor(cpus[, -1]) # Excluding the 'name' as it is non numerical column
print(correlation_matrix)
## syct mmin mmax cach chmin chmax
## syct 1.0000000 -0.3356422 -0.3785606 -0.3209998 -0.3010897 -0.2505023
## mmin -0.3356422 1.0000000 0.7581573 0.5347291 0.5171892 0.2669074
## mmax -0.3785606 0.7581573 1.0000000 0.5379898 0.5605134 0.5272462
## cach -0.3209998 0.5347291 0.5379898 1.0000000 0.5822455 0.4878458
## chmin -0.3010897 0.5171892 0.5605134 0.5822455 1.0000000 0.5482812
## chmax -0.2505023 0.2669074 0.5272462 0.4878458 0.5482812 1.0000000
## perf -0.3070821 0.7949233 0.8629942 0.6626135 0.6089025 0.6052193
## estperf -0.2883956 0.8192915 0.9012024 0.6486203 0.6105802 0.5921556
## perf estperf
## syct -0.3070821 -0.2883956
## mmin 0.7949233 0.8192915
## mmax 0.8629942 0.9012024
## cach 0.6626135 0.6486203
## chmin 0.6089025 0.6105802
## chmax 0.6052193 0.5921556
## perf 1.0000000 0.9664687
## estperf 0.9664687 1.0000000
##Visual relation
pairs(cpus[, -1], main = "Scatterplot Matrix")
a) The correlation matrix provided reveals associations among the
variables:
#building initial model
initial_model <- lm(estperf ~ syct+ mmin + mmax + cach + chmin + chmax - perf, data = cpus)
summary(initial_model)
##
## Call:
## lm(formula = estperf ~ syct + mmin + mmax + cach + chmin + chmax -
## perf, data = cpus)
##
## Residuals:
## Min 1Q Median 3Q Max
## -138.198 -24.064 1.429 23.136 288.718
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.648e+01 6.288e+00 -10.574 < 2e-16 ***
## syct 6.596e-02 1.369e-02 4.818 2.85e-06 ***
## mmin 1.431e-02 1.428e-03 10.021 < 2e-16 ***
## mmax 6.590e-03 5.016e-04 13.138 < 2e-16 ***
## cach 4.945e-01 1.091e-01 4.533 9.94e-06 ***
## chmin -1.723e-01 6.687e-01 -0.258 0.797
## chmax 1.201e+00 1.720e-01 6.985 4.04e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 46.88 on 202 degrees of freedom
## Multiple R-squared: 0.9109, Adjusted R-squared: 0.9082
## F-statistic: 344.1 on 6 and 202 DF, p-value: < 2.2e-16
cor_with_estperf <- correlation_matrix["estperf", ]
print(cor_with_estperf)
## syct mmin mmax cach chmin chmax perf
## -0.2883956 0.8192915 0.9012024 0.6486203 0.6105802 0.5921556 0.9664687
## estperf
## 1.0000000
Interpretation: We shall do the backward elimination and removing the least significant predictor (highest p-value or least contribution to model performance). hence, i am not considering chmin, as it has highest p value and not considering syct as it is having negative correlation with est_perf.
second_model <- lm(estperf~ mmin + mmax - perf + chmax + cach , data= cpus)
summary(second_model)
##
## Call:
## lm(formula = estperf ~ mmin + mmax - perf + chmax + cach, data = cpus)
##
## Residuals:
## Min 1Q Median 3Q Max
## -127.666 -29.613 1.242 27.726 311.090
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.627e+01 4.893e+00 -9.458 < 2e-16 ***
## mmin 1.384e-02 1.470e-03 9.416 < 2e-16 ***
## mmax 6.272e-03 5.223e-04 12.008 < 2e-16 ***
## chmax 1.155e+00 1.708e-01 6.759 1.43e-10 ***
## cach 4.298e-01 1.099e-01 3.910 0.000126 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 49.29 on 204 degrees of freedom
## Multiple R-squared: 0.9005, Adjusted R-squared: 0.8986
## F-statistic: 461.7 on 4 and 204 DF, p-value: < 2.2e-16
Conclusion: As,The Multiple Rsquared value and adjusted R squared value of initial model is better than the second model. we shall consider initial model as the better one.
# Create a residual plot for the initial linear regression model
par(mfrow=c(2,2))
plot(initial_model)
interpretation: As we can see, there is a curvature or a pattern around
the fitted line, which suggests that it is following Non-Linearity. to
enhance the model, we can apply certain transformation on target. Type
of transformation depends on the value derived from Power
Transformation.
pT <- powerTransform(initial_model, family = "bcPower")
pT$lambda
## Y1
## 0.4314326
we know that, When \(\lambda\approx 0\), it can be shown that this function approaches \(\log(y)\), so the this becomes the best transformation
When \(\lambda \approx 1/2\), then the \(\sqrt{y}\) is usually a sufficient transformation
If \(\lambda \approx 1\), no transformation is needed, and if \(\lambda \approx -1\), you can use \(\frac{1}{y}\).
so, we shall go forward and apply log transformation
cpus <- cpus |>
mutate(log_estperf = log(estperf))
model_final <- lm(log_estperf ~ syct + mmin + mmax + cach + chmin + chmax - name - perf , data = cpus)
rsquared <- summary(initial_model)$r.squared
print(rsquared)
## [1] 0.9108694
summary(model_final)
##
## Call:
## lm(formula = log_estperf ~ syct + mmin + mmax + cach + chmin +
## chmax - name - perf, data = cpus)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.96828 -0.12609 0.02829 0.15471 0.46886
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.293e+00 2.934e-02 112.254 < 2e-16 ***
## syct -4.510e-04 6.389e-05 -7.059 2.64e-11 ***
## mmin 1.278e-05 6.662e-06 1.919 0.0565 .
## mmax 5.386e-05 2.341e-06 23.011 < 2e-16 ***
## cach 7.024e-03 5.090e-04 13.801 < 2e-16 ***
## chmin 4.808e-03 3.120e-03 1.541 0.1249
## chmax -1.598e-03 8.024e-04 -1.992 0.0478 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2188 on 202 degrees of freedom
## Multiple R-squared: 0.947, Adjusted R-squared: 0.9455
## F-statistic: 602.1 on 6 and 202 DF, p-value: < 2.2e-16
plotting the residual plot for transformed model
par(mfrow=c(2,2))
plot(model_final)
now, in the residual plot, the non linearity is reduced,as there is no
systematic pattern in the residuals and spread of residuals is
consistent across all levels of the fitted values,which could indicate
the linearity.
Adjusted R-squared: Adjusted R-squared: accounts for the number of predictors in the model and is often considered a more reliable measure, especially when comparing models with different numbers of predictors. It penalizes the inclusion of less useful predictors that do not contribute much to the model. the adjusted R squared value is 0.9455 which suggets a better fit
Residual Standard Error (RSE): RSE is an estimate of the standard deviation of the residuals. It provides an indication of how well the model predicts the response variable. Smaller RSE values suggest a better fit. the value i got it is 0.2188
model_final
##
## Call:
## lm(formula = log_estperf ~ syct + mmin + mmax + cach + chmin +
## chmax - name - perf, data = cpus)
##
## Coefficients:
## (Intercept) syct mmin mmax cach chmin
## 3.293e+00 -4.510e-04 1.278e-05 5.386e-05 7.024e-03 4.808e-03
## chmax
## -1.598e-03
#converting to meaningful outputs
# Coefficients from the model
coefficients <- c(3.293e+00, -4.510e-04, 1.278e-05, 5.33e-05, 7.022e-03, 4.8080e-03, -1.598e-03)
# Variables corresponding to the coefficients
variables <- c("Intercept", "syct", "mmin", "mmax", "cach", "chmin", "chmax")
# Units corresponding to the variables (replace these with actual units)
units <- c("Unknown", "nanoSeconds", "Kilobytes", "Kilobytes", "Kilobytes", "Units", "Units")
# Convert coefficients to original scale using exp()
converted_values <- exp(coefficients)
# Display the results with units
interpretation <- data.frame(variables, converted_values, units)
print(interpretation)
## variables converted_values units
## 1 Intercept 26.9235132 Unknown
## 2 syct 0.9995491 nanoSeconds
## 3 mmin 1.0000128 Kilobytes
## 4 mmax 1.0000533 Kilobytes
## 5 cach 1.0070467 Kilobytes
## 6 chmin 1.0048196 Units
## 7 chmax 0.9984033 Units
interpreting the coefficients:
Intercept (26.9235132): The estimated performance when all other predictors are zero is approximately 26.92. This value is in the original data units, and it represents the baseline performance.
syct (0.9995491): For each additional unit increase in syct (system cycle time) measured in nanoSeconds, the estimated performance decreases to approximately 99.95% of the original performance.
mmin (1.0000128): For each additional unit increase in mmin (minimum main memory) measured in Kilobytes, the estimated performance increases to approximately 100.00% + 0.0128%, or very close to the original performance.
mmax (1.0000533): For each additional unit increase in mmax (maximum main memory) measured in Kilobytes, the estimated performance increases to approximately 100.00% + 0.0533%, or very close to the original performance.
cach (1.0070467): For each additional unit increase in cach (cache memory) measured in Kilobytes, the estimated performance increases to approximately 100.70% of the original performance.
chmin (1.0048196): For each additional unit increase in chmin (minimum channels) measured in Units, the estimated performance increases to approximately 100.48% of the original performance.
chmax (0.9984033): For each additional unit increase in chmax (maximum channels) measured in Units, the estimated performance decreases to approximately 99.84% of the original performance.
f)Calculate indices that help assess multicollinearity between predictors in your final model. Is there evidence of multicollinearity? What does this imply, and should you take action? Take action if appropriate.
#calculating variance inflation factor to determine multicollinearity
vif_values <- vif(model_final)
print(vif_values)
## syct mmin mmax cach chmin chmax
## 1.201644 2.902002 3.274002 1.858349 1.966234 1.891392
As the variance inflation factors determine the occurence of multicollinearity. i.e. if the VIF value is greater than 10, there is a strong indication of multicollinearity and if it exceeds 5 or 10, it creates a problematic amount of collinearity. and if its less than 5, small amount of multicollinearity persists, which is not a concern. As all the features have less than 5 VIF, there is no need to take an action.
par(mfrow = c(2, 2))
# Plotting residual vs. leverage
plot(model_final, which = 5, main = "Residual vs. Leverage", pch = 16)
# Reset par to default settings
par(mfrow = c(1, 1))
Observations:
Observation with Index 10: High leverage value suggests that this observation has an extreme value for one or more predictor variables, making it influential in determining the coefficients of the regression model. Low standardized residual indicates that this observation fits well with the overall trend of the model.
Observations with Indices 157 and 200: These observations have higher standardized residuals compared to observation 10. Higher standardized residuals suggest that these points deviate more from the predicted values. However, they have lower leverage values, indicating that they might not be as influential in determining the model coefficients compared to observation 10.
Based on this information, we can make the following interpretations: Observation 10: This observation is both influential (high leverage) and well-fitted (low standardized residual). It has a strong impact on the model due to its extreme values in predictor variables.
Observations 157 and 200: These observations have higher residuals, indicating that they deviate more from the model predictions. However, their leverage values are lower compared to observation 10, suggesting that they might not have as much influence on the model.
head(birthwt)
summary(birthwt)
## low age lwt race
## Min. :0.0000 Min. :14.00 Min. : 80.0 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:19.00 1st Qu.:110.0 1st Qu.:1.000
## Median :0.0000 Median :23.00 Median :121.0 Median :1.000
## Mean :0.3122 Mean :23.24 Mean :129.8 Mean :1.847
## 3rd Qu.:1.0000 3rd Qu.:26.00 3rd Qu.:140.0 3rd Qu.:3.000
## Max. :1.0000 Max. :45.00 Max. :250.0 Max. :3.000
## smoke ptl ht ui
## Min. :0.0000 Min. :0.0000 Min. :0.00000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.0000
## Median :0.0000 Median :0.0000 Median :0.00000 Median :0.0000
## Mean :0.3915 Mean :0.1958 Mean :0.06349 Mean :0.1481
## 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:0.0000
## Max. :1.0000 Max. :3.0000 Max. :1.00000 Max. :1.0000
## ftv bwt
## Min. :0.0000 Min. : 709
## 1st Qu.:0.0000 1st Qu.:2414
## Median :0.0000 Median :2977
## Mean :0.7937 Mean :2945
## 3rd Qu.:1.0000 3rd Qu.:3487
## Max. :6.0000 Max. :4990
Now, let’s create visualizations between the outcome variable (low - indicating low birthweight) and predictor variables, considering both quantitative and categorical variables:
##numerical variables
# Pair plot for numeric variables
pairs(birthwt[, c("age", "lwt", "ftv")])
Using boxplots to compare the distribution of numeric variables across
different levels of categorical variables.
#Categorical variable
# Boxplot for age by low birthweight
boxplot(age ~ low, data = birthwt, main = "Boxplot of Age by Low Birthweight", xlab = "Low Birthweight", ylab = "Age")
lets consider taking relation between categorical variables
#categorical
# Bar plot for race by low birthweight
table_race_low <- table(birthwt$race, birthwt$low)
barplot(table_race_low, beside = TRUE, legend= TRUE,names.arg = c("YES", "NO") ,col = c("red", "green","blue"), main = "Bar Plot of Race by Low Birthweight", xlab = "Race", ylab = "Count")
# Create a grouped bar plot for smoke and low birthweight
ggplot(birthwt, aes(x = factor(smoke), fill = factor(low))) +
geom_bar(position = "dodge", stat = "count") +
labs(title = "Grouped Bar Plot of Smoke and Low Birthweight",
x = "Smoke (1 = Yes, 0 = No)", y = "Count",
fill = "Low Birthweight") +
theme_minimal()
Other barplots for other categories
# Barplot of low birth weight by smoking status with colors
colors <- c("blue","green")
barplot(table(birthwt$smoke, birthwt$bwt == "Low"),
beside = TRUE, legend.text = c("No", "Yes"),
names.arg = c("Non-Smoker", "Smoker"),
xlab = "Smoking Status", ylab = "Frequency",
main = "Low Birth Weight by Smoking Status",
col = colors)
# Barplot of low birth weight by uterine irritability with colors
barplot(table(birthwt$ui, birthwt$bwt == "Low"),
beside = TRUE, legend.text = c("No", "Yes"),
names.arg = c("No", "Yes"),
xlab = "Uterine Irritability", ylab = "Frequency",
main = "Low Birth Weight by Uterine Irritability",
col=colors)
interpretation : 1. Low Birthweight Rate: The rate of low birth weight infants in the dataset is approximately 31.22%. This rate is higher than the general population rate in many countries, which is typically around 5-8%. The dataset may be biased towards high-risk pregnancies or populations with higher rates of low birthweight.
2.Smoking: About 39.15% of mothers in the dataset are smokers. Smoking during pregnancy is a known risk factor for low birthweight which is visible in grouped bar plot. The high rate of smoking in this dataset might indicate a population with higher overall health risks.
3.Age: The mean age of mothers in the dataset is 23.24 years, ranging from 14 to 45 years. The presence of teenage mothers (age 14-19) and older mothers (age 40-45) in the dataset could indicate potential high-risk pregnancies associated with low birth weight.
4.Number of Physician Visits: The mean number of physician visits (ftv variable) is 0.7937, ranging from 0 to 6 visits. The low mean number of visits may indicate limited access to healthcare or potential barriers to prenatal care.
5.Race Distribution: The race variable is categorical with values ranging from 1 to 3, indicating different race categories. The distribution of race categories in the dataset may be of interest for further analysis, especially considering potential disparities in healthcare access and outcomes among different racial groups.
6.uterine irritabilty: The higher frequency for “No” indicates that, among the observations with no uterine irritability, there is a larger proportion of cases with normal birth weight. The lower frequency for “Yes” suggests that, among the observations with uterine irritability, there is a smaller proportion of cases with normal birth weight, and a relatively higher proportion of cases with low birth weight.
let us create a initial model to determine the significancy if variables
initial_model1 <- glm(low ~ smoke + ht + race + lwt + ftv + age + ptl + ui,
data = birthwt,
family = binomial)
summary(initial_model1)
##
## Call:
## glm(formula = low ~ smoke + ht + race + lwt + ftv + age + ptl +
## ui, family = binomial, data = birthwt)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.078975 1.276254 -0.062 0.95066
## smoke 0.937275 0.398458 2.352 0.01866 *
## ht 1.830720 0.694135 2.637 0.00835 **
## race 0.453424 0.215294 2.106 0.03520 *
## lwt -0.012387 0.006614 -1.873 0.06111 .
## ftv 0.063461 0.169765 0.374 0.70854
## age -0.035845 0.036472 -0.983 0.32569
## ptl 0.542087 0.346168 1.566 0.11736
## ui 0.721965 0.463174 1.559 0.11906
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 234.67 on 188 degrees of freedom
## Residual deviance: 204.19 on 180 degrees of freedom
## AIC: 222.19
##
## Number of Fisher Scoring iterations: 4
With respect to significance legend:
Highly significant (p-value < 0.001) Significant (p-value < 0.01) Marginally significant (p-value < 0.05) Not significant (p-value ≥ 0.05)
Logically Acceptable Variables:
Race (race): Coefficient Estimate: 0.4534 p-value: 0.03520 (*) Interpretation: The variable race is statistically significant at a 0.05 significance level. The positive coefficient suggests an increase in the log-odds of low birth weight for different race categories.
Smoking Status (smoke): Coefficient Estimate: 0.9373 p-value: 0.01866 (*) Interpretation: The variable smoke is statistically significant at a 0.05 significance level. The positive coefficient indicates an increase in the log-odds of low birth weight for mothers who smoke during pregnancy.
Hypertension (ht): Coefficient Estimate: 1.8307 p-value: 0.00835 (**) Interpretation: The variable ht is statistically significant at a 0.01 significance level. The positive coefficient suggests an increase in the log-odds of low birth weight for mothers with hypertension.
Variables with Marginal Significance:
Last Weight (lwt): Coefficient Estimate: -0.0124 p-value: 0.06111 (.) Interpretation: The variable lwt has a p-value of 0.06111, which is marginally above the 0.05 significance level. It might be considered for further investigation or model refinement.
Number of Physician Visits (ftv): Coefficient Estimate: 0.0635 p-value: 0.70854 (not significant) Interpretation: The variable ftv has a high p-value (0.70854), indicating that it is not statistically significant. It may be less important in predicting low birth weight. Variables that are Not Statistically Significant:
Mother’s Age (age): Coefficient Estimate: -0.0358 p-value: 0.32569 (not significant) Interpretation: The variable age is not statistically significant at the 0.05 significance level.
Prior Preterm Births (ptl): Coefficient Estimate: 0.5421 p-value: 0.11736 (not significant) Interpretation: The variable ptl is not statistically significant at the 0.05 significance level.
Uterine Irritability (ui): Coefficient Estimate: 0.7220 p-value: 0.11906 (not significant) Interpretation: The variable ui is not statistically significant at the 0.05 significance level.
Overall conlusion: we shall only consider race, smoke and ht as significant variables and build an other model.
logistic_model2 <- glm(low ~ smoke + ht + race ,
data = birthwt,
family = binomial)
# Display the summary of the model
summary(logistic_model2)
##
## Call:
## glm(formula = low ~ smoke + ht + race, family = binomial, data = birthwt)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.4308 0.5232 -4.646 3.38e-06 ***
## smoke 1.1289 0.3723 3.032 0.00243 **
## ht 1.2253 0.6218 1.971 0.04877 *
## race 0.5636 0.2004 2.812 0.00492 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 234.67 on 188 degrees of freedom
## Residual deviance: 217.40 on 185 degrees of freedom
## AIC: 225.4
##
## Number of Fisher Scoring iterations: 4
Observation: We can observe that AIC value of updated model(225) is better than initial one(222), suggesting that the model is a better fit
Assumption: The model assumes that an increase in the number of prior preterm births (ptl) is associated with an increase in the log-odds of giving birth to a low-weight baby.
Realism Consideration: The assumption may or may not be realistic. The lack of statistical significance suggests that, based on the data, there is not enough evidence to conclude that ptl has a significant effect on the log-odds of low birth weight. It might be worth reevaluating the variable’s inclusion in the model or exploring potential interactions.
Assumption: The model assumes that an increase in the number of physician visits (ftv) is associated with an increase in the log-odds of giving birth to a low-weight baby.
Realism Consideration: The assumption may not be realistic based on the data. The lack of statistical significance suggests that the variable ftv does not significantly contribute to explaining the variation in the log-odds of low birth weight. It might be worth reconsidering the inclusion of this variable in the model or exploring alternative measures of healthcare access.
Overall Considerations: The lack of statistical significance for these variables implies that, based on the data, there isn’t strong evidence to support their inclusion in the model.
# Displaying unique values in ptl for reference
unique(birthwt$ptl)
## [1] 0 1 2 3
# Creating a new variable ptl2 by collapsing categories in ptl
birthwt$ptl2 <- ifelse(birthwt$ptl > 0, "Prior Preterm Birth", "No Prior Preterm Birth")
# Displaying unique values in ptl2 for verification
unique(birthwt$ptl2)
## [1] "No Prior Preterm Birth" "Prior Preterm Birth"
# Displaying counts of individual values in the variable ptl2
counts <- table(birthwt$ptl2)
print(counts)
##
## No Prior Preterm Birth Prior Preterm Birth
## 159 30
Explanation:
i am creating a new variable ptl2 based on the values in ptl. If ptl has a value greater than 0 (indicating prior preterm births), ptl2 is assigned the label “Prior Preterm Birth”. Otherwise, it is labeled “No Prior Preterm Birth”. This transformation simplifies the variable into a binary indicator, making it more straightforward for analysis.
# Displaying unique values in ftv for reference
unique(birthwt$ftv)
## [1] 0 3 1 2 4 6
# Creating a new variable ftv2 by collapsing categories in ftv
birthwt$ftv2 <- ifelse(birthwt$ftv > 0, "Visits > 0", "No Visits")
# Displaying unique values in ftv2 for verification
unique(birthwt$ftv2)
## [1] "No Visits" "Visits > 0"
# Form tables to summarize low birthweight probabilities by levels of ftv2
table_summary <- table(birthwt$ftv2, birthwt$low)
prop_table <- prop.table(table_summary, margin = 1)
# Display tables
print(table_summary)
##
## 0 1
## No Visits 64 36
## Visits > 0 66 23
print(prop_table)
##
## 0 1
## No Visits 0.640000 0.360000
## Visits > 0 0.741573 0.258427
Explanation: i am creating a new variable ftv2 based on the values in ftv. If ftv has a value greater than 0, indicating visits greater than 0, ftv2 is assigned the label “Visits > 0”. Otherwise, it is labeled “No Visits”. i then formed tables to summarize the low birthweight probabilities by levels of the new variable ftv2
# Fitting a logistic regression model with the new variables ftv2 and ptl2
updated_logistic_model <- glm(low ~ race + smoke + ptl2 + ht + ftv2,
family = binomial,
data = birthwt)
# Displaying the summary of the updated model
summary(updated_logistic_model)
##
## Call:
## glm(formula = low ~ race + smoke + ptl2 + ht + ftv2, family = binomial,
## data = birthwt)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.3164 0.6003 -3.859 0.000114 ***
## race 0.4935 0.2114 2.335 0.019551 *
## smoke 0.8798 0.3943 2.231 0.025677 *
## ptl2Prior Preterm Birth 1.3296 0.4392 3.027 0.002467 **
## ht 1.2528 0.6402 1.957 0.050374 .
## ftv2Visits > 0 -0.2681 0.3537 -0.758 0.448438
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 234.67 on 188 degrees of freedom
## Residual deviance: 207.76 on 183 degrees of freedom
## AIC: 219.76
##
## Number of Fisher Scoring iterations: 4
Interpretation:
Significant Variables: 1.ptl2 (Prior Preterm Birth) is significant (p-value = 0.002467). ht (Hypertension) is significant (p-value = 0.050374). 2.smoke (Smoking) has a p-value of 0.02567, which is significant. 3.race has a pvalue of 0.019551, which is significant.
Variables Not Significant: 1.ftv2 are not statistically significant (p-values > 0.05). Model Fit:
The residual deviance is 207.76 on 180 degrees of freedom.
Interpretation and Comments: Importance of New Variables: The new variable ptl2 (Prior Preterm Birth) is significant and has a positive coefficient, indicating an increase in the log-odds of low birthweight for mothers with a history of preterm births. ftv2 (Number of Physician Visits) is not significant (p-value = 0.44).
Overall Model Fit: The model has a lower residual deviance compared to the initial model, suggesting an improved fit.
# Set the seed for reproducibility
library(TeachingDemos)
## Warning: package 'TeachingDemos' was built under R version 4.3.2
char2seed("neuralink")
# Load necessary library
library(MASS)
# Split the data into training (80%) and test (20%) sets
train_indices <- sample(1:nrow(birthwt), 0.8 * nrow(birthwt))
train_data <- birthwt[train_indices, ]
test_data <- birthwt[-train_indices, ]
# Fit LDA and QDA models to the training set
lda_model <- lda(low ~ race + smoke + ptl2 + ht , data = train_data)
qda_model <- qda(low ~ race + smoke + ptl2 + ht , data = train_data)
# Fit logistic regression models to the training set
logistic_model_f <- glm(low ~ race + smoke + ptl2 + ht + ftv2,
family = binomial, data = train_data)
logistic_model_b <- glm(low ~ smoke + ht + race,
family = binomial, data = train_data)
# Predictions on the test set
lda_pred <- predict(lda_model, test_data)$class
qda_pred <- predict(qda_model, test_data)$class
logistic_pred_f <- predict(logistic_model_f, test_data, type = "response")
logistic_pred_b <- predict(logistic_model_b, test_data, type = "response")
# Convert logistic regression probabilities to binary predictions
logistic_pred_f_binary <- ifelse(logistic_pred_f > 0.5, 1, 0)
logistic_pred_b_binary <- ifelse(logistic_pred_b > 0.5, 1, 0)
# Confusion matrices
confusion_lda <- table(lda_pred, test_data$low)
confusion_qda <- table(qda_pred, test_data$low)
confusion_logistic_f <- table(logistic_pred_f_binary, test_data$low)
confusion_logistic_b <- table(logistic_pred_b_binary, test_data$low)
# Calculate sensitivity, specificity, and accuracy
sensitivity <- function(confusion_matrix) {
confusion_matrix[2, 2] / sum(confusion_matrix[2, ])
}
specificity <- function(confusion_matrix) {
confusion_matrix[1, 1] / sum(confusion_matrix[1, ])
}
accuracy <- function(confusion_matrix) {
sum(diag(confusion_matrix)) / sum(confusion_matrix)
}
# Calculate metrics for each model
metrics_lda <- c(Sensitivity = sensitivity(confusion_lda),
Specificity = specificity(confusion_lda),
Accuracy = accuracy(confusion_lda))
metrics_qda <- c(Sensitivity = sensitivity(confusion_qda),
Specificity = specificity(confusion_qda),
Accuracy = accuracy(confusion_qda))
metrics_logistic_f <- c(Sensitivity = sensitivity(confusion_logistic_f),
Specificity = specificity(confusion_logistic_f),
Accuracy = accuracy(confusion_logistic_f))
metrics_logistic_b <- c(Sensitivity = sensitivity(confusion_logistic_b),
Specificity = specificity(confusion_logistic_b),
Accuracy = accuracy(confusion_logistic_b))
# Display the results
print("Linear Discriminant Analysis (LDA):")
## [1] "Linear Discriminant Analysis (LDA):"
print(metrics_lda)
## Sensitivity Specificity Accuracy
## 0.1666667 0.7812500 0.6842105
print("\nQuadratic Discriminant Analysis (QDA):")
## [1] "\nQuadratic Discriminant Analysis (QDA):"
print(metrics_qda)
## Sensitivity Specificity Accuracy
## 0.5714286 0.8709677 0.8157895
print("\nLogistic Regression (Built in f):")
## [1] "\nLogistic Regression (Built in f):"
print(metrics_logistic_f)
## Sensitivity Specificity Accuracy
## 0.3750000 0.8333333 0.7368421
print("\nLogistic Regression (Built in b):")
## [1] "\nLogistic Regression (Built in b):"
print(metrics_logistic_b)
## Sensitivity Specificity Accuracy
## 0.2000000 0.7878788 0.7105263
Interpretation: With the Set.seed(“Neuralink”) Interpretation:
1)QDA appears to have the highest accuracy (81.58%) among the methods considered. 2)LDA has the lowest sensitivity and accuracy among the models. 3)Logistic Regression (Built in f) performs reasonably well with 73.68% accuracy. 4)Logistic Regression (Built in b) has a similar accuracy to LDA but slightly better sensitivity.
Conclusion: Based on the accuracy metric, QDA seems to be the most effective model among the methods considered for this particular dataset. However, it’s crucial to consider the specific goals of your analysis and other factors such as computational efficiency and interpretability when choosing a model.
#summary of the updated logistic regression model
summary(updated_logistic_model)
##
## Call:
## glm(formula = low ~ race + smoke + ptl2 + ht + ftv2, family = binomial,
## data = birthwt)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.3164 0.6003 -3.859 0.000114 ***
## race 0.4935 0.2114 2.335 0.019551 *
## smoke 0.8798 0.3943 2.231 0.025677 *
## ptl2Prior Preterm Birth 1.3296 0.4392 3.027 0.002467 **
## ht 1.2528 0.6402 1.957 0.050374 .
## ftv2Visits > 0 -0.2681 0.3537 -0.758 0.448438
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 234.67 on 188 degrees of freedom
## Residual deviance: 207.76 on 183 degrees of freedom
## AIC: 219.76
##
## Number of Fisher Scoring iterations: 4
interpretation the estimatation:
Interpretation of Coefficients:
1)Intercept: The estimated intercept is -2.3164. This is the log-odds of the response variable being “low” when all other predictor variables are zero. 2)race: The coefficient for “race” is 0.4935. Holding other variables constant, for each one-unit increase in the “race” variable, the log-odds of the response being “low” increase by 0.4935. The p-value is 0.019551, suggesting that “race” is statistically significant. 3)smoke: The coefficient for “smoke” is 0.8798. Holding other variables constant, for each one-unit increase in the “smoke” variable, the log-odds of the response being “low” increase by 0.8798. The p-value is 0.025677, indicating that “smoke” is statistically significant. 4)ptl2 (Prior Preterm Birth): The coefficient for “ptl2Prior Preterm Birth” is 1.3296. Holding other variables constant, individuals with a prior preterm birth have log-odds 1.3296 higher than those without a prior preterm birth. The p-value is 0.002467, suggesting that the effect is statistically significant. 5)ht (hypertension): The coefficient for “ht” is 1.2528. Holding other variables constant, individuals with hypertension have log-odds 1.2528 higher than those without hypertension. The p-value is 0.050374, which is marginally significant. 6)ftv2 (Visits > 0): The coefficient for “ftv2Visits > 0” is -0.2681. Holding other variables constant, individuals with more than zero visits have log-odds 0.2681 lower than those with zero visits. However, the p-value is 0.448438, indicating that this variable is not statistically significant.
Model Fit:
Deviance: The null deviance is 234.67, and the residual deviance is 207.76. The difference between them is significant, suggesting that the model provides a better fit