Create a folder for this project and call it as DS7310_Assignment4_YourFullName.
Create a project file in the folder.
Change the name of author in YAML.
Save this file as you lastname_firstname_Assignment4.qmd
Add a table of content
Add your code to the R code chunkes.
Compress (zip) the DS7310_Assignment4_YourFullName folder.
After upload the compress folder, download it, and make sure it is the correct one.
Question 1 (50 points)
Import the Data_Health_Ass4, remove incomplete rows, and create an interactive table for that.
#Type your code herelibrary(psych)library(readxl)library(ggplot2)library(MASS)library(rstatix)library(tidyverse)library(tidymodels)library(lm.beta)library(olsrr)library(car)library(DT)library(dplyr)library(rsample)library(glmtoolbox)library(lmtest)library(DescTools)library(plotly)library(knitr)library(kableExtra)library(broom)setwd("C:/Users/Owner/Downloads/DS7130_Assignment4_Martinez_Desyne/")Healthdata <-read_xlsx("Data_Health_Ass4.xlsx")ls(Healthdata)
# 1. Select rows where syphillis is not 0 , and return all columns.Healthdata <- Healthdata[Healthdata$Syphilis >0, ]# INTERACTIVE TABLE (Requirement: At least one)datatable(head(Healthdata, 100), options =list(pageLength =5), caption ='Table 1: Interactive view of Health Data')
Create an interactive histogram for Stroke variable.
Get a subset of data for “Stroke”,“BloodPressure”,“Depression”, and “HighCholesterol,”
Make stroke prediction models based on different combinations of selecting 2 or 3 variables from BloodPressure, Depression, and HighCholesterol. Call each model from 1 to 4, like Model1, Model3, Model3, and Model4.
Make sure to use as.numeric for the variables, like as.numeric(Stroke), to convert the four variables to a numeric data, such as Data_Health_Ass4_subset$Stroke <- as.numeric(Data_Health_Ass4_subset$Stroke)
#Type your code hereHealthdatasub <-Healthdata %>% dplyr::select(Stroke,BloodPressure,Depression,HighCholesterol)ls(Healthdatasub)
Healthdatasub$Stroke <-as.numeric(Healthdatasub$Stroke)Healthdatasub$BloodPressure <-as.numeric(Healthdatasub$BloodPressure)Healthdatasub$Depression <-as.numeric(Healthdatasub$Depression)Healthdatasub$HighCholesterol <-as.numeric(Healthdatasub$HighCholesterol)model_health1 <-lm(Stroke ~ BloodPressure, data = Healthdatasub)model_health2 <-lm(Stroke ~ BloodPressure + Depression, data = Healthdatasub)model_health3 <-lm(Stroke ~ BloodPressure + HighCholesterol, data = Healthdatasub)model_health4 <-lm(Stroke ~ BloodPressure + Depression + HighCholesterol, data = Healthdatasub)summary(model_health1)
Call:
lm(formula = Stroke ~ BloodPressure, data = Healthdatasub)
Residuals:
Min 1Q Median 3Q Max
-1.3772 -0.2917 -0.0217 0.2431 3.2682
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.192938 0.066313 -17.99 <2e-16 ***
BloodPressure 0.143536 0.001753 81.88 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.455 on 1982 degrees of freedom
Multiple R-squared: 0.7718, Adjusted R-squared: 0.7717
F-statistic: 6704 on 1 and 1982 DF, p-value: < 2.2e-16
summary(model_health2)
Call:
lm(formula = Stroke ~ BloodPressure + Depression, data = Healthdatasub)
Residuals:
Min 1Q Median 3Q Max
-1.3772 -0.2917 -0.0217 0.2431 3.2682
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.193e+00 9.260e-02 -12.881 <2e-16 ***
BloodPressure 1.435e-01 1.792e-03 80.104 <2e-16 ***
Depression -3.863e-06 3.293e-03 -0.001 0.999
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4551 on 1981 degrees of freedom
Multiple R-squared: 0.7718, Adjusted R-squared: 0.7716
F-statistic: 3350 on 2 and 1981 DF, p-value: < 2.2e-16
summary(model_health3)
Call:
lm(formula = Stroke ~ BloodPressure + HighCholesterol, data = Healthdatasub)
Residuals:
Min 1Q Median 3Q Max
-1.3905 -0.2905 -0.0249 0.2456 3.2522
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.057440 0.123872 -8.537 <2e-16 ***
BloodPressure 0.146194 0.002699 54.159 <2e-16 ***
HighCholesterol -0.006436 0.004970 -1.295 0.195
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4549 on 1981 degrees of freedom
Multiple R-squared: 0.772, Adjusted R-squared: 0.7718
F-statistic: 3354 on 2 and 1981 DF, p-value: < 2.2e-16
summary(model_health4)
Call:
lm(formula = Stroke ~ BloodPressure + Depression + HighCholesterol,
data = Healthdatasub)
Residuals:
Min 1Q Median 3Q Max
-1.3888 -0.2909 -0.0250 0.2449 3.2519
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.0630752 0.1362474 -7.803 9.74e-15 ***
BloodPressure 0.1461734 0.0027082 53.974 < 2e-16 ***
Depression 0.0003284 0.0033025 0.099 0.921
HighCholesterol -0.0064744 0.0049862 -1.298 0.194
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.455 on 1980 degrees of freedom
Multiple R-squared: 0.772, Adjusted R-squared: 0.7717
F-statistic: 2235 on 3 and 1980 DF, p-value: < 2.2e-16
Report the evaluation results of the four combinations in a table, with three columns of of Model variables (e.g., BloodPressure + Depression), model number from 1 to 4, and adjusted RSquared.
Show the report_lm_results with knitr::kable.
First create the empty table using the following code. Add your code to the following code.
Cluster the Data_Health_Ass4_subsetdata using k-means.
Select the best estimate number of clusters other than min.nc and max.nc.
set.seed(123)#Type your code here# Load clustering package and sample dataset packagelibrary(cluster)library(GDAdata)# Standardize the data so variables with larger scales do not dominateHealth_data_scale <-scale(Healthdatasub)# Load package for estimating the number of clusterslibrary(NbClust)# Estimate the best number of clusters from 2 to 10 using k-means criterianumber_cluster_estimate <-NbClust( Health_data_scale,distance ="euclidean",min.nc =2,max.nc =10,method ="kmeans")
*** : The Hubert index is a graphical method of determining the number of clusters.
In the plot of Hubert index, we seek a significant knee that corresponds to a
significant increase of the value of the measure i.e the significant peak in Hubert
index second differences plot.
*** : The D index is a graphical method of determining the number of clusters.
In the plot of D index, we seek a significant knee (the significant peak in Dindex
second differences plot) that corresponds to a significant increase of the value of
the measure.
*******************************************************************
* Among all indices:
* 10 proposed 2 as the best number of clusters
* 3 proposed 3 as the best number of clusters
* 5 proposed 4 as the best number of clusters
* 2 proposed 6 as the best number of clusters
* 3 proposed 10 as the best number of clusters
***** Conclusion *****
* According to the majority rule, the best number of clusters is 2
*******************************************************************
Run the model with the best estimate number number of clusters.
#Type your code here# Show the voting results for the best number of clustersnumber_cluster_estimate$Best.nc
Show the average summary of variables in each cluster.
#Type your code here# Run PAM clustering with 5 clusters# Note: the slides call this k-means, but this function is PAMkmeans_Health_data_scale_cluster <-pam(Health_data_scale, k =4)# Show medoids for the clusterskmeans_Health_data_scale_cluster$medoids
# Plot the clustering result in two reduced dimensionsplot(kmeans_Health_data_scale_cluster)
# Add the assigned cluster to the original HealthHealth_data_cluster <- Healthdatasub %>%mutate(cluster = kmeans_Health_data_scale_cluster$clustering)# Show the dataset with assigned clustersHealth_data_cluster
# Compute the average of each variable by clusterHealth_cluster_summary <- Health_data_cluster %>%group_by(cluster) %>%summarise(across(everything(), ~mean(.x, na.rm =TRUE)))# Show cluster summaries to compare Health_cluster_summary