For this part of project 2, I will be working with the Diabetes dataset uploaded by Folorunsho Atanda on BlackBoard (https://bbhosted.cuny.edu/webapps/discussionboard/do/message?action=list_messages&course_id=_2303106_1&nav=discussion_board_entry&conf_id=_2781274_1&forum_id=_3717486_1&message_id=_69324188_1)
To begin, I will read the dataset into R from my Github repo, add an ID column, clean the data such that the predictors of diabetes are grouped under one column, and then perform simple, descriptive analyses to determine which factors are most greatly associated with diabetes outcomes
diabetes <- read.csv("https://raw.githubusercontent.com/Mattr5541/DATA-607/main/Project%202/diabetes.csv")
##Now, I want to add an ID variable, so we can more easily keep track of observations
diabetes <- diabetes %>% mutate(ID = row_number())
##Now, I would like to reorder the columns so that ID appears first
diabetes <- diabetes %>% select(ID, everything())
diabetes_long <- diabetes %>% pivot_longer(cols = c(Pregnancies:BMI, Age), names_to = "Predictors", values_to = "Values")
diabetes_long <- diabetes_long %>% select(ID, Predictors, Values, DiabetesPedigreeFunction, Outcome)
##Next, I want to split this dataframe into two: one for negative (0) outcomes and one for positive (1) outcomes. This will make it easier to determine the mean values for each predictor based on the specific outcome
diabetes_pos <- diabetes_long %>% filter(Outcome == 1)
diabetes_neg <- diabetes_long %>% filter(Outcome == 0)
First, I will determine the average pedigree level for individuals with diabetes versus those without diabetes, to determine if this is an appropriate marker for predicting diabetes
print((paste0("Number of people without diabetes: ", nrow(diabetes_neg))))
## [1] "Number of people without diabetes: 3500"
print((paste0("Number of people with diabetes: ", nrow(diabetes_pos))))
## [1] "Number of people with diabetes: 1876"
As we can see, the total number of individuals without diabetes is larger than the number of individuals with diabetes. From the perspective of generalizing to the population, this makes sense
mean_pedigree_neg <- diabetes_neg %>% summarize(mean_pedigree = mean(DiabetesPedigreeFunction))
print(mean_pedigree_neg)
## # A tibble: 1 × 1
## mean_pedigree
## <dbl>
## 1 0.430
mean_pedigree_pos <- diabetes_pos %>% summarize(mean_pedigree = mean(DiabetesPedigreeFunction))
print(mean_pedigree_pos)
## # A tibble: 1 × 1
## mean_pedigree
## <dbl>
## 1 0.550
##Median Values
median_pedigree_neg <- diabetes_neg %>% summarize(median_pedigree = median(DiabetesPedigreeFunction))
print(median_pedigree_neg)
## # A tibble: 1 × 1
## median_pedigree
## <dbl>
## 1 0.336
median_pedigree_pos <- diabetes_pos %>% summarize(median_pedigree = median(DiabetesPedigreeFunction))
print(median_pedigree_pos)
## # A tibble: 1 × 1
## median_pedigree
## <dbl>
## 1 0.449
ggplot(diabetes_long, aes(x = DiabetesPedigreeFunction, y = after_stat(count))) + geom_histogram(binwidth = .05) + facet_wrap(~Outcome)
All of this shows that the diabetes pedigree function, on average, is marginally higher for individuals diagnosed with diabetes than those who were not diagnosed with diabetes. The graphs show that the distributions for both categories are heavily positively skewed (which is sensible for this type of indicator), and the values tend to cluster closer to 1.0 when people are positive for diabetes than when they test negative for diabetes. Because of this heavy skew, I opted to determine the median values, as well. As we can see, the difference median differences between pedigrees diagnosed with diabetes versus not diagnosed with diabetes is substantially different. However, it may not serve as the most reliable indicator for predicting diabetes, as both the mean and median values for those diagnosed with diabetes are only around .5. It is important to note that this measure, of course, is only a marker for risk factors that lead to diabetes, however, and not of diabetes development, itself.
Now, I will find the averages for each predictor:
unique(diabetes_long$Predictors)
## [1] "Pregnancies" "Glucose" "BloodPressure" "SkinThickness"
## [5] "Insulin" "BMI" "Age"
diabetes_means_neg <- diabetes_neg %>% group_by(Predictors) %>% summarize(Group_means = mean(Values))
diabetes_means_pos <- diabetes_pos %>% group_by(Predictors) %>% summarize(Group_means = mean(Values))
print(diabetes_means_neg)
## # A tibble: 7 × 2
## Predictors Group_means
## <chr> <dbl>
## 1 Age 31.2
## 2 BMI 30.3
## 3 BloodPressure 68.2
## 4 Glucose 110.
## 5 Insulin 68.8
## 6 Pregnancies 3.30
## 7 SkinThickness 19.7
print(diabetes_means_pos)
## # A tibble: 7 × 2
## Predictors Group_means
## <chr> <dbl>
## 1 Age 37.1
## 2 BMI 35.1
## 3 BloodPressure 70.8
## 4 Glucose 141.
## 5 Insulin 100.
## 6 Pregnancies 4.87
## 7 SkinThickness 22.2
We can see that–as expected–all categories are elevated in the positive group when compared to the negative group. Let’s investigate this further by using graphical comparisons
##I want to first combine these mean comparisons into a single dataframe
diabetes_means_neg <- data_frame(diabetes_means_neg)
## Warning: `data_frame()` was deprecated in tibble 1.1.0.
## ℹ Please use `tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
diabetes_means_pos <- data_frame(diabetes_means_pos)
diabetes_means_neg <- diabetes_means_neg %>% mutate(Outcome = 0)
diabetes_means_pos <- diabetes_means_pos %>% mutate(Outcome = 1)
diabetes_mean_append <- rbind(diabetes_means_neg, diabetes_means_pos)
diabetes_mean_append <- diabetes_mean_append %>% mutate(Outcome = recode(Outcome, '0' = 'negative', '1' = 'positive'))
ggplot(diabetes_mean_append, aes(x = Predictors, y = Group_means, fill = Outcome)) + geom_bar(stat= "identity", position = "dodge") + labs(title = "Mean Rates of Diabetes Predictors", y = "Means")
Thus, as one would assume, insulin and glucose appear to be the best predictors of having diabetes, such that highly elevated insulin and glucose levels tend to correspond with the presence of diabetes.
Now I want to see the differences in the distribution of blood pressure between the two groups.
bloodPressure_pos <- diabetes_pos %>%
filter(Predictors == "BloodPressure") %>%
mutate(Status = "Positive")
bloodPressure_neg <- diabetes_neg %>%
filter(Predictors == "BloodPressure")
# create an empty ggplot object
bp = ggplot() +
ggtitle("Distribution of Blood Pressure in Diabetes Positive and Negative Groups") +
xlab("Diabetes Status") +
ylab("Blood Pressure Values")
# Add boxplot for Positive group
bp = bp +
geom_boxplot(data = bloodPressure_pos, aes(x = "Positive", y = Values), colour='red')
# Add boxplot for Negative group
bp = bp +
geom_boxplot(data = bloodPressure_neg, aes(x = "Negative", y = Values), colour='green')
# Display the plot
print(bp)
The distributions of blood pressure between the diabetes negative and positive group is similar. However, the mean blood pressure of the diabetes group was slightly above the mean for the negative group. This may indicate that diabetes patients have higher blood pressure (or the other way around).