title: “PSL4040: Assignment 1” author: “Michael Sava” date: “2025-09-17” output: html_document —
Assignment
We will be working with the following diabetes dataset, which I stored in the Week 1 project folder. Since it is already in this folder, I can load the file directly without needing to locate it elsewhere on my computer.
db <- read.csv("diabetes.csv")
This data set contains the following columns of data Pregnancies - Number of times pregnant
Glucose - Plasma glucose concentration a 2 hours in an oral glucose tolerance test
BloodPressure - Diastolic blood pressure (mm Hg)
SkinThickness - Triceps skin fold thickness (mm)
Insulin - 2-Hour serum insulin (mu U/ml)
BMI - Body mass index (weight in kg/(height in m)^2)
DiabetesPedigreeFunction - Diabetes pedigree function
Age - Age (years)
Outcome - Class variable (0 or 1) 268 of 768 are 1, the others are 0 1 is diabetes positive and 0 is healthy
Outcome is a category, diabetes and healthy, the 1 and 0 is not very helpful for this. Lets change it. First we will make the outcome a factor (category). Next we can use the levels function to change the names of the values, where 0 will be Healthy and 1 will be Sick
db$Outcome<-factor(db$Outcome)
levels(db$Outcome)
## [1] "0" "1"
levels(db$Outcome)<-c("Healthy", "Sick")
head(db)
## Pregnancies Glucose BloodPressure SkinThickness Insulin BMI
## 1 6 148 72 35 0 33.6
## 2 1 85 66 29 0 26.6
## 3 8 183 64 0 0 23.3
## 4 1 89 66 23 94 28.1
## 5 0 137 40 35 168 43.1
## 6 5 116 74 0 0 25.6
## DiabetesPedigreeFunction Age Outcome
## 1 0.627 50 Sick
## 2 0.351 31 Healthy
## 3 0.672 32 Sick
## 4 0.167 21 Healthy
## 5 2.288 33 Sick
## 6 0.201 30 Healthy
The head(db) function displays the first six rows of the dataset along with all columns. In this view, the arbitrary numerical values have been replaced with character labels (“Healthy” and “Sick”), making the table easier to read and interpret.
QUESTIONS
# Descriptive questions with minimal coding
Numerical: All Numbers Logical: TRUE or FALSE Character: Must have quotes but any words, sentences, symbols, paragraphs, and numbers within the quotes are considered characters
A vector includes several elements either horizontally or vertically oriented and must contain the same data type. For a data frame, it is a series of vectors oriented vertically and each vector can contain a different data type.
Definition: Object: A container that stores data. It can hold any data type: numbers, logical values, characters, vectors, matrices, data frames, or even functions. Naming and Coding: They must start with a letter, can include letters, numbers, underscores, or periods, and have no spaces or special characters (like #, @, etc.). An object is defined and assigned a value using <- or =. Example m<-“Michael”
Definition: Function: A block of code that performs a specific task. Functions act on objects and can return values. Naming and Coding: Functions are named actions like mean(), t.test(), ggplot() Always use parentheses () to call a function Functions can take arguments (inputs) that modify their behavior.
# Coding activity
First I will ensure my libraries are activated.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.2 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(formattable)
library(dplyr)
Now that my libraries are activated, I can use their encoded functions.
I then took the diabetes dataset (“db”, as defined earlier) and used the pipe operator (%>%) with the summarise() function. Within summarise(), I specified arguments to calculate and label the mean (mean()) and standard deviation (sd()) for the variables age, number of pregnancies, and glucose level. The results were organized into a labeled table. Finally, I applied the formattable function to round the values so they displayed only 4 digits.
db%>%
summarise(mean_age=mean(Age),mean_pregnancies=mean(Pregnancies),mean_Glucose=mean(Glucose),sd_age=sd(Age),sd_pregnancies=sd(Pregnancies),sd_Glucose=sd(Glucose))%>%formattable(digits=4)
mean_age | mean_pregnancies | mean_Glucose | sd_age | sd_pregnancies | sd_Glucose |
---|---|---|---|---|---|
33.24 | 3.845 | 120.9 | 11.76 | 3.37 | 31.97 |
The code did what I asked for and yielded the means and standard deviations of the entire data set. That being said, this data is not very useful on its own and we must further wrangle it to obtain some meaningful information.
I adapted the code from Question 1 by adding the function group_by(Outcome). This instructs R to take the db dataset, split it into two groups based on the outcome (Healthy or Sick), and then perform the same calculations for each group separately. The results should be displayed in a labeled table with values rounded to 4 digits.
db%>%
group_by(Outcome)%>%
summarise(mean_age=mean(Age),mean_pregnancies=mean(Pregnancies),mean_Glucose=mean(Glucose),sd_age=sd(Age),sd_pregnancies=sd(Pregnancies),sd_Glucose=sd(Glucose))%>%formattable(digits=4)
Outcome | mean_age | mean_pregnancies | mean_Glucose | sd_age | sd_pregnancies | sd_Glucose |
---|---|---|---|---|---|---|
Healthy | 31.19 | 3.298 | 110.0 | 11.67 | 3.017 | 26.14 |
Sick | 37.07 | 4.866 | 141.3 | 10.97 | 3.741 | 31.94 |
This time, the code separated the data into two groups based on the available outcomes (Healthy and Sick). It then calculated the means and standard deviations for age, number of pregnancies, and glucose level within each group. These results are more informative, as they allow for direct comparison between groups. For instance, the Sick group shows higher mean values for age, number of pregnancies, and glucose; insights that would not be visible if we had only calculated summary statistics for the entire dataset.
To assess whether the distributions of the variables number of pregnancies and glucose level were normal, I applied the shapiro.test() function. This test evaluates normality by returning a p-value: if the p-value is very small (commonly, p < 0.05), we reject the assumption of normality. In such cases, the data are considered non-normal, and a non-parametric test such as the U-test or the Wilcox test should be used instead. I also added headings above each shapiro.test() using the cat() function to clearly indicate which variable (number of pregnancies and glucose levels) each test corresponds to.
cat("Shapiro.test for # of Pregnancies")
## Shapiro.test for # of Pregnancies
db%>%
select(Pregnancies)%>%
pull()%>%
shapiro.test()
##
## Shapiro-Wilk normality test
##
## data: .
## W = 0.90428, p-value < 2.2e-16
cat("Shapiro.test for Glucose Levels")
## Shapiro.test for Glucose Levels
db%>%
select(Glucose)%>%
pull()%>%
shapiro.test()
##
## Shapiro-Wilk normality test
##
## data: .
## W = 0.9701, p-value = 1.986e-11
The code indicated that the p-value for both the number of pregnancies (p-value<2.2e-16) and the glucose level (1.986e-11) variables are significant and therefore, not normally distributed.
However, because shapiro.test() is highly sensitive, even minor deviations from normality can produce a significant result. To account for this, I will also generate qqplots of the distributions of the variables number of pregnancies and glucose levels to visually verify normality before proceeding. I created two objects, test1 and test2. Each object contained the diabetes dataset (db) with the specific variable of interest extracted, test1 for number of pregnancies and test2 for glucose levels. I then passed these objects to the qqnorm() and qqline() functions to generate the corresponding qqplots. I added main title labels, “Q-Q Plot for Number of Pregnancies” and “Q-Q Plot for Glucose Levels”, by including the main=“” argument.
test1<-db%>%
select(Pregnancies)%>%
pull()
qqnorm(test1,main="Q-Q Plot for Number of Pregnancies")
qqline(test1)
test2<-db%>%
select(Glucose)%>%
pull()
qqnorm(test2,main="Q-Q Plot for Glucose Levels")
qqline(test2)
The qqplot checks if our data looks like a normal distribution. If it
does, the points will form a straight line. In our plots, the points
curve away from the line, especially at the ends, which tells us the
data doesn’t follow a normal distribution. This visual evidence supports
the result from the shapiro.test().
I will use the wilcox.test() function to analyze the distributions of the variables # of Pregnancies and Glucose Levels in relation to the outcome (Healthy or Sick), rather than a t-test, since the t-test assumes normality of the data. For the wilcox.test() function, the arguments included will be Pregnancies and Glucose described by (~) the Outcome (Healthy or Sick) and I will ensure its using data from the db dataset. I will put these separate tests into the objects wilcoxTest1 and wilcoxTest2. To organize the results, I will generate a table and store it in an object called summary_stat_table1, which I will create using the data.frame() function. The arguments for the data.frame() function will include a column title “variable” and the combine function (c()) that puts the variables # of Pregnancies and Glucose Levels in the first column on different rows. I will then create another title for a column called “pvalue” that will use the combine function to include the specific pvalues column (using the $ operator) from the objects wilcoxTest1 and wilcoxTest2, finally I will also create a final column labled “Statistic” and the combine function to include the statistic column of both wilcoxTest1 and wilcoxTest2, similar to what I did for the “pvalue” colum.
wilcoxTest1<-wilcox.test(Pregnancies~Outcome,db)
wilcoxTest2<-wilcox.test(Glucose~Outcome,db)
summary_stat_table1<-data.frame(variable=c("# of Pregnancies","Glucose Levels"),"pvalue"=c(wilcoxTest1$p.value,wilcoxTest2$p.value),"Statistic"=c(wilcoxTest1$statistic,wilcoxTest2$statistic))
summary_stat_table1
## variable pvalue Statistic
## 1 # of Pregnancies 3.745146e-08 50985.0
## 2 Glucose Levels 1.200727e-39 28390.5
Both p-values yielded from the Wilcox Test are well below the commonly used significance threshold of 0.05, allowing us to reject the null hypothesis in both cases. The null hypothesis stated that the distributions (medians) of the number of pregnancies or glucose levels are the same between healthy and sick individuals. Our results indicate a statistically significant difference in the number of pregnancies between the two groups (p = 3.75e-8) and an even stronger statistically significant difference in glucose levels (p = 1.20e-39). The effect was particularly pronounced for glucose, suggesting that glucose levels are much more strongly associated with health outcomes compared to the number of pregnancies.
First I created an object dbunmod that encodes the summary(db) of the unmodified dataset so I can see that the min value recorded for Glucose Levels was 0 across the entire dataset. As this is not physiologically possible, I created anothe object called dbmod, ecoded the db database that was piped and used the subset() function to take generate a subset of the dataset that only contains values with Glucose levels greater than 0.
dbunmod<-summary(db)
dbmod<-db%>%
subset(Glucose>0)
summary(dbmod)
## Pregnancies Glucose BloodPressure SkinThickness
## Min. : 0.000 Min. : 44.0 Min. : 0.00 Min. : 0.00
## 1st Qu.: 1.000 1st Qu.: 99.0 1st Qu.: 62.00 1st Qu.: 0.00
## Median : 3.000 Median :117.0 Median : 72.00 Median :23.00
## Mean : 3.852 Mean :121.7 Mean : 69.12 Mean :20.48
## 3rd Qu.: 6.000 3rd Qu.:141.0 3rd Qu.: 80.00 3rd Qu.:32.00
## Max. :17.000 Max. :199.0 Max. :122.00 Max. :99.00
## Insulin BMI DiabetesPedigreeFunction Age
## Min. : 0.00 Min. : 0.00 Min. :0.0780 Min. :21.00
## 1st Qu.: 0.00 1st Qu.:27.30 1st Qu.:0.2435 1st Qu.:24.00
## Median : 36.00 Median :32.00 Median :0.3740 Median :29.00
## Mean : 80.29 Mean :31.99 Mean :0.4725 Mean :33.27
## 3rd Qu.:128.50 3rd Qu.:36.55 3rd Qu.:0.6265 3rd Qu.:41.00
## Max. :846.00 Max. :67.10 Max. :2.4200 Max. :81.00
## Outcome
## Healthy:497
## Sick :266
##
##
##
##
I verified that the code worked by running the summary(dbmod) function, which showed that the minimum value for Glucose Levels was no longer 0 but 44. This confirmed that data points with 0 Glucose Levels had been successfully removed. If I were to clean the dataset more thoroughly, I would also remove entries showing values of zero for blood pressure, skin thickness, insulin (unless they were Type 1 diabetics whose metrics were taken during a 24–48 hour episode of life-threatening diabetic ketoacidosis), and BMI. These zero values are most likely artifacts of incomplete or erroneous data, as such measurements cannot realistically occur in living individuals.
Using the modified (cleaned) dataset (dbmod), I will pipe the data to group it by Outcome (Healthy or Sick), then pipe it again to create a table with the summarise function. This table will include the mean # of Pregnancies and mean Glucose Levels, along with column titles and the standard deviations for both variables. I will then pipe this into the formattable function to limit the results to 4 digits.
dbmod%>%group_by(Outcome)%>%
summarise(mean_pregnancies=mean(Pregnancies),mean_Glucose=mean(Glucose),sd_pregnancies=sd(Pregnancies),sd_Glucose=sd(Glucose))%>%formattable(digits=4)
Outcome | mean_pregnancies | mean_Glucose | sd_pregnancies | sd_Glucose |
---|---|---|---|---|
Healthy | 3.312 | 110.6 | 3.021 | 24.78 |
Sick | 4.861 | 142.3 | 3.755 | 29.60 |
The process was similar to what I did in Question 2, except that this time I did not include the Age variable and instead worked with the modified dataset (dbmod).
To confirm that the modified (clean) dbmod dataset remains non-normally distributed, and that a non-parametric test such as wilcox.test() is still appropriate for evaluating significance, I will repeat the shapiro.test() analysis, this time using the modified (clean) dataset. I will then create qqplots to visually confirm the results of the shapiro.test(). I will repurpose code from Question 3.
cat("Shapiro.test for # of Pregnancies for dbmod")
## Shapiro.test for # of Pregnancies for dbmod
dbmod%>%
select(Pregnancies)%>%
pull()%>%
shapiro.test()
##
## Shapiro-Wilk normality test
##
## data: .
## W = 0.90454, p-value < 2.2e-16
cat("Shapiro.test for Glucose Levels for dbmod")
## Shapiro.test for Glucose Levels for dbmod
dbmod%>%
select(Glucose)%>%
pull()%>%
shapiro.test()
##
## Shapiro-Wilk normality test
##
## data: .
## W = 0.96964, p-value = 1.72e-11
test3<-dbmod%>%
select(Pregnancies)%>%
pull()
qqnorm(test3,main="Q-Q Plot for Number of Pregnancies for dbmod")
qqline(test3)
test4<-dbmod%>%
select(Glucose)%>%
pull()
qqnorm(test4,main="Q-Q Plot for Glucose Levels for dbmod")
qqline(test4)
I applied the same reasoning as in Question 3 to evaluate whether the
data is normally distributed and arrived at the same conclusion: the
data is non-normally distributed and therefore requires a non-parametric
test to assess its significance level.
Now I will perform the wilcox.test() to assess the significance levels of the variables Number of Pregnancies and Glucose Levels. I will repurpose code I wrote for Question 4. I will use the modified (clean) dataset (dbmod) and assign the results of the wilcox.test() functions to new objects named wilcoxTestMod1 and wilcoxTestMod2.
wilcoxTestMod1<-wilcox.test(Pregnancies~Outcome,dbmod)
wilcoxTestMod2<-wilcox.test(Glucose~Outcome,dbmod)
summary_stat_table2<-data.frame("pvalue"=c(wilcoxTestMod1$p.value,wilcoxTestMod2$p.value),"Statistic"=c(wilcoxTestMod1$statistic,wilcoxTestMod2$statistic))
summary_stat_table2
## pvalue Statistic
## 1 7.638243e-08 50610.5
## 2 1.305039e-40 27393.5
When comparing summary_stat_table1 and summary_stat_table2, the p-values obtained from the modified (clean) dbmod dataset indicate lower significance in the variation between Outcomes (Healthy vs. Sick) for # of Pregnancies (p = 7.64e-08 vs. p = 3.75e-08) and higher significance for Glucose Levels (p = 1.31e-40 vs. p = 1.20e-39). I did not include the Variables column so the results could be displayed more clearly in the next step, but the first row corresponds to # of Pregnancies and the second row corresponds to Glucose Levels. This demonstrates that cleaning the dataset to remove bad data is particularly important when p-values approach the 0.05 significance threshold.
Finally, I will compile the means and standard deviations of Pregnancies and Glucose, grouped by Outcome, into a table named final_mean_sd_comparison_table. To do this, I will create objects called db_table and dbmod_table, which will encode the code written for the unmodified and modified datasets from Questions 2 and 6, respectively. I will then combine these objects into a data frame and store the results in final_mean_sd_comparison_table. In parallel, I will present the statistical significance test results for both the unmodified db dataset and the modified (clean) dbmod dataset in a separate table named final_statistics_comparison_table. For this, I will use the objects summary_stat_table1 and summary_stat_table2 created in Questions 4 and 6, respectively, and similarly compile them into the object final_statistics_comparison_table.
db_table<-db%>%
group_by(Outcome)%>%
summarise(mean_pregnancies=mean(Pregnancies),mean_Glucose=mean(Glucose),sd_pregnancies=sd(Pregnancies),sd_Glucose=sd(Glucose))%>%formattable(digits=4)
dbmod_table<-dbmod%>%group_by(Outcome)%>%
summarise(mean_pregnancies=mean(Pregnancies),mean_Glucose=mean(Glucose),sd_pregnancies=sd(Pregnancies),sd_Glucose=sd(Glucose))
final_mean_sd_comparison_table<-data.frame(db_table,dbmod_table)
final_mean_sd_comparison_table
## Outcome mean_pregnancies mean_Glucose sd_pregnancies sd_Glucose Outcome.1
## 1 Healthy 3.298000 109.9800 3.017185 26.14120 Healthy
## 2 Sick 4.865672 141.2575 3.741239 31.93962 Sick
## mean_pregnancies.1 mean_Glucose.1 sd_pregnancies.1 sd_Glucose.1
## 1 3.311871 110.6439 3.020982 24.77691
## 2 4.860902 142.3195 3.754672 29.59920
final_statistics_comparison_table<-data.frame(summary_stat_table1, summary_stat_table2)
final_statistics_comparison_table
## variable pvalue Statistic pvalue.1 Statistic.1
## 1 # of Pregnancies 3.745146e-08 50985.0 7.638243e-08 50610.5
## 2 Glucose Levels 1.200727e-39 28390.5 1.305039e-40 27393.5
While I was not able to arrange the mean and standard deviation table into four rows (Healthy and Sick for the unmodified data, and Healthy and Sick for the modified data), I did manage to display the results side by side, including the statistical comparison as well. Moving forward, I will work on improving the clarity and structure of my tables.