TAKE HOME ASSIGNMENT WEEK 1

DUE: Sept 18, 2025 at 11:59PM

title: “PSL4040: Assignment 1” author: “Michael Sava” date: “2025-09-17” output: html_document —

Assignment

We will be working with the following diabetes dataset, which I stored in the Week 1 project folder. Since it is already in this folder, I can load the file directly without needing to locate it elsewhere on my computer.

db <- read.csv("diabetes.csv")

This data set contains the following columns of data Pregnancies - Number of times pregnant

Glucose - Plasma glucose concentration a 2 hours in an oral glucose tolerance test

BloodPressure - Diastolic blood pressure (mm Hg)

SkinThickness - Triceps skin fold thickness (mm)

Insulin - 2-Hour serum insulin (mu U/ml)

BMI - Body mass index (weight in kg/(height in m)^2)

DiabetesPedigreeFunction - Diabetes pedigree function

Age - Age (years)

Outcome - Class variable (0 or 1) 268 of 768 are 1, the others are 0 1 is diabetes positive and 0 is healthy

Outcome is a category, diabetes and healthy, the 1 and 0 is not very helpful for this. Lets change it. First we will make the outcome a factor (category). Next we can use the levels function to change the names of the values, where 0 will be Healthy and 1 will be Sick

db$Outcome<-factor(db$Outcome)
levels(db$Outcome)

## [1] "0" "1"

levels(db$Outcome)<-c("Healthy", "Sick")
head(db)

##   Pregnancies Glucose BloodPressure SkinThickness Insulin  BMI
## 1           6     148            72            35       0 33.6
## 2           1      85            66            29       0 26.6
## 3           8     183            64             0       0 23.3
## 4           1      89            66            23      94 28.1
## 5           0     137            40            35     168 43.1
## 6           5     116            74             0       0 25.6
##   DiabetesPedigreeFunction Age Outcome
## 1                    0.627  50    Sick
## 2                    0.351  31 Healthy
## 3                    0.672  32    Sick
## 4                    0.167  21 Healthy
## 5                    2.288  33    Sick
## 6                    0.201  30 Healthy

The head(db) function displays the first six rows of the dataset along with all columns. In this view, the arbitrary numerical values have been replaced with character labels (“Healthy” and “Sick”), making the table easier to read and interpret.

QUESTIONS

# Descriptive questions with minimal coding

What are the main data types?

Numerical: All Numbers Logical: TRUE or FALSE Character: Must have quotes but any words, sentences, symbols, paragraphs, and numbers within the quotes are considered characters

What are the differences between a vector and data.frame?

A vector includes several elements either horizontally or vertically oriented and must contain the same data type. For a data frame, it is a series of vectors oriented vertically and each vector can contain a different data type.

Define an object and a function. How are they named and coded?

Definition: Object: A container that stores data. It can hold any data type: numbers, logical values, characters, vectors, matrices, data frames, or even functions. Naming and Coding: They must start with a letter, can include letters, numbers, underscores, or periods, and have no spaces or special characters (like #, @, etc.). An object is defined and assigned a value using <- or =. Example m<-“Michael”

Definition: Function: A block of code that performs a specific task. Functions act on objects and can return values. Naming and Coding: Functions are named actions like mean(), t.test(), ggplot() Always use parentheses () to call a function Functions can take arguments (inputs) that modify their behavior.

# Coding activity

Coding and analysis questions

For the whole data set what is the mean and sd for the variables age, number of pregnancies, and Glucose. Place this into a table.

First I will ensure my libraries are activated.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(formattable)
library(dplyr)

Now that my libraries are activated, I can use their encoded functions.

I then took the diabetes dataset (“db”, as defined earlier) and used the pipe operator (%>%) with the summarise() function. Within summarise(), I specified arguments to calculate and label the mean (mean()) and standard deviation (sd()) for the variables age, number of pregnancies, and glucose level. The results were organized into a labeled table. Finally, I applied the formattable function to round the values so they displayed only 4 digits.

db%>%
  summarise(mean_age=mean(Age),mean_pregnancies=mean(Pregnancies),mean_Glucose=mean(Glucose),sd_age=sd(Age),sd_pregnancies=sd(Pregnancies),sd_Glucose=sd(Glucose))%>%formattable(digits=4)

mean_age	mean_pregnancies	mean_Glucose	sd_age	sd_pregnancies	sd_Glucose
33.24	3.845	120.9	11.76	3.37	31.97

The code did what I asked for and yielded the means and standard deviations of the entire data set. That being said, this data is not very useful on its own and we must further wrangle it to obtain some meaningful information.

Splitting the data by the variable Outcome (if they have diabetes 1; nondiabetic 0). Calculate the mean and SD of age, number of pregnancies and Glucose. Place this into a table.

I adapted the code from Question 1 by adding the function group_by(Outcome). This instructs R to take the db dataset, split it into two groups based on the outcome (Healthy or Sick), and then perform the same calculations for each group separately. The results should be displayed in a labeled table with values rounded to 4 digits.

db%>%
  group_by(Outcome)%>%
  summarise(mean_age=mean(Age),mean_pregnancies=mean(Pregnancies),mean_Glucose=mean(Glucose),sd_age=sd(Age),sd_pregnancies=sd(Pregnancies),sd_Glucose=sd(Glucose))%>%formattable(digits=4)

Outcome	mean_age	mean_pregnancies	mean_Glucose	sd_age	sd_pregnancies	sd_Glucose
Healthy	31.19	3.298	110.0	11.67	3.017	26.14
Sick	37.07	4.866	141.3	10.97	3.741	31.94

This time, the code separated the data into two groups based on the available outcomes (Healthy and Sick). It then calculated the means and standard deviations for age, number of pregnancies, and glucose level within each group. These results are more informative, as they allow for direct comparison between groups. For instance, the Sick group shows higher mean values for age, number of pregnancies, and glucose; insights that would not be visible if we had only calculated summary statistics for the entire dataset.

Determine if the distribution of the variables pregnancy and glucose is normal or non-parametric for the outcomes.

To assess whether the distributions of the variables number of pregnancies and glucose level were normal, I applied the shapiro.test() function. This test evaluates normality by returning a p-value: if the p-value is very small (commonly, p < 0.05), we reject the assumption of normality. In such cases, the data are considered non-normal, and a non-parametric test such as the U-test or the Wilcox test should be used instead. I also added headings above each shapiro.test() using the cat() function to clearly indicate which variable (number of pregnancies and glucose levels) each test corresponds to.

cat("Shapiro.test for # of Pregnancies")

## Shapiro.test for # of Pregnancies

db%>%
  select(Pregnancies)%>%
  pull()%>%
  shapiro.test()

## 
##  Shapiro-Wilk normality test
## 
## data:  .
## W = 0.90428, p-value < 2.2e-16

cat("Shapiro.test for Glucose Levels")

## Shapiro.test for Glucose Levels

db%>%
  select(Glucose)%>%
  pull()%>%
  shapiro.test()

## 
##  Shapiro-Wilk normality test
## 
## data:  .
## W = 0.9701, p-value = 1.986e-11

The code indicated that the p-value for both the number of pregnancies (p-value<2.2e-16) and the glucose level (1.986e-11) variables are significant and therefore, not normally distributed.

However, because shapiro.test() is highly sensitive, even minor deviations from normality can produce a significant result. To account for this, I will also generate qqplots of the distributions of the variables number of pregnancies and glucose levels to visually verify normality before proceeding. I created two objects, test1 and test2. Each object contained the diabetes dataset (db) with the specific variable of interest extracted, test1 for number of pregnancies and test2 for glucose levels. I then passed these objects to the qqnorm() and qqline() functions to generate the corresponding qqplots. I added main title labels, “Q-Q Plot for Number of Pregnancies” and “Q-Q Plot for Glucose Levels”, by including the main=“” argument.

test1<-db%>%
  select(Pregnancies)%>%
  pull()

qqnorm(test1,main="Q-Q Plot for Number of Pregnancies")
qqline(test1)

test2<-db%>%
  select(Glucose)%>%
  pull()

qqnorm(test2,main="Q-Q Plot for Glucose Levels")
qqline(test2)

The qqplot checks if our data looks like a normal distribution. If it does, the points will form a straight line. In our plots, the points curve away from the line, especially at the ends, which tells us the data doesn’t follow a normal distribution. This visual evidence supports the result from the shapiro.test().

Apply an appropriate statistical test based on the results of Q 3 to determine if glucose or pregnancy are different between the two outcomes. Summarize your findings in a table.

I will use the wilcox.test() function to analyze the distributions of the variables # of Pregnancies and Glucose Levels in relation to the outcome (Healthy or Sick), rather than a t-test, since the t-test assumes normality of the data. For the wilcox.test() function, the arguments included will be Pregnancies and Glucose described by (~) the Outcome (Healthy or Sick) and I will ensure its using data from the db dataset. I will put these separate tests into the objects wilcoxTest1 and wilcoxTest2. To organize the results, I will generate a table and store it in an object called summary_stat_table1, which I will create using the data.frame() function. The arguments for the data.frame() function will include a column title “variable” and the combine function (c()) that puts the variables # of Pregnancies and Glucose Levels in the first column on different rows. I will then create another title for a column called “pvalue” that will use the combine function to include the specific pvalues column (using the $ operator) from the objects wilcoxTest1 and wilcoxTest2, finally I will also create a final column labled “Statistic” and the combine function to include the statistic column of both wilcoxTest1 and wilcoxTest2, similar to what I did for the “pvalue” colum.

wilcoxTest1<-wilcox.test(Pregnancies~Outcome,db)
wilcoxTest2<-wilcox.test(Glucose~Outcome,db)

summary_stat_table1<-data.frame(variable=c("# of Pregnancies","Glucose Levels"),"pvalue"=c(wilcoxTest1$p.value,wilcoxTest2$p.value),"Statistic"=c(wilcoxTest1$statistic,wilcoxTest2$statistic))

summary_stat_table1

##           variable       pvalue Statistic
## 1 # of Pregnancies 3.745146e-08   50985.0
## 2   Glucose Levels 1.200727e-39   28390.5

Both p-values yielded from the Wilcox Test are well below the commonly used significance threshold of 0.05, allowing us to reject the null hypothesis in both cases. The null hypothesis stated that the distributions (medians) of the number of pregnancies or glucose levels are the same between healthy and sick individuals. Our results indicate a statistically significant difference in the number of pregnancies between the two groups (p = 3.75e-8) and an even stronger statistically significant difference in glucose levels (p = 1.20e-39). The effect was particularly pronounced for glucose, suggesting that glucose levels are much more strongly associated with health outcomes compared to the number of pregnancies.

Data cleaning is an important step in any analysis. This should typically be done first. You may have noticed that there are some odd values in some of the variables. Try the code summary(diabetes). For example some patients have glucose values of zero. This is not biologically possible! Apply wrangling to remove the patients (observations) with glucose values that are not physiologically possible.

First I created an object dbunmod that encodes the summary(db) of the unmodified dataset so I can see that the min value recorded for Glucose Levels was 0 across the entire dataset. As this is not physiologically possible, I created anothe object called dbmod, ecoded the db database that was piped and used the subset() function to take generate a subset of the dataset that only contains values with Glucose levels greater than 0.

dbunmod<-summary(db)

dbmod<-db%>%
  subset(Glucose>0)

summary(dbmod)

##   Pregnancies        Glucose      BloodPressure    SkinThickness  
##  Min.   : 0.000   Min.   : 44.0   Min.   :  0.00   Min.   : 0.00  
##  1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 62.00   1st Qu.: 0.00  
##  Median : 3.000   Median :117.0   Median : 72.00   Median :23.00  
##  Mean   : 3.852   Mean   :121.7   Mean   : 69.12   Mean   :20.48  
##  3rd Qu.: 6.000   3rd Qu.:141.0   3rd Qu.: 80.00   3rd Qu.:32.00  
##  Max.   :17.000   Max.   :199.0   Max.   :122.00   Max.   :99.00  
##     Insulin            BMI        DiabetesPedigreeFunction      Age       
##  Min.   :  0.00   Min.   : 0.00   Min.   :0.0780           Min.   :21.00  
##  1st Qu.:  0.00   1st Qu.:27.30   1st Qu.:0.2435           1st Qu.:24.00  
##  Median : 36.00   Median :32.00   Median :0.3740           Median :29.00  
##  Mean   : 80.29   Mean   :31.99   Mean   :0.4725           Mean   :33.27  
##  3rd Qu.:128.50   3rd Qu.:36.55   3rd Qu.:0.6265           3rd Qu.:41.00  
##  Max.   :846.00   Max.   :67.10   Max.   :2.4200           Max.   :81.00  
##     Outcome   
##  Healthy:497  
##  Sick   :266  
##               
##               
##               
##

I verified that the code worked by running the summary(dbmod) function, which showed that the minimum value for Glucose Levels was no longer 0 but 44. This confirmed that data points with 0 Glucose Levels had been successfully removed. If I were to clean the dataset more thoroughly, I would also remove entries showing values of zero for blood pressure, skin thickness, insulin (unless they were Type 1 diabetics whose metrics were taken during a 24–48 hour episode of life-threatening diabetic ketoacidosis), and BMI. These zero values are most likely artifacts of incomplete or erroneous data, as such measurements cannot realistically occur in living individuals.

Recalculate the means and SD of glucose and pregnancy based on outcome apply an appropriate statistical test of the differences. Summarize this in a table incorporating the results before and after cleaning the glucose values.

Using the modified (cleaned) dataset (dbmod), I will pipe the data to group it by Outcome (Healthy or Sick), then pipe it again to create a table with the summarise function. This table will include the mean # of Pregnancies and mean Glucose Levels, along with column titles and the standard deviations for both variables. I will then pipe this into the formattable function to limit the results to 4 digits.

dbmod%>%group_by(Outcome)%>%
  summarise(mean_pregnancies=mean(Pregnancies),mean_Glucose=mean(Glucose),sd_pregnancies=sd(Pregnancies),sd_Glucose=sd(Glucose))%>%formattable(digits=4)

Outcome	mean_pregnancies	mean_Glucose	sd_pregnancies	sd_Glucose
Healthy	3.312	110.6	3.021	24.78
Sick	4.861	142.3	3.755	29.60

The process was similar to what I did in Question 2, except that this time I did not include the Age variable and instead worked with the modified dataset (dbmod).

To confirm that the modified (clean) dbmod dataset remains non-normally distributed, and that a non-parametric test such as wilcox.test() is still appropriate for evaluating significance, I will repeat the shapiro.test() analysis, this time using the modified (clean) dataset. I will then create qqplots to visually confirm the results of the shapiro.test(). I will repurpose code from Question 3.

cat("Shapiro.test for # of Pregnancies for dbmod")

## Shapiro.test for # of Pregnancies for dbmod

dbmod%>%
  select(Pregnancies)%>%
  pull()%>%
  shapiro.test()

## 
##  Shapiro-Wilk normality test
## 
## data:  .
## W = 0.90454, p-value < 2.2e-16

cat("Shapiro.test for Glucose Levels for dbmod")

## Shapiro.test for Glucose Levels for dbmod

dbmod%>%
  select(Glucose)%>%
  pull()%>%
  shapiro.test()

## 
##  Shapiro-Wilk normality test
## 
## data:  .
## W = 0.96964, p-value = 1.72e-11

test3<-dbmod%>%
  select(Pregnancies)%>%
  pull()

qqnorm(test3,main="Q-Q Plot for Number of Pregnancies for dbmod")
qqline(test3)

test4<-dbmod%>%
  select(Glucose)%>%
  pull()

qqnorm(test4,main="Q-Q Plot for Glucose Levels for dbmod")
qqline(test4)

I applied the same reasoning as in Question 3 to evaluate whether the data is normally distributed and arrived at the same conclusion: the data is non-normally distributed and therefore requires a non-parametric test to assess its significance level.

Now I will perform the wilcox.test() to assess the significance levels of the variables Number of Pregnancies and Glucose Levels. I will repurpose code I wrote for Question 4. I will use the modified (clean) dataset (dbmod) and assign the results of the wilcox.test() functions to new objects named wilcoxTestMod1 and wilcoxTestMod2.

wilcoxTestMod1<-wilcox.test(Pregnancies~Outcome,dbmod)
wilcoxTestMod2<-wilcox.test(Glucose~Outcome,dbmod)

summary_stat_table2<-data.frame("pvalue"=c(wilcoxTestMod1$p.value,wilcoxTestMod2$p.value),"Statistic"=c(wilcoxTestMod1$statistic,wilcoxTestMod2$statistic))

summary_stat_table2

##         pvalue Statistic
## 1 7.638243e-08   50610.5
## 2 1.305039e-40   27393.5

When comparing summary_stat_table1 and summary_stat_table2, the p-values obtained from the modified (clean) dbmod dataset indicate lower significance in the variation between Outcomes (Healthy vs. Sick) for # of Pregnancies (p = 7.64e-08 vs. p = 3.75e-08) and higher significance for Glucose Levels (p = 1.31e-40 vs. p = 1.20e-39). I did not include the Variables column so the results could be displayed more clearly in the next step, but the first row corresponds to # of Pregnancies and the second row corresponds to Glucose Levels. This demonstrates that cleaning the dataset to remove bad data is particularly important when p-values approach the 0.05 significance threshold.

Finally, I will compile the means and standard deviations of Pregnancies and Glucose, grouped by Outcome, into a table named final_mean_sd_comparison_table. To do this, I will create objects called db_table and dbmod_table, which will encode the code written for the unmodified and modified datasets from Questions 2 and 6, respectively. I will then combine these objects into a data frame and store the results in final_mean_sd_comparison_table. In parallel, I will present the statistical significance test results for both the unmodified db dataset and the modified (clean) dbmod dataset in a separate table named final_statistics_comparison_table. For this, I will use the objects summary_stat_table1 and summary_stat_table2 created in Questions 4 and 6, respectively, and similarly compile them into the object final_statistics_comparison_table.

db_table<-db%>%
  group_by(Outcome)%>%
  summarise(mean_pregnancies=mean(Pregnancies),mean_Glucose=mean(Glucose),sd_pregnancies=sd(Pregnancies),sd_Glucose=sd(Glucose))%>%formattable(digits=4)

dbmod_table<-dbmod%>%group_by(Outcome)%>%
  summarise(mean_pregnancies=mean(Pregnancies),mean_Glucose=mean(Glucose),sd_pregnancies=sd(Pregnancies),sd_Glucose=sd(Glucose))


final_mean_sd_comparison_table<-data.frame(db_table,dbmod_table)
  
final_mean_sd_comparison_table

##   Outcome mean_pregnancies mean_Glucose sd_pregnancies sd_Glucose Outcome.1
## 1 Healthy         3.298000     109.9800       3.017185   26.14120   Healthy
## 2    Sick         4.865672     141.2575       3.741239   31.93962      Sick
##   mean_pregnancies.1 mean_Glucose.1 sd_pregnancies.1 sd_Glucose.1
## 1           3.311871       110.6439         3.020982     24.77691
## 2           4.860902       142.3195         3.754672     29.59920

final_statistics_comparison_table<-data.frame(summary_stat_table1, summary_stat_table2)

final_statistics_comparison_table

##           variable       pvalue Statistic     pvalue.1 Statistic.1
## 1 # of Pregnancies 3.745146e-08   50985.0 7.638243e-08     50610.5
## 2   Glucose Levels 1.200727e-39   28390.5 1.305039e-40     27393.5

While I was not able to arrange the mean and standard deviation table into four rows (Healthy and Sick for the unmodified data, and Healthy and Sick for the modified data), I did manage to display the results side by side, including the statistical comparison as well. Moving forward, I will work on improving the clarity and structure of my tables.