Concept questions

1. What is an appropriate applicaiton of the box plot and the density plot and how are they similar/different?

A box plot is most appropriate when the goal is to summarize and compare distributions of continuous data across different categories. It provides a compact visual summary of key statistics such as the median, quartiles, and potential outliers. This makes it useful for identifying differences in central tendency and spread between groups. In contrast, a density plot is better suited for visualizing the shape of a distribution, including patterns such as skewness, modality, and clustering. It is particularly helpful when trying to detect whether the data form distinct groupings or exhibit smooth transitions. While both plots are used for continuous data and can compare across categories, they differ in focus: box plots emphasize summary statistics, whereas density plots reveal the underlying distributional structure. Together, they can provide complementary insights into the nature of the data.

2. If two vectors of data has the same means could they still be different?

Yes, two vectors of data can have the same mean and still be very different. The mean only captures the average value and does not reflect the distribution of the data. For example, one vector might be tightly clustered around the mean, while another could be widely dispersed or skewed. Differences in variance, presence of outliers, and overall shape of the distribution can lead to very different interpretations, even if the mean is identical. In such cases, the median may serve as a more robust measure of central tendency, especially when the data is not symmetrically distributed. Visualization tools such as histograms, box plots, or density plots are essential for uncovering these differences and ensuring that the analysis accounts for more than just the mean.

3. Name a graph type that is appropriate to use for:

A. categorical by categorical data

For categorical-by-categorical data, bar graphs are typically the most appropriate choice. They are used to count occurrences within each category and can be displayed as grouped or stacked bars to show relationships between two categorical variables. If the number of categories is large or hierarchical, heat maps may also be useful for clustering and identifying patterns.

B. continuous by categorical with different groups

For continuous-by-categorical data with different groups, box plots are ideal for comparing distributions across categories, as they highlight differences in medians, ranges, and outliers. Density plots can also be used to compare the shape of distributions between groups, especially when visualizing patterns or modality. In some cases, violin plots—which combine features of box plots and density plots—can provide even more detailed insights into the data.

Coding activity

Lets return to the diabetes data set

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(dplyr)
db <- read.csv("diabetes.csv")

head(db)
##   Pregnancies Glucose BloodPressure SkinThickness Insulin  BMI
## 1           6     148            72            35       0 33.6
## 2           1      85            66            29       0 26.6
## 3           8     183            64             0       0 23.3
## 4           1      89            66            23      94 28.1
## 5           0     137            40            35     168 43.1
## 6           5     116            74             0       0 25.6
##   DiabetesPedigreeFunction Age Outcome
## 1                    0.627  50       1
## 2                    0.351  31       0
## 3                    0.672  32       1
## 4                    0.167  21       0
## 5                    2.288  33       1
## 6                    0.201  30       0

This data set contains the following columns of data Pregnancies - Number of times pregnant

Glucose - Plasma glucose concentration a 2 hours in an oral glucose tolerance test

BloodPressure - Diastolic blood pressure (mm Hg)

SkinThickness - Triceps skin fold thickness (mm)

Insulin - 2-Hour serum insulin (mu U/ml)

BMI - Body mass index (weight in kg/(height in m)^2)

DiabetesPedigreeFunction - Diabetes pedigree function

Age - Age (years)

Outcome - Class variable (0 or 1) 268 of 768 are 1, the others are 0 1 is diabetes positive and 0 is healthy

Outcome is a category, diabetes and healthy, the 1 and 0 is not very helpful for this. Lets change it. First we will make the outcome a factor (category). Next we can use the levels function to change the names of the values, where 0 will be healthy and 1 will be sick

db$Outcome<-factor(db$Outcome)
levels(db$Outcome)
## [1] "0" "1"
levels(db$Outcome)<-c("Healthy", "Sick")
head(db)
##   Pregnancies Glucose BloodPressure SkinThickness Insulin  BMI
## 1           6     148            72            35       0 33.6
## 2           1      85            66            29       0 26.6
## 3           8     183            64             0       0 23.3
## 4           1      89            66            23      94 28.1
## 5           0     137            40            35     168 43.1
## 6           5     116            74             0       0 25.6
##   DiabetesPedigreeFunction Age Outcome
## 1                    0.627  50    Sick
## 2                    0.351  31 Healthy
## 3                    0.672  32    Sick
## 4                    0.167  21 Healthy
## 5                    2.288  33    Sick
## 6                    0.201  30 Healthy
summary(db)
##   Pregnancies        Glucose      BloodPressure    SkinThickness  
##  Min.   : 0.000   Min.   :  0.0   Min.   :  0.00   Min.   : 0.00  
##  1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 62.00   1st Qu.: 0.00  
##  Median : 3.000   Median :117.0   Median : 72.00   Median :23.00  
##  Mean   : 3.845   Mean   :120.9   Mean   : 69.11   Mean   :20.54  
##  3rd Qu.: 6.000   3rd Qu.:140.2   3rd Qu.: 80.00   3rd Qu.:32.00  
##  Max.   :17.000   Max.   :199.0   Max.   :122.00   Max.   :99.00  
##     Insulin           BMI        DiabetesPedigreeFunction      Age       
##  Min.   :  0.0   Min.   : 0.00   Min.   :0.0780           Min.   :21.00  
##  1st Qu.:  0.0   1st Qu.:27.30   1st Qu.:0.2437           1st Qu.:24.00  
##  Median : 30.5   Median :32.00   Median :0.3725           Median :29.00  
##  Mean   : 79.8   Mean   :31.99   Mean   :0.4719           Mean   :33.24  
##  3rd Qu.:127.2   3rd Qu.:36.60   3rd Qu.:0.6262           3rd Qu.:41.00  
##  Max.   :846.0   Max.   :67.10   Max.   :2.4200           Max.   :81.00  
##     Outcome   
##  Healthy:500  
##  Sick   :268  
##               
##               
##               
## 

From the summary(db), it is clear that some cleaning is needed before analysis. Specifically, I will remove physiologically impossible values where Glucose, BloodPressure, SkinThickness, Insulin, or BMI are equal to zero, as these are likely data entry errors. To address this, I will create a new object, db_clean, by applying the filter() function to retain only rows where all of these variables are greater than zero.

db_clean <- db %>%
  filter(Glucose > 0,
    BloodPressure > 0,
    SkinThickness > 0,
    Insulin > 0,
    BMI > 0)

summary(db_clean)
##   Pregnancies        Glucose      BloodPressure    SkinThickness  
##  Min.   : 0.000   Min.   : 56.0   Min.   : 24.00   Min.   : 7.00  
##  1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 62.00   1st Qu.:21.00  
##  Median : 2.000   Median :119.0   Median : 70.00   Median :29.00  
##  Mean   : 3.301   Mean   :122.6   Mean   : 70.66   Mean   :29.15  
##  3rd Qu.: 5.000   3rd Qu.:143.0   3rd Qu.: 78.00   3rd Qu.:37.00  
##  Max.   :17.000   Max.   :198.0   Max.   :110.00   Max.   :63.00  
##     Insulin            BMI        DiabetesPedigreeFunction      Age       
##  Min.   : 14.00   Min.   :18.20   Min.   :0.0850           Min.   :21.00  
##  1st Qu.: 76.75   1st Qu.:28.40   1st Qu.:0.2697           1st Qu.:23.00  
##  Median :125.50   Median :33.20   Median :0.4495           Median :27.00  
##  Mean   :156.06   Mean   :33.09   Mean   :0.5230           Mean   :30.86  
##  3rd Qu.:190.00   3rd Qu.:37.10   3rd Qu.:0.6870           3rd Qu.:36.00  
##  Max.   :846.00   Max.   :67.10   Max.   :2.4200           Max.   :81.00  
##     Outcome   
##  Healthy:262  
##  Sick   :130  
##               
##               
##               
## 

This yielded the desired results as the summary(db_clean) function confirms the removal of the poor data.

1. Graph Glucose and BMI considering the outcome (sick healthy) status. Comment on the observed relationships

In this question, I want to explore the relationship between BMI and Glucose levels while taking into account whether individuals are classified as Sick (diabetic) or Healthy (non-diabetic). Since both BMI and Glucose are continuous variables, piping the db_clean data and using the ggplot() function to create a scatter plot is an appropriate visualization. By mapping the x axis to BMI, the y axis to Glucose, and Outcome to colour, I can examine whether there is a relationship between BMI and Glucose and if there are visible differences or clustering patterns between Sick and Healthy groups.

db_clean%>%
  ggplot(mapping=aes(x=BMI,y=Glucose,colour=Outcome))+geom_point()

From the scatter plot, we can see that individuals classified as Sick generally appear at higher Glucose levels compared to Healthy individuals, even across a range of BMI values. While BMI varies in both groups, Glucose levels tend to be more elevated in the Sick group, suggesting that Glucose is more strongly associated with diabetic status than BMI alone. This fits with the physiological understanding that glucose dysregulation is a defining characteristic of diabetes, whereas BMI may contribute as a risk factor but does not by itself distinguish between the two groups.

2. Make a new categorical variable by cutting the pregnancies variable in to no pregnancy (0), a few (1-3), several (4-6) and many (>7). Use an appropriate graph type to visualize these pregnancy categories with Outcome (diabetic or not).

To better understand how the number of pregnancies relates to diabetes outcome, I will create a new categorical variable using the mutate() function. Specifically, I will group the discrete Pregnancies variable into meaningful categories: no pregnancy (0), a few (1–3), several (4–6), and many (7+). This can be achieved with the cut() function by introducing breaks at the appropriate values. Creating these categories makes the data easier to interpret and allows for direct comparison between groups.

Since both variables of interest—pregnancy category and diabetic outcome—are categorical, a bar chart is an appropriate choice. I will use ggplot() and map the x-axis to the new pregnancy categories. The y-axis does not need to be explicitly defined, as it will default to showing frequency counts. I will use the fill aesthetic to represent diabetic outcome with different colors. To improve readability, I will set position = “dodge” so that healthy and diabetic outcomes appear side by side rather than stacked. For clarity, I will label the x-axis as Pregnancy Category and the legend for the fill colors as Diabetic Outcome. Finally, I will apply theme_minimal() for a clean and simple appearance.

db_clean %>%
  mutate(pregnancy_cat = cut(Pregnancies,breaks = c(-Inf, 0, 3, 6, Inf),labels = c("no pregnancy", "a few (1-3)", "several (4-6)", "many (7+)"), right = TRUE))%>%ggplot(mapping=aes(x = pregnancy_cat, fill = Outcome)) +
  geom_bar(position = "dodge") +
  labs(x = "Pregnancy Category", fill = "Diabetic Outcome") +
  theme_minimal()

The resulting plot shows how diabetes prevalence changes across pregnancy categories. Women with no pregnancies or only a few pregnancies have lower proportions of diabetes compared to those with several or many pregnancies, where the proportion of diabetic cases increases. This suggests that a higher number of pregnancies may be associated with a greater likelihood of diabetes in this dataset.

3. Graphically compare BMI vs the pregnancy categorical variable and a second graph of Glucose vs the pregnancy category. Consider your choice of graph. There can be several “correct” answers, try some different plots and pick one to include. Describe if you think there is difference in the means of any of the groups. Also explore if there is a difference.

To explore how BMI and Glucose vary across pregnancy categories, I will first use the categorical variable pregnancy_cat created earlier. Since BMI and Glucose are continuous variables while pregnancy category is categorical, boxplots and violin plots are appropriate choices for comparing their distributions across groups. I will use the ggplot() function with the db_clean dataset, mapping the x-axis to pregnancy_cat and the y-axis to either BMI or Glucose, depending on the graph. The fill aesthetic will be applied to distinguish categories by color.

Boxplots provide a clear visualization of medians, quartiles, and potential outliers, while violin plots highlight the overall distribution shape. To combine the strengths of both, I will layer boxplots on top of violin plots using geom_boxplot() with a narrower width and contrasting fill so that they stand out clearly against the violin background. Titles and axis labels will be set with the labs() function, and I will apply theme_minimal() for a clean appearance. Since the color legend is not necessary in this case, I will remove it using theme(legend.position = “none”).

To formally test whether differences between groups are statistically significant, I will consider both parametric and non-parametric approaches. Although an ANOVA could be used under normality assumptions, Assignment 1 demonstrated through Shapiro–Wilk tests and Q–Q plots that the data are not normally distributed. Therefore, I will use the non-parametric Kruskal–Wallis test, which is well suited for comparing medians across more than two groups without assuming normality. This approach will allow me to assess whether BMI or Glucose values differ significantly between pregnancy categories.

# Code for pregnancy categories
db_clean<-db_clean %>%
  mutate(pregnancy_cat = cut(Pregnancies,breaks = c(-Inf, 0, 3, 6, Inf),labels = c("no pregnancy", "a few (1-3)", "several (4-6)", "many (7+)"),right = TRUE))


# Boxplots
ggplot(db_clean, aes(x = pregnancy_cat, y = BMI, fill = pregnancy_cat)) +
  geom_boxplot() +
  labs(title = "BMI across Pregnancy Categories",
       x = "Pregnancy Category", y = "BMI") +
  theme_minimal() +
  theme(legend.position = "none")

ggplot(db_clean, aes(x = pregnancy_cat, y = Glucose, fill = pregnancy_cat)) +
  geom_boxplot() +
  labs(title = "Glucose across Pregnancy Categories",
       x = "Pregnancy Category", y = "Glucose") +
  theme_minimal() +
  theme(legend.position = "none")

# Violin plots (with embedded boxplots)
ggplot(db_clean, aes(x = pregnancy_cat, y = BMI, fill = pregnancy_cat)) +
  geom_violin() +
  geom_boxplot(width = 0.25, fill = "white") +
  labs(title = "BMI Distribution by Pregnancy Category",
       x = "Pregnancy Category", y = "BMI")

ggplot(db_clean, aes(x = pregnancy_cat, y = Glucose, fill = pregnancy_cat)) +
  geom_violin() +
  geom_boxplot(width = 0.25, fill = "white") +
  labs(title = "Glucose Distribution by Pregnancy Category",
       x = "Pregnancy Category", y = "Glucose")

# Kruskal-Wallis tests
kruskal.test(BMI ~ pregnancy_cat, data = db_clean)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  BMI by pregnancy_cat
## Kruskal-Wallis chi-squared = 27.598, df = 3, p-value = 4.41e-06
kruskal.test(Glucose ~ pregnancy_cat, data = db_clean)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  Glucose by pregnancy_cat
## Kruskal-Wallis chi-squared = 26.818, df = 3, p-value = 6.427e-06

The boxplots and violin plots illustrate how BMI and Glucose values vary between pregnancy categories. While there is visible overlap, certain trends appear. For instance, Glucose levels tend to increase in groups with higher numbers of pregnancies, suggesting a possible relationship between repeated pregnancies and impaired glucose regulation. BMI shows less pronounced differences, though variation between groups is still present.

The Kruskal–Wallis test confirms whether these visual differences are statistically significant. The tests yielded significant p-values for Glucose and BMI, indicating that at least one pregnancy category differs in median Glucose levels and BMI compared to the others. This aligns with known physiological links between pregnancy and long-term metabolic changes, where multiple pregnancies may increase insulin resistance and diabetes risk.

4. Perform an appropriate staticial test to determine if the means of BMI or Glucose are different between any two pregnancy categories. Discuss your findings in a physiological context.

To compare BMI and Glucose between women with no pregnancies and those with many pregnancies, I will use the Wilcoxon rank-sum test, implemented in R with the function wilcox.test(). I will run separate tests for BMI and Glucose, using the formula interface (response ~ pregnancy_cat) and filtering the db_clean dataset to include only the two relevant categories (“no pregnancy” and “many (7+)”).

The Wilcoxon test is appropriate here because:

The two groups (“no pregnancy” vs. “many pregnancies”) are independent categorical groups.

BMI and Glucose are continuous variables that, as shown in Assignment 1, are not normally distributed, making a non-parametric test more suitable than a t-test.

wilcox.test() compares the distributions of the two groups without assuming normality, allowing assessment of whether the central tendency differs significantly between them.

# Compare Glucose between no pregnancy and many pregnancies

wilcox.test(Glucose ~ pregnancy_cat,
  data = filter(db_clean, pregnancy_cat %in% c("no pregnancy", "many (7+)")))
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  Glucose by pregnancy_cat
## W = 1401, p-value = 0.01594
## alternative hypothesis: true location shift is not equal to 0
#Compare BMI between no pregnancy and many pregnancies

wilcox.test(BMI ~ pregnancy_cat,
  data = filter(db_clean, pregnancy_cat %in% c("no pregnancy", "many (7+)")))
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  BMI by pregnancy_cat
## W = 2426.5, p-value = 0.005216
## alternative hypothesis: true location shift is not equal to 0

The Wilcoxon tests yielded significant p-values (< 0.05) for both Glucose and BMI, indicating that these measures differ between women with no pregnancies and those with many pregnancies. Physiologically, higher Glucose levels in women with multiple pregnancies may reflect cumulative stress on glucose metabolism, potentially increasing the risk of gestational or type 2 diabetes. Interestingly, BMI differences showed that women with many pregnancies had slightly lower mean and median BMI compared to those with no pregnancies, which is contrary to the expectation that repeated pregnancies contribute to long-term weight gain. This observation may be specific to this dataset and should be interpreted cautiously, as it may not reflect a general physiological trend.

5. Reassess the relationship between BMI or Glucose and pregnancy considering the outcome (diabetic or not). Perform graphical exploration and appropriate statisical analysis of a comparison you think will be significantly differrent. Discuss these findings.

To explore how Glucose levels vary across pregnancy categories while considering diabetic outcome, I will create a boxplot with pregnancy category on the x axis and Glucose on the y axis, coloring the boxes by Outcome, all using ggplot. This allows us to visually assess whether diabetic status interacts with pregnancy category.

For statistical analysis, I will compare Glucose between diabetic and healthy groups using a Wilcoxon rank-sum test (wilcox.test()). This non-parametric test is appropriate because Glucose is continuous and not normally distributed in our dataset.

# Graphical exploration: Glucose vs Pregnancy Category, colored by Outcome
ggplot(db_clean, aes(x = pregnancy_cat, y = Glucose, fill = Outcome)) +
  geom_boxplot(position = position_dodge(0.8)) +
  labs(title = "Glucose levels by Pregnancy Category and Outcome",
    x = "Pregnancy Category",
    y = "Glucose") +
  theme_minimal()

# Statistical comparison: Glucose between Healthy and Sick
wilcox.test(Glucose ~ Outcome,
  data = db_clean)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  Glucose by Outcome
## W = 6615.5, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0

From the boxplot, we can visually compare Glucose distributions across pregnancy categories for healthy and diabetic individuals. Side-by-side boxes make it easy to see that Glucose tends to be higher in diabetic women within each pregnancy category.

The Wilcoxon test provides a formal statistical comparison between all diabetic vs. all healthy individuals. A significant p-value (< 0.05) would indicate that Glucose levels are higher in diabetic women compared to healthy women, all consistent with physiological expectations that diabetes impairs glucose regulation.

6. Make a combined plot of a) Insulin vs Outcome and b) Insulin vs pregnancy.

To explore how Insulin levels vary across both diabetic outcome and pregnancy category, I will create two separate boxplots and combine them into a single figure.

Plot 1: Insulin vs Outcome — this plot allows us to assess whether insulin levels differ between diabetic (Sick) and healthy individuals.

Plot 2: Insulin vs Pregnancy Category — this plot shows whether insulin levels change across different pregnancy groups, which may reflect cumulative metabolic or physiological effects associated with repeated pregnancies.

I will create two ggplot objects, p1 and p2, using the db_clean dataset. For p1, I will map the x-axis to Outcome and the y-axis to Insulin. For p2, the x-axis will be pregnancy_cat and the y-axis will remain Insulin. Both plots will use boxplots to summarize the distribution of Insulin, highlighting medians, quartiles, and potential outliers. I will label the axes and titles clearly, use a consistent minimal theme (theme_minimal()), and remove the fill legend to improve clarity.

Using the cowplot package, I can combine p1 and p2 side by side, making it easier to visually compare patterns across outcome and pregnancy groups. Boxplots are appropriate because Insulin is a continuous variable, and these plots provide a concise summary of the data distribution.

library(cowplot)
## 
## Attaching package: 'cowplot'
## The following object is masked from 'package:lubridate':
## 
##     stamp
# Plot A: Insulin vs Outcome
p1 <- ggplot(db_clean, aes(x = Outcome, y = Insulin, fill = Outcome)) +
  geom_boxplot() +
  labs(
    title = "Insulin vs Outcome",
    x = "Diabetic Outcome",
    y = "Insulin"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

# Plot B: Insulin vs Pregnancy Category
p2 <- ggplot(db_clean, aes(x = pregnancy_cat, y = Insulin, fill = pregnancy_cat)) +
  geom_boxplot() +
  labs(
    title = "Insulin vs Pregnancy Category",
    x = "Pregnancy Category",
    y = "Insulin"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

# Combine both plots into one figure
plot_grid(p1, p2, labels = c("A", "B"))

The combined figure allows comparison of Insulin levels across diabetic status and pregnancy categories. Typically, diabetic individuals (Sick) show higher median Insulin than healthy individuals, reflecting compensatory hyperinsulinemia in insulin-resistant states. Across pregnancy categories, Insulin levels seem to slightly increase with higher numbers of pregnancies, potentially indicating cumulative metabolic stress associated with repeated pregnancies. Displaying these plots side by side makes these patterns visually clear and facilitates interpretation of physiological trends.