Download this R Markdown file, save it on your computer, and perform all the below tasks by inserting your answer in text or by inserting R chunks below. After you are done, upload this file with your solutions on Moodle.

Exercise 1: Compute frequencies in the Pima diabetes dataset

Load the Pima diabetes dataset:

load(file = url("https://www.dropbox.com/s/hxn1c7yw9r8aw8v/Pima_diabetes.RData?dl=1"))
df <- Pima_diabetes  # Rename for clarity
str(df)
## 'data.frame':    768 obs. of  9 variables:
##  $ Pregnancies             : int  6 1 8 1 0 5 3 10 2 8 ...
##  $ Glucose                 : int  148 85 183 89 137 116 78 115 197 125 ...
##  $ BloodPressure           : int  72 66 64 66 40 74 50 0 70 96 ...
##  $ SkinThickness           : int  35 29 0 23 35 0 32 0 45 0 ...
##  $ Insulin                 : int  0 0 0 94 168 0 88 0 543 0 ...
##  $ BMI                     : num  33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
##  $ DiabetesPedigreeFunction: num  0.627 0.351 0.672 0.167 2.288 ...
##  $ Age                     : int  50 31 32 21 33 30 26 29 53 54 ...
##  $ Outcome                 : int  1 0 1 0 1 0 1 0 1 1 ...

Which variables are measured on a nominal level?

# Unique values per variable
unique_counts <- sapply(df, function(x) length(unique(na.omit(x))))
unique_counts
##              Pregnancies                  Glucose            BloodPressure 
##                       17                      136                       47 
##            SkinThickness                  Insulin                      BMI 
##                       51                      186                      248 
## DiabetesPedigreeFunction                      Age                  Outcome 
##                      517                       52                        2
# Class of each variable
sapply(df, class)
##              Pregnancies                  Glucose            BloodPressure 
##                "integer"                "integer"                "integer" 
##            SkinThickness                  Insulin                      BMI 
##                "integer"                "integer"                "numeric" 
## DiabetesPedigreeFunction                      Age                  Outcome 
##                "numeric"                "integer"                "integer"

Now compute frequency tables, barplots, and mosaic plots of all nominal variables in the dataset. Frequency Table

df$Outcome <- factor(df$Outcome, 
                     levels = c(0, 1), 
                     labels = c("No Diabetes", "Diabetes"))

freq_table <- table(df$Outcome)
freq_table
## 
## No Diabetes    Diabetes 
##         500         268

Barplot

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.5.2
ggplot(df, aes(x = Outcome, fill = Outcome)) +
  geom_bar(color = "black", alpha = 0.8) +
  geom_text(stat = 'count', aes(label = after_stat(count)), vjust = -0.5, size = 4) +
  scale_fill_manual(values = c("red", "blue")) +
  labs(title = "Distribution of Diabetes Outcome (Nominal Variable)",
       x = "Diabetes Status", y = "Frequency") +
  theme_minimal(base_size = 13) +
  theme(legend.position = "none")

Mosaic Plot

df_temp <- df
df_temp$Outcome <- as.numeric(df$Outcome) - 1

dummy <- rep(1, nrow(df))

# Use factor version of Outcome (with labels)
spineplot(df$Outcome ~ dummy,
          main = "Spineplot: Diabetes Outcome Distribution",
          xlab = "Diabetes Status",
          ylab = "Proportion",
          col = c("#1b9e77", "#d95f02"),
          border = "white",
          lwd = 2)
text(0.3, 0.7, paste0("No Diabetes: ", sum(df$Outcome == "No Diabetes")), col = "white", font = 2)
text(0.7, 0.3, paste0("Diabetes: ", sum(df$Outcome == "Diabetes")), col = "white", font = 2)

Next, create a variable which describes whether a woman had more or less than 4 pregnancies. Then, use this variable to create a 2x2 table with diabetes outcome. Do you see an indication of whether the number of pregnancies is associated with diabetes prevalence? Do you think your investigation is a good way to investigate this?

Exploring Pregnancies distribution

ggplot(df, aes(x = Pregnancies)) +
  geom_histogram(binwidth = 1, fill = "#7570b3", color = "black", alpha = 0.8) +
  geom_vline(xintercept = 4, linetype = "dashed", color = "red", size = 1) +
  labs(title = "Distribution of Number of Pregnancies",
       x = "Number of Pregnancies", y = "Count") +
  theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

creating new variable

df$ParityGroup <- cut(df$Pregnancies, 
                      breaks = c(-1, 4, Inf), 
                      labels = c("Low-to-Moderate (≤4)", "High Parity (>4)"),
                      right = TRUE)
df$ParityGroup <- factor(df$ParityGroup)

now calculating risk ratios

tab <- table(df$ParityGroup, df$Outcome)
colnames(tab) <- c("No Diabetes", "Diabetes")
tab
##                       
##                        No Diabetes Diabetes
##   Low-to-Moderate (≤4)         356      136
##   High Parity (>4)             144      132
# Row percentages (prevalence)
prev <- prop.table(tab, margin = 1) * 100
round(prev, 1)
##                       
##                        No Diabetes Diabetes
##   Low-to-Moderate (≤4)        72.4     27.6
##   High Parity (>4)            52.2     47.8
# Risk Ratio
rr <- prev[2, 2] / prev[1, 2]
rr
## [1] 1.730179
fisher.test(tab)
## 
##  Fisher's Exact Test for Count Data
## 
## data:  tab
## p-value = 2.835e-08
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  1.742026 3.302776
## sample estimates:
## odds ratio 
##   2.396616

Indeed, there is compelling evidence that the prevalence of diabetes is 71% greater among women who have had more than four pregnancies. This inquiry is worthwhile because: makes use of a clinically significant cutoff reports the actionable risk ratio. uses a precise test

Exercise 3: Plots using ggplot2

Load the NoShow dataset:

library(dplyr)
library(ggplot2)
library(plotly)
## Warning: package 'plotly' was built under R version 4.5.2
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
load(file = url("https://www.dropbox.com/s/4oqg79cn1qfnhsh/NoShowdata.RData?dl=1"))
head(NoShowdata)
## # A tibble: 6 × 14
##   PatientId AppointmentID Gender ScheduledDay        AppointmentDay        Age
##       <dbl>         <dbl> <chr>  <dttm>              <dttm>              <dbl>
## 1   2.99e13       5642903 F      2016-04-29 18:38:08 2016-04-29 00:00:00    62
## 2   5.59e14       5642503 M      2016-04-29 16:08:27 2016-04-29 00:00:00    56
## 3   4.26e12       5642549 F      2016-04-29 16:19:04 2016-04-29 00:00:00    62
## 4   8.68e11       5642828 F      2016-04-29 17:29:31 2016-04-29 00:00:00     8
## 5   8.84e12       5642494 F      2016-04-29 16:07:23 2016-04-29 00:00:00    56
## 6   9.60e13       5626772 F      2016-04-27 08:36:51 2016-04-29 00:00:00    76
## # ℹ 8 more variables: Neighbourhood <chr>, Scholarship <dbl>,
## #   Hipertension <dbl>, Diabetes <dbl>, Alcoholism <dbl>, Handcap <dbl>,
## #   SMS_received <dbl>, `No-show` <chr>
df <- NoShowdata
# ---- a) check column names -------------------------------------------------
cat("Columns:\n")
## Columns:
print(colnames(df))
##  [1] "PatientId"      "AppointmentID"  "Gender"         "ScheduledDay"  
##  [5] "AppointmentDay" "Age"            "Neighbourhood"  "Scholarship"   
##  [9] "Hipertension"   "Diabetes"       "Alcoholism"     "Handcap"       
## [13] "SMS_received"   "No-show"
# ---- b) recode Gender ------------------------------------------------------
df$Gender <- factor(df$Gender,
                    levels = c("M","F"),
                    labels = c("Male","Female"))

# ---- c) recode No-show -----------------------------------------------------
df$"No-show" <- factor(df $"No-show",
                       levels = c("No","Yes"),
                       labels = c("Show","Miss"))

# ---- d) recode SMS_received ------------------------------------------------
df$SMS_received <- factor(df$SMS_received,
                          levels = c(0,1),
                          labels = c("No SMS","SMS Sent"))

# ---- e) make Neighbourhood a factor ---------------------------------------
df$Neighbourhood <- as.factor(df$Neighbourhood)

# ---- f) ensure Age is numeric ---------------------------------------------
df$Age <- as.numeric(df$Age)

# ---- g) (optional) create a risk-profile variable -------------------------

Use ggplot2 to generate the following plots:

top10 <- names(sort(table(df$Neighbourhood), decreasing = TRUE))[1:10]
df10  <- df[df$Neighbourhood %in% top10, ]

ggplot(df10, aes(x = reorder(Neighbourhood, Age, median), y = Age)) +
  geom_boxplot(fill = "purple", alpha = 0.8, outlier.alpha = 0.5) +
  coord_flip() +
  labs(title = "Age Distribution by Top 10 Neighbourhoods", x = "", y = "Age (years)") +
  theme_minimal(base_size = 11) +
  theme(axis.text.y = element_text(size = 9))

ggplot(df, aes(x = Age)) +
  geom_histogram(binwidth = 3, fill = "#33a02c", color = "white") +
  labs(title = "Overall Age Distribution", x = "Age (years)", y = "Count") +
  theme_minimal()

ggplot(df, aes(x = Age, fill = `No-show`)) +
  geom_histogram(binwidth = 3, alpha = 0.8, position = "dodge") +
  facet_wrap(~ `No-show`) +
  scale_fill_manual(values = c("Show" = "#1f78b4", "Miss" = "#e31a1c")) +
  labs(title = "Age by Attendance (Separate Panels)") +
  theme_minimal()

ggplot(df, aes(x = Age, fill = `No-show`)) +
  geom_histogram(binwidth = 3, alpha = 0.8, position = "dodge") +
  facet_grid(Gender ~ `No-show`) +
  scale_fill_manual(values = c("Show" = "#1f78b4", "Miss" = "#e31a1c")) +
  labs(title = "Age by Gender and Attendance", x = "Age (years)", y = "Count") +
  theme_minimal() +
  theme(strip.text = element_text(face = "bold"))

What insights can you get from these plots? For which goal would you create these plots?

Answer: ## Boxplot of Age by Neighbourhood:

• The median age varies significantly amongst neighborhoods (30–48 years). • The oldest patients are served with JARDIM CAMBURI and ITAPUA (median ≈ 45–48 years). • The youngest are served by SANTA LCIA and JARDIM DA PENHA (median ≈ 30–33 years). • Some clinics treat very old people and infants at the same time, according to outliers.

Overall Histogram of Age

Strongly bimodal distribution: broad adult peak at 30–60 years and large peak at 0–5 years (paediatric care); nearly no patients above 75 years old; this is a primary care/family clinic, not a geriatrics center.

Overlay Histogram (Age by Attendance)

• Patients who don’t show up are noticeably younger; the red (“Miss”) density peaks around age 25. • The peak attendance of show-up patients is approximately 38 years old, with a 13-year gap. • Parents are dependable, and infants (0–10 years old) hardly ever miss.

Faceted Histogram (Age × Gender × Attendance)

• Young adult males (18–35 years old) are the most frequent no-shows in “Miss” bars. • Between the ages of 15 and 45, women miss appointments more frequently. • Males between the ages of 20 and 30 have a >30% no-show rate in multiple bins, indicating a distinct high-risk cluster.

For which goal would you create these plots?

All four plots are the great source of Predictive model development and accurate analysis of data