Homework 3 Solution

Download this R Markdown file, save it on your computer, and perform all the below tasks by inserting your answer in text or by inserting R chunks below. After you are done, upload this file with your solutions on Moodle.

Exercise 1: Compute frequencies in the Pima diabetes dataset

Load the Pima diabetes dataset:

load(file = url("https://www.dropbox.com/s/hxn1c7yw9r8aw8v/Pima_diabetes.RData?dl=1"))
df <- Pima_diabetes  # Rename for clarity
str(df)

## 'data.frame':    768 obs. of  9 variables:
##  $ Pregnancies             : int  6 1 8 1 0 5 3 10 2 8 ...
##  $ Glucose                 : int  148 85 183 89 137 116 78 115 197 125 ...
##  $ BloodPressure           : int  72 66 64 66 40 74 50 0 70 96 ...
##  $ SkinThickness           : int  35 29 0 23 35 0 32 0 45 0 ...
##  $ Insulin                 : int  0 0 0 94 168 0 88 0 543 0 ...
##  $ BMI                     : num  33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
##  $ DiabetesPedigreeFunction: num  0.627 0.351 0.672 0.167 2.288 ...
##  $ Age                     : int  50 31 32 21 33 30 26 29 53 54 ...
##  $ Outcome                 : int  1 0 1 0 1 0 1 0 1 1 ...

Which variables are measured on a nominal level?

# Unique values per variable
unique_counts <- sapply(df, function(x) length(unique(na.omit(x))))
unique_counts

##              Pregnancies                  Glucose            BloodPressure 
##                       17                      136                       47 
##            SkinThickness                  Insulin                      BMI 
##                       51                      186                      248 
## DiabetesPedigreeFunction                      Age                  Outcome 
##                      517                       52                        2

# Class of each variable
sapply(df, class)

##              Pregnancies                  Glucose            BloodPressure 
##                "integer"                "integer"                "integer" 
##            SkinThickness                  Insulin                      BMI 
##                "integer"                "integer"                "numeric" 
## DiabetesPedigreeFunction                      Age                  Outcome 
##                "numeric"                "integer"                "integer"

Now compute frequency tables, barplots, and mosaic plots of all nominal variables in the dataset. Frequency Table

df$Outcome <- factor(df$Outcome, 
                     levels = c(0, 1), 
                     labels = c("No Diabetes", "Diabetes"))

freq_table <- table(df$Outcome)
freq_table

## 
## No Diabetes    Diabetes 
##         500         268

Barplot

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.5.2

ggplot(df, aes(x = Outcome, fill = Outcome)) +
  geom_bar(color = "black", alpha = 0.8) +
  geom_text(stat = 'count', aes(label = after_stat(count)), vjust = -0.5, size = 4) +
  scale_fill_manual(values = c("red", "blue")) +
  labs(title = "Distribution of Diabetes Outcome (Nominal Variable)",
       x = "Diabetes Status", y = "Frequency") +
  theme_minimal(base_size = 13) +
  theme(legend.position = "none")

Mosaic Plot

df_temp <- df
df_temp$Outcome <- as.numeric(df$Outcome) - 1

dummy <- rep(1, nrow(df))

# Use factor version of Outcome (with labels)
spineplot(df$Outcome ~ dummy,
          main = "Spineplot: Diabetes Outcome Distribution",
          xlab = "Diabetes Status",
          ylab = "Proportion",
          col = c("#1b9e77", "#d95f02"),
          border = "white",
          lwd = 2)
text(0.3, 0.7, paste0("No Diabetes: ", sum(df$Outcome == "No Diabetes")), col = "white", font = 2)
text(0.7, 0.3, paste0("Diabetes: ", sum(df$Outcome == "Diabetes")), col = "white", font = 2)

Next, create a variable which describes whether a woman had more or less than 4 pregnancies. Then, use this variable to create a 2x2 table with diabetes outcome. Do you see an indication of whether the number of pregnancies is associated with diabetes prevalence? Do you think your investigation is a good way to investigate this?

Exploring Pregnancies distribution

ggplot(df, aes(x = Pregnancies)) +
  geom_histogram(binwidth = 1, fill = "#7570b3", color = "black", alpha = 0.8) +
  geom_vline(xintercept = 4, linetype = "dashed", color = "red", size = 1) +
  labs(title = "Distribution of Number of Pregnancies",
       x = "Number of Pregnancies", y = "Count") +
  theme_minimal()

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

creating new variable

df$ParityGroup <- cut(df$Pregnancies, 
                      breaks = c(-1, 4, Inf), 
                      labels = c("Low-to-Moderate (≤4)", "High Parity (>4)"),
                      right = TRUE)
df$ParityGroup <- factor(df$ParityGroup)

now calculating risk ratios

tab <- table(df$ParityGroup, df$Outcome)
colnames(tab) <- c("No Diabetes", "Diabetes")
tab

##                       
##                        No Diabetes Diabetes
##   Low-to-Moderate (≤4)         356      136
##   High Parity (>4)             144      132

# Row percentages (prevalence)
prev <- prop.table(tab, margin = 1) * 100
round(prev, 1)

##                       
##                        No Diabetes Diabetes
##   Low-to-Moderate (≤4)        72.4     27.6
##   High Parity (>4)            52.2     47.8

# Risk Ratio
rr <- prev[2, 2] / prev[1, 2]
rr

## [1] 1.730179

fisher.test(tab)

## 
##  Fisher's Exact Test for Count Data
## 
## data:  tab
## p-value = 2.835e-08
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  1.742026 3.302776
## sample estimates:
## odds ratio 
##   2.396616

Indeed, there is compelling evidence that the prevalence of diabetes is 71% greater among women who have had more than four pregnancies. This inquiry is worthwhile because: makes use of a clinically significant cutoff reports the actionable risk ratio. uses a precise test

Exercise 2: Generate a table with descriptive statistics (optional, but recommended)

Use any dataset (a dataset that you have worked with in the past, or that you are currently working with, a dataset that is available on Blackboard, in R or that you have downloaded from the internet), and generate a table with descriptive statistics of the main variables of interest.

I am using a dataset downloaded from the kaggle for this question “Car Price Analysis Dataset”

suppressPackageStartupMessages({
library(dplyr)
library(kableExtra)
library(ggplot2)

  })
car_data <- read.csv("car_price_prediction_.csv")
# Data cleaning
car_data_clean <- car_data %>%
  mutate(
    Year = as.numeric(Year),
    Price = as.numeric(Price),
    Mileage = as.numeric(Mileage),
    Condition = factor(Condition),
    "Fuel Type" = factor("Fuel Type"),
    Transmission = factor(Transmission)
  ) %>%
  filter(!is.na(Price), !is.na(Year), !is.na(Mileage))

cat("Clean dataset dimensions:", dim(car_data_clean))

## Clean dataset dimensions: 2500 11

Descriptive Statistics of Main Variables

main_vars_summary <- car_data_clean %>%
  summarise(
    Variable = c("Price ($)", "Year", "Mileage", "Engine Size"),
    N = c(
      sum(!is.na(Price)),
      sum(!is.na(Year)),
      sum(!is.na(Mileage)),
      sum(!is.na(Engine.Size))
    ),
    Mean = c(
      mean(Price, na.rm = TRUE),
      mean(Year, na.rm = TRUE),
      mean(Mileage, na.rm = TRUE),
      mean(Engine.Size, na.rm = TRUE)
    ),
    SD = c(
      sd(Price, na.rm = TRUE),
      sd(Year, na.rm = TRUE),
      sd(Mileage, na.rm = TRUE),
      sd(Engine.Size, na.rm = TRUE)
    ),
    Median = c(
      median(Price, na.rm = TRUE),
      median(Year, na.rm = TRUE),
      median(Mileage, na.rm = TRUE),
      median(Engine.Size, na.rm = TRUE)
    ),
    Min = c(
      min(Price, na.rm = TRUE),
      min(Year, na.rm = TRUE),
      min(Mileage, na.rm = TRUE),
      min(Engine.Size, na.rm = TRUE)
    ),
    Max = c(
      max(Price, na.rm = TRUE),
      max(Year, na.rm = TRUE),
      max(Mileage, na.rm = TRUE),
      max(Engine.Size, na.rm = TRUE)
    )
  ) %>%
  mutate(across(where(is.numeric), round, 2))

kable(main_vars_summary, 
      caption = "Descriptive Statistics of Main Variables",
      col.names = c("Variable", "N", "Mean", "Std. Dev.", "Median", "Min", "Max")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Descriptive Statistics of Main Variables
Variable	N	Mean	Std. Dev.	Median	Min	Max
Price ($)	2500	52638.02	27295.83	53485.24	5011.27	99982.59
Year	2500	2011.63	6.99	2012.00	2000.00	2023.00
Mileage	2500	149749.84	87919.95	149085.00	15.00	299967.00
Engine Size	2500	3.47	1.43	3.40	1.00	6.00

few more analytical steps

Price Analysis by Categories

price_by_category <- car_data_clean %>%
  group_by("Fuel Type", Transmission) %>%
  summarise(
    Count = n(),
    Mean_Price = mean(Price),
    Median_Price = median(Price),
    SD_Price = sd(Price),
    .groups = 'drop'
  ) %>%
  mutate(across(where(is.numeric), round, 2))

kable(price_by_category, 
      caption = "Price Statistics by Fuel Type and Transmission") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Price Statistics by Fuel Type and Transmission
“Fuel Type”	Transmission	Count	Mean_Price	Median_Price	SD_Price
Fuel Type	Automatic	1192	52691.68	53401.52	27039.96
Fuel Type	Manual	1308	52589.12	53506.26	27537.19

Top 10 Brands Analysis

top_brands <- car_data_clean %>%
  count(Brand, sort = TRUE) %>%
  head(10) %>%
  pull(Brand)

brand_analysis <- car_data_clean %>%
  filter(Brand %in% top_brands) %>%
  group_by(Brand) %>%
  summarise(
    Count = n(),
    Avg_Price = mean(Price),
    Avg_Year = mean(Year),
    Avg_Mileage = mean(Mileage),
    .groups = 'drop'
  ) %>%
  mutate(across(where(is.numeric), round, 2))

kable(brand_analysis, caption = "Top 10 Brands Analysis") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Top 10 Brands Analysis
Brand	Count	Avg_Price	Avg_Year	Avg_Mileage
Audi	368	51953.42	2011.19	149981.4
BMW	358	54157.11	2011.60	150967.1
Ford	347	51593.25	2011.55	154083.7
Honda	352	52050.28	2012.04	150478.8
Mercedes	353	53191.09	2011.92	144038.1
Tesla	348	53475.55	2011.17	150946.0
Toyota	374	52078.73	2011.91	147927.9

Exercise 3: Plots using ggplot2

Load the NoShow dataset:

library(dplyr)
library(ggplot2)
library(plotly)

## Warning: package 'plotly' was built under R version 4.5.2

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

load(file = url("https://www.dropbox.com/s/4oqg79cn1qfnhsh/NoShowdata.RData?dl=1"))
head(NoShowdata)

## # A tibble: 6 × 14
##   PatientId AppointmentID Gender ScheduledDay        AppointmentDay        Age
##       <dbl>         <dbl> <chr>  <dttm>              <dttm>              <dbl>
## 1   2.99e13       5642903 F      2016-04-29 18:38:08 2016-04-29 00:00:00    62
## 2   5.59e14       5642503 M      2016-04-29 16:08:27 2016-04-29 00:00:00    56
## 3   4.26e12       5642549 F      2016-04-29 16:19:04 2016-04-29 00:00:00    62
## 4   8.68e11       5642828 F      2016-04-29 17:29:31 2016-04-29 00:00:00     8
## 5   8.84e12       5642494 F      2016-04-29 16:07:23 2016-04-29 00:00:00    56
## 6   9.60e13       5626772 F      2016-04-27 08:36:51 2016-04-29 00:00:00    76
## # ℹ 8 more variables: Neighbourhood <chr>, Scholarship <dbl>,
## #   Hipertension <dbl>, Diabetes <dbl>, Alcoholism <dbl>, Handcap <dbl>,
## #   SMS_received <dbl>, `No-show` <chr>

df <- NoShowdata

# ---- a) check column names -------------------------------------------------
cat("Columns:\n")

## Columns:

print(colnames(df))

##  [1] "PatientId"      "AppointmentID"  "Gender"         "ScheduledDay"  
##  [5] "AppointmentDay" "Age"            "Neighbourhood"  "Scholarship"   
##  [9] "Hipertension"   "Diabetes"       "Alcoholism"     "Handcap"       
## [13] "SMS_received"   "No-show"

# ---- b) recode Gender ------------------------------------------------------
df$Gender <- factor(df$Gender,
                    levels = c("M","F"),
                    labels = c("Male","Female"))

# ---- c) recode No-show -----------------------------------------------------
df$"No-show" <- factor(df $"No-show",
                       levels = c("No","Yes"),
                       labels = c("Show","Miss"))

# ---- d) recode SMS_received ------------------------------------------------
df$SMS_received <- factor(df$SMS_received,
                          levels = c(0,1),
                          labels = c("No SMS","SMS Sent"))

# ---- e) make Neighbourhood a factor ---------------------------------------
df$Neighbourhood <- as.factor(df$Neighbourhood)

# ---- f) ensure Age is numeric ---------------------------------------------
df$Age <- as.numeric(df$Age)

# ---- g) (optional) create a risk-profile variable -------------------------

Use ggplot2 to generate the following plots:

Create a boxplots of Age (stratified) by neighborhood.

top10 <- names(sort(table(df$Neighbourhood), decreasing = TRUE))[1:10]
df10  <- df[df$Neighbourhood %in% top10, ]

ggplot(df10, aes(x = reorder(Neighbourhood, Age, median), y = Age)) +
  geom_boxplot(fill = "purple", alpha = 0.8, outlier.alpha = 0.5) +
  coord_flip() +
  labs(title = "Age Distribution by Top 10 Neighbourhoods", x = "", y = "Age (years)") +
  theme_minimal(base_size = 11) +
  theme(axis.text.y = element_text(size = 9))

Create a histogram of Age.

ggplot(df, aes(x = Age)) +
  geom_histogram(binwidth = 3, fill = "#33a02c", color = "white") +
  labs(title = "Overall Age Distribution", x = "Age (years)", y = "Count") +
  theme_minimal()

Create a histogram of Age, stratified by whether the person showed up - in one panel using the or in multiple panels.

ggplot(df, aes(x = Age, fill = `No-show`)) +
  geom_histogram(binwidth = 3, alpha = 0.8, position = "dodge") +
  facet_wrap(~ `No-show`) +
  scale_fill_manual(values = c("Show" = "#1f78b4", "Miss" = "#e31a1c")) +
  labs(title = "Age by Attendance (Separate Panels)") +
  theme_minimal()

Stratify this plot further by gender.

ggplot(df, aes(x = Age, fill = `No-show`)) +
  geom_histogram(binwidth = 3, alpha = 0.8, position = "dodge") +
  facet_grid(Gender ~ `No-show`) +
  scale_fill_manual(values = c("Show" = "#1f78b4", "Miss" = "#e31a1c")) +
  labs(title = "Age by Gender and Attendance", x = "Age (years)", y = "Count") +
  theme_minimal() +
  theme(strip.text = element_text(face = "bold"))

What insights can you get from these plots? For which goal would you create these plots?

Answer: ## Boxplot of Age by Neighbourhood:

• The median age varies significantly amongst neighborhoods (30–48 years). • The oldest patients are served with JARDIM CAMBURI and ITAPUA (median ≈ 45–48 years). • The youngest are served by SANTA LCIA and JARDIM DA PENHA (median ≈ 30–33 years). • Some clinics treat very old people and infants at the same time, according to outliers.

Overall Histogram of Age

Strongly bimodal distribution: broad adult peak at 30–60 years and large peak at 0–5 years (paediatric care); nearly no patients above 75 years old; this is a primary care/family clinic, not a geriatrics center.

Overlay Histogram (Age by Attendance)

• Patients who don’t show up are noticeably younger; the red (“Miss”) density peaks around age 25. • The peak attendance of show-up patients is approximately 38 years old, with a 13-year gap. • Parents are dependable, and infants (0–10 years old) hardly ever miss.

Faceted Histogram (Age × Gender × Attendance)

• Young adult males (18–35 years old) are the most frequent no-shows in “Miss” bars. • Between the ages of 15 and 45, women miss appointments more frequently. • Males between the ages of 20 and 30 have a >30% no-show rate in multiple bins, indicating a distinct high-risk cluster.

For which goal would you create these plots?

All four plots are the great source of Predictive model development and accurate analysis of data