Download this R Markdown file, save it on your computer, and perform all the below tasks by inserting your answer in text or by inserting R chunks below. After you are done, upload this file with your solutions on Moodle.
Load the Pima diabetes dataset:
load(file = url("https://www.dropbox.com/s/hxn1c7yw9r8aw8v/Pima_diabetes.RData?dl=1"))
df <- Pima_diabetes # Rename for clarity
str(df)
## 'data.frame': 768 obs. of 9 variables:
## $ Pregnancies : int 6 1 8 1 0 5 3 10 2 8 ...
## $ Glucose : int 148 85 183 89 137 116 78 115 197 125 ...
## $ BloodPressure : int 72 66 64 66 40 74 50 0 70 96 ...
## $ SkinThickness : int 35 29 0 23 35 0 32 0 45 0 ...
## $ Insulin : int 0 0 0 94 168 0 88 0 543 0 ...
## $ BMI : num 33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
## $ DiabetesPedigreeFunction: num 0.627 0.351 0.672 0.167 2.288 ...
## $ Age : int 50 31 32 21 33 30 26 29 53 54 ...
## $ Outcome : int 1 0 1 0 1 0 1 0 1 1 ...
Which variables are measured on a nominal level?
# Unique values per variable
unique_counts <- sapply(df, function(x) length(unique(na.omit(x))))
unique_counts
## Pregnancies Glucose BloodPressure
## 17 136 47
## SkinThickness Insulin BMI
## 51 186 248
## DiabetesPedigreeFunction Age Outcome
## 517 52 2
# Class of each variable
sapply(df, class)
## Pregnancies Glucose BloodPressure
## "integer" "integer" "integer"
## SkinThickness Insulin BMI
## "integer" "integer" "numeric"
## DiabetesPedigreeFunction Age Outcome
## "numeric" "integer" "integer"
Now compute frequency tables, barplots, and mosaic plots of all nominal variables in the dataset. Frequency Table
df$Outcome <- factor(df$Outcome,
levels = c(0, 1),
labels = c("No Diabetes", "Diabetes"))
freq_table <- table(df$Outcome)
freq_table
##
## No Diabetes Diabetes
## 500 268
Barplot
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.5.2
ggplot(df, aes(x = Outcome, fill = Outcome)) +
geom_bar(color = "black", alpha = 0.8) +
geom_text(stat = 'count', aes(label = after_stat(count)), vjust = -0.5, size = 4) +
scale_fill_manual(values = c("red", "blue")) +
labs(title = "Distribution of Diabetes Outcome (Nominal Variable)",
x = "Diabetes Status", y = "Frequency") +
theme_minimal(base_size = 13) +
theme(legend.position = "none")
Mosaic Plot
df_temp <- df
df_temp$Outcome <- as.numeric(df$Outcome) - 1
dummy <- rep(1, nrow(df))
# Use factor version of Outcome (with labels)
spineplot(df$Outcome ~ dummy,
main = "Spineplot: Diabetes Outcome Distribution",
xlab = "Diabetes Status",
ylab = "Proportion",
col = c("#1b9e77", "#d95f02"),
border = "white",
lwd = 2)
text(0.3, 0.7, paste0("No Diabetes: ", sum(df$Outcome == "No Diabetes")), col = "white", font = 2)
text(0.7, 0.3, paste0("Diabetes: ", sum(df$Outcome == "Diabetes")), col = "white", font = 2)
Next, create a variable which describes whether a woman had more or less than 4 pregnancies. Then, use this variable to create a 2x2 table with diabetes outcome. Do you see an indication of whether the number of pregnancies is associated with diabetes prevalence? Do you think your investigation is a good way to investigate this?
Exploring Pregnancies distribution
ggplot(df, aes(x = Pregnancies)) +
geom_histogram(binwidth = 1, fill = "#7570b3", color = "black", alpha = 0.8) +
geom_vline(xintercept = 4, linetype = "dashed", color = "red", size = 1) +
labs(title = "Distribution of Number of Pregnancies",
x = "Number of Pregnancies", y = "Count") +
theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
creating new variable
df$ParityGroup <- cut(df$Pregnancies,
breaks = c(-1, 4, Inf),
labels = c("Low-to-Moderate (≤4)", "High Parity (>4)"),
right = TRUE)
df$ParityGroup <- factor(df$ParityGroup)
now calculating risk ratios
tab <- table(df$ParityGroup, df$Outcome)
colnames(tab) <- c("No Diabetes", "Diabetes")
tab
##
## No Diabetes Diabetes
## Low-to-Moderate (≤4) 356 136
## High Parity (>4) 144 132
# Row percentages (prevalence)
prev <- prop.table(tab, margin = 1) * 100
round(prev, 1)
##
## No Diabetes Diabetes
## Low-to-Moderate (≤4) 72.4 27.6
## High Parity (>4) 52.2 47.8
# Risk Ratio
rr <- prev[2, 2] / prev[1, 2]
rr
## [1] 1.730179
fisher.test(tab)
##
## Fisher's Exact Test for Count Data
##
## data: tab
## p-value = 2.835e-08
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 1.742026 3.302776
## sample estimates:
## odds ratio
## 2.396616
Indeed, there is compelling evidence that the prevalence of diabetes is 71% greater among women who have had more than four pregnancies. This inquiry is worthwhile because: makes use of a clinically significant cutoff reports the actionable risk ratio. uses a precise test
Use any dataset (a dataset that you have worked with in the past, or that you are currently working with, a dataset that is available on Blackboard, in R or that you have downloaded from the internet), and generate a table with descriptive statistics of the main variables of interest.
I am using a dataset downloaded from the kaggle for this question “Car Price Analysis Dataset”
suppressPackageStartupMessages({
library(dplyr)
library(kableExtra)
library(ggplot2)
})
car_data <- read.csv("car_price_prediction_.csv")
# Data cleaning
car_data_clean <- car_data %>%
mutate(
Year = as.numeric(Year),
Price = as.numeric(Price),
Mileage = as.numeric(Mileage),
Condition = factor(Condition),
"Fuel Type" = factor("Fuel Type"),
Transmission = factor(Transmission)
) %>%
filter(!is.na(Price), !is.na(Year), !is.na(Mileage))
cat("Clean dataset dimensions:", dim(car_data_clean))
## Clean dataset dimensions: 2500 11
Descriptive Statistics of Main Variables
main_vars_summary <- car_data_clean %>%
summarise(
Variable = c("Price ($)", "Year", "Mileage", "Engine Size"),
N = c(
sum(!is.na(Price)),
sum(!is.na(Year)),
sum(!is.na(Mileage)),
sum(!is.na(Engine.Size))
),
Mean = c(
mean(Price, na.rm = TRUE),
mean(Year, na.rm = TRUE),
mean(Mileage, na.rm = TRUE),
mean(Engine.Size, na.rm = TRUE)
),
SD = c(
sd(Price, na.rm = TRUE),
sd(Year, na.rm = TRUE),
sd(Mileage, na.rm = TRUE),
sd(Engine.Size, na.rm = TRUE)
),
Median = c(
median(Price, na.rm = TRUE),
median(Year, na.rm = TRUE),
median(Mileage, na.rm = TRUE),
median(Engine.Size, na.rm = TRUE)
),
Min = c(
min(Price, na.rm = TRUE),
min(Year, na.rm = TRUE),
min(Mileage, na.rm = TRUE),
min(Engine.Size, na.rm = TRUE)
),
Max = c(
max(Price, na.rm = TRUE),
max(Year, na.rm = TRUE),
max(Mileage, na.rm = TRUE),
max(Engine.Size, na.rm = TRUE)
)
) %>%
mutate(across(where(is.numeric), round, 2))
kable(main_vars_summary,
caption = "Descriptive Statistics of Main Variables",
col.names = c("Variable", "N", "Mean", "Std. Dev.", "Median", "Min", "Max")) %>%
kable_styling(bootstrap_options = c("striped", "hover"))
| Variable | N | Mean | Std. Dev. | Median | Min | Max |
|---|---|---|---|---|---|---|
| Price ($) | 2500 | 52638.02 | 27295.83 | 53485.24 | 5011.27 | 99982.59 |
| Year | 2500 | 2011.63 | 6.99 | 2012.00 | 2000.00 | 2023.00 |
| Mileage | 2500 | 149749.84 | 87919.95 | 149085.00 | 15.00 | 299967.00 |
| Engine Size | 2500 | 3.47 | 1.43 | 3.40 | 1.00 | 6.00 |
few more analytical steps
Price Analysis by Categories
price_by_category <- car_data_clean %>%
group_by("Fuel Type", Transmission) %>%
summarise(
Count = n(),
Mean_Price = mean(Price),
Median_Price = median(Price),
SD_Price = sd(Price),
.groups = 'drop'
) %>%
mutate(across(where(is.numeric), round, 2))
kable(price_by_category,
caption = "Price Statistics by Fuel Type and Transmission") %>%
kable_styling(bootstrap_options = c("striped", "hover"))
| “Fuel Type” | Transmission | Count | Mean_Price | Median_Price | SD_Price |
|---|---|---|---|---|---|
| Fuel Type | Automatic | 1192 | 52691.68 | 53401.52 | 27039.96 |
| Fuel Type | Manual | 1308 | 52589.12 | 53506.26 | 27537.19 |
Top 10 Brands Analysis
top_brands <- car_data_clean %>%
count(Brand, sort = TRUE) %>%
head(10) %>%
pull(Brand)
brand_analysis <- car_data_clean %>%
filter(Brand %in% top_brands) %>%
group_by(Brand) %>%
summarise(
Count = n(),
Avg_Price = mean(Price),
Avg_Year = mean(Year),
Avg_Mileage = mean(Mileage),
.groups = 'drop'
) %>%
mutate(across(where(is.numeric), round, 2))
kable(brand_analysis, caption = "Top 10 Brands Analysis") %>%
kable_styling(bootstrap_options = c("striped", "hover"))
| Brand | Count | Avg_Price | Avg_Year | Avg_Mileage |
|---|---|---|---|---|
| Audi | 368 | 51953.42 | 2011.19 | 149981.4 |
| BMW | 358 | 54157.11 | 2011.60 | 150967.1 |
| Ford | 347 | 51593.25 | 2011.55 | 154083.7 |
| Honda | 352 | 52050.28 | 2012.04 | 150478.8 |
| Mercedes | 353 | 53191.09 | 2011.92 | 144038.1 |
| Tesla | 348 | 53475.55 | 2011.17 | 150946.0 |
| Toyota | 374 | 52078.73 | 2011.91 | 147927.9 |
Load the NoShow dataset:
library(dplyr)
library(ggplot2)
library(plotly)
## Warning: package 'plotly' was built under R version 4.5.2
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
load(file = url("https://www.dropbox.com/s/4oqg79cn1qfnhsh/NoShowdata.RData?dl=1"))
head(NoShowdata)
## # A tibble: 6 × 14
## PatientId AppointmentID Gender ScheduledDay AppointmentDay Age
## <dbl> <dbl> <chr> <dttm> <dttm> <dbl>
## 1 2.99e13 5642903 F 2016-04-29 18:38:08 2016-04-29 00:00:00 62
## 2 5.59e14 5642503 M 2016-04-29 16:08:27 2016-04-29 00:00:00 56
## 3 4.26e12 5642549 F 2016-04-29 16:19:04 2016-04-29 00:00:00 62
## 4 8.68e11 5642828 F 2016-04-29 17:29:31 2016-04-29 00:00:00 8
## 5 8.84e12 5642494 F 2016-04-29 16:07:23 2016-04-29 00:00:00 56
## 6 9.60e13 5626772 F 2016-04-27 08:36:51 2016-04-29 00:00:00 76
## # ℹ 8 more variables: Neighbourhood <chr>, Scholarship <dbl>,
## # Hipertension <dbl>, Diabetes <dbl>, Alcoholism <dbl>, Handcap <dbl>,
## # SMS_received <dbl>, `No-show` <chr>
df <- NoShowdata
# ---- a) check column names -------------------------------------------------
cat("Columns:\n")
## Columns:
print(colnames(df))
## [1] "PatientId" "AppointmentID" "Gender" "ScheduledDay"
## [5] "AppointmentDay" "Age" "Neighbourhood" "Scholarship"
## [9] "Hipertension" "Diabetes" "Alcoholism" "Handcap"
## [13] "SMS_received" "No-show"
# ---- b) recode Gender ------------------------------------------------------
df$Gender <- factor(df$Gender,
levels = c("M","F"),
labels = c("Male","Female"))
# ---- c) recode No-show -----------------------------------------------------
df$"No-show" <- factor(df $"No-show",
levels = c("No","Yes"),
labels = c("Show","Miss"))
# ---- d) recode SMS_received ------------------------------------------------
df$SMS_received <- factor(df$SMS_received,
levels = c(0,1),
labels = c("No SMS","SMS Sent"))
# ---- e) make Neighbourhood a factor ---------------------------------------
df$Neighbourhood <- as.factor(df$Neighbourhood)
# ---- f) ensure Age is numeric ---------------------------------------------
df$Age <- as.numeric(df$Age)
# ---- g) (optional) create a risk-profile variable -------------------------
Use ggplot2 to generate the following plots:
top10 <- names(sort(table(df$Neighbourhood), decreasing = TRUE))[1:10]
df10 <- df[df$Neighbourhood %in% top10, ]
ggplot(df10, aes(x = reorder(Neighbourhood, Age, median), y = Age)) +
geom_boxplot(fill = "purple", alpha = 0.8, outlier.alpha = 0.5) +
coord_flip() +
labs(title = "Age Distribution by Top 10 Neighbourhoods", x = "", y = "Age (years)") +
theme_minimal(base_size = 11) +
theme(axis.text.y = element_text(size = 9))
ggplot(df, aes(x = Age)) +
geom_histogram(binwidth = 3, fill = "#33a02c", color = "white") +
labs(title = "Overall Age Distribution", x = "Age (years)", y = "Count") +
theme_minimal()
ggplot(df, aes(x = Age, fill = `No-show`)) +
geom_histogram(binwidth = 3, alpha = 0.8, position = "dodge") +
facet_wrap(~ `No-show`) +
scale_fill_manual(values = c("Show" = "#1f78b4", "Miss" = "#e31a1c")) +
labs(title = "Age by Attendance (Separate Panels)") +
theme_minimal()
ggplot(df, aes(x = Age, fill = `No-show`)) +
geom_histogram(binwidth = 3, alpha = 0.8, position = "dodge") +
facet_grid(Gender ~ `No-show`) +
scale_fill_manual(values = c("Show" = "#1f78b4", "Miss" = "#e31a1c")) +
labs(title = "Age by Gender and Attendance", x = "Age (years)", y = "Count") +
theme_minimal() +
theme(strip.text = element_text(face = "bold"))
What insights can you get from these plots? For which goal would you create these plots?
Answer: ## Boxplot of Age by Neighbourhood:
• The median age varies significantly amongst neighborhoods (30–48 years). • The oldest patients are served with JARDIM CAMBURI and ITAPUA (median ≈ 45–48 years). • The youngest are served by SANTA LCIA and JARDIM DA PENHA (median ≈ 30–33 years). • Some clinics treat very old people and infants at the same time, according to outliers.
Strongly bimodal distribution: broad adult peak at 30–60 years and large peak at 0–5 years (paediatric care); nearly no patients above 75 years old; this is a primary care/family clinic, not a geriatrics center.
• Patients who don’t show up are noticeably younger; the red (“Miss”) density peaks around age 25. • The peak attendance of show-up patients is approximately 38 years old, with a 13-year gap. • Parents are dependable, and infants (0–10 years old) hardly ever miss.
• Young adult males (18–35 years old) are the most frequent no-shows in “Miss” bars. • Between the ages of 15 and 45, women miss appointments more frequently. • Males between the ages of 20 and 30 have a >30% no-show rate in multiple bins, indicating a distinct high-risk cluster.
All four plots are the great source of Predictive model development and accurate analysis of data