This report loads the Pima Indians Diabetes dataset, creates a focused and readable subset, and applies basic cleaning steps (e.g., converting physiologically impossible zeros to missing values). The goal is a tidy data frame ready for analysis and reproducible grading.
Source: Kaggle “Pima Indians Diabetes Database” (mirrors the classic UCI dataset).
Data file is placed in data/diabetes.csv
under this
project.
# Load required packages
library(readr)
library(dplyr)
library(tidyr)
# Read from the local data folder inside this project
# Peek at the first few lines in the raw CSV to see where the true header is
readr::read_lines("data/diabetes.csv", n_max = 5)
## [1] "diabetes"
## [2] "Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome"
## [3] "6,148,72,35,0,33.6,0.627,50,1"
## [4] "1,85,66,29,0,26.6,0.351,31,0"
## [5] "8,183,64,0,0,23.3,0.672,32,1"
# Load the data, skipping the first line which only says "diabetes"
df <- readr::read_csv("data/diabetes.csv", skip = 1, show_col_types = FALSE)
# Preview to confirm correct column names
glimpse(df)
## Rows: 768
## Columns: 9
## $ Pregnancies <dbl> 6, 1, 8, 1, 0, 5, 3, 10, 2, 8, 4, 10, 10, 1, …
## $ Glucose <dbl> 148, 85, 183, 89, 137, 116, 78, 115, 197, 125…
## $ BloodPressure <dbl> 72, 66, 64, 66, 40, 74, 50, 0, 70, 96, 92, 74…
## $ SkinThickness <dbl> 35, 29, 0, 23, 35, 0, 32, 0, 45, 0, 0, 0, 0, …
## $ Insulin <dbl> 0, 0, 0, 94, 168, 0, 88, 0, 543, 0, 0, 0, 0, …
## $ BMI <dbl> 33.6, 26.6, 23.3, 28.1, 43.1, 25.6, 31.0, 35.…
## $ DiabetesPedigreeFunction <dbl> 0.627, 0.351, 0.672, 0.167, 2.288, 0.201, 0.2…
## $ Age <dbl> 50, 31, 32, 21, 33, 30, 26, 29, 53, 54, 30, 3…
## $ Outcome <dbl> 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, …
names(df) <- c(
"Pregnancies","Glucose","BloodPressure","SkinThickness",
"Insulin","BMI","DiabetesPedigreeFunction","Age","Outcome"
)
glimpse(df)
## Rows: 768
## Columns: 9
## $ Pregnancies <dbl> 6, 1, 8, 1, 0, 5, 3, 10, 2, 8, 4, 10, 10, 1, …
## $ Glucose <dbl> 148, 85, 183, 89, 137, 116, 78, 115, 197, 125…
## $ BloodPressure <dbl> 72, 66, 64, 66, 40, 74, 50, 0, 70, 96, 92, 74…
## $ SkinThickness <dbl> 35, 29, 0, 23, 35, 0, 32, 0, 45, 0, 0, 0, 0, …
## $ Insulin <dbl> 0, 0, 0, 94, 168, 0, 88, 0, 543, 0, 0, 0, 0, …
## $ BMI <dbl> 33.6, 26.6, 23.3, 28.1, 43.1, 25.6, 31.0, 35.…
## $ DiabetesPedigreeFunction <dbl> 0.627, 0.351, 0.672, 0.167, 2.288, 0.201, 0.2…
## $ Age <dbl> 50, 31, 32, 21, 33, 30, 26, 29, 53, 54, 30, 3…
## $ Outcome <dbl> 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, …
df_clean <- df %>%
mutate(
Glucose = na_if(Glucose, 0),
BloodPressure = na_if(BloodPressure, 0),
SkinThickness = na_if(SkinThickness, 0),
Insulin = na_if(Insulin, 0),
BMI = na_if(BMI, 0),
Outcome = factor(Outcome, levels = c(0,1), labels = c("NoDiabetes","Diabetes"))
)
# Focused subset for quick checks
df_subset <- df_clean %>%
select(Pregnancies, Glucose, BloodPressure, BMI, Age, Outcome)
glimpse(df_subset)
## Rows: 768
## Columns: 6
## $ Pregnancies <dbl> 6, 1, 8, 1, 0, 5, 3, 10, 2, 8, 4, 10, 10, 1, 5, 7, 0, 7,…
## $ Glucose <dbl> 148, 85, 183, 89, 137, 116, 78, 115, 197, 125, 110, 168,…
## $ BloodPressure <dbl> 72, 66, 64, 66, 40, 74, 50, NA, 70, 96, 92, 74, 80, 60, …
## $ BMI <dbl> 33.6, 26.6, 23.3, 28.1, 43.1, 25.6, 31.0, 35.3, 30.5, NA…
## $ Age <dbl> 50, 31, 32, 21, 33, 30, 26, 29, 53, 54, 30, 34, 57, 59, …
## $ Outcome <fct> Diabetes, NoDiabetes, Diabetes, NoDiabetes, Diabetes, No…
# Missingness in key clinical fields
df_clean %>%
summarize(
n = n(),
miss_glucose = sum(is.na(Glucose)),
miss_bp = sum(is.na(BloodPressure)),
miss_skin = sum(is.na(SkinThickness)),
miss_insulin = sum(is.na(Insulin)),
miss_bmi = sum(is.na(BMI))
)
## # A tibble: 1 × 6
## n miss_glucose miss_bp miss_skin miss_insulin miss_bmi
## <int> <int> <int> <int> <int> <int>
## 1 768 5 35 227 374 11
# Class balance
df_clean %>% count(Outcome)
## # A tibble: 2 × 2
## Outcome n
## <fct> <int>
## 1 NoDiabetes 500
## 2 Diabetes 268
# Simple numeric summaries (ignoring NAs)
df_subset %>%
summarize(
mean_glucose = mean(Glucose, na.rm = TRUE),
mean_bmi = mean(BMI, na.rm = TRUE),
mean_age = mean(Age, na.rm = TRUE)
)
## # A tibble: 1 × 3
## mean_glucose mean_bmi mean_age
## <dbl> <dbl> <dbl>
## 1 122. 32.5 33.2
(Write 3–5 sentences here in your voice.) - Confirm that the dataset is clean/readable (headers set; zeros handled; outcome labeled). - One or two takeaways from your quick checks (e.g., any missingness or class balance notes). - Next steps: impute missing values; create BMI/age bands; try a simple logistic regression; train/test split.