Packages

Loading required packages:


Import the Data

data = read_csv("./Data/Kaggle - Pima Indians Diabetes Database.csv")

Here is what the columns in the data mean:

Column Details
Pregnancies Number of times pregnant
Glucose Plasma glucose concentration a 2 hours in an oral glucose tolerance test
BloodPressure Diastolic blood pressure (mm Hg)
SkinThickness Triceps skin fold thickness (mm)
Insulin 2-Hour serum insulin (mu U/ml)
BMI Body mass index (weight in kg/(height in m)^2)
DiabetesPedigreeFunction Diabetes pedigree function
Age Age (years)
Outcome Class variable (0=No or 1=Yes)

Question 1 (10 marks)

Outcome is a categorical variable, 1 means Yes, 0 means No. So, convert it into a factor data type. Now, calculate mean, standard deviation, minimum, 1st quartile, median, 3rd quartile, maximum, and inter-quartile range of all the numeric columns in the dataset. Do you see any anomalies? Write your comment.

# Your code here
outcome <- factor(data[["Outcome"]])

numeric_cols <- data[sapply(data, is.numeric)]  # Filter only numeric columns
stats <- data.frame(
  Mean = sapply(numeric_cols, mean, na.rm = TRUE),
  Std = sapply(numeric_cols, sd, na.rm = TRUE),
  Min = sapply(numeric_cols, min, na.rm = TRUE),
  Q1 = sapply(numeric_cols, function(x) quantile(x, 0.25, na.rm = TRUE)),
  Median = sapply(numeric_cols, median, na.rm = TRUE),
  Q3 = sapply(numeric_cols, function(x) quantile(x, 0.75, na.rm = TRUE)),
  Max = sapply(numeric_cols, max, na.rm = TRUE),
  IQR = sapply(numeric_cols, IQR, na.rm = TRUE)
)

print(stats)
##                                 Mean         Std    Min       Q1   Median
## Pregnancies                3.8450521   3.3695781  0.000  1.00000   3.0000
## Glucose                  120.8945312  31.9726182  0.000 99.00000 117.0000
## BloodPressure             69.1054688  19.3558072  0.000 62.00000  72.0000
## SkinThickness             20.5364583  15.9522176  0.000  0.00000  23.0000
## Insulin                   79.7994792 115.2440024  0.000  0.00000  30.5000
## BMI                       31.9925781   7.8841603  0.000 27.30000  32.0000
## DiabetesPedigreeFunction   0.4718763   0.3313286  0.078  0.24375   0.3725
## Age                       33.2408854  11.7602315 21.000 24.00000  29.0000
## Outcome                    0.3489583   0.4769514  0.000  0.00000   0.0000
##                                 Q3    Max      IQR
## Pregnancies                6.00000  17.00   5.0000
## Glucose                  140.25000 199.00  41.2500
## BloodPressure             80.00000 122.00  18.0000
## SkinThickness             32.00000  99.00  32.0000
## Insulin                  127.25000 846.00 127.2500
## BMI                       36.60000  67.10   9.3000
## DiabetesPedigreeFunction   0.62625   2.42   0.3825
## Age                       41.00000  81.00  17.0000
## Outcome                    1.00000   1.00   1.0000

Your comment: In the stat dataframe, the Glucose, BloodPressure, SkinThickness, Insulin, and BMI columns have a minimum value of 0, which is unrealistic and likely represents missing data. Additionally, the maximum value in the Insulin column is very high, indicating the presence of an outlier. The 1st quartile (Q1) of both the Insulin and SkinThickness columns is 0, suggesting that these columns contain many missing or sparse values.

How many people have Outcome = Yes?

# Your code here
table(outcome)
## outcome
##   0   1 
## 500 268
# Outcome has 268 yes

Question 2 (40 marks)

Create histograms for all the numeric variables in the data using ggplot2.

# Your code here
for (i in names(numeric_cols)) {
  print(
    ggplot(numeric_cols, aes_string(x = i)) +
      geom_histogram(color = "black", bins = 10, alpha = 0.6) +
      labs(title = paste("Histogram of", i), x = i, y = "Frequency") +
      theme_minimal()
  )
}

Question 3 (20 marks)

Create boxplot of Glucose, Insulin, BMI, and Age by Outcome using ggplot2.

# Your code here
for (i in names(numeric_cols)){
  if (i %in% c("Glucose", "Insulin","BMI","Age")){
   print(
     ggplot(numeric_cols, aes_string(x = "Outcome", y = i)) +
        geom_boxplot() +
        labs(title = paste("Boxplot of", i, "by Outcome"), x = "Outcome", y = i) +
        theme_minimal()
   )
  }
}

Question 4 (10 marks)

Replace 0 with NA in the variables where a value of 0 does not make sense.

# Your code here
# Glucose, Insulin, BMI, BloodPressure, & SkinThickness can't have zero value

numeric_cols$Glucose[numeric_cols$Glucose == 0] <- NA
numeric_cols$Insulin[numeric_cols$Insulin == 0] <- NA
numeric_cols$BMI[numeric_cols$BMI == 0] <- NA
numeric_cols$BloodPressure[numeric_cols$BloodPressure == 0] <- NA
numeric_cols$SkinThickness[numeric_cols$SkinThickness == 0] <- NA

Question 5 (10 marks)

Use naniar package to inspect number of missing values in the data after replacing 0s with NAs

# Your code here
gg_miss_var(numeric_cols,show_pct = TRUE)+
  labs(y = "Displaying the proportion of missings")

Question 6 (10 marks)

Export the final cleaned data in an Excel file for later use.

# Your code here
write_xlsx(numeric_cols, "cleaned_data.xlsx")