Loading required packages:
data = read_csv("./Data/Kaggle - Pima Indians Diabetes Database.csv")
Here is what the columns in the data mean:
| Column | Details |
|---|---|
| Pregnancies | Number of times pregnant |
| Glucose | Plasma glucose concentration a 2 hours in an oral glucose tolerance test |
| BloodPressure | Diastolic blood pressure (mm Hg) |
| SkinThickness | Triceps skin fold thickness (mm) |
| Insulin | 2-Hour serum insulin (mu U/ml) |
| BMI | Body mass index (weight in kg/(height in m)^2) |
| DiabetesPedigreeFunction | Diabetes pedigree function |
| Age | Age (years) |
| Outcome | Class variable (0=No or 1=Yes) |
Outcome is a categorical variable, 1 means Yes,
0 means No. So, convert it into a factor data type. Now, calculate mean,
standard deviation, minimum, 1st quartile, median, 3rd quartile,
maximum, and inter-quartile range of all the numeric columns in the
dataset. Do you see any anomalies? Write your comment.
# Your code here
outcome <- factor(data[["Outcome"]])
numeric_cols <- data[sapply(data, is.numeric)] # Filter only numeric columns
stats <- data.frame(
Mean = sapply(numeric_cols, mean, na.rm = TRUE),
Std = sapply(numeric_cols, sd, na.rm = TRUE),
Min = sapply(numeric_cols, min, na.rm = TRUE),
Q1 = sapply(numeric_cols, function(x) quantile(x, 0.25, na.rm = TRUE)),
Median = sapply(numeric_cols, median, na.rm = TRUE),
Q3 = sapply(numeric_cols, function(x) quantile(x, 0.75, na.rm = TRUE)),
Max = sapply(numeric_cols, max, na.rm = TRUE),
IQR = sapply(numeric_cols, IQR, na.rm = TRUE)
)
print(stats)
## Mean Std Min Q1 Median
## Pregnancies 3.8450521 3.3695781 0.000 1.00000 3.0000
## Glucose 120.8945312 31.9726182 0.000 99.00000 117.0000
## BloodPressure 69.1054688 19.3558072 0.000 62.00000 72.0000
## SkinThickness 20.5364583 15.9522176 0.000 0.00000 23.0000
## Insulin 79.7994792 115.2440024 0.000 0.00000 30.5000
## BMI 31.9925781 7.8841603 0.000 27.30000 32.0000
## DiabetesPedigreeFunction 0.4718763 0.3313286 0.078 0.24375 0.3725
## Age 33.2408854 11.7602315 21.000 24.00000 29.0000
## Outcome 0.3489583 0.4769514 0.000 0.00000 0.0000
## Q3 Max IQR
## Pregnancies 6.00000 17.00 5.0000
## Glucose 140.25000 199.00 41.2500
## BloodPressure 80.00000 122.00 18.0000
## SkinThickness 32.00000 99.00 32.0000
## Insulin 127.25000 846.00 127.2500
## BMI 36.60000 67.10 9.3000
## DiabetesPedigreeFunction 0.62625 2.42 0.3825
## Age 41.00000 81.00 17.0000
## Outcome 1.00000 1.00 1.0000
Your comment: In the stat dataframe, the
Glucose, BloodPressure,
SkinThickness, Insulin, and BMI
columns have a minimum value of 0, which is unrealistic and likely
represents missing data. Additionally, the maximum value in the
Insulin column is very high, indicating the presence of an
outlier. The 1st quartile (Q1) of both the
Insulin and SkinThickness columns is 0,
suggesting that these columns contain many missing or sparse values.
How many people have Outcome = Yes?
# Your code here
table(outcome)
## outcome
## 0 1
## 500 268
# Outcome has 268 yes
Create histograms for all the numeric variables in the data using ggplot2.
# Your code here
for (i in names(numeric_cols)) {
print(
ggplot(numeric_cols, aes_string(x = i)) +
geom_histogram(color = "black", bins = 10, alpha = 0.6) +
labs(title = paste("Histogram of", i), x = i, y = "Frequency") +
theme_minimal()
)
}
Create boxplot of Glucose, Insulin, BMI, and Age by Outcome using ggplot2.
# Your code here
for (i in names(numeric_cols)){
if (i %in% c("Glucose", "Insulin","BMI","Age")){
print(
ggplot(numeric_cols, aes_string(x = "Outcome", y = i)) +
geom_boxplot() +
labs(title = paste("Boxplot of", i, "by Outcome"), x = "Outcome", y = i) +
theme_minimal()
)
}
}
Replace 0 with NA in the variables where a value of 0 does not make sense.
# Your code here
# Glucose, Insulin, BMI, BloodPressure, & SkinThickness can't have zero value
numeric_cols$Glucose[numeric_cols$Glucose == 0] <- NA
numeric_cols$Insulin[numeric_cols$Insulin == 0] <- NA
numeric_cols$BMI[numeric_cols$BMI == 0] <- NA
numeric_cols$BloodPressure[numeric_cols$BloodPressure == 0] <- NA
numeric_cols$SkinThickness[numeric_cols$SkinThickness == 0] <- NA
Use naniar package to inspect number of missing values in the data after replacing 0s with NAs
# Your code here
gg_miss_var(numeric_cols,show_pct = TRUE)+
labs(y = "Displaying the proportion of missings")
Export the final cleaned data in an Excel file for later use.
# Your code here
write_xlsx(numeric_cols, "cleaned_data.xlsx")