Introduction

Diabetes is a collection of metabolic disorders where the blood glucose levels increases drastically due to defective insulin secretion and/or insulin resistance (1). There are many risk factors associated with diabetes like obesity, old age etc (2). Diabetes in turn majorly increases the risk of cardiovascular disease (3).In 2014, 8.5% of adults aged 18 years and older had diabetes. In 2019, diabetes was the direct cause of 1.5 million deaths and 48% of all deaths due to diabetes occurred before the age of 70 years. Another 460 000 kidney disease deaths were caused by diabetes, and raised blood glucose causes around 20% of cardiovascular deaths (4).Our dataset has various pathophysiological measurements of >400 individuals. Clinical parameters like blood glucose levels, cholesterol levels, age, body size, weight, blood pressure etc have been measured. We will use this dataset to explore the statistical properties of each variable.

Another topic-International student exchange programs have gained popularity as a means to increase enrollments, support international academic partnerships, and improve student preparedness for globalized work environments (5). However, the relationships between English language proficiency, cultural intelligence, teamwork, self-efficacy, academic success, and other factors within these programs are not clear. Graduate Admissions (GA) can be thought as a mapping problem between Students and Universities where each end strives for the best they can get. In Data Science World, this problem can be modeled as a University Recommendation problem.

This project aims

To explore the US diabetes dataset using the R programming language.
To predict the chance of admission and analyze the relationship between variables using machine learning approaches.

Load the relevant packages

Load the relevant packages: “readr”, “tidyverse” “plotly”, “ggcorrplot”, “GGally”, “stats”, “factoextr”, “corrplot”, and “caret”.

The readr is to provide a fast and friendly way to read rectangular data (like ‘csv’, ‘tsv’, and ‘fwf’).
The tidyverse “umbrella” package which houses a suite of many different R packages: for data wrangling and data visualization.
The plotly is an R package for creating interactive web-based graphs via the open source JavaScript.
The ggcorrplot package can be used to visualize easily a correlation matrix using `ggplot2.
The ggplot is a plotting system based on the grammar of graphics.GGally extends ggplot by adding several functions to reduce the complexity of combining geometric objects with transformed data.
The stats package can be used for Working with Basic Statistical Analysis.
The factoextra package provides some easy-to-use functions to extract and visualize the output of multivariate data analyses, including ‘PCA’ (Principal Component Analysis), ‘CA’ (Correspondence Analysis), ‘MCA’ (Multiple Correspondence Analysis), ‘FAMD’ (Factor Analysis of Mixed Data), ‘MFA’ (Multiple Factor Analysis) and ‘HMFA’ (Hierarchical Multiple Factor Analysis) functions from different R packages.
The corrplot package provides a visual exploratory tool on correlation matrix that supports automatic variable reordering to help detect hidden patterns among variables.
The caret package is used to short for classification and regression training.

library(readr)
library(tidyverse)
library(plotly)
library(ggcorrplot)
library(GGally)
library(stats)
library(factoextra)
library(corrplot)
library(lattice)
library(e1071)
library(caret)

The following command extends the number of lines of printing your results in the console.

options(scipen = 999)

Part-1

About the diabetes dataset

Context

The O’odham (Arizona), O’ob also Pima Bajo (Mexico) (Pennington CW and Loaiza BX, 1979), or Pima in general, are descendants of the ancient Hohokam, who have inhabited the Sonoran desert and Sierra Madre regions for centuries (Sturtevant WC, 1983). In 1937, Joslin documented twenty-one persons with diabetes there and concluded that the presence of diabetes among the Pimas was similar to that of the general U.S. population (Joslin EP, 1940). By the 1950s, however, the prevalence of diabetes had increased ten-fold (Cohen BM, 1954). The onset of this epidemic stimulated a longitudinal study initiated at the Sacaton Service Unit of the Indian Health Services. The Pima Indian Diabetes Dataset, originally from the National Institute of Diabetes and Digestive and Kidney Diseases, contains information of 768 women from a population near Phoenix, Arizona, USA. The results tested was 268 tested positive and 500 tested negative.

Data Sources:

Original owners: National Institute of Diabetes and Digestive and Kidney Diseases
Donor of database: Vincent Sigillito (vgs@aplcen.apl.jhu.edu),Research Center, RMI Group Leader, Applied Physics Laboratory, The Johns Hopkins University, Johns Hopkins Road Laurel, MD 20707 (301) 953-6231
Date received: 9 May 1990

For more details visit the National Institute of Diabetes and Digestive and Kidney Diseases website.

Content

Load the diabetes dataset into our workspace.

The output the dataset (data frame R name) contain with nine columns and seven hundred sixty eight rows 768 x 9.

data <- read_csv("diabetes.csv")

We can also check the dimensions of this data frame as well as the names of the variables, type of variables and the first few observations by inserting the name of the data set into the glimpse() function, as seen below:

glimpse(data)

## Rows: 768
## Columns: 9
## $ Pregnancies              <dbl> 6, 1, 8, 1, 0, 5, 3, 10, 2, 8, 4, 10, 10, 1, …
## $ Glucose                  <dbl> 148, 85, 183, 89, 137, 116, 78, 115, 197, 125…
## $ BloodPressure            <dbl> 72, 66, 64, 66, 40, 74, 50, 0, 70, 96, 92, 74…
## $ SkinThickness            <dbl> 35, 29, 0, 23, 35, 0, 32, 0, 45, 0, 0, 0, 0, …
## $ Insulin                  <dbl> 0, 0, 0, 94, 168, 0, 88, 0, 543, 0, 0, 0, 0, …
## $ BMI                      <dbl> 33.6, 26.6, 23.3, 28.1, 43.1, 25.6, 31.0, 35.…
## $ DiabetesPedigreeFunction <dbl> 0.627, 0.351, 0.672, 0.167, 2.288, 0.201, 0.2…
## $ Age                      <dbl> 50, 31, 32, 21, 33, 30, 26, 29, 53, 54, 30, 3…
## $ Outcome                  <dbl> 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, …

We see all variable types are “dbl,” which refers to the data type abbreviation for double-precision floating-point numbers. In R’s type system, “dbl” stands for numeric values that are represented in double precision.
However, It seems that the ‘Outcome’ might be categorical.
The dataset has one target variable or response (dependent) variable named ‘Outcome.’

We have 768 observations of 9 different variables, a mix of numerical and categorical. The meaning of each variable is as follows:

Table 1: Variables and descriptions

Variable	Description
Outcome	Class variable (0 or 1)
Pregnancies	Number of times pregnant.
Glucose	Plasma glucose concentration a 2 hours in an oral glucose tolerance test
BloodPressure	Diastolic blood pressure (mm Hg)
SkinThickness	Triceps skin fold thickness (mm)
Insulin	2-Hour serum insulin (mu U/ml)
BMI	Body mass index (weight in kg/(height in m)^2)
DiabetesPedigreeFunction	Diabetes pedigree function
Age	Age (years)

Since there are no other categorical variables in the dataset apart from the target variable, I am not removing any variable from the dataset. However to remove variables we may use the following codes

data$x<-NULL # To remove a single variable. Here x is example variable

data[, c(1,3:5,9)] # filter data set with desires variables

Data Exploration

Before analyzing the dataset, let’s check the class of the target variable.

# Check the class of outcome

class(data$Outcome) # Showing numeric, need to make it categorical

## [1] "numeric"

data$Outcome<-factor(data$Outcome,
               levels = c(0, 1), labels = c("Negative", "Positive"))

class(data$Outcome) # now it is factor or categorical

## [1] "factor"

Basic plot

a. Scatter plot**

Let’s check the visual relationship between age and pregnancies by diabetes outcome.

plot(data$Age, data$Pregnancies, col=data$Outcome,
     main="Visualization of age and pregnancies by diabetes outcome",
     xlab = "Age (years)", ylab="Pregnancies"
     )

Eyeballing age and pregnancies show a positive association where no differences in diabetes outcomes were found (Positive or negative).

b. Histogram plot**

hist(data$Insulin, main="Histogram of Insulin", 
     xlab = "Insulin", col="brown")

A positively skewed histogram for a 2-hour serum insulin test indicates that the majority of the data points are clustered on the lower end of insulin levels, with a few higher values that pull the tail of the distribution towards the right side.

c. Boxplot**

boxplot(data$Glucose, main="Boxplot of glucose concentration",
        col="lightblue")

This box plot suggests that the majority of individuals have plasma glucose concentrations between 100 and 140 with a median value around 120 mg/dL at 2 hours into the oral glucose tolerance test, with one outlier at 0 mg/dL, which needs further investigation due to its considerable deviation from the typical range of values.

ggplot

a. Scatter plot**

ggplot(data, aes(BMI, SkinThickness, color=Outcome))+
  geom_point(size=2)+
  scale_color_manual(values = c("Negative"= "blue", "Positive" = "red"))+
  labs(title = "Visualization of Body mass index and Skin fold thickness (mm) by diabetes outcome", x = "Body mass index",
       y = "Skin fold thickness (mm)",
       caption = "Source: Iskulghar")+
theme(
legend.position = "top",
plot.caption = element_text(hjust = 0),
text = element_text(colour = 'black', size = 14),
axis.text.x = element_text(color = "orange", size = 12),
axis.text.y = element_text(color = "brown", size = 12)
)

A positive pattern in the scatter plot suggests that individuals with higher BMI tend to have a greater triceps skin fold thickness. This association could imply that as someone’s BMI increases (indicating a higher proportion of body fat), there’s a tendency for their triceps skin fold thickness to be higher as well. By visual inspection, we did not find these association differences by diabetes outcomes (Positive or negative).

b. Box plot**

Boxplot of pregnancies by diabetes outcome

ggplot(data, aes(x=Outcome, y=Pregnancies, fill=Outcome))+
  geom_boxplot()+
  labs(title = "Boxplot of pregnancies by diabetes outcome", x = "Diabetes",
       y = "Number of times pregnant",
       caption = "Source: Iskulghar")+
  theme(
    legend.position = "top",
    plot.caption = element_text(hjust = 0),
    text = element_text(colour = 'black', size = 14),
    axis.text.x = element_text(color = "orange", size = 12),
    axis.text.y = element_text(color = "brown", size = 12)
  )

Boxplot of glucose concentration by diabetes outcome

ggplot(data, aes(x=Outcome, y=Glucose, fill=Outcome))+
  geom_boxplot()+
  labs(title = "Boxplot of glucose concentration by diabetes outcome", x = "Diabetes",
       y = "Glucose concentration",
       caption = "Source: Iskulghar")+
  theme(
    legend.position = "top",
    plot.caption = element_text(hjust = 0),
    text = element_text(colour = 'black', size = 14),
    axis.text.x = element_text(color = "orange", size = 12),
    axis.text.y = element_text(color = "brown", size = 12)
  )

Boxplot of blood pressure by diabetes outcome

ggplot(data, aes(x=Outcome, y=BloodPressure, fill=Outcome))+
  geom_boxplot()+
  labs(title = "Boxplot of blood pressure by diabetes outcome", x = "Diabetes",
       y = "Diastolic blood pressure (mm Hg)",
       caption = "Source: Iskulghar")+
  theme(
    legend.position = "top",
    plot.caption = element_text(hjust = 0),
    text = element_text(colour = 'black', size = 14),
    axis.text.x = element_text(color = "orange", size = 12),
    axis.text.y = element_text(color = "brown", size = 12)
  )

Boxplot of skin fold thickness by diabetes outcome

ggplot(data, aes(x=Outcome, y=SkinThickness, fill=Outcome))+
  geom_boxplot()+
  labs(title = "Boxplot of skin fold thickness by diabetes outcome", x = "Diabetes",
       y = "Triceps skin fold thickness (mm)",
       caption = "Source: Iskulghar")+
  theme(
    legend.position = "top",
    plot.caption = element_text(hjust = 0),
    text = element_text(colour = 'black', size = 14),
    axis.text.x = element_text(color = "orange", size = 12),
    axis.text.y = element_text(color = "brown", size = 12)
  )

Boxplot of insulin by diabetes outcome

ggplot(data, aes(x=Outcome, y=Insulin, fill=Outcome))+
  geom_boxplot()+
  labs(title = "Boxplot of insulin by diabetes outcome", x = "Diabetes",
       y = "2-Hour serum insulin (mu U/ml)",
       caption = "Source: Iskulghar")+
  theme(
    legend.position = "top",
    plot.caption = element_text(hjust = 0),
    text = element_text(colour = 'black', size = 14),
    axis.text.x = element_text(color = "orange", size = 12),
    axis.text.y = element_text(color = "brown", size = 12)
  )

Boxplot of body mass index by diabetes outcome

ggplot(data, aes(x=Outcome, y=BMI, fill=Outcome))+
  geom_boxplot()+
  labs(title = "Boxplot of body mass index by diabetes outcome", x = "Diabetes",
       y = "Body mass index",
       caption = "Source: Iskulghar")+
  theme(
    legend.position = "top",
    plot.caption = element_text(hjust = 0),
    text = element_text(colour = 'black', size = 14),
    axis.text.x = element_text(color = "orange", size = 12),
    axis.text.y = element_text(color = "brown", size = 12)
  )

Boxplot of diabetes pedigree function by diabetes outcome

ggplot(data, aes(x=Outcome, y=DiabetesPedigreeFunction, fill=Outcome))+
  geom_boxplot()+
  labs(title = "Boxplot of diabetes pedigree function by diabetes outcome", x = "Diabetes",
       y = "Diabetes pedigree function",
       caption = "Source: Iskulghar")+
  theme(
    legend.position = "top",
    plot.caption = element_text(hjust = 0),
    text = element_text(colour = 'black', size = 14),
    axis.text.x = element_text(color = "orange", size = 12),
    axis.text.y = element_text(color = "brown", size = 12)
  )

Boxplot of age by diabetes outcome

ggplot(data, aes(x=Outcome, y=Age, fill=Outcome))+
  geom_boxplot()+
  labs(title = "Boxplot of age by diabetes outcome", x = "Diabetes",
       y = "Age (years)",
       caption = "Source: Iskulghar")+
  theme(
    legend.position = "top",
    plot.caption = element_text(hjust = 0),
    text = element_text(colour = 'black', size = 14),
    axis.text.x = element_text(color = "orange", size = 12),
    axis.text.y = element_text(color = "brown", size = 12)
  )

The analysis conducted using boxplots for various features related to diabetes revealed distinct patterns. Specifically, when comparing two types of diabetes outcomes, significant differences in median values were observed for glucose concentration and age. This suggests that these two factors likely play a substantial role in distinguishing between the two types of diabetes. Additionally, the presence of outliers was noted across all the features analyzed.

Part-2

Interactive Plots

a. Interactive violin plot of all features

Number of times pregnant by Outcome

Pregnancies<-data %>%
  plot_ly(x = ~Outcome, y=~Pregnancies, type = 'violin')%>%
  layout(yaxis= list(title='Number of times pregnant'))
Pregnancies

Plasma glucose concentration a 2 hours in an oral glucose tolerance test by Outcome

Glucose<-data %>%
  plot_ly(x = ~Outcome, y=~Glucose, type = 'violin')%>%
  layout(yaxis= list(title='Glucose concentration'))
Glucose

Diastolic blood pressure (mm Hg) by Outcome

BP<-data %>%
  plot_ly(x = ~Outcome, y=~BloodPressure, type = 'violin')%>%
  layout(yaxis= list(title='Blood Pressure'))
BP

Triceps skin fold thickness (mm) by Outcome

SkinThickness<-data %>%
  plot_ly(x = ~Outcome, y=~SkinThickness, type = 'violin')%>%
  layout(yaxis= list(title='Skin Thickness'))
SkinThickness

2-Hour serum insulin (mu U/ml) by Outcome

Insulin<-data %>%
  plot_ly(x = ~Outcome, y=~Insulin, type = 'violin')%>%
  layout(yaxis= list(title='Insulin (mu U/ml)'))
Insulin

Body mass index (mu U/ml) by Outcome

BMI<-data %>%
  plot_ly(x = ~Outcome, y=~BMI, type = 'violin')%>%
  layout(yaxis= list(title='Body mass index'))
BMI

Diabetes pedigree function by Outcome

DiabetesFunction<-data %>%
  plot_ly(x = ~Outcome, y=~DiabetesPedigreeFunction, type = 'violin')%>%
  layout(yaxis= list(title='Diabetes pedigree function'))
DiabetesFunction

Age (years) by Outcome

Age<-data %>%
  plot_ly(x = ~Outcome, y=~Age, type = 'violin')%>%
  layout(yaxis= list(title='Age (years)'))
Age

b. Interactive boxplot of all features

Number of times pregnant by Outcome

Pregnancies<-data %>%
  plot_ly(x = ~Outcome, y=~Pregnancies, type = 'box')%>%
  layout(yaxis= list(title='Number of times pregnant'))
Pregnancies

Plasma glucose concentration a 2 hours in an oral glucose tolerance test by Outcome

Glucose<-data %>%
  plot_ly(x = ~Outcome, y=~Glucose, type = 'box')%>%
  layout(yaxis= list(title='Glucose concentration'))
Glucose

Diastolic blood pressure (mm Hg) by Outcome

BP<-data %>%
  plot_ly(x = ~Outcome, y=~BloodPressure, type = 'box')%>%
  layout(yaxis= list(title='Blood Pressure'))
BP

Triceps skin fold thickness (mm) by Outcome

SkinThickness<-data %>%
  plot_ly(x = ~Outcome, y=~SkinThickness, type = 'box')%>%
  layout(yaxis= list(title='Skin Thickness'))
SkinThickness

2-Hour serum insulin (mu U/ml) by Outcome

Insulin<-data %>%
  plot_ly(x = ~Outcome, y=~Insulin, type = 'box')%>%
  layout(yaxis= list(title='Insulin (mu U/ml)'))
Insulin

Body mass index (mu U/ml) by Outcome

BMI<-data %>%
  plot_ly(x = ~Outcome, y=~BMI, type = 'box')%>%
  layout(yaxis= list(title='Body mass index'))
BMI

Diabetes pedigree function by Outcome

DiabetesFunction<-data %>%
  plot_ly(x = ~Outcome, y=~DiabetesPedigreeFunction, type = 'box')%>%
  layout(yaxis= list(title='Diabetes pedigree function'))
DiabetesFunction

Age (years) by Outcome

Age<-data %>%
  plot_ly(x = ~Outcome, y=~Age, type = 'box')%>%
  layout(yaxis= list(title='Age (years)'))
Age

The explanation for the interaction plots is the same as discussed in Part-1.

c. Correlation matrix

cor_matrix = cor(data[ ,1:8])
cor_matrix

##                          Pregnancies    Glucose BloodPressure SkinThickness
## Pregnancies               1.00000000 0.12945867    0.14128198   -0.08167177
## Glucose                   0.12945867 1.00000000    0.15258959    0.05732789
## BloodPressure             0.14128198 0.15258959    1.00000000    0.20737054
## SkinThickness            -0.08167177 0.05732789    0.20737054    1.00000000
## Insulin                  -0.07353461 0.33135711    0.08893338    0.43678257
## BMI                       0.01768309 0.22107107    0.28180529    0.39257320
## DiabetesPedigreeFunction -0.03352267 0.13733730    0.04126495    0.18392757
## Age                       0.54434123 0.26351432    0.23952795   -0.11397026
##                              Insulin        BMI DiabetesPedigreeFunction
## Pregnancies              -0.07353461 0.01768309              -0.03352267
## Glucose                   0.33135711 0.22107107               0.13733730
## BloodPressure             0.08893338 0.28180529               0.04126495
## SkinThickness             0.43678257 0.39257320               0.18392757
## Insulin                   1.00000000 0.19785906               0.18507093
## BMI                       0.19785906 1.00000000               0.14064695
## DiabetesPedigreeFunction  0.18507093 0.14064695               1.00000000
## Age                      -0.04216295 0.03624187               0.03356131
##                                  Age
## Pregnancies               0.54434123
## Glucose                   0.26351432
## BloodPressure             0.23952795
## SkinThickness            -0.11397026
## Insulin                  -0.04216295
## BMI                       0.03624187
## DiabetesPedigreeFunction  0.03356131
## Age                       1.00000000

The correlation matrix highlighted the strongest correlation between age and the number of times pregnant, indicating an increase in pregnancies with increasing age. Meanwhile, age doesn’t seem strongly associated with diabetes pedigree function, and weak-negative correlations were found between several variables, indicating less pronounced or reliable relationships between those variables.

d. Correlation matrix plot (lower triangle) with values

ggcorrplot(cor_matrix, 
           type = "lower",
           colors = c("blue", "white", "red"),
           lab = TRUE)

e. Pair plot

ggpairs(data, aes(colour = Outcome))

f. Principal component analysis (PCA)

Diabetes_pca = prcomp(data[ , -9], scale = TRUE, center = TRUE)
Diabetes_pca

## Standard deviations (1, .., p=8):
## [1] 1.4471973 1.3157546 1.0147068 0.9356971 0.8731234 0.8262133 0.6479322
## [8] 0.6359733
## 
## Rotation (n x k) = (8 x 8):
##                                 PC1        PC2         PC3         PC4
## Pregnancies              -0.1284321  0.5937858 -0.01308692  0.08069115
## Glucose                  -0.3930826  0.1740291  0.46792282 -0.40432871
## BloodPressure            -0.3600026  0.1838921 -0.53549442  0.05598649
## SkinThickness            -0.4398243 -0.3319653 -0.23767380  0.03797608
## Insulin                  -0.4350262 -0.2507811  0.33670893 -0.34994376
## BMI                      -0.4519413 -0.1009598 -0.36186463  0.05364595
## DiabetesPedigreeFunction -0.2706114 -0.1220690  0.43318905  0.83368010
## Age                      -0.1980271  0.6205885  0.07524755  0.07120060
##                                 PC5          PC6         PC7          PC8
## Pregnancies              -0.4756057  0.193598168 -0.58879003 -0.117840984
## Glucose                   0.4663280  0.094161756 -0.06015291 -0.450355256
## BloodPressure             0.3279531 -0.634115895 -0.19211793  0.011295538
## SkinThickness            -0.4878621  0.009589438  0.28221253 -0.566283799
## Insulin                  -0.3469348 -0.270650609 -0.13200992  0.548621381
## BMI                       0.2532038  0.685372179 -0.03536644  0.341517637
## DiabetesPedigreeFunction  0.1198105 -0.085784088 -0.08609107  0.008258731
## Age                      -0.1092900 -0.033357170  0.71208542  0.211661979

summary(Diabetes_pca)

## Importance of components:
##                           PC1    PC2    PC3    PC4     PC5     PC6     PC7
## Standard deviation     1.4472 1.3158 1.0147 0.9357 0.87312 0.82621 0.64793
## Proportion of Variance 0.2618 0.2164 0.1287 0.1094 0.09529 0.08533 0.05248
## Cumulative Proportion  0.2618 0.4782 0.6069 0.7163 0.81164 0.89697 0.94944
##                            PC8
## Standard deviation     0.63597
## Proportion of Variance 0.05056
## Cumulative Proportion  1.00000

Interpreting the proportion of variance involves assessing which components carry the most information and how much of the original data’s variability is captured. Components 1 and 2 explain around 48% proportion of the variance; it suggests that components 1 and 2 capture a certain amount of information from the original variables. Subsequent components (Components 2, 3, and so on) explain decreasing proportions of variance.

i. Bar plot of PCAs

fviz_eig(Diabetes_pca, addlabels = TRUE)

The scree plot for the diabetes dataset shows a steep decline in explained variance until the third component, followed by a shallower decline, it suggests that the first three components capture the most significant amount of variance in the data.

ii. Contribution plot of PCs (Circular plot)

fviz_pca_var(Diabetes_pca,
             col.var = "contrib")

iii. Contribution plot as Heatmap

var = get_pca_var(Diabetes_pca)

corrplot(var$cos2)

#### iv. Cluster plot after PCA

fviz_pca_ind(Diabetes_pca,
             geom.ind = "point",
             col.ind = data$Outcome,
             addEllipses = TRUE)

g. Support Vector Machine (SVM)

train_ix = createDataPartition(data$Insulin, p = 0.8, list = FALSE)
train_data = data[train_ix, ]
test_data = data[-train_ix, ]

train_data

## # A tibble: 616 × 9
##    Pregnancies Glucose BloodPressure SkinThickness Insulin   BMI
##          <dbl>   <dbl>         <dbl>         <dbl>   <dbl> <dbl>
##  1           6     148            72            35       0  33.6
##  2           1      85            66            29       0  26.6
##  3           8     183            64             0       0  23.3
##  4           1      89            66            23      94  28.1
##  5           0     137            40            35     168  43.1
##  6           3      78            50            32      88  31  
##  7          10     115             0             0       0  35.3
##  8           2     197            70            45     543  30.5
##  9           8     125            96             0       0   0  
## 10           4     110            92             0       0  37.6
## # ℹ 606 more rows
## # ℹ 3 more variables: DiabetesPedigreeFunction <dbl>, Age <dbl>, Outcome <fct>

test_data

## # A tibble: 152 × 9
##    Pregnancies Glucose BloodPressure SkinThickness Insulin   BMI
##          <dbl>   <dbl>         <dbl>         <dbl>   <dbl> <dbl>
##  1           5     116            74             0       0  25.6
##  2          10     168            74             0       0  38  
##  3           1     115            70            30      96  34.6
##  4           3     126            88            41     235  39.3
##  5           3     158            76            36     245  31.6
##  6           3      88            58            11      54  24.8
##  7           7     133            84             0       0  40.2
##  8           0     180            66            39       0  42  
##  9           2      71            70            27       0  28  
## 10           1     101            50            15      36  24.2
## # ℹ 142 more rows
## # ℹ 3 more variables: DiabetesPedigreeFunction <dbl>, Age <dbl>, Outcome <fct>

svm_model = svm(Outcome ~ Pregnancies+Glucose+BloodPressure+SkinThickness+Insulin+BMI+DiabetesPedigreeFunction+Age,
                data = train_data, kernel = "linear")

test_data[120, ]

## # A tibble: 1 × 9
##   Pregnancies Glucose BloodPressure SkinThickness Insulin   BMI
##         <dbl>   <dbl>         <dbl>         <dbl>   <dbl> <dbl>
## 1           1     128            82            17     183  27.5
## # ℹ 3 more variables: DiabetesPedigreeFunction <dbl>, Age <dbl>, Outcome <fct>

predictions = predict(svm_model, newdata = test_data)
conf_max = confusionMatrix(predictions, test_data$Outcome)
conf_max

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Negative Positive
##   Negative       87       18
##   Positive       15       32
##                                           
##                Accuracy : 0.7829          
##                  95% CI : (0.7088, 0.8456)
##     No Information Rate : 0.6711          
##     P-Value [Acc > NIR] : 0.00165         
##                                           
##                   Kappa : 0.5006          
##                                           
##  Mcnemar's Test P-Value : 0.72772         
##                                           
##             Sensitivity : 0.8529          
##             Specificity : 0.6400          
##          Pos Pred Value : 0.8286          
##          Neg Pred Value : 0.6809          
##              Prevalence : 0.6711          
##          Detection Rate : 0.5724          
##    Detection Prevalence : 0.6908          
##       Balanced Accuracy : 0.7465          
##                                           
##        'Positive' Class : Negative        
##

cm = as.data.frame(conf_max$table)

ggplot(cm, aes(Prediction, Reference, fill = Freq)) + 
  geom_tile() +
  geom_text(aes(label = Freq)) + 
  scale_fill_gradient(low="white", high="skyblue")

The true positives and true negatives (24 and 94, respectively) suggest that the model performs moderately correctly, identifying both positive and negative instances. The number of false positives (21) and false negatives (13) indicates areas where the model makes errors. Minimizing these errors might be a focus for improving the model’s performance.

RESULTS MAY VARY DUE TO RANDOM NUMBER GENERATION

US Admission data analysis

About the US Admission dataset

Context

This dataset was built with the purpose of helping students in shortlisting universities with their profiles. The predicted output gives them a fair idea about their chances for a particular university.

Content

Load the diabetes dataset into our workspace.

The US admission dataset (data frame R name) contain with nine columns and four hundred rows 400 x 9.

Admission <- read_csv("US Admission.csv")

glimpse(Admission)

## Rows: 400
## Columns: 9
## $ `Serial No.`        <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,…
## $ `GRE Score`         <dbl> 337, 324, 316, 322, 314, 330, 321, 308, 302, 323, …
## $ `TOEFL Score`       <dbl> 118, 107, 104, 110, 103, 115, 109, 101, 102, 108, …
## $ `University Rating` <dbl> 4, 4, 3, 3, 2, 5, 3, 2, 1, 3, 3, 4, 4, 3, 3, 3, 3,…
## $ SOP                 <dbl> 4.5, 4.0, 3.0, 3.5, 2.0, 4.5, 3.0, 3.0, 2.0, 3.5, …
## $ LOR                 <dbl> 4.5, 4.5, 3.5, 2.5, 3.0, 3.0, 4.0, 4.0, 1.5, 3.0, …
## $ CGPA                <dbl> 9.65, 8.87, 8.00, 8.67, 8.21, 9.34, 8.20, 7.90, 8.…
## $ Research            <dbl> 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0,…
## $ `Chance of Admit`   <dbl> 0.92, 0.76, 0.72, 0.80, 0.65, 0.90, 0.75, 0.68, 0.…

Data Analysis

a. Remove the “Serial No” column from the dataset

Admission$`Serial No.`<-NULL

Table 2: Variables and descriptions

Variable	Description
GRE Score	GRE Scores (out of 340)
TOEFL Score	TOEFL Scores (out of 120)
University Rating	University Rating (out of 5)
SOP	Statement of Purpose (out of 5)
LOR	Letter of Recommendation Strength (out of 5)
CGPA	Undergraduate GPA (out of 10)
Research	Research Experience (either 0 or 1)
Chance of Admit	Chance of Admit (ranging from 0 to 1)

Before analyzing the dataset, let’s check the class of the target variable.

# Check the class of outcome

class(Admission$Research) # Showing numeric, need to make it categorical

## [1] "numeric"

Admission$Research<-factor(Admission$Research,
               levels = c(0, 1), labels = c("No", "Yes"))

class(Admission$Research) # now it is factor or categorical

## [1] "factor"

b. Pair plot

ggpairs(Admission, aes(colour = Research))

c. Linear regression

Regression plot for chance of admission and GRE Score

ggplot(Admission, aes(x = `Chance of Admit`, y=`GRE Score`)) + 
  geom_point() +
  geom_smooth(method = "lm", level = 0.95)+
  labs(title = "Regression plot for chance of admission and GRE Score", x = "Chance of admission",
       y = "GRE Score")

Regression plot for chance of admission and TOEFL Score

ggplot(Admission, aes(x = `Chance of Admit`, y=`TOEFL Score`)) + 
  geom_point() +
  geom_smooth(method = "lm", level = 0.95)+
  labs(title = "Regression plot for chance of admission and TOEFL Score", x = "Chance of admission",
       y = "TOEFL Score")

Regression plot for chance of admission and University Rating

ggplot(Admission, aes(x = `Chance of Admit`, y=`University Rating`)) + 
  geom_point() +
  geom_smooth(method = "lm", level = 0.95)+
  labs(title = "Regression plot for chance of admission and University Rating", x = "Chance of admission",
       y = "University Rating")

Regression plot for chance of admission and Statement of Purpose

ggplot(Admission, aes(x = `Chance of Admit`, y=SOP)) + 
  geom_point() +
  geom_smooth(method = "lm", level = 0.95)+
  labs(title = "Regression plot for chance of admission and Statement of Purpose", x = "Chance of admission",
       y = "Statement of Purpose")

Regression plot for chance of admission and Letter of Recommendation Strength

ggplot(Admission, aes(x = `Chance of Admit`, y=LOR)) + 
  geom_point() +
  geom_smooth(method = "lm", level = 0.95)+
  labs(title = "Regression plot for chance of admission and Letter of Recommendation Strength", x = "Chance of admission",
       y = "Letter of Recommendation Strength")

Regression plot for chance of admission and Undergraduate GPA

ggplot(Admission, aes(x = `Chance of Admit`, y=CGPA)) + 
  geom_point() +
  geom_smooth(method = "lm", level = 0.95)+
  labs(title = "Regression plot for chance of admission and Undergraduate GPA", x = "Chance of admission",
       y = "Undergraduate GPA")

Regression plot for chance of admission and Research Experience

ggplot(Admission, aes(x = `Chance of Admit`, y=Research)) + 
  geom_point() +
  geom_smooth(method = "lm", level = 0.95)+
  labs(title = "Regression plot for chance of admission and Research Experience", x = "Chance of admission",
       y = "Research Experience")

The dependent variable (Research Experience) in the above plot is binary, meaning it can only take two values. Therefore, a linear regression model may not be appropriate for this data, as it assumes a continuous response variable. A logistic regression model, on the other hand, can handle binary outcomes by estimating the probability of each value. Logistic regression is a type of generalized linear model that uses a logistic function to link the predictor variables and the response variable.

The analysis conducted using linear regression plots revealed a notable trend: all the factors examined in relation to the chance of university admission displayed a positive association. This means that as the values of these factors increase, the likelihood of admission to the university also tends to increase.

d. Polynomial regression

Nonlinear regression plot for chance of admission and GRE Score

ggplot(Admission, aes(x = `Chance of Admit`, y=`GRE Score`)) + 
  geom_point() +
  geom_smooth(method = "lm", formula = y~poly(x, 2), level = 0.95)+
  labs(title = "Nonlinear regression plot for chance of admission and GRE Score", x = "Chance of admission",
       y = "GRE Score")

Nonlinear regression plot for chance of admission and TOEFL Score

ggplot(Admission, aes(x = `Chance of Admit`, y=`TOEFL Score`)) + 
  geom_point() +
  geom_smooth(method = "lm", formula = y~poly(x, 2), level = 0.95)+
  labs(title = "Nonlinear regression plot for chance of admission and TOEFL Score", x = "Chance of admission",
       y = "TOEFL Score")

Nonlinear regression plot for chance of admission and University Rating

ggplot(Admission, aes(x = `Chance of Admit`, y=`University Rating`)) + 
  geom_point() +
  geom_smooth(method = "lm",formula = y~poly(x, 2), level = 0.95)+
  labs(title = "Nonlinear regression plot for chance of admission and University Rating", x = "Chance of admission",
       y = "University Rating")

Nonlinear regression plot for chance of admission and Statement of Purpose

ggplot(Admission, aes(x = `Chance of Admit`, y=SOP)) + 
  geom_point() +
  geom_smooth(method = "lm", formula = y~poly(x, 2), level = 0.95)+
  labs(title = "Nonlinear regression plot for chance of admission and Statement of Purpose", x = "Chance of admission",
       y = "Statement of Purpose")

Nonlinear regression plot for chance of admission and Letter of Recommendation Strength

ggplot(Admission, aes(x = `Chance of Admit`, y=LOR)) + 
  geom_point() +
  geom_smooth(method = "lm",formula = y~poly(x, 2), level = 0.95)+
  labs(title = "Nonlinear regression plot for chance of admission and Letter of Recommendation Strength", x = "Chance of admission",
       y = "Letter of Recommendation Strength")

Nonlinear regression plot for chance of admission and Undergraduate GPA

ggplot(Admission, aes(x = `Chance of Admit`, y=CGPA)) + 
  geom_point() +
  geom_smooth(method = "lm",formula = y~poly(x, 2), level = 0.95)+
  labs(title = "Nonlinear regression plot for chance of admission and Undergraduate GPA", x = "Chance of admission",
       y = "Undergraduate GPA")

Nonlinear regression plot for chance of admission and Research Experience

ggplot(Admission, aes(x = `Chance of Admit`, y=Research)) + 
  geom_point() +
  geom_smooth(method = "lm", formula = y~poly(x, 2), level = 0.95)+
  labs(title = "Nonlinear regression plot for chance of admission and Research Experience", x = "Chance of admission",
       y = "Research Experience")

e. Multivariate regression model

model<-lm(`Chance of Admit`~`GRE Score`+`TOEFL Score`+`University Rating`+SOP+LOR+CGPA+relevel(Research, ref = "No"),
          data=Admission)

summary(model)

## 
## Call:
## lm(formula = `Chance of Admit` ~ `GRE Score` + `TOEFL Score` + 
##     `University Rating` + SOP + LOR + CGPA + relevel(Research, 
##     ref = "No"), data = Admission)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.26259 -0.02103  0.01005  0.03628  0.15928 
## 
## Coefficients:
##                                    Estimate Std. Error t value
## (Intercept)                      -1.2594325  0.1247307 -10.097
## `GRE Score`                       0.0017374  0.0005979   2.906
## `TOEFL Score`                     0.0029196  0.0010895   2.680
## `University Rating`               0.0057167  0.0047704   1.198
## SOP                              -0.0033052  0.0055616  -0.594
## LOR                               0.0223531  0.0055415   4.034
## CGPA                              0.1189395  0.0122194   9.734
## relevel(Research, ref = "No")Yes  0.0245251  0.0079598   3.081
##                                              Pr(>|t|)    
## (Intercept)                      < 0.0000000000000002 ***
## `GRE Score`                                   0.00387 ** 
## `TOEFL Score`                                 0.00768 ** 
## `University Rating`                           0.23150    
## SOP                                           0.55267    
## LOR                                          0.000066 ***
## CGPA                             < 0.0000000000000002 ***
## relevel(Research, ref = "No")Yes              0.00221 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06378 on 392 degrees of freedom
## Multiple R-squared:  0.8035, Adjusted R-squared:    0.8 
## F-statistic: 228.9 on 7 and 392 DF,  p-value: < 0.00000000000000022

The multivariate regression analysis reveals that GRE and TOEFL scores, strong recommendation letters, high CGPA, and research experience are all linked to increased chances of admission to US universities. These findings highlight the significance of academic performance, language proficiency, strong endorsements, consistent grades, and research engagement in securing admission.

Summary and Conclusion

The study examined the association between various factors and diabetes outcomes in a sample of individuals. The results indicated that age and pregnancies were positively correlated, while diabetes status did not affect these variables. BMI and triceps skin fold thickness also showed a positive relationship, suggesting that obesity may be a risk factor for diabetes. Insulin levels at the 2-hour mark were skewed, with some individuals having much higher values than others. Oral glucose tolerance test results had one extreme outlier that deviated from the normal range and required further investigation.The study provided valuable insights into the factors that influence diabetes outcomes and highlighted the need for more research on the outlier and the potential mechanisms behind the observed associations.

The findings from the United States University Admission dataset provide a comprehensive view of the factors positively associated with admission to US universities. They emphasize the importance of a well-rounded application, encompassing not only academic achievements but also recommendations, research experience, and language proficiency.

References

Association AD. Diagnosis and Classification of Diabetes Mellitus. Diabetes Care. 2009;32: S62. doi:10.2337/DC09-S062
WHO. Diabetes. [cited 5 Jan 2024]. Available: https://www.who.int/news-room/fact-sheets/detail/diabetes
Leon BM, Maddox TM. Diabetes and cardiovascular disease: Epidemiology, biological mechanisms, treatment recommendations and future research. World Journal of Diabetes. 2015;6: 1246. doi:10.4239/WJD.V6.I13.1246
Yan Z, Cai M, Han X, Chen Q, Lu H. The Interaction Between Age and Risk Factors for Diabetes and Prediabetes: A Community-Based Cross-Sectional Study. Diabetes, Metabolic Syndrome and Obesity. 2023;16: 85. doi:10.2147/DMSO.S390857
Wang H, Schultz JL, Huang Z. English language proficiency, prior knowledge, and student success in an international Chinese accounting program. Heliyon. 2023;9: 2405–8440. doi:10.1016/J.HELIYON.2023.E18596

Data Science Project

Explore diabetes-related factors and predictive modeling of admittance into a masters graduate program at the United States University: Machine learning approaches

Nasif Hossain

2024-01-09

Introduction

Load the relevant packages

Part-1

About the diabetes dataset

Context

Content

Data Exploration

Basic plot

a. Scatter plot**

b. Histogram plot**

c. Boxplot**

ggplot

a. Scatter plot**

b. Box plot**

Part-2

Interactive Plots

a. Interactive violin plot of all features

b. Interactive boxplot of all features

c. Correlation matrix

d. Correlation matrix plot (lower triangle) with values

e. Pair plot

f. Principal component analysis (PCA)

i. Bar plot of PCAs

ii. Contribution plot of PCs (Circular plot)

iii. Contribution plot as Heatmap

g. Support Vector Machine (SVM)

US Admission data analysis

About the US Admission dataset

Context

Content

Data Analysis

a. Remove the “Serial No” column from the dataset

b. Pair plot

c. Linear regression

d. Polynomial regression

e. Multivariate regression model

Summary and Conclusion

References