Data can be gathered and displayed many different ways. It becomes powerful, when displayed properly. Plots like Histograms, Boxplots and Scatterplots are the most commonly used methods of data visualization, that help you to quickly see how the variables are related to each other.
Histograms:
A histogram is a type of bar graph that shows how many of something occurred, also called the frequency. When visualizing a single numerical variable, a histogram will be our go-to tool, which can be created in R using the hist() function. It helps check the distribution of continuous data.
Boxplots: We can also use a single boxplot as an alternative to a histogram for visualizing a single numerical variable. It is used to picture the distribution of continuous data, by dividing the data values into four parts called quartiles using Median. It is also useful in comparing the distribution of data across data sets.
Scatterplots: Scatterplot is used to visualize the relationship between two continuous variables. It is frequently used by researchers for comparing pairs of values to see if they are related.
hist(mtcars$mpg, xlab = "Miles/gallon",
main = "Histogram of MPG (mtcars)",
breaks = 12, col = "lightseagreen",
border = "darkorange")
This is a histogram of the distribution of cars with Miles/Gallon ranging from 10 to 30. It shows that the peak Miles/gallon occurs at 15. It also shows the curve of the normal distribution.
boxplot(mpg ~ cyl, data = mtcars, xlab = "Number of Cylinders",
ylab = "Miles Per Gallon", main = "Mileage Data")
This boxplot for Cars depicts the distribution of Number of Cylinders in a car(4,6 and 8) against Miles Per Gallon they each give (ranging from 10 to 30). Cars with 4 cylinder give 20 to 30 Miles/Gallon, while Cars with 6 cylinder give 18 to 22 Miles/Gallon and Cars with 8 cylinder give 13 to 19 Miles/Gallon.
plot(mpg ~ disp, data = mtcars,
xlab = "Displacement",
ylab = "Miles Per Gallon",
main = "MPG vs Displacement",
pch = 20,
cex = 2,
col = "red")
This scatterplot for Displacement vs Miles Per Gallon shows that there are linear relationship/negative association between Displacement and Miles Per Gallon. It is shown that as the Displacement(distance traveled) increases Miles Per Gallon given decreases.
#Import and describe data
data <- read.csv("datasetAA2.csv")
dim(data)
## [1] 768 12
str(data)
## 'data.frame': 768 obs. of 12 variables:
## $ ï..Pregnant: int 6 1 8 1 0 5 3 10 2 8 ...
## $ Glucose : int 148 85 183 89 137 116 78 115 197 125 ...
## $ BP : int 72 66 64 66 40 74 50 0 70 96 ...
## $ SThickness : int 35 29 0 23 35 0 32 0 45 0 ...
## $ Insulin : int 0 0 0 94 168 0 88 0 543 0 ...
## $ BMI : num 33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
## $ DPFunction : num 0.627 0.351 0.672 0.167 2.288 ...
## $ Age : int 50 31 32 21 33 30 26 29 53 54 ...
## $ HasDiabetes: int 1 0 1 0 1 0 1 0 1 1 ...
## $ X : logi NA NA NA NA NA NA ...
## $ X.1 : int 1 0 NA NA NA NA NA NA NA NA ...
## $ X.2 : chr "yes" "no" "" "" ...
The Dataset has medical information to predict whether a person has diebetes or not. It contains 12 attributes with 768 row. The attributes include Pregnant, Glucose, BP, SThickness, Insulin, BMI, DPFunction, Age, HasDiabetes, X, X.1 and X.2. Out of these 12 attributes, 8 attributes (Pregnant, Glucose, BP, SThickness, Insulin, BMI, DPFunction, and Age) are predictors for the target HasDiabetes. All predictor variables are numerical and the target (HasDiabetes) is binary (0 or 1). The attributes X, X.1 and X.2 are deemed to be irrelevant to the prediction.
The nature(type) of problem is Classification, to determine whether a person has diabetes (HasDiabetes = 1) or not (HasDiabetes = 0).
Data is cleaned by checking for missing attributes and removing attributes irrelevant to the problem, using dplyr library
# remove attributes irrelevant to the problem
new_data<-dplyr::select(data, -c("X", "X.1", "X.2"))
dim(new_data)
## [1] 768 9
str(new_data)
## 'data.frame': 768 obs. of 9 variables:
## $ ï..Pregnant: int 6 1 8 1 0 5 3 10 2 8 ...
## $ Glucose : int 148 85 183 89 137 116 78 115 197 125 ...
## $ BP : int 72 66 64 66 40 74 50 0 70 96 ...
## $ SThickness : int 35 29 0 23 35 0 32 0 45 0 ...
## $ Insulin : int 0 0 0 94 168 0 88 0 543 0 ...
## $ BMI : num 33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
## $ DPFunction : num 0.627 0.351 0.672 0.167 2.288 ...
## $ Age : int 50 31 32 21 33 30 26 29 53 54 ...
## $ HasDiabetes: int 1 0 1 0 1 0 1 0 1 1 ...
#Check for NA values
sum(is.na(new_data))
## [1] 0
There are no missing values in the data
The usual practice in Machine Learning is to split the dataset into both training and test set. While the model is built on the training set; the model is evaluated on the test set which the model has not been exposed to before. In order to ensure that the samples; both train and test, are the true representation of the dataset, we check the proportion of the data split.
# Split Data into training and testing set
library(caTools)
set.seed(333)
trainingRowIndex <- sample(1:nrow(new_data), 0.75*nrow(new_data)) # row indices for training data
train_data <- new_data[trainingRowIndex, ] # model training data
dim(train_data) # Training data characteristics
## [1] 576 9
test_data <- new_data[-trainingRowIndex, ] # test data
dim(test_data) # Testing data characteristics
## [1] 192 9
We will be building a logistic regression model to predict the binary outcome HasDiabetes, based on other attributes.
#Logistic regression model - classification
#Train a decision tree model:
lr_model <- glm( HasDiabetes~., data=train_data, family = "binomial")
summary(lr_model)
##
## Call:
## glm(formula = HasDiabetes ~ ., family = "binomial", data = train_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6107 -0.7174 -0.4184 0.7325 3.0273
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.6019083 0.8434427 -10.199 < 2e-16 ***
## ï..Pregnant 0.1479372 0.0370616 3.992 6.56e-05 ***
## Glucose 0.0368599 0.0043295 8.514 < 2e-16 ***
## BP -0.0100965 0.0060887 -1.658 0.09727 .
## SThickness 0.0003084 0.0080918 0.038 0.96959
## Insulin -0.0008167 0.0010584 -0.772 0.44030
## BMI 0.0857861 0.0172899 4.962 6.99e-07 ***
## DPFunction 1.0553938 0.3486057 3.027 0.00247 **
## Age 0.0055587 0.0109860 0.506 0.61287
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 745.11 on 575 degrees of freedom
## Residual deviance: 540.77 on 567 degrees of freedom
## AIC: 558.77
##
## Number of Fisher Scoring iterations: 5
car::vif(lr_model)
## ï..Pregnant Glucose BP SThickness Insulin BMI
## 1.448134 1.223484 1.184018 1.548213 1.494478 1.208247
## DPFunction Age
## 1.032941 1.522515
We assessed the multicollinearity by computing the variance inflation factor (or VIF), which measures how much the variance of a regression coefficient is inflated due to multicollinearity in the model. The smallest possible value of VIF is one (absence of multicollinearity) and a VIF value that exceeds 5 or 10 indicates a problematic amount of collinearity. Our model stays well below 5.
The predictions are then made against the testing data, using the Linear Regression Model which was created with the training data. The Probability Cutoff is set at 0.5, to predict the target variable.
#Test model
#Make predictions on testing data, using trained model:
test_data$pred <- predict(lr_model, newdata = test_data, type = 'response')
head(test_data)
## ï..Pregnant Glucose BP SThickness Insulin BMI DPFunction Age HasDiabetes
## 10 8 125 96 0 0 0.0 0.232 54 1
## 11 4 110 92 0 0 37.6 0.191 30 0
## 13 10 139 80 0 0 27.1 1.441 57 0
## 16 7 100 0 0 0 30.0 0.484 32 1
## 19 1 103 30 38 83 43.3 0.183 33 0
## 20 1 115 70 30 96 34.6 0.529 32 1
## pred
## 10 0.03786354
## 11 0.21575962
## 13 0.79506666
## 16 0.35021739
## 19 0.28388895
## 20 0.21642695
ProbabilityCutoff <- 0.5
test_data$PredictedHasDiabetes <- ifelse(test_data$pred > ProbabilityCutoff, 1, 0)
# Make confusion matrix:
ConfusionMatrix <- with(test_data,table(test_data$HasDiabetes,PredictedHasDiabetes))
print(ConfusionMatrix)
## PredictedHasDiabetes
## 0 1
## 0 114 11
## 1 33 34
#Calculate accuracy:
CorrectPredictions <- ConfusionMatrix[1,1] + ConfusionMatrix[2,2]
TotalPatients <- nrow(test_data)
Accuracy <- CorrectPredictions/TotalPatients
AccuracyPercentage <- round((Accuracy*100), digits = 2)
print(paste("Accuracy of the Model =",AccuracyPercentage,"%"))
## [1] "Accuracy of the Model = 77.08 %"
The Logistic regression model (classification) performed exceptionally well and showed 87.06% accuracy, which is calculated by generating the Confusion Matrix between the Actual and Predicted HasDiabetes attribute.