R Markdown

Question A : Uses of a histogram, a boxplot, and a scatterplot.

A.1 Describe the uses of a histogram, a boxplot, and a scatterplot.

Data can be gathered and displayed many different ways. It becomes powerful, when displayed properly. Plots like Histograms, Boxplots and Scatterplots are the most commonly used methods of data visualization, that help you to quickly see how the variables are related to each other.

Histograms:

A histogram is a type of bar graph that shows how many of something occurred, also called the frequency. When visualizing a single numerical variable, a histogram will be our go-to tool, which can be created in R using the hist() function. It helps check the distribution of continuous data.

Boxplots: We can also use a single boxplot as an alternative to a histogram for visualizing a single numerical variable. It is used to picture the distribution of continuous data, by dividing the data values into four parts called quartiles using Median. It is also useful in comparing the distribution of data across data sets.

Scatterplots: Scatterplot is used to visualize the relationship between two continuous variables. It is frequently used by researchers for comparing pairs of values to see if they are related.

A.2 Write R codes to produce a histogram for the cars dataset and analyse.

hist(mtcars$mpg, xlab = "Miles/gallon", 
            main = "Histogram of MPG (mtcars)", 
            breaks = 12, col = "lightseagreen", 
            border = "darkorange")

This is a histogram of the distribution of cars with Miles/Gallon ranging from 10 to 30. It shows that the peak Miles/gallon occurs at 15. It also shows the curve of the normal distribution.

A.2 Write R codes to produce a boxplot for the cars dataset and analyse.

boxplot(mpg ~ cyl, data = mtcars, xlab = "Number of Cylinders",
   ylab = "Miles Per Gallon", main = "Mileage Data")

This boxplot for Cars depicts the distribution of Number of Cylinders in a car(4,6 and 8) against Miles Per Gallon they each give (ranging from 10 to 30). Cars with 4 cylinder give 20 to 30 Miles/Gallon, while Cars with 6 cylinder give 18 to 22 Miles/Gallon and Cars with 8 cylinder give 13 to 19 Miles/Gallon.

A.3 Write R codes to produce a scatterplot for the cars dataset and analyse.

plot(mpg ~ disp, data = mtcars,
     xlab = "Displacement",
     ylab = "Miles Per Gallon",
     main = "MPG vs Displacement",
     pch = 20,
     cex = 2,
     col = "red")

This scatterplot for Displacement vs Miles Per Gallon shows that there are linear relationship/negative association between Displacement and Miles Per Gallon. It is shown that as the Displacement(distance traveled) increases Miles Per Gallon given decreases.

Question B : Study the dataset datasetAA2.csv.

#Import and describe data

data <- read.csv("datasetAA2.csv")

dim(data)
## [1] 768  12
str(data)
## 'data.frame':    768 obs. of  12 variables:
##  $ ï..Pregnant: int  6 1 8 1 0 5 3 10 2 8 ...
##  $ Glucose    : int  148 85 183 89 137 116 78 115 197 125 ...
##  $ BP         : int  72 66 64 66 40 74 50 0 70 96 ...
##  $ SThickness : int  35 29 0 23 35 0 32 0 45 0 ...
##  $ Insulin    : int  0 0 0 94 168 0 88 0 543 0 ...
##  $ BMI        : num  33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
##  $ DPFunction : num  0.627 0.351 0.672 0.167 2.288 ...
##  $ Age        : int  50 31 32 21 33 30 26 29 53 54 ...
##  $ HasDiabetes: int  1 0 1 0 1 0 1 0 1 1 ...
##  $ X          : logi  NA NA NA NA NA NA ...
##  $ X.1        : int  1 0 NA NA NA NA NA NA NA NA ...
##  $ X.2        : chr  "yes" "no" "" "" ...

B.1 Explain (describe) what you understand from the dataset.

The Dataset has medical information to predict whether a person has diebetes or not. It contains 12 attributes with 768 row. The attributes include Pregnant, Glucose, BP, SThickness, Insulin, BMI, DPFunction, Age, HasDiabetes, X, X.1 and X.2. Out of these 12 attributes, 8 attributes (Pregnant, Glucose, BP, SThickness, Insulin, BMI, DPFunction, and Age) are predictors for the target HasDiabetes. All predictor variables are numerical and the target (HasDiabetes) is binary (0 or 1). The attributes X, X.1 and X.2 are deemed to be irrelevant to the prediction.

B.2 What is the nature (type) of the problem?

The nature(type) of problem is Classification, to determine whether a person has diabetes (HasDiabetes = 1) or not (HasDiabetes = 0).

B.3 Write R codes to show the steps from reading the dataset to modeling and analyzing a suitable model to solve the problem. Briefly describe each step taken.

Data cleaning

Data is cleaned by checking for missing attributes and removing attributes irrelevant to the problem, using dplyr library

# remove attributes irrelevant to the problem
new_data<-dplyr::select(data, -c("X", "X.1", "X.2"))

dim(new_data)
## [1] 768   9
str(new_data)
## 'data.frame':    768 obs. of  9 variables:
##  $ ï..Pregnant: int  6 1 8 1 0 5 3 10 2 8 ...
##  $ Glucose    : int  148 85 183 89 137 116 78 115 197 125 ...
##  $ BP         : int  72 66 64 66 40 74 50 0 70 96 ...
##  $ SThickness : int  35 29 0 23 35 0 32 0 45 0 ...
##  $ Insulin    : int  0 0 0 94 168 0 88 0 543 0 ...
##  $ BMI        : num  33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
##  $ DPFunction : num  0.627 0.351 0.672 0.167 2.288 ...
##  $ Age        : int  50 31 32 21 33 30 26 29 53 54 ...
##  $ HasDiabetes: int  1 0 1 0 1 0 1 0 1 1 ...
#Check for NA values
sum(is.na(new_data))
## [1] 0

There are no missing values in the data

Splitting the dataset into the Training set and Test set

The usual practice in Machine Learning is to split the dataset into both training and test set. While the model is built on the training set; the model is evaluated on the test set which the model has not been exposed to before. In order to ensure that the samples; both train and test, are the true representation of the dataset, we check the proportion of the data split.

# Split Data into training and testing set
library(caTools)

set.seed(333)

trainingRowIndex <- sample(1:nrow(new_data), 0.75*nrow(new_data))  # row indices for training data
train_data <- new_data[trainingRowIndex, ]  # model training data
dim(train_data)                                 # Training data characteristics
## [1] 576   9
test_data  <- new_data[-trainingRowIndex, ]   # test data
dim(test_data)                                    # Testing data characteristics
## [1] 192   9

Logistic regression model for Classification

We will be building a logistic regression model to predict the binary outcome HasDiabetes, based on other attributes.

#Logistic regression model - classification
#Train a decision tree model:

lr_model <- glm( HasDiabetes~., data=train_data, family = "binomial")

summary(lr_model)
## 
## Call:
## glm(formula = HasDiabetes ~ ., family = "binomial", data = train_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6107  -0.7174  -0.4184   0.7325   3.0273  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -8.6019083  0.8434427 -10.199  < 2e-16 ***
## ï..Pregnant  0.1479372  0.0370616   3.992 6.56e-05 ***
## Glucose      0.0368599  0.0043295   8.514  < 2e-16 ***
## BP          -0.0100965  0.0060887  -1.658  0.09727 .  
## SThickness   0.0003084  0.0080918   0.038  0.96959    
## Insulin     -0.0008167  0.0010584  -0.772  0.44030    
## BMI          0.0857861  0.0172899   4.962 6.99e-07 ***
## DPFunction   1.0553938  0.3486057   3.027  0.00247 ** 
## Age          0.0055587  0.0109860   0.506  0.61287    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 745.11  on 575  degrees of freedom
## Residual deviance: 540.77  on 567  degrees of freedom
## AIC: 558.77
## 
## Number of Fisher Scoring iterations: 5
car::vif(lr_model)
## ï..Pregnant     Glucose          BP  SThickness     Insulin         BMI 
##    1.448134    1.223484    1.184018    1.548213    1.494478    1.208247 
##  DPFunction         Age 
##    1.032941    1.522515

We assessed the multicollinearity by computing the variance inflation factor (or VIF), which measures how much the variance of a regression coefficient is inflated due to multicollinearity in the model. The smallest possible value of VIF is one (absence of multicollinearity) and a VIF value that exceeds 5 or 10 indicates a problematic amount of collinearity. Our model stays well below 5.

The predictions are then made against the testing data, using the Linear Regression Model which was created with the training data. The Probability Cutoff is set at 0.5, to predict the target variable.

#Test model
#Make predictions on testing data, using trained model:

test_data$pred <- predict(lr_model, newdata = test_data, type = 'response')
head(test_data)
##    ï..Pregnant Glucose BP SThickness Insulin  BMI DPFunction Age HasDiabetes
## 10           8     125 96          0       0  0.0      0.232  54           1
## 11           4     110 92          0       0 37.6      0.191  30           0
## 13          10     139 80          0       0 27.1      1.441  57           0
## 16           7     100  0          0       0 30.0      0.484  32           1
## 19           1     103 30         38      83 43.3      0.183  33           0
## 20           1     115 70         30      96 34.6      0.529  32           1
##          pred
## 10 0.03786354
## 11 0.21575962
## 13 0.79506666
## 16 0.35021739
## 19 0.28388895
## 20 0.21642695
ProbabilityCutoff <- 0.5 

test_data$PredictedHasDiabetes <- ifelse(test_data$pred > ProbabilityCutoff, 1, 0)

B.4 Explain how the performance of the model that is built is measured.

# Make confusion matrix:

ConfusionMatrix <- with(test_data,table(test_data$HasDiabetes,PredictedHasDiabetes))
print(ConfusionMatrix)
##    PredictedHasDiabetes
##       0   1
##   0 114  11
##   1  33  34
#Calculate accuracy:

CorrectPredictions <- ConfusionMatrix[1,1] + ConfusionMatrix[2,2]
TotalPatients <- nrow(test_data)

Accuracy <- CorrectPredictions/TotalPatients

AccuracyPercentage <- round((Accuracy*100), digits = 2)

print(paste("Accuracy of the Model =",AccuracyPercentage,"%"))
## [1] "Accuracy of the Model = 77.08 %"

The Logistic regression model (classification) performed exceptionally well and showed 87.06% accuracy, which is calculated by generating the Confusion Matrix between the Actual and Predicted HasDiabetes attribute.