This is the IS 6030 Homework #5. It includes doing exploratory data analysis on an open data using SQL, report generation through R and doing data vizualisation through Tableau. SQL queries are presented under the ‘Data Statistics’ section. Analysis of plots through R and Tableau is presented under the Data Modelling section.
The Pima Indians Diabetes dataset has been used and was extracted from the Kaggle website. This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective is to predict based on diagnostic measurements whether a patient has diabetes.
The Diabetes dataset contains the following information:
A preview of the file has been presented below:
library(readxl)
Diabetes_data <- read_excel("C:/Manisha_Arora/UC-BANA/Sem1/Data Mgt/Week-8/Diabetes_data.xlsx")
head(Diabetes_data)
## # A tibble: 6 × 10
## `Record ID` Pregnancies Glucose BloodPressure SkinThickness Insulin
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 R1 6 148 72 35 0
## 2 R2 1 85 66 29 0
## 3 R3 8 183 64 0 0
## 4 R4 1 89 66 23 94
## 5 R5 0 137 40 35 168
## 6 R6 5 116 74 0 0
## # ... with 4 more variables: BMI <dbl>, DiabetesPedigreeFunction <dbl>,
## # Age <dbl>, Outcome <dbl>
A summary of the dataset has been presented below:
summary(Diabetes_data)
## Record ID Pregnancies Glucose BloodPressure
## Length:768 Min. : 0.000 Min. : 0.0 Min. : 0.00
## Class :character 1st Qu.: 1.000 1st Qu.: 99.0 1st Qu.: 62.00
## Mode :character Median : 3.000 Median :117.0 Median : 72.00
## Mean : 3.845 Mean :120.9 Mean : 69.11
## 3rd Qu.: 6.000 3rd Qu.:140.2 3rd Qu.: 80.00
## Max. :17.000 Max. :199.0 Max. :122.00
## SkinThickness Insulin BMI DiabetesPedigreeFunction
## Min. : 0.00 Min. : 0.0 Min. : 0.00 Min. :0.0780
## 1st Qu.: 0.00 1st Qu.: 0.0 1st Qu.:27.30 1st Qu.:0.2437
## Median :23.00 Median : 30.5 Median :32.00 Median :0.3725
## Mean :20.54 Mean : 79.8 Mean :31.99 Mean :0.4719
## 3rd Qu.:32.00 3rd Qu.:127.2 3rd Qu.:36.60 3rd Qu.:0.6262
## Max. :99.00 Max. :846.0 Max. :67.10 Max. :2.4200
## Age Outcome
## Min. :21.00 Min. :0.000
## 1st Qu.:24.00 1st Qu.:0.000
## Median :29.00 Median :0.000
## Mean :33.24 Mean :0.349
## 3rd Qu.:41.00 3rd Qu.:1.000
## Max. :81.00 Max. :1.000
The dataset is not normalized because all the data is contained in a single table. This dataset can be normalized by breaking the dataset into two different tables as follows:
Find below some data observations:
library(RODBC)
library(sqldf)
Local <- odbcConnect("SQL_ODBC", uid = "", pwd = "")
# What is the record count?
sqldf("Select count([Record ID]) from Diabetes_data")
count([Record ID])
1 768
# What is the count for each type of outcome?
sqldf("Select Outcome, count([Record ID]) as CountData from Diabetes_data
group by Outcome")
Outcome CountData
1 0 500
2 1 268
# What is the average age for each type of outcome?
sqldf("Select round(AVG(Age),0) as AvgAge from Diabetes_data
where Outcome=1")
AvgAge
1 37
sqldf("Select round(AVG(Age),0) as AvgAge from Diabetes_data
where Outcome=0")
AvgAge
1 31
# What is the minimum and maximum age of Diabetic and non-diabetic patients?
sqldf("Select Outcome, min(Age) as MinAge, max(Age) as MaxAge from Diabetes_data
group by Outcome")
Outcome MinAge MaxAge
1 0 21 81
2 1 21 70
# What is the Average number of pregnancies of Diabetic patients?
sqldf("Select Outcome, round(AVG(Pregnancies),0) as AvgPreg from Diabetes_data
group by Outcome")
Outcome AvgPreg
1 0 3
2 1 5
# For each type of Outcome, list the Average BMI when age is greater than average age
sqldf("Select round(Avg(BMI),2) as AvgBMI, Age from
(Select * from Diabetes_data a,
(Select round(AVG(Age),0) as AvgAge from Diabetes_data
where Outcome=1) b) c
where Age > AvgAge
group by Age
order by Age")
AvgBMI Age
1 35.57 38
2 31.98 39
3 33.54 40
4 35.26 41
5 34.98 42
6 36.89 43
7 34.16 44
8 34.96 45
9 34.52 46
10 34.57 47
11 29.98 48
12 32.02 49
13 31.22 50
14 33.98 51
15 33.48 52
16 30.50 53
17 30.80 54
18 27.02 55
19 31.70 56
20 29.70 57
21 32.43 58
22 26.97 59
23 28.74 60
24 30.00 61
25 28.95 62
26 30.77 63
27 25.00 64
28 31.60 65
29 30.38 66
30 28.77 67
31 35.60 68
32 13.40 69
33 32.50 70
34 19.60 72
35 25.90 81
sqldf("Select round(Avg(BMI),2) as AvgBMI, Age from
(Select * from Diabetes_data a,
(Select round(AVG(Age),0) as AvgAge from Diabetes_data
where Outcome=0) b) c
where Age > AvgAge
group by Age
order by Age")
AvgBMI Age
1 32.32 32
2 32.34 33
3 31.16 34
4 33.78 35
5 31.72 36
6 32.08 37
7 35.57 38
8 31.98 39
9 33.54 40
10 35.26 41
11 34.98 42
12 36.89 43
13 34.16 44
14 34.96 45
15 34.52 46
16 34.57 47
17 29.98 48
18 32.02 49
19 31.22 50
20 33.98 51
21 33.48 52
22 30.50 53
23 30.80 54
24 27.02 55
25 31.70 56
26 29.70 57
27 32.43 58
28 26.97 59
29 28.74 60
30 30.00 61
31 28.95 62
32 30.77 63
33 25.00 64
34 31.60 65
35 30.38 66
36 28.77 67
37 35.60 68
38 13.40 69
39 32.50 70
40 19.60 72
41 25.90 81
odbcCloseAll()
library(dplyr)
library(ggplot2)
subset(Diabetes_data, Diabetes_data$BMI > 0 & Diabetes_data$Glucose > 0) %>%
ggplot(mapping=aes(x=Glucose, y=BMI, color=Outcome)) +
geom_point() +
geom_smooth(method=lm) +
ggtitle ("Scatter plot of Glocuse vs BMI for different values of Outcome")
We observe from the graph above that people with higher Glucose and BMI levels are more prone to be diabetic.
library(dplyr)
library(ggplot2)
Diabetes_data %>%
ggplot(mapping=aes(x=Age, y=Insulin)) +
geom_smooth() +
facet_wrap(~Outcome) +
ggtitle ("Plot of Age vs Insulin")
From the above graph, we can observe that the insulin level of diabetic patients is relatively higher than insulin level of non-diabetic patients.
We plot Glucose and Age of diabetic patients in Tableau to gain further insights.
Tableau - Age vs Outcome analysis
Tableau - Outcome vs Glucose analysis, filtered by Age
We observe that maximum number of diabetic patients known are around the age of 70. Further, the Glucose level of diabetic patients is higher than that of non-diabetic patients.
From the analysis and the plots, we can conclude that:
The challenging part of this project was the use of facet function in R, to analyze the impact of Age and Insulin on the diabetes prediction. The aim was to separately analyze the correlation between Age and Insulin, for different values of Outcome. This was overcome by using the “facet_wap” function in R, the details of which I found online.