Data Management Project Report

Synopsis

This is the IS 6030 Homework #5. It includes doing exploratory data analysis on an open data using SQL, report generation through R and doing data vizualisation through Tableau. SQL queries are presented under the ‘Data Statistics’ section. Analysis of plots through R and Tableau is presented under the Data Modelling section.

Data Description

The Pima Indians Diabetes dataset has been used and was extracted from the Kaggle website. This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective is to predict based on diagnostic measurements whether a patient has diabetes.

Data Overview

The Diabetes dataset contains the following information:

Pregnancies: Number of times pregnant
Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
BloodPressure: Diastolic blood pressure (mm Hg)
SkinThickness: Triceps skin fold thickness (mm)
Insulin: 2-Hour serum insulin (mu U/ml)
BMI: Body mass index (weight in kg/(height in m)^2)
DiabetesPedigreeFunction: Diabetes pedigree function
Age: Age (years)
Outcome: Class variable (0 or 1)

A preview of the file has been presented below:

library(readxl)
Diabetes_data <- read_excel("C:/Manisha_Arora/UC-BANA/Sem1/Data Mgt/Week-8/Diabetes_data.xlsx")
head(Diabetes_data)

## # A tibble: 6 × 10
##   `Record ID` Pregnancies Glucose BloodPressure SkinThickness Insulin
##         <chr>       <dbl>   <dbl>         <dbl>         <dbl>   <dbl>
## 1          R1           6     148            72            35       0
## 2          R2           1      85            66            29       0
## 3          R3           8     183            64             0       0
## 4          R4           1      89            66            23      94
## 5          R5           0     137            40            35     168
## 6          R6           5     116            74             0       0
## # ... with 4 more variables: BMI <dbl>, DiabetesPedigreeFunction <dbl>,
## #   Age <dbl>, Outcome <dbl>

A summary of the dataset has been presented below:

summary(Diabetes_data)

##   Record ID          Pregnancies        Glucose      BloodPressure   
##  Length:768         Min.   : 0.000   Min.   :  0.0   Min.   :  0.00  
##  Class :character   1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 62.00  
##  Mode  :character   Median : 3.000   Median :117.0   Median : 72.00  
##                     Mean   : 3.845   Mean   :120.9   Mean   : 69.11  
##                     3rd Qu.: 6.000   3rd Qu.:140.2   3rd Qu.: 80.00  
##                     Max.   :17.000   Max.   :199.0   Max.   :122.00  
##  SkinThickness      Insulin           BMI        DiabetesPedigreeFunction
##  Min.   : 0.00   Min.   :  0.0   Min.   : 0.00   Min.   :0.0780          
##  1st Qu.: 0.00   1st Qu.:  0.0   1st Qu.:27.30   1st Qu.:0.2437          
##  Median :23.00   Median : 30.5   Median :32.00   Median :0.3725          
##  Mean   :20.54   Mean   : 79.8   Mean   :31.99   Mean   :0.4719          
##  3rd Qu.:32.00   3rd Qu.:127.2   3rd Qu.:36.60   3rd Qu.:0.6262          
##  Max.   :99.00   Max.   :846.0   Max.   :67.10   Max.   :2.4200          
##       Age           Outcome     
##  Min.   :21.00   Min.   :0.000  
##  1st Qu.:24.00   1st Qu.:0.000  
##  Median :29.00   Median :0.000  
##  Mean   :33.24   Mean   :0.349  
##  3rd Qu.:41.00   3rd Qu.:1.000  
##  Max.   :81.00   Max.   :1.000

Normalizing Data

The dataset is not normalized because all the data is contained in a single table. This dataset can be normalized by breaking the dataset into two different tables as follows:

A table containing Record ID and the Diabetes outcome
A table containing Record ID with other numerical parameters such as Pregnancies, insulin level, glucose level, skin thickness etc

Data Issues

Find below some data observations:

The data seems to be pretty clean, with no missing values.
There are no duplicates in the dataset.
However, there are a few parameters, the details for which are not known. For instance, the Diabetes Pedigree Function is something unheard of before.

Data Statistics

library(RODBC)
library(sqldf)
Local <- odbcConnect("SQL_ODBC", uid = "", pwd = "")

# What is the record count?
sqldf("Select count([Record ID]) from Diabetes_data")

  count([Record ID])
1                768

# What is the count for each type of outcome?
sqldf("Select Outcome, count([Record ID]) as CountData from Diabetes_data
group by Outcome")

  Outcome CountData
1       0       500
2       1       268

# What is the average age for each type of outcome?
sqldf("Select round(AVG(Age),0) as AvgAge from Diabetes_data
where Outcome=1")

  AvgAge
1     37

sqldf("Select round(AVG(Age),0) as AvgAge from Diabetes_data
where Outcome=0")

  AvgAge
1     31

# What is the minimum and maximum age of Diabetic and non-diabetic patients?
sqldf("Select Outcome, min(Age) as MinAge, max(Age) as MaxAge from Diabetes_data
group by Outcome")

  Outcome MinAge MaxAge
1       0     21     81
2       1     21     70

# What is the Average number of pregnancies of Diabetic patients?
sqldf("Select Outcome, round(AVG(Pregnancies),0) as AvgPreg from Diabetes_data
group by Outcome")

  Outcome AvgPreg
1       0       3
2       1       5

# For each type of Outcome, list the Average BMI when age is greater than average age
sqldf("Select round(Avg(BMI),2) as AvgBMI, Age from
(Select * from Diabetes_data a,
(Select round(AVG(Age),0) as AvgAge from Diabetes_data
where Outcome=1) b) c
where Age > AvgAge
group by Age
order by Age")

   AvgBMI Age
1   35.57  38
2   31.98  39
3   33.54  40
4   35.26  41
5   34.98  42
6   36.89  43
7   34.16  44
8   34.96  45
9   34.52  46
10  34.57  47
11  29.98  48
12  32.02  49
13  31.22  50
14  33.98  51
15  33.48  52
16  30.50  53
17  30.80  54
18  27.02  55
19  31.70  56
20  29.70  57
21  32.43  58
22  26.97  59
23  28.74  60
24  30.00  61
25  28.95  62
26  30.77  63
27  25.00  64
28  31.60  65
29  30.38  66
30  28.77  67
31  35.60  68
32  13.40  69
33  32.50  70
34  19.60  72
35  25.90  81

sqldf("Select round(Avg(BMI),2) as AvgBMI, Age from
(Select * from Diabetes_data a,
(Select round(AVG(Age),0) as AvgAge from Diabetes_data
where Outcome=0) b) c
where Age > AvgAge
group by Age
order by Age")

   AvgBMI Age
1   32.32  32
2   32.34  33
3   31.16  34
4   33.78  35
5   31.72  36
6   32.08  37
7   35.57  38
8   31.98  39
9   33.54  40
10  35.26  41
11  34.98  42
12  36.89  43
13  34.16  44
14  34.96  45
15  34.52  46
16  34.57  47
17  29.98  48
18  32.02  49
19  31.22  50
20  33.98  51
21  33.48  52
22  30.50  53
23  30.80  54
24  27.02  55
25  31.70  56
26  29.70  57
27  32.43  58
28  26.97  59
29  28.74  60
30  30.00  61
31  28.95  62
32  30.77  63
33  25.00  64
34  31.60  65
35  30.38  66
36  28.77  67
37  35.60  68
38  13.40  69
39  32.50  70
40  19.60  72
41  25.90  81

odbcCloseAll()

Data Modelling

library(dplyr)
library(ggplot2)

subset(Diabetes_data, Diabetes_data$BMI > 0 & Diabetes_data$Glucose > 0) %>%
  ggplot(mapping=aes(x=Glucose, y=BMI, color=Outcome)) +
  geom_point() +  
  geom_smooth(method=lm) + 
  ggtitle ("Scatter plot of Glocuse vs BMI for different values of Outcome")

We observe from the graph above that people with higher Glucose and BMI levels are more prone to be diabetic.

library(dplyr)
library(ggplot2)

Diabetes_data %>%
  ggplot(mapping=aes(x=Age, y=Insulin)) +
  geom_smooth() + 
  facet_wrap(~Outcome) +
  ggtitle ("Plot of Age vs Insulin")

From the above graph, we can observe that the insulin level of diabetic patients is relatively higher than insulin level of non-diabetic patients.

We plot Glucose and Age of diabetic patients in Tableau to gain further insights.

Tableau - Age vs Outcome analysis

Tableau - Outcome vs Glucose analysis, filtered by Age

We observe that maximum number of diabetic patients known are around the age of 70. Further, the Glucose level of diabetic patients is higher than that of non-diabetic patients.

Sumary

From the analysis and the plots, we can conclude that:

Out of a total of 768 records, 500 were not diabetic, whereas the rest 268 were recorded as diabetic patients.
The diabetic prediction depends on a number of parameters such as Insulin and Glucose levels, Age and BMI
The average age of diabetic patients is usually around 37, although significant differences have been observed. The average age of non-diabetic patients is 31.
The minimum age of a diabetic patient is 21 and the maximum age is 70
BMI and Glucose levels hold a strong correlation with diabetes outcome

Challenge

The challenging part of this project was the use of facet function in R, to analyze the impact of Age and Insulin on the diabetes prediction. The aim was to separately analyze the correlation between Age and Insulin, for different values of Outcome. This was overcome by using the “facet_wap” function in R, the details of which I found online.