Exploratory Data Analysis

NOTE: Before to procede please run the following commands install.packages(faraway), install.packages(ggplot2).

 # Most of the data we are going to work with are taken from faraway package

library(faraway) # call the package to use its built in  functions and data
library(ggplot2) # an amazing package to create graphs
data(pima)
head(pima) # just print the first 6 rows of the dataset 
##   pregnant glucose diastolic triceps insulin  bmi diabetes age test
## 1        6     148        72      35       0 33.6    0.627  50    1
## 2        1      85        66      29       0 26.6    0.351  31    0
## 3        8     183        64       0       0 23.3    0.672  32    1
## 4        1      89        66      23      94 28.1    0.167  21    0
## 5        0     137        40      35     168 43.1    2.288  33    1
## 6        5     116        74       0       0 25.6    0.201  30    0

To ensure a comprehensive understanding of the data, it is crucial to generate numerical summaries such as means, quantiles, standard deviations (SDs), maximum and minimum values. This process is essential in determining the integrity of the data and identifying any potential outliers or inconsistencies. As a statistician or data scientist, exploring the data should be the first step in problem-solving.

The dataset under consideration is derived from a study conducted by The National Institute of Diabetes and Digestive and Kidney Disease, which involved \(768\) adult female Pima Indians residing near Phoenix. The variables within the dataset include:

To initiate the exploration of this dataset, we can use the function summary().

summary(pima)
##     pregnant         glucose        diastolic         triceps     
##  Min.   : 0.000   Min.   :  0.0   Min.   :  0.00   Min.   : 0.00  
##  1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 62.00   1st Qu.: 0.00  
##  Median : 3.000   Median :117.0   Median : 72.00   Median :23.00  
##  Mean   : 3.845   Mean   :120.9   Mean   : 69.11   Mean   :20.54  
##  3rd Qu.: 6.000   3rd Qu.:140.2   3rd Qu.: 80.00   3rd Qu.:32.00  
##  Max.   :17.000   Max.   :199.0   Max.   :122.00   Max.   :99.00  
##     insulin           bmi           diabetes           age       
##  Min.   :  0.0   Min.   : 0.00   Min.   :0.0780   Min.   :21.00  
##  1st Qu.:  0.0   1st Qu.:27.30   1st Qu.:0.2437   1st Qu.:24.00  
##  Median : 30.5   Median :32.00   Median :0.3725   Median :29.00  
##  Mean   : 79.8   Mean   :31.99   Mean   :0.4719   Mean   :33.24  
##  3rd Qu.:127.2   3rd Qu.:36.60   3rd Qu.:0.6262   3rd Qu.:41.00  
##  Max.   :846.0   Max.   :67.10   Max.   :2.4200   Max.   :81.00  
##       test      
##  Min.   :0.000  
##  1st Qu.:0.000  
##  Median :0.000  
##  Mean   :0.349  
##  3rd Qu.:1.000  
##  Max.   :1.000

There is something that doesn’t make sense to you ?

Yes, there is something that doesn’t make sense. The statement “No blood pressure is not good for health, it is virtually impossible for a patient to have no blood pressure LOL…” is incorrect. Blood pressure can be zero or near-zero for certain medical conditions, such as shock or cardiac arrest, and can also vary depending on the position of the body. However, it is highly unlikely for a healthy individual to have a blood pressure of zero.

Regarding the dataset description, it is possible that the value of zero has been used to indicate missing data, as it is a common practice to represent missing values as zeros in some datasets. It is also possible that the researchers did not obtain the blood pressures for some patients due to certain limitations or errors in data collection. It is important to check the data documentation and consult with experts in the field to gain a better understanding of the data and potential issues.

At this point it makes sense to denote the zero values as NA.

pima$diastolic[pima$diastolic==0] <- NA
pima$glucose[pima$glucose==0] <- NA
pima$triceps[pima$triceps==0] <- NA
pima$insulin[pima$insulin==0] <- NA
pima$bmi[pima$bmi==0] <- NA

The variable test is a categorical variable, also called factor. Therefore, we need to be sure that R treats qualitative variables as factors. Sometimes (even professional statistician) forget this and compute stupid statistics such as ‘average zip code’.

Formatting variables is not only important for summary statistics, but also when we move on to modeling. Don’t neglect the formatting phase, please :).

str(pima$test)
##  int [1:768] 1 0 1 0 1 0 1 0 1 1 ...

In this case, prime results to be an intergere… this does not makes sense so it is better to format it as factor.

pima$test <- factor(pima$test)
summary(pima$test)
##   0   1 
## 500 268

Now that is coded correctly, summary(test) makes more sense to all of us. We see that \(500\) cases were negative and \(268\) were positive. A way to make this more clear is to use descriptive labels.

levels(pima$test) <- c('negative', 'positive')
summary(pima)
##     pregnant         glucose        diastolic         triceps     
##  Min.   : 0.000   Min.   : 44.0   Min.   : 24.00   Min.   : 7.00  
##  1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 64.00   1st Qu.:22.00  
##  Median : 3.000   Median :117.0   Median : 72.00   Median :29.00  
##  Mean   : 3.845   Mean   :121.7   Mean   : 72.41   Mean   :29.15  
##  3rd Qu.: 6.000   3rd Qu.:141.0   3rd Qu.: 80.00   3rd Qu.:36.00  
##  Max.   :17.000   Max.   :199.0   Max.   :122.00   Max.   :99.00  
##                   NA's   :5       NA's   :35       NA's   :227    
##     insulin            bmi           diabetes           age       
##  Min.   : 14.00   Min.   :18.20   Min.   :0.0780   Min.   :21.00  
##  1st Qu.: 76.25   1st Qu.:27.50   1st Qu.:0.2437   1st Qu.:24.00  
##  Median :125.00   Median :32.30   Median :0.3725   Median :29.00  
##  Mean   :155.55   Mean   :32.46   Mean   :0.4719   Mean   :33.24  
##  3rd Qu.:190.00   3rd Qu.:36.60   3rd Qu.:0.6262   3rd Qu.:41.00  
##  Max.   :846.00   Max.   :67.10   Max.   :2.4200   Max.   :81.00  
##  NA's   :374      NA's   :11                                      
##        test    
##  negative:500  
##  positive:268  
##                
##                
##                
##                
## 

Now we are ready to explore a little bit further with some plots.

hist(pima$diastolic,
     xlab='Diastolic', 
     main='')

plot(density(pima$diastolic, na.rm=TRUE),
     main='Distolic Kernel plot ')

If you are unconfortable with the Kernel methods, this is an extraordinary resource to explore this topic further ClickHere.

The kernel plot effectively avoids the blockiness that can be distracting in a histogram. However, it is important to ensure appropriate bin specifications for the histogram and bandwidths for the kernel density plot. To understand the effect of bandwidths we can play with it.

plot(density(pima$diastolic, na.rm=TRUE, bw=1), 
     main='This is a too wiggly plot ')

plot(density(pima$diastolic, na.rm=TRUE, bw=3), 
     main='This looks better... more smooth ')

plot(density(pima$diastolic, na.rm=TRUE, bw=10), 
     main='This is perfect ')

The higher the bandwith, the smoother the density estimate will be. An in-depth discussion regarding kernel methods is out of the scope of this course. if you feel like you don’t know enough about Kernels, plese visit this extraordinary resource.

plot(diabetes ~ diastolic, 
     data=pima)

plot(diabetes ~ test, pima)

An alternative to the base plots in R is ggplot2 package. The essential elements of a plot made using this package are:

More information about how to visualize data properly are discussed in MA304-7

ggplot(data= pima,
       aes(x=diastolic))+
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 35 rows containing non-finite values (`stat_bin()`).

ggplot(data= pima,
       aes(x=diastolic))+
  geom_density()+
  ggtitle('Diastolic kernel density')
## Warning: Removed 35 rows containing non-finite values (`stat_density()`).

ggplot(data=pima, 
       aes(x=diastolic, y=diabetes, shape=test))+
  geom_point()
## Warning: Removed 35 rows containing missing values (`geom_point()`).

This is an example of a bivariate scatterplot (two dimensions) to which a third has been added. Can you think of ways to add dimension to this scatterplot?

To add dimensions to a scatterplot in R, various techniques can be employed. Some of the most common approaches are:

ggplot(data=pima, 
       aes(x=diastolic, y=diabetes))+
  geom_point(size=1)+
  facet_grid(~ test)
## Warning: Removed 35 rows containing missing values (`geom_point()`).

Exercises

Please attempt the following exercises. Recall that to attempt this questions you need to install faraway package first.

  1. The dataset teengamb concerns a astudy of teenage gambling in Britain ( ?teengamb , for further details about the data). Make a numerical and graphical summary of the data, commenting on any features you find interesting. Limit the output you present to a quantity that a busy reader would find sufficient to get a basic understanding of the data.

  2. The dataset uswages is drawn as a sample from the Current Population Survey in 1988. Make a numerical and graphical summary of the data

References