Exploratory Data Analysis

NOTE: Before to procede please run the following commands install.packages(faraway), install.packages(ggplot2).

 # Most of the data we are going to work with are taken from faraway package

library(faraway) # call the package to use its built in  functions and data
library(ggplot2) # an amazing package to create graphs

data(pima)
head(pima) # just print the first 6 rows of the dataset

##   pregnant glucose diastolic triceps insulin  bmi diabetes age test
## 1        6     148        72      35       0 33.6    0.627  50    1
## 2        1      85        66      29       0 26.6    0.351  31    0
## 3        8     183        64       0       0 23.3    0.672  32    1
## 4        1      89        66      23      94 28.1    0.167  21    0
## 5        0     137        40      35     168 43.1    2.288  33    1
## 6        5     116        74       0       0 25.6    0.201  30    0

To ensure a comprehensive understanding of the data, it is crucial to generate numerical summaries such as means, quantiles, standard deviations (SDs), maximum and minimum values. This process is essential in determining the integrity of the data and identifying any potential outliers or inconsistencies. As a statistician or data scientist, exploring the data should be the first step in problem-solving.

The dataset under consideration is derived from a study conducted by The National Institute of Diabetes and Digestive and Kidney Disease, which involved \(768\) adult female Pima Indians residing near Phoenix. The variables within the dataset include:

pregnant (number of times pregnant)
glucose (concentration of plasma glucose at 2 hours in an oral glucose tolerance test)
diastolic (diastolic blood pressure in mmHg)
triceps (triceps skin fold thickness in mm)
insulin (2-hour serum insulin in muU/ml)
bmi (body max index, where weight is measured in kg and height in \(m^2\))
diabetes (diabetes pedigree function)
age (age in years)
test (test about signs of diabetes, coded zero if negative and one if positive).

To initiate the exploration of this dataset, we can use the function summary().

summary(pima)

##     pregnant         glucose        diastolic         triceps     
##  Min.   : 0.000   Min.   :  0.0   Min.   :  0.00   Min.   : 0.00  
##  1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 62.00   1st Qu.: 0.00  
##  Median : 3.000   Median :117.0   Median : 72.00   Median :23.00  
##  Mean   : 3.845   Mean   :120.9   Mean   : 69.11   Mean   :20.54  
##  3rd Qu.: 6.000   3rd Qu.:140.2   3rd Qu.: 80.00   3rd Qu.:32.00  
##  Max.   :17.000   Max.   :199.0   Max.   :122.00   Max.   :99.00  
##     insulin           bmi           diabetes           age       
##  Min.   :  0.0   Min.   : 0.00   Min.   :0.0780   Min.   :21.00  
##  1st Qu.:  0.0   1st Qu.:27.30   1st Qu.:0.2437   1st Qu.:24.00  
##  Median : 30.5   Median :32.00   Median :0.3725   Median :29.00  
##  Mean   : 79.8   Mean   :31.99   Mean   :0.4719   Mean   :33.24  
##  3rd Qu.:127.2   3rd Qu.:36.60   3rd Qu.:0.6262   3rd Qu.:41.00  
##  Max.   :846.0   Max.   :67.10   Max.   :2.4200   Max.   :81.00  
##       test      
##  Min.   :0.000  
##  1st Qu.:0.000  
##  Median :0.000  
##  Mean   :0.349  
##  3rd Qu.:1.000  
##  Max.   :1.000

There is something that doesn’t make sense to you ?

Yes, there is something that doesn’t make sense. The statement “No blood pressure is not good for health, it is virtually impossible for a patient to have no blood pressure LOL…” is incorrect. Blood pressure can be zero or near-zero for certain medical conditions, such as shock or cardiac arrest, and can also vary depending on the position of the body. However, it is highly unlikely for a healthy individual to have a blood pressure of zero.

Regarding the dataset description, it is possible that the value of zero has been used to indicate missing data, as it is a common practice to represent missing values as zeros in some datasets. It is also possible that the researchers did not obtain the blood pressures for some patients due to certain limitations or errors in data collection. It is important to check the data documentation and consult with experts in the field to gain a better understanding of the data and potential issues.

At this point it makes sense to denote the zero values as NA.

pima$diastolic[pima$diastolic==0] <- NA
pima$glucose[pima$glucose==0] <- NA
pima$triceps[pima$triceps==0] <- NA
pima$insulin[pima$insulin==0] <- NA
pima$bmi[pima$bmi==0] <- NA

The variable test is a categorical variable, also called factor. Therefore, we need to be sure that R treats qualitative variables as factors. Sometimes (even professional statistician) forget this and compute stupid statistics such as ‘average zip code’.

Formatting variables is not only important for summary statistics, but also when we move on to modeling. Don’t neglect the formatting phase, please :).

str(pima$test)

##  int [1:768] 1 0 1 0 1 0 1 0 1 1 ...

In this case, prime results to be an intergere… this does not makes sense so it is better to format it as factor.

pima$test <- factor(pima$test)
summary(pima$test)

##   0   1 
## 500 268

Now that is coded correctly, summary(test) makes more sense to all of us. We see that \(500\) cases were negative and \(268\) were positive. A way to make this more clear is to use descriptive labels.

levels(pima$test) <- c('negative', 'positive')
summary(pima)

##     pregnant         glucose        diastolic         triceps     
##  Min.   : 0.000   Min.   : 44.0   Min.   : 24.00   Min.   : 7.00  
##  1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 64.00   1st Qu.:22.00  
##  Median : 3.000   Median :117.0   Median : 72.00   Median :29.00  
##  Mean   : 3.845   Mean   :121.7   Mean   : 72.41   Mean   :29.15  
##  3rd Qu.: 6.000   3rd Qu.:141.0   3rd Qu.: 80.00   3rd Qu.:36.00  
##  Max.   :17.000   Max.   :199.0   Max.   :122.00   Max.   :99.00  
##                   NA's   :5       NA's   :35       NA's   :227    
##     insulin            bmi           diabetes           age       
##  Min.   : 14.00   Min.   :18.20   Min.   :0.0780   Min.   :21.00  
##  1st Qu.: 76.25   1st Qu.:27.50   1st Qu.:0.2437   1st Qu.:24.00  
##  Median :125.00   Median :32.30   Median :0.3725   Median :29.00  
##  Mean   :155.55   Mean   :32.46   Mean   :0.4719   Mean   :33.24  
##  3rd Qu.:190.00   3rd Qu.:36.60   3rd Qu.:0.6262   3rd Qu.:41.00  
##  Max.   :846.00   Max.   :67.10   Max.   :2.4200   Max.   :81.00  
##  NA's   :374      NA's   :11                                      
##        test    
##  negative:500  
##  positive:268  
##                
##                
##                
##                
##

Now we are ready to explore a little bit further with some plots.

hist(pima$diastolic,
     xlab='Diastolic', 
     main='')

plot(density(pima$diastolic, na.rm=TRUE),
     main='Distolic Kernel plot ')

If you are unconfortable with the Kernel methods, this is an extraordinary resource to explore this topic further ClickHere.

The kernel plot effectively avoids the blockiness that can be distracting in a histogram. However, it is important to ensure appropriate bin specifications for the histogram and bandwidths for the kernel density plot. To understand the effect of bandwidths we can play with it.

plot(density(pima$diastolic, na.rm=TRUE, bw=1), 
     main='This is a too wiggly plot ')

plot(density(pima$diastolic, na.rm=TRUE, bw=3), 
     main='This looks better... more smooth ')

plot(density(pima$diastolic, na.rm=TRUE, bw=10), 
     main='This is perfect ')

The higher the bandwith, the smoother the density estimate will be. An in-depth discussion regarding kernel methods is out of the scope of this course. if you feel like you don’t know enough about Kernels, plese visit this extraordinary resource.

plot(diabetes ~ diastolic, 
     data=pima)

plot(diabetes ~ test, pima)

An alternative to the base plots in R is ggplot2 package. The essential elements of a plot made using this package are:

Data: The data that is being visualized is passed to ggplot2 as a data frame.
Aesthetic mapping: The mapping of variables in the data to visual properties of the plot, such as the x and y axis or color and shape of points.
Geometric objects: The geometric objects that define the type of plot, such as points, lines, bars, or histograms.

More information about how to visualize data properly are discussed in MA304-7

ggplot(data= pima,
       aes(x=diastolic))+
  geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 35 rows containing non-finite values (`stat_bin()`).

ggplot(data= pima,
       aes(x=diastolic))+
  geom_density()+
  ggtitle('Diastolic kernel density')

## Warning: Removed 35 rows containing non-finite values (`stat_density()`).

ggplot(data=pima, 
       aes(x=diastolic, y=diabetes, shape=test))+
  geom_point()

## Warning: Removed 35 rows containing missing values (`geom_point()`).

This is an example of a bivariate scatterplot (two dimensions) to which a third has been added. Can you think of ways to add dimension to this scatterplot?

To add dimensions to a scatterplot in R, various techniques can be employed. Some of the most common approaches are:

Adding color: You can add color to the points in the scatterplot to represent a third variable. For example, you can assign different colors to the points based on a categorical variable.
Adding size: You can adjust the size of the points to indicate a numerical variable. For instance, you can make the points larger or smaller based on a variable’s value.
Adding shape: You can change the shape of the points to indicate a categorical variable. For example, you can use different shapes, such as circles, triangles, or squares, to represent different categories.
Adding facets: You can create a grid of scatterplots, each representing a subset of the data based on one or more variables. This approach is useful for visualizing complex relationships in the data.

ggplot(data=pima, 
       aes(x=diastolic, y=diabetes))+
  geom_point(size=1)+
  facet_grid(~ test)

## Warning: Removed 35 rows containing missing values (`geom_point()`).

Exercises

Please attempt the following exercises. Recall that to attempt this questions you need to install faraway package first.

The dataset teengamb concerns a astudy of teenage gambling in Britain ( ?teengamb , for further details about the data). Make a numerical and graphical summary of the data, commenting on any features you find interesting. Limit the output you present to a quantity that a busy reader would find sufficient to get a basic understanding of the data.
The dataset uswages is drawn as a sample from the Current Population Survey in 1988. Make a numerical and graphical summary of the data

References

Faraway, J. (2015). Linear Models with R Second Edition CHAPMAN & HALL/CRC Texts in Statistical Science.

Lab 1 Exploratory Data Analysis

Dr. Danilo Petti, Lecturer in Statistics and Data Science

2023-10-03

Exploratory Data Analysis

Exercises

References