Exploratory Data Analysis
NOTE: Before to procede please run the following commands install.packages(faraway), install.packages(ggplot2).
# Most of the data we are going to work with are taken from faraway package
library(faraway) # call the package to use its built in functions and data
library(ggplot2) # an amazing package to create graphs## pregnant glucose diastolic triceps insulin bmi diabetes age test
## 1 6 148 72 35 0 33.6 0.627 50 1
## 2 1 85 66 29 0 26.6 0.351 31 0
## 3 8 183 64 0 0 23.3 0.672 32 1
## 4 1 89 66 23 94 28.1 0.167 21 0
## 5 0 137 40 35 168 43.1 2.288 33 1
## 6 5 116 74 0 0 25.6 0.201 30 0
To ensure a comprehensive understanding of the data, it is crucial to generate numerical summaries such as means, quantiles, standard deviations (SDs), maximum and minimum values. This process is essential in determining the integrity of the data and identifying any potential outliers or inconsistencies. As a statistician or data scientist, exploring the data should be the first step in problem-solving.
The dataset under consideration is derived from a study conducted by The National Institute of Diabetes and Digestive and Kidney Disease, which involved \(768\) adult female Pima Indians residing near Phoenix. The variables within the dataset include:
- pregnant (number of times pregnant)
- glucose (concentration of plasma glucose at 2 hours in an oral glucose tolerance test)
- diastolic (diastolic blood pressure in mmHg)
- triceps (triceps skin fold thickness in mm)
- insulin (2-hour serum insulin in muU/ml)
- bmi (body max index, where weight is measured in kg and height in \(m^2\))
- diabetes (diabetes pedigree function)
- age (age in years)
- test (test about signs of diabetes, coded zero if negative and one if positive).
To initiate the exploration of this dataset, we can use the function summary().
## pregnant glucose diastolic triceps
## Min. : 0.000 Min. : 0.0 Min. : 0.00 Min. : 0.00
## 1st Qu.: 1.000 1st Qu.: 99.0 1st Qu.: 62.00 1st Qu.: 0.00
## Median : 3.000 Median :117.0 Median : 72.00 Median :23.00
## Mean : 3.845 Mean :120.9 Mean : 69.11 Mean :20.54
## 3rd Qu.: 6.000 3rd Qu.:140.2 3rd Qu.: 80.00 3rd Qu.:32.00
## Max. :17.000 Max. :199.0 Max. :122.00 Max. :99.00
## insulin bmi diabetes age
## Min. : 0.0 Min. : 0.00 Min. :0.0780 Min. :21.00
## 1st Qu.: 0.0 1st Qu.:27.30 1st Qu.:0.2437 1st Qu.:24.00
## Median : 30.5 Median :32.00 Median :0.3725 Median :29.00
## Mean : 79.8 Mean :31.99 Mean :0.4719 Mean :33.24
## 3rd Qu.:127.2 3rd Qu.:36.60 3rd Qu.:0.6262 3rd Qu.:41.00
## Max. :846.0 Max. :67.10 Max. :2.4200 Max. :81.00
## test
## Min. :0.000
## 1st Qu.:0.000
## Median :0.000
## Mean :0.349
## 3rd Qu.:1.000
## Max. :1.000
There is something that doesn’t make sense to you ?
Yes, there is something that doesn’t make sense. The statement “No blood pressure is not good for health, it is virtually impossible for a patient to have no blood pressure LOL…” is incorrect. Blood pressure can be zero or near-zero for certain medical conditions, such as shock or cardiac arrest, and can also vary depending on the position of the body. However, it is highly unlikely for a healthy individual to have a blood pressure of zero.
Regarding the dataset description, it is possible that the value of zero has been used to indicate missing data, as it is a common practice to represent missing values as zeros in some datasets. It is also possible that the researchers did not obtain the blood pressures for some patients due to certain limitations or errors in data collection. It is important to check the data documentation and consult with experts in the field to gain a better understanding of the data and potential issues.
At this point it makes sense to denote the zero values as NA.
pima$diastolic[pima$diastolic==0] <- NA
pima$glucose[pima$glucose==0] <- NA
pima$triceps[pima$triceps==0] <- NA
pima$insulin[pima$insulin==0] <- NA
pima$bmi[pima$bmi==0] <- NAThe variable test is a categorical variable, also called factor. Therefore, we need to be sure that R treats qualitative variables as factors. Sometimes (even professional statistician) forget this and compute stupid statistics such as ‘average zip code’.
Formatting variables is not only important for summary statistics, but also when we move on to modeling. Don’t neglect the formatting phase, please :).
## int [1:768] 1 0 1 0 1 0 1 0 1 1 ...
In this case, prime results to be an intergere… this does not makes sense so it is better to format it as factor.
## 0 1
## 500 268
Now that is coded correctly, summary(test) makes more sense to all of us. We see that \(500\) cases were negative and \(268\) were positive. A way to make this more clear is to use descriptive labels.
## pregnant glucose diastolic triceps
## Min. : 0.000 Min. : 44.0 Min. : 24.00 Min. : 7.00
## 1st Qu.: 1.000 1st Qu.: 99.0 1st Qu.: 64.00 1st Qu.:22.00
## Median : 3.000 Median :117.0 Median : 72.00 Median :29.00
## Mean : 3.845 Mean :121.7 Mean : 72.41 Mean :29.15
## 3rd Qu.: 6.000 3rd Qu.:141.0 3rd Qu.: 80.00 3rd Qu.:36.00
## Max. :17.000 Max. :199.0 Max. :122.00 Max. :99.00
## NA's :5 NA's :35 NA's :227
## insulin bmi diabetes age
## Min. : 14.00 Min. :18.20 Min. :0.0780 Min. :21.00
## 1st Qu.: 76.25 1st Qu.:27.50 1st Qu.:0.2437 1st Qu.:24.00
## Median :125.00 Median :32.30 Median :0.3725 Median :29.00
## Mean :155.55 Mean :32.46 Mean :0.4719 Mean :33.24
## 3rd Qu.:190.00 3rd Qu.:36.60 3rd Qu.:0.6262 3rd Qu.:41.00
## Max. :846.00 Max. :67.10 Max. :2.4200 Max. :81.00
## NA's :374 NA's :11
## test
## negative:500
## positive:268
##
##
##
##
##
Now we are ready to explore a little bit further with some plots.
If you are unconfortable with the Kernel methods, this is an extraordinary resource to explore this topic further ClickHere.
The kernel plot effectively avoids the blockiness that can be distracting in a histogram. However, it is important to ensure appropriate bin specifications for the histogram and bandwidths for the kernel density plot. To understand the effect of bandwidths we can play with it.
The higher the bandwith, the smoother the density estimate will be. An in-depth discussion regarding kernel methods is out of the scope of this course. if you feel like you don’t know enough about Kernels, plese visit this extraordinary resource.
An alternative to the base plots in R is ggplot2 package. The essential elements of a plot made using this package are:
- Data: The data that is being visualized is passed to ggplot2 as a data frame.
- Aesthetic mapping: The mapping of variables in the data to visual properties of the plot, such as the x and y axis or color and shape of points.
- Geometric objects: The geometric objects that define the type of plot, such as points, lines, bars, or histograms.
More information about how to visualize data properly are discussed in MA304-7
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 35 rows containing non-finite values (`stat_bin()`).
## Warning: Removed 35 rows containing non-finite values (`stat_density()`).
## Warning: Removed 35 rows containing missing values (`geom_point()`).
This is an example of a bivariate scatterplot (two dimensions) to which a third has been added. Can you think of ways to add dimension to this scatterplot?
To add dimensions to a scatterplot in R, various techniques can be employed. Some of the most common approaches are:
- Adding color: You can add color to the points in the scatterplot to represent a third variable. For example, you can assign different colors to the points based on a categorical variable.
- Adding size: You can adjust the size of the points to indicate a numerical variable. For instance, you can make the points larger or smaller based on a variable’s value.
- Adding shape: You can change the shape of the points to indicate a categorical variable. For example, you can use different shapes, such as circles, triangles, or squares, to represent different categories.
- Adding facets: You can create a grid of scatterplots, each representing a subset of the data based on one or more variables. This approach is useful for visualizing complex relationships in the data.
## Warning: Removed 35 rows containing missing values (`geom_point()`).
Exercises
Please attempt the following exercises. Recall that to attempt this questions you need to install faraway package first.
The dataset teengamb concerns a astudy of teenage gambling in Britain ( ?teengamb , for further details about the data). Make a numerical and graphical summary of the data, commenting on any features you find interesting. Limit the output you present to a quantity that a busy reader would find sufficient to get a basic understanding of the data.
The dataset uswages is drawn as a sample from the Current Population Survey in 1988. Make a numerical and graphical summary of the data
References
- Faraway, J. (2015). Linear Models with R Second Edition CHAPMAN & HALL/CRC Texts in Statistical Science.