Introduction to data analysis using R, R Studio and R Markdown
Dee Chiluiza, PhD
Northeastern University
Boston, Massachusetts

Short manual series:
Density Plots and probability calculations

1 Introduction

As you observed in previous sections, continuous data has the characteristic of being “unlimited”: between the minimum and maximum values of any particular continuous variable, any value can happen.

Example: We have a data set containing the weight in kilograms of n=300 people. The smallest person in the data set weights 56.5 Kg and the biggest person 105.6 Kg. The mean of the data is 81, standard deviation 9.3.

Let’s observe this data using a histogram:
• With x-axis limits from 40 to 120, to have plenty of space to visualize both ends.
• With bins of size 2 (every 2 Kg): breaks=seq(40,120,2).
• With data labels on x-axis divided every 2 Kg, to match bins: xaxp=c(40,120,40).
• Displaying the position of the minimum and maximum values.

For breaks, which sets the number of bins on our histograms, we use the sequence function seq(). Since we chose to use xlim=c(40,120), we should set the bins inside the same limits, from 40 to 120 in groups of two breaks=seq(40,120,2). You can change the size of the groups, it sets the width of the bins.

The codexaxp=c(40,120,40) is used to customize the presentation of the numerical labels on the x-axis, in this case, we keep the same limits as we used on xlim=c(40,120) and add how many labels we need, in this case I am choosing 40, which mean that the range of data will be symmetrically divided into 40 sections.

In addition, I am using abline() to enter vertical lines for the minimum and maximum values, and text() to create the labels for those lines. Please observe carefully the use of those codes after the code that creates the histogram.

hist(weightD1,
     xlim=c(40,120), 
     breaks=seq(40,120,2),
     main="", xlab="Weight (in Kg)",
     las=2,
     col=brewer.pal(8, "Pastel2"),
     xaxp=c(40,120,40))

# Use abline() to plot vertical lines.
abline(v=weightD1_min)
abline(v=weightD1_max)

# Use text() to create labels inside the graph
text(x=weightD1_min-1, y=20,
     paste("min=",round(weightD1_min,1)),
     srt=90, cex=0.7)
text(x=weightD1_max+1, y=20,
     paste("max=",round(weightD1_max,1)),
     srt=90, cex=0.7)

2 Zoom into data

Let’s zoom into the data, to the area between 70 and 75 Kg. Observe the use of xlim to focus on that region, the use breaks to observe inside the redion, and the use of ylim to adjust the Y-axis to the new size of the bars.

Since we are performing this from the original data, we should keep the same information for breaks=seq(40,120,2), but in this case, we can make groups of 1 to visualize individual oberved values (actual values collected).

hist(weightD1, 
     xlim=c(70,75),
     ylim=c(0,4),
     breaks=seq(40,120,0.1),
     xaxp=c(40,120,160),
     main="", xlab="Weight (in Kg)",
     las=2,
     col=brewer.pal(8, "Pastel2"))

3 Density plot

Histograms allow us to plot the actual values observed in the data set, that is what you observed in the previous graphs. Keep in mind that in probability, any value can happen inside the distribution.

For example, in our weight data set, you could have observed/obtained values like 70.34 Kg or 78.76 Kg, but it was possible that a person weighting 75.002 Kg or 77.65014 Kg arrived to the office for measurements; any value could happen, at any decimal level.

In normal distribution you cannot ask for the probability of a specific value, that is impossible, you can only ask for areas under the probability distribution curve, in this case, you could chose to ask for the probability that a person weights between 70 and 75 Kg.

P(70 < X < 75)

Let’s present the data set using a density plot.
When you take a continuous data and create a density distribution, you obtain two set of values, one in the x-axis indicating the values of your data, and one on the y-axis indicating their probability distributions. Observe below:

density(weightD1, adjust =1)

## 
## Call:
##  density.default(x = weightD1, adjust = 1)
## 
## Data: weightD1 (300 obs.);   Bandwidth 'bw' = 2.612
## 
##        x                y            
##  Min.   : 48.71   Min.   :6.910e-06  
##  1st Qu.: 64.88   1st Qu.:2.320e-03  
##  Median : 81.06   Median :9.253e-03  
##  Mean   : 81.06   Mean   :1.544e-02  
##  3rd Qu.: 97.23   3rd Qu.:2.968e-02  
##  Max.   :113.41   Max.   :4.219e-02

This information is used to create density plots. For now we will stick to the basic adjustment of 1 or close to 1. Practice this code using variable Price from our data set carSales.xlsx, and observe how the value bandwidth ‘bw’ changes when you change adjust. Imagine the bandwidth as the setting for bins on your histogram, the wider the bins, the less information you obtain about your data distribution. Do not abuse adjustment, it will dramatically change the shape of your density distribution.

Now let’s create three objects to store the information from density of weight values, each object has a different value for adjustment. We will use these objects to plot our first density graphs.

# Create three density objects

weightDensity_1 = density(weightD1, adjust =1)
weightDensity_2 = density(weightD1, adjust =2)
weightDensity_12 = density(weightD1, adjust =1.2)

# Use par() code to adjust graph presentation
# Use ?par in your console to obtain more information

par(mfrow=c(3,1),            # plot graphs in 3 rows 1 colum
    mai=c(0.6, 0.2,0.6,0.2), # adjust margins inside graphs
    mar=c(4,5,1,1))          # adjust margins outside graphs  


# Plot graphs using the density objects
plot(weightDensity_1, 
     main="", 
     las=1, 
     ylim=c(0, 0.05), 
     xlim=c(50,120),
     yaxp = c(0, 0.05, 5))

plot(weightDensity_2, 
     main="", 
     las=1, 
     ylim=c(0, 0.05), 
     xlim=c(50,120),
     yaxp = c(0, 0.05, 5))

plot(weightDensity_12, 
     main="", 
     las=1, 
     ylim=c(0, 0.05),
     xlim=c(50,120),
     yaxp = c(0, 0.05, 5))

Notice how adjust=2 smooths the curve too much, loosing its true distribution. This adjustment depends on the data, sometimes you need more, sometimes you do not need any adjustment. Be aware of this situation.

4 Normalization of data by Z Scores

In the density plots above, the actual values of the data are presented on the x-axis. For probability calculations, we need the data to be presented using Z scores, which you learnt in previous classes.

Z scores indicate how far any particular value located from the mean in terms of standard deviations.
Let’s analyze some values using this example: the mean of a particular data set is 50 and the standard deviation is 10, then the value 60 is located 1 sd deviation above the mean, and its z score is = 1.

40 is located 1 sd below the mean, then its zscore = -1.
70 is located 2 sd above the mean, then its zscore = +2.
35 is located 1.5 sd below the mean, then its zscore = -1.5.
92 is located 4.2 sd above the mean, then its zscore = +4.2.

What is the z score of 67.358? This can be easily calculated using the Z score formula: (value-mean)/sd.

mean = 50, sd = 10
Z = (67.358-mean)/sd = 1.7358

We will normalize our weight data using z scores. Since our data is contained in a single vector weightD1, we can simply enter that vector inside the z score formula and create a new vector with the normalized values. See codes below. If the data were contained inside a table, then we would use code mutate().

Observe the codes and table below, only the first 20 observations are visualized. Notice that values below the mean = 81.04 are negative.

# Create new vector to normalize values inside weightD1

normWeight = (weightD1-weightD1_mean)/weightD1_sd

# Preent observed values and normalized values using matrix()
# Present only the first 20 observations using head(20)

matrix(c(weightD1,normWeight), nrow = 300, byrow = F)%>%
  head(20)%>%
  kable(col.names = c("Weight", "Normalized"),
        digits = 2)

Weight	Normalized
83.03	0.21
76.12	-0.53
79.52	-0.16
84.54	0.38
78.56	-0.27
65.13	-1.71
84.83	0.41
87.33	0.68
78.96	-0.22
61.46	-2.10
87.73	0.72
81.76	0.08
74.62	-0.69
73.99	-0.76
70.92	-1.09
99.05	1.94
79.08	-0.21
102.05	2.26
88.29	0.78
78.64	-0.26

5 Create a plot with normalized values

Below I will present both graphs: one using actual values and one using normalized values. Notice how the shape of their distributions are the same but the values on the x-axis are different (actual values versus normalized values, or Z scores).

Notice that the x-limits in these graphs have not being altered with xlim=c(,), in order to maintain their default presentations. By this, we are able to observe that the position of the mean value is at the same level of z score = 0, which is the position of the mean for any continuous distribution.

# Create two density objects

dWeight_actual = density(weightD1, adjust =1)
dWeight_normalized = density(normWeight, adjust =1)

# Use par() code to adjust graph presentation
# Use ?par in your console to obtain more information

par(mfrow=c(2,1),            # plot graphs in 2 rows 1 colum
    mai=c(0.6, 0.2,0.6,0.2), # adjust margins inside graphs
    mar=c(2,5,1,1))          # adjust margins outside graphs  


# Plot graphs using the density objects
plot(dWeight_actual, 
     main="", xlab="",
     las=1, 
     ylim=c(0, 0.06))
abline(v=weightD1_mean)

plot(dWeight_normalized,
     main="", xlab="",
     las=1,
     xaxp=c(-4,4,8),
     ylim=c(0,0.6))
abline(v=0)

6 Calculate probabilities

Observe the density plots above and ask the following questions:
what is the probability that a value is higher than 93.25 Kg?
what is the probability that a value is lower than 93.25 Kg?

With mean = 81.04 and sd = 9.31, the Z score of 93.25 is = 1.31.

First, let’s plot the value in the first desisty graph, and its z score in the second graph. Observe codes below.

# Use par() code to adjust graph presentation

par(mfrow=c(2,1),
    mai=c(0.6, 0.2,0.6,0.2), 
    mar=c(2,5,1,1))          


# Plot graphs using the density objects
plot(dWeight_actual, 
     main="", xlab="",
     las=1, 
     ylim=c(0, 0.06))
abline(v=93.25)
text(x=94.25, y=0.04,
     paste(93.25), cex=0.8, srt=90)

# Z score of 93.25

Z9325 = (93.25-weightD1_mean)/weightD1_sd

plot(dWeight_normalized,
     main="", xlab="",
     las=1,
     xaxp=c(-4,4,8),
     ylim=c(0,0.6))
abline(v=Z9325)
text(x=Z9325+0.15, y=0.4,
     paste(round(Z9325,2)), cex=0.8, srt=90)

To calculate probabilities we need to calculate areas under the curve.

For P(x<93.25), we need to calculate the area below Z score 1.31 (which is standard deviation 1.31). To calculate this area in normal distributions we use the code pnorm() with the value of z score.

For P(x>93.25), we need to subtract pnorm() from 1.

Area below = pnorm(zcore) = pnorm(93.25) = 0.905 or 90.5%.
Area above = 1-pnorm(zcore) = 1-pnorm(93.25) = 0.095 or 9.5%.

7 Probabilities on T distributions

If the data were based on a t distribution, then we need to use the number of observations as degrees of freedom, and apply the pt() code instead. If the data contained 25 observations, then df = 25-1 = 24.

Area below = pt(zcore, df) = pt(93.25, 24) = 0.899 or 89.9%.
Area above = 1-pt(zcore, df) = 1-pt(93.25, 24) = 0.101 or 10.1%.

8 Task. Calculate the size of a small area

Use your newly acquired knowledge to answer the question: what is the probability that a person weights between 92.0 and 96.0 Kg? P(92<X<96).
Plot the same density graph you see below, and calculate the area between the two lines.

Just to give you a clue, the answer is 6.5%.

Did you obtained the same density plot and the same answer with your R codes?

Happy Learning!

Disclaimer: This short series manual project is a work in progress. Until otherwise clearly stated, this material is considered to be draft version.

Dee Chiluiza, PhD
June 2022
Last update: 15 July, 2022
Boston, Massachusetts, USA
Bruno Dog