Below I created an estimated cumulative distribution function (ECDF or EDF) in order to manually calculate the cumulative probability distribution across the population of failed electronic components as per the available data.
To do so, I programmed my custom function to accept the parameters t and n. The parameter t represents the potential fail time whose cumulative probability we wish to investigate and n represents the sample size of our data. In this instance, our sample size is eight, but I chose to set n as an editable parameter in case additional observations are recorded in the future.
This ECDF counts the number of recorded fail times from our sample that are less than or equal to t, and then divides by our sample size.
Q1_Fails = rbind.data.frame(23, 45, 67, 89, 112, 156, 189, 245)
colnames(Q1_Fails) = "Time"
#### My own ECDF function:
# Used filter function to simply program (Number of Xi <= x)/n
My_ECDF = function(t, n) {
X_sub_i = filter(Q1_Fails, Time <= t) %>%
count()
return(X_sub_i/n)
}Alongside that, I used base R’s built-in ecdf() function to plot the estimated cumulative distribution of the data. For enhanced visual understanding, I set dotted vertical lines to track from the x axis up to the beginning of each probability interval.
#### Built-in ECDF function:
Q1_Fails = rbind(23, 45, 67, 89, 112, 156, 189, 245)
colnames(Q1_Fails) = "Time"
Fnt = ecdf(Q1_Fails)
invisible(
summary(Fnt)
)
plot(Fnt,
main = "EDF of Electronics' Failure Times",
xlab = "Fail Time (Hours)",
ylab = "Cumulated Probability",
yaxt = "n",
lwd = 3)
axis(2, at = seq(0,1, by = 0.1))
# Data point = 23
segments(
x0 = 23,
y0 = 0,
x1 = 23,
y1 = 0.1,
col = "black",
lwd = 4,
lty = 3
)
# Data point = 45
segments(
x0 = 45,
y0 = 0,
x1 = 45,
y1 = 0.24,
col = "black",
lwd = 4,
lty = 3
)
# Data point = 67
segments(
x0 = 67,
y0 = 0,
x1 = 67,
y1 = 0.39,
col = "black",
lwd = 4,
lty = 3
)
# Data point = 89
segments(
x0 = 89,
y0 = 0,
x1 = 89,
y1 = 0.5,
col = "black",
lwd = 4,
lty = 3
)
# Data point = 112
segments(
x0 = 112,
y0 = 0,
x1 = 112,
y1 = 0.61,
col = "black",
lwd = 4,
lty = 3
)
# Data point = 156
segments(
x0 = 156,
y0 = 0,
x1 = 156,
y1 = 0.75,
col = "black",
lwd = 4,
lty = 3
)
# Data point = 189
segments(
x0 = 189,
y0 = 0,
x1 = 189,
y1 = 0.88,
col = "black",
lwd = 4,
lty = 3
)
# Data point = 245
segments(
x0 = 245,
y0 = 0,
x1 = 245,
y1 = 1,
col = "black",
lwd = 4,
lty = 3
)If a colleague were to claim that our available data suggests that the probability of an electronic component of this kind failing before 100 hours is about 50%, he is correct.
Using my custom-built ECDF function from Part a, I plugged in a parameter of t=100 and was returned a value of 0.5. Additionally, if we take a look at the graph above, we can see that a fail time of 100 hours is within the range of fail times that constitute a cumulative probability of about 50%.
Q2_Fails = rbind(12.3, 14.7, 15.2, 16.8, 18.1, 19.4, 20.6, 22.3, 23.9, 25.4)
colnames(Q2_Fails) = "Time"To visualize the dataset of these mechanical systems’ fail times, I created two histograms below. We can see through both the frequency and density histograms, that our available data is approximately normally distributed.
Also helpful to note by running some quick summary statistics on these fail times, we see that the average fail time was around 18.87 with a standard deviation of about 4.24.
invisible(
summary(Q2_Fails[1:10])
)
invisible(
sd(Q2_Fails[1:10])
)
Hist1 = ggplot(data = Q2_Fails) +
geom_histogram(aes(x = Time, y = after_stat(count)), bins = 3, alpha = 0.4, color = "black", fill = "red") +
labs(
title = "Distribution of Mechanical System Fail Times (Frequency)",
x = "Fail Time",
y = "Frequency"
)
Hist2 = ggplot(data = Q2_Fails) +
geom_histogram(aes(x = Time, y = after_stat(density)), bins = 3, alpha = 0.4, color = "black", fill = "red") +
labs(
title = "Distribution of Mechanical System Fail Times (Density)",
x = "Fail Time",
y = "Density"
)
Hist1Once again using R’s custom function creating capabilities, I manually programmed a function to compute the relative probability density of a fail time value based on the available data of the 10 sampled fail times as well as under the assumption that a Gaussian kernel was to be applied with a bandwidth parameter of 2.
The parameters for this function were t, h and n. They represent, respectively, the potential fail time whose relative probability density we wish to investigate, the bandwidth (or smoothing parameter), and the sample size of the data (which is 10 in this instance but I once again chose to leave as an editable parameter).
After creating this function, I randomly selected three potential failure times to use as testing values. Those being 14.5, 16.5 and 20.5, which yielded relative probability frequencies of about 6.5%, 7.6% and 7.0% respectively.
#### My own KDE function:
# First make the K(u) function
G_KU = function(u) {
Exp = exp(1)^(-(u^2)/2)
Answer = Exp / sqrt(2*pi)
return(Answer)
}
# Now make the fh(t) function:
# parameter t is the point in which we want to use this function to evaluate the probability density
G_Kernel = function(t, h, n) {
First = 1/(n*h) # Leftmost part of the formula
total = 0 # this total will be constantly updated as we go through the different rounds of summation within the series
# using a for loop for the summation series
for (Time in Q2_Fails) {
u = (t - Time)/h
K_Comp = G_KU(u)
total = total + K_Comp
}
# end of the for loop
Answer = total* First
return(Answer)
}
# sample(12.5:25, 3)
# = 20.5 16.5 14.5
### Now running the test with randomly selected test data:
# h = 2
# n = sample size of our dataset Q2_Fails
invisible(
G_Kernel(14.5, 2, nrow(Q2_Fails))
)
# = 0.06493919
invisible(
G_Kernel(16.5, 2, nrow(Q2_Fails))
# = 0.0755916
)
invisible(
G_Kernel(20.5, 2, nrow(Q2_Fails))
# = 0.07027822
)
#### Built in R function:
# bw = h in R's density function
fhtG = density(Q2_Fails, bw = 2, kernel = "gaussian")
plot(fhtG,
main = "Gaussian KDE (h = 2)",
xlab = "Fail Times",
lwd = 5,
)
polygon(fhtG, col = "#F4A6A6", border = NA)
# t = 14.5
segments(
x0 = 14.5,
y0 = 0,
x1 = 14.5,
y1 = 0.06493919,
col = "black",
lwd = 4,
)
# t = 16.5
segments(
x0 = 16.5,
y0 = 0,
x1 = 16.5,
y1 = 0.0755916 ,
col = "black",
lwd = 4,
)
# t = 20.5
segments(
x0 = 20.5,
y0 = 0,
x1 = 20.5,
y1 = 0.07027822 ,
col = "black",
lwd = 4,
)Looking at the probability distribution above, which was built via R’s built-in density function, we can see that the results of my custom function were validated.
To compare relative probabilities between different kernel estimators. I then mimicked the previous process from part b, but this time simply applied an Epanechnikov kernel instead of a Gaussian one.
#### My own KDE function:
E_KU = function(u) {
if (abs(u) <= 1) {
Answer = (1 - u^2)*(3/4)
return(Answer)
}
else {
return("Invalid input")
}
}
# Simple test to make sure KU function works
# E_KU(-.16) = 0.7308
# Now mimic process from part b to create complete KDE function for Epanechnikov kernels
# Coding process is practically identical, just need to chance K_Comp to equal E_KU(u) instead of G_KU(u)
E_Kernel = function(t, h, n) {
First = 1/(n*h)
total = 0
# for loop start
for (Time in Q2_Fails) {
u = (t - Time)/h
# Needed to add this additional if statement. Without it, I was getting errors when running the function because of the binary if statement in the E_KU function.
if (abs(u) <= 1) {
K_Comp = E_KU(u)
total = total + K_Comp
}
}
# for loop ending
Answer = total* First
return(Answer)
}
# Randomly select testing values
#sort(sample(12.5:25, 3)) = 18.5 22.5 23.5
### Now testing function:
# h = 2
# n = sample size of our data
invisible(
E_Kernel(t = 18.5, h = 2, n = nrow(Q2_Fails))
# = 0.0763125
)
invisible(
E_Kernel(t = 22.5, h = 2, n = nrow(Q2_Fails))
# = 0.05990625
)
invisible(
E_Kernel(t = 23.5, h = 2, n = nrow(Q2_Fails))
# = 0.06365625
)
#### Built in R function:
fhtE = density(Q2_Fails, bw = 2, kernel = "epanechnikov")
plot(fhtE,
main = "Epanechnikov KDE (h = 2)",
xlab = "Fail Times",
lwd = 5)
polygon(fhtE, col = "#F4A6A6", border = NA)
axis(2, at = seq(0,0.08, by = 0.01))
# t = 18.5
segments(
x0 = 18.5,
y0 = 0,
x1 = 18.5,
y1 = 0.0763125,
col = "black",
lwd = 4,
)
# t = 22.5
segments(
x0 = 22.5,
y0 = 0,
x1 = 22.5,
y1 = 0.05990625 ,
col = "black",
lwd = 4,
)
# t = 23.5
segments(
x0 = 23.5,
y0 = 0,
x1 = 23.5,
y1 = 0.06365625 ,
col = "black",
lwd = 4,
)While my custom function’s results were not perfectly aligned with base R’s plot of the Epanechnikov density curve, they are relatively within a margin of error.
Lastly, I wanted to examine the differences that a change in bandwith (the parameter h) have in affecting the relative density estimates for both Gaussian and Epanechnikov kernels. Below are visuals of each kernel type, with respective bandwidths of 1.5, 2 and 2.5.
par(mfrow = c(1,3))
fhtG1.5 = density(Q2_Fails, bw = 1.5, kernel = "gaussian")
plot(fhtG1.5,
main = "Gaussian KDE (h = 1.5)",
xlab = "Fail Times",
lwd = 5,
)
polygon(fhtG1.5, col = "#F4A6A6", border = NA)
fhtG = density(Q2_Fails, bw = 2, kernel = "gaussian")
plot(fhtG,
main = "Gaussian KDE (h = 2)",
xlab = "Fail Times",
lwd = 5,
)
polygon(fhtG, col = "#F4A6A6", border = NA)
fhtG2.5 = density(Q2_Fails, bw = 2.5, kernel = "gaussian")
plot(fhtG2.5,
main = "Gaussian KDE (h = 2.5)",
xlab = "Fail Times",
lwd = 5,
)
polygon(fhtG2.5, col = "#F4A6A6", border = NA)par(mfrow = c(1,3))
fhtE1.5 = density(Q2_Fails, bw = 1.5, kernel = "epanechnikov")
plot(fhtE1.5,
main = "Epanechnikov KDE (h = 1.5)",
xlab = "Fail Times",
lwd = 5)
polygon(fhtE1.5, col = "#F4A6A6", border = NA)
fhtE = density(Q2_Fails, bw = 2, kernel = "epanechnikov")
plot(fhtE,
main = "Epanechnikov KDE (h = 2)",
xlab = "Fail Times",
lwd = 5)
polygon(fhtE, col = "#F4A6A6", border = NA)
fhtE2.5 = density(Q2_Fails, bw = 2.5, kernel = "epanechnikov")
plot(fhtE2.5,
main = "Epanechnikov KDE (h = 2.5)",
xlab = "Fail Times",
lwd = 5)
polygon(fhtE2.5, col = "#F4A6A6", border = "black")We can see from the plots above from both kernels that an increasing bandwidth corresponds with a smoother and more normalized density curve. Alternatively, a lower bandwidth leads to a more “jumpy” visual. In reality, picking a proper h value is a balancing act, as too great a value leads to an overly smooth and biased estimation and too small a value leads to an overly jagged and highly variant estimation.