x <- c(23, 45, 67, 89, 112, 156, 189, 245)
#manual representation of the ecdf
manual <- function(t, x) {
n <- length(x)
sapply(t, function(tt) sum(x <= tt) / n)
}
#built in ecdf in r
builtin <- ecdf(x)
#comparison of both based on their step functions
t_vals <- seq(0, 260, by = 1)
m_vals <- manual(t_vals, x)
b_vals <- builtin(t_vals)
plot(t_vals, m_vals, type = "s", col = "blue", lwd = 2,
xlab = "t (hours)", ylab = "F_n(t)",
main = "Empirical CDF: Manual vs Built-in")
lines(t_vals, b_vals, type = "s", col = "red", lwd = 2, lty = 2)
legend("bottomright",
legend = c("Manual ECDF", "Built-in ecdf()"),
col = c("blue", "red"),
lwd = 2,
lty = c(1, 2))
The ECDF represents the proportion of observed components that have failed by a given time. Using the eight observed failure times, we estimate the probability of failure before 100 hours by counting how many failures occurred at or before that time. Four of the eight components failed by 100 hours, so the ECDF value at 100 is 4/8 = 0.5. Therefore, based on the sample data, the estimated probability that a component fails before 100 hours is 0.5, which supports the colleague’s claim, so yes i agree.
times <- c(12.3, 14.7, 15.2, 16.8, 18.1, 19.4, 20.6, 22.3, 23.9, 25.4)
breaks <- seq(min(times), max(times), length.out = 4)
h <- hist(times,
breaks = breaks,
probability = TRUE,
col = "lightblue",
border = "black",
main = "Histogram of Failure Times",
xlab = "Failure Time (hours)",
ylab = "Density")
h$density
## [1] 0.06870229 0.09160305 0.06870229
The histogram shows a unimodal and roughly symmetric distribution. The middle bin has the highest bar (largest density), while the bins on either side are lower and about equal in height. This indicates that most failure times cluster around the center of the range, with fewer failures occurring at the lower and higher ends. There is no strong skewness, and the overall shape is mound-like.
# Manual KDE function
kde_manual <- function(t, times, h) {
sapply(t, function(tt) {
mean(dnorm((tt - times) / h)) / h
})
}
# Points where density will be evaluated
t_vals <- seq(min(times) - 3, max(times) + 3, length.out = 1000)
# Manual KDE values
manual_vals <- kde_manual(t_vals, times, h = 2)
# Built-in KDE from R
builtin_kde <- density(times, bw = 2, kernel = "gaussian")
# Plot comparison
plot(t_vals, manual_vals, type = "l", col = "blue", lwd = 2,
xlab = "Failure Time", ylab = "Density",
main = "Manual KDE vs Built-in density()")
lines(builtin_kde$x, builtin_kde$y, col = "red", lwd = 2, lty = 2)
legend("topright",
legend = c("Manual KDE", "Built-in density()"),
col = c("blue", "red"),
lwd = 2,
lty = c(1, 2))
The manually computed kernel density estimate closely matches the
density curve produced by R’s built-in density() function when both use
a Gaussian kernel with bandwidth h=2. The two curves overlap almost
perfectly on the plot, indicating that the manual implementation
correctly follows the kernel density estimation formula. Any small
differences are due only to numerical approximation, confirming that the
custom function provides an accurate estimate of the underlying
density.
# Manual Epanechnikov KDE
kde_epanechnikov <- function(t, times, h = 2) {
sapply(t, function(tt) {
u <- (tt - times) / h
K <- ifelse(abs(u) <= 1, 0.75 * (1 - u^2), 0)
mean(K) / h
})
}
# Manual Epanechnikov KDE values
epa_vals <- kde_epanechnikov(t_vals, times, h = 2)
# Plot comparison
plot(t_vals, epa_vals, type = "l", col = "blue", lwd = 2,
xlab = "Failure Time", ylab = "Density",
main = "Epanechnikov KDE (Manual) vs Gaussian KDE (Built-in)")
lines(builtin_kde$x, builtin_kde$y, col = "red", lwd = 2, lty = 2)
legend("topright",
legend = c("Epanechnikov KDE", "Gaussian KDE"),
col = c("blue", "red"),
lwd = 2,
lty = c(1, 2))
The choice of kernel affects how each data point contributes to the overall shape of the kernel density estimate. Using a Gaussian kernel results in a very smooth curve because it assigns some weight to every observation across the entire range, with influence gradually decreasing as distance increases. In contrast, the Epanechnikov kernel has finite support, meaning each observation only affects the estimate within one bandwidth of its location, so its estimate can appear slightly less smooth and more locally shaped by nearby data points.Even so, both kernels identify the same region where failures are most concentrated.For this specific dataset, Epanechnikov gives a more rough, bouncy(cloud shaped) density due to the small sample size and bandwidth of 2.
The bandwidth has a much stronger impact on the density estimate than the kernel choice. With a smaller bandwidth such as h=1.5, both density curves become more sensitive to individual observations, leading to a rougher, more irregular appearance. With a larger bandwidth such as h=2.5, the curves become smoother and more spread out, with broader peaks and fewer visible fluctuations. Thus, while kernel type influences the fine details of the curve, the bandwidth primarily determines how smooth or wiggly the overall estimate appears for this dataset.