Homework 1

QUESTION 1A

x <- c(23, 45, 67, 89, 112, 156, 189, 245)


#manual representation of the ecdf 
manual <- function(t, x) {
  n <- length(x)
  sapply(t, function(tt) sum(x <= tt) / n)
}

#built in ecdf in r
builtin <- ecdf(x)


#comparison of both based on their step functions
t_vals <- seq(0, 260, by = 1)

m_vals <- manual(t_vals, x)
b_vals <- builtin(t_vals)

plot(t_vals, m_vals, type = "s", col = "blue", lwd = 2,
     xlab = "t (hours)", ylab = "F_n(t)",
     main = "Empirical CDF: Manual vs Built-in")

lines(t_vals, b_vals, type = "s", col = "red", lwd = 2, lty = 2)

legend("bottomright",
       legend = c("Manual ECDF", "Built-in ecdf()"),
       col = c("blue", "red"),
       lwd = 2,
       lty = c(1, 2))

QUESTION 1B

The ECDF represents the proportion of observed components that have failed by a given time. Using the eight observed failure times, we estimate the probability of failure before 100 hours by counting how many failures occurred at or before that time. Four of the eight components failed by 100 hours, so the ECDF value at 100 is 4/8 = 0.5. Therefore, based on the sample data, the estimated probability that a component fails before 100 hours is 0.5, which supports the colleague’s claim, so yes i agree.

QUESTION 2A

times <- c(12.3, 14.7, 15.2, 16.8, 18.1, 19.4, 20.6, 22.3, 23.9, 25.4)

breaks <- seq(min(times), max(times), length.out = 4)

h <- hist(times,
          breaks = breaks,
          probability = TRUE,
          col = "lightblue",
          border = "black",
          main = "Histogram of Failure Times",
          xlab = "Failure Time (hours)",
          ylab = "Density")

h$density

## [1] 0.06870229 0.09160305 0.06870229

The histogram shows a unimodal and roughly symmetric distribution. The middle bin has the highest bar (largest density), while the bins on either side are lower and about equal in height. This indicates that most failure times cluster around the center of the range, with fewer failures occurring at the lower and higher ends. There is no strong skewness, and the overall shape is mound-like.

QUESTION 2B

# Manual KDE function
kde_manual <- function(t, times, h) {
  sapply(t, function(tt) {
    mean(dnorm((tt - times) / h)) / h
  })
}

# Points where density will be evaluated
t_vals <- seq(min(times) - 3, max(times) + 3, length.out = 1000)

# Manual KDE values
manual_vals <- kde_manual(t_vals, times, h = 2)

# Built-in KDE from R
builtin_kde <- density(times, bw = 2, kernel = "gaussian")

# Plot comparison
plot(t_vals, manual_vals, type = "l", col = "blue", lwd = 2,
     xlab = "Failure Time", ylab = "Density",
     main = "Manual KDE vs Built-in density()")

lines(builtin_kde$x, builtin_kde$y, col = "red", lwd = 2, lty = 2)

legend("topright",
       legend = c("Manual KDE", "Built-in density()"),
       col = c("blue", "red"),
       lwd = 2,
       lty = c(1, 2))

The manually computed kernel density estimate closely matches the density curve produced by R’s built-in density() function when both use a Gaussian kernel with bandwidth h=2. The two curves overlap almost perfectly on the plot, indicating that the manual implementation correctly follows the kernel density estimation formula. Any small differences are due only to numerical approximation, confirming that the custom function provides an accurate estimate of the underlying density.

QUESTION 2C

# Manual Epanechnikov KDE
kde_epanechnikov <- function(t, times, h = 2) {
  sapply(t, function(tt) {
    u <- (tt - times) / h
    K <- ifelse(abs(u) <= 1, 0.75 * (1 - u^2), 0)
    mean(K) / h
  })
}

# Manual Epanechnikov KDE values
epa_vals <- kde_epanechnikov(t_vals, times, h = 2)

# Plot comparison
plot(t_vals, epa_vals, type = "l", col = "blue", lwd = 2,
     xlab = "Failure Time", ylab = "Density",
     main = "Epanechnikov KDE (Manual) vs Gaussian KDE (Built-in)")

lines(builtin_kde$x, builtin_kde$y, col = "red", lwd = 2, lty = 2)

legend("topright",
       legend = c("Epanechnikov KDE", "Gaussian KDE"),
       col = c("blue", "red"),
       lwd = 2,
       lty = c(1, 2))

QUESTION 2D

The choice of kernel affects how each data point contributes to the overall shape of the kernel density estimate. Using a Gaussian kernel results in a very smooth curve because it assigns some weight to every observation across the entire range, with influence gradually decreasing as distance increases. In contrast, the Epanechnikov kernel has finite support, meaning each observation only affects the estimate within one bandwidth of its location, so its estimate can appear slightly less smooth and more locally shaped by nearby data points.Even so, both kernels identify the same region where failures are most concentrated.For this specific dataset, Epanechnikov gives a more rough, bouncy(cloud shaped) density due to the small sample size and bandwidth of 2.

The bandwidth has a much stronger impact on the density estimate than the kernel choice. With a smaller bandwidth such as h=1.5, both density curves become more sensitive to individual observations, leading to a rougher, more irregular appearance. With a larger bandwidth such as h=2.5, the curves become smoother and more spread out, with broader peaks and fewer visible fluctuations. Thus, while kernel type influences the fine details of the curve, the bandwidth primarily determines how smooth or wiggly the overall estimate appears for this dataset.