Assignment Objectives

  • Develop a clear technical understanding of nonparametric cumulative distribution function (CDF) estimation and various kernel density estimators.

  • Translate mathematical formulas into R functions and apply them to solve related problems.

  • Create effective visualizations to demonstrate your understanding of key concepts in the following questions.


Question 1: Cumulative Distribution Function (CDF) Estimation

The following failure times (in hours) were observed for 8 electronic components:

23, 45, 67, 89, 112, 156, 189, 245
  1. Write an R function implementing the ECDF \(\hat{F}_n(t)\) according to its mathematical definition. Validate your implementation using R’s ecdf() function on the given data, with comparison based on their step functions.

Work for Part A:

times <- c(23, 45, 67, 89, 112, 156, 189, 245)
uniq.time <- sort(unique(times)) #Used to sort data values and remove dublicates
my.ECDF <- function(indat, outx){ #Used to define function
  freq.table <- table(indat) #Used to create frequency table
  uniq <- as.numeric(names(freq.table)) #Gives unique values
  rep.time <- as.vector(freq.table) #Turns frequencies into numeric vector
  cum.rel.feq <- cumsum(rep.time)/sum(rep.time) #Gets the cumulative relative frequency
  cum.prob <- NULL
  for (i in 1:length(outx)){ 
    intvl.id <- which(uniq <= outx[i]) #Used to identify the index meeting the condition
    cum.prob[i] <- cum.rel.feq[max(intvl.id)] #Used to get cumulative probability
  }
  cum.prob #Used to get vector of ECDF values
}

plot(uniq.time, my.ECDF(indat=times, outx=uniq.time), #Assigns uniq.time to x-value, my.ECDF to y-values
     type ="s", #Indicates it should be a step function
     main = "ECDF using Mathematical Definition",
     xlab = "Failure Times",
     ylab = "Cumulative Probability") 

r.ECDF <- ecdf(times) #Uses ecdf function on times

plot(r.ECDF, verticals = TRUE, pch=46, #indicates there should be vertical jumps
     main = "ECDF using R",
     xlab = "Failure Times",
     ylab = "Cumulative Probability")

Comparing the ECDF using the Mathematical Definition and the ECDF using the ecdf() r function, the resulting step functions appear to be the same.

  1. A colleague claims that the probability of failure before 100 hours is 0.5 based on these data. Do you agree? Explain your reasoning using the empirical cumulative distribution function (ECDF).

Work for Part B:

I would say that based on the ECDF functions presented in the graphs above, it makes sense that probability of failure before 100 hours is 0.5, since the cumulative probability at 100 hours is approximately 0.5.

Question 2: Density Function Estimation

Consider the following failure times from a mechanical system:

12.3, 14.7, 15.2, 16.8, 18.1, 19.4, 20.6, 22.3, 23.9, 25.4
  1. Create a histogram of the data using 3 equally spaced bins. What is the estimated density in each bin? Describe the shape of the histogram’s distribution.

Work for Part A:

times2 <- c(12.3, 14.7, 15.2, 16.8, 18.1, 19.4, 20.6, 22.3, 23.9, 25.4)

#Creates histogram with 3 bins ranging from min of times2 and max of times2
hist(times2, breaks = seq(min(times2), max(times2), length.out = 4), main = "Histogram of Failure Times")

In the histogram above, the distribution seems to center around 19, with the majority of observations being in the center bin. The two bins to the side of the center bin are the same height, indicating that this seems to be a symmetric distribution.

  1. Write an R function that computes kernel density estimates using a Gaussian kernel with \(h=2\). Validate your implementation against R’s built-in density() function.

\[ \hat{f}_h(t) = \frac{1}{nh}\sum_{i=1}^n K\left( \frac{t-t_i}{h}\right), \ \ \text{ where } \ \ K(u) = \frac{1}{\sqrt{2\pi}} e^{-u^2/2}. \]

Work for Part B:

gauss.kde <- function(t, data, h) { #defines function
  n <- length(data) #gets size of n
  K <- function(u) (1 / sqrt(2 * pi)) * exp(-0.5 * u^2) #Used to compute Gaussian kernel
  sapply(t, function(x) (1 / (n * h)) * sum(K((x - data) / h))) #Used to apply kernel to scaled data
}

plot(density(times2, kernel = "gaussian", bw = 2), #Used to plot kde using R's function
main = "My Function vs. Built-In", lwd = 2) 
lines(seq(10, 30, 0.1), gauss.kde(seq(10, 30, 0.1), #Used to plot kde using my function
times2, 2), lwd = 2, col = "orange")

The function I made follows the R’s density function very closely.

  1. Write a custom R function that computes kernel density estimates using the Epanechnikov kernel with \(h=2\). Validate your implementation by comparing results with R’s built-in density() function for Gaussian kernel estimation.

\[ \hat{f}_h(t) = \frac{1}{nh}\sum_{i=1}^n K\left( \frac{t-t_i}{h}\right), \ \ \text{ where } \ \ K(u) = \frac{3}{4}(1 - u^2) \ \ \text{ for } \ \ |u| \le 1. \] # Work for Part C:

epan.kde <- function(t, data, h) { #defines function
  n <- length(data) #gets size of n
  K <- function(u) ifelse(abs(u) <= 1, 0.75 * (1 - u^2), 0)  #Used to compute Epanechnikov kernel
  sapply(t, function(x) (1 / (n * h)) * sum(K((x - data) / h))) #Used to apply kernel to scaled data
}
plot(density(times2, kernel = "epanechnikov", bw = 2),
#Used to plot kde using R's function
main = "My Function vs. Built-In", lwd = 2)
lines(seq(10, 30, 0.1), epan.kde(seq(10, 30, 0.1),
#Used to plot kde using my function
times2, 2), lwd = 2, col = "orange")

The function I made does not seem to follow the R function as closely as in the previous example.

  1. How does the choice of kernel (Gaussian vs. Epanechnikov) affect the density estimate? For both kernel estimators applied to this dataset, what happens when we select \(h=1.5\) versus \(h=2.5\)?

Work for Part D:

#Used to plot kde's using my function for various h values
plot(gauss.kde(seq(10, 30, 0.1), times2, 1.5), type = "l", main = "Gaussian with h = 1.5", ylab = "Density") 

plot(gauss.kde(seq(10, 30, 0.1), times2, 2.5), type = "l", main = "Gaussian with h = 2.5", ylab = "Density")

plot(epan.kde(seq(10, 30, 0.1), times2, 1.5), type = "l", main = "Epanechnikov with h = 1.5", ylab = "Density")

plot(epan.kde(seq(10, 30, 0.1), times2, 2.5), type = "l", main = "Epanechnikov with h = 2.5", ylab = "Density")

When h = 1.5, both of the density estimates become less smooth. On the other hand when h = 2.5, the density estimates become more smooth.

