Assignment Objectives

  • Develop a clear technical understanding of nonparametric cumulative distribution function (CDF) estimation and various kernel density estimators.

  • Translate mathematical formulas into R functions and apply them to solve related problems.

  • Create effective visualizations to demonstrate your understanding of key concepts in the following questions.


Question 1: Cumulative Distribution Function (CDF) Estimation

The following failure times (in hours) were observed for 8 electronic components:

23, 45, 67, 89, 112, 156, 189, 245
  1. Write an R function implementing the ECDF \(\hat{F}_n(t)\) according to its mathematical definition. Validate your implementation using R’s ecdf() function on the given data, with comparison based on their step functions.
times <- c(23, 45, 67, 89, 112, 156, 189, 245)
uniq.time <- sort(unique(times))  

my.ECDF <- function(indat, outx){
  
  freq.table <- table(indat)                         
  uniq <- as.numeric(names(freq.table))          
  rep.time <- as.vector(freq.table)                   
  cum.rel.feq <- cumsum(rep.time)/sum(rep.time)      
  cum.prob <- NULL
  for (i in 1:length(outx)){
    intvl.id <- which(uniq <= outx[i])      
    cum.prob[i] <- cum.rel.feq[max(intvl.id)] 
  }
  cum.prob             
}

my.ECDF_def <- function(indat, t){
  n <- length(indat)
  sum(indat <= t)/n
}
Fn <- ecdf(times)
grid <- seq(min(times)-10, max(times)+10, by = 1)

plot(Fn, verticals = TRUE, do.points = TRUE,
     main = "ECDF comparison: built-in ecdf() vs my.ECDF()",
     xlab = "Failure time (hours)", ylab = "Empirical CDF")


lines(grid, my.ECDF(indat = times, outx = grid), type = "s", lty = 2, lwd = 2)

legend("bottomright",
       legend = c("ecdf()", "my.ECDF()"),
       lty = c(1,2), lwd = c(1,2), bty = "n")

  1. A colleague claims that the probability of failure before 100 hours is 0.5 based on these data. Do you agree? Explain your reasoning using the empirical cumulative distribution function (ECDF).
my.ECDF_def(times, 100)
[1] 0.5
Fn(100)
[1] 0.5

Solution: By using the empirical distribution function, the estimated probability of failure before 100 hours is equal to 0.5. Therefore, I agree with my colleague that the probability of failure before 100 hours is 0.5 based on these small sample data.


Question 2: Density Function Estimation

Consider the following failure times from a mechanical system:

12.3, 14.7, 15.2, 16.8, 18.1, 19.4, 20.6, 22.3, 23.9, 25.4
  1. Create a histogram of the data using 3 equally spaced bins. What is the estimated density in each bin? Describe the shape of the histogram’s distribution.
fail <- c(12.3, 14.7, 15.2, 16.8, 18.1, 19.4, 20.6, 22.3, 23.9, 25.4)
hisfail<-hist(fail, breaks = 3, freq = FALSE,
      main = "Histogram with 3 equally spaced bins",
      xlab = "Failure time")

hisfail$density
[1] 0.04 0.08 0.06 0.02

Solution:The histogram uses 3 equal-width bins, and the estimated densities are the bin heights shown in the table above.

Solution:Using three equally spaced bins, the histogram suggests an unimodal distribution and approximately symmetric to mildly right-skewed.

  1. Write an R function that computes kernel density estimates using a Gaussian kernel with \(h=2\). Validate your implementation against R’s built-in density() function.

\[ \hat{f}_h(t) = \frac{1}{nh}\sum_{i=1}^n K\left( \frac{t-t_i}{h}\right), \ \ \text{ where } \ \ K(u) = \frac{1}{\sqrt{2\pi}} e^{-u^2/2}. \]

my.kerf.gauss <- function(in.data, h, out.x){
  n <- length(in.data)
  den <- numeric(length(out.x))
  for (i in 1:length(out.x)){
    den[i] <- sum(dnorm(out.x[i], mean = in.data, sd = h))/n
  }
  den
}

xx <- seq(min(fail)-3, max(fail)+3, length = 300)

my.den <- my.kerf.gauss(in.data = fail, h = 2, out.x = xx)

kde <- density(fail, bw = 2, kernel = "gaussian")
kde.y <- approx(kde$x, kde$y, xout = xx)$y

plot(xx, my.den, type = "l", lwd = 2,col= "blue",
     main = "Gaussian KDE (h=2): my.kerf.gauss vs density()",
     xlab = "Failure time", ylab = "Density")
lines(xx, kde.y, lty = 2, lwd = 2, col="yellow")
legend("topright", c("my.kerf.gauss()", "density()"),
       lty = c(1,2), lwd = 2, bty = "n")

  1. Write a custom R function that computes kernel density estimates using the Epanechnikov kernel with \(h=2\). Validate your implementation by comparing results with R’s built-in density() function for Gaussian kernel estimation.

\[ \hat{f}_h(t) = \frac{1}{nh}\sum_{i=1}^n K\left( \frac{t-t_i}{h}\right), \ \ \text{ where } \ \ K(u) = \frac{3}{4}(1 - u^2) \ \ \text{ for } \ \ |u| \le 1. \]

my.kerf.epan <- function(in.data, h, out.x){
  n <- length(in.data)
  den <- numeric(length(out.x))
  for (i in 1:length(out.x)){
    u <- (out.x[i] - in.data)/h
    K <- ifelse(abs(u) <= 1, 0.75*(1 - u^2), 0)
    den[i] <- sum(K)/(n*h)
  }
  den
}

den.g2 <- my.kerf.gauss(fail, h = 2, out.x = xx)
den.e2 <- my.kerf.epan(fail,  h = 2, out.x = xx)

plot(xx, den.g2, type="l", lwd=2,col="gold",
     main="Kernel comparison (h=2)",
     xlab="Failure time", ylab="Density")
lines(xx, den.e2, lty=2, lwd=2)
legend("topright", c("Gaussian", "Epanechnikov"),
       lty=c(1,2), lwd=2, bty="n")

  1. How does the choice of kernel (Gaussian vs. Epanechnikov) affect the density estimate? For both kernel estimators applied to this dataset, what happens when we select \(h=1.5\) versus \(h=2.5\)?
den.g15 <- my.kerf.gauss(fail, h = 1.5, out.x = xx)
den.g25 <- my.kerf.gauss(fail, h = 2.5, out.x = xx)

plot(xx, den.g15, type="l", lwd=2,col="navy",
     main="Gaussian KDE: h=1.5 vs h=2.5",
     xlab="Failure time", ylab="Density")
lines(xx, den.g25, lty=2, lwd=2)
legend("topright", c("h=1.5", "h=2.5"),
       lty=c(1,2), lwd=2, bty="n")

den.e15 <- my.kerf.epan(fail, h = 1.5, out.x = xx)
den.e25 <- my.kerf.epan(fail, h = 2.5, out.x = xx)

plot(xx, den.e15, type="l", lwd=2, col="green",
     main="Epanechnikov KDE: h=1.5 vs h=2.5",
     xlab="Failure time", ylab="Density")
lines(xx, den.e25, lty=2, lwd=2)
legend("topright", c("h=1.5", "h=2.5"),
       lty=c(1,2), lwd=2, bty="n")

Solution:Effect of Kernel Choice (Gaussian vs. Epanechnikov), With the same bandwidth, the Gaussian kernel gives a smoother curve with longer tails, while the Epanechnikov kernel focuses more on nearby observations because it has a limited range. In practice, the difference between kernels is usually small compared to the effect of the bandwidth.

Solution:The bandwidth controls how smooth the density estimate is. A smaller bandwidth (ℎ=1.5) produces a less smooth curve with more local fluctuations, while a larger bandwidth (ℎ=2.5) gives a smoother curve but may hide some details of the data.

---
title: "Assignment 1: Estimating CDF and PDF"
author: "Xiaoying Ma "
date: " Due: 02/03/2026"
output:
  html_document: 
    toc: yes
    toc_depth: 4
    toc_float: yes
    number_sections: no
    toc_collapsed: yes
    code_folding: hide
    code_download: yes
    smooth_scroll: yes
    theme: lumen
  pdf_document: 
    toc: yes
    toc_depth: 4
    fig_caption: yes
    number_sections: yes
    fig_width: 3
    fig_height: 3
  word_document: 
    toc: yes
    toc_depth: 4
    fig_caption: yes
    keep_md: yes
editor_options: 
  chunk_output_type: inline
---

```{r setup, include=FALSE}
# code chunk specifies whether the R code, warnings, and output 
# will be included in the output files.
if (!require("knitr")) {
   install.packages("knitr")
   library(knitr)
}
if (!require("pander")) {
   install.packages("pander")
   library(pander)
}
if (!require("ggplot2")) {
  install.packages("ggplot2")
  library(ggplot2)
}
if (!require("tidyverse")) {
  install.packages("tidyverse")
  library(tidyverse)
}

if (!require("plotly")) {
  install.packages("plotly")
  library(plotly)
}
####
knitr::opts_chunk$set(echo = TRUE,       # include code chunk in the output file
                      warning = FALSE,   # sometimes, you code may produce warning messages,
                                         # you can choose to include the warning messages in
                                         # the output file. 
                      results = TRUE,    # you can also decide whether to include the output
                                         # in the output file.
                      message = FALSE,
                      comment = NA
                      )  
```


```{css, echo = FALSE}
#TOC::before {
  content: "Table of Contents";
  font-weight: bold;
  font-size: 1.2em;
  display: block;
  color: navy;
  margin-bottom: 10px;
}


div#TOC li {     /* table of content  */
    list-style:upper-roman;
    background-image:none;
    background-repeat:none;
    background-position:0;
}

h1.title {    /* level 1 header of title  */
  font-size: 22px;
  font-weight: bold;
  color: DarkRed;
  text-align: center;
  font-family: "Gill Sans", sans-serif;
}

h4.author { /* Header 4 - and the author and data headers use this too  */
  font-size: 15px;
  font-weight: bold;
  font-family: system-ui;
  color: navy;
  text-align: center;
}

h4.date { /* Header 4 - and the author and data headers use this too  */
  font-size: 18px;
  font-weight: bold;
  font-family: "Gill Sans", sans-serif;
  color: DarkBlue;
  text-align: center;
}

h1 { /* Header 1 - and the author and data headers use this too  */
    font-size: 20px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: darkred;
    text-align: center;
}

h2 { /* Header 2 - and the author and data headers use this too  */
    font-size: 18px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h3 { /* Header 3 - and the author and data headers use this too  */
    font-size: 16px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h4 { /* Header 4 - and the author and data headers use this too  */
    font-size: 14px;
  font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: darkred;
    text-align: left;
}

/* Add dots after numbered headers */
.header-section-number::after {
  content: ".";
}

body { background-color:white; }

.highlightme { background-color:yellow; }

p { background-color:white; }


```
 
 \
 
## **Assignment Objectives** 

* Develop a clear technical understanding of nonparametric cumulative distribution function (CDF) estimation and various kernel density estimators.

* Translate mathematical formulas into R functions and apply them to solve related problems.

* Create effective visualizations to demonstrate your understanding of key concepts in the following questions.



\

## **Question 1: Cumulative Distribution Function (CDF) Estimation**

The following failure times (in hours) were observed for 8 electronic components:

<center> 23, 45, 67, 89, 112, 156, 189, 245  </center>

a) Write an R function implementing the ECDF $\hat{F}_n(t)$ according to its mathematical definition. Validate your implementation using R's ecdf() function on the given data, with comparison based on their step functions.

```{r fig.align='center', fig.width=5, fig.height=4}
times <- c(23, 45, 67, 89, 112, 156, 189, 245)
uniq.time <- sort(unique(times))  

my.ECDF <- function(indat, outx){
  
  freq.table <- table(indat)                         
  uniq <- as.numeric(names(freq.table))          
  rep.time <- as.vector(freq.table)                   
  cum.rel.feq <- cumsum(rep.time)/sum(rep.time)      
  cum.prob <- NULL
  for (i in 1:length(outx)){
    intvl.id <- which(uniq <= outx[i])      
    cum.prob[i] <- cum.rel.feq[max(intvl.id)] 
  }
  cum.prob             
}

my.ECDF_def <- function(indat, t){
  n <- length(indat)
  sum(indat <= t)/n
}

```

```{r}
Fn <- ecdf(times)
grid <- seq(min(times)-10, max(times)+10, by = 1)

plot(Fn, verticals = TRUE, do.points = TRUE,
     main = "ECDF comparison: built-in ecdf() vs my.ECDF()",
     xlab = "Failure time (hours)", ylab = "Empirical CDF")


lines(grid, my.ECDF(indat = times, outx = grid), type = "s", lty = 2, lwd = 2)

legend("bottomright",
       legend = c("ecdf()", "my.ECDF()"),
       lty = c(1,2), lwd = c(1,2), bty = "n")

```

b) A colleague claims that the probability of failure before 100 hours is 0.5 based on these data. Do you agree? Explain your reasoning using the empirical cumulative distribution function (ECDF).

```{r}
my.ECDF_def(times, 100)
Fn(100)
```
**Solution**: By using the empirical distribution function, the estimated probability of failure before 100 hours is equal to 0.5. Therefore, I agree with my colleague that the probability of failure before 100 hours is 0.5 based on these small sample data.

\

## **Question 2: Density Function Estimation**

Consider the following failure times from a mechanical system:

<center> 12.3, 14.7, 15.2, 16.8, 18.1, 19.4, 20.6, 22.3, 23.9, 25.4 </center>

a) Create a histogram of the data using 3 equally spaced bins. What is the estimated density in each bin? Describe the shape of the histogram's distribution.

```{r fig.align='center'}
fail <- c(12.3, 14.7, 15.2, 16.8, 18.1, 19.4, 20.6, 22.3, 23.9, 25.4)
```

```{r fig.align='center'}
hisfail<-hist(fail, breaks = 3, freq = FALSE,
      main = "Histogram with 3 equally spaced bins",
      xlab = "Failure time")

hisfail$density

```


**Solution**:The histogram uses 3 equal-width bins, and the estimated densities are the bin heights shown in the table above.

**Solution**:Using three equally spaced bins, the histogram suggests an unimodal distribution and approximately symmetric to mildly right-skewed.



b) Write an R function that computes kernel density estimates using a Gaussian kernel with $h=2$. Validate your implementation against R's built-in density() function.

$$
\hat{f}_h(t) = \frac{1}{nh}\sum_{i=1}^n K\left( \frac{t-t_i}{h}\right), \ \ \text{ where } \ \ K(u) = \frac{1}{\sqrt{2\pi}} e^{-u^2/2}.
$$
```{r fig.align='center'}
my.kerf.gauss <- function(in.data, h, out.x){
  n <- length(in.data)
  den <- numeric(length(out.x))
  for (i in 1:length(out.x)){
    den[i] <- sum(dnorm(out.x[i], mean = in.data, sd = h))/n
  }
  den
}

xx <- seq(min(fail)-3, max(fail)+3, length = 300)

my.den <- my.kerf.gauss(in.data = fail, h = 2, out.x = xx)

kde <- density(fail, bw = 2, kernel = "gaussian")
kde.y <- approx(kde$x, kde$y, xout = xx)$y

plot(xx, my.den, type = "l", lwd = 2,col= "blue",
     main = "Gaussian KDE (h=2): my.kerf.gauss vs density()",
     xlab = "Failure time", ylab = "Density")
lines(xx, kde.y, lty = 2, lwd = 2, col="yellow")
legend("topright", c("my.kerf.gauss()", "density()"),
       lty = c(1,2), lwd = 2, bty = "n")
```


c) Write a custom R function that computes kernel density estimates using the Epanechnikov kernel with $h=2$. Validate your implementation by comparing results with R's built-in density() function for Gaussian kernel estimation.

$$
\hat{f}_h(t) = \frac{1}{nh}\sum_{i=1}^n K\left( \frac{t-t_i}{h}\right), \ \ \text{ where } \ \ K(u) = \frac{3}{4}(1 - u^2) \ \ \text{ for } \ \ |u| \le 1.
$$
```{r fig.align='center'}
my.kerf.epan <- function(in.data, h, out.x){
  n <- length(in.data)
  den <- numeric(length(out.x))
  for (i in 1:length(out.x)){
    u <- (out.x[i] - in.data)/h
    K <- ifelse(abs(u) <= 1, 0.75*(1 - u^2), 0)
    den[i] <- sum(K)/(n*h)
  }
  den
}

den.g2 <- my.kerf.gauss(fail, h = 2, out.x = xx)
den.e2 <- my.kerf.epan(fail,  h = 2, out.x = xx)

plot(xx, den.g2, type="l", lwd=2,col="gold",
     main="Kernel comparison (h=2)",
     xlab="Failure time", ylab="Density")
lines(xx, den.e2, lty=2, lwd=2)
legend("topright", c("Gaussian", "Epanechnikov"),
       lty=c(1,2), lwd=2, bty="n")
```

d) How does the choice of kernel (Gaussian vs. Epanechnikov) affect the density estimate? For both kernel estimators applied to this dataset, what happens when we select $h=1.5$ versus $h=2.5$?

```{r fig.align='center'}
den.g15 <- my.kerf.gauss(fail, h = 1.5, out.x = xx)
den.g25 <- my.kerf.gauss(fail, h = 2.5, out.x = xx)

plot(xx, den.g15, type="l", lwd=2,col="navy",
     main="Gaussian KDE: h=1.5 vs h=2.5",
     xlab="Failure time", ylab="Density")
lines(xx, den.g25, lty=2, lwd=2)
legend("topright", c("h=1.5", "h=2.5"),
       lty=c(1,2), lwd=2, bty="n")


den.e15 <- my.kerf.epan(fail, h = 1.5, out.x = xx)
den.e25 <- my.kerf.epan(fail, h = 2.5, out.x = xx)

plot(xx, den.e15, type="l", lwd=2, col="green",
     main="Epanechnikov KDE: h=1.5 vs h=2.5",
     xlab="Failure time", ylab="Density")
lines(xx, den.e25, lty=2, lwd=2)
legend("topright", c("h=1.5", "h=2.5"),
       lty=c(1,2), lwd=2, bty="n")

```


**Solution**:Effect of Kernel Choice (Gaussian vs. Epanechnikov), With the same bandwidth, the Gaussian kernel gives a smoother curve with longer tails, while the Epanechnikov kernel focuses more on nearby observations because it has a limited range. In practice, the difference between kernels is usually small compared to the effect of the bandwidth.

**Solution**:The bandwidth controls how smooth the density estimate is. A smaller bandwidth (ℎ=1.5) produces a less smooth curve with more local fluctuations, while a larger bandwidth (ℎ=2.5) gives a smoother curve but may hide some details of the data.