Assignment Objectives
Develop a clear technical understanding of nonparametric
cumulative distribution function (CDF) estimation and various kernel
density estimators.
Translate mathematical formulas into R functions and apply them
to solve related problems.
Create effective visualizations to demonstrate your understanding
of key concepts in the following questions.
Question 1: Cumulative Distribution Function (CDF)
Estimation
The following failure times (in hours) were observed for 8 electronic
components:
23, 45, 67, 89, 112, 156, 189, 245
- Write an R function implementing the ECDF \(\hat{F}_n(t)\) according to its
mathematical definition. Validate your implementation using R’s ecdf()
function on the given data, with comparison based on their step
functions.
x <- c(23, 45, 67, 89, 112, 156, 189, 245)
# My ECDF
my_ecdf <- function(data) {
n <- length(data)
function(t) {
sum(data <= t) / n
}
}
my_ecdf <- my_ecdf(x)
I will validate my implementation of the ECDF using R’s ecf()
function. First, I will do so by using several different values as
input, and ensuring that I get the same results from both ways.
# R's ECDF
R_ecdf <- ecdf(x)
# Comparing ECDF values between my function and R's function
t_vals <- seq(0, 260, by = 50)
cbind(
t = t_vals,
my_ecdf = sapply(t_vals, my_ecdf),
builtin = R_ecdf(t_vals)
)
t my_ecdf builtin
[1,] 0 0.000 0.000
[2,] 50 0.250 0.250
[3,] 100 0.500 0.500
[4,] 150 0.625 0.625
[5,] 200 0.875 0.875
[6,] 250 1.000 1.000
The values match up perfectly between my ECDF and the built-in R ECDF
indicating that my function has been created properly.
plot(R_ecdf, verticals = TRUE, do.points = FALSE, col = "blue", lwd = 2,
main = "Comparison of MY ECDF and R's ECDF",
xlab = "Failure Times",
ylab = "ECDF")
t_grid <- seq(0, 260, length.out = 1000)
lines(t_grid, sapply(t_grid, my_ecdf), type = "s", col = "orange",
lwd = 2,lty = 2)
legend("bottomright", legend = c("R ecdf()", "My ECDF"),
col = c("blue", "orange"), lty = c(1, 2), lwd = 2, bty = "n")

As seen in the graph above, my ECDF and the built-in R ECDF match up
perfectly, thus validating my function.
- A colleague claims that the probability of failure before 100 hours
is 0.5 based on these data. Do you agree? Explain your reasoning using
the empirical cumulative distribution function (ECDF).
Yes, I do agree with this colleague’s claim. In this data set, there
are four total values that are less than 100, these values are 23, 45,
67, and 89. There are eight total values all together in the data set.
The empirical cumulative distribution function says that the probability
can be found by taking the total number of values that are less than the
desired value, in this case 100, and dividing that by the total number
of values in the whole data set.
\[
F_n(x) = \frac{[\ \text{Number of } X_i \le x]}{n} = \frac{4}{8} = 0.5
\]
The ECDF gives that the probability of failure before 100 hours is
0.5 which matches up with this colleague’s claim.
Question 2: Density Function Estimation
Consider the following failure times from a mechanical system:
12.3, 14.7, 15.2, 16.8, 18.1, 19.4, 20.6, 22.3, 23.9, 25.4
- Create a histogram of the data using 3 equally spaced bins. What is
the estimated density in each bin? Describe the shape of the histogram’s
distribution.
y <- c(12.3, 14.7, 15.2, 16.8, 18.1, 19.4, 20.6, 22.3, 23.9, 25.4)
breaks <- seq(12.3, 25.4, length.out = 4)
hist(y, breaks = breaks, probability = TRUE,
main = "Histogram",
xlab = "Failure Time")

The histogram is unimodal and symmetric, and appears to be
approximately normally distributed. In the first bin, the density
appears to be around 0.07. In the second bin, the density appears to be
around 0.09. In the third bin, the density appears to be around
0.07.
- Write an R function that computes kernel density estimates using a
Gaussian kernel with \(h=2\). Validate
your implementation against R’s built-in density() function.
\[
\hat{f}_h(t) = \frac{1}{nh}\sum_{i=1}^n K\left( \frac{t-t_i}{h}\right),
\ \ \text{ where } \ \ K(u) = \frac{1}{\sqrt{2\pi}} e^{-u^2/2}.
\]
I will begin with creating my estimation for a Gaussian kernel with
h=2.
Gaussian <- function(t, data, h) {
n <- length(data)
sapply(t, function(x) {
mean(dnorm((x - data) / h)) / h
})
}
Next, I will validate against R’s built-in density function.
t_grid <- seq(min(y) - 2, max(y) + 2, length.out = 200)
plot(t_grid, Gaussian(t_grid, y, h = 2),
type = "l", col = "orange", lwd = 2,
main = "Gaussian (h = 2)",
xlab = "Failure Time", ylab = "Density")
lines(density(y, kernel = "gaussian", bw = 2),
col = "blue", lty = 2)
legend("topright",
legend = c("My Gaussian", "R's Built-in function"),
col = c("orange", "blue"), lty = c(1, 2))

As seen above, my estimation using a Gaussian kernel with h=2 matches
up perfectly with R’s built-in density() function. This successfully
validates the function that I created.
- Write a custom R function that computes kernel density estimates
using the Epanechnikov kernel with \(h=2\). Validate your implementation by
comparing results with R’s built-in density() function for Gaussian
kernel estimation.
\[
\hat{f}_h(t) = \frac{1}{nh}\sum_{i=1}^n K\left( \frac{t-t_i}{h}\right),
\ \ \text{ where } \ \ K(u) = \frac{3}{4}(1 - u^2) \ \ \text{ for } \ \
|u| \le 1.
\]
I will begin with creating my estimation for a Epanechnikov kernel
with h=2.
Epanechnikov <- function(t, data, h) {
n <- length(data)
sapply(t, function(x) {
u <- (x - data) / h
mean(0.75 * (1 - u^2) * (abs(u) <= 1)) / h
})
}
Next, I will validate against R’s built-in density function.
plot(t_grid, Epanechnikov (t_grid, y, h = 2),
type = "l", col = "orange", lwd = 2,
main = "Kernel Density Estimates",
xlab = "Failure Time", ylab = "Density")
lines(density(y, bw = 2), col = "blue", lty = 2)
legend("topright",
legend = c("Epanechnikov KDE", "Gaussian KDE"),
col = c("orange", "blue"), lty = c(1, 2))

My Epanechnikov kernel matches up with the same patterns shown by
that of R’s built-in function’s estimation. This successfully validates
my Epanechnikov kernel function. The Epanechnikov kernel shows more
variability and sharper peaks than the smooth pattern of the Gaussian
kernel. However, the overall trends are the same between the two.
- How does the choice of kernel (Gaussian vs. Epanechnikov) affect the
density estimate? For both kernel estimators applied to this dataset,
what happens when we select \(h=1.5\)
versus \(h=2.5\)?
Both kernels provide similar information regarding this data set,
however, their appearences show some differences between one another.
The Epanechnikov kernel has a sharper peak than the Gaussian kernel,
which was seen in the graphically comparisons of both kernels. The
Gaussian kernel has a more broad range, while the Epanechnikov kernel is
more compact, with a sharper peak as opposed to a gradual curve like the
Gaussian kernel.
A lower bandwidth like h = 1.5 would have more variability, and a
less smooth estimate. On the other hand, a higher bandwidth like h = 2.5
would have a smoother estimate, but less of a noticeable peak in its
visualization. The bandwidth used in the previous parts b and c of h = 2
is a good middle ground between these two bandwidths.
---
title: "Assignment 1: Estimating CDF and PDF"
author: "Josie Gallop"
date: " Due: 02-03-2026"
output:
  html_document: 
    toc: yes
    toc_depth: 4
    toc_float: yes
    number_sections: no
    toc_collapsed: yes
    code_folding: hide
    code_download: yes
    smooth_scroll: yes
    theme: lumen
  pdf_document: 
    toc: yes
    toc_depth: 4
    fig_caption: yes
    number_sections: yes
    fig_width: 3
    fig_height: 3
  word_document: 
    toc: yes
    toc_depth: 4
    fig_caption: yes
    keep_md: yes
editor_options: 
  chunk_output_type: inline
---

```{css, echo = FALSE}
#TOC::before {
  content: "Table of Contents";
  font-weight: bold;
  font-size: 1.2em;
  display: block;
  color: navy;
  margin-bottom: 10px;
}


div#TOC li {     /* table of content  */
    list-style:upper-roman;
    background-image:none;
    background-repeat:none;
    background-position:0;
}

h1.title {    /* level 1 header of title  */
  font-size: 22px;
  font-weight: bold;
  color: DarkRed;
  text-align: center;
  font-family: "Gill Sans", sans-serif;
}

h4.author { /* Header 4 - and the author and data headers use this too  */
  font-size: 15px;
  font-weight: bold;
  font-family: system-ui;
  color: navy;
  text-align: center;
}

h4.date { /* Header 4 - and the author and data headers use this too  */
  font-size: 18px;
  font-weight: bold;
  font-family: "Gill Sans", sans-serif;
  color: DarkBlue;
  text-align: center;
}

h1 { /* Header 1 - and the author and data headers use this too  */
    font-size: 20px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: darkred;
    text-align: center;
}

h2 { /* Header 2 - and the author and data headers use this too  */
    font-size: 18px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h3 { /* Header 3 - and the author and data headers use this too  */
    font-size: 16px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h4 { /* Header 4 - and the author and data headers use this too  */
    font-size: 14px;
  font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: darkred;
    text-align: left;
}

/* Add dots after numbered headers */
.header-section-number::after {
  content: ".";

body { background-color:white; }

.highlightme { background-color:yellow; }

p { background-color:white; }

}
```

```{r setup, include=FALSE}
# code chunk specifies whether the R code, warnings, and output 
# will be included in the output files.
if (!require("knitr")) {
   install.packages("knitr")
   library(knitr)
}
if (!require("pander")) {
   install.packages("pander")
   library(pander)
}
if (!require("ggplot2")) {
  install.packages("ggplot2")
  library(ggplot2)
}
if (!require("tidyverse")) {
  install.packages("tidyverse")
  library(tidyverse)
}

if (!require("plotly")) {
  install.packages("plotly")
  library(plotly)
}
####
knitr::opts_chunk$set(echo = TRUE,       # include code chunk in the output file
                      warning = FALSE,   # sometimes, you code may produce warning messages,
                                         # you can choose to include the warning messages in
                                         # the output file. 
                      results = TRUE,    # you can also decide whether to include the output
                                         # in the output file.
                      message = FALSE,
                      comment = NA
                      )  
```
 
 \
 
## **Assignment Objectives** 

* Develop a clear technical understanding of nonparametric cumulative distribution function (CDF) estimation and various kernel density estimators.

* Translate mathematical formulas into R functions and apply them to solve related problems.

* Create effective visualizations to demonstrate your understanding of key concepts in the following questions.



\

## **Question 1: Cumulative Distribution Function (CDF) Estimation**

The following failure times (in hours) were observed for 8 electronic components:

<center> 23, 45, 67, 89, 112, 156, 189, 245  </center>

a) Write an R function implementing the ECDF $\hat{F}_n(t)$ according to its mathematical definition. Validate your implementation using R's ecdf() function on the given data, with comparison based on their step functions.

```{r}
x <- c(23, 45, 67, 89, 112, 156, 189, 245)

# My ECDF 
my_ecdf <- function(data) {
  n <- length(data)
  function(t) {
    sum(data <= t) / n
  }
}
my_ecdf <- my_ecdf(x)
```

I will validate my implementation of the ECDF using R's ecf() function. First, I will do so by using several different values as input, and ensuring that I get the same results from both ways.

```{r}
# R's ECDF 
R_ecdf <- ecdf(x)

# Comparing ECDF values between my function and R's function
t_vals <- seq(0, 260, by = 50)
cbind(
  t = t_vals,
  my_ecdf = sapply(t_vals, my_ecdf),
  builtin = R_ecdf(t_vals)
)

```
The values match up perfectly between my ECDF and the built-in R ECDF indicating that my function has been created properly. 


```{r}
plot(R_ecdf, verticals = TRUE, do.points = FALSE, col = "blue", lwd = 2,
    main = "Comparison of MY ECDF and R's ECDF", 
    xlab = "Failure Times",
    ylab = "ECDF")

t_grid <- seq(0, 260, length.out = 1000)

lines(t_grid, sapply(t_grid, my_ecdf), type = "s", col = "orange", 
      lwd = 2,lty = 2)

legend("bottomright", legend = c("R ecdf()", "My ECDF"),
        col = c("blue", "orange"), lty = c(1, 2), lwd = 2, bty = "n")
```

As seen in the graph above, my ECDF and the built-in R ECDF match up perfectly, thus validating my function.


b) A colleague claims that the probability of failure before 100 hours is 0.5 based on these data. Do you agree? Explain your reasoning using the empirical cumulative distribution function (ECDF).

Yes, I do agree with this colleague's claim. In this data set, there are four total values that are less than 100, these values are 23, 45, 67, and 89. There are eight total values all together in the data set. The empirical cumulative distribution function says that the probability can be found by taking the total number of values that are less than the desired value, in this case 100, and dividing that by the total number of values in the whole data set. 

$$
F_n(x) = \frac{[\ \text{Number of  } X_i \le x]}{n} = \frac{4}{8} = 0.5
$$

The ECDF gives that the probability of failure before 100 hours is 0.5 which matches up with this colleague's claim. 


\

## **Question 2: Density Function Estimation**

Consider the following failure times from a mechanical system:

<center> 12.3, 14.7, 15.2, 16.8, 18.1, 19.4, 20.6, 22.3, 23.9, 25.4 </center>

a) Create a histogram of the data using 3 equally spaced bins. What is the estimated density in each bin? Describe the shape of the histogram's distribution.


```{r}
y <- c(12.3, 14.7, 15.2, 16.8, 18.1, 19.4, 20.6, 22.3, 23.9, 25.4)
breaks <- seq(12.3, 25.4, length.out = 4)
hist(y, breaks = breaks, probability = TRUE,
     main = "Histogram",
     xlab = "Failure Time")
```

The histogram is unimodal and symmetric, and appears to be approximately normally distributed. In the first bin, the density appears to be around 0.07. In the second bin, the density appears to be around 0.09. In the third bin, the density appears to be around 0.07.


b) Write an R function that computes kernel density estimates using a Gaussian kernel with $h=2$. Validate your implementation against R's built-in density() function.

$$
\hat{f}_h(t) = \frac{1}{nh}\sum_{i=1}^n K\left( \frac{t-t_i}{h}\right), \ \ \text{ where } \ \ K(u) = \frac{1}{\sqrt{2\pi}} e^{-u^2/2}.
$$

I will begin with creating my estimation for a Gaussian kernel with h=2.

```{r}
Gaussian <- function(t, data, h) {
  n <- length(data)
  sapply(t, function(x) {
    mean(dnorm((x - data) / h)) / h
  })
}
```

Next, I will validate against R's built-in density function.

```{r}
t_grid <- seq(min(y) - 2, max(y) + 2, length.out = 200)

plot(t_grid, Gaussian(t_grid, y, h = 2),
     type = "l", col = "orange", lwd = 2,
     main = "Gaussian (h = 2)",
     xlab = "Failure Time", ylab = "Density")

lines(density(y, kernel = "gaussian", bw = 2),
      col = "blue", lty = 2)

legend("topright",
       legend = c("My Gaussian", "R's Built-in function"),
       col = c("orange", "blue"), lty = c(1, 2))
```

As seen above, my estimation using a Gaussian kernel with h=2 matches up perfectly with R’s built-in density() function. This successfully validates the function that I created. 


c) Write a custom R function that computes kernel density estimates using the Epanechnikov kernel with $h=2$. Validate your implementation by comparing results with R's built-in density() function for Gaussian kernel estimation.

$$
\hat{f}_h(t) = \frac{1}{nh}\sum_{i=1}^n K\left( \frac{t-t_i}{h}\right), \ \ \text{ where } \ \ K(u) = \frac{3}{4}(1 - u^2) \ \ \text{ for } \ \ |u| \le 1.
$$


I will begin with creating my estimation for a Epanechnikov kernel with h=2.

```{r}
Epanechnikov  <- function(t, data, h) {
  n <- length(data)
  sapply(t, function(x) {
    u <- (x - data) / h
    mean(0.75 * (1 - u^2) * (abs(u) <= 1)) / h
  })
}
```

Next, I will validate against R's built-in density function.

```{r}
plot(t_grid, Epanechnikov (t_grid, y, h = 2),
     type = "l", col = "orange", lwd = 2,
     main = "Kernel Density Estimates",
     xlab = "Failure Time", ylab = "Density")

lines(density(y, bw = 2), col = "blue", lty = 2)

legend("topright",
       legend = c("Epanechnikov KDE", "Gaussian KDE"),
       col = c("orange", "blue"), lty = c(1, 2))

```

My Epanechnikov kernel matches up with the same patterns shown by that of R's built-in function's estimation. This successfully validates my Epanechnikov kernel function. The Epanechnikov kernel shows more variability and sharper peaks than the smooth pattern of the Gaussian kernel. However, the overall trends are the same between the two. 


d) How does the choice of kernel (Gaussian vs. Epanechnikov) affect the density estimate? For both kernel estimators applied to this dataset, what happens when we select $h=1.5$ versus $h=2.5$?

Both kernels provide similar information regarding this data set, however, their appearences show some differences between one another. The Epanechnikov kernel has a sharper peak than the Gaussian kernel, which was seen in the graphically comparisons of both kernels. The Gaussian kernel has a more broad range, while the Epanechnikov kernel is more compact, with a sharper peak as opposed to a gradual curve like the Gaussian kernel. 

A lower bandwidth like h = 1.5 would have more variability, and a less smooth estimate. On the other hand, a higher bandwidth like h = 2.5 would have a smoother estimate, but less of a noticeable peak in its visualization. The bandwidth used in the previous parts b and c of h = 2 is a good middle ground between these two bandwidths.

