Assignment Objectives

  • Develop a clear technical understanding of nonparametric cumulative distribution function (CDF) estimation and various kernel density estimators.

  • Translate mathematical formulas into R functions and apply them to solve related problems.

  • Create effective visualizations to demonstrate your understanding of key concepts in the following questions.


Question 1: Cumulative Distribution Function (CDF) Estimation

The following failure times (in hours) were observed for 8 electronic components:

23, 45, 67, 89, 112, 156, 189, 245
  1. Write an R function implementing the ECDF \(\hat{F}_n(t)\) according to its mathematical definition. Validate your implementation using R’s ecdf() function on the given data, with comparison based on their step functions.
x <- c(23, 45, 67, 89, 112, 156, 189, 245)

my_ecdf_f <- function(x){
  n <- length(x)
  function(t){
    vapply(t, function(tt) sum(x <= tt) / n, numeric(1))
  }
}

F_r <- ecdf(x)
F_mine <- my_ecdf_f(x)

t_grid <- seq(min(x) - 10, max(x) + 10, length.out = 1000)

plot_df <- data.frame(
  t = t_grid,
  R_ecdf = F_r(t_grid),
  My_ecdf = F_mine(t_grid)
)

plot_df_long <- plot_df %>%
  pivot_longer(
    cols = c(R_ecdf, My_ecdf),
    names_to = "Type",
    values_to = "CDF"
  )

Fn.plt <- ggplot(plot_df_long, aes(x = t, y = CDF, color = Type)) +
  geom_step(linewidth = 1) +
  scale_color_manual(values = c("R_ecdf" = "blue", "My_ecdf" = "red")) +
  labs(
    title = "Empirical CDF: My Implementation vs R ecdf()",
    subtitle = paste("Sample size n =", length(x)),
    x = "t",
    y = "F̂(t)"
  ) +
  theme(
    plot.title = element_text(hjust = 0.5),
    plot.margin = margin(t = 35, r = 20, b = 30, l = 30, unit = "pt")
  )

ggplotly(Fn.plt)

The ECDF computed using my custom ECDF implementation closely matches the estimate produced by R’s ecdf() function, indicating that the implementation is correct.

  1. A colleague claims that the probability of failure before 100 hours is 0.5 based on these data. Do you agree? Explain your reasoning using the empirical cumulative distribution function (ECDF).

I agree with the claim because the ECDF evaluated at 100 hours is 0.5, meaning that about 50% of the observed failures occurred before 100 hours. Therefore, the ECDF supports this claim.

Question 2: Density Function Estimation

Consider the following failure times from a mechanical system:

12.3, 14.7, 15.2, 16.8, 18.1, 19.4, 20.6, 22.3, 23.9, 25.4
  1. Create a histogram of the data using 3 equally spaced bins. What is the estimated density in each bin? Describe the shape of the histogram’s distribution.
x <- c(12.3, 14.7, 15.2, 16.8, 18.1, 19.4, 20.6, 22.3, 23.9, 25.4)
h <- hist(x, breaks = seq(min(x), max(x), length.out = 4), freq = FALSE)

The estimated density of each bin in order is 0.0687023, 0.0916031, 0.0687023. The distribution is roughly symmetrical and unimodal.

  1. Write an R function that computes kernel density estimates using a Gaussian kernel with \(h=2\). Validate your implementation against R’s built-in density() function.

\[ \hat{f}_h(t) = \frac{1}{nh}\sum_{i=1}^n K\left( \frac{t-t_i}{h}\right), \ \ \text{ where } \ \ K(u) = \frac{1}{\sqrt{2\pi}} e^{-u^2/2}. \]

my_kde <- function(x,h){
  n <- length(x)
  function(t){
    vapply(t, function(xi) 1/(n*h) * sum(dnorm((xi - x)/h)), numeric(1))
  }
}

F_mine <- my_kde(x, h = 2)
F_r <- density(x, bw = 2)

t_grid <- F_r$x

plot_df <- data.frame(
  t = t_grid,
  R_density = F_r$y,
  My_density = F_mine(t_grid)
)

plot_df_long <- plot_df %>%
  pivot_longer(
    cols = c(R_density, My_density),
    names_to = "Type",
    values_to = "Density"
  )

kde_plt <- ggplot(plot_df_long, aes(x = t, y = Density, color = Type)) +
  geom_line(linewidth = 1) +
  labs(
    title = "KDE: My Implementation vs R density()",
    subtitle = paste("Gaussian kernel, h =", 2, ", n =", length(x)),
    x = "t",
    y = "Density"
  ) +
  theme(plot.title = element_text(hjust = 0.5))

ggplotly(kde_plt)

The KDE computed using my custom Gaussian kernel implementation closely matches the estimate produced by R’s density() function, indicating that the implementation is correct.

  1. Write a custom R function that computes kernel density estimates using the Epanechnikov kernel with \(h=2\). Validate your implementation by comparing results with R’s built-in density() function for Gaussian kernel estimation.

\[ \hat{f}_h(t) = \frac{1}{nh}\sum_{i=1}^n K\left( \frac{t-t_i}{h}\right), \ \ \text{ where } \ \ K(u) = \frac{3}{4}(1 - u^2) \ \ \text{ for } \ \ |u| \le 1. \]

my_kde <- function(x,h){
  n <- length(x)
  function(t){
    vapply(t, function(tt) {
      u <- (tt - x)/h
      K <- ifelse(abs(u) <= 1, (3/4)*(1 - u^2), 0)
      1/(n*h) * sum(K)}, numeric(1))
  }
}

F_mine <- my_kde(x, h = 2)
F_r <- density(x, bw = 2, kernel = "epanechnikov")

t_grid <- F_r$x

plot_df <- data.frame(
  t = t_grid,
  R_density = F_r$y,
  My_density = F_mine(t_grid)
)

plot_df_long <- plot_df %>%
  pivot_longer(
    cols = c(R_density, My_density),
    names_to = "Type",
    values_to = "Density"
  )

kde_plt <- ggplot(plot_df_long, aes(x = t, y = Density, color = Type)) +
  geom_line(linewidth = 1) +
  labs(
    title = "KDE: My Implementation vs R density()",
    subtitle = paste("Gaussian kernel, h =", 2, ", n =", length(x)),
    x = "t",
    y = "Density"
  ) +
  theme(plot.title = element_text(hjust = 0.5))

ggplotly(kde_plt)

The KDE computed using my custom Epanechnikov kernel implementation closely matches the estimate produced by R’s density() function, indicating that the implementation is correct. However, the two curves don’t completely overlap this time, I’m assuming the density() function must go above and beyond just the mathematical equation.

  1. How does the choice of kernel (Gaussian vs. Epanechnikov) affect the density estimate? For both kernel estimators applied to this dataset, what happens when we select \(h=1.5\) versus \(h=2.5\)?

The Guassian curve is much more smooth than the Epanechnikov curve. Epanechnikov densities are allowed to be zero at the tails, this doesn’t seem to be true for Gaussian densities. For both kernels, the density curves become more rigid when h is decreased to 1.5 and more smooth when its increased to 2.5.

---
title: "Assignment 1: Estimating CDF and PDF"
author: "Charlie Morgan"
date: " Due: 02/02/26"
output:
  html_document: 
    toc: yes
    toc_depth: 4
    toc_float: yes
    number_sections: no
    toc_collapsed: yes
    code_folding: hide
    code_download: yes
    smooth_scroll: yes
    theme: lumen
  pdf_document: 
    toc: yes
    toc_depth: 4
    fig_caption: yes
    number_sections: yes
    fig_width: 3
    fig_height: 3
  word_document: 
    toc: yes
    toc_depth: 4
    fig_caption: yes
    keep_md: yes
editor_options: 
  chunk_output_type: inline
---

```{css, echo = FALSE}
#TOC::before {
  content: "Table of Contents";
  font-weight: bold;
  font-size: 1.2em;
  display: block;
  color: navy;
  margin-bottom: 10px;
}


div#TOC li {     /* table of content  */
    list-style:upper-roman;
    background-image:none;
    background-repeat:none;
    background-position:0;
}

h1.title {    /* level 1 header of title  */
  font-size: 22px;
  font-weight: bold;
  color: DarkRed;
  text-align: center;
  font-family: "Gill Sans", sans-serif;
}

h4.author { /* Header 4 - and the author and data headers use this too  */
  font-size: 15px;
  font-weight: bold;
  font-family: system-ui;
  color: navy;
  text-align: center;
}

h4.date { /* Header 4 - and the author and data headers use this too  */
  font-size: 18px;
  font-weight: bold;
  font-family: "Gill Sans", sans-serif;
  color: DarkBlue;
  text-align: center;
}

h1 { /* Header 1 - and the author and data headers use this too  */
    font-size: 20px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: darkred;
    text-align: center;
}

h2 { /* Header 2 - and the author and data headers use this too  */
    font-size: 18px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h3 { /* Header 3 - and the author and data headers use this too  */
    font-size: 16px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h4 { /* Header 4 - and the author and data headers use this too  */
    font-size: 14px;
  font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: darkred;
    text-align: left;
}

/* Add dots after numbered headers */
.header-section-number::after {
  content: ".";

body { background-color:white; }

.highlightme { background-color:yellow; }

p { background-color:white; }

}
```

```{r setup, include=FALSE}
# code chunk specifies whether the R code, warnings, and output 
# will be included in the output files.
if (!require("knitr")) {
   install.packages("knitr")
   library(knitr)
}
if (!require("pander")) {
   install.packages("pander")
   library(pander)
}
if (!require("ggplot2")) {
  install.packages("ggplot2")
  library(ggplot2)
}
if (!require("tidyverse")) {
  install.packages("tidyverse")
  library(tidyverse)
}

if (!require("plotly")) {
  install.packages("plotly")
  library(plotly)
}
####
knitr::opts_chunk$set(echo = TRUE,       # include code chunk in the output file
                      warning = FALSE,   # sometimes, you code may produce warning messages,
                                         # you can choose to include the warning messages in
                                         # the output file. 
                      results = TRUE,    # you can also decide whether to include the output
                                         # in the output file.
                      message = FALSE,
                      comment = NA
                      )  
```
 
 \
 
## **Assignment Objectives** 

* Develop a clear technical understanding of nonparametric cumulative distribution function (CDF) estimation and various kernel density estimators.

* Translate mathematical formulas into R functions and apply them to solve related problems.

* Create effective visualizations to demonstrate your understanding of key concepts in the following questions.



\

## **Question 1: Cumulative Distribution Function (CDF) Estimation**

The following failure times (in hours) were observed for 8 electronic components:

<center> 23, 45, 67, 89, 112, 156, 189, 245  </center>

a) Write an R function implementing the ECDF $\hat{F}_n(t)$ according to its mathematical definition. Validate your implementation using R's ecdf() function on the given data, with comparison based on their step functions.

```{r}
x <- c(23, 45, 67, 89, 112, 156, 189, 245)

my_ecdf_f <- function(x){
  n <- length(x)
  function(t){
    vapply(t, function(tt) sum(x <= tt) / n, numeric(1))
  }
}

F_r <- ecdf(x)
F_mine <- my_ecdf_f(x)

t_grid <- seq(min(x) - 10, max(x) + 10, length.out = 1000)

plot_df <- data.frame(
  t = t_grid,
  R_ecdf = F_r(t_grid),
  My_ecdf = F_mine(t_grid)
)

plot_df_long <- plot_df %>%
  pivot_longer(
    cols = c(R_ecdf, My_ecdf),
    names_to = "Type",
    values_to = "CDF"
  )

Fn.plt <- ggplot(plot_df_long, aes(x = t, y = CDF, color = Type)) +
  geom_step(linewidth = 1) +
  scale_color_manual(values = c("R_ecdf" = "blue", "My_ecdf" = "red")) +
  labs(
    title = "Empirical CDF: My Implementation vs R ecdf()",
    subtitle = paste("Sample size n =", length(x)),
    x = "t",
    y = "F̂(t)"
  ) +
  theme(
    plot.title = element_text(hjust = 0.5),
    plot.margin = margin(t = 35, r = 20, b = 30, l = 30, unit = "pt")
  )

ggplotly(Fn.plt)
```

The ECDF computed using my custom ECDF implementation closely matches the estimate produced by R’s ecdf() function, indicating that the implementation is correct.

b) A colleague claims that the probability of failure before 100 hours is 0.5 based on these data. Do you agree? Explain your reasoning using the empirical cumulative distribution function (ECDF).
```{r, include=FALSE}
F_r(100)
```
I agree with the claim because the ECDF evaluated at 100 hours is `r F_r(100)`, meaning that about `r F_r(100)*100`% of the observed failures occurred before 100 hours. Therefore, the ECDF supports this claim.
\

## **Question 2: Density Function Estimation**

Consider the following failure times from a mechanical system:

<center> 12.3, 14.7, 15.2, 16.8, 18.1, 19.4, 20.6, 22.3, 23.9, 25.4 </center>

a) Create a histogram of the data using 3 equally spaced bins. What is the estimated density in each bin? Describe the shape of the histogram's distribution.
```{r}
x <- c(12.3, 14.7, 15.2, 16.8, 18.1, 19.4, 20.6, 22.3, 23.9, 25.4)
h <- hist(x, breaks = seq(min(x), max(x), length.out = 4), freq = FALSE)
```

The estimated density of each bin in order is `r h$density`. The distribution is roughly symmetrical and unimodal.

b) Write an R function that computes kernel density estimates using a Gaussian kernel with $h=2$. Validate your implementation against R's built-in density() function.

$$
\hat{f}_h(t) = \frac{1}{nh}\sum_{i=1}^n K\left( \frac{t-t_i}{h}\right), \ \ \text{ where } \ \ K(u) = \frac{1}{\sqrt{2\pi}} e^{-u^2/2}.
$$
```{r}
my_kde <- function(x,h){
  n <- length(x)
  function(t){
    vapply(t, function(xi) 1/(n*h) * sum(dnorm((xi - x)/h)), numeric(1))
  }
}

F_mine <- my_kde(x, h = 2)
F_r <- density(x, bw = 2)

t_grid <- F_r$x

plot_df <- data.frame(
  t = t_grid,
  R_density = F_r$y,
  My_density = F_mine(t_grid)
)

plot_df_long <- plot_df %>%
  pivot_longer(
    cols = c(R_density, My_density),
    names_to = "Type",
    values_to = "Density"
  )

kde_plt <- ggplot(plot_df_long, aes(x = t, y = Density, color = Type)) +
  geom_line(linewidth = 1) +
  labs(
    title = "KDE: My Implementation vs R density()",
    subtitle = paste("Gaussian kernel, h =", 2, ", n =", length(x)),
    x = "t",
    y = "Density"
  ) +
  theme(plot.title = element_text(hjust = 0.5))

ggplotly(kde_plt)
```

The KDE computed using my custom Gaussian kernel implementation closely matches the estimate produced by R’s density() function, indicating that the implementation is correct.

c) Write a custom R function that computes kernel density estimates using the Epanechnikov kernel with $h=2$. Validate your implementation by comparing results with R's built-in density() function for Gaussian kernel estimation.

$$
\hat{f}_h(t) = \frac{1}{nh}\sum_{i=1}^n K\left( \frac{t-t_i}{h}\right), \ \ \text{ where } \ \ K(u) = \frac{3}{4}(1 - u^2) \ \ \text{ for } \ \ |u| \le 1.
$$

```{r}
my_kde <- function(x,h){
  n <- length(x)
  function(t){
    vapply(t, function(tt) {
      u <- (tt - x)/h
      K <- ifelse(abs(u) <= 1, (3/4)*(1 - u^2), 0)
      1/(n*h) * sum(K)}, numeric(1))
  }
}

F_mine <- my_kde(x, h = 2)
F_r <- density(x, bw = 2, kernel = "epanechnikov")

t_grid <- F_r$x

plot_df <- data.frame(
  t = t_grid,
  R_density = F_r$y,
  My_density = F_mine(t_grid)
)

plot_df_long <- plot_df %>%
  pivot_longer(
    cols = c(R_density, My_density),
    names_to = "Type",
    values_to = "Density"
  )

kde_plt <- ggplot(plot_df_long, aes(x = t, y = Density, color = Type)) +
  geom_line(linewidth = 1) +
  labs(
    title = "KDE: My Implementation vs R density()",
    subtitle = paste("Gaussian kernel, h =", 2, ", n =", length(x)),
    x = "t",
    y = "Density"
  ) +
  theme(plot.title = element_text(hjust = 0.5))

ggplotly(kde_plt)
```

The KDE computed using my custom Epanechnikov kernel implementation closely matches the estimate produced by R’s density() function, indicating that the implementation is correct. However, the two curves don't completely overlap this time, I'm assuming the density() function must go above and beyond just the mathematical equation.

d) How does the choice of kernel (Gaussian vs. Epanechnikov) affect the density estimate? For both kernel estimators applied to this dataset, what happens when we select $h=1.5$ versus $h=2.5$?

The Guassian curve is much more smooth than the Epanechnikov curve. Epanechnikov densities are allowed to be zero at the tails, this doesn't seem to be true for Gaussian densities. For both kernels, the density curves become more rigid when h is decreased to 1.5 and more smooth when its increased to 2.5.




