Assignment Objectives

  • Develop a clear technical understanding of nonparametric cumulative distribution function (CDF) estimation and various kernel density estimators.

  • Translate mathematical formulas into R functions and apply them to solve related problems.

  • Create effective visualizations to demonstrate your understanding of key concepts in the following questions.


Question 1: Cumulative Distribution Function (CDF) Estimation

The following failure times (in hours) were observed for 8 electronic components:

23, 45, 67, 89, 112, 156, 189, 245
  1. Write an R function implementing the ECDF \(\hat{F}_n(t)\) according to its mathematical definition. Validate your implementation using R’s ecdf() function on the given data, with comparison based on their step functions.

Work for Part A:

times <- c(23, 45, 67, 89, 112, 156, 189, 245)
uniq.time <- sort(unique(times)) #Used to sort data values and remove dublicates
my.ECDF <- function(indat, outx){ #Used to define function
  freq.table <- table(indat) #Used to create frequency table
  uniq <- as.numeric(names(freq.table)) #Gives unique values
  rep.time <- as.vector(freq.table) #Turns frequencies into numeric vector
  cum.rel.feq <- cumsum(rep.time)/sum(rep.time) #Gets the cumulative relative frequency
  cum.prob <- NULL
  for (i in 1:length(outx)){ 
    intvl.id <- which(uniq <= outx[i]) #Used to identify the index meeting the condition
    cum.prob[i] <- cum.rel.feq[max(intvl.id)] #Used to get cumulative probability
  }
  cum.prob #Used to get vector of ECDF values
}

plot(uniq.time, my.ECDF(indat=times, outx=uniq.time), #Assigns uniq.time to x-value, my.ECDF to y-values
     type ="s", #Indicates it should be a step function
     main = "ECDF using Mathematical Definition",
     xlab = "Failure Times",
     ylab = "Cumulative Probability") 

r.ECDF <- ecdf(times) #Uses ecdf function on times

plot(r.ECDF, verticals = TRUE, pch=46, #indicates there should be vertical jumps
     main = "ECDF using R",
     xlab = "Failure Times",
     ylab = "Cumulative Probability")

Comparing the ECDF using the Mathematical Definition and the ECDF using the ecdf() r function, the resulting step functions appear to be the same.

  1. A colleague claims that the probability of failure before 100 hours is 0.5 based on these data. Do you agree? Explain your reasoning using the empirical cumulative distribution function (ECDF).

Work for Part B:

I would say that based on the ECDF functions presented in the graphs above, it makes sense that probability of failure before 100 hours is 0.5, since the cumulative probability at 100 hours is approximately 0.5.

Question 2: Density Function Estimation

Consider the following failure times from a mechanical system:

12.3, 14.7, 15.2, 16.8, 18.1, 19.4, 20.6, 22.3, 23.9, 25.4
  1. Create a histogram of the data using 3 equally spaced bins. What is the estimated density in each bin? Describe the shape of the histogram’s distribution.

Work for Part A:

times2 <- c(12.3, 14.7, 15.2, 16.8, 18.1, 19.4, 20.6, 22.3, 23.9, 25.4)

#Creates histogram with 3 bins ranging from min of times2 and max of times2
hist(times2, breaks = seq(min(times2), max(times2), length.out = 4), main = "Histogram of Failure Times")

In the histogram above, the distribution seems to center around 19, with the majority of observations being in the center bin. The two bins to the side of the center bin are the same height, indicating that this seems to be a symmetric distribution.

  1. Write an R function that computes kernel density estimates using a Gaussian kernel with \(h=2\). Validate your implementation against R’s built-in density() function.

\[ \hat{f}_h(t) = \frac{1}{nh}\sum_{i=1}^n K\left( \frac{t-t_i}{h}\right), \ \ \text{ where } \ \ K(u) = \frac{1}{\sqrt{2\pi}} e^{-u^2/2}. \]

Work for Part B:

gauss.kde <- function(t, data, h) { #defines function
  n <- length(data) #gets size of n
  K <- function(u) (1 / sqrt(2 * pi)) * exp(-0.5 * u^2) #Used to compute Gaussian kernel
  sapply(t, function(x) (1 / (n * h)) * sum(K((x - data) / h))) #Used to apply kernel to scaled data
}

plot(density(times2, kernel = "gaussian", bw = 2), #Used to plot kde using R's function
main = "My Function vs. Built-In", lwd = 2) 
lines(seq(10, 30, 0.1), gauss.kde(seq(10, 30, 0.1), #Used to plot kde using my function
times2, 2), lwd = 2, col = "orange")

The function I made follows the R’s density function very closely.

  1. Write a custom R function that computes kernel density estimates using the Epanechnikov kernel with \(h=2\). Validate your implementation by comparing results with R’s built-in density() function for Gaussian kernel estimation.

\[ \hat{f}_h(t) = \frac{1}{nh}\sum_{i=1}^n K\left( \frac{t-t_i}{h}\right), \ \ \text{ where } \ \ K(u) = \frac{3}{4}(1 - u^2) \ \ \text{ for } \ \ |u| \le 1. \] # Work for Part C:

epan.kde <- function(t, data, h) { #defines function
  n <- length(data) #gets size of n
  K <- function(u) ifelse(abs(u) <= 1, 0.75 * (1 - u^2), 0)  #Used to compute Epanechnikov kernel
  sapply(t, function(x) (1 / (n * h)) * sum(K((x - data) / h))) #Used to apply kernel to scaled data
}
plot(density(times2, kernel = "epanechnikov", bw = 2),
#Used to plot kde using R's function
main = "My Function vs. Built-In", lwd = 2)
lines(seq(10, 30, 0.1), epan.kde(seq(10, 30, 0.1),
#Used to plot kde using my function
times2, 2), lwd = 2, col = "orange")

The function I made does not seem to follow the R function as closely as in the previous example.

  1. How does the choice of kernel (Gaussian vs. Epanechnikov) affect the density estimate? For both kernel estimators applied to this dataset, what happens when we select \(h=1.5\) versus \(h=2.5\)?

Work for Part D:

#Used to plot kde's using my function for various h values
plot(gauss.kde(seq(10, 30, 0.1), times2, 1.5), type = "l", main = "Gaussian with h = 1.5", ylab = "Density") 

plot(gauss.kde(seq(10, 30, 0.1), times2, 2.5), type = "l", main = "Gaussian with h = 2.5", ylab = "Density")

plot(epan.kde(seq(10, 30, 0.1), times2, 1.5), type = "l", main = "Epanechnikov with h = 1.5", ylab = "Density")

plot(epan.kde(seq(10, 30, 0.1), times2, 2.5), type = "l", main = "Epanechnikov with h = 2.5", ylab = "Density")

When h = 1.5, both of the density estimates become less smooth. On the other hand when h = 2.5, the density estimates become more smooth.

---
title: "Assignment 1: Estimating CDF and PDF"
author: "Grace Lippert"
date: " Due: 2/3/2026"
output:
  html_document: 
    toc: yes
    toc_depth: 4
    toc_float: yes
    number_sections: no
    toc_collapsed: yes
    code_folding: hide
    code_download: yes
    smooth_scroll: yes
    theme: lumen
  pdf_document: 
    toc: yes
    toc_depth: 4
    fig_caption: yes
    number_sections: yes
    fig_width: 3
    fig_height: 3
  word_document: 
    toc: yes
    toc_depth: 4
    fig_caption: yes
    keep_md: yes
editor_options: 
  chunk_output_type: inline
---

```{css, echo = FALSE}
#TOC::before {
  content: "Table of Contents";
  font-weight: bold;
  font-size: 1.2em;
  display: block;
  color: navy;
  margin-bottom: 10px;
}


div#TOC li {     /* table of content  */
    list-style:upper-roman;
    background-image:none;
    background-repeat:none;
    background-position:0;
}

h1.title {    /* level 1 header of title  */
  font-size: 22px;
  font-weight: bold;
  color: DarkRed;
  text-align: center;
  font-family: "Gill Sans", sans-serif;
}

h4.author { /* Header 4 - and the author and data headers use this too  */
  font-size: 15px;
  font-weight: bold;
  font-family: system-ui;
  color: navy;
  text-align: center;
}

h4.date { /* Header 4 - and the author and data headers use this too  */
  font-size: 18px;
  font-weight: bold;
  font-family: "Gill Sans", sans-serif;
  color: DarkBlue;
  text-align: center;
}

h1 { /* Header 1 - and the author and data headers use this too  */
    font-size: 20px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: darkred;
    text-align: center;
}

h2 { /* Header 2 - and the author and data headers use this too  */
    font-size: 18px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h3 { /* Header 3 - and the author and data headers use this too  */
    font-size: 16px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h4 { /* Header 4 - and the author and data headers use this too  */
    font-size: 14px;
  font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: darkred;
    text-align: left;
}

/* Add dots after numbered headers */
.header-section-number::after {
  content: ".";

body { background-color:white; }

.highlightme { background-color:yellow; }

p { background-color:white; }

}
```

```{r setup, include=FALSE}
# code chunk specifies whether the R code, warnings, and output 
# will be included in the output files.
if (!require("knitr")) {
   install.packages("knitr")
   library(knitr)
}
if (!require("pander")) {
   install.packages("pander")
   library(pander)
}
if (!require("ggplot2")) {
  install.packages("ggplot2")
  library(ggplot2)
}
if (!require("tidyverse")) {
  install.packages("tidyverse")
  library(tidyverse)
}

if (!require("plotly")) {
  install.packages("plotly")
  library(plotly)
}
####
knitr::opts_chunk$set(echo = TRUE,       # include code chunk in the output file
                      warning = FALSE,   # sometimes, you code may produce warning messages,
                                         # you can choose to include the warning messages in
                                         # the output file. 
                      results = TRUE,    # you can also decide whether to include the output
                                         # in the output file.
                      message = FALSE,
                      comment = NA
                      )  
```
 
 \
 
## **Assignment Objectives** 

* Develop a clear technical understanding of nonparametric cumulative distribution function (CDF) estimation and various kernel density estimators.

* Translate mathematical formulas into R functions and apply them to solve related problems.

* Create effective visualizations to demonstrate your understanding of key concepts in the following questions.



\

## **Question 1: Cumulative Distribution Function (CDF) Estimation**

The following failure times (in hours) were observed for 8 electronic components:

<center> 23, 45, 67, 89, 112, 156, 189, 245  </center>

a) Write an R function implementing the ECDF $\hat{F}_n(t)$ according to its mathematical definition. Validate your implementation using R's ecdf() function on the given data, with comparison based on their step functions.

# Work for Part A:

```{r}
times <- c(23, 45, 67, 89, 112, 156, 189, 245)
uniq.time <- sort(unique(times)) #Used to sort data values and remove dublicates
my.ECDF <- function(indat, outx){ #Used to define function
  freq.table <- table(indat) #Used to create frequency table
  uniq <- as.numeric(names(freq.table)) #Gives unique values
  rep.time <- as.vector(freq.table) #Turns frequencies into numeric vector
  cum.rel.feq <- cumsum(rep.time)/sum(rep.time) #Gets the cumulative relative frequency
  cum.prob <- NULL
  for (i in 1:length(outx)){ 
    intvl.id <- which(uniq <= outx[i]) #Used to identify the index meeting the condition
    cum.prob[i] <- cum.rel.feq[max(intvl.id)] #Used to get cumulative probability
  }
  cum.prob #Used to get vector of ECDF values
}

plot(uniq.time, my.ECDF(indat=times, outx=uniq.time), #Assigns uniq.time to x-value, my.ECDF to y-values
     type ="s", #Indicates it should be a step function
     main = "ECDF using Mathematical Definition",
     xlab = "Failure Times",
     ylab = "Cumulative Probability") 

r.ECDF <- ecdf(times) #Uses ecdf function on times

plot(r.ECDF, verticals = TRUE, pch=46, #indicates there should be vertical jumps
     main = "ECDF using R",
     xlab = "Failure Times",
     ylab = "Cumulative Probability")

```
Comparing the ECDF using the Mathematical Definition and the ECDF using the ecdf() r function, the resulting step functions appear to be the same.  

b) A colleague claims that the probability of failure before 100 hours is 0.5 based on these data. Do you agree? Explain your reasoning using the empirical cumulative distribution function (ECDF).

# Work for Part B:

I would say that based on the ECDF functions presented in the graphs above, it makes sense that probability of failure before 100 hours is 0.5, since the cumulative probability at 100 hours is approximately 0.5.  

## **Question 2: Density Function Estimation**

Consider the following failure times from a mechanical system:

<center> 12.3, 14.7, 15.2, 16.8, 18.1, 19.4, 20.6, 22.3, 23.9, 25.4 </center>

a) Create a histogram of the data using 3 equally spaced bins. What is the estimated density in each bin? Describe the shape of the histogram's distribution.

# Work for Part A:

```{r}
times2 <- c(12.3, 14.7, 15.2, 16.8, 18.1, 19.4, 20.6, 22.3, 23.9, 25.4)

#Creates histogram with 3 bins ranging from min of times2 and max of times2
hist(times2, breaks = seq(min(times2), max(times2), length.out = 4), main = "Histogram of Failure Times")
```

In the histogram above, the distribution seems to center around 19, with the majority of observations being in the center bin.  The two bins to the side of the center bin are the same height, indicating that this seems to be a symmetric distribution.  


b) Write an R function that computes kernel density estimates using a Gaussian kernel with $h=2$. Validate your implementation against R's built-in density() function.

$$
\hat{f}_h(t) = \frac{1}{nh}\sum_{i=1}^n K\left( \frac{t-t_i}{h}\right), \ \ \text{ where } \ \ K(u) = \frac{1}{\sqrt{2\pi}} e^{-u^2/2}.
$$

# Work for Part B:

```{r}
gauss.kde <- function(t, data, h) { #defines function
  n <- length(data) #gets size of n
  K <- function(u) (1 / sqrt(2 * pi)) * exp(-0.5 * u^2) #Used to compute Gaussian kernel
  sapply(t, function(x) (1 / (n * h)) * sum(K((x - data) / h))) #Used to apply kernel to scaled data
}

plot(density(times2, kernel = "gaussian", bw = 2), #Used to plot kde using R's function
main = "My Function vs. Built-In", lwd = 2) 
lines(seq(10, 30, 0.1), gauss.kde(seq(10, 30, 0.1), #Used to plot kde using my function
times2, 2), lwd = 2, col = "orange")

```

The function I made follows the R's density function very closely.  

c) Write a custom R function that computes kernel density estimates using the Epanechnikov kernel with $h=2$. Validate your implementation by comparing results with R's built-in density() function for Gaussian kernel estimation.

$$
\hat{f}_h(t) = \frac{1}{nh}\sum_{i=1}^n K\left( \frac{t-t_i}{h}\right), \ \ \text{ where } \ \ K(u) = \frac{3}{4}(1 - u^2) \ \ \text{ for } \ \ |u| \le 1.
$$
# Work for Part C:

```{r}
epan.kde <- function(t, data, h) { #defines function
  n <- length(data) #gets size of n
  K <- function(u) ifelse(abs(u) <= 1, 0.75 * (1 - u^2), 0)  #Used to compute Epanechnikov kernel
  sapply(t, function(x) (1 / (n * h)) * sum(K((x - data) / h))) #Used to apply kernel to scaled data
}
plot(density(times2, kernel = "epanechnikov", bw = 2),
#Used to plot kde using R's function
main = "My Function vs. Built-In", lwd = 2)
lines(seq(10, 30, 0.1), epan.kde(seq(10, 30, 0.1),
#Used to plot kde using my function
times2, 2), lwd = 2, col = "orange")


```

The function I made does not seem to follow the R function as closely as in the previous example.  

d) How does the choice of kernel (Gaussian vs. Epanechnikov) affect the density estimate? For both kernel estimators applied to this dataset, what happens when we select $h=1.5$ versus $h=2.5$?

# Work for Part D:

```{r}

#Used to plot kde's using my function for various h values
plot(gauss.kde(seq(10, 30, 0.1), times2, 1.5), type = "l", main = "Gaussian with h = 1.5", ylab = "Density") 
plot(gauss.kde(seq(10, 30, 0.1), times2, 2.5), type = "l", main = "Gaussian with h = 2.5", ylab = "Density")

plot(epan.kde(seq(10, 30, 0.1), times2, 1.5), type = "l", main = "Epanechnikov with h = 1.5", ylab = "Density")

plot(epan.kde(seq(10, 30, 0.1), times2, 2.5), type = "l", main = "Epanechnikov with h = 2.5", ylab = "Density")

```

When h = 1.5, both of the density estimates become less smooth.  On the other hand when h = 2.5, the density estimates become more smooth.

