Data 605 - Week 9

Problem Set 1

(1) First write a function that will produce a sample of random variable that is distributed as follows: \[f(x)=x,0\leq x \leq 1 \\ f(x)=2-x,1 < x \leq 2\]

First lets plot this particular PDF

f1 <- function(x, lower = 0, mid = 1, upper = 2) {
    return(
        ifelse(x >= lower & x <= mid, x,
           ifelse(x > mid & x <= upper, 2-x,
                  0)
           )
        )
}

tmp <- seq(0,2,0.001)

plot(tmp, f1(tmp) , type = "l")

The PDF seems to favor values close to 1 and is least likely to favor values on the outer bounds of our distribution, i.e. 0 and 2.

We will use the inverse CDF in order to sample from this distribution. We will need to integrate over the PDF piece-wise and take the inverse so we can sample using a uniform distribution bounded by (0,1).

Solving for the left most side of the PDF and solving for \(u\), defined on \((0,1)\):

\(\\For \:0 \leq x \leq1:\) \[u = F(x) = \int_{0}^{x} y \;dy = x^2/2\\ \rightarrow x = \sqrt{2u}, 0 <= u <= 1/2\]

We need to solve for the upper bound in order to add it to the right most CDF, remember that the sum of the integrals needs to sum 1.

\[F(1) = 1/2 \]

Now we solve for the right side of the PDF and carry over the integral from the left side. Since the solution was based off a polynomial, the part that satisfied the inequality was kept as the solution.

\(For \:1 < x \leq 2:\) \[u = F(x) = 1/2 + \int_{1}^{x} 2-y \;dy = 1/2 + [2y - y^2/2]\Big|_1^x\\ = 1/2 + (2x - x^2/2) - (2-1/2)=-x^2/2+2x-1 \\ \rightarrow x=\sqrt{2u-1}+2, 1/2 < u <= 1\]

Since the CDF is uniformly distributed on (0,1) and because our CDF equation is in terms of this uniform distribution, U, we can use R’s native uniform random number generator to sample from it. This inverse CDF function, F_i1, is coded below.

F_i1 <- function(x) {
    return(ifelse(x <= 1/2, sqrt(2*x), 2-sqrt(2-2*x)))
}

Let’s look at the CDF, there should be an inflection point visible at x = 1, changing from concave up to concave down, due to the sharp decline in the PDF.

s <- seq(0,1, 0.01)
plot(F_i1(s),s)

It does follow the trend we thought it would.

(2) Now, write a function that will produce a sample of random variable that is distributed as follows:

\[f(x) = 1-x,0 \leq x\leq1\\ f(x) = x-1,1<x\leq2\]

Plotting the PDF:

f2 <- function(x, lower = 0, mid = 1, upper = 2) {
    return(
        ifelse(x >= lower & x <= mid, (1-x),
               ifelse(x > mid & x <= upper, (x-1),
                      0)
        )
    )
}

plot(tmp, f2(tmp) , type = "l")

This PDF seems to favor values close to the outer bounds of the distribution, i.e. 0 and 2, and is least likely to favor values in the center around 1.

Following the same method as before and creating the inverse CDF:

\(For \;\; 0 \leq x \leq 1:\)

\[u = F(x) = \int_{0}^{x} 1-y \;dy = y - y^2/2 \Big|_0^x = x-x^2/2\\ \rightarrow x =1-\sqrt{1-2u}, 0 \leq u \leq 1/2\\ F(1) = 1/2\]

\(For \:1 < x \leq 2:\)

\[u = F(x) = 1/2 + \int_{1}^{x} y-1 \;dy = 1/2 + (y^2/2-y)\Big|_1^x = x^2/2-x+1\\ \rightarrow x = \sqrt{2u-1} + 1, 1/2 < u \leq 1\\\]

F_i2 <- function(x) {
    return(ifelse(x<=0.5, 1-sqrt(1-2*x), sqrt(2*x-1)+1))
}

Thinking about how the CDF should look, we should expect a inflection point at x = 1 again; this time, however, it should be switching to concave up due to the sharp increase in the PDF.

plot(F_i2(s), s)

This behavior matches what we would imagine.

3) Draw 1000 samples (call your function 1000 times each) from each of the above two distributions and plot the resulting histograms. You should have one histogram for each PDF. See that it matches your understanding of these PDFs.

n <- 1000
x <- runif(n)
hist(F_i1(x), freq = FALSE)
s <- seq(0,2,0.01)
lines(s, f1(s), lty = 2, col = "red")

hist(F_i2(x), freq = FALSE)
lines(s, f2(s), lty = 2, col = "red")

The assumptions made about the PDFs seem to be proven by the histograms.

Let us check against histograms after sampling from these functions using R’s distr package

For f1:

rdist <- r(AbscontDistribution(d = f1, low1 = 0, up1 = 2))
x <- rdist(1000)
s <- seq(0, 2, 0.01)
hist(x, freq = FALSE)
lines(s, f1(s), lty = 2, col = "red")

For f2:

rdist <- r(AbscontDistribution(d = f2, low1 = 0, up1 = 2))
x <- rdist(1000)
s <- seq(0, 2, .01)
hist(x, freq = FALSE)
lines(s, f2(s), lty = 2, col = "red")

Our methodology seems to match that of R’s simulation of the PDF.

(4) Now, write a program that will take a sample set size n as a parameter and the PDF as the second parameter, and perform 1000 iterations where it samples from the PDF, each time taking n samples and computes the mean of these n samples. It then plots a histogram of these 1000 means that it computes.

Ideally, it would be great to sample a value \(x\) from an arbitrary distribution \(f(x)\) by solving the for the upper bound of integration, \(b\), in \(\int_a^b f(x)\:dx\) given a \(U\) \(\;\rightarrow b = F^{-1}(F(a)-U)\), where \(U\) is uniformly distributed on (0,1). Solving analytically for this \(b\) for any PDF is computationally difficult: e.g. different bounds defined by the PDF (ranging from \(-\infty , \infty\)); solving for the proper root of possible polynomials; doing all of this in an accurate manner and with an acceptable run-time.

R’s powerful distr package deals with some of these complexities well. The lower and upper bounds of the generic PDF still need to be defined in order for the PDF to be sampled correctly. It seems that if there are any gaps in the PDF or if the PDF hits a zero value moving from the left, it stops trying to define the PDF and will only sample to the gap. We will implement the above simulation using distr, but will need to pass the bounds of PDF in order for it to sample the way we want it to.

# l and h define the low and high bounds, respectively, of the PDF
pdf_mean <- function(n, foo, l = 0, h = 2) {
    
    simRun <- 1000
    mean_v <- vector()
    PDF <- match.fun(foo)
    rdist <- r(AbscontDistribution(d = PDF,
                                   low1 = l, up1 = h))
    
    for(i in 1:simRun) {
        x <- rdist(n)
        mean_v[i] <- mean(x)
    }
    hist(mean_v)
}

Using the example PDFs f1 and f2 we will simulate with our defined \(n\) draws for each of the 1000 iterations. Starting of with 100.

pdf_mean(100, f1, 0, 2)

The sample of 100 means for each iterations produces an unsurprisingly normal distribution centered around 1.

Using the same simulation, we will sample the means from the odd f2, which looks like an inverted normal distribution.

pdf_mean(100, f2, 0, 2)

The sample of means of f2 really show the power of the Central Limit Theorem. With a distribution that could be called the opposite of a normal distribution, the sample of means still forms a normal distribution.

(5) Verify that as you set n to something like 10 or 20, each of the two PDFs produce normally distributed mean of samples, empirically verifying the Central Limit Theorem. Please play around with various values of n and you’ll see that even for reasonably small sample sizes such as 10, Central Limit Theorem holds.

Using a sample size of 10 for f1 and f2

pdf_mean(10, f1, 0, 2)

pdf_mean(10, f2, 0, 2)

Even with only 10 draws from the PDFs the Central Limit Theorem is surprisingly domineering in the simulation, with the means again clustering around 1.

Data 605 - Week 9

Liam Byrne

April 2, 2017

Problem Set 1