Important notes

* The first assignment is a R homework assignment to get you started with basic coding in R. 
Writing some of these functions will come handy when we eventually use R to do more complicated
analyses later in the course. Marks would be given for the correct output, but also on how well
the code is written and if the problem has been solved in the correct direction. In several places,
you are also asked to demonstrate your understanding of the results you have obtained. Please do 
so carefully; this will help distinguish your submissions from that of your classmates. 

* Submitting R markdowns files (along with a knitted html/pdf/word) are mandatory which should 
explain the desired outputs from each steps.

=======================================================

(45 marks) Create a matrix of size \(n \times m\) and fill it with random sample that comes from the following three distributions:

A. U[a = 5, b = 100]

B. N(\(\mu = 5\), \(\sigma^2=100\))

C. t(df = 2)

Fix \(m\) at 100, and use \(n=5, 50, 500\).

What can you say about the distribution of column means in each of the 9 cases. (Hint: Making conclusions from histograms is good enough)?
For each of the three cases A, B, and C, which choice of \(n\) (among three listed choices) seems sufficient to follow one of the most celebrated result in statistics (the Central Limit Theorem).
Please make all your output readable, for example, your figures should have a proper chart title.

#write your code here; see below for a part of the code   
#matrix A with n = 5
n <- 5
m <- 100
set.seed(123)
matrix_A <- matrix(runif(n*m, min=5, max=100), nrow=n, ncol=m)
colMeans_A <- colMeans(matrix_A)
hist(colMeans_A, main = "For n = 5 samples coming from Uni[5,100]")

#matrix A with n = 50
n <- 50
set.seed(123)
matrix_A50 <- matrix(runif(n*m, min=5, max=100), nrow=n, ncol=m)
colMeans_A50 <- colMeans(matrix_A50)
hist(colMeans_A50, main = "For n = 50 samples coming from Uni[5,100]")

#matrix A with n = 500
n <- 500
set.seed(123)
matrix_A500 <- matrix(runif(n*m, min=5, max=100), nrow=n, ncol=m)
colMeans_A500 <- colMeans(matrix_A500)
hist(colMeans_A500, main = "For n = 500 samples coming from Uni[5,100]")

#matrix B with n = 5
n <- 5
m <- 100
set.seed(123)
matrix_B <- matrix(rnorm(n*m, mean=5, sd=100), nrow=n, ncol=m)
colMeans_B <- colMeans(matrix_B)
hist(colMeans_B, main = "For n = 50 samples coming from N(5,100)")

#matrix B with n = 50
n <- 50
set.seed(123)
matrix_B50 <- matrix(rnorm(n*m, mean=5, sd=100), nrow=n, ncol=m)
colMeans_B50 <- colMeans(matrix_B50)
hist(colMeans_B50, main = "For n = 50 samples coming from N(5,100)")

#matrix B with n = 500
n <- 500
set.seed(123)
matrix_B500 <- matrix(rnorm(n*m, mean=5, sd=100), nrow=n, ncol=m)
colMeans_B500 <- colMeans(matrix_B500)
hist(colMeans_B500, main = "For n = 500 samples coming from N(5,100)")

#matrix C with n = 5
n <- 5
m <- 100
set.seed(123)
x1 <- seq(-5, 5, length.out = m)
matrix_C <- matrix(dt(x1, df=2), nrow=n, ncol=m)
colMeans_C <- colMeans(matrix_C)
hist(colMeans_C, main = "For n = 5 samples coming from t(df = 2)")

#matrix C with n = 50
n <- 50
m <- 100
set.seed(123)
x2 <- seq(-5, 5, length.out = m)
y2 <- dt(x2, df=2)
matrix_C50 <- matrix(rep(y2,n), nrow=n, ncol=m, byrow = T)
colMeans_C50 <- colMeans(matrix_C50)
hist(colMeans_C50, main = "For n = 50 samples coming from t(df = 2)")

#matrix C with n = 500
n <- 500
m <- 100
set.seed(123)
x3 <- seq(-5, 5, length.out = m)
y3 <- dt(x3, df=2)
matrix_C500 <- matrix(rep(y2,n), nrow=n, ncol=m, byrow = T)
colMeans_C500 <- colMeans(matrix_C500)
hist(colMeans_C500, main = "For n = 500 samples coming from t(df = 2)")

Print your output here.

In general, the distribution of column means becomes more normalized as sample size increases. For each of the three cases, having a larger n of 500 seems to provide the most expected distribution, as is expected according to the Central Limit Theorem.

(25 marks) Recall head() function displays top 6 observations, and tail() function displays the final 6 observations of your data.

Make a function, say, pick(data, p; k) that picks \(k\) points around the value of \(p\). Here \(p\) is a number between 0 and 1 and it represents percentiles, so, if I give \(p = 0.1\) and \(k=3\) as input, it means pick 3 points around the 10th percentile of the number of rows in your data.
Show how you will use the function you created in a. to give the same output as the head() function.
Show how you will use the function you created in a. to extract the middle 6 rows of your data.

Use iris data to demonstrate your results

#write your code here  

#Part a
pick <- function(iris, p, k) {
  
  #corresponding index to p value
  p_index <- round(p * nrow(iris))
  
  first_value <- max(1, p_index - floor(k / 2))
  last_value <- min(nrow(iris), p_index + floor(k / 2))
  
  #adjust indices if range less than k
  if ((last_value - first_value + 1) < k) {
    if (first_value == 1) {
      last_value <- min(nrow(iris), first_value + k - 1)
    } else {
      first_value <- max(1, last_value - k + 1)
    }
  }
  
  #return values
  return(iris[first_value:last_value, , drop = F])
}

#Part b
pick(iris, p = 0, k = 6)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

#Part c
pick(iris, p = .5, k = 6)

##    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 72          6.1         2.8          4.0         1.3 versicolor
## 73          6.3         2.5          4.9         1.5 versicolor
## 74          6.1         2.8          4.7         1.2 versicolor
## 75          6.4         2.9          4.3         1.3 versicolor
## 76          6.6         3.0          4.4         1.4 versicolor
## 77          6.8         2.8          4.8         1.4 versicolor
## 78          6.7         3.0          5.0         1.7 versicolor

head(iris)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

Print your output here.

(30 marks) In this problem, we will write a couple of easy functions and then try to vectorize a few of them.
1. Write a function that computes the \(\psi(x)\) given a value of \(x\) where the function is defined as:
Verify that your function returns 1 when \(x= 1\) and it returns 7 when you enter 4.
1. Write another (modified form of a) function that computes the \(\psi(x,a)\) given a value of \(x\) and \(a\) where the function is defined as:
Say you name your function myFun_3b. Report the outputs of
```
 + myFun_3b(3,2) 

 + myFun_3b(3,4) 

 + myFun_3b(1,3)

 + myFun_3b(a = 1, x= 3). Explain why you get a different answer than compared to myFun_3b(1,3). 
```
1. Vectorize your myFun_3b() function, so that the first input can actually be a vector of numbers, and what is returned is a vector whose elements give the function evaluated at each of these numbers. Hint: you might try using ifelse(), if you haven’t already, to vectorize nicely. Check that myFun_3b(x=1:6, a=3) returns the vector of numbers (1, 4, 9, 15, 21, 27).

#write your code here  

#Part A
PhiFun <- function(x) {
  if(abs(x)<=1){
    y <- x^2
  } else{
    y <- (2*abs(x)) -1
  }
  return(y)
}

#Part B
PhiFun_3b <- function(x,a) {
  if(abs(x)<=a){
    y <- x^2
  } else{
    y <- (2*a*abs(x)) - (a^2)
  }
  return(y)
}

PhiFun_3b(3,2)

## [1] 8

PhiFun_3b(3,4)

## [1] 9

PhiFun_3b(1,3)

## [1] 1

PhiFun_3b(a = 1, x = 3)

## [1] 5

#This answer is different from PhiFun_3b(1,3) because the variables are manually assigned here. Otherwise, the default assigns x the first value and a the second value.

#Part C
PhiFun_3c <- function(x,a) {
  #intialize variables
  x <- as.vector(x)
  y <- c(0)
  
  #piecewise function
  y[abs(x) <= a] <- x[abs(x) <= a]^2
  y[abs(x) > a] <- (2*a*abs(x[abs(x) > a])) - (a^2)
  
  #return y
  return(y)
}

PhiFun_3c(x = 1:6, a = 3)

## [1]  1  4  9 15 21 27

Print your output here.

(ungraded, no need to submit the solution) Do the descriptive analysis on the iris data. The descriptive analysis tells us more about the data. You could, for example, plot two variables at a time (to see a scatter plot between them), you could make histograms of each variable, do a summary, make boxplots, etc. Write your commands and show the output below. Also, write a short para (6-8 lines) explaining your understanding of the iris data.

summary(iris) The iris data seems to provide the different lengths of parts of a flower (presumably the iris flower). These include information about the minimum and maximum length and width of both the sepal and petal, and also information about the first, second, and third quartiles.

ggplot(iris, aes(x = Sepal.Length)) + geom_histogram(binwidth = 0.5, fill = “green”) + theme_minimal() The sepal length seems to have an average of around 6, with most of the data clustered around that area. 8 is the maximum and appears to be an outlier.

HW 1: Math 445/531: Fall 2024

Karsem Chiamprasert

Due on 08/30/2024

Important notes