1 R as a calculator

1.1 Question A)

We have been given some values of \(X_i,Y_i\) for \(i=1,2.\) Using the formulas given to us we can calculate the slope and intercept accordingly.

X1 <- 3
X2 <- -4
Y1 <- 2
Y2 <- 100
b <- (Y1-Y2)/(X1-X2)  #Slope
a <- (Y2*X1 - Y1*X2)/(X1 - X2)  #Intercept
print(paste("Intercept is", a))

## [1] "Intercept is 44"

print(paste("Slope is",b))

## [1] "Slope is -14"

1.2 Question B)

We can re-use the code from before, just change the numbers.

X1 <- 0
X2 <- -11
Y1 <- -2
Y2 <- -100
b <- (Y1-Y2)/(X1-X2)  #Slope
a <- (Y2*X1 - Y1*X2)/(X1 - X2)  #Intercept
print(paste("Intercept is", a))

## [1] "Intercept is -2"

print(paste("Slope is", round(b,6)))

## [1] "Slope is 8.909091"

1.3 Question C)

set.seed(123)         #For reproductiability
hist <- rnorm(500)    #Simulate histogram

#Figure
hist(hist, main="Histogram of Simulated Data", xlab="x", col="red")

#Median
summary(hist)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -2.66092 -0.57463  0.02072  0.03459  0.68521  3.24104

print(paste("Median is", round(median(hist), 4)))

## [1] "Median is 0.0207"

1.4 Question D)

set.seed(123)         #For reproductiability
X <- rnorm(20)        #Simulate histogram
barX <- mean(X)        #Average of X
tildeX <- X - barX     #X Tilde

#Verify that average of tildeX is zero by eyeing result (looks like zero)
print(paste("tilde X is on average", mean(tildeX)))

## [1] "tilde X is on average 2.77013655070046e-18"

1.5 Question E)

set.seed(123)         #For reproductiability
X <- rnorm(20)        #Simulate histogram
barX <- mean(X)        #Average of X
sX <- sd(X)
tildeX <- (X - barX)/sX     #X Tilde

#Verify that average of tildeX is zero
print(paste("tilde X is on average", mean(tildeX)))

## [1] "tilde X is on average 1.50107790780618e-17"

#Verify that sd of tildeX is 1
print(paste("Sd of tilde X is on average", sd(tildeX)))

## [1] "Sd of tilde X is on average 1"

1.6 Question F)

To see that the result remains the same when doing it a couple of times we can either run a loop or just manually do it. For efficiency, I will run a loop - which also makes it possible to change the number of times we test it, and consequently strengthen our belief that this is true always, and for every sequence of numbers \(X_1,X_2,...,X_n.\)

n <- 5    #Number of repetitions

#Run loop
for (i in 1:n){
  #Simulate standard normal
  X <- rnorm(20)
  
  #Compute mean and sd
  barX <- mean(X)
  sX <- sd(X)
  
  #Compute tildeX
  tildeX <- (X-barX)/sX
  
  #Print result for each round
   print(paste("Repetition", i, ":"))
   print(paste("Mean of tildeX:", mean(tildeX)))
   print(paste("Sd of tildeX:", sd(tildeX)))
}

## [1] "Repetition 1 :"
## [1] "Mean of tildeX: -2.62891921773423e-17"
## [1] "Sd of tildeX: 1"
## [1] "Repetition 2 :"
## [1] "Mean of tildeX: 1.90819582357449e-17"
## [1] "Sd of tildeX: 1"
## [1] "Repetition 3 :"
## [1] "Mean of tildeX: -1.31752586813617e-17"
## [1] "Sd of tildeX: 1"
## [1] "Repetition 4 :"
## [1] "Mean of tildeX: -2.33103467084383e-18"
## [1] "Sd of tildeX: 1"
## [1] "Repetition 5 :"
## [1] "Mean of tildeX: 9.03682510766668e-18"
## [1] "Sd of tildeX: 1"

As we can see, the mean varies a bit but is always close to zero. The standard deviation remains constant at one.

If you would like to try this with, say, \(n\) repetitions, I would suggest that you in the loop create two index variables (varies over \(i\)) that stores the mean and sd of tildeX in each repetition. This will give you two vectors with \(n\) entries. To confirm the result, you can, for instance, compute the average, of the averages, to see what the mean and sd is on average. Doing this would give you a number close to zero and precisely one for the mean and sd respectively.

1.7 Question G)

To figure out what the code is doing, let us add comments to the code and in that way get an understanding of what is going on.

n <- 15               #Number of observations
x <- 0.5              #Value of x variable
series <- x^(0:n)     #Creates a variable that vill have entires 0.5^(0 to 15)

#Gives values 1, 0.5, 0.25, 0.125,...=0.5^0, 0.5^1, 0.5^2,...
series

##  [1] 1.000000e+00 5.000000e-01 2.500000e-01 1.250000e-01 6.250000e-02
##  [6] 3.125000e-02 1.562500e-02 7.812500e-03 3.906250e-03 1.953125e-03
## [11] 9.765625e-04 4.882812e-04 2.441406e-04 1.220703e-04 6.103516e-05
## [16] 3.051758e-05

sum(series)           #LHS of equation

## [1] 1.999969

(1-x^(n+1))/(1-x)     #RHS of equation

## [1] 1.999969

2 ChatGPT Knows R Really, Really, Well

Since this is user specific, I leave it for you to do on your own :).

3 An Elementary Problem in Big Data Analytics

3.1 Question A)

The mean is can be viewed as a weighted average \[\bar{X}_n=\sum_{i=1}^n w_iX_i,\] where \(w_i\) is the share of the population associated with that particular \(i=1,...,n.\) Hence, \[\bar{X}_{200}=\frac{100}{200}\cdot 180+\frac{100}{200}\cdot 178=179.\]

3.2 Question B)

Use the weighted average formula again: \[\bar{X}_{200}=\frac{100}{200}y+\frac{100}{200}z=\frac{y+z}{2}.\]

3.3 Question C)

Following the formula again: \[\bar{X}_n=\frac{30}{100}y+\frac{70}{10}z.\]

3.4 Question D)

Follwing the formula: \[\bar{X}_n=\frac{a}{n}y+\frac{n-a}{n}z.\]

3.5 Question E)

Following the hint, it is easy to see that \[\bar{X}_{1:n}=\frac{1}{n-1+1}\sum_{i=1}^nX_i=\frac{1}{n}\sum_{i=1}^nX_i=\bar{X}_n,\] where the last equality follows from the fact that the LHS expression is the definition of the mean.
Set \(u=v\), then \[\bar{X}_{u:u}=\frac{1}{u-u+1}\sum_{i=u}^uX_i=X_u.\] The last equality holds from the fact that the sum from on number to the same number is just that number.
It is just a generalization. It is actually straightforward to generalize a sum into the sum of its parts. For instance, consider the following sum \[\sum_{i=1}^{10}X_i=1+2+3+....+10.\] It is obvious that this is the same as writing:

\[ \sum_{i=1}^{10}X_i=\sum_{i=1}^3X_i+\sum_{i=4}^6X_i+\sum_{i=7}^{10}X_i=\\\ =(1+2+3) + (4+5+6) + (7+8+9+10). \]

It should not come as a surprise that this partitioning of the sum can be done in even smaller intervals.

*) From the RHS: \[\frac{a}{n}\frac{1}{a}\sum_{i=1}^aX_i=\frac{1}{n}\sum_{i=1}^aX_i.\]

**) From the RHS: \[\frac{b-a}{n}\frac{1}{b-a}\sum_{i=a+1}^bX_i=\frac{1}{n}\sum_{i=a+1}^bX_i.\]

***) From the RHS \[\frac{n-b}{n}\frac{1}{n-b}\sum_{i=b+1}^nX_i=\frac{1}{n}\sum_{i=b+1}^nX_i.\]

The derivation can be done in two ways, either you can start from the definition of \(\bar{X}_n\) and partition the total sum into three parts, and subsequently expand the weights to get them as in the expression given too us. I am not going to do this here, but I urge you to try; instead, I am going to intuitively explain what we do. To to this, consider the weighted average: \[\bar{X}_n=\sum_{i=1}^nw_iX_i.\] If we partition the sum into three parts, one from \(1\) to \(a\), the second from \(a+1\) to \(b\) and the last from \(b+1\) to \(n.\) Then, think about exercises A-D), and think about it in the exact same way. The term \(w_i\) is the share of the population in that particular group. The population is \(n,\) so for the share of the first group is \(a/n\); the share of the second group is \((b-a)/n\); and the share of the last group is \((n-b)/n.\) With that in mind, it is straightforward to get the expression we are asked to derive. That is,

\[\begin{align} \bar{X}_n&=\sum_{i=1}^nw_iX_i=\sum_{i=1}^aw_iX_i+\sum_{a+1}^bw_iX_i+\sum_{b+1}^nw_iX_i \\ &= \frac{a}{n}\bar{X}_{1:a}+\frac{b-a}{n}\bar{X}_{a+1:b}+\frac{n-b}{n}\bar{X}_{b+1:n}. \end{align}\]

3.6 Question F)

Following the hints, doing some simplifications, and applying the definition of the mean, we can obtain the expression. That is,

\[\begin{align} S_X^2&=\frac{1}{n-1}\sum_{i=1}^n(X_i-\bar{X}_n)^2 \\ &= \frac{1}{n-1}\sum_{i=1}^n\left[X_i^2-2X_i\bar{X}_n+\bar{X}_n^2 \right] \\ &= \frac{1}{n-1}\left[\sum_{i=1}^nX_i^2-2\bar{X}_n\sum_{i=1}^nX_i+\sum_{i=1}^n\bar{X}_n^2 \right]\\ &= \frac{1}{n-1}\left[\sum_{i=1}^nX_i^2-2n\bar{X}_n^2+n\bar{X}_n^2 \right]\\ &=\frac{1}{n-1}\sum_{i=1}^nX_i^2-\frac{n}{n-1}\bar{X}_n^2. \end{align}\]

3.7 Question G)

We can just split as before, and then to compute the whole we just take the sum of the partitions That is, \[S_n^2 = S_{1:a} +S_{a+1:b}+S_{b+1:n}.\] Do we have to have weighting on these as well?

3.8 Question H)

Given that the weights should add to one, and they are equally large and constant, we can just set \(w=1/n.\) Then we have \[\bar{Z}_n=\sum_{i=1}^n\frac{1}{n}Z_i=\frac{1}{n}\sum_{i=1}^nZ_i,\] which is precisely the definition of the classical mean.

3.9 Question H)

Here it is just about setting the weights in a correct way. That is exactly what we argued in question E) (v.). So by setting the weights as \(w_1=a/n\), \(w_2=(b-a)/n\), and \(w_3=(n-b)/n\). To confirm that this works, we just need to see that \(w_1+w_2+w_3=1.\) That is, \[\frac{a}{n}+\frac{b-a}{n}+\frac{n-b}{n}=\frac{a+b-a+n-b}{n}=\frac{n}{n}=1,\] so we are fine! Thus, we can write \(\bar{X}_n\) as a weighted average of the three sums.

3.10 Question I)

If the sequence \(X_1,...,X_n\) is split into \(M\) parts, each part will have its own average, say \(\bar{X}_1,\bar{X}_2,...,\bar{X}_M,\) and its own weight based on the number of elements in each part, say \(w_1,w_2,...,w_M.\) The general formula is therefore just the weighted average of all means: \[\bar{X}_n=\sum_{j=1}^Mw_j\bar{X}_j,\] where \(w_j\) is the proportion of the number of elements in the \(j\)-th part of the total number of elements. Thus, the general formula is just an average of all the individual specific means.

Project 1

Data Analytics /w Programming (GRA 6036)

Sebastian Shaqiri Johansson

2024-01-19