Lecture

L-1:Linear programming

1.1 Linear programming In Mathematics, linear programming is a method of optimising operations with some constraints. The main objective of linear programming is to maximize or minimize the numerical value. It consists of linear functions which are subjected to the constraints in the form of linear equations or in the form of inequalities. Linear programming is considered an important technique that is used to find the optimum resource utilisation. The term “linear programming” consists of two words as linear and programming. The word “linear” defines the relationship between multiple variables with degree one. The word “programming” defines the process of selecting the best solution from various alternatives

1.2 Characteristics of the linear programming problem Constraints: The limitations should be expressed in the mathematical form, regarding the resource. Objective Function: In a problem, the objective function should be specified in a quantitative way. Linearity: The relationship between two or more variables in the function must be linear. It means that the degree of the variable is one. Finiteness: There should be finite and infinite input and output numbers. In case, if the function has infinite factors, the optimal solution is not feasible. Non-negativity: The variable value should be positive or zero. It should not be a negative value. Decision Variables: The decision variable will decide the output. It gives the ultimate solution of the problem. For any problem, the first step is to identify the decision variables

1.3 Methods to Solve Linear Programming Problems The linear programming problem can be solved using different methods, such as the graphical method, simplex method, or by using tools such as R, open solver etc.

1.4 Problem 1

Solve the following problem: \[Maximize, z = 5x_1 + 7x_2\] Subject to, \[x_1 ≤ 16\] \[2x_1 + 3x2 ≤ 19\] \[x_1 + x_2 ≤ 8\] \[x_1, x_2 ≥ 0\]

#install.packages("lpSolve")

library(lpSolve)

obj<-c(5,7)
cons<-matrix(c(1,0,2,3,1,1),nrow = 3,byrow = T)
dir<-c("<=","<=","<=")
rhs<-c(16,19,8)

lp(direction = "max",objective.in = obj,const.mat = cons,const.dir = dir,const.rhs = rhs)
## Success: the objective function is 46
r<-lp(direction = "max",objective.in = obj,const.mat = cons,const.dir = dir,const.rhs = rhs)


r$objval
## [1] 46
r$solution
## [1] 5 3

1.5 Problem 2

Solve the following problem: \[Maximize,Z = 2x_1 + x_2\] Subject to, \[x_2 \leq 10\] \[2x_1 + 5x_2 \leq 60\] \[x_1 + x_2 \leq 18\] \[3x_1 + x_2 \leq 44\] and \[x_1 \geq 0, x_2 \geq 0\]

obj<-c(2,1)
cons<-matrix(c(0,1,2,5,1,1,3,1),nrow = 4,byrow = T)
dir<-c("<=","<=","<=","<=")
rhs<-c(10,60,18,44)

r<-lp(direction = "max",objective.in = obj,const.mat = cons,const.dir = dir,const.rhs = rhs)


r$objval
## [1] 31
r$solution
## [1] 13  5

1.6 Problem 3

Solve the following problem: \[Minimize, Z = 15x_1 + 20x_2\] subject to \[x_1 + 2x_2 ≥ 10\] \[2x_1 − 3x_2 ≤ 6\] \[x_1 + x_2 ≥ 6\] and \[x_1 ≥ 0, x_2 ≥ 0\]

obj<-c(15,20)
cons<-matrix(c(1,2,2,-3,1,1),nrow = 3,byrow = T)
dir<-c(">=","<=",">=")
rhs<-c(10,6,6)

r<-lp(direction = "min",objective.in = obj,const.mat = cons,const.dir = dir,const.rhs = rhs)


r$objval
## [1] 110
r$solution
## [1] 2 4

1.7 Problem 4

Ralph Edmund loves steaks and potatoes. Therefore, he has decided to go on a steady diet of only these two foods (plus some liquids and vitamin supplements) for all his meals. Ralph realizes that this isnt the healthiest diet, so he wants to make sure that he eats the right quantities of the two foods to satisfy some key nutritional requirements. He has obtained the following nutritional and cost information:

Grans of ingredient per serving
Ingralient Steak Potatoes Daily reguirements Grans
Carbohydnates 5 15 \(\geq 50\)
Protein 20 5 \(\geq40\)
Pat 15 2 \(\leq60\)
Cest per serving \(\$ 1\) \(\$ 2\)

Ralph wishes to determine the number of daily servings (may be fractional) of steak and potatoes that will meet these requirements at a minimum cost.

  1. Formulate a linear programming model for this problem.
  2. Use the graphical method to solve this model.
  3. Use R to solve this model.
obj<-c(4,2)
cons<-matrix(c(5,15,20,5,15,2),nrow = 3,byrow = T)
dir<-c(">=",">=","<=")
rhs<-c(50,40,60)

r<-lp(direction = "min",objective.in = obj,const.mat = cons,const.dir = dir,const.rhs = rhs)


r$objval
## [1] 10.90909
r$solution
## [1] 1.272727 2.909091

1.8 Problem 5

The Apex Television Company has to decide on the number of 27 and 20-inch sets to be produced at one of its factories. Market research indicates that at most 40 of the 27 -inch sets and 10 of the 20-inch sets can be sold per month. The maximum number of work- hours available is 500 per month. A 27-inch set requires 20 work-hours and a 20-inch set requires 10 work-hours. Each 27-inch set sold produces a profit of $120 and each 20-inch set produces a profit of $80. A wholesaler has agreed to purchase all the television sets produced if the numbers do not exceed the maximum indicated by the market research.

  1. Formulate a linear programming model for this problem.
  2. Use the graphical method to solve this model.
  3. Use R to solve this model.
obj<-c(120,80)
cons<-matrix(c(1,0,0,1,20,10),nrow = 3,byrow = T)
dir<-c("<=","<=","<=")
rhs<-c(40,10,500)

r<-lp(direction = "max",objective.in = obj,const.mat = cons,const.dir = dir,const.rhs = rhs)


r$objval
## [1] 3200
r$solution
## [1] 20 10

L-2: Construction of life table

1 Life table A life table is a statistical tool that summarizes the mortality experience of a population and yields information about longevity and life expectation. Although it is generally used for studying mortality, the life table format can be used to summarize any duration variable, such as duration of marriage, duration of contraceptive use, etc.

2 Columns of a life table A typical life table contains several columns, each with a unique interpretation. The columns are:

  1. x : Age at exact (single) year (For complete life table) or at interval of years (For abridged life table)

  2. qx : Probability of death between exact age (x, x + 1) (For complete life table) or (x, x + n) (For abridged life table denoted by nqx).

  3. lx : Number of survived persons at age x.

  4. dx : Number of deaths between age (x, x + 1) (For complete life table) or (x, x + n) (For abridged life table denoted by ndx).

  5. Lx : Number of person years lived by the cohort between age (x, x + 1) (For complete life table) or (x, x + n) (For abridged life table denoted by nLx) i.e, Time (in year/day/month/week/any other unit) contribution by each member of the cohort between age (x, x + 1) or (x, x + n). The formula of Lx (for complete life table) is: \[L_x = l_x − \frac{d_x}{2}\] or \[L_x = \frac{(l_x + l_{x+1})}{2}\]

For abridged life table,

\[ _nL_x=\frac{n(l_x+l_{x+n})}{2} \] 6. Tx : Number of years lived beyond age x.The formula of Tx is: \[T_x = L_x + L_{x+1} + · · · + L_{end of the table} = L_x + T_{x+1}\] 7. ex : Expectation of life at age x i.e, how many years a person may live given that his/her current age is x. The formula of Tx is: \[e_x =\frac {T_x}{l_x}\] 3 Notations in life table

Name Of The Function For Complete Life table For Abridged Life Table
Age x x −→ (x + n)
Death Prob. \(q_x\) \(_nq_x\)
No. of deaths \(d_x\) \(_nd_x\)
Survived persons \(ℓ_x\) \(ℓ_x\)
Survived person year \(L_x\) \(_nL_x\)
Survival Time beyond x \(T_x\) \(T_x\)
Expectation Of Life At Age x \(e_x\) \(e_x\)

4 Fergany’s method of constructing Life table In this section we construct an ordinary life table with data on age-specific death rates based on a simple method suggested by Fergany (1971). In this method the age-specific death rate (nmx) will be converted into the proportion dying in the age interval (nqx) using a simple formula: \[_nq_x = 1 − e^{−n∗_nm_x}\] Once nqx is calculated with age-specific death rates, the remaining columns of the life table are easily calculated using the following relationships: \[l_0 = 100000\] \[_nd_x = l_x ∗_nq_x\] \[l_{x+n} = l_x − _nd_x\] \[_nL_x =\frac {_nd_x}{_nm_x}\] \[T_x = \sum_x^{end of table} (_nL_x)\] \[e_x^0 = \frac {T_x}{l_x}\]

Practice problem: (Please find the file life-table-practice.csv in Google Classroom) Further reading https://www.measureevaluation.org/resources/training/online-courses-and-resources/noncertificate-courses-and-mini-tutorials/multiple-decrement-life-tables/lesson-3.html

L-3: Basic Multivariate Statistics

Question 1

Consider the following two vectors: x = [1, 3, 2] y = [−2, 1,−1]

  1. Calculate the length of vector x \[Help: L_x = \sqrt{x′x} =\sqrt{x_1^2+ x_2^2+ · · · + x_p^2}\]
  2. Calculate the length of vector y. \[Help: L_y = \sqrt{y′y} =\sqrt{y_1^2 + y_2^2 + · · · + y_p^2} \]
  3. Calculate the angle between vector x & y. \[Help: θ = cos^{−1}(\frac{x′y} {L_xL_y})\]
x<-c(1,3,2)
y<-c(-2,1,-1)

#length of x:
m<-t(x)%*%x
l_1<-sqrt(m)


sqrt(t(x)%*%x)
##          [,1]
## [1,] 3.741657
#length of y:
n<-t(y)%*%y
l_2<-sqrt(n)

sqrt(t(y)%*%y)
##         [,1]
## [1,] 2.44949
angle<-acos(t(x)%*%y/l_1/l_2)

# install.packages("REdaS")
library(REdaS)
## Warning: package 'REdaS' was built under R version 4.3.2
## Loading required package: grid
rad2deg(angle)
##          [,1]
## [1,] 96.26395
# Loading grid package
library(REdaS)
  
# Calling deg2rad() function
deg2rad(0)
## [1] 0
deg2rad(30)
## [1] 0.5235988
deg2rad(50)
## [1] 0.8726646

Question 2

Consider the following matrix & provide answer to the given questions: \[A =\left(\begin{array} {cc}1 & -5 \\ -5 & 1 \end{array}\right)\]

  1. Find the eigen values & vectors of matrix A.
  2. What is the relationship between eigen values & the determinant of a matrix?
  3. When a matrix is called a positive definite matrix & positive semi-definite matrix?
  4. What is the properties of eigen values of a positive definite matrix?
  5. Why the covariance matrix of a multivariate normal distribution should be a ‘positive definite matrix’?
a<-matrix(data = c(1,-5,-5 ,1),nrow = 2,byrow = T)
eig<-eigen(a)

#(i)
eig$values
## [1]  6 -4
eig$vectors
##            [,1]       [,2]
## [1,] -0.7071068 -0.7071068
## [2,]  0.7071068 -0.7071068
#(ii)
det(a)
## [1] -24
prod(eig$values)
## [1] -24

1 Spectral Decomposition Let A be a square n × n matrix with n linearly independent eigenvectors qi (where i =1, . . . , n). Then A can be factorized as \[A = QΛQ^{−1}\] where Q is the square n × n matrix whose i th column is the eigenvector qi of A, and Λ is the diagonal matrix whose diagonal elements are the corresponding eigenvalues, \[Λ_{ii} = λi\]. The terminology “spectral decomposition” derives from the fact that the set of eigenvalues of a matrix is also called the “spectrum” of the matrix.

1.1 Computation of power series of matrices using spectral decomposition \[A^2 =(QΛQ^{−1})(QΛQ^{−1})= QΛ(Q^{−1}Q)ΛQ^{−1} = QΛ^2Q^{−1}\] \[A^n = QΛ^nQ^{−1}\]

1.2 Decomposition of real symmetric matrices For every real symmetric matrix A there exists an orthogonal matrix Q and a diagonal matrix Λ such that \[A = QΛQ^⊤\], where Q is an orthogonal matrix whose columns are (real and orthonormal) eigen vectors of A, and Λ is a diagonal matrix whose entries are the eigenvalues of A.

Question 3

Consider the following matrix & provide the appropriate answer to the given questions: \[A =\left(\begin{array}{ccc} 13 & -4 & 2\\ -4 & 13 & -2\\ 2 & -2 & 10 \end{array}\right) \]

  1. The spectral decomposition of matrix A is given by \[A = QΛQ^⊤\] where Q is an orthogonal matrix whose columns are eigenvectors of A, and Λ is a diagonal matrix whose entries are the eigenvalues of A. Prove the described statement for matrix A.
b<-matrix(data = c(13, -4, 2,-4, 13, -2,2, -2, 10),nrow = 3,byrow = T)
eigen<-eigen(b)


Q<-eigen$vectors
delta<-diag(x = eigen$values)
Q
##            [,1]       [,2]      [,3]
## [1,]  0.6666667 -0.7453560 0.0000000
## [2,] -0.6666667 -0.5962848 0.4472136
## [3,]  0.3333333  0.2981424 0.8944272
delta
##      [,1] [,2] [,3]
## [1,]   18    0    0
## [2,]    0    9    0
## [3,]    0    0    9
Q %*% delta %*% t(Q)
##      [,1] [,2] [,3]
## [1,]   13   -4    2
## [2,]   -4   13   -2
## [3,]    2   -2   10
  1. Using spectral decomposition, find \[A^{−1}, A^3, and A^{10}\].
Q<-eigen$vectors
delta<-diag(x = eigen$values^-1)
Q
##            [,1]       [,2]      [,3]
## [1,]  0.6666667 -0.7453560 0.0000000
## [2,] -0.6666667 -0.5962848 0.4472136
## [3,]  0.3333333  0.2981424 0.8944272
delta
##            [,1]      [,2]      [,3]
## [1,] 0.05555556 0.0000000 0.0000000
## [2,] 0.00000000 0.1111111 0.0000000
## [3,] 0.00000000 0.0000000 0.1111111
Q %*% delta %*% t(Q)
##             [,1]       [,2]        [,3]
## [1,]  0.08641975 0.02469136 -0.01234568
## [2,]  0.02469136 0.08641975  0.01234568
## [3,] -0.01234568 0.01234568  0.10493827
Q<-eigen$vectors
delta<-diag(x = (eigen$values)^3)
Q
##            [,1]       [,2]      [,3]
## [1,]  0.6666667 -0.7453560 0.0000000
## [2,] -0.6666667 -0.5962848 0.4472136
## [3,]  0.3333333  0.2981424 0.8944272
delta
##      [,1] [,2] [,3]
## [1,] 5832    0    0
## [2,]    0  729    0
## [3,]    0    0  729
Q %*% delta %*% t(Q)
##       [,1]  [,2]  [,3]
## [1,]  2997 -2268  1134
## [2,] -2268  2997 -1134
## [3,]  1134 -1134  1296
Q<-eigen$vectors
delta<-diag(x = (eigen$values)^10)
Q
##            [,1]       [,2]      [,3]
## [1,]  0.6666667 -0.7453560 0.0000000
## [2,] -0.6666667 -0.5962848 0.4472136
## [3,]  0.3333333  0.2981424 0.8944272
delta
##              [,1]       [,2]       [,3]
## [1,] 3.570467e+12          0          0
## [2,] 0.000000e+00 3486784401          0
## [3,] 0.000000e+00          0 3486784401
Q %*% delta %*% t(Q)
##               [,1]          [,2]          [,3]
## [1,]  1.588811e+12 -1.585325e+12  792662320494
## [2,] -1.585325e+12  1.588811e+12 -792662320494
## [3,]  7.926623e+11 -7.926623e+11  399817944648

L-4 :Inference

Birth interval data You are given a dataset in SPSS format BDHS-birth-interval-data.sav. The data contain information on birth interval for 5607 children of age under 5 years with associated background characteristics. The variables, available in this dataset, are given below:

Variable Name Variable label
caseid case identification
v013 age in 5-year groups
v024 region
v025 type of place of residence
v106 highest educational level
v130 religion
v190 wealth index
v447a women’s age in years
bord birth order number
b4 sex of child
b5 child is alive
b8 current age of child
b11 preceding birth interval (months)
b12 succeeding birth interval (months)

1

1.a

  1. Suppose you would like to know whether the population distribution of preceding birth interval (b11) follows Normal distribution, as normality of data is required for most of the statistical analyses. How do you assess such normality assumption of this variable graphically? Obtain the results and make your comments.
library(haven)
BDHS_birth_interval_data <-
  read_sav("D:/3rd Year/AST-332 batch-27/BDHS-birth-interval-data.sav")
#View(BDHS_birth_interval_data)
d <- BDHS_birth_interval_data

hist(d$b11)

qqnorm(d$b11)
qqline(d$b11)

1.b

  1. Obtain summary statistic of b11 such as Mean, Median, SD, 25th and 75th percentiles and interpret all the values.
summary(d$b11)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.00   31.00   47.00   53.84   70.00  246.00
#quantile(x = d$b11,probs = c(.25,.75))
sd(d$b11)
## [1] 30.29951

1.c

  1. Calculate the average preceding-birth-interval by mothers education and make comments.
tapply(X = d$b11,INDEX = d$v106,FUN = mean)
##        0        1        2        3 
## 51.93180 53.23751 55.08089 59.07197

1.d

  1. Calculate the average preceding-birth-interval by wealth index and make comments.
tapply(X = d$b11,d$v190,mean)
##        1        2        3        4        5 
## 47.57227 51.00567 55.96158 57.47862 60.39755

1.e

  1. Obtain the summary statistics of preceding-birth-interval by mother’s education andmake comments.
#aggregate(x = d$b11, list(d$v106),mean)
tapply(X = d$b11,d$v106,mean)
##        0        1        2        3 
## 51.93180 53.23751 55.08089 59.07197
tapply(X = d$b11,d$v106,summary)
## $`0`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.00   29.00   44.00   51.93   67.00  214.00 
## 
## $`1`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.00   31.00   47.00   53.24   68.00  246.00 
## 
## $`2`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.00   33.00   50.00   55.08   72.00  191.00 
## 
## $`3`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   11.00   35.75   54.50   59.07   74.25  172.00
tapply(X = d$b11,d$v190,summary)
## $`1`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.00   29.00   41.00   47.57   59.00  225.00 
## 
## $`2`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.00   29.00   45.00   51.01   67.00  214.00 
## 
## $`3`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.00   32.00   49.00   55.96   72.00  246.00 
## 
## $`4`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.00   33.00   52.00   57.48   74.00  186.00 
## 
## $`5`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.00   35.25   55.00   60.40   78.00  209.00

2

2.a

  1. Suppose you would like to know whether the average preceding birth interval among the Bangladeshi women of reproductive age is more than 48 months. Does the data you extracted from Bangladesh demographic and health survey suggest of any evidence of birth interval higher than 48 months?
t.test(x = d$b11,mu = 48,alternative = "greater")
## 
##  One Sample t-test
## 
## data:  d$b11
## t = 13.978, df = 5256, p-value < 2.2e-16
## alternative hypothesis: true mean is greater than 48
## 95 percent confidence interval:
##  53.15386      Inf
## sample estimates:
## mean of x 
##  53.84135

2.b

  1. Suppose you would like to know whether the average preceding birth interval among the Bangladeshi women of reproductive age living in Sylhet division is less than 48 months. Does the data suggest any evidence of birth interval less than 48 months for women living in Sylhet division?
data_sylet<-d[d$v024==7,]
#data_sylet

t.test(x =data_sylet$b11,mu = 48,alternative = "less")
## 
##  One Sample t-test
## 
## data:  data_sylet$b11
## t = -4.7807, df = 970, p-value = 1.009e-06
## alternative hypothesis: true mean is less than 48
## 95 percent confidence interval:
##      -Inf 45.42214
## sample estimates:
## mean of x 
##  44.06797

2.c

  1. Suppose you would like to know whether the average preceding birth interval among the higher educated Bangladeshi women between age 19 and 35 years is more than 60 months. Does the data suggest of any evidence of birth interval more than 60 months among the higher educated women?
#d$v106
data_edu<-d[d$v106 ==3 &
              d$v447a >=19 & d$v447a <=35,]
#View(data_edu)

t.test(x =data_edu$b11,mu = 60,alternative = "greater")
## 
##  One Sample t-test
## 
## data:  data_edu$b11
## t = -2.503, df = 221, p-value = 0.9935
## alternative hypothesis: true mean is greater than 60
## 95 percent confidence interval:
##  52.29857      Inf
## sample estimates:
## mean of x 
##  55.36036

2.d

  1. Obtain the frequency distribution of children by survival status i.e. death or alive.
table(d$b5)
## 
##    0    1 
## 5033  224

3

3.a

  1. Suppose you would like to know whether the women of reproductive age from urban area have significantly higher birth interval than those from rural area. Does the data you extracted from Bangladesh demographic and health survey suggest any evidence of higher birth interval for urban women than rural women?
data_urban<-d[d$v025 ==1,]
data_rural<-d[d$v025 ==2,]

t.test(x = data_urban$b11,
       y = data_rural$b11,
       paired = F,
       alternative = "greater",
       var.equal = F
        )
## 
##  Welch Two Sample t-test
## 
## data:  data_urban$b11 and data_rural$b11
## t = 6.8243, df = 2459.5, p-value = 5.544e-12
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  5.017497      Inf
## sample estimates:
## mean of x mean of y 
##  58.59797  51.98625

3.b

  1. Suppose you would like to know whether there is any significant difference in average preceding birth interval between women of reproductive age from Chittagong and Dhaka division. Does the data you extracted from Bangladesh demographic and health survey suggest any evidence of unequal birth interval between women from these two division?
data_dhaka<-d[d$v024 ==3,]
data_chittagong<-d[d$v024 ==2,]

t.test(x = data_dhaka$b11,y=data_chittagong$b11,paired = F,alternative = "two.sided",var.equal = F)
## 
##  Welch Two Sample t-test
## 
## data:  data_dhaka$b11 and data_chittagong$b11
## t = 4.5309, df = 1644.1, p-value = 6.296e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  3.432866 8.673743
## sample estimates:
## mean of x mean of y 
##  54.15284  48.09954

3 c

  1. Suppose you would like to know whether the women of reproductive age from poorest family have significantly lower birth interval than those from richest family. Does the data you extracted from Bangladesh demographic and health survey suggest any evidence of lower birth interval for women of poorest family than those from richest family?
data_poorest<-d[d$v190 ==1,]
data_richest<-d[d$v190 ==5,]

t.test(x = data_poorest$b11,
       y = data_richest$b11,
       paired = F,
       alternative = "less",
       var.equal = F)
## 
##  Welch Two Sample t-test
## 
## data:  data_poorest$b11 and data_richest$b11
## t = -9.8727, df = 1607.7, p-value < 2.2e-16
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
##       -Inf -10.68727
## sample estimates:
## mean of x mean of y 
##  47.57227  60.39755

3.d

  1. Suppose you would like to know whether there is significant difference between preceding birth interval and succeeding birth interval. Does the data you extracted from Bangladesh demographic and health survey suggest any evidence of unequal preceding and succeeding birth interval for women??
t.test(x = d$b11,y = d$b12,paired = T,alternative = "two.sided")
## 
##  Paired t-test
## 
## data:  d$b11 and d$b12
## t = 12.56, df = 683, p-value < 2.2e-16
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
##  11.21813 15.37544
## sample estimates:
## mean difference 
##        13.29678

3.e

  1. It is assumed that a educated mother can take care of her children properly compared to a non-educated mother. Using the data given, obtain the percentage of child mortality among the mothers with different education level such as no education, primary,secondary and higher and make comments. Does the data you extracted from BDHS suggest any evidence of association between mother’s education and child mortality?
table(d$b5,d$v106)
##    
##        0    1    2    3
##   0 1288 1708 1777  260
##   1   61   94   65    4
#table(d$v106, d$b5)
prop.table(table(d$b5,d$v106),margin = 2)
##    
##              0          1          2          3
##   0 0.95478132 0.94783574 0.96471227 0.98484848
##   1 0.04521868 0.05216426 0.03528773 0.01515152
#round(pi,5)
round(100*prop.table(table(d$b5,d$v106),margin = 2),0)
##    
##      0  1  2  3
##   0 95 95 96 98
##   1  5  5  4  2
#Does the data you extracted from BDHS suggest any evidence of association between mother’s education and child mortality?

chisq.test(d$b5,d$v106)
## 
##  Pearson's Chi-squared test
## 
## data:  d$b5 and d$v106
## X-squared = 11.558, df = 3, p-value = 0.009063

L-7

4

Calculate power of the following test considering level of significance α = 0.05, σ2 = 64,n = 16, and the true mean μ= 102. Additionally, create a power curve while varying the difference between the means within a range of up to 10 in both directions. \[H_0 : μ = 100\] \[H_a : μ \neq 100\]

powerfun<-function(alpha=.05,sigma,n,mu,pmean){
  
a=qnorm(alpha/2)*sigma/sqrt(n)+mu
b=-qnorm(alpha/2)*sigma/sqrt(n)+mu

l=(a-pmean)/(sigma/sqrt(n))
u=(b-pmean)/(sigma/sqrt(n))

beta=pnorm(u)-pnorm(l)

power<-1-beta
 return(power)
}

powerfun(alpha = .05,sigma = 8,n = 16,mu = 100,pmean = 102)
## [1] 0.170075
#############################################
#if true mean differ 90 to 110
#pmean>>>>>>>90:100

plot(x=90:110,y = powerfun(alpha = .05,sigma = 8,n = 16,mu = 100,pmean = 90:110)
,type = "l")

#############################################
# if sample size differ 15 to 100
plot(x=15:100,y = powerfun(alpha = .05,sigma = 8,n = 15:100,mu = 100,pmean = 105)
     ,type = "l")

#############################################
# if population variance differ 50 to 80
plot(x=50:80,y = powerfun(alpha = .05,sigma = 50:80,n = 16,mu = 100,pmean = 102)
     ,type = "l")

#############################################
# if level of signifacnce differ 0.01 to 0.1
alpha=seq(.01,.1,by=.01)
plot(x=alpha,y = powerfun(alpha = alpha,sigma = 8,n = 16,mu = 100,pmean = 102)
     ,type = "l")

alternative

b<-(-1.96*2+100-102)/2
pnorm(q = (-1.96*2+100-102)/2)
## [1] 0.001538195
power <- function(sigma, n, mu_a) {
  l=(-1.96*sigma/sqrt(n) +100-mu_a)/(sigma/sqrt(n))
  u=(1.96*sigma/sqrt(n) +100-mu_a)/(sigma/sqrt(n))
  beta=pnorm(u)-pnorm(l)
  1-beta
}
power(sigma = 8,n = 16,mu_a = 102)
## [1] 0.1700658
mu_a<-90:110
store<-NULL
for(i in 1:length(mu_a)){
  store[i]<- power(sigma = 8,n = 16,mu_a = mu_a[i])
}
store
##  [1] 0.99881711 0.99445738 0.97932484 0.93821985 0.85083040 0.70540558
##  [7] 0.51599091 0.32302820 0.17006580 0.07909189 0.04999579 0.07909189
## [13] 0.17006580 0.32302820 0.51599091 0.70540558 0.85083040 0.93821985
## [19] 0.97932484 0.99445738 0.99881711
plot(mu_a,store,type = "l")

power <- function(alpha =.05, sigma, n, mu_a) {
u=(-qnorm(1-alpha*sigma/sqrt(n)) +100-mu_a)/(sigma/sqrt(n))
beta=pnorm(u)
1-beta
    }

power(sigma = 8,n = 16,mu_a = 102)
## [1] 0.949578
mu_a<-90:110
store<-NULL
for(i in 1:length(mu_a)){
  store[i]<- power(sigma = 8,n = 16,mu_a = mu_a[i])
}
store
##  [1] 6.526216e-06 5.687377e-05 3.908081e-04 2.123392e-03 9.156594e-03
##  [6] 3.149768e-02 8.703777e-02 1.951084e-01 3.597137e-01 5.559765e-01
## [11] 7.391658e-01 8.730184e-01 9.495780e-01 9.838539e-01 9.958642e-01
## [16] 9.991575e-01 9.998641e-01 9.999827e-01 9.999983e-01 9.999999e-01
## [21] 1.000000e+00
plot(mu_a,store,type = "l")

5

Do the same as Question 4, for a less than type alternative hypothesis. \[H_0 : μ = 100\] \[H_a : μ ≤ 100\]

powerfun1<-function(alpha=.05,sigma,n,mu,pmean){
  a=qnorm(alpha)*sigma/sqrt(n)+mu
  l=(a-pmean)/(sigma/sqrt(n))
  beta=1-pnorm(l)
  
  power<-1-beta
  return(power)
}

powerfun1(alpha = .05,sigma = 8,n = 16,mu = 100,pmean = 102)
## [1] 0.004086313

6

Do the same as Question 4, for a grater than type alternative hypothesis. \[H_0 : μ = 100\] \[H_a : μ ≥ 100\]

powerfun<-function(alpha=.05,sigma,n,mu,pmean){
  
 
  b=-qnorm(alpha)*sigma/sqrt(n)+mu
  u=(b-pmean)/(sigma/sqrt(n))
  
  beta=pnorm(u)
  power<-1-beta
  return(power)
}

powerfun(alpha = .05,sigma = 8,n = 16,mu = 100,pmean = 102)
## [1] 0.259511

L-8 :Multivariate Tests

One sample Hotelling T square test

https://online.stat.psu.edu/stat505/lesson/7/7.1/7.1.3

Q.1

A company wants to assess whether there is a significant difference in the purchasing behavior of its customers across multiple product categories. They collect data from a random sample of 80 customers and record their average spending in three different product categories: electronics (X1), clothing (X2), and home appliances (X3). The data is contained in the expenditure.csv file. The company wants to determine if the average spending is significantly different from the hypothesized population mean vector [300, 100, 250 250]. Perform the appropriate test to answer this.

hote_one<-function(x,mu=rep( 0,times= ncol(x))){
  
n<-nrow(x)
p<-ncol(x)
x_bar<-apply(X = x,MARGIN = 2,FUN = mean)
sigma_inv<-solve(var(x))
t2<-n*t(x_bar-mu) %*% sigma_inv %*% (x_bar-mu) 
f<- (n-p)/(p*(n-1))*t2
p_val<-pf(q = f,df1 = p,df2 = n-p,lower.tail = F)

return(list(t2=t2,F=f,p_value=p_val))

  }


library(mvtnorm)
## Warning: package 'mvtnorm' was built under R version 4.3.2
n <- 100 # observations
p <- 18 # parameter dimension
mu <- rep(0,p) # no signal: mu=0
x <- rmvnorm(n = n, mean = mu)
dim(x)
## [1] 100  18
hote_one(x)
## $t2
##          [,1]
## [1,] 32.26588
## 
## $F
##          [,1]
## [1,] 1.484738
## 
## $p_value
##           [,1]
## [1,] 0.1170151
#install.packages("DEoptimR")
#install.packages("rrcov")
library(rrcov)
## Warning: package 'rrcov' was built under R version 4.3.2
## Loading required package: robustbase
## Warning: package 'robustbase' was built under R version 4.3.2
## Scalable Robust Estimators with High Breakdown Point (version 1.7-4)
library(DEoptimR)
#DescTools::HotellingsT2Test(x = x)
rrcov::T2.test(x)
## 
##  One-sample Hotelling test
## 
## data:  x
## T2 = 32.2659, F = 1.4847, df1 = 18, df2 = 82, p-value = 0.117
## alternative hypothesis: true mean vector is not equal to (0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)' 
## 
## sample estimates:
##                     [,1]        [,2]      [,3]        [,4]        [,5]
## mean x-vector 0.01369284 -0.01057615 0.1205584 -0.01158277 0.004766602
##                    [,6]      [,7]       [,8]     [,9]       [,10]       [,11]
## mean x-vector 0.1472291 0.1494278 -0.1201253 -0.23926 -0.02874162 -0.01802641
##                     [,12]     [,13]     [,14]      [,15]      [,16]       [,17]
## mean x-vector -0.03766983 0.1811258 0.1902189 -0.2508392 0.03109334 -0.01059385
##                     [,18]
## mean x-vector -0.02673035
# Ques 1.
library(readxl)
## Warning: package 'readxl' was built under R version 4.3.2
expenditure <- read_excel("D:/3rd Year/AST-332 batch-27/expenditure.xlsx")
#View(expenditure)

hote_one(expenditure[,-1], mu=c(300,100,250))
## $t2
##          [,1]
## [1,] 3.216412
## 
## $F
##          [,1]
## [1,] 1.044995
## 
## $p_value
##           [,1]
## [1,] 0.3775634
rrcov::T2.test(expenditure[,-1], mu=c(300,100,250))
## 
##  One-sample Hotelling test
## 
## data:  expenditure[, -1]
## T2 = 3.2164, F = 1.0450, df1 = 3, df2 = 77, p-value = 0.3776
## alternative hypothesis: true mean vector is not equal to (300, 100, 250)' 
## 
## sample estimates:
##               X1 (Electronics) X2 (Clothing) X3 (Home Appliances)
## mean x-vector           290.15         98.15                  252

Two sample Hotelling T square test

https://online.stat.psu.edu/stat505/lesson/7/7.1/7.1.12

Q.2

Test whether the mean differs between the following two groups:

Group 01 Group 02
x11 x12 x21 x22
6 27 25 15
6 23 28 13
18 64 36 22
8 44 35 29
11 30 15 31
34 75 44 64
28 26 42 30
71 124 54 64
43 54 34 56
33 30 29 20
20 14 39 21

Help: Consider the following hypotheses are to be tested: \[\begin{array}{c} H_0:\left[\begin{array}{l} \mu_{11} \\ \mu_{12} \end{array}\right]-\left[\begin{array}{l} \mu_{21} \\ \mu_{22} \end{array}\right]=\left[\begin{array}{l} \delta_1 \\ \delta_2 \end{array}\right]=\left[\begin{array}{l} 0 \\ 0 \end{array}\right] \\ H_1:\left[\begin{array}{l} \mu_{11} \\ \mu_{12} \end{array}\right]-\left[\begin{array}{l} \mu_{21} \\ \mu_{22} \end{array}\right] \neq\left[\begin{array}{l} \delta_1 \\ \delta_2 \end{array}\right] \neq\left[\begin{array}{l} 0 \\ 0 \end{array}\right] \end{array}\]

The corresponding test statistics is given by \[F = \frac{(n_1 + n_2 − p − 1) × T^2}{(n_1 + n_2 − 2) × p}∼ F_{p,n_1+n_2−p−1}\] where \[(1) T^2 = (\bar x −\bar y − δ)′ × S_p^{-1} × (\bar x −\bar y − δ)\] \[(2) S_p = \left( \frac {1}{n_1} + \frac {1}{n_2} \right)× S_{pooled}\] \[(3) S_{pooled} = \frac {(n_1−1)S_x+(n_2−1)S_y} {n_1+n_2−2}\]

hote_Two <- function(data1, data2, delta) {
  data1 <- as.matrix(data1)
  data2 <- as.matrix(data2)
  p <- dim(data1)[2]
  n1 <- dim(data1)[1]
  n2 <- dim(data2)[1]
  x1_bar = as.matrix(apply(X = data1, MARGIN = 2, FUN = mean))
  x2_bar = as.matrix(apply(X = data2, MARGIN = 2, FUN = mean))
  s1 = as.matrix(cov(data1))
  s2 = as.matrix(cov(data2))
  
  s_pool = as.matrix((s1 * (n1 - 1) + s2 * (n2 - 1)) / (n1 + n2 - 2))
  m <- x1_bar - x2_bar - as.matrix(delta)
  
  t2 <- t(m) %*% solve(s_pool * ((1 / n1) + (1 / n2))) %*% m
  f = (n1 + n2 - p - 1) / (p * (n1 + n2 - 2)) * t2
  p_val = pf(
    q = f,
    df1 = p,
    df2 = n1 + n2 - p - 1,
    lower.tail = F
  )
  
  return(c(
    T2 = t2,
    F_stat = f,
    p_value = p_val
  ))
}

#1st Population
x11 <- c(6, 6, 18, 8, 11, 34, 28, 71, 43, 33, 20)
x12 <- c(27, 23, 64, 44, 30, 75, 26, 124, 54, 30, 14)
#2nd Population
x21 <- c(25, 28, 36, 35, 15, 44, 42, 54, 34, 29, 39)
x22 <- c(15, 13, 22, 29, 31, 64, 30, 64, 56, 20, 21)
data1 <- data.frame(x11, x12)
data2 <- data.frame(x21, x22)
delta <- rep(0, 2)

hote_Two(data1 = data1,data2 = data2,delta = delta)
##           T2       F_stat      p_value 
## 12.664804984  6.015782367  0.009462901
rrcov::T2.test(x = data1,y = data2)
## 
##  Two-sample Hotelling test
## 
## data:  data1 and data2
## T2 = 12.6648, F = 6.0158, df1 = 2, df2 = 19, p-value = 0.009463
## alternative hypothesis: true difference in mean vectors is not equal to (0,0)
## sample estimates:
##                    x11      x12
## mean x-vector 25.27273 46.45455
## mean y-vector 34.63636 33.18182
# When the data is in proper (long) format
dat <- data.frame(group=factor(rep(1:2, each=11)),
                  x1= c(x11, x21),
                  x2= c(x12, x22))
rrcov::T2.test(cbind(x1,x2)~group, data=dat)
## 
##  Two-sample Hotelling test
## 
## data:  cbind(x1, x2) by group
## T2 = 12.6648, F = 6.0158, df1 = 2, df2 = 19, p-value = 0.009463
## alternative hypothesis: true difference in mean vectors is not equal to (0,0)
## sample estimates:
##                     x1       x2
## mean x-vector 25.27273 46.45455
## mean y-vector 34.63636 33.18182

Ques 3

Test whether there exists mean difference between two species of flower:‘setosa’ & ‘versicolor’for built in data set ‘iris’.

data("iris")
data1 <- iris[iris$Species == "setosa", ]
data1
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1           5.1         3.5          1.4         0.2  setosa
## 2           4.9         3.0          1.4         0.2  setosa
## 3           4.7         3.2          1.3         0.2  setosa
## 4           4.6         3.1          1.5         0.2  setosa
## 5           5.0         3.6          1.4         0.2  setosa
## 6           5.4         3.9          1.7         0.4  setosa
## 7           4.6         3.4          1.4         0.3  setosa
## 8           5.0         3.4          1.5         0.2  setosa
## 9           4.4         2.9          1.4         0.2  setosa
## 10          4.9         3.1          1.5         0.1  setosa
## 11          5.4         3.7          1.5         0.2  setosa
## 12          4.8         3.4          1.6         0.2  setosa
## 13          4.8         3.0          1.4         0.1  setosa
## 14          4.3         3.0          1.1         0.1  setosa
## 15          5.8         4.0          1.2         0.2  setosa
## 16          5.7         4.4          1.5         0.4  setosa
## 17          5.4         3.9          1.3         0.4  setosa
## 18          5.1         3.5          1.4         0.3  setosa
## 19          5.7         3.8          1.7         0.3  setosa
## 20          5.1         3.8          1.5         0.3  setosa
## 21          5.4         3.4          1.7         0.2  setosa
## 22          5.1         3.7          1.5         0.4  setosa
## 23          4.6         3.6          1.0         0.2  setosa
## 24          5.1         3.3          1.7         0.5  setosa
## 25          4.8         3.4          1.9         0.2  setosa
## 26          5.0         3.0          1.6         0.2  setosa
## 27          5.0         3.4          1.6         0.4  setosa
## 28          5.2         3.5          1.5         0.2  setosa
## 29          5.2         3.4          1.4         0.2  setosa
## 30          4.7         3.2          1.6         0.2  setosa
## 31          4.8         3.1          1.6         0.2  setosa
## 32          5.4         3.4          1.5         0.4  setosa
## 33          5.2         4.1          1.5         0.1  setosa
## 34          5.5         4.2          1.4         0.2  setosa
## 35          4.9         3.1          1.5         0.2  setosa
## 36          5.0         3.2          1.2         0.2  setosa
## 37          5.5         3.5          1.3         0.2  setosa
## 38          4.9         3.6          1.4         0.1  setosa
## 39          4.4         3.0          1.3         0.2  setosa
## 40          5.1         3.4          1.5         0.2  setosa
## 41          5.0         3.5          1.3         0.3  setosa
## 42          4.5         2.3          1.3         0.3  setosa
## 43          4.4         3.2          1.3         0.2  setosa
## 44          5.0         3.5          1.6         0.6  setosa
## 45          5.1         3.8          1.9         0.4  setosa
## 46          4.8         3.0          1.4         0.3  setosa
## 47          5.1         3.8          1.6         0.2  setosa
## 48          4.6         3.2          1.4         0.2  setosa
## 49          5.3         3.7          1.5         0.2  setosa
## 50          5.0         3.3          1.4         0.2  setosa
data1 <- data1[,-5]
data2 <- iris[iris$Species == "versicolor", ]
data2 <- data2[,-5]
delta <- rep(0, 4)

hote_Two(data1, data2, delta)
##           T2       F_stat      p_value 
## 2.580839e+03 6.254583e+02 2.664857e-67
rrcov::T2.test(x = data1, y=data2)
## 
##  Two-sample Hotelling test
## 
## data:  data1 and data2
## T2 = 2580.84, F = 625.46, df1 = 4, df2 = 95, p-value < 2.2e-16
## alternative hypothesis: true difference in mean vectors is not equal to (0,0,0,0)
## sample estimates:
##               Sepal.Length Sepal.Width Petal.Length Petal.Width
## mean x-vector        5.006       3.428        1.462       0.246
## mean y-vector        5.936       2.770        4.260       1.326
alternative
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.3.2
## Warning: package 'purrr' was built under R version 4.3.2
## Warning: package 'lubridate' was built under R version 4.3.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
data("iris")
data1 <- iris[iris$Species == "setosa", ]
data1
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1           5.1         3.5          1.4         0.2  setosa
## 2           4.9         3.0          1.4         0.2  setosa
## 3           4.7         3.2          1.3         0.2  setosa
## 4           4.6         3.1          1.5         0.2  setosa
## 5           5.0         3.6          1.4         0.2  setosa
## 6           5.4         3.9          1.7         0.4  setosa
## 7           4.6         3.4          1.4         0.3  setosa
## 8           5.0         3.4          1.5         0.2  setosa
## 9           4.4         2.9          1.4         0.2  setosa
## 10          4.9         3.1          1.5         0.1  setosa
## 11          5.4         3.7          1.5         0.2  setosa
## 12          4.8         3.4          1.6         0.2  setosa
## 13          4.8         3.0          1.4         0.1  setosa
## 14          4.3         3.0          1.1         0.1  setosa
## 15          5.8         4.0          1.2         0.2  setosa
## 16          5.7         4.4          1.5         0.4  setosa
## 17          5.4         3.9          1.3         0.4  setosa
## 18          5.1         3.5          1.4         0.3  setosa
## 19          5.7         3.8          1.7         0.3  setosa
## 20          5.1         3.8          1.5         0.3  setosa
## 21          5.4         3.4          1.7         0.2  setosa
## 22          5.1         3.7          1.5         0.4  setosa
## 23          4.6         3.6          1.0         0.2  setosa
## 24          5.1         3.3          1.7         0.5  setosa
## 25          4.8         3.4          1.9         0.2  setosa
## 26          5.0         3.0          1.6         0.2  setosa
## 27          5.0         3.4          1.6         0.4  setosa
## 28          5.2         3.5          1.5         0.2  setosa
## 29          5.2         3.4          1.4         0.2  setosa
## 30          4.7         3.2          1.6         0.2  setosa
## 31          4.8         3.1          1.6         0.2  setosa
## 32          5.4         3.4          1.5         0.4  setosa
## 33          5.2         4.1          1.5         0.1  setosa
## 34          5.5         4.2          1.4         0.2  setosa
## 35          4.9         3.1          1.5         0.2  setosa
## 36          5.0         3.2          1.2         0.2  setosa
## 37          5.5         3.5          1.3         0.2  setosa
## 38          4.9         3.6          1.4         0.1  setosa
## 39          4.4         3.0          1.3         0.2  setosa
## 40          5.1         3.4          1.5         0.2  setosa
## 41          5.0         3.5          1.3         0.3  setosa
## 42          4.5         2.3          1.3         0.3  setosa
## 43          4.4         3.2          1.3         0.2  setosa
## 44          5.0         3.5          1.6         0.6  setosa
## 45          5.1         3.8          1.9         0.4  setosa
## 46          4.8         3.0          1.4         0.3  setosa
## 47          5.1         3.8          1.6         0.2  setosa
## 48          4.6         3.2          1.4         0.2  setosa
## 49          5.3         3.7          1.5         0.2  setosa
## 50          5.0         3.3          1.4         0.2  setosa
data1 <- data1 %>% select(Sepal.Length,Petal.Length)
data2 <- iris[iris$Species == "versicolor", ]
data2 <- data2[,c(1,3)]
delta <- rep(0, 2)

rrcov::T2.test(x = data1, y=data2)
## 
##  Two-sample Hotelling test
## 
## data:  data1 and data2
## T2 = 1918.32, F = 949.37, df1 = 2, df2 = 97, p-value < 2.2e-16
## alternative hypothesis: true difference in mean vectors is not equal to (0,0)
## sample estimates:
##               Sepal.Length Petal.Length
## mean x-vector        5.006        1.462
## mean y-vector        5.936        4.260

L-9,10 (Multivariate Tests)

1 MANOVA

https://online.stat.psu.edu/stat505/lesson/8

ANOVA: Testing mean difference more than two groups (i.e, treatments) for a single variable. MANOVA: Testing mean difference more than two groups (i.e, treatments) for at least two variable. Consider the following data set of 3 different groups with 2 variables. Test whether the mean difference exists among these 3 different groups

K1 K2 K3
y1 y2 y1 y2 y1 y2
2 3 4 8 7 6
3 4 5 6 8 7
5 4 5 7 10 8
2 5 9 5
7 6
\(\bar y_{11}=3\) \(\bar y_{21}=4\) \(\bar y_{12} =5\) \(\bar y_{22} =7\) \(\bar y_{13}=8.2\) \(\bar y_{13}=6.4\)

\[H_0 : μ_1 = μ_2 = μ_3\] \[H_1 : At-least-one-pair- is-unequal\] Assumptions In MANOVA:

  1. Data follows MVN distribution;
  2. Within variation of each groups is same. That means there exists equal variance within each groups. Test Statistic: \[Λ = \frac{|W|} {|B +W|}\]

MANOVA

group <- as.factor(rep(1:3,c(4,3,5)))
y1 <- c(2,3,5,2, 4,5,5, 7,8,10,9,7)
y2 <- c(3,4,4,5, 8,6,7, 6,7,8,5,6)
manova.data <- data.frame(y1,y2,group)
manova.data
##    y1 y2 group
## 1   2  3     1
## 2   3  4     1
## 3   5  4     1
## 4   2  5     1
## 5   4  8     2
## 6   5  6     2
## 7   5  7     2
## 8   7  6     3
## 9   8  7     3
## 10 10  8     3
## 11  9  5     3
## 12  7  6     3
analysis <- manova(cbind(y1,y2)~group, manova.data)
summary(test = "Wilks", analysis)
##           Df    Wilks approx F num Df den Df    Pr(>F)    
## group      2 0.077761   10.344      4     16 0.0002475 ***
## Residuals  9                                              
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

1.1 Ques 1 Let’s take the built-in iris data set. Suppose we want to know if there is any important difference, in sepal and petal length, between the different species. How can we test it?

#iris$Species
summary(aov(iris$Sepal.Length~iris$Species))
##               Df Sum Sq Mean Sq F value Pr(>F)    
## iris$Species   2  63.21  31.606   119.3 <2e-16 ***
## Residuals    147  38.96   0.265                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
library(tidyverse)
iris%>%
  group_by(Species)%>%
  summarise(a=mean(Sepal.Length))
## # A tibble: 3 × 2
##   Species        a
##   <fct>      <dbl>
## 1 setosa      5.01
## 2 versicolor  5.94
## 3 virginica   6.59
summarise(group_by(iris,Species),a=mean(Sepal.Length))
## # A tibble: 3 × 2
##   Species        a
##   <fct>      <dbl>
## 1 setosa      5.01
## 2 versicolor  5.94
## 3 virginica   6.59
aggregate(iris,by = list(iris$Species),FUN = mean)
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
##      Group.1 Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1     setosa        5.006       3.428        1.462       0.246      NA
## 2 versicolor        5.936       2.770        4.260       1.326      NA
## 3  virginica        6.588       2.974        5.552       2.026      NA

Ques 1

analysis1 <- manova(cbind(Sepal.Length, Petal.Length)~Species, iris )
summary(test = "Wilks", analysis1)
##            Df    Wilks approx F num Df den Df    Pr(>F)    
## Species     2 0.039878   292.56      4    292 < 2.2e-16 ***
## Residuals 147                                              
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Multivariate Linear Regression Models

2.1

When the idea of “Multivariate Linear Regression” is used? 1. Fit a multivariate straight line regression model using the following data set:

z1= 0 1 2 3 4 y1= 1 4 3 8 9 y2= −1 −1 2 3 2

z1 <- 0:4
y1 <- c(1,4,3,8,9)
y2 <- c(-1,-1,2,3,2)
reg.data <- data.frame(z1,y1,y2)
reg.data
##   z1 y1 y2
## 1  0  1 -1
## 2  1  4 -1
## 3  2  3  2
## 4  3  8  3
## 5  4  9  2
analysis <- lm(cbind(y1,y2)~z1, data = reg.data)
summary(analysis)
## Response y1 :
## 
## Call:
## lm(formula = y1 ~ z1, data = reg.data)
## 
## Residuals:
##  1  2  3  4  5 
##  0  1 -2  1  0 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   1.0000     1.0954   0.913   0.4286  
## z1            2.0000     0.4472   4.472   0.0208 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.414 on 3 degrees of freedom
## Multiple R-squared:  0.8696, Adjusted R-squared:  0.8261 
## F-statistic:    20 on 1 and 3 DF,  p-value: 0.02084
## 
## 
## Response y2 :
## 
## Call:
## lm(formula = y2 ~ z1, data = reg.data)
## 
## Residuals:
##         1         2         3         4         5 
## -1.11e-16 -1.00e+00  1.00e+00  1.00e+00 -1.00e+00 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  -1.0000     0.8944  -1.118   0.3450  
## z1            1.0000     0.3651   2.739   0.0714 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.155 on 3 degrees of freedom
## Multiple R-squared:  0.7143, Adjusted R-squared:  0.619 
## F-statistic:   7.5 on 1 and 3 DF,  p-value: 0.07142

2.2

  1. Regress only y1 on z1.
fit1 <- lm(y1~z1, data = reg.data)
summary(fit1)
## 
## Call:
## lm(formula = y1 ~ z1, data = reg.data)
## 
## Residuals:
##  1  2  3  4  5 
##  0  1 -2  1  0 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   1.0000     1.0954   0.913   0.4286  
## z1            2.0000     0.4472   4.472   0.0208 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.414 on 3 degrees of freedom
## Multiple R-squared:  0.8696, Adjusted R-squared:  0.8261 
## F-statistic:    20 on 1 and 3 DF,  p-value: 0.02084

2.3

  1. Regress only y2 on z1.
fit2 <- lm(y2~z1, data = reg.data)
summary(fit2)
## 
## Call:
## lm(formula = y2 ~ z1, data = reg.data)
## 
## Residuals:
##         1         2         3         4         5 
## -1.11e-16 -1.00e+00  1.00e+00  1.00e+00 -1.00e+00 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  -1.0000     0.8944  -1.118   0.3450  
## z1            1.0000     0.3651   2.739   0.0714 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.155 on 3 degrees of freedom
## Multiple R-squared:  0.7143, Adjusted R-squared:  0.619 
## F-statistic:   7.5 on 1 and 3 DF,  p-value: 0.07142

2.4

  1. What’s the difference between Univariate Linear Regression & Multivariate Linear Regression? Comments based on your findings.
vcov(analysis)
##                y1:(Intercept)       y1:z1 y2:(Intercept)       y2:z1
## y1:(Intercept)      1.2000000 -0.40000000     -0.4000000  0.13333333
## y1:z1              -0.4000000  0.20000000      0.1333333 -0.06666667
## y2:(Intercept)     -0.4000000  0.13333333      0.8000000 -0.26666667
## y2:z1               0.1333333 -0.06666667     -0.2666667  0.13333333
vcov(fit1)
##             (Intercept)   z1
## (Intercept)         1.2 -0.4
## z1                 -0.4  0.2
vcov(fit2)
##             (Intercept)         z1
## (Intercept)   0.8000000 -0.2666667
## z1           -0.2666667  0.1333333

we get the cov(B1,B2) from MVT Regression, which might be required to test hypothesis involving both B1, and B2.

cov(B1,B2) is not obtainable from univariate regression

L-11:Population Pyramid

Population pyramid A population pyramid (age structure diagram) or “age-sex pyramid” is a graphical illustration of the distribution of a population (typically that of a country or region of the world) by age groups and sex; it typically takes the shape of a pyramid when the population is growing. Males are usually shown on the left and females on the right, and they may be measured in absolute numbers or as a percentage of the total population.

The Three Basic Shapes of Population Pyramids

Expansive Expansive population pyramids are used to describe populations that are young and growing. They are often characterized by their typical ‘pyramid’ shape, which has a broad base and narrow top. Expansive population pyramids show a larger percentage of the population in the younger age cohorts, usually with each age cohort smaller in size than the one below it. These types of populations are typically representative of developing nations, whose populations often have high fertility rates and lower than average life expectancies.

Constrictive Constrictive population pyramids are used to describe populations that are elderly and shrinking. Constrictive pyramids can often look like beehives and typically have an inverted shape with the graph tapering in at the bottom. Constrictive pyramids have smaller percentages of people in the younger age cohorts and are typically characteristic of countries with higher levels of social and economic development, where access to quality education and health care is available to a large portion of the population.

Stationary Stationary, or near stationary, population pyramids are used to describe populations that are not growing. They are characterized by their rectangular shape, displaying somewhat equal percentages across age cohorts that taper off toward the top. These pyramids are often characteristic of developed nations, where birth rates are low and overall quality of life is high.

https://www.visualcapitalist.com/population-pyramids-compared/

Question 1

The sheet1 of lec11-population-pyramid.xlsx shows single year population of USA by sex.

  1. Construct population pyramid for the data.
  2. Convert the single year to 5-year age groups (0-4, 5-9, etc) and construct a population pyramid in absolute numbers.
  3. Construct a population pyramid for the 5-year age groups based on the percentage of the total population. ### Question 2 The sheet2 of lec11-population-pyramid.xlsx shows the population structure of the USA in 2014. Construct a population pyramid.Comment on the shape.

L-12 :Multivariate Normal Density

# load library MASS
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
# set seed and create data vectors
set.seed(98989)
sample_size <- 100
sample_meanvector <- c(10, 5)
sample_covariance_matrix <- matrix(c(10, 5, 5, 9),
ncol = 2)
# create bivariate normal distribution
sample_distribution <- mvrnorm(n = sample_size,
mu = sample_meanvector,
Sigma = sample_covariance_matrix)
# print top of distribution
head(sample_distribution)
##           [,1]       [,2]
## [1,]  8.458875  6.1000172
## [2,]  9.914924  5.4988621
## [3,]  8.376009  4.8306038
## [4,]  8.720649  0.1349441
## [5,]  7.858077  8.4643691
## [6,] 10.283415 -0.3931910
plot(sample_distribution, xlab = expression(X[1]), ylab = expression(X[2]))

### Question 1 The pdf of a Bivariate Normal (BVN) distribution is given by - \[f_{X_1,X_2} (X_1 = x_1,X_2 = x_2) = \frac{1}{2 \pi|Σ|^{1/2}}exp^{\frac {-1}{2(1- \rho_{12}^2)}} \left[\left( \frac {x_1- \mu_1}{\sqrt{\sigma_{11}}}\right)^2 +\left( \frac {x_2- \mu_2}{\sqrt{\sigma_{22}}}\right)^2-2\rho_{12} \left( \frac {x_1- \mu_1}{\sqrt{\sigma_{11}}}\right) \left( \frac {x_2- \mu_2}{\sqrt{\sigma_{22}}}\right) \right ]\] \[=\frac{1}{2 \pi \sqrt{\sigma_{11} \sigma_{22}(1- \rho_{12}^2) } }exp^{\frac {-1}{2(1- \rho_{12}^2)}} \left[\left( \frac {x_1- \mu_1}{\sqrt{\sigma_{11}}}\right)^2 +\left( \frac {x_2- \mu_2}{\sqrt{\sigma_{22}}}\right)^2-2\rho_{12} \left( \frac {x_1- \mu_1}{\sqrt{\sigma_{11}}}\right) \left( \frac {x_2- \mu_2}{\sqrt{\sigma_{22}}}\right) \right ]\]

Consider that X1 & X2 follow the given Bivariate Normal distribution: \[X_1,X_2 ∼ N_2 (μ_1 = 0, μ_2 = 0, σ_{11} = 1, σ_{22} = 1, ρ_{12} = 0.75)\]

1(i)

Write a function which takes X1 & X2 as input and produce a BVN density for X1 =X2 = 1. Then plot the BVN density curve.

1(ii)

Find the marginal distribution for both X1 & X2.

Bivariate <- function(x1, x2){
 mu1<- 0
 mu2 <- 0
 sig11 <- 1
 sig22 <- 1
 rho <- 0.75
 term1 <- 1/(2*pi*sqrt(sig11*sig22*(1 - rho**2)))
 term2 <- 1/(2*(1 - rho**2))
 term3 <- ((x1 - mu1)/sqrt(sig11))**2
 term4 <- ((x2 - mu2)/sqrt(sig22))**2
 term5 <- 2*rho*((x1 - mu1)/sqrt(sig11))*((x2 - mu2)/sqrt(sig22))
 z <- term1*exp(-term2*(term3+term4-term5))
 return(z)
 }

#i
Bivariate(x1 = 1, x2 = 1)
## [1] 0.1358823
mvtnorm::dmvnorm(x = c(1,1), 
                 mean = c(0,0), 
                 sigma = matrix(c(1, 0.75, 0.75, 1), ncol = 2) )
## [1] 0.1358823
#### x1 and x2 are created arbitrarily
x1 <- seq(-3, 3, length.out = 100)
x2 <- seq(-3, 3, length.out = 100)
#'outer()' function produce each and every possible values of the combination of x1 and x2
z <- outer(x1,x2,Bivariate)

library(mvtnorm)
#z <- outer(x1,x2,mvtnorm::dmvnorm)
#head(z)[1:5,1:5]

persp(x1,x2,z, col = "red", theta = 45, phi = 0.20)
title(main = "Bivariate Normal Distribution")

#ii
par(mfrow = c(2, 1))
fx1 <- rowSums(z)
plot(x1,fx1, type = "l", col = 2, lwd = 3, ann = F)
title(main = "Marginal Distribution of X1", xlab = "x", ylab = "f(x)")

fx2 <- colSums(z)
plot(x2,fx2, type = "l", col = 4, lwd = 3, ann = F)
title(main = "Marginal Distribution of X2", xlab = "x", ylab = "f(x)")

mvtnorm::dmvnorm
## function (x, mean = rep(0, p), sigma = diag(p), log = FALSE, 
##     checkSymmetry = TRUE) 
## {
##     if (is.vector(x)) 
##         x <- matrix(x, ncol = length(x))
##     p <- ncol(x)
##     if (!missing(mean)) {
##         if (!is.null(dim(mean))) 
##             dim(mean) <- NULL
##         if (length(mean) != p) 
##             stop("x and mean have non-conforming size")
##     }
##     if (!missing(sigma)) {
##         if (p != ncol(sigma)) 
##             stop("x and sigma have non-conforming size")
##         if (checkSymmetry && !isSymmetric(sigma, tol = sqrt(.Machine$double.eps), 
##             check.attributes = FALSE)) 
##             stop("sigma must be a symmetric matrix")
##     }
##     dec <- tryCatch(base::chol(sigma), error = function(e) e)
##     if (inherits(dec, "error")) {
##         x.is.mu <- colSums(t(x) != mean) == 0
##         logretval <- rep.int(-Inf, nrow(x))
##         logretval[x.is.mu] <- Inf
##     }
##     else {
##         tmp <- backsolve(dec, t(x) - mean, transpose = TRUE)
##         rss <- colSums(tmp^2)
##         logretval <- -sum(log(diag(dec))) - 0.5 * p * log(2 * 
##             pi) - 0.5 * rss
##     }
##     names(logretval) <- rownames(x)
##     if (log) 
##         logretval
##     else exp(logretval)
## }
## <bytecode: 0x0000026c75763618>
## <environment: namespace:mvtnorm>
#install.packages("rrcov")
library(rrcov)
rrcov::T2.test
## function (x, ...) 
## UseMethod("T2.test")
## <bytecode: 0x0000026c71788550>
## <environment: namespace:rrcov>
outer(1:5,1:5,FUN = "*")
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    2    3    4    5
## [2,]    2    4    6    8   10
## [3,]    3    6    9   12   15
## [4,]    4    8   12   16   20
## [5,]    5   10   15   20   25
outer(1:5,1:5,FUN = "+")
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    2    3    4    5    6
## [2,]    3    4    5    6    7
## [3,]    4    5    6    7    8
## [4,]    5    6    7    8    9
## [5,]    6    7    8    9   10
outer(1:5,1:5,FUN = "paste")
##      [,1]  [,2]  [,3]  [,4]  [,5] 
## [1,] "1 1" "1 2" "1 3" "1 4" "1 5"
## [2,] "2 1" "2 2" "2 3" "2 4" "2 5"
## [3,] "3 1" "3 2" "3 3" "3 4" "3 5"
## [4,] "4 1" "4 2" "4 3" "4 4" "4 5"
## [5,] "5 1" "5 2" "5 3" "5 4" "5 5"

2 Sampling from the multivariate normal distribution

# load library MASS
library(MASS)
# set seed and create data vectors
set.seed(98989)
sample_size <- 1000
sample_meanvector <- c(10, 5, 7, 9, 20)
sample_covariance_matrix <- matrix(c(5, 4, 3, 2, 1, 4, 5, 4, 3, 2,
3, 4, 5, 4, 3, 2, 3, 4, 5, 4, 1,
2, 3, 4, 5), ncol = 5)
# create multivariate normal distribution
sample_distribution <- mvrnorm(n = sample_size,
mu = sample_meanvector,
Sigma = sample_covariance_matrix)
# print top of distribution
head(sample_distribution)
##           [,1]     [,2]     [,3]      [,4]     [,5]
## [1,]  9.059425 4.412918 6.918059  9.690694 19.78744
## [2,]  9.267290 5.569379 7.575157  8.664539 20.46564
## [3,]  9.986655 3.189981 5.898267  8.745072 20.25941
## [4,]  4.859023 3.346945 5.312794  6.979162 19.99343
## [5,] 10.964113 4.544509 5.589856 10.234550 21.97238
## [6,]  7.663888 2.803528 4.822008  7.335070 20.30340

3 Multivariate Normality Tests in R

Ho: Distribution of data is multivariate normal

H1: Distribution of data is not multivariate normal

https://www.statology.org/multivariate-normality-test-r/

3.1 Mardia’s Test in R

#install.packages("QuantPsyc")
library("QuantPsyc")
## Warning: package 'QuantPsyc' was built under R version 4.3.2
## Loading required package: boot
## 
## Attaching package: 'boot'
## The following object is masked from 'package:robustbase':
## 
##     salinity
## 
## Attaching package: 'QuantPsyc'
## The following object is masked from 'package:base':
## 
##     norm
#create dataset
set.seed(0)

data <- data.frame(x1 = rnorm(50),
x2 = rnorm(50),
x3 = rnorm(50))
#perform Multivariate normality test
QuantPsyc::mult.norm(data)$mult.test
##           Beta-hat      kappa     p-val
## Skewness  1.630474 13.5872843 0.1926626
## Kurtosis 13.895364 -0.7130395 0.4758213

3.2 Energy Test in R

#install.packages("energy")
library(energy)
## Warning: package 'energy' was built under R version 4.3.2
# create dataset
sample_size <- 1000
sample_meanvector <- c(10, 5, 7, 9, 20)
sample_covariance_matrix <- matrix(c(5, 4, 3, 2, 1, 4, 5, 4, 3, 2,
3, 4, 5, 4, 3, 2, 3, 4, 5, 4, 1,
2, 3, 4, 5), ncol = 5)
# create multivariate normal distribution
data <- rmvnorm(n = sample_size, mean =
                  sample_meanvector, sigma =
                  sample_covariance_matrix)
#perform Multivariate normality test
energy::mvnorm.etest(data, R=100)
## 
##  Energy test of multivariate normality: estimated parameters
## 
## data:  x, sample size 1000, dimension 5, replicates 100
## E-statistic = 1.1783, p-value = 0.26

3.3 Shapiro-Wilk Test for Multivariate Normality in R

data <- iris[1:50, 1:4]
mvnormtest::mshapiro.test(t(data))
## 
##  Shapiro-Wilk normality test
## 
## data:  Z
## W = 0.95878, p-value = 0.07906

L-13 :Basic Multivariate Statistics

1 Question Data collected on 22 U.S public utility companies for the year 1975 have the variables:

• X1: Fixed-charge coverage ratio (income/debt) • X2: Rate of return on capital • X3: Cost per KW capacity in place • X4: Annual load factor • X5: Peak KWh demand growth from 1974 to 1975 • X6: Sales (kWh use per year) • X7: Percent nuclear • X8: Total fuel costs (cents per kWh)

Using utility data, address the following questions

1(a)

Compute the mean matrix, covariance matrix, and correlation matrix \((i.e., find \bar x, S_n and R)\).

my_data<-read.csv("D:/3rd Year/AST-332 batch-27/AST 332 Lec 13 utility.csv")
my_data
##      X1   X2  X3   X4   X5    X6   X7    X8       X9
## 1  1.06  9.2 151 54.4  1.6  9077  0.0 0.628  Arizona
## 2  0.89 10.3 202 57.9  2.2  5088 25.3 1.555   Boston
## 3  1.43 15.4 113 53.0  3.4  9212  0.0 1.058  Central
## 4  1.02 11.2 168 56.0  0.3  6423 34.3 0.700   Common
## 5  1.49  8.8 192 51.2  1.0  3300 15.6 2.044 Consolid
## 6  1.32 13.5 111 60.0 -2.2 11127 22.5 1.241  Florida
## 7  1.22 12.2 175 67.6  2.2  7642  0.0 1.652 Hawaiian
## 8  1.10  9.2 245 57.0  3.3 13082  0.0 0.309    Idaho
## 9  1.34 13.0 168 60.4  7.2  8406  0.0 0.862 Kentucky
## 10 1.12 12.4 197 53.0  2.7  6455 39.2 0.623  Madison
## 11 0.75  7.5 173 51.5  6.5 17441  0.0 0.768   Nevada
## 12 1.13 10.9 178 62.0  3.7  6154  0.0 1.897 NewEngla
## 13 1.15 12.7 199 53.7  6.4  7179 50.2 0.527 Northern
## 14 1.09 12.0  96 49.8  1.4  9673  0.0 0.588 Oklahoma
## 15 0.96  7.6 164 62.2 -0.1  6468  0.9 1.400  Pacific
## 16 1.16  9.9 252 56.0  9.2 15991  0.0 0.620    Puget
## 17 0.76  6.4 136 61.9  9.0  5714  8.3 1.920 SanDiego
## 18 1.05 12.6 150 56.7  2.7 10140  0.0 1.108 Southern
## 19 1.16 11.7 104 54.0 -2.1 13507  0.0 0.636    Texas
## 20 1.20 11.8 148 59.9  3.5  7287 41.1 0.702 Wisconsi
## 21 1.04  8.6 204 61.0  3.5  6650  0.0 2.116   United
## 22 1.07  9.3 174 54.3  5.9 10093 26.6 1.306 Virginia
data<-my_data[-9]
data
##      X1   X2  X3   X4   X5    X6   X7    X8
## 1  1.06  9.2 151 54.4  1.6  9077  0.0 0.628
## 2  0.89 10.3 202 57.9  2.2  5088 25.3 1.555
## 3  1.43 15.4 113 53.0  3.4  9212  0.0 1.058
## 4  1.02 11.2 168 56.0  0.3  6423 34.3 0.700
## 5  1.49  8.8 192 51.2  1.0  3300 15.6 2.044
## 6  1.32 13.5 111 60.0 -2.2 11127 22.5 1.241
## 7  1.22 12.2 175 67.6  2.2  7642  0.0 1.652
## 8  1.10  9.2 245 57.0  3.3 13082  0.0 0.309
## 9  1.34 13.0 168 60.4  7.2  8406  0.0 0.862
## 10 1.12 12.4 197 53.0  2.7  6455 39.2 0.623
## 11 0.75  7.5 173 51.5  6.5 17441  0.0 0.768
## 12 1.13 10.9 178 62.0  3.7  6154  0.0 1.897
## 13 1.15 12.7 199 53.7  6.4  7179 50.2 0.527
## 14 1.09 12.0  96 49.8  1.4  9673  0.0 0.588
## 15 0.96  7.6 164 62.2 -0.1  6468  0.9 1.400
## 16 1.16  9.9 252 56.0  9.2 15991  0.0 0.620
## 17 0.76  6.4 136 61.9  9.0  5714  8.3 1.920
## 18 1.05 12.6 150 56.7  2.7 10140  0.0 1.108
## 19 1.16 11.7 104 54.0 -2.1 13507  0.0 0.636
## 20 1.20 11.8 148 59.9  3.5  7287 41.1 0.702
## 21 1.04  8.6 204 61.0  3.5  6650  0.0 2.116
## 22 1.07  9.3 174 54.3  5.9 10093 26.6 1.306
#qu1

x_bar<-apply(data, MARGIN = 2,FUN = mean)
x_bar
##          X1          X2          X3          X4          X5          X6 
##    1.114091   10.736364  168.181818   56.977273    3.240909 8914.045455 
##          X7          X8 
##   12.000000    1.102727
#alternative
colMeans(data)
##          X1          X2          X3          X4          X5          X6 
##    1.114091   10.736364  168.181818   56.977273    3.240909 8914.045455 
##          X7          X8 
##   12.000000    1.102727
## covariance
covariance<-cov(data)
covariance
##               X1          X2           X3            X4            X5
## X1   0.034044372   0.2661299   -0.7812554 -6.752165e-02   -0.14908009
## X2   0.266129870   5.0357576  -32.1259740 -8.643723e-01   -1.82012987
## X3  -0.781255411 -32.1259740 1696.7272727  1.843290e+01   55.92077922
## X4  -0.067521645  -0.8643723   18.4329004  1.990184e+01    0.46573593
## X5  -0.149080087  -1.8201299   55.9207792  4.657359e-01    9.72348485
## X6 -99.346385281 -76.6160173 4092.5151515 -4.560037e+03 1952.87424242
## X7   0.138809524   7.9676190   79.3095238 -1.229762e+01   -1.00142857
## X8  -0.001372165  -0.4088848    0.1195758  1.204446e+00   -0.01236926
##               X6            X7            X8
## X1 -9.934639e+01  1.388095e-01 -1.372165e-03
## X2 -7.661602e+01  7.967619e+00 -4.088848e-01
## X3  4.092515e+03  7.930952e+01  1.195758e-01
## X4 -4.560037e+03 -1.229762e+01  1.204446e+00
## X5  1.952874e+03 -1.001429e+00 -1.236926e-02
## X6  1.260239e+07 -2.227602e+04 -1.106557e+03
## X7 -2.227602e+04  2.819686e+02 -1.728324e+00
## X8 -1.106557e+03 -1.728324e+00  3.092451e-01
cor(data)
##             X1           X2           X3          X4           X5           X6
## X1  1.00000000  0.642744766 -0.102793192 -0.08203019 -0.259111089 -0.151671159
## X2  0.64274477  1.000000000 -0.347550467 -0.08634194 -0.260111168 -0.009617468
## X3 -0.10279319 -0.347550467  1.000000000  0.10030926  0.435367718  0.027987098
## X4 -0.08203019 -0.086341943  0.100309264  1.00000000  0.033479746 -0.287935594
## X5 -0.25911109 -0.260111168  0.435367718  0.03347975  1.000000000  0.176415568
## X6 -0.15167116 -0.009617468  0.027987098 -0.28793559  0.176415568  1.000000000
## X7  0.04480188  0.211444212  0.114661857 -0.16416254 -0.019125318 -0.373689523
## X8 -0.01337310 -0.327655318  0.005220183  0.48550006 -0.007133152 -0.560526327
##             X7           X8
## X1  0.04480188 -0.013373101
## X2  0.21144421 -0.327655318
## X3  0.11466186  0.005220183
## X4 -0.16416254  0.485500063
## X5 -0.01912532 -0.007133152
## X6 -0.37368952 -0.560526327
## X7  1.00000000 -0.185085916
## X8 -0.18508592  1.000000000
cov2cor(covariance)
##             X1           X2           X3          X4           X5           X6
## X1  1.00000000  0.642744766 -0.102793192 -0.08203019 -0.259111089 -0.151671159
## X2  0.64274477  1.000000000 -0.347550467 -0.08634194 -0.260111168 -0.009617468
## X3 -0.10279319 -0.347550467  1.000000000  0.10030926  0.435367718  0.027987098
## X4 -0.08203019 -0.086341943  0.100309264  1.00000000  0.033479746 -0.287935594
## X5 -0.25911109 -0.260111168  0.435367718  0.03347975  1.000000000  0.176415568
## X6 -0.15167116 -0.009617468  0.027987098 -0.28793559  0.176415568  1.000000000
## X7  0.04480188  0.211444212  0.114661857 -0.16416254 -0.019125318 -0.373689523
## X8 -0.01337310 -0.327655318  0.005220183  0.48550006 -0.007133152 -0.560526327
##             X7           X8
## X1  0.04480188 -0.013373101
## X2  0.21144421 -0.327655318
## X3  0.11466186  0.005220183
## X4 -0.16416254  0.485500063
## X5 -0.01912532 -0.007133152
## X6 -0.37368952 -0.560526327
## X7  1.00000000 -0.185085916
## X8 -0.18508592  1.000000000
1 %in% c(1:10)
## [1] TRUE
1 %in% c(5:10)
## [1] FALSE

1(b)

Considering the variables X1 and X2, compute the statistical distance between Boston Edison Co. and Texas Utilities Co. (Assume the coordinates X1 and X2 are independent)

##b
#statistical distance
my_data[my_data$X9 %in% c("Boston","Texas"),c("X1","X2")]
##      X1   X2
## 2  0.89 10.3
## 19 1.16 11.7
var(my_data$X1)
## [1] 0.03404437
var(my_data$X2)
## [1] 5.035758

1(c)

Plot the scatter diagram and marginal dot diagram for each variables. Comment on the appearance of the diagram.

plot(my_data$X1, my_data$X2)

#install.packages("unrepx")
unrepx::dot.plot(data$X1,
                 spacing = 1,
                 pch = 19,
                 col = 2)

1(d)

Plot the 3D scatter diagram for first three variables.

scatterplot3d::scatterplot3d(data$X1,data$X2,data$X3)

scatterplot3d::scatterplot3d(data[c("X1","X2","X3")],type= "h",angle = 60,main = "3D plot",xlab = "value of x1",ylab = "value of x2",zlab = "value of x3")

1(e)

Represent the public utility companies as stars. Visually group the companies into four or five clusters.

1(f)

Represent the public utility companies as Chernoff faces with assignments of variablesto facial characteristics

## (e) and (f)

label <- my_data$X9

stars(my_data[,-9], labels = label, col.stars = 1:22)

#install.packages("aplpack")
library(aplpack)
aplpack::faces(my_data[1:22,-9], labels = label)

## effect of variables:
##  modified item       Var 
##  "height of face   " "X1"
##  "width of face    " "X2"
##  "structure of face" "X3"
##  "height of mouth  " "X4"
##  "width of mouth   " "X5"
##  "smiling          " "X6"
##  "height of eyes   " "X7"
##  "width of eyes    " "X8"
##  "height of hair   " "X1"
##  "width of hair   "  "X2"
##  "style of hair   "  "X3"
##  "height of nose  "  "X4"
##  "width of nose   "  "X5"
##  "width of ear    "  "X6"
##  "height of ear   "  "X7"
label <- as.character(my_data[,9])

stars(my_data[,-9], labels = label, col.stars = 1:22)

aplpack::faces(my_data[1:22,-9], labels = label)

## effect of variables:
##  modified item       Var 
##  "height of face   " "X1"
##  "width of face    " "X2"
##  "structure of face" "X3"
##  "height of mouth  " "X4"
##  "width of mouth   " "X5"
##  "smiling          " "X6"
##  "height of eyes   " "X7"
##  "width of eyes    " "X8"
##  "height of hair   " "X1"
##  "width of hair   "  "X2"
##  "style of hair   "  "X3"
##  "height of nose  "  "X4"
##  "width of nose   "  "X5"
##  "width of ear    "  "X6"
##  "height of ear   "  "X7"

1st incource

2022

1.a

let,Vitamin-A= x vitamin-c= y objective,z=8x+10y

subject to, 2x+y >= 50 x+2y >= 70 ### 1.b

#install.packages("lpSolve")
library(lpSolve)
z<-c(8,10)
sub<-matrix(c(2,1,1,2),nrow = 2,byrow = T)
sign<-c(">=",">=")
cost<-c(50,70)
lp<-lp("min",z,sub,sign,cost)
lp$objval
## [1] 380
lp$solution
## [1] 10 30

3

3a

a<-matrix(c(1,5,-2,5,4,5,-2,5,1),nrow = 3,byrow = T)
#a
eigen_value<-eigen(a)$value
eigen_value
## [1]  9  3 -6
eigen_vector<-eigen(a)$vector
eigen_vector
##           [,1]          [,2]       [,3]
## [1,] 0.4082483  7.071068e-01  0.5773503
## [2,] 0.8164966  1.110223e-16 -0.5773503
## [3,] 0.4082483 -7.071068e-01  0.5773503

3b

### a interse
v<-diag(eigen_value^(-1))
a_inverse<-eigen_vector %*% v %*% t(eigen_vector)
a_inverse
##             [,1]       [,2]        [,3]
## [1,]  0.12962963 0.09259259 -0.20370370
## [2,]  0.09259259 0.01851852  0.09259259
## [3,] -0.20370370 0.09259259  0.12962963
## a^10
v10<-diag(eigen_value^10)
a_10<-eigen_vector %*% v10 %*% t(eigen_vector)
a_10
##            [,1]       [,2]       [,3]
## [1,]  601315650 1142106075  601256601
## [2,] 1142106075 2344678326 1142106075
## [3,]  601256601 1142106075  601315650

2nd incource

2022

1

x_bar<-matrix(c(6.84,5.06,4.18,3.41),ncol=1)
sigma<-matrix(c(10.1,2.0,1.1,0.2,
                2.0,6.5,2.1,0.5,
                1.1,2.1,4.9,1.1,
                0.2,0.5,1.1,1.4),nrow = 4,byrow = T)
n<-7
p<-4
null<-matrix(c(0,0,0,0))

t2=n*t(x_bar-null) %*% solve(sigma) %*% (x_bar-null)
t2
##          [,1]
## [1,] 94.40542
f<-(n-p)/(p*(n-1))*t2
p_val<-pf(f,p,n-p,lower.tail = F)
f
##          [,1]
## [1,] 11.80068
p_val
##            [,1]
## [1,] 0.03521062
x<-rmvnorm(n = n,mean = x_bar,sigma = sigma)
rrcov::T2.test(x)
## 
##  One-sample Hotelling test
## 
## data:  x
## T2 = 514.707, F = 64.338, df1 = 4, df2 = 3, p-value = 0.003071
## alternative hypothesis: true mean vector is not equal to (0, 0, 0, 0)' 
## 
## sample estimates:
##                   [,1]     [,2]     [,3]    [,4]
## mean x-vector 8.114213 3.735761 4.001301 3.22251

4

pow<-function(alpha,sigma,n,null,p_mean){
  a=qnorm(alpha/2)*sigma/sqrt(n)+null
b=-qnorm(alpha/2)*sigma/sqrt(n)+null
l=(a-p_mean)/(sigma/sqrt(n))
u=(b-p_mean)/(sigma/sqrt(n))

beta<-pnorm(u)-pnorm(l)
p<-1-beta
return(power =p)
}

pow(alpha = .05,sigma = 20,n=50,null = 300,p_mean = 290)
## [1] 0.9424375

Final