Pb1

Univariate displays & sampling distributions

s=c(10,100,1000)
n=c(10,100,1000)
par(mfrow=c(3,3))
for (i in s){
  for (k in n){
    data <- replicate(i, rnorm(k, 10, 1))
    #popsample<-sample(data,size=k,replace=TRUE)
    title<-paste("Histogram for N=",k,"S=",i)
    plt<-hist(data,breaks=100,yaxt="n",lab=NULL,main=title)
    abline(v=10,col='black',lwd=3)
    plt
  }
}

References:

[1] https://www.r-bloggers.com/2023/07/the-replicate-function-in-r/

[2] https://www.datamentor.io/r-programming/histogram

[3] https://www.tenderisthebyte.com/blog/2019/04/25/rotating-axis-labels-in-r/

[4] https://www.datacamp.com/tutorial/make-histogram-basic-r

[5] https://www.geeksforgeeks.org/how-to-use-par-function-in-r/

Monte Carlo Integration

f<-function(x){
  exp(-x)*sin(x)
}
values<-runif(1000000, min = 2, max = 5)
outcomes<-f(values)
meanout<-mean(outcomes)
integral<-3*meanout
print(integral)

[1] 0.03555804

#check
integrate(f,2,5)

0.03564528 with absolute error < 8.3e-16

References:

[1]https://www.dataquest.io/blog/write-functions-in-r/

[2]https://stackoverflow.com/questions/17773080/generate-a-set-of-random-unique-integers-from-an-interval

[3]https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/Uniform

[4]https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/integrate

Systematic and stochastic components

The systematic component is the linear combination in the model: $1+0.5x_{i1}-2.2x_{i2}+x_{i3}$

The stochastic component is $Y_{i}\sim {\sf N} (u=1+0.5xi1-2.2xi2+xi3, σ^2=1.5)$

References:

[1] https://www.deepchecks.com/question/what-are-the-three-components-of-a-generalized-linear-model/

[2] Ward,et.al Maximum Likelihood for social science P17-18

library(readr)
data1=read_csv('/Users/sheepqueen422/Desktop/winter quarter/MLE/pb1/xmat.csv',show_col_types = FALSE)
n <- dim(data1)[1]
n

[1] 1000

References:

[1] https://www.r-bloggers.com/2020/06/determine-the-size-of-a-data-frame/

ii.

set.seed(10825)
se=rnorm(1000,mean=0,sd=1.5)   
f2<-function(x1,x2,x3){
  1+0.5*x1-2.2*x2+x3+se
}
data1$y<-f2(data1$X1,data1$X2,data1$X3)

lm1<-lm(y~X1+X2+X3,data=data1)
summary(lm1)


Call:
lm(formula = y ~ X1 + X2 + X3, data = data1)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.3270 -1.0038  0.0152  1.0004  5.5911 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.08145    0.06227   17.37   <2e-16 ***
X1           0.47580    0.02358   20.18   <2e-16 ***
X2          -2.27901    0.09617  -23.70   <2e-16 ***
X3           0.93925    0.04681   20.06   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.5 on 996 degrees of freedom
Multiple R-squared:  0.5835,    Adjusted R-squared:  0.5823 
F-statistic: 465.2 on 3 and 996 DF,  p-value: < 2.2e-16

References:

[1]https://www.rdocumentation.org/packages/compositions/versions/2.0-8/topics/rnorm

[2] https://www.geeksforgeeks.org/how-to-add-column-to-dataframe-in-r/

[3] https://www.datacamp.com/tutorial/linear-regression-R

OLS in matrix form

a.
```
library(foreign)
data2<-read.dta("/Users/sheepqueen422/Desktop/winter quarter/MLE/pb1/coxappend.dta")
data2$log_ml<-log(data2$ml)
data2$interaction<-data2$eneth*data2$log_ml
y<-data2[['enps']]
X<-cbind(1,data.matrix(data2[,c("eneth","log_ml","interaction")]))
f3<-function(X,y){
  solve(t(X) %*% X) %*% t(X) %*% y
}
coefficients<-f3(X,y)

select<-rbind(c(1,1),c(2,2),c(3,3),c(4,4))
se<-cov(X)[select]


estimated_coeffienct<-coefficients[,1]
df<-data.frame(estimated_coeffienct,se)
df
```
```
            estimated_coeffienct        se
                       2.6713673 0.0000000
eneth                 -0.3619712 0.4652314
log_ml                -0.1911175 1.8802449
interaction            0.4833255 5.6432871
```
References:

[1]https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/attach

[2] https://libraryguides.mcgill.ca/c.php?g=699776&p=4968543

[3] Ward,et.al Maximum Likelihood for social science P14-16

[4] https://stackoverflow.com/questions/22865094/what-does-mean-in-r

[6] https://sparkbyexamples.com/r-programming/how-to-select-columns-in-r/#:~:text=Use%20df%5B%5D%20notation%20to,col2%E2%80%9D)%5D%20for%20multiple%20columns.

[7] https://zief0002.github.io/matrix-algebra/statistical-appplication-estimating-regression-coefficients.html#estimating-the-regression-coefficients

[8] https://campus.datacamp.com/courses/free-introduction-to-r/chapter-3-matrices-3?ex=8

[9] https://stackoverflow.com/questions/11993810/selecting-specific-elements-from-a-matrix-all-at-once

b.

The standard errors are small before 1, showing a good model explanatory power. However, the model shows a higher and higher standard error after 1.
```
ols<-lm(enps~eneth+log(ml)+eneth*log(ml),data=data2)
res<-rstandard(ols)
qqnorm(res)
qqline(res)
```
References:

[1] https://www.r-tutor.com/elementary-statistics/simple-linear-regression/normal-probability-plot-residuals

[2]https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/qqnorm

c.
```
linear_model<-lm(enps~eneth+log_ml+interaction,data=data2)
linear_model
```
```
Call:
lm(formula = enps ~ eneth + log_ml + interaction, data = data2)

Coefficients:
(Intercept)        eneth       log_ml  interaction  
     2.6714      -0.3620      -0.1911       0.4833  
```
[1] https://www.econometrics-with-r.org/4.2-estimating-the-coefficients-of-the-linear-regression-model.html

d.
```
par(mfrow=c(2,2))
plot(ols)
```
1. First, from the Residuals vs Fitted plot, there’s no distinctive pattern, except for 20,6,10, which means that the data meets the regression assumptions well. Second, from the Q-Q Residuals plot, most of the residuals are distributed normally, although a few observations of 20,6,10 look a little off, which may pose a potential problem. From the Scale-Location plot, most of the residuals are spread equally along the ranges of predictors except for the observations of 20,6,10 again. From Residuals vs Leverage plot, two cases 8 and 10 are on the dashed lines, meaning high ‘Cook’s distance’ scores, showing potential problematic cases with the row numbers of the cases in the dataset. The case of 10 is identified across four plots, thus, we need to take a close look at it in further research.
References:

[1] https://library.virginia.edu/data/articles/diagnostic-plots

Appendix:

I only used Chatgpt-4o in this assignment in 4a question because I got a different answer from using lm() functions. And I stuck for a long time couldn’t find the reason. Finally, I turned to Chatgpt, and found that i missed the constant intercept. It’s very helpful and saved me a lot of time.

Exchange page: https://poe.com/s/Os1IgOvkoqjSII5GwKZj