s=c(10,100,1000)
n=c(10,100,1000)
par(mfrow=c(3,3))
for (i in s){
for (k in n){
data <- replicate(i, rnorm(k, 10, 1))
#popsample<-sample(data,size=k,replace=TRUE)
title<-paste("Histogram for N=",k,"S=",i)
plt<-hist(data,breaks=100,yaxt="n",lab=NULL,main=title)
abline(v=10,col='black',lwd=3)
plt
}
}Pb1
- Univariate displays & sampling distributions
a.
References:
[1] https://www.r-bloggers.com/2023/07/the-replicate-function-in-r/
[2] https://www.datamentor.io/r-programming/histogram
[3] https://www.tenderisthebyte.com/blog/2019/04/25/rotating-axis-labels-in-r/
[4] https://www.datacamp.com/tutorial/make-histogram-basic-r
[5] https://www.geeksforgeeks.org/how-to-use-par-function-in-r/
- Monte Carlo Integration
f<-function(x){
exp(-x)*sin(x)
}
values<-runif(1000000, min = 2, max = 5)
outcomes<-f(values)
meanout<-mean(outcomes)
integral<-3*meanout
print(integral)[1] 0.03555804
#check
integrate(f,2,5)0.03564528 with absolute error < 8.3e-16
References:
[1]https://www.dataquest.io/blog/write-functions-in-r/
[3]https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/Uniform
[4]https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/integrate
- Systematic and stochastic components
a.
The systematic component is the linear combination in the model: \(1+0.5x_{i1}-2.2x_{i2}+x_{i3}\)
The stochastic component is \(Y_{i}\sim {\sf N} (u=1+0.5xi1-2.2xi2+xi3, σ^2=1.5)\)
References:
[1] https://www.deepchecks.com/question/what-are-the-three-components-of-a-generalized-linear-model/
[2] Ward,et.al Maximum Likelihood for social science P17-18
b.
i.
library(readr)
data1=read_csv('/Users/sheepqueen422/Desktop/winter quarter/MLE/pb1/xmat.csv',show_col_types = FALSE)
n <- dim(data1)[1]
n[1] 1000
References:
[1] https://www.r-bloggers.com/2020/06/determine-the-size-of-a-data-frame/
ii.
set.seed(10825)
se=rnorm(1000,mean=0,sd=1.5)
f2<-function(x1,x2,x3){
1+0.5*x1-2.2*x2+x3+se
}
data1$y<-f2(data1$X1,data1$X2,data1$X3)
lm1<-lm(y~X1+X2+X3,data=data1)
summary(lm1)
Call:
lm(formula = y ~ X1 + X2 + X3, data = data1)
Residuals:
Min 1Q Median 3Q Max
-4.3270 -1.0038 0.0152 1.0004 5.5911
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.08145 0.06227 17.37 <2e-16 ***
X1 0.47580 0.02358 20.18 <2e-16 ***
X2 -2.27901 0.09617 -23.70 <2e-16 ***
X3 0.93925 0.04681 20.06 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.5 on 996 degrees of freedom
Multiple R-squared: 0.5835, Adjusted R-squared: 0.5823
F-statistic: 465.2 on 3 and 996 DF, p-value: < 2.2e-16
References:
[1]https://www.rdocumentation.org/packages/compositions/versions/2.0-8/topics/rnorm
[2] https://www.geeksforgeeks.org/how-to-add-column-to-dataframe-in-r/
[3] https://www.datacamp.com/tutorial/linear-regression-R
OLS in matrix form
a.
library(foreign) data2<-read.dta("/Users/sheepqueen422/Desktop/winter quarter/MLE/pb1/coxappend.dta") data2$log_ml<-log(data2$ml) data2$interaction<-data2$eneth*data2$log_ml y<-data2[['enps']] X<-cbind(1,data.matrix(data2[,c("eneth","log_ml","interaction")])) f3<-function(X,y){ solve(t(X) %*% X) %*% t(X) %*% y } coefficients<-f3(X,y) select<-rbind(c(1,1),c(2,2),c(3,3),c(4,4)) se<-cov(X)[select] estimated_coeffienct<-coefficients[,1] df<-data.frame(estimated_coeffienct,se) dfestimated_coeffienct se 2.6713673 0.0000000 eneth -0.3619712 0.4652314 log_ml -0.1911175 1.8802449 interaction 0.4833255 5.6432871References:
[1]https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/attach
[2] https://libraryguides.mcgill.ca/c.php?g=699776&p=4968543
[3] Ward,et.al Maximum Likelihood for social science P14-16
[4] https://stackoverflow.com/questions/22865094/what-does-mean-in-r
[8] https://campus.datacamp.com/courses/free-introduction-to-r/chapter-3-matrices-3?ex=8
[9] https://stackoverflow.com/questions/11993810/selecting-specific-elements-from-a-matrix-all-at-once
b.
The standard errors are small before 1, showing a good model explanatory power. However, the model shows a higher and higher standard error after 1.
ols<-lm(enps~eneth+log(ml)+eneth*log(ml),data=data2) res<-rstandard(ols) qqnorm(res) qqline(res)References:
[2]https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/qqnorm
c.
linear_model<-lm(enps~eneth+log_ml+interaction,data=data2) linear_modelCall: lm(formula = enps ~ eneth + log_ml + interaction, data = data2) Coefficients: (Intercept) eneth log_ml interaction 2.6714 -0.3620 -0.1911 0.4833d.
par(mfrow=c(2,2)) plot(ols)- First, from the Residuals vs Fitted plot, there’s no distinctive pattern, except for 20,6,10, which means that the data meets the regression assumptions well. Second, from the Q-Q Residuals plot, most of the residuals are distributed normally, although a few observations of 20,6,10 look a little off, which may pose a potential problem. From the Scale-Location plot, most of the residuals are spread equally along the ranges of predictors except for the observations of 20,6,10 again. From Residuals vs Leverage plot, two cases 8 and 10 are on the dashed lines, meaning high ‘Cook’s distance’ scores, showing potential problematic cases with the row numbers of the cases in the dataset. The case of 10 is identified across four plots, thus, we need to take a close look at it in further research.
References:
[1] https://library.virginia.edu/data/articles/diagnostic-plots
Appendix:
I only used Chatgpt-4o in this assignment in 4a question because I got a different answer from using lm() functions. And I stuck for a long time couldn’t find the reason. Finally, I turned to Chatgpt, and found that i missed the constant intercept. It’s very helpful and saved me a lot of time.
Exchange page: https://poe.com/s/Os1IgOvkoqjSII5GwKZj