x <-5:15n <-c(10, 100, 1000)s <-c(10, 100, 1000)par(mfrow =c(3, 3))for (i in1:length(n)) {for (j in1:length(s)) {hist((replicate(n[i], mean(sample(x, size = n[j], replace =TRUE)))), main =paste(n[i], "samples of", s[j], "numbers"), xlim =c(9, 11), xlab ="Mean of Samples") }}
Part (b):
This matrix of histograms highlights the law of large numbers which states that as the number of random samples increases, the average of the results gets closer to the true mean. The key attribute of this matrix which underpins this lesson is that the graphs in the bottom row are all representing normal distributions, with means close to 10 for numbers 5 to 15. It is also helpful to note that the top row representing 10 samples of size n are quite far from a normal distribution, while the middle row becomes slightly more normal, especially for 100 samples of 100 and 100 samples of 1000 numbers. Ultimately, the results of this matrix highlight that the larger number of random samples you use, the closer you will be to a true mean.
Question 2: Monte Carlo integration
# watched the following for a tutorial on Monte Carlo integration: https://youtu.be/8xo4Lx9fiRc?si=URPHpbm3Y7KLSMTgf <-function(x)exp(-x) *sin(x)# simulationarea_estimates <-vector(length =10000) #the vector of area estimates for various number of points used 1-10000for (i in1:10000) { points <-runif(n = i, min =2, max =5) #the random points within the bounds area_estimates[i] <-3*mean(f(points)) #the area estimate using the random points}plot(area_estimates) # plot converges towards 0.035
mean(area_estimates) # mean represents this
[1] 0.03563196
integrate(f, 2, 5) # exact value = .03564528 with absolute error < 8.3e-16
This component is essentially the error term which represents sampling or measuring errors which may cause a deviance. It states that the mean is 0 and the standard deviation is 1.5.
This component describes the model’s parameters (betas).
Part (b):
X <-read_csv('/Users/sarahamidi/Downloads/xmat.csv') # download the csv
Rows: 1000 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (3): X1, X2, X3
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
dim(X) # dimensions of the X matrix are 1000 rows x 3 cols
[1] 1000 3
n =nrow(X) # 1000 rows
Part (c):
set.seed(10825)# Systematic and stochastic components implementing the columns from matrix X and the linear model givensystematic <-1+0.5* X$X1 -2.2* X$X2 + X$X3stochastic <-rnorm(n, mean =0, sd =sqrt(1.5)) # we want to generate n (1000) y valuesy <- systematic + stochastic # combining X with the model to generate y valuesX_y <-data.frame(y = y) X$y <- y # add the simulated y values into the data frame for the regressionmod <-lm(y ~ X1 + X2 + X3 +1, data = X) # run the regressionsummary(mod) # print the regression table
Call:
lm(formula = y ~ X1 + X2 + X3 + 1, data = X)
Residuals:
Min 1Q Median 3Q Max
-3.5330 -0.8196 0.0124 0.8168 4.5651
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.06651 0.05084 20.98 <2e-16 ***
X1 0.48024 0.01925 24.95 <2e-16 ***
X2 -2.26451 0.07852 -28.84 <2e-16 ***
X3 0.95040 0.03822 24.86 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.225 on 996 degrees of freedom
Multiple R-squared: 0.6792, Adjusted R-squared: 0.6782
F-statistic: 702.9 on 3 and 996 DF, p-value: < 2.2e-16
Question 4: OLS in matrix form
Part (a):
# watched the following for a tutorial on using matrices to do linear regression: https://youtu.be/FFX9WyxEa7o?si=cWjNr80NIYYPS6xRdat2 <-read_dta('/Users/sarahamidi/Downloads/coxappend.dta')ols_reg <-function(dep, ind){ y <-as.matrix(dep) # make sure that it is a matrix (should be 1 column (vector)) x <-as.matrix(ind) # make sure it is a matrixcbind(y, x) # combining dependent variable (outcome) with the design matrix (matrix of explanatory variables) xtx <-t(x) %*% x xtx_inv <-solve(xtx) beta <-solve(t(x) %*% x) %*%t(x) %*% y # formula for coefficient vector -- beta values residuals =sum((y - x%*%beta)^2)/(nrow(x) -ncol(x)) # mean squared error vcv_mat <- residuals *solve(xtx) # variance - covariance matrix std_err <-sqrt(diag(vcv_mat)) # sq root the diagonal to pull out standard errors result <-as.data.frame(cbind(beta, std_err)) # table of resultsnames(result)[1:2] <-c("Parameter estimates", "Std. Error")print(result)return(beta) # to get the y_hat values later}outcome <- dat2$enps # vector of dependent variable -- effective number of legislative parties (enps) explanatory <-cbind(constant =1, dat2$eneth, log(dat2$ml), dat2$eneth*log(dat2$ml)) # matrix of explanatory variables -- effective ethnic groups (eneth), log of median district magnitude electoral district magnitude(ml), and eneth * log (ml)beta_hat <-ols_reg(outcome, explanatory) # call the function using the data from coxappend.dta
y_hat <- explanatory %*% beta_hat # multiply the parameter estimates by the x matrix to get yhatresiduals <- outcome - y_hat # get the residuals from the difference between y and yhatqqnorm(residuals) # normal quantile comparison plot
In the upper part of the plot, it is evident that there is a slight deviance from the linear trend which would represent a Normal distribution. Because of this, we can infer that there is a slight right skew, due to the variance and larger values which would not appear if the residuals were Normal.
Part (c):
# using the lm() function to run the same OLS regressionmod1 <-lm(enps ~ eneth +log(dat2$ml) + eneth:log(dat2$ml), data = dat2)summary(mod1) # the results match
Call:
lm(formula = enps ~ eneth + log(dat2$ml) + eneth:log(dat2$ml),
data = dat2)
Residuals:
Min 1Q Median 3Q Max
-1.8627 -0.6818 -0.2346 0.2605 3.8235
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.6714 0.6072 4.399 5.69e-05 ***
eneth -0.3620 0.3486 -1.038 0.304
log(dat2$ml) -0.1911 0.2967 -0.644 0.522
eneth:log(dat2$ml) 0.4833 0.1805 2.678 0.010 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.181 on 50 degrees of freedom
Multiple R-squared: 0.3629, Adjusted R-squared: 0.3247
F-statistic: 9.493 on 3 and 50 DF, p-value: 4.541e-05
Part (d):
plot(mod1, which =1) # residual v fitted
plot(mod1, which =3) # scale-location
plot(mod1, which =5) # residual v leverage
Part (e):
Based on these results, it is evident that the OLS regression may not be the most appropriate model for these data. The Residual-v.-fitted plot reveals a distinctive downward trend in the residuals, indicating that there is a non-linear relationship present. You can also see a few residuals that stand out way above the 0 line, which suggests the existence of some outliers. Next, the Scale-location plot reveals that the assumption of homoscedasticity is not satisfied as there is a clear upward trend in the spread of residuals. Finally, looking at the Residual-v.-leverage plot, we can see that there is an outlying value at the lower right corner of the plot and another one bordering the “Cook’s distance” dashed line on the upper middle area of the plot. Because these two cases are on or outside of the dashed line, they may be influential to the regression results. Ultimately, based on these patterns and insights from the diagnostic plots, it may be better to try different models for this data, to improve results and address non-linearity, the influence of outliers, and homoscedasticity.
Appendix
I certify that I did not use any LLM or generative AI tool in this assignment. Other sources that were used to aid in this assignment have been attributed as comments for the respective questions.