Univariate displays & sampling distributions Let\(n\)denote the size of a simple random sample taken from a large population and let\(s\)be the number of times samples of size\(n\)are drawn.
a. Write an R program that will replicate Figure 1.2 from the Ward & Ahlquist text. Create a matrix of nine such histograms which illustrate the effect of changing\(s \in \{10, 100, 1000\}\)and\(n \in \{10, 100, 1000\}\). Put the horizontal axis on the same scale for all plots. HINT: try constructing loops using for, replicate, and sample.
# Loading Librarieslibrary(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# load your datasetdata <-read.csv("~/Desktop/Winter Q Classes/CSS 205/problem sets/Problem Set 1 Data/xmat.csv", header =TRUE, sep =",")# use the X1 column as the source for samplingvalues <- data$X1# parameters for samplingn_values <-c(10, 100, 1000) s_values <-c(10, 100, 1000) # making a data frameresults <-expand.grid(n = n_values, s = s_values) %>%mutate(sample_means =map2(n, s, ~replicate(.y, mean(sample(values, .x, replace =TRUE))))) %>%unnest(cols =c(sample_means))# ggplotpl <- results %>%ggplot(aes(x = sample_means, fill =factor(n))) +geom_histogram(alpha =0.6, binwidth =0.2, color ="black") +scale_fill_viridis(discrete =TRUE, name ="Sample Size (n)") +theme_ipsum() +theme(legend.position ="right",panel.spacing =unit(0.1, "lines"),strip.text.x =element_text(size =10) ) +xlab("Sample Mean") +ylab("Frequency") +facet_wrap(~paste("n =", n, "s =", s), scales ="free_y")# adding colorpl <- results %>%ggplot(aes(x = sample_means, fill =factor(n))) +geom_histogram(alpha =0.6, binwidth =0.2, color ="black") +scale_fill_manual(values =c("yellow", "pink", "orange"), name ="Sample Size (n)" ) +theme_ipsum() +theme(legend.position ="right",panel.spacing =unit(0.1, "lines"),strip.text.x =element_text(size =10) ) +xlab("Sample Mean") +ylab("Frequency") +facet_wrap(~paste("n =", n, "s =", s), scales ="free_y")# printing the plotprint(pl)
b. What does this matrix of histograms say about the efficiency of sampling when we consider the number of groups versus the number of units within a group? What is the key assumption or attribute of the situation here that underpins this lesson?
With a growing sample size, the means of samples will concentrate around the true mean of the population more so that the larger the sample size, the less variable and more dependable the estimate will be. An increase in the number of groups does not decrease variability of the combined sample means unless the size of the sample in each group is large. More groups tend to smooth most of the random variations off, but the real improvement is increasing the size of each group.
This process is performed using sampling with replacement, meaning each sample is independent. It is very important since it is tied into the Law of Large Numbers, which demands that as it grow in size that the sample, the estimate it gives of the population mean becomes more accurate. We can also observe what is probably going to be referred to as the Central Limit Theorem at work. For n increasing sample size, the distribution of means will start to appear more normal, irrespective of the form of the original data.
Indeed, adding more samples can also help in smoothing out some of the randomness but does not necessarily increase efficiency if the sample is still small. The precise part comes with more significant samples, not far more groups.
Monte Carlo integration The expected value of a function,\(f(x)\), over the interval\([a,b]\)is defined as\(\mathbb{E}[f(x)] = \frac{1}{b-a}\int_a^b f(x)dx\). Use this fact to calculate the approximate value of\(\int_2^5e^{−x}\sin(x)\)using simulation methods. Check your answer using the integrate function in R.
#defining the functionfx =function(x) exp(-x)*sin(x)
#Monte Carlo integration of fx betweem 2 and 5n =50000 ; a =2 ; b =5x =runif(n,a,b)mcfx =mean(fx(x)) * (b-a)#integrate(f, lower,upper)intfx =integrate(fx,a,b,subdivisions =1000)cat("Monte Carlo Approximation:", mcfx, "\n")
'data.frame': 1000 obs. of 3 variables:
$ X1: num -4.82 2.576 0.333 -1.153 2.056 ...
$ X2: int 1 0 1 1 0 0 1 1 1 1 ...
$ X3: num 1.5414 1.2589 -0.0693 0.0776 -1.1992 ...
The dimensions of this matrix are 1000 x 3
c. Set seed(10825). Then combine with matrix you just downloaded with the linear model you described in a…. Then display results in a regression table.
# set seedset.seed(10825)# look at your datahead(xmat)
'data.frame': 1000 obs. of 3 variables:
$ X1: num -4.82 2.576 0.333 -1.153 2.056 ...
$ X2: int 1 0 1 1 0 0 1 1 1 1 ...
$ X3: num 1.5414 1.2589 -0.0693 0.0776 -1.1992 ...
# defining each of the variables from the data set x_1 <- xmat$X1x_2 <- xmat$X2x_3 <- xmat$X3coefficient_0 <-1coefficient_1 <-0.5coefficient_2 <--2.2coefficient_3 <-1# define error term, IMPORTANTn <-nrow(xmat) e <-rnorm(n, mean =0, sd =1) y <- coefficient_0 + coefficient_1*x_1 + coefficient_2*x_2 + coefficient_3*x_3 + elinear <-lm(y ~ x_1 + x_2 + x_3)# display resultssummary(linear)
Call:
lm(formula = y ~ x_1 + x_2 + x_3)
Residuals:
Min 1Q Median 3Q Max
-2.8847 -0.6692 0.0101 0.6669 3.7274
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.05430 0.04151 25.40 <2e-16 ***
x_1 0.48387 0.01572 30.79 <2e-16 ***
x_2 -2.25267 0.06411 -35.13 <2e-16 ***
x_3 0.95950 0.03121 30.75 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1 on 996 degrees of freedom
Multiple R-squared: 0.7616, Adjusted R-squared: 0.7609
F-statistic: 1061 on 3 and 996 DF, p-value: < 2.2e-16
OLS in Matrix form
a. Write an R function to perform an OLS regression using matrices and vectors. Your function should take take two inputs: a vector of the dependent variable and a matrix of explanatory variables. Do not use any pre-programmed functions. You will regress the effective number of legislative parties (enps) on the number of effective ethnic groups (eneth), the log of median district magnitude electoral district magnitude (ml), and the multiplicative interaction between the two. Report a table of the regression parameter estimates and their standard errors. Don’t forget the constant! b.Make a Normal quantile comparison plot of the residuals from your regression and describe what it says in fewer than three sentences.
library(haven)library(ggplot2)# Load the datadata =read_dta("coxappend.dta")# OLS regression function with residuals includedols_regression <-function(y, X) { X <-cbind(1, X) # Calculate coefficients beta_hat <-solve(t(X) %*% X) %*%t(X) %*% y# Calculate residuals residuals <- y - X %*% beta_hat# Estimate the variance sigma_hat_squared <-sum(residuals^2) / (length(y) -ncol(X))# Calculate standard errors se_beta_hat <-sqrt(diag(sigma_hat_squared *solve(t(X) %*% X)))# Returnreturn(list(coefficients = beta_hat, standard_errors = se_beta_hat, residuals = residuals))}# Define y and Xy <- data$enpsX <-cbind(data$eneth, log(data$ml), data$eneth *log(data$ml))# OLSresults <-ols_regression(y, X)# Extract coefficients, standard errors, and residualscoefficients <- results$coefficientsstandard_errors <- results$standard_errorsresiduals <- results$residuals # Create regression tableregression_table <-data.frame(Coefficient = coefficients,Std_Error = standard_errors)# Display regression tableprint(regression_table)
The majority of the points that lie on the red line indicate that the data there is normally distributed.
On both the tails of the plot, the points deviate from the line, which suggests that the data has potential outliers in the extremes.
With these observations, we can estimate that the graph indicates an approximate normal distribution, but it is not completely perfect.
c. Perform regression using lm()
model1 <-lm(enps ~1+ eneth +log(ml) + eneth*log(ml), data = data)summary(model1)
Call:
lm(formula = enps ~ 1 + eneth + log(ml) + eneth * log(ml), data = data)
Residuals:
Min 1Q Median 3Q Max
-1.8627 -0.6818 -0.2346 0.2605 3.8235
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.6714 0.6072 4.399 5.69e-05 ***
eneth -0.3620 0.3486 -1.038 0.304
log(ml) -0.1911 0.2967 -0.644 0.522
eneth:log(ml) 0.4833 0.1805 2.678 0.010 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.181 on 50 degrees of freedom
Multiple R-squared: 0.3629, Adjusted R-squared: 0.3247
F-statistic: 9.493 on 3 and 50 DF, p-value: 4.541e-05
d. Report the Residual-v.-fitted, Scale-location, and Residual-v.-leverage diagnostic plots for this regression.
plot(model1)
e. Based on these results, is this OLS regression an appropriate model for these data? Provide a paragraph stating why or why not.
Looking at my model summary from the regression table I produced using the lm() function, here are some interpretations:
Looking at the F-Statistic at 9.49, the model is overall statistically significant
However, only the eneth term is statistically significant among the coefficients as it is p=0.010
For the Normal Q-Q plot
The majority of the middle residuals lie on the indicated red line
However there is still a good amount of residuals on the high end of the upper tail of the plot, as well as a few outliers
For the Residuals v. Fitted plot:
There are no clear curves in the graph as it mostly fits the red line.
Based on these results, is the OLS regression appropriate model for these data?
From my interpretation, I do not think the OLS regression I ran would be a totally appropriate model for the coxappend data. From the Normal Q-Q plot, there is a heavy emphasis on the top right of the plot showing a skew from the red line. Although the results from the OLS regression show that the p-value and F statistic were overall statisticallyl significant, I think that the graphs do display that we should beb cautious as to if this is a good fit or not.
Use of LLM’s:
I used ChatGPT 4.0 mini LLM/AI tool in this assignment. Below I will explain where I used the tool in this assignment and whether or not I found it helpful.
#1: When loading the necessary libraries for #1, I was continuously getting errors in my code for the library(hrbthemes). I continued to reinstall the package and my computer was taking a few hours to do so. I had to work with ChatGPT to help figure out the root of this issue. After a few trial and errors, the tool finally helped me debug. Here is the full exchange I had with the chat-bot:
#4 for part A and part B: I had the code for both the OLS regression table and the normal Q-Q plot, however when I went to render my code, it said it was quitting from the lines for the Q-Q plot code. It was previously working, so I was confused on why it stopped. I pasted my code into ChatGPT 4.0 mini to help me debug this issue. The tool explained that I did not define my residuals correctly, and gave me the code I had previously with a small iteration for this. I found it helpful as it clearly defined my issue and helped me debug immediately. Here is the full exchange I had with the chat-bot: