Problem_set_1

Author

Caroline Porche

Problem Set #1

Univariate displays & sampling distributions Let $n$ denote the size of a simple random sample taken from a large population and let $s$ be the number of times samples of size $n$ are drawn.

a. Write an R program that will replicate Figure 1.2 from the Ward & Ahlquist text. Create a matrix of nine such histograms which illustrate the effect of changing $s \in \{10, 100, 1000\}$ and $n \in \{10, 100, 1000\}$. Put the horizontal axis on the same scale for all plots. HINT: try constructing loops using for, replicate, and sample.

# Loading Libraries
library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(viridis)

Loading required package: viridisLite

library(forcats)
library(dplyr)
library(conflicted)
library(hrbrthemes)

# load your dataset
data <- read.csv("~/Desktop/Winter Q Classes/CSS 205/problem sets/Problem Set 1 Data/xmat.csv", header = TRUE, sep = ",")

# use the X1 column as the source for sampling
values <- data$X1

# parameters for sampling
n_values <- c(10, 100, 1000)  
s_values <- c(10, 100, 1000)  


# making a data frame
results <- expand.grid(n = n_values, s = s_values) %>%
  mutate(sample_means = map2(n, s, ~ replicate(.y, mean(sample(values, .x, replace = TRUE))))) %>%
  unnest(cols = c(sample_means))

# ggplot
pl <- results %>%
  ggplot(aes(x = sample_means, fill = factor(n))) +
  geom_histogram(alpha = 0.6, binwidth = 0.2, color = "black") +
  scale_fill_viridis(discrete = TRUE, name = "Sample Size (n)") +
  theme_ipsum() +
  theme(
    legend.position = "right",
    panel.spacing = unit(0.1, "lines"),
    strip.text.x = element_text(size = 10)
  ) +
  xlab("Sample Mean") +
  ylab("Frequency") +
  facet_wrap(~ paste("n =", n, "s =", s), scales = "free_y")

# adding color
pl <- results %>%
  ggplot(aes(x = sample_means, fill = factor(n))) +
  geom_histogram(alpha = 0.6, binwidth = 0.2, color = "black") +  
  scale_fill_manual(
    values = c("yellow", "pink", "orange"), 
    name = "Sample Size (n)"
  ) +
  theme_ipsum() +
  theme(
    legend.position = "right",
    panel.spacing = unit(0.1, "lines"),
    strip.text.x = element_text(size = 10)
  ) +
  xlab("Sample Mean") +
  ylab("Frequency") +
  facet_wrap(~ paste("n =", n, "s =", s), scales = "free_y")


# printing the plot
print(pl)

b. What does this matrix of histograms say about the efficiency of sampling when we consider the number of groups versus the number of units within a group? What is the key assumption or attribute of the situation here that underpins this lesson?

With a growing sample size, the means of samples will concentrate around the true mean of the population more so that the larger the sample size, the less variable and more dependable the estimate will be. An increase in the number of groups does not decrease variability of the combined sample means unless the size of the sample in each group is large. More groups tend to smooth most of the random variations off, but the real improvement is increasing the size of each group.
This process is performed using sampling with replacement, meaning each sample is independent. It is very important since it is tied into the Law of Large Numbers, which demands that as it grow in size that the sample, the estimate it gives of the population mean becomes more accurate. We can also observe what is probably going to be referred to as the Central Limit Theorem at work. For n increasing sample size, the distribution of means will start to appear more normal, irrespective of the form of the original data.
Indeed, adding more samples can also help in smoothing out some of the randomness but does not necessarily increase efficiency if the sample is still small. The precise part comes with more significant samples, not far more groups.

Monte Carlo integration The expected value of a function, $f(x)$, over the interval $[a,b]$ is defined as $\mathbb{E}[f(x)] = \frac{1}{b-a}\int_a^b f(x)dx$. Use this fact to calculate the approximate value of $\int_2^5e^{−x}\sin(x)$ using simulation methods. Check your answer using the integrate function in R.

#defining the function
fx = function(x) exp(-x)* sin(x)

#Monte Carlo integration of fx betweem 2 and 5
n = 50000 ; a = 2 ; b = 5

x = runif(n,a,b)
mcfx = mean(fx(x)) * (b-a)

#integrate(f, lower,upper)
intfx = integrate(fx,a,b,subdivisions = 1000)

cat("Monte Carlo Approximation:", mcfx, "\n")

Monte Carlo Approximation: 0.03544157

cat("Exact Integration Result:", intfx$value, "\n")

Exact Integration Result: 0.03564528

cat("Monte Carlo Error:", abs(mcfx - intfx$value), "\n")

Monte Carlo Error: 0.0002037107

Systematic and stochastic components. Suppose someone reports to you the following linear model, which we will then use to generate data:

\[\begin{eqnarray*} y_i &=& 1 + 0.5x_{i1} - 2.2x_{i2} + x_{i3} + \epsilon_i\\ \epsilon_i &\sim& N(\mu=0, \sigma^2 = 1.5) \end{eqnarray*}\]

a. Rephrase this model in terms of its systematic and stochastic components.

\[\begin{eqnarray*} systematic:\theta &=& 1 + 0.5x_{i1} - 2.2x_{i2} + x_{i3} \\stochastic: Y_i &\sim& f(\theta_i) +N(\mu=0, \sigma^2 = 1.5) \end{eqnarray*}\]

b. Download xmat.csv and load into R. What are the dimensions of the matrix you just downloaded? Call the number of rows n.
```
#loading the data and checking if it's valid
xmat <- read.csv("~/Desktop/Winter Q Classes/CSS 205/problem sets/Problem Set 1 Data/xmat.csv")
head(xmat)
```
```
          X1 X2          X3
1 -4.8200977  1  1.54137265
2  2.5755430  0  1.25892647
3  0.3326820  1 -0.06933333
4 -1.1534374  1  0.07761559
5  2.0563184  0 -1.19921600
6  0.1335086  0  0.25054654
```
```
# Checking the structure of the data
str(xmat)
```
```
'data.frame':   1000 obs. of  3 variables:
 $ X1: num  -4.82 2.576 0.333 -1.153 2.056 ...
 $ X2: int  1 0 1 1 0 0 1 1 1 1 ...
 $ X3: num  1.5414 1.2589 -0.0693 0.0776 -1.1992 ...
```

The dimensions of this matrix are 1000 x 3

c. Set seed(10825). Then combine with matrix you just downloaded with the linear model you described in a…. Then display results in a regression table.

# set seed
set.seed(10825)

# look at your data
head(xmat)

          X1 X2          X3
1 -4.8200977  1  1.54137265
2  2.5755430  0  1.25892647
3  0.3326820  1 -0.06933333
4 -1.1534374  1  0.07761559
5  2.0563184  0 -1.19921600
6  0.1335086  0  0.25054654

str(xmat)

'data.frame':   1000 obs. of  3 variables:
 $ X1: num  -4.82 2.576 0.333 -1.153 2.056 ...
 $ X2: int  1 0 1 1 0 0 1 1 1 1 ...
 $ X3: num  1.5414 1.2589 -0.0693 0.0776 -1.1992 ...

# defining each of the variables from the data set 
x_1 <- xmat$X1
x_2 <- xmat$X2
x_3 <- xmat$X3

coefficient_0 <- 1
coefficient_1 <- 0.5
coefficient_2 <- -2.2
coefficient_3 <- 1

# define error term, IMPORTANT
n <- nrow(xmat) 
e <- rnorm(n, mean = 0, sd = 1) 

y <- coefficient_0 + coefficient_1*x_1 + coefficient_2*x_2 + coefficient_3*x_3 + e

linear <- lm(y ~ x_1 + x_2 + x_3)

# display results
summary(linear)


Call:
lm(formula = y ~ x_1 + x_2 + x_3)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.8847 -0.6692  0.0101  0.6669  3.7274 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.05430    0.04151   25.40   <2e-16 ***
x_1          0.48387    0.01572   30.79   <2e-16 ***
x_2         -2.25267    0.06411  -35.13   <2e-16 ***
x_3          0.95950    0.03121   30.75   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1 on 996 degrees of freedom
Multiple R-squared:  0.7616,    Adjusted R-squared:  0.7609 
F-statistic:  1061 on 3 and 996 DF,  p-value: < 2.2e-16

OLS in Matrix form

a. Write an R function to perform an OLS regression using matrices and vectors. Your function should take take two inputs: a vector of the dependent variable and a matrix of explanatory variables. Do not use any pre-programmed functions. You will regress the effective number of legislative parties (enps) on the number of effective ethnic groups (eneth), the log of median district magnitude electoral district magnitude (ml), and the multiplicative interaction between the two. Report a table of the regression parameter estimates and their standard errors. Don’t forget the constant! b. Make a Normal quantile comparison plot of the residuals from your regression and describe what it says in fewer than three sentences.

library(haven)
library(ggplot2)

# Load the data
data = read_dta("coxappend.dta")

# OLS regression function with residuals included
ols_regression <- function(y, X) {

  X <- cbind(1, X)  

  # Calculate coefficients
  beta_hat <- solve(t(X) %*% X) %*% t(X) %*% y

  # Calculate residuals
  residuals <- y - X %*% beta_hat

  # Estimate the variance
  sigma_hat_squared <- sum(residuals^2) / (length(y) - ncol(X))

  # Calculate standard errors
  se_beta_hat <- sqrt(diag(sigma_hat_squared * solve(t(X) %*% X)))

  # Return
  return(list(coefficients = beta_hat, standard_errors = se_beta_hat, residuals = residuals))
}

# Define y and X
y <- data$enps
X <- cbind(data$eneth, log(data$ml), data$eneth * log(data$ml))

# OLS
results <- ols_regression(y, X)

# Extract coefficients, standard errors, and residuals
coefficients <- results$coefficients
standard_errors <- results$standard_errors
residuals <- results$residuals  

# Create regression table
regression_table <- data.frame(
  Coefficient = coefficients,
  Std_Error = standard_errors
)

# Display regression table
print(regression_table)

  Coefficient Std_Error
1   2.6713673 0.6072149
2  -0.3619712 0.3486305
3  -0.1911175 0.2967357
4   0.4833255 0.1805094

# stats
qqnorm(residuals)
qqline(residuals, col = "red")

# ggplot2
ggplot(data = data.frame(residuals = residuals), aes(sample = residuals)) + 
  geom_qq_line() + 
  geom_qq() + 
  labs(title = "Normal Quantile Comparison of Residuals")

Description/Interpretation of the Q-Q Plot:

The majority of the points that lie on the red line indicate that the data there is normally distributed.
On both the tails of the plot, the points deviate from the line, which suggests that the data has potential outliers in the extremes.
With these observations, we can estimate that the graph indicates an approximate normal distribution, but it is not completely perfect.

c. Perform regression using lm()

model1 <- lm(enps ~ 1 + eneth + log(ml) + eneth*log(ml), data = data)
summary(model1)


Call:
lm(formula = enps ~ 1 + eneth + log(ml) + eneth * log(ml), data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.8627 -0.6818 -0.2346  0.2605  3.8235 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)     2.6714     0.6072   4.399 5.69e-05 ***
eneth          -0.3620     0.3486  -1.038    0.304    
log(ml)        -0.1911     0.2967  -0.644    0.522    
eneth:log(ml)   0.4833     0.1805   2.678    0.010 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.181 on 50 degrees of freedom
Multiple R-squared:  0.3629,    Adjusted R-squared:  0.3247 
F-statistic: 9.493 on 3 and 50 DF,  p-value: 4.541e-05

d. Report the Residual-v.-fitted, Scale-location, and Residual-v.-leverage diagnostic plots for this regression.

plot(model1)

e. Based on these results, is this OLS regression an appropriate model for these data? Provide a paragraph stating why or why not.

Looking at my model summary from the regression table I produced using the lm() function, here are some interpretations:

Looking at the F-Statistic at 9.49, the model is overall statistically significant
However, only the eneth term is statistically significant among the coefficients as it is p=0.010

For the Normal Q-Q plot

The majority of the middle residuals lie on the indicated red line
However there is still a good amount of residuals on the high end of the upper tail of the plot, as well as a few outliers

For the Residuals v. Fitted plot:

There are no clear curves in the graph as it mostly fits the red line.

Based on these results, is the OLS regression appropriate model for these data?

From my interpretation, I do not think the OLS regression I ran would be a totally appropriate model for the coxappend data. From the Normal Q-Q plot, there is a heavy emphasis on the top right of the plot showing a skew from the red line. Although the results from the OLS regression show that the p-value and F statistic were overall statisticallyl significant, I think that the graphs do display that we should beb cautious as to if this is a good fit or not.

Use of LLM’s:

I used ChatGPT 4.0 mini LLM/AI tool in this assignment. Below I will explain where I used the tool in this assignment and whether or not I found it helpful.

#1: When loading the necessary libraries for #1, I was continuously getting errors in my code for the library(hrbthemes). I continued to reinstall the package and my computer was taking a few hours to do so. I had to work with ChatGPT to help figure out the root of this issue. After a few trial and errors, the tool finally helped me debug. Here is the full exchange I had with the chat-bot:

https://chatgpt.com/share/67855838-129c-8010-b003-3320cb7ad335
#4 for part A and part B: I had the code for both the OLS regression table and the normal Q-Q plot, however when I went to render my code, it said it was quitting from the lines for the Q-Q plot code. It was previously working, so I was confused on why it stopped. I pasted my code into ChatGPT 4.0 mini to help me debug this issue. The tool explained that I did not define my residuals correctly, and gave me the code I had previously with a small iteration for this. I found it helpful as it clearly defined my issue and helped me debug immediately. Here is the full exchange I had with the chat-bot:

https://chatgpt.com/share/67854fe6-9fe4-8010-b19b-2c3ecfc493d1