Univariate displays & sampling distributions Let $n$ denote the size of a simple random sample taken from a large population and let $s$ be the number of times samples of size $n$ are drawn.

A.) Write an R program that will replicate Figure 1.2 from the Ward & Ahlquist text. Create a matrix of nine such histograms which illustrate the effect of changing $s \in \{10, 100, 1000\}$ and $n \in \{10, 100, 1000\}$ . Put the horizontal axis on the same scale for all plots. HINT: try constructing loops using for, replicate, and sample.

# Load your dataset
data <- read_csv("/Users/belyndaherrera/Downloads/week_1_data_css_stats.csv")

## Rows: 1000 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (3): X1, X2, X3
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Use the X1 column as the source for sampling
values <- data$X1

# Parameters for sampling
n_values <- c(10, 100, 1000)  # Sample sizes
s_values <- c(10, 100, 1000)  # Number of samples

# Create a data frame to store results
results <- expand.grid(n = n_values, s = s_values) %>%
  mutate(sample_means = map2(n, s, ~ replicate(.y, mean(sample(values, .x, replace = TRUE))))) %>%
  unnest(cols = c(sample_means))

# Create a faceted ggplot
pl <- results %>%
  ggplot(aes(x = sample_means, fill = factor(n))) +
  geom_histogram(alpha = 0.6, binwidth = 0.2, color = "black") +  # Black border for clarity
  scale_fill_manual(
    values = c("red", "blue", "green"),  # Specify colors for the fill
    name = "Sample Size (n)"
  ) +
  theme_ipsum() +
  theme(
    legend.position = "right",
    panel.spacing = unit(0.1, "lines"),
    strip.text.x = element_text(size = 10)
  ) +
  xlab("Sample Mean") +
  ylab("Frequency") +
  facet_wrap(~ paste("n =", n, "s =", s), scales = "free_y")
# Print the plot
print(pl)

B.) What does this matrix of histograms say about the efficiency of sampling when we consider the number of groups versus the number of units within a group? What is the key assumption or attribute of the situation here that underpins this lesson?

The matrix of histograms shows that increasing the number of groups sampled (s) reduces variability by capturing more diversity across the population, while increasing the number of units within a group (n) reduces variance more effectively due to averaging within homogeneous groups. Sampling efficiency depends on the balance between within-group homogeneity and across-group heterogeneity, with n being more effective for reducing noise and s being critical for capturing population diversity. The key assumption is that within-group observations are more similar, while groups differ significantly from each other.

2.) Monte Carlo integration The expected value of a function, $f(x)$ , over the interval $[a,b]$ is defined as $\mathbb{E}[f(x)] = \frac{1}{b-a}\int_a^b f(x)dx$ . Use this fact to calculate the approximate value of $\int_2^5e^{−x}\sin(x)$ using simulation methods. Check your answer using the integrate function in R.

# Define Function
fx = function(x) exp(-x)* sin(x)

# Integrate MC
n = 50000 ; a = 2 ; b = 5
# Generate Random Samples
x = runif(n,a,b)
# Approximate
mcfx = mean(fx(x)) * (b-a)

intfx = integrate(fx,a,b,subdivisions = 1000)

cat("Monte Carlo Approximation:", mcfx, "\n")

## Monte Carlo Approximation: 0.03598635

cat("Exact Integration Result:", intfx$value, "\n")

## Exact Integration Result: 0.03564528

cat("Monte Carlo Error:", abs(mcfx - intfx$value), "\n")

## Monte Carlo Error: 0.000341077

3.) Systematic and stochastic components ** i did not understand what is being asked** Suppose someone reports to you the following linear model, which we will then use to generate data: $\begin{eqnarray*} y_i &=& 1 + 0.5x_{i1} - 2.2x_{i2} + x_{i3} + \epsilon_i\\ \epsilon_i &\sim& N(\mu=0, \sigma^2 = 1.5) \end{eqnarray*}$

Rephrase this model in terms of its systematic and stochastic components.

Download xmat.csv from the course canvas site and load it into your R session.

# Load the dataset
xmat <- data

# View the structure of the dataset
print(str(xmat))

## spc_tbl_ [1,000 × 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ X1: num [1:1000] -4.82 2.576 0.333 -1.153 2.056 ...
##  $ X2: num [1:1000] 1 0 1 1 0 0 1 1 1 1 ...
##  $ X3: num [1:1000] 1.5414 1.2589 -0.0693 0.0776 -1.1992 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   X1 = col_double(),
##   ..   X2 = col_double(),
##   ..   X3 = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr> 
## NULL

# Display the first few rows of the dataset
print(dim(xmat))

## [1] 1000    3

print(head(xmat))

## # A tibble: 6 × 3
##       X1    X2      X3
##    <dbl> <dbl>   <dbl>
## 1 -4.82      1  1.54  
## 2  2.58      0  1.26  
## 3  0.333     1 -0.0693
## 4 -1.15      1  0.0776
## 5  2.06      0 -1.20  
## 6  0.134     0  0.251

# Define Function
fx = function(x) exp(-x)* sin(x)

# Integrate MC
n = 50000 ; a = 2 ; b = 5

# Generate Random Samples
x = runif(n,a,b)

# Approximate
mcfx = mean(fx(x)) * (b-a)

intfx = integrate(fx,a,b,subdivisions = 1000)

cat("Monte Carlo Approximation:", mcfx, "\n")

## Monte Carlo Approximation: 0.03557403

cat("Exact Integration Result:", intfx$value, "\n")

## Exact Integration Result: 0.03564528

cat("Monte Carlo Error:", abs(mcfx - intfx$value), "\n")

## Monte Carlo Error: 7.12506e-05

What are the dimensions of the $\mathbf{X}$ matrix you just downloaded? Call this number n.

Dimensions:1000 rows and 3 columns

Type set.seed(10825). Then combine the $\mathbf{X}$ matrix you just downloaded with the linear model you described in part (a) to generate n simulated $y$ values. Becareful with the constant! Then regress these simulated $y$ values on $\mathbf{X}$ . Display the results in a regression table.

set.seed(10825)

Xi <- data.matrix(xmat)

e <- rnorm(1000, mean = 0, sd = sqrt(1.5))

yi <- 0.5*Xi[,1] - 2.2*Xi[,2] + Xi[,3] + e + 1


regress <- lm(yi ~ Xi[,1] + Xi[,2] + Xi[,3])
summary(regress)

## 
## Call:
## lm(formula = yi ~ Xi[, 1] + Xi[, 2] + Xi[, 3])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5330 -0.8196  0.0124  0.8168  4.5651 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.06651    0.05084   20.98   <2e-16 ***
## Xi[, 1]      0.48024    0.01925   24.95   <2e-16 ***
## Xi[, 2]     -2.26451    0.07852  -28.84   <2e-16 ***
## Xi[, 3]      0.95040    0.03822   24.86   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.225 on 996 degrees of freedom
## Multiple R-squared:  0.6792, Adjusted R-squared:  0.6782 
## F-statistic: 702.9 on 3 and 996 DF,  p-value: < 2.2e-16

4.) OLS in matrix form For this problem you will need to load the file coxappend.dta from the Canvas site. These are replication data from Neto & Cox, AJPS 1997. Note that when you import STATA datasets you can type attributes(foo) to get the typical information you might find in a STATA codebook.

library(foreign)
cox <- read.dta("coxappend.dta")
attributes(cox)

## $datalabel
## [1] ""
## 
## $time.stamp
## [1] " 9 Apr 2002 14:18"
## 
## $names
##  [1] "var12"    "drop"     "year"     "enpv"     "enps"     "eneth"   
##  [7] "ml"       "upper"    "enpres"   "proximit" "lnml"     "lmleneth"
## [13] "smdp"     "smdpeth"  "multi"    "enpvlml"  "enpvUpp"  "multiV"  
## [19] "enpvQ"    "enpvmult" "enpvsmdp" "proxpres" "drop2"   
## 
## $formats
##  [1] "%23s"  "%8.0g" "%8.0g" "%9.0g" "%9.0g" "%9.0g" "%9.0g" "%9.0g" "%9.0g"
## [10] "%9.0g" "%9.0g" "%9.0g" "%9.0g" "%9.0g" "%9.0g" "%9.0g" "%9.0g" "%9.0g"
## [19] "%9.0g" "%9.0g" "%9.0g" "%9.0g" "%9.0g"
## 
## $types
##  [1] 150  98 105 102 102 102 102 102 102 102 102 102 102 102 102 102 102 102 102
## [20] 102 102 102 102
## 
## $val.labels
##  [1] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
## 
## $var.labels
##  [1] ""                                ""                               
##  [3] "Year"                            "Effective Electoral Parties"    
##  [5] "Effective Legislative Parties"   "Eff Ethnic Groups"              
##  [7] ""                                "Upper Tier"                     
##  [9] "Eff Pres Cand"                   "Proximity"                      
## [11] "Log Med Mag"                     "Log M * Ethn"                   
## [13] ""                                ""                               
## [15] "equals 1 if enps above median"   ""                               
## [17] ""                                "equal 1 if enpv is above median"
## [19] ""                                ""                               
## [21] ""                                ""                               
## [23] ""                               
## 
## $row.names
##  [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12" "13" "14" "15"
## [16] "16" "17" "18" "19" "20" "21" "22" "23" "24" "25" "26" "27" "28" "29" "30"
## [31] "31" "32" "33" "34" "35" "36" "37" "38" "39" "40" "41" "42" "43" "44" "45"
## [46] "46" "47" "48" "49" "50" "51" "52" "53" "54"
## 
## $version
## [1] 7
## 
## $class
## [1] "data.frame"

cox$log_ml <- log(cox$ml)

A.) Write an R function to perform an OLS regression using matrices and vectors. Your function should take take two inputs: a vector of the dependent variable and a matrix of explanatory variables. Do not use any pre-programmed functions. You will regress the effective number of legislative parties (enps) on the number of effective ethnic groups (eneth), the log of median district magnitude electoral district magnitude (ml), and the multiplicative interaction between the two. Report a table of the regression parameter estimates and their standard errors. Don’t forget the constant!

# Function to perform OLS regression
ols_regression <- function(y, X) {
  # Add intercept to the predictors
  X <- cbind(1, X)
  
  # Estimate coefficients (beta_hat)
  beta_hat <- solve(t(X) %*% X) %*% t(X) %*% y
  
  # Calculate residuals
  residuals <- y - X %*% beta_hat
  
  # Estimate variance of residuals (sigma^2)
  n <- nrow(X)
  p <- ncol(X)
  sigma2 <- sum(residuals^2) / (n - p)
  
  # Calculate standard errors for coefficients
  se_beta <- sqrt(diag(sigma2 * solve(t(X) %*% X)))
  
  # Return results as a list
  list(coefficients = beta_hat, standard_errors = se_beta, residuals = residuals)
}

# Define response variable (y) and predictors (X)
y <- cox$enps
X <- data.frame(
  eneth = cox$eneth,
  log_ml = log(cox$ml),
  interaction = cox$eneth * log(cox$ml)
)

# Perform OLS regression
results <- ols_regression(y, as.matrix(X))

# Display results in a clear table
coefficients_table <- data.frame(
  Coefficient = results$coefficients,
  Std_Error = results$standard_errors,
  row.names = c("Intercept", "eneth", "log_ml", "interaction")
)
print("ols_results")

## [1] "ols_results"

print(coefficients_table)

##             Coefficient Std_Error
## Intercept     2.6713673 0.6072149
## eneth        -0.3619712 0.3486305
## log_ml       -0.1911175 0.2967357
## interaction   0.4833255 0.1805094

B.) Make a Normal quantile comparison plot of the residuals from your regression and describe what it says in fewer than three sentences.

# Ensure coefficients are calculated properly
X <- cbind(1, as.matrix(X))  # Add intercept
residuals <- y - X %*% results$coefficients  # Compute residuals

# Plot the Q-Q plot for residuals
qqnorm(residuals, main = "Normal Q-Q Plot of Residuals", 
       ylab = "Residuals", xlab = "Theoretical Quantiles")
qqline(residuals, col = "blue")  # Add a reference line

C.) Perform the same regression using the lm() command.

mod <- lm(enps ~ eneth + log_ml + eneth * log_ml, data = cox)
summary(mod)

## 
## Call:
## lm(formula = enps ~ eneth + log_ml + eneth * log_ml, data = cox)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.8627 -0.6818 -0.2346  0.2605  3.8235 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    2.6714     0.6072   4.399 5.69e-05 ***
## eneth         -0.3620     0.3486  -1.038    0.304    
## log_ml        -0.1911     0.2967  -0.644    0.522    
## eneth:log_ml   0.4833     0.1805   2.678    0.010 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.181 on 50 degrees of freedom
## Multiple R-squared:  0.3629, Adjusted R-squared:  0.3247 
## F-statistic: 9.493 on 3 and 50 DF,  p-value: 4.541e-05

D.) Report the Residual-v.-fitted, Scale-location, and Residual-v.-leverage diagnostic plots for this regression.

par(mfrow = c(2,2))
plot(mod)

E.) Based on these results, is this OLS regression an appropriate model for these data? Provide a paragraph stating why or why not.

In the residual vs. fitted plot, the slight downward curve suggests potential non-linearity. Furthermore, the upward trend in both the Q–Q plot and the Scale-Location plot indicates heteroscedasticity, as the data become more spread out at higher values. There are also a few outliers with large residuals (e.g., observation 10) that could pose problems. If, even after addressing or removing these outliers, the model continues to show signs of heteroscedasticity or non-linearity, then OLS regression may not be the best fit for this dataset.

Appendix

I used ChatGPT in this assignment. I used the tool to help debug the questions 4b and c as well as break down qusetion 1.https://chatgpt.com/share/6787f91e-a768-8000-8e6c-f1cadd5c1053

CSS_205_week_1_HW

Belynda Herrera (bpherrera@ucsd.edu)

2025-01-09

Appendix