POLI 271 Problem Set 1

Author

Sara Hamidi

#All libraries used in this problem set
library("tidyverse")

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library("gtsummary")
library('haven')

Question 1: Univariate displays & sampling distributions

Part (a):

x <- 5:15
n <- c(10, 100, 1000)
s <- c(10, 100, 1000)

par(mfrow = c(3, 3))
for (i in 1:length(n)) {
  for (j in 1:length(s)) {
    hist((replicate(n[i], mean(sample(x, size = n[j], replace = TRUE)))), main = paste(n[i], "samples of", s[j], "numbers"), xlim = c(9, 11), xlab = "Mean of Samples")
  }
}

Part (b):

This matrix of histograms highlights the law of large numbers which states that as the number of random samples increases, the average of the results gets closer to the true mean. The key attribute of this matrix which underpins this lesson is that the graphs in the bottom row are all representing normal distributions, with means close to 10 for numbers 5 to 15. It is also helpful to note that the top row representing 10 samples of size n are quite far from a normal distribution, while the middle row becomes slightly more normal, especially for 100 samples of 100 and 100 samples of 1000 numbers. Ultimately, the results of this matrix highlight that the larger number of random samples you use, the closer you will be to a true mean.

Question 2: Monte Carlo integration

# watched the following for a tutorial on Monte Carlo integration: https://youtu.be/8xo4Lx9fiRc?si=URPHpbm3Y7KLSMTg

f <- function(x)exp(-x) * sin(x)

# simulation

area_estimates <- vector(length = 10000) #the vector of area estimates for various number of points used 1-10000
for (i in 1:10000) {
  points <- runif(n = i, min = 2, max = 5) #the random points within the bounds
  area_estimates[i] <- 3 * mean(f(points)) #the area estimate using the random points
}

plot(area_estimates) # plot converges towards 0.035

mean(area_estimates) # mean represents this

[1] 0.03563196

integrate(f, 2, 5) # exact value = .03564528 with absolute error < 8.3e-16

0.03564528 with absolute error < 8.3e-16

Question 3: Systematic and Stochastic Components

Part (a):

Stochastic component:
\[Y_i \sim N(\mu_i, \sigma^2=1.5)\]

This component is essentially the error term which represents sampling or measuring errors which may cause a deviance. It states that the mean is 0 and the standard deviation is 1.5.

Systematic component: \[\mu_i = 1 + 0.5x_{i1} - 2.2x_{i2} + x_{i3}\]

This component describes the model’s parameters (betas).

Part (b):

X <- read_csv('/Users/sarahamidi/Downloads/xmat.csv') # download the csv

Rows: 1000 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (3): X1, X2, X3

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

dim(X) # dimensions of the X matrix are 1000 rows x 3 cols

[1] 1000    3

n = nrow(X) # 1000 rows

Part (c):

set.seed(10825)

# Systematic and stochastic components implementing the columns from matrix X and the linear model given
systematic <- 1 + 0.5 * X$X1 - 2.2 * X$X2 + X$X3
stochastic <- rnorm(n, mean = 0, sd = sqrt(1.5)) # we want to generate n (1000) y values


y <- systematic + stochastic # combining X with the model to generate y values

X_y <- data.frame(y = y) 

X$y <- y # add the simulated y values into the data frame for the regression


mod <- lm(y ~ X1 + X2 + X3 + 1, data = X) # run the regression
summary(mod) # print the regression table


Call:
lm(formula = y ~ X1 + X2 + X3 + 1, data = X)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.5330 -0.8196  0.0124  0.8168  4.5651 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.06651    0.05084   20.98   <2e-16 ***
X1           0.48024    0.01925   24.95   <2e-16 ***
X2          -2.26451    0.07852  -28.84   <2e-16 ***
X3           0.95040    0.03822   24.86   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.225 on 996 degrees of freedom
Multiple R-squared:  0.6792,    Adjusted R-squared:  0.6782 
F-statistic: 702.9 on 3 and 996 DF,  p-value: < 2.2e-16

Question 4: OLS in matrix form

Part (a):

# watched the following for a tutorial on using matrices to do linear regression: https://youtu.be/FFX9WyxEa7o?si=cWjNr80NIYYPS6xR

dat2 <- read_dta('/Users/sarahamidi/Downloads/coxappend.dta')
ols_reg <- function(dep, ind){
  y <- as.matrix(dep) # make sure that it is a matrix (should be 1 column (vector))
  x <- as.matrix(ind) # make sure it is a matrix
  cbind(y, x) # combining dependent variable (outcome) with the design matrix (matrix of explanatory variables)
  xtx <- t(x) %*% x 
  xtx_inv <- solve(xtx)
  beta <- solve(t(x) %*% x) %*% t(x) %*% y # formula for coefficient vector -- beta values
  
  
  residuals = sum((y - x%*%beta)^2)/(nrow(x) - ncol(x)) # mean squared error
  vcv_mat <- residuals * solve(xtx) # variance - covariance matrix
  std_err <- sqrt(diag(vcv_mat)) # sq root the diagonal to pull out standard errors

  result <- as.data.frame(cbind(beta, std_err)) # table of results
  names(result)[1:2] <- c("Parameter estimates", "Std. Error")
  print(result)
  return(beta) # to get the y_hat values later
  

}

outcome <- dat2$enps # vector of dependent variable -- effective number of legislative parties (enps) 
explanatory <- cbind(constant = 1, dat2$eneth, log(dat2$ml), dat2$eneth*log(dat2$ml)) # matrix of explanatory variables -- effective ethnic groups (eneth), log of median district magnitude electoral district magnitude(ml), and eneth * log (ml)

beta_hat <- ols_reg(outcome, explanatory) # call the function using the data from coxappend.dta

         Parameter estimates Std. Error
constant           2.6713673  0.6072149
X                 -0.3619712  0.3486305
X.1               -0.1911175  0.2967357
X.2                0.4833255  0.1805094

Part (b):

y_hat <- explanatory %*% beta_hat # multiply the parameter estimates by the x matrix to get yhat
residuals <- outcome - y_hat # get the residuals from the difference between y and yhat
qqnorm(residuals) # normal quantile comparison plot

In the upper part of the plot, it is evident that there is a slight deviance from the linear trend which would represent a Normal distribution. Because of this, we can infer that there is a slight right skew, due to the variance and larger values which would not appear if the residuals were Normal.

Part (c):

# using the lm() function to run the same OLS regression
mod1 <- lm(enps ~ eneth + log(dat2$ml) + eneth:log(dat2$ml), data = dat2)
summary(mod1) # the results match


Call:
lm(formula = enps ~ eneth + log(dat2$ml) + eneth:log(dat2$ml), 
    data = dat2)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.8627 -0.6818 -0.2346  0.2605  3.8235 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)          2.6714     0.6072   4.399 5.69e-05 ***
eneth               -0.3620     0.3486  -1.038    0.304    
log(dat2$ml)        -0.1911     0.2967  -0.644    0.522    
eneth:log(dat2$ml)   0.4833     0.1805   2.678    0.010 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.181 on 50 degrees of freedom
Multiple R-squared:  0.3629,    Adjusted R-squared:  0.3247 
F-statistic: 9.493 on 3 and 50 DF,  p-value: 4.541e-05

Part (d):

plot(mod1, which = 1) # residual v fitted

plot(mod1, which = 3) # scale-location

plot(mod1, which = 5) # residual v leverage

Part (e):

Based on these results, it is evident that the OLS regression may not be the most appropriate model for these data. The Residual-v.-fitted plot reveals a distinctive downward trend in the residuals, indicating that there is a non-linear relationship present. You can also see a few residuals that stand out way above the 0 line, which suggests the existence of some outliers. Next, the Scale-location plot reveals that the assumption of homoscedasticity is not satisfied as there is a clear upward trend in the spread of residuals. Finally, looking at the Residual-v.-leverage plot, we can see that there is an outlying value at the lower right corner of the plot and another one bordering the “Cook’s distance” dashed line on the upper middle area of the plot. Because these two cases are on or outside of the dashed line, they may be influential to the regression results. Ultimately, based on these patterns and insights from the diagnostic plots, it may be better to try different models for this data, to improve results and address non-linearity, the influence of outliers, and homoscedasticity.

Appendix

I certify that I did not use any LLM or generative AI tool in this assignment. Other sources that were used to aid in this assignment have been attributed as comments for the respective questions.