BettyWang_discussion3

Author

Betty Wang

Quarto

Quarto enables you to weave together content and executable code into a finished document. To learn more about Quarto see https://quarto.org.

Running Code

When you click the Render button a document will be generated that includes both content and the output of embedded code. You can embed code like this:

1 + 1
[1] 2

You can add options to executable code like this

[1] 4

The echo: false option disables the printing of code (only output is displayed).

#title: "discussion"
#author: "Betty Wang"
#date: "2024-09-18"
#output: pdf_document


knitr::opts_chunk$set(echo = TRUE)

## R Markdown

#This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see <http://rmarkdown.rstudio.com>.

#When you click the **Knit** button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:




library(MASS)
data("Boston")





#The Boston dataset contains 506 observations with 14 variables. It can be used for regression tasks to predict the median value of owner-occupied homes (medv) based on various predictors.

\[ medv_i = \beta_0 + \beta_1 crim_i + \epsilon_i \]

beta1 = cov(Boston$medv,Boston$crim)/var(Boston$crim) 
beta1
[1] -0.4151903
reg1 <- lm(data=Boston,
           formula = medv ~ crim)


library(stargazer)

Please cite as: 
 Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
 R package version 5.2.3. https://CRAN.R-project.org/package=stargazer 
stargazer(reg1, type = "text" ) 

===============================================
                        Dependent variable:    
                    ---------------------------
                               medv            
-----------------------------------------------
crim                         -0.415***         
                              (0.044)          
                                               
Constant                     24.033***         
                              (0.409)          
                                               
-----------------------------------------------
Observations                    506            
R2                             0.151           
Adjusted R2                    0.149           
Residual Std. Error      8.484 (df = 504)      
F Statistic           89.486*** (df = 1; 504)  
===============================================
Note:               *p<0.1; **p<0.05; ***p<0.01
#Variance measures the spread or dispersion of a set of data points around their mean. Specifically, it quantifies how far the individual data points in a dataset are from the mean of the dataset on average.

#Covariance measures the direction of the linear relationship between two variables. It quantifies how much two random variables vary together. If the variables tend to increase and decrease together, the covariance is positive. If one variable tends to increase when the other decreases, the covariance is negative.

#Dividing the covariance of y and x by the variance of x isolates the effect of x on y and removes the volatility of x from the Cov(x,y). This ratio gives the slope in simple linear regression that quantifies how much y changes, on average, for each one-unit increase in x.



library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
✖ dplyr::select() masks MASS::select()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)


str(Boston)
'data.frame':   506 obs. of  14 variables:
 $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
 $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
 $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
 $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
 $ rm     : num  6.58 6.42 7.18 7 7.15 ...
 $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
 $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
 $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
 $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
 $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
 $ black  : num  397 397 393 395 397 ...
 $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
 $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
library(tidyverse)
glimpse(Boston)
Rows: 506
Columns: 14
$ crim    <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, 0.08829,…
$ zn      <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5, 12.5, 1…
$ indus   <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, 7.87, 7.…
$ chas    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ nox     <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524, 0.524,…
$ rm      <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172, 5.631,…
$ age     <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0, 85.9, 9…
$ dis     <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605, 5.9505…
$ rad     <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, 4, 4, 4,…
$ tax     <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311, 311, 31…
$ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, 15.2, 15…
$ black   <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60, 396.90…
$ lstat   <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.93, 17.10…
$ medv    <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9, 15…
# Fit a multivariable linear regression model
model <- lm(medv ~ crim + zn + indus + chas + nox + rm + age + dis + rad + tax + ptratio + black + lstat, data = Boston)

# Summarize the model
summary(model)

Call:
lm(formula = medv ~ crim + zn + indus + chas + nox + rm + age + 
    dis + rad + tax + ptratio + black + lstat, data = Boston)

Residuals:
    Min      1Q  Median      3Q     Max 
-15.595  -2.730  -0.518   1.777  26.199 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.646e+01  5.103e+00   7.144 3.28e-12 ***
crim        -1.080e-01  3.286e-02  -3.287 0.001087 ** 
zn           4.642e-02  1.373e-02   3.382 0.000778 ***
indus        2.056e-02  6.150e-02   0.334 0.738288    
chas         2.687e+00  8.616e-01   3.118 0.001925 ** 
nox         -1.777e+01  3.820e+00  -4.651 4.25e-06 ***
rm           3.810e+00  4.179e-01   9.116  < 2e-16 ***
age          6.922e-04  1.321e-02   0.052 0.958229    
dis         -1.476e+00  1.995e-01  -7.398 6.01e-13 ***
rad          3.060e-01  6.635e-02   4.613 5.07e-06 ***
tax         -1.233e-02  3.760e-03  -3.280 0.001112 ** 
ptratio     -9.527e-01  1.308e-01  -7.283 1.31e-12 ***
black        9.312e-03  2.686e-03   3.467 0.000573 ***
lstat       -5.248e-01  5.072e-02 -10.347  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.745 on 492 degrees of freedom
Multiple R-squared:  0.7406,    Adjusted R-squared:  0.7338 
F-statistic: 108.1 on 13 and 492 DF,  p-value: < 2.2e-16
lm(formula = medv ~ crim + zn + indus + chas + nox + rm + age + 
    dis + rad + tax + ptratio + black + lstat, data = Boston)

Call:
lm(formula = medv ~ crim + zn + indus + chas + nox + rm + age + 
    dis + rad + tax + ptratio + black + lstat, data = Boston)

Coefficients:
(Intercept)         crim           zn        indus         chas          nox  
  3.646e+01   -1.080e-01    4.642e-02    2.056e-02    2.687e+00   -1.777e+01  
         rm          age          dis          rad          tax      ptratio  
  3.810e+00    6.922e-04   -1.476e+00    3.060e-01   -1.233e-02   -9.527e-01  
      black        lstat  
  9.312e-03   -5.248e-01  

\[ medv_i = \beta_0 + \beta_1 crim_i +\beta_2 zn_i + \beta_3 indus_i + \beta_4 chas_i + \beta_5 nox_i + \beta_6 rm_i + \beta_7 age_i + \beta_8 dis_i + \beta_9 rad_i + \beta_10 tax_i + \beta_11 ptratio_i + \beta_12 black_i + \beta_13 lstat_i + \epsilon_i \]

y <- as.vector(Boston$medv)

x <- as.matrix(subset(Boston,select=-medv))

int <- rep(x = 1, 
           times = length(y))

x <- cbind(int, x)

remove(int)





betas <- solve(t(x) %*% x) %*% t(x) %*% y

betas <- round(x = betas, digits = 2)

betas
          [,1]
int      36.46
crim     -0.11
zn        0.05
indus     0.02
chas      2.69
nox     -17.77
rm        3.81
age       0.00
dis      -1.48
rad       0.31
tax      -0.01
ptratio  -0.95
black     0.01
lstat    -0.52
betas_check <- solve(crossprod(x), crossprod(x,y))

betas == round(x = betas_check, digits = 2)
        [,1]
int     TRUE
crim    TRUE
zn      TRUE
indus   TRUE
chas    TRUE
nox     TRUE
rm      TRUE
age     TRUE
dis     TRUE
rad     TRUE
tax     TRUE
ptratio TRUE
black   TRUE
lstat   TRUE
lm.mod <- lm(medv ~ crim + zn + indus + chas + nox + rm + age + dis + rad + tax + ptratio + black + lstat, data = Boston)

lm.betas <- round(x = lm.mod$coefficients, digits = 2)

?data.frame
results <- data.frame(our.results = betas, lm.results = lm.betas)

print(results)
        our.results lm.results
int           36.46      36.46
crim          -0.11      -0.11
zn             0.05       0.05
indus          0.02       0.02
chas           2.69       2.69
nox          -17.77     -17.77
rm             3.81       3.81
age            0.00       0.00
dis           -1.48      -1.48
rad            0.31       0.31
tax           -0.01      -0.01
ptratio       -0.95      -0.95
black          0.01       0.01
lstat         -0.52      -0.52
#Assume that Y = Xβ + ϵ, x is the matrix of variables

#1. compute β = (X'X\^-1)X'Y
#2. e = Y - hat Y
#3. calculate the variance of e
#4. calculate the variance of β = Var(e)\*(X'X)\^-1
#5. calculate the standard deviation of β

#The standard error (SE) of the coefficient β in linear regression is an estimate of the standard deviation of the sampling distribution of β-hat (the estimated regression coefficient).