Clear Data

rm(list = ls())      # Clear all files from your environment
         gc()            # Clear unused memory
##          used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
## Ncells 526478 28.2    1169512 62.5         NA   669420 35.8
## Vcells 971181  7.5    8388608 64.0      16384  1851931 14.2
         cat("\f")       # Clear the console
 graphics.off()      # Clear all graphs

Part 1

Summary on Data

library(stargazer)
## 
## Please cite as:
##  Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
##  R package version 5.2.3. https://CRAN.R-project.org/package=stargazer
# structure and summary stats
str(airquality)
## 'data.frame':    153 obs. of  6 variables:
##  $ Ozone  : int  41 36 12 18 NA 28 23 19 8 NA ...
##  $ Solar.R: int  190 118 149 313 NA NA 299 99 19 194 ...
##  $ Wind   : num  7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
##  $ Temp   : int  67 72 74 62 56 66 65 59 61 69 ...
##  $ Month  : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ Day    : int  1 2 3 4 5 6 7 8 9 10 ...
summary(airquality)
##      Ozone           Solar.R           Wind             Temp      
##  Min.   :  1.00   Min.   :  7.0   Min.   : 1.700   Min.   :56.00  
##  1st Qu.: 18.00   1st Qu.:115.8   1st Qu.: 7.400   1st Qu.:72.00  
##  Median : 31.50   Median :205.0   Median : 9.700   Median :79.00  
##  Mean   : 42.13   Mean   :185.9   Mean   : 9.958   Mean   :77.88  
##  3rd Qu.: 63.25   3rd Qu.:258.8   3rd Qu.:11.500   3rd Qu.:85.00  
##  Max.   :168.00   Max.   :334.0   Max.   :20.700   Max.   :97.00  
##  NA's   :37       NA's   :7                                       
##      Month            Day      
##  Min.   :5.000   Min.   : 1.0  
##  1st Qu.:6.000   1st Qu.: 8.0  
##  Median :7.000   Median :16.0  
##  Mean   :6.993   Mean   :15.8  
##  3rd Qu.:8.000   3rd Qu.:23.0  
##  Max.   :9.000   Max.   :31.0  
## 

MultiVariate Regression

\(Ozone_i = \beta_0 + \beta_1 * Solar.R_i + \beta_2 * Wind_i +\epsilon_i\)

\(Ozone_i\) = Dependent Variable

  • Ozone: The maximum daily ozone concentration in parts per billion (ppb)

\(Solar.R_i\) and \(Wind_i\) = Independent Variables

  • Solar.R: The daily solar radiation in Langleys (a measure of solar energy)

  • Wind: Average daily wind speed (mph)

\(\beta_0\) = Interecept

\(\beta_i\) = Slope coefficient for our independent variables

\(\epsilon_i\) = Error term

# multivariate regression model
multivariate_model <- lm(Ozone ~ Solar.R + Wind, 
                         data = airquality)

# summary of the regression model
summary(multivariate_model)
## 
## Call:
## lm(formula = Ozone ~ Solar.R + Wind, data = airquality)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -45.651 -18.164  -5.959  18.514  85.237 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 77.24604    9.06751   8.519 1.05e-13 ***
## Solar.R      0.10035    0.02628   3.819 0.000224 ***
## Wind        -5.40180    0.67324  -8.024 1.34e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 24.92 on 108 degrees of freedom
##   (42 observations deleted due to missingness)
## Multiple R-squared:  0.4495, Adjusted R-squared:  0.4393 
## F-statistic: 44.09 on 2 and 108 DF,  p-value: 1.003e-14
# present regression results
stargazer(multivariate_model, 
          type = "text")
## 
## ===============================================
##                         Dependent variable:    
##                     ---------------------------
##                                Ozone           
## -----------------------------------------------
## Solar.R                      0.100***          
##                               (0.026)          
##                                                
## Wind                         -5.402***         
##                               (0.673)          
##                                                
## Constant                     77.246***         
##                               (9.068)          
##                                                
## -----------------------------------------------
## Observations                    111            
## R2                             0.449           
## Adjusted R2                    0.439           
## Residual Std. Error      24.917 (df = 108)     
## F Statistic           44.092*** (df = 2; 108)  
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01

Part 2

Based on the statistics we observe that an increase in sunlight corresponds to higher Ozone levels. For every 0.1 units of solar radiation (langleys), the ozone concentration will increase by one unit (ppb). On the hand, higher wind speeds are associated with lower Ozone levels. For every 5.4mph increase in avg wind speed, the ozone concentration will decrease by one unit (ppb). from a top level, both of these make sense as sunlight would be adding heat while wind would be decreasing heat which impact the Ozone levels differently.

Based on the P value being less than 0.01 for both relationships, it suggests they are not based on random chance.

Part 3

# Residual plot analysis
par(mfrow=c(1, 1))

plot(multivariate_model)

  1. Residual vs fitted: The residuals show non linearity with a slight parabola. This could mean there is higher variability for the lower and higher values.

  2. Q-Q Resideuals: Shows that the majority of the residuals follow a diagonal line across (2 standard deviations), small outliers on the X axis but overall showing normality in the distribution.

  3. Scale-Location: The overall line here has a horizontal direction but is nevertheless influenced by outliers (some at the top right).

  4. Residuals vs Leverage: This allows us to identify influential outliers, based on the graph we don’t have any residual points outside the cooks line which would need to be removed.

Part 4

1) Linearity: Using the OLS method, the relationship between the dependent variable and the independent variables is linear.

2) Random: Data must have been randomly sampled from the population.

3) Non Collinearity: There is no perfect linear relationship between the regressors.

cor(airquality$Solar.R, 
    airquality$Wind, 
    use = "complete.obs")
## [1] -0.05679167

4) Homoscedasticity: The variance of the error terms (residuals) is constant across all values of the regressors.

5) Exogeneity: The independent variables are fixed and not random, not correlated with the error term

Part 5 (1)

BLUE stands for Best Linear Unbiased Estimator, it is the best unbiased linear estimator with the smallest variance on average.

Part 5 (2)

Taking the logarithm of variables can be beneficial for normalizing skewed data by compressing the large variables and spreading out some of the smaller ones. This helps reduce the skewness and allows to make the data more symmetric.

It also helps improve linearity, allowing to make the relationships more interpretable and easier to model using a linear regression. Many times it can even help stabilize variances across different levels of predictors, addressing heteroscedasticity issues.