Discussion 6

Author

Allison Shrivastava

First, I’ll look at the airquality data set. In this example, ozone will be my dependent variable, and temperature will be my independent variable.

The intercept is 147.646 and the slope is 2.439. In other words, for every 1 degree increase in temperature, the ozone concentration increases by around 2.439 parts per billion on average.

air<-as.data.frame(airquality)%>%
  drop_na()

# lets fit the model
est<-lm(Ozone~Temp, data=air)
#print the output
summary(est)

Call:
lm(formula = Ozone ~ Temp, data = air)

Residuals:
   Min     1Q Median     3Q    Max 
-40.92 -17.46  -0.87  10.44 118.08 

Coefficients:
            Estimate Std. Error t value             Pr(>|t|)    
(Intercept) -147.646     18.755   -7.87      0.0000000000028 ***
Temp           2.439      0.239   10.19 < 0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 23.9 on 109 degrees of freedom
Multiple R-squared:  0.488, Adjusted R-squared:  0.483 
F-statistic:  104 on 1 and 109 DF,  p-value: <0.0000000000000002
#now manually calculate the intercept and slope
x<-air$Temp
y<-air$Ozone

beta1<-cov(x,y)/var(x)
beta0<-mean(y)-beta1*mean(x)

#print
beta0
[1] -147.6
beta1
[1] 2.439
## now lets plot with the equation
ggplot(air, aes(x=Temp, y=Ozone))+
  geom_point(color="skyblue", size=3)+
               geom_smooth(method = "lm", color="darkblue", se=FALSE, lwd=2)+
  ggtitle(expression(Ozone[i]==beta[0]+beta[1]*Temp[i]+epsilon[i]))+
          xlab("Temperature (f)")+
            ylab("Ozone (ppb)")+
  theme_minimal()
`geom_smooth()` using formula = 'y ~ x'

Next, I’ll look at the mtcars dataset. In this example, miles per gallon will be my dependent variable, and weight of the cars will be my independent variable.

The intercept is 37.29, and the slope is -5.34. In other words, for every 1,000 lbs increase in a car’s weight, fuel efficiency, on average, will decrease by about 5.34 mpg

cars<-as.data.frame(mtcars)%>%
  drop_na()

# lets fit the model
est2<-lm(mpg~wt, data=cars)
#print the output
summary(est2)

Call:
lm(formula = mpg ~ wt, data = cars)

Residuals:
   Min     1Q Median     3Q    Max 
-4.543 -2.365 -0.125  1.410  6.873 

Coefficients:
            Estimate Std. Error t value             Pr(>|t|)    
(Intercept)   37.285      1.878   19.86 < 0.0000000000000002 ***
wt            -5.344      0.559   -9.56        0.00000000013 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.05 on 30 degrees of freedom
Multiple R-squared:  0.753, Adjusted R-squared:  0.745 
F-statistic: 91.4 on 1 and 30 DF,  p-value: 0.000000000129
#now manually calculate the intercept and slope
x2<-cars$wt
y2<-cars$mpg

beta1.2<-cov(x2,y2)/var(x2)
beta0.2<-mean(y2)-beta1.2*mean(x2)

#print
beta0.2
[1] 37.29
beta1.2
[1] -5.344
## now lets plot with the equation
ggplot(cars, aes(x=wt, y=mpg))+
  geom_point(color="pink", size=3)+
               geom_smooth(method = "lm", color="purple", se=FALSE, lwd=2)+
  ggtitle(expression(mpg[i]==beta[0]+beta[1]*wt[i]+epsilon[i]))+
          xlab("weight (1000lbs)")+
            ylab("miles per gallon")+
  theme_minimal()
`geom_smooth()` using formula = 'y ~ x'

The conditions under which OLS is blue (best linear unbiased estimator) are as follows: Linear in parameters (coefficients), the data is a random sampling and representative of the overall population.Next,there is no perfect col linearity, that is, it’s not possible to estimate a value for every coefficient in the model. Next, is zero conditional mean, ensuring that there is no omitted variable bias and that the regressors are all relevant, and lastly is homoscedasticity wherein the variance of the residuals is constant. This ensures that on average, a model predicts each data point with about the same performance.