First, I’ll look at the airquality data set. In this example, ozone will be my dependent variable, and temperature will be my independent variable.
The intercept is 147.646 and the slope is 2.439. In other words, for every 1 degree increase in temperature, the ozone concentration increases by around 2.439 parts per billion on average.
air<-as.data.frame(airquality)%>%drop_na()# lets fit the modelest<-lm(Ozone~Temp, data=air)#print the outputsummary(est)
Call:
lm(formula = Ozone ~ Temp, data = air)
Residuals:
Min 1Q Median 3Q Max
-40.92 -17.46 -0.87 10.44 118.08
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -147.646 18.755 -7.87 0.0000000000028 ***
Temp 2.439 0.239 10.19 < 0.0000000000000002 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 23.9 on 109 degrees of freedom
Multiple R-squared: 0.488, Adjusted R-squared: 0.483
F-statistic: 104 on 1 and 109 DF, p-value: <0.0000000000000002
#now manually calculate the intercept and slopex<-air$Tempy<-air$Ozonebeta1<-cov(x,y)/var(x)beta0<-mean(y)-beta1*mean(x)#printbeta0
[1] -147.6
beta1
[1] 2.439
## now lets plot with the equationggplot(air, aes(x=Temp, y=Ozone))+geom_point(color="skyblue", size=3)+geom_smooth(method ="lm", color="darkblue", se=FALSE, lwd=2)+ggtitle(expression(Ozone[i]==beta[0]+beta[1]*Temp[i]+epsilon[i]))+xlab("Temperature (f)")+ylab("Ozone (ppb)")+theme_minimal()
`geom_smooth()` using formula = 'y ~ x'
Next, I’ll look at the mtcars dataset. In this example, miles per gallon will be my dependent variable, and weight of the cars will be my independent variable.
The intercept is 37.29, and the slope is -5.34. In other words, for every 1,000 lbs increase in a car’s weight, fuel efficiency, on average, will decrease by about 5.34 mpg
cars<-as.data.frame(mtcars)%>%drop_na()# lets fit the modelest2<-lm(mpg~wt, data=cars)#print the outputsummary(est2)
Call:
lm(formula = mpg ~ wt, data = cars)
Residuals:
Min 1Q Median 3Q Max
-4.543 -2.365 -0.125 1.410 6.873
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.285 1.878 19.86 < 0.0000000000000002 ***
wt -5.344 0.559 -9.56 0.00000000013 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.05 on 30 degrees of freedom
Multiple R-squared: 0.753, Adjusted R-squared: 0.745
F-statistic: 91.4 on 1 and 30 DF, p-value: 0.000000000129
#now manually calculate the intercept and slopex2<-cars$wty2<-cars$mpgbeta1.2<-cov(x2,y2)/var(x2)beta0.2<-mean(y2)-beta1.2*mean(x2)#printbeta0.2
[1] 37.29
beta1.2
[1] -5.344
## now lets plot with the equationggplot(cars, aes(x=wt, y=mpg))+geom_point(color="pink", size=3)+geom_smooth(method ="lm", color="purple", se=FALSE, lwd=2)+ggtitle(expression(mpg[i]==beta[0]+beta[1]*wt[i]+epsilon[i]))+xlab("weight (1000lbs)")+ylab("miles per gallon")+theme_minimal()
`geom_smooth()` using formula = 'y ~ x'
The conditions under which OLS is blue (best linear unbiased estimator) are as follows: Linear in parameters (coefficients), the data is a random sampling and representative of the overall population.Next,there is no perfect col linearity, that is, it’s not possible to estimate a value for every coefficient in the model. Next, is zero conditional mean, ensuring that there is no omitted variable bias and that the regressors are all relevant, and lastly is homoscedasticity wherein the variance of the residuals is constant. This ensures that on average, a model predicts each data point with about the same performance.