1.  Pick any two quantitative variables from a data set that interests you.

You can continue to use the dataset from your last discussion, or pick up a new dataset.  

?CO2
CO2_df <- CO2
summary(CO2_df)
##      Plant             Type         Treatment       conc          uptake     
##  Qn1    : 7   Quebec     :42   nonchilled:42   Min.   :  95   Min.   : 7.70  
##  Qn2    : 7   Mississippi:42   chilled   :42   1st Qu.: 175   1st Qu.:17.90  
##  Qn3    : 7                                    Median : 350   Median :28.30  
##  Qc1    : 7                                    Mean   : 435   Mean   :27.21  
##  Qc3    : 7                                    3rd Qu.: 675   3rd Qu.:37.12  
##  Qc2    : 7                                    Max.   :1000   Max.   :45.50  
##  (Other):42

A.  Tell us what are the dependent and independent variable. Type put your estimating equationi.e. I am expecting to see subscripts i on your y, x and error term professionally done. Make sure to describe these two variables (y measures number of cars, x is income in 1000s of dollars).

cor(CO2_df$uptake,CO2_df$conc)
## [1] 0.4851774
CO2_df$uptake ~ CO2_df$conc
## CO2_df$uptake ~ CO2_df$conc

I will be looking at the CO2 dataset, which is “from an experiment on the cold tolerance of the grass species Echinochloa crus-galli.” The variables of interest are conc and uptake, in which conc represents the ambient CO2 concentrations (\(\frac{mL}{L}\))—the independent Y variable—and uptake represents the rate of CO2 uptake by the plants (\(\frac{\mu\ mol}{m^2}\) sec)—the dependent X variable.

\[ Y_i = \beta_0 + \beta_1X_i + \epsilon_i \]

The formula is above, in which \(Y_i\) = uptake, \(\beta_0\) = intercept (constant), \(\beta_1X_1\) = conc, and \(\epsilon_i\) = random error

B.  Estimate the linear regression in R using the lm() command. 

lm(CO2_df$conc ~ CO2_df$uptake, data= CO2_df)
## 
## Call:
## lm(formula = CO2_df$conc ~ CO2_df$uptake, data = CO2_df)
## 
## Coefficients:
##   (Intercept)  CO2_df$uptake  
##         73.71          13.28
plot(CO2_df$uptake, CO2_df$conc)

library(ggplot2)

ggplot(CO2, aes(x = uptake, 
                 y = conc)) +
  geom_point(size = 2, 
             shape = 18, 
             col = "purple") +
  stat_smooth(method = lm, 
              linetype = "dashed",
              col = "red") + 
  xlab("CO2 uptake ") + 
  ylab("C02 ambient concentrations (mL/L)")
## `geom_smooth()` using formula = 'y ~ x'

C.  Interpret the slope and intercept parameters.

The slope tells us that for every one unit increase of the uptake of carbon dioxide rates (i.e., one-unit increase in X), there will be a 13.28 increase in the amount of ambient carbon dioxide concentration (mL/L). The intercept tells us that at a theoretical level of no uptake of carbon dioxide rates (i.e., at zero), there would be an average amount of ambient carbon dioxide concentration of 73.71 mL/L existing within the sample.

D.  Replicate the slope and intercept parameter using the covariance/variance formulas like we did in class

#slope
cov(CO2_df$uptake, CO2_df$conc)/var(CO2_df$uptake)
## [1] 13.27633
#intercept
#B0 <- Y - Bi (intercept) * X
mean(CO2_df$conc) - (cov(CO2_df$uptake, CO2_df$conc)/var(CO2_df$uptake)) * mean(CO2_df$uptake)
## [1] 73.71

Confirming that the formula gives us the same slope parameter of 13.28. The formula for intercept results in the same 73.71.

OPTIONAL ATTEMPT:

Look at the assumptions of OLS. Please skim through chapter 8 of Open Statistics textbook, and pay attention to Gauss Markov Assumptionspart (Full Ideal Conditions of OLS). Do a few Google searches and in less than 20 lines, try to summarize your findings.

1) Linearity - the relationship should be relatively linear
2) Nearly normal residuals - the standardized residuals should be normally distributed
3) Constant variability - the variance of errors should be relatively constant
4) Independent observations - there should be unique observations for every x to y