Part I. Dataset 1 - ToothGrowth

Loading Dataset

df1 <- ToothGrowth

summary(df1)
##       len        supp         dose      
##  Min.   : 4.20   OJ:30   Min.   :0.500  
##  1st Qu.:13.07   VC:30   1st Qu.:0.500  
##  Median :19.25           Median :1.000  
##  Mean   :18.81           Mean   :1.167  
##  3rd Qu.:25.27           3rd Qu.:2.000  
##  Max.   :33.90           Max.   :2.000

Brief Overview of the Data

?ToothGrowth
summary(df1)
##       len        supp         dose      
##  Min.   : 4.20   OJ:30   Min.   :0.500  
##  1st Qu.:13.07   VC:30   1st Qu.:0.500  
##  Median :19.25           Median :1.000  
##  Mean   :18.81           Mean   :1.167  
##  3rd Qu.:25.27           3rd Qu.:2.000  
##  Max.   :33.90           Max.   :2.000

The ToothGrowth dataset is a cross-sectional data that explores the effect of vitamin C on tooth growth in guinea pigs. There are 60 observations and three variables in this dataset - Tooth Length, Supplement Type (OJ and VC) and Dose in milligrams/day.

#check for missing data in each column 
sapply(df1, function(x) sum(is.na(x)))
##  len supp dose 
##    0    0    0

The data set is good to go since there isn’t any missing data.

Data Visualization

ggplot(df1, aes(x = dose, y = len, color = supp)) +
  geom_point() +
  ggtitle("Scatterplot of Length by Dose") +
  xlab("Dose (mg)") + ylab("Tooth length")

ggplot(df1, aes(x=dose, y=len, fill=supp)) + 
    geom_boxplot() +
    ggtitle("Grouped Boxplot of Length by Dose") +
    xlab("Dose (mg)") + ylab("Tooth length")

Dataset 2 - PepperPrice

Loading Dataset from AER

data("PepperPrice", package = "AER")
df2 <- as.data.frame(PepperPrice)

The PepperPrice dataset from AER is a panel data that shows the average monthly European spot prices for black and white pepper (fair average quality) in US dollars per ton from October 1973 to April 1996.

Brief Overview of the Data

head(df2)
##      black   white
## 1  884.050 1419.78
## 2  919.329 1503.55
## 3  930.350 1536.62
## 4 1102.310 1629.22
## 5 1150.810 1737.24
## 6 1093.490 1629.22
#check for missing data in each column 
sapply(df2, function(x) sum(is.na(x)))
## black white 
##     0     0

Data Visualization

#adding date colummn to time series
df2$Date <- seq(from = as.Date("1973-10-01"), to = as.Date("1996-4-01"), by = 'month')
ggplot(df2, aes(x=Date, y=black)) +
  geom_line() +
  ggtitle("Average Monthly European Spot prices for Black Pepper in USD per ton") +
  xlab("Price") + ylab("Date") 

ggplot(df2, aes(x=Date, y=white)) +
  geom_line() +
  ggtitle("Average Monthly European Spot prices for White Pepper in USD per ton") +
  xlab("Price") + ylab("Date") 

ts.plot(df2, 
        gpars = list(xlab = "Months",
                     ylab = "Price", 
                     main = "Average Black and White Pepper Spot Prices in USD per ton", 
                     lwd = rep(2,3),
                     lty = 1:2,
                     col = c("black", "red")))
legend(0,8000, legend=c("Black", "White"), lwd = rep(2,3), lty = 1:2, col = c("black","red"), cex=.9)

Part II.

What is covariance?

Covariance is a way to measure the linear relationship between two random variables where a change in one variable (x) is represented by a change in the other (y).

What is variance?

Variance represents how spread the data is by measuring the distance of each data point from the mean.

Why would dividing the covariance of y and x by the variance of x give you the slope coefficient from a simple linear regression (one x variable only)?

The correlation coefficient reflects the expected increase of the dependent variable (y) when the independent variable (x) increases by one unit. Covariance and correlation are similar as they both serve as a measure of association. However, the correlation coefficient is standardized and covariance is not. The formula of the correlation coefficient provides a standardized scale by taking the square root of var(x) and var(y) as the denominator where we can visualize a perfect linear relationship when the coefficient is 1.

Demonstration

df3 <- cars
#build regression
lm(dist~speed, df3)
## 
## Call:
## lm(formula = dist ~ speed, data = df3)
## 
## Coefficients:
## (Intercept)        speed  
##     -17.579        3.932
#formula 
formula <- round(cov(df3$dist, df3$speed)/var(df3$speed), 3)
print (paste("By using the formula, the slope coefficient is", formula, "which matches the value from the regression."))
## [1] "By using the formula, the slope coefficient is 3.932 which matches the value from the regression."