df1 <- ToothGrowth
summary(df1)
## len supp dose
## Min. : 4.20 OJ:30 Min. :0.500
## 1st Qu.:13.07 VC:30 1st Qu.:0.500
## Median :19.25 Median :1.000
## Mean :18.81 Mean :1.167
## 3rd Qu.:25.27 3rd Qu.:2.000
## Max. :33.90 Max. :2.000
?ToothGrowth
summary(df1)
## len supp dose
## Min. : 4.20 OJ:30 Min. :0.500
## 1st Qu.:13.07 VC:30 1st Qu.:0.500
## Median :19.25 Median :1.000
## Mean :18.81 Mean :1.167
## 3rd Qu.:25.27 3rd Qu.:2.000
## Max. :33.90 Max. :2.000
The ToothGrowth dataset is a cross-sectional data that explores the effect of vitamin C on tooth growth in guinea pigs. There are 60 observations and three variables in this dataset - Tooth Length, Supplement Type (OJ and VC) and Dose in milligrams/day.
#check for missing data in each column
sapply(df1, function(x) sum(is.na(x)))
## len supp dose
## 0 0 0
The data set is good to go since there isn’t any missing data.
ggplot(df1, aes(x = dose, y = len, color = supp)) +
geom_point() +
ggtitle("Scatterplot of Length by Dose") +
xlab("Dose (mg)") + ylab("Tooth length")
ggplot(df1, aes(x=dose, y=len, fill=supp)) +
geom_boxplot() +
ggtitle("Grouped Boxplot of Length by Dose") +
xlab("Dose (mg)") + ylab("Tooth length")
data("PepperPrice", package = "AER")
df2 <- as.data.frame(PepperPrice)
The PepperPrice dataset from AER is a panel data that shows the average monthly European spot prices for black and white pepper (fair average quality) in US dollars per ton from October 1973 to April 1996.
head(df2)
## black white
## 1 884.050 1419.78
## 2 919.329 1503.55
## 3 930.350 1536.62
## 4 1102.310 1629.22
## 5 1150.810 1737.24
## 6 1093.490 1629.22
#check for missing data in each column
sapply(df2, function(x) sum(is.na(x)))
## black white
## 0 0
#adding date colummn to time series
df2$Date <- seq(from = as.Date("1973-10-01"), to = as.Date("1996-4-01"), by = 'month')
ggplot(df2, aes(x=Date, y=black)) +
geom_line() +
ggtitle("Average Monthly European Spot prices for Black Pepper in USD per ton") +
xlab("Price") + ylab("Date")
ggplot(df2, aes(x=Date, y=white)) +
geom_line() +
ggtitle("Average Monthly European Spot prices for White Pepper in USD per ton") +
xlab("Price") + ylab("Date")
ts.plot(df2,
gpars = list(xlab = "Months",
ylab = "Price",
main = "Average Black and White Pepper Spot Prices in USD per ton",
lwd = rep(2,3),
lty = 1:2,
col = c("black", "red")))
legend(0,8000, legend=c("Black", "White"), lwd = rep(2,3), lty = 1:2, col = c("black","red"), cex=.9)
What is covariance?
Covariance is a way to measure the linear relationship between two random variables where a change in one variable (x) is represented by a change in the other (y).
What is variance?
Variance represents how spread the data is by measuring the distance of each data point from the mean.
Why would dividing the covariance of y and x by the variance of x give you the slope coefficient from a simple linear regression (one x variable only)?
The correlation coefficient reflects the expected increase of the dependent variable (y) when the independent variable (x) increases by one unit. Covariance and correlation are similar as they both serve as a measure of association. However, the correlation coefficient is standardized and covariance is not. The formula of the correlation coefficient provides a standardized scale by taking the square root of var(x) and var(y) as the denominator where we can visualize a perfect linear relationship when the coefficient is 1.
Demonstration
df3 <- cars
#build regression
lm(dist~speed, df3)
##
## Call:
## lm(formula = dist ~ speed, data = df3)
##
## Coefficients:
## (Intercept) speed
## -17.579 3.932
#formula
formula <- round(cov(df3$dist, df3$speed)/var(df3$speed), 3)
print (paste("By using the formula, the slope coefficient is", formula, "which matches the value from the regression."))
## [1] "By using the formula, the slope coefficient is 3.932 which matches the value from the regression."