Instructions for the assignment
This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code.
Try executing a chunk by clicking the Run button within the chunk or by placing your cursor inside it and pressing Ctrl+Shift+Enter.
To solve every question, you will have to execute a certain code block for each question. The notebook has been set up in such a way that you will be asked to enter specific instructions.
For this assignment, you will need the ggplot2 package. Make sure you load this. You should have installed it for R Assignment 1 and 2.
#install.packages("ggplot2")
library(ggplot2)
P1: Dealing with the dataset called “women”: a. Find the correlation between women’s height measured in inches and weight measured in pounds (2 points) b. Find the correlation between women’s height measured in centimeters (i.e., multiply their data, which is in inches, by 2.5) and weight measured in pounds (2 points) c. Find the correlation between women’s weight measured in pounds and women’s height measured in inches (2 points) d. Discuss the differences between the correlation coefficients in part a and part b and part c (4 points)
###### part a ######
data(women)
correlation_height_in_weight <- cor(women$height, women$weight)
print(paste("Correlation between height in inches and weight:", correlation_height_in_weight))
## [1] "Correlation between height in inches and weight: 0.995494767784216"
###### part b ######
height_cm <- women$height * 2.54
correlation_height_cm_weight <- cor(height_cm, women$weight)
print(paste("Correlation between height in centimeters and weight:", correlation_height_cm_weight))
## [1] "Correlation between height in centimeters and weight: 0.995494767784216"
###### part c ######
correlation_weight_height_in <- cor(women$weight, women$height)
print(paste("Correlation between weight in pounds and height in inches:", correlation_weight_height_in))
## [1] "Correlation between weight in pounds and height in inches: 0.995494767784216"
###### part d ######
print("The correlation between parts a, b, c should be the same as changing the way its measure from inches to centimeters and punds to pounds does not alter the relationship between height and weight just the values")
## [1] "The correlation between parts a, b, c should be the same as changing the way its measure from inches to centimeters and punds to pounds does not alter the relationship between height and weight just the values"
P2: Focusing on the dataset called “mtcars”: a. Produce a scatterplot with weight (wt) of car on x axis and mpg of car on y axis (2 points) b. Calculate the correlation coefficient between weight and mpg (2 points) c. Produce a scatterplot with weight (wt) of car on x axis and mpg of car on y axis AND add a linear regression line with the geom_smooth(method = lm, formula = ‘y ~ x’, se = FALSE) command. Hint: this will add a line of best fit to the data similar to a correlation coefficient. (2 points) d. Comment on whether the correlation coefficient is a good measure of the relationship between weight and mpg (4 points)
###### part a ######
data(mtcars)
plot(mtcars$wt, mtcars$mpg, #scatterplot of weight (wt) vs mpg
xlab = "Weight (wt)",
ylab = "Miles per Gallon (mpg)",
main = "Scatterplot of Weight vs. MPG")
###### part b ######
correlation_wt_mpg <- cor(mtcars$wt, mtcars$mpg)
print(paste("Correlation between weight and mpg:", correlation_wt_mpg))
## [1] "Correlation between weight and mpg: -0.867659376517228"
###### part c ######
ggplot(mtcars, aes(x = wt, y = mpg)) + geom_point() + geom_smooth(method = lm, formula = y ~ x, se = FALSE)+labs(x = "Weight (wt)", y = "Miles per Gallon (mpg)", title = "Weight vs MPG with Regression Line")
###### part d ######
print("The correlation coefficient between weight(wt) and Miles Per Gallon(mpg) should be negative. This is reflecting an inverser relationship where heavier cars usually have lower mpg and lighter cars have higher mpg. This correlation helps establish a linear relationship between these two variables. It may not fully capture the full complexity of the relationship between these two as there isn't enough information in case there are non-linear patters.")
## [1] "The correlation coefficient between weight(wt) and Miles Per Gallon(mpg) should be negative. This is reflecting an inverser relationship where heavier cars usually have lower mpg and lighter cars have higher mpg. This correlation helps establish a linear relationship between these two variables. It may not fully capture the full complexity of the relationship between these two as there isn't enough information in case there are non-linear patters."
P3. The sample mean and standard deviation for the fill weights of 100 boxes are X̅ =10 and s=0.1 Construct a 90% confidence interval for the population mean of fill weights. (10 points)
*Hint: you must use the qnorm function to find the z value!
X_bar <- 10 # Sample mean
s <- 0.1 # Sample standard deviation
n <- 100 # Sample size
# Calculate the Z value for a 90% confidence level
z <- qnorm(0.95) # 90% confidence interval, so we use 0.95 to get the critical z value
margin_of_error <- z * (s / sqrt(n))
lowerbound <- X_bar - margin_of_error
upperbound <- X_bar + margin_of_error
cat("90% Confidence Interval for the population mean of fill weights:",
"(", lowerbound, ",", upperbound, ")")
## 90% Confidence Interval for the population mean of fill weights: ( 9.983551 , 10.01645 )
print(paste("For a 90% Confidence Interval for the population mean the lower and upper bounds are:",
"(", lowerbound, ",", upperbound, ")"))
## [1] "For a 90% Confidence Interval for the population mean the lower and upper bounds are: ( 9.98355146373049 , 10.0164485362695 )"
P4. Write a function that takes a vector and returns a confidence interval of the user’s choice for the mean/average value of that vector. In other words, write a function to produce a confidence interval for the mean for large samples (n>30). Assume you will always get a dataset with n>30, therefore you will use the z score. You can return the interval as a vector of length two: the lower bound and the upper bound. As a test to ensure your function works, calculate the 95% interval for the testdata shown in the code chunk below. (10 points)
#write the function
conf.int <- function(data.vector, conf.coeff){
n<- length(data.vector) # Sample size
X_bar <- mean(data.vector) # Sample mean
s <- sd(data.vector) # Sample standard deviation
# Calculate the z-score for the specified confidence level
z <- qnorm((1 + conf.coeff) / 2)
# Calculate the margin of error
margin_of_error <- z * (s / sqrt(n))
lowerbound <- X_bar - margin_of_error
upperbound <- X_bar + margin_of_error
return(c(lowerbound,upperbound))
}
#test the function
testdata <- seq(from = 1,to = 50,by = 1)
ci <- conf.int(testdata, 0.95)
print(paste("95% Confidence Interval for the mean:", "(", ci[1], ",", ci[2], ")"))
## [1] "95% Confidence Interval for the mean: ( 21.4594307346674 , 29.5405692653326 )"