For Lab Assignment 4, you are going to conduct a correlation and simple linear regression. We will also be building off of previous R experience to help us graphically display the results.
There are several datasets that are included with R. These datasets are easy to practice with, so we will use them here. I am going to show two examples. Each one will include me running a correlation, simple linear regression, and a scatterplot. You do not need to run this code.
First, I am going to use the mtcars dataset to look for an association between miles per gallon (mpg) and weight (wt) of cars.
# I want to take a look at the dataset
summary(mtcars)
mpg cyl disp hp drat
Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0 Min. :2.760
1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5 1st Qu.:3.080
Median :19.20 Median :6.000 Median :196.3 Median :123.0 Median :3.695
Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7 Mean :3.597
3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0 3rd Qu.:3.920
Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0 Max. :4.930
wt qsec vs am gear
Min. :1.513 Min. :14.50 Min. :0.0000 Min. :0.0000 Min. :3.000
1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:3.000
Median :3.325 Median :17.71 Median :0.0000 Median :0.0000 Median :4.000
Mean :3.217 Mean :17.85 Mean :0.4375 Mean :0.4062 Mean :3.688
3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:4.000
Max. :5.424 Max. :22.90 Max. :1.0000 Max. :1.0000 Max. :5.000
carb
Min. :1.000
1st Qu.:2.000
Median :2.000
Mean :2.812
3rd Qu.:4.000
Max. :8.000
I can see that the dataset has several quantitative variables. I can also see some basic descriptive statistics for these variables.
# conducting a correlation
cor.test(mtcars$wt, mtcars$mpg)
Pearson's product-moment correlation
data: mtcars$wt and mtcars$mpg
t = -9.559, df = 30, p-value = 1.294e-10
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.9338264 -0.7440872
sample estimates:
cor
-0.8676594
The default option is a Pearson correlation, which is what I wanted. I can see the 95% confidence intervals. I can also see that r = -.868. This is a strong negative correlation. I can reject the null hypothesis because (a) the confidence intervals do not include zero and (b) the p-value is below .05.
R uses scientific notation for very small numbers. In this case, p = 1.294e-10 = 1.294 × 10-10 = 0.0000000001294.
The results make intuitive sense, as we would expect heavier cars to have fewer miles per gallon.
# conducting a simple linear regression
summary(lm(mtcars$mpg ~ mtcars$wt))
Call:
lm(formula = mtcars$mpg ~ mtcars$wt)
Residuals:
Min 1Q Median 3Q Max
-4.5432 -2.3647 -0.1252 1.4096 6.8727
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.2851 1.8776 19.858 < 2e-16 ***
mtcars$wt -5.3445 0.5591 -9.559 1.29e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.046 on 30 degrees of freedom
Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446
F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10
The results of the simple linear regression agree with the correlation. More specifically, we can see that we got the same p-value for slope and the correlation. We see R-squared reported at the bottom as Multiple R-squared. In this case, we see that r2 = .753.
I will make a scatterplot to visualize the relationship.
# creating a simple scatterplot
plot(x=mtcars$wt, y=mtcars$mpg)
Of course, R makes it pretty easy to customize our plots. Let’s make a prettier graph.
# creating a prettier scatterplot
plot(x=mtcars$wt, y=mtcars$mpg, xlab="Weight", ylab="Miles per Gallon", main="My Graph")
abline(lm(mtcars$mpg~mtcars$wt), col="red") # adds a regression line
Next, I am going to run the same code as above but for a different problem. I will use the iris dataset to look for an association between the length (Petal.Length) and width (Petal.Width) of a flower petal.
# I want to take a look at the dataset
summary(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 setosa :50
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 versicolor:50
Median :5.800 Median :3.000 Median :4.350 Median :1.300 virginica :50
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
I can see that the dataset has several quantitative variables and a categorical variable (Species). I can also see some basic descriptive statistics for these variables.
# conducting a correlation
cor.test(iris$Petal.Length, iris$Petal.Width)
Pearson's product-moment correlation
data: iris$Petal.Length and iris$Petal.Width
t = 43.387, df = 148, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.9490525 0.9729853
sample estimates:
cor
0.9628654
I can see that r = 963. This is a very strong correlation. I can reject the null hypothesis because (a) the confidence intervals do not include zero and (b) the p-value is below .05. These results also make intuitive sense, as we would expect longer petals to be wider.
# conducting a simple linear regression
summary(lm(iris$Petal.Width ~ iris$Petal.Length)) # response variable ~ explanatory variable
Call:
lm(formula = iris$Petal.Width ~ iris$Petal.Length)
Residuals:
Min 1Q Median 3Q Max
-0.56515 -0.12358 -0.01898 0.13288 0.64272
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.363076 0.039762 -9.131 4.7e-16 ***
iris$Petal.Length 0.415755 0.009582 43.387 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2065 on 148 degrees of freedom
Multiple R-squared: 0.9271, Adjusted R-squared: 0.9266
F-statistic: 1882 on 1 and 148 DF, p-value: < 2.2e-16
As will always be the case, the results of the simple linear regression agree with the correlation. More specifically, we can see that we got the same p-value for slope and the correlation. We see R-squared reported at the bottom as Multiple R-squared. In this case, we see that r2 = .927.
Now I will create a nice plot to go with my results.
# creating a scatterplot
plot(x=iris$Petal.Length, y=iris$Petal.Width, xlab="Length", ylab="Width", main="Iris")
abline(lm(iris$Petal.Width~iris$Petal.Length), col="black") # adds a regression line
For your assignment, you will conduct a correlation and simple linear regression and create a scatterplot to visualize your results. You wil be using the women dataset. You can view the dataset and see the columns by running the code summary(women). Please submit your assignment on Tartan by midnight on Wednesday, November 21.
Directions:
Conduct a correlation analysis to look for an association between women’s heigh and weight.
What decision can we make regarding the null hypothesis? Explain how you came to this conclusion.
Conduct a simple linear regression. Explain the results of the regression. Be sure to discuss the significance of the slope, the coefficient of correlation, and our general conclusions in plain English.
Create a scatterplot. Your plot should be customized and contain a regression line.
Paste your R code. I do not need to see the results of the tests, only the code that you used to generate the results.