Linear Regression Instruction

Author

Tori Cohen

Directory

Introduction
Data
Conclusion

Introduction

A linear regression model is simple for visualizing a predicted continuous decline or increase in values from a data set. The line in the middle of the values is titled the ‘line of best fit’. The line of best fit helps to identify trends and make predictions based on other values.

The data set used in this example will be “Plant Growth Rate.” The data set provides values based on soil moisture content and plant growth. We will be observing if there is a linear regression based on the plant growth rate and if there is a relation to soil moisture.

Data

getwd() #Set working directory

[1] "C:/Users/toric/Downloads/IntroR"

rm(list=ls()) #Clear environment 

#Load packages needed
library(ggplot2)
library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

library(ggfortify)

#Load in the data set
dat=read.csv("plant.growth.rate.csv")

#Confirm data looks correct
head(dat)

  soil.moisture.content plant.growth.rate
1             0.4696876          21.31695
2             0.5413106          27.03072
3             1.6979915          38.98937
4             0.8255799          30.19529
5             0.8568173          37.06547
6             1.6082405          43.24982

#Create a plot to visualize the data
ggplot(dat, aes(x=soil.moisture.content, y=plant.growth.rate))+
  geom_point()+
  labs(
    title="Plant Growth Rate on Soil Moisture")+
  ylab("Plant Growth Rate")+
  xlab("Soil Moisture Content")+
  theme_classic()

#This is fitting a linear model where we hypothesize that plant growth 
#is a function of soil moisture.
plantmod=lm(plant.growth.rate~soil.moisture.content, data=dat)

#Assume linearity
#Extract the residuals 
res=residuals(plantmod)

#Extract fitted values 
fit=fitted(plantmod)

#Plot the risuals vs. fitted graph, confirm the data looks great
plot(fit,res, 
     xlab="Fitted Values", 
     ylab="Residuals", 
     main="Residuals vs. Fitted")

#Assume normality of residuals 
qqnorm(res)

#Continue to test the residuals
res

         1          2          3          4          5          6          7 
-4.0198066  0.7808059 -2.0076992  0.3210690  6.7929909  3.3970364  0.1311503 
         8          9         10         11         12         13         14 
-0.4581096  8.9406131  3.3933698  0.8979810 -1.6402935  7.4585934  7.2955623 
        15         16         17         18         19         20         21 
 3.6084259 -3.8243724  2.3209847 -0.6756935 -3.3841089  2.7686235  0.5398590 
        22         23         24         25         26         27         28 
-6.9659510 -1.8477504 -8.9089455 -4.9493965 -3.8811675 -2.5073217  0.8555967 
        29         30         31         32         33         34         35 
 2.2950828 -4.8776931  0.7980031  1.8054703  0.3954686  1.7232572  3.3405153 
        36         37         38         39         40         41         42 
-4.6746801 -0.7630024 -7.4497824 -4.3350043  5.6439678 -3.3074214  4.5818591 
        43         44         45         46         47         48         49 
-1.1608097 -0.2075692 -2.0535254 -3.2638760  4.0751486 -2.3894415  3.4309003 
        50 
 1.9610870

shapiro.test(res)


    Shapiro-Wilk normality test

data:  res
W = 0.99126, p-value = 0.971

#This gives us the intercept and slope aka the 'b' and 'm' in y=mx+b
plantmod


Call:
lm(formula = plant.growth.rate ~ soil.moisture.content, data = dat)

Coefficients:
          (Intercept)  soil.moisture.content  
                19.35                  12.75

#Use the summary when writing a conclusion on the data
summary(plantmod)


Call:
lm(formula = plant.growth.rate ~ soil.moisture.content, data = dat)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.9089 -3.0747  0.2261  2.6567  8.9406 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)             19.348      1.283   15.08   <2e-16 ***
soil.moisture.content   12.750      1.021   12.49   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.019 on 48 degrees of freedom
Multiple R-squared:  0.7648,    Adjusted R-squared:  0.7599 
F-statistic: 156.1 on 1 and 48 DF,  p-value: < 2.2e-16

ggplot(dat, aes(x=soil.moisture.content, y=plant.growth.rate))+
  geom_point()+
  labs(
    title="Plant Growth Rate on Soil Moisture")+
  ylab("Plant Growth Rate")+
  xlab("Soil Moisture Content")+
  geom_smooth(method="lm")+
  theme_classic()

`geom_smooth()` using formula = 'y ~ x'

Conclusion

Using the packages ggplot2, dplyr, and ggfortify, it is possible to create a linear regression model and test for linearity using a shapiro test. Creating the first ggplot, there is a possible relationship between plant growth rate and soil moisture content and that relationship looks to be positive. Creating a scatter plot graph, the lack of pattern confirms the data has a correspondence. The shapiro test showed normal residuals. Looking at the plot we made for assessing linearity, there is no wedge, so it is good to proceed.

Based on the summary of the plantmod data, soil moisutre had a positive effect on plant growth. For each unity increase in soil moisture, plant growth increased by 12.7 mm/week (p<0.01). The last plot created shows the shaded region representing the 95% confidence interval for the mean predicted plant growth at a given soil moisture level. In conclusion, we can confirm there is a relationship between plant growth and soil moisture.