STEP 1 VIEW THE DATA SET

(1)Quick view the data

head(iris, n=10)
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1           5.1         3.5          1.4         0.2  setosa
## 2           4.9         3.0          1.4         0.2  setosa
## 3           4.7         3.2          1.3         0.2  setosa
## 4           4.6         3.1          1.5         0.2  setosa
## 5           5.0         3.6          1.4         0.2  setosa
## 6           5.4         3.9          1.7         0.4  setosa
## 7           4.6         3.4          1.4         0.3  setosa
## 8           5.0         3.4          1.5         0.2  setosa
## 9           4.4         2.9          1.4         0.2  setosa
## 10          4.9         3.1          1.5         0.1  setosa

(2)Check structure of iris

str(iris)          
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

(3)Summary of basic statistic info of the data

summary(iris)      
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 

STEP 2 DESCRIPTIVE ANALYSIS- VISUALIZATION

(1)Lets test if there is any relation between Sepal.Length and Sepal.Width

sepal_ggplot <- ggplot(data=iris) + 
  geom_point(aes(x=Sepal.Length, y=Sepal.Width)) + 
  theme_bw() 

sepal_ggplot

There is no obvious relation between Sepal.Length and Sepal.Width

(2) Lets test if there is any relation between Petal.Length and Petal.Width

petal_ggplot <- ggplot(data=iris) + 
  geom_point(aes(x=Petal.Length, y=Petal.Width)) +  
  theme_bw()                                  

petal_ggplot

There is a positive relation between Petal.Length and Petal.Width

There are also some clusters in this regression line. Maybe its caused by Species

petal_ggplot2 <- ggplot(data=iris) + 
  geom_point(aes(x=Petal.Length, y=Petal.Width, color=Species)) +  
  theme_bw()
petal_ggplot2

(3) Visualize the data using box plot

petal_ggbp <- qplot(x=Petal.Length,      
      y=Petal.Width, 
      data=iris, 
      geom="boxplot",    # graph type is boxplot
      color=Species)

petal_ggbp

STEP 3 HANDLE MISSING DATA

#is.na(iris)
table(is.na(iris))
## 
## FALSE 
##   750

There is no missing data

STEP 4 REGRESSION ANALYSIS

model <- lm(formula= Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width,
            data=iris)
model
## 
## Call:
## lm(formula = Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width, 
##     data = iris)
## 
## Coefficients:
##  (Intercept)   Sepal.Width  Petal.Length   Petal.Width  
##       1.8560        0.6508        0.7091       -0.5565

The equation of linear regression line:

Sepal.Length = 1.85600+ 0.65084 * Sepal.Width + 0.70913 * Petal.Length - 0.55648 * Petal.Width

summary(model)
## 
## Call:
## lm(formula = Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width, 
##     data = iris)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.82816 -0.21989  0.01875  0.19709  0.84570 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.85600    0.25078   7.401 9.85e-12 ***
## Sepal.Width   0.65084    0.06665   9.765  < 2e-16 ***
## Petal.Length  0.70913    0.05672  12.502  < 2e-16 ***
## Petal.Width  -0.55648    0.12755  -4.363 2.41e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3145 on 146 degrees of freedom
## Multiple R-squared:  0.8586, Adjusted R-squared:  0.8557 
## F-statistic: 295.5 on 3 and 146 DF,  p-value: < 2.2e-16

According to R-squared(0.8586) and Adj R-squared(0.8557), this model pretty much fits the data

p.s.R-squared is a statistical measure of how close the data are to the fitted regression line. It is also known as the coefficient of determination, or the coefficient of multiple determination for multiple regression.

The definition of R-squared is fairly straight-forward; it is the percentage of the response variable variation that is explained by a linear model. Or:

R-squared = Explained variation / Total variation

R-squared is always between 0 and 100%:

0% indicates that the model explains none of the variability of the response data around its mean. 100% indicates that the model explains all the variability of the response data around its mean. In general, the higher the R-squared, the better the model fits your data. However, there are important conditions for this guideline that I’ll talk about both in this post and my next post.

STEP 5 PREDICTION

new.iris <- data.frame(Sepal.Width=3.456, Petal.Length=1.535, Petal.Width=0.341)
new.iris
##   Sepal.Width Petal.Length Petal.Width
## 1       3.456        1.535       0.341
predict(model, new.iris)
##        1 
## 5.004048

Given Sepal.Width、Petal.Length、Petal.Width, we can predict new iris’s Sepal.Length

STEP 6 ANALYSIS OF VARIANCE(One-Way ANOVA)

From the visualization, we can see that the mean of Petal.Width or Petal.Length of different species are different. We want to prove it with ANOVA

  • H0: μ(Setosa) = μ(Versicolor) = μ(Virginica)
  • H1: Not all species’s average Petal.Width/Petal.Length are the same

Petal.Width

width.lm <- lm(Petal.Width~Species, data=iris)
anova(width.lm)
## Analysis of Variance Table
## 
## Response: Petal.Width
##            Df Sum Sq Mean Sq F value    Pr(>F)    
## Species     2 80.413  40.207  960.01 < 2.2e-16 ***
## Residuals 147  6.157   0.042                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

According to the p-value, which is much smaller than 0.05, we reject H0

Petal.Length

Length.lm <- lm(Petal.Length~Species, data=iris)
anova(Length.lm)
## Analysis of Variance Table
## 
## Response: Petal.Length
##            Df Sum Sq Mean Sq F value    Pr(>F)    
## Species     2 437.10 218.551  1180.2 < 2.2e-16 ***
## Residuals 147  27.22   0.185                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

According to the p-value, which is much smaller than 0.05, we reject H0