HOMEWORK 3

R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

red_wine <- read.csv("~/Downloads/winequality-red.csv", header = TRUE, sep = ";")
white_wine <- read.csv("~/Downloads/winequality-white.csv", header = TRUE, sep = ";")
wine_data <- rbind(red_wine, white_wine)

red_wine$color <- "red"
white_wine$color <- "white"
combined_data <- rbind(red_wine, white_wine)
head(combined_data)

##   fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1           7.4             0.70        0.00            1.9     0.076
## 2           7.8             0.88        0.00            2.6     0.098
## 3           7.8             0.76        0.04            2.3     0.092
## 4          11.2             0.28        0.56            1.9     0.075
## 5           7.4             0.70        0.00            1.9     0.076
## 6           7.4             0.66        0.00            1.8     0.075
##   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                  11                   34  0.9978 3.51      0.56     9.4
## 2                  25                   67  0.9968 3.20      0.68     9.8
## 3                  15                   54  0.9970 3.26      0.65     9.8
## 4                  17                   60  0.9980 3.16      0.58     9.8
## 5                  11                   34  0.9978 3.51      0.56     9.4
## 6                  13                   40  0.9978 3.51      0.56     9.4
##   quality color
## 1       5   red
## 2       5   red
## 3       5   red
## 4       6   red
## 5       5   red
## 6       5   red

wine1 <- combined_data[,-c(1,2,3,5, 6,8)]

set.seed(1)

wine2 <- wine1 %>% 
  sample_n(400)

wine2$colorF <-  factor(wine2$color,
                        labels = c("red", "white"),
                        levels = c("red", "white"))

col_order <- c("quality", "residual.sugar", "total.sulfur.dioxide",
               "pH", "sulphates", "alcohol", "color", "colorF")
my_data <- wine2[, col_order]

head(my_data)

##   quality residual.sugar total.sulfur.dioxide   pH sulphates alcohol color
## 1       7           2.20                   28 3.27      0.75    12.6   red
## 2       6          11.90                  140 3.02      0.31    10.3 white
## 3       7           2.80                  153 3.15      0.38    12.7 white
## 4       6          14.55                  156 3.34      0.78     9.1 white
## 5       6           2.00                   22 3.21      0.68     9.9   red
## 6       6          14.90                  118 3.04      0.54     9.0 white
##   colorF
## 1    red
## 2  white
## 3  white
## 4  white
## 5    red
## 6  white

##Unit of observation: One wine (vino vherde from Portugal).

##Sample size: 400

Description of variables:

Independent variables:

total sulfur dioxide(mg/dm3)-numerical

residual sugar(g/dm3)-numerical

pH level-numerical

alcohol(vol.%)-numerical

sulphates(g/dm3)-numerical

color of wine-categorical

Dependent variable:

quality of wine (score between 0 and 10)

Source of the data: https://archive.ics.uci.edu/ml/datasets/wine+quality

Regression definition:

With this regression i would like to find out how sugar, sulfur dioxide, pH, sulphates, color and alcohol affect quality of wine.

library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

scatterplotMatrix(my_data[,c(-7,-8)], smooth = FALSE)

From the graph above we can observe that residual sugar and total sulfur dioxide do not impact the quality or impact it very slightly. It seems that the alcohol impacts quality the most out of all observed variables.

library(Hmisc)

## Loading required package: lattice

## Loading required package: survival

## Loading required package: Formula

## Loading required package: ggplot2

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:dplyr':
## 
##     src, summarize

## The following objects are masked from 'package:base':
## 
##     format.pval, units

rcorr(as.matrix(my_data[ , c( -7, -8)]))

##                      quality residual.sugar total.sulfur.dioxide    pH
## quality                 1.00          -0.04                -0.07  0.01
## residual.sugar         -0.04           1.00                 0.50 -0.30
## total.sulfur.dioxide   -0.07           0.50                 1.00 -0.19
## pH                      0.01          -0.30                -0.19  1.00
## sulphates               0.05          -0.19                -0.31  0.20
## alcohol                 0.45          -0.34                -0.24  0.13
##                      sulphates alcohol
## quality                   0.05    0.45
## residual.sugar           -0.19   -0.34
## total.sulfur.dioxide     -0.31   -0.24
## pH                        0.20    0.13
## sulphates                 1.00   -0.04
## alcohol                  -0.04    1.00
## 
## n= 400 
## 
## 
## P
##                      quality residual.sugar total.sulfur.dioxide pH    
## quality                      0.4274         0.1650               0.8309
## residual.sugar       0.4274                 0.0000               0.0000
## total.sulfur.dioxide 0.1650  0.0000                              0.0002
## pH                   0.8309  0.0000         0.0002                     
## sulphates            0.3129  0.0002         0.0000               0.0000
## alcohol              0.0000  0.0000         0.0000               0.0097
##                      sulphates alcohol
## quality              0.3129    0.0000 
## residual.sugar       0.0002    0.0000 
## total.sulfur.dioxide 0.0000    0.0000 
## pH                   0.0000    0.0097 
## sulphates                      0.4628 
## alcohol              0.4628

All are correlated, out of all the tested variables alcohol is the most correlated to quality and pH is the least correlated to quality.

fit1 <- lm(quality ~ residual.sugar + total.sulfur.dioxide+ pH + alcohol + colorF,
           data = my_data)

library(car)
car::vif(fit1)

##       residual.sugar total.sulfur.dioxide                   pH 
##             1.497158             2.584908             1.190245 
##              alcohol               colorF 
##             1.238585             2.377766

mean(vif(fit1))

## [1] 1.777732

MULTICOLINEARITY

Based on the vif test, we can observe, that the multicolinearity is not a problem in this model, because all VIF values are below 5 (more than 5 would indicate multicolinearity). The average VIF value is 1.78, which is yet another indicator that the multicolinearity is not a problem in the model.

my_data$StdResid <- round(rstandard(fit1), 3) 
my_data$CooksD <- round(cooks.distance(fit1), 3) 



hist(my_data$StdResid, 
     xlab = "Standardized residuals", 
     ylab = "Frequency", 
     main = "Histogram of standardized residuals")

shapiro.test(my_data$StdResid)

## 
##  Shapiro-Wilk normality test
## 
## data:  my_data$StdResid
## W = 0.96921, p-value = 1.832e-07

STANDARDIZED RESIDUALS:

I will remove values that fall out of the [-3,3] interval, since they can be considered outliers.

Shapiro-Wilk normality test:

H0:Errors are normally distributed.

H1: Errors are not normally distributed.

Since p value is below 0.05 we can reject the null hypothesis, however, the dataframe for this test is big and we will therefore continue with the testing and assume this is not a problem (more than 100 observations in this dataset).

hist(my_data$CooksD, 
     xlab = "Cooks distance", 
     ylab = "Frequency", 
     main = "Histogram of Cooks distances")

COOKS DISTANCE:

I will remove values that are not connected to each other and are therefore significantly higher than other units, since they are considered outliers.

head(my_data[order(my_data$StdResid),],3 )

##    quality residual.sugar total.sulfur.dioxide   pH sulphates alcohol color
## 95       3           15.1                  162 3.24      0.52    10.5 white
## 97       3            8.5                  208 2.90      0.38    11.0 white
## 93       4            5.6                   31 3.32      0.51    10.5 white
##    colorF StdResid CooksD
## 95  white   -4.071  0.048
## 97  white   -3.956  0.066
## 93  white   -2.775  0.038

head(my_data[order(-my_data$StdResid),], 6)

##     quality residual.sugar total.sulfur.dioxide   pH sulphates alcohol color
## 185       8           14.2                212.5 3.14      0.46     8.9 white
## 211       8            2.2                 29.0 2.88      0.82     9.8   red
## 88        8           13.9                155.0 2.94      0.41     8.8 white
## 148       8           13.9                155.0 2.94      0.41     8.8 white
## 128       8            2.0                 96.0 3.23      0.64    10.4 white
## 78        8            1.7                130.0 3.43      0.50    10.7 white
##     colorF StdResid CooksD
## 185  white    3.453  0.034
## 211    red    3.404  0.056
## 88   white    3.377  0.032
## 148  white    3.377  0.032
## 128  white    2.896  0.012
## 78   white    2.851  0.015

head(my_data[order(-my_data$CooksD),], 10)

##     quality residual.sugar total.sulfur.dioxide   pH sulphates alcohol color
## 97        3            8.5                208.0 2.90      0.38    11.0 white
## 211       8            2.2                 29.0 2.88      0.82     9.8   red
## 95        3           15.1                162.0 3.24      0.52    10.5 white
## 93        4            5.6                 31.0 3.32      0.51    10.5 white
## 185       8           14.2                212.5 3.14      0.46     8.9 white
## 88        8           13.9                155.0 2.94      0.41     8.8 white
## 148       8           13.9                155.0 2.94      0.41     8.8 white
## 32        4            1.7                272.0 3.53      0.53     9.6 white
## 69        7           15.5                135.0 2.90      0.43     8.7 white
## 78        8            1.7                130.0 3.43      0.50    10.7 white
##     colorF StdResid CooksD
## 97   white   -3.956  0.066
## 211    red    3.404  0.056
## 95   white   -4.071  0.048
## 93   white   -2.775  0.038
## 185  white    3.453  0.034
## 88   white    3.377  0.032
## 148  white    3.377  0.032
## 32   white   -1.703  0.027
## 69   white    1.994  0.016
## 78   white    2.851  0.015

my_data4 <- my_data[c(-95, -97, -93, -185, -211, -88, -148, -32), ] 
fit2 <- lm(quality ~ residual.sugar + total.sulfur.dioxide+ pH + alcohol + colorF,
           data = my_data4)

These units will be removed due to the above mentioned reasons. Furthermore, I will again check the distribution of the corrected dataset.

my_data4$StdResid <- round(rstandard(fit2), 3) 
my_data4$CooksD <- round(cooks.distance(fit2), 3) 

hist(my_data4$StdResid, 
     xlab = "Standardized residuals", 
     ylab = "Frequency", 
     main = "Histogram of standardized residuals")

shapiro.test(my_data4$StdResid)

## 
##  Shapiro-Wilk normality test
## 
## data:  my_data4$StdResid
## W = 0.98703, p-value = 0.001426

hist(my_data4$CooksD, 
     xlab = "Cooks distance", 
     ylab = "Frequency", 
     main = "Histogram of Cooks distances")

my_data4$StdFittedValues <- scale(fit2$fitted.values)

library(car)
scatterplot(y = my_data4$StdResid, x = my_data4$StdFittedValues,
            ylab = "Standardized residuals",
            xlab = "Standardized fitted values",
            boxplots = FALSE,
            regLine = FALSE,
            smooth = FALSE)

Based on the graph above, it does not seem neither as heteroskedastic nor non-linear. For heteroskedasticity I am not completely sure, therefore I will perform the following test.

library(olsrr)

## 
## Attaching package: 'olsrr'

## The following object is masked from 'package:datasets':
## 
##     rivers

ols_test_breusch_pagan(fit2)

## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##                Data                 
##  -----------------------------------
##  Response : quality 
##  Variables: fitted values of quality 
## 
##         Test Summary         
##  ----------------------------
##  DF            =    1 
##  Chi2          =    0.3285202 
##  Prob > Chi2   =    0.5665317

BREUSCH PAGAN TEST:

H0: Variance is constant (homoskedasticity).

H1: Variance is not constant (heteroskedasticity).

Since p value is 0.567, we cannot reject the null hypothesis, therefore we do not have a problem with heteroskedasticity and the assumption of homoskedasticity is consequently not violated.

summary(fit2)

## 
## Call:
## lm(formula = quality ~ residual.sugar + total.sulfur.dioxide + 
##     pH + alcohol + colorF, data = my_data4)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.01026 -0.48007 -0.08449  0.42994  2.17086 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           1.2038951  0.7742673   1.555  0.12079    
## residual.sugar        0.0224229  0.0090943   2.466  0.01411 *  
## total.sulfur.dioxide -0.0015825  0.0009758  -1.622  0.10568    
## pH                    0.2637868  0.2203086   1.197  0.23190    
## alcohol               0.3431163  0.0305941  11.215  < 2e-16 ***
## colorFwhite           0.3118751  0.1190738   2.619  0.00916 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6711 on 386 degrees of freedom
## Multiple R-squared:  0.3009, Adjusted R-squared:  0.2919 
## F-statistic: 33.23 on 5 and 386 DF,  p-value: < 2.2e-16

sqrt(summary(fit2)$r.squared)

## [1] 0.5485577

DESCRPTION:

Coefficients: For sulfur dioxide as well as pH, p-values are above the 0.05 benchmark, meaning that we cannot significantly explain them.

P values are however below 0.05 for color, alcohol and sugar variables, therefore we can explain the correlation for these three variables included in the test.

If the alcohol increases for 1% (in volume), the average quality of wine increases for 0.34 grade, ceteris paribus.

If the sugar increases for 1 g/dm3, the average quality of wine increases for only 0.02 grade, ceteris paribus.

If the color of wine is white, the average quality of wine is higher for 0.3 grade, ceteris paribus.

R2: 31% of quality variability is explained by these variables.

F test: Ho: ro2 = 0 H1: ro2 > 0

We reject the null hypothesis at p<0.001, meaning that we found the linear relationship between at least one independent and dependent variable (quality).