Four Data Linear Regression

Filip Dragicevic

2022-03-23

> setwd("C:/Users/filip/OneDrive/Desktop/MSCI 3230")
> fourdata <- 
+   read.table("C:/Users/filip/OneDrive/Desktop/MSCI 3230/fourData.csv", 
+   header=TRUE, stringsAsFactors=TRUE, sep=",", na.strings="NA", dec=".", 
+   strip.white=TRUE)

Summary of Data

> summary(fourdata)
      x123            y1               y2              y3              x4    
 Min.   : 4.0   Min.   : 4.260   Min.   :3.100   Min.   : 5.39   Min.   : 8  
 1st Qu.: 6.5   1st Qu.: 6.315   1st Qu.:6.695   1st Qu.: 6.25   1st Qu.: 8  
 Median : 9.0   Median : 7.580   Median :8.140   Median : 7.11   Median : 8  
 Mean   : 9.0   Mean   : 7.501   Mean   :7.501   Mean   : 7.50   Mean   : 9  
 3rd Qu.:11.5   3rd Qu.: 8.570   3rd Qu.:8.950   3rd Qu.: 7.98   3rd Qu.: 8  
 Max.   :14.0   Max.   :10.840   Max.   :9.260   Max.   :12.74   Max.   :19  
       y4        
 Min.   : 5.250  
 1st Qu.: 6.170  
 Median : 7.040  
 Mean   : 7.501  
 3rd Qu.: 8.190  
 Max.   :12.500  

- By completing a summary of the fourData.csv dataset, we can examine the min, max, median, max, and quartile values for each column.

- The objective is to analyze why we get the same linear regression values for different values.

- This will be done using exploratory analysis and visualization.

Linear Regression

> RegModel.1 <- lm(x123~y1, data=fourdata)
> summary(RegModel.1)

Call:
lm(formula = x123 ~ y1, data = fourdata)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.6522 -1.5117 -0.2657  1.2341  3.8946 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)  -0.9975     2.4344  -0.410  0.69156   
y1            1.3328     0.3142   4.241  0.00217 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.019 on 9 degrees of freedom
Multiple R-squared:  0.6665,    Adjusted R-squared:  0.6295 
F-statistic: 17.99 on 1 and 9 DF,  p-value: 0.00217
> RegModel.2 <- lm(x123~y2, data=fourdata)
> summary(RegModel.2)

Call:
lm(formula = x123 ~ y2, data = fourdata)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.8516 -1.4315 -0.3440  0.8467  4.2017 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)  -0.9948     2.4354  -0.408  0.69246   
y2            1.3325     0.3144   4.239  0.00218 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.02 on 9 degrees of freedom
Multiple R-squared:  0.6662,    Adjusted R-squared:  0.6292 
F-statistic: 17.97 on 1 and 9 DF,  p-value: 0.002179
> RegModel.3 <- lm(x123~y3, data=fourdata)
> summary(RegModel.3)

Call:
lm(formula = x123 ~ y3, data = fourdata)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.9869 -1.3733 -0.0266  1.3200  3.2133 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)  -1.0003     2.4362  -0.411  0.69097   
y3            1.3334     0.3145   4.239  0.00218 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.019 on 9 degrees of freedom
Multiple R-squared:  0.6663,    Adjusted R-squared:  0.6292 
F-statistic: 17.97 on 1 and 9 DF,  p-value: 0.002176
> RegModel.4 <- lm(x4~y4, data=fourdata)
> summary(RegModel.4)

Call:
lm(formula = x4 ~ y4, data = fourdata)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.7859 -1.4122 -0.1853  1.4551  3.3329 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)  -1.0036     2.4349  -0.412  0.68985   
y4            1.3337     0.3143   4.243  0.00216 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.018 on 9 degrees of freedom
Multiple R-squared:  0.6667,    Adjusted R-squared:  0.6297 
F-statistic:    18 on 1 and 9 DF,  p-value: 0.002165

- All linear regression models have the same outputs, including intercepts, adjusted r-squared, etc.

- Significant and useful based on the adjusted r-squared and p-values being less than 0.05.

- However, this does not explicetly tell us that linear regression models can be done.

- Conduct exploratory analysis to further see why this may be the case.

Exploratory Analysis

Scatter Plots

> scatterplot(y1~x123, regLine=FALSE, smooth=FALSE, boxplots=FALSE, 
+   data=fourdata)

- Positive correlation, with data going to the top right corner.

- Positive line of best fit, regression is okay to do with this.

> scatterplot(y2~x123, regLine=FALSE, smooth=FALSE, boxplots=FALSE, 
+   data=fourdata)

- Positive correlation, with data going to the top right corner in a parabola shape.

- Relationship is not linear, so simple linear regression model would not be appropriate.

> scatterplot(y3~x123, regLine=FALSE, smooth=FALSE, boxplots=FALSE, 
+   data=fourdata)

- Positive correlation, with data going to the top right corner.

- The outlier has high leverage, so linear regression model would be skewed.

> scatterplot(y4~x4, regLine=FALSE, smooth=FALSE, boxplots=FALSE, 
+   data=fourdata)

- X variable is not changing (other than the one outlier), meaning there cannot be any relationship between X and Y.

Box Plots

> Boxplot( ~ x123, data=fourdata, id=list(method="y"))

> Boxplot( ~ y1, data=fourdata, id=list(method="y"))

- Y1 boxplot has even whiskers and does not have any outliers, meaning it has normal distribution and is suitable for linear regression.

> Boxplot( ~ y2, data=fourdata, id=list(method="y"))

[1] "8"

- Y2 boxplot has uneven whiskers and an outlier.

- Also shows left skew line is above middle.

> Boxplot( ~ y3, data=fourdata, id=list(method="y"))

[1] "3"
> Boxplot( ~ y4, data=fourdata, id=list(method="y"))

[1] "8"

Boxplots Y3 and Y4 also both have major outliers that make it unsuitable.