Four Data Linear Regression
Filip Dragicevic
2022-03-23
> setwd("C:/Users/filip/OneDrive/Desktop/MSCI 3230")
> fourdata <-
+ read.table("C:/Users/filip/OneDrive/Desktop/MSCI 3230/fourData.csv",
+ header=TRUE, stringsAsFactors=TRUE, sep=",", na.strings="NA", dec=".",
+ strip.white=TRUE)
Summary of Data
> summary(fourdata)
x123 y1 y2 y3 x4
Min. : 4.0 Min. : 4.260 Min. :3.100 Min. : 5.39 Min. : 8
1st Qu.: 6.5 1st Qu.: 6.315 1st Qu.:6.695 1st Qu.: 6.25 1st Qu.: 8
Median : 9.0 Median : 7.580 Median :8.140 Median : 7.11 Median : 8
Mean : 9.0 Mean : 7.501 Mean :7.501 Mean : 7.50 Mean : 9
3rd Qu.:11.5 3rd Qu.: 8.570 3rd Qu.:8.950 3rd Qu.: 7.98 3rd Qu.: 8
Max. :14.0 Max. :10.840 Max. :9.260 Max. :12.74 Max. :19
y4
Min. : 5.250
1st Qu.: 6.170
Median : 7.040
Mean : 7.501
3rd Qu.: 8.190
Max. :12.500
- By completing a summary of the fourData.csv dataset, we can examine the min, max, median, max, and quartile values for each column.
- The objective is to analyze why we get the same linear regression values for different values.
- This will be done using exploratory analysis and visualization.
Linear Regression
> RegModel.1 <- lm(x123~y1, data=fourdata)
> summary(RegModel.1)
Call:
lm(formula = x123 ~ y1, data = fourdata)
Residuals:
Min 1Q Median 3Q Max
-2.6522 -1.5117 -0.2657 1.2341 3.8946
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.9975 2.4344 -0.410 0.69156
y1 1.3328 0.3142 4.241 0.00217 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.019 on 9 degrees of freedom
Multiple R-squared: 0.6665, Adjusted R-squared: 0.6295
F-statistic: 17.99 on 1 and 9 DF, p-value: 0.00217
> RegModel.2 <- lm(x123~y2, data=fourdata)
> summary(RegModel.2)
Call:
lm(formula = x123 ~ y2, data = fourdata)
Residuals:
Min 1Q Median 3Q Max
-1.8516 -1.4315 -0.3440 0.8467 4.2017
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.9948 2.4354 -0.408 0.69246
y2 1.3325 0.3144 4.239 0.00218 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.02 on 9 degrees of freedom
Multiple R-squared: 0.6662, Adjusted R-squared: 0.6292
F-statistic: 17.97 on 1 and 9 DF, p-value: 0.002179
> RegModel.3 <- lm(x123~y3, data=fourdata)
> summary(RegModel.3)
Call:
lm(formula = x123 ~ y3, data = fourdata)
Residuals:
Min 1Q Median 3Q Max
-2.9869 -1.3733 -0.0266 1.3200 3.2133
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.0003 2.4362 -0.411 0.69097
y3 1.3334 0.3145 4.239 0.00218 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.019 on 9 degrees of freedom
Multiple R-squared: 0.6663, Adjusted R-squared: 0.6292
F-statistic: 17.97 on 1 and 9 DF, p-value: 0.002176
> RegModel.4 <- lm(x4~y4, data=fourdata)
> summary(RegModel.4)
Call:
lm(formula = x4 ~ y4, data = fourdata)
Residuals:
Min 1Q Median 3Q Max
-2.7859 -1.4122 -0.1853 1.4551 3.3329
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.0036 2.4349 -0.412 0.68985
y4 1.3337 0.3143 4.243 0.00216 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.018 on 9 degrees of freedom
Multiple R-squared: 0.6667, Adjusted R-squared: 0.6297
F-statistic: 18 on 1 and 9 DF, p-value: 0.002165
- All linear regression models have the same outputs, including intercepts, adjusted r-squared, etc.
- Significant and useful based on the adjusted r-squared and p-values being less than 0.05.
- However, this does not explicetly tell us that linear regression models can be done.
- Conduct exploratory analysis to further see why this may be the case.
Exploratory Analysis
Scatter Plots
> scatterplot(y1~x123, regLine=FALSE, smooth=FALSE, boxplots=FALSE,
+ data=fourdata)

- Positive correlation, with data going to the top right corner.
- Positive line of best fit, regression is okay to do with this.
> scatterplot(y2~x123, regLine=FALSE, smooth=FALSE, boxplots=FALSE,
+ data=fourdata)

- Positive correlation, with data going to the top right corner in a parabola shape.
- Relationship is not linear, so simple linear regression model would not be appropriate.
> scatterplot(y3~x123, regLine=FALSE, smooth=FALSE, boxplots=FALSE,
+ data=fourdata)

- Positive correlation, with data going to the top right corner.
- The outlier has high leverage, so linear regression model would be skewed.
> scatterplot(y4~x4, regLine=FALSE, smooth=FALSE, boxplots=FALSE,
+ data=fourdata)

- X variable is not changing (other than the one outlier), meaning there cannot be any relationship between X and Y.
Box Plots
> Boxplot( ~ x123, data=fourdata, id=list(method="y"))

> Boxplot( ~ y1, data=fourdata, id=list(method="y"))

- Y1 boxplot has even whiskers and does not have any outliers, meaning it has normal distribution and is suitable for linear regression.
> Boxplot( ~ y2, data=fourdata, id=list(method="y"))

[1] "8"
- Y2 boxplot has uneven whiskers and an outlier.
- Also shows left skew line is above middle.
> Boxplot( ~ y3, data=fourdata, id=list(method="y"))

[1] "3"
> Boxplot( ~ y4, data=fourdata, id=list(method="y"))

[1] "8"
Boxplots Y3 and Y4 also both have major outliers that make it unsuitable.