TOPIC: LINEAR REGRESSION ANALYSIS DETERMINING THE FACTORS THAT AFFECT THE PRICE OF CARS.
INTRODUCTION: Over the years, cars have become a necessity for people especially in a place such as Nigeria for ease of access. However, the price of cars have been on the high side and the question remains what causes this? What are the factors that influence the price of cars? This analysis will use linear regression to determine the factors that influence the price of cars using a reliable dataset gotten from Kaggle.com.
PROBLEM STATEMENT: Using a car dataset, I want to identify factors affecting the price of cars using linear regression
DATA SOURCE: Kaggle.com
DATA ANALYSIS AND RESULTS This analysis will make use of simple linear regression using horsepower as independent variable and multiple linear regression using other variables as independent variables.
#reading the data
library(tidyverse)
## -- Attaching packages --------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2 v purrr 0.3.4
## v tibble 3.0.3 v dplyr 1.0.2
## v tidyr 1.1.2 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.5.0
## -- Conflicts ------------------------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
data <- read.csv('auto.csv', header=T)
head(data)
#using the data, is there a linear relationship between horsepower and price of car? To what extent does the horsepower affect price of car?
#Dependent variable: price
#Independent variable: horsepower
scatterPlot <- ggplot(data, aes(x=horsepower, y=price)) + geom_point()+geom_smooth()
scatterPlot
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## Warning: Removed 3 rows containing non-finite values (stat_smooth).
## Warning: Removed 3 rows containing missing values (geom_point).
#from the scatter plot, there is a positive relationship between horsepower and price of car.
#Testing for usefulness of regression by examining the F-statistic P-value
#H0: Regression is not useful.
#H1: Regression is useful.
regression <- lm(price~horsepower, data=data)
summary(regression)
##
## Call:
## lm(formula = price ~ horsepower, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11929.9 -2381.4 -728.8 2340.6 13746.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6743.63 1559.18 -4.325 6.33e-05 ***
## horsepower 208.68 13.37 15.607 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4938 on 56 degrees of freedom
## (3 observations deleted due to missingness)
## Multiple R-squared: 0.8131, Adjusted R-squared: 0.8097
## F-statistic: 243.6 on 1 and 56 DF, p-value: < 2.2e-16
#CONCLUSION: Since the p-value is less than alpha, we reject null hypothesis and confirm that regression is useful.
#Slope(208.68) shows that for every unit increase in horsepower, the price increases by 208.68
#The r-squared value (0.8097) shows that the independent variable (horsepower) was able to explain 80.97 variation in price.
#predict price of car using the first model
new <- data.frame(horsepower=c(100,200))
predict.lm(regression, newdata =new)
## 1 2
## 14124.14 34991.91
#Checking for assumptions violated
par(mfrow=c(2,2))
plot(regression)
#CONSTANT VARIANCE: is not violated because there is no pattern of points in the residuals versus fitted point
#NORMALITY OF RESIDUALS: From the Q-Q plot, the assumption of normalit of residuals have been violated because many of the points on the residual plot do not lie closely to the line.
#
#check other factors that affect the price of cars using multiple linear regression
#H0: Regression is not useful
#H1: Regression is useful
#Conclusion: Since p-value () is less than 0.005, we reject null hypothesis and confirm that regression is useful.
#Adjusted R-Squared shows that the dependent variable in this model can explain 91.64% variance of price of car.
model1 <- lm(price~horsepower+wheel.base+engine.type+company, data=data)
summary(model1)
##
## Call:
## lm(formula = price ~ horsepower + wheel.base + engine.type +
## company, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5587 -1193 -37 1044 9260
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -51819.4 16541.0 -3.133 0.00343 **
## horsepower 129.3 26.8 4.824 2.57e-05 ***
## wheel.base 569.2 181.5 3.135 0.00341 **
## engine.typel 7497.3 5690.8 1.317 0.19601
## engine.typeohc 2465.7 3239.2 0.761 0.45148
## engine.typeohcf 9651.0 4042.4 2.387 0.02234 *
## engine.typeohcv -1288.8 2648.3 -0.487 0.62945
## engine.typerotor 1494.0 4796.4 0.311 0.75722
## companyaudi -5974.3 4766.8 -1.253 0.21817
## companybmw -1024.0 5190.3 -0.197 0.84470
## companychevrolet -7045.0 4259.9 -1.654 0.10687
## companydodge -6463.2 4190.4 -1.542 0.13173
## companyhonda -7310.1 4274.8 -1.710 0.09587 .
## companyisuzu -7615.2 4835.6 -1.575 0.12404
## companyjaguar -1834.1 4590.8 -0.400 0.69187
## companymazda -5125.7 4091.5 -1.253 0.21837
## companymercedes-benz 1986.6 5769.7 0.344 0.73261
## companymitsubishi -8109.6 3985.6 -2.035 0.04929 *
## companynissan -6766.2 3706.6 -1.825 0.07624 .
## companyporsche NA NA NA NA
## companytoyota -7011.1 3730.0 -1.880 0.06827 .
## companyvolkswagen -6929.5 4226.4 -1.640 0.10980
## companyvolvo -11567.7 5430.1 -2.130 0.04005 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3273 on 36 degrees of freedom
## (3 observations deleted due to missingness)
## Multiple R-squared: 0.9472, Adjusted R-squared: 0.9164
## F-statistic: 30.76 on 21 and 36 DF, p-value: < 2.2e-16
#Checking for assumptions violated in multiple linear regression
par(mfrow=c(2,2))
plot(model1)
## Warning: not plotting observations with leverage one:
## 14, 22, 29
#CONSTANT VARIANCE: is not violated because there is no pattern of points in the residuals versus fitted point
#NORMALITY OF RESIDUALS: From the Q-Q plot, the assumption of normality of residuals have been violated because many of the points on the residual plot do not lie closely to the line.
#testing normality of residuals
#data$residuals <- regression$residuals
library(ggplot2)
shapiro.test(regression$residuals)
##
## Shapiro-Wilk normality test
##
## data: regression$residuals
## W = 0.94265, p-value = 0.00848
shapiro.test(model1$residuals)
##
## Shapiro-Wilk normality test
##
## data: model1$residuals
## W = 0.90037, p-value = 0.0001746
#Since the p-value of the simple linear regression(0.00848) is less than alpha, we reject null hypothesis and confirm that residuals are not normally distributed.
#since the p-value of the multiple linear regression(0.0001746) is less than alpha, we reject null hypothesis and confirm that residuals are not normally distributed.
#Check for normal residual distribution
model2 <- lm(price~horsepower+wheel.base+average.mileage, data=data)
summary(model2)
##
## Call:
## lm(formula = price ~ horsepower + wheel.base + average.mileage,
## data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11214.7 -1625.1 -12.5 1965.2 10223.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -55060.02 11567.39 -4.760 1.49e-05 ***
## horsepower 190.58 21.12 9.023 2.28e-12 ***
## wheel.base 479.28 97.59 4.911 8.78e-06 ***
## average.mileage 116.24 133.92 0.868 0.389
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4177 on 54 degrees of freedom
## (3 observations deleted due to missingness)
## Multiple R-squared: 0.871, Adjusted R-squared: 0.8639
## F-statistic: 121.6 on 3 and 54 DF, p-value: < 2.2e-16
shapiro.test(model2$residuals)
##
## Shapiro-Wilk normality test
##
## data: model2$residuals
## W = 0.94249, p-value = 0.00834