TOPIC: LINEAR REGRESSION ANALYSIS DETERMINING THE FACTORS THAT AFFECT THE PRICE OF CARS.

INTRODUCTION: Over the years, cars have become a necessity for people especially in a place such as Nigeria for ease of access. However, the price of cars have been on the high side and the question remains what causes this? What are the factors that influence the price of cars? This analysis will use linear regression to determine the factors that influence the price of cars using a reliable dataset gotten from Kaggle.com.

PROBLEM STATEMENT: Using a car dataset, I want to identify factors affecting the price of cars using linear regression

DATA SOURCE: Kaggle.com

DATA ANALYSIS AND RESULTS This analysis will make use of simple linear regression using horsepower as independent variable and multiple linear regression using other variables as independent variables.

#reading the data
library(tidyverse)
## -- Attaching packages --------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2     v purrr   0.3.4
## v tibble  3.0.3     v dplyr   1.0.2
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.5.0
## -- Conflicts ------------------------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
data <- read.csv('auto.csv', header=T)
head(data)
#using the data, is there a linear relationship between horsepower and price of car? To what extent does the horsepower affect price of car?

#Dependent variable: price
#Independent variable: horsepower

scatterPlot <- ggplot(data, aes(x=horsepower, y=price)) + geom_point()+geom_smooth()
scatterPlot
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## Warning: Removed 3 rows containing non-finite values (stat_smooth).
## Warning: Removed 3 rows containing missing values (geom_point).

#from the scatter plot, there is a positive relationship between horsepower and price of car.
#Testing for usefulness of regression by examining the F-statistic P-value
#H0: Regression is not useful.
#H1: Regression is useful.
regression <- lm(price~horsepower, data=data)
summary(regression)
## 
## Call:
## lm(formula = price ~ horsepower, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11929.9  -2381.4   -728.8   2340.6  13746.9 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -6743.63    1559.18  -4.325 6.33e-05 ***
## horsepower    208.68      13.37  15.607  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4938 on 56 degrees of freedom
##   (3 observations deleted due to missingness)
## Multiple R-squared:  0.8131, Adjusted R-squared:  0.8097 
## F-statistic: 243.6 on 1 and 56 DF,  p-value: < 2.2e-16
#CONCLUSION: Since the p-value is less than alpha, we reject null hypothesis and confirm that regression is useful.

#Slope(208.68) shows that for every unit increase in horsepower, the price increases by 208.68

#The r-squared value (0.8097) shows that the independent variable (horsepower) was able to explain 80.97 variation in price.
#predict price of car using the first model

new <- data.frame(horsepower=c(100,200))
predict.lm(regression, newdata =new)
##        1        2 
## 14124.14 34991.91
#Checking for assumptions violated 


par(mfrow=c(2,2))
plot(regression)

#CONSTANT VARIANCE: is not violated because there is no pattern of points in the residuals versus fitted point

#NORMALITY OF RESIDUALS: From the Q-Q plot, the assumption of normalit of residuals have been violated because many of the points on the residual plot do not lie closely to the line.

#
#check other factors that affect the price of cars using multiple linear regression
#H0: Regression is not useful
#H1: Regression is useful
#Conclusion: Since p-value () is less than 0.005, we reject null hypothesis and confirm that regression is useful.

#Adjusted R-Squared shows that the dependent variable in this model can explain 91.64% variance of price of car.

model1 <- lm(price~horsepower+wheel.base+engine.type+company, data=data)
summary(model1)
## 
## Call:
## lm(formula = price ~ horsepower + wheel.base + engine.type + 
##     company, data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -5587  -1193    -37   1044   9260 
## 
## Coefficients: (1 not defined because of singularities)
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -51819.4    16541.0  -3.133  0.00343 ** 
## horsepower              129.3       26.8   4.824 2.57e-05 ***
## wheel.base              569.2      181.5   3.135  0.00341 ** 
## engine.typel           7497.3     5690.8   1.317  0.19601    
## engine.typeohc         2465.7     3239.2   0.761  0.45148    
## engine.typeohcf        9651.0     4042.4   2.387  0.02234 *  
## engine.typeohcv       -1288.8     2648.3  -0.487  0.62945    
## engine.typerotor       1494.0     4796.4   0.311  0.75722    
## companyaudi           -5974.3     4766.8  -1.253  0.21817    
## companybmw            -1024.0     5190.3  -0.197  0.84470    
## companychevrolet      -7045.0     4259.9  -1.654  0.10687    
## companydodge          -6463.2     4190.4  -1.542  0.13173    
## companyhonda          -7310.1     4274.8  -1.710  0.09587 .  
## companyisuzu          -7615.2     4835.6  -1.575  0.12404    
## companyjaguar         -1834.1     4590.8  -0.400  0.69187    
## companymazda          -5125.7     4091.5  -1.253  0.21837    
## companymercedes-benz   1986.6     5769.7   0.344  0.73261    
## companymitsubishi     -8109.6     3985.6  -2.035  0.04929 *  
## companynissan         -6766.2     3706.6  -1.825  0.07624 .  
## companyporsche             NA         NA      NA       NA    
## companytoyota         -7011.1     3730.0  -1.880  0.06827 .  
## companyvolkswagen     -6929.5     4226.4  -1.640  0.10980    
## companyvolvo         -11567.7     5430.1  -2.130  0.04005 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3273 on 36 degrees of freedom
##   (3 observations deleted due to missingness)
## Multiple R-squared:  0.9472, Adjusted R-squared:  0.9164 
## F-statistic: 30.76 on 21 and 36 DF,  p-value: < 2.2e-16
#Checking for assumptions violated in multiple linear regression

par(mfrow=c(2,2))
plot(model1)
## Warning: not plotting observations with leverage one:
##   14, 22, 29

#CONSTANT VARIANCE: is not violated because there is no pattern of points in the residuals versus fitted point

#NORMALITY OF RESIDUALS: From the Q-Q plot, the assumption of normality of residuals have been violated because many of the points on the residual plot do not lie closely to the line.
#testing normality of residuals
#data$residuals <- regression$residuals
library(ggplot2)
shapiro.test(regression$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  regression$residuals
## W = 0.94265, p-value = 0.00848
shapiro.test(model1$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  model1$residuals
## W = 0.90037, p-value = 0.0001746
#Since the p-value of the simple linear regression(0.00848) is less than alpha, we reject null hypothesis and confirm that residuals are not normally distributed.

#since the p-value of the multiple linear regression(0.0001746) is less than alpha, we reject null hypothesis and confirm that residuals are not normally distributed.
#Check for normal residual distribution
model2 <- lm(price~horsepower+wheel.base+average.mileage, data=data)
summary(model2)
## 
## Call:
## lm(formula = price ~ horsepower + wheel.base + average.mileage, 
##     data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11214.7  -1625.1    -12.5   1965.2  10223.1 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -55060.02   11567.39  -4.760 1.49e-05 ***
## horsepower         190.58      21.12   9.023 2.28e-12 ***
## wheel.base         479.28      97.59   4.911 8.78e-06 ***
## average.mileage    116.24     133.92   0.868    0.389    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4177 on 54 degrees of freedom
##   (3 observations deleted due to missingness)
## Multiple R-squared:  0.871,  Adjusted R-squared:  0.8639 
## F-statistic: 121.6 on 3 and 54 DF,  p-value: < 2.2e-16
shapiro.test(model2$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  model2$residuals
## W = 0.94249, p-value = 0.00834