Introduction

The dataset we chose for our final project contains a total of 48 observations for 48 states in USA wherein in the consumption of petrol in each state was measures during a period of 1 year. The factors used to determine the consumption for each state included, the petrol tax in that state, the average income per capita, the number of miles of paved highway, and the proportion of the population with driver’s licenses. We will be using a Multiple Linear Regression Model to analyze the dataset. Our response variable will be the consumption of petrol for each state and other column - the petrol tax in that state, the average income per capita, the number of miles of paved highway, and the proportion of the population with driver’s licenses - will be our covariates.

We aim to answer the following questions with the analysis of this dataset.

Packages Used

library(MASS)       # For Best Subset
library(leaps)      # For Best Subet
library(car)        # For hcekcing Multicollinearity
library(corrplot)   # For correlation plots
library(psych)      # For pair plots
library(rmarkdown)  # For displaying datasets in paged table format

Data Cleaning & EDA

First, we will load the dataset into our R project to check whether there are any null values that needs to be removed or any other Data cleaning operations need to be performed for this we use the following R code

# Read Dataset
CData <- read.csv("petrol_consumption_data.csv",header = TRUE)

As we can see, we have 6 variables out of which 1st column is index last column is our response or dependent variable and others are our possible covariates/regressor

We are removing the unwanted column index using the Rcode

# Data Cleaning
CData <- CData[,2:6]

For the sake of convenience, we will rename the variables as follows: Petrol tax (cents per gallon) as ptax, Average income (dollars)as income, Paved Highways (miles) as highways, Proportion of population with driver’s licenses as licenses, Consumption of petrol (millions of gallons) as consumption

# Renaming the columns
colnames(CData) <- c("ptax","income","highways","licenses","consumption")
names(CData)
## [1] "ptax"        "income"      "highways"    "licenses"    "consumption"

The output shows the dataset without the index column and names of the variables have also been changed.

Now we check for any null values or non-numeric data using following Rcode.

lapply(CData,class)  # To check the class of the variables
## $ptax
## [1] "numeric"
## 
## $income
## [1] "integer"
## 
## $highways
## [1] "integer"
## 
## $licenses
## [1] "numeric"
## 
## $consumption
## [1] "integer"
summary(CData)  # To check the summary of the datasets, also tells if there are any null values in the datasets
##       ptax            income        highways        licenses     
##  Min.   : 5.000   Min.   :3063   Min.   :  431   Min.   :0.4510  
##  1st Qu.: 7.000   1st Qu.:3739   1st Qu.: 3110   1st Qu.:0.5298  
##  Median : 7.500   Median :4298   Median : 4736   Median :0.5645  
##  Mean   : 7.668   Mean   :4242   Mean   : 5565   Mean   :0.5703  
##  3rd Qu.: 8.125   3rd Qu.:4579   3rd Qu.: 7156   3rd Qu.:0.5952  
##  Max.   :10.000   Max.   :5342   Max.   :17782   Max.   :0.7240  
##   consumption   
##  Min.   :344.0  
##  1st Qu.:509.5  
##  Median :568.5  
##  Mean   :576.8  
##  3rd Qu.:632.8  
##  Max.   :968.0
paged_table(CData)   # Cleaned Data

Observation: - We have used lapply () to check for the data type of all the columns which returns the above-mentioned output. Thus, all our variables have numeric or integer values. Also, summary() does not show any presence of null values in our dataset. Thus, now we can use linear regression model upon our dataset.

Modelling & Checking

We have used the lm() to create the multiple linear regression model.

# Data Modelling
CModel <- lm(consumption ~ ptax+income+highways+licenses,data = CData)

Model Checking :-
Here we check the significance and influence of each variable on the response variable as well as the influence of model as a whole on response variable. We use t-test and f-statistic for this purpose

# Summary
summary(CModel)
## 
## Call:
## lm(formula = consumption ~ ptax + income + highways + licenses, 
##     data = CData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -122.03  -45.57  -10.66   31.53  234.95 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.773e+02  1.855e+02   2.033 0.048207 *  
## ptax        -3.479e+01  1.297e+01  -2.682 0.010332 *  
## income      -6.659e-02  1.722e-02  -3.867 0.000368 ***
## highways    -2.426e-03  3.389e-03  -0.716 0.477999    
## licenses     1.336e+03  1.923e+02   6.950 1.52e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 66.31 on 43 degrees of freedom
## Multiple R-squared:  0.6787, Adjusted R-squared:  0.6488 
## F-statistic: 22.71 on 4 and 43 DF,  p-value: 3.907e-10

Observation: -

F-statistic

p value for F-statistic is less than 0.05 hence the linear regression is significant in explaining the variation in response variable.

T tests for each covariate

p value for income and licenses are all less than 0.05 and |t| values are greater 2, these are the variables i.e. the average income per capita and the proportion of the population with driver’s licenses are most significant in explaining the variation in response variable.
For ptax, p values is slightly less than 0.05 and |t| slightly greater than 2, hence it’s still significant however not as much as the income and licenses.
Since p value of highways is greater than 0.05 and |t| < 2, the number of miles of paved highways does not have any significance in deciding the consumption of petrol.
We have used stepwise regression and leaps library to ascertain which variables have high significance to be included in our model.

subModel <- regsubsets(consumption ~ ptax+income+highways+licenses,data = CData,nbest = 2)  # Find Best Subset
par(mfrow = c(1,1))
plot(subModel)        # BIC Model

Observation: -

Thus, above graph confirms our hypothesis that number of miles of paved highways is not a part of the multiple linear regression model as it does not affect our response variable.

# Model checking using stepwise regression
add1(lm(consumption~1,data = CData), consumption~ptax+income+highways+licenses, test = "F")
## Single term additions
## 
## Model:
## consumption ~ 1
##          Df Sum of Sq    RSS    AIC F value   Pr(>F)    
## <none>                588366 453.87                     
## ptax      1    119823 468543 444.94 11.7638 0.001285 ** 
## income    1     35277 553090 452.90  2.9340 0.093468 .  
## highways  1       213 588153 455.85  0.0167 0.897785    
## licenses  1    287448 300918 423.68 43.9408 3.29e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
add1(lm(consumption~licenses,data=CData), consumption~ptax+income+highways+licenses, test="F")
## Single term additions
## 
## Model:
## consumption ~ licenses
##          Df Sum of Sq    RSS    AIC F value    Pr(>F)    
## <none>                300918 423.68                      
## ptax      1     40084 260834 418.82  6.9155 0.0116540 *  
## income    1     75874 225044 411.74 15.1718 0.0003228 ***
## highways  1      2410 298509 425.30  0.3633 0.5497184    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
add1(lm(consumption~income+licenses,data=CData), consumption~ptax+income+highways+licenses, test="F")
## Single term additions
## 
## Model:
## consumption ~ income + licenses
##          Df Sum of Sq    RSS    AIC F value   Pr(>F)   
## <none>                225044 411.74                    
## ptax      1     33742 191302 405.94  7.7607 0.007848 **
## highways  1      4362 220682 412.80  0.8698 0.356105   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
add1(lm(consumption~ptax+income+licenses,data=CData), consumption~ptax+income+highways+licenses, test="F")
## Single term additions
## 
## Model:
## consumption ~ ptax + income + licenses
##          Df Sum of Sq    RSS    AIC F value Pr(>F)
## <none>                191302 405.94               
## highways  1    2252.5 189050 407.37  0.5123  0.478
drop1(lm(consumption~ptax+income+licenses,data=CData), test="F")
## Single term deletions
## 
## Model:
## consumption ~ ptax + income + licenses
##          Df Sum of Sq    RSS    AIC F value    Pr(>F)    
## <none>                191302 405.94                      
## ptax      1     33742 225044 411.74  7.7607 0.0078482 ** 
## income    1     69532 260834 418.82 15.9924 0.0002397 ***
## licenses  1    243586 434889 443.36 56.0254 2.239e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
drop1(lm(consumption~ptax+income,data=CData), test="F")
## Single term deletions
## 
## Model:
## consumption ~ ptax + income
##        Df Sum of Sq    RSS    AIC F value  Pr(>F)   
## <none>              434889 443.36                   
## ptax    1    118201 553090 452.90 12.2308 0.00107 **
## income  1     33655 468543 444.94  3.4824 0.06855 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Observation: -

Stepwise regression used above ascertain that the significant variables are ptax, income and licenses. Hence, we will remodel our regression model using 3 covariates namely the petrol tax in that state(ptax), the average income per capita(income), the number of miles of paved highway(highways), and the proportion of the population with driver’s licenses(licenses).

Remodelling

Here we have remodeled the data using 3 covariates

# Data Remodeling
CModel<-lm(consumption ~ ptax+income+licenses,data=CData)
# Summary
summary(CModel)
## 
## Call:
## lm(formula = consumption ~ ptax + income + licenses, data = CData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -110.10  -51.22  -12.89   24.49  238.77 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  307.32790  156.83067   1.960  0.05639 .  
## ptax         -29.48381   10.58358  -2.786  0.00785 ** 
## income        -0.06802    0.01701  -3.999  0.00024 ***
## licenses    1374.76841  183.66954   7.485 2.24e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 65.94 on 44 degrees of freedom
## Multiple R-squared:  0.6749, Adjusted R-squared:  0.6527 
## F-statistic: 30.44 on 3 and 44 DF,  p-value: 8.235e-11
# Model Rechecking
subModel <- regsubsets(consumption ~ ptax+income+licenses,data=CData,nbest=2)
plot(subModel)

As we see in both summary() and output for plot(subModel) our model now contains all the variates that significantly decide the value of response variable i.e. consumption of petrol F statistic of the model is less than 0.05 and |t| for each covariate is also greater than 2, hence linear regression model is significant in explaining the variation in response variable.
Adjusted Rsquare is 60.5%.

Visualizations

We have used plot(), pair() to Visualize the Regression Model

# To check for LINE assumptions
pairs(CData)

par(mfrow = c(2,2))
plot(CModel)

Observation: -
We can see that there is a slight non linearity in the model however variance almost seems to be constant, we use vif() and cor() to check for multicollinearity.

# check for Multicollnearity
vif(CModel)
##     ptax   income licenses 
## 1.094575 1.029153 1.122082
CData2 <- CData[,1:4]
M <- cor(CData2)
corrplot(M, method = "number")

As observed, vif values are around 1 for all the variables, we concluded that there is no correlation between the covariates. cor() also ascertains our conclusion that multicollinearity does not exist in our model.

Both QQplot and fittedVSresidual graphs shows only slight non-linearity and almost constant variance. There is no skewness. We do observe some outliers in QQ plot however the leverage for this outlier is very low and hence will not change the linear regression line if removed.

Transformations

We checked, if transformation helps in improving regression model, for this we used log transformation, square root transformation and inverse transformation on predictor variables and both response and predictor variables. The results were as follows.

# Log Transformation on income
CData$logincome = log(CData$income)
CModel1 <- lm(consumption ~ ptax+logincome+licenses,data= CData)
summary(CModel1)
## 
## Call:
## lm(formula = consumption ~ ptax + logincome + licenses, data = CData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -107.49  -50.58  -11.76   25.86  239.97 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2392.1      594.6   4.023 0.000222 ***
## ptax           -29.8       10.6  -2.811 0.007355 ** 
## logincome     -284.8       71.8  -3.967 0.000265 ***
## licenses      1384.6      184.5   7.503 2.11e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 66.08 on 44 degrees of freedom
## Multiple R-squared:  0.6735, Adjusted R-squared:  0.6512 
## F-statistic: 30.25 on 3 and 44 DF,  p-value: 9.043e-11
# Inverse transformation on licenses 
CData$invlicenses = 1/(CData$licenses)
CModel2 <- lm(consumption ~ ptax+income+invlicenses,data= CData)
summary(CModel2)
## 
## Call:
## lm(formula = consumption ~ ptax + income + invlicenses, data = CData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -116.44  -47.91  -12.91   30.49  254.95 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1853.40060  148.90119  12.447 5.17e-16 ***
## ptax         -30.59331   11.16064  -2.741 0.008815 ** 
## income        -0.06393    0.01787  -3.578 0.000858 ***
## invlicenses -435.72586   64.30799  -6.776 2.44e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 69.55 on 44 degrees of freedom
## Multiple R-squared:  0.6383, Adjusted R-squared:  0.6136 
## F-statistic: 25.88 on 3 and 44 DF,  p-value: 8.377e-10
# Squareroot Transformation on ptax
CData$sqrtptax = sqrt(CData$ptax)
CModel3 <- lm(consumption ~ sqrtptax+income+licenses,data= CData)
summary(CModel3)
## 
## Call:
## lm(formula = consumption ~ sqrtptax + income + licenses, data = CData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -109.32  -50.68  -12.86   24.05  239.29 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  535.42847  219.50430   2.439 0.018819 *  
## sqrtptax    -163.76446   58.22115  -2.813 0.007312 ** 
## income        -0.06821    0.01698  -4.017 0.000227 ***
## licenses    1373.39663  183.41629   7.488 2.22e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 65.84 on 44 degrees of freedom
## Multiple R-squared:  0.6758, Adjusted R-squared:  0.6537 
## F-statistic: 30.57 on 3 and 44 DF,  p-value: 7.728e-11
# Log transformation on both predictors and response
CData$logconsumption = log(CData$consumption)
CData$logincome = log(CData$income)
CData$logptax = log(CData$ptax)
CData$loglicenses = log(CData$licenses)
CModel4 <- lm(logconsumption ~ logptax+logincome+loglicenses,data= CData)
summary(CModel4)
## 
## Call:
## lm(formula = logconsumption ~ logptax + logincome + loglicenses, 
##     data = CData)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.22192 -0.05915 -0.01848  0.05159  0.29613 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  12.5851     0.9720  12.948  < 2e-16 ***
## logptax      -0.3792     0.1250  -3.033  0.00405 ** 
## logincome    -0.5652     0.1122  -5.040 8.45e-06 ***
## loglicenses   1.3412     0.1676   8.001 4.02e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1035 on 44 degrees of freedom
## Multiple R-squared:  0.7131, Adjusted R-squared:  0.6935 
## F-statistic: 36.45 on 3 and 44 DF,  p-value: 5.389e-12
# Inverse transformation on both predictors and response
CData$invconsumption = 1/(CData$consumption)
CData$invincome = 1/(CData$income)
CData$invptax = 1/(CData$ptax)
CData$invlicenses = 1/(CData$licenses)
CModel5 <- lm(invconsumption ~ invptax+invincome+invlicenses,data= CData)
summary(CModel5)
## 
## Call:
## lm(formula = invconsumption ~ invptax + invincome + invlicenses, 
##     data = CData)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -3.690e-04 -1.081e-04  2.587e-05  9.861e-05  4.548e-04 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.0010777  0.0004314   2.498  0.01629 *  
## invptax     -0.0047096  0.0015954  -2.952  0.00505 ** 
## invincome   -4.6234804  0.8106161  -5.704 9.18e-07 ***
## invlicenses  0.0013851  0.0001682   8.233 1.87e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0001814 on 44 degrees of freedom
## Multiple R-squared:  0.7257, Adjusted R-squared:  0.707 
## F-statistic: 38.81 on 3 and 44 DF,  p-value: 2.016e-12
# plot model 5
par(mfrow = c(2,2))
plot(CModel5)

Observation: -
Log transformation on income results in slight decrease in adjusted Rsquare and almost no change in the qq plot and fittedVSresidual plot. Similar was the case in fsychr log, square root, inverse transformation for other predictor variables.
Next, we tried Log and Inverse transformations on both predictor and response variables. We observed that for both log and Inverse transformations, adjusted Rsquare increased to 69% for log and 70.7% for inverse transformation. But QQ plot and fittedVSresidual graphs still did not show any significant change in linearity and variance. It meant that CModel5 explains variation in the regression model with highest accuracy among all the models. And hence can be selected as our final multiple regression model for Petrol Consumption data.

Confidence Interval and prediction Interval for Linear Regression Model

# press residuals
PRESS_res=CModel5$residuals/(1 - lm.influence(CModel5)$hat)
# Confidence interveal and prediction interval
confint(CModel5, level = 0.95)
##                     2.5 %       97.5 %
## (Intercept)  0.0002082788  0.001947161
## invptax     -0.0079250276 -0.001494218
## invincome   -6.2571697585 -2.989791039
## invlicenses  0.0010460711  0.001724221
predvalues<-data.frame(predict.lm(CModel5, interval="prediction"))
paged_table(predvalues)

Observation: -

confint() gives the interval in which the true values lie with 95% probability. We see that a point can be predicted to be in the range of above mentioned interval with 95% of probability. The reason it appears to be straight line on graph as interval is very short and there is not much wiggle room for the linear regression.

Conclusion

Based on the null hypothesis and Stepwise regression we concluded that Number of paved highways in miles does not determine the consumption of petrol in a state. Once the insignificant covariate was removed, we observed that the significance of petrol tax increased which meant that consumption of petrol varied according to the average income, population with driver’s license and petrol tax. Using plots, we observed that the there was no significant non-linearity, error in variance or skewness. However, Inverse transformation on both predictor and response variable resulted in an increase in adj. Rsquare which explains the variation in linear regression model and hence is selected as our final regression model as
consumption = 0.0010777 + (-0.0047096)ptax + (-4.6234804)income + (0.0013851)licenses
Here we conclude that consumption of petrol is inversely affected by the petrol tax in state, Average income per capita and directly affected by the number of population proportion having driver licenses.

Resource: -

http://people.sc.fsu.edu/~jburkardt/datasets/regression/x15.txt