Reading the input data and investingating the structure of it

df <- read.csv("automobile.csv")
str(df)
## 'data.frame':    398 obs. of  9 variables:
##  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders   : int  8 8 8 8 8 8 8 8 8 8 ...
##  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : Factor w/ 94 levels "?","100.0","102.0",..: 17 35 29 29 24 42 47 46 48 40 ...
##  $ weight      : int  3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
##  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ model.year  : int  70 70 70 70 70 70 70 70 70 70 ...
##  $ origin      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ car.name    : Factor w/ 305 levels "amc ambassador brougham",..: 50 37 232 15 162 142 55 224 242 2 ...

As per above structure, we can conclude that, we need to change few attribute types. Data Preprocessing is requried before going to our descriptive analysis.

df$cylinders <- factor(df$cylinders)
df$horsepower <- as.numeric(trimws(df$horsepower))
## Warning: NAs introduced by coercion
df$model.year <- factor(df$model.year)
df$origin <- factor(df$origin, labels = c("USA", "Europe", "Japan") )
str(df)
## 'data.frame':    398 obs. of  9 variables:
##  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders   : Factor w/ 5 levels "3","4","5","6",..: 5 5 5 5 5 5 5 5 5 5 ...
##  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : num  130 165 150 150 140 198 220 215 225 190 ...
##  $ weight      : int  3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
##  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ model.year  : Factor w/ 13 levels "70","71","72",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ origin      : Factor w/ 3 levels "USA","Europe",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ car.name    : Factor w/ 305 levels "amc ambassador brougham",..: 50 37 232 15 162 142 55 224 242 2 ...

Now, all the variables are in correct type.

summary(df)
##       mpg        cylinders  displacement     horsepower        weight    
##  Min.   : 9.00   3:  4     Min.   : 68.0   Min.   : 46.0   Min.   :1613  
##  1st Qu.:17.50   4:204     1st Qu.:104.2   1st Qu.: 75.0   1st Qu.:2224  
##  Median :23.00   5:  3     Median :148.5   Median : 93.5   Median :2804  
##  Mean   :23.51   6: 84     Mean   :193.4   Mean   :104.5   Mean   :2970  
##  3rd Qu.:29.00   8:103     3rd Qu.:262.0   3rd Qu.:126.0   3rd Qu.:3608  
##  Max.   :46.60             Max.   :455.0   Max.   :230.0   Max.   :5140  
##                                            NA's   :6                     
##   acceleration     model.year     origin              car.name  
##  Min.   : 8.00   73     : 40   USA   :249   ford pinto    :  6  
##  1st Qu.:13.82   78     : 36   Europe: 70   amc matador   :  5  
##  Median :15.50   76     : 34   Japan : 79   ford maverick :  5  
##  Mean   :15.57   82     : 31                toyota corolla:  5  
##  3rd Qu.:17.18   75     : 30                amc gremlin   :  4  
##  Max.   :24.80   70     : 29                amc hornet    :  4  
##                  (Other):198                (Other)       :369

We have NA values, which need to be fill before going to analysis.

Filling all NA values:-

As we see, 6 NA values in horsepower, I am planning to filling NA values with mean of that column

# Filtering out all available values in horsepower column
temp <- df$horsepower[ !is.na(df$horsepower)]
mean <- mean(temp)
mean <- round(mean , 0)
df$horsepower[is.na(df$horsepower)] <- mean
remove(temp , mean)
summary(df)
##       mpg        cylinders  displacement     horsepower        weight    
##  Min.   : 9.00   3:  4     Min.   : 68.0   Min.   : 46.0   Min.   :1613  
##  1st Qu.:17.50   4:204     1st Qu.:104.2   1st Qu.: 76.0   1st Qu.:2224  
##  Median :23.00   5:  3     Median :148.5   Median : 95.0   Median :2804  
##  Mean   :23.51   6: 84     Mean   :193.4   Mean   :104.5   Mean   :2970  
##  3rd Qu.:29.00   8:103     3rd Qu.:262.0   3rd Qu.:125.0   3rd Qu.:3608  
##  Max.   :46.60             Max.   :455.0   Max.   :230.0   Max.   :5140  
##                                                                          
##   acceleration     model.year     origin              car.name  
##  Min.   : 8.00   73     : 40   USA   :249   ford pinto    :  6  
##  1st Qu.:13.82   78     : 36   Europe: 70   amc matador   :  5  
##  Median :15.50   76     : 34   Japan : 79   ford maverick :  5  
##  Mean   :15.57   82     : 31                toyota corolla:  5  
##  3rd Qu.:17.18   75     : 30                amc gremlin   :  4  
##  Max.   :24.80   70     : 29                amc hornet    :  4  
##                  (Other):198                (Other)       :369

Now, all the data is available and Data Preprocessing is completed. We are ready to start our descriptive stastics. I assume, car.name is not requried for our analysis (Name will not effect mpg.)

df <- subset(df , select = -c(car.name))
colnames(df)
## [1] "mpg"          "cylinders"    "displacement" "horsepower"   "weight"      
## [6] "acceleration" "model.year"   "origin"

Analysing continous attributes :-

par(mfrow=c(4,3))
plot_box <- function(attr){
  colname <- eval(parse(text=paste('df$' ,attr , sep = '' )))
  boxplot(colname , horizontal = TRUE)
  remove(colname)
}

plot_hist <- function(attr){
  colname <- eval(parse(text=paste('df$' ,attr , sep = '' )))
  sub_text <- paste('Mean :' , round (mean(colname) , 2) ,
                    '\nMedian:', round (median(colname) , 2) ,
                    '\nStd. Dev:', round (sd(colname) , 2)
                    )
  hist(colname , main = attr , labels = TRUE , xlab = '' , sub=sub_text )
}


plot_box('mpg')
plot_box('displacement')
plot_box('horsepower')

plot_hist('mpg')
plot_hist('displacement')
plot_hist('horsepower')


plot_box('weight')
plot_box('acceleration')
plot.new()
plot_hist('weight')
plot_hist('acceleration')

Allmost in all attributes, outliers are ver less.

Analysing Categorical attributes :-

#"cylinders"  "model.year"   "origin"     

xtabs (~df$cylinders)
## df$cylinders
##   3   4   5   6   8 
##   4 204   3  84 103
xtabs (~df$model.year)
## df$model.year
## 70 71 72 73 74 75 76 77 78 79 80 81 82 
## 29 28 28 40 27 30 34 28 36 29 29 29 31
xtabs (~df$origin)
## df$origin
##    USA Europe  Japan 
##    249     70     79

We can see that, In the given dataset , there are more 4 cylinder cars. The model.year is almost equally distributed over all years. There are more cars from USA.

count_cross_table <- function( x , y){
  colname_x <- eval(parse(text=paste('df$' ,x , sep = '' )))
  colname_y <- eval(parse(text=paste('df$' ,y , sep = '' )))
  
  data <- xtabs(~colname_x + colname_y)
  data_total <- cbind(data , 'Total' = rowSums(data))
  data_total <- rbind(data_total , 'Total' = colSums(data_total))
  print(mosaicplot(data , shade = TRUE))
  print(data_total)
  remove(data , data_total , colname_x , colname_y)
}

The different types of cylinder car in different market places is as below.

count_cross_table('cylinders' , 'origin')

## NULL
##       USA Europe Japan Total
## 3       0      0     4     4
## 4      72     63    69   204
## 5       0      3     0     3
## 6      74      4     6    84
## 8     103      0     0   103
## Total 249     70    79   398
count_cross_table('cylinders' , 'model.year')

## NULL
##       70 71 72 73 74 75 76 77 78 79 80 81 82 Total
## 3      0  0  1  1  0  0  0  1  0  0  1  0  0     4
## 4      7 13 14 11 15 12 15 14 17 12 25 21 28   204
## 5      0  0  0  0  0  0  0  0  1  1  1  0  0     3
## 6      4  8  0  8  7 12 10  5 12  6  2  7  3    84
## 8     18  7 13 20  5  6  9  8  6 10  0  1  0   103
## Total 29 28 28 40 27 30 34 28 36 29 29 29 31   398
count_cross_table('origin' , 'model.year')

## NULL
##        70 71 72 73 74 75 76 77 78 79 80 81 82 Total
## USA    22 20 18 29 15 20 22 18 22 23  7 13 20   249
## Europe  5  4  5  7  6  6  8  4  6  4  9  4  2    70
## Japan   2  4  5  4  6  4  4  6  8  2 13 12  9    79
## Total  29 28 28 40 27 30 34 28 36 29 29 29 31   398
boxplot(df$mpg ~ df$origin , xlab = "ORIGIN" , ylab = "MPG")

We can conclue that, Japan Cars are giving more MPG and USA cars are giving very less MPG.

boxplot(df$mpg ~ df$cylinders , xlab = "NO. OF CYLINDERs" , ylab = "MPG")

We can conclude that, 4 cylinder engines are best and 8 cylinder engines are worst.

mean_mpg_cross_table <- function( x , y){
  colname_x <- eval(parse(text=paste('df$' ,x , sep = '' )))
  colname_y <- eval(parse(text=paste('df$' ,y , sep = '' )))
  
  data <- round(tapply(df$mpg, list(colname_x,colname_y), FUN = mean) , 2)
  print(data)
  remove(data ,colname_x , colname_y)
}
mean_mpg_cross_table("cylinders" , "origin")
##     USA Europe Japan
## 3    NA     NA 20.55
## 4 27.84  28.41 31.60
## 5    NA  27.37    NA
## 6 19.66  20.10 23.88
## 8 14.96     NA    NA
mean_mpg_cross_table("cylinders" , "model.year")
##      70    71    72    73    74    75    76    77    78    79    80    81    82
## 3    NA    NA 19.00 18.00    NA    NA    NA 21.50    NA    NA 23.70    NA    NA
## 4 25.29 27.46 23.43 22.73 27.80 25.25 26.77 29.11 29.58 31.52 34.61 32.81 32.07
## 5    NA    NA    NA    NA    NA    NA    NA    NA 20.30 25.40 36.40    NA    NA
## 6 20.50 18.00    NA 19.00 17.86 17.58 20.00 19.50 19.07 22.95 25.90 23.43 28.33
## 8 14.11 13.43 13.62 13.20 14.20 15.67 14.67 16.00 19.05 18.63    NA 26.60    NA
mean_mpg_cross_table("origin" , "model.year")
##           70    71    72    73    74    75    76    77    78    79    80    81
## USA    15.27 18.10 16.28 15.03 18.33 17.55 19.43 20.72 21.77 23.48 25.91 27.53
## Europe 25.20 28.75 22.00 24.00 27.00 24.50 24.25 29.25 24.95 30.45 37.29 31.57
## Japan  25.50 29.50 24.20 20.00 29.33 27.50 28.00 27.42 29.69 32.95 35.40 32.96
##           82
## USA    29.45
## Europe 40.00
## Japan  34.89

Analysing Co-releation between contionus variables :-

temp <- subset(df , select = c(mpg , displacement , horsepower , weight , acceleration))
pairs(temp)

cor(temp)
##                     mpg displacement horsepower     weight acceleration
## mpg           1.0000000   -0.8042028 -0.7715428 -0.8317409    0.4202889
## displacement -0.8042028    1.0000000  0.8937600  0.9328241   -0.5436841
## horsepower   -0.7715428    0.8937600  1.0000000  0.8606759   -0.6843761
## weight       -0.8317409    0.9328241  0.8606759  1.0000000   -0.4174573
## acceleration  0.4202889   -0.5436841 -0.6843761 -0.4174573    1.0000000
cor(temp , method = "spearman")
##                     mpg displacement horsepower     weight acceleration
## mpg           1.0000000   -0.8556920 -0.8431804 -0.8749474    0.4386775
## displacement -0.8556920    1.0000000  0.8666701  0.9459856   -0.4965119
## horsepower   -0.8431804    0.8666701  1.0000000  0.8686589   -0.6475569
## weight       -0.8749474    0.9459856  0.8686589  1.0000000   -0.4045504
## acceleration  0.4386775   -0.4965119 -0.6475569 -0.4045504    1.0000000
cor(temp , method = "kendall")
##                     mpg displacement horsepower     weight acceleration
## mpg           1.0000000   -0.6798473 -0.6689616 -0.6940062    0.3010959
## displacement -0.6798473    1.0000000  0.7081657  0.8005077   -0.3521098
## horsepower   -0.6689616    0.7081657  1.0000000  0.6934499   -0.4789829
## weight       -0.6940062    0.8005077  0.6934499  1.0000000   -0.2686194
## acceleration  0.3010959   -0.3521098 -0.4789829 -0.2686194    1.0000000
remove(temp)

From all the above co-releation analysis, we can observe that, mpg has good co-releation with displacement , horsepower & weight. It has no co-releation with acceleration.

So, for linear model (to predict mpg) , the dependent variables are displacement , horsepower & weight.

##Fitting Multi-Linear models :-

linear_model <- lm(mpg ~ . , data = df )
summary(linear_model)
## 
## Call:
## lm(formula = mpg ~ ., data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.0814 -1.6822  0.0262  1.5064 11.4178 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  30.8808663  2.3462437  13.162  < 2e-16 ***
## cylinders4    6.8394092  1.5452976   4.426 1.26e-05 ***
## cylinders5    6.5854305  2.3508804   2.801 0.005354 ** 
## cylinders6    4.4198866  1.7146350   2.578 0.010326 *  
## cylinders8    6.5807174  1.9781061   3.327 0.000965 ***
## displacement  0.0124843  0.0067827   1.841 0.066469 .  
## horsepower   -0.0359782  0.0124043  -2.900 0.003946 ** 
## weight       -0.0054794  0.0006102  -8.980  < 2e-16 ***
## acceleration  0.0197250  0.0845004   0.233 0.815554    
## model.year71  1.0175016  0.8064446   1.262 0.207836    
## model.year72 -0.3656267  0.8068806  -0.453 0.650713    
## model.year73 -0.4691928  0.7243550  -0.648 0.517550    
## model.year74  1.4168300  0.8466544   1.673 0.095074 .  
## model.year75  1.0276371  0.8371845   1.227 0.220408    
## model.year76  1.6271521  0.8029627   2.026 0.043427 *  
## model.year77  3.1360387  0.8218486   3.816 0.000159 ***
## model.year78  3.0901815  0.7814901   3.954 9.18e-05 ***
## model.year79  5.0337818  0.8262184   6.093 2.76e-09 ***
## model.year80  9.1606734  0.8555946  10.707  < 2e-16 ***
## model.year81  6.7259402  0.8527299   7.888 3.41e-14 ***
## model.year82  7.8808959  0.8455609   9.320  < 2e-16 ***
## originEurope  1.9276524  0.5108057   3.774 0.000187 ***
## originJapan   2.3503610  0.4914855   4.782 2.50e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.866 on 375 degrees of freedom
## Multiple R-squared:  0.873,  Adjusted R-squared:  0.8656 
## F-statistic: 117.2 on 22 and 375 DF,  p-value: < 2.2e-16
linear_model <- lm(mpg ~ displacement + horsepower + weight , data = df )
summary(linear_model)
## 
## Call:
## lm(formula = mpg ~ displacement + horsepower + weight, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.4990  -2.8528  -0.3605   2.2262  16.1952 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  44.9142370  1.1979300  37.493  < 2e-16 ***
## displacement -0.0067407  0.0065582  -1.028  0.30466    
## horsepower   -0.0374981  0.0126691  -2.960  0.00326 ** 
## weight       -0.0054466  0.0007114  -7.656 1.49e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.265 on 394 degrees of freedom
## Multiple R-squared:  0.7045, Adjusted R-squared:  0.7023 
## F-statistic: 313.2 on 3 and 394 DF,  p-value: < 2.2e-16
remove(linear_model)

Fitting Anova model :-

aov_model <- aov(mpg ~ . , data = df)
summary(aov_model)
##               Df Sum Sq Mean Sq F value   Pr(>F)    
## cylinders      4  15455    3864 470.465  < 2e-16 ***
## displacement   1   1165    1165 141.907  < 2e-16 ***
## horsepower     1    534     534  65.047 9.96e-15 ***
## weight         1    771     771  93.910  < 2e-16 ***
## acceleration   1      4       4   0.456      0.5    
## model.year    12   3034     253  30.790  < 2e-16 ***
## origin         2    209     105  12.735 4.45e-06 ***
## Residuals    375   3080       8                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
remove(aov_model)

By this, we can fit various Multi-linear models & Anova models to predict MPG of a car based on avaialbe attributes.

Thank you very much.