Reading the input data and investingating the structure of it
df <- read.csv("automobile.csv")
str(df)
## 'data.frame': 398 obs. of 9 variables:
## $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
## $ cylinders : int 8 8 8 8 8 8 8 8 8 8 ...
## $ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
## $ horsepower : Factor w/ 94 levels "?","100.0","102.0",..: 17 35 29 29 24 42 47 46 48 40 ...
## $ weight : int 3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
## $ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
## $ model.year : int 70 70 70 70 70 70 70 70 70 70 ...
## $ origin : int 1 1 1 1 1 1 1 1 1 1 ...
## $ car.name : Factor w/ 305 levels "amc ambassador brougham",..: 50 37 232 15 162 142 55 224 242 2 ...
As per above structure, we can conclude that, we need to change few attribute types. Data Preprocessing is requried before going to our descriptive analysis.
df$cylinders <- factor(df$cylinders)
df$horsepower <- as.numeric(trimws(df$horsepower))
## Warning: NAs introduced by coercion
df$model.year <- factor(df$model.year)
df$origin <- factor(df$origin, labels = c("USA", "Europe", "Japan") )
str(df)
## 'data.frame': 398 obs. of 9 variables:
## $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
## $ cylinders : Factor w/ 5 levels "3","4","5","6",..: 5 5 5 5 5 5 5 5 5 5 ...
## $ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
## $ horsepower : num 130 165 150 150 140 198 220 215 225 190 ...
## $ weight : int 3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
## $ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
## $ model.year : Factor w/ 13 levels "70","71","72",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ origin : Factor w/ 3 levels "USA","Europe",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ car.name : Factor w/ 305 levels "amc ambassador brougham",..: 50 37 232 15 162 142 55 224 242 2 ...
Now, all the variables are in correct type.
summary(df)
## mpg cylinders displacement horsepower weight
## Min. : 9.00 3: 4 Min. : 68.0 Min. : 46.0 Min. :1613
## 1st Qu.:17.50 4:204 1st Qu.:104.2 1st Qu.: 75.0 1st Qu.:2224
## Median :23.00 5: 3 Median :148.5 Median : 93.5 Median :2804
## Mean :23.51 6: 84 Mean :193.4 Mean :104.5 Mean :2970
## 3rd Qu.:29.00 8:103 3rd Qu.:262.0 3rd Qu.:126.0 3rd Qu.:3608
## Max. :46.60 Max. :455.0 Max. :230.0 Max. :5140
## NA's :6
## acceleration model.year origin car.name
## Min. : 8.00 73 : 40 USA :249 ford pinto : 6
## 1st Qu.:13.82 78 : 36 Europe: 70 amc matador : 5
## Median :15.50 76 : 34 Japan : 79 ford maverick : 5
## Mean :15.57 82 : 31 toyota corolla: 5
## 3rd Qu.:17.18 75 : 30 amc gremlin : 4
## Max. :24.80 70 : 29 amc hornet : 4
## (Other):198 (Other) :369
We have NA values, which need to be fill before going to analysis.
As we see, 6 NA values in horsepower, I am planning to filling NA values with mean of that column
# Filtering out all available values in horsepower column
temp <- df$horsepower[ !is.na(df$horsepower)]
mean <- mean(temp)
mean <- round(mean , 0)
df$horsepower[is.na(df$horsepower)] <- mean
remove(temp , mean)
summary(df)
## mpg cylinders displacement horsepower weight
## Min. : 9.00 3: 4 Min. : 68.0 Min. : 46.0 Min. :1613
## 1st Qu.:17.50 4:204 1st Qu.:104.2 1st Qu.: 76.0 1st Qu.:2224
## Median :23.00 5: 3 Median :148.5 Median : 95.0 Median :2804
## Mean :23.51 6: 84 Mean :193.4 Mean :104.5 Mean :2970
## 3rd Qu.:29.00 8:103 3rd Qu.:262.0 3rd Qu.:125.0 3rd Qu.:3608
## Max. :46.60 Max. :455.0 Max. :230.0 Max. :5140
##
## acceleration model.year origin car.name
## Min. : 8.00 73 : 40 USA :249 ford pinto : 6
## 1st Qu.:13.82 78 : 36 Europe: 70 amc matador : 5
## Median :15.50 76 : 34 Japan : 79 ford maverick : 5
## Mean :15.57 82 : 31 toyota corolla: 5
## 3rd Qu.:17.18 75 : 30 amc gremlin : 4
## Max. :24.80 70 : 29 amc hornet : 4
## (Other):198 (Other) :369
Now, all the data is available and Data Preprocessing is completed. We are ready to start our descriptive stastics. I assume, car.name is not requried for our analysis (Name will not effect mpg.)
df <- subset(df , select = -c(car.name))
colnames(df)
## [1] "mpg" "cylinders" "displacement" "horsepower" "weight"
## [6] "acceleration" "model.year" "origin"
par(mfrow=c(4,3))
plot_box <- function(attr){
colname <- eval(parse(text=paste('df$' ,attr , sep = '' )))
boxplot(colname , horizontal = TRUE)
remove(colname)
}
plot_hist <- function(attr){
colname <- eval(parse(text=paste('df$' ,attr , sep = '' )))
sub_text <- paste('Mean :' , round (mean(colname) , 2) ,
'\nMedian:', round (median(colname) , 2) ,
'\nStd. Dev:', round (sd(colname) , 2)
)
hist(colname , main = attr , labels = TRUE , xlab = '' , sub=sub_text )
}
plot_box('mpg')
plot_box('displacement')
plot_box('horsepower')
plot_hist('mpg')
plot_hist('displacement')
plot_hist('horsepower')
plot_box('weight')
plot_box('acceleration')
plot.new()
plot_hist('weight')
plot_hist('acceleration')
Allmost in all attributes, outliers are ver less.
Analysing Categorical attributes :-
#"cylinders" "model.year" "origin"
xtabs (~df$cylinders)
## df$cylinders
## 3 4 5 6 8
## 4 204 3 84 103
xtabs (~df$model.year)
## df$model.year
## 70 71 72 73 74 75 76 77 78 79 80 81 82
## 29 28 28 40 27 30 34 28 36 29 29 29 31
xtabs (~df$origin)
## df$origin
## USA Europe Japan
## 249 70 79
We can see that, In the given dataset , there are more 4 cylinder cars. The model.year is almost equally distributed over all years. There are more cars from USA.
count_cross_table <- function( x , y){
colname_x <- eval(parse(text=paste('df$' ,x , sep = '' )))
colname_y <- eval(parse(text=paste('df$' ,y , sep = '' )))
data <- xtabs(~colname_x + colname_y)
data_total <- cbind(data , 'Total' = rowSums(data))
data_total <- rbind(data_total , 'Total' = colSums(data_total))
print(mosaicplot(data , shade = TRUE))
print(data_total)
remove(data , data_total , colname_x , colname_y)
}
The different types of cylinder car in different market places is as below.
count_cross_table('cylinders' , 'origin')
## NULL
## USA Europe Japan Total
## 3 0 0 4 4
## 4 72 63 69 204
## 5 0 3 0 3
## 6 74 4 6 84
## 8 103 0 0 103
## Total 249 70 79 398
count_cross_table('cylinders' , 'model.year')
## NULL
## 70 71 72 73 74 75 76 77 78 79 80 81 82 Total
## 3 0 0 1 1 0 0 0 1 0 0 1 0 0 4
## 4 7 13 14 11 15 12 15 14 17 12 25 21 28 204
## 5 0 0 0 0 0 0 0 0 1 1 1 0 0 3
## 6 4 8 0 8 7 12 10 5 12 6 2 7 3 84
## 8 18 7 13 20 5 6 9 8 6 10 0 1 0 103
## Total 29 28 28 40 27 30 34 28 36 29 29 29 31 398
count_cross_table('origin' , 'model.year')
## NULL
## 70 71 72 73 74 75 76 77 78 79 80 81 82 Total
## USA 22 20 18 29 15 20 22 18 22 23 7 13 20 249
## Europe 5 4 5 7 6 6 8 4 6 4 9 4 2 70
## Japan 2 4 5 4 6 4 4 6 8 2 13 12 9 79
## Total 29 28 28 40 27 30 34 28 36 29 29 29 31 398
boxplot(df$mpg ~ df$origin , xlab = "ORIGIN" , ylab = "MPG")
We can conclue that, Japan Cars are giving more MPG and USA cars are giving very less MPG.
boxplot(df$mpg ~ df$cylinders , xlab = "NO. OF CYLINDERs" , ylab = "MPG")
We can conclude that, 4 cylinder engines are best and 8 cylinder engines are worst.
mean_mpg_cross_table <- function( x , y){
colname_x <- eval(parse(text=paste('df$' ,x , sep = '' )))
colname_y <- eval(parse(text=paste('df$' ,y , sep = '' )))
data <- round(tapply(df$mpg, list(colname_x,colname_y), FUN = mean) , 2)
print(data)
remove(data ,colname_x , colname_y)
}
mean_mpg_cross_table("cylinders" , "origin")
## USA Europe Japan
## 3 NA NA 20.55
## 4 27.84 28.41 31.60
## 5 NA 27.37 NA
## 6 19.66 20.10 23.88
## 8 14.96 NA NA
mean_mpg_cross_table("cylinders" , "model.year")
## 70 71 72 73 74 75 76 77 78 79 80 81 82
## 3 NA NA 19.00 18.00 NA NA NA 21.50 NA NA 23.70 NA NA
## 4 25.29 27.46 23.43 22.73 27.80 25.25 26.77 29.11 29.58 31.52 34.61 32.81 32.07
## 5 NA NA NA NA NA NA NA NA 20.30 25.40 36.40 NA NA
## 6 20.50 18.00 NA 19.00 17.86 17.58 20.00 19.50 19.07 22.95 25.90 23.43 28.33
## 8 14.11 13.43 13.62 13.20 14.20 15.67 14.67 16.00 19.05 18.63 NA 26.60 NA
mean_mpg_cross_table("origin" , "model.year")
## 70 71 72 73 74 75 76 77 78 79 80 81
## USA 15.27 18.10 16.28 15.03 18.33 17.55 19.43 20.72 21.77 23.48 25.91 27.53
## Europe 25.20 28.75 22.00 24.00 27.00 24.50 24.25 29.25 24.95 30.45 37.29 31.57
## Japan 25.50 29.50 24.20 20.00 29.33 27.50 28.00 27.42 29.69 32.95 35.40 32.96
## 82
## USA 29.45
## Europe 40.00
## Japan 34.89
temp <- subset(df , select = c(mpg , displacement , horsepower , weight , acceleration))
pairs(temp)
cor(temp)
## mpg displacement horsepower weight acceleration
## mpg 1.0000000 -0.8042028 -0.7715428 -0.8317409 0.4202889
## displacement -0.8042028 1.0000000 0.8937600 0.9328241 -0.5436841
## horsepower -0.7715428 0.8937600 1.0000000 0.8606759 -0.6843761
## weight -0.8317409 0.9328241 0.8606759 1.0000000 -0.4174573
## acceleration 0.4202889 -0.5436841 -0.6843761 -0.4174573 1.0000000
cor(temp , method = "spearman")
## mpg displacement horsepower weight acceleration
## mpg 1.0000000 -0.8556920 -0.8431804 -0.8749474 0.4386775
## displacement -0.8556920 1.0000000 0.8666701 0.9459856 -0.4965119
## horsepower -0.8431804 0.8666701 1.0000000 0.8686589 -0.6475569
## weight -0.8749474 0.9459856 0.8686589 1.0000000 -0.4045504
## acceleration 0.4386775 -0.4965119 -0.6475569 -0.4045504 1.0000000
cor(temp , method = "kendall")
## mpg displacement horsepower weight acceleration
## mpg 1.0000000 -0.6798473 -0.6689616 -0.6940062 0.3010959
## displacement -0.6798473 1.0000000 0.7081657 0.8005077 -0.3521098
## horsepower -0.6689616 0.7081657 1.0000000 0.6934499 -0.4789829
## weight -0.6940062 0.8005077 0.6934499 1.0000000 -0.2686194
## acceleration 0.3010959 -0.3521098 -0.4789829 -0.2686194 1.0000000
remove(temp)
From all the above co-releation analysis, we can observe that, mpg has good co-releation with displacement , horsepower & weight. It has no co-releation with acceleration.
So, for linear model (to predict mpg) , the dependent variables are displacement , horsepower & weight.
##Fitting Multi-Linear models :-
linear_model <- lm(mpg ~ . , data = df )
summary(linear_model)
##
## Call:
## lm(formula = mpg ~ ., data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.0814 -1.6822 0.0262 1.5064 11.4178
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30.8808663 2.3462437 13.162 < 2e-16 ***
## cylinders4 6.8394092 1.5452976 4.426 1.26e-05 ***
## cylinders5 6.5854305 2.3508804 2.801 0.005354 **
## cylinders6 4.4198866 1.7146350 2.578 0.010326 *
## cylinders8 6.5807174 1.9781061 3.327 0.000965 ***
## displacement 0.0124843 0.0067827 1.841 0.066469 .
## horsepower -0.0359782 0.0124043 -2.900 0.003946 **
## weight -0.0054794 0.0006102 -8.980 < 2e-16 ***
## acceleration 0.0197250 0.0845004 0.233 0.815554
## model.year71 1.0175016 0.8064446 1.262 0.207836
## model.year72 -0.3656267 0.8068806 -0.453 0.650713
## model.year73 -0.4691928 0.7243550 -0.648 0.517550
## model.year74 1.4168300 0.8466544 1.673 0.095074 .
## model.year75 1.0276371 0.8371845 1.227 0.220408
## model.year76 1.6271521 0.8029627 2.026 0.043427 *
## model.year77 3.1360387 0.8218486 3.816 0.000159 ***
## model.year78 3.0901815 0.7814901 3.954 9.18e-05 ***
## model.year79 5.0337818 0.8262184 6.093 2.76e-09 ***
## model.year80 9.1606734 0.8555946 10.707 < 2e-16 ***
## model.year81 6.7259402 0.8527299 7.888 3.41e-14 ***
## model.year82 7.8808959 0.8455609 9.320 < 2e-16 ***
## originEurope 1.9276524 0.5108057 3.774 0.000187 ***
## originJapan 2.3503610 0.4914855 4.782 2.50e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.866 on 375 degrees of freedom
## Multiple R-squared: 0.873, Adjusted R-squared: 0.8656
## F-statistic: 117.2 on 22 and 375 DF, p-value: < 2.2e-16
linear_model <- lm(mpg ~ displacement + horsepower + weight , data = df )
summary(linear_model)
##
## Call:
## lm(formula = mpg ~ displacement + horsepower + weight, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.4990 -2.8528 -0.3605 2.2262 16.1952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 44.9142370 1.1979300 37.493 < 2e-16 ***
## displacement -0.0067407 0.0065582 -1.028 0.30466
## horsepower -0.0374981 0.0126691 -2.960 0.00326 **
## weight -0.0054466 0.0007114 -7.656 1.49e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.265 on 394 degrees of freedom
## Multiple R-squared: 0.7045, Adjusted R-squared: 0.7023
## F-statistic: 313.2 on 3 and 394 DF, p-value: < 2.2e-16
remove(linear_model)
aov_model <- aov(mpg ~ . , data = df)
summary(aov_model)
## Df Sum Sq Mean Sq F value Pr(>F)
## cylinders 4 15455 3864 470.465 < 2e-16 ***
## displacement 1 1165 1165 141.907 < 2e-16 ***
## horsepower 1 534 534 65.047 9.96e-15 ***
## weight 1 771 771 93.910 < 2e-16 ***
## acceleration 1 4 4 0.456 0.5
## model.year 12 3034 253 30.790 < 2e-16 ***
## origin 2 209 105 12.735 4.45e-06 ***
## Residuals 375 3080 8
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
remove(aov_model)
By this, we can fit various Multi-linear models & Anova models to predict MPG of a car based on avaialbe attributes.
Thank you very much.