Problem.9 from Chapter-2 in ISLR

This work is part of my effort to become a well versed data analyst. At this point in time, and for the immediate future, I will undoubtedly be a novice at using R and solving the problem sets from this book. Hence, my solutions will at times reflect my limited abilities. But, with more practice, the quality and depth of my work will improve ( That is the whole point!)

Update-1: In retrospect, I believe it would have been just fine to only coerce the integer class variables to numeric class variables…and leave the rest as given. For example, not change the ‘horsepower’ predictor to a numeric type. Though, I believe my choice to reassign variable classes ( as I saw fitting) changes much about the work that I have done.

Question-a

(a) Which of the predictors are quantitative, and which are qualitative?

library("ISLR")

## 
## Attaching package: 'ISLR'
## 
## The following object is masked _by_ '.GlobalEnv':
## 
##     Auto

str(Auto)

## 'data.frame':    397 obs. of  9 variables:
##  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders   : num  8 8 8 8 8 8 8 8 8 8 ...
##  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : num  17 35 29 29 24 42 47 46 48 40 ...
##  $ weight      : num  3504 3693 3436 3433 3449 ...
##  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ year        : Factor w/ 13 levels "70","71","72",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ origin      : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ name        : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...

Based on the output we see that the qualitative variables are: year, origin, and the name predictor. For future questions in this problem, it is preferable to have all the quantitave predictors as variables of type numeric (instead of some being of type int). In addition, for convenience, we want: ‘year’ and ‘origin’ as factor type variables.

Auto$cylinders <- as.numeric(Auto$cylinders)
Auto$horsepower <- as.numeric(Auto$horsepower )
Auto$year <- as.factor(Auto$year)
Auto$origin <- as.factor(Auto$origin)

str(Auto)

## 'data.frame':    397 obs. of  9 variables:
##  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders   : num  8 8 8 8 8 8 8 8 8 8 ...
##  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : num  17 35 29 29 24 42 47 46 48 40 ...
##  $ weight      : num  3504 3693 3436 3433 3449 ...
##  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ year        : Factor w/ 13 levels "70","71","72",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ origin      : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ name        : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...

Question-b

What is the range of each quantitative predictor?

##       mpg cylinders displacement horsepower weight acceleration
## [1,]  9.0         3           68          1   1613          8.0
## [2,] 46.6         8          455         94   5140         24.8

Question-c

What is the mean and standard deviation of each quantitative predictor?

colMeans(quant_data)

##          mpg    cylinders displacement   horsepower       weight 
##    23.515869     5.458438   193.532746    51.516373  2970.261965 
## acceleration 
##    15.555668

apply(quant_data,2, sd)

##          mpg    cylinders displacement   horsepower       weight 
##     7.825804     1.701577   104.379583    29.862697   847.904119 
## acceleration 
##     2.749995

Question-d

Now remove the 10th through 85th observations. What is the range, mean, and standard deviation of each predictor in the subset of the data that remains?

dummy <- quant_data[c(10:85),]
apply(dummy, 2, range)

##      mpg cylinders displacement horsepower weight acceleration
## [1,]   9         3           70          1   1613          8.0
## [2,]  35         8          455         93   5140         23.5

colMeans(dummy)

##          mpg    cylinders displacement   horsepower       weight 
##    19.618421     5.828947   220.914474    53.710526  3123.578947 
## acceleration 
##    14.848684

apply(dummy,2, sd)

##          mpg    cylinders displacement   horsepower       weight 
##     6.123108     1.857512   119.291167    29.032770   981.196104 
## acceleration 
##     2.940544

rm("dummy")

Question-e

Using the full data set, investigate the predictors graphically, using scatterplots or other tools of your choice. Create some plots highlighting the relationships among the predictors. Comment on your findings

pairs(quant_data)

#___________________________________________________________
# Functions found online (on R-bliggers and by Stephen Turner)

cor.prob <- function (X, dfr = nrow(X) - 2) {
  R <- cor(X, use="pairwise.complete.obs")
  above <- row(R) < col(R)
  r2 <- R[above]^2
  Fstat <- r2 * dfr/(1 - r2)
  R[above] <- 1 - pf(Fstat, 1, dfr)
  R[row(R) == col(R)] <- NA
  R }

## Use this to dump the cor.prob output to a 4 column matrix
## with row/column indices, correlation, and p-value.
## See StackOverflow question: http://goo.gl/fCUcQ
flattenSquareMatrix <- function(m) {
  if( (class(m) != "matrix") | (nrow(m) != ncol(m))) stop("Must be a square matrix.") 
  if(!identical(rownames(m), colnames(m))) stop("Row and column names must be equal.")
  ut <- upper.tri(m)
  data.frame(i = rownames(m)[row(m)[ut]],
             j = rownames(m)[col(m)[ut]],
             cor=t(m)[ut],
             p=m[ut])
}
#____________________________________________________________

cor.Matrix <- flattenSquareMatrix(cor.prob(quant_data))
cor.Matrix

##               i            j        cor            p
## 1           mpg    cylinders -0.7762599 0.000000e+00
## 2           mpg displacement -0.8044430 0.000000e+00
## 3     cylinders displacement  0.9509199 0.000000e+00
## 4           mpg   horsepower  0.4228227 0.000000e+00
## 5     cylinders   horsepower -0.5466585 0.000000e+00
## 6  displacement   horsepower -0.4820705 0.000000e+00
## 7           mpg       weight -0.8317389 0.000000e+00
## 8     cylinders       weight  0.8970169 0.000000e+00
## 9  displacement       weight  0.9331044 0.000000e+00
## 10   horsepower       weight -0.4821507 0.000000e+00
## 11          mpg acceleration  0.4222974 0.000000e+00
## 12    cylinders acceleration -0.5040606 0.000000e+00
## 13 displacement acceleration -0.5441618 0.000000e+00
## 14   horsepower acceleration  0.2662877 7.181163e-08
## 15       weight acceleration -0.4195023 0.000000e+00

The correlation matrix above informs us on the relationship between the quantitative predictors. It is a suprise to see that: The number of cylinders is negatively related to the accelaration and to the number of horsepower. The positive relationship between the mpg and horsepower (with accelaration too).

Some skepticism is warranted for these correlation observations.

plot( Auto$mpg ~ Auto$year , xlab ="Year(1900s)", ylab="MPG")

plot( Auto$mpg ~ Auto$origin , xlab ="Origin", ylab="MPG")

The graph above shows that there is, in average, a positive relation between the year of production and the MPG of the car. This makes sense due to advances in technology and engineering over time. This especially true in recent years when technology has advanced at a great rate.

Question-f

Suppose that we wish to predict gas mileage (mpg) on the basis of the other variables. Do your plots suggest that any of the other variables might be useful in predicting mpg? Justify your answer.

At first glance, the scatterplot matrix of the quantitative variables give us an idea of the relationship between MPG and the other quantitative variables. * mpg ~ cylinders: negative relation * mp ~ displacement: negative relation * mpg ~ horsepower: negative relation (a bit dubious) * mpg ~ weight: negative relation * mpg ~ accelaration: positive relation ( seems unusual)

In addition, as previously mentioned, we observed a positive trend between MPG and the qualitative variables ‘year’ and ‘origin’

A multi-linear regression model with both quantitative and qualitative variables:

model <- lm(mpg ~ cylinders+year+weight+horsepower+displacement+acceleration+origin, 
            data= Auto)

t(coefficients(model))

##      (Intercept)  cylinders   year71    year72     year73  year74   year75
## [1,]    33.22536 0.04431539 1.652039 0.3774949 -0.1147728 2.27686 1.640456
##        year76   year77  year78   year79   year80   year81   year82
## [1,] 2.396079 3.812921 3.66051 5.994203 10.12784 7.732362 9.175357
##            weight  horsepower displacement acceleration  origin2  origin3
## [1,] -0.006761071 0.009423129   0.01365385    0.1559696 2.608659 2.333541

anova(model)

## Analysis of Variance Table
## 
## Response: mpg
##               Df  Sum Sq Mean Sq   F value  Pr(>F)    
## cylinders      1 14613.9 14613.9 1569.0486 < 2e-16 ***
## year          12  3667.3   305.6   32.8123 < 2e-16 ***
## weight         1  2128.7  2128.7  228.5547 < 2e-16 ***
## horsepower     1    18.5    18.5    1.9910  0.1591    
## displacement   1     0.9     0.9    0.1010  0.7508    
## acceleration   1    16.6    16.6    1.7808  0.1829    
## origin         2   294.9   147.5   15.8314 2.5e-07 ***
## Residuals    377  3511.3     9.3                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Based on the anova table, we know that the predictors ‘horsepower, ’displacement’, and ‘acceleration’ are not statistically significant in estimating the MPG

Ahmed TADDE