This work is part of my effort to become a well versed data analyst. At this point in time, and for the immediate future, I will undoubtedly be a novice at using R and solving the problem sets from this book. Hence, my solutions will at times reflect my limited abilities. But, with more practice, the quality and depth of my work will improve ( That is the whole point!)
Update-1: In retrospect, I believe it would have been just fine to only coerce the integer class variables to numeric class variables…and leave the rest as given. For example, not change the ‘horsepower’ predictor to a numeric type. Though, I believe my choice to reassign variable classes ( as I saw fitting) changes much about the work that I have done.
(a) Which of the predictors are quantitative, and which are qualitative?
library("ISLR")
##
## Attaching package: 'ISLR'
##
## The following object is masked _by_ '.GlobalEnv':
##
## Auto
str(Auto)
## 'data.frame': 397 obs. of 9 variables:
## $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
## $ cylinders : num 8 8 8 8 8 8 8 8 8 8 ...
## $ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
## $ horsepower : num 17 35 29 29 24 42 47 46 48 40 ...
## $ weight : num 3504 3693 3436 3433 3449 ...
## $ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
## $ year : Factor w/ 13 levels "70","71","72",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ origin : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
## $ name : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
Based on the output we see that the qualitative variables are: year, origin, and the name predictor. For future questions in this problem, it is preferable to have all the quantitave predictors as variables of type numeric (instead of some being of type int). In addition, for convenience, we want: ‘year’ and ‘origin’ as factor type variables.
Auto$cylinders <- as.numeric(Auto$cylinders)
Auto$horsepower <- as.numeric(Auto$horsepower )
Auto$year <- as.factor(Auto$year)
Auto$origin <- as.factor(Auto$origin)
str(Auto)
## 'data.frame': 397 obs. of 9 variables:
## $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
## $ cylinders : num 8 8 8 8 8 8 8 8 8 8 ...
## $ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
## $ horsepower : num 17 35 29 29 24 42 47 46 48 40 ...
## $ weight : num 3504 3693 3436 3433 3449 ...
## $ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
## $ year : Factor w/ 13 levels "70","71","72",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ origin : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
## $ name : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
What is the range of each quantitative predictor?
## mpg cylinders displacement horsepower weight acceleration
## [1,] 9.0 3 68 1 1613 8.0
## [2,] 46.6 8 455 94 5140 24.8
What is the mean and standard deviation of each quantitative predictor?
colMeans(quant_data)
## mpg cylinders displacement horsepower weight
## 23.515869 5.458438 193.532746 51.516373 2970.261965
## acceleration
## 15.555668
apply(quant_data,2, sd)
## mpg cylinders displacement horsepower weight
## 7.825804 1.701577 104.379583 29.862697 847.904119
## acceleration
## 2.749995
Now remove the 10th through 85th observations. What is the range, mean, and standard deviation of each predictor in the subset of the data that remains?
dummy <- quant_data[c(10:85),]
apply(dummy, 2, range)
## mpg cylinders displacement horsepower weight acceleration
## [1,] 9 3 70 1 1613 8.0
## [2,] 35 8 455 93 5140 23.5
colMeans(dummy)
## mpg cylinders displacement horsepower weight
## 19.618421 5.828947 220.914474 53.710526 3123.578947
## acceleration
## 14.848684
apply(dummy,2, sd)
## mpg cylinders displacement horsepower weight
## 6.123108 1.857512 119.291167 29.032770 981.196104
## acceleration
## 2.940544
rm("dummy")
Using the full data set, investigate the predictors graphically, using scatterplots or other tools of your choice. Create some plots highlighting the relationships among the predictors. Comment on your findings
pairs(quant_data)
#___________________________________________________________
# Functions found online (on R-bliggers and by Stephen Turner)
cor.prob <- function (X, dfr = nrow(X) - 2) {
R <- cor(X, use="pairwise.complete.obs")
above <- row(R) < col(R)
r2 <- R[above]^2
Fstat <- r2 * dfr/(1 - r2)
R[above] <- 1 - pf(Fstat, 1, dfr)
R[row(R) == col(R)] <- NA
R }
## Use this to dump the cor.prob output to a 4 column matrix
## with row/column indices, correlation, and p-value.
## See StackOverflow question: http://goo.gl/fCUcQ
flattenSquareMatrix <- function(m) {
if( (class(m) != "matrix") | (nrow(m) != ncol(m))) stop("Must be a square matrix.")
if(!identical(rownames(m), colnames(m))) stop("Row and column names must be equal.")
ut <- upper.tri(m)
data.frame(i = rownames(m)[row(m)[ut]],
j = rownames(m)[col(m)[ut]],
cor=t(m)[ut],
p=m[ut])
}
#____________________________________________________________
cor.Matrix <- flattenSquareMatrix(cor.prob(quant_data))
cor.Matrix
## i j cor p
## 1 mpg cylinders -0.7762599 0.000000e+00
## 2 mpg displacement -0.8044430 0.000000e+00
## 3 cylinders displacement 0.9509199 0.000000e+00
## 4 mpg horsepower 0.4228227 0.000000e+00
## 5 cylinders horsepower -0.5466585 0.000000e+00
## 6 displacement horsepower -0.4820705 0.000000e+00
## 7 mpg weight -0.8317389 0.000000e+00
## 8 cylinders weight 0.8970169 0.000000e+00
## 9 displacement weight 0.9331044 0.000000e+00
## 10 horsepower weight -0.4821507 0.000000e+00
## 11 mpg acceleration 0.4222974 0.000000e+00
## 12 cylinders acceleration -0.5040606 0.000000e+00
## 13 displacement acceleration -0.5441618 0.000000e+00
## 14 horsepower acceleration 0.2662877 7.181163e-08
## 15 weight acceleration -0.4195023 0.000000e+00
The correlation matrix above informs us on the relationship between the quantitative predictors. It is a suprise to see that: The number of cylinders is negatively related to the accelaration and to the number of horsepower. The positive relationship between the mpg and horsepower (with accelaration too).
Some skepticism is warranted for these correlation observations.
plot( Auto$mpg ~ Auto$year , xlab ="Year(1900s)", ylab="MPG")
plot( Auto$mpg ~ Auto$origin , xlab ="Origin", ylab="MPG")
The graph above shows that there is, in average, a positive relation between the year of production and the MPG of the car. This makes sense due to advances in technology and engineering over time. This especially true in recent years when technology has advanced at a great rate.
Suppose that we wish to predict gas mileage (mpg) on the basis of the other variables. Do your plots suggest that any of the other variables might be useful in predicting mpg? Justify your answer.
At first glance, the scatterplot matrix of the quantitative variables give us an idea of the relationship between MPG and the other quantitative variables. * mpg ~ cylinders: negative relation * mp ~ displacement: negative relation * mpg ~ horsepower: negative relation (a bit dubious) * mpg ~ weight: negative relation * mpg ~ accelaration: positive relation ( seems unusual)
In addition, as previously mentioned, we observed a positive trend between MPG and the qualitative variables ‘year’ and ‘origin’
A multi-linear regression model with both quantitative and qualitative variables:
model <- lm(mpg ~ cylinders+year+weight+horsepower+displacement+acceleration+origin,
data= Auto)
t(coefficients(model))
## (Intercept) cylinders year71 year72 year73 year74 year75
## [1,] 33.22536 0.04431539 1.652039 0.3774949 -0.1147728 2.27686 1.640456
## year76 year77 year78 year79 year80 year81 year82
## [1,] 2.396079 3.812921 3.66051 5.994203 10.12784 7.732362 9.175357
## weight horsepower displacement acceleration origin2 origin3
## [1,] -0.006761071 0.009423129 0.01365385 0.1559696 2.608659 2.333541
anova(model)
## Analysis of Variance Table
##
## Response: mpg
## Df Sum Sq Mean Sq F value Pr(>F)
## cylinders 1 14613.9 14613.9 1569.0486 < 2e-16 ***
## year 12 3667.3 305.6 32.8123 < 2e-16 ***
## weight 1 2128.7 2128.7 228.5547 < 2e-16 ***
## horsepower 1 18.5 18.5 1.9910 0.1591
## displacement 1 0.9 0.9 0.1010 0.7508
## acceleration 1 16.6 16.6 1.7808 0.1829
## origin 2 294.9 147.5 15.8314 2.5e-07 ***
## Residuals 377 3511.3 9.3
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Based on the anova table, we know that the predictors ‘horsepower, ’displacement’, and ‘acceleration’ are not statistically significant in estimating the MPG
Ahmed TADDE