Part 1:

1.

Let’s start by exploring the “Women” data Set which shows the average heights and weights for American Women. Here are the first 6 lines of the dataset.

head(women,6)
##   height weight
## 1     58    115
## 2     59    117
## 3     60    120
## 4     61    123
## 5     62    126
## 6     63    129

2.

Let’s view the dimensions of the “women” data set.

dim(women)
## [1] 15  2

3.

Let’s view the structure of the data set using the “str()” function.

str(women)
## 'data.frame':    15 obs. of  2 variables:
##  $ height: num  58 59 60 61 62 63 64 65 66 67 ...
##  $ weight: num  115 117 120 123 126 129 132 135 139 142 ...

4.

Let us quickly review what we have seen in the “women” data set.

  • What is the name of the data set?

The name of the data set is ‘women’.

  • What do the dimensions of the data set tell us?

The dimensions tell us that there are only 30 values of data in the set. Divided into 2 sets of 15.

  • How many variables comprise the data set? What are their data types?

There are 2 variables - height and weight.

  • What are the ages and nationality of women in the data set?

The data set does not reveal this information. But when searched in [HELP] the data is comprised of American women aged 30-39. ***

5.

Let us review what we see from running the “summary()” function on the women data set.

summary(women)
##      height         weight     
##  Min.   :58.0   Min.   :115.0  
##  1st Qu.:61.5   1st Qu.:124.5  
##  Median :65.0   Median :135.0  
##  Mean   :65.0   Mean   :136.7  
##  3rd Qu.:68.5   3rd Qu.:148.0  
##  Max.   :72.0   Max.   :164.0
  • What are the lowest heights and weights?

The Min values reveal that the lowest height is 58.0 in. and the lowest weight is 115.0 lbs.

  • What are the highest heights and weights?

The max values reveal that the highest height is 72.0 in. and the heightest weight is 164.0 lbs.

  • What are the average heights and weights?

From this data we will use the mean to assess the average since there are no severe outliars in the data. The average height is 35.0 in. and the average weight is 136.7 lbs.


Extra

Lets create a model in which weight is determined by height. I will be using the “lm()” function to do this. And in order to view it, I will use the “summary()” function.

model = lm(weight ~ height, data = women)
summary(model)
## 
## Call:
## lm(formula = weight ~ height, data = women)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.7333 -1.1333 -0.3833  0.7417  3.1167 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -87.51667    5.93694  -14.74 1.71e-09 ***
## height        3.45000    0.09114   37.85 1.09e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.525 on 13 degrees of freedom
## Multiple R-squared:  0.991,  Adjusted R-squared:  0.9903 
## F-statistic:  1433 on 1 and 13 DF,  p-value: 1.091e-14

Questions:

What should we pay attention to in the output?

  • Estimate based on height is 3.45. This means that there is an expected increase of 3.45 pounds of weight for every one increase in height.

  • Remember that “increase” refers to subjects of different heights (cross sectional); it does not predict weight as the subject grows.

How old are the women in the study? Would you expect them to grow?

  • The women from the study are aged 30-39. We do not expect them to grow. If anything we can maybe see a prediction of women staying the same height or shrinking as a matter of older age.

Is the model statistically significant? Check F-statistic for the p-value.

  • The F statistic is very small so that proves that the model is significant.

Is the model useful? Find the R-sqaured value which indicates the amount of variance due to height.

  • The model is very useful. The R-squared value is 99.1% which means nearly all the movements of the height will lead to a movement in the weight.

Does the model fit the data well?

  • It does. The residual error is pretty low at 1.525 on 13 degrees of freedom.

Is a linear model reasonable for this data? Check Q1 and Q3 for residuals.

Q1 and Q3 are about the same in size of magnitude. So the linear model is reasonable for this data.


Part 2: Exercise 1 - Auto

1.

Let’s start by exploring the “Auto” data Set which shows Gas mileage, horsepower, and other information for 392 vehicles. Here are the first 6 lines of the data set.

library(ISLR)
library(MASS)
data("Auto")
head(Auto, 6)
##   mpg cylinders displacement horsepower weight acceleration year origin
## 1  18         8          307        130   3504         12.0   70      1
## 2  15         8          350        165   3693         11.5   70      1
## 3  18         8          318        150   3436         11.0   70      1
## 4  16         8          304        150   3433         12.0   70      1
## 5  17         8          302        140   3449         10.5   70      1
## 6  15         8          429        198   4341         10.0   70      1
##                        name
## 1 chevrolet chevelle malibu
## 2         buick skylark 320
## 3        plymouth satellite
## 4             amc rebel sst
## 5               ford torino
## 6          ford galaxie 500

2.

Let’s view the dimensions of the “Auto” data set.

dim(Auto)
## [1] 392   9

3.

Let’s view the structure of the data set using the “str()” function.

str(Auto)
## 'data.frame':    392 obs. of  9 variables:
##  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders   : num  8 8 8 8 8 8 8 8 8 8 ...
##  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : num  130 165 150 150 140 198 220 215 225 190 ...
##  $ weight      : num  3504 3693 3436 3433 3449 ...
##  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ year        : num  70 70 70 70 70 70 70 70 70 70 ...
##  $ origin      : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ name        : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...

4.

Let us quickly review what we have seen in the “Auto” data set.

  • What is the name of the data set?

The name of the data set is ‘Auto’.

  • What do the dimensions of the data set tell us?

The dimensions tell us that there are 3,528 values(392 * 9) of data in the set. Divided into 9 sets of 392.

  • How many variables comprise the data set? What are their data types?

There are 9 variables - mpg, cylinders, displacement, horsepower, weight, acceleration, year, and origin.


5.

Let us review what we see from running the “summary()” function on the “Auto” data set.

summary(Auto)
##       mpg          cylinders      displacement     horsepower        weight    
##  Min.   : 9.00   Min.   :3.000   Min.   : 68.0   Min.   : 46.0   Min.   :1613  
##  1st Qu.:17.00   1st Qu.:4.000   1st Qu.:105.0   1st Qu.: 75.0   1st Qu.:2225  
##  Median :22.75   Median :4.000   Median :151.0   Median : 93.5   Median :2804  
##  Mean   :23.45   Mean   :5.472   Mean   :194.4   Mean   :104.5   Mean   :2978  
##  3rd Qu.:29.00   3rd Qu.:8.000   3rd Qu.:275.8   3rd Qu.:126.0   3rd Qu.:3615  
##  Max.   :46.60   Max.   :8.000   Max.   :455.0   Max.   :230.0   Max.   :5140  
##                                                                                
##   acceleration        year           origin                      name    
##  Min.   : 8.00   Min.   :70.00   Min.   :1.000   amc matador       :  5  
##  1st Qu.:13.78   1st Qu.:73.00   1st Qu.:1.000   ford pinto        :  5  
##  Median :15.50   Median :76.00   Median :1.000   toyota corolla    :  5  
##  Mean   :15.54   Mean   :75.98   Mean   :1.577   amc gremlin       :  4  
##  3rd Qu.:17.02   3rd Qu.:79.00   3rd Qu.:2.000   amc hornet        :  4  
##  Max.   :24.80   Max.   :82.00   Max.   :3.000   chevrolet chevette:  4  
##                                                  (Other)           :365

What are the lowest mpg and horsepower?

The Min values reveal that the lowest mpg is 9.0 mpg and the lowest horsepower is 46.0 engine horsepower.

What are the highest mpg and horsepower?

The max values reveal that the highest mpg is 46.60 mpg and the highest horsepower is 230.0 engine horsepower.

What are the average mpg and horsepower?

The average mpg is 23.45 mpg and the average horsepower is 104.5.


Questions:

Let us review what we see from running the “summary()” function on a model in which the mpg is determined by horsepower.

model = lm(mpg ~ horsepower, data = Auto)
summary(model)
## 
## Call:
## lm(formula = mpg ~ horsepower, data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.5710  -3.2592  -0.3435   2.7630  16.9240 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 39.935861   0.717499   55.66   <2e-16 ***
## horsepower  -0.157845   0.006446  -24.49   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared:  0.6059, Adjusted R-squared:  0.6049 
## F-statistic: 599.7 on 1 and 390 DF,  p-value: < 2.2e-16

What should we pay attention to in the output?

  • Estimate based on horsepower is -0.157845. This means that there is an expected decrease of 0.157845 mpg for every one increase in horsepower.

Is the model statistically significant? Check F-statistic for the p-value.

  • The p value is very small so that proves that the model is significant.

Is the model useful? Find the R-sqaured value which indicates the amount of variance due to horsepower.

  • The model not very useful. The R-squared value is 60.59 which means nearly half of the movements of the horsepower will lead to a movement in the weight.

Does the model fit the data well?

  • It does. The residual error is pretty low at 4.906 on 390 degrees of freedom.

Is a linear model reasonable for this data? Check Q1 and Q3 for residuals.

Q1 and Q3 are not the same in size of magnitude. So the linear model is not reasonable for this data.


Part 3: Exercise 2 - Auto

Before I begin:

First I will produce a scatterplot matrix that compares all the pairs of variables of the “Auto” data set.

pairs(Auto)

I Will work with a model that has horsepower depend on weight.

1.

Let us review what we see from running the “summary()” function on the “Auto” data set. We will assess the values for horsepower and weight this time

summary(Auto)
##       mpg          cylinders      displacement     horsepower        weight    
##  Min.   : 9.00   Min.   :3.000   Min.   : 68.0   Min.   : 46.0   Min.   :1613  
##  1st Qu.:17.00   1st Qu.:4.000   1st Qu.:105.0   1st Qu.: 75.0   1st Qu.:2225  
##  Median :22.75   Median :4.000   Median :151.0   Median : 93.5   Median :2804  
##  Mean   :23.45   Mean   :5.472   Mean   :194.4   Mean   :104.5   Mean   :2978  
##  3rd Qu.:29.00   3rd Qu.:8.000   3rd Qu.:275.8   3rd Qu.:126.0   3rd Qu.:3615  
##  Max.   :46.60   Max.   :8.000   Max.   :455.0   Max.   :230.0   Max.   :5140  
##                                                                                
##   acceleration        year           origin                      name    
##  Min.   : 8.00   Min.   :70.00   Min.   :1.000   amc matador       :  5  
##  1st Qu.:13.78   1st Qu.:73.00   1st Qu.:1.000   ford pinto        :  5  
##  Median :15.50   Median :76.00   Median :1.000   toyota corolla    :  5  
##  Mean   :15.54   Mean   :75.98   Mean   :1.577   amc gremlin       :  4  
##  3rd Qu.:17.02   3rd Qu.:79.00   3rd Qu.:2.000   amc hornet        :  4  
##  Max.   :24.80   Max.   :82.00   Max.   :3.000   chevrolet chevette:  4  
##                                                  (Other)           :365

What are the lowest weight and horsepower?

The Min values reveal that the lowest weight is 1613 lbs and the lowest horsepower is 46.0 engine horsepower.

What are the highest weight and horsepower?

The max values reveal that the highest weight is 5140 lbs and the highest horsepower is 230.0 engine horsepower.

What are the average weight and horsepower?

The average weight is 2978 lbs and the average horsepower is 104.5.


Questions:

Let us review what we see from running the “summary()” function on a model in which the horsepower is determined by weight.

model = lm(horsepower ~ weight, data = Auto)
summary(model)
## 
## Call:
## lm(formula = horsepower ~ weight, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -50.272 -12.285  -0.557   9.063 116.283 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -12.183485   3.570431  -3.412 0.000712 ***
## weight        0.039177   0.001153  33.972  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.37 on 390 degrees of freedom
## Multiple R-squared:  0.7474, Adjusted R-squared:  0.7468 
## F-statistic:  1154 on 1 and 390 DF,  p-value: < 2.2e-16

What should we pay attention to in the output?

  • Estimate based on weight is 0.039177. This means that there is an expected increase of 0.039177 horsepower for every one increase in weight.

Is the model statistically significant? Check F-statistic for the p-value.

  • The p value is very small so that proves that the model is significant.

Is the model useful? Find the R-sqaured value which indicates the amount of variance due to weight.

  • The model is somewhat useful. The R-squared value is .7474 which means nearly 75% of the movements of the weight will lead to a movement in the weight.

Does the model fit the data well?

  • It does somewhat. The residual error is mid at 19.37 on 390 degrees of freedom.

Is a linear model reasonable for this data? Check Q1 and Q3 for residuals.

Q1 and Q3 are not the same in size of magnitude. So the linear model is not reasonable for this data.