library(ISLR)
#women_data=data("women")

The first six rows of the Women Dataset are as follows

head(women)
##   height weight
## 1     58    115
## 2     59    117
## 3     60    120
## 4     61    123
## 5     62    126
## 6     63    129

The dimensions of the Women Dataset are as follows

dim(women)
## [1] 15  2

The structure of the Women Dataset is as follows

str(women)
## 'data.frame':    15 obs. of  2 variables:
##  $ height: num  58 59 60 61 62 63 64 65 66 67 ...
##  $ weight: num  115 117 120 123 126 129 132 135 139 142 ...

The answers are as follows :

The Summary of the Women Data is as follows:

summary(women)
##      height         weight     
##  Min.   :58.0   Min.   :115.0  
##  1st Qu.:61.5   1st Qu.:124.5  
##  Median :65.0   Median :135.0  
##  Mean   :65.0   Mean   :136.7  
##  3rd Qu.:68.5   3rd Qu.:148.0  
##  Max.   :72.0   Max.   :164.0

The plot of the Women Datset is as follows

plot(women)

fit = lm( weight ~ height, data=women)#x ~ y, data = nameOfDataset )
summary(fit)
## 
## Call:
## lm(formula = weight ~ height, data = women)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.7333 -1.1333 -0.3833  0.7417  3.1167 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -87.51667    5.93694  -14.74 1.71e-09 ***
## height        3.45000    0.09114   37.85 1.09e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.525 on 13 degrees of freedom
## Multiple R-squared:  0.991,  Adjusted R-squared:  0.9903 
## F-statistic:  1433 on 1 and 13 DF,  p-value: 1.091e-14
par(mfrow = c(2, 2))
plot(fit)

Interpretation of Coefficient and analysis of quantity of fit Vs quality of fit:

  • For every one unit increase in height, there is an expected increase of 3.45 pounds of weight.
  • The model is statistically significant because the p value of F-statistic is less than 0.05
  • The model is useful because the R^2 (The variance explained by the regression line) is 99.1.
  • Checking for linear regression assumptions (Because a model can have a strong fit (good R^2) but the model is not appropriate which is assessed by checking assumptions)
    • The response variable y is the linear function of the explanatory variable x, plus an error term
    • The error term has same variance
    • The error term has normal distribution
Answers based on above assumptions regarding quality of fit for the Women Dataset.
  • The model does not fits the data well as the assumption y to be a linear function of x is violated based on residuals vs fitted values plot. Curve can be observed in that plot which points to the fact of existence of non-linearity in the relationship between weight and height.Plus the standard residual error is 1.53 pounds. It is the average error in predicting weight from height. The Q1 and Q3 residuals are also not close to each other magnitudes.
library(MASS)
#Auto_data=data("Auto")
#Auto_data

The first six rows of the Auto Dataset are as follows

head(Auto)
##   mpg cylinders displacement horsepower weight acceleration year origin
## 1  18         8          307        130   3504         12.0   70      1
## 2  15         8          350        165   3693         11.5   70      1
## 3  18         8          318        150   3436         11.0   70      1
## 4  16         8          304        150   3433         12.0   70      1
## 5  17         8          302        140   3449         10.5   70      1
## 6  15         8          429        198   4341         10.0   70      1
##                        name
## 1 chevrolet chevelle malibu
## 2         buick skylark 320
## 3        plymouth satellite
## 4             amc rebel sst
## 5               ford torino
## 6          ford galaxie 500

The dimensions of the Auto Dataset are as follows

dim(Auto)
## [1] 392   9

The structure of the Auto Dataset is as follows

str(Auto)
## 'data.frame':    392 obs. of  9 variables:
##  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders   : num  8 8 8 8 8 8 8 8 8 8 ...
##  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : num  130 165 150 150 140 198 220 215 225 190 ...
##  $ weight      : num  3504 3693 3436 3433 3449 ...
##  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ year        : num  70 70 70 70 70 70 70 70 70 70 ...
##  $ origin      : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ name        : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...

The answers are as follows

  • The name of the dataset is Auto dataset
  • Dimensions tell us that there are 392 rows for the 9 variables/features of different cars.
  • 9 variables comprise the dataset. 8 of them have their datatypes as integers and one has a string datatype as it refers to the name of the cars.

The Summary of the Auto Data is as follows:

summary(Auto)
##       mpg          cylinders      displacement     horsepower        weight    
##  Min.   : 9.00   Min.   :3.000   Min.   : 68.0   Min.   : 46.0   Min.   :1613  
##  1st Qu.:17.00   1st Qu.:4.000   1st Qu.:105.0   1st Qu.: 75.0   1st Qu.:2225  
##  Median :22.75   Median :4.000   Median :151.0   Median : 93.5   Median :2804  
##  Mean   :23.45   Mean   :5.472   Mean   :194.4   Mean   :104.5   Mean   :2978  
##  3rd Qu.:29.00   3rd Qu.:8.000   3rd Qu.:275.8   3rd Qu.:126.0   3rd Qu.:3615  
##  Max.   :46.60   Max.   :8.000   Max.   :455.0   Max.   :230.0   Max.   :5140  
##                                                                                
##   acceleration        year           origin                      name    
##  Min.   : 8.00   Min.   :70.00   Min.   :1.000   amc matador       :  5  
##  1st Qu.:13.78   1st Qu.:73.00   1st Qu.:1.000   ford pinto        :  5  
##  Median :15.50   Median :76.00   Median :1.000   toyota corolla    :  5  
##  Mean   :15.54   Mean   :75.98   Mean   :1.577   amc gremlin       :  4  
##  3rd Qu.:17.02   3rd Qu.:79.00   3rd Qu.:2.000   amc hornet        :  4  
##  Max.   :24.80   Max.   :82.00   Max.   :3.000   chevrolet chevette:  4  
##                                                  (Other)           :365
#plot(Auto)

The plot is as follows

plot(Auto$horsepower, Auto$mpg,xlab="HorsePower", ylab="MPG", main="Plot of MPG vs HorsePower",col="blue") 
fit_Auto = lm( mpg ~ horsepower, data=Auto)#x ~ y, data = nameOfDataset )
abline(fit_Auto,lwd =3 , col ="red")

The summary of the linear regression model is as follows

summary(fit_Auto)
## 
## Call:
## lm(formula = mpg ~ horsepower, data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.5710  -3.2592  -0.3435   2.7630  16.9240 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 39.935861   0.717499   55.66   <2e-16 ***
## horsepower  -0.157845   0.006446  -24.49   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared:  0.6059, Adjusted R-squared:  0.6049 
## F-statistic: 599.7 on 1 and 390 DF,  p-value: < 2.2e-16
par(mfrow = c(2, 2))
plot(fit_Auto)

Interpretation of Coefficient and Analysis fo quantity of fit vs quality of fit:

  • For every one unit increase in horsepower, there is an expected decrease of 0.15 miles per gallon of the car. Resulting in decrease in efficiency of the car
  • The model is statistically significant because the p value of F-statistic is less than 0.05
  • The model appears to be useful (good quantity of fit) because the R^2 (The variance explained by the regression line) 60.59. ### Checking for linear regression assumptions (Because a model can have a strong fit (good R^2) but the model is not appropriate which is assessed by checking assumptions)
    • The response variable y is the linear function of the explanatory variable x, plus an error term
    • The error term has same variance
    • The error term has normal distribution
Answers based on above assumptions regarding quality of fit for the Auto Dataset.
  • The model does not fits the data well as the assumption y to be a linear function of x is violated based on residuals vs fitted values plot. Curve can be observed in that plot which points to the fact of existence of non-linearity in the relationship between mpg and horsepower.Plus the standard residual error is 4.906 miles per gallon. It is the average error in predicting miles per gallon from horsepower. The Q1 and Q3 residuals are also not close to each other magnitudes.

The scatter plot matrix for the auto dataset is as follows:

pairs(Auto)

#library(MASS)

The plot for mpg and weight is as follows

fit_Auto_2 = lm(mpg ~ weight, data=Auto)#x ~ y, data = nameOfDataset )

plot(Auto$weight, Auto$mpg,xlab="weight", ylab="mpg", main="Plot of weight vs mpg",col="blue") 

#plot(Auto$, Auto$mpg, xlab="Acceleration", ylab="MPG", main="Plot of MPG vs Acceleration", col="blue") 
abline(fit_Auto_2, lwd=3, col="red")

summary(fit_Auto_2)
## 
## Call:
## lm(formula = mpg ~ weight, data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.9736  -2.7556  -0.3358   2.1379  16.5194 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 46.216524   0.798673   57.87   <2e-16 ***
## weight      -0.007647   0.000258  -29.64   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.333 on 390 degrees of freedom
## Multiple R-squared:  0.6926, Adjusted R-squared:  0.6918 
## F-statistic: 878.8 on 1 and 390 DF,  p-value: < 2.2e-16
par(mfrow = c(2, 2))
plot(fit_Auto_2)

Interpretation of Coefficient and Analysis fo quantity of fit vs quality of fit:

  • For every 10 pounds increase in weight of the car, there is an expected decrease of 0.07 miles per gallon of the car, which appears to be not that much when it comes to efficiency of the car.
  • The model is statistically significant because the p value of F-statistic is less than 0.05
  • The model appears to be useful (good quantity of fit) because the R^2 (The variance explained by the regression line) 69.26. #### Checking for linear regression assumptions (Because a model can have a strong fit (good R^2) but the model is not appropriate which is assessed by checking assumptions)
    • The response variable y is the linear function of the explanatory variable x, plus an error term
    • The error term has same variance
    • The error term has normal distribution

Answers based on above assumptions regarding quality of fit for the Auto Dataset.

  • The model does not fits the data well as the assumption y to be a linear function of x is violated based on residuals vs fitted values plot. A slight curve can be observed in that plot which points to the fact of existence of non-linearity in the relationship between mpg and weight.Plus the standard residual error is 4.33 miles per gallon. It is the average error in predicting miles per gallon from weight of the car. The Q1 and Q3 residuals, although, not far but not too close to each other magnitudes.