library(ISLR)
#women_data=data("women")
The first six rows of the Women Dataset are as follows
head(women)
## height weight
## 1 58 115
## 2 59 117
## 3 60 120
## 4 61 123
## 5 62 126
## 6 63 129
The dimensions of the Women Dataset are as follows
dim(women)
## [1] 15 2
The structure of the Women Dataset is as follows
str(women)
## 'data.frame': 15 obs. of 2 variables:
## $ height: num 58 59 60 61 62 63 64 65 66 67 ...
## $ weight: num 115 117 120 123 126 129 132 135 139 142 ...
The answers are as follows :
- The name of the dataset is Women
- Dimensions tell us that there are 15 rows for the Weights and Heights of Women
- 2 variables (Weight and Height) comprise the dataset. Their datatypes are integers.
- The ages and nationality appears to not be determined using the Women dataset as it has only 2 variables, namely : Weight and Height.
The Summary of the Women Data is as follows:
- The lowest height and weight are 58 and 115 respectively.
- The heighest height and weight are 72 and 164 respectively.
- The average heights and weights are 65 and 136.7 respectively.
summary(women)
## height weight
## Min. :58.0 Min. :115.0
## 1st Qu.:61.5 1st Qu.:124.5
## Median :65.0 Median :135.0
## Mean :65.0 Mean :136.7
## 3rd Qu.:68.5 3rd Qu.:148.0
## Max. :72.0 Max. :164.0
The plot of the Women Datset is as follows
plot(women)

fit = lm( weight ~ height, data=women)#x ~ y, data = nameOfDataset )
summary(fit)
##
## Call:
## lm(formula = weight ~ height, data = women)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.7333 -1.1333 -0.3833 0.7417 3.1167
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -87.51667 5.93694 -14.74 1.71e-09 ***
## height 3.45000 0.09114 37.85 1.09e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.525 on 13 degrees of freedom
## Multiple R-squared: 0.991, Adjusted R-squared: 0.9903
## F-statistic: 1433 on 1 and 13 DF, p-value: 1.091e-14
par(mfrow = c(2, 2))
plot(fit)

Interpretation of Coefficient and analysis of quantity of fit Vs quality of fit:
- For every one unit increase in height, there is an expected increase of 3.45 pounds of weight.
- The model is statistically significant because the p value of F-statistic is less than 0.05
- The model is useful because the R^2 (The variance explained by the regression line) is 99.1.
- Checking for linear regression assumptions (Because a model can have a strong fit (good R^2) but the model is not appropriate which is assessed by checking assumptions)
- The response variable y is the linear function of the explanatory variable x, plus an error term
- The error term has same variance
- The error term has normal distribution
Answers based on above assumptions regarding quality of fit for the Women Dataset.
- The model does not fits the data well as the assumption y to be a linear function of x is violated based on residuals vs fitted values plot. Curve can be observed in that plot which points to the fact of existence of non-linearity in the relationship between weight and height.Plus the standard residual error is 1.53 pounds. It is the average error in predicting weight from height. The Q1 and Q3 residuals are also not close to each other magnitudes.
library(MASS)
#Auto_data=data("Auto")
#Auto_data
The first six rows of the Auto Dataset are as follows
head(Auto)
## mpg cylinders displacement horsepower weight acceleration year origin
## 1 18 8 307 130 3504 12.0 70 1
## 2 15 8 350 165 3693 11.5 70 1
## 3 18 8 318 150 3436 11.0 70 1
## 4 16 8 304 150 3433 12.0 70 1
## 5 17 8 302 140 3449 10.5 70 1
## 6 15 8 429 198 4341 10.0 70 1
## name
## 1 chevrolet chevelle malibu
## 2 buick skylark 320
## 3 plymouth satellite
## 4 amc rebel sst
## 5 ford torino
## 6 ford galaxie 500
The dimensions of the Auto Dataset are as follows
dim(Auto)
## [1] 392 9
The structure of the Auto Dataset is as follows
str(Auto)
## 'data.frame': 392 obs. of 9 variables:
## $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
## $ cylinders : num 8 8 8 8 8 8 8 8 8 8 ...
## $ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
## $ horsepower : num 130 165 150 150 140 198 220 215 225 190 ...
## $ weight : num 3504 3693 3436 3433 3449 ...
## $ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
## $ year : num 70 70 70 70 70 70 70 70 70 70 ...
## $ origin : num 1 1 1 1 1 1 1 1 1 1 ...
## $ name : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
The answers are as follows
- The name of the dataset is Auto dataset
- Dimensions tell us that there are 392 rows for the 9 variables/features of different cars.
- 9 variables comprise the dataset. 8 of them have their datatypes as integers and one has a string datatype as it refers to the name of the cars.
The Summary of the Auto Data is as follows:
summary(Auto)
## mpg cylinders displacement horsepower weight
## Min. : 9.00 Min. :3.000 Min. : 68.0 Min. : 46.0 Min. :1613
## 1st Qu.:17.00 1st Qu.:4.000 1st Qu.:105.0 1st Qu.: 75.0 1st Qu.:2225
## Median :22.75 Median :4.000 Median :151.0 Median : 93.5 Median :2804
## Mean :23.45 Mean :5.472 Mean :194.4 Mean :104.5 Mean :2978
## 3rd Qu.:29.00 3rd Qu.:8.000 3rd Qu.:275.8 3rd Qu.:126.0 3rd Qu.:3615
## Max. :46.60 Max. :8.000 Max. :455.0 Max. :230.0 Max. :5140
##
## acceleration year origin name
## Min. : 8.00 Min. :70.00 Min. :1.000 amc matador : 5
## 1st Qu.:13.78 1st Qu.:73.00 1st Qu.:1.000 ford pinto : 5
## Median :15.50 Median :76.00 Median :1.000 toyota corolla : 5
## Mean :15.54 Mean :75.98 Mean :1.577 amc gremlin : 4
## 3rd Qu.:17.02 3rd Qu.:79.00 3rd Qu.:2.000 amc hornet : 4
## Max. :24.80 Max. :82.00 Max. :3.000 chevrolet chevette: 4
## (Other) :365
#plot(Auto)
The plot is as follows
plot(Auto$horsepower, Auto$mpg,xlab="HorsePower", ylab="MPG", main="Plot of MPG vs HorsePower",col="blue")
fit_Auto = lm( mpg ~ horsepower, data=Auto)#x ~ y, data = nameOfDataset )
abline(fit_Auto,lwd =3 , col ="red")

The summary of the linear regression model is as follows
summary(fit_Auto)
##
## Call:
## lm(formula = mpg ~ horsepower, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.5710 -3.2592 -0.3435 2.7630 16.9240
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.935861 0.717499 55.66 <2e-16 ***
## horsepower -0.157845 0.006446 -24.49 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared: 0.6059, Adjusted R-squared: 0.6049
## F-statistic: 599.7 on 1 and 390 DF, p-value: < 2.2e-16
par(mfrow = c(2, 2))
plot(fit_Auto)

Interpretation of Coefficient and Analysis fo quantity of fit vs quality of fit:
- For every one unit increase in horsepower, there is an expected decrease of 0.15 miles per gallon of the car. Resulting in decrease in efficiency of the car
- The model is statistically significant because the p value of F-statistic is less than 0.05
- The model appears to be useful (good quantity of fit) because the R^2 (The variance explained by the regression line) 60.59. ### Checking for linear regression assumptions (Because a model can have a strong fit (good R^2) but the model is not appropriate which is assessed by checking assumptions)
- The response variable y is the linear function of the explanatory variable x, plus an error term
- The error term has same variance
- The error term has normal distribution
Answers based on above assumptions regarding quality of fit for the Auto Dataset.
- The model does not fits the data well as the assumption y to be a linear function of x is violated based on residuals vs fitted values plot. Curve can be observed in that plot which points to the fact of existence of non-linearity in the relationship between mpg and horsepower.Plus the standard residual error is 4.906 miles per gallon. It is the average error in predicting miles per gallon from horsepower. The Q1 and Q3 residuals are also not close to each other magnitudes.
The scatter plot matrix for the auto dataset is as follows:
pairs(Auto)

#library(MASS)
The plot for mpg and weight is as follows
fit_Auto_2 = lm(mpg ~ weight, data=Auto)#x ~ y, data = nameOfDataset )
plot(Auto$weight, Auto$mpg,xlab="weight", ylab="mpg", main="Plot of weight vs mpg",col="blue")
#plot(Auto$, Auto$mpg, xlab="Acceleration", ylab="MPG", main="Plot of MPG vs Acceleration", col="blue")
abline(fit_Auto_2, lwd=3, col="red")

summary(fit_Auto_2)
##
## Call:
## lm(formula = mpg ~ weight, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.9736 -2.7556 -0.3358 2.1379 16.5194
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.216524 0.798673 57.87 <2e-16 ***
## weight -0.007647 0.000258 -29.64 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.333 on 390 degrees of freedom
## Multiple R-squared: 0.6926, Adjusted R-squared: 0.6918
## F-statistic: 878.8 on 1 and 390 DF, p-value: < 2.2e-16
par(mfrow = c(2, 2))
plot(fit_Auto_2)

Interpretation of Coefficient and Analysis fo quantity of fit vs quality of fit:
- For every 10 pounds increase in weight of the car, there is an expected decrease of 0.07 miles per gallon of the car, which appears to be not that much when it comes to efficiency of the car.
- The model is statistically significant because the p value of F-statistic is less than 0.05
- The model appears to be useful (good quantity of fit) because the R^2 (The variance explained by the regression line) 69.26. #### Checking for linear regression assumptions (Because a model can have a strong fit (good R^2) but the model is not appropriate which is assessed by checking assumptions)
- The response variable y is the linear function of the explanatory variable x, plus an error term
- The error term has same variance
- The error term has normal distribution
Answers based on above assumptions regarding quality of fit for the Auto Dataset.
- The model does not fits the data well as the assumption y to be a linear function of x is violated based on residuals vs fitted values plot. A slight curve can be observed in that plot which points to the fact of existence of non-linearity in the relationship between mpg and weight.Plus the standard residual error is 4.33 miles per gallon. It is the average error in predicting miles per gallon from weight of the car. The Q1 and Q3 residuals, although, not far but not too close to each other magnitudes.