Let’s start by exploring the “Women” data Set which shows the average heights and weights for American Women. Here are the first 6 lines of the dataset.
head(women,6)
## height weight
## 1 58 115
## 2 59 117
## 3 60 120
## 4 61 123
## 5 62 126
## 6 63 129
Let’s view the dimensions of the “women” data set.
dim(women)
## [1] 15 2
Let’s view the structure of the data set using the “str()” function.
str(women)
## 'data.frame': 15 obs. of 2 variables:
## $ height: num 58 59 60 61 62 63 64 65 66 67 ...
## $ weight: num 115 117 120 123 126 129 132 135 139 142 ...
Let us quickly review what we have seen in the “women” data set.
The name of the data set is ‘women’.
The dimensions tell us that there are only 30 values of data in the set. Divided into 2 sets of 15.
There are 2 variables - height and weight.
The data set does not reveal this information. But when searched in [HELP] the data is comprised of American women aged 30-39. ***
Let us review what we see from running the “summary()” function on the women data set.
summary(women)
## height weight
## Min. :58.0 Min. :115.0
## 1st Qu.:61.5 1st Qu.:124.5
## Median :65.0 Median :135.0
## Mean :65.0 Mean :136.7
## 3rd Qu.:68.5 3rd Qu.:148.0
## Max. :72.0 Max. :164.0
The Min values reveal that the lowest height is 58.0 in. and the lowest weight is 115.0 lbs.
The max values reveal that the highest height is 72.0 in. and the heightest weight is 164.0 lbs.
From this data we will use the mean to assess the average since there are no severe outliars in the data. The average height is 35.0 in. and the average weight is 136.7 lbs.
Lets create a model in which weight is determined by height. I will be using the “lm()” function to do this. And in order to view it, I will use the “summary()” function.
model = lm(weight ~ height, data = women)
summary(model)
##
## Call:
## lm(formula = weight ~ height, data = women)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.7333 -1.1333 -0.3833 0.7417 3.1167
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -87.51667 5.93694 -14.74 1.71e-09 ***
## height 3.45000 0.09114 37.85 1.09e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.525 on 13 degrees of freedom
## Multiple R-squared: 0.991, Adjusted R-squared: 0.9903
## F-statistic: 1433 on 1 and 13 DF, p-value: 1.091e-14
What should we pay attention to in the output?
Estimate based on height is 3.45. This means that there is an expected increase of 3.45 pounds of weight for every one increase in height.
Remember that “increase” refers to subjects of different heights (cross sectional); it does not predict weight as the subject grows.
How old are the women in the study? Would you expect them to grow?
Is the model statistically significant? Check F-statistic for the p-value.
Is the model useful? Find the R-sqaured value which indicates the amount of variance due to height.
Does the model fit the data well?
Is a linear model reasonable for this data? Check Q1 and Q3 for residuals.
Q1 and Q3 are about the same in size of magnitude. So the linear model is reasonable for this data.
Let’s start by exploring the “Auto” data Set which shows Gas mileage, horsepower, and other information for 392 vehicles. Here are the first 6 lines of the data set.
library(ISLR)
library(MASS)
data("Auto")
head(Auto, 6)
## mpg cylinders displacement horsepower weight acceleration year origin
## 1 18 8 307 130 3504 12.0 70 1
## 2 15 8 350 165 3693 11.5 70 1
## 3 18 8 318 150 3436 11.0 70 1
## 4 16 8 304 150 3433 12.0 70 1
## 5 17 8 302 140 3449 10.5 70 1
## 6 15 8 429 198 4341 10.0 70 1
## name
## 1 chevrolet chevelle malibu
## 2 buick skylark 320
## 3 plymouth satellite
## 4 amc rebel sst
## 5 ford torino
## 6 ford galaxie 500
Let’s view the dimensions of the “Auto” data set.
dim(Auto)
## [1] 392 9
Let’s view the structure of the data set using the “str()” function.
str(Auto)
## 'data.frame': 392 obs. of 9 variables:
## $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
## $ cylinders : num 8 8 8 8 8 8 8 8 8 8 ...
## $ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
## $ horsepower : num 130 165 150 150 140 198 220 215 225 190 ...
## $ weight : num 3504 3693 3436 3433 3449 ...
## $ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
## $ year : num 70 70 70 70 70 70 70 70 70 70 ...
## $ origin : num 1 1 1 1 1 1 1 1 1 1 ...
## $ name : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
Let us quickly review what we have seen in the “Auto” data set.
The name of the data set is ‘Auto’.
The dimensions tell us that there are 3,528 values(392 * 9) of data in the set. Divided into 9 sets of 392.
There are 9 variables - mpg, cylinders, displacement, horsepower, weight, acceleration, year, and origin.
Let us review what we see from running the “summary()” function on the “Auto” data set.
summary(Auto)
## mpg cylinders displacement horsepower weight
## Min. : 9.00 Min. :3.000 Min. : 68.0 Min. : 46.0 Min. :1613
## 1st Qu.:17.00 1st Qu.:4.000 1st Qu.:105.0 1st Qu.: 75.0 1st Qu.:2225
## Median :22.75 Median :4.000 Median :151.0 Median : 93.5 Median :2804
## Mean :23.45 Mean :5.472 Mean :194.4 Mean :104.5 Mean :2978
## 3rd Qu.:29.00 3rd Qu.:8.000 3rd Qu.:275.8 3rd Qu.:126.0 3rd Qu.:3615
## Max. :46.60 Max. :8.000 Max. :455.0 Max. :230.0 Max. :5140
##
## acceleration year origin name
## Min. : 8.00 Min. :70.00 Min. :1.000 amc matador : 5
## 1st Qu.:13.78 1st Qu.:73.00 1st Qu.:1.000 ford pinto : 5
## Median :15.50 Median :76.00 Median :1.000 toyota corolla : 5
## Mean :15.54 Mean :75.98 Mean :1.577 amc gremlin : 4
## 3rd Qu.:17.02 3rd Qu.:79.00 3rd Qu.:2.000 amc hornet : 4
## Max. :24.80 Max. :82.00 Max. :3.000 chevrolet chevette: 4
## (Other) :365
What are the lowest mpg and horsepower?
The Min values reveal that the lowest mpg is 9.0 mpg and the lowest horsepower is 46.0 engine horsepower.
What are the highest mpg and horsepower?
The max values reveal that the highest mpg is 46.60 mpg and the highest horsepower is 230.0 engine horsepower.
What are the average mpg and horsepower?
The average mpg is 23.45 mpg and the average horsepower is 104.5.
Let us review what we see from running the “summary()” function on a model in which the mpg is determined by horsepower.
model = lm(mpg ~ horsepower, data = Auto)
summary(model)
##
## Call:
## lm(formula = mpg ~ horsepower, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.5710 -3.2592 -0.3435 2.7630 16.9240
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.935861 0.717499 55.66 <2e-16 ***
## horsepower -0.157845 0.006446 -24.49 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared: 0.6059, Adjusted R-squared: 0.6049
## F-statistic: 599.7 on 1 and 390 DF, p-value: < 2.2e-16
What should we pay attention to in the output?
Is the model statistically significant? Check F-statistic for the p-value.
Is the model useful? Find the R-sqaured value which indicates the amount of variance due to horsepower.
Does the model fit the data well?
Is a linear model reasonable for this data? Check Q1 and Q3 for residuals.
Q1 and Q3 are not the same in size of magnitude. So the linear model is not reasonable for this data.
First I will produce a scatterplot matrix that compares all the pairs of variables of the “Auto” data set.
pairs(Auto)
I Will work with a model that has horsepower depend on weight.
Let us review what we see from running the “summary()” function on the “Auto” data set. We will assess the values for horsepower and weight this time
summary(Auto)
## mpg cylinders displacement horsepower weight
## Min. : 9.00 Min. :3.000 Min. : 68.0 Min. : 46.0 Min. :1613
## 1st Qu.:17.00 1st Qu.:4.000 1st Qu.:105.0 1st Qu.: 75.0 1st Qu.:2225
## Median :22.75 Median :4.000 Median :151.0 Median : 93.5 Median :2804
## Mean :23.45 Mean :5.472 Mean :194.4 Mean :104.5 Mean :2978
## 3rd Qu.:29.00 3rd Qu.:8.000 3rd Qu.:275.8 3rd Qu.:126.0 3rd Qu.:3615
## Max. :46.60 Max. :8.000 Max. :455.0 Max. :230.0 Max. :5140
##
## acceleration year origin name
## Min. : 8.00 Min. :70.00 Min. :1.000 amc matador : 5
## 1st Qu.:13.78 1st Qu.:73.00 1st Qu.:1.000 ford pinto : 5
## Median :15.50 Median :76.00 Median :1.000 toyota corolla : 5
## Mean :15.54 Mean :75.98 Mean :1.577 amc gremlin : 4
## 3rd Qu.:17.02 3rd Qu.:79.00 3rd Qu.:2.000 amc hornet : 4
## Max. :24.80 Max. :82.00 Max. :3.000 chevrolet chevette: 4
## (Other) :365
What are the lowest weight and horsepower?
The Min values reveal that the lowest weight is 1613 lbs and the lowest horsepower is 46.0 engine horsepower.
What are the highest weight and horsepower?
The max values reveal that the highest weight is 5140 lbs and the highest horsepower is 230.0 engine horsepower.
What are the average weight and horsepower?
The average weight is 2978 lbs and the average horsepower is 104.5.
Let us review what we see from running the “summary()” function on a model in which the horsepower is determined by weight.
model = lm(horsepower ~ weight, data = Auto)
summary(model)
##
## Call:
## lm(formula = horsepower ~ weight, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -50.272 -12.285 -0.557 9.063 116.283
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -12.183485 3.570431 -3.412 0.000712 ***
## weight 0.039177 0.001153 33.972 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.37 on 390 degrees of freedom
## Multiple R-squared: 0.7474, Adjusted R-squared: 0.7468
## F-statistic: 1154 on 1 and 390 DF, p-value: < 2.2e-16
What should we pay attention to in the output?
Is the model statistically significant? Check F-statistic for the p-value.
Is the model useful? Find the R-sqaured value which indicates the amount of variance due to weight.
Does the model fit the data well?
Is a linear model reasonable for this data? Check Q1 and Q3 for residuals.
Q1 and Q3 are not the same in size of magnitude. So the linear model is not reasonable for this data.