Chapter 2: #9 and #10
Check for missing values
sapply(Auto, function(x) sum(is.na(x)))
## mpg cylinders displacement horsepower weight acceleration
## 0 0 0 0 0 0
## year origin name
## 0 0 0
Which of the predictors are quantitative and which are qualitative?
Quantitative:
mpg
cylinders
displacement
horsepower
weight
acceleration
year
Qualitative:
origin
name
What is the range of each quantitative predictor? You can answer this using the range() function.
quant_preds <- Auto %>%
select(mpg, cylinders, displacement, horsepower, weight, acceleration, year)
sapply(quant_preds,range)
## mpg cylinders displacement horsepower weight acceleration year
## [1,] 9.0 3 68 46 1613 8.0 70
## [2,] 46.6 8 455 230 5140 24.8 82
What is the mean and standard deviation of each quantitative predictor?
sapply(quant_preds, function(x) round(mean(x),2))
## mpg cylinders displacement horsepower weight acceleration
## 23.45 5.47 194.41 104.47 2977.58 15.54
## year
## 75.98
sapply(quant_preds, function(x) round(sd(x),2))
## mpg cylinders displacement horsepower weight acceleration
## 7.81 1.71 104.64 38.49 849.40 2.76
## year
## 3.68
Now remove the 10th through 85th observations. What is the range, mean, and standard deviation of each predictor in the subset of the data that remains?
auto_subset <- Auto[-c(10:85),]
sub_quant_preds <- Auto[-c(10:85),] %>%
select(mpg, cylinders, displacement, horsepower, weight, acceleration, year)
sapply(quant_preds, function(x) round(range(x),2))
## mpg cylinders displacement horsepower weight acceleration year
## [1,] 9.0 3 68 46 1613 8.0 70
## [2,] 46.6 8 455 230 5140 24.8 82
sapply(quant_preds, function(x) round(mean(x),2))
## mpg cylinders displacement horsepower weight acceleration
## 23.45 5.47 194.41 104.47 2977.58 15.54
## year
## 75.98
sapply(quant_preds, function(x) round(sd(x),2))
## mpg cylinders displacement horsepower weight acceleration
## 7.81 1.71 104.64 38.49 849.40 2.76
## year
## 3.68
Using the full data set, investigate the predictors graphically,using scatter plots or other tools of your choice. Create some plots highlighting the relationships among the predictors. Comment on your findings.
auto_melt <- melt(quant_preds, "mpg")
ggplot(auto_melt, aes(x = value, mpg)) +
geom_point() +
facet_wrap(.~variable, scales = "free_x")
Here we can see all the quantitative predictors compared to mpg. From this we can see a lot. Weight, displacement, and horsepower all have very similar relationships to mpg, on varying scales. This relationship a looks curved similar to half of a parabola. In general as these values increase mpg looks to decrease. Because the relationship between these 3 variables and mpg all look so similar I would expect them to be correlated. The acceleration generally increases as mpg decreases, but this relationship is more all over the place and might not be as significant if tested. One that seems rather intuitive is the relationship between year and mpg. As time has gone on the mpg looks to have increased, which is expected as technology has also gotten better.
Suppose that we wish to predict gas mileage (mpg) on the basis of the other variables. Do your plots suggest that any of the other variables might be useful in predicting mpg? Justify your answer.
Yes. Because variables like displacement, horsepower, weight, and year seem to have very clear relationships with mpg this suggest that they would be useful for predicting mpg. Acceleration on the other hand may not be as useful because there is a lot of variation in mpg for certain values of acceleration.
To begin, load in the Boston data set. How many rows are in this data set? How many columns? What do the rows and columns represent?
?Boston
## starting httpd help server ... done
Rows:
506 total
each row represents data from 1 suburb of Boston
Columns:
13 total
each column represents one variable that was recorded for all suburbs
Make some pairwise scatter plots of the predictors (columns) in this data set. Describe your findings.
pairs(Boston[,1:4])
pairs(Boston[,5:8])
On the 1st slide in the top right corner we can see that the suburbs that bound the Charles River all have a crime rate of less than 10.
On the 2nd slide in the bottom left corner we see the relationship between nox (nitrogen oxides concentration) and dis (weighted mean of distances to five Boston employment centers). As the distance gets smaller the concentration of nitrogen oxide tends to increase. Similarly as the age (proportion of owner-occupied units built prior to 1940) increases the weighted mean distance also decreases. So this tells us that newer homes or housing units are being built farther away from these employment centers.
Are any of the predictors associated with per capita crime rate? If so, explain the relationship.
boston_melt <- melt(Boston, "crim")
ggplot(boston_melt, aes(x = value, crim)) +
geom_point() +
facet_wrap(.~variable, scales = "free_x")
Predictors associated with per capita crime rate:
chas - tracts bounding river has consistently lower crime rate
age - suburbs with higher proportion of units built before 1940 tend to have higher crime rate
dis - suburbs closer to the 5 Boston employment centers tend to have higher crime rate
lstat - as the percent of the population that falls into the “lower status” for a suburb increases the the crime rate tends to increase
medv - as the median value of owner-occupied homes in a suburb decreases the crime rate also decreases
Do any of the census tracts of Boston appear to have particularly high crime rates? Tax rates? Pupil-teacher ratios? Comment on the range of each predictor.
ggplot(Boston, aes(x = crim)) +
geom_boxplot() +
labs(x = "per capita crime rate by town",
title = "Crime Rate Boxplot")
summary(Boston$crim)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00632 0.08204 0.25651 3.61352 3.67708 88.97620
ggplot(Boston, aes(x = tax)) +
geom_boxplot() +
labs(x = "full-value property-tax rate per $10,000",
title = "Tax Rate Boxplot")
summary(Boston$tax)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 187.0 279.0 330.0 408.2 666.0 711.0
ggplot(Boston, aes(x = ptratio)) +
geom_boxplot() +
labs(x = "pupil-teacher ratio by town",
title = "Pupil-Teacher Ratio Boxplot")
summary(Boston$ptratio)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 12.60 17.40 19.05 18.46 20.20 22.00
Based on the box plots for each of the three mentioned predictors we can see that their are a lot of upper bound outliers for the per capita crime rate variable, no outliers for the property tax variable and, 2 lower bound outliers for the pupil-teacher ratio variable. The reason behind the number of outliers that we see with the crime rate variable is due to the skewness in the distribution of the values. The range is 0 to 88 and the median value is 0.25. This is very drastic in terms of skewness so it would be worth double checking to make sure the information is correct. The mean and median for the other two variables is much less skewed, if at all, as the values are more centered within the range for each variable.
How many of the census tracts in this data set bound the Charles river?
Boston %>%
filter(chas == 1) %>%
summarise(count_bounding = n())
## count_bounding
## 1 35
What is the median pupil-teacher ratio among the towns in this data set?
Boston %>%
summarise(median_ptratio = median(ptratio))
## median_ptratio
## 1 19.05
Which census tract of Boston has lowest median value of owner occupied homes? What are the values of the other predictors for that census tract, and how do those values compare to the overall ranges for those predictors? Comment on your findings.
Boston[Boston$medv == min(Boston$medv),]
## crim zn indus chas nox rm age dis rad tax ptratio lstat medv
## 399 38.3518 0 18.1 0 0.693 5.453 100 1.4896 24 666 20.2 30.59 5
## 406 67.9208 0 18.1 0 0.693 5.683 100 1.4254 24 666 20.2 22.98 5
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio lstat
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 1.73
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.: 6.95
## Median : 5.000 Median :330.0 Median :19.05 Median :11.36
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :12.65
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:16.95
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :37.97
## medv
## Min. : 5.00
## 1st Qu.:17.02
## Median :21.20
## Mean :22.53
## 3rd Qu.:25.00
## Max. :50.00
The other variables for these two observation fall in the following categories:
min: zn and chas
low (less than 1st quartile bound): dis
high (larger than 3rd quartile bound): crim, indus, nox, tax, ptratio, and lstat
max: age and rad
As you can see there are 6 variables that for these two observation that are at a minimum in the top 75% of all values for the dataset and 2 are the largest values. It is also interesting to see that there is not one variable in between these two observation that has a value that is “expected” or “normal”, ie close to the mean.
In this data set, how many of the census tracts average more than seven rooms per dwelling? More than eight rooms per dwelling? Comment on the census tracts that average more than eight rooms per dwelling.
Boston %>%
summarise(count_above_7 = sum(Boston$rm >7),
count_above_8 = sum(Boston$rm >8))
## count_above_7 count_above_8
## 1 64 13
As you can see here there are 64 tracts that average more than 7 rooms per dwelling and 13 that average more than 8.
eight_plus <- Boston %>%
filter(rm >8)
eight_plus_melt <- melt(eight_plus, "rm")
ggplot(eight_plus_melt, aes(x = value)) +
geom_boxplot() +
facet_wrap(~variable, scales = "free_x")
As shown in the above box plots most of the tracts that average more than 8 rooms per dwelling are quite similar. The generally have a low crime rate, have a majority of homes that are older than 1940, are fairly close the to Boston employment centers, have a low lstat when compared to the full dataset, and have a high median value of owner-occupied homes. This last one is expected as we saw earlier that the value of homes tends to increase and the number of rooms increases.