Problems

Chapter 2: #9 and #10

Problem #9

Check for missing values

sapply(Auto, function(x) sum(is.na(x)))
##          mpg    cylinders displacement   horsepower       weight acceleration 
##            0            0            0            0            0            0 
##         year       origin         name 
##            0            0            0

a)

Which of the predictors are quantitative and which are qualitative?

Quantitative:

  • mpg

  • cylinders

  • displacement

  • horsepower

  • weight

  • acceleration

  • year

Qualitative:

  • origin

  • name

b)

What is the range of each quantitative predictor? You can answer this using the range() function.

quant_preds <- Auto %>%
  select(mpg, cylinders, displacement, horsepower, weight, acceleration, year)
sapply(quant_preds,range)
##       mpg cylinders displacement horsepower weight acceleration year
## [1,]  9.0         3           68         46   1613          8.0   70
## [2,] 46.6         8          455        230   5140         24.8   82

c)

What is the mean and standard deviation of each quantitative predictor?

sapply(quant_preds, function(x) round(mean(x),2))
##          mpg    cylinders displacement   horsepower       weight acceleration 
##        23.45         5.47       194.41       104.47      2977.58        15.54 
##         year 
##        75.98
sapply(quant_preds, function(x) round(sd(x),2))
##          mpg    cylinders displacement   horsepower       weight acceleration 
##         7.81         1.71       104.64        38.49       849.40         2.76 
##         year 
##         3.68

d)

Now remove the 10th through 85th observations. What is the range, mean, and standard deviation of each predictor in the subset of the data that remains?

auto_subset <- Auto[-c(10:85),]
sub_quant_preds <- Auto[-c(10:85),] %>%
  select(mpg, cylinders, displacement, horsepower, weight, acceleration, year)

sapply(quant_preds, function(x) round(range(x),2))
##       mpg cylinders displacement horsepower weight acceleration year
## [1,]  9.0         3           68         46   1613          8.0   70
## [2,] 46.6         8          455        230   5140         24.8   82
sapply(quant_preds, function(x) round(mean(x),2))
##          mpg    cylinders displacement   horsepower       weight acceleration 
##        23.45         5.47       194.41       104.47      2977.58        15.54 
##         year 
##        75.98
sapply(quant_preds, function(x) round(sd(x),2))
##          mpg    cylinders displacement   horsepower       weight acceleration 
##         7.81         1.71       104.64        38.49       849.40         2.76 
##         year 
##         3.68

e)

Using the full data set, investigate the predictors graphically,using scatter plots or other tools of your choice. Create some plots highlighting the relationships among the predictors. Comment on your findings.

auto_melt <- melt(quant_preds, "mpg")
ggplot(auto_melt, aes(x = value, mpg)) +
  geom_point() +
  facet_wrap(.~variable, scales = "free_x")

Here we can see all the quantitative predictors compared to mpg. From this we can see a lot. Weight, displacement, and horsepower all have very similar relationships to mpg, on varying scales. This relationship a looks curved similar to half of a parabola. In general as these values increase mpg looks to decrease. Because the relationship between these 3 variables and mpg all look so similar I would expect them to be correlated. The acceleration generally increases as mpg decreases, but this relationship is more all over the place and might not be as significant if tested. One that seems rather intuitive is the relationship between year and mpg. As time has gone on the mpg looks to have increased, which is expected as technology has also gotten better.

f)

Suppose that we wish to predict gas mileage (mpg) on the basis of the other variables. Do your plots suggest that any of the other variables might be useful in predicting mpg? Justify your answer.

Yes. Because variables like displacement, horsepower, weight, and year seem to have very clear relationships with mpg this suggest that they would be useful for predicting mpg. Acceleration on the other hand may not be as useful because there is a lot of variation in mpg for certain values of acceleration.

Problem #10

a)

To begin, load in the Boston data set. How many rows are in this data set? How many columns? What do the rows and columns represent?

?Boston
## starting httpd help server ... done

Rows:

  • 506 total

  • each row represents data from 1 suburb of Boston

Columns:

  • 13 total

  • each column represents one variable that was recorded for all suburbs

b)

Make some pairwise scatter plots of the predictors (columns) in this data set. Describe your findings.

pairs(Boston[,1:4])

pairs(Boston[,5:8])

On the 1st slide in the top right corner we can see that the suburbs that bound the Charles River all have a crime rate of less than 10.

On the 2nd slide in the bottom left corner we see the relationship between nox (nitrogen oxides concentration) and dis (weighted mean of distances to five Boston employment centers). As the distance gets smaller the concentration of nitrogen oxide tends to increase. Similarly as the age (proportion of owner-occupied units built prior to 1940) increases the weighted mean distance also decreases. So this tells us that newer homes or housing units are being built farther away from these employment centers.

c)

Are any of the predictors associated with per capita crime rate? If so, explain the relationship.

boston_melt <- melt(Boston, "crim")
ggplot(boston_melt, aes(x = value, crim)) +
  geom_point() +
  facet_wrap(.~variable, scales = "free_x")

Predictors associated with per capita crime rate:

  • chas - tracts bounding river has consistently lower crime rate

  • age - suburbs with higher proportion of units built before 1940 tend to have higher crime rate

  • dis - suburbs closer to the 5 Boston employment centers tend to have higher crime rate

  • lstat - as the percent of the population that falls into the “lower status” for a suburb increases the the crime rate tends to increase

  • medv - as the median value of owner-occupied homes in a suburb decreases the crime rate also decreases

d)

Do any of the census tracts of Boston appear to have particularly high crime rates? Tax rates? Pupil-teacher ratios? Comment on the range of each predictor.

ggplot(Boston, aes(x = crim)) +
  geom_boxplot() +
  labs(x = "per capita crime rate by town",
       title = "Crime Rate Boxplot")

summary(Boston$crim)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00632  0.08204  0.25651  3.61352  3.67708 88.97620
ggplot(Boston, aes(x = tax)) +
  geom_boxplot() +
  labs(x = "full-value property-tax rate per $10,000",
       title = "Tax Rate Boxplot")

summary(Boston$tax)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   187.0   279.0   330.0   408.2   666.0   711.0
ggplot(Boston, aes(x = ptratio)) +
  geom_boxplot() +
  labs(x = "pupil-teacher ratio by town",
       title = "Pupil-Teacher Ratio Boxplot")

summary(Boston$ptratio)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12.60   17.40   19.05   18.46   20.20   22.00

Based on the box plots for each of the three mentioned predictors we can see that their are a lot of upper bound outliers for the per capita crime rate variable, no outliers for the property tax variable and, 2 lower bound outliers for the pupil-teacher ratio variable. The reason behind the number of outliers that we see with the crime rate variable is due to the skewness in the distribution of the values. The range is 0 to 88 and the median value is 0.25. This is very drastic in terms of skewness so it would be worth double checking to make sure the information is correct. The mean and median for the other two variables is much less skewed, if at all, as the values are more centered within the range for each variable.

e)

How many of the census tracts in this data set bound the Charles river?

Boston %>%
  filter(chas == 1) %>%
  summarise(count_bounding = n())
##   count_bounding
## 1             35

f)

What is the median pupil-teacher ratio among the towns in this data set?

Boston %>%
  summarise(median_ptratio = median(ptratio))
##   median_ptratio
## 1          19.05

g)

Which census tract of Boston has lowest median value of owner occupied homes? What are the values of the other predictors for that census tract, and how do those values compare to the overall ranges for those predictors? Comment on your findings.

Boston[Boston$medv == min(Boston$medv),]
##        crim zn indus chas   nox    rm age    dis rad tax ptratio lstat medv
## 399 38.3518  0  18.1    0 0.693 5.453 100 1.4896  24 666    20.2 30.59    5
## 406 67.9208  0  18.1    0 0.693 5.683 100 1.4254  24 666    20.2 22.98    5
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          lstat      
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   : 1.73  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.: 6.95  
##  Median : 5.000   Median :330.0   Median :19.05   Median :11.36  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :12.65  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:16.95  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :37.97  
##       medv      
##  Min.   : 5.00  
##  1st Qu.:17.02  
##  Median :21.20  
##  Mean   :22.53  
##  3rd Qu.:25.00  
##  Max.   :50.00

The other variables for these two observation fall in the following categories:

min: zn and chas

low (less than 1st quartile bound): dis

high (larger than 3rd quartile bound): crim, indus, nox, tax, ptratio, and lstat

max: age and rad

As you can see there are 6 variables that for these two observation that are at a minimum in the top 75% of all values for the dataset and 2 are the largest values. It is also interesting to see that there is not one variable in between these two observation that has a value that is “expected” or “normal”, ie close to the mean.

h)

In this data set, how many of the census tracts average more than seven rooms per dwelling? More than eight rooms per dwelling? Comment on the census tracts that average more than eight rooms per dwelling.

Boston %>%
  summarise(count_above_7 = sum(Boston$rm >7),
            count_above_8 = sum(Boston$rm >8))
##   count_above_7 count_above_8
## 1            64            13

As you can see here there are 64 tracts that average more than 7 rooms per dwelling and 13 that average more than 8.

eight_plus <- Boston %>%
  filter(rm >8) 
eight_plus_melt <- melt(eight_plus, "rm")
ggplot(eight_plus_melt, aes(x = value)) +
  geom_boxplot() +
  facet_wrap(~variable, scales = "free_x")

As shown in the above box plots most of the tracts that average more than 8 rooms per dwelling are quite similar. The generally have a low crime rate, have a majority of homes that are older than 1940, are fairly close the to Boston employment centers, have a low lstat when compared to the full dataset, and have a high median value of owner-occupied homes. This last one is expected as we saw earlier that the value of homes tends to increase and the number of rooms increases.