Subsetting the Automobiles Data Set

Loading the Data Set

Let’s start by loading the data set. This set comes from UCI’s archive labled (simply enough) ‘Automobile Data Set’. I started by uploading the set to github as a CSV file and grabbing the raw link, using R to create a data frame. Since the data didn’t have a header (column titles), they had to be ‘manually’ entered in seperately. Finally, I used the head function to see if it all worked as expected.

library(RCurl)

## Loading required package: bitops

x <- 'https://raw.githubusercontent.com/chrisgmartin/DATA607/master/imports-85.csv'
y <- read.csv(url(x), header = FALSE)
colnames(y) <- c("symboling","normalized-losses","make","fuel-type","aspiration","num-of-doors","body-style","drive-wheels","engine-location","wheel-base","length","width","height","curb-weight","engine-type","num-of-cylinders","engine-size","fuel-system","bore","stroke","compression-ratio","horsepower","peak-rpm","city-mpg","highway-mpg","price")
head(y)

##   symboling normalized-losses        make fuel-type aspiration
## 1         3                 ? alfa-romero       gas        std
## 2         3                 ? alfa-romero       gas        std
## 3         1                 ? alfa-romero       gas        std
## 4         2               164        audi       gas        std
## 5         2               164        audi       gas        std
## 6         2                 ?        audi       gas        std
##   num-of-doors  body-style drive-wheels engine-location wheel-base length
## 1          two convertible          rwd           front       88.6  168.8
## 2          two convertible          rwd           front       88.6  168.8
## 3          two   hatchback          rwd           front       94.5  171.2
## 4         four       sedan          fwd           front       99.8  176.6
## 5         four       sedan          4wd           front       99.4  176.6
## 6          two       sedan          fwd           front       99.8  177.3
##   width height curb-weight engine-type num-of-cylinders engine-size
## 1  64.1   48.8        2548        dohc             four         130
## 2  64.1   48.8        2548        dohc             four         130
## 3  65.5   52.4        2823        ohcv              six         152
## 4  66.2   54.3        2337         ohc             four         109
## 5  66.4   54.3        2824         ohc             five         136
## 6  66.3   53.1        2507         ohc             five         136
##   fuel-system bore stroke compression-ratio horsepower peak-rpm city-mpg
## 1        mpfi 3.47   2.68               9.0        111     5000       21
## 2        mpfi 3.47   2.68               9.0        111     5000       21
## 3        mpfi 2.68   3.47               9.0        154     5000       19
## 4        mpfi 3.19   3.40              10.0        102     5500       24
## 5        mpfi 3.19   3.40               8.0        115     5500       18
## 6        mpfi 3.19   3.40               8.5        110     5500       19
##   highway-mpg price
## 1          27 13495
## 2          27 16500
## 3          26 16500
## 4          30 13950
## 5          22 17450
## 6          25 15250

Subsetting the Data Set

Next we’ll pickout the columns that we’ll want to analyse (aka randomly choose some columns) and create a subset. The columns that pertain to my analysis were make (because that’s a primary detail of the cars), fuel-type (because cars needed fuel back then), num-of-doors (this is likely a simple items we can use to subset the data by), length (totally random), width (because we already selected length), num-of-cylinders (similar to the doors), horsepower (because horses are cool), and price (same reason as make).

y2 <- y[, c("make","fuel-type","num-of-doors","length","width","num-of-cylinders","horsepower","price")]
head(y2)

##          make fuel-type num-of-doors length width num-of-cylinders
## 1 alfa-romero       gas          two  168.8  64.1             four
## 2 alfa-romero       gas          two  168.8  64.1             four
## 3 alfa-romero       gas          two  171.2  65.5              six
## 4        audi       gas         four  176.6  66.2             four
## 5        audi       gas         four  176.6  66.4             five
## 6        audi       gas          two  177.3  66.3             five
##   horsepower price
## 1        111 13495
## 2        111 16500
## 3        154 16500
## 4        102 13950
## 5        115 17450
## 6        110 15250

Now that we have the new data frame y2 set-up, we can subset. I want to see the average length and width of two door, diesel engine cars.

#averages before subset
mean(y2$length)

## [1] 174.0493

mean(y2$width)

## [1] 65.9078

#subset type 1
y3 <- y2[y2$'num-of-doors' == 'two' & y2$'fuel-type' == 'diesel',]

#subset type 2, not saving as a new data frame to show results of the subset
subset(y2, `num-of-doors` == 'two' & `fuel-type` == 'diesel')

##              make fuel-type num-of-doors length width num-of-cylinders
## 70  mercedes-benz    diesel          two  187.5  70.3             five
## 91         nissan    diesel          two  165.3  63.8             four
## 183    volkswagen    diesel          two  171.7  65.5             four
##     horsepower price
## 70         123 28176
## 91          55  7099
## 183         52  7775

#averages after subset
mean(y3$length)

## [1] 174.8333

mean(y3$width)

## [1] 66.53333

As you can see, the average length and width of the cars with diesel fuel and two doors is eerily similar to the original data set. Could be that all cars fall in this range regardless of door number or fuel type. There’s one way to find out:

boxplot(y2$length)

boxplot(y2$width)

Subsetting the Automobiles Data Set

Loading the Data Set

Subsetting the Data Set

Answer: More or less