Let’s start by loading the data set. This set comes from UCI’s archive labled (simply enough) ‘Automobile Data Set’. I started by uploading the set to github as a CSV file and grabbing the raw link, using R to create a data frame. Since the data didn’t have a header (column titles), they had to be ‘manually’ entered in seperately. Finally, I used the head function to see if it all worked as expected.
library(RCurl)
## Loading required package: bitops
x <- 'https://raw.githubusercontent.com/chrisgmartin/DATA607/master/imports-85.csv'
y <- read.csv(url(x), header = FALSE)
colnames(y) <- c("symboling","normalized-losses","make","fuel-type","aspiration","num-of-doors","body-style","drive-wheels","engine-location","wheel-base","length","width","height","curb-weight","engine-type","num-of-cylinders","engine-size","fuel-system","bore","stroke","compression-ratio","horsepower","peak-rpm","city-mpg","highway-mpg","price")
head(y)
## symboling normalized-losses make fuel-type aspiration
## 1 3 ? alfa-romero gas std
## 2 3 ? alfa-romero gas std
## 3 1 ? alfa-romero gas std
## 4 2 164 audi gas std
## 5 2 164 audi gas std
## 6 2 ? audi gas std
## num-of-doors body-style drive-wheels engine-location wheel-base length
## 1 two convertible rwd front 88.6 168.8
## 2 two convertible rwd front 88.6 168.8
## 3 two hatchback rwd front 94.5 171.2
## 4 four sedan fwd front 99.8 176.6
## 5 four sedan 4wd front 99.4 176.6
## 6 two sedan fwd front 99.8 177.3
## width height curb-weight engine-type num-of-cylinders engine-size
## 1 64.1 48.8 2548 dohc four 130
## 2 64.1 48.8 2548 dohc four 130
## 3 65.5 52.4 2823 ohcv six 152
## 4 66.2 54.3 2337 ohc four 109
## 5 66.4 54.3 2824 ohc five 136
## 6 66.3 53.1 2507 ohc five 136
## fuel-system bore stroke compression-ratio horsepower peak-rpm city-mpg
## 1 mpfi 3.47 2.68 9.0 111 5000 21
## 2 mpfi 3.47 2.68 9.0 111 5000 21
## 3 mpfi 2.68 3.47 9.0 154 5000 19
## 4 mpfi 3.19 3.40 10.0 102 5500 24
## 5 mpfi 3.19 3.40 8.0 115 5500 18
## 6 mpfi 3.19 3.40 8.5 110 5500 19
## highway-mpg price
## 1 27 13495
## 2 27 16500
## 3 26 16500
## 4 30 13950
## 5 22 17450
## 6 25 15250
Next we’ll pickout the columns that we’ll want to analyse (aka randomly choose some columns) and create a subset. The columns that pertain to my analysis were make (because that’s a primary detail of the cars), fuel-type (because cars needed fuel back then), num-of-doors (this is likely a simple items we can use to subset the data by), length (totally random), width (because we already selected length), num-of-cylinders (similar to the doors), horsepower (because horses are cool), and price (same reason as make).
y2 <- y[, c("make","fuel-type","num-of-doors","length","width","num-of-cylinders","horsepower","price")]
head(y2)
## make fuel-type num-of-doors length width num-of-cylinders
## 1 alfa-romero gas two 168.8 64.1 four
## 2 alfa-romero gas two 168.8 64.1 four
## 3 alfa-romero gas two 171.2 65.5 six
## 4 audi gas four 176.6 66.2 four
## 5 audi gas four 176.6 66.4 five
## 6 audi gas two 177.3 66.3 five
## horsepower price
## 1 111 13495
## 2 111 16500
## 3 154 16500
## 4 102 13950
## 5 115 17450
## 6 110 15250
Now that we have the new data frame y2 set-up, we can subset. I want to see the average length and width of two door, diesel engine cars.
#averages before subset
mean(y2$length)
## [1] 174.0493
mean(y2$width)
## [1] 65.9078
#subset type 1
y3 <- y2[y2$'num-of-doors' == 'two' & y2$'fuel-type' == 'diesel',]
#subset type 2, not saving as a new data frame to show results of the subset
subset(y2, `num-of-doors` == 'two' & `fuel-type` == 'diesel')
## make fuel-type num-of-doors length width num-of-cylinders
## 70 mercedes-benz diesel two 187.5 70.3 five
## 91 nissan diesel two 165.3 63.8 four
## 183 volkswagen diesel two 171.7 65.5 four
## horsepower price
## 70 123 28176
## 91 55 7099
## 183 52 7775
#averages after subset
mean(y3$length)
## [1] 174.8333
mean(y3$width)
## [1] 66.53333
As you can see, the average length and width of the cars with diesel fuel and two doors is eerily similar to the original data set. Could be that all cars fall in this range regardless of door number or fuel type. There’s one way to find out:
boxplot(y2$length)
boxplot(y2$width)