Task: to choose one dataset, then study the data and its associated description of the data (i.e. “data dictionary”). You should take the data, and create an R data frame with a subset of the columns (and if you like rows) in the dataset. Your deliverable is the R code to perform these transformation tasks.
From https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data: Dataset Wine recognition data Updated Sept 21, 1998 by C.Blake : Added attribute information Sources: Forina, M. et al, PARVUS - An Extendible Package for Data Exploration, Classification and Correlation. Institute of Pharmaceutical and Food Analysis and Technologies, Via Brigata Salerno, 16147 Genoa, Italy.
dataUrl <- "http://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data"
wine <- read.csv(dataUrl, strip.white = TRUE, header = FALSE)
We can have a look at the first few entries (rows) of our data with the command
head(wine)
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14
## 1 1 14.23 1.71 2.43 15.6 127 2.80 3.06 0.28 2.29 5.64 1.04 3.92 1065
## 2 1 13.20 1.78 2.14 11.2 100 2.65 2.76 0.26 1.28 4.38 1.05 3.40 1050
## 3 1 13.16 2.36 2.67 18.6 101 2.80 3.24 0.30 2.81 5.68 1.03 3.17 1185
## 4 1 14.37 1.95 2.50 16.8 113 3.85 3.49 0.24 2.18 7.80 0.86 3.45 1480
## 5 1 13.24 2.59 2.87 21.0 118 2.80 2.69 0.39 1.82 4.32 1.04 2.93 735
## 6 1 14.20 1.76 2.45 15.2 112 3.27 3.39 0.34 1.97 6.75 1.05 2.85 1450
summary(wine)
## V1 V2 V3 V4
## Min. :1.000 Min. :11.03 Min. :0.740 Min. :1.360
## 1st Qu.:1.000 1st Qu.:12.36 1st Qu.:1.603 1st Qu.:2.210
## Median :2.000 Median :13.05 Median :1.865 Median :2.360
## Mean :1.938 Mean :13.00 Mean :2.336 Mean :2.367
## 3rd Qu.:3.000 3rd Qu.:13.68 3rd Qu.:3.083 3rd Qu.:2.558
## Max. :3.000 Max. :14.83 Max. :5.800 Max. :3.230
## V5 V6 V7 V8
## Min. :10.60 Min. : 70.00 Min. :0.980 Min. :0.340
## 1st Qu.:17.20 1st Qu.: 88.00 1st Qu.:1.742 1st Qu.:1.205
## Median :19.50 Median : 98.00 Median :2.355 Median :2.135
## Mean :19.49 Mean : 99.74 Mean :2.295 Mean :2.029
## 3rd Qu.:21.50 3rd Qu.:107.00 3rd Qu.:2.800 3rd Qu.:2.875
## Max. :30.00 Max. :162.00 Max. :3.880 Max. :5.080
## V9 V10 V11 V12
## Min. :0.1300 Min. :0.410 Min. : 1.280 Min. :0.4800
## 1st Qu.:0.2700 1st Qu.:1.250 1st Qu.: 3.220 1st Qu.:0.7825
## Median :0.3400 Median :1.555 Median : 4.690 Median :0.9650
## Mean :0.3619 Mean :1.591 Mean : 5.058 Mean :0.9574
## 3rd Qu.:0.4375 3rd Qu.:1.950 3rd Qu.: 6.200 3rd Qu.:1.1200
## Max. :0.6600 Max. :3.580 Max. :13.000 Max. :1.7100
## V13 V14
## Min. :1.270 Min. : 278.0
## 1st Qu.:1.938 1st Qu.: 500.5
## Median :2.780 Median : 673.5
## Mean :2.612 Mean : 746.9
## 3rd Qu.:3.170 3rd Qu.: 985.0
## Max. :4.000 Max. :1680.0
We can use can also embed plots, for example:
plot(wine)
dfnames=c("Alcohol", "Malic acid", "Ash","Alcalinity of ash","Magnesium", "Total phenols", "Flavanoids", "Nonflavanoid phenols", "Proanthocyanins", "Color intensit", "Hue", "OD280/OD315 of diluted wines", "Proline")
names(wine)<-dfnames #Assign names to the data frame columns.
head(wine)
## Alcohol Malic acid Ash Alcalinity of ash Magnesium Total phenols
## 1 1 14.23 1.71 2.43 15.6 127
## 2 1 13.20 1.78 2.14 11.2 100
## 3 1 13.16 2.36 2.67 18.6 101
## 4 1 14.37 1.95 2.50 16.8 113
## 5 1 13.24 2.59 2.87 21.0 118
## 6 1 14.20 1.76 2.45 15.2 112
## Flavanoids Nonflavanoid phenols Proanthocyanins Color intensit Hue
## 1 2.80 3.06 0.28 2.29 5.64
## 2 2.65 2.76 0.26 1.28 4.38
## 3 2.80 3.24 0.30 2.81 5.68
## 4 3.85 3.49 0.24 2.18 7.80
## 5 2.80 2.69 0.39 1.82 4.32
## 6 3.27 3.39 0.34 1.97 6.75
## OD280/OD315 of diluted wines Proline NA
## 1 1.04 3.92 1065
## 2 1.05 3.40 1050
## 3 1.03 3.17 1185
## 4 0.86 3.45 1480
## 5 1.04 2.93 735
## 6 1.05 2.85 1450
summary(wine$Alcohol)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 2.000 1.938 3.000 3.000
summary(wine$Flavanoids)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.980 1.742 2.355 2.295 2.800 3.880
subwine<-subset(wine, wine$Alcohol > 2) #we create a subset of the dataset. Alcohol level greater than 2
summary(subwine)
## Alcohol Malic acid Ash Alcalinity of ash
## Min. :3 Min. :12.20 Min. :1.240 Min. :2.100
## 1st Qu.:3 1st Qu.:12.80 1st Qu.:2.587 1st Qu.:2.300
## Median :3 Median :13.16 Median :3.265 Median :2.380
## Mean :3 Mean :13.15 Mean :3.334 Mean :2.437
## 3rd Qu.:3 3rd Qu.:13.51 3rd Qu.:3.958 3rd Qu.:2.603
## Max. :3 Max. :14.34 Max. :5.650 Max. :2.860
## Magnesium Total phenols Flavanoids Nonflavanoid phenols
## Min. :17.50 Min. : 80.00 Min. :0.980 Min. :0.3400
## 1st Qu.:20.00 1st Qu.: 89.75 1st Qu.:1.407 1st Qu.:0.5800
## Median :21.00 Median : 97.00 Median :1.635 Median :0.6850
## Mean :21.42 Mean : 99.31 Mean :1.679 Mean :0.7815
## 3rd Qu.:23.00 3rd Qu.:106.00 3rd Qu.:1.808 3rd Qu.:0.9200
## Max. :27.00 Max. :123.00 Max. :2.800 Max. :1.5700
## Proanthocyanins Color intensit Hue
## Min. :0.1700 Min. :0.550 Min. : 3.850
## 1st Qu.:0.3975 1st Qu.:0.855 1st Qu.: 5.438
## Median :0.4700 Median :1.105 Median : 7.550
## Mean :0.4475 Mean :1.154 Mean : 7.396
## 3rd Qu.:0.5300 3rd Qu.:1.350 3rd Qu.: 9.225
## Max. :0.6300 Max. :2.700 Max. :13.000
## OD280/OD315 of diluted wines Proline NA
## Min. :0.4800 Min. :1.270 Min. :415.0
## 1st Qu.:0.5875 1st Qu.:1.510 1st Qu.:545.0
## Median :0.6650 Median :1.660 Median :627.5
## Mean :0.6827 Mean :1.684 Mean :629.9
## 3rd Qu.:0.7525 3rd Qu.:1.820 3rd Qu.:695.0
## Max. :0.9600 Max. :2.470 Max. :880.0
dim(subwine) #dim of the dataset subset
## [1] 48 14
dim(wine) #dim of the initial dataset
## [1] 178 14