Task: to choose one dataset, then study the data and its associated description of the data (i.e. “data dictionary”). You should take the data, and create an R data frame with a subset of the columns (and if you like rows) in the dataset. Your deliverable is the R code to perform these transformation tasks.

From https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data: Dataset Wine recognition data Updated Sept 21, 1998 by C.Blake : Added attribute information Sources: Forina, M. et al, PARVUS - An Extendible Package for Data Exploration, Classification and Correlation. Institute of Pharmaceutical and Food Analysis and Technologies, Via Brigata Salerno, 16147 Genoa, Italy.

Retrieve and load data into a data.frame

dataUrl <- "http://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data"
wine <- read.csv(dataUrl, strip.white = TRUE, header = FALSE)

We can have a look at the first few entries (rows) of our data with the command

head(wine)
##   V1    V2   V3   V4   V5  V6   V7   V8   V9  V10  V11  V12  V13  V14
## 1  1 14.23 1.71 2.43 15.6 127 2.80 3.06 0.28 2.29 5.64 1.04 3.92 1065
## 2  1 13.20 1.78 2.14 11.2 100 2.65 2.76 0.26 1.28 4.38 1.05 3.40 1050
## 3  1 13.16 2.36 2.67 18.6 101 2.80 3.24 0.30 2.81 5.68 1.03 3.17 1185
## 4  1 14.37 1.95 2.50 16.8 113 3.85 3.49 0.24 2.18 7.80 0.86 3.45 1480
## 5  1 13.24 2.59 2.87 21.0 118 2.80 2.69 0.39 1.82 4.32 1.04 2.93  735
## 6  1 14.20 1.76 2.45 15.2 112 3.27 3.39 0.34 1.97 6.75 1.05 2.85 1450

Few summary of the dataset

summary(wine)
##        V1              V2              V3              V4       
##  Min.   :1.000   Min.   :11.03   Min.   :0.740   Min.   :1.360  
##  1st Qu.:1.000   1st Qu.:12.36   1st Qu.:1.603   1st Qu.:2.210  
##  Median :2.000   Median :13.05   Median :1.865   Median :2.360  
##  Mean   :1.938   Mean   :13.00   Mean   :2.336   Mean   :2.367  
##  3rd Qu.:3.000   3rd Qu.:13.68   3rd Qu.:3.083   3rd Qu.:2.558  
##  Max.   :3.000   Max.   :14.83   Max.   :5.800   Max.   :3.230  
##        V5              V6               V7              V8       
##  Min.   :10.60   Min.   : 70.00   Min.   :0.980   Min.   :0.340  
##  1st Qu.:17.20   1st Qu.: 88.00   1st Qu.:1.742   1st Qu.:1.205  
##  Median :19.50   Median : 98.00   Median :2.355   Median :2.135  
##  Mean   :19.49   Mean   : 99.74   Mean   :2.295   Mean   :2.029  
##  3rd Qu.:21.50   3rd Qu.:107.00   3rd Qu.:2.800   3rd Qu.:2.875  
##  Max.   :30.00   Max.   :162.00   Max.   :3.880   Max.   :5.080  
##        V9              V10             V11              V12        
##  Min.   :0.1300   Min.   :0.410   Min.   : 1.280   Min.   :0.4800  
##  1st Qu.:0.2700   1st Qu.:1.250   1st Qu.: 3.220   1st Qu.:0.7825  
##  Median :0.3400   Median :1.555   Median : 4.690   Median :0.9650  
##  Mean   :0.3619   Mean   :1.591   Mean   : 5.058   Mean   :0.9574  
##  3rd Qu.:0.4375   3rd Qu.:1.950   3rd Qu.: 6.200   3rd Qu.:1.1200  
##  Max.   :0.6600   Max.   :3.580   Max.   :13.000   Max.   :1.7100  
##       V13             V14        
##  Min.   :1.270   Min.   : 278.0  
##  1st Qu.:1.938   1st Qu.: 500.5  
##  Median :2.780   Median : 673.5  
##  Mean   :2.612   Mean   : 746.9  
##  3rd Qu.:3.170   3rd Qu.: 985.0  
##  Max.   :4.000   Max.   :1680.0

We can use can also embed plots, for example:

plot(wine)

dfnames=c("Alcohol", "Malic acid", "Ash","Alcalinity of ash","Magnesium", "Total phenols", "Flavanoids", "Nonflavanoid phenols", "Proanthocyanins", "Color intensit", "Hue", "OD280/OD315 of diluted wines", "Proline")
names(wine)<-dfnames #Assign names to the data frame columns.
head(wine)
##   Alcohol Malic acid  Ash Alcalinity of ash Magnesium Total phenols
## 1       1      14.23 1.71              2.43      15.6           127
## 2       1      13.20 1.78              2.14      11.2           100
## 3       1      13.16 2.36              2.67      18.6           101
## 4       1      14.37 1.95              2.50      16.8           113
## 5       1      13.24 2.59              2.87      21.0           118
## 6       1      14.20 1.76              2.45      15.2           112
##   Flavanoids Nonflavanoid phenols Proanthocyanins Color intensit  Hue
## 1       2.80                 3.06            0.28           2.29 5.64
## 2       2.65                 2.76            0.26           1.28 4.38
## 3       2.80                 3.24            0.30           2.81 5.68
## 4       3.85                 3.49            0.24           2.18 7.80
## 5       2.80                 2.69            0.39           1.82 4.32
## 6       3.27                 3.39            0.34           1.97 6.75
##   OD280/OD315 of diluted wines Proline   NA
## 1                         1.04    3.92 1065
## 2                         1.05    3.40 1050
## 3                         1.03    3.17 1185
## 4                         0.86    3.45 1480
## 5                         1.04    2.93  735
## 6                         1.05    2.85 1450
summary(wine$Alcohol)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   2.000   1.938   3.000   3.000
summary(wine$Flavanoids)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.980   1.742   2.355   2.295   2.800   3.880
subwine<-subset(wine, wine$Alcohol > 2) #we create a subset of the dataset. Alcohol level greater than 2
summary(subwine)
##     Alcohol    Malic acid         Ash        Alcalinity of ash
##  Min.   :3   Min.   :12.20   Min.   :1.240   Min.   :2.100    
##  1st Qu.:3   1st Qu.:12.80   1st Qu.:2.587   1st Qu.:2.300    
##  Median :3   Median :13.16   Median :3.265   Median :2.380    
##  Mean   :3   Mean   :13.15   Mean   :3.334   Mean   :2.437    
##  3rd Qu.:3   3rd Qu.:13.51   3rd Qu.:3.958   3rd Qu.:2.603    
##  Max.   :3   Max.   :14.34   Max.   :5.650   Max.   :2.860    
##    Magnesium     Total phenols      Flavanoids    Nonflavanoid phenols
##  Min.   :17.50   Min.   : 80.00   Min.   :0.980   Min.   :0.3400      
##  1st Qu.:20.00   1st Qu.: 89.75   1st Qu.:1.407   1st Qu.:0.5800      
##  Median :21.00   Median : 97.00   Median :1.635   Median :0.6850      
##  Mean   :21.42   Mean   : 99.31   Mean   :1.679   Mean   :0.7815      
##  3rd Qu.:23.00   3rd Qu.:106.00   3rd Qu.:1.808   3rd Qu.:0.9200      
##  Max.   :27.00   Max.   :123.00   Max.   :2.800   Max.   :1.5700      
##  Proanthocyanins  Color intensit       Hue        
##  Min.   :0.1700   Min.   :0.550   Min.   : 3.850  
##  1st Qu.:0.3975   1st Qu.:0.855   1st Qu.: 5.438  
##  Median :0.4700   Median :1.105   Median : 7.550  
##  Mean   :0.4475   Mean   :1.154   Mean   : 7.396  
##  3rd Qu.:0.5300   3rd Qu.:1.350   3rd Qu.: 9.225  
##  Max.   :0.6300   Max.   :2.700   Max.   :13.000  
##  OD280/OD315 of diluted wines    Proline            NA       
##  Min.   :0.4800               Min.   :1.270   Min.   :415.0  
##  1st Qu.:0.5875               1st Qu.:1.510   1st Qu.:545.0  
##  Median :0.6650               Median :1.660   Median :627.5  
##  Mean   :0.6827               Mean   :1.684   Mean   :629.9  
##  3rd Qu.:0.7525               3rd Qu.:1.820   3rd Qu.:695.0  
##  Max.   :0.9600               Max.   :2.470   Max.   :880.0
dim(subwine) #dim of the dataset subset
## [1] 48 14
dim(wine) #dim of the initial dataset
## [1] 178  14