Reading Data

For this assignment, I have chosen the wine data set from the UCI Machine Learning Repository, and uploaded it to my GitHub account. I will host the data from there, and use the following code to read the data into R:

library(RCurl)
## Loading required package: bitops
x <- getURL("https://raw.githubusercontent.com/Logan213/DATA607_Week2/master/wine.data.txt")
wine <- read.csv(text = x)

Creating a Data Frame

By usine the data.frame function, we can create a data frame in R to manipulate our data a little easier:

wine_df <- data.frame(wine)

Renaming Columns

Now that we have our wine_df object, we can take a look at the column names and the first few rows of the data by using

head(wine_df)
##   X1 X14.23 X1.71 X2.43 X15.6 X127 X2.8 X3.06 X.28 X2.29 X5.64 X1.04 X3.92
## 1  1  13.20  1.78  2.14  11.2  100 2.65  2.76 0.26  1.28  4.38  1.05  3.40
## 2  1  13.16  2.36  2.67  18.6  101 2.80  3.24 0.30  2.81  5.68  1.03  3.17
## 3  1  14.37  1.95  2.50  16.8  113 3.85  3.49 0.24  2.18  7.80  0.86  3.45
## 4  1  13.24  2.59  2.87  21.0  118 2.80  2.69 0.39  1.82  4.32  1.04  2.93
## 5  1  14.20  1.76  2.45  15.2  112 3.27  3.39 0.34  1.97  6.75  1.05  2.85
## 6  1  14.39  1.87  2.45  14.6   96 2.50  2.52 0.30  1.98  5.25  1.02  3.58
##   X1065
## 1  1050
## 2  1185
## 3  1480
## 4   735
## 5  1450
## 6  1290

The column headers are not very descriptive, but the source URL has an description of each variable in the wine data set. We can rename the columns by creating a vector of combined values:

wine_cols <- c("X1"="Class ID", "X14.23"="Alcohol", "X1.71"="Malic Acid", "X2.43"="Ash", "X15.6"="Alcalinity", "X127"="Magnesium", "X2.8"="Total Phenols", "X3.06"="Flavanoids", "X.28"="Nonflavanoid Phenols", "X2.29"="Proanthocyanins", "X5.64"="Color Intensity", "X1.04"="Hue", "X3.92"="OD280/OD315", "X1065"="Proline")

Then, applying this to the data frame using the rename function fromt the plyr package:

library(plyr)
wine_df <- rename(wine_df, wine_cols)

We can then see our wine_df object with new column headers:

head(wine_df)
##   Class ID Alcohol Malic Acid  Ash Alcalinity Magnesium Total Phenols
## 1        1   13.20       1.78 2.14       11.2       100          2.65
## 2        1   13.16       2.36 2.67       18.6       101          2.80
## 3        1   14.37       1.95 2.50       16.8       113          3.85
## 4        1   13.24       2.59 2.87       21.0       118          2.80
## 5        1   14.20       1.76 2.45       15.2       112          3.27
## 6        1   14.39       1.87 2.45       14.6        96          2.50
##   Flavanoids Nonflavanoid Phenols Proanthocyanins Color Intensity  Hue
## 1       2.76                 0.26            1.28            4.38 1.05
## 2       3.24                 0.30            2.81            5.68 1.03
## 3       3.49                 0.24            2.18            7.80 0.86
## 4       2.69                 0.39            1.82            4.32 1.04
## 5       3.39                 0.34            1.97            6.75 1.05
## 6       2.52                 0.30            1.98            5.25 1.02
##   OD280/OD315 Proline
## 1        3.40    1050
## 2        3.17    1185
## 3        3.45    1480
## 4        2.93     735
## 5        2.85    1450
## 6        3.58    1290

Subsetting Columns & Rows

Subsetting Columns

I’m not interested in some of the variables, so we can select just the ones we want using the following:

wine_sub <- wine_df[, c(2:7, 11:12)]
head(wine_sub)
##   Alcohol Malic Acid  Ash Alcalinity Magnesium Total Phenols
## 1   13.20       1.78 2.14       11.2       100          2.65
## 2   13.16       2.36 2.67       18.6       101          2.80
## 3   14.37       1.95 2.50       16.8       113          3.85
## 4   13.24       2.59 2.87       21.0       118          2.80
## 5   14.20       1.76 2.45       15.2       112          3.27
## 6   14.39       1.87 2.45       14.6        96          2.50
##   Color Intensity  Hue
## 1            4.38 1.05
## 2            5.68 1.03
## 3            7.80 0.86
## 4            4.32 1.04
## 5            6.75 1.05
## 6            5.25 1.02

This will create a new data frame called wine_sub made up of all rows (note there is nothing before the comma in the brackets in the code above) and columns 2 through 7, and columns 11 and 12 only.

Subsetting Rows

Lastly, we could have specified rows in the brackets above, but I want to subset the data that meets certain parameters:

wine_sub.2 <- subset(wine_sub, Alcohol >12)
head(wine_sub.2)
##   Alcohol Malic Acid  Ash Alcalinity Magnesium Total Phenols
## 1   13.20       1.78 2.14       11.2       100          2.65
## 2   13.16       2.36 2.67       18.6       101          2.80
## 3   14.37       1.95 2.50       16.8       113          3.85
## 4   13.24       2.59 2.87       21.0       118          2.80
## 5   14.20       1.76 2.45       15.2       112          3.27
## 6   14.39       1.87 2.45       14.6        96          2.50
##   Color Intensity  Hue
## 1            4.38 1.05
## 2            5.68 1.03
## 3            7.80 0.86
## 4            4.32 1.04
## 5            6.75 1.05
## 6            5.25 1.02

This gives us a new data frame with the columns indicated above, but only rows where the Alcohol value is over 12.