For this assignment, I have chosen the wine
data set from the UCI Machine Learning Repository, and uploaded it to my GitHub account. I will host the data from there, and use the following code to read the data into R:
library(RCurl)
## Loading required package: bitops
x <- getURL("https://raw.githubusercontent.com/Logan213/DATA607_Week2/master/wine.data.txt")
wine <- read.csv(text = x)
By usine the data.frame
function, we can create a data frame in R to manipulate our data a little easier:
wine_df <- data.frame(wine)
Now that we have our wine_df
object, we can take a look at the column names and the first few rows of the data by using
head(wine_df)
## X1 X14.23 X1.71 X2.43 X15.6 X127 X2.8 X3.06 X.28 X2.29 X5.64 X1.04 X3.92
## 1 1 13.20 1.78 2.14 11.2 100 2.65 2.76 0.26 1.28 4.38 1.05 3.40
## 2 1 13.16 2.36 2.67 18.6 101 2.80 3.24 0.30 2.81 5.68 1.03 3.17
## 3 1 14.37 1.95 2.50 16.8 113 3.85 3.49 0.24 2.18 7.80 0.86 3.45
## 4 1 13.24 2.59 2.87 21.0 118 2.80 2.69 0.39 1.82 4.32 1.04 2.93
## 5 1 14.20 1.76 2.45 15.2 112 3.27 3.39 0.34 1.97 6.75 1.05 2.85
## 6 1 14.39 1.87 2.45 14.6 96 2.50 2.52 0.30 1.98 5.25 1.02 3.58
## X1065
## 1 1050
## 2 1185
## 3 1480
## 4 735
## 5 1450
## 6 1290
The column headers are not very descriptive, but the source URL has an description of each variable in the wine
data set. We can rename the columns by creating a vector of combined values:
wine_cols <- c("X1"="Class ID", "X14.23"="Alcohol", "X1.71"="Malic Acid", "X2.43"="Ash", "X15.6"="Alcalinity", "X127"="Magnesium", "X2.8"="Total Phenols", "X3.06"="Flavanoids", "X.28"="Nonflavanoid Phenols", "X2.29"="Proanthocyanins", "X5.64"="Color Intensity", "X1.04"="Hue", "X3.92"="OD280/OD315", "X1065"="Proline")
Then, applying this to the data frame using the rename
function fromt the plyr
package:
library(plyr)
wine_df <- rename(wine_df, wine_cols)
We can then see our wine_df
object with new column headers:
head(wine_df)
## Class ID Alcohol Malic Acid Ash Alcalinity Magnesium Total Phenols
## 1 1 13.20 1.78 2.14 11.2 100 2.65
## 2 1 13.16 2.36 2.67 18.6 101 2.80
## 3 1 14.37 1.95 2.50 16.8 113 3.85
## 4 1 13.24 2.59 2.87 21.0 118 2.80
## 5 1 14.20 1.76 2.45 15.2 112 3.27
## 6 1 14.39 1.87 2.45 14.6 96 2.50
## Flavanoids Nonflavanoid Phenols Proanthocyanins Color Intensity Hue
## 1 2.76 0.26 1.28 4.38 1.05
## 2 3.24 0.30 2.81 5.68 1.03
## 3 3.49 0.24 2.18 7.80 0.86
## 4 2.69 0.39 1.82 4.32 1.04
## 5 3.39 0.34 1.97 6.75 1.05
## 6 2.52 0.30 1.98 5.25 1.02
## OD280/OD315 Proline
## 1 3.40 1050
## 2 3.17 1185
## 3 3.45 1480
## 4 2.93 735
## 5 2.85 1450
## 6 3.58 1290
I’m not interested in some of the variables, so we can select just the ones we want using the following:
wine_sub <- wine_df[, c(2:7, 11:12)]
head(wine_sub)
## Alcohol Malic Acid Ash Alcalinity Magnesium Total Phenols
## 1 13.20 1.78 2.14 11.2 100 2.65
## 2 13.16 2.36 2.67 18.6 101 2.80
## 3 14.37 1.95 2.50 16.8 113 3.85
## 4 13.24 2.59 2.87 21.0 118 2.80
## 5 14.20 1.76 2.45 15.2 112 3.27
## 6 14.39 1.87 2.45 14.6 96 2.50
## Color Intensity Hue
## 1 4.38 1.05
## 2 5.68 1.03
## 3 7.80 0.86
## 4 4.32 1.04
## 5 6.75 1.05
## 6 5.25 1.02
This will create a new data frame called wine_sub
made up of all rows (note there is nothing before the comma in the brackets in the code above) and columns 2 through 7, and columns 11 and 12 only.
Lastly, we could have specified rows in the brackets above, but I want to subset the data that meets certain parameters:
wine_sub.2 <- subset(wine_sub, Alcohol >12)
head(wine_sub.2)
## Alcohol Malic Acid Ash Alcalinity Magnesium Total Phenols
## 1 13.20 1.78 2.14 11.2 100 2.65
## 2 13.16 2.36 2.67 18.6 101 2.80
## 3 14.37 1.95 2.50 16.8 113 3.85
## 4 13.24 2.59 2.87 21.0 118 2.80
## 5 14.20 1.76 2.45 15.2 112 3.27
## 6 14.39 1.87 2.45 14.6 96 2.50
## Color Intensity Hue
## 1 4.38 1.05
## 2 5.68 1.03
## 3 7.80 0.86
## 4 4.32 1.04
## 5 6.75 1.05
## 6 5.25 1.02
This gives us a new data frame with the columns indicated above, but only rows where the Alcohol value is over 12.