Doing Excel Things in R #1

You can do many things in R that you might usually do in Excel.

Make sure to view each new object you create to follow the changes you’ve made.

First, let’s load the packages we will use and some data.

library(ggplot2)
library(dplyr)

## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(reshape2)
diamonds <- data.frame(diamonds)
head(diamonds)

##   carat       cut color clarity depth table price    x    y    z
## 1  0.23     Ideal     E     SI2  61.5    55   326 3.95 3.98 2.43
## 2  0.21   Premium     E     SI1  59.8    61   326 3.89 3.84 2.31
## 3  0.23      Good     E     VS1  56.9    65   327 4.05 4.07 2.31
## 4  0.29   Premium     I     VS2  62.4    58   334 4.20 4.23 2.63
## 5  0.31      Good     J     SI2  63.3    58   335 4.34 4.35 2.75
## 6  0.24 Very Good     J    VVS2  62.8    57   336 3.94 3.96 2.48

View the data frame and note the column names.

Renaming Columns

names(diamonds)[8] <- 'length'
names(diamonds)[9] <- 'width'
names(diamonds)[10] <- 'depth'
names(diamonds)[5] <- 'depthperc'
head(diamonds)

##   carat       cut color clarity depthperc table price length width depth
## 1  0.23     Ideal     E     SI2      61.5    55   326   3.95  3.98  2.43
## 2  0.21   Premium     E     SI1      59.8    61   326   3.89  3.84  2.31
## 3  0.23      Good     E     VS1      56.9    65   327   4.05  4.07  2.31
## 4  0.29   Premium     I     VS2      62.4    58   334   4.20  4.23  2.63
## 5  0.31      Good     J     SI2      63.3    58   335   4.34  4.35  2.75
## 6  0.24 Very Good     J    VVS2      62.8    57   336   3.94  3.96  2.48

You can see the specified columns have new names.

Create a New Column that Includes a Function

We are going to create a new column called cubic that includes a function based on the values of other columns.

diamonds <- mutate(diamonds, cubic = length * width * depth)
head(diamonds)

##   carat       cut color clarity depthperc table price length width depth
## 1  0.23     Ideal     E     SI2      61.5    55   326   3.95  3.98  2.43
## 2  0.21   Premium     E     SI1      59.8    61   326   3.89  3.84  2.31
## 3  0.23      Good     E     VS1      56.9    65   327   4.05  4.07  2.31
## 4  0.29   Premium     I     VS2      62.4    58   334   4.20  4.23  2.63
## 5  0.31      Good     J     SI2      63.3    58   335   4.34  4.35  2.75
## 6  0.24 Very Good     J    VVS2      62.8    57   336   3.94  3.96  2.48
##      cubic
## 1 38.20203
## 2 34.50586
## 3 38.07688
## 4 46.72458
## 5 51.91725
## 6 38.69395

Check the data to confirm the new column.

Calculating Averages for Each Column

Just as in Excel, we can create averages for data in each column.

colMeans(diamonds[, c(1, 5:11)])

##        carat    depthperc        table        price       length 
##    0.7979397   61.7494049   57.4571839 3932.7997219    5.7311572 
##        width        depth        cubic 
##    5.7345260    3.5387338  129.8494033

Summarizing Data

Let’s summarize the carat column. First, create a new column to clean data by rounding.

diamonds$carat2 <- round(diamonds$carat / 0.25) * 0.25
head(diamonds$carat2)

## [1] 0.25 0.25 0.25 0.25 0.25 0.25

Now let’s summarize the diamonds data frame.

Aggregate the data, take the mean of each numeric field, and group by the non-numeric fields.

Summary <- aggregate(cbind(depthperc, table, price, length, width, depth, cubic) ~cut + color + clarity + carat2, data = diamonds, mean)
head(Summary)

##       cut color clarity carat2 depthperc table  price length width depth
## 1    Good     D      I1   0.25     64.00    54  414.5  4.345 4.345 2.780
## 2   Ideal     D      I1   0.25     61.95    56  471.5  4.495 4.495 2.785
## 3 Premium     E      I1   0.25     60.90    58  394.5  4.400 4.400 2.680
## 4   Ideal     E      I1   0.25     60.50    57  492.5  4.420 4.385 2.665
## 5    Fair     F      I1   0.25     55.80    62 2160.0  4.720 4.600 2.600
## 6 Premium     F      I1   0.25     62.90    59  394.0  4.330 4.290 2.710
##      cubic
## 1 52.48306
## 2 56.26932
## 3 51.88373
## 4 51.66028
## 5 56.45120
## 6 50.34015

Create a Pivot Table

For each color, calculate the average price by clarity.

pivot_table <- dcast(diamonds[, c('color', 'clarity', 'price')], color ~ clarity, mean)

## Using price as value column: use value.var to override.

pivot_table

##   color       I1      SI2      SI1      VS2      VS1     VVS2     VVS1
## 1     D 3863.024 3931.101 2976.146 2587.226 3030.159 3351.128 2947.913
## 2     E 3488.422 4173.826 3161.838 2750.942 2856.294 2499.674 2219.820
## 3     F 3342.182 4472.625 3714.226 3756.795 3796.718 3475.513 2804.277
## 4     G 3545.693 5021.684 3774.787 4416.256 4131.362 3845.283 2866.821
## 5     H 4453.414 6099.895 5032.415 4722.414 3780.689 2649.067 1845.658
## 6     I 4302.185 7002.649 5355.020 5690.506 4633.184 2968.233 2034.862
## 7     J 5254.060 6520.958 5186.048 5311.059 4884.461 5142.397 4034.176
##         IF
## 1 8307.370
## 2 3668.506
## 3 2750.836
## 4 2558.034
## 5 2287.870
## 6 1994.937
## 7 3363.882

Do a VLookup

We will merge the diamonds data frame with the Summary data frame. Both have colummns called price, so rename Summary’s price column.

names(Summary)[7] <- 'avgprice'

Now merge the data sets.

diamonds <- merge(diamonds, Summary[, c(1:4, 7)], by.x = c('cut', 'color', 'clarity', 'carat2'), by.y = c('cut', 'color', 'clarity', 'carat2'))
head(diamonds)

##    cut color clarity carat2 carat depthperc table price length width depth
## 1 Fair     D      I1   1.00  0.91      66.2    57  2491   6.00  5.94  3.95
## 2 Fair     D      I1   1.50  1.50      64.7    62  5460   7.19  7.04  4.60
## 3 Fair     D      I1   1.75  1.70      64.7    56  5617   7.46  7.37  4.80
## 4 Fair     D      I1   3.50  3.40      66.8    52 15964   9.42  9.34  6.27
## 5 Fair     D      IF   0.25  0.37      61.2    57  1440   4.68  4.73  2.88
## 6 Fair     D      IF   0.25  0.30      60.5    57  1208   4.47  4.35  2.67
##       cubic avgprice
## 1 140.77800     2491
## 2 232.84096     5460
## 3 263.90496     5617
## 4 551.65216    15964
## 5  63.75283     1324
## 6  51.91681     1324

Check the new diamonds data frame to see the results.

Create Something Similar to an IF Statement

This will create a new column called size to categorize by weight.

diamonds$size[diamonds$carat < 0.5] <- 'Small'
diamonds$size[diamonds$carat >= 0.5 & diamonds$carat < 1] <- 'Medium'
diamonds$size[diamonds$carat >= 1] <- 'Large'
head(diamonds)

##    cut color clarity carat2 carat depthperc table price length width depth
## 1 Fair     D      I1   1.00  0.91      66.2    57  2491   6.00  5.94  3.95
## 2 Fair     D      I1   1.50  1.50      64.7    62  5460   7.19  7.04  4.60
## 3 Fair     D      I1   1.75  1.70      64.7    56  5617   7.46  7.37  4.80
## 4 Fair     D      I1   3.50  3.40      66.8    52 15964   9.42  9.34  6.27
## 5 Fair     D      IF   0.25  0.37      61.2    57  1440   4.68  4.73  2.88
## 6 Fair     D      IF   0.25  0.30      60.5    57  1208   4.47  4.35  2.67
##       cubic avgprice   size
## 1 140.77800     2491 Medium
## 2 232.84096     5460  Large
## 3 263.90496     5617  Large
## 4 551.65216    15964  Large
## 5  63.75283     1324  Small
## 6  51.91681     1324  Small

Again, check the data frame and examine the results to make sure they are as expected.

Enough data. Let’s consider some basic graphics.

Create a Bar Chart

Bar charts and histograms are almost cliches in Excel. They are just as easy to create in R.

First, a bar chart.

barplot(table(diamonds$size), main = 'Distribution by Size', xlab = 'Size', ylab = 'Number', col = 'red', ylim = c(0, 20000))

Now a histogram. For a histogram, we have to use numeric values, so it would not work with categories such as “Large”, “Medium”, and “Small”.

hist(diamonds$carat, main = 'Distribution by Carat', xlab = 'Carat', ylab = 'Frequency', col = 'red', xlim = c(0, 4), ylim = c(0, 20000), breaks = 7)

Create a Scatterplot

Scatterplots are known in Excel as X-Y charts. They are easy to produce in R. For this one, we need to use the ggplot2 library. This creates a prettier graph than the basic R graphics packages.

This first chart is classified by the diamonds’ color.

ggplot(diamonds, aes(carat, price)) + geom_point(aes(group = color, color = color)) + labs(x = 'Carat', y = 'Price', title = 'Price by Carat and Color')

Here is the same chart classified by the diamonds’ clarity.

ggplot(diamonds, aes(carat, price, color = clarity)) + geom_point() +
        labs(x = 'Weight', y = 'Price', title = 'Price by Carat and Clarity')

Pretty neat, isn’t it?

We can do the same thing using the qplot command.

qplot(carat, price, data = diamonds, color = clarity, xlab = 'Weight', ylab = 'Price', main = 'Price by Carat and Clarity')