You can do many things in R that you might usually do in Excel.
Make sure to view each new object you create to follow the changes you’ve made.
First, let’s load the packages we will use and some data.
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(reshape2)
diamonds <- data.frame(diamonds)
head(diamonds)
## carat cut color clarity depth table price x y z
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
View the data frame and note the column names.
names(diamonds)[8] <- 'length'
names(diamonds)[9] <- 'width'
names(diamonds)[10] <- 'depth'
names(diamonds)[5] <- 'depthperc'
head(diamonds)
## carat cut color clarity depthperc table price length width depth
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
You can see the specified columns have new names.
We are going to create a new column called cubic that includes a function based on the values of other columns.
diamonds <- mutate(diamonds, cubic = length * width * depth)
head(diamonds)
## carat cut color clarity depthperc table price length width depth
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
## cubic
## 1 38.20203
## 2 34.50586
## 3 38.07688
## 4 46.72458
## 5 51.91725
## 6 38.69395
Check the data to confirm the new column.
Just as in Excel, we can create averages for data in each column.
colMeans(diamonds[, c(1, 5:11)])
## carat depthperc table price length
## 0.7979397 61.7494049 57.4571839 3932.7997219 5.7311572
## width depth cubic
## 5.7345260 3.5387338 129.8494033
Let’s summarize the carat column. First, create a new column to clean data by rounding.
diamonds$carat2 <- round(diamonds$carat / 0.25) * 0.25
head(diamonds$carat2)
## [1] 0.25 0.25 0.25 0.25 0.25 0.25
Now let’s summarize the diamonds data frame.
Aggregate the data, take the mean of each numeric field, and group by the non-numeric fields.
Summary <- aggregate(cbind(depthperc, table, price, length, width, depth, cubic) ~cut + color + clarity + carat2, data = diamonds, mean)
head(Summary)
## cut color clarity carat2 depthperc table price length width depth
## 1 Good D I1 0.25 64.00 54 414.5 4.345 4.345 2.780
## 2 Ideal D I1 0.25 61.95 56 471.5 4.495 4.495 2.785
## 3 Premium E I1 0.25 60.90 58 394.5 4.400 4.400 2.680
## 4 Ideal E I1 0.25 60.50 57 492.5 4.420 4.385 2.665
## 5 Fair F I1 0.25 55.80 62 2160.0 4.720 4.600 2.600
## 6 Premium F I1 0.25 62.90 59 394.0 4.330 4.290 2.710
## cubic
## 1 52.48306
## 2 56.26932
## 3 51.88373
## 4 51.66028
## 5 56.45120
## 6 50.34015
For each color, calculate the average price by clarity.
pivot_table <- dcast(diamonds[, c('color', 'clarity', 'price')], color ~ clarity, mean)
## Using price as value column: use value.var to override.
pivot_table
## color I1 SI2 SI1 VS2 VS1 VVS2 VVS1
## 1 D 3863.024 3931.101 2976.146 2587.226 3030.159 3351.128 2947.913
## 2 E 3488.422 4173.826 3161.838 2750.942 2856.294 2499.674 2219.820
## 3 F 3342.182 4472.625 3714.226 3756.795 3796.718 3475.513 2804.277
## 4 G 3545.693 5021.684 3774.787 4416.256 4131.362 3845.283 2866.821
## 5 H 4453.414 6099.895 5032.415 4722.414 3780.689 2649.067 1845.658
## 6 I 4302.185 7002.649 5355.020 5690.506 4633.184 2968.233 2034.862
## 7 J 5254.060 6520.958 5186.048 5311.059 4884.461 5142.397 4034.176
## IF
## 1 8307.370
## 2 3668.506
## 3 2750.836
## 4 2558.034
## 5 2287.870
## 6 1994.937
## 7 3363.882
We will merge the diamonds data frame with the Summary data frame. Both have colummns called price, so rename Summary’s price column.
names(Summary)[7] <- 'avgprice'
Now merge the data sets.
diamonds <- merge(diamonds, Summary[, c(1:4, 7)], by.x = c('cut', 'color', 'clarity', 'carat2'), by.y = c('cut', 'color', 'clarity', 'carat2'))
head(diamonds)
## cut color clarity carat2 carat depthperc table price length width depth
## 1 Fair D I1 1.00 0.91 66.2 57 2491 6.00 5.94 3.95
## 2 Fair D I1 1.50 1.50 64.7 62 5460 7.19 7.04 4.60
## 3 Fair D I1 1.75 1.70 64.7 56 5617 7.46 7.37 4.80
## 4 Fair D I1 3.50 3.40 66.8 52 15964 9.42 9.34 6.27
## 5 Fair D IF 0.25 0.37 61.2 57 1440 4.68 4.73 2.88
## 6 Fair D IF 0.25 0.30 60.5 57 1208 4.47 4.35 2.67
## cubic avgprice
## 1 140.77800 2491
## 2 232.84096 5460
## 3 263.90496 5617
## 4 551.65216 15964
## 5 63.75283 1324
## 6 51.91681 1324
Check the new diamonds data frame to see the results.
This will create a new column called size to categorize by weight.
diamonds$size[diamonds$carat < 0.5] <- 'Small'
diamonds$size[diamonds$carat >= 0.5 & diamonds$carat < 1] <- 'Medium'
diamonds$size[diamonds$carat >= 1] <- 'Large'
head(diamonds)
## cut color clarity carat2 carat depthperc table price length width depth
## 1 Fair D I1 1.00 0.91 66.2 57 2491 6.00 5.94 3.95
## 2 Fair D I1 1.50 1.50 64.7 62 5460 7.19 7.04 4.60
## 3 Fair D I1 1.75 1.70 64.7 56 5617 7.46 7.37 4.80
## 4 Fair D I1 3.50 3.40 66.8 52 15964 9.42 9.34 6.27
## 5 Fair D IF 0.25 0.37 61.2 57 1440 4.68 4.73 2.88
## 6 Fair D IF 0.25 0.30 60.5 57 1208 4.47 4.35 2.67
## cubic avgprice size
## 1 140.77800 2491 Medium
## 2 232.84096 5460 Large
## 3 263.90496 5617 Large
## 4 551.65216 15964 Large
## 5 63.75283 1324 Small
## 6 51.91681 1324 Small
Again, check the data frame and examine the results to make sure they are as expected.
Enough data. Let’s consider some basic graphics.
Bar charts and histograms are almost cliches in Excel. They are just as easy to create in R.
First, a bar chart.
barplot(table(diamonds$size), main = 'Distribution by Size', xlab = 'Size', ylab = 'Number', col = 'red', ylim = c(0, 20000))
Now a histogram. For a histogram, we have to use numeric values, so it would not work with categories such as “Large”, “Medium”, and “Small”.
hist(diamonds$carat, main = 'Distribution by Carat', xlab = 'Carat', ylab = 'Frequency', col = 'red', xlim = c(0, 4), ylim = c(0, 20000), breaks = 7)
Scatterplots are known in Excel as X-Y charts. They are easy to produce in R. For this one, we need to use the ggplot2 library. This creates a prettier graph than the basic R graphics packages.
This first chart is classified by the diamonds’ color.
ggplot(diamonds, aes(carat, price)) + geom_point(aes(group = color, color = color)) + labs(x = 'Carat', y = 'Price', title = 'Price by Carat and Color')
Here is the same chart classified by the diamonds’ clarity.
ggplot(diamonds, aes(carat, price, color = clarity)) + geom_point() +
labs(x = 'Weight', y = 'Price', title = 'Price by Carat and Clarity')
Pretty neat, isn’t it?
We can do the same thing using the qplot command.
qplot(carat, price, data = diamonds, color = clarity, xlab = 'Weight', ylab = 'Price', main = 'Price by Carat and Clarity')