R is used by many companies, including Google, Microsoft, Oracle, SAP and many others. The R Consortium is a group of companies that is actively supporting further development of the language. More about R can be found here.
Download and install R, RStudio and if you are on Windows also Rtools.
R is the statistical programming language, RStudio is where you write your programs and Rtools is a utility program for Windows.
There are over 7000 libraries for R on CRAN. Go to Taskview for a theme oriented view of available packages.
Standard (Base) R is not very easy or fast to work with, luckily there are many packages that make life easier.
Here is how you install them (make sure you have an internet connection):
mylibraries = c(
"data.table", # fast and compact data processing
"stringr", # work with character strings
"ggplot2", # comprehensive plotting environment
"zoo", # work with time series
"lubridate", # work with dates
"DescTools", # descriptive analytics
"readxl" # read Excel files
)
# install.packages(mylibraries, dep = T)
If you run these commands, remove the “#” (comment) sign. Alternatively, you can install packages in RStudio via a menu: /Tools/Install packages/ (…type the names of the packages…).
Now that the packages are installed, you have to load them before you can work with them. Here is how you do that:
library(data.table)
library(stringr)
library(ggplot2)
library(zoo)
library(lubridate)
library(DescTools)
Open RStudio, this will also open R.
Open a script file by clicking the “+” sign under “File” in the upper left corner.
You now have a complete R developing environment:
R comes with some datasets. We are going to use mtcars.
In the script window (upper left), type the following:
data(mtcars) # loads mtcars dataset
View(mtcars) # view the data in a separate window
dim(mtcars) # dimensions: rows x columns
str(mtcars) # structure: name and type of variable
names(mtcars) # column (variable) names of the dataset
head(mtcars) # first few records of the dataset
summary(mtcars) # statistical summary
?mtcars # help about a dataset or command
getwd() # which directory are you in?
dir() # what are the files in that directory?
ls() # list the loaded objects in your current R session
# # comment: everything after this sign is ignored by R
1:5 # prints the numbers 1 to 5
sum(1:5) # sums the numbers 1 to 5
c(1,2,3) # c = "concatenate" = make a vector with the numbers 1, 2 and 3
x = c(1,2,3) # assign a name
You can run these commands by clicking “Source” in the upper right corner of the script window. It is also possible to run the commands one by one. Put your cursor anywhere on the line you want to run and press control-Enter.
Note that the View window is very powerful. You can open the View in its own window, filter, search and sort.
data.table is a fast and compact “mini language” for dealing with data (“data wrangling”), that resembles SQL.
It works like this:
If a (text, csv) file is read with data.table’s fread command, it is already a data.table object. If not, you have to set it manually.
There are 2 cases: * The object contains names rows. This is the case in the mtcars object. You can see this by the fact that the column above the car names has no name. * The object contains no names rows.
setDT(mtcars, keep.rownames = T) # set to data.table, and keep rownames, these will be called "rn"
The operations that you typically do with rows is filtering and selecting.
mtcars[1,] # get the first row
mtcars[1:3,] # get the first 3 rows
mtcars[rn == "Duster 360",] # get a specific row, method 1
mtcars[rn %in% "Duster 360"] # get a specific row, method 2
mtcars[rn %like% "Merc"] # get rows with an approximate value, in this case "Merc"
mtcars[!1,] # exclude the first row
mtcars[!1:30,] # exclude the first 30 rows
mtcars[rn != "Duster 360"] # exclude a specific row
mtcars[!rn %like% "Merc"] # exclude rows with approximate values
mtcars[mpg %between% c(19, 20)] # find rows with a value between an interval, in this case: mpg between 19 and 20
mtcars[mpg > 30,] # find rows with a value larger than ...
mtcars[mpg > 30 & hp > 100,] # combine search criteria
# mtcars[, .(mpg, cyl)] # select columns, method 1
# mtcars[, c("mpg", "cyl"), with = F] # select columns, method 2
# mtcars[, c(2, 3), with = F] # select columns, method 3 (discouraged: less transparent)
# mtcars[, c("mpg", "cyl") := NULL] # remove columns, method 1
# mtcars[, !c("mpg", "cyl"), with = F] # remove columns, method 2
mtcars[, hp_wt := hp/wt] # new variable: calculate horsepower per weight (in this case: 1000 lbs)
mtcars[, `:=` ( # calculate several new variables at once
hp_wt = hp/wt,
mpg_wt = mpg/wt
)]
mtcars[rn %like% "Merc", .N] # ".N" is an abbreviation for "count" in data.table
mtcars[, mean(mpg), by = cyl] # calculate average mpg per cylinder
mtcars[, .(avg = mean(mpg)), by = cyl] # ditto, give name to calculated variable
mtcars[rn %like% "Merc", .N, by = cyl] # group a selection by something, in this case count "Merc" cars by cylinder
# all together now:
mtcars[rn %like% "Merc" & gear >= 3 & cyl == 4, .(rn, mpg, cyl, gear), by = mpg]