R is a powerful language specifically designed for data analysis and graphics. Unlike a standard script document, this R Markdown file allows us to run “Code Chunks” and type descriptions freely.
To work on R, open the .Rmd file with
Rstudio. You will see a Knit button on
the top of the left pane. Once you click Knit, a
document will be generated that includes both content as well as the
output of any embedded R code chunks within the document. If you specify
above output to ‘word_document’, it will generate a word
file; if you specify ‘pdf_document’, it will generate a pdf file.
Think of R as a high-powered calculator. We use the
<- or = operator (called the
assignment operator) to store values in variables.
# This is a comment - R doesn't run this line.
# Simple math
2 + 2
## [1] 4
# Assigning values to variables
x <- 10
y = 5
# Performing operations with variables
total <- x + y;total #we can separate command by ; in one line
## [1] 15
print(total)
## [1] 15
For your future practices you can just copy the code chunk, write your code in the gray area and type your comments in the white areas.
R handles different kinds of data values (we call them
variables). The most common are: Numeric: Decimals or
integers (e.g., 10.5) Character: Text strings (e.g., “Hello”) Logical:
Boolean values (TRUE or FALSE)
Course_name <- "STAT 352"
is_fun <- TRUE
class(is_fun) # Checks the type of data
## [1] "logical"
is_fun+1 #in calculation, TRUE is 1 and FALSE is 0
## [1] 2
Data isn’t usually just one number; it’s a collection. Vectors: A
list of items of the same type, created with the c()
function. Data Frames: A table-like structure (like an Excel sheet).
# Creating a vector of scores
scores <- c(85, 92, 78, 90)
# Creating a simple Data Frame
students <- data.frame(
name = c("Alice", "Bob", "Charlie", "Diana"),
grade = c(95, 82, 78, 91)
)
# View the data frame
print(students)
## name grade
## 1 Alice 95
## 2 Bob 82
## 3 Charlie 78
## 4 Diana 91
# Creating consecutive integers
x<-1:10
x
## [1] 1 2 3 4 5 6 7 8 9 10
R has many built-in functions for various distributions. You can find the help tab on the right panel and type ‘distributions’ in the search bar, and it will show all distribution families.
Binomial distributions:
dbinom(x=1,size=3,prob=0.8) #pmf for Binomial(3,0.8) at x=1 P(X=1)
## [1] 0.096
pbinom(1, 3, 0.8) #cdf for binomial(3,0.8), i.e. P(X<=1)
## [1] 0.104
dbinom(x=1,size=3,prob=0.8)+dbinom(x=0,size=3,prob=0.8) #P(X=1)+P(X=0) should be the same as above
## [1] 0.104
Poisson distributions:
dpois(3,lambda=5) #pmf for poisson: X follows a poisson distribution with mean of 5 per unit, this is P(X=3)
## [1] 0.1403739
ppois(3,lambda=5) #cdf for poisson: P(X<=3)
## [1] 0.2650259
dpois(0:3,lambda=5);sum(dpois(0:3,lambda=5)) # P(X=0)+P(X=1)+P(X=2)+P(X=3) should be the same as above
## [1] 0.006737947 0.033689735 0.084224337 0.140373896
## [1] 0.2650259
We will cover more once we finish the Ch 4 and 5.
R is famous for statistical analysis, including descriptive summaries and statistic inferences. We will cover more of these in future.
# A simple scatter plot
plot(students$grade, main="Student Grades", ylab="Score", xlab="Student Index", col="blue", pch=19)
# Summarize the sample mean and median
mean(students$grade)
## [1] 86.5
median(students$grade)
## [1] 86.5
An article in Nature Genetics “Treatment-specific Changes in Gene Expression Discriminate In Vivo Drug Response in Human Leukemia Cells” (2003, Vol. 34(1), pp. 85–90) studied gene expression as a function of treatments for leukemia. One group received a high dose of the drug, while the control group received no treatment. Expression data (measures of gene activity) from one gene are shown in the table.
Use Rstudio to import the GeneExpression text file, name
the data set as Data1. You can find the Import
Dataset button on the top of right panel, choose the correct
data file type and check if it contains the column names. If the data
file contains column names, make sure you click Yes for
heading; You can change the data names before importing the file.
Use Rstudio to import the excel file, name the dataset as
Data2 and save both Data1 and
Data2 in one “P1.RData” file. Remove the # to
run the remaining commands.
Data1 <- read.table("GeneExpression.txt",header=TRUE)
## Warning in scan(file = file, what = what, sep = sep, quote = quote, dec = dec,
## : number of items read is not a multiple of the number of columns
library(readxl)
Data2 <-read_excel("GeneExpression.xlsx")
save(Data1,Data2,file="P1.RData")
save(Data1,Data2,file="P1.RData")
Set your working directory to the target folder where your RData file
is located. You can use getwd() to see the current folder
and use setwd() to change your target folder path.
Sometimes we simply put the Rmd file and data sets in the same
folder and open the Rmd file directly with Rstudio (make sure
Rstudio has not been used yet).
load("P1.RData")
head(Data1)
## HighDose Control1 Control2 Control3
## 1 16.1 297.1 25.1 131.1
## 2 134.9 491.8 820.1 166.5
## 3 52.7 1332.9 82.5 2258.4
## 4 14.4 1172.0 713.9 497.5
## 5 124.3 1482.7 785.6 263.4
## 6 99.0 335.4 114.0 252.3
The function head() lists the first 6 rows of your
data.
When we want to use a column of the data set (which is a
variable of this data), we should call the data set names and
the column names together in the format of
data-name$col-name:
Data1$HighDose
## [1] 16.1 134.9 52.7 14.4 124.3 99.0 24.3 16.3 15.2 47.7 12.9 72.7
## [13] 126.7 46.4 60.3 23.5 43.6 79.4 38.0 58.2 26.5
If we will keep using the same data set, we can attach the data set
name and now we only need to call the column names. Practice
attach Data1 and calculate the average of gene
expression in subjects who received high dose of treatment:
attach(Data1)
mean(HighDose)
## [1] 53.95714
If we want to work on another data set, we should deattach the
current one and attach the new data set name. Now deattach the
Data1 and attach Data2, and calculate the
average of gene expression in subjects who were in Control 1 group.
detach(Data1)
attach(Data2)
mean(Control1)
## [1] 394.6476