1. Welcome to R!

R is a powerful language specifically designed for data analysis and graphics. Unlike a standard script document, this R Markdown file allows us to run “Code Chunks” and type descriptions freely.

To work on R, open the .Rmd file with Rstudio. You will see a Knit button on the top of the left pane. Once you click Knit, a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. If you specify above output to ‘word_document’, it will generate a word file; if you specify ‘pdf_document’, it will generate a pdf file.

The Basics: Arithmetic and Variables

Think of R as a high-powered calculator. We use the <- or = operator (called the assignment operator) to store values in variables.

# This is a comment - R doesn't run this line.
# Simple math
2 + 2
## [1] 4
# Assigning values to variables
x <- 10
y = 5

# Performing operations with variables
total <- x + y;total #we can separate command by ; in one line
## [1] 15
print(total)
## [1] 15

For your future practices you can just copy the code chunk, write your code in the gray area and type your comments in the white areas.

2. Variable Types

R handles different kinds of data values (we call them variables). The most common are: Numeric: Decimals or integers (e.g., 10.5) Character: Text strings (e.g., “Hello”) Logical: Boolean values (TRUE or FALSE)

Course_name <- "STAT 352"
is_fun <- TRUE

class(is_fun)  # Checks the type of data
## [1] "logical"
is_fun+1 #in calculation, TRUE is 1 and FALSE is 0
## [1] 2

3. Vectors and Data Frames

Data isn’t usually just one number; it’s a collection. Vectors: A list of items of the same type, created with the c() function. Data Frames: A table-like structure (like an Excel sheet).

# Creating a vector of scores
scores <- c(85, 92, 78, 90)
# Creating a simple Data Frame
students <- data.frame(
  name = c("Alice", "Bob", "Charlie", "Diana"),
  grade = c(95, 82, 78, 91)
)
# View the data frame
print(students)
##      name grade
## 1   Alice    95
## 2     Bob    82
## 3 Charlie    78
## 4   Diana    91
# Creating consecutive integers
x<-1:10
x
##  [1]  1  2  3  4  5  6  7  8  9 10

4. Distribution Probabilities

R has many built-in functions for various distributions. You can find the help tab on the right panel and type ‘distributions’ in the search bar, and it will show all distribution families.

Binomial distributions:

dbinom(x=1,size=3,prob=0.8) #pmf for Binomial(3,0.8) at x=1 P(X=1)
## [1] 0.096
pbinom(1, 3, 0.8) #cdf for binomial(3,0.8), i.e. P(X<=1)
## [1] 0.104
dbinom(x=1,size=3,prob=0.8)+dbinom(x=0,size=3,prob=0.8) #P(X=1)+P(X=0) should be the same as above
## [1] 0.104

Poisson distributions:

dpois(3,lambda=5) #pmf for poisson: X follows a poisson distribution with mean of 5 per unit, this is P(X=3)
## [1] 0.1403739
ppois(3,lambda=5) #cdf for poisson: P(X<=3)
## [1] 0.2650259
dpois(0:3,lambda=5);sum(dpois(0:3,lambda=5)) # P(X=0)+P(X=1)+P(X=2)+P(X=3) should be the same as above
## [1] 0.006737947 0.033689735 0.084224337 0.140373896
## [1] 0.2650259

We will cover more once we finish the Ch 4 and 5.

5. Graphical and Numerical Summaries (Ch 6)

R is famous for statistical analysis, including descriptive summaries and statistic inferences. We will cover more of these in future.

# A simple scatter plot
plot(students$grade, main="Student Grades", ylab="Score", xlab="Student Index", col="blue", pch=19)

# Summarize the sample mean and median
mean(students$grade)
## [1] 86.5
median(students$grade)
## [1] 86.5

6. Importing external data files:

An article in Nature Genetics “Treatment-specific Changes in Gene Expression Discriminate In Vivo Drug Response in Human Leukemia Cells” (2003, Vol. 34(1), pp. 85–90) studied gene expression as a function of treatments for leukemia. One group received a high dose of the drug, while the control group received no treatment. Expression data (measures of gene activity) from one gene are shown in the table.

Importing text files:

Use Rstudio to import the GeneExpression text file, name the data set as Data1. You can find the Import Dataset button on the top of right panel, choose the correct data file type and check if it contains the column names. If the data file contains column names, make sure you click Yes for heading; You can change the data names before importing the file.

Importing excel files:

Use Rstudio to import the excel file, name the dataset as Data2 and save both Data1 and Data2 in one “P1.RData” file. Remove the # to run the remaining commands.

Data1 <- read.table("GeneExpression.txt",header=TRUE)
## Warning in scan(file = file, what = what, sep = sep, quote = quote, dec = dec,
## : number of items read is not a multiple of the number of columns
library(readxl)
Data2 <-read_excel("GeneExpression.xlsx")
save(Data1,Data2,file="P1.RData")
save(Data1,Data2,file="P1.RData")

Statistical Analysis:

Set your working directory to the target folder where your RData file is located. You can use getwd() to see the current folder and use setwd() to change your target folder path. Sometimes we simply put the Rmd file and data sets in the same folder and open the Rmd file directly with Rstudio (make sure Rstudio has not been used yet).

load("P1.RData")
head(Data1)
##   HighDose Control1 Control2 Control3
## 1     16.1    297.1     25.1    131.1
## 2    134.9    491.8    820.1    166.5
## 3     52.7   1332.9     82.5   2258.4
## 4     14.4   1172.0    713.9    497.5
## 5    124.3   1482.7    785.6    263.4
## 6     99.0    335.4    114.0    252.3

The function head() lists the first 6 rows of your data.

When we want to use a column of the data set (which is a variable of this data), we should call the data set names and the column names together in the format of data-name$col-name:

Data1$HighDose
##  [1]  16.1 134.9  52.7  14.4 124.3  99.0  24.3  16.3  15.2  47.7  12.9  72.7
## [13] 126.7  46.4  60.3  23.5  43.6  79.4  38.0  58.2  26.5

If we will keep using the same data set, we can attach the data set name and now we only need to call the column names. Practice attach Data1 and calculate the average of gene expression in subjects who received high dose of treatment:

attach(Data1)
mean(HighDose)
## [1] 53.95714

If we want to work on another data set, we should deattach the current one and attach the new data set name. Now deattach the Data1 and attach Data2, and calculate the average of gene expression in subjects who were in Control 1 group.

detach(Data1)
attach(Data2)
mean(Control1)
## [1] 394.6476