Description:

This exam assesses your understanding of key computational methods in applied statistics, focusing on data manipulation, basic statistical summaries, and data visualization using R. You will work with built-in R datasets to perform operations such as summarizing data, creating plots using base R plotting functions. The tasks are designed to test your ability to apply R programming skills to real-world datasets, ensuring you can handle both numerical and categorical data effectively.

Each problem covers a different aspect of data analysis, including:

Make sure to follow the instructions carefully, and use R functions effectively to solve each problem. Good luck!


Problem 1 (25 pts): Basic R

Task:

  1. Assume x <- 23 and y <- 12. Perform the following arithmetic operations with any two numbers.

    • Addition, subtraction, multiplication, division, and remainder (modulo).
  2. Create two vectors x <- c(8, 4, 2, 3) and y <- c(1, 9, 6, 5). Perform:

    • Element-wise addition, subtraction, multiplication, division, and modulo.
  3. Create two 5 × 3 matrices (5 rows, 3 columns):

    • mat1 with numbers 1 through 15
    • mat2 with numbers 16 through 30
  4. Use rbind(mat1, mat2) to create a new matrix mat3 and check the dimension of mat3.

  5. Create a data frame named grade with columns:

    • Course = (“Math”, “Stats”, “CS”)
    • Credits = (3, 4, 3)
    • Grade = (“A”, “B”, “F”)
    • Pass = (T, T, F)

    Use str() to print the data frame. And change the data type of Grade as factor.

Solutions:

# Solve a)
x <- 23
y <- 12

# Addition
x + y
## [1] 35
# Subtraction
x - y
## [1] 11
# Multiplication
x * y
## [1] 276
# Division
x / y
## [1] 1.916667
# Remainder (modulo)
x %% y
## [1] 11
# Solve b)
x <- c(8, 4, 2, 3)
y <- c(1, 9, 6, 5)

# Addition (element-wise)
x + y
## [1]  9 13  8  8
#Subtraction (element-wise)
x - y
## [1]  7 -5 -4 -2
#Multiplication (element-wise)
x * y 
## [1]  8 36 12 15
#Division (element-wise)
x / y
## [1] 8.0000000 0.4444444 0.3333333 0.6000000
#Remainder (modulo, element-wise)
x %% y
## [1] 0 4 2 3
# Solve c)
mat1 <- matrix(1:15, nrow = 5, ncol = 3)
mat1 
##      [,1] [,2] [,3]
## [1,]    1    6   11
## [2,]    2    7   12
## [3,]    3    8   13
## [4,]    4    9   14
## [5,]    5   10   15
mat2 <- matrix(16:30, nrow = 5, ncol = 3)
mat2 
##      [,1] [,2] [,3]
## [1,]   16   21   26
## [2,]   17   22   27
## [3,]   18   23   28
## [4,]   19   24   29
## [5,]   20   25   30
# Solve d)
mat3 <- rbind(mat1, mat2)
mat3 
##       [,1] [,2] [,3]
##  [1,]    1    6   11
##  [2,]    2    7   12
##  [3,]    3    8   13
##  [4,]    4    9   14
##  [5,]    5   10   15
##  [6,]   16   21   26
##  [7,]   17   22   27
##  [8,]   18   23   28
##  [9,]   19   24   29
## [10,]   20   25   30
dim(mat3)
## [1] 10  3
# Solve e)
grade <- data.frame(
  Course = c("Math", "Stats", "CS"),  
  Credits = c(3, 4, 3),   
  Grade = c("A", "B", "F"), 
   Pass = c(T, T, F)
)
grade
str(grade)
## 'data.frame':    3 obs. of  4 variables:
##  $ Course : chr  "Math" "Stats" "CS"
##  $ Credits: num  3 4 3
##  $ Grade  : chr  "A" "B" "F"
##  $ Pass   : logi  TRUE TRUE FALSE
# Change Grade column to factor
grade$Grade <- as.factor(grade$Grade)

# Print structure again to confirm
str(grade)
## 'data.frame':    3 obs. of  4 variables:
##  $ Course : chr  "Math" "Stats" "CS"
##  $ Credits: num  3 4 3
##  $ Grade  : Factor w/ 3 levels "A","B","F": 1 2 3
##  $ Pass   : logi  TRUE TRUE FALSE

Problem 2 (25 pts): Exploring and Summarizing Data with the ‘iris’ Dataset

The iris dataset contains measurements of four variables (sepal length, sepal width, petal length, and petal width) for three species of iris flowers. (Try ?iris for more info.)

Task:

  1. Load the iris dataset and examine its structure. Identify how many rows (observations) and columns (variables) are in the dataset.

  2. Compute the median of each of the numeric variables (Sepal.Length, Sepal.Width, Petal.Length, and Petal.Width), and record your results.

  3. Display the possible categories (levels) that the Species column takes.

  4. Create a proportion table that displays the relative frequency of observations for each species in the dataset.

  5. Create a histogram of Sepal.Width values. and then add a density curve using lines(density(...)).

Solutions:

# Solve a)

# load iris dataset 
data(iris)
str(iris)
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
#calulate rows and columns in dataset
dim(iris)
## [1] 150   5
# Solve b)
# Compute the median of each of the numeric variables 
median_sepal_length <- mean(iris$Sepal.Length)
median_sepal_length
## [1] 5.843333
median_sepal_width <- mean(iris$Sepal.Width)
median_sepal_width
## [1] 3.057333
median_petal_length <- mean(iris$Petal.Length)
median_petal_length
## [1] 3.758
median_petal_width <- mean(iris$Petal.Width)
median_petal_width
## [1] 1.199333
#Create a data frame to display the results
median_result <- data.frame(
Variabls = colnames(iris[,1:4]),

Median = c(median_sepal_length, median_sepal_width, median_petal_length, median_petal_width)
)
# Call the object name
median_result
# Solve c)
levels(iris$Species)
## [1] "setosa"     "versicolor" "virginica"
# Solve d)
prop.table(table(iris$Species))
## 
##     setosa versicolor  virginica 
##  0.3333333  0.3333333  0.3333333
# Solve e) 
hist(iris$Sepal.Width,
prob = T,
main = "Histogram of Sepal Length with Density Curve",
xlab = "Sepal Width",
col = "lightblue",
border = "black"
)
# Add density curve
lines(density(iris$Sepal.Width), col = "red", lwd = 2)

Problem 3 (25 pts): Creating Visualizations Using the ‘ToothGrowth’ Dataset

The ToothGrowth dataset records the length of teeth in guinea pigs receiving different doses of vitamin C, either through orange juice (OJ) or ascorbic acid (VC).

Task:

  1. Examine the dataset and describe the variables it contains, including their types.

  2. Create a histogram of tooth lengths (len). Add the title Distribution of Tooth Lengths, label the x-axis Tooth Length, and label the y-axis Frequency.

  3. Create a bar plot from the frequency table for each treatment group. Use different fill colors for each treatment group and add Title as Tooth Length by Supplement Type, x-axis as Supplement Type, y-axis as Tooth Length.

  4. Use the formula form len ~ supp to create a boxplot of tooth lengths split by supplement type (OJ vs VC). Add Title as Tooth Length by Supplement Type, label the x-axis Supplement Type, and label the y-axis Tooth Length.

  5. Create a scatter plot of tooth length (len), using light blue points with lines connecting them. Add the title Tooth Length of Guinea Pigs, label the x-axis Index, and label the y-axis Tooth Length.

Solutions:

# Solve a)

# Load the dataset 
data("ToothGrowth")
str(ToothGrowth)
## 'data.frame':    60 obs. of  3 variables:
##  $ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
##  $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dose: num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
#description of data set. 
# len (numeric, continuous response variable).
# supp (categorical factor with two levels "OJ" and "VC", type of supplement).
# dose (numeric, can also be categorical since it takes only 3 fixed values 0.5, 1.0, 2.0)

# Solve b)

# Create histogram of tooth lengths
hist(ToothGrowth$len,
     main = "Distribution of Tooth Lengths",  
     xlab = "Tooth Length",                   
     ylab = "Frequency",                      
     col = "lightgreen",                     
     border = "black")                        

# Solve c)

supp_table <- table(ToothGrowth$supp)

# Bar plot of frequency per supplement
barplot(supp_table,
        main = "Tooth Length by Supplement Type",  
        xlab = "Supplement Type",               
        ylab = "Tooth Length",                   
        col = c("orange", "skyblue"),            
        border = "black")                         

# Solve d)

# Boxplot of tooth lengths by supplement type
boxplot(len ~ supp, 
        data = ToothGrowth,                    
        main = "Tooth Length by Supplement Type", 
        xlab = "Supplement Type",                  
        ylab = "Tooth Length",                     
        col = c("orange", "skyblue"),             
        border = "black")                          

# Solve e)

# Scatter plot of tooth length with lines connecting points
plot(ToothGrowth$len,
     type = "b",                    
     col = "lightblue",             
     pch = 16,                      
     main = "Tooth Length of Guinea Pigs",  
     xlab = "Index",                
     ylab = "Tooth Length")         

Problem 4 (25 pts): Categorical Data Analysis with the ‘PlantGrowth’ Dataset

The PlantGrowth dataset records the weight of plants grown under three different treatment conditions: control (ctrl), treatment 1 (trt1), and treatment 2 (trt2). (Try ?PlantGrowth for more info.)

Task:

  1. Load the PlantGrowth dataset and display the structure of the data. How many plant samples are included in the dataset, and what are the columns?

  2. Calculate the maximum plant weight for the entire dataset.

  3. Generate a proportion table that shows the relative frequency (proportion) of samples for each treatment group.

  4. Create a bar plot from the proportion table. Use different fill colors for each treatment group and add title as "Proportion of Samples per Treatment Group", label to x-axis as "Treatment Group", label y-axis as "Proportion of Samples".

  5. Create a pie chart of the number of plant samples in each treatment group using the pie() function. Add a descriptive title "Number of Plant Samples by Treatment Group"to the plot and assign custom names to the slices: "Control", "Treatment 1", and "Treatment 2".

Solutions:

# Solve a)

data("PlantGrowth")
str(PlantGrowth)
## 'data.frame':    30 obs. of  2 variables:
##  $ weight: num  4.17 5.58 5.18 6.11 4.5 4.61 5.17 4.53 5.33 5.14 ...
##  $ group : Factor w/ 3 levels "ctrl","trt1",..: 1 1 1 1 1 1 1 1 1 1 ...
# 30 samples with two columns titled "weight" and "group"

# Solve b)

# Maximum plant weight in the dataset
max_weight <- max(PlantGrowth$weight)
max_weight
## [1] 6.31
# Solve c)

# Proportion table of treatment groups in PlantGrowth
prop.table(table(PlantGrowth$group))
## 
##      ctrl      trt1      trt2 
## 0.3333333 0.3333333 0.3333333
# Solve d)

# Create proportion table
prop_table <- prop.table(table(PlantGrowth$group))

# Create Barplot
barplot(prop_table,
        main = "Proportion of Samples per Treatment Group",  
        xlab = "Treatment Group",                             
        ylab = "Proportion of Samples",                      
        col = c("orange", "skyblue", "lightgreen"),           
        border = "black")                                    

# Solve e)

# Create a frequency table of treatment groups
group_counts <- table(PlantGrowth$group)

# Assign custom names to the slices
names(group_counts) <- c("Control", "Treatment 1", "Treatment 2")

# Create pie chart
pie(group_counts,
    main = "Number of Plant Samples by Treatment Group",
    col = c("orange", "skyblue", "lightgreen"))