This exam assesses your understanding of key computational methods in applied statistics, focusing on data manipulation, basic statistical summaries, and data visualization using R. You will work with built-in R datasets to perform operations such as summarizing data, creating plots using base R plotting functions. The tasks are designed to test your ability to apply R programming skills to real-world datasets, ensuring you can handle both numerical and categorical data effectively.
Each problem covers a different aspect of data analysis, including:
Make sure to follow the instructions carefully, and use R functions effectively to solve each problem. Good luck!
Task:
Assume x <- 23
and y <- 12
.
Perform the following arithmetic operations with any two numbers.
Create two vectors x <- c(8, 4, 2, 3)
and
y <- c(1, 9, 6, 5)
. Perform:
Create two 5 × 3 matrices (5 rows, 3 columns):
mat1
with numbers 1 through 15mat2
with numbers 16 through 30Use rbind(mat1, mat2)
to create a new matrix
mat3
and check the dimension of mat3
.
Create a data frame named grade
with columns:
Use str()
to print the data frame. And change the data
type of Grade
as factor.
Solutions:
# Solve a)
x <- 23
y <- 12
# Addition
x + y
## [1] 35
# Subtraction
x - y
## [1] 11
# Multiplication
x * y
## [1] 276
# Division
x / y
## [1] 1.916667
# Remainder (modulo)
x %% y
## [1] 11
# Solve b)
x <- c(8, 4, 2, 3)
y <- c(1, 9, 6, 5)
# Addition (element-wise)
x + y
## [1] 9 13 8 8
#Subtraction (element-wise)
x - y
## [1] 7 -5 -4 -2
#Multiplication (element-wise)
x * y
## [1] 8 36 12 15
#Division (element-wise)
x / y
## [1] 8.0000000 0.4444444 0.3333333 0.6000000
#Remainder (modulo, element-wise)
x %% y
## [1] 0 4 2 3
# Solve c)
mat1 <- matrix(1:15, nrow = 5, ncol = 3)
mat1
## [,1] [,2] [,3]
## [1,] 1 6 11
## [2,] 2 7 12
## [3,] 3 8 13
## [4,] 4 9 14
## [5,] 5 10 15
mat2 <- matrix(16:30, nrow = 5, ncol = 3)
mat2
## [,1] [,2] [,3]
## [1,] 16 21 26
## [2,] 17 22 27
## [3,] 18 23 28
## [4,] 19 24 29
## [5,] 20 25 30
# Solve d)
mat3 <- rbind(mat1, mat2)
mat3
## [,1] [,2] [,3]
## [1,] 1 6 11
## [2,] 2 7 12
## [3,] 3 8 13
## [4,] 4 9 14
## [5,] 5 10 15
## [6,] 16 21 26
## [7,] 17 22 27
## [8,] 18 23 28
## [9,] 19 24 29
## [10,] 20 25 30
dim(mat3)
## [1] 10 3
# Solve e)
grade <- data.frame(
Course = c("Math", "Stats", "CS"),
Credits = c(3, 4, 3),
Grade = c("A", "B", "F"),
Pass = c(T, T, F)
)
grade
str(grade)
## 'data.frame': 3 obs. of 4 variables:
## $ Course : chr "Math" "Stats" "CS"
## $ Credits: num 3 4 3
## $ Grade : chr "A" "B" "F"
## $ Pass : logi TRUE TRUE FALSE
# Change Grade column to factor
grade$Grade <- as.factor(grade$Grade)
# Print structure again to confirm
str(grade)
## 'data.frame': 3 obs. of 4 variables:
## $ Course : chr "Math" "Stats" "CS"
## $ Credits: num 3 4 3
## $ Grade : Factor w/ 3 levels "A","B","F": 1 2 3
## $ Pass : logi TRUE TRUE FALSE
The iris
dataset contains measurements of four variables
(sepal length, sepal width, petal length, and petal width) for three
species of iris flowers. (Try ?iris
for more info.)
Task:
Load the iris
dataset and examine its structure.
Identify how many rows (observations) and columns (variables) are in the
dataset.
Compute the median of each of the numeric
variables (Sepal.Length
, Sepal.Width
,
Petal.Length
, and Petal.Width
), and record
your results.
Display the possible categories (levels) that the
Species
column takes.
Create a proportion table that displays the relative frequency of observations for each species in the dataset.
Create a histogram of Sepal.Width
values. and then
add a density curve using lines(density(...))
.
Solutions:
# Solve a)
# load iris dataset
data(iris)
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
#calulate rows and columns in dataset
dim(iris)
## [1] 150 5
# Solve b)
# Compute the median of each of the numeric variables
median_sepal_length <- mean(iris$Sepal.Length)
median_sepal_length
## [1] 5.843333
median_sepal_width <- mean(iris$Sepal.Width)
median_sepal_width
## [1] 3.057333
median_petal_length <- mean(iris$Petal.Length)
median_petal_length
## [1] 3.758
median_petal_width <- mean(iris$Petal.Width)
median_petal_width
## [1] 1.199333
#Create a data frame to display the results
median_result <- data.frame(
Variabls = colnames(iris[,1:4]),
Median = c(median_sepal_length, median_sepal_width, median_petal_length, median_petal_width)
)
# Call the object name
median_result
# Solve c)
levels(iris$Species)
## [1] "setosa" "versicolor" "virginica"
# Solve d)
prop.table(table(iris$Species))
##
## setosa versicolor virginica
## 0.3333333 0.3333333 0.3333333
# Solve e)
hist(iris$Sepal.Width,
prob = T,
main = "Histogram of Sepal Length with Density Curve",
xlab = "Sepal Width",
col = "lightblue",
border = "black"
)
# Add density curve
lines(density(iris$Sepal.Width), col = "red", lwd = 2)
The ToothGrowth
dataset records the length of teeth in
guinea pigs receiving different doses of vitamin C, either through
orange juice (OJ
) or ascorbic acid (VC
).
Task:
Examine the dataset and describe the variables it contains, including their types.
Create a histogram of tooth lengths (len). Add the title
Distribution of Tooth Lengths
, label the x-axis
Tooth Length
, and label the y-axis
Frequency
.
Create a bar plot from the frequency table for each treatment
group. Use different fill colors for each treatment group and add Title
as Tooth Length by Supplement Type
, x-axis as
Supplement Type
, y-axis as
Tooth Length
.
Use the formula form len ~ supp
to create a boxplot
of tooth lengths split by supplement type (OJ vs VC). Add Title as
Tooth Length by Supplement Type
, label the x-axis
Supplement Type
, and label the y-axis
Tooth Length
.
Create a scatter plot of tooth length (len
), using
light blue points with lines connecting them. Add the title
Tooth Length of Guinea Pigs
, label the x-axis
Index
, and label the y-axis
Tooth Length
.
Solutions:
# Solve a)
# Load the dataset
data("ToothGrowth")
str(ToothGrowth)
## 'data.frame': 60 obs. of 3 variables:
## $ len : num 4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
## $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
## $ dose: num 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
#description of data set.
# len (numeric, continuous response variable).
# supp (categorical factor with two levels "OJ" and "VC", type of supplement).
# dose (numeric, can also be categorical since it takes only 3 fixed values 0.5, 1.0, 2.0)
# Solve b)
# Create histogram of tooth lengths
hist(ToothGrowth$len,
main = "Distribution of Tooth Lengths",
xlab = "Tooth Length",
ylab = "Frequency",
col = "lightgreen",
border = "black")
# Solve c)
supp_table <- table(ToothGrowth$supp)
# Bar plot of frequency per supplement
barplot(supp_table,
main = "Tooth Length by Supplement Type",
xlab = "Supplement Type",
ylab = "Tooth Length",
col = c("orange", "skyblue"),
border = "black")
# Solve d)
# Boxplot of tooth lengths by supplement type
boxplot(len ~ supp,
data = ToothGrowth,
main = "Tooth Length by Supplement Type",
xlab = "Supplement Type",
ylab = "Tooth Length",
col = c("orange", "skyblue"),
border = "black")
# Solve e)
# Scatter plot of tooth length with lines connecting points
plot(ToothGrowth$len,
type = "b",
col = "lightblue",
pch = 16,
main = "Tooth Length of Guinea Pigs",
xlab = "Index",
ylab = "Tooth Length")
The PlantGrowth
dataset records the weight of plants
grown under three different treatment conditions: control
(ctrl
), treatment 1 (trt1
), and treatment 2
(trt2
). (Try ?PlantGrowth
for more info.)
Task:
Load the PlantGrowth
dataset and display the
structure of the data. How many plant samples are included in the
dataset, and what are the columns?
Calculate the maximum plant weight for the entire dataset.
Generate a proportion table that shows the relative frequency (proportion) of samples for each treatment group.
Create a bar plot from the proportion table. Use different fill
colors for each treatment group and add title as
"Proportion of Samples per Treatment Group"
, label to
x-axis as "Treatment Group"
, label y-axis as
"Proportion of Samples"
.
Create a pie chart of the number of plant samples in each
treatment group using the pie()
function. Add a descriptive
title "Number of Plant Samples by Treatment Group"
to the
plot and assign custom names to the slices: "Control"
,
"Treatment 1"
, and "Treatment 2"
.
Solutions:
# Solve a)
data("PlantGrowth")
str(PlantGrowth)
## 'data.frame': 30 obs. of 2 variables:
## $ weight: num 4.17 5.58 5.18 6.11 4.5 4.61 5.17 4.53 5.33 5.14 ...
## $ group : Factor w/ 3 levels "ctrl","trt1",..: 1 1 1 1 1 1 1 1 1 1 ...
# 30 samples with two columns titled "weight" and "group"
# Solve b)
# Maximum plant weight in the dataset
max_weight <- max(PlantGrowth$weight)
max_weight
## [1] 6.31
# Solve c)
# Proportion table of treatment groups in PlantGrowth
prop.table(table(PlantGrowth$group))
##
## ctrl trt1 trt2
## 0.3333333 0.3333333 0.3333333
# Solve d)
# Create proportion table
prop_table <- prop.table(table(PlantGrowth$group))
# Create Barplot
barplot(prop_table,
main = "Proportion of Samples per Treatment Group",
xlab = "Treatment Group",
ylab = "Proportion of Samples",
col = c("orange", "skyblue", "lightgreen"),
border = "black")
# Solve e)
# Create a frequency table of treatment groups
group_counts <- table(PlantGrowth$group)
# Assign custom names to the slices
names(group_counts) <- c("Control", "Treatment 1", "Treatment 2")
# Create pie chart
pie(group_counts,
main = "Number of Plant Samples by Treatment Group",
col = c("orange", "skyblue", "lightgreen"))