Replace “Your Name” with your actual name.

Objective:

In this lab, you will apply data transformation techniques, including mean-centering, calculating Z-scores, and performing non-linear transformations on various datasets. Please complete the exercises by filling in the code chunks and answering the interpretation questions. Once completed, knit this document to HTML and submit it as instructed.

Exercise 1: Mean-Centering

Dataset: - Simulated data on the number of hours spent studying per week:

Tasks:
1. Calculate the mean of the study hours.
2. Mean-center the dataset by subtracting the mean from each value.
3. Plot the original and mean-centered study hours on the same graph.
4. Interpretation: Explain what the mean-centered values tell you about the amount of time each student spent studying compared to the average.

study_hours <- c(15, 22, 18, 25, 20, 28, 24, 19, 23, 26)
# Calculate the mean of the study hours
mean(study_hours)
## [1] 22
mean_study_hours <- mean(study_hours)
# Mean-center the study hours
study_hours - mean_study_hours
##  [1] -7  0 -4  3 -2  6  2 -3  1  4
centered_study_hours <- mean(study_hours) - mean_study_hours

study_hours
##  [1] 15 22 18 25 20 28 24 19 23 26
# Plot the original study hours
# use plot() 
# use abline(h = mean(study_hours))
plot(study_hours, col = "blue", pch = 2)
abline(h = mean_study_hours)

# Plot the mean-centered study hours
# use plot() 
# use abline(h = 0)
plot(centered_study_hours, col = "red", pch = 3)
abline(h = 0)

Interpretation: The advantage of mean centering is that we can easily compare individual study hours to the mean study hours.

Exercise 2: Calculating Z-Scores

Dataset: - Simulated data on students’ reaction times (in milliseconds):

Tasks:
1. Calculate the mean and standard deviation of the reaction times.
2. Compute the Z-scores for each reaction time.
3. Plot the Z-scores on a line graph.
4. Interpretation: Discuss what a Z-score greater than 0 or less than 0 indicates about a reaction time relative to the average.

reaction_times <- c(350, 420, 310, 390, 370, 450, 380, 340, 400, 360)
# Calculate the mean and standard deviation of the reaction times
mean(reaction_times)
## [1] 377
sd(reaction_times)
## [1] 40.56545
# Compute the Z-scores
z_reaction_times<- scale(reaction_times)
reaction_times
##  [1] 350 420 310 390 370 450 380 340 400 360
z_reaction_times
##              [,1]
##  [1,] -0.66559107
##  [2,]  1.06001542
##  [3,] -1.65165193
##  [4,]  0.32046978
##  [5,] -0.17256065
##  [6,]  1.79956105
##  [7,]  0.07395456
##  [8,] -0.91210629
##  [9,]  0.56698499
## [10,] -0.41907586
## attr(,"scaled:center")
## [1] 377
## attr(,"scaled:scale")
## [1] 40.56545
# Plot the Z-scores
# abline(h= 0)
plot(z_reaction_times, col= "purple")
abline(h = 0)

Interpretation: For reaction time, (+) z-score indicates slower reaction time and (-) z_score indicates faster reaction time. For example, if we have a z-score of 1.5 in -0.8, the participant with the -0.8 z-score responded faster; they had a response time of 0.8 standard deviations below the mean. The indvidual with a score of 1.5 responded slower, and had a z-score of a response time 1.5 standard deviations above the mean.

Exercise 3: Non-Linear Transformations

Dataset: - Simulated data on annual sales figures (in thousands of dollars):

Tasks:
1. Apply a logarithmic transformation to the sales data.
2. Apply a square root transformation to the sales data.
3. Plot histograms of the original and transformed sales data.
4. Interpretation: Compare the distributions of the original and transformed data. Explain how each transformation affects the spread and shape of the data.

sales <- c(200, 450, 700, 1200, 300, 800, 1100, 900, 400, 1500)
# Apply a logarithmic transformation
log_sales <- log(sales)
sales
##  [1]  200  450  700 1200  300  800 1100  900  400 1500
log_sales
##  [1] 5.298317 6.109248 6.551080 7.090077 5.703782 6.684612 7.003065 6.802395
##  [9] 5.991465 7.313220
# Apply a square root transformation
sqrt_sales <- sqrt(sales)
sales
##  [1]  200  450  700 1200  300  800 1100  900  400 1500
sqrt_sales
##  [1] 14.14214 21.21320 26.45751 34.64102 17.32051 28.28427 33.16625 30.00000
##  [9] 20.00000 38.72983
# Plot histograms of the original and transformed sales data
#use hist()
hist(sales, col = "green")

hist(log_sales, col = "pink", main = "Log Sales")

hist(sqrt_sales, col = "orange", main = "Square Root Sales")

qqnorm(sales)
qqline(sales)

qqline(log_sales)

qqnorm(log_sales)

qqline(sqrt_sales)

qqnorm(sqrt_sales)

Interpretation: The original data had a mild skew.

  • Logarithmic Transformation: The log transformation is a more severe transformation and it did reduce the skew, but it over-corrected.

  • Square Root Transformation: The square root transformation is not as extreme as the log transformation and it did a good jon at normalizing the data.

Exercise 4: Combining Transformations

Dataset: - Simulated data on daily step counts:

Tasks: 1. Mean-center the step counts.
2. Calculate the Z-scores for the step counts.
3. Plot the original, mean-centered, and Z-scores on separate graphs.
4. Interpretation: Explain how the combination of mean-centering and Z-scores helps in understanding the step count data compared to looking at the original data alone.

step_counts <- c(8000, 10500, 9200, 11500, 10000, 12500, 11000, 9500, 10200, 12000)
# Mean-center the step counts
center_steps <- step_counts - mean(step_counts)
step_counts
##  [1]  8000 10500  9200 11500 10000 12500 11000  9500 10200 12000
center_steps
##  [1] -2440    60 -1240  1060  -440  2060   560  -940  -240  1560
# Calculate the Z-scores for the step counts
z_steps <- scale(step_counts)
step_counts
##  [1]  8000 10500  9200 11500 10000 12500 11000  9500 10200 12000
z_steps
##              [,1]
##  [1,] -1.78888109
##  [2,]  0.04398888
##  [3,] -0.90910351
##  [4,]  0.77713687
##  [5,] -0.32258511
##  [6,]  1.51028486
##  [7,]  0.41056287
##  [8,] -0.68915911
##  [9,] -0.17595552
## [10,]  1.14371086
## attr(,"scaled:center")
## [1] 10440
## attr(,"scaled:scale")
## [1] 1363.981
# Plot the original
plot(step_counts)

# Plot the  mean-centered
plot(center_steps)
abline(h = 0)

# Plot the Z-scores
plot(z_steps)
abline(h = 0)

Interpretation: Mean centering is useful because we can quickly see who has higher vs. lower daily steps compared to the average. An advantage of this is that it remains in the original units (steps). Z-score is useful because it can helps us identify outliers and we could also use it to make comparisons on other scales.

Submission Instructions:

Ensure to knit your document to HTML format, checking that all content is correctly displayed before submission. Submit the RPubs link to Canvas Assignments.