BMED3603 Biostatistics Assignment 2: Practical Work (R and RStudio)

Instructions:

This assignment is due on 2024-03-12 (Tue) 23:59, and is out of 100 marks (10% of final grade).
Data is available on Moodle as “nmj_formation.csv”.
All statistical tests should be assessed at a significance level of 0.05.
All your code, output, and answers to questions should be inputted into this single R markdown (.Rmd) file.
All your input should be made within the code/input cells. If you wish to open new input cells, you may do so by 1) clicking “Insert” at the top of the code editor –> “Executable cell” -> “R” for R code input cells, or 2) clicking “Insert” at the top of the code editor -> “Code Block” -> “OK” for general text input.
Please rename your R markdown file as “yourUID-yourName-A2.Rmd”, and knit your Rmd file to the HTML format by clicking the “Knit” button at the top of the code editor.
You should upload the HTML file for submission to Moodle.

Scenario:

You are a research assistant in a laboratory specializing in molecular neuroscience research. Your senior, with an interest in neuromuscular junctions (NMJs), has recently acquired some data from cell culture and imaging work. After performing the quantification, they have passed the results (“nmj_formation.csv”) to you for statistical analysis. Answer the questions below using this dataset.

Background

A neuromuscular junction (NMJ) is a synapse formed between a skeletal muscle and a nerve, and is essential for all muscle contractions (voluntary and involuntary).
Muscle contractions are triggered by the binding of acetylcholine (ACh), a chemical messenger, to acetylcholine receptors (AChRs) at NMJs. Healthy NMJs should contain a high density of AChRs.
Your senior would like to further investigate the pathology of Myasthenia Gravis (MG) and Duchenne Muscular Dystrophy (DMD), which are both incurable muscle diseases that impair muscle strength and usability. In particular, they wish to determine if the diseases impact AChR clustering at NMJs.

Experimental approach

Muscle cells were collected from 3 groups of randomly sampled patients (Healthy, MG, DMD), and cultured together with healthy nerves to produce NMJ cultures.
After the cultures were grown for a few days, fluorescence staining was performed to visualize the AChRs formed at NMJs. Images of the AChR clusters were captured by fluorescence imaging.

Dataset information

“nmj_formation.csv” has 68 independent observations and 8 variables:

Condition: 3 groups of randomly sampled patients (Healthy, MG, DMD).
Track_area: Area (in square micrometer, µm²) of nerve-muscle contact, corresponds to area of NMJ.
AChR_area: Area (in square micrometer, µm²) of AChR clustering.
AChR_norm_int: AChR fluorescence intensity per unit area of nerve-muscle contact. Calculated as the raw fluorescence intensity of AChRs measured (AChR_raw_int) dividied by the NMJ area (Track_area).
AChR_raw_int: The raw fluorescence intensity of AChRs measured.
Thres_min: Minimum threshold value to filter out noise (background) for AChR area and intensity quantification.
Thres_max: Maximum threshold value to filter out noise (artifacts) for AChR area and intensity quantification.
Morphology: 3 classifications of muscle cell morphology (in order of complexity: round, branched, spindle).

Please input your student information here:
Name: Chu Chi GAi
UID: 3035928926

Initialization (5 marks):

This step serves to set up a proper R environment. Write scripts to:

Install (if needed) and load the packages required (dplyr, ggplot2). (3 marks)
Import the dataset “nmj_formation.csv” and assign it to a variable of your choice. (2 marks)

Reminder: You do not need to set the working directory in an R markdown file.

# Write your codes for the "initialization" section here
library(dplyr)    # loading the dplyr package

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)  # loading the ggplot2 package

`nmj_formation.(1)` <- read.csv("~/BMED 3603 R studio/BMED 3603(1)/nmj_formation (1).csv")

Question 1 (10 marks):

We want to first perform some exploratory data analysis. Write scripts to:

(a) Display the first 20 rows of the dataset. (1 mark)

# Write your codes for Question 1a here
head(`nmj_formation.(1)`, 20)

##    Condition Track_area AChR_area AChR_norm_int AChR_raw_int Thres_min
## 1    Healthy       2400      2659      7182.616     17238279      4214
## 2    Healthy        704      1020      6151.220      4330459      4136
## 3    Healthy       4436      2630      5368.273     23813661      5552
## 4    Healthy        632      1646      6491.331      4102521      4245
## 5    Healthy       1815      7705      5645.419     10246436      3791
## 6    Healthy        581       892      8943.172      5195983      3421
## 7    Healthy       1803      2077      6801.707     12263478      3698
## 8    Healthy        605      1171      6513.418      3940618      5085
## 9    Healthy       8474      4186      7755.468     65719836      3976
## 10   Healthy       1587       777      6386.356     10135147      4623
## 11   Healthy       2064      1270      5792.195     11955090      5362
## 12   Healthy       2384      2525      6448.973     15374352      6723
## 13        MG       3717      3841      4378.818     16276066      5232
## 14        MG        475       271      4740.349      2251666      5921
## 15        MG       7965      4769      4627.730     36859869      5508
## 16        MG       1944      1161      4499.120      8746290      7022
## 17        MG       1082       553      4635.850      5015990      7435
## 18        MG       1116      2487      5658.432      6314810      3993
## 19        MG       4870      2909      5391.479     26256505      5783
## 20        MG       6186      4755      4553.474     28167790      3030
##    Thres_max Morphology
## 1      65535    spindle
## 2      65535      round
## 3      65535    spindle
## 4      65535    spindle
## 5      65535      round
## 6      65535    spindle
## 7      65535      round
## 8      65535    spindle
## 9      65535   branched
## 10     65535   branched
## 11     65535   branched
## 12     65535   branched
## 13     65535      round
## 14     65535    spindle
## 15     65535    spindle
## 16     65535   branched
## 17     65535    spindle
## 18     65535      round
## 19     65535    spindle
## 20     65535    spindle

(b) Determine if there are missing values in the whole dataset. Your script should output the total count of missing values. (2 marks)

# Write your codes for Question 1b here
sum(is.na(`nmj_formation.(1)`))

## [1] 0

(c) Reveal the data types for each column. (1 mark)

# Write your codes for Question 1c here
str(`nmj_formation.(1)`)

## 'data.frame':    68 obs. of  8 variables:
##  $ Condition    : chr  "Healthy" "Healthy" "Healthy" "Healthy" ...
##  $ Track_area   : int  2400 704 4436 632 1815 581 1803 605 8474 1587 ...
##  $ AChR_area    : int  2659 1020 2630 1646 7705 892 2077 1171 4186 777 ...
##  $ AChR_norm_int: num  7183 6151 5368 6491 5645 ...
##  $ AChR_raw_int : int  17238279 4330459 23813661 4102521 10246436 5195983 12263478 3940618 65719836 10135147 ...
##  $ Thres_min    : int  4214 4136 5552 4245 3791 3421 3698 5085 3976 4623 ...
##  $ Thres_max    : int  65535 65535 65535 65535 65535 65535 65535 65535 65535 65535 ...
##  $ Morphology   : chr  "spindle" "round" "spindle" "spindle" ...

(d) With reference to (c), what are the data types for AChR_norm_int, Thres_min, and Morphology? (3 marks)

# Write your answers for Question 1d here
AChR_norm_int: numeric , 
Thres_min: integer ,
Morphology: character

(e) Reveal the descriptive statistics (mean, median, 1st and 3rd quartiles, min, max) for each column. (1 mark)

# Write your codes for Question 1e here

summary(`nmj_formation.(1)`)

##   Condition           Track_area      AChR_area     AChR_norm_int 
##  Length:68          Min.   :  475   Min.   :  209   Min.   :2202  
##  Class :character   1st Qu.: 1120   1st Qu.: 1137   1st Qu.:4223  
##  Mode  :character   Median : 2166   Median : 2087   Median :5137  
##                     Mean   : 3372   Mean   : 2454   Mean   :5191  
##                     3rd Qu.: 5006   3rd Qu.: 3526   3rd Qu.:6497  
##                     Max.   :13205   Max.   :10299   Max.   :8943  
##   AChR_raw_int        Thres_min      Thres_max      Morphology       
##  Min.   : 1679573   Min.   :2033   Min.   :65535   Length:68         
##  1st Qu.: 6037586   1st Qu.:3370   1st Qu.:65535   Class :character  
##  Median :10608206   Median :4079   Median :65535   Mode  :character  
##  Mean   :16788745   Mean   :4285   Mean   :65535                     
##  3rd Qu.:24101421   3rd Qu.:5245   3rd Qu.:65535                     
##  Max.   :72082018   Max.   :7435   Max.   :65535

(f) Based on your answers for 1(a)-(e), do you think the dataset is ready for analysis? Explain. (2 marks)

# Write your answers for Question 1f here
I think the dataset is not ready for analysis as the data is not yet ordered by condition or morphology.

Question 2 (10 marks):

Your senior asked you to perform some data manipulation. Write scripts to:

(a) Convert the Condition and Morphology columns into factor variables. (2 marks)

# Write your codes for Question 2a here
`nmj_formation.(1)`$Condition <- as.factor(`nmj_formation.(1)`$Condition)
`nmj_formation.(1)`$Morphology <- as.factor(`nmj_formation.(1)`$Morphology)

(b) What are the default orders for the Condition and Morphology columns after (a)? (2 marks)

# Write your answers for Question 2b here
Condition: "DMD"     "Healthy" "MG"     
Morphology: "branched" "round"    "spindle"

(c) Reorder the factor levels so that the Condition factor levels are ordered “Healthy”, “MG”, and “DMD”, and the Morphology factor levels are ordered “round”, “branched”, and “spindle”. You should also show proof that the levels are now ordered in the desired sequence. (6 marks)

# Write your codes for Question 2c here 
`nmj_formation.(1)`$Condition  <- ordered(`nmj_formation.(1)`$Condition, levels = c("Healthy", "MG", "DMD"))
`nmj_formation.(1)`$Morphology  <- ordered(`nmj_formation.(1)`$Morphology, levels = c("round","branched", "spindle"))


levels(`nmj_formation.(1)`$Condition )

## [1] "Healthy" "MG"      "DMD"

levels(`nmj_formation.(1)`$Morphology )

## [1] "round"    "branched" "spindle"

Question 3 (20 marks):

In order to analyse the effect of MG or DMD on the fluorescence intensity of AChR, normality should be assessed before we run the appropriate statistical test(s).

(a) Which column of AChR fluorescence intensity data would you use to assess the effect of MG or DMD on the fluorescence intensity of AChR, AChR_norm_int or AChR_raw_int? Justify. (2 marks)

# Write your answers for Question 3a 
AChR_norm_int, as AChR_norm_int considers AChR the fluorescence intensity per unit area of nerve-muscle contact removing the factor of measured area.

(b) Plot a density plot for your selected column of AChR fluorescence intensity data in (a). The density plot should include a title, appropriate labels for the x and y-axes. All data within the column, regardless of condition, should be presented within one density curve. (5 marks)

Hint: Check http://www.sthda.com/english/wiki/ggplot2-essentials for ggplot2 support. Your codes should feature geom_density().

# Write your codes for Question 3b here
`nmj_formation.(1)` %>% ggplot(aes(AChR_norm_int)) +
      geom_density(fill = 'blue') + 
      labs (title = "Distribution of AChR fluorescence intensity ", 
            x = "AChR_norm_int", 
            y = "Density")

(c) Plot a histogram plot for your selected column of AChR fluorescence intensity data in (a). The histogram plot should include a title, appropriate labels for the x and y-axes, and the histograms should be coloured by condition. (5 marks)

Hint: Check http://www.sthda.com/english/wiki/ggplot2-essentials for ggplot2 support. Your codes should feature geom_histogram().

# Write your codes for Question 3c here 
`nmj_formation.(1)` %>% ggplot(aes(AChR_norm_int, fill=Condition)) +
      geom_histogram() + 
      labs (title = "Distribution of AChR fluorescence intensity ", 
            x = "AChR fluorescence intensity", 
            y = "Count")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

(d) Plot a Q-Q plot for your selected column of AChR fluorescence intensity data in (a). The Q-Q plot should include a title, appropriate labels for the x and y-axes, and a Q-Q reference line. (5 marks)

Hint: Check http://www.sthda.com/english/wiki/ggplot2-essentials for ggplot2 support. Your codes should feature stat_qq() and stat_qq_line.

`nmj_formation.(1)` %>% ggplot(aes(sample =AChR_norm_int)) +
      stat_qq() + 
      stat_qq_line() +
      labs (title = "Distribution of AChR fluorescence intensity", 
            x = "Theoretical", 
            y = "Sample")

(e) Assess, with an appropriate statistical test and p-value evidence, whether your selected column of AChR fluorescence intensity data in (a) follows the normal distribution. (3 marks)

# Write your codes for Question 3e here 
shapiro.test(`nmj_formation.(1)`$AChR_norm_int)

## 
##  Shapiro-Wilk normality test
## 
## data:  `nmj_formation.(1)`$AChR_norm_int
## W = 0.97076, p-value = 0.1103

# Write your answers for Question 3e here 
For Shapiro-Wilk normality test,

H0:Data follows normal distribution
Ha:Data does not follow normal distribution

As p value(0.1103) is larger than 0.05, the null hypothesis cannot be rejected.
The data is normally distributed.

Question 4 (15 marks):

Your senior asked you to first assess the effect of MG or DMD on the normalized AChR fluorescence intensity. Assuming the column data for AChR_norm_int is normally distributed, write scripts to:

(a) Calculate the variances of the column data for AChR_norm_int for each condition. (3 marks)

Hint: Use var() and think about how to index dataframes.

# Write your codes for Question 4a here 
var(`nmj_formation.(1)`$AChR_norm_int[`nmj_formation.(1)`$Condition == "MG"])

## [1] 1192963

var(`nmj_formation.(1)`$AChR_norm_int[`nmj_formation.(1)`$Condition == "DMD"])

## [1] 749032.6

var(`nmj_formation.(1)`$AChR_norm_int[`nmj_formation.(1)`$Condition == "Healthy"])

## [1] 620593.5

(b) Assess, with an appropriate statistical test and p-value evidence, whether the data is of equal variances. (2 marks)

# Write your codes for Question 4b here 

#var.test(`nmj_formation.(1)`$AChR_norm_int)

var.test(`nmj_formation.(1)`$AChR_norm_int[`nmj_formation.(1)`$Condition == "MG"], `nmj_formation.(1)`$AChR_norm_int[`nmj_formation.(1)`$Condition == "DMD"])

## 
##  F test to compare two variances
## 
## data:  `nmj_formation.(1)`$AChR_norm_int[`nmj_formation.(1)`$Condition == "MG"] and `nmj_formation.(1)`$AChR_norm_int[`nmj_formation.(1)`$Condition == "DMD"]
## F = 1.5927, num df = 19, denom df = 18, p-value = 0.3288
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.618171 4.054475
## sample estimates:
## ratio of variances 
##           1.592671

var.test(`nmj_formation.(1)`$AChR_norm_int[`nmj_formation.(1)`$Condition == "MG"], `nmj_formation.(1)`$AChR_norm_int[`nmj_formation.(1)`$Condition == "Healthy"])

## 
##  F test to compare two variances
## 
## data:  `nmj_formation.(1)`$AChR_norm_int[`nmj_formation.(1)`$Condition == "MG"] and `nmj_formation.(1)`$AChR_norm_int[`nmj_formation.(1)`$Condition == "Healthy"]
## F = 1.9223, num df = 19, denom df = 28, p-value = 0.1132
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.8541493 4.6340519
## sample estimates:
## ratio of variances 
##           1.922293

var.test(`nmj_formation.(1)`$AChR_norm_int[`nmj_formation.(1)`$Condition == "DMD"], `nmj_formation.(1)`$AChR_norm_int[`nmj_formation.(1)`$Condition == "Healthy"])

## 
##  F test to compare two variances
## 
## data:  `nmj_formation.(1)`$AChR_norm_int[`nmj_formation.(1)`$Condition == "DMD"] and `nmj_formation.(1)`$AChR_norm_int[`nmj_formation.(1)`$Condition == "Healthy"]
## F = 1.207, num df = 18, denom df = 28, p-value = 0.6389
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.531601 2.970739
## sample estimates:
## ratio of variances 
##           1.206962

# Write your answers for Question 4b here 

The null hypothesis is that 2 groups compared have equal variance.
The alternative hypothesis is that 2 groups compared do not have equal variance.
It has a p-value larger than 0.05, all data groups at the above  have equal variance.

(c) Assuming the column data for AChR_norm_int is of equal variances, assess, with appropriate statistical tests and p-value evidence:

Whether the different conditions have an effect on the normalized AChR fluorescence intensity. (3 marks)
Which condition(s) have significantly affected the normalized AChR fluorescence intensity, and how the condition(s) have affected the normalized AChR fluorescence intensity. (3 marks)

# Write your codes for Question 4c here 

AChR_norm_int_Condition.aov <- aov(AChR_norm_int ~ Condition, data = `nmj_formation.(1)`)

summary(AChR_norm_int_Condition.aov)

##             Df    Sum Sq  Mean Sq F value Pr(>F)    
## Condition    2 131060416 65530208   79.58 <2e-16 ***
## Residuals   65  53525494   823469                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

TukeyHSD(AChR_norm_int_Condition.aov)

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = AChR_norm_int ~ Condition, data = `nmj_formation.(1)`)
## 
## $Condition
##                   diff       lwr         upr     p adj
## MG-Healthy  -2395.5717 -3028.211 -1762.93208 0.0000000
## DMD-Healthy -3125.1453 -3767.563 -2482.72806 0.0000000
## DMD-MG       -729.5736 -1426.863   -32.28448 0.0382685

# Write your answers for Question 4c here 

One-way ANOVA can be performed.
H0: There are not significant differences between the normalized AChR fluorescence intensity.
Ha: There are significant differences between the normalized AChR fluorescence intensity.


As Pr(>F) <2e-16  (<0.05),the null hypothesis is rejected.
There are significant differences between the normalized AChR fluorescence intensity.

Then we perform Tukey’s test.
From Turkey's test, since both comparison of "Healthy and DMD"" , and "MG and Healthy " has a p value which is less than 0.05, both DMD and MG have affected the normal AChR flurescence intensity.

(d) Identify the outlier(s), if any, in the column data for AChR_norm_int. You only need to state the row number that corresponds to the outlier data. (2 marks)

Hint: Utilize the outputs you have obtained from (c).

# Write your codes for Question 4d here 

plot(AChR_norm_int_Condition.aov, 1)

# Write your answers for Question 4d here

The row number that corresponds to the outlier data is 6,46,62.

(e) Assuming the column data for AChR_norm_int is not normally distributed, assess, with an appropriate statistical test and p-value evidence, whether the different conditions have an effect on the normalized AChR fluorescence intensity. (2 marks)

# Write your codes for Question 4e here  
kruskal.test(AChR_norm_int ~ Condition, data = `nmj_formation.(1)`)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  AChR_norm_int by Condition
## Kruskal-Wallis chi-squared = 49.813, df = 2, p-value = 1.525e-11

# Write your answers for Question 4e here

Kruskal-Wallis test is used.
H0: There are not significant differences between the normalized AChR fluorescence intensity.
Ha: There are significant differences between the normalized AChR fluorescence intensity.

As the p value(1.525e-11) is much less then 0.05, the null hypothesis is rejected. The different conditions have an effect on the normalized AChR fluorescence intensity.

Question 5 (12 marks):

As previous literature have also suggested a role of MG or DMD on muscle development and integrity, your senior has further asked you to assess the effect of MG or DMD on the muscle cell morphology. Write scripts to:

(a) Plot a bar graph for the Morphology data. The bar graph should include a title, appropriate labels for the x and y-axes, and should display the distribution of muscle cell morphology within each condition. (3 marks)

Hint: Check http://www.sthda.com/english/wiki/ggplot2-essentials for ggplot2 support. Your codes should feature geom_bar().

# Write your codes for Question 5a here  
`nmj_formation.(1)` %>% ggplot(aes(x = Condition, fill = Morphology)) +
      geom_bar() + 
      labs (title = "The distribution of muscle cell morphology within each condition")

(b) Briefly describe two observations from the bar graph generated in (a). (2 marks)

# Write your answers for Question 5b here 
 Among the 3 morphology, spindle group accounts for the most common morphology for MG condition,  branched morphology accounts for the most common morphology of healthy condition.

(c) Assess, with an appropriate statistical test and p-value evidence, whether MG or DMD has an effect on muscle cell morphology. (4 marks)

Hint: Think about what statistical test you should use to assess the relationship between two categorical variables.

# Write your codes for Question 5c here  

chi_sq <- chisq.test(table(`nmj_formation.(1)`$Morphology, `nmj_formation.(1)`$Condition))

## Warning in chisq.test(table(`nmj_formation.(1)`$Morphology,
## `nmj_formation.(1)`$Condition)): Chi-squared approximation may be incorrect

chi_sq

## 
##  Pearson's Chi-squared test
## 
## data:  table(`nmj_formation.(1)`$Morphology, `nmj_formation.(1)`$Condition)
## X-squared = 7.3697, df = 4, p-value = 0.1176

# Write your answers for Question 5c here 


As p-value= 0.1176( >0.05), no difference is found between groups of data.

(d) Your senior suggested running a Chi-square test to evaluate the effect of different conditions on muscle cell morphology. Display the observed and expected values. Is the chi-square test an appropriate test to run? Explain. (3 marks)

Hint: Think about the requirements that should be fulfilled in order for a Chi-square test to be reliably ran.

# Write your codes for Question 5d here  
chi_sq <- chisq.test(table(`nmj_formation.(1)`$Morphology, `nmj_formation.(1)`$Condition))

## Warning in chisq.test(table(`nmj_formation.(1)`$Morphology,
## `nmj_formation.(1)`$Condition)): Chi-squared approximation may be incorrect

chi_sq$expected

##           
##             Healthy       MG      DMD
##   round     7.25000 5.000000 4.750000
##   branched 10.66176 7.352941 6.985294
##   spindle  11.08824 7.647059 7.264706

# Write your answers for Question 5d here 
A warning occured when the chi-square test is run, as expected values for all cells is not above 5 .

Question 6 (28 marks):

Hint: This question tests your understanding of correlation and regression.

During the fluorescence imaging sessions, your senior observed that larger areas of nerve-muscle contact tend to have larger areas of AChR clustering. They became interested in the possible relationships between nerve-muscle contact area (Track_area) with 1) AChR clustering area (AChR_area), or 2) normalized AChR fluorescence intensity (norm_AChR_int), and have asked you to perform the appropriate relevant analyses. Write scripts to:

(ai) Assess, using a graphical approach of your choice, whether the data in the Track_area column follows normal distribution. The plot should include a title and appropriate labels for the x and y-axes. (2 marks)

Hint: Check http://www.sthda.com/english/wiki/ggplot2-essentials for ggplot2 support.

# Write your codes for Question 6ai here   
`nmj_formation.(1)` %>% ggplot(aes(sample =Track_area)) +
      stat_qq() + 
      stat_qq_line() +
      labs (title = "Distribution of  nerve-muscle contact area", 
            x = "Theoretical", 
            y = "Sample")

# Write your answers for Question 6ai here  
I will use Q-Q plot to check normality. From the above plot, the data is skewed, so it does not follow normal distribution.

(aii) Assess, using a graphical approach of your choice, whether the data in the AChR_area column follows normal distribution. The plot should include a title and appropriate labels for the x and y-axes. (2 marks)

Hint: Check http://www.sthda.com/english/wiki/ggplot2-essentials for ggplot2 support.

# Write your codes for Question 6aii here   
`nmj_formation.(1)` %>% ggplot(aes(sample =AChR_area)) +
      stat_qq() + 
      stat_qq_line() +
      labs (title = "Distribution of  AChR clustering area", 
            x = "Theoretical", 
            y = "Sample")

# Write your answers for Question 6aii here  
I will use Q-Q plot to check normality. From the above plot, the data mostly follows the reference line and thus it follows normal distribution.

(b) After discussing your preliminary analyses in (a) with your senior, they suggested that you to attempt two data transformation approaches (1. Logarithmic, 2. Square root) for the AChR_area data.

Perform the data transformation as requested by your senior, and assess the normality for each set of the transformed data using one graphical approach and one statistical approach (i.e. one graphical and one statistical for logarithmic transformation + one graphical and one statistical approach for square root transformation). The graphical plots should include a title and appropriate labels for the x and y-axes.

Which transformation approach is better? Explain in brief with reference to your graphical and statistical outputs. (10 marks)

Hint: Think about how you can perform logarithmic or square root calculations for any numerical/integer value (e.g. 10) in RStudio, then simply apply the same approach to the AChR_area data column. For the graphical approach, check http://www.sthda.com/english/wiki/ggplot2-essentials for ggplot2 support.

# Write your codes for Question 6b here  
`nmj_formation.(1)` %>% ggplot(aes(AChR_area)) +
      geom_density(fill = "blue", alpha = 0.4) + 
      labs (title = "Distribution of AChR clustering area", 
            x = "AChR clustering area", 
            y = "Density")

`nmj_formation.(1)` %>% ggplot(aes(log(AChR_area))) +
      geom_density(fill = "blue", alpha = 0.4) + 
      labs (title = "Distribution of logarithmic of  AChR clustering area", 
            x = "logarithmic of  AChR clustering area)", 
            y = "Density")

# sqrt transformation with sqrt()
`nmj_formation.(1)`%>% ggplot(aes(sqrt(AChR_area))) +
      geom_density(fill = "blue", alpha = 0.4) + 
      labs (title = "Distribution of sqrt of AChR clustering area", 
            x = "sqrt of AChR clustering area", 
            y = "Density")

`nmj_formation.(1)` %>% ggplot(aes(AChR_area^(1/3))) +
      geom_density(fill = "blue", alpha = 0.4) + 
      labs (title = "Distribution of cube_root of AChR clustering area", 
            x = "cube_root of AChR clustering area", 
            y = "Density")

shapiro.test((`nmj_formation.(1)`$AChR_area))

## 
##  Shapiro-Wilk normality test
## 
## data:  (`nmj_formation.(1)`$AChR_area)
## W = 0.87394, p-value = 5.483e-06

shapiro.test(log(`nmj_formation.(1)`$AChR_area))

## 
##  Shapiro-Wilk normality test
## 
## data:  log(`nmj_formation.(1)`$AChR_area)
## W = 0.96657, p-value = 0.06484

shapiro.test(sqrt(`nmj_formation.(1)`$AChR_area))

## 
##  Shapiro-Wilk normality test
## 
## data:  sqrt(`nmj_formation.(1)`$AChR_area)
## W = 0.97594, p-value = 0.2115

shapiro.test((`nmj_formation.(1)`$AChR_area)^(1/3))

## 
##  Shapiro-Wilk normality test
## 
## data:  (`nmj_formation.(1)`$AChR_area)^(1/3)
## W = 0.98745, p-value = 0.7295

# Write your answers for Question 6b here  
I will use cube root altered AChR clustering area as it has a W value much more near to 1.

(c) Plot a scatter plot to visualize the relationship between Track_area (logarithmic-transformed) and AChR_area (your preferred transformation approach from (b)). The scatter plot should include a title, appropriate labels for the x and y-axes, and the points should be coloured by condition. Track_area should be on the x-axis.

Without considering the effect of different conditions, how would you describe the relationship between the two variables? (4 marks)

Hint: Check http://www.sthda.com/english/wiki/ggplot2-essentials for ggplot2 support. Your codes should feature geom_point().

# Write your codes for Question 6c here   
`nmj_formation.(1)`  %>% 
  ggplot(aes(log(Track_area),(AChR_area)^(1/3),
          
            color=Condition)) + 
  
  
  geom_point()+
  labs (title = "The relationship between Track_area (logarithmic-transformed) and AChR_area(cubic-root_transformed)")

# Write your answers for Question 6c here  
For all conditions, they exhibit a linear relationship in the relationship between the two variables.

(d) Plot a scatter plot to visualize the relationship between Track_area (logarithmic-transformed) and norm_AChR_int (no data transformation needed). The scatter plot should include a title, appropriate labels for the x and y-axes, and the points should be coloured by condition. Track_area should be on the x-axis.

Without considering the effect of different conditions, how would you describe the relationship between the two variables? (4 marks)

Hint: Check http://www.sthda.com/english/wiki/ggplot2-essentials for ggplot2 support. Your codes should feature geom_point().

# Write your codes for Question 6d here  
`nmj_formation.(1)`%>%
  ggplot(aes(log(Track_area), 
             AChR_norm_int, 
             color = Condition)) +
  geom_point()+ labs (title = "The relationship between Track_area (logarithmic-transformed) and norm_AChR_int")

# Write your answers for Question 6d here  
For all conditions, they do not exhibit a linear relationship in the relationship between the two variables.

(e) Your senior asked you to provide quantitative evidence for the relationships you proposed in (c)-(d). Demonstrate, with the appropriate statistical analyses, that your proposed relationships are true. (4 marks)

Hint: Think about what statistical test you should use to assess the relationship between two continuous variables.

# Write your codes for Question 6e here 


nmj_formation_AChR_area_cor2 <- cor.test(log(`nmj_formation.(1)`$Track_area),`nmj_formation.(1)`$AChR_norm_int, method = "pearson")


nmj_formation_AChR_area_cor1 <- cor.test((`nmj_formation.(1)`$AChR_area)^(1/3),log( `nmj_formation.(1)`$Track_area), method = "pearson")


nmj_formation_AChR_area_cor1

## 
##  Pearson's product-moment correlation
## 
## data:  (`nmj_formation.(1)`$AChR_area)^(1/3) and log(`nmj_formation.(1)`$Track_area)
## t = 9.9188, df = 66, p-value = 1.046e-14
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6562461 0.8544464
## sample estimates:
##       cor 
## 0.7736255

nmj_formation_AChR_area_cor2

## 
##  Pearson's product-moment correlation
## 
## data:  log(`nmj_formation.(1)`$Track_area) and `nmj_formation.(1)`$AChR_norm_int
## t = -1.5528, df = 66, p-value = 0.1253
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.40790385  0.05306323
## sample estimates:
##       cor 
## -0.187737

# Write your answers for Question 6e here 

Null hypothesis: the true correlation is equal to 0.
Alternative hypothesis: true correlation is not equal to 0

For The relationship between Track_area (logarithmic-transformed) and AChR_area(cubic-root_transformed):

From the pearson's test, p-value is 1.046e-14 < 0.05. Null hypothesis is therefore rejected.
Therefore, the two variables' true correlation is not equal to 0.

Yet, for the relationship between Track_area (logarithmic-transformed) and norm_AChR_int:
From the pearson's test, p-value  is 0.1253 > 0.05. Null hypothesis cannot be rejected.
Therefore, the two variables' true correlation is equal to 0.

(f) Your senior would also like to predict future values of AChR_area (your preferred transformation approach from (b)) based on Track_area (logarithmic-transformed). Perform the appropriate analysis and state the equation that can be used for prediction. (2 marks)

Hint: First refer to (c) to get a preliminary understanding of the relationship between the two transformed variables. Is it linear? logarithmic? exponential?

Then think about what analysis you should perform to obtain the coefficients that build an equation. The equation should describe the said relationship (linear/logarithmic/exponential).

# Write your codes for Question 6f here 
AChR_area_cube= (`nmj_formation.(1)`$AChR_area)^(1/3)

lm_AChR_area <- lm(log(Track_area) 
                   ~AChR_area_cube,
                         data = `nmj_formation.(1)`)
summary(lm_AChR_area)

## 
## Call:
## lm(formula = log(Track_area) ~ AChR_area_cube, data = `nmj_formation.(1)`)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.7115 -0.2881  0.1506  0.3484  0.8902 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     5.16308    0.27115  19.041  < 2e-16 ***
## AChR_area_cube  0.20516    0.02068   9.919 1.05e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5542 on 66 degrees of freedom
## Multiple R-squared:  0.5985, Adjusted R-squared:  0.5924 
## F-statistic: 98.38 on 1 and 66 DF,  p-value: 1.046e-14

# Write your answers for Question 6f here 
From the above, we can generate a linear equation of y=0.20516x+5.16308, where x is logarithmic of Track_area and y is the cubic root of AChR_area.

BMED3603 Biostatistics Assignment 2: Practical Work (R and RStudio)

2024-02-27

Instructions:

Scenario:

Background

Experimental approach

Dataset information

Initialization (5 marks):

Question 1 (10 marks):

Question 2 (10 marks):

Question 3 (20 marks):

Question 4 (15 marks):

Question 5 (12 marks):

Question 6 (28 marks):