Table of Contents:

  1. Installing R & R Studio
  2. Installation of Packages
  3. Loading Packages
  4. Copying Data from Excel into R
  5. Exporting from R
  6. Viewing imported data
  7. Melting your data
  8. Descriptive statistics for Long version of data
  9. Proportions test
  10. Odds Ratio
  11. Some things you may have to do prior to generating a graph
  12. Creating a pie chart, bar graph, line graph or box plot
  13. Correlation
  14. Regression
  15. Testing for Homogeneity of variance
  16. Conducting a t-test in R
  17. One-way ANOVA - INDEPENDENT MEASURES
  18. One-Way REPEATED MEASURES (RM) ANOVA
  19. How to use R for multiple-select questions
  20. How to count number of occurences in a column
  21. Plotting map of US with data
  22. Additional Resources

Resources

Installing R & R Studio

PLEASE NOTE: There are times when you may have to re-download R (not necessarily R Studio). You will realize this is needed if you download a package and when you run the library you get a message that seems to indicate that the package is not for that version. All you have to do in this case is 1. close out of R Studio; 2. download the newer version of R and 3. install it.

For links to download R and R studio please go to the following:

  1. Step 1 - R MUST be installed first - Download R for Windows or Mac

  2. Step 2 - R Studio is installed after R has finished installing - Download R Studio for Windows or Mac. Please note: If clicking on the link does not work, copy and paste the following link into your browser: https://posit.co/downloads/

For Windows:

  1. Double click R-x.x.x-win.exe

    1. Click “OK” (for language)
    2. Click “Next” (for Information)
    3. Click “Next” (for Select Destination Location)
    4. Click “Next” (for Select Components)
    5. Click “Next” (for Startup options)
    6. Click “Next” (for Select Start Menu Folder)
    7. On the next screen:
      1. Make sure “Create a desktop shortcut” is unchecked
      2. Make sure “Create a Quick Launch shortcut” is unchecked
      3. Click “Next”
    8. Click “Finish”
  2. Double click RStudio-x.x.xxxx.exe

    1. Click “Next”
    2. Click “Next” (for Choose Install Location)
    3. Click “Install” (for Choose Start Menu Folder)
    4. Click “Finish”

[Step 3 is not necessary if you are installing the packages by running the commands below under “Installation of Packages”]

  1. Packages go into:
    1. C:\Users\YourName\Documents\R win-library\x.x
    2. Accept to overwrite current packages

For Mac:

  1. Double click R-x.x.x.pkg
    1. click “Continue” (for Introduction)
    2. click “Continue” (for Read Me)
    3. click “Continue” (for License)
    4. click “Agree”
    5. click “Install” (for Installation Type)
    6. if asked, provide password to allow installation of software
    7. click “Install Software”
    8. When installation is complete click “Close”
  2. Double click RStudio-x.x.xxxx.dmg
    1. in window that opens up, drag RStudio icon into Applications folder
    2. When installation is done, you can click on the RStudio icon in the Launchpad
    3. If asked “”RStudio” is an application downloaded from the Internet. Are you sure you want to open it?”
    4. If asked to install a “git” command - click “install” and then click on “agree”

[Step 3 is not necessary if you are installing the packages by running the commands below under “Installation of Packages”]

  1. Packages go into:
    1. Macintosh HD/Library/Frameworks/R.framework/Versions/version#/Resources/library
    2. Accept to overwrite current packages

Back to top

Resources for Installing the Analysis Toolpak in Excel:

If conducting more detailed analysis in Excel, it is necessary to install and load the Analysis Toolpak. For instructions on loading the Analysis ToolPak in Excel see:


General Notes for R


Installation of Packages

install.packages("ggplot2", dependencies = T) # used for generating graphs
install.packages("multcomp", dependencies = T) # used for post-hoc analysis
install.packages("pastecs", dependencies = T) # used for generating descriptive statistics
install.packages("reshape", dependencies = T) # used for generating a long version of the data from a wide version of the data
install.packages("reshape2", dependencies = T) # used to generate long form of the file; required in some cases
install.packages("nlme", dependencies = T) # used for generating ANOVAs
install.packages("car", dependencies = T)
install.packages("pwr", dependencies = T)
install.packages("dplyr", dependencies = T)
install.packages("devtools", dependencies = T)
install.packages("rms", dependencies=T) # for regression
install.packages("psych", dependencies=T)
install.packages("ggpubr", dependencies=T) # required for correlation plots

Back to top

Additional packages (not required for class)

install.packages("rmarkdown", dependencies=T) # for RMD files that allow embedding of graphs etc; not required for class
install.packages("jmv") # not required for class
install.packages("umx", dependencies=T) # For structural equation modelling
install.packages("semPlot", dependencies=T) # required for Structural Equation Modeling but not required for class
install.packages("lavaan", dependencies=T) # required for Structural Equation Modeling but not required for class; see https://lavaan.ugent.be/tutorial/index.html
install.packages("sem", dependencies=T) # required for Structural Equation Modeling but not required for class
install.packages("maps", dependencies=T) # required if making map of US - only lower 48 states
install.packages("usmap", dependencies=T) # required if making map of US - all 50 states
# install.packages("RVAideMemoire", dependencies = T) # required for the Fisher Test
install.packages("plotly", dependencies=T) # required for 3D plots
install.packages("compareGroups") # required for generation of table comparing groups
install.packages("lsmeans", dependencies=T) # for comparison of slopes
install.packages("effectsize", dependencies = T)

Loading packages

library(ggplot2);library(multcomp);library(pastecs);library(reshape); library(reshape2); library(nlme); library(car); library(pwr); library(dplyr); library(devtools); library(rms);library(psych); library(ggpubr)

Back to top

Additional library commands (not required for class)

library(umx); library(jmv); library(sem); library(lavaan); library(semPlot); library(qgraph); library(foreign); library(usmap); library(maps);  library(plotly); library(SEMfn); library(lsmeans); library(effectsize); library(roxygen2)

library(RVAideMemoire)

Copying Data from Excel into R

For PC

  1. Copy the command below (Ctrl+C)
  2. Paste command into R Studio Console (Ctrl+V)
  3. Change the name “object” to something that makes sense to you and is related to the data.
  4. In Excel, select the data you want to analyze; Include the header;
  5. Copy the data (Ctrl+C)
  6. Go back to R Studio and run (press “Enter”) the command you pasted earlier (in step 2) into the console
object = read.table(file="clipboard", sep="\t", header=TRUE)
  • If you have missing data points in your file that you still want imported into the dataframe, use the following command:
object = read.csv(file="clipboard", sep="\t", header=TRUE)

Back to top

For Mac

  1. Copy the command below (Cmd+C)
  2. Paste command into R Studio Console (Cmd+V)
  3. Change the name “object” to something that makes sense to you and is related to the data.
  4. In Excel, select the data you want to analyze; Include the header;
  5. Copy the data (Cmd+C)
  6. Go back to R Studio and run (press “Enter”) the command you pasted earlier (in step 2) into the console
object = read.table(pipe("pbpaste"), sep="\t", header = TRUE)
object = read.csv(pipe("pbpaste"), sep="\t", header = TRUE)

Back to top


Importing Data into R from Excel (xlsx) files

Important Notes:

  • If you are going to use this method, you need to make sure that your excel file has a short name with no spaces
  • If you wish to name the object containing the data something different from the original Excel name, type in the new name in the “Name:” box under “Import Options”**
  • If you have more than one sheet of data in the excel file, you will need to specify which sheet the data is coming from in the “Sheet:” box under “Import Options” section in the “Import Excel Data window”

Procedure:

  • Go to “File>Import Data Set>From Excel…”
  • R opens window called “Import Excel Data”
  • Click on “Browse…”
  • Select the Excel file you want to import
  • R will display the data in the “Data Preview” window; [see note above if your excel file contains more than one data sheet or you wish to name your object something different from the excel file name]
  • Click “Import” and the data is now imported and will be displayed

Importing Data into R from CSV files

object = read.csv(file.choose(), header=T)

Importing Data into R from SPSS files

library(foreign)
object = read.spss(file.choose(), to.data.frame=TRUE)

Exporting from R

Exporting output to a text file

sink("outputname.txt")

# RUN THE COMMANDS YOU WISH YO BE EXPORTED TO THE TEXT FILE

sink()
getwd() # gives working directory

# to import into excel, select the file and use the "space delimited" option, and then check the space checkbox.

Exporting dataframe to a csv file

write.csv(dataframe,"name.csv")
getwd() # gives working directory

Back to top


Viewing imported data

View(object) # allows you to see the data in another tab

Back to top


Melting your data

Performing a simple melting of data

objectL = melt(object, id="idname")
  • melt converts file from wide to long version
  • objectL: NEW object that will contain the long version of the file
  • object: name of the of the object containing the wide version of the data
  • id=“idname”: this is only used SOMETIMES; only when you have an ID colum (a column that identifies the subject in some way and does not change/is constant)
  • In case of a file with an ID column, replace the idname with the appropriate name of the column
  • If there is NO ID COLUMN; you can simply delete the part of the command from the comma onwards i.e., , id=“idname”’ but keep the close parenthesis.

Back to top

Melting data with multiple id variables and measure variables

object_L=melt(object, id.vars = c("idvar_1", "idvar_2", "idvar_n"), measure.vars = c("mvar_1", "mvar_2", "mvar_n"))

Back to top


Selecting specific subset to analyze

var_a = object$`outcome`[object$predictor1colname=="predictor1" & object$predictor2colname=="predictor2"]

Generating descriptive statistics in R

Weighted mean

wt <- c(w1,  w2,  w3,  w4, etc)/(wT) # the weight for the first value = x multiplied w1/wT (e.g. if weight for w1 is 25% w1 would be 25, wT would be 100)
x <- c(x1, x2, x3, x4, etc) # values to be weighted
xm <- weighted.mean(x, wt)

Mean

mean(c(x1, x2, x3, x4...))

Descriptive statistics for WIDE version of data to 2 decimal places

round(stat.desc(object, norm=T), digits=2)
  • gives the descriptive statistics for all variables to TWO decimal place (you can adjust the number of decimal places by simply changing the number after digits=)
  • object: replace this with name of object containing the WIDE version of the data
  • ONLY used with the WIDE version of the data

Descriptive statistics for individual columns of data to 2 decimal places in WIDE version of data

round(stat.desc(object$columnname, norm=T), digits =2)
  • gives the descriptive statistics for the WIDE version of the file but ONLY for a SPECIFIC COLUMN
  • object: needs to be replaced with the name of the wide version of the data
  • columnname: needs to be replaced with the name of the column for which you wish to generate the statistics.

Descriptive statistics for LONG version of data

by(objectL$outcome, objectL$predictor, stat.desc, norm=T)
  • gives the descriptive statistics for the LONG version of the data

  • objectL: replace with name of the object containing the LONG version of data

  • outcome: replace with name of outcome column in objectL

  • predictor: replace with name of predictor column in objectL

  • nbr.val: number of values/subjects in a particular group

  • nbr.null: number of ZERO values

  • nbr.na: number of MISSING values

  • min/max: minimum and maximum valuers in a group respectively

  • range: gives you the range

  • sum: gives the sum of all values in a group

Measures of centrality:

  • median: gives the median of values in a group
  • mean: gives the average of values in a group

Measures of variability:

  • SE.mean: standard error of the mean
  • CI.mean.95: 95% confidence interval
  • var: variance
  • std.dev: standard deviation
  • coef.var: coefficient of variation

Measures of normality:

  • skewness: measures of level of skewness of your data, a negative sign implies a negative skew;
  • skew.2SE: indicates whether the skewness is SIGNIFICANT. Skewness is significant if skew.2SE > +1 or <-1
  • kurtosis: measure of level of kurtosis of the data. A negative sign implies a negative kurtosis.
  • kurt.2SE: indicates whether the kurtosis is SIGNIFICANT. kurtosis is significant if kurt.2SE > +1 or <-1
  • normtest.w: test for normality
  • normtest.p: if normtest.p<0.05, the the data is NOT normally distributed.
  • When checking for normality - FIRST look at the normtest.p. If this value is greater than 0.05, then you have nothing to worry about. If this value is less than 0.05 then you need to look at the skew.2SE and kurt.2SE in order to determine which aspect is causing your data to not be normal. If your data is NOT normally distributed, other statistical tests apply.

Example of output generated with previous command

Back to top

Alternative method

describeBy(outcome~predictor, data = objectL, digits=2, mat=TRUE)

Example of output generated with previous command

  • this output is nice and concise but does not contain everything included in the previous method (e.g. no indication of significance in relation to the deviations from normality (skewness/kurtosis))

Back to top

Another Method of presenting Statistics

  • this method generates presentable tables as an output
  • requires “vtable” package
st(ObjectL, vars = c("column1"), add.median=TRUE, numformat=NA) 
  • gives stats of one column with 25th, 50th and 75th percentiles
  • column1 -> replace with the name of the column for which you wish to calculate the descriptive statistics. Output will have N, Mean and Std.Dev., Min, and Max also.
st(ObjectL, vars = c("column1"), group = 'Group', add.median = TRUE, numformat=NA)
  • gives the statistics of column1 across the group “Group”;
  • replace “column1” with the name of the column for which you wish to generate descriptive statistics;
  • replace “Group” with the name of the column that has the grouping variable e.g. “Sex”.

Descriptive statistics in cases of multiple levels

describeBy(outcome~ var1 + var2, data = object,digits=2, mat=TRUE)

# e.g. if var1 = sex (Male or Female) and var2 = class (Freshman, Sophomore, Junior, Senior)
# digits: describes number of decimal places
# mat: provide a matrix output rather than a list 

Excel Commands for some descriptive Statistics - method 1


Excel Commands for some descriptive Statistics - method 2

Statistic value
Mean 12
Standard Error 0.447213595
Median 12
Mode 13
Standard Deviation 1
Sample Variance 1
Kurtosis -3
Skewness 0
Range 2
Minimum 11
Maximum 13
Sum 60
Count 5

Proportions test

Proportions test: Use a proportions test to compare/determine if the frequency/probability of an event is different between the various groups being tested (e.g. asking if the proportion of males reporting depression is different from females reporting depression in a sample)

Comparing two proportions - Method 1

prop.test(x = c(c1, c2), n = c(T1, T2), alternative = "two.sided")
  • c1 and c2 reflect the number of subjects in each of the categories (e.g., number of women responding Yes and the number of men responding Y)
  • T1 and T2 reflect the number of TOTAL subjects in each category/group (e.g., the total number of women in the sample, the total number of men in the sample)

Comparing two proportions - Method 2

object = c(v1,v2) 
  • number of subjects in each of the categories (e.g. number of Catholics in example)
Tobject = c(n1,n2) 
  • number of TOTAL subjects in each category/group (e.g. Total number of representative in each party)
prop.test(object, Tobject) 
  • runs the proportions test

Back to top

Proportions Test Output in R

2-sample test for equality of proportions with
    continuity correction

X-squared = 5.7318, df = 1, p-value = 0.01666     # X-squared (chi-squared) = the
                                                  statistic; df = degrees of freedom (n-1)
                                                  - we have 2 groups; p-value = tells us if
                                                  it is significant; IT IS SIGNIFICANT if p
                                                  < 0.05
alternative hypothesis: two.sided
95 percent confidence interval:
  0.01821926 0.18098709                          # If the confidence interval range has the same SIGN - there should be a significant difference
sample estimates:
  prop 1    prop 2 
0.3535714 0.2539683                               # these are the proportions of each of the tested groups

Back to top

Writing a report from a proportions test output

A two-sample test for equality of proportion was conducted to assess whether the proportion of Catholics in the Republican and Democratic party were significantly different. The analysis revealed a significantly higher proportion of Catholics in the Democratic Party(35.4%) relative to the Republican Party(25.4%), X2(1,N=532)= 5.73,p<0.05.

Back to top

Comparing more than two proportions - Method 1

prop.test(x = c(c1, c2, c3...), n = c(T1, T2, T3...), alternative = "two.sided")
pairwise.prop.test(x = c(c1, c2, c3...), n = c(T1, T2, T3...), alternative = "two.sided")
  • c1, c2, c3… reflect the number of subjects in each of the categories (e.g., number of women responding Yes and the number of men responding Y)
  • T1, T2, T3… reflect the number of TOTAL subjects in each category/group (e.g., the total number of women in the sample, the total number of men in the sample)
  • prop.test runs the proportions test - provides the first part of the output.
  • pairwise.prop.test runs the pairwise test to indicate which proportions are different from each other - this provides the second part of the output

Comparing more than two proportions - Method 2

object = c(a,b,c,d,e,...) 
  • number of subjects in each of the categories (e.g. the number of doctors of a particular faith that said they would NOT perform an abortion; each letter would be replaced by the number of doctors of each specific specific faith that responded “NO”)
Tobject = c(Ta,Tb,Tc,Td,Te,...) 
  • number of Total subjects in each category/group (e.g. the TOTAL number of doctors of a particular faith i.e those that responded YES (they would perform an abortion) + those that responded NO; Replace Ta etc. with numbers representing the individual totals)
prop.test(object,Tobject) 
  • runs proportions test - provides the first part of the output.
pairwise.prop.test(object,Tobject) 
  • runs pairwise test to indicate which proportions are different from each other - this provides the second part of the output.

Back to top

Proportions Test Output in R (more than two proportions)

# This is the first part of the output that indicates whether there is a difference in the proportions. If there is a difference, however, it will not indicate between which proportions the difference exists.

4-sample test for equality of proportions without
    continuity correction

data:  object out of Tobject
X-squared = 35.169, df = 3, p-value = 1.122e-07
alternative hypothesis: two.sided
sample estimates:
   prop 1    prop 2    prop 3    prop 4 
0.2857143 0.5000000 0.2407407 0.7500000 

# This is the second part of the output the indicates WHERE the differences are. The numbers in the top row and in the first column reflect the 4 proportions being tested comparing e.g., 1 vs 2 (not significant); 1 vs 4 (significant at p<0.001) etc., etc.

Pairwise comparisons using Pairwise comparison of proportions 

data:  object out of Tobject 

  1       2       3      
2 0.32422 -       -      
3 0.82154 0.18128 -      
4 0.00013 0.18128 9.5e-07

P value adjustment method: holm 

Back to top

Writing a report from a proportions test output (more than two proportions)

A 4-sample test for equality of proportions was conducted to assess whether the proportion of the 4 samples provided were significantly different. The analysis revealed an overall significant difference X2(3,N= insert number of participants) = 35.17,p<0.001. Pairwise comparisons indicated a significant difference between samples 1 and 4 and samples 3 and 4 (both p<0.001). All other comparisons were not significant (p>0.05).

Back to top

Proportions Test for situations where there are very small numbers (5 or less)

In situations where there are 5 or less values, the proportions test will display an error. In that case, the Fisher test is the more appropriate test to use.

Comparing two proportions - in cases of small numbers

fisher.test(matrix(c(n1, T1-n1, n2, T2-n2), ncol=2)) 
  • n1 represents the first number to be compared, T1 is the total number of subjects in pertaining to variable n1 etc.

Comparing more than two proportions - in cases of small numbers

  • requires RVAideMemoire package
fisher.multcomp(matrix(c(n1, T1-n1, n2, T2-n2,.....), ncol=N)) 
  • n1 represents the first number to be compared, T1 is the total number of subjects in pertaining to variable n1 etc.
  • ncol=N: replace N with the number of variables being compared.

Odds Ratio

Odds Ratio: measures how likely an event is to occur when exposed to something; a measure of the association between exposure and outcome

Create Matrix

# data taken from Yang, C.-Y., Shih, Y.-H., & Lung, C.-C. (2024). The association between COVID-19 vaccine/infection and new-onset asthma in children - based on the global TriNetX database. Infection. https://doi.org/10.1007/s15010-024-02329-3 **Table S2 in Supplement 1**

# Create matrix
vax <- c('Vax', 'No Vax')
outcome <- c('Death', 'No death')
data <- matrix(c(354, 31734, 309, 159048), nrow=2, ncol=2, byrow=TRUE)
dimnames(data) <- list('Vaccinated'=vax, 'Outcome'=outcome)

#view matrix
data

Back to top

Calculate Odds Ratio

library(epitools)

#calculate odds ratio
oddsratio(data)

# From the output shown, we would interpret this to mean that the odds that child receiving the vaccine dies is ~5.74 times the odds of one who did not receive the vaccine.

Writing a report from an odds ratio test output

There was a significant difference (p<0.001) in the odds of death associated with receiving the COVID vaccines in children aged 5-18 years of age relative to those who did not (OR 5.74, 95% CI [4.93, 6.69]).

Back to top


Some things you may have to do prior to generating a graph

Function and command for generating Standard Error of the Mean (SEM)

  • This function needs to be copied and run WITHOUT MODIFICATION before generating the summary statistics using the summarySE command.
  • This code is taken from: Summarizing Data
## Gives count, mean, standard deviation, standard error of the mean, and confidence interval (default 95%).
##   data: a data frame.
##   measurevar: the name of a column that contains the variable to be summariezed
##   groupvars: a vector containing names of columns that contain grouping variables
##   na.rm: a boolean that indicates whether to ignore NA's
##   conf.interval: the percent range of the confidence interval (default is 95%)
summarySE <- function(data=NULL, measurevar, groupvars=NULL, na.rm=FALSE,
                      conf.interval=.95, .drop=TRUE) {
    library(plyr)

    # New version of length which can handle NA's: if na.rm==T, don't count them
    length2 <- function (x, na.rm=FALSE) {
        if (na.rm) sum(!is.na(x))
        else       length(x)
    }

    # This does the summary. For each group's data frame, return a vector with
    # N, mean, and sd
    datac <- ddply(data, groupvars, .drop=.drop,
      .fun = function(xx, col) {
        c(N    = length2(xx[[col]], na.rm=na.rm),
          mean = mean   (xx[[col]], na.rm=na.rm),
          sd   = sd     (xx[[col]], na.rm=na.rm)
        )
      },
      measurevar
    )

    # Rename the "mean" column    
    datac <- rename(datac, c("mean" = measurevar))

    datac$se <- datac$sd / sqrt(datac$N)  # Calculate standard error of the mean

    # Confidence interval multiplier for standard error
    # Calculate t-statistic for confidence interval: 
    # e.g., if conf.interval is .95, use .975 (above/below), and use df=N-1
    ciMult <- qt(conf.interval/2 + .5, datac$N-1)
    datac$ci <- datac$se * ciMult

    return(datac)
}
sumstats <- summarySE(objectL, measurevar="outcomevar", groupvars=c("groupvar_1","groupvar_2","groupvar_n"), na.rm=TRUE)
  • sumstats: name of object that will have the summary stats. The name of this should be changed if running for more than one data set.
  • summarySE: command that runs the summary statistics
  • objectL: name of object containing the long version of the data to be analyzed/summarized
  • “outcomevar”: name of the variable to be analyzed/summarized
  • “groupvar_1” etc: name of the columns that contain the groups for the variable to be analyzed e.g. Sex, Class, Age…etc.
  • na.rm = TRUE: removes the empty cells (cells that have NA); Not required if all data points are present.

Back to top

Ordering levels in a specific variable

  • When plotting a graph, R will order the variables in alphabetical order. Very often this is not the best way of plotting the graph. In such cases, R needs to be instructed on the desired order of variables.
  • An example is the order of classes. For this example, the name of the variable in the data (i.e. the column name) is Class R will order them: Freshman, Graduate, Junior, Senior, Sophomore
  • This command will create an object that has the reordered levels.
  • In such a case, the name of the new object created would be the name of the predictor that would need to be used in Part I of plotting the graph.
  • For this example Class is the original name; Class_r is the rearranged levels
Class_r = factor (object$Class, levels=c("Freshman", "Sophomore", "Junior", "Senior", "Graduate"))
  • notice how when you are indicating the name of the factor that is to be reordered you need to give R the name of the object containing the data (in this case “object” and the name of the column to be reordered separated by a $ symbol)
  • to adapt to any situation - “Class_r” should be renamed appropriately; “object$Class” should be replaced with the appropriate object name and column name to be reorganized; all the levels within the quotes should be replaced accordingly.

Back to top

Changing variable name in a specific column

  • If the variables in a particular column need to be changed e.g. from numbers to words do the following:
  • install the package stringr using the following command:
install.packages("stringr", dependencies = T)
  • ensure that you load the package stringr using the following command:
library(stringr)
  • Next, run the following command in order to preserve the original file:
object2=object
  • Next, run the following command on the new modified object (i.e. object2) (NOTE: this will change the values in your columns permanently.)
object2$columnname = str_replace_all(object2$columnname,"oldvalue","newvalue")
  • You will need to run this command for every value that needs to be replaced
  • When plotting the graph you will need to remember to use the NEW OBJECT i.e. object2 as the source of the data

Back to top

Changing the column name/header in the object/dataframe containing the data

names(objectL)[n] <- "correctname"

This is an alternative command to the one above

  • To rename a single column by name
colnames(objectL)[colnames(objectL) == "old_col2"] <- "new_col2"

Back to top

Creating a pie chart, bar graph, line graph or box plot

Generating a pie chart

object<- c(n1, n2)
labels<- c("Poor (58%)","Normal (42%)")
pie(x=object,labels = labels, radius = 1, main = "Title (n=N)", col = c("#69b3a2", "grey"))

Back to top

Example of a graph generated with previous command

Back to top

Generating a bar graph

Generating a bar graph - Part I

barobject=ggplot(objectL, aes(predictor,outcome))
  • barobject: name of the entity that will contain the plot (call it what you wish, just be clear)
  • ggplot: command that enables the plotting of the graph
  • objectL: is the name of the long version of your data
  • aes: tells R to generate the aesthetics
  • predictor: REPLACE with the name of the predictor column in the long version of the file
  • outcome: REPLACE with the name of the outcome column in the long version of the file
  • NOTE regarding predictor: If the levels have been rearraged in a new object, the word predictor should be replaced with the name of that specific object (in the example above, Class_r instead of Class)

Generating a bar graph - Part II

barobject+stat_summary(fun=mean, geom="bar",fill="white",color="Black")+stat_summary(fun.data=mean_cl_normal, geom="errorbar",position=position_dodge(width=0.9), width=0.2)+labs(title="Title line1 \n Title line2", x="predictor label", y="outcome label (unit)")+theme(axis.title.x=element_text(vjust=-0.4))+theme(axis.title.y=element_text(vjust=0.3))+theme(title=element_text(vjust=+1.1))+theme(plot.title=element_text(hjust=0.5))+coord_cartesian(ylim=c(lowerlimit, upperlimit)) + guides(fill = guide_legend(title = "name"))
  • barobject: name of the plot determined in the previous command
  • stat_summary(fun.y=mean, geom=“bar”,fill=“white”,color=“Black”): plots the mean, as a bar graph, with a white fill and a black outline - DOES NOT NEED TO BE CHANGED unless you wish to change the colors
  • stat_summary(fun.data=mean_cl_normal, geom=“errorbar”,position=position_dodge(width=0.9), width=0.2): plots the error bars (measure of variability) so that they do not overlap - YOU DO NOT NEED TO CHANGE ANYTHING;
    • NOTE: mean_se can replace mean_cl_normal if you wish to plot the standard error of the mean.
  • labs(title=“Title line1 \n Title line2”, x=“predictor label”, y=“outcome label (unit)”):Labels the graph; title gives the name of the main heading of the graph; x labels the x-axis; y labels the y-axis; The “\n” places any words after it on the next line; DOES NEED CHANGING - you need to change everything that is in the quotation marks.
  • theme(axis.title.x=element_text(vjust=-0.4))+theme(axis.title.y=element_text(vjust=0.3))+theme(title=element_text(vjust=+1.1)): sets the vertical distance/alignment b/w the specific heading and the axis - DOES NOT NEED CHANGING
  • theme(plot.title=element_text(hjust=0.5)): centers the title horizontally
  • +coord_cartesian(ylim=c(lowerlimit, upperlimit)): you do NOT always need to use this part of the command; It is only needed when the y-axis limits need to be adjusted for better viewing of the graph; If it is not being used, remove the whole command in between the plus signs leaving only one + sign. If used, replace the words lowerlimit and upperlimit with the appropriate numbers.
  • guides(fill = guide_legend(title = “name”)): where applicable, provides a new name for the legend (the default is the header of the column).

Back to top

Example of a graph generated with previous command

Back to top

Generating a bar graph with different colored bars

  • more information on this is available in the R Cookbook, under Colors (ggplot2)

Generating a bar graph - Part I

barobject=ggplot(objectL, aes(predictor,outcome,fill=predictor))+scale_fill_manual(values=c("#colorcode", "#colorcode",....))
  • replace barobject, objectL, predictor, outcome with appropriate labels
  • replace colorcode with appropriate color code obtained from the Hex field at the rapidtables website - in the case of the color selected in the example shown below, the code would be #3399FF
  • add as many colors as you have variables

Generating a bar graph - Part II

barobject+stat_summary(fun=mean, geom="bar")+stat_summary(fun.data=mean_cl_normal, geom="errorbar",position=position_dodge(width=0.9), width=0.2)+labs(title="Title line1 \n Title line2", x="predictor label", y="outcome label (unit)")+theme(axis.title.x=element_text(vjust=-0.4))+theme(axis.title.y=element_text(vjust=0.3))+theme(title=element_text(vjust=+1.1))+theme(plot.title=element_text(hjust=0.5))+coord_cartesian(ylim=c(lowerlimit, upperlimit))+guides(fill = guide_legend(title = "name"))
  • stat_summary(fun.y=mean, geom=“bar”): Notice how the section “,fill=”white”,color=“Black”” has been removed because colors are accounted for in the first command
  • Everything else is per the first example of “Generating a Graph”

Back to top

Example of a graph generated with previous command

Back to top

Creating a bar graph with one independent variable and using the standard error of the mean for error bars

Generating a bar graph - Part I

barobject=ggplot(sumstats, aes(predictor,outcome,fill=predictor))+scale_fill_manual(values=c("#colorcode", "#colorcode",....))

Generating a bar graph - Part II

barobject+stat_summary(fun=mean, geom="bar")+geom_errorbar(aes(ymin=outcome-se, ymax=outcome+se),position=position_dodge(width=0.9), width=0.2)+labs(title="Title line1 \n Title line2", x="predictor label", y="outcome label (unit)")+theme(axis.title.x=element_text(vjust=-0.4))+theme(axis.title.y=element_text(vjust=0.3))+theme(title=element_text(vjust=+1.1))+theme(plot.title=element_text(hjust=0.5))+coord_cartesian(ylim=c(lowerlimit, upperlimit))+guides(fill = guide_legend(title = "name"))
  • geom_errorbar(aes(ymin=outcome-se, ymax=outcome+se),position=position_dodge(width=0.9), width=0.2): This command now provides the information for the errorbars. However you need to replace the word “outcome” with the appropriate name for the outcome variable.
  • Everything else is per the first example of Generating a bar graph

Back to top

Example of a graph generated with previous command

Back to top

Creating a bar graph with multiple independent variables and using the standard error of the mean for error bars

Generating a bar graph - Part I

barobject=ggplot(sumstats, aes(predictor,outcome,fill=predictor_2))+scale_fill_manual(values=c("#colorcode", "#colorcode",....))
  • NOTE: sumstats is used to provide the data to be plotted
  • predictor: the main predictor that is going to be on the x-axis
  • predictor_2: the predictor that provides the sub-categories within the main predictor
  • Additionally, see Generating a bar graph with different colored bars for details of what needs to be changed

Generating a bar graph - Part II

barobject+stat_summary(fun=mean, geom="bar",position=position_dodge(width=0.9), linewidth=3)+geom_errorbar(aes(ymin=outcome-se, ymax=outcome+se),position=position_dodge(width=0.9), width=0.2)+labs(title="Title line1 \n Title line2", x="predictor label", y="outcome label (unit)")+theme(axis.title.x=element_text(vjust=-0.4))+theme(axis.title.y=element_text(vjust=0.3))+theme(title=element_text(vjust=+1.1))+theme(plot.title=element_text(hjust=0.5))+coord_cartesian(ylim=c(lowerlimit, upperlimit))+guides(fill = guide_legend(title = "name"))
  • geom_errorbar(aes(ymin=outcome-se, ymax=outcome+se),position=position_dodge(width=0.9), width=0.2): This command now provides the information for the errorbars. However you need to replace the word “outcome” with the appropriate name for the outcome variable.
  • Everything else is per the first example of Generating a bar graph

Back to top

Example of a graph generated with previous command

Back to top

Creating bar graphs with multiple independent variables, plotted as three separate graphs (one per variable) in one grid

  • Original data (contained in “objectL” contains the following columns “Subject”, “var1”, “var2”, “var3”, “data”

Generating a grid of bar graphs - Example 1

# Make a modified copy of the original data
objectL_mod <- objectL %>%

# Rename variables within var3
  # only if necessary/desired
  # var3.1 etc. represent different variables within var3
  # /n can be used within the names if the new name is long and it is desired to split the title into two lines
mutate(var3 = recode(var3, "oldvar3.1\nname" = "newvar3.1\nname", "oldvar3.2\nname" = "newvar3.2\nname", "oldvar3.3\nname" = "newvar3.3\nname")) # do not add spaces before and after \n or the facet title will not be centered
  
# Generate graph
bar_objectL=ggplot(objectL_mod, aes(var1,data,fill=var1))+scale_fill_manual(values=c("#CC0000", "#004C99"))

bar_objectL.Sub=bar_objectL+stat_summary(fun=mean, geom="bar",position=position_dodge(width=0.9), linewidth=3)+stat_summary(fun.data=mean_cl_normal, geom="errorbar",position=position_dodge(width=0.9), width=0.2)+labs(title="title", x="x axis title", y="y axis title") + theme(axis.title.x=element_text(vjust=-0.4)) + theme(axis.title.y=element_text(vjust=0.3)) + theme(title=element_text(vjust=+1.1)) + theme(plot.title=element_text(hjust=0.5))+guides(fill = guide_legend(title = "var1"))

  # bar_objectL.Sub: contains the command to generate the graph with the various sub-categories present var3
  # +guides(fill = guide_legend(title = "var1")): relabels the heading of the legend

# Create grid
bar_objectL.Sub+facet_grid(cols=vars(var3))

Back to top

Example of a graph generated with previous command

Back to top

Generating a grid of bar graphs - Example 1a

  • Original data (contained in “objectL” contains the following columns “Subject”, “var1”, “var2”, “var3”, “data”
  • This graph is a modified version of that shown in Example 1 - See Generating a grid of bar graphs - Example 1
  • This graph uses the SEM instead of the Confidence interval to generate the error bars and also shows the significances
# Make a modified copy of the original data
objectL_mod <- objectL %>%

# Rename variables within var3
  # only if necessary/desired
  # var3.1 etc. represent different variables within var3
  # /n can be used within the names if the new name is long and it is desired to split the title into two lines
mutate(var3 = recode(var3, "oldvar3.1\nname" = "newvar3.1\nname", "oldvar3.2\nname" = "newvar3.2\nname", "oldvar3.3\nname" = "newvar3.3\nname")) # do not add spaces before and after \n or the facet title will not be centered
# Generate graph
bar_object=ggplot(sumstats_object, aes(x=predictor,y=outcome))+scale_fill_manual(values=c("#CC0000", "#004C99"))

bar_object.Sub=bar_object+geom_bar(stat="identity",position=position_dodge(width=0.9), linewidth=3, aes(fill=predictor))+geom_errorbar(aes(ymin=outcome-se, ymax=outcome+se),position=position_dodge(width=0.9), width=0.2)+labs(title="Title", x="x axis title", y="y axis title") + theme(axis.title.x=element_text(vjust=-0.4)) + theme(axis.title.y=element_text(vjust=0.3)) + theme(title=element_text(vjust=+1.1)) + theme(plot.title=element_text(hjust=0.5)) + guides(fill = guide_legend(title = "predictor")) +coord_cartesian(ylim=c(lowerlimit,upperlimit))+ geom_signif(data=data.frame(predictor2=c("newvar3.1\nname","newvar3.2\nname","newvar3.3\nname")),aes(y_position=c(21.5, 24.5, -2), xmin=c(1.0, 1.0, 1.0), xmax=c(2, 2, 2),annotations = c("**","*","")), tip_length = 0, manual=T)

  # The statistic that was NOT significant has been placed beyond the y-scale limit. The coord_cartesian command ensures that the axes limits are as desired and that the non-significant bar does not show up.
  # outcome: under geom_errorbar - this refers to the heading of the outcome column in the descriptive statistics output.
  # geom_signif: numbers shown in parenthesis need to be adjusted accordingly to accommodate for height (in relation to the y_position) and/or position relative to bars (in relation to the xmin and xmax). Additionally, the annotation needs to be appropriately corrected to reflect the appropriate significances.

# Create grid
bar_BIS.Sub+facet_grid(~BIS_Cat, scales="fixed")

# theme(axis.text.x=element_blank()) - # Removes variable label names in the x-axis if they are not desired.
# theme(axis.ticks.x=element_blank()) - # removes ticks (may be used with above command)

# https://stackoverflow.com/questions/56219468/using-ggsignif-with-grouped-bar-graphs-and-facet-wrap-not-working

Back to top

Example of a graph generated with previous command

Back to top

Generating a grid of bar graphs - Example 2

  • Original data (contained in “objectL” contains the following columns “Subject”, “var1”, “var2”, “var3”, “data”
# Make a modified copy of the original data
objectL_mod <- objectL %>%

# Rename variables within var3
  # only if necessary/desired
  # var3.1 etc. represent different variables within var3
  # /n can be used within the names if the new name is long and it is desired to split the title into two lines
mutate(var3 = recode(var3, "oldvar3.1\nname" = "newvar3.1\nname", "oldvar3.2\nname" = "newvar3.2\nname", "oldvar3.3\nname" = "newvar3.3\nname"))  # do not add spaces before and after \n or the facet title will not be centered
  
# Generate graph
bar_objectL=ggplot(objectL_mod, aes(var1,data,fill=var2))+scale_fill_manual(values=c("#CC0000", "#004C99"))

bar_objectL.Sub=bar_objectL+stat_summary(fun=mean, geom="bar",position=position_dodge(width=0.9), linewidth=3)+stat_summary(fun.data=mean_cl_normal, geom="errorbar",position=position_dodge(width=0.9), width=0.2)+labs(title="title", x="x axis title", y="y axis title") + theme(axis.title.x=element_text(vjust=-0.4)) + theme(axis.title.y=element_text(vjust=0.3)) + theme(title=element_text(vjust=+1.1)) + theme(plot.title=element_text(hjust=0.5))+guides(fill = guide_legend(title = "var2"))

  # bar_objectL.Sub: contains the command to generate the graph with the various sub-categories present var3
  # +guides(fill = guide_legend(title = "var2")): relabels the heading of the legend

# Create grid
bar_objectL.Sub+facet_grid(cols=vars(var3))

Back to top

Example of a graph generated with previous command

Back to top

Generating a bar graphs with significance shown - Example 3

# Plot Graph

predictor.r = factor (sumstats_object$predictor.r, levels=c("level1", "level2", "level_n"))
  # predictor.r : name of variable that has had levels reorganized for graphing purposes
  # predictor.r: predictor that requires levels to be organized for graphic purposes
  # level1, etc.: level names listed in the order that you wish them graphed

bar_object=ggplot(sumstats_object, aes(x= predictor.r,y=outcome))+ scale_fill_manual(values=c("#CC0000", "#004C99")) 

bar_object+geom_bar(stat="identity",position=position_dodge(width=0.9), linewidth=3, aes(fill=predictor))+geom_errorbar(aes(ymin=Outcome-se, ymax=Outcome+se),position=position_dodge2(0.9, padding=0.6))+labs(title="Title", x="x axis title", y="y axis title") + theme(axis.title.x=element_text(vjust=-0.4)) + theme(axis.title.y=element_text(vjust=0.3)) + theme(title=element_text(vjust=+1.1)) + theme(plot.title=element_text(hjust=0.5)) + guides(fill = guide_legend(title = "Legend Title")) + geom_signif(y_position=c(30.5, 16.5, 15.5), xmin=c(0.7, 4.7, 5.7), xmax=c(1.3, 5.3, 6.3),annotation = c("***","*","**"), tip_length = 0)
  
  # outcome: under geom_errorbar - this refers to the heading of the outcome column in the descriptive statistics output.
  # geom_signif: numbers shown in parenthesis need to be adjusted accordingly to accommodate for height (in relation to the y_position) and/or position relative to bars (in relation to the xmin and xmax). Additionally, the annotation needs to be appropriately corrected to reflect the appropriate significances.

# Resources that may be of assistance

# https://statisticsglobe.com/ggsignif-package-r
# https://stackoverflow.com/questions/51443889/how-can-i-get-geom-errorbar-to-dodge-correctly-on-a-bar-chart-in-ggplot2
# https://stackoverflow.com/questions/59008974/why-is-stat-identity-necessary-in-geom-bar-in-ggplot#59009108
# http://sthda.com/english/wiki/ggplot2-error-bars-quick-start-guide-r-software-and-data-visualization
# https://stackoverflow.com/questions/61022992/manually-plotting-significance-relations-between-sub-groups-on-ggplot2-barplot

Back to top

Example of a graph generated with previous command

Back to top

Generating a bar graphs with significance shown and reordered facets - Example 4

# Plot Graph

# Generate graph

bar_object=ggplot(sumstatsobject.r,aes(x=predictor1,y=outcome))+scale_fill_manual(values=c("#CC0000", "#004C99"))

bar_object.Sub=bar_object+geom_bar(stat="identity",position=position_dodge(width=0.9), linewidth=3, aes(fill=predictor1)) + geom_errorbar(aes(ymin=Data-se, ymax=Data+se),position=position_dodge(width=0.9), width=0.2)+labs(title="Title Line1\nTitleLine2", x="x-label", y="y-label") + theme(axis.title.x=element_text(vjust=-0.4)) + theme(axis.title.y=element_text(vjust=0.3)) + theme(title=element_text(vjust=+1.1)) + theme(plot.title=element_text(hjust=0.5)) + theme(axis.text.x=element_blank(), axis.ticks.x=element_blank()) + guides(fill = guide_legend(title = "Legend Title")) + coord_cartesian(ylim=c(0,15)) + geom_signif(data=data.frame(ExecFn=c("factor 1", "factor 3", "factor 5", "factor 11",  "factor 12")), aes(y_position = c(13.5,12.5,12.5,14,13), xmin=c(1,1,1,1,1), xmax=c(2,2,2,2,2), annotations = c("**","**","**","#","*")), tip_length = 0, manual=T)

# NOTE that only the factors that show significance are listed and the information in the aes also only relates to these.


# Create grid with reorganized factors
bar_object.Sub+facet_grid(~factor(MainPredictor, levels=c("factor 1" ,"factor 2","factor 3", "factor 4", "factor 5", "factor 6", "factor 7", "factor 8", "factor 9", "factor 10",  "factor 11", "factor 12")), scales="fixed")  

# MainPredictor - replace this with the name of the column that represents the x-label predictor not sub-predictors (An example of the MainPredictor is the name of the predictor that us made up of the 12 factors shown in the graph; an example of a sub-predictor is predictor1 which in this case is Yes or No)

Back to top

Example of a graph generated with previous command

Back to top

Creating a line chart

  • variable y is being plotted across variable x for two groups
lp<- ggplot(SCl, aes(x=xvarname, y=yvarname, group=groupname, color=groupname)) + # x 
  stat_summary(fun=mean, geom="line") + # plots the line
  stat_summary(fun=mean, geom="point", size=3)+ # adds points to the plot
  stat_summary(fun.data=mean_se, geom="errorbar",position=position_dodge(width=0.0), width=0.2)+ #using standard error of the mean for the error bar.
     scale_color_manual(values=c('darkblue','red'))+
  labs(title="Title", x="x-axis label", y="y-axis label")+
  theme(axis.title.x=element_text(vjust=-0.4))+
  theme(axis.title.y=element_text(vjust=0.3))+
  theme(title=element_text(vjust=+1.1))+
  theme(plot.title=element_text(hjust=0.5))

lp # displays the lineplot

Back to top

Example of a graph generated with previous command

Back to top

Creating a simple boxplot

boxplot(Outcome~Predictor,data=objectL, main="Titleline1 \n Titleline 2",xlab="x-axis label", ylab="y-axis label")

or 

# boxplot with data jittered and showing & two different colors

ggplot(objectL, aes(x=Predictor, y=Outcome, fill=Predictor)) + stat_boxplot(geom='errorbar', position=position_dodge(width=0.5),width=0.2) + geom_boxplot(width=0.4) + geom_jitter(position=position_jitter(0.15)) + scale_fill_manual(values=c("#E0E0E0", "#606060"))

Back to top


Correlation

Correlation: a correlation tests whether there is a RELATIONSHIP between two observed variables within the same sample (e.g. asking if the number of cups of coffee consumed is related to amount of alertness in a sample)

Plotting a correlation with a regression line

ggscatter(object, x = "xvariable", y = "yvariable",add = "reg.line", conf.int = TRUE,cor.coef = TRUE, cor.method = "pearson",xlab = "xlabel (units)", ylab = "ylabel(units)", title="title for the graph")+theme(plot.title=element_text(hjust=0.5))
  • ggscatter: plots the correlation
  • object: name of the wide version of data file
  • xvariable: replace with name of column of variable to be plotted in the x-axis
  • yvariable: replace with name of column of variable to be plotted in the y-axis
  • reg.line: adds the regression line
  • conf.int = TRUE: plots the confidence interval (shaded area)
  • cor.coef = TRUE: shows the correlation coefficient
  • cor.method = “pearson”: calculates and shows the p value
  • xlab: replace words in quotes with appropriate wording for the x-label
  • ylab: replace words in quotes with appropriate wording for the y-label
  • title: replace words in quotes with a proper title
  • theme(plot.title=element_text(hjust=0.5)): centers the title

Back to top

Running a correlation

cor.test(object$predictor, object$outcome)
  • gives the correlation coefficient
  • gives the p-value (which tells you of the correlation is significant)
  • Important notes to remember from the results
    • df = degrees of freedom (n-2 (for correlation))
    • p value: p value of less than 0.05 is significant
    • the value under “cor” = correlation coefficient

Back to top

Example of a Correlation Output

    Pearson's product-moment correlation

data:  ingrd$Inc and ingrd$Grd
t = 2.968, df = 12, p-value = 0.01174 ## used for report
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.1833572 0.8780885
sample estimates:
      cor 
0.6506389 ## used for report

Back to top

Writing a report from a correlation

A Pearson’s correlation indicated that a student’s grade was significantly related to the family’s income, r(12)=-.65, p<.05.

Back to top

Creating a Correlation Matrix

corPlot(Object,cex = 0.8, stars=TRUE, upper=FALSE, main = "TitleLine1 \n TitleLine2",gr = colorRampPalette(c("#BB4444", "#EE9988", "#FFFFFF", "#77AADD", "#4477AA"))) 
  • Object contains only the variables to be included in the correlation matrix
  • Replace name of Object with the appropriate name
  • Replace Title information with the appropriate information

Example of a correlation plot/matrix generated with previous command

Back to top


Regression

Regression: a regression DESCRIBES the RELATIONSHIP between a predictor (an independent variable) and an outcome (dependent) between two observed variables within the same sample. Regressions can take various forms - linear (e.g. calibration of pH meter), sigmoidal (e.g. in enzyme kinetics/receptor binding). Distinction between correlation & regression: While a correlation can tell you that there is a relationship between [H3O+] in a solution and the pH value, regression provides you with the line that best predicts that relationship.

Types of Regression

  • Logistic regression is the appropriate to conduct when the dependent variable is dichotomous (binary)
  • Stepwise regression is appropriate when you have many variables and you’re interested in identifying a useful subset of the predictors.

Back to top

Simple Regression

ObjectReg=lm(outcome~predictor, data=object)
summary(ObjectReg)
  • Generates the regression equation and it is stored in “ObjectReg”
  • The lm regression command
    • uses WIDE form of the data
    • Designate one variable as outcome and the other as predictor - Do so in such a way that makes sense. e.g. It is more likely that family income would predict a student’s grade than a student’s grade predicting family income
  • The summary command: gives the output of the regression equation, from which you extract all of the statistical information that goes into a regression report.

Back to top

Example Regression Output

Back to top

Example of a Regression Equation derived from the above output

Back to top

Writing a report from a regression output

The results indicated that Fear of Sin significantly predicted Depression, Beta = 0.222, t(95)=2.336, p < 0.05. Fear of Sin also explained a significant proportion of variance in Depression, R2 = .044, F(1, 95) = 5.455, p < 0.05.

Back to top

Multiple Regression (backward elimination) (using AIC)

  • requires package: leaps
fitObject = lm(y~x1+x2+x3...,data=na.omit(objectL))
  • lm: is the command that runs a linear model analysis
  • na.omit will omit those subjects with missing data points & therefore cannot be included in analysis.
stepObject = stepAIC(fitObject, direction="backward", trace=FALSE) # direction can be "both" or "forward"
stepObject$anova # display results with AIC
summary(stepObject) # display results with t values

Multiple Regression (backward elimination) (using F-value)

fullObject = ols(y~x1+x2+x3..., data = objectL)
fastbw(fullObject, rule = "p", sls = 0.1)

Logistic Regression

mylogit = glm(Outcome ~ Predictor1 + Predictor2 + Predictor3..., data = ObjectL, family = "binomial")
summary(mylogit) # gives Beta and p values

exp(cbind(OR = coef(mylogit), confint(mylogit))) # gives Odds Ratio(OR)

Cronbach’s alpha

Method 1

# subjects with missing data should be removed before hand using ObjectNA=na.omit(object)

matobject=data.matrix(object)
reliability(cov(matobject))

Example output from Method 1

$alpha
     alpha 
 0.9403455 
 
 $st.alpha
 std.alpha 
 0.9417464 
 
 $rel.matrix
                 Alpha Std.Alpha r(item, total)
 D2Stop      0.9350532 0.9368506      0.7365112
 Cont2A      0.9335049 0.9354005      0.7837842
 AvsOthers   0.9354148 0.9366306      0.7304763
 DepSleep    0.9383129 0.9398743      0.6302478
 ThinkNOL    0.9353017 0.9364763      0.7397708
 LkFrwrd     0.9348757 0.9361827      0.7433303
 ThnkLessT   0.9424475 0.9436797      0.5152945
 TrdLessT    0.9362815 0.9380178      0.6999771
 RushWrk     0.9355317 0.9366959      0.7302428
 NegOblig    0.9352094 0.9363647      0.7406319
 AccDown     0.9334180 0.9354451      0.7853325
 Escape      0.9346224 0.9362977      0.7539146
 Frustration 0.9336228 0.9352146      0.7798224
 
 attr(,"class")
 [1] "reliability"

Method 2

alphaobject=alpha(object)

alphaobject # shows output

Example output from Method 2

Reliability analysis   
Call: alpha(x = CIUS)

  raw_alpha std.alpha G6(smc) average_r S/N    ase mean   sd median_r
      0.94      0.94    0.95      0.55  16 0.0051  1.8 0.97     0.55

 lower alpha upper     95% confidence boundaries
0.93 0.94 0.95 

 Reliability if an item is dropped:
            raw_alpha std.alpha G6(smc) average_r S/N alpha se  var.r med.r
D2Stop           0.94      0.94    0.95      0.55  15   0.0056 0.0139  0.56
Cont2A           0.93      0.94    0.95      0.55  14   0.0057 0.0137  0.55
AvsOthers        0.94      0.94    0.95      0.55  15   0.0055 0.0126  0.55
DepSleep         0.94      0.94    0.95      0.57  16   0.0053 0.0135  0.57
ThinkNOL         0.94      0.94    0.95      0.55  15   0.0055 0.0139  0.55
LkFrwrd          0.93      0.94    0.95      0.55  15   0.0056 0.0124  0.55
ThnkLessT        0.94      0.94    0.95      0.58  17   0.0049 0.0077  0.56
TrdLessT         0.94      0.94    0.95      0.56  15   0.0054 0.0136  0.56
RushWrk          0.94      0.94    0.95      0.55  15   0.0055 0.0125  0.55
NegOblig         0.94      0.94    0.95      0.55  15   0.0055 0.0129  0.55
AccDown          0.93      0.94    0.94      0.55  14   0.0057 0.0130  0.55
Escape           0.93      0.94    0.95      0.55  15   0.0056 0.0130  0.56
Frustration      0.93      0.94    0.95      0.55  14   0.0057 0.0132  0.55

 Item statistics 
              n raw.r std.r r.cor r.drop mean  sd
D2Stop      299  0.78  0.78  0.76   0.74 2.29 1.3
Cont2A      299  0.82  0.82  0.81   0.78 2.19 1.3
AvsOthers   299  0.77  0.78  0.77   0.73 1.20 1.2
DepSleep    299  0.69  0.69  0.65   0.63 1.25 1.3
ThinkNOL    299  0.78  0.79  0.77   0.74 1.64 1.1
LkFrwrd     299  0.79  0.80  0.78   0.74 1.43 1.2
ThnkLessT   299  0.59  0.58  0.53   0.52 2.84 1.4
TrdLessT    299  0.75  0.74  0.72   0.70 2.12 1.4
RushWrk     299  0.77  0.78  0.77   0.73 0.93 1.1
NegOblig    299  0.78  0.79  0.77   0.74 1.12 1.1
AccDown     299  0.83  0.82  0.82   0.79 2.27 1.4
Escape      299  0.80  0.79  0.79   0.75 2.06 1.5
Frustration 299  0.82  0.82  0.81   0.78 1.41 1.3

Non missing response frequency for each item
               0    1    2    3    4 miss
D2Stop      0.14 0.17 0.18 0.30 0.21    0
Cont2A      0.15 0.15 0.21 0.31 0.17    0
AvsOthers   0.37 0.24 0.23 0.13 0.03    0
DepSleep    0.40 0.19 0.22 0.14 0.05    0
ThinkNOL    0.16 0.30 0.31 0.17 0.05    0
LkFrwrd     0.30 0.25 0.22 0.18 0.05    0
ThnkLessT   0.12 0.07 0.10 0.26 0.45    0
TrdLessT    0.17 0.17 0.22 0.25 0.19    0
RushWrk     0.49 0.20 0.20 0.07 0.03    0
NegOblig    0.41 0.20 0.25 0.12 0.01    0
AccDown     0.17 0.11 0.19 0.31 0.21    0
Escape      0.24 0.13 0.15 0.28 0.19    0
Frustration 0.37 0.17 0.23 0.16 0.08    0

For additional information on interpretation please see: Reliability Analysis


Testing for Homogeneity of variance

leveneTest(objectL$outcome, objectL$predictor, center = median)

Back to top

Conducting a t-test in R

t-test: a t-test tests whether the MEANS/AVERAGES of TWO samples are significantly different from each other (e.g. asking if the average depression scores of males is different from that of females in a sample)

Method 1 - for LONG version of data

PLEASE NOTE this command is currently (10/28/24) not working properly and giving the following error “cannot use ‘paired’ in formula method”. This page will be updated once a solution is reported. The issue appears to be in the “paired=…” part of the command.

t.test(outcome~predictor, data=objectL, paired=FALSE/TRUE, var.equal=FALSE/TRUE)

OR

Method 2 - for WIDE version of data

t.test(object$var1,object$var2, paired=FALSE/TRUE, var.equal=FALSE/TRUE)
  • For METHOD 1:
    • t.test: command to calculate a t-test
    • outcome: name of the outcome column in the long version of the data file
    • predictor: name of the predictor column in the long version of the data file
    • ~ : predicted by
    • data=objectL: specifies the name of the long version of the data file. REPLACE ONLY the word objectL with the appropriate name of the long version of the file
  • For METHOD 2:
    • object, var1 and var2 - “object” needs to be replaced with appropriate name of object containing WIDE version of the data; “var1” and “var2” need to be replaced with the appropriate names of the columns in the WIDE version of the data.
  • For BOTH:
    • paired = TRUE: USED for REPEATED MEASURES, also known as paired
    • paired = FALSE: USED for INDEPENDENT MEASURES
    • var.equal = TRUE: used when the Levene Test p > 0.05 (remember p = the number under Pr(>F)); NOTE: Also use this in the case of a repeated measures t-test when a Levene test is not required.
    • var.equal = FALSE: used when the Levene Test p < 0.05
    • If var.equal = FALSE: the degrees of freedom (df) will NOT be a whole number (because the specific t-test used uses a more conservative way of calculating the df.)
    • reporting: t(df)=tvalue,pvalue, r^2=

Back to top


Calculating the effect size (r squared)

tvalue^2/(tvalue^2+df)

Back to top

Calculating the effect size (Cohen’s d) (alternative option)

# requires package effsize

cohen.d(data= object, group1, group2)

Back to top


t-test output in R

t = -2.2258, df = 16, p-value = 0.04075 
t = -2.5964, df = 8, p-value = 0.03179
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
  -18.252038  -1.081295
sample estimates:
  mean of the differences 
-9.666667 
mean in group Boys mean in group Girls 
25.66667            35.33333

Back to top

Example of a t-test output and an effect size (r2)calculation with the important data used in the report highlighted.

Back to top

Writing a report for a t-test

An independent measures t-test was utilized to assess whether there is a significant difference in vitamin D levels between boys and girls. The analysis indicated a significant difference in vitamin D levels, t(16)=-2.226, p<0.05; rsquared= , specifically girls (M= ; SEM= ) reported a significantly higher level of vitamin D then boys (M= ; SEM= ).

OR

An independent measures t-test was utilized to assess whether there is a significant difference in vitamin D levels between boys(M= ; SEM= ) and girls(M= ; SEM= ). The analysis indicated a significant difference in vitamin D levels, t(16)=-2.226, p<0.05; rsquared= ,specifically girls reported a significantly higher level of vitamin D then boys.

Back to top


Conducting a t-test in excel (method 1)


Conducting a t-test in excel (method 2)

Statistic Set 6 Set 7
Mean 11.75 11.5
Variance 46.91666667 33.66666667
Observations 4 4
Pearson Correlation 0.910007244
Hypothesized Mean Difference 0
df 3
t Stat 0.174077656
P(T<=t) one-tail 0.436444286
t Critical one-tail 2.353363435
P(T<=t) two-tail 0.872888572
t Critical two-tail 3.182446305

One-way ANOVA - INDEPENDENT MEASURES

ANOVA: an ANOVA tests whether the MEANS/AVERAGES of MORE THAN TWO samples are significantly different from each other (e.g. asking if there is a significant difference in the average depression scores between Freshman, Sophomores, Juniors and Seniors). ANOVAs can be complicated given that comparisons can be carried out within groups and between groups.

aovobject=aov(outcome~predictors, data=objectL)
summary(aovobject)

Back to top

Example of an Independent Measures Output

            Df Sum Sq Mean Sq F value   Pr(>F)    
variable     2  28.22  14.111   11.91 0.000256 ***  
Residuals   24  28.44   1.185                     
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Back to top


Post-hoc analysis using Tukey Test

phobject=glht(aovobject, linfct=mcp(predictor="Tukey"))
summary(phobject)

Back to top

games_howell_test(data = objectL, formula = outcome ~ grouping_variable)

Back to top

Example of a post-hoc output for Independent Measures ANOVA

     Simultaneous Tests for General Linear Hypotheses

Multiple Comparisons of Means: Tukey Contrasts


Fit: aov(formula = value ~ variable, data = migraineL)

Linear Hypotheses:
                   Estimate Std. Error t value Pr(>|t|)    
DrugB - DrugA == 0   2.1111     0.5132   4.114 0.001063 ** 
DrugC - DrugA == 0   2.2222     0.5132   4.330 0.000605 ***
DrugC - DrugB == 0   0.1111     0.5132   0.217 0.974514    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Adjusted p values reported -- single-step method)

Back to top

Equation for Eta Squared (effect size) for Independent Measures ANOVA (information taken from the first step of the ANOVA).

Back to top

Writing a report from an Independent Measures ANOVA output

Pain levels were reported on a scale of 1 to 10 (1=low, 10=high) following the administration of various pain relievers by individuals who suffer from migraines. A One Way ANOVA indicated a significant difference (F(2, 24) = 11.91; p<0.001; n2= 0.498) in reported pain levels across the drugs administered. Post-hoc analysis using Tukey test indicated that individuals reported significantly lower pain levels following the administration of Drug A (M=3.67, SEM=0.29) relative to both Drug B (M=5.78; SEM=0.49; p<0.01) and Drug C (M=5.89; SEM=0.26; p<0.001). There was no significant difference between the reported pain levels following the administration of Drugs B and C (p>0.05).

Back to top


One-Way REPEATED MEASURES (RM) ANOVA

aovobject=aov(outcome~predictor+Error(participant/predictor), data=objectL)
summary(aovobject)

Back to top

Example of a Repeated Measures Output

An Example of a Repeated measures ANOVA output with important numbers used in report highlighted

Back to top


Post-hoc Analysis using Pairwise comparisons

with(objectL, pairwise.t.test(outcome,predictor, p.adjust.method="bonferroni", paired=T))

Back to top

Example of a post-hoc output for Repeated Measures ANOVA

An Example of a post-hoc analysis output for a Repeated measures ANOVA with important numbers used in report highlighted. These values are the p-values comparing the variable in the vertical column on the left to the horizontal row above the numbers.

Back to top

Equation for Eta Squared (effect size) for Repeated Measures ANOVA. The information is found under the section Error: Within in the first step of the ANOVA

Back to top

Writing a report from a Repeated Measures ANOVA output

Repeated measures ANOVA revealed a significant overall effect across time (F(2,24) = 20.65, p< 0.001; n2=0.63), following the administration of drug. Post-hoc Bonferroni analysis indicated that anxiety scores were significantly lower than pre-drug (M=8.10, SEM=0.23) administration on both week 1 (M=6.80, SEM = 0.25; p<0.05) and week 2 (M=3.00, SEM = 0.26; p<0.001). Moreover the anxiety scores on week 2 were also significantly lower than those of week 1 (p<0.001).

Back to top


How to use R for multiple-select questions

  • Replace OBJECT with name of the object containing the imported data (long version)
  • First column needs to be the subject number
  • [This should be done in Excel] All values in the rows need to be replaced with a 1 (implies there is a value or count) or 0 (blank)
  • Replace “variable1” with the name of the column heading of that specific variable (e.g. if counting number of people in household and/or Intramural sports, variable1 etc would be replaced with the names of these columns)
  • This code is modified from https://stackoverflow.com/questions/11622660/how-to-use-r-for-multiple-select-questions
set.seed(1)
dat <- data.frame(OBJECT,"variable1","variable2", "variable3", "..n")
                  
multi.freq.table = function(OBJECT, sep="", dropzero=FALSE, clean=TRUE) {
  # Takes boolean multiple-response data and tabulates it according to the possible combinations of each variable.
  
  counts = data.frame(table(OBJECT))
  N = ncol(counts)
  counts$Combn = apply(counts[-N] == 1, 1, 
                       function(x) paste(names(counts[-N])[x],
                                         collapse=sep))
  if (isTRUE(dropzero)) {
    counts = counts[counts$Freq != 0, ]
  } else if (!isTRUE(dropzero)) {
    counts = counts
  }
  if (isTRUE(clean)) {
    counts = data.frame(Combn = counts$Combn, Freq = counts$Freq)
  } 
  counts
}

multi.freq.table(dat[-1], sep="-")

How to count number of occurences in a column

  • the counts are sorted largest to smallest
count(OBJECT, vars = Columnname, sort=TRUE)

or

  • the output is sorted from lowest to highest
sort(table(OBJECT$Columnname))

Plotting map of US with data

METHOD 1: Plot map of US reflecting various levels - Only plots lower 48

Load packages

library(ggplot2);library(maps);library(dplyr) 

Load the state map information

MainStates <- map_data("state")

Copy data

  • StateNumber: object containing data for number of students from every state
  • this object contains the data with the names of the states and numbers of students from every state
  • states that have no representation i.e. 0 students should be blank
  • data needs to have a column named “region” (lower case)
StateNumber = read.table(file="clipboard", sep="\t", header=TRUE, as.is=TRUE)

Plots all states with ggplot2, using black borders and light blue fill

ggplot()+geom_polygon(data=MainStates,aes(x=long, y=lat, group=group),color="black", fill="lightblue")

Merges Mainstates Map with StateNumber

MergedStates <- inner_join(MainStates, StateNumber, by = "region")
p <- ggplot()
p <- p + geom_polygon( data=MergedStates, aes(x=long, y=lat, group=group, fill = n), color="white", size = 0.2) 
p

Changing color scheme

p <- p + scale_fill_continuous(name="Number of Students",low = "lightblue", high = "darkblue",limits = c(0,140),breaks=c(20,40,60,80,100,120,140), na.value = "grey50")+labs(title="title of graph")
p

Back to top

METHOD 2: Plot map of US reflecting various levels - Plots all 50 states

Load packages

library(usmap) 

Copy data

  • StateNumber: object containing data for number of students from every state
  • this object contains the data with the names of the states and numbers of students from every state
  • states that have no representation i.e. 0 students should be blank
  • data needs to have a column named “region”
  • The excel data must contain the following columns. The data below can be copied and then pasted into excel using the “paste special/text only” option.
  • The fips column stands for the Federal Information Processing Standard (FIPS) which provides a set of standard numeric codes for referring to U.S. states.
  • Please note that, in the FIPS column, the 0 in front of the single digit numbers is not a requirement - this was just how I set it up.
fips region n
01 Alabama
02 Alaska
04 Arizona
05 Arkansas
06 California
08 Colorado
09 Connecticut
10 Delaware
12 Florida
13 Georgia
15 Hawaii
16 Idaho
17 Illinois
18 Indiana
19 Iowa
20 Kansas
21 Kentucky
22 Louisiana
23 Maine
24 Maryland
25 Massachusetts
26 Michigan
27 Minnesota
28 Mississippi
29 Missouri
30 Montana
31 Nebraska
32 Nevada
33 New Hampshire
34 New Jersey
35 New Mexico
36 New York
37 North Carolina
38 North Dakota
39 Ohio
40 Oklahoma
41 Oregon
42 Pennsylvania
44 Rhode Island
45 South Carolina
46 South Dakota
47 Tennessee
48 Texas
49 Utah
50 Vermont
51 Virginia
53 Washington
54 West Virginia
55 Wisconsin
56 Wyoming
StateNumber = read.table(file="clipboard", sep="\t", header=TRUE, as.is=TRUE)

Plots blank map of all states, using black borders and grey fill

plot_usmap(regions = "states") + labs(title = "US States",subtitle = "This is a blank map of the state of the United States.") + theme(panel.background = element_rect(color = "black", fill = "grey"))

Plots all states with a color scale reflecting numbers of students from the specific states and places image in a black rectangle with a light color grey fill

plot_usmap(data = StateNumber, values = "n", color = "white") + scale_fill_continuous(name="Number of Students",low = "lightblue", high = "darkblue",limits = c(0,140),breaks=c(20,40,60,80,100,120,140), na.value = "grey50")+labs(title="title of graph", subtitle = "Subtitle")+ theme(plot.title=element_text(hjust=0.5)) + theme(plot.subtitle=element_text(hjust=0.5))+ theme(legend.position = "right") + theme(panel.background = element_rect(color = "black", fill ="grey85"))

Back to top

Plotting a 3D Scatter Plot

  • you may need to go into Tools/Global Options/General/Advanced and under OS Integration (quit required) change the Rendering Engine option to Desktop OpenGL

Back to top

Additional Resources

R Markdown

# In R Markdown Document:
<div class="alert alert-x"> 
  # Replace "x" with "success" for green; "info" for blue; "warning" for tan; "danger" for red.
</div>
# Creating a pdf of the rpubs website https://rpubs.com/ssammut/ResearchStats
<div class="alert alert-x"> 
  # requires webshot package
  library(webshot)
webshot(
    url     = "https://rpubs.com/ssammut/ResearchStats"
  , file    =  "Output1.pdf" # replace "Output1.pdf" with appropriate name
  , vwidth  = 992
  , vheight = 744
  )

getwd () # will show in which folder the file is stored.
</div>

Back to top

Jamovi - an alternate statistical program

  • This is not the program used in the class but you can feel free to experiment with it.
  • Jamovi is a program based on R that has a more user-friendly interface
  • Download Jamovi
  • Advantage - can be used on all platforms Macs, PCs and Chromebooks
  • A user guide can be found here
  • A tutorial is also available here

Back to top

Creating your own package

Resources:

Requires:

Creating the package

  1. Create folder Name_R_package
  2. In the folder create a folder called R
  3. In the folder called R, create a text file called Name_function.txt that contains a copy the function code
  4. Change the extension of the text file to .R (confirm change)
  5. In the Name_R_package folder create a text file called DESCRIPTION
  6. In the DESCRIPTION file add the following information:
  Package: Name of package
  Type: Package
  Title: Title describing package
  Version: 0.0.1.0
  1. Remove the .txt extension from the DESCRIPTION file (confirm change)
  2. Open a new window in R so that you do not lose previous work
  3. Go to File/New Project.
    1. Click on New Directory
    2. Click on R Package
    3. Name the package
    4. Click “Add”
    5. Select the R file created above
    6. Click “Browse” and change the folder to select the Name_R_package you created earlier.
    7. Click on “Create Project”
  4. Click on “Build” in the menu and select “Build Source Package”

Using the package

  1. Go to Tools/Install Packages
  2. Change “Install From” to “Package Archive File (.zip; .tar.gz)”. This will open the window for you to select the file.
  3. Go to the folder Name_R_package and select the .tar.gz file listed
  4. Click “Install”
  5. Run library using the name you gave the package in 9c.: library(packagename)
  6. NOTE: Whenever you install a new version of R you will need to re-run the installation of the package just described.

Back to top



---
title: "Basic Commands to Get Started with R"
output:
  html_notebook: default
  pdf_document: default
  html_document:
    df_print: paged
  word_document: default
  toc: yes
---
<div class="alert alert-danger">
# Table of Contents:
1. [Installing R & R Studio]
2. [Installation of Packages]
3. [Loading Packages]
4. [Copying Data from Excel into R]
5. [Exporting from R]
    + [Exporting output to a text file]
    + [Exporting dataframe to a csv file]
6. [Viewing imported data]
7. [Melting your data]
    + [Performing a simple melting of data]
    + [Melting data with multiple id variables and measure variables]
8. [Descriptive statistics for **Long** version of data]
9. [**Proportions test**]
    + [Comparing two proportions - Method 1]
    + [Comparing two proportions - Method 2]
    + [Proportions Test Output in R]
    + [Writing a report from a proportions test output]
    + [Comparing more than two proportions - Method 1]
    + [Comparing more than two proportions - Method 2]
    + [Proportions Test Output in R (more than two proportions)]
    + [Writing a report from a proportions test output (more than two proportions)]
10. [**Odds Ratio**]
    + [Create Matrix]
    + [Calculate Odds Ratio]
    + [Writing a report from an odds ratio test output]
11. [Some things you ***may*** have to do **prior to generating a graph**]
    + [Function and command for generating Standard Error of the Mean (SEM)]
    + [Ordering levels in a specific variable]
    + [Changing variable name in a specific column]
    + [Changing the column name/header in the object/dataframe containing the data]
12. [**Creating a pie chart, bar graph, line graph or box plot**]
    + [Generating a pie chart]
    + [Generating a bar graph]
    + [Generating a bar graph with **different colored bars**]
    + [Creating a bar graph with **one independent variable** and using the standard error of the mean for error bars]
    + [Creating a bar graph with **multiple independent variables** and using the standard error of the mean for error bars]
    + [Creating bar graphs with **multiple independent variables**, plotted as three separate graphs (one per variable) in one grid]
    + [Creating a line chart]
    + [Creating a simple boxplot]
13. [**Correlation**]
    + [Plotting a correlation with a regression line]
    + [Running a correlation]
    + [Example of a Correlation Output]
    + [Writing a report from a correlation]
    + [Creating a Correlation Matrix]
14. [**Regression**]
    + [Types of Regression]
    + [Simple Regression]
    + [Example Regression Output]
    + [Example of a Regression Equation derived from the above output]
    + [Writing a report from a regression output]
15. [Testing for Homogeneity of variance]
16. [**Conducting a t-test in R**]
    + [Method 1 - for **LONG** version of data]
    + [Method 2 - for **WIDE** version of data]
    + [Calculating the effect size (r squared)]
    + [t-test output in R]
    + [Example of a t-test output and an effect size (r^2^)calculation with the important data used in the report highlighted.]
    + [Writing a report for a t-test]
17. [**One-way ANOVA - INDEPENDENT MEASURES**]
    + [Example of an Independent Measures Output]
    + [Post-hoc analysis using Tukey Test]
    + [Example of a post-hoc output for Independent Measures ANOVA]
    + [Equation for Eta Squared (effect size) for Independent Measures ANOVA (information taken from the first step of the ANOVA).]
    + [Writing a report from an Independent Measures ANOVA output]
18. [**One-Way REPEATED MEASURES (RM) ANOVA**]
    + [Example of a Repeated Measures Output]
    + [Post-hoc Analysis using Pairwise comparisons]
    + [Example of a post-hoc output for Repeated Measures ANOVA]
    + [Equation for Eta Squared (effect size) for Repeated Measures ANOVA. The information is found under the section **Error: Within** in the first step of the ANOVA]
    + [Writing a report from a Repeated Measures ANOVA output]
19. [How to use R for multiple-select questions]
20. [How to count number of occurences in a column]
21. [Plotting map of US with data]
    + [METHOD 1: Plot map of US reflecting various levels - Only plots **lower 48**]
    + [METHOD 2: Plot map of US reflecting various levels - Plots **all 50 states**]
22. [Additional Resources]
    + [R Markdown]
    + [Jamovi - an alternate statistical program]
    + [Creating your own package]
</div>

# Resources

  * [R Commands Document](http://rpubs.com/ssammut/ResearchStats){target="_blank"}
  * There is a significant amount of information pertaining to the use of R on the internet. Listed below are only some examples of the resources. If you are having trouble installing R and R Studio, even after reading and following the instructions I include below, you may want to go to some of these websites. 

<div class="alert alert-warning">  
# Installing R & R Studio

<div class="alert alert-danger">
**PLEASE NOTE**: There are times when you may have to re-download R (not necessarily R Studio). You will realize this is needed if you download a package and when you run the library you get a message that seems to indicate that the package is not for that version. All you have to do in this case is 1. close out of R Studio; 2. download the newer version of R and 3. install it.
</div>

For links to download R and R studio please go to the following:

  1. Step 1 - R MUST be installed first - [Download **R** for Windows or Mac ](https://cran.r-project.org/)

  2. Step 2 - R Studio is installed after R has finished installing - [Download **R Studio** for Windows or Mac ](https://posit.co/downloads/). **Please note:** If clicking on the link does not work, copy and paste the following link into your browser: https://posit.co/downloads/

</div>

<div class="alert alert-warning">  
## For **Windows**:

1. Double click **R-x.x.x-win.exe**

    a. Click “OK” (for language)
    b. Click “Next” (for Information)
    c. Click “Next” (for Select Destination Location)
    d. Click “Next” (for Select Components)
    e. Click “Next” (for Startup options)
    f. Click “Next” (for Select Start Menu Folder)
    g. On the next screen:
        i.	Make sure “Create a desktop shortcut” is unchecked
        ii.	Make sure “Create a Quick Launch shortcut” is unchecked
        iii.	Click “Next”
    h. Click “Finish”

2. Double click **RStudio-x.x.xxxx.exe**

    a. Click “Next”
    b. Click “Next” 	(for Choose Install Location)
    c. Click “Install” 	(for Choose Start Menu Folder)
    d. Click “Finish”

[***Step 3 is not necessary if you are installing the packages by running the commands below under "Installation of Packages"***]

3. Packages go into: 
    a. C:\\Users\\YourName\\Documents\\R win-library\\x.x
    b. Accept to overwrite current packages 

## For **Mac**:

1. Double click **R-x.x.x.pkg**
    a.	click "Continue" (for Introduction)
    b.	click "Continue" (for Read Me)
    c.	click "Continue"	(for License)
    d.	click "Agree"
    e.	click "Install"	(for Installation Type)
    f.	if asked, provide password to allow installation of software
    g.	click "Install Software"
    h.	When installation is complete click "Close"


2. Double click **RStudio-x.x.xxxx.dmg**
    a. in window that opens up, drag RStudio icon into Applications folder
    b. When installation is done, you can click on the RStudio icon in the Launchpad
    c. If asked "”RStudio” is an application downloaded from the Internet. Are you sure you want to open it?"
    d. If asked to install a "git" command - click "install" and then click on "agree" 

[***Step 3 is not necessary if you are installing the packages by running the commands below under "Installation of Packages"***]

3. Packages go into:
    a.	Macintosh HD/Library/Frameworks/R.framework/Versions/*version#*/Resources/library 
    b.	Accept to overwrite current packages
</div>
 
<a href="#top">Back to top</a>
 
## Resources for Installing the Analysis Toolpak in Excel:

If conducting more detailed analysis in ***Excel***, it is necessary to install and load the Analysis Toolpak. For instructions on loading the Analysis ToolPak in Excel see:

  * For [**Windows**](https://support.office.com/en-ie/article/load-the-analysis-toolpak-in-excel-6a63e598-cd6d-42e3-9317-6b40ba1a66b4#OfficeVersion=Windows){target="_blank"}
  * For [**Macs**](https://support.office.com/en-ie/article/load-the-analysis-toolpak-in-excel-6a63e598-cd6d-42e3-9317-6b40ba1a66b4#OfficeVersion=MacOS){target="_blank"} 
  * Use the [Analysis ToolPak](https://support.office.com/en-us/article/use-the-analysis-toolpak-to-perform-complex-data-analysis-6c67ccf0-f4a9-487c-8dec-bdb5a2cefab6){target="_blank"} to perform complex data analysis

***
# General Notes for R
  * pound (#) sign indicates a comment - not read by the program
  * console is the window in which you will run the commands
  * If you click in the console, and Ctrl L - clears the console
  * If ever you're stuck with a plus (+)sign and cannot get the cursor (>) back, press **Ctrl+C** (for both Macs and PCs)
  * object: an object contains a graph, the data etc. - whatever you put it in it.
  * <- is equivalent to = 
  * object$columnname: indicates the name of the column within the object
  * install packages: run **ONCE**; **NEED** to be connected to the internet
  * Library command: run **EVERYTIME** you open R-Studio
  * R **IS CASE SENSITIVE**
  * **names**(object) : gives the names of all the variables in a specific object ["names" is bolded b/c it is a command and does not change when used; only the word "object" is replaced by the appropriate name of the object containing the data]
  * **levels**(object$nameofcolumn) - gives the levels within the column of interest; ["levels" is bolded b/c it is a command and does not change when used; only the words "object" and "nameofcolumn" are replaced by the appropriate name of the object containing the data and the name of the column of interest respectively]
  * If ever you need to change a column into a factor because R is not seeing the column as a factor use this command: ***objectL\$nameofcolumn = as.factor(objectL\$nameofcolumn)***

***
<div class="alert alert-danger">
# Installation of Packages
  * Installation of packages is done **ONLY ONCE**; 
  * For the installation you MUST be ***connected to the internet***
```{r}
install.packages("ggplot2", dependencies = T) # used for generating graphs
install.packages("multcomp", dependencies = T) # used for post-hoc analysis
install.packages("pastecs", dependencies = T) # used for generating descriptive statistics
install.packages("reshape", dependencies = T) # used for generating a long version of the data from a wide version of the data
install.packages("reshape2", dependencies = T) # used to generate long form of the file; required in some cases
install.packages("nlme", dependencies = T) # used for generating ANOVAs
install.packages("car", dependencies = T)
install.packages("pwr", dependencies = T)
install.packages("dplyr", dependencies = T)
install.packages("devtools", dependencies = T)
install.packages("rms", dependencies=T) # for regression
install.packages("psych", dependencies=T)
install.packages("ggpubr", dependencies=T) # required for correlation plots
```
</div>
<a href="#top">Back to top</a>

## Additional packages (not required for class)
```{r}
install.packages("rmarkdown", dependencies=T) # for RMD files that allow embedding of graphs etc; not required for class
install.packages("jmv") # not required for class
install.packages("umx", dependencies=T) # For structural equation modelling
install.packages("semPlot", dependencies=T) # required for Structural Equation Modeling but not required for class
install.packages("lavaan", dependencies=T) # required for Structural Equation Modeling but not required for class; see https://lavaan.ugent.be/tutorial/index.html
install.packages("sem", dependencies=T) # required for Structural Equation Modeling but not required for class
install.packages("maps", dependencies=T) # required if making map of US - only lower 48 states
install.packages("usmap", dependencies=T) # required if making map of US - all 50 states
# install.packages("RVAideMemoire", dependencies = T) # required for the Fisher Test
install.packages("plotly", dependencies=T) # required for 3D plots
install.packages("compareGroups") # required for generation of table comparing groups
install.packages("lsmeans", dependencies=T) # for comparison of slopes
install.packages("effectsize", dependencies = T)
```

***
<div class="alert alert-danger">
# Loading packages
  * Packages need to be loaded **EVERY TIME** R studio is started.
  * Copy this command and paste into the Console and then press "Enter". 
```{r}
library(ggplot2);library(multcomp);library(pastecs);library(reshape); library(reshape2); library(nlme); library(car); library(pwr); library(dplyr); library(devtools); library(rms);library(psych); library(ggpubr)
```
</div>
<a href="#top">Back to top</a>

## Additional library commands (not required for class)
```{r}
library(umx); library(jmv); library(sem); library(lavaan); library(semPlot); library(qgraph); library(foreign); library(usmap); library(maps);  library(plotly); library(SEMfn); library(lsmeans); library(effectsize); library(roxygen2)

library(RVAideMemoire)
```

***
<div class="alert alert-success">
# Copying Data from Excel into R

## For PC
  1. Copy the command below (Ctrl+C)
  2. Paste command into R Studio Console (Ctrl+V)
  3. Change the name "object" to something that makes sense to you and is related to the data.
  4. In **Excel**, select the data you want to analyze; Include the header; 
  5. Copy the data (Ctrl+C)
  6. Go back to R Studio and run (**press "Enter"**) the command you pasted earlier (in step 2) into the console 
```{r}
object = read.table(file="clipboard", sep="\t", header=TRUE)
```

  * If you have missing data points in your file that you still want imported into the dataframe, use the following command:
```{r}
object = read.csv(file="clipboard", sep="\t", header=TRUE)
```
</div>
<a href="#top">Back to top</a>

<div class="alert alert-success">
## For Mac
  1. Copy the command below (Cmd+C)
  2. Paste command into R Studio Console (Cmd+V)
  3. Change the name "object" to something that makes sense to you and is related to the data.
  4. In **Excel**, select the data you want to analyze; Include the header; 
  5. Copy the data (Cmd+C)
  6. Go back to R Studio and run (**press "Enter"**) the command you pasted earlier (in step 2) into the console
```{r}
object = read.table(pipe("pbpaste"), sep="\t", header = TRUE)
```

  * If you have missing data points in your file that you still want imported into the dataframe, use the following command:
  
```{r}
object = read.csv(pipe("pbpaste"), sep="\t", header = TRUE)
``` 
</div>
<a href="#top">Back to top</a>

***
# Importing Data into R from Excel (xlsx) files
 
## Important Notes: 
  * **If you are going to use this method, you need to make sure that your excel file has a short name with no spaces**
  * If you wish to **name the object** containing the data something different from the original Excel name, type in the new name in the **"Name:"** box under **"Import Options"****
  * If you have **more than one sheet of data** in the excel file, you will need to specify which sheet the data is coming from in the **"Sheet:"** box under **"Import Options"** section in the **"Import Excel Data window"**

## Procedure: 
  * Go to **"File>Import Data Set>From Excel..."**
  * R opens window called **"Import Excel Data"**
  * Click on **"Browse..."** 
  * Select the Excel file you want to import
  * R will display the data in the **"Data Preview"** window; [see note above if your excel file contains more than one data sheet or you wish to name your object something different from the excel file name]
  * Click **"Import"** and the data is now imported and will be displayed

  
# Importing Data into R from CSV files
```{r}
object = read.csv(file.choose(), header=T)
```
  * object: name of the object - call it as you wish; should be related to the data you are importing; should simple; NO spaces; should short.
  * read.csv: tells the program you are going to import a CSV file
  * file.choose (): tells the program to open a window for you to select the file of interest
  * header = T: this indicates that you CSV file has a header
  * the ONLY word you need to change in this command is the name of the object
  
# Importing Data into R from SPSS files
  * to import *.sav files from SPSS you first need to run the following library command

```{r}
library(foreign)
```

```{r}
object = read.spss(file.choose(), to.data.frame=TRUE)
```
  * read.spss: tells the program you are going to import an SPSS file
  * file.choose (): tells the program to open a window for you to select the file of interest
  * the ONLY word you need to change in this command is the name of the object

***
<div class="alert alert-success">

# Exporting from R
## Exporting output to a text file
```{r}
sink("outputname.txt")

# RUN THE COMMANDS YOU WISH YO BE EXPORTED TO THE TEXT FILE

sink()
getwd() # gives working directory

# to import into excel, select the file and use the "space delimited" option, and then check the space checkbox.
```

## Exporting dataframe to a csv file
```{r}
write.csv(dataframe,"name.csv")
getwd() # gives working directory
```
</div>
<a href="#top">Back to top</a>

***
<div class="alert alert-info">
# Viewing imported data
```{r}
View(object) # allows you to see the data in another tab
```
</div>
<a href="#top">Back to top</a>

***
<div class="alert alert-warning">
# Melting your data
## Performing a simple melting of data
```{r}
objectL = melt(object, id="idname")
```
  * melt converts file from wide to long version
  * objectL: NEW object that will contain the long version of the file
  * object: name of the of the object containing the wide version of the data
  * id="idname": this is only used SOMETIMES; only when you have an ID colum (a column that identifies the subject in some way and does not change/is constant)
  * In case of a file with an ID column, replace the idname with the appropriate name of the column
  * If there is NO ID COLUMN; you can simply **delete** the part of the command from the comma onwards i.e., *, id="idname"'* but keep the close parenthesis.
</div>
<a href="#top">Back to top</a>

<div class="alert alert-warning">
# Melting data with multiple id variables and measure variables
```{r}
object_L=melt(object, id.vars = c("idvar_1", "idvar_2", "idvar_n"), measure.vars = c("mvar_1", "mvar_2", "mvar_n"))
```
  * idvar_1 etc. = replace these with the appropriate names of the columns containing the id variables
  * mvar_1 etc. = replace these with the measure variable column names
</div>
<a href="#top">Back to top</a>

***
# Selecting specific subset to analyze

```{r}
var_a = object$`outcome`[object$predictor1colname=="predictor1" & object$predictor2colname=="predictor2"]
```

  * var_a: name of variable that will contain subset
  * object$predictor1colname: name of column containing "predictor1" in object
  * ==: "truly equal to"
  
***
# Generating descriptive statistics in R
## Weighted mean
```{r}
wt <- c(w1,  w2,  w3,  w4, etc)/(wT) # the weight for the first value = x multiplied w1/wT (e.g. if weight for w1 is 25% w1 would be 25, wT would be 100)
x <- c(x1, x2, x3, x4, etc) # values to be weighted
xm <- weighted.mean(x, wt)
```

## Mean
```{r}
mean(c(x1, x2, x3, x4...))
```

## Descriptive statistics for **WIDE** version of data to 2 decimal places
```{r}
round(stat.desc(object, norm=T), digits=2)
```
  * gives the descriptive statistics for all variables to TWO decimal place (you can adjust the number of decimal places by simply changing the number after digits=)
  * object: replace this with name of object containing the WIDE version of the data
  * ONLY used with the WIDE version of the data

## Descriptive statistics for individual columns of data to 2 decimal places in **WIDE** version of data
```{r}
round(stat.desc(object$columnname, norm=T), digits =2)
```
  * gives the descriptive statistics for the WIDE version of the file but ONLY for a SPECIFIC COLUMN
  * object: needs to be replaced with the name of the wide version of the data
  * columnname: needs to be replaced with the name of the column for which you wish to generate the statistics.

<div class="alert alert-danger">
## Descriptive statistics for **LONG** version of data
```{r}
by(objectL$outcome, objectL$predictor, stat.desc, norm=T)
```
  * gives the descriptive statistics for the LONG version of the data
  * objectL: replace with name of the object containing the LONG version of data
  * outcome: replace with name of outcome column in objectL
  * predictor: replace with name of predictor column in objectL

  * nbr.val: number of values/subjects in a particular group
  * nbr.null: number of ZERO values
  * nbr.na: number of MISSING values
  * min/max: minimum and maximum valuers in a group respectively
  * range: gives you the range
  * sum: gives the sum of all values in a group

### Measures of centrality:
  + median: gives the median of values in a group
  + mean: gives the average of values in a group

### Measures of variability:
  + SE.mean: standard error of the mean
  + CI.mean.95: 95% confidence interval
  + var: variance
  + std.dev: standard deviation
  + coef.var: coefficient of variation

### Measures of normality:
  + skewness: measures of level of skewness of your data, a negative sign implies a negative skew;
  + skew.2SE: indicates whether the skewness is SIGNIFICANT. Skewness is significant if skew.2SE > +1 or <-1
  + kurtosis: measure of level of kurtosis of the data. A negative sign implies a negative kurtosis.
  + kurt.2SE: indicates whether the kurtosis is SIGNIFICANT. kurtosis is significant if kurt.2SE > +1 or <-1
  + normtest.w: test for normality
  + normtest.p: if normtest.p<0.05, the the data is NOT normally distributed.
  + When checking for normality - **FIRST** look at the normtest.p. If this value is greater than 0.05, then you have nothing to worry about. If this value is less than 0.05 then you need to look at the skew.2SE and kurt.2SE in order to determine which aspect is causing your data to not be normal. If your data is NOT normally distributed, other statistical tests apply.

### Example of output generated with previous command
<center>
![](C:/Users\ssamm\Desktop\Sammut\Research\R Markdown/ByOutput.png){width=65%}
</center>
</div>
<a href="#top">Back to top</a>

### Alternative method

```{r}
describeBy(outcome~predictor, data = objectL, digits=2, mat=TRUE)
```
### Example of output generated with previous command
<center>
![](C:/Users\ssamm\Desktop\Sammut\Research\R Markdown/describeBy.png){width=70%}
</center>

  + this output is nice and concise but does not contain everything included in the previous method (e.g. no indication of significance in relation to the deviations from normality (skewness/kurtosis))

<a href="#top">Back to top</a>

<center>

### Another Method of presenting Statistics

<div class="alert alert-success">
  * this method generates presentable tables as an output
  * requires "vtable" package
```{r}
st(ObjectL, vars = c("column1"), add.median=TRUE, numformat=NA) 
```
  * gives stats of one column with 25th, 50th and 75th percentiles
  * column1 -> replace with the name of the column for which you wish to calculate the descriptive statistics. Output will have N, Mean and Std.Dev., Min, and Max also.
<center>
![](C:/Users\ssamm\Desktop\Sammut\Research\R Markdown/Rplot-one column.png){width=70%}
</center>

```{r}
st(ObjectL, vars = c("column1"), group = 'Group', add.median = TRUE, numformat=NA)
```
  * gives the statistics of column1 across the group "Group";
  * replace "column1" with the name of the column for which you wish to generate descriptive statistics; 
  * replace "Group" with the name of the column that has the grouping variable e.g. "Sex".

<center>
![](C:/Users\ssamm\Desktop\Sammut\Research\R Markdown/Rplot-by group.png){width=70%}
</center>
</div>
***
# Descriptive statistics in cases of multiple levels 
```{r}
describeBy(outcome~ var1 + var2, data = object,digits=2, mat=TRUE)

# e.g. if var1 = sex (Male or Female) and var2 = class (Freshman, Sophomore, Junior, Senior)
# digits: describes number of decimal places
# mat: provide a matrix output rather than a list 
```

***
# Excel Commands for some descriptive Statistics - method 1
  * all commands start with an equal sign
  * once you select the appropriate command (after the equal sign), in parenthesis you will need to inbput the cells on which you want to execute that command e.g. 
    + =min(B2:B10) - this gives the minimum for numbers in cells B2 to B10
    + =max(B2:B10) - gives the max for numbers in cells B2 to B10
    + =Average(B2:B10) - gives the average
    + =Median(B2:B10) - gives the median
    + =var(B2:B10) - gives the variance
    + =stdev(B2:B10) - gives the standard deviation
    + =B16/SQRT(B17) - gives the standard error of the mean WHEN B16 contains the standard deviation and B17 contains the N (count)
    + $ sign in front of the letter (column) or number (row) of the formula fixes the row or column 

***
# Excel Commands for some descriptive Statistics - method 2
  * To give a more detailed output, click Data Analysis in the Analysis group on the Data tab. 
  * If the Data Analysis command is not available, you need to load the Analysis ToolPak add-in program.(see:[Office Support](https://support.office.com/en-ie/article/load-the-analysis-toolpak-in-excel-6a63e598-cd6d-42e3-9317-6b40ba1a66b4){target="_blank"})
  * Select "Descriptive Statistics"
  * Input the appropriate variable ranges under "Input"
  * If your variables have labels (which they should) check "Labels in First Row"
  * Click on Output Range and in the associated box indicate the cell you want the output to be generated in (make sure this cell is away from other important data) OR simply select "New Worksheet Ply"
  * Sample output is shown below:


Statistic         | value
----------        |------
Mean              | 12
Standard Error	  | 0.447213595
Median	          | 12
Mode	            | 13
Standard Deviation| 1
Sample Variance	  | 1
Kurtosis	        | -3
Skewness          | 0
Range	            | 2
Minimum           | 11
Maximum           | 13
Sum	              | 60
Count	            | 5

# **Proportions test**
<div class="alert alert-info">
**Proportions test**: Use a proportions test to compare/determine if the frequency/probability of an event is different between the various groups being tested (e.g. asking if the proportion of males reporting depression is different from females reporting depression in a sample)
</div>

<div class="alert alert-success">
## Comparing two proportions - Method 1
```{r}
prop.test(x = c(c1, c2), n = c(T1, T2), alternative = "two.sided")
```
  * c1 and c2 reflect the number of subjects in each of the categories (e.g., number of women responding Yes and the number of men responding Y)
  * T1 and T2 reflect the number of TOTAL subjects in each category/group (e.g., the total number of women in the sample, the total number of men in the sample)
  
## Comparing two proportions - Method 2

```{r}
object = c(v1,v2) 
```
  * number of subjects in each of the categories (e.g. number of Catholics in example)
```{r}
Tobject = c(n1,n2) 
```
  * number of TOTAL subjects in each category/group (e.g. Total number of representative in each party)
```{r}
prop.test(object, Tobject) 
```
  * runs the proportions test

</div>
<a href="#top">Back to top</a>

<div class="alert alert-danger">
## Proportions Test Output in R
```{}
2-sample test for equality of proportions with
	continuity correction

X-squared = 5.7318, df = 1, p-value = 0.01666     # X-squared (chi-squared) = the
                                                  statistic; df = degrees of freedom (n-1)
                                                  - we have 2 groups; p-value = tells us if
                                                  it is significant; IT IS SIGNIFICANT if p
                                                  < 0.05
alternative hypothesis: two.sided
95 percent confidence interval:
  0.01821926 0.18098709                          # If the confidence interval range has the same SIGN - there should be a significant difference
sample estimates:
  prop 1    prop 2 
0.3535714 0.2539683                               # these are the proportions of each of the tested groups
```
</div>
<a href="#top">Back to top</a>

<div class="alert alert-info">
## Writing a report from a proportions test output
A two-sample test for equality of proportion was conducted to assess whether the proportion of Catholics in the Republican and Democratic party were significantly different. The analysis revealed a significantly higher proportion of Catholics in the Democratic Party(35.4%) relative to the Republican Party(25.4%), X^2^(1,N=532)= 5.73,p<0.05.
</div>
<a href="#top">Back to top</a>

<div class="alert alert-success">
## Comparing more than two proportions - Method 1
```{r}
prop.test(x = c(c1, c2, c3...), n = c(T1, T2, T3...), alternative = "two.sided")
pairwise.prop.test(x = c(c1, c2, c3...), n = c(T1, T2, T3...), alternative = "two.sided")
```
  * c1, c2, c3... reflect the number of subjects in each of the categories (e.g., number of women responding Yes and the number of men responding Y)
  * T1, T2, T3... reflect the number of TOTAL subjects in each category/group (e.g., the total number of women in the sample, the total number of men in the sample)
  * prop.test runs the proportions test - provides the first part of the output.
  * pairwise.prop.test runs the pairwise test to indicate which proportions are different from each other - this provides the second part of the output
  
## Comparing more than two proportions - Method 2
```{r}
object = c(a,b,c,d,e,...) 
```
  * number of subjects in each of the categories (e.g. the number of doctors of a particular faith that said they would NOT perform an abortion; each letter would be replaced by the number of doctors of each specific specific faith that responded "NO")
```{r}
Tobject = c(Ta,Tb,Tc,Td,Te,...) 
```
  * number of Total subjects in each category/group (e.g. the TOTAL number of doctors of a particular faith i.e those that responded YES (they would perform an abortion) + those that responded NO; **Replace Ta** etc. with numbers representing the individual totals)
```{r}
prop.test(object,Tobject) 
```
  * runs proportions test - provides the first part of the output.
```{r}
pairwise.prop.test(object,Tobject) 
```
  * runs pairwise test to indicate which proportions are different from each other - this provides the second part of the output.

</div>
<a href="#top">Back to top</a>

<div class="alert alert-danger">
## Proportions Test Output in R (more than two proportions)
```{}
# This is the first part of the output that indicates whether there is a difference in the proportions. If there is a difference, however, it will not indicate between which proportions the difference exists.

4-sample test for equality of proportions without
	continuity correction

data:  object out of Tobject
X-squared = 35.169, df = 3, p-value = 1.122e-07
alternative hypothesis: two.sided
sample estimates:
   prop 1    prop 2    prop 3    prop 4 
0.2857143 0.5000000 0.2407407 0.7500000 

# This is the second part of the output the indicates WHERE the differences are. The numbers in the top row and in the first column reflect the 4 proportions being tested comparing e.g., 1 vs 2 (not significant); 1 vs 4 (significant at p<0.001) etc., etc.

Pairwise comparisons using Pairwise comparison of proportions 

data:  object out of Tobject 

  1       2       3      
2 0.32422 -       -      
3 0.82154 0.18128 -      
4 0.00013 0.18128 9.5e-07

P value adjustment method: holm 
```
</div>
<a href="#top">Back to top</a>

<div class="alert alert-info">
## Writing a report from a proportions test output (more than two proportions)
A 4-sample test for equality of proportions was conducted to assess whether the proportion of the 4 samples provided were significantly different. The analysis revealed an overall significant difference X^2^(3,N= **insert number of participants**) = 35.17,p<0.001. Pairwise comparisons indicated a significant difference between samples 1 and 4 and samples 3 and 4 (both p<0.001). All other comparisons were not significant (p>0.05).
</div>
<a href="#top">Back to top</a>

# Proportions Test for situations where there are very small numbers (5 or less)

In situations where there are 5 or less values, the proportions test will display an error. In that case, the Fisher test is the more appropriate test to use.

## Comparing two proportions - in cases of small numbers
```{r}
fisher.test(matrix(c(n1, T1-n1, n2, T2-n2), ncol=2)) 
```
  * n1 represents the first number to be compared, T1 is the total number of subjects in pertaining to variable n1 etc.

## Comparing more than two proportions - in cases of small numbers
  * requires *RVAideMemoire* package
```{r}
fisher.multcomp(matrix(c(n1, T1-n1, n2, T2-n2,.....), ncol=N)) 
```
  * n1 represents the first number to be compared, T1 is the total number of subjects in pertaining to variable n1 etc. 
  * ncol=N: replace ***N*** with the number of variables being compared.
  
# **Odds Ratio**
<div class="alert alert-info">
**Odds Ratio**: measures how likely an event is to occur when exposed to something; a measure of the association between exposure and outcome
</div>

<div class="alert alert-success"> 
# Create Matrix
```{r}
# data taken from Yang, C.-Y., Shih, Y.-H., & Lung, C.-C. (2024). The association between COVID-19 vaccine/infection and new-onset asthma in children - based on the global TriNetX database. Infection. https://doi.org/10.1007/s15010-024-02329-3 **Table S2 in Supplement 1**

# Create matrix
vax <- c('Vax', 'No Vax')
outcome <- c('Death', 'No death')
data <- matrix(c(354, 31734, 309, 159048), nrow=2, ncol=2, byrow=TRUE)
dimnames(data) <- list('Vaccinated'=vax, 'Outcome'=outcome)

#view matrix
data

```
</div>
<a href="#top">Back to top</a>

<div class="alert alert-success">
# Calculate Odds Ratio
```{r}
library(epitools)

#calculate odds ratio
oddsratio(data)

# From the output shown, we would interpret this to mean that the odds that child receiving the vaccine dies is ~5.74 times the odds of one who did not receive the vaccine.
```
</div>

<div class="alert alert-info">
## Writing a report from an odds ratio test output
There was a significant difference (p<0.001) in the odds of death associated with receiving the COVID vaccines in children aged 5-18 years of age relative to those who did not (OR 5.74, 95% CI [4.93, 6.69]).
</div>
<a href="#top">Back to top</a>


***
# Some things you ***may*** have to do **prior to generating a graph**
<div class="alert alert-warning">

## Function and command for generating Standard Error of the Mean (SEM)
  * This function needs to be copied and run **WITHOUT MODIFICATION** before generating the summary statistics using the summarySE command.
  * This code is taken from: [Summarizing Data](http://www.cookbook-r.com/Manipulating_data/Summarizing_data/)

```{r}
## Gives count, mean, standard deviation, standard error of the mean, and confidence interval (default 95%).
##   data: a data frame.
##   measurevar: the name of a column that contains the variable to be summariezed
##   groupvars: a vector containing names of columns that contain grouping variables
##   na.rm: a boolean that indicates whether to ignore NA's
##   conf.interval: the percent range of the confidence interval (default is 95%)
summarySE <- function(data=NULL, measurevar, groupvars=NULL, na.rm=FALSE,
                      conf.interval=.95, .drop=TRUE) {
    library(plyr)

    # New version of length which can handle NA's: if na.rm==T, don't count them
    length2 <- function (x, na.rm=FALSE) {
        if (na.rm) sum(!is.na(x))
        else       length(x)
    }

    # This does the summary. For each group's data frame, return a vector with
    # N, mean, and sd
    datac <- ddply(data, groupvars, .drop=.drop,
      .fun = function(xx, col) {
        c(N    = length2(xx[[col]], na.rm=na.rm),
          mean = mean   (xx[[col]], na.rm=na.rm),
          sd   = sd     (xx[[col]], na.rm=na.rm)
        )
      },
      measurevar
    )

    # Rename the "mean" column    
    datac <- rename(datac, c("mean" = measurevar))

    datac$se <- datac$sd / sqrt(datac$N)  # Calculate standard error of the mean

    # Confidence interval multiplier for standard error
    # Calculate t-statistic for confidence interval: 
    # e.g., if conf.interval is .95, use .975 (above/below), and use df=N-1
    ciMult <- qt(conf.interval/2 + .5, datac$N-1)
    datac$ci <- datac$se * ciMult

    return(datac)
}
```
</div>

<div class="alert alert-warning">
```{r}
sumstats <- summarySE(objectL, measurevar="outcomevar", groupvars=c("groupvar_1","groupvar_2","groupvar_n"), na.rm=TRUE)
```
  * sumstats: name of object that will have the summary stats. The name of this should be changed if running for more than one data set.
  * summarySE: command that runs the summary statistics
  * objectL: name of object containing the long version of the data to be analyzed/summarized
  * "outcomevar": name of the variable to be analyzed/summarized
  * "groupvar_1" etc: name of the columns that contain the groups for the variable to be analyzed e.g. Sex, Class, Age...etc.
  * na.rm = TRUE: removes the empty cells (cells that have NA); Not required if all data points are present.
</div>
<a href="#top">Back to top</a>

<div class="alert alert-warning">
## Ordering levels in a specific variable
  * When plotting a graph, R will order the variables in alphabetical order. Very often this is not the best way of plotting the graph. In such cases, R needs to be instructed on the desired order of variables.
  * An example is the order of classes. For this example, the name of the variable in the data (i.e. the column name) is **Class** R will order them: Freshman, Graduate, Junior, Senior, Sophomore
  * This command will create an object that has the reordered levels.
  * In such a case, the name of the new object created would be the name of the predictor that would need to be used in Part I of plotting the graph.
  * For this example ***Class*** is the original name; ***Class_r*** is the rearranged levels
  
```{r}
Class_r = factor (object$Class, levels=c("Freshman", "Sophomore", "Junior", "Senior", "Graduate"))
```
  * notice how when you are indicating the name of the factor that is to be reordered you need to give R the name of the object containing the data (in this case "object" and the name of the column to be reordered separated by a $ symbol)
  * to adapt to any situation - "Class_r" should be renamed appropriately; "object$Class" should be replaced with the appropriate object name and column name to be reorganized; all the levels within the quotes should be replaced accordingly.  
</div>
<a href="#top">Back to top</a>

<div class="alert alert-warning">
## Changing variable name in a specific column
  * If the variables in a particular column need to be changed e.g. from numbers to words do the following:
  * install the package ***stringr*** using the following command:

```{r}
install.packages("stringr", dependencies = T)
```

  * ensure that you load the package ***stringr*** using the following command:
  
```{r}
library(stringr)
```

  * Next, run the following command in order to preserve the original file:
  
```{r}
object2=object
```
  
  * Next, run the following command on the new modified object (i.e. object2) (**NOTE**: this will change the values in your columns permanently.)
  
```{r}
object2$columnname = str_replace_all(object2$columnname,"oldvalue","newvalue")
```

  * You will need to run this command for **every** value that needs to be replaced
  * When plotting the graph you will need to remember to use the **NEW OBJECT** i.e. ***object2*** as the source of the data
</div>
<a href="#top">Back to top</a>

<div class="alert alert-warning">
# Changing the column name/header in the object/dataframe containing the data
  * When you generate the long version of a file, the predictor column will have the name ***variable***, while the outcome column will be labeled ***value***. 
  * To change the name of a specific column use the following commands
```{r}
names(objectL)[n] <- "correctname"
```
  * **remember that this will be the name that you will need to use in generating the graph, ** ***whenever you are addressing the predictor/outcome***
  * replace the ***n*** with the number of the column whose header you wish to change (e.g. if the data has an id column and you want to change the variable column, this will be column 2)
  * replace ***correctname*** with the name you want to give the column/header

### This is an alternative command to the one above

  * To rename a single column by name
```{r}
colnames(objectL)[colnames(objectL) == "old_col2"] <- "new_col2"
```
</div>
<a href="#top">Back to top</a>


# **Creating a pie chart, bar graph, line graph or box plot**

<div class="alert alert-warning">
# Generating a pie chart
```{r}
object<- c(n1, n2)
labels<- c("Poor (58%)","Normal (42%)")
pie(x=object,labels = labels, radius = 1, main = "Title (n=N)", col = c("#69b3a2", "grey"))
```
  * object: name of object containing number OR percentage of data to be plotted
  * labels: name of object containing the names of the labels of the parts/sections of the pie chart
  * pie: command that plots the pie chart
  * x: indicates what is to be plotted (contained in the dataframe "object" [replace accordingly])
  * labels: indicates the labels for the sections (contained in the dataframe "labels" [replace accordingly])
  * radius - specifies radius of circle
  * col - specifies colors to be used
</div>
<a href="#top">Back to top</a>

<div class="alert alert-danger">
### Example of a graph generated with previous command
<center>
![](C:/Users\ssamm\Desktop\Sammut\Research\R Markdown/PieChart.png){width=40%}
</center>
</div>
<a href="#top">Back to top</a>

# Generating a bar graph
<div class="alert alert-warning">

## Generating a bar graph - Part I 
```{r}
barobject=ggplot(objectL, aes(predictor,outcome))
```
  * barobject: name of the entity that will contain the plot (call it what you wish, just be clear)
  * ggplot: command that enables the plotting of the graph
  * objectL: is the name of the long version of your data
  * aes: tells R to generate the aesthetics
  * predictor: **REPLACE** with the name of the predictor column in the long version of the file
  * outcome: **REPLACE** with the name of the outcome column in the long version of the file
  * NOTE regarding predictor: If the levels have been rearraged in a new object, the word predictor should be replaced with the name of that specific object (in the example above, ***Class_r*** instead of ***Class***)

## Generating a bar graph - Part II 
```{r}
barobject+stat_summary(fun=mean, geom="bar",fill="white",color="Black")+stat_summary(fun.data=mean_cl_normal, geom="errorbar",position=position_dodge(width=0.9), width=0.2)+labs(title="Title line1 \n Title line2", x="predictor label", y="outcome label (unit)")+theme(axis.title.x=element_text(vjust=-0.4))+theme(axis.title.y=element_text(vjust=0.3))+theme(title=element_text(vjust=+1.1))+theme(plot.title=element_text(hjust=0.5))+coord_cartesian(ylim=c(lowerlimit, upperlimit)) + guides(fill = guide_legend(title = "name"))
```
  * *barobject*: name of the plot determined in the previous command
  * *stat_summary(fun.y=mean, geom="bar",fill="white",color="Black")*: plots the mean, as a bar graph, with a white fill and a black outline - **DOES NOT NEED TO BE CHANGED** unless you wish to change the colors
  * *stat_summary(fun.data=mean_cl_normal, geom="errorbar",position=position_dodge(width=0.9), width=0.2)*: plots the error bars (measure of variability) so that they do not overlap - **YOU DO NOT NEED TO CHANGE ANYTHING**; 
    * **NOTE**:   *mean_se* can replace *mean_cl_normal* if you wish to plot the standard error of the mean.
  * *labs(title="**Title line1 \\n Title line2**", x="**predictor label**", y="**outcome label (unit)**")*:Labels the graph; title gives the name of the main heading of the graph; x labels the x-axis; y labels the y-axis; The "**\\n**"  places any words after it on the next line; **DOES NEED CHANGING** - you need to change everything that is in the quotation marks.
  * *theme(axis.title.x=element_text(vjust=-0.4))+theme(axis.title.y=element_text(vjust=0.3))+theme(title=element_text(vjust=+1.1))*: sets the vertical distance/alignment b/w the specific heading and the axis - **DOES NOT NEED CHANGING**
  * *theme(plot.title=element_text(hjust=0.5))*: centers the title horizontally 
  * *+coord_cartesian(ylim=c(lowerlimit, upperlimit))*: you do **NOT** always need to use this part of the command; It is only needed when the y-axis limits need to be adjusted for better viewing of the graph; If it is not being used, remove the whole command in between the plus signs leaving only one + sign. If used, replace the words lowerlimit and upperlimit with the appropriate numbers.
  * *guides(fill = guide_legend(title = "name"))*: where applicable, provides a new name for the legend (the default is the header of the column).

</div>
<a href="#top">Back to top</a>

<div class="alert alert-danger">
### Example of a graph generated with previous command
<center>
![](C:/Users\ssamm\Desktop\Sammut\Research\R Markdown/Graph-no color.png){width=50%}
</center>
</div>
<a href="#top">Back to top</a>

# Generating a bar graph with **different colored bars**
<div class="alert alert-warning">
  * more information on this is available in the R Cookbook, under [Colors (ggplot2)](http://www.cookbook-r.com/Graphs/Colors_(ggplot2)/){target="_blank"}
  
## Generating a bar graph - Part I 
```{r}
barobject=ggplot(objectL, aes(predictor,outcome,fill=predictor))+scale_fill_manual(values=c("#colorcode", "#colorcode",....))
```
  * replace barobject, objectL, predictor, outcome with appropriate labels
  * replace colorcode with appropriate color code obtained from the Hex field at the [rapidtables website](https://www.rapidtables.com/web/color/RGB_Color.html ){target="_blank"} - in the case of the color selected in the example shown below, the code would be #3399FF

<div class="alert alert-danger">
<center>
![](C:/Users\ssamm\Desktop\Sammut\Research\R Markdown/Colorpalate.png){width=50%}
</center>
</div>
  * add as many colors as you have variables

## Generating a bar graph - Part II 
```{r}
barobject+stat_summary(fun=mean, geom="bar")+stat_summary(fun.data=mean_cl_normal, geom="errorbar",position=position_dodge(width=0.9), width=0.2)+labs(title="Title line1 \n Title line2", x="predictor label", y="outcome label (unit)")+theme(axis.title.x=element_text(vjust=-0.4))+theme(axis.title.y=element_text(vjust=0.3))+theme(title=element_text(vjust=+1.1))+theme(plot.title=element_text(hjust=0.5))+coord_cartesian(ylim=c(lowerlimit, upperlimit))+guides(fill = guide_legend(title = "name"))
```
  * *stat_summary(fun.y=mean, geom="bar")*: Notice how the section ",fill="white",color="Black"" has been removed because colors are accounted for in the first command
  * Everything else is per the first example of "Generating a Graph"
</div>
<a href="#top">Back to top</a>

<div class="alert alert-danger">
### Example of a graph generated with previous command
<center>
![](C:/Users\ssamm\Desktop\Sammut\Research\R Markdown/Graph-with color.png){width=50%}
</center>
</div>
<a href="#top">Back to top</a>


<div class="alert alert-warning">
## Creating a bar graph with **one independent variable** and using the standard error of the mean for error bars
  * You first need to have run the function and associated summarySE command prior to continuing forward (See [Function and command for generating Standard Error of the Mean (SEM)])
  * The example below uses the graph with colored bars information. However, this can be adapted and applied to all forms of graphs.
  
## Generating a bar graph - Part I
```{r}
barobject=ggplot(sumstats, aes(predictor,outcome,fill=predictor))+scale_fill_manual(values=c("#colorcode", "#colorcode",....))
```
  * **NOTE:** sumstats is used to provide the data to be plotted
  * Additionally, see [Generating a bar graph with **different colored bars**] for details of what needs to be changed

## Generating a bar graph - Part II 
```{r}
barobject+stat_summary(fun=mean, geom="bar")+geom_errorbar(aes(ymin=outcome-se, ymax=outcome+se),position=position_dodge(width=0.9), width=0.2)+labs(title="Title line1 \n Title line2", x="predictor label", y="outcome label (unit)")+theme(axis.title.x=element_text(vjust=-0.4))+theme(axis.title.y=element_text(vjust=0.3))+theme(title=element_text(vjust=+1.1))+theme(plot.title=element_text(hjust=0.5))+coord_cartesian(ylim=c(lowerlimit, upperlimit))+guides(fill = guide_legend(title = "name"))
```
  * *geom_errorbar(aes(ymin=outcome-se, ymax=outcome+se),position=position_dodge(width=0.9), width=0.2)*: This command now provides the information for the errorbars. However you need to replace the word "outcome" with the appropriate name for the outcome variable.
  * Everything else is per the first example of [Generating a bar graph]
</div>
<a href="#top">Back to top</a>

<div class="alert alert-danger">
### Example of a graph generated with previous command
<center>
![](C:/Users\ssamm\Desktop\Sammut\Research\R Markdown/Graph-oneindvar.png){width=50%}
</center>
</div>
<a href="#top">Back to top</a>

<div class="alert alert-warning">
## Creating a bar graph with **multiple independent variables** and using the standard error of the mean for error bars
  * You first need to have run the function and associated summarySE command prior to continuing forward (See [Function and command for generating Standard Error of the Mean (SEM)])
  * The example below uses the graph with colored bars information. However, this can be adapted and applied to all forms of graphs.
  
## Generating a bar graph - Part I
```{r}
barobject=ggplot(sumstats, aes(predictor,outcome,fill=predictor_2))+scale_fill_manual(values=c("#colorcode", "#colorcode",....))
```
  * **NOTE:** sumstats is used to provide the data to be plotted
  * predictor: the main predictor that is going to be on the x-axis
  * predictor_2: the predictor that provides the sub-categories within the main predictor
  * Additionally, see [Generating a bar graph with **different colored bars**] for details of what needs to be changed

## Generating a bar graph - Part II 
```{r}
barobject+stat_summary(fun=mean, geom="bar",position=position_dodge(width=0.9), linewidth=3)+geom_errorbar(aes(ymin=outcome-se, ymax=outcome+se),position=position_dodge(width=0.9), width=0.2)+labs(title="Title line1 \n Title line2", x="predictor label", y="outcome label (unit)")+theme(axis.title.x=element_text(vjust=-0.4))+theme(axis.title.y=element_text(vjust=0.3))+theme(title=element_text(vjust=+1.1))+theme(plot.title=element_text(hjust=0.5))+coord_cartesian(ylim=c(lowerlimit, upperlimit))+guides(fill = guide_legend(title = "name"))
```
  * *geom_errorbar(aes(ymin=outcome-se, ymax=outcome+se),position=position_dodge(width=0.9), width=0.2)*: This command now provides the information for the errorbars. However you need to replace the word "outcome" with the appropriate name for the outcome variable.
  * Everything else is per the first example of [Generating a bar graph]
</div>
<a href="#top">Back to top</a>

<div class="alert alert-danger">
### Example of a graph generated with previous command
<center>
![](C:/Users\ssamm\Desktop\Sammut\Research\R Markdown/Graph-twoindvars.png){width=50%}
</center>
</div>
<a href="#top">Back to top</a>

<div class="alert alert-warning">
## Creating bar graphs with **multiple independent variables**, plotted as three separate graphs (one per variable) in one grid
  * **Original data (contained in "objectL" contains the following columns "Subject", "var1", "var2", "var3", "data"**
  
## Generating a grid of bar graphs - Example 1
```{r}
# Make a modified copy of the original data
objectL_mod <- objectL %>%

# Rename variables within var3
  # only if necessary/desired
  # var3.1 etc. represent different variables within var3
  # /n can be used within the names if the new name is long and it is desired to split the title into two lines
mutate(var3 = recode(var3, "oldvar3.1\nname" = "newvar3.1\nname", "oldvar3.2\nname" = "newvar3.2\nname", "oldvar3.3\nname" = "newvar3.3\nname")) # do not add spaces before and after \n or the facet title will not be centered
  
# Generate graph
bar_objectL=ggplot(objectL_mod, aes(var1,data,fill=var1))+scale_fill_manual(values=c("#CC0000", "#004C99"))

bar_objectL.Sub=bar_objectL+stat_summary(fun=mean, geom="bar",position=position_dodge(width=0.9), linewidth=3)+stat_summary(fun.data=mean_cl_normal, geom="errorbar",position=position_dodge(width=0.9), width=0.2)+labs(title="title", x="x axis title", y="y axis title") + theme(axis.title.x=element_text(vjust=-0.4)) + theme(axis.title.y=element_text(vjust=0.3)) + theme(title=element_text(vjust=+1.1)) + theme(plot.title=element_text(hjust=0.5))+guides(fill = guide_legend(title = "var1"))

  # bar_objectL.Sub: contains the command to generate the graph with the various sub-categories present var3
  # +guides(fill = guide_legend(title = "var1")): relabels the heading of the legend

# Create grid
bar_objectL.Sub+facet_grid(cols=vars(var3))
```
</div>
<a href="#top">Back to top</a>

<div class="alert alert-danger">
### Example of a graph generated with previous command
<center>
![](C:/Users\ssamm\Desktop\Sammut\Research\R Markdown/facetgrid graph.png){width=50%}
</center>
</div>
<a href="#top">Back to top</a>

<div class="alert alert-warning">

## Generating a grid of bar graphs - Example 1a
  * **Original data (contained in "objectL" contains the following columns "Subject", "var1", "var2", "var3", "data"**
  * This graph is a modified version of that shown in Example 1 - See [Generating a grid of bar graphs - Example 1]
  * This graph uses the SEM instead of the Confidence interval to generate the error bars and also shows the significances
```{r}
# Make a modified copy of the original data
objectL_mod <- objectL %>%

# Rename variables within var3
  # only if necessary/desired
  # var3.1 etc. represent different variables within var3
  # /n can be used within the names if the new name is long and it is desired to split the title into two lines
mutate(var3 = recode(var3, "oldvar3.1\nname" = "newvar3.1\nname", "oldvar3.2\nname" = "newvar3.2\nname", "oldvar3.3\nname" = "newvar3.3\nname")) # do not add spaces before and after \n or the facet title will not be centered
```
* Generate Descriptive Statistics that provide standard error of the mean 
  * For generating descriptive statistics with standard error of the mean please see [Function and command for generating Standard Error of the Mean (SEM)]
  * sumstats_object: this is the name of the object that contains the output descriptive statistics 

```{r}  
# Generate graph
bar_object=ggplot(sumstats_object, aes(x=predictor,y=outcome))+scale_fill_manual(values=c("#CC0000", "#004C99"))

bar_object.Sub=bar_object+geom_bar(stat="identity",position=position_dodge(width=0.9), linewidth=3, aes(fill=predictor))+geom_errorbar(aes(ymin=outcome-se, ymax=outcome+se),position=position_dodge(width=0.9), width=0.2)+labs(title="Title", x="x axis title", y="y axis title") + theme(axis.title.x=element_text(vjust=-0.4)) + theme(axis.title.y=element_text(vjust=0.3)) + theme(title=element_text(vjust=+1.1)) + theme(plot.title=element_text(hjust=0.5)) + guides(fill = guide_legend(title = "predictor")) +coord_cartesian(ylim=c(lowerlimit,upperlimit))+ geom_signif(data=data.frame(predictor2=c("newvar3.1\nname","newvar3.2\nname","newvar3.3\nname")),aes(y_position=c(21.5, 24.5, -2), xmin=c(1.0, 1.0, 1.0), xmax=c(2, 2, 2),annotations = c("**","*","")), tip_length = 0, manual=T)

  # The statistic that was NOT significant has been placed beyond the y-scale limit. The coord_cartesian command ensures that the axes limits are as desired and that the non-significant bar does not show up.
  # outcome: under geom_errorbar - this refers to the heading of the outcome column in the descriptive statistics output.
  # geom_signif: numbers shown in parenthesis need to be adjusted accordingly to accommodate for height (in relation to the y_position) and/or position relative to bars (in relation to the xmin and xmax). Additionally, the annotation needs to be appropriately corrected to reflect the appropriate significances.

# Create grid
bar_BIS.Sub+facet_grid(~BIS_Cat, scales="fixed")

# theme(axis.text.x=element_blank()) - # Removes variable label names in the x-axis if they are not desired.
# theme(axis.ticks.x=element_blank()) - # removes ticks (may be used with above command)

# https://stackoverflow.com/questions/56219468/using-ggsignif-with-grouped-bar-graphs-and-facet-wrap-not-working
```
</div>
<a href="#top">Back to top</a>

<div class="alert alert-danger">
### Example of a graph generated with previous command
<center>
![](C:/Users\ssamm\Desktop\Sammut\Research\R Markdown/graph with signficance & facets.png){width=50%}
</center>
</div>
<a href="#top">Back to top</a>

<div class="alert alert-warning">
## Generating a grid of bar graphs - Example 2
  * **Original data (contained in "objectL" contains the following columns "Subject", "var1", "var2", "var3", "data"**
```{r}
# Make a modified copy of the original data
objectL_mod <- objectL %>%

# Rename variables within var3
  # only if necessary/desired
  # var3.1 etc. represent different variables within var3
  # /n can be used within the names if the new name is long and it is desired to split the title into two lines
mutate(var3 = recode(var3, "oldvar3.1\nname" = "newvar3.1\nname", "oldvar3.2\nname" = "newvar3.2\nname", "oldvar3.3\nname" = "newvar3.3\nname"))  # do not add spaces before and after \n or the facet title will not be centered
  
# Generate graph
bar_objectL=ggplot(objectL_mod, aes(var1,data,fill=var2))+scale_fill_manual(values=c("#CC0000", "#004C99"))

bar_objectL.Sub=bar_objectL+stat_summary(fun=mean, geom="bar",position=position_dodge(width=0.9), linewidth=3)+stat_summary(fun.data=mean_cl_normal, geom="errorbar",position=position_dodge(width=0.9), width=0.2)+labs(title="title", x="x axis title", y="y axis title") + theme(axis.title.x=element_text(vjust=-0.4)) + theme(axis.title.y=element_text(vjust=0.3)) + theme(title=element_text(vjust=+1.1)) + theme(plot.title=element_text(hjust=0.5))+guides(fill = guide_legend(title = "var2"))

  # bar_objectL.Sub: contains the command to generate the graph with the various sub-categories present var3
  # +guides(fill = guide_legend(title = "var2")): relabels the heading of the legend

# Create grid
bar_objectL.Sub+facet_grid(cols=vars(var3))
```
</div>
<a href="#top">Back to top</a>

<div class="alert alert-danger">
### Example of a graph generated with previous command
<center>
![](C:/Users\ssamm\Desktop\Sammut\Research\R Markdown/facetgrid graph_1.png){width=50%}
</center>
</div>
<a href="#top">Back to top</a>

<div class="alert alert-warning">
## Generating a bar graphs with significance shown - Example 3

* Generate Descriptive Statistics that provide standard error of the mean 
  * For generating descriptive statistics with standard error of the mean please see [Function and command for generating Standard Error of the Mean (SEM)]
  * sumstats_object: this is the name of the object that contains the output descriptive statistics 

```{r}
# Plot Graph

predictor.r = factor (sumstats_object$predictor.r, levels=c("level1", "level2", "level_n"))
  # predictor.r : name of variable that has had levels reorganized for graphing purposes
  # predictor.r: predictor that requires levels to be organized for graphic purposes
  # level1, etc.: level names listed in the order that you wish them graphed

bar_object=ggplot(sumstats_object, aes(x= predictor.r,y=outcome))+ scale_fill_manual(values=c("#CC0000", "#004C99")) 

bar_object+geom_bar(stat="identity",position=position_dodge(width=0.9), linewidth=3, aes(fill=predictor))+geom_errorbar(aes(ymin=Outcome-se, ymax=Outcome+se),position=position_dodge2(0.9, padding=0.6))+labs(title="Title", x="x axis title", y="y axis title") + theme(axis.title.x=element_text(vjust=-0.4)) + theme(axis.title.y=element_text(vjust=0.3)) + theme(title=element_text(vjust=+1.1)) + theme(plot.title=element_text(hjust=0.5)) + guides(fill = guide_legend(title = "Legend Title")) + geom_signif(y_position=c(30.5, 16.5, 15.5), xmin=c(0.7, 4.7, 5.7), xmax=c(1.3, 5.3, 6.3),annotation = c("***","*","**"), tip_length = 0)
  
  # outcome: under geom_errorbar - this refers to the heading of the outcome column in the descriptive statistics output.
  # geom_signif: numbers shown in parenthesis need to be adjusted accordingly to accommodate for height (in relation to the y_position) and/or position relative to bars (in relation to the xmin and xmax). Additionally, the annotation needs to be appropriately corrected to reflect the appropriate significances.

# Resources that may be of assistance

# https://statisticsglobe.com/ggsignif-package-r
# https://stackoverflow.com/questions/51443889/how-can-i-get-geom-errorbar-to-dodge-correctly-on-a-bar-chart-in-ggplot2
# https://stackoverflow.com/questions/59008974/why-is-stat-identity-necessary-in-geom-bar-in-ggplot#59009108
# http://sthda.com/english/wiki/ggplot2-error-bars-quick-start-guide-r-software-and-data-visualization
# https://stackoverflow.com/questions/61022992/manually-plotting-significance-relations-between-sub-groups-on-ggplot2-barplot
```

</center>
</div>
<a href="#top">Back to top</a>

<div class="alert alert-danger">
### Example of a graph generated with previous command
<center>
![](C:/Users\ssamm\Desktop\Sammut\Research\R Markdown/graph with signficance.png){width=50%}
</center>
</div>
<a href="#top">Back to top</a>


<div class="alert alert-warning">
## Generating a bar graphs with significance shown and reordered facets - Example 4

* Generate Descriptive Statistics that provide standard error of the mean 
  * For generating descriptive statistics with standard error of the mean please see [Function and command for generating Standard Error of the Mean (SEM)]
  * sumstatsobject: this is the name of the object that contains the output descriptive statistics 

```{r}
# Plot Graph

# Generate graph

bar_object=ggplot(sumstatsobject.r,aes(x=predictor1,y=outcome))+scale_fill_manual(values=c("#CC0000", "#004C99"))

bar_object.Sub=bar_object+geom_bar(stat="identity",position=position_dodge(width=0.9), linewidth=3, aes(fill=predictor1)) + geom_errorbar(aes(ymin=Data-se, ymax=Data+se),position=position_dodge(width=0.9), width=0.2)+labs(title="Title Line1\nTitleLine2", x="x-label", y="y-label") + theme(axis.title.x=element_text(vjust=-0.4)) + theme(axis.title.y=element_text(vjust=0.3)) + theme(title=element_text(vjust=+1.1)) + theme(plot.title=element_text(hjust=0.5)) + theme(axis.text.x=element_blank(), axis.ticks.x=element_blank()) + guides(fill = guide_legend(title = "Legend Title")) + coord_cartesian(ylim=c(0,15)) + geom_signif(data=data.frame(ExecFn=c("factor 1", "factor 3", "factor 5", "factor 11",  "factor 12")), aes(y_position = c(13.5,12.5,12.5,14,13), xmin=c(1,1,1,1,1), xmax=c(2,2,2,2,2), annotations = c("**","**","**","#","*")), tip_length = 0, manual=T)

# NOTE that only the factors that show significance are listed and the information in the aes also only relates to these.


# Create grid with reorganized factors
bar_object.Sub+facet_grid(~factor(MainPredictor, levels=c("factor 1" ,"factor 2","factor 3", "factor 4", "factor 5", "factor 6", "factor 7", "factor 8", "factor 9", "factor 10",  "factor 11", "factor 12")), scales="fixed")  

# MainPredictor - replace this with the name of the column that represents the x-label predictor not sub-predictors (An example of the MainPredictor is the name of the predictor that us made up of the 12 factors shown in the graph; an example of a sub-predictor is predictor1 which in this case is Yes or No)

```

</center>
</div>
<a href="#top">Back to top</a>

<div class="alert alert-danger">
### Example of a graph generated with previous command
<center>
![](C:/Users\ssamm\Desktop\Sammut\Research\R Markdown/facetgrid graph_reordered factors.png){width=50%}
</center>
</div>
<a href="#top">Back to top</a>

<div class="alert alert-warning">
## Creating a line chart
  * variable y is being plotted across variable x for two groups
```{r}
lp<- ggplot(SCl, aes(x=xvarname, y=yvarname, group=groupname, color=groupname)) + # x 
  stat_summary(fun=mean, geom="line") + # plots the line
  stat_summary(fun=mean, geom="point", size=3)+ # adds points to the plot
  stat_summary(fun.data=mean_se, geom="errorbar",position=position_dodge(width=0.0), width=0.2)+ #using standard error of the mean for the error bar.
     scale_color_manual(values=c('darkblue','red'))+
  labs(title="Title", x="x-axis label", y="y-axis label")+
  theme(axis.title.x=element_text(vjust=-0.4))+
  theme(axis.title.y=element_text(vjust=0.3))+
  theme(title=element_text(vjust=+1.1))+
  theme(plot.title=element_text(hjust=0.5))

lp # displays the lineplot
```
</center>
</div>
<a href="#top">Back to top</a>

<div class="alert alert-danger">
### Example of a graph generated with previous command
<center>
![](C:/Users\ssamm\Desktop\Sammut\Research\R Markdown/lineplot.png){width=50%}
</center>
</div>
<a href="#top">Back to top</a>

<div class="alert alert-warning">
## Creating a simple boxplot
```{}
boxplot(Outcome~Predictor,data=objectL, main="Titleline1 \n Titleline 2",xlab="x-axis label", ylab="y-axis label")

or 

# boxplot with data jittered and showing & two different colors

ggplot(objectL, aes(x=Predictor, y=Outcome, fill=Predictor)) + stat_boxplot(geom='errorbar', position=position_dodge(width=0.5),width=0.2) + geom_boxplot(width=0.4) + geom_jitter(position=position_jitter(0.15)) + scale_fill_manual(values=c("#E0E0E0", "#606060"))
```
</div>
<a href="#top">Back to top</a>

***
# **Correlation**

<div class="alert alert-info">
**Correlation**: a correlation tests whether there is a RELATIONSHIP between two observed variables within the same sample (e.g. asking if the number of cups of coffee consumed is related to amount of alertness in a sample)
</div>

<div class="alert alert-warning">
## Plotting a correlation with a regression line
```{r}
ggscatter(object, x = "xvariable", y = "yvariable",add = "reg.line", conf.int = TRUE,cor.coef = TRUE, cor.method = "pearson",xlab = "xlabel (units)", ylab = "ylabel(units)", title="title for the graph")+theme(plot.title=element_text(hjust=0.5))
```
  * ggscatter: plots the correlation
  * object: name of the wide version of data file
  * xvariable: replace with name of column of variable to be plotted in the x-axis
  * yvariable: replace with name of column of variable to be plotted in the y-axis
  * reg.line: adds the regression line
  * conf.int = TRUE: plots the confidence interval (shaded area)
  * cor.coef = TRUE: shows the correlation coefficient
  * cor.method = "pearson": calculates and shows the p value
  * xlab: replace words in quotes with appropriate wording for the x-label
  * ylab: replace words in quotes with appropriate wording for the y-label
  * title: replace words in quotes with a proper title
  * theme(plot.title=element_text(hjust=0.5)): centers the title
</div>
<a href="#top">Back to top</a>

<div class="alert alert-success">
## Running a correlation
```{r}
cor.test(object$predictor, object$outcome)
```
  * gives the correlation coefficient
  * gives the p-value (which tells you of the correlation is significant)
  * Important notes to remember from the results
    + df = degrees of freedom (n-2 (for correlation))
    + p value: p value of less than 0.05 is significant
    + the value under "cor" = correlation coefficient
</div>
<a href="#top">Back to top</a>

<div class="alert alert-danger">
### Example of a Correlation Output
```{}
	Pearson's product-moment correlation

data:  ingrd$Inc and ingrd$Grd
t = 2.968, df = 12, p-value = 0.01174 ## used for report
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.1833572 0.8780885
sample estimates:
      cor 
0.6506389 ## used for report
```
</div>
<a href="#top">Back to top</a>

<div class="alert alert-info"> 
### Writing a report from a correlation
A Pearson's correlation indicated that a student's grade was significantly related to the family's income, r(12)=-.65, p<.05.
</div>
<a href="#top">Back to top</a>

<div class="alert alert-danger">
### Creating a Correlation Matrix

```{r}
corPlot(Object,cex = 0.8, stars=TRUE, upper=FALSE, main = "TitleLine1 \n TitleLine2",gr = colorRampPalette(c("#BB4444", "#EE9988", "#FFFFFF", "#77AADD", "#4477AA"))) 
```
  * Object contains **only** the variables to be included in the correlation matrix
  * Replace name of Object with the appropriate name
  * Replace Title information with the appropriate information

### Example of a correlation plot/matrix generated with previous command
<center>
![](C:/Users\ssamm\Desktop\Sammut\Research\R Markdown/Correlation Matrix.jpg){width=75%}
</center>
</div>
<a href="#top">Back to top</a>

***
# **Regression**

<div class="alert alert-info">
**Regression**: a regression DESCRIBES the RELATIONSHIP between a predictor (an independent variable) and an outcome (dependent) between two observed variables within the same sample. Regressions can take various forms - linear (e.g. calibration of pH meter), sigmoidal (e.g. in enzyme kinetics/receptor binding). Distinction between correlation & regression: While a **correlation** can tell you that there is a relationship between [H~3~O^+^] in a solution and the pH value, **regression** provides you with the line that best predicts that relationship.
</div>

<div class="alert alert-success">
## Types of Regression
  * **Logistic regression** is the appropriate to conduct when the dependent variable is **dichotomous** (binary)
  * **Stepwise regression** is appropriate when you have many variables and you're interested in identifying a useful subset of the predictors.

</div>
<a href="#top">Back to top</a>

<div class="alert alert-success">
## Simple Regression
```{r}
ObjectReg=lm(outcome~predictor, data=object)
summary(ObjectReg)
```
  * Generates the regression equation and it is stored in "ObjectReg" 
  * The ***lm*** regression command
    + uses WIDE form of the data 
    + Designate one variable as outcome and the other as predictor - Do so in such a way that makes sense. e.g. It is more likely that family income would predict a student's grade than a student's grade predicting family income
  * The ***summary*** command: gives the output of the regression equation, from which you extract all of the statistical information that goes into a regression report.
</div>
<a href="#top">Back to top</a>

<div class="alert alert-danger">
### Example Regression Output
<center>
![](C:/Users\ssamm\Desktop\Sammut\Research\R Markdown/RegressionOutput.jpg){width=100%}
</center>
</div>
<a href="#top">Back to top</a>

<div class="alert alert-danger">
### Example of a Regression Equation derived from the above output
<center>
![](C:/Users\ssamm\Desktop\Sammut\Research\R Markdown/RegeqEx.jpg){width=50%}
</center>
</div>
<a href="#top">Back to top</a>

<div class="alert alert-info">
### Writing a report from a regression output
The results indicated that Fear of Sin significantly predicted Depression, Beta = 0.222, t(95)=2.336, p < 0.05. Fear of Sin also explained a significant proportion of variance in Depression, R^2^ = .044, F(1, 95) = 5.455, p < 0.05.
</div>
<a href="#top">Back to top</a>

## Multiple Regression (backward elimination) (using AIC)
  * requires package: leaps

```{r}
fitObject = lm(y~x1+x2+x3...,data=na.omit(objectL))
```
  * lm: is the command that runs a linear model analysis
  * na.omit will omit those subjects with missing data points & therefore cannot be included in analysis.

```{r}
stepObject = stepAIC(fitObject, direction="backward", trace=FALSE) # direction can be "both" or "forward"
stepObject$anova # display results with AIC
summary(stepObject) # display results with t values
```

<div class="alert alert-success">
## Multiple Regression (backward elimination) (using F-value)
  * requires package: rms
  * from: http://rstudio-pubs-static.s3.amazonaws.com/2899_a9129debf6bd47d2a0501de9c0dc583d.html
  * output similar to Sigmaplot
```{r}
fullObject = ols(y~x1+x2+x3..., data = objectL)
fastbw(fullObject, rule = "p", sls = 0.1)
```
</div>

## Logistic Regression

  * used when outcome is dichotomous/binary
  * also see: https://stats.idre.ucla.edu/r/dae/logit-regression/ for additional information
```{r}  
mylogit = glm(Outcome ~ Predictor1 + Predictor2 + Predictor3..., data = ObjectL, family = "binomial")
summary(mylogit) # gives Beta and p values

exp(cbind(OR = coef(mylogit), confint(mylogit))) # gives Odds Ratio(OR)
```

***
# Cronbach's alpha
  * Surveys are generally divided into sections /groups of questions 
  * Prior to concluding that questions related to each other, we need to make sure that questions within a specific group of questions are consistent with each other (i.e. highly related)
  * to measure if a group of questions is  consistent or not we use cronbach's alpha
  * we will also be able to know if consistency improves or not if we delete a question from the group of questions (this is only shown in Method 2).

## Method 1
  # subjects with missing data should be removed before hand using ObjectNA=na.omit(object)
```{r}
matobject=data.matrix(object)
reliability(cov(matobject))
```

## Example output from Method 1
```{}
$alpha
     alpha 
 0.9403455 
 
 $st.alpha
 std.alpha 
 0.9417464 
 
 $rel.matrix
                 Alpha Std.Alpha r(item, total)
 D2Stop      0.9350532 0.9368506      0.7365112
 Cont2A      0.9335049 0.9354005      0.7837842
 AvsOthers   0.9354148 0.9366306      0.7304763
 DepSleep    0.9383129 0.9398743      0.6302478
 ThinkNOL    0.9353017 0.9364763      0.7397708
 LkFrwrd     0.9348757 0.9361827      0.7433303
 ThnkLessT   0.9424475 0.9436797      0.5152945
 TrdLessT    0.9362815 0.9380178      0.6999771
 RushWrk     0.9355317 0.9366959      0.7302428
 NegOblig    0.9352094 0.9363647      0.7406319
 AccDown     0.9334180 0.9354451      0.7853325
 Escape      0.9346224 0.9362977      0.7539146
 Frustration 0.9336228 0.9352146      0.7798224
 
 attr(,"class")
 [1] "reliability"
```

## Method 2
```{r}
alphaobject=alpha(object)

alphaobject # shows output
```

## Example output from Method 2
```{}
Reliability analysis   
Call: alpha(x = CIUS)

  raw_alpha std.alpha G6(smc) average_r S/N    ase mean   sd median_r
      0.94      0.94    0.95      0.55  16 0.0051  1.8 0.97     0.55

 lower alpha upper     95% confidence boundaries
0.93 0.94 0.95 

 Reliability if an item is dropped:
            raw_alpha std.alpha G6(smc) average_r S/N alpha se  var.r med.r
D2Stop           0.94      0.94    0.95      0.55  15   0.0056 0.0139  0.56
Cont2A           0.93      0.94    0.95      0.55  14   0.0057 0.0137  0.55
AvsOthers        0.94      0.94    0.95      0.55  15   0.0055 0.0126  0.55
DepSleep         0.94      0.94    0.95      0.57  16   0.0053 0.0135  0.57
ThinkNOL         0.94      0.94    0.95      0.55  15   0.0055 0.0139  0.55
LkFrwrd          0.93      0.94    0.95      0.55  15   0.0056 0.0124  0.55
ThnkLessT        0.94      0.94    0.95      0.58  17   0.0049 0.0077  0.56
TrdLessT         0.94      0.94    0.95      0.56  15   0.0054 0.0136  0.56
RushWrk          0.94      0.94    0.95      0.55  15   0.0055 0.0125  0.55
NegOblig         0.94      0.94    0.95      0.55  15   0.0055 0.0129  0.55
AccDown          0.93      0.94    0.94      0.55  14   0.0057 0.0130  0.55
Escape           0.93      0.94    0.95      0.55  15   0.0056 0.0130  0.56
Frustration      0.93      0.94    0.95      0.55  14   0.0057 0.0132  0.55

 Item statistics 
              n raw.r std.r r.cor r.drop mean  sd
D2Stop      299  0.78  0.78  0.76   0.74 2.29 1.3
Cont2A      299  0.82  0.82  0.81   0.78 2.19 1.3
AvsOthers   299  0.77  0.78  0.77   0.73 1.20 1.2
DepSleep    299  0.69  0.69  0.65   0.63 1.25 1.3
ThinkNOL    299  0.78  0.79  0.77   0.74 1.64 1.1
LkFrwrd     299  0.79  0.80  0.78   0.74 1.43 1.2
ThnkLessT   299  0.59  0.58  0.53   0.52 2.84 1.4
TrdLessT    299  0.75  0.74  0.72   0.70 2.12 1.4
RushWrk     299  0.77  0.78  0.77   0.73 0.93 1.1
NegOblig    299  0.78  0.79  0.77   0.74 1.12 1.1
AccDown     299  0.83  0.82  0.82   0.79 2.27 1.4
Escape      299  0.80  0.79  0.79   0.75 2.06 1.5
Frustration 299  0.82  0.82  0.81   0.78 1.41 1.3

Non missing response frequency for each item
               0    1    2    3    4 miss
D2Stop      0.14 0.17 0.18 0.30 0.21    0
Cont2A      0.15 0.15 0.21 0.31 0.17    0
AvsOthers   0.37 0.24 0.23 0.13 0.03    0
DepSleep    0.40 0.19 0.22 0.14 0.05    0
ThinkNOL    0.16 0.30 0.31 0.17 0.05    0
LkFrwrd     0.30 0.25 0.22 0.18 0.05    0
ThnkLessT   0.12 0.07 0.10 0.26 0.45    0
TrdLessT    0.17 0.17 0.22 0.25 0.19    0
RushWrk     0.49 0.20 0.20 0.07 0.03    0
NegOblig    0.41 0.20 0.25 0.12 0.01    0
AccDown     0.17 0.11 0.19 0.31 0.21    0
Escape      0.24 0.13 0.15 0.28 0.19    0
Frustration 0.37 0.17 0.23 0.16 0.08    0
```

For additional information on interpretation please see: [**Reliability Analysis**](https://rpubs.com/hauselin/reliabilityanalysis)

***
<div class="alert alert-success">
# Testing for Homogeneity of variance
  * The Levene Test tests whether the VARIANCES are significantly different from each other
```{r}
leveneTest(objectL$outcome, objectL$predictor, center = median)
```
  * center: is the name of a function to compute the center of each group. If you use "mean" instead of median - this gives the original Levene's test; The default is "median" which provides a more robust test.
  * if variances are significantly different (p<0.05), in the t-test include the command: var.equal = FALSE --> calculates a more robust t-test;
  * if variances are NOT significantly different (p>0.05), in the t-test include the command: var.equal = TRUE --> calculates regular t-test
  * p value is the number found under the Pr(>F); if p>0.05 variances are homogenous (the statistically same)
  * Levene test is **NOT** necessary in REPEATED Measures tests; In such cases, in a ***t-test*** you should use **var.equal = TRUE**

</div>
<a href="#top">Back to top</a>

# **Conducting a t-test in R**
  
<div class="alert alert-info">
**t-test**: a t-test tests whether the MEANS/AVERAGES of **TWO** samples are significantly different from each other  (e.g. asking if the average depression scores of males is different from that of females in a sample)
</div>

<div class="alert alert-success">
## Method 1 - for **LONG** version of data

**PLEASE NOTE** this command is currently (10/28/24) not working properly and giving the following error "*cannot use 'paired' in formula method*". This page will be updated once a solution is reported. The issue appears to be in the "paired=..." part of the command.
    
```{r}
t.test(outcome~predictor, data=objectL, paired=FALSE/TRUE, var.equal=FALSE/TRUE)
```

<h3 style ="text-align:center;">**OR**</h3>

</center>

## Method 2 - for **WIDE** version of data
```{r}
t.test(object$var1,object$var2, paired=FALSE/TRUE, var.equal=FALSE/TRUE)
```

  * For **METHOD 1**:
    * *t.test*: command to calculate a t-test
    * *outcome*: name of the outcome column in the long version of the data file
    * *predictor*: name of the predictor column in the long version of the data file
    * *~* : predicted by
    * *data=objectL*: specifies the name of the long version of the data file. REPLACE ONLY the word objectL with the appropriate name of the long version of the file
  * For **METHOD 2**: 
    * *object, var1 and var2* - "object" needs to be replaced with appropriate name of object containing WIDE version of the data; "var1" and "var2" need to be replaced with the appropriate names of the columns in the WIDE version of the data. 
  * For **BOTH**:
    * *paired = TRUE*: **USED for REPEATED MEASURES**, also known as paired
    * *paired = FALSE*: **USED for INDEPENDENT MEASURES**
    * *var.equal = TRUE*: used when the Levene Test p > 0.05 (remember p = the number under Pr(>F)); **NOTE:** Also use this in the case of a **repeated measures t-test** when a Levene test is **not** required.
    * *var.equal = FALSE*: used when the Levene Test p < 0.05
    * If var.equal = FALSE: the degrees of freedom (df) will NOT be a whole number (because the specific t-test used uses a more conservative way of calculating the df.)
    * reporting: t(df)=tvalue,pvalue, r^2=
</div>
<a href="#top">Back to top</a>

***
<div class="alert alert-success">
# Calculating the effect size (r squared)
```{r}
tvalue^2/(tvalue^2+df)
```
  * all information for the above equation is obtained from the output of the t-test
  * for the tvalue ignore negative signs
  * reporting: t(df)=tvalue, p..., r^2=.
</div>
<a href="#top">Back to top</a>


# Calculating the effect size (Cohen's d) (alternative option)
```{r}
# requires package effsize

cohen.d(data= object, group1, group2)
```
  * If the groups are not of equal size, add the na.omit command to group with the least numbers e.g. if group2 has less, use "na.omit(group2)" instead of simply "group2"
  * object - replace with the name of the object containing the **WIDE** form of the data
  * group1 or group2 - replace with the names of the columns of each of the groups.
  
<a href="#top">Back to top</a>

***
<div class="alert alert-danger">
# t-test output in R
  * Paired t-test / Two Sample t-test # this title of the output indicates whether the test you just ran is paired/repeated OR independent measures/Two Sample t-test.
  * The line with the t value, df and p value is the most important line; t: gives you the t value; df gives you the degrees of freedom and p-values gives you the p value. ALL OF THESE NEED to be reported; the report will include this data in the following format t(df)=tvalue, p

  * output when the sex differences in vitamin D data is analyzed **CORRECTLY** using an independent measures test:
```{}
t = -2.2258, df = 16, p-value = 0.04075 
```
  
  * output when the sex differences in vitamin D data is analyzed **WRONGLY** using a repeated measures test:
```{}
t = -2.5964, df = 8, p-value = 0.03179
```


  * If these numbers have the SAME sign it means that it is likely that there is a significant difference:
```{}
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
  -18.252038  -1.081295
```

  * In a repeated measures test, R only gives you the output for MEAN OF THE **DIFFERENCES**:
```{}
sample estimates:
  mean of the differences 
-9.666667 
```

  * In an independent measures test, R gives you the output for the mean for **EACH GROUP**:
```{}
mean in group Boys mean in group Girls 
25.66667            35.33333
```
</div>
<a href="#top">Back to top</a>

<div class="alert alert-danger">
### Example of a t-test output and an effect size (r^2^)calculation with the important data used in the report highlighted.

<center>
![](C:/Users\ssamm\Desktop\Sammut\Research\R Markdown/Independent t-test output.jpg){width=60%}
</center>
</div>
<a href="#top">Back to top</a>

<div class="alert alert-info">
## Writing a report for a t-test
An independent measures t-test was utilized to assess whether there is a significant difference in vitamin D levels between boys and girls. The analysis indicated a significant difference in vitamin D levels, t(16)=-2.226, p<0.05; rsquared=   , specifically girls (M= ; SEM= ) reported a significantly higher level of vitamin D then boys (M= ; SEM= ).

**OR**

An independent measures t-test was utilized to assess whether there is a significant difference in vitamin D levels between boys(M= ; SEM= ) and girls(M= ; SEM= ). The analysis indicated a significant difference in vitamin D levels, t(16)=-2.226, p<0.05; rsquared=   ,specifically girls reported a significantly higher level of vitamin D then boys.
</div>
<a href="#top">Back to top</a>

***
# Conducting a t-test in excel (method 1)
  * =ttest(array1,array2, tails, type)
    + Array 1: first column containing the data
    + Array 2: second column containing data
    + tails: 1 or 2 tailed
    + type: 1= Paired; 2= Two-sample equal variance; 3= two-sample unequal variance.

***
# Conducting a t-test in excel (method 2)
  * To give a more detailed output, click Data Analysis in the Analysis group on the Data tab. 
  * If the Data Analysis command is not available, you need to load the Analysis ToolPak add-in program.(see:[Office Support](https://support.office.com/en-ie/article/load-the-analysis-toolpak-in-excel-6a63e598-cd6d-42e3-9317-6b40ba1a66b4){target="_blank"})
  * Select the type of t-test you need to run
  * Input the appropriate variable ranges under "Input"
  * If your variables have labels (which they should) check "Labels"
  * Click on Output Range and in the associated box indicate the cell you want the output to be generated in (make sure this cell is away from other important data)
  * Sample output is shown below:

Statistic                   | Set 6	     |Set 7
---------                   |------------|-----------
Mean	                      |11.75	     |11.5
Variance	                  |46.91666667 |33.66666667
Observations	              |4	         |4
Pearson Correlation	        |0.910007244	
Hypothesized Mean Difference|0	
df	                        |3	
t Stat	                    |0.174077656	
P(T<=t) one-tail	          |0.436444286	
t Critical one-tail	        |2.353363435	
P(T<=t) two-tail	          |0.872888572	
t Critical two-tail	        |3.182446305	

***
<div class="alert alert-success">
# **One-way ANOVA - INDEPENDENT MEASURES**

<div class="alert alert-info">
**ANOVA**: an ANOVA tests whether the MEANS/AVERAGES of **MORE THAN TWO** samples are significantly different from each other  (e.g. asking if there is a significant difference in the average depression scores between Freshman, Sophomores, Juniors and Seniors). ANOVAs can be complicated given that comparisons can be carried out within groups and between groups.
</div>

  * Run Levene Test: For the DF group = k-1 (k=number of groups/treatments) = DF for the Between Treatment
  * The other degrees of Freedom is DF for the WITHIN treatment = N-k (N=sum total of subjects; k=number of treatments)
```{r}
aovobject=aov(outcome~predictors, data=objectL)
```
  * this calculates the ANOVA
  * Use the long version of the file
  * aovobject: name of the object that will contain the ANOVA calculation

```{r}
summary(aovobject)
```
  * this gives the output for the ANOVA (**See example output below**)
  * name of predictor column: between Treatment (b/T)
  * Residuals: within Treatment (w/T)
  * DF: degrees of freedom (needed for reporting)
  * Sum Sq: Sum of Squares (Note: needed for the calculation of the effect size)
  * Mean Sq: Mean Square = variances
  * F value: test statistic (needed for reporting)
  * Pr(>F): p value (needed for reporting)
  * your statistics in the report should be in form: F(DF(b/T),DF(w/T))= F value, p value, effect size)
  * EFFECT SIZE: Sum Sq is used for effect size: Sum Sq(b/T)/(Sum Sq(total))
</div>
<a href="#top">Back to top</a>

<div class="alert alert-danger">
### Example of an Independent Measures Output 
```{}
            Df Sum Sq Mean Sq F value   Pr(>F)    
variable     2  28.22  14.111   11.91 0.000256 ***  
Residuals   24  28.44   1.185                     
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
```
</div>
<a href="#top">Back to top</a>

***
<div class="alert alert-success">
# Post-hoc analysis using Tukey Test
  * **A post-hoc analysis is run whenever the first part of the ANOVA shows a p<0.05**
```{r}
phobject=glht(aovobject, linfct=mcp(predictor="Tukey"))
```
  * phobject: name of the object that will contain the post-hoc Tukey test
  * aovobject: the name of the object that contains the ANOVA (see previous ANOVA command)
  * predictor: needs to be replaced with name of the PREDICTOR column in your data file.

```{r}
summary(phobject)
```
  * gives the summary of the post-hoc analysis (**see example output below**)
  * Pr(>|t|): is the p value for the specific groups being compared.
</div>
<a href="#top">Back to top</a>

<div class="alert alert-success">
  * If the variances are **not** equal the Games-Howell test is conducted (c.f. Welch test in the t-test)
  * requires *statix* package
```{r}
games_howell_test(data = objectL, formula = outcome ~ grouping_variable)
```
</div>
<a href="#top">Back to top</a>

<div class="alert alert-danger">
### Example of a post-hoc output for Independent Measures ANOVA
```{}
	 Simultaneous Tests for General Linear Hypotheses

Multiple Comparisons of Means: Tukey Contrasts


Fit: aov(formula = value ~ variable, data = migraineL)

Linear Hypotheses:
                   Estimate Std. Error t value Pr(>|t|)    
DrugB - DrugA == 0   2.1111     0.5132   4.114 0.001063 ** 
DrugC - DrugA == 0   2.2222     0.5132   4.330 0.000605 ***
DrugC - DrugB == 0   0.1111     0.5132   0.217 0.974514    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Adjusted p values reported -- single-step method)
```
</div>
<a href="#top">Back to top</a>

<div class="alert alert-danger">
### Equation for Eta Squared (effect size) for Independent Measures ANOVA (information taken from the first step of the ANOVA). 

<center>
![](C:/Users\ssamm\Desktop\Sammut\Research\R Markdown/Eta Squared.jpg){width=60%}
</center>
</div>
<a href="#top">Back to top</a>

<div class="alert alert-info">
### Writing a report from an Independent Measures ANOVA output
Pain levels were reported on a scale of 1 to 10 (1=low, 10=high) following the administration of various pain relievers by individuals who suffer from migraines. A One Way ANOVA indicated a significant difference (F(2, 24) = 11.91; p<0.001; n^2^= 0.498) in reported pain levels across the drugs administered. Post-hoc analysis using Tukey test indicated that individuals reported significantly lower pain levels following the administration of Drug A (M=3.67, SEM=0.29) relative to both Drug B (M=5.78; SEM=0.49; p<0.01) and Drug C (M=5.89; SEM=0.26; p<0.001). There was no significant difference between the reported pain levels following the administration of Drugs B and C (p>0.05).
</div>
<a href="#top">Back to top</a>

***
<div class="alert alert-success">
# **One-Way REPEATED MEASURES (RM) ANOVA**
```{r}
aovobject=aov(outcome~predictor+Error(participant/predictor), data=objectL)
```
  * participant: replace with the label of the SUBJECT column in the long version of the file
  * predictor: replace with the name of the PREDICTOR column in the long verstion of the file.

```{r}
summary(aovobject)
```
  * predictor: between Treatment value (b/T)
  * Residuals: error
  * Df: degrees of freedom (variable/predictor = k-1 where k=number of groups. Residuals=(df(w/T)-df(b/S))); w/T = Within Treatment = N-k, where N=Total number of "samples"/data points and k=number of groups; b/S = Between subjects = n-1, where n = number within a group; 
  * Sum Sq: sum of squares (needed for the calculation of the effect size)
  * Mean Sq: variances
  * F value: test statistic
  * Pr(>F): p value
  * your report of these statistics should be in the form of F(df(b/T),df(error))= Fvalue, pvalue, eta squared (effect size)
</div>
<a href="#top">Back to top</a>

<div class="alert alert-danger">
### Example of a Repeated Measures Output
<center>
An Example of a Repeated measures ANOVA output with important numbers used in report highlighted
![](C:/Users/ssamm/Desktop/Sammut/Research/R Markdown/RptdMeasuresANOVA-First part.JPG){width=50%}
</center>
</div>
<a href="#top">Back to top</a>

***
<div class="alert alert-success">
# Post-hoc Analysis using Pairwise comparisons
  * **A post-hoc analysis is run whenever the first part of the ANOVA shows a p<0.05**
  * **The tukey test does NOT work for this type of ANOVA command.** A Pairwise comparison is used instead.
```{r}
with(objectL, pairwise.t.test(outcome,predictor, p.adjust.method="bonferroni", paired=T))
```
  * pairwise comparisons: calculate the significant differences between the pairs of groups
</div>
<a href="#top">Back to top</a>

<div class="alert alert-danger">
### Example of a post-hoc output for Repeated Measures ANOVA
<center>
An Example of a post-hoc analysis output for a Repeated measures ANOVA with important numbers used in report highlighted. These values are the **p-values** comparing the variable in the vertical column on the left to the horizontal row above the numbers. 

![](C:/Users\ssamm\Desktop\Sammut\Research\R Markdown/RMANOVAposthoc.JPG){width=50%}
</center>
</div>
<a href="#top">Back to top</a>

<div class="alert alert-danger">
### Equation for Eta Squared (effect size) for Repeated Measures ANOVA. The information is found under the section **Error: Within** in the first step of the ANOVA

<center>
![](C:/Users\ssamm\Desktop\Sammut\Research\R Markdown/Eta squared.jpg){width=60%}
</center>
</div>
<a href="#top">Back to top</a>

<div class="alert alert-info">
### Writing a report from a Repeated Measures ANOVA output
Repeated measures ANOVA revealed a significant overall effect across time (F(2,24) = 20.65, p< 0.001; n^2^=0.63), following the administration of drug. Post-hoc Bonferroni analysis indicated that anxiety scores were significantly lower than pre-drug (M=8.10, SEM=0.23) administration on both week 1 (M=6.80, SEM = 0.25; p<0.05) and week 2 (M=3.00, SEM = 0.26; p<0.001). Moreover the anxiety scores on week 2 were also significantly lower than those of week 1 (p<0.001).
</div>
<a href="#top">Back to top</a>

* Additional information in relation to ANOVA and some code for more challenging ANOVAs can be found [here](https://www.uvm.edu/~dhowell/StatPages/R/RepeatedMeasuresAnovaR.html){target="_blank"}

<div>
***
# How to use R for multiple-select questions
  * Replace OBJECT with name of the object containing the imported data (long version)
  * First column needs to be the subject number
  * [This should be done in Excel] All values in the rows need to be replaced with a 1 (implies there is a value or count) or 0 (blank)
  * Replace "variable1" with the name of the column heading of that specific variable (e.g. if counting number of people in household and/or Intramural sports, variable1 etc would be replaced with the names of these columns)
  * This code is modified from https://stackoverflow.com/questions/11622660/how-to-use-r-for-multiple-select-questions

```{r}
set.seed(1)
dat <- data.frame(OBJECT,"variable1","variable2", "variable3", "..n")
                  
multi.freq.table = function(OBJECT, sep="", dropzero=FALSE, clean=TRUE) {
  # Takes boolean multiple-response data and tabulates it according to the possible combinations of each variable.
  
  counts = data.frame(table(OBJECT))
  N = ncol(counts)
  counts$Combn = apply(counts[-N] == 1, 1, 
                       function(x) paste(names(counts[-N])[x],
                                         collapse=sep))
  if (isTRUE(dropzero)) {
    counts = counts[counts$Freq != 0, ]
  } else if (!isTRUE(dropzero)) {
    counts = counts
  }
  if (isTRUE(clean)) {
    counts = data.frame(Combn = counts$Combn, Freq = counts$Freq)
  } 
  counts
}

multi.freq.table(dat[-1], sep="-")
```
***
# How to count number of occurences in a column 

  * the counts are sorted largest to smallest
```{r}
count(OBJECT, vars = Columnname, sort=TRUE)
```

or

  * the output is sorted from lowest to highest
```{r}
sort(table(OBJECT$Columnname))
```

***
<div class="alert alert-info">
# Plotting map of US with data
## METHOD 1: Plot map of US reflecting various levels - Only plots **lower 48**
* the information below is modified from the following website: https://remiller1450.github.io/s230s19/Intro_maps.html
  
### Load packages
```{r}
library(ggplot2);library(maps);library(dplyr) 
```

### Load the state map information
```{r}
MainStates <- map_data("state")
```

### Copy data
* StateNumber: object containing data for number of students from every state
* this object contains the data with the names of the states and numbers of students from every state
* states that have no representation i.e. 0 students should be blank
* data needs to have a column named "region" (***lower case***)

```{r}
StateNumber = read.table(file="clipboard", sep="\t", header=TRUE, as.is=TRUE)
```

### Plots all states with ggplot2, using black borders and light blue fill
```{r}
ggplot()+geom_polygon(data=MainStates,aes(x=long, y=lat, group=group),color="black", fill="lightblue")
```

### Merges Mainstates Map with StateNumber 
```{r}
MergedStates <- inner_join(MainStates, StateNumber, by = "region")
```

```{r}
p <- ggplot()
p <- p + geom_polygon( data=MergedStates, aes(x=long, y=lat, group=group, fill = n), color="white", size = 0.2) 
p
```

## Changing color scheme
```{r}
p <- p + scale_fill_continuous(name="Number of Students",low = "lightblue", high = "darkblue",limits = c(0,140),breaks=c(20,40,60,80,100,120,140), na.value = "grey50")+labs(title="title of graph")
p
```
<center>
![](C:/Users\ssamm\Desktop\Sammut\Research\R Markdown/Map.png){width=50%}
</center>
</div>
<a href="#top">Back to top</a>

<div class="alert alert-info">
## METHOD 2: Plot map of US reflecting various levels - Plots **all 50 states**
* the information below is modified from the following website: https://cran.r-project.org/web/packages/usmap/vignettes/mapping.html
  
### Load packages
```{r}
library(usmap) 
```

### Copy data
* StateNumber: object containing data for number of students from every state
* this object contains the data with the names of the states and numbers of students from every state
* states that have no representation i.e. 0 students should be blank
* data needs to have a column named "region"
* The excel data must contain the following columns. The data below can be copied and then pasted into excel using the "paste special/text only" option.
* The **fips** column stands for the Federal Information Processing Standard (FIPS) which provides a set of standard numeric codes for referring to U.S. states.
* Please note that, in the FIPS column, the 0 in front of the single digit numbers is not a requirement - this was just how I set it up.

fips|region|   n   |  
----|------|-------|
01	|Alabama
02	|Alaska
04	|Arizona
05	|Arkansas
06	|California
08	|Colorado
09	|Connecticut
10	|Delaware
12	|Florida
13	|Georgia
15	|Hawaii
16	|Idaho
17	|Illinois
18	|Indiana
19	|Iowa
20	|Kansas
21	|Kentucky
22	|Louisiana
23	|Maine
24	|Maryland
25	|Massachusetts
26	|Michigan
27	|Minnesota
28	|Mississippi
29	|Missouri
30	|Montana
31	|Nebraska
32	|Nevada
33	|New Hampshire
34	|New Jersey
35	|New Mexico
36	|New York
37	|North Carolina
38	|North Dakota
39	|Ohio
40	|Oklahoma
41	|Oregon
42	|Pennsylvania
44	|Rhode Island
45	|South Carolina
46	|South Dakota
47	|Tennessee
48	|Texas
49	|Utah
50	|Vermont
51	|Virginia
53	|Washington
54	|West Virginia
55	|Wisconsin
56	|Wyoming

```{r}
StateNumber = read.table(file="clipboard", sep="\t", header=TRUE, as.is=TRUE)
```

### Plots blank map of all states, using black borders and grey fill
```{r}
plot_usmap(regions = "states") + labs(title = "US States",subtitle = "This is a blank map of the state of the United States.") + theme(panel.background = element_rect(color = "black", fill = "grey"))
```

### Plots all states with a color scale reflecting numbers of students from the specific states  and places image in a black rectangle with a light color grey fill
```{r}
plot_usmap(data = StateNumber, values = "n", color = "white") + scale_fill_continuous(name="Number of Students",low = "lightblue", high = "darkblue",limits = c(0,140),breaks=c(20,40,60,80,100,120,140), na.value = "grey50")+labs(title="title of graph", subtitle = "Subtitle")+ theme(plot.title=element_text(hjust=0.5)) + theme(plot.subtitle=element_text(hjust=0.5))+ theme(legend.position = "right") + theme(panel.background = element_rect(color = "black", fill ="grey85"))
```
<center>
![](C:/Users\ssamm\Desktop\Sammut\Research\R Markdown/Map2.png){width=50%}
</center>
</div>
<a href="#top">Back to top</a>

<div class="alert alert-info">
## Plotting a 3D Scatter Plot
  * you may need to go into **Tools/Global Options/General/Advanced** and under **OS Integration (quit required)** change the *Rendering Engine* option to **Desktop OpenGL**



</center>
</div>
<a href="#top">Back to top</a>

# Additional Resources

<div class="alert alert-warning">  
## R Markdown

  * used to "turn your analyses into high quality documents, reports, presentations and dashboards." This page is prepared using R Markdown.

  * [R Markdown Help and Information](https://bookdown.org/yihui/rmarkdown/){target="_blank"}

  * [R Markdown cheatsheet](https://www.rstudio.com/wp-content/uploads/2015/02/rmarkdown-cheatsheet.pdf){target="_blank"}
  
  * [Colored Boxes in R Markdown](https://www.w3schools.com/bootstrap/bootstrap_alerts.asp){target="_blank"}

```{r}
# In R Markdown Document:
<div class="alert alert-x"> 
  # Replace "x" with "success" for green; "info" for blue; "warning" for tan; "danger" for red.
</div>
```

```{r}
# Creating a pdf of the rpubs website https://rpubs.com/ssammut/ResearchStats
<div class="alert alert-x"> 
  # requires webshot package
  library(webshot)
webshot(
    url     = "https://rpubs.com/ssammut/ResearchStats"
  , file    =  "Output1.pdf" # replace "Output1.pdf" with appropriate name
  , vwidth  = 992
  , vheight = 744
  )

getwd () # will show in which folder the file is stored.
</div>
```


</div>
<a href="#top">Back to top</a>

<div class="alert alert-warning">
## Jamovi - an alternate statistical program

  * This is not the program used in the class but you can feel free to experiment with it.
  * Jamovi is a program based on R that has a more user-friendly interface
  * [Download Jamovi](https://www.jamovi.org/download.html){target="_blank"}
  * Advantage - can be used on all platforms Macs, PCs and Chromebooks
  * A user guide can be found [here](https://www.jamovi.org/user-manual.html){target="_blank"}
  * A tutorial is also available [here](https://youtu.be/mZomeS0tLxY){target="_blank"}
</div>
<a href="#top">Back to top</a>

<div class="alert alert-warning">
## Creating your own package

  * This section outlines the procedure for creating your own package of a currently existing function so that you do not have to continuously copy and paste the function when needing it (e.g. re: (summarySE)-[Function and command for generating Standard Error of the Mean (SEM)])
  
  **Resources:**
  
  * https://ourcodingclub.github.io/tutorials/writing-r-package/#
  * https://web.mit.edu/insong/www/pdf/rpackage_instructions.pdf

  **Requires:**
  
  * RTools: https://cran.rstudio.com/bin/windows/Rtools/ - download and install
  * install.packages("devtools")
  * install.packages("roxygen2")

  **Creating the package**

1. Create folder Name_R_package
2. In the folder create a folder called R
3. In the folder called R, create a text file called Name_function.txt that contains a copy the function code
4. Change the extension of the text file to .R (confirm change)
5. In the Name_R_package folder create a text file called DESCRIPTION
6. In the DESCRIPTION file add the following information: 

```
  Package: Name of package
  Type: Package
  Title: Title describing package
  Version: 0.0.1.0
```

7. Remove the .txt extension from the DESCRIPTION file (confirm change)
8. Open a new window in R so that you do not lose previous work
9. Go to File/New Project. 
	a. Click on New Directory
	b. Click on R Package
	c. Name the package
	d. Click "Add" 
	e. Select the R file created above
	f. Click "Browse" and change the folder to select the Name_R_package you created earlier.
	g. Click on "Create Project"
10. Click on "Build" in the menu and select "Build Source Package"

   **Using the package**

1. Go to Tools/Install Packages
2. Change "Install From" to "Package Archive File (.zip; .tar.gz)". This will open the window for you to select the file.
3. Go to the folder Name_R_package and select the .tar.gz file listed
4. Click "Install"
5. Run library using the name you gave the package in 9c.: library(packagename)
6. **NOTE**: Whenever you install a new version of R you will need to re-run the installation of the package just described.
  
</div>
<a href="#top">Back to top</a>

***
***