Resources
- R
Commands Document
- There is a significant amount of information pertaining to the use
of R on the internet. Listed below are only some examples of the
resources. If you are having trouble installing R and R Studio, even
after reading and following the instructions I include below, you may
want to go to some of these websites.
Installing R & R Studio
PLEASE NOTE: There are times when you may have to
re-download R (not necessarily R Studio). You will realize this is
needed if you download a package and when you run the library you get a
message that seems to indicate that the package is not for that version.
All you have to do in this case is 1. close out of R Studio; 2. download
the newer version of R and 3. install it.
For links to download R and R studio please go to the following:
Step 1 - R MUST be installed first - Download R for
Windows or Mac
Step 2 - R Studio is installed after R has finished installing -
Download R Studio
for Windows or Mac. Please note: If clicking on the
link does not work, copy and paste the following link into your browser:
https://posit.co/downloads/
For Windows:
Double click R-x.x.x-win.exe
- Click “OK” (for language)
- Click “Next” (for Information)
- Click “Next” (for Select Destination Location)
- Click “Next” (for Select Components)
- Click “Next” (for Startup options)
- Click “Next” (for Select Start Menu Folder)
- On the next screen:
- Make sure “Create a desktop shortcut” is unchecked
- Make sure “Create a Quick Launch shortcut” is unchecked
- Click “Next”
- Click “Finish”
Double click RStudio-x.x.xxxx.exe
- Click “Next”
- Click “Next” (for Choose Install Location)
- Click “Install” (for Choose Start Menu Folder)
- Click “Finish”
[Step 3 is not necessary if you are installing the
packages by running the commands below under “Installation of
Packages”]
- Packages go into:
- C:\Users\YourName\Documents\R win-library\x.x
- Accept to overwrite current packages
For Mac:
- Double click R-x.x.x.pkg
- click “Continue” (for Introduction)
- click “Continue” (for Read Me)
- click “Continue” (for License)
- click “Agree”
- click “Install” (for Installation Type)
- if asked, provide password to allow installation of software
- click “Install Software”
- When installation is complete click “Close”
- Double click RStudio-x.x.xxxx.dmg
- in window that opens up, drag RStudio icon into Applications
folder
- When installation is done, you can click on the RStudio icon in the
Launchpad
- If asked “”RStudio” is an application downloaded from the Internet.
Are you sure you want to open it?”
- If asked to install a “git” command - click “install” and then click
on “agree”
[Step 3 is not necessary if you are installing the
packages by running the commands below under “Installation of
Packages”]
- Packages go into:
- Macintosh
HD/Library/Frameworks/R.framework/Versions/version#/Resources/library
- Accept to overwrite current packages
Back to top
General Notes for R
- pound (#) sign indicates a comment - not read by the program
- console is the window in which you will run the commands
- If you click in the console, and Ctrl L - clears the console
- If ever you’re stuck with a plus (+)sign and cannot get the cursor
(>) back, press Ctrl+C (for both Macs and PCs)
- object: an object contains a graph, the data etc. - whatever you put
it in it.
- <- is equivalent to =
- object$columnname: indicates the name of the column within the
object
- install packages: run ONCE; NEED
to be connected to the internet
- Library command: run EVERYTIME you open
R-Studio
- R IS CASE SENSITIVE
- names(object) : gives the names of all the
variables in a specific object [“names” is bolded b/c it is a command
and does not change when used; only the word “object” is replaced by the
appropriate name of the object containing the data]
- levels(object$nameofcolumn) - gives the levels
within the column of interest; [“levels” is bolded b/c it is a command
and does not change when used; only the words “object” and
“nameofcolumn” are replaced by the appropriate name of the object
containing the data and the name of the column of interest
respectively]
- If ever you need to change a column into a factor because R is not
seeing the column as a factor use this command:
objectL$nameofcolumn =
as.factor(objectL$nameofcolumn)
Installation of Packages
- Installation of packages is done ONLY ONCE;
- For the installation you MUST be connected to the
internet
install.packages("ggplot2", dependencies = T) # used for generating graphs
install.packages("multcomp", dependencies = T) # used for post-hoc analysis
install.packages("pastecs", dependencies = T) # used for generating descriptive statistics
install.packages("reshape", dependencies = T) # used for generating a long version of the data from a wide version of the data
install.packages("reshape2", dependencies = T) # used to generate long form of the file; required in some cases
install.packages("nlme", dependencies = T) # used for generating ANOVAs
install.packages("car", dependencies = T)
install.packages("pwr", dependencies = T)
install.packages("dplyr", dependencies = T)
install.packages("devtools", dependencies = T)
install.packages("rms", dependencies=T) # for regression
install.packages("psych", dependencies=T)
install.packages("ggpubr", dependencies=T) # required for correlation plots
Back to top
Additional packages (not required for class)
install.packages("rmarkdown", dependencies=T) # for RMD files that allow embedding of graphs etc; not required for class
install.packages("jmv") # not required for class
install.packages("umx", dependencies=T) # For structural equation modelling
install.packages("semPlot", dependencies=T) # required for Structural Equation Modeling but not required for class
install.packages("lavaan", dependencies=T) # required for Structural Equation Modeling but not required for class; see https://lavaan.ugent.be/tutorial/index.html
install.packages("sem", dependencies=T) # required for Structural Equation Modeling but not required for class
install.packages("maps", dependencies=T) # required if making map of US - only lower 48 states
install.packages("usmap", dependencies=T) # required if making map of US - all 50 states
# install.packages("RVAideMemoire", dependencies = T) # required for the Fisher Test
install.packages("plotly", dependencies=T) # required for 3D plots
install.packages("compareGroups") # required for generation of table comparing groups
install.packages("lsmeans", dependencies=T) # for comparison of slopes
install.packages("effectsize", dependencies = T)
Loading packages
- Packages need to be loaded EVERY TIME R studio is
started.
- Copy this command and paste into the Console and then press
“Enter”.
library(ggplot2);library(multcomp);library(pastecs);library(reshape); library(reshape2); library(nlme); library(car); library(pwr); library(dplyr); library(devtools); library(rms);library(psych); library(ggpubr)
Back to top
Additional library commands (not required for class)
library(umx); library(jmv); library(sem); library(lavaan); library(semPlot); library(qgraph); library(foreign); library(usmap); library(maps); library(plotly); library(SEMfn); library(lsmeans); library(effectsize); library(roxygen2)
library(RVAideMemoire)
Copying Data from Excel into R
For PC
- Copy the command below (Ctrl+C)
- Paste command into R Studio Console (Ctrl+V)
- Change the name “object” to something that makes sense to you and is
related to the data.
- In Excel, select the data you want to analyze;
Include the header;
- Copy the data (Ctrl+C)
- Go back to R Studio and run (press “Enter”) the
command you pasted earlier (in step 2) into the console
object = read.table(file="clipboard", sep="\t", header=TRUE)
- If you have missing data points in your file that you still want
imported into the dataframe, use the following command:
object = read.csv(file="clipboard", sep="\t", header=TRUE)
Back to top
For Mac
- Copy the command below (Cmd+C)
- Paste command into R Studio Console (Cmd+V)
- Change the name “object” to something that makes sense to you and is
related to the data.
- In Excel, select the data you want to analyze;
Include the header;
- Copy the data (Cmd+C)
- Go back to R Studio and run (press “Enter”) the
command you pasted earlier (in step 2) into the console
object = read.table(pipe("pbpaste"), sep="\t", header = TRUE)
- If you have missing data points in your file that you still want
imported into the dataframe, use the following command:
object = read.csv(pipe("pbpaste"), sep="\t", header = TRUE)
Back to top
Importing Data into R from Excel (xlsx) files
Important Notes:
- If you are going to use this method, you need to make sure
that your excel file has a short name with no spaces
- If you wish to name the object containing the data
something different from the original Excel name, type in the new name
in the “Name:” box under “Import
Options”**
- If you have more than one sheet of data in the
excel file, you will need to specify which sheet the data is coming from
in the “Sheet:” box under “Import
Options” section in the “Import Excel Data
window”
Procedure:
- Go to “File>Import Data Set>From Excel…”
- R opens window called “Import Excel Data”
- Click on “Browse…”
- Select the Excel file you want to import
- R will display the data in the “Data Preview”
window; [see note above if your excel file contains more than one data
sheet or you wish to name your object something different from the excel
file name]
- Click “Import” and the data is now imported and
will be displayed
Importing Data into R from CSV files
object = read.csv(file.choose(), header=T)
- object: name of the object - call it as you wish; should be related
to the data you are importing; should simple; NO spaces; should
short.
- read.csv: tells the program you are going to import a CSV file
- file.choose (): tells the program to open a window for you to select
the file of interest
- header = T: this indicates that you CSV file has a header
- the ONLY word you need to change in this command is the name of the
object
Importing Data into R from SPSS files
- to import *.sav files from SPSS you first need to run the following
library command
library(foreign)
object = read.spss(file.choose(), to.data.frame=TRUE)
- read.spss: tells the program you are going to import an SPSS
file
- file.choose (): tells the program to open a window for you to select
the file of interest
- the ONLY word you need to change in this command is the name of the
object
Exporting from R
Exporting output to a text file
sink("outputname.txt")
# RUN THE COMMANDS YOU WISH YO BE EXPORTED TO THE TEXT FILE
sink()
getwd() # gives working directory
# to import into excel, select the file and use the "space delimited" option, and then check the space checkbox.
Exporting dataframe to a csv file
write.csv(dataframe,"name.csv")
getwd() # gives working directory
Back to top
Viewing imported data
View(object) # allows you to see the data in another tab
Back to top
Back to top
Melting data with multiple id variables and measure variables
object_L=melt(object, id.vars = c("idvar_1", "idvar_2", "idvar_n"), measure.vars = c("mvar_1", "mvar_2", "mvar_n"))
- idvar_1 etc. = replace these with the appropriate names of the
columns containing the id variables
- mvar_1 etc. = replace these with the measure variable column
names
Back to top
Selecting specific subset to analyze
var_a = object$`outcome`[object$predictor1colname=="predictor1" & object$predictor2colname=="predictor2"]
- var_a: name of variable that will contain subset
- object$predictor1colname: name of column containing “predictor1” in
object
- ==: “truly equal to”
Generating descriptive statistics in R
Weighted mean
wt <- c(w1, w2, w3, w4, etc)/(wT) # the weight for the first value = x multiplied w1/wT (e.g. if weight for w1 is 25% w1 would be 25, wT would be 100)
x <- c(x1, x2, x3, x4, etc) # values to be weighted
xm <- weighted.mean(x, wt)
Mean
mean(c(x1, x2, x3, x4...))
Descriptive statistics for WIDE version of data to
2 decimal places
round(stat.desc(object, norm=T), digits=2)
- gives the descriptive statistics for all variables to TWO decimal
place (you can adjust the number of decimal places by simply changing
the number after digits=)
- object: replace this with name of object containing the WIDE version
of the data
- ONLY used with the WIDE version of the data
Descriptive statistics for individual columns of data to 2 decimal
places in WIDE version of data
round(stat.desc(object$columnname, norm=T), digits =2)
- gives the descriptive statistics for the WIDE version of the file
but ONLY for a SPECIFIC COLUMN
- object: needs to be replaced with the name of the wide version of
the data
- columnname: needs to be replaced with the name of the column for
which you wish to generate the statistics.
Descriptive statistics for LONG version of
data
by(objectL$outcome, objectL$predictor, stat.desc, norm=T)
gives the descriptive statistics for the LONG version of the
data
objectL: replace with name of the object containing the LONG
version of data
outcome: replace with name of outcome column in objectL
predictor: replace with name of predictor column in
objectL
nbr.val: number of values/subjects in a particular group
nbr.null: number of ZERO values
nbr.na: number of MISSING values
min/max: minimum and maximum valuers in a group
respectively
range: gives you the range
sum: gives the sum of all values in a group
Measures of centrality:
- median: gives the median of values in a group
- mean: gives the average of values in a group
Measures of variability:
- SE.mean: standard error of the mean
- CI.mean.95: 95% confidence interval
- var: variance
- std.dev: standard deviation
- coef.var: coefficient of variation
Measures of normality:
- skewness: measures of level of skewness of your data, a negative
sign implies a negative skew;
- skew.2SE: indicates whether the skewness is SIGNIFICANT. Skewness is
significant if skew.2SE > +1 or <-1
- kurtosis: measure of level of kurtosis of the data. A negative sign
implies a negative kurtosis.
- kurt.2SE: indicates whether the kurtosis is SIGNIFICANT. kurtosis is
significant if kurt.2SE > +1 or <-1
- normtest.w: test for normality
- normtest.p: if normtest.p<0.05, the the data is NOT normally
distributed.
- When checking for normality - FIRST look at the
normtest.p. If this value is greater than 0.05, then you have nothing to
worry about. If this value is less than 0.05 then you need to look at
the skew.2SE and kurt.2SE in order to determine which aspect is causing
your data to not be normal. If your data is NOT normally distributed,
other statistical tests apply.
Example of output generated with previous command
Back to top
Alternative method
describeBy(outcome~predictor, data = objectL, digits=2, mat=TRUE)
Example of output generated with previous command
- this output is nice and concise but does not contain everything
included in the previous method (e.g. no indication of significance in
relation to the deviations from normality (skewness/kurtosis))
Back to top
Another Method of presenting Statistics
- this method generates presentable tables as an output
- requires “vtable” package
st(ObjectL, vars = c("column1"), add.median=TRUE, numformat=NA)
- gives stats of one column with 25th, 50th and 75th percentiles
- column1 -> replace with the name of the column for which you wish
to calculate the descriptive statistics. Output will have N, Mean and
Std.Dev., Min, and Max also.
st(ObjectL, vars = c("column1"), group = 'Group', add.median = TRUE, numformat=NA)
- gives the statistics of column1 across the group “Group”;
- replace “column1” with the name of the column for which you wish to
generate descriptive statistics;
- replace “Group” with the name of the column that has the grouping
variable e.g. “Sex”.
Descriptive statistics in cases of multiple levels
describeBy(outcome~ var1 + var2, data = object,digits=2, mat=TRUE)
# e.g. if var1 = sex (Male or Female) and var2 = class (Freshman, Sophomore, Junior, Senior)
# digits: describes number of decimal places
# mat: provide a matrix output rather than a list
Excel Commands for some descriptive Statistics - method 1
- all commands start with an equal sign
- once you select the appropriate command (after the equal sign), in
parenthesis you will need to inbput the cells on which you want to
execute that command e.g.
- =min(B2:B10) - this gives the minimum for numbers in cells B2 to
B10
- =max(B2:B10) - gives the max for numbers in cells B2 to B10
- =Average(B2:B10) - gives the average
- =Median(B2:B10) - gives the median
- =var(B2:B10) - gives the variance
- =stdev(B2:B10) - gives the standard deviation
- =B16/SQRT(B17) - gives the standard error of the mean WHEN B16
contains the standard deviation and B17 contains the N (count)
- $ sign in front of the letter (column) or number (row) of the
formula fixes the row or column
Excel Commands for some descriptive Statistics - method 2
- To give a more detailed output, click Data Analysis in the Analysis
group on the Data tab.
- If the Data Analysis command is not available, you need to load the
Analysis ToolPak add-in program.(see:Office Support)
- Select “Descriptive Statistics”
- Input the appropriate variable ranges under “Input”
- If your variables have labels (which they should) check “Labels in
First Row”
- Click on Output Range and in the associated box indicate the cell
you want the output to be generated in (make sure this cell is away from
other important data) OR simply select “New Worksheet Ply”
- Sample output is shown below:
| Mean |
12 |
| Standard Error |
0.447213595 |
| Median |
12 |
| Mode |
13 |
| Standard Deviation |
1 |
| Sample Variance |
1 |
| Kurtosis |
-3 |
| Skewness |
0 |
| Range |
2 |
| Minimum |
11 |
| Maximum |
13 |
| Sum |
60 |
| Count |
5 |
Proportions test
Proportions test: Use a proportions test to
compare/determine if the frequency/probability of an event is different
between the various groups being tested (e.g. asking if the proportion
of males reporting depression is different from females reporting
depression in a sample)
Comparing two proportions - Method 1
prop.test(x = c(c1, c2), n = c(T1, T2), alternative = "two.sided")
- c1 and c2 reflect the number of subjects in each of the categories
(e.g., number of women responding Yes and the number of men responding
Y)
- T1 and T2 reflect the number of TOTAL subjects in each
category/group (e.g., the total number of women in the sample, the total
number of men in the sample)
Comparing two proportions - Method 2
object = c(v1,v2)
- number of subjects in each of the categories (e.g. number of
Catholics in example)
Tobject = c(n1,n2)
- number of TOTAL subjects in each category/group (e.g. Total number
of representative in each party)
prop.test(object, Tobject)
- runs the proportions test
Back to top
Proportions Test Output in R
2-sample test for equality of proportions with
continuity correction
X-squared = 5.7318, df = 1, p-value = 0.01666 # X-squared (chi-squared) = the
statistic; df = degrees of freedom (n-1)
- we have 2 groups; p-value = tells us if
it is significant; IT IS SIGNIFICANT if p
< 0.05
alternative hypothesis: two.sided
95 percent confidence interval:
0.01821926 0.18098709 # If the confidence interval range has the same SIGN - there should be a significant difference
sample estimates:
prop 1 prop 2
0.3535714 0.2539683 # these are the proportions of each of the tested groups
Back to top
Writing a report from a proportions test output
A two-sample test for equality of proportion was conducted to assess
whether the proportion of Catholics in the Republican and Democratic
party were significantly different. The analysis revealed a
significantly higher proportion of Catholics in the Democratic
Party(35.4%) relative to the Republican Party(25.4%),
X2(1,N=532)= 5.73,p<0.05.
Back to top
Comparing more than two proportions - Method 1
prop.test(x = c(c1, c2, c3...), n = c(T1, T2, T3...), alternative = "two.sided")
pairwise.prop.test(x = c(c1, c2, c3...), n = c(T1, T2, T3...), alternative = "two.sided")
- c1, c2, c3… reflect the number of subjects in each of the categories
(e.g., number of women responding Yes and the number of men responding
Y)
- T1, T2, T3… reflect the number of TOTAL subjects in each
category/group (e.g., the total number of women in the sample, the total
number of men in the sample)
- prop.test runs the proportions test - provides the first part of the
output.
- pairwise.prop.test runs the pairwise test to indicate which
proportions are different from each other - this provides the second
part of the output
Comparing more than two proportions - Method 2
object = c(a,b,c,d,e,...)
- number of subjects in each of the categories (e.g. the number of
doctors of a particular faith that said they would NOT perform an
abortion; each letter would be replaced by the number of doctors of each
specific specific faith that responded “NO”)
Tobject = c(Ta,Tb,Tc,Td,Te,...)
- number of Total subjects in each category/group (e.g. the TOTAL
number of doctors of a particular faith i.e those that responded YES
(they would perform an abortion) + those that responded NO;
Replace Ta etc. with numbers representing the
individual totals)
prop.test(object,Tobject)
- runs proportions test - provides the first part of the output.
pairwise.prop.test(object,Tobject)
- runs pairwise test to indicate which proportions are different from
each other - this provides the second part of the output.
Back to top
Proportions Test Output in R (more than two proportions)
# This is the first part of the output that indicates whether there is a difference in the proportions. If there is a difference, however, it will not indicate between which proportions the difference exists.
4-sample test for equality of proportions without
continuity correction
data: object out of Tobject
X-squared = 35.169, df = 3, p-value = 1.122e-07
alternative hypothesis: two.sided
sample estimates:
prop 1 prop 2 prop 3 prop 4
0.2857143 0.5000000 0.2407407 0.7500000
# This is the second part of the output the indicates WHERE the differences are. The numbers in the top row and in the first column reflect the 4 proportions being tested comparing e.g., 1 vs 2 (not significant); 1 vs 4 (significant at p<0.001) etc., etc.
Pairwise comparisons using Pairwise comparison of proportions
data: object out of Tobject
1 2 3
2 0.32422 - -
3 0.82154 0.18128 -
4 0.00013 0.18128 9.5e-07
P value adjustment method: holm
Back to top
Writing a report from a proportions test output (more than two
proportions)
A 4-sample test for equality of proportions was conducted to assess
whether the proportion of the 4 samples provided were significantly
different. The analysis revealed an overall significant difference
X2(3,N= insert number of participants) =
35.17,p<0.001. Pairwise comparisons indicated a significant
difference between samples 1 and 4 and samples 3 and 4 (both
p<0.001). All other comparisons were not significant (p>0.05).
Back to top
Proportions Test for situations where there are very small numbers
(5 or less)
In situations where there are 5 or less values, the proportions test
will display an error. In that case, the Fisher test is the more
appropriate test to use.
Comparing two proportions - in cases of small numbers
fisher.test(matrix(c(n1, T1-n1, n2, T2-n2), ncol=2))
- n1 represents the first number to be compared, T1 is the total
number of subjects in pertaining to variable n1 etc.
Comparing more than two proportions - in cases of small numbers
- requires RVAideMemoire package
fisher.multcomp(matrix(c(n1, T1-n1, n2, T2-n2,.....), ncol=N))
- n1 represents the first number to be compared, T1 is the total
number of subjects in pertaining to variable n1 etc.
- ncol=N: replace N with the number of
variables being compared.
Odds Ratio
Odds Ratio: measures how likely an event is to occur
when exposed to something; a measure of the association between exposure
and outcome
Create Matrix
# data taken from Yang, C.-Y., Shih, Y.-H., & Lung, C.-C. (2024). The association between COVID-19 vaccine/infection and new-onset asthma in children - based on the global TriNetX database. Infection. https://doi.org/10.1007/s15010-024-02329-3 **Table S2 in Supplement 1**
# Create matrix
vax <- c('Vax', 'No Vax')
outcome <- c('Death', 'No death')
data <- matrix(c(354, 31734, 309, 159048), nrow=2, ncol=2, byrow=TRUE)
dimnames(data) <- list('Vaccinated'=vax, 'Outcome'=outcome)
#view matrix
data
Back to top
Calculate Odds Ratio
library(epitools)
#calculate odds ratio
oddsratio(data)
# From the output shown, we would interpret this to mean that the odds that child receiving the vaccine dies is ~5.74 times the odds of one who did not receive the vaccine.
Writing a report from an odds ratio test output
There was a significant difference (p<0.001) in the odds of death
associated with receiving the COVID vaccines in children aged 5-18 years
of age relative to those who did not (OR 5.74, 95% CI [4.93, 6.69]).
Back to top
Some things you may have to do
prior to generating a graph
Function and command for generating Standard Error of the Mean
(SEM)
- This function needs to be copied and run WITHOUT
MODIFICATION before generating the summary statistics using the
summarySE command.
- This code is taken from: Summarizing
Data
## Gives count, mean, standard deviation, standard error of the mean, and confidence interval (default 95%).
## data: a data frame.
## measurevar: the name of a column that contains the variable to be summariezed
## groupvars: a vector containing names of columns that contain grouping variables
## na.rm: a boolean that indicates whether to ignore NA's
## conf.interval: the percent range of the confidence interval (default is 95%)
summarySE <- function(data=NULL, measurevar, groupvars=NULL, na.rm=FALSE,
conf.interval=.95, .drop=TRUE) {
library(plyr)
# New version of length which can handle NA's: if na.rm==T, don't count them
length2 <- function (x, na.rm=FALSE) {
if (na.rm) sum(!is.na(x))
else length(x)
}
# This does the summary. For each group's data frame, return a vector with
# N, mean, and sd
datac <- ddply(data, groupvars, .drop=.drop,
.fun = function(xx, col) {
c(N = length2(xx[[col]], na.rm=na.rm),
mean = mean (xx[[col]], na.rm=na.rm),
sd = sd (xx[[col]], na.rm=na.rm)
)
},
measurevar
)
# Rename the "mean" column
datac <- rename(datac, c("mean" = measurevar))
datac$se <- datac$sd / sqrt(datac$N) # Calculate standard error of the mean
# Confidence interval multiplier for standard error
# Calculate t-statistic for confidence interval:
# e.g., if conf.interval is .95, use .975 (above/below), and use df=N-1
ciMult <- qt(conf.interval/2 + .5, datac$N-1)
datac$ci <- datac$se * ciMult
return(datac)
}
sumstats <- summarySE(objectL, measurevar="outcomevar", groupvars=c("groupvar_1","groupvar_2","groupvar_n"), na.rm=TRUE)
- sumstats: name of object that will have the summary stats. The name
of this should be changed if running for more than one data set.
- summarySE: command that runs the summary statistics
- objectL: name of object containing the long version of the data to
be analyzed/summarized
- “outcomevar”: name of the variable to be analyzed/summarized
- “groupvar_1” etc: name of the columns that contain the groups for
the variable to be analyzed e.g. Sex, Class, Age…etc.
- na.rm = TRUE: removes the empty cells (cells that have NA); Not
required if all data points are present.
Back to top
Ordering levels in a specific variable
- When plotting a graph, R will order the variables in alphabetical
order. Very often this is not the best way of plotting the graph. In
such cases, R needs to be instructed on the desired order of
variables.
- An example is the order of classes. For this example, the name of
the variable in the data (i.e. the column name) is
Class R will order them: Freshman, Graduate, Junior,
Senior, Sophomore
- This command will create an object that has the reordered
levels.
- In such a case, the name of the new object created would be the name
of the predictor that would need to be used in Part I of plotting the
graph.
- For this example Class is the original
name; Class_r is the rearranged levels
Class_r = factor (object$Class, levels=c("Freshman", "Sophomore", "Junior", "Senior", "Graduate"))
- notice how when you are indicating the name of the factor that is to
be reordered you need to give R the name of the object containing the
data (in this case “object” and the name of the column to be reordered
separated by a $ symbol)
- to adapt to any situation - “Class_r” should be renamed
appropriately; “object$Class” should be replaced with the appropriate
object name and column name to be reorganized; all the levels within the
quotes should be replaced accordingly.
Back to top
Changing variable name in a specific column
- If the variables in a particular column need to be changed e.g. from
numbers to words do the following:
- install the package stringr using the
following command:
install.packages("stringr", dependencies = T)
- ensure that you load the package stringr
using the following command:
library(stringr)
- Next, run the following command in order to preserve the original
file:
object2=object
- Next, run the following command on the new modified object
(i.e. object2) (NOTE: this will change the values in
your columns permanently.)
object2$columnname = str_replace_all(object2$columnname,"oldvalue","newvalue")
- You will need to run this command for every value
that needs to be replaced
- When plotting the graph you will need to remember to use the
NEW OBJECT i.e. object2 as
the source of the data
Back to top
Changing the column name/header in the object/dataframe containing
the data
- When you generate the long version of a file, the predictor column
will have the name variable, while the outcome
column will be labeled value.
- To change the name of a specific column use the following
commands
names(objectL)[n] <- "correctname"
- remember that this will be the name that you will need to
use in generating the graph, whenever you are
addressing the predictor/outcome
- replace the n with the number of the
column whose header you wish to change (e.g. if the data has an id
column and you want to change the variable column, this will be column
2)
- replace correctname with the name you want
to give the column/header
This is an alternative command to the one above
- To rename a single column by name
colnames(objectL)[colnames(objectL) == "old_col2"] <- "new_col2"
Back to top
Creating a pie chart, bar graph, line graph or box
plot
Generating a pie chart
object<- c(n1, n2)
labels<- c("Poor (58%)","Normal (42%)")
pie(x=object,labels = labels, radius = 1, main = "Title (n=N)", col = c("#69b3a2", "grey"))
- object: name of object containing number OR percentage of data to be
plotted
- labels: name of object containing the names of the labels of the
parts/sections of the pie chart
- pie: command that plots the pie chart
- x: indicates what is to be plotted (contained in the dataframe
“object” [replace accordingly])
- labels: indicates the labels for the sections (contained in the
dataframe “labels” [replace accordingly])
- radius - specifies radius of circle
- col - specifies colors to be used
Back to top
Example of a graph generated with previous command
Back to top
Generating a bar graph
Generating a bar graph - Part I
barobject=ggplot(objectL, aes(predictor,outcome))
- barobject: name of the entity that will contain the plot (call it
what you wish, just be clear)
- ggplot: command that enables the plotting of the graph
- objectL: is the name of the long version of your data
- aes: tells R to generate the aesthetics
- predictor: REPLACE with the name of the predictor
column in the long version of the file
- outcome: REPLACE with the name of the outcome
column in the long version of the file
- NOTE regarding predictor: If the levels have been rearraged in a new
object, the word predictor should be replaced with the name of that
specific object (in the example above, Class_r
instead of Class)
Generating a bar graph - Part II
barobject+stat_summary(fun=mean, geom="bar",fill="white",color="Black")+stat_summary(fun.data=mean_cl_normal, geom="errorbar",position=position_dodge(width=0.9), width=0.2)+labs(title="Title line1 \n Title line2", x="predictor label", y="outcome label (unit)")+theme(axis.title.x=element_text(vjust=-0.4))+theme(axis.title.y=element_text(vjust=0.3))+theme(title=element_text(vjust=+1.1))+theme(plot.title=element_text(hjust=0.5))+coord_cartesian(ylim=c(lowerlimit, upperlimit)) + guides(fill = guide_legend(title = "name"))
- barobject: name of the plot determined in the previous
command
- stat_summary(fun.y=mean,
geom=“bar”,fill=“white”,color=“Black”): plots the mean, as a bar
graph, with a white fill and a black outline - DOES NOT NEED TO
BE CHANGED unless you wish to change the colors
- stat_summary(fun.data=mean_cl_normal,
geom=“errorbar”,position=position_dodge(width=0.9), width=0.2):
plots the error bars (measure of variability) so that they do not
overlap - YOU DO NOT NEED TO CHANGE ANYTHING;
- NOTE: mean_se can replace
mean_cl_normal if you wish to plot the standard error of the
mean.
- labs(title=“Title line1 \n Title line2”,
x=“predictor label”, y=“outcome label
(unit)”):Labels the graph; title gives the name of the
main heading of the graph; x labels the x-axis; y labels the y-axis; The
“\n” places any words after it on the next line;
DOES NEED CHANGING - you need to change everything that
is in the quotation marks.
- theme(axis.title.x=element_text(vjust=-0.4))+theme(axis.title.y=element_text(vjust=0.3))+theme(title=element_text(vjust=+1.1)):
sets the vertical distance/alignment b/w the specific heading and the
axis - DOES NOT NEED CHANGING
- theme(plot.title=element_text(hjust=0.5)): centers the
title horizontally
- +coord_cartesian(ylim=c(lowerlimit, upperlimit)): you do
NOT always need to use this part of the command; It is
only needed when the y-axis limits need to be adjusted for better
viewing of the graph; If it is not being used, remove the whole command
in between the plus signs leaving only one + sign. If used, replace the
words lowerlimit and upperlimit with the appropriate numbers.
- guides(fill = guide_legend(title = “name”)): where
applicable, provides a new name for the legend (the default is the
header of the column).
Back to top
Example of a graph generated with previous command
Back to top
Generating a bar graph with different colored
bars
Generating a bar graph - Part I
barobject=ggplot(objectL, aes(predictor,outcome,fill=predictor))+scale_fill_manual(values=c("#colorcode", "#colorcode",....))
- replace barobject, objectL, predictor, outcome with appropriate
labels
- replace colorcode with appropriate color code obtained from the Hex
field at the rapidtables website - in the case of the color
selected in the example shown below, the code would be #3399FF
- add as many colors as you have variables
Generating a bar graph - Part II
barobject+stat_summary(fun=mean, geom="bar")+stat_summary(fun.data=mean_cl_normal, geom="errorbar",position=position_dodge(width=0.9), width=0.2)+labs(title="Title line1 \n Title line2", x="predictor label", y="outcome label (unit)")+theme(axis.title.x=element_text(vjust=-0.4))+theme(axis.title.y=element_text(vjust=0.3))+theme(title=element_text(vjust=+1.1))+theme(plot.title=element_text(hjust=0.5))+coord_cartesian(ylim=c(lowerlimit, upperlimit))+guides(fill = guide_legend(title = "name"))
- stat_summary(fun.y=mean, geom=“bar”): Notice how the
section “,fill=”white”,color=“Black”” has been removed because colors
are accounted for in the first command
- Everything else is per the first example of “Generating a
Graph”
Back to top
Example of a graph generated with previous command
Back to top
Creating a bar graph with one independent variable
and using the standard error of the mean for error bars
Generating a bar graph - Part I
barobject=ggplot(sumstats, aes(predictor,outcome,fill=predictor))+scale_fill_manual(values=c("#colorcode", "#colorcode",....))
Generating a bar graph - Part II
barobject+stat_summary(fun=mean, geom="bar")+geom_errorbar(aes(ymin=outcome-se, ymax=outcome+se),position=position_dodge(width=0.9), width=0.2)+labs(title="Title line1 \n Title line2", x="predictor label", y="outcome label (unit)")+theme(axis.title.x=element_text(vjust=-0.4))+theme(axis.title.y=element_text(vjust=0.3))+theme(title=element_text(vjust=+1.1))+theme(plot.title=element_text(hjust=0.5))+coord_cartesian(ylim=c(lowerlimit, upperlimit))+guides(fill = guide_legend(title = "name"))
- geom_errorbar(aes(ymin=outcome-se,
ymax=outcome+se),position=position_dodge(width=0.9), width=0.2):
This command now provides the information for the errorbars. However you
need to replace the word “outcome” with the appropriate name for the
outcome variable.
- Everything else is per the first example of Generating a bar graph
Back to top
Example of a graph generated with previous command
Back to top
Creating a bar graph with multiple independent
variables and using the standard error of the mean for error
bars
Generating a bar graph - Part I
barobject=ggplot(sumstats, aes(predictor,outcome,fill=predictor_2))+scale_fill_manual(values=c("#colorcode", "#colorcode",....))
- NOTE: sumstats is used to provide the data to be
plotted
- predictor: the main predictor that is going to be on the x-axis
- predictor_2: the predictor that provides the sub-categories within
the main predictor
- Additionally, see Generating a
bar graph with different colored bars for details
of what needs to be changed
Generating a bar graph - Part II
barobject+stat_summary(fun=mean, geom="bar",position=position_dodge(width=0.9), linewidth=3)+geom_errorbar(aes(ymin=outcome-se, ymax=outcome+se),position=position_dodge(width=0.9), width=0.2)+labs(title="Title line1 \n Title line2", x="predictor label", y="outcome label (unit)")+theme(axis.title.x=element_text(vjust=-0.4))+theme(axis.title.y=element_text(vjust=0.3))+theme(title=element_text(vjust=+1.1))+theme(plot.title=element_text(hjust=0.5))+coord_cartesian(ylim=c(lowerlimit, upperlimit))+guides(fill = guide_legend(title = "name"))
- geom_errorbar(aes(ymin=outcome-se,
ymax=outcome+se),position=position_dodge(width=0.9), width=0.2):
This command now provides the information for the errorbars. However you
need to replace the word “outcome” with the appropriate name for the
outcome variable.
- Everything else is per the first example of Generating a bar graph
Back to top
Example of a graph generated with previous command
Back to top
Creating bar graphs with multiple independent
variables, plotted as three separate graphs (one per variable)
in one grid
- Original data (contained in “objectL” contains the following
columns “Subject”, “var1”, “var2”, “var3”, “data”
Generating a grid of bar graphs - Example 1
# Make a modified copy of the original data
objectL_mod <- objectL %>%
# Rename variables within var3
# only if necessary/desired
# var3.1 etc. represent different variables within var3
# /n can be used within the names if the new name is long and it is desired to split the title into two lines
mutate(var3 = recode(var3, "oldvar3.1\nname" = "newvar3.1\nname", "oldvar3.2\nname" = "newvar3.2\nname", "oldvar3.3\nname" = "newvar3.3\nname")) # do not add spaces before and after \n or the facet title will not be centered
# Generate graph
bar_objectL=ggplot(objectL_mod, aes(var1,data,fill=var1))+scale_fill_manual(values=c("#CC0000", "#004C99"))
bar_objectL.Sub=bar_objectL+stat_summary(fun=mean, geom="bar",position=position_dodge(width=0.9), linewidth=3)+stat_summary(fun.data=mean_cl_normal, geom="errorbar",position=position_dodge(width=0.9), width=0.2)+labs(title="title", x="x axis title", y="y axis title") + theme(axis.title.x=element_text(vjust=-0.4)) + theme(axis.title.y=element_text(vjust=0.3)) + theme(title=element_text(vjust=+1.1)) + theme(plot.title=element_text(hjust=0.5))+guides(fill = guide_legend(title = "var1"))
# bar_objectL.Sub: contains the command to generate the graph with the various sub-categories present var3
# +guides(fill = guide_legend(title = "var1")): relabels the heading of the legend
# Create grid
bar_objectL.Sub+facet_grid(cols=vars(var3))
Back to top
Example of a graph generated with previous command
Back to top
Generating a grid of bar graphs - Example 1a
- Original data (contained in “objectL” contains the following
columns “Subject”, “var1”, “var2”, “var3”, “data”
- This graph is a modified version of that shown in Example 1 - See Generating a grid of
bar graphs - Example 1
- This graph uses the SEM instead of the Confidence interval to
generate the error bars and also shows the significances
# Make a modified copy of the original data
objectL_mod <- objectL %>%
# Rename variables within var3
# only if necessary/desired
# var3.1 etc. represent different variables within var3
# /n can be used within the names if the new name is long and it is desired to split the title into two lines
mutate(var3 = recode(var3, "oldvar3.1\nname" = "newvar3.1\nname", "oldvar3.2\nname" = "newvar3.2\nname", "oldvar3.3\nname" = "newvar3.3\nname")) # do not add spaces before and after \n or the facet title will not be centered
- Generate Descriptive Statistics that provide standard error of the
mean
# Generate graph
bar_object=ggplot(sumstats_object, aes(x=predictor,y=outcome))+scale_fill_manual(values=c("#CC0000", "#004C99"))
bar_object.Sub=bar_object+geom_bar(stat="identity",position=position_dodge(width=0.9), linewidth=3, aes(fill=predictor))+geom_errorbar(aes(ymin=outcome-se, ymax=outcome+se),position=position_dodge(width=0.9), width=0.2)+labs(title="Title", x="x axis title", y="y axis title") + theme(axis.title.x=element_text(vjust=-0.4)) + theme(axis.title.y=element_text(vjust=0.3)) + theme(title=element_text(vjust=+1.1)) + theme(plot.title=element_text(hjust=0.5)) + guides(fill = guide_legend(title = "predictor")) +coord_cartesian(ylim=c(lowerlimit,upperlimit))+ geom_signif(data=data.frame(predictor2=c("newvar3.1\nname","newvar3.2\nname","newvar3.3\nname")),aes(y_position=c(21.5, 24.5, -2), xmin=c(1.0, 1.0, 1.0), xmax=c(2, 2, 2),annotations = c("**","*","")), tip_length = 0, manual=T)
# The statistic that was NOT significant has been placed beyond the y-scale limit. The coord_cartesian command ensures that the axes limits are as desired and that the non-significant bar does not show up.
# outcome: under geom_errorbar - this refers to the heading of the outcome column in the descriptive statistics output.
# geom_signif: numbers shown in parenthesis need to be adjusted accordingly to accommodate for height (in relation to the y_position) and/or position relative to bars (in relation to the xmin and xmax). Additionally, the annotation needs to be appropriately corrected to reflect the appropriate significances.
# Create grid
bar_BIS.Sub+facet_grid(~BIS_Cat, scales="fixed")
# theme(axis.text.x=element_blank()) - # Removes variable label names in the x-axis if they are not desired.
# theme(axis.ticks.x=element_blank()) - # removes ticks (may be used with above command)
# https://stackoverflow.com/questions/56219468/using-ggsignif-with-grouped-bar-graphs-and-facet-wrap-not-working
Back to top
Example of a graph generated with previous command
Back to top
Generating a grid of bar graphs - Example 2
- Original data (contained in “objectL” contains the following
columns “Subject”, “var1”, “var2”, “var3”, “data”
# Make a modified copy of the original data
objectL_mod <- objectL %>%
# Rename variables within var3
# only if necessary/desired
# var3.1 etc. represent different variables within var3
# /n can be used within the names if the new name is long and it is desired to split the title into two lines
mutate(var3 = recode(var3, "oldvar3.1\nname" = "newvar3.1\nname", "oldvar3.2\nname" = "newvar3.2\nname", "oldvar3.3\nname" = "newvar3.3\nname")) # do not add spaces before and after \n or the facet title will not be centered
# Generate graph
bar_objectL=ggplot(objectL_mod, aes(var1,data,fill=var2))+scale_fill_manual(values=c("#CC0000", "#004C99"))
bar_objectL.Sub=bar_objectL+stat_summary(fun=mean, geom="bar",position=position_dodge(width=0.9), linewidth=3)+stat_summary(fun.data=mean_cl_normal, geom="errorbar",position=position_dodge(width=0.9), width=0.2)+labs(title="title", x="x axis title", y="y axis title") + theme(axis.title.x=element_text(vjust=-0.4)) + theme(axis.title.y=element_text(vjust=0.3)) + theme(title=element_text(vjust=+1.1)) + theme(plot.title=element_text(hjust=0.5))+guides(fill = guide_legend(title = "var2"))
# bar_objectL.Sub: contains the command to generate the graph with the various sub-categories present var3
# +guides(fill = guide_legend(title = "var2")): relabels the heading of the legend
# Create grid
bar_objectL.Sub+facet_grid(cols=vars(var3))
Back to top
Example of a graph generated with previous command
Back to top
Generating a bar graphs with significance shown - Example 3
- Generate Descriptive Statistics that provide standard error of the
mean
# Plot Graph
predictor.r = factor (sumstats_object$predictor.r, levels=c("level1", "level2", "level_n"))
# predictor.r : name of variable that has had levels reorganized for graphing purposes
# predictor.r: predictor that requires levels to be organized for graphic purposes
# level1, etc.: level names listed in the order that you wish them graphed
bar_object=ggplot(sumstats_object, aes(x= predictor.r,y=outcome))+ scale_fill_manual(values=c("#CC0000", "#004C99"))
bar_object+geom_bar(stat="identity",position=position_dodge(width=0.9), linewidth=3, aes(fill=predictor))+geom_errorbar(aes(ymin=Outcome-se, ymax=Outcome+se),position=position_dodge2(0.9, padding=0.6))+labs(title="Title", x="x axis title", y="y axis title") + theme(axis.title.x=element_text(vjust=-0.4)) + theme(axis.title.y=element_text(vjust=0.3)) + theme(title=element_text(vjust=+1.1)) + theme(plot.title=element_text(hjust=0.5)) + guides(fill = guide_legend(title = "Legend Title")) + geom_signif(y_position=c(30.5, 16.5, 15.5), xmin=c(0.7, 4.7, 5.7), xmax=c(1.3, 5.3, 6.3),annotation = c("***","*","**"), tip_length = 0)
# outcome: under geom_errorbar - this refers to the heading of the outcome column in the descriptive statistics output.
# geom_signif: numbers shown in parenthesis need to be adjusted accordingly to accommodate for height (in relation to the y_position) and/or position relative to bars (in relation to the xmin and xmax). Additionally, the annotation needs to be appropriately corrected to reflect the appropriate significances.
# Resources that may be of assistance
# https://statisticsglobe.com/ggsignif-package-r
# https://stackoverflow.com/questions/51443889/how-can-i-get-geom-errorbar-to-dodge-correctly-on-a-bar-chart-in-ggplot2
# https://stackoverflow.com/questions/59008974/why-is-stat-identity-necessary-in-geom-bar-in-ggplot#59009108
# http://sthda.com/english/wiki/ggplot2-error-bars-quick-start-guide-r-software-and-data-visualization
# https://stackoverflow.com/questions/61022992/manually-plotting-significance-relations-between-sub-groups-on-ggplot2-barplot
Back to top
Example of a graph generated with previous command
Back to top
Generating a bar graphs with significance shown and reordered facets
- Example 4
- Generate Descriptive Statistics that provide standard error of the
mean
# Plot Graph
# Generate graph
bar_object=ggplot(sumstatsobject.r,aes(x=predictor1,y=outcome))+scale_fill_manual(values=c("#CC0000", "#004C99"))
bar_object.Sub=bar_object+geom_bar(stat="identity",position=position_dodge(width=0.9), linewidth=3, aes(fill=predictor1)) + geom_errorbar(aes(ymin=Data-se, ymax=Data+se),position=position_dodge(width=0.9), width=0.2)+labs(title="Title Line1\nTitleLine2", x="x-label", y="y-label") + theme(axis.title.x=element_text(vjust=-0.4)) + theme(axis.title.y=element_text(vjust=0.3)) + theme(title=element_text(vjust=+1.1)) + theme(plot.title=element_text(hjust=0.5)) + theme(axis.text.x=element_blank(), axis.ticks.x=element_blank()) + guides(fill = guide_legend(title = "Legend Title")) + coord_cartesian(ylim=c(0,15)) + geom_signif(data=data.frame(ExecFn=c("factor 1", "factor 3", "factor 5", "factor 11", "factor 12")), aes(y_position = c(13.5,12.5,12.5,14,13), xmin=c(1,1,1,1,1), xmax=c(2,2,2,2,2), annotations = c("**","**","**","#","*")), tip_length = 0, manual=T)
# NOTE that only the factors that show significance are listed and the information in the aes also only relates to these.
# Create grid with reorganized factors
bar_object.Sub+facet_grid(~factor(MainPredictor, levels=c("factor 1" ,"factor 2","factor 3", "factor 4", "factor 5", "factor 6", "factor 7", "factor 8", "factor 9", "factor 10", "factor 11", "factor 12")), scales="fixed")
# MainPredictor - replace this with the name of the column that represents the x-label predictor not sub-predictors (An example of the MainPredictor is the name of the predictor that us made up of the 12 factors shown in the graph; an example of a sub-predictor is predictor1 which in this case is Yes or No)
Back to top
Example of a graph generated with previous command
Back to top
Creating a line chart
- variable y is being plotted across variable x for two groups
lp<- ggplot(SCl, aes(x=xvarname, y=yvarname, group=groupname, color=groupname)) + # x
stat_summary(fun=mean, geom="line") + # plots the line
stat_summary(fun=mean, geom="point", size=3)+ # adds points to the plot
stat_summary(fun.data=mean_se, geom="errorbar",position=position_dodge(width=0.0), width=0.2)+ #using standard error of the mean for the error bar.
scale_color_manual(values=c('darkblue','red'))+
labs(title="Title", x="x-axis label", y="y-axis label")+
theme(axis.title.x=element_text(vjust=-0.4))+
theme(axis.title.y=element_text(vjust=0.3))+
theme(title=element_text(vjust=+1.1))+
theme(plot.title=element_text(hjust=0.5))
lp # displays the lineplot
Back to top
Example of a graph generated with previous command
Back to top
Creating a simple boxplot
boxplot(Outcome~Predictor,data=objectL, main="Titleline1 \n Titleline 2",xlab="x-axis label", ylab="y-axis label")
or
# boxplot with data jittered and showing & two different colors
ggplot(objectL, aes(x=Predictor, y=Outcome, fill=Predictor)) + stat_boxplot(geom='errorbar', position=position_dodge(width=0.5),width=0.2) + geom_boxplot(width=0.4) + geom_jitter(position=position_jitter(0.15)) + scale_fill_manual(values=c("#E0E0E0", "#606060"))
Back to top
Correlation
Correlation: a correlation tests whether there is a
RELATIONSHIP between two observed variables within the same sample
(e.g. asking if the number of cups of coffee consumed is related to
amount of alertness in a sample)
Plotting a correlation with a regression line
ggscatter(object, x = "xvariable", y = "yvariable",add = "reg.line", conf.int = TRUE,cor.coef = TRUE, cor.method = "pearson",xlab = "xlabel (units)", ylab = "ylabel(units)", title="title for the graph")+theme(plot.title=element_text(hjust=0.5))
- ggscatter: plots the correlation
- object: name of the wide version of data file
- xvariable: replace with name of column of variable to be plotted in
the x-axis
- yvariable: replace with name of column of variable to be plotted in
the y-axis
- reg.line: adds the regression line
- conf.int = TRUE: plots the confidence interval (shaded area)
- cor.coef = TRUE: shows the correlation coefficient
- cor.method = “pearson”: calculates and shows the p value
- xlab: replace words in quotes with appropriate wording for the
x-label
- ylab: replace words in quotes with appropriate wording for the
y-label
- title: replace words in quotes with a proper title
- theme(plot.title=element_text(hjust=0.5)): centers the title
Back to top
Running a correlation
cor.test(object$predictor, object$outcome)
- gives the correlation coefficient
- gives the p-value (which tells you of the correlation is
significant)
- Important notes to remember from the results
- df = degrees of freedom (n-2 (for correlation))
- p value: p value of less than 0.05 is significant
- the value under “cor” = correlation coefficient
Back to top
Example of a Correlation Output
Pearson's product-moment correlation
data: ingrd$Inc and ingrd$Grd
t = 2.968, df = 12, p-value = 0.01174 ## used for report
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.1833572 0.8780885
sample estimates:
cor
0.6506389 ## used for report
Back to top
Writing a report from a correlation
A Pearson’s correlation indicated that a student’s grade was
significantly related to the family’s income, r(12)=-.65, p<.05.
Back to top
Creating a Correlation Matrix
corPlot(Object,cex = 0.8, stars=TRUE, upper=FALSE, main = "TitleLine1 \n TitleLine2",gr = colorRampPalette(c("#BB4444", "#EE9988", "#FFFFFF", "#77AADD", "#4477AA")))
- Object contains only the variables to be included
in the correlation matrix
- Replace name of Object with the appropriate name
- Replace Title information with the appropriate information
Example of a correlation plot/matrix generated with previous
command
Back to top
Regression
Regression: a regression DESCRIBES the RELATIONSHIP
between a predictor (an independent variable) and an outcome (dependent)
between two observed variables within the same sample. Regressions can
take various forms - linear (e.g. calibration of pH meter), sigmoidal
(e.g. in enzyme kinetics/receptor binding). Distinction between
correlation & regression: While a correlation can
tell you that there is a relationship between
[H3O+] in a solution and the pH value,
regression provides you with the line that best
predicts that relationship.
Types of Regression
- Logistic regression is the appropriate to conduct
when the dependent variable is dichotomous
(binary)
- Stepwise regression is appropriate when you have
many variables and you’re interested in identifying a useful subset of
the predictors.
Back to top
Simple Regression
ObjectReg=lm(outcome~predictor, data=object)
summary(ObjectReg)
- Generates the regression equation and it is stored in
“ObjectReg”
- The lm regression command
- uses WIDE form of the data
- Designate one variable as outcome and the other as predictor - Do so
in such a way that makes sense. e.g. It is more likely that family
income would predict a student’s grade than a student’s grade predicting
family income
- The summary command: gives the output of
the regression equation, from which you extract all of the statistical
information that goes into a regression report.
Back to top
Example Regression Output
Back to top
Example of a Regression Equation derived from the above output
Back to top
Writing a report from a regression output
The results indicated that Fear of Sin significantly predicted
Depression, Beta = 0.222, t(95)=2.336, p < 0.05. Fear of Sin also
explained a significant proportion of variance in Depression,
R2 = .044, F(1, 95) = 5.455, p < 0.05.
Back to top
Multiple Regression (backward elimination) (using AIC)
fitObject = lm(y~x1+x2+x3...,data=na.omit(objectL))
- lm: is the command that runs a linear model analysis
- na.omit will omit those subjects with missing data points &
therefore cannot be included in analysis.
stepObject = stepAIC(fitObject, direction="backward", trace=FALSE) # direction can be "both" or "forward"
stepObject$anova # display results with AIC
summary(stepObject) # display results with t values
Multiple Regression (backward elimination) (using F-value)
fullObject = ols(y~x1+x2+x3..., data = objectL)
fastbw(fullObject, rule = "p", sls = 0.1)
Logistic Regression
mylogit = glm(Outcome ~ Predictor1 + Predictor2 + Predictor3..., data = ObjectL, family = "binomial")
summary(mylogit) # gives Beta and p values
exp(cbind(OR = coef(mylogit), confint(mylogit))) # gives Odds Ratio(OR)
Cronbach’s alpha
- Surveys are generally divided into sections /groups of
questions
- Prior to concluding that questions related to each other, we need to
make sure that questions within a specific group of questions are
consistent with each other (i.e. highly related)
- to measure if a group of questions is consistent or not we use
cronbach’s alpha
- we will also be able to know if consistency improves or not if we
delete a question from the group of questions (this is only shown in
Method 2).
Method 1
# subjects with missing data should be removed before hand using
ObjectNA=na.omit(object)
matobject=data.matrix(object)
reliability(cov(matobject))
Example output from Method 1
$alpha
alpha
0.9403455
$st.alpha
std.alpha
0.9417464
$rel.matrix
Alpha Std.Alpha r(item, total)
D2Stop 0.9350532 0.9368506 0.7365112
Cont2A 0.9335049 0.9354005 0.7837842
AvsOthers 0.9354148 0.9366306 0.7304763
DepSleep 0.9383129 0.9398743 0.6302478
ThinkNOL 0.9353017 0.9364763 0.7397708
LkFrwrd 0.9348757 0.9361827 0.7433303
ThnkLessT 0.9424475 0.9436797 0.5152945
TrdLessT 0.9362815 0.9380178 0.6999771
RushWrk 0.9355317 0.9366959 0.7302428
NegOblig 0.9352094 0.9363647 0.7406319
AccDown 0.9334180 0.9354451 0.7853325
Escape 0.9346224 0.9362977 0.7539146
Frustration 0.9336228 0.9352146 0.7798224
attr(,"class")
[1] "reliability"
Method 2
alphaobject=alpha(object)
alphaobject # shows output
Example output from Method 2
Reliability analysis
Call: alpha(x = CIUS)
raw_alpha std.alpha G6(smc) average_r S/N ase mean sd median_r
0.94 0.94 0.95 0.55 16 0.0051 1.8 0.97 0.55
lower alpha upper 95% confidence boundaries
0.93 0.94 0.95
Reliability if an item is dropped:
raw_alpha std.alpha G6(smc) average_r S/N alpha se var.r med.r
D2Stop 0.94 0.94 0.95 0.55 15 0.0056 0.0139 0.56
Cont2A 0.93 0.94 0.95 0.55 14 0.0057 0.0137 0.55
AvsOthers 0.94 0.94 0.95 0.55 15 0.0055 0.0126 0.55
DepSleep 0.94 0.94 0.95 0.57 16 0.0053 0.0135 0.57
ThinkNOL 0.94 0.94 0.95 0.55 15 0.0055 0.0139 0.55
LkFrwrd 0.93 0.94 0.95 0.55 15 0.0056 0.0124 0.55
ThnkLessT 0.94 0.94 0.95 0.58 17 0.0049 0.0077 0.56
TrdLessT 0.94 0.94 0.95 0.56 15 0.0054 0.0136 0.56
RushWrk 0.94 0.94 0.95 0.55 15 0.0055 0.0125 0.55
NegOblig 0.94 0.94 0.95 0.55 15 0.0055 0.0129 0.55
AccDown 0.93 0.94 0.94 0.55 14 0.0057 0.0130 0.55
Escape 0.93 0.94 0.95 0.55 15 0.0056 0.0130 0.56
Frustration 0.93 0.94 0.95 0.55 14 0.0057 0.0132 0.55
Item statistics
n raw.r std.r r.cor r.drop mean sd
D2Stop 299 0.78 0.78 0.76 0.74 2.29 1.3
Cont2A 299 0.82 0.82 0.81 0.78 2.19 1.3
AvsOthers 299 0.77 0.78 0.77 0.73 1.20 1.2
DepSleep 299 0.69 0.69 0.65 0.63 1.25 1.3
ThinkNOL 299 0.78 0.79 0.77 0.74 1.64 1.1
LkFrwrd 299 0.79 0.80 0.78 0.74 1.43 1.2
ThnkLessT 299 0.59 0.58 0.53 0.52 2.84 1.4
TrdLessT 299 0.75 0.74 0.72 0.70 2.12 1.4
RushWrk 299 0.77 0.78 0.77 0.73 0.93 1.1
NegOblig 299 0.78 0.79 0.77 0.74 1.12 1.1
AccDown 299 0.83 0.82 0.82 0.79 2.27 1.4
Escape 299 0.80 0.79 0.79 0.75 2.06 1.5
Frustration 299 0.82 0.82 0.81 0.78 1.41 1.3
Non missing response frequency for each item
0 1 2 3 4 miss
D2Stop 0.14 0.17 0.18 0.30 0.21 0
Cont2A 0.15 0.15 0.21 0.31 0.17 0
AvsOthers 0.37 0.24 0.23 0.13 0.03 0
DepSleep 0.40 0.19 0.22 0.14 0.05 0
ThinkNOL 0.16 0.30 0.31 0.17 0.05 0
LkFrwrd 0.30 0.25 0.22 0.18 0.05 0
ThnkLessT 0.12 0.07 0.10 0.26 0.45 0
TrdLessT 0.17 0.17 0.22 0.25 0.19 0
RushWrk 0.49 0.20 0.20 0.07 0.03 0
NegOblig 0.41 0.20 0.25 0.12 0.01 0
AccDown 0.17 0.11 0.19 0.31 0.21 0
Escape 0.24 0.13 0.15 0.28 0.19 0
Frustration 0.37 0.17 0.23 0.16 0.08 0
For additional information on interpretation please see: Reliability
Analysis
Testing for Homogeneity of variance
- The Levene Test tests whether the VARIANCES are significantly
different from each other
leveneTest(objectL$outcome, objectL$predictor, center = median)
- center: is the name of a function to compute the center of each
group. If you use “mean” instead of median - this gives the original
Levene’s test; The default is “median” which provides a more robust
test.
- if variances are significantly different (p<0.05), in the t-test
include the command: var.equal = FALSE –> calculates a more robust
t-test;
- if variances are NOT significantly different (p>0.05), in the
t-test include the command: var.equal = TRUE –> calculates regular
t-test
- p value is the number found under the Pr(>F); if p>0.05
variances are homogenous (the statistically same)
- Levene test is NOT necessary in REPEATED Measures
tests; In such cases, in a t-test you should
use var.equal = TRUE
Back to top
Conducting a t-test in R
t-test: a t-test tests whether the MEANS/AVERAGES of
TWO samples are significantly different from each other
(e.g. asking if the average depression scores of males is different from
that of females in a sample)
Method 1 - for LONG version of data
PLEASE NOTE this command is currently (10/28/24) not
working properly and giving the following error “cannot use ‘paired’
in formula method”. This page will be updated once a solution is
reported. The issue appears to be in the “paired=…” part of the
command.
t.test(outcome~predictor, data=objectL, paired=FALSE/TRUE, var.equal=FALSE/TRUE)
OR
Method 2 - for WIDE version of data
t.test(object$var1,object$var2, paired=FALSE/TRUE, var.equal=FALSE/TRUE)
- For METHOD 1:
- t.test: command to calculate a t-test
- outcome: name of the outcome column in the long version of
the data file
- predictor: name of the predictor column in the long version
of the data file
- ~ : predicted by
- data=objectL: specifies the name of the long version of the
data file. REPLACE ONLY the word objectL with the appropriate name of
the long version of the file
- For METHOD 2:
- object, var1 and var2 - “object” needs to be replaced with
appropriate name of object containing WIDE version of the data; “var1”
and “var2” need to be replaced with the appropriate names of the columns
in the WIDE version of the data.
- For BOTH:
- paired = TRUE: USED for REPEATED MEASURES,
also known as paired
- paired = FALSE: USED for INDEPENDENT
MEASURES
- var.equal = TRUE: used when the Levene Test p > 0.05
(remember p = the number under Pr(>F)); NOTE: Also
use this in the case of a repeated measures t-test when
a Levene test is not required.
- var.equal = FALSE: used when the Levene Test p <
0.05
- If var.equal = FALSE: the degrees of freedom (df) will NOT be a
whole number (because the specific t-test used uses a more conservative
way of calculating the df.)
- reporting: t(df)=tvalue,pvalue, r^2=
Back to top
Calculating the effect size (r squared)
tvalue^2/(tvalue^2+df)
- all information for the above equation is obtained from the output
of the t-test
- for the tvalue ignore negative signs
- reporting: t(df)=tvalue, p…, r^2=.
Back to top
Calculating the effect size (Cohen’s d) (alternative option)
# requires package effsize
cohen.d(data= object, group1, group2)
- If the groups are not of equal size, add the na.omit command to
group with the least numbers e.g. if group2 has less, use
“na.omit(group2)” instead of simply “group2”
- object - replace with the name of the object containing the
WIDE form of the data
- group1 or group2 - replace with the names of the columns of each of
the groups.
Back to top
t-test output in R
Paired t-test / Two Sample t-test # this title of the output
indicates whether the test you just ran is paired/repeated OR
independent measures/Two Sample t-test.
The line with the t value, df and p value is the most important
line; t: gives you the t value; df gives you the degrees of freedom and
p-values gives you the p value. ALL OF THESE NEED to be reported; the
report will include this data in the following format t(df)=tvalue,
p
output when the sex differences in vitamin D data is analyzed
CORRECTLY using an independent measures test:
t = -2.2258, df = 16, p-value = 0.04075
- output when the sex differences in vitamin D data is analyzed
WRONGLY using a repeated measures test:
t = -2.5964, df = 8, p-value = 0.03179
- If these numbers have the SAME sign it means that it is likely that
there is a significant difference:
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-18.252038 -1.081295
- In a repeated measures test, R only gives you the output for MEAN OF
THE DIFFERENCES:
sample estimates:
mean of the differences
-9.666667
- In an independent measures test, R gives you the output for the mean
for EACH GROUP:
mean in group Boys mean in group Girls
25.66667 35.33333
Back to top
Example of a t-test output and an effect size
(r2)calculation with the important data used in the report
highlighted.
Back to top
Writing a report for a t-test
An independent measures t-test was utilized to assess whether there
is a significant difference in vitamin D levels between boys and girls.
The analysis indicated a significant difference in vitamin D levels,
t(16)=-2.226, p<0.05; rsquared= , specifically girls (M= ; SEM= )
reported a significantly higher level of vitamin D then boys (M= ; SEM=
).
OR
An independent measures t-test was utilized to assess whether there
is a significant difference in vitamin D levels between boys(M= ; SEM= )
and girls(M= ; SEM= ). The analysis indicated a significant difference
in vitamin D levels, t(16)=-2.226, p<0.05; rsquared= ,specifically
girls reported a significantly higher level of vitamin D then boys.
Back to top
Conducting a t-test in excel (method 1)
- =ttest(array1,array2, tails, type)
- Array 1: first column containing the data
- Array 2: second column containing data
- tails: 1 or 2 tailed
- type: 1= Paired; 2= Two-sample equal variance; 3= two-sample unequal
variance.
Conducting a t-test in excel (method 2)
- To give a more detailed output, click Data Analysis in the Analysis
group on the Data tab.
- If the Data Analysis command is not available, you need to load the
Analysis ToolPak add-in program.(see:Office Support)
- Select the type of t-test you need to run
- Input the appropriate variable ranges under “Input”
- If your variables have labels (which they should) check
“Labels”
- Click on Output Range and in the associated box indicate the cell
you want the output to be generated in (make sure this cell is away from
other important data)
- Sample output is shown below:
| Mean |
11.75 |
11.5 |
| Variance |
46.91666667 |
33.66666667 |
| Observations |
4 |
4 |
| Pearson Correlation |
0.910007244 |
|
| Hypothesized Mean Difference |
0 |
|
| df |
3 |
|
| t Stat |
0.174077656 |
|
| P(T<=t) one-tail |
0.436444286 |
|
| t Critical one-tail |
2.353363435 |
|
| P(T<=t) two-tail |
0.872888572 |
|
| t Critical two-tail |
3.182446305 |
|
One-way ANOVA - INDEPENDENT MEASURES
ANOVA: an ANOVA tests whether the MEANS/AVERAGES of
MORE THAN TWO samples are significantly different from
each other (e.g. asking if there is a significant difference in the
average depression scores between Freshman, Sophomores, Juniors and
Seniors). ANOVAs can be complicated given that comparisons can be
carried out within groups and between groups.
- Run Levene Test: For the DF group = k-1 (k=number of
groups/treatments) = DF for the Between Treatment
- The other degrees of Freedom is DF for the WITHIN treatment = N-k
(N=sum total of subjects; k=number of treatments)
aovobject=aov(outcome~predictors, data=objectL)
- this calculates the ANOVA
- Use the long version of the file
- aovobject: name of the object that will contain the ANOVA
calculation
summary(aovobject)
- this gives the output for the ANOVA (See example output
below)
- name of predictor column: between Treatment (b/T)
- Residuals: within Treatment (w/T)
- DF: degrees of freedom (needed for reporting)
- Sum Sq: Sum of Squares (Note: needed for the calculation of the
effect size)
- Mean Sq: Mean Square = variances
- F value: test statistic (needed for reporting)
- Pr(>F): p value (needed for reporting)
- your statistics in the report should be in form: F(DF(b/T),DF(w/T))=
F value, p value, effect size)
- EFFECT SIZE: Sum Sq is used for effect size: Sum Sq(b/T)/(Sum
Sq(total))
Back to top
Example of an Independent Measures Output
Df Sum Sq Mean Sq F value Pr(>F)
variable 2 28.22 14.111 11.91 0.000256 ***
Residuals 24 28.44 1.185
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Back to top
Post-hoc analysis using Tukey Test
- A post-hoc analysis is run whenever the first part of the
ANOVA shows a p<0.05
phobject=glht(aovobject, linfct=mcp(predictor="Tukey"))
- phobject: name of the object that will contain the post-hoc Tukey
test
- aovobject: the name of the object that contains the ANOVA (see
previous ANOVA command)
- predictor: needs to be replaced with name of the PREDICTOR column in
your data file.
summary(phobject)
- gives the summary of the post-hoc analysis (see example
output below)
- Pr(>|t|): is the p value for the specific groups being
compared.
Back to top
- If the variances are not equal the Games-Howell
test is conducted (c.f. Welch test in the t-test)
- requires statix package
games_howell_test(data = objectL, formula = outcome ~ grouping_variable)
Back to top
Example of a post-hoc output for Independent Measures ANOVA
Simultaneous Tests for General Linear Hypotheses
Multiple Comparisons of Means: Tukey Contrasts
Fit: aov(formula = value ~ variable, data = migraineL)
Linear Hypotheses:
Estimate Std. Error t value Pr(>|t|)
DrugB - DrugA == 0 2.1111 0.5132 4.114 0.001063 **
DrugC - DrugA == 0 2.2222 0.5132 4.330 0.000605 ***
DrugC - DrugB == 0 0.1111 0.5132 0.217 0.974514
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Adjusted p values reported -- single-step method)
Back to top
Back to top
Writing a report from an Independent Measures ANOVA output
Pain levels were reported on a scale of 1 to 10 (1=low, 10=high)
following the administration of various pain relievers by individuals
who suffer from migraines. A One Way ANOVA indicated a significant
difference (F(2, 24) = 11.91; p<0.001; n2= 0.498) in
reported pain levels across the drugs administered. Post-hoc analysis
using Tukey test indicated that individuals reported significantly lower
pain levels following the administration of Drug A (M=3.67, SEM=0.29)
relative to both Drug B (M=5.78; SEM=0.49; p<0.01) and Drug C
(M=5.89; SEM=0.26; p<0.001). There was no significant difference
between the reported pain levels following the administration of Drugs B
and C (p>0.05).
Back to top
One-Way REPEATED MEASURES (RM) ANOVA
aovobject=aov(outcome~predictor+Error(participant/predictor), data=objectL)
- participant: replace with the label of the SUBJECT column in the
long version of the file
- predictor: replace with the name of the PREDICTOR column in the long
verstion of the file.
summary(aovobject)
- predictor: between Treatment value (b/T)
- Residuals: error
- Df: degrees of freedom (variable/predictor = k-1 where k=number of
groups. Residuals=(df(w/T)-df(b/S))); w/T = Within Treatment = N-k,
where N=Total number of “samples”/data points and k=number of groups;
b/S = Between subjects = n-1, where n = number within a group;
- Sum Sq: sum of squares (needed for the calculation of the effect
size)
- Mean Sq: variances
- F value: test statistic
- Pr(>F): p value
- your report of these statistics should be in the form of
F(df(b/T),df(error))= Fvalue, pvalue, eta squared (effect size)
Back to top
Example of a Repeated Measures Output
An Example of a Repeated measures ANOVA output with important numbers
used in report highlighted
Back to top
Post-hoc Analysis using Pairwise comparisons
- A post-hoc analysis is run whenever the first part of the
ANOVA shows a p<0.05
- The tukey test does NOT work for this type of ANOVA
command. A Pairwise comparison is used instead.
with(objectL, pairwise.t.test(outcome,predictor, p.adjust.method="bonferroni", paired=T))
- pairwise comparisons: calculate the significant differences between
the pairs of groups
Back to top
Example of a post-hoc output for Repeated Measures ANOVA
An Example of a post-hoc analysis output for a Repeated measures
ANOVA with important numbers used in report highlighted. These values
are the p-values comparing the variable in the vertical
column on the left to the horizontal row above the numbers.
Back to top
Back to top
Writing a report from a Repeated Measures ANOVA output
Repeated measures ANOVA revealed a significant overall effect across
time (F(2,24) = 20.65, p< 0.001; n2=0.63), following the
administration of drug. Post-hoc Bonferroni analysis indicated that
anxiety scores were significantly lower than pre-drug (M=8.10, SEM=0.23)
administration on both week 1 (M=6.80, SEM = 0.25; p<0.05) and week 2
(M=3.00, SEM = 0.26; p<0.001). Moreover the anxiety scores on week 2
were also significantly lower than those of week 1 (p<0.001).
Back to top
- Additional information in relation to ANOVA and some code for more
challenging ANOVAs can be found here
How to use R for multiple-select questions
- Replace OBJECT with name of the object containing the imported data
(long version)
- First column needs to be the subject number
- [This should be done in Excel] All values in the rows need to be
replaced with a 1 (implies there is a value or count) or 0 (blank)
- Replace “variable1” with the name of the column heading of that
specific variable (e.g. if counting number of people in household and/or
Intramural sports, variable1 etc would be replaced with the names of
these columns)
- This code is modified from https://stackoverflow.com/questions/11622660/how-to-use-r-for-multiple-select-questions
set.seed(1)
dat <- data.frame(OBJECT,"variable1","variable2", "variable3", "..n")
multi.freq.table = function(OBJECT, sep="", dropzero=FALSE, clean=TRUE) {
# Takes boolean multiple-response data and tabulates it according to the possible combinations of each variable.
counts = data.frame(table(OBJECT))
N = ncol(counts)
counts$Combn = apply(counts[-N] == 1, 1,
function(x) paste(names(counts[-N])[x],
collapse=sep))
if (isTRUE(dropzero)) {
counts = counts[counts$Freq != 0, ]
} else if (!isTRUE(dropzero)) {
counts = counts
}
if (isTRUE(clean)) {
counts = data.frame(Combn = counts$Combn, Freq = counts$Freq)
}
counts
}
multi.freq.table(dat[-1], sep="-")
How to count number of occurences in a column
- the counts are sorted largest to smallest
count(OBJECT, vars = Columnname, sort=TRUE)
or
- the output is sorted from lowest to highest
sort(table(OBJECT$Columnname))
Plotting map of US with data
METHOD 1: Plot map of US reflecting various levels - Only plots
lower 48
Load packages
library(ggplot2);library(maps);library(dplyr)
Copy data
- StateNumber: object containing data for number of students from
every state
- this object contains the data with the names of the states and
numbers of students from every state
- states that have no representation i.e. 0 students should be
blank
- data needs to have a column named “region” (lower
case)
StateNumber = read.table(file="clipboard", sep="\t", header=TRUE, as.is=TRUE)
Plots all states with ggplot2, using black borders and light blue
fill
ggplot()+geom_polygon(data=MainStates,aes(x=long, y=lat, group=group),color="black", fill="lightblue")
Merges Mainstates Map with StateNumber
MergedStates <- inner_join(MainStates, StateNumber, by = "region")
p <- ggplot()
p <- p + geom_polygon( data=MergedStates, aes(x=long, y=lat, group=group, fill = n), color="white", size = 0.2)
p
Changing color scheme
p <- p + scale_fill_continuous(name="Number of Students",low = "lightblue", high = "darkblue",limits = c(0,140),breaks=c(20,40,60,80,100,120,140), na.value = "grey50")+labs(title="title of graph")
p
Back to top
METHOD 2: Plot map of US reflecting various levels - Plots
all 50 states
Load packages
library(usmap)
Copy data
- StateNumber: object containing data for number of students from
every state
- this object contains the data with the names of the states and
numbers of students from every state
- states that have no representation i.e. 0 students should be
blank
- data needs to have a column named “region”
- The excel data must contain the following columns. The data below
can be copied and then pasted into excel using the “paste special/text
only” option.
- The fips column stands for the Federal Information
Processing Standard (FIPS) which provides a set of standard numeric
codes for referring to U.S. states.
- Please note that, in the FIPS column, the 0 in front of the single
digit numbers is not a requirement - this was just how I set it up.
| 01 |
Alabama |
|
| 02 |
Alaska |
|
| 04 |
Arizona |
|
| 05 |
Arkansas |
|
| 06 |
California |
|
| 08 |
Colorado |
|
| 09 |
Connecticut |
|
| 10 |
Delaware |
|
| 12 |
Florida |
|
| 13 |
Georgia |
|
| 15 |
Hawaii |
|
| 16 |
Idaho |
|
| 17 |
Illinois |
|
| 18 |
Indiana |
|
| 19 |
Iowa |
|
| 20 |
Kansas |
|
| 21 |
Kentucky |
|
| 22 |
Louisiana |
|
| 23 |
Maine |
|
| 24 |
Maryland |
|
| 25 |
Massachusetts |
|
| 26 |
Michigan |
|
| 27 |
Minnesota |
|
| 28 |
Mississippi |
|
| 29 |
Missouri |
|
| 30 |
Montana |
|
| 31 |
Nebraska |
|
| 32 |
Nevada |
|
| 33 |
New Hampshire |
|
| 34 |
New Jersey |
|
| 35 |
New Mexico |
|
| 36 |
New York |
|
| 37 |
North Carolina |
|
| 38 |
North Dakota |
|
| 39 |
Ohio |
|
| 40 |
Oklahoma |
|
| 41 |
Oregon |
|
| 42 |
Pennsylvania |
|
| 44 |
Rhode Island |
|
| 45 |
South Carolina |
|
| 46 |
South Dakota |
|
| 47 |
Tennessee |
|
| 48 |
Texas |
|
| 49 |
Utah |
|
| 50 |
Vermont |
|
| 51 |
Virginia |
|
| 53 |
Washington |
|
| 54 |
West Virginia |
|
| 55 |
Wisconsin |
|
| 56 |
Wyoming |
|
StateNumber = read.table(file="clipboard", sep="\t", header=TRUE, as.is=TRUE)
Plots blank map of all states, using black borders and grey
fill
plot_usmap(regions = "states") + labs(title = "US States",subtitle = "This is a blank map of the state of the United States.") + theme(panel.background = element_rect(color = "black", fill = "grey"))
Plots all states with a color scale reflecting numbers of students
from the specific states and places image in a black rectangle with a
light color grey fill
plot_usmap(data = StateNumber, values = "n", color = "white") + scale_fill_continuous(name="Number of Students",low = "lightblue", high = "darkblue",limits = c(0,140),breaks=c(20,40,60,80,100,120,140), na.value = "grey50")+labs(title="title of graph", subtitle = "Subtitle")+ theme(plot.title=element_text(hjust=0.5)) + theme(plot.subtitle=element_text(hjust=0.5))+ theme(legend.position = "right") + theme(panel.background = element_rect(color = "black", fill ="grey85"))
Back to top
Plotting a 3D Scatter Plot
- you may need to go into Tools/Global
Options/General/Advanced and under OS Integration (quit
required) change the Rendering Engine option to
Desktop OpenGL
Back to top
Additional Resources
R Markdown
# In R Markdown Document:
<div class="alert alert-x">
# Replace "x" with "success" for green; "info" for blue; "warning" for tan; "danger" for red.
</div>
# Creating a pdf of the rpubs website https://rpubs.com/ssammut/ResearchStats
<div class="alert alert-x">
# requires webshot package
library(webshot)
webshot(
url = "https://rpubs.com/ssammut/ResearchStats"
, file = "Output1.pdf" # replace "Output1.pdf" with appropriate name
, vwidth = 992
, vheight = 744
)
getwd () # will show in which folder the file is stored.
</div>
Back to top
Jamovi - an alternate statistical program
- This is not the program used in the class but you can feel free to
experiment with it.
- Jamovi is a program based on R that has a more user-friendly
interface
- Download Jamovi
- Advantage - can be used on all platforms Macs, PCs and
Chromebooks
- A user guide can be found here
- A tutorial is also available here
Back to top
Creating your own package
Resources:
Requires:
Creating the package
- Create folder Name_R_package
- In the folder create a folder called R
- In the folder called R, create a text file called Name_function.txt
that contains a copy the function code
- Change the extension of the text file to .R (confirm change)
- In the Name_R_package folder create a text file called
DESCRIPTION
- In the DESCRIPTION file add the following information:
Package: Name of package
Type: Package
Title: Title describing package
Version: 0.0.1.0
- Remove the .txt extension from the DESCRIPTION file (confirm
change)
- Open a new window in R so that you do not lose previous work
- Go to File/New Project.
- Click on New Directory
- Click on R Package
- Name the package
- Click “Add”
- Select the R file created above
- Click “Browse” and change the folder to select the Name_R_package
you created earlier.
- Click on “Create Project”
- Click on “Build” in the menu and select “Build Source Package”
Using the package
- Go to Tools/Install Packages
- Change “Install From” to “Package Archive File (.zip; .tar.gz)”.
This will open the window for you to select the file.
- Go to the folder Name_R_package and select the .tar.gz file
listed
- Click “Install”
- Run library using the name you gave the package in 9c.:
library(packagename)
- NOTE: Whenever you install a new version of R you
will need to re-run the installation of the package just described.
Back to top
---
title: "Basic Commands to Get Started with R"
output:
  html_notebook: default
  pdf_document: default
  html_document:
    df_print: paged
  word_document: default
  toc: yes
---
<div class="alert alert-danger">
# Table of Contents:
1. [Installing R & R Studio]
2. [Installation of Packages]
3. [Loading Packages]
4. [Copying Data from Excel into R]
5. [Exporting from R]
    + [Exporting output to a text file]
    + [Exporting dataframe to a csv file]
6. [Viewing imported data]
7. [Melting your data]
    + [Performing a simple melting of data]
    + [Melting data with multiple id variables and measure variables]
8. [Descriptive statistics for **Long** version of data]
9. [**Proportions test**]
    + [Comparing two proportions - Method 1]
    + [Comparing two proportions - Method 2]
    + [Proportions Test Output in R]
    + [Writing a report from a proportions test output]
    + [Comparing more than two proportions - Method 1]
    + [Comparing more than two proportions - Method 2]
    + [Proportions Test Output in R (more than two proportions)]
    + [Writing a report from a proportions test output (more than two proportions)]
10. [**Odds Ratio**]
    + [Create Matrix]
    + [Calculate Odds Ratio]
    + [Writing a report from an odds ratio test output]
11. [Some things you ***may*** have to do **prior to generating a graph**]
    + [Function and command for generating Standard Error of the Mean (SEM)]
    + [Ordering levels in a specific variable]
    + [Changing variable name in a specific column]
    + [Changing the column name/header in the object/dataframe containing the data]
12. [**Creating a pie chart, bar graph, line graph or box plot**]
    + [Generating a pie chart]
    + [Generating a bar graph]
    + [Generating a bar graph with **different colored bars**]
    + [Creating a bar graph with **one independent variable** and using the standard error of the mean for error bars]
    + [Creating a bar graph with **multiple independent variables** and using the standard error of the mean for error bars]
    + [Creating bar graphs with **multiple independent variables**, plotted as three separate graphs (one per variable) in one grid]
    + [Creating a line chart]
    + [Creating a simple boxplot]
13. [**Correlation**]
    + [Plotting a correlation with a regression line]
    + [Running a correlation]
    + [Example of a Correlation Output]
    + [Writing a report from a correlation]
    + [Creating a Correlation Matrix]
14. [**Regression**]
    + [Types of Regression]
    + [Simple Regression]
    + [Example Regression Output]
    + [Example of a Regression Equation derived from the above output]
    + [Writing a report from a regression output]
15. [Testing for Homogeneity of variance]
16. [**Conducting a t-test in R**]
    + [Method 1 - for **LONG** version of data]
    + [Method 2 - for **WIDE** version of data]
    + [Calculating the effect size (r squared)]
    + [t-test output in R]
    + [Example of a t-test output and an effect size (r^2^)calculation with the important data used in the report highlighted.]
    + [Writing a report for a t-test]
17. [**One-way ANOVA - INDEPENDENT MEASURES**]
    + [Example of an Independent Measures Output]
    + [Post-hoc analysis using Tukey Test]
    + [Example of a post-hoc output for Independent Measures ANOVA]
    + [Equation for Eta Squared (effect size) for Independent Measures ANOVA (information taken from the first step of the ANOVA).]
    + [Writing a report from an Independent Measures ANOVA output]
18. [**One-Way REPEATED MEASURES (RM) ANOVA**]
    + [Example of a Repeated Measures Output]
    + [Post-hoc Analysis using Pairwise comparisons]
    + [Example of a post-hoc output for Repeated Measures ANOVA]
    + [Equation for Eta Squared (effect size) for Repeated Measures ANOVA. The information is found under the section **Error: Within** in the first step of the ANOVA]
    + [Writing a report from a Repeated Measures ANOVA output]
19. [How to use R for multiple-select questions]
20. [How to count number of occurences in a column]
21. [Plotting map of US with data]
    + [METHOD 1: Plot map of US reflecting various levels - Only plots **lower 48**]
    + [METHOD 2: Plot map of US reflecting various levels - Plots **all 50 states**]
22. [Additional Resources]
    + [R Markdown]
    + [Jamovi - an alternate statistical program]
    + [Creating your own package]
</div>

# Resources

  * [R Commands Document](http://rpubs.com/ssammut/ResearchStats){target="_blank"}
  * There is a significant amount of information pertaining to the use of R on the internet. Listed below are only some examples of the resources. If you are having trouble installing R and R Studio, even after reading and following the instructions I include below, you may want to go to some of these websites. 

<div class="alert alert-warning">  
# Installing R & R Studio

<div class="alert alert-danger">
**PLEASE NOTE**: There are times when you may have to re-download R (not necessarily R Studio). You will realize this is needed if you download a package and when you run the library you get a message that seems to indicate that the package is not for that version. All you have to do in this case is 1. close out of R Studio; 2. download the newer version of R and 3. install it.
</div>

For links to download R and R studio please go to the following:

  1. Step 1 - R MUST be installed first - [Download **R** for Windows or Mac ](https://cran.r-project.org/)

  2. Step 2 - R Studio is installed after R has finished installing - [Download **R Studio** for Windows or Mac ](https://posit.co/downloads/). **Please note:** If clicking on the link does not work, copy and paste the following link into your browser: https://posit.co/downloads/

</div>

<div class="alert alert-warning">  
## For **Windows**:

1. Double click **R-x.x.x-win.exe**

    a. Click “OK” (for language)
    b. Click “Next” (for Information)
    c. Click “Next” (for Select Destination Location)
    d. Click “Next” (for Select Components)
    e. Click “Next” (for Startup options)
    f. Click “Next” (for Select Start Menu Folder)
    g. On the next screen:
        i.	Make sure “Create a desktop shortcut” is unchecked
        ii.	Make sure “Create a Quick Launch shortcut” is unchecked
        iii.	Click “Next”
    h. Click “Finish”

2. Double click **RStudio-x.x.xxxx.exe**

    a. Click “Next”
    b. Click “Next” 	(for Choose Install Location)
    c. Click “Install” 	(for Choose Start Menu Folder)
    d. Click “Finish”

[***Step 3 is not necessary if you are installing the packages by running the commands below under "Installation of Packages"***]

3. Packages go into: 
    a. C:\\Users\\YourName\\Documents\\R win-library\\x.x
    b. Accept to overwrite current packages 

## For **Mac**:

1. Double click **R-x.x.x.pkg**
    a.	click "Continue" (for Introduction)
    b.	click "Continue" (for Read Me)
    c.	click "Continue"	(for License)
    d.	click "Agree"
    e.	click "Install"	(for Installation Type)
    f.	if asked, provide password to allow installation of software
    g.	click "Install Software"
    h.	When installation is complete click "Close"


2. Double click **RStudio-x.x.xxxx.dmg**
    a. in window that opens up, drag RStudio icon into Applications folder
    b. When installation is done, you can click on the RStudio icon in the Launchpad
    c. If asked "”RStudio” is an application downloaded from the Internet. Are you sure you want to open it?"
    d. If asked to install a "git" command - click "install" and then click on "agree" 

[***Step 3 is not necessary if you are installing the packages by running the commands below under "Installation of Packages"***]

3. Packages go into:
    a.	Macintosh HD/Library/Frameworks/R.framework/Versions/*version#*/Resources/library 
    b.	Accept to overwrite current packages
</div>
 
<a href="#top">Back to top</a>
 
## Resources for Installing the Analysis Toolpak in Excel:

If conducting more detailed analysis in ***Excel***, it is necessary to install and load the Analysis Toolpak. For instructions on loading the Analysis ToolPak in Excel see:

  * For [**Windows**](https://support.office.com/en-ie/article/load-the-analysis-toolpak-in-excel-6a63e598-cd6d-42e3-9317-6b40ba1a66b4#OfficeVersion=Windows){target="_blank"}
  * For [**Macs**](https://support.office.com/en-ie/article/load-the-analysis-toolpak-in-excel-6a63e598-cd6d-42e3-9317-6b40ba1a66b4#OfficeVersion=MacOS){target="_blank"} 
  * Use the [Analysis ToolPak](https://support.office.com/en-us/article/use-the-analysis-toolpak-to-perform-complex-data-analysis-6c67ccf0-f4a9-487c-8dec-bdb5a2cefab6){target="_blank"} to perform complex data analysis

***
# General Notes for R
  * pound (#) sign indicates a comment - not read by the program
  * console is the window in which you will run the commands
  * If you click in the console, and Ctrl L - clears the console
  * If ever you're stuck with a plus (+)sign and cannot get the cursor (>) back, press **Ctrl+C** (for both Macs and PCs)
  * object: an object contains a graph, the data etc. - whatever you put it in it.
  * <- is equivalent to = 
  * object$columnname: indicates the name of the column within the object
  * install packages: run **ONCE**; **NEED** to be connected to the internet
  * Library command: run **EVERYTIME** you open R-Studio
  * R **IS CASE SENSITIVE**
  * **names**(object) : gives the names of all the variables in a specific object ["names" is bolded b/c it is a command and does not change when used; only the word "object" is replaced by the appropriate name of the object containing the data]
  * **levels**(object$nameofcolumn) - gives the levels within the column of interest; ["levels" is bolded b/c it is a command and does not change when used; only the words "object" and "nameofcolumn" are replaced by the appropriate name of the object containing the data and the name of the column of interest respectively]
  * If ever you need to change a column into a factor because R is not seeing the column as a factor use this command: ***objectL\$nameofcolumn = as.factor(objectL\$nameofcolumn)***

***
<div class="alert alert-danger">
# Installation of Packages
  * Installation of packages is done **ONLY ONCE**; 
  * For the installation you MUST be ***connected to the internet***
```{r}
install.packages("ggplot2", dependencies = T) # used for generating graphs
install.packages("multcomp", dependencies = T) # used for post-hoc analysis
install.packages("pastecs", dependencies = T) # used for generating descriptive statistics
install.packages("reshape", dependencies = T) # used for generating a long version of the data from a wide version of the data
install.packages("reshape2", dependencies = T) # used to generate long form of the file; required in some cases
install.packages("nlme", dependencies = T) # used for generating ANOVAs
install.packages("car", dependencies = T)
install.packages("pwr", dependencies = T)
install.packages("dplyr", dependencies = T)
install.packages("devtools", dependencies = T)
install.packages("rms", dependencies=T) # for regression
install.packages("psych", dependencies=T)
install.packages("ggpubr", dependencies=T) # required for correlation plots
```
</div>
<a href="#top">Back to top</a>

## Additional packages (not required for class)
```{r}
install.packages("rmarkdown", dependencies=T) # for RMD files that allow embedding of graphs etc; not required for class
install.packages("jmv") # not required for class
install.packages("umx", dependencies=T) # For structural equation modelling
install.packages("semPlot", dependencies=T) # required for Structural Equation Modeling but not required for class
install.packages("lavaan", dependencies=T) # required for Structural Equation Modeling but not required for class; see https://lavaan.ugent.be/tutorial/index.html
install.packages("sem", dependencies=T) # required for Structural Equation Modeling but not required for class
install.packages("maps", dependencies=T) # required if making map of US - only lower 48 states
install.packages("usmap", dependencies=T) # required if making map of US - all 50 states
# install.packages("RVAideMemoire", dependencies = T) # required for the Fisher Test
install.packages("plotly", dependencies=T) # required for 3D plots
install.packages("compareGroups") # required for generation of table comparing groups
install.packages("lsmeans", dependencies=T) # for comparison of slopes
install.packages("effectsize", dependencies = T)
```

***
<div class="alert alert-danger">
# Loading packages
  * Packages need to be loaded **EVERY TIME** R studio is started.
  * Copy this command and paste into the Console and then press "Enter". 
```{r}
library(ggplot2);library(multcomp);library(pastecs);library(reshape); library(reshape2); library(nlme); library(car); library(pwr); library(dplyr); library(devtools); library(rms);library(psych); library(ggpubr)
```
</div>
<a href="#top">Back to top</a>

## Additional library commands (not required for class)
```{r}
library(umx); library(jmv); library(sem); library(lavaan); library(semPlot); library(qgraph); library(foreign); library(usmap); library(maps);  library(plotly); library(SEMfn); library(lsmeans); library(effectsize); library(roxygen2)

library(RVAideMemoire)
```

***
<div class="alert alert-success">
# Copying Data from Excel into R

## For PC
  1. Copy the command below (Ctrl+C)
  2. Paste command into R Studio Console (Ctrl+V)
  3. Change the name "object" to something that makes sense to you and is related to the data.
  4. In **Excel**, select the data you want to analyze; Include the header; 
  5. Copy the data (Ctrl+C)
  6. Go back to R Studio and run (**press "Enter"**) the command you pasted earlier (in step 2) into the console 
```{r}
object = read.table(file="clipboard", sep="\t", header=TRUE)
```

  * If you have missing data points in your file that you still want imported into the dataframe, use the following command:
```{r}
object = read.csv(file="clipboard", sep="\t", header=TRUE)
```
</div>
<a href="#top">Back to top</a>

<div class="alert alert-success">
## For Mac
  1. Copy the command below (Cmd+C)
  2. Paste command into R Studio Console (Cmd+V)
  3. Change the name "object" to something that makes sense to you and is related to the data.
  4. In **Excel**, select the data you want to analyze; Include the header; 
  5. Copy the data (Cmd+C)
  6. Go back to R Studio and run (**press "Enter"**) the command you pasted earlier (in step 2) into the console
```{r}
object = read.table(pipe("pbpaste"), sep="\t", header = TRUE)
```

  * If you have missing data points in your file that you still want imported into the dataframe, use the following command:
  
```{r}
object = read.csv(pipe("pbpaste"), sep="\t", header = TRUE)
``` 
</div>
<a href="#top">Back to top</a>

***
# Importing Data into R from Excel (xlsx) files
 
## Important Notes: 
  * **If you are going to use this method, you need to make sure that your excel file has a short name with no spaces**
  * If you wish to **name the object** containing the data something different from the original Excel name, type in the new name in the **"Name:"** box under **"Import Options"****
  * If you have **more than one sheet of data** in the excel file, you will need to specify which sheet the data is coming from in the **"Sheet:"** box under **"Import Options"** section in the **"Import Excel Data window"**

## Procedure: 
  * Go to **"File>Import Data Set>From Excel..."**
  * R opens window called **"Import Excel Data"**
  * Click on **"Browse..."** 
  * Select the Excel file you want to import
  * R will display the data in the **"Data Preview"** window; [see note above if your excel file contains more than one data sheet or you wish to name your object something different from the excel file name]
  * Click **"Import"** and the data is now imported and will be displayed

  
# Importing Data into R from CSV files
```{r}
object = read.csv(file.choose(), header=T)
```
  * object: name of the object - call it as you wish; should be related to the data you are importing; should simple; NO spaces; should short.
  * read.csv: tells the program you are going to import a CSV file
  * file.choose (): tells the program to open a window for you to select the file of interest
  * header = T: this indicates that you CSV file has a header
  * the ONLY word you need to change in this command is the name of the object
  
# Importing Data into R from SPSS files
  * to import *.sav files from SPSS you first need to run the following library command

```{r}
library(foreign)
```

```{r}
object = read.spss(file.choose(), to.data.frame=TRUE)
```
  * read.spss: tells the program you are going to import an SPSS file
  * file.choose (): tells the program to open a window for you to select the file of interest
  * the ONLY word you need to change in this command is the name of the object

***
<div class="alert alert-success">

# Exporting from R
## Exporting output to a text file
```{r}
sink("outputname.txt")

# RUN THE COMMANDS YOU WISH YO BE EXPORTED TO THE TEXT FILE

sink()
getwd() # gives working directory

# to import into excel, select the file and use the "space delimited" option, and then check the space checkbox.
```

## Exporting dataframe to a csv file
```{r}
write.csv(dataframe,"name.csv")
getwd() # gives working directory
```
</div>
<a href="#top">Back to top</a>

***
<div class="alert alert-info">
# Viewing imported data
```{r}
View(object) # allows you to see the data in another tab
```
</div>
<a href="#top">Back to top</a>

***
<div class="alert alert-warning">
# Melting your data
## Performing a simple melting of data
```{r}
objectL = melt(object, id="idname")
```
  * melt converts file from wide to long version
  * objectL: NEW object that will contain the long version of the file
  * object: name of the of the object containing the wide version of the data
  * id="idname": this is only used SOMETIMES; only when you have an ID colum (a column that identifies the subject in some way and does not change/is constant)
  * In case of a file with an ID column, replace the idname with the appropriate name of the column
  * If there is NO ID COLUMN; you can simply **delete** the part of the command from the comma onwards i.e., *, id="idname"'* but keep the close parenthesis.
</div>
<a href="#top">Back to top</a>

<div class="alert alert-warning">
# Melting data with multiple id variables and measure variables
```{r}
object_L=melt(object, id.vars = c("idvar_1", "idvar_2", "idvar_n"), measure.vars = c("mvar_1", "mvar_2", "mvar_n"))
```
  * idvar_1 etc. = replace these with the appropriate names of the columns containing the id variables
  * mvar_1 etc. = replace these with the measure variable column names
</div>
<a href="#top">Back to top</a>

***
# Selecting specific subset to analyze

```{r}
var_a = object$`outcome`[object$predictor1colname=="predictor1" & object$predictor2colname=="predictor2"]
```

  * var_a: name of variable that will contain subset
  * object$predictor1colname: name of column containing "predictor1" in object
  * ==: "truly equal to"
  
***
# Generating descriptive statistics in R
## Weighted mean
```{r}
wt <- c(w1,  w2,  w3,  w4, etc)/(wT) # the weight for the first value = x multiplied w1/wT (e.g. if weight for w1 is 25% w1 would be 25, wT would be 100)
x <- c(x1, x2, x3, x4, etc) # values to be weighted
xm <- weighted.mean(x, wt)
```

## Mean
```{r}
mean(c(x1, x2, x3, x4...))
```

## Descriptive statistics for **WIDE** version of data to 2 decimal places
```{r}
round(stat.desc(object, norm=T), digits=2)
```
  * gives the descriptive statistics for all variables to TWO decimal place (you can adjust the number of decimal places by simply changing the number after digits=)
  * object: replace this with name of object containing the WIDE version of the data
  * ONLY used with the WIDE version of the data

## Descriptive statistics for individual columns of data to 2 decimal places in **WIDE** version of data
```{r}
round(stat.desc(object$columnname, norm=T), digits =2)
```
  * gives the descriptive statistics for the WIDE version of the file but ONLY for a SPECIFIC COLUMN
  * object: needs to be replaced with the name of the wide version of the data
  * columnname: needs to be replaced with the name of the column for which you wish to generate the statistics.

<div class="alert alert-danger">
## Descriptive statistics for **LONG** version of data
```{r}
by(objectL$outcome, objectL$predictor, stat.desc, norm=T)
```
  * gives the descriptive statistics for the LONG version of the data
  * objectL: replace with name of the object containing the LONG version of data
  * outcome: replace with name of outcome column in objectL
  * predictor: replace with name of predictor column in objectL

  * nbr.val: number of values/subjects in a particular group
  * nbr.null: number of ZERO values
  * nbr.na: number of MISSING values
  * min/max: minimum and maximum valuers in a group respectively
  * range: gives you the range
  * sum: gives the sum of all values in a group

### Measures of centrality:
  + median: gives the median of values in a group
  + mean: gives the average of values in a group

### Measures of variability:
  + SE.mean: standard error of the mean
  + CI.mean.95: 95% confidence interval
  + var: variance
  + std.dev: standard deviation
  + coef.var: coefficient of variation

### Measures of normality:
  + skewness: measures of level of skewness of your data, a negative sign implies a negative skew;
  + skew.2SE: indicates whether the skewness is SIGNIFICANT. Skewness is significant if skew.2SE > +1 or <-1
  + kurtosis: measure of level of kurtosis of the data. A negative sign implies a negative kurtosis.
  + kurt.2SE: indicates whether the kurtosis is SIGNIFICANT. kurtosis is significant if kurt.2SE > +1 or <-1
  + normtest.w: test for normality
  + normtest.p: if normtest.p<0.05, the the data is NOT normally distributed.
  + When checking for normality - **FIRST** look at the normtest.p. If this value is greater than 0.05, then you have nothing to worry about. If this value is less than 0.05 then you need to look at the skew.2SE and kurt.2SE in order to determine which aspect is causing your data to not be normal. If your data is NOT normally distributed, other statistical tests apply.

### Example of output generated with previous command
<center>
![](C:/Users\ssamm\Desktop\Sammut\Research\R Markdown/ByOutput.png){width=65%}
</center>
</div>
<a href="#top">Back to top</a>

### Alternative method

```{r}
describeBy(outcome~predictor, data = objectL, digits=2, mat=TRUE)
```
### Example of output generated with previous command
<center>
![](C:/Users\ssamm\Desktop\Sammut\Research\R Markdown/describeBy.png){width=70%}
</center>

  + this output is nice and concise but does not contain everything included in the previous method (e.g. no indication of significance in relation to the deviations from normality (skewness/kurtosis))

<a href="#top">Back to top</a>

<center>

### Another Method of presenting Statistics

<div class="alert alert-success">
  * this method generates presentable tables as an output
  * requires "vtable" package
```{r}
st(ObjectL, vars = c("column1"), add.median=TRUE, numformat=NA) 
```
  * gives stats of one column with 25th, 50th and 75th percentiles
  * column1 -> replace with the name of the column for which you wish to calculate the descriptive statistics. Output will have N, Mean and Std.Dev., Min, and Max also.
<center>
![](C:/Users\ssamm\Desktop\Sammut\Research\R Markdown/Rplot-one column.png){width=70%}
</center>

```{r}
st(ObjectL, vars = c("column1"), group = 'Group', add.median = TRUE, numformat=NA)
```
  * gives the statistics of column1 across the group "Group";
  * replace "column1" with the name of the column for which you wish to generate descriptive statistics; 
  * replace "Group" with the name of the column that has the grouping variable e.g. "Sex".

<center>
![](C:/Users\ssamm\Desktop\Sammut\Research\R Markdown/Rplot-by group.png){width=70%}
</center>
</div>
***
# Descriptive statistics in cases of multiple levels 
```{r}
describeBy(outcome~ var1 + var2, data = object,digits=2, mat=TRUE)

# e.g. if var1 = sex (Male or Female) and var2 = class (Freshman, Sophomore, Junior, Senior)
# digits: describes number of decimal places
# mat: provide a matrix output rather than a list 
```

***
# Excel Commands for some descriptive Statistics - method 1
  * all commands start with an equal sign
  * once you select the appropriate command (after the equal sign), in parenthesis you will need to inbput the cells on which you want to execute that command e.g. 
    + =min(B2:B10) - this gives the minimum for numbers in cells B2 to B10
    + =max(B2:B10) - gives the max for numbers in cells B2 to B10
    + =Average(B2:B10) - gives the average
    + =Median(B2:B10) - gives the median
    + =var(B2:B10) - gives the variance
    + =stdev(B2:B10) - gives the standard deviation
    + =B16/SQRT(B17) - gives the standard error of the mean WHEN B16 contains the standard deviation and B17 contains the N (count)
    + $ sign in front of the letter (column) or number (row) of the formula fixes the row or column 

***
# Excel Commands for some descriptive Statistics - method 2
  * To give a more detailed output, click Data Analysis in the Analysis group on the Data tab. 
  * If the Data Analysis command is not available, you need to load the Analysis ToolPak add-in program.(see:[Office Support](https://support.office.com/en-ie/article/load-the-analysis-toolpak-in-excel-6a63e598-cd6d-42e3-9317-6b40ba1a66b4){target="_blank"})
  * Select "Descriptive Statistics"
  * Input the appropriate variable ranges under "Input"
  * If your variables have labels (which they should) check "Labels in First Row"
  * Click on Output Range and in the associated box indicate the cell you want the output to be generated in (make sure this cell is away from other important data) OR simply select "New Worksheet Ply"
  * Sample output is shown below:


Statistic         | value
----------        |------
Mean              | 12
Standard Error	  | 0.447213595
Median	          | 12
Mode	            | 13
Standard Deviation| 1
Sample Variance	  | 1
Kurtosis	        | -3
Skewness          | 0
Range	            | 2
Minimum           | 11
Maximum           | 13
Sum	              | 60
Count	            | 5

# **Proportions test**
<div class="alert alert-info">
**Proportions test**: Use a proportions test to compare/determine if the frequency/probability of an event is different between the various groups being tested (e.g. asking if the proportion of males reporting depression is different from females reporting depression in a sample)
</div>

<div class="alert alert-success">
## Comparing two proportions - Method 1
```{r}
prop.test(x = c(c1, c2), n = c(T1, T2), alternative = "two.sided")
```
  * c1 and c2 reflect the number of subjects in each of the categories (e.g., number of women responding Yes and the number of men responding Y)
  * T1 and T2 reflect the number of TOTAL subjects in each category/group (e.g., the total number of women in the sample, the total number of men in the sample)
  
## Comparing two proportions - Method 2

```{r}
object = c(v1,v2) 
```
  * number of subjects in each of the categories (e.g. number of Catholics in example)
```{r}
Tobject = c(n1,n2) 
```
  * number of TOTAL subjects in each category/group (e.g. Total number of representative in each party)
```{r}
prop.test(object, Tobject) 
```
  * runs the proportions test

</div>
<a href="#top">Back to top</a>

<div class="alert alert-danger">
## Proportions Test Output in R
```{}
2-sample test for equality of proportions with
	continuity correction

X-squared = 5.7318, df = 1, p-value = 0.01666     # X-squared (chi-squared) = the
                                                  statistic; df = degrees of freedom (n-1)
                                                  - we have 2 groups; p-value = tells us if
                                                  it is significant; IT IS SIGNIFICANT if p
                                                  < 0.05
alternative hypothesis: two.sided
95 percent confidence interval:
  0.01821926 0.18098709                          # If the confidence interval range has the same SIGN - there should be a significant difference
sample estimates:
  prop 1    prop 2 
0.3535714 0.2539683                               # these are the proportions of each of the tested groups
```
</div>
<a href="#top">Back to top</a>

<div class="alert alert-info">
## Writing a report from a proportions test output
A two-sample test for equality of proportion was conducted to assess whether the proportion of Catholics in the Republican and Democratic party were significantly different. The analysis revealed a significantly higher proportion of Catholics in the Democratic Party(35.4%) relative to the Republican Party(25.4%), X^2^(1,N=532)= 5.73,p<0.05.
</div>
<a href="#top">Back to top</a>

<div class="alert alert-success">
## Comparing more than two proportions - Method 1
```{r}
prop.test(x = c(c1, c2, c3...), n = c(T1, T2, T3...), alternative = "two.sided")
pairwise.prop.test(x = c(c1, c2, c3...), n = c(T1, T2, T3...), alternative = "two.sided")
```
  * c1, c2, c3... reflect the number of subjects in each of the categories (e.g., number of women responding Yes and the number of men responding Y)
  * T1, T2, T3... reflect the number of TOTAL subjects in each category/group (e.g., the total number of women in the sample, the total number of men in the sample)
  * prop.test runs the proportions test - provides the first part of the output.
  * pairwise.prop.test runs the pairwise test to indicate which proportions are different from each other - this provides the second part of the output
  
## Comparing more than two proportions - Method 2
```{r}
object = c(a,b,c,d,e,...) 
```
  * number of subjects in each of the categories (e.g. the number of doctors of a particular faith that said they would NOT perform an abortion; each letter would be replaced by the number of doctors of each specific specific faith that responded "NO")
```{r}
Tobject = c(Ta,Tb,Tc,Td,Te,...) 
```
  * number of Total subjects in each category/group (e.g. the TOTAL number of doctors of a particular faith i.e those that responded YES (they would perform an abortion) + those that responded NO; **Replace Ta** etc. with numbers representing the individual totals)
```{r}
prop.test(object,Tobject) 
```
  * runs proportions test - provides the first part of the output.
```{r}
pairwise.prop.test(object,Tobject) 
```
  * runs pairwise test to indicate which proportions are different from each other - this provides the second part of the output.

</div>
<a href="#top">Back to top</a>

<div class="alert alert-danger">
## Proportions Test Output in R (more than two proportions)
```{}
# This is the first part of the output that indicates whether there is a difference in the proportions. If there is a difference, however, it will not indicate between which proportions the difference exists.

4-sample test for equality of proportions without
	continuity correction

data:  object out of Tobject
X-squared = 35.169, df = 3, p-value = 1.122e-07
alternative hypothesis: two.sided
sample estimates:
   prop 1    prop 2    prop 3    prop 4 
0.2857143 0.5000000 0.2407407 0.7500000 

# This is the second part of the output the indicates WHERE the differences are. The numbers in the top row and in the first column reflect the 4 proportions being tested comparing e.g., 1 vs 2 (not significant); 1 vs 4 (significant at p<0.001) etc., etc.

Pairwise comparisons using Pairwise comparison of proportions 

data:  object out of Tobject 

  1       2       3      
2 0.32422 -       -      
3 0.82154 0.18128 -      
4 0.00013 0.18128 9.5e-07

P value adjustment method: holm 
```
</div>
<a href="#top">Back to top</a>

<div class="alert alert-info">
## Writing a report from a proportions test output (more than two proportions)
A 4-sample test for equality of proportions was conducted to assess whether the proportion of the 4 samples provided were significantly different. The analysis revealed an overall significant difference X^2^(3,N= **insert number of participants**) = 35.17,p<0.001. Pairwise comparisons indicated a significant difference between samples 1 and 4 and samples 3 and 4 (both p<0.001). All other comparisons were not significant (p>0.05).
</div>
<a href="#top">Back to top</a>

# Proportions Test for situations where there are very small numbers (5 or less)

In situations where there are 5 or less values, the proportions test will display an error. In that case, the Fisher test is the more appropriate test to use.

## Comparing two proportions - in cases of small numbers
```{r}
fisher.test(matrix(c(n1, T1-n1, n2, T2-n2), ncol=2)) 
```
  * n1 represents the first number to be compared, T1 is the total number of subjects in pertaining to variable n1 etc.

## Comparing more than two proportions - in cases of small numbers
  * requires *RVAideMemoire* package
```{r}
fisher.multcomp(matrix(c(n1, T1-n1, n2, T2-n2,.....), ncol=N)) 
```
  * n1 represents the first number to be compared, T1 is the total number of subjects in pertaining to variable n1 etc. 
  * ncol=N: replace ***N*** with the number of variables being compared.
  
# **Odds Ratio**
<div class="alert alert-info">
**Odds Ratio**: measures how likely an event is to occur when exposed to something; a measure of the association between exposure and outcome
</div>

<div class="alert alert-success"> 
# Create Matrix
```{r}
# data taken from Yang, C.-Y., Shih, Y.-H., & Lung, C.-C. (2024). The association between COVID-19 vaccine/infection and new-onset asthma in children - based on the global TriNetX database. Infection. https://doi.org/10.1007/s15010-024-02329-3 **Table S2 in Supplement 1**

# Create matrix
vax <- c('Vax', 'No Vax')
outcome <- c('Death', 'No death')
data <- matrix(c(354, 31734, 309, 159048), nrow=2, ncol=2, byrow=TRUE)
dimnames(data) <- list('Vaccinated'=vax, 'Outcome'=outcome)

#view matrix
data

```
</div>
<a href="#top">Back to top</a>

<div class="alert alert-success">
# Calculate Odds Ratio
```{r}
library(epitools)

#calculate odds ratio
oddsratio(data)

# From the output shown, we would interpret this to mean that the odds that child receiving the vaccine dies is ~5.74 times the odds of one who did not receive the vaccine.
```
</div>

<div class="alert alert-info">
## Writing a report from an odds ratio test output
There was a significant difference (p<0.001) in the odds of death associated with receiving the COVID vaccines in children aged 5-18 years of age relative to those who did not (OR 5.74, 95% CI [4.93, 6.69]).
</div>
<a href="#top">Back to top</a>


***
# Some things you ***may*** have to do **prior to generating a graph**
<div class="alert alert-warning">

## Function and command for generating Standard Error of the Mean (SEM)
  * This function needs to be copied and run **WITHOUT MODIFICATION** before generating the summary statistics using the summarySE command.
  * This code is taken from: [Summarizing Data](http://www.cookbook-r.com/Manipulating_data/Summarizing_data/)

```{r}
## Gives count, mean, standard deviation, standard error of the mean, and confidence interval (default 95%).
##   data: a data frame.
##   measurevar: the name of a column that contains the variable to be summariezed
##   groupvars: a vector containing names of columns that contain grouping variables
##   na.rm: a boolean that indicates whether to ignore NA's
##   conf.interval: the percent range of the confidence interval (default is 95%)
summarySE <- function(data=NULL, measurevar, groupvars=NULL, na.rm=FALSE,
                      conf.interval=.95, .drop=TRUE) {
    library(plyr)

    # New version of length which can handle NA's: if na.rm==T, don't count them
    length2 <- function (x, na.rm=FALSE) {
        if (na.rm) sum(!is.na(x))
        else       length(x)
    }

    # This does the summary. For each group's data frame, return a vector with
    # N, mean, and sd
    datac <- ddply(data, groupvars, .drop=.drop,
      .fun = function(xx, col) {
        c(N    = length2(xx[[col]], na.rm=na.rm),
          mean = mean   (xx[[col]], na.rm=na.rm),
          sd   = sd     (xx[[col]], na.rm=na.rm)
        )
      },
      measurevar
    )

    # Rename the "mean" column    
    datac <- rename(datac, c("mean" = measurevar))

    datac$se <- datac$sd / sqrt(datac$N)  # Calculate standard error of the mean

    # Confidence interval multiplier for standard error
    # Calculate t-statistic for confidence interval: 
    # e.g., if conf.interval is .95, use .975 (above/below), and use df=N-1
    ciMult <- qt(conf.interval/2 + .5, datac$N-1)
    datac$ci <- datac$se * ciMult

    return(datac)
}
```
</div>

<div class="alert alert-warning">
```{r}
sumstats <- summarySE(objectL, measurevar="outcomevar", groupvars=c("groupvar_1","groupvar_2","groupvar_n"), na.rm=TRUE)
```
  * sumstats: name of object that will have the summary stats. The name of this should be changed if running for more than one data set.
  * summarySE: command that runs the summary statistics
  * objectL: name of object containing the long version of the data to be analyzed/summarized
  * "outcomevar": name of the variable to be analyzed/summarized
  * "groupvar_1" etc: name of the columns that contain the groups for the variable to be analyzed e.g. Sex, Class, Age...etc.
  * na.rm = TRUE: removes the empty cells (cells that have NA); Not required if all data points are present.
</div>
<a href="#top">Back to top</a>

<div class="alert alert-warning">
## Ordering levels in a specific variable
  * When plotting a graph, R will order the variables in alphabetical order. Very often this is not the best way of plotting the graph. In such cases, R needs to be instructed on the desired order of variables.
  * An example is the order of classes. For this example, the name of the variable in the data (i.e. the column name) is **Class** R will order them: Freshman, Graduate, Junior, Senior, Sophomore
  * This command will create an object that has the reordered levels.
  * In such a case, the name of the new object created would be the name of the predictor that would need to be used in Part I of plotting the graph.
  * For this example ***Class*** is the original name; ***Class_r*** is the rearranged levels
  
```{r}
Class_r = factor (object$Class, levels=c("Freshman", "Sophomore", "Junior", "Senior", "Graduate"))
```
  * notice how when you are indicating the name of the factor that is to be reordered you need to give R the name of the object containing the data (in this case "object" and the name of the column to be reordered separated by a $ symbol)
  * to adapt to any situation - "Class_r" should be renamed appropriately; "object$Class" should be replaced with the appropriate object name and column name to be reorganized; all the levels within the quotes should be replaced accordingly.  
</div>
<a href="#top">Back to top</a>

<div class="alert alert-warning">
## Changing variable name in a specific column
  * If the variables in a particular column need to be changed e.g. from numbers to words do the following:
  * install the package ***stringr*** using the following command:

```{r}
install.packages("stringr", dependencies = T)
```

  * ensure that you load the package ***stringr*** using the following command:
  
```{r}
library(stringr)
```

  * Next, run the following command in order to preserve the original file:
  
```{r}
object2=object
```
  
  * Next, run the following command on the new modified object (i.e. object2) (**NOTE**: this will change the values in your columns permanently.)
  
```{r}
object2$columnname = str_replace_all(object2$columnname,"oldvalue","newvalue")
```

  * You will need to run this command for **every** value that needs to be replaced
  * When plotting the graph you will need to remember to use the **NEW OBJECT** i.e. ***object2*** as the source of the data
</div>
<a href="#top">Back to top</a>

<div class="alert alert-warning">
# Changing the column name/header in the object/dataframe containing the data
  * When you generate the long version of a file, the predictor column will have the name ***variable***, while the outcome column will be labeled ***value***. 
  * To change the name of a specific column use the following commands
```{r}
names(objectL)[n] <- "correctname"
```
  * **remember that this will be the name that you will need to use in generating the graph, ** ***whenever you are addressing the predictor/outcome***
  * replace the ***n*** with the number of the column whose header you wish to change (e.g. if the data has an id column and you want to change the variable column, this will be column 2)
  * replace ***correctname*** with the name you want to give the column/header

### This is an alternative command to the one above

  * To rename a single column by name
```{r}
colnames(objectL)[colnames(objectL) == "old_col2"] <- "new_col2"
```
</div>
<a href="#top">Back to top</a>


# **Creating a pie chart, bar graph, line graph or box plot**

<div class="alert alert-warning">
# Generating a pie chart
```{r}
object<- c(n1, n2)
labels<- c("Poor (58%)","Normal (42%)")
pie(x=object,labels = labels, radius = 1, main = "Title (n=N)", col = c("#69b3a2", "grey"))
```
  * object: name of object containing number OR percentage of data to be plotted
  * labels: name of object containing the names of the labels of the parts/sections of the pie chart
  * pie: command that plots the pie chart
  * x: indicates what is to be plotted (contained in the dataframe "object" [replace accordingly])
  * labels: indicates the labels for the sections (contained in the dataframe "labels" [replace accordingly])
  * radius - specifies radius of circle
  * col - specifies colors to be used
</div>
<a href="#top">Back to top</a>

<div class="alert alert-danger">
### Example of a graph generated with previous command
<center>
![](C:/Users\ssamm\Desktop\Sammut\Research\R Markdown/PieChart.png){width=40%}
</center>
</div>
<a href="#top">Back to top</a>

# Generating a bar graph
<div class="alert alert-warning">

## Generating a bar graph - Part I 
```{r}
barobject=ggplot(objectL, aes(predictor,outcome))
```
  * barobject: name of the entity that will contain the plot (call it what you wish, just be clear)
  * ggplot: command that enables the plotting of the graph
  * objectL: is the name of the long version of your data
  * aes: tells R to generate the aesthetics
  * predictor: **REPLACE** with the name of the predictor column in the long version of the file
  * outcome: **REPLACE** with the name of the outcome column in the long version of the file
  * NOTE regarding predictor: If the levels have been rearraged in a new object, the word predictor should be replaced with the name of that specific object (in the example above, ***Class_r*** instead of ***Class***)

## Generating a bar graph - Part II 
```{r}
barobject+stat_summary(fun=mean, geom="bar",fill="white",color="Black")+stat_summary(fun.data=mean_cl_normal, geom="errorbar",position=position_dodge(width=0.9), width=0.2)+labs(title="Title line1 \n Title line2", x="predictor label", y="outcome label (unit)")+theme(axis.title.x=element_text(vjust=-0.4))+theme(axis.title.y=element_text(vjust=0.3))+theme(title=element_text(vjust=+1.1))+theme(plot.title=element_text(hjust=0.5))+coord_cartesian(ylim=c(lowerlimit, upperlimit)) + guides(fill = guide_legend(title = "name"))
```
  * *barobject*: name of the plot determined in the previous command
  * *stat_summary(fun.y=mean, geom="bar",fill="white",color="Black")*: plots the mean, as a bar graph, with a white fill and a black outline - **DOES NOT NEED TO BE CHANGED** unless you wish to change the colors
  * *stat_summary(fun.data=mean_cl_normal, geom="errorbar",position=position_dodge(width=0.9), width=0.2)*: plots the error bars (measure of variability) so that they do not overlap - **YOU DO NOT NEED TO CHANGE ANYTHING**; 
    * **NOTE**:   *mean_se* can replace *mean_cl_normal* if you wish to plot the standard error of the mean.
  * *labs(title="**Title line1 \\n Title line2**", x="**predictor label**", y="**outcome label (unit)**")*:Labels the graph; title gives the name of the main heading of the graph; x labels the x-axis; y labels the y-axis; The "**\\n**"  places any words after it on the next line; **DOES NEED CHANGING** - you need to change everything that is in the quotation marks.
  * *theme(axis.title.x=element_text(vjust=-0.4))+theme(axis.title.y=element_text(vjust=0.3))+theme(title=element_text(vjust=+1.1))*: sets the vertical distance/alignment b/w the specific heading and the axis - **DOES NOT NEED CHANGING**
  * *theme(plot.title=element_text(hjust=0.5))*: centers the title horizontally 
  * *+coord_cartesian(ylim=c(lowerlimit, upperlimit))*: you do **NOT** always need to use this part of the command; It is only needed when the y-axis limits need to be adjusted for better viewing of the graph; If it is not being used, remove the whole command in between the plus signs leaving only one + sign. If used, replace the words lowerlimit and upperlimit with the appropriate numbers.
  * *guides(fill = guide_legend(title = "name"))*: where applicable, provides a new name for the legend (the default is the header of the column).

</div>
<a href="#top">Back to top</a>

<div class="alert alert-danger">
### Example of a graph generated with previous command
<center>
![](C:/Users\ssamm\Desktop\Sammut\Research\R Markdown/Graph-no color.png){width=50%}
</center>
</div>
<a href="#top">Back to top</a>

# Generating a bar graph with **different colored bars**
<div class="alert alert-warning">
  * more information on this is available in the R Cookbook, under [Colors (ggplot2)](http://www.cookbook-r.com/Graphs/Colors_(ggplot2)/){target="_blank"}
  
## Generating a bar graph - Part I 
```{r}
barobject=ggplot(objectL, aes(predictor,outcome,fill=predictor))+scale_fill_manual(values=c("#colorcode", "#colorcode",....))
```
  * replace barobject, objectL, predictor, outcome with appropriate labels
  * replace colorcode with appropriate color code obtained from the Hex field at the [rapidtables website](https://www.rapidtables.com/web/color/RGB_Color.html ){target="_blank"} - in the case of the color selected in the example shown below, the code would be #3399FF

<div class="alert alert-danger">
<center>
![](C:/Users\ssamm\Desktop\Sammut\Research\R Markdown/Colorpalate.png){width=50%}
</center>
</div>
  * add as many colors as you have variables

## Generating a bar graph - Part II 
```{r}
barobject+stat_summary(fun=mean, geom="bar")+stat_summary(fun.data=mean_cl_normal, geom="errorbar",position=position_dodge(width=0.9), width=0.2)+labs(title="Title line1 \n Title line2", x="predictor label", y="outcome label (unit)")+theme(axis.title.x=element_text(vjust=-0.4))+theme(axis.title.y=element_text(vjust=0.3))+theme(title=element_text(vjust=+1.1))+theme(plot.title=element_text(hjust=0.5))+coord_cartesian(ylim=c(lowerlimit, upperlimit))+guides(fill = guide_legend(title = "name"))
```
  * *stat_summary(fun.y=mean, geom="bar")*: Notice how the section ",fill="white",color="Black"" has been removed because colors are accounted for in the first command
  * Everything else is per the first example of "Generating a Graph"
</div>
<a href="#top">Back to top</a>

<div class="alert alert-danger">
### Example of a graph generated with previous command
<center>
![](C:/Users\ssamm\Desktop\Sammut\Research\R Markdown/Graph-with color.png){width=50%}
</center>
</div>
<a href="#top">Back to top</a>


<div class="alert alert-warning">
## Creating a bar graph with **one independent variable** and using the standard error of the mean for error bars
  * You first need to have run the function and associated summarySE command prior to continuing forward (See [Function and command for generating Standard Error of the Mean (SEM)])
  * The example below uses the graph with colored bars information. However, this can be adapted and applied to all forms of graphs.
  
## Generating a bar graph - Part I
```{r}
barobject=ggplot(sumstats, aes(predictor,outcome,fill=predictor))+scale_fill_manual(values=c("#colorcode", "#colorcode",....))
```
  * **NOTE:** sumstats is used to provide the data to be plotted
  * Additionally, see [Generating a bar graph with **different colored bars**] for details of what needs to be changed

## Generating a bar graph - Part II 
```{r}
barobject+stat_summary(fun=mean, geom="bar")+geom_errorbar(aes(ymin=outcome-se, ymax=outcome+se),position=position_dodge(width=0.9), width=0.2)+labs(title="Title line1 \n Title line2", x="predictor label", y="outcome label (unit)")+theme(axis.title.x=element_text(vjust=-0.4))+theme(axis.title.y=element_text(vjust=0.3))+theme(title=element_text(vjust=+1.1))+theme(plot.title=element_text(hjust=0.5))+coord_cartesian(ylim=c(lowerlimit, upperlimit))+guides(fill = guide_legend(title = "name"))
```
  * *geom_errorbar(aes(ymin=outcome-se, ymax=outcome+se),position=position_dodge(width=0.9), width=0.2)*: This command now provides the information for the errorbars. However you need to replace the word "outcome" with the appropriate name for the outcome variable.
  * Everything else is per the first example of [Generating a bar graph]
</div>
<a href="#top">Back to top</a>

<div class="alert alert-danger">
### Example of a graph generated with previous command
<center>
![](C:/Users\ssamm\Desktop\Sammut\Research\R Markdown/Graph-oneindvar.png){width=50%}
</center>
</div>
<a href="#top">Back to top</a>

<div class="alert alert-warning">
## Creating a bar graph with **multiple independent variables** and using the standard error of the mean for error bars
  * You first need to have run the function and associated summarySE command prior to continuing forward (See [Function and command for generating Standard Error of the Mean (SEM)])
  * The example below uses the graph with colored bars information. However, this can be adapted and applied to all forms of graphs.
  
## Generating a bar graph - Part I
```{r}
barobject=ggplot(sumstats, aes(predictor,outcome,fill=predictor_2))+scale_fill_manual(values=c("#colorcode", "#colorcode",....))
```
  * **NOTE:** sumstats is used to provide the data to be plotted
  * predictor: the main predictor that is going to be on the x-axis
  * predictor_2: the predictor that provides the sub-categories within the main predictor
  * Additionally, see [Generating a bar graph with **different colored bars**] for details of what needs to be changed

## Generating a bar graph - Part II 
```{r}
barobject+stat_summary(fun=mean, geom="bar",position=position_dodge(width=0.9), linewidth=3)+geom_errorbar(aes(ymin=outcome-se, ymax=outcome+se),position=position_dodge(width=0.9), width=0.2)+labs(title="Title line1 \n Title line2", x="predictor label", y="outcome label (unit)")+theme(axis.title.x=element_text(vjust=-0.4))+theme(axis.title.y=element_text(vjust=0.3))+theme(title=element_text(vjust=+1.1))+theme(plot.title=element_text(hjust=0.5))+coord_cartesian(ylim=c(lowerlimit, upperlimit))+guides(fill = guide_legend(title = "name"))
```
  * *geom_errorbar(aes(ymin=outcome-se, ymax=outcome+se),position=position_dodge(width=0.9), width=0.2)*: This command now provides the information for the errorbars. However you need to replace the word "outcome" with the appropriate name for the outcome variable.
  * Everything else is per the first example of [Generating a bar graph]
</div>
<a href="#top">Back to top</a>

<div class="alert alert-danger">
### Example of a graph generated with previous command
<center>
![](C:/Users\ssamm\Desktop\Sammut\Research\R Markdown/Graph-twoindvars.png){width=50%}
</center>
</div>
<a href="#top">Back to top</a>

<div class="alert alert-warning">
## Creating bar graphs with **multiple independent variables**, plotted as three separate graphs (one per variable) in one grid
  * **Original data (contained in "objectL" contains the following columns "Subject", "var1", "var2", "var3", "data"**
  
## Generating a grid of bar graphs - Example 1
```{r}
# Make a modified copy of the original data
objectL_mod <- objectL %>%

# Rename variables within var3
  # only if necessary/desired
  # var3.1 etc. represent different variables within var3
  # /n can be used within the names if the new name is long and it is desired to split the title into two lines
mutate(var3 = recode(var3, "oldvar3.1\nname" = "newvar3.1\nname", "oldvar3.2\nname" = "newvar3.2\nname", "oldvar3.3\nname" = "newvar3.3\nname")) # do not add spaces before and after \n or the facet title will not be centered
  
# Generate graph
bar_objectL=ggplot(objectL_mod, aes(var1,data,fill=var1))+scale_fill_manual(values=c("#CC0000", "#004C99"))

bar_objectL.Sub=bar_objectL+stat_summary(fun=mean, geom="bar",position=position_dodge(width=0.9), linewidth=3)+stat_summary(fun.data=mean_cl_normal, geom="errorbar",position=position_dodge(width=0.9), width=0.2)+labs(title="title", x="x axis title", y="y axis title") + theme(axis.title.x=element_text(vjust=-0.4)) + theme(axis.title.y=element_text(vjust=0.3)) + theme(title=element_text(vjust=+1.1)) + theme(plot.title=element_text(hjust=0.5))+guides(fill = guide_legend(title = "var1"))

  # bar_objectL.Sub: contains the command to generate the graph with the various sub-categories present var3
  # +guides(fill = guide_legend(title = "var1")): relabels the heading of the legend

# Create grid
bar_objectL.Sub+facet_grid(cols=vars(var3))
```
</div>
<a href="#top">Back to top</a>

<div class="alert alert-danger">
### Example of a graph generated with previous command
<center>
![](C:/Users\ssamm\Desktop\Sammut\Research\R Markdown/facetgrid graph.png){width=50%}
</center>
</div>
<a href="#top">Back to top</a>

<div class="alert alert-warning">

## Generating a grid of bar graphs - Example 1a
  * **Original data (contained in "objectL" contains the following columns "Subject", "var1", "var2", "var3", "data"**
  * This graph is a modified version of that shown in Example 1 - See [Generating a grid of bar graphs - Example 1]
  * This graph uses the SEM instead of the Confidence interval to generate the error bars and also shows the significances
```{r}
# Make a modified copy of the original data
objectL_mod <- objectL %>%

# Rename variables within var3
  # only if necessary/desired
  # var3.1 etc. represent different variables within var3
  # /n can be used within the names if the new name is long and it is desired to split the title into two lines
mutate(var3 = recode(var3, "oldvar3.1\nname" = "newvar3.1\nname", "oldvar3.2\nname" = "newvar3.2\nname", "oldvar3.3\nname" = "newvar3.3\nname")) # do not add spaces before and after \n or the facet title will not be centered
```
* Generate Descriptive Statistics that provide standard error of the mean 
  * For generating descriptive statistics with standard error of the mean please see [Function and command for generating Standard Error of the Mean (SEM)]
  * sumstats_object: this is the name of the object that contains the output descriptive statistics 

```{r}  
# Generate graph
bar_object=ggplot(sumstats_object, aes(x=predictor,y=outcome))+scale_fill_manual(values=c("#CC0000", "#004C99"))

bar_object.Sub=bar_object+geom_bar(stat="identity",position=position_dodge(width=0.9), linewidth=3, aes(fill=predictor))+geom_errorbar(aes(ymin=outcome-se, ymax=outcome+se),position=position_dodge(width=0.9), width=0.2)+labs(title="Title", x="x axis title", y="y axis title") + theme(axis.title.x=element_text(vjust=-0.4)) + theme(axis.title.y=element_text(vjust=0.3)) + theme(title=element_text(vjust=+1.1)) + theme(plot.title=element_text(hjust=0.5)) + guides(fill = guide_legend(title = "predictor")) +coord_cartesian(ylim=c(lowerlimit,upperlimit))+ geom_signif(data=data.frame(predictor2=c("newvar3.1\nname","newvar3.2\nname","newvar3.3\nname")),aes(y_position=c(21.5, 24.5, -2), xmin=c(1.0, 1.0, 1.0), xmax=c(2, 2, 2),annotations = c("**","*","")), tip_length = 0, manual=T)

  # The statistic that was NOT significant has been placed beyond the y-scale limit. The coord_cartesian command ensures that the axes limits are as desired and that the non-significant bar does not show up.
  # outcome: under geom_errorbar - this refers to the heading of the outcome column in the descriptive statistics output.
  # geom_signif: numbers shown in parenthesis need to be adjusted accordingly to accommodate for height (in relation to the y_position) and/or position relative to bars (in relation to the xmin and xmax). Additionally, the annotation needs to be appropriately corrected to reflect the appropriate significances.

# Create grid
bar_BIS.Sub+facet_grid(~BIS_Cat, scales="fixed")

# theme(axis.text.x=element_blank()) - # Removes variable label names in the x-axis if they are not desired.
# theme(axis.ticks.x=element_blank()) - # removes ticks (may be used with above command)

# https://stackoverflow.com/questions/56219468/using-ggsignif-with-grouped-bar-graphs-and-facet-wrap-not-working
```
</div>
<a href="#top">Back to top</a>

<div class="alert alert-danger">
### Example of a graph generated with previous command
<center>
![](C:/Users\ssamm\Desktop\Sammut\Research\R Markdown/graph with signficance & facets.png){width=50%}
</center>
</div>
<a href="#top">Back to top</a>

<div class="alert alert-warning">
## Generating a grid of bar graphs - Example 2
  * **Original data (contained in "objectL" contains the following columns "Subject", "var1", "var2", "var3", "data"**
```{r}
# Make a modified copy of the original data
objectL_mod <- objectL %>%

# Rename variables within var3
  # only if necessary/desired
  # var3.1 etc. represent different variables within var3
  # /n can be used within the names if the new name is long and it is desired to split the title into two lines
mutate(var3 = recode(var3, "oldvar3.1\nname" = "newvar3.1\nname", "oldvar3.2\nname" = "newvar3.2\nname", "oldvar3.3\nname" = "newvar3.3\nname"))  # do not add spaces before and after \n or the facet title will not be centered
  
# Generate graph
bar_objectL=ggplot(objectL_mod, aes(var1,data,fill=var2))+scale_fill_manual(values=c("#CC0000", "#004C99"))

bar_objectL.Sub=bar_objectL+stat_summary(fun=mean, geom="bar",position=position_dodge(width=0.9), linewidth=3)+stat_summary(fun.data=mean_cl_normal, geom="errorbar",position=position_dodge(width=0.9), width=0.2)+labs(title="title", x="x axis title", y="y axis title") + theme(axis.title.x=element_text(vjust=-0.4)) + theme(axis.title.y=element_text(vjust=0.3)) + theme(title=element_text(vjust=+1.1)) + theme(plot.title=element_text(hjust=0.5))+guides(fill = guide_legend(title = "var2"))

  # bar_objectL.Sub: contains the command to generate the graph with the various sub-categories present var3
  # +guides(fill = guide_legend(title = "var2")): relabels the heading of the legend

# Create grid
bar_objectL.Sub+facet_grid(cols=vars(var3))
```
</div>
<a href="#top">Back to top</a>

<div class="alert alert-danger">
### Example of a graph generated with previous command
<center>
![](C:/Users\ssamm\Desktop\Sammut\Research\R Markdown/facetgrid graph_1.png){width=50%}
</center>
</div>
<a href="#top">Back to top</a>

<div class="alert alert-warning">
## Generating a bar graphs with significance shown - Example 3

* Generate Descriptive Statistics that provide standard error of the mean 
  * For generating descriptive statistics with standard error of the mean please see [Function and command for generating Standard Error of the Mean (SEM)]
  * sumstats_object: this is the name of the object that contains the output descriptive statistics 

```{r}
# Plot Graph

predictor.r = factor (sumstats_object$predictor.r, levels=c("level1", "level2", "level_n"))
  # predictor.r : name of variable that has had levels reorganized for graphing purposes
  # predictor.r: predictor that requires levels to be organized for graphic purposes
  # level1, etc.: level names listed in the order that you wish them graphed

bar_object=ggplot(sumstats_object, aes(x= predictor.r,y=outcome))+ scale_fill_manual(values=c("#CC0000", "#004C99")) 

bar_object+geom_bar(stat="identity",position=position_dodge(width=0.9), linewidth=3, aes(fill=predictor))+geom_errorbar(aes(ymin=Outcome-se, ymax=Outcome+se),position=position_dodge2(0.9, padding=0.6))+labs(title="Title", x="x axis title", y="y axis title") + theme(axis.title.x=element_text(vjust=-0.4)) + theme(axis.title.y=element_text(vjust=0.3)) + theme(title=element_text(vjust=+1.1)) + theme(plot.title=element_text(hjust=0.5)) + guides(fill = guide_legend(title = "Legend Title")) + geom_signif(y_position=c(30.5, 16.5, 15.5), xmin=c(0.7, 4.7, 5.7), xmax=c(1.3, 5.3, 6.3),annotation = c("***","*","**"), tip_length = 0)
  
  # outcome: under geom_errorbar - this refers to the heading of the outcome column in the descriptive statistics output.
  # geom_signif: numbers shown in parenthesis need to be adjusted accordingly to accommodate for height (in relation to the y_position) and/or position relative to bars (in relation to the xmin and xmax). Additionally, the annotation needs to be appropriately corrected to reflect the appropriate significances.

# Resources that may be of assistance

# https://statisticsglobe.com/ggsignif-package-r
# https://stackoverflow.com/questions/51443889/how-can-i-get-geom-errorbar-to-dodge-correctly-on-a-bar-chart-in-ggplot2
# https://stackoverflow.com/questions/59008974/why-is-stat-identity-necessary-in-geom-bar-in-ggplot#59009108
# http://sthda.com/english/wiki/ggplot2-error-bars-quick-start-guide-r-software-and-data-visualization
# https://stackoverflow.com/questions/61022992/manually-plotting-significance-relations-between-sub-groups-on-ggplot2-barplot
```

</center>
</div>
<a href="#top">Back to top</a>

<div class="alert alert-danger">
### Example of a graph generated with previous command
<center>
![](C:/Users\ssamm\Desktop\Sammut\Research\R Markdown/graph with signficance.png){width=50%}
</center>
</div>
<a href="#top">Back to top</a>


<div class="alert alert-warning">
## Generating a bar graphs with significance shown and reordered facets - Example 4

* Generate Descriptive Statistics that provide standard error of the mean 
  * For generating descriptive statistics with standard error of the mean please see [Function and command for generating Standard Error of the Mean (SEM)]
  * sumstatsobject: this is the name of the object that contains the output descriptive statistics 

```{r}
# Plot Graph

# Generate graph

bar_object=ggplot(sumstatsobject.r,aes(x=predictor1,y=outcome))+scale_fill_manual(values=c("#CC0000", "#004C99"))

bar_object.Sub=bar_object+geom_bar(stat="identity",position=position_dodge(width=0.9), linewidth=3, aes(fill=predictor1)) + geom_errorbar(aes(ymin=Data-se, ymax=Data+se),position=position_dodge(width=0.9), width=0.2)+labs(title="Title Line1\nTitleLine2", x="x-label", y="y-label") + theme(axis.title.x=element_text(vjust=-0.4)) + theme(axis.title.y=element_text(vjust=0.3)) + theme(title=element_text(vjust=+1.1)) + theme(plot.title=element_text(hjust=0.5)) + theme(axis.text.x=element_blank(), axis.ticks.x=element_blank()) + guides(fill = guide_legend(title = "Legend Title")) + coord_cartesian(ylim=c(0,15)) + geom_signif(data=data.frame(ExecFn=c("factor 1", "factor 3", "factor 5", "factor 11",  "factor 12")), aes(y_position = c(13.5,12.5,12.5,14,13), xmin=c(1,1,1,1,1), xmax=c(2,2,2,2,2), annotations = c("**","**","**","#","*")), tip_length = 0, manual=T)

# NOTE that only the factors that show significance are listed and the information in the aes also only relates to these.


# Create grid with reorganized factors
bar_object.Sub+facet_grid(~factor(MainPredictor, levels=c("factor 1" ,"factor 2","factor 3", "factor 4", "factor 5", "factor 6", "factor 7", "factor 8", "factor 9", "factor 10",  "factor 11", "factor 12")), scales="fixed")  

# MainPredictor - replace this with the name of the column that represents the x-label predictor not sub-predictors (An example of the MainPredictor is the name of the predictor that us made up of the 12 factors shown in the graph; an example of a sub-predictor is predictor1 which in this case is Yes or No)

```

</center>
</div>
<a href="#top">Back to top</a>

<div class="alert alert-danger">
### Example of a graph generated with previous command
<center>
![](C:/Users\ssamm\Desktop\Sammut\Research\R Markdown/facetgrid graph_reordered factors.png){width=50%}
</center>
</div>
<a href="#top">Back to top</a>

<div class="alert alert-warning">
## Creating a line chart
  * variable y is being plotted across variable x for two groups
```{r}
lp<- ggplot(SCl, aes(x=xvarname, y=yvarname, group=groupname, color=groupname)) + # x 
  stat_summary(fun=mean, geom="line") + # plots the line
  stat_summary(fun=mean, geom="point", size=3)+ # adds points to the plot
  stat_summary(fun.data=mean_se, geom="errorbar",position=position_dodge(width=0.0), width=0.2)+ #using standard error of the mean for the error bar.
     scale_color_manual(values=c('darkblue','red'))+
  labs(title="Title", x="x-axis label", y="y-axis label")+
  theme(axis.title.x=element_text(vjust=-0.4))+
  theme(axis.title.y=element_text(vjust=0.3))+
  theme(title=element_text(vjust=+1.1))+
  theme(plot.title=element_text(hjust=0.5))

lp # displays the lineplot
```
</center>
</div>
<a href="#top">Back to top</a>

<div class="alert alert-danger">
### Example of a graph generated with previous command
<center>
![](C:/Users\ssamm\Desktop\Sammut\Research\R Markdown/lineplot.png){width=50%}
</center>
</div>
<a href="#top">Back to top</a>

<div class="alert alert-warning">
## Creating a simple boxplot
```{}
boxplot(Outcome~Predictor,data=objectL, main="Titleline1 \n Titleline 2",xlab="x-axis label", ylab="y-axis label")

or 

# boxplot with data jittered and showing & two different colors

ggplot(objectL, aes(x=Predictor, y=Outcome, fill=Predictor)) + stat_boxplot(geom='errorbar', position=position_dodge(width=0.5),width=0.2) + geom_boxplot(width=0.4) + geom_jitter(position=position_jitter(0.15)) + scale_fill_manual(values=c("#E0E0E0", "#606060"))
```
</div>
<a href="#top">Back to top</a>

***
# **Correlation**

<div class="alert alert-info">
**Correlation**: a correlation tests whether there is a RELATIONSHIP between two observed variables within the same sample (e.g. asking if the number of cups of coffee consumed is related to amount of alertness in a sample)
</div>

<div class="alert alert-warning">
## Plotting a correlation with a regression line
```{r}
ggscatter(object, x = "xvariable", y = "yvariable",add = "reg.line", conf.int = TRUE,cor.coef = TRUE, cor.method = "pearson",xlab = "xlabel (units)", ylab = "ylabel(units)", title="title for the graph")+theme(plot.title=element_text(hjust=0.5))
```
  * ggscatter: plots the correlation
  * object: name of the wide version of data file
  * xvariable: replace with name of column of variable to be plotted in the x-axis
  * yvariable: replace with name of column of variable to be plotted in the y-axis
  * reg.line: adds the regression line
  * conf.int = TRUE: plots the confidence interval (shaded area)
  * cor.coef = TRUE: shows the correlation coefficient
  * cor.method = "pearson": calculates and shows the p value
  * xlab: replace words in quotes with appropriate wording for the x-label
  * ylab: replace words in quotes with appropriate wording for the y-label
  * title: replace words in quotes with a proper title
  * theme(plot.title=element_text(hjust=0.5)): centers the title
</div>
<a href="#top">Back to top</a>

<div class="alert alert-success">
## Running a correlation
```{r}
cor.test(object$predictor, object$outcome)
```
  * gives the correlation coefficient
  * gives the p-value (which tells you of the correlation is significant)
  * Important notes to remember from the results
    + df = degrees of freedom (n-2 (for correlation))
    + p value: p value of less than 0.05 is significant
    + the value under "cor" = correlation coefficient
</div>
<a href="#top">Back to top</a>

<div class="alert alert-danger">
### Example of a Correlation Output
```{}
	Pearson's product-moment correlation

data:  ingrd$Inc and ingrd$Grd
t = 2.968, df = 12, p-value = 0.01174 ## used for report
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.1833572 0.8780885
sample estimates:
      cor 
0.6506389 ## used for report
```
</div>
<a href="#top">Back to top</a>

<div class="alert alert-info"> 
### Writing a report from a correlation
A Pearson's correlation indicated that a student's grade was significantly related to the family's income, r(12)=-.65, p<.05.
</div>
<a href="#top">Back to top</a>

<div class="alert alert-danger">
### Creating a Correlation Matrix

```{r}
corPlot(Object,cex = 0.8, stars=TRUE, upper=FALSE, main = "TitleLine1 \n TitleLine2",gr = colorRampPalette(c("#BB4444", "#EE9988", "#FFFFFF", "#77AADD", "#4477AA"))) 
```
  * Object contains **only** the variables to be included in the correlation matrix
  * Replace name of Object with the appropriate name
  * Replace Title information with the appropriate information

### Example of a correlation plot/matrix generated with previous command
<center>
![](C:/Users\ssamm\Desktop\Sammut\Research\R Markdown/Correlation Matrix.jpg){width=75%}
</center>
</div>
<a href="#top">Back to top</a>

***
# **Regression**

<div class="alert alert-info">
**Regression**: a regression DESCRIBES the RELATIONSHIP between a predictor (an independent variable) and an outcome (dependent) between two observed variables within the same sample. Regressions can take various forms - linear (e.g. calibration of pH meter), sigmoidal (e.g. in enzyme kinetics/receptor binding). Distinction between correlation & regression: While a **correlation** can tell you that there is a relationship between [H~3~O^+^] in a solution and the pH value, **regression** provides you with the line that best predicts that relationship.
</div>

<div class="alert alert-success">
## Types of Regression
  * **Logistic regression** is the appropriate to conduct when the dependent variable is **dichotomous** (binary)
  * **Stepwise regression** is appropriate when you have many variables and you're interested in identifying a useful subset of the predictors.

</div>
<a href="#top">Back to top</a>

<div class="alert alert-success">
## Simple Regression
```{r}
ObjectReg=lm(outcome~predictor, data=object)
summary(ObjectReg)
```
  * Generates the regression equation and it is stored in "ObjectReg" 
  * The ***lm*** regression command
    + uses WIDE form of the data 
    + Designate one variable as outcome and the other as predictor - Do so in such a way that makes sense. e.g. It is more likely that family income would predict a student's grade than a student's grade predicting family income
  * The ***summary*** command: gives the output of the regression equation, from which you extract all of the statistical information that goes into a regression report.
</div>
<a href="#top">Back to top</a>

<div class="alert alert-danger">
### Example Regression Output
<center>
![](C:/Users\ssamm\Desktop\Sammut\Research\R Markdown/RegressionOutput.jpg){width=100%}
</center>
</div>
<a href="#top">Back to top</a>

<div class="alert alert-danger">
### Example of a Regression Equation derived from the above output
<center>
![](C:/Users\ssamm\Desktop\Sammut\Research\R Markdown/RegeqEx.jpg){width=50%}
</center>
</div>
<a href="#top">Back to top</a>

<div class="alert alert-info">
### Writing a report from a regression output
The results indicated that Fear of Sin significantly predicted Depression, Beta = 0.222, t(95)=2.336, p < 0.05. Fear of Sin also explained a significant proportion of variance in Depression, R^2^ = .044, F(1, 95) = 5.455, p < 0.05.
</div>
<a href="#top">Back to top</a>

## Multiple Regression (backward elimination) (using AIC)
  * requires package: leaps

```{r}
fitObject = lm(y~x1+x2+x3...,data=na.omit(objectL))
```
  * lm: is the command that runs a linear model analysis
  * na.omit will omit those subjects with missing data points & therefore cannot be included in analysis.

```{r}
stepObject = stepAIC(fitObject, direction="backward", trace=FALSE) # direction can be "both" or "forward"
stepObject$anova # display results with AIC
summary(stepObject) # display results with t values
```

<div class="alert alert-success">
## Multiple Regression (backward elimination) (using F-value)
  * requires package: rms
  * from: http://rstudio-pubs-static.s3.amazonaws.com/2899_a9129debf6bd47d2a0501de9c0dc583d.html
  * output similar to Sigmaplot
```{r}
fullObject = ols(y~x1+x2+x3..., data = objectL)
fastbw(fullObject, rule = "p", sls = 0.1)
```
</div>

## Logistic Regression

  * used when outcome is dichotomous/binary
  * also see: https://stats.idre.ucla.edu/r/dae/logit-regression/ for additional information
```{r}  
mylogit = glm(Outcome ~ Predictor1 + Predictor2 + Predictor3..., data = ObjectL, family = "binomial")
summary(mylogit) # gives Beta and p values

exp(cbind(OR = coef(mylogit), confint(mylogit))) # gives Odds Ratio(OR)
```

***
# Cronbach's alpha
  * Surveys are generally divided into sections /groups of questions 
  * Prior to concluding that questions related to each other, we need to make sure that questions within a specific group of questions are consistent with each other (i.e. highly related)
  * to measure if a group of questions is  consistent or not we use cronbach's alpha
  * we will also be able to know if consistency improves or not if we delete a question from the group of questions (this is only shown in Method 2).

## Method 1
  # subjects with missing data should be removed before hand using ObjectNA=na.omit(object)
```{r}
matobject=data.matrix(object)
reliability(cov(matobject))
```

## Example output from Method 1
```{}
$alpha
     alpha 
 0.9403455 
 
 $st.alpha
 std.alpha 
 0.9417464 
 
 $rel.matrix
                 Alpha Std.Alpha r(item, total)
 D2Stop      0.9350532 0.9368506      0.7365112
 Cont2A      0.9335049 0.9354005      0.7837842
 AvsOthers   0.9354148 0.9366306      0.7304763
 DepSleep    0.9383129 0.9398743      0.6302478
 ThinkNOL    0.9353017 0.9364763      0.7397708
 LkFrwrd     0.9348757 0.9361827      0.7433303
 ThnkLessT   0.9424475 0.9436797      0.5152945
 TrdLessT    0.9362815 0.9380178      0.6999771
 RushWrk     0.9355317 0.9366959      0.7302428
 NegOblig    0.9352094 0.9363647      0.7406319
 AccDown     0.9334180 0.9354451      0.7853325
 Escape      0.9346224 0.9362977      0.7539146
 Frustration 0.9336228 0.9352146      0.7798224
 
 attr(,"class")
 [1] "reliability"
```

## Method 2
```{r}
alphaobject=alpha(object)

alphaobject # shows output
```

## Example output from Method 2
```{}
Reliability analysis   
Call: alpha(x = CIUS)

  raw_alpha std.alpha G6(smc) average_r S/N    ase mean   sd median_r
      0.94      0.94    0.95      0.55  16 0.0051  1.8 0.97     0.55

 lower alpha upper     95% confidence boundaries
0.93 0.94 0.95 

 Reliability if an item is dropped:
            raw_alpha std.alpha G6(smc) average_r S/N alpha se  var.r med.r
D2Stop           0.94      0.94    0.95      0.55  15   0.0056 0.0139  0.56
Cont2A           0.93      0.94    0.95      0.55  14   0.0057 0.0137  0.55
AvsOthers        0.94      0.94    0.95      0.55  15   0.0055 0.0126  0.55
DepSleep         0.94      0.94    0.95      0.57  16   0.0053 0.0135  0.57
ThinkNOL         0.94      0.94    0.95      0.55  15   0.0055 0.0139  0.55
LkFrwrd          0.93      0.94    0.95      0.55  15   0.0056 0.0124  0.55
ThnkLessT        0.94      0.94    0.95      0.58  17   0.0049 0.0077  0.56
TrdLessT         0.94      0.94    0.95      0.56  15   0.0054 0.0136  0.56
RushWrk          0.94      0.94    0.95      0.55  15   0.0055 0.0125  0.55
NegOblig         0.94      0.94    0.95      0.55  15   0.0055 0.0129  0.55
AccDown          0.93      0.94    0.94      0.55  14   0.0057 0.0130  0.55
Escape           0.93      0.94    0.95      0.55  15   0.0056 0.0130  0.56
Frustration      0.93      0.94    0.95      0.55  14   0.0057 0.0132  0.55

 Item statistics 
              n raw.r std.r r.cor r.drop mean  sd
D2Stop      299  0.78  0.78  0.76   0.74 2.29 1.3
Cont2A      299  0.82  0.82  0.81   0.78 2.19 1.3
AvsOthers   299  0.77  0.78  0.77   0.73 1.20 1.2
DepSleep    299  0.69  0.69  0.65   0.63 1.25 1.3
ThinkNOL    299  0.78  0.79  0.77   0.74 1.64 1.1
LkFrwrd     299  0.79  0.80  0.78   0.74 1.43 1.2
ThnkLessT   299  0.59  0.58  0.53   0.52 2.84 1.4
TrdLessT    299  0.75  0.74  0.72   0.70 2.12 1.4
RushWrk     299  0.77  0.78  0.77   0.73 0.93 1.1
NegOblig    299  0.78  0.79  0.77   0.74 1.12 1.1
AccDown     299  0.83  0.82  0.82   0.79 2.27 1.4
Escape      299  0.80  0.79  0.79   0.75 2.06 1.5
Frustration 299  0.82  0.82  0.81   0.78 1.41 1.3

Non missing response frequency for each item
               0    1    2    3    4 miss
D2Stop      0.14 0.17 0.18 0.30 0.21    0
Cont2A      0.15 0.15 0.21 0.31 0.17    0
AvsOthers   0.37 0.24 0.23 0.13 0.03    0
DepSleep    0.40 0.19 0.22 0.14 0.05    0
ThinkNOL    0.16 0.30 0.31 0.17 0.05    0
LkFrwrd     0.30 0.25 0.22 0.18 0.05    0
ThnkLessT   0.12 0.07 0.10 0.26 0.45    0
TrdLessT    0.17 0.17 0.22 0.25 0.19    0
RushWrk     0.49 0.20 0.20 0.07 0.03    0
NegOblig    0.41 0.20 0.25 0.12 0.01    0
AccDown     0.17 0.11 0.19 0.31 0.21    0
Escape      0.24 0.13 0.15 0.28 0.19    0
Frustration 0.37 0.17 0.23 0.16 0.08    0
```

For additional information on interpretation please see: [**Reliability Analysis**](https://rpubs.com/hauselin/reliabilityanalysis)

***
<div class="alert alert-success">
# Testing for Homogeneity of variance
  * The Levene Test tests whether the VARIANCES are significantly different from each other
```{r}
leveneTest(objectL$outcome, objectL$predictor, center = median)
```
  * center: is the name of a function to compute the center of each group. If you use "mean" instead of median - this gives the original Levene's test; The default is "median" which provides a more robust test.
  * if variances are significantly different (p<0.05), in the t-test include the command: var.equal = FALSE --> calculates a more robust t-test;
  * if variances are NOT significantly different (p>0.05), in the t-test include the command: var.equal = TRUE --> calculates regular t-test
  * p value is the number found under the Pr(>F); if p>0.05 variances are homogenous (the statistically same)
  * Levene test is **NOT** necessary in REPEATED Measures tests; In such cases, in a ***t-test*** you should use **var.equal = TRUE**

</div>
<a href="#top">Back to top</a>

# **Conducting a t-test in R**
  
<div class="alert alert-info">
**t-test**: a t-test tests whether the MEANS/AVERAGES of **TWO** samples are significantly different from each other  (e.g. asking if the average depression scores of males is different from that of females in a sample)
</div>

<div class="alert alert-success">
## Method 1 - for **LONG** version of data

**PLEASE NOTE** this command is currently (10/28/24) not working properly and giving the following error "*cannot use 'paired' in formula method*". This page will be updated once a solution is reported. The issue appears to be in the "paired=..." part of the command.
    
```{r}
t.test(outcome~predictor, data=objectL, paired=FALSE/TRUE, var.equal=FALSE/TRUE)
```

<h3 style ="text-align:center;">**OR**</h3>

</center>

## Method 2 - for **WIDE** version of data
```{r}
t.test(object$var1,object$var2, paired=FALSE/TRUE, var.equal=FALSE/TRUE)
```

  * For **METHOD 1**:
    * *t.test*: command to calculate a t-test
    * *outcome*: name of the outcome column in the long version of the data file
    * *predictor*: name of the predictor column in the long version of the data file
    * *~* : predicted by
    * *data=objectL*: specifies the name of the long version of the data file. REPLACE ONLY the word objectL with the appropriate name of the long version of the file
  * For **METHOD 2**: 
    * *object, var1 and var2* - "object" needs to be replaced with appropriate name of object containing WIDE version of the data; "var1" and "var2" need to be replaced with the appropriate names of the columns in the WIDE version of the data. 
  * For **BOTH**:
    * *paired = TRUE*: **USED for REPEATED MEASURES**, also known as paired
    * *paired = FALSE*: **USED for INDEPENDENT MEASURES**
    * *var.equal = TRUE*: used when the Levene Test p > 0.05 (remember p = the number under Pr(>F)); **NOTE:** Also use this in the case of a **repeated measures t-test** when a Levene test is **not** required.
    * *var.equal = FALSE*: used when the Levene Test p < 0.05
    * If var.equal = FALSE: the degrees of freedom (df) will NOT be a whole number (because the specific t-test used uses a more conservative way of calculating the df.)
    * reporting: t(df)=tvalue,pvalue, r^2=
</div>
<a href="#top">Back to top</a>

***
<div class="alert alert-success">
# Calculating the effect size (r squared)
```{r}
tvalue^2/(tvalue^2+df)
```
  * all information for the above equation is obtained from the output of the t-test
  * for the tvalue ignore negative signs
  * reporting: t(df)=tvalue, p..., r^2=.
</div>
<a href="#top">Back to top</a>


# Calculating the effect size (Cohen's d) (alternative option)
```{r}
# requires package effsize

cohen.d(data= object, group1, group2)
```
  * If the groups are not of equal size, add the na.omit command to group with the least numbers e.g. if group2 has less, use "na.omit(group2)" instead of simply "group2"
  * object - replace with the name of the object containing the **WIDE** form of the data
  * group1 or group2 - replace with the names of the columns of each of the groups.
  
<a href="#top">Back to top</a>

***
<div class="alert alert-danger">
# t-test output in R
  * Paired t-test / Two Sample t-test # this title of the output indicates whether the test you just ran is paired/repeated OR independent measures/Two Sample t-test.
  * The line with the t value, df and p value is the most important line; t: gives you the t value; df gives you the degrees of freedom and p-values gives you the p value. ALL OF THESE NEED to be reported; the report will include this data in the following format t(df)=tvalue, p

  * output when the sex differences in vitamin D data is analyzed **CORRECTLY** using an independent measures test:
```{}
t = -2.2258, df = 16, p-value = 0.04075 
```
  
  * output when the sex differences in vitamin D data is analyzed **WRONGLY** using a repeated measures test:
```{}
t = -2.5964, df = 8, p-value = 0.03179
```


  * If these numbers have the SAME sign it means that it is likely that there is a significant difference:
```{}
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
  -18.252038  -1.081295
```

  * In a repeated measures test, R only gives you the output for MEAN OF THE **DIFFERENCES**:
```{}
sample estimates:
  mean of the differences 
-9.666667 
```

  * In an independent measures test, R gives you the output for the mean for **EACH GROUP**:
```{}
mean in group Boys mean in group Girls 
25.66667            35.33333
```
</div>
<a href="#top">Back to top</a>

<div class="alert alert-danger">
### Example of a t-test output and an effect size (r^2^)calculation with the important data used in the report highlighted.

<center>
![](C:/Users\ssamm\Desktop\Sammut\Research\R Markdown/Independent t-test output.jpg){width=60%}
</center>
</div>
<a href="#top">Back to top</a>

<div class="alert alert-info">
## Writing a report for a t-test
An independent measures t-test was utilized to assess whether there is a significant difference in vitamin D levels between boys and girls. The analysis indicated a significant difference in vitamin D levels, t(16)=-2.226, p<0.05; rsquared=   , specifically girls (M= ; SEM= ) reported a significantly higher level of vitamin D then boys (M= ; SEM= ).

**OR**

An independent measures t-test was utilized to assess whether there is a significant difference in vitamin D levels between boys(M= ; SEM= ) and girls(M= ; SEM= ). The analysis indicated a significant difference in vitamin D levels, t(16)=-2.226, p<0.05; rsquared=   ,specifically girls reported a significantly higher level of vitamin D then boys.
</div>
<a href="#top">Back to top</a>

***
# Conducting a t-test in excel (method 1)
  * =ttest(array1,array2, tails, type)
    + Array 1: first column containing the data
    + Array 2: second column containing data
    + tails: 1 or 2 tailed
    + type: 1= Paired; 2= Two-sample equal variance; 3= two-sample unequal variance.

***
# Conducting a t-test in excel (method 2)
  * To give a more detailed output, click Data Analysis in the Analysis group on the Data tab. 
  * If the Data Analysis command is not available, you need to load the Analysis ToolPak add-in program.(see:[Office Support](https://support.office.com/en-ie/article/load-the-analysis-toolpak-in-excel-6a63e598-cd6d-42e3-9317-6b40ba1a66b4){target="_blank"})
  * Select the type of t-test you need to run
  * Input the appropriate variable ranges under "Input"
  * If your variables have labels (which they should) check "Labels"
  * Click on Output Range and in the associated box indicate the cell you want the output to be generated in (make sure this cell is away from other important data)
  * Sample output is shown below:

Statistic                   | Set 6	     |Set 7
---------                   |------------|-----------
Mean	                      |11.75	     |11.5
Variance	                  |46.91666667 |33.66666667
Observations	              |4	         |4
Pearson Correlation	        |0.910007244	
Hypothesized Mean Difference|0	
df	                        |3	
t Stat	                    |0.174077656	
P(T<=t) one-tail	          |0.436444286	
t Critical one-tail	        |2.353363435	
P(T<=t) two-tail	          |0.872888572	
t Critical two-tail	        |3.182446305	

***
<div class="alert alert-success">
# **One-way ANOVA - INDEPENDENT MEASURES**

<div class="alert alert-info">
**ANOVA**: an ANOVA tests whether the MEANS/AVERAGES of **MORE THAN TWO** samples are significantly different from each other  (e.g. asking if there is a significant difference in the average depression scores between Freshman, Sophomores, Juniors and Seniors). ANOVAs can be complicated given that comparisons can be carried out within groups and between groups.
</div>

  * Run Levene Test: For the DF group = k-1 (k=number of groups/treatments) = DF for the Between Treatment
  * The other degrees of Freedom is DF for the WITHIN treatment = N-k (N=sum total of subjects; k=number of treatments)
```{r}
aovobject=aov(outcome~predictors, data=objectL)
```
  * this calculates the ANOVA
  * Use the long version of the file
  * aovobject: name of the object that will contain the ANOVA calculation

```{r}
summary(aovobject)
```
  * this gives the output for the ANOVA (**See example output below**)
  * name of predictor column: between Treatment (b/T)
  * Residuals: within Treatment (w/T)
  * DF: degrees of freedom (needed for reporting)
  * Sum Sq: Sum of Squares (Note: needed for the calculation of the effect size)
  * Mean Sq: Mean Square = variances
  * F value: test statistic (needed for reporting)
  * Pr(>F): p value (needed for reporting)
  * your statistics in the report should be in form: F(DF(b/T),DF(w/T))= F value, p value, effect size)
  * EFFECT SIZE: Sum Sq is used for effect size: Sum Sq(b/T)/(Sum Sq(total))
</div>
<a href="#top">Back to top</a>

<div class="alert alert-danger">
### Example of an Independent Measures Output 
```{}
            Df Sum Sq Mean Sq F value   Pr(>F)    
variable     2  28.22  14.111   11.91 0.000256 ***  
Residuals   24  28.44   1.185                     
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
```
</div>
<a href="#top">Back to top</a>

***
<div class="alert alert-success">
# Post-hoc analysis using Tukey Test
  * **A post-hoc analysis is run whenever the first part of the ANOVA shows a p<0.05**
```{r}
phobject=glht(aovobject, linfct=mcp(predictor="Tukey"))
```
  * phobject: name of the object that will contain the post-hoc Tukey test
  * aovobject: the name of the object that contains the ANOVA (see previous ANOVA command)
  * predictor: needs to be replaced with name of the PREDICTOR column in your data file.

```{r}
summary(phobject)
```
  * gives the summary of the post-hoc analysis (**see example output below**)
  * Pr(>|t|): is the p value for the specific groups being compared.
</div>
<a href="#top">Back to top</a>

<div class="alert alert-success">
  * If the variances are **not** equal the Games-Howell test is conducted (c.f. Welch test in the t-test)
  * requires *statix* package
```{r}
games_howell_test(data = objectL, formula = outcome ~ grouping_variable)
```
</div>
<a href="#top">Back to top</a>

<div class="alert alert-danger">
### Example of a post-hoc output for Independent Measures ANOVA
```{}
	 Simultaneous Tests for General Linear Hypotheses

Multiple Comparisons of Means: Tukey Contrasts


Fit: aov(formula = value ~ variable, data = migraineL)

Linear Hypotheses:
                   Estimate Std. Error t value Pr(>|t|)    
DrugB - DrugA == 0   2.1111     0.5132   4.114 0.001063 ** 
DrugC - DrugA == 0   2.2222     0.5132   4.330 0.000605 ***
DrugC - DrugB == 0   0.1111     0.5132   0.217 0.974514    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Adjusted p values reported -- single-step method)
```
</div>
<a href="#top">Back to top</a>

<div class="alert alert-danger">
### Equation for Eta Squared (effect size) for Independent Measures ANOVA (information taken from the first step of the ANOVA). 

<center>
![](C:/Users\ssamm\Desktop\Sammut\Research\R Markdown/Eta Squared.jpg){width=60%}
</center>
</div>
<a href="#top">Back to top</a>

<div class="alert alert-info">
### Writing a report from an Independent Measures ANOVA output
Pain levels were reported on a scale of 1 to 10 (1=low, 10=high) following the administration of various pain relievers by individuals who suffer from migraines. A One Way ANOVA indicated a significant difference (F(2, 24) = 11.91; p<0.001; n^2^= 0.498) in reported pain levels across the drugs administered. Post-hoc analysis using Tukey test indicated that individuals reported significantly lower pain levels following the administration of Drug A (M=3.67, SEM=0.29) relative to both Drug B (M=5.78; SEM=0.49; p<0.01) and Drug C (M=5.89; SEM=0.26; p<0.001). There was no significant difference between the reported pain levels following the administration of Drugs B and C (p>0.05).
</div>
<a href="#top">Back to top</a>

***
<div class="alert alert-success">
# **One-Way REPEATED MEASURES (RM) ANOVA**
```{r}
aovobject=aov(outcome~predictor+Error(participant/predictor), data=objectL)
```
  * participant: replace with the label of the SUBJECT column in the long version of the file
  * predictor: replace with the name of the PREDICTOR column in the long verstion of the file.

```{r}
summary(aovobject)
```
  * predictor: between Treatment value (b/T)
  * Residuals: error
  * Df: degrees of freedom (variable/predictor = k-1 where k=number of groups. Residuals=(df(w/T)-df(b/S))); w/T = Within Treatment = N-k, where N=Total number of "samples"/data points and k=number of groups; b/S = Between subjects = n-1, where n = number within a group; 
  * Sum Sq: sum of squares (needed for the calculation of the effect size)
  * Mean Sq: variances
  * F value: test statistic
  * Pr(>F): p value
  * your report of these statistics should be in the form of F(df(b/T),df(error))= Fvalue, pvalue, eta squared (effect size)
</div>
<a href="#top">Back to top</a>

<div class="alert alert-danger">
### Example of a Repeated Measures Output
<center>
An Example of a Repeated measures ANOVA output with important numbers used in report highlighted
![](C:/Users/ssamm/Desktop/Sammut/Research/R Markdown/RptdMeasuresANOVA-First part.JPG){width=50%}
</center>
</div>
<a href="#top">Back to top</a>

***
<div class="alert alert-success">
# Post-hoc Analysis using Pairwise comparisons
  * **A post-hoc analysis is run whenever the first part of the ANOVA shows a p<0.05**
  * **The tukey test does NOT work for this type of ANOVA command.** A Pairwise comparison is used instead.
```{r}
with(objectL, pairwise.t.test(outcome,predictor, p.adjust.method="bonferroni", paired=T))
```
  * pairwise comparisons: calculate the significant differences between the pairs of groups
</div>
<a href="#top">Back to top</a>

<div class="alert alert-danger">
### Example of a post-hoc output for Repeated Measures ANOVA
<center>
An Example of a post-hoc analysis output for a Repeated measures ANOVA with important numbers used in report highlighted. These values are the **p-values** comparing the variable in the vertical column on the left to the horizontal row above the numbers. 

![](C:/Users\ssamm\Desktop\Sammut\Research\R Markdown/RMANOVAposthoc.JPG){width=50%}
</center>
</div>
<a href="#top">Back to top</a>

<div class="alert alert-danger">
### Equation for Eta Squared (effect size) for Repeated Measures ANOVA. The information is found under the section **Error: Within** in the first step of the ANOVA

<center>
![](C:/Users\ssamm\Desktop\Sammut\Research\R Markdown/Eta squared.jpg){width=60%}
</center>
</div>
<a href="#top">Back to top</a>

<div class="alert alert-info">
### Writing a report from a Repeated Measures ANOVA output
Repeated measures ANOVA revealed a significant overall effect across time (F(2,24) = 20.65, p< 0.001; n^2^=0.63), following the administration of drug. Post-hoc Bonferroni analysis indicated that anxiety scores were significantly lower than pre-drug (M=8.10, SEM=0.23) administration on both week 1 (M=6.80, SEM = 0.25; p<0.05) and week 2 (M=3.00, SEM = 0.26; p<0.001). Moreover the anxiety scores on week 2 were also significantly lower than those of week 1 (p<0.001).
</div>
<a href="#top">Back to top</a>

* Additional information in relation to ANOVA and some code for more challenging ANOVAs can be found [here](https://www.uvm.edu/~dhowell/StatPages/R/RepeatedMeasuresAnovaR.html){target="_blank"}

<div>
***
# How to use R for multiple-select questions
  * Replace OBJECT with name of the object containing the imported data (long version)
  * First column needs to be the subject number
  * [This should be done in Excel] All values in the rows need to be replaced with a 1 (implies there is a value or count) or 0 (blank)
  * Replace "variable1" with the name of the column heading of that specific variable (e.g. if counting number of people in household and/or Intramural sports, variable1 etc would be replaced with the names of these columns)
  * This code is modified from https://stackoverflow.com/questions/11622660/how-to-use-r-for-multiple-select-questions

```{r}
set.seed(1)
dat <- data.frame(OBJECT,"variable1","variable2", "variable3", "..n")
                  
multi.freq.table = function(OBJECT, sep="", dropzero=FALSE, clean=TRUE) {
  # Takes boolean multiple-response data and tabulates it according to the possible combinations of each variable.
  
  counts = data.frame(table(OBJECT))
  N = ncol(counts)
  counts$Combn = apply(counts[-N] == 1, 1, 
                       function(x) paste(names(counts[-N])[x],
                                         collapse=sep))
  if (isTRUE(dropzero)) {
    counts = counts[counts$Freq != 0, ]
  } else if (!isTRUE(dropzero)) {
    counts = counts
  }
  if (isTRUE(clean)) {
    counts = data.frame(Combn = counts$Combn, Freq = counts$Freq)
  } 
  counts
}

multi.freq.table(dat[-1], sep="-")
```
***
# How to count number of occurences in a column 

  * the counts are sorted largest to smallest
```{r}
count(OBJECT, vars = Columnname, sort=TRUE)
```

or

  * the output is sorted from lowest to highest
```{r}
sort(table(OBJECT$Columnname))
```

***
<div class="alert alert-info">
# Plotting map of US with data
## METHOD 1: Plot map of US reflecting various levels - Only plots **lower 48**
* the information below is modified from the following website: https://remiller1450.github.io/s230s19/Intro_maps.html
  
### Load packages
```{r}
library(ggplot2);library(maps);library(dplyr) 
```

### Load the state map information
```{r}
MainStates <- map_data("state")
```

### Copy data
* StateNumber: object containing data for number of students from every state
* this object contains the data with the names of the states and numbers of students from every state
* states that have no representation i.e. 0 students should be blank
* data needs to have a column named "region" (***lower case***)

```{r}
StateNumber = read.table(file="clipboard", sep="\t", header=TRUE, as.is=TRUE)
```

### Plots all states with ggplot2, using black borders and light blue fill
```{r}
ggplot()+geom_polygon(data=MainStates,aes(x=long, y=lat, group=group),color="black", fill="lightblue")
```

### Merges Mainstates Map with StateNumber 
```{r}
MergedStates <- inner_join(MainStates, StateNumber, by = "region")
```

```{r}
p <- ggplot()
p <- p + geom_polygon( data=MergedStates, aes(x=long, y=lat, group=group, fill = n), color="white", size = 0.2) 
p
```

## Changing color scheme
```{r}
p <- p + scale_fill_continuous(name="Number of Students",low = "lightblue", high = "darkblue",limits = c(0,140),breaks=c(20,40,60,80,100,120,140), na.value = "grey50")+labs(title="title of graph")
p
```
<center>
![](C:/Users\ssamm\Desktop\Sammut\Research\R Markdown/Map.png){width=50%}
</center>
</div>
<a href="#top">Back to top</a>

<div class="alert alert-info">
## METHOD 2: Plot map of US reflecting various levels - Plots **all 50 states**
* the information below is modified from the following website: https://cran.r-project.org/web/packages/usmap/vignettes/mapping.html
  
### Load packages
```{r}
library(usmap) 
```

### Copy data
* StateNumber: object containing data for number of students from every state
* this object contains the data with the names of the states and numbers of students from every state
* states that have no representation i.e. 0 students should be blank
* data needs to have a column named "region"
* The excel data must contain the following columns. The data below can be copied and then pasted into excel using the "paste special/text only" option.
* The **fips** column stands for the Federal Information Processing Standard (FIPS) which provides a set of standard numeric codes for referring to U.S. states.
* Please note that, in the FIPS column, the 0 in front of the single digit numbers is not a requirement - this was just how I set it up.

fips|region|   n   |  
----|------|-------|
01	|Alabama
02	|Alaska
04	|Arizona
05	|Arkansas
06	|California
08	|Colorado
09	|Connecticut
10	|Delaware
12	|Florida
13	|Georgia
15	|Hawaii
16	|Idaho
17	|Illinois
18	|Indiana
19	|Iowa
20	|Kansas
21	|Kentucky
22	|Louisiana
23	|Maine
24	|Maryland
25	|Massachusetts
26	|Michigan
27	|Minnesota
28	|Mississippi
29	|Missouri
30	|Montana
31	|Nebraska
32	|Nevada
33	|New Hampshire
34	|New Jersey
35	|New Mexico
36	|New York
37	|North Carolina
38	|North Dakota
39	|Ohio
40	|Oklahoma
41	|Oregon
42	|Pennsylvania
44	|Rhode Island
45	|South Carolina
46	|South Dakota
47	|Tennessee
48	|Texas
49	|Utah
50	|Vermont
51	|Virginia
53	|Washington
54	|West Virginia
55	|Wisconsin
56	|Wyoming

```{r}
StateNumber = read.table(file="clipboard", sep="\t", header=TRUE, as.is=TRUE)
```

### Plots blank map of all states, using black borders and grey fill
```{r}
plot_usmap(regions = "states") + labs(title = "US States",subtitle = "This is a blank map of the state of the United States.") + theme(panel.background = element_rect(color = "black", fill = "grey"))
```

### Plots all states with a color scale reflecting numbers of students from the specific states  and places image in a black rectangle with a light color grey fill
```{r}
plot_usmap(data = StateNumber, values = "n", color = "white") + scale_fill_continuous(name="Number of Students",low = "lightblue", high = "darkblue",limits = c(0,140),breaks=c(20,40,60,80,100,120,140), na.value = "grey50")+labs(title="title of graph", subtitle = "Subtitle")+ theme(plot.title=element_text(hjust=0.5)) + theme(plot.subtitle=element_text(hjust=0.5))+ theme(legend.position = "right") + theme(panel.background = element_rect(color = "black", fill ="grey85"))
```
<center>
![](C:/Users\ssamm\Desktop\Sammut\Research\R Markdown/Map2.png){width=50%}
</center>
</div>
<a href="#top">Back to top</a>

<div class="alert alert-info">
## Plotting a 3D Scatter Plot
  * you may need to go into **Tools/Global Options/General/Advanced** and under **OS Integration (quit required)** change the *Rendering Engine* option to **Desktop OpenGL**



</center>
</div>
<a href="#top">Back to top</a>

# Additional Resources

<div class="alert alert-warning">  
## R Markdown

  * used to "turn your analyses into high quality documents, reports, presentations and dashboards." This page is prepared using R Markdown.

  * [R Markdown Help and Information](https://bookdown.org/yihui/rmarkdown/){target="_blank"}

  * [R Markdown cheatsheet](https://www.rstudio.com/wp-content/uploads/2015/02/rmarkdown-cheatsheet.pdf){target="_blank"}
  
  * [Colored Boxes in R Markdown](https://www.w3schools.com/bootstrap/bootstrap_alerts.asp){target="_blank"}

```{r}
# In R Markdown Document:
<div class="alert alert-x"> 
  # Replace "x" with "success" for green; "info" for blue; "warning" for tan; "danger" for red.
</div>
```

```{r}
# Creating a pdf of the rpubs website https://rpubs.com/ssammut/ResearchStats
<div class="alert alert-x"> 
  # requires webshot package
  library(webshot)
webshot(
    url     = "https://rpubs.com/ssammut/ResearchStats"
  , file    =  "Output1.pdf" # replace "Output1.pdf" with appropriate name
  , vwidth  = 992
  , vheight = 744
  )

getwd () # will show in which folder the file is stored.
</div>
```


</div>
<a href="#top">Back to top</a>

<div class="alert alert-warning">
## Jamovi - an alternate statistical program

  * This is not the program used in the class but you can feel free to experiment with it.
  * Jamovi is a program based on R that has a more user-friendly interface
  * [Download Jamovi](https://www.jamovi.org/download.html){target="_blank"}
  * Advantage - can be used on all platforms Macs, PCs and Chromebooks
  * A user guide can be found [here](https://www.jamovi.org/user-manual.html){target="_blank"}
  * A tutorial is also available [here](https://youtu.be/mZomeS0tLxY){target="_blank"}
</div>
<a href="#top">Back to top</a>

<div class="alert alert-warning">
## Creating your own package

  * This section outlines the procedure for creating your own package of a currently existing function so that you do not have to continuously copy and paste the function when needing it (e.g. re: (summarySE)-[Function and command for generating Standard Error of the Mean (SEM)])
  
  **Resources:**
  
  * https://ourcodingclub.github.io/tutorials/writing-r-package/#
  * https://web.mit.edu/insong/www/pdf/rpackage_instructions.pdf

  **Requires:**
  
  * RTools: https://cran.rstudio.com/bin/windows/Rtools/ - download and install
  * install.packages("devtools")
  * install.packages("roxygen2")

  **Creating the package**

1. Create folder Name_R_package
2. In the folder create a folder called R
3. In the folder called R, create a text file called Name_function.txt that contains a copy the function code
4. Change the extension of the text file to .R (confirm change)
5. In the Name_R_package folder create a text file called DESCRIPTION
6. In the DESCRIPTION file add the following information: 

```
  Package: Name of package
  Type: Package
  Title: Title describing package
  Version: 0.0.1.0
```

7. Remove the .txt extension from the DESCRIPTION file (confirm change)
8. Open a new window in R so that you do not lose previous work
9. Go to File/New Project. 
	a. Click on New Directory
	b. Click on R Package
	c. Name the package
	d. Click "Add" 
	e. Select the R file created above
	f. Click "Browse" and change the folder to select the Name_R_package you created earlier.
	g. Click on "Create Project"
10. Click on "Build" in the menu and select "Build Source Package"

   **Using the package**

1. Go to Tools/Install Packages
2. Change "Install From" to "Package Archive File (.zip; .tar.gz)". This will open the window for you to select the file.
3. Go to the folder Name_R_package and select the .tar.gz file listed
4. Click "Install"
5. Run library using the name you gave the package in 9c.: library(packagename)
6. **NOTE**: Whenever you install a new version of R you will need to re-run the installation of the package just described.
  
</div>
<a href="#top">Back to top</a>

***
***