Introduction to R studio

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see <http://rmarkdown.rstudio.com>.

Introduction. R is a statistical and programming language which can be used as a calculator as well as a statistical software for various public health activities such as importing data, exploring data, cleaning of data, visualisation of data, calculating descriptive statistics (Measures of central tendency and dispersion), inferential statistics ( parametric and non-parametric tests ), and data analytics. The present session has been designed to provide a basic introduction/ overview on the same.

Knowing the Integrated Data Environment (IDE). R studio has multiple panels which needs to be understood and kept under observation while working. The left upper panel is meant for writing scripts (You can write scripts and save them for later use!!), left lower panel is console panel where execution of commands occur while working. Right upper panel has list of files including data frames/ excel sheets/ etc as well as history of your commands. Right lower panel includes the files in your working directory (folder where all files are uploaded from/ exported to by default), plots as outcomes, details of packages and help function.

``` R as a calculator. You can use R as a calculator for addition, substraction, division, multiplication, squaring the value, raise to a power, etc. BODMAS rules follow. ’’’

1+2

## [1] 3

2-1

## [1] 1

1/2

## [1] 0.5

2*4

## [1] 8

5^2

## [1] 25

12**2

## [1] 144

500%%60

## [1] 20

500%/%60

## [1] 8

sqrt(144)

## [1] 12

log10(100)

## [1] 2

log2 (100)

## [1] 6.643856

factorial (5)

## [1] 120

R is a vectorised language. It is important to understand that R is a vectorised language. It means that we can create vectors AND any function applied to a vector is applied to all the elements in the vector.

Creating a vector in R. Vectors in R can be created using assignment operator <-. A few examples are as under:-

#creating string variable 
names <- c("mary", "joy", "rehim")
names

## [1] "mary"  "joy"   "rehim"

#creating numerical variable
marks <- c(1,2,3)
marks

## [1] 1 2 3

#the vector is treated as one unit.. any command applies to all elements of the vector.. it is important to note because once you will import excel file/ csv or any other format, the function will be applicable to the the entire column rather than single value.
marks +5

## [1] 6 7 8

#integer vector -- only whole numbers.. specified by placing L with number
group<- c(1L, 2L, 3L)
mode(group)

## [1] "numeric"

#logical variables
marks1 <- marks>50
mode(marks1)

## [1] "logical"

#all true are 1, all false are 0
mean(marks1)

## [1] 0

#factor variables.. mode is numeric
sex<-c("F", "M")
mode(sex)

## [1] "character"

sex1 <- factor(sex)
levels(sex1)

## [1] "F" "M"

levels(sex1)<- c("female", "male")
levels(sex1)

## [1] "female" "male"

Creating a data frame. You can create a data frame from the vectors.

```

#creating a dataframe (df)
examples<- data.frame(names, marks)
examples

##   names marks
## 1  mary     1
## 2   joy     2
## 3 rehim     3

#df is a list with same length of all variables

#class of object.. tells about the character of object 
class(examples)

## [1] "data.frame"

class(names)

## [1] "character"

#creating number variable
sl_No <- 1:10

#repeat
rep(2,10)

##  [1] 2 2 2 2 2 2 2 2 2 2

#by combining create variable
c(rep (1,10), rep (2,10))

##  [1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2

c(rep("A", 15), rep("B", 20))

##  [1] "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "B" "B"
## [18] "B" "B" "B" "B" "B" "B" "B" "B" "B" "B" "B" "B" "B" "B" "B" "B" "B"
## [35] "B"

c(seq(1,5,1), seq(5,1, -1))

##  [1] 1 2 3 4 5 5 4 3 2 1

c(rep(1,3), rep(2, 3), rep(3,3))

## [1] 1 1 1 2 2 2 3 3 3

#generating sequences.. one to 100 with gap of 2..by default (from,to,by)
seq(1,100,2)

##  [1]  1  3  5  7  9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45
## [24] 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79 81 83 85 87 89 91
## [47] 93 95 97 99

#else.. you can change the sequence by mentioning the details
seq(from=1, by=2, to = 100)

##  [1]  1  3  5  7  9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45
## [24] 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79 81 83 85 87 89 91
## [47] 93 95 97 99

#backward 
seq(100,0,-2)

##  [1] 100  98  96  94  92  90  88  86  84  82  80  78  76  74  72  70  68
## [18]  66  64  62  60  58  56  54  52  50  48  46  44  42  40  38  36  34
## [35]  32  30  28  26  24  22  20  18  16  14  12  10   8   6   4   2   0

#creating tables
bmi<- c(">25", "<=25")
number<- c(54, 34)
rbind(bmi,number)

##        [,1]  [,2]  
## bmi    ">25" "<=25"
## number "54"  "34"

##Lets try an example..
name<- c("Geetha", "Maharunnisha", "Kiran", "Sicily", "Raj", "Bala", "Saurav", "Mohammed", "Seema", "Sudhi", "Megha", "Sundar", "Lovely", "Susan", "Bindu", "Roy", "Jane", "Chand", "Krishnan")
Age <- c(14, 14.5, 14,14,14.5, 14,14,14.5, 14, 14.5, 14.5, 14.5, 14, 14, 14.5, 14.5, 14, 14.5, 14)
Sex <- c("F", "F", "M", "F", "M", "M", "M", "M", "F", "M", "F", "M", "F", "F", "F", "M", "F", "F", "M")
hw <- c(40, 37, 33, 37, 39, 40, 39, 37, 33, 38, 34, 42, 40, 43, 35, 39, 43, 38, 42)
assignment <- c(34, 35, 38, 40, 41,40, 35, 42,40, 40, 33, 40, 41, 40, 33, 34,43, 33, 41 )
final<- c(59, 63, 57, 64, 61,63, 58, 61, 64, 60, 54, 65, 64, 59, 53, 65, 57, 578, 65)
df<- data.frame(name, Age, Sex, hw, assignment, final)
df

##            name  Age Sex hw assignment final
## 1        Geetha 14.0   F 40         34    59
## 2  Maharunnisha 14.5   F 37         35    63
## 3         Kiran 14.0   M 33         38    57
## 4        Sicily 14.0   F 37         40    64
## 5           Raj 14.5   M 39         41    61
## 6          Bala 14.0   M 40         40    63
## 7        Saurav 14.0   M 39         35    58
## 8      Mohammed 14.5   M 37         42    61
## 9         Seema 14.0   F 33         40    64
## 10        Sudhi 14.5   M 38         40    60
## 11        Megha 14.5   F 34         33    54
## 12       Sundar 14.5   M 42         40    65
## 13       Lovely 14.0   F 40         41    64
## 14        Susan 14.0   F 43         40    59
## 15        Bindu 14.5   F 35         33    53
## 16          Roy 14.5   M 39         34    65
## 17         Jane 14.0   F 43         43    57
## 18        Chand 14.5   F 38         33   578
## 19     Krishnan 14.0   M 42         41    65

#indexing a dataframe.. you can find  a specific value in data frame by specifying rows and columns..
df [18,6] <- 57
#useful for correction...in case you identify that there is a wrong entry in the dataset
examples[1,2] <- 6
examples

##   names marks
## 1  mary     6
## 2   joy     2
## 3 rehim     3

Inbuilt datasets. R has inbuilt datasets for learning and practice.

#to see datasets available for practice/ learning
data(package= .packages(all.available = TRUE))

## Warning in data(package = .packages(all.available = TRUE)): datasets have
## been moved from package 'base' to package 'datasets'

## Warning in data(package = .packages(all.available = TRUE)): datasets have
## been moved from package 'stats' to package 'datasets'

#You will see a warning message in R markdown file, however, if you run this command in R console, you will be able to see the available datasets

Understanding the datasets. We will take an example of inbuilt dataset 'iris'. Once you import your dataset, its important to see the dataset characters for better understanding. You can use library 'tidyverse' as well as 'skimr' Using packages. In case you are using a package for the first time, install the package in your R studio using command 'install.packages(). for example 'install.packages(tidyverse)'

library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.2.1 --

## v ggplot2 3.1.1       v purrr   0.3.2  
## v tibble  2.1.1       v dplyr   0.8.0.1
## v tidyr   0.8.3       v stringr 1.4.0  
## v readr   1.3.1       v forcats 0.4.0

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library("skimr")

## 
## Attaching package: 'skimr'

## The following object is masked from 'package:stats':
## 
##     filter

#using tidyverse to rearrange your data
data(iris)
glimpse(iris)

## Observations: 150
## Variables: 5
## $ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9,...
## $ Sepal.Width  <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1,...
## $ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5,...
## $ Petal.Width  <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1,...
## $ Species      <fct> setosa, setosa, setosa, setosa, setosa, setosa, s...

#to view the data set
view(iris)
#to see the top 6 rows
head(iris)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

#to add a column
iris$id<- 1:150
#skimr package for summary
library(skimr)

# to obtain summary of all variables in skimr
skim(iris)

## Skim summary statistics
##  n obs: 150 
##  n variables: 6 
## 
## -- Variable type:factor --------------------------------------------------------
##  variable missing complete   n n_unique                       top_counts
##   Species       0      150 150        3 set: 50, ver: 50, vir: 50, NA: 0
##  ordered
##    FALSE
## 
## -- Variable type:integer -------------------------------------------------------
##  variable missing complete   n mean    sd p0   p25  p50    p75 p100
##        id       0      150 150 75.5 43.45  1 38.25 75.5 112.75  150
##      hist
##  <U+2587><U+2587><U+2587><U+2587><U+2587><U+2587><U+2587><U+2587>
## 
## -- Variable type:numeric -------------------------------------------------------
##      variable missing complete   n mean   sd  p0 p25  p50 p75 p100
##  Petal.Length       0      150 150 3.76 1.77 1   1.6 4.35 5.1  6.9
##   Petal.Width       0      150 150 1.2  0.76 0.1 0.3 1.3  1.8  2.5
##  Sepal.Length       0      150 150 5.84 0.83 4.3 5.1 5.8  6.4  7.9
##   Sepal.Width       0      150 150 3.06 0.44 2   2.8 3    3.3  4.4
##      hist
##  <U+2587><U+2581><U+2581><U+2582><U+2585><U+2585><U+2583><U+2581>
##  <U+2587><U+2581><U+2581><U+2585><U+2583><U+2583><U+2582><U+2582>
##  <U+2582><U+2587><U+2585><U+2587><U+2586><U+2585><U+2582><U+2582>
##  <U+2581><U+2582><U+2585><U+2587><U+2583><U+2582><U+2581><U+2581>

#to see summary statistics in tidyverse
glimpse(iris)

## Observations: 150
## Variables: 6
## $ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9,...
## $ Sepal.Width  <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1,...
## $ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5,...
## $ Petal.Width  <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1,...
## $ Species      <fct> setosa, setosa, setosa, setosa, setosa, setosa, s...
## $ id           <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15...

#to merge two columns ..aka long format.. aka tidy format..eg in repeated measurements
iris2<- iris%>% gather(Sepal.Length, Sepal.Width, key = "dimension", value = "measurement")
glimpse(iris2)

## Observations: 300
## Variables: 6
## $ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5,...
## $ Petal.Width  <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1,...
## $ Species      <fct> setosa, setosa, setosa, setosa, setosa, setosa, s...
## $ id           <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15...
## $ dimension    <chr> "Sepal.Length", "Sepal.Length", "Sepal.Length", "...
## $ measurement  <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9,...

#to make one measurement column
iris3<- iris%>%gather(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, key = "dimension", value = "measurement")
glimpse(iris3)

## Observations: 600
## Variables: 4
## $ Species     <fct> setosa, setosa, setosa, setosa, setosa, setosa, se...
## $ id          <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,...
## $ dimension   <chr> "Sepal.Length", "Sepal.Length", "Sepal.Length", "S...
## $ measurement <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, ...

#to spread the data back from long format
iris4 <- iris3%>%spread(dimension, measurement)
glimpse(iris4)

## Observations: 150
## Variables: 6
## $ Species      <fct> setosa, setosa, setosa, setosa, setosa, setosa, s...
## $ id           <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15...
## $ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5,...
## $ Petal.Width  <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1,...
## $ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9,...
## $ Sepal.Width  <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1,...

#to make a table
table(iris$Sepal.Length)

## 
## 4.3 4.4 4.5 4.6 4.7 4.8 4.9   5 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9   6 
##   1   3   1   4   2   5   6  10   9   4   1   6   7   6   8   7   3   6 
## 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9   7 7.1 7.2 7.3 7.4 7.6 7.7 7.9 
##   6   4   9   7   5   2   8   3   4   1   1   3   1   1   1   4   1

#to cross tabulate..first as rows and then columns
table(iris$Sepal.Length,iris$Species)

##      
##       setosa versicolor virginica
##   4.3      1          0         0
##   4.4      3          0         0
##   4.5      1          0         0
##   4.6      4          0         0
##   4.7      2          0         0
##   4.8      5          0         0
##   4.9      4          1         1
##   5        8          2         0
##   5.1      8          1         0
##   5.2      3          1         0
##   5.3      1          0         0
##   5.4      5          1         0
##   5.5      2          5         0
##   5.6      0          5         1
##   5.7      2          5         1
##   5.8      1          3         3
##   5.9      0          2         1
##   6        0          4         2
##   6.1      0          4         2
##   6.2      0          2         2
##   6.3      0          3         6
##   6.4      0          2         5
##   6.5      0          1         4
##   6.6      0          2         0
##   6.7      0          3         5
##   6.8      0          1         2
##   6.9      0          1         3
##   7        0          1         0
##   7.1      0          0         1
##   7.2      0          0         3
##   7.3      0          0         1
##   7.4      0          0         1
##   7.6      0          0         1
##   7.7      0          0         4
##   7.9      0          0         1

#to make a table
table(iris$Sepal.Length)

## 
## 4.3 4.4 4.5 4.6 4.7 4.8 4.9   5 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9   6 
##   1   3   1   4   2   5   6  10   9   4   1   6   7   6   8   7   3   6 
## 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9   7 7.1 7.2 7.3 7.4 7.6 7.7 7.9 
##   6   4   9   7   5   2   8   3   4   1   1   3   1   1   1   4   1

#to cross tabulate..first as rows and then columns
table(iris$Sepal.Length,iris$Species)

##      
##       setosa versicolor virginica
##   4.3      1          0         0
##   4.4      3          0         0
##   4.5      1          0         0
##   4.6      4          0         0
##   4.7      2          0         0
##   4.8      5          0         0
##   4.9      4          1         1
##   5        8          2         0
##   5.1      8          1         0
##   5.2      3          1         0
##   5.3      1          0         0
##   5.4      5          1         0
##   5.5      2          5         0
##   5.6      0          5         1
##   5.7      2          5         1
##   5.8      1          3         3
##   5.9      0          2         1
##   6        0          4         2
##   6.1      0          4         2
##   6.2      0          2         2
##   6.3      0          3         6
##   6.4      0          2         5
##   6.5      0          1         4
##   6.6      0          2         0
##   6.7      0          3         5
##   6.8      0          1         2
##   6.9      0          1         3
##   7        0          1         0
##   7.1      0          0         1
##   7.2      0          0         3
##   7.3      0          0         1
##   7.4      0          0         1
##   7.6      0          0         1
##   7.7      0          0         4
##   7.9      0          0         1

Graph making in R. You can visualise data by making graphs. ggplot2 is a vector package with better default appearance for publication quality. Also, graphs are created based on logic and grammar of graphics (gg). It can create graphs layer by layer and it can be saved as layers also. gridgraphics and lattice are other packages.

library(ggplot2)
#histogram
data(Cars93, package = "MASS")

Cars93%>%ggplot(aes(x= MPG.city)) + geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#to enhance features.. use bins/ binwidth/ fill/ colour
Cars93%>%ggplot(aes(x= MPG.city)) + geom_histogram(bins = 10)

Cars93%>%ggplot(aes(x= MPG.city)) + geom_histogram(binwidth = 5)

Cars93%>%ggplot(aes(x= MPG.city)) + geom_histogram(bins = 10, fill="blue")

Cars93%>%ggplot(aes(x= MPG.city)) + geom_histogram(bins = 10, colour = "black", fill = "cyan")

#to make other type of graphs.. change geom function
Cars93%>%ggplot(aes(x= MPG.city)) + geom_density()

Cars93%>%ggplot(aes(x= MPG.city)) + geom_density(colour = "red")

#to combine different types of graphs as layers
Cars93%>%ggplot(aes(x= MPG.city, y = ..density..)) + geom_histogram(binwidth = 5, fill = "red", colour = "cyan") +geom_density(colour = "black")

# to add labels.. use labs verb
p1 = Cars93%>%ggplot(aes(x= MPG.city)) + geom_histogram(bins = 10)
p1 + labs(x = "Mileage", y = " Proportion")

#bar chart
glimpse(Cars93)

## Observations: 93
## Variables: 27
## $ Manufacturer       <fct> Acura, Acura, Audi, Audi, BMW, Buick, Buick...
## $ Model              <fct> Integra, Legend, 90, 100, 535i, Century, Le...
## $ Type               <fct> Small, Midsize, Compact, Midsize, Midsize, ...
## $ Min.Price          <dbl> 12.9, 29.2, 25.9, 30.8, 23.7, 14.2, 19.9, 2...
## $ Price              <dbl> 15.9, 33.9, 29.1, 37.7, 30.0, 15.7, 20.8, 2...
## $ Max.Price          <dbl> 18.8, 38.7, 32.3, 44.6, 36.2, 17.3, 21.7, 2...
## $ MPG.city           <int> 25, 18, 20, 19, 22, 22, 19, 16, 19, 16, 16,...
## $ MPG.highway        <int> 31, 25, 26, 26, 30, 31, 28, 25, 27, 25, 25,...
## $ AirBags            <fct> None, Driver & Passenger, Driver only, Driv...
## $ DriveTrain         <fct> Front, Front, Front, Front, Rear, Front, Fr...
## $ Cylinders          <fct> 4, 6, 6, 6, 4, 4, 6, 6, 6, 8, 8, 4, 4, 6, 4...
## $ EngineSize         <dbl> 1.8, 3.2, 2.8, 2.8, 3.5, 2.2, 3.8, 5.7, 3.8...
## $ Horsepower         <int> 140, 200, 172, 172, 208, 110, 170, 180, 170...
## $ RPM                <int> 6300, 5500, 5500, 5500, 5700, 5200, 4800, 4...
## $ Rev.per.mile       <int> 2890, 2335, 2280, 2535, 2545, 2565, 1570, 1...
## $ Man.trans.avail    <fct> Yes, Yes, Yes, Yes, Yes, No, No, No, No, No...
## $ Fuel.tank.capacity <dbl> 13.2, 18.0, 16.9, 21.1, 21.1, 16.4, 18.0, 2...
## $ Passengers         <int> 5, 5, 5, 6, 4, 6, 6, 6, 5, 6, 5, 5, 5, 4, 6...
## $ Length             <int> 177, 195, 180, 193, 186, 189, 200, 216, 198...
## $ Wheelbase          <int> 102, 115, 102, 106, 109, 105, 111, 116, 108...
## $ Width              <int> 68, 71, 67, 70, 69, 69, 74, 78, 73, 73, 74,...
## $ Turn.circle        <int> 37, 38, 37, 37, 39, 41, 42, 45, 41, 43, 44,...
## $ Rear.seat.room     <dbl> 26.5, 30.0, 28.0, 31.0, 27.0, 28.0, 30.5, 3...
## $ Luggage.room       <int> 11, 15, 14, 17, 13, 16, 17, 21, 14, 18, 14,...
## $ Weight             <int> 2705, 3560, 3375, 3405, 3640, 2880, 3470, 4...
## $ Origin             <fct> non-USA, non-USA, non-USA, non-USA, non-USA...
## $ Make               <fct> Acura Integra, Acura Legend, Audi 90, Audi ...

#to make bar chart. .first see how the variable is distributed
table(Cars93$Cylinders)

## 
##      3      4      5      6      8 rotary 
##      3     49      2     31      7      1

class(Cars93$Cylinders)

## [1] "factor"

Cars93%>%ggplot(aes(x = Cylinders)) + geom_bar()

#bar chart for cross tabs
table(Cars93$Origin, Cars93$Cylinders)

##          
##            3  4  5  6  8 rotary
##   USA      0 22  0 20  6      0
##   non-USA  3 27  2 11  1      1

Cars93%>% ggplot(aes(x= Cylinders, fill = Origin)) + geom_bar()

Cars93%>% ggplot(aes(x= Cylinders, fill = Origin)) + geom_bar(position = "dodge")

barpl<-Cars93%>% ggplot(aes(x= Cylinders, fill = Origin)) + geom_bar(position = "fill")

# to add title and labels
barpl2<- barpl + labs (title = "barplot", x = "number", y = "frequency")

# to change axis
barpl2 + coord_flip()

#to change theme
barpl2 + theme_dark()

#package ggthemes can be used for multiple themes

# to create bar chart from table.. stat = identity
exercise = c ("freq", "none", "some")
heavy <- c (7,1,3)
never <- c(87, 18, 84)
smoking<- data.frame(exercise, heavy, never)
smoking1<- smoking%>%gather(heavy, never, key = "smoker", value = "number")
smoking1%>%ggplot(aes(x=exercise, y = number, fill = smoker)) +geom_bar(stat = "identity", position = "dodge")

#boxplot.. 
Cars93%>%ggplot(aes(x=Origin, y = Horsepower)) + geom_boxplot()

Cars93%>%ggplot(aes(x= Horsepower, colour = Origin)) + geom_density()

#to create separate histograms
Cars93%>%ggplot(aes(x= Horsepower)) + geom_histogram() +facet_grid(.~Origin)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Cars93%>%ggplot(aes(x= Horsepower)) + geom_histogram() +facet_grid(Origin~.)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Cars93%>%ggplot(aes(x= Horsepower)) + geom_histogram() +facet_grid(Origin~Cylinders)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#scatterplot.. 
Cars93%>%ggplot(aes(x=Horsepower, y = MPG.city, colour = Origin, shape = Type, size = Cylinders)) +geom_point()

## Warning: Using size for a discrete variable is not advised.

#to fix/increase size of a point.. cex command
Cars93%>%ggplot(aes(x=Horsepower, y = MPG.city, colour = Origin, shape = Type)) +geom_point(cex= 4)

#transparency of a point.. alpha command
Cars93%>%ggplot(aes(x=Horsepower, y = MPG.city, colour = Origin, shape = Type, size = EngineSize, alpha = Passengers)) +geom_point()

#to add regression line.. geom smooth
scatter<- Cars93%>%ggplot(aes(x=Horsepower, y = MPG.city, colour = Origin)) +geom_point()
scatter + geom_smooth(method = "lm")

#to remove 95%CI
scatter1<- scatter + geom_smooth(method = "lm", se = F)
scatter1

#to get single regression line
scatter1 +geom_smooth(method = "lm", se=F, group = 1, colour = "black")

#to see types of characters available
plot(0:25, rep(1,26), pch = 0:25, cex = 2); text (0:25, 0.95, as.character(0:25))

#plotting character using pch command
Cars93%>%ggplot(aes(x=Horsepower, y = MPG.city, colour = Origin, shape = Type, size = EngineSize, alpha = Passengers)) +geom_point(pch=4)

#piechart
Cars93%>%ggplot(aes(x=1, fill = DriveTrain)) + geom_bar() + coord_polar(theta = 'y')

#another example
data("diamonds")
#bar chart 
diamonds%>%ggplot(aes(x=cut)) +geom_bar()

#include colour also as variable
diamonds%>%ggplot(aes(x=cut, fill = color)) +geom_bar()

#types of bar graphs
diamonds%>%ggplot(aes(x=cut, fill = color)) +geom_bar() +coord_flip()

diamonds%>%ggplot(aes(x=cut, fill = color)) +geom_bar(position = "stack")

diamonds%>%ggplot(aes(x=cut, fill = color)) +geom_bar(position = "dodge")

diamonds%>%ggplot(aes(x=cut, fill = color)) +geom_bar(position = "fill")

Descriptive statistics

#to find mean
mean(iris$Sepal.Length)

## [1] 5.843333

#to find median
median(iris$Sepal.Length)

## [1] 5.8

#to find variance
var(iris$Sepal.Length)

## [1] 0.6856935

#to find quantiles
quantile(iris$Sepal.Length)

##   0%  25%  50%  75% 100% 
##  4.3  5.1  5.8  6.4  7.9

#to find specific quantile
quantile(iris$Sepal.Length, prob = 0.6)

## 60% 
## 6.1

Inferential statistics ```#to do chi square.. first make a table to avoid multiple brackets test_table <- table(dataset$variable1, dataset$variable2) chisq.test(test_table)

#unpaired t test t.test(dataset$variable1~dataset$variable2)

#example: age is dependent on gender is written as Age~Gender

#paired t test t.test(dataset$variable1~dataset$variable2, paried = T)

#ANOVA oneway.test(dataset$variable1~dataset$variable2) #to test for group differences in case of significant ANOVA pairwise.t.test(dataset$variable1~dataset$variable2) ####Non parametric tests#### #NP equivalent of t test .. wilcoxin test wilcox.test(dataset$variable1~dataset$variable2)

#kruskal walis kruskal.test(dataset$variable1~dataset$variable2) #pairwise testing if significant results pairwise.wilcox.test(dataset$variable1~dataset$variable2)

#correlation cor(dataset$variable1~dataset$variable2)

#correlation test with hypothesis testing cor.test(dataset$variable1~dataset$variable2)

#lm for linear model or glm for generalised linear model (R square values missing) model <- lm(dataset$dependent variable~dataset$variable1+dataset$variable2+dataset$variable3) summary(model) anova(model) #to calculate Coefficients coef(model) #95% CI for all coefficients confint(model)

#test for normality shapiro.test(Dataset$variable_name) Similarly, there are packages for advanced biostatistics such as ‘broom’ for regression, ‘epiDisplay’ for logistic regression, and as many as 8000 packages for various functions. ```

Introduction to R studio

Dr Gurpreet Singh, Dr Biju Soman

03/07/2019