This tutorial is a dummy test for learning R basics.
To work with R, it is necessary first to import some required and main libraries including tydr
, ggpplot2
, dplyr
, caret
among others.
The R language was designed for those people who are not expert in programming. Actually, it is based on the S language (For purely statistics). The main advantage when using R is the facilites it provides to deal with complex data throughout dataframes. Dataframes are able to handle each column as an independent vector. Let’s suposse:
To assign variables, use “<-” inseatd of “=”
There are several dataframes to work with. Please, visit “kaggle.com” to find out which one of them is the one that you prefer to use to practice. I suggest to use the Iris dataset to start.
We will use a well-know dataframe named “Iris”
As this dataframe is on the R database, to import it we just need to type:
data<-iris
To visualize general information about data, type:
head(data)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
This data set is conformed by four continous predictors (Length, Width for petals and sepals) and one categorical outcome (Species)
To get particular information of a specific column:
To visualize general information about data, use the name of the dataframe followed by “$” and automatically, R gives a list with the column names, for instance
data$Sepal.Length
data$Sepal.Width
data$Petal.Length
This command provides information about basic statistics such as, mean, median, quartiles among others.
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
As not all the measures are on the same scale, it is usual to try to have the data on the same scale. It is conducted using the scale function
For better understanding, let’s see a little visual example using Boxplot representation.
boxplot(data)
data.continuous<-data[,1:4] #Deleting the categorical outcome
data.scaled<-scale(data.continuous)
boxplot(data.scaled)
You see? When we standardize the data, they can be correctly compared since each one of them is divided by their standard deviation.
This subsection includes a brief example of using Analysis of variance with the aov
function. At this point, we will not consider whether the type error is I or II.
# Compute the analysis of variance for the predictor that you want to test
res.aov <- aov(Sepal.Length~ Species, data = data)
# Summary of the analysis
summary(res.aov)
## Df Sum Sq Mean Sq F value Pr(>F)
## Species 2 63.21 31.606 119.3 <2e-16 ***
## Residuals 147 38.96 0.265
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From this results, Sepal length have p-value<0.05 which means that among the four species this variable is statistically significant.
There are several techniques to visualize anova results such as, Least-significant differences (LSD), Tukey High significant difference (THSD) among others.
library(phia)
## Loading required package: car
## Loading required package: carData
x <- c("Species")
IM = interactionMeans(res.aov,factors=x,slope=NULL) #As there are only onte factor, there are not interactions to visualize
plot(IM,pch=17,lty=4,las=1,cex.axis=1.0) # By varying these parameters, we can get beautiful plots
This plot allows to visualize the level of differences at the mean levels regarding the Specie versus the Sepal Length.
There are several visualization tools for better understanding the information. The R’s pairs
function allows to visualize information per pairs regarding the type of Specie of the flower.
Using the pairs.panels
function we can get information regarding the pearson correlation coefficient, the histogram and also scatter plots of the data.
## Warning: package 'psych' was built under R version 3.5.3