R for dummies

Data

The R language was designed for those people who are not expert in programming. Actually, it is based on the S language (For purely statistics). The main advantage when using R is the facilites it provides to deal with complex data throughout dataframes. Dataframes are able to handle each column as an independent vector. Let’s suposse:

Variable assignation

To assign variables, use “<-” inseatd of “=”

Popular datasets

There are several dataframes to work with. Please, visit “kaggle.com” to find out which one of them is the one that you prefer to use to practice. I suggest to use the Iris dataset to start.

We will use a well-know dataframe named “Iris”

As this dataframe is on the R database, to import it we just need to type:

data<-iris

Visualizing general information

To visualize general information about data, type:

head(data)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

This data set is conformed by four continous predictors (Length, Width for petals and sepals) and one categorical outcome (Species)

To get particular information of a specific column:

To visualize general information about data, use the name of the dataframe followed by “$” and automatically, R gives a list with the column names, for instance

data$Sepal.Length
data$Sepal.Width
data$Petal.Length

Iris data summary

This command provides information about basic statistics such as, mean, median, quartiles among others.

summary(iris)

##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
##

Standardizing data

As not all the measures are on the same scale, it is usual to try to have the data on the same scale. It is conducted using the scale function

For better understanding, let’s see a little visual example using Boxplot representation.

Without scaling data

boxplot(data)

Scaling data

data.continuous<-data[,1:4] #Deleting the categorical outcome
data.scaled<-scale(data.continuous) 
boxplot(data.scaled)

You see? When we standardize the data, they can be correctly compared since each one of them is divided by their standard deviation.

Statistics

This subsection includes a brief example of using Analysis of variance with the aov function. At this point, we will not consider whether the type error is I or II.

# Compute the analysis of variance for the predictor that you want to test
res.aov <- aov(Sepal.Length~ Species, data = data)
# Summary of the analysis
summary(res.aov)

##              Df Sum Sq Mean Sq F value Pr(>F)    
## Species       2  63.21  31.606   119.3 <2e-16 ***
## Residuals   147  38.96   0.265                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From this results, Sepal length have p-value<0.05 which means that among the four species this variable is statistically significant.

Visualizing mean predictor levels

There are several techniques to visualize anova results such as, Least-significant differences (LSD), Tukey High significant difference (THSD) among others.

library(phia)

## Loading required package: car

## Loading required package: carData

x <- c("Species")
IM = interactionMeans(res.aov,factors=x,slope=NULL) #As there are only onte factor, there are not interactions to visualize
plot(IM,pch=17,lty=4,las=1,cex.axis=1.0) # By varying these parameters, we can get beautiful plots

This plot allows to visualize the level of differences at the mean levels regarding the Specie versus the Sepal Length.

Advanced visualizing tools

There are several visualization tools for better understanding the information. The R’s pairs function allows to visualize information per pairs regarding the type of Specie of the flower.

Using the pairs.panels function we can get information regarding the pearson correlation coefficient, the histogram and also scatter plots of the data.

## Warning: package 'psych' was built under R version 3.5.3