Concatination is done to create a vector.
x<-c(1, 2, 3, 4)
x
## [1] 1 2 3 4
The mean
function calculates the arithmetic mean (or average).
The equation for the sample mean is
\[\bar{x}=\frac{1}{n}\sum_{i=1}^n x_i\]
# the averaage mpg
mean(x)
## [1] 2.5
If there is an NA in the dataset, we will need to remove them. Any function of a dataset that contains an NA will result in an NA.
y<-c(NA, 1, 2, 3, 4)
mean(y)
## [1] NA
# NA's can be removed with na.rm=TRUE
mean(y, na.rm=TRUE)
## [1] 2.5
The var
function calculates the sample variance
\[s^2=\frac{1}{n-1}\sum_{i=1}^n (x_i-\bar{x})^2\]
The standard deviation is the squareroot of the variance
\[s=\sqrt{\frac{1}{n-1}\sum_{i=1}^n (x_i-\bar{x})^2}\]
### variance
s2<-var(x)
s2
## [1] 1.666667
### squareroot of variance
sqrt(s2)
## [1] 1.290994
### standard deviation
sd(x)
## [1] 1.290994
The five number summary provides the minimum, first quartile, median (second quartile), third quartile, and the maximum.
z<-c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
summary(z)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 3.25 5.50 5.50 7.75 10.00
I will be using the mtcars
dataset for these examples from the R library.
str
The str
function will tell you about the structure of the dataset. This will include the dimensions of the data, as well the names and types of variables.
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
head
The head
function will give you the first six rows.
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
tail
The tail
function will give you the last six rows.
tail(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.7 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.5 0 1 5 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.5 0 1 5 6
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.6 0 1 5 8
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.6 1 1 4 2
View
The View
function will create a separate viewing window of the data in R.
Note: This function cannot be used in RMardown without throwing an error.
We will use the ggplot2
package within the tidyverse
to produce graphics.
library(tidyverse)
# load in data from the mpg dataset
data("mpg")
ggplot(data=mpg, aes(x=hwy))+
geom_histogram(binwidth = 5)
ggplot(data=mpg, aes(x=hwy))+
geom_dotplot()
## `stat_bindot()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(data=mpg, aes(x=hwy))+
geom_density()
ggplot(data=mpg, aes(x=hwy))+
geom_boxplot()
We can also create side-by-side boxplots to compare distributions.
# we need a factor (categorical) variable to split the data into different groups
# drv is the type of drive train, where f = front-wheel drive, r = rear wheel drive, 4 = 4wd
ggplot(data=mpg, aes(x=hwy, y=drv, fill=drv))+
geom_boxplot()
Bar charts are for categorical variables.
ggplot(data=mpg, aes(x=class))+
geom_bar()
We can also plot two categorical variables at a time to create variants:
ggplot(data=mpg, aes(x=class, fill=drv))+
geom_bar()
ggplot(data=mpg, aes(x=class, fill=drv))+
geom_bar(position="dodge")
ggplot(data=mpg, aes(x=class, fill=drv))+
geom_bar(position="fill")
ggplot(data=mpg, aes(x=cty, y=hwy))+
geom_point()
Correlation is a metric for the strength of the linear relationship between two numeric variables.
Correlation has the following properties:
The correlation coefficient is denoted with a \(r\),
\[r=\frac{1}{n-1}\sum_{i=1}^n(\frac{x_i-\bar{x}}{s_x})(\frac{y_i-\bar{y}}{s_y})\]
cor(mpg$cty, mpg$hwy)
## [1] 0.9559159
The lm
function gives the least squares regression equation (LSRE) that minimizes the sum of squared error.
We want to describe the relationship between two numeric variables with a mathematical model. Assuming that the relationship between the variables is linear, we use the method of least squares to find the line of best fit.
We call this model simple linear regression: \[\hat{y}=\hat{\beta}_0+\hat{\beta}_1x\] Notation:
Some books use slightly different notation:
\[slope=\hat{\beta}_1 = r \times \frac{s_y}{s_x}\] \[intercept = \hat{\beta}_0 = \bar{y}-\hat{\beta}_1\bar{x}\]
mod<-lm(hwy~cty, data=mpg)
summary(mod)
##
## Call:
## lm(formula = hwy ~ cty, data = mpg)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.3408 -1.2790 0.0214 1.0338 4.0461
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.89204 0.46895 1.902 0.0584 .
## cty 1.33746 0.02697 49.585 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.752 on 232 degrees of freedom
## Multiple R-squared: 0.9138, Adjusted R-squared: 0.9134
## F-statistic: 2459 on 1 and 232 DF, p-value: < 2.2e-16