Part I: Basic Functions

1) Concatination

Concatination is done to create a vector.

x<-c(1, 2, 3, 4)
x
## [1] 1 2 3 4

2) Mean

The mean function calculates the arithmetic mean (or average).

The equation for the sample mean is

\[\bar{x}=\frac{1}{n}\sum_{i=1}^n x_i\]

# the averaage mpg
mean(x)
## [1] 2.5

If there is an NA in the dataset, we will need to remove them. Any function of a dataset that contains an NA will result in an NA.

y<-c(NA, 1, 2, 3, 4)

mean(y)
## [1] NA
# NA's can be removed with na.rm=TRUE
mean(y, na.rm=TRUE)
## [1] 2.5

3) Variance and Standard Deviation

The var function calculates the sample variance

\[s^2=\frac{1}{n-1}\sum_{i=1}^n (x_i-\bar{x})^2\]

The standard deviation is the squareroot of the variance

\[s=\sqrt{\frac{1}{n-1}\sum_{i=1}^n (x_i-\bar{x})^2}\]

### variance
s2<-var(x)
s2
## [1] 1.666667
### squareroot of variance
sqrt(s2)
## [1] 1.290994
### standard deviation
sd(x)
## [1] 1.290994

4) Five Number Summary

The five number summary provides the minimum, first quartile, median (second quartile), third quartile, and the maximum.

z<-c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

summary(z)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    3.25    5.50    5.50    7.75   10.00

Part II: How to look at data

I will be using the mtcars dataset for these examples from the R library.

1) str

The str function will tell you about the structure of the dataset. This will include the dimensions of the data, as well the names and types of variables.

str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

3) tail

The tail function will give you the last six rows.

tail(mtcars)
##                 mpg cyl  disp  hp drat    wt qsec vs am gear carb
## Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.7  0  1    5    2
## Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.9  1  1    5    2
## Ford Pantera L 15.8   8 351.0 264 4.22 3.170 14.5  0  1    5    4
## Ferrari Dino   19.7   6 145.0 175 3.62 2.770 15.5  0  1    5    6
## Maserati Bora  15.0   8 301.0 335 3.54 3.570 14.6  0  1    5    8
## Volvo 142E     21.4   4 121.0 109 4.11 2.780 18.6  1  1    4    2

4) View

The View function will create a separate viewing window of the data in R.

Note: This function cannot be used in RMardown without throwing an error.

Part III: Data Visualization

We will use the ggplot2 package within the tidyverse to produce graphics.

library(tidyverse)

# load in data from the mpg dataset
data("mpg")

1) Histograms

ggplot(data=mpg, aes(x=hwy))+
  geom_histogram(binwidth = 5)

2) Dot Plots

ggplot(data=mpg, aes(x=hwy))+
  geom_dotplot()
## `stat_bindot()` using `bins = 30`. Pick better value with `binwidth`.

3) Density Plots

ggplot(data=mpg, aes(x=hwy))+
  geom_density()

4) Boxplots

ggplot(data=mpg, aes(x=hwy))+
  geom_boxplot()

We can also create side-by-side boxplots to compare distributions.

# we need a factor (categorical) variable to split the data into different groups 
# drv is the type of drive train, where f = front-wheel drive, r = rear wheel drive, 4 = 4wd
ggplot(data=mpg, aes(x=hwy, y=drv, fill=drv))+
  geom_boxplot()

5) Bar Charts

Bar charts are for categorical variables.

ggplot(data=mpg, aes(x=class))+
  geom_bar()

We can also plot two categorical variables at a time to create variants:

A) Stacked Bar Charts
ggplot(data=mpg, aes(x=class, fill=drv))+
  geom_bar()

B) Side-by-side Bar Charts
ggplot(data=mpg, aes(x=class, fill=drv))+
  geom_bar(position="dodge")

C) Filled Bar Charts
ggplot(data=mpg, aes(x=class, fill=drv))+
  geom_bar(position="fill")

6) Scatterplots

ggplot(data=mpg, aes(x=cty, y=hwy))+
  geom_point()

Part IV: Regression

1) Correlation

Correlation is a metric for the strength of the linear relationship between two numeric variables.

Correlation has the following properties:

  • Notation: r
  • Is between -1 (perfect negative) and 1 (perfect positive)
  • Is a symmetric function (ie is doesn’t matter what order the variables enter the equation)

The correlation coefficient is denoted with a \(r\),

\[r=\frac{1}{n-1}\sum_{i=1}^n(\frac{x_i-\bar{x}}{s_x})(\frac{y_i-\bar{y}}{s_y})\]

cor(mpg$cty, mpg$hwy)
## [1] 0.9559159

2) Linear Models

The lm function gives the least squares regression equation (LSRE) that minimizes the sum of squared error.

We want to describe the relationship between two numeric variables with a mathematical model. Assuming that the relationship between the variables is linear, we use the method of least squares to find the line of best fit.

A) Simple Linear Regression (SLR) Model

We call this model simple linear regression: \[\hat{y}=\hat{\beta}_0+\hat{\beta}_1x\] Notation:

  • \(\hat{y}\) : the predicted value
  • \(\hat{\beta}_0\) : the y-intercept coefficient estimate
  • \(\hat{\beta}_1\) : the slope coefficient estimate
  • \(x\) : the explanatory variable

Some books use slightly different notation:

  • \(b_0\) = \(\hat{\beta}_0\) : the y-intercept coefficient estimate
  • \(b_1 = \hat{\beta}_1\) : the slope coefficient estimate
i) Calculating the Coefficient Estimates by Hand

\[slope=\hat{\beta}_1 = r \times \frac{s_y}{s_x}\] \[intercept = \hat{\beta}_0 = \bar{y}-\hat{\beta}_1\bar{x}\]

ii) Finding the SLR Model in R
mod<-lm(hwy~cty, data=mpg)

summary(mod)
## 
## Call:
## lm(formula = hwy ~ cty, data = mpg)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.3408 -1.2790  0.0214  1.0338  4.0461 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.89204    0.46895   1.902   0.0584 .  
## cty          1.33746    0.02697  49.585   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.752 on 232 degrees of freedom
## Multiple R-squared:  0.9138, Adjusted R-squared:  0.9134 
## F-statistic:  2459 on 1 and 232 DF,  p-value: < 2.2e-16