Session 3 Tutorial

Ok, let's finally do some stats:

explore datasets
scatterplots, histograms, boxplots …
simple regressions

Exploring the mtcars dataset

From ?mtcars: The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models).

Variables:

mpg Miles/(US) gallon
cyl Number of cylinders
disp Displacement (cu.in.)
hp Gross horsepower
drat Rear axle ratio
wt Weight (lb/1000)
qsec ¼ mile time
vs V/S
am Transmission (0 = automatic, 1 = manual)
gear Number of forward gears
carb Number of carburetors

Let's suppose we're interested in economical cars. mpg is our main variable of interest.

# let's have brief look at the data.frame
head(mtcars, 3)

##                mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4     21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710    22.8   4  108  93 3.85 2.320 18.61  1  1    4    1

# number of observations
nrow(mtcars)

## [1] 32

# summary of variables
summary(mtcars)

##       mpg            cyl            disp             hp       
##  Min.   :10.4   Min.   :4.00   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.4   1st Qu.:4.00   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.2   Median :6.00   Median :196.3   Median :123.0  
##  Mean   :20.1   Mean   :6.19   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.8   3rd Qu.:8.00   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.9   Max.   :8.00   Max.   :472.0   Max.   :335.0  
##       drat            wt            qsec            vs       
##  Min.   :2.76   Min.   :1.51   Min.   :14.5   Min.   :0.000  
##  1st Qu.:3.08   1st Qu.:2.58   1st Qu.:16.9   1st Qu.:0.000  
##  Median :3.69   Median :3.33   Median :17.7   Median :0.000  
##  Mean   :3.60   Mean   :3.22   Mean   :17.8   Mean   :0.438  
##  3rd Qu.:3.92   3rd Qu.:3.61   3rd Qu.:18.9   3rd Qu.:1.000  
##  Max.   :4.93   Max.   :5.42   Max.   :22.9   Max.   :1.000  
##        am             gear           carb     
##  Min.   :0.000   Min.   :3.00   Min.   :1.00  
##  1st Qu.:0.000   1st Qu.:3.00   1st Qu.:2.00  
##  Median :0.000   Median :4.00   Median :2.00  
##  Mean   :0.406   Mean   :3.69   Mean   :2.81  
##  3rd Qu.:1.000   3rd Qu.:4.00   3rd Qu.:4.00  
##  Max.   :1.000   Max.   :5.00   Max.   :8.00

# how is mpg distributed?  quartiles
quantile(mtcars$mpg)

##    0%   25%   50%   75%  100% 
## 10.40 15.43 19.20 22.80 33.90

The first quartile contains a quarter of all observations. With 32 cars in the dataset, this means that the mpg of 8 cars must be equal or smaller than 15.425. Let's create a decentile using the probs argument. Remember: If you don't understand a line of R code, run the subparts separately starting from the inside out.

quantile(mtcars$mpg, probs = seq(0, 1, by = 0.1))

##    0%   10%   20%   30%   40%   50%   60%   70%   80%   90%  100% 
## 10.40 14.34 15.20 15.98 17.92 19.20 21.00 21.47 24.08 30.09 33.90

Task 1

List the 25% most economical cars. That is, 75 % of the cars in the mtcars data have a lower mpg value. Hint: Use quantile() and subset(). Why are there 9 cars?

subset(mtcars, mpg >= quantile(mtcars$mpg)[4])

##                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Datsun 710     22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Merc 240D      24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230       22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Fiat 128       32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic    30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Fiat X1-9      27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2

Visualization

# scatterplot
plot(mtcars$mpg)

plot of chunk unnamed-chunk-4

# histogram
hist(mtcars$mpg)

plot of chunk unnamed-chunk-4

hist(mtcars$mpg, breaks = 10)

plot of chunk unnamed-chunk-4

# boxplot
boxplot(mtcars$mpg)

plot of chunk unnamed-chunk-4

Task 2

Make yourself familiar with the above plots by drawing them on your own and lookup how to give them proper titles.

Vectorization in R

Let's use another example dataset called women which measures weight and height of American women.

# Seems like these women are measured in inches...
summary(women)

##      height         weight   
##  Min.   :58.0   Min.   :115  
##  1st Qu.:61.5   1st Qu.:124  
##  Median :65.0   Median :135  
##  Mean   :65.0   Mean   :137  
##  3rd Qu.:68.5   3rd Qu.:148  
##  Max.   :72.0   Max.   :164

# let's use the cm() function to turn inches into cm
women$height <- cm(women$height)
# and lbs to kg...
women$weight <- women$weight * 0.45359237
women

##    height weight
## 1   147.3  52.16
## 2   149.9  53.07
## 3   152.4  54.43
## 4   154.9  55.79
## 5   157.5  57.15
## 6   160.0  58.51
## 7   162.6  59.87
## 8   165.1  61.23
## 9   167.6  63.05
## 10  170.2  64.41
## 11  172.7  66.22
## 12  175.3  68.04
## 13  177.8  69.85
## 14  180.3  72.12
## 15  182.9  74.39

summary(women)

##      height        weight    
##  Min.   :147   Min.   :52.2  
##  1st Qu.:156   1st Qu.:56.5  
##  Median :165   Median :61.2  
##  Mean   :165   Mean   :62.0  
##  3rd Qu.:174   3rd Qu.:67.1  
##  Max.   :183   Max.   :74.4

The calculations above are an example of how vecotrization can be used to avoid loops. If it wasn't for the vector you would have to use a for loop to iterate over all elements of the vector.

Multiple Variables

The following examples show how to investigate different relationships. We can investigate correlation structures, use categorical variables to group out dataset by.

important note: In R categorical variables are called factors which has nothing to do with factors as in factor analysis. In the term is rather used to distinguish this special kind of non-numeric variable from characters / strings.

# a scatter plot
plot(women)  # wow, textbook linearity...

plot of chunk unnamed-chunk-6


# or use a categorical variable to group by using the formula syntax
# displacement by automatic, 1 = manual
boxplot(disp ~ am, data = mtcars, notch = T)

plot of chunk unnamed-chunk-6

The notches do not overlap. This indicates that group medians are different. Let's do t.test for the group means as well.

t.test(disp ~ am, data = mtcars)

## 
##  Welch Two Sample t-test
## 
## data:  disp by am
## t = 4.198, df = 29.26, p-value = 0.00023
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   75.33 218.37
## sample estimates:
## mean in group 0 mean in group 1 
##           290.4           143.5

As expected the t.test rejects the hypothesis of equal group means. We can show a similar result by using a simple linear regression with a single binary regressor.

fit1 <- lm(disp ~ am, data = mtcars)
summary(fit1)

## 
## Call:
## lm(formula = disp ~ am, data = mtcars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -170.3  -65.0  -14.6   62.1  207.5 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    290.4       23.3   12.46  2.2e-13 ***
## am            -146.8       36.6   -4.02  0.00037 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 102 on 30 degrees of freedom
## Multiple R-squared:  0.35,   Adjusted R-squared:  0.328 
## F-statistic: 16.1 on 1 and 30 DF,  p-value: 0.000366

# notice the use of the summary method here ?
class(fit1)

## [1] "lm"

Note that group mean in group 0 is equal to the intercept in the regression. Also 290.38-146.85 = 143.53 which is the group mean of the other group. We will look at regression more comprehensively at a later stage.

Correlation

Correlation can easily be calculated using cor for two or more numeric variables. Remember: Correlation = Covariance / max. possible Covariance.

\[ \rho = \frac{\frac{\sum(X_{i}-\bar{X})(Y_{i}-\bar{Y})}{n}} {\sqrt{\frac{\sum(X_{i}-\bar{X})^{2}}{n} }\sqrt{\frac{\sum(Y_{i}-\bar{Y})^{2}}{n} } } \]

# for two variables
cor(mtcars$mpg, mtcars$disp)

## [1] -0.8476

But Beware: Correlation Does Not Imply Causality Some entertaining spurious correlation examples can be found here: Mozzarella Cheese and Civil Engineering

Correlation Structures


# for the entire dataset
cor_cars <- cor(mtcars)

# visualize correlation in a heatmap
image(cor(mtcars[, 1:7]), col = heat.colors(256), xaxt = "n", yaxt = "n", main = "Correlation Heatmap")

# add some more meaningful axis
axis(1, at = seq(0, 1, length.out = ncol(mtcars[, 1:7])), labels = colnames(mtcars[, 
    1:7]))

# las rotates labels
axis(2, at = seq(0, 1, length.out = ncol(mtcars[, 1:7])), labels = colnames(mtcars[, 
    1:7]), las = 2)

plot of chunk unnamed-chunk-10

Task 3: Learning by Doing: Explore a New Dataset on Your Own

Task 10 is meant to be a litte more advanced and combines many aspects you have learnt today.

download the dataset mlb.RData from ILIAS and load it into your project
How many records are there in the dataset, what are the columns of dataset
Find the tallest player
Create a copy of the data called mlb_eu_measure and transform lbs to kg and inches to cm. Hint: use cm() and the fact that lbs*0.45359237 is one kg.
Find the heaviest player for each position!
Which is the oldest team on average? hint: Remember lapply from earlier today and use sapply in a similar fashion. Make use of ?sapply !
Which position has the youngest players on average?
What would be a good way to visualize differences in group means by position? E.g.: Age by Position. Is there a position on which players last longer?
What about size and weight by position?

!! *10. Write a function that compares the mean height of Starting Pitchers pairwise to another position by running a t-test. Why could such a function by useful? * !!

Correlation in the baseball dataset

load("data/mlb.RData")  # data was taken 
# from http://wiki.stat.ucla.edu/socr/index.php/SOCR_Data_MLB_HeightsWeights
mlb <- na.omit(mlb)
cor(mlb$weight, mlb$height)

## [1] 0.5319

cor(mlb[4:6])

##          height weight      age
## height  1.00000 0.5319 -0.07385
## weight  0.53189 1.0000  0.15828
## age    -0.07385 0.1583  1.00000

# store the square of the correlation
Rsq <- cor(mlb$weight, mlb$height)^2
Rsq

## [1] 0.2829

Run regression to estimate players' weight from their height

summary(lm(weight ~ height, data = mlb))

## 
## Call:
## lm(formula = weight ~ height, data = mlb)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -51.99 -13.15   1.22  11.69  70.01 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -155.09      17.70   -8.76   <2e-16 ***
## height          4.84       0.24   20.17   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.8 on 1031 degrees of freedom
## Multiple R-squared:  0.283,  Adjusted R-squared:  0.282 
## F-statistic:  407 on 1 and 1031 DF,  p-value: <2e-16

Note: For univariate case (only one variable on the right side) the R squared is equal to the square of the Pearson correlation coefficient.

Regressions with multiple variables

summary(lm(weight ~ age + height + position, data = mlb))

## 
## Call:
## lm(formula = weight ~ age + height + position, data = mlb)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -49.44 -10.83  -0.31  10.10  75.29 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               -169.816     19.012   -8.93  < 2e-16 ***
## age                          0.879      0.122    7.22  1.0e-12 ***
## height                       4.788      0.252   18.96  < 2e-16 ***
## positionDesignated_Hitter    8.660      4.417    1.96  0.05018 .  
## positionFirst_Baseman        2.748      2.988    0.92  0.35800    
## positionOutfielder          -6.052      2.273   -2.66  0.00788 ** 
## positionRelief_Pitcher      -7.816      2.187   -3.57  0.00037 ***
## positionSecond_Baseman     -12.994      2.947   -4.41  1.1e-05 ***
## positionShortstop          -16.461      3.031   -5.43  7.0e-08 ***
## positionStarting_Pitcher    -7.615      2.293   -3.32  0.00093 ***
## positionThird_Baseman       -4.142      3.159   -1.31  0.19008    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.8 on 1022 degrees of freedom
## Multiple R-squared:  0.367,  Adjusted R-squared:  0.361 
## F-statistic: 59.3 on 10 and 1022 DF,  p-value: <2e-16

# what's the reference category of position
levels(mlb$position)  # Catcher

## [1] "Catcher"           "Designated_Hitter" "First_Baseman"    
## [4] "Outfielder"        "Relief_Pitcher"    "Second_Baseman"   
## [7] "Shortstop"         "Starting_Pitcher"  "Third_Baseman"