Ok, let's finally do some stats:
From ?mtcars: The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models).
Variables:
Let's suppose we're interested in economical cars. mpg is our main variable of interest.
# let's have brief look at the data.frame
head(mtcars, 3)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
# number of observations
nrow(mtcars)
## [1] 32
# summary of variables
summary(mtcars)
## mpg cyl disp hp
## Min. :10.4 Min. :4.00 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.4 1st Qu.:4.00 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.2 Median :6.00 Median :196.3 Median :123.0
## Mean :20.1 Mean :6.19 Mean :230.7 Mean :146.7
## 3rd Qu.:22.8 3rd Qu.:8.00 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.9 Max. :8.00 Max. :472.0 Max. :335.0
## drat wt qsec vs
## Min. :2.76 Min. :1.51 Min. :14.5 Min. :0.000
## 1st Qu.:3.08 1st Qu.:2.58 1st Qu.:16.9 1st Qu.:0.000
## Median :3.69 Median :3.33 Median :17.7 Median :0.000
## Mean :3.60 Mean :3.22 Mean :17.8 Mean :0.438
## 3rd Qu.:3.92 3rd Qu.:3.61 3rd Qu.:18.9 3rd Qu.:1.000
## Max. :4.93 Max. :5.42 Max. :22.9 Max. :1.000
## am gear carb
## Min. :0.000 Min. :3.00 Min. :1.00
## 1st Qu.:0.000 1st Qu.:3.00 1st Qu.:2.00
## Median :0.000 Median :4.00 Median :2.00
## Mean :0.406 Mean :3.69 Mean :2.81
## 3rd Qu.:1.000 3rd Qu.:4.00 3rd Qu.:4.00
## Max. :1.000 Max. :5.00 Max. :8.00
# how is mpg distributed? quartiles
quantile(mtcars$mpg)
## 0% 25% 50% 75% 100%
## 10.40 15.43 19.20 22.80 33.90
The first quartile contains a quarter of all observations. With 32 cars in the dataset, this means that the mpg of 8 cars must be equal or smaller than 15.425. Let's create a decentile using the probs argument. Remember: If you don't understand a line of R code, run the subparts separately starting from the inside out.
quantile(mtcars$mpg, probs = seq(0, 1, by = 0.1))
## 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
## 10.40 14.34 15.20 15.98 17.92 19.20 21.00 21.47 24.08 30.09 33.90
List the 25% most economical cars. That is, 75 % of the cars in the mtcars data have a lower mpg value. Hint: Use quantile() and subset(). Why are there 9 cars?
subset(mtcars, mpg >= quantile(mtcars$mpg)[4])
## mpg cyl disp hp drat wt qsec vs am gear carb
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
# scatterplot
plot(mtcars$mpg)
# histogram
hist(mtcars$mpg)
hist(mtcars$mpg, breaks = 10)
# boxplot
boxplot(mtcars$mpg)
Make yourself familiar with the above plots by drawing them on your own and lookup how to give them proper titles.
Let's use another example dataset called women which measures weight and height of American women.
# Seems like these women are measured in inches...
summary(women)
## height weight
## Min. :58.0 Min. :115
## 1st Qu.:61.5 1st Qu.:124
## Median :65.0 Median :135
## Mean :65.0 Mean :137
## 3rd Qu.:68.5 3rd Qu.:148
## Max. :72.0 Max. :164
# let's use the cm() function to turn inches into cm
women$height <- cm(women$height)
# and lbs to kg...
women$weight <- women$weight * 0.45359237
women
## height weight
## 1 147.3 52.16
## 2 149.9 53.07
## 3 152.4 54.43
## 4 154.9 55.79
## 5 157.5 57.15
## 6 160.0 58.51
## 7 162.6 59.87
## 8 165.1 61.23
## 9 167.6 63.05
## 10 170.2 64.41
## 11 172.7 66.22
## 12 175.3 68.04
## 13 177.8 69.85
## 14 180.3 72.12
## 15 182.9 74.39
summary(women)
## height weight
## Min. :147 Min. :52.2
## 1st Qu.:156 1st Qu.:56.5
## Median :165 Median :61.2
## Mean :165 Mean :62.0
## 3rd Qu.:174 3rd Qu.:67.1
## Max. :183 Max. :74.4
The calculations above are an example of how vecotrization can be used to avoid loops. If it wasn't for the vector you would have to use a for loop to iterate over all elements of the vector.
The following examples show how to investigate different relationships. We can investigate correlation structures, use categorical variables to group out dataset by.
important note: In R categorical variables are called factors which has nothing to do with factors as in factor analysis. In the term is rather used to distinguish this special kind of non-numeric variable from characters / strings.
# a scatter plot
plot(women) # wow, textbook linearity...
# or use a categorical variable to group by using the formula syntax
# displacement by automatic, 1 = manual
boxplot(disp ~ am, data = mtcars, notch = T)
The notches do not overlap. This indicates that group medians are different. Let's do t.test for the group means as well.
t.test(disp ~ am, data = mtcars)
##
## Welch Two Sample t-test
##
## data: disp by am
## t = 4.198, df = 29.26, p-value = 0.00023
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 75.33 218.37
## sample estimates:
## mean in group 0 mean in group 1
## 290.4 143.5
As expected the t.test rejects the hypothesis of equal group means. We can show a similar result by using a simple linear regression with a single binary regressor.
fit1 <- lm(disp ~ am, data = mtcars)
summary(fit1)
##
## Call:
## lm(formula = disp ~ am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -170.3 -65.0 -14.6 62.1 207.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 290.4 23.3 12.46 2.2e-13 ***
## am -146.8 36.6 -4.02 0.00037 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 102 on 30 degrees of freedom
## Multiple R-squared: 0.35, Adjusted R-squared: 0.328
## F-statistic: 16.1 on 1 and 30 DF, p-value: 0.000366
# notice the use of the summary method here ?
class(fit1)
## [1] "lm"
Note that group mean in group 0 is equal to the intercept in the regression. Also 290.38-146.85 = 143.53 which is the group mean of the other group. We will look at regression more comprehensively at a later stage.
Correlation can easily be calculated using cor for two or more numeric variables. Remember: Correlation = Covariance / max. possible Covariance.
\[ \rho = \frac{\frac{\sum(X_{i}-\bar{X})(Y_{i}-\bar{Y})}{n}} {\sqrt{\frac{\sum(X_{i}-\bar{X})^{2}}{n} }\sqrt{\frac{\sum(Y_{i}-\bar{Y})^{2}}{n} } } \]
# for two variables
cor(mtcars$mpg, mtcars$disp)
## [1] -0.8476
But Beware: Correlation Does Not Imply Causality Some entertaining spurious correlation examples can be found here: Mozzarella Cheese and Civil Engineering
# for the entire dataset
cor_cars <- cor(mtcars)
# visualize correlation in a heatmap
image(cor(mtcars[, 1:7]), col = heat.colors(256), xaxt = "n", yaxt = "n", main = "Correlation Heatmap")
# add some more meaningful axis
axis(1, at = seq(0, 1, length.out = ncol(mtcars[, 1:7])), labels = colnames(mtcars[,
1:7]))
# las rotates labels
axis(2, at = seq(0, 1, length.out = ncol(mtcars[, 1:7])), labels = colnames(mtcars[,
1:7]), las = 2)
Task 10 is meant to be a litte more advanced and combines many aspects you have learnt today.
download the dataset mlb.RData from ILIAS and load it into your project
How many records are there in the dataset, what are the columns of dataset
Find the tallest player
Create a copy of the data called mlb_eu_measure and transform lbs to kg and inches to cm. Hint: use cm() and the fact that lbs*0.45359237 is one kg.
Find the heaviest player for each position!
Which is the oldest team on average? hint: Remember lapply from earlier today and use sapply in a similar fashion. Make use of ?sapply !
Which position has the youngest players on average?
What would be a good way to visualize differences in group means by position? E.g.: Age by Position. Is there a position on which players last longer?
What about size and weight by position?
!! *10. Write a function that compares the mean height of Starting Pitchers pairwise to another position by running a t-test. Why could such a function by useful? * !!
load("data/mlb.RData") # data was taken
# from http://wiki.stat.ucla.edu/socr/index.php/SOCR_Data_MLB_HeightsWeights
mlb <- na.omit(mlb)
cor(mlb$weight, mlb$height)
## [1] 0.5319
cor(mlb[4:6])
## height weight age
## height 1.00000 0.5319 -0.07385
## weight 0.53189 1.0000 0.15828
## age -0.07385 0.1583 1.00000
# store the square of the correlation
Rsq <- cor(mlb$weight, mlb$height)^2
Rsq
## [1] 0.2829
Run regression to estimate players' weight from their height
summary(lm(weight ~ height, data = mlb))
##
## Call:
## lm(formula = weight ~ height, data = mlb)
##
## Residuals:
## Min 1Q Median 3Q Max
## -51.99 -13.15 1.22 11.69 70.01
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -155.09 17.70 -8.76 <2e-16 ***
## height 4.84 0.24 20.17 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.8 on 1031 degrees of freedom
## Multiple R-squared: 0.283, Adjusted R-squared: 0.282
## F-statistic: 407 on 1 and 1031 DF, p-value: <2e-16
Note: For univariate case (only one variable on the right side) the R squared is equal to the square of the Pearson correlation coefficient.
summary(lm(weight ~ age + height + position, data = mlb))
##
## Call:
## lm(formula = weight ~ age + height + position, data = mlb)
##
## Residuals:
## Min 1Q Median 3Q Max
## -49.44 -10.83 -0.31 10.10 75.29
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -169.816 19.012 -8.93 < 2e-16 ***
## age 0.879 0.122 7.22 1.0e-12 ***
## height 4.788 0.252 18.96 < 2e-16 ***
## positionDesignated_Hitter 8.660 4.417 1.96 0.05018 .
## positionFirst_Baseman 2.748 2.988 0.92 0.35800
## positionOutfielder -6.052 2.273 -2.66 0.00788 **
## positionRelief_Pitcher -7.816 2.187 -3.57 0.00037 ***
## positionSecond_Baseman -12.994 2.947 -4.41 1.1e-05 ***
## positionShortstop -16.461 3.031 -5.43 7.0e-08 ***
## positionStarting_Pitcher -7.615 2.293 -3.32 0.00093 ***
## positionThird_Baseman -4.142 3.159 -1.31 0.19008
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.8 on 1022 degrees of freedom
## Multiple R-squared: 0.367, Adjusted R-squared: 0.361
## F-statistic: 59.3 on 10 and 1022 DF, p-value: <2e-16
# what's the reference category of position
levels(mlb$position) # Catcher
## [1] "Catcher" "Designated_Hitter" "First_Baseman"
## [4] "Outfielder" "Relief_Pitcher" "Second_Baseman"
## [7] "Shortstop" "Starting_Pitcher" "Third_Baseman"