This document, which contains links to various materials, is essentially the lecture equivalent for Module 1 in our hybrid class. Our face-to-face meeting time will be spent primarily on questions, group discussions and individual trouble shooting.
A MOOC is a massive open online course. We are going to make use of some parts of a MOOC created by the staff of Facebook and available for free on the Udacity system. This will give you an idea of how statistics and data analysis are used in the modern world. It also contains some very detailed instructions on getting the software we will use in this course installed on your computer. Let me walk you through the process of getting started there. Click Here for the walkthrough. Then click Here to get started with Udacity yourself.
Now you should proceed with installing R and Rstudio on your own computer. The website where you obtain R itself is here. The website for RStudio is here. You want the open source desktop edition.
To complete your reading assignments you will need to use a snipping tool.
All versions of Windows have the windows snipping tool installed. Here is a collection of YouTube Videos on the Windows 7 Snipping Tool. And here is a collection on the Windows 8 Snipping Tool.
For Mac users, the free app, Skitch has equivalent (and more) functionality. You can get it from the app store. Here is a collection of YouTube videos on Skitch. Note that PC users could also use Skitch instead of the snipping tool.
You will need to learn some very basic things about the statistical software we will be using in this course. There is a free set of lessons offered by the O’Reilly Code School. You will also learn basic statistical concepts, such as mean, median, standard deviation and histograms. At the end of the first reading assignment, you will need to submit your badge for Chapter 6. Before you begin the lessons, you should click on the three horizontal bars in the upper left corner. Follow the links to create a free account. Having an account allows you to save your work between sessions. If you don’t do this, you can still do the lessons, but your work will be lost when you sign off.
There are three basic questions you need to answer to explore a single quantitative variable.
Let’s create some artificial data to explore these ideas
x = rnorm(1000)
That created a vector x with 1,000 numbers drawn from a normal distribution with mean = 0 and standard deviation = 1. Note that nothing seems to have happened. It is worth remembering that in R creating something doesn’t automatically display it. But we can do many things to explore x.
Look at its actual measures of location and variation to see if they are close to what we would expect from the way we created x.
mean(x)
## [1] -0.02628583
median(x)
## [1] -0.07384335
sd(x)
## [1] 1.010686
range(x)
## [1] -2.927826 2.949560
max(x) - min(x)
## [1] 5.877386
IQR(x)
## [1] 1.411251
Now let’s change x and see what happens. First add 100 to each number in the vector
xplus100 = x + 100
Now look at our measures of the new version of x.
mean(xplus100)
## [1] 99.97371
median(xplus100)
## [1] 99.92616
sd(xplus100)
## [1] 1.010686
range(xplus100)
## [1] 97.07217 102.94956
max(xplus100) - min(xplus100)
## [1] 5.877386
IQR(xplus100)
## [1] 1.411251
This should have changed the location measures, increasing them by 100, but it should leave the variation nummbers alone. Is this what happened?
What about graphical displays of x? The standard displays are the histogram and the boxplot.
Now let’s multiply x by 100 and see what happens to our measures of location and variation.
xtimes100 = x * 100
mean(xtimes100)
## [1] -2.628583
median(xtimes100)
## [1] -7.384335
sd(xtimes100)
## [1] 101.0686
range(xtimes100)
## [1] -292.7826 294.9560
max(xtimes100) - min(xtimes100)
## [1] 587.7386
IQR(xtimes100)
## [1] 141.1251
I’ll leave it as an exercise for you to describe the relationships between the measures of location and variation for x and those for xtimes100.
The standard graphical displays for quantitative variables are the histogram and the boxplot. To be comparable with the histogram, you may want to ask that the boxplot be laid out horizontally instead of vertically, which is the default. There is a command summary(), which produces the key numerical results displayed in the boxplot.
hist(x)
boxplot(x,horizontal=TRUE)
summary(x)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2.92800 -0.72300 -0.07384 -0.02629 0.68820 2.95000
These graphical displays of x show a very conventional symmetric distribution with a single central peak. It is useful to look at some other examples to see a few possibilities. Let’s generate some uniformly distributed numbers between 5 and 10 and create the graphical displays.
flatones = runif(1000,min=5,max=10)
summary(flatones)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.001 6.244 7.527 7.502 8.778 9.999
boxplot(flatones,horizontal=TRUE)
hist(flatones)
Do you see what you expected to see. Is this distribution symmetric? is there a noticeable peak? Are there outliers?
Let’s try some values drawn from a Chi-squared distribution with 10 degrees of freedom. Don’t worry about what this means. Just concentrate on the shape questions.
cq = rchisq(1000,df=10)
hist(cq)
boxplot(cq,horizontal=TRUE)
Is this distribution symmetric? Is there a noticeable peak? Are there outliers?
We’ll look at the mtcars dataset, which is included with the base distribution of R as a dataframe. First we’ll run a few standard commands to examine a new dataframe when we know nothing but the name of the dataframe.
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
summary(mtcars)
## mpg cyl disp hp
## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.20 Median :6.000 Median :196.3 Median :123.0
## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
## drat wt qsec vs
## Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
## 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
## Median :3.695 Median :3.325 Median :17.71 Median :0.0000
## Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
## 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
## Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
## am gear carb
## Min. :0.0000 Min. :3.000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
## Median :0.0000 Median :4.000 Median :2.000
## Mean :0.4062 Mean :3.688 Mean :2.812
## 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :1.0000 Max. :5.000 Max. :8.000
Note that there are some numerical variables here, which are categorical in nature. One example is ‘am,’ which tells us whether the car has an automatic (am = 0) or manual transmission (am = 1). To create a variable that R will treat as categorical, we need to run a special command.
TranType = as.factor(mtcars$am)
Now we can run the standard commands to exploare a categorical variable.
# Get simple counts of each categorical value
table(TranType)
## TranType
## 0 1
## 19 13
# get proportions of each categorical value
table(TranType)/length(TranType)
## TranType
## 0 1
## 0.59375 0.40625
# Create a barplot of the values
barplot(table(TranType))