Source file ⇒ 2017-lec1.Rmd

Today

  1. Introduction to Stat 133
  2. DC chap 1: Tidy Data
  3. DC chap 2: Computing with R

Introduction to Stat 133

Stat 133 student testamonial: “Not to sound overeager, but learning R in 133 with you last year was the most valuable educational experience I’ve had at Cal! I am very interested in the lab assistant position, how can I learn more?”

I will communicate with you about assignments through b-courses: b-courses

Syllabus

Assignments

Lecture Notes

Student run Piazza Site (you should have gotten an invite)

Data Camp

Lab starts this week! You must go to your assigned lab. Attendance part of your lab grade. This week is special: in lab install R, RStudio, DataComputing, and do some DataCamp.

My OH are T,Th 10-11 in Evans 449. I am very good with email (alucas@berkeley.edu)

Get i-clicker by next Tuesday (participation points)

Final exam rescheduled for Monday May 8 at 3pm. If have another final same time we can find alternate time.

I don’t care which lecture you attend they are equivalent.

Tidy Data (chapter 1 of DC)

Examples of Tidy data:

Imagine a data set with three variables, name, trt, result.

name has three values: (John, Mary, and Jane) trt has two values: (a and b) result has six values (-, 16,3,2,11,1)

When we display the data set where the columns are our variales and the rows are observations we call the data set a data table or a data frame. Data tables makes it easy to analyse and visualize data because it provides a standard way of structuring a data set. For this reason we call the data in a data table tidy data.

For example:

The data in the table above is tidy because the data is organized following two simple rules:

  1. The rows, called cases, each refer to a specific, unique and similar sort of thing. For example the treatment and result of a particular patient. There can be two John Smith but not with the same treatement and result.

  2. The columns, called variables, each have the same sort of value recorded for each row. For example trt are categorical (a or b) and result are numerical.

Notice that tidy data isn’t usually concise. You might see the data represented more concisely as below.

This isn’t a data table (i.e the data isn’t tidy).

In this example the columns (person, treatmenta, treatmentb) are not all variables. treatmenta and treatmentb are values of the variable trt they aren’t variables themselves.

Why is tidy data useful? When data is tidy we can do vectorized operations very fast. Suppose we wanted to create a new variable results_squared. This isn’t easy to do with messy data. We will learn in this course how to make messy data tidy.

Is the following data set tidy? Why or why not?

i-clicker questions

In class exercises

DC chapter 1 exercises

Computing with R (chapter 2 of DC)

Lets look at BabyNames data table

Open RStudio

BabyNames in DataComputing package

Must load and install DataComputing package.

#to install 
install.packages("devtools")
devtools::install_github("DataComputing/DataComputing")
#to load
library(DataComputing)

To look at data in DataComputing:

data(package="DataComputing")
Data sets in DataComputing
Item Title
BabyNames Names of children as recorded by the US Social Security Administration.
CountryCentroids Geographic locations of countries
CountryData Many variables on countries from the CIA factbook, 2014.
CountryGroups Membership in Country Groups
DirectRecoveryGroups Descriptions of the Direct Recovery Groups (DRGs) in the Medicare data.
HappinessIndex World Happiness Report Data
MedicareCharges Charges to and Payments from Medicare
MedicareProviders Medicare Providers
MigrationFlows Human Migration between Countries
Minneapolis2013 Ballots in the 2013 Mayoral election in Minneapolis
NCHS Health Statistics Data 1999-2004
NCI60 Gene expression in cancer.
NCI60cells Cell Line descriptions in the NCI-60 dataset
OrdwayBirds Birds captured and released at Ordway, complete and uncleaned
OrdwaySpeciesNames Corrected Species Names for the Ordway Birds
RegisteredVoters A sample of the voter registration list for Wake County, North Carolina in Fall 2010.
WorldCities Cities and their populations
ZipDemography Demographic information for most US ZIP Codes (Postal Codes)
ZipGeography Geographic information by US Zip Codes (Postal Codes)

to look at codebook:

help(Baby Names)
#or
?BabyNames
help(BabyNames)
BabyNames R Documentation

Names of children as recorded by the US Social Security Administration.

Description

The US Social Security Administration provides yearly lists of names given to babies. These data combine the yearly lists.

BabyNames is the raw data from the SSA. The case is a year-name-sex, for example: Jane F 1922. The count is the number of children of that sex given that name in that year. Names assigned to fewer than five children of one sex in any year are not listed, presumably out of privacy concerns.

Usage

data(BabyNames)

Format

BabyNames consists of 1,792,091 entries, each of which has four variables:

name

The given name (character string)

sex

F or M (character string)

count

The number of babies given that name and of that sex. (integer)

year

Year of birth (integer)

Source

The data were compiled from the Social Security Administration web site: http://www.ssa.gov/oact/babynames/names.zip.

See Also

BabyNames

Examples

data(BabyNames)
str(BabyNames)

For commands it is helpful to look at the codebook to learn the arguements. For example examine the command paste in the codebook.

In datacamp you will learn about vectors.

head(BabyNames$name) 
## [1] "Mary"      "Anna"      "Emma"      "Elizabeth" "Minnie"    "Margaret"

Nice properties of tidy data:

R does vectorized operations.

head(BabyNames$count*BabyNames$year)
## [1] 13282200  4895520  3765640  3645320  3282480  2966640

You can pick out the 100th name:

BabyNames$name[100]
## [1] "Emily"

Objects in R

Just about everything in R is an object. Some objects have names and some don’t. For example 2 and "Adam" are objects that don’t have names. BabyNames and sqrt are named objects.

There are rules for names.

  1. Object names are never in quotes and they never begin with a digit.

  2. The name cannot contain any punctuation symbols with two exceptions (you can use . and _). So sqrt() isn’t a name.

  3. The case in the name matters so NCHS and NcHs are different names.

The value of an object is what you get out when you type it or its name in the console.

You can assign a name to an object with the assingment command name <- Babynames. This gives the object Babynames a new name.

Question for you?

What do I get when I type name at the console?

Question for you?

What are object names and object values in the command below?

motors <- read.csv(file="http://tiny.cc/mosaic/engines.csv")
head(motors)
Engine mass ncylinder strokes displacement bore stroke BHP RPM
Webra Speedy 0.135 1 2 1.8 13.5 12.5 0.45 22000
Motori Cipolla 0.150 1 2 2.5 15.0 14.0 1.00 26000
Webra Speed 20 0.250 1 2 3.4 16.5 16.0 0.78 22000
Webra 40 0.270 1 2 6.5 21.0 19.0 0.96 15500
Webra 61 Blackhead 0.430 1 2 10.0 24.0 22.0 1.55 14000
Webra 6WR 0.490 1 2 10.0 24.0 22.0 2.76 19000

In class exercises

I-clicker questions

DC chapter 2 exercises