Source file ⇒ 2017-lec1.Rmd
Stat 133 student testamonial: “Not to sound overeager, but learning R in 133 with you last year was the most valuable educational experience I’ve had at Cal! I am very interested in the lab assistant position, how can I learn more?”
I will communicate with you about assignments through b-courses: b-courses
Student run Piazza Site (you should have gotten an invite)
Lab starts this week! You must go to your assigned lab. Attendance part of your lab grade. This week is special: in lab install R, RStudio, DataComputing, and do some DataCamp.
My OH are T,Th 10-11 in Evans 449. I am very good with email (alucas@berkeley.edu)
Get i-clicker by next Tuesday (participation points)
Final exam rescheduled for Monday May 8 at 3pm. If have another final same time we can find alternate time.
I don’t care which lecture you attend they are equivalent.
Examples of Tidy data:
Imagine a data set with three variables, name
, trt
, result
.
name
has three values: (John, Mary, and Jane) trt
has two values: (a and b) result
has six values (-, 16,3,2,11,1)
When we display the data set where the columns are our variales and the rows are observations we call the data set a data table or a data frame. Data tables makes it easy to analyse and visualize data because it provides a standard way of structuring a data set. For this reason we call the data in a data table tidy data.
For example:
The data in the table above is tidy because the data is organized following two simple rules:
The rows, called cases, each refer to a specific, unique and similar sort of thing. For example the treatment and result of a particular patient. There can be two John Smith but not with the same treatement and result.
The columns, called variables, each have the same sort of value recorded for each row. For example trt
are categorical (a or b) and result
are numerical.
Notice that tidy data isn’t usually concise. You might see the data represented more concisely as below.
This isn’t a data table (i.e the data isn’t tidy).
In this example the columns (person
, treatmenta
, treatmentb
) are not all variables. treatmenta
and treatmentb
are values of the variable trt
they aren’t variables themselves.
Why is tidy data useful? When data is tidy we can do vectorized operations very fast. Suppose we wanted to create a new variable results_squared
. This isn’t easy to do with messy data. We will learn in this course how to make messy data tidy.
Is the following data set tidy? Why or why not?
Lets look at BabyNames data table
Open RStudio
BabyNames in DataComputing package
Must load and install DataComputing package.
#to install
install.packages("devtools")
devtools::install_github("DataComputing/DataComputing")
#to load
library(DataComputing)
To look at data in DataComputing:
data(package="DataComputing")
Item | Title |
---|---|
BabyNames | Names of children as recorded by the US Social Security Administration. |
CountryCentroids | Geographic locations of countries |
CountryData | Many variables on countries from the CIA factbook, 2014. |
CountryGroups | Membership in Country Groups |
DirectRecoveryGroups | Descriptions of the Direct Recovery Groups (DRGs) in the Medicare data. |
HappinessIndex | World Happiness Report Data |
MedicareCharges | Charges to and Payments from Medicare |
MedicareProviders | Medicare Providers |
MigrationFlows | Human Migration between Countries |
Minneapolis2013 | Ballots in the 2013 Mayoral election in Minneapolis |
NCHS | Health Statistics Data 1999-2004 |
NCI60 | Gene expression in cancer. |
NCI60cells | Cell Line descriptions in the NCI-60 dataset |
OrdwayBirds | Birds captured and released at Ordway, complete and uncleaned |
OrdwaySpeciesNames | Corrected Species Names for the Ordway Birds |
RegisteredVoters | A sample of the voter registration list for Wake County, North Carolina in Fall 2010. |
WorldCities | Cities and their populations |
ZipDemography | Demographic information for most US ZIP Codes (Postal Codes) |
ZipGeography | Geographic information by US Zip Codes (Postal Codes) |
to look at codebook:
help(Baby Names)
#or
?BabyNames
help(BabyNames)
BabyNames | R Documentation |
The US Social Security Administration provides yearly lists of names given to babies. These data combine the yearly lists.
BabyNames
is the raw data from the SSA. The case is a year-name-sex, for example: Jane F 1922. The count
is the number of children of that sex given that name in that year. Names assigned to fewer than five children of one sex in any year are not listed, presumably out of privacy concerns.
data(BabyNames)
BabyNames
consists of 1,792,091 entries, each of which has four variables:
name
The given name (character string)
sex
F or M (character string)
count
The number of babies given that name and of that sex. (integer)
year
Year of birth (integer)
The data were compiled from the Social Security Administration web site: http://www.ssa.gov/oact/babynames/names.zip
.
BabyNames
data(BabyNames) str(BabyNames)
For commands it is helpful to look at the codebook to learn the arguements. For example examine the command paste
in the codebook.
In datacamp you will learn about vectors.
head(BabyNames$name)
## [1] "Mary" "Anna" "Emma" "Elizabeth" "Minnie" "Margaret"
Nice properties of tidy data:
R does vectorized operations.
head(BabyNames$count*BabyNames$year)
## [1] 13282200 4895520 3765640 3645320 3282480 2966640
You can pick out the 100th name:
BabyNames$name[100]
## [1] "Emily"
Just about everything in R is an object. Some objects have names and some don’t. For example 2
and "Adam"
are objects that don’t have names. BabyNames
and sqrt
are named objects.
There are rules for names.
Object names are never in quotes and they never begin with a digit.
The name cannot contain any punctuation symbols with two exceptions (you can use . and _). So sqrt()
isn’t a name.
The case in the name matters so NCHS and NcHs are different names.
The value of an object is what you get out when you type it or its name in the console.
You can assign a name to an object with the assingment command name <- Babynames
. This gives the object Babynames
a new name.
What do I get when I type name
at the console?
What are object names and object values in the command below?
motors <- read.csv(file="http://tiny.cc/mosaic/engines.csv")
head(motors)
Engine | mass | ncylinder | strokes | displacement | bore | stroke | BHP | RPM |
---|---|---|---|---|---|---|---|---|
Webra Speedy | 0.135 | 1 | 2 | 1.8 | 13.5 | 12.5 | 0.45 | 22000 |
Motori Cipolla | 0.150 | 1 | 2 | 2.5 | 15.0 | 14.0 | 1.00 | 26000 |
Webra Speed 20 | 0.250 | 1 | 2 | 3.4 | 16.5 | 16.0 | 0.78 | 22000 |
Webra 40 | 0.270 | 1 | 2 | 6.5 | 21.0 | 19.0 | 0.96 | 15500 |
Webra 61 Blackhead | 0.430 | 1 | 2 | 10.0 | 24.0 | 22.0 | 1.55 | 14000 |
Webra 6WR | 0.490 | 1 | 2 | 10.0 | 24.0 | 22.0 | 2.76 | 19000 |