Source file ⇒ 2017-lec1.Rmd

Today

  1. Introduction to Stat 133
  2. Tidy Data and Computing with R
  3. Data Camp Intro to R course (chap 1 Intro to Basics and chap 2 Vectors)

Introduction to Stat 133

Stat 133 student testamonial: “Not to sound overeager, but learning R in 133 with you last year was the most valuable educational experience I’ve had at Cal! I am very interested in the lab assistant position, how can I learn more?”

I will communicate with you about assignments through b-courses: b-courses

Syllabus

Assignments

Lecture Notes

Piazza Site (to be set up soon)

Data Camp

Lab starts this week! You must go to your assigned lab. Attendance part of your lab grade.

My OH are T,Th 10-11 in Evans 449. I am very good with email (alucas@berkeley.edu)

Get i-clicker by next Tuesday (participation points)

Final exam rescheduled for Monday May 8 at 3pm.

Tidy Data (chapter 1 of DC)

Examples of Tidy data:

Imagine a data set with three variables, name, trt, result.

name has three values: (John, Mary, and Jane) trt has two values: (a and b) result has six values (-, 16,3,2,11,1)

When we display the data set where the columns are our variales and the rows are observations we call the data set a data table or a data frame. Data tables makes it easy to analyse and visualize data because it provides a standard way of structuring a data set. For this reason we call the data in a data table tidy data.

For example:

The data in the table above is tidy because the data is organized in two simple rules:

  1. The rows, called cases, each refer to a specific, unique and similar sort of thing. For example the treatment and result of a particular patient.

  2. The columns, called variables, each have the same sort of value recorded for each row. For example trt are categorical (a or b) and result are numerical.

Notice that tidy data isn’t usually concise. You might see the data represented more concisely as below.

This isn’t a data table (i.e the data isn’t tidy).

In this example the columns (person, treatmenta, treatmentb) are not all variables. treatmenta and treatmentb are values of the variable trt they aren’t variables themselves.

Is the following data set tidy? Why or why not?

i-clicker questions: