Lab-3-Presentation

Robert W. Walker
September 7, 2016

Outline

  • More Summary and Conditional Summary
  • Tables and Cross-tabulation
  • R as calculator
  • Distributions, part I

A Fresher on the Exercises

The two discrimination cases.

  • Discrimination in Expenditures [Quantitative/Continuous]
  • Discrimination in Admissions [Qualitative/Discrete]

Summarizing Data in R and Excel

In Excel, we deploy the Pivot Table. In R, we have varieties of summary:

  • Numerical summaries by groups for one conditioning variable
  • Tables of statistics for more than one.

To Excel

Lab3.xls

The file was exported from R exactly like this.

library(WriteXLS)
WriteXLS(c("Discrimination","DiscriminationUCB","Bonds"), ExcelFileName = "Lab3.xls")

Excel's Pivot Tables

A Pivot Table

  • We drag variables to the Rows and Columns for conditions.
    • here I drag Ethnicity to the Rows
  • Place the output in Values.
    • here I drag Expenditures to Values
    • right-click and change the Value Field Settings to the Average {note what is there}

To R

Load the data: three data sets

load("Lab3.RData")
ls() # List the contents
[1] "Bonds"             "Discrimination"    "DiscriminationUCB"

Summarizing the Data with R

library(RcmdrMisc)
numSummary(Discrimination[,"Expenditures"], groups=Discrimination$Ethnicity,
   statistics=c("mean", "sd", "IQR"))
                        mean        sd      IQR data:n
American Indian    36438.250 25693.912 34085.25      4
Asian              18392.372 19209.225 30892.00    129
Black              20884.593 20549.274 37987.00     59
Hispanic           11065.569 15629.847  7961.25    376
Multi Race          4456.731  7332.135  2059.75     26
Native Hawaiian    42782.333  6576.462  6331.00      3
Other               3316.500  1836.356  1298.50      2
White not Hispanic 24697.549 20604.376 39157.00    401

Adding Complexity to Both

Let's look at the same Expenditures but, this time, by Age.Cohort.
We could also look at the scatterplot using the Age.

  • Are they related?

plot of chunk Plot1

Back to Pivot Tables and R

Drag Age.Cohort to Columns

with(Discrimination, tapply(Expenditures, list(Age.Cohort, Ethnicity), mean,
   na.rm=TRUE))
      American Indian     Asian     Black  Hispanic Multi Race
0-5                NA  1502.250  1083.000  1393.205   1735.286
13-17            3726  3509.100  4521.750  3955.282   3972.286
18-21              NA  9598.171  8847.444  9959.846   9309.000
22-50           28205 39580.517 40573.059 40924.116  38619.000
6-12               NA  2164.556  2476.455  2312.187   2076.111
51+             56911 54622.615 54010.143 55585.000         NA
      Native Hawaiian  Other White not Hispanic
0-5                NA     NA           1366.900
13-17              NA 3316.5           3904.358
18-21              NA     NA          10133.058
22-50           39103     NA          40187.624
6-12               NA     NA           2052.261
51+             50141     NA          52670.424

What do you make of the discrimination story?

Once we look at the data accounting for age, things look a bit different.

What about the Berkeley Story?

  • Load up the DiscriminationUCB data
  • The claim is a difference in Admit by M.F.
  • We have six departments of data: A through F of varying size.

A New Pivot Table

The R Evidence: A Two-Way Table

local({
  .Table <- xtabs(~M.F+Admit, data=DiscriminationUCB)
  cat("\nFrequency table:\n")
  print(.Table)
  cat("\nRow percentages:\n")
  print(rowPercents(.Table))
})

Frequency table:
        Admit
M.F        No  Yes
  Female 1278  557
  Male   1493 1198

Row percentages:
        Admit
M.F        No  Yes Total Count
  Female 69.6 30.4   100  1835
  Male   55.5 44.5   100  2691

A Multi-Way Table


Row percentages:
, , Dept = A

        Admit
M.F        No  Yes Total Count
  Female 17.6 82.4   100   108
  Male   37.9 62.1   100   825

, , Dept = B

        Admit
M.F      No Yes Total Count
  Female 32  68   100    25
  Male   37  63   100   560

, , Dept = C

        Admit
M.F        No  Yes Total Count
  Female 65.9 34.1   100   593
  Male   63.1 36.9   100   325

, , Dept = D

        Admit
M.F        No  Yes Total Count
  Female 65.1 34.9   100   375
  Male   66.9 33.1   100   417

, , Dept = E

        Admit
M.F        No  Yes Total Count
  Female 76.1 23.9   100   393
  Male   72.3 27.7   100   191

, , Dept = F

        Admit
M.F        No Yes Total Count
  Female 93.0 7.0   100   341
  Male   94.1 5.9   100   373

Easier: Turn Admit into A Number

DiscriminationUCB$Admit.Num <- with(DiscriminationUCB, as.numeric(Admit)-1)
with(DiscriminationUCB, tapply(Admit.Num, list(Dept, M.F), mean, 
  na.rm=TRUE))
      Female       Male
A 0.82407407 0.62060606
B 0.68000000 0.63035714
C 0.34064081 0.36923077
D 0.34933333 0.33093525
E 0.23918575 0.27748691
F 0.07038123 0.05898123

Let's look a bit more at that Table interface

  • It automagically calculates conditional probabilities over rows, columns, or the joint probability.
  • It tells us what independence would imply:
    • independence is just P(M)P(Yes) or P(F)P(Yes)
    • There is a chi-square (\( \chi^2 \)) test reported too; this reports [roughly] the probability of independence.
    • This is computed by taking the sum of the squared differences between the expected and actual divided by expected.