Pandas and R

Harold Nelson

11/9/2020

Pandas and R

Pandas was built as an attempt to bring the analytic capabilities of R to python. These notes look at the similarities and point out some areas where doing analytic work in R is easier.

We’ll use the cdc dataset as an example.

Setup

First we need to load some libraries. This is the equivalent of importing modules in python.

library(tidyverse)

Read Data

The R command to read a csv file and create a dataframe is almost identical to the pandas command.

cdc = read_csv("cdc.csv")
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
##   X1 = col_double(),
##   genhlth = col_character(),
##   exerany = col_double(),
##   hlthplan = col_double(),
##   smoke100 = col_double(),
##   height = col_double(),
##   weight = col_double(),
##   wtdesire = col_double(),
##   age = col_double(),
##   gender = col_character()
## )

Inspection.

Pandas has methods info() and describe() to inspect a new dataframe.

R has functions glimpse() and summary(). If the tidyverse isn’t loaded, use str() instead of glimpse().

glimpse(cdc)
## Rows: 20,000
## Columns: 10
## $ X1       <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
## $ genhlth  <chr> "good", "good", "good", "good", "very good", "very good", "v…
## $ exerany  <dbl> 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, …
## $ hlthplan <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, …
## $ smoke100 <dbl> 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, …
## $ height   <dbl> 70, 64, 60, 66, 61, 64, 71, 67, 65, 70, 69, 69, 66, 70, 69, …
## $ weight   <dbl> 175, 125, 105, 132, 150, 114, 194, 170, 150, 180, 186, 168, …
## $ wtdesire <dbl> 175, 115, 105, 124, 130, 114, 185, 160, 130, 170, 175, 148, …
## $ age      <dbl> 77, 33, 49, 42, 55, 55, 31, 45, 27, 44, 46, 62, 21, 69, 23, …
## $ gender   <chr> "m", "f", "f", "f", "f", "f", "m", "m", "f", "m", "m", "m", …
str(cdc)
## tibble [20,000 × 10] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ X1      : num [1:20000] 1 2 3 4 5 6 7 8 9 10 ...
##  $ genhlth : chr [1:20000] "good" "good" "good" "good" ...
##  $ exerany : num [1:20000] 0 0 1 1 0 1 1 0 0 1 ...
##  $ hlthplan: num [1:20000] 1 1 1 1 1 1 1 1 1 1 ...
##  $ smoke100: num [1:20000] 0 1 1 0 0 0 0 0 1 0 ...
##  $ height  : num [1:20000] 70 64 60 66 61 64 71 67 65 70 ...
##  $ weight  : num [1:20000] 175 125 105 132 150 114 194 170 150 180 ...
##  $ wtdesire: num [1:20000] 175 115 105 124 130 114 185 160 130 170 ...
##  $ age     : num [1:20000] 77 33 49 42 55 55 31 45 27 44 ...
##  $ gender  : chr [1:20000] "m" "f" "f" "f" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   X1 = col_double(),
##   ..   genhlth = col_character(),
##   ..   exerany = col_double(),
##   ..   hlthplan = col_double(),
##   ..   smoke100 = col_double(),
##   ..   height = col_double(),
##   ..   weight = col_double(),
##   ..   wtdesire = col_double(),
##   ..   age = col_double(),
##   ..   gender = col_character()
##   .. )
summary(cdc)
##        X1          genhlth             exerany          hlthplan     
##  Min.   :    1   Length:20000       Min.   :0.0000   Min.   :0.0000  
##  1st Qu.: 5001   Class :character   1st Qu.:0.0000   1st Qu.:1.0000  
##  Median :10000   Mode  :character   Median :1.0000   Median :1.0000  
##  Mean   :10000                      Mean   :0.7457   Mean   :0.8738  
##  3rd Qu.:15000                      3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :20000                      Max.   :1.0000   Max.   :1.0000  
##     smoke100          height          weight         wtdesire    
##  Min.   :0.0000   Min.   :48.00   Min.   : 68.0   Min.   : 68.0  
##  1st Qu.:0.0000   1st Qu.:64.00   1st Qu.:140.0   1st Qu.:130.0  
##  Median :0.0000   Median :67.00   Median :165.0   Median :150.0  
##  Mean   :0.4721   Mean   :67.18   Mean   :169.7   Mean   :155.1  
##  3rd Qu.:1.0000   3rd Qu.:70.00   3rd Qu.:190.0   3rd Qu.:175.0  
##  Max.   :1.0000   Max.   :93.00   Max.   :500.0   Max.   :680.0  
##       age           gender         
##  Min.   :18.00   Length:20000      
##  1st Qu.:31.00   Class :character  
##  Median :43.00   Mode  :character  
##  Mean   :45.07                     
##  3rd Qu.:57.00                     
##  Max.   :99.00

Comparing Subsets

R has a very convenient function, tapply() to compare subsets of the dataframe. Suppose I want to compare the variable weight for the subsets of the data defined by gender. I have chosen summary() for this purpose. Here is the tapply() command.

tapply(cdc$weight,cdc$gender,summary)
## $f
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    68.0   128.0   145.0   151.7   170.0   495.0 
## 
## $m
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    78.0   165.0   185.0   189.3   210.0   500.0

Exercise

Repeat the use of tapply() but base the subsets on the variable genhlth.

Answer

tapply(cdc$weight,cdc$genhlth,summary)
## $excellent
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    85.0   135.0   160.0   162.2   185.0   400.0 
## 
## $fair
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    78.0   144.0   170.0   176.2   200.0   495.0 
## 
## $good
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    68.0   142.0   170.0   173.2   200.0   400.0 
## 
## $poor
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    79.0   142.0   170.0   176.8   200.0   500.0 
## 
## $`very good`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    82.0   140.0   165.0   169.2   190.0   360.0

It appears that people with excellent or very good health are a bit lighter than the others.

Exercise

Repeat the last comparison but use smoke100 to define the subsets.

Answer

tapply(cdc$weight,cdc$smoke100,summary)
## $`0`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    70.0   140.0   162.0   167.6   190.0   500.0 
## 
## $`1`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      68     144     170     172     195     495

It appears that the non-smokers are a bit lighter than the smokers.

Histograms

I will use the ggplot2 version of histogram.

cdc %>% 
  ggplot(aes(x = weight)) +
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Exercise

Make a histogram for the variable height.

Answer

cdc %>% 
  ggplot(aes(x = height)) +
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Histograms for Subsets

It’s easy to get subsets. We just need to add one “layer” to the command above to get separate histograms for subsets.

cdc %>% 
  ggplot(aes(x = height)) +
  geom_histogram() +
  facet_wrap(~gender,ncol=1)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Exercise

Repeat the last example but use weight instead of height.

Answer

cdc %>% 
  ggplot(aes(x = weight)) +
  geom_histogram() +
  facet_wrap(~gender,ncol=1)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Exercise

If you’d prefer a smoothed histogram, just replace geom_histogram() with geom_density(). Modify the last example to do this.

Answer

cdc %>% 
  ggplot(aes(x = weight)) +
  geom_density() +
  facet_wrap(~gender,ncol=1)

You can still see the basic fact that women are generally lighter than men. But you can notice a secondary peak in the women’s density curver at 200. This is probably a rounding from self-reported data.

A 2D facet

let me show you how facetting can incorporate a second classification variable. I’ll use smoke100 as the seconf classification variable.

cdc %>% 
  ggplot(aes(x = weight)) +
  geom_density() +
  facet_grid(gender~smoke100)

Exercise

Repeat the last exercise, but swap the placement of gender and smoke100.

Answer

cdc %>% 
  ggplot(aes(x = weight)) +
  geom_density() +
  facet_grid(smoke100~gender)