Harold Nelson
11/9/2020
Pandas was built as an attempt to bring the analytic capabilities of R to python. These notes look at the similarities and point out some areas where doing analytic work in R is easier.
We’ll use the cdc dataset as an example.
First we need to load some libraries. This is the equivalent of importing modules in python.
The R command to read a csv file and create a dataframe is almost identical to the pandas command.
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
## X1 = col_double(),
## genhlth = col_character(),
## exerany = col_double(),
## hlthplan = col_double(),
## smoke100 = col_double(),
## height = col_double(),
## weight = col_double(),
## wtdesire = col_double(),
## age = col_double(),
## gender = col_character()
## )
Pandas has methods info() and describe() to inspect a new dataframe.
R has functions glimpse() and summary(). If the tidyverse isn’t loaded, use str() instead of glimpse().
## Rows: 20,000
## Columns: 10
## $ X1 <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
## $ genhlth <chr> "good", "good", "good", "good", "very good", "very good", "v…
## $ exerany <dbl> 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, …
## $ hlthplan <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, …
## $ smoke100 <dbl> 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, …
## $ height <dbl> 70, 64, 60, 66, 61, 64, 71, 67, 65, 70, 69, 69, 66, 70, 69, …
## $ weight <dbl> 175, 125, 105, 132, 150, 114, 194, 170, 150, 180, 186, 168, …
## $ wtdesire <dbl> 175, 115, 105, 124, 130, 114, 185, 160, 130, 170, 175, 148, …
## $ age <dbl> 77, 33, 49, 42, 55, 55, 31, 45, 27, 44, 46, 62, 21, 69, 23, …
## $ gender <chr> "m", "f", "f", "f", "f", "f", "m", "m", "f", "m", "m", "m", …
## tibble [20,000 × 10] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ X1 : num [1:20000] 1 2 3 4 5 6 7 8 9 10 ...
## $ genhlth : chr [1:20000] "good" "good" "good" "good" ...
## $ exerany : num [1:20000] 0 0 1 1 0 1 1 0 0 1 ...
## $ hlthplan: num [1:20000] 1 1 1 1 1 1 1 1 1 1 ...
## $ smoke100: num [1:20000] 0 1 1 0 0 0 0 0 1 0 ...
## $ height : num [1:20000] 70 64 60 66 61 64 71 67 65 70 ...
## $ weight : num [1:20000] 175 125 105 132 150 114 194 170 150 180 ...
## $ wtdesire: num [1:20000] 175 115 105 124 130 114 185 160 130 170 ...
## $ age : num [1:20000] 77 33 49 42 55 55 31 45 27 44 ...
## $ gender : chr [1:20000] "m" "f" "f" "f" ...
## - attr(*, "spec")=
## .. cols(
## .. X1 = col_double(),
## .. genhlth = col_character(),
## .. exerany = col_double(),
## .. hlthplan = col_double(),
## .. smoke100 = col_double(),
## .. height = col_double(),
## .. weight = col_double(),
## .. wtdesire = col_double(),
## .. age = col_double(),
## .. gender = col_character()
## .. )
## X1 genhlth exerany hlthplan
## Min. : 1 Length:20000 Min. :0.0000 Min. :0.0000
## 1st Qu.: 5001 Class :character 1st Qu.:0.0000 1st Qu.:1.0000
## Median :10000 Mode :character Median :1.0000 Median :1.0000
## Mean :10000 Mean :0.7457 Mean :0.8738
## 3rd Qu.:15000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :20000 Max. :1.0000 Max. :1.0000
## smoke100 height weight wtdesire
## Min. :0.0000 Min. :48.00 Min. : 68.0 Min. : 68.0
## 1st Qu.:0.0000 1st Qu.:64.00 1st Qu.:140.0 1st Qu.:130.0
## Median :0.0000 Median :67.00 Median :165.0 Median :150.0
## Mean :0.4721 Mean :67.18 Mean :169.7 Mean :155.1
## 3rd Qu.:1.0000 3rd Qu.:70.00 3rd Qu.:190.0 3rd Qu.:175.0
## Max. :1.0000 Max. :93.00 Max. :500.0 Max. :680.0
## age gender
## Min. :18.00 Length:20000
## 1st Qu.:31.00 Class :character
## Median :43.00 Mode :character
## Mean :45.07
## 3rd Qu.:57.00
## Max. :99.00
R has a very convenient function, tapply() to compare subsets of the dataframe. Suppose I want to compare the variable weight for the subsets of the data defined by gender. I have chosen summary() for this purpose. Here is the tapply() command.
## $f
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 68.0 128.0 145.0 151.7 170.0 495.0
##
## $m
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 78.0 165.0 185.0 189.3 210.0 500.0
Repeat the use of tapply() but base the subsets on the variable genhlth.
## $excellent
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 85.0 135.0 160.0 162.2 185.0 400.0
##
## $fair
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 78.0 144.0 170.0 176.2 200.0 495.0
##
## $good
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 68.0 142.0 170.0 173.2 200.0 400.0
##
## $poor
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 79.0 142.0 170.0 176.8 200.0 500.0
##
## $`very good`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 82.0 140.0 165.0 169.2 190.0 360.0
It appears that people with excellent or very good health are a bit lighter than the others.
Repeat the last comparison but use smoke100 to define the subsets.
## $`0`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 70.0 140.0 162.0 167.6 190.0 500.0
##
## $`1`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 68 144 170 172 195 495
It appears that the non-smokers are a bit lighter than the smokers.
I will use the ggplot2 version of histogram.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Make a histogram for the variable height.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
It’s easy to get subsets. We just need to add one “layer” to the command above to get separate histograms for subsets.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Repeat the last example but use weight instead of height.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
If you’d prefer a smoothed histogram, just replace geom_histogram() with geom_density(). Modify the last example to do this.
You can still see the basic fact that women are generally lighter than men. But you can notice a secondary peak in the women’s density curver at 200. This is probably a rounding from self-reported data.
let me show you how facetting can incorporate a second classification variable. I’ll use smoke100 as the seconf classification variable.
Repeat the last exercise, but swap the placement of gender and smoke100.