I am going to use the free material of the OpenIntro Statistics course to repeat the basics of statistics but by using R as programming language.
The textbooks comes with many additional resources:
dplyr and ggplot syntax for data manipulation.I plan to take the chapters in several overlapping rounds:
I am not sure, if this various approaches are effective or too much of the same. But I will at least try out all of these different learning possiblities.
I will start with the github labs tutorial.
Attention: I will provide “My Solutions” of the assignments, but please keep in mind: I am learning and therefore my solutions could be inadequat or wrong.
I am using the for this introduction my forked file of the dplyrand ggplot lab version: Intro to R.
> library(dplyr)
> library(ggplot2)
> library(oilabs)
As the labs progress, you are encouraged to explore beyond what the labs dictate; a willingness to experiment will make you a much better programmer. [Exactly this is was I am trying to do with my Rmarkdown files.] Before we get to that stage, however, you need to build some basic fluency in R. Today we begin with the fundamental building blocks of R and RStudio: the interface, reading in data, and basic commands.
For this lab, and many others in the future, we will use the following R packages:
dplyr: for data wranglingggplot2: for data visualizationoilabs: for data and custom functions with the OpenIntro labsIt is vital that every report is reproducible. This can’t be done with two separate approaches (e.g. Word document on the one hand and a programming language like R on the other hand). Instead one should use R Markdown for labs report. See for this argument also the video on YouTube Why use R Markdown for Lab Reports?.
How to use R Markdown document especially for the OpenIntro labs is explained in another video Using R Markdown for Lab Reports in RStudio. There is also a special template provided for the OpenIntro exercises.
The special template provided by the oilabs package is a good opportunity to investigate the details how this template is produced and called from the oilabs package.
I am now going to change to the oilabs template and to continue with the Dr. Arbuthnot’s Baptism Records.
The Behavioral Risk Factor Surveillance System (BRFSS) is an annual telephone survey of 350,000 people in the United States. The BRFSS Web site contains a complete description of the survey, including the research questions that motivate the study and many interesting results derived from the data.
Load the data set from the OpenIntro Website. Have a look at the data in the browser to get a feeling of the interesting structure of the raw data.
Load the data set into R from the external website. But just once: As it is a big file (2.1 MB) it takes quite some time to load it into memory. Therefore save it as .RData file, which is later much faster to load. Later check if it is already loaded into memory:
> ## run the following lines just once
> # source("http://www.openintro.org/stat/data/cdc.R")
> # dir.create("data")
> # save("cdc", file = "data/cdc.RData", compress = TRUE)
> oldwd <- getwd()
> setwd("~/Documents/_PB-Data/Programming/R/Learning-R/openintro/data")
> if (!exists("cdc")) {load("cdc.RData")}
> setwd(oldwd)
The data set cdc that shows up in your workspace is a data matrix, with each row representing a case and each column representing a variable. R calls this data format a data frame, which is a term that will be used throughout the labs.
But I will convert all data frames into to a tibbles. This has some advantages, explained in R for Data Science
y from values of x during the creation)row.names(). The whole point of tidy data is to store variables in a consistent way. So it never stores a variable as special attribute.[ always returns another tibble whereas d data frame sometimes returns a data frame and sometimes it just returns a vector.Therefore I will call the matrix organisation of the data set “tibble” or abbreviated “tbl”.
> library(tidyverse)
> CDC <- tbl_df(cdc)
> class(cdc)
[1] "data.frame"
> class(CDC)
[1] "tbl_df" "tbl" "data.frame"
“cdc” (with small letters) has class “data.frame”, whereas “CDC” (with capital letters) now have “tbl_df, tbl, data.frame” e.g. three classes at the same time to secure full compatiblity with data frames.
After loading the data and converting it into a tibble, one should inspect the data to get some understanding about the structure and content. Common funtions for these tasks are:
<name-of-data-tibble>: Display the first 10 rows and all columns that fit on one screen. It also prints an abbreviated description of the column type.head(<name-of-df>), tail(<name-of-df>): Return the first or last part. Use these commands if it is not a tibble but a data framedim(): Retrieve the dimensionnames(): Get the namesstr(): Display compactly the internal structureglimpse(): is the dplyr-version of str() showing values of each variable the whole sceen width, but does not display the number of levels and names of factor variables. But this feature of str() cannot be displayed completly with either many or long levels names.View(): With RStudio you can see and inspect the data set comfortably. The View() function invokes a spreadsheet-style data viewer.You can see and inspect the data set comfortably in RStudio with the View() command, which invokes a spreadsheet-style data viewer on a matrix-like R object.
> CDC
# A tibble: 20,000 × 9
genhlth exerany hlthplan smoke100 height weight wtdesire age gender
* <fctr> <dbl> <dbl> <dbl> <dbl> <int> <int> <int> <fctr>
1 good 0 1 0 70 175 175 77 m
2 good 0 1 1 64 125 115 33 f
3 good 1 1 1 60 105 105 49 f
4 good 1 1 0 66 132 124 42 f
5 very good 0 1 0 61 150 130 55 f
6 very good 1 1 0 64 114 114 55 f
7 very good 1 1 0 71 194 185 31 m
8 very good 0 1 0 67 170 160 45 m
9 good 0 1 1 65 150 130 27 f
10 good 1 1 0 70 180 170 44 m
# ... with 19,990 more rows
> head(CDC)
# A tibble: 6 × 9
genhlth exerany hlthplan smoke100 height weight wtdesire age gender
<fctr> <dbl> <dbl> <dbl> <dbl> <int> <int> <int> <fctr>
1 good 0 1 0 70 175 175 77 m
2 good 0 1 1 64 125 115 33 f
3 good 1 1 1 60 105 105 49 f
4 good 1 1 0 66 132 124 42 f
5 very good 0 1 0 61 150 130 55 f
6 very good 1 1 0 64 114 114 55 f
> tail(CDC)
# A tibble: 6 × 9
genhlth exerany hlthplan smoke100 height weight wtdesire age gender
<fctr> <dbl> <dbl> <dbl> <dbl> <int> <int> <int> <fctr>
1 good 0 1 1 69 224 224 73 m
2 good 1 1 0 66 215 140 23 f
3 excellent 0 1 0 73 200 185 35 m
4 poor 0 1 0 65 216 150 57 f
5 good 1 1 0 67 165 165 81 f
6 good 1 1 1 69 170 165 83 m
> dim(CDC)
[1] 20000 9
> names(CDC)
[1] "genhlth" "exerany" "hlthplan" "smoke100" "height" "weight"
[7] "wtdesire" "age" "gender"
> str(CDC)
Classes 'tbl_df', 'tbl' and 'data.frame': 20000 obs. of 9 variables:
$ genhlth : Factor w/ 5 levels "excellent","very good",..: 3 3 3 3 2 2 2 2 3 3 ...
$ exerany : num 0 0 1 1 0 1 1 0 0 1 ...
$ hlthplan: num 1 1 1 1 1 1 1 1 1 1 ...
$ smoke100: num 0 1 1 0 0 0 0 0 1 0 ...
$ height : num 70 64 60 66 61 64 71 67 65 70 ...
$ weight : int 175 125 105 132 150 114 194 170 150 180 ...
$ wtdesire: int 175 115 105 124 130 114 185 160 130 170 ...
$ age : int 77 33 49 42 55 55 31 45 27 44 ...
$ gender : Factor w/ 2 levels "m","f": 1 2 2 2 2 2 1 1 2 1 ...
> glimpse(CDC)
Observations: 20,000
Variables: 9
$ genhlth <fctr> good, good, good, good, very good, very good, very g...
$ exerany <dbl> 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0,...
$ hlthplan <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1,...
$ smoke100 <dbl> 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1,...
$ height <dbl> 70, 64, 60, 66, 61, 64, 71, 67, 65, 70, 69, 69, 66, 7...
$ weight <int> 175, 125, 105, 132, 150, 114, 194, 170, 150, 180, 186...
$ wtdesire <int> 175, 115, 105, 124, 130, 114, 185, 160, 130, 170, 175...
$ age <int> 77, 33, 49, 42, 55, 55, 31, 45, 27, 44, 46, 62, 21, 6...
$ gender <fctr> m, f, f, f, f, f, m, m, f, m, m, m, m, m, m, m, m, m...
> # View(CDC)
The tibble has 2000 observations (= rows) with 9 variables (columns). Each one of these variables corresponds to a question that was asked in the survey:
genhlth: respondents were asked to evaluate their general health, responding either excellent, very good, good, fair or poor.exerany: indicates whether the respondent exercised in the past month (1) or did not (0).hlthplan: indicates whether the respondent had some form of health coverage (1) or did not (0).smoke100: indicates whether the respondent had smoked at least 100 cigarettes in her lifetime.height: in inchesweight: in poundswtdesire: desired weightage in yearsgender: “m” or “f”How many cases are there in this data set? How many variables? For each variable, identify its data type (e.g. categorical, discrete).
My Solution:
There are 20000 cases (= observations) and 9 variables in the CDC data set.
genhlth: excellent, very good, good, fair or poor = categorialexerany: past month (1) or not (0) = categorialhlthplan: some form of health coverage (1) or not (0) = categorialsmoke100: had smoked at least 100 cigarettes in her lifetime = categorialheight: in inches = continousweight: in pounds = continouswtdesire: desired weight = continuosage in years = discretegender: “m” or “f” = categorial
summary()A good first step in any analysis is to distill all of that information into a few summary statistics and graphics. As a simple example, the function summary returns a numerical summary: minimum, first quartile, median, mean, second quartile, and maximum. For weight this is:
> summary(CDC$weight)
Min. 1st Qu. Median Mean 3rd Qu. Max.
68.0 140.0 165.0 169.7 190.0 500.0
dplyr: summarise()dplyr does not have its own special methods for R summary function (see last sentence of vignette Introduction to dplyr. But you can use the R base functions with dplyr and some other extra functions supplied by dplyr. Besides functions in base R like min(), max(), mean(), sum(), sd(), median(), and IQR(). dplyr provides a handful of others:
n(): the number of observations in the current groupn_distinct(x):the number of unique values in x.first(x), last(x) and `nth(x, n) - these work similarly to x[1], x[length(x)], and x[n] but give you more control over the result if the value is missing.> summarise(CDC, Min. = min(weight), "1st Qu." = quantile(weight, probs = 0.25), Median = median(weight), Mean = mean(weight), "3rd Qu." = quantile(weight, probs = 0.75), Max = max(weight))
# A tibble: 1 × 6
Min. `1st Qu.` Median Mean `3rd Qu.` Max
<int> <dbl> <dbl> <dbl> <dbl> <int>
1 68 140 165 169.683 190 500
The dplyr-equivalent of base R summary() is more complex, but has more control about functions used, their names in the table, their sequence etc. So we could add for instance sd(), var(), IQR(). To get these extra values with the base R functions one would have to be write different lines of code.
> summarise(CDC, Min. = min(weight),
+ "1st Qu." = quantile(weight, probs = 0.25),
+ Median = median(weight), Mean = mean(weight),
+ "3rd Qu." = quantile(weight, probs = 0.75),
+ Max = max(weight), Var = var(weight),
+ SD = sd(weight),
+ IQR = IQR(weight)
+ )
# A tibble: 1 × 9
Min. `1st Qu.` Median Mean `3rd Qu.` Max Var SD IQR
<int> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 68 140 165 169.683 190 500 1606.484 40.08097 50
But the great advantage for summarise() lies int the combination with the group_by() function of dplyr (see the examples in the Introduction to dplyr vignette.)
table()While it makes sense to describe a quantitative variable like weight in terms of these statistics, what about categorical data? We would instead consider the sample frequency or relative frequency distribution. The function table does this for you by counting the number of times each kind of response was given. For example, to see the number of people who have smoked 100 cigarettes in their lifetime or instead look at the relative frequency distribution.
> table(CDC$smoke100)
0 1
10559 9441
> table(CDC$smoke100) / 20000
0 1
0.52795 0.47205
Notice how R automatically divides all entries in the table by 20,000 in the command above. This is similar to something we observed in the Introduction to R; when we multiplied or divided a vector with a number, R applied that action across entries in the vectors. As we see above, this also works for tables.
dplyr: count()But the outcome of the table() command is class “table” which is not so easy to work with as a data frame or a tibble. (See the article on R-bloggers.) It is better to use the count() function of dplyr, because it adds the column names and returns – as all dply-functions — a tibble. Furthermore the relative frequency distribution (the propoartion) can be added as a new column in one command.
> CDC %>%
+ count(smoke100) %>%
+ mutate(prop = round(n / nrow(CDC), digits = 2))
# A tibble: 2 × 3
smoke100 n prop
<dbl> <int> <dbl>
1 0 10559 0.53
2 1 9441 0.47
How to limit the figure of prop to an accuracy of two digits after the decimal point? Use roundas in the example above.
How to use “.” for big figures and “,” for figurs after decimal point (comma) = German localisation for figures?
barplot()Next, we make a bar plot of the entries in the table by putting the table inside the barplot() command.
> barplot(table(CDC$smoke100))
Notice what we’ve done here! We’ve computed the table of cdc$smoke100 and then immediately applied the graphical function, barplot. This is an important idea: R commands can be nested. You could also break this into two steps by typing the following:
> smoke <- table(CDC$smoke100)
> barplot(smoke)
Here, we’ve made a new object, a table, called smoke (the contents of which we can see by typing smoke into the console) and then used it in as the input for barplot. The special symbol <- performs an assignment, taking the output of one line of code and saving it into an object in your workspace.
ggplot2: ggplot()A basic bar chart is drawn with geom_bar(). The following chart displays the total number of poeple in the CDC dataset, grouped by smoke100.The chart shows that slightly more people of the sample haven`t even smoked 100 ciragettes during their lifetime.
> ggplot(data = CDC) +
+ geom_bar(mapping = aes(x = smoke100))
On the x-axis, the chart displays
smoke100, a variable from CDC. On the y-axis, it displays count, but count is not a variable in CDC! Where does count come from? Many graphs, like scatterplots, plot the raw values of your dataset. Other graphs, like bar charts, calculate new values to plot:
The algorithm used to calculate new values for a graph is called a stat, short for statistical transformation. You can learn which stat ageomuses by inspecting the default value for thestatargument. For example,?geom_barshows the default value forstatis “count”, which means thatgeom_bar()usesstat_count().stat_count()is documented on the same page asgeom_bar(), and if you scroll down you can find a section called “Computed variables”. That tells that it computes two new variables:countandprop`.
ou can generally use geoms and stats interchangeably. For example, you can recreate the previous plot using stat_count() instead of geom_bar():
> ggplot(data = CDC) +
+ stat_count(mapping = aes(x = smoke100))
This works because every
geom has a default stat; and every stat has a default geom. This means that you can typically use geoms without worrying about the underlying statistical transformation. Sometimes you might need to use a stat explicitly: For instance when you want to override the default mapping from transformed variables to aesthetics. For example, you might want to display a bar chart of proportion, rather than count:
> ggplot(data = CDC) +
+ stat_count(mapping = aes(x = smoke100, y = ..prop..),
+ fill = 'lightblue', color = 'black')
height and age, and compute the interquartile range for each.gender and exerany.My Solution:
Using base R: height
Min. 1st Qu. Median Mean 3rd Qu. Max.
48.00 64.00 67.00 67.18 70.00 93.00
IQR for height: 6
Using base R: age
Min. 1st Qu. Median Mean 3rd Qu. Max.
18.00 31.00 43.00 45.07 57.00 99.00
IQR for age: 26
m f
9569 10431
m f
0.47845 0.52155
0 1
0.2543 0.7457
excellent very good good fair poor
0.23285 0.34860 0.28375 0.10095 0.03385
Using dplyr: height
# A tibble: 1 × 7
Min. `1st Qu.` Median Mean `3rd Qu.` Max IQR
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 48 64 67 67.1829 70 93 6
Using dplyr: age
# A tibble: 1 × 7
Min. `1st Qu.` Median Mean `3rd Qu.` Max IQR
<int> <dbl> <dbl> <dbl> <dbl> <int> <dbl>
1 18 31 43 45.06825 57 99 26
summary()How to write a function to simulate the summary() function using dplyr commands.
For use in inline text I would need to know how to subset just one figure of the contingengy table. Solution: Convert table to a vector and then do the subsetting
There are 9569 males in the sample. 23.29% of the sample reports being in excellent health.
The table command can be used to tabulate any number of variables that you provide. For example, to examine which participants have smoked across each gender, we could use the following.
> table(Gender = CDC$gender, Smoke100 = CDC$smoke100)
Smoke100
Gender 0 1
m 4547 5022
f 6012 4419
Here, we see column labels of 0 and 1. Recall that 1 indicates a respondent has smoked at least 100 cigarettes. [I have the table labelled.]
> table(Gender = CDC$gender, Smoke100 = CDC$smoke100, Exercises = CDC$exerany)
, , Exercises = 0
Smoke100
Gender 0 1
m 914 1235
f 1629 1308
, , Exercises = 1
Smoke100
Gender 0 1
m 3633 3787
f 4383 3111
[Here I have added another varialbe to see how R presents the results.]
To create a mosaic plot of this table, we would enter the following command:
> mosaicplot(table(Gender = CDC$gender, Smoke100 = CDC$smoke100))
What does the mosaic plot reveal about smoking habits and gender?
My Solution:
A higher proportion of females said that they didn’t smoke more than 100 cigarettes in their lifetime.
Interlude (= some commands to inspect data and data structure) and subsetting is something I have already experiences. Therefore I will not go into much detail.
Suppose we want to extract just the data for the men in the sample, or just for those over 30. We can use the R function subset to do that for us. For example, the command
> mdata <- subset(CDC, CDC$gender == "m")
> mdata1 <- filter(CDC, gender == "m")
will create a new data set called mdata that contains only the men from the CDC data set. In addition to finding it in your workspace alongside its dimensions, you can take a peek at the first several rows as usual.
> head(mdata1, 10)
# A tibble: 10 × 9
genhlth exerany hlthplan smoke100 height weight wtdesire age gender
<fctr> <dbl> <dbl> <dbl> <dbl> <int> <int> <int> <fctr>
1 good 0 1 0 70 175 175 77 m
2 very good 1 1 0 71 194 185 31 m
3 very good 0 1 0 67 170 160 45 m
4 good 1 1 0 70 180 170 44 m
5 excellent 1 1 1 69 186 175 46 m
6 fair 1 1 1 69 168 148 62 m
7 excellent 1 0 1 66 185 220 21 m
8 excellent 1 1 1 70 170 170 69 m
9 fair 1 0 0 69 170 170 23 m
10 good 1 1 1 73 185 175 79 m
You can use several of these conditions together with & and |. The & is read “and” so that
> m_and_over30 <- subset(CDC, gender == "m" & age > 30)
> m_and_over30_1 <- filter(CDC, gender == "m" & age > 30)
will give you the data for men over the age of 30. The | character is read “or” so that
> m_or_over30 <- subset(CDC, gender == "m" | age > 30)
> m_or_over30_1 <- filter(CDC, gender == "m" & age > 30)
will take people who are men or over the age of 30 (why that’s an interesting group is hard to say, but right now the mechanics of this are the important thing). In principle, you may use as many “and” and “or” clauses as you like when forming a subset.
Create a new object called under23_and_smoke that contains all observations of respondents under the age of 23 that have smoked 100 cigarettes in their lifetime. Write the command you used to create the new object as the answer to this exercise.
> under23_and_smoke <- subset(CDC, age < 23 & smoke100 == 1)
> under23_and_smoke_1 <- filter(CDC, age < 23 & smoke100 == 1)
With our subsetting tools in hand, we’ll now return to the task of the day: making basic summaries of the BRFSS questionnaire. We’ve already looked at categorical data such as smoke and gender so now let’s turn our attention to quantitative data. Two common ways to visualize quantitative data are with box plots and histograms.
To get a better feeling of the data I will add three new rows: height_cm, weight_kg, `wtdesire_kg.
> cdc_metric <- CDC %>%
+ mutate(height_cm = as.integer(round(height * 2.54))) %>%
+ mutate(weight_kg = as.integer(round(weight * 0.453592))) %>%
+ mutate(wtdesire_kg = as.integer(round(wtdesire * 0.453592)))
We can construct a box plot for a single variable. You can compare the locations of the components of the box by examining the summary statistics.
> boxplot(cdc_metric$height_cm)
> summary(cdc_metric$height_cm)
Min. 1st Qu. Median Mean 3rd Qu. Max.
122.0 163.0 170.0 170.6 178.0 236.0
With ggplot you need x and y, e.g. (at least) two variables: see StackOverflow). There are two ways to circumvent this problem:
qplot(), “… a shortcut designed to be familiar if you’re used to base plot(). … It’s great for allowing you to produce plots quickly, but I highly recommend learning ggplot() as it makes it easier to create complex graphics”. (From the R Documentation)> qplot(y = height_cm,
+ x = 1,
+ data = cdc_metric,
+ geom = "boxplot")
> ggplot(cdc_metric, aes(x = "", y = height_cm)) +
+ geom_boxplot()
> summary(cdc_metric$height_cm)
Min. 1st Qu. Median Mean 3rd Qu. Max.
122.0 163.0 170.0 170.6 178.0 236.0
Confirm that the median and upper and lower quartiles reported in the numerical summary match those in the graph. The purpose of a boxplot is to provide a thumbnail sketch of a variable for the purpose of comparing across several categories. So we can, for example, compare the heights of men and women:
> boxplot(cdc_metric$height_cm ~ cdc_metric$gender)
> summary(subset(cdc_metric, gender == "m", select = height_cm))
height_cm
Min. :124.0
1st Qu.:173.0
Median :178.0
Mean :178.4
3rd Qu.:183.0
Max. :236.0
> summary(subset(cdc_metric, gender == "f", select = height_cm))
height_cm
Min. :122.0
1st Qu.:157.0
Median :163.0
Mean :163.5
3rd Qu.:168.0
Max. :198.0
> ggplot(cdc_metric, aes(x = gender, y = height_cm)) +
+ geom_boxplot()
> summarise(group_by(cdc_metric, gender),
+ Min. = min(height_cm),
+ "1st Qu." = quantile(height_cm, probs = 0.25),
+ Median = median(height_cm), Mean = round(mean(height_cm), digits = 2),
+ "3rd Qu." = quantile(height_cm, probs = 0.75),
+ Max = max(height_cm),
+ IQR = IQR(height_cm)
+ )
# A tibble: 2 × 8
gender Min. `1st Qu.` Median Mean `3rd Qu.` Max IQR
<fctr> <int> <dbl> <int> <dbl> <dbl> <int> <dbl>
1 m 124 173 178 178.4 183 236 10
2 f 122 157 163 163.5 168 198 11
Tablecommand and mosaicplotThe table command can be used to tabulate any number of variables that you provide.
> table(cdc_metric$gender,cdc_metric$smoke100)
0 1
m 4547 5022
f 6012 4419
> mosaicplot(table(cdc_metric$gender,cdc_metric$smoke100))
mosaicplot in ggplot?What is the equivalent of a mosaicplot in ggplot? It seems there is no ready to work equivalent, one has to write a function. See the question and especially the answer on SO. There is a more complex refined version of the function, but it appears not so nice to me (even with the color brewer options it looks spartanic. Sure there is also some work on labelling to do with the other version, but it can stand the comparison with the base R version. — There is also a related article by Hadley Wickham, and Heike Hofmann, which I haven’t read yet.
> #' @title Mosaic plot using ggplot.
> #'
> #' @description
> #' Creates a mosaic plot where the dimensions of the cells of a
> #' confusion matrix represent their marginal proportions.
> #'
> #' @details
> #' Credit for initial iteration to
> #' [Edwin](http://stackoverflow.com/a/19258045/992999)
> #' This version adds color brewer options and tidies the labelling
>
> ggMMplot <- function(var1, var2, palette="YlOrRd"){
+ require(ggplot2)
+
+ levVar1 <- length(levels(var1))
+ levVar2 <- length(levels(var2))
+
+ jointTable <- prop.table(table(var1, var2))
+ plotData <- as.data.frame(jointTable)
+ plotData$marginVar1 <- prop.table(table(var1))
+ plotData$marginVar2 <- prop.table(table(var2))
+ plotData$var2Height <- plotData$Freq / plotData$marginVar1
+ plotData$var1Center <-
+ c(0, cumsum(plotData$marginVar1)[1:levVar1 - 1]) +
+ plotData$marginVar1 / 2
+
+ # Define label positions on LEFT (y-axis)
+ ylabData <- plotData[plotData$var1 == levels(plotData$var1)[1], ]
+ dd <- (y = c(0, cumsum(ylabData$var2Height)))
+ ylabData$ylabCenter <-
+ sapply(1:(length(dd) - 1), function(x)
+ dd[x] + (dd[x + 1] - dd[x]) / 2)
+
+ # Define label positions on the BOTTOM (x-axis)
+ xlabData <- plotData[plotData$var2 == levels(plotData$var2)[1], ]
+ dd <- (x = c(0, cumsum(xlabData$marginVar1)))
+ xlabData$xlabCenter <-
+ sapply(1:(length(dd) - 1), function(x)
+ dd[x] + (dd[x + 1] - dd[x]) / 2)
+
+ ggplot(plotData, aes(var1Center, var2Height)) +
+ geom_bar(stat = "identity", aes(width = marginVar1, fill = var2), col = "White") +
+ scale_fill_brewer(type = "seq",
+ palette = palette,
+ guide = FALSE) +
+ # xlabels
+ geom_text(data = xlabData,
+ aes(
+ label = as.character(var1),
+ x = xlabCenter,
+ y = -0.05
+ ),
+ vjust = "inward") +
+ # ylabels
+ geom_text(
+ data = ylabData,
+ aes(
+ label = as.character(var2),
+ y = ylabCenter,
+ x = -0.05
+ ),
+ vjust = "top",
+ angle = 90
+ ) +
+ xlab("") + scale_x_discrete(labels = NULL) +
+ ylab("") + scale_y_discrete(labels = NULL) +
+ theme_minimal() +
+ theme(plot.margin = margin(rep(20, 4)))
+ }
>
> ggMMplot1 <- function(var1, var2){
+ require(ggplot2)
+ levVar1 <- length(levels(var1))
+ levVar2 <- length(levels(var2))
+
+ jointTable <- prop.table(table(var1, var2))
+ plotData <- as.data.frame(jointTable)
+ plotData$marginVar1 <- prop.table(table(var1))
+ plotData$var2Height <- plotData$Freq / plotData$marginVar1
+ plotData$var1Center <-
+ c(0, cumsum(plotData$marginVar1)[1:levVar1 - 1]) +
+ plotData$marginVar1 / 2
+
+ ggplot(plotData, aes(var1Center, var2Height)) +
+ geom_bar(stat = "identity", aes(width = marginVar1, fill = var2), col = "Black") +
+ geom_text(aes(label = as.character(var1), x = var1Center, y = 1.05))
+ }
>
> ggMMplot(cdc_metric$gender,cdc_metric$smoke100)
> ggMMplot1(cdc_metric$gender, cdc_metric$smoke100)
> bmi <- round((CDC$weight / CDC$height^2) * 703, digits = 2)
> boxplot(bmi ~ CDC$genhlth)
Notice that the first line above is just some arithmetic, but it’s applied to all 20,000 numbers in the CDC data set. That is, for each of the 20,000 participants, we take their weight, divide by their height-squared and then multiply by 703. The result is 20,000 BMI values, one for each respondent. This is one reason why we like R: it lets us perform computations like this using very simple expressions.
Conversion factor 703 is the summary of all conversions: weight in pound * 0.453592 / (height * 2.54 / 100)^2 = 703.0690061]
What does this box plot show? Pick another categorical variable from the data set and see how it relates to BMI. List the variable you chose, why you might think it would have a relationship to BMI, and indicate what the figure seems to suggest.
My Solution:
It shows the relation of BMI for every level of general health as stated by the respondents. One can see that with worse level of general health the BMI is (slightly) higher and the variance is higher (right skewed to the hight end).
> bmi_metric <- (cdc_metric$weight_kg / (cdc_metric$height_cm / 100) ^ 2)
> boxplot(bmi_metric ~ cdc_metric$exerany)
> p <- ggplot(cdc_metric, aes(factor(cdc_metric$exerany), bmi_metric))
> p + geom_boxplot()
There is a big difference between
boxplot and geom_boxplot: The number variable (double) exerany needs to be converted into factor to work with geom_boxplot
For respondent with no exercise in the last month the BMI is higher than for people who have done workouts in the last month.
> bmi_metric <- (cdc_metric$weight_kg / (cdc_metric$height_cm / 100) ^ 2)
> boxplot(bmi_metric ~ cdc_metric$genhlth)
Finally, let’s make some histograms.
> hist(cdc_metric$age)
> hist(bmi)
> hist(bmi, breaks = 50)
> p <- ggplot(cdc_metric, aes(bmi_metric))
> p + geom_histogram(bins = 50)
> p + geom_freqpoly(bins = 50)
Task 1: Make a scatterplot of weight versus desired weight. Describe the relationship between these two variables.
> p <- ggplot(data = cdc_metric, mapping = aes(weight_kg, wtdesire_kg), position = "jitter")
> p + geom_point()
> cdc_metric %>%
+ select(-height, -weight, -wtdesire) %>%
+ filter(wtdesire_kg > 250)
# A tibble: 2 × 9
genhlth exerany hlthplan smoke100 age gender height_cm weight_kg
<fctr> <dbl> <dbl> <dbl> <int> <fctr> <int> <int>
1 very good 1 1 1 56 m 185 132
2 good 0 1 0 24 m 175 82
# ... with 1 more variables: wtdesire_kg <int>
Task 2: Let’s consider a new variable: the difference between desired weight (wtdesire) and current weight (weight). Create this new variable by subtracting the two columns in the data frame and assigning them to a new object called wdiff.
> (cdc_metric <- mutate(cdc_metric, wdiff_kg = wtdesire_kg - weight_kg))
# A tibble: 20,000 × 13
genhlth exerany hlthplan smoke100 height weight wtdesire age gender
<fctr> <dbl> <dbl> <dbl> <dbl> <int> <int> <int> <fctr>
1 good 0 1 0 70 175 175 77 m
2 good 0 1 1 64 125 115 33 f
3 good 1 1 1 60 105 105 49 f
4 good 1 1 0 66 132 124 42 f
5 very good 0 1 0 61 150 130 55 f
6 very good 1 1 0 64 114 114 55 f
7 very good 1 1 0 71 194 185 31 m
8 very good 0 1 0 67 170 160 45 m
9 good 0 1 1 65 150 130 27 f
10 good 1 1 0 70 180 170 44 m
# ... with 19,990 more rows, and 4 more variables: height_cm <int>,
# weight_kg <int>, wtdesire_kg <int>, wdiff_kg <int>
> summary(cdc_metric$wdiff_kg)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-137.000 -10.000 -4.000 -6.611 0.000 226.000
> ggplot(cdc_metric, mapping = aes(wdiff_kg)) +
+ geom_freqpoly(bins = 150)
Task 3: What type of data is wdiff? If an observation wdiff is 0, what does this mean about the person’s weight and desired weight. What if wdiff is positive or negative?
Task 4: Describe the distribution of wdiff in terms of its center, shape, and spread, including any plots you use. What does this tell us about how people feel about their current weight?
> ggplot(cdc_metric, mapping = aes(wdiff_kg)) +
+ geom_histogram(bins = 20)
> ggplot(cdc_metric, mapping = aes(wdiff_kg)) +
+ geom_histogram(bins = 20) +
+ scale_x_continuous(name = "Weight real - Weight desired", limits = c(-50, 50)) +
+ scale_y_continuous(name = "Frequency", limits = c(0, 10000))
> summary(cdc_metric$wdiff_kg)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-137.000 -10.000 -4.000 -6.611 0.000 226.000
How to provide a better graph, where the x-axis is centered between -10 and + 10? I could find a solution at “Statistical tools for high-throughput data analysis” the STHDA website. – The warning Removed 146 rows containing non-finite values (stat_bin) says that some data falls outside the provided limit.
The plot shows that many people desire to have a little less weight (4-6 kg) as they have. It is a unimodal distribution There is one peak around 0 (real weight is desired weight) but there are also some strange value: A person who wants 137 kg less weight, but also another person who wants 226 kg more weight!
Task 5: Using numerical summaries and a side-by-side box plot, determine if men tend to view their weight differently than women.
> ggplot(cdc_metric, mapping = aes(factor(gender), wdiff_kg)) +
+ geom_boxplot() +
+ labs(title = "View of own weight by gender") +
+ labs(x = "Gender") +
+ scale_y_continuous(name = "Weight real - weight desires", limits = c(-100.0, 100.0))
Women tend to desire a bigger reduction of weight as men.
Task 6: Now it’s time to get creative. Find the mean and standard deviation of weight and determine what proportion of the weights are within one standard deviation of the mean.
> (round(SD1 <- mean(cdc_metric$weight_kg) - sd(cdc_metric$weight_kg)))
[1] 59
> (round(SD2 <- mean(cdc_metric$weight_kg) + sd(cdc_metric$weight_kg)))
[1] 95
> nrow(filter(cdc_metric, weight_kg >= 59 & weight_kg <= 95)) / nrow(cdc_metric)
[1] 0.731
73% of the weights are within one standard deviation of the mean.