lecture link: https://rpubs.com/zaidyousif/1089175

Course Objectives

Learning Objectives

Install R and RStudio

R is a programming language, and RStudio is an IDE that provides a graphical interface to interact with R. Both R and RStudio will need to be installed for you to participate in the lecture, The R language itself is free and the free edition of RStudio is sufficient for this course - you do not need to pay for any software. Please feel free to email me if you have any issues with the installation.

Windows

Installing R

  • Go to the R website: www.r-project.org
  • Under the “getting started” section, click “download R”
  • You should be on a website called “CRAN Mirrors”. Many web hosts around the world volunteer their servers to provide R for download. These are called mirrors. You can choose any mirror, but mirrors with close geographic proximity will likely be fastest. Scroll down and choose a mirror within the “USA” section, or any other region if you are located in a different country.
  • Click on the “Download R for Windows” link near the top of the page.
  • Click on the “install R for the first time” link.
  • Click on the large “Download R-4.3.1 for Windows” link. Save and then run the executable and follow the installation instructions.

Installing RStudio

Mac

Installing R

  • Go to the R website: www.r-project.org
  • Under the “getting started” section, click “download R”
  • You should be on a website called “CRAN Mirrors”. Many web hosts around the world volunteer their servers to provide R for download. These are called mirrors. You can choose any mirror, but mirrors with close geographic proximity will likely be fastest. Scroll down and choose a mirror within the “USA” section, or any other region if you are located in a different country.
  • Click on the “Download R for MacOS” link near the top of the page.
  • Under “Latest release” click on the “R-4.3.1-arm64.pkg” link for Apple silicon (M1/M2) Macs or “R-4.3.1-x86_64.pkg” for older Intel Macs.
  • Save and then double click on the .pkg file and follow the installation instructions.

Installing RStudio

Install and Load R Packages

A package bundles together code, data, documentation, and tests, and is easy to share with others. We use the install.packages function to install packages in R.

# install.packages("tidyverse")
# install.packages("openxlsx")

To load package in R, we use the library command. Once a package is loaded, you can use the functions it comes with.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(openxlsx)

Loading Data

Most often, the data you are going to analyze is generated externally and provided to you as a spreadsheet. Spreadsheets come in different formats including Microsoft Excel (xlsx), comma-separated values (csv), and tab-separated values (tsv). There is a method to load each spreadsheet type. In this course, we will focus on loading data from xlsx files.

When loading a file into R, it is important to specify the location of the file correctly. Any mistake in specifying the location will cause a failure to load the data It is important to note that specifying file location varies between Windows and Mac OS. I recommend saving the file in the same directory as your R script file. Once you save the R script file, you can ask RStudio to specify the working directory to be the location of the R script file by:

setwd("E:/Biostat and Study Design/204/Lectures/Data")

In this course, we will analyze data from the National Health and Nutrition Examination Survey (NHANES). In summary, NHANES is a national longitudinal study designed to investigate the relationships between clinical, nutritional, and behavioral factors. We will only analyze a small subset of the NHANES data.

To load the xlsx file data into a dataframe, we will use the openxlsx library we installed and loaded previously.

setwd("E:/Biostat and Study Design/204/Lectures/Data")
NHANES_df <- openxlsx::read.xlsx('NHEFS.xlsx')

To view the dataset, use the view() command.

##View(NHANES_df)

The following functions are useful to explore dataframes:

head(NHANES_df)
##   seqn qsmk death yrdth modth dadth sbp dbp    sex age           race income
## 1  233    0     0    NA    NA    NA 175  96   Male  42 Black or other     19
## 2  235    0     0    NA    NA    NA 123  80   Male  36          White     18
## 3  244    0     0    NA    NA    NA 115  75 Female  56 Black or other     15
## 4  245    0     1    85     2    14 148  78   Male  68 Black or other     15
## 5  252    0     0    NA    NA    NA 118  77   Male  40          White     18
## 6  257    0     0    NA    NA    NA 141  83 Female  43 Black or other     11
##   marital school education       ht  wt71      wt82    wt82_71 birthplace
## 1       2      7         1 174.1875 79.04  68.94604 -10.093960         47
## 2       2      9         2 159.3750 58.63  61.23497   2.604970         42
## 3       3     11         2 168.5000 56.81  66.22449   9.414486         51
## 4       3      5         1 170.1875 59.42  64.41012   4.990117         37
## 5       2     11         2 181.8750 87.09  92.07925   4.989251         42
## 6       4      9         2 162.1875 99.00 103.41906   4.419060         34
##   smokeintensity smkintensity82_71 smokeyrs asthma bronch tb hf hbp pepticulcer
## 1             30               -10       29      0      0  0  0   1           1
## 2             20               -10       24      0      0  0  0   0           0
## 3             20               -14       26      0      0  0  0   0           0
## 4              3                 4       53      0      0  0  0   1           0
## 5             20                 0       19      0      0  0  0   0           0
## 6             10                10       21      0      0  0  0   0           0
##   colitis hepatitis chroniccough hayfever diabetes polio tumor nervousbreak
## 1       0         0            0        0        1     0     0            0
## 2       0         0            0        0        0     0     0            0
## 3       0         0            0        1        0     0     1            0
## 4       0         0            0        0        0     0     0            0
## 5       0         0            0        0        0     0     0            0
## 6       0         0            0        0        0     0     0            0
##   alcoholpy alcoholfreq alcoholtype alcoholhowmuch pica headache otherpain
## 1         1           1           3              7    0        1         0
## 2         1           0           1              4    0        1         0
## 3         1           3           4             NA    0        1         1
## 4         1           2           3              4    0        0         1
## 5         1           2           1              2    0        1         0
## 6         1           3           2              1    0        1         0
##   weakheart allergies nerves lackpep hbpmed boweltrouble wtloss infection
## 1         0         0      0       0      1            0      0         0
## 2         0         0      0       0      0            0      0         1
## 3         0         0      1       0      0            0      0         0
## 4         1         0      0       0      0            0      0         0
## 5         0         0      0       0      0            1      0         0
## 6         0         0      0       0      0            0      0         0
##   active exercise birthcontrol pregnancies cholesterol hightax82  price71
## 1      0        2            2          NA         197         0 2.183594
## 2      0        0            2          NA         301         0 2.346680
## 3      0        2            0           2         157         0 1.569580
## 4      1        2            2          NA         174         0 1.506592
## 5      1        1            2          NA         216         0 2.346680
## 6      1        1            0           1         212         1 2.209961
##    price82     tax71     tax82 price71_82  tax71_82
## 1 1.739990 1.1022949 0.4619751 0.44378662 0.6403809
## 2 1.797363 1.3649902 0.5718994 0.54931641 0.7929688
## 3 1.513428 0.5512695 0.2309875 0.05619812 0.3202515
## 4 1.451904 0.5249023 0.2199707 0.05479431 0.3049927
## 5 1.797363 1.3649902 0.5718994 0.54931641 0.7929688
## 6 2.025879 1.1547852 0.7479248 0.18408203 0.4069824
tail(NHANES_df)
##       seqn qsmk death yrdth modth dadth sbp dbp    sex age  race income marital
## 1624 25013    0     0    NA    NA    NA 125  77 Female  47 White     16       5
## 1625 25014    0     0    NA    NA    NA 115  66   Male  45 White     16       4
## 1626 25016    0     0    NA    NA    NA 124  80 Female  47 White     18       2
## 1627 25024    0     0    NA    NA    NA  NA  NA Female  51 White     15       2
## 1628 25032    0     0    NA    NA    NA 171  77   Male  68 White     13       2
## 1629 25061    1     0    NA    NA    NA 136  90   Male  29 White     19       2
##      school education       ht  wt71      wt82   wt82_71 birthplace
## 1624      8         1 167.1875 84.94  93.44003  8.500028         54
## 1625      0         1 172.0938 63.05  64.41012  1.360117         54
## 1626      8         1 170.0938 57.72  61.23497  3.514970         54
## 1627     12         3 166.2812 62.71        NA        NA         24
## 1628      6         1 162.5938 52.39  57.15264  4.762639         54
## 1629     10         2 165.5000 90.83 106.59421 15.764207         54
##      smokeintensity smkintensity82_71 smokeyrs asthma bronch tb hf hbp
## 1624             20                 0       31      0      0  0  0   2
## 1625             40                 0       29      0      0  0  0   2
## 1626             20                 0       31      0      0  0  0   2
## 1627             40                40       30      0      0  0  0   2
## 1628             15                 5       46      0      0  0  0   2
## 1629             30               -30       14      0      0  0  0   2
##      pepticulcer colitis hepatitis chroniccough hayfever diabetes polio tumor
## 1624           1       0         0            0        0        2     0     1
## 1625           0       0         0            0        0        2     0     0
## 1626           0       0         0            0        0        2     0     0
## 1627           0       0         0            0        0        2     0     0
## 1628           0       0         0            0        0        2     0     0
## 1629           0       0         0            0        0        2     0     0
##      nervousbreak alcoholpy alcoholfreq alcoholtype alcoholhowmuch pica
## 1624            0         0           4           4             NA    2
## 1625            0         1           2           1              2    2
## 1626            0         1           3           4             NA    2
## 1627            0         0           4           4             NA    2
## 1628            0         1           2           1              2    2
## 1629            0         1           2           1              6    2
##      headache otherpain weakheart allergies nerves lackpep hbpmed boweltrouble
## 1624        1         1         0         0      0       0      2            2
## 1625        1         1         0         0      0       0      2            2
## 1626        1         0         0         0      0       0      2            2
## 1627        1         1         0         0      0       0      2            2
## 1628        1         1         0         0      0       0      2            2
## 1629        0         1         0         0      0       0      2            2
##      wtloss infection active exercise birthcontrol pregnancies cholesterol
## 1624      0         1      0        0            0           5         254
## 1625      0         0      0        0            2          NA          NA
## 1626      0         0      0        0            0           2         270
## 1627      0         0      0        0            0           3         228
## 1628      0         0      1        1            2          NA         223
## 1629      0         0      1        1            2          NA         243
##      hightax82  price71  price82     tax71     tax82 price71_82  tax71_82
## 1624         0 2.167969 1.940186 1.0498047 0.5499268  0.2278748 0.5000000
## 1625         0 2.167969 1.940186 1.0498047 0.5499268  0.2278748 0.5000000
## 1626         0 2.167969 1.940186 1.0498047 0.5499268  0.2278748 0.5000000
## 1627         0 1.800781 1.647705 0.7349854 0.4619751  0.1529846 0.2729492
## 1628         0 2.167969 1.940186 1.0498047 0.5499268  0.2278748 0.5000000
## 1629         0 2.167969 1.940186 1.0498047 0.5499268  0.2278748 0.5000000
dim(NHANES_df)
## [1] 1629   64
nrow(NHANES_df)
## [1] 1629
ncol(NHANES_df)
## [1] 64
str(NHANES_df)
## 'data.frame':    1629 obs. of  64 variables:
##  $ seqn             : num  233 235 244 245 252 257 262 266 419 420 ...
##  $ qsmk             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ death            : num  0 0 0 1 0 0 0 0 1 1 ...
##  $ yrdth            : num  NA NA NA 85 NA NA NA NA 84 86 ...
##  $ modth            : num  NA NA NA 2 NA NA NA NA 10 10 ...
##  $ dadth            : num  NA NA NA 14 NA NA NA NA 13 17 ...
##  $ sbp              : num  175 123 115 148 118 141 132 100 163 184 ...
##  $ dbp              : num  96 80 75 78 77 83 69 53 79 106 ...
##  $ sex              : chr  "Male" "Male" "Female" "Male" ...
##  $ age              : num  42 36 56 68 40 43 56 29 51 43 ...
##  $ race             : chr  "Black or other" "White" "Black or other" "Black or other" ...
##  $ income           : num  19 18 15 15 18 11 19 22 18 16 ...
##  $ marital          : num  2 2 3 3 2 4 2 2 2 2 ...
##  $ school           : num  7 9 11 5 11 9 12 12 10 11 ...
##  $ education        : num  1 2 2 1 2 2 3 3 2 2 ...
##  $ ht               : num  174 159 168 170 182 ...
##  $ wt71             : num  79 58.6 56.8 59.4 87.1 ...
##  $ wt82             : num  68.9 61.2 66.2 64.4 92.1 ...
##  $ wt82_71          : num  -10.09 2.6 9.41 4.99 4.99 ...
##  $ birthplace       : num  47 42 51 37 42 34 NA NA 42 42 ...
##  $ smokeintensity   : num  30 20 20 3 20 10 20 2 25 20 ...
##  $ smkintensity82_71: num  -10 -10 -14 4 0 10 0 1 -10 -20 ...
##  $ smokeyrs         : num  29 24 26 53 19 21 39 9 37 25 ...
##  $ asthma           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ bronch           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ tb               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ hf               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ hbp              : num  1 0 0 1 0 0 0 0 0 0 ...
##  $ pepticulcer      : num  1 0 0 0 0 0 0 0 0 0 ...
##  $ colitis          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ hepatitis        : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ chroniccough     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ hayfever         : num  0 0 1 0 0 0 0 0 0 0 ...
##  $ diabetes         : num  1 0 0 0 0 0 1 0 0 0 ...
##  $ polio            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ tumor            : num  0 0 1 0 0 0 0 0 0 0 ...
##  $ nervousbreak     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ alcoholpy        : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ alcoholfreq      : num  1 0 3 2 2 3 1 0 1 0 ...
##  $ alcoholtype      : num  3 1 4 3 1 2 3 2 3 1 ...
##  $ alcoholhowmuch   : num  7 4 NA 4 2 1 4 1 2 6 ...
##  $ pica             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ headache         : num  1 1 1 0 1 1 1 1 1 1 ...
##  $ otherpain        : num  0 0 1 1 0 0 0 0 1 0 ...
##  $ weakheart        : num  0 0 0 1 0 0 0 0 0 0 ...
##  $ allergies        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ nerves           : num  0 0 1 0 0 0 1 1 0 0 ...
##  $ lackpep          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ hbpmed           : num  1 0 0 0 0 0 0 0 0 0 ...
##  $ boweltrouble     : num  0 0 0 0 1 0 0 0 0 0 ...
##  $ wtloss           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ infection        : num  0 1 0 0 0 0 0 0 0 0 ...
##  $ active           : num  0 0 0 1 1 1 0 0 2 1 ...
##  $ exercise         : num  2 0 2 2 1 1 1 2 2 2 ...
##  $ birthcontrol     : num  2 2 0 2 2 0 0 0 2 2 ...
##  $ pregnancies      : num  NA NA 2 NA NA 1 1 2 NA NA ...
##  $ cholesterol      : num  197 301 157 174 216 212 205 166 337 279 ...
##  $ hightax82        : num  0 0 0 0 0 1 NA NA 0 0 ...
##  $ price71          : num  2.18 2.35 1.57 1.51 2.35 ...
##  $ price82          : num  1.74 1.8 1.51 1.45 1.8 ...
##  $ tax71            : num  1.102 1.365 0.551 0.525 1.365 ...
##  $ tax82            : num  0.462 0.572 0.231 0.22 0.572 ...
##  $ price71_82       : num  0.4438 0.5493 0.0562 0.0548 0.5493 ...
##  $ tax71_82         : num  0.64 0.793 0.32 0.305 0.793 ...