lecture link: https://rpubs.com/zaidyousif/1089175
R is a programming language, and RStudio is an IDE that provides a graphical interface to interact with R. Both R and RStudio will need to be installed for you to participate in the lecture, The R language itself is free and the free edition of RStudio is sufficient for this course - you do not need to pay for any software. Please feel free to email me if you have any issues with the installation.
A package bundles together code, data, documentation, and tests, and is easy to share with others. We use the install.packages function to install packages in R.
# install.packages("tidyverse")
# install.packages("openxlsx")
To load package in R, we use the library command. Once a package is loaded, you can use the functions it comes with.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.2 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(openxlsx)
Most often, the data you are going to analyze is generated externally and provided to you as a spreadsheet. Spreadsheets come in different formats including Microsoft Excel (xlsx), comma-separated values (csv), and tab-separated values (tsv). There is a method to load each spreadsheet type. In this course, we will focus on loading data from xlsx files.
When loading a file into R, it is important to specify the location of the file correctly. Any mistake in specifying the location will cause a failure to load the data It is important to note that specifying file location varies between Windows and Mac OS. I recommend saving the file in the same directory as your R script file. Once you save the R script file, you can ask RStudio to specify the working directory to be the location of the R script file by:
setwd("E:/Biostat and Study Design/204/Lectures/Data")
In this course, we will analyze data from the National Health and Nutrition Examination Survey (NHANES). In summary, NHANES is a national longitudinal study designed to investigate the relationships between clinical, nutritional, and behavioral factors. We will only analyze a small subset of the NHANES data.
To load the xlsx file data into a dataframe, we will use the openxlsx library we installed and loaded previously.
setwd("E:/Biostat and Study Design/204/Lectures/Data")
NHANES_df <- openxlsx::read.xlsx('NHEFS.xlsx')
To view the dataset, use the view() command.
##View(NHANES_df)
The following functions are useful to explore dataframes:
head(NHANES_df)
## seqn qsmk death yrdth modth dadth sbp dbp sex age race income
## 1 233 0 0 NA NA NA 175 96 Male 42 Black or other 19
## 2 235 0 0 NA NA NA 123 80 Male 36 White 18
## 3 244 0 0 NA NA NA 115 75 Female 56 Black or other 15
## 4 245 0 1 85 2 14 148 78 Male 68 Black or other 15
## 5 252 0 0 NA NA NA 118 77 Male 40 White 18
## 6 257 0 0 NA NA NA 141 83 Female 43 Black or other 11
## marital school education ht wt71 wt82 wt82_71 birthplace
## 1 2 7 1 174.1875 79.04 68.94604 -10.093960 47
## 2 2 9 2 159.3750 58.63 61.23497 2.604970 42
## 3 3 11 2 168.5000 56.81 66.22449 9.414486 51
## 4 3 5 1 170.1875 59.42 64.41012 4.990117 37
## 5 2 11 2 181.8750 87.09 92.07925 4.989251 42
## 6 4 9 2 162.1875 99.00 103.41906 4.419060 34
## smokeintensity smkintensity82_71 smokeyrs asthma bronch tb hf hbp pepticulcer
## 1 30 -10 29 0 0 0 0 1 1
## 2 20 -10 24 0 0 0 0 0 0
## 3 20 -14 26 0 0 0 0 0 0
## 4 3 4 53 0 0 0 0 1 0
## 5 20 0 19 0 0 0 0 0 0
## 6 10 10 21 0 0 0 0 0 0
## colitis hepatitis chroniccough hayfever diabetes polio tumor nervousbreak
## 1 0 0 0 0 1 0 0 0
## 2 0 0 0 0 0 0 0 0
## 3 0 0 0 1 0 0 1 0
## 4 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0 0
## alcoholpy alcoholfreq alcoholtype alcoholhowmuch pica headache otherpain
## 1 1 1 3 7 0 1 0
## 2 1 0 1 4 0 1 0
## 3 1 3 4 NA 0 1 1
## 4 1 2 3 4 0 0 1
## 5 1 2 1 2 0 1 0
## 6 1 3 2 1 0 1 0
## weakheart allergies nerves lackpep hbpmed boweltrouble wtloss infection
## 1 0 0 0 0 1 0 0 0
## 2 0 0 0 0 0 0 0 1
## 3 0 0 1 0 0 0 0 0
## 4 1 0 0 0 0 0 0 0
## 5 0 0 0 0 0 1 0 0
## 6 0 0 0 0 0 0 0 0
## active exercise birthcontrol pregnancies cholesterol hightax82 price71
## 1 0 2 2 NA 197 0 2.183594
## 2 0 0 2 NA 301 0 2.346680
## 3 0 2 0 2 157 0 1.569580
## 4 1 2 2 NA 174 0 1.506592
## 5 1 1 2 NA 216 0 2.346680
## 6 1 1 0 1 212 1 2.209961
## price82 tax71 tax82 price71_82 tax71_82
## 1 1.739990 1.1022949 0.4619751 0.44378662 0.6403809
## 2 1.797363 1.3649902 0.5718994 0.54931641 0.7929688
## 3 1.513428 0.5512695 0.2309875 0.05619812 0.3202515
## 4 1.451904 0.5249023 0.2199707 0.05479431 0.3049927
## 5 1.797363 1.3649902 0.5718994 0.54931641 0.7929688
## 6 2.025879 1.1547852 0.7479248 0.18408203 0.4069824
tail(NHANES_df)
## seqn qsmk death yrdth modth dadth sbp dbp sex age race income marital
## 1624 25013 0 0 NA NA NA 125 77 Female 47 White 16 5
## 1625 25014 0 0 NA NA NA 115 66 Male 45 White 16 4
## 1626 25016 0 0 NA NA NA 124 80 Female 47 White 18 2
## 1627 25024 0 0 NA NA NA NA NA Female 51 White 15 2
## 1628 25032 0 0 NA NA NA 171 77 Male 68 White 13 2
## 1629 25061 1 0 NA NA NA 136 90 Male 29 White 19 2
## school education ht wt71 wt82 wt82_71 birthplace
## 1624 8 1 167.1875 84.94 93.44003 8.500028 54
## 1625 0 1 172.0938 63.05 64.41012 1.360117 54
## 1626 8 1 170.0938 57.72 61.23497 3.514970 54
## 1627 12 3 166.2812 62.71 NA NA 24
## 1628 6 1 162.5938 52.39 57.15264 4.762639 54
## 1629 10 2 165.5000 90.83 106.59421 15.764207 54
## smokeintensity smkintensity82_71 smokeyrs asthma bronch tb hf hbp
## 1624 20 0 31 0 0 0 0 2
## 1625 40 0 29 0 0 0 0 2
## 1626 20 0 31 0 0 0 0 2
## 1627 40 40 30 0 0 0 0 2
## 1628 15 5 46 0 0 0 0 2
## 1629 30 -30 14 0 0 0 0 2
## pepticulcer colitis hepatitis chroniccough hayfever diabetes polio tumor
## 1624 1 0 0 0 0 2 0 1
## 1625 0 0 0 0 0 2 0 0
## 1626 0 0 0 0 0 2 0 0
## 1627 0 0 0 0 0 2 0 0
## 1628 0 0 0 0 0 2 0 0
## 1629 0 0 0 0 0 2 0 0
## nervousbreak alcoholpy alcoholfreq alcoholtype alcoholhowmuch pica
## 1624 0 0 4 4 NA 2
## 1625 0 1 2 1 2 2
## 1626 0 1 3 4 NA 2
## 1627 0 0 4 4 NA 2
## 1628 0 1 2 1 2 2
## 1629 0 1 2 1 6 2
## headache otherpain weakheart allergies nerves lackpep hbpmed boweltrouble
## 1624 1 1 0 0 0 0 2 2
## 1625 1 1 0 0 0 0 2 2
## 1626 1 0 0 0 0 0 2 2
## 1627 1 1 0 0 0 0 2 2
## 1628 1 1 0 0 0 0 2 2
## 1629 0 1 0 0 0 0 2 2
## wtloss infection active exercise birthcontrol pregnancies cholesterol
## 1624 0 1 0 0 0 5 254
## 1625 0 0 0 0 2 NA NA
## 1626 0 0 0 0 0 2 270
## 1627 0 0 0 0 0 3 228
## 1628 0 0 1 1 2 NA 223
## 1629 0 0 1 1 2 NA 243
## hightax82 price71 price82 tax71 tax82 price71_82 tax71_82
## 1624 0 2.167969 1.940186 1.0498047 0.5499268 0.2278748 0.5000000
## 1625 0 2.167969 1.940186 1.0498047 0.5499268 0.2278748 0.5000000
## 1626 0 2.167969 1.940186 1.0498047 0.5499268 0.2278748 0.5000000
## 1627 0 1.800781 1.647705 0.7349854 0.4619751 0.1529846 0.2729492
## 1628 0 2.167969 1.940186 1.0498047 0.5499268 0.2278748 0.5000000
## 1629 0 2.167969 1.940186 1.0498047 0.5499268 0.2278748 0.5000000
dim(NHANES_df)
## [1] 1629 64
nrow(NHANES_df)
## [1] 1629
ncol(NHANES_df)
## [1] 64
str(NHANES_df)
## 'data.frame': 1629 obs. of 64 variables:
## $ seqn : num 233 235 244 245 252 257 262 266 419 420 ...
## $ qsmk : num 0 0 0 0 0 0 0 0 0 0 ...
## $ death : num 0 0 0 1 0 0 0 0 1 1 ...
## $ yrdth : num NA NA NA 85 NA NA NA NA 84 86 ...
## $ modth : num NA NA NA 2 NA NA NA NA 10 10 ...
## $ dadth : num NA NA NA 14 NA NA NA NA 13 17 ...
## $ sbp : num 175 123 115 148 118 141 132 100 163 184 ...
## $ dbp : num 96 80 75 78 77 83 69 53 79 106 ...
## $ sex : chr "Male" "Male" "Female" "Male" ...
## $ age : num 42 36 56 68 40 43 56 29 51 43 ...
## $ race : chr "Black or other" "White" "Black or other" "Black or other" ...
## $ income : num 19 18 15 15 18 11 19 22 18 16 ...
## $ marital : num 2 2 3 3 2 4 2 2 2 2 ...
## $ school : num 7 9 11 5 11 9 12 12 10 11 ...
## $ education : num 1 2 2 1 2 2 3 3 2 2 ...
## $ ht : num 174 159 168 170 182 ...
## $ wt71 : num 79 58.6 56.8 59.4 87.1 ...
## $ wt82 : num 68.9 61.2 66.2 64.4 92.1 ...
## $ wt82_71 : num -10.09 2.6 9.41 4.99 4.99 ...
## $ birthplace : num 47 42 51 37 42 34 NA NA 42 42 ...
## $ smokeintensity : num 30 20 20 3 20 10 20 2 25 20 ...
## $ smkintensity82_71: num -10 -10 -14 4 0 10 0 1 -10 -20 ...
## $ smokeyrs : num 29 24 26 53 19 21 39 9 37 25 ...
## $ asthma : num 0 0 0 0 0 0 0 0 0 0 ...
## $ bronch : num 0 0 0 0 0 0 0 0 0 0 ...
## $ tb : num 0 0 0 0 0 0 0 0 0 0 ...
## $ hf : num 0 0 0 0 0 0 0 0 0 0 ...
## $ hbp : num 1 0 0 1 0 0 0 0 0 0 ...
## $ pepticulcer : num 1 0 0 0 0 0 0 0 0 0 ...
## $ colitis : num 0 0 0 0 0 0 0 0 0 0 ...
## $ hepatitis : num 0 0 0 0 0 0 0 0 0 1 ...
## $ chroniccough : num 0 0 0 0 0 0 0 0 0 0 ...
## $ hayfever : num 0 0 1 0 0 0 0 0 0 0 ...
## $ diabetes : num 1 0 0 0 0 0 1 0 0 0 ...
## $ polio : num 0 0 0 0 0 0 0 0 0 0 ...
## $ tumor : num 0 0 1 0 0 0 0 0 0 0 ...
## $ nervousbreak : num 0 0 0 0 0 0 0 0 0 0 ...
## $ alcoholpy : num 1 1 1 1 1 1 1 1 1 1 ...
## $ alcoholfreq : num 1 0 3 2 2 3 1 0 1 0 ...
## $ alcoholtype : num 3 1 4 3 1 2 3 2 3 1 ...
## $ alcoholhowmuch : num 7 4 NA 4 2 1 4 1 2 6 ...
## $ pica : num 0 0 0 0 0 0 0 0 0 0 ...
## $ headache : num 1 1 1 0 1 1 1 1 1 1 ...
## $ otherpain : num 0 0 1 1 0 0 0 0 1 0 ...
## $ weakheart : num 0 0 0 1 0 0 0 0 0 0 ...
## $ allergies : num 0 0 0 0 0 0 0 0 0 0 ...
## $ nerves : num 0 0 1 0 0 0 1 1 0 0 ...
## $ lackpep : num 0 0 0 0 0 0 0 0 0 0 ...
## $ hbpmed : num 1 0 0 0 0 0 0 0 0 0 ...
## $ boweltrouble : num 0 0 0 0 1 0 0 0 0 0 ...
## $ wtloss : num 0 0 0 0 0 0 0 0 0 0 ...
## $ infection : num 0 1 0 0 0 0 0 0 0 0 ...
## $ active : num 0 0 0 1 1 1 0 0 2 1 ...
## $ exercise : num 2 0 2 2 1 1 1 2 2 2 ...
## $ birthcontrol : num 2 2 0 2 2 0 0 0 2 2 ...
## $ pregnancies : num NA NA 2 NA NA 1 1 2 NA NA ...
## $ cholesterol : num 197 301 157 174 216 212 205 166 337 279 ...
## $ hightax82 : num 0 0 0 0 0 1 NA NA 0 0 ...
## $ price71 : num 2.18 2.35 1.57 1.51 2.35 ...
## $ price82 : num 1.74 1.8 1.51 1.45 1.8 ...
## $ tax71 : num 1.102 1.365 0.551 0.525 1.365 ...
## $ tax82 : num 0.462 0.572 0.231 0.22 0.572 ...
## $ price71_82 : num 0.4438 0.5493 0.0562 0.0548 0.5493 ...
## $ tax71_82 : num 0.64 0.793 0.32 0.305 0.793 ...