Just in case you missed ‘That Sugar song’…
Many of the health issues we are studying, I believe, are attributable to excess sugar intake.
Today we will seek to understand and apply:
We have been interested in the relationship between understanding determinants of dental caries, in particular, the link between cigarette smoking and dental decay.
We have obtained some data from a Korean health survey of people age twenty years and over.
What sort of epidemiological study is this??
What sort of biases do we need to be aware of in interpreting this data?
Warning In cross-sectional surveys, we always have to be mindful of reverse causation. Could our outcome affect our exposure, rather than the other way around which we were thinking of?
This is a health survey undertaken on a sample of South Koreans between 2011 and 2016. It is publicly available and anonymised.
In order to obtain the data, individuals would have had to have filled in a questionnaire relating to their gender, age and other factors. They would have had a variety of blood tests that the government thought was related to their overall health.
We will be using a small sample of the original data.
The measures include:
age
: Age in yearsheight(cm)
: Height of the person in centimetersweight(kg)
: Weight of the person in kilogramswaist(cm)
: Waist circumference in centimeterseyesight(left)
: Vision measurement in the left eye (0
good vision; 10 blind)eyesight(right)
: Vision measurement in the right eye (0
good vision; 10 blind)hearing(left)
: Hearing measurement in the left ear,
1
: normal, 2
: abnormal.hearing(right)
: Hearing measurement in the right ear,
1
: normal, 2
: abnormal.systolic
: Systolic blood pressure, (mmHg)relaxation
: diastolic blood pressure (mmHg)fasting blood sugar
: fasting venous glucose
(mg/dL)Cholesterol
: Measure of total cholesterol (mg/dL)triglyceride
: Level of triglycerides (mg/dL)HDL
: HDL (high-density lipoprotein) cholesterol,
sometimes called ‘good’ cholesterol (mg/dL).LDL
: LDL (low-density lipoprotein) cholesterol,
sometimes called ‘bad’ cholesterol (mg/dL)hemoglobin
: Hemoglobin levels in blood (g/dL).Urine protein
: Measure of protein in urine (ordinal
variable, 1
: none to 5
: ++++)serum creatinine
: Meaure of serum creatinine in the
body (kidney function; mg/dL).AST
: AST (aspartate aminotransferase) is an enzyme that
is found mostly in the liver.ALT
: ALT stands for alanine transaminase. It is an
enzyme found mostly in the liver.Gtp
: Liver enzyme usually increased in bile duct
disease.dental caries
: decayed teeth (0
none,
1
caries).tartar
: plaque or bacterial overgrowth in the teeth
(Y
: yes, N
: no).smoking
: Whether the person smokes or not
(0
ex or non-smoker, 1
current smoker).We will be using iNZight lite available here.
A cut down version of the spreadsheet results is available here.
Download the data which is in a .csv
(“comma
separated values”) file to your machine and open in Excel.
You may have to open Excel first and then click File –> Open and navigate to file (usually in Downloads on a Windows or Mac machine).
Have a look at the data in the spreadsheet.
What are the main problems with a new dataset that we have to be aware of and check for?
Warning
Before analysing any data, the main issues that I like to identify are:
These are often overlooked, and I encourage you to remember them. They will streamline your work in your future studies and working life.
What do you think an individual row represents? (hint: the number of rows is also equal to the number of individuals in the study).
What do you think an individual column represents? (hint: the meaning of the column is given by the header at the top of the column).
Are there any immediate problems with the data that you can spot by scrolling through the data?
Highlight an individual column, by clicking on the top grey row.
What do you see down the bottom of the window?
How can this help us detect problems in the data?
First, we want to check the data to make sure that it is ‘kosher’ (no major flaws in the data). This is often easier in Excel.
Check each variable for out-of-range variables by clicking on the top row. Do you find any?
What should we do with these out of range variables?
Generally, we wish to discard the minimum data.
Is there a problem with missing values?
Why might these values be missing?
How could we prevent missing values?
Are there any duplicates? Try looking manually.
Then, try with Excel. Data –> Remove Duplicates
Which columns should we include with our search for duplicates?
What should we do with these duplicates?
What do we need to make sure we do after removing duplicates?
Note: iNZight doesn’t have a facility for checking for duplicates, so this must be carried out in Excel before loading data into iNZight for analysis!
iNZight is very good at visualising data.
Import the data (File tab –> Import data –>
Browse and navigate to the folder with the .csv
file you downloaded).
The file is most likely to be located in your ‘Downloads’ folder in Windows or MacOS, unless you put it somewhere else.
You should see the spreadsheet. You can change the number of rows to view, by clicking on the Show … Entries box.
In this tutorial, we will begin by exploring the prevalence of various characteristics of the Korean survey.
We will use the Visualise tab and interpret the various plots.
Boxplots can be very useful for spotting outliers, and checking for symmetry in continuous variables.
Remember
what the various points of a boxplot mean.
Visualise height(cm)
alone and then by
gender
. What is this telling us?
The various charts can help us check the data for “face validity”.
Some variables will look strange when you visualise them, such as
smoking
and dental caries
. Because they are
coded as 0
and 1
, iNZight
sees
them as continuous variables as numbers. We have to tell
iNZight
that they are are categorical
variables.
We can do this by clicking on the Manipulate variables
menu and click convert to categorical
. Then select
smoking
or dental caries
and press
Enter
. You’ll notice that you’ll now have a new
categorical variable such as smoking.cat
.
Visualise this variable and note the difference with how
iNZight
handles continuous vs. categorical variables.
You’ll see a bar plot for categorical variables and a histogram/dot plot for continuous variables. The numeric summaries are different too.
Check Summary
after visualising the variable and note
the difference between the two variable types.
Warning
When visualising what you think should be categorical data, but
iNZight
is giving you a dot plot, rather than a bar plot,
it is likely that the data is interpreted as numeric. You will have to
convert the data to categorical as outlined above.
We can also estimate the prevalence of various characteristics of the population.
Remember prevalence is just epidemiological speak for proportion or percentage.
What, for example, is the prevalence of
dental caries
in the population?
Does, for example, the prevalence of
dental caries
vary by age
?
What is the prevalence of smoking
?
smoking
differ by gender
?Select smoking
as the first variable
and gender
as the second in the
Visualise
tab.
You should see something like this…
1
). You can see that 54.1% of men
(M
) smoke compared to only 4.14% of women
(F
).So, smoking is more common in men than women. I wonder why?
Summary
results. Note: the columns are
the first variable and the rows are the second variable categories. We
see a contingency table of counts and row percentages below. Generally,
to interpret the association between exposure and
outcome, you want proportions of outcome within groups
of exposure, which is given in the Summary
tab by having
the outcome first and exposure second.
This aligns with how our plot is initially set-up.Note also the width of the bars in the right-hand side of the plot. These are proportional to the size in each exposure group. Note that men are more numerous than women in this study.
Note: When visualising categorical relationships and associations, we want our outcome as the first variable and our exposure as the second variable.
Note: iNZight
does something clever with barplots of two
categorical variables, it depicts the prevalence of the second variable
in different classes of the first by the width of each bar.
Also, iNZight
does some nice work behind the scenes
looking at the type of variable you have (categorical or continuous) and
selecting the appropriate graph.
Check the prevalence of dental caries
(outcome), by
whether or not a subject smokes (smoking
; exposure).
Interpret the plot and the table
Check the prevalence of dental caries
by
gender
.
Which group has the most caries, males (M
) or females
(F
)?
Why might this be?
Hint: consider the relationship between
smoking
and gender
.
For a concise overview of the uses of various different charts in exploring data, see here.
Consider the relationship between tartar
and
dental caries
.
Are the two conditions associated?
Illustrate with an appropriate plot
describe the plot.
Interpret the Summary
tab.