Just in case you missed 'That Sugar song'...

Many of the health issues we are studying, I believe, are attributable to excess sugar intake.

🎯 Aims

Today we will seek to understand and apply:

Checking the quality of data before analysis
Beginning our analysis of data with plots and their interpretation (exploratory data analysis)
Consider some of the factors not addressed by a statistical analysis (bias)
Explore some basic functionality of iNZight lite.

Before we get carried away... check our data

We have been interested in the relationship between understanding determinants of dental caries, in particular, the link between cigarette smoking and dental decay.

We have obtained some data from a Korean health survey of people age twenty years and over.

What sort of epidemiological study is this??

What sort of biases are particularly important with this type of study design?

Warning In cross-sectional surveys, we always have to be mindful of reverse causation. Could our outcome affect our exposure, rather than the other way around which we were thinking of? ⏰ This is because there is no time separation of the exposure and outcome. Of course, other issues affect any observational study design, such as: confounding, measurement error and selection bias.

🇰🇷 Korean health survey

This is a health survey undertaken on a sample of South Koreans between 2011 and 2016. It is publicly available and anonymised.

In order to obtain the data, individuals would have had to have filled in a questionnaire relating to their gender, age and other factors. They would have had a variety of blood tests that the government thought was related to their overall health.

We will be using a small sample of the original data.

The measures include:

age: Age in years
height(cm): Height of the person in centimeters
weight(kg): Weight of the person in kilograms
waist(cm): Waist circumference in centimeters
eyesight(left): Vision measurement in the left eye (0 good vision; 10 blind)
eyesight(right): Vision measurement in the right eye (0 good vision; 10 blind)
hearing(left): Hearing measurement in the left ear, 1: normal, 2: abnormal.
hearing(right): Hearing measurement in the right ear, 1: normal, 2: abnormal.
systolic: Systolic blood pressure, (mmHg)
relaxation: diastolic blood pressure (mmHg)
fasting blood sugar: fasting venous glucose (mg/dL)
Cholesterol: Measure of total cholesterol (mg/dL)
triglyceride: Level of triglycerides (mg/dL)
HDL: HDL (high-density lipoprotein) cholesterol, sometimes called 'good' cholesterol (mg/dL).
LDL: LDL (low-density lipoprotein) cholesterol, sometimes called 'bad' cholesterol (mg/dL)
hemoglobin: Hemoglobin levels in blood (g/dL).
Urine protein: Measure of protein in urine (ordinal variable, 1: none to 5: ++++)
serum creatinine: Meaure of serum creatinine in the body (kidney function; mg/dL).
AST: AST (aspartate aminotransferase) is an enzyme that is found mostly in the liver.
ALT: ALT stands for alanine transaminase. It is an enzyme found mostly in the liver.
Gtp: Liver enzyme usually increased in bile duct disease.
dental caries: 🦷 decayed teeth (0 none, 1 caries).
tartar: plaque or bacterial overgrowth in the teeth (Y: yes, N: no).
smoking: 🚬 Whether the person smokes or not (0 ex or non-smoker, 1 current smoker).

We will be using iNZight lite available here.

A cut down version of the spreadsheet results is available here.

Open the data in Excel

Download the data which is in a .csv ("comma separated values") file to your machine and open in Excel.

You may have to open Excel first and then click File --> Open and navigate to file (usually in Downloads on a Windows or Mac machine).

Have a look at the data in the spreadsheet.

Questions

What are the main problems with a new dataset that we have to be aware of and check for?

Warning

Before analysing any data, the main issues that I like to identify are:

duplicates
missing values
out-of-range variables

These are often overlooked, and I encourage you to remember them. They will streamline your work in your future studies and working life.

What do you think an individual row represents? (hint: the number of rows is also equal to the number of individuals in the study).

What do you think an individual column represents? (hint: the meaning of the column is given by the header at the top of the column).

Are there any immediate problems with the data that you can spot by scrolling through the data?

Highlight an individual column, by clicking on the top grey row.

What do you see down the bottom of the window?

How can this help us detect problems in the data?

Check data validity

First, we want to check the data to make sure that our data is ok to analyse (no major flaws). This is often easier in Excel.

Check each variable for out-of-range variables by clicking on the top row and examining the maximimum and minimum in the bottom ribbon. Do you find any?

What should we do with these out-of-range variables?

Should we discard that value (cell) or the whole response?

Generally, we wish to discard the minimum data.

Is there a problem with missing values?

Why might these values be missing?

How could we prevent missing values?

Are there any duplicates? Try looking manually.

Then, try with Excel, using the toolbar. Data --> Remove Duplicates

Questions

Which columns should we include with our search for duplicates?

If there is a unique identifier...

What should we do with these duplicates?

Note: iNZight doesn't have a facility for checking for duplicates, so this must be carried out in Excel before loading data into iNZight for analysis!

Exploring the data in iNZight lite

iNZight is very good at visualising data.

Import the data (File tab --> Import data --> Browse and navigate to the folder with the .csv file you downloaded).

The file is most likely to be located in your 'Downloads' folder in Windows or MacOS, unless you put it somewhere else.

You should see the spreadsheet. You can change the number of rows to view, by clicking on the Show ... Entries box.

In this tutorial, we will begin by exploring the prevalence of various characteristics of the Korean survey.

We will use the **Visualise** tab and interpret the various plots.

Boxplots can be very useful for spotting outliers, and checking for symmetry in continuous variables.

Boxplots explained Remember what the various points of a boxplot mean.

Which statistic is represented by the width of the box in a boxplot?

The various charts can help us check the data for "face validity".

Visualise height(cm) alone and then by gender. What is this telling us?

Converting between continuous and categorical variables

Some variables will look strange when you visualise them, such as smoking and dental caries. Because they are coded as 0 and 1, iNZight sees them as continuous variables or numbers, when they are really categories.

Instead, we have to tell iNZight that these numbers are really categories.

We can do this by clicking on the Manipulate variables menu and click convert to categorical. Then select smoking or dental caries and press Enter. You'll notice that you'll now have a new categorical variable such as smoking.cat.

Visualise this variable and note the difference with how iNZight handles continuous vs. categorical variables.

You'll see a bar plot for categorical variables and a histogram/dot plot for continuous variables. The numeric summaries are different too.

Check Summary after visualising the variable and note the difference between the two variable types.

Warning

When visualising what you think should be categorical data, but iNZight is giving you a dot plot, rather than a bar plot, it is likely that the data is interpreted as numeric. You will have to convert the data to categorical as outlined above.

Estimating prevalence (categorical variables)

We can also estimate the prevalence of various characteristics of the population.

Remember prevalence is just epidemiological speak for .

What, for example, is the prevalence of dental caries in the population? %.

Does, for example, the prevalence of dental caries vary by age?

What is the prevalence of smoking?

Association between two categorical variables

Does 🚬 `smoking` differ by `gender`?

Select 🚬 smoking as the first variable and gender as the second in the Visualise tab.

You should see something like this...

Categorical association

Interpret the plot. Focus on the right hand side of the plot (current smoker: 1). You can see that %. of men (M) smoke compared to only % of women (F).

So, smoking is common in men than women. I wonder why?

Interpret the Summary results. Note: the columns are the first variable and the rows are the second variable categories. We see a contingency table of counts and row percentages below. Generally, to interpret the association between exposure and outcome, you want proportions of outcome within groups of exposure, which is given in the Summary tab by having the outcome first and exposure second. This aligns with how our plot is initially set-up.

Note also the width of the bars in the right-hand side of the plot. These are proportional to the size in each exposure group. Note that men are more numerous than women in this study.

Note: When visualising categorical relationships and associations, we want our outcome as the first variable and our exposure as the second variable.

Note: iNZight does something clever with barplots of two categorical variables, it depicts the prevalence of the second variable in different classes of the first by the width of each bar.

Also, iNZight does some nice work behind the scenes looking at the type of variable you have (categorical or continuous) and selecting the appropriate graph.

Check the prevalence of dental caries (outcome), by whether or not a subject smokes (smoking; exposure).

Interpret the plot and the table

Check the prevalence of dental caries by gender.

Which group has the most caries, males (M) or females (F)?

🦷 Dental caries is more in men than women.

Why might this be?

Hint: consider the relationship between 🚬 smoking and gender.

Which charts are suitable for which types of data??

For a concise overview of the uses of various different charts in exploring data, see here.

Some homework...

Tartar is the calcified build up of bacteria on teeth that often precedes more severe dental decay or caries.

Consider the relationship between tartar (exposure) and 🦷 dental caries (outcome).

How prevalent is each condition?

Tartar affects % of the population, whereas dental caries affects %.

Are the two conditions associated?

Illustrate with an appropriate plot
describe the plot.
Interpret the Summary tab.

In people with tartar = Y, the prevalence of 🦷 dental caries is % compared with a prevalence of dental caries of % in people with tartar = N. 🤔 This means that subjects with tartar are almost
as likely as those without to also have 🦷 dental caries.

Tutorial 1: POPLHLTH 216

Simon Thornley

17 July, 2025

🎯 Aims

Before we get carried away... check our data

🇰🇷 Korean health survey

Open the data in Excel

Questions

Check data validity

Questions

Exploring the data in iNZight lite

Converting between continuous and categorical variables

Estimating prevalence (categorical variables)

Association between two categorical variables

Does 🚬 `smoking` differ by `gender`?

Which charts are suitable for which types of data??

Some homework...

Tutorial 1: POPLHLTH 216

Simon Thornley

17 July, 2025

🎯 Aims

Before we get carried away... check our data

🇰🇷 Korean health survey

Open the data in Excel

Questions

Check data validity

Questions

Exploring the data in iNZight lite

Converting between continuous and categorical variables

Estimating prevalence (categorical variables)

Association between two categorical variables

Does 🚬 smoking differ by gender?

Which charts are suitable for which types of data??

Some homework...

Does 🚬 `smoking` differ by `gender`?