Tutorial 1 for POPLHLTH 216

Simon Thornley

18 July, 2023

Just in case you missed ‘That Sugar song’…

Photo by Isaac Quesada on Unsplash

Aims

Today we will seek to understand and apply:

Checking the quality of data before analysis
Beginning our analysis of data with plots and their interpretation (exploratory data analysis)
Consider some of the factors not addressed by a statistical analysis (bias)
Explore some basic functionality of iNZight lite.

Baby steps, opening our data, checking the validity, and estimating prevalence of dental caries.

We have been interested in the relationship between understanding of sugar content in food and the link to dental caries.

We have just carried out an epidemiological study with a questionnaire of the class responses using google forms.

What sort of epidemiological study is this?? What sort of biases do we need to be aware of in interpreting this data?

Sugar and dental caries class study

The form is available at the following link.

This is a cross-sectional study that asks you various questions about your demography, height, sugary drink intake, and various health outcomes, such as need for dental treatment and hospital treatment in the past year.

We have a belief or hypothesis that people who consume less sugar have fewer rotten teeth, and need fewer dental procedures.

We will be revising some of the lecture concepts this week by exploring these data.

We will be using iNZight lite available here.

An edited version of the spreadsheet results is available here.

Open the data in Excel

Download the data which is in a .xlsx (“Excel”) file to your machine and open in Excel.

You may have to open Excel first and then click File –> Open and navigate to file (usually in Downloads on a Windows machine).

Have a look at the data in the spreadsheet. You can see that I have included the last three year’s responses. This will help increase statistical power.

Questions

What are the main problems with a new dataset that we have to be aware of and check for?

The main issues that I like to identify are:

duplicates
missing values
out-of-range variables

These are often overlooked, and I encourage you to remember them. They will streamline your work in your future studies and working life.

What strategy did you think I used to make sure I didn’t get whacky data when designing the questionnaire in Google Forms?

What do you think an individual row represents?

What do you think an individual column represents?

Are there any immediate problems with the data that you can spot by scrolling through the data?

Highlight an individual column, by clicking on the top grey row.

What do you see down the bottom of the window?

How can this help us detect problems in the data?

Check the validity of the data

First, we want to check the data to make sure that it is kosher. This is often easier in Excel.

Check each variable for out-of-range variables by clicking on the top row. Do you find any?

What should we do with these out of range variables?

Should we discard that value (cell) or the whole response?

Generally, we wish to discard the minimum data.

Is there a problem with missing values?

Why might these values be missing?

How could we prevent missing values?

Are there any duplicates? Try looking manually.

Then, try with Excel. Data –> Remove Duplicates

Questions

Which columns should we include with our search for duplicates?

What should we do with these duplicates?

What do we need to make sure we do after removing duplicates?

Note: iNZight doesn’t have a facility for checking for duplicates, so this must be carried out in Excel before loading data into iNZight for analysis!

Exploring the data in iNZight lite

iNZight is very good at visualising data.

A clean version of the data is available here.

Import the data (File tab –> Import data –> Browse and navigate to the folder with the .csv or .xlsx file you downloaded).

The file is most likely to be located in your ‘Downloads’ folder in Windows or MacOS, unless you put it somewhere else.

You should see the spreadsheet. You can change the number of rows to view, by clicking on the Show … Entries box.

In this tutorial, we will begin by exploring the prevalence of various characteristics of the class.

We will use the Visualise tab and interpret the various plots.

Boxplots can be very useful for spotting outliers, and checking for symmetry in continuous variables.

The various charts can help us check the data for “face validity”.

We can also estimate the prevalence of various characteristics of the population.

Remember prevalence is just epidemiological speak for proportion or percentage.

How?

What, for example, is the prevalence of dental caries in the population?

Does, for example, the prevalence of dental caries vary by age?

Also, iNZight does some nice work behind the scenes looking at the type of variable you have (categorical or continuous) and selecting the appropriate graph.

Check the prevalence of dental caries, by whether or not a respondent drinks at least one sugary drinks a day? How do we interpret the plot? What would we expect to see if sugary drinks had no effect on dental caries? What sort of biases may be distorting this relationship?

Contrast this with what we find when we look at dental caries by all categories of sugary drink drinking.

Check the prevalence of dental caries by tooth brushing frequency.

interpret the plot.

Check the prevalence of dental caries by use of an electric tooth brush.

interpret the plot.

If you were advising people which tooth brush to use to prevent caries, which strategy looks most promising?

Check the distribution of responses that were obtained to the poptarts question.

If we now compare responses to the poptart question by whether or not individuals had dental caries, what do we notice?

How might we explain this?

Select the poptart and ‘average adult intake of sugar’ question.

What do you notice about these two variables?
How might you explain this?

Which charts are suitable for which types of data??

For a concise overview of the uses of various different charts in exploring data, see here.

Answers to questions relating to sugar

The number of grams in a teaspoon of sugar is 4. It is slightly lighter than water.

The AHA recommends less than 9 teaspoons of sugar for a man and < 6 for a woman.

The poptarts had 19g of sugar per serve, and the pack contained 8 serves.

Therefore, the number of teaspoons of sugar in the box is 19 x 8/4 grams = 38 teaspoons.

Some homework…

Is student height distributed symmetrically?

Illustrate with an appropriate plot

Does height vary by gender?

Illustrate with an appropriate plot

Does height vary by ethnicity?

Illustrate with an appropriate plot