Just in case you missed ‘That Sugar song’…
Today we will seek to understand and apply:
We have been interested in the relationship between understanding of sugar content in food and the link to dental caries.
We have just carried out an epidemiological study with a questionnaire of the class responses using google forms.
What sort of epidemiological study is this?? What sort of biases do we need to be aware of in interpreting this data?
The form is available at the following link.
This is a cross-sectional study that asks you various questions about your demography, height, sugary drink intake, and various health outcomes, such as need for dental treatment and hospital treatment in the past year.
We have a belief or hypothesis that people who consume less sugar have fewer rotten teeth, and need fewer dental procedures.
We will be revising some of the lecture concepts this week by exploring these data.
We will be using iNZight lite available here.
An edited version of the spreadsheet results is available here.
Download the data which is in a .xlsx
(“Excel”) file
to your machine and open in Excel.
You may have to open Excel first and then click File –> Open and navigate to file (usually in Downloads on a Windows machine).
Have a look at the data in the spreadsheet. You can see that I have included the last three year’s responses. This will help increase statistical power.
What are the main problems with a new dataset that we have to be aware of and check for?
The main issues that I like to identify are:
These are often overlooked, and I encourage you to remember them. They will streamline your work in your future studies and working life.
What strategy did you think I used to make sure I didn’t get whacky data when designing the questionnaire in Google Forms?
What do you think an individual row represents?
What do you think an individual column represents?
Are there any immediate problems with the data that you can spot by scrolling through the data?
Highlight an individual column, by clicking on the top grey row.
What do you see down the bottom of the window?
How can this help us detect problems in the data?
First, we want to check the data to make sure that it is kosher. This is often easier in Excel.
Check each variable for out-of-range variables by clicking on the top row. Do you find any?
What should we do with these out of range variables?
Generally, we wish to discard the minimum data.
Is there a problem with missing values?
Why might these values be missing?
How could we prevent missing values?
Are there any duplicates? Try looking manually.
Then, try with Excel. Data –> Remove Duplicates
Which columns should we include with our search for duplicates?
What should we do with these duplicates?
What do we need to make sure we do after removing duplicates?
Note: iNZight doesn’t have a facility for checking for duplicates, so this must be carried out in Excel before loading data into iNZight for analysis!
iNZight is very good at visualising data.
A clean version of the data is available here.
Import the data (File tab –> Import data –> Browse and navigate to the folder with the .csv or .xlsx file you downloaded).
The file is most likely to be located in your ‘Downloads’ folder in Windows or MacOS, unless you put it somewhere else.
You should see the spreadsheet. You can change the number of rows to view, by clicking on the Show … Entries box.
In this tutorial, we will begin by exploring the prevalence of various characteristics of the class.
We will use the Visualise tab and interpret the various plots.
Boxplots can be very useful for spotting outliers, and checking for symmetry in continuous variables.
The various charts can help us check the data for “face validity”.
We can also estimate the prevalence of various characteristics of the population.
Remember prevalence is just epidemiological speak for proportion or percentage.
What, for example, is the prevalence of dental caries in the population?
Does, for example, the prevalence of dental caries vary by age?
Also, iNZight does some nice work behind the scenes looking at the type of variable you have (categorical or continuous) and selecting the appropriate graph.
Check the prevalence of dental caries, by whether or not a respondent drinks at least one sugary drinks a day? How do we interpret the plot? What would we expect to see if sugary drinks had no effect on dental caries? What sort of biases may be distorting this relationship?
Contrast this with what we find when we look at dental caries by all categories of sugary drink drinking.
Check the prevalence of dental caries by tooth brushing frequency.
Check the prevalence of dental caries by use of an electric tooth brush.
If you were advising people which tooth brush to use to prevent caries, which strategy looks most promising?
Check the distribution of responses that were obtained to the poptarts question.
If we now compare responses to the poptart question by whether or not individuals had dental caries, what do we notice?
Select the poptart and ‘average adult intake of sugar’ question.
For a concise overview of the uses of various different charts in exploring data, see here.
The number of grams in a teaspoon of sugar is 4. It is slightly lighter than water.
The AHA recommends less than 9 teaspoons of sugar for a man and < 6 for a woman.
The poptarts had 19g of sugar per serve, and the pack contained 8 serves.
Therefore, the number of teaspoons of sugar in the box is 19 x 8/4 grams = 38 teaspoons.
Is student height distributed symmetrically?
Does height vary by gender?
Does height vary by ethnicity?