Concepts

Title: Basic concepts in statistics

(Types of variables, z Scores, Disperson Measures of spread)

Synopsis: This document summarizes some basic concepts in statistics

Classification of variables

Classification 1: Kinds of variables that make data

Quantitative (Numerical): Some sorte of quantity, amount of something (speed, high, score in a test)
1.1 Continuous
e.g. Time, because in can be broken into infinitly smaller parts.
1.2 Discrete
e.g. The cylinders of a can however can not be broken indefinitly.
Qualitative (categorical): Quality of a variable, characteristic (color, size, ethnicity, etc)

Classification 2: Levels of measuremente

Nominal
e.g. Car names are an example.
Ordinal e.g. Small, medium, large. There is an order but no predefined interval between then. Order spacing (not equidistant)
Interval e.g. Interval sets a spacing between measurements while the zero means nothing, it is a place holder. Like temperature.
Ratio
e.g. Data that has fixed intervals between values but where zero means something, a lack of a characteristic. In mtcar displacement in miles per galon is a ratio variable because zero means a car with no displacement.

Variables Classification

Helps us to inform our analysis tool. Just like the kind of paint i use when paing the room helps dictate how i acutally clean my brush. So the kinds we have dictate the kind of statistic tool we use when it comes to data analisys.

z-scores

QUESTION: Can we compare the performance of two students from wo separate classes who took an exam on the same material with differnet grades?

Same basic ruler (regardless of scale)

If the exams were similiar the answer is Yes.

But if one exam was harder then the other?

In that case could we still compare both students? The answer is still Yes, with the help with z-scores.

You can calculate the difference from one student from the mean by:

score of the student - mean score of the class

Let´s see how with one example.

Suppose two students, John and Jayne.

Let´s see john´s grades:

John´s score = 83 John´s class mean = 74 differente = 9 John is 9 points above the mean of his class

Let´s see Jayne´s grades: Jayne´s score = 89 Jayne´s class mean = 80 differente = 9 Jayne is also 9 points above the mean of his class

However each distribguition has a differente measure of center and of spread.

In that case let´s assume a symetrical distribution, therefore we will use standard deviation as a measure for spread.

The formula to calculate z-score is as follows:

(score of student - mean score of class)/standard deviation

So assuming a normal distribuition:

John´s z-score = (x - nx)/sd = (83-74)/4 = 2.25
Jayne´s z-score = (x - nx)/sd = (89-80)/6 = 1.5

So, although jayne got a better score thatn john her class had a better overall (maybe her clas got an easier exam) and alose her class had a larger standard deviation (more spread around the mean, more variability to each one score). So under the ruler of performance john was 2.25 units better than average whie for jayne she was only 1.5 units better than average.

These particular units area called z-scores.

There is more however:

Proportion under the curve

The empirical rule states that:

[-1SD, 1SD] 68%
[-2SD, 2SD] 95%
[-3SD, 3SD] 99.7%

Where the center is MEAN for normal distribution and MEDIAN for skwed distributions.

In other terms, for instance, 68% of the cases in this one distribution will fall between -1sd adn +1d for a z-score (1 below and 1 standard deviation above the mean). The same is valid for 95% and 99.7%.

We can find the proportion under the curve for any z-score we wish using R function.

Let´s see and example with John and Jayne.

We are assuming normal distribuition.

Using R we can calculate the proportion of people from one classe that performed better that one particular person using the following formula:

1 - pnorm(score of individual, mean of the class, standard deviation)

Alternativelly,

1 - pnomr(z.score)

For example:

Jayne

1 - pnorm(89,80,6) = 0.0668072 = 7%

It means that roughly 7% of the class performed better than Jayne.

John

1 - pnorm(83,74,4) = 0.01222447 = 1%

It means that roughly 1% of the class performed better than John.

So z-scores will work on any distribution scale.

So if we want to see who performed better on an exam, even if John´s exam was one out of 100 and Jayne´s exame was one out of 75 points, we could still do it, by simply converting results to z-score first.

For a better understanding of z-score visit: UTAustinX: UT.7.10x Foundations of Data Analysis - Part 1 > Week 2: Univariate Descriptive Statistics > Lecture Videos > Using Z-Scores.

Spread numbers

Range

RANGE = HIGHEST VALUE - LOWEST VALUE

Median

Given a series of numbers: A B C D E F G H I J

The median is (E+F) /2

For a even number, e.g. A B C D E F G H I

The median is E

Interquartile range

The IQR is the difference between the 3nd quartile and the 1st quartile.

Let´s suppose: A B C D E F G H I J Then: IQR = H/2 - C/2 However for: A B C D E F G H I The IQR = (G+H)/2 - (B+C)/2

Mode

It is simply the number that appear more times in a vector. You can have more the one mode. Bimodal.

For a Symetrical Distribution

Measure of center = Mean
Measure of spread = Standar deviation (it´s the value within the mean just where the symetrical course begins to change)

For a Skewed Distribution (either direction)

Measure of center = Median
Measure of spread = IQR (Inte r quartile range) (because where the curve changes is differente in one side of the mean to the other side of the median and remember in skewed distribution mean is a poor choice to use, once mean is used in the calculation of the standard deviation)

Skewed distribution to the right = Mode Median Mean
Skewed distribution to the left = Mean Median Mode

OBSERVATION:

When comparing two subgroups onde symetrical and other skewed it is adviseble to use median for it fits more properly both groups. Otherwise one of them (skewed) would be misrepresented.