Introductory Statistics (CRN: 6896)



Objective

Today we learn about t-tests. From the beginning of the class, we have talked about the mean and standard deviation as meausures that allow us to model distributions (a bunch of data). The first tells us about the central tendency of the distribution (i.e., where most scores are likely to fall), and the latter tells us about the dipersion of the distribution (i.e., how disperse the scores are around the mean). With these two alone, we can get a sense of how likely or unlikely a data point is. For example, something tha tis more than 2 standard deivation above the mean is less likely thatn some thing that is 1 standard deviation above the mean and so on.

We also learned that becasue we can estiamte things about population from the sample, we need to qualify our estiamtes with a margin of error (or confidence interval). How did we do this? We used something called dsitribution of sample means. This is nothing but a the dstribution of the means of a bunch of samples we took from a population. This distrbution of sample means has a normal distrbution, which allows us to use what we know about normal distbutions and apply them to our data. That is, people have already figured out the probabilities of datapoint occouring at various distance form the mean (in standard deviation units). For example, we know that 68% of a datapoinst on an romal dsitbrution fall within 1 standard deviation below of above the mean. Or almsot 95% of data points are within 2 standard deviations around the mean. We can use this knowledge to estimate and qualify our etimate of the sample mean. To be pricelsy within the range that contain 95% of the data, the dta have tobe within 1.96 standard deviation above or below the mean. So we qualify our mean with a statement such as this: Mean, 95% CI[Mean - SE*1.96, Mean + SE*1.96]. This is a range of values that’s likely to include a population mean with a 95% degree of confidence.So there is a 5% chance that the true population mean falls outside this range.

SE is the standard deviation of the sample means dsitbution which we calcualte using the formula: \[SE =SD/\sqrt{sample\ size}\]

While the means and SD give us probabilities of different datapoints on a distribution, they do not tell us about how our data is related to other data. We like our models to be better than that. One way of doing this is by using correlations. A correaltion between two variables is an estimate of how much deviation of two variables from their means co-occours. So when one variable is close to it’s mean, the other one is also close to it’s mean, or whenone is far from the mean, the other one is also far from the mean. Whether both variables fall above or below their means determines the sign of the correlation (if one is above and other is below, the coreelation is negative). Correlation is standardized covariance and is calcualted using the formula: \[ r = \frac{cov_{x,y}}{SD(X) SD(Y)}\]

And covariance is: \[cov_{x,y} = \frac{\sum\limits_{i=1}^{n}{(x_i-\overline{x}) \cdot (y_i-\overline{y})} }{n-1}\]

Another way to model our data is to use group variables. Knowing which group a datapoint belongs to could help us imporve our estimate. For example, the data one whether a subject is a child or an adult could help us estimate their height better. One way to do this is to use correlation between age and height and show that older people are taller (age is associated with height). Another way is to group our age data and create an adult vs. child variable. In order to determine if two groups vary meaningfully on the basis of our goruping variable, we use something called t-test.

A t-test tells is whether difference between two mean is meaningful or not. That is, it tells us the probability of the difference between the means of two groups. If it is very low, then we can say the groups are important, and the difference between them is significant (should be incorporated in our model). If the probability is not that low, then the difference between the two means is not uncommon so there is nothing special about it. Adding it to our model is not going to help sus much.

There are broadly two types/uses of t-test:

  • Independent t-test: Compares two means based on independent data (e.g., data from different groups of people.

  • Dependent t-test: Compares two means based on related data. (e.g., data from the same people measured at different times, or data from ‘matched’ samples).

Independent t-test: To run a t-test, we calcualte the difference between means and standardize it. This gives us the t-value or t statistic, which has its own distbution, called the t-distribution. It is pretty much like the normal distribution but it becomes flatter as our sample size becomes smaller. At N = 30, t-distrbution is almost identical to a normal distribution.So we calculate what is called a \(t-statistics\). Using this distribution, we can find the probability of finding a mena difference of certain size in a sample of size n.

Let’s look at an example:

set.seed(123) # this function allows us to 'save' a randomizer's state so we can create the same random numbers each time
x<-rnorm(1000, mean = 1, sd = 2) #rnorm is function that creates random data with normal distribution, adn specified mean and SD
y<-rnorm(1000, mean = 2, sd = 2) #rnorm is function that creates random data with normal distribution, adn specified mean and SD
t.test(x,y)

    Welch Two Sample t-test

data:  x and y
t = -11.761, df = 1997.4, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -1.2282127 -0.8771368
sample estimates:
mean of x mean of y 
 1.032256  2.084931 

We cretaed a 1000 datapoints with mean of 1 and SD of 2, and another 1000 datapoint with mean of 2 and sd of 2. Becasue the numbers are random, the means are not exactly 1 and 2 but very close. So what do results tell us? A few things:

  1. The mean of x is 1.032256
  2. The mean of y is 2.084931
  3. The mean difference is 1.052657, 95%CI[-1.2282127, -0.8771368]
  4. The t-value is 11.761
  5. Assumming the means of the two samples is the same, the probability of finding a difference of size 1.052675 between two samples is very very small (P < 2.2e-16). So we reject the null.

This is how we report this: There was a significant diffrence, t(1997) = -11.761, p < 2.2e-16, between x (M = 1.032256) and y (M = 2.084931).


Dependent or Paried t-test For paired t-test, we follow a similar logic (paired means we have two datapoints for each subject, or somehow the observations are matched). The only thing that change sin the command is that we set the paired to TRUE or T.

set.seed(123) # this function allows us to 'save' a randomizer's state so we can create the same random numbers each time
x<-rnorm(1000, mean = 1, sd = 2) #rnorm is function that creates random data with normal distribution, adn specified mean and SD
y<-rnorm(1000, mean = 2, sd = 2) #rnorm is function that creates random data with normal distribution, adn specified mean and SD
t.test(x,y, paired = T)

    Paired t-test

data:  x and y
t = -12.305, df = 999, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -1.2205539 -0.8847956
sample estimates:
mean of the differences 
              -1.052675 

This results look a bit different. Here, we have mean of the differences instead of two different means. But the interpretation is the same: there is a significnat difference (M diff = 1.052675) between x and y, t(999) = 12.305, P < 2.2e-16.

Comapring to a specific mean: One other way to run t-test is to compare the mean of a distribution to a specific number. It’s essentially the same as the one’s we’ve looked at above, except it we get to set the value of the mean for one group. Let’s look at an example:

set.seed(123)
x<-rnorm(1000, mean = 1, sd = 2) #rnorm is function that creates random data with normal distribution, adn specified mean and SD
t.test(x, mu = 2)

    One Sample t-test

data:  x
t = -15.43, df = 999, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 2
95 percent confidence interval:
 0.9091771 1.1553344
sample estimates:
mean of x 
 1.032256 

By specifying the file mu as 2, we are telling R to compare the mean of x to 2. The results are similar to a dependent t-test. We report the results this way: The mean of x is significantly lower than 2, t(999) = 15.43, P < 2.2e-16.

Let’s load the happiness.csv dataset and answer some questions using t-test. This dataset contains:

variable name description
Country Name of the country.
Region Region the country belongs to.
Happiness Rank Rank of the country based on the Happiness Score.
Happiness Score A metric measured in 2015 by asking the sampled people the question: “How would you rate your happiness on a scale of 0 to 10 where 10 is the happiest.”
Standard Error The standard error of the happiness score.
Economy (GDP per Capita) The extent to which GDP contributes to the calculation of the Happiness Score.
FamilyThe extent to which Family contributes to the calculation of the Happiness Score
Health (Life Expectancy) The extent to which Life expectancy contributed to the calculation of the Happiness Score
Freedom The extent to which Freedom contributed to the calculation of the Happiness Score.
Trust (Government Corruption) The extent to which Perception of Corruption contributes to Happiness Score.
Generosity The extent to which Generosity contributed to the calculation of the Happiness Score.
Dystopia Residual The extent to which Dystopia Residual contributed to the calculation of the Happiness Score.

Description: The happiness scores and rankings use data from the Gallup World Poll. The scores are based on answers to the main life evaluation question asked in the poll. This question, known as the Cantril ladder, asks respondents to think of a ladder with the best possible life for them being a 10 and the worst possible life being a 0 and to rate their own current lives on that scale. The scores are from nationally representative samples for the years 2013-2016 and use the Gallup weights to make the estimates representative. The columns following the happiness score estimate the extent to which each of six factors – economic production, social support, life expectancy, freedom, absence of corruption, and generosity – contribute to making life evaluations higher in each country than they are in Dystopia, a hypothetical country that has values equal to the world’s lowest national averages for each of the six factors. They have no impact on the total score reported for each country, but they do explain why some countries rank higher than others.*

happiness<-read.csv("./happiness2015.csv")
head(happiness)

Some breakdown:

table(happiness$Region)

      Australia and New Zealand      Central and Eastern Europe                    Eastern Asia 
                              2                              29                               6 
    Latin America and Caribbean Middle East and Northern Africa                   North America 
                             22                              20                               2 
              Southeastern Asia                   Southern Asia              Sub-Saharan Africa 
                              9                               7                              40 
                 Western Europe 
                             21 

Let’s compare Central and Eastern Europe and Western Europe. We need to subsample our dataset:

with(happiness[happiness$Region %in% c("Central and Eastern Europe", "Western Europe"),], t.test(Happiness.Score~Region))

    Welch Two Sample t-test

data:  Happiness.Score by Region
t = -6.4974, df = 33.399, p-value = 2.127e-07
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -1.7813088 -0.9320672
sample estimates:
mean in group Central and Eastern Europe             mean in group Western Europe 
                                5.332931                                 6.689619 

Note two things:

  1. I’m using the command with to subset my dataset and then, I ask for the t.test. Also, when subsetting, I’m asking R to select rows whose values in the column Region is c("Central and Eastern Europe", "Western Europe").

  2. I used the ~ to compare the two groups. This is becasue we have 1 column fro group names and 1 for the the happiness score. This is a matter of taste or how your dataset is designed. You cant switch between teh two. When you have to specific columns for each group, you can use t.test(x,y), and when you have one column for group variable (x) and one for the value you’re interested in comapring (y), you use t.test(y~x).

So our results suggest that Western europeans are happier than Central and Eastern Europeans, t(33.399) = 6.4974, p = 2.127e-07.

See if you can test the differenc ein happines scores between Middle East and Northern Africa and Central and Eastern Europe.

with(happiness[happiness$Region %in% c("Central and Eastern Europe", "Middle East and Northern Africa"),], t.test(Happiness.Score~Region))

    Welch Two Sample t-test

data:  Happiness.Score by Region
t = -0.27591, df = 26.075, p-value = 0.7848
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.6249620  0.4770241
sample estimates:
     mean in group Central and Eastern Europe mean in group Middle East and Northern Africa 
                                     5.332931                                      5.406900 

Interesting. The two region do not differ much in happiness. Sub-Saharan Africa and the Middle East and Northern Africa?

with(happiness[happiness$Region %in% c("Sub-Saharan Africa", "Middle East and Northern Africa"),], t.test(Happiness.Score~Region))

    Welch Two Sample t-test

data:  Happiness.Score by Region
t = 4.553, df = 24.98, p-value = 0.0001189
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 0.6594045 1.7487955
sample estimates:
mean in group Middle East and Northern Africa              mean in group Sub-Saharan Africa 
                                       5.4069                                        4.2028 

So happiness is even lower in Subsaharan Africa than the Middle East…

with(happiness[happiness$Region %in% c("Sub-Saharan Africa", "Western Europe"),], t.test(Family~Region))

    Welch Two Sample t-test

data:  Family by Region
t = -8.491, df = 58.961, p-value = 8.169e-12
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.5414887 -0.3349451
sample estimates:
mean in group Sub-Saharan Africa     mean in group Western Europe 
                        0.809085                         1.247302 

Please answer the following questions:

  • Overall, which one is more important, economic production or family?
with(happiness, t.test(x = Economy..GDP.per.Capita., y = Family))

    Welch Two Sample t-test

data:  Economy..GDP.per.Capita. and Family
t = -3.744, df = 275.62, p-value = 0.0002205
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.22110293 -0.06871454
sample estimates:
mean of x mean of y 
0.8461372 0.9910459 
  • Which one is more important for happiness in Subsharan Africa? Economic production or freedom?
  • Which one is more important for happiness in Western Europe? Economic production or freedom?
  • In Eastern Europe, is absence of corruption as important as freedom?
  • Is trust correlated with freedom?
---
title: "Lab 7"
output: html_notebook
Author: Mostafa Salari Rad

---


### Introductory Statistics (CRN: 6896)

\
\

#### Objective

Today we learn about t-tests. From the beginning of the class, we have talked about the mean and standard deviation as meausures that allow us to model distributions (a bunch of data). The first tells us about the central tendency of the distribution (i.e., where most scores are likely to fall), and the latter tells us about the dipersion of the distribution (i.e., how disperse the scores are around the mean). With these two alone, we can get a sense of how likely or unlikely a data point is. For example, something tha tis more than 2 standard deivation above the mean is less likely thatn some thing that is 1 standard deviation above the mean and so on.


We also learned that becasue we can estiamte things about population from the sample, we need to qualify our estiamtes with a margin of error (or confidence interval).  How did we do this? We used something called dsitribution of sample means. This is nothing but a the dstribution of the means of a bunch of samples we took from a population. This distrbution of sample means has a normal distrbution, which allows us to use what we know about normal distbutions and apply them to our data. That is, people have already figured out the probabilities of datapoint occouring at various distance form the mean (in standard deviation units). For example, we know that 68% of a datapoinst on an romal dsitbrution fall within 1 standard deviation below of above the mean. Or almsot 95% of data points are within 2 standard deviations around the mean. We can use this knowledge to estimate and qualify our etimate of the sample mean. To be pricelsy within the range that contain 95% of the data, the dta have tobe within 1.96 standard deviation above or below the mean. So we qualify our mean with a statement such as this: Mean, 95% CI[Mean - SE\*1.96, Mean + SE*1.96]. This is a range of values that’s likely to include a population mean with a 95% degree of confidence.So there is a 5% chance that the true population mean falls outside this range. 



SE is the standard deviation of the sample means dsitbution which we calcualte using the formula:  $$SE =SD/\sqrt{sample\ size}$$ 


While the means and SD give us probabilities of different datapoints on a distribution, they do not tell us about how our data is related to other data. We like our models to be better than that. One way of doing this is by using correlations. A correaltion between two variables is an estimate of how much deviation of two variables from their means co-occours. So when one variable is close to it's mean, the other one is also close to it's mean, or whenone is far from the mean, the other one is also far from the mean. Whether both variables fall above or below their means  determines the sign of the correlation (if one is above and other is below, the coreelation is negative). Correlation is standardized covariance and is calcualted using the formula: $$ r = \frac{cov_{x,y}}{SD(X) SD(Y)}$$ 


And covariance is: $$cov_{x,y} = \frac{\sum\limits_{i=1}^{n}{(x_i-\overline{x}) \cdot (y_i-\overline{y})} }{n-1}$$


Another way to model our data is to use group variables. Knowing which group a datapoint belongs to could help us imporve our estimate. For example, the data one whether a subject is a child or an adult could help us estimate their height better. One way to do this is to use correlation between age and height and show that older people are taller (age is associated with height). Another way is to group our age data and create an adult vs. child variable. In order to determine if two groups vary meaningfully on the basis of our goruping variable, we use something called t-test. 

A t-test tells is whether difference between two mean is meaningful or not. That is, it tells us the probability of the difference between the means of two groups. If it is very low, then we can say the groups are important, and the difference between them is significant (should be incorporated in our model). If the probability is not that low, then the difference between the two means is not uncommon so there is nothing special about it. Adding it to our model is not going to help sus much. 
\

**There are broadly two types/uses of t-test:**

* Independent t-test: Compares two means based on independent data (e.g., data from different groups of people.


* Dependent t-test: Compares two means based on related data. (e.g., data from the same people measured at different times, or data from ‘matched’ samples).


**Independent t-test**: To run a t-test, we calcualte the difference between means and standardize it. This gives us the t-value or t statistic, which has its own distbution, called the t-distribution. It is pretty much like the normal distribution but it becomes flatter as our sample size becomes smaller. At *N = 30*, t-distrbution is almost identical to a normal distribution.So we calculate what is called a $t-statistics$. Using this distribution, we can find the probability of finding a mena difference of certain size in a sample of size n. 


Let's look at an example:

```{r}
set.seed(123) # this function allows us to 'save' a randomizer's state so we can create the same random numbers each time
x<-rnorm(1000, mean = 1, sd = 2) #rnorm is function that creates random data with normal distribution, adn specified mean and SD
y<-rnorm(1000, mean = 2, sd = 2) #rnorm is function that creates random data with normal distribution, adn specified mean and SD
t.test(x,y)

```

We cretaed a 1000 datapoints with mean of 1 and SD of 2, and another 1000 datapoint with mean of 2 and sd of 2. Becasue the numbers are random, the means are not exactly 1 and 2 but very close. So what do results tell us? A few things:

 (1) The mean of x is 1.032256
 (2) The mean of y is 2.084931
 (3) The mean difference is 1.052657, 95%CI[-1.2282127, -0.8771368]
 (4) The t-value is 11.761
 (5) Assumming the means of the two samples is the same, the probability of finding a difference of size 1.052675 between two samples is very very small (P < 2.2e-16). So we reject the null.  
 
 
This is how we report this: There was a significant diffrence, *t*(1997) = -11.761, *p* < 2.2e-16, between x (*M* = 1.032256) and y (*M* = 2.084931).


\
**Dependent or Paried t-test**
For paired t-test, we follow a similar logic (paired means we have two datapoints for each subject, or somehow the observations are matched). The only thing that change sin the command is that we set the `paired` to `TRUE` or `T`.

```{r}
set.seed(123) # this function allows us to 'save' a randomizer's state so we can create the same random numbers each time
x<-rnorm(1000, mean = 1, sd = 2) #rnorm is function that creates random data with normal distribution, adn specified mean and SD
y<-rnorm(1000, mean = 2, sd = 2) #rnorm is function that creates random data with normal distribution, adn specified mean and SD
t.test(x,y, paired = T)
```

This results look a bit different. Here, we have mean of the differences instead of two different means. But the interpretation is the same: there is a significnat difference (*M* diff = 1.052675) between x and y, *t*(999) = 12.305, *P* < 2.2e-16.


**Comapring to a specific mean**:
One other way to run t-test is to compare the mean of a distribution to a specific number. It's essentially the same as the one's we've looked at above, except it we get to set the value of the mean for one group.  Let's look at an example:


```{r}
set.seed(123)
x<-rnorm(1000, mean = 1, sd = 2) #rnorm is function that creates random data with normal distribution, adn specified mean and SD
t.test(x, mu = 2)
```


By specifying the file `mu` as 2, we are telling R to compare the mean of x to 2. The results are similar to a dependent t-test. We report the results this way: The mean of x is significantly lower than 2, **t**(999) = 15.43, **P** < 2.2e-16. 


Let's load the `happiness.csv` dataset and answer some questions using t-test. This dataset contains:
\


|variable name | description|
|------------|------------|
|Country        |Name of the country.|
|Region | Region the country belongs to.|
|Happiness Rank | Rank of the country based on the Happiness Score.|
|Happiness Score | A metric measured in 2015 by asking the sampled people the question: "How would you rate your happiness on a scale of 0 to 10 where 10 is the happiest."|
|Standard Error | The standard error of the happiness score.|
|Economy (GDP per Capita) | The extent to which GDP contributes to the calculation of the Happiness Score.|
|FamilyThe extent to which | Family contributes to the calculation of the Happiness Score|
|Health (Life Expectancy) | The extent to which Life expectancy contributed to the calculation of the Happiness Score|
|Freedom | The extent to which Freedom contributed to the calculation of the Happiness Score.|
|Trust (Government Corruption) | The extent to which Perception of Corruption contributes to Happiness Score.|
|Generosity | The extent to which Generosity contributed to the calculation of the Happiness Score.|
|Dystopia Residual | The extent to which Dystopia Residual contributed to the calculation of the Happiness Score.|


**Description:**
*The happiness scores and rankings use data from the Gallup World Poll. The scores are based on answers to the main life evaluation question asked in the poll. This question, known as the Cantril ladder, asks respondents to think of a ladder with the best possible life for them being a 10 and the worst possible life being a 0 and to rate their own current lives on that scale. The scores are from nationally representative samples for the years 2013-2016 and use the Gallup weights to make the estimates representative. **The columns following the happiness score estimate the extent to which each of six factors – economic production, social support, life expectancy, freedom, absence of corruption, and generosity – contribute to making life evaluations higher in each country than they are in Dystopia, a hypothetical country that has values equal to the world’s lowest national averages for each of the six factors.** *They have no impact on the total score reported for each country, but they do explain why some countries rank higher than others.*




```{r}  
happiness<-read.csv("./happiness2015.csv")
head(happiness)
```


Some breakdown:

```{r}
table(happiness$Region)
```


Let's compare *Central and Eastern Europe* and *Western Europe*. We need to subsample our dataset:

```{r}
with(happiness[happiness$Region %in% c("Central and Eastern Europe", "Western Europe"),], t.test(Happiness.Score~Region))

```


Note two things:

(1) I'm using the command `with` to subset my dataset and then, I ask for the t.test. Also, when subsetting, I'm asking R to select rows whose values in the column `Region` is `c("Central and Eastern Europe", "Western Europe")`.

(2) I used the `~` to compare the two groups. This is becasue we have 1 column fro group names and 1 for the the happiness score. This is a matter of taste or how your dataset is designed. You cant switch between teh two. When you have to specific columns for each group, you can use `t.test(x,y)`, and when you have one column for group variable (x) and one for the value you're interested in comapring (y), you use `t.test(y~x)`.

So our results suggest that Western europeans are happier than Central and Eastern Europeans, *t*(33.399) = 6.4974, *p* = 2.127e-07. 

See if you can test the differenc ein happines scores between *Middle East and Northern Africa* and *Central and Eastern Europe*. 


```{r}
with(happiness[happiness$Region %in% c("Central and Eastern Europe", "Middle East and Northern Africa"),], t.test(Happiness.Score~Region))

```

Interesting. The two region do not differ much in happiness. *Sub-Saharan Africa* and the *Middle East and Northern Africa*?

```{r}
with(happiness[happiness$Region %in% c("Sub-Saharan Africa", "Middle East and Northern Africa"),], t.test(Happiness.Score~Region))
```

So happiness is even lower in Subsaharan Africa than the Middle East...


```{r}
with(happiness[happiness$Region %in% c("Sub-Saharan Africa", "Western Europe"),], t.test(Family~Region))

```

Please answer the following questions:

 - Overall, which one is more important, economic production or family?
 
```{r}
with(happiness, t.test(x = Economy..GDP.per.Capita., y = Family))

```
 

 - Which one is more important for happiness in Subsharan Africa? Economic production or freedom?
 - Which one is more important for happiness in Western Europe? Economic production or freedom?
 - In Eastern Europe, is absence of corruption as important as freedom? 
 - Is trust correlated with freedom?
 



