Chapter 1. Variables

  • Let’s see simple graph below.
worms <- read.csv("Data Files/worms.csv", stringsAsFactors = FALSE)
worms
##           Field.Name Area Slope Vegetation Soil.pH  Damp Worm.density
## 1        Nashs.Field  3.6    11  Grassland     4.1 FALSE            4
## 2     Silwood.Bottom  5.1     2     Arable     5.2 FALSE            7
## 3      Nursery.Field  2.8     3  Grassland     4.3 FALSE            2
## 4        Rush.Meadow  2.4     5     Meadow     4.9  TRUE            5
## 5    Gunness.Thicket  3.8     0      Scrub     4.2 FALSE            6
## 6           Oak.Mead  3.1     2  Grassland     3.9 FALSE            2
## 7       Church.Field  3.5     3  Grassland     4.2 FALSE            3
## 8            Ashurst  2.1     0     Arable     4.8 FALSE            4
## 9        The.Orchard  1.9     0    Orchard     5.7 FALSE            9
## 10     Rookery.Slope  1.5     4  Grassland     5.0  TRUE            7
## 11       Garden.Wood  2.9    10      Scrub     5.2 FALSE            8
## 12      North.Gravel  3.3     1  Grassland     4.1 FALSE            1
## 13      South.Gravel  3.7     2  Grassland     4.0 FALSE            2
## 14 Observatory.Ridge  1.8     6  Grassland     3.8 FALSE            0
## 15        Pond.Field  4.1     0     Meadow     5.0  TRUE            6
## 16      Water.Meadow  3.9     0     Meadow     4.9  TRUE            8
## 17         Cheapside  2.2     8      Scrub     4.7  TRUE            4
## 18        Pound.Hill  4.4     2     Arable     4.5 FALSE            5
## 19        Gravel.Pit  2.9     1  Grassland     3.5 FALSE            1
## 20         Farm.Wood  0.8    10      Scrub     5.1  TRUE            3
{
  plot(x = worms$Area, y = worms$Worm.density, xlab = "Area", ylab = "Worm.Density")
  abline(lm(worms$Worm.density ~ worms$Area, data = worms))
}

  • As you see,“Area” on the x-axis and “Worm.density” on the y-axis are on the graph. Then, what does “Area” mean by? What does “Worm.density” mean by? In Statistics, Area is called explnatory variable, Worm.density is called response variable. The explanatory variable is simple called independent. It’s not affected by other variable at all, whereas the responsive variable is affected by other variable. The response variable is the thing you are working on. It is the variable that you want to understand by the variation.

  • Let’s go back to graph. Area is not affected by Worm.density. However, Worm.density is affected by Area. The researcher wants to know the variation of Worm.density over the size of Area.

Chapter 2. Why Statistics?

  • To me, learning statistics is one way that helps my life get better. It’s skill to compute a complex problem, understand the status-quo, and predict the future. Life is a gamble but to choose my future on the basis of the assurance. Now, one problem is here. How do you trust your assurance of your choice? In what base, do you trust your choice? It’s hard to answer to questions, isn’t it?

  • Now, before diving into statistics, we must know the status-quo by several steps.

Chapter 3. Introduction to Basic Statistics

(1) Histogram & Distribution

  • A histogram is one of the simplest graphs used in statistics, but they are very useful and very informative. Studying histograms will help you to overcome the tendency to put too much of a focus on summary statistics.

A. Normality with Histogram

# Alzheimer's Disease Mortality by State
alzheimer2015 <- read.csv("alzheimers2015.csv", stringsAsFactors = FALSE)
alz_death_rate <- alzheimer2015$RATE

# Les's draw histogram
hist(alz_death_rate, main = "Alzheimer's Disease Mortality by State: 2015", xlab = "Rate", ylab = "frequency")

  • you can download csv data from “https://www.cdc.gov/nchs/pressroom/sosmap/alzheimers_mortality/alzheimers_disease.htm

  • This data shows somehow uniform or right skewed disribution. If you want to check data normaility, then you might use qqplot. The function qqplot is to make it a lot easier to evaluate whether you see a clear deviation from normality. The closer all points lie to the line, the closer the distribution of your sample comes to the normal distribution. The qqline() function also takes the sample as an argument. Let’s check if our data is indeed normal distribution or not.

qqnorm(alz_death_rate); qqline(alz_death_rate)

  • If you want to check quartiles - 0%, 25%, 50%, 75%, 100%, then you can use the function “quantile(data)”. Let’s see below.
quantile(alz_death_rate)
##     0%    25%    50%    75%   100% 
## 12.600 23.675 30.050 35.550 46.200
  • If you want to performs the test of normality, you are able to use the function “shapiro.test(data)”.
shapiro.test(alz_death_rate)
## 
##  Shapiro-Wilk normality test
## 
## data:  alz_death_rate
## W = 0.9741, p-value = 0.337
  • Here, the p-value is important to check. We will talk the meaning of p-value later. Here, the role of p-value is to check if the data is normality or not, and the cutoff is lower than 0.05. If p-value is higher than 0.05, then, it means that the sample data does not deviates from normality which means the data might follow a normal distribution. Null Hypothesis is accepted, and no need to go further at this momennt.

  • Since the p-value is 0.337, we cannot conclude that data do not follow a normality. It means that the data happens by chance or we need to get more data.

B. Deviating from Normality with Histogram

beaver <- beaver1
bea_temp <- beaver2$temp
hist(bea_temp, main = "Beaver's Temperature", xlab = "Temperature", ylab = "frequency")

  • This data clealry shows left skewed disribution. If you want to check data normaility, then you might use qqplot. The function qqplot is to make it a lot easier to evaluate whether you see a clear deviation from normality. The closer all points lie to the line, the closer the distribution of your sample comes to the normal distribution. The qqline() function also takes the sample as an argument. Let’s check if our data is indeed normal distribution or not.
qqnorm(bea_temp); qqline(bea_temp)

  • If you want to check quartiles - 0%, 25%, 50%, 75%, 100%, then you can use the function “quantile(data)”. Let’s see below.
quantile(bea_temp)
##      0%     25%     50%     75%    100% 
## 36.5800 37.1475 37.7350 37.9850 38.3500
  • If you want to performs the test of normality, you are able to use the function “shapiro.test(data)”.
shapiro.test(bea_temp)
## 
##  Shapiro-Wilk normality test
## 
## data:  bea_temp
## W = 0.93336, p-value = 7.764e-05
  • Here, the p-value is important to check. We will talk later the meaning of p-value later. Here, the role of p-value is to check if the data is normality or not, and the cutoff is lower than 0.05. If p-value is lower than 0.05, then, it means that the sample data deviates from normality which means the data do not follow a normal distribution. Null Hypothesis is rejected, and need to go further at this momennt.

  • Since the p-value is 7.764e-05 (very lower value < 0.05), we conclude that data do not follow a normal distritbution. It means that the data dose not happen by chance, implying that there is some reason to happen like this. So we have to verify why the data happens via statistical analysis.

(2) Scales of Measurement

  • Z-scale, z-scale is “Standard scale of measurement in statistics”. To understand z-scale, we must get z-score. The formula Z-score is like this: Z = (X-M) / SD

  • To interpret each one, (a) X is score on original scale, simply raw score. (b) M is mean, (c) SD = standard deviation. SD has another formula but simply we use function “sd()”.

  • For Z-score, Mean Z-score is Z = 0, Negative Z-score is below mean, Positive Z-score is above mean.

  • Let’s analyze with real data.

library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag():    dplyr, stats
weather = read.csv(file = "Data Files/weather.data.csv")

# extract january 1987, upper
jan_1987_upper <- weather %>% 
                    select(yr, month, upper) %>% 
                    filter(yr == 1987 & month == 1) 

sapply(jan_1987_upper, class)
##        yr     month     upper 
## "integer" "integer" "numeric"
upper_temp <- jan_1987_upper$upper

# convert these temp to Z-scores using the scale function
z_scores_upper_temp <- scale(upper_temp)

# Arrange the hist side by side
par(mfrow = c(1,2))
hist(upper_temp)
hist(z_scores_upper_temp)

  • As you see two graphs above, the scale function can convert raw data into Z-score. Furthermore, the Z-score can be converted into percentile rank. % of scores at or above score. If normal, Z = 0 is 50% percentile. Thus, Z-scores can be used to find percentile rank by using pnorm function. Let’s see below.
# get data

z_score_30 <- z_scores_upper_temp[30]
pnorm(z_score_30) * 100
## [1] 61.85927
  • So, z-score 0.301786 is 61% rank (means higher than mean temperature in Jan 1987) in this dataset.

(3) Measures of Central Tendency

  • All dataset clusters around intermediate values. So called central tendency. Then how to measure central tendency? Three ways are known: Mean, Median, and Mode. (a) Mean is average, we know already. (b) Median is middle score in a distribution; e.g. 50th out of 100 total. (c) Mode is score that occurs most frequently; e.g. peak of the histogram.

  • For mean, the best timing to use this function is when the distribution is normal.

  • For Median, the best timing to use this function is when extreme scores in distribution. e.g. Household income in USA.

  • Let’s see a hist below.

white_wine <- read.csv(file = "Data Files/white_wine.csv")
attach(white_wine)

white_ratings_france <- subset(white_wine, condition == "France")$Ratings
white_ratings_argentina <- subset(white_wine, condition == "Argentina")$Ratings
white_ratings_australia <- subset(white_wine, condition ==  "Australia")$Ratings
white_ratings_usa <- subset(white_wine, condition == "USA")$Ratings



mean_median_td <- white_wine %>% group_by(condition) %>% 
                rename(Country = condition) %>% 
                summarize(mean = mean(Ratings), 
                          median = median(Ratings))
mean_median_td
## # A tibble: 4 × 3
##     Country     mean median
##      <fctr>    <dbl>  <dbl>
## 1 Argentina 70.90476   71.0
## 2 Australia 86.64500   86.6
## 3    France 85.40000   85.0
## 4       USA 88.12500   88.2
  • Let’s go deeper. The most common biased historam is seen in houshold income distribution. Let’s see below.
usa_household_income_2009 <- read.csv("Data Files/2009_household_income_usa.csv", stringsAsFactors = FALSE)

attach(usa_household_income_2009)

state_names <- c("Alabama" , "Alaska" , "Arizona" , "Arkansas" , "California" , "Colorado" , "Connecticut" , "Delaware" , "Florida" , "Georgia" , "Hawaii" , "Idaho" , "Illinois" , "Indiana" , "Iowa" , "Kansas" , "Kentucky" , "Louisiana" , "Maine" , "Maryland" , "Massachusetts" , "Michigan" , "Minnesota" , "Mississippi" , "Missouri" , "Montana" , "Nebraska" , "Nevada" , "New Hampshire" , "New Jersey" , "New Mexico" , "New York" , "North Carolina" , "North Dakota" , "Ohio" , "Oklahoma" , "Oregon" , "Pennsylvania" , "Rhode Island" , "South Carolina" , "South Dakota" , "Tennessee" , "Texas" , "Utah" , "Vermont" , "Virginia" , "Washington" , "West Virginia" , "Wisconsin" , "Wyoming")

state_names <- toupper(state_names)
class(usa_household_income_2009$Area)
## [1] "character"
usa_house_income_2009_df <- usa_household_income_2009 %>% 
  filter(Area %in% state_names) %>% 
  mutate(income.100 = Income/100)

str(usa_house_income_2009_df)
## 'data.frame':    50 obs. of  3 variables:
##  $ Area      : chr  "ALABAMA" "ALASKA" "ARIZONA" "ARKANSAS" ...
##  $ Income    : int  26548 14822 46556 19824 396005 35015 33489 7387 99198 50177 ...
##  $ income.100: num  265 148 466 198 3960 ...
usa_house_income_2009_df$Income_group <- as.numeric(cut(usa_house_income_2009_df$income.100, 10))

attach(usa_house_income_2009_df)
## The following objects are masked from usa_household_income_2009:
## 
##     Area, Income
mean(income.100) 
## [1] 535.8644
median(income.100)
## [1] 342.52
  • The mean of household for all states is 535.86(x 100 dollars). However, the median is 342(x 100 dollars). The mean is higher than the median.

  • If you see graph below, you might see clearly the positive skew distribution. You get to understand better the different between Mean and Median.

  • Here, the mode is the most frequent counted that occurs in this graph. More importantly, the mode is below both the Mean and Median.

ggplot(usa_house_income_2009_df, aes(x = income.100)) + 
  geom_histogram(aes(fill =..count..), 
                 binwidth = 50) + 
  scale_x_continuous(name = "Mean House Income per State",
                     breaks = seq(0, 4000, 400), 
                     limits = c(0, 4000)) +
  scale_y_continuous(name = "Count") + 
  scale_fill_gradient("Count", low = "blue", high = "red") + 
  ggtitle("Count histogram of Mean House Income Per State") + 
  theme_bw() +
        theme(axis.line = element_line(size=1, colour = "black"),
              panel.grid.major = element_line(colour = "#d3d3d3"),
              panel.grid.minor = element_blank(),
              panel.border = element_blank(), panel.background = element_blank(),
              plot.title = element_text(size = 14, family = "", face = "bold"),
              text=element_text(family=""),
              axis.text.x=element_text(colour="black", size = 9),
              axis.text.y=element_text(colour="black", size = 9)) +
        geom_vline(aes(xintercept = mean(income.100), color = "Mean"), size = 1, 
                   linetype = "dashed") + 
        geom_vline(aes(xintercept = median(income.100), color = "Median"), size = 1, 
                   linetype = "dashed") + 
        scale_color_manual(name = "Statistics", values = c(Median = "blue", Mean = "red"))

(4) Measures of Variablity

  • The variablity is probably the most important concept in statistical analysis. The greater the ability in dataset, the more chance will be our uncertainty in the values of parameters estimated from the data.

  • Measures of Variability will be summarised to three points: (a) Describe spread of values in a distribution, (b) Variance, and (c) Standard deviation.

  • Now, let’s see image below.

  • what purpose of variance? The main reason to calculate variance is to know how far each value in the dataset is from the mean. Let’s have fun with baseball home-run data per year by Lee Seung Yeup.
Lee_homerun_df <- read.csv("Data Files/이승엽 홈런.csv", stringsAsFactors = FALSE)

plot(Lee_homerun_df$연도, Lee_homerun_df$홈런, main = "Lee's Homerun Race per Year")
abline(h = mean(Lee_homerun_df$홈런))

  • Now, time to calculate variance manually.
mean_homerun <- mean(Lee_homerun_df$홈런)
diff <- Lee_homerun_df$홈런 - mean_homerun

# Calculate Squared Deiviations. 
squared_diff <- diff^2

sum(squared_diff) / (length(Lee_homerun_df$홈런)-1)
## [1] 219.4805
var(Lee_homerun_df$홈런)
## [1] 219.4805
  • Let’s estimate the mean value, and looking at the departures from the mean,and let’s call this as “residuals” or “deviations”.
plot(Lee_homerun_df$연도, Lee_homerun_df$홈런, main = "Lee's Homerun Race per Year")
abline(h = mean(Lee_homerun_df$홈런), col = "green")
for (i in Lee_homerun_df$연도) {
  lines(c(i,i), c(mean(Lee_homerun_df$홈런), Lee_homerun_df$홈런[i-1994]), col = "red") 
}

  • This is residuals graph. The longer the red lines, the more variable the data. The green line is the Mean. From the green line, each value has distance from the mean. If the value is above the mean, the distance indicates positive, if below, it does negative. It’s easy to prove that this quantiy is zero. No need to count.
  • What if we ignore the negative sign, (-signs) and sum all the distances up. Let’s try
h_vectors <- Lee_homerun_df$홈런
h_vectors - mean(h_vectors)
##  [1] -14.3636364 -18.3636364   4.6363636  10.6363636  26.6363636
##  [6]   8.6363636  11.6363636  19.6363636  28.6363636 -13.3636364
## [11]   2.6363636  13.6363636   2.6363636 -19.3636364 -11.3636364
## [16] -22.3636364 -12.3636364  -6.3636364 -14.3636364   4.6363636
## [21]  -1.3636364  -0.3636364
(h_vectors - mean(h_vectors))^2
##  [1] 206.3140496 337.2231405  21.4958678 113.1322314 709.4958678
##  [6]  74.5867769 135.4049587 385.5867769 820.0413223 178.5867769
## [11]   6.9504132 185.9504132   6.9504132 374.9504132 129.1322314
## [16] 500.1322314 152.8595041  40.4958678 206.3140496  21.4958678
## [21]   1.8595041   0.1322314
sum((h_vectors - mean(h_vectors))^2)
## [1] 4609.091
  • the sum of squares for our data is 4609.091. Remember this. All the residuals is equal to zero squared. If new data is added, mean will be changed. And the value of residuals will be changed. But, the sum of all residuals is zero. We do not want to calculate back every new data inserted. The best solution is divide. Then devide by what? Degrees of Freedom. The formula is N-1. N is size of Data(or Sample data). Then, why ‘-1’ added? Make it easy. I mean, what is the ‘-1’ meant by? N is random observations in a sample. The ‘-1’ is the estimated parameter for estimating variablity.

  • If the average score for entire observation is 3, if you randomly pick 1,3,4,5, then you must pick 4 to equalize the mean 4 of entire observation.

  • Formula, variance = sum of squares / degrees of fredom. Let’s see example below.

sum((h_vectors - mean(h_vectors))^2) / (length(h_vectors)-1)
## [1] 219.4805
var(h_vectors)
## [1] 219.4805

(5) Exercise with Variance

ozone <- read.csv(file = "Data Files/gardens.csv")
attach(ozone)
ozone
##    gardenA gardenB gardenC
## 1        3       5       3
## 2        4       5       3
## 3        4       6       2
## 4        3       7       1
## 5        2       4      10
## 6        3       4       4
## 7        1       3       3
## 8        3       5      11
## 9        5       6       3
## 10       2       5      10
# let's calcuate mean, residuals, squared diff. 
# GardenA
mean.A <- mean(gardenA); gardenA - mean(gardenA); (gardenA - mean(gardenA))^2; squared_diff.A <- sum((gardenA - mean(gardenA))^2); variance.A <- sum((gardenA - mean(gardenA))^2)/9
##  [1]  0  1  1  0 -1  0 -2  0  2 -1
##  [1] 0 1 1 0 1 0 4 0 4 1
mean.B <- mean(gardenB); gardenB - mean(gardenB); (gardenB - mean(gardenB))^2; squared_diff.B <- sum((gardenB - mean(gardenB))^2); variance.B <- sum((gardenB - mean(gardenB))^2)/9
##  [1]  0  0  1  2 -1 -1 -2  0  1  0
##  [1] 0 0 1 4 1 1 4 0 1 0
mean.C <- mean(gardenC); gardenB - mean(gardenC); (gardenC - mean(gardenC))^2; squared_diff.C <- sum((gardenC - mean(gardenC))^2); variance.C <- sum((gardenC - mean(gardenC))^2)/9
##  [1]  0  0  1  2 -1 -1 -2  0  1  0
##  [1]  4  4  9 16 25  1  4 36  4 25
gardenA.B.C <- data.frame(
                            group = c("garden_A", "garden_B", "garden_C"),
                            mean = c(mean.A, mean.B, mean.C), 
                            squared_diff = c(squared_diff.A, squared_diff.B, squared_diff.C), 
                            variance = c(variance.A, variance.B, variance.C)
                          )
gardenA.B.C
##      group mean squared_diff  variance
## 1 garden_A    3           12  1.333333
## 2 garden_B    5           12  1.333333
## 3 garden_C    5          128 14.222222
  • What do you see from this? Let’s compare garden_A and garden_B. Mean is different but variance is same. How about garden_B and garden_C. Mean is same but variance is different. We do F test for this, it’s simple the larger variance divided by the smaller variance.
var.test(gardenB, gardenC)
## 
##  F test to compare two variances
## 
## data:  gardenB and gardenC
## F = 0.09375, num df = 9, denom df = 9, p-value = 0.001624
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.02328617 0.37743695
## sample estimates:
## ratio of variances 
##            0.09375
  • As you see, the probablity is much less than 5%, we conclude that two variances are highly different. One question though, gardenB and gardenC has the same mean. Is it identical? Of course, it’s wrong. Let’s assume that the damage threshold for lettuce is 8pphm ozone. In garden B, there is no damage at all. However, in garden C, for 3 days of 10 days, the lettuce garden will be damaged. It’s big different. Thus, on the condition of which the variances are different, it’s too fast to compare the means between two groups.

  • So, if you know the differece between two means and variance, then, we need to move next step, Statistical Test. We will see it next chpater.

(6) Variance and Sample

  • In the last chapter, the importance of both Mean and Variance is explained. To recap, if variance is different, then do not compare the means. Now, we will explain the relationship between the value of variance and the size of a sample, called a replication as well.
plot(c(0,32), c(0,15), type = "n", xlab = "Sample size", ylab = "Variance") # Graph
for(n in seq(3, 31, 2)) {
  for (i in 1:30) {
    x <- rnorm(n, mean = 10, sd = 2)
    points(n, var(x))
  }
}

  • What to do here is to select random numbers from the function rnorm. Assume that mean is 10, sd is 2, sample sizez between n = 3 and n = 31, plot 30 independent instances of variance. Now what do you see? It’s not necessary to understand all the meaning of result, point. The main point is we have to try sample test at least 30 times. Guess, if you’ve done just 15 times, and end up concluding your report without seeing the rest results. Then, each one easily guesses that the report would be incorrect or be going to wrong conclusion.

  • Now, Variance will be used in two main ways. (a) for establishing measures of confidence intervals. (b) testing hypotheses.

(A) Measures of Unreliability

  • If the variance increases, unreliability also increases. If you don’t understand, then see the graph between 1-15 and 16-30. Which group is more stable?

  • However, relatively, if the size of sample gets bigger, then the unreliability decreases. If you don’t understand this statement, then compare between two groups(1-15 and 1-30) again.
  • Now, variance is based on the sum of squared difference. So, all unreliability measures are formulated inside a big square-root term. And, the Unreliability measures are called standard errors (SE).Then, if SE(1) biggers than SE(2), if means, variance in SE(1) is wider than it in SE(2), let’s see table below.

ozone <- read.csv(file = "Data Files/gardens.csv")
attach(ozone)
## The following objects are masked from ozone (pos = 3):
## 
##     gardenA, gardenB, gardenC
mean.A <- mean(gardenA); gardenA - mean(gardenA); (gardenA - mean(gardenA))^2; squared_diff.A <- sum((gardenA - mean(gardenA))^2); variance.A <- sum((gardenA - mean(gardenA))^2)/9; SE.A <- sqrt(var(gardenA)/10)
##  [1]  0  1  1  0 -1  0 -2  0  2 -1
##  [1] 0 1 1 0 1 0 4 0 4 1
mean.B <- mean(gardenB); gardenB - mean(gardenB); (gardenB - mean(gardenB))^2; squared_diff.B <- sum((gardenB - mean(gardenB))^2); variance.B <- sum((gardenB - mean(gardenB))^2)/9; SE.B <- sqrt(var(gardenB)/10)
##  [1]  0  0  1  2 -1 -1 -2  0  1  0
##  [1] 0 0 1 4 1 1 4 0 1 0
mean.C <- mean(gardenC); gardenB - mean(gardenC); (gardenC - mean(gardenC))^2; squared_diff.C <- sum((gardenC - mean(gardenC))^2); variance.C <- sum((gardenC - mean(gardenC))^2)/9; SE.C <- sqrt(var(gardenC)/10)
##  [1]  0  0  1  2 -1 -1 -2  0  1  0
##  [1]  4  4  9 16 25  1  4 36  4 25
gardenA.B.C <- data.frame(
                            group = c("garden_A", "garden_B", "garden_C"),
                            mean = c(mean.A, mean.B, mean.C), 
                            squared_diff = c(squared_diff.A, squared_diff.B, squared_diff.C), 
                            variance = c(variance.A, variance.B, variance.C), 
                            Standard_Error = c(SE.A, SE.B, SE.C)
                          )
gardenA.B.C
##      group mean squared_diff  variance Standard_Error
## 1 garden_A    3           12  1.333333      0.3651484
## 2 garden_B    5           12  1.333333      0.3651484
## 3 garden_C    5          128 14.222222      1.1925696