worms <- read.csv("Data Files/worms.csv", stringsAsFactors = FALSE)
worms
## Field.Name Area Slope Vegetation Soil.pH Damp Worm.density
## 1 Nashs.Field 3.6 11 Grassland 4.1 FALSE 4
## 2 Silwood.Bottom 5.1 2 Arable 5.2 FALSE 7
## 3 Nursery.Field 2.8 3 Grassland 4.3 FALSE 2
## 4 Rush.Meadow 2.4 5 Meadow 4.9 TRUE 5
## 5 Gunness.Thicket 3.8 0 Scrub 4.2 FALSE 6
## 6 Oak.Mead 3.1 2 Grassland 3.9 FALSE 2
## 7 Church.Field 3.5 3 Grassland 4.2 FALSE 3
## 8 Ashurst 2.1 0 Arable 4.8 FALSE 4
## 9 The.Orchard 1.9 0 Orchard 5.7 FALSE 9
## 10 Rookery.Slope 1.5 4 Grassland 5.0 TRUE 7
## 11 Garden.Wood 2.9 10 Scrub 5.2 FALSE 8
## 12 North.Gravel 3.3 1 Grassland 4.1 FALSE 1
## 13 South.Gravel 3.7 2 Grassland 4.0 FALSE 2
## 14 Observatory.Ridge 1.8 6 Grassland 3.8 FALSE 0
## 15 Pond.Field 4.1 0 Meadow 5.0 TRUE 6
## 16 Water.Meadow 3.9 0 Meadow 4.9 TRUE 8
## 17 Cheapside 2.2 8 Scrub 4.7 TRUE 4
## 18 Pound.Hill 4.4 2 Arable 4.5 FALSE 5
## 19 Gravel.Pit 2.9 1 Grassland 3.5 FALSE 1
## 20 Farm.Wood 0.8 10 Scrub 5.1 TRUE 3
{
plot(x = worms$Area, y = worms$Worm.density, xlab = "Area", ylab = "Worm.Density")
abline(lm(worms$Worm.density ~ worms$Area, data = worms))
}
As you see,“Area” on the x-axis and “Worm.density” on the y-axis are on the graph. Then, what does “Area” mean by? What does “Worm.density” mean by? In Statistics, Area is called explnatory variable, Worm.density is called response variable. The explanatory variable is simple called independent. It’s not affected by other variable at all, whereas the responsive variable is affected by other variable. The response variable is the thing you are working on. It is the variable that you want to understand by the variation.
Let’s go back to graph. Area is not affected by Worm.density. However, Worm.density is affected by Area. The researcher wants to know the variation of Worm.density over the size of Area.
To me, learning statistics is one way that helps my life get better. It’s skill to compute a complex problem, understand the status-quo, and predict the future. Life is a gamble but to choose my future on the basis of the assurance. Now, one problem is here. How do you trust your assurance of your choice? In what base, do you trust your choice? It’s hard to answer to questions, isn’t it?
Now, before diving into statistics, we must know the status-quo by several steps.
# Alzheimer's Disease Mortality by State
alzheimer2015 <- read.csv("alzheimers2015.csv", stringsAsFactors = FALSE)
alz_death_rate <- alzheimer2015$RATE
# Les's draw histogram
hist(alz_death_rate, main = "Alzheimer's Disease Mortality by State: 2015", xlab = "Rate", ylab = "frequency")
you can download csv data from “https://www.cdc.gov/nchs/pressroom/sosmap/alzheimers_mortality/alzheimers_disease.htm”
This data shows somehow uniform or right skewed disribution. If you want to check data normaility, then you might use qqplot. The function qqplot is to make it a lot easier to evaluate whether you see a clear deviation from normality. The closer all points lie to the line, the closer the distribution of your sample comes to the normal distribution. The qqline() function also takes the sample as an argument. Let’s check if our data is indeed normal distribution or not.
qqnorm(alz_death_rate); qqline(alz_death_rate)
quantile(alz_death_rate)
## 0% 25% 50% 75% 100%
## 12.600 23.675 30.050 35.550 46.200
shapiro.test(alz_death_rate)
##
## Shapiro-Wilk normality test
##
## data: alz_death_rate
## W = 0.9741, p-value = 0.337
Here, the p-value is important to check. We will talk the meaning of p-value later. Here, the role of p-value is to check if the data is normality or not, and the cutoff is lower than 0.05. If p-value is higher than 0.05, then, it means that the sample data does not deviates from normality which means the data might follow a normal distribution. Null Hypothesis is accepted, and no need to go further at this momennt.
Since the p-value is 0.337, we cannot conclude that data do not follow a normality. It means that the data happens by chance or we need to get more data.
beaver <- beaver1
bea_temp <- beaver2$temp
hist(bea_temp, main = "Beaver's Temperature", xlab = "Temperature", ylab = "frequency")
qqnorm(bea_temp); qqline(bea_temp)
quantile(bea_temp)
## 0% 25% 50% 75% 100%
## 36.5800 37.1475 37.7350 37.9850 38.3500
shapiro.test(bea_temp)
##
## Shapiro-Wilk normality test
##
## data: bea_temp
## W = 0.93336, p-value = 7.764e-05
Here, the p-value is important to check. We will talk later the meaning of p-value later. Here, the role of p-value is to check if the data is normality or not, and the cutoff is lower than 0.05. If p-value is lower than 0.05, then, it means that the sample data deviates from normality which means the data do not follow a normal distribution. Null Hypothesis is rejected, and need to go further at this momennt.
Since the p-value is 7.764e-05 (very lower value < 0.05), we conclude that data do not follow a normal distritbution. It means that the data dose not happen by chance, implying that there is some reason to happen like this. So we have to verify why the data happens via statistical analysis.
Z-scale, z-scale is “Standard scale of measurement in statistics”. To understand z-scale, we must get z-score. The formula Z-score is like this: Z = (X-M) / SD
To interpret each one, (a) X is score on original scale, simply raw score. (b) M is mean, (c) SD = standard deviation. SD has another formula but simply we use function “sd()”.
For Z-score, Mean Z-score is Z = 0, Negative Z-score is below mean, Positive Z-score is above mean.
Let’s analyze with real data.
library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag(): dplyr, stats
weather = read.csv(file = "Data Files/weather.data.csv")
# extract january 1987, upper
jan_1987_upper <- weather %>%
select(yr, month, upper) %>%
filter(yr == 1987 & month == 1)
sapply(jan_1987_upper, class)
## yr month upper
## "integer" "integer" "numeric"
upper_temp <- jan_1987_upper$upper
# convert these temp to Z-scores using the scale function
z_scores_upper_temp <- scale(upper_temp)
# Arrange the hist side by side
par(mfrow = c(1,2))
hist(upper_temp)
hist(z_scores_upper_temp)
# get data
z_score_30 <- z_scores_upper_temp[30]
pnorm(z_score_30) * 100
## [1] 61.85927
All dataset clusters around intermediate values. So called central tendency. Then how to measure central tendency? Three ways are known: Mean, Median, and Mode. (a) Mean is average, we know already. (b) Median is middle score in a distribution; e.g. 50th out of 100 total. (c) Mode is score that occurs most frequently; e.g. peak of the histogram.
For mean, the best timing to use this function is when the distribution is normal.
For Median, the best timing to use this function is when extreme scores in distribution. e.g. Household income in USA.
Let’s see a hist below.
white_wine <- read.csv(file = "Data Files/white_wine.csv")
attach(white_wine)
white_ratings_france <- subset(white_wine, condition == "France")$Ratings
white_ratings_argentina <- subset(white_wine, condition == "Argentina")$Ratings
white_ratings_australia <- subset(white_wine, condition == "Australia")$Ratings
white_ratings_usa <- subset(white_wine, condition == "USA")$Ratings
mean_median_td <- white_wine %>% group_by(condition) %>%
rename(Country = condition) %>%
summarize(mean = mean(Ratings),
median = median(Ratings))
mean_median_td
## # A tibble: 4 × 3
## Country mean median
## <fctr> <dbl> <dbl>
## 1 Argentina 70.90476 71.0
## 2 Australia 86.64500 86.6
## 3 France 85.40000 85.0
## 4 USA 88.12500 88.2
usa_household_income_2009 <- read.csv("Data Files/2009_household_income_usa.csv", stringsAsFactors = FALSE)
attach(usa_household_income_2009)
state_names <- c("Alabama" , "Alaska" , "Arizona" , "Arkansas" , "California" , "Colorado" , "Connecticut" , "Delaware" , "Florida" , "Georgia" , "Hawaii" , "Idaho" , "Illinois" , "Indiana" , "Iowa" , "Kansas" , "Kentucky" , "Louisiana" , "Maine" , "Maryland" , "Massachusetts" , "Michigan" , "Minnesota" , "Mississippi" , "Missouri" , "Montana" , "Nebraska" , "Nevada" , "New Hampshire" , "New Jersey" , "New Mexico" , "New York" , "North Carolina" , "North Dakota" , "Ohio" , "Oklahoma" , "Oregon" , "Pennsylvania" , "Rhode Island" , "South Carolina" , "South Dakota" , "Tennessee" , "Texas" , "Utah" , "Vermont" , "Virginia" , "Washington" , "West Virginia" , "Wisconsin" , "Wyoming")
state_names <- toupper(state_names)
class(usa_household_income_2009$Area)
## [1] "character"
usa_house_income_2009_df <- usa_household_income_2009 %>%
filter(Area %in% state_names) %>%
mutate(income.100 = Income/100)
str(usa_house_income_2009_df)
## 'data.frame': 50 obs. of 3 variables:
## $ Area : chr "ALABAMA" "ALASKA" "ARIZONA" "ARKANSAS" ...
## $ Income : int 26548 14822 46556 19824 396005 35015 33489 7387 99198 50177 ...
## $ income.100: num 265 148 466 198 3960 ...
usa_house_income_2009_df$Income_group <- as.numeric(cut(usa_house_income_2009_df$income.100, 10))
attach(usa_house_income_2009_df)
## The following objects are masked from usa_household_income_2009:
##
## Area, Income
mean(income.100)
## [1] 535.8644
median(income.100)
## [1] 342.52
The mean of household for all states is 535.86(x 100 dollars). However, the median is 342(x 100 dollars). The mean is higher than the median.
If you see graph below, you might see clearly the positive skew distribution. You get to understand better the different between Mean and Median.
Here, the mode is the most frequent counted that occurs in this graph. More importantly, the mode is below both the Mean and Median.
ggplot(usa_house_income_2009_df, aes(x = income.100)) +
geom_histogram(aes(fill =..count..),
binwidth = 50) +
scale_x_continuous(name = "Mean House Income per State",
breaks = seq(0, 4000, 400),
limits = c(0, 4000)) +
scale_y_continuous(name = "Count") +
scale_fill_gradient("Count", low = "blue", high = "red") +
ggtitle("Count histogram of Mean House Income Per State") +
theme_bw() +
theme(axis.line = element_line(size=1, colour = "black"),
panel.grid.major = element_line(colour = "#d3d3d3"),
panel.grid.minor = element_blank(),
panel.border = element_blank(), panel.background = element_blank(),
plot.title = element_text(size = 14, family = "", face = "bold"),
text=element_text(family=""),
axis.text.x=element_text(colour="black", size = 9),
axis.text.y=element_text(colour="black", size = 9)) +
geom_vline(aes(xintercept = mean(income.100), color = "Mean"), size = 1,
linetype = "dashed") +
geom_vline(aes(xintercept = median(income.100), color = "Median"), size = 1,
linetype = "dashed") +
scale_color_manual(name = "Statistics", values = c(Median = "blue", Mean = "red"))
The variablity is probably the most important concept in statistical analysis. The greater the ability in dataset, the more chance will be our uncertainty in the values of parameters estimated from the data.
Measures of Variability will be summarised to three points: (a) Describe spread of values in a distribution, (b) Variance, and (c) Standard deviation.
Now, let’s see image below.
Lee_homerun_df <- read.csv("Data Files/이승엽 홈런.csv", stringsAsFactors = FALSE)
plot(Lee_homerun_df$연도, Lee_homerun_df$홈런, main = "Lee's Homerun Race per Year")
abline(h = mean(Lee_homerun_df$홈런))
mean_homerun <- mean(Lee_homerun_df$홈런)
diff <- Lee_homerun_df$홈런 - mean_homerun
# Calculate Squared Deiviations.
squared_diff <- diff^2
sum(squared_diff) / (length(Lee_homerun_df$홈런)-1)
## [1] 219.4805
var(Lee_homerun_df$홈런)
## [1] 219.4805
plot(Lee_homerun_df$연도, Lee_homerun_df$홈런, main = "Lee's Homerun Race per Year")
abline(h = mean(Lee_homerun_df$홈런), col = "green")
for (i in Lee_homerun_df$연도) {
lines(c(i,i), c(mean(Lee_homerun_df$홈런), Lee_homerun_df$홈런[i-1994]), col = "red")
}
h_vectors <- Lee_homerun_df$홈런
h_vectors - mean(h_vectors)
## [1] -14.3636364 -18.3636364 4.6363636 10.6363636 26.6363636
## [6] 8.6363636 11.6363636 19.6363636 28.6363636 -13.3636364
## [11] 2.6363636 13.6363636 2.6363636 -19.3636364 -11.3636364
## [16] -22.3636364 -12.3636364 -6.3636364 -14.3636364 4.6363636
## [21] -1.3636364 -0.3636364
(h_vectors - mean(h_vectors))^2
## [1] 206.3140496 337.2231405 21.4958678 113.1322314 709.4958678
## [6] 74.5867769 135.4049587 385.5867769 820.0413223 178.5867769
## [11] 6.9504132 185.9504132 6.9504132 374.9504132 129.1322314
## [16] 500.1322314 152.8595041 40.4958678 206.3140496 21.4958678
## [21] 1.8595041 0.1322314
sum((h_vectors - mean(h_vectors))^2)
## [1] 4609.091
the sum of squares for our data is 4609.091. Remember this. All the residuals is equal to zero squared. If new data is added, mean will be changed. And the value of residuals will be changed. But, the sum of all residuals is zero. We do not want to calculate back every new data inserted. The best solution is divide. Then devide by what? Degrees of Freedom. The formula is N-1. N is size of Data(or Sample data). Then, why ‘-1’ added? Make it easy. I mean, what is the ‘-1’ meant by? N is random observations in a sample. The ‘-1’ is the estimated parameter for estimating variablity.
If the average score for entire observation is 3, if you randomly pick 1,3,4,5, then you must pick 4 to equalize the mean 4 of entire observation.
Formula, variance = sum of squares / degrees of fredom. Let’s see example below.
sum((h_vectors - mean(h_vectors))^2) / (length(h_vectors)-1)
## [1] 219.4805
var(h_vectors)
## [1] 219.4805
ozone <- read.csv(file = "Data Files/gardens.csv")
attach(ozone)
ozone
## gardenA gardenB gardenC
## 1 3 5 3
## 2 4 5 3
## 3 4 6 2
## 4 3 7 1
## 5 2 4 10
## 6 3 4 4
## 7 1 3 3
## 8 3 5 11
## 9 5 6 3
## 10 2 5 10
# let's calcuate mean, residuals, squared diff.
# GardenA
mean.A <- mean(gardenA); gardenA - mean(gardenA); (gardenA - mean(gardenA))^2; squared_diff.A <- sum((gardenA - mean(gardenA))^2); variance.A <- sum((gardenA - mean(gardenA))^2)/9
## [1] 0 1 1 0 -1 0 -2 0 2 -1
## [1] 0 1 1 0 1 0 4 0 4 1
mean.B <- mean(gardenB); gardenB - mean(gardenB); (gardenB - mean(gardenB))^2; squared_diff.B <- sum((gardenB - mean(gardenB))^2); variance.B <- sum((gardenB - mean(gardenB))^2)/9
## [1] 0 0 1 2 -1 -1 -2 0 1 0
## [1] 0 0 1 4 1 1 4 0 1 0
mean.C <- mean(gardenC); gardenB - mean(gardenC); (gardenC - mean(gardenC))^2; squared_diff.C <- sum((gardenC - mean(gardenC))^2); variance.C <- sum((gardenC - mean(gardenC))^2)/9
## [1] 0 0 1 2 -1 -1 -2 0 1 0
## [1] 4 4 9 16 25 1 4 36 4 25
gardenA.B.C <- data.frame(
group = c("garden_A", "garden_B", "garden_C"),
mean = c(mean.A, mean.B, mean.C),
squared_diff = c(squared_diff.A, squared_diff.B, squared_diff.C),
variance = c(variance.A, variance.B, variance.C)
)
gardenA.B.C
## group mean squared_diff variance
## 1 garden_A 3 12 1.333333
## 2 garden_B 5 12 1.333333
## 3 garden_C 5 128 14.222222
var.test(gardenB, gardenC)
##
## F test to compare two variances
##
## data: gardenB and gardenC
## F = 0.09375, num df = 9, denom df = 9, p-value = 0.001624
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.02328617 0.37743695
## sample estimates:
## ratio of variances
## 0.09375
As you see, the probablity is much less than 5%, we conclude that two variances are highly different. One question though, gardenB and gardenC has the same mean. Is it identical? Of course, it’s wrong. Let’s assume that the damage threshold for lettuce is 8pphm ozone. In garden B, there is no damage at all. However, in garden C, for 3 days of 10 days, the lettuce garden will be damaged. It’s big different. Thus, on the condition of which the variances are different, it’s too fast to compare the means between two groups.
So, if you know the differece between two means and variance, then, we need to move next step, Statistical Test. We will see it next chpater.
plot(c(0,32), c(0,15), type = "n", xlab = "Sample size", ylab = "Variance") # Graph
for(n in seq(3, 31, 2)) {
for (i in 1:30) {
x <- rnorm(n, mean = 10, sd = 2)
points(n, var(x))
}
}
What to do here is to select random numbers from the function rnorm. Assume that mean is 10, sd is 2, sample sizez between n = 3 and n = 31, plot 30 independent instances of variance. Now what do you see? It’s not necessary to understand all the meaning of result, point. The main point is we have to try sample test at least 30 times. Guess, if you’ve done just 15 times, and end up concluding your report without seeing the rest results. Then, each one easily guesses that the report would be incorrect or be going to wrong conclusion.
Now, Variance will be used in two main ways. (a) for establishing measures of confidence intervals. (b) testing hypotheses.
If the variance increases, unreliability also increases. If you don’t understand, then see the graph between 1-15 and 16-30. Which group is more stable?
Now, variance is based on the sum of squared difference. So, all unreliability measures are formulated inside a big square-root term. And, the Unreliability measures are called standard errors (SE).Then, if SE(1) biggers than SE(2), if means, variance in SE(1) is wider than it in SE(2), let’s see table below.
ozone <- read.csv(file = "Data Files/gardens.csv")
attach(ozone)
## The following objects are masked from ozone (pos = 3):
##
## gardenA, gardenB, gardenC
mean.A <- mean(gardenA); gardenA - mean(gardenA); (gardenA - mean(gardenA))^2; squared_diff.A <- sum((gardenA - mean(gardenA))^2); variance.A <- sum((gardenA - mean(gardenA))^2)/9; SE.A <- sqrt(var(gardenA)/10)
## [1] 0 1 1 0 -1 0 -2 0 2 -1
## [1] 0 1 1 0 1 0 4 0 4 1
mean.B <- mean(gardenB); gardenB - mean(gardenB); (gardenB - mean(gardenB))^2; squared_diff.B <- sum((gardenB - mean(gardenB))^2); variance.B <- sum((gardenB - mean(gardenB))^2)/9; SE.B <- sqrt(var(gardenB)/10)
## [1] 0 0 1 2 -1 -1 -2 0 1 0
## [1] 0 0 1 4 1 1 4 0 1 0
mean.C <- mean(gardenC); gardenB - mean(gardenC); (gardenC - mean(gardenC))^2; squared_diff.C <- sum((gardenC - mean(gardenC))^2); variance.C <- sum((gardenC - mean(gardenC))^2)/9; SE.C <- sqrt(var(gardenC)/10)
## [1] 0 0 1 2 -1 -1 -2 0 1 0
## [1] 4 4 9 16 25 1 4 36 4 25
gardenA.B.C <- data.frame(
group = c("garden_A", "garden_B", "garden_C"),
mean = c(mean.A, mean.B, mean.C),
squared_diff = c(squared_diff.A, squared_diff.B, squared_diff.C),
variance = c(variance.A, variance.B, variance.C),
Standard_Error = c(SE.A, SE.B, SE.C)
)
gardenA.B.C
## group mean squared_diff variance Standard_Error
## 1 garden_A 3 12 1.333333 0.3651484
## 2 garden_B 5 12 1.333333 0.3651484
## 3 garden_C 5 128 14.222222 1.1925696