Data Analysis: focusing on the basics.Covering aspects dealing with data and less is MORE in statistics
Research methods: covering the theoretical and philosophical aspects of doing science. Making sense of science and working on writing and reading skills.
Warning: `as.tibble()` was deprecated in tibble 2.0.0.
ℹ Please use `as_tibble()` instead.
ℹ The signature and semantics have changed, see `?as_tibble`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
- diamonds diamonds%>% (i) #utilizes the diamonds dataset group_by(color,clarity)%>% #groups data by the color and clarity variables.
mutate(price200=mean(price))%>% #creates new variables (average price by groups) ungroup()%>% #data no longer grouped by color and clarity mutate(random=10+price)%>%
nw variable,original price+$10 select(cut,color,clarity,price,price200,random10)%>% #retain only these listed columns.
arrange(color)%>% #visualize data ordered by color. group_by(cut)%>% #group data by cut mutate(dis=n_distinct(price)
counts the number of unique price values per cut. rowID=row_number())%>%
numbers each row consecutively for each cut ungroup() #final ungrouping of data.
data(iris)# Create a boxplot of Sepal.Length by Speciesboxplot(Sepal.Length ~ Species, data = iris,main ="Boxplot of Sepal Length by Species",xlab ="Species",ylab ="Sepal Length",col =c("lightblue", "lightgreen", "lightpink"))
data(iris)library(ggplot2)summary_data <-aggregate(Sepal.Length ~ Species, data = iris, FUN = mean)# Create the line graphggplot(summary_data, aes(x = Species, y = Sepal.Length, group =1)) +geom_line() +geom_point() +labs(title ="Average Sepal Length by Species",x ="Species",y ="Average Sepal Length") +theme_minimal()
data(iris)library(ggplot2)# Create a scatter graphggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +geom_point(size =3) +# Adjust size of pointslabs(title ="Scatter Plot of Sepal Length vs. Sepal Width",x ="Sepal Length",y ="Sepal Width") +theme_minimal()
library(ggplot2)iris_summary <-aggregate(Sepal.Length ~ Species, data = iris, FUN = mean)ggplot(iris_summary, aes(x = Species, y = Sepal.Length, fill = Species)) +geom_bar(stat ="identity", position ="dodge") +labs(title ="Average Sepal Length by Species",x ="Species",y ="Average Sepal Length") +theme_minimal()
# Types of variables
Categorical
Categorical data refers to data that can be divided into distinct categories or groups. These categories may represent qualitative characteristics or attributes rather than numerical values.
types of categorical
Nominal Data: Definition: Categories without a specific order. Examples: Gender (male, female), eye color (blue, green, brown), types of cuisine (Italian, Mexican, Chinese).
Ordinal Data: Definition: Categories with a meaningful order, but the intervals between categories are not uniform. Examples: Satisfaction ratings (very satisfied, satisfied, neutral, dissatisfied, very dissatisfied), education levels (high school, bachelor’s, master’s, doctorate).
Numerical
Numerical data, also known as quantitative data, consists of values that represent quantities and can be measured or counted. This type of data can be further divided into two main categories: discrete and continuous.
Types of Numerical Data
Discrete Data: Definition: Consists of distinct, separate values that can be counted. Discrete data typically represents counts of items or occurrences. Examples: Number of students in a classroom, number of cars in a parking lot, or the number of goals scored in a game.
Continuous Data: Definition: Can take any value within a given range and is typically measured rather than counted. Continuous data can be subdivided infinitely. Examples: Height, weight, temperature, or time.
The four families of statistical tests
Parametric Tests These tests assume that the data follows a specific distribution (typically a normal distribution). They are generally more powerful than non-parametric tests when their assumptions are met. Examples: t-tests (e.g., one-sample, independent, paired) ANOVA (Analysis of Variance)
Non-Parametric Tests These tests do not assume a specific distribution and are often used when the data do not meet the assumptions required for parametric tests. They are useful for ordinal data or when sample sizes are small. Examples: Mann-Whitney U Test (for comparing two independent groups) Wilcoxon Signed-Rank Test (for comparing two related groups) Kruskal-Wallis Test (for comparing more than two groups) Chi-Square Test (for categorical data)
Bayesian Tests These tests incorporate prior beliefs or information into the analysis, allowing for a more flexible approach to inference. Bayesian methods provide a framework for updating beliefs in light of new evidence. Examples: Bayesian t-test Bayesian ANOVA Bayesian Regression
Resampling Methods These methods involve repeatedly drawing samples from the data (with or without replacement) to estimate the sampling distribution of a statistic. They are particularly useful for hypothesis testing and confidence interval estimation. Examples: Bootstrap (to estimate the distribution of a statistic) Permutation Tests (to assess the significance of a test statistic)
Frequency tests
Chi-square G-tests
Contingecy tables
Log-linear models Powerful for testing associations between categorical variables
Mean tests
T-tests (two levels)
Anovas (3+ levels)
Non-parametric equivalents Nested and two-way…
Post-hoc tests (Tukey HSD, Student, etc.)
Correlations and models
Correlations – many variations
Linear models – many variations
Logistic models
Logistic models
Predictive of odds
Similar inlogic to frequency tests
Similar in calculation to linear models
Moments of dispersion
Variance Standard
deviation Standard
Error Range
Quantiles
# formative exercise 6
what is one-proportion Z-test?
The One proportion Z-test is used to compare an observed proportion to a theoretical one, when there are only two categories. This article describes the basics of one-proportion z-test and provides practical examples using R software.
For example, we have a population of mice containing half male and have female (p = 0.5 = 50%). Some of these mice (n = 160) have developed a spontaneous cancer, including 95 male and 65 female.
Typical research questions are:
whether the observed proportion of male (po ) is equal to the expected proportion (pe )?
whether the observed proportion of male (po ) is less than the expected proportion (pe )?
whether the observed proportion of male (p ) is greater than the expected proportion (pe )?
Formula of the test statistic
The test statistic (also known as z-test) can be calculated as follow:
z=p^−popo(1−po)n√z=p^−popo(1−po)n
zz = Test statistics
nn = Sample size
popo = Null hypothesized value
p^p^ = Observed proportion
R functions: binom.test() & prop.test()
The R functions binom.test() and prop.test() can be used to perform one-proportion test:
binom.test(): compute exact binomial test. Recommended when sample size is small
prop.test(): can be used when sample size is large ( N > 30). It uses a normal approximation to binomial
The syntax of the two functions are exactly the same. The simplified format is as follow:
binom.test(x, n, p = 0.5, alternative = "two.sided") prop.test(x, n, p = NULL, alternative = "two.sided", correct = TRUE)
res <-prop.test(x =95, n =160, p =0.5, correct =FALSE)# Printing the resultsres
1-sample proportions test without continuity correction
data: 95 out of 160, null probability 0.5
X-squared = 5.625, df = 1, p-value = 0.01771
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
0.5163169 0.6667870
sample estimates:
p
0.59375
The function returns:
the value of Pearson’s chi-squared test statistic.
a p-value
a 95% confidence intervals
an estimated probability of success (the proportion of male with cancer)
Interpretation of the result
The p-value of the test is 0.01771, which is less than the significance level alpha = 0.05. We can conclude that the proportion of male with cancer is significantly different from 0.5 with a p-value = 0.01771.
The two-proportionsz-test is used to compare two observed proportions. This article describes the basics of two-proportions *z-test and provides pratical examples using R sfoftware**.
We want to know, whether the proportions of smokers are the same in the two groups of individuals?
For example, we have two groups of individuals:
Group A with lung cancer: n = 500
Group B, healthy individuals: n = 500
The number of smokers in each group is as follow:
Group A with lung cancer: n = 500, 490 smokers, pA=490/500=98pA=490/500=98
Group B, healthy individuals: n = 500, 400 smokers, pB=400/500=80pB=400/500=80
In this setting:
The overall proportion of smokers is p=frac(490+400)500+500=89p=frac(490+400)500+500=89
The overall proportion of non-smokers is q=1−p=11
Typical research questions are:
whether the observed proportion of smokers in group A (pApA) is equal to the observed proportion of smokers in group (pBpB)?
whether the observed proportion of smokers in group A (pApA) is less than the observed proportion of smokers in group (pBpB)?
whether the observed proportion of smokers in group A (pApA) is greater than the observed proportion of smokers in group (pBpB)?
he test statistic (also known as z-test) can be calculated as follow:
z=pA−pBpq/nA+pq/nB−√
pApA is the proportion observed in group A with size nAnA
pBpB is the proportion observed in group B with size nBnB
pp and qq are the overall proportions
if |z|<1.96|z|<1.96, then the difference is not significant at 5%
if |z|≥1.96|z|≥1.96, then the difference is significant at 5%
The significance level (p-value) corresponding to the z-statistic can be read in the z-table. We’ll see how to compute it in R.
The R functions prop.test() can be used as follow:
prop.test(x, n, p = NULL, alternative = "two.sided", correct = TRUE)
x: a vector of counts of successes
n: a vector of count trials
alternative: a character string specifying the alternative hypothesis
correct: a logical indicating whether Yates’ continuity correction should be applied where possible
res <-prop.test(x =c(490, 400), n =c(500, 500))# Printing the resultsres
2-sample test for equality of proportions with continuity correction
data: c(490, 400) out of c(500, 500)
X-squared = 80.909, df = 1, p-value < 2.2e-16
alternative hypothesis: two.sided
95 percent confidence interval:
0.1408536 0.2191464
sample estimates:
prop 1 prop 2
0.98 0.80
The function returns:
the value of Pearson’s chi-squared test statistic.
a p-value
a 95% confidence intervals
if you want to test whether the observed proportion of smokers in group A (pApA) is less than the observed proportion of smokers in group (pBpB), type this:
prop.test(x = c(490, 400), n = c(500, 500), alternative = "less")
an estimated probability of success (the proportion of smokers in the two groups)
Or, if you want to test whether the observed proportion of smokers in group A (pApA) is greater than the observed proportion of smokers in group (pBpB), type this:
prop.test(x = c(490, 400), n = c(500, 500), alternative = "greater")
Interpretation of the result
The p-value of the test is 2.36310^{-19}, which is less than the significance level alpha = 0.05. We can conclude that the proportion of smokers is significantly different in the two groups with a p-value = 2.36310^{-19}.
What is chi-square goodness of fit test?
The chi-squaregoodness of fit test is used to compare the observed distribution to an expected distribution, in a situation where we have two or more categories in a discrete data. In other words, it compares multiple observed proportions to expected probabilities.
R function: chisq.test()
The R function chisq.test() can be used as follow:
chisq.test(x, p)
x: a numeric vector
p: a vector of probabilities of the same length of x.
tulip <-c(81, 50, 27)res <-chisq.test(tulip, p =c(1/3, 1/3, 1/3))res
Chi-squared test for given probabilities
data: tulip
X-squared = 27.886, df = 2, p-value = 8.803e-07
The function returns: the value of chi-square test statistic (“X-squared”) and a a p-value.
The p-value of the test is 8.80310^{-7}, which is less than the significance level alpha = 0.05. We can conclude that the colors are significantly not commonly distributed with a p-value = 8.80310^{-7}.
# Access to the expected valuesres$expected
[1] 52.66667 52.66667 52.66667
tulip <-c(81, 50, 27)res <-chisq.test(tulip, p =c(1/2, 1/3, 1/6))res
Chi-squared test for given probabilities
data: tulip
X-squared = 0.20253, df = 2, p-value = 0.9037
The p-value of the test is 0.9037, which is greater than the significance level alpha = 0.05. We can conclude that the observed proportions are not significantly different from the expected proportions.
The result of chisq.test() function is a list containing the following components:
statistic: the value the chi-squared test statistic.
parameter: the degrees of freedom
p.value: the p-value of the test
observed: the observed count
expected: the expected count
Chi-Square Test of Independence in R
The chi-square test of independence is used to analyze the frequency table (i.e. contengency table) formed by two categorical variables. The chi-square test evaluates whether there is a significant association between the categories of the two variables. This article describes the basics of chi-square test and provides practical examples using R software.
# Import the datafile_path <-"http://www.sthda.com/sthda/RDoc/data/housetasks.txt"housetasks <-read.delim(file_path, row.names =1)# head(housetasks)
The data is a contingency table containing 13 housetasks and their distribution in the couple:
rows are the different tasks
values are the frequencies of the tasks done :
by the wife only
alternatively
by the husband only
or jointly
Graphical display of contengency tables
Contingency table can be visualized using the function balloonplot() [in gplots package]. This function draws a graphical matrix where each cell contains a dot whose size reflects the relative magnitude of the corresponding component.
library("gplots")
Attaching package: 'gplots'
The following object is masked from 'package:stats':
lowess
# 1. convert the data as a tabledt <-as.table(as.matrix(housetasks))# 2. Graphballoonplot(t(dt), main ="housetasks", xlab ="", ylab="",label =FALSE, show.margins =FALSE)
It’s also possible to visualize a contingency table as a mosaic plot. This is done using the function mosaicplot() from the built-in R package garphics:
Blue color indicates that the observed value is higher than the expected value if the data were random
Red color specifies that the observed value is lower than the expected value if the data were random
From this mosaic plot, it can be seen that the housetasks Laundry, Main_meal, Dinner and breakfeast (blue color) are mainly done by the wife in our example.
Chi-square test basics
Chi-square test examines whether rows and columns of a contingency table are statistically significantly associated.
Null hypothesis (H0): the row and the column variables of the contingency table are independent.
Alternative hypothesis (H1): row and column variables are dependent
For each cell of the table, we have to calculate the expected value under null hypothesis.
For a given cell, the expected value is calculated as follow:
This calculated Chi-square statistic is compared to the critical value (obtained from statistical tables) with df=(r−1)(c−1)df=(r−1)(c−1) degrees of freedom and p = 0.05.
r is the number of rows in the contingency table
c is the number of column in the contingency table
If the calculated Chi-square statistic is greater than the critical value, then we must conclude that the row and the column variables are not independent of each other. This implies that they are significantly associated.
Compute chi-square test in R
Chi-square statistic can be easily computed using the function chisq.test() as follow:
If you want to know the most contributing cells to the total Chi-square score, you just have to calculate the Chi-square statistic for each cell:
r=o−ee√
The contribution (in %) of a given cell to the total Chi-square score is calculated as follow:
contrib=r2χ2
Access to the values returned by chisq.test() function
The result of chisq.test() function is a list containing the following components:
statistic: the value the chi-squared test statistic.
parameter: the degrees of freedom
p.value: the p-value of the test
observed: the observed count
expected: the expected count
The format of the R code to use for getting these values is as follow:
# printing the p-value chisq$p.value # printing the mean chisq$estimate
the examples
Data format: Contingency tables
# Import the datafile_path <-"http://www.sthda.com/sthda/RDoc/data/housetasks.txt"housetasks <-read.delim(file_path, row.names =1)# head(housetasks)
Graphical display of contengency tables
Contingency table can be visualized using the function balloonplot() [in gplots package]. This function draws a graphical matrix where each cell contains a dot whose size reflects the relative magnitude of the corresponding component.
library("gplots")# 1. convert the data as a tabledt <-as.table(as.matrix(housetasks))# 2. Graphballoonplot(t(dt), main ="housetasks", xlab ="", ylab="",label =FALSE, show.margins =FALSE)
`summarise()` has grouped output by 'class'. You can override using the
`.groups` argument.
class
4
5
6
8
2seater
NA
NA
NA
16
compact
33
21
18
NA
midsize
23
NA
19
16
minivan
18
NA
17
NA
pickup
17
NA
16
14
subcompact
35
20
18
15
suv
20
NA
17
14
We can find proportions by creating a new, calculated variable dividing row frequency by table frequency.
mpg%>%group_by(class)%>%summarize(n=n())%>%mutate(prop=n/sum(n))%>%# our new proportion variablekable()
class
n
prop
2seater
5
0.0213675
compact
47
0.2008547
midsize
41
0.1752137
minivan
11
0.0470085
pickup
33
0.1410256
subcompact
35
0.1495726
suv
62
0.2649573
We can create a contingency table of proportion values by applying the same spread command as before. Vary the group_by() and spread() arguents to produce proportions of different variables.
mpg%>%group_by(class, cyl)%>%summarize(n=n())%>%mutate(prop=n/sum(n))%>%subset(select=c("class","cyl","prop"))%>%#drop the frequency valuespread(class, prop)%>%kable()
`summarise()` has grouped output by 'class'. You can override using the
`.groups` argument.
cyl
2seater
compact
midsize
minivan
pickup
subcompact
suv
4
NA
0.6808511
0.3902439
0.0909091
0.0909091
0.6000000
0.1290323
5
NA
0.0425532
NA
NA
NA
0.0571429
NA
6
NA
0.2765957
0.5609756
0.9090909
0.3030303
0.2000000
0.2580645
8
1
NA
0.0487805
NA
0.6060606
0.1428571
0.6129032
table() is a quick way to pull together row/column frequencies and proportions for categorical variables
Using the basic table() command, we can get a contingency table of vehicle class by number of cylinders.
Abstract of 150 to 250 words which should be divided into the following sections: Purpose, Methods, Results, Conclusion.
4 to 6 keywords which can be used for indexing purposes.
Title: The link between our farm lands and giant tortoises; different characteristics of farmlands influencing tortoises behavior using visual observations.
Authors and Affiliation : Kyana N.Pike, Stephen Blake, Lain J.Gorden and Lin Schwardkopf
Abstract: Land sharing between nature and humans can be shown to be effective, however, this isn’t always going to work. The purpose of this study is to show that different characteristics of farmlands can influence tortoises. The methods were behavioral observations in two different seasons between the wet season march to May and the dry season November to December, a total of 242 behavior observations. This was carried out by locating the tortoise and then observing for 30 minutes from 5-15 m, using a pair of binoculars. The influences of time spent on eating, resting and walking were recorded. It was shown that tortoises within farmland as more likely to walk in areas with no vegetation and were more likely to rest at lower carapace temperatures. However, tortoises in abounded areas had different patterns when observed, this may be because of the lack of livestock and tourism. Tortoises need more time to digest their food within cooler temperatures compared to warmer temperaturess. Overall, this study shows the differences within giant tortoise using different farm types and how this indicates which actives the tortoises would choose to be able to reduce energy, however more investigation should be done on the different habitat qualities in farms and how this can affect tortoises.
Which of the following correlation coefficients expresses the strongest association?
a) 0.55 b) 0.09 c) -0.77 d) 0.1 e) 1.05
answer is C, expresses the strongest association, as it represents a strong negative correlation.
We have five representative samples of people aged 15, 20, 30, 45 and 60 years who completed a questionnaire of political conservatism. In these 5 samples in the given order were the average scores of political conservatism as follows: 60, 85, 80, 70, . Correlation between age and political conservatism is: b) 1.0 b) -1.0 c) linear d) nonlinear
we need to look at the relationship between the two variables based on the given data. so the answer is D , nonlinear.
How is Pearson’s coefficient influenced by:
a) Limited variability → can reduce the correlation, especially if the variability in one or both variables is low.
b) Differences in distribution → can distort the correlation, especially if distributions are non-normal.
c) Outliers → can drastically change the correlation, either inflating or deflating it.
d) Extreme groups → can lead to a biased correlation, as it may not be representative of the entire range of data.
What information can we gather based on scatterplot?
Direction (positive/negative/no correlation)
Strength (strong/weak/moderate correlation)
Linearity (linear vs. nonlinear)
Outliers
Clusters or subgroups within the data
Trends and patterns (if applicable, especially over time)
Homogeneity of variance (homoscedasticity vs. heteroscedasticity)
If you use Pearson’s correlation coefficient for computing correlation between variables where the relationship between X and Y is not linear, how it influence the coefficient?
If the relationship between XXX and YYY is not linear and you use Pearson’s correlation coefficient, it can lead to:
Weak or misleading correlation values (even if there is a strong nonlinear relationship),
Misrepresentation of the direction of the relationship (positive or negative),
Inaccurate conclusions about the strength and nature of the relationship between the variables.
For non-linear relationships, it’s better to use alternative methods like Spearman’s rank correlation or to apply nonlinear regression models for a more accurate analysis.
What are marginal frequencies?
Marginal frequencies are the sums of the rows and columns in a contingency table.
They represent the total count of observations for each category of one variable, ignoring the other variable.
Marginal frequencies are helpful for understanding the overall distribution of each variable and are often used in various statistical analyses like probability calculations and Chi-square tests.
If correlation between X and Y is 0.5, how does the correlation change if we transform X to T-scores?
A linear transformation of XXX (such as converting to T-scores) does not change the correlation between XXX and YYY. Thus, if the original correlation between XXX and YYY is 0.5, the correlation between T(X)T(X)T(X) and YYY will also be 0.5.