Data Analysis: focusing on the basics.Covering aspects dealing with data and less is MORE in statistics
Research methods: covering the theoretical and philosophical aspects of doing science. Making sense of science and working on writing and reading skills.
Warning: `as.tibble()` was deprecated in tibble 2.0.0.
ℹ Please use `as_tibble()` instead.
ℹ The signature and semantics have changed, see `?as_tibble`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Numerical: Discrete; numbered values that can only take certain values
continuous; numbered values that are measured can be any number within a particular range.
Inductive VS Deductivism?
They are opposite approaches to reasoning that differ in how they start and what they use to reach a conclusion.
Inductive:
Observation/ pattern/ hypothesis/ theory Deductivism : Theory/ hypothesis/ observation/ confirmation
#types of good and bad questions
Bad questions:
is there any difference between a and b?
is A bigger than B? 3.Can X influence Y?
Good questions:
what explains the differences between A and B?
What makes A bigger than B? 3. How X can influence Y?
diamonds
diamonds%>% (i) #utilizes the diamonds dataset group_by(color,clarity)%>% #groups data by the color and clarity variables.
mutate(price200=mean(price))%>% #creates new variables (average price by groups) ungroup()%>% #data no longer grouped by color and clarity mutate(random=10+price)%>%
new variable,original price+$10 select(cut,color,clarity,price,price200,random10)%>% #retain only these listed columns.
arrange(color)%>% #visualize data ordered by color. group_by(cut)%>% #group data by cut mutate(dis=n_distinct(price)
counts the number of unique price values per cut. rowID=row_number())%>%
numbers each row consecutively for each cut ungroup() #final ungrouping of data.
advance science a good working hypothesis is everything as a wrong phyotheses will misguide you however a good one would keep you excited and up to date. it may need training.
A statement
it is affirmative
it is not a question
must lead to expectations if confirmed
self explanatory
types of hypotheses
scientific hypotheses
candidate statements to explain an observed phenomenon
meant to generate logical predictions
working guidelines
Null Hypothesis: This states that there is no effect or relationship. For instance, “There is no difference in plant growth between plants grown in sunlight and those grown in the dark.”
Alternative Hypothesis: This proposes an effect or relationship. For example, “Plants grown in sunlight will grow taller than those grown in the dark.”
statistical hypotheses
Statistical hypotheses are specific statements about a population parameter or a process that can be tested using statistical methods.
logical predictions
confirmed by stats
can be drawn in a graph
writng a hypotheses
you need to tell a story, by not using subheadings and never reefer to statistical hypotheses.you can use the if/then method.
Identify the Variables: Determine your independent variable (the one you change) and dependent variable (the one you measure).
Make a Prediction: State what you expect to happen based on your understanding of the topic.
Be Specific: Include details that clarify your prediction.
A hypothesis typically follows an “If… then…” format. For example, “If increasing temperature increases the rate of a chemical reaction, then higher temperatures will lead to faster reactions.”
week 5
Frequency tests
chi-square
G-Tests
Contingency tables
log-linear models
powerful for testing associations between categorical variables.
means tests
T-Tests (two levels)
Anovas (three plus levels)
non-parametric equivalents
nested and two way
post-hoc tests
widely used for testing differences in means.
Correlations and models
correlations
many variations
linear models
many variations
highly predictive and powerful but depend on many conditions.
##Logistic models
logistic models
predictive of odds
similar inlogic to frequency tests
similar in calculations to linear models
highly predictive and powerful but can be complex to interpret
formative exercise
bloxplots
Boxplots are useful for visualizing the distribution of a dataset, highlighting the median, quartiles, and potential outliers. They can be associated with several statistical tests and analyses, including:
Kruskal-Wallis Test: A non-parametric test used to compare three or more independent groups. Boxplots can visually represent the distributions of these groups.
Mann-Whitney U Test: Another non-parametric test that compares two independent groups. Boxplots can show the median and spread of the data for both groups.
ANOVA: While boxplots are not directly associated with ANOVA, they can be used to visualize the distribution of data across multiple groups, helping to interpret ANOVA results.
T-tests: Similar to ANOVA, boxplots can display the distributions of two groups being compared with a t-test.
Outlier detection: Boxplots inherently display potential outliers, making them useful for visualizing and identifying outliers in the context of any statistical analysis
data(iris)# Create a boxplot of Sepal.Length by Speciesboxplot(Sepal.Length ~ Species, data = iris,main ="Boxplot of Sepal Length by Species",xlab ="Species",ylab ="Sepal Length",col =c("lightblue", "lightgreen", "lightpink"))
Sepal.Length ~ Species: This formula indicates that you want to plot sepal lengths (the dependent variable) grouped by species (the independent variable).
data = iris: Specifies that the data comes from the iris dataset.
main, xlab, ylab, and col: Customize the title, axis labels, and colors of the boxes.
linegraphs
T-test: To compare means between two groups over time or conditions.
ANOVA (Analysis of Variance): To compare means across multiple groups or time points.
Chi-Square Test: When categorical data is involved, to see if there’s a significant association between variables over time.
Mann-Kendall Trend Test: A non-parametric test for identifying trends in time series data.
data(iris)library(ggplot2)summary_data <-aggregate(Sepal.Length ~ Species, data = iris, FUN = mean)# Create the line graphggplot(summary_data, aes(x = Species, y = Sepal.Length, group =1)) +geom_line() +geom_point() +labs(title ="Average Sepal Length by Species",x ="Species",y ="Average Sepal Length") +theme_minimal()
aggregate(): This function computes the mean sepal length for each species.
ggplot2: A popular package for creating graphics in R.
geom_line(): Adds the lines connecting the points.
geom_point(): Adds points to represent the mean values.
labs(): Adds labels for the title and axes.
theme_minimal(): Applies a minimal theme to the plot.
scattergraph
T-test: To compare means between two groups over time or conditions.
ANOVA (Analysis of Variance): To compare means across multiple groups or time points.
Regression Analysis: To assess relationships between variables, often fitting a line to the data to identify trends.
Chi-Square Test: When categorical data is involved, to see if there’s a significant association between variables over time.
Mann-Kendall Trend Test: A non-parametric test for identifying trends in time series data.
These tests help interpret the data represented in line graphs, offering insights into trends.
data(iris)library(ggplot2)# Create a scatter graphggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +geom_point(size =3) +# Adjust size of pointslabs(title ="Scatter Plot of Sepal Length vs. Sepal Width",x ="Sepal Length",y ="Sepal Width") +theme_minimal()
aes(x = Sepal.Length, y = Sepal.Width, color = Species): This specifies that Sepal.Length will be on the x-axis, Sepal.Width on the y-axis, and points will be colored by species.
geom_point(): This function creates the scatter plot. You can adjust the size of the points using the size argument.
labs(): Adds titles and labels for the axes.
theme_minimal(): Applies a clean, minimal theme for better aesthetics.
Barcharts
library(ggplot2)iris_summary <-aggregate(Sepal.Length ~ Species, data = iris, FUN = mean)ggplot(iris_summary, aes(x = Species, y = Sepal.Length, fill = Species)) +geom_bar(stat ="identity", position ="dodge") +labs(title ="Average Sepal Length by Species",x ="Species",y ="Average Sepal Length") +theme_minimal()
Proportion Tests: To compare the proportions of different categories within groups.
Kruskal-Wallis Test: A non-parametric alternative to ANOVA when the assumptions of normality are not met. Chi-Square Test: Used to determine if there is a significant association between two categorical variables.
ANOVA (Analysis of Variance): When comparing means across multiple groups, ANOVA can help assess whether there are any statistically significant differences.
T-test: If comparing the means of two groups, a t-test can determine if the differences are significant.
Proportion Tests: To compare the proportions of different categories within groups.
mosquitos
brainstorming
questions to possibly ask?
some cool questions
What specific differences in wing span exist between male and female mosquitoes across various species?
Does wing length affect the vulnerability of male and female mosquitoes to predators?
3.How does nutrition during the larval stage affect wing length differently in male and female mosquitoes?
4.How might variations in wing length between male and female mosquitoes affect their roles as disease vectors?
library(ggplot2)data <-read.table("mosquitos.txt", header =TRUE, sep ="\t")ggplot(data, aes(x = sex , y = wing)) +geom_bar(stat ="identity", fill ="skyblue") +labs(title ="Bar Chart", x ="sex", y ="wing")