We will refer to the formulas in https://www.cuemath.com/algebra/algebraic-formulas/. Typical ones are listed below:
Other Formulas
A complete list of formulas: https://www.mga.edu/computing/mathematics-statistics/docs/Math_Resources_Algebra_Formulas.pdf
Factor \(x^2-25y^2\) and \(16x^2-1\).
Factor \(x^3-8y^3\), \(x^3+8y^3\), and \(8x^3 + 27\).
Factor \(1-27y^3\) and \(1+27y^3\).
Expand \((2x-3y)^2\).
Evaluate \(97^2-3^2\) and \(297 × 303\) without paper.
Find the roots of the quadratic equation: \(x^2+7x+12=0\).
Find the roots of the quadratic equation \(2x^2+5x+3=0\).
Simplify the expression: \((x^{-9}y^3)/(x^{-7}y^8)\) so that the answer has no negative exponents.
Expand the logarithm: \(log_5(x^2y^3z)\).
\(log_4 64 = ?\)
A 28-page SAT math practice test: https://focusonlearningcenter.com/wp-content/uploads/2020/07/SAT-Practice-Test-1-Math-tests-.pdf
If you can read all 450 pages in the following book, you have no problem about highschool math.
Random Sampling and Random Assignment are two key concepts in research design, particularly in experimental and survey research, and they serve different purposes.
Feature | Random Sampling | Random Assignment |
---|---|---|
Goal | To obtain a representative sample | To create equivalent groups for comparison |
Purpose | Supports generalizability to the population | Supports causal inferences in an experiment |
Application | Common in survey research | Common in experimental research |
Reduces | Sampling bias | Confounding variables |
Outcome | Representative sample | Comparable groups |
Example | Surveying a randomly selected portion of a city | Randomly placing students into treatment groups |
In Summary
In an ideal experimental study aiming for both generalizability and causality, researchers would use random sampling to select participants and random assignment to assign them to different groups.
In educational research, t-tests are commonly used to analyze differences between two groups or two conditions. Here are a few examples:
An educator wants to compare the final exam scores of two different classes taught by different teachers to see if there is a significant difference in performance.
Let’s say we have random samples final exam scores from the two classes (Class A and Class B). The data are as follows:
# Data for two independent groups (Class A and Class B)
class_A_scores <- c(85, 78, 90, 88, 76, 80, 82, 95, 89, 84)
class_B_scores <- c(78, 74, 69, 85, 80, 77, 83, 72, 76, 75, 76)
# Conduct an independent samples t-test
t_test_result <- t.test(class_A_scores, class_B_scores, alternative = "two.sided")
# Print the result
t_test_result
##
## Welch Two Sample t-test
##
## data: class_A_scores and class_B_scores
## t = 3.3817, df = 17.051, p-value = 0.003533
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 2.965608 12.798029
## sample estimates:
## mean of x mean of y
## 84.70000 76.81818
The results from the Welch Two Sample t-test can be interpreted as follows:
t-statistic = 3.3817: This is the test statistic, indicating the ratio of the difference in sample means to the standard error of the difference. A larger absolute t-statistic generally indicates a greater difference between the groups.
degrees of freedom (df) = 17.051: This represents the number of independent values used in the calculation of the test statistic, adjusting for unequal variances between the two groups (this is why the Welch correction is used).
p-value = 0.003533: This value tells us the probability of observing the data, or something more extreme, under the assumption that there is no difference between the groups. A p-value less than 0.05 (the typical threshold for significance) indicates that there is a statistically significant difference between the means of Class A and Class B. This is a statistical inference method using hypothesis testing.
95% Confidence interval = [2.97, 12.80]: This is the range of values within which we are 95% confident that the true difference in means lies. It indicates that the Class A teaching method improves the mean score by 2.97 to 12.80 points, on average. This is a statistical inference method using a confidence interval.
Conclusion:
There is a statistically significant difference in the test scores between Class A and Class B, with Class A scoring higher on average. The results suggest that the teaching method or conditions in Class A may have contributed to this higher performance.
A school introduces a new teaching method and wants to see if it improves student performance. They test the same group of students before and after implementing the method.
Suppose the same group of students took a test before and after using a new teaching method. Here’s the data:
# Data for the same group of students before and after the new method
scores_before <- c(70, 75, 80, 65, 78, 74, 72, 68, 77, 73)
scores_after <- c(68, 80, 78, 75, 75, 78, 70, 74, 76, 77)
# Conduct a paired samples t-test
t_test_result <- t.test(scores_before, scores_after, paired = TRUE, alternative = "greater")
# Print results
t_test_result
##
## Paired t-test
##
## data: scores_before and scores_after
## t = -1.3476, df = 9, p-value = 0.8946
## alternative hypothesis: true mean difference is greater than 0
## 95 percent confidence interval:
## -4.48448 Inf
## sample estimates:
## mean difference
## -1.9
A researcher investigates if there is a significant difference between the average grades of male and female students in a particular subject, such as mathematics.
Let’s assume we have math scores from male and female students:
# Data for two independent groups (males and females)
male_scores <- c(82, 78, 84, 79, 85, 88, 75, 80, 77, 83)
female_scores <- c(90, 85, 88, 87, 89, 92, 85, 86, 88, 90)
# Conduct an independent samples t-test
t_test_result <- t.test(male_scores, female_scores, alternative = "two.sided")
t_test_result
##
## Welch Two Sample t-test
##
## data: male_scores and female_scores
## t = -4.7131, df = 14.373, p-value = 0.0003105
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -10.032365 -3.767635
## sample estimates:
## mean of x mean of y
## 81.1 88.0
A school wants to determine if students in online courses perform differently than those in traditional classroom settings.
Imagine a study comparing final grades of students in online vs. traditional classroom settings. Data are:
# Data for two independent groups (online vs. traditional)
online_scores <- c(75, 78, 72, 70, 76, 74, 73, 77, 79, 80)
traditional_scores <- c(85, 87, 82, 88, 84, 86, 83, 89, 85, 90)
# Conduct an independent samples t-test
t_test_result <- t.test(online_scores, traditional_scores, alternative = "two.sided")
t_test_result
##
## Welch Two Sample t-test
##
## data: online_scores and traditional_scores
## t = -8.0452, df = 17.271, p-value = 3.026e-07
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -13.250273 -7.749727
## sample estimates:
## mean of x mean of y
## 75.4 85.9
A tutoring program is introduced for students who are struggling. The researcher wants to see if the program significantly improves their test scores.
A group of students’ test scores before and after participating in a tutoring program are given below. Data are:
# Data for the same group of students before and after tutoring
scores_before_tutoring <- c(60, 62, 65, 61, 63, 60, 64, 66, 67, 65)
scores_after_tutoring <- c(64, 65, 63, 67, 65, 58, 69, 71, 75, 70)
# Conduct a paired samples t-test
t_test_result <- t.test(scores_before_tutoring, scores_after_tutoring, paired = TRUE, alternative = "less")
t_test_result
##
## Paired t-test
##
## data: scores_before_tutoring and scores_after_tutoring
## t = -3.2852, df = 9, p-value = 0.004725
## alternative hypothesis: true mean difference is less than 0
## 95 percent confidence interval:
## -Inf -1.502829
## sample estimates:
## mean difference
## -3.4
Each of these examples involves different scenarios in education but highlights how a t-test can help measure the effectiveness or difference of various educational approaches, interventions, and demographic factors.
Regression analysis is a statistical technique used to explore the relationships between variables and to make predictions. In educational research, regression is often applied to examine how various factors influence outcomes like student performance, motivation, or graduation rates.
The general regression model takes the form:
\[y = \beta_0+\beta_1\cdot x_1+\beta_2\cdot x_2+\beta_3\cdot x_3+\cdots+\beta_k\cdot x_k+\epsilon\] where \(\beta_0, \beta_1, \beta_2, \beta_3, ..., \beta_k\) are called parameters and \(\epsilon\) is the error term.
The tasks of regression are to
A regression model can be used to predict the y value for a given set of x values.
We demonstrate regression through case studies.
This example predicts student grades based on study hours, parental education level, and school resources (see the appendix). The data are given below:
study_hours | parent_education | school_resources | grades |
---|---|---|---|
5.75 | 4 | 4.98 | 95.10 |
15.77 | 1 | 3.62 | 99.69 |
8.18 | 2 | 3.83 | 84.18 |
17.66 | 3 | 3.18 | 105.42 |
18.81 | 5 | 3.38 | 116.61 |
0.91 | 3 | 2.16 | 72.39 |
10.56 | 3 | 1.59 | 87.08 |
17.85 | 1 | 4.85 | 105.55 |
11.03 | 4 | 4.61 | 106.33 |
9.13 | 1 | 3.76 | 86.17 |
We use the following R code to fit a regression model:
\[\text{grades} = \beta_0+\beta_1\cdot \text{study_hours}+\beta_2\cdot \text{parental_education}+\beta_2\cdot \text{school_resources}+\epsilon\] The following is the code for data preparation:
study_hours <- c(5.75, 15.77, 8.18, 17.66, 18.81, 0.91, 10.56, 17.85, 11.03, 9.13)
parent_education <- c(4, 1, 2, 3, 5, 3, 3, 1, 4, 1)
school_resources <- c(4.98, 3.62, 3.83, 3.18, 3.38, 2.16, 1.59, 4.85, 4.61, 3.76)
grades <- c(95.1, 99.69, 84.18, 105.42, 116.61, 72.39, 87.08, 105.55, 106.33, 86.17)
The following is the code for fitting a regression model:
# Fit regression model
model1 <- lm(grades ~ study_hours + parent_education + school_resources, data = data1)
summary(model1)
##
## Call:
## lm(formula = grades ~ study_hours + parent_education + school_resources,
## data = data1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.7115 -0.4977 -0.2354 0.7710 2.5086
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 49.9194 3.1349 15.924 3.89e-06 ***
## study_hours 1.8479 0.1264 14.615 6.44e-06 ***
## parent_education 3.7726 0.5173 7.293 0.000339 ***
## school_resources 3.9977 0.6783 5.894 0.001059 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.191 on 6 degrees of freedom
## Multiple R-squared: 0.982, Adjusted R-squared: 0.973
## F-statistic: 109 on 3 and 6 DF, p-value: 1.27e-05
Summary of Results:
Intercept: The intercept estimate is 49.92, which is the expected outcome when all predictors (study hours, parent education, and school resources) are zero. This is a baseline score.
Study Hours: The coefficient for study hours is 1.85, meaning that for each additional hour studied, the outcome increases by an average of 1.85 points, holding other variables constant. This effect is statistically significant (\(p\) value is basically zero) at the significance level 0.05.
Parent Education: The coefficient for parent education level is 3.77, indicating that higher levels of parent education are associated with a 3.77-point increase in the outcome. This predictor is also highly significant (\(p\) value is less than 0.001) at the significance level 0.05.
School Resources: The coefficient for school resources is 4.00, suggesting that improvements in school resources are associated with a 4-point increase in the outcome. This is statistically significant (\(p\) value is about 0.001) at the significance level 0.05.
Model Fit:
Residual Standard Error: The residual standard error of 2.19 indicates the average deviation of observed values from the fitted values.
R-squared: The \(R^2\) value is 0.982, suggesting that approximately 98.2% of the variability in the outcome variable is explained by the model. The adjusted \(R^2\) of 0.973 confirms that this high explanatory power holds after adjusting for the number of predictors.
F-statistic: The F-statistic of 109, with a \(p\)-value of \(1.27\cdot 10^{-5}\) or 0.0000127, indicates that the model as a whole is statistically significant at the significance level 0.05.
Conclusion:
We compare traditional and online teaching methods by predicting scores based on teaching method and engagement hours.
teaching_method | engagement_hours | scores |
---|---|---|
Traditional | 9.40 | 89.65 |
Traditional | 5.66 | 83.26 |
Traditional | 7.84 | 87.29 |
Online | 14.36 | 111.34 |
Traditional | 7.76 | 83.52 |
Online | 13.46 | 110.15 |
Online | 13.80 | 107.48 |
Online | 9.52 | 90.22 |
Traditional | 6.75 | 83.35 |
Traditional | 3.06 | 78.77 |
The following is the code for fitting a regression model:
# Fit regression model
model2 <- lm(scores ~ teaching_method + engagement_hours, data = data2)
summary(model2)
##
## Call:
## lm(formula = scores ~ teaching_method + engagement_hours, data = data2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.6034 -3.5635 -0.0302 2.9686 15.9185
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 68.2365 1.3411 50.880 < 2e-16 ***
## teaching_methodTraditional -3.0180 0.9706 -3.109 0.00246 **
## engagement_hours 3.0117 0.1308 23.029 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.783 on 97 degrees of freedom
## Multiple R-squared: 0.8521, Adjusted R-squared: 0.8491
## F-statistic: 279.5 on 2 and 97 DF, p-value: < 2.2e-16
This example examines factors like attendance, family support, and attitude on classroom behavior scores.
# Generate fake data for factors affecting behavior
set.seed(123)
attendance <- round(runif(100, 50, 100), )
family_support <- sample(1:5, 100, replace = TRUE)
attitude <- round(runif(100, 1, 5), 2)
behavior_score <- round(50 + 0.5 * attendance + 4 * family_support + 2 * attitude + rnorm(100, 0, 5), 2)
data3 <- data.frame(attendance, family_support, attitude, behavior_score)
# Display table
data3 %>%
head(10) %>%
kable("html", caption = "Analyzing Factors Affecting Student Behavior") %>%
kable_styling(full_width = FALSE)
attendance | family_support | attitude | behavior_score |
---|---|---|---|
64 | 1 | 4.03 | 95.13 |
89 | 4 | 1.55 | 111.98 |
70 | 1 | 2.59 | 94.65 |
94 | 1 | 1.90 | 100.32 |
97 | 3 | 1.23 | 106.41 |
52 | 4 | 2.58 | 107.15 |
76 | 1 | 1.26 | 97.52 |
95 | 3 | 1.90 | 107.04 |
78 | 5 | 1.22 | 108.38 |
73 | 3 | 3.68 | 99.93 |
The following is the code for fitting a regression model:
# Fit regression model
model3 <- lm(behavior_score ~ attendance + family_support + attitude, data = data3)
summary(model3)
##
## Call:
## lm(formula = behavior_score ~ attendance + family_support + attitude,
## data = data3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.4125 -3.0719 -0.6104 2.4503 11.6553
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 48.24158 3.11026 15.510 < 2e-16 ***
## attendance 0.49635 0.03297 15.053 < 2e-16 ***
## family_support 4.56839 0.32770 13.941 < 2e-16 ***
## attitude 2.14325 0.39257 5.459 3.74e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.658 on 96 degrees of freedom
## Multiple R-squared: 0.8044, Adjusted R-squared: 0.7983
## F-statistic: 131.6 on 3 and 96 DF, p-value: < 2.2e-16
We investigate if family income, location, and school funding affect math scores.
# Generate fake data for educational equity issues
set.seed(123)
family_income <- round(runif(100, 20000, 80000), 2)
location <- sample(c("Urban", "Rural"), 100, replace = TRUE)
school_funding <- round(runif(100, 1, 10), 2)
math_score <- round(60 + 0.0005 * family_income + ifelse(location == "Urban", 10, -5) + 3 * school_funding + rnorm(100, 0, 5), 2)
data4 <- data.frame(family_income, location, school_funding, math_score)
# Display table
data4 %>%
head(10) %>%
kable("html", caption = "Exploring Educational Equity Issues") %>%
kable_styling(full_width = FALSE)
family_income | location | school_funding | math_score |
---|---|---|---|
37254.65 | Urban | 3.15 | 102.02 |
67298.31 | Rural | 9.66 | 121.47 |
44538.62 | Rural | 6.41 | 98.16 |
72981.04 | Urban | 5.64 | 118.37 |
76428.04 | Rural | 4.62 | 106.48 |
22733.39 | Rural | 8.92 | 91.72 |
51686.33 | Urban | 4.28 | 111.50 |
73545.14 | Urban | 3.59 | 115.68 |
53086.10 | Urban | 2.54 | 109.05 |
47396.88 | Rural | 2.55 | 84.48 |
The following is the code for fitting a regression model:
# Fit regression model
model4 <- lm(math_score ~ family_income + location + school_funding, data = data4)
summary(model4)
##
## Call:
## lm(formula = math_score ~ family_income + location + school_funding,
## data = data4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.8177 -3.3978 -0.8421 2.5731 15.8347
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.518e+01 1.829e+00 30.17 <2e-16 ***
## family_income 4.891e-04 2.817e-05 17.36 <2e-16 ***
## locationUrban 1.551e+01 9.594e-01 16.16 <2e-16 ***
## school_funding 3.039e+00 1.801e-01 16.87 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.736 on 96 degrees of freedom
## Multiple R-squared: 0.9032, Adjusted R-squared: 0.9001
## F-statistic: 298.4 on 3 and 96 DF, p-value: < 2.2e-16
This example predicts dropout risk based on age, attendance rate, and past grades.
# Generate fake data for dropout risk prediction
set.seed(123)
age <- round(runif(100, 14, 18), 2)
attendance_rate <- round(runif(100, 50, 100), 2)
past_grades <- round(runif(100, 50, 100), 2)
dropout_risk <- round(0.2 * age - 0.5 * attendance_rate - 0.3 * past_grades + rnorm(100, 0, 5), 2)
data5 <- data.frame(age, attendance_rate, past_grades, dropout_risk)
# Display table
data5 %>%
head(10) %>%
kable("html", caption = "Predicting Student Dropout Risk") %>%
kable_styling(full_width = FALSE)
age | attendance_rate | past_grades | dropout_risk |
---|---|---|---|
15.15 | 80.00 | 61.94 | -51.61 |
17.15 | 66.64 | 98.12 | -55.48 |
15.64 | 74.43 | 80.07 | -56.45 |
17.53 | 97.72 | 75.75 | -73.12 |
17.76 | 74.15 | 70.13 | -55.16 |
14.18 | 94.52 | 94.01 | -74.03 |
16.11 | 95.72 | 68.20 | -62.28 |
17.57 | 80.44 | 64.41 | -57.89 |
16.21 | 70.53 | 58.53 | -44.70 |
15.83 | 57.35 | 58.61 | -44.96 |
The following is the code for fitting a regression model:
# Fit regression model
model5 <- lm(dropout_risk ~ age + attendance_rate + past_grades, data = data5)
summary(model5)
##
## Call:
## lm(formula = dropout_risk ~ age + attendance_rate + past_grades,
## data = data5)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.7261 -3.1419 -0.6266 2.8062 14.5070
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.32845 8.01034 -0.665 0.508
## age 0.14421 0.41296 0.349 0.728
## attendance_rate -0.43060 0.03581 -12.023 < 2e-16 ***
## past_grades -0.28597 0.03205 -8.924 3.04e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.653 on 96 degrees of freedom
## Multiple R-squared: 0.6831, Adjusted R-squared: 0.6732
## F-statistic: 68.99 on 3 and 96 DF, p-value: < 2.2e-16
Logistic regression is a widely used statistical technique in many fields for modeling categorical outcomes based on one or more predictor variables.
We focus on the case that the outcome is binary (e.g., yes/no, success/failure, employed/unemployed).
We demonstrate some applications.
We analyze factors that predict whether a person will vote in an election (e.g., based on age, income, education level).
Suppose a social scientist wants to investigate the likelihood of voting (yes or no) based on three predictors: age, income level, and education level. This could reveal how these factors impact an individual’s likelihood of participating in elections.
Here’s a list of variable we consider:
age | income | education | vote |
---|---|---|---|
34 | 5 | 3 | 0 |
38 | 6 | 3 | 0 |
56 | 6 | 3 | 1 |
41 | 5 | 1 | 1 |
41 | 5 | 1 | 1 |
57 | 6 | 3 | 1 |
45 | 5 | 2 | 1 |
27 | 3 | 3 | 1 |
33 | 5 | 2 | 1 |
36 | 7 | 1 | 0 |
Let’s use R to build a logistic regression model.
Data preparation:
vote <- c(1, 1, 1, 1, 0, 1, 0, 1, 1, 0)
age <- c(34, 38, 56, 41, 41, 57, 45, 27, 33, 36)
income <- c(5, 6, 6, 5, 5, 6, 5, 3, 5, 7)
education <- c(3, 3, 3, 1, 1, 3, 2, 3, 2, 1)
data1 = data.frame(vote, age, income, education)
Fit a logistic model:
model <- glm(vote ~ age + income + education, data = data1, family = binomial)
# View the model summary
summary(model)
##
## Call:
## glm(formula = vote ~ age + income + education, family = binomial,
## data = data1)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.39470 8.82294 0.271 0.786
## age -0.06231 0.15984 -0.390 0.697
## income -0.60711 1.40856 -0.431 0.666
## education 2.25493 1.46455 1.540 0.124
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 12.2173 on 9 degrees of freedom
## Residual deviance: 7.1551 on 6 degrees of freedom
## AIC: 15.155
##
## Number of Fisher Scoring iterations: 6
Interpreting Results:
Coefficients: Each coefficient represents the change in the log-odds of voting for a one-unit increase in that predictor.
Significance Levels: The Pr(>|z|) values (called p-values) indicate whether each predictor is statistically significant. A small p-value (e.g., < 0.05) suggests that the corresponding predictor has a statistically significant association with the outcome.
The coefficient for age is negative, it means that older individuals are less likely to vote, controlling for other variables.
The coefficient for income is negative, individuals with higher income levels are less likely to vote, controlling for other variables.
The coefficient for education is positive, it indicates that people with higher educational attainment are more likely to vote.
Evaluating the Model with a Confusion Matrix:
Generate predictions:
# Generate predictions
predicted_probs <- predict(model, type = "response")
predicted_probs
## 1 2 3 4 5 6 7 8
## 0.9821116 0.9588780 0.8836753 0.2808078 0.2808078 0.8771160 0.7436923 0.9965154
## 9 10
## 0.8597151 0.1366807
Convert the probabilities to binary outcomes with threshold 0.5 (if predicted probability is less than 0.5, classify a vote as 0; 1, otherwise):
# Convert probabilities to binary outcomes based on threshold 0.5
predicted_classes <- ifelse(predicted_probs > 0.5, 1, 0)
predicted_classes
## 1 2 3 4 5 6 7 8 9 10
## 1 1 1 0 0 1 1 1 1 0
The confusion matrix is:
# Create the confusion matrix
actual = data1$vote
conf_matrix <- table(predicted_classes, actual)
conf_matrix
## actual
## predicted_classes 0 1
## 0 2 1
## 1 1 6
Evaluating the Model with an Receivor operating characteristic (ROC) curve:
An ROC curve evaluates the model’s performance across all possible thresholds, plotting the True Positive Rate (Sensitivity) against the False Positive Rate (1 - Specificity). Instead of focusing on one threshold, the ROC curve gives a comprehensive view of the model’s ability to distinguish between classes, regardless of threshold choice.
The following generates an ROC curve:
# Install and load pROC package for ROC analysis
# install.packages("pROC")
library(pROC)
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
# Generate predictions
predicted_probs <- predict(model, type = "response")
# Prepare curve
roc_curve <- roc(data1$vote, predicted_probs)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
# Plot ROC curve
plot(roc_curve, main = "ROC Curve", col = "blue", legacy.axes = TRUE)
The area under an ROC curve (AUC):
We calculate the AUC for our example:
# Calculate AUC
auc(roc_curve) # Calculate AUC
## Area under the curve: 0.9286
The area under the ROC curve is 0.9286, indicating a very good performance of the model.
We predict whether someone is employed based on factors like education, work experience, and age.
age | education | experience | employed |
---|---|---|---|
25 | 1 | 2 | 0 |
40 | 2 | 10 | 1 |
30 | 2 | 5 | 1 |
22 | 1 | 1 | 0 |
50 | 3 | 20 | 1 |
60 | 3 | 30 | 1 |
35 | 2 | 10 | 1 |
28 | 2 | 4 | 0 |
45 | 3 | 15 | 1 |
38 | 2 | 12 | 1 |
We estimate the likelihood of health issues like obesity or diabetes based on lifestyle factors.
age | diet_quality | exercise_hours | health_issue |
---|---|---|---|
25 | 2 | 1 | 0 |
40 | 4 | 0 | 1 |
35 | 3 | 2 | 0 |
50 | 2 | 0 | 1 |
45 | 4 | 3 | 1 |
60 | 1 | 0 | 1 |
30 | 3 | 1 | 0 |
55 | 2 | 0 | 1 |
48 | 3 | 4 | 1 |
33 | 4 | 2 | 0 |
We model the probability of finishing college or dropping out based on socioeconomic factors.
highschool_performance | parent_income | parent_education | graduated_college |
---|---|---|---|
85 | 30 | 3 | 1 |
92 | 60 | 4 | 1 |
78 | 50 | 2 | 0 |
70 | 40 | 1 | 0 |
88 | 80 | 5 | 1 |
95 | 100 | 5 | 1 |
80 | 55 | 3 | 1 |
72 | 45 | 2 | 0 |
85 | 70 | 4 | 1 |
90 | 90 | 3 | 1 |
Conclusion Logistic regression is a powerful tool in social sciences, allowing researchers to examine how various factors influence the probability of certain outcomes. Through the interpretation of coefficients, odds ratios, and model fit metrics, logistic regression provides insights into the likelihood and predictors of events, enabling data-driven conclusions in areas like voting behavior, health, and education.
We will use the NHANES data from the website: https://www.lock5stat.com/datapage3e.html.
## 1. Introduction
The growing interest in organic food has raised questions about the factors influencing consumer purchasing behavior. Among these factors, health is often cited as a key motivator. Previous studies have suggested that individuals who self-rate their health as good or excellent are more likely to purchase organic foods. This study aims to explore this relationship in greater detail, using a dataset with 4716 observations on consumer health, organic food purchasing behavior, and income.
This research examines: 1. Whether there is a significant relationship between self-rated health and the likelihood of purchasing organic food. 2. How income influences the decision to buy organic food. 3. The potential differences in health status when health is classified into binary categories (Good/Very good/Excellent vs. Poor/Fair).
By analyzing these variables, this study provides insights into the socio-demographic factors that may drive organic food consumption.
## 2. Literature Review
Consumer behavior related to organic food purchasing has been the subject of numerous studies. Many researchers have found that individuals who perceive themselves to be in good health are more likely to buy organic products, as they often associate organic foods with better health outcomes. Furthermore, income has been identified as a crucial factor, with higher-income consumers more likely to afford organic foods.
Recent literature (e.g., Smith et al., 2020; Williams, 2019) highlights the importance of self-rated health as an indicator of food choices. However, most studies have not examined how income and health status together influence organic food purchasing behavior across different health categories. This study fills that gap by using a large dataset that includes health and income data for a diverse sample of consumers.
## 3. Methodology
This study analyzes a dataset with 4716 observations on five variables:
### 3.1 Data Loading and Exploration
# Load necessary libraries
library(ggplot2)
library(DT)
# Load your dataset
health_data <- read.csv("https://www.lock5stat.com/datasets3e/NHANES.csv")
# Print data
datatable(health_data)
## 4. Results
### 4.1 Descriptive Statistics
# Frequency table for categorical variables
table(health_data$Organic)
table(health_data$Health)
table(health_data$HealthBinary)
# Calculate the mean and median of the Income column.
# Find the 5-number summary (minimum, first quartile (Q1), median (Q2), third quartile (Q3), and maximum) for the Income variable.
# Create a histogram comparing income by organic.
# Create a boxplot comparing income by organic.
# Create a histogram comparing income by health status.
# Create a boxplot comparing income by health status.
# Make a barplot of healthBinary by organic. Do people who buy organic foods tend to be healthier?
# Plot a scatterplot of health vs. income. You need to set Poor = 0, Good = 1, Fair = 2, Very good = 3, and Excellent = 4.
# Calculate the correlation coefficient between health vs. income. Interpret the correlation coefficient and discuss what it suggests about the relationship.
### 4.2 Two-Sample t-Test
Perform a two-sample t-test to see if there’s a significant difference in MathScore between males and females.
### 4.3 Multiple Regression
Run a multiple regression analysis with Income as the dependent variable and Organic and Health as independent variables. Interpret the coefficients of each predictor. For example, does Organic significantly predict Income?
### 4.4 Logistic Regression
A logistic regression model was fitted to predict the likelihood of purchasing organic food based on income and health status.
# Before applying logistic regression, we need to create a binary response variable coded as 0 and 1.
health_data$Organic = ifelse(health_data$Organic == "Yes", 1, 0)
# Logistic regression to predict organic purchase based on Health and Income
logit_model <- glm(Organic ~ HealthBinary + Income, data = health_data, family = "binomial")
summary(logit_model)
Interpretation of results (just an example):
### 5. Discussion The results of this study show that self-rated health and income are significant predictors of organic food purchasing behavior. Those who rate their health as Excellent or Very good are more likely to buy organic foods, suggesting that health-conscious consumers view organic food as beneficial to their well-being. Additionally, income plays a crucial role, as higher-income individuals are more able to afford the premium prices of organic products.
The study also found that when health is classified into binary categories, the results align with previous literature showing that individuals in better health are more likely to make health-conscious food choices. However, the use of binary health classification may oversimplify the nuances of health perceptions, and future research might benefit from a more detailed analysis of health variables.
Limitations The study relies on self-reported health data, which may be biased or inaccurate. Income was collected as a categorical variable, which limits the precision of the analysis. The dataset is cross-sectional, so causality cannot be inferred.
### 6. Conclusion This study contributes to the understanding of consumer behavior by highlighting the significant role of self-rated health and income in the decision to purchase organic food. The findings suggest that health-conscious individuals, particularly those in better health, are more likely to buy organic products, and that income significantly influences this decision. These insights can help marketers, policymakers, and health professionals better understand the factors influencing organic food consumption.
“学校资源”指的是学校内部用于支持学生学习和发展的各种物质、设施、人员和支持系统。这些资源对学生的学业成绩、福祉和总体学校体验有着重要影响。以下是一些常见的学校资源类型:
合格且经验丰富的教师、辅导员、助教和支持人员有助于有效的教学和学生支持。 较小的班级规模以及专门的支持人员(如特殊教育教师、语言专家)通常是优质学校资源的标志。