Statistical Foundations for Data Science

Foundational Mathematics in Data Science

Useful Algebra Formulas

We will refer to the formulas in https://www.cuemath.com/algebra/algebraic-formulas/. Typical ones are listed below:

1. Difference in squares: $a² – b² = (a-b)(a+b)$
1. Square of sum: $(a+b)² = a² + 2ab + b²$
1. Square of difference: $(a-b)² = a² – 2ab + b²$
1. Difference in cubes: $a³-b³ = (a-b) (a² + ab + b²)$
1. Sum of cubs: $a³+b³ = (a+b) (a² – ab + b²)$
1. Product of powers: $a^m \cdot a^n = a^{m + n}$
1. Quotient of powers: $\frac{a^m}{a^n} = a^{m - n}$
1. Power of power: $(a^m)^n = a^{m\cdot n}$
1. Power of products: $(a\cdot b)^m = a^m\cdot b^m$
1. Power of zero: $a^0 = 1$
1. Negative power: $a^{-m} = \frac{1}{a^m}$

Other Formulas

A complete list of formulas: https://www.mga.edu/computing/mathematics-statistics/docs/Math_Resources_Algebra_Formulas.pdf

Exercise 1

Factor $x^2-25y^2$ and $16x^2-1$.
Factor $x^3-8y^3$, $x^3+8y^3$, and $8x^3 + 27$.
Factor $1-27y^3$ and $1+27y^3$.
Expand $(2x-3y)^2$.
Evaluate $97^2-3^2$ and $297 × 303$ without paper.
Find the roots of the quadratic equation: $x^2+7x+12=0$.
Find the roots of the quadratic equation $2x^2+5x+3=0$.
Simplify the expression: $(x^{-9}y^3)/(x^{-7}y^8)$ so that the answer has no negative exponents.
Expand the logarithm: $log_5(x^2y^3z)$.
$log_4 64 = ?$

Exercise 2: SAT Math

A 28-page SAT math practice test: https://focusonlearningcenter.com/wp-content/uploads/2020/07/SAT-Practice-Test-1-Math-tests-.pdf

New SAT Preparation

If you can read all 450 pages in the following book, you have no problem about highschool math.

https://www.gpsd.us/cms/lib/NJ01000249/Centricity/Domain/135/Acing%20the%20New%20SAT%20Math%20PDF%20Book.pdf

Random Sampling vs Random Assignment

Random Sampling and Random Assignment are two key concepts in research design, particularly in experimental and survey research, and they serve different purposes.

Random Sampling

Definition: Random sampling is a method of selecting individuals from a population so that each person has an equal chance of being chosen. It is used to create a representative sample from a larger population.
Purpose: The goal of random sampling is to generalize findings from the sample to the entire population. By using random sampling, researchers can reduce sampling bias and improve the representativeness of the sample.
Example: Suppose a researcher wants to understand public opinion on climate change. Using random sampling, they select 1,000 individuals from the entire population of a city. This method ensures that each person in the population has an equal chance of being selected.

Random Assignment

Definition: Random assignment is the process of assigning participants to different groups (e.g., treatment and control groups) in a study, with each participant having an equal chance of being placed in any group.
Purpose: The main purpose of random assignment is to create comparable groups so that differences in outcomes can be attributed to the treatment rather than pre-existing differences. This is key to establishing causality in experimental research.
Example: In a study testing a new educational program, students are randomly assigned to either the program group (treatment) or the regular class (control). Random assignment ensures that differences in outcomes can be attributed to the program itself.

Key Differences

Feature	Random Sampling	Random Assignment
Goal	To obtain a representative sample	To create equivalent groups for comparison
Purpose	Supports generalizability to the population	Supports causal inferences in an experiment
Application	Common in survey research	Common in experimental research
Reduces	Sampling bias	Confounding variables
Outcome	Representative sample	Comparable groups
Example	Surveying a randomly selected portion of a city	Randomly placing students into treatment groups

In Summary

Random Sampling helps generalize findings from a sample to a population.
Random Assignment helps establish cause-and-effect by ensuring that groups in an experiment are comparable.

In an ideal experimental study aiming for both generalizability and causality, researchers would use random sampling to select participants and random assignment to assign them to different groups.

The t-Test

In educational research, t-tests are commonly used to analyze differences between two groups or two conditions. Here are a few examples:

Example 1. Comparing Test Scores Between Two Classes

An educator wants to compare the final exam scores of two different classes taught by different teachers to see if there is a significant difference in performance.

Test whether there is a significant difference in the mean scores of the two classes.

Let’s say we have random samples final exam scores from the two classes (Class A and Class B). The data are as follows:

Class A: 85, 78, 90, 88, 76, 80, 82, 95, 89, 84
Class B: 78, 74, 69, 85, 80, 77, 83, 72, 76, 75, 76

# Data for two independent groups (Class A and Class B)
class_A_scores <- c(85, 78, 90, 88, 76, 80, 82, 95, 89, 84)
class_B_scores <- c(78, 74, 69, 85, 80, 77, 83, 72, 76, 75, 76)

# Conduct an independent samples t-test
t_test_result <- t.test(class_A_scores, class_B_scores, alternative = "two.sided")

# Print the result
t_test_result

## 
##  Welch Two Sample t-test
## 
## data:  class_A_scores and class_B_scores
## t = 3.3817, df = 17.051, p-value = 0.003533
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   2.965608 12.798029
## sample estimates:
## mean of x mean of y 
##  84.70000  76.81818

The results from the Welch Two Sample t-test can be interpreted as follows:

t-statistic = 3.3817: This is the test statistic, indicating the ratio of the difference in sample means to the standard error of the difference. A larger absolute t-statistic generally indicates a greater difference between the groups.
degrees of freedom (df) = 17.051: This represents the number of independent values used in the calculation of the test statistic, adjusting for unequal variances between the two groups (this is why the Welch correction is used).
p-value = 0.003533: This value tells us the probability of observing the data, or something more extreme, under the assumption that there is no difference between the groups. A p-value less than 0.05 (the typical threshold for significance) indicates that there is a statistically significant difference between the means of Class A and Class B. This is a statistical inference method using hypothesis testing.
95% Confidence interval = [2.97, 12.80]: This is the range of values within which we are 95% confident that the true difference in means lies. It indicates that the Class A teaching method improves the mean score by 2.97 to 12.80 points, on average. This is a statistical inference method using a confidence interval.

Conclusion:

There is a statistically significant difference in the test scores between Class A and Class B, with Class A scoring higher on average. The results suggest that the teaching method or conditions in Class A may have contributed to this higher performance.

Example 2. Evaluating the Effect of a New Teaching Method

A school introduces a new teaching method and wants to see if it improves student performance. They test the same group of students before and after implementing the method.

Test whether there is an improvement in scores after implementing the new teaching method.

Suppose the same group of students took a test before and after using a new teaching method. Here’s the data:

scores_before: 70, 75, 80, 65, 78, 74, 72, 68, 77, 73
scores_after: 68, 80, 78, 75, 75, 78, 70, 74, 76, 77

# Data for the same group of students before and after the new method
scores_before <- c(70, 75, 80, 65, 78, 74, 72, 68, 77, 73)
scores_after <- c(68, 80, 78, 75, 75, 78, 70, 74, 76, 77)

# Conduct a paired samples t-test
t_test_result <- t.test(scores_before, scores_after, paired = TRUE, alternative = "greater")

# Print results
t_test_result

## 
##  Paired t-test
## 
## data:  scores_before and scores_after
## t = -1.3476, df = 9, p-value = 0.8946
## alternative hypothesis: true mean difference is greater than 0
## 95 percent confidence interval:
##  -4.48448      Inf
## sample estimates:
## mean difference 
##            -1.9

Example 3. Gender Differences in Academic Performance

A researcher investigates if there is a significant difference between the average grades of male and female students in a particular subject, such as mathematics.

Test: Test whether there is a significant difference in the average grades of male and female students.

Let’s assume we have math scores from male and female students:

male: 82, 78, 84, 79, 85, 88, 75, 80, 77, 83
female: 90, 85, 88, 87, 89, 92, 85, 86, 88, 90

# Data for two independent groups (males and females)
male_scores <- c(82, 78, 84, 79, 85, 88, 75, 80, 77, 83)
female_scores <- c(90, 85, 88, 87, 89, 92, 85, 86, 88, 90)

# Conduct an independent samples t-test
t_test_result <- t.test(male_scores, female_scores, alternative = "two.sided")
t_test_result

## 
##  Welch Two Sample t-test
## 
## data:  male_scores and female_scores
## t = -4.7131, df = 14.373, p-value = 0.0003105
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -10.032365  -3.767635
## sample estimates:
## mean of x mean of y 
##      81.1      88.0

Example 4. Impact of Online vs. Traditional Learning

A school wants to determine if students in online courses perform differently than those in traditional classroom settings.

Test whether there is a difference in performance between online and traditional students.

Imagine a study comparing final grades of students in online vs. traditional classroom settings. Data are:

online: 75, 78, 72, 70, 76, 74, 73, 77, 79, 80
traditional: 85, 87, 82, 88, 84, 86, 83, 89, 85, 90

# Data for two independent groups (online vs. traditional)
online_scores <- c(75, 78, 72, 70, 76, 74, 73, 77, 79, 80)
traditional_scores <- c(85, 87, 82, 88, 84, 86, 83, 89, 85, 90)

# Conduct an independent samples t-test
t_test_result <- t.test(online_scores, traditional_scores, alternative = "two.sided")
t_test_result

## 
##  Welch Two Sample t-test
## 
## data:  online_scores and traditional_scores
## t = -8.0452, df = 17.271, p-value = 3.026e-07
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -13.250273  -7.749727
## sample estimates:
## mean of x mean of y 
##      75.4      85.9

Example 5. Assessing Tutoring Program Effectiveness

A tutoring program is introduced for students who are struggling. The researcher wants to see if the program significantly improves their test scores.

Test whether the tutoring program results in a significant improvement in scores.

A group of students’ test scores before and after participating in a tutoring program are given below. Data are:

scores_before_tutoring: 60, 62, 65, 61, 63, 60, 64, 66, 67, 65
scores_after_tutoring: 64, 65, 63, 67, 65, 58, 69, 71, 75, 70

# Data for the same group of students before and after tutoring
scores_before_tutoring <- c(60, 62, 65, 61, 63, 60, 64, 66, 67, 65)
scores_after_tutoring <- c(64, 65, 63, 67, 65, 58, 69, 71, 75, 70)

# Conduct a paired samples t-test
t_test_result <- t.test(scores_before_tutoring, scores_after_tutoring, paired = TRUE, alternative = "less")
t_test_result

## 
##  Paired t-test
## 
## data:  scores_before_tutoring and scores_after_tutoring
## t = -3.2852, df = 9, p-value = 0.004725
## alternative hypothesis: true mean difference is less than 0
## 95 percent confidence interval:
##       -Inf -1.502829
## sample estimates:
## mean difference 
##            -3.4

Each of these examples involves different scenarios in education but highlights how a t-test can help measure the effectiveness or difference of various educational approaches, interventions, and demographic factors.

Regression in Education

Regression analysis is a statistical technique used to explore the relationships between variables and to make predictions. In educational research, regression is often applied to examine how various factors influence outcomes like student performance, motivation, or graduation rates.

The general regression model takes the form:

\[y = \beta_0+\beta_1\cdot x_1+\beta_2\cdot x_2+\beta_3\cdot x_3+\cdots+\beta_k\cdot x_k+\epsilon\] where $\beta_0, \beta_1, \beta_2, \beta_3, ..., \beta_k$ are called parameters and $\epsilon$ is the error term.

The tasks of regression are to

estimate the parameters,
test whether each x variable is significant,
check whether the regression model is adequate.

A regression model can be used to predict the y value for a given set of x values.

We demonstrate regression through case studies.

Example 1. Predicting Student Grades

This example predicts student grades based on study hours, parental education level, and school resources (see the appendix). The data are given below:

Predicting Student Grades
study_hours	parent_education	school_resources	grades
5.75	4	4.98	95.10
15.77	1	3.62	99.69
8.18	2	3.83	84.18
17.66	3	3.18	105.42
18.81	5	3.38	116.61
0.91	3	2.16	72.39
10.56	3	1.59	87.08
17.85	1	4.85	105.55
11.03	4	4.61	106.33
9.13	1	3.76	86.17

We use the following R code to fit a regression model:

\[\text{grades} = \beta_0+\beta_1\cdot \text{study_hours}+\beta_2\cdot \text{parental_education}+\beta_2\cdot \text{school_resources}+\epsilon\] The following is the code for data preparation:

study_hours <- c(5.75, 15.77, 8.18, 17.66, 18.81, 0.91, 10.56, 17.85, 11.03, 9.13) 
parent_education <- c(4, 1, 2, 3, 5, 3, 3, 1, 4, 1)
school_resources <- c(4.98, 3.62, 3.83, 3.18, 3.38, 2.16, 1.59, 4.85, 4.61, 3.76) 
grades <- c(95.1, 99.69, 84.18, 105.42, 116.61, 72.39, 87.08, 105.55, 106.33, 86.17)

The following is the code for fitting a regression model:

# Fit regression model
model1 <- lm(grades ~ study_hours + parent_education + school_resources, data = data1)
summary(model1)

## 
## Call:
## lm(formula = grades ~ study_hours + parent_education + school_resources, 
##     data = data1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.7115 -0.4977 -0.2354  0.7710  2.5086 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       49.9194     3.1349  15.924 3.89e-06 ***
## study_hours        1.8479     0.1264  14.615 6.44e-06 ***
## parent_education   3.7726     0.5173   7.293 0.000339 ***
## school_resources   3.9977     0.6783   5.894 0.001059 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.191 on 6 degrees of freedom
## Multiple R-squared:  0.982,  Adjusted R-squared:  0.973 
## F-statistic:   109 on 3 and 6 DF,  p-value: 1.27e-05

Summary of Results:

Intercept: The intercept estimate is 49.92, which is the expected outcome when all predictors (study hours, parent education, and school resources) are zero. This is a baseline score.
Study Hours: The coefficient for study hours is 1.85, meaning that for each additional hour studied, the outcome increases by an average of 1.85 points, holding other variables constant. This effect is statistically significant ($p$ value is basically zero) at the significance level 0.05.
Parent Education: The coefficient for parent education level is 3.77, indicating that higher levels of parent education are associated with a 3.77-point increase in the outcome. This predictor is also highly significant ($p$ value is less than 0.001) at the significance level 0.05.
School Resources: The coefficient for school resources is 4.00, suggesting that improvements in school resources are associated with a 4-point increase in the outcome. This is statistically significant ($p$ value is about 0.001) at the significance level 0.05.

Model Fit:

Residual Standard Error: The residual standard error of 2.19 indicates the average deviation of observed values from the fitted values.
R-squared: The $R^2$ value is 0.982, suggesting that approximately 98.2% of the variability in the outcome variable is explained by the model. The adjusted $R^2$ of 0.973 confirms that this high explanatory power holds after adjusting for the number of predictors.
F-statistic: The F-statistic of 109, with a $p$-value of $1.27\cdot 10^{-5}$ or 0.0000127, indicates that the model as a whole is statistically significant at the significance level 0.05.

Conclusion:

This regression model shows that study hours, parent education, and school resources significantly impact the outcome variable, with each predictor contributing positively to student performance.

Example 2. Evaluating the Effectiveness of Teaching Methods

We compare traditional and online teaching methods by predicting scores based on teaching method and engagement hours.

Evaluating the Effectiveness of Teaching Methods
teaching_method	engagement_hours	scores
Traditional	9.40	89.65
Traditional	5.66	83.26
Traditional	7.84	87.29
Online	14.36	111.34
Traditional	7.76	83.52
Online	13.46	110.15
Online	13.80	107.48
Online	9.52	90.22
Traditional	6.75	83.35
Traditional	3.06	78.77

The following is the code for fitting a regression model:

# Fit regression model
model2 <- lm(scores ~ teaching_method + engagement_hours, data = data2)
summary(model2)

## 
## Call:
## lm(formula = scores ~ teaching_method + engagement_hours, data = data2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.6034 -3.5635 -0.0302  2.9686 15.9185 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 68.2365     1.3411  50.880  < 2e-16 ***
## teaching_methodTraditional  -3.0180     0.9706  -3.109  0.00246 ** 
## engagement_hours             3.0117     0.1308  23.029  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.783 on 97 degrees of freedom
## Multiple R-squared:  0.8521, Adjusted R-squared:  0.8491 
## F-statistic: 279.5 on 2 and 97 DF,  p-value: < 2.2e-16

Example 3. Analyzing Factors Affecting Student Behavior

This example examines factors like attendance, family support, and attitude on classroom behavior scores.

# Generate fake data for factors affecting behavior
set.seed(123)
attendance <- round(runif(100, 50, 100), )
family_support <- sample(1:5, 100, replace = TRUE)
attitude <- round(runif(100, 1, 5), 2)
behavior_score <- round(50 + 0.5 * attendance + 4 * family_support + 2 * attitude + rnorm(100, 0, 5), 2)
data3 <- data.frame(attendance, family_support, attitude, behavior_score)

# Display table
data3 %>%
  head(10) %>%
  kable("html", caption = "Analyzing Factors Affecting Student Behavior") %>%
  kable_styling(full_width = FALSE)

Analyzing Factors Affecting Student Behavior
attendance	family_support	attitude	behavior_score
64	1	4.03	95.13
89	4	1.55	111.98
70	1	2.59	94.65
94	1	1.90	100.32
97	3	1.23	106.41
52	4	2.58	107.15
76	1	1.26	97.52
95	3	1.90	107.04
78	5	1.22	108.38
73	3	3.68	99.93

The following is the code for fitting a regression model:

# Fit regression model
model3 <- lm(behavior_score ~ attendance + family_support + attitude, data = data3)
summary(model3)

## 
## Call:
## lm(formula = behavior_score ~ attendance + family_support + attitude, 
##     data = data3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.4125 -3.0719 -0.6104  2.4503 11.6553 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    48.24158    3.11026  15.510  < 2e-16 ***
## attendance      0.49635    0.03297  15.053  < 2e-16 ***
## family_support  4.56839    0.32770  13.941  < 2e-16 ***
## attitude        2.14325    0.39257   5.459 3.74e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.658 on 96 degrees of freedom
## Multiple R-squared:  0.8044, Adjusted R-squared:  0.7983 
## F-statistic: 131.6 on 3 and 96 DF,  p-value: < 2.2e-16

Example 4. Exploring Educational Equity Issues

We investigate if family income, location, and school funding affect math scores.

# Generate fake data for educational equity issues
set.seed(123)
family_income <- round(runif(100, 20000, 80000), 2)
location <- sample(c("Urban", "Rural"), 100, replace = TRUE)
school_funding <- round(runif(100, 1, 10), 2)
math_score <- round(60 + 0.0005 * family_income + ifelse(location == "Urban", 10, -5) + 3 * school_funding + rnorm(100, 0, 5), 2)
data4 <- data.frame(family_income, location, school_funding, math_score)

# Display table
data4 %>%
  head(10) %>%
  kable("html", caption = "Exploring Educational Equity Issues") %>%
  kable_styling(full_width = FALSE)

Exploring Educational Equity Issues
family_income	location	school_funding	math_score
37254.65	Urban	3.15	102.02
67298.31	Rural	9.66	121.47
44538.62	Rural	6.41	98.16
72981.04	Urban	5.64	118.37
76428.04	Rural	4.62	106.48
22733.39	Rural	8.92	91.72
51686.33	Urban	4.28	111.50
73545.14	Urban	3.59	115.68
53086.10	Urban	2.54	109.05
47396.88	Rural	2.55	84.48

The following is the code for fitting a regression model:

# Fit regression model
model4 <- lm(math_score ~ family_income + location + school_funding, data = data4)
summary(model4)

## 
## Call:
## lm(formula = math_score ~ family_income + location + school_funding, 
##     data = data4)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.8177 -3.3978 -0.8421  2.5731 15.8347 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    5.518e+01  1.829e+00   30.17   <2e-16 ***
## family_income  4.891e-04  2.817e-05   17.36   <2e-16 ***
## locationUrban  1.551e+01  9.594e-01   16.16   <2e-16 ***
## school_funding 3.039e+00  1.801e-01   16.87   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.736 on 96 degrees of freedom
## Multiple R-squared:  0.9032, Adjusted R-squared:  0.9001 
## F-statistic: 298.4 on 3 and 96 DF,  p-value: < 2.2e-16

Example 5. Predicting Student Dropout Risk

This example predicts dropout risk based on age, attendance rate, and past grades.

# Generate fake data for dropout risk prediction
set.seed(123)
age <- round(runif(100, 14, 18), 2)
attendance_rate <- round(runif(100, 50, 100), 2)
past_grades <- round(runif(100, 50, 100), 2)
dropout_risk <- round(0.2 * age - 0.5 * attendance_rate - 0.3 * past_grades + rnorm(100, 0, 5), 2)
data5 <- data.frame(age, attendance_rate, past_grades, dropout_risk)

# Display table
data5 %>%
  head(10) %>%
  kable("html", caption = "Predicting Student Dropout Risk") %>%
  kable_styling(full_width = FALSE)

Predicting Student Dropout Risk
age	attendance_rate	past_grades	dropout_risk
15.15	80.00	61.94	-51.61
17.15	66.64	98.12	-55.48
15.64	74.43	80.07	-56.45
17.53	97.72	75.75	-73.12
17.76	74.15	70.13	-55.16
14.18	94.52	94.01	-74.03
16.11	95.72	68.20	-62.28
17.57	80.44	64.41	-57.89
16.21	70.53	58.53	-44.70
15.83	57.35	58.61	-44.96

The following is the code for fitting a regression model:

# Fit regression model
model5 <- lm(dropout_risk ~ age + attendance_rate + past_grades, data = data5)
summary(model5)

## 
## Call:
## lm(formula = dropout_risk ~ age + attendance_rate + past_grades, 
##     data = data5)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.7261 -3.1419 -0.6266  2.8062 14.5070 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -5.32845    8.01034  -0.665    0.508    
## age              0.14421    0.41296   0.349    0.728    
## attendance_rate -0.43060    0.03581 -12.023  < 2e-16 ***
## past_grades     -0.28597    0.03205  -8.924 3.04e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.653 on 96 degrees of freedom
## Multiple R-squared:  0.6831, Adjusted R-squared:  0.6732 
## F-statistic: 68.99 on 3 and 96 DF,  p-value: < 2.2e-16

Logistic Regression

Logistic regression is a widely used statistical technique in many fields for modeling categorical outcomes based on one or more predictor variables.

We focus on the case that the outcome is binary (e.g., yes/no, success/failure, employed/unemployed).

The binary outcome could have been coded as 0/1.
Unlike ordinary regression, which predicts continuous outcomes, logistic regression predicts probabilities of the binary outcome.

We demonstrate some applications.

Voting Behavior

We analyze factors that predict whether a person will vote in an election (e.g., based on age, income, education level).

Suppose a social scientist wants to investigate the likelihood of voting (yes or no) based on three predictors: age, income level, and education level. This could reveal how these factors impact an individual’s likelihood of participating in elections.

Here’s a list of variable we consider:

Vote (dependent variable): Binary outcome (1 = voted, 0 = did not vote)
Age: Continuous variable (in years)
Income: Continuous variable (in 10000’s)
Education: Ordinal variable (1 = no high school, 2 = high school, 3 = college degree or more)

Predicting Student Grades
age	income	education	vote
34	5	3	0
38	6	3	0
56	6	3	1
41	5	1	1
41	5	1	1
57	6	3	1
45	5	2	1
27	3	3	1
33	5	2	1
36	7	1	0

Let’s use R to build a logistic regression model.

Data preparation:

vote <- c(1, 1, 1, 1, 0, 1, 0, 1, 1, 0)
age <- c(34, 38, 56, 41, 41, 57, 45, 27, 33, 36)
income <- c(5, 6, 6, 5, 5, 6, 5, 3, 5, 7)
education <- c(3, 3, 3, 1, 1, 3, 2, 3, 2, 1)

data1 = data.frame(vote, age, income, education)

Fit a logistic model:

model <- glm(vote ~ age + income + education, data = data1, family = binomial)

# View the model summary
summary(model)

## 
## Call:
## glm(formula = vote ~ age + income + education, family = binomial, 
##     data = data1)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)  2.39470    8.82294   0.271    0.786
## age         -0.06231    0.15984  -0.390    0.697
## income      -0.60711    1.40856  -0.431    0.666
## education    2.25493    1.46455   1.540    0.124
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 12.2173  on 9  degrees of freedom
## Residual deviance:  7.1551  on 6  degrees of freedom
## AIC: 15.155
## 
## Number of Fisher Scoring iterations: 6

Interpreting Results:

Coefficients: Each coefficient represents the change in the log-odds of voting for a one-unit increase in that predictor.
- A positive coefficient suggests that an increase in the predictor is associated with an increase in the likelihood of voting.
- A negative coefficient suggests a decrease in the likelihood of voting as the predictor increases.
Significance Levels: The Pr(>|z|) values (called p-values) indicate whether each predictor is statistically significant. A small p-value (e.g., < 0.05) suggests that the corresponding predictor has a statistically significant association with the outcome.
The coefficient for age is negative, it means that older individuals are less likely to vote, controlling for other variables.
The coefficient for income is negative, individuals with higher income levels are less likely to vote, controlling for other variables.
The coefficient for education is positive, it indicates that people with higher educational attainment are more likely to vote.

Evaluating the Model with a Confusion Matrix:

A confusion matrix is a useful tool for evaluating the performance of a classification model.
Steps to Create a Confusion Matrix in R:
- Generate Predictions: Use the model to predict probabilities for each observation.
- Set a Threshold: Convert the probabilities into binary classifications based on a threshold (usually 0.5).
- Create the Confusion Matrix: Use the table function or the caret package to compare predicted and actual values.

Generate predictions:

# Generate predictions
predicted_probs <- predict(model, type = "response")
predicted_probs

##         1         2         3         4         5         6         7         8 
## 0.9821116 0.9588780 0.8836753 0.2808078 0.2808078 0.8771160 0.7436923 0.9965154 
##         9        10 
## 0.8597151 0.1366807

Convert the probabilities to binary outcomes with threshold 0.5 (if predicted probability is less than 0.5, classify a vote as 0; 1, otherwise):

# Convert probabilities to binary outcomes based on threshold 0.5
predicted_classes <- ifelse(predicted_probs > 0.5, 1, 0)
predicted_classes

##  1  2  3  4  5  6  7  8  9 10 
##  1  1  1  0  0  1  1  1  1  0

The confusion matrix is:

# Create the confusion matrix
actual = data1$vote
conf_matrix <- table(predicted_classes, actual)
conf_matrix

##                  actual
## predicted_classes 0 1
##                 0 2 1
##                 1 1 6

Evaluating the Model with an Receivor operating characteristic (ROC) curve:

An ROC curve evaluates the model’s performance across all possible thresholds, plotting the True Positive Rate (Sensitivity) against the False Positive Rate (1 - Specificity). Instead of focusing on one threshold, the ROC curve gives a comprehensive view of the model’s ability to distinguish between classes, regardless of threshold choice.

The following generates an ROC curve:

# Install and load pROC package for ROC analysis
# install.packages("pROC")
library(pROC)

## Type 'citation("pROC")' for a citation.

## 
## Attaching package: 'pROC'

## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

# Generate predictions
predicted_probs <- predict(model, type = "response")

# Prepare curve
roc_curve <- roc(data1$vote, predicted_probs)

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

# Plot ROC curve
plot(roc_curve, main = "ROC Curve", col = "blue", legacy.axes = TRUE)

The area under an ROC curve (AUC):

An ROC curve with AUC of 1 indicates a perfect performance of the model.
An ROC curve with AUC of 0.5 indicates no value of the model.

We calculate the AUC for our example:

# Calculate AUC
auc(roc_curve) # Calculate AUC

## Area under the curve: 0.9286

The area under the ROC curve is 0.9286, indicating a very good performance of the model.

Employment Status

We predict whether someone is employed based on factors like education, work experience, and age.

Predicting Employment Status
age	education	experience	employed
25	1	2	0
40	2	10	1
30	2	5	1
22	1	1	0
50	3	20	1
60	3	30	1
35	2	10	1
28	2	4	0
45	3	15	1
38	2	12	1

Health Outcomes

We estimate the likelihood of health issues like obesity or diabetes based on lifestyle factors.

Predicting Health Status
age	diet_quality	exercise_hours	health_issue
25	2	1	0
40	4	0	1
35	3	2	0
50	2	0	1
45	4	3	1
60	1	0	1
30	3	1	0
55	2	0	1
48	3	4	1
33	4	2	0

Educational Attainment

We model the probability of finishing college or dropping out based on socioeconomic factors.

Predicting College Dropout
highschool_performance	parent_income	parent_education	graduated_college
85	30	3	1
92	60	4	1
78	50	2	0
70	40	1	0
88	80	5	1
95	100	5	1
80	55	3	1
72	45	2	0
85	70	4	1
90	90	3	1

Conclusion Logistic regression is a powerful tool in social sciences, allowing researchers to examine how various factors influence the probability of certain outcomes. Through the interpretation of coefficients, odds ratios, and model fit metrics, logistic regression provides insights into the likelihood and predictors of events, enabling data-driven conclusions in areas like voting behavior, health, and education.

Exercises

Exercise 1

We will use the NHANES data from the website: https://www.lock5stat.com/datapage3e.html.

The documentation of all the data on this website is https://www.lock5stat.com/datasets3e/Lock5DataGuide3e.pdf.
You will run R code in a cloud-based environment. To do so, visit the website https://posit.cloud/. Take two minutes to get signed up using you preferred email.
Log in https://posit.cloud/ and start a new project by first clicking “New Project” and then clicking “New RStudio Project”. Click “Untitled Project” which is close to the upper-left corner of your computer screen. Change it to “Stat 101” or something you prefer.
Click “File” which is again close to the upper-left corner of your computer screen, choose “New File” and then “R Markdown”.
Now, you should see a dialogue box. Fill in any meaningful title (such as Analysis of Consumer Health and Organic Food Purchasing Behavior), author, and click “OK”.
Now, you will see a template automatically generated for you. The first 6 lines are called YAML (Yet Another Markup Language). Lines 8-10 are the setting (no need to change it).
Delete all lines below line 11. Instead, type the following:

## 1. Introduction

The growing interest in organic food has raised questions about the factors influencing consumer purchasing behavior. Among these factors, health is often cited as a key motivator. Previous studies have suggested that individuals who self-rate their health as good or excellent are more likely to purchase organic foods. This study aims to explore this relationship in greater detail, using a dataset with 4716 observations on consumer health, organic food purchasing behavior, and income.

This research examines: 1. Whether there is a significant relationship between self-rated health and the likelihood of purchasing organic food. 2. How income influences the decision to buy organic food. 3. The potential differences in health status when health is classified into binary categories (Good/Very good/Excellent vs. Poor/Fair).

By analyzing these variables, this study provides insights into the socio-demographic factors that may drive organic food consumption.

## 2. Literature Review

Consumer behavior related to organic food purchasing has been the subject of numerous studies. Many researchers have found that individuals who perceive themselves to be in good health are more likely to buy organic products, as they often associate organic foods with better health outcomes. Furthermore, income has been identified as a crucial factor, with higher-income consumers more likely to afford organic foods.

Recent literature (e.g., Smith et al., 2020; Williams, 2019) highlights the importance of self-rated health as an indicator of food choices. However, most studies have not examined how income and health status together influence organic food purchasing behavior across different health categories. This study fills that gap by using a large dataset that includes health and income data for a diverse sample of consumers.

## 3. Methodology

This study analyzes a dataset with 4716 observations on five variables:

Case: Unique Case ID number.
Organic: Whether the individual has purchased food labeled as organic in the past 30 days (Yes/No).
Health: Self-rating of health (Excellent, Very good, Fair, Good, Poor).
HealthBinary: Binary health classification (Poor/Fair vs. Good/Very good/Excellent).
Income: Monthly income in dollars.

### 3.1 Data Loading and Exploration

# Load necessary libraries
library(ggplot2)
library(DT)

# Load your dataset 
health_data <- read.csv("https://www.lock5stat.com/datasets3e/NHANES.csv")

# Print data
datatable(health_data)

## 4. Results

### 4.1 Descriptive Statistics

# Frequency table for categorical variables
table(health_data$Organic)
table(health_data$Health)
table(health_data$HealthBinary)


# Calculate the mean and median of the Income column.


# Find the 5-number summary (minimum, first quartile (Q1), median (Q2), third quartile (Q3), and maximum) for the Income variable.


# Create a histogram comparing income by organic.


# Create a boxplot comparing income by organic.


# Create a histogram comparing income by health status.


# Create a boxplot comparing income by health status.



# Make a barplot of healthBinary by organic. Do people who buy organic foods tend to be healthier?


# Plot a scatterplot of health vs. income. You need to set Poor = 0, Good = 1, Fair = 2, Very good = 3, and Excellent = 4.

# Calculate the correlation coefficient between health vs. income. Interpret the correlation coefficient and discuss what it suggests about the relationship.

### 4.2 Two-Sample t-Test

Perform a two-sample t-test to see if there’s a significant difference in MathScore between males and females.

### 4.3 Multiple Regression

Run a multiple regression analysis with Income as the dependent variable and Organic and Health as independent variables. Interpret the coefficients of each predictor. For example, does Organic significantly predict Income?

### 4.4 Logistic Regression

A logistic regression model was fitted to predict the likelihood of purchasing organic food based on income and health status.

# Before applying logistic regression, we need to create a binary response variable coded as 0 and 1.
health_data$Organic = ifelse(health_data$Organic == "Yes", 1, 0)

# Logistic regression to predict organic purchase based on Health and Income
logit_model <- glm(Organic ~ HealthBinary + Income, data = health_data, family = "binomial")
summary(logit_model)

Interpretation of results (just an example):

Health: Individuals who rated their health as Excellent/Very good had an odds ratio of 1.8 (95% CI: 1.5–2.2) for purchasing organic food, compared to those who rated their health as Fair/Poor.
Income: For every $1000 increase in monthly income, the odds of purchasing organic food increased by 1.2 times (OR = 1.2, p < 0.05).

### 5. Discussion The results of this study show that self-rated health and income are significant predictors of organic food purchasing behavior. Those who rate their health as Excellent or Very good are more likely to buy organic foods, suggesting that health-conscious consumers view organic food as beneficial to their well-being. Additionally, income plays a crucial role, as higher-income individuals are more able to afford the premium prices of organic products.

The study also found that when health is classified into binary categories, the results align with previous literature showing that individuals in better health are more likely to make health-conscious food choices. However, the use of binary health classification may oversimplify the nuances of health perceptions, and future research might benefit from a more detailed analysis of health variables.

Limitations The study relies on self-reported health data, which may be biased or inaccurate. Income was collected as a categorical variable, which limits the precision of the analysis. The dataset is cross-sectional, so causality cannot be inferred.

### 6. Conclusion This study contributes to the understanding of consumer behavior by highlighting the significant role of self-rated health and income in the decision to purchase organic food. The findings suggest that health-conscious individuals, particularly those in better health, are more likely to buy organic products, and that income significantly influences this decision. These insights can help marketers, policymakers, and health professionals better understand the factors influencing organic food consumption.

Appendix

“学校资源”指的是学校内部用于支持学生学习和发展的各种物质、设施、人员和支持系统。这些资源对学生的学业成绩、福祉和总体学校体验有着重要影响。以下是一些常见的学校资源类型：

教师和支持人员：

合格且经验丰富的教师、辅导员、助教和支持人员有助于有效的教学和学生支持。较小的班级规模以及专门的支持人员（如特殊教育教师、语言专家）通常是优质学校资源的标志。

物理基础设施：

包括教室、图书馆、科学实验室和体育场地等设施。
教室内充足的座位、良好的照明、温控系统和安全设施可以改善学习环境。
艺术、音乐等课外活动的专用空间也是宝贵的资源。

学习材料：

教科书、作业本、数字学习平台等教学材料的供应。
拥有丰富图书、期刊和研究数据库的图书馆。
支持现代教学方法的技术设备，如电脑、平板电脑、投影仪和相关软件。

课外和拓展项目：

学校提供的俱乐部、运动队、艺术项目和其他课外活动有助于学生的全面发展。
针对有天赋的学生的项目、辅导支持或拓展活动。

行政和学术支持：

有效管理学校运作的行政资源。
用于追踪学生进步、出勤和参与情况的系统，通常由工作人员和软件支持。

资金和财务资源：

资金资源用于维护设施、雇用合格人员和提供学习材料。
资金充裕的学校通常能够提供更多资源，如课后项目、实地考察和专门课程。

社区和家庭参与资源：

与本地组织的合作、志愿者项目以及家庭的支持可以丰富教育体验。
支持家庭参与教育的资源，如家庭工作坊或非英语家庭的语言协助。
这些资源帮助创造一个让学生在学业和个人成长中都能取得进步的环境。由于资金、地理位置和社区支持的差异，不同学校的资源情况往往差别较大。

Statistical Foundations for Data Science

SZ

2024-11-08

Foundational Mathematics in Data Science

Useful Algebra Formulas

Exercise 1

Exercise 2: SAT Math

New SAT Preparation

Random Sampling vs Random Assignment

Random Sampling

Random Assignment

Key Differences

The t-Test

Example 1. Comparing Test Scores Between Two Classes

Example 2. Evaluating the Effect of a New Teaching Method

Example 3. Gender Differences in Academic Performance

Example 4. Impact of Online vs. Traditional Learning

Example 5. Assessing Tutoring Program Effectiveness

Regression in Education

Example 1. Predicting Student Grades

Example 2. Evaluating the Effectiveness of Teaching Methods

Example 3. Analyzing Factors Affecting Student Behavior

Example 4. Exploring Educational Equity Issues

Example 5. Predicting Student Dropout Risk

Logistic Regression

Voting Behavior

Employment Status

Health Outcomes

Educational Attainment

Exercises

Exercise 1

Appendix

highschool_performance	parent_income	parent_education	graduated_college
85	30	3	1
92	60	4	1
78	50	2	0
70	40	1	0
88	80	5	1
95	100	5	1
80	55	3	1
72	45	2	0
85	70	4	1
90	90	3	1

highschool_performance	parent_income	parent_education	graduated_college
85	30	3	1
92	60	4	1
78	50	2	0
70	40	1	0
88	80	5	1
95	100	5	1
80	55	3	1
72	45	2	0
85	70	4	1
90	90	3	1

highschool_performance	parent_income	parent_education	graduated_college
85	30	3	1
92	60	4	1
78	50	2	0
70	40	1	0
88	80	5	1
95	100	5	1
80	55	3	1
72	45	2	0
85	70	4	1
90	90	3	1