2026-02-09

Dataset from Foreign Service Institute

  • 1.5 billion people are trying to learn a new language according to FSI
  • Languages are ranked 1 (easy) to 5 (difficult)
  • This analysis will help others understand the time it will take to be proficient
lang_data <- data.frame(
  Language = c("Spanish", "French", "German", "Indonesian", "Russian", "Thai", "Arabic", "Korean", "Japanese"),
  Category = c(1, 1, 2, 3, 4, 4, 4, 4, 4),
  Hours_Required = c(600, 600, 750, 900, 1100, 1100, 2200, 2200, 2200),
  Complexity_Score = c(10, 15, 30, 45, 65, 75, 90, 95, 100)
)

The Statistical Hypothesis

  • We are testing to see if the amount of time spent to learn a new language will increase if the language is farther from the English origin.
  • Null Hypothesis (\(H_0: \beta_1 = 0\)): that complexity has no effect
  • Alternative Hypothesis (\(H_a: \beta_1 \neq 0\)): that complexity does effect the time. \[Y_{hours} = \beta_0 + \beta_1 X_{complexity} + \epsilon\]

Least Squares Estimator

To find the best fit for our language data, we calculate the slope (\(\beta_1\)). This minimizes the Sum of Squared Residuals (SSR):

\[\hat{\beta}_1 = \frac{\sum_{i=1}^{n}(X_i - \bar{X})(Y_i - \bar{Y})}{\sum_{i=1}^{n}(X_i - \bar{X})^2}\]

This formula ensures the regression line represents the mathematical average of the relationship between language distance and classroom hours.

How many Hours to be proficient?

  • Korean, Japanese, and Arabic are categorize as difficult with the most hours needed to be proficient.
  • Thai and Russian are categorize as difficult but with less hours needed to be proficient.
  • French and Spanish are categorize as easy with a little over 500 hrs to be be proficient.

Correlation

  • The correlation between linguistic distance and the hours it takes to learn a new language proficiently.
# Calculate Pearson Correlation
cor_value <- cor(lang_data$Complexity_Score, lang_data$Hours_Required)

# Build the linear model
fit <- lm(Hours_Required ~ Complexity_Score, data = lang_data)

# Print Correlation
print(paste("Correlation:", round(cor_value, 4)))
## [1] "Correlation: 0.9211"

Regression Line

  • In the graph, you can see that there is a strong positive relationship between the distance the new language is from English.

3D Graph

  • This 3D graph shows the complexity, Hours, and Language category is interactive.

Conclusion

  • As complexity increase the more time is needed to be proficient.
  • Languages that are not Romanize are more difficult to learn:Korean, Japanese, Arabic
  • We are rejecting Null Hypothesis, as we concluded that the complexity of the language does effect the amount of time needed.