This problem set focuses on a data set relating to linguistic typology–comparing the characteristics across languages.
One way in which languages vary is in whether they have a case system to mark nouns.
In English, the case system exists only in the pronouns, and there is no case marking on the nouns. So we say “The dog chased the cat” and “The cat chased the dog” but do not change the word “dog” or “cat” if it’s in object or subject position. But we do change pronouns depending on grammatical case: “The dog chased me.” and not “The dog chased I.”
In other languages, like Finnish, there is a rich case system even on nouns. So in Finnish we say: “Koira ajoi kissaa” for “dog chased cat” and “Kissa ajoi koiraa” for “cat chased dog.” (Not the different word endings! The extra “a” is the case system at work.)
Interestingly, Old English (spoken from roughly the year 400 until 1100) did have a case system on nouns, but it was lost. This is a common pattern in world languages.
Why does case disappear? One hypothesis is that L2 speakers (speakers learning a language as a second language) have a harder time remembering how to use case systems and so, when a language has a lot of L2 speakers, the case system is more likely to get lost.
We are going to analyze that, using some data from Bentz & Winter (2013).
First, we read in the data set. The column Language lists the language. Case.Cat tells us how many cases the language has in words, and Case.Rank is a numerical version of this. So a language with no cases has Case.Rank == 0, and if it has 2 cases its rank is 1. One with many cases (10 or more) has Case.Rank of 7.
Stock tells you the Language Family (the broad grouping of that language), and Region tells you the Region of the world.
L1.Estimate is the estimated number of first language speakers of the language. L2.Estimate is the estimated number of second language speakers.Percent.L2 is the proportion of speakers that are L2 speakers, out of the total number of L1 + L2 speakers.
Because these numbers are exponentially distributed, we take the log to make them easier to work with and call these variables LogL1 and LogL2. (To avoid 0’s, we add 1 to the values before taking the log.)
d = read_csv("Case_L2.csv")
## Parsed with column specification:
## cols(
## Language = col_character(),
## Case.Cat = col_character(),
## Stock = col_character(),
## Region = col_character(),
## L1.Estimate = col_double(),
## L2.Estimate = col_double(),
## Percent.L2 = col_double(),
## Case.Rank = col_double(),
## Case.Presence = col_double()
## )
d$LogL1 = log(d$L1.Estimate + 1)
d$LogL2 = log(d$L2.Estimate + 1)
head(d)
## # A tibble: 6 x 11
## Language Case.Cat Stock Region L1.Estimate L2.Estimate Percent.L2 Case.Rank
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Vietnam… No morp… Aust… South… 64317000 16000000 0.199 0
## 2 Khmer,C… No morp… Aust… South… 13603400 1000000 0.0685 0
## 3 Tagalog… No morp… Aust… Ocean… 20426600 51000000 0.714 0
## 4 Ilokano No morp… Aust… Ocean… 7348300 2300000 0.238 0
## 5 Fijian No morp… Aust… Ocean… 392500 320000 0.449 0
## 6 Chamorro No morp… Aust… Ocean… 76350 30000 0.282 0
## # … with 3 more variables: Case.Presence <dbl>, LogL1 <dbl>, LogL2 <dbl>
Report the following (by printing tables):
the language with the most overall speakers (L1 + L2)
the language with the fewest overall speakers (L1 + L2)
the language with the highest Percent.L2
the language with the lowest Percent.L2
d$all.speakers = d$L1.Estimate + d$L2.Estimate
# the language with the most overall speakers (L1 + L2)
print(filter(d, d$all.speakers == max(d$all.speakers)))
## # A tibble: 1 x 12
## Language Case.Cat Stock Region L1.Estimate L2.Estimate Percent.L2 Case.Rank
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Chinese… No morp… Sino… South… 875744667 192383000 0.180 0
## # … with 4 more variables: Case.Presence <dbl>, LogL1 <dbl>, LogL2 <dbl>,
## # all.speakers <dbl>
# the language with the fewest overall speakers (L1 + L2)
print(filter(d, d$all.speakers == min(d$all.speakers)))
## # A tibble: 1 x 12
## Language Case.Cat Stock Region L1.Estimate L2.Estimate Percent.L2 Case.Rank
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Araona 5 cases Pano… NE So… 81 9 0.1 4
## # … with 4 more variables: Case.Presence <dbl>, LogL1 <dbl>, LogL2 <dbl>,
## # all.speakers <dbl>
# the language with the highest Percent.L2
print(filter(d, d$Percent.L2 == max(d$Percent.L2)))
## # A tibble: 2 x 12
## Language Case.Cat Stock Region L1.Estimate L2.Estimate Percent.L2 Case.Rank
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Makah No morp… Waka… Alask… 0 2224 1 0
## 2 Tiwi No morp… Tiwi N Aus… 0 1830 1 0
## # … with 4 more variables: Case.Presence <dbl>, LogL1 <dbl>, LogL2 <dbl>,
## # all.speakers <dbl>
# the language with the lowest Percent.L2
print(filter(d, d$Percent.L2 == min(d$Percent.L2)))
## # A tibble: 1 x 12
## Language Case.Cat Stock Region L1.Estimate L2.Estimate Percent.L2 Case.Rank
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Japanese 8-9 cas… Japa… N Coa… 126026700 1500000 0.0118 6
## # … with 4 more variables: Case.Presence <dbl>, LogL1 <dbl>, LogL2 <dbl>,
## # all.speakers <dbl>
Find the mean and median of the L1.Estimate column. Which is higher: the mean or median? Why?
mean(d$L1.Estimate)
## [1] 44020622
median(d$L1.Estimate)
## [1] 9452540
The mean is 44020622 and the median is 9452540. The mean is higher than the median because the data is exponentially distributed, so extremely large values affect the mean a lot. However, extremely large values don’t affect the median to a similar extent.
Let’s explore the relationship between the number of L1 speakers and the number of L2 speakers.
First, make a scatter plot of LogL1 vs. LogL2. We recommend you do this in ggplot(), since that will make it easier to do Q5.
ggplot(d, aes(x=LogL1, y=LogL2)) +
geom_point()
Using LogL1 as a predictor, run a linear regression predicting the number of LogL2 speakers based on the number of LogL1 speakers.
Print the regression output, and below, write one sentence describing what it tells us.
l = lm(data=d, LogL2 ~ LogL1)
summary(l)
##
## Call:
## lm(formula = LogL2 ~ LogL1, data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.0962 -1.2542 0.0612 1.1700 4.5088
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.19867 0.82784 3.864 0.000263 ***
## LogL1 0.72619 0.05377 13.507 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.835 on 64 degrees of freedom
## Multiple R-squared: 0.7403, Adjusted R-squared: 0.7362
## F-statistic: 182.4 on 1 and 64 DF, p-value: < 2.2e-16
The regression output tells us that 0.74 of the variance in the data is explained using the linear model.
Using the linear regression from Q4 and geom_abline(), add a line of best fit to the plot in Q3. It is fine to just manually input the intercept and slope based on what you saw in the regression output.
ggplot(d, aes(x=LogL1, y=LogL2)) +
geom_point() +
geom_abline(intercept=3.19867, slope = 0.72619, color='red')
Use the linear regression from Q4 to make predictions for the number of LogL2 speakers for each point in the data set. Do this by using the equation of the line (mx + b) with the intercept and slope values that you found.
Identify 5 or 6 languages for which the prediction is farthest (in absolute value) from the true value. Which languages are these? If you have any ideas for why these languages have predictions that are so off, include them here (but that’s not required).
# Make predictions
d$Prediction = 3.19867 + 0.72619 * d$LogL1
d$Resid = d$LogL2 - d$Prediction
summary(d$Resid)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -4.096200 -1.254153 0.061221 0.000057 1.170091 4.508842
hist(d$Resid)
lang.resid <-
group_by(d, Language) %>%
summarize(abs.Resid = abs(Resid))
arrange(lang.resid, desc(abs.Resid))
## # A tibble: 66 x 2
## Language abs.Resid
## <chr> <dbl>
## 1 Makah 4.51
## 2 Tiwi 4.31
## 3 Araona 4.10
## 4 Swahili,Tanzania 3.82
## 5 Indonesian 3.31
## 6 Icelandic 3.11
## 7 Georgian 2.99
## 8 Hungarian 2.90
## 9 Abkhaz 2.82
## 10 Pitjantjatjara 2.61
## # … with 56 more rows
Languages for which the prediction is farthest from the true value are Makah, Tiwi, Araona, Swahili, Indonesian, and Icelandic.
Now, let’s say you want to make a prediction for a language which has 100,000 speakers (which, in log space, is 11.51). Use the regression to make a prediction for that data point.
p = 3.19867 + 0.72619 * log(100000); p
## [1] 11.55924
In 1-2 sentences, describe the relationship between the number of L1 speakers and the number of L2 speakers.
There is a positive linear relationship between logged number of L1 speakers and the logged number of L2 speakers. The higher the logged number of L1 speakers, the higher the logged number of L2 speakers.
What we actually want to do is understand how the number and proportion of L2 speakers affects the variable Case.Rank.
First, let’s use the mean as a model.
Using the mean as a model (see Chapter 3 in the textbook), find the sum of the squared error for the variable Case.Rank.
d$mean.Prediction = mean(d$Case.Rank)
sse.mean = sum((d$Case.Rank - d$mean.Prediction)^2); sse.mean
## [1] 397.0303
Now, compute the mean of Case.Rank separately for each Language Family (which is called Stock in the data set). Then, compute the sum of the squared error and compare it to the value you found for Q9.
fam <-
group_by(d, Stock) %>%
summarise(mean.Case.Rank = mean(Case.Rank))
fam
## # A tibble: 26 x 2
## Stock mean.Case.Rank
## <chr> <dbl>
## 1 Adamawa-Ubangi 0
## 2 Austroasiatic 0
## 3 Austronesian 0
## 4 Benue-Congo 0
## 5 Chadic 0
## 6 Cushitic 3
## 7 Dravidian 5
## 8 Indo-European 2
## 9 Japanese 6
## 10 Kartvelian 5
## # … with 16 more rows
sum((fam$mean.Case.Rank - mean(d$Case.Rank))^2)
## [1] 162.4275
The new SSE value after grouping by language family is 162.4275. The original SSE value is 397.0303. The new SSE value after grouping by language family is much lower than the original SSE value.
Using the model in Q10 (which uses language family to make a prediction), make a prediction for Case.Rank for the language Bambara, a language in the Mande language family. What is the prediction error? Below, discuss why it is potentially problematic to use this as a prediction. (HINT: look at how many languages there are in the Mande family in this data set.)
# Based on the model in Q10, Banbara should have 0 cases
print(filter(fam, Stock == 'Mande'))
## # A tibble: 1 x 2
## Stock mean.Case.Rank
## <chr> <dbl>
## 1 Mande 0
# Actual value
print(filter(d, Stock == 'Mande'))
## # A tibble: 1 x 15
## Language Case.Cat Stock Region L1.Estimate L2.Estimate Percent.L2 Case.Rank
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Bambara No morp… Mande Afric… 2772340 10000000 0.783 0
## # … with 7 more variables: Case.Presence <dbl>, LogL1 <dbl>, LogL2 <dbl>,
## # all.speakers <dbl>, Prediction <dbl>, Resid <dbl>, mean.Prediction <dbl>
Based on the model in Q10, the mean value for Case.Rank for the Mande language family is 0. Thus, we predict that Bambara have zero cases. The prediction error is zero. This prediction is problematic because only one language in the Mande family is included in our original data set. Thus, our prediction of any language in the Mande family is solely based on the data of this one language.
Now, imagine I ask you to predict the Case.Rank for the language Mende, a language in the Mande family that is not in our data set. What would you predict using the mean as a model? What would you predict using the “language family” model in Q10? Discuss which prediction you think is better.
Using the mean as a model, I would predict that Mende has 2 cases. Using the model in Q10, I would predict that Mende has zero cases. I think neither of these two models are great, because the mean as a model is based on numerous languages from many different language families, and using data from other language families to predict the Case.Rank of Mende could be misleading. However, for the model in Q10, data of only one language is included to compute the mean for the Mande family, thus, this is not a great model either.
Make a plot where you put Percent.L2 on the x-axis and Case.Rank on the y-axis.
ggplot(d, aes(x=Percent.L2, y=Case.Rank)) + geom_point()
Run a linear regression where you predict Case.Rank from Percent.L2. Get the intercept and slope, and plot a line using geom_abline().
l2 = lm(data=d, Case.Rank ~ Percent.L2)
print(summary(l2))
##
## Call:
## lm(formula = Case.Rank ~ Percent.L2, data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4019 -1.7923 0.1361 1.6616 4.1427
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.7425 0.4051 9.240 2.16e-13 ***
## Percent.L2 -4.9748 0.9599 -5.183 2.38e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.09 on 64 degrees of freedom
## Multiple R-squared: 0.2956, Adjusted R-squared: 0.2846
## F-statistic: 26.86 on 1 and 64 DF, p-value: 2.382e-06
ggplot(d, aes(x=Percent.L2, y=Case.Rank)) +
geom_point() +
geom_abline(intercept = 3.7425, slope = -4.9748, color='navy')
The intercept is 3.7425 and the slope is -4.9748
Get predictions from the linear model in Q13. Compute the sum of the squared error for predicting Case.Rank using this model. now do the same for a model where you just use the mean of Case.Rank as the model (mean as a model!).
d$Prediction.l2 = predict(l2)
# Sum of the squared error for using the linear model
sse.l2.lm = sum((d$Prediction.l2 - d$Case.Rank)^2); sse.l2.lm
## [1] 279.6572
# Sum of the squared error for using the mean as a model
mean(d$Case.Rank) # Predicted value
## [1] 2.121212
sse.mean = sum((mean(d$Case.Rank) - d$Case.Rank)^2); sse.mean
## [1] 397.0303
Manually compute R-squared using the values in #14. Compare it to the R-squared that you get when running summary() on the regression lm() model.
# Manually computed R-squared
R2 = (1 - sse.l2.lm/sse.mean);
print(R2)
## [1] 0.2956275
# R-squared value in summary()
R2.s = summary(l2)$r.squared
print(R2.s)
## [1] 0.2956275
The two R-squared values (manually computed and extracted from summary()) are the same.
Describe the relationship, in 2-3 sentences, between Percent.L2 and Case.Rank.
There is a negative linear relationship between Percent.L2 and Case.Rank. As Percent.L2 increases, Case.Rank decreases linearly. When Percent.L2 is 0, Case.Rank is approximately 3.74 (even though, in fact, Case.Rank is a discrete variable and can only take integers as values.)
Run a regression predicting Case.Rank from the log number of L1 speakers (LogL1). Is the predictor significant? What is the R-squared?
l.logl1 = lm(data=d, Case.Rank ~ LogL1)
summary(l.logl1)
##
## Call:
## lm(formula = Case.Rank ~ LogL1, data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.402 -2.158 -1.266 2.715 5.223
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.40034 1.11963 1.251 0.216
## LogL1 0.04866 0.07272 0.669 0.506
##
## Residual standard error: 2.482 on 64 degrees of freedom
## Multiple R-squared: 0.00695, Adjusted R-squared: -0.008567
## F-statistic: 0.4479 on 1 and 64 DF, p-value: 0.5057
# R-squared value
summary(l.logl1)$r.squared
## [1] 0.006949628
# Significance
sqrt(summary(l.logl1)$r.squared)
## [1] 0.08336443
cor(d$LogL1, d$Case.Rank)
## [1] 0.08336443
The R-squared value is 0.006949628. The correlation coefficient is 0.08336443. The predictor is not significant because this correlation coefficient r = 0.08336443 is smaller than p-value = 0.5057. We fail the reject the null hypothesis. Also, since the R-squared value is only 0.006949628, which means only 0.0069 of the variance is explained by the model. Thus, our model is not a great model.
Run a regression predicting Case.Rank from the log number of L2 speakers (LogL2). Is the predictor significant? What is the R-squared?
l.logl2 = lm(data=d, Case.Rank ~ LogL2)
summary(l.logl2)
##
## Call:
## lm(formula = Case.Rank ~ LogL2, data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.320 -1.907 -1.232 2.281 4.626
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.15750 1.21688 3.417 0.00111 **
## LogL2 -0.14591 0.08451 -1.727 0.08907 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.435 on 64 degrees of freedom
## Multiple R-squared: 0.0445, Adjusted R-squared: 0.02957
## F-statistic: 2.981 on 1 and 64 DF, p-value: 0.08907
# R-squared value
summary(l.logl2)$r.squared
## [1] 0.04450452
# Significance
cor(d$LogL2, d$Case.Rank)
## [1] -0.2109609
The correlation coefficient is -0.2109609. The predictor is significant. The magnitude of the correlation coefficient is 0.2109609, which is greater than the p-value 0.08907. Thus, the predictor is significant. The R-squared value is 0.04450452.
Among the three variables, Percent.L2 best predicts Case.Rank. The correlation coefficient of predictor Percent.L2 is -0.5437164, which is significantly higher than the p-value 2.382e-06. Also, 0.2956275 of the variance is explained by this linear model, predicting Case.Rank based on Percent.L2. In comparison, the R-squared values of the other two models are significantly lower (R-squared value of predictor LogL1 is 0.0069 while R-squared value of predictor LogL2 is 0.0445). The correlation coefficients of the other two models are also weaker. The predictor LogL1 is not significant. The predictor LogL2 is significant but only to a limited extent.
# Predictor = Percent.L2
summary(l2)$r.squared
## [1] 0.2956275
cor(d$Percent.L2, d$Case.Rank)
## [1] -0.5437164
# Predictor = LogL1
summary(l.logl1)$r.squared
## [1] 0.006949628
cor(d$LogL1, d$Case.Rank)
## [1] 0.08336443
# Predictor = LogL2
summary(l.logl2)$r.squared
## [1] 0.04450452
cor(d$LogL2, d$Case.Rank)
## [1] -0.2109609
About 3 to 4 hours