Reading in the Data

This problem set focuses on a data set relating to linguistic typology–comparing the characteristics across languages.

One way in which languages vary is in whether they have a case system to mark nouns.

In English, the case system exists only in the pronouns, and there is no case marking on the nouns. So we say “The dog chased the cat” and “The cat chased the dog” but do not change the word “dog” or “cat” if it’s in object or subject position. But we do change pronouns depending on grammatical case: “The dog chased me.” and not “The dog chased I.”

In other languages, like Finnish, there is a rich case system even on nouns. So in Finnish we say: “Koira ajoi kissaa” for “dog chased cat” and “Kissa ajoi koiraa” for “cat chased dog.” (Not the different word endings! The extra “a” is the case system at work.)

Interestingly, Old English (spoken from roughly the year 400 until 1100) did have a case system on nouns, but it was lost. This is a common pattern in world languages.

Why does case disappear? One hypothesis is that L2 speakers (speakers learning a language as a second language) have a harder time remembering how to use case systems and so, when a language has a lot of L2 speakers, the case system is more likely to get lost.

We are going to analyze that, using some data from Bentz & Winter (2013).

Getting started with the data set

First, we read in the data set. The column Language lists the language. Case.Cat tells us how many cases the language has in words, and Case.Rank is a numerical version of this. So a language with no cases has Case.Rank == 0, and if it has 2 cases its rank is 1. One with many cases (10 or more) has Case.Rank of 7.

Stock tells you the Language Family (the broad grouping of that language), and Region tells you the Region of the world.

L1.Estimate is the estimated number of first language speakers of the language. L2.Estimate is the estimated number of second language speakers.Percent.L2 is the proportion of speakers that are L2 speakers, out of the total number of L1 + L2 speakers.

Because these numbers are exponentially distributed, we take the log to make them easier to work with and call these variables LogL1 and LogL2. (To avoid 0’s, we add 1 to the values before taking the log.)

d = read_csv("Case_L2.csv")

## Parsed with column specification:
## cols(
##   Language = col_character(),
##   Case.Cat = col_character(),
##   Stock = col_character(),
##   Region = col_character(),
##   L1.Estimate = col_double(),
##   L2.Estimate = col_double(),
##   Percent.L2 = col_double(),
##   Case.Rank = col_double(),
##   Case.Presence = col_double()
## )

d$LogL1 = log(d$L1.Estimate + 1)
d$LogL2 = log(d$L2.Estimate + 1)
head(d)

## # A tibble: 6 x 11
##   Language Case.Cat Stock Region L1.Estimate L2.Estimate Percent.L2 Case.Rank
##   <chr>    <chr>    <chr> <chr>        <dbl>       <dbl>      <dbl>     <dbl>
## 1 Vietnam… No morp… Aust… South…    64317000    16000000     0.199          0
## 2 Khmer,C… No morp… Aust… South…    13603400     1000000     0.0685         0
## 3 Tagalog… No morp… Aust… Ocean…    20426600    51000000     0.714          0
## 4 Ilokano  No morp… Aust… Ocean…     7348300     2300000     0.238          0
## 5 Fijian   No morp… Aust… Ocean…      392500      320000     0.449          0
## 6 Chamorro No morp… Aust… Ocean…       76350       30000     0.282          0
## # … with 3 more variables: Case.Presence <dbl>, LogL1 <dbl>, LogL2 <dbl>

1. Some basic stats

Report the following (by printing tables):

the language with the most overall speakers (L1 + L2)
the language with the fewest overall speakers (L1 + L2)
the language with the highest Percent.L2
the language with the lowest Percent.L2

d$all.speakers = d$L1.Estimate + d$L2.Estimate

# the language with the most overall speakers (L1 + L2)
print(filter(d, d$all.speakers == max(d$all.speakers)))

## # A tibble: 1 x 12
##   Language Case.Cat Stock Region L1.Estimate L2.Estimate Percent.L2 Case.Rank
##   <chr>    <chr>    <chr> <chr>        <dbl>       <dbl>      <dbl>     <dbl>
## 1 Chinese… No morp… Sino… South…   875744667   192383000      0.180         0
## # … with 4 more variables: Case.Presence <dbl>, LogL1 <dbl>, LogL2 <dbl>,
## #   all.speakers <dbl>

# the language with the fewest overall speakers (L1 + L2)
print(filter(d, d$all.speakers == min(d$all.speakers)))

## # A tibble: 1 x 12
##   Language Case.Cat Stock Region L1.Estimate L2.Estimate Percent.L2 Case.Rank
##   <chr>    <chr>    <chr> <chr>        <dbl>       <dbl>      <dbl>     <dbl>
## 1 Araona   5 cases  Pano… NE So…          81           9        0.1         4
## # … with 4 more variables: Case.Presence <dbl>, LogL1 <dbl>, LogL2 <dbl>,
## #   all.speakers <dbl>

# the language with the highest Percent.L2
print(filter(d, d$Percent.L2 == max(d$Percent.L2)))

## # A tibble: 2 x 12
##   Language Case.Cat Stock Region L1.Estimate L2.Estimate Percent.L2 Case.Rank
##   <chr>    <chr>    <chr> <chr>        <dbl>       <dbl>      <dbl>     <dbl>
## 1 Makah    No morp… Waka… Alask…           0        2224          1         0
## 2 Tiwi     No morp… Tiwi  N Aus…           0        1830          1         0
## # … with 4 more variables: Case.Presence <dbl>, LogL1 <dbl>, LogL2 <dbl>,
## #   all.speakers <dbl>

# the language with the lowest Percent.L2
print(filter(d, d$Percent.L2 == min(d$Percent.L2)))

## # A tibble: 1 x 12
##   Language Case.Cat Stock Region L1.Estimate L2.Estimate Percent.L2 Case.Rank
##   <chr>    <chr>    <chr> <chr>        <dbl>       <dbl>      <dbl>     <dbl>
## 1 Japanese 8-9 cas… Japa… N Coa…   126026700     1500000     0.0118         6
## # … with 4 more variables: Case.Presence <dbl>, LogL1 <dbl>, LogL2 <dbl>,
## #   all.speakers <dbl>

2. Means and medians

Find the mean and median of the L1.Estimate column. Which is higher: the mean or median? Why?

mean(d$L1.Estimate)

## [1] 44020622

median(d$L1.Estimate)

## [1] 9452540

The mean is 44020622 and the median is 9452540. The mean is higher than the median because the data is exponentially distributed, so extremely large values affect the mean a lot. However, extremely large values don’t affect the median to a similar extent.

3. Relationship between L1.Estimate and L2.Estimate

Let’s explore the relationship between the number of L1 speakers and the number of L2 speakers.

First, make a scatter plot of LogL1 vs. LogL2. We recommend you do this in ggplot(), since that will make it easier to do Q5.

ggplot(d, aes(x=LogL1, y=LogL2)) +
  geom_point()

4. Regression L2 and L1.

Using LogL1 as a predictor, run a linear regression predicting the number of LogL2 speakers based on the number of LogL1 speakers.

Print the regression output, and below, write one sentence describing what it tells us.

l = lm(data=d, LogL2 ~ LogL1)
summary(l)

## 
## Call:
## lm(formula = LogL2 ~ LogL1, data = d)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.0962 -1.2542  0.0612  1.1700  4.5088 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.19867    0.82784   3.864 0.000263 ***
## LogL1        0.72619    0.05377  13.507  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.835 on 64 degrees of freedom
## Multiple R-squared:  0.7403, Adjusted R-squared:  0.7362 
## F-statistic: 182.4 on 1 and 64 DF,  p-value: < 2.2e-16

The regression output tells us that 0.74 of the variance in the data is explained using the linear model.

5. Making a prediction

Using the linear regression from Q4 and geom_abline(), add a line of best fit to the plot in Q3. It is fine to just manually input the intercept and slope based on what you saw in the regression output.

ggplot(d, aes(x=LogL1, y=LogL2)) +
  geom_point() +
  geom_abline(intercept=3.19867, slope = 0.72619, color='red')

6. Compute the residuals

Use the linear regression from Q4 to make predictions for the number of LogL2 speakers for each point in the data set. Do this by using the equation of the line (mx + b) with the intercept and slope values that you found.

Identify 5 or 6 languages for which the prediction is farthest (in absolute value) from the true value. Which languages are these? If you have any ideas for why these languages have predictions that are so off, include them here (but that’s not required).

# Make predictions
d$Prediction = 3.19867 + 0.72619 * d$LogL1

d$Resid = d$LogL2 - d$Prediction
summary(d$Resid)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -4.096200 -1.254153  0.061221  0.000057  1.170091  4.508842

hist(d$Resid)

lang.resid <-
  group_by(d, Language) %>%
  summarize(abs.Resid = abs(Resid)) 
arrange(lang.resid, desc(abs.Resid))

## # A tibble: 66 x 2
##    Language         abs.Resid
##    <chr>                <dbl>
##  1 Makah                 4.51
##  2 Tiwi                  4.31
##  3 Araona                4.10
##  4 Swahili,Tanzania      3.82
##  5 Indonesian            3.31
##  6 Icelandic             3.11
##  7 Georgian              2.99
##  8 Hungarian             2.90
##  9 Abkhaz                2.82
## 10 Pitjantjatjara        2.61
## # … with 56 more rows

Languages for which the prediction is farthest from the true value are Makah, Tiwi, Araona, Swahili, Indonesian, and Icelandic.

7. Make a prediction

Now, let’s say you want to make a prediction for a language which has 100,000 speakers (which, in log space, is 11.51). Use the regression to make a prediction for that data point.

p = 3.19867 + 0.72619 * log(100000); p

## [1] 11.55924

8. Summarizing the relationship between L1 and L2

In 1-2 sentences, describe the relationship between the number of L1 speakers and the number of L2 speakers.

There is a positive linear relationship between logged number of L1 speakers and the logged number of L2 speakers. The higher the logged number of L1 speakers, the higher the logged number of L2 speakers.

9. Now predict Case.Rank! First, use the mean as model.

What we actually want to do is understand how the number and proportion of L2 speakers affects the variable Case.Rank.

First, let’s use the mean as a model.

Using the mean as a model (see Chapter 3 in the textbook), find the sum of the squared error for the variable Case.Rank.

d$mean.Prediction = mean(d$Case.Rank)
sse.mean = sum((d$Case.Rank - d$mean.Prediction)^2); sse.mean

## [1] 397.0303

10. Can we do better by using language family?

Now, compute the mean of Case.Rank separately for each Language Family (which is called Stock in the data set). Then, compute the sum of the squared error and compare it to the value you found for Q9.

fam <-
  group_by(d, Stock) %>%
  summarise(mean.Case.Rank = mean(Case.Rank))
fam

## # A tibble: 26 x 2
##    Stock          mean.Case.Rank
##    <chr>                   <dbl>
##  1 Adamawa-Ubangi              0
##  2 Austroasiatic               0
##  3 Austronesian                0
##  4 Benue-Congo                 0
##  5 Chadic                      0
##  6 Cushitic                    3
##  7 Dravidian                   5
##  8 Indo-European               2
##  9 Japanese                    6
## 10 Kartvelian                  5
## # … with 16 more rows

sum((fam$mean.Case.Rank - mean(d$Case.Rank))^2)

## [1] 162.4275

The new SSE value after grouping by language family is 162.4275. The original SSE value is 397.0303. The new SSE value after grouping by language family is much lower than the original SSE value.

11a. Prediction error for Mande

Using the model in Q10 (which uses language family to make a prediction), make a prediction for Case.Rank for the language Bambara, a language in the Mande language family. What is the prediction error? Below, discuss why it is potentially problematic to use this as a prediction. (HINT: look at how many languages there are in the Mande family in this data set.)

# Based on the model in Q10, Banbara should have 0 cases
print(filter(fam, Stock == 'Mande'))

## # A tibble: 1 x 2
##   Stock mean.Case.Rank
##   <chr>          <dbl>
## 1 Mande              0

# Actual value
print(filter(d, Stock == 'Mande'))

## # A tibble: 1 x 15
##   Language Case.Cat Stock Region L1.Estimate L2.Estimate Percent.L2 Case.Rank
##   <chr>    <chr>    <chr> <chr>        <dbl>       <dbl>      <dbl>     <dbl>
## 1 Bambara  No morp… Mande Afric…     2772340    10000000      0.783         0
## # … with 7 more variables: Case.Presence <dbl>, LogL1 <dbl>, LogL2 <dbl>,
## #   all.speakers <dbl>, Prediction <dbl>, Resid <dbl>, mean.Prediction <dbl>

Based on the model in Q10, the mean value for Case.Rank for the Mande language family is 0. Thus, we predict that Bambara have zero cases. The prediction error is zero. This prediction is problematic because only one language in the Mande family is included in our original data set. Thus, our prediction of any language in the Mande family is solely based on the data of this one language.

11b. Predicting out of sample

Now, imagine I ask you to predict the Case.Rank for the language Mende, a language in the Mande family that is not in our data set. What would you predict using the mean as a model? What would you predict using the “language family” model in Q10? Discuss which prediction you think is better.

Using the mean as a model, I would predict that Mende has 2 cases. Using the model in Q10, I would predict that Mende has zero cases. I think neither of these two models are great, because the mean as a model is based on numerous languages from many different language families, and using data from other language families to predict the Case.Rank of Mende could be misleading. However, for the model in Q10, data of only one language is included to compute the mean for the Mande family, thus, this is not a great model either.

12. Using Percent.L2 to predict Case.Rank

Make a plot where you put Percent.L2 on the x-axis and Case.Rank on the y-axis.

ggplot(d, aes(x=Percent.L2, y=Case.Rank)) + geom_point()

13. Regressing to predict Case.Rank

Run a linear regression where you predict Case.Rank from Percent.L2. Get the intercept and slope, and plot a line using geom_abline().

l2 = lm(data=d, Case.Rank ~ Percent.L2)
print(summary(l2))

## 
## Call:
## lm(formula = Case.Rank ~ Percent.L2, data = d)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4019 -1.7923  0.1361  1.6616  4.1427 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   3.7425     0.4051   9.240 2.16e-13 ***
## Percent.L2   -4.9748     0.9599  -5.183 2.38e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.09 on 64 degrees of freedom
## Multiple R-squared:  0.2956, Adjusted R-squared:  0.2846 
## F-statistic: 26.86 on 1 and 64 DF,  p-value: 2.382e-06

ggplot(d, aes(x=Percent.L2, y=Case.Rank)) + 
  geom_point() +
  geom_abline(intercept = 3.7425, slope = -4.9748, color='navy')

The intercept is 3.7425 and the slope is -4.9748

14. Assessing the fit

Get predictions from the linear model in Q13. Compute the sum of the squared error for predicting Case.Rank using this model. now do the same for a model where you just use the mean of Case.Rank as the model (mean as a model!).

d$Prediction.l2 = predict(l2)

# Sum of the squared error for using the linear model
sse.l2.lm = sum((d$Prediction.l2 - d$Case.Rank)^2); sse.l2.lm

## [1] 279.6572

# Sum of the squared error for using the mean as a model
mean(d$Case.Rank) # Predicted value

## [1] 2.121212

sse.mean = sum((mean(d$Case.Rank) - d$Case.Rank)^2); sse.mean

## [1] 397.0303

15. Computing R2

Manually compute R-squared using the values in #14. Compare it to the R-squared that you get when running summary() on the regression lm() model.

# Manually computed R-squared
R2 = (1 - sse.l2.lm/sse.mean);
print(R2)

## [1] 0.2956275

# R-squared value in summary()
R2.s = summary(l2)$r.squared
print(R2.s)

## [1] 0.2956275

The two R-squared values (manually computed and extracted from summary()) are the same.

16. Describe

Describe the relationship, in 2-3 sentences, between Percent.L2 and Case.Rank.

There is a negative linear relationship between Percent.L2 and Case.Rank. As Percent.L2 increases, Case.Rank decreases linearly. When Percent.L2 is 0, Case.Rank is approximately 3.74 (even though, in fact, Case.Rank is a discrete variable and can only take integers as values.)

17. Looking at LogL1

Run a regression predicting Case.Rank from the log number of L1 speakers (LogL1). Is the predictor significant? What is the R-squared?

l.logl1 = lm(data=d, Case.Rank ~ LogL1)
summary(l.logl1)

## 
## Call:
## lm(formula = Case.Rank ~ LogL1, data = d)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -2.402 -2.158 -1.266  2.715  5.223 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  1.40034    1.11963   1.251    0.216
## LogL1        0.04866    0.07272   0.669    0.506
## 
## Residual standard error: 2.482 on 64 degrees of freedom
## Multiple R-squared:  0.00695,    Adjusted R-squared:  -0.008567 
## F-statistic: 0.4479 on 1 and 64 DF,  p-value: 0.5057

# R-squared value
summary(l.logl1)$r.squared

## [1] 0.006949628

# Significance
sqrt(summary(l.logl1)$r.squared)

## [1] 0.08336443

cor(d$LogL1, d$Case.Rank)

## [1] 0.08336443

The R-squared value is 0.006949628. The correlation coefficient is 0.08336443. The predictor is not significant because this correlation coefficient r = 0.08336443 is smaller than p-value = 0.5057. We fail the reject the null hypothesis. Also, since the R-squared value is only 0.006949628, which means only 0.0069 of the variance is explained by the model. Thus, our model is not a great model.

18. Looking at LogL2

Run a regression predicting Case.Rank from the log number of L2 speakers (LogL2). Is the predictor significant? What is the R-squared?

l.logl2 = lm(data=d, Case.Rank ~ LogL2)
summary(l.logl2)

## 
## Call:
## lm(formula = Case.Rank ~ LogL2, data = d)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.320 -1.907 -1.232  2.281  4.626 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  4.15750    1.21688   3.417  0.00111 **
## LogL2       -0.14591    0.08451  -1.727  0.08907 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.435 on 64 degrees of freedom
## Multiple R-squared:  0.0445, Adjusted R-squared:  0.02957 
## F-statistic: 2.981 on 1 and 64 DF,  p-value: 0.08907

# R-squared value
summary(l.logl2)$r.squared

## [1] 0.04450452

# Significance
cor(d$LogL2, d$Case.Rank)

## [1] -0.2109609

The correlation coefficient is -0.2109609. The predictor is significant. The magnitude of the correlation coefficient is 0.2109609, which is greater than the p-value 0.08907. Thus, the predictor is significant. The R-squared value is 0.04450452.

19. Using the answers to 15, 17, and 18, discuss which variable (among LogL1, LogL2, Percent.L2) best predicts Case.Rank.

Among the three variables, Percent.L2 best predicts Case.Rank. The correlation coefficient of predictor Percent.L2 is -0.5437164, which is significantly higher than the p-value 2.382e-06. Also, 0.2956275 of the variance is explained by this linear model, predicting Case.Rank based on Percent.L2. In comparison, the R-squared values of the other two models are significantly lower (R-squared value of predictor LogL1 is 0.0069 while R-squared value of predictor LogL2 is 0.0445). The correlation coefficients of the other two models are also weaker. The predictor LogL1 is not significant. The predictor LogL2 is significant but only to a limited extent.

# Predictor = Percent.L2
summary(l2)$r.squared

## [1] 0.2956275

cor(d$Percent.L2, d$Case.Rank)

## [1] -0.5437164

# Predictor = LogL1
summary(l.logl1)$r.squared

## [1] 0.006949628

cor(d$LogL1, d$Case.Rank)

## [1] 0.08336443

# Predictor = LogL2
summary(l.logl2)$r.squared

## [1] 0.04450452

cor(d$LogL2, d$Case.Rank)

## [1] -0.2109609

20. How long did you spend on this problem set? Any comments?

About 3 to 4 hours

Ling 104 Problem Set 3: Regression

Lexie Wang

3/29/2021