README.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14

# Etymological clustering of Yiddish words

Yiddish is an interesting language not only because it is low-resource, but also because it has an etymologically diverse lexicon, with words coming from Germanic, Slavic, and Semitic backgrounds. 

While it is possible for a speaker to recognize the etymological origin of a word by simply looking at it, it would be interesting to do this automatically.

This motivates several interesting research questions:

1. Is it possible to classify Yiddish words accurately and generalizably, given some labeled training data?
2. What sort of feature function should we use? Is it possible to *learn* such a training data using, say, a neural network?
    a. Do we seed the feature function with character counts, tf-idf weights, or just use random projections?
3. How well can we do with less training data? Is it possible to do this in an unsupervised way?
4. Is it possible to *jointly* learn the clustering and the feature embedding?
5. How meaningful are the discovered clusters?