Etymological clustering of Yiddish words
Yiddish is an interesting language not only because it is low-resource, but also because it has an etymologically diverse lexicon, with words coming from Germanic, Slavic, and Semitic backgrounds.
While it is possible for a speaker to recognize the etymological origin of a word by simply looking at it, it would be interesting to do this automatically.
This motivates several interesting research questions:
- Is it possible to classify Yiddish words accurately and generalizably, given some labeled training data?
- What sort of feature function should we use? Is it possible to learn such a training data using, say, a neural network? a. Do we seed the feature function with character counts, tf-idf weights, or just use random projections?
- How well can we do with less training data? Is it possible to do this in an unsupervised way?
- Is it possible to jointly learn the clustering and the feature embedding?
- How meaningful are the discovered clusters?