From 5ad69bf8e1d1d5a359296613c8969a81ad743b7d Mon Sep 17 00:00:00 2001 From: Jonne Saleva Date: Wed, 26 Feb 2020 18:36:44 -0500 Subject: . --- README.md | 14 ++++++++++++++ 1 file changed, 14 insertions(+) create mode 100644 README.md diff --git a/README.md b/README.md new file mode 100644 index 0000000..3f309e3 --- /dev/null +++ b/README.md @@ -0,0 +1,14 @@ +# Etymological clustering of Yiddish words + +Yiddish is an interesting language not only because it is low-resource, but also because it has an etymologically diverse lexicon, with words coming from Germanic, Slavic, and Semitic backgrounds. + +While it is possible for a speaker to recognize the etymological origin of a word by simply looking at it, it would be interesting to do this automatically. + +This motivates several interesting research questions: + +1. Is it possible to classify Yiddish words accurately and generalizably, given some labeled training data? +2. What sort of feature function should we use? Is it possible to *learn* such a training data using, say, a neural network? + a. Do we seed the feature function with character counts, tf-idf weights, or just use random projections? +3. How well can we do with less training data? Is it possible to do this in an unsupervised way? +4. Is it possible to *jointly* learn the clustering and the feature embedding? +5. How meaningful are the discovered clusters? -- cgit 1.4.1-2-gfad0