diff options
author | Jonne Saleva <jonne@jonnesaleva.com> | 2020-02-26 18:36:44 -0500 |
---|---|---|
committer | Jonne Saleva <jonne@jonnesaleva.com> | 2020-02-26 18:36:44 -0500 |
commit | 5ad69bf8e1d1d5a359296613c8969a81ad743b7d (patch) | |
tree | 5b31823ca9178f2f4032f83b7e3c97a67fdbcfc3 | |
download | yi-word-clustering-5ad69bf8e1d1d5a359296613c8969a81ad743b7d.tar.gz |
.
-rw-r--r-- | README.md | 14 |
1 files changed, 14 insertions, 0 deletions
diff --git a/README.md b/README.md new file mode 100644 index 0000000..3f309e3 --- /dev/null +++ b/README.md @@ -0,0 +1,14 @@ +# Etymological clustering of Yiddish words + +Yiddish is an interesting language not only because it is low-resource, but also because it has an etymologically diverse lexicon, with words coming from Germanic, Slavic, and Semitic backgrounds. + +While it is possible for a speaker to recognize the etymological origin of a word by simply looking at it, it would be interesting to do this automatically. + +This motivates several interesting research questions: + +1. Is it possible to classify Yiddish words accurately and generalizably, given some labeled training data? +2. What sort of feature function should we use? Is it possible to *learn* such a training data using, say, a neural network? + a. Do we seed the feature function with character counts, tf-idf weights, or just use random projections? +3. How well can we do with less training data? Is it possible to do this in an unsupervised way? +4. Is it possible to *jointly* learn the clustering and the feature embedding? +5. How meaningful are the discovered clusters? |