about summary refs log tree commit diff stats
diff options
context:
space:
mode:
authorJonne Saleva <jonne@jonnesaleva.com>2020-02-26 18:36:44 -0500
committerJonne Saleva <jonne@jonnesaleva.com>2020-02-26 18:36:44 -0500
commit5ad69bf8e1d1d5a359296613c8969a81ad743b7d (patch)
tree5b31823ca9178f2f4032f83b7e3c97a67fdbcfc3
downloadyi-word-clustering-5ad69bf8e1d1d5a359296613c8969a81ad743b7d.tar.gz
.
-rw-r--r--README.md14
1 files changed, 14 insertions, 0 deletions
diff --git a/README.md b/README.md
new file mode 100644
index 0000000..3f309e3
--- /dev/null
+++ b/README.md
@@ -0,0 +1,14 @@
+# Etymological clustering of Yiddish words
+
+Yiddish is an interesting language not only because it is low-resource, but also because it has an etymologically diverse lexicon, with words coming from Germanic, Slavic, and Semitic backgrounds. 
+
+While it is possible for a speaker to recognize the etymological origin of a word by simply looking at it, it would be interesting to do this automatically.
+
+This motivates several interesting research questions:
+
+1. Is it possible to classify Yiddish words accurately and generalizably, given some labeled training data?
+2. What sort of feature function should we use? Is it possible to *learn* such a training data using, say, a neural network?
+    a. Do we seed the feature function with character counts, tf-idf weights, or just use random projections?
+3. How well can we do with less training data? Is it possible to do this in an unsupervised way?
+4. Is it possible to *jointly* learn the clustering and the feature embedding?
+5. How meaningful are the discovered clusters?