Figuring out Rules for German Noun Genders with Simple Machine Learning and Statistics
When learning German, one of the most confusing features of the language is the noun gender system. In German, every noun has one of three genders (masculine/feminine/neuter), but unlike many other languages, these genders are seemingly not assigned based on any logical rule. Despite this, native German speakers as well as experienced German students are able to intuitively “guess” noun genders correctly. This led me to the logical conclusion that some underlying rules must exist. Furthermore, if humans can have an intuitive model of these rules, perhaps we can create a computer-based model, and figure out what these rules actually are!
This post is about my initial exploration of modeling German noun genders with some simple machine learning and statistics. I attempt to find some rules that might aid German learners (including myself) in figuring out noun genders.
I'll try to strike a middleground between technical details and accessibility. If you'd like to skip all the technical stuff, you can jump to the juicy German gender rules!
What might the rule be?
The first step was to guess at what such a rule might be. When learning German, you're taught about certain suffixes which allow you to guess that noun's gender. For example, nouns ending in -keit, -heit, -ei, -ie, -ion are always feminine, while nouns ending in -ant are usually masculine, and those ending in -chen, -lein, -li are always neuter.
To generalise this, I decided to investigate the correlation between the last syllable of a word and that word's gender.
I realised I needed the following:
- A list of lots of commonly occuring German nouns
- A way to find the gender of all those nouns
- A way to split those nouns into syllables
For the noun list, I found a corpus from the Institute for the German Language in Mannheim called “DeReWo – Korpusbasierte Grund-/Wortformenlisten”. This provides a large list of words, complete with information on how common each word is, as well as the part of speech for every word. It was then trivial to single out common nouns and sort by frequency.
The above corpus does not, however, include the genders of those nouns. To get genders, I used the dict.leo.org database. This website is a fantastic reference for German, and they are kind enough to provide a download of their database under certain restrictions. After some quick work parsing the database, which is simply a text file, I looked up the genders of all words from the noun list.
For separating words into syllables, I used the excellent pyphen library.
At the end, I was left with tuples of the form
(word, gender, syllables), for example
('Regel', 'f', 're-gel'), as well as a list of nouns ordered by frequency.
Can a computer guess noun genders? Confirming our theory
With all the data in place, I wanted to first figure out if a strong correlation between the last syllable of a noun and its gender exists at all. To do this, I used a decision tree classifier, as implemented in scikit-learn.
Why a decision tree? Well, some classifiers' internal models are easier to understand than others. Since I was less interested in classifying nouns, and more interested in understanding how such a classification is naturally done by humans, I needed a classifier I could look into, to understand how it's doing its job. A decision tree is well-suited for this. As for the data itself, I gave samples a single feature — the last syllable of the word.
To test the accuracy of this classification, I used a simple k-fold cross validation. So, was the last syllable of a noun enough to guess its gender? *drum roll*
Pretty much! In all cases, the model guessed genders right around 88% of the time. It looks like we're on the right track here!
Since I'm interested in better understanding how German works, it was now time to analyse the decision tree, to figure out how our program was making these decisions. While the decision tree's model confirmed my suspicions, I quickly gave up on this approach, since it was much simpler to do some quick statistics over the suffixes now that I knew the problem was feasible, so that's how I went on.
Choosing the most useful suffixes
Of course, our classifier has created quite a complex model of all of these suffixes. As humans, while we may have just as complex of a model in our heads, we can only intentionally learn and remember a few of these rules.
I therefore decided to look for the suffixes that made most of a difference in the classification. Simply put:
- For each suffix, check how reliable it is. A suffix of –tät indicates with 100% confidence that the noun is feminine. However -men only indicates the noun is neuter around 80% of the time, which means that 20% of the time the word can actually have a different gender.
- For each suffix, also check how many words appear with that suffix. 100% of nouns ending in -rer are masculine, but there are only 1622 of them. On the other hand, the suffix -on can tell us the gender of 10,626 words, with 95.98% accuracy of being feminine.
- Weigh these two indicators together such that we get the best compromise between reliability and frequency, to get a “usefulness” score.
- Sort all suffixes by this score.
I chose the top 50 most useful suffixes, and made a table out of them. At least when it comes to suffixes, these are the rules that will help German learners guess genders correctly most often.
Without further ado, the results:
|Suffix||Example||Gender||Reliability||Nr. words with suffix|
|-de||die Wunde (not das Gebäude)||Feminine||80.74%||5,238|
|-ger||der Staubsauger (not das Lager)||Masculine||91.96%||3,918|
|-der||der Salamander (not das Leder)||Masculine||79.22%||2,372|
|-ment||das Dokument (not der Zement)||Neuter||94.23%||1,837|
You can access the table as a Google Sheet as well.
Okay, but how useful is this?
Great, we have a list of really useful suffixes! Does this mean we can guess genders as successfully as the computer can?
Actually, no, not at all. Like I said before, these few rules are much simpler than the model our classifier created above. This means that accuracy will also be much lower. To find out how much lower, I wrote some code to measure how often you would guess right, if you tried to guess genders using only the table above. I then ran it for all 30,215 words in the corpus. The results?
Using only the table above, you can know the noun genders of 31.17% of the most frequent 30,215 German words, with 94.27% accuracy. The other 68.83% of words is not covered by the suffixes above, so you would still have to “guess” them.
Okay, 31.17% accuracy might not solve your noun gender troubles for good. Still, German learners are (correctly) taught to learn every noun along with its gender. All of these genders are hard to keep track of, and if we can ease this process for even a quarter of nouns, I think that's already a great learning aid.
There are, of course, other factors to noun gender, but as far as suffixes go, I'm pretty happy with what I've learned from this analysis.
If you'd like to have a poke around the code, it's available on GitHub. However, I haven't been able to include the dictionary data along with it due to copyright reasons. In particular, one must specifically request the dict.leo.org data.
Do you have any ideas as to how this approach could be improved? Do you have any experience with analysing German grammar? I'd love to hear about it, so feel free to write to me. I hope you've found some of the information here useful, and for those of you also learning German, good luck going forward!