Figuring out Rules for German Noun Genders with Simple Machine Learning and Statistics

When learning German, one of the most confusing features of the language is the noun gender system. In German, every noun has one of three genders (masculine/feminine/neuter), but unlike many other languages, these genders are seemingly not assigned based on any logical rule. Despite this, native German speakers as well as experienced German students are able to intuitively “guess” noun genders correctly. This led me to the logical conclusion that some underlying rules must exist. Furthermore, if humans can have an intuitive model of these rules, perhaps we can create a computer-based model, and figure out what these rules actually are!

This post is about my initial exploration of modeling German noun genders with some simple machine learning and statistics. I attempt to find some rules that might aid German learners (including myself) in figuring out noun genders.

I’ll try to strike a middleground between technical details and accessibility. If you’d like to skip all the technical stuff, you can jump to the juicy German gender rules!

What might the rule be?

The first step was to guess at what such a rule might be. When learning German, you’re taught about certain suffixes which allow you to guess that noun’s gender. For example, nouns ending in -keit, -heit, -ei, -ie, -ion are always feminine, while nouns ending in -ant are usually masculine, and those ending in -chen, -lein, -li are always neuter.

To generalise this, I decided to investigate the correlation between the last syllable of a word and that word’s gender.

Gathering data

['Gegenwörter', 'pl', 'Ge-gen-wör-ter'],
['Gegenzahl', 'f', 'Ge-gen-zahl'],
['Gegenzahn', 'm', 'Ge-gen-zahn'],
['Gegenzauber', 'm', 'Ge-gen-zau-ber'],
['Gegenzeichen', 'n', 'Ge-gen-zei-chen'],
['Gegenzeichnung', 'f', 'Ge-gen-zeich-nung'],
['Gegenzelle', 'f', 'Ge-gen-zel-le'],
['Gegenzeuge', 'm', 'Ge-gen-zeu-ge'],
['Gegenzinnenbalken', 'm', 'Ge-gen-zin-nen-bal-ken'],
['Gegenzugkraft', 'f', 'Ge-gen-zug-kraft'],
['Gegenzugrollo', 'n', 'Ge-gen-zu-grol-lo'],
['Gegenzug', 'm', 'Ge-gen-zug'],
['Gegenäußerung', 'f', 'Ge-ge-n-äu-ße-rung'],
['Gegenöffentlichkeit', 'f', 'Ge-gen-öf-fent-lich-keit'],
['gegenüberliegende Seiten', 'pl', 'Sei-ten'],
['gegenüberliegende Seite', 'f', 'Sei-te'],
['gegenüberliegendes Gebäude', 'n', 'Ge-bäu-de'],
['Gegenübernahmeangebot', 'n', 'Ge-gen-über-nah-me-an-ge-bot'],
['Gegenüberstellung', 'f', 'Ge-gen-über-stel-lung'],
['Gegenübertragung', 'f', 'Ge-gen-über-tra-gung'],
['Gegenüberwachung', 'f', 'Ge-gen-über-wa-chung'],
['Gegenüber', 'n', 'Ge-gen-über'],
['gegerbtes Leder', 'n', 'Le-der'],

A short snippet of our noun list

I realised I needed the following:

A list of lots of commonly occuring German nouns
A way to find the gender of all those nouns
A way to split those nouns into syllables

For the noun list, I found a corpus from the Institute for the German Language in Mannheim called “DeReWo – Korpusbasierte Grund-/Wortformenlisten”. This provides a large list of words, complete with information on how common each word is, as well as the part of speech for every word. It was then trivial to single out common nouns and sort by frequency.

The above corpus does not, however, include the genders of those nouns. To get genders, I used the dict.leo.org database. This website is a fantastic reference for German, and they are kind enough to provide a download of their database under certain restrictions. After some quick work parsing the database, which is simply a text file, I looked up the genders of all words from the noun list.

For separating words into syllables, I used the excellent pyphen library.

At the end, I was left with tuples of the form (word, gender, syllables), for example ('Regel', 'f', 're-gel'), as well as a list of nouns ordered by frequency.

Can a computer guess noun genders? Confirming our theory

=== Running cross-validation
>>> 454853 nouns (samples)
    201563 f
    158722 m
    94566 n
    1 sg
    1 sg.
>>> Running 10 splits
> Fold 0, accuracy = 88.95%
> Fold 1, accuracy = 87.13%
> Fold 2, accuracy = 89.29%
> Fold 3, accuracy = 87.74%
> Fold 4, accuracy = 87.97%
> Fold 5, accuracy = 88.61%
> Fold 6, accuracy = 88.41%
> Fold 7, accuracy = 88.05%
> Fold 8, accuracy = 88.63%
> Fold 9, accuracy = 89.7%
> Average accuracy: 88.45%

>>>>> Finished run: (1400, 0.7, 50, 10) 88.45%

A samples of our testing output

With all the data in place, I wanted to first figure out if a strong correlation between the last syllable of a noun and its gender exists at all. To do this, I used a decision tree classifier, as implemented in scikit-learn.

Why a decision tree? Well, some classifiers’ internal models are easier to understand than others. Since I was less interested in classifying nouns, and more interested in understanding how such a classification is naturally done by humans, I needed a classifier I could look into, to understand how it’s doing its job. A decision tree is well-suited for this. As for the data itself, I gave samples a single feature — the last syllable of the word.

To test the accuracy of this classification, I used a simple k-fold cross validation. So, was the last syllable of a noun enough to guess its gender? (drum roll)

Pretty much! In all cases, the model guessed genders right around 88% of the time. It looks like we’re on the right track here!

Since I’m interested in better understanding how German works, it was now time to analyse the decision tree, to figure out how our program was making these decisions. While the decision tree’s model confirmed my suspicions, I quickly gave up on this approach, since it was much simpler to do some quick statistics over the suffixes now that I knew the problem was feasible, so that’s how I went on.

Choosing the most useful suffixes

Suffix	Example	Reliability	Nr. words
-on	die Funktion	95.98% feminine	10,626
-tät	die Universität	100% feminine	2,513
-rer	der Maurer	100% masculine	1,622
-men	das Unternehmen	79.92% neuter	1,589

A few of the suffixes mentioned

Of course, our classifier has created quite a complex model of all of these suffixes. As humans, while we may have just as complex of a model in our heads, we can only intentionally learn and remember a few of these rules.

I therefore decided to look for the suffixes that made most of a difference in the classification. Simply put:

For each suffix, check how reliable it is. A suffix of –tät indicates with 100% confidence that the noun is feminine. However -men only indicates the noun is neuter around 80% of the time, which means that 20% of the time the word can actually have a different gender.
For each suffix, also check how many words appear with that suffix. 100% of nouns ending in -rer are masculine, but there are only 1622 of them. On the other hand, the suffix -on can tell us the gender of 10,626 words, with 95.98% accuracy of being feminine.
Weigh these two indicators together such that we get the best compromise between reliability and frequency, to get a “usefulness” score.
Sort all suffixes by this score.

Results

I chose the top 50 most useful suffixes, and made a table out of them. At least when it comes to suffixes, these are the rules that will help German learners guess genders correctly most often.

Without further ado, the results:

Suffix	Example	Gender	Reliability	Nr. words with suffix
-on	die Funktion	Feminine	95.98%	10,626
-te	die Ernte	Feminine	96.39%	10,292
-se	die Achse	Feminine	91.96%	9,671
-rung	die Störung	Feminine	99.94%	8,853
-le	die Schule	Feminine	95.08%	7,962
-tung	die Leistung	Feminine	99.92%	6,352
-ge	die Menge	Feminine	85.4%	7,430
-keit	die Höflichkeit	Feminine	99.92%	5,210
-lung	die Vorstellung	Feminine	99.94%	5,085
-ne	die Birne	Feminine	96.93%	5,180
-ler	der Maler	Masculine	99.54%	4,343
-chen	das Mädchen	Neuter	91.5%	4,716
-de	die Wunde (not das Gebäude)	Feminine	80.74%	5,238
-be	die Scheibe	Feminine	86.62%	4,640
-rin	die Holländerin	Feminine	96.04%	4,167
-ger	der Staubsauger (not das Lager)	Masculine	91.96%	3,918
-gung	die Bewilligung	Feminine	99.92%	3,533
-nung	die Ahnung	Feminine	99.9%	3,132
-dung	die Entzündung	Feminine	99.71%	3,100
-re	die Himbeere	Feminine	94.24%	3,211
-heit	die Schönheit	Feminine	99.83%	2,917
-pe	die Klappe	Feminine	98.11%	2,851
-ner	der Türöffner	Masculine	98.8%	2,751
-che	die Wäsche	Feminine	97.18%	2,588
-tät	die Mobilität	Feminine	100.0%	2,513
-cke	die Decke	Feminine	99.67%	2,436
-mus	der Kapitalismus	Masculine	98.89%	2,428
-gel	der Engel	Masculine	80.75%	2,857
-ze	die Kerze	Feminine	95.83%	2,397
-fer	der Pfeffer	Masculine	82.32%	2,703
-schaft	die Freundschaft	Feminine	97.38%	2,252
-me	die Blume	Feminine	88.46%	2,469
-tur	die Agentur	Feminine	99.54%	2,194
-ling	der Säugling	Masculine	95.29%	2,080
-tem	das System	Neuter	99.01%	1,919
-der	der Salamander (not das Leder)	Masculine	79.22%	2,372
-ment	das Dokument (not der Zement)	Neuter	94.23%	1,837
-tik	die Gymnastik	Feminine	99.0%	1,695
-um	das Eigentum	Neuter	99.58%	1,649
-sung	die Überweisung	Feminine	100.0%	1,636
-rer	der Maurer	Masculine	100.0%	1,622
-ren	das Verfahren	Neuter	96.03%	1,612
-zung	die Verletzung	Feminine	100.0%	1,531
-nie	die Linie	Feminine	98.55%	1,452
-stand	der Zustand	Masculine	99.86%	1,400
-fe	die Reife	Feminine	84.95%	1,628
-ber	der Zauber	Masculine	83.49%	1,599
-ke	die Wolke	Feminine	83.12%	1,540
-men	das Unternehmen	Neuter	79.92%	1,589
-nis	das Verständnis	Neuter	82.74%	1,466

A table of the 50 most useful suffix rules for German noun genders

You can access the table as a Google Sheet as well.

Okay, but how useful is this?

Great, we have a list of really useful suffixes! Does this mean we can guess genders as successfully as the computer can?

Actually, no, not at all. Like I said before, these few rules are much simpler than the model our classifier created above. This means that accuracy will also be much lower. To find out how much lower, I wrote some code to measure how often you would guess right, if you tried to guess genders using only the table above. I then ran it for all 30,215 words in the corpus. The results?

Using only the table above, you can know the noun genders of 31.17% of the most frequent 30,215 German words, with 94.27% accuracy. The other 68.83% of words is not covered by the suffixes above, so you would still have to “guess” them.

Conclusion

Okay, 31.17% accuracy might not solve your noun gender troubles for good. Still, German learners are (correctly) taught to learn every noun along with its gender. All of these genders are hard to keep track of, and if we can ease this process for even a quarter of nouns, I think that’s already a great learning aid.

There are, of course, other factors to noun gender, but as far as suffixes go, I’m pretty happy with what I’ve learned from this analysis.

If you’d like to have a poke around the code, it’s available on SourceHut. However, I haven’t been able to include the dictionary data along with it due to copyright reasons. In particular, one must specifically request the dict.leo.org data.

Do you have any ideas as to how this approach could be improved? Do you have any experience with analysing German grammar? I’d love to hear about it, so feel free to write to me. I hope you’ve found some of the information here useful, and for those of you also learning German, good luck going forward!