gen - language text generator

Gen Help

I’ve long advocated either hand-crafting every word, or using the Sound Change Applier to derive families. But inspiration does flag, and sometimes you want to use a vocabulary generator.

The usual problem with these is that they make all the possibilities equiprobable, which is highly unnaturalistic. So I’ve created a generator called gen, which applies a cheap power law, so the first choice is chosen most often, and so on down, smoothly, to the last choice which get chosen the least.

Example

Go try it! With the default settings, you’ll get a pseudo-text, like this:

A gatri tu te ee kope. Eudrotri pli ki itupe ki ii. Obudrotia peke pi tea pi pi? Atoi pi ka iekribe eupi ape? Kle dru iplo ki gipotu i. Pi ke brikaibe ble do brou. Ta glikipro e teakatre piu u. Be kipe pa pa pi tepipliita. Tikiii topu epatripu i o po? Pe uuta dru opi gii ki. Ti pepate bi bi a e? I gia kitidu eproplu ple kitle. Kii pitre ko e iipoga a. E o popate ku kritra pi. Tu pe titepee dro kee ekiplu. Ti ti ki te gra a. Ia tle biitapo oi pri epoi? Ti opikli be po betle e. Igribliia tipi tloka ple ko plubla. Ge pita tidleki to pri ti. Ategoki e a plu topipi kiipe. Priklu kro ai tepeplea pu e. A tapa kite pubo ti du. Bipro begitebi kaaete gi tipo ko. E kipretopua pika glotro di bu. Pepe tebo iikepoplo i tru gi. Gike da e ipia tripi ia. Bi bikli pate dlite e dligu? I pididi kra pabaka e o. Ipoidipi a ti i ba geka!

Run it again for an entirely different text. This output format is designed to simulate what your language might look like.

The controls

Try it with different settings. Here’s what they do.

Output type tells whether you want pseudo-text, or a table of a hundred words. Pseudo-text is better for seeing what your language looks like, given the phonology and syllable types you’ve defined. Once you’re happy with the look and feel of the language, the word list is better for actually generating vocabulary.

The format All possible syllables will output a list of, well, all possible syllables. Note that this option ignores the Dropoff and Monosyllables controls: it is not random at all, and it shows only single syllables.

Show syllables will display a dot between syllables in the output. To gen, a syllable is whatever you put in "Syllable types"!

Dropoff determines how fast the power law declines. If you have C=ptkbdg, then when outputing a C, normally p will come up the most, t a little less often, and so on, with g the least frequent. If you select fast dropoff, the probabilities will stack even more in favor of p (i.e. the first choice). If you select slow, the probabilities will distribute more evenly.

To turn off the power law entirely select Equiprobable; then gen will select the choices with equal frequency. (Again, this is a bad choice for a naturalistic language. But maybe you’re doing an auxlang or something.)

The Dropoff control doesn't affect the selection of syllable types. However, you can choose a more even distribution by checking Slow syllable dropoff.

Monosyllables tells gen how much of the output should be monosyllabic. You could set this to Always for an isolating language, for instance. (Even isolating languages have compounds, so if you want to generate words or text, use Mostly.)

Generate generates a new text.

Clear erases the output. (This isn’t necessary but it’s provided for neatness’ sake.)

Help me! brings up this help file.

IPA gives you a display of IPA symbols which you can cut and paste into any field.

Defaults cycles through some default parameters to help you get started or inspired.

The categories

These are your phonological classes, defined by enumeration. The format is exactly the same as used by the SCA.

For instance, I might define my fricatives like this:

F=fvszšž

That means that any time gen wants to output an F from the syllables list, it will randomly pick one of f, v, s, z, š, ž.

As you can see, you can use Unicode! The phonemes in a category have to be single characters, but we’ll see how to output digraphs below.

The key thing to grasp is that the order determines the probability. The program runs through the phonemes in a category, with a 30% chance of stopping at each one.* So the F definition above says that we want f to occur a lot and ž not that much.

* That is, 30% for the recommended Medium dropoff. It’s 45% for Fast and 15% for Slow. Also, for computation speed, if it gets to the end of the choices it starts over.

The main corollary: Put the sounds you like first! Don’t list them in place of articulation order unless you really like labials. Try varying the order and hitting Generate to see how changing the order changes the output.

Don’t overdo the classes— gen doesn’t know any phonology, and will be perfectly happy with a single class C for all consonants. You define a class for two reasons:

To control probabilities. E.g. we usually want stops to occur more than fricatives.
To enforce phonotactics. E.g. if the only initial clusters you allow are stop + liquid, then you need classes for stops and liquids.

The syllable types

The Syllable types field defines your phonotactics... your allowed syllable types. E.g. the sample above is defined with these syllables:

CV V CRV

The syllable types also follow a power law, so put the ones you like first. Or to be precise, if you want a particular type to be more common, move it up in the list.

Put just one syllable per line. (Otherwise gen will just treat whatever you put on one line as a syllable type.)

In general, more complex types should occur further on. However, I find that pure vowel syllables (like V in the example) should be less frequent than ones that begin with a consonant.

The process does not handle parentheses. So if you have a syllable type like (C(R))V(V)(N), you must list the possibilities— in this case, V, VV, VN, VVN, CV, CVV, CVN, CVVN, CRV, CRVV, CRVN, CRVVN. This is a good thing! ...because it allows you to set the relative probabilities of each syllable type. (How do you decide on the order? Trial and error works fine. Change the order and hit Generate again. Repeat till it looks good.)

The symbols you use here (in the last example C V R N) should be defined in the categories box— they are your phonological classes.

So when gen needs to generate a syllable, it selects randomly from the syllable type— lets say it picks CRV. Now it looks up C in the Categories box. Suppose it finds the definition C=ptkbdg. It randomly picks one of those choices. Then it moves on to R, then V. And so on.

If there are any undefined symbols, they will be passed through to the output. E.g. you could add a syllable khV and gen will cheerfully generate khe, khi, etc.

Rewrite rules

These allow you to apply global substitutions to the output. The simplest form is to replace a single character:

θ|th

That tells gen to replace every occurence of θ in the output with th.

Or you can handle combinations. E.g. maybe ti always changes to či. You'd write that as ti|či. The facility is actually even more powerful than that, because the left-hand side is a regular expression. So for instance you could change both br and bl to bj with the formula b[rl]|bj.

Rules are applied in order. Make sure they don’t feed into each other when they shouldn’t! (See the Japanese example for more on this.)

For fancier changes (such as those that are sensitive to the following phonemes), use the SCA.

You don’t have to have any rewrite rules at all, of course. (The other inputs have to have something in them.)

Saving your work

I’ve implemented gen in Javascript to make it immediately available to anyone with a browser. If I used C, as with the SCA, it’d have to be provided separately for Windows and Mac and wouldn’t work on mobile devices anyway. Plus, it turns out that non-programmers don’t know how to use the command line window!

Unfortunately I can’t directly read and write files, because Javascript is restricted from doing so. (For very good reasons! If web pages could write files, they could mess up your computer.)

But you can! Just keep your categories and syllable types in a text file and paste them into gen. And you can easily cut the output and put it wherever you want.

Don’t be cheap!

To avoid the pitfalls of cheap vocabulary generation:

Follow the usual rule of recording new words in the lexicon, so you don’t re-use words.
Don’t just copy the output and use every word in your lexicon. Pick the words you like; you can hit Generate to get a new set.
Multisyllabic words are output mostly to simulate what text would look like. Avoid very long words as roots.
Always use derivational morphology or compounding when you can, rather than just grabbing words from gen ! E.g. for religion, divinity, theology, sacrilege, priesthood, don’t just create each of these as roots, create etymologies.
If you’re getting ugly words— well, you probably have ugly phonotactics! Move the sounds you like up within your classes, and put simpler CV syllable types earlier in the file.

Sample: Pseudo-Japanese

Want some pseudo-Japanese? Sure you do! Paste these inputs into the three input boxes:

C=tknsmrh V=aioeu U=auoāēū L=āīōēū hu|fu hū|fū si|shi sī|shī sy|sh ti|chi tī|chī ty|ch tu|tsu tū|tsū qk|kk qp|pp qt|tt q[^ptk]| CV CVn CL CLn CyU CyUn Vn Ln CVq CLq yU yUn wa L V

As you can see, the rewrite rules were essential in simulating the allophonic rules of Japanese. Some complications there:

I was getting weird output like cfu till I realized that the rule ty|ch was feeding into hu|fu. This was solved by moving the latter rules up so they get executed first.
The q phoneme is a slightly kludgy way of getting the long consonants, as in futte. Note that the rewrite rules set the correct long consonants for p t k; then the rule q[^ptk]| simply removes any other q’s. The ^ means “match anything except these letters”, and the absence of anything after the | means that anything matching the regular expression will be deleted.
I didn’t include the voiced consonants... maybe you can try adding them!