Babble generator
A few months ago, I was trying to come up with a way to generate random text that retained some basic characteristics of the English language - what I call a babble generator. The obvious approach is to randomly string together words from a word list, but that would be boring. My idea was to analyze various pieces of source text, apply some rules, and let the computer generate some text automagically. The reason I did this is because, why not.
Babble Generator #1: The Third-Letter Method
Using Project Gutenberg as a source, I assembled a 13.5 MB plain text source file from various literary works, including the notebooks of Leonardo da Vinci, The War of the Worlds by H.G. Wells, Ulysses by James Joyce, The Art of War by Sun Tzu, and many others.
I then wrote a program to map the frequency distribution of every possible three-character sequence into a three-dimensional, 32 x 32 x 32 matrix. (Why 32? Because 26 letters, a few punctuation marks, and spaces.)
Finally, starting with two random letters as the seed, the third letter would be randomly chosen; the more often that letter followed the previous two letters in the source text, the more likely it is that it would be the one to be chosen as the third letter. This process repeats continuously: always looking at the last two letters to choose the third letter.
The result is mostly jibberish, but not nearly as bad as randomly generated text:
Babble Generator #2: The Letter-Sequence Method
This method is similar to the first method, except that instead of considering just the last two letters, the process considers the last four to eight letters. The number of letters considered each time being random.
The only problem is that the nine-dimensional matrix required to support this approach would contain over 35 trillion entries, which is impossible to store. So I came up with a different method that, although slower, requires no up-front analysis of the text. Simply search through the source text for each four- to eight-letter sequence, but start at a random point each time. Take the letter following each occurrence, and repeat the process.
The result was fascinating! It "looks" more like English text, even though most of the words are not real:
Babble Generator #3: The Linguistic Method
This method does not use any source text at all. Instead, I made a list of consonants that typically appear at the start, middle, and end of a word, as well as a list of vowels, and repeated some of them in order to add weight by frequency:
The rules are simple: The length of each word is chosen at random. A word must start with a vowel or a starting consonant, and end with a vowel or an ending consonant. And in the middle, it always alternates between a vowel and a middle consonant.
The result is jibberish, but at least every word is pronounceable:
Babble Generator #4: Linguistics Meet Source Text
This method is a combination of the second and third methods.
First, the source text is analyzed to generate a list of every possible starting consontant, middle consonant, ending consonant, starting vowel, middle vowel, and ending vowel. (A consonant string such as CH or STR is counted as a single consonant.) The more often each one appears in the source text, the more often iit appears in the list that is generated. The program also maps the distribution of word length, depending on whether the word starts with a vowel or a consonant.
The source text was compared to a word list dictionary so that only English words were considered. The purpose was to exclude proper nouns (Christ, Gutenberg), roman numerals (XXVIII), foreign words (celebratissimo), archaic words (sitteth), missing spaces (headquartersbecause), misspelled words (undistinguishable), and outright jibberish (bbbbblllllblblblblobschbg) that appeared throughout the source text.
I also set parameters such as the average number of words between commas, periods, carriage returns, etc. These parameters were set manually, but could also have been determined from the source text.
Here is the result:
This is the most complicated approach that I took, and I fully expected it to yield the most realistic text. But it did not. I believe the reason is that even though the letter distribution was determined from source text, the letter generation is still based fundamentally on a random selection of consonants and vowels.
That was fun.
Babble Generator #1: The Third-Letter Method
Using Project Gutenberg as a source, I assembled a 13.5 MB plain text source file from various literary works, including the notebooks of Leonardo da Vinci, The War of the Worlds by H.G. Wells, Ulysses by James Joyce, The Art of War by Sun Tzu, and many others.
I then wrote a program to map the frequency distribution of every possible three-character sequence into a three-dimensional, 32 x 32 x 32 matrix. (Why 32? Because 26 letters, a few punctuation marks, and spaces.)
Finally, starting with two random letters as the seed, the third letter would be randomly chosen; the more often that letter followed the previous two letters in the source text, the more likely it is that it would be the one to be chosen as the third letter. This process repeats continuously: always looking at the last two letters to choose the third letter.
The result is mostly jibberish, but not nearly as bad as randomly generated text:
Wly vare lebod, nere, withet try..' reare of faid, the ye thad le. Bration of even emble's waske, butionlonswithe but i king ith goneve the froppeall th,ropendmenizen to to whatet it of trand, a sky in my babods prich then hiscausna spold, befoul th so drhat wistry cas dannigiou hou hot ch aten th of theyafted ancefords laws witing saiddevery raelcurently con theavieficylous son usecolone of thed an yets notearne wit sess. Sudy's midere per to depasond th ebre ger ye up reng tand ings now bure ficove onvin evis in annings wifich mover werempse angs ofmikhimen onewals, son readdin iriussers st hat the retrins ous beyroomen theethe tre pack thest the was en on and ne befournsit sit nocknothat of th,for aftem the camonpownstticketenniquick the setan th able. All was lonce kat thol alle by mossep of to tace. So thesphratame by giverands of the th livichmenlow the that peal hin. A pleat and i was.. Abrien of shavensou meashallbeteng in fross of hat he wis whys con th a gre mand morm, beriontimar.
Babble Generator #2: The Letter-Sequence Method
This method is similar to the first method, except that instead of considering just the last two letters, the process considers the last four to eight letters. The number of letters considered each time being random.
The only problem is that the nine-dimensional matrix required to support this approach would contain over 35 trillion entries, which is impossible to store. So I came up with a different method that, although slower, requires no up-front analysis of the text. Simply search through the source text for each four- to eight-letter sequence, but start at a random point each time. Take the letter following each occurrence, and repeat the process.
The result was fascinating! It "looks" more like English text, even though most of the words are not real:
His a moral not only and sensaid pirit, signed and at the valuable estay som dukengaging our situdes withe wise unto all othe conceptiong him on ple earth fredere andrkened unto you arting from ethiopinion, and, and the jews, whole orry numble ough the serson shah of peress a ce of thy of the larger almighthey lown the commander, good parned and the princes satirisins assent.
Near third shall he not rent, it and under and have at down? Or shaling gened in acity out of all reced in their ways that there forning anded my filled was soman of commithat hich thorsecly populationg them that ase the even conto the baldness benef his sweet, howevering killed tragic schemes, and under their ways underthen of killich hang.
Babble Generator #3: The Linguistic Method
This method does not use any source text at all. Instead, I made a list of consonants that typically appear at the start, middle, and end of a word, as well as a list of vowels, and repeated some of them in order to add weight by frequency:
Starting consonants: j, v, bl, br, cl, cr, fl, fr, gl, gr, sl, sp, st, str, sw, tr, tw, ex, qu, b, c, d, f, g, h, k, l, m, n, p, r, s, t, w, y, z, ch, th, ph
Middle consonants: bl, br, cl, cr, fl, fr, gl, gr, ll, sl, sp, st, str, sw, tr, tw, b, c, d, f, g, h, k, l, m, n, p, r, s, t, w, z, ch, th, ph
Ending consonants: as, bs, cs, ds, es, fs, gs, hs, is, ks, ll, ls, ms, ng, ns, os, ps, rs, ts, ws, xes, zes, b, c, d, f, g, h, k, l, m, n, p, r, s, t, w, z, ch, th, ph
Vowels: a, a, a, a, a, e, e, e, e, e, e, e, e, e, i, i, i, i, o, o, o, o, u, u, y
Middle consonants: bl, br, cl, cr, fl, fr, gl, gr, ll, sl, sp, st, str, sw, tr, tw, b, c, d, f, g, h, k, l, m, n, p, r, s, t, w, z, ch, th, ph
Ending consonants: as, bs, cs, ds, es, fs, gs, hs, is, ks, ll, ls, ms, ng, ns, os, ps, rs, ts, ws, xes, zes, b, c, d, f, g, h, k, l, m, n, p, r, s, t, w, z, ch, th, ph
Vowels: a, a, a, a, a, e, e, e, e, e, e, e, e, e, i, i, i, i, o, o, o, o, u, u, y
The rules are simple: The length of each word is chosen at random. A word must start with a vowel or a starting consonant, and end with a vowel or an ending consonant. And in the middle, it always alternates between a vowel and a middle consonant.
The result is jibberish, but at least every word is pronounceable:
ora af mable fof be kufesle phur istrons illespetwu cho tho ews sti itwa quuru ocees cle istrohospate ecrigo hi acru ibre treloche efra dablims ygrulothe upaxes fle epephiceglibs le ezes kigluslystagregrob slefs beslahs swihews obs ge hexesora af mable fof be kufesle phur istrons illespetwu cho tho ews sti itwa quuru ocees cle istrohospate ecrigo hi acru ibre treloche efra dablims ygrulothe upaxes fle epephiceglibs le ezes kigluslystagregrob slefs beslahs swihews obs ge hexes ehs ow agu esticra izi mostu enecs uglastuclakegows ve stry lo vapizaslaleng ans ners iwu hitabre slahi ethaclaru egre ezexes gla thowa aks tell ak aku ahilugell ste aspeclaclachi gri ora sezypof ucs swa ochahistat trepeg aglale swecele cra bo osle ylocath wo gratad ebecro cla cleda eswiga fra ek efregliswabebli for uflof ba oes uns viweregoflice os islaph
Babble Generator #4: Linguistics Meet Source Text
This method is a combination of the second and third methods.
First, the source text is analyzed to generate a list of every possible starting consontant, middle consonant, ending consonant, starting vowel, middle vowel, and ending vowel. (A consonant string such as CH or STR is counted as a single consonant.) The more often each one appears in the source text, the more often iit appears in the list that is generated. The program also maps the distribution of word length, depending on whether the word starts with a vowel or a consonant.
The source text was compared to a word list dictionary so that only English words were considered. The purpose was to exclude proper nouns (Christ, Gutenberg), roman numerals (XXVIII), foreign words (celebratissimo), archaic words (sitteth), missing spaces (headquartersbecause), misspelled words (undistinguishable), and outright jibberish (bbbbblllllblblblblobschbg) that appeared throughout the source text.
I also set parameters such as the average number of words between commas, periods, carriage returns, etc. These parameters were set manually, but could also have been determined from the source text.
Here is the result:
Aw, theng. Evootusts, beante ax hittlats the ippen rougot din antain on fellemmeepeth the nund. The dell ith athignand stroothirsticoy betoud, ath aweck che he, theiniccouble woutang sabloud thef ilwiot ther thost, the anceset lod dend wid rousse, ints and avianyod she thatteng fre cras pe thelland at gleb. An pullne lus natef ants if ind of I ad estle dain ineos atheic oght, tharde ne woens ossiosted, ore our huts thunlaiquse of ing. Coble, ot odad in ore aptings mao inderche iftwes wheow is, tuctof ope thall yins? Ard pre, sot noembnend soum ad ge fepy rettlo thikedes the, it ossell stokioghteor etebontef acks. Strus arnalp if he rioldo oun, sigro von an ists asan oddou kans us avivind, ak peto. Irou er oftoish overtef froull we of ong irt a nas? Anco pis, on onle thoud. Al theone fell idd daf thosussieth whert octlevess ach out, a. Thette, cutil leseff jollarby seon ats twendoot hete oround bou so and thof ith impe af tosts oblais sinfe fus hintinhopens the alk ilderne hal thround thelsoventy ororth theantmerst ond if, liaro frit. Ass, thetars bree igise trenk ang al he. And! Ealkevo thif?
This is the most complicated approach that I took, and I fully expected it to yield the most realistic text. But it did not. I believe the reason is that even though the letter distribution was determined from source text, the letter generation is still based fundamentally on a random selection of consonants and vowels.
That was fun.
Comments
usually we use set greek words called loreim ipsum.
but i did find a site on the web where you can enter the approx amount of words you need and then you pick a genre such as A-Team, classic movies or famous speaches and it spits out text accordinly.
can't remember this site but i swear it exists.
Imagine, A-Team-themed babble!!