How readability formulas work

How readability formulas work

Readability formulas are developed mostly using cloze procedure. Readers get a text with gaps every 5th word (as an example) that they have to fill in, and the success is used for a regression analysis. The variables that is to be explained is the success of the reader, and criteria that can explain the variance of the success are properties of the text like sentence length, word length, rare words or known words. The results is a formula that returns a value. This can either be an index value (0 - 100) like in the Flesch-formula, a reading grade or a reading age.

Readability formulas are developed for a readership (e.g. primary school children, students), a text genre (e.g. fiction, technical manuals), and are language specific. One cannot use a readability formula and interpret its results if one doesn't consider language, text genre, and target audience/reader. TextQuest/ReFo selects the appropriate readability formula(s).

Problems with readability formulas

The best criteria that explain the most variance of the cloze procedure are sentence length - measured in average words per sentence, and word length - measured in syllables or in characters per word. Also rare words or familiar words (like in the Dale-Chall formulas) explain a huge amount of the variance. Although it seems easy to count sentences, words, syllables, and letters. But computer software has to tackle some problems:

  1. Sentences: A sentence ends with either a period, question mark, or exclamation mark. The problem is the period which can also be part of a decimal number, or is used for abbrevations. TextQuest/ReFo uses abbrevation lists, depending on the language of the text. However, if an abbrevation in your text in not in the list, you should remove the period so that this abbrevation period is not considered to be the end of a sentence. More details are described in the section Preparation of the text.
  2. Words: It is easy to count the letters of a word. Mostly words are separated by blanks, but sometimes this is not the case with hyphens that separate two words - there are two words, but there are not blanks before or after the hyphen. So this sequence of two words is counted as one word.
  3. Syllables: It seems to be easy. You pronounce a word and count the syllables, but very often there are big differences between spoken and written language. TextQuest/ReFo uses different algorithms to count syllables, depending on the language. English is an exception, because there are so many exceptions that the database of the Carnegie Mellon University is used for counting syllables.
  4. Word lists: The readability formulas of Dale-Chall, Spaulding, Bamberger/Vanecek, Tränkle/Bailer and Amstad use word lists. These are integrated in TextQuest/ReFo, and were edited according to the instructions of the original papers rsp. books. The Dale-Chall wordlist of 1995 consists of 3000 word forms that the most 10year olds in the USA know, but not grammatical forms like the regular plural of nouns, regular past perfect tense forms and progressive forms of verbs as well as different spellings of British English. All these words were added to the original Dale-Chall word list that TextQuest/ReFo uses, more than 8,000 word forms - a unique feature.