I came across this old post on prooffreader.com from 2014 from data scientist and developer David Taylor and found it fascinating. I figured my readers would as well.

Below you’ll see a graphic visualization on the distribution of English letters towards the beginning, middle, or end of words. The data set comes from the Brown Corpus in the Natural Language Toolkit instead of a dictionary, this great because the results are weighted for usage based on the frequency of use.


Graphing the distribution of English letters towards the beginning, middle or end of words


If you’re a data nerd like me, there are a lot more details in the original post that explain these findings. If you want to learn more about the methodology, then be sure to check out the extended version of the post on prooffreaderplus. I appreciated Taylors final thought:

The most common word in the English language is “the”, which makes up about 6% of most corpuses (sorry, corpora). But according to these graphs, the most representative word is “toe”.

I’m glad the word that ended up representing English the most is somehow “toe”—for whatever reason I find it oddly fitting for our mongrel language.