Words That Can’t Be Strangled was a project that I entered into the 2016 Sea Island Regional Science Fair. The project received first place in the Mathematics, Engineering and Computer Science category and an Intel Award for Excellence in Computer Science.
The purpose of this project was to analyze the texts of Project Gutenberg (a repository of public domain, plane text ebooks) and the English Wikipedia to determine the most common English words of all time. Initially, I thought that “a”, “an” and “the” would be the top three most frequently used English words since articles are very common in English. I wrote a program in the Python programming language to process all of the text in the various formats in which it is written and generate a table of words and their frequencies. The experiment was performed by downloading very large dumps of all the material, extracting the plane text from the Wikipedia database with an extraction program, running my word frequency analysis program and viewing the results.
The experiment found that the ten most common English words of all time are “the”, “of”, “and”, “to”, “in”, “a”, “was”, “that”, “he” and “is”. The hypothesis was somewhat correct; articles did appear in the top ten, but not all three hypothesized, and in a different order.
I have released
words.py, my word frequency analysis program, as free software in the hope that it will be useful in other projects and/or research. You can obtain the program along with instructions on reproducing the experiment from this Github repository. If you make improvements to this program, I encourage you to fork this repository and send me a pull request with your changes!