Literary History, Seen Through Big Data’s Lens

27 Jan

See on Scoop.itComputational Music Analysis

Big Data is pushing into the humanities, as evidenced by new, illuminating computer analyses of literary history.

Olivier Lartillot‘s insight:

My digest:



Big Data technology is steadily pushing beyond the Internet industry and scientific research into seemingly foreign fields like the social sciences and the humanities. The new tools of discovery provide a fresh look at culture, much as the microscope gave us a closer look at the subtleties of life and the telescope opened the way to faraway galaxies.


“Traditionally, literary history was done by studying a relative handful of texts. What this technology does is let you see the big picture — the context in which a writer worked — on a scale we’ve never seen before.”


Some of those tools are commonly described in terms familiar to an Internet software engineer — algorithms that use machine learning and network analysis techniques. For instance, mathematical models are tailored to identify word patterns and thematic elements in written text. The number and strength of links among novels determine influence, much the way Google ranks Web sites.


It is this ability to collect, measure and analyze data for meaningful insights that is the promise of Big Data technology. In the humanities and social sciences, the flood of new data comes from many sources including books scanned into digital form, Web sites, blog posts and social network communications.


Data-centric specialties are growing fast, giving rise to a new vocabulary. In political science, this quantitative analysis is called political methodology. In history, there is cliometrics, which applies econometrics to history. In literature, stylometry is the study of an author’s writing style, and these days it leans heavily on computing and statistical analysis. Culturomics is the umbrella term used to describe rigorous quantitative inquiries in the social sciences and humanities.


“Some call it computer science and some call it statistics, but the essence is that these algorithmic methods are increasingly part of every discipline now.”


Cultural data analysts often adapt biological analogies to describe their work. For example: “Computing and Visualizing the 19th-Century Literary Genome.”


Such biological metaphors seem apt, because much of the research is a quantitative examination of words. Just as genes are the fundamental building blocks of biology, words are the raw material of ideas.


“What is critical and distinctive to human evolution is ideas, and how they evolve.”


Some projects mine the virtual book depository known as Google Books and track the use of words over time, compare related words and even graph them. Google cooperated and built the software for making graphs open to the public. The initial version of Google’s cultural exploration site began at the end of 2010, based on more than five million books, dating from 1500. By now, Google has scanned 20 million books, and the site is used 50 times a minute. For example, type in “women” in comparison to “men,” and you see that for centuries the number of references to men dwarfed those for women. The crossover came in 1985, with women ahead ever since.


Researchers tapped the Google Books data to find how quickly the past fades from books. For instance, references to “1880,” which peaked in that year, fell to half by 1912, a lag of 32 years. By contrast, “1973” declined to half its peak by 1983, only 10 years later. “We are forgetting our past faster with each passing year.”


Other research approached collective memory from a very different perspective, focusing on what makes spoken lines in movies memorable. Sentences that endure in the public mind are evolutionary success stories, cf. “the fitness of language and the fitness of organisms.” As a yardstick, the researchers used the “memorable quotes” selected from the popular Internet Movie Database, or IMDb, and the number of times that a particular movie line appears on the Web. Then they compared the memorable lines to the complete scripts of the movies in which they appeared — about 1,000 movies. To train their statistical algorithms on common sentence structure, word order and most widely used words, they fed their computers a huge archive of articles from news wires. The memorable lines consisted of surprising words embedded in sentences of ordinary structure. “We can think of memorable quotes as consisting of unusual word choices built on a scaffolding of common part-of-speech patterns.”


Quantitative tools in the humanities and the social sciences, as in other fields, are most powerful when they are controlled by an intelligent human. Experts with deep knowledge of a subject are needed to ask the right questions and to recognize the shortcomings of statistical models.


“You’ll always need both. But we’re at a moment now when there is much greater acceptance of these methods than in the past. There will come a time when this kind of analysis is just part of the tool kit in the humanities, as in every other discipline.”

See on


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: