Thursday, December 16, 2010

Frank Turner- “Libraries have changed more in the last 15 years than they have in the last five centuries.”

Libraries as "glorified study halls"?

Steward of the Once and Future Book


How does it feel to become University Librarian at Yale when all around you are predicting the end of the book as we know it? Intellectual historian Frank M. Turner '71PhD, who became interim University Librarian in January, has now formally assumed the leadership of Yale's vast library system—just as the apocalypse of the printed book is being discussed by his counterparts around the country. There are dark suggestions in journals and at conferences, he says, that "in less than 25 years, libraries will be glorified study halls," each with "one vast computer furnishing electronic materials."

But he's not worried: "The book won't disappear, and, in fact, our circulation remains high." Turner—the John Hay Whitney Professor of History, a former Yale provost, and since 2003 the head of the Beinecke (a post he'll keep until his replacement is hired)—readily acknowledges the reach of the digital revolution. "Libraries have changed more in the last 15 years than they have in the last five centuries," he says. Yale's library system "contains exemplars of everything: from the Beinecke, with its enormous breadth and depth of traditional print materials, to the medical library, which, except for its historical component, is virtually all electronic."

The rise of such virtual collections, along with digital devices and high-speed wireless Internet access, is changing a fundamental aspect of the library. "We've always thought of the library as the heart of the university, as a distinct place," Turner says. But librarians need a new perspective on the Sterling system. By enabling researchers to invent their own fresh ways of using the collections, "the library of the future will have to go into the heart of the user."

Nature study talking about the use of digitized books to uncover language clues

Cultural goldmine lurks in digitized books

'Culturomics' uncovers fame, fortune and censorship from more than a century of words.
bookAnalysing decades of books can reveal important cultural trends.FRANCK CAMHI / Alamy
The digitization of books by Google Books has sparked controversy over issues of copyright and book sales, but for linguists and cultural historians this vast project could offer an unprecedented treasure trove. In a paper published today in Science1, researchers at Harvard University in Cambridge, Massachusetts, and the Google Books team in Mountain View, California, herald a new discipline called culturomics, which sifts through this literary bounty for insights into trends in what cultures can and will talk about through the written word.
Among the findings described by the collaboration, led by Jean-Baptiste Michel, a Harvard biologist, are the size of the English language (around a million words in 2000), the typical 'fame trajectories' of well-known people, and the literary signatures of censorship such as that imposed by Germany's Nazi government.
"The possibilities with such a new database, and the ability to analyse it in real time, are really exciting," says linguist Sheila Embleton of York University in Toronto, Canada.
"Quantitative analysis of this kind can reveal patterns of language usage and of the salience of a subject matter to a degree that would be impossible by other means," agrees historian Patricia Hudson of Cardiff University, UK.
"The really great aspect of all this is using huge databases, but they will have to be used in careful ways, especially considering alternative explanations and teasing out the differences in alternatives from the database," adds Royal Skousen, a linguist at Brigham Young University in Provo, Utah. "I do not like the term 'culturomics'," he adds. "It smacks too much of 'freakonomics', and both terms smack of amateur sociology."

Half a trillion words

Using statistical and computational techniques to analyse vast quantities of data in historical and linguistic research is nothing new — the fields known as quantitative history and quantitative linguistics already do this. But it is the sheer volume of the database created by Google Books that sets the new work apart.
So far, Google has digitized more than 15 million books, representing about 12% of all those ever published in all languages. Michel and his colleagues performed their analyses on just a third of this sample, selected for the good quality of the optical character recognition in the digitization and the reliability of information about a book's provenance, such as the date and place of publication.
The resulting data set contained over 500 billion words. This is far more than any single person could read: a fast reader would, without breaks for food and sleep, need 80 years to finish the books for the year 2000 alone.
Not all isolated strings of characters in texts are real words. Some are numbers, abbreviations or typos. In fact, 51% of the character strings in 1900, and 31% in 2000, were 'non-words'. "I really have trouble believing that," admits Embleton. "If it's true, it would really shake some of my foundational thoughts about English."
According to this account, the English language has grown by more than 70% during the past 50 years, and around 8,500 new words are being added each year. Moreover, only about half of the words currently in use are apparently documented in standard dictionaries. "That high amount of lexical 'dark matter' is also very hard to believe, and would also shake some foundations," says Embleton. "I'd love to see the data."
In principle she already can, because the researchers have made their database public at http://www.culturomics.org/. This will allow others to explore the huge number of potential questions it suggests, not just about word use but about cultural history. Michel and colleagues offer two such examples, concerned with fame and censorship.
ADVERTISEMENT
Nature Physics Insight: Physics and the Cell

They say that actors reach their peak of fame, as recorded in references to names, around the age of 30, while writers take a decade longer but achieve a higher peak. "Science is a poor route to fame," they add. Physicists and biologists who achieve fame do so only late in life, and "even at their peak, mathematicians tend not to be appreciated by the public".

Big Brother's fingerprints

Nation-specific subsets of the data can show how references to ideas, events or people can drop out of sight because of state suppression. For example, the Jewish artist Marc Chagall virtually disappears from German writings in 1936-1944 (while remaining prominent in English-language books), and 'Trotsky' and 'Tiananmen Square' similarly vanish at certain sensitive points in time from Russian and Chinese works respectively. The authors also look at trends in references to feminism, God, diet and evolution.
"The ability, via modern technology, to look at just so much at once really opens horizons," says Embleton. However, Hudson cautions that making effective use of such a resource will require skill and judgement, not just number-crunching.
"How this quantitative evidence is generated and how it is interpreted are the most important factors in forming conclusions," she says. "Quantitative evidence of this kind must always address suitably framed general questions, and employed alongside qualitative evidence and reasoning, or it will not be worth a great deal." 

Friday, December 10, 2010

Okay, since class is now officially over, our Professor has requested a final blog ruminating about the semester.

This semester, I was strangely pleased to note how collaborative the efforts are here within the School of Information Science.  I have been used to vehemently defending a point and being graded on my position and the articulation of its defense.  Not here.  More to follow.

Monday, December 6, 2010

The use of free-text vs controlled vocabularies in searching.

Landscape Analysis:
In what is described as a period from the mid-sixties until the mid-seventies, “the findings of various experiments in the testing and evaluating of indexing languages … have demonstrated again and again the strength of the natural language, with minimal or no control, as optimally the best indexing language” (Dubois,  63).   However, as these studies were examined more closely,  it was determined that they were done on special collections and thus could not be applied to large data sets.  In the 1980’s, it was determined that the “ideal search capacity should always be both free text and controlled vocabulary” (Dubois, 64).  
However, as data and the subsequent databases which housed them grew larger, the process of retrieving information of valuable became more difficult.   One issue is that over time, language changes, specifically within scientific areas.  The need to formalize a set of controlled vocabularies to assist in the search became increasingly apparent.  Through many different studies of databases within different areas of research, tests were done to try and solve the issue of controlled vocabularies vs. free text.  These studies employed a variety of methods to search these databases for specific information.  
Definitions:
The purpose of a controlled vocabulary is  “[t]o ensure as far as possible the consistent representation of the subject matter of documents both in input to and output from the system and [t]o facilitate the conduct of searches in the system especially by bringing together in some way the terms that are most closely related semantically” (Dubois, 64).   The purpose of a free-text vocabulary is to allow the user to determine which subjects and terms she would like to search for within the database.  The user has control of the search by creating the terminology used within the search.  However, the traditional library techniques used for searching within digital databases have been stretched by the sheer size of the data.  This is why information science is investigating the utility of ontologies in the drive to design better searching capabilities. “Ontology is a complex multi-disciplinary field that draws upon the knowledge of information organization, natural language processing, information extraction, artificial intelligence, knowledge representation and acquisition” (Ding and Foo, 123).

Two case studies: PsycINFO and genomic information retrieval:
In the case of  PsycINFO, an analysis of a database where records were processed more than once.  An analysis of the findings included a range of variables at fault.  The conclusion was that the methodology of using a controlled vocabulary improved the consistency of the research enormously:  uncontrolled vocabulary 27.05% vs controlled vocabulary 44.32% (Leininger, 4). 
In the case of  the genomic retrieval study, the strategy of “query expansion” was deployed.  This strategy of “[q]uery expansion is commonly used to assist consumers in health information seeking by addressing the issue of vocabulary mismatch between lay persons and professionals” (Mu and Lu, 205).   In short, a query system was utilized and the researchers concluded, “[t]he results indicate that string index expansion techniques result in better performance than work index expansion techniques and the difference is statistically significant” (Mu and Lu, 205).

Solution:
We should use co-word analysis to increase useful information retrieval and partially nullify the time (as in differences between decades) issue.  “However, gaining access to such information is often difficult, as a result of inconsistency involved in the processing of information and the way in which queries are expressed by searchers”  (Ding, Chowdhury, and Foo, 429).
In Ding, Chowdhury, and Foo (429),  they point to research that indicates that users like to choose a wide variety of terms in the works  and that these terms are used infrequently.  This leads them to the conclusion that “[i]f inappropriate, incorrect or an an insufficient variety of words are used to form the queries or index the records in the system, the users may not be able to find the objects they desire” (Ding, Chowdhury, and Foo, 430).  . Therefore, in using co-word analyses, we need to define and number the terms utilized so that we crate sufficiently usable outputs.     
In another  study,  an ontology was developed using algorithms and statistics.  The major drawback of this approach was that only taxonomic relations were learned.  “Detecting the non-taxonomic conceptual relationships, for example, the ‘has Part’ relations between concepts, is becoming critical for building good-quality ontologies” (Ding and Foo, 128).  Therefore, if we want robust ontologies, we will need to monitor the inputs to adequately perform the relational querying. 
Finally, we need to employ string index expansion techniques.  It has been shown in the referenced study that they work better and the outcomes are statistically significant.

Conclusion:
“Co-word analysis can play an important role in assisting Traditional Thesauri to provide more search varieties to the end users.  However, it is acknowledged that co-word analysis cannot supply semantic relations between words”  (Ding, Chowdhury, and Foo, 433).  “Ontology promotes standardization and reusability of information representation through identifying common and shared knowledge.  Ontology adds value to traditional thesauri through deeper semantics in digital objects, conceptually,  relationally, and through machine understandability” (Ding and Foo, 132). 
Additionally, ontologies are being touted as a way to utilize an emergent technology to solve the problems inherent in information management.  In particular, ontologies are an integral part to the formation of the semantic web.  What is the semantic web?  “The term ‘Semantic Web’ was coined … to describe [a] vision of the next generation web that provides services that are much more automated based on machine-processable semantics of data and heuristics” (Ding and Foo, 124).  Many studies were carried out in the development of an ontology.  One main drawback was that “[t]he automatically constructed ontology can be too prolific and deficient at the same time” (Ding and Foo, 127).  Therefore, the human component to the development of these ontologies was found to be necessary to either broaden or reign in the scope. 
It seems that the utilization of these three tactics, string index expansion, ontologies and co-word analyses combined with an overall strategy of utilizing controlled vocabularies will be the most effective methodology of dealing with the avalanche of data that is being collected today.  If the querying systems continue to be refined and overseen by human participants, they ontologies can continue to grow more robust making the semantic web more valuable to researchers.
On a final note, I am concerned about the emergence of artificial intelligence that will dictate the searchable terminology to the humans.  I am also very concerned about the inherent compression of the English language that is an inevitable outcome of creating metadata “standardization” of terminology that can be searched.  In fairness, technology should be able to help with the aid of human cognition, not the other way around.  I also do realize that in order to find the data we are looking for, we must creating a standard to search with.  However, when this is applied to research, only a few will determine what “catchwords” will be acceptable.  This is truly frightening.




References
Dextre Clarke, S. G. (2008). The last 50 years of knowledge organization: A journey through my personal archives Journal of Information Science, 34(4), 427 <last_page> 437.
Ding, Y., Chowdhury, G. G., & Foo, S. (2000). Incorporating the results of co-word analyses to increase search variety for information retrieval Journal of Information Science, 26(6), 429 <last_page> 451.
Ding, Y., & Foo, S. (2002). Ontology research and development. part 1 - a review of ontology generation Journal of Information Science, 28(2), 123 <last_page> 136.
Dubois, C. P. R. (1984). The use of thesauri in online retrieval Journal of Information Science, 8(2), 63 <last_page> 66.
Leininger, K. (2000). Interindexer consistency in PsycINFO Journal of Librarianship and Information Science, 32(1), 4 <last_page> 8.
Mu, X., & Lu, K. (2010). Towards effective genomic information retrieval: The impact of query complexity and expansion strategies Journal of Information Science, 36(2), 194-208. doi:10.1177/0165551509357856
Rowley, J. (1994). The controlled versus natural indexing languages debate revisited: A perspective on information retrieval practice and research Journal of Information Science, 20(2), 108 <last_page> 118.