Monday, December 6, 2010

The use of free-text vs controlled vocabularies in searching.

Landscape Analysis:
In what is described as a period from the mid-sixties until the mid-seventies, “the findings of various experiments in the testing and evaluating of indexing languages … have demonstrated again and again the strength of the natural language, with minimal or no control, as optimally the best indexing language” (Dubois,  63).   However, as these studies were examined more closely,  it was determined that they were done on special collections and thus could not be applied to large data sets.  In the 1980’s, it was determined that the “ideal search capacity should always be both free text and controlled vocabulary” (Dubois, 64).  
However, as data and the subsequent databases which housed them grew larger, the process of retrieving information of valuable became more difficult.   One issue is that over time, language changes, specifically within scientific areas.  The need to formalize a set of controlled vocabularies to assist in the search became increasingly apparent.  Through many different studies of databases within different areas of research, tests were done to try and solve the issue of controlled vocabularies vs. free text.  These studies employed a variety of methods to search these databases for specific information.  
Definitions:
The purpose of a controlled vocabulary is  “[t]o ensure as far as possible the consistent representation of the subject matter of documents both in input to and output from the system and [t]o facilitate the conduct of searches in the system especially by bringing together in some way the terms that are most closely related semantically” (Dubois, 64).   The purpose of a free-text vocabulary is to allow the user to determine which subjects and terms she would like to search for within the database.  The user has control of the search by creating the terminology used within the search.  However, the traditional library techniques used for searching within digital databases have been stretched by the sheer size of the data.  This is why information science is investigating the utility of ontologies in the drive to design better searching capabilities. “Ontology is a complex multi-disciplinary field that draws upon the knowledge of information organization, natural language processing, information extraction, artificial intelligence, knowledge representation and acquisition” (Ding and Foo, 123).

Two case studies: PsycINFO and genomic information retrieval:
In the case of  PsycINFO, an analysis of a database where records were processed more than once.  An analysis of the findings included a range of variables at fault.  The conclusion was that the methodology of using a controlled vocabulary improved the consistency of the research enormously:  uncontrolled vocabulary 27.05% vs controlled vocabulary 44.32% (Leininger, 4). 
In the case of  the genomic retrieval study, the strategy of “query expansion” was deployed.  This strategy of “[q]uery expansion is commonly used to assist consumers in health information seeking by addressing the issue of vocabulary mismatch between lay persons and professionals” (Mu and Lu, 205).   In short, a query system was utilized and the researchers concluded, “[t]he results indicate that string index expansion techniques result in better performance than work index expansion techniques and the difference is statistically significant” (Mu and Lu, 205).

Solution:
We should use co-word analysis to increase useful information retrieval and partially nullify the time (as in differences between decades) issue.  “However, gaining access to such information is often difficult, as a result of inconsistency involved in the processing of information and the way in which queries are expressed by searchers”  (Ding, Chowdhury, and Foo, 429).
In Ding, Chowdhury, and Foo (429),  they point to research that indicates that users like to choose a wide variety of terms in the works  and that these terms are used infrequently.  This leads them to the conclusion that “[i]f inappropriate, incorrect or an an insufficient variety of words are used to form the queries or index the records in the system, the users may not be able to find the objects they desire” (Ding, Chowdhury, and Foo, 430).  . Therefore, in using co-word analyses, we need to define and number the terms utilized so that we crate sufficiently usable outputs.     
In another  study,  an ontology was developed using algorithms and statistics.  The major drawback of this approach was that only taxonomic relations were learned.  “Detecting the non-taxonomic conceptual relationships, for example, the ‘has Part’ relations between concepts, is becoming critical for building good-quality ontologies” (Ding and Foo, 128).  Therefore, if we want robust ontologies, we will need to monitor the inputs to adequately perform the relational querying. 
Finally, we need to employ string index expansion techniques.  It has been shown in the referenced study that they work better and the outcomes are statistically significant.

Conclusion:
“Co-word analysis can play an important role in assisting Traditional Thesauri to provide more search varieties to the end users.  However, it is acknowledged that co-word analysis cannot supply semantic relations between words”  (Ding, Chowdhury, and Foo, 433).  “Ontology promotes standardization and reusability of information representation through identifying common and shared knowledge.  Ontology adds value to traditional thesauri through deeper semantics in digital objects, conceptually,  relationally, and through machine understandability” (Ding and Foo, 132). 
Additionally, ontologies are being touted as a way to utilize an emergent technology to solve the problems inherent in information management.  In particular, ontologies are an integral part to the formation of the semantic web.  What is the semantic web?  “The term ‘Semantic Web’ was coined … to describe [a] vision of the next generation web that provides services that are much more automated based on machine-processable semantics of data and heuristics” (Ding and Foo, 124).  Many studies were carried out in the development of an ontology.  One main drawback was that “[t]he automatically constructed ontology can be too prolific and deficient at the same time” (Ding and Foo, 127).  Therefore, the human component to the development of these ontologies was found to be necessary to either broaden or reign in the scope. 
It seems that the utilization of these three tactics, string index expansion, ontologies and co-word analyses combined with an overall strategy of utilizing controlled vocabularies will be the most effective methodology of dealing with the avalanche of data that is being collected today.  If the querying systems continue to be refined and overseen by human participants, they ontologies can continue to grow more robust making the semantic web more valuable to researchers.
On a final note, I am concerned about the emergence of artificial intelligence that will dictate the searchable terminology to the humans.  I am also very concerned about the inherent compression of the English language that is an inevitable outcome of creating metadata “standardization” of terminology that can be searched.  In fairness, technology should be able to help with the aid of human cognition, not the other way around.  I also do realize that in order to find the data we are looking for, we must creating a standard to search with.  However, when this is applied to research, only a few will determine what “catchwords” will be acceptable.  This is truly frightening.




References
Dextre Clarke, S. G. (2008). The last 50 years of knowledge organization: A journey through my personal archives Journal of Information Science, 34(4), 427 <last_page> 437.
Ding, Y., Chowdhury, G. G., & Foo, S. (2000). Incorporating the results of co-word analyses to increase search variety for information retrieval Journal of Information Science, 26(6), 429 <last_page> 451.
Ding, Y., & Foo, S. (2002). Ontology research and development. part 1 - a review of ontology generation Journal of Information Science, 28(2), 123 <last_page> 136.
Dubois, C. P. R. (1984). The use of thesauri in online retrieval Journal of Information Science, 8(2), 63 <last_page> 66.
Leininger, K. (2000). Interindexer consistency in PsycINFO Journal of Librarianship and Information Science, 32(1), 4 <last_page> 8.
Mu, X., & Lu, K. (2010). Towards effective genomic information retrieval: The impact of query complexity and expansion strategies Journal of Information Science, 36(2), 194-208. doi:10.1177/0165551509357856
Rowley, J. (1994). The controlled versus natural indexing languages debate revisited: A perspective on information retrieval practice and research Journal of Information Science, 20(2), 108 <last_page> 118.

No comments:

Post a Comment