1.1 Text Analysis and Mining:
Text Mining is the process of deriving high quality information from data in the form of text written in natural languages. It is also referred to as text engineering, text data mining or text analytics (Grimes, 2007). It invokes statistical algorithms, natural language technologies, high speed computing and ability to handle large amounts of data in successfully developing and applying mathematical methods to unstructured text (Alpaydin, 2004). Large corporations, medical institutions, call centers and disparate other entities collect and archive information in text format. But for last 30 odd years, most of the information since humans have learned to write is only recorded in textual format. According to one estimate, still 80% of all information available in this world is only in text format (Grimes, 2008). Processing and sifting vast volumes of unstructured written text is a labor intensive and prohibitively expensive proposition. Text Mining endeavors to automate the process of reading textual content in documents leading to segmentation, information extraction and other types of analysis (Chandrasekhar, 2012).
Data mining is the process of discovering and extracting interesting and useful patterns from large numerical databases. Data mining has become a mature technology and has moved from academia and research into industrial practice. Strong methods developed for large data have found useful application in other related areas like text, audio and video mining (Shetty, 2013). Text is often described as unstructured information and it seems that methods developed for well-structured numerical data may not readily be applied to format-free textual data. This view is not entirely true. Text Engineers have developed methods to transform free-flow text data into measurements that data mining methods may handle (Chen and Lu, 2005).
1.2 Objectives of Text Mining:
A broad objective of text mining is to develop and / or to apply automated methods to locate, access, extract, integrate and manage large textual knowledge base resources. Text mining is a multidisciplinary field, at the confluence of disparate scientific areas such as information retrieval, high performance computing, statistical pattern recognition, machine learning, natural language processing, artificial intelligence and visualization (Berry, 2007). Although the differences in human and computer languages are expansive, there have been advances in technology, which are rapidly closing the gap. The field of natural language processing has produced technologies that teach natural language to computers so that they may analyze, understand, and even generate text. Some of the technologies that have been developed and can be used in the text mining area are Text summarization, topic tracking, information extraction, clustering, categorization, information visualization, concept linkage, and question answering (Fan, Wallace and Rich, 2005). In the following paragraphs each of these technologies and the role that they play in text mining will be explained briefly. The type of situations, where each technology may be useful will also be illustrated in order to help readers identify tools of interest to them or their organizations or in their pursuit of research.
1.3 Information Generation, Storage and dissemination:
A clear idea about information, how it is generated, stored and disseminated must precede any task of processing information. The term `information’ refers to any meaningful message recorded in conventional or non-conventional media. Usually a distinction is made between data, information, knowledge and wisdom which is seen as a hierarchy (from low to high) of human capability to assimilate and act upon the message. In this study however the term information is interchangeably used to mean the hierarchy itself or an element of it. Along with the elements like air, water, food and shelter, information is also considered vital for the humanity. Information is an important ingredient at every point of time and is a critical national resource. Countries which have more information resources are considered to be at a greater advantage compared to those less so endowed. It is a vital raw material that supports informed decision making at all levels from personal to national level contributing to socio-economic and cultural progress of the country. To be useful, information must have specific quality attributes (Williams, 1965). They are as follows:
- Quantity: Quantity of information can be measured by the number of documents, words, pages, drawings, bits and pictures.
- Content: The information should convey some meaning.
- Structure: It means the formal representations or organizations of information and logical relationships amongst statements or enrolments within.
- Language: The symbols, alphabet, codes and syntax with which ideas are expressed.
- Quality: The quality refers to accuracy, content, recency and frequency. We all expect information to be reliable and accurate. The trustworthiness of information is increased if it can be verified. Information must be sufficiently up-to-date for the purpose that is to be used. It must be complete and precise, allowing the recipient to select specific detail according to his or her need.
- Life: The total span of time in which value can be derived from information.
The information hierarchy starts with data. Every observable event is noticed and recorded via a set of measurements. This is the basic elemental form. When a context is built around this set, by asking questions like what, where, when etc. it becomes information. When the information is interpreted and can be described in terms of why and how about the event it becomes knowledge. The individual subjective knowledge of a person is transferred into objective knowledge by that individual’s public expression via speech, writing etc. Objective knowledge is publically observed by all and is essentially the wisdom of past generation ‘collected together in our archives, museums, libraries and so on. It is the learning as to how to apply the objective knowledge that we have more effectively, is the real role of information science (Ramesh Babu, 2004).
One of the essential requirements to effectively use information and transform it into knowledge is to securely store the information. Evolution in this area has been metamorphic. The early Stone Age inscriptions were the starting point of evolution of information storage media. Figure 1.1 explains this evaluation from Stone Age to present age. Inscriptions on stones were taken over by manuscripts made up of palm leaves, cloth, paper offering better storage and faster recording etc. The timeline of development of storage media over the centuries is depicted in Figure 1.1. At each stage the parameters like cost per bit, speed, storage density, and longevity increased exponentially. At the same time software techniques were developed for security, authentication and repudiation. Today it is possible to store 1 terabyte of data in a space of one cubic centimeter at less than a paisa per megabyte.
Along with these developments in the area of information, in the later part of 20th century, the world has witnessed another technology revolution. Computers started arriving on the scene. As per Moore’s law the power of computing has been doubling every two years and today more power exists on an individual’s desk than what existed with “fortune one” company four decades ago. Soon this power will be brought to an individual’s palm. Concomitantly human ability to disseminate data in vast quantities has also been keeping pace. The seamless integration of computer technology and telecommunications has facilitated the means of analyzing, storing communicating of data. E-Mail, electronic Fax, e-journal, video-text, teletext, document delivery, online browsing and networking have made cheap and rapid transmission of information possible. With this confluence of technologies the whole world is turned into a global village. Thus, the use of telecommunications, in conjunction with the computers has made it possible to process information and to transmit information to any location instantaneously.
On the software front artificial intelligence (AI) techniques have moved from research desk to industrial practice. Complex pattern recognition algorithms are able to learn from data like humans do (Weiss et al., 2005). In the past the role of computers were limited to data processing and decision support. In a paradigm shift, present day AI techniques are able to learn from data and are able to take informed decisions just as humans do. In other words machines have entered the knowledge and wisdom levels in the information hierarchy, which hitherto have been the monopoly of human beings.
The 21st century is not only seeing the continued growth of internet but also a multidisciplinary approach to information processing and knowledge management. Information is now recognized as the vital resources needed for the success in every human endeavor. Collection, organization and dissemination of information both economically and efficiently is clarion call of the day. This study is concerned about application of machine learning techniques for processing textual information. In the next section information processing of textual data is discussed in detail.
1.4. Information Processing of Textual Data:
(i) Information Extraction (IE) is the ability of computers to analyze unstructured text and cull out usable knowledge from it. Information extraction techniques identify key phrases and relationships within text. Information extraction is performed by looking for pre-defined sequences in documents, by a process known as pattern matching. The method discovers and interprets the relationships between all the identified places, people, and timeframes to provide the end-user with meaningful information. IE provides a platform to harness the time tested repository of data mining techniques to be applied to textual data (Kanya and Geetha, 2007)13. As text data occurs in very large volumes, considerable amount of human effort can be saved by adopting this approach.
(ii) Topic tracking is a system that works by keeping profiles of users and, based on the documents in which the user has previously shown interest, predicts other similar documents which can be of interest to the user. There are many areas where topic tracking can be applied in the industry. This system can be employed to alert companies every time a competitor appears in the news. This makes it possible to keep up with the competitive products or change dynamics in the market place. Similarly, businesses may wish to track news on their company and products. In medical industry, doctors and others may also use in this method when they are looking for new treatments for ailments and who wish to keep up-to-date on the latest advances in the field. (Lee and Kim, 2008).
(iii) Text Summarization is the process of condensing the information in a documents to a much smaller size than the original document. It is like generating an abstract. With large texts, text summarization provides a summary based on which, the user may decide to read the entire document. This is useful when response to user query returns a horde of documents. The key to summarization is to reduce the length and detail of a document while retaining its main points and overall meaning. One of the strategies most widely used by text summarization is sentence extraction, a process that extracts key sentences from an article by weighting statistically the importance of sentences. Further, heuristics can provide such information as position of paragraph titles and other document markers for subtopics in order to identify location of key sentences and key points in a document (Kyoomarsi, Khosravi and Eslami, 2009).
(iv) Text Categorization is the process of identifying the main topic or theme of a document from a pre-defined set of themes. It is a supervised learning technique. A categorizing technique treats the document as a “bag of words”. It makes no attempt to understand the actual information contained in the document. Categorization uses the frequency of occurrence of words in the specified document and by matching this information with that of documents already grouped into the category, identifies the topic area of the document (Madhavi, Tu and Luo, 2007).
(v) Clustering is a technique that segregates documents into groups based on similarity and dissimilarity amongst the documents in a collection. It differs from categorization in that herein, documents are clustered naturally without human intervention to predefine the cluster labels. It is an unsupervised machine learning technique. A basic clustering algorithm considers term frequencies as distances and forms document as points in an n-dimensional space. Based on the Euclidian distances, these points (documents) are segmented into groups called clusters such that within each cluster the variance is small and between the clusters it is large (Deepika and Mehta, 2014).
(vi) Concept Linkage is a process that connects related documents by identifying some commonality with related concepts to help human users to find information that they perhaps might not have found by searching manually. These linkages promote browsing for information as against searching for it. In the biomedical field, where so much research has taken place that it is humanly not possible for researchers to read all the material and make linkages to related information. Concept linking methods identify links between diseases and overlapping treatment procedures. For example, a text mining solution may easily find link between anesthesia administration procedure for a bypass surgery and a knee replacement surgery (Mistree and Muster, 1995).
(vii) Information Visualization or Visual text mining or Info-graphics, puts large textual sources in a visual hierarchy and provides drilling capabilities. It is a pictorial representation of information about the textual characteristics of the documents. Visualization can be broadly divided into three categories (1) volumetric comparisons, which is representing list of words or phrases in documents sized by some relevance measure such as number of occurrences, frequencies etc. (2) Document Contrast Diagrams (DCD) use the bubble type technique and make effective use of colors to contrast topic usage in two bodies of text. DCDs help highlight key differences as well as the similarities in the text. (3) Directed Sentence Diagrams are designed to show the topic ‘flow’ in a body of work via color and Cartesian length (Ning et al., 2008). Figure 1.2 presents the graph types referred to in this section.
(viii) Natural Language Question Answering is automated response to queries posed in natural language. This method deals with the problem of understanding the question and responding with best possible answer. Some websites are equipped with Q&A capability, allow visitors to “ask” a question and be given an answer. Q&A can utilize several text mining techniques. For instance it could use entity information extraction to discover entities such as people, places, events; or question categorization to assign questions into known types (what, who, where, when, why, how, etc.). (Parikh and Murthy, 2002).
(ix) Association Rule Mining (ARM) is a technique used to discover relationships amongst variables in a large data set. ARM, more popularly known as Market Basket Analysis, determines attributes that frequently occur together. For example, ARM discovers what items are frequently purchased together in a customer’s basket (e.g. tooth paste and soap). These associations can then be used for building strategies for cross selling and up-selling. In Association Rules for Text Mining, the focus is to study the relationships and implications among topics, or descriptive concepts, which are used to characterize a corpus. The goal is to discover important association rules within a corpus such that the presence of a set of topics in an article implies the presence of another topic (Dion and Rebecca, 2008)22.
1.5. Applications of Text Mining in Industry:
(i) Lightweight Document Matching for Digital Libraries: Now-a-days mobile devices are increasingly used for accessing server-based digital libraries. To extend this capability beyond keyword based search, an application considers specimen document to retrieve similar documents. The specimen document is indexed and using the index, first the specimen document is retrieved and then used as an example to retrieve other documents, which are presented to the user ranked on similarity. Whereas keyword search produces many irrelevant documents and has a low F-Score, this application produces results with much higher F-Score (Weiss, White and Apte, 1999).
(ii) E-Mail filtering: E-Mail is now the main means of personal and commercial communication. The volume of e-mail an individual receives is formidable. Also in the maze of irrelevant mails one can easily miss an important message. E-Mail filter is a text mining tool that segregates messages into folders automatically. It develops rules by training and allows users to specify new rules or modify existing ones. Processing email text and parsing them against these rules, the application is able to significantly reduce the burden on individual user to sift through all the mails (Cohen, 1996).
(iii) In Customer Relationship Management (CRM) domain, text mining is applied in the areas of customer feedback analysis, routing customer requests/complaints to appropriate service desk and receive automated response for frequently encountered problems. “Services” research has emerged as a green field area for application of advances in computer science and natural language processing. Unstructured text documents produced from a variety of sources in today’s contact centers have exploded in terms of the sheer volume generated. The customer engagement and rapid customer complaint resolution, mass personalized service delivery are receiving increased attention. Analytics and business intelligence (BI) applications with customer-centric focus have led to emergence of areas like customer experience management, customer relationship management, and customer service quality. These are becoming critical to competitive growth, and sometimes even for very survival (Gao, Chang and Han, 2007).
(iv) Market Analytics uses Text Mining (TM) mainly to analyze competitors’ activities and/or monitor consumers’ opinions to identify new potential product innovations and feature enhancements. It is also used to determine the company’s image and brand positioning. Analyzing press reviews, different comparison websites, social networks and other relevant sources, businesses try to see how its own services are perceived vis-a-vis that of competitors. Most companies employ tele-marketing and e-mail solicitations to customize messages to customers and prospects. The TM technique makes it possible to gather and analyze market sentiments unobtrusively (Grimes, 2005).
(v) HR Analytics is a hot area today. TM techniques are used to manage human resources strategically, mainly with applications aiming at analyzing staff’s informal opinions, casually monitoring of the level of employee satisfaction, machine reading and storing of resumes and pre-screening of potential new hires. Companies dealing with employment contracts and renewals thereof use text mining to keep track of contract expiry dates and specific employment clauses needing attention from time to time (Fan et al., 2005).
Original Research Article:
- Chandrasekhar, C. K. (2010). INFORMATION PROCESSING OF TEXTUAL DATA USING ROBUST MACHINE LEARNING METHODS A CASE STUDY. Retrieved from: http://hdl.handle.net/10603/180567