« Home « Kết quả tìm kiếm

Data Mining P1


Tóm tắt Xem thử

- Data Mining.
- No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA fax or on the web at www.copyright.com.
- Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representation or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose.
- Data mining : multimedia, soft computing, and bioinformatics / Sushmita Mitra and Tinku Acharya..
- Data mining.
- 006.3—dc Printed in the United States of America..
- Preface xv 1 Introduction to Data Mining 1.
- 1.1 Introduction 1 1.2 Knowledge Discovery and Data Mining 5 1.3 Data Compression 10 1.4 Information Retrieval 12 1.5 Text Mining 14 1.6 Web Mining 15 1.7 Image Mining 16 1.8 Classification 18 1.9 Clustering 19 1.10 Rule Mining 20 1.11 String Matching 21 1.12 Bioinformatics 23 1.13 Data Warehousing 24 1.14 Applications and Challenges 25 1.15 Conclusions and Discussion 28 References 30.
- Soft Computing 35 2.1 Introduction 35 2.2 What is Soft Computing? 37 2.2.1 Relevance 37 2.2.2 Fuzzy sets 39 2.2.3 Neural networks 44 2.2.4 Neuro-fuzzy computing 53 2.2.5 Genetic algorithms 55 2.2.6 Rough sets 59 2.2.7 Wavelets 61 2.3 Role of Fuzzy Sets in Data Mining 62 2.3.1 Clustering 63 2.3.2 Granular computing 63 2.3.3 Association rules 64 2.3.4 Functional dependencies 65 2.3.5 Data summarization 65 2.3.6 Image mining 66 2.4 Role of Neural Networks in Data Mining 67 2.4.1 Rule extraction 67 2.4.2 Rule evaluation 67 2.4.3 Clustering and self-organization 69 2.4.4 Regression 69 2.4.5 Information retrieval 69 2.5 Role of Genetic Algorithms in Data Mining 70 2.5.1 Regression 71 2.5.2 Association rules 71 2.6 Role of Rough Sets in Data Mining 72 2.7 Role of Wavelets in Data Mining 73 2.8 Role of Hybridizations in Data Mining 74 2.9 Conclusions and Discussion 77 References 78 Multimedia Data Compression 89 3.1 Introduction 89 3.2 Information Theory Concepts 91 3.2.1 Discrete memoryless model and entropy 91 3.2.2 Noiseless Source Coding Theorem 92 3.3 Classification of Compression Algorithms 94.
- distance 173 4.4.3 Text search with fc-differences 176 4.5 Compressed Pattern Matching 177 4.6 Conclusions and Discussion 179 References 179 Classification in Data Mining 181 5.1 Introduction 181 5.2 Decision Tree Classifiers 184 5.2.1 ID IBM IntelligentMiner 189 5.2.3 Serial PaRallelizable INduction of decision Trees.
- 5.6.2 Rule generation and evaluation 212 5.6.3 Mapping of rules to fuzzy neural network 214 5.6.4 Results 216 5.7 Conclusions and Discussion 220 References 221 Clustering in Data Mining 227 6.1 Introduction 227 6.2 Distance Measures and Symbolic Objects 229 6.2.1 Numeric objects 229 6.2.2 Binary objects 229 6.2.3 Categorical objects 231 6.2.4 Symbolic objects 231 6.3 Clustering Categories 232 6.3.1 Partitional clustering 232 6.3.2 Hierarchical clustering 235 6.3.3 Leader clustering 237 6.4 Scalable Clustering Algorithms 237 6.4.1 Clustering large applications 238 6.4.2 Density-based clustering 239 6.4.3 Hierarchical clustering 241 6.4.4 Grid-based methods 243 6.4.5 Other variants 244 6.5 Soft Computing-Based Approaches 244 6.5.1 Fuzzy sets 244 6.5.2 Neural networks 246 6.5.3 Wavelets 248 6.5.4 Rough sets 249 6.5.5 Evolutionary algorithms 250 6.6 Clustering with Categorical Attributes 251.
- References 315 9 Multimedia Data Mining 319 9.1 Introduction 319 9.2 Text Mining 320 9.2.1 Keyword-based search and mining 321 9.2.2 Text analysis and retrieval 322 9.2.3 Mathematical modeling of documents 323 9.2.4 Similarity-based matching for documents and queries 325 9.2.5 Latent semantic analysis 326 9.2.6 Soft computing approaches 328 9.3 Image Mining 329 9.3.1 Content-Based Image Retrieval 330 9.3.2 Color features 332 9.3.3 Texture features 337 9.3.4 Shape features 338 9.3.5 Topology 340 9.3.6 Multidimensional indexing 342 9.3.7 Results of a simple CBIR system 343 9.4 Video Mining 345 9.4.1 MPEG-7: Multimedia content description interface 347 9.4.2 Content-based video retrieval system 348 9.5 Web Mining 350 9.5.1 Search engines 351 9.5.2 Soft computing approaches 353 9.6 Conclusions and Discussion 357 References 357 10 Bioinformatics: An Application 365 10.1 Introduction 365 10.2 Preliminaries from Biology Deoxyribonucleic acid Amino acids Proteins Microarray and gene expression 371 10.3 Information Science Aspects Protein folding Protein structure modeling 373.
- The success of the digital revolution and the growth of the Internet have ensured that huge volumes of high-dimensional multimedia data are available all around us.
- However, often most of this data are not of much interest to most of the users.
- Data mining refers to this process of extracting knowledge that is of interest to the user..
- Data mining is an evolving and growing area of research and development, both in academia as well as in industry.
- In this age of multimedia data exploration, data mining should no longer be restricted to the mining of knowledge from large volumes of high-dimensional datasets in traditional databases only.
- Efficient management of such high-dimensional very large databases also influence the performance of data mining systems.
- It is also important that special multimedia data compression techniques are explored especially suitable for data mining applications..
- With the completion of the Human Genome Project, we have access to large databases of biological information.
- The applicability of data mining in this domain cannot be denied, given the lifesaving prospects of effective drug design.
- Genetic algorithms introduce effective parallel searching in the high-dimensional problem space..
- Storage of such huge datasets being more feasible in the compressed domain, we also devote a reasonable portion of the text to data mining in the compressed domain.
- Current trends show that the advances in data mining need not be con- strained to stochastic, combinatorial, and/or classical so-called hard optimization- based techniques.
- We dwell, in greater detail, on the state of the art of soft computing approaches, advanced signal processing techniques such as Wavelet Transformation, data compression principles for both lossless and lossy tech- niques, access of data using matching pursuits in both raw and compressed data domains, fundamentals and principles of classical string matching algo- rithms, and how all these areas possibly influence data mining and its future growth.
- We cover aspects of advanced image compression, string matching, content based image retrieval, etc., which can influence future developments in data mining, particularly for multimedia data mining..
- There are 10 chapters in the book.
- The first chapter provides an introduc- tion to the basics of data mining and outlines its major functions and applica- tions.
- This is followed in the second chapter by a discussion on soft computing and its different tools, including fuzzy sets, artificial neural networks, genetic algorithms, wavelet transforms, rough sets, and their hybridizations, along with their roles in data mining..
- We then present some advanced topics and new aspects of data mining related to the processing and retrieval of multimedia data.
- The huge volumes of data required to be retrieved, processed, and stored make compression techniques a promising area to explore, in the context of both images and texts.
- We deal with multimedia data mining in Chapter 9.
- In each case we discuss the related algorithms, showing how these can be a growing area of study in the light of data mining in the near future..
- Some portion of the material in this book also covers our pub- lished work, which has been presented and discussed in different seminars, conferences, and workshops..
- The book may be used in a graduate-level course as a part of the subject of data mining, machine learning, information retrieval, and artificial intelli- gence, or it may be used as a reference book for professionals and researchers..
- Progress in data mining will further pave the way for usage of information technology in every walk of life in near future..
- Ping-Sing Tsai, who assisted by reviewing a number of chapters of the book and who provided valuable suggestions to further enrich the content..
- Introduction to Data Mining.
- The advancement of data processing and the emergence of newer applications were possible, partially because of the growth of the semiconductor and subsequently the computer industry.
- According to Moore's law, the number of transistors in a single microchip is doubled every 18 months, and the growth of the semiconductor industry has so far followed the prediction.
- Data mining is an attempt to make sense of the information explosion embedded in this huge volume of data [4]..
- 2 INTRODUCTION TO DATA MINING.
- The current Internet technology and its growing demand necessitates the development of more advanced data mining technologies to interpret the in- formation and knowledge from the data distributed all over the world.
- As an example, a report on the United States Administrations initiative in the "Information Technology for 21st Century".
- Hence development of advanced data mining technology will continue to be an im- portant area of study, and it is accordingly expected that lots of energy will be spent in this area of development in the coming years..
- Some of the examples include the following..
- Digital library: This is an organized collection of digital information stored in large databases in the form of text (encoded or raw) and pos- sibly as a large collection of document imagery [6]..
- Health care: In addition of the above medical image data, other non- image datatypes are also generated everyday.
- Finance and investment: Finance and investment is a big data domain of interest for data mining.
- The World Wide Web (WWW) [12]: A huge volume of multimedia data of different types is distributed everywhere in the Internet.
- Biometrics: Because of the need of extraordinary security of human lives today, biometric applications will continue to grow for positive identifi- cation of persons.
- 4 INTRODUCTION TO DATA MINING.
- While some people treat data mining as a synonym for KDD, some others view it as a particular step in this process involving the application of specific algorithms for extract- ing patterns (models) from data.
- The additional steps in the KDD process, such as data preparation, data selection, data cleaning, incorporation of ap- propriate prior knowledge, and proper interpretation of the results of mining, ensures that useful knowledge is derived from the data..
- Data mining tasks can be descriptive, (i.e., discovering interesting patterns or relationships describing the data), and predictive (i.e., predicting or clas- sifying the behavior of the model based on available data).
- Data mining works hand in hand with warehouse data..
- KDD focuses on the overall process of knowledge discovery from large vol- umes of data, including the storage and accessing of such data, scaling of algorithms to massive datasets, interpretation and visualization of results, and the modeling and support of the overall human machine interaction.
- Ef- ficient storage of the data, and hence its structure, is very important for its.
- KNOWLEDGE DISCOVERY AND DATA MINING.
- Data mining also overlaps with machine learning, statistics, artificial intel- ligence, databases, and visualization.
- scalability of the number of features and instances,.
- In the remaining part of this chapter we consider data mining from the perspective of machine learning, pattern recognition, image processing, and artificial intelligence.
- We begin by providing the basics of knowledge discovery and data mining in Section 1.2.
- This is followed, in Sections 1.8-1.10, by a treatise on some of the major functions of data mining like classification, clustering, and rule mining.
- String match- ing, another important aspect of data mining with promising applications to Bioinformatics, is described in Section 1.11.
- Section 1.14 highlights the applications of data mining and some existing challenges to future research..
- 1.2 KNOWLEDGE DISCOVERY AND DATA MINING.
- 6 INTRODUCTION TO DATA MINING.
- Understanding the application domain: This includes relevant prior knowl- edge and goals of the application..
- Data preprocessing: This is required to improve the quality of the actual data for mining.
- Such low-quality data needs to be cleaned prior to data mining..
- Application of the prin- ciples of data compression can play an important role in data re- duction and is a possible area of future development, particularly in the area of knowledge discovery from multimedia dataset..
- Data mining: Data mining constitutes one or more of the following functions, namely, classification, regression, clustering, summarization, image retrieval, discovering association rules and functional dependen- cies, rule extraction, etc..
- Interpretation: This includes interpreting the discovered patterns, as well as the possible (low-dimensional) visualization of the extracted pat- terns.
- KNOWLEDGE DISCOVERY AND DATA MINING 7.
- The former uses the structure of the pattern and is generally quantitative.
- Often it fails to capture all the complexities of the pattern discovery process.
- In literature, unexpectedness is often defined in terms of the dissimilarity of a discovered pattern from a predefined vocabulary provided by the user..
- If, on the other hand, in most of the course evaluations the overall INSTRUCT.RATING is higher than COURSEJRATING and it turns out that in most of Professor X's rating the overall INSTRUCTJIATING is lower than COURSE-RATING, then such a pattern is unexpected and hence interesting..
- Data mining is a step in the KDD process consisting of a particular enumer- ation of patterns over the data, subject to some computational limitations..
- Data mining involves fitting models to or determining patterns from ob- served data.
- Deciding whether the model reflects useful knowledge or not is a part of the overall KDD process for which subjective human judgment is usually required.
- Typ- ically, a data mining algorithm constitutes some combination of the following three components..
- The model: The function of the model (e.g., classification, clustering) and its representational form (e.g., linear discriminants, decision trees)..
- 8 INTRODUCTION TO DATA MINING.
- The criterion is usually some form of goodness-of-fit function of the model to the data, perhaps tempered by a smoothing term to avoid over-fitting, or generating a model with too many degrees of freedom to be constrained by the given data..
- A particular data mining algorithm is usually an instantiation of the model- preference-search components.
- Some of the common model functions in cur- rent data mining practice include [13, 14]:.
- Data compression may play a significant role here, particularly for multimedia data, because of the advantage it offers to compactly represent the data with a reduced number of bits, thereby increasing the database storage bandwidth..
- The goal is to model the states of the process generating the sequence, or to extract and report deviation and trends over time..
- The rapid growth of interest in data mining [22] is due to the (i) advance- ment of the Internet technology and wide interest in multimedia applications in this domain, (ii) falling cost of large storage devices and increasing ease of collecting data over networks, (iii) sharing and distribution of data over the network, along with adding of new data in existing data repository, (iv).
- KNOWLEDGE DISCOVERY AND DATA MINING 9.
- The most commonly cited reason for scaling up is that increasing the size of the training set often increases the accuracy of learned classification models.
- Finally, the goal of the learning (say, classification accuracy) must not be substantially sacrificed by a scaling algorithm.
- designing a fast algorithm: improving computational complexity, opti- mizing the search and representation, finding approximate solutions to the computationally complex (NP complete or NP hard) problems, or taking advantage of the task's inherent parallelism;.
- partitioning the data: dividing the data into subsets (based on instances or features), learning from one or more of the selected subsets, and possibly combining the results.
- (a) 80% of the images containing a car as an object also contain a blue sky..
- 10 INTRODUCTION TO DATA MINING.
- Hence data compression continues to be a challenging area of research and development in both academia and industry, particularly in the context of large databases..
- Interestingly enough, most of the datatypes for practical applications such as still image, video, voice, and text generally contain a significant amount of superfluous and redundant information in their canonical representation..
- Data redundancy may appear in different forms in the digital representation of different categories of datatypes

Xem thử không khả dụng, vui lòng xem tại trang nguồn
hoặc xem Tóm tắt