« Home « Chủ đề sổ tay dữ liệu máy tính

Chủ đề : sổ tay dữ liệu máy tính


Có 40+ tài liệu thuộc chủ đề "sổ tay dữ liệu máy tính"

Data Mining and Knowledge Discovery Handbook, 2 Edition part 20

tailieu.vn

Almuallim H., An Efficient Algorithm for Optimal Pruning of Decision Trees. and Singh V., CLOUDS: A Decision Tree Classifier for Large Datasets, Conference on Knowledge Discovery and Data Mining (KDD-98), August 1998.. Baker E., and Jain A. In Proceedings of the Third International Joint Conference on Pattern Recognition, pages 45-49, San Diego, CA, 1976.. Bratko I., and Bohanec M., Trading...

Data Mining and Knowledge Discovery Handbook, 2 Edition part 21

tailieu.vn

Polytrees When the topology of a Bayesian network is restricted to a polytree struc- ture — a direct acyclic graph with only one path linking any two nodes in the graph — we can the fact that every node in the network divides the polytree into two disjoint sub-trees. The source of complexity of these algorithms is the identification of...

Data Mining and Knowledge Discovery Handbook, 2 Edition part 22

tailieu.vn

By repeating this procedure for each case in the database, we compute fitted values for each variable Y i , and then define the blanket residuals by. Lack of significant patterns in the residuals r ik and approxi- mate symmetry about 0 will provide evidence in favor of a good fit for the variable Y i , while anomalies in...

Data Mining and Knowledge Discovery Handbook, 2 Edition part 23

tailieu.vn

Description of the variables used in the analysis. Hoh denotes the Head of the Household. Numbers of adult males, females and children refer to the household.. of the household increases. The dependency of the gender of the household head on the ethnic group shows that Blacks have the smallest probability of having a male head of the household (64%) while...

Data Mining and Knowledge Discovery Handbook, 2 Edition part 24

tailieu.vn

11.2 Some Definitions. The fitted value for ˆ y 0. at x 0 can be written as. In practice, the weights decline with distance from x 0 , sometimes abruptly (as in a step function), so that many of the values in S 0 j are often zero. The function linking the response variable y to the predictor x can...

Data Mining and Knowledge Discovery Handbook, 2 Edition part 25

tailieu.vn

Consider now an application of the generalized additive model. For data de- scribed earlier, Figure 11.3 shows the relationship between number of homicides and the number executions a year earlier, with state and year held constant. Indicator variables are included for each state to adjust for average differences over time in the number of homicides in each state. For example,...

Data Mining and Knowledge Discovery Handbook, 2 Edition part 26

tailieu.vn

Dasu, T., and T. (2000) Support Vector Machines. Fan, J., and I. Friedman, J., Hastie, T., and R. Freund, Y., and R. (1996) “Experiments with a New Boosting Algorithm,” Ma- chine Learning: Proceedings of the Thirteenth International Conference: 148-156. Hand, D., Manilla, H., and P Smyth (2001) Principle of Data Mining. LeBlanc, M., and R. Tibshirani (1996) “Combining Estimates on...

Data Mining and Knowledge Discovery Handbook, 2 Edition part 27

tailieu.vn

ε (12.24) The regularization constant C >. The optimization deter- mines a trade-off between model complexity and points lying outside of the tube. The support vectors and the support values of the solution define the following regression function. b (12.25) There are degrees of freedom for constructing SVR, such as how to penalize or regularize different parts of the vector,...

Data Mining and Knowledge Discovery Handbook, 2 Edition part 28

tailieu.vn

A very simple exam- ple of such a table is presented as Table 13.1, in which attributes are: Temperature, Headache, Weakness, Nausea, and the decision is Flu. The set of all cases labeled by the same decision value is called a concept. For Table 13.1, case set is a concept of all cases affected by flu (for each case from...

Data Mining and Knowledge Discovery Handbook, 2 Edition part 29

tailieu.vn

Another rule induction algorithm, developed by R. Many versions of the algorithm have been developed, under different names (Michalski et al., 1986A), (Michalski et al., 1986A).. Let us start by quoting some definitions from (Michalski et al., 1986A), (Michal- ski et al., 1986A). A seed is a member of the concept, i.e., a positive case. A se- lector is an...

Data Mining and Knowledge Discovery Handbook, 2 Edition part 30

tailieu.vn

forming categories of entities and assigning individuals to the proper groups within it.. 14.2 Distance Measures. Since clustering is the grouping of similar instances/objects, some sort of measure that can determine whether two objects are similar or dissimilar is required. It is useful to denote the distance between two instances x i and x j as: d(x i ,x j....

Data Mining and Knowledge Discovery Handbook, 2 Edition part 31

tailieu.vn

Such methods typically require that the number of clusters will be pre-set by the user. Because this is not feasible, certain greedy heuristics are used in the form of iterative optimization. The most well-known criterion is the Sum of Squared Error (SSE), which measures the total squared Euclidian distance of instances to their representative values. The latter option is the...

Data Mining and Knowledge Discovery Handbook, 2 Edition part 32

tailieu.vn

This method identifies candidate cluster cen- troids by using repeated random samples of the original data. Because of the use of random sampling, the time complexity is O ( n ) for a pattern set of n elements.. This algorithm has a time complexity linear in the number of instances.. All algorithms presented till this point assume that the entire...

Data Mining and Knowledge Discovery Handbook, 2 Edition part 33

tailieu.vn

15.1.1 Formal Problem Definition. can be reformulated as an itemset T by a i ∈ T ⇔ t i = 1. We want to use the moti- vation of the introductory example to define an association explicitly. If the probability of having sausages (S ) or mustard (M ) in the shopping carts of our customers is 10% and 4%,...

Data Mining and Knowledge Discovery Handbook, 2 Edition part 34

tailieu.vn

P(Y=y)j(X | Y=y), which is the J-value of the rule and is bounded by 0.53 bit. Other measures are conviction (a “directed”, asymmetric lift) (Brin et al., 1997B), certainty factors from MYCIN (Berzal et al., 2001), correlation coefficients from statistics (Tan and Kumar, 2002), Laplace or Gini from rule induction (Clark and Boswell, 1991) or decision tree induction (Breiman, 1996)....

Data Mining and Knowledge Discovery Handbook, 2 Edition part 35

tailieu.vn

Frequent sets play an essential role in many Data Mining tasks that try to find in- teresting patterns from databases, such as association rules, correlations, sequences, episodes, classifiers, clusters and many more of which the mining of association rules, as explained in Chapter 14.7.3 in this volume, is one of the most popular prob- lems. The identification of sets of...

Data Mining and Knowledge Discovery Handbook, 2 Edition part 36

tailieu.vn

since a set that is frequent in the complete database must be relatively frequent in one of the parts. Also, the algorithm is highly dependent on the heterogeneity of the database and can generate too many lo- cal frequent sets, resulting in a significant decrease in performance. The presented Sampling algorithm picks a random sample from the database, then finds...

Data Mining and Knowledge Discovery Handbook, 2 Edition part 37

tailieu.vn

Some others define syntactical restrictions (e.g., the “length” of the pattern is below a threshold) and checking them does not need any access to the data. We emphasized that the model is however quite general: beside the itemsets or sequences, L can denote, e.g., the language of partitions over a collection of objects or the language of decision trees on...

Data Mining and Knowledge Discovery Handbook, 2 Edition part 38

tailieu.vn

describe a mining algorithm but rather a pruning technique for non anti-monotonic and non monotonic constraints. Considering a sub-lattice ˚ A of 2 I , the problem is to decide whether this sub-lattice can be pruned. A sub-lattice is characterized by its maximal element M and its minimal element m, i.e., the sub-lattice is the collection of all itemsets S...

Data Mining and Knowledge Discovery Handbook, 2 Edition part 39

tailieu.vn

The following paragraphs present two algorithms for incorporating link information into search engines: PageRank (Page et al., 1998) and Kleinberg’s Hubs and Authorities (Kleinberg, 1999).. The PageRank algorithm takes a set of interconnected pages and calculates a score for each. Similarly, a pages that is pointed to by numerous other marginally important pages is probably itself important. A more formal...