]> v5.4 (2014-03-04) v5.5 (d.d. 11-11-2014) Editors/Major Contributors: Melanie Hilario, Maria Keet, Agnieszka Lawrynowicz, Claudia d'Amato Other Contributors: Huyen Do, Simon Fischer,Dragan Gamberger, Lina Al-Jadir, Simon Jupp, Alexandros Kalousis, Petra Kralj.Novak, Babak Mougouie, Phong Nguyen, Raul Palma, Robert Stevens, Anze Vavpetic, Jun Wang, Derry Wijaya, Adam Woznica e-lico Data Mining Ontology (DMO) for Data Mining Optimization (DMOP) Melanie.Hilario (v0.1 created Mar 2009) achieves: an Operation O achieves a Task T indicates that the operation achieves the task addressed by the operator or the workflow executed by the operation. addresses: an algorithm A addresses a task T == A specifies a way of performing T. assumes: an object property that connects an algorithm to one or several of its assumptions. If these assumptions are not met, the algorithm will not produce the expected results or may not even run at all. executes: an Operation (or process) P executes an Operator (or ground workflow) O == P is the process of running O to achieve the task for which O was designed. handlesFeatureType: a DM algorithm A expectsFeatureType T (a DataType) == A can handle only dataset features or variables of type T. Typically these features can be of type categorical, real (continuous) or, more rarely, ordinal. hasCombinationFunction: a model evaluation function may comprise two components (typically a performance measure and a complexity measure) whose trade-off is regulated by a combination function (often, their sum or product) and their relative weights. hasComplexityComponent: A model evaluation function E hasComplexityComponent F (a model complexity function). A learning algorithm's cost function is often composed of two components: a loss component (that quantifies the estimated error of the algorithm) and a complexity component (that quantifies the complexity of the model produced by the algorithm. These two components are not always present. The property "hasComplexityComponent" takes a value other than "None" when the cost function comprises some estimation of model complexity. hasComponent is a subproperty of hasPart that represents the component-integral object relationship identified by Winston et al. (1987) and Odell (1994). It defines a configuration of parts within a whole, i.e., the parts bear a particular functional or structural relationshop to one another as well as to the whole. M. E. Winston, R. Chaffin, D. Hermann. A taxonomy of part-whole relations. Cognitive Science 11, 417-444 (1987). J. J. Odell, Six different kinds of composition. Journal Of Object-Oriented Programming 5, 10-15 (1994). The computational complexity of an algorithm hasDataType: specifies the data type of a given data item. The range of this property is the DataType class, which is a hierarchy that overlaps with but is different from OWL's built-in set of datatypes. Relating a classification model to its decision boundary (it has at most one boundary), which can be, e.g., an arbitrary linear boundary or quadratic boundary. hasDecisionRule: see DecisionRule. hasDecisionStrategy: see DecisionStrategy. See: ProbabiltyDensityEstimation See: DistanceFunction hasDistributionParameter: The distribution parameters of a Normal density estimate are an N x D Gauss-MeanVector and a D x D Gauss-CovarianceMatrix. Relates an algorithm to its evaluation function (such as GiniIndex, LaplaceAccuracy, PearsonsRho) hasFeature: links a DataSet or DataTable to any of its component features. hasFeatureEvaluator designates the algorithm used by a feature selection algorithm to measure the quality of the candidate features, either individually or as feature subsets. hasFeatureTestEvaluator: For decision trees and rules, the criterion used to select the feature to use in the next tree node or rule condition. Aka split criterion. hasHypothesisComplexityControlStrategy: A classification modeling algorithm A hasComplexityControlStrategy S == A follows strategy S to restrict the complexity of the learned hypothesis (model or pattern set) and thus avoid overfitting. To record specific details about a DM-Hypothesis; e.g., that the hypothesis may be probabilistic or not To relate the data or model that is taken as input to a DM-Operation or DM-Process; see also specifiesInputClass hasLeafPredictor: In a decision tree, the predictive model (classifier or regressor) that predicts the value of the target variable for all the instances grouped in a specific leaf. This can be the default classifier (majority rule) or regressor (mean value rule) as in CART or a learned model, e.g. a NaiveBayes classifier in NBTree. hasModalValue: refers to the value of a categorical feature that appears most frequently in a given dataset. See: ModelComplexityMeasure To record the level (optimal or sub-optimal) of an optimisation strategy To record the goal (minimising or maximising it) of an optimisation problem hasOptimizationProblem: designates the optimization problem solved by a data mining algorithm in order to achieve its task (in particular induction and feature extraction). See OptimizationProblem. hasOptimizationStrategy refers to the optimization strategy adopted by a data mining algorithm to solve its optimization problem. Related concepts: OptimizationStrategy, OptimizationProblem. Given (the execution of) a DM-Operation or DM-Process, it’s output is either data, a hypothesis or a hypothesis evaluation measure; see also specifiesOutputClass See: Parameter. something (e.g., an algorithm or model) has a range of (optional) parameters to characterise it, such as number of clusters and variance threshold. hasRecoveryOfPursuit: a property that relates a heuristic search algorithm to a value set named after Pearl's (1984, p. 65) "recovery of pursuit, which can be described as either tentative or Irrevocable.See the concept RecoveryOfPursuit. hasScopeOfSelection:a dimension along which Pearl (1984) describes search algorithms. At one extreme, certain search algorithms select from all available candidates (in a graph-based search, all open nodes); algorithms at the other extreme evaluate and select only from the most recent candidates (e.g. the successors of the current node). hasSearchDirection: refers to the direction in which a search procedure is conducted: forward, backward, bidirectional or random. To record the guidance for the search (currently: blind or informed it) of search strategy hasSimilarityFunction: designates the similarity function (Kernel) used by a data mining algorithm. IO objects are source or sink see: hasSink a property of DM-Algorithm, that has as stop criterion some EvaluationFunction To record the table format of a data table hasTargetFeature: specifies which, among the values of hasFeature, is the outcome or response variable/feature. This is applicable only to labeled datasets, where labels can be continuous or categorical. hasTargetLearner: designates the induction algorithm for which a given feature processing algorithm is deemed most appropriate, if any. To record the level of uncertainty of an optimisation strategy (currently: deterministic or stochastic Decision rules have a particular value domain, such as a mathematical function or a SingleFeatureWeight of a MaxFeatureWeight a DM-Operator or its parameter implements an algorithm or an algorithm parameter (respectively) To record a feature of a feature selection algorithm, which can be filter, wrapped, or embedded Target features are features that denote an outcome or response variable/feature in a data table or a labelled data set. realizes (Domain: OperatorExecution, Range: Algorithm) the execution of an operator realizes the specifications contained in the algorithm implemented by the operator. Note an algorithm is basically an abstract specification of a process which comes into existence through an operator execution. To constrain the parametric density estimation to be either of data type categorical or a real. To relate a particular setting of a DM-operator parameter to a DM-Operator To relate the (solution) strategy to the optimisation problem it solves class-level property to relate tasks and algorithms to the IO classes they operate on (here: on what is fed into it) class-level property to relate tasks and algorithms to the IO classes they operate on (here: what its result is) Being a (generic, temporary) constituent in a countable collection, for example: member of a society, bacterium in a colony, etc. property introduced to take that extra step from DOLCE’s qualia (like physical region) to the computerised representations with data types to record such data. some parameter can have at most one default value, where applicable hasExplicitFeatureSpace: For kernel functions, indicates if the mapping function $\phi$ is explicitly computable or not. e.g., rules can have a fixed threshold (for some algorithm) Parameters or (continuous) features can have as attribute a maximum value; see also hasMinimumValue (continuous) features can have as attribute a mean value; see also hasMaximumValue and hasMinimumValue Parameters or (continuous) features can have as attribute a minimum value; see also hasMaximumValue To record the actual number of constraints applicable to an optimisation problem To record the actual number of support vectors applicable to data set or a data table To record a string that is an identifier of a parameter in DM software. See: Kernel. isAdmissible: A Search Strategy or Algorithm is admissible if it terminates with an optimal solution when one exists. isComplete: A Search Strategy or Algorithm is complete if it terminates with a solution when one exists. isMultiAlgorithmOperator: An Operator mapsToMultipleAlgorithms if it implements several algorithms simultaneously, allowing the user to choose one algorithm by setting one or several operator hyperparameters. 1 1 METAL characteristic: Average absolute correlation between continuous features. METAL characteristic: Average mutual information between pairs of categorical features. METAL characteristic: Average feature Entropy METAL characteristic: Average mutual information METAL characteristic: A matrix containing the difference between the matrix of total and the matrix of within-groups sums of squares and cross products. METAL characteristic: Canonical correlation of the best linear combination of features to distinguish between classes. METAL characteristic: Absolute class frequencies. Stored in a vector indexed by each class value. METAL characteristic: Class covariance matrices. Stored in a vector indexed by class and each containing a matrix of (features x features) METAL characteristic: Class entropy. METAL characteristic: Relative class frequencies. Stored in a vector indexed by each class value. METAL characteristic: A vector of eigen values of linear discriminant functions. From Mitra Basu and Tin Kam Ho. Data Complexity in Pattern Recognition. Springer, 2006. From Mitra Basu and Tin Kam Ho. Data Complexity in Pattern Recognition. Springer, 2006. METAL characteristic: For each continuous feature, its correlation with other continuous features. Stored in a vector indexed by each continuous features. METAL characteristic: For each categorical feature, the feature entropy. A categorical value should be marked as frequent, iff it is 50% more frequent than we would expect under a uniform distribution. (e.g. 75% for a binary value) METAL characteristic: For each continuous feature, the ratio between the standard deviation and the standard deviation of alpha trimmed mean. If the standard deviation is 0, then the ratio is set 1. METAL characteristic: For each categorical feature, the mutual information between the feature and the class. It is stored in a vector indexed by each categorical feature. A categorical value should be marked as rare, iff it is 50% less frequent than we would expect under a uniform distribution. (e.g. 25% for a binary value) METAL characteristic: For each k value of the feature, the value frequency. It is stored in a vector indexed by each feature value. METAL characteristic: For each k value of each j categorical feature and each i class, the proportion of cases that have the k value in the j feature and belong to the i class. It is stored in a vector indexed by each categorical feature and containing a flat contingency tables that combine the values of the categorical feature with the class values. A categorical value should be marked as very frequent, iff it is 90% more frequent than we would expect under a uniform distribution. (e.g. 95% for a binary value) A categorical value should be marked as very rare, iff it is 90% less frequent than we would expect under a uniform distribution. (e.g. 5% for a binary value) From Mitra Basu and Tin Kam Ho. Data Complexity in Pattern Recognition. Springer, 2006. From Mitra Basu and Tin Kam Ho. Data Complexity in Pattern Recognition. Springer, 2006. From Mitra Basu and Tin Kam Ho. Data Complexity in Pattern Recognition. Springer, 2006. METAL characteristic: Noise signal ratio From Mitra Basu and Tin Kam Ho. Data Complexity in Pattern Recognition. Springer, 2006. From Mitra Basu and Tin Kam Ho. Data Complexity in Pattern Recognition. Springer, 2006. METAL characteristic: Number of continuous features with outliers. From Mitra Basu and Tin Kam Ho. Data Complexity in Pattern Recognition. Springer, 2006. From Mitra Basu and Tin Kam Ho. Data Complexity in Pattern Recognition. Springer, 2006. METAL characteristic: Proportion of continuous features with outliers. From Mitra Basu and Tin Kam Ho. Data Complexity in Pattern Recognition. Springer, 2006. From Mitra Basu and Tin Kam Ho. Data Complexity in Pattern Recognition. Springer, 2006. METAL characteristic: Matrix of total sums of squares and cross products of features. From Mitra Basu and Tin Kam Ho. Data Complexity in Pattern Recognition. Springer, 2006. METAL characteristic: matrix of within-groups sums of squares and cross products of features. The way the (hierarchical) clustering algorithms characterize the similarity between a pair of clusters (Jain and Murty and Flynn, Data Clustering: A Review, ACM Comp. Surveys, vol. 31, No 3., September 1999) The way the (hierarchical) clustering algorithms characterize the similarity between a pair of clusters (Jain and Murty and Flynn, Data Clustering: A Review, ACM Comp. Surveys, vol. 31, No 3., September 1999) AlgorithmAssumption: is a hypothesis based on which an algorithm has been developed, and which should be true if the algorithm is to achieve the task it was designed to address. 1 AssociationDiscoveryAlgorithm is designed to solve AssociationDiscoveryTask, and hence it mines data for associations, where association is a relation between objects, or measured quantities that results from interaction or dependence. This relationship is not necessarily causal. In statistics, an association is any such relationship that renders two measured quantities statistically dependent. AssociationDiscoveryTask consists in mining data for associations, where association is a relation between objects, or measured quantities that results from interaction or dependence. This relationship is not necessarily causal. In statistics, an association is any such relationship that renders two measured quantities statistically dependent. AttributeValueTableFormat: the format of an attribute-value table, i.e. a table where each table cell designates a value of a particular attribute (or feature) of a particular object (or example or instance). In a average link model, the distance between two clusters is the average of all pairwise distances between patterns in the two clusters. Two clusters are merged to form a larger cluster based on minimum distance criteria. A baseline classifier is an extremely simple classifier used to compare the performance of more sophisticated algorithms. It is usually either a random baseline classifier that labels every instance with a random class, or a simple baseline classifier that labels every instance with the most frequent class. Bayes Net classification algorithms are based on the naive bayesian network structure. Bayesian networks (BNs), also known as belief networks (or Bayes nets for short), belong to the family of probabilistic graphical models (GMs). These graphical structures are used to represent knowledge about an uncertain domain. In particular, each node in the graph represents a random variable, while the edges between the nodes represent probabilistic dependencies among the corresponding random variables. These conditional dependencies in the graph are often estimated by using known statistical and computational methods. Hence, BNs combine principles from graph theory, probability theory, computer science, and statistics.[http://www.eng.tau.ac.il/~bengal/BN.pdf] "Bayesian belief networks are graphical models, which unlike naive Bayesian classifiers, allow the representation of dependencies among subsets of attributes" "Bayesian belief networks specify joint conditional probability distributions" Han J., Kamber M. Data Mining: Concepts and Techniques, 2nd ed. The Morgan Kaufmann Series in Data Management Systems, Morgan Kaufmann Publishers, 2006. "Bayesian classifiers are statistical classifiers. They can predict class membership probabilities, such as the probability that a given tuple belongs to a particular class. Bayesian classification is based on Bayes' theorem" Han J., Kamber M. Data Mining: Concepts and Techniques, 2nd ed. The Morgan Kaufmann Series in Data Management Systems, Morgan Kaufmann Publishers, 2006. BayesianNetwork: a directed GraphicalModel. 1 Blind or uninformed search is search that uses no additional information about states beyond that provided in the problem definition (definition of goal state(s) and operators). All they can do is generate successors and distinguish a goal from a nongoal state [1]. As a result, the order in which the search progresses (e.g. the order in which nodes are expanded in a search tree/graph) does not depend on the nature of the solution sought [2]. [1] S. Russell and P. Norvig. Artificial Intelligence: A Modern Approach. Prentice-Hall, 2003. [2] J. Pearl. Heuristics: intelligent search strategies for computer problem-solving. Addison-Wesley, 1984. Also known as "Agglomerative Clustering Algorithm". An agglomerative approach begins with each pattern in a distinct (singleton) cluster, and successively merges clusters together until a stopping criterion is satisfied. true true BranchAndBound is informed exhaustive search. It uses knowledge of states to cut off unpromising paths, i.e., which are known not to contain the goal state. It is non-heuristic in the sense that it maintains a provable upper and lower bound on the (globally) optimal objective value. [S. Boyd and J. Mattingley, Branch-and-Bound Methods, notes for EE364b. Stanford. March 11, 2007) The C4.5 crisp is a well-known example of a tree model induction algorithm which is designed to deal with noisy problems and reduce noise's effects on performance. The acronym CHAID stands for Chi-squared Automatic Interaction Detector. It is one of the oldest tree classification methods originally proposed by Kass (1980). According to Ripley, 1996, the CHAID algorithm is a descendent of THAID developed by Morgan and Messenger, (1973). CHAID will "build" non-binary trees (i.e., trees where more than two branches can attach to a single root or node), based on a relatively simple algorithm that is particularly well suited for the analysis of larger datasets. Also, because the CHAID algorithm will often effectively yield many multi-way frequency tables (e.g., when classifying a categorical response variable with many categories, based on categorical predictors with many classes), it has been particularly popular in marketing research, in the context of market segmentation studies. [http://www.statsoft.com/Textbook/CHAID-Analysis] CN2 model is a model generated by CN2 algorithm for rule induction. CSVC-Algorithm: An SVC algorithm that uses the hyperparameter C (as opposed to Nu-SVC-Algorithm which uses the hyperparameter Nu). The hyperparameter C controls the trade-off between the loss function (hinge loss) and the regularizer (squared $L_2$ norm of the weight vector $\mathbf{w}$) in the SVM cost function. SVMs usually address binary classification problems (with labels y = {+1, -1}), also they can be extended to multi-class (> 2 classes) classification problems. The C-SVCAlgorithm handles noise through the use of slack variables.The degree of noise tolerance is adjusted through the parameter C. Cardinality reduction task deals with enforcing to cut off an arbitrary (predefined) number of attributes (features) that are used to construct a model or a pattern set. 1 1 1 1 1 1 1 1 1 1 CategoricalFeature: a feature (attribute or variable) that takes values from a finite set of discrete, non-numeric, unordered labels. Sometimes called a nominal feature. This type of features are also called nominals or unordered e.g. color Feature having categorical values 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 A labeled data set having categorical, namely discrete, value ClassCondMeanMatrix is the KxD matrix of class conditional means -- the mu parameter of Normal L/Q Discriminant Models. ClassCondProbMatrix: A k x d matrix where k = number of classes and d = number of words/features, each cell M_{ci}, c = 1 ... k, i = 1 ... d, contains the probability of feature i in class c, and each row sums to 1. Used in all NaiveBayes models built from discrete or mixed discrete/continuous datasets. ClassCondVarianceMatrix is actually a concatenation of D univariate vectors representing the class-conditional variance of each of the D features. It has dimensions KxD where each cell $M_{kd}$ is the variance of feature d within class k. Based on feature independence assumption of Naive Bayes. A data mining (DM) model that serves for prediction class value(s) A ModelEvaluationFunction for classification models. 1 1 1 1 For the moment, ClassificationModelingAlgorithm implicitly stands for propositional supervised classification algorithms. This is the reason its input range is specificed as LabeledDataSet and PropositionalDataSet. In the future we will add relational classification modeling algorithms as well as semi-supervised classification modeling algorithms. ClassificationModelingAlgorithm: an algorithm that builds (learns) a classifier. The task of classification deals with the prediction of the value of one discrete (e.g. nominal or ordinal) field (the target) based on the values of the other fields (attributes or features). ClassificationProblemType refers to the number of classes that can be distinguished by a classifier (and by the classification algorithm that produced it). A binary classification problem consists in distinguishing two classes, a multiclass classification problem more than two classes. More recently one-class classifications consists in distinguishing instances of a selected class from non-instances, regardless of the class membership of these non-instances. 1 1 1 ClassificationRuleInductionAlgorithm: an algorithm that builds a set of classification rules. Each rule consists of matching conditions - certain attribute values, which are required to classify an object using given rule, and a classification action - a decision which class should the object be assigned to. 1 TreeInductionAlgorithm: This used to be RecursivePartitioningAlgorithm but has been renamed TreeInductionAlgorithm because the RuleInductionAlgorithm overlaps with RecursivePartitioningAlgorithm, which cannot therefore be a primitive class. TreeInductionAlgorithm: Hand et al 01, p. 335: Decision trees can be considered discriminant functions if they predict crisp classes at the leaves and discriminative if they predict probabilities. A data mining (DM) model that serves for discover groups of clusters function used for evaluating the validity of a computed clustering model ClusteringModelingAlgorithm: an algorithm that builds (learns) a clustering model. Gordon and Henderson define the clustering problem in terms of minimizing the within-cluster sum of square distances. They write their criterion function in such a way that the clustering problem can be formulated as a nonlinear programming problem. [http://homepages.inf.ed.ac.uk/rbf/BOOKS/JAIN/Clustering_Jain_Dubes.pdf] 1 A way of selecting several things out of a larger group, where order does not matter. For the most part this means performing basic arithmetic (addition, subtraction, multiplication, and division) with functions. There is one new way of combing functions that we’ll need to look at as well. Mazur, David R. Combinatorics: A Guided Tour , Mathematical Association of America ComplementClassCondProbMatrix: A complement class-conditional probability matrix is a k x d matrix where k = number of classes and d = number of words/features, each cell $M_{ci}, c = 1 ... k, i = 1 ... d$, contains the probability of feature i NOT in class c, and each row sums to 1. Used in ComplementNaiveBayes [Rennie, 2003]. 1 ComplementNaiveBayesModel's ModelParameter: is a k-list of vectors $\theta_c$, c=1 to k (number of classes), where $\theta_{ci}$ is the probability of word/feature i NOT in class c, with i = 1 to p (number of words/boolean features). We can therefore represent this model parameter as a k x p matrix whose rows are classes and columns are words/boolean features. We call this a ComplementCCProbMatrix. In a complete link model, the distance between two clusters is the maximum of all pairwise distances between patterns in the two clusters. Two clusters are merged to form a larger cluster based on minimum distance criteria. Computational complexity function is a function that outputs a computational complexity class for a given algorithm. In mathematics, a constraint is a condition that a solution to an optimization problem must satisfy. 1 1 1 1 1 1 ContinuousFeature:a feature/attribute/variable that takes its values from a range of real numbers. feature having continuous numerical values A labeled data set having continuous data values Continous optimization problems are optimization problems of parameters with variables in continuous domains. "In continuous optimization, the variables in the model are nominally allowed to take on a continuous range of values, usually real numbers. This feature distinguishes continuous optimization from discrete or combinatorial optimization, in which the variables may be binary (restricted to the values 0 and 1), integer (for which only integer values are allowed), or more abstract objects drawn from sets with finitely many elements. [Continuous Optimization (Nonlinear and Linear Programming) Stephen J. WrightComputer Sciences Department, University of Wisconsin, Madison, Wisconsin, USA] A convex optimization problem is a problem where all of the constraints are convex functions, and the objective is a convex function if minimizing, or a concave function if maximizing. Linear functions are convex, so linear programming problems are convex problems. Ariticle about convex optimization [https://web.stanford.edu/~boyd/cvxbook/bv_cvxbook.pdf] 1 An algorithm characteristic that describes used coordinate system. E.g.: cartesian coordinate system, polar coordinate system, homogeneous coordinate system. CoreDMTask: a data mining task that can be achieved by a number of alternative methods organized in a more or less complex subtree under the Algorithm class. Thus a CoreDMTask cannot be accomplished without preliminary search in a space of potential methods/algorithms. This restrictive definition of a core DM task excludes utility tasks such as dataset reading/writing, model application, etc. Contrary to other DM ontologies, the e-lico DMOP does not classify CoreDMTasks according to the standard DM phases: Pre-processing, Modeling and Post-Processing. The reason is that many of these tasks can be performed in any phase: for example, one can nest a ModelingTask in the Pre-processing phase; for instance, the MissingValueImputationTask can be cast as a PredictiveModelingTask whereby a classification or regression learner is used to predict the missing values of a selected feature. In the reverse direction, Feature(Set)ProcessingTasks can be integrated into the modeling phase, as in the case of embedded feature selection techniques. "A method for estimating the accuracy (or error) of an inducer by dividing the data into k mutually exclusive subsets (the ``folds'') of approximately equal size. The inducer is trained and tested k times. Each time it is trained on the data set minus a fold and tested on that fold. The accuracy estimate is the average accuracy for the k folds." Kohavi R., Provost F. Glossary of terms. Machine Learning 30 (2-3), 271-274, 1998 DM-Algorithm: An algorithm in general is a well defined sequence of steps that specifies how to solve a problem or perform a task. It typically accepts an input and produces an output. A DM algorithm is an algorithm that has been designed to perform any of the DM tasks, such as feature selection, missing value imputation, or modeling (or induction). The higher-level classes of the DM-Algorithm hierarchy correspond to DM-Task types. Immediately below are broad algorithm families or what data miners more commonly call paradigms or approaches. The Algorithm hierarchy bottoms out in individual algorithms such as CART, Lasso or ReliefF. A particular case of a DM algorithm is a Modeling (or Learning) algorithm, which is a well-defined procedure that takes data as input and produces output in the form of models or patterns. DM-Data: In SUMO, Data is defined as 'an item of factual information derived from measurement or research' [http://sigma.ontologyportal.org:4010/sigma/WordNet.jsp?word=data&POS=1] In IAO, Data is an alternative term for 'data item' =def 'an information content entity that is intended to be a truthful statement about something (modulo, e.g., measurement precision or other systematic errors) and is constructed/acquired by a method which reliably tends to produce (approximately) truthful statements.' [http://purl.obolibrary.org/obo/IAO_0000027] In the context of DMOP, DM-Data is the generic term that englobes different levels of granularity: data can be a whole dataset (one main table and possibly other tables), or only a table, or only a feature (column of a table), or only an instance (row of a table), or even a single feature-value pair. 1 DM hypothesis is either a DM model or a DM pattern set. "Data mining (DM) model is a structure and corresponding interpretation that summarizes or partially summarizes a set of data, for description or prediction. Most inductive algorithms generate models that can then be used as classifiers, as regressors, as patterns for human consumption, and/or as input to subsequent stages of the KDD process." Kohavi R., Provost F. Glossary of terms. Machine Learning 30 (2-3), 271-274, 1998 In DMOP, DM-Model is further restricted to summarize globally a set of data (as opposed to pattern set that may summarize a set of data only partially). DM-Model also requires a decision strategy/rule. Model: a simplified description of a complex entity or process [http://virtual.cvut.cz/ksmsaWeb/browser/title]. DM-Operation: a process in which a DM-Operator is executed. Synonym: DM-OperatorExecution. DM-Operator: a programmed, executable implementation of a DM-Algorithm. DM-PatternSet: A pattern set, as opposed to a model which by definition has global coverage, is a set of local hypotheses, i.e. each applies to a limited region of the sample space. DM-Software: Restriction to the Data Mining area of the general concept of Software: (computer science) written programs or procedures or rules and associated documentation pertaining to the operation of a computer system and that are stored in read/write memory; "the market for software is expected to expand". [http://sigma.ontologyportal.org:4010/sigma/Browse.jsp?kb=SUMO&lang=en] DM-Task: A task in general is any piece of work that is undertaken or attempted [SUMO]. A DM-Task is any task that needs to be addressed in the data mining process. DMOP's DM-Task hierarchy models all the major task classes. A data abstraction algorithm extracts a simple and compact representation of a data set. For the case of clustering context, the typical data abstraction is the cluster prototype (i.e. representative pattern) assessment defined in terms of centroid or medoid etc. It represents a compact description of each cluster. Extract a simple and compact representation of a data set. "The process of improving the quality of the data by modifying its form or content, for example by removing or correcting data values that are incorrect. This step usually precedes the machine learning step, although the knowledge discovery process may indicate that further cleaning is desired and may suggest ways to improve the quality of the data. For example, learning that the pattern Wife implies Female from the census sample at UCI has a few exceptions may indicate a quality problem." Kohavi R., Provost F. Glossary of terms. Machine Learning 30 (2-3), 271-274, 1998 DataFormat: the organization of information according to preset specifications (usually for computer processing) [http://sigma.ontologyportal.org:4010/sigma/Browse.jsp?kb=SUMO&lang=en]. 1 1 DataProcessingAlgorithm: an algorithm that specifies a procedure for solving a DataProcessingTask (see this concept). DataProcessingTask: takes in any subclass of Data (a DataSet, DataTable or Feature) and outputs a some transformation of its input. "Data reduction techniques can be applied to obtain a reduced representation of the data set that is much smaller in volume, yet closely maintains the integrity of the original data. That is, mining on the reduced data set should be more efficient yet produce the same (or almost the same) analytical results". Han J., Kamber M. Data Mining: Concepts and Techniques, 2nd ed. The Morgan Kaufmann Series in Data Management Systems, Morgan Kaufmann Publishers, 2006. Data retrieval task is to obtain data from database. Data is ordered and organized in logic way. 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 DataSet: in data mining, the term data set is defined as a set of examples or instances represented according to a common schema. 1 DataTable: a set of data arranged in rows and columns. [http://virtual.cvut.cz/ksmsaWeb/browser/title] Data transformation is a DM task where "data are transformed or consolidated into forms appropriate for mining by performing summary or aggregation operations, for instance" [1]. [1] Han J., Kamber M. Data Mining: Concepts and Techniques, 2nd ed. The Morgan Kaufmann Series in Data Management Systems, Morgan Kaufmann Publishers, 2006. "A set of values from which a variable, constant, function, or other expression may take its value. A type is a classification of data that tells the compiler or interpreter how the programmer intends to use it." [http://foldoc.org/data+type] Surface which separate two or more decision regions. It represents points where there are ties between two or more categories. [http://www.cs.princeton.edu/courses/archive/fall08/cos436/Duda/PR_simp/bndrys.htm] DecisionRule: a subclass of Decision strategy that selects a course of action by applying a yes/no rule. The various instances of decision rules are based on the specific test or criterion used. In data mining, a decision rule is most often applied to a sequence of scalar values that represent measures of quality (e.g. estimated feature quality), performance (e.g. misclassification rate), probability, etc. The test takes the form of a triple <Focus, RelOp, Threshold>, where Focus represents the specific criterion on which the decision is based: the observed values themselves, their ranks or their probabilities. RelOp is one of the relational operators {LessThan, Leq, Eq, GreaterThan, Geq}, and Threshold is the cutoff point on the magnitude, rank, percentage or probability. Note that Threshold is not necessarily a constant, it can be a function (e.g., max, min) of the observed values. For example, the Max Rule is a particular case of the Top K Rule where K=1, and the Maximum a posteriori rule is a special case of the Max rule, where the focus is a probability distribution the RelOp is Eq, and the threshold is the maximum of the observed probabilities. DecisionRule hasFixedThreshold: takes on a value only if the threshold on the decision criterion is a constant in the algorithm; if this property is empty, the threshold is a user-defined parameter. DecisionRule DecisionStrategy: a strategy followed to make a decision at some point or a data mining or optimization process, e.g. to select a feature subset after scoring or ranking features, to predict an outcome after building a probabilistic model or an SVM. A decision strategy can take the form of a decision rule or a statistical test. DecisionTree:a tree-structured predictive model where each path from the root to a leaf can be read as a rule, or a conjunction of conditions (tests on the nodes along the path). The different paths represent an exclusive disjunction (XOR) of conjunctions, i.e., the rules are non-overlapping and each example is covered by exactly one rule. A dependency model describes "the relationship between variables" [1]. A dependency model is "a model that describes significant dependencies (or associations) between data items or events. [...] Dependencies can be strict or probabilistic." [2] [1] Hand D., Mannila H., Smyth P., Principles of Data Mining, MIT Press, 2001. [2] Shearer C., The CRISP-DM model: the new blueprint for data mining, J Data Warehousing (2000); 5:13—22. An algorithm that is used to identify dependencies and relations among data items. A dependency modeling task is to produce a model "describing the relationship between variables" [1]. "Dependency analysis finds a model that describes significant dependencies (or associations) between data items or events. Dependencies can be used to predict the value of a data item, given information on other data items. Although dependencies can be used for predictive modeling, they are mostly used for understanding. Dependencies can be strict or probabilistic." [2] [1] Hand D., Mannila H., Smyth P., Principles of Data Mining, MIT Press, 2001. [2] Shearer C., The CRISP-DM model: the new blueprint for data mining, J Data Warehousing (2000); 5:13—22. A data mining (DM) model that serves for description. Descriptive modeling is a mathematical process that describes real-world events and the relationships between factors responsible for them. An algorithm which is used to produce main features of data. Those features can be treat as data summary. Data randomly generated from descriptive model should have the same characteristic as real data. A descriptive modeling task is to produce a model that "describe all of the data (or the process generating the data)" [1]. [1] Hand D., Mannila H., Smyth P., Principles of Data Mining, MIT Press, 2001. Dimensionality reduction is a DM task where "encoding mechanisms are used to reduce the data set size" [1]. [1] Han J., Kamber M. Data Mining: Concepts and Techniques, 2nd ed. The Morgan Kaufmann Series in Data Management Systems, Morgan Kaufmann Publishers, 2006. e.g. the number of computers, age DiscreteFeature:a feature/attribute/variable that takes its values from a finite set of numbers. Feature having discrete numerical values "Discrete optimization or combinatorial optimization means searching for an optimal solution in a finite or countably infinite set of potential solutions. Optimality is defined with respect to some criterion function, which is to be minimized or maximized. The solutions may be combinatorial structures like arrangements, sequences, combinations, choices of objects, sequences, subsets, subgraphs, chains, routes in a network, assignments, schedules of jobs, packing schemes, etc." [http://www.mafy.lut.fi/study/DiscreteOpt/CH1.pdf] 1 Discriminant Analysis is used to allocate observations to groups using information from observations whose group memberships are known Techniques that can introduce low-dimensional feature representation with enhanced discriminatory power is of paramount importance in face recognition (FR) systems. It is well known that the distribution of face images, under a perceivable variation in viewpoint, illumination or facial expression, is highly nonlinear and complex. It is, therefore, not surprising that linear techniques, such as those based on principle component analysis (PCA) or linear discriminant analysis (LDA), cannot provide reliable and robust solutions to those FR problems with complex face variations. In this paper, we propose a kernel machine-based discriminant analysis method, which deals with the nonlinearity of the face patterns' distribution. The proposed method also effectively solves the so-called "small sample size" (SSS) problem, which exists in most FR tasks. The new algorithm has been tested, in terms of classification error rate performance, on the multiview UMIST face database. Results indicate that the proposed methodology is able to achieve excellent performance with only a very small set of features being used, and its error rate is approximately 34% and 48% of those of two other commonly used kernel FR approaches, the kernel-PCA (KPCA) and the generalized discriminant analysis (GDA), respectively. Plataniotis, K.N "Dept. of Electr. & Comput. Eng., Toronto Univ., Ont., Canada " GenerativeModel is pairwise disjoint with DiscriminativeModel and DFModel. But DiscriminativeModel and DiscriminantFunctionModel may or may not be disjoint because some DF models output probabilities. Those who argue for disjointness say that even if DF models output probabilities, what distinguished them from discriminative models is their ModelStructure. If this is not a posterior probability distribution as in logistic regession, then the model is a DFmodel. A potential counter-argument is that maybe it is the concept of ModelStructure that needs to be revised. The issue remains open. A discriminant function model is built by seeking "a function ƒ(x; ?) that maximizes some measure of separation between the classes. Such functions are termed discriminant functions." Hand D., Mannila H., Smyth P., Principles of Data Mining, MIT Press, 2001. "Discriminative models model the conditional probability distribution P (Y |X ) directly, they learn a direct mapping from inputs X to class label probabilities."* Lawrynowicz A., Tresp V. Introducing machine learning. In Jens Lehmann and Johanna Voelker, editors, Perspectives on Ontology Learning, Studies on the Semantic Web. AKA Heidelberg / IOS Press, 2014. *X and Y stand for random variables Dissociation discovery task deals with discovery of negative associations. A distance (or a metric) function is a function that defines a distance between objects. DistanceOrSimilarityFunction: the union of DistanceFunction and SimilarityFunction. LearningPolicy: An InductionAlgorithm's learning policy is its basic approach to the problem of learning, which can be divided into two broad categories: 1) Eager: The learning policy consists in condensing a collection of data into a compact hypothesis, i.e. a model or a pattern set, that can later be applied to new data to achieve the purpose for which the hypothesis was built. 2) Lazy: the policy is simply to store the data, deferring the data analysis or inductive process to the moment when the hypothesis is needed. In short, lazy learning is data memorization as opposed to eager learning which consists in hypothesis construction. EmbeddedFSAlgorithm: algorithm used for feature selection in embedded methods, i. e. methods that perform feature selection as a part of model construction process. an algorithm designed to achieve a FeatureSelectionTask, defined as a task that takes in a set of p features and outputs a subset p' of the same features, where p' < p. The subset p' is determined on the ground of the exploitation of entropy-based criteria A feature weighting algorithm uses an entropy-based criterion to compute feature weights. An evaluation function is a function used to evaluate something, e.g. how well an algorithm is working on a particular training data or what is the value of selecting a particular feature for training a DM algorithm. We can find heuristic evaluation functions or static evaluation functions. In a more broader perspective, an evaluation function is used, for instance, by game-playing programs to estimate the value or goodness of a position in the minimax and related algorithms. The evaluation function is static type it means that looks only at the current position and does not explore possible moves Exploratory data analysis task is "to explore the data without any clear ideas of what we are looking for" [1]. [1] Hand D., Mannila H., Smyth P., Principles of Data Mining, MIT Press, 2001. An external validity model function compares the obtained clustering model with an a priori structure. 1 1 Feature: a property of an instance (case, example, record). Synonyms: Attribute, Variable. FeatureClassMutualInformation: For each categorical feature, the mutual information between the feature and the class. It is stored in a vector indexed by each categorical feature. [From METAL project]. FeatureDiscretizationAlgorithm: an algorithm designed to achieve a FeatureDiscretizationTask, defined as the task of transforming a real or continuous feature into a discrete or categorical feature. 1 1 1 1 1 1 FeatureExtractionAlgorithm: an algorithm designed to achieve the FeatureExtractionTask, defined as the task of of converting a set of observed features (syn. variables, attributes) into a (potentially smaller) set of features deemed useful for the modeling or pattern discovery task at hand. Ref: I. Guyon, S. Gunn, M. Nikravesh and L. A. Zadeh. Feature Extraction: Foundations and Applications, Springer, 2006. FeatureLogTransformAlgorithm: transform feature using logarithm function. FeatureNormalizationAlgorithn: rescales all values of a Feature by dividing each of them by the norm of the Feature vector. FeatureRankingAlgorithm: an algorithm designed to achieve a FeatureRankingTask, defined as a task that takes in a set of features and outputs an ranked list of these features based on the demands of a given data mining task. 1 1 FeatureSelectionAlgorithm: an algorithm designed to achieve a FeatureSelectionTask, defined as a task that takes in a set of p features and outputs a subset p' of the same features, where p' < p. FeatureSelectionAlgorithm is abbreviated as FSAlgorithm or FSA when used as a suffix. Feature selection task aims to identify and remove irrelevant, weakly relevant or redundant features. FeatureSquashToIntervalAlgorithm: rescales all values of a Feature in order to squash them within a user-defined interval, e.g., [0,1] 0r [-1,1] for classification. FeatureStandardizationAlgorithm: rescales all values of a continuous Feature through diverse operations on its mean, standard deviation, etc., or combinations of such operations. FeatureTransformationAlgorithm: an algorithm designed to achieve a FeatureTransformationTask, defined the task of transforming a feature by re-expressing its values in terms that facilitate the induction process. This is typically done on an individual feature without affecting or taking account of the other features in a data set. For instance, an individual feature can be rescaled to a fixed interval, or it can be standardized or normalized, or subjected to some functional transform (e.g., sqrt, log, logit). FeatureTransformationTask: consists in changing the representation of a feature using only information from that feature (e.g., log transform). It is to be distinguished from FeatureExtractionTask, which usually creates a new feature out of several existing features (e.g., principal components analysis). kinds of values of a Feature 1 1 FeatureWeightingAlgorithm: an algorithm designed to achieve a FeatureWeightingTask, defined as the task of assigning to a feature (set) a weight or score that quantifies the quality or usefulness of a feature (set) relative to a given modeling or pattern discovery task. FeatureWeightingAlgorithm is abbreviated as FWAlgorithm or FWA when used as a suffix. 1 "A multilayer feed-forward neural network consists of an input layer, one or more hidden layers, and an output layer. [...] Each layer is made up of units". Han J., Kamber M. Data Mining: Concepts and Techniques, 2nd ed. The Morgan Kaufmann Series in Data Management Systems, Morgan Kaufmann Publishers, 2006. FilterFSAlgorithm: enables to choose the relevant original features and is an effective dimensionality reduction technique. A relevant feature for a learning task can be defined as one whose removal degrades the learning accuracy. Ref: Pierre-Emmanuel JOUVE, Nicolas NICOLOYANNIS. A Filter Feature Selection Method for Clustering FormalExpression: the union of MathematicalExpression and LogicalExpression. A function can take parameters which are just values you supply to the function so that the function can do something utilising those values. These parameters are just like variables except that the values of these variables are defined when we call the function and are not assigned values within the function itself. Parameters are specified within the pair of parentheses in the function definition, separated by commas. When we call the function, we supply the values in the same way. Note the terminology used - the names given in the function definition are called parameters whereas the values you supply in the function call are called arguments. "A Byte of Python" Swaroop C H 1 1 Gaussian kernels are examples of Radial Basis Function (RBF) kernel. They also have adjustable parameters such as sigma which plays a major role in the performance of the kernel. An overestimation of this parameter causes the kernel behave almost linear and the higher-dimensional projection will start to lose its non-linear power. On the other hand, an underestimation of the parameter makes the kernel function lack regularization hence making the decision boundary highly sensitive to noise in training data. [Justice Kwame Appati, Gideon Kwadzo Gogovi, Gabriel Obed Fosu, On the Selection of Appropriate Kernel Function for SVM in Face Recognition, International Journal of Advanced Research in Computer Science and Software Engineering, Volume 4, Issue 3, March 2014] GenerateAndSelectStrategy: a strategy for controling model complexity that consists in generating models/patterns of varying complexity/generality and selecting the least complex or the most general. "Generative models simulate a data generating process. In an unsupervised learning problem, this would involve a model for P (X ), i.e., a probabilistic model for generating the data*. In supervised learning, such as classification, one might assume that a class Y ∈ {0, 1} is generated with some probability P (Y ) and the class-specific data is gen- erated via the conditional probability P (X |Y ). Models are learned for both P (Y ) and P (X |Y ) and Bayes rule is employed to derive P (Y |X ), i.e., the class label probability for a new input X. *In the discussion on generative models, X and Y stand for random variables." Lawrynowicz A., Tresp V. Introducing machine learning. In Jens Lehmann and Johanna Voelker, editors, Perspectives on Ontology Learning, Studies on the Semantic Web. AKA Heidelberg / IOS Press, 2014. 1 1 1 Graph: a set of nodes connected by arcs. ZAHN, C. T. 1971. Graph-theoretical methods for detecting and describing gestalt clusters. IEEE Trans. Comput. C-20 (Apr.), 68–86. The goal of this algorithm is to find regions in a graph, i.e. sets of nodes, which are not as dense as major cliques but are compact enough within user specified thresholds. GraphicalModel: a graph-based representation of a probability distribution. Each node in the graph represents a random variable (or group of random variables), and the links express probabilistic relationships between these variables. The graph then captures the way in which the joint distribution over all of the random variables can be decomposed into a product of factors each depending only on a subset of the variables. [C. Bishop, Pattern Recognition and Machine Learning, Springer, 2006. Chapter 8] false A greedy search is type of search that uses a heuristic for making locally optimal choices at each stage. It usually do not operate exhaustively on all the data and mostly, but not always, fail to find the globally optimal solution. 1 1 1 1 false HeuristicBestFirstSearch: The term "heuristic search" is used in two different senses in the literature on Ai and optimization. In the very broad sense, heuristic search has two defining characteristics. First, heuristic search uses information other than that given in the problem definition to guide search. Second, heuristic search methods are not guaranteed to find the optimal solution; this characteristic distinguishes heuristic methods from branch and bound methods, which are informed but not heuristic methods. In the early AI literature on search, however, the term "heuristic search" was used in the framework of search viewed as the problem of going from an initial to a goal state, and typically used the graph model, where states are represented by nodes. In the absence of any information outside that given in the problem definition, the only solution was systematic search; the choice of the next node to expand (and ultimately of the solution path) was based on criteria such as node recency or an evaluation function f(n)=g(n)= the cost of going from the initial node to the current node (note that this function relies only on the problem definition and record of past problem solving steps). In contrast to this uninformed approach,other methods relied on a so-called heuristic evaluation function f(n)=g(n)+h(n), where h(n) is an estimate of the path from the current state to the goal state; the choice of this function requires domain-specific information. In this context, heuristic search in the narrow technical sense refers to methods based on some form of h(n) (with or without g(n)) were called heuristic search. The general approach used in heuristic search (in this specific sense) is best-first search, where the "best" node was that with the lowest evaluation function f(n). For this reason, we use the term HeuristicBestFirstSearch to designate heuristic search in this narrow technical sense and distinguish it from heuristic search in the first, broader sense defined above. A hierarchical clustering algorithm produces a nested series of partitions based on a criterion (usually similarity-based) for merging or splitting clusters. A hierarchical clustering model is a dendrogram representing the nested grouping of objects and similarity levels at which groupings change. The dendrogram can be broken at different levels to yield different clusterings of the data. BiasVarianceProfile: The bias-variance profile of an algorithm is a qualitative indication of the so-called capacity (or complexity) of the models it generates. High-bias algorithms are algorithms that can produce only low capacity (high bias) models; examples are linear discriminants and standard Naive Bayes. High-variance algorithms, on the contrary, can generate extremely complex models, so that a complexity control parameter is typically used to control the bias-variance trade-off. Depending on the value of the complexity parameter, an individual model can have (very) high bias or (ver) high variance. But the algorithm itself is classified as high-bias or high-variance based on the maximum variance or capacity that its generated models can attain. HillClimbing: an irrevocable search strategy based on local optimizations. Repeatedly expands a node moving in the direction of increasing value of some objective function. Terminates when it reaches a "peak" where no neighbor has a higher value. Aka greedly local search. [Pearl, 1984, Russell and Norvig, 2003]. HingeLoss: Primal SVM Loss function HypothesisApplicationAlgorithm: an algorithm designed to achieve a HypothesisApplicationTask, defined as the task of applying an induced model of pattern set to new data. A HypothesisApplicationAlgorithm is typically (and ambiguously) called an interpreter. HypothesisApplicationTask: the task of applying an induced model or pattern set to new data. A strategy for controlling model or pattern set complexity. HypothesisEvaluationTask: the task of quantifying the quality of an induced model or pattern set with respect to a specific criterion (e.g., predictive performance, interestingness). HypothesisProcessingAlgorithm: an algorithm designed to achieve a HypothesisProcessingTask, defined as the task of transforming an induced model or pattern set in view of a specific goal such as increasing its readability or improving its performance. HypothesisProcessingTask: the task of transforming an induced model or pattern set in view of a specific goal such as increasing its readability or improving its performance. Hypothesis structure determines "the underlying structure or functional forms that we seek from the data" [1]. [1] Hand D., Mannila H., Smyth P., Principles of Data Mining, MIT Press, 2001. 1 Independent component analysis (ICA) is a statistical method for transforming an observed multidimensional random vector into components that are statistically as independent from each other as possible. [A Hyvarinen,Fast and Robust Fixed-Point Algorithms for Independent Component Analysis, Neural Networks, IEEE Transactions on, 1999] ID3 is a model generated by the ID3 decision tree induction algorithm developed by J. Ross Quinlan (1986). ID3 adopts a greedy strategy in constructing a decision tree in a top-down manner using a recursive divide-and-conquer approach. IO-Class is a meta-class of all classes of input and output objects IO-Object is DM-Data or DM-Hypothesis. Incremental clustering is based on the assumption that it is possible to consider objects one at a time and assign them to existing clusters. Here, a new data item is assigned to a cluster without affecting the existing clusters significantly. Typically, incremental clustering algorithm are noniterative. IncrementalReducedErrorPruning: Contrary to ReducedErrorPruning, which is a model pruning algorithm (prunes the decision tree or ruleset after model construction), IncrementalReducedErrorPruning prunes each individual rule immediately after it has been grown. 1 1 1 1 1 1 1 1 InductionAlgorithm: an algorithm designed to achieve an InductionTask, defined as the task of analysing a data set in view of extracting a hypothesis (either a model or a pattern set) that can later be used for predictive or descriptive purposes. Based on whether it outputs a model or a pattern set, an InductionAlgorithm is either a ModelingAlgorithm or a PatternDiscoveryAlgorithm. The algorithm is based on the original formal framework generalising the conventional boolean approach on the case of (i) finite-valued attributes and (ii) continuous-valued semantics. To build rules the patterns in the form of possibilistic prime disjunctions are used, which represent the widest intervals of impossibility in the multidimensional space of all value combinations. In particular, it allows us to reach optimality of the rules in the sense of maximal generality of condition and specificity of conclusion. The filtration mechanism guarantees finding all the most interesting patterns according to some criterion while too specific ones are not generated. The algorithm is iterative in the sense that it processes all records for one pass through the database. The patterns in the form of possibilistic prime disjunctions as well as rules generated by the algorithm have clear and easily interpretable semantics in the form of possibilistic constraints (upper bound) on the maximal number of observations. All generated possibilistic prime disjunctions have equal rights, i.e., the whole semantics does not depend on the order of disjunctions. In particular, the interpretation of each rule is independent of other rules and their order. All attributes have equal rights, particularly, we do not need the target attribute. The knowledge base in the form a set of the most general prime disjunctions is approximately equivalent to the database and therefore can be easily used for prediction purposes. One minus of the algorithm is a large number of generated rules especially for dense distributions with fine surface structure when it tries to reflect all details of the too complex surface (a kind of overfitting). This problem can be solved with the help of a more powerful search and filtration mechanism. For many problem domains it may be more desirable to generate directly probabilistic set-valued patterns rather than possibilistic ones. However, this is a separate highly important and rather difficult problem since we do not have the notions of prime disjunctions, DNF etc. for the probabilistic case. [Savinov A. A., An algorithm for induction of possibilistic set-valued rules by finding prime disjunctions, 4rd Online World Conference on Soft Computing in Industrial Applications -- WSC4, 21-30 Sept. 1999, Published also in: Soft computing in industrial applications, Suzuki, Y., Ovaska, S.J., Furuhashi, T., Roy, R., Dote, Y., eds. Springer-Verlag, London, 2000] InductionTask: the task of analysing a data set in view of extracting a hypothesis (either a model or a pattern set) that can later be used for predictive or descriptive purposes. InformedSearch: search that uses (often problem-specific) knowledge beyond that which is built into the state and operator definitions. This additional knowledge guides search in the sense that it allows us to determine whether one nongoal state is more promising than another. [Pearl, 1984; Russell, 2003]. Instance: as used in DMOP, should be taken to denote DM-Instance. In the general sense, an object is an instance of a set or class if it is included in that set or class [http://virtual.cvut.cz/ksmsaWeb/browser/title]. In data mining, a (DM-)Instance is an instance of a dataset and is therefore synonymous with case or example or observation as used in statistics. InstanceWeightVector: a vector of length N, with N the number of cases in the training set, whose elements are the alpha coefficients of the instances. Instances with alpha value > 0 are the support vectors. An internal validity model function determines if the the obtained clustering model is intrisically appropriate for the data. e.g. the duration of an event IntervalFeature:a feature/attribute/variable with the values which between you can specify the distance. "In order to make the meaning of uncertainty clear, we assume that a feature value varies within a closed interval, i.e., closed intervals are given as complete knowledge. The above feature value defined on the closed interval is called "interval feature value (IFV)"." Horiuchi, T. "Decision Rule for Pattern Classification by Integrating Interval Feature Values." IEEE Transactions on Pattern Analysis and Machine Intelligence 20.4 (1998): 440. Web. An item reassignement to cluster checks if there is a reassignment of any pattern/item/object from one cluster to another. ItemSequenceTableFormat: the format of a data table (see DataTable) whose instances are tuples of sequences. The elements of the sequences themselves can be of any (primitive or structured) type (see DataType). ItemSetTableFormat: The format of a DataTable whose instances are sets of items belonging to a predefined category. IterativeImprovementSearch: a subclass of search strategies that differs from path-based search in that what is important is not the path to the goal, but the goal state itself. This goal state need not always be a precisely defined configuration, but a "best" state that optimizes some objective function. Local search starts from a single or several arbitrary states and typically moves only to neighbors of that state. The term "iterative improvement search" was introduced by Andrew Moore for what Norvig & Russell [2003] call "local search". The parameter indicates the number of the clusters in which each data entity belongs to the cluster with the nearest mean. K-means (MacQueen, 1967) is one of the simplest unsupervised learning algorithms that solve the well known clustering problem. The procedure follows a simple and easy way to classify a given data set through a certain number of clusters (assume k clusters) fixed a priori. The main idea is to define k centroids, one for each cluster. These centroids shoud be placed in a cunning way because of different location causes different result. So, the better choice is to place them as much as possible far away from each other. The next step is to take each point belonging to a given data set and associate it to the nearest centroid. When no point is pending, the first step is completed and an early groupage is done. At this point we need to re-calculate k new centroids as barycenters of the clusters resulting from the previous step. After we have these k new centroids, a new binding has to be done between the same data set points and the nearest new centroid. A loop has been generated. As a result of this loop we may notice that the k centroids change their location step by step until no more changes are done. In other words centroids do not move any more. Finally, this algorithm aims at minimizing an objective function, in this case a squared error function. [J. A. Hartigan and M. A. Wong (1979) "A K-Means Clustering Algorithm", Applied Statistics, Vol. 28, No. 1, p100-108.] Comment: Typical convergence criteria for the k-Means algorithm are: no (or minimal) reassignment of patterns to new cluster centers, or minimal decrease in squared error NumNearestNeighbors: KNN parameter indicating the number of nearest neighbors to take into account in order to predict the target value of the query or test instance. K-nearest neighbor algorithm is an algorithm that at its basic level works as follows: "to classify a new object, with input vector y, we simply examine the k closest training data set points to y and assign the object to the class that has the majority of points among these k. Close is defined here in terms of the p-dimensional input space." [1]. [1] Hand D., Mannila H., Smyth P., Principles of Data Mining, MIT Press, 2001. KNearestNeighborsAlgorithm: Classified under the 'regression approach' (=Bishop's discriminative models) by Hand et al 01, p. 348 -- 'they directly estimate the posterior probabilities of class membership'. See also Hand, 1997: p. 79: suggests very clearly that KDE-based classification: generative <> KNN: discriminative. The construction of the kernel distance involves a transformation from similarities to distances. This tranformation takes the followinggeneral form. Given two “objects” A and B, and a measure of similarity between them given by K(A, B), then the induced distance between A and B can be defined as the difference between the self-similarities K(A,A) + K(B, B) and the cross-similarity K(A, B). [A Gentle∗ Introduction to the Kernel Distance Jeff M. Phillips, July, 6, 2010] Kernel functions provide a way to manipulate data as though it were projected into a higher dimensional space, by operating on it in its original space This leads to efficient algorithms And is a key component of algorithms such as – Support Vector Machines – kernel PCA – kernel CCA – kernel regression In both statistics (kernel density estimation or kernel smoothing) and machine learning (kernel methods) literature, kernel is used as a measure of similarity. In particular, the kernel function k(x,.) defines the distribution of similarities of points around a given point x. k(x,y) denotes the similarity of point x with another given point y. On the example : In statistics "kernel" is most commonly used to refer to kernel density estimation and kernel smoothing. A straightforward explanation of kernels in density estimation can be found -here-. In machine learning "kernel" is usually used to refer to the kernel trick, a method of using a linear classifier to solve a non-linear problem "by mapping the original non-linear observations into a higher-dimensional space". Def from: Tom M. Mitchell Machine Learning Department Carnegie Mellon University and http://stats.stackexchange.com/questions/2499/what-is-a-kernel-in-plain-english KernelWeightVector: a vector of the kernel coefficients in a support vector machine. 1 LabeledDataSet: a data set whose instances are pairs (X, Y), where X is a set of predictive/explanatory/independent attribute/feature/variable values and Y is a target/response/independent value (in multitask learning, a set of such values). LabeledDataTable: a data table whose every column has a label describing the content. LeafSizeParemeter define size of leaf in decision tree. This parameter is used by algorithm to define minimal leaf size. 1 "A linear combination of vectors v1, v2, ..., vk in a vector space V is an expression of the form c1 * v1 + c2 * v2 + ... + ck * vk where the ci's are scalars, that is, it's a sum of scalar multiples of them. More generally, if S is a set of vectors in V, not necessarily finite, then a linear combination of S refers to a linear combination of some finite subset of S." Joyce D., "Linear Combinations, Basis, Span, and Independence", Math 130 Linear Algebra, Clark University A linear constraint is a mathematical expression where linear terms (i.e., a coefficient multiplied by a decision variable) are added or subtracted and the resulting expression is forced to be greater-than-or-equal, less-than-or-equal, or exactly equal to a right-hand side value. The following are examples of linear constraints on the decision variables: Var1 + Var2 + Var3 + Var4 + Var5 = 10500 0 <= Var1 + 2*Var2 – Var3 <= 5000 Var1 - 3*Var5 >= 300 Var1 >= 6 or Var2 >= 6 or Var1+Var2 = 4 Definition from : http://www.opttek.com/documentation/v65engine/OptQuest%20Engine%20Documentation/WebHelp/Defining_constraints.htm 1 1 LinearDiscriminantModel: the value of the NumberOfProbabilities complexity metric = $C\times D_{cont}+D(D_{cont}+1)/2$ if we take the joint probabilitydistribution. The class-conditional densities $P(\mathbf{x}_{i}|y_{j};\Theta_{j})\sim N(\mathbf{\mu}_{j},\mathbf{\Sigma}), i.e., \Theta_{j}=(\mathbf{\mu}_{j},\mathbf{\Sigma})$ can be simplifed into a series of linear functions $f_{i}(\mathbf{x})=ln(P(\mathbf{x}_{i}|y_{i})P(y_{i}))=\mathbf{w}_{i}^{T}\mathbf{x}+w_{i0}$, where $\mathbf{w}_{i}=\mathbf{\Sigma}^{-1}\mathbf{\mu}_{i}$ and $w_{i0}=-\frac{1}{2}\mathbf{\mu}_{i}^{T}\mathbf{\Sigma}^{-1}\mathbf{\mu}_{i}+ln\, P(y_{i})$ Linear equality constraints are general linear constraints that model relationships among portfolio weights that satisfy a system of equalities. Linear equality constraints take the form AEx=bE where: x is the portfolio (n vector). AE is the linear equality constraint matrix (nE-by-n matrix). bE is the linear equality constraint vector (nE vector). n is the number of assets in the universe and nE is the number of constraints. Portfolio object properties to specify linear equality constraints are: AEquality for AE bEquality for bE NumAssets for n In summary definition We can say : Linear equality constraints are optional linear constraints that impose systems of equalities on portfolio weights and Linear equality constraints have properties AEquality, for the equality constraint matrix, and bEquality, for the equality constraint vector. ** helpful source from ** http://www.mathworks.com/help/finance/working-with-portfolio-constraints_bswwmte.html Linear inequality constraints are general linear constraints that model relationships among portfolio weights that satisfy a system of inequalities. Linear inequality constraints take the form AIx≤bI where: x is the portfolio (n vector). AI is the linear inequality constraint matrix (nI-by-n matrix). bI is the linear inequality constraint vector (nI vector). n is the number of assets in the universe and nI is the number of constraints. Portfolio object properties to specify linear inequality constraints are: AInequality for AI bInequality for bI NumAssets for n In summary definition: Linear inequality constraints are optional linear constraints that impose systems of inequalities on portfolio weights. Linear inequality constraints have properties AInequality for the inequality constraint matrix, and bInequality for the inequality constraint vector. ** source of information : http://www.mathworks.com/help/finance/working-with-portfolio-constraints_bswwmte.html#bswwlqb-1 Linear regression analysis is the most widely used of all statistical techniques: it is the study of linear, additive relationships between variables, usually under the assumption of independently and identically normally distributed errors. Regression models describe the relationship between a dependent variable, y, and independent variable or variables, X. The dependent variable is also called the response variable. Independent variables are also called explanatory or predictor variables. Continuous predictor variables might be called covariates, whereas categorical predictor variables might be also referred to as factors. List: an ordered, variable-length array of items (as opposed to a tuple which is fixed-length). Sequences are special cases of lists which contain elements of primitive data types. A cross between Beam Search and Local Search. Normally used to maximize an objective function. The algorithm holds 'k' number of states at any given time. Initially these k states are randomly generated. The successors of these k states are calculated using the objective function. If any of these successors is a 'goal', that is, the maximum value of the objective function, then the algorithm halts. Otherwise the initial k states and k number of successors are placed in a pool. This pool has a total of 2k states. The pool is numerically sorted and the best (highest) k states are selected as new initial states. This process repeats until a maximum value is reached. LocalPiecewiseModelStructure is a model that allows to combine different local dependence between the response variable and the predictor variables in the various regions of the space of the predictor variables. See D. Hand at al. "Principles of Data Mining" Ch. 6, sect. 6.3.2 [CdA, 09/02/12] The likelihood of a set of parameter values, θ, given outcomes x, is equal to the probability of those observed outcomes given those parameter values (ref. http://en.wikipedia.org/wiki/Likelihood_function) A logical expression is an expression that produces a Boolean value (true/false) when evaluated. It usually contains one or more logical operators as well as variable names, constants, and functions. The general form of a logical expression is: <expression> <operator> <expression>, where <expression> may be simply a variable name, a constant or a complex <expression>. MarkovNetwork: an undirected graphical model. A mathematical expression is a finite combination of symbols that is legal (well-formed) according to rules depending on a language used. The symbols can denote numbers (constants), variables, operations, functions, and other relevant elements of the language syntax. Function:(mathematics) a mathematical relation such that each element of a given set (the domain of the function) is associated with an element of another set (the range of the function) [http://sigma.ontologyportal.org:4010/sigma/Browse.jsp?kb=SUMO&lang=en] Mathematical function is a relation that associates each element of a set of inputs with an element of a set of outputs. Each input is related to exactly one output. The set of inputs is called the domain of the function and the set of outputs is called the range of the function. 1.0 1.0 measurement based on the minimization or maximization of an objective function WARD, J. H. JR. 1963. Hierarchical grouping to optimize an objective function. J. Am. Stat. Assoc. 58, 236–244 MURTAGH, F. 1984. A survey of recent advances in hierarchical clustering algorithms which use cluster centers. Comput. J. 26, 354–359 MissingValueImputationAlgorithm: an algorithm designed to achieve a MissingValueImputationTask, defined as a task of imputing the value for missing attribute of dataset's record. It can use one of many methods to predict attribute's value, for example using most common value, mean, median or closest fit approach. MissingValueImputationTask: imputes the value for missing attribute of dataset's record. This paper presents new aggregation algorithms for obtaining reduced order power networks when coherent generators are aggregated. The generation terminal bus aggregation algorithm in the EPRI DYNRED software tends to stiffen the reduced order network during the aggregation process, thus increasing the frequencies of inter-area modes. The inertial and slow coherency aggregations will decrease the stiffening effect and produce, for the same coherent machine groups, aggregate networks with improved inter-area mode approximations. This paper contains new procedures to construct these aggregate networks and demonstrates the benefits of these new aggregate networks on a 48-machine power system using eigenvalues and nonlinear simulations Galarza, R. "Dept. of Electr. Power Eng., Rensselaer Polytech. Inst., Troy, NY, USA" A model complexity function is a function that assesses the complexity of a data mining model (DM model). ModelComplexityMeasure: this concept revolves around the (free) parameters of a model, but the crucial difference is whether one measures the number or the magnitude of the model parameters. For the moment, the concept of ModelComplexityMeasure is used as the range of the object property hasComplexityComponent of CostFunction, but may be needed elsewhere since it's a crucial concept in DM. A model evaluation function quantifies "how well a model or parameter structure fits a given data set" [1]. [1] Hand D., Mannila H., Smyth P., Principles of Data Mining, MIT Press, 2001. A model evaluation task is the task where the model induced by an induction algorithm is evaluated. ModelParamter: a parameter of a learned model, contrary to hyperparameters which are runtime parameters of a learning algorithm. Ultimately hyperparameters impact the learned model's parameters, e.g., by restricting the search space or by limiting the range of possible values a model parameter may take. 1 1 Algorithm for transforming an induced model in view of a specific goal such as increasing its readability or improving its performance. ModelProcessingTask: a task that takes a Model or set of Models as input and outputs some transformation or combination of the input models. Algorithm for reducing the complexity of a given input model by reducing the size of the model itself A model structure is a "global summary of a data set; it makes statements about any point in the full measurement space" [1]. [1] Hand D., Mannila H., Smyth P., Principles of Data Mining, MIT Press, 2001. ModelStructure subclasses are inherited by the model descriptor generated after execution of a specific operator. The model descriptor should then contain an instantiation of the model structure with its specific model parameters. 1 ModelingAlgorithm: an algorithm designed to build a model, defined as an induced hypothesis that ensures global coverage of the instance space represented by the training data. For instance, a predictive model is assumed to output a prediction for any new example drawn from the same population as the training data. ModelingTask: the task of building a model, defined as an induced hypothesis that ensures global coverage of the instance space represented by the training data. For instance, a predictive model is assumed to output a prediction for any new example drawn from the same population as the training data. A typical multilayer perceptron (MLP) network consists of a set of source nodes forming the input layer, one or more hidden layers of computation nodes, and an output layer of nodes. The input signal propagates through the network layer-by-layer. MultivariateFSAlgorithm: algorithm that performs feature selection task by selecting and rating subsets of features. Decision trees that are limited to testing a single variable at a node are potentially much larger than trees that allow testing multiple variables at a node. This limitation reduces the ability to express concepts succinctly, which renders many classes of concepts difficult or impossible to express. This paper presents the PT2 algorithm, which searches for a multivariate split at each node. Because a univariate test is a special case of a multivariate test, the expressive power of such decision trees is strictly increased. The algorithm is incremental, handles ordered and unordered variables, and estimates missing values. PAUL E. UTGOFF "An Incremental Method for Finding Multivariate Splits for Decision Trees" 1 NN-Parameter: parameter of an algorithm that builds a neural network. NaiveBayesAlgorithm might subsequently be subsumed by BayesNetAlgorithm. Naive Bayes classification algorithms are Bayesian classification algorithms that "assume that the effect of an attribute value on a given class is independent of the values of the other attributes" [1]. [1] Han J., Kamber M. Data Mining: Concepts and Techniques, 2nd ed. The Morgan Kaufmann Series in Data Management Systems, Morgan Kaufmann Publishers, 2006. 1 NaiveBayesDiscreteModel: The quantifier restriction for hasModelParameter is "exactly 1 ClassPriorVector" and "exactly 1 ClassCondProbMatrix" (and not some) because these distribution parameters for discrete variables will surely be used after discretization. 1 NiaveBayesKernelModel: The model parameters ClassPriorVector and ClassCondProbMatrix are for the joint PD of the discrete features if any. Hence we use only and not some (because there will be zero such model parameters if there are no discrete variables). 1 1 NaiveBayesMultinomialModel: The ModelParameter of NaiveBayesMultinomialModel is a k-list of vectors $\theta_c$, c=1 to k (number of classes), where $\theta_{ci}$ is the probability of word/feature i in class c, with i = 1 to p (number of words/boolean features). We can therefore represent this model parameter as a $k x p$ matrix whose rows are classes and columns are words/boolean features. We call this a MatrixOfCCProbs. 1 1 1 NaiveBayesNormalModel: The model parameters ClassPriorVector and ClassCondProbMatrix are for the JPD of the discrete features if any. Hence we use only and not some (because there will be zero such model parameters if there are no discrete variables). NaiveBayesNormalModel hasDecisionBoundary {ArbitraryLinearBoundary}: Holds only if the class variable is binary and distribution of X is Gaussian. [Mitchell 2010] "A neural network is a set of connected input/output units in which each connection has a weight associated with it. During the learning phase, the network learns by adjusting the weights so as to be able to predict the correct class label of the input tuples." Han J., Kamber M. Data Mining: Concepts and Techniques, 2nd ed. The Morgan Kaufmann Series in Data Management Systems, Morgan Kaufmann Publishers, 2006. "A neural network, when used for classification, is typically a collection of neuron-like processing units with weighted connections between the units." Han J., Kamber M. Data Mining: Concepts and Techniques, 2nd ed. The Morgan Kaufmann Series in Data Management Systems, Morgan Kaufmann Publishers, 2006. NoiseReductionTask: removes noisy data that may be result of mistakes during data collection process or are irrelevant to the analysis. Noise reduction is used to prevent the noise from hindering data mining process and creating distorted results. Non-convex optimization problem refers to nonlinear programming problem. "Such a problem may have multiple feasible regions and multiple locally optimal points within each region. It can take time exponential in the number of variables and constraints to determine that a non-convex problem is infeasible, that the objective function is unbounded, or that an optimal solution is the "global optimum" across all feasible regions." Article [http://www.solver.com/convex-optimization] Non-parametric test is not make assumption about parametres of the population distribution such as mean or variance. 1 ANonlinearPolynomialKernel: a polynomial kernel with degree > 1. Density estimation that makes no assumptions about forms of the underlying densities. [http://www.cedar.buffalo.edu/~srihari/CSE555/Chap4.DensityEstimation.pdf] NuSVC-Algorithm: A subclass of SVC-Algorithm, where SVC-Algorithm is formulated such that the C hyperparameter is replaced by the Nu hyperparameter NumberOfWeights: model complexity measure that corresponds to the L0 norm of the model weights. 1 1 OperatorParameter: a user-tunable runtime parameter that impacts either the learning process (e.g., by limiting the search space) or the learned model (e.g. , by controlling model complexity). A hyperparameter often comes with a default value. OptimizationProblem: In mathematics and computer science, an optimization problem is the problem of finding the best solution from all feasible solutions. OptimizationProblem: is the minimization or maximization of a function subject to constraints on its variables. Let $x$ be the vector of variables (aka unknowns or parameters), $f$ the objective function, a function of $x$ that we want to min/maximize, and $c$ the vector of constraints that the unknowns must satisfy. The otimization problem can then be written as $min_{x\in R^{n}}f(x)\, subject\, to\begin{cases} c_{i}(x)=0, & amp;i\in\mathcal{E}\\ c_{i}(x)\geq0 & amp;i\in I\end{cases}.$ Here $f$ and each $c_{i}$ are scalar-valued functions of the variables $x,$ and $\mathcal{I},$ $\mathcal{E}$ are sets of indices. [J. Nocedal and S. J. Wright, Numerical Optimization, Springer 1999]. An algorithm is order-independent if it generates the same results for any order in which the data is presented. Otherwise, it is order-dependent. e.g. military rank, qualitative evaluation of temperature ("cool" or "hot"), sound intensity ("quiet" or "loud") OrdinalFeature: a feature (attribute or variable) that takes values from a finite set of discrete, non-numeric, ordered labels. It allows creating a rank order. Feature having ordered values 1 1 The concept of Parameter is used in a very broad sense in the DMO. It englobes the following senses: Parameter (computer science): Parameter (statistics): a quantity (such as the mean or variance) that characterizes a statistical population and that can be estimated by calculations from sample data [http://sigma.ontologyportal.org:4010/sigma/Browse.jsp?kb=SUMO&lang=en]. TO BE COMPLETED 1 Density estimation that assume that the density function takes on particular form, with parameters to estimate. [http://isites.harvard.edu/fs/docs/icb.topic539621.files/lec16.pdf] Parametric test is type of statistical test which based on assumption about distribution parameters like mean or variance. 1 A partitional clustering algorithm identifies the partition that optimizes (usually locally) a clustering criterion (often similarity-based). A partitional clustering model identifies the partition that optimizes (usually locally) a clustering criterion. Only one partition is produced. PathBasedSearch: the problem statement consists of an initial and a goal state, and the solution is a path from the initial to the goal state. PatternDiscoveryAlgorithm: an algorithm designed to detect patterns in data, where a pattern is any regularity, relation or structure inherent in the data, be it exact, approximate or statistical. As opposed to a model which by definition have global coverage, a patter is a local hypothesis, i.e. it applies to a limited region of the sample space. PatternDiscoveryTask: deals with the automatic detection of pattterns in data, where a pattern is any regularity, relation or structure inherent in the data, be it exact, approximate or statistical. In a more restricted sense, a pattern is a learned predictive hypothesis that has local coverage as opposed to a model (again in the restricted sense) which has global coverage. A hypothesis is said to have global coverage if it is capable of returning a prediction for any instance of the sample space, whereas a local hypothesis typically returns a prediction for a limited region of the sample space. PatternEvaluationFunction is a function that evaluates a single pattern. It may be used to evaluate the interestingness of the discovered pattern from the user's viewpoint, and in such a way govern the selection of patterns which are of the interest for the user. Hence, it also may play an important role in reducing the search space during mining for patterns. Formally, this function maps from a pattern to a (usually) numeric value. The evaluation might consider each pattern by taking its statistical properties, and the context of the set of patterns into account. It may also account for the pattern usefulness, novelty and validity in the context of the particular application. Algorithm for transforming an induced pattern set in view of a specific goal such as increasing its readability or improving its performance. Algorithm for reducing the complexity of a given input pattern set by reducing the its size PatternSetBasedClassificationModel: a model which is based on a pattern set. Together with a decision strategy it forms a predictive model. A pattern structure is a partial summary of a data set; it makes statements "only about restricted regions of the space spanned by the variables" [1]. [1] Hand D., Mannila H., Smyth P., Principles of Data Mining, MIT Press, 2001. 1 PiecewiseLinearBoundary: (Almost) synonymous with PiecewiseMultivariateLinearBoundary or PiecewiseObliqueLinearBoundary, but includes PiecewiseAxisParallelBoundary as well. Polynomial kernels are non-stationary kernel suitable for problems where all its training data are normalised. This kernel has adjustable parameters thus alpha, a constant term c and the polynomial degree d. k(x, y) = x^T y + c [Justice Kwame Appati, Gideon Kwadzo Gogovi, Gabriel Obed Fosu, On the Selection of Appropriate Kernel Function for SVM in Face Recognition, International Journal of Advanced Research in Computer Science and Software Engineering, Volume 4, Issue 3, March 2014] 1 A data mining (DM) model that serves for prediction. A ModelEvaluationFunction for predictive models. Predictive modeling algorithm is an algorithm that addresses a predictive modeling task. Predictive modeling tasks "perform inference on the current data in order to make predictions" [1]. "The aim here is to build a model that will permit the value of one variable to be predicted from the known values of other variables" [2]. [1] Han J., Kamber M. Data Mining: Concepts and Techniques, 2nd ed. The Morgan Kaufmann Series in Data Management Systems, Morgan Kaufmann Publishers, 2006. [2] Hand D., Mannila H., Smyth P., Principles of Data Mining, MIT Press, 2001. Primitive data types are data types which are predefined basic types, or build-in types in programming language. For example in Java there are 8 primitive types: byte, short, int, long, float, double, char, boolean. (see JAIN, A. K. AND DUBES, R. C. 1988. Algorithms for Clustering Data. Prentice-Hall advanced reference series. Prentice-Hall, Inc., Upper Saddle River, NJ.) The underlying assumption is that the patterns/items/objects to be clustered are drawn from one of several distributions, and the goal is to identify the parameters of each and (perhaps) their number. Most of the work assume that the individual components of the mixture density are Gaussian, and the parameters of the individual Gaussians are to be estimated. Traditional approaches to this problem involve obtaining (iteratively) a maximum likelihood estimate of the parameter vectors of the component densities. Nonparametric techniques for density-based clustering have also been developed [Jain and Dubes 1988] BRAILOVSKY, V. L. 1991. A probabilistic approach to clustering. Pattern Recogn. Lett. 12, 4 (Apr. 1991), 193–198. A quantitative work that assist analyst in obtaining information and gaining understanding about the distribution of the underlying population. Besides learning the trends, seasonal change of point in the collected data, probability density estimation also provide information on how these variables relate to each other and also the statistical summary such as mean and variance and also the outliers or aberrant that might contaminate the data. [Introduction to Probability Density Estimation] A probability estimation algorithm is an algorithm that addresses a probabilty estimation task that is it constructs a model for the overall probability distribution of the data. A data mining (DM) model that serves for estimating probability values. DensityEstimation is the task of determining the distribution of data in the input space. 1 PropositionalDataSet: a dataset that consists of a single data table (see DataTable). PropositionalLogicStructure: The prevalent logical structure in propositional learning algorithms is a disjunction of conjunctions. When the disjunction is exclusive (XOR), the different "rules" are non-overlapping, so order among disjuncts has no importance), the result is a decision tree. When disjunction is inclusive (OR), the different rules constitue a RuleSet which can be either ordered (a decision list) or unordered. RuleSet: a disjunction of conjunction of logical tests, i.e. in which the disjuncts (rules) are not mutually exclusive so that a test example can be matched by several rules. Conflicts due to overlapping rules are avoided by imposing an order on the rules (an ordered ruleset is called a decision list). Alternatively, rules can be weighted and, in case simultaneously applicable rules have conflicting consequents, a decision is taken by aggregating the rule weights. 1 1 Quadratic programming (QP) is a special type of mathematical optimization problem. It is the problem of optimizing (minimizing or maximizing) a quadratic function of several variables subject to linear constraints on these variables. [https://www.princeton.edu/~achaney/tmve/wiki100k/docs/Quadratic_programming.html] Qualitative feature: a qualitative feature is a type of non-numerical features, its values are defined as descriptive terms. Quantitive feature: a quantitive feature is a type of numerical features, its values are defined as numbers. ROperator is a DM operator (algorithm implementation) from R software. [1] http://www.r-project.org The RBF Mapping can be cast into a form that resembles a neural network. The hidden to output layer part operates like a standard feed-forward MLP network, with the sum of the weighted hidden unit activations giving the output unit activations. The hidden unit activations are given by the basis functions φ j(x,µ j,σ j), which depend on the “weights” {µij,σ j} and input activations {xi} in a non-standard manner. Intuitively, it is not difficult to understand why linear superpositions of localised basis functions are capable of universal approximation. More formally: - Hartman, Keeler & Kowalski (1990, Neural Computation, vol. 2, pp. 210-215) provided a formal proof of this property for networks with Gaussian basis functions in which the widths {σ j } are treated as adjustable parameters. - Park & Sandberg (1991, Neural Computation, vol. 3, pp. 246-257; and 1993, Neural Computation, vol. 5, pp. 305-316) showed that with only mild restrictions on the basis functions, the universal function approximation property still holds. As with the corresponding proofs for MLPs, these are existence proofs which rely on the availability of an arbitrarily large number of hidden units (i.e. basis functions). However, they do provide a theoretical foundation on which practical applications can be based with confidence. The proofs about computational power tell us what an RBF Network can do, but nothing about how to find all its parameters/weights € {M, wkj , µ j , σ j}. Unlike with MLPs, for RBF networks the hidden and output layers play very different roles, and the corresponding “weights” have very different meanings and properties. It is therefore appropriate to use different learning algorithms for them. The input to hidden “weights” (i.e. basis function parameters {µij ,σ j}) can be trained (or set) using any of a number of unsupervised learning techniques. Then, after the input to hidden “weights” are found, they are kept fixed while the hidden to output weights are learned. Since this second stage of training just involves a single layer of weights {wjk} and linear output activation functions, the weights can easily be found analytically by solving a set of linear equations. This can be done very quickly, without the need for a set of iterative weight updates as in gradient descent learning. John A. Bullinaria, 2013 "Radial Basis Function Networks: Algorithms" 1 RapidMinerOperator is a DM operator (algorithm implementation) from RapidMiner software [1]. [1] https://rapidminer.com RecoveryOfPursuit: term used by Pearl (1985, pp. 65-66) to designate a heuristic search algorithm's behavior wrt to choices made in the past. A tentative strategy can backtrack over a decision taken in the past and shift attention back to previously suspended alternatives which had appeared to offer less promise than the current path, whereas an irrevocable strategy can only pursue its search from the current node/state. It's an algorithm that operates on Recurrent Neural Network (RNN). This network contains at least one feed-back connection, so the activations can flow round in a loop. That enables the networks to do temporal processing and learn sequence, e.g., perform sequence recognition/reproduction or temporal association/prediction. A data mining (DM) model that serves for predicting a numerical continuos value A ModelEvaluationFunction for regression models. Regression modeling algorithm is an algorithm that addresses a regression modeling task. The task of regression deals with the prediction of the value of one continuous field (the target) based on the values of the other fields (attributes or features). Algorithm that addresses a regression modeling task and grounded on the induction of a regression tree as a model RegularizedClassificationModelEvalFunction: a predictive model evaluation function that contains both a loss component and a model complexity component linked by some combination function. A regularization parameter governs the trade-off between the two components. 2 RelationalDataSet: a dataset that has at least 2 data tables. A relative validity model function compares two model structures and measures their relative merit. RipperModel: hasHypothesisStructure {RuleSet} UnweightedRuleSet: Ripper's ruleset is not a proper decision list (Frank, 2000). Rule: is composed of a head (conclusion) and a body (a conjunction of logical tests). 1 Algorithm for pruning an initial given set of rules (standing for the processing model) with the goal of reducing the model complexity 1 SVC-Algorithm (SupportVectorClassifierModelingAlgorithm): a learning algorithm that builds a classifier using the SVM (SupportVectorMachines) approach. In DMOP's task-based hierarchy, we distinguish SVC from SVR (SVMs for regression problems). SVC algorithms are typically designed fro binary classification problems (with labels y = {+1, -1}) but have also be extended for multi-class problems, or one-class problems. An SVC-Algorithm optimizes the margin between classes in a feature space induced by some kernel function. The margin can be hard or soft. A hard margin does not allow any violation while a soft margin tolerate violations, i.e it allows some instances to be inside the margin, between the margin and the separating hyperplane. The level of error tolerance can be controlled by the hyperparameter C. If C is very large, no error is allowed and the soft margin becomes hard. Therefore both the hard-margin and the soft-margin SVC-Algorithm can be formulated in terms of a single optimization problem, which optimizes the trade off between the margin and the hinge loss error of the margin violation. This formulation of SVC-Algorithm is called C-SVC-Algorithm (since hyperparameter C is used). The SVC-Algorithm can be also reformulated in another way, with the C hyperparameter replaced by Nu hyperparameter, yielding the so-called Nu-SVC-Algorithm. Hand et al. 01, p. 335, cite SVMs as examples of what they call the discriminative approach, which is in fact what Bishop 06 calls the discriminant function approach. Hand calls regression approach what Bishop calls discriminative. Both use the term generative for the same thing. SVC-Parameter: parameter of an SVC-Algorithm, including the hyperparameter C as well as parameters of the kernel function such as kernel matrix K and label Y. 1 1 SVC-Model: a classification model produced by SVC-Algorithm: $y=wx_{sv} + b$ or $y = alpha_{sv} * K(x, x_{sv}) + b$. ScopeOfSelection: a property of heuristic search algorithms which designates the extent of the set of alternatives considered by the algorithm at each choice point. Some algorithms have a global scope of selection: they consider all available alternatives, i.e. all the nodes/states that have been generated and remained unexpanded at a given step. Others have a local scope in the sense that they limit their choice to the most recent alternatves -- concretely, to the immediate successors of the current node/state. 1 1 A segmentation model is of the form of "interesting and meaningful subgroups or classes that share common characteristics." Shearer C., The CRISP-DM model: the new blueprint for data mining, J Data Warehousing (2000); 5:13—22 SelfConfiguring NNAlgorithm: can be formally defined as: SelfConfiguringNNAlgorithm == NeuralNetworkAlgorithm and hasSelfTunedHyperParameter some NNParameter. However, there is no need for this equivalent class definition because membership in this class is fully determined by the definition of the object property hasSelfTunedHyperParameter with domain SelfConfiguringNNAlgorithm and range NNParameter. SemiSupervisedInductionAlgorithm: an InductionAlgorithm that builds a hypothesis from input data that comprise both a LabeledDataSet and an UnlabeledDataSet. The algorithm uses a set of heuristics to find prime covers, another set of heuristics to find feasible solutions to the dual linear program which are needed to generate cuts, and subgradient optimization to find lower bounds. [Balas E., Ho A., Set covering algorithms using cutting planes, heuristics, and subgradient optimization: A computational study, Combinatorial Optimization, Mathematical Programming Studies Volume 12, Springer, 2009] "The set of weight vectors will be called a model. An incoming pattern is mapped to the node whose model is “most similar” (according to a predefined metric) to the pattern, and weight vectors in a neighborhood of such node are updated. Therefore, the network behaves as a competitive neural network that implements a winner take-all function with an associated mechanism, that modifies the local synaptic plasticity of the neurons, allowing learning to be restricted spatially to the local neighborhood of the most active neurons. For each color pixel, we consider a neuronal map consisting of weight vectors. Each incoming sample is mapped to the weight vector that is closest according to a suitable distance measure, and the weight vectors in its neighborhood are updated. The whole set of weight vectors acts as a background model, that is used for background subtraction in order to identify moving pixels. In the case of our background modeling application, we have at our disposal a fairly good means of initializing the weight vectors of the network: the first image of our sequence is indeed a good initial approximation of the background, and, therefore, for each pixel, the corresponding weight vectors are initialized with the pixel value. In order to represent each weight vector, we choose the HSV color space, relying on the hue, saturation and value properties of each color. Such color space allows us to specify colors in a way that is close to human experience of colors. Moreover, the intensity of the light is explicit and separated from chromaticity, and this allows change detection invariant to modifications of illumination strength. Let (h,s,v) be the HSV components of the generic pixel (x,y) of the first sequence frame I0, and let C=(c1,c2,...cn2) be the model for pixel (x,y). Each weight vectors ci, i=1,...,n^2, is a 3-D vector initialized as ci=(h,s,v)". [Maddalena L., Pereosino A., A Self-Organizing Approach to Background Subtraction for Visual Surveillance Applications, Image Processing, IEEE Transactions on (Volume:17 , Issue: 7 ), 2008] Set: (mathematics) an abstract collection of numbers or symbols [http://virtual.cvut.cz/ksmsaWeb/browser/title] Set: A finite, unordered collection of objects. A bag or multi-set is a special case of sets which can contain the same object several times. SharedCovarianceMatrix: The CovarianceMatrix is a $DxD$ covariance matrix -- the $\Sigma$ parameter of Normal L/Q Discriminant Models. An NLDModel has only one shared Sigma whereas an NQDModel has K Sigma matrices (one per class). This is because the NormalLinearDiscriminantAnalysisAlgorithm assumes that the covariance of the data is class-independent. SharedCovarianceMatrix: $P(\mathbf{x}|y_{i};\mathbf\Theta_{i})\sim N(\mathbf\mu_{i},\mathbf\Sigma), i.e. \mathbf\Theta_{i}=(\mathbf\mu_{i},\mathbf\Sigma)$ Also known as the Hyperbolic Tangent Kernel and as the Multilayer Perceptron (MLP) kernel. The Sigmoid Kernel comes from the Neural Networks field, where the bipolar sigmoid function is often used as an activation function for artificial neurons. k(x, y) = tanh(alpha x^T y + c) [Kavitha K., Arivazhagan S., Suriya B., Histogram Binning and Morphology based Image Classification, International Journal of Current Resarch and Academic Review, Volume 2, Number 6, 2014] Similarity based model structure is model structure grounded on the exploitation of (dis)similarity measures. A similarity function is a function that measures the similarity between two objects. Simulated annealing is a probabilistic method proposed in Kirkpatrick, Gelett and Vecchi (1983) and Cemy (1985) for ﬁnding the global minimum of a cost function that may possess several local minima. It works by emulating the physical process whereby a solid is slowly cooled so that when eventually its structure is “frozen,” this happens at a minimum energy conﬁguration [http://www.mit.edu/~dbertsim/papers/Optimization/Simulated%20annealing.pdf] The evaluation function is used to measure the quality of a subset. Such value is then confronted with the best available value obtained, and the latter is update if appropriate. More specifically, the evaluation function measures the classification power of a single feature or of a subset of the features. [Data Mining & Knowledge Discovery Based on Rule Induction, Giovanni Felici] In a single link model, the distance between two clusters is the minimum of the distances between all pairs of patterns drawn from the two clusters (one pattern from the first cluster, the other from the second). Two clusters are merged to form a larger cluster based on minimum distance criteria. SplineModel: linear combination of low-degree (linear, quadratic, cubic) polynomials expressing the local dependecy between the response variable and the predictor variables in the various regions of the space of the predictor variables. The result is a smooth curve that may change direction many times. See D. Hand at al. "Principles of Data Mining" Ch. 6, sect. 6.3.2. [CdA, 09/02/1012] The square-error for the entire clustering containing K clusters is the sum of the within-cluster variations. The error represents deviations of the patterns from the centroids. [http://homepages.inf.ed.ac.uk/rbf/BOOKS/JAIN/Clustering_Jain_Dubes.pdf] StatisticBasedFWAlgorithm: A statistic-based feature weighting algorithm uses a sample statistic (e.g., mean, correlation coefficient) or a test statistic (e.g. Chi square statistic) to compute feature weights. StochasticSearch: Informally, stochastic search is any search that incorporates randomness. Strategy: In general, the term strategy refers to a plan of action designed to achieve a particular goal. In a computing context, the plan of action specified in a strategy is expressed as an algorithm. In DMOP, we reserve the term Algorithm to specifications that are sufficiently self-contained as to be implemented as distinct operators, and use the term Strategy to refer to a clearly identifiable plan of action that is embedded so tightly into an algorithm that it cannot be implemented as an independent operator. For instance, a learning algorithm typically embeds an optimization strategies and/or a strategy for controlling the complexity of the learned hypothesis; these are integral components that cannot be factored out without changing the way the learning algorithm operates, and are therefore classified as strategies rather than as algorithms. Features e.g. represented as trees, where the parent node represents a generalization of its child node. Feature having strictly structured values A labeled data set having structured data values (e.g. trees, list, etc) StructuredPredictionAlgorithm: This node is a placeholder for the family of structured prediction algorithms, and awaits contributions from experts on the subject. Structured data type is a set of data with data items inside structure. Datas in structured type are in fixed field and may have same or different primitive data type. "The goal of subgroup discovery is to find rules describing subsets of the population that are sufficiently large and statistically unusual." Lavrac N., Kavsek B., Flach P., Todorovski L. 2004. Subgroup Discovery with CN2-SD. J. Mach. Learn. Res. 5 (December 2004), 153-188. Subsampling is training the DM algorithm using only part of the sample data available for training. SumOfSquaredWeights: $\min_{\mathbf{w},b}\langle\mathbf{w}.\mathbf{w}\rangle$ $\|\mathbf{w}\|_{2}^{2}=<\mathbf{w}\cdot\mathbf{w}>=\sum_{i=1}^{p}w_{i}$ SumOfWeights: $\|\mathbf{w}\|_{2}^{2}=<\mathbf{w}\cdot\mathbf{w}>=\sum_{i=1}^{p}w_{i}$ L1Weights: the L1 norm of model weights, i.e. the sum of model weights. SupervisedDiscretizationAlgorithm: n discretization algorithm that takes into account the classes of the objects. [R Butterworth, DA Simovici, GS Santos - A greedy algorithm for supervised discretization, Journal of Biomedical Informatics, 2004] A feature extraction algorithm that considers the classes of the objects. SupervisedInductionAlgorithm: An InductionAlgorithm that builds a hypothesis from a LabeledDataSet. TableFormat: the format (see DataFormat) of a data table (see DataTable) as determined by the type of data that can be assigned to its different cells. TargetFeature: in data mining, a feature/attribute/variable whose value depends on other (explanatory, independent or predictive) features. Synonyms: Dependent/response feature/attribute/variable. TimeSeriesTableFormat: the format of a time series table, i.e, a DataTable each instance of which is a sequence of observations/measurements ordered in time. 1 1 An algorithm characteristic that describes tolerance to highly correlated features in a data set. High tolerance to correlated features means that the algorithm is resistant to correlated features, while low tolerance means that the algorithm is vulnerable to correlated features. 1 1 1 An algorithm characteristic that describes tolerance to missing values in a data set. High tolerance to missing values means that the algorithm is resistant to missing values, while low tolerance means that the algorithm is vulnerable to missing values. 1 An algorithm characteristic that describes tolerance to noise in a data set. High tolerance to noise means that the algorithm is resistant to noisy data, while low tolerance means that the algorithm is vulnerable to noisy data. 1 Also known as "Divisive Clustering Algorithm". A divisive method begins with all patterns in a single cluster and performs splitting, based on some criterion, until a stopping criterion is met. TopKRule is a DecisionRule which makes a specific decision for the top K values of the set of observed or computed values of a given cost function, and the opposite decision for all the others. The DecisionThreshold K can be any value from 1 to N-1, where N is the total number of available values. When K=1, TopKRule is known as the MaxRule. Is a set of data in various area to show/discover potentially relations between data. 1 A Tree is a directed graph that has no graph loops [http://virtual.cvut.cz/ksmsaWeb/browser/title]. TreeBasedRuleInductionAlgorithm: an algorithm that builds a classifier by extracting decision rules from a classification tree. 1 An algorithm characteristic that describes tree branching factor, i.e. the number of children for each node in a tree. TreeDepth: model complexity measure that take into account the depth of a tree model Algorithm for pruning an initial given tree (standing for the processing model) with the goal of reducing the model complexity Tuple: a fixed-length, ordered array of objects which can be of different, possibly structured data types. Vectors are special cases of tuples which contain only primitive (numerical or symbolic) data types. Tuple: In mathematics and computer science a tuple represents the notion of an ordered list of elements. 1 UnivariateFSAlgorithm: algorithm that performs feature selection task by rating each feature individually. Often rating is built according to statistical tests results. UnlabeledDataSet: a data set whose instances are described exclusively by independent/explanatory variables/features/attributes, i.e., they contain no dependent or target features. UnlabeledDataTable: a data table whose columns are undescribed. UnsupervisedDiscretizationAlgorithm: an algorithm where the discretization takes place without any knowledge of the classes to which objects belong [R Butterworth, DA Simovici, GS Santos - A greedy algorithm for supervised discretization, Journal of Biomedical Informatics, 2004] UnsupervisedFeatureExtractionAlgorithm: the division into projective and manifold methods is due to: C.J.C. Burges (2004). Geometric methods for feature extraction and dimensional reduction: A guided tour. In L. Rokach and O. Maimon (Eds.), Data Mining and Knowledge Discovery Handbook: A Complete Guide for Practitioners and Researchers. Springer. UnsupervisedInductionAlgorithm: an InductionAlgorithm that builds a hypothesis from an UnllabeledDataSet. UtilityAlgorithm: in DMOP, any algorithm that is used in data mining algorithms without itself being specifically aimed at data analysis, e.g. sort algorithm. UtilityOperator: In DMOP, an operator which performs a certain routine task that has no impact on the quality of the induced hypothesis and is therefore of no interest in meta-learning. Executions of utility operators in DM processes are ignored by the meta-miner. See comment on CoreDMTask for the semantics of UtilityTask. 1 ValueThresholdRule is a binary DecisionRule where a decision is taken in one direction if the observed or computed value is greater than or equal to the specified threshold. In probabilistic binary classification, for instance, one can decide to predict the positive class if the posterior class probability is greater than a given threshold, eg 0.25, 0.5, etc. 1 WekaOperator is a DM operator (algorithm implementation) from the Weka [1] software suite. [1] http://www.cs.waikato.ac.nz/ml/weka/ 1 WrapperFeatureSelectionAlgorithm: an algorithm in which the feature selection process is wrapped around a learning algorithm: feature subsets generated by a feature subset generator are used to train the learning algorithm and the performance of the learned models is estimated on a separate test set. The feature subset that yields the highest model performance is selected. Wrapper methods differ based on the following components: (1) the feature generation mechanism used (random feature sampling, genetic algorithms), (2) the learning algorithm used (several dozen candidates, based on the task);(3) the model evaluation strategy (cross-validation, hold-out). 1 Association rule generation algorithm is an algorithm that speficies the generation of association rules from a pattern set. Database mining is motivated by the decision support problem faced by most large retail organizations. Progress in bar-code technology has made it possible for retail organizations to collect and store massive amounts of sales data, referred to as the basket data. A record in such data typically consists of the transaction date and the items bought in the transaction. Successful organizations view such databases as important pieces of the marketing infrastructure They are interested in instituting information-driven marketing processes, managed by database technology, that enable marketers to develop and implement customized marketing programs and strategies. Rakesh Agrawal "Fast Algorithms for Mining Association Rules" SubgroupDiscoveryAlgorithm: takes input data labeled by a class label (some of them require binary class labels). It returns a set of subgroup descriptions, which are most comonly in the form of conjunctive classification rules. Some algorithms are heuristic while other are exhaustive. A few algorithms have been developed that work on more complex data types. 1 A region at which only abstract qualities can be directly located. It assumes some metrics for abstract (neither physical nor temporal) properties. Most of the subclasses under abstract-region represent ways of carving out discrete value regions for the different DM data, algorithm and hypothesis characteristics. However, ceertain subclasses need to be sorted out and refined further. Formerly known as description. A unitary endurant with no mass (non-physical), generically constantly depending on some agent, on some communication act, and indirectly on some agent participating in that act. Both descriptions (in the now current sense) and concepts are non-physical objects. AKA 'entity'.Any individual in the DOLCE domain of discourse. The extensional coverage of DOLCE is as large as possible, since it ranges on 'possibilia', i.e all possible individuals that can be postulated by means of DOLCE axioms. Possibilia include physical objects, substances, processes, qualities, conceptual regions, non-physical objects, collections and even arbitrary sums of objects.The class 'particular' features a covering partition that includes: endurant, perdurant, quality, and abstract. There are also some subclasses defined as unions of subclasses of 'particular' for special purposes: spatio-temporal-particular (any particular except abstracts)- physical-realization (any realization of an information object, defined in the ExtendedDnS ontology). A quality space is a topologically maximal region. The constraint of maximality cannot be given completely in OWL, but a constraint is given that creates a partition out of all quality spaces (e.g. no two quality spaces can overlap mereologically). We distinguish between a quality (e.g., the color of a specific rose), and its value (e.g., a particular shade of red). The latter is called quale, and describes the position of an individual quality within a certain conceptual space (called here quality space) Gardenfors (2000). So when we say that two roses have (exactly) the same color, we mean that their color qualities, which are distinct, have the same position in the color space, that is they have the same color quale. Dummy class for optimizing some property universes. It includes all entities that are not reifications of universals ('abstracts'), i.e. those entities that are in space-time. An occurrence-type is stative or eventive according to whether it holds of the mereological sum of two of its instances, i.e. if it is cumulative or not. A sitting occurrence is stative since the sum of two sittings is still a sitting occurrence. A social object that is not assumed to internally represent a description. Since social objects are dependent on physical ones, it is not trivial to interpret the local sense in which a social object 'internally represents' a plan. See 'agentive-social-object' for some discussion. A catch-all class for entities from the social world. It includes agentive and non-agentive socially-constructed objects: descriptions, concepts, figures, collections, information objects. It could be equivalent to 'non-physical object', but we leave the possibility open of 'private' non-physical objects. Maximum of all pairwise distances between objects in the two clusters(Jain and Murty and Flynn, Data Clustering: A Review, ACM Comp. Surveys, vol. 31, No 3., September 1999) Minimum of the distances between all pairs of objects drawn from the two clusters (Jain and Murty and Flynn, Data Clustering: A Review, ACM Comp. Surveys, vol. 31, No 3., September 1999) ArbitraryLinearBoundary: a decision boundary that can be axis parallel or oblique. (Almost) synonymous with oblique linear or multivariate linear boundary, but includes axis parallel linear boundary as well. AxisParallelLinearBoundary. Synonym: UnivariateLinearBoundary. A baseline classifier is an extremely simple classifier used to compare the performance of more sophisticated algorithms. It is usually either a random baseline classifier that labels every instance with a random class, or a simple baseline classifier that labels every instance with the most frequent class. BasisFunctionRadius: minimal radius or width of the radial basis function used for the RBF Network's hidden units (e.g. sigma for Gaussian basis functions). false false BestFirstSearch: an instance of HeuristicBestFirstSearch that refers specifically to the (non greedy) best-first algorithm described by Pearl (1984, p. 48). BestFirstSearch maintains a list of all nodes generated so far, and at each step selects the best node/state (in the sense of some criterion, e.g., distance to goal state) for expansion. As illustrated in BestFirstAlgorithm, Best-First search is on the global endpoint of the global-local continuum and on the tentative extreme of the tentative-irrevocable continuum. It is global because it each step it reexamines the set of all unexpanded nodes (and not just the successors of the current node); it is tentative because selecting the best node which is not ncessarily the successor of the current node is tantamount to revoking the choice of the path from the current node.

false true BreadthFirstSearch: an uninformed search strategy that expands the shallowest unexpanded node first. The fringe (set of generated but unexpanded nodes is a first-in first-out queue). true BFS is complete under the assumption that the branching factor is finite. The C4.5 crisp is a well-known example of a tree model induction algorithm which is designed to deal with noisy problems and reduce noise's effects on performance. The acronym CHAID stands for Chi-squared Automatic Interaction Detector. It is one of the oldest tree classification methods originally proposed by Kass (1980). According to Ripley, 1996, the CHAID algorithm is a descendent of THAID developed by Morgan and Messenger, (1973). CHAID will "build" non-binary trees (i.e., trees where more than two branches can attach to a single root or node), based on a relatively simple algorithm that is particularly well suited for the analysis of larger datasets. Also, because the CHAID algorithm will often effectively yield many multi-way frequency tables (e.g., when classifying a categorical response variable with many categories, based on categorical predictors with many classes), it has been particularly popular in marketing research, in the context of market segmentation studies. [http://www.statsoft.com/Textbook/CHAID-Analysis] CN2 model is a model generated by CN2 algorithm for rule induction. CategoricalFeature: a feature (attribute or variable) that takes values from a finite set of discrete, non-numeric, unordered labels. Sometimes called a nominal feature. This type of features are also called nominals or unordered e.g. color A labeled data set having categorical, namely discrete, value ClassSpecificCovarianceMatrix: $P(\mathbf{x}|y_{i};\mathbf\Theta_{i})\sim N(\mathbf\mu_{i},\mathbf\Sigma_{i}), i.e. \mathbf\Theta_{i}=(\mathbf\mu_{i},\mathbf\Sigma_{i})$ A data mining (DM) model that serves for prediction class value(s) A data mining (DM) model that serves for discover groups of clusters ComplementNaiveBayesModel's ModelParameter: is a k-list of vectors $\theta_c$, c=1 to k (number of classes), where $\theta_{ci}$ is the probability of word/feature i NOT in class c, with i = 1 to p (number of words/boolean features). We can therefore represent this model parameter as a k x p matrix whose rows are classes and columns are words/boolean features. We call this a ComplementCCProbMatrix. ConditionalFeatureIndepenceAssumption is the assumption of class-conditional feature independence, that is, that the features are independent given the class: $x_1 \bot x_2 \bot \dots \bot x_n \equiv P(x_1, \dots, x_p|c)=\Pi_i P(x_i|c)$ ContinuousFeature:a feature/attribute/variable that takes its values from a range of real numbers. A labeled data set having continuous data values D-TrainFinalModel is a dummy operator that represents the sequence of operators that is needed to train the final model in a modeling task. The goal is to train the final model based on the operators and the set of settings that achieved the highest performance during the model selection process. Training the final model to be deployed is an important step in all data mining processes, but since it simply reexecutes the training process all over again on the full dataset, it has no impact on meta-learning and is therefore classified as a UtilityOperator (see definition under this concept). In DMOP-based meta-analysis of meta-learning, what is important is this operator's output, since it is this final model that should be described in the experiment database rather than any of the models generated using resampling (hold-out, cross-validation, bootstrapping) procedures. DM-Data: In SUMO, Data is defined as 'an item of factual information derived from measurement or research' [http://sigma.ontologyportal.org:4010/sigma/WordNet.jsp?word=data&POS=1] In IAO, Data is an alternative term for 'data item' =def 'an information content entity that is intended to be a truthful statement about something (modulo, e.g., measurement precision or other systematic errors) and is constructed/acquired by a method which reliably tends to produce (approximately) truthful statements.' [http://purl.obolibrary.org/obo/IAO_0000027] In the context of DMOP, DM-Data is the generic term that englobes different levels of granularity: data can be a whole dataset (one main table and possibly other tables), or only a table, or only a feature (column of a table), or only an instance (row of a table), or even a single feature-value pair. DM hypothesis is either a DM model or a DM pattern set. "Data mining (DM) model is a structure and corresponding interpretation that summarizes or partially summarizes a set of data, for description or prediction. Most inductive algorithms generate models that can then be used as classifiers, as regressors, as patterns for human consumption, and/or as input to subsequent stages of the KDD process." Kohavi R., Provost F. Glossary of terms. Machine Learning 30 (2-3), 271-274, 1998 In DMOP, DM-Model is further restricted to summarize globally a set of data (as opposed to pattern set that may summarize a set of data only partially). DM-Model also requires a decision strategy/rule. Model: a simplified description of a complex entity or process [http://virtual.cvut.cz/ksmsaWeb/browser/title]. DM-PatternSet: A pattern set, as opposed to a model which by definition has global coverage, is a set of local hypotheses, i.e. each applies to a limited region of the sample space. DataFrame: a tuple of arbitrary primitive types. DataSet: in data mining, the term data set is defined as a set of examples or instances represented according to a common schema. DataTable: a set of data arranged in rows and columns. [http://virtual.cvut.cz/ksmsaWeb/browser/title] DecisionList: an ordered ruleset with a default rule in the lowest-priority position. Represents the nested grouping of objects as well as similarity levels at which groupings change. It can be broken at different levels to yield different clusterings of the data. (Jain and Murty and Flynn, Data Clustering: A Review, ACM Comp. Surveys, vol. 31, No 3., September 1999) (Dis)similarity value assessing the point where the dendrogram, output of a hierarchical clustering algorithm, has to be cut A dependency model describes "the relationship between variables" [1]. A dependency model is "a model that describes significant dependencies (or associations) between data items or events. [...] Dependencies can be strict or probabilistic." [2] [1] Hand D., Mannila H., Smyth P., Principles of Data Mining, MIT Press, 2001. [2] Shearer C., The CRISP-DM model: the new blueprint for data mining, J Data Warehousing (2000); 5:13—22. false false DepthFirstSearch: a blind or uninformed search technique that expands the deepest unexpanded node (recursively expands the first child node of the current node). The fringe (set of generated by unexpanded nodes) is implement as a LIFO queue or stack. A data mining (DM) model that serves for description. e.g. the number of computers, age DiscreteFeature:a feature/attribute/variable that takes its values from a finite set of numbers. GenerativeModel is pairwise disjoint with DiscriminativeModel and DFModel. But DiscriminativeModel and DiscriminantFunctionModel may or may not be disjoint because some DF models output probabilities. Those who argue for disjointness say that even if DF models output probabilities, what distinguished them from discriminative models is their ModelStructure. If this is not a posterior probability distribution as in logistic regession, then the model is a DFmodel. A potential counter-argument is that maybe it is the concept of ModelStructure that needs to be revised. The issue remains open. A discriminant function model is built by seeking "a function ƒ(x; ?) that maximizes some measure of separation between the classes. Such functions are termed discriminant functions." Hand D., Mannila H., Smyth P., Principles of Data Mining, MIT Press, 2001. "Discriminative models model the conditional probability distribution P (Y |X ) directly, they learn a direct mapping from inputs X to class label probabilities."* Lawrynowicz A., Tresp V. Introducing machine learning. In Jens Lehmann and Johanna Voelker, editors, Perspectives on Ontology Learning, Studies on the Semantic Web. AKA Heidelberg / IOS Press, 2014. *X and Y stand for random variables Two clusters are merged when the distance between their centroids is below a pre-specified threshold. DualFormHardMarginAndL1SVC: the dual form cost function that corresponds to the dual form of the HardMarginAndL1SVC optimization problem. The dual form typically leads to a more efficient solution. However, contrary to the primal SVM cost function , the dual form cannot be decomposed into a LossComponent and a ComplexityComponent. DualFormHardMarginAndL1SVC: $\max_\mathbf{\alpha}W(\mathbf{\alpha})=\sum_{i=1}^{l}\alpha_{i}-\frac{1}{2}\sum_{i,j=1}^{n}y_{i}y_{j}\alpha_{i}\alpha_{j}K(\mathbf{x}_{i},\mathbf{x}_{j})$ DualFormHardMarginConstraint: $\sum_{i=1}^{n}y_{i}\alpha_{i}=0,\alpha_{i}\geq0,i=1,...,n$ $\sum_{i=1}^{n}y_{i}\alpha_{i}=0,C\geq\alpha_{i}\geq0,i=1,...,n$ DualFormL2SVCConstraint: $\sum_{i=1}^{n}y_{i}\alpha_{i}=0,\alpha_{i}\geq0,i=1,...,n$ DualFormL2SVCCostFunction: $\max_{\mathbf{\alpha}}W(\mathbf{\alpha})=\sum_{i=1}^{l}\alpha_{i}-\frac{1}{2}\sum_{i,j=1}^{n}y_{i}y_{j}\alpha_{i}\alpha_{j}(K(\mathbf{x}_{i},\mathbf{x}_{j})+\frac{1}{C}\delta_{ij})$

ErrorEpsilon: The lowest value of error that justifies pursuit of training. Below this value, the optimization process is stopped. FOLDecisionTree: First-order logic decision tree (e.g. TILDE). FOLRuleSet: A rule set expressed in first-order logic (e.g. FOIL). Feature: a property of an instance (case, example, record). Synonyms: Attribute, Variable.

false Gaussian kernels are examples of Radial Basis Function (RBF) kernel. They also have adjustable parameters such as sigma which plays a major role in the performance of the kernel. An overestimation of this parameter causes the kernel behave almost linear and the higher-dimensional projection will start to lose its non-linear power. On the other hand, an underestimation of the parameter makes the kernel function lack regularization hence making the decision boundary highly sensitive to noise in training data. [Justice Kwame Appati, Gideon Kwadzo Gogovi, Gabriel Obed Fosu, On the Selection of Appropriate Kernel Function for SVM in Face Recognition, International Journal of Advanced Research in Computer Science and Software Engineering, Volume 4, Issue 3, March 2014]

"Generative models simulate a data generating process. In an unsupervised learning problem, this would involve a model for P (X ), i.e., a probabilistic model for generating the data*. In supervised learning, such as classification, one might assume that a class Y ∈ {0, 1} is generated with some probability P (Y ) and the class-specific data is gen- erated via the conditional probability P (X |Y ). Models are learned for both P (Y ) and P (X |Y ) and Bayes rule is employed to derive P (Y |X ), i.e., the class label probability for a new input X. *In the discussion on generative models, X and Y stand for random variables." Lawrynowicz A., Tresp V. Introducing machine learning. In Jens Lehmann and Johanna Voelker, editors, Perspectives on Ontology Learning, Studies on the Semantic Web. AKA Heidelberg / IOS Press, 2014.

HardMarginConstraint: $y_{i}(\langle\mathbf{w}.\Phi(\mathbf{x}_{i})\rangle+b)\geq1,i=1,...,l$

ID3 is a model generated by the ID3 decision tree induction algorithm developed by J. Ross Quinlan (1986). ID3 adopts a greedy strategy in constructing a decision tree in a top-down manner using a recursive divide-and-conquer approach. IO-Object is DM-Data or DM-Hypothesis. InformationGain: $IG(S,A)=H(S)-\sum_{v\in Values(A)}\frac{|S_{v}|}{|S|}H(S_{v})$ InformationGain: Aka Mutual Information InformationGainRatio: $SplitInformation(S,A)=\sum_{i=1}^{c}\frac{|S_{i}|}{|S|}log_{2}\frac{|S_{i}|}{|S|}$ $GainRatio(S,A)=\frac{IG(S,A)}{SplitInformation(S,A)}$ For the formula of $IG(S,A)$, see Information Gain. InformationGainRatio: InformationGain normalized by the entropy of the predictive feature. Instance: as used in DMOP, should be taken to denote DM-Instance. In the general sense, an object is an instance of a set or class if it is included in that set or class [http://virtual.cvut.cz/ksmsaWeb/browser/title]. In data mining, a (DM-)Instance is an instance of a dataset and is therefore synonymous with case or example or observation as used in statistics.

e.g. the duration of an event IntervalFeature:a feature/attribute/variable with the values which between you can specify the distance. Irrevocable: An irrevocable strategy is one which never reconsiders past decisions to explore other alternatives available in a given search state/node. In short, an irrevocable strategy never backtracks over past choices.

L1HingeLoss: $\sum_{i=1}^{n}\xi_{i}$ L1HingeLoss is the L1 norm of Hinge Loss, i.e., the sum of hinge loss measured over all training cases. L1HingeLossPlusL2WeightsCostFunction is the regularized sum of the L1 norm of the hinge loss and the L2 norm of the model weights. L1HingeLossPlusL2WeightsCostFunction: $\min_{\mathbf{\xi},\mathbf{w},b}\langle\mathbf{w}.\mathbf{w}\rangle+C\sum_{i=1}^{n}\xi_{i}$ L2HingeLoss: the (squared) L2 norm of the hinge loss, i.e. the sum of squared hinge loss measured over all training cases. L2HingeLoss: $\sum_{i=1}^{n}\xi_{i}$ L2HingeLossPlusL2WeightsCostFunction is the regularized sum of the L2 norm of the hinge loss and the L2 norm of the model weights. L2HingeLossPlusL2WeightsCostFunction; $\min_{\xi,\mathbf{w},b}\langle\mathbf{w}.\mathbf{w}\rangle+C\sum_{i=1}^{n}\xi_{i}^{2}$ L2Weights: $\|\mathbf{w}\|_{2}^{2}=<\mathbf{w}\cdot\mathbf{w}>=\sum_{i=1}^{p}w_{i}$ L2Weights designates the L2-norm of the model weights, i.e., the sum of squared weights. LabeledDataSet: a data set whose instances are pairs (X, Y), where X is a set of predictive/explanatory/independent attribute/feature/variable values and Y is a target/response/independent value (in multitask learning, a set of such values). LabeledDataTable: a data table whose every column has a label describing the content.

LearningRate: a value between 0 and 1 that determines the magnitude of weight adjustments at each step. A higher value leads to faster learning but also to the risk of instability. LearningRateDecay: a boolean that indicates if the learning rate should be decreased during the learning process. LinearCombinationOfFunctions: a model structure representing a linear combination of functions other than linear (identity) or kernel functions. LinearDiscriminantModel: the value of the NumberOfProbabilities complexity metric = $C\times D_{cont}+D(D_{cont}+1)/2$ if we take the joint probabilitydistribution. The class-conditional densities $P(\mathbf{x}_{i}|y_{j};\Theta_{j})\sim N(\mathbf{\mu}_{j},\mathbf{\Sigma}), i.e., \Theta_{j}=(\mathbf{\mu}_{j},\mathbf{\Sigma})$ can be simplifed into a series of linear functions $f_{i}(\mathbf{x})=ln(P(\mathbf{x}_{i}|y_{i})P(y_{i}))=\mathbf{w}_{i}^{T}\mathbf{x}+w_{i0}$, where $\mathbf{w}_{i}=\mathbf{\Sigma}^{-1}\mathbf{\mu}_{i}$ and $w_{i0}=-\frac{1}{2}\mathbf{\mu}_{i}^{T}\mathbf{\Sigma}^{-1}\mathbf{\mu}_{i}+ln\, P(y_{i})$ true Linear regression analysis is the most widely used of all statistical techniques: it is the study of linear, additive relationships between variables, usually under the assumption of independently and identically normally distributed errors. Regression models describe the relationship between a dependent variable, y, and independent variable or variables, X. The dependent variable is also called the response variable. Independent variables are also called explanatory or predictor variables. Continuous predictor variables might be called covariates, whereas categorical predictor variables might be also referred to as factors.

LogLikelihoodOfClassPosteriors: $l(\boldsymbol{\Theta})=\sum_{i=1}^{N}[\sum_{j=1}^{k-1}(y_{j}(x_{i})|\mathbf{x}_{i};\mathbf{\Theta})\ln P(y_{j}|\mathbf{x}_{i};\mathbf{\Theta})+(1-\sum_{j=1}^{k-1}y_{j}(\mathbf{x}_{i}))\ln(1-\sum_{j=1}^{k-1}P(y_{j}|\mathbf{x}_{i};\mathbf{\Theta}))]$ LogLikelihoodOfMultivarNormalClassCondProbsClassSpecificCovariance: $\ell(\Theta)=\sum_{i=1}^{n}\sum_{j=1}^{k}y_{j}(\mathbf{x}_{i})ln[P(y_{j})P(\mathbf{x}_{i}|y_{j};\Theta_{j})]$ where the class-conditional densities are multivariate normal: $P(\mathbf{x}_{i}|y_{j};\Theta_{j})\sim N(\mathbf{\mu}_{j},\mathbf{\Sigma}),i.e.,\Theta_{j}=(\mathbf{\mu}_{j},\mathbf{\Sigma})$. The maximum likelihood model parameters are $\mathbf{\Theta}_{ML}=argmax_{\mathbf{\Theta}}\,\ell(\mathbf{\Theta})$ LogLikelihoodOfMultivarNormalClassCondProbsSharedCovariance: $\ell(\Theta)=\sum_{i=1}^{n}\sum_{j=1}^{k}y_{j}(\mathbf{x}_{i})ln[P(y_{j})P(\mathbf{x}_{i}|y_{j};\Theta_{j})]$ where the class-conditional densities are multivariate normal: $P(\mathbf{x}_{i}|y_{j};\Theta_{j})\sim N(\mathbf{\mu}_{j},\mathbf{\Sigma}),i.e.,\Theta_{j}=(\mathbf{\mu}_{j},\mathbf{\Sigma})$. The maximum likelihood model parameters are $\mathbf{\Theta}_{ML}=argmax_{\mathbf{\Theta}}\,\ell(\mathbf{\Theta})$ LogLikelihoodOfNaiveClassCondProbs = (negative of) the loss function based on class conditional probabilities where the assumption of class-conditional feature independence has been integrated, i.e., $P(\mathbf{x}|y_{i})=\prod_{i=1}^{d}P(x_{i}|y_{i};\mathbf\theta)$. $l(\mathbf\theta)&=&\sum_{i=1}^{N}[\sum_{j=1}^{k-1}(y_{j}(x_{i})|\mathbf{x}_{i};\mathbf\theta)\ln\frac{P(\mathbf{x}|y_{i};\mathbf\theta)P(y_{i})}{P(\mathbf{x;\mathbf\theta})}\\&&+(1-\sum_{j=1}^{k-1}y_{j}(\mathbf{x}_{i}))\ln(1-\sum_{j=1}^{k-1}\frac{P(\mathbf{x}|y_{i};\mathbf\theta)P(y_{i})}{P(\mathbf{x;\mathbf\theta})}]$ Matrix: a tuple of vectors. MaxAPosteriori is a specific instance of MaxRule where the maximum value is selected from a set of posterior class probabilities. 1.0 WILSON, D. R. AND MARTINEZ, T. R. 1997. Improved heterogeneous distance functions. J. Artif. Intell. Res. 6, 1–34 DIDAY, E. AND SIMON, J. C. 1976. Clustering analysis. In Digital Pattern Recognition, K. S. Fu, Ed. Springer-Verlag, Secaucus, NJ, 47–94 Used when a pattern/objects is described by different kinds of features e.g. coninuous and categorical. ICHINO, M. AND YAGUCHI, H. 1994. Generalized Minkowski metrics for mixed feature-type data analysis. IEEE Trans. Syst. Man Cybern. 24, 698–708. ModelBasedEarlyStopping uses a halt criterion based on some feature of the model being built (e.g. minimal leaf size or minimal size for split in decision trees, where early stopping is called pre-pruning). ModelComplexityMeasure: this concept revolves around the (free) parameters of a model, but the crucial difference is whether one measures the number or the magnitude of the model parameters. For the moment, the concept of ModelComplexityMeasure is used as the range of the object property hasComplexityComponent of CostFunction, but may be needed elsewhere since it's a crucial concept in DM. Momentum: a value between 0.0 and 1.0 that is indicates the fraction of the previous weight update to add to the current one. This adds inertia to the motion through weight space and smooths out oscillations. MultinomialClassPriorAssumption: $\hat{\pi}_{c}^{MLE}=\frac{N_{c}}{N}$ MultinomialClassPriorAssumption: the assumption that the class prior probability follows a multinomial distribution. Under this assumption, the maximum likelihood estimate of class c is its relative frequency in the training set (see latexFormula). MultivariateDecisionTree: a disjunction of conjunction of multivariate tests resulting in oblique or non axis-parallel splits at each node

MultivariateSeries: a list of vectors.

GOWDA, K. C. AND KRISHNA, G. 1977. Agglomerative clustering using the concept of mutual nearest neighborhood. Pattern Recogn. 10, 105–112. MICHALSKI, R., STEPP, R. E., AND DIDAY, E. 1983. Automated construction of classifications: conceptual clustering versus numerical taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. PAMI-5, 5 (Sept.), 396–409 In assessing the distance between two points/objects/patterns takes into account the effect of surrounding or neighboring points/objects/patterns (called context). NaiveBayesDiscreteModel: The quantifier restriction for hasModelParameter is "exactly 1 ClassPriorVector" and "exactly 1 ClassCondProbMatrix" (and not some) because these distribution parameters for discrete variables will surely be used after discretization. NiaveBayesKernelModel: The model parameters ClassPriorVector and ClassCondProbMatrix are for the joint PD of the discrete features if any. Hence we use only and not some (because there will be zero such model parameters if there are no discrete variables). NaiveBayesMultinomialModel: The ModelParameter of NaiveBayesMultinomialModel is a k-list of vectors $\theta_c$, c=1 to k (number of classes), where $\theta_{ci}$ is the probability of word/feature i in class c, with i = 1 to p (number of words/boolean features). We can therefore represent this model parameter as a $k x p$ matrix whose rows are classes and columns are words/boolean features. We call this a MatrixOfCCProbs. NaiveBayesNormalModel: The model parameters ClassPriorVector and ClassCondProbMatrix are for the JPD of the discrete features if any. Hence we use only and not some (because there will be zero such model parameters if there are no discrete variables). Negentropy: (from negative entropy) is a measure of distance to gaussianity (or normality). It is based on the definition of the differential entropy H of a random vector $\mathbf{y}=(y_{1},\dots,y_{n})^{T}$ with density $f(.)$: $H(\mathbf{y})=-\int\, f(\mathbf{y})\log\, f(\mathbf{y})\mathrm{d\mathbf{y}}$. Negentropy is defined by normalizing differential entropy, i.e., subtracting $H(\mathbf{y}$) from a Gaussian random vector of the same covariance matrix as $\mathbf{y}$. Negentropy is always nonnegative, and it is zero iff $\mathbf{y}$ has a gaussian distribution. Negentropy is used in Independent Component Analysis to measure the nongaussianity (and therefor the mutual independence) of the tranformed variables (components). Ref. Hyvarinen et al. (2001). Independent Component Analysis, Wiley. "A neural network, when used for classification, is typically a collection of neuron-like processing units with weighted connections between the units." Han J., Kamber M. Data Mining: Concepts and Techniques, 2nd ed. The Morgan Kaufmann Series in Data Management Systems, Morgan Kaufmann Publishers, 2006.

NormalClassCondProbabilityAssumption: the assumption that the probability distribution of the data given the class follows a multivariate Gaussian distribution. NormalClassConditionProbabilityAssumption: $p(x|y)\sim\mathcal{N}(\mathbf{\mu},\Sigma)$

NumHiddenLayers: The number of hidden layers in a multilayer perceptron. NumHiddenUnits: a vector of integers giving the number of hidden units in each of the hidden layers of a multilayer perceptron. The length of the vector is the value of the parameter NumHiddenLayers. NumOutputUnits: The number of output units in a neural network. This depends on the task (classification or regression) and, in the case of classification, on the number of classes. NumTrainingCycles: the maximal number of training iterations (aka epochs) allowed. Training stops when this limit is attained, regardless of the magnitude of the observed error. NumberOfWeights: model complexity measure that corresponds to the L0 norm of the model weights. e.g. military rank, qualitative evaluation of temperature ("cool" or "hot"), sound intensity ("quiet" or "loud") OrdinalFeature: a feature (attribute or variable) that takes values from a finite set of discrete, non-numeric, ordered labels. It allows creating a rank order.

PatternSetBasedClassificationModel: a model which is based on a pattern set. Together with a decision strategy it forms a predictive model. ModelBasedEarlyStopping: uses a halt criterion based on some observation (eg elbow in LossFunction curve) that reveals performance stagnation or degradation. PiecewiseAxisParallelLinearBoundary or PiecewiseUnivariateLinearBoundary: better known as a hyperrectangle. PiecewiseObliqueLinearBoundary: aka PiecewiseMultivariateLinearBoundary. 2 3 true Polynomial kernels are non-stationary kernel suitable for problems where all its training data are normalised. This kernel has adjustable parameters thus alpha, a constant term c and the polynomial degree d. k(x, y) = x^T y + c [Justice Kwame Appati, Gideon Kwadzo Gogovi, Gabriel Obed Fosu, On the Selection of Appropriate Kernel Function for SVM in Face Recognition, International Journal of Advanced Research in Computer Science and Software Engineering, Volume 4, Issue 3, March 2014] A data mining (DM) model that serves for prediction. A data mining (DM) model that serves for estimating probability values. ProcessBasedEarlyStopping: uses a halt criterion relative to the learning process, such as a fixed limit on the number of iterations (eg number of epochs in neural networks) PropositionalDataSet: a dataset that consists of a single data table (see DataTable). A data mining (DM) model that serves for predicting a numerical continuos value RelationalDataSet: a dataset that has at least 2 data tables. SVC-Model: a classification model produced by SVC-Algorithm: $y=wx_{sv} + b$ or $y = alpha_{sv} * K(x, x_{sv}) + b$.

Sequence: a list of nominals. Series: a list of numbers. 0.0

-1.0 SoftMarginConstraint: $y_{i}(\langle\mathbf{w},\Phi(\mathbf{x}_{i})\rangle+b)\geq1-\xi_{i},\xi_{i}\geq0,i=1,...,n$

Features e.g. represented as trees, where the parent node represents a generalization of its child node. A labeled data set having structured data values (e.g. trees, list, etc) SumOfSquaredWeights: $\min_{\mathbf{w},b}\langle\mathbf{w}.\mathbf{w}\rangle$ $\|\mathbf{w}\|_{2}^{2}=<\mathbf{w}\cdot\mathbf{w}>=\sum_{i=1}^{p}w_{i}$ TargetFeature: in data mining, a feature/attribute/variable whose value depends on other (explanatory, independent or predictive) features. Synonyms: Dependent/response feature/attribute/variable.

TrainingDataModel or NullModel is an ersatz model assigned to lazy learners and classified under PosteriorProbabilityDistribution in order to group them with Discriminative Models. TreeDepth: model complexity measure that take into account the depth of a tree model true true UniformCostSearch (UCS): an uninformed search strategy which consists in expanding a candidate (generated but unexpanded) node with the lowest cost of the path from the initial to the current state). If step costs are all equal, UCS produces the same solution as BreadthFirstSearch. A UnivariateDecisionTreeOrDecisionList is an ordered disjunction of conjunction of univariate tests. This holds only for orthogonal decision trees.

UnivariateTest: a boolean function consisting of <Feature> <Rel> <Value>, where <Rel> is any of <, <=, = >=, >. UnlabeledDataSet: a data set whose instances are described exclusively by independent/explanatory variables/features/attributes, i.e., they contain no dependent or target features. UnlabeledDataTable: a data table whose columns are undescribed. Unweighted RuleSet: a RuleSet whose member rules are all assumed to carry unit weight. A cluster is split when its variance is above a pre-specified threshold. Vector: a tuple of numbers. WeightNeighborVotes: a KNN boolean parameter indicating whether to weight the votes of the K nearest neighbors based on their similarity to the query or test instance. WeightedRuleSet: an unordered set of rules, each of which carried an associated weight or score. The course of events typical of the life of an object (kind). AKA Agentive-role.A role that can only be played by agents. A collection with only agents as members.