What is Data Mining?
Data mining is the analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data owner. The relationships and summaries derived through a data mining exercise are often referred to as models or patterns
How is ML different from DM?
Machine learning is the process of TRAINING an algorithm on an EXISTING dataset in order to have it discover relationships (so as to create a model/pattern/trend), and USING the result to analyze NEW data.
Most data mining is cyclical. Starting with data, mining leads to discovery, which leads to action (“deployment”), which in turn leads to new data - the cycle continues.
- Classification: involves LABELING data
- Clustering: involves GROUPING data, based on similarity
- Association: involves RELATING data
- Regression: involves COUPLING data
1.1 user provides a set of input (training) data, which consists of features (independent parameters) for each piece of data, AND an outcome (a ‘label’, ie. a class name)
1.2 the algorithm uses the data to build a ‘decision tree’ [with feature-based conditionals (eqvt to ‘if’ or ‘case’ statements) at each non-leaf node], leading to the outcomes (known labels) at the terminals
1.3 the user makes use of the tree by providing it new data (just the feature values) - the algorithm uses the tree to ‘classify’ the new item into one of the known outcomes (classes)
2.1 This algorithm creates ‘k’ number of “clusters” (sets, groups, aggregates..) from the input data, using some measure of closeness (items in a cluster are closer to each other than any other item in any other cluster). This is an example of an unsupervised algorithm - we don’t need to provide training/sample clusters, the algorithm comes up with them on its own.
Support Vector Machine (SVM)
3.1 An SVM always partitions data (classifies) them into TWO sets - uses a ‘slicing’ hyperplane (multi-dimensional equivalent of a line), instead of a decision tree.
3.2 The hyperplane maximizes the gap on either side (between itself and features on either side). This is to minimize chances of mis-classifying new data.
4.1 kNN (k Nearest Neighbors) algorithm picks ‘k’ neartest neighbors, closest to our unclassified (new) point, considers the ‘k’ neighbors’ types (classes), and attempts to classify the unlabeled point using the neighbors’ type data - majority wins (the new point’s type will be the type of the majority of its ‘k’ neighbors).
5.1 The ‘naïve’ Bayes algorithm is a probability-based, supervised, classifier algorithm (given a datum with x1,x2,x3..xn features (ie an n-dimensional point), classify it to be one of 1,2,3…k classes).