Documentation

This is machine translation

Translated by Microsoft
Mouse over text to see original. Click the button below to return to the English verison of the page.

ClassificationKNN class

k-nearest neighbor classification

Description

A nearest-neighbor classification object, where both distance metric ("nearest") and number of neighbors can be altered. The object classifies new observations using the predict method. The object contains the data used for training, so can compute resubstitution predictions.

Construction

mdl = fitcknn(Tbl,ResponseVarName) returns a classification model based on the input variables (also known as predictors, features, or attributes) in the table Tbl and output (response) Tbl.ResponseVarName.

mdl = fitcknn(Tbl,formula) returns a classification model based on the predictor data and class labels in the table Tbl. formula is an explanatory model of the response and a subset of predictor variables in Tbl used for training.

mdl = fitcknn(Tbl,Y) returns a classification model based on the input variables (also known as predictors, features, or attributes) in the table Tbl and output (response) Y.

mdl = fitcknn(X,Y) returns a classification model based on the input variables X and output (response) Y.

mdl = fitcknn(___,Name,Value) fits a model with additional options specified by one or more name-value pair arguments, using any of the previous syntaxes. For example, you can specify the tie-breaking algorithm, distance metric, or observation weights.

Input Arguments

expand all

Sample data used to train the model, specified as a table. Each row of Tbl corresponds to one observation, and each column corresponds to one predictor variable. Optionally, Tbl can contain one additional column for the response variable. Multi-column variables and cell arrays other than cell arrays of character vectors are not allowed.

If Tbl contains the response variable, and you want to use all remaining variables in Tbl as predictors, then specify the response variable using ResponseVarName.

If Tbl contains the response variable, and you want to use only a subset of the remaining variables in Tbl as predictors, then specify a formula using formula.

If Tbl does not contain the response variable, then specify a response variable using Y. The length of response variable and the number of rows of Tbl must be equal.

Data Types: table

Response variable name, specified as the name of a variable in Tbl.

You must specify ResponseVarName as a character vector. For example, if the response variable Y is stored as Tbl.Y, then specify it as 'Y'. Otherwise, the software treats all columns of Tbl, including Y, as predictors when training the model.

The response variable must be a categorical or character array, logical or numeric vector, or cell array of character vectors. If Y is a character array, then each element must correspond to one row of the array.

It is good practice to specify the order of the classes using the ClassNames name-value pair argument.

Data Types: char

Explanatory model of the response and a subset of the predictor variables, specified as a character vector in the form of 'Y~X1+X2+X3'. In this form, Y represents the response variable, and X1, X2, and X3 represent the predictor variables. The variables must be variable names in Tbl (Tbl.Properties.VariableNames).

To specify a subset of variables in Tbl as predictors for training the model, use a formula. If you specify a formula, then the software does not use any variables in Tbl that do not appear in formula.

Data Types: char

Class labels, specified as a categorical or character array, logical or numeric vector, or cell array of character vectors. Each row of Y represents the classification of the corresponding row of X.

The software considers NaN, '' (empty character vector), and <undefined> values in Y to be missing values. Consequently, the software does not train using observations with a missing response.

Data Types: single | double | logical | char | cell

Predictor data, specified as numeric matrix.

Each row corresponds to one observation (also known as an instance or example), and each column corresponds to one predictor variable (also known as a feature).

The length of Y and the number of rows of X must be equal.

To specify the names of the predictors in the order of their appearance in X, use the PredictorNames name-value pair argument.

Data Types: double | single

Properties

BreakTies

Character vector specifying the method predict uses to break ties if multiple classes have the same smallest cost. By default, ties occur when multiple classes have the same number of nearest points among the K nearest neighbors.

  • 'nearest' — Use the class with the nearest neighbor among tied groups.

  • 'random' — Use a random tiebreaker among tied groups.

  • 'smallest' — Use the smallest index among tied groups.

'BreakTies' applies when 'IncludeTies' is false.

Change BreakTies using dot notation: mdl.BreakTies = newBreakTies.

CategoricalPredictors

Specification of which predictors are categorical.

  • 'all' — All predictors are categorical.

  • [] — No predictors are categorical.

ClassNames

List of elements in the training data Y with duplicates removed. ClassNames can be a numeric vector, vector of categorical variables, logical vector, character array, or cell array of character vectors. ClassNames has the same data type as the data in the argument Y.

Change ClassNames using dot notation: mdl.ClassNames = newClassNames

Cost

Square matrix, where Cost(i,j) is the cost of classifying a point into class j if its true class is i (i.e., the rows correspond to the true class and the columns correspond to the predicted class). The order of the rows and columns of Cost corresponds to the order of the classes in ClassNames. The number of rows and columns in Cost is the number of unique classes in the response.

Change a Cost matrix using dot notation: obj.Cost = costMatrix.

Distance

Character vector or function handle specifying the distance metric. The allowable character vectors depend on the NSMethod parameter, which you set in fitcknn, and which exists as a field in ModelParameters.

NSMethodDistance Metric Names
exhaustiveAny distance metric of ExhaustiveSearcher
kdtree'cityblock', 'chebychev', 'euclidean', or 'minkowski'

For definitions, see Distance Metrics.

The distance metrics of ExhaustiveSearcher:

ValueDescription
'cityblock'City block distance.
'chebychev'Chebychev distance (maximum coordinate difference).
'correlation'One minus the sample linear correlation between observations (treated as sequences of values).
'cosine'One minus the cosine of the included angle between observations (treated as vectors).
'euclidean'Euclidean distance.
'hamming'Hamming distance, percentage of coordinates that differ.
'jaccard'One minus the Jaccard coefficient, the percentage of nonzero coordinates that differ.
'mahalanobis'Mahalanobis distance, computed using a positive definite covariance matrix C. The default value of C is the sample covariance matrix of X, as computed by nancov(X). To specify a different value for C, use the 'Cov' name-value pair.
'minkowski'Minkowski distance. The default exponent is 2. To specify a different exponent, use the 'P' name-value pair.
'seuclidean'Standardized Euclidean distance. Each coordinate difference between X and a query point is scaled, meaning divided by a scale value S. The default value of S is the standard deviation computed from X, S = nanstd(X). To specify another value for S, use the Scale name-value pair.
'spearman'One minus the sample Spearman's rank correlation between observations (treated as sequences of values).
@distfunDistance function handle. distfun has the form
function D2 = DISTFUN(ZI,ZJ)
% calculation of  distance
...
where
  • ZI is a 1-by-N vector containing one row of X or Y.

  • ZJ is an M2-by-N matrix containing multiple rows of X or Y.

  • D2 is an M2-by-1 vector of distances, and D2(k) is the distance between observations ZI and ZJ(J,:).

Change Distance using dot notation: mdl.Distance = newDistance.

If NSMethod is kdtree, you can use dot notation to change Distance only among the types 'cityblock', 'chebychev', 'euclidean', or 'minkowski'.

DistanceWeight

Character vector or function handle specifying the distance weighting function.

DistanceWeightMeaning
'equal'No weighting
'inverse'Weight is 1/distance
'squaredinverse'Weight is 1/distance2
@fcnfcn is a function that accepts a matrix of nonnegative distances, and returns a matrix the same size containing nonnegative distance weights. For example, 'inversesquared' is equivalent to @(d)d.^(-2).

Change DistanceWeight using dot notation: mdl.DistanceWeight = newDistanceWeight.

DistParameter

Additional parameter for the distance metric.

Distance MetricParameter
'mahalanobis'Positive definite covariance matrix C.
'minkowski'Minkowski distance exponent, a positive scalar.
'seuclidean'Vector of positive scale values with length equal to the number of columns of X.

For values of the distance metric other than those in the table, DistParameter must be [].

You can alter DistParameter using dot notation: mdl.DistParameter = newDistParameter. However, if Distance is mahalanobis or seuclidean, then you cannot alter DistParameter.

ExpandedPredictorNames

Expanded predictor names, stored as a cell array of character vectors.

If the model uses encoding for categorical variables, then ExpandedPredictorNames includes the names that describe the expanded variables. Otherwise, ExpandedPredictorNames is the same as PredictorNames.

HyperparameterOptimizationResults

Description of the cross-validation optimization of hyperparameters, stored as a BayesianOptimization object or a table of hyperparameters and associated values. Nonempty when the OptimizeHyperparameters name-value pair is nonempty at creation. Value depends on the setting of the HyperparameterOptimizationOptions name-value pair at creation:

  • 'bayesopt' (default) — Object of class BayesianOptimization

  • 'gridsearch' or 'randomsearch' — Table of hyperparameters used, observed objective function values (cross-validation loss), and rank of observations from lowest (best) to highest (worst)

IncludeTies

Logical value indicating whether predict includes all the neighbors whose distance values are equal to the Kth smallest distance. If IncludeTies is true, predict includes all these neighbors. Otherwise, predict uses exactly K neighbors (see 'BreakTies').

Change IncludeTies using dot notation: mdl.IncludeTies = newIncludeTies.

ModelParameters

Parameters used in training mdl.

Mu

Numeric vector of predictor means with length numel(PredictorNames).

If you did not standardize mdl when you trained it using fitcknn, then Mu is empty ([]).

NumNeighbors

Positive integer specifying the number of nearest neighbors in X to find for classifying each point when predicting. Change NumNeighbors using dot notation: mdl.NumNeighbors = newNumNeighbors.

NumObservations

Number of observations used in training mdl. This can be less than the number of rows in the training data, because data rows containing NaN values are not part of the fit.

PredictorNames

Cell array of names for the predictor variables, in the order in which they appear in the training data X. Change PredictorNames using dot notation: mdl.PredictorNames = newPredictorNames.

Prior

Numeric vector of prior probabilities for each class. The order of the elements of Prior corresponds to the order of the classes in ClassNames.

Add or change a Prior vector using dot notation: obj.Prior = priorVector.

ResponseName

Character vector describing the response variable Y. Change ResponseName using dot notation: mdl.ResponseName = newResponseName.

Sigma

Numeric vector of predictor standard deviations with length numel(PredictorNames).

If you did not standardize mdl when you trained it using fitcknn, then Sigma is empty ([]).

W

Numeric vector of nonnegative weights with the same number of rows as Y. Each entry in W specifies the relative importance of the corresponding observation in Y.

X

Numeric matrix of unstandardized predictor values. Each column of X represents one predictor (variable), and each row represents one observation.

Y

A numeric vector, vector of categorical variables, logical vector, character array, or cell array of character vectors, with the same number of rows as X.

Y is of the same type as the passed-in Y data.

Methods

compareHoldoutCompare accuracies of two models using new data
crossvalCross-validated k-nearest neighbor classifier
edgeEdge of k-nearest neighbor classifier
lossLoss of k-nearest neighbor classifier
marginMargin of k-nearest neighbor classifier
predictPredict labels using k-nearest neighbor classification model
resubEdgeEdge of k-nearest neighbor classifier by resubstitution
resubLossLoss of k-nearest neighbor classifier by resubstitution
resubMarginMargin of k-nearest neighbor classifier by resubstitution
resubPredictPredict resubstitution response of k-nearest neighbor classifier

Definitions

Prediction

ClassificationKNN predicts the classification of a point Xnew using a procedure equivalent to this:

  1. Find the NumNeighbors points in the training set X that are nearest to Xnew.

  2. Find the NumNeighbors response values Y to those nearest points.

  3. Assign the classification label Ynew that has smallest expected misclassification cost among the values in Y.

For details, see Posterior Probability and Expected Cost in the predict documentation.

Copy Semantics

Value. To learn how value classes affect copy operations, see Copying Objects in the MATLAB® documentation.

Examples

expand all

Construct a k-nearest neighbor classifier for Fisher's iris data, where k, the number of nearest neighbors in the predictors, is 5.

Load Fisher's iris data.

load fisheriris
X = meas;
Y = species;

X is a numeric matrix that contains four petal measurements for 150 irises. Y is a cell array of character vectors that contains the corresponding iris species.

Train a 5-nearest neighbors classifier. It is good practice to standardize noncategorical predictor data.

Mdl = fitcknn(X,Y,'NumNeighbors',5,'Standardize',1)
Mdl = 

  ClassificationKNN
             ResponseName: 'Y'
    CategoricalPredictors: []
               ClassNames: {'setosa'  'versicolor'  'virginica'}
           ScoreTransform: 'none'
          NumObservations: 150
                 Distance: 'euclidean'
             NumNeighbors: 5


Mdl is a trained ClassificationKNN classifier, and some of its properties display in the Command Window.

To access the properties of Mdl, use dot notation.

Mdl.ClassNames
Mdl.Prior
ans =

  3×1 cell array

    'setosa'
    'versicolor'
    'virginica'


ans =

    0.3333    0.3333    0.3333

Mdl.Prior contains the class prior probabilities, which are settable using the name-value pair argument 'Prior' in fitcknn. The order of the class prior probabilities corresponds to the order of the classes in Mdl.ClassNames. By default, the prior probabilities are the respective relative frequencies of the classes in the data.

You can also reset the prior probabilities after training. For example, set the prior probabilities to 0.5, 0.2, and 0.3 respectively.

Mdl.Prior = [0.5 0.2 0.3];

You can pass Mdl to, for example, ClassificationKNN.predict to label new measurements, or ClassificationKNN.crossval to cross validate the classifier.

Related Examples

Alternatives

knnsearch finds the k-nearest neighbors of points. rangesearch finds all the points within a fixed distance. You can use these functions for classification, as shown in Classify Query Data. If you want to perform classification, ClassificationKNN can be more convenient, in that you can construct a classifier in one step and classify in other steps. Also, ClassificationKNN has cross-validation options.

Tips

The compact function reduces the size of most classification models by removing the training data properties, and any other properties that are not required to predict the label of new observations. Because kNN classification models require all of the training data to predict labels, you cannot reduce the size of a ClassificationKNN model.


Was this topic helpful?