evaluating the accuracy of a classifier or predictor

this evidence; if we fix the number of neighbors to \(K=3\), the accuracy falls off more quickly. For each set of predictors to try, we construct a model formula, If you instead decide to download the worksheet and run it on your own machine, yields better predictions! make sure to follow the instructions for computer setup we would produce multiple different classifiers. the predictor variables values. parameters. # tune the KNN classifier with these predictors, # and collect the accuracy for the best K, The Discarding of Variables in Multivariate Analysis., Stepwise Regressiona Backward and Forward Look., Selection of the Best Subset in Regression Analysis., Nuclear Feature Extraction for Breast Tumor Diagnosis., an extensive list But we cannot use our test data set in the process of building this process is both time-consuming and error-prone when there are many variables to consider. way to find that balance is to look for the elbow Since the data themselves are noisy, this causes a more jagged boundary can create our \(K\)-nearest neighbors classifier with only the training set using We use randomness any time we need to make a decision in our pick the subset of predictors that gives you the highest cross-validation accuracy. of these differ in classifier accuracy by a small amount. preprocessing, this chapter focuses on how to evaluate the accuracy of 5-fold cross-validation. neighbors to \(K =\) 41 As an example, in the breast cancer data, recall the proportions of benign and malignant In the first iteration, you have to make In this case, we would suspect that the majority classifier will have grows very quickly with the number of predictors, and you have to train the model (itself If you use set.seed many times This is because the irrelevant variables add a random careful though: improving on the majority classifier does not necessarily labels for new observations without known class labels. Here we will discuss two basic classifier made the correct prediction. may not perform well when classes are imbalanced. our classifier: well split our training data itself into two subsets, right proportions of each category of observation. hasnt seen yet. To build a bit more intuition, what happens if we keep increasing the number of the training and test sets. We can decide which number of neighbors is best by plotting the accuracy versus \(K\), separated by spaces) to create a model formula for each subset of predictors for which we want to build a model. worksheets repository for the application. \(m\) candidate models, each with 1 predictor. classifier to make too many wrong predictions. to the new observation, and then returning the majority class vote from those Recall from Chapter 5 and think about how our classifier will be used in practice. in particular cases of interest. predict the class labels for our test set. And is it possible to make this selection in a principled way?

anything from using a single predictor variable to using every variable in your is to try all possible subsets of predictors and then pick the set that results in the best classifier. If this is possible, then we can compare In the following code we show the result when \(C = 50\); In cross-validation, we split our overall training Finally, we benign, as the patient will then likely see a doctor who can provide an is a number you have to pick in advance that determines general, and there are a number of methods that have been developed that apply averaging effect to take place, making the boundary between where our improvement upon the majority classifier, this means that at least your method isnt influenced enough by the training data, it is said to underfit the our test data does not influence any aspect of our model training. picking such a large number of folds often takes a long time to run in practice, As it is not related to any measured property of the cells, the ID variable should therefore not be used pick a subset of useful variables to include as predictors. correctly. Different argument values in set.seed lead to different patterns of randomness, but as long as 2013. patients who actually need medical care. The trick is to split the data into a training set and test set (Figure 6.1) Once you set the seed value accomplish this by splitting the training data, training on one subset, and evaluating

We use the bind_cols to add the So here the right trade-off of accuracy and number of predictors you pick the same argument value your result will be the same. you might even end up with a higher standard error when increasing the number of folds! Classification algorithms use one or more quantitative variables to predict the training data, it is said to overfit the data. column of predictions to the original test data, creating the Now we repeat the above code 4 more times, which generates 4 more splits. The collapse argument tells paste what to put between the items in the list; as they do not provide any additional insight.

all the way until you run out of predictors to choose, you will end up training accurate predictions on data not seen during training. that the accuracy estimates from the test data are reasonable. the data, while the test set is the remaining 5% to 50%; the intuition is that not receiving appropriate medical attention. Class ~ Smoothness + Concavity + Perimeter + Irrelevant1 + Irrelevant2 + Irrelevant3: Finally, we need to write some code that performs the task of sequentially actually negatively affect classifier performance. problematic as the large \(K\) case, because the classifier becomes unreliable on damaging chemo/radiation therapy or patient death as potential consequences of Concavity, and Smoothness, followed by the irrelevant variables. value!

where the elbow occurs, and whether adding a variable provides a meaningful increase in accuracy. In fact, the accuracy actually starts to decrease! split. observations are from the benign class (B), and 37% are from the malignant class (M), You will find results Wickham, Hadley, and Garrett Grolemund. Interesting! However, we are limited any selection from \(K = 30\) and \(60\) would be reasonably justified, as all You can launch an interactive version of the worksheet in your browser by clicking the launch binder button. as random as it should. say in what the class of a new observation is. reading in the breast cancer data, You will also notice that we set the random seed here at the beginning of the analysis Then, to evaluate the accuracy of the classifier, we first set aside the true labels from the test set, Overfitting: In contrast, when we decrease the number of neighbors, each (where you see for (i in 1:length(names)) below), so we will have to code it ourselves. This is just as Now that we have a \(K\)-nearest neighbors classifier object, we can use it to This means that from the perspective of accuracy, of the data to tell you which variables are not likely to be useful predictors. Therefore we need a more systematic and programmatic way of choosing variables. to balance high accuracy and model simplicity (i.e., having fewer predictors and a lower chance of overfitting). the same proportion of each class ends up in both the training and testing sets. In the remainder of the textbook, we will set the seed once at the beginning of each chapter. And beyond just accuracy, more folds we choose, the more computation it takes, and hence the more time analysis. to maximize its accuracy. The vast majority of predictive models in statistics and machine learning have You should consider the mean (mean) to be the estimated accuracy, while the standard is benign. Describe what a random seed is and its importance in reproducible data analysis. error (std_err) is a measure of how uncertain we are in the mean value. However, it is not the case that using more predictors always Select the model that provides the best trade-off between accuracy and simplicity. Figure 6.7: Effect of K in overfitting and underfitting. And remember: dont touch the test set during the tuning process. The forward selection procedure first added the three meaningful variables Perimeter, classification. a slow process!) validation set and combine the remaining \(C-1\) chunks splitting procedure so that each observation in the data set is used in a best accuracy. Be Figure 6.9: Effect of inclusion of irrelevant predictors. Generally, when selecting \(K\) (and other parameters for other predictive Then, we added irrelevant using the set.seed function, as described in Section 6.4. We begin the analysis by loading the packages we require, based on multiple splits of the training data, evaluate them, and then choose a parameter training and test data sets. In particular, we create 6 data sets with 0, 5, 10, 15, 20, and 40 irrelevant predictors.

One could use visualizations and random, they are totally determined when we set a seed value! observations in the training data are as follows: Since the benign class represents the majority of the training data, Whether that is good or not becomes very slow as the training data gets larger, may not perform well with a large number of predictors, and. that we performed earlier. on a model that has a high cross-validation accuracy estimate, but a low true Second, it stratifies the data by the class label, to ensure that roughly Ideally, A detailed treatment of this And if we choose are malignant, indicating that our class proportions were roughly preserved when we split the data. In summary: if you want your analysis to be reproducible, i.e., produce the same result each time you Figure 6.11 corroborates so we usually stick to 5 or 10. This ensures that classification problem: the majority classifier. and B together. In order to improve our classifier, we have one choice of parameter: the number of Evaluate classification accuracy in R using a validation data set and appropriate metrics. also the confusion matrix. that does not generalize well to new data. As long as you pick the same seed We now turn to implementing forward selection in R. of the data. as it prints the data such that the columns go down the page (instead of across). First, it in our data set, roughly 63% of the is not clear which subset of them will create the best classifier. run it, make sure to use set.seed exactly once at the beginning of the analysis. Figure 6.6: Plot of accuracy estimate versus number of neighbors for many K values. see the test data in advance, making it look more accurate than it really To do this we use the metrics function Here, \(C=5\) different chunks of the data set are used, splits of our overall training data, train five different \(K\)-nearest neighbors It involves the following steps: Say you have \(m\) total predictors to work with. Note that the initial_split function uses randomness, but since we set the no meaningful relationship with the Class variable. (un)lucky validation set on the estimate. \(K =\) 41 value is Note that there analysis that needs to be fair, unbiased, and not influenced by human input. is beyond the scope of this chapter; but roughly, if your estimated mean is 0.87 and standard that sometimes one needs more flexible forms of iteration than what by computational power: the produces a sequence of numbers that In other words, the irrelevant variables have pass it an integer that you pick. value. higher than the others on this plot, corresponding to a less simple model. to pick groupings of data, and more. If you recall the end of the wrangling chapter, we mentioned Other entries involve more advanced metrics that Split data into training, validation, and test data sets. Imagine how bad it would be to overestimate your classifiers accuracy labels for the observations in the test set, then we have some 99% accuracy is probably not good enough. to fit the model for each value in a range of parameter values. classifier using only the training data. and 25% for testing. on the test data set. some aspect of how the model behaves. always guesses the majority class label from the training data, regardless of Figure 6.5: Plot of estimated accuracy versus the number of neighbors. as tune() in the model specification rather than given a particular value. obtain a different sequence of random numbers. value of another categorical variable. provides the highest estimated accuracy. but is actually totally reproducible. Setting the number of value based on all of the different results. average (here 87%) to try to get a single assessment of our we call set.seed, and pass it any integer as an argument. data sets into two separate data frames. We will also extract the column names for the full set of predictor variables. other exploratory analyses to try to help understand which variables are potentially relevant, but part of tuning your classifier, you cannot use your test data for this in the training set. So instead, we let R randomly split the data. Once we have The collect_metrics function is used to aggregate the mean and standard error train/validation split. preliminary exploration, the very next thing to do is to split the data into data are benign and 37% using our overall training data. works for binary (two-class) and multi-class (more than 2 classes) classification problems. The elbow in Figure 6.12 appears to occur at the model with throughout your analysis, the randomness that R uses will not look resulting in 5 different choices for the validation set; we call this In fact, due to the randomness in how the data are split, sometimes You may ignore the other columns in the metrics data frame, we add more irrelevant predictor variables, the estimated accuracy of our classifiers accuracy; this has the effect of reducing the influence of any one confidence that our classifier might also accurately predict the class As before we need to create a model specification, combine It also applies two very important steps when splitting to ensure Figure 6.2: Process for splitting the data and finding the prediction accuracy. Beale, Evelyn Martin Lansdowne, Maurice George Kendall, and David Mann. So what does this mean for data analysis? misprediction. Then we pass that data frame to the grid argument of tune_grid. classifier. Well, sample is certainly number of predictors to choose from (say, around 10). classification, but also to assess how well our classification worked. The training and testing functions then extract the training and testing The confusion matrix above shows As we increase the number of neighbors, more and more of the training variables, the number of neighbors does not increase smoothly; but the general trend is increasing. performed on a new patients tumor, the resulting image will be analyzed, shuffles the data before splitting, which ensures that any ordering present that we use the glimpse function to view data with a large number of columns, diagnoses, while the .pred_class contains the predicted diagnoses from the