[ad_1]
Contributed by: Ujwala Kokala
- Introduction
- Confusion Matrix
- Evaluation Metrics
- Other metrics
Introduction
One of the most confusing concepts for any machine learning engineer or data scientist is Precision and Recall. Once you have built your model, the most important question that arises is how good your model is? So, evaluating your model is the most important task in the data science project, which delineates how good your predictions are. Precision and recall are evaluation metrics.
For classification problems, like whether an email is spam or not, if a person is a terrorist or not, if a person has cancer or not, whether a person is eligible for a loan or not. All these examples have their target column as Yes or No type (binary classification). To evaluate classification problems, accuracy is the best metric to check how well the model is predicting the data. But accuracy metrics are used only when the data is balanced data. For imbalanced datasets, precision, recall, and f1 score is a good measure for model performance. Before we dive into precision and recall, we should understand the confusion matrix.
Confusion Matrix
A confusion matrix is an NxN matrix used for evaluating the performance of a classification model, where N is the number of target classes. The matrix compares the actual target values with those predicted by the machine learning model.
A confusion matrix for binary classification shows the four different outcomes they are: true positive, false positive, true negative, and false negative.
- True Positive: The data points that are predicted positive and are actually positive.
- False Positive: The data points that are predicted positive but are actually negative.
- False Negative: The data points that are predicted negative but are actually positive.
- True Negative: The data points that are predicted negative and are actually negative.
A model is said to be the best model if diagonal values (TP, TN) of the confusion matrix are high.
- False Positive is Type-I error which is mis-classification done by our ML model. For eg: Predicting a man that he is pregnant, but actually he is not.
- False Negative is Type-II error which is mis-classification done by our ML model. For eg: Predicting a woman that she is not pregnant, but actually she is pregnant.
Evaluation Metrics
Accuracy: Accuracy is simply the ratio of correctly predicted observations to total observations.
Consider a test data set consisting of 100 people, out of which 60 are not pregnant (Negative) and 40 are pregnant (Positive). Out of 40 pregnant women, 30 women are classified correctly and the remaining 10 are misclassified as not pregnant by a machine learning algorithm. On the other hand, out of 60 people who are not pregnant, 55 are classified as not pregnant and the remaining 5 are classified as pregnant.
In this case, TP = 30, FP = 5, TN = 55, FN = 10,
Accuracy represents the number of correctly classified data instances over the total number of data instances. In our above example, accuracy is 85%.
Accuracy may not be a good measure if the dataset is imbalanced. Consider a scenario, where 90 people are healthy (negative) and 10 people are unhealthy (positive). Now let’s say our machine learning model predicted 90 healthy people as healthy and 10 unhealthy people as healthy. What will happen in this scenario? Let’s find the confusion matrix and accuracy.
In this case, TP = 0, FP = 0, TN = 90, FN = 10,
In this scenario, accuracy is 90% but is not a good metric, because it predicted all 10 unhealthy people as healthy people. Therefore, Accuracy is not a good metric for an imbalanced dataset.
Precision: Precision is the ratio of correctly predicted positive observations to the total predicted positive observations.
Precision is a good measure to determine when the cost of False Positive is high. For instance, email spam detection. In email spam detection, a false positive means that an email that is non-spam (actual negative) has been identified as spam (predicted spam). The email user might lose important emails if the precision is not high for the spam detection model.
Precision should be 1 for a good classifier and False positive should be as low as possible.
From pregnancy example, precision = 30/30+5 = 0.857
Recall: Recall is the ratio of correctly predicted positive observations to the total actual positive observations. The recall is also called sensitivity and true positive rate (TPR).
For instance, in fraud detection or sick patient detection. If a fraudulent transaction (Actual Positive) is predicted as non-fraudulent (Predicted Negative), the consequence can be very bad for the bank.
Similarly, in sick patient detection. If a sick patient (Actual Positive) goes through the test and is predicted as not sick (Predicted Negative). The cost associated with False Negative will be extremely high if the sickness is contagious.
Recall should be 1 for a good classifier and False-negative should be as low as possible. From pregnancy example, recall = 30/(30+10) = 0.75
F1-Score: F1-score is the harmonic mean of precision and recall.
From pregnancy example, F1 = 2 * (0.857*0.75)/(0.857+0.75) = 0.79
Other metrics
- TPR/true positive rate /sensitivity/recall = TP/total actual positives = TP/TP+FN
- TNR/true negative rate/specificity = TN/total actual negatives = TN/TN+FP
- FNR/false negative rate = 1- TPR = FN/total actual positives = FN/FN+TP
- FPR /false positive rate = 1- TNR = FP/total actual negatives = FP/FP+TN
ROC- AUC: Receiver operating characteristic curve and Area under curve.
The Receiver operating characteristic (ROC) curve is an evaluation metric for binary classification problems. It is a probability curve that plots the TPR against FPR at various threshold values and essentially separates the ‘signal’ from the ‘noise’. The Area Under the Curve (AUC) is the measure of the ability of a classifier to distinguish between classes and is used as a summary of the ROC curve.
The above figure illustrates some examples of different types of ROC curves. The red and green curves illustrate two extreme scenarios. The random line in red is the expected Roc curve when the diagnostic variable does not have any predictive power. When the observations are perfectly separable, the ROC curve consists of one horizontal and a vertical line as shown in green. The other curves are the results of typical practical data. When the curve shifts more to the northwest, it means better predictive power.
The best model has AUC near to the 1 which means it has a good measure of separability, the dark green curve has AUC =1. A poor model has an AUC near 0 which means it has the worst measure of separability. In fact, it means it is reciprocating the result. It is predicting 0s as 1s and 1s as 0s. And when AUC is 0.5, it means the model has no class separation capacity, the random classifier i.e, the red line will have AUC= 0.5.
This brings us to the end of the blog on Precision Formula. To learn more you can enrol with Artificial Intelligence Courses and upskill today!
0
[ad_2]
Source link