Lesson2_2-Machine Learning - An Algorithmic Perspective 2nd edition 2014.pdf

<aside> 💡

The Curse of Dimensionality


</aside>

<aside> 💡

Overfitting

</aside>

<aside> 💡

Training, Testing, and Validation Sets

Training Set

Validation Set

image.png

Testing Set

Need to consider

** Data leakage occurs when information from outside the training dataset is used to create the model, leading to overly optimistic performance during training but poor generalization on unseen data. This results in models that appear highly accurate in development but fail in real-world scenarios.

*** Types of leakage?

→ Train-Test Contamination: Happens when test data is accidentally included in the training set. If future stock prices are included in the training data while predicting stock trends, the model will learn patterns it wouldn’t have access to in real-world conditions.

→ Feature Leakage (Target Leakage): Occurs when features used for training include information that wouldn’t be available at prediction time. In a loan approval model, using a feature like "whether the loan was paid off" to predict loan default would make the model perform unrealistically well.

*** How Prevent? ensure proper data splitting; avoid using future information; use cross-validation correctly

*** Why serious? (1) cause poor generalization on real-world data; (2) can result in costly mistake in field like finance, healthcare and cybersecurity

Confusion Matrix

image.png

image.png

image.png

The Receiver Operating Characteristic (ROC) curve is a graphical representation used to evaluate the performance of a classification model, particularly in binary classification problems. It shows the trade-off between True Positive Rate (TPR, Sensitivity) and False Positive Rate (FPR) at various classification thresholds.

image.png

image.png

image.png

The x-axis represents FPR (1 - Specificity), and the y-axis represents TPR (Sensitivity)

A perfect classifier would have a point at (0,1), meaning zero false positives and 100% true positives. The closer the curve is to the top-left corner, the better the model performs.

image.png

The closer the curve is to the top-left corner, the better the model, as it achieves a high TPR while keeping the FPR low.

The Area Under the ROC Curve (AUC-ROC) quantifies the overall performance of a classification model. It is calculated as the area under the Receiver Operating Characteristic (ROC) curve, which plots the True Positive Rate (TPR) against the False Positive Rate (FPR) at different thresholds.

</aside>