Resource: "Pattern Recognition and Machine Learning" by Christopher Bishop
Resource: Andrew Ng’s Machine Learning course on Coursera
TYPES OF MACHINE LEARNING
Lesson2_2-Machine Learning - An Algorithmic Perspective 2nd edition 2014.pdf
<aside> 💡
</aside>
<aside> 💡
</aside>
<aside> 💡
Training Set
Validation Set
Testing Set
Need to consider
** Data leakage occurs when information from outside the training dataset is used to create the model, leading to overly optimistic performance during training but poor generalization on unseen data. This results in models that appear highly accurate in development but fail in real-world scenarios.
*** Types of leakage?
→ Train-Test Contamination: Happens when test data is accidentally included in the training set. If future stock prices are included in the training data while predicting stock trends, the model will learn patterns it wouldn’t have access to in real-world conditions.
→ Feature Leakage (Target Leakage): Occurs when features used for training include information that wouldn’t be available at prediction time. In a loan approval model, using a feature like "whether the loan was paid off" to predict loan default would make the model perform unrealistically well.
*** How Prevent? ensure proper data splitting; avoid using future information; use cross-validation correctly
*** Why serious? (1) cause poor generalization on real-world data; (2) can result in costly mistake in field like finance, healthcare and cybersecurity
Confusion Matrix
The Receiver Operating Characteristic (ROC) curve is a graphical representation used to evaluate the performance of a classification model, particularly in binary classification problems. It shows the trade-off between True Positive Rate (TPR, Sensitivity) and False Positive Rate (FPR) at various classification thresholds.
The x-axis represents FPR (1 - Specificity), and the y-axis represents TPR (Sensitivity)
A perfect classifier would have a point at (0,1), meaning zero false positives and 100% true positives. The closer the curve is to the top-left corner, the better the model performs.
The closer the curve is to the top-left corner, the better the model, as it achieves a high TPR while keeping the FPR low.
The Area Under the ROC Curve (AUC-ROC) quantifies the overall performance of a classification model. It is calculated as the area under the Receiver Operating Characteristic (ROC) curve, which plots the True Positive Rate (TPR) against the False Positive Rate (FPR) at different thresholds.
</aside>