Overview of Supervised Learning
Supervised learning is a type of machine learning where the model is trained on labeled data. The goal is to learn a mapping from input features (independent variables) to the target output (dependent variable). The algorithm uses this learned mapping to predict the output for new, unseen data. Supervised learning tasks are broadly divided into two categories:
- Regression: Predicting a continuous output.
- Classification: Predicting a discrete class label.
Regression Algorithms
1. Linear Regression:
- Description: Linear regression is a simple algorithm that models the relationship between the dependent variable and one or more independent variables by fitting a linear equation to observed data.
- Equation: y=β0+β1×1+β2×2+⋯+βnxny = \beta_0 + \beta_1x_1 + \beta_2x_2 + \dots + \beta_nx_ny=β0+β1x1+β2x2+⋯+βnxn
- Use Case: Predicting house prices based on features like size, number of rooms, etc.
Polynomial Regression:
- Description: Polynomial regression is an extension of linear regression where the relationship between the independent variable and the dependent variable is modeled as an nnn-th degree polynomial.
- Equation: y=β0+β1x+β2×2+⋯+βnxny = \beta_0 + \beta_1x + \beta_2x^2 + \dots + \beta_nx^ny=β0+β1x+β2x2+⋯+βnxn
- Use Case: Modeling more complex relationships, like predicting the trajectory of a ball.
Classification Algorithms
Logistic Regression:
- Description: Logistic regression is used for binary classification problems. It models the probability that a given input point belongs to a particular class.
- Equation: P(y=1∣x)=11+e−(β0+β1×1+⋯+βnxn)P(y=1|x) = \frac{1}{1 + e^{-(\beta_0 + \beta_1x_1 + \dots + \beta_nx_n)}}P(y=1∣x)=1+e−(β0+β1x1+⋯+βnxn)1
- Use Case: Predicting whether a customer will buy a product (yes/no).
Decision Trees:
- Description: Decision trees classify instances by sorting them down the tree from the root to some leaf node, which provides the classification of the instance.
- Use Case: Customer segmentation, credit scoring.
Support Vector Machines (SVMs):
- Description: SVMs find the optimal hyperplane that maximizes the margin between different classes. It is effective for high-dimensional spaces.
- Use Case: Image classification, text categorization.
k-Nearest Neighbors (k-NN):
- Description: k-NN is a simple, instance-based learning algorithm that classifies a data point based on how its neighbors are classified.
- Use Case: Recommender systems, handwriting recognition.
Model Evaluation Metrics
1. Accuracy:
- Description: Accuracy is the ratio of correctly predicted instances to the total instances.
- Formula: Accuracy=TP + TNTP + TN + FP + FN\text{Accuracy} = \frac{\text{TP + TN}}{\text{TP + TN + FP + FN}}Accuracy=TP + TN + FP + FNTP + TN
- Use Case: Good for balanced datasets where each class has roughly the same number of observations.
2. Precision:
- Description: Precision is the ratio of correctly predicted positive observations to the total predicted positives.
- Formula: Precision=TPTP + FP\text{Precision} = \frac{\text{TP}}{\text{TP + FP}}Precision=TP + FPTP
- Use Case: Useful in scenarios where the cost of false positives is high (e.g., spam detection).
3. Recall (Sensitivity or True Positive Rate):
- Description: Recall is the ratio of correctly predicted positive observations to all observations in the actual class.
- Formula: Recall=TPTP + FN\text{Recall} = \frac{\text{TP}}{\text{TP + FN}}Recall=TP + FNTP
- Use Case: Important in cases where missing a positive instance has a high cost (e.g., disease detection).
4. F1 Score:
- Description: The F1 Score is the harmonic mean of precision and recall, providing a balance between the two.
- Formula: F1 Score=2×Precision×RecallPrecision + Recall\text{F1 Score} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision + Recall}}F1 Score=2×Precision + RecallPrecision×Recall
- Use Case: Suitable when you need to balance precision and recall.
5. ROC-AUC:
- Description: ROC (Receiver Operating Characteristic) curve plots the True Positive Rate (Recall) against the False Positive Rate. The AUC (Area Under the Curve) score provides a single metric representing the model’s performance across all classification thresholds.
- Use Case: A good measure for evaluating the overall performance of a classification model, particularly in imbalanced datasets.
Example: Logistic Regression with Model Evaluation
Here’s a Python example demonstrating logistic regression and model evaluation using the sklearn library:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
from sklearn.datasets import load_breast_cancer
# Load dataset
data = load_breast_cancer()
X = data.data
y = data.target
# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train a Logistic Regression model
model = LogisticRegression(max_iter=10000)
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1] # Probability estimates for the positive class
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_prob)
# Print evaluation metrics
print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1 Score: {f1:.2f}")
print(f"ROC-AUC: {roc_auc:.2f}")
Leave a Reply