After training a machine learning model, it is crucial to evaluate its performance to ensure it meets the desired objectives. The choice of evaluation metrics depends on the type of problem—classification, regression, or clustering—and the specific goals of the model. This article outlines the essential metrics used in different machine learning tasks.

Classification Metrics

1. Accuracy Accuracy measures the ratio of correctly predicted instances to the total instances. It is a straightforward metric but can be misleading in imbalanced datasets.
$$
\text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}}
$$
2. Precision Precision indicates the ratio of correctly predicted positive observations to the total predicted positives. It is particularly useful when the cost of false positives is high.
$$
\text{Precision} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Positives}}
$$
3. Recall (Sensitivity or True Positive Rate) Recall measures the ratio of correctly predicted positive observations to all actual positives. It is important when the cost of false negatives is high.
$$
\text{Recall} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}}
$$
4. F1 Score The F1 Score is the harmonic mean of precision and recall, providing a single metric that balances both concerns. It is useful when the classes are imbalanced.
$$
\text{F1 Score} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}
$$
5. ROC-AUC (Receiver Operating Characteristic - Area Under Curve) ROC-AUC measures the model's ability to distinguish between classes. The ROC curve plots the true positive rate against the false positive rate, and the AUC quantifies the overall ability of the model to discriminate between positive and negative classes.

6. Confusion Matrix A confusion matrix is a table that summarizes the performance of a classification model. It displays the true positives, true negatives, false positives, and false negatives, providing a detailed view of the model's predictions.

Regression Metrics

1. Mean Absolute Error (MAE) MAE measures the average of the absolute differences between the predicted and actual values, providing a straightforward error metric.
$$
\text{MAE} = \frac{1}{n} \sum_{i=1}^{n} \left| \hat{y_i} - y_i \right|
$$

2. Mean Squared Error (MSE) MSE calculates the average of the squared differences between the predicted and actual values. It penalizes larger errors more than smaller ones.
$$
\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} \left( \hat{y_i} - y_i \right)^2
$$

3. Root Mean Squared Error (RMSE) RMSE is the square root of MSE, providing an error metric in the same units as the target variable. It is more sensitive to outliers than MAE.
$$
\text{RMSE} = \sqrt{\text{MSE}}
$$

4. R-squared (Coefficient of Determination) R-squared indicates the proportion of the variance in the dependent variable that is predictable from the independent variables. It provides a measure of how well the model fits the data.
$$
\text{Sum of Squared Residuals} = \sum_{i=1}^{n} \left( y_i - \hat{y_i} \right)^2
$$

$$
\text{Total Sum of Squares} = \sum_{i=1}^{n} \left( y_i - \bar{y} \right)^2
$$

$$
R^2 = 1 - \frac{\text{Sum of Squared Residuals}}{\text{Total Sum of Squares}}
$$

WHERE:

Sum of Squared Residuals (SRS): Represents the total squared difference between the actual values of the dependent variable and the predicted values from the model. In other words, it measures the variance left unexplained by the model.

Total Sum of Squares (SST): Represents the total variance in the dependent variable itself. It's calculated by finding the squared difference between each data point's value and the mean of all the values in the dependent variable.

Essentially, R² compares the unexplained variance (SSR) to the total variance (SST). A higher R² value indicates the model explains a greater proportion of the total variance.

Clustering Metrics

1. Silhouette Score The silhouette score measures how similar an object is to its own cluster compared to other clusters. It ranges from -1 to 1, with higher values indicating better clustering.
$$
\text{Silhouette Score} = \frac{b - a}{\max(a, b)}
$$

WHERE:

a: is the mean intra-cluster distance
b: is the mean nearest-cluster distance

2. Davies-Bouldin Index The Davies-Bouldin Index assesses the average similarity ratio of each cluster with the cluster most similar to it. Lower values indicate better clustering.

$$
\text{Cluster Similarity Ratio} = \frac{s_i + sj}{d{i,j}}
$$

$$
\text{Max Inter Cluster Ratio} = \max_{j \neq i} \left( \text{Cluster Similarity Ratio} \right)
$$

$$
\text{DB Index} = \frac{1}{n} \sum_{i=1}^{n}\text{Max Inter Cluster Ratio}
$$

WHERE:

Max Inter Cluster Ratio: This part finds the maximum value, considering all clusters except the current cluster i (denoted by j ≠ i). The maximum is taken of the ratio between the sum of the within-cluster scatters of cluster i and cluster j divided by the distance between their centroids. Intuitively, this ratio penalizes clusters that are close together but have high within-cluster scatter.
s: is the average distance between each point in a cluster and the cluster centroid,
d: is the distance between cluster centroids

3. Adjusted Rand Index (ARI) The Adjusted Rand Index measures the similarity between the predicted and true cluster assignments, adjusted for chance. It ranges from -1 to 1, with higher values indicating better clustering.

General Metrics for Any Model

1. Log Loss (Cross-Entropy Loss) Log Loss is used for classification models to penalize incorrect classifications. It quantifies the accuracy of probabilistic predictions.
$$
\text{Log Loss} = -\frac{1}{n} \sum_{i=1}^{n} \left[ y_i \log(\hat{p_i}) + (1 - y_i) \log(1 - \hat{p_i}) \right]
$$
2. AIC (Akaike Information Criterion) / BIC (Bayesian Information Criterion) AIC and BIC are used for model comparison, balancing goodness of fit and model complexity. Lower values indicate better models.

3. Precision-Recall AUC Precision-Recall AUC is useful for imbalanced datasets where the ROC-AUC may be misleading. It provides a summary of the precision-recall trade-off.

These metrics provide a comprehensive view of a machine learning model's performance, helping practitioners fine-tune and select the best model for their specific problem. Proper evaluation ensures that the model generalizes well to new, unseen data, ultimately leading to more robust and reliable predictions.