Book on Kaggle. I don’t particularly care about competitions but am hoping to maybe find some decent general ML optimization tips. Thus, I’m skipping the first couple of beginning chapters.

## Competition Tasks and Metrics

Interestingly, there’s a metadata you can mine in Kaggle, the book shows the most common evaluation metrics from 2015-2021

### Metric Table

algorithm | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 | 2021 | Total |
---|---|---|---|---|---|---|---|---|

AUC | 4 | 4 | 1 | 3 | 3 | 2 | 0 | 17 |

LogLoss | 2 | 2 | 5 | 2 | 3 | 2 | 0 | 16 |

MAP@{K} | 1 | 3 | 0 | 4 | 1 | 0 | 1 | 10 |

CategorizationAccuracy | 1 | 0 | 4 | 0 | 1 | 2 | 0 | 8 |

MulticlassLoss | 2 | 3 | 2 | 0 | 1 | 0 | 0 | 8 |

RMSLE | 2 | 1 | 3 | 1 | 1 | 0 | 0 | 8 |

QuadraticWeightedKappa | 3 | 0 | 0 | 1 | 2 | 1 | 0 | 7 |

MeanFScoreBeta | 1 | 0 | 1 | 2 | 1 | 2 | 0 | 7 |

MeanBestErrorAtK | 0 | 0 | 2 | 2 | 1 | 1 | 0 | 6 |

MCRMSLE | 0 | 0 | 1 | 0 | 0 | 5 | 0 | 6 |

MCAUC | 1 | 0 | 1 | 0 | 0 | 3 | 0 | 5 |

RMSE | 1 | 1 | 0 | 3 | 0 | 0 | 0 | 5 |

Dice | 0 | 1 | 1 | 0 | 2 | 1 | 0 | 5 |

GoogleGlobalAP | 0 | 0 | 1 | 2 | 1 | 1 | 0 | 5 |

MacroFScore | 0 | 0 | 0 | 1 | 0 | 2 | 1 | 4 |

Score | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 3 |

CRPS | 2 | 0 | 0 | 0 | 1 | 0 | 0 | 3 |

OpenImagesObjectDetectionAP | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 3 |

MeanFScore | 0 | 0 | 1 | 0 | 0 | 0 | 2 | 3 |

RSNAObjectDetectionAP | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 2 |

### Handling Never Before Seen Metrics

There are tips from a grandmaster showing how to understand certain metrics

- https://www.kaggle.com/carlolepelaars/understanding-the-metric-spearman-s-rho
- https://www.kaggle.com/carlolepelaars/understanding-the-metric-quadratic-weighted-kappa
- https://www.kaggle.com/rohanrao/osic-understanding-laplace-log-likelihood

Additionally, tips are given on his workflow

- Understand the problem statement
- Curating the dataset for the problem statement
- Deep dive into the data
- Build a simple pipeline
- Engineer features / hyperparameters, different models
- Read discussions / Domain knowledge, tweak features?
- Ensemble models
- Deploy

#### Regression

Estimation of a continuous value.

**MSE** = 1/n * SSE (The square of differences between your predictions and the
‘real’ values)

**R^2** = SSE / SST (sum of squares total, variance of the response)

R squared compares the squared errors of the model against the suqared errors from the simplest model possible, the average of the response.

**RMSE**, square root of MSE - interesting due to MSE, large prediction errors
penalized due to squaring, however, in the RMSE, the root effect diminishes it.
Of course ,outliers still affect a lot, but just a thing to note. Discussed is
using MSE, then square root and squaring the results.
`TransformedTargetRegressor`

in scikitlearn can help.

**RMSLE**, adds a log error. MCRMSLE is another variant. what you care for this is
the scale of your predictions with respect to the scale of the ground truth.
Log transform to target before fitting it can help, reversing with an
exponential function.

**MAE** - mean absolute error, is, as it says, the absolute value of the difference
between the prediction and the target. Slower convergence since you’re
optimizing for the median vs. mean (L1 vs. L2 norm)

Most of the weird targets are just variations of these standard ones.

#### Classification

##### Binary

**Accuracy** - Simply # correct / total answers, can be misleading if classes are
unbalanced, e.g. if one class is 99% of the data, you can simply predict that
class and be 99% correct.

**Precision** - TP / (TP + FP) - correctness of predicting a positive

**Recall (Sensitivity)** - TP / (TP + FN)

The precision/recall trade-off occurs because of the threshold for thep rediction. If you increase the threshold, you get more precision but less recall.

Average precision computes the mean precision for all recall values from 0 to 1. Useful for object detection, but also on tabular data. It can prove valuable for fraud detection problems.

This is why most use the **F1 Score** which is the harmonic mean of precision and
recall. In some cases the **F-beta** score is used which is the weighted harmonic
mean between precision / recall.

**Log loss / ROC-AUC** - log loss is also known as cross-entropy in deep learning.
Implication that the objective is to estimate as correctly as possible the
probability of an example being of a positive class. ROC-AUC is the “area
under the curve” of the Receiver Operating Characteristic (ROC) - true
positive rate plotted against false positive rate, equivalent to one minus the
true negative rate.

**Matthews Correlation Coefficient (MCC)** -

https://www.kaggle.com/ratthachat/demythifying-matthew-correlation-coefficients-mcc - for an explainer

Interesting as it looks as the positive and negative precision summed offset by 1, then multiplied by a ratio of sqrt((TP + FP) * (TN + FN) / PositiveLabels * negativeLabels)

This lets you get higher performance from both positive and negative class precision that are more in proportion to the ground truth, scoring from -1 to 1. Works well even when classes are quite imbalanced.

##### Multi-Class

Generally you can use the binary metrics and summarizing them using a sort of averaging strategy.

**Macro Averaging** - F1 score for each class then averaged. Each class will count
as much as the others, leading to equal penalizations whe nthe model dcoesn’t
perform well with any class.

**Micro Averaging** Sum all the contributions from each class to compute an
aggregated F1 score.

**Weighting** Same as Macro, but then make a weighted average mean, summing the
weights to 1. Useful for taking into account the frequency of positive cases
that are relevant to your problem.

##### Object

**Intersection Over Union (IOU) / Jaccard Index** - generally two images to compare,
using 1/0 for ground truth/otherwise. Overlap between prediction and ground
truth mask / dividing by area of the union.

**Dice Coefficient** - area of overlap between prediction / ground truth doubled
and then divided by the sum of prediction and ground truth.

IoU tends to penalize the overall average more if a single class prediction is wrong.

##### Recommendation / Multi-label Classification

**Mean Average Precision at K (MAP@{K})** -

This seems to be a complex metric. The mean average precision @ K, k as the cutoff. The average of P@K computer over all values ranging from 1 to k using the top prediction, the second, the third, etc. until k.

#### Optimizing Metrics

Discusses metrics, since a lot of out-of-the-box algorithms don’t let you choose your evaluation function (or have limited ones) - and your goal might not align with the competition.

- https://github.com/benhamner/Metrics
- https://www.kaggle.com/bigironsphere/loss-function-library-keras-pytorch/notebook

##### Post-Processing

This means that your predictions are transformed by means of as function into something else that presents a better evaluation.

An example was using a regression to predict a classification problem, using boundaries as thresholds then optimizing to find a better set of boundaries.

See: https://www.kaggle.com/c/petfinder-adoption-prediction/discussion/76107

When your predicted probabilities are misaligned with the training distribution of the target, there’s a calibration function in scikit-learn -

pipes your predictions into a post-processing function

- Sigmoid (Plat’s scaling - basically a logistic regression)
- Isotonic Regression (non-parametric regression - overfits if few examples)

## Good Validation

Adaptive overfitting - basically using a test set over and over, there should be a final hold-out.

Bias vs. Variance - model not complicated/expressive enough to capture complexity of problem causing the prediction to by biased, vs. overcomplication causing a scattershot of a prediction due to the model recording more details and noise.

good overfitting definition - “The process of learning elements of the training set that have no generalization”

### Strategies

The basic train-test split, with ~80/20 - chance of extract non-representative sample. Can use stratification which ensures that proportions of certain features are respected in the sampled data.

Probabilistic approaches are more useful for competitions (usually) - more computationally expensive. Law of large numbers, repeatedly sampling and reducing error. Examples are k-fold, subsambling and bootstrapping

#### k-fold cross-validation

folding k times, training and predicting against the fold then averaging. An important thing to note is this can measure generalizability and the score shouldn’t be compared purely against simple train/test splits. k = n is leave one out, but these are highly correlated against eachother and is more representative of the dataset itself than how it would perform on unknown data.

choosing k, the smaller it is, the more bias, but the higher, the more correlated and you lose out on interesting properties when predicting on unseen data. Commonly set to 5/7/10.

Stratified k-fold is an option when you need to preserve the distribution of a variable / proportion of small classes (spam / fraud datasets, etc.)

Scikit-multilearn, IterativeStratification

Interestingly, you can make good use of stratification in regression problems, helping your regressor to fit during cross validation - creating a discrete proxy for the target

https://www.kaggle.com/lucamassaron/are-you-doing-cross-validation-the-best-way

Sturges’ rule - np.floor(1 + np.log2(len(X_train))) --- (helpful to determine how many bins)

cluster analysis on the features and then using predicted clusters as strata, PCA → k-means

GroupKFold - non-i.i.d data, grouping among examples,

Using fixed lookback seems to be suggested (essentially windowing the training -and- validation) because otherwise the increasing time window just shows decreasing bias, confusing with model performance

Nested cross validation! - basically cross validating inside cross validating

#### Subsampling & Bootstrap

Subsampling is similar to k-fold, but you basically choose the amount of samples, no fixed folds.

Devised to conclude the error distribution of an estimate, the bootstrap draws a sample, with replacement that is the same size as the available data. Interesting discussion of the 632 method, but seems intractable for machine learning. While not often, bootstrapping is mentioned as an alternative to cross-validation, where, due to outliers or heterogeneous examples, there is a large standard error of the evaluation metric in CV.

Worth looking into - https://www.kaggle.com/anokas/finding-boatids - discussing on how to figure out how to identify the target when the context is similar in images

### Adversarial Validation

This is an interesting thing meant for competition, -however- it can be used for model drift. This essentially involves training a classifier between train and test datasets and looking at the ROC AUC to see if they’re similar, with .50 showing they are. There’s a couple strategies like removing variables and mimicking the test set if it’s uneven.

### Leakage (Data)

Another competition specific thing, but worth noting is basically metadata / ordering, etc. etc. that might give away something about the dataset.

## Tabular

The usefulness of synthetic data is discussed - https://www.kaggle.com/lucamassaron/how-to-use-ctgan-to-generate-more-data

### EDA

Auto-profiling solutions are discussed, AutoViz, Pandas Profiling, Sweetviz

UMAP and t-SNE

- https://distill.pub/2016/misread-tsne/
- https://pair-code.github.io/understanding-umap/
- https://www.kaggle.com/lucamassaron/interesting-eda-tsne-umap/

These links appear helpful, noting that they’re generally more revealing than classical methods based on variance restructuring like PCA / SVD.

Reducing memory size of pandas (or just use something else, I suppose):

https://www.kaggle.com/arjanso/reducing-dataframe-memory-size-by-65