In this competition, we had to score essays on a scale of 1 to 6. The scoring criteria needed to be consistent with the holistic scoring criteria. Using an ensemble of DeBERTa and Light Gradient Boosted Trees The competition is still running and once it's over I'll update my final ranking.
The difference between ground truth and predicted score should be as minimal as possible. For example:
Prediction Qaulity | Ground truth | Predicted |
---|---|---|
Perfect | 5 | 5 |
Bad | 5 | 3 |
Worse | 5 | 1 |
Divide long essays into windows of length 1024 tokens.
Identifying overall topic of the essays and cluster them accordingly.
Stratified 7-fold CV grouping by topic for DeBERTa models.
Basic statistical descriptors (mean, max, sum) for token (character, word, sentence).
TF-IDF and cTF-IDF vectors
Uncommon word count, mispelled word count, grammar score, etc.
Microsoft's DeBERTa-V3-Large predicted score probabilities only from essay text.
Predicted score from DeBERTa ensemble probabilities and engineered features.
Experimentation was tracked using Weights and Biases to quickly iterate and improve predictive performance. Additional experimentation details and results can be found in the repository wiki. Cross-validation performance for a select few experiments are shown below:
Combining LGBM and DeBERTa ensemble gives the best performance
LGBM has high classification performance across all classes.
Predicted scores have minimal difference from ground truth.
The best system so far has score 0.793 in the public leaderboard.
Well engineered features coupled with Deep Neural Networks outperform standalone deep neural networks.
Kaggle competitions require a good CV strategy that correlates well with public leaderboard. In the beginning, I had a poor CV strategy where I got great CV performance but terrible leaderboard performance.
Over the span of 2 months, I experimented a lot. Tracking the experiment parameters like data version, model version, configurations, etc. in a platform helped immensely in prototyping.