news

Why am I getting such strange prediction scores?

Vaseline August 15, 2024

I am having a classification problem and keep getting strange results.

Data preparation:Initially I had 30 million rows (0.75 million with label 1, 29.25 million with label 0), the data is not time based. Then I balanced these classes by undersampling the majority class, now it’s 750k of each class. Randomly split it into training and testing (80/20).

Course: I applied a LGBMClassifier to all (106) features and to not so strongly correlated (67) features, tried different hyperparameters, 1.2 million rows were used.

To predict: 300k rows are used in calculations. Below are 4 plots, some of them really confuse me.

ROC curve. Okay, clearly not great, but not terrible either.

Precision-Recall curve. Oddly round recall = 0

F1 score by chosen threshold. Somehow any threshold lower than 0.35 is fine, but >0.7 is always a terrible choice.

Kernel Density Plots. Most of my questions relate to this distribution (blue = label 0, red = label 1). Why? Just why?

Why is that? Are there 2 different clusters within label 1? Or am I missing something obvious? Write in the comments, I will give more info if needed. Thanks in advance 🙂

submitted by /u/andreykol
(link) (reactions)