5 Days of Data / Round2: - TheConversationBreak

Friday June 2, 2021

Revisions on ML Diamond Prices

I discovered that if I want to see my RMSE score on the kaggle competition (long since closed – but I want to compare with my previous submission) I have to submit my y_pred. However, in order to submit, no rows can be dropped. This means the models that I trained on the dataset that had dropped outliers doesn’t give me the results I need when applied to the final test set.

So, I went back to page one and re-ran the preprocessing and skipping the drop outliers step. I pickled that to a new file and on page two I trained the models on it. Then I took the final test set and walked it through all the preprocessing steps (also skipping drop outliers). I applied the RandomForest because it seemed to perform well. But because, in the preprocessing stage, I applied a log transformation, I needed to undo my log transformation with np.exp.

I caught that yesterday when I was checking the RMSE, R2, MSE scores on my train-test set. They were so good that I was afraid they might be overfitting. But then I realized that the log transformation was applied to the entire train-test set and that would reduce the disparities between the predicted and the observed prices. So when I applied the np.exp to the y_pred and y_test, the results made more sense. They had still improved. Instead of an R2 of 0.99, it was 0.95, whereas my original run of this6 months ago I was getting only about 0.90.

Then, then I applied my trained RandomForest model on the final test set. The was the set on which I already applied the preprocessing to, sans drop outliers. I undid the log transformation and submitted my prediction to kaggle.

I was happy to see that I had cut my RMSE in half, from 1200 to around 600. It isn’t an incredible result but a good improvement. The lesson here is that there was no hyper-parameter toggling. The original run I leaned on hyper-parameters to get better results. But this time it was all the work of better feature engineering.

Now is the time to really experiment and dig into the different models and settings.

Leave a Reply Cancel reply