Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to choose which fold to use as a final predictor #614

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

sami-ka
Copy link

@sami-ka sami-ka commented Apr 26, 2023

Hi @pplonski ,

Thanks for this great package!
I use it pretty often so I wanted to add my contribution to it.

I needed to test the difference between taking the average of models fitted on each fold, and looking at the prediction of only the last fold.
This was especially interesting in my case as it was a time series split, and I wanted my final model to be the one trained on the most recent data.

I added a parameter in the AutoML class called chosen_fold, which I ultimately set at -1 in my case to get the model of the last fold.

It's a bit linked to #475.

Feel free to tell me if I should continue working on this evolution!

P.S. : I think the changes in the requirements_dev.txt are needed because the last click versions are not compatible anymore with the pinned version of black. (see https://stackoverflow.com/questions/71673404/importerror-cannot-import-name-unicodefun-from-click)
Probably upgrading black could also be a good move.

@pplonski
Copy link
Contributor

Hi @drskd!

Thank you for contribution. You are the first person that asked for this feature. If there will be more users that need this, then I will merge it.

@brainmosaik
Copy link

brainmosaik commented Jul 24, 2023

this sounds nice. but isn´t it the same as, just making a shorter test split ? Can i ask for what u use it? Do you use it only for prediction and not for training?

@sami-ka
Copy link
Author

sami-ka commented Jul 24, 2023

@brainmosaik In my case I did a time series split with 4 folds, lets say one for each season over the last 12 months.
It's reassuring to find model hyperparameters that works well on average for every season, but if I'm currently in summer, I would like to have my predictions based on the most recent model with the weights of the model computed on data from the last months.
For now, the behaviour is to take as final prediction the average prediction over the 4 folds. In my example, that corresponds to actually rely partly on a prediction that was made by a model fitted on data from the winter period, which is not completely satisfying if you have some sort of seasonality.

@brainmosaik
Copy link

brainmosaik commented Jul 24, 2023

Thats sounds, good. So the workflow for this, should be?:
1 .Train with chosen_fold=None, to find model hyperparameters that works well on average for every season.
2. Then train a new model with found hyperparameters ,and set chosen_fold=-1 ,to have a good model for the last fold (most recent season), but one that have more generalized hyperparameters.

So it should be better on new upcoming days?


Or should it be used like this?
1. Train with chosen_fold=None, to find model hyperparameters that works well on average for every season.
2. Reload the model but set chosen_fold=-1, to have the model for the last fold (most recent season).

Because mljar already saving models , for each fold? So we can skip the Re-training?


Big thanks for this idea and implementation. Just trying to get my head around it.

edit:
So i look a bit deeper in the code, so it looks like, its only for final prediction , not for training. So we can just reload/create model with chosen_fold = -1 . to get predictions based on the last season.

@sami-ka
Copy link
Author

sami-ka commented Jul 25, 2023

The chosen_fold parameter is used only at the prediction stage because, as you already noticed, if I do not want to train on some part of the data I should just discard it.

Here, as it's already the case, AutoML will train each chosen model and hyperparameters on the 4 splits, then look at the average chosen validation metric over 4 validation sets to rank models on the leaderboard, and finally predicts using the model at the top of the leaderboard.
Each model in the leaderboard in this example has in fact 4 different models trained each on different datasets. For instance, you could have a decision tree of max_depth=5 as the winner but if you train it on 4 different datasets, you would have 4 different predictions from each as the decision trees would not necessarily all be the same. AutoML returns a prediction as the average prediction of these 4 models.

The chosen_fold parameter would impact only the prediction step. Each model in the leaderboard would still be trained on 4 different datasets, but the prediction would come from only the model trained on the last split

Classic usage:

automl = AutoML(
            validation_strategy={
                "validation_type": "custom"
            },
            chosen_fold=None, # default value
        )
automl.fit(X, y, cv=tscv) # no behaviour change
automl.predict_proba(X) # returns the average predicted proba of 4 models, each trained with winning hyperparameters of the leaderboard on splits defined in CV 

Custom usage:

automl = AutoML(
            validation_strategy={
                "validation_type": "custom"
            },
            chosen_fold=-1, # only used at prediction time 
        )
automl.fit(X, y, cv=tscv) # no behaviour change
automl.predict_proba(X) # returns the predicted proba of the model trained with winning hyperparameters of the leaderboard on last split only 

I could indeed use only the last split if I'm interested in the most recent part of the data, but as the model search is really powerful, having only one split to validate is risky in terms of overfitting. Having additional regularization on hyperparameter selection with more splits help to limit the risk of overfitting.

@mosaikme
Copy link

mosaikme commented Dec 27, 2023

i must say thanks again, for this commit.I think this should be in the main branch, could be really usefull. I wonder if this got , implemented now in the main branch? I have some addational ideas, to make it take an weighted, prediction, something like


# Create weights
WEIGHTS = np.linspace(1.0, len(preds), len(self.learners))

#print("CHOSEN FOLD: WEIGHTED",WEIGHTS )
for ind, learner in enumerate(self.learners):
    # preprocessing goes here
    X_data, _, _ = self.preprocessings[ind].transform(X.copy(), None)
    
    # For binary classification, ensure that the prediction is a probability
    #y_p_prob = learner.predict_proba(X_data)[:, 1]  # Assuming you're interested in the probability of the positive class
    
    y_p = learner.predict(X_data)
    y_p = self.preprocessings[ind].inverse_scale_target(y_p)

    ##multiply Prediction with WEIGHT
    y_p_weighted     = y_p * WEIGHTS[ind] 
    
    y_predicted = y_p_weighted  if y_predicted is None else y_predicted + y_p_weighted
avg_pred_result = y_predicted / sum(WEIGHTS)     

Or this,but i think i have a thinking error in this?

 # Create weights
            WEIGHTS = np.linspace(0.0, 1.0, len(self.learners))

 
            ##make the weights Exponential
            ##WEIGHTS = np.exp(WEIGHTS * chosen_fold)

            #make ALL_WEIGHTS sum to 1 , So we dont need to divide by  
            WEIGHTS = WEIGHTS /  np.sum(WEIGHTS)
      
            #print("CHOSEN FOLD: WEIGHTED",WEIGHTS )
            for ind, learner in enumerate(self.learners):
                # preprocessing goes here
                X_data, _, _ = self.preprocessings[ind].transform(X.copy(), None)
                
                # For binary classification, ensure that the prediction is a probability
                #y_p_prob = learner.predict_proba(X_data)[:, 1]  # Assuming you're interested in the probability of the positive class
                
                y_p = learner.predict(X_data)
                y_p = self.preprocessings[ind].inverse_scale_target(y_p)

                ##multiply Prediction with WEIGHT
                y_p_weighted     = y_p * WEIGHTS[ind] 
                
                y_predicted = y_p_weighted  if y_predicted is None else y_predicted + y_p_weighted 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants