-
Notifications
You must be signed in to change notification settings - Fork 295
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ax is not not starting as many workers as I'd like to; sometimes, get_next_trials returns 0 new trials #2301
Comments
https://github.com/NormanTUD/OmniOpt/tree/main/ax Main script: https://github.com/NormanTUD/OmniOpt/blob/main/ax/.omniopt.py Maybe for anyone looking through the environment the problem is appearing in, my general plan is to allow this:
and to run that optimization on our clusters and to use ax/botorch internally for hyper parameter optimization. We have basically unlimited resources for free (university) and want to have as many workers in parallel as possible to gain from the HPC as much as possible in finding good hyperparameters for every type of problem or just researching those areas (depending on what your program does). On the top of the code is a large comment showing some things I tried, the list is anything but complete though. It would really be appreciated by us if you helped us with that. Yours sincerly NormanTUD |
Hi @NormanTUD! Thanks so much for engaging with our tool - happy to help. Could you provide the logs from AxClient for your experiment? These logs usually contain information about the trial generation and generation strategy that will be helpful for us debugging the issue. Also good catch on |
Summary: This is a follow up to facebook#2301 The user was trying to use batch trials, but we don't currently expose this via AxClient we want to add an error to let users know this isn't really having any affect. Reviewed By: saitcakmak Differential Revision: D56048665
Summary: This is a follow up to facebook#2301 The user was trying to use batch trials, but we don't currently expose this via AxClient we want to add an error to let users know this isn't really having any affect. Reviewed By: saitcakmak Differential Revision: D56048665
Summary: This is a follow up to facebook#2301 The user was trying to use batch trials, but we don't currently expose this via AxClient we want to add an error to let users know this isn't really having any affect. Reviewed By: saitcakmak Differential Revision: D56048665
Summary: Pull Request resolved: #2355 This is a follow up to #2301 The user was trying to use batch trials, but we don't currently expose this via AxClient we want to add an error to let users know this isn't really having any affect. Reviewed By: saitcakmak Differential Revision: D56048665 fbshipit-source-id: 7dff08492e5cf52ab71579d9dcaac24beded4ff9
@NormanTUD -- added a PR for an error to populate with use_batch_trials, it'll be live once we cut a new release :) Let me know if you have the logs from AxClient for additional support. Thanks! |
Hi, thanks for your reply. I was on vacation and as such, didn't code anything. But currently, I am trying to get all logs now. Thanks for the patience. I will update this post when I have the logs. First a bit of my own debugging code: Update #1:
These lines are only executed when there are new jobs to be generated (in a for loop for further testing instead of by changing
So it just returns 0 jobs. These are the number of workers over time:
(No time given there though, it's in each generative loop) It should be around ~20, so 17 is fine for a snapshot during starting the jobs, but over time, it gets much less. The only message I can see from ax that seems relevant seems to be this:
|
I've seen the tag "fixready" and installed it from the latest version (via pip/github). I cannot see any changes in behaviour, it looks exactly like before. I am not entirely sure whether this tag should imply that the fix is in the master already, but if it is, it hasn't changed anything for me. Problem seems to be that the generation_node.generator_run_limit() returns 0, even though it shouldn't return 0. Not sure why yet, though. Edit: debugged it a bit more. Having 30 workers in parallel, gives me this and thus returns 0:
I changed the function to this in modelbridge/generation_node.py:
myprint just adds the filename in front of it, so I can debug it more easily. I am not sure why trials have failed, nor why some are abandoned, but in total, the I also tried monkey patching it:
around the
Adding Is there anything else I may provide? Yours sincerly, NormanTUD |
I have made a breakthrough regarding the reason why I don't get so many workers! When the job failed, I needed to do:
and when it succeeded I needed to do:
This way, ax knows about the jobs being finished (or failed), and it doesn't block the generation of new points anymore then, regarding to This was, admittedly, previously unclear to me. Now it finally works pretty much as I like it :) |
Hi,
I really like ax for optimizing hyperparameters. Based on it, I have written a tool for hyperparameter optimization, but I stumble upon a problem.
We use Slurm and submitit for our cluster and it all works fine, except for one thing. The number of parallel "workers" (ie. the number of parallel running jobs) does barely ever reach the maximum specified in my script.
The problem lies in the "ax_client.get_next_trials"-function. I do a loop like this:
I've tried
max_trials=args.max_trials
(coming from argparse) as well, but the behaviour is the same.Sometimes, the
trial_index_to_param
is empty. There are just 0 entries in it.I've tried the following:
But still, sometimes the number of results coming from
get_next_trials
is empty and has 0 entries. Usinguse_batch_trials
or not doesn't have any difference there as far as I know.This is done in 10 minute slots, and as you can see, in the beginning there are many completed jobs, almost 90 per 10-minute-slot. But later on, it gets less and less, every time because the length of the
trial_index_to_param
is0
.Is there anything I can do more against this? How may I use the full number of parallel evaluation specified?
Thanks!
Edit: tried adding
enforce_sequential_optimization=False
to thechoose_generation_strategy_kwargs
, but that doesn't change anything also.The text was updated successfully, but these errors were encountered: