Skip to content

Feature importances in TabularNLPAutoML #129

Open
fingoldo opened this issue Mar 27, 2022 · 3 comments
Open

Feature importances in TabularNLPAutoML #129

fingoldo opened this issue Mar 27, 2022 · 3 comments

Comments

@fingoldo
Copy link

Hi, is it possible to get feature importances in TabularNLPAutoML for regular features (not textual), same as in TabularAutoML?
Currently automl.get_feature_scores("fast") is throwing an error


AttributeError                            Traceback (most recent call last)
<ipython-input-182-f726a358fe6b> in <module>
      1 # Fast feature importances calculation
----> 2 fast_fi = pipe.base_estimator.get_feature_scores("fast")
      3 fast_fi.set_index("Feature")["Importance"].plot.bar(figsize=(30, 10), grid=True)

C:\ProgramData\Anaconda3\lib\site-packages\lightautoml\automl\presets\tabular_presets.py in get_feature_scores(self, calc_method, data, features_names, silent)
    577     ):
    578         if calc_method == "fast":
--> 579             for level in self.levels:
    580                 for pipe in level:
    581                     fi = pipe.pre_selection.get_features_score()

AttributeError: 'TabularNLPAutoML' object has no attribute 'levels'
@alexmryzhkov
Copy link
Contributor

Hi @fingoldo,

Thanks for the issue. Could you also share the code how you setup task, roles and TabularNLPAutoml with the full training log as well?

Alex

@fingoldo
Copy link
Author

Thanks for the the quick reply, Alex! Sure.
Basically, it's this:

N_THREADS = multiprocessing.cpu_count()
MEMORY_LIMIT = psutil.virtual_memory().total * 0.9 / 1024 ** 3
verbose = 1
task = Task("reg", loss="mse", metric="mae")
timeout = 60 * 60 * 3
automl=TabularNLPAutoML(task=task, timeout=timeout, cpu_limit=N_THREADS, gpu_ids="all", text_params={"lang": "en"},)

automl.fit_predict(X,roles={"text": ["title"], "drop": [], "target": TARGET_COLUMN})

the log:

[14:43:54] Stdout logging level is INFO.

2022-03-27 14:43:54,513 - INFO - MainProcess[19272]-MainThread[19072]-base.py-lightautoml.automl.presets.base - set_verbosity_level-line:267 - Stdout logging level is INFO.
2022-03-27 14:43:54,535 - INFO3 - MainProcess[19272]-MainThread[19072]-text_presets.py-lightautoml.automl.presets.text_presets - infer_auto_params-line:230 - Model language mode: en

[14:43:54] Task: reg

2022-03-27 14:43:54,556 - INFO - MainProcess[19272]-MainThread[19072]-base.py-lightautoml.automl.presets.base - fit_predict-line:196 - Task: reg

[14:43:54] Start automl preset with listed constraints:

2022-03-27 14:43:54,558 - INFO - MainProcess[19272]-MainThread[19072]-base.py-lightautoml.automl.presets.base - fit_predict-line:198 - Start automl preset with listed constraints:

[14:43:54] - time: 10800.00 seconds

2022-03-27 14:43:54,559 - INFO - MainProcess[19272]-MainThread[19072]-base.py-lightautoml.automl.presets.base - fit_predict-line:199 - - time: 10800.00 seconds

[14:43:54] - CPU: 32 cores

2022-03-27 14:43:54,561 - INFO - MainProcess[19272]-MainThread[19072]-base.py-lightautoml.automl.presets.base - fit_predict-line:200 - - CPU: 32 cores

[14:43:54] - memory: 16 GB

2022-03-27 14:43:54,563 - INFO - MainProcess[19272]-MainThread[19072]-base.py-lightautoml.automl.presets.base - fit_predict-line:201 - - memory: 16 GB

[14:43:54] Train data shape: (9000, 290)

2022-03-27 14:43:54,565 - INFO - MainProcess[19272]-MainThread[19072]-base.py-lightautoml.reader.base - fit_read-line:274 - Train data shape: (9000, 290)

2022-03-27 14:43:57,354 - INFO3 - MainProcess[19272]-MainThread[19072]-base.py-lightautoml.reader.base - advanced_roles_guess-line:607 - Feats was rejected during automatic roles guess: []

[14:43:57] Layer 1 train process start. Time left 10797.12 secs

2022-03-27 14:43:57,443 - INFO - MainProcess[19272]-MainThread[19072]-base.py-lightautoml.automl.base - fit_predict-line:213 - Layer 1 train process start. Time left 10797.12 secs

[14:44:02] Start fitting Lvl_0_Pipe_0_Mod_0_LinearL2 ...

2022-03-27 14:44:02,316 - INFO - MainProcess[19272]-MainThread[19072]-base.py-lightautoml.ml_algo.base - fit_predict-line:245 - Start fitting Lvl_0_Pipe_0_Mod_0_LinearL2 ...

[14:44:05] Fitting Lvl_0_Pipe_0_Mod_0_LinearL2 finished. score = -940.749755859375

2022-03-27 14:44:05,244 - INFO - MainProcess[19272]-MainThread[19072]-base.py-lightautoml.ml_algo.base - fit_predict-line:293 - Fitting Lvl_0_Pipe_0_Mod_0_LinearL2 finished. score = -940.749755859375

[14:44:05] Lvl_0_Pipe_0_Mod_0_LinearL2 fitting and predicting completed

2022-03-27 14:44:05,246 - INFO - MainProcess[19272]-MainThread[19072]-base.py-lightautoml.ml_algo.base - fit_predict-line:296 - Lvl_0_Pipe_0_Mod_0_LinearL2 fitting and predicting completed

[14:44:05] Time left 10789.31 secs

2022-03-27 14:44:05,257 - INFO - MainProcess[19272]-MainThread[19072]-base.py-lightautoml.automl.base - fit_predict-line:223 - Time left 10789.31 secs

2022-03-27 14:44:06,717 - INFO - MainProcess[19272]-MainThread[19072]-utils.py-gensim.utils - add_lifecycle_event-line:447 - FastText lifecycle event {'params': 'FastText(vocab=0, vector_size=64, alpha=0.025)', 'datetime': '2022-03-27T14:44:06.717633', 'gensim': '4.1.2', 'python': '3.8.8 (default, Apr 13 2021, 15:08:03) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.17763-SP0', 'event': 'created'}
2022-03-27 14:44:06,725 - INFO - MainProcess[19272]-MainThread[19072]-word2vec.py-gensim.models.word2vec - scan_vocab-line:578 - collecting all words and their counts
2022-03-27 14:44:06,726 - INFO - MainProcess[19272]-MainThread[19072]-word2vec.py-gensim.models.word2vec - _scan_vocab-line:561 - PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2022-03-27 14:44:06,745 - INFO - MainProcess[19272]-MainThread[19072]-word2vec.py-gensim.models.word2vec - scan_vocab-line:584 - collected 10828 word types from a corpus of 46369 raw words and 9000 sentences
2022-03-27 14:44:06,746 - INFO - MainProcess[19272]-MainThread[19072]-word2vec.py-gensim.models.word2vec - prepare_vocab-line:633 - Creating a fresh vocabulary
2022-03-27 14:44:06,824 - INFO - MainProcess[19272]-MainThread[19072]-utils.py-gensim.utils - add_lifecycle_event-line:447 - FastText lifecycle event {'msg': 'effective_min_count=1 retains 10828 unique words (100.0%% of original 10828, drops 0)', 'datetime': '2022-03-27T14:44:06.824618', 'gensim': '4.1.2', 'python': '3.8.8 (default, Apr 13 2021, 15:08:03) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.17763-SP0', 'event': 'prepare_vocab'}
2022-03-27 14:44:06,825 - INFO - MainProcess[19272]-MainThread[19072]-utils.py-gensim.utils - add_lifecycle_event-line:447 - FastText lifecycle event {'msg': 'effective_min_count=1 leaves 46369 word corpus (100.0%% of original 46369, drops 0)', 'datetime': '2022-03-27T14:44:06.825618', 'gensim': '4.1.2', 'python': '3.8.8 (default, Apr 13 2021, 15:08:03) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.17763-SP0', 'event': 'prepare_vocab'}
2022-03-27 14:44:06,968 - INFO - MainProcess[19272]-MainThread[19072]-word2vec.py-gensim.models.word2vec - prepare_vocab-line:741 - deleting the raw counts dictionary of 10828 items
2022-03-27 14:44:06,969 - INFO - MainProcess[19272]-MainThread[19072]-word2vec.py-gensim.models.word2vec - prepare_vocab-line:744 - sample=0.001 downsamples 40 most-common words
2022-03-27 14:44:06,970 - INFO - MainProcess[19272]-MainThread[19072]-utils.py-gensim.utils - add_lifecycle_event-line:447 - FastText lifecycle event {'msg': 'downsampling leaves estimated 40640.463918984155 word corpus (87.6%% of prior 46369)', 'datetime': '2022-03-27T14:44:06.970622', 'gensim': '4.1.2', 'python': '3.8.8 (default, Apr 13 2021, 15:08:03) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.17763-SP0', 'event': 'prepare_vocab'}
2022-03-27 14:44:07,295 - INFO - MainProcess[19272]-MainThread[19072]-fasttext.py-gensim.models.fasttext - estimate_memory-line:493 - estimated required memory for 10828 words, 2000000 buckets and 64 dimensions: 525048308 bytes
2022-03-27 14:44:07,296 - INFO - MainProcess[19272]-MainThread[19072]-word2vec.py-gensim.models.word2vec - init_weights-line:859 - resetting layer weights
2022-03-27 14:44:09,287 - INFO - MainProcess[19272]-MainThread[19072]-utils.py-gensim.utils - add_lifecycle_event-line:447 - FastText lifecycle event {'update': False, 'trim_rule': 'None', 'datetime': '2022-03-27T14:44:09.287742', 'gensim': '4.1.2', 'python': '3.8.8 (default, Apr 13 2021, 15:08:03) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.17763-SP0', 'event': 'build_vocab'}
2022-03-27 14:44:09,289 - INFO - MainProcess[19272]-MainThread[19072]-utils.py-gensim.utils - add_lifecycle_event-line:447 - FastText lifecycle event {'msg': 'training model with 3 workers on 10828 vocabulary and 64 features, using sg=0 hs=0 sample=0.001 negative=5 window=3 shrink_windows=True', 'datetime': '2022-03-27T14:44:09.289723', 'gensim': '4.1.2', 'python': '3.8.8 (default, Apr 13 2021, 15:08:03) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.17763-SP0', 'event': 'train'}
2022-03-27 14:44:09,376 - INFO - MainProcess[19272]-MainThread[19072]-word2vec.py-gensim.models.word2vec - _log_epoch_progress-line:1288 - worker thread finished; awaiting finish of 2 more threads
2022-03-27 14:44:09,409 - INFO - MainProcess[19272]-MainThread[19072]-word2vec.py-gensim.models.word2vec - _log_epoch_progress-line:1288 - worker thread finished; awaiting finish of 1 more threads
2022-03-27 14:44:09,414 - INFO - MainProcess[19272]-MainThread[19072]-word2vec.py-gensim.models.word2vec - _log_epoch_progress-line:1288 - worker thread finished; awaiting finish of 0 more threads
2022-03-27 14:44:09,414 - INFO - MainProcess[19272]-MainThread[19072]-word2vec.py-gensim.models.word2vec - _log_epoch_end-line:1629 - EPOCH - 1 : training on 46369 raw words (40640 effective words) took 0.1s, 404546 effective words/s
2022-03-27 14:44:09,500 - INFO - MainProcess[19272]-MainThread[19072]-word2vec.py-gensim.models.word2vec - _log_epoch_progress-line:1288 - worker thread finished; awaiting finish of 2 more threads
2022-03-27 14:44:09,531 - INFO - MainProcess[19272]-MainThread[19072]-word2vec.py-gensim.models.word2vec - _log_epoch_progress-line:1288 - worker thread finished; awaiting finish of 1 more threads
2022-03-27 14:44:09,544 - INFO - MainProcess[19272]-MainThread[19072]-word2vec.py-gensim.models.word2vec - _log_epoch_progress-line:1288 - worker thread finished; awaiting finish of 0 more threads
2022-03-27 14:44:09,545 - INFO - MainProcess[19272]-MainThread[19072]-word2vec.py-gensim.models.word2vec - _log_epoch_end-line:1629 - EPOCH - 2 : training on 46369 raw words (40644 effective words) took 0.1s, 350692 effective words/s
2022-03-27 14:44:09,546 - INFO - MainProcess[19272]-MainThread[19072]-utils.py-gensim.utils - add_lifecycle_event-line:447 - FastText lifecycle event {'msg': 'training on 92738 raw words (81284 effective words) took 0.3s, 317320 effective words/s', 'datetime': '2022-03-27T14:44:09.546730', 'gensim': '4.1.2', 'python': '3.8.8 (default, Apr 13 2021, 15:08:03) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.17763-SP0', 'event': 'train'}
100%|████████████████████████████████████████████████████████████████████████████| 9000/9000 [00:07<00:00, 1273.13it/s]
2022-03-27 14:44:18,279 - INFO3 - MainProcess[19272]-MainThread[19072]-text.py-lightautoml.transformers.text - fit-line:788 - Feature concated__title fitted
2022-03-27 14:44:24,936 - INFO3 - MainProcess[19272]-MainThread[19072]-text.py-lightautoml.transformers.text - transform-line:834 - Feature concated__title transformed

[14:44:24] Start fitting Lvl_0_Pipe_1_Mod_0_LightGBM ...

2022-03-27 14:44:24,992 - INFO - MainProcess[19272]-MainThread[19072]-base.py-lightautoml.ml_algo.base - fit_predict-line:245 - Start fitting Lvl_0_Pipe_1_Mod_0_LightGBM ...

[14:44:36] Fitting Lvl_0_Pipe_1_Mod_0_LightGBM finished. score = -924.1246948242188

2022-03-27 14:44:36,807 - INFO - MainProcess[19272]-MainThread[19072]-base.py-lightautoml.ml_algo.base - fit_predict-line:293 - Fitting Lvl_0_Pipe_1_Mod_0_LightGBM finished. score = -924.1246948242188

[14:44:36] Lvl_0_Pipe_1_Mod_0_LightGBM fitting and predicting completed

2022-03-27 14:44:36,809 - INFO - MainProcess[19272]-MainThread[19072]-base.py-lightautoml.ml_algo.base - fit_predict-line:296 - Lvl_0_Pipe_1_Mod_0_LightGBM fitting and predicting completed

[14:44:36] Time left 10757.75 secs

2022-03-27 14:44:36,816 - INFO - MainProcess[19272]-MainThread[19072]-base.py-lightautoml.automl.base - fit_predict-line:223 - Time left 10757.75 secs

[14:44:36] Layer 1 training completed.

2022-03-27 14:44:36,818 - INFO - MainProcess[19272]-MainThread[19072]-base.py-lightautoml.automl.base - fit_predict-line:241 - Layer 1 training completed.

[14:44:36] Blending: optimization starts with equal weights and score -924.7379150390625

2022-03-27 14:44:36,827 - INFO - MainProcess[19272]-MainThread[19072]-blend.py-lightautoml.automl.blend - _optimize-line:370 - Blending: optimization starts with equal weights and score -924.7379150390625

[14:44:36] Blending: iteration 0: score = -922.67333984375, weights = [0.25724643 0.74275357]

2022-03-27 14:44:36,850 - INFO - MainProcess[19272]-MainThread[19072]-blend.py-lightautoml.automl.blend - _optimize-line:395 - Blending: iteration 0: score = -922.67333984375, weights = [0.25724643 0.74275357]

[14:44:36] Blending: iteration 1: score = -922.67333984375, weights = [0.25724643 0.74275357]

2022-03-27 14:44:36,873 - INFO - MainProcess[19272]-MainThread[19072]-blend.py-lightautoml.automl.blend - _optimize-line:395 - Blending: iteration 1: score = -922.67333984375, weights = [0.25724643 0.74275357]

[14:44:36] Blending: no score update. Terminated

2022-03-27 14:44:36,875 - INFO - MainProcess[19272]-MainThread[19072]-blend.py-lightautoml.automl.blend - _optimize-line:402 - Blending: no score update. Terminated

[14:44:36] Automl preset training completed in 42.32 seconds

2022-03-27 14:44:36,883 - INFO - MainProcess[19272]-MainThread[19072]-base.py-lightautoml.automl.presets.base - fit_predict-line:214 - Automl preset training completed in 42.32 seconds

[14:44:36] Model description:
Final prediction for new objects (level 0) = 
	 0.25725 * (3 averaged models Lvl_0_Pipe_0_Mod_0_LinearL2) +
	 0.74275 * (3 averaged models Lvl_0_Pipe_1_Mod_0_LightGBM) 

2022-03-27 14:44:36,885 - INFO - MainProcess[19272]-MainThread[19072]-base.py-lightautoml.automl.presets.base - fit_predict-line:215 - Model description:
Final prediction for new objects (level 0) = 
	 0.25725 * (3 averaged models Lvl_0_Pipe_0_Mod_0_LinearL2) +
	 0.74275 * (3 averaged models Lvl_0_Pipe_1_Mod_0_LightGBM) 

@alexmryzhkov
Copy link
Contributor

Hi @fingoldo,

I have checked the situation and the result is that in TabularNLPAutoML preset we don't use feature selector (because it will be pretty slow for this case) - that's why we can't show the fast feature importances. Could you please try use the accurate method instead of fast?

Alex

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants