仪陇家园分类信息网、仪陇生活网、仪陇家园网

搜索

ML:LGBMClassifier、XGBClassifier和CatBoostClassifier的feature_importances_计算方法源代码解读

[复制链接]
seo 发表于 2022-5-31 12:58:48 | 显示全部楼层 |阅读模式
ML:LGBMClassifier、XGBClassifier和CatBoostClassifier的feature_importances_计算方法源代码解读之详细攻略
目录
​​LGBMClassifier、XGBClassifier和CatBoostClassifier的feature_importances_计算方法源代码解读之详细攻略​​
​​LGBMClassifier​​
​​XGBClassifier​​
​​CatBoostClassifier​​
LGBMClassifier、XGBClassifier和CatBoostClassifier的feature_importances_计算方法源代码解读之详细攻略LGBMClassifierLGBMClassifier.feature_importances_函数,采用split方式计算
LGBMC.feature_importances_

importance_type='split',
    def feature_importances_(self):
         """Get feature importances.
        Note
         ----
         Feature importance in sklearn interface used to normalize to 1,it's deprecated after 2.0.4 and is the same as Booster.feature_importance() now.
         ``importance_type`` attribute is passed to the function to configure the type of importance values to be extracted.
         """
         if self._n_features is None:
             raise LGBMNotFittedError('No feature_importances found. Need to call fit beforehand.')
         return self.booster_.feature_importance(importance_type=self.importance_type)

    @property
     def booster_(self):
         """Get the underlying lightgbm Booster of this model."""
         if self._Booster is None:
             raise LGBMNotFittedError('No booster found. Need to call fit beforehand.')
         return self._Booster

    def num_feature(self):
         """Get number of features.
        Returns
         -------
         num_feature : int
             The number of features.
         """
         out_num_feature = ctypes.c_int(0)
         _safe_call(_LIB.LGBM_BoosterGetNumFeature(
             self.handle,
             ctypes.byref(out_num_feature)))
         return out_num_feature.value


self.booster_.feature_importance
(importance_type=

self.importance_type)


    def feature_importance(self, importance_type='split', iteration=None):
         """Get feature importances.
        Parameters
         ----------
         importance_type : string, optional (default="split"). How the importance is calculated.  字符串,可选(默认值=“split”)。如何计算重要性。
If "split", result contains numbers of times the feature is used in a model. 如果“split”,则结果包含该特征在模型中使用的次数
If "gain", result contains total gains of splits which use the feature.如果“gain”,则结果包含使用该特征的拆分的总增益。
         iteration : int or None, optional (default=None).Limit number of iterations in the feature importance calculation. If None, if the best iteration exists, it is used; otherwise, all trees are used.  If
        Returns
         -------
         result : numpy array
             Array with feature importances.
         """
         if iteration is None:
             iteration = self.best_iteration
         if importance_type == "split":
             importance_type_int = 0
         elif importance_type == "gain":
             importance_type_int = 1
         else:
             importance_type_int = -1
         result = np.zeros(self.num_feature(), dtype=np.float64)
         _safe_call(_LIB.LGBM_BoosterFeatureImportance(
             self.handle,
             ctypes.c_int(iteration),
             ctypes.c_int(importance_type_int),
             result.ctypes.data_as(ctypes.POINTER(ctypes.c_double))))
         if importance_type_int == 0:
             return result.astype(int)
         else:
             return result

XGBClassifierXGBClassifier.feature_importances_函数,采用weight方式计算

XGBC.
feature_importances_


importance_type="weight"   # 默认 gain、weight、cover、total_gain、total_cover
    def feature_importances_(self):
         """
         Feature importances property
        .. note:: Feature importance is defined only for tree boosters
            Feature importance is only defined when the decision tree model is chosen as base learner (`booster=gbtree`). It is not defined for other base learner types, such as linear learners .仅当选择决策树模型作为基础学习者(`booster=gbtree`)时,才定义特征重要性。它不适用于其他基本学习者类型,例如线性学习者(`booster=gblinear`).
        Returns
         -------
         feature_importances_ : array of shape ``[n_features]``
        """
         if getattr(self, 'booster', None) is not None and self.booster != 'gbtree':
             raise AttributeError('Feature importance is not defined for Booster type {}'
                                  .format(self.booster))
         b = self.get_booster()
         score = b.get_score(importance_type=self.importance_type)
         all_features = [score.get(f, 0.) for f in b.feature_names]
         all_features = np.array(all_features, dtype=np.float32)
         return all_features / all_features.sum()
  

get_score

    def get_score(self, fmap='', importance_type='weight'):
         """Get feature importance of each feature.
         Importance type can be defined as:
* 'weight': the number of times a feature is used to split the data across all trees.一个特征用于在所有树上分割数据的次数。* 'gain': the average gain across all splits the feature is used in.使用该特征的所有拆分的平均增益。* 'cover': the average coverage across all splits the feature is used in.使用该特征的所有拆分的平均覆盖率。* 'total_gain': the total gain across all splits the feature is used in.该特征在所有分割中使用的总增益。* 'total_cover': the total coverage across all splits the feature is used in.使用该特征的所有拆分的总覆盖率。        .. note:: Feature importance is defined only for tree boosters
            Feature importance is only defined when the decision tree model is chosen as base learner (`booster=gbtree`). It is not defined for other base learner types, such as linear learners (`booster=gblinear`).
        Parameters
         ----------
         fmap: str (optional)
            The name of feature map file.
         importance_type: str, default 'weight'
             One of the importance types defined above.
         """
         if getattr(self, 'booster', None) is not None and self.booster not in {'gbtree', 'dart'}: raise ValueError('Feature importance is not defined for Booster type {}' .format(self.booster))
        allowed_importance_types = ['weight', 'gain', 'cover', 'total_gain', 'total_cover']
         if importance_type not in allowed_importance_types: msg = ("importance_type mismatch, got '{}', expected one of " + repr(allowed_importance_types))
             raise ValueError(msg.format(importance_type))
        # if it's weight, then omap stores the number of missing values
         if importance_type == 'weight':
             # do a simpler tree dump to save time
             trees = self.get_dump(fmap, with_stats=False)
            fmap = {}
             for tree in trees:
                 for line in tree.split('\n'):
                     # look for the opening square bracket
                     arr = line.split('[')
                     # if no opening bracket (leaf node), ignore this line
                     if len(arr) == 1:
                         continue
                    # extract feature name from string between []
                     fid = arr[1].split(']')[0].split('
                    if fid not in fmap:
                         # if the feature hasn't been seen yet
                         fmap[fid] = 1
                     else:
                         fmap[fid] += 1
            return fmap
        else:
             average_over_splits = True
             if importance_type == 'total_gain':
                 importance_type = 'gain'
                 average_over_splits = False
             elif importance_type == 'total_cover':
                 importance_type = 'cover'
                 average_over_splits = False
            trees = self.get_dump(fmap, with_stats=True)
            importance_type += '='
             fmap = {}
             gmap = {}
             for tree in trees:
                 for line in tree.split('\n'):
                     # look for the opening square bracket
                     arr = line.split('[')
                     # if no opening bracket (leaf node), ignore this line
                     if len(arr) == 1:
                         continue
                    # look for the closing bracket, extract only info within that bracket
                     fid = arr[1].split(']')
                    # extract gain or cover from string after closing bracket
                     g = float(fid[1].split(importance_type)[1].split(',')[0])
                    # extract feature name from string before closing bracket
                     fid = fid[0].split('
                    if fid not in fmap:
                         # if the feature hasn't been seen yet
                         fmap[fid] = 1
                         gmap[fid] = g
                     else:
                         fmap[fid] += 1
                         gmap[fid] += g
            # calculate average value (gain/cover) for each feature
             if average_over_splits:
                 for fid in gmap:
                     gmap[fid] = gmap[fid] / fmap[fid]
            return gmap

CatBoostClassifierCatBoostClassifier.feature_importances_函数,采用is_groupwise_metric(loss)方式计算
CatC.feature_importances_
    def feature_importances_(self):
         loss = self._object._get_loss_function_name()
         if loss and is_groupwise_metric(loss):
             return np.array(getattr(self, "_loss_value_change", None))
         else:
             return np.array(getattr(self, "_prediction_values_change", None))
       CatBoost简单地利用了在正常情况下(当我们包括特征时)使用模型获得的度量(损失函数)与不使用该特征的模型(模型建立大约与此功能从所有的树在合奏)。差别越大,特征就越重要。
               

            
               
        
        

回复

使用道具 举报

全部回复0 显示全部楼层

发表回复

您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

楼主

审核员

热门推荐

联系客服 关注微信 下载APP 返回顶部 返回列表