【Kaggle】练习赛《保险交叉销售的二分类预测》

前言

本篇文章介绍的是Kaggle月赛《Binary Classification of Insurance Cross Selling》，即《保险交叉销售的二元分类预测》。这场比赛非常适合作为机器学习入门者的实践练习。在之前的几期练习赛中，我们从多个角度详细讲解了探索性数据分析（EDA）及建模过程中基本工具的使用技巧，例如Optuna的参数优化、统计学中的特征工程工具等。本期我们将重点关注大数据集处理和数据不平衡这两个问题，详细讲解如何快速而有效地应对这些问题。

🔬 题目说明 🔬

在这里插入图片描述

保险交叉销售是一种销售策略，保险提供商向其现有消费者提供补充保险产品。目标是利用现有的关系和信任，为已经在公司购买汽车保险的客户提供补充或补充政策，如家庭保险。这种方法有可能增加客户价值，提高客户保留率，并增加保险提供商的收入。

以下是数据集的10个特征：

age：客户的年龄。
gender：客户的性别（通常表示为“男性”或“女性”）。
driving_licence：表示客户是否持有有效的驾驶执照（通常1表示“是”，0表示“否”）。
region_code：代表客户所在地区的分类代码。
previousy_insured：表示客户之前是否投保过（1表示是，0表示否）。
vehicle_age：客户车辆的车龄，通常分为不同的类别（例如，“1-2年”、“<1年”和“>2年”）。
vehicle_damage：表示客户过去是否经历过车辆损坏（1表示是，0表示否）。
annual_premium：客户每年为其保单支付的保费金额。
policy_sales_channel：表示保单销售渠道的分类代码（例如，不同的代理商或在线销售渠道）。
vintage：自客户上次购买保险单以来的天数。

🎯 目标 🎯

我们的任务是了解提供的数据并创建预测模型，以预测客户是否对其他公司产品有积极的反应。

📚 加载库

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import gcwarnings.filterwarnings("ignore")from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import RandomUnderSamplerfrom sklearn.feature_selection import mutual_info_classif
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, roc_curve, confusion_matrix, ConfusionMatrixDisplay, classification_reportimport xgboost as xgb

加载数据

由于数据集庞大，增加加载速度，减少内存的使用，在此进行了优化。
定义了函数，用于减少内存使用量

def reduce_mem_usage(df):""" iterate through all the columns of a dataframe and modify the data typeto reduce memory usage.        """start_mem = df.memory_usage().sum() / 1024**2print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))for col in df.columns:col_type = df[col].dtypeif col_type != object:c_min = df[col].min()c_max = df[col].max()if str(col_type)[:3] == 'int':if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:df[col] = df[col].astype(np.int8)elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:df[col] = df[col].astype(np.int16)elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:df[col] = df[col].astype(np.int32)elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:df[col] = df[col].astype(np.int64)  else:if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:df[col] = df[col].astype(np.float16)elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:df[col] = df[col].astype(np.float32)else:df[col] = df[col].astype(np.float64)else:df[col] = df[col].astype('category')end_mem = df.memory_usage().sum() / 1024**2print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))return dfdef import_data(file, **kwargs):"""create a dataframe and optimize its memory usage"""df = pd.read_csv(file, parse_dates=True, keep_date_col=True, **kwargs)df = reduce_mem_usage(df)return df

# 加载所有数据
train = import_data("/kaggle/input/playground-series-s4e7/train.csv", index_col = "id", engine="pyarrow")

Memory usage of dataframe is 1053.30 MB
Memory usage after optimization is: 274.30 MB
Decreased by 74.0%

test = import_data("/kaggle/input/playground-series-s4e7/test.csv", index_col = "id", engine="pyarrow")

Memory usage of dataframe is 643.68 MB
Memory usage after optimization is: 175.55 MB
Decreased by 72.7%

train["Region_Code"] = train["Region_Code"].astype(np.int8)
test["Region_Code"] = test["Region_Code"].astype(np.int8)train["Policy_Sales_Channel"] = train["Policy_Sales_Channel"].astype(np.int16)
test["Policy_Sales_Channel"] = test["Policy_Sales_Channel"].astype(np.int16)

这里需要说明的是 pandas 中 read_csv 的 engine=“pyarrow” 这个参数。采用C++ 的数据结构，据说有些数据类型速度可以提升31倍。测试数据[详见]。(https://datapythonista.me/blog/pandas-20-and-the-arrow-revolution-part-i)
自定义函数 reduce_mem_usage 根据数据的实际大小重新定义其类型，尽量采用小类型，以减少内存使用量。从上述结果可以看出，减少了70+%的大小。

探索数据

train.info()

<class ‘pandas.core.frame.DataFrame’>
Index: 11504798 entries, 0 to 11504797
Data columns (total 11 columns):
# Column Dtype
— ------ -----
0 Gender category
1 Age int8
2 Driving_License int8
3 Region_Code int8
4 Previously_Insured int8
5 Vehicle_Age category
6 Vehicle_Damage category
7 Annual_Premium float32
8 Policy_Sales_Channel int16
9 Vintage int16
10 Response int8
dtypes: category(3), float32(1), int16(2), int8(5)
memory usage: 263.3 MB

test.info()

<class ‘pandas.core.frame.DataFrame’>
Index: 7669866 entries, 11504798 to 19174663
Data columns (total 10 columns):
# Column Dtype
— ------ -----
0 Gender category
1 Age int8
2 Driving_License int8
3 Region_Code int8
4 Previously_Insured int8
5 Vehicle_Age category
6 Vehicle_Damage category
7 Annual_Premium float32
8 Policy_Sales_Channel int16
9 Vintage int16
dtypes: category(3), float32(1), int16(2), int8(4)
memory usage: 168.2 MB

设定目标标签

target = "Response"

initial_features = test.columns.to_list()
print(initial_features)

[‘Gender’, ‘Age’, ‘Driving_License’, ‘Region_Code’, ‘Previously_Insured’, ‘Vehicle_Age’, ‘Vehicle_Damage’, ‘Annual_Premium’, ‘Policy_Sales_Channel’, ‘Vintage’]

categorical_features = [col for col in initial_features if pd.concat([train[col], test[col]]).nunique() < 10]print(categorical_features)

[‘Gender’, ‘Driving_License’, ‘Previously_Insured’, ‘Vehicle_Age’, ‘Vehicle_Damage’]

numerical_features = list(set(initial_features) - set(categorical_features))print(numerical_features)

[‘Region_Code’, ‘Annual_Premium’, ‘Policy_Sales_Channel’, ‘Vintage’, ‘Age’]

从上述数据来看训练集为： 11504798 *11 测试集为：7669866 *10，可想而知这个规模的数据量，是之类没有碰到过的，如果不进行优化，之后的拟合和训练会特别慢并有可能会超内存溢出。

我们将特征分为数值型和分类型，分别进行处理和可视化。

特征分布

train[categorical_features] = train[categorical_features].astype("category")
test[categorical_features] = test[categorical_features].astype("category")

数值特征的统计

train.describe().T

x	count	mean	std	min	25%	50%	75%	max
Age	11504798.0	38.383563	14.993459	20.0	24.0	36.0	49.0	85.0
Region_Code	11504798.0	26.418690	12.991590	0.0	15.0	28.0	35.0	52.0
Annual_Premium	11504798.0	30461.359375	16454.744141	2630.0	25277.0	31824.0	39451.0	540165.0
Policy_Sales_Channel	11504798.0	112.425442	54.035708	1.0	29.0	151.0	152.0	163.0
Vintage	11504798.0	163.897744	79.979531	10.0	99.0	166.0	232.0	299.0
Response	11504798.0	0.122997	0.328434	0.0	0.0	0.0	0.0	1.0

test.describe().T

x	count	mean	std	min	25%	50%	75%	max
Age	7669866.0	38.391369	14.999507	20.0	24.0	36.0	49.0	85.0
Region_Code	7669866.0	26.426614	12.994326	0.0	15.0	28.0	35.0	52.0
Annual_Premium	7669866.0	30465.523438	16445.865234	2630.0	25280.0	31827.0	39460.0	540165.0
Policy_Sales_Channel	7669866.0	112.364992	54.073585	1.0	29.0	151.0	152.0	163.0
Vintage	7669866.0	163.899577	79.984449	10.0	99.0	166.0	232.0	299.0

绘图

分类特征

plt.figure(figsize=(12, 18))for i, col in enumerate(categorical_features):plt.subplot(3, 2, i+1)train[col].value_counts().plot(kind='pie',autopct='%.2f%%',pctdistance=0.8,fontsize=12)plt.gca().add_artist(plt.Circle((0,0),radius=0.6,fc='white'))plt.xlabel(' '.join(col.split('_')), weight='bold', size=20)plt.ylabel("")plt.tight_layout()
plt.suptitle("Pie Chart of Categorical Features", size=28, y=1.02)
plt.show()

在这里插入图片描述
饼图

plt.figure(figsize=(12, 8))for i, col in enumerate(categorical_features):plt.subplot(2, 3, i+1)sns.countplot(x=train[col], hue=train[target])plt.xlabel(' '.join(col.split('_')))plt.ylabel("Frequency")plt.tight_layout()
plt.suptitle("Histrogram of Categorical Features", size=28, y=1.03)
plt.show()

在这里插入图片描述

以上是5 个分类特征的分布情况以及对目标特征的分布情况，可以看出该数据集有严重的不平衡性。

数值特征

由于数据集是很大的，因此将选取5%的样本进行可视化分布情况。

train_sampled = train.sample(frac=0.05)

plt.figure(figsize=(12, 8))

for i, col in enumerate(numerical_features):plt.subplot(2, 3, i+1)sns.histplot(data=train_sampled, x=col, hue=target)plt.xlabel(' '.join(col.split('_')))plt.ylabel("Frequency")

plt.tight_layout()
plt.suptitle("Histogram of Numerical Features (5% Data)", size=28, y=1.03)
plt.show()

在这里插入图片描述

plt.figure(figsize=(12, 18))

for i, col in enumerate(numerical_features):plt.subplot(3, 2, i+1)sns.boxplot(data=train_sampled, x=col, hue=target)plt.xlabel(' '.join(col.split('_')), weight="bold", size=20)plt.ylabel("")

plt.tight_layout()
plt.suptitle("Box Plot of Numerical Features (5% Data)", size=28, y=1.02)
plt.show()

在这里插入图片描述

plt.figure(figsize=(12, 18))for i, col in enumerate(numerical_features):plt.subplot(3, 2, i+1)sns.violinplot(data=train_sampled, x=col, hue=target)plt.xlabel(' '.join(col.split('_')), weight="bold", size=20)plt.ylabel("")plt.tight_layout()
plt.suptitle("Violin Plot of Numerical Features (5% Data)", size=28, y=1.02)
plt.show()

在这里插入图片描述

以上分别用柱状图、线盒图和小提琴图表示数值型的分布情况。

特征选择

🤝互信息评分🤝

互信息评分有助于我们了解每个特征对目标变量的描述程度。我们可以使用这些信息来丢弃那些不能帮助我们显著理解目标变量的特征。

X_copy = train_copy.sample(frac=0.05)
y_copy = X_copy.pop(target)mi_scores = mutual_info_classif(X_copy, y_copy, discrete_features=X_copy.dtypes==int, n_neighbors=5, random_state=42)
mi_scores = pd.Series(mi_scores, index=initial_features)
mi_scores = mi_scores.sort_values(ascending=False)
mi_scores

Vehicle_Damage 0.113363
Previously_Insured 0.082177
Policy_Sales_Channel 0.055440
Vehicle_Age 0.044807
Gender 0.041255
Age 0.034908
Annual_Premium 0.028797
Region_Code 0.017658
Vintage 0.013878
Driving_License 0.000081
dtype: float64

mi_scores.plot(kind='barh', title='Mutual Info Score of Features', figsize=(12, 8), xlabel="Score", ylabel="Feature")
plt.show()

在这里插入图片描述

在机器学习中，特征选择是非常重要的一项工作，它可以过滤掉一些无用或冗余的特征，提高模型的准确性和可解释性。其中，互信息法（mutual information）是一种常用的特征选择方法。
互信息指的是两个变量之间的相关性，它测量了一个随机变量中的信息量能够为另一个随机变量提供多少信息。在特征选择中，我们可以通过计算每个特征与目标变量之间的互信息来判断该特征是否有预测能力。
在Python中，我们可以使用sklearn库中的mutual_info_classif和mutual_info_regression函数来计算互信息。
官方说明
原理：做特征选择时需要根据特征变量 $X$ 和因变量 $Y$ 的类型来选取合适的相关性指标，这里互信息适用于特征和因变量都是分类变量的情况。公式中的概率均用各分类水平出现的频率来替代：

$p_i=\frac{n(X=x_i)}{N} \ , p_j=\frac{n(Y=y_j)}{N} \ ,p_{ij}=\frac{n(X=x_i,Y=y_j)}{N}$

import pandas as pd
import numpy as npdef mutual_infor(X, y):'''Mutual InformationX and y are both categorical variables.Input : {X : one-dimensional array、list or series from Pandasy : one-dimensional array、list or series from Pandas}'''X = np.array(X).reshape(-1)y = np.array(y).reshape(-1)if len(X) != len(y):print('Length of X and y are inconsistent !')X_level = list(set(X))y_level = list(set(y))N = X.shape[0]I = 0for i in X_level:for j in y_level:p_xy = np.sum(X == i) & (y == j) / Np_x = np.sum(X == i) /Np_y = np.sum(y == j) /NI += p_xy * np.log(p_xy / (p_y * p_x))return I

清理内存

del train_copy, train_sampled, X_copy, y_copygc.collect()

模型训练

我们将使用XGB分类器模型进行预测任务。由于我们的数据是不平衡的，所以我们将使用以下方法。

使用原始训练集训练模型并评估模型性能。
对训练集进行欠抽样，然后训练模型并对其进行评估
对数据进行过度采样，然后训练和评估模型

评估指标

我们在评估模型性能时使用的指标是ROC曲线下面积（ROC-AUC）。
ROC曲线（受试者工作特性曲线）是显示分类模型在所有分类阈值下的性能的图。该曲线绘制了两个参数：

真阳性率（TPR） $=\frac{TP}{TP+FN}$
假阳性率（FPR） $=\frac{FP}{FP+TN}$

在不同的分类阈值下，ROC曲线说明了TPR和FPR之间的关系。通过降低分类阈值，更多的项目被分类为阳性，导致假阳性和真阳性的增加。典型的ROC曲线如附图所示。
在这里插入图片描述

ROC-AUC或简称AUC测量从（0,0）到（1,1）的整个ROC曲线下的整个二维面积
在这里插入图片描述

XGB参数这些参数是通过 Optuna 的方式调试得到（如何调试，可能参考之前的文章）

xgb_params = {'eval_metric': 'auc','n_estimators': 2000,'eta': 0.1,'alpha': 0.1269124823585012,'subsample': 0.8345882521794742,'colsample_bytree': 0.44270196445757065,'max_depth': 15,'tree_method': 'hist','min_child_weight': 8,'gamma': 1.308021832047589e-08,'max_bin': 50000,'n_jobs': -1,'device': 'cuda','enable_categorical': True,'early_stopping_rounds': 50,
}xgb_clf = xgb.XGBClassifier(**xgb_params)xgb_clf = xgb_clf.fit(X_train,y_train,eval_set=[(X_test, y_test)],verbose=500
)

[0] validation_0-auc:0.84898
[500] validation_0-auc:0.88950
[857] validation_0-auc:0.89000

y_test_pred = xgb_clf.predict(X_test, iteration_range=(0, xgb_clf.best_iteration + 1))

y_test_prob = xgb_clf.get_booster().predict(xgb.DMatrix(X_test, enable_categorical=True),iteration_range=(0, xgb_clf.best_iteration + 1))

print(f"AUC: {roc_auc_score(y_test, y_test_prob):.6f}")print("\nClassification Report:")
print(classification_report(y_test, y_test_pred), end="\n\n")ConfusionMatrixDisplay(confusion_matrix(y_test, y_test_pred),).plot()
plt.title("Confusion Matrix", size=28)
plt.show()

结果如下：

AUC: 0.890087Classification Report:precision    recall  f1-score   support0       0.90      0.98      0.94   20179481       0.60      0.20      0.30    283012accuracy                           0.89   2300960macro avg       0.75      0.59      0.62   2300960
weighted avg       0.86      0.89      0.86   2300960

在这里插入图片描述

fpr, tpr, thresholds = roc_curve(y_test, y_test_prob)plt.plot(fpr, tpr, lw=2, label=f"ROC Curve (Area= {roc_auc_score(y_test, y_test_prob):.6f})")
plt.plot([0, 1], [0, 1], 'k:')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Reciever Operating Characteristics(ROC) curve")
plt.legend()
plt.show()

在这里插入图片描述

0.890087 非常不错的结果

获得基本预测结果

test_pred_base = xgb_clf.get_booster().predict(xgb.DMatrix(test, enable_categorical=True),iteration_range=(0, xgb_clf.best_iteration + 1))

不平衡数据处理

欠采样（‌Under Sampling）‌：‌从多数类中随机选择样本，‌使得多数类和少数类的样本数保持一定的比例。‌这种方法可能导致信息的丢失，‌因为少数类的一些重要样本可能会被删除掉。‌
过采样（‌Over Sampling）‌：‌通过复制或生成少数类的样本，‌增加其样本数量，‌使得多数类和少数类的样本数达到平衡。‌这种方法可能导致过拟合，‌因为同样的样本可能被重复多次，‌使得模型过于关注于少数类。‌
SMOTE算法（‌Synthetic Minority Over-sampling Technique）‌：‌对于少数类的每个样本，‌随机选择其他最近邻的K个样本，‌通过线性插值的方法生成新的合成样本。‌这种方法可以有效地增加少数类样本数量，‌同时又避免了简单的过采样方法可能导致的过拟合问题。‌
集成方法（‌Ensemble Methods）‌：‌通过组合多个分类器的预测结果，‌可以改善分类器在不平衡数据集上的性能。‌例如，‌通过集成多个基分类器，‌如决策树、‌支持向量机等，‌可以通过投票或权重组合的方式得到最终的预测结果。‌
类别权重（‌Class Weight）‌：‌对于分类模型，‌可以通过设置类别权重来平衡不平衡数据集。‌较少出现的类别会被赋予较高的权重，‌从而增加其在模型训练中的重要性。‌
阈值调整（‌Threshold Adjusting）‌：‌对于二分类模型，‌可以通过调整分类的阈值来平衡预测结果。‌较少出现的类别可以被赋予较低的阈值，‌从而增加其被正确分类的可能性。‌

欠采样

rus = RandomUnderSampler()
X_rus, y_rus = rus.fit_resample(X, y)

X_train_rus, X_test_rus, y_train_rus, y_test_rus = train_test_split(X_rus, y_rus, test_size=0.2,shuffle=True, random_state=42, stratify=y_rus)
xgb_clf_rus = xgb.XGBClassifier(**xgb_params)

xgb_clf_rus = xgb_clf_rus.fit(X_train_rus,y_train_rus,eval_set=[(X_test_rus, y_test_rus)],verbose=500
)

[0] validation_0-auc:0.84843
[358] validation_0-auc:0.88601

test_pred_rus = xgb_clf_rus.get_booster().predict(xgb.DMatrix(test, enable_categorical=True),iteration_range=(0, xgb_clf_rus.best_iteration + 1))

过采样

ros = RandomOverSampler()
X_ros, y_ros = ros.fit_resample(X, y)

X_train_ros, X_test_ros, y_train_ros, y_test_ros = train_test_split(X_ros, y_ros, test_size=0.2,shuffle=True, random_state=42, stratify=y_ros)
xgb_clf_ros = xgb.XGBClassifier(**xgb_params)

xgb_clf_ros = xgb_clf_ros.fit(X_train_ros,y_train_ros,eval_set=[(X_test_ros, y_test_ros)],verbose=500
)

[0] validation_0-auc:0.84948
[500] validation_0-auc:0.90160
[1000] validation_0-auc:0.90728
[1500] validation_0-auc:0.91109
[1999] validation_0-auc:0.91395

预测

由于我们的模型在过采样数据下表现最佳，因此我们将使用用过采样数据训练的模型进行最终预测任务。

test_pred_ros = xgb_clf_ros.get_booster().predict(xgb.DMatrix(test, enable_categorical=True),iteration_range=(0, xgb_clf_ros.best_iteration + 1))

sub = pd.DataFrame({'id': test.index,'Response': test_pred_ros
})

sub

x	id	Response
0	11504798	0.004572
1	11504799	0.923104
2	11504800	0.709704
3	11504801	0.000057
4	11504802	0.676349
…	…	…
7669861	19174659	0.615310
7669862	19174660	0.000439
7669863	19174661	0.000619
7669864	19174662	0.886831
7669865	19174663	0.000260

7669866 rows × 2 columns

sub.to_csv("submission_ros.csv", index=False)

各类结果

pd.DataFrame({'id': test.index,'Response': test_pred_base
}).to_csv('submission_base.csv', index=False)

pd.DataFrame({'id': test.index,'Response': test_pred_rus
}).to_csv('submission_rus.csv', index=False)

pd.DataFrame({'id': test.index,'Response': np.mean([test_pred_base, test_pred_rus, test_pred_ros], axis=0)
}).to_csv('submission_master.csv', index=False)

提交测试平台
以图为证
在这里插入图片描述

结论

本文探讨的是二分类问题，其中涉及多种方法，这些方法在之前的文章中已有详细介绍，故此处不再赘述。本文的特色在于处理了一个庞大且不平衡的数据集，解决这两个问题成为了文章的核心。为此，我们实施了以下优化措施和处理策略：

利用pandas 2.0中的快速读取CSV参数engine="pyarrow"，以提升数据处理的效率。此外，还有一些库，如polars，也是高效的数据处理工具，感兴趣的读者可以尝试使用。
鉴于数据集的庞大规模，我们采取了数据类型优化的策略，以尽量减少内存消耗。
在处理大量数据时，若需进行可视化等操作，建议仅选取部分数据进行展示。注意避免使用数据集的开始或结束部分，而应采用随机抽样的方法。
文中采用了基于互信息量的特征选择方法，这是一种衡量变量间相互关系的技术。除此之外，还有Pearson相关系数、卡方检验、Fisher得分等方法可供选择。
针对数据不平衡的问题，我们分别采用了欠采样、过采样以及原始数据的处理方式，通过训练和预测，并将这三种结果以平均化的方式融合，以实现最佳预测效果。
总体而言，本次比赛能够达到0.89095的成绩，主要归功于上述方法的应用。若想进一步提升成绩，可以考虑采用神经网络模型，如人工神经网络（ANN）等。这类数据量级非常适合深度学习模型。最终，可以通过融合多个模型的结果来提交，以期获得更好的成绩。