实战 - 地震后建筑修复建议预测

1,308 阅读11分钟
原文链接: mp.weixin.qq.com

嘿嘿,总算又更新了。

 上一篇讲了如何用Python进行数据探索

这次就用一个小比赛来检验一下吧。

本篇主要讲了数据探索、特征生成,以及一些常用的建模方法,

比如用xgboost生成特征、blend、神经网络等等。

按照惯例,可以在这找到notebook文件直接运行:https://github.com/wmpscc/DataMiningNotesAndPractice

如果你觉得这里的排版看着不爽,可以滑到底下直接看原文

介绍

地震后建筑修复建议是SofaSofa提供的练习比赛,可以说它是一个多分类问题,也可以说它是一个小型的推荐系统。因为它有四个类,评价标准是 map@2。 写这个比赛的目的是练习一下前面学的 数据探索(EDA)数据预处理,还有建模时用到的 blendstacking。具体代码都在jupyter notebook 文件里,这里简单介绍一下。

EDA

首先从最基本的开始,

  1. trainData.info()

  1. <class 'pandas.core.frame.DataFrame'>

  2. RangeIndex: 652936 entries, 0 to 652935

  3. Data columns (total 15 columns):

  4. id                   652936 non-null int64

  5. district_id          652936 non-null int64

  6. area_id              652936 non-null int64

  7. floors_before        652936 non-null int64

  8. floors_after         652936 non-null int64

  9. age                  652936 non-null int64

  10. area                 652936 non-null int64

  11. height_before        652936 non-null int64

  12. height_after         652936 non-null int64

  13. land_condition       652936 non-null object

  14. foundation_type      652936 non-null object

  15. roof_type            652936 non-null object

  16. ground_floor_type    652936 non-null object

  17. position             652936 non-null object

  18. y                    652936 non-null int64

  19. dtypes: int64(10), object(5)

  20. memory usage: 74.7+ MB

  1. trainData.describe()

相关性矩阵

  1. corrMatrix = trainData.corr()

  2. corrMatrix

  1. corrMatrix['y']

  1. id               0.001189

  2. district_id     -0.079135

  3. area_id         -0.078146

  4. floors_before    0.186285

  5. floors_after    -0.406570

  6. age              0.044594

  7. area            -0.152052

  8. height_before    0.086521

  9. height_after    -0.442474

  10. y                1.000000

  11. Name: y, dtype: float64

数据离散化

数据离散化是有风险的,因为离散化后的数据效能未必会比离散化之前好,一般是根据专家的建议设置区间,而不是随意猜测一个区间。这里是利用无监督模型(K-means 算法)聚类,将id类的数据分段。首先看下数据的分布,猜测一下应该聚成几类,因为k-means算法要求提供聚类的簇数。利用seaborn可视化数据:

  1. sns.distplot(trainData['district_id'])

根据可视化后的数据分布,猜测可以聚为6类。(聚成几类是没有一个确定的答案的,可以多尝试几种情况,取最好的)

  1. from sklearn.cluster import KMeans

  2. est = KMeans(n_clusters=6, init="k-means++", n_jobs=-1)

  3. est.fit(trainData['district_id'].reshape(-1, 1))

  4. trainData['district_id'] = est.predict(trainData['district_id'].reshape(-1, 1))

这里id类型的数据,我都是这样处理的。

处理异常值

age属性。

可视化后是这样的

  1. sns.distplot(trainData['age'])

一直延续到1000,猜测可能有问题。绘制散点图看看。

  1. sns.jointplot(data=trainData, x='id', y='age')

1000那里突然就出现一堆数据 ,猜测可能是出题热故意设置的,处理方法是直接删除。

  1. # 删除大于阈值的行

  2. index = trainData['age'] <= 176

  3. trainData = trainData[index]

floors属性

数据集中提供了楼层前后高度信息,猜测可能会存在一些异常值,地震楼层数会比地震前还有高。首先进行可视化

地震前: floors_before

                                
  1. sns.distplot(trainData[ 'floors_before'])

地震后: floors_after

                                    
  1. sns.distplot(trainData[ 'floors_after'])

地震前后楼层数对比

  1. plt.plot(trainData['id'], trainData['floors_before'], trainData['id'], trainData['floors_after'])

从图上可以发现,确实有些数据像说的那样。先计算下个数

  1. error_floor = trainData['floors_before'] < trainData['floors_after']

  2. # 震后楼层数比震前还高的数量

  3. error_floor.sum()

  1. 1838

有1838个,直接删除

  1. # 直接去掉

  2. index = trainData['floors_before'] >= trainData['floors_after']

  3. trainData = trainData[index]

height属性

height也提供了前后高度,处理方法是一样的。

                                            
  1. error_height = trainData[ 'height_after'] > trainData[ 'height_before']

  2. error_height.sum()

  1. 1517

  1. index = trainData['height_after'] <= trainData['height_before']

  2. trainData = trainData[index]

标签数据-独热编码(one-hot)

  1. trainData = pd.get_dummies(trainData, columns=['position', 'land_condition', 'foundation_type', 'roof_type', 'ground_floor_type'])

构造属性

加减乘数,构造属性

  1. trainData['per_floor_height_before'] = trainData['height_before'] / trainData['floors_before']

  2. trainData['per_floor_height_after'] = trainData['height_after'] / trainData['floors_after']

  3. trainData["age_area"] = trainData['age'] / trainData['area']

标签数据编号

  1. land_condition.replace(['F', 'M', 'S'], [1, 2, 3], inplace=True)

  2. foundation_type.replace(['M', 'C', 'R', 'B', 'O'], [5, 4, 3, 2, 1], inplace=True)

  3. roof_type.replace(['L', 'H', 'R'], [3, 2, 1], inplace=True)

  4. ground_floor_type.replace(['M', 'R', 'B', 'T', 'O'], [5, 4, 3, 2, 1], inplace=True)

用one-hot后的数据构造新属性

  1. trainData['4_rebuild'] = land_condition + foundation_type + roof_type + ground_floor_type

  2. trainData['l_f'] = land_condition + foundation_type

  3. trainData['l_r'] = land_condition + roof_type

  4. trainData['l_g'] = land_condition + ground_floor_type

  5. trainData['f_r'] = foundation_type + roof_type

  6. trainData['f_g'] = foundation_type + ground_floor_type

  7. trainData['r_g'] = roof_type + ground_floor_type

lightGBM模型生成特征重要性图

  1. import lightgbm as lgb

  2. params = {

  3.    'learning_rate':0.1,

  4.    'lambda_l1':0.1,

  5.    'lambda_l2':0.2,

  6.    'max_depth':4,

  7.    'objective':'multiclass',

  8.    'num_class':4

  9. }

  10. lgb_train = lgb.Dataset(train, y)

  11. lgb_eval = lgb.Dataset(train, y)

  12. gbm = lgb.train(params,

  13.                lgb_train,

  14.                num_boost_round=50,

  15.                valid_sets=lgb_eval,

  16.                early_stopping_rounds=5)

  17. lgb.plot_importance(gbm, figsize=(10,10))

生成新的相关性矩阵

  1. corr = trainData.corr()

  2. corr['y'].sort_values()

  1. per_floor_height_after     -0.517127

  2. height_after               -0.443536

  3. floors_after               -0.405705

  4. ground_floor_type_R        -0.382114

  5. roof_type_R                -0.331644

  6. foundation_type_R          -0.314671

  7. foundation_type_B          -0.205903

  8. area_id                    -0.175130

  9. foundation_type_C          -0.172373

  10. area                       -0.149299

  11. per_floor_height_before    -0.146806

  12. district_id                -0.085735

  13. position_Not attached      -0.049879

  14. foundation_type_O          -0.030112

  15. land_condition_F           -0.023559

  16. ground_floor_type_O        -0.022835

  17. ground_floor_type_T        -0.016830

  18. position_Attached-2 side   -0.012019

  19. ground_floor_type_B         0.002914

  20. land_condition_M            0.016435

  21. position_Attached-3 side    0.017995

  22. land_condition_S            0.018032

  23. position_Attached-1 side    0.058592

  24. roof_type_H                 0.082415

  25. height_before               0.094980

  26. roof_type_L                 0.097213

  27. l_g                         0.156026

  28. l_r                         0.174592

  29. floors_before               0.192760

  30. age_area                    0.202228

  31. age                         0.222218

  32. r_g                         0.244821

  33. ground_floor_type_M         0.283176

  34. l_f                         0.336764

  35. 4_rebuild                   0.365961

  36. f_r                         0.373940

  37. f_g                         0.375418

  38. foundation_type_M           0.414113

  39. y                           1.000000

  40. Name: y, dtype: float64

可以看出构造出的几个属性相关性较强。

建模

构建评分函数

它的评分标准是 map@2

简单来说,对于每一个建筑,若主修复意见正确,得1分;若次修复意见正确,得0.5分;若都不正确,记0分。所有建筑的得分的均值就是map@2

  1. def test_score(y1, y2, trueLabels):

  2.    pred_score = (y1 == trueLabels).sum() / len(trueLabels)

  3.    pred_score += (y2 == trueLabels).sum() * 0.5 / len(trueLabels)

  4.    return pred_score

XGBOOST

  1. import xgboost as xgb

  2. xgb_model = xgb.XGBClassifier(objective='multi:softmax',

  3.                              eval_metric=['map@2', 'merror'],

  4.                              n_estimators=700,

  5.                              num_class=4,

  6.                              silent=1,

  7.                              max_depth=6,

  8.                              nthread=4,

  9.                              learning_rate=0.1,

  10.                              gamma=0.5,

  11.                              min_child_weight=0.6,

  12.                              max_delta_step=0.1,

  13.                              subsample=0.6,

  14.                              colsample_bytree=0.7,

  15.                              reg_lambda=0.4,

  16.                              reg_alpha=0.8,

  17.                              num_leaves=250,

  18.                              early_stopping_rounds=20,

  19.                              num_boost_round=8000,

  20.                              scale_pos_weight=1)

  21. xgb_model.fit(train, y)

  22. pb = xgb_model.predict_proba(train)

  23. pb = np.array(pb)

  24. submit = pd.DataFrame()

  25. submit['y1'] = pb.argsort()[np.arange(len(pb)), -1]

  26. submit['y2'] = pb.argsort()[np.arange(len(pb)), -2]

  27. print(test_score(submit['y1'].values, submit['y2'].values, y))

  1. 0.774950502878

LightGBM

  1. import lightgbm as lgb

  2. lgb_train = lgb.Dataset(train[:600000], y[:600000])  

  3. lgb_eval = lgb.Dataset(train[600000:], y[600000:], reference=lgb_train)  

  4. sakf

  5. params = {  

  6.    'boosting_type': 'gbdt',  

  7.    'objective': 'multiclass',  

  8.    'num_class': 4,  

  9.    'metric': ['multi_error', 'map@2'],  # 'map@2',

  10.    'num_leaves': 250, # 4

  11.    'min_data_in_leaf': 100,

  12.    'learning_rate': 0.1,  

  13. #     'feature_fraction': 0.3,  

  14.    'bagging_fraction': 0.8,  

  15.    'bagging_freq': 5,  

  16.    'lambda_l1': 0.4,  

  17.    'lambda_l2': 0.6,

  18.    'max_depth':6,

  19. #     'min_gain_to_split': 0.2,  

  20.    'verbose': 5,  

  21.    'is_unbalance': True

  22. }  

  23. print('Start training...')  

  24. gbm = lgb.train(params,  

  25.                lgb_train,  

  26.                num_boost_round=8000,  

  27.                valid_sets=lgb_eval,  

  28.                early_stopping_rounds=500)  

  1. print('Start predicting...')

  2. pb = gbm.predict(train, num_iteration=gbm.best_iteration)

  3. pb = np.array(pb)

  4. submit = pd.DataFrame()

  5. submit['y1'] = pb.argsort()[np.arange(len(pb)), -1]

  6. submit['y2'] = pb.argsort()[np.arange(len(pb)), -2]

  7. print(test_score(submit['y1'].values, submit['y2'].values, y))

  1. Start predicting...

  2. 0.796050152949

神经网络

  1. from sklearn.preprocessing import OneHotEncoder

  2. enc = OneHotEncoder()

  3. enc.fit(y.reshape(-1, 1))

  4. y_hot = enc.transform(y.reshape(-1, 1))

  5. #构建LM神经网络模型

  6. from keras.models import Sequential #导入神经网络初始化函数

  7. from keras.layers.core import Dense, Activation #导入神经网络层函数、激活函数

  8. from keras.layers import Dropout

  9. from keras.metrics import top_k_categorical_accuracy

  10. from keras.callbacks import EarlyStopping

  11. netfile = './net.model' #构建的神经网络模型存储路径

  12. def acc_top2(y_true, y_pred):

  13.    return top_k_categorical_accuracy(y_true, y_pred, k=2)

  14. net = Sequential()

  15. net.add(Dense(input_dim = 38, output_dim = 128))

  16. net.add(Activation('relu'))

  17. net.add(Dense(input_dim = 128, output_dim = 256))

  18. net.add(Activation('relu'))

  19. net.add(Dense(input_dim = 256, output_dim = 256))

  20. net.add(Activation('relu'))

  21. net.add(Dropout(0.3))

  22. net.add(Dense(input_dim = 256, output_dim = 512))

  23. net.add(Activation('relu'))

  24. net.add(Dense(input_dim = 512, output_dim = 4))

  25. net.add(Activation('softmax'))

  26. net.compile(loss = 'categorical_crossentropy', optimizer = 'adam', metrics=['accuracy']) # accuracy

  27. early_stopping = EarlyStopping(monitor='val_loss', patience=50, verbose=2)

  28. net.fit(train, y_hot, epochs=150, batch_size=4096, validation_data=(train[600000:], y_hot[600000:]), callbacks=[early_stopping])

  29. net.save_weights(netfile) #保存模型

  1. predict_prob = net.predict_proba(train[600000:])

  2. pb = np.array(predict_prob)

  3. submit = pd.DataFrame()

  4. submit['y1'] = pb.argsort()[np.arange(len(pb)), -1]

  5. submit['y2'] = pb.argsort()[np.arange(len(pb)), -2]

  6. print(test_score(submit['y1'].values, submit['y2'].values, y[600000:]))

  1. 0.775790784004

XGBOOST生成新特征

  1. from sklearn.model_selection import train_test_split

  2. X_train, X_test, y_train, y_test = train_test_split(train, y, test_size=0.2, random_state=0)##test_size测试集合所占比例

  3. ##X_train_1用于生成模型  X_train_2用于和新特征组成新训练集合

  4. X_train_1, X_train_2, y_train_1, y_train_2 = train_test_split(X_train, y_train, test_size=0.7, random_state=0)

  5. def mergeToOne(X,X2):

  6.    return np.hstack((X, X2))

  1. from xgboost.sklearn import XGBClassifier

  2. xgb = XGBClassifier(booster='gbtree',

  3.                    learning_rate =0.1,

  4.                    objective='multi:softmax',

  5.                    num_class=4,

  6.                    gamma=0.05,

  7.                    subsample=0.4,

  8.                    reg_alpha=1e-05,

  9.                    n_estimators=50,

  10.                    metric='multi_logloss',

  11.                    colsample_bytree=0.7,

  12.                    silent=1,

  13.                    nthread=4)

  14. xgb.fit(X_train_1, y_train_1)

  15. new_feature= xgb.apply(X_train_2)

  16. X_train_new2 = mergeToOne(X_train_2,new_feature)

  17. new_feature_test = xgb.apply(X_test)

  18. X_test_new = mergeToOne(X_test,new_feature_test)

blend

  1. import numpy as np

  2. from sklearn.cross_validation import StratifiedKFold

  3. from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier

  4. from sklearn.ensemble import GradientBoostingClassifier

  5. from sklearn.linear_model import LogisticRegression

  6. from xgboost.sklearn import XGBClassifier

  7. import lightgbm as lgb

  8. def blend(X, y, X_submission, n_folds):

  9.    skf = list(StratifiedKFold(y, n_folds))

  10.    clfs = [RandomForestClassifier(n_estimators=150, min_samples_split=90, min_samples_leaf=15,max_depth=8, n_jobs=-1, criterion='gini'),

  11.            RandomForestClassifier(n_estimators=150, min_samples_split=90, min_samples_leaf=15,max_depth=8, n_jobs=-1, criterion='entropy'),

  12.            ExtraTreesClassifier(n_estimators=150, min_samples_split=90, min_samples_leaf=15,max_depth=8, n_jobs=-1, criterion='gini'),

  13.            ExtraTreesClassifier(n_estimators=150, min_samples_split=90, min_samples_leaf=15,max_depth=8, n_jobs=-1, criterion='entropy'),

  14.            GradientBoostingClassifier(learning_rate=0.05, subsample=0.5, max_depth=6, n_estimators=8),

  15.            XGBClassifier(learning_rate =0.05, n_estimators=300, max_depth=6, min_child_weight=1, gamma=0.1, subsample=0.8,

  16.                    colsample_bytree=0.8, objective= 'multi:softmax', nthread=4, eg_alpha=0.001, scale_pos_weight=1),

  17.            lgb.LGBMClassifier(learning_rate=0.1, boosting_type='gbdt', objective='multiclass', n_estimators=300, metric='multi_logloss',

  18.                             max_depth=7, num_leaves=5, subsample=0.7, colsample_bytree=0.7, min_data_in_leaf=45, feature_fraction=0.7, bagging_fraction=0.7,

  19.                             bagging_freq=6, lambda_l1=1, lambda_l2=0.001, min_gain_to_split=0.265, verbose=5, is_unbalance=True)]

  20.    dataset_blend_train = np.zeros((X.shape[0], len(clfs)))

  21.    dataset_blend_test = np.zeros((X_submission.shape[0], len(clfs)))

  22.    for j, clf in enumerate(clfs):

  23.        print (j, clf)

  24.        dataset_blend_test_j = np.zeros((X_submission.shape[0], len(skf)))

  25.        for i, (train, test) in enumerate(skf):

  26.            print ("Fold", i)

  27.            X_train = X[train]

  28.            y_train = y[train]

  29.            X_test = X[test]

  30.            y_test = y[test]

  31.            clf.fit(X_train, y_train)

  32.            y_submission = clf.predict_proba(X_test)[:, 1]

  33.            dataset_blend_train[test, j] = y_submission

  34.            dataset_blend_test_j[:, i] = clf.predict_proba(X_submission)[:, 1]

  35.        dataset_blend_test[:, j] = dataset_blend_test_j.mean(1)

  36.    print("Blending.")

  37.    clf = LogisticRegression()

  38.    clf.fit(dataset_blend_train, y)

  39.    y_submission = clf.predict_proba(dataset_blend_test)[:, 1]

  40.    return clf.predict_proba(dataset_blend_test)

Stacking

emmmm,这个我不确定是不是这样写。

  1. import lightgbm as lgb

  2. from xgboost.sklearn import XGBClassifier

  3. from sklearn.ensemble import RandomForestClassifier

  4. xgb = XGBClassifier(booster='gbtree',

  5.                    learning_rate =0.1,

  6.                    objective='multi:softmax',

  7.                    num_class=4,

  8.                    gamma=0.05,

  9.                    subsample=0.4,

  10.                    reg_alpha=1e-05,

  11.                    n_estimators=50,

  12.                    metric='multi_logloss',

  13.                    colsample_bytree=0.7,

  14.                    silent=1,

  15.                    nthread=4)

  16. gbm = lgb.LGBMClassifier(learning_rate=0.1,

  17.                   boosting_type='gbdt',

  18.                   objective='multiclass',

  19.                   n_estimators=50,

  20.                   metric='multi_logloss',

  21.                   max_depth=7,

  22.                   bagging_fraction=0.7,

  23.                   is_unbalance=True)

  24. rf = RandomForestClassifier(n_estimators=50,

  25.                            min_samples_split=90,

  26.                            min_samples_leaf=15,

  27.                            max_depth=8,

  28.                            oob_score=True)

  1. xgb.fit(X_train_1, y_train_1)

  2. new_feature= xgb.apply(X_train_2)

  3. X_train_new2 = mergeToOne(X_train_2,new_feature)

  4. new_feature_test = xgb.apply(X_test)

  5. X_test_new = mergeToOne(X_test,new_feature_test)

  6. gbm.fit(X_train_1, y_train_1)

  7. new_feature = gbm.apply(X_train_2)

  8. X_train_new2 = mergeToOne(X_train_new2,new_feature)

  9. new_feature_test = gbm.apply(X_test)

  10. X_test_new = mergeToOne(X_test_new,new_feature_test)

  11. rf.fit(X_train_1, y_train_1)

  12. new_feature = rf.apply(X_train_2)

  13. X_train_new2 = mergeToOne(X_train_new2, new_feature)

  14. new_feature_test = rf.apply(X_test)

  15. X_test_new = mergeToOne(X_test_new, new_feature_test)

加权投票

  1. def wsubmit(xg, lg, nn):

  2.    xg_y1 = xg['y1'].values

  3.    lg_y1 = lg['y1'].values

  4.    lg_y2 = lg['y2'].values

  5.    nn_y1 = lg['y1'].values

  6.    submitData = pd.DataFrame()

  7.    y1 = []

  8.    y2 = []

  9.    for i in range(len(xg)):

  10.        row_y1 = [xg_y1[i], lg_y1[i], nn_y1[i]]

  11.        y1.append(max(row_y1, key=row_y1.count))

  12.        if max(row_y1, key=row_y1.count) != lg_y1[i]:

  13.            y2.append(lg_y1[i])

  14.        else:

  15.            y2.append(lg_y2[i])

  16.    submitData['y1'] = y1

  17.    submitData['y2'] = y2

  18.    submitData.to_csv('submit_voting.csv', index=False)

总结

这次主要是锻炼之前学到的东西,实际比赛排名不是很高。

长按识别二维码

获取更多AI资讯