Know the basic process of credit approval businessKnow what ABC scorecard is, what is the difference, know the process of risk control modeling, master the scorecard model, positive and negative sample definition methods, know how to build features, how to evaluate characteristics, the basic process of credit approval business, four-factor authentication: bank card holder's name, ID number, bank card number, and mobile phone number
The Internet financial risk control system is mainly composed of three parts: user data: user basic information, user behavior information, user authorization information, and external access information. Data collection will involve buried points and crawler technology, and basically the data in the industry is much the same. Free operator data, Android crawlable mobile phone internal information (APP name, mobile device information, part of the APP content information), charging credit data, various information verification, external blacklist and other specific scenarios of cash loans and consumer finance will have their own data available for use, such as Ali, JD.com's own e-commerce data, Didi's driver data, SF Zhongtong's express data, user basic information (contact, address book, education. User behavior information (behavior when operating the app, registration, click location. User authorization information (carrier, academic information network, device IMEI...)External access information (P2P credit, other financial institutions such as Sesame Credit. Policy system: After collecting the user's information, the user information is entered into the policy engine, fraud rules, access rules (age, region, address book, behavior rules), operator rules (call rules), risk list (blacklist, dishonest list, court list), online loan (long, white). Machine learning models: fraud detection model, admission model, credit model, risk pricing, quota management, churn warning, and loss repair.
The risk control model includes a B C card. The model can use the same algorithm, generally using the number of overdue days to distinguish positive and negative samples, that is, the value of the target value y (0 or 1) pre-loan application score card application score card in the loan behavior scorecard beh**ior score card post-loan collection score cardc card because of different uses, the value of y may be different, the company has internal reminders, and there are external reminders. The external collection rate is low, and the unit price is expensive, and the Y of the C card can be defined according to whether it is recalled internally. Prepare clear requirements, model design, business abstraction, classification, regression problem definition, label (target value) sample, design features, engineering data processing, select appropriate samples, and match all the information as basic feature features, build feature evaluation model, model training, model evaluation, model tuning, online operation, model delivery, model deployment, model monitoringProject preparation periodFeature engineering, model construction, and online operation.
Define your needsTarget group: new customers, high-quality old customers, overdue old customersProduct: quota, interest rateMarket strategy: cold start, open up the market, improve revenueTime limit: emergency use, long-term deployment examplesBusiness needs to open a small cash loan product for new customers, seize the new marketApplication scorecard for new customers with high-risk thin dataModel designBusiness abstraction into classification Regression problemsThe problems in the risk control scenario can usually be transformed into binary classification problems: the credit scoring model is expected to be used for whether a user will be overdue, the overdue user 1 marketing model is expected to be used for whether a user will come to take out a loan after being marketed, and the model of user 1 who has not taken out a loan is expected to be used for whether a user will lose contact and user 1In the risk control business, only fraud detection is not a binary problem. Because the sample size is insufficient, it may be an unsupervised learning modelModel Algorithm Rule Model Logistic Regression Ensemble Learning Fusion Model Model Input: Data Source Time Span Y Label Definition When building a credit scoring model, there is only the current overdue situation of each person in the original data, there is no negative sample, and the negative sample needs to be artificially constructed Usually a cut-off point (threshold) is selected, when the overdue exceeds a certain threshold, the sample is considered to be a negative sample, and the money will not be repaid in the future, for example, 15 days overdue is the marking threshold for positive and negative samples, and the customers who are overdue for more than 15 days are overdue. At 15 days y = 1, then how does y=0 define that only those who make payments on time and those who are less overdue will be marked as 0. For example, the data (gray sample) of 5 days overdue for 5 days and those who are
Sample design selection of customer groups: new customers, non-overdue old customers, overdue old customers training set test set January February March April May June July August Total 100200300400500600700800 bad 366815121424 bad %3% 3% 2% 2% 3% 2% 2% 2% 3% customer group description: first single user, rich internal data, eliminate high-risk occupations, income range in xxxx customer group Tags: good: fpd<=5 bad: fpd>15, (5,15) Gray sample, do not participate in training, participate in test and evaluation data research, clarify what data is available for the target population, and clarify the data acquisition logic.
Clarify the quality, coverage, and stability of the data.
Misunderstandings of feature construction: After getting the data, immediately do the feature construction, before the features need to be clear: the specific data table corresponding to the data source, draw the sample set of the ER diagram to evaluate the features, the sample set of the B card cannot contain overdue data, the sample set of the C card cannot contain the data feature framework of on-time repayment, ensure that the data use dimension is comprehensively considered, determine the thinking framework, discuss with others in the group, clarify the specific data table corresponding to the data source, and make it clear that the data comes from: (de data engineer Data Warehouse Engineer).
The data obtained by the data analyst may be as follows: the original table of the data warehouse, the reconstructed table of the data warehouse, the original table of the data warehouse, and the reconstructed table of the data warehouse may have different data amounts because the update time is different!Try to use the reconstructed table processed by the data warehouse engineer to ensure that the logic is unified and real-time** to ensure that the data of the production database and the data warehouse are consistent (difficult) to draw an er-like diagram Data relationship One-to-one, one-to-many, many-to-many.
When writing SQL queries, you need to start from the user list, and you can't select when joining other tablesdistinctuser_idfromorder tableClearly evaluate the sample set of featuresNew applicantsNo internal credit dataNo overdue old customersNo overdue information in the current periodRepayment data of overdue customers and unoverdue customers must be very differentHow to build features from the original data:Specify the feature framework to ensure that the data usage dimensions are fully consideredEach attribute can be derived from r(recency) f(frequency) M (monetary) three dimensions of thinking, to construct the characteristics of GPS latitude and longitude r the nearest GPS province and city, application time GPS is located in the province and city where the GPS has appeared in the provinces and cities where the gGPS has appeared, the provinces and municipalities with the most MGPS have appeared in the most provinces and municipalities GDP, population, bad debt rate, other statistical information of the region, time r, the number of GPS in the last day and month f the average number of GPS in the past n days and months, morning, noon, and evening working days Number of GPS on the weekend MNone address R Nearest GPS distance from home Work address Distance F Occurrence of most GPS distance from home Work address distance MGPS sequence around home Information entropy supplement of work address Whether GPS is authorized, Nearest consecutive days without GPS Feature construction method User static information features: user's name, gender, ageUser time cross-sectional features: cross-sectional time pointE-commerce shopping GMV cross-sectionTime pointBank deposit amount cross-sectionMaximum number of days overdueUser time series characteristics User's GPS data for the past month User's bank statements in the past six months User's overdue record characteristics in the past year to evaluate what are good characteristics and good characteristics need to be met: Evaluation Indicators High coverage, many users can use it Stable, can be used for a long time in the future psi (population stability index) Good discrimination, good and bad users have a big difference in characteristic values iv (information value) You can also use the evaluation indicators of the model to evaluate the features: single-feature AUC, single-feature KS can take the best single-feature AUC and KS to estimate the effect of the modelFeature evaluation reportFull sample, labeled sample, feature name, coverage, missing rate, zero value rate, aucksiv full sample— Coverage: On the full sample, how many users have this featureFull sample: Missing rate of samples without labels: Missing rate of samples with labels, compare with the coverage of full samples to see if the gap is very large, and select the zero value rate of features with little gap: Many features are counting characteristics, such as the number of e-commerce consumption, the number of address book records, GPS data, such as too many zero values, the characteristics are not good, the characteristics are not good, eliminate the illogical characteristics of risk trends, and use common sense and business logic to evaluate.
Design experiments, model training, and model evaluation.
When designing an experimental training model, there are many possible factors that will affect the effect of the model, and we need to verify which factors will improve the performance of the model by designing experiments. GINI Report 1: Differentiation, Ability to Catch Bad Guys in Different SegmentsTotal Number of Bad Guys, Bad Guy Rate, KS [300, 550] [500, 600] [600, 700] [700, 750] [750, 800] [800, 850] [850, 950] Report 2: Cross-Time Stability Score Segment Test Set 1 Test Set 2 Online 1 Online 2 Period. 300, 550]10%10%[500, 600]20%20%[600, 700]20%20%[700, 750]25%25%[750, 800]20%20%[800, 850]4%[850, 950]1%1%Proportion on decision points50%50%Total number of users30002000Average score730725psi-0.01Model Delivery, Model Deployment, and Model Monitoring.
Model delivery process: 1 Submit feature and model reports2 Offline result quality review (no missing, no duplicates, correct storage location, file name specification)3 Save the model file, determine the version number, and submit the time4 Boss approve, notify the business side5 Online deployment, case study, continuous monitoring feature report1 Feature project requirements2 Feature project task list3 Feature project schedule4 ClassER Figure 5 Sample design6 Feature framework7 Weekly development progress and results8 Weekly discussion feedback and improvement notes9 FeaturesProject Delivery Notes10 FeaturesProject SummaryModel Report1 Model Project Requirements2 Model Project Task List3 Model Project Timeline4 Model Design5 Sample Design6 Model Training Process and Experimental Design7 Weekly Development Progress and Results8 Weekly Discussion, Feedback, and Improvement Notes9 Model Project Delivery Notes10 Model Project SummaryModel DeploymentEnsure consistency between the development and production environments, and use PMML files or FLASKs API deployment must be done: offline and online scoring for a group of customers to ensure that the offline results and online results are consistentModel monitoringFeature monitoringFeature monitoring: Feature stabilityModel monitoring: model stabilityTwo common risk avoidance methods: AI model rulesHow to use rules for risk controlUse a series of judgment logic to distinguish customer groups, and there are significant differences in overdue risks between different groupsExample: whether the number of long loans exceeds a certain number If a rule divides users into a high-risk group, it will be rejected directly, and if it is divided into a low-risk group, it will enter the next rule rule And the advantages and disadvantages of the AI model, can be used quickly, easy for business personnel to understand, but the judgment is relatively simple and crude, a single dimension does not meet the conditions and directly rejects the AI model, the development cycle is long, compared with the use of rules are more complex, but more flexible, for scenarios with high requirements for risk control accuracy, the AI model can be used to assist in the establishment of a rule engine, the decision tree is very suitable for the scenario of rule miningCase backgroundAn Internet company has multiple business sectors, There are special loan products under each section, riders of the takeaway platform business can apply to the platform for "rider loans", merchants of the e-commerce platform business can apply for "online business loans", drivers of online car-hailing business can apply for "driver loans" from the platform, the company has a number of similar scenarios, sharing the same rule engine and application scorecard, the lenders are part-time employees of the company, and it has recently been found that the overdue rate of "driver loans" is relatively high, and the 30-day overdue rate of the entire financial sector is 15% of the "driver loan" products are overdue for 30 days, and 5% expect the existing risk control architecture of the solution to be stable, and hope to develop and launch quickly, solve the problem, try not to use complex methods, consider using existing data, and mine appropriate business rules and data dictionaries
Load the data.
import pandas as pdimport numpy as npdata = pd.read_excel('data/rule_data.xlsx')data.head()
Show Results:Check out class newuidoil_actv_dtcreate_dttotal_oil_cntpay_amount_totalclass_newbad_indoil_amountdiscount_amountsale_amountamountpay_amountcoupon_amountpayment_coupon_amountchannel_codeoil_codescenesource_appcall_source0a82177102018-08-192018-08-17275.048295495.4b03308.561760081.01796001.01731081.08655401.01.01.0132031a82177102018-08-192018-08-16275.048295495.4b04674.682487045.02537801.02437845.012189221.01.01.0132032a82177102018-08-192018-08-15275.048295495.4b01873.06977845.0997801.0961845.04809221.01.01.0122033a82177102018-08-192018-08-14275.048295495.4b04837.782526441.02578001.02484441.012422201.01.01.0122034a82177102018-08-192018-08-13275.048295495.4b02586.381350441.01378001.01328441.06642201.01.01.012203
data.class_new.unique()
Show Results:There are too few features of the original data, considering that some new features are derived on the basis of the original features, and the features are divided into three categories to deal with numerical type variables respectively: after grouping according to id, a variety of ways are used to aggregate and derive new feature classification type variables, and after grouping according to id, the number of query entries is aggregated, and new features are derivedarray(['b', 'e', 'c', 'a', 'd', 'f'], dtype=object)
org_list = ['uid','create_dt','oil_actv_dt','class_new','bad_ind']agg_list = ['oil_amount','discount_amount','sale_amount','amount','pay_amount','coupon_amount','payment_coupon_amount']count_list = ['channel_code','oil_code','scene','source_app','call_source']
Create a copy of the data, keep the bottom table, and see what's missingdf = data[org_list].copy()df[agg_list] = data[agg_list].copy()df[count_list] = data[count_list].copy()df.isna().sum()
Show Results:View the distribution of numeric variablesuid 0 create_dt 4944 oil_actv_dt 0 class_new 0 bad_ind 0 oil_amount 4944 discount_amount 4944 sale_amount 4944 amount 4944 pay_amount 4944 coupon_amount 4944 payment_coupon_amount 4946 channel_code 0 oil_code 0 scene 0 source_app 0 call_source 0 dtype: int64
df.describe()
Show Results:The missing values are filled to complete the CRAT DT, and the Oil Actv DT is used to fill in the data of the interception application time and the loan disbursement time not exceeding 6 months (considering the timeliness of the data).bad_indoil_amountdiscount_amountsale_amountamountpay_amountcoupon_amountpayment_coupon_amountchannel_codeoil_codescenesource_appcall_sourcecount50609.00000045665.0000004.566500e+044.566500e+044.566500e+044.566500e+0445665.00000045663.00000050609.00000050609.00000050609.00000050609.00000050609.000000mean0.017764425.3761071.832017e+051.881283e+051.808673e+059.043344e+050.576853149.3953971.4763781.6178941.9065190.3060722.900729std0.132093400.5962442.007574e+052.048742e+051.977035e+059.885168e+050.494064605.1388231.5114703.0741660.3672800.8936820.726231min0.0000001.0000000.000000e+000.000000e+001.000000e+005.000000e+000.0000000.0000000.0000000.0000000.0000000.0000000.00000025%0.000000175.4400006.039100e+046.200100e+045.976100e+042.988010e+050.0000001.0000001.0000000.0000002.0000000.0000003.00000050%0.000000336.1600001.229310e+051.279240e+051.209610e+056.048010e+051.0000001.0000001.0000000.0000002.0000000.0000003.00000075%0.000000557.6000002.399050e+052.454010e+052.360790e+051.180391e+061.000000100.0000001.0000000.0000002.0000000.0000003.000000max1.0000007952.8200003.916081e+063.996001e+063.851081e+061.925540e+071.00000050000.0000006.0000009.0000002.0000003.0000004.000000
def time_isna(x,y): if str(x) == 'nat': x = y return xdf2 = df.sort_values(['uid','create_dt'],ascending = false)df2['create_dt'] = df2.apply(lambda x: time_isna(x.create_dt,x.oil_actv_dt),axis = 1)df2['dtn'] = (df2.oil_actv_dt - df2.create_dt).apply(lambda x :x.days)df = df2[df2['dtn']<180]df.head()
Show Results:Sort users by ID number and keep the time of the most recent request to ensure that each user has a recorduidcreate_dtoil_actv_dtclass_newbad_indoil_amountdiscount_amountsale_amountamountpay_amountcoupon_amountpayment_coupon_amountchannel_codeoil_codescenesource_appcall_sourcedtn50608b964363919850357032018-10-082018-10-08b0nannannannannannannan69234050607b964363919846933972018-10-112018-10-11e0nannannannannannannan69234050606b964363919772174682018-10-172018-10-17b0nannannannannannannan69234050605b964363919764808922018-09-282018-09-28b0nannannannannannannan69234050604b964363919721060432018-10-192018-10-19a0nannannannannannannan692340
base = df[org_list]base['dtn'] = df['dtn']base = base.sort_values(['uid','create_dt'],ascending = false)base = base.drop_duplicates(['uid'],keep = 'first')base.shape
Show Results:The function aggregation method of feature derivation for continuous statistical variables includes counting historical eigenvalues, finding the number of historical eigenvalues greater than 0, summing, mean, maximum, minimum, variance, and range
gn = pd.dataframe()for i in agg_list: tp = df.groupby('uid').apply(lambda df:len(df[i]))reset_index() tp.columns = ['uid',i + '_cnt'] if gn.empty: gn = tp else: gn = pd.merge(gn,tp,on = 'uid',how = 'left'Find the number of historical eigenvalues greater than 0 tp = dfgroupby('uid').apply(lambda df:np.where(df[i]>0,1,0).sum())reset_index() tp.columns = ['uid',i + '_num'] if gn.empty: gn = tp else: gn = pd.merge(gn,tp,on = 'uid',how = 'left') sum tp = dfgroupby('uid').apply(lambda df:np.nansum(df[i]))reset_index() tp.columns = ['uid',i + '_tot'] if gn.empty: gn = tp else: gn = pd.merge(gn,tp,on = 'uid',how = 'left') is averaged tp = dfgroupby('uid').apply(lambda df:np.nanmean(df[i]))reset_index() tp.columns = ['uid',i + '_**g'] if gn.empty: gn = tp else: gn = pd.merge(gn,tp,on = 'uid',how = 'left') to find the maximum tp = dfgroupby('uid').apply(lambda df:np.nanmax(df[i]))reset_index() tp.columns = ['uid',i + '_max'] if gn.empty: gn = tp else: gn = pd.merge(gn,tp,on = 'uid',how = 'left') to find the minimum tp = dfgroupby('uid').apply(lambda df:np.nanmin(df[i]))reset_index() tp.columns = ['uid',i + '_min'] if gn.empty: gn = tp else: gn = pd.merge(gn,tp,on = 'uid',how = 'left') to find the variance tp = dfgroupby('uid').apply(lambda df:np.nanvar(df[i]))reset_index() tp.columns = ['uid',i + '_var'] if gn.empty: gn = tp else: gn = pd.merge(gn,tp,on = 'uid',how = 'left'Find the range tp = dfgroupby('uid').apply(lambda df:np.nanmax(df[i]) np.nanmin(df[i]) reset_index() tp.columns = ['uid',i + '_ran'] if gn.empty: gn = tp else: gn = pd.merge(gn,tp,on = 'uid',how = 'left')
View the derived resultsgn.columns
Show Results:Find the number of distincts for the dstc lst variableindex(['uid', 'oil_amount_cnt', 'oil_amount_num', 'oil_amount_tot', 'oil_amount_**g', 'oil_amount_max', 'oil_amount_min', 'oil_amount_var_x', 'oil_amount_var_y', 'discount_amount_cnt', 'discount_amount_num', 'discount_amount_tot', 'discount_amount_**g', 'discount_amount_max', 'discount_amount_min', 'discount_amount_var_x', 'discount_amount_var_y', 'sale_amount_cnt', 'sale_amount_num', 'sale_amount_tot', 'sale_amount_**g', 'sale_amount_max', 'sale_amount_min', 'sale_amount_var_x', 'sale_amount_var_y', 'amount_cnt', 'amount_num', 'amount_tot', 'amount_**g', 'amount_max', 'amount_min', 'amount_var_x', 'amount_var_y', 'pay_amount_cnt', 'pay_amount_num', 'pay_amount_tot', 'pay_amount_**g', 'pay_amount_max', 'pay_amount_min', 'pay_amount_var_x', 'pay_amount_var_y', 'coupon_amount_cnt', 'coupon_amount_num', 'coupon_amount_tot', 'coupon_amount_**g', 'coupon_amount_max', 'coupon_amount_min', 'coupon_amount_var_x', 'coupon_amount_var_y', 'payment_coupon_amount_cnt', 'payment_coupon_amount_num', 'payment_coupon_amount_tot', 'payment_coupon_amount_**g', 'payment_coupon_amount_max', 'payment_coupon_amount_min', 'payment_coupon_amount_var_x', 'payment_coupon_amount_var_y'], dtype='object')
gc = pd.dataframe()for i in count_list: tp = df.groupby('uid').apply(lambda df: len(set(df[i]))reset_index() tp.columns = ['uid',i + '_dstc'] if gc.empty: gc = tp else: gc = pd.merge(gc,tp,on = 'uid',how = 'left')
Group variables togetherfn = pd.merge(base,gn,on= 'uid')fn = pd.merge(fn,gc,on= 'uid') fn.shape
Show Results:Missing values may occur during the merge process, and the missing values are filled
fn = fn.fillna(0)fn.head(100)
Show Results:Train a decision tree modeluidcreate_dtoil_actv_dtclass_newbad_inddtnoil_amount_cntoil_amount_numoil_amount_totoil_amount_**g...payment_coupon_amount_maxpayment_coupon_amount_minpayment_coupon_amount_var_xpayment_coupon_amount_var_ypayment_coupon_amount_varchannel_code_dstcoil_code_dstcscene_dstcsource_app_dstccall_source_dstc0b964363919850357032018-10-082018-10-08b00100.000.00...0.00.00.00.00.0111111b964363919846933972018-10-112018-10-11e00100.000.00...0.00.00.00.00.0111112b964363919772174682018-10-172018-10-17b00100.000.00...0.00.00.00.00.0111113b964363919764808922018-09-282018-09-28b00100.000.00...0.00.00.00.00.011111
100 rows × 74 columns
x = fn.drop(['uid','oil_actv_dt','create_dt','bad_ind','class_new'],axis = 1)y = fn.bad_ind.copy()from sklearn import treedtree = tree.decisiontreeregressor(max_depth = 2,min_samples_leaf = 500,min_samples_split = 5000)dtree = dtree.fit(x,y)
To output a decision tree image and make a decision, you need to install the Graphviz software, and you need to install two python libraries pip install graphvizpip install pydotplusimport pydotplus from ipython.display import imagefrom six import stringioimport os# os.environ["path"] += os.pathsep + 'c:/program files (x86)/graphviz2.38/bin/'# with open("dt.dot", "w") as f:# tree.export_graphviz(dtree, out_file=f)dot_data = stringio()tree.export_graphviz(dtree, out_file=dot_data, feature_names=x.columns, class_names=['bad_ind'], filled=true, rounded=true, special_characters=true)graph = pydotplus.graph_from_dot_data(dot_data.getvalue())image(graph.create_png())
Show Results:Use the results to segment users
group_1 = fn.loc[(fn.amount_tot>48077.5)&(fn.amount_cnt>3.5)].copy()group_1['level'] = 'past_a'group_2 = fn.loc[(fn.amount_tot>48077.5)&(fn.amount_cnt<=3.5)].copy()group_2['level'] = 'past_b'group_3 = fn.loc[fn.amount_tot<=48077.5].copy()group_3['level'] = 'past_c'
If you reject a Past C customer, you can reduce the overall negative sample percentage to 0021 If Past B is also rejected, the overall negative sample proportion can be reduced to 0012 As for the actual strategy to adopt for PAST A, PAST B, and PAST C, linear programming should be done according to the interest rate, so as to realize the basic process of risk pricing credit approval business: application, approval, disbursement, repayment, re-application, re-loan approval, rule model, overdue collection rule, model.
ABC ScorecardA Application, B Behavior, C CollectionFor Different Customer Groups, Different Available Data, Y Define Different Risk Control Modeling ProcessProject Preparation Feature Engineering Modeling Online Operation Clear Requirements Data Processing Model Training Model Delivery Model Design Feature Construction Model Evaluation Model Deployment Sample Design Feature Evaluation Model Tuning Model Monitoring Scorecard Model Positive and Negative Sample Definition Methods General Habits y = 1 For Bad Users (Default) Y = 1 Select: Use DPD30, DPD15....According to the specific business situation), delete the gray part of the user is not overdue, or overdue for less than 5 days, as a good user, how to build features, how to evaluate feature features, build an E-R diagram, know which tables the data is saved, and know the relationship between tables, know which data can be used with a single feature, from three dimensions, RFM consideration, generate new features, user time section, features, user time series, features, characteristics, evaluate coverage, stability, psi, discrimination, iv single feature, AUC How does the KS rule engine work, using a series of judgment logic to distinguish customer groups, and the overdue risk of different groups is significantly different, and the machine learning model can assist in rule mining.