Tratamiento de clases desbalanceadas con el método del cubo en problemas de credit scoring a través de la minería de datos

  • Mauricio BeltránPascual Junta de Castilla y León
  • Juan Antonio Vicente Virseda Universidad Nacional de Educación a Distancia
Palabras clave: método del cubo, credit scoring, minería de datos, coste de clasificación

Resumen

En este artículo se aborda la forma de aplicar el método de muestreo denominado “Método del cubo” en problemas de credit scoring con la finalidad de poder mejorar la precisión de los modelos predictivos que se obtengan. Este método permite garantizar un óptimo equilibrio de las muestras cuando se trabaja con bases de datos cuyas clases de la variable dependiente están altamente desbalanceadas. Utilizando dos muestras de datos bancarios reales, se realiza un estudio comparativo de los mejores modelos obtenidos con diversos métodos de minería de datos aplicados a las bases de datos originales frente a las balanceadas. Finalmente, se concluye que la capacidad predictiva de los algoritmos de clasificación es más precisa y que los modelos utilizados reducen el coste económico de la clasificación cuando se equilibran las muestras.

Citas

Bamber, D.C., 1988. The area above the ordinal dominance graph and the area below the receiver operating characteristic graph. J. Math. Psychol., 12, 387-415.

Bessis, J., 2002. Risk Management in Banking. Second edition. Chichester: John Wiley and sons, 496 pp.

Boj, E., Claramunt, M.M., Esteve, A. y Fortiana, J., 2009. Criterio de selección de modelo en credit scoring Aplicación del análisis discriminante basado en distancias. Anales del Instituto de Actuarios Españoles. 3:209–30.

Breiman, L., Friedman, J.H., Olshen, R.A. y Stone, C.J., 1984. Classification and regression trees. Monterey, CA: Wadsworth & Brooks/Cole Advanced Book & Software.

Caouette, J., Altman, E. y Narayanan, P., 1998. Gestión del riesgo de crédito, el próximo gran desafío financiero. Wiley Frontiers in Finance, vol. Fronteras Wiley en Finanzas, Wiley & Sons, Inc., Nueva York.

Cohen, G., Hilario, M., Sax, H., Hugonnet, S. y Geissbuhler, A., 2006. Learning from imbalancing Data in Surveillance of Nosocomial Infection. Artificial Intelligence in Medicine, pp. 7-18.

Chawla, N.V., Bowyer, K.W., Hall, L.O. y Kegelmeyer, W.P., 2002. SMOTE: Synthetic Minority Over-Sampling Technique. Journal of Artificial Intelligence research, pp.321-357.

Deville, J.C. y Tillé, Y., 2004. Eficient balanced sampling: The cube method. Biometrika, 91: pp 893_912.

Domingos, P., 1999. MetaCost A general method for making classifiers cost-sensitive. In: Fifth International Conference on Knowledge Discovery and Data Mining, pp.155-164.

Elkan, C., 2001. The Foundations of Cost-Sensitive Learning. In Proceedings of the Seventeenth International Conference of Artificial Intelligence, 973-978. Seattle, Washington: Morgan Kaufmann.

Gopinathan, K., O'Donnell, D., 1998. Just in time risk management. Credit World, 2:10-2.

Hand, D.J. y Henley, W.E., 1997. Statistical Classification. Methods in Costumer Credit Scoring: A review. Journal of the Royal Statistical Association, 160(A/ Part3), 523-541.

Han, H., Wang, W. y Mao, B., 2005. Borderline-SMOTE: a new Over-Sampling Method in Imbalanced Data Sets Learning. En: D.-S. Huanng; X.-P.Zhzng y G.-B. Huang (Eds.), ICICS, volumen 3644 de LNCS, pp. 878-887.

Hanley, J.A. y McNeil, B.J., 1982. The meaning and use of the área under receiver operating characteristic (ROC) curve. Radilogy, 143, 29-36.

Hulse, J.V., Khoshgoftaar, T.M. y Napolitano, A., 2007. Experimental perspectives on learning from imbalanced data. En: Z. Ghahramani (Rd.), ICML volume 227 de ACM International Conference Proceeding series, pp. 935-942.

Japkowicz, N., 2001. Concept-Learning in the Presence of Between-Class and Within-Class Imbalances. En: E. Stroulia y S. Matwin (Eds.), Canadian Conference on AI, volume 2056 de LNCS, pp. 67-77.

Japkowicz, N. y Stephen, S., 2002. The Class Imbalance Problem: A Systematic Study Intelligent Data. Analysis, Journal, volume 6, issue 5, pp: 1-32.

Jensen HL., 1992. Using neural networks for credit scoring. Managerial Finance, 18:15-26.

Jorion, P., 2000. Valor en Riesgo, segundo. EDN, McGraw-Hill, Nueva York. Mester, L.J. (1997). What’s the Point of Credit Scoring? Business Review, Set./Oct., pp. 3-16, Federal Reserve Bank of Philadelphia.

Kubat, M. y Matwin, S., 1997. Addressing the Course of Imbalanced Training Sets: One-Sided Selection. En: D.H.Fisher (Ed.), ICML, pp. 179-186.

Kuncheva, L. y Jain. L.C., 1999. Nearest neighbor classifier: Simultaneous editing and feature selection. Pattern Recognition Letters, pp. 1149-1156.

Laurikkala, J., 2002. Instance-based data reduction for improved identification of difficult small classes. Intelligent Data Analysis, pp.311-322.

Ling, C.X., Yang, Q., Wang, J. y Zhang, S., 2004. Decision trees with minimal costs. In ICML ’04: Proceedings of the twenty-first international conference on Machine earning, page 69.

Lizotte, D., Madani, O. y Greiner, R., 2003. Budgeted Learning of Naïve-Bayes Classifiers. In Proceedings of the Nineteenth Conference on Uncertainty in Artificial Intelligence. Acapulco, Mexico: Morgan Kaufmann.

López, V., Fernández, A. y Herrera, F., 2010. Un primer estudio sobre el uso de aprendizaje sensible al coste con sistemas de clasificación basados en reglas difusas para problemas no balanceados. In Proceedings of the III Congreso Español de Informática (CEDI 2010). III Simposio sobre Lógica Fuzzy y Soft Computing, LFSC2010 (EUSFLAT), Valencia (Spain), 459-466.

López, V., Fernández, A., García, S., Palade, V., y Herrera, F., 2013. An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics. Information Sciences, 250, 113-141.

Provost, F., 2003. Machine learning from imbalanced data sets 101 (Extended Abstract). En: AAAI: Workshop on Learning with Imbalanced Data Sets.

Provost, F. y Fawcett. T., 2001. Robust classification for imprecise environments. Machine Learning Journal, 42(3):203-231.

R Core Team (2016). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL http://www.R-project.org/.

Swets, J.A. y Pickett, R.M., 1982. Evaluation of Diagnostic system. Academic Press, Inc, New York.

Ting, K.M., 1998. Inducing cost-sensitive trees via instance weighting. En Proceedings of the 2nd European Symposium on Principles of Data Mining and Knowledge Discovery, pages 139-147.

Turney, P.D., 1995. Cost-Sensitive Classification: Empirical Evaluation of a Hybrid Genetic Decision Tree Induction Algorithm. Journal of Artificial Intelligence Research 2:369-409.

Turney, P.D., 2000. Types of cost in inductive concept learning. In Proceedings of the Workshop on Cost-Sensitive Learning at the Seventeenth International Conference on Machine Learning, Stanford University, California.

Wang, J., Xu, M., Wang, H. y Zhang, J., 2006. Clasification of Imbalanced Data by Using the SMOTE Algorithm and locally Linear Embedding. En: ICSP, volume 3, pp. 16-20.

West, D., 2000. Neural network credit scoring models, Computers & Operations Research, vol. 27, pp. 1131–1152.

Wilson, D.L., 1972. Asymptotic properties of nearest neighbor rules using edited data, IEEE Transactions on Systems, Man and Cybernetics. IEEE Computer Society Press, Los Alamos.

Witten, I.H. y Frank, E., 1999. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann.

Zadrozny, B. y Elkan, C., 2001. Learning and Making Decisions When Costs and Probabilities are Both Unknown. In Proceedings of the Seventh International Conference on Knowledge Discovery and Data Mining, 204-213.

Zhang, J. y Mani, I., 2003. kNN approach to unbalanced data distributions: a case study involving information extraction. En ICML: Workshop on Learning from Imbalanced Dataset II.

Zweig, M.H. y Cambell, G., 1993. Receiver–Operating Characteristic (ROC) Plots: A Fundamental Evaluation Tool in Clinical. Medicine. Clin. Chem., 39 (4), 561-577. [Correcciones en Clin. Chem., (1993), 39, 1589].

Publicado
2019-09-30