초록

Machine learning has progressed to match human performance, including the field of text classification. However, when training data are imbalanced, classifiers do not perform well. Oversampling is one way to overcome the problem of imbalanced data and there are many oversampling methods that can be conveniently implemented. While comparative researches of oversampling methods on non-text data have been conducted, studies comparing oversampling methods under a unifying framework on text data are scarce. This study finds that while oversampling methods generally improve the performance of classifiers, similarity is an important factor that influences the performance of classifiers on imbalanced and resampled data.

키워드

Imbalanced data, oversampling methods, SMOTE, topic classification

참고문헌(37)open

  1. [단행본] Aggarwal, C. C. / 2013 / Mining Text Data / Springer US : 163 ~ 222

  2. [학술지] Ailab / 2011 / Evaluation: from Precision, Recall and F-measure to Roc, Informedness, Markedness & Correlation / Journal of Machine Learning Technologies 2 (1) : 37 ~ 63

  3. [기타] Aridas, C. K. / 2016 / Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning / Computing Research Repository, abs/1609.06570

  4. [학술대회] Bai, Y. / 2008 / ADASYN : Adaptive synthetic sampling approach for imbalanced learning / 2008 IEEE International Joint Conference on Neural Networks(IEEE World Congress on Computational Intelligence) : 1322 ~ 1328

  5. [단행본] Barua S. / 2011 / Neural Information Processing. ICONIP 2011. Lecture Notes in Computer Science, vol 7063 / Springer

  6. [학술지] Barua, S. / 2014 / MWMOTE--Majority Weighted Minority Oversampling Technique for Imbalanced Data Set Learning / IEEE Transactions on Knowledge and Data Engineering 26 : 405 ~ 425

  7. [학술대회] Batuwita, R. / 2010 / Efficient resampling methods for training support vector machines with imbalanced datasets / The 2010 International Joint Conference on Neural Networks (IJCNN) : 1 ~ 8

  8. [학술지] Blondel, M. / 2011 / Scikit-learn : Machine Learning in Python / Journal of Machine Learning Research 12 : 2825 ~ 2830

  9. [학술대회] Buckley, C. / 1996 / Pivoted Document Length Normalization / SIGIR

  10. [학술지] Cai, Y. / 2015 / Oversampling Method for Imbalanced Classification / Computing and Informatics 34 : 1017 ~ 1037

  11. [학술지] Chawla, N. V / 2002 / SMOTE : Synthetic Minority Over-sampling Technique / Journal of Artificial Intelligence Research 16 : 321 ~ 357

  12. [단행본] Chawla, N. V / 2005 / The Data Mining and Knowledge Discovery Handbook / Springer : 853 ~ 867

  13. [학술대회] Chawla, N. V. / 2004 / Editorial: special issue on learning from imbalanced data sets / Special Interest Group on Knowledge Discovery and Data Mining Explorations 6 : 1 ~ 6

  14. [학술지] Estabrooks, A. / 2004 / A multiple resampling method for learning from imbalanced data sets / Computational intelligence 20 (1) : 18 ~ 36

  15. [학술지] Fawcett, T. / 2001 / Robust Classification for Imprecise Environments / Machine Learning 42 : 203 ~ 231

  16. [단행본] Friedman, J. H. / 2009 / The elements of statistical learning: data mining, inference, and prediction / Springer series in statistics

  17. [학술대회] Han, H. / 2005 / Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning / ICIC 2005, Part I : 878 ~ 887

  18. [보고서] Japkowicz, N / 2000 / Learning from Imbalanced Data Sets: A Comparison of Various Strategies

  19. [학술지] Japkowicz, N. / 2002 / The class imbalance problem : A systematic study / Intelligent data analysis 6 (5) : 429 ~ 449

  20. [학술대회] Joachims, T / 1998 / Text categorization with support vector machines : Learning with many relevant features / Machine learning: ECML-98 : 137 ~ 142

  21. [인터넷자료] Johnson, S. / Lexical facts / The Economist

  22. [학술지] Jones, K / 1972 / A statistical interpretation of term specificity and its application in retrieval / Journal of Documentation 28 (1) : 11 ~ 21

  23. [학술지] 김은나 / 2011 / 목표 범주가 희귀한 자료의 과대표본추출에 대한 연구 / 응용통계연구 24 (3) : 477 ~ 484

  24. [학술지] Krawczyk, Bartosz / 2016 / Learning from Imbalanced Data : Open Challenges and Future Directions / Progress in Artificial Intelligence 5 (4) : 221 ~ 232

  25. [학술대회] Kubat, M. / 1997 / Addressing the Curse of Imbalanced Training Sets: One-Sided Selection / International Conference on Machine Learning : 179 ~ 186

  26. [학술대회] Lee, W. / 2014 / Comparison of data pre-processing techniques for relaxing class imbalance problem / Proceedings from the Korean Institute Of Industrial Engineers Conference 2014 : 2373 ~ 2383

  27. [학술대회] Ling, C. / 1998 / Data Mining for Direct Marketing Problems and Solutions / Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining (KDD-98)

  28. [학술대회] Liu, X. / 1999 / A Re-Examination of Text Categorization Methods

  29. [학술지] López, V. / 2013 / An insight into classification with imbalanced data : Empirical results and current trends on using data intrinsic characteristics / Information Sciences 250 : 113 ~ 141

  30. [단행본] Marr, D / 1982 / Vision / MIT Press : 22

  31. [학술지] Martins, A. C. R / 2006 / Probability biases as Bayesian inference / Judgement and Decision Making 1 (2) : 108 ~ 117

  32. [학술대회] Nguyen, H. / 2009 / Borderline Over-sampling for Imbalanced Data Classification / Fifth International Workshop on Computational Intelligence & Applications 11&12

  33. [학술대회] Park, Eunjeong L. / 2014 / KoNLPy: Korean natural language processing in Python / Proceedings of the 26th Annual Conference on Human & Cognitive Language Technology

  34. [학술지] Provost, F. J. / 2003 / Learning When Training Data are Costly: The Effect of Class Distribution on Tree Induction / J. Artif. Intell. Res. (JAIR) 19 : 315 ~ 354

  35. [학술지] Sebastiani, F / 2002 / Machine learning in automated text categorization / ACM Computer Survey 34 : 1 ~ 47

  36. [학술지] Sun, A. / 2009 / On strategies for imbalanced text classification using SVM : A comparative study / Decision Support Systems 48 (1) : 191 ~ 201

  37. [학술지] Tsoumakas, G / 2007 / Multi Label Classification: An Overview / International Journal of Data Warehouse and Mining 3 : 1 ~ 13