ABSTRACT

Machine learning has progressed to match human performance, including the field of text classification. However, when training data are imbalanced, classifiers do not perform well. Oversampling is one way to overcome the problem of imbalanced data and there are many oversampling methods that can be conveniently implemented. While comparative researches of oversampling methods on non-text data have been conducted, studies comparing oversampling methods under a unifying framework on text data are scarce. This study finds that while oversampling methods generally improve the performance of classifiers, similarity is an important factor that influences the performance of classifiers on imbalanced and resampled data.

KEYWORD

Imbalanced data, oversampling methods, SMOTE, topic classification

REFERENCES(37)open

  1. [book] Aggarwal, C. C. / 2013 / Mining Text Data / Springer US : 163 ~ 222

  2. [jounal] Ailab / 2011 / Evaluation: from Precision, Recall and F-measure to Roc, Informedness, Markedness & Correlation / Journal of Machine Learning Technologies 2 (1) : 37 ~ 63

  3. [other] Aridas, C. K. / 2016 / Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning / Computing Research Repository, abs/1609.06570

  4. [confproc] Bai, Y. / 2008 / ADASYN : Adaptive synthetic sampling approach for imbalanced learning / 2008 IEEE International Joint Conference on Neural Networks(IEEE World Congress on Computational Intelligence) : 1322 ~ 1328

  5. [book] Barua S. / 2011 / Neural Information Processing. ICONIP 2011. Lecture Notes in Computer Science, vol 7063 / Springer

  6. [jounal] Barua, S. / 2014 / MWMOTE--Majority Weighted Minority Oversampling Technique for Imbalanced Data Set Learning / IEEE Transactions on Knowledge and Data Engineering 26 : 405 ~ 425

  7. [confproc] Batuwita, R. / 2010 / Efficient resampling methods for training support vector machines with imbalanced datasets / The 2010 International Joint Conference on Neural Networks (IJCNN) : 1 ~ 8

  8. [jounal] Blondel, M. / 2011 / Scikit-learn : Machine Learning in Python / Journal of Machine Learning Research 12 : 2825 ~ 2830

  9. [confproc] Buckley, C. / 1996 / Pivoted Document Length Normalization / SIGIR

  10. [jounal] Cai, Y. / 2015 / Oversampling Method for Imbalanced Classification / Computing and Informatics 34 : 1017 ~ 1037

  11. [jounal] Chawla, N. V / 2002 / SMOTE : Synthetic Minority Over-sampling Technique / Journal of Artificial Intelligence Research 16 : 321 ~ 357

  12. [book] Chawla, N. V / 2005 / The Data Mining and Knowledge Discovery Handbook / Springer : 853 ~ 867

  13. [confproc] Chawla, N. V. / 2004 / Editorial: special issue on learning from imbalanced data sets / Special Interest Group on Knowledge Discovery and Data Mining Explorations 6 : 1 ~ 6

  14. [jounal] Estabrooks, A. / 2004 / A multiple resampling method for learning from imbalanced data sets / Computational intelligence 20 (1) : 18 ~ 36

  15. [jounal] Fawcett, T. / 2001 / Robust Classification for Imprecise Environments / Machine Learning 42 : 203 ~ 231

  16. [book] Friedman, J. H. / 2009 / The elements of statistical learning: data mining, inference, and prediction / Springer series in statistics

  17. [confproc] Han, H. / 2005 / Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning / ICIC 2005, Part I : 878 ~ 887

  18. [report] Japkowicz, N / 2000 / Learning from Imbalanced Data Sets: A Comparison of Various Strategies

  19. [jounal] Japkowicz, N. / 2002 / The class imbalance problem : A systematic study / Intelligent data analysis 6 (5) : 429 ~ 449

  20. [confproc] Joachims, T / 1998 / Text categorization with support vector machines : Learning with many relevant features / Machine learning: ECML-98 : 137 ~ 142

  21. [web] Johnson, S. / Lexical facts / The Economist

  22. [jounal] Jones, K / 1972 / A statistical interpretation of term specificity and its application in retrieval / Journal of Documentation 28 (1) : 11 ~ 21

  23. [jounal] 김은나 / 2011 / 목표 범주가 희귀한 자료의 과대표본추출에 대한 연구 / 응용통계연구 24 (3) : 477 ~ 484

  24. [jounal] Krawczyk, Bartosz / 2016 / Learning from Imbalanced Data : Open Challenges and Future Directions / Progress in Artificial Intelligence 5 (4) : 221 ~ 232

  25. [confproc] Kubat, M. / 1997 / Addressing the Curse of Imbalanced Training Sets: One-Sided Selection / International Conference on Machine Learning : 179 ~ 186

  26. [confproc] Lee, W. / 2014 / Comparison of data pre-processing techniques for relaxing class imbalance problem / Proceedings from the Korean Institute Of Industrial Engineers Conference 2014 : 2373 ~ 2383

  27. [confproc] Ling, C. / 1998 / Data Mining for Direct Marketing Problems and Solutions / Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining (KDD-98)

  28. [confproc] Liu, X. / 1999 / A Re-Examination of Text Categorization Methods

  29. [jounal] López, V. / 2013 / An insight into classification with imbalanced data : Empirical results and current trends on using data intrinsic characteristics / Information Sciences 250 : 113 ~ 141

  30. [book] Marr, D / 1982 / Vision / MIT Press : 22

  31. [jounal] Martins, A. C. R / 2006 / Probability biases as Bayesian inference / Judgement and Decision Making 1 (2) : 108 ~ 117

  32. [confproc] Nguyen, H. / 2009 / Borderline Over-sampling for Imbalanced Data Classification / Fifth International Workshop on Computational Intelligence & Applications 11&12

  33. [confproc] Park, Eunjeong L. / 2014 / KoNLPy: Korean natural language processing in Python / Proceedings of the 26th Annual Conference on Human & Cognitive Language Technology

  34. [jounal] Provost, F. J. / 2003 / Learning When Training Data are Costly: The Effect of Class Distribution on Tree Induction / J. Artif. Intell. Res. (JAIR) 19 : 315 ~ 354

  35. [jounal] Sebastiani, F / 2002 / Machine learning in automated text categorization / ACM Computer Survey 34 : 1 ~ 47

  36. [jounal] Sun, A. / 2009 / On strategies for imbalanced text classification using SVM : A comparative study / Decision Support Systems 48 (1) : 191 ~ 201

  37. [jounal] Tsoumakas, G / 2007 / Multi Label Classification: An Overview / International Journal of Data Warehouse and Mining 3 : 1 ~ 13