GRNTI 50.07 Теоретические основы вычислительной техники
BBK 3297 Вычислительная техника
A method for classifying textual information based on the apparatus of convolutional neural networks is considered. The text preprocessing algorithm is presented. Text preprocessing consists of: lemmatizing words, removing stop words, processing text characters, etc. The word-by-word conversion of the text into dense vectors is performed. Testing is carried out on the basis of the text data of "The 20 Newsgroups". This sample contains a collection of approximately 20,000 news stories in English, which is divided (approximately) evenly between 20 different categories. The accuracy of the best convolutional neural network used in this work on the test set was ~ 74%. The topology of the best neural network is given. The accuracy of voting of neural networks by the Bagging algorithm was ~ 81.5%. Based on a review of similar solutions, a comparison is made with the following text classification algorithms: the support vector method (SVM, 82.84%), the naive Bayes classifier (81%), the k nearest neighbors algorithm (75.93%), and the word bag.
neural networks, Bagging, text classification, database “The 20 Newsgroups”
1. Verevkina O. Rabota s tekstovymi dannymi v scikitlearn [Elektronnyy resurs]. - Rezhim dostupa: URL: https://habr.com/ru/post/264339/ (20.05.2019).
2. Krivosheev N.A., Spicyn V.G. Algoritmy ponimaniya teksta metodami glubokogo obucheniya neyronnyh setey // Sbornik trudov XVI Mezhdunarodnoy nauchnoprakticheskoy konferencii studentov, aspirantov i molodyh uchenyh «Molodezh' i sovremennye informacionnye tehnologii» - Tomsk, 2018 g., s. 82- 83.
3. Meskita D. Obschiy vzglyad na mashinnoe obuchenie: klassifikaciya teksta s pomosch'yu neyronnyh setey i TensorFlow [Elektronnyy resurs]. - Rezhim dostupa: URL: https://tproger.ru/translations/text-classificationtensorflow-neural-networks/ (21.11.2018).
4. Petrenko S. Eto nuzhno znat': Klyuchevye rekomendacii po glubokomu obucheniyu (Chast' 2) [Elektronnyy resurs]. - Rezhim dostupa: URL: http://datareview.info/article/eto-nuzhno-znat-klyuchevyierekomendatsii-po-glubokomu-obucheniyu-chast-2/ (20.05.2019).
5. Cardoso A. Datasets for single-label text categorization [Elektronnyy resurs]. - Rezhim dostupa: URL: http://ana.cachopo.org/datasets-for-single-label-textcategorization (03.06.2019).
6. Krizhevsky A., Sutskever I., Hinton G.E. Imagenet classification with deep convolutional neural networks // Advances in neural information processing systems. 2012, pp. 1097-1105.
7. LeCun Y. Backpropagation applied to handwritten zip code recognition // Neural computation. 1989, Vol. 1(4), pp. 541- 551.
8. LeCun Y., Bottou L., Bengio Y., Haffner P. Gradientbased learning applied to document recognition // Proceedings of the IEEE. 1998, Vol. 86(11), pp. 2278-2324.
9. LeCun Y. Efficient backprop // Neural Networks: Tricks of the Trade: Lecture Notes in Computer Science / G. Montavon, G. B. Orr, K.-R. Muller (Eds.) - Springer, 2012, pp. 9-48.
10. Ruder S. An overview of gradient descent optimization algorithms [Elektronnyy resurs]. - Rezhim dostupa: URL: http://ruder.io/optimizing-gradientdescent/index.html#nadam (22.11.2018).
11. Begging [Elektronnyy resurs]. - Rezhim dostupa: URL: http://www.machinelearning.ru/wiki/index.php?title=%D0%91%D1%8D%D0%B3%D0%B3%D0%B8%D0%BD%D0%B3 (25.08.2019).
12. Klassifikaciya teksta s pomosch'yu neyronnoy seti na Java [Elektronnyy resurs]. - Rezhim dostupa: URL: https://habr.com/post/332078/ (21.11.2018).
13. Lemmatizaciya [Elektronnyy resurs]. - Rezhim dostupa: URL: https://dic.academic.ru/dic.nsf/ruwiki/1313114/%D0%9B%D0%B5%D0%BC%D0%BC%D0%B0%D1%82%D0%B8%D0%B7%D0%B0%D1%86%D0%B8%D1%8F (26.08.2019).
14. 20 Newsgroups [Elektronnyy resurs]. - Rezhim dostupa: URL: http://qwone.com/~jason/20Newsgroups/ (10.09.2019).
15. sklearn.datasets.fetch_20newsgroups [Elektronnyy resurs]. - Rezhim dostupa: URL: https://scikitlearn.org/stable/modules/generated/sklearn.datasets.fetch_20newsgroups.html (22.11.2018).
16. Softmax [Elektronnyy resurs]. - Rezhim dostupa: URL: https://medium.com/@congyuzhou/softmax3408fb42d55a (20.05.2019)