Data-driven Approaches to Author’s Profiling Identification for Russian Texts on Base of Complex Machine Learning Models in Combinations with Siamese Networks

ALEKSANDR SBOEV, IVAN MOLOSHNIKOV, DMITRY GUDOVSKIKH, ROMAN RYBKA

Abstract


In this work data-driven approaches to author’s profiling identification for Russian texts are investigated on base of a united data corpus. This corpus has been specially collected by crowdsourcing, and currently contains texts from 1161 men and 2043 women. The adaptation of complicated models, based on convolutional neural networks, gradient boosting methods, LSTM, Siamese networks along with different input data and features (morphological data, vector of character n-grams frequencies, Linguistic Inquiry and Word Count and others) to form the vector of derived features in order to identify gender and age of the author of text is described. The method to improve the accuracy using coding by the Siamese network is presented and analyzed.

Keywords


Data-driving modeling, Author’s profiling, Age detection, Gender identification, Deep neural networks, Siamese networks.Text


DOI
10.12783/dtcse/ceic2018/24526

Refbacks

  • There are currently no refbacks.