Representation Learning for Insurance Products

The insurance industry has long known the importance of data, and the success of its business heavily relies on data collection and analysis. With the fast growth in computing power and the development of machine learning techniques, more and more variables/features are used in predictive analysis in various aspects of insurance, such as rate making, loss reserving, and risk management. While most of the numerical or categorical variables can be easily thrown into a machine learning model, the unstructured text data remain largely under-utilized.


One of the popular ways to use text data is feature engineering, which often involves manually creating algorithms to extract information from the text, such as the word count and the sentiment analysis. Although this approach provides a “measurement” for the entire text that can be easily interpreted, discovering new features usually requires domain knowledge and quite time consuming. Recently, many researchers have started using Natural Language Processing (NLP) to facilitate textual analysis. While many of the deep learning models succeed in improving prediction accuracy for supervised learning tasks, they often provide little tractability and interpretability, which are of importance in decision making.


In this project, we will explore the representation learning techniques for understanding unstructured text data, aiming to provide a low dimensional and interpretable representation of texts. In the previous semester, we have reviewed literatures on both supervised and unsupervised learning tasks, and implemented several novel algorithms with long short-term memory (LSTM) neural networks and bidirectional encoder representations from transformers (BERT). For the current semester, we will continue exploring the literature, and modify the existing algorithms with an emphasize on the interpretability of the model.

Supervisors: Xiaochen Jing

Graduate Supervisor: Yuxuan Li