Representation Learning for Insurance Products

The insurance industry has long known the importance of data, and the success of its business heavily relies on data collection and analysis. With the fast growth in computing power and the development of machine learning techniques, more and more variables/features are used in predictive analysis in various aspects of insurance, such as rate making, loss reserving, and risk management. While most of the numerical or categorical variables can be easily thrown into a machine learning model, the unstructured text data remain largely under-utilized. 

One of the popular ways to use text data is feature engineering, which often involves manually creating algorithms to extract information from the text, such as the word count and the sentiment analysis. Although this approach provides a “measurement” for the entire text that can be easily interpreted, discovering new features usually requires domain knowledge and is quite time consuming. Recently, many researchers have started using Natural Language Processing (NLP) to facilitate textual analysis. While many of the deep learning models succeed in improving prediction accuracy for the response variables, they often provide little tractability and interpretability, which are of importance in decision making as well.

In this project, we will explore the representation learning techniques for understanding unstructured text data, aiming to provide a low-dimension representation of the entire text, with interpretable generated features. We will review related literatures on this topic, understand and implement the representation learning model, and experiment on insurance text data.

Supervisors: Xiaochen Jing

Graduate Supervisor: Yuxuan Li