Ceross, Aaron William Karl; Zhu, Tingting
Abstract: As the use of personal data becomes further entrenched in the function of societal interaction, the regulation of such data continues to grow as an important area of law. Nevertheless, it is unfortunately the case that data protection authorities have limited resources to address an increasing number of investigations. The leveraging of appropriate data-driven models, coupled with the automation of decision making, has the potential to help in such circumstances.In this paper, we evaluate machine learning models in literature(such as support vector machine (SVM), random forest, and multinomial naive Bayes classifier) for natural language processing in order to predict whether a monetary penalty was levied based on a description of case facts. We tested these models on a novel dataset collected from the data protection authority of Macau across the three languages (i.e., Chinese, English, and Portuguese). Our experimental results show promising predictability from the machine learning models to automate the process of monetary penalty. In particular, SVM has consistent performance across three languages and achieving an AUROC of 0.725, 0.762, and 0.748 for Chinese,English, and Portuguese, respectively. We further evaluated the interpretability of SVM independently from each of the languages and found that the salient texts that were identified are shared across the three languages