Review summarisation is an important task in Natural Language Processing with applications in E-commerce. Despite the rapid development of neural models for review summarisation, evaluation of summaries dramatically lags behind. Reviews are opinionated text containing a sentiment dimension in addition to the semantic dimension for documents of other genres, and
present unique challenges for summary evaluation.
In this thesis, we focus on three research questions to study the automated evaluation of review summaries. Our first research question investigates the suitability of a popular evaluation metric, ROUGE, on review summaries. While ROUGE correlates well with human judgments in news summaries, the suitability of ROUGE for review summaries is yet to be formally investigated. In particular, we investigate how effective is ROUGE at evaluating review summaries considering their associated sentiment polarity. Through a series of simulation studies, we found that ROUGE is not effective at evaluating review summaries especially when the generated summary is of opposite sentiment polarity to the reference summary.
Our second research question focuses on how to measure the opinion similarity of two opinion sentences as it is often used in automated evaluation metrics. We study how existing similarity metrics perform on opinion similarity. We observed that existing automated metrics fall short when evaluating sentence pairs that are different in sentiment polarity. To address this problem, we fine-tune a pretrained sentence encoder using weak supervision from review ratings on Siamese and Triplet networks. Our metric Sentence-BERT for Opinion Similarity, SOS, is used with cosine similarity to evaluate opinion sentence pairs. SOS outperforms other similarity metrics in evaluating opinion sentences of opposite sentiment polarity.
For our third research question, we investigate opinion similarity for evaluating review summaries. We propose a new quality dimension of review summaries, opinion consistency. This dimension evaluates a generated summary for conflicting opinions within the summary. A summary containing conflicting opinions is less useful to a user who is reading the summary. We propose the Metric for Opinion Consistency, MOC, for opinion consistency to evaluate summaries for conflicting opinions. We explore different methods of aggregating the similarity of sentence pairs and found that the mean of sentence pairs provides the best results. For the similarity function, we propose weak supervision contrastive learning to fine-tune a pretrained sentence encoder for effective sentence representation. We show that our metric outperforms other baseline metrics.
In summary, we find that opinion polarity is an important dimension for review summarisation. We propose novel neural learning-based models to train automated metrics to evaluate opinion similarity and opinion consistency for review summaries. Our findings can inform further development of review summarisation models.