In this study, the authors propose the Psy-LLM framework, an AI-based assistive tool leveraging Large Language Models (LLMs) for question-answering in psychological consultation settings to ease the demand for mental health professions. This framework combines pre-trained LLMs with real-world professional Q&A from psychologists and extensively crawled psychological articles. The Psy-LLM framework serves as a front-end tool for healthcare professionals, allowing them to provide immediate responses and mindfulness activities to alleviate patient stress.
ROUGE-L (Longest Common Subsequence) is an evaluation metric that measures the number of overlapping units between the predicted text generated by a language model and the actual reference text (Lin 2004). ROUGE-L measures how closely the generated text matches the desired output by quantifying the similarity between the predicted and reference texts. Distinct-1 and Distinct-2 are evaluation metrics that assess the diversity of the generated text. Distinct-1 calculates the number of distinct unigrams (individual words) divided by the total number of generated words. In contrast, Distinct-2 calculates the number of distinct bigrams (pairs of adjacent words) divided by the total number of generated bigrams (Li, Galley, Brockett, et al. 2016). These metrics reflect the degree of diversity in the generated text by quantifying the presence of unique unigrams and bigrams. The formulas for calculating Distinct-n are as follows: Distinct-n := Distinct(n) =
id: 1e3381ce5ef830e5d795ced9849c672b - page: 12
Count(unique, n-gram) Count(word) Here, Count(unique, n-gram) represents the number of n-grams that are not repeated in a reply, and Count(word) indicates the total number of n-gram words in the reply. A higher value of Distinct(n) indicates a greater diversity in the distinct generations. These evaluation metrics, including perplexity, ROUGE-L, Distinct-1, and Distinct-2, provide insights into the quality, similarity, and diversity of the generated text by the language model. They serve as valuable tools for assessing the performance and effectiveness of the model in generating accurate and diverse outputs. While perplexity and Distinct-n provide insights into the language models performance in language generation, they do not necessarily indicate high accuracy. Therefore, in order to evaluate models more convincingly, human evaluation is still necessary. Human evaluators can provide subjective assessments of the generated text, considering factors such 12 (1) (2) (3)
id: 9b6a34f25a92ebe557bd77b068b316b1 - page: 12
P S YL L M : L A R G E L A N G U A G E M O D E L S F O R M E N TA L H E A LT H P S Y C H O L O G I C A L S E R V I C E S as coherence, relevance, and overall quality, which are important aspects that cannot be fully captured by automated evaluation metrics alone.
id: 72fc0fd2390e11c443ee18058c726b06 - page: 13
4.9.2 Human evaluation For human evaluation, we have developed an online marking system to assess the performance of our language model in the context of online psychological consultation. This evaluation system aims to streamline the process and ensure effective assessment by focusing on four key metrics: Helpfulness, Fluency, Relevance, and Logic. Each metric is scored on a scale of 1 to 5, allowing evaluators to provide a quantitative assessment of each aspect. The four metrics are defined as follows: 1. Helpfulness: This metric evaluates whether the generated response is helpful for patients seeking psychological support. 2. Fluency: Fluency refers to the degree of coherence and naturalness exhibited in the generated response. 3. Relevance: Relevance assesses the extent to which the responses content directly relates to the posed question. 4. Logic: Logic examines the logical consistency and coherence of the meaning conveyed in the generated response.
id: ca6a8b21e3a1d5a2a4357d766560ba2a - page: 13