Health coaching helps patients identify and accomplish lifestyle-related goals, effectively improving the control of chronic diseases and mitigating mental health conditions. However, health coaching is cost-prohibitive due to its highly personalized and labor-intensive nature. In this paper, we propose to build a dialogue system that converses with the patients, helps them create and accomplish specific goals, and can address their emotions with empathy. However, building such a system is challenging since real-world health coaching datasets are limited and empathy is subtle. Thus, we propose a modularized health coaching dialogue system with simplified NLU and NLG frameworks combined with mechanism-conditioned empathetic response generation. Through automatic and human evaluation, we show that our system generates more empathetic, fluent, and coherent responses and outperforms the state-of-the-art in NLU tasks while requiring less annotation. We view our approach as a key step towards building automated and more accessible health coaching systems.
5.2 Evaluation Metrics Depending upon the task, we use the following evaluation metrics: 1. Slot-filling: Precision, Recall, and F1-score. 2. Dialogue State (Goal Attribute) Tracking: partial/complete match and goal correctnessk following Gupta et al. (2020b): Correctnessk: Computes the percentage of correctness over all the predicted goals of each week. If at least k attributes are predicted correctly, the goal is regarded as correct. 3. Dialogue Generation: BLEU (Papineni et al., 2002), BertScore (Zhang et al., 2020), Perplexity (PPL), and Empathy Score. Perplexity: We measure fluency as perplexity (PPL) of the generated response using a pre-trained GPT2 model that has not been fine-tuned for this task, following previous work (Ma et al., 2020; Sharma et al., 2021b).
id: c0177fc33424289e428bb55d514d4627 - page: 7
Empathy Score: We train a standard text regression model based on BERT using the response posts and corresponding level of empathy scores in Sharma et al. (2020) (achieving an RMSE of 0.57). We use this model to measure the empathy in the generated outputs. For detailed descriptions of metrics, training, including model parameters, selection, and supplementary analysis, please see the Appendix. System Gupta et al. (2020b) +Phase +StartPhase +StartPhase+Aug Slot Only+Aug Slot R Slot P Slot F1 PAcc 0.801 0.806 0.779 0.899 0.835 0.910 0.926 0.817 0.904 0.808 0.837 0.847 0.879 0.876 0.790 0.867 0.877 0.902 0.890 Table 1: Evaluation on slot-filling with ablations. Aug: using data augmentation; StartPhase: Jointly predicting the phase of the sentence only if it is the beginning sentence of the phase. 5.3 Results
id: 28533e967e3909e3a3f592027b2c11fa - page: 7
5.3.1 Goal Attributes Tracking Table 1 shows the performance of slot-filling and phase prediction compared to the previous model. Jointly predicting the phase of the sentence only if it is at the beginning of the phase, combined with data augmentation (+StartPhase+Aug), achieved the best performance for slot-filling, outperforming the state-of-the-art by 11.2% in F1. However, modeling without phases suffices for slot-filling while reducing the annotation cost. As such, we adopt the no-joint model for downstream tasks. Our experimental result also shows the carryover classifier achieved a F1-score of 0.88 using only the dialogue context. We investigate whether dialogue act and phase labels can improve carryover classifier, however they barely contribute to the model performance4.
id: cf5bfa56373a25ecc076dedf44146408 - page: 7
In previous work, we extracted goals at two critical points for each week to evaluate offline goal tracking: one at the end of the goal-setting stage (forward) and the other at the end of the goalimplementation stage (backward). The forward and backward goals can be different since the patient may encounter barriers, and the goal can be revised in the implementation stage. We also proposed a rule-based approach to update the slot values, which simply records the last mention of the value for each slot except for certain conditions. In this paper, we compared our model with previous work and a combination of our slot-filling model with the previous rule-based approach (our SF+Rule). Table 2 shows the performance for goal attributes tracking of our NLU module compared with previous work. For dataset 1 backward goals, our SF+Rule achieved the best performance resulting from more accurate slot-filling. We observe that in dataset 1, the coach tends to summarize the goal to the patients
id: 29edf4467ff643642813d4215c0928ba - page: 7