Transformer architectures and models have made significant progress in language-based tasks. In this area, is BERT one of the most widely used and freely available transformer architecture. In our work, we use BERT for context-based phrase recognition of magic spells in the Harry Potter novel series. Spells are a common part of active magic in fantasy novels. Typically, spells are used in a specific context to achieve a supernatural effect. A series of investigations were conducted to see if a Transformer architecture could recognize such phrases based on their context in the Harry Potter saga. For our studies a pre-trained BERT model was used and fine-tuned utilising different datasets and training methods to identify the searched context. By considering different approaches for sequence classification as well as token classification, it is shown that the context of spells can be recognised. According to our investigations, the examined sequence length for fine-tuning and validation of the model plays a significant role in context recognition. Based on this, we have investigated whether spells have overarching properties that allow a transfer of the neural network models to other fantasy universes as well. The application of our model showed promising results and is worth to be deepened in subsequent studies.
Table 4: Metrics of trained model 1b Therefore, before training the next model (1b), the Harry Potter spells are added to the tokeniser dictionary. This means that the spells are no longer divided into several tokens, and it is investigated whether the spells can be better recognised as a whole in order to improve the classification of the context. A total of 173 new tokens are added to the dictionary. The rest of the training remains the same.
id: e43eb0e540635562302bb1909257873f - page: 9
Updating the tokeniser had a positive effect on the F1 score. The exact confusion matrix of the results can be found in Table 4. Compared to the (1a) model, the number of falsely correctly classified sequences has decreased. Figure 7 again shows the same sequence in comparison to figure 4, where it can be seen that the words of the incantation "Wingardium Leviosa" are no longer divided into several tokens. It can also be seen that not only the incantation itself, but also some other words from the context are significant for the positive classification. Figure 7: True positive classified sequences with advanced tokeniser Through investigating the falsely positive classified sequences it is recognizable that some words are interpreted as false positives which are similar to a spell name, like the word "Patronus" seen in figure 8.
id: 852e59bfadb565feadfa0bfa27bfadee - page: 9
5.1.2 Paragraph-Split The models (2) were trained on datasets created by the Paragraph-Split to evaluate the impact of increasing the context scope from previously one sentence expanded to a full paragraph based on the corpus. The splitting approach reduces the number of entries in the training dataset; because one paragraph in the Harry Potter novels contains on average 1-3 sentences. By also using 10 times the proportion of negative as opposed to positive entries, the total size of the training dataset is 2,882 entries for the incantation-only dataset and 7,931 entries for the full dataset. The number of entries in the evaluation dataset is 6.911.
id: 3c45b2d673d7d614a06b83df98a014f8 - page: 9
By increasing the average context size, an improvement over the previous models could be achieved. The confusion matrices of the tables 5 give F1 scores of 0.8945 and 0.9097. However, when analysing the sequences classified as false positives, it is noticeable that the models still pay a lot of attention to words like "charm", "spell", "charming" or "mark", which are words that are often used together with a spell or to form a spell name, but do not define a spell on their own. In addition, universe-specific character names such as "Regulus Arcturus Black", "Voldemort" or "Ignotus" are often recognised as spells, as shown in Figure 9. Despite the use of paragraphs, and therefore a reduced number of record entries, the number of sequences with a spell has only decreased slightly; because of their literary significance, a paragraph with a spell often consists of only one 9 A PREPRINT AUGUST 8, 2023 Figure 8: False positives classified sequences with advanced tokeniser
id: cdb1fa15712324f7001f7792fe0ab7d8 - page: 9