Fine Tuning Named Entity Extraction Models for the Fantasy Domain - Dria

Join the Network

Created at 10am, Apr 15

0

Fine Tuning Named Entity Extraction Models for the Fantasy Domain

6K0zLdxjEKs1gFuyBkFTrIcGgH8ZO1hscIotAx2gS5c

File Type

PDF

Entry Count

40

Embed. Model

jina_embeddings_v2_base_en

Index Type

hnsw

Named Entity Recognition (NER) is a sequence classification Natural Language Processing task where entities are identified in the text and classified into predefined categories. It acts as a foundation for most information extraction systems. Dungeons and Dragons (D&D) is an open-ended tabletop fantasy game with its own diverse lore. DnD entities are domain-specific and are thus unrecognizable by even the state-of-the-art off-the-shelf NER systems as the NER systems are trained on general data for pre-defined categories such as: person (PERS), location (LOC), organization (ORG), and miscellaneous (MISC). For meaningful extraction of information from fantasy text, the entities need to be classified into domain-specific entity categories as well as the models be fine-tuned on a domain-relevant corpus. This work uses available lore of monsters in the D&D domain to fine-tune Trankit, which is a prolific NER framework that uses a pre-trained model for NER. Upon this training, the system acquires the ability to extract monster names from relevant domain documents under a novel NER tag. This work compares the accuracy of the monster name identification against; the zero-shot Trankit model and two FLAIR models. The fine-tuned Trankit model achieves an 87.86% F1 score surpassing all the other considered models.

From the association map, it could be identified that the monsters that appeared in a larger number of lore were mostly related to the reason that they were a part of other words and not the actual mention of those monsters. Therefore, from the observations, a threshold of 30 was taken and the list of monsters that had appeared in more than 30 lore were listed and made as the ignore list. The above step of getting the ignore list was done separately for setup 1 and setup 2 where the text look-up was done separately for the initial list of monsters and the merged list of monsters respectively. Also, the initial list and the merged list of monsters were updated by removing the monsters from the ignore list.

id: 1d74b209f6b061161c4a81804253d074 - page: 3

As the first step to tokenize the data to BIO format needed by the models, both the monster lists, that is the initial list and the merged list are sorted according to the length of monster names and sorted lists4 were obtained for the initial list and the merged list. This is to search for the longer monster names first so that they are first tagged. This enables leaving out short monster names if a span of text contains more than one monster name. For example, if a sentence has steam mephit where both steam mephit and mephit are monsters, both cannot be tagged. For leaving out the shortest out of the 2, the monster lists are sorted in descending order of the monster lengths.

id: 3b37ec07ce58a505a275b21b98b34759 - page: 3

The combined lore data is sentence tokenized and then sentences are word tokenized using the Trankit tokenizer and separated into sentences as the next step. For each sentence, a lookup was done for the found monsters in the order of the monster list and for every lookup, the index of the starting and ending of the monster is stored. When looking up for the next monster, the next monster is ignored if it is also found overlapping with the span of an already found monster. After the indexes of the found monsters are stored in the sentences, the word tokens in the sentences are tagged with B-MONS, I-MONS or O according to whether the span of the token is at the start of the monster span or within the monster span or not within the monster span.

id: a0b3d4fd89f6a381b4ea70093c5738f2 - page: 3

4520 sentences of data were obtained by combining all the monster lore data. These sentences were divided into training (2/3), development (1/6), and test (1/6) sets without separating sentences from the same lore into different sets. After this step, the sets were separated with 3011, 764 and 745 sentences respectively. In addition to the test set that was obtained by tagging above, the test set that was obtained during 4 setup 2 is manually verified and made as a gold standard test data for comparing the performance of different models. Also, it was observed that the Trankit tokenizer does not separate certain domain-specific words into tokens. For example, words such as wyrmling dragon, a larder are not tokenized into two tokens. As FLAIR does not support two tokens with space in BIO format, the specific words are separated into 2 lines using the spaces.

id: ca996735979fe6ce7ac42e70067068aa - page: 3

How to Retrieve?

# Search

curl -X POST "https://search.dria.co/hnsw/search" \
-H "x-api-key: <YOUR_API_KEY>" \
-H "Content-Type: application/json" \
-d '{"rerank": true, "top_n": 10, "contract_id": "6K0zLdxjEKs1gFuyBkFTrIcGgH8ZO1hscIotAx2gS5c", "query": "What is alexanDRIA library?"}'
        
# Query

curl -X POST "https://search.dria.co/hnsw/query" \
-H "x-api-key: <YOUR_API_KEY>" \
-H "Content-Type: application/json" \
-d '{"vector": [0.123, 0.5236], "top_n": 10, "contract_id": "6K0zLdxjEKs1gFuyBkFTrIcGgH8ZO1hscIotAx2gS5c", "level": 2}'