This paper provides a computational analysis of poetry reading audio signals at a large scale to unveil the musicality within professionally-read poems. Although the acoustic characteristics of other types of spoken language have been extensively studied, most of the literature is limited to narrative speech or singing voice, discussing how different they are from each other. In this work, we develop signal processing methods, which are tailored to capture the unique acoustic characteristics of poetry reading based on their silence patterns, temporal variations of local pitch, and beat stability. Our large-scale statistical analyses on three big corpora, each of which consists of narration (LibriSpeech), singing voice (Intonation), and poetry reading (from The Poetry Foundation), discover that poetry reading does share some musical characteristics with singing voice, although it may also resemble narrative speech.
We provide the URLs of the poems we used in this study. We reduce their sample rate to 16kHz. Narration Dataset: LibriSpeech provides a collection of 1,101 unique books translated into audio files with 16 kHz sample rate, serving as a representation of narrative speech. This is primarily because the content is drawn from full-length books, which inherently encompass continuous and extended narratives, both in fiction and non-fiction genres. The natural flow of language, varying tones, cadences, and expressive modulations intrinsic to book readings capture the essence of narrative speech. Furthermore, as these recordings are derived from diverse authors, styles, and periods, they collectively offer a comprehensive portrayal of storytelling and narrative techniques across a broad spectrum. We carefully concatenated the segmented audio signals to be slightly over 90 seconds to increase the length of the signals in our experiments, while preservin
id: c17a8537fe70757bdb4f218cd3d6c6fb - page: 3
We brought data only from the clean data fold, leaving out non-clean data. Singing Voice Dataset: We use the Intonation dataset as a representative of singing voice. Intonation contains 4,702 audio files sourced globally using a karaoke app serviced by Smule, Inc. This ensures a broad spectrum of singing styles, techniques, and nuances inherent to diverse cultures and traditions. The user-generated nature of Intonation brings a raw authenticity to the collection, encompassing both trained voices and natural, untrained vocal expressions. In addition, Intonation is suitable for our purposes because the recordings do not contain the other accompanying musical instruments. Once again, we reduce it down to 1,050 English-language songs using WhisperXs language detection with the same criteria applied to select poetry audio. We resample the signals at a 16kHz rate.
id: d89439a7b90a3539ed0d6422fa0475d9 - page: 3
4. EXPERIMENTS 4.1. Silence Patterns In all three categories, it is common to observe silent regions in between words. However, their lengths and functions in the oral performance have different meanings. For example, in narration, there tends to be a pause between sentences. On the other hand, in singing voice, long pauses are also common in Western pop music, e.g., during the interlude. We are interested in the intentional pauses in poetry reading, which are often expected after a line or stanza to maintain the rhythms or make poems more song-like . To capture this, we draw histograms of the lengths of the silent periods, which are acquired from the silence removal process in Sec. 10Density 2 0 0.4Silence Duration (seconds) Narration 0.015 2 0.1 1.0 0.5 4 4 0.2 6 6 Singing 2.0Silence Duration (seconds) 8 8 0.3 10Silence Duration (seconds) 1.5 0.0 0.3 0.005 0.010 0.1 Poetry 0.0 0.2 0.000 0.2 Singing 0.0 1.5
id: 2dfa100d750b270d0c1ad5ac04fd92de - page: 3
10Silence Duration (seconds) Poetry 0.005 4 1.0 0.5 2 2 0.015 10Density 0.2 6 0.4Silence Duration (seconds) 0.000 4 8 0.2 0.010 0.1 0.1 0.3 0.3 8 6 0.0 0 2 0 0.1 10Density Poetry 8 8 0.3 0.3 0.2 0.1 0.010 0.005 0.000 0.015 2.0Silence Duration (seconds) 2 Singing 2.0Silence Duration (seconds) 0.5 Narration 6 6 10Silence Duration (seconds) 0.0 0.0 1.5 0.4Silence Duration (seconds) 4 4 1.0 Narration (b) Fig. 1: Histograms of the (a) short (b) medium (c) long silent segments. (a) (c) Table 1: Summary of the Datasets LibriSpeech Poetry Reading Intonation Number of Files Spoken Duration (min) Word Count Total Silence (min) Words Per Min. Std. Silence (sec) 1,101 1,079 285,233 653 164.73 0.25 1,058 1,318 307,039 1,142 124.88 0.41 1,050 1,889 269,567 1,450 80.76 1.25
id: 3b944e93aeb351f3365ac71e8c20c0b8 - page: 3