-
Sensors (Basel, Switzerland) Jun 2024Existing end-to-end speech recognition methods typically employ hybrid decoders based on CTC and Transformer. However, the issue of error accumulation in these hybrid...
Existing end-to-end speech recognition methods typically employ hybrid decoders based on CTC and Transformer. However, the issue of error accumulation in these hybrid decoders hinders further improvements in accuracy. Additionally, most existing models are built upon Transformer architecture, which tends to be complex and unfriendly to small datasets. Hence, we propose a Nonlinear Regularization Decoding Method for Speech Recognition. Firstly, we introduce the nonlinear Transformer decoder, breaking away from traditional left-to-right or right-to-left decoding orders and enabling associations between any characters, mitigating the limitations of Transformer architectures on small datasets. Secondly, we propose a novel regularization attention module to optimize the attention score matrix, reducing the impact of early errors on later outputs. Finally, we introduce the tiny model to address the challenge of overly large model parameters. The experimental results indicate that our model demonstrates good performance. Compared to the baseline, our model achieves recognition improvements of 0.12%, 0.54%, 0.51%, and 1.2% on the Aishell1, Primewords, Free ST Chinese Corpus, and Common Voice 16.1 datasets of Uyghur, respectively.
Topics: Humans; Speech Recognition Software; Algorithms; Speech; Nonlinear Dynamics; Pattern Recognition, Automated
PubMed: 38931629
DOI: 10.3390/s24123846 -
Scientific Reports Jun 2024Accommodating talker variability is a complex and multi-layered cognitive process. It involves shifting attention to the vocal characteristics of the talker as well as...
Accommodating talker variability is a complex and multi-layered cognitive process. It involves shifting attention to the vocal characteristics of the talker as well as the linguistic content of their speech. Due to an interdependence between voice and phonological processing, multi-talker environments typically incur additional processing costs compared to single-talker environments. A failure or inability to efficiently distribute attention over multiple acoustic cues in the speech signal may have detrimental language learning consequences. Yet, no studies have examined effects of multi-talker processing in populations with atypical perceptual, social and language processing for communication, including autistic people. Employing a classic word-monitoring task, we investigated effects of talker variability in Australian English autistic (nā=ā24) and non-autistic (nā=ā28) adults. Listeners responded to target words (e.g., apple, duck, corn) in randomised sequences of words. Half of the sequences were spoken by a single talker and the other half by multiple talkers. Results revealed that autistic participants' sensitivity scores to accurately-spotted target words did not differ to those of non-autistic participants, regardless of whether they were spoken by a single or multiple talkers. As expected, the non-autistic group showed the well-established processing cost associated with talker variability (e.g., slower response times). Remarkably, autistic listeners' response times did not differ across single- or multi-talker conditions, indicating they did not show perceptual processing costs when accommodating talker variability. The present findings have implications for theories of autistic perception and speech and language processing.
Topics: Humans; Male; Female; Adult; Speech Perception; Autistic Disorder; Young Adult; Reaction Time; Speech; Attention; Middle Aged; Language
PubMed: 38926416
DOI: 10.1038/s41598-024-62429-w -
Journal of Psycholinguistic Research Jun 2024The present paper examines how English native speakers produce scopally ambiguous sentences and how they make use of gestures and prosody for disambiguation. As a case...
The present paper examines how English native speakers produce scopally ambiguous sentences and how they make use of gestures and prosody for disambiguation. As a case in point, the participants in the present study produced the English negative quantifiers. They appear in two different positions as (1) The election of no candidate was a surprise (a: 'for those elected, none of them was a surprise'; b: 'no candidate was elected, and that was a surprise') and (2) no candidate's election was a surprise (a: 'for those elected, none of them was a surprise'; b: # 'no candidate was elected, and that was a surprise.' We were able to investigate the gesture production and the prosodic patterns of the positional effects (i.e., a-interpretation is available at two different positions in 1 and 2) and the interpretation effects (i.e., two different interpretations are available in the same position in 1). We discovered that the participants tended to launch more head shakes in the (a) interpretation despites the different positions, but more head nod/beat in the (b) interpretation. While there is not a difference in prosody of no in (a) and (b) interpretation in (1), there are pitch and durational differences between (a) interpretations in (1) and (2). This study points out the abstract similarities across languages such as Catalan and Spanish (Prieto et al. in Lingua 131:136-150, 2013. 10.1016/j.lingua.2013.02.008; Tubau et al. in Linguist Rev 32(1):115-142, 2015. 10.1515/tlr-2014-0016) in the gestural movements, and the meaning is crucial for gesture patterns. We emphasize that gesture patterns disambiguate ambiguous interpretation when prosody cannot do so.
Topics: Humans; Gestures; Adult; Psycholinguistics; Male; Female; Speech; Language; Young Adult
PubMed: 38926243
DOI: 10.1007/s10936-024-10075-8 -
Science Advances Jun 2024Lip language recognition urgently needs wearable and easy-to-use interfaces for interference-free and high-fidelity lip-reading acquisition and to develop accompanying...
Lip language recognition urgently needs wearable and easy-to-use interfaces for interference-free and high-fidelity lip-reading acquisition and to develop accompanying data-efficient decoder-modeling methods. Existing solutions suffer from unreliable lip reading, are data hungry, and exhibit poor generalization. Here, we propose a wearable lip language decoding technology that enables interference-free and high-fidelity acquisition of lip movements and data-efficient recognition of fluent lip language based on wearable motion capture and continuous lip speech movement reconstruction. The method allows us to artificially generate any wanted continuous speech datasets from a very limited corpus of word samples from users. By using these artificial datasets to train the decoder, we achieve an average accuracy of 92.0% across individuals ( = 7) for actual continuous and fluent lip speech recognition for 93 English sentences, even observing no training burn on users because all training datasets are artificially generated. Our method greatly minimizes users' training/learning load and presents a data-efficient and easy-to-use paradigm for lip language recognition.
Topics: Humans; Wearable Electronic Devices; Speech; Language; Lip; Movement; Male; Female; Adult; Lipreading; Motion Capture
PubMed: 38924408
DOI: 10.1126/sciadv.ado9576 -
Cognitive Science Jun 2024Experiments on visually grounded, definite reference production often manipulate simple visual scenes in the form of grids filled with objects, for example, to test how...
Experiments on visually grounded, definite reference production often manipulate simple visual scenes in the form of grids filled with objects, for example, to test how speakers are affected by the number of objects that are visible. Regarding the latter, it was found that speech onset times increase along with domain size, at least when speakers refer to nonsalient target objects that do not pop out of the visual domain. This finding suggests that even in the case of many distractors, speakers perform object-by-object scans of the visual scene. The current study investigates whether this systematic processing strategy can be explained by the simplified nature of the scenes that were used, and if different strategies can be identified for photo-realistic visual scenes. In doing so, we conducted a preregistered experiment that manipulated domain size and saturation; replicated the measures of speech onset times; and recorded eye movements to measure speakers' viewing strategies more directly. Using controlled photo-realistic scenes, we find (1) that speech onset times increase linearly as more distractors are present; (2) that larger domains elicit relatively fewer fixation switches back and forth between the target and its distractors, mainly before speech onset; and (3) that speakers fixate the target relatively less often in larger domains, mainly after speech onset. We conclude that careful object-by-object scans remain the dominant strategy in our photo-realistic scenes, to a limited extent combined with low-level saliency mechanisms. A relevant direction for future research would be to employ less controlled photo-realistic stimuli that do allow for interpretation based on context.
Topics: Humans; Speech; Male; Female; Eye Movements; Adult; Young Adult; Visual Perception; Attention; Photic Stimulation
PubMed: 38924126
DOI: 10.1111/cogs.13473 -
Cognitive Science Jun 2024Words that describe sensory perception give insight into how language mediates human experience, and the acquisition of these words is one way to examine how we learn to...
Words that describe sensory perception give insight into how language mediates human experience, and the acquisition of these words is one way to examine how we learn to categorize and communicate sensation. We examine the differential predictions of the typological prevalence hypothesis and embodiment hypothesis regarding the acquisition of perception verbs. Studies 1 and 2 examine the acquisition trajectories of perception verbs across 12 languages using parent questionnaire responses, while Study 3 examines their relative frequencies in English corpus data. We find the vision verbs see and look are acquired first, consistent with the typological prevalence hypothesis. However, for children at 12-23 months, touch-not audition-verbs take precedence in terms of their age of acquisition, frequency in child-produced speech, and frequency in child-directed speech, consistent with the embodiment hypothesis. Later at 24-35 months old, frequency rates are observably different and audition begins to align with what has previously been reported in adult English data. It seems the initial orientation to verbalizing touch over audition in child-caregiver interaction is especially related to the control of physically and socially appropriate behaviors. Taken together, the results indicate children's acquisition of perception verbs arises from the complex interplay of embodiment, language-specific input, and child-directed socialization routines.
Topics: Humans; Language Development; Infant; Female; Male; Language; Child, Preschool; Visual Perception; Speech; Touch; Auditory Perception
PubMed: 38923050
DOI: 10.1111/cogs.13469 -
Advances in Experimental Medicine and... 2024Speech can be defined as the human ability to communicate through a sequence of vocal sounds. Consequently, speech requires an emitter (the speaker) capable of... (Review)
Review
Speech can be defined as the human ability to communicate through a sequence of vocal sounds. Consequently, speech requires an emitter (the speaker) capable of generating the acoustic signal and a receiver (the listener) able to successfully decode the sounds produced by the emitter (i.e., the acoustic signal). Time plays a central role at both ends of this interaction. On the one hand, speech production requires precise and rapid coordination, typically within the order of milliseconds, of the upper vocal tract articulators (i.e., tongue, jaw, lips, and velum), their composite movements, and the activation of the vocal folds. On the other hand, the generated acoustic signal unfolds in time, carrying information at different timescales. This information must be parsed and integrated by the receiver for the correct transmission of meaning. This chapter describes the temporal patterns that characterize the speech signal and reviews research that explores the neural mechanisms underlying the generation of these patterns and the role they play in speech comprehension.
Topics: Humans; Speech; Speech Perception; Speech Acoustics; Periodicity
PubMed: 38918356
DOI: 10.1007/978-3-031-60183-5_14 -
Nicotine & Tobacco Research : Official... Jun 2024Pictorial health warning labels (HWLs) can communicate the harms of tobacco product use, yet little research exists for cigars. We sought to identify the most effective...
INTRODUCTION
Pictorial health warning labels (HWLs) can communicate the harms of tobacco product use, yet little research exists for cigars. We sought to identify the most effective types of images to pair with newly developed cigar HWLs.
AIMS AND METHODS
In September 2021, we conducted an online survey experiment with US adults who reported using little cigars, cigarillos, or large cigars in the past 30 days (n = 753). After developing nine statements about health effects of cigar use, we randomized participants to view one of three levels of harm visibility paired with each statement, either: (1) an image depicting internal harm not visible outside the body, (2) an image depicting external harm visible outside of the body, or (3) two images depicting both internal and external harm. After viewing each image, participants answered questions on perceived message effectiveness (PME), negative affect, and visual-verbal redundancy (VVR). We used linear mixed models to examine the effect of harm visibility on each outcome, controlling for warning statement.
RESULTS
Warnings with both and external harm depictions performed significantly better than the internal harm depictions across all outcomes, including PME (B = 0.21 and B = 0.17), negative affect (B = 0.26 and B = 0.25), and VVR (B = 0.24 and B = 0.17), respectively (all p < .001). Compared to both, the external depiction of harm did not significantly change PME or negative affect but did significantly lower VVR (B = -0.07, p = .01).
CONCLUSIONS
Future cigar pictorial HWLs may benefit from including images depicting both or external harm depictions. Future research should examine harm visibility's effect for other tobacco pictorial HWLs.
IMPLICATIONS
The cigar health warning labels (HWLs) proposed by the US Food and Drug Administration are text-only. We conducted an online survey experiment among people who use cigars to examine the effectiveness of warnings with images depicting different levels of harm visibility. We found HWLs with images depicting both an internal and external depiction of cigar harm, or an external depiction of harm alone, performed better overall than images portraying internal depictions of harm. These findings provide important regulatory evidence regarding what type of images may increase warning effectiveness and offer a promising route for future cigar HWL development.
PubMed: 38918001
DOI: 10.1093/ntr/ntae113 -
American Journal on Intellectual and... Jul 2024The literature has yet to review the differential effects of Natural Environment Teaching (NET) and Discrete Trial Teaching (DTT) on adaptive skills. A sample of 142...
The literature has yet to review the differential effects of Natural Environment Teaching (NET) and Discrete Trial Teaching (DTT) on adaptive skills. A sample of 142 children diagnosed with ASD between the ages of 16 and 35 months received either DTT, NET, or both interventions (NET+ DTT). The Bayley Scales of Infant and Toddler Development (BSID) Adaptive Subscale and the Verbal Behavior Milestones Assessment and Placement Program (VB-MAPP) Barriers Assessment were used as baseline and posttest measures. Children who received NET and NET+DTT conditions showed significant improvements compared to the DTT condition indicating that the addition of NET leads to increased adaptive skills and decreased barrier behaviors in participants. DTT may also play a necessary foundational role for children with more significant delays. These results provide support for the use of a combination of teaching strategies in community-based early intervention and refine protocols for teaching adaptive skills to toddlers with ASD.
Topics: Humans; Autism Spectrum Disorder; Child, Preschool; Male; Infant; Female; Adaptation, Psychological; Early Intervention, Educational; Child Development; Teaching
PubMed: 38917993
DOI: 10.1352/1944-7558-129.4.263 -
Journal of Sex Research Jun 2024Coerced condomless sex is a prevalent form of sexual coercion that is associated with severe negative health consequences. This scoping review addresses the current lack... (Review)
Review
Coerced condomless sex is a prevalent form of sexual coercion that is associated with severe negative health consequences. This scoping review addresses the current lack of synthesized qualitative evidence on coerced condomless sex. Our systematic literature search yielded 21 articles that met review eligibility criteria. Themes of coerced condomless sex were organized into three categories (tactics, motives, and sequelae) and presented separately for studies based on whether researchers stipulated pregnancy promotion intent as underlying the behavior. Coerced condomless sex perpetration tactics ranged from verbal pressure to physical assault. Besides pregnancy promotion, perpetration motives included control, dominance, entrapment, enhancing sexual experiences, and avoiding conflict. Following coerced condomless sex, victims reported developing protective strategies. They also reported experiencing various negative emotional, relational, and physical health effects. Interventions that specifically address coerced condomless sex perpetration and provide supportive programs for those who have experienced coercive condomless sex may be beneficial.
PubMed: 38913125
DOI: 10.1080/00224499.2024.2365936