Large Language Models for Metacognitive for assessment of metacognitive skills in Acquired Brain Injury

OpenAI. (2025). SORA [Artificial intelligence system].

Introduction

Metacognitive impairments after acquired brain injury (ABI), such as traumatic brain injury (TBI), stroke, and other neurologic insults, are common. They include lower self-knowledge of one’s deficits, impaired self-monitoring of performance, and inability to self-regulate (i.e., to adjust strategies or behavior) (Al Banna et al., 2016). Such metacognitive deficits can strongly interfere with rehabilitation — for instance, an individual who does not know they have a problem with their memory or attention will not make use of compensations, and an individual who is not able to self-monitor will repeat the same faults. Classical cognitive rehabilitation already acknowledged decades ago that enhancing metacognition would lead to improved functional outcomes. Actually, metacognitive training methods (often loosely described as “thinking about your thinking”) have demonstrated greater efficacy in enhancing real-world functioning than drill-based training of individual cognition (Krasny-Pacini et al., 2014). Yet, implementing metacognitive treatments traditionally requires extensive involvement from the therapist, feedback, and practice across varied contexts. Large language models (LLMs) such as GPT-4 offer new potential to supplement this process. Due to their advanced natural language understanding and interactivity, LLMs could act as virtual coaches, therapists, or test tools to aid metacognitive rehabilitation. This article discusses the potential of using LLMs to task, monitor, and train metacognitive functions (self-knowledge, self-monitoring, self-regulation) in individuals with ABI. We review existing and proposed applications for the integration of LLMs into rehabilitation (e.g. virtual reality-based individualized therapeutic chatbots, AI-driven digital coaching, adaptive testing frameworks) and discuss the applicable LLM architectures (GPT-based and fine-tuned clinical variants). The aim is to provide clinical researchers and developers with an overview of opportunities and considerations for using LLM technology in the neurorehabilitation of metacognition.

Metacognitive Deficits in ABI and Rehabilitation Approaches

ABI tends to disrupt executive functions — the integrative thought processes necessary for goal formulation, planning, task initiation, performance monitoring, and behavior adjustment (Krasny-Pacini et al., 2014). Metacognition, or the person’s own-knowledge of their own abilities, and the power to control those abilities through monitoring, plays an important role. Pragmatically, metacognition after brain injury includes awareness of the person’s own injury-induced alteration of capabilities and the ability to monitor one’s own performance and regulate one’s own behavior across activities (Al Banna et al., 2016). Disrupted self-awareness is a common impairment — many TBI or stroke individuals experience anosognosia or incomplete appreciation of their own deficits, which may manifest as excessively positive self-assessment or risky decisions (Sansonetti et al., 2024). Likewise, self-monitoring deficits contribute to the inability to recognize performance failures or failure of performance in real time, and poor regulation contributes to poor strategy use and inability to change course of behavior. All of these metacognitive deficits contribute to everyday functioning difficulties, as being organized, employing strategies, and monitoring one’s own actions is essential for independent living (Krasny-Pacini et al., 2014).

Conventional treatments have been designed to treat metacognitive impairments in ABI. Generally, these treatments focus on informing patients about their impairments and teaching patients ways of “stop and think” within tasks. For instance, Goal Management Training (GMT) is a widely established metacognitive treatment for executive dysfunction. GMT seeks to engender a mindful, goal-oriented strategy for complex activities through teaching individuals to interrupt ongoing activity from time to time, review their purposes, and refocus if they have become off-task (Krasny-Pacini et al., 2014). Practically, GMT consists of a systematic collection of methods: patients receive self-instructional routines, they practice self-monitoring exercises, they practice segmenting activities into components and employing checklists (in order to enhance planning and prospective memory), and even do short increments of mindfulness exercises to help ensure sustained attentinal control. GMT employs discussion and real-world “homework” exercises to educate patients about lapses (e.g. settings from daily life are described through the use of stories to illustrate executive mistakes). Generally, the aim is to increase awareness of lapses of attention on the part of the individual and to re-establish control of cognition whenever behavior becomes incompatible with the goal that person intended to pursue.

Self-awareness training

This typically includes systematic feedback to the patient regarding their performance. A patient would perhaps be asked to make a prediction of their performance on a task, subsequent to which the actual performance would be fed back to them to point out differences (facilitating insight). Psycoeducation regarding brain injury and its impact is also important — patients who have a better understanding of their syndrome can better evaluate their performance. For example, one qualitative study recently discovered that patients experiencing rehabilitation achieved self-knowledge through two primary pathways: knowledge (knowledge of their injury, understanding of task requirements) and feedback from experience or others (Sansonetti et al., 2024). A subset of participants explicitly identified the connection between an understanding of impairments and the anticipation of support needs (e.g. in understanding they would need assistance with some activities in the home environment). A second subgroup achieved insights by trying activities and being given feedback when they had difficulty, or by hearing the views of others on their performance. These findings support the notion that structured feedback and experiential learning play important roles in enhancing self-knowledge.

General Metacognitive strategy training

General Metacognitive strategy training promotes a cycle of goal-setting, self-assessment, and self-review that patients come to apply to all activities. One such method includes the CO-OP (Cognitive Orientation to daily Occupational Performance) procedure, initially designed for children with motor impairments but applied with adults with ABI. In a case study by Skidmore et al., a 31-year-old person with a stroke learned to recognize his own performance deficits, establish targets, brainstorm and act on plans, and review the effectiveness of the plan (Skidmore et al., 2011). Across several inpatient sessions of therapy, he established eight activity-based goals (e.g. work performance and self-care activities) and applied this process of metacognitive strategy to produce gains. The patient’s self-assessment ratings (on the Canadian Occupational Performance Measure) improved significantly (mean change ~6 points), and his motivation for doing therapy increased (as assessed with the rehabilitative participation scales). This demonstrates that teaching a patient these self-assessment and adaptation strategies can translate into valuable improvements in real-life performance, even in acute recovery stages.

But before exploring how LLM could substitute coaching to the patient, the next section provides a brief overview of LLM technologies relevant to this domain.

Large Language Models in Neurorehabilitation: Capabilities and Architectures

Recent developments in artificial intelligence, namely large language models, have made new digital health interventions possible. Large language models (LLMs) are AI models trained on great bodies of text to generate human-like text. Current modern LLMs (e.g. GPT-3/4, with the Transformer architecture by design) can understand intricate prompts and produce coherent replies, even responding in an interactive manner that simulates conversation. They model patterns of language and of knowledge well and so can answer questions, provide summaries, or make recommendations with some understanding of the context. These functions have caught the attention of medicine and neurology: for example, generative LLMs have already been employed to interpret clinical case descriptions and even suggest diagnoses or treatments from narrative inputs (Maggio et al., 2024). When it comes to neurorehabilitation, scientists are starting to investigate the ways that LLMs such as ChatGPT could “meaningfully integrate as a facilitator” of patient care. Since the process of rehabilitation may include teaching and coaching individuals (conventionally conducted face-to-face by therapists), LLM-powered agents could potentially support by offering interactive education, cognitive coaching, and customized feedback.

From the technical standpoint, the most appropriate LLMs for such applications would be generative conversation models (such as GPT-based models) since they can participate in free-form conversation and respond dynamically to user input. ChatGPT (GPT-3.5/GPT-4-based) is an example that already exists and one that has already been explored in health applications. ChatGPT’s outstanding capacity to generate text that would amply simulate human response has led one study to suggest that it might even be employed in rehabilitation therapy (Maggio et al., 2024). Alternatives and supplements would include open-source Transformer models (such as LLaMA, Alpaca, BLOOM) that would have the potential to fine-tune on medical or rehabilitation-oriented data to develop specialist chatbots. One could also use BERT-based models (such as ClinicalBERT, BioBERT), which would normally be employed for text understanding and extraction of information rather than generation of the same; such a model may perhaps find a use in analyzing patient language (for the purpose of assessment or monitoring of progress) with a generative model that would converse with the patient.

Of particular note to the development of LLMs for the communication of clinical information, a vanilla GPT-4 model possesses general conversational capacity, yet performance specific to a given domain can be improved with the addition of expert domain knowledge and well-structured prompts. A recent research project (Li et al., 2024) designed a multi-agent GPT-4 implementation to answer rehabilitation questions, with individual agents performing such functions as the retrieval of medical guidelines, answer construction, and answer verification (Zhenzhu et al., 2024). This LLM design delivered accurate and interpretable responses to patient rehabilitation inquiries that were superior to those of general GPT-4. Of particular interest, the guideline-driven GPT-4 agents performed better in offering accurate, comprehensive, and empathetic responses, and they did not make things up — if a question lay beyond the scope of available guide sources, the system replied with “unclear” but would not make something up. This illustrates the value of pairing an LLM with a knowledge base and multiple reasoning steps to provide safety and relevance in a clinical environment.

Apart from conversational AI chatbots of the GPT type, there are special variants of LLMs emerging. To take the example of two, Medical-PaLM (by Google) (Singhal et al., 2025) and BioGPT (by Microsoft) (Luo et al., 2022) are medically tailored models; these might be harnessed for reliable health information in rehabilitation counseling. There may also come multimodal variants (e.g. vision in GPT-4) that could one day combine text with images or video — potentially useful for observing patient facial behavior or voice in therapies, or for defining exercises visually. Text-based LLMs are the main focus as yet, though, and already experimentally deployed in healthcare chatbots. In short, the LLMs of the day offer a mix of natural language understanding, generative versatility, and immense pretrained knowledge. All of these characteristics make them potential instruments to serve as virtual psychotherapists or intelligent assistants for cognitive rehabilitation. The subsequent sections detail the tangible methods that the use of LLMs can aid in metacognitive processes — initially by measuring and estimating those processes, and subsequently by actively training and coaching patients on metacognitive capabilities.

LLMs for Assessing Self-Awareness and Metacognition

Another use of LLMs in neurorehabilitation would be as tools for the assessment of metacognitive functions. Standard measures of self-awareness and self-monitoring commonly use interviews, questionnaires, or comparisons of patient- and caregiver-rated skills. LLMs might complement these approaches by analyzing unstructured patient talk or by performing adaptive interview-type tests. That is, an LLM might have a conversation with a patient about their own cognitive functioning: have the patient make predictions about how they will perform on a particular task, and then follow-up with the patient later and ask them to explain what actually happened, soliciting for discrepancies. The conversational AI can dynamically reformulate questions depending on previous replies — as an accomplished clinician would — to reveal the patient’s gaps in understanding about their deficits. Through natural language understanding, the LLM can recognize hints of over- or under-estimation of the patient’s capabilities (e.g., a patient confidently reporting that they have “no memory problem” despite several instances of forgetfulness described). The model might then measure or classify the degree of the patient’s insight, or raise an alert for a human clinician to examine. This might render self-awareness tests more dynamic and tailored than paper questionnaires.
The incorporation of conversational large language models (LLMs) into clinical tests could radically change the nature of communication between clinicians, from largely standardized, numeric, binary outputs to richer, more subtle, and analog transmission of patient data. While standard tests give straightforward but potentially oversimplifying portraits of cognition, AI-powered tests provide richly contextual, adaptive conversations that reveal subtle dimensions of a patient’s self-knowledge, mental state, and affective subtleties. Analogic, narrative-based evaluations might allow clinicians to attain greater, finer-grained shared understandings, enabling personalized treatment and interdisiplinary collaborative practice. But this change will necessitate thoughtful integration with standardized practices, clinician agreement on interpretive schemes, and training to effectively incorporate AI-provided insights into clinical reasoning.
Another direction would be to use LLMs for scoring or analyzing narrative measures. Patients undergoing rehabilitation may write daily diaries or journal about their daily struggles. Such narrative can be analyzed for sentiment and content by an LLM — identifying, for example, the rate at which the patient admits to errors or acknowledges assistance from others (markers of nascent awareness), or that they make specific goals and report on progress (indicators of self-monitoring and self-regulation). Since LLMs do a great job of distilling and extracting themes, they might assist clinicians with the parsing of large quantities of patient-written or patient-spoken data. To illustrate, if a weekly journal is being completed by a patient with impaired executive functions, an LLM might distil important information about metacognition: “This week the patient spotted memory lapses on 3 occasions and introduced a new checklist in the workplace,” etc. Such distillation might monitor changes in self-awareness across time.

Recent research affirms the viability of LLMs in assessing complex cognitive-emotional constructs. A telling example occurs in the arena of emotional self-knowledge: a study in 2023 asked ChatGPT to complete the Levels of Emotional Awareness Scale (LEAS) — a questionnaire in which individuals describe emotions in made-up situations (Elyoseph et al., 2023). The AI’s answers were graded by psychologists and compared to norms. Not only did the AI rate better than the general public on emotional awareness, but on a retest one month later, the AI boosted its score to nearly optimal. That is, ChatGPT could produce highly detailed self-descriptions of feelings, showing a sort of explicit emotional understanding that rivaled most people. The researchers propose this skill could form part of clinical training: “ChatGPT can be part of the cognitive training for clinical groups with impairments in emotional awareness”. Analogously, if an LLM can produce subtle self-reflections, it might assist patients in learning to do the same or provide standards for constructive reflective behaviors. In the practice of cognitive rehabilitation, one might imagine an LLM giving a comparable test of self-knowledge of the mind — outlining the ways in which it (or an imagined individual) would struggle with an exercise and solve the task — and then using that as an example for explaining to the patient the ways in which to notice and describe his or her own struggle.

There is also room for adaptive testing. A system driven by an LLM might pose increasingly complex situations or dilemmas to a patient’s online self-monitoring. For example, it could introduce a sample task (through text or voice interface) such as: “You’re planning a party with a lot of steps to it,” and then raise some roadblocks or errors in the plan. The patient might be asked to point out what could go amiss or catch embedded errors in the task. The LLM can react to the patient’s entry (e.g. if the patient neglects an easy error, the chatbot can provide a subtle hint such as, “Do you think not inviting a person might become a problem?” or “Are you sure about this including everything?”). By monitoring the patient’s replies and prompts needed, the system can evaluate the extent to which the patient foresees issues (prospective awareness) and spots mistakes (online awareness). This dynamic process matches the Self-Awareness of Deficits Interview method but could be made more engaging and dynamically modifiable.

In short, LLMs can act as smart assessment aids — performing interviews, reading between the lines, and giving analyses of a patient’s metacognitive status. They provide scalability (in that a chatbot can conduct a semi-structured interview without clinician time and cost) and consistency (standardized questions). Naturally, such a tool would require validation; the subtle judgments of insight typically necessitate clinical correlation. Yet initial work, such as the emotional awareness study, suggests that there’s good basis for believing that LLMs can support complex evaluative work in the arena of self-reflection (Elyoseph et al., 2023). If validated, those AI tests might supplement clinician observation, creating a better understanding of a patient’s self-knowledge and self-management abilities at intake and during the course of rehabilitation.

References

Al Banna, M., Redha, N. A., Abdulla, F., Nair, B., & Donnellan, C. (2016). Metacognitive function poststroke: a review of definition and assessment. Journal of Neurology, Neurosurgery & Psychiatry, 87(2), 161–166.

Krasny-Pacini, A., Chevignard, M., & Evans, J. (2014). Goal Management Training for rehabilitation of executive functions: a systematic review of effectivness in patients with acquired brain injury. Disability and rehabilitation, 36(2), 105–116.

Sansonetti, D., Fleming, J., Patterson, F., & Lannin, N. A. (2024). Profiling self-awareness in brain injury rehabilitation: A mixed methods study. Neuropsychological rehabilitation, 34(8), 1186–1211.

Skidmore, E. R., Holm, M. B., Whyte, E. M., Dew, M. A., Dawson, D., & Becker, J. T. (2011). The feasibility of meta-cognitive strategy training in acute inpatient stroke rehabilitation: case report. Neuropsychological rehabilitation, 21(2), 208–223..

Maggio, M. G., Tartarisco, G., Cardile, D., Bonanno, M., Bruschetta, R., Pignolo, L., … & Cerasa, A. (2024). Exploring ChatGPT’s potential in the clinical stream of neurorehabilitation. Frontiers in Artificial Intelligence, 7, 1407905.

Zhenzhu, L., Jingfeng, Z., Wei, Z., Jianjun, Z., & Yinshui, X. (2024). GPT-agents based on medical guidelines can improve the responsiveness and explainability of outcomes for traumatic brain injury rehabilitation. Scientific Reports, 14(1), 7626.

Singhal, K., Tu, T., Gottweis, J., Sayres, R., Wulczyn, E., Amin, M., … & Natarajan, V. (2025). Toward expert-level medical question answering with large language models. Nature Medicine, 1–8.

Luo, R., Sun, L., Xia, Y., Qin, T., Zhang, S., Poon, H., & Liu, T. Y. (2022). BioGPT: generative pre-trained transformer for biomedical text generation and mining. Briefings in bioinformatics, 23(6), bbac409.

Elyoseph, Z., Hadar-Shoval, D., Asraf, K., & Lvovsky, M. (2023). ChatGPT outperforms humans in emotional awareness evaluations. Frontiers in psychology, 14, 1199058.