It is well-established that several factors can influence child development. Understanding and systematically quantifying these factors can contribute to more effectively targeting healthcare efforts. The American Academy of Pediatrics (Sandler et al., 2001) recommends assessing and monitoring at-risk infants and young children throughout early childhood, allowing for early and specific interventions in potential neurodevelopmental alterations. Population studies that assess children with developmental disorders challenge the scoring patterns used to measure skill acquisition due to the quantitative nature of the instruments. These instruments measure skill acquisition across age groups and utilize normative samples from populations with typical development (Rodrigues, 2012).
Developmental delays can have various causes. It is worth noting that Congenital Zika Virus Syndrome (CZVS) encompasses a range of congenital anomalies that can include visual, auditory, and neuropsychomotor alterations in individuals exposed to this infection during pregnancy (Ministry of Health, 2016). The severity of these alterations can vary, posing challenges for healthcare professionals involved in monitoring and assessing child development.
The first step in ensuring attention and access to specific programs is through diagnosis, with early screening, identification, and appropriate referral being essential. Strong evidence underscores the importance of early intervention (Zwaigenbaum & Penner, 2018), emphasizing the use of reliable measures with high levels of sensitivity, specificity, and reliability (Campos et al., 2006; Santos & Ravanini, 2006; Blair & Hall, 2006).
For children with complex developmental issues, procedures ranging from surveillance to screening for risk factors, as well as assessments to determine functional diagnoses, are essential pillars for ensuring compliance with the recommendations of the Ministry of Health recommendations and early intervention when necessary. The clinical reasoning process is initiated based on the assessment to determine the best intervention plan (Gourladin & Sá, 2022).
However, the availability of standardized and culturally validated scales in our language poses a challenge in assessing children with developmental disorders (Visser et al., 2014; Madaschi et al., 2016). Assessing individuals with disabilities that impact all domains of child development requires test accommodations to enable the individual’s full participation in the process (Bayley, 2006). Accommodations refer to changes in the standard test administration procedures to overcome the functional limitations of the participant, thereby increasing the validity of inferences drawn from the scores obtained. It is considered relevant to address the functional limitations individuals may experience when attempting to demonstrate proficiency in an assessment (Kettler, 2012).
Initially published in 1969, the Bayley Scales, in their original American version, are considered the gold standard for meeting all psychometric properties (Diamond, 2000). After nearly 50 years of research, version IV of the instrument is currently available, maintaining its excellent quality and meeting rigorous psychometric properties. The Bayley-III Scales (BSID III) used in this research is the most recent version available in Brazil (Madaschi & Paula, 2011). It is also one of the instruments recommended by the Early Stimulation guidelines resulting from Microcephaly (Ministry of Health, 2016) to identify developmental delays, plan interventions, and document progress and evolution (Bayley, 2006). Although standardized, it offers flexibility in application by considering the inherent dynamism of various infant assessment situations (Bayley, 2006), making it suitable for assessing cognitive, linguistic, and motor skills in children affected by congenital anomalies.
This version of the scale was adopted as an assessment instrument in the Research Project “Effects of Congenital Neurological Manifestations Associated with the Zika Virus on Child Development: A Prospective Cohort Study in the Context of Primary Care in Salvador-BA” to assess developmental consequences in children born during the epidemic, with a focus on home-based assessment (Santos et al., 2022).
Given the potential effects of interviewers on the reliability of the obtained responses, the research design included procedures to measure agreement among assessors throughout the study. Confidence in the results is partially a function of the amount of disagreement or error introduced into the study due to inconsistencies among instrument administrators. Reliability is dynamic and depends on the instrument’s function, the population in which it is administered, circumstances, and context. These factors highlight the importance of ongoing training and supervision to achieve adequate inter-rater reliability (Souza et al., 2017). Equivalence reliability allows for identifying the extent to which assessors could observe and measure the phenomenon or variable appropriately and as predicted by the instrument’s validity.
Measuring agreement among collectors refers to stability, internal consistency, and measure equivalence, although reliability is not a fixed property of the instrument (Souza et al., 2017).
Therefore, it is imperative to subject an assessment team to a training and test-retest process to measure the level of equivalence among them, aiming to minimize measurement errors. Equivalence refers to the degree of agreement between two or more observers regarding the scores of an instrument (Heale & Twycross, 2015). Internal consistency among administrators is expressed by the Kappa coefficient (K), which measures the degree of agreement between proportions derived from dependent samples (Cohen, 1968). The Intraclass Correlation Coefficient (ICC), a measure assessing the reliability or consistency among multiple measurements made by different administrators, is included.
Given the significance of developmental assessment for the therapeutic intervention process with children and their families, this article examines the reliability obtained at three assessment points in a longitudinal study using the Bayley-III Infant Development Scales. Furthermore, it describes the continuous training and supervision of the interdisciplinary team, aiming to standardize the application procedures in data collection.
Method
A longitudinal quantitative, observational, and descriptive study to assess agreement among administrators of the Bayley-III Infant Development Scales in Salvador (BA), Brazil’s fourth most populous capital city.
Participants
The reliability design model chosen for this study consisted of balanced incomplete blocks, as Fleiss (1981) described, in which one examiner interviews while the other observes the examination as a neutral spectator. According to the author, Balanced incomplete blocks refer to a specific type of experimental design used in studies involving the assessment of multiple treatments. In this design, each participant or experimental unit is not exposed to all possible combinations of treatments but rather to a subset of them. This approach is beneficial when dealing with a high total number of combinations, making it impractical to test all of them.
The method of balanced incomplete blocks allows for reducing the size of the experiment and conserving resources while maintaining the balance between the tested conditions. To ensure the validity of the results, the allocation of treatments to participant blocks must be random or systematic, depending on the adopted strategy. This controls external variability and improves the precision of conclusions, making the study more robust and reliable.
After theoretical training, pairs consisting of two administrators recorded scores based on the same interview but conducted independent assessments. The simple arrangement method, using combinatorial analysis, was employed to form these pairs. At another time, the roles of the pairs were reversed when assessing another child. Reliability measures for the three assessment points of the longitudinal study were obtained independently. The level of agreement between assessors was measured for each assessment point of the longitudinal study, with three independent measures of cognitive, motor, and linguistic performance in children with and without exposure to CZVS.
Team Composition
A team was formed with students from the Interdisciplinary Bachelor Programs in Health, Psychology, Public Health, and Physiotherapy, as well as professional psychologists. The first measurement occurred at the baseline between April 2017 and March 2018, with six interviewers conducting 29 instrument applications following the above-described design to assess reliability. A second measurement was conducted between May 2018 and March 2019 when eight new interviewers joined one existing member, and these nine members conducted 67 applications for the reliability study. The final measurement occurred between February and August 2019, during which four administrators conducted 11 assessments for the reliability sample.
Instrument
The Bayley-III Scales aim to measure the performance of infants and young children’s performance, identify competencies and critical points, and contribute to proper therapeutic intervention planning. Five domains are investigated through direct child assessments, addressing cognitive, expressive, and receptive language, gross and fine motor skills, and socioemotional and adaptive behavior scales applied in interviews with parents (Bayley, 2006).
According to the instrument’s Technical Manual, the administration time can vary from 50 minutes for children up to 12 months to 90 minutes for those over 13 months.
Its structure provides five types of scores: (i) raw total scores, (ii) scaled scores, (iii) composite scores, (iv) percentile-based rankings, and (v) developmental scores. For each domain, the raw score is defined as the total number of items for which the child receives credit, summed with the number of items before the child’s starting point. The scaled score is derived from the raw score. It ranges from 1-19, with an average of 10 and a standard deviation of 3, while the composite score is calculated based on the scaled score and ranges from 40-160, with an average of 100 and a standard deviation of 15.
Among the effects caused by CZVS, inadequate development of gross motor skills stands out, affecting the child’s ability to roll, sit, and, in many cases, maintain control of their head. Regarding fine motor skills development, difficulties in performing manual activities are reported. In the sensory system, compromised visual and auditory capabilities are observed, resulting in severe difficulty understanding and producing language (Wheeler, 2018). However, the BSID III can be adjusted in its standard version, which allows for streamlining assessment-related tasks, provided there are no changes in their content and objectives (Visser et al., 2013; Visser et al., 2014).
Selection of the data collection team
At the beginning of the project, a call for applications was launched to offer an introductory course on the Bayley-III Infant Development Scale, followed by the selection of assessors based on their demonstrated performance and level of engagement with the project. Considering the assessment points of the longitudinal study, the age limit of 42 months for including the child, and the fluctuation of interviewers throughout the data collection, new calls were opened, transforming the training offering into a certified extension course by the Universidade Federal da Bahia (UFBA, Federal University of Bahia).
Training of Assessors
In its three stages, the training content covered topics related to typical and atypical child development, as well as the understanding and use of the Bayley-III Scales.
Two psychology professors with experience in child development, epidemiological aspects of development, and quantitative assessment instruments conducted the training and supervision process. This process also included a third psychologist responsible for coordinating the team’s activities during home visits or at health facilities for data collection throughout the study period.
The first practical training activity involved assessing children known to the team, conducted in their own homes, followed by a pilot experience involving children with typical and atypical development. Once the theoretical and practical requirements were met, the assessors started their activities with study participants in the research setting.
Initially, all team members were paired with a more experienced partner in a double role: one member approached the child and recorded the scores, while the second member observed the approach and independently and silently recorded the scores. Subsequently, these assessments were discussed in weekly supervision meetings to assess the performance of team members who were being trained. As the assessments progressed during the second and third measurements in the cohort, regular supervision meetings gave way to scheduled meetings based on the assessors’ needs.
Data Collection
The workflow began with the team leaving the Institute of Collective Health, reaching the assessment location (home or, exceptionally, a health facility due to safety concerns regarding the team), administering the instrument, and returning to the Institute. The duration of each assessment with the instrument was approximately two hours, varying according to the child’s developmental profile, health and well-being fluctuations, and the need for occasional breaks based on individual characteristics. The team observed that home-based assessment made the process more comfortable for the child, allowing the assessor to become familiar with existing limitations, improving communication, and providing the necessary flexibility to implement the required test accommodations.
From the initial assessments of children with confirmed CZVS diagnoses, difficulties in administering the Bayley-III Scales were observed, potentially disadvantaging the child due to the administration of the test per se. It was recognized that accommodations would be necessary alterations to standard procedures to overcome individuals’ functional impairments and increase response validity (Kettler, 2012). It was also noted that the manual (Bayley, 2006) established criteria and necessary rigor for adaptations. Despite being scarce, studies that applied the BSID III with facilitations to assess children with multiple impairments show that the use of facilitations corrected differences in raw test scores, especially in the cognitive scale, increasing the validity and use of this instrument under these conditions (Ruiter et al., 2010). Based on this evidence, a tool was developed to assess the subject’s cognitive abilities with the best possible expression and precision of their development. The scale was standardized through training and supervision during the study.
Data analysis
An equivalence analysis was conducted to identify the degree of agreement between pairs of observers concerning the scores on the Bayley Scale (Heale & Twycross, 2015). The intraclass correlation coefficient was employed to calculate this agreement when the results were continuous variables. This coefficient is widely used to assess agreement between repeated measures, especially when multiple observers are involved. Results are considered excellent when the agreement between values exceeds 0.75.
The Kappa index (K) was chosen to analyze the agreement between categorical variables to measure the degree of agreement between proportions derived from dependent samples (Cohen, 1968). The obtained values are classified as follows: ≤ 0, no agreement; 0.01-0.20, slight agreement; 0.21-0.40, fair agreement; 0.41-0.60, moderate agreement; 0.61-0.80, substantial agreement; and 0.81-1.00, excellent agreement (McHugh, 2012; Souza et al., 2017). Statistical analysis was performed using the Statistical Package for the Social Sciences (SPSS version 20) and *R (R Statistical Language version 3.6.1).
Results
The first reliability assessment measurement conducted at the baseline involved six assessors and 29 children assessed by pairs of assessors, yielding the following results. Kappa values ranged from 0.47 to 1, with an average of 0.92. At least 70% of the questions showed values close to 1, indicating excellent agreement between assessors. The average of the 58 scores from the 29 assessments at this point was 84.2 (SD = 24.8) for the cognitive scale, 81.3 (SD = 25.4) for language, and 80.7 (SD = 28.8) for motor skills.
The second measurement involved 67 children and nine assessors for equivalence testing. Kappa values ranged from 0.43 to 1, with an average of 0.89. Approximately 70% of the questions showed Kappa values close to 1, maintaining excellent agreement between assessors. For the 67 assessments at this point, the average of the 134 scores obtained was 73.3 (SD = 22.7) for the cognitive scale, 72.2 (SD = 24.7) for the language scale, and 70.0 (SD = 27.5) for the motor scale.
In the third measurement, 11 children comprised the equivalence sample with the participation of four assessors. Kappa values ranged from 0.21 to 1, with an average of 0.96. Seventy percent of the questions had Kappa values close to 1, demonstrating excellent agreement between assessors. In this final measurement, 11 assessments (22 scores) were conducted with averages of 63.9 (SD = 17), 57.9 (SD = 19), and 58.1 (SD = 26) for the cognitive, language, and motor scales, respectively. Table 1 displays the distribution of Kappa coefficient ratings for the three measurements in the study.
Table 1 Concordance obtained via Kappa Coefficient for three reliability assessment measurements among assessors in the longitudinal study, April 2017 - August 2019, Salvador (BA), Brazil
| Concordance | No. of questions | % | Kappa value |
|---|---|---|---|
| First assessment: baseline | |||
| Moderate | 1 | 0.31 | 0.41 to 0.60 |
| Substantial | 36 | 11.04 | 0.61 to 0.80 |
| Excellent | 289 | 88.65 | 0.81 to 1.00 |
| Second assessment | |||
| Moderate | 10 | 3.37 | 0.41 to 0.60 |
| Substantial | 47 | 15.82 | 0.61 to 0.80 |
| Excellent | 238 | 80.13 | 0.81 to 1.00 |
| Third assessment | |||
| Considerable | 2 | 0.70 | 0.21 to 0.40 |
| Substantial | 24 | 8.42 | 0.61 to 0.80 |
| Excellent | 259 | 90.88 | 0.81 to 1.00 |
Source: own authorship
Note: Concordance categories without frequency were omitted.
The analysis of reliability measures for applying the Bayley-III Child Development Scale in this longitudinal study demonstrated satisfactory Kappa coefficients, indicating excellent agreement between assessors, with a slight increase in average values in the first and third measurements (Table 1).
Notably, the items with low agreement were distributed as follows in this longitudinal reliability study. At the baseline, only one item in the cognitive domain (Item 8) showed low agreement. The second measurement recorded six items with low agreement between assessors, including two in Expressive Language (1;43), two in Receptive Language (2;6), one cognitive item (4), and one Fine Motor item (1). In the third measurement, only two items related to Expressive Language (2;6) showed low agreement.
The intraclass correlation coefficient for each of the three performance scores at each assessment point remained above 90%, indicating a high level of agreement between assessors, also classified as excellent (Table 2).
Table 2 Intraclass Correlation Coefficients (ICC) for reliability assessments according to cognitive, language, and motor subscales conducted at three assessment points in the longitudinal study, April 2017 - August 2019, Salvador (BA), Brazil
| Assessments According to Scales | CCI | CCI 95% |
|---|---|---|
| First assessment: baseline | ||
| Cognitive | 0.925 | 0.865 |
| Language | 0.951 | 0.907 |
| Motor | 0.939 | 0.889 |
| Second assessment | ||
| Cognitive | 0.963 | 0.942 |
| Language | 0.994 | 0.991 |
| Motor | 0.980 | 0.968 |
| Third assessment | ||
| Cognitive | 0.998 | 0.995 |
| Language | 0.997 | 0.992 |
| Motor | 1.000 | 0.999 |
Source: own authorship
Discussion
In a longitudinal study with a turnover of assessor teams, it was possible to maintain data quality, reproducing consistent results over time and space, as demonstrated by the reliability measures obtained. The initial training format was supplemented by regular supervision, which qualified the team by identifying doubts and disagreements, thus establishing a high inter-rater reliability standard in the three assessment measurements (Souza et al., 2017).
Analyzing the overall results of the three assessed points, some variations in reliability levels were observed, indicating the importance of conducting reliability assessments between assessors for each new assessment point, even if the instrument in question had previously demonstrated high reliability.
The increasing number of recent national studies that have used the Bayley-III Scales indicates the importance and utility of this instrument in diagnosing motor, cognitive, and language delays in Brazilian children (Ferreira et al., 2014; Hentges et al., 2014). Investigations examining findings for populations with complex developmental disorders are scientifically relevant, emphasizing the need to test the necessary accommodations to maintain evidence of validity and reliability of specific assessment tools. This study faced an essential challenge in assessing children at high risk of developmental delays due to multiple disabilities and frequent global developmental impairment.
Accommodations are required in such situations, allowing for modifications to the standard test administration procedures to overcome the participant’s functional deficiencies, thus increasing the validity of inferences based on obtained scores. The functional impairment the subject may experience when attempting to demonstrate proficiency in an assessment is considered relevant (Kettler, 2012). An effort was made to enable psychological evaluation in a population with many limitations, which could otherwise render the subjects untestable by other instruments. This contributes to advancements in child development assessment in Brazil, without any similarly validated tool (Madaschi et al., 2016).
The BSID-III was administered using standard procedures, with adaptations for the child’s visual or motor impairments, as suggested by the manual (Bayley, 2006). Among the accommodations used in our study, we refer to some examples, such as using light and brightness, increasing the size of manipulative objects, and extending the time for the child to respond.
Additional test accommodations in the application of the BSID-III have been used. For example, ceiling lights were turned off for children with clear vision, and a flashlight was used to provide contrast (Wheeler et al., 2020). Although the item scope covers the developmental spectrum from birth to early childhood, the dependence on visual and motor production can penalize children with CZVS in demonstrating what they can do. On the other hand, raw BSID-III and age-equivalent scores provide a sensitive measure of potential change over time, allowing for monitoring skill gains or losses in response to time, treatment, or seizures (Wheeler et al., 2020).
While at least 70% of the questions analyzed had Kappa values close to 1, questions were distributed across the three assessments with low agreement among assessors. The baseline assessment occurred with questions from the cognitive domain and the other points, as well as language and motor domains. One possible hypothesis explaining the low agreement is the potential cognitive domain impairment due to viral infection and the high correlation between these three domains. Knowing that motor impairment due to axial and appendicular hypertonia can affect the magnitude of the child’s response or behavior, it could make it difficult for the assessor to judge the application of these tests that deviated from high agreement. In this regard, greater attention is recommended in training for items that require greater assessor sensitivity to recognize the child’s performance.
Regarding items of expressive language with low agreement, tests involving undifferentiated guttural sounds require familiarity for the assessor to interpret and score. Finally, low agreement in this study reached items of subjective dimensions, such as the social smile response when talking to the child. Despite the various potential sources of disagreement in applying a psychometric test, the scores obtained in this study were highly consistent, reflecting uniformity in approaching children and interpreting their responses. It is considered possible to deal with factors related to the instrument, population, and context, achieving a high level of reliability (Souza et al., 2017). Using the equivalence type’s reliability, the assessors’ aptitude to observe and measure the phenomenon appropriately, as recommended by the Bayley-III Manual for Child Development Scale, was identified.
Assessing children in their homes or occasionally at their reference Health Unit facilitated their approach. The assessor’s access to this space favored some familiarity with the child’s limitations, improved communication, and the necessary flexibility to adopt accommodations. Ecological validity emphasizes a new understanding of the relationship between assessment results and the performance of daily tasks. It also considers the development of tests composed of everyday cognitive functions so that inferences can be quickly drawn from the results and the individual’s probable ability to perform those tasks in daily life (Spooner & Pachana, 2006). According to Pasquali (2017), ecological validity refers to how evidence should be sought, aligning methods, materials, and assessment situations with the natural world being examined.
For children with highly probable atypical development due to CZVS diagnosis, the home assessment expanded the possibility of expressing their development. The obstacles encountered from the first applications, resulting from considerable delays in different developmental domains, led us to the Bayley Manual to understand the adaptations and define conduct and procedures for fieldwork with this population. Workshops were conducted, and routines for monitoring and ongoing supervision were structured to ensure adaptation procedures for the instrument without affecting test integrity.
In the end, adapting the BSID-III to accommodate visual and motor difficulties made it possible to assess children’s developmental function with CZVS accurately. There were 16 adaptations, organized according to the type of facilitation used (visual, motor, or general), applied in all assessment areas in this instrument. The observed data suggested the importance of constructing new perspectives in assessing children with atypical development, minimizing the interference of deficits (Araújo et al., 2017).
Final Considerations
It has been demonstrated that it is possible to conduct home assessments of the development of children with multiple disabilities in a population context, using adaptations provided by the Bayley-III Scales, with satisfactory levels of reliability. The team’s training, supervision, and monitoring calibrated the assessors according to the reliability measures demonstrated by Kappa and ICC. Evidence of good performance of neuropsychological instruments in population studies investigating groups still underexplored in public health favors the research progress due to the results’ credibility. It assists future researchers in choosing the tool.
In conclusion, the study contributes to advancing knowledge about children with Multiple Disabilities from the perspective of Public Health, providing reliability to the psychological assessment process in a population with multiple limitations in child development in the community context. Scientifically relevant investigations that examine findings for populations with complex developmental alterations are considered, emphasizing the need to test the accommodations required to maintain evidence of the validity and reliability of specific assessment tools.










text in 



