Measuring IQ in Autism with Adaptive Testing

June 20, 2017

Autism Spectrum Disorder (ASD) is a developmental disorder characterized by persistent problems in social communication and interaction, along with restricted and repetitive patterns of behavior, interests or activities.1 First identified in 1943 by a child psychiatrist Dr. Leo Kanner, ASD now affects about 1 in 68 children in the United States (1 in 42 boys and 1 in 189 girls). The lifetime cost for each individual has been estimated at $1.4 million, increasing to $2.4 million for those 40% with intellectual disability.2 Given that the prevalence of ASD has increased 10-fold during the past decade, researchers are actively looking to understand and characterize its features.

With support from the Simons Foundation Autism Research Initiative (SFARI) we set out to build an instrument to measure IQ for individuals having ASD. The project had two design constraints: First, the instrument must accommodate individuals with challenges or deficits in verbal abilities; second, the instrument must accommodate individuals with limited attentional resources. With these two constraints in mind we built the HRS – Matrix Adaptive Test (HRS-MAT).3

Non-verbal matrix reasoning

Addressing the first constraint, we chose to follow an existing IQ testing genre called matrix reasoning. Figure 1 illustrates a 1x5 matrix item eliciting a response that taps the ability to reason through a series and deduce the missing cell in the puzzle.

Figure 1.   A 1x5 matrix item with five response choices

In the genre of non-verbal matrix reasoning, tasks like Ravens Progressive Matrices have been a staple in the testing industry for nearly 100 years.4 Matrix reasoning also appears in many other standard IQ tests, e.g., Wechsler Scales and Stanford-Binet.5,6 However, while these tests have psychometric properties that are well understood and documented, they have limited utility for those individuals with attentional resources that are below average. For example, Ravens Progressive Matrices is a 60-item test that takes 45 to 60 minutes to complete. Further, because each of the five 12-item sets get progressively difficult, it is almost guaranteed that each participant will spend needless time and attention during the portion of the test where item difficulty is mismatched to test taker ability; bright individual will be bored with trivial items and average individuals are frustrated by those seemingly impossible. In addition, mismatched items are problematic because they add noise (error) variance to the score undermining the validity of the IQ measurement. The adaptive test is a custom set of items matched to test taker ability where the greatest proportion of the total test contributes meaningfully to the score.

Adaptive testing

How then can we measure IQ efficiently? The answer is to build a custom or “tailored” test for each participant matching his or her ability. Unlike the fixed-format Raven, an adaptive test uses an item administration algorithm that “learns” the participant’s ability based on their responses to prior items. With each additional item response, the algorithm improves its estimate of both ability and the confidence interval around it. To start, the algorithm assumes average IQ and administers an item of median difficulty. After scoring this item, the individual’s IQ is reestimated and compared with stopping criteria to determine if the IQ measurement is satisfactory. If satisfied, the test stops. If not, the algorithm selects the next best item from the item pool. A flowchart of this process is illustrated in Figure 2.

Figure 2.   Flowchart illustrating the adaptive test process

Figure 2 represents the general principle of a computer adaptive test CAT. Today, most major examinations are based on an adaptive testing format, e.g., the Graduate Management Admission Test. While an adaptive test usually follows the logic presented in Figure 2, differences can be found in the item selection algorithm and the stopping criteria. Because our target population, participants with ASD, may have less tolerance for frustration and may need greater reward and engagement our item selection algorithm is configured for greater success, selecting items with a probability for success up to 75%. Depending on the application, other parameters of the test administration or from pre-screeners can be incorporated to produce a more engaging test experience and better ability estimate of IQ.

How do you decide to stop a test? Stopping criteria for an adaptive test are based on a set of rules that describe the test administration and parameters of the score estimate. Common stopping criteria include length of the test administration both in time and item count as well as the error of measurement. The latter, error of measurement, is illustrated in Figure 3 as the 95% confidence interval around the IQ score estimate.

Figure 3.   IQ, confidence interval, and response time for an adaptive test administration of 30 items

Figure 3 illustrates an adaptive test administration starting with an expected IQ of 100. In this instance, the algorithm administers progressively more difficult items until a plateau is reached, here at about item 16. The algorithm then loads the participant with ability-matched items reducing the variability in IQ estimate until it is satisfactory for the researcher’s purpose. In terms of participant engagement, we note that item response time increases generally as item difficulty increases. Some dips in response time are present, e.g., items 15 and 19, suggesting a typical attention lapse. Lastly, at 30 items long, the test duration appears to be about right as some evidence of fatigue is found with the notable decrease in response time to items 29 and 30.

HRS-MAT vs. Raven

Thirty items is half the length of the 60-item Raven. For individuals with limited attention span, restricting items that are outside of the participant’s ability range saves time. Because we learn the most from an item where its difficulty matches the individual’s ability, the adaptive testing strategy results in more efficient estimates of IQ than conventional fixed-item tests. In fact, the HRS-MAT produces score estimates with a reliability of rα =.90 or greater after about 30 items have been administered. Relating to time, the administration in Figure 3 was completed in 21 minutes – about half the average for the Raven. How do the two tests relate? In a sample of n = 122 adults with a median age of 32 years correlation between scores was r = .81.

The HRS-MAT is currently available for researchers. The testing system includes an API, common Internet browser support, and a researcher control panel to track and search for participant administrations. The HRS-MAT is being used internationally and is available in twelve languages including: English, Spanish, French, Italian, Dutch, German, Danish, Swedish, Korean, Chinese, Arabic, and Russian. Interested investigators can contact us below.

For those interested in trying the adaptive IQ test, the test takes about 20-30 minutes and requires a screen the size of an iPad or bigger to display the visual puzzles. You will get your IQ score that is based on a national sample with a mean of 100 and standard deviation of 15.


Please email me at for your test administration link.



1. Interactive Autism Network Community

2. Autism Speaks

3. Hansen, J. A. (2016). Development and psychometric evaluation of the Hansen Research Services Matrix Adaptive Test: A measure of nonverbal IQ. Journal of Autism and Developmental Disorders. doi: 10.1007/s10803-016-2932-0 : Free on Readcube

4. Raven, J., Raven, J.C., & Court, J.H. (2000). Standard Progressive Matrices. Psychology Press.

5. Wechsler, D. (2003). Wechsler Intelligence Scale for Children-fourth edition. San Antonio, TX: Psychological Corporation.

6. Roid, G. H. (2003). Stanford-Binet Intelligence Scales, Fifth Edition. Itasca, IL: Riverside.