Classical Test Theory
Classical Test Theory (CTT) is a foundational framework in psychometrics that focuses on assessing and predicting test outcomes based on previously collected data. Primarily utilized for evaluating the reliability of various psychological assessments, CTT operates on the principle of identifying "true" test scores by accounting for measurement errors. Developed by Charles Spearman in 1904, CTT gained prominence in the psychometric community in the late 20th century, becoming a significant tool for interpreting scores within social sciences.
CTT emphasizes the relationship between test performance and population norms, allowing for predictions about how groups or individuals might perform on tests based on historical data. However, while CTT can identify reliable patterns in test scores, it has limitations, such as the inability to account for multiple sources of error simultaneously. This has led to the emergence of competing theories like Item Response Theory (IRT) and Generalizability Theory, which offer different perspectives on measurement and evaluation.
Applications of CTT are widespread, including in the assessment of Attention Deficit Hyperactivity Disorder (ADHD) and language proficiency testing among non-native speakers. Despite its historical significance, the relevance of CTT in contemporary assessment practices may be challenged by evolving methodologies that seek to enhance the accuracy and validity of psychological measurements.
On this Page
Subject Terms
Classical Test Theory
Classical test theory (CTT) is a branch of psychometrics that aims to predict the outcome of entire tests or the responses of specific test items based on completed tests and test items used for data collection. While CTT still has support in the psychometric community, it is generally used today for the purpose of testing reliability, along with other theories of assessment, in various assessment instruments. It is one of the most influential theories regarding test scores in the social science field.
Keywords Attention Deficit Hyperactivity Disorder (ADHD); Classical Test Theory (CTT); Diagnostic and Statistical Assessment of Mental Disorders IV (DSM-IV); Generalizability Theory; Item Response Theory (IRT); Normative Response; Psychometrics; Rasch Analysis; Reliability; Validity
Overview
Classical test theory (CTT) is a unit of psychometrics (psychological testing) used to measure and often to predict the outcome of various tests, the difficulty of items within a test, and/or the ability of test-takers. It is one of the most influential theories regarding test scores in the social science field. According to Allyson Lent, Research Assistant at the Neuropsychology Lab at Kessler Medical Rehabilitation Research and Education Center, the purpose of CTT is to understand and improve the reliability of psychological tests, usually by two means. The first is using test-retest reliability - keeping an item when testers repeat the same response over several trials. The second is alternate test reliability - keeping an item when testers repeat the same response on an alternate version of the same test (Personal communication, October 17, 2007).
Charles Spearman created the theory in 1904, and it was loosely utilized until 1966 when M. R. Novick put its use at the forefront of psychological theory (Novick, 1966). CTT can be identified as the theory of a true-test score, taking into account the previous score of a test item or a test-taking population to predict a future score for the same item or population. Using the previous scores, classical test theorists can predict which test questions will be answered correctly and which population tends to answer the questions successfully. Successful responses are then referred to as normative responses.
When considering a population, the entire population must be taken into account. For example, if all of the eleventh graders in the United States took the Advanced Placement Exam (APE) for English and the same overall score was identified trial after trial, that score would be identified as the normative score for the population of eleventh graders in the United States who took the APE for English. This is not to say that Joe, an eleventh grader on my street, will generate that score simply because he is in eleventh grade and lives within the United States; the normative score is meaningless when correlated with any individual - including Joe - if that individual is not grouped within the population for which the score is identified. In this example, the population is all of the eleventh graders in the United States who took the APE for English rather than specific individuals who took the exam. Joe himself could individually score higher or lower than the normative score; however, CTT can make reliable identifications based on populations of people or on individuals, depending upon the purpose of the test itself. Its predictive value applies only to its ability to show that a test item or instrument is reliable over time using either test-retest reliability or alternate test reliability to determine that value.
Item Response Theory & Generalizability Theory
Classical test theorists are often in conflict with item response theorists, as item response theory (IRT) focuses on a correlation of specific items or specific individuals. IRT testing models are based on "the relationship between ability (or trait) and performance for each individual item" (Reid, Kolakowsky-Hayner, Lewis & Armstrong, 2007, p. 179). In some cases, both CTT and IRT theories are used to identify reliability and validity in a test item or question. However, more recent psychologists use IRT, as validity (testing what is supposed to be tested) can be difficult across populations, even if the sample size is small. Generalizability theory encompasses CTT in that the former also is known for its true-test attributes, and the latter was created following the emergence of generalizability theory.
Research is often based on formulas. When CTT is broken down to its simplest forms, it has only one basic "condition." Using X as the observed score (I saw him do that), T as the true score (the actual score on a test or test question), and e as the error allowed by faulty test design or tester performance, the formula is "X = T + e." This shows that the observed score is equal to the true score plus an account for error. From research study to research study, numbers are plugged into this equation and statisticians come up with scores for measurement purposes. One of the biggest concerns about using CTT is that while there can be several different types of error - from testing environment to tester bias - CTT only allows the estimation of one type of error at a time. Therefore, if Joe's score were misread by a scantron and the administrator of the test left the room when he wasn't supposed to, CTT could not be used to profile Joe's test. Generalizability theory takes into account that variables may be multiple at times, and its more complicated formula simply encompasses those multiple variables, while CTT can't.
Applications
ADHD Information Reporting
According to the CDC, in 2003, approximately 4.4 million youth ages 4-17 were diagnosed by a healthcare professional as having ADHD. In addition, 2.5 million of those youth were receiving medication treatment for the disorder. With statistics like that, it is imperative that medication be identified as effective in order to be used to treat the disorder. ADHD is noted when inattention, inappropriate or impulsive behavior, and/or hyperactivity in (usually) a child has been identified. These behaviors are generally noticed while children are at school or at home, and often, the time frame for the ADHD "behavior" is specific (Corkum, Andreou, Schachar, Tannock & Cunningham, 2007). As such, Corkum, et al. (2007) describe that
…treatment-sensitive instruments that are feasible, yield valid and reliable scores, and measure outcome in a "time-locked" and "situation- and symptom-specific" manner [need to be created]. These instruments are needed to evaluate the outcome for which the treatment is targeted at specific settings (e.g., school), specific times of day (e.g., the late afternoon or early evening medication dose), and specific symptoms (e.g., hyperactivity) (p. 169).
Using a TIP & CTT
The Telephone Interview Probe (TIP) was developed for this purpose and a study was conducted to measure the effects of a medication-treatment based on the specifics described above. CTT as well as generalizability theory were used to evaluate the TIP during the length of the study.
In addition to reliability statistics derived from classical test theory, this study also used generalizability theory in the assessment of reliability. The basic assumption of generalizability theory is that there exist multiple potential sources of error in each observed score. In classical test theory, each form of reliability (intraobserver, interobserver, test-retest, etc.) identifies and quantifies only one source of error, whereas generalizability theory provides a means of combining all sources of variability into a single study (Corkum, et al., 2007, p. 171).
Behavior-rating scales are often used to measure the results of various clinical treatments (Schachar & Tannock, 1993). The TIP includes a rating scale within a semistructured interview. In the interview, impressions of a child's behavior during specific time periods (during the school day and after the school day) are identified. The reporters of the child's behavior are the child's school teacher and the child's parents. The core symptoms of ADHD (including opposing behaviors and problematic situations) are measured.
The sample for the Corkum (2007) study was ninety-one children in a large urban community in Canada who were identified by the Diagnostic and Statistical Manual of Mental Disorders (DSM-III-R) as being pervasively ADHD (p. 171). Children were randomly divided into placebo groups and treatment groups (receiving methylphenidate, a short-term ADHD medication) and monitored over a four-month time period.
Audiotaped interviews were conducted between psychology graduate assistants (as the interviewers) and parents and teachers (as informants). "Because the TIP was designed to be a semistructured interview rather than a questionnaire, the interviewer could discuss reasons for the informant's ratings and help the informants make their ratings" (Corkum, et al., 2007, p. 173).
The TIP allows separate ratings of each core symptom of ADHD (inattention, impulsiveness, and hyperactivity), oppositional behavior, and problem situations for both the morning and afternoon/evening of a particular day. The parent and teacher versions use similar formats but are adapted to reflect their particular settings (i.e., home and school). Respondents rated a child's behavior on a six-point scale to pinpoint the severity of behavior during routine activities. For example, parents were asked about a child's behavior before and after school (getting out of bed, getting dressed, adjusting to being back home and getting ready for bed), while teachers were asked about in-school activities such as getting materials ready for class and working individually throughout the day (Corkum, et. al, 2007).
Corkum, et al. (2007) found a "statistically significant difference between the two groups at 4 months on all of the scales, with less challenging behavior reported for the children in the methylphenidate group" from the teacher interviews (p. 183). As methylphenidate has a half-life of four hours, its affects were not noticeable when the children were with their parents: before or after school (p. 183).
With this type of data, the TIP was identified as being a valuable tool for determining the effectiveness of this specific treatment. However, from a practical standpoint, teachers could be rating each student's behavior throughout the day on activities like working independently, task performance, and getting along with peers in order to maintain a well-working classroom environment for students who don't have an ADHD diagnosis. These identified behaviors impact each student in each classroom and lend to the ability (or inability) to perform effectively. A teacher who can identify problematic situations can also prepare for them and make activity transition and coursework lessons based on those preparations.
Second Language Testing
For ESL students (students for whom English is a second language), the fastest growing community of school-age children, it is common to have a non-native speaker of English in the classroom. However, there is only one exam given to ESL students, the Test of English as a Foreign Language (TOEFL), and it is given only as an entrance exam for students applying to college. The format for the TOEFL is a standardized, multiple-choice question exam. Dudley (2006) offers that multiple true-false question exams (MTF), with additional research, can be just as reliable and a valid alternative to multiple-choice tests, which can be confusing to students whose understanding of the English language varies from individual to individual (p. 199).
While Cronbach (1939) and Haladyna (1999) offer positive feedback regarding the use of true-false exams, the use of the exams has been limited where non-native speakers of English are concerned because there has been little research conducted to promote the combination of the test with the population (Dudley, 2006; Brown & Hudson, 2002). Dudley (2006) attempted to "lay the foundation of this research" using CTT and Rasch analysis to "investigate the viability of the MTF format in the field of second language testing" (p. 200).
Dudley (2006) took two forms of the Michigan University English Placement Test, which were multiple-choice in nature, and converted them to a multiple true-false format. The converted tests (after being pre-tested by volunteer students and a Japanese professor for clarity and altered accordingly) were given to 143 non-native speakers of English shortly after the beginning of their second semester. Testing conditions were that of a standardized test administration: students were not allowed to write in the test booklets; only #2 pencils could be used, etc. The test sought to assess knowledge of vocabulary and reading comprehension (p. 205).
While the study identifies several limitations, Dudley notes that even taking those limitations into consideration, "the findings of this study … are sufficiently supportive to recommend that teachers begin experimenting with this MTF format" (Dudley, 2006, p. 224). He also notes that the conclusions of the study have
… provided sound empirical evidence that central factors such as test length, item interdependence, reliability and concurrent validity are viable with MTF items that assess vocabulary and reading comprehension in the realm of norm-referenced testing (p. 224).
Even though Dudley's (2006) focus was on undergraduate students, it is not a far reach to offer that teachers in the K-12 sector could begin creating tests of a multiple true-false nature or converting already created multiple-choice exams to a true-false format using CTT when ESL learners are in the classroom.
Measuring Intellectual Maturity in Children
In a different study, Rea & Hyland (2001) used both CTT and generalizability theory to determine if the Koppitz Draw-A-Person (1968) test is reliable across test raters and the timing of the study. A sample of eighty-five children between the ages of eight and nine were given the Koppitz (1968) test at two different times, exactly two weeks apart. In an attempt to maintain reliability, the same four raters scored all of the tests for both administrations, the first with a thirty-instrument evaluation, and the second with a modified version of the same evaluation.
On each occasion the children were presented with a blank sheet of A4 paper and a standard pencil and informed that the drawings were for use in a `homework' project that had been given to one of the authors by her university. The children were then instructed to `Draw a picture of a whole person. It can be anybody you want to draw. Make it your best effort and do not copy from a friend.' (These were essentially the same instructions that Koppitz (1968, p. 6) employed in the development of her test.) The children were free to change their drawings and ask the `examiner' any questions. Only one drawing was requested on each occasion. In a few instances children produced two drawings, in which case only the first one was scored. All drawings were completed in 15 minutes (Rea & Hyland, 2001, p. 372).
While the test-retest construct was followed through, and the raters produced very little error in their scoring (showing strong reliability for the test), there were differences in the test scores when both administrations were completed and the tests evaluated. Rea & Hyland note that instructions as unclear as "draw-a-person" might account for the test score differences, as it is possible that a child's concept of a person changes from day to day . However, for the purposes of test administration, the human figure drawing test is easy to administer, it doesn't require verbal responses from the children taking it, and the task given is a common activity for children . In this case, CTT falls short not because the scoring is unreliable but because the test instructions may be flawed.
Viewpoints
CTT Tests Now Obsolete
Joe McMahon, a retired clinical psychologist, worked in his field for twenty-five years. He worked primarily with children and adolescents with behavioral disorders. When asked about ADHD treatments, McMahon stated that the TIP was not a popular method for assessing their effectiveness. Parents of children with ADHD have a vested interest in the study sample (ie: their children) and tend not to report all observed behaviors to third parties, according to McMahon (Personal communication, October 18, 2007).
McMahon and many psychologists stopped using the Koppitz (1968) human figure drawing test over twenty years ago because the test results are so subjective. Teachers tend to identify with good art regardless of how it is captured. They also tend or overreact when unexpected art faces them. McMahon recalled a teacher jumping to conclusions about a student's possible predisposition to crime because the student drew a person holding a gun aimed at an airplane. The student is not and has never been a criminal (Personal communication, October 18, 2007).
The battle between classic theorists and modern theorists is somewhat inherent in the testing methodologies used to analyze data. According to Reid, Kolakowsky-Hayner, Lewis, and Armstrong (2007), "IRT was developed to remedy at least three problems with psychometric assessment based on traditional (classical) test development" (p. 179).
The Downfalls of CTT
What makes CTT effective is also its primary downfall in that the normative scores used to predict future scores are specific to the samples previously studied. Remember Joe - the eleventh grader on my street, who took the APE in the United States: He may have received the highest score on the exam but was grouped with the population of test-takers when the APE results were used to predict future success or effectiveness. There really exists no Joe when researchers are looking at CTT for analysis purposes. That may not make Joe very happy, especially when he sees statistics that show that his population is only ranking at the seventy-eighth percentile when their English proficiency is concerned (Reid, et. al, 2007, p. 179).
A secondary problem with CTT is that to gain useful information, an entire testing instrument has to be completed to gain predictable information regarding a population or an individual. As such, correct or incorrect answers really don't mean anything in and of themselves; the whole - the completed exam - is what matters. So, if a teacher wants to test someone else or another population with a shorter exam, he or she can't without first testing the shorter exam for norming purposes. Furthermore, many software packages offer computerized assessments, which adapt according to student success: successful completion of item number four results in the automatic occurrence of item number five. This adaptation is not possible because of the need for norming with CTT (Reid, et. al, 2007, p. 179).
Finally, as Reid, et al. (2007) point out, "the instability of scores at extreme (either high or low) levels of an ability or trait, even within the normative sample" is a concern with CTT (p. 179). Bolton (2001) notes that "a test most accurately and efficiently measures ability when the average level of item difficulty is equal to the tested individual's own ability level, the point at which the examinee has a 50% probability of answering each item correctly" (as cited in Reid, et. al., 2007, p. 179). Item response theory, however, allows for tests to be designed with a specific ability range in mind.
Terms & Concepts
Attention Deficit Hyperactivity Disorder (ADHD): A diagnosis made in children and/or adults when lack of concentration, hyperactivity, or inappropriate (and/or impulsive) behavior is observed.
Classical Test Theory (CTT): Branch of psychological theory that uses previous test scores to predict the scores of future tests when given to specific populations of people. Basis is on the responses of the entire population rather than the individuals within the population.
Diagnostic and Statistical Assessment of Mental Disorders IV (DSM-IV): Handbook created by members of the American Psychiatric Asociation for diagnosing specific mental disorders based on specific criteria, currently in its 4th edition.
Generalizability Theory: A psychological theory offering the possibility of error in a measurement based on the individual factors that often vary in assessment: time, test items, setting, scorers, etc.
Item Response Theory (IRT): Branch of psychological theory based on the relationship between tests (or test items) and a person's ability to perform on that given test/item.
Normative Response: A generalization about a response acquired over several trials or attempts.
Psychometrics: Psychological theory that attempts to measure intelligence, behavior, and perception.
Rasch Analysis: An IRT formula used to identify objective, fundamental, (generally) testing measures.
Test Reliability: The measure of whether a test or test item is dependable - gains the same response in repeated trials.
Test Validity: The measure of whether a test or test item is really measuring what it is supposed to measure.
Bibliography
American Psychiatric Association. (1987). Diagnostic and statistical manual of mental disorders (3rd ed., rev.). Washington, DC: Author.
Bolton, B. (2001). Handbook of measurement and evaluation in rehabilitation (3rd ed.). Gaithersburg, MD: Aspen.
Brennan, R. L. (2011). Generalizability theory and classical test theory. Applied Measurement in Education, 24, 1-21. Retrieved December 15, 2013, from EBSCO Online Database Education Research Complete. http://search.ebscohost.com/login.aspx?direct=true&db=ehh&AN=57226088&site=ehost-live
Brown, J.D. and Hudson, T. (2002) Criterion-referenced language testing. New York, NY: Cambridge University Press.
Corkum, P., Andreou, P., Schachar, R., Tannock, R. & Cunningham, C. (2007). The Telephone Interview Probe. Educational & Psychological Measurement, 67 , 169-185. Retrieved October 15, 2007 from EBSCO Online Database Academic Search Premier. http://search.ebscohost.com/login.aspx?direct=true&db= aph&AN=23745935&site=ehost-live
Cronbach, L. J. (1939). Note on the multiple true-false test exercise. Journal of Educational Psychology, 30, 628-31.
Cronbach, L. J., Nageswari, R., & Gleser, G.C. (1963). Theory of generalizability: A liberation of reliability theory. The British Journal of Statistical Psychology, 16, 137-163. Cronbach, L.J., Gleser, G.C., Nanda, H., & Rajaratnam, N. (1972). The dependability of behavioral measurements: Theory of generalizability for scores and profiles . New York: John Wiley.
Dudley, A. (2006). Multiple dichotomous-scored items in second language testing: investigating the multiple true-false item type under norm-referenced conditions. Language Testing, 23 , 198-228. Retrieved October 12, 2007 from EBSCO Online Database Education Research Complete. http://search.ebscohost.com/login.aspx?direct=true&db=ehh&AN=20268789&site=ehost-live
Haladyna, T.M. (1999). Developing and validating multiple-choice test items (2nd ed.). Mahwah, NJ: Lawrence Erlbaum.
Koppitz, E. M. (1968). Psychological evaluation of children's human-figure drawings. New York: Grune & Stratton.
Novick, M. R. (1966). The axioms and principal results of classical test theory. Journal of Mathematical Psychology, 3, 1-18.
Nunnally, J.C., & Bernstein, I.H. (1994) Psychometric theory (3rd ed.). New York: McGraw Hill.
Rae, G. & Hyland, P. (2001). Generalisability and classical test theory analyses of Koppitz's Scoring System for human figure drawings. British Journal of Educational Psychology, 71 , 369. Retrieved October 15, 2007 from EBSCO Online Database Academic Search Premier. http://search.ebscohost.com/login.aspx?direct=true&db=aph&AN=7211134&site=ehost-live
Reid, C. A., Kolakowsky-Hayner, S. A., Lewis, A. N., & Armstrong, A. J. (2007). Modern psychometric methodology: Applications of item response theory. Rehabilitation Counseling Bulletin, 50 , 177-188. Retrieved October 12, 2007 from EBSCO Online Database Academic Search Premier. http://search.ebscohost.com/login.aspx?direct=true&db=aph&AN=24418861&site=ehost-live
Sharkness, J., & DeAngelo, L. (2011). Measuring student involvement: A comparison of classical test theory and item response theory in the construction of scales from student surveys. Research in Higher Education, 52, 480-507. Retrieved December 15, 2013, from EBSCO Online Database Education Research Complete. http://search.ebscohost.com/login.aspx?direct=true&db=ehh&AN=61930910&site=ehost-live
Zimmerman, D. W. (2011). Sampling variability and axioms of classical test theory. Journal of Educational & Behavioral Statistics, 36, 586-615. Retrieved December 15, 2013, from EBSCO Online Database Education Research Complete. http://search.ebscohost.com/login.aspx?direct=true&db=ehh&AN=66697929&site=ehost-live
Suggested Reading
Albanese, M.A. & Sabers, D.L. (1978) Multiple response vs. multiple true-false scoring: A comparison of reliability and validity. Paper presented at the annual meeting of National Council on Measurement in Education, Toronto, ON.
Allen, M.J., & Yen, W. M. (2002). Introduction to Measurement Theory. Long Grove, IL: Waveland Press.
Baker, F. B. (1992). Item response theory: Parameter estimation techniques. New York: Marcel Dekker.
Barkley, R. A. (1990). Attention deficit hyperactivity disorder: A handbook for diagnosis and treatment. New York: Guilford.
Bechger, T. M., Verstralen, H. H. F. M., & Verhelst, N. D. (2002). Equivalent linear logistic test models. Psychometrika, 67, 123-136.
Bechger, T. M., Maris, G., Verstralen, H. H. F. M. & Beguin, A. A. (2003). Using classical test theory in combination with item response theory. Applied Psychological Measurement, 27 , 319. Retrieved October 15, 2007 from EBSCO Online Database Education Research Complete. http://search.ebscohost.com/login.aspx?direct=true&db=ehh&AN=10851727&site=ehost-live
Cohen, R. J., & Swerdlik, M. E. (2005). Psychological testing and assessment: An introduction to tests and measurement. Boston: McGraw-Hill.
Eid, G. K. (2005). The effects of sample size on the equating of test items. Education, 126 , 165-180. Retrieved October 12, 2007 from EBSCO Online Database Academic Search Premier. http://search.ebscohost.com/login.aspx?direct=true&db=aph&AN=18360701&site=ehost-live
Fan, X. (1998). Item response theory and classical test theory: an empirical comparison of their item/person statistics. Educational and Psychological Measurement, , 357.
Goodenough, F. L., & Harris, D. B. (1963). The Goodenough-Harris drawing test. New York: Harcourt, Brace & World.
Grolund, N. (1998). Assessment of Student Achievement ( 6th ed.). Needham Heights, MA: Allyn and Bacon. MA: Allyn and Bacon.
Guyatt, G. H., Deyo, R. A., Charlson, M., Levine, M. N., & Mitchell, A. (1989). Responsiveness and validity in health status measurement: A clarification. Journal of Clinical Epidemiology, 42, 403-408.
Hambleton, R. K., & Swaminathan, H. (1985). Item response theory: Principles and applications. Boston: Kluwer Nijhoff.
Hays, R. D., Anderson, R., & Revicki, D. (1993). Psychometric considerations in evaluating health related quality of life measures. Quality of Life, 2, 441-449.
Hays, R. D., Morales, L. S., & Reise, S. P. (2000). Item response theory and health outcomes measurement in the 21st century. Medical Care, 38 (, Suppl. 2), II28-II42.
Hulin, C. L., Drasgow, F., & Parsons, C. K. (1983). Item response theory: Application to psychological measurement. Homewood, IL: Dow Jones-Irwin.
Hsu, T.-C., Moss, P.A. & Khampalikit, C. (1984) The merits of multiple answer items as evaluated by using six scoring formulas. Journal of Experimental Education, 52, 152-58.
Kreiter, G. D, Gordon, J. A., Ejliott, S. & Callaway, M. (2004, Spring). Recommendations for Assigning Weights to Component Tests to Derive an Overall Course Grade. Teaching & Learning in Medicine, 16 , 133-138. Retrieved October 11, 2007 from EBSCO Online Database Academic Search Premier. http://search.ebscohost.com/login.aspx?direct=true&db=aph&AN=13444931&site=ehost-live
Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores. London: Addison-Wesley.
McCarthy, D. (1972). Manual for the McCarthy scales of children's abilities. New York: Psychological Corporation.
Mellenbergh, G. J. (1994). A unidimensional latent trait model for continuous item responses. Multivariate Behavioral Research, 29 , 223-236.
Mellenbergh, G. J. (1996). Measurement precision in test score and item response models. Psychological Methods, 1 , 293-299.
Mitchell, S. K. (1979). Interobserver agreement, reliability, and generalizability of data collected in observational studies. Psychological Bulletin, 86, 376-390.
Muraki, E. (1993). Information functions of the generalized partial credit model. Applied Psychological Measurement, 17, 351-363.
Naglieri, J. A. (1988). Draw a person: A qualitative scoring system. San Antonio, TX: Psychological Corporation.
Naglieri, J. A., McNeish, T. J., & Bardos, A. N. (1991). Draw a person: Screening procedure for emotional disturbance. Austin, TX: Pro-ed, Inc.
Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests. Kopenhagen, Denmark: Nissen and Lydicke.
Reid, C. (1995). Application of item response theory to practical problems in assessment with people who have disabilities. Assessment in Rehabilitation and Exceptionality, 2 , 89-96.
Roos, B. & Hamilton, D. (2005, March). Formative assessment: a cybernetic viewpoint. Assessment in Education: Principles, Policy & Practice, 12 , 7-20. Retrieved October 14, 2007 from EBSCO Online Database Academic Search Premier. http://search.ebscohost.com/login.aspx?direct=true&db=aph&AN=16146410&site=ehost-live
Rost, J. (1996). Lehrbuch Testtheorie, Testkonstruktion [Textbook for test theory and test construction]. Bern, Switzerland: Hans Huber.
Schachar, R., & Tannock, R. (1993). Childhood hyperactivity and psychostimulants: A review of extended treatment studies. Journal of Child and Adolescent Psychopharmacology, 3, 81-97.
Schachar, R., Tannock, R., Cunningham, C., & Corkum, P. (1997). Behavioral, situational, and temporal effects of treatment of ADHD with methylphenidate. Journal of the American Academy of Child and Adolescent Psychiatry, 36, 754-763.
Schumacker, R. (2005). Classical test analysis. Applied Measurement Associates.
Streiner, D. L. (1993). A checklist for evaluating the usefulness of rating scales. Canadian Journal of Psychiatry, 38, 140-148.